classifier 1.4.3 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CLAUDE.md +67 -0
- data/README.md +259 -0
- data/lib/classifier/bayes.rb +50 -36
- data/lib/classifier/extensions/vector.rb +30 -22
- data/lib/classifier/extensions/word_hash.rb +8 -1
- data/lib/classifier/lsi/content_node.rb +30 -8
- data/lib/classifier/lsi/word_list.rb +12 -1
- data/lib/classifier/lsi.rb +93 -37
- data/sig/vendor/fast_stemmer.rbs +9 -0
- data/sig/vendor/gsl.rbs +27 -0
- data/sig/vendor/matrix.rbs +26 -0
- data/test/test_helper.rb +11 -1
- metadata +36 -6
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: fea14969bc8a61283823b0b0f5bae013af968caf4676c383155e3b8682b948de
|
|
4
|
+
data.tar.gz: 4d626c85d084ff75eba2ff305673734a6f25b668e773b1b5a3a0630a6b68df96
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: ef53c06db3326b1b6ebc14255b4ba198286c06e291cba3afc67bba360ca766a173f89269405d216751806ca72f885a87ac80ec24a031053f8e6f2987e8e2267e
|
|
7
|
+
data.tar.gz: 8f120a9b78e802e6fd3e7172fd311b476745e27d5b3d301dc8d140296a451875e5aa33a901514bfdd1bc96c656ad1a43cbb3935a05223cd38548a71ba6a3a1c1
|
data/CLAUDE.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
1
|
+
# CLAUDE.md
|
|
2
|
+
|
|
3
|
+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
4
|
+
|
|
5
|
+
## Project Overview
|
|
6
|
+
|
|
7
|
+
Ruby gem providing text classification via two algorithms:
|
|
8
|
+
- **Bayes** (`Classifier::Bayes`) - Naive Bayesian classification
|
|
9
|
+
- **LSI** (`Classifier::LSI`) - Latent Semantic Indexing for semantic classification, clustering, and search
|
|
10
|
+
|
|
11
|
+
## Common Commands
|
|
12
|
+
|
|
13
|
+
```bash
|
|
14
|
+
# Run all tests
|
|
15
|
+
rake test
|
|
16
|
+
|
|
17
|
+
# Run a single test file
|
|
18
|
+
ruby -Ilib test/bayes/bayesian_test.rb
|
|
19
|
+
ruby -Ilib test/lsi/lsi_test.rb
|
|
20
|
+
|
|
21
|
+
# Run tests with native Ruby vector (without GSL)
|
|
22
|
+
NATIVE_VECTOR=true rake test
|
|
23
|
+
|
|
24
|
+
# Interactive console
|
|
25
|
+
rake console
|
|
26
|
+
|
|
27
|
+
# Generate documentation
|
|
28
|
+
rake doc
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## Architecture
|
|
32
|
+
|
|
33
|
+
### Core Components
|
|
34
|
+
|
|
35
|
+
**Bayesian Classifier** (`lib/classifier/bayes.rb`)
|
|
36
|
+
- Train with `train(category, text)` or dynamic methods like `train_spam(text)`
|
|
37
|
+
- Classify with `classify(text)` returning the best category
|
|
38
|
+
- Uses log probabilities for numerical stability
|
|
39
|
+
|
|
40
|
+
**LSI Classifier** (`lib/classifier/lsi.rb`)
|
|
41
|
+
- Uses Singular Value Decomposition (SVD) for semantic analysis
|
|
42
|
+
- Optional GSL gem for 10x faster matrix operations; falls back to pure Ruby SVD
|
|
43
|
+
- Key operations: `add_item`, `classify`, `find_related`, `search`
|
|
44
|
+
- `auto_rebuild` option controls automatic index rebuilding after changes
|
|
45
|
+
|
|
46
|
+
**String Extensions** (`lib/classifier/extensions/word_hash.rb`)
|
|
47
|
+
- `word_hash` / `clean_word_hash` - tokenize text to stemmed word frequencies
|
|
48
|
+
- `CORPUS_SKIP_WORDS` - stopwords filtered during tokenization
|
|
49
|
+
- Uses `fast-stemmer` gem for Porter stemming
|
|
50
|
+
|
|
51
|
+
**Vector Extensions** (`lib/classifier/extensions/vector.rb`)
|
|
52
|
+
- Pure Ruby SVD implementation (`Matrix#SV_decomp`)
|
|
53
|
+
- Vector normalization and magnitude calculations
|
|
54
|
+
|
|
55
|
+
### GSL Integration
|
|
56
|
+
|
|
57
|
+
LSI checks for the `gsl` gem at load time. When available:
|
|
58
|
+
- Uses `GSL::Matrix` and `GSL::Vector` for faster operations
|
|
59
|
+
- Serialization handled via `vector_serialize.rb`
|
|
60
|
+
- Test without GSL: `NATIVE_VECTOR=true rake test`
|
|
61
|
+
|
|
62
|
+
### Content Nodes (`lib/classifier/lsi/content_node.rb`)
|
|
63
|
+
|
|
64
|
+
Internal data structure storing:
|
|
65
|
+
- `word_hash` - term frequencies
|
|
66
|
+
- `raw_vector` / `raw_norm` - initial vector representation
|
|
67
|
+
- `lsi_vector` / `lsi_norm` - reduced dimensionality representation after SVD
|
data/README.md
ADDED
|
@@ -0,0 +1,259 @@
|
|
|
1
|
+
# Classifier
|
|
2
|
+
|
|
3
|
+
[](https://badge.fury.io/rb/classifier)
|
|
4
|
+
[](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml)
|
|
5
|
+
[](https://opensource.org/licenses/LGPL-2.1)
|
|
6
|
+
|
|
7
|
+
A Ruby library for text classification using Bayesian and Latent Semantic Indexing (LSI) algorithms.
|
|
8
|
+
|
|
9
|
+
## Table of Contents
|
|
10
|
+
|
|
11
|
+
- [Installation](#installation)
|
|
12
|
+
- [Bayesian Classifier](#bayesian-classifier)
|
|
13
|
+
- [LSI (Latent Semantic Indexing)](#lsi-latent-semantic-indexing)
|
|
14
|
+
- [Performance](#performance)
|
|
15
|
+
- [Development](#development)
|
|
16
|
+
- [Contributing](#contributing)
|
|
17
|
+
- [License](#license)
|
|
18
|
+
|
|
19
|
+
## Installation
|
|
20
|
+
|
|
21
|
+
Add to your Gemfile:
|
|
22
|
+
|
|
23
|
+
```ruby
|
|
24
|
+
gem 'classifier'
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
Then run:
|
|
28
|
+
|
|
29
|
+
```bash
|
|
30
|
+
bundle install
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
Or install directly:
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
gem install classifier
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
### Optional: GSL for Faster LSI
|
|
40
|
+
|
|
41
|
+
For significantly faster LSI operations, install the [GNU Scientific Library](https://www.gnu.org/software/gsl/).
|
|
42
|
+
|
|
43
|
+
<details>
|
|
44
|
+
<summary><strong>Ruby 3+</strong></summary>
|
|
45
|
+
|
|
46
|
+
The released `gsl` gem doesn't support Ruby 3+. Install from source:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
# Install GSL library
|
|
50
|
+
brew install gsl # macOS
|
|
51
|
+
apt-get install libgsl-dev # Ubuntu/Debian
|
|
52
|
+
|
|
53
|
+
# Build and install the gem
|
|
54
|
+
git clone https://github.com/cardmagic/rb-gsl.git
|
|
55
|
+
cd rb-gsl
|
|
56
|
+
git checkout fix/ruby-3.4-compatibility
|
|
57
|
+
gem build gsl.gemspec
|
|
58
|
+
gem install gsl-*.gem
|
|
59
|
+
```
|
|
60
|
+
</details>
|
|
61
|
+
|
|
62
|
+
<details>
|
|
63
|
+
<summary><strong>Ruby 2.x</strong></summary>
|
|
64
|
+
|
|
65
|
+
```bash
|
|
66
|
+
# macOS
|
|
67
|
+
brew install gsl
|
|
68
|
+
gem install gsl
|
|
69
|
+
|
|
70
|
+
# Ubuntu/Debian
|
|
71
|
+
apt-get install libgsl-dev
|
|
72
|
+
gem install gsl
|
|
73
|
+
```
|
|
74
|
+
</details>
|
|
75
|
+
|
|
76
|
+
When GSL is installed, Classifier automatically uses it. To suppress the GSL notice:
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
SUPPRESS_GSL_WARNING=true ruby your_script.rb
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
### Compatibility
|
|
83
|
+
|
|
84
|
+
| Ruby Version | Status |
|
|
85
|
+
|--------------|--------|
|
|
86
|
+
| 4.0 | Supported |
|
|
87
|
+
| 3.4 | Supported |
|
|
88
|
+
| 3.3 | Supported |
|
|
89
|
+
| 3.2 | Supported |
|
|
90
|
+
| 3.1 | EOL (unsupported) |
|
|
91
|
+
|
|
92
|
+
## Bayesian Classifier
|
|
93
|
+
|
|
94
|
+
Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.
|
|
95
|
+
|
|
96
|
+
### Quick Start
|
|
97
|
+
|
|
98
|
+
```ruby
|
|
99
|
+
require 'classifier'
|
|
100
|
+
|
|
101
|
+
classifier = Classifier::Bayes.new('Spam', 'Ham')
|
|
102
|
+
|
|
103
|
+
# Train the classifier
|
|
104
|
+
classifier.train_spam "Buy cheap viagra now! Limited offer!"
|
|
105
|
+
classifier.train_spam "You've won a million dollars! Claim now!"
|
|
106
|
+
classifier.train_ham "Meeting scheduled for tomorrow at 10am"
|
|
107
|
+
classifier.train_ham "Please review the attached document"
|
|
108
|
+
|
|
109
|
+
# Classify new text
|
|
110
|
+
classifier.classify "Congratulations! You've won a prize!"
|
|
111
|
+
# => "Spam"
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
### Persistence with Madeleine
|
|
115
|
+
|
|
116
|
+
```ruby
|
|
117
|
+
require 'classifier'
|
|
118
|
+
require 'madeleine'
|
|
119
|
+
|
|
120
|
+
m = SnapshotMadeleine.new("classifier_data") {
|
|
121
|
+
Classifier::Bayes.new('Interesting', 'Uninteresting')
|
|
122
|
+
}
|
|
123
|
+
|
|
124
|
+
m.system.train_interesting "fascinating article about science"
|
|
125
|
+
m.system.train_uninteresting "boring repetitive content"
|
|
126
|
+
m.take_snapshot
|
|
127
|
+
|
|
128
|
+
# Later, restore and use:
|
|
129
|
+
m.system.classify "new scientific discovery"
|
|
130
|
+
# => "Interesting"
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
### Learn More
|
|
134
|
+
|
|
135
|
+
- [Bayesian Filtering Explained](http://www.process.com/precisemail/bayesian_filtering.htm)
|
|
136
|
+
- [Wikipedia: Bayesian Filtering](http://en.wikipedia.org/wiki/Bayesian_filtering)
|
|
137
|
+
- [Paul Graham: A Plan for Spam](http://www.paulgraham.com/spam.html)
|
|
138
|
+
|
|
139
|
+
## LSI (Latent Semantic Indexing)
|
|
140
|
+
|
|
141
|
+
Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.
|
|
142
|
+
|
|
143
|
+
### Quick Start
|
|
144
|
+
|
|
145
|
+
```ruby
|
|
146
|
+
require 'classifier'
|
|
147
|
+
|
|
148
|
+
lsi = Classifier::LSI.new
|
|
149
|
+
|
|
150
|
+
# Add documents with categories
|
|
151
|
+
lsi.add_item "Dogs are loyal pets that love to play fetch", :pets
|
|
152
|
+
lsi.add_item "Cats are independent and love to nap", :pets
|
|
153
|
+
lsi.add_item "Ruby is a dynamic programming language", :programming
|
|
154
|
+
lsi.add_item "Python is great for data science", :programming
|
|
155
|
+
|
|
156
|
+
# Classify new text
|
|
157
|
+
lsi.classify "My puppy loves to run around"
|
|
158
|
+
# => :pets
|
|
159
|
+
|
|
160
|
+
# Get classification with confidence score
|
|
161
|
+
lsi.classify_with_confidence "Learning to code in Ruby"
|
|
162
|
+
# => [:programming, 0.89]
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
### Search and Discovery
|
|
166
|
+
|
|
167
|
+
```ruby
|
|
168
|
+
# Find similar documents
|
|
169
|
+
lsi.find_related "Dogs are great companions", 2
|
|
170
|
+
# => ["Dogs are loyal pets that love to play fetch", "Cats are independent..."]
|
|
171
|
+
|
|
172
|
+
# Search by keyword
|
|
173
|
+
lsi.search "programming", 3
|
|
174
|
+
# => ["Ruby is a dynamic programming language", "Python is great for..."]
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
### Learn More
|
|
178
|
+
|
|
179
|
+
- [Wikipedia: Latent Semantic Analysis](http://en.wikipedia.org/wiki/Latent_semantic_analysis)
|
|
180
|
+
- [C2 Wiki: Latent Semantic Indexing](http://www.c2.com/cgi/wiki?LatentSemanticIndexing)
|
|
181
|
+
|
|
182
|
+
## Performance
|
|
183
|
+
|
|
184
|
+
### GSL vs Native Ruby
|
|
185
|
+
|
|
186
|
+
GSL provides dramatic speedups for LSI operations, especially `build_index` (SVD computation):
|
|
187
|
+
|
|
188
|
+
| Documents | build_index | Overall |
|
|
189
|
+
|-----------|-------------|---------|
|
|
190
|
+
| 5 | 4x faster | 2.5x |
|
|
191
|
+
| 10 | 24x faster | 5.5x |
|
|
192
|
+
| 15 | 116x faster | 17x |
|
|
193
|
+
|
|
194
|
+
<details>
|
|
195
|
+
<summary>Detailed benchmark (15 documents)</summary>
|
|
196
|
+
|
|
197
|
+
```
|
|
198
|
+
Operation Native GSL Speedup
|
|
199
|
+
----------------------------------------------------------
|
|
200
|
+
build_index 0.1412 0.0012 116.2x
|
|
201
|
+
classify 0.0142 0.0049 2.9x
|
|
202
|
+
search 0.0102 0.0026 3.9x
|
|
203
|
+
find_related 0.0069 0.0016 4.2x
|
|
204
|
+
----------------------------------------------------------
|
|
205
|
+
TOTAL 0.1725 0.0104 16.6x
|
|
206
|
+
```
|
|
207
|
+
</details>
|
|
208
|
+
|
|
209
|
+
### Running Benchmarks
|
|
210
|
+
|
|
211
|
+
```bash
|
|
212
|
+
rake benchmark # Run with current configuration
|
|
213
|
+
rake benchmark:compare # Compare GSL vs native Ruby
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
## Development
|
|
217
|
+
|
|
218
|
+
### Setup
|
|
219
|
+
|
|
220
|
+
```bash
|
|
221
|
+
git clone https://github.com/cardmagic/classifier.git
|
|
222
|
+
cd classifier
|
|
223
|
+
bundle install
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
### Running Tests
|
|
227
|
+
|
|
228
|
+
```bash
|
|
229
|
+
rake test # Run all tests
|
|
230
|
+
ruby -Ilib test/bayes/bayesian_test.rb # Run specific test file
|
|
231
|
+
|
|
232
|
+
# Test without GSL (pure Ruby)
|
|
233
|
+
NATIVE_VECTOR=true rake test
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
### Console
|
|
237
|
+
|
|
238
|
+
```bash
|
|
239
|
+
rake console
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
## Contributing
|
|
243
|
+
|
|
244
|
+
1. Fork the repository
|
|
245
|
+
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
|
|
246
|
+
3. Commit your changes (`git commit -am 'Add amazing feature'`)
|
|
247
|
+
4. Push to the branch (`git push origin feature/amazing-feature`)
|
|
248
|
+
5. Open a Pull Request
|
|
249
|
+
|
|
250
|
+
## Authors
|
|
251
|
+
|
|
252
|
+
- **Lucas Carlson** - *Original author* - lucas@rufy.com
|
|
253
|
+
- **David Fayram II** - *LSI implementation* - dfayram@gmail.com
|
|
254
|
+
- **Cameron McBride** - cameron.mcbride@gmail.com
|
|
255
|
+
- **Ivan Acosta-Rubio** - ivan@softwarecriollo.com
|
|
256
|
+
|
|
257
|
+
## License
|
|
258
|
+
|
|
259
|
+
This library is released under the [GNU Lesser General Public License (LGPL) 2.1](LICENSE).
|
data/lib/classifier/bayes.rb
CHANGED
|
@@ -1,12 +1,20 @@
|
|
|
1
|
+
# rbs_inline: enabled
|
|
2
|
+
|
|
1
3
|
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
|
|
2
4
|
# Copyright:: Copyright (c) 2005 Lucas Carlson
|
|
3
5
|
# License:: LGPL
|
|
4
6
|
|
|
5
7
|
module Classifier
|
|
6
8
|
class Bayes
|
|
9
|
+
# @rbs @categories: Hash[Symbol, Hash[Symbol, Integer]]
|
|
10
|
+
# @rbs @total_words: Integer
|
|
11
|
+
# @rbs @category_counts: Hash[Symbol, Integer]
|
|
12
|
+
# @rbs @category_word_count: Hash[Symbol, Integer]
|
|
13
|
+
|
|
7
14
|
# The class can be created with one or more categories, each of which will be
|
|
8
15
|
# initialized and given a training method. E.g.,
|
|
9
16
|
# b = Classifier::Bayes.new 'Interesting', 'Uninteresting', 'Spam'
|
|
17
|
+
# @rbs (*String | Symbol) -> void
|
|
10
18
|
def initialize(*categories)
|
|
11
19
|
@categories = {}
|
|
12
20
|
categories.each { |category| @categories[category.prepare_category_name] = {} }
|
|
@@ -15,13 +23,14 @@ module Classifier
|
|
|
15
23
|
@category_word_count = Hash.new(0)
|
|
16
24
|
end
|
|
17
25
|
|
|
18
|
-
#
|
|
19
26
|
# Provides a general training method for all categories specified in Bayes#new
|
|
20
27
|
# For example:
|
|
21
28
|
# b = Classifier::Bayes.new 'This', 'That', 'the_other'
|
|
22
29
|
# b.train :this, "This text"
|
|
23
30
|
# b.train "that", "That text"
|
|
24
31
|
# b.train "The other", "The other text"
|
|
32
|
+
#
|
|
33
|
+
# @rbs (String | Symbol, String) -> void
|
|
25
34
|
def train(category, text)
|
|
26
35
|
category = category.prepare_category_name
|
|
27
36
|
@category_counts[category] += 1
|
|
@@ -33,7 +42,6 @@ module Classifier
|
|
|
33
42
|
end
|
|
34
43
|
end
|
|
35
44
|
|
|
36
|
-
#
|
|
37
45
|
# Provides a untraining method for all categories specified in Bayes#new
|
|
38
46
|
# Be very careful with this method.
|
|
39
47
|
#
|
|
@@ -41,6 +49,8 @@ module Classifier
|
|
|
41
49
|
# b = Classifier::Bayes.new 'This', 'That', 'the_other'
|
|
42
50
|
# b.train :this, "This text"
|
|
43
51
|
# b.untrain :this, "This text"
|
|
52
|
+
#
|
|
53
|
+
# @rbs (String | Symbol, String) -> void
|
|
44
54
|
def untrain(category, text)
|
|
45
55
|
category = category.prepare_category_name
|
|
46
56
|
@category_counts[category] -= 1
|
|
@@ -59,36 +69,39 @@ module Classifier
|
|
|
59
69
|
end
|
|
60
70
|
end
|
|
61
71
|
|
|
62
|
-
#
|
|
63
72
|
# Returns the scores in each category the provided +text+. E.g.,
|
|
64
73
|
# b.classifications "I hate bad words and you"
|
|
65
74
|
# => {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524}
|
|
66
75
|
# The largest of these scores (the one closest to 0) is the one picked out by #classify
|
|
76
|
+
#
|
|
77
|
+
# @rbs (String) -> Hash[String, Float]
|
|
67
78
|
def classifications(text)
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
score[category.to_s] += Math.log(s / training_count)
|
|
79
|
+
words = text.word_hash.keys
|
|
80
|
+
training_count = @category_counts.values.sum.to_f
|
|
81
|
+
vocab_size = [@categories.values.flat_map(&:keys).uniq.size, 1].max
|
|
82
|
+
|
|
83
|
+
@categories.to_h do |category, category_words|
|
|
84
|
+
smoothed_total = ((@category_word_count[category] || 0) + vocab_size).to_f
|
|
85
|
+
|
|
86
|
+
# Laplace smoothing: P(word|category) = (count + α) / (total + α * V)
|
|
87
|
+
word_score = words.sum { |w| Math.log(((category_words[w] || 0) + 1) / smoothed_total) }
|
|
88
|
+
prior_score = Math.log((@category_counts[category] || 0.1) / training_count)
|
|
89
|
+
|
|
90
|
+
[category.to_s, word_score + prior_score]
|
|
81
91
|
end
|
|
82
|
-
score
|
|
83
92
|
end
|
|
84
93
|
|
|
85
|
-
#
|
|
86
94
|
# Returns the classification of the provided +text+, which is one of the
|
|
87
95
|
# categories given in the initializer. E.g.,
|
|
88
96
|
# b.classify "I hate bad words and you"
|
|
89
97
|
# => 'Uninteresting'
|
|
98
|
+
#
|
|
99
|
+
# @rbs (String) -> String
|
|
90
100
|
def classify(text)
|
|
91
|
-
|
|
101
|
+
best = classifications(text).min_by { |a| -a[1] }
|
|
102
|
+
raise StandardError, 'No classifications available' unless best
|
|
103
|
+
|
|
104
|
+
best.first.to_s
|
|
92
105
|
end
|
|
93
106
|
|
|
94
107
|
#
|
|
@@ -100,32 +113,30 @@ module Classifier
|
|
|
100
113
|
# b.untrain_that "That text"
|
|
101
114
|
# b.train_the_other "The other text"
|
|
102
115
|
def method_missing(name, *args)
|
|
116
|
+
return super unless name.to_s =~ /(un)?train_(\w+)/
|
|
117
|
+
|
|
103
118
|
category = name.to_s.gsub(/(un)?train_(\w+)/, '\2').prepare_category_name
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
raise StandardError, "No such category: #{category}"
|
|
114
|
-
else
|
|
115
|
-
super
|
|
116
|
-
end
|
|
119
|
+
raise StandardError, "No such category: #{category}" unless @categories.key?(category)
|
|
120
|
+
|
|
121
|
+
method = name.to_s.start_with?('untrain_') ? :untrain : :train
|
|
122
|
+
args.each { |text| send(method, category, text) }
|
|
123
|
+
end
|
|
124
|
+
|
|
125
|
+
# @rbs (Symbol, ?bool) -> bool
|
|
126
|
+
def respond_to_missing?(name, include_private = false)
|
|
127
|
+
!!(name.to_s =~ /(un)?train_(\w+)/) || super
|
|
117
128
|
end
|
|
118
129
|
|
|
119
|
-
#
|
|
120
130
|
# Provides a list of category names
|
|
121
131
|
# For example:
|
|
122
132
|
# b.categories
|
|
123
133
|
# => ['This', 'That', 'the_other']
|
|
124
|
-
|
|
134
|
+
#
|
|
135
|
+
# @rbs () -> Array[String]
|
|
136
|
+
def categories
|
|
125
137
|
@categories.keys.collect(&:to_s)
|
|
126
138
|
end
|
|
127
139
|
|
|
128
|
-
#
|
|
129
140
|
# Allows you to add categories to the classifier.
|
|
130
141
|
# For example:
|
|
131
142
|
# b.add_category "Not spam"
|
|
@@ -134,13 +145,14 @@ module Classifier
|
|
|
134
145
|
# result in an undertrained category that will tend to match
|
|
135
146
|
# more criteria than the trained selective categories. In short,
|
|
136
147
|
# try to initialize your categories at initialization.
|
|
148
|
+
#
|
|
149
|
+
# @rbs (String | Symbol) -> Hash[Symbol, Integer]
|
|
137
150
|
def add_category(category)
|
|
138
151
|
@categories[category.prepare_category_name] = {}
|
|
139
152
|
end
|
|
140
153
|
|
|
141
154
|
alias append_category add_category
|
|
142
155
|
|
|
143
|
-
#
|
|
144
156
|
# Allows you to remove categories from the classifier.
|
|
145
157
|
# For example:
|
|
146
158
|
# b.remove_category "Spam"
|
|
@@ -148,6 +160,8 @@ module Classifier
|
|
|
148
160
|
# WARNING: Removing categories from a trained classifier will
|
|
149
161
|
# result in the loss of all training data for that category.
|
|
150
162
|
# Make sure you really want to do this before calling this method.
|
|
163
|
+
#
|
|
164
|
+
# @rbs (String | Symbol) -> void
|
|
151
165
|
def remove_category(category)
|
|
152
166
|
category = category.prepare_category_name
|
|
153
167
|
raise StandardError, "No such category: #{category}" unless @categories.key?(category)
|
|
@@ -1,3 +1,5 @@
|
|
|
1
|
+
# rbs_inline: enabled
|
|
2
|
+
|
|
1
3
|
# Author:: Ernest Ellingson
|
|
2
4
|
# Copyright:: Copyright (c) 2005
|
|
3
5
|
|
|
@@ -5,19 +7,20 @@
|
|
|
5
7
|
|
|
6
8
|
require 'matrix'
|
|
7
9
|
|
|
10
|
+
# @rbs skip
|
|
8
11
|
class Array
|
|
9
|
-
def sum_with_identity(identity = 0.0, &
|
|
12
|
+
def sum_with_identity(identity = 0.0, &)
|
|
10
13
|
return identity unless size.to_i.positive?
|
|
14
|
+
return map(&).sum_with_identity(identity) if block_given?
|
|
11
15
|
|
|
12
|
-
|
|
13
|
-
map(&block).sum_with_identity(identity)
|
|
14
|
-
else
|
|
15
|
-
compact.reduce(:+).to_f || identity.to_f
|
|
16
|
-
end
|
|
16
|
+
compact.reduce(identity, :+).to_f
|
|
17
17
|
end
|
|
18
18
|
end
|
|
19
19
|
|
|
20
|
-
|
|
20
|
+
# @rbs skip
|
|
21
|
+
class Vector
|
|
22
|
+
EPSILON = 1e-10
|
|
23
|
+
|
|
21
24
|
def magnitude
|
|
22
25
|
sum_of_squares = 0.to_r
|
|
23
26
|
size.times do |i|
|
|
@@ -27,8 +30,10 @@ module VectorExtensions
|
|
|
27
30
|
end
|
|
28
31
|
|
|
29
32
|
def normalize
|
|
33
|
+
magnitude_value = magnitude
|
|
34
|
+
return Vector[*Array.new(size, 0.0)] if magnitude_value <= 0.0
|
|
35
|
+
|
|
30
36
|
normalized_values = []
|
|
31
|
-
magnitude_value = magnitude.to_r
|
|
32
37
|
size.times do |i|
|
|
33
38
|
normalized_values << (self[i] / magnitude_value)
|
|
34
39
|
end
|
|
@@ -36,10 +41,7 @@ module VectorExtensions
|
|
|
36
41
|
end
|
|
37
42
|
end
|
|
38
43
|
|
|
39
|
-
|
|
40
|
-
include VectorExtensions
|
|
41
|
-
end
|
|
42
|
-
|
|
44
|
+
# @rbs skip
|
|
43
45
|
class Matrix
|
|
44
46
|
def self.diag(diagonal_elements)
|
|
45
47
|
Matrix.diagonal(*diagonal_elements)
|
|
@@ -61,14 +63,19 @@ class Matrix
|
|
|
61
63
|
|
|
62
64
|
loop do
|
|
63
65
|
iteration_count += 1
|
|
64
|
-
(0...q_rotation_matrix.row_size - 1).each do |row|
|
|
65
|
-
(1..q_rotation_matrix.row_size - 1).each do |col|
|
|
66
|
+
(0...(q_rotation_matrix.row_size - 1)).each do |row|
|
|
67
|
+
(1..(q_rotation_matrix.row_size - 1)).each do |col|
|
|
66
68
|
next if row == col
|
|
67
69
|
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
70
|
+
numerator = 2.0 * q_rotation_matrix[row, col]
|
|
71
|
+
denominator = q_rotation_matrix[row, row] - q_rotation_matrix[col, col]
|
|
72
|
+
|
|
73
|
+
angle = if denominator.abs < Vector::EPSILON
|
|
74
|
+
numerator >= 0 ? Math::PI / 4.0 : -Math::PI / 4.0
|
|
75
|
+
else
|
|
76
|
+
Math.atan(numerator / denominator) / 2.0
|
|
77
|
+
end
|
|
78
|
+
|
|
72
79
|
cosine = Math.cos(angle)
|
|
73
80
|
sine = Math.sin(angle)
|
|
74
81
|
rotation_matrix = Matrix.identity(q_rotation_matrix.row_size)
|
|
@@ -92,11 +99,12 @@ class Matrix
|
|
|
92
99
|
break if (sum_of_differences <= 0.001 && iteration_count > 1) || iteration_count >= max_sweeps
|
|
93
100
|
end
|
|
94
101
|
|
|
95
|
-
singular_values =
|
|
96
|
-
|
|
97
|
-
singular_values << Math.sqrt(q_rotation_matrix[r, r].to_f)
|
|
102
|
+
singular_values = q_rotation_matrix.row_size.times.map do |r|
|
|
103
|
+
Math.sqrt([q_rotation_matrix[r, r].to_f, 0.0].max)
|
|
98
104
|
end
|
|
99
|
-
|
|
105
|
+
|
|
106
|
+
safe_singular_values = singular_values.map { |v| [v, Vector::EPSILON].max }
|
|
107
|
+
u_matrix = (row_size >= column_size ? self : trans) * v_matrix * Matrix.diagonal(*safe_singular_values).inverse
|
|
100
108
|
[u_matrix, v_matrix, singular_values]
|
|
101
109
|
end
|
|
102
110
|
|
|
@@ -1,3 +1,5 @@
|
|
|
1
|
+
# rbs_inline: enabled
|
|
2
|
+
|
|
1
3
|
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
|
|
2
4
|
# Copyright:: Copyright (c) 2005 Lucas Carlson
|
|
3
5
|
# License:: LGPL
|
|
@@ -11,12 +13,14 @@ class String
|
|
|
11
13
|
# E.g.,
|
|
12
14
|
# "Hello (greeting's), with {braces} < >...?".without_punctuation
|
|
13
15
|
# => "Hello greetings with braces "
|
|
16
|
+
# @rbs () -> String
|
|
14
17
|
def without_punctuation
|
|
15
|
-
tr(',?.!;:"@#$%^&*()_=+[]{}
|
|
18
|
+
tr(',?.!;:"@#$%^&*()_=+[]{}|<>/`~', ' ').tr("'-", '')
|
|
16
19
|
end
|
|
17
20
|
|
|
18
21
|
# Return a Hash of strings => ints. Each word in the string is stemmed,
|
|
19
22
|
# interned, and indexes to its frequency in the document.
|
|
23
|
+
# @rbs () -> Hash[Symbol, Integer]
|
|
20
24
|
def word_hash
|
|
21
25
|
word_hash = clean_word_hash
|
|
22
26
|
symbol_hash = word_hash_for_symbols(gsub(/\w/, ' ').split)
|
|
@@ -24,12 +28,14 @@ class String
|
|
|
24
28
|
end
|
|
25
29
|
|
|
26
30
|
# Return a word hash without extra punctuation or short symbols, just stemmed words
|
|
31
|
+
# @rbs () -> Hash[Symbol, Integer]
|
|
27
32
|
def clean_word_hash
|
|
28
33
|
word_hash_for_words gsub(/[^\w\s]/, '').split
|
|
29
34
|
end
|
|
30
35
|
|
|
31
36
|
private
|
|
32
37
|
|
|
38
|
+
# @rbs (Array[String]) -> Hash[Symbol, Integer]
|
|
33
39
|
def word_hash_for_words(words)
|
|
34
40
|
d = Hash.new(0)
|
|
35
41
|
words.each do |word|
|
|
@@ -39,6 +45,7 @@ class String
|
|
|
39
45
|
d
|
|
40
46
|
end
|
|
41
47
|
|
|
48
|
+
# @rbs (Array[String]) -> Hash[Symbol, Integer]
|
|
42
49
|
def word_hash_for_symbols(words)
|
|
43
50
|
d = Hash.new(0)
|
|
44
51
|
words.each do |word|
|
|
@@ -1,3 +1,5 @@
|
|
|
1
|
+
# rbs_inline: enabled
|
|
2
|
+
|
|
1
3
|
# Author:: David Fayram (mailto:dfayram@lensmen.net)
|
|
2
4
|
# Copyright:: Copyright (c) 2005 David Fayram II
|
|
3
5
|
# License:: LGPL
|
|
@@ -7,33 +9,48 @@ module Classifier
|
|
|
7
9
|
# raw_vector_with, it should be fairly straightforward to understand.
|
|
8
10
|
# You should never have to use it directly.
|
|
9
11
|
class ContentNode
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
12
|
+
# @rbs @word_hash: Hash[Symbol, Integer]
|
|
13
|
+
|
|
14
|
+
# @rbs @raw_vector: untyped
|
|
15
|
+
# @rbs @raw_norm: untyped
|
|
16
|
+
# @rbs @lsi_vector: untyped
|
|
17
|
+
# @rbs @lsi_norm: untyped
|
|
18
|
+
attr_accessor :raw_vector, :raw_norm, :lsi_vector, :lsi_norm
|
|
19
|
+
|
|
20
|
+
# @rbs @categories: Array[String | Symbol]
|
|
21
|
+
attr_accessor :categories
|
|
13
22
|
|
|
14
23
|
attr_reader :word_hash
|
|
15
24
|
|
|
16
25
|
# If text_proc is not specified, the source will be duck-typed
|
|
17
26
|
# via source.to_s
|
|
27
|
+
#
|
|
28
|
+
# @rbs (Hash[Symbol, Integer], *String | Symbol) -> void
|
|
18
29
|
def initialize(word_frequencies, *categories)
|
|
19
30
|
@categories = categories || []
|
|
20
31
|
@word_hash = word_frequencies
|
|
21
32
|
end
|
|
22
33
|
|
|
23
34
|
# Use this to fetch the appropriate search vector.
|
|
35
|
+
#
|
|
36
|
+
# @rbs () -> untyped
|
|
24
37
|
def search_vector
|
|
25
38
|
@lsi_vector || @raw_vector
|
|
26
39
|
end
|
|
27
40
|
|
|
28
41
|
# Use this to fetch the appropriate search vector in normalized form.
|
|
42
|
+
#
|
|
43
|
+
# @rbs () -> untyped
|
|
29
44
|
def search_norm
|
|
30
45
|
@lsi_norm || @raw_norm
|
|
31
46
|
end
|
|
32
47
|
|
|
33
48
|
# Creates the raw vector out of word_hash using word_list as the
|
|
34
49
|
# key for mapping the vector space.
|
|
50
|
+
#
|
|
51
|
+
# @rbs (WordList) -> untyped
|
|
35
52
|
def raw_vector_with(word_list)
|
|
36
|
-
vec = if
|
|
53
|
+
vec = if Classifier::LSI.gsl_available
|
|
37
54
|
GSL::Vector.alloc(word_list.size)
|
|
38
55
|
else
|
|
39
56
|
Array.new(word_list.size, 0)
|
|
@@ -44,11 +61,13 @@ module Classifier
|
|
|
44
61
|
end
|
|
45
62
|
|
|
46
63
|
# Perform the scaling transform
|
|
47
|
-
total_words =
|
|
64
|
+
total_words = Classifier::LSI.gsl_available ? vec.sum : vec.sum_with_identity
|
|
65
|
+
vec_array = Classifier::LSI.gsl_available ? vec.to_a : vec
|
|
66
|
+
total_unique_words = vec_array.count { |word| word != 0 }
|
|
48
67
|
|
|
49
68
|
# Perform first-order association transform if this vector has more
|
|
50
69
|
# than one word in it.
|
|
51
|
-
if total_words > 1.0
|
|
70
|
+
if total_words > 1.0 && total_unique_words > 1
|
|
52
71
|
weighted_total = 0.0
|
|
53
72
|
|
|
54
73
|
vec.each do |term|
|
|
@@ -59,10 +78,13 @@ module Classifier
|
|
|
59
78
|
val = term_over_total * Math.log(term_over_total)
|
|
60
79
|
weighted_total += val unless val.nan?
|
|
61
80
|
end
|
|
62
|
-
|
|
81
|
+
|
|
82
|
+
sign = weighted_total.negative? ? 1.0 : -1.0
|
|
83
|
+
divisor = sign * [weighted_total.abs, Vector::EPSILON].max
|
|
84
|
+
vec = vec.collect { |val| Math.log(val + 1) / divisor }
|
|
63
85
|
end
|
|
64
86
|
|
|
65
|
-
if
|
|
87
|
+
if Classifier::LSI.gsl_available
|
|
66
88
|
@raw_norm = vec.normalize
|
|
67
89
|
@raw_vector = vec
|
|
68
90
|
else
|
|
@@ -1,3 +1,5 @@
|
|
|
1
|
+
# rbs_inline: enabled
|
|
2
|
+
|
|
1
3
|
# Author:: David Fayram (mailto:dfayram@lensmen.net)
|
|
2
4
|
# Copyright:: Copyright (c) 2005 David Fayram II
|
|
3
5
|
# License:: LGPL
|
|
@@ -5,29 +7,38 @@
|
|
|
5
7
|
module Classifier
|
|
6
8
|
# This class keeps a word => index mapping. It is used to map stemmed words
|
|
7
9
|
# to dimensions of a vector.
|
|
8
|
-
|
|
9
10
|
class WordList
|
|
11
|
+
# @rbs @location_table: Hash[Symbol, Integer]
|
|
12
|
+
|
|
13
|
+
# @rbs () -> void
|
|
10
14
|
def initialize
|
|
11
15
|
@location_table = {}
|
|
12
16
|
end
|
|
13
17
|
|
|
14
18
|
# Adds a word (if it is new) and assigns it a unique dimension.
|
|
19
|
+
#
|
|
20
|
+
# @rbs (Symbol) -> Integer?
|
|
15
21
|
def add_word(word)
|
|
16
22
|
term = word
|
|
17
23
|
@location_table[term] = @location_table.size unless @location_table[term]
|
|
18
24
|
end
|
|
19
25
|
|
|
20
26
|
# Returns the dimension of the word or nil if the word is not in the space.
|
|
27
|
+
#
|
|
28
|
+
# @rbs (Symbol) -> Integer?
|
|
21
29
|
def [](lookup)
|
|
22
30
|
term = lookup
|
|
23
31
|
@location_table[term]
|
|
24
32
|
end
|
|
25
33
|
|
|
34
|
+
# @rbs (Integer) -> Symbol?
|
|
26
35
|
def word_for_index(ind)
|
|
27
36
|
@location_table.invert[ind]
|
|
28
37
|
end
|
|
29
38
|
|
|
30
39
|
# Returns the number of words mapped.
|
|
40
|
+
#
|
|
41
|
+
# @rbs () -> Integer
|
|
31
42
|
def size
|
|
32
43
|
@location_table.size
|
|
33
44
|
end
|
data/lib/classifier/lsi.rb
CHANGED
|
@@ -1,17 +1,34 @@
|
|
|
1
|
+
# rbs_inline: enabled
|
|
2
|
+
|
|
1
3
|
# Author:: David Fayram (mailto:dfayram@lensmen.net)
|
|
2
4
|
# Copyright:: Copyright (c) 2005 David Fayram II
|
|
3
5
|
# License:: LGPL
|
|
4
6
|
|
|
7
|
+
module Classifier
|
|
8
|
+
class LSI
|
|
9
|
+
# @rbs @gsl_available: bool
|
|
10
|
+
@gsl_available = false
|
|
11
|
+
|
|
12
|
+
class << self
|
|
13
|
+
# @rbs @gsl_available: bool
|
|
14
|
+
attr_accessor :gsl_available
|
|
15
|
+
end
|
|
16
|
+
end
|
|
17
|
+
end
|
|
18
|
+
|
|
5
19
|
begin
|
|
6
20
|
# to test the native vector class, try `rake test NATIVE_VECTOR=true`
|
|
7
21
|
raise LoadError if ENV['NATIVE_VECTOR'] == 'true'
|
|
22
|
+
raise LoadError unless Gem::Specification.find_all_by_name('gsl').any?
|
|
8
23
|
|
|
9
|
-
require 'gsl'
|
|
24
|
+
require 'gsl'
|
|
10
25
|
require 'classifier/extensions/vector_serialize'
|
|
11
|
-
|
|
26
|
+
Classifier::LSI.gsl_available = true
|
|
12
27
|
rescue LoadError
|
|
13
|
-
|
|
14
|
-
|
|
28
|
+
unless ENV['SUPPRESS_GSL_WARNING'] == 'true'
|
|
29
|
+
warn 'Notice: for 10x faster LSI, run `gem install gsl`. Set SUPPRESS_GSL_WARNING=true to hide this.'
|
|
30
|
+
end
|
|
31
|
+
Classifier::LSI.gsl_available = false
|
|
15
32
|
require 'classifier/extensions/vector'
|
|
16
33
|
end
|
|
17
34
|
|
|
@@ -24,13 +41,20 @@ module Classifier
|
|
|
24
41
|
# data based on underlying semantic relations. For more information on the algorithms used,
|
|
25
42
|
# please consult Wikipedia[http://en.wikipedia.org/wiki/Latent_Semantic_Indexing].
|
|
26
43
|
class LSI
|
|
44
|
+
# @rbs @auto_rebuild: bool
|
|
45
|
+
# @rbs @word_list: WordList
|
|
46
|
+
# @rbs @items: Hash[untyped, ContentNode]
|
|
47
|
+
# @rbs @version: Integer
|
|
48
|
+
# @rbs @built_at_version: Integer
|
|
49
|
+
|
|
27
50
|
attr_reader :word_list
|
|
28
51
|
attr_accessor :auto_rebuild
|
|
29
52
|
|
|
30
53
|
# Create a fresh index.
|
|
31
54
|
# If you want to call #build_index manually, use
|
|
32
|
-
# Classifier::LSI.new :
|
|
55
|
+
# Classifier::LSI.new auto_rebuild: false
|
|
33
56
|
#
|
|
57
|
+
# @rbs (?Hash[Symbol, untyped]) -> void
|
|
34
58
|
def initialize(options = {})
|
|
35
59
|
@auto_rebuild = true unless options[:auto_rebuild] == false
|
|
36
60
|
@word_list = WordList.new
|
|
@@ -42,6 +66,8 @@ module Classifier
|
|
|
42
66
|
# Returns true if the index needs to be rebuilt. The index needs
|
|
43
67
|
# to be built after all informaton is added, but before you start
|
|
44
68
|
# using it for search, classification and cluster detection.
|
|
69
|
+
#
|
|
70
|
+
# @rbs () -> bool
|
|
45
71
|
def needs_rebuild?
|
|
46
72
|
(@items.keys.size > 1) && (@version != @built_at_version)
|
|
47
73
|
end
|
|
@@ -59,6 +85,7 @@ module Classifier
|
|
|
59
85
|
# ar = ActiveRecordObject.find( :all )
|
|
60
86
|
# lsi.add_item ar, *ar.categories { |x| ar.content }
|
|
61
87
|
#
|
|
88
|
+
# @rbs (String, *String | Symbol) ?{ (String) -> String } -> void
|
|
62
89
|
def add_item(item, *categories, &block)
|
|
63
90
|
clean_word_hash = block ? block.call(item).clean_word_hash : item.to_s.clean_word_hash
|
|
64
91
|
@items[item] = ContentNode.new(clean_word_hash, *categories)
|
|
@@ -70,12 +97,15 @@ module Classifier
|
|
|
70
97
|
# you are passing in a string with no categorries. item
|
|
71
98
|
# will be duck typed via to_s .
|
|
72
99
|
#
|
|
100
|
+
# @rbs (String) -> void
|
|
73
101
|
def <<(item)
|
|
74
102
|
add_item(item)
|
|
75
103
|
end
|
|
76
104
|
|
|
77
105
|
# Returns the categories for a given indexed items. You are free to add and remove
|
|
78
106
|
# items from this as you see fit. It does not invalide an index to change its categories.
|
|
107
|
+
#
|
|
108
|
+
# @rbs (String) -> Array[String | Symbol]
|
|
79
109
|
def categories_for(item)
|
|
80
110
|
return [] unless @items[item]
|
|
81
111
|
|
|
@@ -84,6 +114,7 @@ module Classifier
|
|
|
84
114
|
|
|
85
115
|
# Removes an item from the database, if it is indexed.
|
|
86
116
|
#
|
|
117
|
+
# @rbs (String) -> void
|
|
87
118
|
def remove_item(item)
|
|
88
119
|
return unless @items.key?(item)
|
|
89
120
|
|
|
@@ -92,6 +123,7 @@ module Classifier
|
|
|
92
123
|
end
|
|
93
124
|
|
|
94
125
|
# Returns an array of items that are indexed.
|
|
126
|
+
# @rbs () -> Array[untyped]
|
|
95
127
|
def items
|
|
96
128
|
@items.keys
|
|
97
129
|
end
|
|
@@ -110,6 +142,8 @@ module Classifier
|
|
|
110
142
|
# cutoff parameter tells the indexer how many of these values to keep.
|
|
111
143
|
# A value of 1 for cutoff means that no semantic analysis will take place,
|
|
112
144
|
# turning the LSI class into a simple vector search engine.
|
|
145
|
+
#
|
|
146
|
+
# @rbs (?Float) -> void
|
|
113
147
|
def build_index(cutoff = 0.75)
|
|
114
148
|
return unless needs_rebuild?
|
|
115
149
|
|
|
@@ -118,7 +152,7 @@ module Classifier
|
|
|
118
152
|
doc_list = @items.values
|
|
119
153
|
tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }
|
|
120
154
|
|
|
121
|
-
if
|
|
155
|
+
if self.class.gsl_available
|
|
122
156
|
tdm = GSL::Matrix.alloc(*tda).trans
|
|
123
157
|
ntdm = build_reduced_matrix(tdm, cutoff)
|
|
124
158
|
|
|
@@ -131,9 +165,14 @@ module Classifier
|
|
|
131
165
|
tdm = Matrix.rows(tda).trans
|
|
132
166
|
ntdm = build_reduced_matrix(tdm, cutoff)
|
|
133
167
|
|
|
134
|
-
ntdm.
|
|
135
|
-
|
|
136
|
-
|
|
168
|
+
ntdm.column_size.times do |col|
|
|
169
|
+
next unless doc_list[col]
|
|
170
|
+
|
|
171
|
+
column = ntdm.column(col)
|
|
172
|
+
next unless column
|
|
173
|
+
|
|
174
|
+
doc_list[col].lsi_vector = column
|
|
175
|
+
doc_list[col].lsi_norm = column.normalize
|
|
137
176
|
end
|
|
138
177
|
end
|
|
139
178
|
|
|
@@ -148,13 +187,15 @@ module Classifier
|
|
|
148
187
|
# your dataset's general content. For example, if you were to use categorize on the
|
|
149
188
|
# results of this data, you could gather information on what your dataset is generally
|
|
150
189
|
# about.
|
|
190
|
+
#
|
|
191
|
+
# @rbs (?Integer) -> Array[String]
|
|
151
192
|
def highest_relative_content(max_chunks = 10)
|
|
152
193
|
return [] if needs_rebuild?
|
|
153
194
|
|
|
154
195
|
avg_density = {}
|
|
155
|
-
@items.each_key { |x| avg_density[x] = proximity_array_for_content(x).
|
|
196
|
+
@items.each_key { |x| avg_density[x] = proximity_array_for_content(x).sum { |pair| pair[1] } }
|
|
156
197
|
|
|
157
|
-
avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..max_chunks - 1].map
|
|
198
|
+
avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..(max_chunks - 1)].map
|
|
158
199
|
end
|
|
159
200
|
|
|
160
201
|
# This function is the primitive that find_related and classify
|
|
@@ -169,13 +210,15 @@ module Classifier
|
|
|
169
210
|
# The parameter doc is the content to compare. If that content is not
|
|
170
211
|
# indexed, you can pass an optional block to define how to create the
|
|
171
212
|
# text data. See add_item for examples of how this works.
|
|
172
|
-
|
|
213
|
+
#
|
|
214
|
+
# @rbs (String) ?{ (String) -> String } -> Array[[String, Float]]
|
|
215
|
+
def proximity_array_for_content(doc, &)
|
|
173
216
|
return [] if needs_rebuild?
|
|
174
217
|
|
|
175
|
-
content_node = node_for_content(doc, &
|
|
218
|
+
content_node = node_for_content(doc, &)
|
|
176
219
|
result =
|
|
177
220
|
@items.keys.collect do |item|
|
|
178
|
-
val = if
|
|
221
|
+
val = if self.class.gsl_available
|
|
179
222
|
content_node.search_vector * @items[item].search_vector.col
|
|
180
223
|
else
|
|
181
224
|
(Matrix[content_node.search_vector] * @items[item].search_vector)[0]
|
|
@@ -190,13 +233,15 @@ module Classifier
|
|
|
190
233
|
# calculated vectors instead of their full versions. This is useful when
|
|
191
234
|
# you're trying to perform operations on content that is much smaller than
|
|
192
235
|
# the text you're working with. search uses this primitive.
|
|
193
|
-
|
|
236
|
+
#
|
|
237
|
+
# @rbs (String) ?{ (String) -> String } -> Array[[String, Float]]
|
|
238
|
+
def proximity_norms_for_content(doc, &)
|
|
194
239
|
return [] if needs_rebuild?
|
|
195
240
|
|
|
196
|
-
content_node = node_for_content(doc, &
|
|
241
|
+
content_node = node_for_content(doc, &)
|
|
197
242
|
result =
|
|
198
243
|
@items.keys.collect do |item|
|
|
199
|
-
val = if
|
|
244
|
+
val = if self.class.gsl_available
|
|
200
245
|
content_node.search_norm * @items[item].search_norm.col
|
|
201
246
|
else
|
|
202
247
|
(Matrix[content_node.search_norm] * @items[item].search_norm)[0]
|
|
@@ -213,12 +258,14 @@ module Classifier
|
|
|
213
258
|
#
|
|
214
259
|
# While this may seem backwards compared to the other functions that LSI supports,
|
|
215
260
|
# it is actually the same algorithm, just applied on a smaller document.
|
|
261
|
+
#
|
|
262
|
+
# @rbs (String, ?Integer) -> Array[String]
|
|
216
263
|
def search(string, max_nearest = 3)
|
|
217
264
|
return [] if needs_rebuild?
|
|
218
265
|
|
|
219
266
|
carry = proximity_norms_for_content(string)
|
|
220
267
|
result = carry.collect { |x| x[0] }
|
|
221
|
-
result[0..max_nearest - 1]
|
|
268
|
+
result[0..(max_nearest - 1)]
|
|
222
269
|
end
|
|
223
270
|
|
|
224
271
|
# This function takes content and finds other documents
|
|
@@ -230,11 +277,13 @@ module Classifier
|
|
|
230
277
|
# This is particularly useful for identifing clusters in your document space.
|
|
231
278
|
# For example you may want to identify several "What's Related" items for weblog
|
|
232
279
|
# articles, or find paragraphs that relate to each other in an essay.
|
|
280
|
+
#
|
|
281
|
+
# @rbs (String, ?Integer) ?{ (String) -> String } -> Array[String]
|
|
233
282
|
def find_related(doc, max_nearest = 3, &block)
|
|
234
283
|
carry =
|
|
235
284
|
proximity_array_for_content(doc, &block).reject { |pair| pair[0] == doc }
|
|
236
285
|
result = carry.collect { |x| x[0] }
|
|
237
|
-
result[0..max_nearest - 1]
|
|
286
|
+
result[0..(max_nearest - 1)]
|
|
238
287
|
end
|
|
239
288
|
|
|
240
289
|
# This function uses a voting system to categorize documents, based on
|
|
@@ -246,17 +295,19 @@ module Classifier
|
|
|
246
295
|
# text. A cutoff of 1 means that every document in the index votes on
|
|
247
296
|
# what category the document is in. This may not always make sense.
|
|
248
297
|
#
|
|
249
|
-
|
|
250
|
-
|
|
298
|
+
# @rbs (String, ?Float) ?{ (String) -> String } -> String | Symbol
|
|
299
|
+
def classify(doc, cutoff = 0.30, &)
|
|
300
|
+
votes = vote(doc, cutoff, &)
|
|
251
301
|
|
|
252
302
|
ranking = votes.keys.sort_by { |x| votes[x] }
|
|
253
303
|
ranking[-1]
|
|
254
304
|
end
|
|
255
305
|
|
|
256
|
-
|
|
306
|
+
# @rbs (String, ?Float) ?{ (String) -> String } -> Hash[String | Symbol, Float]
|
|
307
|
+
def vote(doc, cutoff = 0.30, &)
|
|
257
308
|
icutoff = (@items.size * cutoff).round
|
|
258
|
-
carry = proximity_array_for_content(doc, &
|
|
259
|
-
carry = carry[0..icutoff - 1]
|
|
309
|
+
carry = proximity_array_for_content(doc, &)
|
|
310
|
+
carry = carry[0..(icutoff - 1)]
|
|
260
311
|
votes = {}
|
|
261
312
|
carry.each do |pair|
|
|
262
313
|
categories = @items[pair[0]].categories
|
|
@@ -278,11 +329,11 @@ module Classifier
|
|
|
278
329
|
# category = nil
|
|
279
330
|
# end
|
|
280
331
|
#
|
|
281
|
-
#
|
|
282
332
|
# See classify() for argument docs
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
333
|
+
# @rbs (String, ?Float) ?{ (String) -> String } -> [String | Symbol | nil, Float?]
|
|
334
|
+
def classify_with_confidence(doc, cutoff = 0.30, &)
|
|
335
|
+
votes = vote(doc, cutoff, &)
|
|
336
|
+
votes_sum = votes.values.sum
|
|
286
337
|
return [nil, nil] if votes_sum.zero?
|
|
287
338
|
|
|
288
339
|
ranking = votes.keys.sort_by { |x| votes[x] }
|
|
@@ -294,16 +345,18 @@ module Classifier
|
|
|
294
345
|
# Prototype, only works on indexed documents.
|
|
295
346
|
# I have no clue if this is going to work, but in theory
|
|
296
347
|
# it's supposed to.
|
|
348
|
+
# @rbs (String, ?Integer) -> Array[Symbol]
|
|
297
349
|
def highest_ranked_stems(doc, count = 3)
|
|
298
350
|
raise 'Requested stem ranking on non-indexed content!' unless @items[doc]
|
|
299
351
|
|
|
300
352
|
arr = node_for_content(doc).lsi_vector.to_a
|
|
301
|
-
top_n = arr.sort.reverse[0..count - 1]
|
|
353
|
+
top_n = arr.sort.reverse[0..(count - 1)]
|
|
302
354
|
top_n.collect { |x| @word_list.word_for_index(arr.index(x)) }
|
|
303
355
|
end
|
|
304
356
|
|
|
305
357
|
private
|
|
306
358
|
|
|
359
|
+
# @rbs (untyped, ?Float) -> untyped
|
|
307
360
|
def build_reduced_matrix(matrix, cutoff = 0.75)
|
|
308
361
|
# TODO: Check that M>=N on these dimensions! Transpose helps assure this
|
|
309
362
|
u, v, s = matrix.SV_decomp
|
|
@@ -314,23 +367,26 @@ module Classifier
|
|
|
314
367
|
s[ord] = 0.0 if s[ord] < s_cutoff
|
|
315
368
|
end
|
|
316
369
|
# Reconstruct the term document matrix, only with reduced rank
|
|
317
|
-
u * (
|
|
370
|
+
result = u * (self.class.gsl_available ? GSL::Matrix : ::Matrix).diag(s) * v.trans
|
|
371
|
+
|
|
372
|
+
# Native Ruby SVD returns transposed dimensions when row_size < column_size
|
|
373
|
+
# Ensure result matches input dimensions
|
|
374
|
+
result = result.trans if !self.class.gsl_available && result.row_size != matrix.row_size
|
|
375
|
+
|
|
376
|
+
result
|
|
318
377
|
end
|
|
319
378
|
|
|
379
|
+
# @rbs (String) ?{ (String) -> String } -> ContentNode
|
|
320
380
|
def node_for_content(item, &block)
|
|
321
381
|
return @items[item] if @items[item]
|
|
322
382
|
|
|
323
383
|
clean_word_hash = block ? block.call(item).clean_word_hash : item.to_s.clean_word_hash
|
|
324
|
-
|
|
325
|
-
cn
|
|
326
|
-
|
|
327
|
-
unless needs_rebuild?
|
|
328
|
-
cn.raw_vector_with(@word_list) # make the lsi raw and norm vectors
|
|
329
|
-
end
|
|
330
|
-
|
|
384
|
+
cn = ContentNode.new(clean_word_hash, &block)
|
|
385
|
+
cn.raw_vector_with(@word_list) unless needs_rebuild?
|
|
331
386
|
cn
|
|
332
387
|
end
|
|
333
388
|
|
|
389
|
+
# @rbs () -> void
|
|
334
390
|
def make_word_list
|
|
335
391
|
@word_list = WordList.new
|
|
336
392
|
@items.each_value do |node|
|
data/sig/vendor/gsl.rbs
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
# Type stubs for optional GSL gem
|
|
2
|
+
module GSL
|
|
3
|
+
class Vector
|
|
4
|
+
def self.alloc: (untyped) -> Vector
|
|
5
|
+
def to_a: () -> Array[Float]
|
|
6
|
+
def normalize: () -> Vector
|
|
7
|
+
def sum: () -> Float
|
|
8
|
+
def each: () { (Float) -> void } -> void
|
|
9
|
+
def []: (Integer) -> Float
|
|
10
|
+
def []=: (Integer, Float) -> Float
|
|
11
|
+
def size: () -> Integer
|
|
12
|
+
def row: () -> Vector
|
|
13
|
+
def col: () -> Vector
|
|
14
|
+
def *: (untyped) -> untyped
|
|
15
|
+
def collect: () { (Float) -> Float } -> Vector
|
|
16
|
+
end
|
|
17
|
+
|
|
18
|
+
class Matrix
|
|
19
|
+
def self.alloc: (*untyped) -> Matrix
|
|
20
|
+
def self.diag: (untyped) -> Matrix
|
|
21
|
+
def trans: () -> Matrix
|
|
22
|
+
def *: (untyped) -> Matrix
|
|
23
|
+
def size: () -> [Integer, Integer]
|
|
24
|
+
def column: (Integer) -> Vector
|
|
25
|
+
def SV_decomp: () -> [Matrix, Matrix, Vector]
|
|
26
|
+
end
|
|
27
|
+
end
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
# Type stubs for matrix gem
|
|
2
|
+
class Vector[T]
|
|
3
|
+
EPSILON: Float
|
|
4
|
+
|
|
5
|
+
def self.[]: [T] (*T) -> Vector[T]
|
|
6
|
+
def size: () -> Integer
|
|
7
|
+
def []: (Integer) -> T
|
|
8
|
+
def magnitude: () -> Float
|
|
9
|
+
def normalize: () -> Vector[T]
|
|
10
|
+
def each: () { (T) -> void } -> void
|
|
11
|
+
def collect: [U] () { (T) -> U } -> Vector[U]
|
|
12
|
+
def to_a: () -> Array[T]
|
|
13
|
+
def *: (untyped) -> untyped
|
|
14
|
+
end
|
|
15
|
+
|
|
16
|
+
class Matrix[T]
|
|
17
|
+
def self.rows: [T] (Array[Array[T]]) -> Matrix[T]
|
|
18
|
+
def self.[]: [T] (*Array[T]) -> Matrix[T]
|
|
19
|
+
def self.diag: (untyped) -> Matrix[untyped]
|
|
20
|
+
def trans: () -> Matrix[T]
|
|
21
|
+
def *: (untyped) -> untyped
|
|
22
|
+
def row_size: () -> Integer
|
|
23
|
+
def column_size: () -> Integer
|
|
24
|
+
def column: (Integer) -> Vector[T]
|
|
25
|
+
def SV_decomp: () -> [Matrix[T], Matrix[T], untyped]
|
|
26
|
+
end
|
data/test/test_helper.rb
CHANGED
|
@@ -1,4 +1,14 @@
|
|
|
1
|
-
|
|
1
|
+
require 'simplecov'
|
|
2
|
+
SimpleCov.start do
|
|
3
|
+
add_filter '/test/'
|
|
4
|
+
add_filter '/vendor/'
|
|
5
|
+
add_group 'Bayes', 'lib/classifier/bayes.rb'
|
|
6
|
+
add_group 'LSI', 'lib/classifier/lsi'
|
|
7
|
+
add_group 'Extensions', 'lib/classifier/extensions'
|
|
8
|
+
enable_coverage :branch
|
|
9
|
+
end
|
|
10
|
+
|
|
11
|
+
$LOAD_PATH.unshift("#{File.dirname(__FILE__)}/../lib")
|
|
2
12
|
|
|
3
13
|
require 'minitest'
|
|
4
14
|
require 'minitest/autorun'
|
metadata
CHANGED
|
@@ -1,14 +1,13 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: classifier
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version:
|
|
4
|
+
version: 2.0.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Lucas Carlson
|
|
8
|
-
autorequire:
|
|
9
8
|
bindir: bin
|
|
10
9
|
cert_chain: []
|
|
11
|
-
date:
|
|
10
|
+
date: 1980-01-02 00:00:00.000000000 Z
|
|
12
11
|
dependencies:
|
|
13
12
|
- !ruby/object:Gem::Dependency
|
|
14
13
|
name: fast-stemmer
|
|
@@ -52,6 +51,20 @@ dependencies:
|
|
|
52
51
|
- - ">="
|
|
53
52
|
- !ruby/object:Gem::Version
|
|
54
53
|
version: '0'
|
|
54
|
+
- !ruby/object:Gem::Dependency
|
|
55
|
+
name: matrix
|
|
56
|
+
requirement: !ruby/object:Gem::Requirement
|
|
57
|
+
requirements:
|
|
58
|
+
- - ">="
|
|
59
|
+
- !ruby/object:Gem::Version
|
|
60
|
+
version: '0'
|
|
61
|
+
type: :runtime
|
|
62
|
+
prerelease: false
|
|
63
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
64
|
+
requirements:
|
|
65
|
+
- - ">="
|
|
66
|
+
- !ruby/object:Gem::Version
|
|
67
|
+
version: '0'
|
|
55
68
|
- !ruby/object:Gem::Dependency
|
|
56
69
|
name: minitest
|
|
57
70
|
requirement: !ruby/object:Gem::Requirement
|
|
@@ -66,6 +79,20 @@ dependencies:
|
|
|
66
79
|
- - ">="
|
|
67
80
|
- !ruby/object:Gem::Version
|
|
68
81
|
version: '0'
|
|
82
|
+
- !ruby/object:Gem::Dependency
|
|
83
|
+
name: rbs-inline
|
|
84
|
+
requirement: !ruby/object:Gem::Requirement
|
|
85
|
+
requirements:
|
|
86
|
+
- - ">="
|
|
87
|
+
- !ruby/object:Gem::Version
|
|
88
|
+
version: '0'
|
|
89
|
+
type: :development
|
|
90
|
+
prerelease: false
|
|
91
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
92
|
+
requirements:
|
|
93
|
+
- - ">="
|
|
94
|
+
- !ruby/object:Gem::Version
|
|
95
|
+
version: '0'
|
|
69
96
|
- !ruby/object:Gem::Dependency
|
|
70
97
|
name: rdoc
|
|
71
98
|
requirement: !ruby/object:Gem::Requirement
|
|
@@ -86,7 +113,9 @@ executables: []
|
|
|
86
113
|
extensions: []
|
|
87
114
|
extra_rdoc_files: []
|
|
88
115
|
files:
|
|
116
|
+
- CLAUDE.md
|
|
89
117
|
- LICENSE
|
|
118
|
+
- README.md
|
|
90
119
|
- bin/bayes.rb
|
|
91
120
|
- bin/summarize.rb
|
|
92
121
|
- lib/classifier.rb
|
|
@@ -99,12 +128,14 @@ files:
|
|
|
99
128
|
- lib/classifier/lsi/content_node.rb
|
|
100
129
|
- lib/classifier/lsi/summary.rb
|
|
101
130
|
- lib/classifier/lsi/word_list.rb
|
|
131
|
+
- sig/vendor/fast_stemmer.rbs
|
|
132
|
+
- sig/vendor/gsl.rbs
|
|
133
|
+
- sig/vendor/matrix.rbs
|
|
102
134
|
- test/test_helper.rb
|
|
103
135
|
homepage: https://github.com/cardmagic/classifier
|
|
104
136
|
licenses:
|
|
105
137
|
- LGPL
|
|
106
138
|
metadata: {}
|
|
107
|
-
post_install_message:
|
|
108
139
|
rdoc_options: []
|
|
109
140
|
require_paths:
|
|
110
141
|
- lib
|
|
@@ -119,8 +150,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
|
119
150
|
- !ruby/object:Gem::Version
|
|
120
151
|
version: '0'
|
|
121
152
|
requirements: []
|
|
122
|
-
rubygems_version:
|
|
123
|
-
signing_key:
|
|
153
|
+
rubygems_version: 4.0.3
|
|
124
154
|
specification_version: 4
|
|
125
155
|
summary: A general classifier module to allow Bayesian and other types of classifications.
|
|
126
156
|
test_files: []
|