classifier 2.1.0 → 2.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +70 -199
- data/exe/classifier +9 -0
- data/ext/classifier/classifier_ext.c +1 -0
- data/ext/classifier/incremental_svd.c +393 -0
- data/ext/classifier/linalg.h +8 -0
- data/lib/classifier/bayes.rb +177 -53
- data/lib/classifier/cli.rb +880 -0
- data/lib/classifier/errors.rb +3 -0
- data/lib/classifier/knn.rb +351 -0
- data/lib/classifier/logistic_regression.rb +593 -0
- data/lib/classifier/lsi/incremental_svd.rb +166 -0
- data/lib/classifier/lsi/summary.rb +25 -5
- data/lib/classifier/lsi.rb +365 -17
- data/lib/classifier/streaming/line_reader.rb +99 -0
- data/lib/classifier/streaming/progress.rb +96 -0
- data/lib/classifier/streaming.rb +122 -0
- data/lib/classifier/tfidf.rb +408 -0
- data/lib/classifier/version.rb +3 -0
- data/lib/classifier.rb +5 -0
- data/sig/classifier.rbs +3 -0
- data/sig/vendor/json.rbs +1 -0
- data/sig/vendor/matrix.rbs +25 -14
- data/sig/vendor/optparse.rbs +19 -0
- data/sig/vendor/streaming.rbs +14 -0
- metadata +39 -6
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: c13b0ca80981d0186f038581e896182ca112928dcada4ac23700f2a9642ca785
|
|
4
|
+
data.tar.gz: 1949dea18b6d7e06eb931d411e821757f75084c6776ea589927b1f1b89d82280
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 562474db108e959f218407bd66a58bffe1b609ea38b8db16d15f579196d41d4976d74a66b6928e195ff0f6cc5dc4b081ea75d634cefb928e25932f8867358384
|
|
7
|
+
data.tar.gz: 39f72068f85e64a6397672376f6fcc4821e1eccd9556813e4894ba62800975330051c87e88a8747f03278e6ad0e83a665ed48fcc6cc1f623cd91b657089f8dd6
|
data/README.md
CHANGED
|
@@ -4,271 +4,142 @@
|
|
|
4
4
|
[](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml)
|
|
5
5
|
[](https://opensource.org/licenses/LGPL-2.1)
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
Text classification in Ruby. Five algorithms, native performance, streaming support.
|
|
8
8
|
|
|
9
|
-
**[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[
|
|
9
|
+
**[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[API Reference](https://rubydoc.info/gems/classifier)**
|
|
10
10
|
|
|
11
|
-
##
|
|
11
|
+
## Why This Library?
|
|
12
12
|
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
-
|
|
19
|
-
|
|
20
|
-
- [License](#license)
|
|
13
|
+
| | This Gem | Other Forks |
|
|
14
|
+
|:--|:--|:--|
|
|
15
|
+
| **Algorithms** | ✅ 5 classifiers | ❌ 2 only |
|
|
16
|
+
| **Incremental LSI** | ✅ Brand's algorithm (no rebuild) | ❌ Full SVD rebuild on every add |
|
|
17
|
+
| **LSI Performance** | ✅ Native C extension (5-50x faster) | ❌ Pure Ruby or requires GSL |
|
|
18
|
+
| **Streaming** | ✅ Train on multi-GB datasets | ❌ Must load all data in memory |
|
|
19
|
+
| **Persistence** | ✅ Pluggable (file, Redis, S3, SQL, Custom) | ❌ Marshal only |
|
|
21
20
|
|
|
22
21
|
## Installation
|
|
23
22
|
|
|
24
|
-
Add to your Gemfile:
|
|
25
|
-
|
|
26
23
|
```ruby
|
|
27
24
|
gem 'classifier'
|
|
28
25
|
```
|
|
29
26
|
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
```bash
|
|
33
|
-
bundle install
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
Or install directly:
|
|
27
|
+
## Quick Start
|
|
37
28
|
|
|
38
|
-
|
|
39
|
-
gem install classifier
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
### Native C Extension
|
|
43
|
-
|
|
44
|
-
The gem includes a native C extension for fast LSI operations. It compiles automatically during gem installation. No external dependencies are required.
|
|
45
|
-
|
|
46
|
-
To verify the native extension is active:
|
|
29
|
+
### Bayesian
|
|
47
30
|
|
|
48
31
|
```ruby
|
|
49
|
-
|
|
50
|
-
|
|
32
|
+
classifier = Classifier::Bayes.new(:spam, :ham)
|
|
33
|
+
classifier.train(spam: "Buy viagra cheap pills now")
|
|
34
|
+
classifier.train(spam: "You won million dollars prize")
|
|
35
|
+
classifier.train(ham: ["Meeting tomorrow at 3pm", "Quarterly report attached"])
|
|
36
|
+
classifier.classify("Cheap pills!") # => "Spam"
|
|
51
37
|
```
|
|
38
|
+
[Bayesian Guide →](https://rubyclassifier.com/docs/guides/bayes/basics)
|
|
52
39
|
|
|
53
|
-
|
|
40
|
+
### Logistic Regression
|
|
54
41
|
|
|
55
|
-
```
|
|
56
|
-
|
|
42
|
+
```ruby
|
|
43
|
+
classifier = Classifier::LogisticRegression.new(:positive, :negative)
|
|
44
|
+
classifier.train(positive: "love amazing great wonderful")
|
|
45
|
+
classifier.train(negative: "hate terrible awful bad")
|
|
46
|
+
classifier.classify("I love it!") # => "Positive"
|
|
57
47
|
```
|
|
48
|
+
[Logistic Regression Guide →](https://rubyclassifier.com/docs/guides/logisticregression/basics)
|
|
58
49
|
|
|
59
|
-
|
|
50
|
+
### LSI (Latent Semantic Indexing)
|
|
60
51
|
|
|
61
|
-
```
|
|
62
|
-
|
|
52
|
+
```ruby
|
|
53
|
+
lsi = Classifier::LSI.new
|
|
54
|
+
lsi.add(dog: "dog puppy canine bark fetch", cat: "cat kitten feline meow purr")
|
|
55
|
+
lsi.classify("My puppy barks") # => "dog"
|
|
63
56
|
```
|
|
57
|
+
[LSI Guide →](https://rubyclassifier.com/docs/guides/lsi/basics)
|
|
64
58
|
|
|
65
|
-
###
|
|
66
|
-
|
|
67
|
-
| Ruby Version | Status |
|
|
68
|
-
|--------------|--------|
|
|
69
|
-
| 4.0 | Supported |
|
|
70
|
-
| 3.4 | Supported |
|
|
71
|
-
| 3.3 | Supported |
|
|
72
|
-
| 3.2 | Supported |
|
|
73
|
-
| 3.1 | EOL (unsupported) |
|
|
74
|
-
|
|
75
|
-
## Bayesian Classifier
|
|
76
|
-
|
|
77
|
-
Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.
|
|
78
|
-
|
|
79
|
-
### Quick Start
|
|
59
|
+
### k-Nearest Neighbors
|
|
80
60
|
|
|
81
61
|
```ruby
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
# Train the classifier
|
|
87
|
-
classifier.train_spam "Buy cheap viagra now! Limited offer!"
|
|
88
|
-
classifier.train_spam "You've won a million dollars! Claim now!"
|
|
89
|
-
classifier.train_ham "Meeting scheduled for tomorrow at 10am"
|
|
90
|
-
classifier.train_ham "Please review the attached document"
|
|
91
|
-
|
|
92
|
-
# Classify new text
|
|
93
|
-
classifier.classify "Congratulations! You've won a prize!"
|
|
94
|
-
# => "Spam"
|
|
62
|
+
knn = Classifier::KNN.new(k: 3)
|
|
63
|
+
%w[laptop coding software developer programming].each { |w| knn.add(tech: w) }
|
|
64
|
+
%w[football basketball soccer goal team].each { |w| knn.add(sports: w) }
|
|
65
|
+
knn.classify("programming code") # => "tech"
|
|
95
66
|
```
|
|
67
|
+
[k-Nearest Neighbors Guide →](https://rubyclassifier.com/docs/guides/knn/basics)
|
|
96
68
|
|
|
97
|
-
###
|
|
98
|
-
|
|
99
|
-
- [Bayes Basics Guide](https://rubyclassifier.com/docs/guides/bayes/basics) - In-depth documentation
|
|
100
|
-
- [Build a Spam Filter Tutorial](https://rubyclassifier.com/docs/tutorials/spam-filter) - Step-by-step guide
|
|
101
|
-
- [Paul Graham: A Plan for Spam](http://www.paulgraham.com/spam.html)
|
|
102
|
-
|
|
103
|
-
## LSI (Latent Semantic Indexing)
|
|
104
|
-
|
|
105
|
-
Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.
|
|
106
|
-
|
|
107
|
-
### Quick Start
|
|
69
|
+
### TF-IDF
|
|
108
70
|
|
|
109
71
|
```ruby
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
lsi.add_item "Dogs are loyal pets that love to play fetch", :pets
|
|
116
|
-
lsi.add_item "Cats are independent and love to nap", :pets
|
|
117
|
-
lsi.add_item "Ruby is a dynamic programming language", :programming
|
|
118
|
-
lsi.add_item "Python is great for data science", :programming
|
|
72
|
+
tfidf = Classifier::TFIDF.new
|
|
73
|
+
tfidf.fit(["Ruby is great", "Python is great", "Ruby on Rails"])
|
|
74
|
+
tfidf.transform("Ruby programming") # => {:rubi => 1.0}
|
|
75
|
+
```
|
|
76
|
+
[TF-IDF Guide →](https://rubyclassifier.com/docs/guides/tfidf/basics)
|
|
119
77
|
|
|
120
|
-
|
|
121
|
-
lsi.classify "My puppy loves to run around"
|
|
122
|
-
# => :pets
|
|
78
|
+
## Key Features
|
|
123
79
|
|
|
124
|
-
|
|
125
|
-
lsi.classify_with_confidence "Learning to code in Ruby"
|
|
126
|
-
# => [:programming, 0.89]
|
|
127
|
-
```
|
|
80
|
+
### Incremental LSI
|
|
128
81
|
|
|
129
|
-
|
|
82
|
+
Add documents without rebuilding the entire index—400x faster for streaming data:
|
|
130
83
|
|
|
131
84
|
```ruby
|
|
132
|
-
|
|
133
|
-
lsi.
|
|
134
|
-
|
|
85
|
+
lsi = Classifier::LSI.new(incremental: true)
|
|
86
|
+
lsi.add(tech: ["Ruby is elegant", "Python is popular"])
|
|
87
|
+
lsi.build_index
|
|
135
88
|
|
|
136
|
-
#
|
|
137
|
-
lsi.
|
|
138
|
-
|
|
89
|
+
# These use Brand's algorithm—no full rebuild
|
|
90
|
+
lsi.add(tech: "Go is fast")
|
|
91
|
+
lsi.add(tech: "Rust is safe")
|
|
139
92
|
```
|
|
140
93
|
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
- [LSI Basics Guide](https://rubyclassifier.com/docs/guides/lsi/basics) - In-depth documentation
|
|
144
|
-
- [Wikipedia: Latent Semantic Analysis](http://en.wikipedia.org/wiki/Latent_semantic_analysis)
|
|
94
|
+
[Learn more →](https://rubyclassifier.com/docs/guides/lsi/basics)
|
|
145
95
|
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
Save and load trained classifiers with pluggable storage backends. Works with both Bayes and LSI classifiers.
|
|
149
|
-
|
|
150
|
-
### File Storage
|
|
96
|
+
### Persistence
|
|
151
97
|
|
|
152
98
|
```ruby
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
classifier = Classifier::Bayes.new('Spam', 'Ham')
|
|
156
|
-
classifier.train_spam "Buy now! Limited offer!"
|
|
157
|
-
classifier.train_ham "Meeting tomorrow at 3pm"
|
|
158
|
-
|
|
159
|
-
# Configure storage and save
|
|
160
|
-
classifier.storage = Classifier::Storage::File.new(path: "spam_filter.json")
|
|
99
|
+
classifier.storage = Classifier::Storage::File.new(path: "model.json")
|
|
161
100
|
classifier.save
|
|
162
101
|
|
|
163
|
-
# Load later
|
|
164
102
|
loaded = Classifier::Bayes.load(storage: classifier.storage)
|
|
165
|
-
loaded.classify "Claim your prize now!"
|
|
166
|
-
# => "Spam"
|
|
167
103
|
```
|
|
168
104
|
|
|
169
|
-
|
|
105
|
+
[Learn more →](https://rubyclassifier.com/docs/guides/persistence/basics)
|
|
170
106
|
|
|
171
|
-
|
|
107
|
+
### Streaming Training
|
|
172
108
|
|
|
173
109
|
```ruby
|
|
174
|
-
|
|
175
|
-
def initialize(redis:, key:)
|
|
176
|
-
super()
|
|
177
|
-
@redis, @key = redis, key
|
|
178
|
-
end
|
|
179
|
-
|
|
180
|
-
def write(data) = @redis.set(@key, data)
|
|
181
|
-
def read = @redis.get(@key)
|
|
182
|
-
def delete = @redis.del(@key)
|
|
183
|
-
def exists? = @redis.exists?(@key)
|
|
184
|
-
end
|
|
185
|
-
|
|
186
|
-
# Use it
|
|
187
|
-
classifier.storage = RedisStorage.new(redis: Redis.new, key: "classifier:spam")
|
|
188
|
-
classifier.save
|
|
110
|
+
classifier.train_from_stream(:spam, File.open("spam_corpus.txt"))
|
|
189
111
|
```
|
|
190
112
|
|
|
191
|
-
|
|
192
|
-
|
|
193
|
-
- [Persistence Guide](https://rubyclassifier.com/docs/guides/persistence/basics) - Full documentation with examples
|
|
113
|
+
[Learn more →](https://rubyclassifier.com/docs/tutorials/streaming-training)
|
|
194
114
|
|
|
195
115
|
## Performance
|
|
196
116
|
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
The native C extension provides dramatic speedups for LSI operations, especially `build_index` (SVD computation):
|
|
200
|
-
|
|
201
|
-
| Documents | build_index | Overall |
|
|
202
|
-
|-----------|-------------|---------|
|
|
203
|
-
| 5 | 7x faster | 2.6x |
|
|
204
|
-
| 10 | 25x faster | 4.6x |
|
|
205
|
-
| 15 | 112x faster | 14.5x |
|
|
206
|
-
| 20 | 385x faster | 48.7x |
|
|
207
|
-
|
|
208
|
-
<details>
|
|
209
|
-
<summary>Detailed benchmark (20 documents)</summary>
|
|
210
|
-
|
|
211
|
-
```
|
|
212
|
-
Operation Pure Ruby Native C Speedup
|
|
213
|
-
----------------------------------------------------------
|
|
214
|
-
build_index 0.5540 0.0014 384.5x
|
|
215
|
-
classify 0.0190 0.0060 3.2x
|
|
216
|
-
search 0.0145 0.0037 3.9x
|
|
217
|
-
find_related 0.0098 0.0011 8.6x
|
|
218
|
-
----------------------------------------------------------
|
|
219
|
-
TOTAL 0.5973 0.0123 48.7x
|
|
220
|
-
```
|
|
221
|
-
</details>
|
|
117
|
+
Native C extension provides 5-50x speedup for LSI operations:
|
|
222
118
|
|
|
223
|
-
|
|
119
|
+
| Documents | Speedup |
|
|
120
|
+
|-----------|---------|
|
|
121
|
+
| 10 | 25x |
|
|
122
|
+
| 20 | 50x |
|
|
224
123
|
|
|
225
124
|
```bash
|
|
226
|
-
rake benchmark
|
|
227
|
-
rake benchmark:compare # Compare native C vs pure Ruby
|
|
125
|
+
rake benchmark:compare # Run your own comparison
|
|
228
126
|
```
|
|
229
127
|
|
|
230
128
|
## Development
|
|
231
129
|
|
|
232
|
-
### Setup
|
|
233
|
-
|
|
234
130
|
```bash
|
|
235
|
-
git clone https://github.com/cardmagic/classifier.git
|
|
236
|
-
cd classifier
|
|
237
131
|
bundle install
|
|
238
|
-
rake compile #
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
### Running Tests
|
|
242
|
-
|
|
243
|
-
```bash
|
|
244
|
-
rake test # Run all tests (compiles first)
|
|
245
|
-
ruby -Ilib test/bayes/bayesian_test.rb # Run specific test file
|
|
246
|
-
|
|
247
|
-
# Test with pure Ruby (no native extension)
|
|
248
|
-
NATIVE_VECTOR=true rake test
|
|
132
|
+
rake compile # Build native extension
|
|
133
|
+
rake test # Run tests
|
|
249
134
|
```
|
|
250
135
|
|
|
251
|
-
### Console
|
|
252
|
-
|
|
253
|
-
```bash
|
|
254
|
-
rake console
|
|
255
|
-
```
|
|
256
|
-
|
|
257
|
-
## Contributing
|
|
258
|
-
|
|
259
|
-
1. Fork the repository
|
|
260
|
-
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
|
|
261
|
-
3. Commit your changes (`git commit -am 'Add amazing feature'`)
|
|
262
|
-
4. Push to the branch (`git push origin feature/amazing-feature`)
|
|
263
|
-
5. Open a Pull Request
|
|
264
|
-
|
|
265
136
|
## Authors
|
|
266
137
|
|
|
267
|
-
- **Lucas Carlson** -
|
|
268
|
-
- **David Fayram II** -
|
|
138
|
+
- **Lucas Carlson** - lucas@rufy.com
|
|
139
|
+
- **David Fayram II** - dfayram@gmail.com
|
|
269
140
|
- **Cameron McBride** - cameron.mcbride@gmail.com
|
|
270
141
|
- **Ivan Acosta-Rubio** - ivan@softwarecriollo.com
|
|
271
142
|
|
|
272
143
|
## License
|
|
273
144
|
|
|
274
|
-
|
|
145
|
+
[LGPL 2.1](LICENSE)
|
data/exe/classifier
ADDED