classifier 2.1.0 → 2.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +66 -199
- data/ext/classifier/classifier_ext.c +1 -0
- data/ext/classifier/incremental_svd.c +393 -0
- data/ext/classifier/linalg.h +8 -0
- data/lib/classifier/bayes.rb +177 -53
- data/lib/classifier/errors.rb +3 -0
- data/lib/classifier/knn.rb +351 -0
- data/lib/classifier/logistic_regression.rb +571 -0
- data/lib/classifier/lsi/incremental_svd.rb +166 -0
- data/lib/classifier/lsi/summary.rb +25 -5
- data/lib/classifier/lsi.rb +365 -17
- data/lib/classifier/streaming/line_reader.rb +99 -0
- data/lib/classifier/streaming/progress.rb +96 -0
- data/lib/classifier/streaming.rb +122 -0
- data/lib/classifier/tfidf.rb +408 -0
- data/lib/classifier.rb +4 -0
- data/sig/vendor/matrix.rbs +25 -14
- data/sig/vendor/streaming.rbs +14 -0
- metadata +17 -4
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 2ce325479bd32938cd7a7c694152983579266667e5def8d36725c38594eb822b
|
|
4
|
+
data.tar.gz: a35439c4ddfba77091297fee46930c279c54bfeb314b8b87fba8dd0ccbe6b292
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: e82b06f6449d3c5b089f840fe794b0e7f59bc1a00a379d4739010ad872ce7a90a4cbc1eee930b822b2761add461f2378dde8e015115c45b59c67e8aee0ad01b5
|
|
7
|
+
data.tar.gz: fa9271f248da8de9012ef923911b5f3d18cb9ced08ff9f85372e13622b8ab43a3f75365d21e9de3b3fa2849f3a54c3261da67c00c38a263b3ceee7ac0afd2598
|
data/README.md
CHANGED
|
@@ -4,271 +4,138 @@
|
|
|
4
4
|
[](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml)
|
|
5
5
|
[](https://opensource.org/licenses/LGPL-2.1)
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
Text classification in Ruby. Five algorithms, native performance, streaming support.
|
|
8
8
|
|
|
9
|
-
**[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[
|
|
9
|
+
**[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[API Reference](https://rubyclassifier.com/docs/api)**
|
|
10
10
|
|
|
11
|
-
##
|
|
11
|
+
## Why This Library?
|
|
12
12
|
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
-
|
|
19
|
-
|
|
20
|
-
- [License](#license)
|
|
13
|
+
| | This Gem | Other Forks |
|
|
14
|
+
|:--|:--|:--|
|
|
15
|
+
| **Algorithms** | ✅ 5 classifiers | ❌ 2 only |
|
|
16
|
+
| **Incremental LSI** | ✅ Brand's algorithm (no rebuild) | ❌ Full SVD rebuild on every add |
|
|
17
|
+
| **LSI Performance** | ✅ Native C extension (5-50x faster) | ❌ Pure Ruby or requires GSL |
|
|
18
|
+
| **Streaming** | ✅ Train on multi-GB datasets | ❌ Must load all data in memory |
|
|
19
|
+
| **Persistence** | ✅ Pluggable (file, Redis, S3) | ❌ Marshal only |
|
|
21
20
|
|
|
22
21
|
## Installation
|
|
23
22
|
|
|
24
|
-
Add to your Gemfile:
|
|
25
|
-
|
|
26
23
|
```ruby
|
|
27
24
|
gem 'classifier'
|
|
28
25
|
```
|
|
29
26
|
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
```bash
|
|
33
|
-
bundle install
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
Or install directly:
|
|
27
|
+
## Quick Start
|
|
37
28
|
|
|
38
|
-
|
|
39
|
-
gem install classifier
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
### Native C Extension
|
|
43
|
-
|
|
44
|
-
The gem includes a native C extension for fast LSI operations. It compiles automatically during gem installation. No external dependencies are required.
|
|
45
|
-
|
|
46
|
-
To verify the native extension is active:
|
|
29
|
+
### Bayesian
|
|
47
30
|
|
|
48
31
|
```ruby
|
|
49
|
-
|
|
50
|
-
|
|
32
|
+
classifier = Classifier::Bayes.new(:spam, :ham)
|
|
33
|
+
classifier.train(spam: "Buy cheap viagra now!", ham: "Meeting at 3pm tomorrow")
|
|
34
|
+
classifier.classify "You've won a prize!" # => "Spam"
|
|
51
35
|
```
|
|
36
|
+
[Bayesian Guide →](https://rubyclassifier.com/docs/guides/bayes/basics)
|
|
52
37
|
|
|
53
|
-
|
|
38
|
+
### Logistic Regression
|
|
54
39
|
|
|
55
|
-
```
|
|
56
|
-
|
|
40
|
+
```ruby
|
|
41
|
+
classifier = Classifier::LogisticRegression.new(:positive, :negative)
|
|
42
|
+
classifier.train(positive: "Great product!", negative: "Terrible experience")
|
|
43
|
+
classifier.classify "Loved it!" # => "Positive"
|
|
57
44
|
```
|
|
45
|
+
[Logistic Regression Guide →](https://rubyclassifier.com/docs/guides/logisticregression/basics)
|
|
58
46
|
|
|
59
|
-
|
|
47
|
+
### LSI (Latent Semantic Indexing)
|
|
60
48
|
|
|
61
|
-
```
|
|
62
|
-
|
|
49
|
+
```ruby
|
|
50
|
+
lsi = Classifier::LSI.new
|
|
51
|
+
lsi.add(pets: "Dogs are loyal", tech: "Ruby is elegant")
|
|
52
|
+
lsi.classify "My puppy is playful" # => "pets"
|
|
63
53
|
```
|
|
54
|
+
[LSI Guide →](https://rubyclassifier.com/docs/guides/lsi/basics)
|
|
64
55
|
|
|
65
|
-
###
|
|
66
|
-
|
|
67
|
-
| Ruby Version | Status |
|
|
68
|
-
|--------------|--------|
|
|
69
|
-
| 4.0 | Supported |
|
|
70
|
-
| 3.4 | Supported |
|
|
71
|
-
| 3.3 | Supported |
|
|
72
|
-
| 3.2 | Supported |
|
|
73
|
-
| 3.1 | EOL (unsupported) |
|
|
74
|
-
|
|
75
|
-
## Bayesian Classifier
|
|
76
|
-
|
|
77
|
-
Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.
|
|
78
|
-
|
|
79
|
-
### Quick Start
|
|
56
|
+
### k-Nearest Neighbors
|
|
80
57
|
|
|
81
58
|
```ruby
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
# Train the classifier
|
|
87
|
-
classifier.train_spam "Buy cheap viagra now! Limited offer!"
|
|
88
|
-
classifier.train_spam "You've won a million dollars! Claim now!"
|
|
89
|
-
classifier.train_ham "Meeting scheduled for tomorrow at 10am"
|
|
90
|
-
classifier.train_ham "Please review the attached document"
|
|
91
|
-
|
|
92
|
-
# Classify new text
|
|
93
|
-
classifier.classify "Congratulations! You've won a prize!"
|
|
94
|
-
# => "Spam"
|
|
59
|
+
knn = Classifier::KNN.new(k: 3)
|
|
60
|
+
knn.train(spam: "Free money!", ham: "Quarterly report attached") # or knn.add()
|
|
61
|
+
knn.classify "Claim your prize" # => "spam"
|
|
95
62
|
```
|
|
63
|
+
[k-Nearest Neighbors Guide →](https://rubyclassifier.com/docs/guides/knn/basics)
|
|
96
64
|
|
|
97
|
-
###
|
|
98
|
-
|
|
99
|
-
- [Bayes Basics Guide](https://rubyclassifier.com/docs/guides/bayes/basics) - In-depth documentation
|
|
100
|
-
- [Build a Spam Filter Tutorial](https://rubyclassifier.com/docs/tutorials/spam-filter) - Step-by-step guide
|
|
101
|
-
- [Paul Graham: A Plan for Spam](http://www.paulgraham.com/spam.html)
|
|
102
|
-
|
|
103
|
-
## LSI (Latent Semantic Indexing)
|
|
104
|
-
|
|
105
|
-
Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.
|
|
106
|
-
|
|
107
|
-
### Quick Start
|
|
65
|
+
### TF-IDF
|
|
108
66
|
|
|
109
67
|
```ruby
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
lsi.add_item "Dogs are loyal pets that love to play fetch", :pets
|
|
116
|
-
lsi.add_item "Cats are independent and love to nap", :pets
|
|
117
|
-
lsi.add_item "Ruby is a dynamic programming language", :programming
|
|
118
|
-
lsi.add_item "Python is great for data science", :programming
|
|
68
|
+
tfidf = Classifier::TFIDF.new
|
|
69
|
+
tfidf.fit(["Dogs are pets", "Cats are independent"])
|
|
70
|
+
tfidf.transform("Dogs are loyal") # => {:dog => 0.707, :loyal => 0.707}
|
|
71
|
+
```
|
|
72
|
+
[TF-IDF Guide →](https://rubyclassifier.com/docs/guides/tfidf/basics)
|
|
119
73
|
|
|
120
|
-
|
|
121
|
-
lsi.classify "My puppy loves to run around"
|
|
122
|
-
# => :pets
|
|
74
|
+
## Key Features
|
|
123
75
|
|
|
124
|
-
|
|
125
|
-
lsi.classify_with_confidence "Learning to code in Ruby"
|
|
126
|
-
# => [:programming, 0.89]
|
|
127
|
-
```
|
|
76
|
+
### Incremental LSI
|
|
128
77
|
|
|
129
|
-
|
|
78
|
+
Add documents without rebuilding the entire index—400x faster for streaming data:
|
|
130
79
|
|
|
131
80
|
```ruby
|
|
132
|
-
|
|
133
|
-
lsi.
|
|
134
|
-
|
|
81
|
+
lsi = Classifier::LSI.new(incremental: true)
|
|
82
|
+
lsi.add(tech: ["Ruby is elegant", "Python is popular"])
|
|
83
|
+
lsi.build_index
|
|
135
84
|
|
|
136
|
-
#
|
|
137
|
-
lsi.
|
|
138
|
-
|
|
85
|
+
# These use Brand's algorithm—no full rebuild
|
|
86
|
+
lsi.add(tech: "Go is fast")
|
|
87
|
+
lsi.add(tech: "Rust is safe")
|
|
139
88
|
```
|
|
140
89
|
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
- [LSI Basics Guide](https://rubyclassifier.com/docs/guides/lsi/basics) - In-depth documentation
|
|
144
|
-
- [Wikipedia: Latent Semantic Analysis](http://en.wikipedia.org/wiki/Latent_semantic_analysis)
|
|
90
|
+
[Learn more →](https://rubyclassifier.com/docs/guides/lsi/incremental)
|
|
145
91
|
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
Save and load trained classifiers with pluggable storage backends. Works with both Bayes and LSI classifiers.
|
|
149
|
-
|
|
150
|
-
### File Storage
|
|
92
|
+
### Persistence
|
|
151
93
|
|
|
152
94
|
```ruby
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
classifier = Classifier::Bayes.new('Spam', 'Ham')
|
|
156
|
-
classifier.train_spam "Buy now! Limited offer!"
|
|
157
|
-
classifier.train_ham "Meeting tomorrow at 3pm"
|
|
158
|
-
|
|
159
|
-
# Configure storage and save
|
|
160
|
-
classifier.storage = Classifier::Storage::File.new(path: "spam_filter.json")
|
|
95
|
+
classifier.storage = Classifier::Storage::File.new(path: "model.json")
|
|
161
96
|
classifier.save
|
|
162
97
|
|
|
163
|
-
# Load later
|
|
164
98
|
loaded = Classifier::Bayes.load(storage: classifier.storage)
|
|
165
|
-
loaded.classify "Claim your prize now!"
|
|
166
|
-
# => "Spam"
|
|
167
99
|
```
|
|
168
100
|
|
|
169
|
-
|
|
101
|
+
[Learn more →](https://rubyclassifier.com/docs/guides/persistence)
|
|
170
102
|
|
|
171
|
-
|
|
103
|
+
### Streaming Training
|
|
172
104
|
|
|
173
105
|
```ruby
|
|
174
|
-
|
|
175
|
-
def initialize(redis:, key:)
|
|
176
|
-
super()
|
|
177
|
-
@redis, @key = redis, key
|
|
178
|
-
end
|
|
179
|
-
|
|
180
|
-
def write(data) = @redis.set(@key, data)
|
|
181
|
-
def read = @redis.get(@key)
|
|
182
|
-
def delete = @redis.del(@key)
|
|
183
|
-
def exists? = @redis.exists?(@key)
|
|
184
|
-
end
|
|
185
|
-
|
|
186
|
-
# Use it
|
|
187
|
-
classifier.storage = RedisStorage.new(redis: Redis.new, key: "classifier:spam")
|
|
188
|
-
classifier.save
|
|
106
|
+
classifier.train_from_stream(:spam, File.open("spam_corpus.txt"))
|
|
189
107
|
```
|
|
190
108
|
|
|
191
|
-
|
|
192
|
-
|
|
193
|
-
- [Persistence Guide](https://rubyclassifier.com/docs/guides/persistence/basics) - Full documentation with examples
|
|
109
|
+
[Learn more →](https://rubyclassifier.com/docs/guides/streaming)
|
|
194
110
|
|
|
195
111
|
## Performance
|
|
196
112
|
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
The native C extension provides dramatic speedups for LSI operations, especially `build_index` (SVD computation):
|
|
200
|
-
|
|
201
|
-
| Documents | build_index | Overall |
|
|
202
|
-
|-----------|-------------|---------|
|
|
203
|
-
| 5 | 7x faster | 2.6x |
|
|
204
|
-
| 10 | 25x faster | 4.6x |
|
|
205
|
-
| 15 | 112x faster | 14.5x |
|
|
206
|
-
| 20 | 385x faster | 48.7x |
|
|
207
|
-
|
|
208
|
-
<details>
|
|
209
|
-
<summary>Detailed benchmark (20 documents)</summary>
|
|
210
|
-
|
|
211
|
-
```
|
|
212
|
-
Operation Pure Ruby Native C Speedup
|
|
213
|
-
----------------------------------------------------------
|
|
214
|
-
build_index 0.5540 0.0014 384.5x
|
|
215
|
-
classify 0.0190 0.0060 3.2x
|
|
216
|
-
search 0.0145 0.0037 3.9x
|
|
217
|
-
find_related 0.0098 0.0011 8.6x
|
|
218
|
-
----------------------------------------------------------
|
|
219
|
-
TOTAL 0.5973 0.0123 48.7x
|
|
220
|
-
```
|
|
221
|
-
</details>
|
|
113
|
+
Native C extension provides 5-50x speedup for LSI operations:
|
|
222
114
|
|
|
223
|
-
|
|
115
|
+
| Documents | Speedup |
|
|
116
|
+
|-----------|---------|
|
|
117
|
+
| 10 | 25x |
|
|
118
|
+
| 20 | 50x |
|
|
224
119
|
|
|
225
120
|
```bash
|
|
226
|
-
rake benchmark
|
|
227
|
-
rake benchmark:compare # Compare native C vs pure Ruby
|
|
121
|
+
rake benchmark:compare # Run your own comparison
|
|
228
122
|
```
|
|
229
123
|
|
|
230
124
|
## Development
|
|
231
125
|
|
|
232
|
-
### Setup
|
|
233
|
-
|
|
234
126
|
```bash
|
|
235
|
-
git clone https://github.com/cardmagic/classifier.git
|
|
236
|
-
cd classifier
|
|
237
127
|
bundle install
|
|
238
|
-
rake compile #
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
### Running Tests
|
|
242
|
-
|
|
243
|
-
```bash
|
|
244
|
-
rake test # Run all tests (compiles first)
|
|
245
|
-
ruby -Ilib test/bayes/bayesian_test.rb # Run specific test file
|
|
246
|
-
|
|
247
|
-
# Test with pure Ruby (no native extension)
|
|
248
|
-
NATIVE_VECTOR=true rake test
|
|
128
|
+
rake compile # Build native extension
|
|
129
|
+
rake test # Run tests
|
|
249
130
|
```
|
|
250
131
|
|
|
251
|
-
### Console
|
|
252
|
-
|
|
253
|
-
```bash
|
|
254
|
-
rake console
|
|
255
|
-
```
|
|
256
|
-
|
|
257
|
-
## Contributing
|
|
258
|
-
|
|
259
|
-
1. Fork the repository
|
|
260
|
-
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
|
|
261
|
-
3. Commit your changes (`git commit -am 'Add amazing feature'`)
|
|
262
|
-
4. Push to the branch (`git push origin feature/amazing-feature`)
|
|
263
|
-
5. Open a Pull Request
|
|
264
|
-
|
|
265
132
|
## Authors
|
|
266
133
|
|
|
267
|
-
- **Lucas Carlson** -
|
|
268
|
-
- **David Fayram II** -
|
|
134
|
+
- **Lucas Carlson** - lucas@rufy.com
|
|
135
|
+
- **David Fayram II** - dfayram@gmail.com
|
|
269
136
|
- **Cameron McBride** - cameron.mcbride@gmail.com
|
|
270
137
|
- **Ivan Acosta-Rubio** - ivan@softwarecriollo.com
|
|
271
138
|
|
|
272
139
|
## License
|
|
273
140
|
|
|
274
|
-
|
|
141
|
+
[LGPL 2.1](LICENSE)
|