classifier 2.1.0 → 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e82726ac5c6e619e701be4591c8ed29e8b9f9a88236ad9d8d602fe3f748dcf43
4
- data.tar.gz: 3771f2d2fce4992ed0cd5b1ad3849cc3e1c4fa310a144c1056e5f9d0c49579ef
3
+ metadata.gz: 2ce325479bd32938cd7a7c694152983579266667e5def8d36725c38594eb822b
4
+ data.tar.gz: a35439c4ddfba77091297fee46930c279c54bfeb314b8b87fba8dd0ccbe6b292
5
5
  SHA512:
6
- metadata.gz: 913111c43ffd83a1a461023c6d1f4fef0f1efb7f4e591089db7dbf3d37a1cc98cb3b5de16888019ab1caf8db83ad828dbaf82d2cab237b991216bdb27f529d15
7
- data.tar.gz: b7142000fa687a58481051921663665edad1ac3b183947e79cf63bb15a79a0ca7bc66cd76104fccea713162cf05e876bd609d5f4866a13ea2c184f8caa089547
6
+ metadata.gz: e82b06f6449d3c5b089f840fe794b0e7f59bc1a00a379d4739010ad872ce7a90a4cbc1eee930b822b2761add461f2378dde8e015115c45b59c67e8aee0ad01b5
7
+ data.tar.gz: fa9271f248da8de9012ef923911b5f3d18cb9ced08ff9f85372e13622b8ab43a3f75365d21e9de3b3fa2849f3a54c3261da67c00c38a263b3ceee7ac0afd2598
data/README.md CHANGED
@@ -4,271 +4,138 @@
4
4
  [![CI](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml/badge.svg)](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml)
5
5
  [![License: LGPL](https://img.shields.io/badge/License-LGPL_2.1-blue.svg)](https://opensource.org/licenses/LGPL-2.1)
6
6
 
7
- A Ruby library for text classification using Bayesian and Latent Semantic Indexing (LSI) algorithms.
7
+ Text classification in Ruby. Five algorithms, native performance, streaming support.
8
8
 
9
- **[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[Guides](https://rubyclassifier.com/docs/guides)**
9
+ **[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[API Reference](https://rubyclassifier.com/docs/api)**
10
10
 
11
- ## Table of Contents
11
+ ## Why This Library?
12
12
 
13
- - [Installation](#installation)
14
- - [Bayesian Classifier](#bayesian-classifier)
15
- - [LSI (Latent Semantic Indexing)](#lsi-latent-semantic-indexing)
16
- - [Persistence](#persistence)
17
- - [Performance](#performance)
18
- - [Development](#development)
19
- - [Contributing](#contributing)
20
- - [License](#license)
13
+ | | This Gem | Other Forks |
14
+ |:--|:--|:--|
15
+ | **Algorithms** | 5 classifiers | ❌ 2 only |
16
+ | **Incremental LSI** | ✅ Brand's algorithm (no rebuild) | ❌ Full SVD rebuild on every add |
17
+ | **LSI Performance** | ✅ Native C extension (5-50x faster) | ❌ Pure Ruby or requires GSL |
18
+ | **Streaming** | ✅ Train on multi-GB datasets | ❌ Must load all data in memory |
19
+ | **Persistence** | ✅ Pluggable (file, Redis, S3) | ❌ Marshal only |
21
20
 
22
21
  ## Installation
23
22
 
24
- Add to your Gemfile:
25
-
26
23
  ```ruby
27
24
  gem 'classifier'
28
25
  ```
29
26
 
30
- Then run:
31
-
32
- ```bash
33
- bundle install
34
- ```
35
-
36
- Or install directly:
27
+ ## Quick Start
37
28
 
38
- ```bash
39
- gem install classifier
40
- ```
41
-
42
- ### Native C Extension
43
-
44
- The gem includes a native C extension for fast LSI operations. It compiles automatically during gem installation. No external dependencies are required.
45
-
46
- To verify the native extension is active:
29
+ ### Bayesian
47
30
 
48
31
  ```ruby
49
- require 'classifier'
50
- puts Classifier::LSI.backend # => :native
32
+ classifier = Classifier::Bayes.new(:spam, :ham)
33
+ classifier.train(spam: "Buy cheap viagra now!", ham: "Meeting at 3pm tomorrow")
34
+ classifier.classify "You've won a prize!" # => "Spam"
51
35
  ```
36
+ [Bayesian Guide →](https://rubyclassifier.com/docs/guides/bayes/basics)
52
37
 
53
- To force pure Ruby mode (for debugging):
38
+ ### Logistic Regression
54
39
 
55
- ```bash
56
- NATIVE_VECTOR=true ruby your_script.rb
40
+ ```ruby
41
+ classifier = Classifier::LogisticRegression.new(:positive, :negative)
42
+ classifier.train(positive: "Great product!", negative: "Terrible experience")
43
+ classifier.classify "Loved it!" # => "Positive"
57
44
  ```
45
+ [Logistic Regression Guide →](https://rubyclassifier.com/docs/guides/logisticregression/basics)
58
46
 
59
- To suppress the warning when native extension isn't available:
47
+ ### LSI (Latent Semantic Indexing)
60
48
 
61
- ```bash
62
- SUPPRESS_LSI_WARNING=true ruby your_script.rb
49
+ ```ruby
50
+ lsi = Classifier::LSI.new
51
+ lsi.add(pets: "Dogs are loyal", tech: "Ruby is elegant")
52
+ lsi.classify "My puppy is playful" # => "pets"
63
53
  ```
54
+ [LSI Guide →](https://rubyclassifier.com/docs/guides/lsi/basics)
64
55
 
65
- ### Compatibility
66
-
67
- | Ruby Version | Status |
68
- |--------------|--------|
69
- | 4.0 | Supported |
70
- | 3.4 | Supported |
71
- | 3.3 | Supported |
72
- | 3.2 | Supported |
73
- | 3.1 | EOL (unsupported) |
74
-
75
- ## Bayesian Classifier
76
-
77
- Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.
78
-
79
- ### Quick Start
56
+ ### k-Nearest Neighbors
80
57
 
81
58
  ```ruby
82
- require 'classifier'
83
-
84
- classifier = Classifier::Bayes.new('Spam', 'Ham')
85
-
86
- # Train the classifier
87
- classifier.train_spam "Buy cheap viagra now! Limited offer!"
88
- classifier.train_spam "You've won a million dollars! Claim now!"
89
- classifier.train_ham "Meeting scheduled for tomorrow at 10am"
90
- classifier.train_ham "Please review the attached document"
91
-
92
- # Classify new text
93
- classifier.classify "Congratulations! You've won a prize!"
94
- # => "Spam"
59
+ knn = Classifier::KNN.new(k: 3)
60
+ knn.train(spam: "Free money!", ham: "Quarterly report attached") # or knn.add()
61
+ knn.classify "Claim your prize" # => "spam"
95
62
  ```
63
+ [k-Nearest Neighbors Guide →](https://rubyclassifier.com/docs/guides/knn/basics)
96
64
 
97
- ### Learn More
98
-
99
- - [Bayes Basics Guide](https://rubyclassifier.com/docs/guides/bayes/basics) - In-depth documentation
100
- - [Build a Spam Filter Tutorial](https://rubyclassifier.com/docs/tutorials/spam-filter) - Step-by-step guide
101
- - [Paul Graham: A Plan for Spam](http://www.paulgraham.com/spam.html)
102
-
103
- ## LSI (Latent Semantic Indexing)
104
-
105
- Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.
106
-
107
- ### Quick Start
65
+ ### TF-IDF
108
66
 
109
67
  ```ruby
110
- require 'classifier'
111
-
112
- lsi = Classifier::LSI.new
113
-
114
- # Add documents with categories
115
- lsi.add_item "Dogs are loyal pets that love to play fetch", :pets
116
- lsi.add_item "Cats are independent and love to nap", :pets
117
- lsi.add_item "Ruby is a dynamic programming language", :programming
118
- lsi.add_item "Python is great for data science", :programming
68
+ tfidf = Classifier::TFIDF.new
69
+ tfidf.fit(["Dogs are pets", "Cats are independent"])
70
+ tfidf.transform("Dogs are loyal") # => {:dog => 0.707, :loyal => 0.707}
71
+ ```
72
+ [TF-IDF Guide →](https://rubyclassifier.com/docs/guides/tfidf/basics)
119
73
 
120
- # Classify new text
121
- lsi.classify "My puppy loves to run around"
122
- # => :pets
74
+ ## Key Features
123
75
 
124
- # Get classification with confidence score
125
- lsi.classify_with_confidence "Learning to code in Ruby"
126
- # => [:programming, 0.89]
127
- ```
76
+ ### Incremental LSI
128
77
 
129
- ### Search and Discovery
78
+ Add documents without rebuilding the entire index—400x faster for streaming data:
130
79
 
131
80
  ```ruby
132
- # Find similar documents
133
- lsi.find_related "Dogs are great companions", 2
134
- # => ["Dogs are loyal pets that love to play fetch", "Cats are independent..."]
81
+ lsi = Classifier::LSI.new(incremental: true)
82
+ lsi.add(tech: ["Ruby is elegant", "Python is popular"])
83
+ lsi.build_index
135
84
 
136
- # Search by keyword
137
- lsi.search "programming", 3
138
- # => ["Ruby is a dynamic programming language", "Python is great for..."]
85
+ # These use Brand's algorithm—no full rebuild
86
+ lsi.add(tech: "Go is fast")
87
+ lsi.add(tech: "Rust is safe")
139
88
  ```
140
89
 
141
- ### Learn More
142
-
143
- - [LSI Basics Guide](https://rubyclassifier.com/docs/guides/lsi/basics) - In-depth documentation
144
- - [Wikipedia: Latent Semantic Analysis](http://en.wikipedia.org/wiki/Latent_semantic_analysis)
90
+ [Learn more →](https://rubyclassifier.com/docs/guides/lsi/incremental)
145
91
 
146
- ## Persistence
147
-
148
- Save and load trained classifiers with pluggable storage backends. Works with both Bayes and LSI classifiers.
149
-
150
- ### File Storage
92
+ ### Persistence
151
93
 
152
94
  ```ruby
153
- require 'classifier'
154
-
155
- classifier = Classifier::Bayes.new('Spam', 'Ham')
156
- classifier.train_spam "Buy now! Limited offer!"
157
- classifier.train_ham "Meeting tomorrow at 3pm"
158
-
159
- # Configure storage and save
160
- classifier.storage = Classifier::Storage::File.new(path: "spam_filter.json")
95
+ classifier.storage = Classifier::Storage::File.new(path: "model.json")
161
96
  classifier.save
162
97
 
163
- # Load later
164
98
  loaded = Classifier::Bayes.load(storage: classifier.storage)
165
- loaded.classify "Claim your prize now!"
166
- # => "Spam"
167
99
  ```
168
100
 
169
- ### Custom Storage Backends
101
+ [Learn more →](https://rubyclassifier.com/docs/guides/persistence)
170
102
 
171
- Create backends for Redis, PostgreSQL, S3, or any storage system:
103
+ ### Streaming Training
172
104
 
173
105
  ```ruby
174
- class RedisStorage < Classifier::Storage::Base
175
- def initialize(redis:, key:)
176
- super()
177
- @redis, @key = redis, key
178
- end
179
-
180
- def write(data) = @redis.set(@key, data)
181
- def read = @redis.get(@key)
182
- def delete = @redis.del(@key)
183
- def exists? = @redis.exists?(@key)
184
- end
185
-
186
- # Use it
187
- classifier.storage = RedisStorage.new(redis: Redis.new, key: "classifier:spam")
188
- classifier.save
106
+ classifier.train_from_stream(:spam, File.open("spam_corpus.txt"))
189
107
  ```
190
108
 
191
- ### Learn More
192
-
193
- - [Persistence Guide](https://rubyclassifier.com/docs/guides/persistence/basics) - Full documentation with examples
109
+ [Learn more →](https://rubyclassifier.com/docs/guides/streaming)
194
110
 
195
111
  ## Performance
196
112
 
197
- ### Native C Extension vs Pure Ruby
198
-
199
- The native C extension provides dramatic speedups for LSI operations, especially `build_index` (SVD computation):
200
-
201
- | Documents | build_index | Overall |
202
- |-----------|-------------|---------|
203
- | 5 | 7x faster | 2.6x |
204
- | 10 | 25x faster | 4.6x |
205
- | 15 | 112x faster | 14.5x |
206
- | 20 | 385x faster | 48.7x |
207
-
208
- <details>
209
- <summary>Detailed benchmark (20 documents)</summary>
210
-
211
- ```
212
- Operation Pure Ruby Native C Speedup
213
- ----------------------------------------------------------
214
- build_index 0.5540 0.0014 384.5x
215
- classify 0.0190 0.0060 3.2x
216
- search 0.0145 0.0037 3.9x
217
- find_related 0.0098 0.0011 8.6x
218
- ----------------------------------------------------------
219
- TOTAL 0.5973 0.0123 48.7x
220
- ```
221
- </details>
113
+ Native C extension provides 5-50x speedup for LSI operations:
222
114
 
223
- ### Running Benchmarks
115
+ | Documents | Speedup |
116
+ |-----------|---------|
117
+ | 10 | 25x |
118
+ | 20 | 50x |
224
119
 
225
120
  ```bash
226
- rake benchmark # Run with current configuration
227
- rake benchmark:compare # Compare native C vs pure Ruby
121
+ rake benchmark:compare # Run your own comparison
228
122
  ```
229
123
 
230
124
  ## Development
231
125
 
232
- ### Setup
233
-
234
126
  ```bash
235
- git clone https://github.com/cardmagic/classifier.git
236
- cd classifier
237
127
  bundle install
238
- rake compile # Compile native C extension
239
- ```
240
-
241
- ### Running Tests
242
-
243
- ```bash
244
- rake test # Run all tests (compiles first)
245
- ruby -Ilib test/bayes/bayesian_test.rb # Run specific test file
246
-
247
- # Test with pure Ruby (no native extension)
248
- NATIVE_VECTOR=true rake test
128
+ rake compile # Build native extension
129
+ rake test # Run tests
249
130
  ```
250
131
 
251
- ### Console
252
-
253
- ```bash
254
- rake console
255
- ```
256
-
257
- ## Contributing
258
-
259
- 1. Fork the repository
260
- 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
261
- 3. Commit your changes (`git commit -am 'Add amazing feature'`)
262
- 4. Push to the branch (`git push origin feature/amazing-feature`)
263
- 5. Open a Pull Request
264
-
265
132
  ## Authors
266
133
 
267
- - **Lucas Carlson** - *Original author* - lucas@rufy.com
268
- - **David Fayram II** - *LSI implementation* - dfayram@gmail.com
134
+ - **Lucas Carlson** - lucas@rufy.com
135
+ - **David Fayram II** - dfayram@gmail.com
269
136
  - **Cameron McBride** - cameron.mcbride@gmail.com
270
137
  - **Ivan Acosta-Rubio** - ivan@softwarecriollo.com
271
138
 
272
139
  ## License
273
140
 
274
- This library is released under the [GNU Lesser General Public License (LGPL) 2.1](LICENSE).
141
+ [LGPL 2.1](LICENSE)
@@ -22,4 +22,5 @@ void Init_classifier_ext(void)
22
22
  Init_vector();
23
23
  Init_matrix();
24
24
  Init_svd();
25
+ Init_incremental_svd();
25
26
  }