classifier 2.1.0 → 2.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e82726ac5c6e619e701be4591c8ed29e8b9f9a88236ad9d8d602fe3f748dcf43
4
- data.tar.gz: 3771f2d2fce4992ed0cd5b1ad3849cc3e1c4fa310a144c1056e5f9d0c49579ef
3
+ metadata.gz: c13b0ca80981d0186f038581e896182ca112928dcada4ac23700f2a9642ca785
4
+ data.tar.gz: 1949dea18b6d7e06eb931d411e821757f75084c6776ea589927b1f1b89d82280
5
5
  SHA512:
6
- metadata.gz: 913111c43ffd83a1a461023c6d1f4fef0f1efb7f4e591089db7dbf3d37a1cc98cb3b5de16888019ab1caf8db83ad828dbaf82d2cab237b991216bdb27f529d15
7
- data.tar.gz: b7142000fa687a58481051921663665edad1ac3b183947e79cf63bb15a79a0ca7bc66cd76104fccea713162cf05e876bd609d5f4866a13ea2c184f8caa089547
6
+ metadata.gz: 562474db108e959f218407bd66a58bffe1b609ea38b8db16d15f579196d41d4976d74a66b6928e195ff0f6cc5dc4b081ea75d634cefb928e25932f8867358384
7
+ data.tar.gz: 39f72068f85e64a6397672376f6fcc4821e1eccd9556813e4894ba62800975330051c87e88a8747f03278e6ad0e83a665ed48fcc6cc1f623cd91b657089f8dd6
data/README.md CHANGED
@@ -4,271 +4,142 @@
4
4
  [![CI](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml/badge.svg)](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml)
5
5
  [![License: LGPL](https://img.shields.io/badge/License-LGPL_2.1-blue.svg)](https://opensource.org/licenses/LGPL-2.1)
6
6
 
7
- A Ruby library for text classification using Bayesian and Latent Semantic Indexing (LSI) algorithms.
7
+ Text classification in Ruby. Five algorithms, native performance, streaming support.
8
8
 
9
- **[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[Guides](https://rubyclassifier.com/docs/guides)**
9
+ **[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[API Reference](https://rubydoc.info/gems/classifier)**
10
10
 
11
- ## Table of Contents
11
+ ## Why This Library?
12
12
 
13
- - [Installation](#installation)
14
- - [Bayesian Classifier](#bayesian-classifier)
15
- - [LSI (Latent Semantic Indexing)](#lsi-latent-semantic-indexing)
16
- - [Persistence](#persistence)
17
- - [Performance](#performance)
18
- - [Development](#development)
19
- - [Contributing](#contributing)
20
- - [License](#license)
13
+ | | This Gem | Other Forks |
14
+ |:--|:--|:--|
15
+ | **Algorithms** | 5 classifiers | ❌ 2 only |
16
+ | **Incremental LSI** | ✅ Brand's algorithm (no rebuild) | ❌ Full SVD rebuild on every add |
17
+ | **LSI Performance** | ✅ Native C extension (5-50x faster) | ❌ Pure Ruby or requires GSL |
18
+ | **Streaming** | ✅ Train on multi-GB datasets | ❌ Must load all data in memory |
19
+ | **Persistence** | ✅ Pluggable (file, Redis, S3, SQL, Custom) | ❌ Marshal only |
21
20
 
22
21
  ## Installation
23
22
 
24
- Add to your Gemfile:
25
-
26
23
  ```ruby
27
24
  gem 'classifier'
28
25
  ```
29
26
 
30
- Then run:
31
-
32
- ```bash
33
- bundle install
34
- ```
35
-
36
- Or install directly:
27
+ ## Quick Start
37
28
 
38
- ```bash
39
- gem install classifier
40
- ```
41
-
42
- ### Native C Extension
43
-
44
- The gem includes a native C extension for fast LSI operations. It compiles automatically during gem installation. No external dependencies are required.
45
-
46
- To verify the native extension is active:
29
+ ### Bayesian
47
30
 
48
31
  ```ruby
49
- require 'classifier'
50
- puts Classifier::LSI.backend # => :native
32
+ classifier = Classifier::Bayes.new(:spam, :ham)
33
+ classifier.train(spam: "Buy viagra cheap pills now")
34
+ classifier.train(spam: "You won million dollars prize")
35
+ classifier.train(ham: ["Meeting tomorrow at 3pm", "Quarterly report attached"])
36
+ classifier.classify("Cheap pills!") # => "Spam"
51
37
  ```
38
+ [Bayesian Guide →](https://rubyclassifier.com/docs/guides/bayes/basics)
52
39
 
53
- To force pure Ruby mode (for debugging):
40
+ ### Logistic Regression
54
41
 
55
- ```bash
56
- NATIVE_VECTOR=true ruby your_script.rb
42
+ ```ruby
43
+ classifier = Classifier::LogisticRegression.new(:positive, :negative)
44
+ classifier.train(positive: "love amazing great wonderful")
45
+ classifier.train(negative: "hate terrible awful bad")
46
+ classifier.classify("I love it!") # => "Positive"
57
47
  ```
48
+ [Logistic Regression Guide →](https://rubyclassifier.com/docs/guides/logisticregression/basics)
58
49
 
59
- To suppress the warning when native extension isn't available:
50
+ ### LSI (Latent Semantic Indexing)
60
51
 
61
- ```bash
62
- SUPPRESS_LSI_WARNING=true ruby your_script.rb
52
+ ```ruby
53
+ lsi = Classifier::LSI.new
54
+ lsi.add(dog: "dog puppy canine bark fetch", cat: "cat kitten feline meow purr")
55
+ lsi.classify("My puppy barks") # => "dog"
63
56
  ```
57
+ [LSI Guide →](https://rubyclassifier.com/docs/guides/lsi/basics)
64
58
 
65
- ### Compatibility
66
-
67
- | Ruby Version | Status |
68
- |--------------|--------|
69
- | 4.0 | Supported |
70
- | 3.4 | Supported |
71
- | 3.3 | Supported |
72
- | 3.2 | Supported |
73
- | 3.1 | EOL (unsupported) |
74
-
75
- ## Bayesian Classifier
76
-
77
- Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.
78
-
79
- ### Quick Start
59
+ ### k-Nearest Neighbors
80
60
 
81
61
  ```ruby
82
- require 'classifier'
83
-
84
- classifier = Classifier::Bayes.new('Spam', 'Ham')
85
-
86
- # Train the classifier
87
- classifier.train_spam "Buy cheap viagra now! Limited offer!"
88
- classifier.train_spam "You've won a million dollars! Claim now!"
89
- classifier.train_ham "Meeting scheduled for tomorrow at 10am"
90
- classifier.train_ham "Please review the attached document"
91
-
92
- # Classify new text
93
- classifier.classify "Congratulations! You've won a prize!"
94
- # => "Spam"
62
+ knn = Classifier::KNN.new(k: 3)
63
+ %w[laptop coding software developer programming].each { |w| knn.add(tech: w) }
64
+ %w[football basketball soccer goal team].each { |w| knn.add(sports: w) }
65
+ knn.classify("programming code") # => "tech"
95
66
  ```
67
+ [k-Nearest Neighbors Guide →](https://rubyclassifier.com/docs/guides/knn/basics)
96
68
 
97
- ### Learn More
98
-
99
- - [Bayes Basics Guide](https://rubyclassifier.com/docs/guides/bayes/basics) - In-depth documentation
100
- - [Build a Spam Filter Tutorial](https://rubyclassifier.com/docs/tutorials/spam-filter) - Step-by-step guide
101
- - [Paul Graham: A Plan for Spam](http://www.paulgraham.com/spam.html)
102
-
103
- ## LSI (Latent Semantic Indexing)
104
-
105
- Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.
106
-
107
- ### Quick Start
69
+ ### TF-IDF
108
70
 
109
71
  ```ruby
110
- require 'classifier'
111
-
112
- lsi = Classifier::LSI.new
113
-
114
- # Add documents with categories
115
- lsi.add_item "Dogs are loyal pets that love to play fetch", :pets
116
- lsi.add_item "Cats are independent and love to nap", :pets
117
- lsi.add_item "Ruby is a dynamic programming language", :programming
118
- lsi.add_item "Python is great for data science", :programming
72
+ tfidf = Classifier::TFIDF.new
73
+ tfidf.fit(["Ruby is great", "Python is great", "Ruby on Rails"])
74
+ tfidf.transform("Ruby programming") # => {:rubi => 1.0}
75
+ ```
76
+ [TF-IDF Guide →](https://rubyclassifier.com/docs/guides/tfidf/basics)
119
77
 
120
- # Classify new text
121
- lsi.classify "My puppy loves to run around"
122
- # => :pets
78
+ ## Key Features
123
79
 
124
- # Get classification with confidence score
125
- lsi.classify_with_confidence "Learning to code in Ruby"
126
- # => [:programming, 0.89]
127
- ```
80
+ ### Incremental LSI
128
81
 
129
- ### Search and Discovery
82
+ Add documents without rebuilding the entire index—400x faster for streaming data:
130
83
 
131
84
  ```ruby
132
- # Find similar documents
133
- lsi.find_related "Dogs are great companions", 2
134
- # => ["Dogs are loyal pets that love to play fetch", "Cats are independent..."]
85
+ lsi = Classifier::LSI.new(incremental: true)
86
+ lsi.add(tech: ["Ruby is elegant", "Python is popular"])
87
+ lsi.build_index
135
88
 
136
- # Search by keyword
137
- lsi.search "programming", 3
138
- # => ["Ruby is a dynamic programming language", "Python is great for..."]
89
+ # These use Brand's algorithm—no full rebuild
90
+ lsi.add(tech: "Go is fast")
91
+ lsi.add(tech: "Rust is safe")
139
92
  ```
140
93
 
141
- ### Learn More
142
-
143
- - [LSI Basics Guide](https://rubyclassifier.com/docs/guides/lsi/basics) - In-depth documentation
144
- - [Wikipedia: Latent Semantic Analysis](http://en.wikipedia.org/wiki/Latent_semantic_analysis)
94
+ [Learn more →](https://rubyclassifier.com/docs/guides/lsi/basics)
145
95
 
146
- ## Persistence
147
-
148
- Save and load trained classifiers with pluggable storage backends. Works with both Bayes and LSI classifiers.
149
-
150
- ### File Storage
96
+ ### Persistence
151
97
 
152
98
  ```ruby
153
- require 'classifier'
154
-
155
- classifier = Classifier::Bayes.new('Spam', 'Ham')
156
- classifier.train_spam "Buy now! Limited offer!"
157
- classifier.train_ham "Meeting tomorrow at 3pm"
158
-
159
- # Configure storage and save
160
- classifier.storage = Classifier::Storage::File.new(path: "spam_filter.json")
99
+ classifier.storage = Classifier::Storage::File.new(path: "model.json")
161
100
  classifier.save
162
101
 
163
- # Load later
164
102
  loaded = Classifier::Bayes.load(storage: classifier.storage)
165
- loaded.classify "Claim your prize now!"
166
- # => "Spam"
167
103
  ```
168
104
 
169
- ### Custom Storage Backends
105
+ [Learn more →](https://rubyclassifier.com/docs/guides/persistence/basics)
170
106
 
171
- Create backends for Redis, PostgreSQL, S3, or any storage system:
107
+ ### Streaming Training
172
108
 
173
109
  ```ruby
174
- class RedisStorage < Classifier::Storage::Base
175
- def initialize(redis:, key:)
176
- super()
177
- @redis, @key = redis, key
178
- end
179
-
180
- def write(data) = @redis.set(@key, data)
181
- def read = @redis.get(@key)
182
- def delete = @redis.del(@key)
183
- def exists? = @redis.exists?(@key)
184
- end
185
-
186
- # Use it
187
- classifier.storage = RedisStorage.new(redis: Redis.new, key: "classifier:spam")
188
- classifier.save
110
+ classifier.train_from_stream(:spam, File.open("spam_corpus.txt"))
189
111
  ```
190
112
 
191
- ### Learn More
192
-
193
- - [Persistence Guide](https://rubyclassifier.com/docs/guides/persistence/basics) - Full documentation with examples
113
+ [Learn more →](https://rubyclassifier.com/docs/tutorials/streaming-training)
194
114
 
195
115
  ## Performance
196
116
 
197
- ### Native C Extension vs Pure Ruby
198
-
199
- The native C extension provides dramatic speedups for LSI operations, especially `build_index` (SVD computation):
200
-
201
- | Documents | build_index | Overall |
202
- |-----------|-------------|---------|
203
- | 5 | 7x faster | 2.6x |
204
- | 10 | 25x faster | 4.6x |
205
- | 15 | 112x faster | 14.5x |
206
- | 20 | 385x faster | 48.7x |
207
-
208
- <details>
209
- <summary>Detailed benchmark (20 documents)</summary>
210
-
211
- ```
212
- Operation Pure Ruby Native C Speedup
213
- ----------------------------------------------------------
214
- build_index 0.5540 0.0014 384.5x
215
- classify 0.0190 0.0060 3.2x
216
- search 0.0145 0.0037 3.9x
217
- find_related 0.0098 0.0011 8.6x
218
- ----------------------------------------------------------
219
- TOTAL 0.5973 0.0123 48.7x
220
- ```
221
- </details>
117
+ Native C extension provides 5-50x speedup for LSI operations:
222
118
 
223
- ### Running Benchmarks
119
+ | Documents | Speedup |
120
+ |-----------|---------|
121
+ | 10 | 25x |
122
+ | 20 | 50x |
224
123
 
225
124
  ```bash
226
- rake benchmark # Run with current configuration
227
- rake benchmark:compare # Compare native C vs pure Ruby
125
+ rake benchmark:compare # Run your own comparison
228
126
  ```
229
127
 
230
128
  ## Development
231
129
 
232
- ### Setup
233
-
234
130
  ```bash
235
- git clone https://github.com/cardmagic/classifier.git
236
- cd classifier
237
131
  bundle install
238
- rake compile # Compile native C extension
239
- ```
240
-
241
- ### Running Tests
242
-
243
- ```bash
244
- rake test # Run all tests (compiles first)
245
- ruby -Ilib test/bayes/bayesian_test.rb # Run specific test file
246
-
247
- # Test with pure Ruby (no native extension)
248
- NATIVE_VECTOR=true rake test
132
+ rake compile # Build native extension
133
+ rake test # Run tests
249
134
  ```
250
135
 
251
- ### Console
252
-
253
- ```bash
254
- rake console
255
- ```
256
-
257
- ## Contributing
258
-
259
- 1. Fork the repository
260
- 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
261
- 3. Commit your changes (`git commit -am 'Add amazing feature'`)
262
- 4. Push to the branch (`git push origin feature/amazing-feature`)
263
- 5. Open a Pull Request
264
-
265
136
  ## Authors
266
137
 
267
- - **Lucas Carlson** - *Original author* - lucas@rufy.com
268
- - **David Fayram II** - *LSI implementation* - dfayram@gmail.com
138
+ - **Lucas Carlson** - lucas@rufy.com
139
+ - **David Fayram II** - dfayram@gmail.com
269
140
  - **Cameron McBride** - cameron.mcbride@gmail.com
270
141
  - **Ivan Acosta-Rubio** - ivan@softwarecriollo.com
271
142
 
272
143
  ## License
273
144
 
274
- This library is released under the [GNU Lesser General Public License (LGPL) 2.1](LICENSE).
145
+ [LGPL 2.1](LICENSE)
data/exe/classifier ADDED
@@ -0,0 +1,9 @@
1
+ #!/usr/bin/env ruby
2
+ require 'classifier/cli'
3
+
4
+ result = Classifier::CLI.new(ARGV).run
5
+
6
+ warn result[:error] unless result[:error].empty?
7
+ puts result[:output] unless result[:output].empty?
8
+
9
+ exit result[:exit_code]
@@ -22,4 +22,5 @@ void Init_classifier_ext(void)
22
22
  Init_vector();
23
23
  Init_matrix();
24
24
  Init_svd();
25
+ Init_incremental_svd();
25
26
  }