classifier 2.0.0 → 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fea14969bc8a61283823b0b0f5bae013af968caf4676c383155e3b8682b948de
4
- data.tar.gz: 4d626c85d084ff75eba2ff305673734a6f25b668e773b1b5a3a0630a6b68df96
3
+ metadata.gz: 2ce325479bd32938cd7a7c694152983579266667e5def8d36725c38594eb822b
4
+ data.tar.gz: a35439c4ddfba77091297fee46930c279c54bfeb314b8b87fba8dd0ccbe6b292
5
5
  SHA512:
6
- metadata.gz: ef53c06db3326b1b6ebc14255b4ba198286c06e291cba3afc67bba360ca766a173f89269405d216751806ca72f885a87ac80ec24a031053f8e6f2987e8e2267e
7
- data.tar.gz: 8f120a9b78e802e6fd3e7172fd311b476745e27d5b3d301dc8d140296a451875e5aa33a901514bfdd1bc96c656ad1a43cbb3935a05223cd38548a71ba6a3a1c1
6
+ metadata.gz: e82b06f6449d3c5b089f840fe794b0e7f59bc1a00a379d4739010ad872ce7a90a4cbc1eee930b822b2761add461f2378dde8e015115c45b59c67e8aee0ad01b5
7
+ data.tar.gz: fa9271f248da8de9012ef923911b5f3d18cb9ced08ff9f85372e13622b8ab43a3f75365d21e9de3b3fa2849f3a54c3261da67c00c38a263b3ceee7ac0afd2598
data/CLAUDE.md CHANGED
@@ -11,21 +11,28 @@ Ruby gem providing text classification via two algorithms:
11
11
  ## Common Commands
12
12
 
13
13
  ```bash
14
- # Run all tests
15
- rake test
14
+ # Compile native C extension
15
+ bundle exec rake compile
16
+
17
+ # Run all tests (compiles first)
18
+ bundle exec rake test
16
19
 
17
20
  # Run a single test file
18
21
  ruby -Ilib test/bayes/bayesian_test.rb
19
22
  ruby -Ilib test/lsi/lsi_test.rb
20
23
 
21
- # Run tests with native Ruby vector (without GSL)
22
- NATIVE_VECTOR=true rake test
24
+ # Run tests with pure Ruby (no native extension)
25
+ NATIVE_VECTOR=true bundle exec rake test
26
+
27
+ # Run benchmarks
28
+ bundle exec rake benchmark
29
+ bundle exec rake benchmark:compare
23
30
 
24
31
  # Interactive console
25
- rake console
32
+ bundle exec rake console
26
33
 
27
34
  # Generate documentation
28
- rake doc
35
+ bundle exec rake doc
29
36
  ```
30
37
 
31
38
  ## Architecture
@@ -39,7 +46,7 @@ rake doc
39
46
 
40
47
  **LSI Classifier** (`lib/classifier/lsi.rb`)
41
48
  - Uses Singular Value Decomposition (SVD) for semantic analysis
42
- - Optional GSL gem for 10x faster matrix operations; falls back to pure Ruby SVD
49
+ - Native C extension for 5-50x faster matrix operations; falls back to pure Ruby
43
50
  - Key operations: `add_item`, `classify`, `find_related`, `search`
44
51
  - `auto_rebuild` option controls automatic index rebuilding after changes
45
52
 
@@ -49,15 +56,18 @@ rake doc
49
56
  - Uses `fast-stemmer` gem for Porter stemming
50
57
 
51
58
  **Vector Extensions** (`lib/classifier/extensions/vector.rb`)
52
- - Pure Ruby SVD implementation (`Matrix#SV_decomp`)
59
+ - Pure Ruby SVD implementation (`Matrix#SV_decomp`) - used as fallback
53
60
  - Vector normalization and magnitude calculations
54
61
 
55
- ### GSL Integration
62
+ ### Native C Extension (`ext/classifier/`)
63
+
64
+ LSI uses a native C extension for fast linear algebra operations:
65
+ - `Classifier::Linalg::Vector` - Vector operations (alloc, normalize, dot product)
66
+ - `Classifier::Linalg::Matrix` - Matrix operations (alloc, transpose, multiply)
67
+ - Jacobi SVD implementation for singular value decomposition
56
68
 
57
- LSI checks for the `gsl` gem at load time. When available:
58
- - Uses `GSL::Matrix` and `GSL::Vector` for faster operations
59
- - Serialization handled via `vector_serialize.rb`
60
- - Test without GSL: `NATIVE_VECTOR=true rake test`
69
+ Check current backend: `Classifier::LSI.backend` returns `:native` or `:ruby`
70
+ Force pure Ruby: `NATIVE_VECTOR=true bundle exec rake test`
61
71
 
62
72
  ### Content Nodes (`lib/classifier/lsi/content_node.rb`)
63
73
 
data/README.md CHANGED
@@ -4,256 +4,138 @@
4
4
  [![CI](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml/badge.svg)](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml)
5
5
  [![License: LGPL](https://img.shields.io/badge/License-LGPL_2.1-blue.svg)](https://opensource.org/licenses/LGPL-2.1)
6
6
 
7
- A Ruby library for text classification using Bayesian and Latent Semantic Indexing (LSI) algorithms.
7
+ Text classification in Ruby. Five algorithms, native performance, streaming support.
8
8
 
9
- ## Table of Contents
9
+ **[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[API Reference](https://rubyclassifier.com/docs/api)**
10
10
 
11
- - [Installation](#installation)
12
- - [Bayesian Classifier](#bayesian-classifier)
13
- - [LSI (Latent Semantic Indexing)](#lsi-latent-semantic-indexing)
14
- - [Performance](#performance)
15
- - [Development](#development)
16
- - [Contributing](#contributing)
17
- - [License](#license)
11
+ ## Why This Library?
18
12
 
19
- ## Installation
13
+ | | This Gem | Other Forks |
14
+ |:--|:--|:--|
15
+ | **Algorithms** | ✅ 5 classifiers | ❌ 2 only |
16
+ | **Incremental LSI** | ✅ Brand's algorithm (no rebuild) | ❌ Full SVD rebuild on every add |
17
+ | **LSI Performance** | ✅ Native C extension (5-50x faster) | ❌ Pure Ruby or requires GSL |
18
+ | **Streaming** | ✅ Train on multi-GB datasets | ❌ Must load all data in memory |
19
+ | **Persistence** | ✅ Pluggable (file, Redis, S3) | ❌ Marshal only |
20
20
 
21
- Add to your Gemfile:
21
+ ## Installation
22
22
 
23
23
  ```ruby
24
24
  gem 'classifier'
25
25
  ```
26
26
 
27
- Then run:
27
+ ## Quick Start
28
28
 
29
- ```bash
30
- bundle install
31
- ```
32
-
33
- Or install directly:
29
+ ### Bayesian
34
30
 
35
- ```bash
36
- gem install classifier
31
+ ```ruby
32
+ classifier = Classifier::Bayes.new(:spam, :ham)
33
+ classifier.train(spam: "Buy cheap viagra now!", ham: "Meeting at 3pm tomorrow")
34
+ classifier.classify "You've won a prize!" # => "Spam"
37
35
  ```
36
+ [Bayesian Guide →](https://rubyclassifier.com/docs/guides/bayes/basics)
38
37
 
39
- ### Optional: GSL for Faster LSI
40
-
41
- For significantly faster LSI operations, install the [GNU Scientific Library](https://www.gnu.org/software/gsl/).
42
-
43
- <details>
44
- <summary><strong>Ruby 3+</strong></summary>
45
-
46
- The released `gsl` gem doesn't support Ruby 3+. Install from source:
38
+ ### Logistic Regression
47
39
 
48
- ```bash
49
- # Install GSL library
50
- brew install gsl # macOS
51
- apt-get install libgsl-dev # Ubuntu/Debian
52
-
53
- # Build and install the gem
54
- git clone https://github.com/cardmagic/rb-gsl.git
55
- cd rb-gsl
56
- git checkout fix/ruby-3.4-compatibility
57
- gem build gsl.gemspec
58
- gem install gsl-*.gem
40
+ ```ruby
41
+ classifier = Classifier::LogisticRegression.new(:positive, :negative)
42
+ classifier.train(positive: "Great product!", negative: "Terrible experience")
43
+ classifier.classify "Loved it!" # => "Positive"
59
44
  ```
60
- </details>
61
-
62
- <details>
63
- <summary><strong>Ruby 2.x</strong></summary>
45
+ [Logistic Regression Guide →](https://rubyclassifier.com/docs/guides/logisticregression/basics)
64
46
 
65
- ```bash
66
- # macOS
67
- brew install gsl
68
- gem install gsl
47
+ ### LSI (Latent Semantic Indexing)
69
48
 
70
- # Ubuntu/Debian
71
- apt-get install libgsl-dev
72
- gem install gsl
49
+ ```ruby
50
+ lsi = Classifier::LSI.new
51
+ lsi.add(pets: "Dogs are loyal", tech: "Ruby is elegant")
52
+ lsi.classify "My puppy is playful" # => "pets"
73
53
  ```
74
- </details>
54
+ [LSI Guide →](https://rubyclassifier.com/docs/guides/lsi/basics)
75
55
 
76
- When GSL is installed, Classifier automatically uses it. To suppress the GSL notice:
56
+ ### k-Nearest Neighbors
77
57
 
78
- ```bash
79
- SUPPRESS_GSL_WARNING=true ruby your_script.rb
58
+ ```ruby
59
+ knn = Classifier::KNN.new(k: 3)
60
+ knn.train(spam: "Free money!", ham: "Quarterly report attached") # or knn.add()
61
+ knn.classify "Claim your prize" # => "spam"
80
62
  ```
63
+ [k-Nearest Neighbors Guide →](https://rubyclassifier.com/docs/guides/knn/basics)
81
64
 
82
- ### Compatibility
83
-
84
- | Ruby Version | Status |
85
- |--------------|--------|
86
- | 4.0 | Supported |
87
- | 3.4 | Supported |
88
- | 3.3 | Supported |
89
- | 3.2 | Supported |
90
- | 3.1 | EOL (unsupported) |
91
-
92
- ## Bayesian Classifier
93
-
94
- Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.
95
-
96
- ### Quick Start
65
+ ### TF-IDF
97
66
 
98
67
  ```ruby
99
- require 'classifier'
100
-
101
- classifier = Classifier::Bayes.new('Spam', 'Ham')
102
-
103
- # Train the classifier
104
- classifier.train_spam "Buy cheap viagra now! Limited offer!"
105
- classifier.train_spam "You've won a million dollars! Claim now!"
106
- classifier.train_ham "Meeting scheduled for tomorrow at 10am"
107
- classifier.train_ham "Please review the attached document"
108
-
109
- # Classify new text
110
- classifier.classify "Congratulations! You've won a prize!"
111
- # => "Spam"
68
+ tfidf = Classifier::TFIDF.new
69
+ tfidf.fit(["Dogs are pets", "Cats are independent"])
70
+ tfidf.transform("Dogs are loyal") # => {:dog => 0.707, :loyal => 0.707}
112
71
  ```
72
+ [TF-IDF Guide →](https://rubyclassifier.com/docs/guides/tfidf/basics)
113
73
 
114
- ### Persistence with Madeleine
74
+ ## Key Features
115
75
 
116
- ```ruby
117
- require 'classifier'
118
- require 'madeleine'
76
+ ### Incremental LSI
119
77
 
120
- m = SnapshotMadeleine.new("classifier_data") {
121
- Classifier::Bayes.new('Interesting', 'Uninteresting')
122
- }
78
+ Add documents without rebuilding the entire index—400x faster for streaming data:
123
79
 
124
- m.system.train_interesting "fascinating article about science"
125
- m.system.train_uninteresting "boring repetitive content"
126
- m.take_snapshot
80
+ ```ruby
81
+ lsi = Classifier::LSI.new(incremental: true)
82
+ lsi.add(tech: ["Ruby is elegant", "Python is popular"])
83
+ lsi.build_index
127
84
 
128
- # Later, restore and use:
129
- m.system.classify "new scientific discovery"
130
- # => "Interesting"
85
+ # These use Brand's algorithm—no full rebuild
86
+ lsi.add(tech: "Go is fast")
87
+ lsi.add(tech: "Rust is safe")
131
88
  ```
132
89
 
133
- ### Learn More
90
+ [Learn more →](https://rubyclassifier.com/docs/guides/lsi/incremental)
134
91
 
135
- - [Bayesian Filtering Explained](http://www.process.com/precisemail/bayesian_filtering.htm)
136
- - [Wikipedia: Bayesian Filtering](http://en.wikipedia.org/wiki/Bayesian_filtering)
137
- - [Paul Graham: A Plan for Spam](http://www.paulgraham.com/spam.html)
138
-
139
- ## LSI (Latent Semantic Indexing)
140
-
141
- Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.
142
-
143
- ### Quick Start
92
+ ### Persistence
144
93
 
145
94
  ```ruby
146
- require 'classifier'
147
-
148
- lsi = Classifier::LSI.new
149
-
150
- # Add documents with categories
151
- lsi.add_item "Dogs are loyal pets that love to play fetch", :pets
152
- lsi.add_item "Cats are independent and love to nap", :pets
153
- lsi.add_item "Ruby is a dynamic programming language", :programming
154
- lsi.add_item "Python is great for data science", :programming
155
-
156
- # Classify new text
157
- lsi.classify "My puppy loves to run around"
158
- # => :pets
95
+ classifier.storage = Classifier::Storage::File.new(path: "model.json")
96
+ classifier.save
159
97
 
160
- # Get classification with confidence score
161
- lsi.classify_with_confidence "Learning to code in Ruby"
162
- # => [:programming, 0.89]
98
+ loaded = Classifier::Bayes.load(storage: classifier.storage)
163
99
  ```
164
100
 
165
- ### Search and Discovery
101
+ [Learn more →](https://rubyclassifier.com/docs/guides/persistence)
166
102
 
167
- ```ruby
168
- # Find similar documents
169
- lsi.find_related "Dogs are great companions", 2
170
- # => ["Dogs are loyal pets that love to play fetch", "Cats are independent..."]
103
+ ### Streaming Training
171
104
 
172
- # Search by keyword
173
- lsi.search "programming", 3
174
- # => ["Ruby is a dynamic programming language", "Python is great for..."]
105
+ ```ruby
106
+ classifier.train_from_stream(:spam, File.open("spam_corpus.txt"))
175
107
  ```
176
108
 
177
- ### Learn More
178
-
179
- - [Wikipedia: Latent Semantic Analysis](http://en.wikipedia.org/wiki/Latent_semantic_analysis)
180
- - [C2 Wiki: Latent Semantic Indexing](http://www.c2.com/cgi/wiki?LatentSemanticIndexing)
109
+ [Learn more →](https://rubyclassifier.com/docs/guides/streaming)
181
110
 
182
111
  ## Performance
183
112
 
184
- ### GSL vs Native Ruby
185
-
186
- GSL provides dramatic speedups for LSI operations, especially `build_index` (SVD computation):
187
-
188
- | Documents | build_index | Overall |
189
- |-----------|-------------|---------|
190
- | 5 | 4x faster | 2.5x |
191
- | 10 | 24x faster | 5.5x |
192
- | 15 | 116x faster | 17x |
193
-
194
- <details>
195
- <summary>Detailed benchmark (15 documents)</summary>
196
-
197
- ```
198
- Operation Native GSL Speedup
199
- ----------------------------------------------------------
200
- build_index 0.1412 0.0012 116.2x
201
- classify 0.0142 0.0049 2.9x
202
- search 0.0102 0.0026 3.9x
203
- find_related 0.0069 0.0016 4.2x
204
- ----------------------------------------------------------
205
- TOTAL 0.1725 0.0104 16.6x
206
- ```
207
- </details>
113
+ Native C extension provides 5-50x speedup for LSI operations:
208
114
 
209
- ### Running Benchmarks
115
+ | Documents | Speedup |
116
+ |-----------|---------|
117
+ | 10 | 25x |
118
+ | 20 | 50x |
210
119
 
211
120
  ```bash
212
- rake benchmark # Run with current configuration
213
- rake benchmark:compare # Compare GSL vs native Ruby
121
+ rake benchmark:compare # Run your own comparison
214
122
  ```
215
123
 
216
124
  ## Development
217
125
 
218
- ### Setup
219
-
220
126
  ```bash
221
- git clone https://github.com/cardmagic/classifier.git
222
- cd classifier
223
127
  bundle install
128
+ rake compile # Build native extension
129
+ rake test # Run tests
224
130
  ```
225
131
 
226
- ### Running Tests
227
-
228
- ```bash
229
- rake test # Run all tests
230
- ruby -Ilib test/bayes/bayesian_test.rb # Run specific test file
231
-
232
- # Test without GSL (pure Ruby)
233
- NATIVE_VECTOR=true rake test
234
- ```
235
-
236
- ### Console
237
-
238
- ```bash
239
- rake console
240
- ```
241
-
242
- ## Contributing
243
-
244
- 1. Fork the repository
245
- 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
246
- 3. Commit your changes (`git commit -am 'Add amazing feature'`)
247
- 4. Push to the branch (`git push origin feature/amazing-feature`)
248
- 5. Open a Pull Request
249
-
250
132
  ## Authors
251
133
 
252
- - **Lucas Carlson** - *Original author* - lucas@rufy.com
253
- - **David Fayram II** - *LSI implementation* - dfayram@gmail.com
134
+ - **Lucas Carlson** - lucas@rufy.com
135
+ - **David Fayram II** - dfayram@gmail.com
254
136
  - **Cameron McBride** - cameron.mcbride@gmail.com
255
137
  - **Ivan Acosta-Rubio** - ivan@softwarecriollo.com
256
138
 
257
139
  ## License
258
140
 
259
- This library is released under the [GNU Lesser General Public License (LGPL) 2.1](LICENSE).
141
+ [LGPL 2.1](LICENSE)
@@ -0,0 +1,26 @@
1
+ /*
2
+ * classifier_ext.c
3
+ * Main entry point for the Classifier native linear algebra extension
4
+ *
5
+ * This extension provides zero-dependency Vector, Matrix, and SVD
6
+ * implementations for the Classifier gem's LSI functionality.
7
+ */
8
+
9
+ #include "linalg.h"
10
+
11
+ VALUE mClassifierLinalg;
12
+ VALUE cClassifierVector;
13
+ VALUE cClassifierMatrix;
14
+
15
+ void Init_classifier_ext(void)
16
+ {
17
+ /* Define Classifier::Linalg module */
18
+ VALUE mClassifier = rb_define_module("Classifier");
19
+ mClassifierLinalg = rb_define_module_under(mClassifier, "Linalg");
20
+
21
+ /* Initialize Vector and Matrix classes */
22
+ Init_vector();
23
+ Init_matrix();
24
+ Init_svd();
25
+ Init_incremental_svd();
26
+ }
@@ -0,0 +1,15 @@
1
+ require 'mkmf'
2
+
3
+ # rubocop:disable Style/GlobalVars
4
+ if ENV['COVERAGE']
5
+ # Coverage flags: disable optimization for accurate line coverage
6
+ $CFLAGS << ' -O0 -g --coverage -Wall'
7
+ $LDFLAGS << ' --coverage'
8
+ else
9
+ # Optimization flags for performance
10
+ $CFLAGS << ' -O3 -ffast-math -Wall'
11
+ end
12
+ # rubocop:enable Style/GlobalVars
13
+
14
+ # Create the Makefile
15
+ create_makefile('classifier/classifier_ext')