classifier 1.4.3 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e2d12a6941acf386b0567d5f504d20bffad8486111675977446867c6caf5e865
4
- data.tar.gz: b44dca735ec32321183dc9291f339e68ef115af145d0d9ec78c767e9b3e132b2
3
+ metadata.gz: fea14969bc8a61283823b0b0f5bae013af968caf4676c383155e3b8682b948de
4
+ data.tar.gz: 4d626c85d084ff75eba2ff305673734a6f25b668e773b1b5a3a0630a6b68df96
5
5
  SHA512:
6
- metadata.gz: 4a37d6482fac59b1b6d3cf1c22f0144a08e580ab4ef681cb01189c266fa3de6d6a11668dfd2f1175db0f9d587c01302570be1814a457d2b05e2a9a72d9b9b975
7
- data.tar.gz: b9a62dc7243527ae95cd89f946d1caf30ca0c2f52527a34427b4dbe68698b920dce8644b0ffd4f34cba4d646f2a17d8711698c096062b0596ed9228885bd822b
6
+ metadata.gz: ef53c06db3326b1b6ebc14255b4ba198286c06e291cba3afc67bba360ca766a173f89269405d216751806ca72f885a87ac80ec24a031053f8e6f2987e8e2267e
7
+ data.tar.gz: 8f120a9b78e802e6fd3e7172fd311b476745e27d5b3d301dc8d140296a451875e5aa33a901514bfdd1bc96c656ad1a43cbb3935a05223cd38548a71ba6a3a1c1
data/CLAUDE.md ADDED
@@ -0,0 +1,67 @@
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ Ruby gem providing text classification via two algorithms:
8
+ - **Bayes** (`Classifier::Bayes`) - Naive Bayesian classification
9
+ - **LSI** (`Classifier::LSI`) - Latent Semantic Indexing for semantic classification, clustering, and search
10
+
11
+ ## Common Commands
12
+
13
+ ```bash
14
+ # Run all tests
15
+ rake test
16
+
17
+ # Run a single test file
18
+ ruby -Ilib test/bayes/bayesian_test.rb
19
+ ruby -Ilib test/lsi/lsi_test.rb
20
+
21
+ # Run tests with native Ruby vector (without GSL)
22
+ NATIVE_VECTOR=true rake test
23
+
24
+ # Interactive console
25
+ rake console
26
+
27
+ # Generate documentation
28
+ rake doc
29
+ ```
30
+
31
+ ## Architecture
32
+
33
+ ### Core Components
34
+
35
+ **Bayesian Classifier** (`lib/classifier/bayes.rb`)
36
+ - Train with `train(category, text)` or dynamic methods like `train_spam(text)`
37
+ - Classify with `classify(text)` returning the best category
38
+ - Uses log probabilities for numerical stability
39
+
40
+ **LSI Classifier** (`lib/classifier/lsi.rb`)
41
+ - Uses Singular Value Decomposition (SVD) for semantic analysis
42
+ - Optional GSL gem for 10x faster matrix operations; falls back to pure Ruby SVD
43
+ - Key operations: `add_item`, `classify`, `find_related`, `search`
44
+ - `auto_rebuild` option controls automatic index rebuilding after changes
45
+
46
+ **String Extensions** (`lib/classifier/extensions/word_hash.rb`)
47
+ - `word_hash` / `clean_word_hash` - tokenize text to stemmed word frequencies
48
+ - `CORPUS_SKIP_WORDS` - stopwords filtered during tokenization
49
+ - Uses `fast-stemmer` gem for Porter stemming
50
+
51
+ **Vector Extensions** (`lib/classifier/extensions/vector.rb`)
52
+ - Pure Ruby SVD implementation (`Matrix#SV_decomp`)
53
+ - Vector normalization and magnitude calculations
54
+
55
+ ### GSL Integration
56
+
57
+ LSI checks for the `gsl` gem at load time. When available:
58
+ - Uses `GSL::Matrix` and `GSL::Vector` for faster operations
59
+ - Serialization handled via `vector_serialize.rb`
60
+ - Test without GSL: `NATIVE_VECTOR=true rake test`
61
+
62
+ ### Content Nodes (`lib/classifier/lsi/content_node.rb`)
63
+
64
+ Internal data structure storing:
65
+ - `word_hash` - term frequencies
66
+ - `raw_vector` / `raw_norm` - initial vector representation
67
+ - `lsi_vector` / `lsi_norm` - reduced dimensionality representation after SVD
data/README.md ADDED
@@ -0,0 +1,259 @@
1
+ # Classifier
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/classifier.svg)](https://badge.fury.io/rb/classifier)
4
+ [![CI](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml/badge.svg)](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml)
5
+ [![License: LGPL](https://img.shields.io/badge/License-LGPL_2.1-blue.svg)](https://opensource.org/licenses/LGPL-2.1)
6
+
7
+ A Ruby library for text classification using Bayesian and Latent Semantic Indexing (LSI) algorithms.
8
+
9
+ ## Table of Contents
10
+
11
+ - [Installation](#installation)
12
+ - [Bayesian Classifier](#bayesian-classifier)
13
+ - [LSI (Latent Semantic Indexing)](#lsi-latent-semantic-indexing)
14
+ - [Performance](#performance)
15
+ - [Development](#development)
16
+ - [Contributing](#contributing)
17
+ - [License](#license)
18
+
19
+ ## Installation
20
+
21
+ Add to your Gemfile:
22
+
23
+ ```ruby
24
+ gem 'classifier'
25
+ ```
26
+
27
+ Then run:
28
+
29
+ ```bash
30
+ bundle install
31
+ ```
32
+
33
+ Or install directly:
34
+
35
+ ```bash
36
+ gem install classifier
37
+ ```
38
+
39
+ ### Optional: GSL for Faster LSI
40
+
41
+ For significantly faster LSI operations, install the [GNU Scientific Library](https://www.gnu.org/software/gsl/).
42
+
43
+ <details>
44
+ <summary><strong>Ruby 3+</strong></summary>
45
+
46
+ The released `gsl` gem doesn't support Ruby 3+. Install from source:
47
+
48
+ ```bash
49
+ # Install GSL library
50
+ brew install gsl # macOS
51
+ apt-get install libgsl-dev # Ubuntu/Debian
52
+
53
+ # Build and install the gem
54
+ git clone https://github.com/cardmagic/rb-gsl.git
55
+ cd rb-gsl
56
+ git checkout fix/ruby-3.4-compatibility
57
+ gem build gsl.gemspec
58
+ gem install gsl-*.gem
59
+ ```
60
+ </details>
61
+
62
+ <details>
63
+ <summary><strong>Ruby 2.x</strong></summary>
64
+
65
+ ```bash
66
+ # macOS
67
+ brew install gsl
68
+ gem install gsl
69
+
70
+ # Ubuntu/Debian
71
+ apt-get install libgsl-dev
72
+ gem install gsl
73
+ ```
74
+ </details>
75
+
76
+ When GSL is installed, Classifier automatically uses it. To suppress the GSL notice:
77
+
78
+ ```bash
79
+ SUPPRESS_GSL_WARNING=true ruby your_script.rb
80
+ ```
81
+
82
+ ### Compatibility
83
+
84
+ | Ruby Version | Status |
85
+ |--------------|--------|
86
+ | 4.0 | Supported |
87
+ | 3.4 | Supported |
88
+ | 3.3 | Supported |
89
+ | 3.2 | Supported |
90
+ | 3.1 | EOL (unsupported) |
91
+
92
+ ## Bayesian Classifier
93
+
94
+ Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.
95
+
96
+ ### Quick Start
97
+
98
+ ```ruby
99
+ require 'classifier'
100
+
101
+ classifier = Classifier::Bayes.new('Spam', 'Ham')
102
+
103
+ # Train the classifier
104
+ classifier.train_spam "Buy cheap viagra now! Limited offer!"
105
+ classifier.train_spam "You've won a million dollars! Claim now!"
106
+ classifier.train_ham "Meeting scheduled for tomorrow at 10am"
107
+ classifier.train_ham "Please review the attached document"
108
+
109
+ # Classify new text
110
+ classifier.classify "Congratulations! You've won a prize!"
111
+ # => "Spam"
112
+ ```
113
+
114
+ ### Persistence with Madeleine
115
+
116
+ ```ruby
117
+ require 'classifier'
118
+ require 'madeleine'
119
+
120
+ m = SnapshotMadeleine.new("classifier_data") {
121
+ Classifier::Bayes.new('Interesting', 'Uninteresting')
122
+ }
123
+
124
+ m.system.train_interesting "fascinating article about science"
125
+ m.system.train_uninteresting "boring repetitive content"
126
+ m.take_snapshot
127
+
128
+ # Later, restore and use:
129
+ m.system.classify "new scientific discovery"
130
+ # => "Interesting"
131
+ ```
132
+
133
+ ### Learn More
134
+
135
+ - [Bayesian Filtering Explained](http://www.process.com/precisemail/bayesian_filtering.htm)
136
+ - [Wikipedia: Bayesian Filtering](http://en.wikipedia.org/wiki/Bayesian_filtering)
137
+ - [Paul Graham: A Plan for Spam](http://www.paulgraham.com/spam.html)
138
+
139
+ ## LSI (Latent Semantic Indexing)
140
+
141
+ Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.
142
+
143
+ ### Quick Start
144
+
145
+ ```ruby
146
+ require 'classifier'
147
+
148
+ lsi = Classifier::LSI.new
149
+
150
+ # Add documents with categories
151
+ lsi.add_item "Dogs are loyal pets that love to play fetch", :pets
152
+ lsi.add_item "Cats are independent and love to nap", :pets
153
+ lsi.add_item "Ruby is a dynamic programming language", :programming
154
+ lsi.add_item "Python is great for data science", :programming
155
+
156
+ # Classify new text
157
+ lsi.classify "My puppy loves to run around"
158
+ # => :pets
159
+
160
+ # Get classification with confidence score
161
+ lsi.classify_with_confidence "Learning to code in Ruby"
162
+ # => [:programming, 0.89]
163
+ ```
164
+
165
+ ### Search and Discovery
166
+
167
+ ```ruby
168
+ # Find similar documents
169
+ lsi.find_related "Dogs are great companions", 2
170
+ # => ["Dogs are loyal pets that love to play fetch", "Cats are independent..."]
171
+
172
+ # Search by keyword
173
+ lsi.search "programming", 3
174
+ # => ["Ruby is a dynamic programming language", "Python is great for..."]
175
+ ```
176
+
177
+ ### Learn More
178
+
179
+ - [Wikipedia: Latent Semantic Analysis](http://en.wikipedia.org/wiki/Latent_semantic_analysis)
180
+ - [C2 Wiki: Latent Semantic Indexing](http://www.c2.com/cgi/wiki?LatentSemanticIndexing)
181
+
182
+ ## Performance
183
+
184
+ ### GSL vs Native Ruby
185
+
186
+ GSL provides dramatic speedups for LSI operations, especially `build_index` (SVD computation):
187
+
188
+ | Documents | build_index | Overall |
189
+ |-----------|-------------|---------|
190
+ | 5 | 4x faster | 2.5x |
191
+ | 10 | 24x faster | 5.5x |
192
+ | 15 | 116x faster | 17x |
193
+
194
+ <details>
195
+ <summary>Detailed benchmark (15 documents)</summary>
196
+
197
+ ```
198
+ Operation Native GSL Speedup
199
+ ----------------------------------------------------------
200
+ build_index 0.1412 0.0012 116.2x
201
+ classify 0.0142 0.0049 2.9x
202
+ search 0.0102 0.0026 3.9x
203
+ find_related 0.0069 0.0016 4.2x
204
+ ----------------------------------------------------------
205
+ TOTAL 0.1725 0.0104 16.6x
206
+ ```
207
+ </details>
208
+
209
+ ### Running Benchmarks
210
+
211
+ ```bash
212
+ rake benchmark # Run with current configuration
213
+ rake benchmark:compare # Compare GSL vs native Ruby
214
+ ```
215
+
216
+ ## Development
217
+
218
+ ### Setup
219
+
220
+ ```bash
221
+ git clone https://github.com/cardmagic/classifier.git
222
+ cd classifier
223
+ bundle install
224
+ ```
225
+
226
+ ### Running Tests
227
+
228
+ ```bash
229
+ rake test # Run all tests
230
+ ruby -Ilib test/bayes/bayesian_test.rb # Run specific test file
231
+
232
+ # Test without GSL (pure Ruby)
233
+ NATIVE_VECTOR=true rake test
234
+ ```
235
+
236
+ ### Console
237
+
238
+ ```bash
239
+ rake console
240
+ ```
241
+
242
+ ## Contributing
243
+
244
+ 1. Fork the repository
245
+ 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
246
+ 3. Commit your changes (`git commit -am 'Add amazing feature'`)
247
+ 4. Push to the branch (`git push origin feature/amazing-feature`)
248
+ 5. Open a Pull Request
249
+
250
+ ## Authors
251
+
252
+ - **Lucas Carlson** - *Original author* - lucas@rufy.com
253
+ - **David Fayram II** - *LSI implementation* - dfayram@gmail.com
254
+ - **Cameron McBride** - cameron.mcbride@gmail.com
255
+ - **Ivan Acosta-Rubio** - ivan@softwarecriollo.com
256
+
257
+ ## License
258
+
259
+ This library is released under the [GNU Lesser General Public License (LGPL) 2.1](LICENSE).
@@ -1,12 +1,20 @@
1
+ # rbs_inline: enabled
2
+
1
3
  # Author:: Lucas Carlson (mailto:lucas@rufy.com)
2
4
  # Copyright:: Copyright (c) 2005 Lucas Carlson
3
5
  # License:: LGPL
4
6
 
5
7
  module Classifier
6
8
  class Bayes
9
+ # @rbs @categories: Hash[Symbol, Hash[Symbol, Integer]]
10
+ # @rbs @total_words: Integer
11
+ # @rbs @category_counts: Hash[Symbol, Integer]
12
+ # @rbs @category_word_count: Hash[Symbol, Integer]
13
+
7
14
  # The class can be created with one or more categories, each of which will be
8
15
  # initialized and given a training method. E.g.,
9
16
  # b = Classifier::Bayes.new 'Interesting', 'Uninteresting', 'Spam'
17
+ # @rbs (*String | Symbol) -> void
10
18
  def initialize(*categories)
11
19
  @categories = {}
12
20
  categories.each { |category| @categories[category.prepare_category_name] = {} }
@@ -15,13 +23,14 @@ module Classifier
15
23
  @category_word_count = Hash.new(0)
16
24
  end
17
25
 
18
- #
19
26
  # Provides a general training method for all categories specified in Bayes#new
20
27
  # For example:
21
28
  # b = Classifier::Bayes.new 'This', 'That', 'the_other'
22
29
  # b.train :this, "This text"
23
30
  # b.train "that", "That text"
24
31
  # b.train "The other", "The other text"
32
+ #
33
+ # @rbs (String | Symbol, String) -> void
25
34
  def train(category, text)
26
35
  category = category.prepare_category_name
27
36
  @category_counts[category] += 1
@@ -33,7 +42,6 @@ module Classifier
33
42
  end
34
43
  end
35
44
 
36
- #
37
45
  # Provides a untraining method for all categories specified in Bayes#new
38
46
  # Be very careful with this method.
39
47
  #
@@ -41,6 +49,8 @@ module Classifier
41
49
  # b = Classifier::Bayes.new 'This', 'That', 'the_other'
42
50
  # b.train :this, "This text"
43
51
  # b.untrain :this, "This text"
52
+ #
53
+ # @rbs (String | Symbol, String) -> void
44
54
  def untrain(category, text)
45
55
  category = category.prepare_category_name
46
56
  @category_counts[category] -= 1
@@ -59,36 +69,39 @@ module Classifier
59
69
  end
60
70
  end
61
71
 
62
- #
63
72
  # Returns the scores in each category the provided +text+. E.g.,
64
73
  # b.classifications "I hate bad words and you"
65
74
  # => {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524}
66
75
  # The largest of these scores (the one closest to 0) is the one picked out by #classify
76
+ #
77
+ # @rbs (String) -> Hash[String, Float]
67
78
  def classifications(text)
68
- score = {}
69
- word_hash = text.word_hash
70
- training_count = @category_counts.values.inject { |x, y| x + y }.to_f
71
- @categories.each do |category, category_words|
72
- score[category.to_s] = 0
73
- total = (@category_word_count[category] || 1).to_f
74
- word_hash.each_key do |word|
75
- s = category_words.key?(word) ? category_words[word] : 0.1
76
- score[category.to_s] += Math.log(s / total)
77
- end
78
- # now add prior probability for the category
79
- s = @category_counts.key?(category) ? @category_counts[category] : 0.1
80
- score[category.to_s] += Math.log(s / training_count)
79
+ words = text.word_hash.keys
80
+ training_count = @category_counts.values.sum.to_f
81
+ vocab_size = [@categories.values.flat_map(&:keys).uniq.size, 1].max
82
+
83
+ @categories.to_h do |category, category_words|
84
+ smoothed_total = ((@category_word_count[category] || 0) + vocab_size).to_f
85
+
86
+ # Laplace smoothing: P(word|category) = (count + α) / (total + α * V)
87
+ word_score = words.sum { |w| Math.log(((category_words[w] || 0) + 1) / smoothed_total) }
88
+ prior_score = Math.log((@category_counts[category] || 0.1) / training_count)
89
+
90
+ [category.to_s, word_score + prior_score]
81
91
  end
82
- score
83
92
  end
84
93
 
85
- #
86
94
  # Returns the classification of the provided +text+, which is one of the
87
95
  # categories given in the initializer. E.g.,
88
96
  # b.classify "I hate bad words and you"
89
97
  # => 'Uninteresting'
98
+ #
99
+ # @rbs (String) -> String
90
100
  def classify(text)
91
- (classifications(text).sort_by { |a| -a[1] })[0][0]
101
+ best = classifications(text).min_by { |a| -a[1] }
102
+ raise StandardError, 'No classifications available' unless best
103
+
104
+ best.first.to_s
92
105
  end
93
106
 
94
107
  #
@@ -100,32 +113,30 @@ module Classifier
100
113
  # b.untrain_that "That text"
101
114
  # b.train_the_other "The other text"
102
115
  def method_missing(name, *args)
116
+ return super unless name.to_s =~ /(un)?train_(\w+)/
117
+
103
118
  category = name.to_s.gsub(/(un)?train_(\w+)/, '\2').prepare_category_name
104
- if @categories.key?(category)
105
- args.each do |text|
106
- if name.to_s.start_with?('untrain_')
107
- untrain(category, text)
108
- else
109
- train(category, text)
110
- end
111
- end
112
- elsif name.to_s =~ /(un)?train_(\w+)/
113
- raise StandardError, "No such category: #{category}"
114
- else
115
- super
116
- end
119
+ raise StandardError, "No such category: #{category}" unless @categories.key?(category)
120
+
121
+ method = name.to_s.start_with?('untrain_') ? :untrain : :train
122
+ args.each { |text| send(method, category, text) }
123
+ end
124
+
125
+ # @rbs (Symbol, ?bool) -> bool
126
+ def respond_to_missing?(name, include_private = false)
127
+ !!(name.to_s =~ /(un)?train_(\w+)/) || super
117
128
  end
118
129
 
119
- #
120
130
  # Provides a list of category names
121
131
  # For example:
122
132
  # b.categories
123
133
  # => ['This', 'That', 'the_other']
124
- def categories # :nodoc:
134
+ #
135
+ # @rbs () -> Array[String]
136
+ def categories
125
137
  @categories.keys.collect(&:to_s)
126
138
  end
127
139
 
128
- #
129
140
  # Allows you to add categories to the classifier.
130
141
  # For example:
131
142
  # b.add_category "Not spam"
@@ -134,13 +145,14 @@ module Classifier
134
145
  # result in an undertrained category that will tend to match
135
146
  # more criteria than the trained selective categories. In short,
136
147
  # try to initialize your categories at initialization.
148
+ #
149
+ # @rbs (String | Symbol) -> Hash[Symbol, Integer]
137
150
  def add_category(category)
138
151
  @categories[category.prepare_category_name] = {}
139
152
  end
140
153
 
141
154
  alias append_category add_category
142
155
 
143
- #
144
156
  # Allows you to remove categories from the classifier.
145
157
  # For example:
146
158
  # b.remove_category "Spam"
@@ -148,6 +160,8 @@ module Classifier
148
160
  # WARNING: Removing categories from a trained classifier will
149
161
  # result in the loss of all training data for that category.
150
162
  # Make sure you really want to do this before calling this method.
163
+ #
164
+ # @rbs (String | Symbol) -> void
151
165
  def remove_category(category)
152
166
  category = category.prepare_category_name
153
167
  raise StandardError, "No such category: #{category}" unless @categories.key?(category)
@@ -1,3 +1,5 @@
1
+ # rbs_inline: enabled
2
+
1
3
  # Author:: Ernest Ellingson
2
4
  # Copyright:: Copyright (c) 2005
3
5
 
@@ -5,19 +7,20 @@
5
7
 
6
8
  require 'matrix'
7
9
 
10
+ # @rbs skip
8
11
  class Array
9
- def sum_with_identity(identity = 0.0, &block)
12
+ def sum_with_identity(identity = 0.0, &)
10
13
  return identity unless size.to_i.positive?
14
+ return map(&).sum_with_identity(identity) if block_given?
11
15
 
12
- if block_given?
13
- map(&block).sum_with_identity(identity)
14
- else
15
- compact.reduce(:+).to_f || identity.to_f
16
- end
16
+ compact.reduce(identity, :+).to_f
17
17
  end
18
18
  end
19
19
 
20
- module VectorExtensions
20
+ # @rbs skip
21
+ class Vector
22
+ EPSILON = 1e-10
23
+
21
24
  def magnitude
22
25
  sum_of_squares = 0.to_r
23
26
  size.times do |i|
@@ -27,8 +30,10 @@ module VectorExtensions
27
30
  end
28
31
 
29
32
  def normalize
33
+ magnitude_value = magnitude
34
+ return Vector[*Array.new(size, 0.0)] if magnitude_value <= 0.0
35
+
30
36
  normalized_values = []
31
- magnitude_value = magnitude.to_r
32
37
  size.times do |i|
33
38
  normalized_values << (self[i] / magnitude_value)
34
39
  end
@@ -36,10 +41,7 @@ module VectorExtensions
36
41
  end
37
42
  end
38
43
 
39
- class Vector
40
- include VectorExtensions
41
- end
42
-
44
+ # @rbs skip
43
45
  class Matrix
44
46
  def self.diag(diagonal_elements)
45
47
  Matrix.diagonal(*diagonal_elements)
@@ -61,14 +63,19 @@ class Matrix
61
63
 
62
64
  loop do
63
65
  iteration_count += 1
64
- (0...q_rotation_matrix.row_size - 1).each do |row|
65
- (1..q_rotation_matrix.row_size - 1).each do |col|
66
+ (0...(q_rotation_matrix.row_size - 1)).each do |row|
67
+ (1..(q_rotation_matrix.row_size - 1)).each do |col|
66
68
  next if row == col
67
69
 
68
- angle = Math.atan((2.to_r * q_rotation_matrix[row,
69
- col]) / (q_rotation_matrix[row,
70
- row] - q_rotation_matrix[col,
71
- col])) / 2.0
70
+ numerator = 2.0 * q_rotation_matrix[row, col]
71
+ denominator = q_rotation_matrix[row, row] - q_rotation_matrix[col, col]
72
+
73
+ angle = if denominator.abs < Vector::EPSILON
74
+ numerator >= 0 ? Math::PI / 4.0 : -Math::PI / 4.0
75
+ else
76
+ Math.atan(numerator / denominator) / 2.0
77
+ end
78
+
72
79
  cosine = Math.cos(angle)
73
80
  sine = Math.sin(angle)
74
81
  rotation_matrix = Matrix.identity(q_rotation_matrix.row_size)
@@ -92,11 +99,12 @@ class Matrix
92
99
  break if (sum_of_differences <= 0.001 && iteration_count > 1) || iteration_count >= max_sweeps
93
100
  end
94
101
 
95
- singular_values = []
96
- q_rotation_matrix.row_size.times do |r|
97
- singular_values << Math.sqrt(q_rotation_matrix[r, r].to_f)
102
+ singular_values = q_rotation_matrix.row_size.times.map do |r|
103
+ Math.sqrt([q_rotation_matrix[r, r].to_f, 0.0].max)
98
104
  end
99
- u_matrix = (row_size >= column_size ? self : trans) * v_matrix * Matrix.diagonal(*singular_values).inverse
105
+
106
+ safe_singular_values = singular_values.map { |v| [v, Vector::EPSILON].max }
107
+ u_matrix = (row_size >= column_size ? self : trans) * v_matrix * Matrix.diagonal(*safe_singular_values).inverse
100
108
  [u_matrix, v_matrix, singular_values]
101
109
  end
102
110
 
@@ -1,3 +1,5 @@
1
+ # rbs_inline: enabled
2
+
1
3
  # Author:: Lucas Carlson (mailto:lucas@rufy.com)
2
4
  # Copyright:: Copyright (c) 2005 Lucas Carlson
3
5
  # License:: LGPL
@@ -11,12 +13,14 @@ class String
11
13
  # E.g.,
12
14
  # "Hello (greeting's), with {braces} < >...?".without_punctuation
13
15
  # => "Hello greetings with braces "
16
+ # @rbs () -> String
14
17
  def without_punctuation
15
- tr(',?.!;:"@#$%^&*()_=+[]{}\|<>/`~', ' ').tr("'\-", '')
18
+ tr(',?.!;:"@#$%^&*()_=+[]{}|<>/`~', ' ').tr("'-", '')
16
19
  end
17
20
 
18
21
  # Return a Hash of strings => ints. Each word in the string is stemmed,
19
22
  # interned, and indexes to its frequency in the document.
23
+ # @rbs () -> Hash[Symbol, Integer]
20
24
  def word_hash
21
25
  word_hash = clean_word_hash
22
26
  symbol_hash = word_hash_for_symbols(gsub(/\w/, ' ').split)
@@ -24,12 +28,14 @@ class String
24
28
  end
25
29
 
26
30
  # Return a word hash without extra punctuation or short symbols, just stemmed words
31
+ # @rbs () -> Hash[Symbol, Integer]
27
32
  def clean_word_hash
28
33
  word_hash_for_words gsub(/[^\w\s]/, '').split
29
34
  end
30
35
 
31
36
  private
32
37
 
38
+ # @rbs (Array[String]) -> Hash[Symbol, Integer]
33
39
  def word_hash_for_words(words)
34
40
  d = Hash.new(0)
35
41
  words.each do |word|
@@ -39,6 +45,7 @@ class String
39
45
  d
40
46
  end
41
47
 
48
+ # @rbs (Array[String]) -> Hash[Symbol, Integer]
42
49
  def word_hash_for_symbols(words)
43
50
  d = Hash.new(0)
44
51
  words.each do |word|
@@ -1,3 +1,5 @@
1
+ # rbs_inline: enabled
2
+
1
3
  # Author:: David Fayram (mailto:dfayram@lensmen.net)
2
4
  # Copyright:: Copyright (c) 2005 David Fayram II
3
5
  # License:: LGPL
@@ -7,33 +9,48 @@ module Classifier
7
9
  # raw_vector_with, it should be fairly straightforward to understand.
8
10
  # You should never have to use it directly.
9
11
  class ContentNode
10
- attr_accessor :raw_vector, :raw_norm,
11
- :lsi_vector, :lsi_norm,
12
- :categories
12
+ # @rbs @word_hash: Hash[Symbol, Integer]
13
+
14
+ # @rbs @raw_vector: untyped
15
+ # @rbs @raw_norm: untyped
16
+ # @rbs @lsi_vector: untyped
17
+ # @rbs @lsi_norm: untyped
18
+ attr_accessor :raw_vector, :raw_norm, :lsi_vector, :lsi_norm
19
+
20
+ # @rbs @categories: Array[String | Symbol]
21
+ attr_accessor :categories
13
22
 
14
23
  attr_reader :word_hash
15
24
 
16
25
  # If text_proc is not specified, the source will be duck-typed
17
26
  # via source.to_s
27
+ #
28
+ # @rbs (Hash[Symbol, Integer], *String | Symbol) -> void
18
29
  def initialize(word_frequencies, *categories)
19
30
  @categories = categories || []
20
31
  @word_hash = word_frequencies
21
32
  end
22
33
 
23
34
  # Use this to fetch the appropriate search vector.
35
+ #
36
+ # @rbs () -> untyped
24
37
  def search_vector
25
38
  @lsi_vector || @raw_vector
26
39
  end
27
40
 
28
41
  # Use this to fetch the appropriate search vector in normalized form.
42
+ #
43
+ # @rbs () -> untyped
29
44
  def search_norm
30
45
  @lsi_norm || @raw_norm
31
46
  end
32
47
 
33
48
  # Creates the raw vector out of word_hash using word_list as the
34
49
  # key for mapping the vector space.
50
+ #
51
+ # @rbs (WordList) -> untyped
35
52
  def raw_vector_with(word_list)
36
- vec = if $GSL
53
+ vec = if Classifier::LSI.gsl_available
37
54
  GSL::Vector.alloc(word_list.size)
38
55
  else
39
56
  Array.new(word_list.size, 0)
@@ -44,11 +61,13 @@ module Classifier
44
61
  end
45
62
 
46
63
  # Perform the scaling transform
47
- total_words = $GSL ? vec.sum : vec.sum_with_identity
64
+ total_words = Classifier::LSI.gsl_available ? vec.sum : vec.sum_with_identity
65
+ vec_array = Classifier::LSI.gsl_available ? vec.to_a : vec
66
+ total_unique_words = vec_array.count { |word| word != 0 }
48
67
 
49
68
  # Perform first-order association transform if this vector has more
50
69
  # than one word in it.
51
- if total_words > 1.0
70
+ if total_words > 1.0 && total_unique_words > 1
52
71
  weighted_total = 0.0
53
72
 
54
73
  vec.each do |term|
@@ -59,10 +78,13 @@ module Classifier
59
78
  val = term_over_total * Math.log(term_over_total)
60
79
  weighted_total += val unless val.nan?
61
80
  end
62
- vec = vec.collect { |val| Math.log(val + 1) / -weighted_total }
81
+
82
+ sign = weighted_total.negative? ? 1.0 : -1.0
83
+ divisor = sign * [weighted_total.abs, Vector::EPSILON].max
84
+ vec = vec.collect { |val| Math.log(val + 1) / divisor }
63
85
  end
64
86
 
65
- if $GSL
87
+ if Classifier::LSI.gsl_available
66
88
  @raw_norm = vec.normalize
67
89
  @raw_vector = vec
68
90
  else
@@ -1,3 +1,5 @@
1
+ # rbs_inline: enabled
2
+
1
3
  # Author:: David Fayram (mailto:dfayram@lensmen.net)
2
4
  # Copyright:: Copyright (c) 2005 David Fayram II
3
5
  # License:: LGPL
@@ -5,29 +7,38 @@
5
7
  module Classifier
6
8
  # This class keeps a word => index mapping. It is used to map stemmed words
7
9
  # to dimensions of a vector.
8
-
9
10
  class WordList
11
+ # @rbs @location_table: Hash[Symbol, Integer]
12
+
13
+ # @rbs () -> void
10
14
  def initialize
11
15
  @location_table = {}
12
16
  end
13
17
 
14
18
  # Adds a word (if it is new) and assigns it a unique dimension.
19
+ #
20
+ # @rbs (Symbol) -> Integer?
15
21
  def add_word(word)
16
22
  term = word
17
23
  @location_table[term] = @location_table.size unless @location_table[term]
18
24
  end
19
25
 
20
26
  # Returns the dimension of the word or nil if the word is not in the space.
27
+ #
28
+ # @rbs (Symbol) -> Integer?
21
29
  def [](lookup)
22
30
  term = lookup
23
31
  @location_table[term]
24
32
  end
25
33
 
34
+ # @rbs (Integer) -> Symbol?
26
35
  def word_for_index(ind)
27
36
  @location_table.invert[ind]
28
37
  end
29
38
 
30
39
  # Returns the number of words mapped.
40
+ #
41
+ # @rbs () -> Integer
31
42
  def size
32
43
  @location_table.size
33
44
  end
@@ -1,17 +1,34 @@
1
+ # rbs_inline: enabled
2
+
1
3
  # Author:: David Fayram (mailto:dfayram@lensmen.net)
2
4
  # Copyright:: Copyright (c) 2005 David Fayram II
3
5
  # License:: LGPL
4
6
 
7
+ module Classifier
8
+ class LSI
9
+ # @rbs @gsl_available: bool
10
+ @gsl_available = false
11
+
12
+ class << self
13
+ # @rbs @gsl_available: bool
14
+ attr_accessor :gsl_available
15
+ end
16
+ end
17
+ end
18
+
5
19
  begin
6
20
  # to test the native vector class, try `rake test NATIVE_VECTOR=true`
7
21
  raise LoadError if ENV['NATIVE_VECTOR'] == 'true'
22
+ raise LoadError unless Gem::Specification.find_all_by_name('gsl').any?
8
23
 
9
- require 'gsl' # requires https://github.com/SciRuby/rb-gsl/
24
+ require 'gsl'
10
25
  require 'classifier/extensions/vector_serialize'
11
- $GSL = true
26
+ Classifier::LSI.gsl_available = true
12
27
  rescue LoadError
13
- warn 'Notice: for 10x faster LSI support, please install https://github.com/SciRuby/rb-gsl/'
14
- $GSL = false
28
+ unless ENV['SUPPRESS_GSL_WARNING'] == 'true'
29
+ warn 'Notice: for 10x faster LSI, run `gem install gsl`. Set SUPPRESS_GSL_WARNING=true to hide this.'
30
+ end
31
+ Classifier::LSI.gsl_available = false
15
32
  require 'classifier/extensions/vector'
16
33
  end
17
34
 
@@ -24,13 +41,20 @@ module Classifier
24
41
  # data based on underlying semantic relations. For more information on the algorithms used,
25
42
  # please consult Wikipedia[http://en.wikipedia.org/wiki/Latent_Semantic_Indexing].
26
43
  class LSI
44
+ # @rbs @auto_rebuild: bool
45
+ # @rbs @word_list: WordList
46
+ # @rbs @items: Hash[untyped, ContentNode]
47
+ # @rbs @version: Integer
48
+ # @rbs @built_at_version: Integer
49
+
27
50
  attr_reader :word_list
28
51
  attr_accessor :auto_rebuild
29
52
 
30
53
  # Create a fresh index.
31
54
  # If you want to call #build_index manually, use
32
- # Classifier::LSI.new :auto_rebuild => false
55
+ # Classifier::LSI.new auto_rebuild: false
33
56
  #
57
+ # @rbs (?Hash[Symbol, untyped]) -> void
34
58
  def initialize(options = {})
35
59
  @auto_rebuild = true unless options[:auto_rebuild] == false
36
60
  @word_list = WordList.new
@@ -42,6 +66,8 @@ module Classifier
42
66
  # Returns true if the index needs to be rebuilt. The index needs
43
67
  # to be built after all informaton is added, but before you start
44
68
  # using it for search, classification and cluster detection.
69
+ #
70
+ # @rbs () -> bool
45
71
  def needs_rebuild?
46
72
  (@items.keys.size > 1) && (@version != @built_at_version)
47
73
  end
@@ -59,6 +85,7 @@ module Classifier
59
85
  # ar = ActiveRecordObject.find( :all )
60
86
  # lsi.add_item ar, *ar.categories { |x| ar.content }
61
87
  #
88
+ # @rbs (String, *String | Symbol) ?{ (String) -> String } -> void
62
89
  def add_item(item, *categories, &block)
63
90
  clean_word_hash = block ? block.call(item).clean_word_hash : item.to_s.clean_word_hash
64
91
  @items[item] = ContentNode.new(clean_word_hash, *categories)
@@ -70,12 +97,15 @@ module Classifier
70
97
  # you are passing in a string with no categorries. item
71
98
  # will be duck typed via to_s .
72
99
  #
100
+ # @rbs (String) -> void
73
101
  def <<(item)
74
102
  add_item(item)
75
103
  end
76
104
 
77
105
  # Returns the categories for a given indexed items. You are free to add and remove
78
106
  # items from this as you see fit. It does not invalide an index to change its categories.
107
+ #
108
+ # @rbs (String) -> Array[String | Symbol]
79
109
  def categories_for(item)
80
110
  return [] unless @items[item]
81
111
 
@@ -84,6 +114,7 @@ module Classifier
84
114
 
85
115
  # Removes an item from the database, if it is indexed.
86
116
  #
117
+ # @rbs (String) -> void
87
118
  def remove_item(item)
88
119
  return unless @items.key?(item)
89
120
 
@@ -92,6 +123,7 @@ module Classifier
92
123
  end
93
124
 
94
125
  # Returns an array of items that are indexed.
126
+ # @rbs () -> Array[untyped]
95
127
  def items
96
128
  @items.keys
97
129
  end
@@ -110,6 +142,8 @@ module Classifier
110
142
  # cutoff parameter tells the indexer how many of these values to keep.
111
143
  # A value of 1 for cutoff means that no semantic analysis will take place,
112
144
  # turning the LSI class into a simple vector search engine.
145
+ #
146
+ # @rbs (?Float) -> void
113
147
  def build_index(cutoff = 0.75)
114
148
  return unless needs_rebuild?
115
149
 
@@ -118,7 +152,7 @@ module Classifier
118
152
  doc_list = @items.values
119
153
  tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }
120
154
 
121
- if $GSL
155
+ if self.class.gsl_available
122
156
  tdm = GSL::Matrix.alloc(*tda).trans
123
157
  ntdm = build_reduced_matrix(tdm, cutoff)
124
158
 
@@ -131,9 +165,14 @@ module Classifier
131
165
  tdm = Matrix.rows(tda).trans
132
166
  ntdm = build_reduced_matrix(tdm, cutoff)
133
167
 
134
- ntdm.row_size.times do |col|
135
- doc_list[col].lsi_vector = ntdm.column(col) if doc_list[col]
136
- doc_list[col].lsi_norm = ntdm.column(col).normalize if doc_list[col]
168
+ ntdm.column_size.times do |col|
169
+ next unless doc_list[col]
170
+
171
+ column = ntdm.column(col)
172
+ next unless column
173
+
174
+ doc_list[col].lsi_vector = column
175
+ doc_list[col].lsi_norm = column.normalize
137
176
  end
138
177
  end
139
178
 
@@ -148,13 +187,15 @@ module Classifier
148
187
  # your dataset's general content. For example, if you were to use categorize on the
149
188
  # results of this data, you could gather information on what your dataset is generally
150
189
  # about.
190
+ #
191
+ # @rbs (?Integer) -> Array[String]
151
192
  def highest_relative_content(max_chunks = 10)
152
193
  return [] if needs_rebuild?
153
194
 
154
195
  avg_density = {}
155
- @items.each_key { |x| avg_density[x] = proximity_array_for_content(x).inject(0.0) { |x, y| x + y[1] } }
196
+ @items.each_key { |x| avg_density[x] = proximity_array_for_content(x).sum { |pair| pair[1] } }
156
197
 
157
- avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..max_chunks - 1].map
198
+ avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..(max_chunks - 1)].map
158
199
  end
159
200
 
160
201
  # This function is the primitive that find_related and classify
@@ -169,13 +210,15 @@ module Classifier
169
210
  # The parameter doc is the content to compare. If that content is not
170
211
  # indexed, you can pass an optional block to define how to create the
171
212
  # text data. See add_item for examples of how this works.
172
- def proximity_array_for_content(doc, &block)
213
+ #
214
+ # @rbs (String) ?{ (String) -> String } -> Array[[String, Float]]
215
+ def proximity_array_for_content(doc, &)
173
216
  return [] if needs_rebuild?
174
217
 
175
- content_node = node_for_content(doc, &block)
218
+ content_node = node_for_content(doc, &)
176
219
  result =
177
220
  @items.keys.collect do |item|
178
- val = if $GSL
221
+ val = if self.class.gsl_available
179
222
  content_node.search_vector * @items[item].search_vector.col
180
223
  else
181
224
  (Matrix[content_node.search_vector] * @items[item].search_vector)[0]
@@ -190,13 +233,15 @@ module Classifier
190
233
  # calculated vectors instead of their full versions. This is useful when
191
234
  # you're trying to perform operations on content that is much smaller than
192
235
  # the text you're working with. search uses this primitive.
193
- def proximity_norms_for_content(doc, &block)
236
+ #
237
+ # @rbs (String) ?{ (String) -> String } -> Array[[String, Float]]
238
+ def proximity_norms_for_content(doc, &)
194
239
  return [] if needs_rebuild?
195
240
 
196
- content_node = node_for_content(doc, &block)
241
+ content_node = node_for_content(doc, &)
197
242
  result =
198
243
  @items.keys.collect do |item|
199
- val = if $GSL
244
+ val = if self.class.gsl_available
200
245
  content_node.search_norm * @items[item].search_norm.col
201
246
  else
202
247
  (Matrix[content_node.search_norm] * @items[item].search_norm)[0]
@@ -213,12 +258,14 @@ module Classifier
213
258
  #
214
259
  # While this may seem backwards compared to the other functions that LSI supports,
215
260
  # it is actually the same algorithm, just applied on a smaller document.
261
+ #
262
+ # @rbs (String, ?Integer) -> Array[String]
216
263
  def search(string, max_nearest = 3)
217
264
  return [] if needs_rebuild?
218
265
 
219
266
  carry = proximity_norms_for_content(string)
220
267
  result = carry.collect { |x| x[0] }
221
- result[0..max_nearest - 1]
268
+ result[0..(max_nearest - 1)]
222
269
  end
223
270
 
224
271
  # This function takes content and finds other documents
@@ -230,11 +277,13 @@ module Classifier
230
277
  # This is particularly useful for identifing clusters in your document space.
231
278
  # For example you may want to identify several "What's Related" items for weblog
232
279
  # articles, or find paragraphs that relate to each other in an essay.
280
+ #
281
+ # @rbs (String, ?Integer) ?{ (String) -> String } -> Array[String]
233
282
  def find_related(doc, max_nearest = 3, &block)
234
283
  carry =
235
284
  proximity_array_for_content(doc, &block).reject { |pair| pair[0] == doc }
236
285
  result = carry.collect { |x| x[0] }
237
- result[0..max_nearest - 1]
286
+ result[0..(max_nearest - 1)]
238
287
  end
239
288
 
240
289
  # This function uses a voting system to categorize documents, based on
@@ -246,17 +295,19 @@ module Classifier
246
295
  # text. A cutoff of 1 means that every document in the index votes on
247
296
  # what category the document is in. This may not always make sense.
248
297
  #
249
- def classify(doc, cutoff = 0.30, &block)
250
- votes = vote(doc, cutoff, &block)
298
+ # @rbs (String, ?Float) ?{ (String) -> String } -> String | Symbol
299
+ def classify(doc, cutoff = 0.30, &)
300
+ votes = vote(doc, cutoff, &)
251
301
 
252
302
  ranking = votes.keys.sort_by { |x| votes[x] }
253
303
  ranking[-1]
254
304
  end
255
305
 
256
- def vote(doc, cutoff = 0.30, &block)
306
+ # @rbs (String, ?Float) ?{ (String) -> String } -> Hash[String | Symbol, Float]
307
+ def vote(doc, cutoff = 0.30, &)
257
308
  icutoff = (@items.size * cutoff).round
258
- carry = proximity_array_for_content(doc, &block)
259
- carry = carry[0..icutoff - 1]
309
+ carry = proximity_array_for_content(doc, &)
310
+ carry = carry[0..(icutoff - 1)]
260
311
  votes = {}
261
312
  carry.each do |pair|
262
313
  categories = @items[pair[0]].categories
@@ -278,11 +329,11 @@ module Classifier
278
329
  # category = nil
279
330
  # end
280
331
  #
281
- #
282
332
  # See classify() for argument docs
283
- def classify_with_confidence(doc, cutoff = 0.30, &block)
284
- votes = vote(doc, cutoff, &block)
285
- votes_sum = votes.values.inject(0.0) { |sum, v| sum + v }
333
+ # @rbs (String, ?Float) ?{ (String) -> String } -> [String | Symbol | nil, Float?]
334
+ def classify_with_confidence(doc, cutoff = 0.30, &)
335
+ votes = vote(doc, cutoff, &)
336
+ votes_sum = votes.values.sum
286
337
  return [nil, nil] if votes_sum.zero?
287
338
 
288
339
  ranking = votes.keys.sort_by { |x| votes[x] }
@@ -294,16 +345,18 @@ module Classifier
294
345
  # Prototype, only works on indexed documents.
295
346
  # I have no clue if this is going to work, but in theory
296
347
  # it's supposed to.
348
+ # @rbs (String, ?Integer) -> Array[Symbol]
297
349
  def highest_ranked_stems(doc, count = 3)
298
350
  raise 'Requested stem ranking on non-indexed content!' unless @items[doc]
299
351
 
300
352
  arr = node_for_content(doc).lsi_vector.to_a
301
- top_n = arr.sort.reverse[0..count - 1]
353
+ top_n = arr.sort.reverse[0..(count - 1)]
302
354
  top_n.collect { |x| @word_list.word_for_index(arr.index(x)) }
303
355
  end
304
356
 
305
357
  private
306
358
 
359
+ # @rbs (untyped, ?Float) -> untyped
307
360
  def build_reduced_matrix(matrix, cutoff = 0.75)
308
361
  # TODO: Check that M>=N on these dimensions! Transpose helps assure this
309
362
  u, v, s = matrix.SV_decomp
@@ -314,23 +367,26 @@ module Classifier
314
367
  s[ord] = 0.0 if s[ord] < s_cutoff
315
368
  end
316
369
  # Reconstruct the term document matrix, only with reduced rank
317
- u * ($GSL ? GSL::Matrix : ::Matrix).diag(s) * v.trans
370
+ result = u * (self.class.gsl_available ? GSL::Matrix : ::Matrix).diag(s) * v.trans
371
+
372
+ # Native Ruby SVD returns transposed dimensions when row_size < column_size
373
+ # Ensure result matches input dimensions
374
+ result = result.trans if !self.class.gsl_available && result.row_size != matrix.row_size
375
+
376
+ result
318
377
  end
319
378
 
379
+ # @rbs (String) ?{ (String) -> String } -> ContentNode
320
380
  def node_for_content(item, &block)
321
381
  return @items[item] if @items[item]
322
382
 
323
383
  clean_word_hash = block ? block.call(item).clean_word_hash : item.to_s.clean_word_hash
324
-
325
- cn = ContentNode.new(clean_word_hash, &block) # make the node and extract the data
326
-
327
- unless needs_rebuild?
328
- cn.raw_vector_with(@word_list) # make the lsi raw and norm vectors
329
- end
330
-
384
+ cn = ContentNode.new(clean_word_hash, &block)
385
+ cn.raw_vector_with(@word_list) unless needs_rebuild?
331
386
  cn
332
387
  end
333
388
 
389
+ # @rbs () -> void
334
390
  def make_word_list
335
391
  @word_list = WordList.new
336
392
  @items.each_value do |node|
@@ -0,0 +1,9 @@
1
+ # Type stubs for fast-stemmer gem and classifier extensions
2
+ class String
3
+ def stem: () -> String
4
+ def prepare_category_name: () -> Symbol
5
+ end
6
+
7
+ class Symbol
8
+ def prepare_category_name: () -> Symbol
9
+ end
@@ -0,0 +1,27 @@
1
+ # Type stubs for optional GSL gem
2
+ module GSL
3
+ class Vector
4
+ def self.alloc: (untyped) -> Vector
5
+ def to_a: () -> Array[Float]
6
+ def normalize: () -> Vector
7
+ def sum: () -> Float
8
+ def each: () { (Float) -> void } -> void
9
+ def []: (Integer) -> Float
10
+ def []=: (Integer, Float) -> Float
11
+ def size: () -> Integer
12
+ def row: () -> Vector
13
+ def col: () -> Vector
14
+ def *: (untyped) -> untyped
15
+ def collect: () { (Float) -> Float } -> Vector
16
+ end
17
+
18
+ class Matrix
19
+ def self.alloc: (*untyped) -> Matrix
20
+ def self.diag: (untyped) -> Matrix
21
+ def trans: () -> Matrix
22
+ def *: (untyped) -> Matrix
23
+ def size: () -> [Integer, Integer]
24
+ def column: (Integer) -> Vector
25
+ def SV_decomp: () -> [Matrix, Matrix, Vector]
26
+ end
27
+ end
@@ -0,0 +1,26 @@
1
+ # Type stubs for matrix gem
2
+ class Vector[T]
3
+ EPSILON: Float
4
+
5
+ def self.[]: [T] (*T) -> Vector[T]
6
+ def size: () -> Integer
7
+ def []: (Integer) -> T
8
+ def magnitude: () -> Float
9
+ def normalize: () -> Vector[T]
10
+ def each: () { (T) -> void } -> void
11
+ def collect: [U] () { (T) -> U } -> Vector[U]
12
+ def to_a: () -> Array[T]
13
+ def *: (untyped) -> untyped
14
+ end
15
+
16
+ class Matrix[T]
17
+ def self.rows: [T] (Array[Array[T]]) -> Matrix[T]
18
+ def self.[]: [T] (*Array[T]) -> Matrix[T]
19
+ def self.diag: (untyped) -> Matrix[untyped]
20
+ def trans: () -> Matrix[T]
21
+ def *: (untyped) -> untyped
22
+ def row_size: () -> Integer
23
+ def column_size: () -> Integer
24
+ def column: (Integer) -> Vector[T]
25
+ def SV_decomp: () -> [Matrix[T], Matrix[T], untyped]
26
+ end
data/test/test_helper.rb CHANGED
@@ -1,4 +1,14 @@
1
- $:.unshift(File.dirname(__FILE__) + '/../lib')
1
+ require 'simplecov'
2
+ SimpleCov.start do
3
+ add_filter '/test/'
4
+ add_filter '/vendor/'
5
+ add_group 'Bayes', 'lib/classifier/bayes.rb'
6
+ add_group 'LSI', 'lib/classifier/lsi'
7
+ add_group 'Extensions', 'lib/classifier/extensions'
8
+ enable_coverage :branch
9
+ end
10
+
11
+ $LOAD_PATH.unshift("#{File.dirname(__FILE__)}/../lib")
2
12
 
3
13
  require 'minitest'
4
14
  require 'minitest/autorun'
metadata CHANGED
@@ -1,14 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: classifier
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.4.3
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Lucas Carlson
8
- autorequire:
9
8
  bindir: bin
10
9
  cert_chain: []
11
- date: 2024-07-31 00:00:00.000000000 Z
10
+ date: 1980-01-02 00:00:00.000000000 Z
12
11
  dependencies:
13
12
  - !ruby/object:Gem::Dependency
14
13
  name: fast-stemmer
@@ -52,6 +51,20 @@ dependencies:
52
51
  - - ">="
53
52
  - !ruby/object:Gem::Version
54
53
  version: '0'
54
+ - !ruby/object:Gem::Dependency
55
+ name: matrix
56
+ requirement: !ruby/object:Gem::Requirement
57
+ requirements:
58
+ - - ">="
59
+ - !ruby/object:Gem::Version
60
+ version: '0'
61
+ type: :runtime
62
+ prerelease: false
63
+ version_requirements: !ruby/object:Gem::Requirement
64
+ requirements:
65
+ - - ">="
66
+ - !ruby/object:Gem::Version
67
+ version: '0'
55
68
  - !ruby/object:Gem::Dependency
56
69
  name: minitest
57
70
  requirement: !ruby/object:Gem::Requirement
@@ -66,6 +79,20 @@ dependencies:
66
79
  - - ">="
67
80
  - !ruby/object:Gem::Version
68
81
  version: '0'
82
+ - !ruby/object:Gem::Dependency
83
+ name: rbs-inline
84
+ requirement: !ruby/object:Gem::Requirement
85
+ requirements:
86
+ - - ">="
87
+ - !ruby/object:Gem::Version
88
+ version: '0'
89
+ type: :development
90
+ prerelease: false
91
+ version_requirements: !ruby/object:Gem::Requirement
92
+ requirements:
93
+ - - ">="
94
+ - !ruby/object:Gem::Version
95
+ version: '0'
69
96
  - !ruby/object:Gem::Dependency
70
97
  name: rdoc
71
98
  requirement: !ruby/object:Gem::Requirement
@@ -86,7 +113,9 @@ executables: []
86
113
  extensions: []
87
114
  extra_rdoc_files: []
88
115
  files:
116
+ - CLAUDE.md
89
117
  - LICENSE
118
+ - README.md
90
119
  - bin/bayes.rb
91
120
  - bin/summarize.rb
92
121
  - lib/classifier.rb
@@ -99,12 +128,14 @@ files:
99
128
  - lib/classifier/lsi/content_node.rb
100
129
  - lib/classifier/lsi/summary.rb
101
130
  - lib/classifier/lsi/word_list.rb
131
+ - sig/vendor/fast_stemmer.rbs
132
+ - sig/vendor/gsl.rbs
133
+ - sig/vendor/matrix.rbs
102
134
  - test/test_helper.rb
103
135
  homepage: https://github.com/cardmagic/classifier
104
136
  licenses:
105
137
  - LGPL
106
138
  metadata: {}
107
- post_install_message:
108
139
  rdoc_options: []
109
140
  require_paths:
110
141
  - lib
@@ -119,8 +150,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
119
150
  - !ruby/object:Gem::Version
120
151
  version: '0'
121
152
  requirements: []
122
- rubygems_version: 3.5.9
123
- signing_key:
153
+ rubygems_version: 4.0.3
124
154
  specification_version: 4
125
155
  summary: A general classifier module to allow Bayesian and other types of classifications.
126
156
  test_files: []