RubyGems - classifier - Versions diffs - 1.4.3 → 2.0.0 - Mend

classifier 1.4.3 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +4 -4
data/CLAUDE.md +67 -0
data/README.md +259 -0
data/lib/classifier/bayes.rb +50 -36
data/lib/classifier/extensions/vector.rb +30 -22
data/lib/classifier/extensions/word_hash.rb +8 -1
data/lib/classifier/lsi/content_node.rb +30 -8
data/lib/classifier/lsi/word_list.rb +12 -1
data/lib/classifier/lsi.rb +93 -37
data/sig/vendor/fast_stemmer.rbs +9 -0
data/sig/vendor/gsl.rbs +27 -0
data/sig/vendor/matrix.rbs +26 -0
data/test/test_helper.rb +11 -1
metadata +36 -6

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: e2d12a6941acf386b0567d5f504d20bffad8486111675977446867c6caf5e865
-  data.tar.gz: b44dca735ec32321183dc9291f339e68ef115af145d0d9ec78c767e9b3e132b2
+  metadata.gz: fea14969bc8a61283823b0b0f5bae013af968caf4676c383155e3b8682b948de
+  data.tar.gz: 4d626c85d084ff75eba2ff305673734a6f25b668e773b1b5a3a0630a6b68df96
 SHA512:
-  metadata.gz: 4a37d6482fac59b1b6d3cf1c22f0144a08e580ab4ef681cb01189c266fa3de6d6a11668dfd2f1175db0f9d587c01302570be1814a457d2b05e2a9a72d9b9b975
-  data.tar.gz: b9a62dc7243527ae95cd89f946d1caf30ca0c2f52527a34427b4dbe68698b920dce8644b0ffd4f34cba4d646f2a17d8711698c096062b0596ed9228885bd822b
+  metadata.gz: ef53c06db3326b1b6ebc14255b4ba198286c06e291cba3afc67bba360ca766a173f89269405d216751806ca72f885a87ac80ec24a031053f8e6f2987e8e2267e
+  data.tar.gz: 8f120a9b78e802e6fd3e7172fd311b476745e27d5b3d301dc8d140296a451875e5aa33a901514bfdd1bc96c656ad1a43cbb3935a05223cd38548a71ba6a3a1c1

data/CLAUDE.md ADDED Viewed

@@ -0,0 +1,67 @@
+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+Ruby gem providing text classification via two algorithms:
+- **Bayes** (`Classifier::Bayes`) - Naive Bayesian classification
+- **LSI** (`Classifier::LSI`) - Latent Semantic Indexing for semantic classification, clustering, and search
+## Common Commands
+```bash
+# Run all tests
+rake test
+# Run a single test file
+ruby -Ilib test/bayes/bayesian_test.rb
+ruby -Ilib test/lsi/lsi_test.rb
+# Run tests with native Ruby vector (without GSL)
+NATIVE_VECTOR=true rake test
+# Interactive console
+rake console
+# Generate documentation
+rake doc
+```
+## Architecture
+### Core Components
+**Bayesian Classifier** (`lib/classifier/bayes.rb`)
+- Train with `train(category, text)` or dynamic methods like `train_spam(text)`
+- Classify with `classify(text)` returning the best category
+- Uses log probabilities for numerical stability
+**LSI Classifier** (`lib/classifier/lsi.rb`)
+- Uses Singular Value Decomposition (SVD) for semantic analysis
+- Optional GSL gem for 10x faster matrix operations; falls back to pure Ruby SVD
+- Key operations: `add_item`, `classify`, `find_related`, `search`
+- `auto_rebuild` option controls automatic index rebuilding after changes
+**String Extensions** (`lib/classifier/extensions/word_hash.rb`)
+- `word_hash` / `clean_word_hash` - tokenize text to stemmed word frequencies
+- `CORPUS_SKIP_WORDS` - stopwords filtered during tokenization
+- Uses `fast-stemmer` gem for Porter stemming
+**Vector Extensions** (`lib/classifier/extensions/vector.rb`)
+- Pure Ruby SVD implementation (`Matrix#SV_decomp`)
+- Vector normalization and magnitude calculations
+### GSL Integration
+LSI checks for the `gsl` gem at load time. When available:
+- Uses `GSL::Matrix` and `GSL::Vector` for faster operations
+- Serialization handled via `vector_serialize.rb`
+- Test without GSL: `NATIVE_VECTOR=true rake test`
+### Content Nodes (`lib/classifier/lsi/content_node.rb`)
+Internal data structure storing:
+- `word_hash` - term frequencies
+- `raw_vector` / `raw_norm` - initial vector representation
+- `lsi_vector` / `lsi_norm` - reduced dimensionality representation after SVD

data/README.md ADDED Viewed

@@ -0,0 +1,259 @@
+# Classifier
+[![Gem Version](https://badge.fury.io/rb/classifier.svg)](https://badge.fury.io/rb/classifier)
+[![CI](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml/badge.svg)](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml)
+[![License: LGPL](https://img.shields.io/badge/License-LGPL_2.1-blue.svg)](https://opensource.org/licenses/LGPL-2.1)
+A Ruby library for text classification using Bayesian and Latent Semantic Indexing (LSI) algorithms.
+## Table of Contents
+- [Installation](#installation)
+- [Bayesian Classifier](#bayesian-classifier)
+- [LSI (Latent Semantic Indexing)](#lsi-latent-semantic-indexing)
+- [Performance](#performance)
+- [Development](#development)
+- [Contributing](#contributing)
+- [License](#license)
+## Installation
+Add to your Gemfile:
+```ruby
+gem 'classifier'
+```
+Then run:
+```bash
+bundle install
+```
+Or install directly:
+```bash
+gem install classifier
+```
+### Optional: GSL for Faster LSI
+For significantly faster LSI operations, install the [GNU Scientific Library](https://www.gnu.org/software/gsl/).
+<details>
+<summary><strong>Ruby 3+</strong></summary>
+The released `gsl` gem doesn't support Ruby 3+. Install from source:
+```bash
+# Install GSL library
+brew install gsl        # macOS
+apt-get install libgsl-dev  # Ubuntu/Debian
+# Build and install the gem
+git clone https://github.com/cardmagic/rb-gsl.git
+cd rb-gsl
+git checkout fix/ruby-3.4-compatibility
+gem build gsl.gemspec
+gem install gsl-*.gem
+```
+</details>
+<details>
+<summary><strong>Ruby 2.x</strong></summary>
+```bash
+# macOS
+brew install gsl
+gem install gsl
+# Ubuntu/Debian
+apt-get install libgsl-dev
+gem install gsl
+```
+</details>
+When GSL is installed, Classifier automatically uses it. To suppress the GSL notice:
+```bash
+SUPPRESS_GSL_WARNING=true ruby your_script.rb
+```
+### Compatibility
+| Ruby Version | Status |
+|--------------|--------|
+| 4.0          | Supported |
+| 3.4          | Supported |
+| 3.3          | Supported |
+| 3.2          | Supported |
+| 3.1          | EOL (unsupported) |
+## Bayesian Classifier
+Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.
+### Quick Start
+```ruby
+require 'classifier'
+classifier = Classifier::Bayes.new('Spam', 'Ham')
+# Train the classifier
+classifier.train_spam "Buy cheap viagra now! Limited offer!"
+classifier.train_spam "You've won a million dollars! Claim now!"
+classifier.train_ham "Meeting scheduled for tomorrow at 10am"
+classifier.train_ham "Please review the attached document"
+# Classify new text
+classifier.classify "Congratulations! You've won a prize!"
+# => "Spam"
+```
+### Persistence with Madeleine
+```ruby
+require 'classifier'
+require 'madeleine'
+m = SnapshotMadeleine.new("classifier_data") {
+  Classifier::Bayes.new('Interesting', 'Uninteresting')
+}
+m.system.train_interesting "fascinating article about science"
+m.system.train_uninteresting "boring repetitive content"
+m.take_snapshot
+# Later, restore and use:
+m.system.classify "new scientific discovery"
+# => "Interesting"
+```
+### Learn More
+- [Bayesian Filtering Explained](http://www.process.com/precisemail/bayesian_filtering.htm)
+- [Wikipedia: Bayesian Filtering](http://en.wikipedia.org/wiki/Bayesian_filtering)
+- [Paul Graham: A Plan for Spam](http://www.paulgraham.com/spam.html)
+## LSI (Latent Semantic Indexing)
+Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.
+### Quick Start
+```ruby
+require 'classifier'
+lsi = Classifier::LSI.new
+# Add documents with categories
+lsi.add_item "Dogs are loyal pets that love to play fetch", :pets
+lsi.add_item "Cats are independent and love to nap", :pets
+lsi.add_item "Ruby is a dynamic programming language", :programming
+lsi.add_item "Python is great for data science", :programming
+# Classify new text
+lsi.classify "My puppy loves to run around"
+# => :pets
+# Get classification with confidence score
+lsi.classify_with_confidence "Learning to code in Ruby"
+# => [:programming, 0.89]
+```
+### Search and Discovery
+```ruby
+# Find similar documents
+lsi.find_related "Dogs are great companions", 2
+# => ["Dogs are loyal pets that love to play fetch", "Cats are independent..."]
+# Search by keyword
+lsi.search "programming", 3
+# => ["Ruby is a dynamic programming language", "Python is great for..."]
+```
+### Learn More
+- [Wikipedia: Latent Semantic Analysis](http://en.wikipedia.org/wiki/Latent_semantic_analysis)
+- [C2 Wiki: Latent Semantic Indexing](http://www.c2.com/cgi/wiki?LatentSemanticIndexing)
+## Performance
+### GSL vs Native Ruby
+GSL provides dramatic speedups for LSI operations, especially `build_index` (SVD computation):
+| Documents | build_index | Overall |
+|-----------|-------------|---------|
+| 5         | 4x faster   | 2.5x    |
+| 10        | 24x faster  | 5.5x    |
+| 15        | 116x faster | 17x     |
+<details>
+<summary>Detailed benchmark (15 documents)</summary>
+```
+Operation              Native          GSL      Speedup
+----------------------------------------------------------
+build_index            0.1412       0.0012       116.2x
+classify               0.0142       0.0049         2.9x
+search                 0.0102       0.0026         3.9x
+find_related           0.0069       0.0016         4.2x
+----------------------------------------------------------
+TOTAL                  0.1725       0.0104        16.6x
+```
+</details>
+### Running Benchmarks
+```bash
+rake benchmark              # Run with current configuration
+rake benchmark:compare      # Compare GSL vs native Ruby
+```
+## Development
+### Setup
+```bash
+git clone https://github.com/cardmagic/classifier.git
+cd classifier
+bundle install
+```
+### Running Tests
+```bash
+rake test                        # Run all tests
+ruby -Ilib test/bayes/bayesian_test.rb  # Run specific test file
+# Test without GSL (pure Ruby)
+NATIVE_VECTOR=true rake test
+```
+### Console
+```bash
+rake console
+```
+## Contributing
+1. Fork the repository
+2. Create your feature branch (`git checkout -b feature/amazing-feature`)
+3. Commit your changes (`git commit -am 'Add amazing feature'`)
+4. Push to the branch (`git push origin feature/amazing-feature`)
+5. Open a Pull Request
+## Authors
+- **Lucas Carlson** - *Original author* - lucas@rufy.com
+- **David Fayram II** - *LSI implementation* - dfayram@gmail.com
+- **Cameron McBride** - cameron.mcbride@gmail.com
+- **Ivan Acosta-Rubio** - ivan@softwarecriollo.com
+## License
+This library is released under the [GNU Lesser General Public License (LGPL) 2.1](LICENSE).

data/lib/classifier/bayes.rb CHANGED Viewed

@@ -1,12 +1,20 @@
+# rbs_inline: enabled
 # Author::    Lucas Carlson  (mailto:lucas@rufy.com)
 # Copyright:: Copyright (c) 2005 Lucas Carlson
 # License::   LGPL
 module Classifier
   class Bayes
+    # @rbs @categories: Hash[Symbol, Hash[Symbol, Integer]]
+    # @rbs @total_words: Integer
+    # @rbs @category_counts: Hash[Symbol, Integer]
+    # @rbs @category_word_count: Hash[Symbol, Integer]
     # The class can be created with one or more categories, each of which will be
     # initialized and given a training method. E.g.,
     #      b = Classifier::Bayes.new 'Interesting', 'Uninteresting', 'Spam'
+    # @rbs (*String | Symbol) -> void
     def initialize(*categories)
       @categories = {}
       categories.each { |category| @categories[category.prepare_category_name] = {} }
@@ -15,13 +23,14 @@ module Classifier
       @category_word_count = Hash.new(0)
     end
-    #
     # Provides a general training method for all categories specified in Bayes#new
     # For example:
     #     b = Classifier::Bayes.new 'This', 'That', 'the_other'
     #     b.train :this, "This text"
     #     b.train "that", "That text"
     #     b.train "The other", "The other text"
+    #
+    # @rbs (String | Symbol, String) -> void
     def train(category, text)
       category = category.prepare_category_name
       @category_counts[category] += 1
@@ -33,7 +42,6 @@ module Classifier
       end
     end
-    #
     # Provides a untraining method for all categories specified in Bayes#new
     # Be very careful with this method.
     #
@@ -41,6 +49,8 @@ module Classifier
     #     b = Classifier::Bayes.new 'This', 'That', 'the_other'
     #     b.train :this, "This text"
     #     b.untrain :this, "This text"
+    #
+    # @rbs (String | Symbol, String) -> void
     def untrain(category, text)
       category = category.prepare_category_name
       @category_counts[category] -= 1
@@ -59,36 +69,39 @@ module Classifier
       end
     end
-    #
     # Returns the scores in each category the provided +text+. E.g.,
     #    b.classifications "I hate bad words and you"
     #    =>  {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524}
     # The largest of these scores (the one closest to 0) is the one picked out by #classify
+    #
+    # @rbs (String) -> Hash[String, Float]
     def classifications(text)
-      score = {}
-      word_hash = text.word_hash
-      training_count = @category_counts.values.inject { |x, y| x + y }.to_f
-      @categories.each do |category, category_words|
-        score[category.to_s] = 0
-        total = (@category_word_count[category] || 1).to_f
-        word_hash.each_key do |word|
-          s = category_words.key?(word) ? category_words[word] : 0.1
-          score[category.to_s] += Math.log(s / total)
-        end
-        # now add prior probability for the category
-        s = @category_counts.key?(category) ? @category_counts[category] : 0.1
-        score[category.to_s] += Math.log(s / training_count)
+      words = text.word_hash.keys
+      training_count = @category_counts.values.sum.to_f
+      vocab_size = [@categories.values.flat_map(&:keys).uniq.size, 1].max
+      @categories.to_h do |category, category_words|
+        smoothed_total = ((@category_word_count[category] || 0) + vocab_size).to_f
+        # Laplace smoothing: P(word|category) = (count + α) / (total + α * V)
+        word_score = words.sum { |w| Math.log(((category_words[w] || 0) + 1) / smoothed_total) }
+        prior_score = Math.log((@category_counts[category] || 0.1) / training_count)
+        [category.to_s, word_score + prior_score]
       end
-      score
     end
-    #
     # Returns the classification of the provided +text+, which is one of the
     # categories given in the initializer. E.g.,
     #    b.classify "I hate bad words and you"
     #    =>  'Uninteresting'
+    #
+    # @rbs (String) -> String
     def classify(text)
-      (classifications(text).sort_by { |a| -a[1] })[0][0]
+      best = classifications(text).min_by { |a| -a[1] }
+      raise StandardError, 'No classifications available' unless best
+      best.first.to_s
     end
     #
@@ -100,32 +113,30 @@ module Classifier
     #     b.untrain_that "That text"
     #     b.train_the_other "The other text"
     def method_missing(name, *args)
+      return super unless name.to_s =~ /(un)?train_(\w+)/
       category = name.to_s.gsub(/(un)?train_(\w+)/, '\2').prepare_category_name
-      if @categories.key?(category)
-        args.each do |text|
-          if name.to_s.start_with?('untrain_')
-            untrain(category, text)
-          else
-            train(category, text)
-          end
-        end
-      elsif name.to_s =~ /(un)?train_(\w+)/
-        raise StandardError, "No such category: #{category}"
-      else
-        super
-      end
+      raise StandardError, "No such category: #{category}" unless @categories.key?(category)
+      method = name.to_s.start_with?('untrain_') ? :untrain : :train
+      args.each { |text| send(method, category, text) }
+    end
+    # @rbs (Symbol, ?bool) -> bool
+    def respond_to_missing?(name, include_private = false)
+      !!(name.to_s =~ /(un)?train_(\w+)/) || super
     end
-    #
     # Provides a list of category names
     # For example:
     #     b.categories
     #     =>   ['This', 'That', 'the_other']
-    def categories # :nodoc:
+    #
+    # @rbs () -> Array[String]
+    def categories
       @categories.keys.collect(&:to_s)
     end
-    #
     # Allows you to add categories to the classifier.
     # For example:
     #     b.add_category "Not spam"
@@ -134,13 +145,14 @@ module Classifier
     # result in an undertrained category that will tend to match
     # more criteria than the trained selective categories. In short,
     # try to initialize your categories at initialization.
+    #
+    # @rbs (String | Symbol) -> Hash[Symbol, Integer]
     def add_category(category)
       @categories[category.prepare_category_name] = {}
     end
     alias append_category add_category
-    #
     # Allows you to remove categories from the classifier.
     # For example:
     #     b.remove_category "Spam"
@@ -148,6 +160,8 @@ module Classifier
     # WARNING: Removing categories from a trained classifier will
     # result in the loss of all training data for that category.
     # Make sure you really want to do this before calling this method.
+    #
+    # @rbs (String | Symbol) -> void
     def remove_category(category)
       category = category.prepare_category_name
       raise StandardError, "No such category: #{category}" unless @categories.key?(category)

data/lib/classifier/extensions/vector.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# rbs_inline: enabled
 # Author::    Ernest Ellingson
 # Copyright:: Copyright (c) 2005
@@ -5,19 +7,20 @@
 require 'matrix'
+# @rbs skip
 class Array
-  def sum_with_identity(identity = 0.0, &block)
+  def sum_with_identity(identity = 0.0, &)
     return identity unless size.to_i.positive?
+    return map(&).sum_with_identity(identity) if block_given?
-    if block_given?
-      map(&block).sum_with_identity(identity)
-    else
-      compact.reduce(:+).to_f || identity.to_f
-    end
+    compact.reduce(identity, :+).to_f
   end
 end
-module VectorExtensions
+# @rbs skip
+class Vector
+  EPSILON = 1e-10
   def magnitude
     sum_of_squares = 0.to_r
     size.times do |i|
@@ -27,8 +30,10 @@ module VectorExtensions
   end
   def normalize
+    magnitude_value = magnitude
+    return Vector[*Array.new(size, 0.0)] if magnitude_value <= 0.0
     normalized_values = []
-    magnitude_value = magnitude.to_r
     size.times do |i|
       normalized_values << (self[i] / magnitude_value)
     end
@@ -36,10 +41,7 @@ module VectorExtensions
   end
 end
-class Vector
-  include VectorExtensions
-end
+# @rbs skip
 class Matrix
   def self.diag(diagonal_elements)
     Matrix.diagonal(*diagonal_elements)
@@ -61,14 +63,19 @@ class Matrix
     loop do
       iteration_count += 1
-      (0...q_rotation_matrix.row_size - 1).each do |row|
-        (1..q_rotation_matrix.row_size - 1).each do |col|
+      (0...(q_rotation_matrix.row_size - 1)).each do |row|
+        (1..(q_rotation_matrix.row_size - 1)).each do |col|
           next if row == col
-          angle = Math.atan((2.to_r * q_rotation_matrix[row,
-                                                        col]) / (q_rotation_matrix[row,
-                                                                                   row] - q_rotation_matrix[col,
-                                                                                                            col])) / 2.0
+          numerator = 2.0 * q_rotation_matrix[row, col]
+          denominator = q_rotation_matrix[row, row] - q_rotation_matrix[col, col]
+          angle = if denominator.abs < Vector::EPSILON
+                    numerator >= 0 ? Math::PI / 4.0 : -Math::PI / 4.0
+                  else
+                    Math.atan(numerator / denominator) / 2.0
+                  end
           cosine = Math.cos(angle)
           sine = Math.sin(angle)
           rotation_matrix = Matrix.identity(q_rotation_matrix.row_size)
@@ -92,11 +99,12 @@ class Matrix
       break if (sum_of_differences <= 0.001 && iteration_count > 1) || iteration_count >= max_sweeps
     end
-    singular_values = []
-    q_rotation_matrix.row_size.times do |r|
-      singular_values << Math.sqrt(q_rotation_matrix[r, r].to_f)
+    singular_values = q_rotation_matrix.row_size.times.map do |r|
+      Math.sqrt([q_rotation_matrix[r, r].to_f, 0.0].max)
     end
-    u_matrix = (row_size >= column_size ? self : trans) * v_matrix * Matrix.diagonal(*singular_values).inverse
+    safe_singular_values = singular_values.map { |v| [v, Vector::EPSILON].max }
+    u_matrix = (row_size >= column_size ? self : trans) * v_matrix * Matrix.diagonal(*safe_singular_values).inverse
     [u_matrix, v_matrix, singular_values]
   end

data/lib/classifier/extensions/word_hash.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# rbs_inline: enabled
 # Author::    Lucas Carlson  (mailto:lucas@rufy.com)
 # Copyright:: Copyright (c) 2005 Lucas Carlson
 # License::   LGPL
@@ -11,12 +13,14 @@ class String
   # E.g.,
   #   "Hello (greeting's), with {braces} < >...?".without_punctuation
   #   => "Hello  greetings   with  braces         "
+  # @rbs () -> String
   def without_punctuation
-    tr(',?.!;:"@#$%^&*()_=+[]{}\|<>/`~', ' ').tr("'\-", '')
+    tr(',?.!;:"@#$%^&*()_=+[]{}|<>/`~', ' ').tr("'-", '')
   end
   # Return a Hash of strings => ints. Each word in the string is stemmed,
   # interned, and indexes to its frequency in the document.
+  # @rbs () -> Hash[Symbol, Integer]
   def word_hash
     word_hash = clean_word_hash
     symbol_hash = word_hash_for_symbols(gsub(/\w/, ' ').split)
@@ -24,12 +28,14 @@ class String
   end
   # Return a word hash without extra punctuation or short symbols, just stemmed words
+  # @rbs () -> Hash[Symbol, Integer]
   def clean_word_hash
     word_hash_for_words gsub(/[^\w\s]/, '').split
   end
   private
+  # @rbs (Array[String]) -> Hash[Symbol, Integer]
   def word_hash_for_words(words)
     d = Hash.new(0)
     words.each do |word|
@@ -39,6 +45,7 @@ class String
     d
   end
+  # @rbs (Array[String]) -> Hash[Symbol, Integer]
   def word_hash_for_symbols(words)
     d = Hash.new(0)
     words.each do |word|

data/lib/classifier/lsi/content_node.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# rbs_inline: enabled
 # Author::    David Fayram  (mailto:dfayram@lensmen.net)
 # Copyright:: Copyright (c) 2005 David Fayram II
 # License::   LGPL
@@ -7,33 +9,48 @@ module Classifier
   # raw_vector_with, it should be fairly straightforward to understand.
   # You should never have to use it directly.
   class ContentNode
-    attr_accessor :raw_vector, :raw_norm,
-                  :lsi_vector, :lsi_norm,
-                  :categories
+    # @rbs @word_hash: Hash[Symbol, Integer]
+    # @rbs @raw_vector: untyped
+    # @rbs @raw_norm: untyped
+    # @rbs @lsi_vector: untyped
+    # @rbs @lsi_norm: untyped
+    attr_accessor :raw_vector, :raw_norm, :lsi_vector, :lsi_norm
+    # @rbs @categories: Array[String | Symbol]
+    attr_accessor :categories
     attr_reader :word_hash
     # If text_proc is not specified, the source will be duck-typed
     # via source.to_s
+    #
+    # @rbs (Hash[Symbol, Integer], *String | Symbol) -> void
     def initialize(word_frequencies, *categories)
       @categories = categories || []
       @word_hash = word_frequencies
     end
     # Use this to fetch the appropriate search vector.
+    #
+    # @rbs () -> untyped
     def search_vector
       @lsi_vector || @raw_vector
     end
     # Use this to fetch the appropriate search vector in normalized form.
+    #
+    # @rbs () -> untyped
     def search_norm
       @lsi_norm || @raw_norm
     end
     # Creates the raw vector out of word_hash using word_list as the
     # key for mapping the vector space.
+    #
+    # @rbs (WordList) -> untyped
     def raw_vector_with(word_list)
-      vec = if $GSL
+      vec = if Classifier::LSI.gsl_available
               GSL::Vector.alloc(word_list.size)
             else
               Array.new(word_list.size, 0)
@@ -44,11 +61,13 @@ module Classifier
       end
       # Perform the scaling transform
-      total_words = $GSL ? vec.sum : vec.sum_with_identity
+      total_words = Classifier::LSI.gsl_available ? vec.sum : vec.sum_with_identity
+      vec_array = Classifier::LSI.gsl_available ? vec.to_a : vec
+      total_unique_words = vec_array.count { |word| word != 0 }
       # Perform first-order association transform if this vector has more
       # than one word in it.
-      if total_words > 1.0
+      if total_words > 1.0 && total_unique_words > 1
         weighted_total = 0.0
         vec.each do |term|
@@ -59,10 +78,13 @@ module Classifier
           val = term_over_total * Math.log(term_over_total)
           weighted_total += val unless val.nan?
         end
-        vec = vec.collect { |val| Math.log(val + 1) / -weighted_total }
+        sign = weighted_total.negative? ? 1.0 : -1.0
+        divisor = sign * [weighted_total.abs, Vector::EPSILON].max
+        vec = vec.collect { |val| Math.log(val + 1) / divisor }
       end
-      if $GSL
+      if Classifier::LSI.gsl_available
         @raw_norm   = vec.normalize
         @raw_vector = vec
       else

data/lib/classifier/lsi/word_list.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# rbs_inline: enabled
 # Author::    David Fayram  (mailto:dfayram@lensmen.net)
 # Copyright:: Copyright (c) 2005 David Fayram II
 # License::   LGPL
@@ -5,29 +7,38 @@
 module Classifier
   # This class keeps a word => index mapping. It is used to map stemmed words
   # to dimensions of a vector.
   class WordList
+    # @rbs @location_table: Hash[Symbol, Integer]
+    # @rbs () -> void
     def initialize
       @location_table = {}
     end
     # Adds a word (if it is new) and assigns it a unique dimension.
+    #
+    # @rbs (Symbol) -> Integer?
     def add_word(word)
       term = word
       @location_table[term] = @location_table.size unless @location_table[term]
     end
     # Returns the dimension of the word or nil if the word is not in the space.
+    #
+    # @rbs (Symbol) -> Integer?
     def [](lookup)
       term = lookup
       @location_table[term]
     end
+    # @rbs (Integer) -> Symbol?
     def word_for_index(ind)
       @location_table.invert[ind]
     end
     # Returns the number of words mapped.
+    #
+    # @rbs () -> Integer
     def size
       @location_table.size
     end

data/lib/classifier/lsi.rb CHANGED Viewed

@@ -1,17 +1,34 @@
+# rbs_inline: enabled
 # Author::    David Fayram  (mailto:dfayram@lensmen.net)
 # Copyright:: Copyright (c) 2005 David Fayram II
 # License::   LGPL
+module Classifier
+  class LSI
+    # @rbs @gsl_available: bool
+    @gsl_available = false
+    class << self
+      # @rbs @gsl_available: bool
+      attr_accessor :gsl_available
+    end
+  end
+end
 begin
   # to test the native vector class, try `rake test NATIVE_VECTOR=true`
   raise LoadError if ENV['NATIVE_VECTOR'] == 'true'
+  raise LoadError unless Gem::Specification.find_all_by_name('gsl').any?
-  require 'gsl' # requires https://github.com/SciRuby/rb-gsl/
+  require 'gsl'
   require 'classifier/extensions/vector_serialize'
-  $GSL = true
+  Classifier::LSI.gsl_available = true
 rescue LoadError
-  warn 'Notice: for 10x faster LSI support, please install https://github.com/SciRuby/rb-gsl/'
-  $GSL = false
+  unless ENV['SUPPRESS_GSL_WARNING'] == 'true'
+    warn 'Notice: for 10x faster LSI, run `gem install gsl`. Set SUPPRESS_GSL_WARNING=true to hide this.'
+  end
+  Classifier::LSI.gsl_available = false
   require 'classifier/extensions/vector'
 end
@@ -24,13 +41,20 @@ module Classifier
   # data based on underlying semantic relations. For more information on the algorithms used,
   # please consult Wikipedia[http://en.wikipedia.org/wiki/Latent_Semantic_Indexing].
   class LSI
+    # @rbs @auto_rebuild: bool
+    # @rbs @word_list: WordList
+    # @rbs @items: Hash[untyped, ContentNode]
+    # @rbs @version: Integer
+    # @rbs @built_at_version: Integer
     attr_reader :word_list
     attr_accessor :auto_rebuild
     # Create a fresh index.
     # If you want to call #build_index manually, use
-    #      Classifier::LSI.new :auto_rebuild => false
+    #      Classifier::LSI.new auto_rebuild: false
     #
+    # @rbs (?Hash[Symbol, untyped]) -> void
     def initialize(options = {})
       @auto_rebuild = true unless options[:auto_rebuild] == false
       @word_list = WordList.new
@@ -42,6 +66,8 @@ module Classifier
     # Returns true if the index needs to be rebuilt.  The index needs
     # to be built after all informaton is added, but before you start
     # using it for search, classification and cluster detection.
+    #
+    # @rbs () -> bool
     def needs_rebuild?
       (@items.keys.size > 1) && (@version != @built_at_version)
     end
@@ -59,6 +85,7 @@ module Classifier
     #   ar = ActiveRecordObject.find( :all )
     #   lsi.add_item ar, *ar.categories { |x| ar.content }
     #
+    # @rbs (String, *String | Symbol) ?{ (String) -> String } -> void
     def add_item(item, *categories, &block)
       clean_word_hash = block ? block.call(item).clean_word_hash : item.to_s.clean_word_hash
       @items[item] = ContentNode.new(clean_word_hash, *categories)
@@ -70,12 +97,15 @@ module Classifier
     # you are passing in a string with no categorries. item
     # will be duck typed via to_s .
     #
+    # @rbs (String) -> void
     def <<(item)
       add_item(item)
     end
     # Returns the categories for a given indexed items. You are free to add and remove
     # items from this as you see fit. It does not invalide an index to change its categories.
+    #
+    # @rbs (String) -> Array[String | Symbol]
     def categories_for(item)
       return [] unless @items[item]
@@ -84,6 +114,7 @@ module Classifier
     # Removes an item from the database, if it is indexed.
     #
+    # @rbs (String) -> void
     def remove_item(item)
       return unless @items.key?(item)
@@ -92,6 +123,7 @@ module Classifier
     end
     # Returns an array of items that are indexed.
+    # @rbs () -> Array[untyped]
     def items
       @items.keys
     end
@@ -110,6 +142,8 @@ module Classifier
     # cutoff parameter tells the indexer how many of these values to keep.
     # A value of 1 for cutoff means that no semantic analysis will take place,
     # turning the LSI class into a simple vector search engine.
+    #
+    # @rbs (?Float) -> void
     def build_index(cutoff = 0.75)
       return unless needs_rebuild?
@@ -118,7 +152,7 @@ module Classifier
       doc_list = @items.values
       tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }
-      if $GSL
+      if self.class.gsl_available
         tdm = GSL::Matrix.alloc(*tda).trans
         ntdm = build_reduced_matrix(tdm, cutoff)
@@ -131,9 +165,14 @@ module Classifier
         tdm = Matrix.rows(tda).trans
         ntdm = build_reduced_matrix(tdm, cutoff)
-        ntdm.row_size.times do |col|
-          doc_list[col].lsi_vector = ntdm.column(col) if doc_list[col]
-          doc_list[col].lsi_norm = ntdm.column(col).normalize if doc_list[col]
+        ntdm.column_size.times do |col|
+          next unless doc_list[col]
+          column = ntdm.column(col)
+          next unless column
+          doc_list[col].lsi_vector = column
+          doc_list[col].lsi_norm = column.normalize
         end
       end
@@ -148,13 +187,15 @@ module Classifier
     # your dataset's general content. For example, if you were to use categorize on the
     # results of this data, you could gather information on what your dataset is generally
     # about.
+    #
+    # @rbs (?Integer) -> Array[String]
     def highest_relative_content(max_chunks = 10)
       return [] if needs_rebuild?
       avg_density = {}
-      @items.each_key { |x| avg_density[x] = proximity_array_for_content(x).inject(0.0) { |x, y| x + y[1] } }
+      @items.each_key { |x| avg_density[x] = proximity_array_for_content(x).sum { |pair| pair[1] } }
-      avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..max_chunks - 1].map
+      avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..(max_chunks - 1)].map
     end
     # This function is the primitive that find_related and classify
@@ -169,13 +210,15 @@ module Classifier
     # The parameter doc is the content to compare. If that content is not
     # indexed, you can pass an optional block to define how to create the
     # text data. See add_item for examples of how this works.
-    def proximity_array_for_content(doc, &block)
+    #
+    # @rbs (String) ?{ (String) -> String } -> Array[[String, Float]]
+    def proximity_array_for_content(doc, &)
       return [] if needs_rebuild?
-      content_node = node_for_content(doc, &block)
+      content_node = node_for_content(doc, &)
       result =
         @items.keys.collect do |item|
-          val = if $GSL
+          val = if self.class.gsl_available
                   content_node.search_vector * @items[item].search_vector.col
                 else
                   (Matrix[content_node.search_vector] * @items[item].search_vector)[0]
@@ -190,13 +233,15 @@ module Classifier
     # calculated vectors instead of their full versions. This is useful when
     # you're trying to perform operations on content that is much smaller than
     # the text you're working with. search uses this primitive.
-    def proximity_norms_for_content(doc, &block)
+    #
+    # @rbs (String) ?{ (String) -> String } -> Array[[String, Float]]
+    def proximity_norms_for_content(doc, &)
       return [] if needs_rebuild?
-      content_node = node_for_content(doc, &block)
+      content_node = node_for_content(doc, &)
       result =
         @items.keys.collect do |item|
-          val = if $GSL
+          val = if self.class.gsl_available
                   content_node.search_norm * @items[item].search_norm.col
                 else
                   (Matrix[content_node.search_norm] * @items[item].search_norm)[0]
@@ -213,12 +258,14 @@ module Classifier
     #
     # While this may seem backwards compared to the other functions that LSI supports,
     # it is actually the same algorithm, just applied on a smaller document.
+    #
+    # @rbs (String, ?Integer) -> Array[String]
     def search(string, max_nearest = 3)
       return [] if needs_rebuild?
       carry = proximity_norms_for_content(string)
       result = carry.collect { |x| x[0] }
-      result[0..max_nearest - 1]
+      result[0..(max_nearest - 1)]
     end
     # This function takes content and finds other documents
@@ -230,11 +277,13 @@ module Classifier
     # This is particularly useful for identifing clusters in your document space.
     # For example you may want to identify several "What's Related" items for weblog
     # articles, or find paragraphs that relate to each other in an essay.
+    #
+    # @rbs (String, ?Integer) ?{ (String) -> String } -> Array[String]
     def find_related(doc, max_nearest = 3, &block)
       carry =
         proximity_array_for_content(doc, &block).reject { |pair| pair[0] == doc }
       result = carry.collect { |x| x[0] }
-      result[0..max_nearest - 1]
+      result[0..(max_nearest - 1)]
     end
     # This function uses a voting system to categorize documents, based on
@@ -246,17 +295,19 @@ module Classifier
     # text. A cutoff of 1 means that every document in the index votes on
     # what category the document is in. This may not always make sense.
     #
-    def classify(doc, cutoff = 0.30, &block)
-      votes = vote(doc, cutoff, &block)
+    # @rbs (String, ?Float) ?{ (String) -> String } -> String | Symbol
+    def classify(doc, cutoff = 0.30, &)
+      votes = vote(doc, cutoff, &)
       ranking = votes.keys.sort_by { |x| votes[x] }
       ranking[-1]
     end
-    def vote(doc, cutoff = 0.30, &block)
+    # @rbs (String, ?Float) ?{ (String) -> String } -> Hash[String | Symbol, Float]
+    def vote(doc, cutoff = 0.30, &)
       icutoff = (@items.size * cutoff).round
-      carry = proximity_array_for_content(doc, &block)
-      carry = carry[0..icutoff - 1]
+      carry = proximity_array_for_content(doc, &)
+      carry = carry[0..(icutoff - 1)]
       votes = {}
       carry.each do |pair|
         categories = @items[pair[0]].categories
@@ -278,11 +329,11 @@ module Classifier
     #   category = nil
     # end
     #
-    #
     # See classify() for argument docs
-    def classify_with_confidence(doc, cutoff = 0.30, &block)
-      votes = vote(doc, cutoff, &block)
-      votes_sum = votes.values.inject(0.0) { |sum, v| sum + v }
+    # @rbs (String, ?Float) ?{ (String) -> String } -> [String | Symbol | nil, Float?]
+    def classify_with_confidence(doc, cutoff = 0.30, &)
+      votes = vote(doc, cutoff, &)
+      votes_sum = votes.values.sum
       return [nil, nil] if votes_sum.zero?
       ranking = votes.keys.sort_by { |x| votes[x] }
@@ -294,16 +345,18 @@ module Classifier
     # Prototype, only works on indexed documents.
     # I have no clue if this is going to work, but in theory
     # it's supposed to.
+    # @rbs (String, ?Integer) -> Array[Symbol]
     def highest_ranked_stems(doc, count = 3)
       raise 'Requested stem ranking on non-indexed content!' unless @items[doc]
       arr = node_for_content(doc).lsi_vector.to_a
-      top_n = arr.sort.reverse[0..count - 1]
+      top_n = arr.sort.reverse[0..(count - 1)]
       top_n.collect { |x| @word_list.word_for_index(arr.index(x)) }
     end
     private
+    # @rbs (untyped, ?Float) -> untyped
     def build_reduced_matrix(matrix, cutoff = 0.75)
       # TODO: Check that M>=N on these dimensions! Transpose helps assure this
       u, v, s = matrix.SV_decomp
@@ -314,23 +367,26 @@ module Classifier
         s[ord] = 0.0 if s[ord] < s_cutoff
       end
       # Reconstruct the term document matrix, only with reduced rank
-      u * ($GSL ? GSL::Matrix : ::Matrix).diag(s) * v.trans
+      result = u * (self.class.gsl_available ? GSL::Matrix : ::Matrix).diag(s) * v.trans
+      # Native Ruby SVD returns transposed dimensions when row_size < column_size
+      # Ensure result matches input dimensions
+      result = result.trans if !self.class.gsl_available && result.row_size != matrix.row_size
+      result
     end
+    # @rbs (String) ?{ (String) -> String } -> ContentNode
     def node_for_content(item, &block)
       return @items[item] if @items[item]
       clean_word_hash = block ? block.call(item).clean_word_hash : item.to_s.clean_word_hash
-      cn = ContentNode.new(clean_word_hash, &block) # make the node and extract the data
-      unless needs_rebuild?
-        cn.raw_vector_with(@word_list) # make the lsi raw and norm vectors
-      end
+      cn = ContentNode.new(clean_word_hash, &block)
+      cn.raw_vector_with(@word_list) unless needs_rebuild?
       cn
     end
+    # @rbs () -> void
     def make_word_list
       @word_list = WordList.new
       @items.each_value do |node|

data/sig/vendor/fast_stemmer.rbs ADDED Viewed

@@ -0,0 +1,9 @@
+# Type stubs for fast-stemmer gem and classifier extensions
+class String
+  def stem: () -> String
+  def prepare_category_name: () -> Symbol
+end
+class Symbol
+  def prepare_category_name: () -> Symbol
+end

data/sig/vendor/gsl.rbs ADDED Viewed

@@ -0,0 +1,27 @@
+# Type stubs for optional GSL gem
+module GSL
+  class Vector
+    def self.alloc: (untyped) -> Vector
+    def to_a: () -> Array[Float]
+    def normalize: () -> Vector
+    def sum: () -> Float
+    def each: () { (Float) -> void } -> void
+    def []: (Integer) -> Float
+    def []=: (Integer, Float) -> Float
+    def size: () -> Integer
+    def row: () -> Vector
+    def col: () -> Vector
+    def *: (untyped) -> untyped
+    def collect: () { (Float) -> Float } -> Vector
+  end
+  class Matrix
+    def self.alloc: (*untyped) -> Matrix
+    def self.diag: (untyped) -> Matrix
+    def trans: () -> Matrix
+    def *: (untyped) -> Matrix
+    def size: () -> [Integer, Integer]
+    def column: (Integer) -> Vector
+    def SV_decomp: () -> [Matrix, Matrix, Vector]
+  end
+end

data/sig/vendor/matrix.rbs ADDED Viewed

@@ -0,0 +1,26 @@
+# Type stubs for matrix gem
+class Vector[T]
+  EPSILON: Float
+  def self.[]: [T] (*T) -> Vector[T]
+  def size: () -> Integer
+  def []: (Integer) -> T
+  def magnitude: () -> Float
+  def normalize: () -> Vector[T]
+  def each: () { (T) -> void } -> void
+  def collect: [U] () { (T) -> U } -> Vector[U]
+  def to_a: () -> Array[T]
+  def *: (untyped) -> untyped
+end
+class Matrix[T]
+  def self.rows: [T] (Array[Array[T]]) -> Matrix[T]
+  def self.[]: [T] (*Array[T]) -> Matrix[T]
+  def self.diag: (untyped) -> Matrix[untyped]
+  def trans: () -> Matrix[T]
+  def *: (untyped) -> untyped
+  def row_size: () -> Integer
+  def column_size: () -> Integer
+  def column: (Integer) -> Vector[T]
+  def SV_decomp: () -> [Matrix[T], Matrix[T], untyped]
+end

data/test/test_helper.rb CHANGED Viewed

@@ -1,4 +1,14 @@
-$:.unshift(File.dirname(__FILE__) + '/../lib')
+require 'simplecov'
+SimpleCov.start do
+  add_filter '/test/'
+  add_filter '/vendor/'
+  add_group 'Bayes', 'lib/classifier/bayes.rb'
+  add_group 'LSI', 'lib/classifier/lsi'
+  add_group 'Extensions', 'lib/classifier/extensions'
+  enable_coverage :branch
+end
+$LOAD_PATH.unshift("#{File.dirname(__FILE__)}/../lib")
 require 'minitest'
 require 'minitest/autorun'

metadata CHANGED Viewed

@@ -1,14 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: classifier
 version: !ruby/object:Gem::Version
-  version: 1.4.3
+  version: 2.0.0
 platform: ruby
 authors:
 - Lucas Carlson
-autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-07-31 00:00:00.000000000 Z
+date: 1980-01-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: fast-stemmer
@@ -52,6 +51,20 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
+- !ruby/object:Gem::Dependency
+  name: matrix
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
 - !ruby/object:Gem::Dependency
   name: minitest
   requirement: !ruby/object:Gem::Requirement
@@ -66,6 +79,20 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
+- !ruby/object:Gem::Dependency
+  name: rbs-inline
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
 - !ruby/object:Gem::Dependency
   name: rdoc
   requirement: !ruby/object:Gem::Requirement
@@ -86,7 +113,9 @@ executables: []
 extensions: []
 extra_rdoc_files: []
 files:
+- CLAUDE.md
 - LICENSE
+- README.md
 - bin/bayes.rb
 - bin/summarize.rb
 - lib/classifier.rb
@@ -99,12 +128,14 @@ files:
 - lib/classifier/lsi/content_node.rb
 - lib/classifier/lsi/summary.rb
 - lib/classifier/lsi/word_list.rb
+- sig/vendor/fast_stemmer.rbs
+- sig/vendor/gsl.rbs
+- sig/vendor/matrix.rbs
 - test/test_helper.rb
 homepage: https://github.com/cardmagic/classifier
 licenses:
 - LGPL
 metadata: {}
-post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -119,8 +150,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.5.9
-signing_key:
+rubygems_version: 4.0.3
 specification_version: 4
 summary: A general classifier module to allow Bayesian and other types of classifications.
 test_files: []