RubyGems - classifier - Versions diffs - 2.2.0 → 2.3.0 - Mend

classifier 2.2.0 → 2.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/README.md +19 -15
data/exe/classifier +9 -0
data/lib/classifier/cli.rb +880 -0
data/lib/classifier/logistic_regression.rb +41 -19
data/lib/classifier/version.rb +3 -0
data/lib/classifier.rb +1 -0
data/sig/classifier.rbs +3 -0
data/sig/vendor/json.rbs +1 -0
data/sig/vendor/optparse.rbs +19 -0
metadata +23 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 2ce325479bd32938cd7a7c694152983579266667e5def8d36725c38594eb822b
-  data.tar.gz: a35439c4ddfba77091297fee46930c279c54bfeb314b8b87fba8dd0ccbe6b292
+  metadata.gz: c13b0ca80981d0186f038581e896182ca112928dcada4ac23700f2a9642ca785
+  data.tar.gz: 1949dea18b6d7e06eb931d411e821757f75084c6776ea589927b1f1b89d82280
 SHA512:
-  metadata.gz: e82b06f6449d3c5b089f840fe794b0e7f59bc1a00a379d4739010ad872ce7a90a4cbc1eee930b822b2761add461f2378dde8e015115c45b59c67e8aee0ad01b5
-  data.tar.gz: fa9271f248da8de9012ef923911b5f3d18cb9ced08ff9f85372e13622b8ab43a3f75365d21e9de3b3fa2849f3a54c3261da67c00c38a263b3ceee7ac0afd2598
+  metadata.gz: 562474db108e959f218407bd66a58bffe1b609ea38b8db16d15f579196d41d4976d74a66b6928e195ff0f6cc5dc4b081ea75d634cefb928e25932f8867358384
+  data.tar.gz: 39f72068f85e64a6397672376f6fcc4821e1eccd9556813e4894ba62800975330051c87e88a8747f03278e6ad0e83a665ed48fcc6cc1f623cd91b657089f8dd6

data/README.md CHANGED Viewed

@@ -6,7 +6,7 @@
 Text classification in Ruby. Five algorithms, native performance, streaming support.
-**[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[API Reference](https://rubyclassifier.com/docs/api)**
+**[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[API Reference](https://rubydoc.info/gems/classifier)**
 ## Why This Library?
@@ -16,7 +16,7 @@ Text classification in Ruby. Five algorithms, native performance, streaming supp
 | **Incremental LSI** | ✅ Brand's algorithm (no rebuild) | ❌ Full SVD rebuild on every add |
 | **LSI Performance** | ✅ Native C extension (5-50x faster) | ❌ Pure Ruby or requires GSL |
 | **Streaming** | ✅ Train on multi-GB datasets | ❌ Must load all data in memory |
-| **Persistence** | ✅ Pluggable (file, Redis, S3) | ❌ Marshal only |
+| **Persistence** | ✅ Pluggable (file, Redis, S3, SQL, Custom) | ❌ Marshal only |
 ## Installation
@@ -30,8 +30,10 @@ gem 'classifier'
 ```ruby
 classifier = Classifier::Bayes.new(:spam, :ham)
-classifier.train(spam: "Buy cheap viagra now!", ham: "Meeting at 3pm tomorrow")
-classifier.classify "You've won a prize!"  # => "Spam"
+classifier.train(spam: "Buy viagra cheap pills now")
+classifier.train(spam: "You won million dollars prize")
+classifier.train(ham: ["Meeting tomorrow at 3pm", "Quarterly report attached"])
+classifier.classify("Cheap pills!")  # => "Spam"
 ```
 [Bayesian Guide →](https://rubyclassifier.com/docs/guides/bayes/basics)
@@ -39,8 +41,9 @@ classifier.classify "You've won a prize!"  # => "Spam"
 ```ruby
 classifier = Classifier::LogisticRegression.new(:positive, :negative)
-classifier.train(positive: "Great product!", negative: "Terrible experience")
-classifier.classify "Loved it!"  # => "Positive"
+classifier.train(positive: "love amazing great wonderful")
+classifier.train(negative: "hate terrible awful bad")
+classifier.classify("I love it!")  # => "Positive"
 ```
 [Logistic Regression Guide →](https://rubyclassifier.com/docs/guides/logisticregression/basics)
@@ -48,8 +51,8 @@ classifier.classify "Loved it!"  # => "Positive"
 ```ruby
 lsi = Classifier::LSI.new
-lsi.add(pets: "Dogs are loyal", tech: "Ruby is elegant")
-lsi.classify "My puppy is playful"  # => "pets"
+lsi.add(dog: "dog puppy canine bark fetch", cat: "cat kitten feline meow purr")
+lsi.classify("My puppy barks")  # => "dog"
 ```
 [LSI Guide →](https://rubyclassifier.com/docs/guides/lsi/basics)
@@ -57,8 +60,9 @@ lsi.classify "My puppy is playful"  # => "pets"
 ```ruby
 knn = Classifier::KNN.new(k: 3)
-knn.train(spam: "Free money!", ham: "Quarterly report attached")  # or knn.add()
-knn.classify "Claim your prize"  # => "spam"
+%w[laptop coding software developer programming].each { |w| knn.add(tech: w) }
+%w[football basketball soccer goal team].each { |w| knn.add(sports: w) }
+knn.classify("programming code")  # => "tech"
 ```
 [k-Nearest Neighbors Guide →](https://rubyclassifier.com/docs/guides/knn/basics)
@@ -66,8 +70,8 @@ knn.classify "Claim your prize"  # => "spam"
 ```ruby
 tfidf = Classifier::TFIDF.new
-tfidf.fit(["Dogs are pets", "Cats are independent"])
-tfidf.transform("Dogs are loyal")  # => {:dog => 0.707, :loyal => 0.707}
+tfidf.fit(["Ruby is great", "Python is great", "Ruby on Rails"])
+tfidf.transform("Ruby programming")  # => {:rubi => 1.0}
 ```
 [TF-IDF Guide →](https://rubyclassifier.com/docs/guides/tfidf/basics)
@@ -87,7 +91,7 @@ lsi.add(tech: "Go is fast")
 lsi.add(tech: "Rust is safe")
 ```
-[Learn more →](https://rubyclassifier.com/docs/guides/lsi/incremental)
+[Learn more →](https://rubyclassifier.com/docs/guides/lsi/basics)
 ### Persistence
@@ -98,7 +102,7 @@ classifier.save
 loaded = Classifier::Bayes.load(storage: classifier.storage)
 ```
-[Learn more →](https://rubyclassifier.com/docs/guides/persistence)
+[Learn more →](https://rubyclassifier.com/docs/guides/persistence/basics)
 ### Streaming Training
@@ -106,7 +110,7 @@ loaded = Classifier::Bayes.load(storage: classifier.storage)
 classifier.train_from_stream(:spam, File.open("spam_corpus.txt"))
 ```
-[Learn more →](https://rubyclassifier.com/docs/guides/streaming)
+[Learn more →](https://rubyclassifier.com/docs/tutorials/streaming-training)
 ## Performance

data/exe/classifier ADDED Viewed

@@ -0,0 +1,9 @@
+#!/usr/bin/env ruby
+require 'classifier/cli'
+result = Classifier::CLI.new(ARGV).run
+warn result[:error] unless result[:error].empty?
+puts result[:output] unless result[:output].empty?
+exit result[:exit_code]

data/lib/classifier/cli.rb ADDED Viewed

@@ -0,0 +1,880 @@
+# rbs_inline: enabled
+require 'json'
+require 'optparse'
+require 'net/http'
+require 'uri'
+require 'fileutils'
+require 'classifier'
+module Classifier
+  class CLI
+    # @rbs @args: Array[String]
+    # @rbs @stdin: String?
+    # @rbs @options: Hash[Symbol, untyped]
+    # @rbs @output: Array[String]
+    # @rbs @error: Array[String]
+    # @rbs @exit_code: Integer
+    # @rbs @parser: OptionParser
+    CLASSIFIER_TYPES = {
+      'bayes' => :bayes,
+      'lsi' => :lsi,
+      'knn' => :knn,
+      'lr' => :logistic_regression,
+      'logistic_regression' => :logistic_regression
+    }.freeze
+    DEFAULT_REGISTRY = ENV.fetch('CLASSIFIER_REGISTRY', 'cardmagic/classifier-models') #: String
+    CACHE_DIR = ENV.fetch('CLASSIFIER_CACHE', File.expand_path('~/.classifier')) #: String
+    def initialize(args, stdin: nil)
+      @args = args.dup
+      @stdin = stdin
+      @options = {
+        model: ENV.fetch('CLASSIFIER_MODEL', './classifier.json'),
+        type: ENV.fetch('CLASSIFIER_TYPE', 'bayes'),
+        probabilities: false,
+        quiet: false,
+        count: 10,
+        k: 5,
+        weighted: false,
+        learning_rate: nil,
+        regularization: nil,
+        max_iterations: nil,
+        remote: nil,
+        output_path: nil
+      }
+      @output = [] #: Array[String]
+      @error = [] #: Array[String]
+      @exit_code = 0
+    end
+    def run
+      parse_options
+      execute_command
+      { output: @output.join("\n"), error: @error.join("\n"), exit_code: @exit_code }
+    rescue OptionParser::InvalidOption, OptionParser::MissingArgument, OptionParser::InvalidArgument => e
+      @error << "Error: #{e.message}"
+      @exit_code = 2
+      { output: @output.join("\n"), error: @error.join("\n"), exit_code: @exit_code }
+    rescue StandardError => e
+      @error << "Error: #{e.message}"
+      @exit_code = 1
+      { output: @output.join("\n"), error: @error.join("\n"), exit_code: @exit_code }
+    end
+    private
+    def parse_options
+      @parser = OptionParser.new do |opts|
+        opts.banner = 'Usage: classifier [options] [command] [arguments]'
+        opts.separator ''
+        opts.separator 'Commands:'
+        opts.separator '  train <category> [files...]  Train a category from files or stdin'
+        opts.separator '  info                         Show model information'
+        opts.separator '  fit                          Fit the model (logistic regression)'
+        opts.separator '  search <query>               Semantic search (LSI only)'
+        opts.separator '  related <item>               Find related documents (LSI only)'
+        opts.separator '  models [registry]            List models in registry'
+        opts.separator '  pull <model>                 Download model from registry'
+        opts.separator '  push <file>                  Contribute model to registry'
+        opts.separator '  <text>                       Classify text (default action)'
+        opts.separator ''
+        opts.separator 'Options:'
+        opts.on('-f', '--file FILE', 'Model file (default: ./classifier.json)') do |file|
+          @options[:model] = file
+        end
+        opts.on('-m', '--model TYPE', 'Classifier model: bayes, lsi, knn, lr (default: bayes)') do |type|
+          unless CLASSIFIER_TYPES.key?(type)
+            raise OptionParser::InvalidArgument, "Unknown classifier model: #{type}. Valid models: #{CLASSIFIER_TYPES.keys.join(', ')}"
+          end
+          @options[:type] = type
+        end
+        opts.on('-r', '--remote MODEL', 'Use remote model: name or @user/repo:name') do |model|
+          @options[:remote] = model
+        end
+        opts.on('-o', '--output FILE', 'Output path for pull command') do |file|
+          @options[:output_path] = file
+        end
+        opts.on('-p', 'Show probabilities') do
+          @options[:probabilities] = true
+        end
+        opts.on('-n', '--count N', Integer, 'Number of results for search/related (default: 10)') do |n|
+          @options[:count] = n
+        end
+        opts.on('-k', '--neighbors N', Integer, 'Number of neighbors for KNN (default: 5)') do |n|
+          @options[:k] = n
+        end
+        opts.on('--weighted', 'Use distance-weighted voting for KNN') do
+          @options[:weighted] = true
+        end
+        opts.on('--learning-rate N', Float, 'Learning rate for logistic regression (default: 0.1)') do |n|
+          @options[:learning_rate] = n
+        end
+        opts.on('--regularization N', Float, 'L2 regularization for logistic regression (default: 0.01)') do |n|
+          @options[:regularization] = n
+        end
+        opts.on('--max-iterations N', Integer, 'Max iterations for logistic regression (default: 100)') do |n|
+          @options[:max_iterations] = n
+        end
+        opts.on('-q', 'Quiet mode') do
+          @options[:quiet] = true
+        end
+        opts.on('--local', 'List locally cached models (for models command)') do
+          @options[:local] = true
+        end
+        opts.on('-v', '--version', 'Show version') do
+          @output << Classifier::VERSION
+          @exit_code = 0
+          throw :done
+        end
+        opts.on('-h', '--help', 'Show help') do
+          @output << opts.to_s
+          @exit_code = 0
+          throw :done
+        end
+      end
+      catch(:done) do
+        @parser.parse!(@args)
+      end
+    end
+    def execute_command
+      return if @exit_code != 0 || @output.any?
+      command = @args.first
+      case command
+      when 'train'
+        command_train
+      when 'info'
+        command_info
+      when 'fit'
+        command_fit
+      when 'search'
+        command_search
+      when 'related'
+        command_related
+      when 'models'
+        command_models
+      when 'pull'
+        command_pull
+      when 'push'
+        command_push
+      else
+        command_classify
+      end
+    end
+    def command_train
+      @args.shift # remove 'train'
+      category = @args.shift
+      unless category
+        @error << 'Error: category required for train command'
+        @exit_code = 2
+        return
+      end
+      classifier = load_or_create_classifier
+      if classifier.is_a?(LSI) && @args.any?
+        train_lsi_from_files(classifier, category, @args)
+        save_classifier(classifier)
+        return
+      end
+      text = read_training_input
+      if text.empty?
+        @error << 'Error: no training data provided'
+        @exit_code = 2
+        return
+      end
+      train_classifier(classifier, category, text)
+      save_classifier(classifier)
+    end
+    def command_info
+      unless File.exist?(@options[:model])
+        @error << "Error: model not found at #{@options[:model]}"
+        @exit_code = 1
+        return
+      end
+      classifier = load_classifier
+      info = build_model_info(classifier)
+      @output << JSON.pretty_generate(info)
+    end
+    def build_model_info(classifier)
+      info = { file: @options[:model], type: classifier_type_name(classifier) }
+      add_common_info(info, classifier)
+      add_classifier_specific_info(info, classifier)
+      info
+    end
+    def add_common_info(info, classifier)
+      info[:categories] = classifier.categories.map(&:to_s) if classifier.respond_to?(:categories)
+      info[:training_count] = classifier.training_count if classifier.respond_to?(:training_count)
+      info[:vocab_size] = classifier.vocab_size if classifier.respond_to?(:vocab_size)
+      info[:fitted] = classifier.fitted? if classifier.respond_to?(:fitted?)
+    end
+    def add_classifier_specific_info(info, classifier)
+      case classifier
+      when Bayes then add_bayes_info(info, classifier)
+      when LSI then add_lsi_info(info, classifier)
+      when KNN then add_knn_info(info, classifier)
+      end
+    end
+    def add_bayes_info(info, classifier)
+      categories_data = classifier.instance_variable_get(:@categories)
+      info[:category_stats] = classifier.categories.to_h do |cat|
+        cat_data = categories_data[cat.to_sym] || {}
+        [cat.to_s, { unique_words: cat_data.size, total_words: cat_data.values.sum }]
+      end
+    end
+    def add_lsi_info(info, classifier)
+      info[:documents] = classifier.items.size
+      info[:items] = classifier.items
+      categories = classifier.items.map { |item| classifier.categories_for(item) }.flatten.uniq
+      info[:categories] = categories.map(&:to_s) unless categories.empty?
+    end
+    def add_knn_info(info, classifier)
+      data = classifier.instance_variable_get(:@data) || []
+      info[:documents] = data.size
+      categories = data.map { |d| d[:category] }.uniq
+      info[:categories] = categories.map(&:to_s) unless categories.empty?
+    end
+    def command_fit
+      unless File.exist?(@options[:model])
+        @error << "Error: model not found at #{@options[:model]}"
+        @exit_code = 1
+        return
+      end
+      classifier = load_classifier
+      unless classifier.respond_to?(:fit)
+        @output << 'Model does not require fitting' unless @options[:quiet]
+        return
+      end
+      classifier.fit
+      save_classifier(classifier)
+      @output << 'Model fitted successfully' unless @options[:quiet]
+    end
+    def command_search
+      @args.shift # remove 'search'
+      unless File.exist?(@options[:model])
+        @error << "Error: model not found at #{@options[:model]}"
+        @exit_code = 1
+        return
+      end
+      classifier = load_classifier
+      unless classifier.is_a?(LSI)
+        @error << 'Error: search requires LSI model (use -t lsi)'
+        @exit_code = 1
+        return
+      end
+      query = @args.join(' ')
+      query = read_stdin_line if query.empty?
+      if query.empty?
+        @error << 'Error: search query required'
+        @exit_code = 2
+        return
+      end
+      results = classifier.search(query, @options[:count])
+      results.each do |item|
+        score = classifier.proximity_norms_for_content(query).find { |i, _| i == item }&.last || 0
+        @output << "#{item}:#{format('%.2f', score)}"
+      end
+    end
+    def command_related
+      @args.shift # remove 'related'
+      item = @args.shift
+      unless item
+        @error << 'Error: item required for related command'
+        @exit_code = 2
+        return
+      end
+      unless File.exist?(@options[:model])
+        @error << "Error: model not found at #{@options[:model]}"
+        @exit_code = 1
+        return
+      end
+      classifier = load_classifier
+      unless classifier.is_a?(LSI)
+        @error << 'Error: related requires LSI model (use -t lsi)'
+        @exit_code = 1
+        return
+      end
+      unless classifier.items.include?(item)
+        @error << "Error: item not found in model: #{item}"
+        @exit_code = 1
+        return
+      end
+      results = classifier.find_related(item, @options[:count])
+      results.each do |related_item|
+        scores = classifier.proximity_array_for_content(item)
+        score = scores.find { |i, _| i == related_item }&.last || 0
+        @output << "#{related_item}:#{format('%.2f', score)}"
+      end
+    end
+    def command_models
+      @args.shift # remove 'models'
+      if @options[:local]
+        list_local_models
+      else
+        list_remote_models
+      end
+    end
+    def list_remote_models
+      registry_arg = @args.shift
+      registry = parse_registry(registry_arg) || DEFAULT_REGISTRY
+      index = fetch_registry_index(registry)
+      return if @exit_code != 0
+      if index['models'].empty?
+        @output << 'No models found in registry'
+        return
+      end
+      index['models'].each do |name, info|
+        type = info['type'] || 'unknown'
+        size = info['size'] || 'unknown'
+        desc = info['description'] || ''
+        @output << format('%-20<name>s %<desc>s (%<type>s, %<size>s)', name: name, desc: desc.slice(0, 40), type: type, size: size)
+      end
+    end
+    def list_local_models
+      models_dir = File.join(CACHE_DIR, 'models')
+      unless Dir.exist?(models_dir)
+        @output << 'No local models found'
+        return
+      end
+      # Find models from default registry
+      default_models = Dir.glob(File.join(models_dir, '*.json')).map do |path|
+        { name: File.basename(path, '.json'), registry: nil, path: path }
+      end
+      # Find models from custom registries (@user/repo structure)
+      custom_models = Dir.glob(File.join(models_dir, '@*', '*', '*.json')).map do |path|
+        # Extract registry from path: .../models/@user/repo/model.json
+        repo_dir = File.dirname(path)
+        user_dir = File.dirname(repo_dir)
+        registry = "#{File.basename(user_dir).delete_prefix('@')}/#{File.basename(repo_dir)}"
+        { name: File.basename(path, '.json'), registry: registry, path: path }
+      end
+      models = default_models + custom_models #: Array[{name: String, registry: String?, path: String}]
+      if models.empty?
+        @output << 'No local models found'
+        return
+      end
+      models.each do |model|
+        info = load_model_info(model[:path])
+        type = info['type'] || 'unknown'
+        display_name = model[:registry] ? "@#{model[:registry]}:#{model[:name]}" : model[:name]
+        size = File.size(model[:path])
+        @output << format('%-30<name>s (%<type>s, %<size>s)', name: display_name, type: type, size: human_size(size))
+      end
+    end
+    def load_model_info(path)
+      JSON.parse(File.read(path))
+    rescue JSON::ParserError
+      {}
+    end
+    def human_size(bytes)
+      units = %w[B KB MB GB]
+      unit_index = 0
+      size = bytes.to_f
+      while size >= 1024 && unit_index < units.length - 1
+        size /= 1024
+        unit_index += 1
+      end
+      format('%<size>.1f %<unit>s', size: size, unit: units[unit_index])
+    end
+    def command_pull
+      @args.shift # remove 'pull'
+      model_spec = @args.shift
+      unless model_spec
+        @error << 'Error: model name required for pull command'
+        @exit_code = 2
+        return
+      end
+      registry, model_name = parse_model_spec(model_spec)
+      registry ||= DEFAULT_REGISTRY
+      if model_name.nil?
+        pull_all_models(registry)
+      else
+        pull_single_model(registry, model_name)
+      end
+    end
+    def pull_single_model(registry, model_name)
+      index = fetch_registry_index(registry)
+      return if @exit_code != 0
+      model_info = index['models'][model_name]
+      unless model_info
+        @error << "Error: model '#{model_name}' not found in registry #{registry}"
+        @exit_code = 1
+        return
+      end
+      file_path = model_info['file'] || "models/#{model_name}.json"
+      output_path = @options[:output_path] || cache_path_for(registry, model_name)
+      @output << "Downloading #{model_name} from #{registry}..." unless @options[:quiet]
+      content = fetch_github_file(registry, file_path)
+      return if @exit_code != 0
+      FileUtils.mkdir_p(File.dirname(output_path))
+      File.write(output_path, content)
+      @output << "Saved to #{output_path}" unless @options[:quiet]
+    end
+    def pull_all_models(registry)
+      index = fetch_registry_index(registry)
+      return if @exit_code != 0
+      if index['models'].empty?
+        @output << 'No models found in registry'
+        return
+      end
+      @output << "Downloading #{index['models'].size} models from #{registry}..." unless @options[:quiet]
+      index['models'].each_key do |model_name|
+        pull_single_model(registry, model_name)
+        break if @exit_code != 0
+      end
+    end
+    def command_push
+      @args.shift # remove 'push'
+      @output << 'To contribute a model to the registry:'
+      @output << ''
+      @output << '1. Fork https://github.com/cardmagic/classifier-models'
+      @output << '2. Add your model to the models/ directory'
+      @output << '3. Update models.json with your model metadata'
+      @output << '4. Create a pull request'
+      @output << ''
+      @output << 'Or use the GitHub CLI:'
+      @output << ''
+      @output << '  gh repo fork cardmagic/classifier-models --clone'
+      @output << '  cp ./classifier.json classifier-models/models/my-model.json'
+      @output << '  # Edit classifier-models/models.json to add your model'
+      @output << '  cd classifier-models && gh pr create'
+    end
+    def command_classify
+      text = @args.join(' ')
+      if @options[:remote]
+        classify_with_remote(text)
+        return
+      end
+      if text.empty? && ($stdin.tty? || @stdin.nil?) && !File.exist?(@options[:model])
+        show_getting_started
+        return
+      end
+      unless File.exist?(@options[:model])
+        @error << "Error: model not found at #{@options[:model]}"
+        @exit_code = 1
+        return
+      end
+      classifier = load_classifier
+      if text.empty?
+        lines = read_stdin_lines
+        return show_model_usage(classifier) if lines.empty?
+        lines.each { |line| classify_and_output(classifier, line) }
+      else
+        classify_and_output(classifier, text)
+      end
+    end
+    def classify_with_remote(text)
+      registry, model_name = parse_model_spec(@options[:remote])
+      registry ||= DEFAULT_REGISTRY
+      unless model_name
+        @error << 'Error: model name required for -r option'
+        @exit_code = 2
+        return
+      end
+      cached_path = cache_path_for(registry, model_name)
+      unless File.exist?(cached_path)
+        pull_single_model(registry, model_name)
+        return if @exit_code != 0
+      end
+      original_model = @options[:model]
+      @options[:model] = cached_path
+      begin
+        classifier = load_classifier
+        if text.empty?
+          lines = read_stdin_lines
+          return show_model_usage(classifier) if lines.empty?
+          lines.each { |line| classify_and_output(classifier, line) }
+        else
+          classify_and_output(classifier, text)
+        end
+      ensure
+        @options[:model] = original_model
+      end
+    end
+    # @rbs (untyped) -> void
+    def show_model_usage(classifier)
+      type = classifier_type_name(classifier)
+      cats = classifier.categories.map(&:to_s).map(&:downcase)
+      first_cat = cats.first || 'category'
+      @output << "Model: #{@options[:model]} (#{type})"
+      @output << "Categories: #{cats.join(', ')}"
+      @output << ''
+      @output << 'Classify text:'
+      @output << ''
+      @output << "  classifier 'text to classify'"
+      @output << "  echo 'text to classify' | classifier"
+      @output << ''
+      @output << 'Train more data:'
+      @output << ''
+      @output << "  echo 'new example text' | classifier train #{first_cat}"
+      @output << "  classifier train #{first_cat} file1.txt file2.txt"
+      @output << ''
+      @output << 'Other commands:'
+      @output << ''
+      @output << '  classifier info    Show model details (JSON)'
+    end
+    def classify_and_output(classifier, text)
+      return if text.strip.empty?
+      if classifier.is_a?(LogisticRegression) && !classifier.fitted?
+        raise StandardError, "Model not fitted. Run 'classifier fit' after training."
+      end
+      if @options[:probabilities]
+        probs = get_probabilities(classifier, text)
+        formatted = probs.map { |cat, prob| "#{cat.downcase}:#{format('%.2f', prob)}" }.join(' ')
+        @output << formatted
+      else
+        result = classifier.classify(text)
+        @output << result.downcase
+      end
+    end
+    def get_probabilities(classifier, text)
+      if classifier.respond_to?(:probabilities)
+        classifier.probabilities(text)
+      elsif classifier.respond_to?(:classifications)
+        scores = classifier.classifications(text)
+        normalize_scores(scores)
+      else
+        { classifier.classify(text) => 1.0 }
+      end
+    end
+    def normalize_scores(scores)
+      max_score = scores.values.max
+      exp_scores = scores.transform_values { |s| Math.exp(s - max_score) }
+      total = exp_scores.values.sum.to_f
+      exp_scores.transform_values { |s| (s / total).to_f }
+    end
+    def load_or_create_classifier
+      if File.exist?(@options[:model])
+        load_classifier
+      else
+        create_classifier
+      end
+    end
+    def load_classifier
+      json = File.read(@options[:model])
+      data = JSON.parse(json)
+      type = data['type']
+      case type
+      when 'bayes'
+        Bayes.from_json(data)
+      when 'lsi'
+        LSI.from_json(data)
+      when 'knn'
+        KNN.from_json(data)
+      when 'logistic_regression'
+        LogisticRegression.from_json(data)
+      else
+        raise "Unknown classifier type in model: #{type}"
+      end
+    end
+    def create_classifier
+      type = CLASSIFIER_TYPES[@options[:type]] || :bayes
+      case type
+      when :lsi
+        LSI.new(auto_rebuild: true)
+      when :knn
+        KNN.new(k: @options[:k], weighted: @options[:weighted])
+      when :logistic_regression
+        lr_opts = {} #: Hash[Symbol, untyped]
+        lr_opts[:learning_rate] = @options[:learning_rate] if @options[:learning_rate]
+        lr_opts[:regularization] = @options[:regularization] if @options[:regularization]
+        lr_opts[:max_iterations] = @options[:max_iterations] if @options[:max_iterations]
+        LogisticRegression.new(**lr_opts)
+      else # :bayes or unknown defaults to Bayes
+        Bayes.new
+      end
+    end
+    def train_classifier(classifier, category, text)
+      case classifier
+      when Bayes, LogisticRegression
+        classifier.add_category(category) unless classifier.categories.include?(category)
+        text.each_line { |line| classifier.train(category, line.strip) unless line.strip.empty? }
+      when LSI
+        text.each_line do |line|
+          next if line.strip.empty?
+          classifier.add_item(line.strip, category.to_sym)
+        end
+      when KNN
+        text.each_line do |line|
+          next if line.strip.empty?
+          classifier.add(category.to_sym => line.strip)
+        end
+      end
+    end
+    def train_lsi_from_files(classifier, category, files)
+      files.each do |file|
+        content = File.read(file)
+        classifier.add_item(file, category.to_sym) { content }
+      end
+    end
+    def save_classifier(classifier)
+      classifier.storage = Storage::File.new(path: @options[:model])
+      classifier.save
+    end
+    def classifier_type_name(classifier)
+      case classifier
+      when Bayes then 'bayes'
+      when LSI then 'lsi'
+      when KNN then 'knn'
+      when LogisticRegression then 'logistic_regression'
+      else 'unknown'
+      end
+    end
+    def read_training_input
+      if @args.any?
+        @args.map { |file| File.read(file) }.join("\n")
+      else
+        read_stdin
+      end
+    end
+    def read_stdin
+      @stdin || ($stdin.tty? ? '' : $stdin.read)
+    end
+    def read_stdin_line
+      (@stdin || ($stdin.tty? ? '' : $stdin.read)).to_s.strip
+    end
+    def read_stdin_lines
+      read_stdin.to_s.split("\n").map(&:strip).reject(&:empty?)
+    end
+    # @rbs () -> void
+    def show_getting_started
+      @output << 'Classifier - Text classification from the command line'
+      @output << ''
+      @output << 'Get started by training some categories:'
+      @output << ''
+      @output << '  # Train from files'
+      @output << '  classifier train spam spam_emails/*.txt'
+      @output << '  classifier train ham good_emails/*.txt'
+      @output << ''
+      @output << '  # Train from stdin'
+      @output << "  echo 'buy viagra now free pills cheap meds' | classifier train spam"
+      @output << "  echo 'meeting scheduled for tomorrow to discuss project' | classifier train ham"
+      @output << ''
+      @output << 'Then classify text:'
+      @output << ''
+      @output << "  classifier 'free money buy now'"
+      @output << "  classifier 'meeting postponed to friday'"
+      @output << ''
+      @output << 'Use LSI for semantic search:'
+      @output << ''
+      @output << "  echo 'ruby is a dynamic programming language' | classifier train docs -m lsi"
+      @output << "  echo 'python is great for data science' | classifier train docs -m lsi"
+      @output << "  classifier search 'programming'"
+      @output << ''
+      @output << 'Options:'
+      @output << '  -f FILE    Model file (default: ./classifier.json)'
+      @output << '  -m TYPE    Model type: bayes, lsi, knn, lr (default: bayes)'
+      @output << '  -r MODEL   Use remote model from registry'
+      @output << '  -p         Show probabilities'
+      @output << ''
+      @output << 'Use pre-trained models:'
+      @output << ''
+      @output << '  classifier models                       List available models'
+      @output << '  classifier pull sentiment               Download a model'
+      @output << "  classifier -r sentiment 'I love this!'  Classify with remote model"
+      @output << ''
+      @output << 'Run "classifier --help" for full usage.'
+    end
+    # Parse @user/repo format to extract registry
+    # @rbs (String?) -> String?
+    def parse_registry(arg)
+      return nil unless arg
+      return nil unless arg.start_with?('@')
+      # @user/repo format
+      arg[1..] # Remove @ prefix
+    end
+    # Parse model spec: name, @user/repo:name, or @user/repo (for all models)
+    # Returns [registry, model_name] where model_name is nil if pulling all
+    # @rbs (String) -> [String?, String?]
+    def parse_model_spec(spec)
+      if spec.start_with?('@')
+        # @user/repo:model or @user/repo
+        rest = spec[1..] || ''
+        if spec.include?(':')
+          parts = rest.split(':', 2)
+          [parts[0], parts[1]]
+        else
+          # @user/repo - pull all models from registry
+          [rest, nil]
+        end
+      else
+        # Just a model name from default registry
+        [nil, spec]
+      end
+    end
+    # Get cache path for a model
+    # @rbs (String, String) -> String
+    def cache_path_for(registry, model_name)
+      if registry == DEFAULT_REGISTRY
+        File.join(CACHE_DIR, 'models', "#{model_name}.json")
+      else
+        File.join(CACHE_DIR, 'models', "@#{registry}", "#{model_name}.json")
+      end
+    end
+    # Fetch models.json index from a registry
+    # @rbs (String) -> Hash[String, untyped]
+    def fetch_registry_index(registry)
+      content = fetch_github_file(registry, 'models.json')
+      return { 'models' => {} } if @exit_code != 0
+      JSON.parse(content)
+    rescue JSON::ParserError => e
+      @error << "Error: invalid models.json in registry: #{e.message}"
+      @exit_code = 1
+      { 'models' => {} }
+    end
+    # Fetch a file from GitHub raw content
+    # @rbs (String, String) -> String
+    def fetch_github_file(registry, file_path)
+      url = "https://raw.githubusercontent.com/#{registry}/main/#{file_path}"
+      uri = URI.parse(url)
+      response = Net::HTTP.get_response(uri)
+      unless response.is_a?(Net::HTTPSuccess)
+        # Try master branch if main fails
+        url = "https://raw.githubusercontent.com/#{registry}/master/#{file_path}"
+        uri = URI.parse(url)
+        response = Net::HTTP.get_response(uri)
+      end
+      unless response.is_a?(Net::HTTPSuccess)
+        @error << "Error: failed to fetch #{file_path} from #{registry} (#{response.code})"
+        @exit_code = 1
+        return ''
+      end
+      response.body
+    end
+  end
+end

data/lib/classifier/logistic_regression.rb CHANGED Viewed

@@ -62,8 +62,6 @@ module Classifier
                    tolerance: DEFAULT_TOLERANCE)
       super()
       categories = categories.flatten
-      raise ArgumentError, 'At least two categories required' if categories.size < 2
       @categories = categories.map { |c| c.to_s.prepare_category_name }
       @weights = @categories.to_h { |c| [c, {}] }
       @bias = @categories.to_h { |c| [c, 0.0] }
@@ -99,6 +97,7 @@ module Classifier
     def fit
       synchronize do
         return self if @training_data.empty?
+        raise ArgumentError, 'At least two categories required for fitting' if @categories.size < 2
         optimize_weights
         @fitted = true
@@ -122,13 +121,14 @@ module Classifier
     # Returns probability distribution across all categories.
     # Probabilities are well-calibrated (unlike Naive Bayes).
+    # Raises NotFittedError if model has not been fitted.
     #
     #   classifier.probabilities("Buy now!")
     #   # => {"Spam" => 0.92, "Ham" => 0.08}
     #
     # @rbs (String) -> Hash[String, Float]
     def probabilities(text)
-      fit unless @fitted
+      raise NotFittedError, 'Model not fitted. Call fit() after training.' unless @fitted
       features = text.word_hash
       synchronize do
@@ -137,10 +137,11 @@ module Classifier
     end
     # Returns log-odds scores for each category (before softmax).
+    # Raises NotFittedError if model has not been fitted.
     #
     # @rbs (String) -> Hash[String, Float]
     def classifications(text)
-      fit unless @fitted
+      raise NotFittedError, 'Model not fitted. Call fit() after training.' unless @fitted
       features = text.word_hash
       synchronize do
@@ -173,6 +174,23 @@ module Classifier
       synchronize { @categories.map(&:to_s) }
     end
+    # Adds a new category to the classifier.
+    # Allows dynamic category creation for CLI and incremental training.
+    #
+    # @rbs (String | Symbol) -> void
+    def add_category(category)
+      cat = category.to_s.prepare_category_name
+      synchronize do
+        return if @categories.include?(cat)
+        @categories << cat
+        @weights[cat] = {}
+        @bias[cat] = 0.0
+        @fitted = false
+        @dirty = true
+      end
+    end
     # Returns true if the model has been fitted.
     #
     # @rbs () -> bool
@@ -205,11 +223,10 @@ module Classifier
     end
     # Returns a hash representation of the classifier state.
+    # Does NOT auto-fit; saves current state including unfitted models.
     #
     # @rbs (?untyped) -> Hash[Symbol, untyped]
     def as_json(_options = nil)
-      fit unless @fitted
       {
         version: 1,
         type: 'logistic_regression',
@@ -217,10 +234,12 @@ module Classifier
         weights: @weights.transform_keys(&:to_s).transform_values { |v| v.transform_keys(&:to_s) },
         bias: @bias.transform_keys(&:to_s),
         vocabulary: @vocabulary.keys.map(&:to_s),
+        training_data: @training_data.map { |d| { category: d[:category].to_s, features: d[:features].transform_keys(&:to_s) } },
         learning_rate: @learning_rate,
         regularization: @regularization,
         max_iterations: @max_iterations,
-        tolerance: @tolerance
+        tolerance: @tolerance,
+        fitted: @fitted
       }
     end
@@ -546,26 +565,29 @@ module Classifier
     def restore_state(data, categories)
       mu_initialize
       @categories = categories
+      restore_weights_and_bias(data)
+      restore_hyperparameters(data)
+      @fitted = data.fetch('fitted', true)
+      @dirty = false
+      @storage = nil
+    end
+    def restore_weights_and_bias(data)
       @weights = {}
       @bias = {}
-      data['weights'].each do |cat, words|
-        @weights[cat.to_sym] = words.transform_keys(&:to_sym).transform_values(&:to_f)
-      end
-      data['bias'].each do |cat, value|
-        @bias[cat.to_sym] = value.to_f
+      data['weights'].each { |cat, words| @weights[cat.to_sym] = words.transform_keys(&:to_sym).transform_values(&:to_f) }
+      data['bias'].each { |cat, value| @bias[cat.to_sym] = value.to_f }
+      @vocabulary = data['vocabulary'].to_h { |v| [v.to_sym, true] }
+      @training_data = (data['training_data'] || []).map do |d|
+        { category: d['category'].to_sym, features: d['features'].transform_keys(&:to_sym).transform_values(&:to_i) }
       end
+    end
-      @vocabulary = data['vocabulary'].to_h { |v| [v.to_sym, true] }
+    def restore_hyperparameters(data)
       @learning_rate = data['learning_rate']
       @regularization = data['regularization']
       @max_iterations = data['max_iterations']
       @tolerance = data['tolerance']
-      @training_data = []
-      @fitted = true
-      @dirty = false
-      @storage = nil
     end
   end
 end

data/lib/classifier/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module Classifier
+  VERSION = '2.3.0'.freeze
+end

data/lib/classifier.rb CHANGED Viewed

@@ -25,6 +25,7 @@
 # License::   LGPL
 require 'rubygems'
+require 'classifier/version'
 require 'classifier/errors'
 require 'classifier/storage'
 require 'classifier/streaming'

data/sig/classifier.rbs ADDED Viewed

@@ -0,0 +1,3 @@
+module Classifier
+  VERSION: String
+end

data/sig/vendor/json.rbs CHANGED Viewed

@@ -1,4 +1,5 @@
 module JSON
   def self.parse: (String source, ?symbolize_names: bool) -> untyped
   def self.generate: (untyped obj) -> String
+  def self.pretty_generate: (untyped obj) -> String
 end

data/sig/vendor/optparse.rbs ADDED Viewed

@@ -0,0 +1,19 @@
+# Minimal type definitions for optparse stdlib
+class OptionParser
+  class InvalidOption < StandardError
+  end
+  class MissingArgument < StandardError
+  end
+  class InvalidArgument < StandardError
+  end
+  def initialize: () { (OptionParser) -> void } -> void
+  def banner=: (String) -> String
+  def separator: (String) -> void
+  def on: (*untyped) ?{ (*untyped) -> untyped } -> void
+  def to_s: () -> String
+  def parse!: (Array[String]) -> Array[String]
+end

metadata CHANGED Viewed

@@ -1,11 +1,11 @@
 --- !ruby/object:Gem::Specification
 name: classifier
 version: !ruby/object:Gem::Version
-  version: 2.2.0
+  version: 2.3.0
 platform: ruby
 authors:
 - Lucas Carlson
-bindir: bin
+bindir: exe
 cert_chain: []
 date: 1980-01-02 00:00:00.000000000 Z
 dependencies:
@@ -121,12 +121,27 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
+- !ruby/object:Gem::Dependency
+  name: webmock
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
 description: A Ruby library for text classification featuring Naive Bayes, LSI (Latent
   Semantic Indexing), Logistic Regression, and k-Nearest Neighbors classifiers. Includes
   TF-IDF vectorization, streaming/incremental training, pluggable persistence backends,
   thread safety, and a native C extension for fast LSI operations.
 email: lucas@rufy.com
-executables: []
+executables:
+- classifier
 extensions:
 - ext/classifier/extconf.rb
 extra_rdoc_files: []
@@ -136,6 +151,7 @@ files:
 - README.md
 - bin/bayes.rb
 - bin/summarize.rb
+- exe/classifier
 - ext/classifier/classifier_ext.c
 - ext/classifier/extconf.rb
 - ext/classifier/incremental_svd.c
@@ -145,6 +161,7 @@ files:
 - ext/classifier/vector.c
 - lib/classifier.rb
 - lib/classifier/bayes.rb
+- lib/classifier/cli.rb
 - lib/classifier/errors.rb
 - lib/classifier/extensions/string.rb
 - lib/classifier/extensions/vector.rb
@@ -164,11 +181,14 @@ files:
 - lib/classifier/streaming/line_reader.rb
 - lib/classifier/streaming/progress.rb
 - lib/classifier/tfidf.rb
+- lib/classifier/version.rb
+- sig/classifier.rbs
 - sig/vendor/fast_stemmer.rbs
 - sig/vendor/gsl.rbs
 - sig/vendor/json.rbs
 - sig/vendor/matrix.rbs
 - sig/vendor/mutex_m.rbs
+- sig/vendor/optparse.rbs
 - sig/vendor/streaming.rbs
 - test/test_helper.rb
 homepage: https://rubyclassifier.com