RubyGems - hoatzin - Versions diffs - 0.1.0 → 0.2.0 - Mend

hoatzin 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

data/LICENSE.txt +6 -0
data/README.markdown +38 -4
data/VERSION +1 -1
data/hoatzin.gemspec +6 -2
data/lib/classifier.rb +115 -0
data/lib/hoatzin.rb +5 -185
data/lib/parser.rb +82 -0
data/lib/vector_space/builder.rb +81 -0
data/lib/vector_space/model.rb +46 -0
data/test/models/readonly-test/metadata +9 -9
data/test/models/readonly-test/model +0 -0
data/test/models/test/metadata +9 -9
data/test/models/test/model +0 -0
data/test/test_hoatzin.rb +3 -2
metadata +34 -22

data/LICENSE.txt CHANGED Viewed

@@ -18,3 +18,9 @@ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
 LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
 WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+Portions of this software are licensed under the MIT License, as specified in
+their original form (from https://github.com/josephwilk/rsemantic) :
+vector_space/builder.rb
+vector_space/model.rb

data/README.markdown CHANGED Viewed

@@ -1,6 +1,6 @@
 # hoatzin
-Hoatzin is a text classifier in Ruby that uses SVM for it's classification.
+Hoatzin is a text classifier in Ruby that uses libsvm for it's classification.
 ## Installation
@@ -19,8 +19,8 @@ gem install hoatzin
 ## Storage
-The Hoatzin classifier supports saving your trained classifier to the filesystem.  It needs
-to store the libsvm model and the associated metadata as two separate files.
+The Hoatzin classifier supports saving your trained classifier to the filesystem.  It stores
+the generated libsvm model and the required metadata as two separate files.
     # Load a previously trained classifier
     c = Hoatzin::Classifier.new(:metadata => '/path/to/file', :model => '/path/to/file')
@@ -36,11 +36,45 @@ to store the libsvm model and the associated metadata as two separate files.
 The classifier can continue to be trained if the model is saved with the :update => true option,
 however the files stored on the filesystem will be much larger as they will contain copies
 of all the documents used during training the classifier.  It is generally advised to save without
-the :update => true option unless it is definitely required.
+the :update => true option unless it is required.
+## Training
+The #train method doesn't calculate all the required information for classification
+(in particular the feature vectors) due to the time they take to recompute for each new
+token generated when adding a document for training. This means that there can be a delay
+when calling the #classify method for the first time whilst all the required information
+is prepared. This preparation step can be explicitly called using the #sync method.  This
+method is transparently called by the #classify method when required.  Sample usage of the #sync
+method is shown below:
+    # Create a hoatzin classifier
+    c = Hoatzin::Classifier.new()
+    # Add the training data to the classifier
+    corpus.each do |doc|
+      c.train(doc[:classification], doc[:text])
+    end
+    # Force the calculation of the feature vectors and
+    # preparation of the SVM model.  This can take some
+    # time if the corpus is large
+    c.sync
+    # Save the model and associated meta-data so we don't have to
+    # call sync again and wait for the feature vectors to be computed
+    c.save(:metadata => '/path/to/metadata', :model => '/path/to/model')
+    # Now call classify
+    c.classify("Spectacular show")
+The saved model and metadata can be loaded again for classification, avoiding the
+need to recompute the feature vectors.
 ## Acknowledgements
 See http://www.igvita.com/2008/01/07/support-vector-machines-svm-in-ruby/ for the original inspiration.
+The Vector Space Model implementation is adapted from https://github.com/josephwilk/rsemantic
 ## Copyright and License

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.1.0
1	+ 0.2.0

data/hoatzin.gemspec CHANGED Viewed

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = %q{hoatzin}
-  s.version = "0.1.0"
+  s.version = "0.2.0"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["robl"]
-  s.date = %q{2010-12-31}
+  s.date = %q{2011-01-03}
   s.description = %q{Hoatzin is a text classifier in Ruby that uses SVM for it's classification.}
   s.email = %q{robl@rjlee.net}
   s.extra_rdoc_files = [
@@ -24,7 +24,11 @@ Gem::Specification.new do |s|
     "Rakefile",
     "VERSION",
     "hoatzin.gemspec",
+    "lib/classifier.rb",
     "lib/hoatzin.rb",
+    "lib/parser.rb",
+    "lib/vector_space/builder.rb",
+    "lib/vector_space/model.rb",
     "test/helper.rb",
     "test/models/readonly-test/metadata",
     "test/models/readonly-test/model",

data/lib/classifier.rb ADDED Viewed

@@ -0,0 +1,115 @@
+module Hoatzin
+  class Classifier
+    class ReadOnly < Exception; end
+    class InvalidFormat < Exception; end
+    FORMAT_VERSION = 2
+    attr_reader :classifications
+    def initialize options = {}
+      @documents = []
+      @classifications = []
+      @labels = []
+      @problem = @model = nil
+      @cache = 0
+      @readonly = false
+      @metadata_file = options.delete(:metadata) || nil
+      @model_file = options.delete(:model) || nil
+      @builder = VectorSpace::Builder.new(:parser => Hoatzin::Parser.new)
+      # If we have model and metadata files then load them
+      load if @metadata_file && @model_file
+      # Define kernel parameters for libsvm
+      @parameters = Parameter.new(:C => 100,
+                                  :degree => 1,
+                                  :coef0 => 0,
+                                  :eps => 0.001)
+    end
+    def train classification, text
+      # Only allow retraining if we have all the required data
+      raise ReadOnly if @readonly
+      # Add the classification if we haven't seen it before
+      @classifications << classification unless @classifications.include?(classification)
+      # Add to document corpus
+      @documents << text
+      # Add classification to classification list
+      @labels << @classifications.index(classification)
+    end
+    def classify text
+      # See if we need to calculate the feature vectors
+      sync
+      # Calculate the feature vectors for the text to be classified
+      f_vector = @builder.build_query_vector(text)
+      # Classify and return classification
+      pred, probs = @model.predict_probability(f_vector)
+      @classifications[pred.to_i]
+    end
+    def sync
+      # Only update the model if we've trained more documents since it was last updated
+      if !@readonly && @documents.length > @cache
+        return nil if @documents.length == 0
+        @cache = @documents.length
+        assign_model
+      end
+    end
+    def save options = {}
+      @metadata_file = options[:metadata] if options.key?(:metadata)
+      @model_file = options[:model] if options.key?(:model)
+      return false unless (@metadata_file && @model_file)
+      # TODO: Add a version identifier
+      data = { :classifications => @classifications,
+               :version => FORMAT_VERSION,
+               :dictionary => @builder.vector_keyword_index,
+               :readonly => true }
+      data.merge!(:documents => @documents,
+                  :cache => @cache,
+                  :readonly => false) if options[:update]
+      File.open(@metadata_file, 'w+') { |f| Marshal.dump(data, f) }
+      assign_model if @model.nil?
+      @model.save(@model_file)
+    end
+    protected
+    def load
+      data = {}
+      File.open(@metadata_file) { |f| data = Marshal.load(f) }
+      raise InvalidFormat if !data.key?(:version) || data[:version] != FORMAT_VERSION
+      @classifications = data[:classifications]
+      @readonly = data[:readonly]
+      @builder.vector_keyword_index = data[:dictionary]
+      unless @readonly
+        @documents = data[:documents]
+        @cache = data[:cache]
+      end
+      @model = Model.new(@model_file)
+    end
+    def assign_model
+      vector_space_model = @builder.build_document_matrix(@documents)
+      @problem = Problem.new(@labels, vector_space_model.matrix)
+      @model = Model.new(@problem, @parameters)
+    end
+  end
+end

data/lib/hoatzin.rb CHANGED Viewed

@@ -1,189 +1,9 @@
 require 'svm'
 require 'fast_stemmer'
 require 'iconv'
+require 'pp'
-module Hoatzin
-  class Classifier
-    class ReadOnly < Exception; end
-    attr_reader :classifications
-    def initialize options = {}
-      @metadata_file = options.delete(:metadata) || nil
-      @model_file = options.delete(:model) || nil
-      @documents = []
-      @dictionary = []
-      @classifications = []
-      @labels = []
-      @feature_vectors = []
-      @problem = @model = nil
-      @cache = 0
-      @readonly = false
-      # If we have model and metadata files then load them
-      load if @metadata_file && @model_file
-      # Define kernel parameters for libsvm
-      @parameters = Parameter.new(:C => 100,
-                                  :degree => 1,
-                                  :coef0 => 0,
-                                  :eps => 0.001)
-    end
-    def train classification, text
-      # Only allow retraining if we have all the required data
-      raise ReadOnly if @readonly
-      # Add the classification if we haven't seen it before
-      @classifications << classification unless @classifications.include?(classification)
-      # Tokenize the text
-      tokens = Classifier.tokenize(text)
-      # Add tokens to word list
-      @dictionary << tokens
-      @dictionary.flatten!.uniq!
-      # Add to list of documents
-      @documents << tokens
-      # Add classification to classification list
-      @labels << @classifications.index(classification)
-      # Compute the feature vectors
-      @feature_vectors = @documents.map { |doc| @dictionary.map{|x| doc.include?(x) ? 1 : 0} }
-    end
-    def classify text
-      # Only update the model if we've trained more documents since it was last updated
-      if !@readonly && @documents.length > @cache
-        return nil if @documents.length == 0
-        @cache = @documents.length
-        assign_model
-      end
-      # Tokenize the text
-      tokens = Classifier.tokenize(text)
-      # Calculate the feature vectors for the text to be classified
-      f_vector = @dictionary.map{|x| tokens.include?(x) ? 1 : 0}
-      # Classify and return classification
-      pred, probs = @model.predict_probability(f_vector)
-      @classifications[pred.to_i]
-    end
-    def save options = {}
-      @metadata_file = options[:metadata] if options.key?(:metadata)
-      @model_file = options[:model] if options.key?(:model)
-      return false unless (@metadata_file && @model_file)
-      data = { :dictionary => @dictionary, :classifications => @classifications}
-      data.merge!(:documents => @documents,
-                  :labels => @labels,
-                  :feature_vectors => @feature_vectors,
-                  :cache => @cache) if options[:update]
-      File.open(@metadata_file, 'w+') { |f| Marshal.dump(data, f) }
-      assign_model if @model.nil?
-      @model.save(@model_file)
-    end
-    protected
-    def load
-      data = {}
-      File.open(@metadata_file) { |f| data = Marshal.load(f) }
-      @dictionary = data[:dictionary]
-      @classifications = data[:classifications]
-      if data.key?(:documents)
-        @documents = data[:documents]
-        @labels = data[:labels]
-        @feature_vectors = data[:feature_vectors]
-        @cache = data[:cache]
-      end
-      @readonly = @documents.length > 0 ? false : true
-      @model = Model.new(@model_file)
-    end
-    def assign_model
-      @problem = Problem.new(@labels, @feature_vectors)
-      @model = Model.new(@problem, @parameters)
-    end
-    # Adapted from ankusa, to replace with tokenizer gem
-    def self.tokenize text
-      tokens = []
-      # from http://www.jroller.com/obie/tags/unicode
-      converter = Iconv.new('ASCII//IGNORE//TRANSLIT', 'UTF-8')
-      converter.iconv(text).unpack('U*').select { |cp| cp < 127 }.pack('U*') rescue ""
-      text.tr('-', ' ').gsub(/[^\w\s]/," ").split.each do |token|
-        tokens << token if (token.length > 3 && !Classifier.stop_words.include?(token))
-      end
-      tokens
-    end
-    # ftp://ftp.cs.cornell.edu/pub/smart/english.stop
-    def self.stop_words
-      %w{
-          a a's able about above according accordingly across actually after
-          afterwards again against ain't all allow allows almost alone along
-          already also although always am among amongst an and another any
-          anybody anyhow anyone anything anyway anyways anywhere apart appear
-          appreciate appropriate are aren't around as aside ask asking
-          associated at available away awfully b be became because become
-          becomes becoming been before beforehand behind being believe below
-          beside besides best better between beyond both brief but by c
-          c'mon c's came can can't cannot cant cause causes certain certainly
-          changes clearly co com come comes concerning consequently consider
-          considering contain containing contains corresponding could couldn't
-          course currently d definitely described despite did didn't different
-          do does doesn't doing don't done down downwards during e each edu
-          eg eight either else elsewhere enough entirely especially et etc
-          even ever every everybody everyone everything everywhere ex exactly
-          example except f far few fifth first five followed following follows
-          for former formerly forth four from further furthermore g get gets
-          getting given gives go goes going gone got gotten greetings h had
-          hadn't happens hardly has hasn't have haven't having he he's hello
-          help hence her here here's hereafter hereby herein hereupon hers
-          herself hi him himself his hither hopefully how howbeit however i
-          i'd i'll i'm i've ie if ignored immediate in inasmuch inc indeed
-          indicate indicated indicates inner insofar instead into inward is
-          isn't it it'd it'll it's its itself j just k keep keeps kept know
-          knows known l last lately later latter latterly least less lest let
-          let's like liked likely little look looking looks ltd m mainly many
-          may maybe me mean meanwhile merely might more moreover most mostly
-          much must my myself n name namely nd near nearly necessary need needs
-          neither never nevertheless new next nine no nobody non none noone
-          nor normally not nothing novel now nowhere o obviously of off often
-          oh ok okay old on once one ones only onto or other others otherwise
-          ought our ours ourselves out outside over overall own p particular
-          particularly per perhaps placed please plus possible presumably
-          probably provides q que quite qv r rather rd re really reasonably
-          regarding regardless regards relatively respectively right s said
-          same saw say saying says second secondly see seeing seem seemed
-          seeming seems seen self selves sensible sent serious seriously
-          seven several shall she should shouldn't since six so some somebody
-          somehow someone something sometime sometimes somewhat somewhere soon
-          sorry specified specify specifying still sub such sup sure t t's
-          take taken tell tends th than thank thanks thanx that that's thats
-          the their theirs them themselves then thence there there's thereafter
-          thereby therefore therein theres thereupon these they they'd they'll
-          they're they've think third this thorough thoroughly those though
-          three through throughout thru thus to together too took toward
-          towards tried tries truly try trying twice two u un under
-          unfortunately unless unlikely until unto up upon us use used useful
-          uses using usually uucp v value various very via viz vs w want wants
-          was wasn't way we we'd we'll we're we've welcome well went were weren't
-          what what's whatever when whence whenever where where's whereafter
-          whereas whereby wherein whereupon wherever whether which while
-          whither who who's whoever whole whom whose why will willing wish
-          with within without won't wonder would would wouldn't x y yes yet
-          you you'd you'll you're you've your yours yourself yourselves
-          z zero
-        }
-    end
-  end
-end
+require 'classifier'
+require 'parser'
+require 'vector_space/model'
+require 'vector_space/builder'

data/lib/parser.rb ADDED Viewed

@@ -0,0 +1,82 @@
+module Hoatzin
+  class Parser
+    def initialize
+    end
+    # Adapted from ankusa, to replace with tokenizer gem
+    def tokenize text
+      tokens = []
+      # from http://www.jroller.com/obie/tags/unicode
+      converter = Iconv.new('ASCII//IGNORE//TRANSLIT', 'UTF-8')
+      converter.iconv(text).unpack('U*').select { |cp| cp < 127 }.pack('U*') rescue ""
+      text.tr('-', ' ').gsub(/[^\w\s]/," ").split.each do |token|
+        token = token.stem
+        tokens << token if (token.length > 3 && !stop_words.include?(token))
+      end
+      tokens
+    end
+    # ftp://ftp.cs.cornell.edu/pub/smart/english.stop
+    def stop_words
+      %w{
+          a a's able about above according accordingly across actually after
+          afterwards again against ain't all allow allows almost alone along
+          already also although always am among amongst an and another any
+          anybody anyhow anyone anything anyway anyways anywhere apart appear
+          appreciate appropriate are aren't around as aside ask asking
+          associated at available away awfully b be became because become
+          becomes becoming been before beforehand behind being believe below
+          beside besides best better between beyond both brief but by c
+          c'mon c's came can can't cannot cant cause causes certain certainly
+          changes clearly co com come comes concerning consequently consider
+          considering contain containing contains corresponding could couldn't
+          course currently d definitely described despite did didn't different
+          do does doesn't doing don't done down downwards during e each edu
+          eg eight either else elsewhere enough entirely especially et etc
+          even ever every everybody everyone everything everywhere ex exactly
+          example except f far few fifth first five followed following follows
+          for former formerly forth four from further furthermore g get gets
+          getting given gives go goes going gone got gotten greetings h had
+          hadn't happens hardly has hasn't have haven't having he he's hello
+          help hence her here here's hereafter hereby herein hereupon hers
+          herself hi him himself his hither hopefully how howbeit however i
+          i'd i'll i'm i've ie if ignored immediate in inasmuch inc indeed
+          indicate indicated indicates inner insofar instead into inward is
+          isn't it it'd it'll it's its itself j just k keep keeps kept know
+          knows known l last lately later latter latterly least less lest let
+          let's like liked likely little look looking looks ltd m mainly many
+          may maybe me mean meanwhile merely might more moreover most mostly
+          much must my myself n name namely nd near nearly necessary need needs
+          neither never nevertheless new next nine no nobody non none noone
+          nor normally not nothing novel now nowhere o obviously of off often
+          oh ok okay old on once one ones only onto or other others otherwise
+          ought our ours ourselves out outside over overall own p particular
+          particularly per perhaps placed please plus possible presumably
+          probably provides q que quite qv r rather rd re really reasonably
+          regarding regardless regards relatively respectively right s said
+          same saw say saying says second secondly see seeing seem seemed
+          seeming seems seen self selves sensible sent serious seriously
+          seven several shall she should shouldn't since six so some somebody
+          somehow someone something sometime sometimes somewhat somewhere soon
+          sorry specified specify specifying still sub such sup sure t t's
+          take taken tell tends th than thank thanks thanx that that's thats
+          the their theirs them themselves then thence there there's thereafter
+          thereby therefore therein theres thereupon these they they'd they'll
+          they're they've think third this thorough thoroughly those though
+          three through throughout thru thus to together too took toward
+          towards tried tries truly try trying twice two u un under
+          unfortunately unless unlikely until unto up upon us use used useful
+          uses using usually uucp v value various very via viz vs w want wants
+          was wasn't way we we'd we'll we're we've welcome well went were weren't
+          what what's whatever when whence whenever where where's whereafter
+          whereas whereby wherein whereupon wherever whether which while
+          whither who who's whoever whole whom whose why will willing wish
+          with within without won't wonder would would wouldn't x y yes yet
+          you you'd you'll you're you've your yours yourself yourselves
+          z zero
+        }
+    end
+  end
+end

data/lib/vector_space/builder.rb ADDED Viewed

@@ -0,0 +1,81 @@
+# Adapted from : https://github.com/josephwilk/rsemantic
+module Hoatzin
+  module VectorSpace
+    #A algebraic model for representing text documents as vectors of identifiers.
+    #A document is represented as a vector. Each dimension of the vector corresponds to a
+    #separate term. If a term occurs in the document, then the value in the vector is non-zero.
+    class Builder
+      attr_accessor :vector_keyword_index
+      def initialize(options={})
+        @parser = options.delete(:parser)
+        @options = options
+        @parsed_document_cache = []
+      end
+      def build_document_matrix(documents)
+        @vector_keyword_index = build_vector_keyword_index(documents)
+        document_matrix = []
+        document_matrix += documents.enum_for(:each_with_index).map{|document,document_id| build_vector(document, document_id)}
+        Model.new(document_matrix, @vector_keyword_index)
+      end
+      def build_query_vector(text)
+        build_vector(text)
+      end
+      def marshal_dump
+        [@parser, @options, @parsed_document_cache, @vector_keyword_index]
+      end
+      def marshal_load(ary)
+        @parser, @options, @parsed_document_cache, @vector_keyword_index = ary
+      end
+      private
+      def build_vector_keyword_index(documents)
+        parse_and_cache(documents)
+        vocabulary_list = find_unique_vocabulary
+	      map_vocabulary_to_vector_positions(vocabulary_list)
+      end
+      def parse_and_cache(documents)
+        documents.each_with_index do |document, index|
+          @parsed_document_cache[index] = @parser.tokenize(document)
+        end
+      end
+      def find_unique_vocabulary
+        vocabulary_list = @parsed_document_cache.inject([]) { |parsed_document, vocabulary_list| vocabulary_list + parsed_document  }
+        vocabulary_list.uniq
+      end
+      def map_vocabulary_to_vector_positions(vocabulary_list)
+        vector_index={}
+        column = 0
+	      vocabulary_list.each do |word|
+          vector_index[word] = column
+          column += 1
+        end
+        vector_index
+      end
+      def build_vector(word_string, document_id=nil)
+        if document_id.nil?
+          word_list = @parser.tokenize(word_string)
+        else
+          word_list = @parsed_document_cache[document_id]
+        end
+        vector = Array.new(@vector_keyword_index.length, 0)
+        word_list.each { |word| vector[@vector_keyword_index[word]] += 1 if @vector_keyword_index.has_key?(word)  }
+        vector
+      end
+    end
+  end
+end

data/lib/vector_space/model.rb ADDED Viewed

@@ -0,0 +1,46 @@
+# Adapted from : https://github.com/josephwilk/rsemantic
+#require 'stringio'
+module Hoatzin
+  module VectorSpace
+    class Model
+      def initialize(matrix, keywords)
+        @keywords = keywords || {}
+        @_dc_obj = matrix
+      end
+      def matrix=(matrix)
+        @_dc_obj = matrix
+      end
+      def matrix
+        @_dc_obj
+      end
+#      def to_s
+#        out = StringIO.new
+#        out.print " " * 9
+#
+#        matrix.ncol.times do |id|
+#          out.print "  D#{id+1}  "
+#        end
+#        out.puts
+#
+#        matrix.rows.each_with_index do |terms, index|
+#          out.print "#{@keywords.index(index).ljust(6)}" if @keywords.has_value?(index)
+#          out.print "[ "
+#          terms.columns.each do |document|
+#            out.print "%+0.2f " % document
+#          end
+#          out.print "]"
+#          out.puts
+#        end
+#        out.string
+#      end
+    end
+  end
+end

data/test/models/readonly-test/metadata CHANGED Viewed

@@ -1,16 +1,16 @@
 svm_type c_svc
 kernel_type rbf
-gamma 0.0454545
+gamma 0.0833333
 nr_class 2
 total_sv 7
-rho -0.00785397
+rho -0.44634
 label 0 1
 nr_sv 4 3
 SV
-1.178526450216892 0:1 1:1 2:1 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:0 18:0 19:0 20:0 21:0
-0.8670862694825178 0:1 1:0 2:0 3:1 4:1 5:1 6:1 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:0 18:0 19:0 20:0 21:0
-1.565208229773764 0:0 1:1 2:0 3:1 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:1 13:1 14:1 15:0 16:0 17:0 18:0 19:0 20:0 21:0
-3.06895439364994 0:1 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:1 16:0 17:0 18:0 19:0 20:0 21:0
--2.445428417806817 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:1 9:1 10:1 11:1 12:0 13:0 14:0 15:0 16:0 17:0 18:0 19:0 20:0 21:0
--0.5364005807527658 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:1 9:1 10:1 11:1 12:0 13:0 14:0 15:0 16:1 17:1 18:1 19:0 20:0 21:0
--3.697946344563531 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:0 18:0 19:1 20:1 21:1
+0.09624983775649808 0:0 1:0 2:0 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:0 11:1
+1.421056914116459 0:0 1:0 2:0 3:0 4:0 5:1 6:0 7:0 8:0 9:1 10:1 11:0
+3.139840844768948 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:1 9:0 10:0 11:0
+2.541431764584626 0:0 1:0 2:0 3:0 4:0 5:1 6:1 7:0 8:0 9:0 10:0 11:0
+-0.1202843540395405 0:1 1:0 2:0 3:1 4:1 5:0 6:0 7:0 8:0 9:0 10:0 11:0
+-0.7832988164395933 0:1 1:1 2:1 3:1 4:1 5:0 6:0 7:0 8:0 9:0 10:0 11:0
+-6.294996190747396 0:1 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0

data/test/models/readonly-test/model CHANGED Viewed

Binary file

data/test/models/test/metadata CHANGED Viewed

@@ -1,16 +1,16 @@
 svm_type c_svc
 kernel_type rbf
-gamma 0.0454545
+gamma 0.0833333
 nr_class 2
 total_sv 7
-rho -0.00785397
+rho -0.44634
 label 0 1
 nr_sv 4 3
 SV
-1.178526450216892 0:1 1:1 2:1 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:0 18:0 19:0 20:0 21:0
-0.8670862694825178 0:1 1:0 2:0 3:1 4:1 5:1 6:1 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:0 18:0 19:0 20:0 21:0
-1.565208229773764 0:0 1:1 2:0 3:1 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:1 13:1 14:1 15:0 16:0 17:0 18:0 19:0 20:0 21:0
-3.06895439364994 0:1 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:1 16:0 17:0 18:0 19:0 20:0 21:0
--2.445428417806817 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:1 9:1 10:1 11:1 12:0 13:0 14:0 15:0 16:0 17:0 18:0 19:0 20:0 21:0
--0.5364005807527658 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:1 9:1 10:1 11:1 12:0 13:0 14:0 15:0 16:1 17:1 18:1 19:0 20:0 21:0
--3.697946344563531 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:0 18:0 19:1 20:1 21:1
+0.09624983775649808 0:0 1:0 2:0 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:0 11:1
+1.421056914116459 0:0 1:0 2:0 3:0 4:0 5:1 6:0 7:0 8:0 9:1 10:1 11:0
+3.139840844768948 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:1 9:0 10:0 11:0
+2.541431764584626 0:0 1:0 2:0 3:0 4:0 5:1 6:1 7:0 8:0 9:0 10:0 11:0
+-0.1202843540395405 0:1 1:0 2:0 3:1 4:1 5:0 6:0 7:0 8:0 9:0 10:0 11:0
+-0.7832988164395933 0:1 1:1 2:1 3:1 4:1 5:0 6:0 7:0 8:0 9:0 10:0 11:0
+-6.294996190747396 0:1 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0

data/test/models/test/model CHANGED Viewed

Binary file

data/test/test_hoatzin.rb CHANGED Viewed

@@ -9,7 +9,7 @@ class TestHoatzin < Test::Unit::TestCase
     end
     should "support training and classification" do
-      assert_equal @c.train(:positive, "Thats nice"), [[1, 1]]
+      assert_equal @c.train(:positive, "Thats nice"), [0] #[[1, 1]]
       assert_equal @c.classify("Thats nice"), :positive
     end
@@ -22,10 +22,10 @@ class TestHoatzin < Test::Unit::TestCase
       end
       should "classify the test set correctly" do
-        #@c.save(:metadata => METADATA_FILE, :model => MODEL_FILE, :update => true)
         TESTING_LABELS.each_with_index do |label, index|
           assert_equal @c.classify(TESTING_DOCS[index]), label
         end
+        #@c.save(:metadata => READONLY_METADATA_FILE, :model => READONLY_MODEL_FILE, :update => false)
       end
       should "return the classifications" do
@@ -71,4 +71,5 @@ class TestHoatzin < Test::Unit::TestCase
     end
   end
 end

metadata CHANGED Viewed

@@ -1,12 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: hoatzin
 version: !ruby/object:Gem::Version
+  hash: 23
   prerelease: false
   segments:
   - 0
-  - 1
+  - 2
   - 0
-  version: 0.1.0
+  version: 0.2.0
 platform: ruby
 authors:
 - robl
@@ -14,91 +15,97 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2010-12-31 00:00:00 +00:00
+date: 2011-01-03 00:00:00 +00:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
+  prerelease: false
   name: libsvm-ruby-swig
-  requirement: &id001 !ruby/object:Gem::Requirement
+  version_requirements: &id001 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
+        hash: 3
         segments:
         - 0
         version: "0"
+  requirement: *id001
   type: :runtime
-  prerelease: false
-  version_requirements: *id001
 - !ruby/object:Gem::Dependency
+  prerelease: false
   name: fast-stemmer
-  requirement: &id002 !ruby/object:Gem::Requirement
+  version_requirements: &id002 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
+        hash: 3
         segments:
         - 0
         version: "0"
+  requirement: *id002
   type: :runtime
-  prerelease: false
-  version_requirements: *id002
 - !ruby/object:Gem::Dependency
+  prerelease: false
   name: shoulda
-  requirement: &id003 !ruby/object:Gem::Requirement
+  version_requirements: &id003 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
+        hash: 3
         segments:
         - 0
         version: "0"
+  requirement: *id003
   type: :development
-  prerelease: false
-  version_requirements: *id003
 - !ruby/object:Gem::Dependency
+  prerelease: false
   name: bundler
-  requirement: &id004 !ruby/object:Gem::Requirement
+  version_requirements: &id004 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
       - !ruby/object:Gem::Version
+        hash: 23
         segments:
         - 1
         - 0
         - 0
         version: 1.0.0
+  requirement: *id004
   type: :development
-  prerelease: false
-  version_requirements: *id004
 - !ruby/object:Gem::Dependency
+  prerelease: false
   name: jeweler
-  requirement: &id005 !ruby/object:Gem::Requirement
+  version_requirements: &id005 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
       - !ruby/object:Gem::Version
+        hash: 7
         segments:
         - 1
         - 5
         - 2
         version: 1.5.2
+  requirement: *id005
   type: :development
-  prerelease: false
-  version_requirements: *id005
 - !ruby/object:Gem::Dependency
+  prerelease: false
   name: rcov
-  requirement: &id006 !ruby/object:Gem::Requirement
+  version_requirements: &id006 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
+        hash: 3
         segments:
         - 0
         version: "0"
+  requirement: *id006
   type: :development
-  prerelease: false
-  version_requirements: *id006
 description: Hoatzin is a text classifier in Ruby that uses SVM for it's classification.
 email: robl@rjlee.net
 executables: []
@@ -116,7 +123,11 @@ files:
 - Rakefile
 - VERSION
 - hoatzin.gemspec
+- lib/classifier.rb
 - lib/hoatzin.rb
+- lib/parser.rb
+- lib/vector_space/builder.rb
+- lib/vector_space/model.rb
 - test/helper.rb
 - test/models/readonly-test/metadata
 - test/models/readonly-test/model
@@ -137,7 +148,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      hash: -1045663747
+      hash: 3
       segments:
       - 0
       version: "0"
@@ -146,6 +157,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
+      hash: 3
       segments:
       - 0
       version: "0"