RubyGems - yanbi-ml - Versions diffs - 0.1.2 → 0.2.0 - Mend

yanbi-ml 0.1.2 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

data/README.md CHANGED

@@ -1,6 +1,6 @@
 # YANBI-ML
-Yet Another Naive Bayes Implementation
+Yet Another Naive Bayes Implementation - Bayes and Fisher document classifiers
 ## Installation
@@ -34,9 +34,27 @@ classifier.train_raw(:odd, "one three five seven")
 classifier.classify_raw("one two three") => :odd
 ```
+## What is a Fisher Classifier?
+An alternative to the standard Bayesian classifier that can also give very accurate results.  A Bayesian classifier works by computing a single, document-wide probability for each class that a document might belong to.  A Fisher classifer, by contrast, will compute a probability for each individual feature in a document.  If the document does not belong to a given class, then you would expect to get a random distribution of probabilities for the features in the document.  In fact, the eponymous Fisher showed that you would generally get a *chi squared distribution* of probabilities.  If the document does belong to a given class, you would expect the probabilities to be skewed towards higher probabilities, instead of being randomly distributed. A Fisher classifier uses the Fisher statistical method (p-value) to determine the degree to which the features in the document diverge from the expected random probability.
+## I don't care, I just want to use it!
+Fortunately the interface is pretty consistent:
+```ruby
+classifier = Yanbi::Fisher.default(:even, :odd)
+classifier.train_raw(:even, "two four six eight")
+classifier.train_raw(:odd, "one three five seven")
+classifier.classify_raw("one two three") => :odd
+```
+See?  Easy.
 ## Bags (of words)
-A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk).  This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them.
+A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk).  This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them.  Although a single bag can contain as many documents as you want, in practice it's a good idea to treat word bags as corresponding to a single document.
 A handful of classes are provided:
@@ -163,6 +181,41 @@ docs.each_doc do |d|
 end
 ```
+## Feature thresholds
+A method on the classifier is provided to prune infrequently seen features.  This is often one of the first things recommended for improving the accuracy of a classifier in real world applications.  Note that when you prune features, there's no un-pruning afterwards - so be sure you actually want to do it!
+```ruby
+classifier = Yanbi.default(:even, :odd)
+#...tons of training happens here...
+#we now have thousands of documents.  Ignore any words we haven't
+#seen at least a dozen times
+classifier.set_significance(12)
+#actually, the 'odd' category is especially noisy, so let's make
+#that two dozen for odd items
+classifier.set_significance(24, :odd)
+```
+## Persisting
+After going to all of the trouble of training a classifier on a large corpus, it would be very useful to save it to disk for later use.  You can do just that with the appropriately named save and load functions:
+```ruby
+classifier.save('testclassifier')
+#...some time later
+newclassifier = Yanbi::Bayes.load('testclassifier')
+```
+Note that an .obj extension is added to saved classifiers by default - no need to explicitly include it.
 ## Putting it all together
 ```ruby
@@ -176,11 +229,43 @@ other.add_file('biglistofotherstuff.txt', '@@@@')
 stuff.each_doc {|d| classifier.train(:stuff, d)}
 otherstuff.each_doc {|d| classifier.train(:otherstuff, d)}
+#...classify all the things....
+```
+A slightly fancier example:
+```ruby
+STOP_WORDS = %w(in the a and at of)
+#classify using stemmed words
+classifier = Yanbi::Bayes.new(Yanbi::StemmedWordBag, :stuff, :otherstuff)
+#create our corpora
+stuff = Yanbi::Corpus.new(Yanbi::StemmedWordBag)
+stuff.add_file('biglistofstuff.txt', '****')
+other = Yanbi::Corpus.new(Yanbi::StemmedWordBag)
+other.add_file('biglistofotherstuff.txt', '@@@@')
+#get rid of those nasty stop words
+stuff.each_doc {|d| d.remove(STOP_WORDS}
+otherstuff.each_doc {|d| d.remove(STOP_WORDS}
+#train away!
+stuff.each_doc {|d| classifier.train(:stuff, d)}
+otherstuff.each_doc {|d| classifier.train(:otherstuff, d)}
+#get rid of the long tail
+classifier.set_significance(50)
+#...classify all the things....
 ```
 ## Contributing
-Bug reports and pull requests are welcome on GitHub at https://github.com/rdormer/yanbi-ml.
+Bug reports, corrections of any tragic mathematical misunderstandings, and pull requests are welcome on GitHub at https://github.com/rdormer/yanbi-ml.
 ## License

data/lib/bayes/bayes.rb CHANGED

@@ -34,6 +34,18 @@ module Yanbi
       self.new(WordBag, *categories)
     end
+    def self.load(fname)
+      c = YAML::load(File.read(fname + ".obj"))
+      raise LoadError unless c.is_a? self
+      c
+    end
+    def save(name)
+      File.open(name + ".obj", 'w') do |out|
+         YAML.dump(self, out)
+      end
+    end
     def train(category, document)
       cat = category.to_sym
       @document_counts[cat] += 1
@@ -69,13 +81,7 @@ module Yanbi
     def newdoc(doc)
       Yanbi.const_get(@bag_class).new(doc)
     end
-    def save(name)
-      File.open(name + ".obj", 'w') do |out|
-         YAML.dump(self, out)
-      end
-    end
     private
     def cond_prob(cat, document)
@@ -102,11 +108,6 @@ module Yanbi
       @categories[i]
     end
-#    def weighted_prob(word, category, basicprob, weight=1.0, ap=0.5)
-#      #basicprob = word_prob(category, word) if basicprob.nil?
-#      totals = @category_counts.inject(0) {|sum, cat| sum += cat.last[word].to_i}
-#      ((weight * ap) + (totals*basicprob)) / (weight + totals)
-#    end
   end
 end

data/lib/bayes/fisher.rb CHANGED

@@ -1,6 +1,10 @@
+# Author::    Robert Dormer (mailto:rdormer@gmail.com)
+# Copyright:: Copyright (c) 2016 Robert Dormer
+# License::   MIT
 module Yanbi
-  class Fisher < Yanbi::Bayes
+  class Fisher < Bayes
     def classify(text)
       max_score(text) do |cat, doc|
@@ -12,36 +16,31 @@ module Yanbi
     def fisher_score(category, document)
       features = document.words.uniq
-      pscores = 1
-###
-#compute weighted probabilities for each word/cat tuple
-#and then multiply them all together...
-##
-      features.each do |word|
-        clf = word_prob(category, word)
-        freqsum = @categories.reduce(0) {|sum, x| sum + word_prob(x, word)}
-        pscores *= (clf / freqsum) if clf > 0
-      end
-#####
-#compute fisher factor of pscores
+      probs = features.map {|x| weighted_prob(x, category)}
+      pscores = probs.reduce(&:*)
       score = -2 * Math.log(pscores)
-#this is okay
       invchi2(score, features.count * 2)
     end
+    def category_prob(cat, word)
+      wp = word_prob(cat, word)
+      sum = @categories.inject(0) {|s,c| s + word_prob(c, word)}
+      return 0 if sum.zero?
+      wp / sum
+    end
     def word_prob(cat, word)
-      @category_counts[cat][word].to_f / @document_counts[cat]
+      all_word_count = @category_counts[cat].values.reduce(&:+)
+      count = @category_counts[cat].has_key?(word) ? @category_counts[cat][word].to_f : 0
+      count / all_word_count
     end
+    def weighted_prob(word, category, basicprob=nil, weight=1.0, ap=0.5)
+      basicprob = category_prob(category, word)
+      totals = @category_counts.inject(0) {|sum, cat| sum += cat.last[word].to_i}
+      ((weight * ap) + (totals*basicprob)) / (weight + totals)
+    end
     def invchi2(chi, df)
       m = chi / 2.0
       sum = Math.exp(-m)

data/lib/version.rb CHANGED

@@ -3,5 +3,5 @@
 # License::   MIT
 module Yanbi
-  VERSION = "0.1.2"
+  VERSION = "0.2.0"
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: yanbi-ml
 version: !ruby/object:Gem::Version
-  version: 0.1.2
+  version: 0.2.0
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2016-07-05 00:00:00.000000000 Z
+date: 2016-07-08 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler