RubyGems - ankusa - Versions diffs - 0.0.14 → 0.0.15 - Mend

ankusa 0.0.14 → 0.0.15

Files changed (9) hide show

data/Gemfile CHANGED

@@ -2,3 +2,5 @@ source "http://rubygems.org"
 # Specify your gem's dependencies in bandit.gemspec
 gemspec
+gem 'rake'

data/Gemfile.lock CHANGED

@@ -1,16 +1,18 @@
 PATH
   remote: .
   specs:
-    ankusa (0.0.14)
+    ankusa (0.0.15)
       fast-stemmer (>= 1.0.0)
 GEM
   remote: http://rubygems.org/
   specs:
     fast-stemmer (1.0.1)
+    rake (10.0.2)
 PLATFORMS
   ruby
 DEPENDENCIES
   ankusa!
+  rake

data/README.rdoc CHANGED

@@ -9,7 +9,7 @@ First, install HBase/Hadoop, Mongo, or Cassandra (>= 0.7.0-rc2).  Then, install
   gem install hbaserb
   # or
   gem install cassandra
-  # or
+  # or
   gem install mongo
 If you're using HBase, make sure the HBase Thrift interface has been started as well.  Then:
@@ -35,7 +35,7 @@ Using the naive Bayes classifier:
   # This will return the most likely class (as symbol)
   puts c.classify "This is some spammy text"
-  # This will return Hash with classes as keys and
+  # This will return Hash with classes as keys and
   # membership probability as values
   puts c.classifications "This is some spammy text"
@@ -54,13 +54,13 @@ Using the naive Bayes classifier:
 == KL Diverence Classifier
 There is a Kullback–Leibler divergence classifier as well.  KL divergence is a distance measure (though not a true metric because it does not satisfy the triangle inequality).  The KL classifier simply measures the relative entropy between the text you want to classify and each of the classes.  The class with the shortest "distance" is the best class.  You may find that for a especially large corpus it may be slightly faster to use this classifier (since prior probablities are never calculated, only likelihoods).
-The API is the same as the NaiveBayesClassifier, except rather than calling "classifications" if you want actual numbers you call "distances".
+The API is the same as the NaiveBayesClassifier, except rather than calling "classifications" if you want actual numbers you call "distances".
   require 'rubygems'
   require 'ankusa'
   require 'ankusa/hbase_storage'
-  # connect to HBase
+  # connect to HBase
   storage = Ankusa::HBaseStorage.new 'localhost'
   c = Ankusa::KLDivergenceClassifier.new storage
@@ -72,7 +72,7 @@ The API is the same as the NaiveBayesClassifier, except rather than calling "cla
   # This will return the most likely class (as symbol)
   puts c.classify "This is some spammy text"
-  # This will return Hash with classes as keys and
+  # This will return Hash with classes as keys and
   # distances >= 0 as values
   puts c.distances "This is some spammy text"
@@ -104,13 +104,13 @@ HBase storage:
 For Cassandra storage:
 * You will need Cassandra version 0.7.0-rc2 or greater.
-* You will need to set a max number classes since current implementation of the Ruby Cassandra client doesn't support table scans.
+* You will need to set a max number classes since current implementation of the Ruby Cassandra client doesn't support table scans.
 * Prior to using the Cassandra storage you will need to run the following command from the cassandra-cli: "create keyspace ankusa with replication_factor = 1".  This should be fixed with a new release candidate for Cassandra.
 To use the Cassandra storage class:
   require 'ankusa/cassandra_storage'
   # defaults: host='127.0.0.1', port=9160, keyspace = 'ankusa', max_classes = 100
-  storage = Ankusa::HBaseStorage.new host, port, keyspace, max_classes
+  storage = Ankusa::CassandraStorage.new host, port, keyspace, max_classes
 For MongoDB storage:
   require 'ankusa/mongo_db_storage'

data/Rakefile CHANGED

@@ -1,5 +1,7 @@
 require 'rubygems'
 require 'bundler'
+require 'rake/dsl_definition'
+require 'rake/alt_system'
 require 'rake/testtask'
 require 'rdoc/task'
@@ -23,7 +25,7 @@ Rake::TestTask.new("test_memory") { |t|
 desc "Run all unit tests with HBase storage"
 Rake::TestTask.new("test_hbase") { |t|
   t.libs += ["lib", "."]
-  t.test_files = FileList['test/hasher_test.rb']
+  t.test_files = FileList['test/hasher_test.rb', 'test/hbase_classifier_test.rb']
   t.verbose = true
 }

data/lib/ankusa/classifier.rb CHANGED

@@ -49,7 +49,7 @@ module Ankusa
       probs = Hash.new 0
       @storage.get_word_counts(word).each { |k,v| probs[k] = v if classnames.include? k }
       vs = vocab_sizes
-      classnames.each { |cn|
+      classnames.each { |cn|
         # if we've never seen the class, the word prob is 0
         next unless vs.has_key? cn

data/lib/ankusa/hbase_storage.rb CHANGED

@@ -26,7 +26,7 @@ module Ankusa
       drop_tables
       init_tables
     end
     def drop_tables
       freq_table.delete
       summary_table.delete
@@ -69,10 +69,12 @@ module Ankusa
         @klass_word_counts[klass] = summary_table.get(klass, "totals:wordcount").first.to_i64.to_f
       }
     end
     def get_doc_count(klass)
       @klass_doc_counts.fetch(klass) {
-        @klass_doc_counts[klass] = summary_table.get(klass, "totals:doccount").first.to_i64.to_f
+        totals = summary_table.get(klass, "totals:doccount")
+        totals = (totals.size === 0) ? 0 : totals.first.to_i64.to_f
+        @klass_doc_counts[klass] = totals
       }
     end
@@ -83,7 +85,7 @@ module Ankusa
       if size == count
         summary_table.atomic_increment klass, "totals:vocabsize"
       elsif size == 0
-        summary_table.atomic_increment klass, "totals:vocabsize", -1
+        summary_table.atomic_increment klass, "totals:vocabsize", -1
       end
       size
     end

data/lib/ankusa/naive_bayes.rb CHANGED

@@ -6,9 +6,16 @@ module Ankusa
     def classify(text, classes=nil)
       # return the most probable class
-      log_likelihoods(text, classes).sort_by { |c| -c[1] }.first.first
+      result = log_likelihoods(text, classes)
+      if result.values.uniq.size. === 1
+        # unless all classes are equally likely, then return nil
+        return nil
+      else
+        result.sort_by { |c| -c[1] }.first.first
+      end
     end
     # Classes is an array of classes to look at
     def classifications(text, classnames=nil)
       result = log_likelihoods text, classnames
@@ -17,8 +24,10 @@ module Ankusa
       }
       # normalize to get probs
-      sum = result.values.inject { |x,y| x+y }
-      result.keys.each { |k| result[k] = result[k] / sum }
+      sum = result.values.inject{ |x,y| x+y }
+      result.keys.each { |k|
+        result[k] = result[k] / sum
+        } unless sum.zero?
       result
     end
@@ -29,7 +38,7 @@ module Ankusa
       TextHash.new(text).each { |word, count|
         probs = get_word_probs(word, classnames)
-        classnames.each { |k|
+        classnames.each { |k|
           # log likelihood should be negative infinity if we've never seen the klass
           result[k] += probs[k] > 0 ? (Math.log(probs[k]) * count) : -INFTY
         }
@@ -37,9 +46,11 @@ module Ankusa
       # add the prior
       doc_counts = doc_count_totals.select { |k,v| classnames.include? k }.map { |k,v| v }
-      doc_count_total = (doc_counts.inject { |x,y| x+y } + classnames.length).to_f
-      classnames.each { |k|
-        result[k] += Math.log((@storage.get_doc_count(k) + 1).to_f / doc_count_total)
+      doc_count_total = (doc_counts.inject(0){ |x,y| x+y } + classnames.length).to_f
+      classnames.each { |k|
+        result[k] += Math.log((@storage.get_doc_count(k) + 1).to_f / doc_count_total)
       }
       result

data/lib/ankusa/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Ankusa
-  VERSION = "0.0.14"
+  VERSION = "0.0.15"
 end

metadata CHANGED

@@ -1,33 +1,37 @@
---- !ruby/object:Gem::Specification
+--- !ruby/object:Gem::Specification
 name: ankusa
-version: !ruby/object:Gem::Version
-  version: 0.0.14
+version: !ruby/object:Gem::Version
   prerelease:
+  version: 0.0.15
 platform: ruby
-authors:
+authors:
 - Brian Muller
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-08-15 00:00:00.000000000Z
-dependencies:
-- !ruby/object:Gem::Dependency
+date: 2012-11-30 00:00:00 Z
+dependencies:
+- !ruby/object:Gem::Dependency
   name: fast-stemmer
-  requirement: &70197701802820 !ruby/object:Gem::Requirement
+  prerelease: false
+  requirement: &id001 !ruby/object:Gem::Requirement
     none: false
-    requirements:
-    - - ! '>='
-      - !ruby/object:Gem::Version
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
         version: 1.0.0
   type: :runtime
-  prerelease: false
-  version_requirements: *70197701802820
+  version_requirements: *id001
 description: Text classifier with HBase, Cassandra, or Mongo storage
 email: brian.muller@livingsocial.com
 executables: []
 extensions: []
 extra_rdoc_files: []
-files:
+files:
 - lib/ankusa/cassandra_storage.rb
 - lib/ankusa/classifier.rb
 - lib/ankusa/extensions.rb
@@ -48,34 +52,30 @@ files:
 - README.rdoc
 homepage: https://github.com/livingsocial/ankusa
 licenses: []
 post_install_message:
 rdoc_options: []
-require_paths:
+require_paths:
 - lib
-required_ruby_version: !ruby/object:Gem::Requirement
+required_ruby_version: !ruby/object:Gem::Requirement
   none: false
-  requirements:
-  - - ! '>='
-    - !ruby/object:Gem::Version
-      version: '0'
-      segments:
-      - 0
-      hash: 2837888903817045284
-required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: "0"
+required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
-  requirements:
-  - - ! '>='
-    - !ruby/object:Gem::Version
-      version: '0'
-      segments:
-      - 0
-      hash: 2837888903817045284
-requirements:
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: "0"
+requirements:
 - Either hbaserb >= 0.0.3 or cassandra >= 0.7
 rubyforge_project: ankusa
-rubygems_version: 1.8.17
+rubygems_version: 1.8.24
 signing_key:
 specification_version: 3
-summary: Text classifier in Ruby that uses Hadoop's HBase, Cassandra, or Mongo for
-  storage
+summary: Text classifier in Ruby that uses Hadoop's HBase, Cassandra, or Mongo for storage
 test_files: []