RubyGems - simple_naive_bayes - Versions diffs - 0.0.2 - Mend

simple_naive_bayes 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +7 -0
data/.gitignore +17 -0
data/Gemfile +4 -0
data/LICENSE.txt +22 -0
data/README.md +61 -0
data/Rakefile +1 -0
data/example/example.rb +19 -0
data/example/publiccorpus_test.rb +155 -0
data/lib/simple_naive_bayes/version.rb +3 -0
data/lib/simple_naive_bayes.rb +116 -0
data/simple_naive_bayes.gemspec +23 -0
metadata +82 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: fe3a97c568b49df1ee13a85badfa0c106e800f84
+  data.tar.gz: 3ce26714bcbbd3cd914dd96b996447e6c0150db4
+SHA512:
+  metadata.gz: 652a9f57aa077f89e7d3611649630684bf4a49e6236e886685952d433d2081ba65a2d7c78fa8d1d0085f458378e30ed46a6e3e64e82659af10e78c39b8433151
+  data.tar.gz: d313812706be35cedaa51cbe93b3db5d41fbb6eb1659d733ec1b4556a7732ac14777bc56c731b269260537112a4a4aaf4746738bc96cd82bc51fd4475a5712d4

data/.gitignore ADDED Viewed

@@ -0,0 +1,17 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in simple_naive_bayes.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2013 y42sora
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,61 @@
+# SimpleNaiveBayes
+This is a very simple naive bayes written in ruby.
+## Installation
+    $ gem install simple_naive_bayes
+## Usage
+```ruby
+require 'simple_naive_bayes'
+cl = SimpleNaiveBayes::NaiveBayes.new
+cl.training("yes", ["Chinese", "Beijing", "Chinese"])
+cl.training("yes", ["Chinese", "Chinese", "Shanghai"])
+cl.training("yes", ["Chinese", "Macao"])
+cl.training("no", ["Tokyo", "Japan", "Chinese"])
+cl.classify(["Tokyo"])
+```
+show example.rb
+## Supported Ruby Versions
+Ruby 2.0.0
+## Performance
+To measure the performance of the filte, I tested.
+The datasource is publiccorpus (http://spamassassin.apache.org/publiccorpus/).
+This data is mail corpus, so I classify mails.
+Those mails have three type which is spam, easy_ham, hard_ham.
+The test script is publiccorpus_test.rb.
+### Data sources
+#### Training Data
+* http://spamassassin.apache.org/publiccorpus/20021010_easy_ham.tar.bz2
+* http://spamassassin.apache.org/publiccorpus/20021010_hard_ham.tar.bz2
+* http://spamassassin.apache.org/publiccorpus/20021010_spam.tar.bz2
+#### Test Data
+* http://spamassassin.apache.org/publiccorpus/20030228_easy_ham.tar.bz2
+* http://spamassassin.apache.org/publiccorpus/20030228_hard_ham.tar.bz2
+* http://spamassassin.apache.org/publiccorpus/20030228_spam.tar.bz2
+### Result
+* spam accuracy rate is 99.6% (498/500)
+* easy ham accuracy rate is 99.8% (2497/2500)
+* hard ham accuracy rate is 81.6% (204/250)
+## License
+MIT License
+## Contributing
+1. Fork it
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request

data/Rakefile ADDED Viewed

	@@ -0,0 +1 @@
1	+ require "bundler/gem_tasks"

data/example/example.rb ADDED Viewed

@@ -0,0 +1,19 @@
+require 'simple_naive_bayes'
+cl = SimpleNaiveBayes::NaiveBayes.new
+data = [
+  ["yes", ["Chinese", "Beijing", "Chinese"]],
+  ["yes", ["Chinese", "Chinese", "Shanghai"]],
+  ["yes", ["Chinese", "Macao"]],
+  ["no", ["Tokyo", "Japan", "Chinese"]]
+]
+data.each do |cat, doc|
+  cl.training(cat, doc)
+end
+test = ["Chinese", "Chinese", "Chinese", "Tokyo", "Japan"]
+p cl.classify(test)
+p cl.classify_with_all_result(test)

data/example/publiccorpus_test.rb ADDED Viewed

@@ -0,0 +1,155 @@
+# -*- coding: utf-8 -*-
+# test script for http://spamassassin.apache.org/publiccorpus/
+require 'find'
+require 'simple_naive_bayes'
+train_spam_folder = "20021010/spam"
+train_ham_folder = "20021010/easy_ham"
+train_hard_ham_folder = "20021010/hard_ham"
+test_spam_folder = "20030228/spam"
+test_ham_folder = "20030228/easy_ham"
+test_hard_ham_folder = "20030228/hard_ham"
+@header_regxp = /[\w-]*: .*/
+@nb_classifier = SimpleNaiveBayes::NaiveBayes.new
+# delete all mail header
+# chek line is not mail header and befor line is blank
+# it expect like that header
+# X-Original-Date: Wed, 4 Dec 2002 11:54:45 +0000
+# Date: Wed, 4 Dec 2002 11:54:45 +0000
+#
+#
+# Hi,
+# I think you need to give us a little more detailed information.
+# ...
+def get_context_from_file(filepath)
+  context = []
+  end_header = false
+  before_line = "before"
+  open(filepath) {|f|
+    f.each {|line|
+      line = line.encode("UTF-16BE", :invalid => :replace, :undef => :replace, :replace => '?').encode("UTF-8")
+      line = line.chomp
+      if before_line.empty?  and not line.empty? and not @header_regxp.match(line)
+        end_header = true
+      end
+      context << line if end_header
+      before_line = line
+    }
+  }
+  context.join(" ")
+end
+# divide context string to word list
+# return like [word1, word2, word3]
+# and delete stopword that word length smaller than 3
+def get_word_from_context(context)
+  words = []
+  context.split(" ").each do |word|
+    if word[-1] == "." or word[-1]  == "," or
+        word[-1] == "?" or word[-1]  == "!" or
+        word[-1]  == ":"
+        word = word[0..-2]
+    end
+    words << word unless word.size < 3
+  end
+  words
+end
+def train_data_from_file(category, filepath)
+  context = get_context_from_file(filepath)
+  words = get_word_from_context(context)
+  @nb_classifier.training(category, words)
+end
+def train_data_from_folder(category, folder)
+  all_num = 0
+  t0 = Time.now
+  Find.find(folder) do |filepath|
+    if File::ftype(filepath) == "file"
+      train_data_from_file(category, filepath)
+      all_num += 1
+    end
+  end
+  t1 = Time.now
+  puts "training #{category} #{t1 - t0} sec and #{all_num} file"
+end
+# check correct rate
+def check_data_from_folder(category, folder)
+  correct_num = 0
+  all_num = 0
+  t0 = Time.now
+  Find.find(folder) do |filepath|
+    if File::ftype(filepath) == "file"
+      context = get_context_from_file(filepath)
+      words = get_word_from_context(context)
+      correct_num += 1 if category == @nb_classifier.classify(words)
+      all_num += 1
+    end
+  end
+  t1 = Time.now
+  puts "check #{category} #{t1 - t0} sec"
+  [all_num, correct_num]
+end
+train_data_from_folder("spam", train_spam_folder)
+train_data_from_folder("ham", train_ham_folder)
+train_data_from_folder("hard", train_hard_ham_folder)
+puts "----check spam----"
+ans = check_data_from_folder("spam", test_spam_folder)
+puts "spam rate is " + (ans[1].to_f / ans[0]).to_s
+puts "all #{ans[0]} correct #{ans[1]}"
+puts "----check ham----"
+ans = check_data_from_folder("ham", test_ham_folder)
+puts "ham rate is " + (ans[1].to_f / ans[0]).to_s
+puts "all #{ans[0]} correct #{ans[1]}"
+puts "----check hard_ham----"
+ans = check_data_from_folder("hard", test_hard_ham_folder)
+puts "hard ham rate is " + (ans[1].to_f / ans[0]).to_s
+puts "all #{ans[0]} correct #{ans[1]}"
+=begin
+training spam 2.337407645 sec and 501 file
+training ham 7.85665402 sec and 2551 file
+training hard 6.014818518 sec and 250 file
+----check spam----
+check spam 4.681404607 sec
+spam rate is 0.996
+all 500 correct 498
+----check ham----
+check ham 11.444270327 sec
+ham rate is 0.9988
+all 2500 correct 2497
+----check hard_ham----
+check hard 8.78753183 sec
+hard ham rate is 0.816
+all 250 correct 204
+=end

data/lib/simple_naive_bayes/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module SimpleNaiveBayes
+  VERSION = "0.0.2"
+end

data/lib/simple_naive_bayes.rb ADDED Viewed

@@ -0,0 +1,116 @@
+require "simple_naive_bayes/version"
+require 'set'
+module SimpleNaiveBayes
+    class NaiveBayes
+=begin
+    P(cat|doc) = P(doc|cat) * P(cat) / P(doc)
+    P(doc) is stable, so don't care.
+    P(cat) = @categories_count[cat] / @all_category_num
+    P(doc|cat) = P(word1|cat) * P(word2|cat)....
+    P(word1|cat) = T(cat, word1) / (T(cat, word1) + T(cat, word2) + ...)
+    T(cat, word1) = @categories_word[cat][word]
+    (T(cat, word1) + T(cat, word2) + ...) = sum(T(cat, word))  = @categories_all_word_count[cat]
+    Additive smoothing
+    P(word1|cat) = (T(cat, word1) + a)  / sum(T(cat, word) + a))
+    sum(T(cat, word) + a) =  sum(T(cat, word)) + @all_word_set.length() * @additive = @laplace_categories_all_word_count[cat]
+    arg max P(cat|doc)  =  arg max log(P(cat|doc))
+    log(P(cat|doc)) = log(P(doc|cat)) + log( P(cat))
+    log(P(cat)) =  log(@categories_count[cat]) - log(@all_category_num)
+    log(P(doc|cat)) = log(P(word1|cat)) + log(P(word2|cat)) + ....
+    log(P(word1|cat)) = log(T(cat, word1)) - log(sum(T(cat, word)))
+    http://aidiary.hatenablog.com/entry/20100613/1276389337
+=end
+      def initialize()
+        @all_category_num = 0
+        @all_word_set = Set.new
+        @categories_count = Hash.new(0)
+        @categories_word = Hash.new
+        @categories_all_word_count = Hash.new(0)
+        @laplace_categories_all_word_count = Hash.new(0)
+        @additive = 0.5
+      end
+      """
+      doc = [word1, word2, word3...]
+      """
+      def training(category, doc)
+        @categories_count[category] += 1
+        @all_category_num += 1
+        @categories_word[category] = Hash.new(0) unless @categories_word.key?(category)
+        doc.each do |word|
+          @all_word_set.add(word)
+          @categories_word[category][word] += 1
+          @categories_all_word_count[category] += 1
+        end
+        # sum(T(cat, word) + 1))
+        # Additive smoothing
+        @laplace_categories_all_word_count[category] = @categories_all_word_count[category] + @all_word_set.length() * @additive
+      end
+      # classify and return best category
+      def classify(doc)
+        result = classify_with_all_result(doc)
+        best = result.max_by { |classify_relust| classify_relust[1] }
+        best[0]
+      end
+      # classify and return all category's probability
+      # get all log(P(cat|doc))
+      # return [ [category1, probability1], [category2, probability2]... ]
+      def classify_with_all_result(doc)
+        result = []
+        @categories_count.keys().each do |category|
+          # log(P(doc|cat))
+          document_category = calc_document_category(doc, category)
+          # log(P(cat)) =  log(@categories_count[cat]) - log( @all_category_num )
+          category_probability = Math.log2(@categories_count[category]) - Math.log2(@all_category_num)
+          # log(P(cat|doc)) = log(P(doc|cat)) + log(P(cat))
+          category_document_probability = document_category + category_probability
+          result << [category, category_document_probability]
+        end
+        result
+      end
+      # log(P(doc|cat)) = log(P(word1|cat)) + log(P(word2|cat)) + ....
+      def calc_document_category(doc, category)
+        probability = 0
+        # log(P(word1|cat)) + log(P(word2|cat)) + ....
+        doc.each do |word|
+          # log(T(cat, word1))
+          # Additive smoothing
+          category_word = Math.log2(@categories_word[category][word] + @additive)
+          # sum(T(cat, word) + 1))
+          all_category_word = Math.log2(@laplace_categories_all_word_count[category])
+          # log(P(word1|cat)) = log(T(cat, word1) + 1) - log(sum(T(cat, word) + 1))
+          prob = category_word - all_category_word
+          probability += prob if prob.finite?
+        end
+        probability
+      end
+    end
+end

data/simple_naive_bayes.gemspec ADDED Viewed

@@ -0,0 +1,23 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'simple_naive_bayes/version'
+Gem::Specification.new do |spec|
+  spec.name          = "simple_naive_bayes"
+  spec.version       = SimpleNaiveBayes::VERSION
+  spec.authors       = ["y42sora"]
+  spec.email         = ["y42sora@y42sora.com"]
+  spec.description   = %q{Simple pure ruby naive bayes}
+  spec.summary       = %q{Simple pure ruby naive bayes}
+  spec.homepage      = "https://github.com/y42sora/simple_naive_bayes"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files`.split($/)
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.3"
+  spec.add_development_dependency "rake"
+end

metadata ADDED Viewed

@@ -0,0 +1,82 @@
+--- !ruby/object:Gem::Specification
+name: simple_naive_bayes
+version: !ruby/object:Gem::Version
+  version: 0.0.2
+platform: ruby
+authors:
+- y42sora
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2013-08-17 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.3'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.3'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+description: Simple pure ruby naive bayes
+email:
+- y42sora@y42sora.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- example/example.rb
+- example/publiccorpus_test.rb
+- lib/simple_naive_bayes.rb
+- lib/simple_naive_bayes/version.rb
+- simple_naive_bayes.gemspec
+homepage: https://github.com/y42sora/simple_naive_bayes
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.0.0
+signing_key:
+specification_version: 4
+summary: Simple pure ruby naive bayes
+test_files: []