RubyGems - luisparravicini-classifier - Versions diffs - 1.3.9 → 1.4.0 - Mend

luisparravicini-classifier 1.3.9 → 1.4.0

Files changed (7) hide show

data/README.rdoc +11 -2
data/VERSION.yml +2 -2
data/lib/classifier/base.rb +1 -1
data/lib/classifier/stopwords.rb +21 -9
data/luisparravicini-classifier.gemspec +1 -1
data/test/stopwords_test.rb +22 -5
metadata +1 -1

data/README.rdoc CHANGED

@@ -9,8 +9,8 @@ rb-gsl:: http://rb-gsl.rubyforge.org
 Notice that LSI will work without these libraries, but as soon as they are installed, Classifier will make use of them. No configuration changes are needed, we like to keep things ridiculously easy for you.
 == Changes in this branch
-I made this branch to fix a TypeError on untrain (classifier-1.3.1), then francois[http://github.com/francois/classifier/] branch for jeweler and all the changes yuri[http://github.com/yury/classifier/] made on his branch (specially the use of ruby-stemmer, and the incompatibility fix on Array#sum, which I needed).
-After that I added support for loading the stopwords of certain language from a file (before the list was embedded on the source code) and a stopword list for Spanish.
+I made this branch to fix a TypeError on untrain (classifier-1.3.1), then merged francois[http://github.com/francois/classifier/] branch for jeweler and all the changes yuri[http://github.com/yury/classifier/] made on his branch (specially the use of ruby-stemmer, and the incompatibility fix on Array#sum, which I needed).
+After that I added support for loading the stopwords of a certain language from a file (before the list was embedded on the source code) and added a stopword list for Spanish.
 This branch only works with Ruby 1.9
 == Bayes
@@ -46,6 +46,15 @@ The default values are 'en' for language and 'UTF-8' for the encoding.
 Each language uses a word list to exclude certain words (stopwords). classifier comes with three included stopword lists, for English, Russian and Spanish.
 The English list is the list that comes with the original gem (don't know where it was taken from) and the Russian and Spanish are from snowball[http://snowball.tartarus.org/algorithms/].
+You can override the default stopword list, or add lists for new languages sending a value for :lang_dir when initializing Bayes:
+	  b = Classifier::Bayes.new :categories => ['Interesting', 'Uninteresting'],
+      :lang_dir => '/home/xrm0/stopwords'
+This directory is used when loading the list from disk and takes precedence over the default directory in lib/classifier/stopwords. Each file is named after the language (using the two letter code).
+The stopwords file can have comments (indicated with '#'), blank lines are ignored and the encoding must be utf-8.
 === Bayesian Classification
 * http://www.process.com/precisemail/bayesian_filtering.htm

data/VERSION.yml CHANGED

@@ -1,5 +1,5 @@
 ---
 :major: 1
-:minor: 3
-:patch: 9
+:minor: 4
+:patch: 0
 :build:

data/lib/classifier/base.rb CHANGED

@@ -40,7 +40,7 @@ module Classifier
   	def word_hash_for_words(words)
   		d = Hash.new
-  		skip_words = SkipWords.for(@options[:language])
+  		skip_words = StopWords.for(@options[:language], @options[:lang_dir])
   		words.each do |word|
   			word = word.mb_chars.downcase.to_s if word =~ /[\w]+/
   			key = stemmer.stem(word).intern

data/lib/classifier/stopwords.rb CHANGED

@@ -1,18 +1,30 @@
 module Classifier
-  module SkipWords
+  module StopWords
-    def self.for(language)
-      unless SKIP_WORDS.has_key?(language)
-        SKIP_WORDS[language] = load_stopwords(language) || []
+    def self.for(language, lang_dir=nil)
+      unless STOP_WORDS.has_key?(language)
+        STOP_WORDS[language] = load_stopwords(language, lang_dir) || []
       end
-      SKIP_WORDS[language]
+      STOP_WORDS[language]
+    end
+    def self.reset
+      STOP_WORDS.clear
     end
     protected
-      def self.load_stopwords(language)
-        lang_file = File.join(File.dirname(__FILE__), 'stopwords', language)
+      def self.load_stopwords(language, lang_dir)
+        default_dir = File.join(File.dirname(__FILE__), 'stopwords')
+        load_file(language, lang_dir) || load_file(language, default_dir) || []
+      end
+      def self.load_file(language, lang_dir)
+        return if lang_dir.nil?
+        lang_file = File.join(lang_dir, language)
         if File.exist?(lang_file)
           data = []
           File.open(lang_file, 'r:utf-8') do |f|
@@ -21,10 +33,10 @@ module Classifier
               data << line unless line.empty?
             end
           end
-          data
+          data unless data.empty?
         end
       end
-    SKIP_WORDS = {}
+    STOP_WORDS = {}
   end
 end

data/luisparravicini-classifier.gemspec CHANGED

@@ -5,7 +5,7 @@
 Gem::Specification.new do |s|
   s.name = %q{luisparravicini-classifier}
-  s.version = "1.3.9"
+  s.version = "1.4.0"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Luis Parravicini"]

data/test/stopwords_test.rb CHANGED

@@ -1,21 +1,38 @@
 # coding:utf-8
 require File.dirname(__FILE__) + '/test_helper'
+require 'tempfile'
-class SkipWordsTest < Test::Unit::TestCase
+class StopWordsTest < Test::Unit::TestCase
   def test_en
-    assert_equal 80, Classifier::SkipWords.for('en').size
+    assert_equal 80, Classifier::StopWords.for('en').size
   end
   def test_ru
-    assert_equal 159, Classifier::SkipWords.for('ru').size
+    assert_equal 159, Classifier::StopWords.for('ru').size
   end
   def test_stopword_es
-    list = Classifier::SkipWords.for('es')
+    list = Classifier::StopWords.for('es')
     assert list.include?('más')
   end
   def test_unknown
-    assert_equal [], Classifier::SkipWords.for('xxyyzz')
+    assert_equal [], Classifier::StopWords.for('_unknown_')
+  end
+  def setup
+    @tmp = nil
+  end
+  def teardown
+    Classifier::StopWords.reset
+    File.delete(@tmp) unless @tmp.nil?
+  end
+  def test_custom_lang_file
+    lang = 'xxyyzz'
+    @tmp = File.join(File.dirname(__FILE__), lang)
+    File.open(@tmp, 'w') { |f| f.puts "str1\nstr2" }
+    assert_equal ["str1", "str2"], Classifier::StopWords.for(lang,
+      File.dirname(@tmp))
   end
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: luisparravicini-classifier
 version: !ruby/object:Gem::Version
-  version: 1.3.9
+  version: 1.4.0
 platform: ruby
 authors:
 - Luis Parravicini