luisparravicini-classifier 1.3.9 → 1.4.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -9,8 +9,8 @@ rb-gsl:: http://rb-gsl.rubyforge.org
9
9
  Notice that LSI will work without these libraries, but as soon as they are installed, Classifier will make use of them. No configuration changes are needed, we like to keep things ridiculously easy for you.
10
10
 
11
11
  == Changes in this branch
12
- I made this branch to fix a TypeError on untrain (classifier-1.3.1), then francois[http://github.com/francois/classifier/] branch for jeweler and all the changes yuri[http://github.com/yury/classifier/] made on his branch (specially the use of ruby-stemmer, and the incompatibility fix on Array#sum, which I needed).
13
- After that I added support for loading the stopwords of certain language from a file (before the list was embedded on the source code) and a stopword list for Spanish.
12
+ I made this branch to fix a TypeError on untrain (classifier-1.3.1), then merged francois[http://github.com/francois/classifier/] branch for jeweler and all the changes yuri[http://github.com/yury/classifier/] made on his branch (specially the use of ruby-stemmer, and the incompatibility fix on Array#sum, which I needed).
13
+ After that I added support for loading the stopwords of a certain language from a file (before the list was embedded on the source code) and added a stopword list for Spanish.
14
14
  This branch only works with Ruby 1.9
15
15
 
16
16
  == Bayes
@@ -46,6 +46,15 @@ The default values are 'en' for language and 'UTF-8' for the encoding.
46
46
  Each language uses a word list to exclude certain words (stopwords). classifier comes with three included stopword lists, for English, Russian and Spanish.
47
47
  The English list is the list that comes with the original gem (don't know where it was taken from) and the Russian and Spanish are from snowball[http://snowball.tartarus.org/algorithms/].
48
48
 
49
+ You can override the default stopword list, or add lists for new languages sending a value for :lang_dir when initializing Bayes:
50
+
51
+ b = Classifier::Bayes.new :categories => ['Interesting', 'Uninteresting'],
52
+ :lang_dir => '/home/xrm0/stopwords'
53
+
54
+ This directory is used when loading the list from disk and takes precedence over the default directory in lib/classifier/stopwords. Each file is named after the language (using the two letter code).
55
+
56
+ The stopwords file can have comments (indicated with '#'), blank lines are ignored and the encoding must be utf-8.
57
+
49
58
  === Bayesian Classification
50
59
 
51
60
  * http://www.process.com/precisemail/bayesian_filtering.htm
@@ -1,5 +1,5 @@
1
1
  ---
2
2
  :major: 1
3
- :minor: 3
4
- :patch: 9
3
+ :minor: 4
4
+ :patch: 0
5
5
  :build:
@@ -40,7 +40,7 @@ module Classifier
40
40
 
41
41
  def word_hash_for_words(words)
42
42
  d = Hash.new
43
- skip_words = SkipWords.for(@options[:language])
43
+ skip_words = StopWords.for(@options[:language], @options[:lang_dir])
44
44
  words.each do |word|
45
45
  word = word.mb_chars.downcase.to_s if word =~ /[\w]+/
46
46
  key = stemmer.stem(word).intern
@@ -1,18 +1,30 @@
1
1
  module Classifier
2
2
 
3
- module SkipWords
3
+ module StopWords
4
4
 
5
- def self.for(language)
6
- unless SKIP_WORDS.has_key?(language)
7
- SKIP_WORDS[language] = load_stopwords(language) || []
5
+ def self.for(language, lang_dir=nil)
6
+ unless STOP_WORDS.has_key?(language)
7
+ STOP_WORDS[language] = load_stopwords(language, lang_dir) || []
8
8
  end
9
- SKIP_WORDS[language]
9
+ STOP_WORDS[language]
10
+ end
11
+
12
+ def self.reset
13
+ STOP_WORDS.clear
10
14
  end
11
15
 
12
16
  protected
13
17
 
14
- def self.load_stopwords(language)
15
- lang_file = File.join(File.dirname(__FILE__), 'stopwords', language)
18
+ def self.load_stopwords(language, lang_dir)
19
+ default_dir = File.join(File.dirname(__FILE__), 'stopwords')
20
+
21
+ load_file(language, lang_dir) || load_file(language, default_dir) || []
22
+ end
23
+
24
+ def self.load_file(language, lang_dir)
25
+ return if lang_dir.nil?
26
+
27
+ lang_file = File.join(lang_dir, language)
16
28
  if File.exist?(lang_file)
17
29
  data = []
18
30
  File.open(lang_file, 'r:utf-8') do |f|
@@ -21,10 +33,10 @@ module Classifier
21
33
  data << line unless line.empty?
22
34
  end
23
35
  end
24
- data
36
+ data unless data.empty?
25
37
  end
26
38
  end
27
39
 
28
- SKIP_WORDS = {}
40
+ STOP_WORDS = {}
29
41
  end
30
42
  end
@@ -5,7 +5,7 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{luisparravicini-classifier}
8
- s.version = "1.3.9"
8
+ s.version = "1.4.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Luis Parravicini"]
@@ -1,21 +1,38 @@
1
1
  # coding:utf-8
2
2
  require File.dirname(__FILE__) + '/test_helper'
3
+ require 'tempfile'
3
4
 
4
- class SkipWordsTest < Test::Unit::TestCase
5
+ class StopWordsTest < Test::Unit::TestCase
5
6
  def test_en
6
- assert_equal 80, Classifier::SkipWords.for('en').size
7
+ assert_equal 80, Classifier::StopWords.for('en').size
7
8
  end
8
9
 
9
10
  def test_ru
10
- assert_equal 159, Classifier::SkipWords.for('ru').size
11
+ assert_equal 159, Classifier::StopWords.for('ru').size
11
12
  end
12
13
 
13
14
  def test_stopword_es
14
- list = Classifier::SkipWords.for('es')
15
+ list = Classifier::StopWords.for('es')
15
16
  assert list.include?('más')
16
17
  end
17
18
 
18
19
  def test_unknown
19
- assert_equal [], Classifier::SkipWords.for('xxyyzz')
20
+ assert_equal [], Classifier::StopWords.for('_unknown_')
21
+ end
22
+
23
+ def setup
24
+ @tmp = nil
25
+ end
26
+ def teardown
27
+ Classifier::StopWords.reset
28
+ File.delete(@tmp) unless @tmp.nil?
29
+ end
30
+
31
+ def test_custom_lang_file
32
+ lang = 'xxyyzz'
33
+ @tmp = File.join(File.dirname(__FILE__), lang)
34
+ File.open(@tmp, 'w') { |f| f.puts "str1\nstr2" }
35
+ assert_equal ["str1", "str2"], Classifier::StopWords.for(lang,
36
+ File.dirname(@tmp))
20
37
  end
21
38
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: luisparravicini-classifier
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.3.9
4
+ version: 1.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Luis Parravicini