luisparravicini-classifier 1.3.9 → 1.4.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README.rdoc +11 -2
- data/VERSION.yml +2 -2
- data/lib/classifier/base.rb +1 -1
- data/lib/classifier/stopwords.rb +21 -9
- data/luisparravicini-classifier.gemspec +1 -1
- data/test/stopwords_test.rb +22 -5
- metadata +1 -1
data/README.rdoc
CHANGED
@@ -9,8 +9,8 @@ rb-gsl:: http://rb-gsl.rubyforge.org
|
|
9
9
|
Notice that LSI will work without these libraries, but as soon as they are installed, Classifier will make use of them. No configuration changes are needed, we like to keep things ridiculously easy for you.
|
10
10
|
|
11
11
|
== Changes in this branch
|
12
|
-
I made this branch to fix a TypeError on untrain (classifier-1.3.1), then francois[http://github.com/francois/classifier/] branch for jeweler and all the changes yuri[http://github.com/yury/classifier/] made on his branch (specially the use of ruby-stemmer, and the incompatibility fix on Array#sum, which I needed).
|
13
|
-
After that I added support for loading the stopwords of certain language from a file (before the list was embedded on the source code) and a stopword list for Spanish.
|
12
|
+
I made this branch to fix a TypeError on untrain (classifier-1.3.1), then merged francois[http://github.com/francois/classifier/] branch for jeweler and all the changes yuri[http://github.com/yury/classifier/] made on his branch (specially the use of ruby-stemmer, and the incompatibility fix on Array#sum, which I needed).
|
13
|
+
After that I added support for loading the stopwords of a certain language from a file (before the list was embedded on the source code) and added a stopword list for Spanish.
|
14
14
|
This branch only works with Ruby 1.9
|
15
15
|
|
16
16
|
== Bayes
|
@@ -46,6 +46,15 @@ The default values are 'en' for language and 'UTF-8' for the encoding.
|
|
46
46
|
Each language uses a word list to exclude certain words (stopwords). classifier comes with three included stopword lists, for English, Russian and Spanish.
|
47
47
|
The English list is the list that comes with the original gem (don't know where it was taken from) and the Russian and Spanish are from snowball[http://snowball.tartarus.org/algorithms/].
|
48
48
|
|
49
|
+
You can override the default stopword list, or add lists for new languages sending a value for :lang_dir when initializing Bayes:
|
50
|
+
|
51
|
+
b = Classifier::Bayes.new :categories => ['Interesting', 'Uninteresting'],
|
52
|
+
:lang_dir => '/home/xrm0/stopwords'
|
53
|
+
|
54
|
+
This directory is used when loading the list from disk and takes precedence over the default directory in lib/classifier/stopwords. Each file is named after the language (using the two letter code).
|
55
|
+
|
56
|
+
The stopwords file can have comments (indicated with '#'), blank lines are ignored and the encoding must be utf-8.
|
57
|
+
|
49
58
|
=== Bayesian Classification
|
50
59
|
|
51
60
|
* http://www.process.com/precisemail/bayesian_filtering.htm
|
data/VERSION.yml
CHANGED
data/lib/classifier/base.rb
CHANGED
@@ -40,7 +40,7 @@ module Classifier
|
|
40
40
|
|
41
41
|
def word_hash_for_words(words)
|
42
42
|
d = Hash.new
|
43
|
-
skip_words =
|
43
|
+
skip_words = StopWords.for(@options[:language], @options[:lang_dir])
|
44
44
|
words.each do |word|
|
45
45
|
word = word.mb_chars.downcase.to_s if word =~ /[\w]+/
|
46
46
|
key = stemmer.stem(word).intern
|
data/lib/classifier/stopwords.rb
CHANGED
@@ -1,18 +1,30 @@
|
|
1
1
|
module Classifier
|
2
2
|
|
3
|
-
module
|
3
|
+
module StopWords
|
4
4
|
|
5
|
-
def self.for(language)
|
6
|
-
unless
|
7
|
-
|
5
|
+
def self.for(language, lang_dir=nil)
|
6
|
+
unless STOP_WORDS.has_key?(language)
|
7
|
+
STOP_WORDS[language] = load_stopwords(language, lang_dir) || []
|
8
8
|
end
|
9
|
-
|
9
|
+
STOP_WORDS[language]
|
10
|
+
end
|
11
|
+
|
12
|
+
def self.reset
|
13
|
+
STOP_WORDS.clear
|
10
14
|
end
|
11
15
|
|
12
16
|
protected
|
13
17
|
|
14
|
-
def self.load_stopwords(language)
|
15
|
-
|
18
|
+
def self.load_stopwords(language, lang_dir)
|
19
|
+
default_dir = File.join(File.dirname(__FILE__), 'stopwords')
|
20
|
+
|
21
|
+
load_file(language, lang_dir) || load_file(language, default_dir) || []
|
22
|
+
end
|
23
|
+
|
24
|
+
def self.load_file(language, lang_dir)
|
25
|
+
return if lang_dir.nil?
|
26
|
+
|
27
|
+
lang_file = File.join(lang_dir, language)
|
16
28
|
if File.exist?(lang_file)
|
17
29
|
data = []
|
18
30
|
File.open(lang_file, 'r:utf-8') do |f|
|
@@ -21,10 +33,10 @@ module Classifier
|
|
21
33
|
data << line unless line.empty?
|
22
34
|
end
|
23
35
|
end
|
24
|
-
data
|
36
|
+
data unless data.empty?
|
25
37
|
end
|
26
38
|
end
|
27
39
|
|
28
|
-
|
40
|
+
STOP_WORDS = {}
|
29
41
|
end
|
30
42
|
end
|
data/test/stopwords_test.rb
CHANGED
@@ -1,21 +1,38 @@
|
|
1
1
|
# coding:utf-8
|
2
2
|
require File.dirname(__FILE__) + '/test_helper'
|
3
|
+
require 'tempfile'
|
3
4
|
|
4
|
-
class
|
5
|
+
class StopWordsTest < Test::Unit::TestCase
|
5
6
|
def test_en
|
6
|
-
assert_equal 80, Classifier::
|
7
|
+
assert_equal 80, Classifier::StopWords.for('en').size
|
7
8
|
end
|
8
9
|
|
9
10
|
def test_ru
|
10
|
-
assert_equal 159, Classifier::
|
11
|
+
assert_equal 159, Classifier::StopWords.for('ru').size
|
11
12
|
end
|
12
13
|
|
13
14
|
def test_stopword_es
|
14
|
-
list = Classifier::
|
15
|
+
list = Classifier::StopWords.for('es')
|
15
16
|
assert list.include?('más')
|
16
17
|
end
|
17
18
|
|
18
19
|
def test_unknown
|
19
|
-
assert_equal [], Classifier::
|
20
|
+
assert_equal [], Classifier::StopWords.for('_unknown_')
|
21
|
+
end
|
22
|
+
|
23
|
+
def setup
|
24
|
+
@tmp = nil
|
25
|
+
end
|
26
|
+
def teardown
|
27
|
+
Classifier::StopWords.reset
|
28
|
+
File.delete(@tmp) unless @tmp.nil?
|
29
|
+
end
|
30
|
+
|
31
|
+
def test_custom_lang_file
|
32
|
+
lang = 'xxyyzz'
|
33
|
+
@tmp = File.join(File.dirname(__FILE__), lang)
|
34
|
+
File.open(@tmp, 'w') { |f| f.puts "str1\nstr2" }
|
35
|
+
assert_equal ["str1", "str2"], Classifier::StopWords.for(lang,
|
36
|
+
File.dirname(@tmp))
|
20
37
|
end
|
21
38
|
end
|