classifier-reborn 2.0.0.rc1 → 2.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.markdown +62 -45
- data/lib/classifier-reborn/bayes.rb +9 -7
- data/lib/classifier-reborn/extensions/hasher.rb +134 -0
- data/lib/classifier-reborn/extensions/string.rb +1 -1
- data/lib/classifier-reborn/extensions/vector.rb +1 -2
- data/lib/classifier-reborn/lsi.rb +9 -9
- data/lib/classifier-reborn/lsi/content_node.rb +6 -4
- data/lib/classifier-reborn/lsi/summarizer.rb +33 -0
- data/lib/classifier-reborn/version.rb +1 -1
- metadata +10 -10
- data/lib/classifier-reborn/extensions/word_hash.rb +0 -136
- data/lib/classifier-reborn/lsi/summary.rb +0 -31
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b5bf47d31418ceab033a1cb67ef62dd0958d619f
|
4
|
+
data.tar.gz: 9fbfb8cfa723f0c124130aa15ef831e8cb30e317
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 79d2af7643a0184efc1f4173f4cce6e966d6e02d8ddc60ebfa03c8aed1f34d6c33ec6cb75077e46f466acf91eabe9122af4ff611d5f96f3d7bfba29923de90db
|
7
|
+
data.tar.gz: 80bc672d0e87fdb0a5064dcd6d8d80746923f7713f38da1c34decb3e3b4a7e11fe92e16e2c8230a1837905b91b3db3de7368c56f0f89654500f4ea242291775a
|
data/README.markdown
CHANGED
@@ -1,22 +1,35 @@
|
|
1
|
-
## Welcome to Classifier
|
1
|
+
## Welcome to Classifier Reborn
|
2
2
|
|
3
3
|
Classifier is a general module to allow Bayesian and other types of classifications.
|
4
4
|
|
5
|
+
Classifier Reborn is a fork of cardmagic/classifier under more active development.
|
6
|
+
|
5
7
|
## Download
|
6
8
|
|
7
|
-
|
8
|
-
|
9
|
-
|
9
|
+
Add this line to your application's Gemfile:
|
10
|
+
|
11
|
+
gem 'classifier-reborn'
|
12
|
+
|
13
|
+
And then execute:
|
14
|
+
|
15
|
+
$ bundle
|
16
|
+
|
17
|
+
Or install it yourself as:
|
18
|
+
|
19
|
+
$ gem install classifier-reborn
|
10
20
|
|
11
21
|
## Dependencies
|
12
22
|
|
13
|
-
|
23
|
+
The only runtime dependency you'll need to install is Roman Shterenzon's fast-stemmer gem:
|
14
24
|
|
15
25
|
gem install fast-stemmer
|
16
26
|
|
27
|
+
This should install automatically with RubyGems.
|
28
|
+
|
17
29
|
If you would like to speed up LSI classification by at least 10x, please install the following libraries:
|
18
|
-
|
19
|
-
|
30
|
+
|
31
|
+
* [GNU GSL](http://www.gnu.org/software/gsl)
|
32
|
+
* [rb-gsl](http://rb-gsl.rubyforge.org)
|
20
33
|
|
21
34
|
Notice that LSI will work without these libraries, but as soon as they are installed, Classifier will make use of them. No configuration changes are needed, we like to keep things ridiculously easy for you.
|
22
35
|
|
@@ -26,20 +39,22 @@ A Bayesian classifier by Lucas Carlson. Bayesian Classifiers are accurate, fast,
|
|
26
39
|
|
27
40
|
### Usage
|
28
41
|
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
42
|
+
```ruby
|
43
|
+
require 'classifier'
|
44
|
+
b = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
|
45
|
+
b.train_interesting "here are some good words. I hope you love them"
|
46
|
+
b.train_uninteresting "here are some bad words, I hate you"
|
47
|
+
b.classify "I hate bad words and you" # returns 'Uninteresting'
|
48
|
+
|
49
|
+
require 'madeleine' # use madeline to persist the data
|
50
|
+
m = SnapshotMadeleine.new("bayes_data") {
|
51
|
+
ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
|
52
|
+
}
|
53
|
+
m.system.train_interesting "here are some good words. I hope you love them"
|
54
|
+
m.system.train_uninteresting "here are some bad words, I hate you"
|
55
|
+
m.take_snapshot
|
56
|
+
m.system.classify "I love you" # returns 'Interesting'
|
57
|
+
```
|
43
58
|
|
44
59
|
Using Madeleine, your application can persist the learned data over time.
|
45
60
|
|
@@ -52,33 +67,35 @@ Using Madeleine, your application can persist the learned data over time.
|
|
52
67
|
## LSI
|
53
68
|
|
54
69
|
A Latent Semantic Indexer by David Fayram. Latent Semantic Indexing engines
|
55
|
-
are not as fast or as small as Bayesian classifiers, but are more flexible, providing
|
56
|
-
fast search and clustering detection as well as semantic analysis of the text that
|
70
|
+
are not as fast or as small as Bayesian classifiers, but are more flexible, providing
|
71
|
+
fast search and clustering detection as well as semantic analysis of the text that
|
57
72
|
theoretically simulates human learning.
|
58
73
|
|
59
74
|
### Usage
|
60
75
|
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
76
|
+
```ruby
|
77
|
+
require 'classifier'
|
78
|
+
lsi = ClassifierReborn::LSI.new
|
79
|
+
strings = [ ["This text deals with dogs. Dogs.", :dog],
|
80
|
+
["This text involves dogs too. Dogs! ", :dog],
|
81
|
+
["This text revolves around cats. Cats.", :cat],
|
82
|
+
["This text also involves cats. Cats!", :cat],
|
83
|
+
["This text involves birds. Birds.",:bird ]]
|
84
|
+
strings.each {|x| lsi.add_item x.first, x.last}
|
85
|
+
|
86
|
+
lsi.search("dog", 3)
|
87
|
+
# returns => ["This text deals with dogs. Dogs.", "This text involves dogs too. Dogs! ",
|
88
|
+
# "This text also involves cats. Cats!"]
|
89
|
+
|
90
|
+
lsi.find_related(strings[2], 2)
|
91
|
+
# returns => ["This text revolves around cats. Cats.", "This text also involves cats. Cats!"]
|
92
|
+
|
93
|
+
lsi.classify "This text is also about dogs!"
|
94
|
+
# returns => :dog
|
95
|
+
```
|
96
|
+
|
80
97
|
Please see the ClassifierReborn::LSI documentation for more information. It is possible to index, search and classify
|
81
|
-
with more than just simple strings.
|
98
|
+
with more than just simple strings.
|
82
99
|
|
83
100
|
### Latent Semantic Indexing
|
84
101
|
|
@@ -86,12 +103,12 @@ with more than just simple strings.
|
|
86
103
|
* http://www.chadfowler.com/index.cgi/Computing/LatentSemanticIndexing.rdoc
|
87
104
|
* http://en.wikipedia.org/wiki/Latent_semantic_analysis
|
88
105
|
|
89
|
-
## Authors
|
106
|
+
## Authors
|
90
107
|
|
91
108
|
* Lucas Carlson (lucas@rufy.com)
|
92
109
|
* David Fayram II (dfayram@gmail.com)
|
93
110
|
* Cameron McBride (cameron.mcbride@gmail.com)
|
94
111
|
* Ivan Acosta-Rubio (ivan@softwarecriollo.com)
|
112
|
+
* Parker Moore (email@byparker.com)
|
95
113
|
|
96
114
|
This library is released under the terms of the GNU LGPL. See LICENSE for more details.
|
97
|
-
|
@@ -2,6 +2,8 @@
|
|
2
2
|
# Copyright:: Copyright (c) 2005 Lucas Carlson
|
3
3
|
# License:: LGPL
|
4
4
|
|
5
|
+
require_relative 'extensions/string'
|
6
|
+
|
5
7
|
module ClassifierReborn
|
6
8
|
class Bayes
|
7
9
|
# The class can be created with one or more categories, each of which will be
|
@@ -23,7 +25,7 @@ module ClassifierReborn
|
|
23
25
|
def train(category, text)
|
24
26
|
category = category.prepare_category_name
|
25
27
|
@category_counts[category] += 1
|
26
|
-
|
28
|
+
Hasher.word_hash(text).each do |word, count|
|
27
29
|
@categories[category][word] ||= 0
|
28
30
|
@categories[category][word] += count
|
29
31
|
@total_words += count
|
@@ -39,12 +41,12 @@ module ClassifierReborn
|
|
39
41
|
# b.untrain :this, "This text"
|
40
42
|
def untrain(category, text)
|
41
43
|
category = category.prepare_category_name
|
42
|
-
|
43
|
-
|
44
|
+
@category_counts[category] -= 1
|
45
|
+
Hasher.word_hash(text).each do |word, count|
|
44
46
|
if @total_words >= 0
|
45
|
-
orig = @categories[category][word]
|
46
|
-
@categories[category][word]
|
47
|
-
@categories[category][word]
|
47
|
+
orig = @categories[category][word] || 0
|
48
|
+
@categories[category][word] ||= 0
|
49
|
+
@categories[category][word] -= count
|
48
50
|
if @categories[category][word] <= 0
|
49
51
|
@categories[category].delete(word)
|
50
52
|
count = orig
|
@@ -64,7 +66,7 @@ module ClassifierReborn
|
|
64
66
|
@categories.each do |category, category_words|
|
65
67
|
score[category.to_s] = 0
|
66
68
|
total = category_words.values.inject(0) {|sum, element| sum+element}
|
67
|
-
|
69
|
+
Hasher.word_hash(text).each do |word, count|
|
68
70
|
s = category_words.has_key?(word) ? category_words[word] : 0.1
|
69
71
|
score[category.to_s] += Math.log(s/total.to_f)
|
70
72
|
end
|
@@ -0,0 +1,134 @@
|
|
1
|
+
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
|
2
|
+
# Copyright:: Copyright (c) 2005 Lucas Carlson
|
3
|
+
# License:: LGPL
|
4
|
+
|
5
|
+
require "set"
|
6
|
+
|
7
|
+
module ClassifierReborn
|
8
|
+
module Hasher
|
9
|
+
extend self
|
10
|
+
|
11
|
+
# Removes common punctuation symbols, returning a new string.
|
12
|
+
# E.g.,
|
13
|
+
# "Hello (greeting's), with {braces} < >...?".without_punctuation
|
14
|
+
# => "Hello greetings with braces "
|
15
|
+
def without_punctuation(str)
|
16
|
+
str .tr( ',?.!;:"@#$%^&*()_=+[]{}\|<>/`~', " " ) .tr( "'\-", "")
|
17
|
+
end
|
18
|
+
|
19
|
+
# Return a Hash of strings => ints. Each word in the string is stemmed,
|
20
|
+
# interned, and indexes to its frequency in the document.
|
21
|
+
def word_hash(str)
|
22
|
+
word_hash = clean_word_hash(str)
|
23
|
+
symbol_hash = word_hash_for_symbols(str.gsub(/[\w]/," ").split)
|
24
|
+
return clean_word_hash(str).merge(symbol_hash)
|
25
|
+
end
|
26
|
+
|
27
|
+
# Return a word hash without extra punctuation or short symbols, just stemmed words
|
28
|
+
def clean_word_hash(str)
|
29
|
+
word_hash_for_words str.gsub(/[^\w\s]/,"").split
|
30
|
+
end
|
31
|
+
|
32
|
+
def word_hash_for_words(words)
|
33
|
+
d = Hash.new(0)
|
34
|
+
words.each do |word|
|
35
|
+
word.downcase!
|
36
|
+
if ! CORPUS_SKIP_WORDS.include?(word) && word.length > 2
|
37
|
+
d[word.stem.intern] += 1
|
38
|
+
end
|
39
|
+
end
|
40
|
+
return d
|
41
|
+
end
|
42
|
+
|
43
|
+
def word_hash_for_symbols(words)
|
44
|
+
d = Hash.new(0)
|
45
|
+
words.each do |word|
|
46
|
+
d[word.intern] += 1
|
47
|
+
end
|
48
|
+
return d
|
49
|
+
end
|
50
|
+
|
51
|
+
CORPUS_SKIP_WORDS = Set.new(%w[
|
52
|
+
a
|
53
|
+
again
|
54
|
+
all
|
55
|
+
along
|
56
|
+
are
|
57
|
+
also
|
58
|
+
an
|
59
|
+
and
|
60
|
+
as
|
61
|
+
at
|
62
|
+
but
|
63
|
+
by
|
64
|
+
came
|
65
|
+
can
|
66
|
+
cant
|
67
|
+
couldnt
|
68
|
+
did
|
69
|
+
didn
|
70
|
+
didnt
|
71
|
+
do
|
72
|
+
doesnt
|
73
|
+
dont
|
74
|
+
ever
|
75
|
+
first
|
76
|
+
from
|
77
|
+
have
|
78
|
+
her
|
79
|
+
here
|
80
|
+
him
|
81
|
+
how
|
82
|
+
i
|
83
|
+
if
|
84
|
+
in
|
85
|
+
into
|
86
|
+
is
|
87
|
+
isnt
|
88
|
+
it
|
89
|
+
itll
|
90
|
+
just
|
91
|
+
last
|
92
|
+
least
|
93
|
+
like
|
94
|
+
most
|
95
|
+
my
|
96
|
+
new
|
97
|
+
no
|
98
|
+
not
|
99
|
+
now
|
100
|
+
of
|
101
|
+
on
|
102
|
+
or
|
103
|
+
should
|
104
|
+
sinc
|
105
|
+
so
|
106
|
+
some
|
107
|
+
th
|
108
|
+
than
|
109
|
+
this
|
110
|
+
that
|
111
|
+
the
|
112
|
+
their
|
113
|
+
then
|
114
|
+
those
|
115
|
+
to
|
116
|
+
told
|
117
|
+
too
|
118
|
+
true
|
119
|
+
try
|
120
|
+
until
|
121
|
+
url
|
122
|
+
us
|
123
|
+
were
|
124
|
+
when
|
125
|
+
whether
|
126
|
+
while
|
127
|
+
with
|
128
|
+
within
|
129
|
+
yes
|
130
|
+
you
|
131
|
+
youll
|
132
|
+
])
|
133
|
+
end
|
134
|
+
end
|
@@ -4,7 +4,6 @@
|
|
4
4
|
# These are extensions to the std-lib 'matrix' to allow an all ruby SVD
|
5
5
|
|
6
6
|
require 'matrix'
|
7
|
-
require 'mathn'
|
8
7
|
|
9
8
|
class Array
|
10
9
|
def sum(identity = 0, &block)
|
@@ -13,7 +12,7 @@ class Array
|
|
13
12
|
if block_given?
|
14
13
|
map(&block).sum
|
15
14
|
else
|
16
|
-
reduce(:+)
|
15
|
+
reduce(:+) || 0
|
17
16
|
end
|
18
17
|
end
|
19
18
|
end
|
@@ -6,16 +6,16 @@ begin
|
|
6
6
|
raise LoadError if ENV['NATIVE_VECTOR'] == "true" # to test the native vector class, try `rake test NATIVE_VECTOR=true`
|
7
7
|
|
8
8
|
require 'gsl' # requires http://rb-gsl.rubyforge.org/
|
9
|
-
|
9
|
+
require_relative 'extensions/vector_serialize'
|
10
10
|
$GSL = true
|
11
11
|
|
12
12
|
rescue LoadError
|
13
|
-
|
13
|
+
require_relative 'extensions/vector'
|
14
14
|
end
|
15
15
|
|
16
|
-
|
17
|
-
|
18
|
-
|
16
|
+
require_relative 'lsi/word_list'
|
17
|
+
require_relative 'lsi/content_node'
|
18
|
+
require_relative 'lsi/summarizer'
|
19
19
|
|
20
20
|
module ClassifierReborn
|
21
21
|
|
@@ -58,7 +58,7 @@ module ClassifierReborn
|
|
58
58
|
# lsi.add_item ar, *ar.categories { |x| ar.content }
|
59
59
|
#
|
60
60
|
def add_item( item, *categories, &block )
|
61
|
-
clean_word_hash = block ? block.call(item)
|
61
|
+
clean_word_hash = Hasher.clean_word_hash(block ? block.call(item) : item.to_s)
|
62
62
|
@items[item] = ContentNode.new(clean_word_hash, *categories)
|
63
63
|
@version += 1
|
64
64
|
build_index if @auto_rebuild
|
@@ -82,8 +82,8 @@ module ClassifierReborn
|
|
82
82
|
# Removes an item from the database, if it is indexed.
|
83
83
|
#
|
84
84
|
def remove_item( item )
|
85
|
-
if @items.
|
86
|
-
@items.
|
85
|
+
if @items.key? item
|
86
|
+
@items.delete item
|
87
87
|
@version += 1
|
88
88
|
end
|
89
89
|
end
|
@@ -293,7 +293,7 @@ module ClassifierReborn
|
|
293
293
|
if @items[item]
|
294
294
|
return @items[item]
|
295
295
|
else
|
296
|
-
clean_word_hash = block ? block.call(item)
|
296
|
+
clean_word_hash = Hasher.clean_word_hash(block ? block.call(item) : item.to_s)
|
297
297
|
|
298
298
|
cn = ContentNode.new(clean_word_hash, &block) # make the node and extract the data
|
299
299
|
|
@@ -43,12 +43,14 @@ module ClassifierReborn
|
|
43
43
|
vec[word_list[word]] = @word_hash[word] if word_list[word]
|
44
44
|
end
|
45
45
|
|
46
|
-
# Perform the scaling transform
|
47
|
-
total_words = vec.sum
|
46
|
+
# Perform the scaling transform and force floating point arithmetic
|
47
|
+
total_words = vec.sum.to_f
|
48
|
+
|
49
|
+
total_unique_words = vec.count{|word| word != 0}
|
48
50
|
|
49
51
|
# Perform first-order association transform if this vector has more
|
50
|
-
#
|
51
|
-
if total_words > 1.0
|
52
|
+
# then one word in it.
|
53
|
+
if total_words > 1.0 && total_unique_words > 1
|
52
54
|
weighted_total = 0.0
|
53
55
|
vec.each do |term|
|
54
56
|
if ( term > 0 )
|
@@ -0,0 +1,33 @@
|
|
1
|
+
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
|
2
|
+
# Copyright:: Copyright (c) 2005 Lucas Carlson
|
3
|
+
# License:: LGPL
|
4
|
+
|
5
|
+
module ClassifierReborn
|
6
|
+
module Summarizer
|
7
|
+
extend self
|
8
|
+
|
9
|
+
def summary( str, count=10, separator=" [...] " )
|
10
|
+
perform_lsi split_sentences(str), count, separator
|
11
|
+
end
|
12
|
+
|
13
|
+
def paragraph_summary( str, count=1, separator=" [...] " )
|
14
|
+
perform_lsi split_paragraphs(str), count, separator
|
15
|
+
end
|
16
|
+
|
17
|
+
def split_sentences(str)
|
18
|
+
str.split /(\.|\!|\?)/ # TODO: make this less primitive
|
19
|
+
end
|
20
|
+
|
21
|
+
def split_paragraphs(str)
|
22
|
+
str.split /(\n\n|\r\r|\r\n\r\n)/ # TODO: make this less primitive
|
23
|
+
end
|
24
|
+
|
25
|
+
def perform_lsi(chunks, count, separator)
|
26
|
+
lsi = ClassifierReborn::LSI.new :auto_rebuild => false
|
27
|
+
chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
|
28
|
+
lsi.build_index
|
29
|
+
summaries = lsi.highest_relative_content count
|
30
|
+
return summaries.reject { |chunk| !summaries.include? chunk }.map { |x| x.strip }.join(separator)
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: classifier-reborn
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 2.0.0
|
4
|
+
version: 2.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Lucas Carlson
|
@@ -9,22 +9,22 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2014-08-
|
12
|
+
date: 2014-08-13 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: fast-stemmer
|
16
16
|
requirement: !ruby/object:Gem::Requirement
|
17
17
|
requirements:
|
18
|
-
- - "
|
18
|
+
- - "~>"
|
19
19
|
- !ruby/object:Gem::Version
|
20
|
-
version: 1.0
|
20
|
+
version: '1.0'
|
21
21
|
type: :runtime
|
22
22
|
prerelease: false
|
23
23
|
version_requirements: !ruby/object:Gem::Requirement
|
24
24
|
requirements:
|
25
|
-
- - "
|
25
|
+
- - "~>"
|
26
26
|
- !ruby/object:Gem::Version
|
27
|
-
version: 1.0
|
27
|
+
version: '1.0'
|
28
28
|
- !ruby/object:Gem::Dependency
|
29
29
|
name: rake
|
30
30
|
requirement: !ruby/object:Gem::Requirement
|
@@ -71,13 +71,13 @@ files:
|
|
71
71
|
- bin/summarize.rb
|
72
72
|
- lib/classifier-reborn.rb
|
73
73
|
- lib/classifier-reborn/bayes.rb
|
74
|
+
- lib/classifier-reborn/extensions/hasher.rb
|
74
75
|
- lib/classifier-reborn/extensions/string.rb
|
75
76
|
- lib/classifier-reborn/extensions/vector.rb
|
76
77
|
- lib/classifier-reborn/extensions/vector_serialize.rb
|
77
|
-
- lib/classifier-reborn/extensions/word_hash.rb
|
78
78
|
- lib/classifier-reborn/lsi.rb
|
79
79
|
- lib/classifier-reborn/lsi/content_node.rb
|
80
|
-
- lib/classifier-reborn/lsi/
|
80
|
+
- lib/classifier-reborn/lsi/summarizer.rb
|
81
81
|
- lib/classifier-reborn/lsi/word_list.rb
|
82
82
|
- lib/classifier-reborn/version.rb
|
83
83
|
homepage: https://github.com/jekyll/classifier-reborn
|
@@ -96,9 +96,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
96
96
|
version: 1.9.3
|
97
97
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
98
98
|
requirements:
|
99
|
-
- - "
|
99
|
+
- - ">="
|
100
100
|
- !ruby/object:Gem::Version
|
101
|
-
version:
|
101
|
+
version: '0'
|
102
102
|
requirements: []
|
103
103
|
rubyforge_project:
|
104
104
|
rubygems_version: 2.2.2
|
@@ -1,136 +0,0 @@
|
|
1
|
-
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
|
2
|
-
# Copyright:: Copyright (c) 2005 Lucas Carlson
|
3
|
-
# License:: LGPL
|
4
|
-
|
5
|
-
require "set"
|
6
|
-
|
7
|
-
# These are extensions to the String class to provide convenience
|
8
|
-
# methods for the Classifier package.
|
9
|
-
class String
|
10
|
-
|
11
|
-
# Removes common punctuation symbols, returning a new string.
|
12
|
-
# E.g.,
|
13
|
-
# "Hello (greeting's), with {braces} < >...?".without_punctuation
|
14
|
-
# => "Hello greetings with braces "
|
15
|
-
def without_punctuation
|
16
|
-
tr( ',?.!;:"@#$%^&*()_=+[]{}\|<>/`~', " " ) .tr( "'\-", "")
|
17
|
-
end
|
18
|
-
|
19
|
-
# Return a Hash of strings => ints. Each word in the string is stemmed,
|
20
|
-
# interned, and indexes to its frequency in the document.
|
21
|
-
def word_hash
|
22
|
-
word_hash = clean_word_hash()
|
23
|
-
symbol_hash = word_hash_for_symbols(gsub(/[\w]/," ").split)
|
24
|
-
return word_hash.merge(symbol_hash)
|
25
|
-
end
|
26
|
-
|
27
|
-
# Return a word hash without extra punctuation or short symbols, just stemmed words
|
28
|
-
def clean_word_hash
|
29
|
-
word_hash_for_words gsub(/[^\w\s]/,"").split
|
30
|
-
end
|
31
|
-
|
32
|
-
private
|
33
|
-
|
34
|
-
def word_hash_for_words(words)
|
35
|
-
d = Hash.new(0)
|
36
|
-
words.each do |word|
|
37
|
-
word.downcase!
|
38
|
-
if ! CORPUS_SKIP_WORDS.include?(word) && word.length > 2
|
39
|
-
d[word.stem.intern] += 1
|
40
|
-
end
|
41
|
-
end
|
42
|
-
return d
|
43
|
-
end
|
44
|
-
|
45
|
-
|
46
|
-
def word_hash_for_symbols(words)
|
47
|
-
d = Hash.new(0)
|
48
|
-
words.each do |word|
|
49
|
-
d[word.intern] += 1
|
50
|
-
end
|
51
|
-
return d
|
52
|
-
end
|
53
|
-
|
54
|
-
CORPUS_SKIP_WORDS = Set.new([
|
55
|
-
"a",
|
56
|
-
"again",
|
57
|
-
"all",
|
58
|
-
"along",
|
59
|
-
"are",
|
60
|
-
"also",
|
61
|
-
"an",
|
62
|
-
"and",
|
63
|
-
"as",
|
64
|
-
"at",
|
65
|
-
"but",
|
66
|
-
"by",
|
67
|
-
"came",
|
68
|
-
"can",
|
69
|
-
"cant",
|
70
|
-
"couldnt",
|
71
|
-
"did",
|
72
|
-
"didn",
|
73
|
-
"didnt",
|
74
|
-
"do",
|
75
|
-
"doesnt",
|
76
|
-
"dont",
|
77
|
-
"ever",
|
78
|
-
"first",
|
79
|
-
"from",
|
80
|
-
"have",
|
81
|
-
"her",
|
82
|
-
"here",
|
83
|
-
"him",
|
84
|
-
"how",
|
85
|
-
"i",
|
86
|
-
"if",
|
87
|
-
"in",
|
88
|
-
"into",
|
89
|
-
"is",
|
90
|
-
"isnt",
|
91
|
-
"it",
|
92
|
-
"itll",
|
93
|
-
"just",
|
94
|
-
"last",
|
95
|
-
"least",
|
96
|
-
"like",
|
97
|
-
"most",
|
98
|
-
"my",
|
99
|
-
"new",
|
100
|
-
"no",
|
101
|
-
"not",
|
102
|
-
"now",
|
103
|
-
"of",
|
104
|
-
"on",
|
105
|
-
"or",
|
106
|
-
"should",
|
107
|
-
"sinc",
|
108
|
-
"so",
|
109
|
-
"some",
|
110
|
-
"th",
|
111
|
-
"than",
|
112
|
-
"this",
|
113
|
-
"that",
|
114
|
-
"the",
|
115
|
-
"their",
|
116
|
-
"then",
|
117
|
-
"those",
|
118
|
-
"to",
|
119
|
-
"told",
|
120
|
-
"too",
|
121
|
-
"true",
|
122
|
-
"try",
|
123
|
-
"until",
|
124
|
-
"url",
|
125
|
-
"us",
|
126
|
-
"were",
|
127
|
-
"when",
|
128
|
-
"whether",
|
129
|
-
"while",
|
130
|
-
"with",
|
131
|
-
"within",
|
132
|
-
"yes",
|
133
|
-
"you",
|
134
|
-
"youll",
|
135
|
-
])
|
136
|
-
end
|
@@ -1,31 +0,0 @@
|
|
1
|
-
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
|
2
|
-
# Copyright:: Copyright (c) 2005 Lucas Carlson
|
3
|
-
# License:: LGPL
|
4
|
-
|
5
|
-
class String
|
6
|
-
def summary( count=10, separator=" [...] " )
|
7
|
-
perform_lsi split_sentences, count, separator
|
8
|
-
end
|
9
|
-
|
10
|
-
def paragraph_summary( count=1, separator=" [...] " )
|
11
|
-
perform_lsi split_paragraphs, count, separator
|
12
|
-
end
|
13
|
-
|
14
|
-
def split_sentences
|
15
|
-
split /(\.|\!|\?)/ # TODO: make this less primitive
|
16
|
-
end
|
17
|
-
|
18
|
-
def split_paragraphs
|
19
|
-
split /(\n\n|\r\r|\r\n\r\n)/ # TODO: make this less primitive
|
20
|
-
end
|
21
|
-
|
22
|
-
private
|
23
|
-
|
24
|
-
def perform_lsi(chunks, count, separator)
|
25
|
-
lsi = ClassifierReborn::LSI.new :auto_rebuild => false
|
26
|
-
chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
|
27
|
-
lsi.build_index
|
28
|
-
summaries = lsi.highest_relative_content count
|
29
|
-
return summaries.reject { |chunk| !summaries.include? chunk }.map { |x| x.strip }.join(separator)
|
30
|
-
end
|
31
|
-
end
|