yanbi-ml 0.1.2 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +88 -3
- data/lib/bayes/bayes.rb +13 -12
- data/lib/bayes/fisher.rb +25 -26
- data/lib/version.rb +1 -1
- metadata +2 -2
data/README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# YANBI-ML
|
2
2
|
|
3
|
-
Yet Another Naive Bayes Implementation
|
3
|
+
Yet Another Naive Bayes Implementation - Bayes and Fisher document classifiers
|
4
4
|
|
5
5
|
## Installation
|
6
6
|
|
@@ -34,9 +34,27 @@ classifier.train_raw(:odd, "one three five seven")
|
|
34
34
|
classifier.classify_raw("one two three") => :odd
|
35
35
|
```
|
36
36
|
|
37
|
+
## What is a Fisher Classifier?
|
38
|
+
|
39
|
+
An alternative to the standard Bayesian classifier that can also give very accurate results. A Bayesian classifier works by computing a single, document-wide probability for each class that a document might belong to. A Fisher classifer, by contrast, will compute a probability for each individual feature in a document. If the document does not belong to a given class, then you would expect to get a random distribution of probabilities for the features in the document. In fact, the eponymous Fisher showed that you would generally get a *chi squared distribution* of probabilities. If the document does belong to a given class, you would expect the probabilities to be skewed towards higher probabilities, instead of being randomly distributed. A Fisher classifier uses the Fisher statistical method (p-value) to determine the degree to which the features in the document diverge from the expected random probability.
|
40
|
+
|
41
|
+
## I don't care, I just want to use it!
|
42
|
+
|
43
|
+
Fortunately the interface is pretty consistent:
|
44
|
+
|
45
|
+
```ruby
|
46
|
+
classifier = Yanbi::Fisher.default(:even, :odd)
|
47
|
+
classifier.train_raw(:even, "two four six eight")
|
48
|
+
classifier.train_raw(:odd, "one three five seven")
|
49
|
+
|
50
|
+
classifier.classify_raw("one two three") => :odd
|
51
|
+
```
|
52
|
+
|
53
|
+
See? Easy.
|
54
|
+
|
37
55
|
## Bags (of words)
|
38
56
|
|
39
|
-
A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk). This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them.
|
57
|
+
A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk). This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them. Although a single bag can contain as many documents as you want, in practice it's a good idea to treat word bags as corresponding to a single document.
|
40
58
|
|
41
59
|
A handful of classes are provided:
|
42
60
|
|
@@ -163,6 +181,41 @@ docs.each_doc do |d|
|
|
163
181
|
end
|
164
182
|
```
|
165
183
|
|
184
|
+
## Feature thresholds
|
185
|
+
|
186
|
+
A method on the classifier is provided to prune infrequently seen features. This is often one of the first things recommended for improving the accuracy of a classifier in real world applications. Note that when you prune features, there's no un-pruning afterwards - so be sure you actually want to do it!
|
187
|
+
|
188
|
+
|
189
|
+
```ruby
|
190
|
+
classifier = Yanbi.default(:even, :odd)
|
191
|
+
|
192
|
+
#...tons of training happens here...
|
193
|
+
|
194
|
+
#we now have thousands of documents. Ignore any words we haven't
|
195
|
+
#seen at least a dozen times
|
196
|
+
|
197
|
+
classifier.set_significance(12)
|
198
|
+
|
199
|
+
#actually, the 'odd' category is especially noisy, so let's make
|
200
|
+
#that two dozen for odd items
|
201
|
+
|
202
|
+
classifier.set_significance(24, :odd)
|
203
|
+
```
|
204
|
+
|
205
|
+
## Persisting
|
206
|
+
|
207
|
+
After going to all of the trouble of training a classifier on a large corpus, it would be very useful to save it to disk for later use. You can do just that with the appropriately named save and load functions:
|
208
|
+
|
209
|
+
```ruby
|
210
|
+
classifier.save('testclassifier')
|
211
|
+
|
212
|
+
#...some time later
|
213
|
+
|
214
|
+
newclassifier = Yanbi::Bayes.load('testclassifier')
|
215
|
+
```
|
216
|
+
|
217
|
+
Note that an .obj extension is added to saved classifiers by default - no need to explicitly include it.
|
218
|
+
|
166
219
|
## Putting it all together
|
167
220
|
|
168
221
|
```ruby
|
@@ -176,11 +229,43 @@ other.add_file('biglistofotherstuff.txt', '@@@@')
|
|
176
229
|
|
177
230
|
stuff.each_doc {|d| classifier.train(:stuff, d)}
|
178
231
|
otherstuff.each_doc {|d| classifier.train(:otherstuff, d)}
|
232
|
+
|
233
|
+
#...classify all the things....
|
234
|
+
```
|
235
|
+
|
236
|
+
A slightly fancier example:
|
237
|
+
|
238
|
+
```ruby
|
239
|
+
|
240
|
+
STOP_WORDS = %w(in the a and at of)
|
241
|
+
|
242
|
+
#classify using stemmed words
|
243
|
+
classifier = Yanbi::Bayes.new(Yanbi::StemmedWordBag, :stuff, :otherstuff)
|
244
|
+
|
245
|
+
#create our corpora
|
246
|
+
stuff = Yanbi::Corpus.new(Yanbi::StemmedWordBag)
|
247
|
+
stuff.add_file('biglistofstuff.txt', '****')
|
248
|
+
|
249
|
+
other = Yanbi::Corpus.new(Yanbi::StemmedWordBag)
|
250
|
+
other.add_file('biglistofotherstuff.txt', '@@@@')
|
251
|
+
|
252
|
+
#get rid of those nasty stop words
|
253
|
+
stuff.each_doc {|d| d.remove(STOP_WORDS}
|
254
|
+
otherstuff.each_doc {|d| d.remove(STOP_WORDS}
|
255
|
+
|
256
|
+
#train away!
|
257
|
+
stuff.each_doc {|d| classifier.train(:stuff, d)}
|
258
|
+
otherstuff.each_doc {|d| classifier.train(:otherstuff, d)}
|
259
|
+
|
260
|
+
#get rid of the long tail
|
261
|
+
classifier.set_significance(50)
|
262
|
+
|
263
|
+
#...classify all the things....
|
179
264
|
```
|
180
265
|
|
181
266
|
## Contributing
|
182
267
|
|
183
|
-
Bug reports and pull requests are welcome on GitHub at https://github.com/rdormer/yanbi-ml.
|
268
|
+
Bug reports, corrections of any tragic mathematical misunderstandings, and pull requests are welcome on GitHub at https://github.com/rdormer/yanbi-ml.
|
184
269
|
|
185
270
|
|
186
271
|
## License
|
data/lib/bayes/bayes.rb
CHANGED
@@ -34,6 +34,18 @@ module Yanbi
|
|
34
34
|
self.new(WordBag, *categories)
|
35
35
|
end
|
36
36
|
|
37
|
+
def self.load(fname)
|
38
|
+
c = YAML::load(File.read(fname + ".obj"))
|
39
|
+
raise LoadError unless c.is_a? self
|
40
|
+
c
|
41
|
+
end
|
42
|
+
|
43
|
+
def save(name)
|
44
|
+
File.open(name + ".obj", 'w') do |out|
|
45
|
+
YAML.dump(self, out)
|
46
|
+
end
|
47
|
+
end
|
48
|
+
|
37
49
|
def train(category, document)
|
38
50
|
cat = category.to_sym
|
39
51
|
@document_counts[cat] += 1
|
@@ -69,13 +81,7 @@ module Yanbi
|
|
69
81
|
def newdoc(doc)
|
70
82
|
Yanbi.const_get(@bag_class).new(doc)
|
71
83
|
end
|
72
|
-
|
73
|
-
def save(name)
|
74
|
-
File.open(name + ".obj", 'w') do |out|
|
75
|
-
YAML.dump(self, out)
|
76
|
-
end
|
77
|
-
end
|
78
|
-
|
84
|
+
|
79
85
|
private
|
80
86
|
|
81
87
|
def cond_prob(cat, document)
|
@@ -102,11 +108,6 @@ module Yanbi
|
|
102
108
|
@categories[i]
|
103
109
|
end
|
104
110
|
|
105
|
-
# def weighted_prob(word, category, basicprob, weight=1.0, ap=0.5)
|
106
|
-
# #basicprob = word_prob(category, word) if basicprob.nil?
|
107
|
-
# totals = @category_counts.inject(0) {|sum, cat| sum += cat.last[word].to_i}
|
108
|
-
# ((weight * ap) + (totals*basicprob)) / (weight + totals)
|
109
|
-
# end
|
110
111
|
end
|
111
112
|
|
112
113
|
end
|
data/lib/bayes/fisher.rb
CHANGED
@@ -1,6 +1,10 @@
|
|
1
|
+
# Author:: Robert Dormer (mailto:rdormer@gmail.com)
|
2
|
+
# Copyright:: Copyright (c) 2016 Robert Dormer
|
3
|
+
# License:: MIT
|
4
|
+
|
1
5
|
module Yanbi
|
2
6
|
|
3
|
-
class Fisher <
|
7
|
+
class Fisher < Bayes
|
4
8
|
|
5
9
|
def classify(text)
|
6
10
|
max_score(text) do |cat, doc|
|
@@ -12,36 +16,31 @@ module Yanbi
|
|
12
16
|
|
13
17
|
def fisher_score(category, document)
|
14
18
|
features = document.words.uniq
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
###
|
19
|
-
#compute weighted probabilities for each word/cat tuple
|
20
|
-
#and then multiply them all together...
|
21
|
-
##
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
features.each do |word|
|
26
|
-
clf = word_prob(category, word)
|
27
|
-
freqsum = @categories.reduce(0) {|sum, x| sum + word_prob(x, word)}
|
28
|
-
pscores *= (clf / freqsum) if clf > 0
|
29
|
-
end
|
30
|
-
|
31
|
-
#####
|
32
|
-
|
33
|
-
|
34
|
-
#compute fisher factor of pscores
|
19
|
+
probs = features.map {|x| weighted_prob(x, category)}
|
20
|
+
pscores = probs.reduce(&:*)
|
35
21
|
score = -2 * Math.log(pscores)
|
36
|
-
|
37
|
-
#this is okay
|
38
22
|
invchi2(score, features.count * 2)
|
39
23
|
end
|
40
|
-
|
24
|
+
|
25
|
+
def category_prob(cat, word)
|
26
|
+
wp = word_prob(cat, word)
|
27
|
+
sum = @categories.inject(0) {|s,c| s + word_prob(c, word)}
|
28
|
+
return 0 if sum.zero?
|
29
|
+
wp / sum
|
30
|
+
end
|
31
|
+
|
41
32
|
def word_prob(cat, word)
|
42
|
-
@category_counts[cat]
|
33
|
+
all_word_count = @category_counts[cat].values.reduce(&:+)
|
34
|
+
count = @category_counts[cat].has_key?(word) ? @category_counts[cat][word].to_f : 0
|
35
|
+
count / all_word_count
|
43
36
|
end
|
44
|
-
|
37
|
+
|
38
|
+
def weighted_prob(word, category, basicprob=nil, weight=1.0, ap=0.5)
|
39
|
+
basicprob = category_prob(category, word)
|
40
|
+
totals = @category_counts.inject(0) {|sum, cat| sum += cat.last[word].to_i}
|
41
|
+
((weight * ap) + (totals*basicprob)) / (weight + totals)
|
42
|
+
end
|
43
|
+
|
45
44
|
def invchi2(chi, df)
|
46
45
|
m = chi / 2.0
|
47
46
|
sum = Math.exp(-m)
|
data/lib/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: yanbi-ml
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: exe
|
11
11
|
cert_chain: []
|
12
|
-
date: 2016-07-
|
12
|
+
date: 2016-07-08 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: bundler
|