yanbi-ml 0.1.2 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # YANBI-ML
2
2
 
3
- Yet Another Naive Bayes Implementation
3
+ Yet Another Naive Bayes Implementation - Bayes and Fisher document classifiers
4
4
 
5
5
  ## Installation
6
6
 
@@ -34,9 +34,27 @@ classifier.train_raw(:odd, "one three five seven")
34
34
  classifier.classify_raw("one two three") => :odd
35
35
  ```
36
36
 
37
+ ## What is a Fisher Classifier?
38
+
39
+ An alternative to the standard Bayesian classifier that can also give very accurate results. A Bayesian classifier works by computing a single, document-wide probability for each class that a document might belong to. A Fisher classifer, by contrast, will compute a probability for each individual feature in a document. If the document does not belong to a given class, then you would expect to get a random distribution of probabilities for the features in the document. In fact, the eponymous Fisher showed that you would generally get a *chi squared distribution* of probabilities. If the document does belong to a given class, you would expect the probabilities to be skewed towards higher probabilities, instead of being randomly distributed. A Fisher classifier uses the Fisher statistical method (p-value) to determine the degree to which the features in the document diverge from the expected random probability.
40
+
41
+ ## I don't care, I just want to use it!
42
+
43
+ Fortunately the interface is pretty consistent:
44
+
45
+ ```ruby
46
+ classifier = Yanbi::Fisher.default(:even, :odd)
47
+ classifier.train_raw(:even, "two four six eight")
48
+ classifier.train_raw(:odd, "one three five seven")
49
+
50
+ classifier.classify_raw("one two three") => :odd
51
+ ```
52
+
53
+ See? Easy.
54
+
37
55
  ## Bags (of words)
38
56
 
39
- A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk). This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them.
57
+ A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk). This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them. Although a single bag can contain as many documents as you want, in practice it's a good idea to treat word bags as corresponding to a single document.
40
58
 
41
59
  A handful of classes are provided:
42
60
 
@@ -163,6 +181,41 @@ docs.each_doc do |d|
163
181
  end
164
182
  ```
165
183
 
184
+ ## Feature thresholds
185
+
186
+ A method on the classifier is provided to prune infrequently seen features. This is often one of the first things recommended for improving the accuracy of a classifier in real world applications. Note that when you prune features, there's no un-pruning afterwards - so be sure you actually want to do it!
187
+
188
+
189
+ ```ruby
190
+ classifier = Yanbi.default(:even, :odd)
191
+
192
+ #...tons of training happens here...
193
+
194
+ #we now have thousands of documents. Ignore any words we haven't
195
+ #seen at least a dozen times
196
+
197
+ classifier.set_significance(12)
198
+
199
+ #actually, the 'odd' category is especially noisy, so let's make
200
+ #that two dozen for odd items
201
+
202
+ classifier.set_significance(24, :odd)
203
+ ```
204
+
205
+ ## Persisting
206
+
207
+ After going to all of the trouble of training a classifier on a large corpus, it would be very useful to save it to disk for later use. You can do just that with the appropriately named save and load functions:
208
+
209
+ ```ruby
210
+ classifier.save('testclassifier')
211
+
212
+ #...some time later
213
+
214
+ newclassifier = Yanbi::Bayes.load('testclassifier')
215
+ ```
216
+
217
+ Note that an .obj extension is added to saved classifiers by default - no need to explicitly include it.
218
+
166
219
  ## Putting it all together
167
220
 
168
221
  ```ruby
@@ -176,11 +229,43 @@ other.add_file('biglistofotherstuff.txt', '@@@@')
176
229
 
177
230
  stuff.each_doc {|d| classifier.train(:stuff, d)}
178
231
  otherstuff.each_doc {|d| classifier.train(:otherstuff, d)}
232
+
233
+ #...classify all the things....
234
+ ```
235
+
236
+ A slightly fancier example:
237
+
238
+ ```ruby
239
+
240
+ STOP_WORDS = %w(in the a and at of)
241
+
242
+ #classify using stemmed words
243
+ classifier = Yanbi::Bayes.new(Yanbi::StemmedWordBag, :stuff, :otherstuff)
244
+
245
+ #create our corpora
246
+ stuff = Yanbi::Corpus.new(Yanbi::StemmedWordBag)
247
+ stuff.add_file('biglistofstuff.txt', '****')
248
+
249
+ other = Yanbi::Corpus.new(Yanbi::StemmedWordBag)
250
+ other.add_file('biglistofotherstuff.txt', '@@@@')
251
+
252
+ #get rid of those nasty stop words
253
+ stuff.each_doc {|d| d.remove(STOP_WORDS}
254
+ otherstuff.each_doc {|d| d.remove(STOP_WORDS}
255
+
256
+ #train away!
257
+ stuff.each_doc {|d| classifier.train(:stuff, d)}
258
+ otherstuff.each_doc {|d| classifier.train(:otherstuff, d)}
259
+
260
+ #get rid of the long tail
261
+ classifier.set_significance(50)
262
+
263
+ #...classify all the things....
179
264
  ```
180
265
 
181
266
  ## Contributing
182
267
 
183
- Bug reports and pull requests are welcome on GitHub at https://github.com/rdormer/yanbi-ml.
268
+ Bug reports, corrections of any tragic mathematical misunderstandings, and pull requests are welcome on GitHub at https://github.com/rdormer/yanbi-ml.
184
269
 
185
270
 
186
271
  ## License
@@ -34,6 +34,18 @@ module Yanbi
34
34
  self.new(WordBag, *categories)
35
35
  end
36
36
 
37
+ def self.load(fname)
38
+ c = YAML::load(File.read(fname + ".obj"))
39
+ raise LoadError unless c.is_a? self
40
+ c
41
+ end
42
+
43
+ def save(name)
44
+ File.open(name + ".obj", 'w') do |out|
45
+ YAML.dump(self, out)
46
+ end
47
+ end
48
+
37
49
  def train(category, document)
38
50
  cat = category.to_sym
39
51
  @document_counts[cat] += 1
@@ -69,13 +81,7 @@ module Yanbi
69
81
  def newdoc(doc)
70
82
  Yanbi.const_get(@bag_class).new(doc)
71
83
  end
72
-
73
- def save(name)
74
- File.open(name + ".obj", 'w') do |out|
75
- YAML.dump(self, out)
76
- end
77
- end
78
-
84
+
79
85
  private
80
86
 
81
87
  def cond_prob(cat, document)
@@ -102,11 +108,6 @@ module Yanbi
102
108
  @categories[i]
103
109
  end
104
110
 
105
- # def weighted_prob(word, category, basicprob, weight=1.0, ap=0.5)
106
- # #basicprob = word_prob(category, word) if basicprob.nil?
107
- # totals = @category_counts.inject(0) {|sum, cat| sum += cat.last[word].to_i}
108
- # ((weight * ap) + (totals*basicprob)) / (weight + totals)
109
- # end
110
111
  end
111
112
 
112
113
  end
@@ -1,6 +1,10 @@
1
+ # Author:: Robert Dormer (mailto:rdormer@gmail.com)
2
+ # Copyright:: Copyright (c) 2016 Robert Dormer
3
+ # License:: MIT
4
+
1
5
  module Yanbi
2
6
 
3
- class Fisher < Yanbi::Bayes
7
+ class Fisher < Bayes
4
8
 
5
9
  def classify(text)
6
10
  max_score(text) do |cat, doc|
@@ -12,36 +16,31 @@ module Yanbi
12
16
 
13
17
  def fisher_score(category, document)
14
18
  features = document.words.uniq
15
- pscores = 1
16
-
17
-
18
- ###
19
- #compute weighted probabilities for each word/cat tuple
20
- #and then multiply them all together...
21
- ##
22
-
23
-
24
-
25
- features.each do |word|
26
- clf = word_prob(category, word)
27
- freqsum = @categories.reduce(0) {|sum, x| sum + word_prob(x, word)}
28
- pscores *= (clf / freqsum) if clf > 0
29
- end
30
-
31
- #####
32
-
33
-
34
- #compute fisher factor of pscores
19
+ probs = features.map {|x| weighted_prob(x, category)}
20
+ pscores = probs.reduce(&:*)
35
21
  score = -2 * Math.log(pscores)
36
-
37
- #this is okay
38
22
  invchi2(score, features.count * 2)
39
23
  end
40
-
24
+
25
+ def category_prob(cat, word)
26
+ wp = word_prob(cat, word)
27
+ sum = @categories.inject(0) {|s,c| s + word_prob(c, word)}
28
+ return 0 if sum.zero?
29
+ wp / sum
30
+ end
31
+
41
32
  def word_prob(cat, word)
42
- @category_counts[cat][word].to_f / @document_counts[cat]
33
+ all_word_count = @category_counts[cat].values.reduce(&:+)
34
+ count = @category_counts[cat].has_key?(word) ? @category_counts[cat][word].to_f : 0
35
+ count / all_word_count
43
36
  end
44
-
37
+
38
+ def weighted_prob(word, category, basicprob=nil, weight=1.0, ap=0.5)
39
+ basicprob = category_prob(category, word)
40
+ totals = @category_counts.inject(0) {|sum, cat| sum += cat.last[word].to_i}
41
+ ((weight * ap) + (totals*basicprob)) / (weight + totals)
42
+ end
43
+
45
44
  def invchi2(chi, df)
46
45
  m = chi / 2.0
47
46
  sum = Math.exp(-m)
@@ -3,5 +3,5 @@
3
3
  # License:: MIT
4
4
 
5
5
  module Yanbi
6
- VERSION = "0.1.2"
6
+ VERSION = "0.2.0"
7
7
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: yanbi-ml
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.2
4
+ version: 0.2.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: exe
11
11
  cert_chain: []
12
- date: 2016-07-05 00:00:00.000000000 Z
12
+ date: 2016-07-08 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bundler