yanbi-ml 0.1.2 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # YANBI-ML
2
2
 
3
- Yet Another Naive Bayes Implementation
3
+ Yet Another Naive Bayes Implementation - Bayes and Fisher document classifiers
4
4
 
5
5
  ## Installation
6
6
 
@@ -34,9 +34,27 @@ classifier.train_raw(:odd, "one three five seven")
34
34
  classifier.classify_raw("one two three") => :odd
35
35
  ```
36
36
 
37
+ ## What is a Fisher Classifier?
38
+
39
+ An alternative to the standard Bayesian classifier that can also give very accurate results. A Bayesian classifier works by computing a single, document-wide probability for each class that a document might belong to. A Fisher classifer, by contrast, will compute a probability for each individual feature in a document. If the document does not belong to a given class, then you would expect to get a random distribution of probabilities for the features in the document. In fact, the eponymous Fisher showed that you would generally get a *chi squared distribution* of probabilities. If the document does belong to a given class, you would expect the probabilities to be skewed towards higher probabilities, instead of being randomly distributed. A Fisher classifier uses the Fisher statistical method (p-value) to determine the degree to which the features in the document diverge from the expected random probability.
40
+
41
+ ## I don't care, I just want to use it!
42
+
43
+ Fortunately the interface is pretty consistent:
44
+
45
+ ```ruby
46
+ classifier = Yanbi::Fisher.default(:even, :odd)
47
+ classifier.train_raw(:even, "two four six eight")
48
+ classifier.train_raw(:odd, "one three five seven")
49
+
50
+ classifier.classify_raw("one two three") => :odd
51
+ ```
52
+
53
+ See? Easy.
54
+
37
55
  ## Bags (of words)
38
56
 
39
- A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk). This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them.
57
+ A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk). This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them. Although a single bag can contain as many documents as you want, in practice it's a good idea to treat word bags as corresponding to a single document.
40
58
 
41
59
  A handful of classes are provided:
42
60
 
@@ -163,6 +181,41 @@ docs.each_doc do |d|
163
181
  end
164
182
  ```
165
183
 
184
+ ## Feature thresholds
185
+
186
+ A method on the classifier is provided to prune infrequently seen features. This is often one of the first things recommended for improving the accuracy of a classifier in real world applications. Note that when you prune features, there's no un-pruning afterwards - so be sure you actually want to do it!
187
+
188
+
189
+ ```ruby
190
+ classifier = Yanbi.default(:even, :odd)
191
+
192
+ #...tons of training happens here...
193
+
194
+ #we now have thousands of documents. Ignore any words we haven't
195
+ #seen at least a dozen times
196
+
197
+ classifier.set_significance(12)
198
+
199
+ #actually, the 'odd' category is especially noisy, so let's make
200
+ #that two dozen for odd items
201
+
202
+ classifier.set_significance(24, :odd)
203
+ ```
204
+
205
+ ## Persisting
206
+
207
+ After going to all of the trouble of training a classifier on a large corpus, it would be very useful to save it to disk for later use. You can do just that with the appropriately named save and load functions:
208
+
209
+ ```ruby
210
+ classifier.save('testclassifier')
211
+
212
+ #...some time later
213
+
214
+ newclassifier = Yanbi::Bayes.load('testclassifier')
215
+ ```
216
+
217
+ Note that an .obj extension is added to saved classifiers by default - no need to explicitly include it.
218
+
166
219
  ## Putting it all together
167
220
 
168
221
  ```ruby
@@ -176,11 +229,43 @@ other.add_file('biglistofotherstuff.txt', '@@@@')
176
229
 
177
230
  stuff.each_doc {|d| classifier.train(:stuff, d)}
178
231
  otherstuff.each_doc {|d| classifier.train(:otherstuff, d)}
232
+
233
+ #...classify all the things....
234
+ ```
235
+
236
+ A slightly fancier example:
237
+
238
+ ```ruby
239
+
240
+ STOP_WORDS = %w(in the a and at of)
241
+
242
+ #classify using stemmed words
243
+ classifier = Yanbi::Bayes.new(Yanbi::StemmedWordBag, :stuff, :otherstuff)
244
+
245
+ #create our corpora
246
+ stuff = Yanbi::Corpus.new(Yanbi::StemmedWordBag)
247
+ stuff.add_file('biglistofstuff.txt', '****')
248
+
249
+ other = Yanbi::Corpus.new(Yanbi::StemmedWordBag)
250
+ other.add_file('biglistofotherstuff.txt', '@@@@')
251
+
252
+ #get rid of those nasty stop words
253
+ stuff.each_doc {|d| d.remove(STOP_WORDS}
254
+ otherstuff.each_doc {|d| d.remove(STOP_WORDS}
255
+
256
+ #train away!
257
+ stuff.each_doc {|d| classifier.train(:stuff, d)}
258
+ otherstuff.each_doc {|d| classifier.train(:otherstuff, d)}
259
+
260
+ #get rid of the long tail
261
+ classifier.set_significance(50)
262
+
263
+ #...classify all the things....
179
264
  ```
180
265
 
181
266
  ## Contributing
182
267
 
183
- Bug reports and pull requests are welcome on GitHub at https://github.com/rdormer/yanbi-ml.
268
+ Bug reports, corrections of any tragic mathematical misunderstandings, and pull requests are welcome on GitHub at https://github.com/rdormer/yanbi-ml.
184
269
 
185
270
 
186
271
  ## License
@@ -34,6 +34,18 @@ module Yanbi
34
34
  self.new(WordBag, *categories)
35
35
  end
36
36
 
37
+ def self.load(fname)
38
+ c = YAML::load(File.read(fname + ".obj"))
39
+ raise LoadError unless c.is_a? self
40
+ c
41
+ end
42
+
43
+ def save(name)
44
+ File.open(name + ".obj", 'w') do |out|
45
+ YAML.dump(self, out)
46
+ end
47
+ end
48
+
37
49
  def train(category, document)
38
50
  cat = category.to_sym
39
51
  @document_counts[cat] += 1
@@ -69,13 +81,7 @@ module Yanbi
69
81
  def newdoc(doc)
70
82
  Yanbi.const_get(@bag_class).new(doc)
71
83
  end
72
-
73
- def save(name)
74
- File.open(name + ".obj", 'w') do |out|
75
- YAML.dump(self, out)
76
- end
77
- end
78
-
84
+
79
85
  private
80
86
 
81
87
  def cond_prob(cat, document)
@@ -102,11 +108,6 @@ module Yanbi
102
108
  @categories[i]
103
109
  end
104
110
 
105
- # def weighted_prob(word, category, basicprob, weight=1.0, ap=0.5)
106
- # #basicprob = word_prob(category, word) if basicprob.nil?
107
- # totals = @category_counts.inject(0) {|sum, cat| sum += cat.last[word].to_i}
108
- # ((weight * ap) + (totals*basicprob)) / (weight + totals)
109
- # end
110
111
  end
111
112
 
112
113
  end
@@ -1,6 +1,10 @@
1
+ # Author:: Robert Dormer (mailto:rdormer@gmail.com)
2
+ # Copyright:: Copyright (c) 2016 Robert Dormer
3
+ # License:: MIT
4
+
1
5
  module Yanbi
2
6
 
3
- class Fisher < Yanbi::Bayes
7
+ class Fisher < Bayes
4
8
 
5
9
  def classify(text)
6
10
  max_score(text) do |cat, doc|
@@ -12,36 +16,31 @@ module Yanbi
12
16
 
13
17
  def fisher_score(category, document)
14
18
  features = document.words.uniq
15
- pscores = 1
16
-
17
-
18
- ###
19
- #compute weighted probabilities for each word/cat tuple
20
- #and then multiply them all together...
21
- ##
22
-
23
-
24
-
25
- features.each do |word|
26
- clf = word_prob(category, word)
27
- freqsum = @categories.reduce(0) {|sum, x| sum + word_prob(x, word)}
28
- pscores *= (clf / freqsum) if clf > 0
29
- end
30
-
31
- #####
32
-
33
-
34
- #compute fisher factor of pscores
19
+ probs = features.map {|x| weighted_prob(x, category)}
20
+ pscores = probs.reduce(&:*)
35
21
  score = -2 * Math.log(pscores)
36
-
37
- #this is okay
38
22
  invchi2(score, features.count * 2)
39
23
  end
40
-
24
+
25
+ def category_prob(cat, word)
26
+ wp = word_prob(cat, word)
27
+ sum = @categories.inject(0) {|s,c| s + word_prob(c, word)}
28
+ return 0 if sum.zero?
29
+ wp / sum
30
+ end
31
+
41
32
  def word_prob(cat, word)
42
- @category_counts[cat][word].to_f / @document_counts[cat]
33
+ all_word_count = @category_counts[cat].values.reduce(&:+)
34
+ count = @category_counts[cat].has_key?(word) ? @category_counts[cat][word].to_f : 0
35
+ count / all_word_count
43
36
  end
44
-
37
+
38
+ def weighted_prob(word, category, basicprob=nil, weight=1.0, ap=0.5)
39
+ basicprob = category_prob(category, word)
40
+ totals = @category_counts.inject(0) {|sum, cat| sum += cat.last[word].to_i}
41
+ ((weight * ap) + (totals*basicprob)) / (weight + totals)
42
+ end
43
+
45
44
  def invchi2(chi, df)
46
45
  m = chi / 2.0
47
46
  sum = Math.exp(-m)
@@ -3,5 +3,5 @@
3
3
  # License:: MIT
4
4
 
5
5
  module Yanbi
6
- VERSION = "0.1.2"
6
+ VERSION = "0.2.0"
7
7
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: yanbi-ml
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.2
4
+ version: 0.2.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: exe
11
11
  cert_chain: []
12
- date: 2016-07-05 00:00:00.000000000 Z
12
+ date: 2016-07-08 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bundler