yanbi-ml 0.1.0 → 0.1.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (3) hide show
  1. data/README.md +156 -2
  2. data/lib/version.rb +1 -1
  3. metadata +1 -1
data/README.md CHANGED
@@ -20,8 +20,163 @@ Or install it yourself as:
20
20
 
21
21
  ## Usage
22
22
 
23
- TODO: Write usage instructions here
23
+ A Naive Bayesian classifier based on the bag-of-words model so very popular in text classification literature. Although primarily built around these bags, an interface to just train and classify raw text is also included. This gem is written with an eye towards training and classifying large sets of documents as painlessly as possible. I originally wrote this for an unpublished project of mine, and decided the interface might be useful for other people :)
24
24
 
25
+ ## I want to keep it simple!
26
+
27
+ Okay, do this:
28
+
29
+ ```ruby
30
+ classifier = Yanbi::Bayes.default(:even, :odd)
31
+ classifier.train_raw(:even, "two four six eight")
32
+ classifier.train_raw(:odd, "one three five seven")
33
+
34
+ classifier.classify_raw("one two three") => :odd
35
+ ```
36
+
37
+ ## Bags (of words)
38
+
39
+ A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk). This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them.
40
+
41
+ A handful of classes are provided:
42
+
43
+ <ul>
44
+ <li>WordBag - basic, default bag of words</li>
45
+ <li>StemmedWordBag - bag of words with lemmatization (stemming)</li>
46
+ <li>DiadBag - use with overlapping pairs of words</li>
47
+ <li>StemmedDiadBag - overlapping pairs of stemmed words</li>
48
+ </ul>
49
+
50
+ All of these classes will do the same basic standardization of text - lowercasing, punctuation and whitespace stripping, and so on. Using one or the other of these will give you some flexibility with how you process and classify text:
51
+
52
+ ```ruby
53
+ #I want to use stemmed words!
54
+ classifier = Yanbi::Bayes.new(Yanbi::StemmedWordBag, :even, :odd)
55
+ classifier.train_raw(:even, "two four six eight")
56
+ classifier.train_raw(:odd, "one three five seven")
57
+ classifier.classify_raw("one two three") => :odd
58
+ ```
59
+
60
+ Or, if you want to deal with bags directly:
61
+
62
+ ```ruby
63
+ classifier = Yanbi::Bayes.new(Yanbi::StemmedWordBag, :even, :odd)
64
+ classifier.train(:even, classifier.newdoc('two four six eight'))
65
+ classifier.train(:odd, classifier.newdoc('one three five seven'))
66
+ classifier.classify(classifier.newdoc('one two three')) => :odd
67
+ ```
68
+
69
+ The newdoc method will create the type of bag associated with that classifier. Although it's not strictly necessary to keep the type of word bag you use with a classifier consistent, it's recommended unless you have a good reason not to. Using the newdoc method will help a great deal with that.
70
+
71
+ Of course, you can also create word bags directly:
72
+
73
+ ```ruby
74
+ bag = Yanbi::WordBag.new('this is a test, of the emergency broadcast system')
75
+ ```
76
+
77
+ and query them:
78
+ ```ruby
79
+ bag = Yanbi::WordBag.new('one two three')
80
+ bag.words => ["one", "two", "three"]
81
+ bag.word_counts => {"one"=>1, "two"=>1, "three"=>1}
82
+
83
+ bag = Yanbi::DiadBag.new('one two three four')
84
+ ["one two", "two three", "three four"]
85
+ bag.word_counts => {"one two"=>1, "two three"=>1, "three four"=>1}
86
+
87
+ bag = Yanbi::StemmedWordBag.new
88
+ bag.empty? => true
89
+ ```
90
+
91
+ You can also add text after the fact:
92
+ ```ruby
93
+ bag = Yanbi::WordBag.new('one two three')
94
+ bag.add_text('four five six seven')
95
+ bag.words => ["one", "two", "three", "four", "five", "six", "seven"]
96
+ ```
97
+
98
+ And remove words:
99
+ ```ruby
100
+ bag = Yanbi::WordBag.new('one two three four five six seven')
101
+ bag.remove(%w(one three five))
102
+ bag.words => ["two", "four", "six", "seven"]
103
+ ```
104
+
105
+ And see where bags of words overlap:
106
+ ```ruby
107
+ first = Yanbi::WordBag.new('one two three four')
108
+ second = Yanbi::WordBag.new('three four five six')
109
+ first.intersection(second) => ["three", "four"]
110
+ ```
111
+
112
+ ## Corpora
113
+
114
+ A Corpus is a set of related documents, and naturally, a Corpus class is provided to process text and documents into a collection of word bags. It can accept text directly, or from a file, and can optionally accept multiple documents concatenated together (this makes dealing with large numbers of documents a *lot* easier) and a RegEx specifying a comment pattern (for metadata or feature shaping). The comment can either enclose (/*like this*/) or be a line comment (//like these), depending on which regex you choose.
115
+
116
+ A corpus is created with an associated word bag type. By default, this is the basic WordBag.
117
+
118
+ ```ruby
119
+ #Just make a basic corpus, no muss, no fuss
120
+ docs = Yanbi::Corpus.new
121
+
122
+ #I want to stem!
123
+ docs = Yanbi::Corpus.new(Yanbi::StemmedWordBag)
124
+ ```
125
+
126
+ Once that's done, it's on to creating the actual corpus:
127
+ ```ruby
128
+
129
+ #just load a file as a single document
130
+ docs.add_file('biglistofstuff.txt')
131
+
132
+ #to make things easier, I pasted a ton of documents into a
133
+ #text file and separated them with a **** delimiter
134
+ docs.add_file('biglistofstuff.txt', '****')
135
+
136
+ #to make things easier, I pasted a ton of documents into a
137
+ #text file and separated them with a **** delimiter, and
138
+ #commented out noise like so: %%noise noise noise%%
139
+ docs.add_file('biglistofstuff.txt', '****', /\%\%.+\%\%/)
140
+ ```
141
+
142
+ Of course you're not limited to files:
143
+
144
+ ```ruby
145
+ array_of_strings.each do |current|
146
+ docs.add_doc(current)
147
+ end
148
+
149
+ #wait, these have comments!
150
+ array_of_commented_strings.each do |current|
151
+ docs.add_doc(current, /\%\%.+\%\%/)
152
+ end
153
+
154
+ ```
155
+
156
+ Once you've started adding documents, they're available for iteration as word bags of the type you specified when you created the corpus:
157
+
158
+ ```ruby
159
+ STOP_WORDS = %w(the a at in and of)
160
+
161
+ docs.each_doc do |d|
162
+ d.remove(STOP_WORDS)
163
+ end
164
+ ```
165
+
166
+ ## Putting it all together
167
+
168
+ ```ruby
169
+ classifier = Yanbi.default(:stuff, :otherstuff)
170
+
171
+ stuff = Yanbi::Corpus.new
172
+ stuff.add_file('biglistofstuff.txt', '****')
173
+
174
+ other = Yanbi::Corpus.new
175
+ other.add_file('biglistofotherstuff.txt', '@@@@')
176
+
177
+ stuff.each_doc {|d| classifier.train(:stuff, d)}
178
+ otherstuff.each_doc {|d| classifier.train(:otherstuff, d)}
179
+ ```
25
180
 
26
181
  ## Contributing
27
182
 
@@ -31,4 +186,3 @@ Bug reports and pull requests are welcome on GitHub at https://github.com/rdorme
31
186
  ## License
32
187
 
33
188
  The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
34
-
@@ -3,5 +3,5 @@
3
3
  # License:: MIT
4
4
 
5
5
  module Yanbi
6
- VERSION = "0.1.0"
6
+ VERSION = "0.1.1"
7
7
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: yanbi-ml
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors: