green_midget 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +108 -40
- data/bin/greenmidget +62 -0
- data/lib/green_midget/version.rb +1 -1
- metadata +7 -5
data/README.md
CHANGED
|
@@ -6,43 +6,84 @@ On Bayesian Classification
|
|
|
6
6
|
|
|
7
7
|
This project started during an internship at SoundCloud.
|
|
8
8
|
|
|
9
|
-
Using SoundCloud's private messaging means that you can effectively reach out
|
|
9
|
+
Using SoundCloud's private messaging means that you can effectively reach out
|
|
10
|
+
to everyone on the Cloud. On top of that, you have track commenting, groups
|
|
11
|
+
posting, forum topics, track sharing - we care about your voice being heard!
|
|
12
|
+
And read.
|
|
10
13
|
|
|
11
|
-
I'll put this in some perspective and say that we're now having daily text
|
|
14
|
+
I'll put this in some perspective and say that we're now having daily text
|
|
15
|
+
exchange volume in the order of hundreds of thousands. And it's also rapidly
|
|
16
|
+
going up.
|
|
12
17
|
|
|
13
|
-
And while most of this runs smoother than Berliner beer on a SoundCloud Friday,
|
|
18
|
+
And while most of this runs smoother than Berliner beer on a SoundCloud Friday,
|
|
19
|
+
violations to our [Community guidelines][guidelines] are starting to be less
|
|
20
|
+
and less of an exception. So I've been given the task to address this and
|
|
21
|
+
build a system that progressively learns how to tell good community behaviour
|
|
22
|
+
from less good - welcome to the:
|
|
14
23
|
|
|
15
24
|
GreenMidget
|
|
16
25
|
----------
|
|
17
26
|
|
|
18
|
-
GreenMidget is a trainable, feature-full Bayesian text classifier. Out of the
|
|
27
|
+
GreenMidget is a trainable, feature-full Bayesian text classifier. Out of the
|
|
28
|
+
box it's super straightforward to use, but it also offers easy customisation
|
|
29
|
+
options. It's a Ruby gem and today we're open sourcing it, so you can start
|
|
30
|
+
with it within a minute after the:
|
|
19
31
|
|
|
20
32
|
Installation
|
|
21
33
|
----------
|
|
22
34
|
|
|
23
|
-
|
|
35
|
+
If you're using bundle, simply add the following to your Gemfile
|
|
24
36
|
|
|
25
37
|
gem 'green_midget'
|
|
26
38
|
|
|
27
|
-
|
|
39
|
+
and then run
|
|
28
40
|
|
|
29
41
|
bundle install
|
|
30
42
|
|
|
31
43
|
after which (so that you get the ActiveRecord backend ready):
|
|
32
44
|
|
|
33
|
-
rake green_midget:setup:active_record
|
|
45
|
+
bundle exec rake green_midget:setup:active_record
|
|
46
|
+
|
|
47
|
+
This creates a `green_midget_records` table and populate some entried there
|
|
34
48
|
|
|
35
49
|
You're now done.
|
|
36
50
|
|
|
51
|
+
Try it out (right on the CLI)
|
|
52
|
+
----------
|
|
53
|
+
After you install the gem a shell executable is available for a quick play
|
|
54
|
+
with an online GreenMidget server trained on ~ 9000 public spam and ham
|
|
55
|
+
examples posted on SoundCloud as posts or track comments.
|
|
56
|
+
|
|
57
|
+
$ greenmidget 'buy cheap bags online'
|
|
58
|
+
$ greenmidget 'upload and share cool tracks online'
|
|
59
|
+
$ greenmidget potential_spam.txt # will read the file and classify the text
|
|
60
|
+
|
|
61
|
+
Go ahead and try around a bit, but keep in mind that this online service is in a
|
|
62
|
+
very early training stage and lacks even basic features (see below).
|
|
63
|
+
|
|
37
64
|
How it works
|
|
38
65
|
----------
|
|
39
66
|
|
|
40
|
-
GreenMidget
|
|
67
|
+
GreenMidget is a Naive Bayes implementation that uses a Log ratio of spam vs
|
|
68
|
+
ham probabilities for a given object to classify it to any of the categories.
|
|
69
|
+
There's an indecisive range as well - by default between 0 and Log(3).
|
|
70
|
+
Everything under 0 will be considered legit and above Log(3) will be spam.
|
|
71
|
+
|
|
72
|
+
GreenMidget adjusts the probabilities for individual words from training with
|
|
73
|
+
known examples and thus it improves its capability.
|
|
74
|
+
|
|
75
|
+
You can define further features (perhaps based on characteristics of the objects
|
|
76
|
+
you have to deal with) and use them to calculate probabilities. You can also
|
|
77
|
+
define heuristic checks for either category (see below for more on how to do
|
|
78
|
+
these).
|
|
41
79
|
|
|
42
80
|
Use it
|
|
43
81
|
----------
|
|
44
82
|
|
|
45
|
-
`GreenMidget::Classifier` is the interaction class that is there after
|
|
83
|
+
`GreenMidget::Classifier` is the interaction class that is there after
|
|
84
|
+
installation. It exposes two public instance methods as a start:
|
|
85
|
+
`GreenMidget::Classifier#classify_as!` and `GreenMidget::Classifier#classify`.
|
|
86
|
+
We'll do a three lines classification session and illustrate them.
|
|
46
87
|
|
|
47
88
|
We'll start training `GreenMidget` with a spammy example:
|
|
48
89
|
|
|
@@ -52,81 +93,108 @@ Similarly for legitimate examples
|
|
|
52
93
|
|
|
53
94
|
GreenMidget::Classifier.new(known_legit_text).classify_as! :ham
|
|
54
95
|
|
|
55
|
-
After we've given to it some training data, we can start classifying unknown
|
|
96
|
+
After we've given to it some training data, we can start classifying unknown
|
|
97
|
+
text:
|
|
56
98
|
|
|
57
99
|
decision = GreenMidget::Classifier.new(new_text).classify
|
|
58
100
|
|
|
59
|
-
`decision` is now in `[ -1, 0, 1 ]` meaning respectively 'No spam',
|
|
101
|
+
`decision` is now in `[ -1, 0, 1 ]` meaning respectively 'No spam',
|
|
102
|
+
'Not enough evidence', 'Spam'.
|
|
60
103
|
|
|
61
104
|
Extend it
|
|
62
105
|
----------
|
|
63
106
|
|
|
64
|
-
If the above functionality is not enough for you and you want to add custom
|
|
107
|
+
If the above functionality is not enough for you and you want to add custom
|
|
108
|
+
logic to GreenMidget you can do that by extending the `GreenMidget::Base`
|
|
109
|
+
class (check `lib/green_midget/extensions/sample.b` in the [code][green_midget_github]
|
|
110
|
+
for an example):
|
|
65
111
|
|
|
66
|
-
* Implement heuristics logic, which will directly classify incoming object as a
|
|
112
|
+
* Implement heuristics logic, which will directly classify incoming object as a
|
|
113
|
+
given category. Example:
|
|
67
114
|
|
|
68
115
|
def pass_ham_heuristics?
|
|
69
116
|
words.count > 5 || url_in_text?
|
|
70
117
|
end
|
|
71
118
|
|
|
72
|
-
This method will be `true` for longer text or such that contains an external
|
|
119
|
+
This method will be `true` for longer text or such that contains an external
|
|
120
|
+
url. In this case the classifier would go on to the actual testing procedure.
|
|
121
|
+
If `false`, however, the procedure will not be done and the classifier will
|
|
122
|
+
return the ham category as a result. Note the default
|
|
123
|
+
`GreenMidget::Base#words` and `GreenMidget::Base#url_in_text?`
|
|
73
124
|
|
|
74
|
-
All heuristic checks return `true` by default so it's up to you whether you
|
|
125
|
+
All heuristic checks return `true` by default so it's up to you whether you
|
|
126
|
+
will define and use heuristics at all or not. However, using them can help
|
|
127
|
+
you integrate your application context and decrease classification error
|
|
128
|
+
chance especially at the edge cases.
|
|
75
129
|
|
|
76
|
-
* Expand the source of evidence. Traditionally, _naive_ Bayesian text
|
|
130
|
+
* Expand the source of evidence. Traditionally, _naive_ Bayesian text
|
|
131
|
+
classifiers see individual words as evidence and calculate category-likelihoods
|
|
132
|
+
for each word. But there could be more than that in your application context,
|
|
133
|
+
eg. user's data or specific text features.
|
|
77
134
|
|
|
78
|
-
By default GreenMidget comes with two feature definitions `url_in_text` and
|
|
135
|
+
By default GreenMidget comes with two feature definitions `url_in_text` and
|
|
136
|
+
`email_in_text`, but you can implement as many more as you want by writing a
|
|
137
|
+
boolean method that checks for the feature:
|
|
79
138
|
|
|
80
139
|
def regular_user?
|
|
81
140
|
@user.sign_up_count > 10
|
|
82
141
|
end
|
|
83
142
|
|
|
84
|
-
and then implement a `features` method that returns an array with your custom
|
|
143
|
+
and then implement a `features` method that returns an array with your custom
|
|
144
|
+
feature names:
|
|
85
145
|
|
|
86
146
|
def features
|
|
87
147
|
['regular_user', .... ]
|
|
88
148
|
end
|
|
89
149
|
|
|
90
|
-
(do make sure that the array entry is the same as the name of the method that
|
|
150
|
+
(do make sure that the array entry is the same as the name of the method that
|
|
151
|
+
would be checking for this feature)
|
|
91
152
|
|
|
92
|
-
The GreenMidget features definitions have more weight on shorter texts and
|
|
153
|
+
The GreenMidget features definitions have more weight on shorter texts and
|
|
154
|
+
less weight on longer thus they provide a ground source of evidence for
|
|
155
|
+
GreenMidget's classification.
|
|
93
156
|
|
|
94
157
|
If that's not enough too, see the Contribute section below.
|
|
95
158
|
|
|
96
|
-
|
|
159
|
+
Performance
|
|
97
160
|
----------
|
|
98
161
|
|
|
99
|
-
|
|
162
|
+
GreenMidget uses ActiveRecord as backend and this guarantees wide support and
|
|
163
|
+
easy setup, however it's less performant than other data stores especially on
|
|
164
|
+
training operations. You should do such tasks asynchronously on real
|
|
165
|
+
applications. A future version backed on Redis is planned.
|
|
100
166
|
|
|
101
|
-
|
|
167
|
+
Classification Efficiency
|
|
168
|
+
----------
|
|
102
169
|
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
* on ~ 1 000 000 rows = 0.6773 sec / messages
|
|
170
|
+
Obviously this will depend on the training data that you have, but do give a
|
|
171
|
+
try to the Heroku GreenMidget app from the supplied CLI tool for a start (see
|
|
172
|
+
above for examples) or type:
|
|
107
173
|
|
|
108
|
-
|
|
174
|
+
$ greenmidget
|
|
109
175
|
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
176
|
+
on your shell for a help message. The online classifier for example lacks many
|
|
177
|
+
possible features such as heuristic checks, words stamming, stop words, etc.
|
|
178
|
+
It's only trained on the word occurrences of a total of 9000 messages (4500 of
|
|
179
|
+
each spam and ham).
|
|
114
180
|
|
|
115
|
-
|
|
181
|
+
During the development tests at SoundCloud, with those features in place, we
|
|
182
|
+
achieved more than 98% correct classification of spam examples using GreenMidget.
|
|
183
|
+
|
|
184
|
+
Thanks
|
|
116
185
|
----------
|
|
117
186
|
|
|
118
|
-
|
|
187
|
+
massively to everyone at SoundCloud for the help during the development of
|
|
188
|
+
GreenMidget.
|
|
119
189
|
|
|
120
190
|
Contribute
|
|
121
191
|
----------
|
|
122
192
|
|
|
123
|
-
|
|
124
|
-
code, just do that!
|
|
193
|
+
Just do the standard:
|
|
125
194
|
|
|
126
|
-
* Make a fork
|
|
127
|
-
* `
|
|
128
|
-
* `bundle`
|
|
129
|
-
* `bundle exec rake` to run the specs
|
|
195
|
+
* Make a fork and then:
|
|
196
|
+
* run `bundle` to setup dependencies
|
|
197
|
+
* and `bundle exec rake` to run the specs
|
|
130
198
|
* Make a patch
|
|
131
199
|
* Send a Pull Request
|
|
132
200
|
|
data/bin/greenmidget
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
1
|
+
#!/usr/bin/env ruby
|
|
2
|
+
|
|
3
|
+
$LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
|
|
4
|
+
|
|
5
|
+
require 'net/http'
|
|
6
|
+
require 'green_midget'
|
|
7
|
+
|
|
8
|
+
def say(what); puts "==> #{what}"; end
|
|
9
|
+
|
|
10
|
+
if (text = ARGV[0]) && ARGV.size == 1
|
|
11
|
+
say "This will check your input against some of SoundCloud\'s history of spaammmm..\n"
|
|
12
|
+
say "(run without arguments for more info)\n\n"
|
|
13
|
+
|
|
14
|
+
text =
|
|
15
|
+
if File.exist?(text)
|
|
16
|
+
IO.readlines(text, '').join
|
|
17
|
+
else
|
|
18
|
+
text
|
|
19
|
+
end
|
|
20
|
+
|
|
21
|
+
uri = URI("http://freezing-earth-5798.herokuapp.com/?q=#{URI.escape(text)}")
|
|
22
|
+
response = ''
|
|
23
|
+
|
|
24
|
+
begin
|
|
25
|
+
response = Net::HTTP.get(uri).to_i
|
|
26
|
+
rescue
|
|
27
|
+
say 'An error connecting to Heroku. How about your internets?'
|
|
28
|
+
exit 1
|
|
29
|
+
end
|
|
30
|
+
|
|
31
|
+
case response
|
|
32
|
+
when 1
|
|
33
|
+
say 'Hm.. It looks not so good! ( looks like spam )'
|
|
34
|
+
when 0
|
|
35
|
+
say 'Well, i cant really tell - it could be either'
|
|
36
|
+
when -1
|
|
37
|
+
say 'And.. it sounds ok..'
|
|
38
|
+
else
|
|
39
|
+
say 'An unknown error stroke!'
|
|
40
|
+
end
|
|
41
|
+
else
|
|
42
|
+
say "Checks for spam!\n\n"
|
|
43
|
+
puts <<-TEXT
|
|
44
|
+
This tool accesses an online GreenMidget service trained on 4500
|
|
45
|
+
examples of public spam messages or track comments that were posted on
|
|
46
|
+
SoundCloud. You can use it to classify your texts against it.
|
|
47
|
+
|
|
48
|
+
Examples:
|
|
49
|
+
|
|
50
|
+
greenmidget 'buy cheap bags online'
|
|
51
|
+
greenmidget 'upload cool tracks online'
|
|
52
|
+
greenmidget potential_spam.txt
|
|
53
|
+
|
|
54
|
+
Notice: This service is only used as an illustration to the GreenMidget
|
|
55
|
+
classifier, however its training is limited and it lacks even basic
|
|
56
|
+
features, that GreenMidget could provide.
|
|
57
|
+
|
|
58
|
+
This is not actually in use at SoundCloud!
|
|
59
|
+
|
|
60
|
+
read more on: http://github.com/chochkov/greenmidget
|
|
61
|
+
TEXT
|
|
62
|
+
end
|
data/lib/green_midget/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: green_midget
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.1.
|
|
4
|
+
version: 0.1.1
|
|
5
5
|
prerelease:
|
|
6
6
|
platform: ruby
|
|
7
7
|
authors:
|
|
@@ -9,12 +9,12 @@ authors:
|
|
|
9
9
|
autorequire:
|
|
10
10
|
bindir: bin
|
|
11
11
|
cert_chain: []
|
|
12
|
-
date: 2012-
|
|
12
|
+
date: 2012-03-05 00:00:00.000000000 +01:00
|
|
13
13
|
default_executable:
|
|
14
14
|
dependencies:
|
|
15
15
|
- !ruby/object:Gem::Dependency
|
|
16
16
|
name: activerecord
|
|
17
|
-
requirement: &
|
|
17
|
+
requirement: &2153446000 !ruby/object:Gem::Requirement
|
|
18
18
|
none: false
|
|
19
19
|
requirements:
|
|
20
20
|
- - ! '>='
|
|
@@ -22,11 +22,12 @@ dependencies:
|
|
|
22
22
|
version: '0'
|
|
23
23
|
type: :runtime
|
|
24
24
|
prerelease: false
|
|
25
|
-
version_requirements: *
|
|
25
|
+
version_requirements: *2153446000
|
|
26
26
|
description: Naive Bayesian Classifier with customizable features
|
|
27
27
|
email:
|
|
28
28
|
- nikola@howkul.info
|
|
29
|
-
executables:
|
|
29
|
+
executables:
|
|
30
|
+
- greenmidget
|
|
30
31
|
extensions: []
|
|
31
32
|
extra_rdoc_files: []
|
|
32
33
|
files:
|
|
@@ -39,6 +40,7 @@ files:
|
|
|
39
40
|
- Rakefile
|
|
40
41
|
- benchmark/benchmark.rb
|
|
41
42
|
- benchmark/test.rb
|
|
43
|
+
- bin/greenmidget
|
|
42
44
|
- green_midget.gemspec
|
|
43
45
|
- lib/green_midget.rb
|
|
44
46
|
- lib/green_midget/base.rb
|