green_midget 0.1.0 → 0.1.1
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +108 -40
- data/bin/greenmidget +62 -0
- data/lib/green_midget/version.rb +1 -1
- metadata +7 -5
data/README.md
CHANGED
@@ -6,43 +6,84 @@ On Bayesian Classification
|
|
6
6
|
|
7
7
|
This project started during an internship at SoundCloud.
|
8
8
|
|
9
|
-
Using SoundCloud's private messaging means that you can effectively reach out
|
9
|
+
Using SoundCloud's private messaging means that you can effectively reach out
|
10
|
+
to everyone on the Cloud. On top of that, you have track commenting, groups
|
11
|
+
posting, forum topics, track sharing - we care about your voice being heard!
|
12
|
+
And read.
|
10
13
|
|
11
|
-
I'll put this in some perspective and say that we're now having daily text
|
14
|
+
I'll put this in some perspective and say that we're now having daily text
|
15
|
+
exchange volume in the order of hundreds of thousands. And it's also rapidly
|
16
|
+
going up.
|
12
17
|
|
13
|
-
And while most of this runs smoother than Berliner beer on a SoundCloud Friday,
|
18
|
+
And while most of this runs smoother than Berliner beer on a SoundCloud Friday,
|
19
|
+
violations to our [Community guidelines][guidelines] are starting to be less
|
20
|
+
and less of an exception. So I've been given the task to address this and
|
21
|
+
build a system that progressively learns how to tell good community behaviour
|
22
|
+
from less good - welcome to the:
|
14
23
|
|
15
24
|
GreenMidget
|
16
25
|
----------
|
17
26
|
|
18
|
-
GreenMidget is a trainable, feature-full Bayesian text classifier. Out of the
|
27
|
+
GreenMidget is a trainable, feature-full Bayesian text classifier. Out of the
|
28
|
+
box it's super straightforward to use, but it also offers easy customisation
|
29
|
+
options. It's a Ruby gem and today we're open sourcing it, so you can start
|
30
|
+
with it within a minute after the:
|
19
31
|
|
20
32
|
Installation
|
21
33
|
----------
|
22
34
|
|
23
|
-
|
35
|
+
If you're using bundle, simply add the following to your Gemfile
|
24
36
|
|
25
37
|
gem 'green_midget'
|
26
38
|
|
27
|
-
|
39
|
+
and then run
|
28
40
|
|
29
41
|
bundle install
|
30
42
|
|
31
43
|
after which (so that you get the ActiveRecord backend ready):
|
32
44
|
|
33
|
-
rake green_midget:setup:active_record
|
45
|
+
bundle exec rake green_midget:setup:active_record
|
46
|
+
|
47
|
+
This creates a `green_midget_records` table and populate some entried there
|
34
48
|
|
35
49
|
You're now done.
|
36
50
|
|
51
|
+
Try it out (right on the CLI)
|
52
|
+
----------
|
53
|
+
After you install the gem a shell executable is available for a quick play
|
54
|
+
with an online GreenMidget server trained on ~ 9000 public spam and ham
|
55
|
+
examples posted on SoundCloud as posts or track comments.
|
56
|
+
|
57
|
+
$ greenmidget 'buy cheap bags online'
|
58
|
+
$ greenmidget 'upload and share cool tracks online'
|
59
|
+
$ greenmidget potential_spam.txt # will read the file and classify the text
|
60
|
+
|
61
|
+
Go ahead and try around a bit, but keep in mind that this online service is in a
|
62
|
+
very early training stage and lacks even basic features (see below).
|
63
|
+
|
37
64
|
How it works
|
38
65
|
----------
|
39
66
|
|
40
|
-
GreenMidget
|
67
|
+
GreenMidget is a Naive Bayes implementation that uses a Log ratio of spam vs
|
68
|
+
ham probabilities for a given object to classify it to any of the categories.
|
69
|
+
There's an indecisive range as well - by default between 0 and Log(3).
|
70
|
+
Everything under 0 will be considered legit and above Log(3) will be spam.
|
71
|
+
|
72
|
+
GreenMidget adjusts the probabilities for individual words from training with
|
73
|
+
known examples and thus it improves its capability.
|
74
|
+
|
75
|
+
You can define further features (perhaps based on characteristics of the objects
|
76
|
+
you have to deal with) and use them to calculate probabilities. You can also
|
77
|
+
define heuristic checks for either category (see below for more on how to do
|
78
|
+
these).
|
41
79
|
|
42
80
|
Use it
|
43
81
|
----------
|
44
82
|
|
45
|
-
`GreenMidget::Classifier` is the interaction class that is there after
|
83
|
+
`GreenMidget::Classifier` is the interaction class that is there after
|
84
|
+
installation. It exposes two public instance methods as a start:
|
85
|
+
`GreenMidget::Classifier#classify_as!` and `GreenMidget::Classifier#classify`.
|
86
|
+
We'll do a three lines classification session and illustrate them.
|
46
87
|
|
47
88
|
We'll start training `GreenMidget` with a spammy example:
|
48
89
|
|
@@ -52,81 +93,108 @@ Similarly for legitimate examples
|
|
52
93
|
|
53
94
|
GreenMidget::Classifier.new(known_legit_text).classify_as! :ham
|
54
95
|
|
55
|
-
After we've given to it some training data, we can start classifying unknown
|
96
|
+
After we've given to it some training data, we can start classifying unknown
|
97
|
+
text:
|
56
98
|
|
57
99
|
decision = GreenMidget::Classifier.new(new_text).classify
|
58
100
|
|
59
|
-
`decision` is now in `[ -1, 0, 1 ]` meaning respectively 'No spam',
|
101
|
+
`decision` is now in `[ -1, 0, 1 ]` meaning respectively 'No spam',
|
102
|
+
'Not enough evidence', 'Spam'.
|
60
103
|
|
61
104
|
Extend it
|
62
105
|
----------
|
63
106
|
|
64
|
-
If the above functionality is not enough for you and you want to add custom
|
107
|
+
If the above functionality is not enough for you and you want to add custom
|
108
|
+
logic to GreenMidget you can do that by extending the `GreenMidget::Base`
|
109
|
+
class (check `lib/green_midget/extensions/sample.b` in the [code][green_midget_github]
|
110
|
+
for an example):
|
65
111
|
|
66
|
-
* Implement heuristics logic, which will directly classify incoming object as a
|
112
|
+
* Implement heuristics logic, which will directly classify incoming object as a
|
113
|
+
given category. Example:
|
67
114
|
|
68
115
|
def pass_ham_heuristics?
|
69
116
|
words.count > 5 || url_in_text?
|
70
117
|
end
|
71
118
|
|
72
|
-
This method will be `true` for longer text or such that contains an external
|
119
|
+
This method will be `true` for longer text or such that contains an external
|
120
|
+
url. In this case the classifier would go on to the actual testing procedure.
|
121
|
+
If `false`, however, the procedure will not be done and the classifier will
|
122
|
+
return the ham category as a result. Note the default
|
123
|
+
`GreenMidget::Base#words` and `GreenMidget::Base#url_in_text?`
|
73
124
|
|
74
|
-
All heuristic checks return `true` by default so it's up to you whether you
|
125
|
+
All heuristic checks return `true` by default so it's up to you whether you
|
126
|
+
will define and use heuristics at all or not. However, using them can help
|
127
|
+
you integrate your application context and decrease classification error
|
128
|
+
chance especially at the edge cases.
|
75
129
|
|
76
|
-
* Expand the source of evidence. Traditionally, _naive_ Bayesian text
|
130
|
+
* Expand the source of evidence. Traditionally, _naive_ Bayesian text
|
131
|
+
classifiers see individual words as evidence and calculate category-likelihoods
|
132
|
+
for each word. But there could be more than that in your application context,
|
133
|
+
eg. user's data or specific text features.
|
77
134
|
|
78
|
-
By default GreenMidget comes with two feature definitions `url_in_text` and
|
135
|
+
By default GreenMidget comes with two feature definitions `url_in_text` and
|
136
|
+
`email_in_text`, but you can implement as many more as you want by writing a
|
137
|
+
boolean method that checks for the feature:
|
79
138
|
|
80
139
|
def regular_user?
|
81
140
|
@user.sign_up_count > 10
|
82
141
|
end
|
83
142
|
|
84
|
-
and then implement a `features` method that returns an array with your custom
|
143
|
+
and then implement a `features` method that returns an array with your custom
|
144
|
+
feature names:
|
85
145
|
|
86
146
|
def features
|
87
147
|
['regular_user', .... ]
|
88
148
|
end
|
89
149
|
|
90
|
-
(do make sure that the array entry is the same as the name of the method that
|
150
|
+
(do make sure that the array entry is the same as the name of the method that
|
151
|
+
would be checking for this feature)
|
91
152
|
|
92
|
-
The GreenMidget features definitions have more weight on shorter texts and
|
153
|
+
The GreenMidget features definitions have more weight on shorter texts and
|
154
|
+
less weight on longer thus they provide a ground source of evidence for
|
155
|
+
GreenMidget's classification.
|
93
156
|
|
94
157
|
If that's not enough too, see the Contribute section below.
|
95
158
|
|
96
|
-
|
159
|
+
Performance
|
97
160
|
----------
|
98
161
|
|
99
|
-
|
162
|
+
GreenMidget uses ActiveRecord as backend and this guarantees wide support and
|
163
|
+
easy setup, however it's less performant than other data stores especially on
|
164
|
+
training operations. You should do such tasks asynchronously on real
|
165
|
+
applications. A future version backed on Redis is planned.
|
100
166
|
|
101
|
-
|
167
|
+
Classification Efficiency
|
168
|
+
----------
|
102
169
|
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
* on ~ 1 000 000 rows = 0.6773 sec / messages
|
170
|
+
Obviously this will depend on the training data that you have, but do give a
|
171
|
+
try to the Heroku GreenMidget app from the supplied CLI tool for a start (see
|
172
|
+
above for examples) or type:
|
107
173
|
|
108
|
-
|
174
|
+
$ greenmidget
|
109
175
|
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
176
|
+
on your shell for a help message. The online classifier for example lacks many
|
177
|
+
possible features such as heuristic checks, words stamming, stop words, etc.
|
178
|
+
It's only trained on the word occurrences of a total of 9000 messages (4500 of
|
179
|
+
each spam and ham).
|
114
180
|
|
115
|
-
|
181
|
+
During the development tests at SoundCloud, with those features in place, we
|
182
|
+
achieved more than 98% correct classification of spam examples using GreenMidget.
|
183
|
+
|
184
|
+
Thanks
|
116
185
|
----------
|
117
186
|
|
118
|
-
|
187
|
+
massively to everyone at SoundCloud for the help during the development of
|
188
|
+
GreenMidget.
|
119
189
|
|
120
190
|
Contribute
|
121
191
|
----------
|
122
192
|
|
123
|
-
|
124
|
-
code, just do that!
|
193
|
+
Just do the standard:
|
125
194
|
|
126
|
-
* Make a fork
|
127
|
-
* `
|
128
|
-
* `bundle`
|
129
|
-
* `bundle exec rake` to run the specs
|
195
|
+
* Make a fork and then:
|
196
|
+
* run `bundle` to setup dependencies
|
197
|
+
* and `bundle exec rake` to run the specs
|
130
198
|
* Make a patch
|
131
199
|
* Send a Pull Request
|
132
200
|
|
data/bin/greenmidget
ADDED
@@ -0,0 +1,62 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
$LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
|
4
|
+
|
5
|
+
require 'net/http'
|
6
|
+
require 'green_midget'
|
7
|
+
|
8
|
+
def say(what); puts "==> #{what}"; end
|
9
|
+
|
10
|
+
if (text = ARGV[0]) && ARGV.size == 1
|
11
|
+
say "This will check your input against some of SoundCloud\'s history of spaammmm..\n"
|
12
|
+
say "(run without arguments for more info)\n\n"
|
13
|
+
|
14
|
+
text =
|
15
|
+
if File.exist?(text)
|
16
|
+
IO.readlines(text, '').join
|
17
|
+
else
|
18
|
+
text
|
19
|
+
end
|
20
|
+
|
21
|
+
uri = URI("http://freezing-earth-5798.herokuapp.com/?q=#{URI.escape(text)}")
|
22
|
+
response = ''
|
23
|
+
|
24
|
+
begin
|
25
|
+
response = Net::HTTP.get(uri).to_i
|
26
|
+
rescue
|
27
|
+
say 'An error connecting to Heroku. How about your internets?'
|
28
|
+
exit 1
|
29
|
+
end
|
30
|
+
|
31
|
+
case response
|
32
|
+
when 1
|
33
|
+
say 'Hm.. It looks not so good! ( looks like spam )'
|
34
|
+
when 0
|
35
|
+
say 'Well, i cant really tell - it could be either'
|
36
|
+
when -1
|
37
|
+
say 'And.. it sounds ok..'
|
38
|
+
else
|
39
|
+
say 'An unknown error stroke!'
|
40
|
+
end
|
41
|
+
else
|
42
|
+
say "Checks for spam!\n\n"
|
43
|
+
puts <<-TEXT
|
44
|
+
This tool accesses an online GreenMidget service trained on 4500
|
45
|
+
examples of public spam messages or track comments that were posted on
|
46
|
+
SoundCloud. You can use it to classify your texts against it.
|
47
|
+
|
48
|
+
Examples:
|
49
|
+
|
50
|
+
greenmidget 'buy cheap bags online'
|
51
|
+
greenmidget 'upload cool tracks online'
|
52
|
+
greenmidget potential_spam.txt
|
53
|
+
|
54
|
+
Notice: This service is only used as an illustration to the GreenMidget
|
55
|
+
classifier, however its training is limited and it lacks even basic
|
56
|
+
features, that GreenMidget could provide.
|
57
|
+
|
58
|
+
This is not actually in use at SoundCloud!
|
59
|
+
|
60
|
+
read more on: http://github.com/chochkov/greenmidget
|
61
|
+
TEXT
|
62
|
+
end
|
data/lib/green_midget/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: green_midget
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.1
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,12 +9,12 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-
|
12
|
+
date: 2012-03-05 00:00:00.000000000 +01:00
|
13
13
|
default_executable:
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
16
16
|
name: activerecord
|
17
|
-
requirement: &
|
17
|
+
requirement: &2153446000 !ruby/object:Gem::Requirement
|
18
18
|
none: false
|
19
19
|
requirements:
|
20
20
|
- - ! '>='
|
@@ -22,11 +22,12 @@ dependencies:
|
|
22
22
|
version: '0'
|
23
23
|
type: :runtime
|
24
24
|
prerelease: false
|
25
|
-
version_requirements: *
|
25
|
+
version_requirements: *2153446000
|
26
26
|
description: Naive Bayesian Classifier with customizable features
|
27
27
|
email:
|
28
28
|
- nikola@howkul.info
|
29
|
-
executables:
|
29
|
+
executables:
|
30
|
+
- greenmidget
|
30
31
|
extensions: []
|
31
32
|
extra_rdoc_files: []
|
32
33
|
files:
|
@@ -39,6 +40,7 @@ files:
|
|
39
40
|
- Rakefile
|
40
41
|
- benchmark/benchmark.rb
|
41
42
|
- benchmark/test.rb
|
43
|
+
- bin/greenmidget
|
42
44
|
- green_midget.gemspec
|
43
45
|
- lib/green_midget.rb
|
44
46
|
- lib/green_midget/base.rb
|