green_midget 0.1.0 → 0.1.1

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -6,43 +6,84 @@ On Bayesian Classification
6
6
 
7
7
  This project started during an internship at SoundCloud.
8
8
 
9
- Using SoundCloud's private messaging means that you can effectively reach out to everyone on the Cloud. On top of that, you have track commenting, groups posting, forum topics, track sharing - we care about your voice being heard! And read.
9
+ Using SoundCloud's private messaging means that you can effectively reach out
10
+ to everyone on the Cloud. On top of that, you have track commenting, groups
11
+ posting, forum topics, track sharing - we care about your voice being heard!
12
+ And read.
10
13
 
11
- I'll put this in some perspective and say that we're now having daily text exchange volume in the order of hundreds of thousands. And it's also rapidly going up.
14
+ I'll put this in some perspective and say that we're now having daily text
15
+ exchange volume in the order of hundreds of thousands. And it's also rapidly
16
+ going up.
12
17
 
13
- And while most of this runs smoother than Berliner beer on a SoundCloud Friday, violations to our [Community guidelines][guidelines] are starting to be less and less of an exception. So I've been given the task to address this and build a system that progressively learns how to tell good community behaviour from less good - welcome to the:
18
+ And while most of this runs smoother than Berliner beer on a SoundCloud Friday,
19
+ violations to our [Community guidelines][guidelines] are starting to be less
20
+ and less of an exception. So I've been given the task to address this and
21
+ build a system that progressively learns how to tell good community behaviour
22
+ from less good - welcome to the:
14
23
 
15
24
  GreenMidget
16
25
  ----------
17
26
 
18
- GreenMidget is a trainable, feature-full Bayesian text classifier. Out of the box it's super straightforward to use, but it also offers easy customisation options. It's a Ruby gem and today we're open sourcing it, so you can start with it within a minute after the:
27
+ GreenMidget is a trainable, feature-full Bayesian text classifier. Out of the
28
+ box it's super straightforward to use, but it also offers easy customisation
29
+ options. It's a Ruby gem and today we're open sourcing it, so you can start
30
+ with it within a minute after the:
19
31
 
20
32
  Installation
21
33
  ----------
22
34
 
23
- You are very likely (but not necessarily) gonna be on a Rails app, so just add
35
+ If you're using bundle, simply add the following to your Gemfile
24
36
 
25
37
  gem 'green_midget'
26
38
 
27
- to your Gemfile and run
39
+ and then run
28
40
 
29
41
  bundle install
30
42
 
31
43
  after which (so that you get the ActiveRecord backend ready):
32
44
 
33
- rake green_midget:setup:active_record # creates a green_midget_records table and populate some entried there
45
+ bundle exec rake green_midget:setup:active_record
46
+
47
+ This creates a `green_midget_records` table and populate some entried there
34
48
 
35
49
  You're now done.
36
50
 
51
+ Try it out (right on the CLI)
52
+ ----------
53
+ After you install the gem a shell executable is available for a quick play
54
+ with an online GreenMidget server trained on ~ 9000 public spam and ham
55
+ examples posted on SoundCloud as posts or track comments.
56
+
57
+ $ greenmidget 'buy cheap bags online'
58
+ $ greenmidget 'upload and share cool tracks online'
59
+ $ greenmidget potential_spam.txt # will read the file and classify the text
60
+
61
+ Go ahead and try around a bit, but keep in mind that this online service is in a
62
+ very early training stage and lacks even basic features (see below).
63
+
37
64
  How it works
38
65
  ----------
39
66
 
40
- GreenMidget learns to classify between two categories, so what you should first do is provide training examples for each of those two categories. See below.
67
+ GreenMidget is a Naive Bayes implementation that uses a Log ratio of spam vs
68
+ ham probabilities for a given object to classify it to any of the categories.
69
+ There's an indecisive range as well - by default between 0 and Log(3).
70
+ Everything under 0 will be considered legit and above Log(3) will be spam.
71
+
72
+ GreenMidget adjusts the probabilities for individual words from training with
73
+ known examples and thus it improves its capability.
74
+
75
+ You can define further features (perhaps based on characteristics of the objects
76
+ you have to deal with) and use them to calculate probabilities. You can also
77
+ define heuristic checks for either category (see below for more on how to do
78
+ these).
41
79
 
42
80
  Use it
43
81
  ----------
44
82
 
45
- `GreenMidget::Classifier` is the interaction class that is there after installation. It exposes two public instance methods as a start: `GreenMidget::Classifier#classify_as!` and `GreenMidget::Classifier#classify`. We'll do a three lines classification session and illustrate them.
83
+ `GreenMidget::Classifier` is the interaction class that is there after
84
+ installation. It exposes two public instance methods as a start:
85
+ `GreenMidget::Classifier#classify_as!` and `GreenMidget::Classifier#classify`.
86
+ We'll do a three lines classification session and illustrate them.
46
87
 
47
88
  We'll start training `GreenMidget` with a spammy example:
48
89
 
@@ -52,81 +93,108 @@ Similarly for legitimate examples
52
93
 
53
94
  GreenMidget::Classifier.new(known_legit_text).classify_as! :ham
54
95
 
55
- After we've given to it some training data, we can start classifying unknown text:
96
+ After we've given to it some training data, we can start classifying unknown
97
+ text:
56
98
 
57
99
  decision = GreenMidget::Classifier.new(new_text).classify
58
100
 
59
- `decision` is now in `[ -1, 0, 1 ]` meaning respectively 'No spam', 'Not enough evidence', 'Spam'.
101
+ `decision` is now in `[ -1, 0, 1 ]` meaning respectively 'No spam',
102
+ 'Not enough evidence', 'Spam'.
60
103
 
61
104
  Extend it
62
105
  ----------
63
106
 
64
- If the above functionality is not enough for you and you want to add custom logic to GreenMidget you can do that by extending the `GreenMidget::Base` class (check `lib/green_midget/extensions/sample.b` in the [code][green_midget_github] for an example):
107
+ If the above functionality is not enough for you and you want to add custom
108
+ logic to GreenMidget you can do that by extending the `GreenMidget::Base`
109
+ class (check `lib/green_midget/extensions/sample.b` in the [code][green_midget_github]
110
+ for an example):
65
111
 
66
- * Implement heuristics logic, which will directly classify incoming object as a given category. Example:
112
+ * Implement heuristics logic, which will directly classify incoming object as a
113
+ given category. Example:
67
114
 
68
115
  def pass_ham_heuristics?
69
116
  words.count > 5 || url_in_text?
70
117
  end
71
118
 
72
- This method will be `true` for longer text or such that contains an external url. In this case the classifier would go on to the actual testing procedure. If `false`, however, the procedure will not be done and the classifier will return the ham category as a result. Note the native `GreenMidget::Base#words` and `GreenMidget::Base#url_in_text?`
119
+ This method will be `true` for longer text or such that contains an external
120
+ url. In this case the classifier would go on to the actual testing procedure.
121
+ If `false`, however, the procedure will not be done and the classifier will
122
+ return the ham category as a result. Note the default
123
+ `GreenMidget::Base#words` and `GreenMidget::Base#url_in_text?`
73
124
 
74
- All heuristic checks return `true` by default so it's up to you whether you will define and use heuristics at all or not. However, using them can help you integrate your application context and decrease classification error chance especially at the edge cases.
125
+ All heuristic checks return `true` by default so it's up to you whether you
126
+ will define and use heuristics at all or not. However, using them can help
127
+ you integrate your application context and decrease classification error
128
+ chance especially at the edge cases.
75
129
 
76
- * Expand the source of evidence. Traditionally, _naive_ Bayesian text classifiers see individual words as evidence and calculate category-likelihoods for each word. But there could be more than that in your application context, eg. user's data or specific text features.
130
+ * Expand the source of evidence. Traditionally, _naive_ Bayesian text
131
+ classifiers see individual words as evidence and calculate category-likelihoods
132
+ for each word. But there could be more than that in your application context,
133
+ eg. user's data or specific text features.
77
134
 
78
- By default GreenMidget comes with two feature definitions `url_in_text` and `email_in_text`, but you can implement as many more as you want by writing a boolean method that checks for the feature:
135
+ By default GreenMidget comes with two feature definitions `url_in_text` and
136
+ `email_in_text`, but you can implement as many more as you want by writing a
137
+ boolean method that checks for the feature:
79
138
 
80
139
  def regular_user?
81
140
  @user.sign_up_count > 10
82
141
  end
83
142
 
84
- and then implement a `features` method that returns an array with your custom feature names:
143
+ and then implement a `features` method that returns an array with your custom
144
+ feature names:
85
145
 
86
146
  def features
87
147
  ['regular_user', .... ]
88
148
  end
89
149
 
90
- (do make sure that the array entry is the same as the name of the method that would be checking for this feature)
150
+ (do make sure that the array entry is the same as the name of the method that
151
+ would be checking for this feature)
91
152
 
92
- The GreenMidget features definitions have more weight on shorter texts and less weight on longer thus they provide a ground source of evidence for GreenMidget's classification.
153
+ The GreenMidget features definitions have more weight on shorter texts and
154
+ less weight on longer thus they provide a ground source of evidence for
155
+ GreenMidget's classification.
93
156
 
94
157
  If that's not enough too, see the Contribute section below.
95
158
 
96
- Benchmarking
159
+ Performance
97
160
  ----------
98
161
 
99
- Before moving on, let's say that `GreenMidget` is intended for asynchronous spam checks. Using ActiveRecord as backend has the benefit of wide support and easy setup, but as it also means that the time performance will become progressively worse the more training you provide.
162
+ GreenMidget uses ActiveRecord as backend and this guarantees wide support and
163
+ easy setup, however it's less performant than other data stores especially on
164
+ training operations. You should do such tasks asynchronously on real
165
+ applications. A future version backed on Redis is planned.
100
166
 
101
- 1. GreenMidget is optimised for classification operations (`classify` method), on which it's relatively efficient. The results below were obtained from classification on randomly generated messages of length _1 000 words_ (that's _very_ long for SoundCloud). Since GreenMidget runs on a relational database (through ActiveRecord) by default the table size impacts data fetch and write:
167
+ Classification Efficiency
168
+ ----------
102
169
 
103
- * on ~ 10 000 table rows = 0.0703 seconds / message
104
- * on ~ 100 000 rows = 0.2082 sec / message
105
- * on ~ 500 000 rows = 0.6505 sec / message
106
- * on ~ 1 000 000 rows = 0.6773 sec / messages
170
+ Obviously this will depend on the training data that you have, but do give a
171
+ try to the Heroku GreenMidget app from the supplied CLI tool for a start (see
172
+ above for examples) or type:
107
173
 
108
- 2. Training operations (`classify_as!`) are, however, less performant because they invoke a database write per word. Under the same conditions as above, the training times of randomly generated messages follows:
174
+ $ greenmidget
109
175
 
110
- * on ~ 10 000 table rows = 1.5984 seconds / message
111
- * on ~ 100 000 rows = 0.1303 sec / message
112
- * on ~ 500 000 rows = 1.7185 sec / message
113
- * on ~ 1 000 000 rows = 2.5335 sec / message
176
+ on your shell for a help message. The online classifier for example lacks many
177
+ possible features such as heuristic checks, words stamming, stop words, etc.
178
+ It's only trained on the word occurrences of a total of 9000 messages (4500 of
179
+ each spam and ham).
114
180
 
115
- Classification Efficiency
181
+ During the development tests at SoundCloud, with those features in place, we
182
+ achieved more than 98% correct classification of spam examples using GreenMidget.
183
+
184
+ Thanks
116
185
  ----------
117
186
 
118
- TODO: give test results; provide a web interface to a trained classifier using some of SoundCloud's spam and legit data; give production experience from DigitaleSeiten.
187
+ massively to everyone at SoundCloud for the help during the development of
188
+ GreenMidget.
119
189
 
120
190
  Contribute
121
191
  ----------
122
192
 
123
- Let me know on any feedback or feature requests. If you want to hack on the
124
- code, just do that!
193
+ Just do the standard:
125
194
 
126
- * Make a fork
127
- * `git clone git@github.com:chochkov/GreenMidget.git`
128
- * `bundle`
129
- * `bundle exec rake` to run the specs
195
+ * Make a fork and then:
196
+ * run `bundle` to setup dependencies
197
+ * and `bundle exec rake` to run the specs
130
198
  * Make a patch
131
199
  * Send a Pull Request
132
200
 
data/bin/greenmidget ADDED
@@ -0,0 +1,62 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ $LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
4
+
5
+ require 'net/http'
6
+ require 'green_midget'
7
+
8
+ def say(what); puts "==> #{what}"; end
9
+
10
+ if (text = ARGV[0]) && ARGV.size == 1
11
+ say "This will check your input against some of SoundCloud\'s history of spaammmm..\n"
12
+ say "(run without arguments for more info)\n\n"
13
+
14
+ text =
15
+ if File.exist?(text)
16
+ IO.readlines(text, '').join
17
+ else
18
+ text
19
+ end
20
+
21
+ uri = URI("http://freezing-earth-5798.herokuapp.com/?q=#{URI.escape(text)}")
22
+ response = ''
23
+
24
+ begin
25
+ response = Net::HTTP.get(uri).to_i
26
+ rescue
27
+ say 'An error connecting to Heroku. How about your internets?'
28
+ exit 1
29
+ end
30
+
31
+ case response
32
+ when 1
33
+ say 'Hm.. It looks not so good! ( looks like spam )'
34
+ when 0
35
+ say 'Well, i cant really tell - it could be either'
36
+ when -1
37
+ say 'And.. it sounds ok..'
38
+ else
39
+ say 'An unknown error stroke!'
40
+ end
41
+ else
42
+ say "Checks for spam!\n\n"
43
+ puts <<-TEXT
44
+ This tool accesses an online GreenMidget service trained on 4500
45
+ examples of public spam messages or track comments that were posted on
46
+ SoundCloud. You can use it to classify your texts against it.
47
+
48
+ Examples:
49
+
50
+ greenmidget 'buy cheap bags online'
51
+ greenmidget 'upload cool tracks online'
52
+ greenmidget potential_spam.txt
53
+
54
+ Notice: This service is only used as an illustration to the GreenMidget
55
+ classifier, however its training is limited and it lacks even basic
56
+ features, that GreenMidget could provide.
57
+
58
+ This is not actually in use at SoundCloud!
59
+
60
+ read more on: http://github.com/chochkov/greenmidget
61
+ TEXT
62
+ end
@@ -1,3 +1,3 @@
1
1
  module GreenMidget
2
- VERSION = '0.1.0'
2
+ VERSION = '0.1.1'
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: green_midget
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,12 +9,12 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-02-17 00:00:00.000000000 +01:00
12
+ date: 2012-03-05 00:00:00.000000000 +01:00
13
13
  default_executable:
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: activerecord
17
- requirement: &2153348200 !ruby/object:Gem::Requirement
17
+ requirement: &2153446000 !ruby/object:Gem::Requirement
18
18
  none: false
19
19
  requirements:
20
20
  - - ! '>='
@@ -22,11 +22,12 @@ dependencies:
22
22
  version: '0'
23
23
  type: :runtime
24
24
  prerelease: false
25
- version_requirements: *2153348200
25
+ version_requirements: *2153446000
26
26
  description: Naive Bayesian Classifier with customizable features
27
27
  email:
28
28
  - nikola@howkul.info
29
- executables: []
29
+ executables:
30
+ - greenmidget
30
31
  extensions: []
31
32
  extra_rdoc_files: []
32
33
  files:
@@ -39,6 +40,7 @@ files:
39
40
  - Rakefile
40
41
  - benchmark/benchmark.rb
41
42
  - benchmark/test.rb
43
+ - bin/greenmidget
42
44
  - green_midget.gemspec
43
45
  - lib/green_midget.rb
44
46
  - lib/green_midget/base.rb