green_midget 0.0.2 → 0.0.3
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +2 -0
- data/Gemfile +1 -16
- data/README.md +18 -28
- data/lib/green_midget/version.rb +1 -1
- metadata +7 -13
- data/.rspec +0 -1
- data/.rvmrc +0 -1
- data/benchmark/benchmark.rb +0 -40
- data/benchmark/test.rb +0 -31
- data/extensions/green_midget_check.rb +0 -8
- data/extensions/sample.rb +0 -19
data/.gitignore
CHANGED
data/Gemfile
CHANGED
@@ -1,19 +1,4 @@
|
|
1
1
|
# Copyright (c) 2011, SoundCloud Ltd., Nikola Chochkov
|
2
2
|
source "http://rubygems.org"
|
3
|
-
# Add dependencies required to use your gem here.
|
4
|
-
# Example:
|
5
|
-
# gem "activesupport", ">= 2.3.5"
|
6
3
|
|
7
|
-
|
8
|
-
|
9
|
-
# remove this dependency after testing.
|
10
|
-
gem 'jberkel-mysql-ruby', '= 2.8.1', :require => 'mysql' # Ruby 1.9 fixes
|
11
|
-
|
12
|
-
# Add dependencies to develop your gem here.
|
13
|
-
# Include everything needed to run rake, tests, features, etc.
|
14
|
-
group :development do
|
15
|
-
gem "rspec", "~> 2.3.0"
|
16
|
-
gem "bundler", "~> 1.0.0"
|
17
|
-
gem "jeweler", "~> 1.5.2"
|
18
|
-
gem "rcov", ">= 0"
|
19
|
-
end
|
4
|
+
gemspec
|
data/README.md
CHANGED
@@ -1,6 +1,8 @@
|
|
1
1
|
On Bayesian Classification
|
2
2
|
----------
|
3
3
|
|
4
|
+
This project started during an internship at SoundCloud.
|
5
|
+
|
4
6
|
Using SoundCloud's private messaging means that you can effectively reach out to everyone on the Cloud. On top of that, you have track commenting, groups posting, forum topics, track sharing - we care about your voice being heard! And read.
|
5
7
|
|
6
8
|
I'll put this in some perspective and say that we're now having daily text exchange volume in the order of hundreds of thousands. And it's also rapidly going up.
|
@@ -23,48 +25,40 @@ to your Gemfile and run
|
|
23
25
|
|
24
26
|
bundle install
|
25
27
|
|
26
|
-
|
27
|
-
|
28
|
-
require 'green_midget'
|
29
|
-
|
30
|
-
to your Rakefile and run
|
28
|
+
after which (so that you get the ActiveRecord backend ready):
|
31
29
|
|
32
|
-
rake green_midget:setup
|
30
|
+
rake green_midget:setup:active_record # creates a green_midget_records table and populate some entried there
|
33
31
|
|
34
32
|
You're now done.
|
35
33
|
|
36
34
|
How it works
|
37
35
|
----------
|
38
36
|
|
39
|
-
GreenMidget
|
40
|
-
|
41
|
-
GreenMidget is a learner, so you will only expect effective classification from it only once it has received sufficient training. Training it means providing examples of messages for the categor
|
42
|
-
|
43
|
-
|
37
|
+
GreenMidget learns to classify between two categories, so what you should first do is provide training examples for each of those two categories. See below.
|
44
38
|
|
45
39
|
Use it
|
46
40
|
----------
|
47
41
|
|
48
|
-
`GreenMidget` exposes two public methods as a start: `GreenMidget#classify_as!` and `GreenMidget#classify`.
|
42
|
+
`GreenMidget::Classifier` is the interaction class that is there after installation. It exposes two public instance methods as a start: `GreenMidget::Classifier#classify_as!` and `GreenMidget::Classifier#classify`. We'll do a three lines classification session and illustrate them.
|
49
43
|
|
50
|
-
We'll start training `GreenMidget` with a spammy example
|
44
|
+
We'll start training `GreenMidget` with a spammy example:
|
51
45
|
|
52
|
-
GreenMidget.new(known_spam_text).classify_as! :spam
|
46
|
+
GreenMidget::Classifier.new(known_spam_text).classify_as! :spam
|
53
47
|
|
54
48
|
Similarly for legitimate examples
|
55
49
|
|
56
|
-
GreenMidget.new(known_legit_text).classify_as! :ham
|
50
|
+
GreenMidget::Classifier.new(known_legit_text).classify_as! :ham
|
57
51
|
|
58
|
-
|
52
|
+
After we've given to it some training data, we can start classifying unknown text:
|
59
53
|
|
60
|
-
decision = GreenMidget.new(new_text).classify
|
54
|
+
decision = GreenMidget::Classifier.new(new_text).classify
|
61
55
|
|
62
|
-
`decision` is now
|
56
|
+
`decision` is now in `[ -1, 0, 1 ]` meaning respectively 'No spam', 'Not enough evidence', 'Spam'.
|
63
57
|
|
64
58
|
Extend it
|
65
59
|
----------
|
66
60
|
|
67
|
-
If the above functionality is not enough for you and you want to add custom logic to GreenMidget you can do that by extending the `GreenMidget::Base` class (check `extensions/sample.b` in the [code][green_midget_github] for an example):
|
61
|
+
If the above functionality is not enough for you and you want to add custom logic to GreenMidget you can do that by extending the `GreenMidget::Base` class (check `lib/green_midget/extensions/sample.b` in the [code][green_midget_github] for an example):
|
68
62
|
|
69
63
|
* Implement heuristics logic, which will directly classify incoming object as a given category. Example:
|
70
64
|
|
@@ -99,7 +93,9 @@ If that's not enough too, you're welcome to [browse the code][green_midget_githu
|
|
99
93
|
Benchmarking
|
100
94
|
----------
|
101
95
|
|
102
|
-
|
96
|
+
Before moving on, let's say that `GreenMidget` is intended for asynchronous spam checks. Using ActiveRecord as backend has the benefit of wide support and easy setup, but as it also means that the time performance will become progressively worse the more training you provide.
|
97
|
+
|
98
|
+
1. GreenMidget is optimised for classification operations (`classify` method), on which it's relatively efficient. The results below were obtained from classification on randomly generated messages of length _1 000 words_ (that's _very_ long for SoundCloud). Since GreenMidget runs on a relational database (through ActiveRecord) by default the table size impacts data fetch and write:
|
103
99
|
|
104
100
|
* on ~ 10 000 table rows = 0.0703 seconds / message
|
105
101
|
* on ~ 100 000 rows = 0.2082 sec / message
|
@@ -116,13 +112,7 @@ Benchmarking
|
|
116
112
|
Classification Efficiency
|
117
113
|
----------
|
118
114
|
|
119
|
-
|
120
|
-
|
121
|
-
Benchmarks = > data
|
122
|
-
|
123
|
-
Efficiency = > for the sake of this article I ran a small off-production test to show some results on real data - I used 150 000 text items from records which we marked as good and records which we marked as not-the-best!
|
124
|
-
|
125
|
-
We'll be next building our own SoundCloud extensions to GreenMidget and use it, so expect to hear more from the student! Meanwhile, I'll be happy to answer everything concerning the project so do feel free to get in touch.
|
115
|
+
TODO: give test results; provide a web interface to a trained classifier using some of SoundCloud's spam and legit data; give production experience from DigitaleSeiten.
|
126
116
|
|
127
|
-
[green_midget_github]: http://github.com/chochkov/
|
117
|
+
[green_midget_github]: http://github.com/chochkov/GreenMidget "Github repository"
|
128
118
|
[guidelines]: http://soundcloud.com/community-guidelines "Community guidelines"
|
data/lib/green_midget/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: green_midget
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.3
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -14,7 +14,7 @@ default_executable:
|
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
16
16
|
name: activerecord
|
17
|
-
requirement: &
|
17
|
+
requirement: &2153074740 !ruby/object:Gem::Requirement
|
18
18
|
none: false
|
19
19
|
requirements:
|
20
20
|
- - ! '>='
|
@@ -22,10 +22,10 @@ dependencies:
|
|
22
22
|
version: '0'
|
23
23
|
type: :runtime
|
24
24
|
prerelease: false
|
25
|
-
version_requirements: *
|
25
|
+
version_requirements: *2153074740
|
26
26
|
- !ruby/object:Gem::Dependency
|
27
27
|
name: rspec
|
28
|
-
requirement: &
|
28
|
+
requirement: &2153074320 !ruby/object:Gem::Requirement
|
29
29
|
none: false
|
30
30
|
requirements:
|
31
31
|
- - ! '>='
|
@@ -33,10 +33,10 @@ dependencies:
|
|
33
33
|
version: '0'
|
34
34
|
type: :development
|
35
35
|
prerelease: false
|
36
|
-
version_requirements: *
|
36
|
+
version_requirements: *2153074320
|
37
37
|
- !ruby/object:Gem::Dependency
|
38
38
|
name: bundler
|
39
|
-
requirement: &
|
39
|
+
requirement: &2153073900 !ruby/object:Gem::Requirement
|
40
40
|
none: false
|
41
41
|
requirements:
|
42
42
|
- - ! '>='
|
@@ -44,7 +44,7 @@ dependencies:
|
|
44
44
|
version: '0'
|
45
45
|
type: :development
|
46
46
|
prerelease: false
|
47
|
-
version_requirements: *
|
47
|
+
version_requirements: *2153073900
|
48
48
|
description: Naive Bayesian Classifier with customizable features
|
49
49
|
email:
|
50
50
|
- nikola@howkul.info
|
@@ -54,17 +54,11 @@ extra_rdoc_files: []
|
|
54
54
|
files:
|
55
55
|
- .document
|
56
56
|
- .gitignore
|
57
|
-
- .rspec
|
58
|
-
- .rvmrc
|
59
57
|
- Gemfile
|
60
58
|
- Gemfile.lock
|
61
59
|
- LICENSE.txt
|
62
60
|
- README.md
|
63
61
|
- Rakefile
|
64
|
-
- benchmark/benchmark.rb
|
65
|
-
- benchmark/test.rb
|
66
|
-
- extensions/green_midget_check.rb
|
67
|
-
- extensions/sample.rb
|
68
62
|
- green_midget.gemspec
|
69
63
|
- lib/green_midget.rb
|
70
64
|
- lib/green_midget/base.rb
|
data/.rspec
DELETED
@@ -1 +0,0 @@
|
|
1
|
-
--color
|
data/.rvmrc
DELETED
@@ -1 +0,0 @@
|
|
1
|
-
rvm use ruby-1.9.2@green_midget
|
data/benchmark/benchmark.rb
DELETED
@@ -1,40 +0,0 @@
|
|
1
|
-
include GreenMidget
|
2
|
-
|
3
|
-
TRAININGS = 90
|
4
|
-
CLASSIFICATIONS = 1
|
5
|
-
|
6
|
-
MESSAGE_LENGTH = 1000
|
7
|
-
|
8
|
-
@training_times = []
|
9
|
-
@classification_times = []
|
10
|
-
|
11
|
-
records_count_at_start = GreenMidgetRecords.count
|
12
|
-
|
13
|
-
def generate_text(message_length = 1)
|
14
|
-
message ||= []
|
15
|
-
while message.count < message_length do
|
16
|
-
word = ''
|
17
|
-
(rand(7) + 3).times { word += ('a'..'z').to_a[rand(26)] }
|
18
|
-
message << word unless message.include?(word)
|
19
|
-
end
|
20
|
-
text = message.join(' ')
|
21
|
-
end
|
22
|
-
|
23
|
-
TRAININGS.times do
|
24
|
-
a = GreenMidgetCheck.new generate_text(MESSAGE_LENGTH)
|
25
|
-
@training_times << Benchmark.measure{ a.classify_as! [ ALTERNATIVE, NULL ][rand(2)] }.real
|
26
|
-
end
|
27
|
-
|
28
|
-
CLASSIFICATIONS.times do
|
29
|
-
a = GreenMidgetCheck.new generate_text(MESSAGE_LENGTH)
|
30
|
-
@classification_times << Benchmark.measure{ a.classify }.real
|
31
|
-
end
|
32
|
-
|
33
|
-
puts " ------------------------------- "
|
34
|
-
puts " Average seconds from #{ TRAININGS } trainings and #{ CLASSIFICATIONS } classifications. #{ MESSAGE_LENGTH } words per message:"
|
35
|
-
puts " Number of records at start: #{ records_count_at_start } and at the end: #{ GreenMidgetRecords.count }"
|
36
|
-
puts " ------------------------------- "
|
37
|
-
puts " Training times: #{ (@training_times.sum.to_f/TRAININGS).round(4) }"
|
38
|
-
puts " ------------------------------- "
|
39
|
-
puts " Classification times: #{ (@classification_times.sum.to_f/CLASSIFICATIONS).round(4) }"
|
40
|
-
puts " ------------------------------- "
|
data/benchmark/test.rb
DELETED
@@ -1,31 +0,0 @@
|
|
1
|
-
require 'sqlite3'
|
2
|
-
|
3
|
-
require File.join(File.dirname(__FILE__), '..', 'spec', 'tester')
|
4
|
-
include GreenMidget
|
5
|
-
|
6
|
-
ActiveRecord::Base.establish_connection(:adapter => 'sqlite3', :database => '/sc/user_backup/data.db')
|
7
|
-
|
8
|
-
@spam = [ 'messages', 'comments', 'posts' ].map { |table| ActiveRecord::Base.connection.execute("select body from #{table} limit 1500").inject([]) { |memo, hash| memo << hash["body"] } }
|
9
|
-
|
10
|
-
ActiveRecord::Base.establish_connection(:adapter => 'mysql', :username => 'root', :password => 'root', :database => 'soundcloud_development_temp')
|
11
|
-
|
12
|
-
@ham = [ 'messages', 'comments', 'posts' ].map { |table| GreenMidgetRecords.find_by_sql("select body from #{table} limit 1500").to_a.inject([]) { |memo, hash| memo << hash["body"] } }
|
13
|
-
|
14
|
-
ActiveRecord::Base.establish_connection(:adapter => 'mysql', :username => 'root', :password => 'root', :database => 'classifier_development_weird')
|
15
|
-
#
|
16
|
-
# # ------ I. PERFORM TRAINING
|
17
|
-
# puts Benchmark.measure {
|
18
|
-
# @spam.each { |src|
|
19
|
-
# src.each {|body|
|
20
|
-
# klass = Tester.new(body);klass.classify_as! :spam
|
21
|
-
# }
|
22
|
-
# };true
|
23
|
-
# }
|
24
|
-
#
|
25
|
-
# puts Benchmark.measure {
|
26
|
-
# @ham.each { |src|
|
27
|
-
# src.each {|body|
|
28
|
-
# klass = Tester.new(body);klass.classify_as! :ham
|
29
|
-
# }
|
30
|
-
# };true
|
31
|
-
# }
|
data/extensions/sample.rb
DELETED
@@ -1,19 +0,0 @@
|
|
1
|
-
# Copyright (c) 2011, SoundCloud Ltd., Nikola Chochkov
|
2
|
-
class Sample < GreenMidget::Base
|
3
|
-
attr_accessor :user
|
4
|
-
|
5
|
-
def initialize(text, user)
|
6
|
-
@text = text
|
7
|
-
@user = user
|
8
|
-
end
|
9
|
-
|
10
|
-
private
|
11
|
-
|
12
|
-
def features
|
13
|
-
%w(regular_user) + super
|
14
|
-
end
|
15
|
-
|
16
|
-
def regular_user?
|
17
|
-
|
18
|
-
end
|
19
|
-
end
|