green_midget 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore CHANGED
@@ -2,3 +2,5 @@
2
2
  .bundle
3
3
  Gemfile.lock
4
4
  pkg/*
5
+ .rvmrc
6
+ .rspec
data/Gemfile CHANGED
@@ -1,19 +1,4 @@
1
1
  # Copyright (c) 2011, SoundCloud Ltd., Nikola Chochkov
2
2
  source "http://rubygems.org"
3
- # Add dependencies required to use your gem here.
4
- # Example:
5
- # gem "activesupport", ">= 2.3.5"
6
3
 
7
- gem "activerecord"
8
-
9
- # remove this dependency after testing.
10
- gem 'jberkel-mysql-ruby', '= 2.8.1', :require => 'mysql' # Ruby 1.9 fixes
11
-
12
- # Add dependencies to develop your gem here.
13
- # Include everything needed to run rake, tests, features, etc.
14
- group :development do
15
- gem "rspec", "~> 2.3.0"
16
- gem "bundler", "~> 1.0.0"
17
- gem "jeweler", "~> 1.5.2"
18
- gem "rcov", ">= 0"
19
- end
4
+ gemspec
data/README.md CHANGED
@@ -1,6 +1,8 @@
1
1
  On Bayesian Classification
2
2
  ----------
3
3
 
4
+ This project started during an internship at SoundCloud.
5
+
4
6
  Using SoundCloud's private messaging means that you can effectively reach out to everyone on the Cloud. On top of that, you have track commenting, groups posting, forum topics, track sharing - we care about your voice being heard! And read.
5
7
 
6
8
  I'll put this in some perspective and say that we're now having daily text exchange volume in the order of hundreds of thousands. And it's also rapidly going up.
@@ -23,48 +25,40 @@ to your Gemfile and run
23
25
 
24
26
  bundle install
25
27
 
26
- then add
27
-
28
- require 'green_midget'
29
-
30
- to your Rakefile and run
28
+ after which (so that you get the ActiveRecord backend ready):
31
29
 
32
- rake green_midget:setup
30
+ rake green_midget:setup:active_record # creates a green_midget_records table and populate some entried there
33
31
 
34
32
  You're now done.
35
33
 
36
34
  How it works
37
35
  ----------
38
36
 
39
- GreenMidget
40
-
41
- GreenMidget is a learner, so you will only expect effective classification from it only once it has received sufficient training. Training it means providing examples of messages for the categor
42
-
43
-
37
+ GreenMidget learns to classify between two categories, so what you should first do is provide training examples for each of those two categories. See below.
44
38
 
45
39
  Use it
46
40
  ----------
47
41
 
48
- `GreenMidget` exposes two public methods as a start: `GreenMidget#classify_as!` and `GreenMidget#classify`. Let's do a three lines classification session and illustrate them
42
+ `GreenMidget::Classifier` is the interaction class that is there after installation. It exposes two public instance methods as a start: `GreenMidget::Classifier#classify_as!` and `GreenMidget::Classifier#classify`. We'll do a three lines classification session and illustrate them.
49
43
 
50
- We'll start training `GreenMidget` with a spammy example
44
+ We'll start training `GreenMidget` with a spammy example:
51
45
 
52
- GreenMidget.new(known_spam_text).classify_as! :spam
46
+ GreenMidget::Classifier.new(known_spam_text).classify_as! :spam
53
47
 
54
48
  Similarly for legitimate examples
55
49
 
56
- GreenMidget.new(known_legit_text).classify_as! :ham
50
+ GreenMidget::Classifier.new(known_legit_text).classify_as! :ham
57
51
 
58
- To get a classification decision we would
52
+ After we've given to it some training data, we can start classifying unknown text:
59
53
 
60
- decision = GreenMidget.new(new_text).classify
54
+ decision = GreenMidget::Classifier.new(new_text).classify
61
55
 
62
- `decision` is now one of `[-1, 0, 1]` meaning respectively 'No spam', 'Not enough evidence', 'Spam'.
56
+ `decision` is now in `[ -1, 0, 1 ]` meaning respectively 'No spam', 'Not enough evidence', 'Spam'.
63
57
 
64
58
  Extend it
65
59
  ----------
66
60
 
67
- If the above functionality is not enough for you and you want to add custom logic to GreenMidget you can do that by extending the `GreenMidget::Base` class (check `extensions/sample.b` in the [code][green_midget_github] for an example):
61
+ If the above functionality is not enough for you and you want to add custom logic to GreenMidget you can do that by extending the `GreenMidget::Base` class (check `lib/green_midget/extensions/sample.b` in the [code][green_midget_github] for an example):
68
62
 
69
63
  * Implement heuristics logic, which will directly classify incoming object as a given category. Example:
70
64
 
@@ -99,7 +93,9 @@ If that's not enough too, you're welcome to [browse the code][green_midget_githu
99
93
  Benchmarking
100
94
  ----------
101
95
 
102
- 1. GreenMidget is optimised for classification operations (`classify` method), on which it's very efficient. The results below were obtained from classification on randomly generated messages of length _1 000 words_ (that's _very_ long for SoundCloud). Since GreenMidget runs on a relational database (through ActiveRecord) by default the table size impacts data fetch and write:
96
+ Before moving on, let's say that `GreenMidget` is intended for asynchronous spam checks. Using ActiveRecord as backend has the benefit of wide support and easy setup, but as it also means that the time performance will become progressively worse the more training you provide.
97
+
98
+ 1. GreenMidget is optimised for classification operations (`classify` method), on which it's relatively efficient. The results below were obtained from classification on randomly generated messages of length _1 000 words_ (that's _very_ long for SoundCloud). Since GreenMidget runs on a relational database (through ActiveRecord) by default the table size impacts data fetch and write:
103
99
 
104
100
  * on ~ 10 000 table rows = 0.0703 seconds / message
105
101
  * on ~ 100 000 rows = 0.2082 sec / message
@@ -116,13 +112,7 @@ Benchmarking
116
112
  Classification Efficiency
117
113
  ----------
118
114
 
119
- Trained on s
120
-
121
- Benchmarks = > data
122
-
123
- Efficiency = > for the sake of this article I ran a small off-production test to show some results on real data - I used 150 000 text items from records which we marked as good and records which we marked as not-the-best!
124
-
125
- We'll be next building our own SoundCloud extensions to GreenMidget and use it, so expect to hear more from the student! Meanwhile, I'll be happy to answer everything concerning the project so do feel free to get in touch.
115
+ TODO: give test results; provide a web interface to a trained classifier using some of SoundCloud's spam and legit data; give production experience from DigitaleSeiten.
126
116
 
127
- [green_midget_github]: http://github.com/chochkov/green_midget "Github repository"
117
+ [green_midget_github]: http://github.com/chochkov/GreenMidget "Github repository"
128
118
  [guidelines]: http://soundcloud.com/community-guidelines "Community guidelines"
@@ -1,3 +1,3 @@
1
1
  module GreenMidget
2
- VERSION = '0.0.2'
2
+ VERSION = '0.0.3'
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: green_midget
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.0.3
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -14,7 +14,7 @@ default_executable:
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: activerecord
17
- requirement: &2152860260 !ruby/object:Gem::Requirement
17
+ requirement: &2153074740 !ruby/object:Gem::Requirement
18
18
  none: false
19
19
  requirements:
20
20
  - - ! '>='
@@ -22,10 +22,10 @@ dependencies:
22
22
  version: '0'
23
23
  type: :runtime
24
24
  prerelease: false
25
- version_requirements: *2152860260
25
+ version_requirements: *2153074740
26
26
  - !ruby/object:Gem::Dependency
27
27
  name: rspec
28
- requirement: &2152859840 !ruby/object:Gem::Requirement
28
+ requirement: &2153074320 !ruby/object:Gem::Requirement
29
29
  none: false
30
30
  requirements:
31
31
  - - ! '>='
@@ -33,10 +33,10 @@ dependencies:
33
33
  version: '0'
34
34
  type: :development
35
35
  prerelease: false
36
- version_requirements: *2152859840
36
+ version_requirements: *2153074320
37
37
  - !ruby/object:Gem::Dependency
38
38
  name: bundler
39
- requirement: &2152859420 !ruby/object:Gem::Requirement
39
+ requirement: &2153073900 !ruby/object:Gem::Requirement
40
40
  none: false
41
41
  requirements:
42
42
  - - ! '>='
@@ -44,7 +44,7 @@ dependencies:
44
44
  version: '0'
45
45
  type: :development
46
46
  prerelease: false
47
- version_requirements: *2152859420
47
+ version_requirements: *2153073900
48
48
  description: Naive Bayesian Classifier with customizable features
49
49
  email:
50
50
  - nikola@howkul.info
@@ -54,17 +54,11 @@ extra_rdoc_files: []
54
54
  files:
55
55
  - .document
56
56
  - .gitignore
57
- - .rspec
58
- - .rvmrc
59
57
  - Gemfile
60
58
  - Gemfile.lock
61
59
  - LICENSE.txt
62
60
  - README.md
63
61
  - Rakefile
64
- - benchmark/benchmark.rb
65
- - benchmark/test.rb
66
- - extensions/green_midget_check.rb
67
- - extensions/sample.rb
68
62
  - green_midget.gemspec
69
63
  - lib/green_midget.rb
70
64
  - lib/green_midget/base.rb
data/.rspec DELETED
@@ -1 +0,0 @@
1
- --color
data/.rvmrc DELETED
@@ -1 +0,0 @@
1
- rvm use ruby-1.9.2@green_midget
@@ -1,40 +0,0 @@
1
- include GreenMidget
2
-
3
- TRAININGS = 90
4
- CLASSIFICATIONS = 1
5
-
6
- MESSAGE_LENGTH = 1000
7
-
8
- @training_times = []
9
- @classification_times = []
10
-
11
- records_count_at_start = GreenMidgetRecords.count
12
-
13
- def generate_text(message_length = 1)
14
- message ||= []
15
- while message.count < message_length do
16
- word = ''
17
- (rand(7) + 3).times { word += ('a'..'z').to_a[rand(26)] }
18
- message << word unless message.include?(word)
19
- end
20
- text = message.join(' ')
21
- end
22
-
23
- TRAININGS.times do
24
- a = GreenMidgetCheck.new generate_text(MESSAGE_LENGTH)
25
- @training_times << Benchmark.measure{ a.classify_as! [ ALTERNATIVE, NULL ][rand(2)] }.real
26
- end
27
-
28
- CLASSIFICATIONS.times do
29
- a = GreenMidgetCheck.new generate_text(MESSAGE_LENGTH)
30
- @classification_times << Benchmark.measure{ a.classify }.real
31
- end
32
-
33
- puts " ------------------------------- "
34
- puts " Average seconds from #{ TRAININGS } trainings and #{ CLASSIFICATIONS } classifications. #{ MESSAGE_LENGTH } words per message:"
35
- puts " Number of records at start: #{ records_count_at_start } and at the end: #{ GreenMidgetRecords.count }"
36
- puts " ------------------------------- "
37
- puts " Training times: #{ (@training_times.sum.to_f/TRAININGS).round(4) }"
38
- puts " ------------------------------- "
39
- puts " Classification times: #{ (@classification_times.sum.to_f/CLASSIFICATIONS).round(4) }"
40
- puts " ------------------------------- "
@@ -1,31 +0,0 @@
1
- require 'sqlite3'
2
-
3
- require File.join(File.dirname(__FILE__), '..', 'spec', 'tester')
4
- include GreenMidget
5
-
6
- ActiveRecord::Base.establish_connection(:adapter => 'sqlite3', :database => '/sc/user_backup/data.db')
7
-
8
- @spam = [ 'messages', 'comments', 'posts' ].map { |table| ActiveRecord::Base.connection.execute("select body from #{table} limit 1500").inject([]) { |memo, hash| memo << hash["body"] } }
9
-
10
- ActiveRecord::Base.establish_connection(:adapter => 'mysql', :username => 'root', :password => 'root', :database => 'soundcloud_development_temp')
11
-
12
- @ham = [ 'messages', 'comments', 'posts' ].map { |table| GreenMidgetRecords.find_by_sql("select body from #{table} limit 1500").to_a.inject([]) { |memo, hash| memo << hash["body"] } }
13
-
14
- ActiveRecord::Base.establish_connection(:adapter => 'mysql', :username => 'root', :password => 'root', :database => 'classifier_development_weird')
15
- #
16
- # # ------ I. PERFORM TRAINING
17
- # puts Benchmark.measure {
18
- # @spam.each { |src|
19
- # src.each {|body|
20
- # klass = Tester.new(body);klass.classify_as! :spam
21
- # }
22
- # };true
23
- # }
24
- #
25
- # puts Benchmark.measure {
26
- # @ham.each { |src|
27
- # src.each {|body|
28
- # klass = Tester.new(body);klass.classify_as! :ham
29
- # }
30
- # };true
31
- # }
@@ -1,8 +0,0 @@
1
- # Copyright (c) 2011, SoundCloud Ltd., Nikola Chochkov
2
- class GreenMidgetCheck < GreenMidget::Base
3
- attr_accessor :text
4
-
5
- def initialize(text)
6
- self.text = text
7
- end
8
- end
@@ -1,19 +0,0 @@
1
- # Copyright (c) 2011, SoundCloud Ltd., Nikola Chochkov
2
- class Sample < GreenMidget::Base
3
- attr_accessor :user
4
-
5
- def initialize(text, user)
6
- @text = text
7
- @user = user
8
- end
9
-
10
- private
11
-
12
- def features
13
- %w(regular_user) + super
14
- end
15
-
16
- def regular_user?
17
-
18
- end
19
- end