RubyGems - green_midget - Versions diffs - 0.1.0 → 0.1.1 - Mend

green_midget 0.1.0 → 0.1.1

Files changed (4) hide show

data/README.md CHANGED Viewed

@@ -6,43 +6,84 @@ On Bayesian Classification
 This project started during an internship at SoundCloud.
-Using SoundCloud's private messaging means that you can effectively reach out to everyone on the Cloud. On top of that, you have track commenting, groups posting, forum topics, track sharing - we care about your voice being heard! And read.
+Using SoundCloud's private messaging means that you can effectively reach out
+to everyone on the Cloud. On top of that, you have track commenting, groups
+posting, forum topics, track sharing - we care about your voice being heard!
+And read.
-I'll put this in some perspective and say that we're now having daily text exchange volume in the order of hundreds of thousands. And it's also rapidly going up.
+I'll put this in some perspective and say that we're now having daily text
+exchange volume in the order of hundreds of thousands. And it's also rapidly
+going up.
-And while most of this runs smoother than Berliner beer on a SoundCloud Friday, violations to our [Community guidelines][guidelines] are starting to be less and less of an exception. So I've been given the task to address this and build a system that progressively learns how to tell good community behaviour from less good - welcome to the:
+And while most of this runs smoother than Berliner beer on a SoundCloud Friday,
+violations to our [Community guidelines][guidelines] are starting to be less
+and less of an exception. So I've been given the task to address this and
+build a system that progressively learns how to tell good community behaviour
+from less good - welcome to the:
 GreenMidget
 ----------
-GreenMidget is a trainable, feature-full Bayesian text classifier. Out of the box it's super straightforward to use, but it also offers easy customisation options. It's a Ruby gem and today we're open sourcing it, so you can start with it within a minute after the:
+GreenMidget is a trainable, feature-full Bayesian text classifier. Out of the
+box it's super straightforward to use, but it also offers easy customisation
+options. It's a Ruby gem and today we're open sourcing it, so you can start
+with it within a minute after the:
 Installation
 ----------
-You are very likely (but not necessarily) gonna be on a Rails app, so just add
+If you're using bundle, simply add the following to your Gemfile
     gem 'green_midget'
-to your Gemfile and run
+and then run
     bundle install
 after which (so that you get the ActiveRecord backend ready):
-    rake green_midget:setup:active_record # creates a green_midget_records table and populate some entried there
+    bundle exec rake green_midget:setup:active_record
+This creates a `green_midget_records` table and populate some entried there
 You're now done.
+Try it out (right on the CLI)
+----------
+After you install the gem a shell executable is available for a quick play
+with an online GreenMidget server trained on ~ 9000 public spam and ham
+examples posted on SoundCloud as posts or track comments.
+    $ greenmidget 'buy cheap bags online'
+    $ greenmidget 'upload and share cool tracks online'
+    $ greenmidget potential_spam.txt # will read the file and classify the text
+Go ahead and try around a bit, but keep in mind that this online service is in a
+very early training stage and lacks even basic features (see below).
 How it works
 ----------
-GreenMidget learns to classify between two categories, so what you should first do is provide training examples for each of those two categories. See below.
+GreenMidget is a Naive Bayes implementation that uses a Log ratio of spam vs
+ham probabilities for a given object to classify it to any of the categories.
+There's an indecisive range as well - by default between 0 and Log(3).
+Everything under 0 will be considered legit and above Log(3) will be spam.
+GreenMidget adjusts the probabilities for individual words from training with
+known examples and thus it improves its capability.
+You can define further features (perhaps based on characteristics of the objects
+you have to deal with) and use them to calculate probabilities. You can also
+define heuristic checks for either category (see below for more on how to do
+these).
 Use it
 ----------
-`GreenMidget::Classifier` is the interaction class that is there after installation. It exposes two public instance methods as a start: `GreenMidget::Classifier#classify_as!` and `GreenMidget::Classifier#classify`. We'll do a three lines classification session and illustrate them.
+`GreenMidget::Classifier` is the interaction class that is there after
+installation. It exposes two public instance methods as a start:
+`GreenMidget::Classifier#classify_as!` and `GreenMidget::Classifier#classify`.
+We'll do a three lines classification session and illustrate them.
 We'll start training `GreenMidget` with a spammy example:
@@ -52,81 +93,108 @@ Similarly for legitimate examples
     GreenMidget::Classifier.new(known_legit_text).classify_as! :ham
-After we've given to it some training data, we can start classifying unknown text:
+After we've given to it some training data, we can start classifying unknown
+text:
     decision = GreenMidget::Classifier.new(new_text).classify
-`decision` is now in `[ -1, 0, 1 ]` meaning respectively 'No spam', 'Not enough evidence', 'Spam'.
+`decision` is now in `[ -1, 0, 1 ]` meaning respectively 'No spam',
+'Not enough evidence', 'Spam'.
 Extend it
 ----------
-If the above functionality is not enough for you and you want to add custom logic to GreenMidget you can do that by extending the `GreenMidget::Base` class (check `lib/green_midget/extensions/sample.b` in the [code][green_midget_github] for an example):
+If the above functionality is not enough for you and you want to add custom
+logic to GreenMidget you can do that by extending the `GreenMidget::Base`
+class (check `lib/green_midget/extensions/sample.b` in the [code][green_midget_github]
+for an example):
-* Implement heuristics logic, which will directly classify incoming object as a given category. Example:
+* Implement heuristics logic, which will directly classify incoming object as a
+given category. Example:
     def pass_ham_heuristics?
       words.count > 5 || url_in_text?
     end
-  This method will be `true` for longer text or such that contains an external url. In this case the classifier would go on to the actual testing procedure. If `false`, however, the procedure will not be done and the classifier will return the ham category as a result. Note the native `GreenMidget::Base#words` and `GreenMidget::Base#url_in_text?`
+  This method will be `true` for longer text or such that contains an external
+  url. In this case the classifier would go on to the actual testing procedure.
+  If `false`, however, the procedure will not be done and the classifier will
+  return the ham category as a result. Note the default
+  `GreenMidget::Base#words` and `GreenMidget::Base#url_in_text?`
-  All heuristic checks return `true` by default so it's up to you whether you will define and use heuristics at all or not. However, using them can help you integrate your application context and decrease classification error chance especially at the edge cases.
+  All heuristic checks return `true` by default so it's up to you whether you
+  will define and use heuristics at all or not. However, using them can help
+  you integrate your application context and decrease classification error
+  chance especially at the edge cases.
-* Expand the source of evidence. Traditionally, _naive_ Bayesian text classifiers see individual words as evidence and calculate category-likelihoods for each word. But there could be more than that in your application context, eg. user's data or specific text features.
+* Expand the source of evidence. Traditionally, _naive_ Bayesian text
+classifiers see individual words as evidence and calculate category-likelihoods
+for each word. But there could be more than that in your application context,
+eg. user's data or specific text features.
-  By default GreenMidget comes with two feature definitions `url_in_text` and `email_in_text`, but you can implement as many more as you want by writing a boolean method that checks for the feature:
+  By default GreenMidget comes with two feature definitions `url_in_text` and
+  `email_in_text`, but you can implement as many more as you want by writing a
+  boolean method that checks for the feature:
     def regular_user?
       @user.sign_up_count > 10
     end
-  and then implement a `features` method that returns an array with your custom feature names:
+  and then implement a `features` method that returns an array with your custom
+  feature names:
     def features
       ['regular_user', .... ]
     end
-  (do make sure that the array entry is the same as the name of the method that would be checking for this feature)
+  (do make sure that the array entry is the same as the name of the method that
+  would be checking for this feature)
-  The GreenMidget features definitions have more weight on shorter texts and less weight on longer thus they provide a ground source of evidence for GreenMidget's classification.
+  The GreenMidget features definitions have more weight on shorter texts and
+  less weight on longer thus they provide a ground source of evidence for
+  GreenMidget's classification.
 If that's not enough too, see the Contribute section below.
-Benchmarking
+Performance
 ----------
-Before moving on, let's say that `GreenMidget` is intended for asynchronous spam checks. Using ActiveRecord as backend has the benefit of wide support and easy setup, but as it also means that the time performance will become progressively worse the more training you provide.
+GreenMidget uses ActiveRecord as backend and this guarantees wide support and
+easy setup, however it's less performant than other data stores especially on
+training operations. You should do such tasks asynchronously on real
+applications. A future version backed on Redis is planned.
-1. GreenMidget is optimised for classification operations (`classify` method), on which it's relatively efficient. The results below were obtained from classification on randomly generated messages of length _1 000 words_ (that's _very_ long for SoundCloud). Since GreenMidget runs on a relational database (through ActiveRecord) by default the table size impacts data fetch and write:
+Classification Efficiency
+----------
-	* on ~ 10 000 table rows = 0.0703 seconds / message
-	* on ~ 100 000 rows = 0.2082 sec / message
-	* on ~ 500 000 rows = 0.6505 sec / message
-	* on ~ 1 000 000 rows = 0.6773 sec / messages
+Obviously this will depend on the training data that you have, but do give a
+try to the Heroku GreenMidget app from the supplied CLI tool for a start (see
+above for examples) or type:
-2. Training operations (`classify_as!`) are, however, less performant because they invoke a database write per word. Under the same conditions as above, the training times of randomly generated messages follows:
+    $ greenmidget
-	* on ~ 10 000 table rows = 1.5984 seconds / message
-	* on ~ 100 000 rows = 0.1303 sec / message
-	* on ~ 500 000 rows = 1.7185 sec / message
-	* on ~ 1 000 000 rows = 2.5335 sec / message
+on your shell for a help message. The online classifier for example lacks many
+possible features such as heuristic checks, words stamming, stop words, etc.
+It's only trained on the word occurrences of a total of 9000 messages (4500 of
+each spam and ham).
-Classification Efficiency
+During the development tests at SoundCloud, with those features in place, we
+achieved more than 98% correct classification of spam examples using GreenMidget.
+Thanks
 ----------
-TODO: give test results; provide a web interface to a trained classifier using some of SoundCloud's spam and legit data; give production experience from DigitaleSeiten.
+massively to everyone at SoundCloud for the help during the development of
+GreenMidget.
 Contribute
 ----------
-Let me know on any feedback or feature requests. If you want to hack on the
-code, just do that!
+Just do the standard:
-  * Make a fork
-    * `git clone git@github.com:chochkov/GreenMidget.git`
-    * `bundle`
-    * `bundle exec rake` to run the specs
+  * Make a fork and then:
+    * run `bundle` to setup dependencies
+    * and `bundle exec rake` to run the specs
   * Make a patch
   * Send a Pull Request

data/bin/greenmidget ADDED Viewed

@@ -0,0 +1,62 @@
+#!/usr/bin/env ruby
+$LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
+require 'net/http'
+require 'green_midget'
+def say(what); puts "==> #{what}"; end
+if (text = ARGV[0]) && ARGV.size == 1
+  say "This will check your input against some of SoundCloud\'s history of spaammmm..\n"
+  say "(run without arguments for more info)\n\n"
+  text =
+    if File.exist?(text)
+      IO.readlines(text, '').join
+    else
+      text
+    end
+  uri = URI("http://freezing-earth-5798.herokuapp.com/?q=#{URI.escape(text)}")
+  response = ''
+  begin
+    response = Net::HTTP.get(uri).to_i
+  rescue
+    say 'An error connecting to Heroku. How about your internets?'
+    exit 1
+  end
+  case response
+  when 1
+    say 'Hm.. It looks not so good! ( looks like spam )'
+  when 0
+    say 'Well, i cant really tell - it could be either'
+  when -1
+    say 'And.. it sounds ok..'
+  else
+    say 'An unknown error stroke!'
+  end
+else
+  say "Checks for spam!\n\n"
+  puts <<-TEXT
+This tool accesses an online GreenMidget service trained on 4500
+examples of public spam messages or track comments that were posted on
+SoundCloud. You can use it to classify your texts against it.
+  Examples:
+  greenmidget 'buy cheap bags online'
+  greenmidget 'upload cool tracks online'
+  greenmidget potential_spam.txt
+Notice: This service is only used as an illustration to the GreenMidget
+classifier, however its training is limited and it lacks even basic
+features, that GreenMidget could provide.
+This is not actually in use at SoundCloud!
+read more on: http://github.com/chochkov/greenmidget
+  TEXT
+end

data/lib/green_midget/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module GreenMidget
-  VERSION = '0.1.0'
+  VERSION = '0.1.1'
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: green_midget
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.1.1
   prerelease:
 platform: ruby
 authors:
@@ -9,12 +9,12 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-02-17 00:00:00.000000000 +01:00
+date: 2012-03-05 00:00:00.000000000 +01:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activerecord
-  requirement: &2153348200 !ruby/object:Gem::Requirement
+  requirement: &2153446000 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -22,11 +22,12 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *2153348200
+  version_requirements: *2153446000
 description: Naive Bayesian Classifier with customizable features
 email:
 - nikola@howkul.info
-executables: []
+executables:
+- greenmidget
 extensions: []
 extra_rdoc_files: []
 files:
@@ -39,6 +40,7 @@ files:
 - Rakefile
 - benchmark/benchmark.rb
 - benchmark/test.rb
+- bin/greenmidget
 - green_midget.gemspec
 - lib/green_midget.rb
 - lib/green_midget/base.rb