RubyGems - monkeyshines - Versions diffs - 0.2.0 → 0.2.1 - Mend

monkeyshines 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

data/README.textile +121 -16
data/lib/monkeyshines/fetcher/http_fetcher.rb +1 -1
metadata +2 -2

data/README.textile CHANGED Viewed

@@ -4,34 +4,121 @@ It's designed to handle large-scale scrapes that may exceed the capabilities of
 h2. Install
-This is best run standalone -- not as a gem; it's still in heavy development. I recommend cloning
+** "Main Install and Setup Documentation":http://mrflip.github.com/monkeyshines/INSTALL.html **
-* http://github.com/mrflip/edamame
-* http://github.com/mrflip/wuclan
-* http://github.com/mrflip/wukong
-* http://github.com/mrflip/monkeyshines (this repo)
+h3. Get the code
-into a common directory.
+We're still actively developing monkeyshines.  The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/monkeyshines
-Additionally, you'll need some of these gems:
+pre. $ git clone git://github.com/mrflip/monkeyshines
-* addressable (2.1.0)
-* extlib (0.9.12)
-* htmlentities (4.2.0)
+A gem is available from "gemcutter:":http://gemcutter.org/gems/monkeyshines
-And if you spell ruby with a 'j', you'll want
+pre. $ sudo gem install monkeyshines --source=http://gemcutter.org
-* jruby-openssl (0.5.2)
-* json-jruby (1.1.7)
+(don't use the gems.github.com version -- it's way out of date.)
----------------------------------------------------------------------------
+You can instead download this project in either "zip":http://github.com/mrflip/monkeyshines/zipball/master or "tar":http://github.com/mrflip/monkeyshines/tarball/master formats.
-h2. Help!
+h3. Dependencies and setup
-Send Monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
+To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/monkeyshines/INSTALL.html and then read the "usage notes":http://mrflip.github.com/monkeyshines/usage.html
 ---------------------------------------------------------------------------
+h2. Overview
+h3. Runner
+* Builder Pattern to construct
+* does the running itself
+*
+# Set stuff up
+# Loop. (Until no more requests)
+## Get a request from #source
+## Pass that request to the fetcher
+##* The fetcher has a #get method,
+##* which stuffs the response contents into the request object
+## if the fetcher has a successful response,
+##
+**Bulk URL Scraper**
+# Open a file with URLs, one per line
+# Loop until no more requests:
+## Get a simple_request from #source
+##* The source is a FlatFileStore;
+##* It generates simple_request (objects of type SimpleRequest): has a #url and
+    an attribute holding (contents, response_code, response_message).
+## Pass that request to an http_fetcher
+##* The fetcher has a #get method,
+##* which stuffs the body of the response -- basically, the HTML for the page -- request object's contents. (and so on for the response_code and response_message).
+## if the fetcher has a successful response,
+##* Pass it to a flat_file_store
+##* which just wrtes the request to disk, one line per request, tab separated on fields.
+url moreinfo scraped_at response_code response_message contents
+beanstalk == queue
+ttserver == distributed lightweight DB
+god == monitoring & restart
+shotgun == runs sinatra for development
+thin == runs sinatra for production
+ ~/ics/monkeyshines/examples/twitter_search == twitter search scraper
+   * work directory holds everything generated: logs, output, dumps of the scrape queue
+   * ./dump_twitter_search_jobs.rb --handle=com.twitter.search --dest-filename=dump.tsv
+     serializes the queue to a flat file in work/seed
+   load_twitter_search_jobs.rb*
+   scrape_twitter_search.rb
+   * nohup ./scrape_twitter_search.rb --handle=com.twitter.search >> work/log/twitter_search-console-`date "+%Y%m%d%M%H%S`.log 2>&1 &
+   * tail -f work/log/twitter_search-console-20091006.log  (<-- replace date with latest run)
+   # the acutal file being stored
+   * tail -f work/20091013/comtwittersearch+20091013164824-17240.tsv | cutc 150
+h3. Request Source
+* runner.source
+** request stream
+** Supplies raw material to initialize a job
+Twitter search scra
 h2. Request Queue
 h3. Periodic requests
@@ -109,3 +196,21 @@ h4. Rescheduling
 Want to perform next scrape to give a couple pages or a mostly-full page. Need to track a rate (num_items / timespan), clamped to a min_reschedule / max_reschedule bounds.
+---------------------------------------------------------------------------
+<notextile><div class="toggle"></notextile>
+h2. More info
+There are many useful examples in the examples/ directory.
+h3. Credits
+Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
+h3. Help!
+Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
+<notextile></div></notextile>

data/lib/monkeyshines/fetcher/http_fetcher.rb CHANGED Viewed

@@ -88,7 +88,7 @@ module Monkeyshines
         when Net::HTTPUnauthorized        then sleep_time =  0 # 401 (protected user, probably)
         when Net::HTTPForbidden           then sleep_time =  4 # 403 update limit
         when Net::HTTPNotFound            then sleep_time =  0 # 404 deleted
-        when Net::HTTPServiceUnavailable  then sleep_time =  9 # 503 Fail Whale
+        when Net::HTTPServiceUnavailable  then sleep_time = 15 # 503 Fail Whale
         when Net::HTTPServerError         then sleep_time =  2 # 5xx All other server errors
         else                              sleep_time = 1
         end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: monkeyshines
 version: !ruby/object:Gem::Version
-  version: 0.2.0
+  version: 0.2.1
 platform: ruby
 authors:
 - Philip (flip) Kromer
@@ -9,7 +9,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2009-10-12 00:00:00 -05:00
+date: 2009-11-02 00:00:00 -06:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency