monkeyshines 0.2.0 → 0.2.1

Sign up to get free protection for your applications and to get access to all the features.
data/README.textile CHANGED
@@ -4,34 +4,121 @@ It's designed to handle large-scale scrapes that may exceed the capabilities of
4
4
 
5
5
  h2. Install
6
6
 
7
- This is best run standalone -- not as a gem; it's still in heavy development. I recommend cloning
7
+ ** "Main Install and Setup Documentation":http://mrflip.github.com/monkeyshines/INSTALL.html **
8
8
 
9
- * http://github.com/mrflip/edamame
10
- * http://github.com/mrflip/wuclan
11
- * http://github.com/mrflip/wukong
12
- * http://github.com/mrflip/monkeyshines (this repo)
9
+ h3. Get the code
13
10
 
14
- into a common directory.
11
+ We're still actively developing monkeyshines. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/monkeyshines
15
12
 
16
- Additionally, you'll need some of these gems:
13
+ pre. $ git clone git://github.com/mrflip/monkeyshines
17
14
 
18
- * addressable (2.1.0)
19
- * extlib (0.9.12)
20
- * htmlentities (4.2.0)
15
+ A gem is available from "gemcutter:":http://gemcutter.org/gems/monkeyshines
21
16
 
22
- And if you spell ruby with a 'j', you'll want
17
+ pre. $ sudo gem install monkeyshines --source=http://gemcutter.org
23
18
 
24
- * jruby-openssl (0.5.2)
25
- * json-jruby (1.1.7)
19
+ (don't use the gems.github.com version -- it's way out of date.)
26
20
 
27
- ---------------------------------------------------------------------------
21
+ You can instead download this project in either "zip":http://github.com/mrflip/monkeyshines/zipball/master or "tar":http://github.com/mrflip/monkeyshines/tarball/master formats.
28
22
 
29
- h2. Help!
23
+ h3. Dependencies and setup
30
24
 
31
- Send Monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
25
+ To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/monkeyshines/INSTALL.html and then read the "usage notes":http://mrflip.github.com/monkeyshines/usage.html
32
26
 
33
27
  ---------------------------------------------------------------------------
34
28
 
29
+ h2. Overview
30
+
31
+ h3. Runner
32
+
33
+ * Builder Pattern to construct
34
+ * does the running itself
35
+ *
36
+
37
+ # Set stuff up
38
+ # Loop. (Until no more requests)
39
+ ## Get a request from #source
40
+ ## Pass that request to the fetcher
41
+ ##* The fetcher has a #get method,
42
+ ##* which stuffs the response contents into the request object
43
+ ## if the fetcher has a successful response,
44
+ ##
45
+
46
+ **Bulk URL Scraper**
47
+
48
+ # Open a file with URLs, one per line
49
+ # Loop until no more requests:
50
+ ## Get a simple_request from #source
51
+ ##* The source is a FlatFileStore;
52
+ ##* It generates simple_request (objects of type SimpleRequest): has a #url and
53
+ an attribute holding (contents, response_code, response_message).
54
+
55
+ ## Pass that request to an http_fetcher
56
+ ##* The fetcher has a #get method,
57
+ ##* which stuffs the body of the response -- basically, the HTML for the page -- request object's contents. (and so on for the response_code and response_message).
58
+
59
+ ## if the fetcher has a successful response,
60
+ ##* Pass it to a flat_file_store
61
+ ##* which just wrtes the request to disk, one line per request, tab separated on fields.
62
+ url moreinfo scraped_at response_code response_message contents
63
+
64
+ beanstalk == queue
65
+ ttserver == distributed lightweight DB
66
+ god == monitoring & restart
67
+ shotgun == runs sinatra for development
68
+ thin == runs sinatra for production
69
+
70
+ ~/ics/monkeyshines/examples/twitter_search == twitter search scraper
71
+ * work directory holds everything generated: logs, output, dumps of the scrape queue
72
+ * ./dump_twitter_search_jobs.rb --handle=com.twitter.search --dest-filename=dump.tsv
73
+ serializes the queue to a flat file in work/seed
74
+ load_twitter_search_jobs.rb*
75
+ scrape_twitter_search.rb
76
+ * nohup ./scrape_twitter_search.rb --handle=com.twitter.search >> work/log/twitter_search-console-`date "+%Y%m%d%M%H%S`.log 2>&1 &
77
+
78
+
79
+ * tail -f work/log/twitter_search-console-20091006.log (<-- replace date with latest run)
80
+
81
+ # the acutal file being stored
82
+ * tail -f work/20091013/comtwittersearch+20091013164824-17240.tsv | cutc 150
83
+
84
+
85
+ h3. Request Source
86
+
87
+
88
+ * runner.source
89
+ ** request stream
90
+ ** Supplies raw material to initialize a job
91
+
92
+
93
+
94
+
95
+
96
+
97
+
98
+
99
+
100
+
101
+
102
+
103
+ Twitter search scra
104
+
105
+
106
+
107
+
108
+
109
+
110
+
111
+
112
+
113
+
114
+
115
+
116
+
117
+
118
+
119
+
120
+
121
+
35
122
  h2. Request Queue
36
123
 
37
124
  h3. Periodic requests
@@ -109,3 +196,21 @@ h4. Rescheduling
109
196
 
110
197
  Want to perform next scrape to give a couple pages or a mostly-full page. Need to track a rate (num_items / timespan), clamped to a min_reschedule / max_reschedule bounds.
111
198
 
199
+
200
+ ---------------------------------------------------------------------------
201
+
202
+ <notextile><div class="toggle"></notextile>
203
+
204
+ h2. More info
205
+
206
+ There are many useful examples in the examples/ directory.
207
+
208
+ h3. Credits
209
+
210
+ Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
211
+
212
+ h3. Help!
213
+
214
+ Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
215
+
216
+ <notextile></div></notextile>
@@ -88,7 +88,7 @@ module Monkeyshines
88
88
  when Net::HTTPUnauthorized then sleep_time = 0 # 401 (protected user, probably)
89
89
  when Net::HTTPForbidden then sleep_time = 4 # 403 update limit
90
90
  when Net::HTTPNotFound then sleep_time = 0 # 404 deleted
91
- when Net::HTTPServiceUnavailable then sleep_time = 9 # 503 Fail Whale
91
+ when Net::HTTPServiceUnavailable then sleep_time = 15 # 503 Fail Whale
92
92
  when Net::HTTPServerError then sleep_time = 2 # 5xx All other server errors
93
93
  else sleep_time = 1
94
94
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: monkeyshines
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Philip (flip) Kromer
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2009-10-12 00:00:00 -05:00
12
+ date: 2009-11-02 00:00:00 -06:00
13
13
  default_executable:
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency