monkeyshines 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.textile CHANGED
@@ -4,34 +4,121 @@ It's designed to handle large-scale scrapes that may exceed the capabilities of
4
4
 
5
5
  h2. Install
6
6
 
7
- This is best run standalone -- not as a gem; it's still in heavy development. I recommend cloning
7
+ ** "Main Install and Setup Documentation":http://mrflip.github.com/monkeyshines/INSTALL.html **
8
8
 
9
- * http://github.com/mrflip/edamame
10
- * http://github.com/mrflip/wuclan
11
- * http://github.com/mrflip/wukong
12
- * http://github.com/mrflip/monkeyshines (this repo)
9
+ h3. Get the code
13
10
 
14
- into a common directory.
11
+ We're still actively developing monkeyshines. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/monkeyshines
15
12
 
16
- Additionally, you'll need some of these gems:
13
+ pre. $ git clone git://github.com/mrflip/monkeyshines
17
14
 
18
- * addressable (2.1.0)
19
- * extlib (0.9.12)
20
- * htmlentities (4.2.0)
15
+ A gem is available from "gemcutter:":http://gemcutter.org/gems/monkeyshines
21
16
 
22
- And if you spell ruby with a 'j', you'll want
17
+ pre. $ sudo gem install monkeyshines --source=http://gemcutter.org
23
18
 
24
- * jruby-openssl (0.5.2)
25
- * json-jruby (1.1.7)
19
+ (don't use the gems.github.com version -- it's way out of date.)
26
20
 
27
- ---------------------------------------------------------------------------
21
+ You can instead download this project in either "zip":http://github.com/mrflip/monkeyshines/zipball/master or "tar":http://github.com/mrflip/monkeyshines/tarball/master formats.
28
22
 
29
- h2. Help!
23
+ h3. Dependencies and setup
30
24
 
31
- Send Monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
25
+ To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/monkeyshines/INSTALL.html and then read the "usage notes":http://mrflip.github.com/monkeyshines/usage.html
32
26
 
33
27
  ---------------------------------------------------------------------------
34
28
 
29
+ h2. Overview
30
+
31
+ h3. Runner
32
+
33
+ * Builder Pattern to construct
34
+ * does the running itself
35
+ *
36
+
37
+ # Set stuff up
38
+ # Loop. (Until no more requests)
39
+ ## Get a request from #source
40
+ ## Pass that request to the fetcher
41
+ ##* The fetcher has a #get method,
42
+ ##* which stuffs the response contents into the request object
43
+ ## if the fetcher has a successful response,
44
+ ##
45
+
46
+ **Bulk URL Scraper**
47
+
48
+ # Open a file with URLs, one per line
49
+ # Loop until no more requests:
50
+ ## Get a simple_request from #source
51
+ ##* The source is a FlatFileStore;
52
+ ##* It generates simple_request (objects of type SimpleRequest): has a #url and
53
+ an attribute holding (contents, response_code, response_message).
54
+
55
+ ## Pass that request to an http_fetcher
56
+ ##* The fetcher has a #get method,
57
+ ##* which stuffs the body of the response -- basically, the HTML for the page -- request object's contents. (and so on for the response_code and response_message).
58
+
59
+ ## if the fetcher has a successful response,
60
+ ##* Pass it to a flat_file_store
61
+ ##* which just wrtes the request to disk, one line per request, tab separated on fields.
62
+ url moreinfo scraped_at response_code response_message contents
63
+
64
+ beanstalk == queue
65
+ ttserver == distributed lightweight DB
66
+ god == monitoring & restart
67
+ shotgun == runs sinatra for development
68
+ thin == runs sinatra for production
69
+
70
+ ~/ics/monkeyshines/examples/twitter_search == twitter search scraper
71
+ * work directory holds everything generated: logs, output, dumps of the scrape queue
72
+ * ./dump_twitter_search_jobs.rb --handle=com.twitter.search --dest-filename=dump.tsv
73
+ serializes the queue to a flat file in work/seed
74
+ load_twitter_search_jobs.rb*
75
+ scrape_twitter_search.rb
76
+ * nohup ./scrape_twitter_search.rb --handle=com.twitter.search >> work/log/twitter_search-console-`date "+%Y%m%d%M%H%S`.log 2>&1 &
77
+
78
+
79
+ * tail -f work/log/twitter_search-console-20091006.log (<-- replace date with latest run)
80
+
81
+ # the acutal file being stored
82
+ * tail -f work/20091013/comtwittersearch+20091013164824-17240.tsv | cutc 150
83
+
84
+
85
+ h3. Request Source
86
+
87
+
88
+ * runner.source
89
+ ** request stream
90
+ ** Supplies raw material to initialize a job
91
+
92
+
93
+
94
+
95
+
96
+
97
+
98
+
99
+
100
+
101
+
102
+
103
+ Twitter search scra
104
+
105
+
106
+
107
+
108
+
109
+
110
+
111
+
112
+
113
+
114
+
115
+
116
+
117
+
118
+
119
+
120
+
121
+
35
122
  h2. Request Queue
36
123
 
37
124
  h3. Periodic requests
@@ -109,3 +196,21 @@ h4. Rescheduling
109
196
 
110
197
  Want to perform next scrape to give a couple pages or a mostly-full page. Need to track a rate (num_items / timespan), clamped to a min_reschedule / max_reschedule bounds.
111
198
 
199
+
200
+ ---------------------------------------------------------------------------
201
+
202
+ <notextile><div class="toggle"></notextile>
203
+
204
+ h2. More info
205
+
206
+ There are many useful examples in the examples/ directory.
207
+
208
+ h3. Credits
209
+
210
+ Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
211
+
212
+ h3. Help!
213
+
214
+ Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
215
+
216
+ <notextile></div></notextile>
@@ -88,7 +88,7 @@ module Monkeyshines
88
88
  when Net::HTTPUnauthorized then sleep_time = 0 # 401 (protected user, probably)
89
89
  when Net::HTTPForbidden then sleep_time = 4 # 403 update limit
90
90
  when Net::HTTPNotFound then sleep_time = 0 # 404 deleted
91
- when Net::HTTPServiceUnavailable then sleep_time = 9 # 503 Fail Whale
91
+ when Net::HTTPServiceUnavailable then sleep_time = 15 # 503 Fail Whale
92
92
  when Net::HTTPServerError then sleep_time = 2 # 5xx All other server errors
93
93
  else sleep_time = 1
94
94
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: monkeyshines
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Philip (flip) Kromer
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2009-10-12 00:00:00 -05:00
12
+ date: 2009-11-02 00:00:00 -06:00
13
13
  default_executable:
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency