monkeyshines 0.2.0 → 0.2.1
Sign up to get free protection for your applications and to get access to all the features.
- data/README.textile +121 -16
- data/lib/monkeyshines/fetcher/http_fetcher.rb +1 -1
- metadata +2 -2
data/README.textile
CHANGED
@@ -4,34 +4,121 @@ It's designed to handle large-scale scrapes that may exceed the capabilities of
|
|
4
4
|
|
5
5
|
h2. Install
|
6
6
|
|
7
|
-
|
7
|
+
** "Main Install and Setup Documentation":http://mrflip.github.com/monkeyshines/INSTALL.html **
|
8
8
|
|
9
|
-
|
10
|
-
* http://github.com/mrflip/wuclan
|
11
|
-
* http://github.com/mrflip/wukong
|
12
|
-
* http://github.com/mrflip/monkeyshines (this repo)
|
9
|
+
h3. Get the code
|
13
10
|
|
14
|
-
|
11
|
+
We're still actively developing monkeyshines. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/monkeyshines
|
15
12
|
|
16
|
-
|
13
|
+
pre. $ git clone git://github.com/mrflip/monkeyshines
|
17
14
|
|
18
|
-
|
19
|
-
* extlib (0.9.12)
|
20
|
-
* htmlentities (4.2.0)
|
15
|
+
A gem is available from "gemcutter:":http://gemcutter.org/gems/monkeyshines
|
21
16
|
|
22
|
-
|
17
|
+
pre. $ sudo gem install monkeyshines --source=http://gemcutter.org
|
23
18
|
|
24
|
-
|
25
|
-
* json-jruby (1.1.7)
|
19
|
+
(don't use the gems.github.com version -- it's way out of date.)
|
26
20
|
|
27
|
-
|
21
|
+
You can instead download this project in either "zip":http://github.com/mrflip/monkeyshines/zipball/master or "tar":http://github.com/mrflip/monkeyshines/tarball/master formats.
|
28
22
|
|
29
|
-
|
23
|
+
h3. Dependencies and setup
|
30
24
|
|
31
|
-
|
25
|
+
To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/monkeyshines/INSTALL.html and then read the "usage notes":http://mrflip.github.com/monkeyshines/usage.html
|
32
26
|
|
33
27
|
---------------------------------------------------------------------------
|
34
28
|
|
29
|
+
h2. Overview
|
30
|
+
|
31
|
+
h3. Runner
|
32
|
+
|
33
|
+
* Builder Pattern to construct
|
34
|
+
* does the running itself
|
35
|
+
*
|
36
|
+
|
37
|
+
# Set stuff up
|
38
|
+
# Loop. (Until no more requests)
|
39
|
+
## Get a request from #source
|
40
|
+
## Pass that request to the fetcher
|
41
|
+
##* The fetcher has a #get method,
|
42
|
+
##* which stuffs the response contents into the request object
|
43
|
+
## if the fetcher has a successful response,
|
44
|
+
##
|
45
|
+
|
46
|
+
**Bulk URL Scraper**
|
47
|
+
|
48
|
+
# Open a file with URLs, one per line
|
49
|
+
# Loop until no more requests:
|
50
|
+
## Get a simple_request from #source
|
51
|
+
##* The source is a FlatFileStore;
|
52
|
+
##* It generates simple_request (objects of type SimpleRequest): has a #url and
|
53
|
+
an attribute holding (contents, response_code, response_message).
|
54
|
+
|
55
|
+
## Pass that request to an http_fetcher
|
56
|
+
##* The fetcher has a #get method,
|
57
|
+
##* which stuffs the body of the response -- basically, the HTML for the page -- request object's contents. (and so on for the response_code and response_message).
|
58
|
+
|
59
|
+
## if the fetcher has a successful response,
|
60
|
+
##* Pass it to a flat_file_store
|
61
|
+
##* which just wrtes the request to disk, one line per request, tab separated on fields.
|
62
|
+
url moreinfo scraped_at response_code response_message contents
|
63
|
+
|
64
|
+
beanstalk == queue
|
65
|
+
ttserver == distributed lightweight DB
|
66
|
+
god == monitoring & restart
|
67
|
+
shotgun == runs sinatra for development
|
68
|
+
thin == runs sinatra for production
|
69
|
+
|
70
|
+
~/ics/monkeyshines/examples/twitter_search == twitter search scraper
|
71
|
+
* work directory holds everything generated: logs, output, dumps of the scrape queue
|
72
|
+
* ./dump_twitter_search_jobs.rb --handle=com.twitter.search --dest-filename=dump.tsv
|
73
|
+
serializes the queue to a flat file in work/seed
|
74
|
+
load_twitter_search_jobs.rb*
|
75
|
+
scrape_twitter_search.rb
|
76
|
+
* nohup ./scrape_twitter_search.rb --handle=com.twitter.search >> work/log/twitter_search-console-`date "+%Y%m%d%M%H%S`.log 2>&1 &
|
77
|
+
|
78
|
+
|
79
|
+
* tail -f work/log/twitter_search-console-20091006.log (<-- replace date with latest run)
|
80
|
+
|
81
|
+
# the acutal file being stored
|
82
|
+
* tail -f work/20091013/comtwittersearch+20091013164824-17240.tsv | cutc 150
|
83
|
+
|
84
|
+
|
85
|
+
h3. Request Source
|
86
|
+
|
87
|
+
|
88
|
+
* runner.source
|
89
|
+
** request stream
|
90
|
+
** Supplies raw material to initialize a job
|
91
|
+
|
92
|
+
|
93
|
+
|
94
|
+
|
95
|
+
|
96
|
+
|
97
|
+
|
98
|
+
|
99
|
+
|
100
|
+
|
101
|
+
|
102
|
+
|
103
|
+
Twitter search scra
|
104
|
+
|
105
|
+
|
106
|
+
|
107
|
+
|
108
|
+
|
109
|
+
|
110
|
+
|
111
|
+
|
112
|
+
|
113
|
+
|
114
|
+
|
115
|
+
|
116
|
+
|
117
|
+
|
118
|
+
|
119
|
+
|
120
|
+
|
121
|
+
|
35
122
|
h2. Request Queue
|
36
123
|
|
37
124
|
h3. Periodic requests
|
@@ -109,3 +196,21 @@ h4. Rescheduling
|
|
109
196
|
|
110
197
|
Want to perform next scrape to give a couple pages or a mostly-full page. Need to track a rate (num_items / timespan), clamped to a min_reschedule / max_reschedule bounds.
|
111
198
|
|
199
|
+
|
200
|
+
---------------------------------------------------------------------------
|
201
|
+
|
202
|
+
<notextile><div class="toggle"></notextile>
|
203
|
+
|
204
|
+
h2. More info
|
205
|
+
|
206
|
+
There are many useful examples in the examples/ directory.
|
207
|
+
|
208
|
+
h3. Credits
|
209
|
+
|
210
|
+
Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
|
211
|
+
|
212
|
+
h3. Help!
|
213
|
+
|
214
|
+
Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
|
215
|
+
|
216
|
+
<notextile></div></notextile>
|
@@ -88,7 +88,7 @@ module Monkeyshines
|
|
88
88
|
when Net::HTTPUnauthorized then sleep_time = 0 # 401 (protected user, probably)
|
89
89
|
when Net::HTTPForbidden then sleep_time = 4 # 403 update limit
|
90
90
|
when Net::HTTPNotFound then sleep_time = 0 # 404 deleted
|
91
|
-
when Net::HTTPServiceUnavailable then sleep_time =
|
91
|
+
when Net::HTTPServiceUnavailable then sleep_time = 15 # 503 Fail Whale
|
92
92
|
when Net::HTTPServerError then sleep_time = 2 # 5xx All other server errors
|
93
93
|
else sleep_time = 1
|
94
94
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: monkeyshines
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.2.
|
4
|
+
version: 0.2.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Philip (flip) Kromer
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2009-
|
12
|
+
date: 2009-11-02 00:00:00 -06:00
|
13
13
|
default_executable:
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|