monkeyshines 0.2.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.textile +121 -16
- data/lib/monkeyshines/fetcher/http_fetcher.rb +1 -1
- metadata +2 -2
data/README.textile
CHANGED
@@ -4,34 +4,121 @@ It's designed to handle large-scale scrapes that may exceed the capabilities of
|
|
4
4
|
|
5
5
|
h2. Install
|
6
6
|
|
7
|
-
|
7
|
+
** "Main Install and Setup Documentation":http://mrflip.github.com/monkeyshines/INSTALL.html **
|
8
8
|
|
9
|
-
|
10
|
-
* http://github.com/mrflip/wuclan
|
11
|
-
* http://github.com/mrflip/wukong
|
12
|
-
* http://github.com/mrflip/monkeyshines (this repo)
|
9
|
+
h3. Get the code
|
13
10
|
|
14
|
-
|
11
|
+
We're still actively developing monkeyshines. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/monkeyshines
|
15
12
|
|
16
|
-
|
13
|
+
pre. $ git clone git://github.com/mrflip/monkeyshines
|
17
14
|
|
18
|
-
|
19
|
-
* extlib (0.9.12)
|
20
|
-
* htmlentities (4.2.0)
|
15
|
+
A gem is available from "gemcutter:":http://gemcutter.org/gems/monkeyshines
|
21
16
|
|
22
|
-
|
17
|
+
pre. $ sudo gem install monkeyshines --source=http://gemcutter.org
|
23
18
|
|
24
|
-
|
25
|
-
* json-jruby (1.1.7)
|
19
|
+
(don't use the gems.github.com version -- it's way out of date.)
|
26
20
|
|
27
|
-
|
21
|
+
You can instead download this project in either "zip":http://github.com/mrflip/monkeyshines/zipball/master or "tar":http://github.com/mrflip/monkeyshines/tarball/master formats.
|
28
22
|
|
29
|
-
|
23
|
+
h3. Dependencies and setup
|
30
24
|
|
31
|
-
|
25
|
+
To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/monkeyshines/INSTALL.html and then read the "usage notes":http://mrflip.github.com/monkeyshines/usage.html
|
32
26
|
|
33
27
|
---------------------------------------------------------------------------
|
34
28
|
|
29
|
+
h2. Overview
|
30
|
+
|
31
|
+
h3. Runner
|
32
|
+
|
33
|
+
* Builder Pattern to construct
|
34
|
+
* does the running itself
|
35
|
+
*
|
36
|
+
|
37
|
+
# Set stuff up
|
38
|
+
# Loop. (Until no more requests)
|
39
|
+
## Get a request from #source
|
40
|
+
## Pass that request to the fetcher
|
41
|
+
##* The fetcher has a #get method,
|
42
|
+
##* which stuffs the response contents into the request object
|
43
|
+
## if the fetcher has a successful response,
|
44
|
+
##
|
45
|
+
|
46
|
+
**Bulk URL Scraper**
|
47
|
+
|
48
|
+
# Open a file with URLs, one per line
|
49
|
+
# Loop until no more requests:
|
50
|
+
## Get a simple_request from #source
|
51
|
+
##* The source is a FlatFileStore;
|
52
|
+
##* It generates simple_request (objects of type SimpleRequest): has a #url and
|
53
|
+
an attribute holding (contents, response_code, response_message).
|
54
|
+
|
55
|
+
## Pass that request to an http_fetcher
|
56
|
+
##* The fetcher has a #get method,
|
57
|
+
##* which stuffs the body of the response -- basically, the HTML for the page -- request object's contents. (and so on for the response_code and response_message).
|
58
|
+
|
59
|
+
## if the fetcher has a successful response,
|
60
|
+
##* Pass it to a flat_file_store
|
61
|
+
##* which just wrtes the request to disk, one line per request, tab separated on fields.
|
62
|
+
url moreinfo scraped_at response_code response_message contents
|
63
|
+
|
64
|
+
beanstalk == queue
|
65
|
+
ttserver == distributed lightweight DB
|
66
|
+
god == monitoring & restart
|
67
|
+
shotgun == runs sinatra for development
|
68
|
+
thin == runs sinatra for production
|
69
|
+
|
70
|
+
~/ics/monkeyshines/examples/twitter_search == twitter search scraper
|
71
|
+
* work directory holds everything generated: logs, output, dumps of the scrape queue
|
72
|
+
* ./dump_twitter_search_jobs.rb --handle=com.twitter.search --dest-filename=dump.tsv
|
73
|
+
serializes the queue to a flat file in work/seed
|
74
|
+
load_twitter_search_jobs.rb*
|
75
|
+
scrape_twitter_search.rb
|
76
|
+
* nohup ./scrape_twitter_search.rb --handle=com.twitter.search >> work/log/twitter_search-console-`date "+%Y%m%d%M%H%S`.log 2>&1 &
|
77
|
+
|
78
|
+
|
79
|
+
* tail -f work/log/twitter_search-console-20091006.log (<-- replace date with latest run)
|
80
|
+
|
81
|
+
# the acutal file being stored
|
82
|
+
* tail -f work/20091013/comtwittersearch+20091013164824-17240.tsv | cutc 150
|
83
|
+
|
84
|
+
|
85
|
+
h3. Request Source
|
86
|
+
|
87
|
+
|
88
|
+
* runner.source
|
89
|
+
** request stream
|
90
|
+
** Supplies raw material to initialize a job
|
91
|
+
|
92
|
+
|
93
|
+
|
94
|
+
|
95
|
+
|
96
|
+
|
97
|
+
|
98
|
+
|
99
|
+
|
100
|
+
|
101
|
+
|
102
|
+
|
103
|
+
Twitter search scra
|
104
|
+
|
105
|
+
|
106
|
+
|
107
|
+
|
108
|
+
|
109
|
+
|
110
|
+
|
111
|
+
|
112
|
+
|
113
|
+
|
114
|
+
|
115
|
+
|
116
|
+
|
117
|
+
|
118
|
+
|
119
|
+
|
120
|
+
|
121
|
+
|
35
122
|
h2. Request Queue
|
36
123
|
|
37
124
|
h3. Periodic requests
|
@@ -109,3 +196,21 @@ h4. Rescheduling
|
|
109
196
|
|
110
197
|
Want to perform next scrape to give a couple pages or a mostly-full page. Need to track a rate (num_items / timespan), clamped to a min_reschedule / max_reschedule bounds.
|
111
198
|
|
199
|
+
|
200
|
+
---------------------------------------------------------------------------
|
201
|
+
|
202
|
+
<notextile><div class="toggle"></notextile>
|
203
|
+
|
204
|
+
h2. More info
|
205
|
+
|
206
|
+
There are many useful examples in the examples/ directory.
|
207
|
+
|
208
|
+
h3. Credits
|
209
|
+
|
210
|
+
Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
|
211
|
+
|
212
|
+
h3. Help!
|
213
|
+
|
214
|
+
Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
|
215
|
+
|
216
|
+
<notextile></div></notextile>
|
@@ -88,7 +88,7 @@ module Monkeyshines
|
|
88
88
|
when Net::HTTPUnauthorized then sleep_time = 0 # 401 (protected user, probably)
|
89
89
|
when Net::HTTPForbidden then sleep_time = 4 # 403 update limit
|
90
90
|
when Net::HTTPNotFound then sleep_time = 0 # 404 deleted
|
91
|
-
when Net::HTTPServiceUnavailable then sleep_time =
|
91
|
+
when Net::HTTPServiceUnavailable then sleep_time = 15 # 503 Fail Whale
|
92
92
|
when Net::HTTPServerError then sleep_time = 2 # 5xx All other server errors
|
93
93
|
else sleep_time = 1
|
94
94
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: monkeyshines
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.2.
|
4
|
+
version: 0.2.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Philip (flip) Kromer
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2009-
|
12
|
+
date: 2009-11-02 00:00:00 -06:00
|
13
13
|
default_executable:
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|