extraloop 0.0.7 → 0.0.8
Sign up to get free protection for your applications and to get access to all the features.
- data/History.txt +3 -0
- data/README.md +12 -12
- data/examples/mod_pay_data.rb +6 -6
- data/examples/wikipedia_categories.rb +0 -4
- data/lib/extraloop.rb +1 -1
- data/lib/extraloop/dom_extractor.rb +2 -1
- data/lib/extraloop/json_extractor.rb +1 -1
- data/lib/extraloop/scraper_base.rb +13 -6
- data/spec/scraper_base_spec.rb +28 -2
- metadata +32 -19
data/History.txt
CHANGED
data/README.md
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
# Extra Loop
|
2
2
|
|
3
3
|
A Ruby library for extracting structured data from websites and web based APIs.
|
4
|
-
Supports most common document formats (i.e. HTML, XML, and JSON), and comes with a handy mechanism
|
4
|
+
Supports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes with a handy mechanism
|
5
5
|
for iterating over paginated datasets.
|
6
6
|
|
7
|
-
|
7
|
+
## Installation:
|
8
8
|
|
9
9
|
gem install extraloop
|
10
10
|
|
11
|
-
|
11
|
+
## Usage:
|
12
12
|
|
13
13
|
A basic scraper that fetches the top 25 websites from [Alexa's daily top 100](www.alexa.com/topsites) list:
|
14
14
|
|
@@ -37,7 +37,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
|
|
37
37
|
run()
|
38
38
|
|
39
39
|
|
40
|
-
|
40
|
+
## Scraper initialisation signature
|
41
41
|
|
42
42
|
#new(urls, scraper_options, http_options)
|
43
43
|
|
@@ -45,7 +45,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
|
|
45
45
|
- __scraper_options__ - hash of scraper options (see below).
|
46
46
|
- __http_options__ - hash of request options for `Typheous::Request#initialize` (see [API documentation](http://rubydoc.info/github/pauldix/typhoeus/master/Typhoeus/Request#initialize-instance_method) for details).
|
47
47
|
|
48
|
-
|
48
|
+
### scraper options:
|
49
49
|
|
50
50
|
* __format__ - Specifies the scraped document format; needed only if the Content-Type in the server response is not the correct one. Supported formats are: 'html', 'xml', 'json', and 'csv'.
|
51
51
|
* __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
|
@@ -53,7 +53,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
|
|
53
53
|
* __loglevel__ - a symbol specifying the desired log level (defaults to `:info`).
|
54
54
|
* __appenders__ - a list of Logging.appenders object (defaults to `Logging.appenders.sterr`).
|
55
55
|
|
56
|
-
|
56
|
+
## Extractors
|
57
57
|
|
58
58
|
ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.
|
59
59
|
For each matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract`
|
@@ -82,7 +82,7 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a bl
|
|
82
82
|
# extract a description text, separating paragraphs with newlines
|
83
83
|
extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
|
84
84
|
|
85
|
-
|
85
|
+
### Extracting data from JSON Documents
|
86
86
|
|
87
87
|
While processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
|
88
88
|
the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's
|
@@ -116,26 +116,26 @@ one argument, it will in fact try to fetch a hash value using the provided field
|
|
116
116
|
# => "johndoe"
|
117
117
|
|
118
118
|
|
119
|
-
|
119
|
+
## Iteration methods
|
120
120
|
|
121
121
|
The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.
|
122
122
|
|
123
|
-
|
123
|
+
### set\_iteration
|
124
124
|
|
125
125
|
* __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
|
126
126
|
* __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.
|
127
127
|
|
128
|
-
|
128
|
+
### continue\_with
|
129
129
|
|
130
130
|
The second iteration method, `#continue_with`, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).
|
131
131
|
|
132
132
|
* __iteration_parameter__ - the scraper' iteration parameter.
|
133
133
|
* __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
|
134
134
|
|
135
|
-
|
135
|
+
## Running tests
|
136
136
|
|
137
137
|
ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
|
138
138
|
|
139
139
|
cd spec
|
140
140
|
rspec *
|
141
|
-
|
141
|
+
|
data/examples/mod_pay_data.rb
CHANGED
@@ -1,6 +1,5 @@
|
|
1
1
|
#
|
2
|
-
#
|
3
|
-
#
|
2
|
+
# Fetches name, job title, and actual pay ceiling from a CSV dataset listing UK Ministry of Defence's organogram and staff pay data
|
4
3
|
# source: http://data.gov.uk/dataset/staff-organograms-and-pay-mod
|
5
4
|
#
|
6
5
|
|
@@ -12,7 +11,8 @@ class ModPayScraper < ExtraLoop::ScraperBase
|
|
12
11
|
dataset_url = "http://www.mod.uk/NR/rdonlyres/FF9761D8-2AB9-4CD4-88BC-983A46A0CD90/0/20111208CTLBOrganogramFinal7Useniordata.csv"
|
13
12
|
super dataset_url, :format => :csv
|
14
13
|
|
15
|
-
# Select only
|
14
|
+
# Select only records of officers earning more than 100k per year
|
15
|
+
|
16
16
|
loop_on do |rows|
|
17
17
|
rows[1..-1].select { |row| row[14].to_i > 100000 }
|
18
18
|
end
|
@@ -22,11 +22,11 @@ class ModPayScraper < ExtraLoop::ScraperBase
|
|
22
22
|
extract :pay, 14
|
23
23
|
|
24
24
|
on("data") do |records|
|
25
|
-
|
26
|
-
|
25
|
+
records.
|
26
|
+
sort { |r1, r2| r2.pay.to_i <=> r1.pay.to_i }.
|
27
|
+
each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
|
27
28
|
end
|
28
29
|
end
|
29
30
|
end
|
30
31
|
|
31
|
-
|
32
32
|
ModPayScraper.new.run
|
@@ -18,10 +18,6 @@ params = {
|
|
18
18
|
|
19
19
|
options = {
|
20
20
|
:format => :json,
|
21
|
-
:log => {
|
22
|
-
:appenders => [Logging.appenders.stderr],
|
23
|
-
:log_level => :info
|
24
|
-
}
|
25
21
|
}
|
26
22
|
request_arguments = { :params => params, :headers => {
|
27
23
|
"User-Agent" => "ExtraLoop - ruby data extraction toolkit: http://github.com/afiore/extraloop"
|
data/lib/extraloop.rb
CHANGED
@@ -32,7 +32,8 @@ module ExtraLoop
|
|
32
32
|
def extract_list(input)
|
33
33
|
nodes = (input.respond_to?(:document) ? input : parse(input))
|
34
34
|
nodes = nodes.search(@selector) if @selector
|
35
|
-
|
35
|
+
nodes = nodes.css("*") unless @selector or @callback
|
36
|
+
@callback && Array(@environment.run(nodes, &@callback)) || nodes
|
36
37
|
end
|
37
38
|
|
38
39
|
def parse(input)
|
@@ -22,7 +22,7 @@ module ExtraLoop
|
|
22
22
|
def extract_list(input)
|
23
23
|
@environment.document = input = (input.is_a?(String) ? parse(input) : input)
|
24
24
|
input = input.get_in(@path) if @path
|
25
|
-
@callback && @environment.run(input, &@callback) || input
|
25
|
+
Array(@callback && @environment.run(input, &@callback) || input)
|
26
26
|
end
|
27
27
|
|
28
28
|
def parse(input)
|
@@ -23,7 +23,7 @@ module ExtraLoop
|
|
23
23
|
|
24
24
|
def initialize(urls, options = {}, arguments = {})
|
25
25
|
@urls = Array(urls)
|
26
|
-
@loop_extractor_args =
|
26
|
+
@loop_extractor_args = []
|
27
27
|
@extractor_args = []
|
28
28
|
@loop = nil
|
29
29
|
|
@@ -61,8 +61,8 @@ module ExtraLoop
|
|
61
61
|
|
62
62
|
def loop_on(*args, &block)
|
63
63
|
args << block if block
|
64
|
-
#
|
65
|
-
@loop_extractor_args = args.insert(0, nil)
|
64
|
+
# prepend placeholder values for loop name and extraction environment
|
65
|
+
@loop_extractor_args = args.insert(0, nil, nil)
|
66
66
|
self
|
67
67
|
end
|
68
68
|
|
@@ -129,6 +129,9 @@ module ExtraLoop
|
|
129
129
|
end
|
130
130
|
|
131
131
|
log("queueing url: #{url}, params #{arguments[:params]}", :debug)
|
132
|
+
|
133
|
+
|
134
|
+
|
132
135
|
@queued_count += 1
|
133
136
|
@hydra.queue(request)
|
134
137
|
end
|
@@ -147,6 +150,7 @@ module ExtraLoop
|
|
147
150
|
log("response ##{@response_count} of #{@queued_count}, status code: [#{response.code}], URL fragment: ...#{response.effective_url.split('/').last if response.effective_url}")
|
148
151
|
|
149
152
|
@loop.run
|
153
|
+
|
150
154
|
@environment = @loop.environment
|
151
155
|
run_hook(:data, [@loop.records, response])
|
152
156
|
#TODO: add hock for scraper completion (useful in iterative scrapes).
|
@@ -159,7 +163,10 @@ module ExtraLoop
|
|
159
163
|
extractor_classname = "#{format.to_s.capitalize}Extractor"
|
160
164
|
extractor_class = ExtraLoop.const_defined?(extractor_classname) && ExtraLoop.const_get(extractor_classname) || DomExtractor
|
161
165
|
|
162
|
-
|
166
|
+
|
167
|
+
#replace empty placeholder with extraction environment
|
168
|
+
@loop_extractor_args[1] = ExtractionEnvironment.new(self)
|
169
|
+
|
163
170
|
loop_extractor = extractor_class.new(*@loop_extractor_args)
|
164
171
|
|
165
172
|
# There is no point in parsing response.body more than once, so we reuse
|
@@ -167,8 +174,8 @@ module ExtraLoop
|
|
167
174
|
|
168
175
|
document = loop_extractor.parse(response.body)
|
169
176
|
|
170
|
-
extractors = @extractor_args.map do |
|
171
|
-
args.insert(1, ExtractionEnvironment.new(self, document))
|
177
|
+
extractors = @extractor_args.map do |original_args|
|
178
|
+
args = original_args.clone.insert(1, ExtractionEnvironment.new(self, document))
|
172
179
|
extractor_class.new(*args)
|
173
180
|
end
|
174
181
|
|
data/spec/scraper_base_spec.rb
CHANGED
@@ -69,14 +69,15 @@ describe ScraperBase do
|
|
69
69
|
loop_on("ul li.file a").
|
70
70
|
extract(:url, :href).
|
71
71
|
extract(:filename).
|
72
|
-
|
72
|
+
on(:data) { |records| results = records }.
|
73
|
+
run
|
73
74
|
|
74
75
|
@results = results
|
76
|
+
|
75
77
|
end
|
76
78
|
|
77
79
|
|
78
80
|
it "Should handle response" do
|
79
|
-
@scraper.run
|
80
81
|
@results.should_not be_empty
|
81
82
|
@results.all? { |record| record.extracted_at && record.url && record.filename }.should be_true
|
82
83
|
end
|
@@ -166,4 +167,29 @@ describe ScraperBase do
|
|
166
167
|
end
|
167
168
|
end
|
168
169
|
end
|
170
|
+
|
171
|
+
context "no loop defined.." do
|
172
|
+
describe "#run", :thisonly => true do
|
173
|
+
before do
|
174
|
+
data = []
|
175
|
+
@url = "http://localhost/fixture"
|
176
|
+
|
177
|
+
stub_http({}, :body => @fixture_doc) do |hydra, request, response|
|
178
|
+
hydra.stub(:get, request.url).and_return(response)
|
179
|
+
end
|
180
|
+
|
181
|
+
(ScraperBase.new @url).
|
182
|
+
extract(:url, "a[href]", :href).
|
183
|
+
on("data") { |records| data = records }.
|
184
|
+
run
|
185
|
+
|
186
|
+
@data = data
|
187
|
+
end
|
188
|
+
|
189
|
+
it "should run and extract data" do
|
190
|
+
@data.should_not be_empty
|
191
|
+
end
|
192
|
+
end
|
193
|
+
end
|
194
|
+
|
169
195
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: extraloop
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.8
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-
|
12
|
+
date: 2012-03-27 00:00:00.000000000Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: yajl-ruby
|
16
|
-
requirement: &
|
16
|
+
requirement: &13625500 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ~>
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.1.0
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *13625500
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: nokogiri
|
27
|
-
requirement: &
|
27
|
+
requirement: &13624900 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ~>
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 1.5.0
|
33
33
|
type: :runtime
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *13624900
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: typhoeus
|
38
|
-
requirement: &
|
38
|
+
requirement: &13624340 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 0.3.2
|
44
44
|
type: :runtime
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *13624340
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: logging
|
49
|
-
requirement: &
|
49
|
+
requirement: &13623540 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ~>
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: 0.6.1
|
55
55
|
type: :runtime
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *13623540
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: rspec
|
60
|
-
requirement: &
|
60
|
+
requirement: &13622600 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 2.7.0
|
66
66
|
type: :development
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *13622600
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rr
|
71
|
-
requirement: &
|
71
|
+
requirement: &13620140 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ~>
|
@@ -76,10 +76,10 @@ dependencies:
|
|
76
76
|
version: 1.0.4
|
77
77
|
type: :development
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *13620140
|
80
80
|
- !ruby/object:Gem::Dependency
|
81
81
|
name: pry-nav
|
82
|
-
requirement: &
|
82
|
+
requirement: &13619680 !ruby/object:Gem::Requirement
|
83
83
|
none: false
|
84
84
|
requirements:
|
85
85
|
- - ~>
|
@@ -87,10 +87,21 @@ dependencies:
|
|
87
87
|
version: 0.1.0
|
88
88
|
type: :development
|
89
89
|
prerelease: false
|
90
|
-
version_requirements: *
|
90
|
+
version_requirements: *13619680
|
91
|
+
- !ruby/object:Gem::Dependency
|
92
|
+
name: rake
|
93
|
+
requirement: &13619180 !ruby/object:Gem::Requirement
|
94
|
+
none: false
|
95
|
+
requirements:
|
96
|
+
- - ~>
|
97
|
+
- !ruby/object:Gem::Version
|
98
|
+
version: 0.9.2.2
|
99
|
+
type: :development
|
100
|
+
prerelease: false
|
101
|
+
version_requirements: *13619180
|
91
102
|
description: A Ruby library for extracting data from websites and web based APIs.
|
92
|
-
Supports most common document formats (i.e. HTML, XML, and JSON), and comes
|
93
|
-
a handy mechanism for iterating over paginated datasets.
|
103
|
+
Supports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes
|
104
|
+
with a handy mechanism for iterating over paginated datasets.
|
94
105
|
email: andrea.giulio.fiore@googlemail.com
|
95
106
|
executables: []
|
96
107
|
extensions: []
|
@@ -140,6 +151,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
140
151
|
- - ! '>='
|
141
152
|
- !ruby/object:Gem::Version
|
142
153
|
version: '0'
|
154
|
+
segments:
|
155
|
+
- 0
|
156
|
+
hash: 2441039337834275619
|
143
157
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
144
158
|
none: false
|
145
159
|
requirements:
|
@@ -153,4 +167,3 @@ signing_key:
|
|
153
167
|
specification_version: 2
|
154
168
|
summary: A toolkit for online data extraction.
|
155
169
|
test_files: []
|
156
|
-
has_rdoc:
|