extraloop 0.0.7 → 0.0.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/History.txt +3 -0
- data/README.md +12 -12
- data/examples/mod_pay_data.rb +6 -6
- data/examples/wikipedia_categories.rb +0 -4
- data/lib/extraloop.rb +1 -1
- data/lib/extraloop/dom_extractor.rb +2 -1
- data/lib/extraloop/json_extractor.rb +1 -1
- data/lib/extraloop/scraper_base.rb +13 -6
- data/spec/scraper_base_spec.rb +28 -2
- metadata +32 -19
data/History.txt
CHANGED
data/README.md
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
# Extra Loop
|
2
2
|
|
3
3
|
A Ruby library for extracting structured data from websites and web based APIs.
|
4
|
-
Supports most common document formats (i.e. HTML, XML, and JSON), and comes with a handy mechanism
|
4
|
+
Supports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes with a handy mechanism
|
5
5
|
for iterating over paginated datasets.
|
6
6
|
|
7
|
-
|
7
|
+
## Installation:
|
8
8
|
|
9
9
|
gem install extraloop
|
10
10
|
|
11
|
-
|
11
|
+
## Usage:
|
12
12
|
|
13
13
|
A basic scraper that fetches the top 25 websites from [Alexa's daily top 100](www.alexa.com/topsites) list:
|
14
14
|
|
@@ -37,7 +37,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
|
|
37
37
|
run()
|
38
38
|
|
39
39
|
|
40
|
-
|
40
|
+
## Scraper initialisation signature
|
41
41
|
|
42
42
|
#new(urls, scraper_options, http_options)
|
43
43
|
|
@@ -45,7 +45,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
|
|
45
45
|
- __scraper_options__ - hash of scraper options (see below).
|
46
46
|
- __http_options__ - hash of request options for `Typheous::Request#initialize` (see [API documentation](http://rubydoc.info/github/pauldix/typhoeus/master/Typhoeus/Request#initialize-instance_method) for details).
|
47
47
|
|
48
|
-
|
48
|
+
### scraper options:
|
49
49
|
|
50
50
|
* __format__ - Specifies the scraped document format; needed only if the Content-Type in the server response is not the correct one. Supported formats are: 'html', 'xml', 'json', and 'csv'.
|
51
51
|
* __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
|
@@ -53,7 +53,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
|
|
53
53
|
* __loglevel__ - a symbol specifying the desired log level (defaults to `:info`).
|
54
54
|
* __appenders__ - a list of Logging.appenders object (defaults to `Logging.appenders.sterr`).
|
55
55
|
|
56
|
-
|
56
|
+
## Extractors
|
57
57
|
|
58
58
|
ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.
|
59
59
|
For each matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract`
|
@@ -82,7 +82,7 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a bl
|
|
82
82
|
# extract a description text, separating paragraphs with newlines
|
83
83
|
extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
|
84
84
|
|
85
|
-
|
85
|
+
### Extracting data from JSON Documents
|
86
86
|
|
87
87
|
While processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
|
88
88
|
the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's
|
@@ -116,26 +116,26 @@ one argument, it will in fact try to fetch a hash value using the provided field
|
|
116
116
|
# => "johndoe"
|
117
117
|
|
118
118
|
|
119
|
-
|
119
|
+
## Iteration methods
|
120
120
|
|
121
121
|
The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.
|
122
122
|
|
123
|
-
|
123
|
+
### set\_iteration
|
124
124
|
|
125
125
|
* __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
|
126
126
|
* __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.
|
127
127
|
|
128
|
-
|
128
|
+
### continue\_with
|
129
129
|
|
130
130
|
The second iteration method, `#continue_with`, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).
|
131
131
|
|
132
132
|
* __iteration_parameter__ - the scraper' iteration parameter.
|
133
133
|
* __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
|
134
134
|
|
135
|
-
|
135
|
+
## Running tests
|
136
136
|
|
137
137
|
ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
|
138
138
|
|
139
139
|
cd spec
|
140
140
|
rspec *
|
141
|
-
|
141
|
+
|
data/examples/mod_pay_data.rb
CHANGED
@@ -1,6 +1,5 @@
|
|
1
1
|
#
|
2
|
-
#
|
3
|
-
#
|
2
|
+
# Fetches name, job title, and actual pay ceiling from a CSV dataset listing UK Ministry of Defence's organogram and staff pay data
|
4
3
|
# source: http://data.gov.uk/dataset/staff-organograms-and-pay-mod
|
5
4
|
#
|
6
5
|
|
@@ -12,7 +11,8 @@ class ModPayScraper < ExtraLoop::ScraperBase
|
|
12
11
|
dataset_url = "http://www.mod.uk/NR/rdonlyres/FF9761D8-2AB9-4CD4-88BC-983A46A0CD90/0/20111208CTLBOrganogramFinal7Useniordata.csv"
|
13
12
|
super dataset_url, :format => :csv
|
14
13
|
|
15
|
-
# Select only
|
14
|
+
# Select only records of officers earning more than 100k per year
|
15
|
+
|
16
16
|
loop_on do |rows|
|
17
17
|
rows[1..-1].select { |row| row[14].to_i > 100000 }
|
18
18
|
end
|
@@ -22,11 +22,11 @@ class ModPayScraper < ExtraLoop::ScraperBase
|
|
22
22
|
extract :pay, 14
|
23
23
|
|
24
24
|
on("data") do |records|
|
25
|
-
|
26
|
-
|
25
|
+
records.
|
26
|
+
sort { |r1, r2| r2.pay.to_i <=> r1.pay.to_i }.
|
27
|
+
each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
|
27
28
|
end
|
28
29
|
end
|
29
30
|
end
|
30
31
|
|
31
|
-
|
32
32
|
ModPayScraper.new.run
|
@@ -18,10 +18,6 @@ params = {
|
|
18
18
|
|
19
19
|
options = {
|
20
20
|
:format => :json,
|
21
|
-
:log => {
|
22
|
-
:appenders => [Logging.appenders.stderr],
|
23
|
-
:log_level => :info
|
24
|
-
}
|
25
21
|
}
|
26
22
|
request_arguments = { :params => params, :headers => {
|
27
23
|
"User-Agent" => "ExtraLoop - ruby data extraction toolkit: http://github.com/afiore/extraloop"
|
data/lib/extraloop.rb
CHANGED
@@ -32,7 +32,8 @@ module ExtraLoop
|
|
32
32
|
def extract_list(input)
|
33
33
|
nodes = (input.respond_to?(:document) ? input : parse(input))
|
34
34
|
nodes = nodes.search(@selector) if @selector
|
35
|
-
|
35
|
+
nodes = nodes.css("*") unless @selector or @callback
|
36
|
+
@callback && Array(@environment.run(nodes, &@callback)) || nodes
|
36
37
|
end
|
37
38
|
|
38
39
|
def parse(input)
|
@@ -22,7 +22,7 @@ module ExtraLoop
|
|
22
22
|
def extract_list(input)
|
23
23
|
@environment.document = input = (input.is_a?(String) ? parse(input) : input)
|
24
24
|
input = input.get_in(@path) if @path
|
25
|
-
@callback && @environment.run(input, &@callback) || input
|
25
|
+
Array(@callback && @environment.run(input, &@callback) || input)
|
26
26
|
end
|
27
27
|
|
28
28
|
def parse(input)
|
@@ -23,7 +23,7 @@ module ExtraLoop
|
|
23
23
|
|
24
24
|
def initialize(urls, options = {}, arguments = {})
|
25
25
|
@urls = Array(urls)
|
26
|
-
@loop_extractor_args =
|
26
|
+
@loop_extractor_args = []
|
27
27
|
@extractor_args = []
|
28
28
|
@loop = nil
|
29
29
|
|
@@ -61,8 +61,8 @@ module ExtraLoop
|
|
61
61
|
|
62
62
|
def loop_on(*args, &block)
|
63
63
|
args << block if block
|
64
|
-
#
|
65
|
-
@loop_extractor_args = args.insert(0, nil)
|
64
|
+
# prepend placeholder values for loop name and extraction environment
|
65
|
+
@loop_extractor_args = args.insert(0, nil, nil)
|
66
66
|
self
|
67
67
|
end
|
68
68
|
|
@@ -129,6 +129,9 @@ module ExtraLoop
|
|
129
129
|
end
|
130
130
|
|
131
131
|
log("queueing url: #{url}, params #{arguments[:params]}", :debug)
|
132
|
+
|
133
|
+
|
134
|
+
|
132
135
|
@queued_count += 1
|
133
136
|
@hydra.queue(request)
|
134
137
|
end
|
@@ -147,6 +150,7 @@ module ExtraLoop
|
|
147
150
|
log("response ##{@response_count} of #{@queued_count}, status code: [#{response.code}], URL fragment: ...#{response.effective_url.split('/').last if response.effective_url}")
|
148
151
|
|
149
152
|
@loop.run
|
153
|
+
|
150
154
|
@environment = @loop.environment
|
151
155
|
run_hook(:data, [@loop.records, response])
|
152
156
|
#TODO: add hock for scraper completion (useful in iterative scrapes).
|
@@ -159,7 +163,10 @@ module ExtraLoop
|
|
159
163
|
extractor_classname = "#{format.to_s.capitalize}Extractor"
|
160
164
|
extractor_class = ExtraLoop.const_defined?(extractor_classname) && ExtraLoop.const_get(extractor_classname) || DomExtractor
|
161
165
|
|
162
|
-
|
166
|
+
|
167
|
+
#replace empty placeholder with extraction environment
|
168
|
+
@loop_extractor_args[1] = ExtractionEnvironment.new(self)
|
169
|
+
|
163
170
|
loop_extractor = extractor_class.new(*@loop_extractor_args)
|
164
171
|
|
165
172
|
# There is no point in parsing response.body more than once, so we reuse
|
@@ -167,8 +174,8 @@ module ExtraLoop
|
|
167
174
|
|
168
175
|
document = loop_extractor.parse(response.body)
|
169
176
|
|
170
|
-
extractors = @extractor_args.map do |
|
171
|
-
args.insert(1, ExtractionEnvironment.new(self, document))
|
177
|
+
extractors = @extractor_args.map do |original_args|
|
178
|
+
args = original_args.clone.insert(1, ExtractionEnvironment.new(self, document))
|
172
179
|
extractor_class.new(*args)
|
173
180
|
end
|
174
181
|
|
data/spec/scraper_base_spec.rb
CHANGED
@@ -69,14 +69,15 @@ describe ScraperBase do
|
|
69
69
|
loop_on("ul li.file a").
|
70
70
|
extract(:url, :href).
|
71
71
|
extract(:filename).
|
72
|
-
|
72
|
+
on(:data) { |records| results = records }.
|
73
|
+
run
|
73
74
|
|
74
75
|
@results = results
|
76
|
+
|
75
77
|
end
|
76
78
|
|
77
79
|
|
78
80
|
it "Should handle response" do
|
79
|
-
@scraper.run
|
80
81
|
@results.should_not be_empty
|
81
82
|
@results.all? { |record| record.extracted_at && record.url && record.filename }.should be_true
|
82
83
|
end
|
@@ -166,4 +167,29 @@ describe ScraperBase do
|
|
166
167
|
end
|
167
168
|
end
|
168
169
|
end
|
170
|
+
|
171
|
+
context "no loop defined.." do
|
172
|
+
describe "#run", :thisonly => true do
|
173
|
+
before do
|
174
|
+
data = []
|
175
|
+
@url = "http://localhost/fixture"
|
176
|
+
|
177
|
+
stub_http({}, :body => @fixture_doc) do |hydra, request, response|
|
178
|
+
hydra.stub(:get, request.url).and_return(response)
|
179
|
+
end
|
180
|
+
|
181
|
+
(ScraperBase.new @url).
|
182
|
+
extract(:url, "a[href]", :href).
|
183
|
+
on("data") { |records| data = records }.
|
184
|
+
run
|
185
|
+
|
186
|
+
@data = data
|
187
|
+
end
|
188
|
+
|
189
|
+
it "should run and extract data" do
|
190
|
+
@data.should_not be_empty
|
191
|
+
end
|
192
|
+
end
|
193
|
+
end
|
194
|
+
|
169
195
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: extraloop
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.8
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-
|
12
|
+
date: 2012-03-27 00:00:00.000000000Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: yajl-ruby
|
16
|
-
requirement: &
|
16
|
+
requirement: &13625500 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ~>
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.1.0
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *13625500
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: nokogiri
|
27
|
-
requirement: &
|
27
|
+
requirement: &13624900 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ~>
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 1.5.0
|
33
33
|
type: :runtime
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *13624900
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: typhoeus
|
38
|
-
requirement: &
|
38
|
+
requirement: &13624340 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 0.3.2
|
44
44
|
type: :runtime
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *13624340
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: logging
|
49
|
-
requirement: &
|
49
|
+
requirement: &13623540 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ~>
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: 0.6.1
|
55
55
|
type: :runtime
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *13623540
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: rspec
|
60
|
-
requirement: &
|
60
|
+
requirement: &13622600 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 2.7.0
|
66
66
|
type: :development
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *13622600
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rr
|
71
|
-
requirement: &
|
71
|
+
requirement: &13620140 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ~>
|
@@ -76,10 +76,10 @@ dependencies:
|
|
76
76
|
version: 1.0.4
|
77
77
|
type: :development
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *13620140
|
80
80
|
- !ruby/object:Gem::Dependency
|
81
81
|
name: pry-nav
|
82
|
-
requirement: &
|
82
|
+
requirement: &13619680 !ruby/object:Gem::Requirement
|
83
83
|
none: false
|
84
84
|
requirements:
|
85
85
|
- - ~>
|
@@ -87,10 +87,21 @@ dependencies:
|
|
87
87
|
version: 0.1.0
|
88
88
|
type: :development
|
89
89
|
prerelease: false
|
90
|
-
version_requirements: *
|
90
|
+
version_requirements: *13619680
|
91
|
+
- !ruby/object:Gem::Dependency
|
92
|
+
name: rake
|
93
|
+
requirement: &13619180 !ruby/object:Gem::Requirement
|
94
|
+
none: false
|
95
|
+
requirements:
|
96
|
+
- - ~>
|
97
|
+
- !ruby/object:Gem::Version
|
98
|
+
version: 0.9.2.2
|
99
|
+
type: :development
|
100
|
+
prerelease: false
|
101
|
+
version_requirements: *13619180
|
91
102
|
description: A Ruby library for extracting data from websites and web based APIs.
|
92
|
-
Supports most common document formats (i.e. HTML, XML, and JSON), and comes
|
93
|
-
a handy mechanism for iterating over paginated datasets.
|
103
|
+
Supports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes
|
104
|
+
with a handy mechanism for iterating over paginated datasets.
|
94
105
|
email: andrea.giulio.fiore@googlemail.com
|
95
106
|
executables: []
|
96
107
|
extensions: []
|
@@ -140,6 +151,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
140
151
|
- - ! '>='
|
141
152
|
- !ruby/object:Gem::Version
|
142
153
|
version: '0'
|
154
|
+
segments:
|
155
|
+
- 0
|
156
|
+
hash: 2441039337834275619
|
143
157
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
144
158
|
none: false
|
145
159
|
requirements:
|
@@ -153,4 +167,3 @@ signing_key:
|
|
153
167
|
specification_version: 2
|
154
168
|
summary: A toolkit for online data extraction.
|
155
169
|
test_files: []
|
156
|
-
has_rdoc:
|