extraloop 0.0.4 → 0.0.5
Sign up to get free protection for your applications and to get access to all the features.
- data/History.txt +3 -0
- data/README.md +59 -59
- data/examples/google_news_scraper.rb +5 -6
- data/lib/extraloop/hookable.rb +1 -2
- data/lib/extraloop/iterative_scraper.rb +2 -2
- data/lib/extraloop/scraper_base.rb +6 -3
- data/spec/iterative_scraper_spec.rb +2 -2
- data/spec/scraper_base_spec.rb +4 -10
- metadata +17 -17
data/History.txt
CHANGED
data/README.md
CHANGED
@@ -8,70 +8,70 @@ for iterating over paginated datasets.
|
|
8
8
|
|
9
9
|
gem install extraloop
|
10
10
|
|
11
|
-
###
|
11
|
+
### Sample scrapers:
|
12
12
|
|
13
|
-
A basic scraper that fetches the top 25 websites from Alexa's daily top 100 list:
|
13
|
+
A basic scraper that fetches the top 25 websites from [Alexa's daily top 100](www.alexa.com/topsites) list:
|
14
14
|
|
15
|
-
|
16
|
-
|
17
|
-
ExtraLoop::ScraperBase.
|
15
|
+
alexa_scraper = ExtraLoop::ScraperBase.
|
18
16
|
new("http://www.alexa.com/topsites").
|
19
17
|
loop_on("li.site-listing").
|
20
18
|
extract(:site_name, "h2").
|
21
19
|
extract(:url, "h2 a").
|
22
20
|
extract(:description, ".description").
|
23
|
-
on(:data
|
24
|
-
run()
|
21
|
+
on(:data) { |data| { |record| puts record.site_name } }
|
25
22
|
|
26
|
-
|
23
|
+
alexa_scraper.run
|
24
|
+
|
25
|
+
An iterative Scraper that fetches URL, title, and publisher from some 110 Google News articles mentioning the keyword _'Egypt'_.
|
27
26
|
|
28
27
|
results = []
|
29
28
|
|
30
29
|
ExtraLoop::IterativeScraper.
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
30
|
+
new("https://www.google.com/search?tbm=nws&q=Egypt").
|
31
|
+
set_iteration(:start, (1..101).step(10)).
|
32
|
+
loop_on("h3") { |nodes| nodes.map(&:parent) }.
|
33
|
+
extract(:title, "h3.r a").
|
34
|
+
extract(:url, "h3.r a", :href).
|
35
|
+
extract(:source, "br") { |node| node.next.text.split("-").first }.
|
36
|
+
on(:data) { |data, response| data.each { |record| results << record } }.
|
37
|
+
run()
|
39
38
|
|
40
39
|
|
41
|
-
###
|
40
|
+
### Scraper initialisation signature
|
42
41
|
|
43
|
-
#new(urls, scraper_options,
|
42
|
+
#new(urls, scraper_options, http_options)
|
44
43
|
|
45
|
-
-
|
46
|
-
-
|
47
|
-
-
|
44
|
+
- __urls__ - single url, or array of several urls.
|
45
|
+
- __scraper_options__ - hash of scraper options (see below).
|
46
|
+
- __http_options__ - hash of request options for `Typheous::Request#initialize` (see [API documentation](http://rubydoc.info/github/pauldix/typhoeus/master/Typhoeus/Request#initialize-instance_method) for details).
|
48
47
|
|
49
|
-
####
|
50
|
-
|
51
|
-
*
|
52
|
-
*
|
53
|
-
|
54
|
-
*
|
48
|
+
#### scraper options:
|
49
|
+
|
50
|
+
* __format__ - Specifies the scraped document format (valid values are :html, :xml, :json).
|
51
|
+
* __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
|
52
|
+
* __log__ - Logging options hash:
|
53
|
+
* __loglevel__ - a symbol specifying the desired log level (defaults to `:info`).
|
54
|
+
* __appenders__ - a list of Logging.appenders object (defaults to `Logging.appenders.sterr`).
|
55
55
|
|
56
56
|
### Extractors
|
57
57
|
|
58
58
|
ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.
|
59
59
|
For each matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract`
|
60
|
-
method extracts a piece of information from an element (e.g. a story's title) and
|
60
|
+
method extracts a specific piece of information from an element (e.g. a story's title) and stores it into a record's field.
|
61
61
|
|
62
|
-
# using a CSS3 or
|
62
|
+
# looping over a set of document elements using a CSS3 (or XPath) selector
|
63
63
|
loop_on('div.post')
|
64
64
|
|
65
|
-
#
|
65
|
+
# looping
|
66
66
|
|
67
|
-
loop_on
|
67
|
+
loop_on { |doc| doc.search('div.post') }
|
68
68
|
|
69
|
-
# using both a selector and a proc (matched
|
69
|
+
# using both a selector and a proc (the matched element list is passed in to the proc as its first argument )
|
70
70
|
|
71
|
-
loop_on('div.post'
|
71
|
+
loop_on('div.post') { |posts| posts.reject { |post| post.attr(:class) == 'sticky' } }
|
72
72
|
|
73
|
-
Both the `loop_on` and the `extract` methods may be called with a selector, a
|
74
|
-
`Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name
|
73
|
+
Both the `loop_on` and the `extract` methods may be called with a selector, a block or a combination of the two. By default, when parsing DOM documents, `extract` will call
|
74
|
+
`Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and block, this will be evaluated in the context of the current iteration element.
|
75
75
|
|
76
76
|
# extract a story's title
|
77
77
|
extract(:title, 'h3')
|
@@ -80,34 +80,32 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a pr
|
|
80
80
|
extract(:url, "a.link-to-story", :href)
|
81
81
|
|
82
82
|
# extract a description text, separating paragraphs with newlines
|
83
|
-
extract(:description, "div.description"
|
84
|
-
node.css("p").map(&:text).join("\n")
|
85
|
-
})
|
83
|
+
extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
|
86
84
|
|
87
85
|
#### Extracting from JSON Documents
|
88
86
|
|
89
87
|
While processing each HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
|
90
|
-
the `ContentType` header sent by the server
|
91
|
-
initialization options
|
92
|
-
In this case, both the `loop_on` and the `extract` methods still behave as illustrated above,
|
93
|
-
|
88
|
+
the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's
|
89
|
+
initialization options. When format is JSON, the document is parsed using the `yajl` JSON parser and converted into a hash.
|
90
|
+
In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except for the
|
91
|
+
CSS3/XPath selectors, which are specific to DOM documents.
|
94
92
|
|
95
|
-
When working with JSON data, you can just use a
|
93
|
+
When working with JSON data, you can just use a block and have it return the document elements you want to loop on.
|
96
94
|
|
97
95
|
# Fetch a portion of a document using a proc
|
98
|
-
loop_on
|
96
|
+
loop_on { |data| data['query']['categorymembers'] })
|
99
97
|
|
100
|
-
Alternatively, the same loop can be defined by passing an array of keys pointing at a value located
|
101
|
-
at several levels of depth down into the parsed document
|
98
|
+
Alternatively, the same loop can be defined by passing an array of keys pointing at a hash value located
|
99
|
+
at several levels of depth down into the parsed document structure.
|
102
100
|
|
103
101
|
# Fetch the same document portion above using a hash path
|
104
102
|
loop_on(['query', 'categorymembers'])
|
105
103
|
|
106
|
-
When fetching fields from a
|
107
|
-
|
104
|
+
When fetching fields from a JSON document fragment, `extract` will often not need a block or an array of keys. If called with only
|
105
|
+
one argument, it will in fact try to fetch a hash value using the provided field name as key.
|
108
106
|
|
109
107
|
# current node:
|
110
|
-
#
|
108
|
+
#
|
111
109
|
# {
|
112
110
|
# 'from_user' => "johndoe",
|
113
111
|
# 'text' => 'bla bla bla',
|
@@ -118,25 +116,27 @@ field name as a hash key if no key path or key string is provided.
|
|
118
116
|
# => "johndoe"
|
119
117
|
|
120
118
|
|
121
|
-
### Iteration methods
|
119
|
+
### Iteration methods
|
120
|
+
|
121
|
+
The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.
|
122
|
+
|
123
|
+
set_iteration(iteration_parameter, array_range_or_block)
|
122
124
|
|
123
|
-
|
125
|
+
* __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
|
126
|
+
* __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.
|
124
127
|
|
125
|
-
_set_iteration(iteration_parameter, array_range_or_proc)_
|
126
128
|
|
127
|
-
|
128
|
-
* `array_or_range_or_proc` - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document as its first argument. Its return value is then used to shift, at each iteration, the value of the iteration parameter. If the block fails to return a non empty array, the iteration stops.
|
129
|
+
The second iteration methods, `#continue_with`, allows to continue iterating untill an arbitrary block of code returns a positive, non-nil value (to be assigned to the iteration parameter).
|
129
130
|
|
130
|
-
|
131
|
+
continue_with(iteration_parameter, &block)
|
131
132
|
|
132
|
-
|
133
|
+
* __iteration_parameter__ - the scraper' iteration parameter.
|
134
|
+
* __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
|
133
135
|
|
134
|
-
* `iteration_parameter` - the scraper' iteration parameter.
|
135
|
-
* `block` - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
|
136
136
|
|
137
137
|
### Running tests
|
138
138
|
|
139
139
|
ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
|
140
140
|
|
141
|
-
cd spec
|
141
|
+
cd spec
|
142
142
|
rspec *
|
@@ -5,16 +5,15 @@ results = []
|
|
5
5
|
ExtraLoop::IterativeScraper.new("https://www.google.com/search?tbm=nws&q=Egypt", :log => {
|
6
6
|
:log_level => :debug,
|
7
7
|
:appenders => [Logging.appenders.stderr ]
|
8
|
+
|
8
9
|
}).set_iteration(:start, (1..101).step(10)).
|
9
|
-
loop_on("h3"
|
10
|
+
loop_on("h3") { |nodes| nodes.map(&:parent) }.
|
10
11
|
extract(:title, "h3.r a").
|
11
12
|
extract(:url, "h3.r a", :href).
|
12
|
-
extract(:source, "br"
|
13
|
-
|
14
|
-
}).
|
15
|
-
on(:data, proc { |data, response|
|
13
|
+
extract(:source, "br") { |node| node.next.text.split("-").first }.
|
14
|
+
on(:data) do |data, response|
|
16
15
|
data.each { |record| results << record }
|
17
|
-
|
16
|
+
end.
|
18
17
|
run()
|
19
18
|
|
20
19
|
results.each_with_index do |record, index|
|
data/lib/extraloop/hookable.rb
CHANGED
@@ -6,9 +6,8 @@ module ExtraLoop
|
|
6
6
|
end
|
7
7
|
end
|
8
8
|
|
9
|
-
def set_hook(hookname, handler)
|
9
|
+
def set_hook(hookname, &handler)
|
10
10
|
@hooks ||= {}
|
11
|
-
raise Exceptions::HookArgumentError.new "handler must be a callable proc" unless handler.respond_to?(:call)
|
12
11
|
@hooks[hookname.to_sym] ? @hooks[hookname.to_sym].push(handler) : @hooks[hookname.to_sym] = [handler]
|
13
12
|
self
|
14
13
|
end
|
@@ -209,7 +209,7 @@ module ExtraLoop
|
|
209
209
|
def update_request_params!
|
210
210
|
offset = @iteration_set.at(@iteration_count) || default_offset
|
211
211
|
@request_arguments[:params] ||= {}
|
212
|
-
@request_arguments[:params][@iteration_param.to_sym] = offset
|
212
|
+
@request_arguments[:params][@iteration_param.to_sym] = offset if @iteration_count > 0
|
213
213
|
end
|
214
214
|
|
215
215
|
|
@@ -246,7 +246,7 @@ module ExtraLoop
|
|
246
246
|
|
247
247
|
def issue_request(url)
|
248
248
|
# remove continue argument if this is the first iteration
|
249
|
-
|
249
|
+
#@request_arguments[:params].delete(@iteration_param.to_sym) if @continue_clause_args && @iteration_count == 0
|
250
250
|
super(url)
|
251
251
|
# clear previous value of iteration parameter
|
252
252
|
@request_arguments[:params].delete(@iteration_param.to_sym) if @request_arguments[:params] && @request_arguments[:params].any?
|
@@ -52,13 +52,15 @@ module ExtraLoop
|
|
52
52
|
#
|
53
53
|
#
|
54
54
|
# selector - The CSS3 selector identifying the node list over which iterate (optional).
|
55
|
-
# callback - A block of code (optional).
|
56
55
|
# attribute - An attribute name (optional).
|
57
56
|
#
|
57
|
+
# callback - A block of code (optional).
|
58
|
+
#
|
58
59
|
# Returns itself.
|
59
60
|
#
|
60
61
|
|
61
|
-
def loop_on(*args)
|
62
|
+
def loop_on(*args, &block)
|
63
|
+
args << block if block
|
62
64
|
@loop_extractor_args = args.insert(0, nil, ExtractionEnvironment.new(self))
|
63
65
|
self
|
64
66
|
end
|
@@ -75,7 +77,8 @@ module ExtraLoop
|
|
75
77
|
#
|
76
78
|
#
|
77
79
|
|
78
|
-
def extract(*args)
|
80
|
+
def extract(*args, &block)
|
81
|
+
args << block if block
|
79
82
|
@extractor_args << args.insert(1, ExtractionEnvironment.new(self))
|
80
83
|
self
|
81
84
|
end
|
@@ -78,7 +78,7 @@ describe IterativeScraper do
|
|
78
78
|
new("http://whatever.net/search-stuff").
|
79
79
|
set_iteration(:p, iteration_proc).
|
80
80
|
loop_on(".whatever").
|
81
|
-
set_hook(:data, proc { iteration_count += 1 }).
|
81
|
+
set_hook(:data, &proc { iteration_count += 1 }).
|
82
82
|
run()
|
83
83
|
|
84
84
|
@iteration_count = iteration_count
|
@@ -90,7 +90,7 @@ describe IterativeScraper do
|
|
90
90
|
end
|
91
91
|
|
92
92
|
it "should have sent p=1, p=2, p=3, p=4 as request parameters" do
|
93
|
-
@params_sent.should eql([
|
93
|
+
@params_sent.should eql([nil, "2", "3", "4"])
|
94
94
|
end
|
95
95
|
end
|
96
96
|
end
|
data/spec/scraper_base_spec.rb
CHANGED
@@ -24,16 +24,10 @@ describe ScraperBase do
|
|
24
24
|
end
|
25
25
|
|
26
26
|
describe "#set_hook" do
|
27
|
-
subject { @scraper.set_hook(:after, proc {}) }
|
27
|
+
subject { @scraper.set_hook(:after, &proc {}) }
|
28
28
|
it { should be_an_instance_of(ScraperBase) }
|
29
29
|
end
|
30
30
|
|
31
|
-
describe "#set_hook" do
|
32
|
-
it "should raise exception if no proc is provided" do
|
33
|
-
expect { @scraper.set_hook(:after, :method) }.to raise_exception(ScraperBase::Exceptions::HookArgumentError)
|
34
|
-
end
|
35
|
-
end
|
36
|
-
|
37
31
|
context "request params in both the url and the arguments hash" do
|
38
32
|
describe "#run" do
|
39
33
|
before do
|
@@ -76,7 +70,7 @@ describe ScraperBase do
|
|
76
70
|
loop_on("ul li.file a").
|
77
71
|
extract(:url, :href).
|
78
72
|
extract(:filename).
|
79
|
-
set_hook(:data, proc { |records| records.each { |record| results << record }})
|
73
|
+
set_hook(:data, &proc { |records| records.each { |record| results << record }})
|
80
74
|
|
81
75
|
@results = results
|
82
76
|
end
|
@@ -110,7 +104,7 @@ describe ScraperBase do
|
|
110
104
|
loop_on("ul li.file a").
|
111
105
|
extract(:url, :href).
|
112
106
|
extract(:filename).
|
113
|
-
set_hook(:data, proc { |records| records.each { |record| results << record } })
|
107
|
+
set_hook(:data, &proc { |records| records.each { |record| results << record } })
|
114
108
|
|
115
109
|
@results = results
|
116
110
|
|
@@ -154,7 +148,7 @@ describe ScraperBase do
|
|
154
148
|
loop_on("ul li.file a").
|
155
149
|
extract(:url, :href).
|
156
150
|
extract(:filename).
|
157
|
-
set_hook(:data, proc { |records| records.each { |record| results << record } })
|
151
|
+
set_hook(:data, &proc { |records| records.each { |record| results << record } })
|
158
152
|
|
159
153
|
@results = results
|
160
154
|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: extraloop
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.5
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-01-
|
12
|
+
date: 2012-01-28 00:00:00.000000000Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: yajl-ruby
|
16
|
-
requirement: &
|
16
|
+
requirement: &19726880 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ~>
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.1.0
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *19726880
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: nokogiri
|
27
|
-
requirement: &
|
27
|
+
requirement: &19726420 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ~>
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 1.5.0
|
33
33
|
type: :runtime
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *19726420
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: typhoeus
|
38
|
-
requirement: &
|
38
|
+
requirement: &19725960 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 0.3.2
|
44
44
|
type: :runtime
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *19725960
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: logging
|
49
|
-
requirement: &
|
49
|
+
requirement: &19725500 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ~>
|
@@ -54,21 +54,21 @@ dependencies:
|
|
54
54
|
version: 0.6.1
|
55
55
|
type: :runtime
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *19725500
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: rspec
|
60
|
-
requirement: &
|
60
|
+
requirement: &19725040 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
64
64
|
- !ruby/object:Gem::Version
|
65
|
-
version: 2.7.
|
65
|
+
version: 2.7.0
|
66
66
|
type: :development
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *19725040
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rr
|
71
|
-
requirement: &
|
71
|
+
requirement: &19724580 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ~>
|
@@ -76,10 +76,10 @@ dependencies:
|
|
76
76
|
version: 1.0.4
|
77
77
|
type: :development
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *19724580
|
80
80
|
- !ruby/object:Gem::Dependency
|
81
81
|
name: pry
|
82
|
-
requirement: &
|
82
|
+
requirement: &19724060 !ruby/object:Gem::Requirement
|
83
83
|
none: false
|
84
84
|
requirements:
|
85
85
|
- - ~>
|
@@ -87,7 +87,7 @@ dependencies:
|
|
87
87
|
version: 0.9.7.4
|
88
88
|
type: :development
|
89
89
|
prerelease: false
|
90
|
-
version_requirements: *
|
90
|
+
version_requirements: *19724060
|
91
91
|
description: A Ruby library for extracting data from websites and web based APIs.
|
92
92
|
Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
|
93
93
|
a handy mechanism for iterating over paginated datasets.
|