extraloop 0.0.3 → 0.0.4
Sign up to get free protection for your applications and to get access to all the features.
- data/History.txt +3 -0
- data/README.md +28 -21
- data/examples/wikipedia_categories_recoursive.rb +57 -0
- data/lib/extraloop/extraction_environment.rb +1 -0
- data/lib/extraloop/extraction_loop.rb +1 -1
- data/lib/extraloop/iterative_scraper.rb +1 -1
- data/spec/extraction_loop_spec.rb +5 -0
- metadata +17 -16
data/History.txt
CHANGED
data/README.md
CHANGED
@@ -20,9 +20,7 @@ A basic scraper that fetches the top 25 websites from Alexa's daily top 100 list
|
|
20
20
|
extract(:site_name, "h2").
|
21
21
|
extract(:url, "h2 a").
|
22
22
|
extract(:description, ".description").
|
23
|
-
on(:data, proc { |data|
|
24
|
-
results = data
|
25
|
-
}).
|
23
|
+
on(:data, proc { |data| results = data }).
|
26
24
|
run()
|
27
25
|
|
28
26
|
An Iterative Scraper that fetches URL, title, and publisher from some 110 Google News articles mentioning the keyword 'Egypt'.
|
@@ -57,8 +55,9 @@ An Iterative Scraper that fetches URL, title, and publisher from some 110 Google
|
|
57
55
|
|
58
56
|
### Extractors
|
59
57
|
|
60
|
-
ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.
|
61
|
-
|
58
|
+
ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.
|
59
|
+
For each matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract`
|
60
|
+
method extracts a piece of information from an element (e.g. a story's title) and and stores it into a record's field.
|
62
61
|
|
63
62
|
# using a CSS3 or an XPath selector
|
64
63
|
loop_on('div.post')
|
@@ -67,12 +66,12 @@ ExtraLoop allows to fetch structured data from online documents by looping throu
|
|
67
66
|
|
68
67
|
loop_on(proc { |doc| doc.search('div.post') })
|
69
68
|
|
70
|
-
# using both a selector and a proc (
|
69
|
+
# using both a selector and a proc (matched elements are passed in as the first argument of the proc )
|
71
70
|
|
72
|
-
loop_on('div.post', proc { |posts| posts.reject {|post| post.attr(:class) == 'sticky' }})
|
71
|
+
loop_on('div.post', proc { |posts| posts.reject { |post| post.attr(:class) == 'sticky' }})
|
73
72
|
|
74
73
|
Both the `loop_on` and the `extract` methods may be called with a selector, a proc or a combination of the two. By default, when parsing DOM documents, `extract` will call
|
75
|
-
`Nokogiri::XML::Node#text()`. Alternatively, `extract`
|
74
|
+
`Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name or a proc as a third argument, this will be evaluated in the context of the matching element.
|
76
75
|
|
77
76
|
# extract a story's title
|
78
77
|
extract(:title, 'h3')
|
@@ -87,21 +86,25 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a pr
|
|
87
86
|
|
88
87
|
#### Extracting from JSON Documents
|
89
88
|
|
90
|
-
While processing each HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
|
91
|
-
|
92
|
-
|
89
|
+
While processing each HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
|
90
|
+
the `ContentType` header sent by the server (this value may be overriden by providing a `:format` key in the scraper's
|
91
|
+
initialization options). When the format is JSON, the document is parsed using the `yajl` parser and converted into a hash.
|
92
|
+
In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, with the sole exception
|
93
|
+
of the CSS3/XPath selector string, which is specific of DOM documents.
|
94
|
+
|
95
|
+
When working with JSON data, you can just use a proc and return the document elements you want to loop on.
|
93
96
|
|
94
97
|
# Fetch a portion of a document using a proc
|
95
98
|
loop_on(proc { |data| data['query']['categorymembers'] })
|
96
99
|
|
97
|
-
Alternatively, the same loop can be defined by passing an array of
|
100
|
+
Alternatively, the same loop can be defined by passing an array of keys pointing at a value located
|
101
|
+
at several levels of depth down into the parsed document hash.
|
98
102
|
|
99
103
|
# Fetch the same document portion above using a hash path
|
100
104
|
loop_on(['query', 'categorymembers'])
|
101
105
|
|
102
|
-
|
103
|
-
|
104
|
-
field name as a key if no key path or key string is provided.
|
106
|
+
When fetching fields from a portion of a JSON document, `extract` will use the
|
107
|
+
field name as a hash key if no key path or key string is provided.
|
105
108
|
|
106
109
|
# current node:
|
107
110
|
#
|
@@ -110,26 +113,30 @@ field name as a key if no key path or key string is provided.
|
|
110
113
|
# 'text' => 'bla bla bla',
|
111
114
|
# 'from_user_id'..
|
112
115
|
# }
|
113
|
-
|
114
116
|
|
115
|
-
extract(:from_user)
|
117
|
+
# >> extract(:from_user)
|
116
118
|
# => "johndoe"
|
117
119
|
|
118
120
|
|
119
121
|
### Iteration methods:
|
120
122
|
|
121
|
-
The `IterativeScraper` class comes with two methods for defining how a scraper should loop over paginated content.
|
122
|
-
|
123
|
+
The `IterativeScraper` class comes with two methods for defining how a scraper should loop over paginated content.
|
123
124
|
|
124
|
-
|
125
|
+
_set_iteration(iteration_parameter, array_range_or_proc)_
|
125
126
|
|
126
127
|
* `iteration_parameter` - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
|
127
128
|
* `array_or_range_or_proc` - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document as its first argument. Its return value is then used to shift, at each iteration, the value of the iteration parameter. If the block fails to return a non empty array, the iteration stops.
|
128
129
|
|
129
130
|
The second iteration methods, `#continue_with`, allows to continue iterating untill an arbitrary block of code returns a positive, non-nil value.
|
130
131
|
|
131
|
-
|
132
|
+
_continue_with(iteration_parameter, block)_
|
132
133
|
|
133
134
|
* `iteration_parameter` - the scraper' iteration parameter.
|
134
135
|
* `block` - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
|
135
136
|
|
137
|
+
### Running tests
|
138
|
+
|
139
|
+
ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
|
140
|
+
|
141
|
+
cd spec
|
142
|
+
rspec *
|
@@ -0,0 +1,57 @@
|
|
1
|
+
require '../lib/extraloop'
|
2
|
+
require 'pry'
|
3
|
+
|
4
|
+
class WikipediaCategoryScraper < ExtraLoop::IterativeScraper
|
5
|
+
attr_accessor :members
|
6
|
+
attr_reader :request_arguments
|
7
|
+
|
8
|
+
baseurl = "http://en.wikipedia.org"
|
9
|
+
endpoint_url = "/w/api.php"
|
10
|
+
@@api_url = baseurl + endpoint_url
|
11
|
+
|
12
|
+
def initialize(category, depth=2, parent=nil)
|
13
|
+
|
14
|
+
@members = []
|
15
|
+
@parent = parent
|
16
|
+
|
17
|
+
params = {
|
18
|
+
:action => 'query',
|
19
|
+
:list => 'categorymembers',
|
20
|
+
:format => 'json',
|
21
|
+
:cmtitle => "Category:#{category.gsub(/^Category\:/,'')}",
|
22
|
+
:cmlimit => "500",
|
23
|
+
:cmtype => 'page|subcat',
|
24
|
+
:cmdir => 'asc',
|
25
|
+
:cmprop => 'ids|title|type|timestamp'
|
26
|
+
}
|
27
|
+
options = {
|
28
|
+
:depth => depth,
|
29
|
+
:format => :json,
|
30
|
+
:log => false
|
31
|
+
}
|
32
|
+
request_arguments = { :params => params }
|
33
|
+
|
34
|
+
super(@@api_url, options, request_arguments)
|
35
|
+
|
36
|
+
loop_on(['query', 'categorymembers']).
|
37
|
+
extract(:title).
|
38
|
+
extract(:ns).
|
39
|
+
extract(:type).
|
40
|
+
extract(:timestamp).
|
41
|
+
on(:data, proc { |results|
|
42
|
+
puts "#{"\t" * (@options[:depth] - 2).abs } #{@scraper.request_arguments[:params][:cmtitle]}"
|
43
|
+
categories = results.select{ |record| record.ns === 14 }.each { |category| results.delete(category) }
|
44
|
+
|
45
|
+
|
46
|
+
categories.each do |record|
|
47
|
+
# Instanciate a sub scraper if the current depth is greater than zero and the category member is a sub category.
|
48
|
+
WikipediaCategoryScraper.new(record.title, @options[:depth] - 1, @scraper.request_arguments[:params][:cmtitle] ).run unless @options[:depth] <= 0
|
49
|
+
end
|
50
|
+
|
51
|
+
}).
|
52
|
+
continue_with(:cmcontinue, ['query-continue', 'categorymembers', 'cmcontinue'])
|
53
|
+
end
|
54
|
+
end
|
55
|
+
|
56
|
+
|
57
|
+
WikipediaCategoryScraper.new("Italian_media").run
|
@@ -15,7 +15,7 @@ module ExtraLoop
|
|
15
15
|
@document = @loop_extractor.parse(document)
|
16
16
|
@records = []
|
17
17
|
@hooks = hooks
|
18
|
-
@environment = ExtractionEnvironment.new(
|
18
|
+
@environment = ExtractionEnvironment.new(scraper, @document, @records)
|
19
19
|
self
|
20
20
|
end
|
21
21
|
|
@@ -31,6 +31,10 @@ describe ExtractionLoop do
|
|
31
31
|
describe "run" do
|
32
32
|
before(:each) do
|
33
33
|
|
34
|
+
@fake_scraper = Object.new
|
35
|
+
stub(@fake_scraper).options {{}}
|
36
|
+
stub(@fake_scraper).results { }
|
37
|
+
|
34
38
|
@extractors = [:a, :b].map do |field_name|
|
35
39
|
object = Object.new
|
36
40
|
stub(object).extract_field { |node, record| node[field_name] }
|
@@ -55,6 +59,7 @@ describe ExtractionLoop do
|
|
55
59
|
mock(env).run.with_any_args.times(20 + 2)
|
56
60
|
end
|
57
61
|
|
62
|
+
|
58
63
|
@extraction_loop = ExtractionLoop.new(@loop_extractor, @extractors, "fake document", hooks, @fake_scraper).run
|
59
64
|
end
|
60
65
|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: extraloop
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.4
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-01-
|
12
|
+
date: 2012-01-14 00:00:00.000000000Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: yajl-ruby
|
16
|
-
requirement: &
|
16
|
+
requirement: &10100300 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ~>
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.1.0
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *10100300
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: nokogiri
|
27
|
-
requirement: &
|
27
|
+
requirement: &10098200 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ~>
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 1.5.0
|
33
33
|
type: :runtime
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *10098200
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: typhoeus
|
38
|
-
requirement: &
|
38
|
+
requirement: &10095680 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 0.3.2
|
44
44
|
type: :runtime
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *10095680
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: logging
|
49
|
-
requirement: &
|
49
|
+
requirement: &10094320 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ~>
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: 0.6.1
|
55
55
|
type: :runtime
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *10094320
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: rspec
|
60
|
-
requirement: &
|
60
|
+
requirement: &10092700 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 2.7.1
|
66
66
|
type: :development
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *10092700
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rr
|
71
|
-
requirement: &
|
71
|
+
requirement: &10089100 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ~>
|
@@ -76,10 +76,10 @@ dependencies:
|
|
76
76
|
version: 1.0.4
|
77
77
|
type: :development
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *10089100
|
80
80
|
- !ruby/object:Gem::Dependency
|
81
81
|
name: pry
|
82
|
-
requirement: &
|
82
|
+
requirement: &10088040 !ruby/object:Gem::Requirement
|
83
83
|
none: false
|
84
84
|
requirements:
|
85
85
|
- - ~>
|
@@ -87,7 +87,7 @@ dependencies:
|
|
87
87
|
version: 0.9.7.4
|
88
88
|
type: :development
|
89
89
|
prerelease: false
|
90
|
-
version_requirements: *
|
90
|
+
version_requirements: *10088040
|
91
91
|
description: A Ruby library for extracting data from websites and web based APIs.
|
92
92
|
Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
|
93
93
|
a handy mechanism for iterating over paginated datasets.
|
@@ -100,6 +100,7 @@ files:
|
|
100
100
|
- README.md
|
101
101
|
- examples/google_news_scraper.rb
|
102
102
|
- examples/wikipedia_categories.rb
|
103
|
+
- examples/wikipedia_categories_recoursive.rb
|
103
104
|
- lib/extraloop.rb
|
104
105
|
- lib/extraloop/dom_extractor.rb
|
105
106
|
- lib/extraloop/extraction_environment.rb
|