extraloop 0.0.4 → 0.0.5

Sign up to get free protection for your applications and to get access to all the features.
data/History.txt CHANGED
@@ -1,3 +1,6 @@
1
+ == 0.0.5 / 2011-01-14
2
+ * Refactored #extract, #loop_on, and #set_hook to make a more idematic use of ruby blocks
3
+
1
4
  == 0.0.4 / 2011-01-14
2
5
  * fixed a bug which prevented from subclassing `IterativeScraper` instances
3
6
 
data/README.md CHANGED
@@ -8,70 +8,70 @@ for iterating over paginated datasets.
8
8
 
9
9
  gem install extraloop
10
10
 
11
- ### Usage:
11
+ ### Sample scrapers:
12
12
 
13
- A basic scraper that fetches the top 25 websites from Alexa's daily top 100 list:
13
+ A basic scraper that fetches the top 25 websites from [Alexa's daily top 100](www.alexa.com/topsites) list:
14
14
 
15
- results = nil
16
-
17
- ExtraLoop::ScraperBase.
15
+ alexa_scraper = ExtraLoop::ScraperBase.
18
16
  new("http://www.alexa.com/topsites").
19
17
  loop_on("li.site-listing").
20
18
  extract(:site_name, "h2").
21
19
  extract(:url, "h2 a").
22
20
  extract(:description, ".description").
23
- on(:data, proc { |data| results = data }).
24
- run()
21
+ on(:data) { |data| { |record| puts record.site_name } }
25
22
 
26
- An Iterative Scraper that fetches URL, title, and publisher from some 110 Google News articles mentioning the keyword 'Egypt'.
23
+ alexa_scraper.run
24
+
25
+ An iterative Scraper that fetches URL, title, and publisher from some 110 Google News articles mentioning the keyword _'Egypt'_.
27
26
 
28
27
  results = []
29
28
 
30
29
  ExtraLoop::IterativeScraper.
31
- new("https://www.google.com/search?tbm=nws&q=Egypt").
32
- set_iteration(:start, (1..101).step(10)).
33
- loop_on("h3", proc { |nodes| nodes.map(&:parent) }).
34
- extract(:title, "h3.r a").
35
- extract(:url, "h3.r a", :href).
36
- extract(:source, "br", proc { |node| node.next.text.split("-").first }).
37
- on(:data, proc { |data, response| data.each { |record| results << record } }).
38
- run()
30
+ new("https://www.google.com/search?tbm=nws&q=Egypt").
31
+ set_iteration(:start, (1..101).step(10)).
32
+ loop_on("h3") { |nodes| nodes.map(&:parent) }.
33
+ extract(:title, "h3.r a").
34
+ extract(:url, "h3.r a", :href).
35
+ extract(:source, "br") { |node| node.next.text.split("-").first }.
36
+ on(:data) { |data, response| data.each { |record| results << record } }.
37
+ run()
39
38
 
40
39
 
41
- ### Initializations
40
+ ### Scraper initialisation signature
42
41
 
43
- #new(urls, scraper_options, httpclient_options)
42
+ #new(urls, scraper_options, http_options)
44
43
 
45
- - `urls` - single url, or array of several urls.
46
- - `scraper_options` - hash of scraper options (see below).
47
- - `httpclient_options` - hash of request options for `Typheous::Request#initialize` (see [API documentation](http://rubydoc.info/github/pauldix/typhoeus/master/Typhoeus/Request#initialize-instance_method) for details).
44
+ - __urls__ - single url, or array of several urls.
45
+ - __scraper_options__ - hash of scraper options (see below).
46
+ - __http_options__ - hash of request options for `Typheous::Request#initialize` (see [API documentation](http://rubydoc.info/github/pauldix/typhoeus/master/Typhoeus/Request#initialize-instance_method) for details).
48
47
 
49
- #### Scraper options:
50
- * `format` - Specifies the scraped document format (valid values are :html, :xml, :json).
51
- * `async` - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
52
- * `log` - Logging options hash:
53
- * `loglevel` - a symbol specifying the desired log level (defaults to `:info`).
54
- * `appenders` - a list of Logging.appenders object (defaults to `Logging.appenders.sterr`).
48
+ #### scraper options:
49
+
50
+ * __format__ - Specifies the scraped document format (valid values are :html, :xml, :json).
51
+ * __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
52
+ * __log__ - Logging options hash:
53
+ * __loglevel__ - a symbol specifying the desired log level (defaults to `:info`).
54
+ * __appenders__ - a list of Logging.appenders object (defaults to `Logging.appenders.sterr`).
55
55
 
56
56
  ### Extractors
57
57
 
58
58
  ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.
59
59
  For each matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract`
60
- method extracts a piece of information from an element (e.g. a story's title) and and stores it into a record's field.
60
+ method extracts a specific piece of information from an element (e.g. a story's title) and stores it into a record's field.
61
61
 
62
- # using a CSS3 or an XPath selector
62
+ # looping over a set of document elements using a CSS3 (or XPath) selector
63
63
  loop_on('div.post')
64
64
 
65
- # using a proc as a selector
65
+ # looping
66
66
 
67
- loop_on(proc { |doc| doc.search('div.post') })
67
+ loop_on { |doc| doc.search('div.post') }
68
68
 
69
- # using both a selector and a proc (matched elements are passed in as the first argument of the proc )
69
+ # using both a selector and a proc (the matched element list is passed in to the proc as its first argument )
70
70
 
71
- loop_on('div.post', proc { |posts| posts.reject { |post| post.attr(:class) == 'sticky' }})
71
+ loop_on('div.post') { |posts| posts.reject { |post| post.attr(:class) == 'sticky' } }
72
72
 
73
- Both the `loop_on` and the `extract` methods may be called with a selector, a proc or a combination of the two. By default, when parsing DOM documents, `extract` will call
74
- `Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name or a proc as a third argument, this will be evaluated in the context of the matching element.
73
+ Both the `loop_on` and the `extract` methods may be called with a selector, a block or a combination of the two. By default, when parsing DOM documents, `extract` will call
74
+ `Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and block, this will be evaluated in the context of the current iteration element.
75
75
 
76
76
  # extract a story's title
77
77
  extract(:title, 'h3')
@@ -80,34 +80,32 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a pr
80
80
  extract(:url, "a.link-to-story", :href)
81
81
 
82
82
  # extract a description text, separating paragraphs with newlines
83
- extract(:description, "div.description", proc { |node|
84
- node.css("p").map(&:text).join("\n")
85
- })
83
+ extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
86
84
 
87
85
  #### Extracting from JSON Documents
88
86
 
89
87
  While processing each HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
90
- the `ContentType` header sent by the server (this value may be overriden by providing a `:format` key in the scraper's
91
- initialization options). When the format is JSON, the document is parsed using the `yajl` parser and converted into a hash.
92
- In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, with the sole exception
93
- of the CSS3/XPath selector string, which is specific of DOM documents.
88
+ the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's
89
+ initialization options. When format is JSON, the document is parsed using the `yajl` JSON parser and converted into a hash.
90
+ In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except for the
91
+ CSS3/XPath selectors, which are specific to DOM documents.
94
92
 
95
- When working with JSON data, you can just use a proc and return the document elements you want to loop on.
93
+ When working with JSON data, you can just use a block and have it return the document elements you want to loop on.
96
94
 
97
95
  # Fetch a portion of a document using a proc
98
- loop_on(proc { |data| data['query']['categorymembers'] })
96
+ loop_on { |data| data['query']['categorymembers'] })
99
97
 
100
- Alternatively, the same loop can be defined by passing an array of keys pointing at a value located
101
- at several levels of depth down into the parsed document hash.
98
+ Alternatively, the same loop can be defined by passing an array of keys pointing at a hash value located
99
+ at several levels of depth down into the parsed document structure.
102
100
 
103
101
  # Fetch the same document portion above using a hash path
104
102
  loop_on(['query', 'categorymembers'])
105
103
 
106
- When fetching fields from a portion of a JSON document, `extract` will use the
107
- field name as a hash key if no key path or key string is provided.
104
+ When fetching fields from a JSON document fragment, `extract` will often not need a block or an array of keys. If called with only
105
+ one argument, it will in fact try to fetch a hash value using the provided field name as key.
108
106
 
109
107
  # current node:
110
- #
108
+ #
111
109
  # {
112
110
  # 'from_user' => "johndoe",
113
111
  # 'text' => 'bla bla bla',
@@ -118,25 +116,27 @@ field name as a hash key if no key path or key string is provided.
118
116
  # => "johndoe"
119
117
 
120
118
 
121
- ### Iteration methods:
119
+ ### Iteration methods
120
+
121
+ The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.
122
+
123
+ set_iteration(iteration_parameter, array_range_or_block)
122
124
 
123
- The `IterativeScraper` class comes with two methods for defining how a scraper should loop over paginated content.
125
+ * __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
126
+ * __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.
124
127
 
125
- _set_iteration(iteration_parameter, array_range_or_proc)_
126
128
 
127
- * `iteration_parameter` - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
128
- * `array_or_range_or_proc` - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document as its first argument. Its return value is then used to shift, at each iteration, the value of the iteration parameter. If the block fails to return a non empty array, the iteration stops.
129
+ The second iteration methods, `#continue_with`, allows to continue iterating untill an arbitrary block of code returns a positive, non-nil value (to be assigned to the iteration parameter).
129
130
 
130
- The second iteration methods, `#continue_with`, allows to continue iterating untill an arbitrary block of code returns a positive, non-nil value.
131
+ continue_with(iteration_parameter, &block)
131
132
 
132
- _continue_with(iteration_parameter, block)_
133
+ * __iteration_parameter__ - the scraper' iteration parameter.
134
+ * __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
133
135
 
134
- * `iteration_parameter` - the scraper' iteration parameter.
135
- * `block` - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
136
136
 
137
137
  ### Running tests
138
138
 
139
139
  ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
140
140
 
141
- cd spec
141
+ cd spec
142
142
  rspec *
@@ -5,16 +5,15 @@ results = []
5
5
  ExtraLoop::IterativeScraper.new("https://www.google.com/search?tbm=nws&q=Egypt", :log => {
6
6
  :log_level => :debug,
7
7
  :appenders => [Logging.appenders.stderr ]
8
+
8
9
  }).set_iteration(:start, (1..101).step(10)).
9
- loop_on("h3", proc { |nodes| nodes.map(&:parent) }).
10
+ loop_on("h3") { |nodes| nodes.map(&:parent) }.
10
11
  extract(:title, "h3.r a").
11
12
  extract(:url, "h3.r a", :href).
12
- extract(:source, "br", proc { |node|
13
- node.next.text.split("-").first
14
- }).
15
- on(:data, proc { |data, response|
13
+ extract(:source, "br") { |node| node.next.text.split("-").first }.
14
+ on(:data) do |data, response|
16
15
  data.each { |record| results << record }
17
- }).
16
+ end.
18
17
  run()
19
18
 
20
19
  results.each_with_index do |record, index|
@@ -6,9 +6,8 @@ module ExtraLoop
6
6
  end
7
7
  end
8
8
 
9
- def set_hook(hookname, handler)
9
+ def set_hook(hookname, &handler)
10
10
  @hooks ||= {}
11
- raise Exceptions::HookArgumentError.new "handler must be a callable proc" unless handler.respond_to?(:call)
12
11
  @hooks[hookname.to_sym] ? @hooks[hookname.to_sym].push(handler) : @hooks[hookname.to_sym] = [handler]
13
12
  self
14
13
  end
@@ -209,7 +209,7 @@ module ExtraLoop
209
209
  def update_request_params!
210
210
  offset = @iteration_set.at(@iteration_count) || default_offset
211
211
  @request_arguments[:params] ||= {}
212
- @request_arguments[:params][@iteration_param.to_sym] = offset
212
+ @request_arguments[:params][@iteration_param.to_sym] = offset if @iteration_count > 0
213
213
  end
214
214
 
215
215
 
@@ -246,7 +246,7 @@ module ExtraLoop
246
246
 
247
247
  def issue_request(url)
248
248
  # remove continue argument if this is the first iteration
249
- @request_arguments[:params].delete(@iteration_param.to_sym) if @continue_clause_args && @iteration_count == 0
249
+ #@request_arguments[:params].delete(@iteration_param.to_sym) if @continue_clause_args && @iteration_count == 0
250
250
  super(url)
251
251
  # clear previous value of iteration parameter
252
252
  @request_arguments[:params].delete(@iteration_param.to_sym) if @request_arguments[:params] && @request_arguments[:params].any?
@@ -52,13 +52,15 @@ module ExtraLoop
52
52
  #
53
53
  #
54
54
  # selector - The CSS3 selector identifying the node list over which iterate (optional).
55
- # callback - A block of code (optional).
56
55
  # attribute - An attribute name (optional).
57
56
  #
57
+ # callback - A block of code (optional).
58
+ #
58
59
  # Returns itself.
59
60
  #
60
61
 
61
- def loop_on(*args)
62
+ def loop_on(*args, &block)
63
+ args << block if block
62
64
  @loop_extractor_args = args.insert(0, nil, ExtractionEnvironment.new(self))
63
65
  self
64
66
  end
@@ -75,7 +77,8 @@ module ExtraLoop
75
77
  #
76
78
  #
77
79
 
78
- def extract(*args)
80
+ def extract(*args, &block)
81
+ args << block if block
79
82
  @extractor_args << args.insert(1, ExtractionEnvironment.new(self))
80
83
  self
81
84
  end
@@ -78,7 +78,7 @@ describe IterativeScraper do
78
78
  new("http://whatever.net/search-stuff").
79
79
  set_iteration(:p, iteration_proc).
80
80
  loop_on(".whatever").
81
- set_hook(:data, proc { iteration_count += 1 }).
81
+ set_hook(:data, &proc { iteration_count += 1 }).
82
82
  run()
83
83
 
84
84
  @iteration_count = iteration_count
@@ -90,7 +90,7 @@ describe IterativeScraper do
90
90
  end
91
91
 
92
92
  it "should have sent p=1, p=2, p=3, p=4 as request parameters" do
93
- @params_sent.should eql(["1", "2", "3", "4"])
93
+ @params_sent.should eql([nil, "2", "3", "4"])
94
94
  end
95
95
  end
96
96
  end
@@ -24,16 +24,10 @@ describe ScraperBase do
24
24
  end
25
25
 
26
26
  describe "#set_hook" do
27
- subject { @scraper.set_hook(:after, proc {}) }
27
+ subject { @scraper.set_hook(:after, &proc {}) }
28
28
  it { should be_an_instance_of(ScraperBase) }
29
29
  end
30
30
 
31
- describe "#set_hook" do
32
- it "should raise exception if no proc is provided" do
33
- expect { @scraper.set_hook(:after, :method) }.to raise_exception(ScraperBase::Exceptions::HookArgumentError)
34
- end
35
- end
36
-
37
31
  context "request params in both the url and the arguments hash" do
38
32
  describe "#run" do
39
33
  before do
@@ -76,7 +70,7 @@ describe ScraperBase do
76
70
  loop_on("ul li.file a").
77
71
  extract(:url, :href).
78
72
  extract(:filename).
79
- set_hook(:data, proc { |records| records.each { |record| results << record }})
73
+ set_hook(:data, &proc { |records| records.each { |record| results << record }})
80
74
 
81
75
  @results = results
82
76
  end
@@ -110,7 +104,7 @@ describe ScraperBase do
110
104
  loop_on("ul li.file a").
111
105
  extract(:url, :href).
112
106
  extract(:filename).
113
- set_hook(:data, proc { |records| records.each { |record| results << record } })
107
+ set_hook(:data, &proc { |records| records.each { |record| results << record } })
114
108
 
115
109
  @results = results
116
110
 
@@ -154,7 +148,7 @@ describe ScraperBase do
154
148
  loop_on("ul li.file a").
155
149
  extract(:url, :href).
156
150
  extract(:filename).
157
- set_hook(:data, proc { |records| records.each { |record| results << record } })
151
+ set_hook(:data, &proc { |records| records.each { |record| results << record } })
158
152
 
159
153
  @results = results
160
154
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: extraloop
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.4
4
+ version: 0.0.5
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-01-14 00:00:00.000000000Z
12
+ date: 2012-01-28 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: yajl-ruby
16
- requirement: &10100300 !ruby/object:Gem::Requirement
16
+ requirement: &19726880 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ~>
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 1.1.0
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *10100300
24
+ version_requirements: *19726880
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: nokogiri
27
- requirement: &10098200 !ruby/object:Gem::Requirement
27
+ requirement: &19726420 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ~>
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 1.5.0
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *10098200
35
+ version_requirements: *19726420
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: typhoeus
38
- requirement: &10095680 !ruby/object:Gem::Requirement
38
+ requirement: &19725960 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ~>
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: 0.3.2
44
44
  type: :runtime
45
45
  prerelease: false
46
- version_requirements: *10095680
46
+ version_requirements: *19725960
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: logging
49
- requirement: &10094320 !ruby/object:Gem::Requirement
49
+ requirement: &19725500 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ~>
@@ -54,21 +54,21 @@ dependencies:
54
54
  version: 0.6.1
55
55
  type: :runtime
56
56
  prerelease: false
57
- version_requirements: *10094320
57
+ version_requirements: *19725500
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: rspec
60
- requirement: &10092700 !ruby/object:Gem::Requirement
60
+ requirement: &19725040 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ~>
64
64
  - !ruby/object:Gem::Version
65
- version: 2.7.1
65
+ version: 2.7.0
66
66
  type: :development
67
67
  prerelease: false
68
- version_requirements: *10092700
68
+ version_requirements: *19725040
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: rr
71
- requirement: &10089100 !ruby/object:Gem::Requirement
71
+ requirement: &19724580 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ~>
@@ -76,10 +76,10 @@ dependencies:
76
76
  version: 1.0.4
77
77
  type: :development
78
78
  prerelease: false
79
- version_requirements: *10089100
79
+ version_requirements: *19724580
80
80
  - !ruby/object:Gem::Dependency
81
81
  name: pry
82
- requirement: &10088040 !ruby/object:Gem::Requirement
82
+ requirement: &19724060 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ~>
@@ -87,7 +87,7 @@ dependencies:
87
87
  version: 0.9.7.4
88
88
  type: :development
89
89
  prerelease: false
90
- version_requirements: *10088040
90
+ version_requirements: *19724060
91
91
  description: A Ruby library for extracting data from websites and web based APIs.
92
92
  Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
93
93
  a handy mechanism for iterating over paginated datasets.