extraloop 0.0.7 → 0.0.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,3 +1,6 @@
1
+ == 0.0.8 / 2012-03-27
2
+ * Fixed major bug which prevented iterative scrapers from working properly
3
+
1
4
  == 0.0.7 / 2012-02-28
2
5
  * Added support for CSV data extraction.
3
6
 
data/README.md CHANGED
@@ -1,14 +1,14 @@
1
1
  # Extra Loop
2
2
 
3
3
  A Ruby library for extracting structured data from websites and web based APIs.
4
- Supports most common document formats (i.e. HTML, XML, and JSON), and comes with a handy mechanism
4
+ Supports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes with a handy mechanism
5
5
  for iterating over paginated datasets.
6
6
 
7
- ### Installation:
7
+ ## Installation:
8
8
 
9
9
  gem install extraloop
10
10
 
11
- ### Sample scrapers:
11
+ ## Usage:
12
12
 
13
13
  A basic scraper that fetches the top 25 websites from [Alexa's daily top 100](www.alexa.com/topsites) list:
14
14
 
@@ -37,7 +37,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
37
37
  run()
38
38
 
39
39
 
40
- ### Scraper initialisation signature
40
+ ## Scraper initialisation signature
41
41
 
42
42
  #new(urls, scraper_options, http_options)
43
43
 
@@ -45,7 +45,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
45
45
  - __scraper_options__ - hash of scraper options (see below).
46
46
  - __http_options__ - hash of request options for `Typheous::Request#initialize` (see [API documentation](http://rubydoc.info/github/pauldix/typhoeus/master/Typhoeus/Request#initialize-instance_method) for details).
47
47
 
48
- #### scraper options:
48
+ ### scraper options:
49
49
 
50
50
  * __format__ - Specifies the scraped document format; needed only if the Content-Type in the server response is not the correct one. Supported formats are: 'html', 'xml', 'json', and 'csv'.
51
51
  * __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
@@ -53,7 +53,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
53
53
  * __loglevel__ - a symbol specifying the desired log level (defaults to `:info`).
54
54
  * __appenders__ - a list of Logging.appenders object (defaults to `Logging.appenders.sterr`).
55
55
 
56
- ### Extractors
56
+ ## Extractors
57
57
 
58
58
  ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.
59
59
  For each matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract`
@@ -82,7 +82,7 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a bl
82
82
  # extract a description text, separating paragraphs with newlines
83
83
  extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
84
84
 
85
- #### Extracting data from JSON Documents
85
+ ### Extracting data from JSON Documents
86
86
 
87
87
  While processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
88
88
  the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's
@@ -116,26 +116,26 @@ one argument, it will in fact try to fetch a hash value using the provided field
116
116
  # => "johndoe"
117
117
 
118
118
 
119
- ### Iteration methods
119
+ ## Iteration methods
120
120
 
121
121
  The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.
122
122
 
123
- #### set\_iteration
123
+ ### set\_iteration
124
124
 
125
125
  * __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
126
126
  * __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.
127
127
 
128
- #### continue\_with
128
+ ### continue\_with
129
129
 
130
130
  The second iteration method, `#continue_with`, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).
131
131
 
132
132
  * __iteration_parameter__ - the scraper' iteration parameter.
133
133
  * __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
134
134
 
135
- ### Running tests
135
+ ## Running tests
136
136
 
137
137
  ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
138
138
 
139
139
  cd spec
140
140
  rspec *
141
-
141
+
@@ -1,6 +1,5 @@
1
1
  #
2
- # Fetch name, job title, and actual pay ceiling from a csv dataset containing UK Ministry of Defence's organogram and staff pay data
3
- #
2
+ # Fetches name, job title, and actual pay ceiling from a CSV dataset listing UK Ministry of Defence's organogram and staff pay data
4
3
  # source: http://data.gov.uk/dataset/staff-organograms-and-pay-mod
5
4
  #
6
5
 
@@ -12,7 +11,8 @@ class ModPayScraper < ExtraLoop::ScraperBase
12
11
  dataset_url = "http://www.mod.uk/NR/rdonlyres/FF9761D8-2AB9-4CD4-88BC-983A46A0CD90/0/20111208CTLBOrganogramFinal7Useniordata.csv"
13
12
  super dataset_url, :format => :csv
14
13
 
15
- # Select only record of officiers who earn more than 100k per year
14
+ # Select only records of officers earning more than 100k per year
15
+
16
16
  loop_on do |rows|
17
17
  rows[1..-1].select { |row| row[14].to_i > 100000 }
18
18
  end
@@ -22,11 +22,11 @@ class ModPayScraper < ExtraLoop::ScraperBase
22
22
  extract :pay, 14
23
23
 
24
24
  on("data") do |records|
25
-
26
- records.sort { |r1, r2| r2.pay <=> r1.pay }.each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
25
+ records.
26
+ sort { |r1, r2| r2.pay.to_i <=> r1.pay.to_i }.
27
+ each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
27
28
  end
28
29
  end
29
30
  end
30
31
 
31
-
32
32
  ModPayScraper.new.run
@@ -18,10 +18,6 @@ params = {
18
18
 
19
19
  options = {
20
20
  :format => :json,
21
- :log => {
22
- :appenders => [Logging.appenders.stderr],
23
- :log_level => :info
24
- }
25
21
  }
26
22
  request_arguments = { :params => params, :headers => {
27
23
  "User-Agent" => "ExtraLoop - ruby data extraction toolkit: http://github.com/afiore/extraloop"
@@ -1,7 +1,7 @@
1
1
  base_path = File.expand_path(File.dirname(__FILE__) + "/extraloop" )
2
2
 
3
3
  module ExtraLoop
4
- VERSION = '0.0.3'
4
+ VERSION = '0.0.8'
5
5
  end
6
6
 
7
7
 
@@ -32,7 +32,8 @@ module ExtraLoop
32
32
  def extract_list(input)
33
33
  nodes = (input.respond_to?(:document) ? input : parse(input))
34
34
  nodes = nodes.search(@selector) if @selector
35
- Array(@callback && @environment.run(nodes, &@callback) || nodes)
35
+ nodes = nodes.css("*") unless @selector or @callback
36
+ @callback && Array(@environment.run(nodes, &@callback)) || nodes
36
37
  end
37
38
 
38
39
  def parse(input)
@@ -22,7 +22,7 @@ module ExtraLoop
22
22
  def extract_list(input)
23
23
  @environment.document = input = (input.is_a?(String) ? parse(input) : input)
24
24
  input = input.get_in(@path) if @path
25
- @callback && @environment.run(input, &@callback) || input
25
+ Array(@callback && @environment.run(input, &@callback) || input)
26
26
  end
27
27
 
28
28
  def parse(input)
@@ -23,7 +23,7 @@ module ExtraLoop
23
23
 
24
24
  def initialize(urls, options = {}, arguments = {})
25
25
  @urls = Array(urls)
26
- @loop_extractor_args = nil
26
+ @loop_extractor_args = []
27
27
  @extractor_args = []
28
28
  @loop = nil
29
29
 
@@ -61,8 +61,8 @@ module ExtraLoop
61
61
 
62
62
  def loop_on(*args, &block)
63
63
  args << block if block
64
- # we prepend a nil value, as the loop extractor does not need to specify a field name
65
- @loop_extractor_args = args.insert(0, nil)
64
+ # prepend placeholder values for loop name and extraction environment
65
+ @loop_extractor_args = args.insert(0, nil, nil)
66
66
  self
67
67
  end
68
68
 
@@ -129,6 +129,9 @@ module ExtraLoop
129
129
  end
130
130
 
131
131
  log("queueing url: #{url}, params #{arguments[:params]}", :debug)
132
+
133
+
134
+
132
135
  @queued_count += 1
133
136
  @hydra.queue(request)
134
137
  end
@@ -147,6 +150,7 @@ module ExtraLoop
147
150
  log("response ##{@response_count} of #{@queued_count}, status code: [#{response.code}], URL fragment: ...#{response.effective_url.split('/').last if response.effective_url}")
148
151
 
149
152
  @loop.run
153
+
150
154
  @environment = @loop.environment
151
155
  run_hook(:data, [@loop.records, response])
152
156
  #TODO: add hock for scraper completion (useful in iterative scrapes).
@@ -159,7 +163,10 @@ module ExtraLoop
159
163
  extractor_classname = "#{format.to_s.capitalize}Extractor"
160
164
  extractor_class = ExtraLoop.const_defined?(extractor_classname) && ExtraLoop.const_get(extractor_classname) || DomExtractor
161
165
 
162
- @loop_extractor_args.insert(1, ExtractionEnvironment.new(self))
166
+
167
+ #replace empty placeholder with extraction environment
168
+ @loop_extractor_args[1] = ExtractionEnvironment.new(self)
169
+
163
170
  loop_extractor = extractor_class.new(*@loop_extractor_args)
164
171
 
165
172
  # There is no point in parsing response.body more than once, so we reuse
@@ -167,8 +174,8 @@ module ExtraLoop
167
174
 
168
175
  document = loop_extractor.parse(response.body)
169
176
 
170
- extractors = @extractor_args.map do |args|
171
- args.insert(1, ExtractionEnvironment.new(self, document))
177
+ extractors = @extractor_args.map do |original_args|
178
+ args = original_args.clone.insert(1, ExtractionEnvironment.new(self, document))
172
179
  extractor_class.new(*args)
173
180
  end
174
181
 
@@ -69,14 +69,15 @@ describe ScraperBase do
69
69
  loop_on("ul li.file a").
70
70
  extract(:url, :href).
71
71
  extract(:filename).
72
- set_hook(:data, &proc { |records| records.each { |record| results << record }})
72
+ on(:data) { |records| results = records }.
73
+ run
73
74
 
74
75
  @results = results
76
+
75
77
  end
76
78
 
77
79
 
78
80
  it "Should handle response" do
79
- @scraper.run
80
81
  @results.should_not be_empty
81
82
  @results.all? { |record| record.extracted_at && record.url && record.filename }.should be_true
82
83
  end
@@ -166,4 +167,29 @@ describe ScraperBase do
166
167
  end
167
168
  end
168
169
  end
170
+
171
+ context "no loop defined.." do
172
+ describe "#run", :thisonly => true do
173
+ before do
174
+ data = []
175
+ @url = "http://localhost/fixture"
176
+
177
+ stub_http({}, :body => @fixture_doc) do |hydra, request, response|
178
+ hydra.stub(:get, request.url).and_return(response)
179
+ end
180
+
181
+ (ScraperBase.new @url).
182
+ extract(:url, "a[href]", :href).
183
+ on("data") { |records| data = records }.
184
+ run
185
+
186
+ @data = data
187
+ end
188
+
189
+ it "should run and extract data" do
190
+ @data.should_not be_empty
191
+ end
192
+ end
193
+ end
194
+
169
195
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: extraloop
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.7
4
+ version: 0.0.8
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-02-28 00:00:00.000000000Z
12
+ date: 2012-03-27 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: yajl-ruby
16
- requirement: &21376200 !ruby/object:Gem::Requirement
16
+ requirement: &13625500 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ~>
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 1.1.0
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *21376200
24
+ version_requirements: *13625500
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: nokogiri
27
- requirement: &21373200 !ruby/object:Gem::Requirement
27
+ requirement: &13624900 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ~>
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 1.5.0
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *21373200
35
+ version_requirements: *13624900
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: typhoeus
38
- requirement: &21368180 !ruby/object:Gem::Requirement
38
+ requirement: &13624340 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ~>
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: 0.3.2
44
44
  type: :runtime
45
45
  prerelease: false
46
- version_requirements: *21368180
46
+ version_requirements: *13624340
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: logging
49
- requirement: &21365740 !ruby/object:Gem::Requirement
49
+ requirement: &13623540 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ~>
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: 0.6.1
55
55
  type: :runtime
56
56
  prerelease: false
57
- version_requirements: *21365740
57
+ version_requirements: *13623540
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: rspec
60
- requirement: &21363940 !ruby/object:Gem::Requirement
60
+ requirement: &13622600 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ~>
@@ -65,10 +65,10 @@ dependencies:
65
65
  version: 2.7.0
66
66
  type: :development
67
67
  prerelease: false
68
- version_requirements: *21363940
68
+ version_requirements: *13622600
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: rr
71
- requirement: &21362040 !ruby/object:Gem::Requirement
71
+ requirement: &13620140 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ~>
@@ -76,10 +76,10 @@ dependencies:
76
76
  version: 1.0.4
77
77
  type: :development
78
78
  prerelease: false
79
- version_requirements: *21362040
79
+ version_requirements: *13620140
80
80
  - !ruby/object:Gem::Dependency
81
81
  name: pry-nav
82
- requirement: &21355940 !ruby/object:Gem::Requirement
82
+ requirement: &13619680 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ~>
@@ -87,10 +87,21 @@ dependencies:
87
87
  version: 0.1.0
88
88
  type: :development
89
89
  prerelease: false
90
- version_requirements: *21355940
90
+ version_requirements: *13619680
91
+ - !ruby/object:Gem::Dependency
92
+ name: rake
93
+ requirement: &13619180 !ruby/object:Gem::Requirement
94
+ none: false
95
+ requirements:
96
+ - - ~>
97
+ - !ruby/object:Gem::Version
98
+ version: 0.9.2.2
99
+ type: :development
100
+ prerelease: false
101
+ version_requirements: *13619180
91
102
  description: A Ruby library for extracting data from websites and web based APIs.
92
- Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
93
- a handy mechanism for iterating over paginated datasets.
103
+ Supports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes
104
+ with a handy mechanism for iterating over paginated datasets.
94
105
  email: andrea.giulio.fiore@googlemail.com
95
106
  executables: []
96
107
  extensions: []
@@ -140,6 +151,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
140
151
  - - ! '>='
141
152
  - !ruby/object:Gem::Version
142
153
  version: '0'
154
+ segments:
155
+ - 0
156
+ hash: 2441039337834275619
143
157
  required_rubygems_version: !ruby/object:Gem::Requirement
144
158
  none: false
145
159
  requirements:
@@ -153,4 +167,3 @@ signing_key:
153
167
  specification_version: 2
154
168
  summary: A toolkit for online data extraction.
155
169
  test_files: []
156
- has_rdoc: