extraloop 0.0.7 → 0.0.8

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,3 +1,6 @@
1
+ == 0.0.8 / 2012-03-27
2
+ * Fixed major bug which prevented iterative scrapers from working properly
3
+
1
4
  == 0.0.7 / 2012-02-28
2
5
  * Added support for CSV data extraction.
3
6
 
data/README.md CHANGED
@@ -1,14 +1,14 @@
1
1
  # Extra Loop
2
2
 
3
3
  A Ruby library for extracting structured data from websites and web based APIs.
4
- Supports most common document formats (i.e. HTML, XML, and JSON), and comes with a handy mechanism
4
+ Supports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes with a handy mechanism
5
5
  for iterating over paginated datasets.
6
6
 
7
- ### Installation:
7
+ ## Installation:
8
8
 
9
9
  gem install extraloop
10
10
 
11
- ### Sample scrapers:
11
+ ## Usage:
12
12
 
13
13
  A basic scraper that fetches the top 25 websites from [Alexa's daily top 100](www.alexa.com/topsites) list:
14
14
 
@@ -37,7 +37,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
37
37
  run()
38
38
 
39
39
 
40
- ### Scraper initialisation signature
40
+ ## Scraper initialisation signature
41
41
 
42
42
  #new(urls, scraper_options, http_options)
43
43
 
@@ -45,7 +45,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
45
45
  - __scraper_options__ - hash of scraper options (see below).
46
46
  - __http_options__ - hash of request options for `Typheous::Request#initialize` (see [API documentation](http://rubydoc.info/github/pauldix/typhoeus/master/Typhoeus/Request#initialize-instance_method) for details).
47
47
 
48
- #### scraper options:
48
+ ### scraper options:
49
49
 
50
50
  * __format__ - Specifies the scraped document format; needed only if the Content-Type in the server response is not the correct one. Supported formats are: 'html', 'xml', 'json', and 'csv'.
51
51
  * __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
@@ -53,7 +53,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
53
53
  * __loglevel__ - a symbol specifying the desired log level (defaults to `:info`).
54
54
  * __appenders__ - a list of Logging.appenders object (defaults to `Logging.appenders.sterr`).
55
55
 
56
- ### Extractors
56
+ ## Extractors
57
57
 
58
58
  ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.
59
59
  For each matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract`
@@ -82,7 +82,7 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a bl
82
82
  # extract a description text, separating paragraphs with newlines
83
83
  extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
84
84
 
85
- #### Extracting data from JSON Documents
85
+ ### Extracting data from JSON Documents
86
86
 
87
87
  While processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
88
88
  the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's
@@ -116,26 +116,26 @@ one argument, it will in fact try to fetch a hash value using the provided field
116
116
  # => "johndoe"
117
117
 
118
118
 
119
- ### Iteration methods
119
+ ## Iteration methods
120
120
 
121
121
  The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.
122
122
 
123
- #### set\_iteration
123
+ ### set\_iteration
124
124
 
125
125
  * __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
126
126
  * __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.
127
127
 
128
- #### continue\_with
128
+ ### continue\_with
129
129
 
130
130
  The second iteration method, `#continue_with`, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).
131
131
 
132
132
  * __iteration_parameter__ - the scraper' iteration parameter.
133
133
  * __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
134
134
 
135
- ### Running tests
135
+ ## Running tests
136
136
 
137
137
  ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
138
138
 
139
139
  cd spec
140
140
  rspec *
141
-
141
+
@@ -1,6 +1,5 @@
1
1
  #
2
- # Fetch name, job title, and actual pay ceiling from a csv dataset containing UK Ministry of Defence's organogram and staff pay data
3
- #
2
+ # Fetches name, job title, and actual pay ceiling from a CSV dataset listing UK Ministry of Defence's organogram and staff pay data
4
3
  # source: http://data.gov.uk/dataset/staff-organograms-and-pay-mod
5
4
  #
6
5
 
@@ -12,7 +11,8 @@ class ModPayScraper < ExtraLoop::ScraperBase
12
11
  dataset_url = "http://www.mod.uk/NR/rdonlyres/FF9761D8-2AB9-4CD4-88BC-983A46A0CD90/0/20111208CTLBOrganogramFinal7Useniordata.csv"
13
12
  super dataset_url, :format => :csv
14
13
 
15
- # Select only record of officiers who earn more than 100k per year
14
+ # Select only records of officers earning more than 100k per year
15
+
16
16
  loop_on do |rows|
17
17
  rows[1..-1].select { |row| row[14].to_i > 100000 }
18
18
  end
@@ -22,11 +22,11 @@ class ModPayScraper < ExtraLoop::ScraperBase
22
22
  extract :pay, 14
23
23
 
24
24
  on("data") do |records|
25
-
26
- records.sort { |r1, r2| r2.pay <=> r1.pay }.each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
25
+ records.
26
+ sort { |r1, r2| r2.pay.to_i <=> r1.pay.to_i }.
27
+ each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
27
28
  end
28
29
  end
29
30
  end
30
31
 
31
-
32
32
  ModPayScraper.new.run
@@ -18,10 +18,6 @@ params = {
18
18
 
19
19
  options = {
20
20
  :format => :json,
21
- :log => {
22
- :appenders => [Logging.appenders.stderr],
23
- :log_level => :info
24
- }
25
21
  }
26
22
  request_arguments = { :params => params, :headers => {
27
23
  "User-Agent" => "ExtraLoop - ruby data extraction toolkit: http://github.com/afiore/extraloop"
@@ -1,7 +1,7 @@
1
1
  base_path = File.expand_path(File.dirname(__FILE__) + "/extraloop" )
2
2
 
3
3
  module ExtraLoop
4
- VERSION = '0.0.3'
4
+ VERSION = '0.0.8'
5
5
  end
6
6
 
7
7
 
@@ -32,7 +32,8 @@ module ExtraLoop
32
32
  def extract_list(input)
33
33
  nodes = (input.respond_to?(:document) ? input : parse(input))
34
34
  nodes = nodes.search(@selector) if @selector
35
- Array(@callback && @environment.run(nodes, &@callback) || nodes)
35
+ nodes = nodes.css("*") unless @selector or @callback
36
+ @callback && Array(@environment.run(nodes, &@callback)) || nodes
36
37
  end
37
38
 
38
39
  def parse(input)
@@ -22,7 +22,7 @@ module ExtraLoop
22
22
  def extract_list(input)
23
23
  @environment.document = input = (input.is_a?(String) ? parse(input) : input)
24
24
  input = input.get_in(@path) if @path
25
- @callback && @environment.run(input, &@callback) || input
25
+ Array(@callback && @environment.run(input, &@callback) || input)
26
26
  end
27
27
 
28
28
  def parse(input)
@@ -23,7 +23,7 @@ module ExtraLoop
23
23
 
24
24
  def initialize(urls, options = {}, arguments = {})
25
25
  @urls = Array(urls)
26
- @loop_extractor_args = nil
26
+ @loop_extractor_args = []
27
27
  @extractor_args = []
28
28
  @loop = nil
29
29
 
@@ -61,8 +61,8 @@ module ExtraLoop
61
61
 
62
62
  def loop_on(*args, &block)
63
63
  args << block if block
64
- # we prepend a nil value, as the loop extractor does not need to specify a field name
65
- @loop_extractor_args = args.insert(0, nil)
64
+ # prepend placeholder values for loop name and extraction environment
65
+ @loop_extractor_args = args.insert(0, nil, nil)
66
66
  self
67
67
  end
68
68
 
@@ -129,6 +129,9 @@ module ExtraLoop
129
129
  end
130
130
 
131
131
  log("queueing url: #{url}, params #{arguments[:params]}", :debug)
132
+
133
+
134
+
132
135
  @queued_count += 1
133
136
  @hydra.queue(request)
134
137
  end
@@ -147,6 +150,7 @@ module ExtraLoop
147
150
  log("response ##{@response_count} of #{@queued_count}, status code: [#{response.code}], URL fragment: ...#{response.effective_url.split('/').last if response.effective_url}")
148
151
 
149
152
  @loop.run
153
+
150
154
  @environment = @loop.environment
151
155
  run_hook(:data, [@loop.records, response])
152
156
  #TODO: add hock for scraper completion (useful in iterative scrapes).
@@ -159,7 +163,10 @@ module ExtraLoop
159
163
  extractor_classname = "#{format.to_s.capitalize}Extractor"
160
164
  extractor_class = ExtraLoop.const_defined?(extractor_classname) && ExtraLoop.const_get(extractor_classname) || DomExtractor
161
165
 
162
- @loop_extractor_args.insert(1, ExtractionEnvironment.new(self))
166
+
167
+ #replace empty placeholder with extraction environment
168
+ @loop_extractor_args[1] = ExtractionEnvironment.new(self)
169
+
163
170
  loop_extractor = extractor_class.new(*@loop_extractor_args)
164
171
 
165
172
  # There is no point in parsing response.body more than once, so we reuse
@@ -167,8 +174,8 @@ module ExtraLoop
167
174
 
168
175
  document = loop_extractor.parse(response.body)
169
176
 
170
- extractors = @extractor_args.map do |args|
171
- args.insert(1, ExtractionEnvironment.new(self, document))
177
+ extractors = @extractor_args.map do |original_args|
178
+ args = original_args.clone.insert(1, ExtractionEnvironment.new(self, document))
172
179
  extractor_class.new(*args)
173
180
  end
174
181
 
@@ -69,14 +69,15 @@ describe ScraperBase do
69
69
  loop_on("ul li.file a").
70
70
  extract(:url, :href).
71
71
  extract(:filename).
72
- set_hook(:data, &proc { |records| records.each { |record| results << record }})
72
+ on(:data) { |records| results = records }.
73
+ run
73
74
 
74
75
  @results = results
76
+
75
77
  end
76
78
 
77
79
 
78
80
  it "Should handle response" do
79
- @scraper.run
80
81
  @results.should_not be_empty
81
82
  @results.all? { |record| record.extracted_at && record.url && record.filename }.should be_true
82
83
  end
@@ -166,4 +167,29 @@ describe ScraperBase do
166
167
  end
167
168
  end
168
169
  end
170
+
171
+ context "no loop defined.." do
172
+ describe "#run", :thisonly => true do
173
+ before do
174
+ data = []
175
+ @url = "http://localhost/fixture"
176
+
177
+ stub_http({}, :body => @fixture_doc) do |hydra, request, response|
178
+ hydra.stub(:get, request.url).and_return(response)
179
+ end
180
+
181
+ (ScraperBase.new @url).
182
+ extract(:url, "a[href]", :href).
183
+ on("data") { |records| data = records }.
184
+ run
185
+
186
+ @data = data
187
+ end
188
+
189
+ it "should run and extract data" do
190
+ @data.should_not be_empty
191
+ end
192
+ end
193
+ end
194
+
169
195
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: extraloop
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.7
4
+ version: 0.0.8
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-02-28 00:00:00.000000000Z
12
+ date: 2012-03-27 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: yajl-ruby
16
- requirement: &21376200 !ruby/object:Gem::Requirement
16
+ requirement: &13625500 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ~>
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 1.1.0
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *21376200
24
+ version_requirements: *13625500
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: nokogiri
27
- requirement: &21373200 !ruby/object:Gem::Requirement
27
+ requirement: &13624900 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ~>
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 1.5.0
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *21373200
35
+ version_requirements: *13624900
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: typhoeus
38
- requirement: &21368180 !ruby/object:Gem::Requirement
38
+ requirement: &13624340 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ~>
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: 0.3.2
44
44
  type: :runtime
45
45
  prerelease: false
46
- version_requirements: *21368180
46
+ version_requirements: *13624340
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: logging
49
- requirement: &21365740 !ruby/object:Gem::Requirement
49
+ requirement: &13623540 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ~>
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: 0.6.1
55
55
  type: :runtime
56
56
  prerelease: false
57
- version_requirements: *21365740
57
+ version_requirements: *13623540
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: rspec
60
- requirement: &21363940 !ruby/object:Gem::Requirement
60
+ requirement: &13622600 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ~>
@@ -65,10 +65,10 @@ dependencies:
65
65
  version: 2.7.0
66
66
  type: :development
67
67
  prerelease: false
68
- version_requirements: *21363940
68
+ version_requirements: *13622600
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: rr
71
- requirement: &21362040 !ruby/object:Gem::Requirement
71
+ requirement: &13620140 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ~>
@@ -76,10 +76,10 @@ dependencies:
76
76
  version: 1.0.4
77
77
  type: :development
78
78
  prerelease: false
79
- version_requirements: *21362040
79
+ version_requirements: *13620140
80
80
  - !ruby/object:Gem::Dependency
81
81
  name: pry-nav
82
- requirement: &21355940 !ruby/object:Gem::Requirement
82
+ requirement: &13619680 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ~>
@@ -87,10 +87,21 @@ dependencies:
87
87
  version: 0.1.0
88
88
  type: :development
89
89
  prerelease: false
90
- version_requirements: *21355940
90
+ version_requirements: *13619680
91
+ - !ruby/object:Gem::Dependency
92
+ name: rake
93
+ requirement: &13619180 !ruby/object:Gem::Requirement
94
+ none: false
95
+ requirements:
96
+ - - ~>
97
+ - !ruby/object:Gem::Version
98
+ version: 0.9.2.2
99
+ type: :development
100
+ prerelease: false
101
+ version_requirements: *13619180
91
102
  description: A Ruby library for extracting data from websites and web based APIs.
92
- Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
93
- a handy mechanism for iterating over paginated datasets.
103
+ Supports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes
104
+ with a handy mechanism for iterating over paginated datasets.
94
105
  email: andrea.giulio.fiore@googlemail.com
95
106
  executables: []
96
107
  extensions: []
@@ -140,6 +151,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
140
151
  - - ! '>='
141
152
  - !ruby/object:Gem::Version
142
153
  version: '0'
154
+ segments:
155
+ - 0
156
+ hash: 2441039337834275619
143
157
  required_rubygems_version: !ruby/object:Gem::Requirement
144
158
  none: false
145
159
  requirements:
@@ -153,4 +167,3 @@ signing_key:
153
167
  specification_version: 2
154
168
  summary: A toolkit for online data extraction.
155
169
  test_files: []
156
- has_rdoc: