extraloop 0.0.6 → 0.0.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/History.txt +8 -5
- data/README.md +12 -13
- data/examples/google_news_scraper.rb +3 -2
- data/examples/mod_pay_data.rb +32 -0
- data/lib/extraloop.rb +3 -0
- data/lib/extraloop/csv_extractor.rb +32 -0
- data/lib/extraloop/dom_extractor.rb +3 -3
- data/lib/extraloop/extraction_environment.rb +1 -0
- data/lib/extraloop/extraction_loop.rb +1 -1
- data/lib/extraloop/extractor_base.rb +3 -2
- data/lib/extraloop/json_extractor.rb +2 -6
- data/lib/extraloop/scraper_base.rb +26 -7
- data/spec/csv_extractor.rb +67 -0
- data/spec/dom_extractor_spec.rb +33 -6
- data/spec/fixtures/doc.csv +23 -0
- data/spec/json_extractor_spec.rb +38 -7
- data/spec/scraper_base_spec.rb +2 -5
- metadata +22 -18
data/History.txt
CHANGED
@@ -1,14 +1,17 @@
|
|
1
|
-
== 0.0.
|
1
|
+
== 0.0.7 / 2012-02-28
|
2
|
+
* Added support for CSV data extraction.
|
3
|
+
|
4
|
+
== 0.0.5 / 2012-01-14
|
2
5
|
* Refactored #extract, #loop_on, and #set_hook to make a more idematic use of ruby blocks
|
3
6
|
|
4
|
-
== 0.0.4 /
|
7
|
+
== 0.0.4 / 2012-01-14
|
5
8
|
* fixed a bug which prevented from subclassing `IterativeScraper` instances
|
6
9
|
|
7
|
-
== 0.0.3 /
|
10
|
+
== 0.0.3 / 2012-01-01
|
8
11
|
* namespaced all classes into the ExtraLoop module
|
9
12
|
|
10
|
-
== 0.0.2 /
|
13
|
+
== 0.0.2 / 2012-01-01
|
11
14
|
* changed repository URL
|
12
15
|
|
13
|
-
== 0.0.1 /
|
16
|
+
== 0.0.1 / 2012-01-01
|
14
17
|
* Project Birthday!
|
data/README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# Extra Loop
|
2
2
|
|
3
|
-
A Ruby library for extracting data from websites and web based APIs.
|
3
|
+
A Ruby library for extracting structured data from websites and web based APIs.
|
4
4
|
Supports most common document formats (i.e. HTML, XML, and JSON), and comes with a handy mechanism
|
5
5
|
for iterating over paginated datasets.
|
6
6
|
|
@@ -47,7 +47,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
|
|
47
47
|
|
48
48
|
#### scraper options:
|
49
49
|
|
50
|
-
* __format__ - Specifies the scraped document format
|
50
|
+
* __format__ - Specifies the scraped document format; needed only if the Content-Type in the server response is not the correct one. Supported formats are: 'html', 'xml', 'json', and 'csv'.
|
51
51
|
* __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
|
52
52
|
* __log__ - Logging options hash:
|
53
53
|
* __loglevel__ - a symbol specifying the desired log level (defaults to `:info`).
|
@@ -71,7 +71,7 @@ method extracts a specific piece of information from an element (e.g. a story's
|
|
71
71
|
loop_on('div.post') { |posts| posts.reject { |post| post.attr(:class) == 'sticky' } }
|
72
72
|
|
73
73
|
Both the `loop_on` and the `extract` methods may be called with a selector, a block or a combination of the two. By default, when parsing DOM documents, `extract` will call
|
74
|
-
`Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and block
|
74
|
+
`Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and a block. The latter is evaluated in the context of the current iteration's element.
|
75
75
|
|
76
76
|
# extract a story's title
|
77
77
|
extract(:title, 'h3')
|
@@ -82,13 +82,13 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a bl
|
|
82
82
|
# extract a description text, separating paragraphs with newlines
|
83
83
|
extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
|
84
84
|
|
85
|
-
#### Extracting from JSON Documents
|
85
|
+
#### Extracting data from JSON Documents
|
86
86
|
|
87
|
-
While processing
|
87
|
+
While processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
|
88
88
|
the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's
|
89
89
|
initialization options. When format is JSON, the document is parsed using the `yajl` JSON parser and converted into a hash.
|
90
|
-
In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except
|
91
|
-
CSS3/XPath selectors
|
90
|
+
In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except it does not support
|
91
|
+
CSS3/XPath selectors.
|
92
92
|
|
93
93
|
When working with JSON data, you can just use a block and have it return the document elements you want to loop on.
|
94
94
|
|
@@ -98,7 +98,7 @@ When working with JSON data, you can just use a block and have it return the doc
|
|
98
98
|
Alternatively, the same loop can be defined by passing an array of keys pointing at a hash value located
|
99
99
|
at several levels of depth down into the parsed document structure.
|
100
100
|
|
101
|
-
#
|
101
|
+
# Same as above, using a hash path
|
102
102
|
loop_on(['query', 'categorymembers'])
|
103
103
|
|
104
104
|
When fetching fields from a JSON document fragment, `extract` will often not need a block or an array of keys. If called with only
|
@@ -120,23 +120,22 @@ one argument, it will in fact try to fetch a hash value using the provided field
|
|
120
120
|
|
121
121
|
The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.
|
122
122
|
|
123
|
-
|
123
|
+
#### set\_iteration
|
124
124
|
|
125
125
|
* __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
|
126
126
|
* __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.
|
127
127
|
|
128
|
+
#### continue\_with
|
128
129
|
|
129
|
-
The second iteration
|
130
|
-
|
131
|
-
continue_with(iteration_parameter, &block)
|
130
|
+
The second iteration method, `#continue_with`, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).
|
132
131
|
|
133
132
|
* __iteration_parameter__ - the scraper' iteration parameter.
|
134
133
|
* __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
|
135
134
|
|
136
|
-
|
137
135
|
### Running tests
|
138
136
|
|
139
137
|
ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
|
140
138
|
|
141
139
|
cd spec
|
142
140
|
rspec *
|
141
|
+
|
@@ -1,10 +1,11 @@
|
|
1
1
|
require '../lib/extraloop'
|
2
|
+
require 'pry'
|
2
3
|
|
3
4
|
results = []
|
4
5
|
|
5
6
|
ExtraLoop::IterativeScraper.new("https://www.google.com/search?tbm=nws&q=Egypt", :log => {
|
6
|
-
|
7
|
-
|
7
|
+
#:log_level => :debug,
|
8
|
+
#:appenders => [ Logging.appenders.stderr ]
|
8
9
|
|
9
10
|
}).set_iteration(:start, (1..101).step(10)).
|
10
11
|
loop_on("h3") { |nodes| nodes.map(&:parent) }.
|
@@ -0,0 +1,32 @@
|
|
1
|
+
#
|
2
|
+
# Fetch name, job title, and actual pay ceiling from a csv dataset containing UK Ministry of Defence's organogram and staff pay data
|
3
|
+
#
|
4
|
+
# source: http://data.gov.uk/dataset/staff-organograms-and-pay-mod
|
5
|
+
#
|
6
|
+
|
7
|
+
require "../lib/extraloop.rb"
|
8
|
+
require "pry"
|
9
|
+
|
10
|
+
class ModPayScraper < ExtraLoop::ScraperBase
|
11
|
+
def initialize
|
12
|
+
dataset_url = "http://www.mod.uk/NR/rdonlyres/FF9761D8-2AB9-4CD4-88BC-983A46A0CD90/0/20111208CTLBOrganogramFinal7Useniordata.csv"
|
13
|
+
super dataset_url, :format => :csv
|
14
|
+
|
15
|
+
# Select only record of officiers who earn more than 100k per year
|
16
|
+
loop_on do |rows|
|
17
|
+
rows[1..-1].select { |row| row[14].to_i > 100000 }
|
18
|
+
end
|
19
|
+
|
20
|
+
extract :name, "Name"
|
21
|
+
extract :title, "Job Title"
|
22
|
+
extract :pay, 14
|
23
|
+
|
24
|
+
on("data") do |records|
|
25
|
+
|
26
|
+
records.sort { |r1, r2| r2.pay <=> r1.pay }.each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
|
27
|
+
end
|
28
|
+
end
|
29
|
+
end
|
30
|
+
|
31
|
+
|
32
|
+
ModPayScraper.new.run
|
data/lib/extraloop.rb
CHANGED
@@ -16,6 +16,8 @@ gem "typhoeus"
|
|
16
16
|
gem "logging"
|
17
17
|
|
18
18
|
|
19
|
+
|
20
|
+
autoload :CSV, "csv"
|
19
21
|
autoload :Nokogiri, "nokogiri"
|
20
22
|
autoload :Yajl, "yajl"
|
21
23
|
autoload :Typhoeus, "typhoeus"
|
@@ -29,6 +31,7 @@ ExtraLoop.autoload :ExtractionEnvironment , "#{base_path}/extraction_environment
|
|
29
31
|
ExtraLoop.autoload :ExtractorBase , "#{base_path}/extractor_base"
|
30
32
|
ExtraLoop.autoload :DomExtractor , "#{base_path}/dom_extractor"
|
31
33
|
ExtraLoop.autoload :JsonExtractor , "#{base_path}/json_extractor"
|
34
|
+
ExtraLoop.autoload :CsvExtractor , "#{base_path}/csv_extractor"
|
32
35
|
ExtraLoop.autoload :ExtractionLoop , "#{base_path}/extraction_loop"
|
33
36
|
ExtraLoop.autoload :ScraperBase , "#{base_path}/scraper_base"
|
34
37
|
ExtraLoop.autoload :Loggable , "#{base_path}/loggable"
|
@@ -0,0 +1,32 @@
|
|
1
|
+
class ExtraLoop::CsvExtractor < ExtraLoop::ExtractorBase
|
2
|
+
|
3
|
+
def initialize(*args)
|
4
|
+
super(*args)
|
5
|
+
@selector = args[2] if args[2] && args[2].is_a?(Integer)
|
6
|
+
end
|
7
|
+
|
8
|
+
def extract_field(row, record=nil)
|
9
|
+
target = row = row.respond_to?(:entries)? row : parse(row)
|
10
|
+
headers = @environment.document.first
|
11
|
+
selector = !@selector && @field_name || @selector
|
12
|
+
|
13
|
+
# allow using CSV column names or array indices as selectors
|
14
|
+
target = row[headers.index(selector.to_s)] if selector && selector.to_s.match(/[a-z]/i)
|
15
|
+
target = row[selector] if selector.is_a?(Integer)
|
16
|
+
|
17
|
+
target = @environment.run(target, record, &@callback) if @callback
|
18
|
+
target
|
19
|
+
end
|
20
|
+
|
21
|
+
def extract_list(input)
|
22
|
+
rows = (input.respond_to?(:entries) ? input : parse(input))
|
23
|
+
Array(@callback && @environment.run(rows, &@callback) || rows)
|
24
|
+
end
|
25
|
+
|
26
|
+
|
27
|
+
def parse(input, options=Hash.new)
|
28
|
+
super(input)
|
29
|
+
document = CSV.parse(input, options)
|
30
|
+
@environment.document = document
|
31
|
+
end
|
32
|
+
end
|
@@ -11,7 +11,7 @@ module ExtraLoop
|
|
11
11
|
|
12
12
|
def extract_field(node, record=nil)
|
13
13
|
target = node = node.respond_to?(:document) ? node : parse(node)
|
14
|
-
target = node.
|
14
|
+
target = node.at(@selector) if @selector
|
15
15
|
target = target.attr(@attribute) if target.respond_to?(:attr) && @attribute
|
16
16
|
target = @environment.run(target, record, &@callback) if @callback
|
17
17
|
|
@@ -30,9 +30,9 @@ module ExtraLoop
|
|
30
30
|
#
|
31
31
|
|
32
32
|
def extract_list(input)
|
33
|
-
nodes = input.respond_to?(:document) ? input : parse(input)
|
33
|
+
nodes = (input.respond_to?(:document) ? input : parse(input))
|
34
34
|
nodes = nodes.search(@selector) if @selector
|
35
|
-
@callback &&
|
35
|
+
Array(@callback && @environment.run(nodes, &@callback) || nodes)
|
36
36
|
end
|
37
37
|
|
38
38
|
def parse(input)
|
@@ -12,7 +12,7 @@ module ExtraLoop
|
|
12
12
|
def initialize(loop_extractor, extractors=[], document=nil, hooks = {}, scraper = nil)
|
13
13
|
@loop_extractor = loop_extractor
|
14
14
|
@extractors = extractors
|
15
|
-
@document = @loop_extractor.parse(document)
|
15
|
+
@document = document.is_a?(String) ? @loop_extractor.parse(document) : document
|
16
16
|
@records = []
|
17
17
|
@hooks = hooks
|
18
18
|
@environment = ExtractionEnvironment.new(scraper, @document, @records)
|
@@ -1,5 +1,5 @@
|
|
1
1
|
module ExtraLoop
|
2
|
-
# Pseudo Abstract class.
|
2
|
+
# Pseudo Abstract class from which all extractors inherit.
|
3
3
|
# This should not be called directly
|
4
4
|
#
|
5
5
|
class ExtractorBase
|
@@ -9,8 +9,9 @@ module ExtraLoop
|
|
9
9
|
end
|
10
10
|
|
11
11
|
attr_reader :field_name
|
12
|
+
|
12
13
|
#
|
13
|
-
# Public:
|
14
|
+
# Public: Initialises a Data extractor.
|
14
15
|
#
|
15
16
|
# Parameters:
|
16
17
|
# field_name - The machine readable field name
|
@@ -20,13 +20,9 @@ module ExtraLoop
|
|
20
20
|
end
|
21
21
|
|
22
22
|
def extract_list(input)
|
23
|
-
|
24
|
-
# into possible hash traversal techniques
|
25
|
-
|
26
|
-
input = input.is_a?(String) ? parse(input) : input
|
23
|
+
@environment.document = input = (input.is_a?(String) ? parse(input) : input)
|
27
24
|
input = input.get_in(@path) if @path
|
28
|
-
|
29
|
-
@callback && Array(@environment.run(input, &@callback)) || input
|
25
|
+
@callback && @environment.run(input, &@callback) || input
|
30
26
|
end
|
31
27
|
|
32
28
|
def parse(input)
|
@@ -61,7 +61,8 @@ module ExtraLoop
|
|
61
61
|
|
62
62
|
def loop_on(*args, &block)
|
63
63
|
args << block if block
|
64
|
-
|
64
|
+
# we prepend a nil value, as the loop extractor does not need to specify a field name
|
65
|
+
@loop_extractor_args = args.insert(0, nil)
|
65
66
|
self
|
66
67
|
end
|
67
68
|
|
@@ -79,7 +80,7 @@ module ExtraLoop
|
|
79
80
|
|
80
81
|
def extract(*args, &block)
|
81
82
|
args << block if block
|
82
|
-
@extractor_args << args
|
83
|
+
@extractor_args << args
|
83
84
|
self
|
84
85
|
end
|
85
86
|
|
@@ -144,24 +145,42 @@ module ExtraLoop
|
|
144
145
|
@response_count += 1
|
145
146
|
@loop = prepare_loop(response)
|
146
147
|
log("response ##{@response_count} of #{@queued_count}, status code: [#{response.code}], URL fragment: ...#{response.effective_url.split('/').last if response.effective_url}")
|
147
|
-
@loop.run
|
148
148
|
|
149
|
+
@loop.run
|
149
150
|
@environment = @loop.environment
|
150
151
|
run_hook(:data, [@loop.records, response])
|
152
|
+
#TODO: add hock for scraper completion (useful in iterative scrapes).
|
151
153
|
end
|
152
154
|
|
153
155
|
def prepare_loop(response)
|
154
|
-
|
155
|
-
|
156
|
+
content_type = response.headers_hash.fetch('Content-Type', nil)
|
157
|
+
format = @options[:format] || detect_format(content_type)
|
158
|
+
|
159
|
+
extractor_classname = "#{format.to_s.capitalize}Extractor"
|
160
|
+
extractor_class = ExtraLoop.const_defined?(extractor_classname) && ExtraLoop.const_get(extractor_classname) || DomExtractor
|
161
|
+
|
162
|
+
@loop_extractor_args.insert(1, ExtractionEnvironment.new(self))
|
156
163
|
loop_extractor = extractor_class.new(*@loop_extractor_args)
|
157
|
-
|
158
|
-
|
164
|
+
|
165
|
+
# There is no point in parsing response.body more than once, so we reuse
|
166
|
+
# the first parsed document
|
167
|
+
|
168
|
+
document = loop_extractor.parse(response.body)
|
169
|
+
|
170
|
+
extractors = @extractor_args.map do |args|
|
171
|
+
args.insert(1, ExtractionEnvironment.new(self, document))
|
172
|
+
extractor_class.new(*args)
|
173
|
+
end
|
174
|
+
|
175
|
+
ExtractionLoop.new(loop_extractor, extractors, document, @hooks, self)
|
159
176
|
end
|
160
177
|
|
161
178
|
def detect_format(content_type)
|
162
179
|
#TODO: add support for xml/rdf documents
|
163
180
|
if content_type && content_type =~ /json$/
|
164
181
|
:json
|
182
|
+
elsif content_type && content_type =~ /(csv)|(comma-separated-values)$/
|
183
|
+
:csv
|
165
184
|
else
|
166
185
|
:html
|
167
186
|
end
|
@@ -0,0 +1,67 @@
|
|
1
|
+
require 'helpers/spec_helper'
|
2
|
+
|
3
|
+
describe JsonExtractor do
|
4
|
+
before(:each) do
|
5
|
+
stub(scraper = Object.new).options
|
6
|
+
stub(scraper).results
|
7
|
+
@env = ExtractionEnvironment.new(scraper)
|
8
|
+
|
9
|
+
File.open('fixtures/doc.csv', 'r') { |file|
|
10
|
+
@csv = file.read
|
11
|
+
@parsed_csv = CSV.parse(@csv)
|
12
|
+
file.close
|
13
|
+
}
|
14
|
+
|
15
|
+
end
|
16
|
+
|
17
|
+
describe "#extract_field" do
|
18
|
+
context "with only a field name defined" do
|
19
|
+
before do
|
20
|
+
@extractor = CsvExtractor.new(:customer_company_name, @env)
|
21
|
+
@extractor.parse(@csv)
|
22
|
+
end
|
23
|
+
|
24
|
+
subject { @extractor.extract_field @parsed_csv[2] }
|
25
|
+
it { should eql("Utility A") }
|
26
|
+
end
|
27
|
+
|
28
|
+
context "with a field name and a selector defined" do
|
29
|
+
before do
|
30
|
+
@extractor = CsvExtractor.new(:name, @env, "customer_company_name")
|
31
|
+
@extractor.parse(@csv)
|
32
|
+
end
|
33
|
+
subject { @extractor.extract_field @parsed_csv[2] }
|
34
|
+
it { should eql("Utility A") }
|
35
|
+
end
|
36
|
+
|
37
|
+
context "with a field name, using a numerical index as selector", :onlythis => true do
|
38
|
+
before do
|
39
|
+
@extractor = CsvExtractor.new(:company_name, @env, 2)
|
40
|
+
@extractor.parse(@csv)
|
41
|
+
end
|
42
|
+
subject { @extractor.extract_field @parsed_csv[2] }
|
43
|
+
it { should eql("Utility A") }
|
44
|
+
end
|
45
|
+
|
46
|
+
context "Without any other arguments but a callback" do
|
47
|
+
before do
|
48
|
+
@extractor = CsvExtractor.new nil, @env, proc { |row| row[2] }
|
49
|
+
@extractor.parse(@csv)
|
50
|
+
end
|
51
|
+
subject { @extractor.extract_field @parsed_csv[2] }
|
52
|
+
it { should eql("Utility A") }
|
53
|
+
end
|
54
|
+
end
|
55
|
+
|
56
|
+
describe "#extract_list" do
|
57
|
+
context "with no arguments" do
|
58
|
+
subject { CsvExtractor.new(nil, @env).extract_list(@csv) }
|
59
|
+
it { should eql(@parsed_csv) }
|
60
|
+
end
|
61
|
+
|
62
|
+
context "with a callback" do
|
63
|
+
subject { CsvExtractor.new(nil, @env, proc { |rows| rows[0..10] }).extract_list(@csv) }
|
64
|
+
it { should eql(@parsed_csv[0..10]) }
|
65
|
+
end
|
66
|
+
end
|
67
|
+
end
|
data/spec/dom_extractor_spec.rb
CHANGED
@@ -54,17 +54,31 @@ describe DomExtractor do
|
|
54
54
|
end
|
55
55
|
end
|
56
56
|
|
57
|
-
context "when a selector and a block is provided" do
|
57
|
+
context "when a selector and a block is provided", :bla => true do
|
58
58
|
before do
|
59
|
+
document_defined = scraper_defined = false
|
60
|
+
|
59
61
|
@extractor = DomExtractor.new(:anchor, @env, "p a", proc { |node|
|
62
|
+
document_defined = @document && @document.is_a?(Nokogiri::HTML::Document)
|
63
|
+
scraper_defined = instance_variable_defined? "@scraper"
|
60
64
|
node.text.gsub("dummy", "fancy")
|
61
65
|
})
|
66
|
+
|
62
67
|
@node = @extractor.parse(@html)
|
68
|
+
@output = @extractor.extract_field(@node)
|
69
|
+
|
70
|
+
@scraper_defined = scraper_defined
|
71
|
+
@document_defined = document_defined
|
63
72
|
end
|
64
73
|
|
65
74
|
describe "#extract_field" do
|
66
|
-
|
67
|
-
|
75
|
+
it "should return the block output" do
|
76
|
+
@output.should match(/my\sfancy/)
|
77
|
+
end
|
78
|
+
it "should add the @scraper and @document instance variables to the extraction environment" do
|
79
|
+
@scraper_defined.should be_true
|
80
|
+
@document_defined.should be_true
|
81
|
+
end
|
68
82
|
end
|
69
83
|
end
|
70
84
|
|
@@ -93,6 +107,7 @@ describe DomExtractor do
|
|
93
107
|
end
|
94
108
|
end
|
95
109
|
|
110
|
+
|
96
111
|
context "when nothing but a field name is provided" do
|
97
112
|
before do
|
98
113
|
@extractor = DomExtractor.new(:url, @env)
|
@@ -117,13 +132,25 @@ describe DomExtractor do
|
|
117
132
|
|
118
133
|
context "block provided" do
|
119
134
|
before do
|
120
|
-
|
135
|
+
document_defined = scraper_defined = false
|
136
|
+
|
137
|
+
@extractor = DomExtractor.new(nil, @env, "div.entry", proc { |nodeList|
|
138
|
+
document_defined = @document && @document.is_a?(Nokogiri::HTML::Document)
|
139
|
+
scraper_defined = instance_variable_defined? "@scraper"
|
140
|
+
|
121
141
|
nodeList.reject {|node| node.attr(:class).split(" ").include?('exclude') }
|
122
142
|
})
|
143
|
+
|
144
|
+
@output = @extractor.extract_list(@html)
|
145
|
+
@scraper_defined = scraper_defined
|
146
|
+
@document_defined = document_defined
|
123
147
|
end
|
124
148
|
|
125
|
-
|
126
|
-
it
|
149
|
+
it { @output.should have(2).items }
|
150
|
+
it "should add @scraper and @document instance variables to the ExtractionEnvironment instance" do
|
151
|
+
@scraper_defined.should be_true
|
152
|
+
@document_defined.should be_true
|
153
|
+
end
|
127
154
|
end
|
128
155
|
end
|
129
156
|
|
@@ -0,0 +1,23 @@
|
|
1
|
+
contract_id,seller_company_name,customer_company_name,customer_duns_number,contract_affiliate,FERC_tariff_reference,contract_service_agreement_id,contract_execution_date,contract_commencement_date,contract_termination_date,actual_termination_date,extension_provision_description,class_name,term_name,increment_name,increment_peaking_name,product_type_name,product_name,quantity,units_for_contract,rate,rate_minimum,rate_maximum,rate_description,units_for_rate,point_of_receipt_control_area,point_of_receipt_specific_location,point_of_delivery_control_area,point_of_delivery_specific_location,begin_date,end_date,time_zone
|
2
|
+
C71,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Original Volume No. 10,2,2/15/2001,2/15/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,,,,,,ES
|
3
|
+
C72,The Electric Company,Utility A,38495837,n,FERC Electric Tariff Original Volume No. 10,15,7/25/2001,8/1/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,,,,,,ES
|
4
|
+
C73,The Electric Company,Utility B,493758794,N,FERC Electric Tariff Original Volume No. 10,7,6/8/2001,7/6/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,, , ,,,ep
|
5
|
+
C74,The Electric Company,Utility C,594739573,n,FERC Electric Tariff Original Volume No. 10,25,6/8/2001,7/6/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,, , ,,,ep
|
6
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,ENERGY,2000,KWh,.1475, , ,Max amount of capacity and energy to be transmitted. Bill based on monthly max delivery to City.,$/KWh,PJM,Point A,PJM,Point B,,,ep
|
7
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,point-to-point agreement,2000,KW,0.01, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
|
8
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,network,2000,KW,0.2, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
|
9
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,BLACK START SERVICE,2000,KW,0.22, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
|
10
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,CAPACITY,2000,KW,0.04, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
|
11
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,regulation & frequency response,2000,KW,0.1, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
|
12
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,real power transmission loss,2000,KW,7, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
|
13
|
+
C76,The Electric Company,The Power Company,456534333,N,FERC Electric Tariff Original Volume No. 10,132,12/15/2001,1/1/2002,12/31/2004,12/31/2004,None,F,LT,M,FP,MB,CAPACITY,70,MW,3750, , ,70MW for each and every hour over the term of the agreement (7x24 schedule).,$/MW,,,,,,,ep
|
14
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,35, , ,,$/MWH,,,PJM,Bus 4321,20020101,20030101,EP
|
15
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,37, , ,,$/MWH,,,PJM,Bus 4321,20030101,20040101,EP
|
16
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,39, , ,,$/MWH,,,PJM,Bus 4321,20040101,20050101,EP
|
17
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,41, , ,,$/MWH,,,PJM,Bus 4321,20050101,20060101,EP
|
18
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,43, , ,,$/MWH,,,PJM,Bus 4321,20060101,20070101,EP
|
19
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,45, , ,,$/MWH,,,PJM,Bus 4321,20070101,20080101,EP
|
20
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,47, , ,,$/MWH,,,PJM,Bus 4321,20080101,20090101,EP
|
21
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,49, , ,,$/MWH,,,PJM,Bus 4321,20090101,20100101,EP
|
22
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,51, , ,,$/MWH,,,PJM,Bus 4321,20100101,20110101,EP
|
23
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,53, , ,,$/MWH,,,PJM,Bus 4321,20110101,20120101,EP
|
data/spec/json_extractor_spec.rb
CHANGED
@@ -10,6 +10,7 @@ describe JsonExtractor do
|
|
10
10
|
content = file.read
|
11
11
|
file.close
|
12
12
|
content
|
13
|
+
|
13
14
|
}.call()
|
14
15
|
end
|
15
16
|
|
@@ -37,12 +38,27 @@ describe JsonExtractor do
|
|
37
38
|
|
38
39
|
context "field_name and callback" do
|
39
40
|
before do
|
40
|
-
|
41
|
+
scraper_defined = document_defined = false
|
42
|
+
|
43
|
+
@extractor = JsonExtractor.new(:from_user, @env, proc { |node|
|
44
|
+
document_defined = @document && @document.is_a?(Hash)
|
45
|
+
scraper_defined = instance_variable_defined? "@scraper"
|
46
|
+
|
47
|
+
node['from_user_name']
|
48
|
+
})
|
49
|
+
|
41
50
|
@node = @extractor.parse(@json)['results'].first
|
51
|
+
@output = @extractor.extract_field(@node)
|
52
|
+
|
53
|
+
@scraper_defined = scraper_defined
|
54
|
+
@document_defined = document_defined
|
42
55
|
end
|
43
56
|
|
44
|
-
|
45
|
-
it
|
57
|
+
it { @output.should eql("Ludovic kohn") }
|
58
|
+
it "should add the @scraper and @document instance variables to the extraction environment" do
|
59
|
+
@scraper_defined.should be_true
|
60
|
+
@document_defined.should be_true
|
61
|
+
end
|
46
62
|
end
|
47
63
|
|
48
64
|
context "field_name and attribute" do
|
@@ -108,12 +124,27 @@ describe JsonExtractor do
|
|
108
124
|
|
109
125
|
context "with pre-parsed input" do
|
110
126
|
before do
|
111
|
-
|
127
|
+
document_defined = scraper_defined = false
|
128
|
+
|
129
|
+
@extractor = JsonExtractor.new(nil, @env, proc { |data|
|
130
|
+
document_defined = @document && @document.is_a?(Hash)
|
131
|
+
scraper_defined = instance_variable_defined? "@scraper"
|
132
|
+
data['results']
|
133
|
+
})
|
134
|
+
|
135
|
+
|
136
|
+
@output = @extractor.extract_list((Yajl::Parser.new).parse(@json))
|
137
|
+
@scraper_defined = scraper_defined
|
138
|
+
@document_defined = document_defined
|
112
139
|
end
|
113
140
|
|
114
|
-
|
115
|
-
it {
|
116
|
-
|
141
|
+
it { @output.size.should eql(15) }
|
142
|
+
it { @output.should be_an_instance_of(Array) }
|
143
|
+
|
144
|
+
it "should add the @scraper and @document instance variables to the extraction environment" do
|
145
|
+
@scraper_defined.should be_true
|
146
|
+
@document_defined.should be_true
|
147
|
+
end
|
117
148
|
end
|
118
149
|
|
119
150
|
end
|
data/spec/scraper_base_spec.rb
CHANGED
@@ -12,7 +12,6 @@ describe ScraperBase do
|
|
12
12
|
@scraper = ScraperBase.new("http://localhost/fixture")
|
13
13
|
end
|
14
14
|
|
15
|
-
|
16
15
|
describe "#loop_on" do
|
17
16
|
subject { @scraper.loop_on("bla.bla") }
|
18
17
|
it { should be_an_instance_of(ScraperBase) }
|
@@ -113,7 +112,7 @@ describe ScraperBase do
|
|
113
112
|
stub(@fake_loop).environment { ExtractionEnvironment.new }
|
114
113
|
stub(@fake_loop).records { Array(1..3).map { |n| Object.new } }
|
115
114
|
|
116
|
-
mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(
|
115
|
+
mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(Nokogiri::HTML::Document), is_a(Hash), is_a(ScraperBase)).times(3) { @fake_loop }
|
117
116
|
end
|
118
117
|
|
119
118
|
|
@@ -157,10 +156,9 @@ describe ScraperBase do
|
|
157
156
|
stub(@fake_loop).environment { ExtractionEnvironment.new }
|
158
157
|
stub(@fake_loop).records { Array(1..3).map { |n| Object.new } }
|
159
158
|
|
160
|
-
mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(
|
159
|
+
mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(Nokogiri::HTML::Document), is_a(Hash), is_a(ScraperBase)).times(@urls.size) { @fake_loop }
|
161
160
|
end
|
162
161
|
|
163
|
-
|
164
162
|
it "Should handle response" do
|
165
163
|
@scraper.run
|
166
164
|
@results.size.should eql(@urls.size * 3)
|
@@ -168,5 +166,4 @@ describe ScraperBase do
|
|
168
166
|
end
|
169
167
|
end
|
170
168
|
end
|
171
|
-
|
172
169
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: extraloop
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.7
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-
|
12
|
+
date: 2012-02-28 00:00:00.000000000Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: yajl-ruby
|
16
|
-
requirement: &
|
16
|
+
requirement: &21376200 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ~>
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.1.0
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *21376200
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: nokogiri
|
27
|
-
requirement: &
|
27
|
+
requirement: &21373200 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ~>
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 1.5.0
|
33
33
|
type: :runtime
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *21373200
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: typhoeus
|
38
|
-
requirement: &
|
38
|
+
requirement: &21368180 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 0.3.2
|
44
44
|
type: :runtime
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *21368180
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: logging
|
49
|
-
requirement: &
|
49
|
+
requirement: &21365740 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ~>
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: 0.6.1
|
55
55
|
type: :runtime
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *21365740
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: rspec
|
60
|
-
requirement: &
|
60
|
+
requirement: &21363940 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 2.7.0
|
66
66
|
type: :development
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *21363940
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rr
|
71
|
-
requirement: &
|
71
|
+
requirement: &21362040 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ~>
|
@@ -76,18 +76,18 @@ dependencies:
|
|
76
76
|
version: 1.0.4
|
77
77
|
type: :development
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *21362040
|
80
80
|
- !ruby/object:Gem::Dependency
|
81
|
-
name: pry
|
82
|
-
requirement: &
|
81
|
+
name: pry-nav
|
82
|
+
requirement: &21355940 !ruby/object:Gem::Requirement
|
83
83
|
none: false
|
84
84
|
requirements:
|
85
85
|
- - ~>
|
86
86
|
- !ruby/object:Gem::Version
|
87
|
-
version: 0.
|
87
|
+
version: 0.1.0
|
88
88
|
type: :development
|
89
89
|
prerelease: false
|
90
|
-
version_requirements: *
|
90
|
+
version_requirements: *21355940
|
91
91
|
description: A Ruby library for extracting data from websites and web based APIs.
|
92
92
|
Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
|
93
93
|
a handy mechanism for iterating over paginated datasets.
|
@@ -99,9 +99,11 @@ files:
|
|
99
99
|
- History.txt
|
100
100
|
- README.md
|
101
101
|
- examples/google_news_scraper.rb
|
102
|
+
- examples/mod_pay_data.rb
|
102
103
|
- examples/wikipedia_categories.rb
|
103
104
|
- examples/wikipedia_categories_recoursive.rb
|
104
105
|
- lib/extraloop.rb
|
106
|
+
- lib/extraloop/csv_extractor.rb
|
105
107
|
- lib/extraloop/dom_extractor.rb
|
106
108
|
- lib/extraloop/extraction_environment.rb
|
107
109
|
- lib/extraloop/extraction_loop.rb
|
@@ -112,8 +114,10 @@ files:
|
|
112
114
|
- lib/extraloop/loggable.rb
|
113
115
|
- lib/extraloop/scraper_base.rb
|
114
116
|
- lib/extraloop/utils.rb
|
117
|
+
- spec/csv_extractor.rb
|
115
118
|
- spec/dom_extractor_spec.rb
|
116
119
|
- spec/extraction_loop_spec.rb
|
120
|
+
- spec/fixtures/doc.csv
|
117
121
|
- spec/fixtures/doc.html
|
118
122
|
- spec/fixtures/doc.json
|
119
123
|
- spec/helpers/scraper_helper.rb
|