extraloop 0.0.6 → 0.0.7
Sign up to get free protection for your applications and to get access to all the features.
- data/History.txt +8 -5
- data/README.md +12 -13
- data/examples/google_news_scraper.rb +3 -2
- data/examples/mod_pay_data.rb +32 -0
- data/lib/extraloop.rb +3 -0
- data/lib/extraloop/csv_extractor.rb +32 -0
- data/lib/extraloop/dom_extractor.rb +3 -3
- data/lib/extraloop/extraction_environment.rb +1 -0
- data/lib/extraloop/extraction_loop.rb +1 -1
- data/lib/extraloop/extractor_base.rb +3 -2
- data/lib/extraloop/json_extractor.rb +2 -6
- data/lib/extraloop/scraper_base.rb +26 -7
- data/spec/csv_extractor.rb +67 -0
- data/spec/dom_extractor_spec.rb +33 -6
- data/spec/fixtures/doc.csv +23 -0
- data/spec/json_extractor_spec.rb +38 -7
- data/spec/scraper_base_spec.rb +2 -5
- metadata +22 -18
data/History.txt
CHANGED
@@ -1,14 +1,17 @@
|
|
1
|
-
== 0.0.
|
1
|
+
== 0.0.7 / 2012-02-28
|
2
|
+
* Added support for CSV data extraction.
|
3
|
+
|
4
|
+
== 0.0.5 / 2012-01-14
|
2
5
|
* Refactored #extract, #loop_on, and #set_hook to make a more idematic use of ruby blocks
|
3
6
|
|
4
|
-
== 0.0.4 /
|
7
|
+
== 0.0.4 / 2012-01-14
|
5
8
|
* fixed a bug which prevented from subclassing `IterativeScraper` instances
|
6
9
|
|
7
|
-
== 0.0.3 /
|
10
|
+
== 0.0.3 / 2012-01-01
|
8
11
|
* namespaced all classes into the ExtraLoop module
|
9
12
|
|
10
|
-
== 0.0.2 /
|
13
|
+
== 0.0.2 / 2012-01-01
|
11
14
|
* changed repository URL
|
12
15
|
|
13
|
-
== 0.0.1 /
|
16
|
+
== 0.0.1 / 2012-01-01
|
14
17
|
* Project Birthday!
|
data/README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# Extra Loop
|
2
2
|
|
3
|
-
A Ruby library for extracting data from websites and web based APIs.
|
3
|
+
A Ruby library for extracting structured data from websites and web based APIs.
|
4
4
|
Supports most common document formats (i.e. HTML, XML, and JSON), and comes with a handy mechanism
|
5
5
|
for iterating over paginated datasets.
|
6
6
|
|
@@ -47,7 +47,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
|
|
47
47
|
|
48
48
|
#### scraper options:
|
49
49
|
|
50
|
-
* __format__ - Specifies the scraped document format
|
50
|
+
* __format__ - Specifies the scraped document format; needed only if the Content-Type in the server response is not the correct one. Supported formats are: 'html', 'xml', 'json', and 'csv'.
|
51
51
|
* __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
|
52
52
|
* __log__ - Logging options hash:
|
53
53
|
* __loglevel__ - a symbol specifying the desired log level (defaults to `:info`).
|
@@ -71,7 +71,7 @@ method extracts a specific piece of information from an element (e.g. a story's
|
|
71
71
|
loop_on('div.post') { |posts| posts.reject { |post| post.attr(:class) == 'sticky' } }
|
72
72
|
|
73
73
|
Both the `loop_on` and the `extract` methods may be called with a selector, a block or a combination of the two. By default, when parsing DOM documents, `extract` will call
|
74
|
-
`Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and block
|
74
|
+
`Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and a block. The latter is evaluated in the context of the current iteration's element.
|
75
75
|
|
76
76
|
# extract a story's title
|
77
77
|
extract(:title, 'h3')
|
@@ -82,13 +82,13 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a bl
|
|
82
82
|
# extract a description text, separating paragraphs with newlines
|
83
83
|
extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
|
84
84
|
|
85
|
-
#### Extracting from JSON Documents
|
85
|
+
#### Extracting data from JSON Documents
|
86
86
|
|
87
|
-
While processing
|
87
|
+
While processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
|
88
88
|
the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's
|
89
89
|
initialization options. When format is JSON, the document is parsed using the `yajl` JSON parser and converted into a hash.
|
90
|
-
In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except
|
91
|
-
CSS3/XPath selectors
|
90
|
+
In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except it does not support
|
91
|
+
CSS3/XPath selectors.
|
92
92
|
|
93
93
|
When working with JSON data, you can just use a block and have it return the document elements you want to loop on.
|
94
94
|
|
@@ -98,7 +98,7 @@ When working with JSON data, you can just use a block and have it return the doc
|
|
98
98
|
Alternatively, the same loop can be defined by passing an array of keys pointing at a hash value located
|
99
99
|
at several levels of depth down into the parsed document structure.
|
100
100
|
|
101
|
-
#
|
101
|
+
# Same as above, using a hash path
|
102
102
|
loop_on(['query', 'categorymembers'])
|
103
103
|
|
104
104
|
When fetching fields from a JSON document fragment, `extract` will often not need a block or an array of keys. If called with only
|
@@ -120,23 +120,22 @@ one argument, it will in fact try to fetch a hash value using the provided field
|
|
120
120
|
|
121
121
|
The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.
|
122
122
|
|
123
|
-
|
123
|
+
#### set\_iteration
|
124
124
|
|
125
125
|
* __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
|
126
126
|
* __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.
|
127
127
|
|
128
|
+
#### continue\_with
|
128
129
|
|
129
|
-
The second iteration
|
130
|
-
|
131
|
-
continue_with(iteration_parameter, &block)
|
130
|
+
The second iteration method, `#continue_with`, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).
|
132
131
|
|
133
132
|
* __iteration_parameter__ - the scraper' iteration parameter.
|
134
133
|
* __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
|
135
134
|
|
136
|
-
|
137
135
|
### Running tests
|
138
136
|
|
139
137
|
ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
|
140
138
|
|
141
139
|
cd spec
|
142
140
|
rspec *
|
141
|
+
|
@@ -1,10 +1,11 @@
|
|
1
1
|
require '../lib/extraloop'
|
2
|
+
require 'pry'
|
2
3
|
|
3
4
|
results = []
|
4
5
|
|
5
6
|
ExtraLoop::IterativeScraper.new("https://www.google.com/search?tbm=nws&q=Egypt", :log => {
|
6
|
-
|
7
|
-
|
7
|
+
#:log_level => :debug,
|
8
|
+
#:appenders => [ Logging.appenders.stderr ]
|
8
9
|
|
9
10
|
}).set_iteration(:start, (1..101).step(10)).
|
10
11
|
loop_on("h3") { |nodes| nodes.map(&:parent) }.
|
@@ -0,0 +1,32 @@
|
|
1
|
+
#
|
2
|
+
# Fetch name, job title, and actual pay ceiling from a csv dataset containing UK Ministry of Defence's organogram and staff pay data
|
3
|
+
#
|
4
|
+
# source: http://data.gov.uk/dataset/staff-organograms-and-pay-mod
|
5
|
+
#
|
6
|
+
|
7
|
+
require "../lib/extraloop.rb"
|
8
|
+
require "pry"
|
9
|
+
|
10
|
+
class ModPayScraper < ExtraLoop::ScraperBase
|
11
|
+
def initialize
|
12
|
+
dataset_url = "http://www.mod.uk/NR/rdonlyres/FF9761D8-2AB9-4CD4-88BC-983A46A0CD90/0/20111208CTLBOrganogramFinal7Useniordata.csv"
|
13
|
+
super dataset_url, :format => :csv
|
14
|
+
|
15
|
+
# Select only record of officiers who earn more than 100k per year
|
16
|
+
loop_on do |rows|
|
17
|
+
rows[1..-1].select { |row| row[14].to_i > 100000 }
|
18
|
+
end
|
19
|
+
|
20
|
+
extract :name, "Name"
|
21
|
+
extract :title, "Job Title"
|
22
|
+
extract :pay, 14
|
23
|
+
|
24
|
+
on("data") do |records|
|
25
|
+
|
26
|
+
records.sort { |r1, r2| r2.pay <=> r1.pay }.each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
|
27
|
+
end
|
28
|
+
end
|
29
|
+
end
|
30
|
+
|
31
|
+
|
32
|
+
ModPayScraper.new.run
|
data/lib/extraloop.rb
CHANGED
@@ -16,6 +16,8 @@ gem "typhoeus"
|
|
16
16
|
gem "logging"
|
17
17
|
|
18
18
|
|
19
|
+
|
20
|
+
autoload :CSV, "csv"
|
19
21
|
autoload :Nokogiri, "nokogiri"
|
20
22
|
autoload :Yajl, "yajl"
|
21
23
|
autoload :Typhoeus, "typhoeus"
|
@@ -29,6 +31,7 @@ ExtraLoop.autoload :ExtractionEnvironment , "#{base_path}/extraction_environment
|
|
29
31
|
ExtraLoop.autoload :ExtractorBase , "#{base_path}/extractor_base"
|
30
32
|
ExtraLoop.autoload :DomExtractor , "#{base_path}/dom_extractor"
|
31
33
|
ExtraLoop.autoload :JsonExtractor , "#{base_path}/json_extractor"
|
34
|
+
ExtraLoop.autoload :CsvExtractor , "#{base_path}/csv_extractor"
|
32
35
|
ExtraLoop.autoload :ExtractionLoop , "#{base_path}/extraction_loop"
|
33
36
|
ExtraLoop.autoload :ScraperBase , "#{base_path}/scraper_base"
|
34
37
|
ExtraLoop.autoload :Loggable , "#{base_path}/loggable"
|
@@ -0,0 +1,32 @@
|
|
1
|
+
class ExtraLoop::CsvExtractor < ExtraLoop::ExtractorBase
|
2
|
+
|
3
|
+
def initialize(*args)
|
4
|
+
super(*args)
|
5
|
+
@selector = args[2] if args[2] && args[2].is_a?(Integer)
|
6
|
+
end
|
7
|
+
|
8
|
+
def extract_field(row, record=nil)
|
9
|
+
target = row = row.respond_to?(:entries)? row : parse(row)
|
10
|
+
headers = @environment.document.first
|
11
|
+
selector = !@selector && @field_name || @selector
|
12
|
+
|
13
|
+
# allow using CSV column names or array indices as selectors
|
14
|
+
target = row[headers.index(selector.to_s)] if selector && selector.to_s.match(/[a-z]/i)
|
15
|
+
target = row[selector] if selector.is_a?(Integer)
|
16
|
+
|
17
|
+
target = @environment.run(target, record, &@callback) if @callback
|
18
|
+
target
|
19
|
+
end
|
20
|
+
|
21
|
+
def extract_list(input)
|
22
|
+
rows = (input.respond_to?(:entries) ? input : parse(input))
|
23
|
+
Array(@callback && @environment.run(rows, &@callback) || rows)
|
24
|
+
end
|
25
|
+
|
26
|
+
|
27
|
+
def parse(input, options=Hash.new)
|
28
|
+
super(input)
|
29
|
+
document = CSV.parse(input, options)
|
30
|
+
@environment.document = document
|
31
|
+
end
|
32
|
+
end
|
@@ -11,7 +11,7 @@ module ExtraLoop
|
|
11
11
|
|
12
12
|
def extract_field(node, record=nil)
|
13
13
|
target = node = node.respond_to?(:document) ? node : parse(node)
|
14
|
-
target = node.
|
14
|
+
target = node.at(@selector) if @selector
|
15
15
|
target = target.attr(@attribute) if target.respond_to?(:attr) && @attribute
|
16
16
|
target = @environment.run(target, record, &@callback) if @callback
|
17
17
|
|
@@ -30,9 +30,9 @@ module ExtraLoop
|
|
30
30
|
#
|
31
31
|
|
32
32
|
def extract_list(input)
|
33
|
-
nodes = input.respond_to?(:document) ? input : parse(input)
|
33
|
+
nodes = (input.respond_to?(:document) ? input : parse(input))
|
34
34
|
nodes = nodes.search(@selector) if @selector
|
35
|
-
@callback &&
|
35
|
+
Array(@callback && @environment.run(nodes, &@callback) || nodes)
|
36
36
|
end
|
37
37
|
|
38
38
|
def parse(input)
|
@@ -12,7 +12,7 @@ module ExtraLoop
|
|
12
12
|
def initialize(loop_extractor, extractors=[], document=nil, hooks = {}, scraper = nil)
|
13
13
|
@loop_extractor = loop_extractor
|
14
14
|
@extractors = extractors
|
15
|
-
@document = @loop_extractor.parse(document)
|
15
|
+
@document = document.is_a?(String) ? @loop_extractor.parse(document) : document
|
16
16
|
@records = []
|
17
17
|
@hooks = hooks
|
18
18
|
@environment = ExtractionEnvironment.new(scraper, @document, @records)
|
@@ -1,5 +1,5 @@
|
|
1
1
|
module ExtraLoop
|
2
|
-
# Pseudo Abstract class.
|
2
|
+
# Pseudo Abstract class from which all extractors inherit.
|
3
3
|
# This should not be called directly
|
4
4
|
#
|
5
5
|
class ExtractorBase
|
@@ -9,8 +9,9 @@ module ExtraLoop
|
|
9
9
|
end
|
10
10
|
|
11
11
|
attr_reader :field_name
|
12
|
+
|
12
13
|
#
|
13
|
-
# Public:
|
14
|
+
# Public: Initialises a Data extractor.
|
14
15
|
#
|
15
16
|
# Parameters:
|
16
17
|
# field_name - The machine readable field name
|
@@ -20,13 +20,9 @@ module ExtraLoop
|
|
20
20
|
end
|
21
21
|
|
22
22
|
def extract_list(input)
|
23
|
-
|
24
|
-
# into possible hash traversal techniques
|
25
|
-
|
26
|
-
input = input.is_a?(String) ? parse(input) : input
|
23
|
+
@environment.document = input = (input.is_a?(String) ? parse(input) : input)
|
27
24
|
input = input.get_in(@path) if @path
|
28
|
-
|
29
|
-
@callback && Array(@environment.run(input, &@callback)) || input
|
25
|
+
@callback && @environment.run(input, &@callback) || input
|
30
26
|
end
|
31
27
|
|
32
28
|
def parse(input)
|
@@ -61,7 +61,8 @@ module ExtraLoop
|
|
61
61
|
|
62
62
|
def loop_on(*args, &block)
|
63
63
|
args << block if block
|
64
|
-
|
64
|
+
# we prepend a nil value, as the loop extractor does not need to specify a field name
|
65
|
+
@loop_extractor_args = args.insert(0, nil)
|
65
66
|
self
|
66
67
|
end
|
67
68
|
|
@@ -79,7 +80,7 @@ module ExtraLoop
|
|
79
80
|
|
80
81
|
def extract(*args, &block)
|
81
82
|
args << block if block
|
82
|
-
@extractor_args << args
|
83
|
+
@extractor_args << args
|
83
84
|
self
|
84
85
|
end
|
85
86
|
|
@@ -144,24 +145,42 @@ module ExtraLoop
|
|
144
145
|
@response_count += 1
|
145
146
|
@loop = prepare_loop(response)
|
146
147
|
log("response ##{@response_count} of #{@queued_count}, status code: [#{response.code}], URL fragment: ...#{response.effective_url.split('/').last if response.effective_url}")
|
147
|
-
@loop.run
|
148
148
|
|
149
|
+
@loop.run
|
149
150
|
@environment = @loop.environment
|
150
151
|
run_hook(:data, [@loop.records, response])
|
152
|
+
#TODO: add hock for scraper completion (useful in iterative scrapes).
|
151
153
|
end
|
152
154
|
|
153
155
|
def prepare_loop(response)
|
154
|
-
|
155
|
-
|
156
|
+
content_type = response.headers_hash.fetch('Content-Type', nil)
|
157
|
+
format = @options[:format] || detect_format(content_type)
|
158
|
+
|
159
|
+
extractor_classname = "#{format.to_s.capitalize}Extractor"
|
160
|
+
extractor_class = ExtraLoop.const_defined?(extractor_classname) && ExtraLoop.const_get(extractor_classname) || DomExtractor
|
161
|
+
|
162
|
+
@loop_extractor_args.insert(1, ExtractionEnvironment.new(self))
|
156
163
|
loop_extractor = extractor_class.new(*@loop_extractor_args)
|
157
|
-
|
158
|
-
|
164
|
+
|
165
|
+
# There is no point in parsing response.body more than once, so we reuse
|
166
|
+
# the first parsed document
|
167
|
+
|
168
|
+
document = loop_extractor.parse(response.body)
|
169
|
+
|
170
|
+
extractors = @extractor_args.map do |args|
|
171
|
+
args.insert(1, ExtractionEnvironment.new(self, document))
|
172
|
+
extractor_class.new(*args)
|
173
|
+
end
|
174
|
+
|
175
|
+
ExtractionLoop.new(loop_extractor, extractors, document, @hooks, self)
|
159
176
|
end
|
160
177
|
|
161
178
|
def detect_format(content_type)
|
162
179
|
#TODO: add support for xml/rdf documents
|
163
180
|
if content_type && content_type =~ /json$/
|
164
181
|
:json
|
182
|
+
elsif content_type && content_type =~ /(csv)|(comma-separated-values)$/
|
183
|
+
:csv
|
165
184
|
else
|
166
185
|
:html
|
167
186
|
end
|
@@ -0,0 +1,67 @@
|
|
1
|
+
require 'helpers/spec_helper'
|
2
|
+
|
3
|
+
describe JsonExtractor do
|
4
|
+
before(:each) do
|
5
|
+
stub(scraper = Object.new).options
|
6
|
+
stub(scraper).results
|
7
|
+
@env = ExtractionEnvironment.new(scraper)
|
8
|
+
|
9
|
+
File.open('fixtures/doc.csv', 'r') { |file|
|
10
|
+
@csv = file.read
|
11
|
+
@parsed_csv = CSV.parse(@csv)
|
12
|
+
file.close
|
13
|
+
}
|
14
|
+
|
15
|
+
end
|
16
|
+
|
17
|
+
describe "#extract_field" do
|
18
|
+
context "with only a field name defined" do
|
19
|
+
before do
|
20
|
+
@extractor = CsvExtractor.new(:customer_company_name, @env)
|
21
|
+
@extractor.parse(@csv)
|
22
|
+
end
|
23
|
+
|
24
|
+
subject { @extractor.extract_field @parsed_csv[2] }
|
25
|
+
it { should eql("Utility A") }
|
26
|
+
end
|
27
|
+
|
28
|
+
context "with a field name and a selector defined" do
|
29
|
+
before do
|
30
|
+
@extractor = CsvExtractor.new(:name, @env, "customer_company_name")
|
31
|
+
@extractor.parse(@csv)
|
32
|
+
end
|
33
|
+
subject { @extractor.extract_field @parsed_csv[2] }
|
34
|
+
it { should eql("Utility A") }
|
35
|
+
end
|
36
|
+
|
37
|
+
context "with a field name, using a numerical index as selector", :onlythis => true do
|
38
|
+
before do
|
39
|
+
@extractor = CsvExtractor.new(:company_name, @env, 2)
|
40
|
+
@extractor.parse(@csv)
|
41
|
+
end
|
42
|
+
subject { @extractor.extract_field @parsed_csv[2] }
|
43
|
+
it { should eql("Utility A") }
|
44
|
+
end
|
45
|
+
|
46
|
+
context "Without any other arguments but a callback" do
|
47
|
+
before do
|
48
|
+
@extractor = CsvExtractor.new nil, @env, proc { |row| row[2] }
|
49
|
+
@extractor.parse(@csv)
|
50
|
+
end
|
51
|
+
subject { @extractor.extract_field @parsed_csv[2] }
|
52
|
+
it { should eql("Utility A") }
|
53
|
+
end
|
54
|
+
end
|
55
|
+
|
56
|
+
describe "#extract_list" do
|
57
|
+
context "with no arguments" do
|
58
|
+
subject { CsvExtractor.new(nil, @env).extract_list(@csv) }
|
59
|
+
it { should eql(@parsed_csv) }
|
60
|
+
end
|
61
|
+
|
62
|
+
context "with a callback" do
|
63
|
+
subject { CsvExtractor.new(nil, @env, proc { |rows| rows[0..10] }).extract_list(@csv) }
|
64
|
+
it { should eql(@parsed_csv[0..10]) }
|
65
|
+
end
|
66
|
+
end
|
67
|
+
end
|
data/spec/dom_extractor_spec.rb
CHANGED
@@ -54,17 +54,31 @@ describe DomExtractor do
|
|
54
54
|
end
|
55
55
|
end
|
56
56
|
|
57
|
-
context "when a selector and a block is provided" do
|
57
|
+
context "when a selector and a block is provided", :bla => true do
|
58
58
|
before do
|
59
|
+
document_defined = scraper_defined = false
|
60
|
+
|
59
61
|
@extractor = DomExtractor.new(:anchor, @env, "p a", proc { |node|
|
62
|
+
document_defined = @document && @document.is_a?(Nokogiri::HTML::Document)
|
63
|
+
scraper_defined = instance_variable_defined? "@scraper"
|
60
64
|
node.text.gsub("dummy", "fancy")
|
61
65
|
})
|
66
|
+
|
62
67
|
@node = @extractor.parse(@html)
|
68
|
+
@output = @extractor.extract_field(@node)
|
69
|
+
|
70
|
+
@scraper_defined = scraper_defined
|
71
|
+
@document_defined = document_defined
|
63
72
|
end
|
64
73
|
|
65
74
|
describe "#extract_field" do
|
66
|
-
|
67
|
-
|
75
|
+
it "should return the block output" do
|
76
|
+
@output.should match(/my\sfancy/)
|
77
|
+
end
|
78
|
+
it "should add the @scraper and @document instance variables to the extraction environment" do
|
79
|
+
@scraper_defined.should be_true
|
80
|
+
@document_defined.should be_true
|
81
|
+
end
|
68
82
|
end
|
69
83
|
end
|
70
84
|
|
@@ -93,6 +107,7 @@ describe DomExtractor do
|
|
93
107
|
end
|
94
108
|
end
|
95
109
|
|
110
|
+
|
96
111
|
context "when nothing but a field name is provided" do
|
97
112
|
before do
|
98
113
|
@extractor = DomExtractor.new(:url, @env)
|
@@ -117,13 +132,25 @@ describe DomExtractor do
|
|
117
132
|
|
118
133
|
context "block provided" do
|
119
134
|
before do
|
120
|
-
|
135
|
+
document_defined = scraper_defined = false
|
136
|
+
|
137
|
+
@extractor = DomExtractor.new(nil, @env, "div.entry", proc { |nodeList|
|
138
|
+
document_defined = @document && @document.is_a?(Nokogiri::HTML::Document)
|
139
|
+
scraper_defined = instance_variable_defined? "@scraper"
|
140
|
+
|
121
141
|
nodeList.reject {|node| node.attr(:class).split(" ").include?('exclude') }
|
122
142
|
})
|
143
|
+
|
144
|
+
@output = @extractor.extract_list(@html)
|
145
|
+
@scraper_defined = scraper_defined
|
146
|
+
@document_defined = document_defined
|
123
147
|
end
|
124
148
|
|
125
|
-
|
126
|
-
it
|
149
|
+
it { @output.should have(2).items }
|
150
|
+
it "should add @scraper and @document instance variables to the ExtractionEnvironment instance" do
|
151
|
+
@scraper_defined.should be_true
|
152
|
+
@document_defined.should be_true
|
153
|
+
end
|
127
154
|
end
|
128
155
|
end
|
129
156
|
|
@@ -0,0 +1,23 @@
|
|
1
|
+
contract_id,seller_company_name,customer_company_name,customer_duns_number,contract_affiliate,FERC_tariff_reference,contract_service_agreement_id,contract_execution_date,contract_commencement_date,contract_termination_date,actual_termination_date,extension_provision_description,class_name,term_name,increment_name,increment_peaking_name,product_type_name,product_name,quantity,units_for_contract,rate,rate_minimum,rate_maximum,rate_description,units_for_rate,point_of_receipt_control_area,point_of_receipt_specific_location,point_of_delivery_control_area,point_of_delivery_specific_location,begin_date,end_date,time_zone
|
2
|
+
C71,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Original Volume No. 10,2,2/15/2001,2/15/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,,,,,,ES
|
3
|
+
C72,The Electric Company,Utility A,38495837,n,FERC Electric Tariff Original Volume No. 10,15,7/25/2001,8/1/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,,,,,,ES
|
4
|
+
C73,The Electric Company,Utility B,493758794,N,FERC Electric Tariff Original Volume No. 10,7,6/8/2001,7/6/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,, , ,,,ep
|
5
|
+
C74,The Electric Company,Utility C,594739573,n,FERC Electric Tariff Original Volume No. 10,25,6/8/2001,7/6/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,, , ,,,ep
|
6
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,ENERGY,2000,KWh,.1475, , ,Max amount of capacity and energy to be transmitted. Bill based on monthly max delivery to City.,$/KWh,PJM,Point A,PJM,Point B,,,ep
|
7
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,point-to-point agreement,2000,KW,0.01, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
|
8
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,network,2000,KW,0.2, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
|
9
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,BLACK START SERVICE,2000,KW,0.22, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
|
10
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,CAPACITY,2000,KW,0.04, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
|
11
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,regulation & frequency response,2000,KW,0.1, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
|
12
|
+
C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,real power transmission loss,2000,KW,7, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
|
13
|
+
C76,The Electric Company,The Power Company,456534333,N,FERC Electric Tariff Original Volume No. 10,132,12/15/2001,1/1/2002,12/31/2004,12/31/2004,None,F,LT,M,FP,MB,CAPACITY,70,MW,3750, , ,70MW for each and every hour over the term of the agreement (7x24 schedule).,$/MW,,,,,,,ep
|
14
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,35, , ,,$/MWH,,,PJM,Bus 4321,20020101,20030101,EP
|
15
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,37, , ,,$/MWH,,,PJM,Bus 4321,20030101,20040101,EP
|
16
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,39, , ,,$/MWH,,,PJM,Bus 4321,20040101,20050101,EP
|
17
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,41, , ,,$/MWH,,,PJM,Bus 4321,20050101,20060101,EP
|
18
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,43, , ,,$/MWH,,,PJM,Bus 4321,20060101,20070101,EP
|
19
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,45, , ,,$/MWH,,,PJM,Bus 4321,20070101,20080101,EP
|
20
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,47, , ,,$/MWH,,,PJM,Bus 4321,20080101,20090101,EP
|
21
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,49, , ,,$/MWH,,,PJM,Bus 4321,20090101,20100101,EP
|
22
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,51, , ,,$/MWH,,,PJM,Bus 4321,20100101,20110101,EP
|
23
|
+
C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,53, , ,,$/MWH,,,PJM,Bus 4321,20110101,20120101,EP
|
data/spec/json_extractor_spec.rb
CHANGED
@@ -10,6 +10,7 @@ describe JsonExtractor do
|
|
10
10
|
content = file.read
|
11
11
|
file.close
|
12
12
|
content
|
13
|
+
|
13
14
|
}.call()
|
14
15
|
end
|
15
16
|
|
@@ -37,12 +38,27 @@ describe JsonExtractor do
|
|
37
38
|
|
38
39
|
context "field_name and callback" do
|
39
40
|
before do
|
40
|
-
|
41
|
+
scraper_defined = document_defined = false
|
42
|
+
|
43
|
+
@extractor = JsonExtractor.new(:from_user, @env, proc { |node|
|
44
|
+
document_defined = @document && @document.is_a?(Hash)
|
45
|
+
scraper_defined = instance_variable_defined? "@scraper"
|
46
|
+
|
47
|
+
node['from_user_name']
|
48
|
+
})
|
49
|
+
|
41
50
|
@node = @extractor.parse(@json)['results'].first
|
51
|
+
@output = @extractor.extract_field(@node)
|
52
|
+
|
53
|
+
@scraper_defined = scraper_defined
|
54
|
+
@document_defined = document_defined
|
42
55
|
end
|
43
56
|
|
44
|
-
|
45
|
-
it
|
57
|
+
it { @output.should eql("Ludovic kohn") }
|
58
|
+
it "should add the @scraper and @document instance variables to the extraction environment" do
|
59
|
+
@scraper_defined.should be_true
|
60
|
+
@document_defined.should be_true
|
61
|
+
end
|
46
62
|
end
|
47
63
|
|
48
64
|
context "field_name and attribute" do
|
@@ -108,12 +124,27 @@ describe JsonExtractor do
|
|
108
124
|
|
109
125
|
context "with pre-parsed input" do
|
110
126
|
before do
|
111
|
-
|
127
|
+
document_defined = scraper_defined = false
|
128
|
+
|
129
|
+
@extractor = JsonExtractor.new(nil, @env, proc { |data|
|
130
|
+
document_defined = @document && @document.is_a?(Hash)
|
131
|
+
scraper_defined = instance_variable_defined? "@scraper"
|
132
|
+
data['results']
|
133
|
+
})
|
134
|
+
|
135
|
+
|
136
|
+
@output = @extractor.extract_list((Yajl::Parser.new).parse(@json))
|
137
|
+
@scraper_defined = scraper_defined
|
138
|
+
@document_defined = document_defined
|
112
139
|
end
|
113
140
|
|
114
|
-
|
115
|
-
it {
|
116
|
-
|
141
|
+
it { @output.size.should eql(15) }
|
142
|
+
it { @output.should be_an_instance_of(Array) }
|
143
|
+
|
144
|
+
it "should add the @scraper and @document instance variables to the extraction environment" do
|
145
|
+
@scraper_defined.should be_true
|
146
|
+
@document_defined.should be_true
|
147
|
+
end
|
117
148
|
end
|
118
149
|
|
119
150
|
end
|
data/spec/scraper_base_spec.rb
CHANGED
@@ -12,7 +12,6 @@ describe ScraperBase do
|
|
12
12
|
@scraper = ScraperBase.new("http://localhost/fixture")
|
13
13
|
end
|
14
14
|
|
15
|
-
|
16
15
|
describe "#loop_on" do
|
17
16
|
subject { @scraper.loop_on("bla.bla") }
|
18
17
|
it { should be_an_instance_of(ScraperBase) }
|
@@ -113,7 +112,7 @@ describe ScraperBase do
|
|
113
112
|
stub(@fake_loop).environment { ExtractionEnvironment.new }
|
114
113
|
stub(@fake_loop).records { Array(1..3).map { |n| Object.new } }
|
115
114
|
|
116
|
-
mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(
|
115
|
+
mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(Nokogiri::HTML::Document), is_a(Hash), is_a(ScraperBase)).times(3) { @fake_loop }
|
117
116
|
end
|
118
117
|
|
119
118
|
|
@@ -157,10 +156,9 @@ describe ScraperBase do
|
|
157
156
|
stub(@fake_loop).environment { ExtractionEnvironment.new }
|
158
157
|
stub(@fake_loop).records { Array(1..3).map { |n| Object.new } }
|
159
158
|
|
160
|
-
mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(
|
159
|
+
mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(Nokogiri::HTML::Document), is_a(Hash), is_a(ScraperBase)).times(@urls.size) { @fake_loop }
|
161
160
|
end
|
162
161
|
|
163
|
-
|
164
162
|
it "Should handle response" do
|
165
163
|
@scraper.run
|
166
164
|
@results.size.should eql(@urls.size * 3)
|
@@ -168,5 +166,4 @@ describe ScraperBase do
|
|
168
166
|
end
|
169
167
|
end
|
170
168
|
end
|
171
|
-
|
172
169
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: extraloop
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.7
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-
|
12
|
+
date: 2012-02-28 00:00:00.000000000Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: yajl-ruby
|
16
|
-
requirement: &
|
16
|
+
requirement: &21376200 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ~>
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.1.0
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *21376200
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: nokogiri
|
27
|
-
requirement: &
|
27
|
+
requirement: &21373200 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ~>
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 1.5.0
|
33
33
|
type: :runtime
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *21373200
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: typhoeus
|
38
|
-
requirement: &
|
38
|
+
requirement: &21368180 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 0.3.2
|
44
44
|
type: :runtime
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *21368180
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: logging
|
49
|
-
requirement: &
|
49
|
+
requirement: &21365740 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ~>
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: 0.6.1
|
55
55
|
type: :runtime
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *21365740
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: rspec
|
60
|
-
requirement: &
|
60
|
+
requirement: &21363940 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 2.7.0
|
66
66
|
type: :development
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *21363940
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rr
|
71
|
-
requirement: &
|
71
|
+
requirement: &21362040 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ~>
|
@@ -76,18 +76,18 @@ dependencies:
|
|
76
76
|
version: 1.0.4
|
77
77
|
type: :development
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *21362040
|
80
80
|
- !ruby/object:Gem::Dependency
|
81
|
-
name: pry
|
82
|
-
requirement: &
|
81
|
+
name: pry-nav
|
82
|
+
requirement: &21355940 !ruby/object:Gem::Requirement
|
83
83
|
none: false
|
84
84
|
requirements:
|
85
85
|
- - ~>
|
86
86
|
- !ruby/object:Gem::Version
|
87
|
-
version: 0.
|
87
|
+
version: 0.1.0
|
88
88
|
type: :development
|
89
89
|
prerelease: false
|
90
|
-
version_requirements: *
|
90
|
+
version_requirements: *21355940
|
91
91
|
description: A Ruby library for extracting data from websites and web based APIs.
|
92
92
|
Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
|
93
93
|
a handy mechanism for iterating over paginated datasets.
|
@@ -99,9 +99,11 @@ files:
|
|
99
99
|
- History.txt
|
100
100
|
- README.md
|
101
101
|
- examples/google_news_scraper.rb
|
102
|
+
- examples/mod_pay_data.rb
|
102
103
|
- examples/wikipedia_categories.rb
|
103
104
|
- examples/wikipedia_categories_recoursive.rb
|
104
105
|
- lib/extraloop.rb
|
106
|
+
- lib/extraloop/csv_extractor.rb
|
105
107
|
- lib/extraloop/dom_extractor.rb
|
106
108
|
- lib/extraloop/extraction_environment.rb
|
107
109
|
- lib/extraloop/extraction_loop.rb
|
@@ -112,8 +114,10 @@ files:
|
|
112
114
|
- lib/extraloop/loggable.rb
|
113
115
|
- lib/extraloop/scraper_base.rb
|
114
116
|
- lib/extraloop/utils.rb
|
117
|
+
- spec/csv_extractor.rb
|
115
118
|
- spec/dom_extractor_spec.rb
|
116
119
|
- spec/extraction_loop_spec.rb
|
120
|
+
- spec/fixtures/doc.csv
|
117
121
|
- spec/fixtures/doc.html
|
118
122
|
- spec/fixtures/doc.json
|
119
123
|
- spec/helpers/scraper_helper.rb
|