extraloop 0.0.6 → 0.0.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/History.txt CHANGED
@@ -1,14 +1,17 @@
1
- == 0.0.5 / 2011-01-14
1
+ == 0.0.7 / 2012-02-28
2
+ * Added support for CSV data extraction.
3
+
4
+ == 0.0.5 / 2012-01-14
2
5
  * Refactored #extract, #loop_on, and #set_hook to make a more idematic use of ruby blocks
3
6
 
4
- == 0.0.4 / 2011-01-14
7
+ == 0.0.4 / 2012-01-14
5
8
  * fixed a bug which prevented from subclassing `IterativeScraper` instances
6
9
 
7
- == 0.0.3 / 2011-01-01
10
+ == 0.0.3 / 2012-01-01
8
11
  * namespaced all classes into the ExtraLoop module
9
12
 
10
- == 0.0.2 / 2011-01-01
13
+ == 0.0.2 / 2012-01-01
11
14
  * changed repository URL
12
15
 
13
- == 0.0.1 / 2011-01-01
16
+ == 0.0.1 / 2012-01-01
14
17
  * Project Birthday!
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # Extra Loop
2
2
 
3
- A Ruby library for extracting data from websites and web based APIs.
3
+ A Ruby library for extracting structured data from websites and web based APIs.
4
4
  Supports most common document formats (i.e. HTML, XML, and JSON), and comes with a handy mechanism
5
5
  for iterating over paginated datasets.
6
6
 
@@ -47,7 +47,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
47
47
 
48
48
  #### scraper options:
49
49
 
50
- * __format__ - Specifies the scraped document format (valid values are :html, :xml, :json).
50
+ * __format__ - Specifies the scraped document format; needed only if the Content-Type in the server response is not the correct one. Supported formats are: 'html', 'xml', 'json', and 'csv'.
51
51
  * __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
52
52
  * __log__ - Logging options hash:
53
53
  * __loglevel__ - a symbol specifying the desired log level (defaults to `:info`).
@@ -71,7 +71,7 @@ method extracts a specific piece of information from an element (e.g. a story's
71
71
  loop_on('div.post') { |posts| posts.reject { |post| post.attr(:class) == 'sticky' } }
72
72
 
73
73
  Both the `loop_on` and the `extract` methods may be called with a selector, a block or a combination of the two. By default, when parsing DOM documents, `extract` will call
74
- `Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and block, this will be evaluated in the context of the current iteration element.
74
+ `Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and a block. The latter is evaluated in the context of the current iteration's element.
75
75
 
76
76
  # extract a story's title
77
77
  extract(:title, 'h3')
@@ -82,13 +82,13 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a bl
82
82
  # extract a description text, separating paragraphs with newlines
83
83
  extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
84
84
 
85
- #### Extracting from JSON Documents
85
+ #### Extracting data from JSON Documents
86
86
 
87
- While processing each HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
87
+ While processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
88
88
  the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's
89
89
  initialization options. When format is JSON, the document is parsed using the `yajl` JSON parser and converted into a hash.
90
- In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except for the
91
- CSS3/XPath selectors, which are specific to DOM documents.
90
+ In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except it does not support
91
+ CSS3/XPath selectors.
92
92
 
93
93
  When working with JSON data, you can just use a block and have it return the document elements you want to loop on.
94
94
 
@@ -98,7 +98,7 @@ When working with JSON data, you can just use a block and have it return the doc
98
98
  Alternatively, the same loop can be defined by passing an array of keys pointing at a hash value located
99
99
  at several levels of depth down into the parsed document structure.
100
100
 
101
- # Fetch the same document portion above using a hash path
101
+ # Same as above, using a hash path
102
102
  loop_on(['query', 'categorymembers'])
103
103
 
104
104
  When fetching fields from a JSON document fragment, `extract` will often not need a block or an array of keys. If called with only
@@ -120,23 +120,22 @@ one argument, it will in fact try to fetch a hash value using the provided field
120
120
 
121
121
  The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.
122
122
 
123
- set_iteration(iteration_parameter, array_range_or_block)
123
+ #### set\_iteration
124
124
 
125
125
  * __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
126
126
  * __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.
127
127
 
128
+ #### continue\_with
128
129
 
129
- The second iteration methods, `#continue_with`, allows to continue iterating untill an arbitrary block of code returns a positive, non-nil value (to be assigned to the iteration parameter).
130
-
131
- continue_with(iteration_parameter, &block)
130
+ The second iteration method, `#continue_with`, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).
132
131
 
133
132
  * __iteration_parameter__ - the scraper' iteration parameter.
134
133
  * __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
135
134
 
136
-
137
135
  ### Running tests
138
136
 
139
137
  ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
140
138
 
141
139
  cd spec
142
140
  rspec *
141
+
@@ -1,10 +1,11 @@
1
1
  require '../lib/extraloop'
2
+ require 'pry'
2
3
 
3
4
  results = []
4
5
 
5
6
  ExtraLoop::IterativeScraper.new("https://www.google.com/search?tbm=nws&q=Egypt", :log => {
6
- :log_level => :debug,
7
- :appenders => [ Logging.appenders.stderr ]
7
+ #:log_level => :debug,
8
+ #:appenders => [ Logging.appenders.stderr ]
8
9
 
9
10
  }).set_iteration(:start, (1..101).step(10)).
10
11
  loop_on("h3") { |nodes| nodes.map(&:parent) }.
@@ -0,0 +1,32 @@
1
+ #
2
+ # Fetch name, job title, and actual pay ceiling from a csv dataset containing UK Ministry of Defence's organogram and staff pay data
3
+ #
4
+ # source: http://data.gov.uk/dataset/staff-organograms-and-pay-mod
5
+ #
6
+
7
+ require "../lib/extraloop.rb"
8
+ require "pry"
9
+
10
+ class ModPayScraper < ExtraLoop::ScraperBase
11
+ def initialize
12
+ dataset_url = "http://www.mod.uk/NR/rdonlyres/FF9761D8-2AB9-4CD4-88BC-983A46A0CD90/0/20111208CTLBOrganogramFinal7Useniordata.csv"
13
+ super dataset_url, :format => :csv
14
+
15
+ # Select only record of officiers who earn more than 100k per year
16
+ loop_on do |rows|
17
+ rows[1..-1].select { |row| row[14].to_i > 100000 }
18
+ end
19
+
20
+ extract :name, "Name"
21
+ extract :title, "Job Title"
22
+ extract :pay, 14
23
+
24
+ on("data") do |records|
25
+
26
+ records.sort { |r1, r2| r2.pay <=> r1.pay }.each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
27
+ end
28
+ end
29
+ end
30
+
31
+
32
+ ModPayScraper.new.run
data/lib/extraloop.rb CHANGED
@@ -16,6 +16,8 @@ gem "typhoeus"
16
16
  gem "logging"
17
17
 
18
18
 
19
+
20
+ autoload :CSV, "csv"
19
21
  autoload :Nokogiri, "nokogiri"
20
22
  autoload :Yajl, "yajl"
21
23
  autoload :Typhoeus, "typhoeus"
@@ -29,6 +31,7 @@ ExtraLoop.autoload :ExtractionEnvironment , "#{base_path}/extraction_environment
29
31
  ExtraLoop.autoload :ExtractorBase , "#{base_path}/extractor_base"
30
32
  ExtraLoop.autoload :DomExtractor , "#{base_path}/dom_extractor"
31
33
  ExtraLoop.autoload :JsonExtractor , "#{base_path}/json_extractor"
34
+ ExtraLoop.autoload :CsvExtractor , "#{base_path}/csv_extractor"
32
35
  ExtraLoop.autoload :ExtractionLoop , "#{base_path}/extraction_loop"
33
36
  ExtraLoop.autoload :ScraperBase , "#{base_path}/scraper_base"
34
37
  ExtraLoop.autoload :Loggable , "#{base_path}/loggable"
@@ -0,0 +1,32 @@
1
+ class ExtraLoop::CsvExtractor < ExtraLoop::ExtractorBase
2
+
3
+ def initialize(*args)
4
+ super(*args)
5
+ @selector = args[2] if args[2] && args[2].is_a?(Integer)
6
+ end
7
+
8
+ def extract_field(row, record=nil)
9
+ target = row = row.respond_to?(:entries)? row : parse(row)
10
+ headers = @environment.document.first
11
+ selector = !@selector && @field_name || @selector
12
+
13
+ # allow using CSV column names or array indices as selectors
14
+ target = row[headers.index(selector.to_s)] if selector && selector.to_s.match(/[a-z]/i)
15
+ target = row[selector] if selector.is_a?(Integer)
16
+
17
+ target = @environment.run(target, record, &@callback) if @callback
18
+ target
19
+ end
20
+
21
+ def extract_list(input)
22
+ rows = (input.respond_to?(:entries) ? input : parse(input))
23
+ Array(@callback && @environment.run(rows, &@callback) || rows)
24
+ end
25
+
26
+
27
+ def parse(input, options=Hash.new)
28
+ super(input)
29
+ document = CSV.parse(input, options)
30
+ @environment.document = document
31
+ end
32
+ end
@@ -11,7 +11,7 @@ module ExtraLoop
11
11
 
12
12
  def extract_field(node, record=nil)
13
13
  target = node = node.respond_to?(:document) ? node : parse(node)
14
- target = node.at_css(@selector) if @selector
14
+ target = node.at(@selector) if @selector
15
15
  target = target.attr(@attribute) if target.respond_to?(:attr) && @attribute
16
16
  target = @environment.run(target, record, &@callback) if @callback
17
17
 
@@ -30,9 +30,9 @@ module ExtraLoop
30
30
  #
31
31
 
32
32
  def extract_list(input)
33
- nodes = input.respond_to?(:document) ? input : parse(input)
33
+ nodes = (input.respond_to?(:document) ? input : parse(input))
34
34
  nodes = nodes.search(@selector) if @selector
35
- @callback && Array(@environment.run(nodes, &@callback)) || nodes
35
+ Array(@callback && @environment.run(nodes, &@callback) || nodes)
36
36
  end
37
37
 
38
38
  def parse(input)
@@ -7,6 +7,7 @@ module ExtraLoop
7
7
  attr_reader :scraper
8
8
 
9
9
  def initialize(scraper=nil, document=nil, records=nil)
10
+
10
11
  if scraper
11
12
  @options = scraper.options
12
13
  @results = scraper.results
@@ -12,7 +12,7 @@ module ExtraLoop
12
12
  def initialize(loop_extractor, extractors=[], document=nil, hooks = {}, scraper = nil)
13
13
  @loop_extractor = loop_extractor
14
14
  @extractors = extractors
15
- @document = @loop_extractor.parse(document)
15
+ @document = document.is_a?(String) ? @loop_extractor.parse(document) : document
16
16
  @records = []
17
17
  @hooks = hooks
18
18
  @environment = ExtractionEnvironment.new(scraper, @document, @records)
@@ -1,5 +1,5 @@
1
1
  module ExtraLoop
2
- # Pseudo Abstract class.
2
+ # Pseudo Abstract class from which all extractors inherit.
3
3
  # This should not be called directly
4
4
  #
5
5
  class ExtractorBase
@@ -9,8 +9,9 @@ module ExtraLoop
9
9
  end
10
10
 
11
11
  attr_reader :field_name
12
+
12
13
  #
13
- # Public: Initializes a Data extractor.
14
+ # Public: Initialises a Data extractor.
14
15
  #
15
16
  # Parameters:
16
17
  # field_name - The machine readable field name
@@ -20,13 +20,9 @@ module ExtraLoop
20
20
  end
21
21
 
22
22
  def extract_list(input)
23
- #TODO: implement more clever stuff here after looking
24
- # into possible hash traversal techniques
25
-
26
- input = input.is_a?(String) ? parse(input) : input
23
+ @environment.document = input = (input.is_a?(String) ? parse(input) : input)
27
24
  input = input.get_in(@path) if @path
28
-
29
- @callback && Array(@environment.run(input, &@callback)) || input
25
+ @callback && @environment.run(input, &@callback) || input
30
26
  end
31
27
 
32
28
  def parse(input)
@@ -61,7 +61,8 @@ module ExtraLoop
61
61
 
62
62
  def loop_on(*args, &block)
63
63
  args << block if block
64
- @loop_extractor_args = args.insert(0, nil, ExtractionEnvironment.new(self))
64
+ # we prepend a nil value, as the loop extractor does not need to specify a field name
65
+ @loop_extractor_args = args.insert(0, nil)
65
66
  self
66
67
  end
67
68
 
@@ -79,7 +80,7 @@ module ExtraLoop
79
80
 
80
81
  def extract(*args, &block)
81
82
  args << block if block
82
- @extractor_args << args.insert(1, ExtractionEnvironment.new(self))
83
+ @extractor_args << args
83
84
  self
84
85
  end
85
86
 
@@ -144,24 +145,42 @@ module ExtraLoop
144
145
  @response_count += 1
145
146
  @loop = prepare_loop(response)
146
147
  log("response ##{@response_count} of #{@queued_count}, status code: [#{response.code}], URL fragment: ...#{response.effective_url.split('/').last if response.effective_url}")
147
- @loop.run
148
148
 
149
+ @loop.run
149
150
  @environment = @loop.environment
150
151
  run_hook(:data, [@loop.records, response])
152
+ #TODO: add hock for scraper completion (useful in iterative scrapes).
151
153
  end
152
154
 
153
155
  def prepare_loop(response)
154
- format = @options[:format] || detect_format(response.headers_hash.fetch('Content-Type', nil))
155
- extractor_class = format == :json ? JsonExtractor : DomExtractor
156
+ content_type = response.headers_hash.fetch('Content-Type', nil)
157
+ format = @options[:format] || detect_format(content_type)
158
+
159
+ extractor_classname = "#{format.to_s.capitalize}Extractor"
160
+ extractor_class = ExtraLoop.const_defined?(extractor_classname) && ExtraLoop.const_get(extractor_classname) || DomExtractor
161
+
162
+ @loop_extractor_args.insert(1, ExtractionEnvironment.new(self))
156
163
  loop_extractor = extractor_class.new(*@loop_extractor_args)
157
- extractors = @extractor_args.map { |args| extractor_class.new(*args) }
158
- ExtractionLoop.new(loop_extractor, extractors, response.body, @hooks, self)
164
+
165
+ # There is no point in parsing response.body more than once, so we reuse
166
+ # the first parsed document
167
+
168
+ document = loop_extractor.parse(response.body)
169
+
170
+ extractors = @extractor_args.map do |args|
171
+ args.insert(1, ExtractionEnvironment.new(self, document))
172
+ extractor_class.new(*args)
173
+ end
174
+
175
+ ExtractionLoop.new(loop_extractor, extractors, document, @hooks, self)
159
176
  end
160
177
 
161
178
  def detect_format(content_type)
162
179
  #TODO: add support for xml/rdf documents
163
180
  if content_type && content_type =~ /json$/
164
181
  :json
182
+ elsif content_type && content_type =~ /(csv)|(comma-separated-values)$/
183
+ :csv
165
184
  else
166
185
  :html
167
186
  end
@@ -0,0 +1,67 @@
1
+ require 'helpers/spec_helper'
2
+
3
+ describe JsonExtractor do
4
+ before(:each) do
5
+ stub(scraper = Object.new).options
6
+ stub(scraper).results
7
+ @env = ExtractionEnvironment.new(scraper)
8
+
9
+ File.open('fixtures/doc.csv', 'r') { |file|
10
+ @csv = file.read
11
+ @parsed_csv = CSV.parse(@csv)
12
+ file.close
13
+ }
14
+
15
+ end
16
+
17
+ describe "#extract_field" do
18
+ context "with only a field name defined" do
19
+ before do
20
+ @extractor = CsvExtractor.new(:customer_company_name, @env)
21
+ @extractor.parse(@csv)
22
+ end
23
+
24
+ subject { @extractor.extract_field @parsed_csv[2] }
25
+ it { should eql("Utility A") }
26
+ end
27
+
28
+ context "with a field name and a selector defined" do
29
+ before do
30
+ @extractor = CsvExtractor.new(:name, @env, "customer_company_name")
31
+ @extractor.parse(@csv)
32
+ end
33
+ subject { @extractor.extract_field @parsed_csv[2] }
34
+ it { should eql("Utility A") }
35
+ end
36
+
37
+ context "with a field name, using a numerical index as selector", :onlythis => true do
38
+ before do
39
+ @extractor = CsvExtractor.new(:company_name, @env, 2)
40
+ @extractor.parse(@csv)
41
+ end
42
+ subject { @extractor.extract_field @parsed_csv[2] }
43
+ it { should eql("Utility A") }
44
+ end
45
+
46
+ context "Without any other arguments but a callback" do
47
+ before do
48
+ @extractor = CsvExtractor.new nil, @env, proc { |row| row[2] }
49
+ @extractor.parse(@csv)
50
+ end
51
+ subject { @extractor.extract_field @parsed_csv[2] }
52
+ it { should eql("Utility A") }
53
+ end
54
+ end
55
+
56
+ describe "#extract_list" do
57
+ context "with no arguments" do
58
+ subject { CsvExtractor.new(nil, @env).extract_list(@csv) }
59
+ it { should eql(@parsed_csv) }
60
+ end
61
+
62
+ context "with a callback" do
63
+ subject { CsvExtractor.new(nil, @env, proc { |rows| rows[0..10] }).extract_list(@csv) }
64
+ it { should eql(@parsed_csv[0..10]) }
65
+ end
66
+ end
67
+ end
@@ -54,17 +54,31 @@ describe DomExtractor do
54
54
  end
55
55
  end
56
56
 
57
- context "when a selector and a block is provided" do
57
+ context "when a selector and a block is provided", :bla => true do
58
58
  before do
59
+ document_defined = scraper_defined = false
60
+
59
61
  @extractor = DomExtractor.new(:anchor, @env, "p a", proc { |node|
62
+ document_defined = @document && @document.is_a?(Nokogiri::HTML::Document)
63
+ scraper_defined = instance_variable_defined? "@scraper"
60
64
  node.text.gsub("dummy", "fancy")
61
65
  })
66
+
62
67
  @node = @extractor.parse(@html)
68
+ @output = @extractor.extract_field(@node)
69
+
70
+ @scraper_defined = scraper_defined
71
+ @document_defined = document_defined
63
72
  end
64
73
 
65
74
  describe "#extract_field" do
66
- subject { @extractor.extract_field(@node) }
67
- it { should match(/my fancy/) }
75
+ it "should return the block output" do
76
+ @output.should match(/my\sfancy/)
77
+ end
78
+ it "should add the @scraper and @document instance variables to the extraction environment" do
79
+ @scraper_defined.should be_true
80
+ @document_defined.should be_true
81
+ end
68
82
  end
69
83
  end
70
84
 
@@ -93,6 +107,7 @@ describe DomExtractor do
93
107
  end
94
108
  end
95
109
 
110
+
96
111
  context "when nothing but a field name is provided" do
97
112
  before do
98
113
  @extractor = DomExtractor.new(:url, @env)
@@ -117,13 +132,25 @@ describe DomExtractor do
117
132
 
118
133
  context "block provided" do
119
134
  before do
120
- @extractor = DomExtractor.new(nil, @env, "div.entry", lambda { |nodeList|
135
+ document_defined = scraper_defined = false
136
+
137
+ @extractor = DomExtractor.new(nil, @env, "div.entry", proc { |nodeList|
138
+ document_defined = @document && @document.is_a?(Nokogiri::HTML::Document)
139
+ scraper_defined = instance_variable_defined? "@scraper"
140
+
121
141
  nodeList.reject {|node| node.attr(:class).split(" ").include?('exclude') }
122
142
  })
143
+
144
+ @output = @extractor.extract_list(@html)
145
+ @scraper_defined = scraper_defined
146
+ @document_defined = document_defined
123
147
  end
124
148
 
125
- subject { @extractor.extract_list(@html) }
126
- it { subject.should have(2).items }
149
+ it { @output.should have(2).items }
150
+ it "should add @scraper and @document instance variables to the ExtractionEnvironment instance" do
151
+ @scraper_defined.should be_true
152
+ @document_defined.should be_true
153
+ end
127
154
  end
128
155
  end
129
156
 
@@ -0,0 +1,23 @@
1
+ contract_id,seller_company_name,customer_company_name,customer_duns_number,contract_affiliate,FERC_tariff_reference,contract_service_agreement_id,contract_execution_date,contract_commencement_date,contract_termination_date,actual_termination_date,extension_provision_description,class_name,term_name,increment_name,increment_peaking_name,product_type_name,product_name,quantity,units_for_contract,rate,rate_minimum,rate_maximum,rate_description,units_for_rate,point_of_receipt_control_area,point_of_receipt_specific_location,point_of_delivery_control_area,point_of_delivery_specific_location,begin_date,end_date,time_zone
2
+ C71,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Original Volume No. 10,2,2/15/2001,2/15/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,,,,,,ES
3
+ C72,The Electric Company,Utility A,38495837,n,FERC Electric Tariff Original Volume No. 10,15,7/25/2001,8/1/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,,,,,,ES
4
+ C73,The Electric Company,Utility B,493758794,N,FERC Electric Tariff Original Volume No. 10,7,6/8/2001,7/6/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,, , ,,,ep
5
+ C74,The Electric Company,Utility C,594739573,n,FERC Electric Tariff Original Volume No. 10,25,6/8/2001,7/6/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,, , ,,,ep
6
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,ENERGY,2000,KWh,.1475, , ,Max amount of capacity and energy to be transmitted. Bill based on monthly max delivery to City.,$/KWh,PJM,Point A,PJM,Point B,,,ep
7
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,point-to-point agreement,2000,KW,0.01, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
8
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,network,2000,KW,0.2, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
9
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,BLACK START SERVICE,2000,KW,0.22, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
10
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,CAPACITY,2000,KW,0.04, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
11
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,regulation & frequency response,2000,KW,0.1, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
12
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,real power transmission loss,2000,KW,7, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
13
+ C76,The Electric Company,The Power Company,456534333,N,FERC Electric Tariff Original Volume No. 10,132,12/15/2001,1/1/2002,12/31/2004,12/31/2004,None,F,LT,M,FP,MB,CAPACITY,70,MW,3750, , ,70MW for each and every hour over the term of the agreement (7x24 schedule).,$/MW,,,,,,,ep
14
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,35, , ,,$/MWH,,,PJM,Bus 4321,20020101,20030101,EP
15
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,37, , ,,$/MWH,,,PJM,Bus 4321,20030101,20040101,EP
16
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,39, , ,,$/MWH,,,PJM,Bus 4321,20040101,20050101,EP
17
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,41, , ,,$/MWH,,,PJM,Bus 4321,20050101,20060101,EP
18
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,43, , ,,$/MWH,,,PJM,Bus 4321,20060101,20070101,EP
19
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,45, , ,,$/MWH,,,PJM,Bus 4321,20070101,20080101,EP
20
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,47, , ,,$/MWH,,,PJM,Bus 4321,20080101,20090101,EP
21
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,49, , ,,$/MWH,,,PJM,Bus 4321,20090101,20100101,EP
22
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,51, , ,,$/MWH,,,PJM,Bus 4321,20100101,20110101,EP
23
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,53, , ,,$/MWH,,,PJM,Bus 4321,20110101,20120101,EP
@@ -10,6 +10,7 @@ describe JsonExtractor do
10
10
  content = file.read
11
11
  file.close
12
12
  content
13
+
13
14
  }.call()
14
15
  end
15
16
 
@@ -37,12 +38,27 @@ describe JsonExtractor do
37
38
 
38
39
  context "field_name and callback" do
39
40
  before do
40
- @extractor = JsonExtractor.new(:from_user, @env, proc { |node| node['from_user_name'] } )
41
+ scraper_defined = document_defined = false
42
+
43
+ @extractor = JsonExtractor.new(:from_user, @env, proc { |node|
44
+ document_defined = @document && @document.is_a?(Hash)
45
+ scraper_defined = instance_variable_defined? "@scraper"
46
+
47
+ node['from_user_name']
48
+ })
49
+
41
50
  @node = @extractor.parse(@json)['results'].first
51
+ @output = @extractor.extract_field(@node)
52
+
53
+ @scraper_defined = scraper_defined
54
+ @document_defined = document_defined
42
55
  end
43
56
 
44
- subject { @extractor.extract_field(@node) }
45
- it { should eql("Ludovic kohn") }
57
+ it { @output.should eql("Ludovic kohn") }
58
+ it "should add the @scraper and @document instance variables to the extraction environment" do
59
+ @scraper_defined.should be_true
60
+ @document_defined.should be_true
61
+ end
46
62
  end
47
63
 
48
64
  context "field_name and attribute" do
@@ -108,12 +124,27 @@ describe JsonExtractor do
108
124
 
109
125
  context "with pre-parsed input" do
110
126
  before do
111
- @extractor = JsonExtractor.new(nil, @env, proc { |data| data['results'] })
127
+ document_defined = scraper_defined = false
128
+
129
+ @extractor = JsonExtractor.new(nil, @env, proc { |data|
130
+ document_defined = @document && @document.is_a?(Hash)
131
+ scraper_defined = instance_variable_defined? "@scraper"
132
+ data['results']
133
+ })
134
+
135
+
136
+ @output = @extractor.extract_list((Yajl::Parser.new).parse(@json))
137
+ @scraper_defined = scraper_defined
138
+ @document_defined = document_defined
112
139
  end
113
140
 
114
- subject { @extractor.extract_list((Yajl::Parser.new).parse(@json)) }
115
- it { subject.size.should eql(15) }
116
- it { should be_an_instance_of(Array) }
141
+ it { @output.size.should eql(15) }
142
+ it { @output.should be_an_instance_of(Array) }
143
+
144
+ it "should add the @scraper and @document instance variables to the extraction environment" do
145
+ @scraper_defined.should be_true
146
+ @document_defined.should be_true
147
+ end
117
148
  end
118
149
 
119
150
  end
@@ -12,7 +12,6 @@ describe ScraperBase do
12
12
  @scraper = ScraperBase.new("http://localhost/fixture")
13
13
  end
14
14
 
15
-
16
15
  describe "#loop_on" do
17
16
  subject { @scraper.loop_on("bla.bla") }
18
17
  it { should be_an_instance_of(ScraperBase) }
@@ -113,7 +112,7 @@ describe ScraperBase do
113
112
  stub(@fake_loop).environment { ExtractionEnvironment.new }
114
113
  stub(@fake_loop).records { Array(1..3).map { |n| Object.new } }
115
114
 
116
- mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(String), is_a(Hash), is_a(ScraperBase)).times(3) { @fake_loop }
115
+ mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(Nokogiri::HTML::Document), is_a(Hash), is_a(ScraperBase)).times(3) { @fake_loop }
117
116
  end
118
117
 
119
118
 
@@ -157,10 +156,9 @@ describe ScraperBase do
157
156
  stub(@fake_loop).environment { ExtractionEnvironment.new }
158
157
  stub(@fake_loop).records { Array(1..3).map { |n| Object.new } }
159
158
 
160
- mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(String), is_a(Hash), is_a(ScraperBase)).times(@urls.size) { @fake_loop }
159
+ mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(Nokogiri::HTML::Document), is_a(Hash), is_a(ScraperBase)).times(@urls.size) { @fake_loop }
161
160
  end
162
161
 
163
-
164
162
  it "Should handle response" do
165
163
  @scraper.run
166
164
  @results.size.should eql(@urls.size * 3)
@@ -168,5 +166,4 @@ describe ScraperBase do
168
166
  end
169
167
  end
170
168
  end
171
-
172
169
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: extraloop
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.6
4
+ version: 0.0.7
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-01-30 00:00:00.000000000Z
12
+ date: 2012-02-28 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: yajl-ruby
16
- requirement: &15579720 !ruby/object:Gem::Requirement
16
+ requirement: &21376200 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ~>
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 1.1.0
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *15579720
24
+ version_requirements: *21376200
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: nokogiri
27
- requirement: &15579260 !ruby/object:Gem::Requirement
27
+ requirement: &21373200 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ~>
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 1.5.0
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *15579260
35
+ version_requirements: *21373200
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: typhoeus
38
- requirement: &15578800 !ruby/object:Gem::Requirement
38
+ requirement: &21368180 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ~>
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: 0.3.2
44
44
  type: :runtime
45
45
  prerelease: false
46
- version_requirements: *15578800
46
+ version_requirements: *21368180
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: logging
49
- requirement: &15578340 !ruby/object:Gem::Requirement
49
+ requirement: &21365740 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ~>
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: 0.6.1
55
55
  type: :runtime
56
56
  prerelease: false
57
- version_requirements: *15578340
57
+ version_requirements: *21365740
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: rspec
60
- requirement: &15577880 !ruby/object:Gem::Requirement
60
+ requirement: &21363940 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ~>
@@ -65,10 +65,10 @@ dependencies:
65
65
  version: 2.7.0
66
66
  type: :development
67
67
  prerelease: false
68
- version_requirements: *15577880
68
+ version_requirements: *21363940
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: rr
71
- requirement: &15577420 !ruby/object:Gem::Requirement
71
+ requirement: &21362040 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ~>
@@ -76,18 +76,18 @@ dependencies:
76
76
  version: 1.0.4
77
77
  type: :development
78
78
  prerelease: false
79
- version_requirements: *15577420
79
+ version_requirements: *21362040
80
80
  - !ruby/object:Gem::Dependency
81
- name: pry
82
- requirement: &15576960 !ruby/object:Gem::Requirement
81
+ name: pry-nav
82
+ requirement: &21355940 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ~>
86
86
  - !ruby/object:Gem::Version
87
- version: 0.9.7.4
87
+ version: 0.1.0
88
88
  type: :development
89
89
  prerelease: false
90
- version_requirements: *15576960
90
+ version_requirements: *21355940
91
91
  description: A Ruby library for extracting data from websites and web based APIs.
92
92
  Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
93
93
  a handy mechanism for iterating over paginated datasets.
@@ -99,9 +99,11 @@ files:
99
99
  - History.txt
100
100
  - README.md
101
101
  - examples/google_news_scraper.rb
102
+ - examples/mod_pay_data.rb
102
103
  - examples/wikipedia_categories.rb
103
104
  - examples/wikipedia_categories_recoursive.rb
104
105
  - lib/extraloop.rb
106
+ - lib/extraloop/csv_extractor.rb
105
107
  - lib/extraloop/dom_extractor.rb
106
108
  - lib/extraloop/extraction_environment.rb
107
109
  - lib/extraloop/extraction_loop.rb
@@ -112,8 +114,10 @@ files:
112
114
  - lib/extraloop/loggable.rb
113
115
  - lib/extraloop/scraper_base.rb
114
116
  - lib/extraloop/utils.rb
117
+ - spec/csv_extractor.rb
115
118
  - spec/dom_extractor_spec.rb
116
119
  - spec/extraction_loop_spec.rb
120
+ - spec/fixtures/doc.csv
117
121
  - spec/fixtures/doc.html
118
122
  - spec/fixtures/doc.json
119
123
  - spec/helpers/scraper_helper.rb