extraloop 0.0.6 → 0.0.7

Sign up to get free protection for your applications and to get access to all the features.
data/History.txt CHANGED
@@ -1,14 +1,17 @@
1
- == 0.0.5 / 2011-01-14
1
+ == 0.0.7 / 2012-02-28
2
+ * Added support for CSV data extraction.
3
+
4
+ == 0.0.5 / 2012-01-14
2
5
  * Refactored #extract, #loop_on, and #set_hook to make a more idematic use of ruby blocks
3
6
 
4
- == 0.0.4 / 2011-01-14
7
+ == 0.0.4 / 2012-01-14
5
8
  * fixed a bug which prevented from subclassing `IterativeScraper` instances
6
9
 
7
- == 0.0.3 / 2011-01-01
10
+ == 0.0.3 / 2012-01-01
8
11
  * namespaced all classes into the ExtraLoop module
9
12
 
10
- == 0.0.2 / 2011-01-01
13
+ == 0.0.2 / 2012-01-01
11
14
  * changed repository URL
12
15
 
13
- == 0.0.1 / 2011-01-01
16
+ == 0.0.1 / 2012-01-01
14
17
  * Project Birthday!
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # Extra Loop
2
2
 
3
- A Ruby library for extracting data from websites and web based APIs.
3
+ A Ruby library for extracting structured data from websites and web based APIs.
4
4
  Supports most common document formats (i.e. HTML, XML, and JSON), and comes with a handy mechanism
5
5
  for iterating over paginated datasets.
6
6
 
@@ -47,7 +47,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
47
47
 
48
48
  #### scraper options:
49
49
 
50
- * __format__ - Specifies the scraped document format (valid values are :html, :xml, :json).
50
+ * __format__ - Specifies the scraped document format; needed only if the Content-Type in the server response is not the correct one. Supported formats are: 'html', 'xml', 'json', and 'csv'.
51
51
  * __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
52
52
  * __log__ - Logging options hash:
53
53
  * __loglevel__ - a symbol specifying the desired log level (defaults to `:info`).
@@ -71,7 +71,7 @@ method extracts a specific piece of information from an element (e.g. a story's
71
71
  loop_on('div.post') { |posts| posts.reject { |post| post.attr(:class) == 'sticky' } }
72
72
 
73
73
  Both the `loop_on` and the `extract` methods may be called with a selector, a block or a combination of the two. By default, when parsing DOM documents, `extract` will call
74
- `Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and block, this will be evaluated in the context of the current iteration element.
74
+ `Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and a block. The latter is evaluated in the context of the current iteration's element.
75
75
 
76
76
  # extract a story's title
77
77
  extract(:title, 'h3')
@@ -82,13 +82,13 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a bl
82
82
  # extract a description text, separating paragraphs with newlines
83
83
  extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
84
84
 
85
- #### Extracting from JSON Documents
85
+ #### Extracting data from JSON Documents
86
86
 
87
- While processing each HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
87
+ While processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
88
88
  the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's
89
89
  initialization options. When format is JSON, the document is parsed using the `yajl` JSON parser and converted into a hash.
90
- In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except for the
91
- CSS3/XPath selectors, which are specific to DOM documents.
90
+ In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except it does not support
91
+ CSS3/XPath selectors.
92
92
 
93
93
  When working with JSON data, you can just use a block and have it return the document elements you want to loop on.
94
94
 
@@ -98,7 +98,7 @@ When working with JSON data, you can just use a block and have it return the doc
98
98
  Alternatively, the same loop can be defined by passing an array of keys pointing at a hash value located
99
99
  at several levels of depth down into the parsed document structure.
100
100
 
101
- # Fetch the same document portion above using a hash path
101
+ # Same as above, using a hash path
102
102
  loop_on(['query', 'categorymembers'])
103
103
 
104
104
  When fetching fields from a JSON document fragment, `extract` will often not need a block or an array of keys. If called with only
@@ -120,23 +120,22 @@ one argument, it will in fact try to fetch a hash value using the provided field
120
120
 
121
121
  The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.
122
122
 
123
- set_iteration(iteration_parameter, array_range_or_block)
123
+ #### set\_iteration
124
124
 
125
125
  * __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
126
126
  * __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.
127
127
 
128
+ #### continue\_with
128
129
 
129
- The second iteration methods, `#continue_with`, allows to continue iterating untill an arbitrary block of code returns a positive, non-nil value (to be assigned to the iteration parameter).
130
-
131
- continue_with(iteration_parameter, &block)
130
+ The second iteration method, `#continue_with`, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).
132
131
 
133
132
  * __iteration_parameter__ - the scraper' iteration parameter.
134
133
  * __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
135
134
 
136
-
137
135
  ### Running tests
138
136
 
139
137
  ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
140
138
 
141
139
  cd spec
142
140
  rspec *
141
+
@@ -1,10 +1,11 @@
1
1
  require '../lib/extraloop'
2
+ require 'pry'
2
3
 
3
4
  results = []
4
5
 
5
6
  ExtraLoop::IterativeScraper.new("https://www.google.com/search?tbm=nws&q=Egypt", :log => {
6
- :log_level => :debug,
7
- :appenders => [ Logging.appenders.stderr ]
7
+ #:log_level => :debug,
8
+ #:appenders => [ Logging.appenders.stderr ]
8
9
 
9
10
  }).set_iteration(:start, (1..101).step(10)).
10
11
  loop_on("h3") { |nodes| nodes.map(&:parent) }.
@@ -0,0 +1,32 @@
1
+ #
2
+ # Fetch name, job title, and actual pay ceiling from a csv dataset containing UK Ministry of Defence's organogram and staff pay data
3
+ #
4
+ # source: http://data.gov.uk/dataset/staff-organograms-and-pay-mod
5
+ #
6
+
7
+ require "../lib/extraloop.rb"
8
+ require "pry"
9
+
10
+ class ModPayScraper < ExtraLoop::ScraperBase
11
+ def initialize
12
+ dataset_url = "http://www.mod.uk/NR/rdonlyres/FF9761D8-2AB9-4CD4-88BC-983A46A0CD90/0/20111208CTLBOrganogramFinal7Useniordata.csv"
13
+ super dataset_url, :format => :csv
14
+
15
+ # Select only record of officiers who earn more than 100k per year
16
+ loop_on do |rows|
17
+ rows[1..-1].select { |row| row[14].to_i > 100000 }
18
+ end
19
+
20
+ extract :name, "Name"
21
+ extract :title, "Job Title"
22
+ extract :pay, 14
23
+
24
+ on("data") do |records|
25
+
26
+ records.sort { |r1, r2| r2.pay <=> r1.pay }.each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
27
+ end
28
+ end
29
+ end
30
+
31
+
32
+ ModPayScraper.new.run
data/lib/extraloop.rb CHANGED
@@ -16,6 +16,8 @@ gem "typhoeus"
16
16
  gem "logging"
17
17
 
18
18
 
19
+
20
+ autoload :CSV, "csv"
19
21
  autoload :Nokogiri, "nokogiri"
20
22
  autoload :Yajl, "yajl"
21
23
  autoload :Typhoeus, "typhoeus"
@@ -29,6 +31,7 @@ ExtraLoop.autoload :ExtractionEnvironment , "#{base_path}/extraction_environment
29
31
  ExtraLoop.autoload :ExtractorBase , "#{base_path}/extractor_base"
30
32
  ExtraLoop.autoload :DomExtractor , "#{base_path}/dom_extractor"
31
33
  ExtraLoop.autoload :JsonExtractor , "#{base_path}/json_extractor"
34
+ ExtraLoop.autoload :CsvExtractor , "#{base_path}/csv_extractor"
32
35
  ExtraLoop.autoload :ExtractionLoop , "#{base_path}/extraction_loop"
33
36
  ExtraLoop.autoload :ScraperBase , "#{base_path}/scraper_base"
34
37
  ExtraLoop.autoload :Loggable , "#{base_path}/loggable"
@@ -0,0 +1,32 @@
1
+ class ExtraLoop::CsvExtractor < ExtraLoop::ExtractorBase
2
+
3
+ def initialize(*args)
4
+ super(*args)
5
+ @selector = args[2] if args[2] && args[2].is_a?(Integer)
6
+ end
7
+
8
+ def extract_field(row, record=nil)
9
+ target = row = row.respond_to?(:entries)? row : parse(row)
10
+ headers = @environment.document.first
11
+ selector = !@selector && @field_name || @selector
12
+
13
+ # allow using CSV column names or array indices as selectors
14
+ target = row[headers.index(selector.to_s)] if selector && selector.to_s.match(/[a-z]/i)
15
+ target = row[selector] if selector.is_a?(Integer)
16
+
17
+ target = @environment.run(target, record, &@callback) if @callback
18
+ target
19
+ end
20
+
21
+ def extract_list(input)
22
+ rows = (input.respond_to?(:entries) ? input : parse(input))
23
+ Array(@callback && @environment.run(rows, &@callback) || rows)
24
+ end
25
+
26
+
27
+ def parse(input, options=Hash.new)
28
+ super(input)
29
+ document = CSV.parse(input, options)
30
+ @environment.document = document
31
+ end
32
+ end
@@ -11,7 +11,7 @@ module ExtraLoop
11
11
 
12
12
  def extract_field(node, record=nil)
13
13
  target = node = node.respond_to?(:document) ? node : parse(node)
14
- target = node.at_css(@selector) if @selector
14
+ target = node.at(@selector) if @selector
15
15
  target = target.attr(@attribute) if target.respond_to?(:attr) && @attribute
16
16
  target = @environment.run(target, record, &@callback) if @callback
17
17
 
@@ -30,9 +30,9 @@ module ExtraLoop
30
30
  #
31
31
 
32
32
  def extract_list(input)
33
- nodes = input.respond_to?(:document) ? input : parse(input)
33
+ nodes = (input.respond_to?(:document) ? input : parse(input))
34
34
  nodes = nodes.search(@selector) if @selector
35
- @callback && Array(@environment.run(nodes, &@callback)) || nodes
35
+ Array(@callback && @environment.run(nodes, &@callback) || nodes)
36
36
  end
37
37
 
38
38
  def parse(input)
@@ -7,6 +7,7 @@ module ExtraLoop
7
7
  attr_reader :scraper
8
8
 
9
9
  def initialize(scraper=nil, document=nil, records=nil)
10
+
10
11
  if scraper
11
12
  @options = scraper.options
12
13
  @results = scraper.results
@@ -12,7 +12,7 @@ module ExtraLoop
12
12
  def initialize(loop_extractor, extractors=[], document=nil, hooks = {}, scraper = nil)
13
13
  @loop_extractor = loop_extractor
14
14
  @extractors = extractors
15
- @document = @loop_extractor.parse(document)
15
+ @document = document.is_a?(String) ? @loop_extractor.parse(document) : document
16
16
  @records = []
17
17
  @hooks = hooks
18
18
  @environment = ExtractionEnvironment.new(scraper, @document, @records)
@@ -1,5 +1,5 @@
1
1
  module ExtraLoop
2
- # Pseudo Abstract class.
2
+ # Pseudo Abstract class from which all extractors inherit.
3
3
  # This should not be called directly
4
4
  #
5
5
  class ExtractorBase
@@ -9,8 +9,9 @@ module ExtraLoop
9
9
  end
10
10
 
11
11
  attr_reader :field_name
12
+
12
13
  #
13
- # Public: Initializes a Data extractor.
14
+ # Public: Initialises a Data extractor.
14
15
  #
15
16
  # Parameters:
16
17
  # field_name - The machine readable field name
@@ -20,13 +20,9 @@ module ExtraLoop
20
20
  end
21
21
 
22
22
  def extract_list(input)
23
- #TODO: implement more clever stuff here after looking
24
- # into possible hash traversal techniques
25
-
26
- input = input.is_a?(String) ? parse(input) : input
23
+ @environment.document = input = (input.is_a?(String) ? parse(input) : input)
27
24
  input = input.get_in(@path) if @path
28
-
29
- @callback && Array(@environment.run(input, &@callback)) || input
25
+ @callback && @environment.run(input, &@callback) || input
30
26
  end
31
27
 
32
28
  def parse(input)
@@ -61,7 +61,8 @@ module ExtraLoop
61
61
 
62
62
  def loop_on(*args, &block)
63
63
  args << block if block
64
- @loop_extractor_args = args.insert(0, nil, ExtractionEnvironment.new(self))
64
+ # we prepend a nil value, as the loop extractor does not need to specify a field name
65
+ @loop_extractor_args = args.insert(0, nil)
65
66
  self
66
67
  end
67
68
 
@@ -79,7 +80,7 @@ module ExtraLoop
79
80
 
80
81
  def extract(*args, &block)
81
82
  args << block if block
82
- @extractor_args << args.insert(1, ExtractionEnvironment.new(self))
83
+ @extractor_args << args
83
84
  self
84
85
  end
85
86
 
@@ -144,24 +145,42 @@ module ExtraLoop
144
145
  @response_count += 1
145
146
  @loop = prepare_loop(response)
146
147
  log("response ##{@response_count} of #{@queued_count}, status code: [#{response.code}], URL fragment: ...#{response.effective_url.split('/').last if response.effective_url}")
147
- @loop.run
148
148
 
149
+ @loop.run
149
150
  @environment = @loop.environment
150
151
  run_hook(:data, [@loop.records, response])
152
+ #TODO: add hock for scraper completion (useful in iterative scrapes).
151
153
  end
152
154
 
153
155
  def prepare_loop(response)
154
- format = @options[:format] || detect_format(response.headers_hash.fetch('Content-Type', nil))
155
- extractor_class = format == :json ? JsonExtractor : DomExtractor
156
+ content_type = response.headers_hash.fetch('Content-Type', nil)
157
+ format = @options[:format] || detect_format(content_type)
158
+
159
+ extractor_classname = "#{format.to_s.capitalize}Extractor"
160
+ extractor_class = ExtraLoop.const_defined?(extractor_classname) && ExtraLoop.const_get(extractor_classname) || DomExtractor
161
+
162
+ @loop_extractor_args.insert(1, ExtractionEnvironment.new(self))
156
163
  loop_extractor = extractor_class.new(*@loop_extractor_args)
157
- extractors = @extractor_args.map { |args| extractor_class.new(*args) }
158
- ExtractionLoop.new(loop_extractor, extractors, response.body, @hooks, self)
164
+
165
+ # There is no point in parsing response.body more than once, so we reuse
166
+ # the first parsed document
167
+
168
+ document = loop_extractor.parse(response.body)
169
+
170
+ extractors = @extractor_args.map do |args|
171
+ args.insert(1, ExtractionEnvironment.new(self, document))
172
+ extractor_class.new(*args)
173
+ end
174
+
175
+ ExtractionLoop.new(loop_extractor, extractors, document, @hooks, self)
159
176
  end
160
177
 
161
178
  def detect_format(content_type)
162
179
  #TODO: add support for xml/rdf documents
163
180
  if content_type && content_type =~ /json$/
164
181
  :json
182
+ elsif content_type && content_type =~ /(csv)|(comma-separated-values)$/
183
+ :csv
165
184
  else
166
185
  :html
167
186
  end
@@ -0,0 +1,67 @@
1
+ require 'helpers/spec_helper'
2
+
3
+ describe JsonExtractor do
4
+ before(:each) do
5
+ stub(scraper = Object.new).options
6
+ stub(scraper).results
7
+ @env = ExtractionEnvironment.new(scraper)
8
+
9
+ File.open('fixtures/doc.csv', 'r') { |file|
10
+ @csv = file.read
11
+ @parsed_csv = CSV.parse(@csv)
12
+ file.close
13
+ }
14
+
15
+ end
16
+
17
+ describe "#extract_field" do
18
+ context "with only a field name defined" do
19
+ before do
20
+ @extractor = CsvExtractor.new(:customer_company_name, @env)
21
+ @extractor.parse(@csv)
22
+ end
23
+
24
+ subject { @extractor.extract_field @parsed_csv[2] }
25
+ it { should eql("Utility A") }
26
+ end
27
+
28
+ context "with a field name and a selector defined" do
29
+ before do
30
+ @extractor = CsvExtractor.new(:name, @env, "customer_company_name")
31
+ @extractor.parse(@csv)
32
+ end
33
+ subject { @extractor.extract_field @parsed_csv[2] }
34
+ it { should eql("Utility A") }
35
+ end
36
+
37
+ context "with a field name, using a numerical index as selector", :onlythis => true do
38
+ before do
39
+ @extractor = CsvExtractor.new(:company_name, @env, 2)
40
+ @extractor.parse(@csv)
41
+ end
42
+ subject { @extractor.extract_field @parsed_csv[2] }
43
+ it { should eql("Utility A") }
44
+ end
45
+
46
+ context "Without any other arguments but a callback" do
47
+ before do
48
+ @extractor = CsvExtractor.new nil, @env, proc { |row| row[2] }
49
+ @extractor.parse(@csv)
50
+ end
51
+ subject { @extractor.extract_field @parsed_csv[2] }
52
+ it { should eql("Utility A") }
53
+ end
54
+ end
55
+
56
+ describe "#extract_list" do
57
+ context "with no arguments" do
58
+ subject { CsvExtractor.new(nil, @env).extract_list(@csv) }
59
+ it { should eql(@parsed_csv) }
60
+ end
61
+
62
+ context "with a callback" do
63
+ subject { CsvExtractor.new(nil, @env, proc { |rows| rows[0..10] }).extract_list(@csv) }
64
+ it { should eql(@parsed_csv[0..10]) }
65
+ end
66
+ end
67
+ end
@@ -54,17 +54,31 @@ describe DomExtractor do
54
54
  end
55
55
  end
56
56
 
57
- context "when a selector and a block is provided" do
57
+ context "when a selector and a block is provided", :bla => true do
58
58
  before do
59
+ document_defined = scraper_defined = false
60
+
59
61
  @extractor = DomExtractor.new(:anchor, @env, "p a", proc { |node|
62
+ document_defined = @document && @document.is_a?(Nokogiri::HTML::Document)
63
+ scraper_defined = instance_variable_defined? "@scraper"
60
64
  node.text.gsub("dummy", "fancy")
61
65
  })
66
+
62
67
  @node = @extractor.parse(@html)
68
+ @output = @extractor.extract_field(@node)
69
+
70
+ @scraper_defined = scraper_defined
71
+ @document_defined = document_defined
63
72
  end
64
73
 
65
74
  describe "#extract_field" do
66
- subject { @extractor.extract_field(@node) }
67
- it { should match(/my fancy/) }
75
+ it "should return the block output" do
76
+ @output.should match(/my\sfancy/)
77
+ end
78
+ it "should add the @scraper and @document instance variables to the extraction environment" do
79
+ @scraper_defined.should be_true
80
+ @document_defined.should be_true
81
+ end
68
82
  end
69
83
  end
70
84
 
@@ -93,6 +107,7 @@ describe DomExtractor do
93
107
  end
94
108
  end
95
109
 
110
+
96
111
  context "when nothing but a field name is provided" do
97
112
  before do
98
113
  @extractor = DomExtractor.new(:url, @env)
@@ -117,13 +132,25 @@ describe DomExtractor do
117
132
 
118
133
  context "block provided" do
119
134
  before do
120
- @extractor = DomExtractor.new(nil, @env, "div.entry", lambda { |nodeList|
135
+ document_defined = scraper_defined = false
136
+
137
+ @extractor = DomExtractor.new(nil, @env, "div.entry", proc { |nodeList|
138
+ document_defined = @document && @document.is_a?(Nokogiri::HTML::Document)
139
+ scraper_defined = instance_variable_defined? "@scraper"
140
+
121
141
  nodeList.reject {|node| node.attr(:class).split(" ").include?('exclude') }
122
142
  })
143
+
144
+ @output = @extractor.extract_list(@html)
145
+ @scraper_defined = scraper_defined
146
+ @document_defined = document_defined
123
147
  end
124
148
 
125
- subject { @extractor.extract_list(@html) }
126
- it { subject.should have(2).items }
149
+ it { @output.should have(2).items }
150
+ it "should add @scraper and @document instance variables to the ExtractionEnvironment instance" do
151
+ @scraper_defined.should be_true
152
+ @document_defined.should be_true
153
+ end
127
154
  end
128
155
  end
129
156
 
@@ -0,0 +1,23 @@
1
+ contract_id,seller_company_name,customer_company_name,customer_duns_number,contract_affiliate,FERC_tariff_reference,contract_service_agreement_id,contract_execution_date,contract_commencement_date,contract_termination_date,actual_termination_date,extension_provision_description,class_name,term_name,increment_name,increment_peaking_name,product_type_name,product_name,quantity,units_for_contract,rate,rate_minimum,rate_maximum,rate_description,units_for_rate,point_of_receipt_control_area,point_of_receipt_specific_location,point_of_delivery_control_area,point_of_delivery_specific_location,begin_date,end_date,time_zone
2
+ C71,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Original Volume No. 10,2,2/15/2001,2/15/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,,,,,,ES
3
+ C72,The Electric Company,Utility A,38495837,n,FERC Electric Tariff Original Volume No. 10,15,7/25/2001,8/1/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,,,,,,ES
4
+ C73,The Electric Company,Utility B,493758794,N,FERC Electric Tariff Original Volume No. 10,7,6/8/2001,7/6/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,, , ,,,ep
5
+ C74,The Electric Company,Utility C,594739573,n,FERC Electric Tariff Original Volume No. 10,25,6/8/2001,7/6/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,, , ,,,ep
6
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,ENERGY,2000,KWh,.1475, , ,Max amount of capacity and energy to be transmitted. Bill based on monthly max delivery to City.,$/KWh,PJM,Point A,PJM,Point B,,,ep
7
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,point-to-point agreement,2000,KW,0.01, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
8
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,network,2000,KW,0.2, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
9
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,BLACK START SERVICE,2000,KW,0.22, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
10
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,CAPACITY,2000,KW,0.04, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
11
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,regulation & frequency response,2000,KW,0.1, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
12
+ C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,real power transmission loss,2000,KW,7, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
13
+ C76,The Electric Company,The Power Company,456534333,N,FERC Electric Tariff Original Volume No. 10,132,12/15/2001,1/1/2002,12/31/2004,12/31/2004,None,F,LT,M,FP,MB,CAPACITY,70,MW,3750, , ,70MW for each and every hour over the term of the agreement (7x24 schedule).,$/MW,,,,,,,ep
14
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,35, , ,,$/MWH,,,PJM,Bus 4321,20020101,20030101,EP
15
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,37, , ,,$/MWH,,,PJM,Bus 4321,20030101,20040101,EP
16
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,39, , ,,$/MWH,,,PJM,Bus 4321,20040101,20050101,EP
17
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,41, , ,,$/MWH,,,PJM,Bus 4321,20050101,20060101,EP
18
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,43, , ,,$/MWH,,,PJM,Bus 4321,20060101,20070101,EP
19
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,45, , ,,$/MWH,,,PJM,Bus 4321,20070101,20080101,EP
20
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,47, , ,,$/MWH,,,PJM,Bus 4321,20080101,20090101,EP
21
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,49, , ,,$/MWH,,,PJM,Bus 4321,20090101,20100101,EP
22
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,51, , ,,$/MWH,,,PJM,Bus 4321,20100101,20110101,EP
23
+ C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,53, , ,,$/MWH,,,PJM,Bus 4321,20110101,20120101,EP
@@ -10,6 +10,7 @@ describe JsonExtractor do
10
10
  content = file.read
11
11
  file.close
12
12
  content
13
+
13
14
  }.call()
14
15
  end
15
16
 
@@ -37,12 +38,27 @@ describe JsonExtractor do
37
38
 
38
39
  context "field_name and callback" do
39
40
  before do
40
- @extractor = JsonExtractor.new(:from_user, @env, proc { |node| node['from_user_name'] } )
41
+ scraper_defined = document_defined = false
42
+
43
+ @extractor = JsonExtractor.new(:from_user, @env, proc { |node|
44
+ document_defined = @document && @document.is_a?(Hash)
45
+ scraper_defined = instance_variable_defined? "@scraper"
46
+
47
+ node['from_user_name']
48
+ })
49
+
41
50
  @node = @extractor.parse(@json)['results'].first
51
+ @output = @extractor.extract_field(@node)
52
+
53
+ @scraper_defined = scraper_defined
54
+ @document_defined = document_defined
42
55
  end
43
56
 
44
- subject { @extractor.extract_field(@node) }
45
- it { should eql("Ludovic kohn") }
57
+ it { @output.should eql("Ludovic kohn") }
58
+ it "should add the @scraper and @document instance variables to the extraction environment" do
59
+ @scraper_defined.should be_true
60
+ @document_defined.should be_true
61
+ end
46
62
  end
47
63
 
48
64
  context "field_name and attribute" do
@@ -108,12 +124,27 @@ describe JsonExtractor do
108
124
 
109
125
  context "with pre-parsed input" do
110
126
  before do
111
- @extractor = JsonExtractor.new(nil, @env, proc { |data| data['results'] })
127
+ document_defined = scraper_defined = false
128
+
129
+ @extractor = JsonExtractor.new(nil, @env, proc { |data|
130
+ document_defined = @document && @document.is_a?(Hash)
131
+ scraper_defined = instance_variable_defined? "@scraper"
132
+ data['results']
133
+ })
134
+
135
+
136
+ @output = @extractor.extract_list((Yajl::Parser.new).parse(@json))
137
+ @scraper_defined = scraper_defined
138
+ @document_defined = document_defined
112
139
  end
113
140
 
114
- subject { @extractor.extract_list((Yajl::Parser.new).parse(@json)) }
115
- it { subject.size.should eql(15) }
116
- it { should be_an_instance_of(Array) }
141
+ it { @output.size.should eql(15) }
142
+ it { @output.should be_an_instance_of(Array) }
143
+
144
+ it "should add the @scraper and @document instance variables to the extraction environment" do
145
+ @scraper_defined.should be_true
146
+ @document_defined.should be_true
147
+ end
117
148
  end
118
149
 
119
150
  end
@@ -12,7 +12,6 @@ describe ScraperBase do
12
12
  @scraper = ScraperBase.new("http://localhost/fixture")
13
13
  end
14
14
 
15
-
16
15
  describe "#loop_on" do
17
16
  subject { @scraper.loop_on("bla.bla") }
18
17
  it { should be_an_instance_of(ScraperBase) }
@@ -113,7 +112,7 @@ describe ScraperBase do
113
112
  stub(@fake_loop).environment { ExtractionEnvironment.new }
114
113
  stub(@fake_loop).records { Array(1..3).map { |n| Object.new } }
115
114
 
116
- mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(String), is_a(Hash), is_a(ScraperBase)).times(3) { @fake_loop }
115
+ mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(Nokogiri::HTML::Document), is_a(Hash), is_a(ScraperBase)).times(3) { @fake_loop }
117
116
  end
118
117
 
119
118
 
@@ -157,10 +156,9 @@ describe ScraperBase do
157
156
  stub(@fake_loop).environment { ExtractionEnvironment.new }
158
157
  stub(@fake_loop).records { Array(1..3).map { |n| Object.new } }
159
158
 
160
- mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(String), is_a(Hash), is_a(ScraperBase)).times(@urls.size) { @fake_loop }
159
+ mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(Nokogiri::HTML::Document), is_a(Hash), is_a(ScraperBase)).times(@urls.size) { @fake_loop }
161
160
  end
162
161
 
163
-
164
162
  it "Should handle response" do
165
163
  @scraper.run
166
164
  @results.size.should eql(@urls.size * 3)
@@ -168,5 +166,4 @@ describe ScraperBase do
168
166
  end
169
167
  end
170
168
  end
171
-
172
169
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: extraloop
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.6
4
+ version: 0.0.7
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-01-30 00:00:00.000000000Z
12
+ date: 2012-02-28 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: yajl-ruby
16
- requirement: &15579720 !ruby/object:Gem::Requirement
16
+ requirement: &21376200 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ~>
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 1.1.0
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *15579720
24
+ version_requirements: *21376200
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: nokogiri
27
- requirement: &15579260 !ruby/object:Gem::Requirement
27
+ requirement: &21373200 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ~>
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 1.5.0
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *15579260
35
+ version_requirements: *21373200
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: typhoeus
38
- requirement: &15578800 !ruby/object:Gem::Requirement
38
+ requirement: &21368180 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ~>
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: 0.3.2
44
44
  type: :runtime
45
45
  prerelease: false
46
- version_requirements: *15578800
46
+ version_requirements: *21368180
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: logging
49
- requirement: &15578340 !ruby/object:Gem::Requirement
49
+ requirement: &21365740 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ~>
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: 0.6.1
55
55
  type: :runtime
56
56
  prerelease: false
57
- version_requirements: *15578340
57
+ version_requirements: *21365740
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: rspec
60
- requirement: &15577880 !ruby/object:Gem::Requirement
60
+ requirement: &21363940 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ~>
@@ -65,10 +65,10 @@ dependencies:
65
65
  version: 2.7.0
66
66
  type: :development
67
67
  prerelease: false
68
- version_requirements: *15577880
68
+ version_requirements: *21363940
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: rr
71
- requirement: &15577420 !ruby/object:Gem::Requirement
71
+ requirement: &21362040 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ~>
@@ -76,18 +76,18 @@ dependencies:
76
76
  version: 1.0.4
77
77
  type: :development
78
78
  prerelease: false
79
- version_requirements: *15577420
79
+ version_requirements: *21362040
80
80
  - !ruby/object:Gem::Dependency
81
- name: pry
82
- requirement: &15576960 !ruby/object:Gem::Requirement
81
+ name: pry-nav
82
+ requirement: &21355940 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ~>
86
86
  - !ruby/object:Gem::Version
87
- version: 0.9.7.4
87
+ version: 0.1.0
88
88
  type: :development
89
89
  prerelease: false
90
- version_requirements: *15576960
90
+ version_requirements: *21355940
91
91
  description: A Ruby library for extracting data from websites and web based APIs.
92
92
  Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
93
93
  a handy mechanism for iterating over paginated datasets.
@@ -99,9 +99,11 @@ files:
99
99
  - History.txt
100
100
  - README.md
101
101
  - examples/google_news_scraper.rb
102
+ - examples/mod_pay_data.rb
102
103
  - examples/wikipedia_categories.rb
103
104
  - examples/wikipedia_categories_recoursive.rb
104
105
  - lib/extraloop.rb
106
+ - lib/extraloop/csv_extractor.rb
105
107
  - lib/extraloop/dom_extractor.rb
106
108
  - lib/extraloop/extraction_environment.rb
107
109
  - lib/extraloop/extraction_loop.rb
@@ -112,8 +114,10 @@ files:
112
114
  - lib/extraloop/loggable.rb
113
115
  - lib/extraloop/scraper_base.rb
114
116
  - lib/extraloop/utils.rb
117
+ - spec/csv_extractor.rb
115
118
  - spec/dom_extractor_spec.rb
116
119
  - spec/extraction_loop_spec.rb
120
+ - spec/fixtures/doc.csv
117
121
  - spec/fixtures/doc.html
118
122
  - spec/fixtures/doc.json
119
123
  - spec/helpers/scraper_helper.rb