RubyGems - extraloop - Versions diffs - 0.0.6 → 0.0.7 - Mend

extraloop 0.0.6 → 0.0.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

data/History.txt +8 -5
data/README.md +12 -13
data/examples/google_news_scraper.rb +3 -2
data/examples/mod_pay_data.rb +32 -0
data/lib/extraloop.rb +3 -0
data/lib/extraloop/csv_extractor.rb +32 -0
data/lib/extraloop/dom_extractor.rb +3 -3
data/lib/extraloop/extraction_environment.rb +1 -0
data/lib/extraloop/extraction_loop.rb +1 -1
data/lib/extraloop/extractor_base.rb +3 -2
data/lib/extraloop/json_extractor.rb +2 -6
data/lib/extraloop/scraper_base.rb +26 -7
data/spec/csv_extractor.rb +67 -0
data/spec/dom_extractor_spec.rb +33 -6
data/spec/fixtures/doc.csv +23 -0
data/spec/json_extractor_spec.rb +38 -7
data/spec/scraper_base_spec.rb +2 -5
metadata +22 -18

data/History.txt CHANGED Viewed

@@ -1,14 +1,17 @@
-== 0.0.5  / 2011-01-14
+== 0.0.7  / 2012-02-28
+  * Added support for CSV data extraction.
+== 0.0.5  / 2012-01-14
   * Refactored #extract, #loop_on, and #set_hook to make a more idematic use of ruby blocks
-== 0.0.4  / 2011-01-14
+== 0.0.4  / 2012-01-14
   * fixed a bug which prevented from subclassing `IterativeScraper` instances
-== 0.0.3  / 2011-01-01
+== 0.0.3  / 2012-01-01
   * namespaced all classes into the ExtraLoop module
-== 0.0.2  / 2011-01-01
+== 0.0.2  / 2012-01-01
   * changed repository URL
-== 0.0.1  / 2011-01-01
+== 0.0.1  / 2012-01-01
   * Project Birthday!

data/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Extra Loop
-A Ruby library for extracting data from websites and web based APIs.
+A Ruby library for extracting structured data from websites and web based APIs.
 Supports most common document formats (i.e. HTML, XML, and JSON), and comes with a handy mechanism
 for iterating over paginated datasets.
@@ -47,7 +47,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
 #### scraper options:
-* __format__ - Specifies the scraped document format (valid values are :html, :xml, :json).
+* __format__ - Specifies the scraped document format; needed only if the Content-Type in the server response is not the correct one. Supported formats are: 'html', 'xml', 'json', and 'csv'.
 * __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
 * __log__ - Logging options hash:
      * __loglevel__  - a symbol specifying the desired log level (defaults to `:info`).
@@ -71,7 +71,7 @@ method extracts a specific piece of information from an element (e.g. a story's
     loop_on('div.post') { |posts| posts.reject { |post| post.attr(:class) == 'sticky' } }
 Both the `loop_on` and the `extract` methods may be called with a selector, a block or a combination of the two. By default, when parsing DOM documents, `extract` will call
-`Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and block, this will be evaluated in the context of the current iteration element.
+`Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and a block. The latter is evaluated in the context of the current iteration's element.
     # extract a story's title
     extract(:title, 'h3')
@@ -82,13 +82,13 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a bl
     # extract a description text, separating paragraphs with newlines
     extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
-#### Extracting from JSON Documents
+#### Extracting data from JSON Documents
-While processing each HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
+While processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
 the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's
 initialization options. When format is JSON, the document is parsed using the `yajl` JSON parser and converted into a hash.
-In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except for the
-CSS3/XPath selectors, which are specific to DOM documents.
+In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except it does not support
+CSS3/XPath selectors.
 When working with JSON data, you can just use a block and have it return the document elements you want to loop on.
@@ -98,7 +98,7 @@ When working with JSON data, you can just use a block and have it return the doc
 Alternatively, the same loop can be defined by passing an array of keys pointing at a hash value located
 at several levels of depth down into the parsed document structure.
-    # Fetch the same document portion above using a hash path
+    # Same as above, using a hash path
     loop_on(['query', 'categorymembers'])
 When fetching fields from a JSON document fragment, `extract` will often not need a block or an array of keys. If called with only
@@ -120,23 +120,22 @@ one argument, it will in fact try to fetch a hash value using the provided field
 The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.
-    set_iteration(iteration_parameter, array_range_or_block)
+#### set\_iteration
 * __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
 * __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.
+#### continue\_with
-The second iteration methods, `#continue_with`, allows to continue iterating untill an arbitrary block of code returns a positive, non-nil value (to be assigned to the iteration parameter).
-    continue_with(iteration_parameter, &block)
+The second iteration method, `#continue_with`, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).
 * __iteration_parameter__ - the scraper' iteration parameter.
 * __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
 ### Running tests
 ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
     cd spec
     rspec *

data/examples/google_news_scraper.rb CHANGED Viewed

@@ -1,10 +1,11 @@
 require '../lib/extraloop'
+require 'pry'
 results = []
 ExtraLoop::IterativeScraper.new("https://www.google.com/search?tbm=nws&q=Egypt", :log => {
-  :log_level => :debug,
-  :appenders => [ Logging.appenders.stderr ]
+  #:log_level => :debug,
+  #:appenders => [ Logging.appenders.stderr ]
 }).set_iteration(:start, (1..101).step(10)).
    loop_on("h3") { |nodes| nodes.map(&:parent) }.

data/examples/mod_pay_data.rb ADDED Viewed

@@ -0,0 +1,32 @@
+#
+# Fetch name, job title, and actual pay ceiling from a csv dataset containing UK Ministry of Defence's organogram and staff pay data
+#
+# source: http://data.gov.uk/dataset/staff-organograms-and-pay-mod
+#
+require "../lib/extraloop.rb"
+require "pry"
+class ModPayScraper < ExtraLoop::ScraperBase
+  def initialize
+    dataset_url = "http://www.mod.uk/NR/rdonlyres/FF9761D8-2AB9-4CD4-88BC-983A46A0CD90/0/20111208CTLBOrganogramFinal7Useniordata.csv"
+    super dataset_url, :format => :csv
+    # Select only record of officiers who earn more than 100k per year
+    loop_on do |rows|
+      rows[1..-1].select { |row| row[14].to_i > 100000 }
+    end
+    extract :name, "Name"
+    extract :title, "Job Title"
+    extract :pay, 14
+    on("data") do |records|
+      records.sort { |r1, r2| r2.pay <=> r1.pay }.each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
+    end
+  end
+end
+ModPayScraper.new.run

data/lib/extraloop.rb CHANGED Viewed

@@ -16,6 +16,8 @@ gem "typhoeus"
 gem "logging"
+autoload :CSV,      "csv"
 autoload :Nokogiri, "nokogiri"
 autoload :Yajl,     "yajl"
 autoload :Typhoeus, "typhoeus"
@@ -29,6 +31,7 @@ ExtraLoop.autoload :ExtractionEnvironment , "#{base_path}/extraction_environment
 ExtraLoop.autoload :ExtractorBase         , "#{base_path}/extractor_base"
 ExtraLoop.autoload :DomExtractor          , "#{base_path}/dom_extractor"
 ExtraLoop.autoload :JsonExtractor         , "#{base_path}/json_extractor"
+ExtraLoop.autoload :CsvExtractor          , "#{base_path}/csv_extractor"
 ExtraLoop.autoload :ExtractionLoop        , "#{base_path}/extraction_loop"
 ExtraLoop.autoload :ScraperBase           , "#{base_path}/scraper_base"
 ExtraLoop.autoload :Loggable              , "#{base_path}/loggable"

data/lib/extraloop/csv_extractor.rb ADDED Viewed

@@ -0,0 +1,32 @@
+class ExtraLoop::CsvExtractor < ExtraLoop::ExtractorBase
+  def initialize(*args)
+    super(*args)
+    @selector = args[2] if args[2] && args[2].is_a?(Integer)
+  end
+  def extract_field(row, record=nil)
+    target = row = row.respond_to?(:entries)? row : parse(row)
+    headers = @environment.document.first
+    selector = !@selector && @field_name || @selector
+    # allow using CSV column names or array indices as selectors
+    target = row[headers.index(selector.to_s)] if selector && selector.to_s.match(/[a-z]/i)
+    target = row[selector] if selector.is_a?(Integer)
+    target = @environment.run(target, record, &@callback) if @callback
+    target
+  end
+  def extract_list(input)
+    rows = (input.respond_to?(:entries) ? input : parse(input))
+    Array(@callback && @environment.run(rows, &@callback) || rows)
+  end
+  def parse(input, options=Hash.new)
+    super(input)
+    document = CSV.parse(input, options)
+    @environment.document = document
+  end
+end

data/lib/extraloop/dom_extractor.rb CHANGED Viewed

@@ -11,7 +11,7 @@ module ExtraLoop
     def extract_field(node, record=nil)
       target = node = node.respond_to?(:document) ? node : parse(node)
-      target = node.at_css(@selector)  if @selector
+      target = node.at(@selector)  if @selector
       target = target.attr(@attribute) if target.respond_to?(:attr) && @attribute
       target = @environment.run(target, record, &@callback) if @callback
@@ -30,9 +30,9 @@ module ExtraLoop
     #
     def extract_list(input)
-      nodes = input.respond_to?(:document) ? input : parse(input)
+      nodes = (input.respond_to?(:document) ? input : parse(input))
       nodes = nodes.search(@selector) if @selector
-      @callback && Array(@environment.run(nodes, &@callback)) || nodes
+      Array(@callback && @environment.run(nodes, &@callback) || nodes)
     end
     def parse(input)

data/lib/extraloop/extraction_environment.rb CHANGED Viewed

@@ -7,6 +7,7 @@ module ExtraLoop
     attr_reader :scraper
     def initialize(scraper=nil, document=nil, records=nil)
       if scraper
         @options  = scraper.options
         @results  = scraper.results

data/lib/extraloop/extraction_loop.rb CHANGED Viewed

@@ -12,7 +12,7 @@ module ExtraLoop
     def initialize(loop_extractor, extractors=[], document=nil, hooks = {}, scraper = nil)
       @loop_extractor = loop_extractor
       @extractors = extractors
-      @document = @loop_extractor.parse(document)
+      @document = document.is_a?(String) ? @loop_extractor.parse(document) : document
       @records = []
       @hooks = hooks
       @environment = ExtractionEnvironment.new(scraper, @document, @records)

data/lib/extraloop/extractor_base.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 module ExtraLoop
-  # Pseudo Abstract class.
+  # Pseudo Abstract class from which all extractors inherit.
   # This should not be called directly
   #
   class ExtractorBase
@@ -9,8 +9,9 @@ module ExtraLoop
     end
     attr_reader :field_name
     #
-    # Public: Initializes a Data extractor.
+    # Public: Initialises a Data extractor.
     #
     # Parameters:
     #   field_name  - The machine readable field name

data/lib/extraloop/json_extractor.rb CHANGED Viewed

@@ -20,13 +20,9 @@ module ExtraLoop
     end
     def extract_list(input)
-      #TODO: implement more clever stuff here after looking
-      # into possible hash traversal techniques
-      input = input.is_a?(String) ? parse(input) : input
+      @environment.document = input = (input.is_a?(String) ? parse(input) : input)
       input = input.get_in(@path) if @path
-      @callback && Array(@environment.run(input, &@callback)) || input
+      @callback && @environment.run(input, &@callback) || input
     end
     def parse(input)

data/lib/extraloop/scraper_base.rb CHANGED Viewed

@@ -61,7 +61,8 @@ module ExtraLoop
     def loop_on(*args, &block)
       args << block if block
-      @loop_extractor_args = args.insert(0, nil, ExtractionEnvironment.new(self))
+      # we prepend a nil value, as the loop extractor does not need to specify a field name
+      @loop_extractor_args = args.insert(0, nil)
       self
     end
@@ -79,7 +80,7 @@ module ExtraLoop
     def extract(*args, &block)
       args << block if block
-      @extractor_args << args.insert(1, ExtractionEnvironment.new(self))
+      @extractor_args << args
       self
     end
@@ -144,24 +145,42 @@ module ExtraLoop
       @response_count += 1
       @loop = prepare_loop(response)
       log("response ##{@response_count} of #{@queued_count}, status code: [#{response.code}], URL fragment: ...#{response.effective_url.split('/').last if response.effective_url}")
-      @loop.run
+      @loop.run
       @environment = @loop.environment
       run_hook(:data, [@loop.records, response])
+      #TODO: add hock for scraper completion (useful in iterative scrapes).
     end
     def prepare_loop(response)
-      format = @options[:format] || detect_format(response.headers_hash.fetch('Content-Type', nil))
-      extractor_class = format == :json ? JsonExtractor : DomExtractor
+      content_type = response.headers_hash.fetch('Content-Type', nil)
+      format = @options[:format] || detect_format(content_type)
+      extractor_classname = "#{format.to_s.capitalize}Extractor"
+      extractor_class = ExtraLoop.const_defined?(extractor_classname) && ExtraLoop.const_get(extractor_classname) || DomExtractor
+      @loop_extractor_args.insert(1, ExtractionEnvironment.new(self))
       loop_extractor = extractor_class.new(*@loop_extractor_args)
-      extractors = @extractor_args.map { |args|  extractor_class.new(*args) }
-      ExtractionLoop.new(loop_extractor, extractors, response.body, @hooks, self)
+      # There is no point in parsing response.body more than once, so we reuse
+      # the first parsed document
+      document = loop_extractor.parse(response.body)
+      extractors = @extractor_args.map do |args|
+        args.insert(1, ExtractionEnvironment.new(self, document))
+        extractor_class.new(*args)
+      end
+      ExtractionLoop.new(loop_extractor, extractors, document, @hooks, self)
     end
     def detect_format(content_type)
       #TODO: add support for xml/rdf documents
       if content_type && content_type =~ /json$/
         :json
+      elsif content_type && content_type =~ /(csv)|(comma-separated-values)$/
+        :csv
       else
         :html
       end

data/spec/csv_extractor.rb ADDED Viewed

@@ -0,0 +1,67 @@
+require 'helpers/spec_helper'
+describe JsonExtractor do
+  before(:each) do
+    stub(scraper = Object.new).options
+    stub(scraper).results
+    @env = ExtractionEnvironment.new(scraper)
+    File.open('fixtures/doc.csv', 'r') { |file|
+      @csv = file.read
+      @parsed_csv = CSV.parse(@csv)
+      file.close
+    }
+  end
+  describe "#extract_field" do
+    context "with only a field name defined" do
+      before  do
+        @extractor = CsvExtractor.new(:customer_company_name, @env)
+        @extractor.parse(@csv)
+      end
+      subject { @extractor.extract_field @parsed_csv[2] }
+      it { should eql("Utility A") }
+    end
+    context "with a field name and a selector defined" do
+      before  do
+        @extractor = CsvExtractor.new(:name, @env, "customer_company_name")
+        @extractor.parse(@csv)
+      end
+      subject { @extractor.extract_field @parsed_csv[2] }
+      it { should eql("Utility A") }
+    end
+    context "with a field name, using a numerical index as selector", :onlythis => true do
+      before  do
+        @extractor = CsvExtractor.new(:company_name, @env, 2)
+        @extractor.parse(@csv)
+      end
+      subject { @extractor.extract_field @parsed_csv[2] }
+      it { should eql("Utility A") }
+    end
+    context "Without any other arguments but a callback" do
+      before do
+        @extractor = CsvExtractor.new nil, @env, proc { |row| row[2] }
+        @extractor.parse(@csv)
+      end
+      subject { @extractor.extract_field @parsed_csv[2] }
+      it { should eql("Utility A") }
+    end
+  end
+  describe "#extract_list" do
+    context "with no arguments" do
+      subject { CsvExtractor.new(nil, @env).extract_list(@csv) }
+      it { should eql(@parsed_csv) }
+    end
+    context "with a callback" do
+      subject { CsvExtractor.new(nil, @env, proc { |rows| rows[0..10] }).extract_list(@csv) }
+      it { should eql(@parsed_csv[0..10]) }
+    end
+  end
+end

data/spec/dom_extractor_spec.rb CHANGED Viewed

@@ -54,17 +54,31 @@ describe DomExtractor do
     end
   end
-  context "when a selector and a block is provided" do
+  context "when a selector and a block is provided", :bla => true do
     before do
+      document_defined = scraper_defined = false
       @extractor = DomExtractor.new(:anchor, @env, "p a", proc { |node|
+        document_defined = @document && @document.is_a?(Nokogiri::HTML::Document)
+        scraper_defined = instance_variable_defined? "@scraper"
         node.text.gsub("dummy", "fancy")
       })
       @node = @extractor.parse(@html)
+      @output = @extractor.extract_field(@node)
+      @scraper_defined = scraper_defined
+      @document_defined = document_defined
     end
     describe "#extract_field" do
-      subject { @extractor.extract_field(@node) }
-      it { should match(/my fancy/) }
+      it "should return the block output" do
+        @output.should match(/my\sfancy/)
+      end
+      it "should add the @scraper and @document instance variables to the extraction environment" do
+        @scraper_defined.should be_true
+        @document_defined.should be_true
+      end
     end
   end
@@ -93,6 +107,7 @@ describe DomExtractor do
     end
   end
   context "when nothing but a field name is provided" do
     before do
       @extractor = DomExtractor.new(:url, @env)
@@ -117,13 +132,25 @@ describe DomExtractor do
     context "block provided" do
       before do
-        @extractor = DomExtractor.new(nil, @env, "div.entry", lambda { |nodeList|
+        document_defined = scraper_defined = false
+        @extractor = DomExtractor.new(nil, @env, "div.entry", proc { |nodeList|
+          document_defined = @document && @document.is_a?(Nokogiri::HTML::Document)
+          scraper_defined = instance_variable_defined? "@scraper"
           nodeList.reject {|node| node.attr(:class).split(" ").include?('exclude')  }
         })
+        @output = @extractor.extract_list(@html)
+        @scraper_defined = scraper_defined
+        @document_defined = document_defined
       end
-      subject { @extractor.extract_list(@html) }
-      it { subject.should have(2).items }
+      it { @output.should have(2).items }
+      it "should add @scraper and @document instance variables to the ExtractionEnvironment instance" do
+        @scraper_defined.should be_true
+        @document_defined.should be_true
+      end
     end
   end

data/spec/fixtures/doc.csv ADDED Viewed

@@ -0,0 +1,23 @@
+contract_id,seller_company_name,customer_company_name,customer_duns_number,contract_affiliate,FERC_tariff_reference,contract_service_agreement_id,contract_execution_date,contract_commencement_date,contract_termination_date,actual_termination_date,extension_provision_description,class_name,term_name,increment_name,increment_peaking_name,product_type_name,product_name,quantity,units_for_contract,rate,rate_minimum,rate_maximum,rate_description,units_for_rate,point_of_receipt_control_area,point_of_receipt_specific_location,point_of_delivery_control_area,point_of_delivery_specific_location,begin_date,end_date,time_zone
+C71,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Original Volume No. 10,2,2/15/2001,2/15/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,,,,,,ES
+C72,The Electric Company,Utility A,38495837,n,FERC Electric Tariff Original Volume No. 10,15,7/25/2001,8/1/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,,,,,,ES
+C73,The Electric Company,Utility B,493758794,N,FERC Electric Tariff Original Volume No. 10,7,6/8/2001,7/6/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,, , ,,,ep
+C74,The Electric Company,Utility C,594739573,n,FERC Electric Tariff Original Volume No. 10,25,6/8/2001,7/6/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,, , ,,,ep
+C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,ENERGY,2000,KWh,.1475, , ,Max amount of capacity and energy to be transmitted.  Bill based on monthly max delivery to City.,$/KWh,PJM,Point A,PJM,Point B,,,ep
+C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,point-to-point agreement,2000,KW,0.01, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
+C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,network,2000,KW,0.2, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
+C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,BLACK START SERVICE,2000,KW,0.22, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
+C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,CAPACITY,2000,KW,0.04, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
+C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,regulation & frequency response,2000,KW,0.1, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
+C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,real power transmission loss,2000,KW,7, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep
+C76,The Electric Company,The Power Company,456534333,N,FERC Electric Tariff Original Volume No. 10,132,12/15/2001,1/1/2002,12/31/2004,12/31/2004,None,F,LT,M,FP,MB,CAPACITY,70,MW,3750, , ,70MW for each and every hour over the term of the agreement (7x24 schedule).,$/MW,,,,,,,ep
+C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,35, , ,,$/MWH,,,PJM,Bus 4321,20020101,20030101,EP
+C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,37, , ,,$/MWH,,,PJM,Bus 4321,20030101,20040101,EP
+C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,39, , ,,$/MWH,,,PJM,Bus 4321,20040101,20050101,EP
+C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,41, , ,,$/MWH,,,PJM,Bus 4321,20050101,20060101,EP
+C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,43, , ,,$/MWH,,,PJM,Bus 4321,20060101,20070101,EP
+C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,45, , ,,$/MWH,,,PJM,Bus 4321,20070101,20080101,EP
+C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,47, , ,,$/MWH,,,PJM,Bus 4321,20080101,20090101,EP
+C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,49, , ,,$/MWH,,,PJM,Bus 4321,20090101,20100101,EP
+C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,51, , ,,$/MWH,,,PJM,Bus 4321,20100101,20110101,EP
+C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,53, , ,,$/MWH,,,PJM,Bus 4321,20110101,20120101,EP

data/spec/json_extractor_spec.rb CHANGED Viewed

@@ -10,6 +10,7 @@ describe JsonExtractor do
       content = file.read
       file.close
       content
     }.call()
   end
@@ -37,12 +38,27 @@ describe JsonExtractor do
     context "field_name and callback" do
       before do
-        @extractor = JsonExtractor.new(:from_user, @env, proc { |node| node['from_user_name'] } )
+        scraper_defined = document_defined = false
+        @extractor = JsonExtractor.new(:from_user, @env, proc { |node|
+          document_defined = @document && @document.is_a?(Hash)
+          scraper_defined = instance_variable_defined? "@scraper"
+          node['from_user_name']
+        })
         @node = @extractor.parse(@json)['results'].first
+        @output = @extractor.extract_field(@node)
+        @scraper_defined = scraper_defined
+        @document_defined = document_defined
       end
-      subject { @extractor.extract_field(@node) }
-      it { should eql("Ludovic kohn") }
+      it { @output.should eql("Ludovic kohn") }
+      it "should add the @scraper and @document instance variables to the extraction environment" do
+        @scraper_defined.should be_true
+        @document_defined.should be_true
+      end
     end
     context "field_name and attribute" do
@@ -108,12 +124,27 @@ describe JsonExtractor do
     context "with pre-parsed input" do
       before do
-        @extractor = JsonExtractor.new(nil, @env, proc { |data| data['results'] })
+        document_defined = scraper_defined = false
+        @extractor = JsonExtractor.new(nil, @env, proc { |data|
+          document_defined = @document && @document.is_a?(Hash)
+          scraper_defined = instance_variable_defined? "@scraper"
+          data['results']
+        })
+        @output = @extractor.extract_list((Yajl::Parser.new).parse(@json))
+        @scraper_defined = scraper_defined
+        @document_defined = document_defined
       end
-      subject { @extractor.extract_list((Yajl::Parser.new).parse(@json)) }
-      it { subject.size.should eql(15) }
-      it { should be_an_instance_of(Array) }
+      it { @output.size.should eql(15) }
+      it { @output.should be_an_instance_of(Array) }
+      it "should add the @scraper and @document instance variables to the extraction environment" do
+        @scraper_defined.should be_true
+        @document_defined.should be_true
+      end
     end
   end

data/spec/scraper_base_spec.rb CHANGED Viewed

@@ -12,7 +12,6 @@ describe ScraperBase do
     @scraper = ScraperBase.new("http://localhost/fixture")
   end
   describe "#loop_on" do
     subject { @scraper.loop_on("bla.bla") }
     it { should be_an_instance_of(ScraperBase) }
@@ -113,7 +112,7 @@ describe ScraperBase do
         stub(@fake_loop).environment { ExtractionEnvironment.new }
         stub(@fake_loop).records { Array(1..3).map { |n| Object.new } }
-        mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(String), is_a(Hash), is_a(ScraperBase)).times(3) { @fake_loop  }
+        mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(Nokogiri::HTML::Document), is_a(Hash), is_a(ScraperBase)).times(3) { @fake_loop  }
       end
@@ -157,10 +156,9 @@ describe ScraperBase do
         stub(@fake_loop).environment { ExtractionEnvironment.new }
         stub(@fake_loop).records { Array(1..3).map { |n| Object.new } }
-        mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(String), is_a(Hash), is_a(ScraperBase)).times(@urls.size) { @fake_loop  }
+        mock(ExtractionLoop).new(is_a(DomExtractor), is_a(Array), is_a(Nokogiri::HTML::Document), is_a(Hash), is_a(ScraperBase)).times(@urls.size) { @fake_loop  }
       end
       it "Should handle response" do
         @scraper.run
         @results.size.should eql(@urls.size * 3)
@@ -168,5 +166,4 @@ describe ScraperBase do
       end
     end
   end
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: extraloop
 version: !ruby/object:Gem::Version
-  version: 0.0.6
+  version: 0.0.7
   prerelease:
 platform: ruby
 authors:
@@ -9,11 +9,11 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-01-30 00:00:00.000000000Z
+date: 2012-02-28 00:00:00.000000000Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: yajl-ruby
-  requirement: &15579720 !ruby/object:Gem::Requirement
+  requirement: &21376200 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -21,10 +21,10 @@ dependencies:
         version: 1.1.0
   type: :runtime
   prerelease: false
-  version_requirements: *15579720
+  version_requirements: *21376200
 - !ruby/object:Gem::Dependency
   name: nokogiri
-  requirement: &15579260 !ruby/object:Gem::Requirement
+  requirement: &21373200 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -32,10 +32,10 @@ dependencies:
         version: 1.5.0
   type: :runtime
   prerelease: false
-  version_requirements: *15579260
+  version_requirements: *21373200
 - !ruby/object:Gem::Dependency
   name: typhoeus
-  requirement: &15578800 !ruby/object:Gem::Requirement
+  requirement: &21368180 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -43,10 +43,10 @@ dependencies:
         version: 0.3.2
   type: :runtime
   prerelease: false
-  version_requirements: *15578800
+  version_requirements: *21368180
 - !ruby/object:Gem::Dependency
   name: logging
-  requirement: &15578340 !ruby/object:Gem::Requirement
+  requirement: &21365740 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -54,10 +54,10 @@ dependencies:
         version: 0.6.1
   type: :runtime
   prerelease: false
-  version_requirements: *15578340
+  version_requirements: *21365740
 - !ruby/object:Gem::Dependency
   name: rspec
-  requirement: &15577880 !ruby/object:Gem::Requirement
+  requirement: &21363940 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -65,10 +65,10 @@ dependencies:
         version: 2.7.0
   type: :development
   prerelease: false
-  version_requirements: *15577880
+  version_requirements: *21363940
 - !ruby/object:Gem::Dependency
   name: rr
-  requirement: &15577420 !ruby/object:Gem::Requirement
+  requirement: &21362040 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -76,18 +76,18 @@ dependencies:
         version: 1.0.4
   type: :development
   prerelease: false
-  version_requirements: *15577420
+  version_requirements: *21362040
 - !ruby/object:Gem::Dependency
-  name: pry
-  requirement: &15576960 !ruby/object:Gem::Requirement
+  name: pry-nav
+  requirement: &21355940 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
       - !ruby/object:Gem::Version
-        version: 0.9.7.4
+        version: 0.1.0
   type: :development
   prerelease: false
-  version_requirements: *15576960
+  version_requirements: *21355940
 description: A Ruby library for extracting data from websites and web based APIs.
   Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
   a handy mechanism  for iterating over paginated datasets.
@@ -99,9 +99,11 @@ files:
 - History.txt
 - README.md
 - examples/google_news_scraper.rb
+- examples/mod_pay_data.rb
 - examples/wikipedia_categories.rb
 - examples/wikipedia_categories_recoursive.rb
 - lib/extraloop.rb
+- lib/extraloop/csv_extractor.rb
 - lib/extraloop/dom_extractor.rb
 - lib/extraloop/extraction_environment.rb
 - lib/extraloop/extraction_loop.rb
@@ -112,8 +114,10 @@ files:
 - lib/extraloop/loggable.rb
 - lib/extraloop/scraper_base.rb
 - lib/extraloop/utils.rb
+- spec/csv_extractor.rb
 - spec/dom_extractor_spec.rb
 - spec/extraction_loop_spec.rb
+- spec/fixtures/doc.csv
 - spec/fixtures/doc.html
 - spec/fixtures/doc.json
 - spec/helpers/scraper_helper.rb