RubyGems - extraloop - Versions diffs - 0.0.7 → 0.0.8 - Mend

extraloop 0.0.7 → 0.0.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

data/History.txt +3 -0
data/README.md +12 -12
data/examples/mod_pay_data.rb +6 -6
data/examples/wikipedia_categories.rb +0 -4
data/lib/extraloop.rb +1 -1
data/lib/extraloop/dom_extractor.rb +2 -1
data/lib/extraloop/json_extractor.rb +1 -1
data/lib/extraloop/scraper_base.rb +13 -6
data/spec/scraper_base_spec.rb +28 -2
metadata +32 -19

data/History.txt CHANGED

@@ -1,3 +1,6 @@
+== 0.0.8  / 2012-03-27
+  * Fixed major bug which prevented iterative scrapers from working properly
 == 0.0.7  / 2012-02-28
   * Added support for CSV data extraction.

data/README.md CHANGED

@@ -1,14 +1,14 @@
 # Extra Loop
 A Ruby library for extracting structured data from websites and web based APIs.
-Supports most common document formats (i.e. HTML, XML, and JSON), and comes with a handy mechanism
+Supports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes with a handy mechanism
 for iterating over paginated datasets.
-### Installation:
+## Installation:
     gem install extraloop
-### Sample scrapers:
+## Usage:
 A basic scraper that fetches the top 25 websites from [Alexa's daily top 100](www.alexa.com/topsites) list:
@@ -37,7 +37,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
       run()
-### Scraper initialisation signature
+## Scraper initialisation signature
     #new(urls, scraper_options, http_options)
@@ -45,7 +45,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
 - __scraper_options__ - hash of scraper options (see below).
 - __http_options__ - hash of request options for `Typheous::Request#initialize` (see [API documentation](http://rubydoc.info/github/pauldix/typhoeus/master/Typhoeus/Request#initialize-instance_method) for details).
-#### scraper options:
+### scraper options:
 * __format__ - Specifies the scraped document format; needed only if the Content-Type in the server response is not the correct one. Supported formats are: 'html', 'xml', 'json', and 'csv'.
 * __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
@@ -53,7 +53,7 @@ An iterative Scraper that fetches URL, title, and publisher from some 110 Google
      * __loglevel__  - a symbol specifying the desired log level (defaults to `:info`).
      * __appenders__ - a list of Logging.appenders object (defaults to `Logging.appenders.sterr`).
-### Extractors
+## Extractors
 ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.
 For each matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract`
@@ -82,7 +82,7 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a bl
     # extract a description text, separating paragraphs with newlines
     extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
-#### Extracting data from JSON Documents
+### Extracting data from JSON Documents
 While processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
 the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's
@@ -116,26 +116,26 @@ one argument, it will in fact try to fetch a hash value using the provided field
     # => "johndoe"
-### Iteration methods
+## Iteration methods
 The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.
-#### set\_iteration
+### set\_iteration
 * __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
 * __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.
-#### continue\_with
+### continue\_with
 The second iteration method, `#continue_with`, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).
 * __iteration_parameter__ - the scraper' iteration parameter.
 * __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
-### Running tests
+## Running tests
 ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
     cd spec
     rspec *

data/examples/mod_pay_data.rb CHANGED

@@ -1,6 +1,5 @@
 #
-# Fetch name, job title, and actual pay ceiling from a csv dataset containing UK Ministry of Defence's organogram and staff pay data
-#
+# Fetches name, job title, and actual pay ceiling from a CSV dataset listing UK Ministry of Defence's organogram and staff pay data
 # source: http://data.gov.uk/dataset/staff-organograms-and-pay-mod
 #
@@ -12,7 +11,8 @@ class ModPayScraper < ExtraLoop::ScraperBase
     dataset_url = "http://www.mod.uk/NR/rdonlyres/FF9761D8-2AB9-4CD4-88BC-983A46A0CD90/0/20111208CTLBOrganogramFinal7Useniordata.csv"
     super dataset_url, :format => :csv
-    # Select only record of officiers who earn more than 100k per year
+    # Select only records of officers earning more than 100k per year
     loop_on do |rows|
       rows[1..-1].select { |row| row[14].to_i > 100000 }
     end
@@ -22,11 +22,11 @@ class ModPayScraper < ExtraLoop::ScraperBase
     extract :pay, 14
     on("data") do |records|
-      records.sort { |r1, r2| r2.pay <=> r1.pay }.each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
+      records.
+        sort { |r1, r2| r2.pay.to_i <=> r1.pay.to_i }.
+        each { |record| puts [record.pay, record.name].map { |string| string.ljust 7 }.join }
     end
   end
 end
 ModPayScraper.new.run

data/examples/wikipedia_categories.rb CHANGED

@@ -18,10 +18,6 @@ params = {
 options = {
   :format => :json,
-  :log => {
-    :appenders => [Logging.appenders.stderr],
-    :log_level => :info
-  }
 }
 request_arguments = { :params => params, :headers => {
   "User-Agent" => "ExtraLoop - ruby data extraction toolkit: http://github.com/afiore/extraloop"

data/lib/extraloop.rb CHANGED

@@ -1,7 +1,7 @@
 base_path = File.expand_path(File.dirname(__FILE__) + "/extraloop"  )
 module ExtraLoop
-  VERSION = '0.0.3'
+  VERSION = '0.0.8'
 end

data/lib/extraloop/dom_extractor.rb CHANGED

@@ -32,7 +32,8 @@ module ExtraLoop
     def extract_list(input)
       nodes = (input.respond_to?(:document) ? input : parse(input))
       nodes = nodes.search(@selector) if @selector
-      Array(@callback && @environment.run(nodes, &@callback) || nodes)
+      nodes = nodes.css("*") unless @selector or @callback
+      @callback && Array(@environment.run(nodes, &@callback)) || nodes
     end
     def parse(input)

data/lib/extraloop/json_extractor.rb CHANGED

@@ -22,7 +22,7 @@ module ExtraLoop
     def extract_list(input)
       @environment.document = input = (input.is_a?(String) ? parse(input) : input)
       input = input.get_in(@path) if @path
-      @callback && @environment.run(input, &@callback) || input
+      Array(@callback && @environment.run(input, &@callback) || input)
     end
     def parse(input)

data/lib/extraloop/scraper_base.rb CHANGED

@@ -23,7 +23,7 @@ module ExtraLoop
     def initialize(urls, options = {}, arguments = {})
       @urls = Array(urls)
-      @loop_extractor_args = nil
+      @loop_extractor_args = []
       @extractor_args = []
       @loop = nil
@@ -61,8 +61,8 @@ module ExtraLoop
     def loop_on(*args, &block)
       args << block if block
-      # we prepend a nil value, as the loop extractor does not need to specify a field name
-      @loop_extractor_args = args.insert(0, nil)
+      # prepend placeholder values for loop name and extraction environment
+      @loop_extractor_args = args.insert(0, nil, nil)
       self
     end
@@ -129,6 +129,9 @@ module ExtraLoop
       end
       log("queueing url: #{url}, params #{arguments[:params]}", :debug)
       @queued_count += 1
       @hydra.queue(request)
     end
@@ -147,6 +150,7 @@ module ExtraLoop
       log("response ##{@response_count} of #{@queued_count}, status code: [#{response.code}], URL fragment: ...#{response.effective_url.split('/').last if response.effective_url}")
       @loop.run
       @environment = @loop.environment
       run_hook(:data, [@loop.records, response])
       #TODO: add hock for scraper completion (useful in iterative scrapes).
@@ -159,7 +163,10 @@ module ExtraLoop
       extractor_classname = "#{format.to_s.capitalize}Extractor"
       extractor_class = ExtraLoop.const_defined?(extractor_classname) && ExtraLoop.const_get(extractor_classname) || DomExtractor
-      @loop_extractor_args.insert(1, ExtractionEnvironment.new(self))
+      #replace empty placeholder with extraction environment
+      @loop_extractor_args[1] = ExtractionEnvironment.new(self)
       loop_extractor = extractor_class.new(*@loop_extractor_args)
       # There is no point in parsing response.body more than once, so we reuse
@@ -167,8 +174,8 @@ module ExtraLoop
       document = loop_extractor.parse(response.body)
-      extractors = @extractor_args.map do |args|
-        args.insert(1, ExtractionEnvironment.new(self, document))
+      extractors = @extractor_args.map do |original_args|
+        args = original_args.clone.insert(1, ExtractionEnvironment.new(self, document))
         extractor_class.new(*args)
       end

data/spec/scraper_base_spec.rb CHANGED

@@ -69,14 +69,15 @@ describe ScraperBase do
           loop_on("ul li.file a").
             extract(:url, :href).
             extract(:filename).
-          set_hook(:data, &proc { |records| records.each { |record| results << record }})
+          on(:data) { |records| results = records }.
+          run
         @results = results
       end
       it "Should handle response" do
-        @scraper.run
         @results.should_not be_empty
         @results.all? { |record| record.extracted_at && record.url && record.filename }.should be_true
       end
@@ -166,4 +167,29 @@ describe ScraperBase do
       end
     end
   end
+  context "no loop defined.." do
+    describe "#run", :thisonly => true do
+      before do
+        data = []
+        @url = "http://localhost/fixture"
+        stub_http({}, :body => @fixture_doc) do |hydra, request, response|
+          hydra.stub(:get, request.url).and_return(response)
+        end
+        (ScraperBase.new @url).
+          extract(:url, "a[href]", :href).
+          on("data") { |records| data = records }.
+          run
+        @data = data
+      end
+      it "should run and extract data" do
+        @data.should_not be_empty
+      end
+    end
+  end
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: extraloop
 version: !ruby/object:Gem::Version
-  version: 0.0.7
+  version: 0.0.8
   prerelease:
 platform: ruby
 authors:
@@ -9,11 +9,11 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-02-28 00:00:00.000000000Z
+date: 2012-03-27 00:00:00.000000000Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: yajl-ruby
-  requirement: &21376200 !ruby/object:Gem::Requirement
+  requirement: &13625500 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -21,10 +21,10 @@ dependencies:
         version: 1.1.0
   type: :runtime
   prerelease: false
-  version_requirements: *21376200
+  version_requirements: *13625500
 - !ruby/object:Gem::Dependency
   name: nokogiri
-  requirement: &21373200 !ruby/object:Gem::Requirement
+  requirement: &13624900 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -32,10 +32,10 @@ dependencies:
         version: 1.5.0
   type: :runtime
   prerelease: false
-  version_requirements: *21373200
+  version_requirements: *13624900
 - !ruby/object:Gem::Dependency
   name: typhoeus
-  requirement: &21368180 !ruby/object:Gem::Requirement
+  requirement: &13624340 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -43,10 +43,10 @@ dependencies:
         version: 0.3.2
   type: :runtime
   prerelease: false
-  version_requirements: *21368180
+  version_requirements: *13624340
 - !ruby/object:Gem::Dependency
   name: logging
-  requirement: &21365740 !ruby/object:Gem::Requirement
+  requirement: &13623540 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -54,10 +54,10 @@ dependencies:
         version: 0.6.1
   type: :runtime
   prerelease: false
-  version_requirements: *21365740
+  version_requirements: *13623540
 - !ruby/object:Gem::Dependency
   name: rspec
-  requirement: &21363940 !ruby/object:Gem::Requirement
+  requirement: &13622600 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -65,10 +65,10 @@ dependencies:
         version: 2.7.0
   type: :development
   prerelease: false
-  version_requirements: *21363940
+  version_requirements: *13622600
 - !ruby/object:Gem::Dependency
   name: rr
-  requirement: &21362040 !ruby/object:Gem::Requirement
+  requirement: &13620140 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -76,10 +76,10 @@ dependencies:
         version: 1.0.4
   type: :development
   prerelease: false
-  version_requirements: *21362040
+  version_requirements: *13620140
 - !ruby/object:Gem::Dependency
   name: pry-nav
-  requirement: &21355940 !ruby/object:Gem::Requirement
+  requirement: &13619680 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -87,10 +87,21 @@ dependencies:
         version: 0.1.0
   type: :development
   prerelease: false
-  version_requirements: *21355940
+  version_requirements: *13619680
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: &13619180 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: 0.9.2.2
+  type: :development
+  prerelease: false
+  version_requirements: *13619180
 description: A Ruby library for extracting data from websites and web based APIs.
-  Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
-  a handy mechanism  for iterating over paginated datasets.
+  Supports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes
+  with a handy mechanism  for iterating over paginated datasets.
 email: andrea.giulio.fiore@googlemail.com
 executables: []
 extensions: []
@@ -140,6 +151,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
   - - ! '>='
     - !ruby/object:Gem::Version
       version: '0'
+      segments:
+      - 0
+      hash: 2441039337834275619
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements:
@@ -153,4 +167,3 @@ signing_key:
 specification_version: 2
 summary: A toolkit for online data extraction.
 test_files: []
-has_rdoc: