RubyGems - extraloop - Versions diffs - 0.0.1 - Mend

extraloop 0.0.1

Files changed (27) hide show

data/History.txt +2 -0
data/README.md +135 -0
data/examples/google_news_scraper.rb +22 -0
data/examples/wikipedia_categories.rb +49 -0
data/lib/extraloop/dom_extractor.rb +45 -0
data/lib/extraloop/extraction_environment.rb +20 -0
data/lib/extraloop/extraction_loop.rb +46 -0
data/lib/extraloop/extractor_base.rb +40 -0
data/lib/extraloop/hookable.rb +26 -0
data/lib/extraloop/iterative_scraper.rb +291 -0
data/lib/extraloop/json_extractor.rb +36 -0
data/lib/extraloop/loggable.rb +64 -0
data/lib/extraloop/scraper_base.rb +166 -0
data/lib/extraloop/utils.rb +75 -0
data/lib/extraloop.rb +43 -0
data/spec/dom_extractor_spec.rb +165 -0
data/spec/extraction_loop_spec.rb +76 -0
data/spec/fixtures/doc.html +1324 -0
data/spec/fixtures/doc.json +1 -0
data/spec/helpers/scraper_helper.rb +46 -0
data/spec/helpers/spec_helper.rb +12 -0
data/spec/iterative_scraper_spec.rb +175 -0
data/spec/json_extractor_spec.rb +146 -0
data/spec/loggable_spec.rb +25 -0
data/spec/scraper_base_spec.rb +178 -0
data/spec/utils_spec.rb +44 -0
metadata +140 -0

data/History.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ == 0.0.0 / 2011-01-01
2	+ * Project Birthday!

data/README.md ADDED Viewed

@@ -0,0 +1,135 @@
+# Extra Loop
+A Ruby library for extracting data from websites and web based APIs.
+Supports most common document formats (i.e. HTML, XML, and JSON), and comes with a handy mechanism
+for iterating over paginated datasets.
+### Installation:
+    gem install extraloop
+### Usage:
+A basic scraper that fetches the top 25 websites from Alexa's daily top 100 list:
+    results = nil
+    Scraper.
+      new("http://www.alexa.com/topsites").
+      loop_on("li.site-listing").
+        extract(:site_name, "h2").
+        extract(:url, "h2 a").
+        extract(:description, ".description").
+      on(:data, proc { |data|
+        results = data
+      }).
+      run()
+An Iterative Scraper that fetches URL, title, and publisher from some 110 Google News articles mentioning the keyword 'Egypt'.
+    results = []
+    IterativeScraper.
+       new("https://www.google.com/search?tbm=nws&q=Egypt").
+       set_iteration(:start, (1..101).step(10)).
+       loop_on("h3", proc { |nodes| nodes.map(&:parent) }).
+         extract(:title, "h3.r a").
+         extract(:url, "h3.r a", :href).
+         extract(:source, "br", proc { |node| node.next.text.split("-").first }).
+       on(:data, proc { |data, response| data.each { |record| results << record } }).
+       run()
+### Initializations
+    #new(urls, scraper_options, httpclient_options)
+- `urls` - single url, or array of several urls.
+- `scraper_options` - hash of scraper options (see below).
+- `httpclient_options` - hash of request options for `Typheous::Request#initialize` (see [API documentation](http://rubydoc.info/github/pauldix/typhoeus/master/Typhoeus/Request#initialize-instance_method) for details).
+#### Scraper options:
+* `format` - Specifies the scraped document format (valid values are :html, :xml, :json).
+* `async` - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.
+* `log` - Logging options hash:
+     * `loglevel`  - a symbol specifying the desired log level (defaults to `:info`).
+     * `appenders` - a list of Logging.appenders object (defaults to `Logging.appenders.sterr`).
+### Extractors
+ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector. For each of the matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract` method extracts a piece of information from an element (e.g. a story's title) and and stores it into a record's field.
+    # using a CSS3 or an XPath selector
+    loop_on('div.post')
+    # using a proc as a selector
+    loop_on(proc { |doc| doc.search('div.post') })
+    # using both a selector and a proc (the result of applying a selector is passed on as the first argument of the proc)
+    loop_on('div.post', proc { |posts| posts.reject {|post| post.attr(:class) == 'sticky' }})
+Both the `loop_on` and the `extract` methods may be called with a selector, a proc or a combination of the two. By default, when parsing DOM documents, `extract` will call
+`Nokogiri::XML::Node#text()`. Alternatively, `extract` can also be passed an attribute name or a proc as a third argument; this will be evaluated in the context of the matching element.
+    # extract a story's title
+    extract(:title, 'h3')
+    # extract a story's url
+    extract(:url, "a.link-to-story", :href)
+    # extract a description text, separating paragraphs with newlines
+    extract(:description, "div.description", proc { |node|
+      node.css("p").map(&:text).join("\n")
+    })
+#### Extracting from JSON Documents
+While processing each HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at the `ContentType` header sent by the server (this value may be overriden by providing a `:format` key in the scraper's initialization options).
+When the format is JSON, the document is parsed using the `yajl` parser and converted into a hash. In this case, both the `loop_on` and the `extract` methods still behave as documented above, with only the exception of the CSS3/XPath selector string, which is specific of DOM documents.
+When working with JSON documents, it is possible to loop over an arbitrary portion of a document by simply passing a proc to `loop_on`.
+    # Fetch a portion of a document using a proc
+    loop_on(proc { |data| data['query']['categorymembers'] })
+Alternatively, the same loop can be defined by passing an array of nested keys, locating the position of the document fragments.
+    # Fetch the same document portion above using a hash path
+    loop_on(['query', 'categorymembers'])
+Passing an array of nested keys will also work fine with the `extract` method.
+When fetching fields from a JSON document fragment, `extract` will try to use the
+field name as a key if no key path or key string is provided.
+    # current node:
+    #
+    # {
+    #  'from_user' => "johndoe",
+    #  'text' => 'bla bla bla',
+    #  'from_user_id'..
+    # }
+    extract(:from_user)
+    # => "johndoe"
+### Iteration methods:
+The `IterativeScraper` class comes with two methods for defining how a scraper should loop over paginated content.
+    #set_iteration(iteration_parameter, array_range_or_proc)
+* `iteration_parameter` - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
+* `array_or_range_or_proc` - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document as its first argument. Its return value is then used to shift, at each iteration, the value of the iteration parameter. If the block fails to return a non empty array, the iteration stops.
+The second iteration methods, `#continue_with`, allows to continue iterating untill an arbitrary block of code returns a positive, non-nil value.
+    #continue_with(iteration_parameter, block)
+* `iteration_parameter` - the scraper' iteration parameter.
+* `block` - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.

data/examples/google_news_scraper.rb ADDED Viewed

@@ -0,0 +1,22 @@
+require '../lib/extraloop'
+results = []
+IterativeScraper.new("https://www.google.com/search?tbm=nws&q=Egypt", :log => {
+  :log_level => :debug,
+  :appenders => [Logging.appenders.stderr ]
+}).set_iteration(:start, (1..101).step(10)).
+   loop_on("h3", proc { |nodes| nodes.map(&:parent) }).
+     extract(:title, "h3.r a").
+     extract(:url, "h3.r a", :href).
+     extract(:source, "br", proc { |node|
+       node.next.text.split("-").first
+     }).
+   on(:data, proc { |data, response|
+     data.each { |record| results << record }
+   }).
+   run()
+results.each_with_index do |record, index|
+  puts "#{index}) #{record.title} (source: #{record.source})"
+end

data/examples/wikipedia_categories.rb ADDED Viewed

@@ -0,0 +1,49 @@
+require '../lib/extraloop'
+wikipedia_baseurl = "http://en.wikipedia.org"
+endpoint_url = "/w/api.php"
+api_url = wikipedia_baseurl + endpoint_url
+all_results = []
+params = {
+  :action => 'query',
+  :list => 'categorymembers',
+  :format => 'json',
+  :cmtitle => 'Category:Linguistics',
+  :cmlimit => "100",
+  :cmtype => 'page|subcat',
+  :cmdir => 'asc',
+  :cmprop => 'ids|title|type|timestamp'
+}
+options = {
+  :format => :json,
+  :log => {
+    :appenders => [Logging.appenders.stderr],
+    :log_level => :info
+  }
+}
+request_arguments = { :params => params }
+#
+# Fetches members of the English wikipedia's category "Linguistics".
+#
+# This uses the the #continue_with instead of the #set_iteration method
+# (used in the Google News example).
+#
+IterativeScraper.new(api_url, options, request_arguments).
+  loop_on(['query', 'categorymembers']).
+    extract(:title).
+    extract(:ns).
+    extract(:type).
+    extract(:timestamp).
+  on(:data, proc { |results|
+    results.each { |record| all_results << record }
+  }).
+  continue_with(:cmcontinue, ['query-continue', 'categorymembers', 'cmcontinue']).
+  run()
+puts "#{all_results.size} fetched"

data/lib/extraloop/dom_extractor.rb ADDED Viewed

@@ -0,0 +1,45 @@
+class DomExtractor < ExtractorBase
+  # Public: Runs the extractor against a document fragment (dom node or object).
+  #
+  # node   - The document fragment
+  # record - The extracted record
+  #
+  # Returns the text content of the element, or the output of the extractor's callback.
+  #
+  def extract_field(node, record=nil)
+    target = node = node.respond_to?(:document) ? node : parse(node)
+    target = node.at_css(@selector)  if @selector
+    target = target.attr(@attribute) if target.respond_to?(:attr) && @attribute
+    target = @environment.run(target, record, &@callback) if @callback
+    #if target is still a DOM node, return its text content
+    target = target.text if target.respond_to?(:text)
+    target
+  end
+  #
+  # Public: Extracts a list of document fragments matching the provided selector/callback
+  #
+  # input - a document (either as a string or as a parsed Nokogiri document)
+  #
+  # Returns an array of elements matching the specified selector or function
+  #
+  #
+  def extract_list(input)
+    nodes = input.respond_to?(:document) ? input : parse(input)
+    nodes = nodes.search(@selector) if @selector
+    @callback && Array(@environment.run(nodes, &@callback)) || nodes
+  end
+  def parse(input)
+    super(input)
+    @environment.document = is_xml(input) ? Nokogiri::XML(input) : Nokogiri::HTML(input)
+  end
+  def is_xml(input)
+    input =~ /^\s*\<\?xml version=\"\d\.\d\"\?\>/
+  end
+end

data/lib/extraloop/extraction_environment.rb ADDED Viewed

@@ -0,0 +1,20 @@
+# This class is simply used as a virtual environment within
+# which Hook handlers and extractors run (through #run)
+class ExtractionEnvironment
+  attr_accessor :document
+  def initialize(scraper=nil, document=nil, records=nil)
+    if scraper
+      @options  = scraper.options
+      @results  = scraper.results
+      @scraper  = scraper
+    end
+    @document = document
+    @records  = records
+  end
+  def run(*arguments, &block)
+    self.instance_exec(*arguments, &block)
+  end
+end

data/lib/extraloop/extraction_loop.rb ADDED Viewed

@@ -0,0 +1,46 @@
+class ExtractionLoop
+  include Hookable
+  module Exceptions
+    class UnsupportedFormat < StandardError; end
+  end
+  attr_reader :records, :environment
+  attr_accessor :extractors, :document, :hooks, :children, :parent, :scraper
+  def initialize(loop_extractor, extractors=[], document=nil, hooks = {}, scraper = nil)
+    @loop_extractor = loop_extractor
+    @extractors = extractors
+    @document = @loop_extractor.parse(document)
+    @records = []
+    @hooks = hooks
+    @environment = ExtractionEnvironment.new(@scraper, @document, @records)
+    self
+  end
+  def run
+    run_hook(:before, @document)
+    get_nodelist.each do |node|
+      run_hook(:before_extract, [node])
+      @records << run_extractors(node)
+      run_hook(:after_extract, [node, records.last])
+    end
+    run_hook(:after, @records)
+    self
+  end
+  private
+  def get_nodelist
+    @loop_extractor.extract_list(@document)
+  end
+  def run_extractors(node)
+    record = OpenStruct.new(:extracted_at => Time.now.to_i)
+    @extractors.each { |extractor| record.send("#{extractor.field_name.to_s}=", extractor.extract_field(node, record)) }
+    record
+  end
+end

data/lib/extraloop/extractor_base.rb ADDED Viewed

@@ -0,0 +1,40 @@
+# Abstract class.
+# This should not be called directly
+#
+#
+class ExtractorBase
+  module Exceptions
+    class WrongArgumentError < StandardError; end
+    class ExtractorParseError < StandardError; end
+  end
+  attr_reader :field_name
+  #
+  # Public: Initializes a Data extractor.
+  #
+  # Parameters:
+  #   field_name  - The machine readable field name
+  #   environment - The object within which the extractor callback will be run (using run).
+  #   selector:   - The css3 selector to be used to match a specific portion of a document (optional).
+  #   callback    - A block of code to which the extracted node/attribute will be passed (optional).
+  #   attribute:  - A node attribute. If provided, the attribute value will be returned (optional).
+  #
+  # Returns itself
+  #
+  def initialize(field_name, environment, *args)
+    @field_name = field_name
+    @environment = environment
+    @selector = args.find { |arg| arg.is_a?(String)}
+    args.delete(@selector) if @selector
+    @attribute = args.find { |arg| arg.is_a?(String) || arg.is_a?(Symbol) }
+    @callback = args.find { |arg| arg.respond_to?(:call) }
+    self
+  end
+  def parse(input)
+    raise Exceptions::ExtractorParseError.new "input parameter must be a string" unless input.is_a?(String)
+  end
+end

data/lib/extraloop/hookable.rb ADDED Viewed

@@ -0,0 +1,26 @@
+module Hookable
+  module Exceptions
+    class HookArgumentError < StandardError
+    end
+  end
+  def set_hook(hookname, handler)
+    @hooks ||= {}
+    raise Exceptions::HookArgumentError.new "handler must be a callable proc" unless handler.respond_to?(:call)
+    @hooks[hookname.to_sym] ? @hooks[hookname.to_sym].push(handler) : @hooks[hookname.to_sym] = [handler]
+    self
+  end
+  def run_hook(hook, arguments)
+    return unless @hooks.has_key?(hook)
+    @hooks[hook].each do |handler|
+      (@environment || ExtractionEnvironment.new ).run(*arguments, &handler)
+    end
+  end
+  alias_method :on, :set_hook
+end

data/lib/extraloop/iterative_scraper.rb ADDED Viewed

@@ -0,0 +1,291 @@
+class IterativeScraper < ScraperBase
+  module Exceptions
+    class NonGetAsyncRequestNotYetImplemented < StandardError; end
+  end
+  #
+  # Public
+  #
+  # Initializes an iterative scraper (i.e. a scraper which can extract data from a list of several web pages).
+  #
+  # urls      -  One or an array of several urls.
+  # options   -  A hash of scraper options (optional).
+  #   async : Wether or not the scraper should issue HTTP requests synchronously or asynchronously (defaults to false).
+  #   log   : Logging options (set to false to completely suppress logging).
+  #   hydra : A list of arguments to be passed in when initializing the HTTP queue (see Typheous#Hydra).
+  # arguments - Hash of arguments to be passed to the Typhoeus HTTP client (optional).
+  #
+  #
+  # Examples:
+  #
+  # # Iterates over the first 10 pages of Google News search result for the query 'Egypt'.
+  #
+  # IterativeScraper.new("https://www.google.com/search?tbm=nws&q=Egypt", :log => {
+  #     :appenders => [ 'example.log', :stderr],
+  #     :log_level => :debug
+  #
+  #   }).set_iteration(:start, (1..101).step(10))
+  #
+  # # Iterates over the first 10 pages of Google News search results for the query 'Egypt' first, and then
+  # # for the query 'Syria', issuing HTTP requests asynchronously, and ignoring ssl certificate verification.
+  #
+  # IterativeScraper.new([
+  #     https://www.google.com/search?tbm=nws&q=Egypt",
+  #     https://www.google.com/search?tbm=nws&q=Syria"
+  #   ], {:async => true,  }, {:disable_ssl_peer_verification => true
+  #
+  # }).set_iteration(:start, (1..101).step(10))
+  #
+  # Returns itself.
+  #
+  def initialize(urls, options = {}, arguments = {})
+    super([], options, arguments)
+    @base_urls = Array(urls)
+    @iteration_set = []
+    @iteration_extractor = nil
+    @iteration_extractor_args = nil
+    @iteration_count = 0
+    @iteration_param = nil
+    @iteration_param_value = nil
+    @continue_clause_args = nil
+    self
+  end
+  # Public
+  #
+  # Specifies the collection of values over which the scraper should iterate.
+  # At each iteration, the current value in the iteration set will be included as part of the request parameters.
+  #
+  # param - the name of the iteration parameter.
+  # args  - Either an array of values, or a set the arguments to initialize an Extractor object.
+  #
+  # Examples:
+  #
+  #  # Explicitly specify the iteration set (can be either a range or an array).
+  #
+  #   IterativeScraper.new("http://my-site.com/events").
+  #     set_iteration(:p, 1..10).
+  #
+  #  # Pass in a code block to dynamically extract the iteration set from the document.
+  #  # The code block will be passed to generate an Extractor that will be run at the first
+  #  # iteration. The iteration will not continue if the proc will return return a non empty
+  #  # set of values.
+  #
+  #  fetch_page_numbers = proc { |elements|
+  #    elements.map { |a|
+  #       a.attr(:href).match(/p=(\d+)/)
+  #       $1
+  #    }.reject { |p| p == 1 }
+  #  }
+  #
+  #  IterativeScraper.new("http://my-site.com/events").
+  #    set_iteration(:p, "div#pagination a", fetch_page_numbers)
+  #
+  #
+  # Returns itself.
+  #
+  def set_iteration(param, *args)
+    #TODO: allow passing ranges as well as arrays
+    if args.first.respond_to?(:map)
+      @iteration_set = Array(args.first).map &:to_s
+    else
+      @iteration_extractor_args = [:pagination, *args]
+    end
+    set_iteration_param(param)
+    self
+  end
+  # Public
+  #
+  # Builds an extractor and uses it to set the value of the next iteration's offset parameter.
+  # If the extractor returns nil, the iteration stops.
+  #
+  # param - A symbol identifying the itertion parameter name.
+  # extractor_args - Arguments to be passed to the extractor which will be used to evaluate the continue value
+  #
+  # Returns itself.
+  def continue_with(param, *extractor_args)
+    raise Exceptions::NonGetAsyncRequestNotYetImplemented.new "the #continue_with method currently requires the 'async' option to be set to false" if @options[:async]
+    @continue_clause_args = extractor_args
+    set_iteration_param(param)
+    self
+  end
+  def run
+    @base_urls.each do |base_url|
+      # run an extra iteration to determine the value of the next offset parameter (if #continue_with is used)
+      # or the entire iteration set (if #set_iteration is used).
+      (run_iteration(base_url); @iteration_count += 1 ) if @iteration_extractor_args || @continue_clause_args
+      while @iteration_set.at(@iteration_count)
+        method = @options[:async] ? :run_iteration_async : :run_iteration
+        send(method, base_url)
+        @iteration_count += 1
+      end
+      #reset all counts
+      @queued_count = 0
+      @response_count = 0
+      @iteration_count = 0
+    end
+    self
+  end
+  protected
+  #
+  # Set the name (and optionally the default value) of the iteration parameter.
+  #
+  # param - a symbol or a hash containing the parameter name (as the key) and its default value.
+  #
+  # Returns nothing.
+  #
+  #
+  def set_iteration_param(param)
+    if param.respond_to?(:keys)
+      @iteration_param = param.keys.first
+      @iteration_param_value = param.values.first
+    else
+      @iteration_param = param
+    end
+  end
+  def default_offset
+    @iteration_param_value or "1"
+  end
+  #
+  # Runs an iteration performing blocking, synchronous HTTP request per time (
+  # calls ScraperBase#run at each request)
+  #
+  # url - the current iteration's url.
+  #
+  # Returns nothing
+  #
+  def run_iteration(url)
+    @urls = Array(url)
+    update_request_params!
+    run_super(:run)
+  end
+  #
+  # Runs an iteration performing parallel, non-blocking HTTP requests
+  #
+  # url - The current iteration's url.
+  #
+  # Returns nothing.
+  #
+  #
+  def run_iteration_async(url)
+    error_message = "When then option 'async' is set, the IterativeScraper class currently supports only HTTP method 'get'." +
+      "If you have to use a HTTP method other than GET, you will have to set the 'async' option to false."
+    raise NonGetAsyncRequestNotYetImplemented error_message unless @request_arguments[:method].nil? || @request_arguments[:method].downcase.to_sym == :get
+    @urls << add_iteration_param(url)
+    if @iteration_set[@iteration_count] == @iteration_set.last
+      run_super(:run)
+    end
+  end
+  #
+  # Dynamically updates the request parameter hash with the
+  # current iteration parameter value.
+  #
+  # Returns nothing.
+  #
+  def update_request_params!
+    offset = @iteration_set.at(@iteration_count) || default_offset
+    @request_arguments[:params] ||= {}
+    @request_arguments[:params][@iteration_param.to_sym] = offset
+  end
+  #
+  # Ads the current iteration offset to a url as a GET parameter.
+  #
+  # url - the url to be update
+  #
+  # Returns a url with the current iteration value represented as a get parameter.
+  #
+  def add_iteration_param(url)
+    offset = @iteration_set.at(@iteration_count) || default_offset
+    param = "#{@iteration_param}=#{offset}"
+    parsed_url = URI::parse(url)
+    if parsed_url.query
+      parsed_url.query += param
+    else
+      parsed_url.query =  param
+    end
+    parsed_url.to_s
+  end
+  #
+  # Utility function for calling a superclass instance method.
+  #
+  # (currently used to call ScraperBase#run).
+  #
+  def run_super(method, args=[])
+    self.class.superclass.instance_method(method).bind(self).call(*args)
+  end
+  def issue_request(url)
+    # remove continue argument if this is the first iteration
+    @request_arguments[:params].delete(@iteration_param.to_sym) if @continue_clause_args && @iteration_count == 0
+    super(url)
+    # clear previous value of iteration parameter
+    @request_arguments[:params].delete(@iteration_param.to_sym) if @request_arguments[:params] && @request_arguments[:params].any?
+  end
+  #
+  # Overrides ScraperBase#handle_response in order to apply the proc used to dynamically extract the iteration set.
+  #
+  # TODO: update doc
+  #
+  # returns nothing.
+  #
+  def handle_response(response)
+    format =  @options[:format] || run_super(:detect_format, response.headers_hash['Content-Type'])
+    extractor_class = format == :json ? JsonExtractor : DomExtractor
+    run_iteration_extractor(response.body, extractor_class) if @response_count == 0 && @iteration_extractor_args
+    run_continue_clause(response.body, extractor_class) if @continue_clause_args
+    super(response)
+  end
+  def run_continue_clause(response_body, extractor_class)
+    extractor = extractor_class.new(:continue, ExtractionEnvironment.new(self), *@continue_clause_args)
+    continue_value = extractor.extract_field(response_body)
+    #TODO: check if continue_value is valid
+    @iteration_set << "" if @iteration_count == 0  #horrible hack: please refactor
+    @iteration_set << continue_value.to_s if continue_value
+  end
+  def run_iteration_extractor(response_body, extractor_class)
+    @iteration_extractor = extractor_class.new(*@iteration_extractor_args.insert(1, ExtractionEnvironment.new(self)))
+    #NOTE: does this default_offset make any sense?
+    @iteration_set = Array(default_offset) + @iteration_extractor.extract_list(response_body).map(&:to_s) if @iteration_extractor
+  end
+end