RubyGems - extraloop - Versions diffs - 0.0.3 → 0.0.4 - Mend

extraloop 0.0.3 → 0.0.4

Files changed (8) hide show

data/History.txt +3 -0
data/README.md +28 -21
data/examples/wikipedia_categories_recoursive.rb +57 -0
data/lib/extraloop/extraction_environment.rb +1 -0
data/lib/extraloop/extraction_loop.rb +1 -1
data/lib/extraloop/iterative_scraper.rb +1 -1
data/spec/extraction_loop_spec.rb +5 -0
metadata +17 -16

data/History.txt CHANGED

@@ -1,3 +1,6 @@
+== 0.0.4  / 2011-01-14
+  * fixed a bug which prevented from subclassing `IterativeScraper` instances
 == 0.0.3  / 2011-01-01
   * namespaced all classes into the ExtraLoop module

data/README.md CHANGED

@@ -20,9 +20,7 @@ A basic scraper that fetches the top 25 websites from Alexa's daily top 100 list
         extract(:site_name, "h2").
         extract(:url, "h2 a").
         extract(:description, ".description").
-      on(:data, proc { |data|
-        results = data
-      }).
+      on(:data, proc { |data| results = data }).
       run()
 An Iterative Scraper that fetches URL, title, and publisher from some 110 Google News articles mentioning the keyword 'Egypt'.
@@ -57,8 +55,9 @@ An Iterative Scraper that fetches URL, title, and publisher from some 110 Google
 ### Extractors
-ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector. For each of the matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract` method extracts a piece of information from an element (e.g. a story's title) and and stores it into a record's field.
+ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.
+For each matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract`
+method extracts a piece of information from an element (e.g. a story's title) and and stores it into a record's field.
     # using a CSS3 or an XPath selector
     loop_on('div.post')
@@ -67,12 +66,12 @@ ExtraLoop allows to fetch structured data from online documents by looping throu
     loop_on(proc { |doc| doc.search('div.post') })
-    # using both a selector and a proc (the result of applying a selector is passed on as the first argument of the proc)
+    # using both a selector and a proc (matched elements are passed in as the first argument of the proc )
-    loop_on('div.post', proc { |posts| posts.reject {|post| post.attr(:class) == 'sticky' }})
+    loop_on('div.post', proc { |posts| posts.reject { |post| post.attr(:class) == 'sticky' }})
 Both the `loop_on` and the `extract` methods may be called with a selector, a proc or a combination of the two. By default, when parsing DOM documents, `extract` will call
-`Nokogiri::XML::Node#text()`. Alternatively, `extract` can also be passed an attribute name or a proc as a third argument; this will be evaluated in the context of the matching element.
+`Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name or a proc as a third argument, this will be evaluated in the context of the matching element.
     # extract a story's title
     extract(:title, 'h3')
@@ -87,21 +86,25 @@ Both the `loop_on` and the `extract` methods may be called with a selector, a pr
 #### Extracting from JSON Documents
-While processing each HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at the `ContentType` header sent by the server (this value may be overriden by providing a `:format` key in the scraper's initialization options).
-When the format is JSON, the document is parsed using the `yajl` parser and converted into a hash. In this case, both the `loop_on` and the `extract` methods still behave as documented above, with only the exception of the CSS3/XPath selector string, which is specific of DOM documents.
-When working with JSON documents, it is possible to loop over an arbitrary portion of a document by simply passing a proc to `loop_on`.
+While processing each HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
+the `ContentType` header sent by the server (this value may be overriden by providing a `:format` key in the scraper's
+initialization options). When the format is JSON, the document is parsed using the `yajl` parser and converted into a hash.
+In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, with the sole exception
+of the CSS3/XPath selector string, which is specific of DOM documents.
+When working with JSON data, you can just use a proc and return the document elements you want to loop on.
     # Fetch a portion of a document using a proc
     loop_on(proc { |data| data['query']['categorymembers'] })
-Alternatively, the same loop can be defined by passing an array of nested keys, locating the position of the document fragments.
+Alternatively, the same loop can be defined by passing an array of keys pointing at a value located
+at several levels of depth down into the parsed document hash.
     # Fetch the same document portion above using a hash path
     loop_on(['query', 'categorymembers'])
-Passing an array of nested keys will also work fine with the `extract` method.
-When fetching fields from a JSON document fragment, `extract` will try to use the
-field name as a key if no key path or key string is provided.
+When fetching fields from a portion of a JSON document, `extract` will use the
+field name as a hash key if no key path or key string is provided.
     # current node:
     #
@@ -110,26 +113,30 @@ field name as a key if no key path or key string is provided.
     #  'text' => 'bla bla bla',
     #  'from_user_id'..
     # }
-    extract(:from_user)
+    # >> extract(:from_user)
     # => "johndoe"
 ### Iteration methods:
-The `IterativeScraper` class comes with two methods for defining how a scraper should loop over paginated content.
+The `IterativeScraper` class comes with two methods for defining how a scraper should loop over paginated content.
-    #set_iteration(iteration_parameter, array_range_or_proc)
+_set_iteration(iteration_parameter, array_range_or_proc)_
 * `iteration_parameter` - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.
 * `array_or_range_or_proc` - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document as its first argument. Its return value is then used to shift, at each iteration, the value of the iteration parameter. If the block fails to return a non empty array, the iteration stops.
 The second iteration methods, `#continue_with`, allows to continue iterating untill an arbitrary block of code returns a positive, non-nil value.
-    #continue_with(iteration_parameter, block)
+_continue_with(iteration_parameter, block)_
 * `iteration_parameter` - the scraper' iteration parameter.
 * `block` - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.
+### Running tests
+ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:
+    cd spec
+    rspec *

data/examples/wikipedia_categories_recoursive.rb ADDED

@@ -0,0 +1,57 @@
+require '../lib/extraloop'
+require 'pry'
+class WikipediaCategoryScraper < ExtraLoop::IterativeScraper
+  attr_accessor :members
+  attr_reader   :request_arguments
+  baseurl = "http://en.wikipedia.org"
+  endpoint_url = "/w/api.php"
+  @@api_url = baseurl + endpoint_url
+  def initialize(category, depth=2, parent=nil)
+    @members = []
+    @parent  = parent
+    params = {
+      :action => 'query',
+      :list => 'categorymembers',
+      :format => 'json',
+      :cmtitle => "Category:#{category.gsub(/^Category\:/,'')}",
+      :cmlimit => "500",
+      :cmtype => 'page|subcat',
+      :cmdir => 'asc',
+      :cmprop => 'ids|title|type|timestamp'
+    }
+    options = {
+      :depth => depth,
+      :format => :json,
+      :log => false
+    }
+    request_arguments = { :params => params }
+    super(@@api_url, options, request_arguments)
+    loop_on(['query', 'categorymembers']).
+      extract(:title).
+      extract(:ns).
+      extract(:type).
+      extract(:timestamp).
+    on(:data, proc { |results|
+      puts "#{"\t" * (@options[:depth] - 2).abs }  #{@scraper.request_arguments[:params][:cmtitle]}"
+      categories = results.select{ |record| record.ns === 14  }.each { |category| results.delete(category) }
+      categories.each do |record|
+        # Instanciate a sub scraper if the current depth is greater than zero and the category member is a sub category.
+        WikipediaCategoryScraper.new(record.title, @options[:depth] - 1, @scraper.request_arguments[:params][:cmtitle] ).run unless @options[:depth] <= 0
+      end
+    }).
+    continue_with(:cmcontinue, ['query-continue', 'categorymembers', 'cmcontinue'])
+  end
+end
+WikipediaCategoryScraper.new("Italian_media").run

data/lib/extraloop/extraction_environment.rb CHANGED

@@ -4,6 +4,7 @@ module ExtraLoop
   class ExtractionEnvironment
     attr_accessor :document
+    attr_reader :scraper
     def initialize(scraper=nil, document=nil, records=nil)
       if scraper

data/lib/extraloop/extraction_loop.rb CHANGED

@@ -15,7 +15,7 @@ module ExtraLoop
       @document = @loop_extractor.parse(document)
       @records = []
       @hooks = hooks
-      @environment = ExtractionEnvironment.new(@scraper, @document, @records)
+      @environment = ExtractionEnvironment.new(scraper, @document, @records)
       self
     end

data/lib/extraloop/iterative_scraper.rb CHANGED

@@ -240,7 +240,7 @@ module ExtraLoop
     #
     def run_super(method, args=[])
-      self.class.superclass.instance_method(method).bind(self).call(*args)
+      ExtraLoop::ScraperBase.instance_method(method).bind(self).call(*args)
     end

data/spec/extraction_loop_spec.rb CHANGED

@@ -31,6 +31,10 @@ describe ExtractionLoop do
   describe "run" do
     before(:each) do
+      @fake_scraper = Object.new
+      stub(@fake_scraper).options {{}}
+      stub(@fake_scraper).results { }
       @extractors = [:a, :b].map do |field_name|
         object = Object.new
         stub(object).extract_field { |node, record| node[field_name] }
@@ -55,6 +59,7 @@ describe ExtractionLoop do
         mock(env).run.with_any_args.times(20 + 2)
       end
       @extraction_loop = ExtractionLoop.new(@loop_extractor, @extractors, "fake document", hooks, @fake_scraper).run
     end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: extraloop
 version: !ruby/object:Gem::Version
-  version: 0.0.3
+  version: 0.0.4
   prerelease:
 platform: ruby
 authors:
@@ -9,11 +9,11 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-01-01 00:00:00.000000000Z
+date: 2012-01-14 00:00:00.000000000Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: yajl-ruby
-  requirement: &10243900 !ruby/object:Gem::Requirement
+  requirement: &10100300 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -21,10 +21,10 @@ dependencies:
         version: 1.1.0
   type: :runtime
   prerelease: false
-  version_requirements: *10243900
+  version_requirements: *10100300
 - !ruby/object:Gem::Dependency
   name: nokogiri
-  requirement: &10242520 !ruby/object:Gem::Requirement
+  requirement: &10098200 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -32,10 +32,10 @@ dependencies:
         version: 1.5.0
   type: :runtime
   prerelease: false
-  version_requirements: *10242520
+  version_requirements: *10098200
 - !ruby/object:Gem::Dependency
   name: typhoeus
-  requirement: &10240780 !ruby/object:Gem::Requirement
+  requirement: &10095680 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -43,10 +43,10 @@ dependencies:
         version: 0.3.2
   type: :runtime
   prerelease: false
-  version_requirements: *10240780
+  version_requirements: *10095680
 - !ruby/object:Gem::Dependency
   name: logging
-  requirement: &10238820 !ruby/object:Gem::Requirement
+  requirement: &10094320 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -54,10 +54,10 @@ dependencies:
         version: 0.6.1
   type: :runtime
   prerelease: false
-  version_requirements: *10238820
+  version_requirements: *10094320
 - !ruby/object:Gem::Dependency
   name: rspec
-  requirement: &10233640 !ruby/object:Gem::Requirement
+  requirement: &10092700 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -65,10 +65,10 @@ dependencies:
         version: 2.7.1
   type: :development
   prerelease: false
-  version_requirements: *10233640
+  version_requirements: *10092700
 - !ruby/object:Gem::Dependency
   name: rr
-  requirement: &10231680 !ruby/object:Gem::Requirement
+  requirement: &10089100 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -76,10 +76,10 @@ dependencies:
         version: 1.0.4
   type: :development
   prerelease: false
-  version_requirements: *10231680
+  version_requirements: *10089100
 - !ruby/object:Gem::Dependency
   name: pry
-  requirement: &10229180 !ruby/object:Gem::Requirement
+  requirement: &10088040 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -87,7 +87,7 @@ dependencies:
         version: 0.9.7.4
   type: :development
   prerelease: false
-  version_requirements: *10229180
+  version_requirements: *10088040
 description: A Ruby library for extracting data from websites and web based APIs.
   Supports most common document formats (i.e. HTML, XML, and JSON), and comes with
   a handy mechanism  for iterating over paginated datasets.
@@ -100,6 +100,7 @@ files:
 - README.md
 - examples/google_news_scraper.rb
 - examples/wikipedia_categories.rb
+- examples/wikipedia_categories_recoursive.rb
 - lib/extraloop.rb
 - lib/extraloop/dom_extractor.rb
 - lib/extraloop/extraction_environment.rb