RubyGems - grubby - Versions diffs - 1.1.0 → 1.2.0 - Mend

grubby 1.1.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +24 -11
data/README.md +24 -28
data/grubby.gemspec +1 -2
data/lib/grubby.rb +72 -25
data/lib/grubby/core_ext/string.rb +2 -1
data/lib/grubby/core_ext/uri.rb +4 -3
data/lib/grubby/json_parser.rb +1 -1
data/lib/grubby/mechanize/download.rb +1 -1
data/lib/grubby/mechanize/file.rb +1 -1
data/lib/grubby/mechanize/page.rb +7 -3
data/lib/grubby/page_scraper.rb +1 -1
data/lib/grubby/scraper.rb +165 -25
data/lib/grubby/version.rb +1 -1
metadata +5 -20

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 7528791ce5da4ca182e8258cf5bc8920345470ee76ee50de44cf89adac7ffec6
-  data.tar.gz: 3b3dad255ae1841583abb2c61345fbefffb268231906968619000b95005044be
+  metadata.gz: 84d759cf7187c8502b42e9d7828f59f126bb87af8da524e9d8e6f6ad8a64f467
+  data.tar.gz: bf26cca3991fca00e573f51f28a1c457e063e4f419986971f1429f051f2e3155
 SHA512:
-  metadata.gz: 295c2957f708d86b596a4c062fcdf31d9c5083d26d15989de31feb174316ee17430e19d1746ad0f389d599560cc419e740835b1bfba6b0f57627e633f1a0ecf1
-  data.tar.gz: e8bc4ecb3ce277436be91ee4e8cf9c187c1f0bbf5ee170bc7a4e3f221f94d678e0e24d2f7dd878427c7b2b0e0b2fe1baa485556bdf21cd89e05a7ff222a9dc53
+  metadata.gz: 38b8f7818be985da5c48484b8a3f42a40401b4890e46da93c2565c546654a660537cf15303e1106bdca201d1ea8e7ff90e13ab13dcb652997b0acc9becc01b48
+  data.tar.gz: e3c8b063d275ebf49dc50c5a70fa82cb0f9e517f17cc9e3735557a2fe998d5ea82a3ea0932ad8a6ecec630f4f66c8d62443c76497f6e98e4f202a72df988095e

data/CHANGELOG.md CHANGED

@@ -1,15 +1,28 @@
+## 1.2.0
+* Add `Grubby#journal=`
+* Add `$grubby` global default `Grubby` instance
+* Add `Scraper.scrape`
+* Add `Scraper.each`
+* Support `:if` and `:unless` options for `Scraper.scrapes`
+* Fix fail-fast behavior of inherited scraper fields
+* Fix `JsonParser` on empty response body
+* Loosen Active Support version constraint
 ## 1.1.0
-* Added `Grubby#ok?`.
-* Added `Grubby::PageScraper.scrape_file` and `Grubby::JsonScraper.scrape_file`.
-* Added `Mechanize::Parser#save_to` and `Mechanize::Parser#save_to!`,
-  which are inherited by `Mechanize::Download` and `Mechanize::File`.
-* Added `URI#basename`.
-* Added `URI#query_param`.
-* Added utility methods from [ryoba](https://rubygems.org/gems/ryoba).
-* Added `Grubby::Scraper::Error#scraper` and `Grubby::Scraper#errors`
-  for interactive debugging with e.g. byebug.
-* Improved log messages and error formatting.
-* Fixed compatibility with net-http-persistent gem v3.0.
+* Add `Grubby#ok?`
+* Add `PageScraper.scrape_file` and `JsonScraper.scrape_file`
+* Add `Mechanize::Parser#save_to` and `Mechanize::Parser#save_to!`,
+  which are inherited by `Mechanize::Download` and `Mechanize::File`
+* Add `URI#basename`
+* Add `URI#query_param`
+* Add utility methods from [ryoba](https://rubygems.org/gems/ryoba)
+* Add `Scraper::Error#scraper` and `Scraper#errors` for interactive
+  debugging with e.g. `byebug`
+* Improve log messages and error formatting
+* Fix compatibility with net-http-persistent gem v3.0
 ## 1.0.0

data/README.md CHANGED

@@ -11,7 +11,7 @@ below, or browse the [full documentation].
 ## Examples
-The following example scrapes the [Hacker News] front page:
+The following example scrapes stories from the [Hacker News] front page:
 ```ruby
 require "grubby"
@@ -19,38 +19,31 @@ require "grubby"
 class HackerNews < Grubby::PageScraper
   scrapes(:items) do
-    page.search!(".athing").map{|item| HackerNewsItem.new(item) }
+    page.search!(".athing").map{|el| Item.new(el) }
   end
-end
-class HackerNewsItem < Grubby::Scraper
-  scrapes(:title) { @row1.at!(".storylink").text }
-  scrapes(:submitter) { @row2.at!(".hnuser").text }
-  scrapes(:story_uri) { URI.join(@base_uri, @row1.at!(".storylink")["href"]) }
-  scrapes(:comments_uri) { URI.join(@base_uri, @row2.at!(".age a")["href"]) }
-  def initialize(source)
-    @row1 = source
-    @row2 = source.next_sibling
-    @base_uri = source.document.url
-    super
+  class Item < Grubby::Scraper
+    scrapes(:story_link){ source.at!("a.storylink") }
+    scrapes(:story_uri) { story_link.uri }
+    scrapes(:title) { story_link.text }
   end
 end
-grubby = Grubby.new
 # The following line will raise an exception if anything goes wrong
 # during the scraping process.  For example, if the structure of the
-# HTML does not match expectations, either due to a bad assumption or
-# due to a site-wide change, the script will terminate immediately with
-# a relevant error message.  This prevents bad values from propogating
-# and causing hard-to-trace errors.
-hn = HackerNews.new(grubby.get("https://news.ycombinator.com/news"))
-puts hn.items.take(10).map(&:title) # your scraping logic goes here
+# HTML does not match expectations, either due to incorrect assumptions
+# or a site change, the script will terminate immediately with a helpful
+# error message.  This prevents bad data from propagating and causing
+# hard-to-trace errors.
+hn = HackerNews.scrape("https://news.ycombinator.com/news")
+# Your processing logic goes here:
+hn.items.take(10).each do |item|
+  puts "* #{item.title}"
+  puts "  #{item.story_uri}"
+  puts
+end
 ```
 [Hacker News]: https://news.ycombinator.com/news
@@ -64,7 +57,9 @@ puts hn.items.take(10).map(&:title) # your scraping logic goes here
   - [#singleton](http://www.rubydoc.info/gems/grubby/Grubby:singleton)
   - [#time_between_requests](http://www.rubydoc.info/gems/grubby/Grubby:time_between_requests)
 - [Scraper](http://www.rubydoc.info/gems/grubby/Grubby/Scraper)
+  - [.each](http://www.rubydoc.info/gems/grubby/Grubby/Scraper.each)
   - [.fields](http://www.rubydoc.info/gems/grubby/Grubby/Scraper.fields)
+  - [.scrape](http://www.rubydoc.info/gems/grubby/Grubby/Scraper.scrape)
   - [.scrapes](http://www.rubydoc.info/gems/grubby/Grubby/Scraper.scrapes)
   - [#[]](http://www.rubydoc.info/gems/grubby/Grubby/Scraper:[])
   - [#source](http://www.rubydoc.info/gems/grubby/Grubby/Scraper:source)
@@ -136,14 +131,14 @@ for a complete API listing.
   - [String#assert_match!](http://www.rubydoc.info/gems/mini_sanity/String:assert_match%21)
 - [pleasant_path](https://rubygems.org/gems/pleasant_path)
   ([docs](http://www.rubydoc.info/gems/pleasant_path/))
+  - [Pathname#available_name](http://www.rubydoc.info/gems/pleasant_path/Pathname:available_name)
   - [Pathname#dirs](http://www.rubydoc.info/gems/pleasant_path/Pathname:dirs)
-  - [Pathname#dirs_r](http://www.rubydoc.info/gems/pleasant_path/Pathname:dirs_r)
   - [Pathname#files](http://www.rubydoc.info/gems/pleasant_path/Pathname:files)
-  - [Pathname#files_r](http://www.rubydoc.info/gems/pleasant_path/Pathname:files_r)
   - [Pathname#make_dirname](http://www.rubydoc.info/gems/pleasant_path/Pathname:make_dirname)
+  - [Pathname#make_file](http://www.rubydoc.info/gems/pleasant_path/Pathname:make_file)
+  - [Pathname#move_as](http://www.rubydoc.info/gems/pleasant_path/Pathname:move_as)
   - [Pathname#rename_basename](http://www.rubydoc.info/gems/pleasant_path/Pathname:rename_basename)
   - [Pathname#rename_extname](http://www.rubydoc.info/gems/pleasant_path/Pathname:rename_extname)
-  - [Pathname#touch_file](http://www.rubydoc.info/gems/pleasant_path/Pathname:touch_file)
 - [ryoba](https://rubygems.org/gems/ryoba)
   ([docs](http://www.rubydoc.info/gems/ryoba/))
   - [Nokogiri::XML::Node#matches!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Node:matches%21)
@@ -154,6 +149,7 @@ for a complete API listing.
   - [Nokogiri::XML::Searchable#at!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:at%21)
   - [Nokogiri::XML::Searchable#search!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:search%21)
 ## Installation
 Install from [Ruby Gems](https://rubygems.org/gems/grubby):

data/grubby.gemspec CHANGED

@@ -20,9 +20,8 @@ Gem::Specification.new do |spec|
   spec.executables   = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
   spec.require_paths = ["lib"]
-  spec.add_runtime_dependency "activesupport", "~> 5.0"
+  spec.add_runtime_dependency "activesupport", ">= 5.0"
   spec.add_runtime_dependency "casual_support", "~> 3.0"
-  spec.add_runtime_dependency "dumb_delimited", "~> 1.0"
   spec.add_runtime_dependency "gorge", "~> 1.0"
   spec.add_runtime_dependency "mechanize", "~> 2.7"
   spec.add_runtime_dependency "mini_sanity", "~> 1.0"

data/lib/grubby.rb CHANGED

@@ -1,6 +1,5 @@
 require "active_support/all"
 require "casual_support"
-require "dumb_delimited"
 require "gorge"
 require "mechanize"
 require "mini_sanity"
@@ -32,7 +31,7 @@ class Grubby < Mechanize
   attr_accessor :time_between_requests
   # Journal file used to ensure only-once processing of resources by
-  # {singleton} across multiple program runs.  Set via {initialize}.
+  # {singleton} across multiple program runs.
   #
   # @return [Pathname, nil]
   attr_reader :journal
@@ -68,20 +67,37 @@ class Grubby < Mechanize
     self.pre_connect_hooks << Proc.new{ self.send(:sleep_between_requests) }
     self.time_between_requests = 1.0
-    @journal = journal.try(&:to_pathname).try(&:touch_file)
-    @seen = @journal ? SingletonKey.parse_file(@journal).index_to{ true } : {}
+    self.journal = journal
+  end
+  # Sets the journal file used to ensure only-once processing of
+  # resources by {singleton} across multiple program runs.  Setting the
+  # journal file will clear the in-memory list of previously-processed
+  # resources, and, if the journal file exists, load the list from file.
+  #
+  # @param path [Pathname, String, nil]
+  # @return [Pathname]
+  def journal=(path)
+    @journal = path&.to_pathname&.touch_file
+    @seen = if @journal
+        require "csv"
+        CSV.read(@journal).map{|row| SingletonKey.new(*row) }.index_to{ true }
+      else
+        {}
+      end
+    @journal
   end
   # Calls +#head+ and returns true if the result has response code
   # "200".  Unlike +#head+, error response codes (e.g. "404", "500")
   # do not cause a +Mechanize::ResponseCodeError+ to be raised.
   #
-  # @param uri [String]
+  # @param uri [URI, String]
   # @return [Boolean]
   def ok?(uri, query_params = {}, headers = {})
     begin
       head(uri, query_params, headers).code == "200"
-    rescue Mechanize::ResponseCodeError => e
+    rescue Mechanize::ResponseCodeError
       false
     end
   end
@@ -91,7 +107,21 @@ class Grubby < Mechanize
   # Rescues and logs +Mechanize::ResponseCodeError+ failures for all but
   # the last mirror.
   #
-  # @param mirror_uris [Array<String>]
+  # @example
+  #   grubby = Grubby.new
+  #
+  #   urls = [
+  #     "http://httpstat.us/404",
+  #     "http://httpstat.us/500",
+  #     "http://httpstat.us/200#foo",
+  #     "http://httpstat.us/200#bar",
+  #   ]
+  #
+  #   grubby.get_mirrored(urls).uri  # == URI("http://httpstat.us/200#foo")
+  #
+  #   grubby.get_mirrored(urls.take(2))  # raise Mechanize::ResponseCodeError
+  #
+  # @param mirror_uris [Array<URI>, Array<String>]
   # @return [Mechanize::Page, Mechanize::File, Mechanize::Download, ...]
   # @raise [Mechanize::ResponseCodeError]
   #   if all +mirror_uris+ fail
@@ -111,32 +141,43 @@ class Grubby < Mechanize
     end
   end
-  # Ensures only-once processing of the resource indicated by +target+
-  # for the specified +purpose+.  A list of previously-processed
-  # resource URIs and content hashes is maintained in the Grubby
-  # instance.  The given block is called with the fetched resource only
-  # if the resource's URI and the resource's content hash have not been
+  # Ensures only-once processing of the resource indicated by +uri+ for
+  # the specified +purpose+.  A list of previously-processed resource
+  # URIs and content hashes is maintained in the Grubby instance.  The
+  # given block is called with the fetched resource only if the
+  # resource's URI and the resource's content hash have not been
   # previously processed under the specified +purpose+.
   #
-  # @param target [URI, String, Mechanize::Page::Link, #to_absolute_uri]
-  #   designates the resource to fetch
+  # @example
+  #   grubby = Grubby.new
+  #
+  #   grubby.singleton("https://example.com/foo") do |page|
+  #     # will be executed (first time "/foo")
+  #   end
+  #
+  #   grubby.singleton("https://example.com/foo#bar") do |page|
+  #     # will be skipped (already seen "/foo")
+  #   end
+  #
+  #   grubby.singleton("https://example.com/foo", "again!") do |page|
+  #     # will be executed (new purpose for "/foo")
+  #   end
+  #
+  # @param uri [URI, String]
   # @param purpose [String]
-  #   the purpose of processing the resource
   # @yield [resource]
-  #   processes the resource
   # @yieldparam resource [Mechanize::Page, Mechanize::File, Mechanize::Download, ...]
-  #   the fetched resource
   # @return [Boolean]
   #   whether the given block was called
   # @raise [Mechanize::ResponseCodeError]
   #   if fetching the resource results in error (see +Mechanize#get+)
-  def singleton(target, purpose = "")
+  def singleton(uri, purpose = "")
     series = []
-    original_uri = target.to_absolute_uri
-    return if try_skip_singleton(original_uri, purpose, series)
+    uri = uri.to_absolute_uri
+    return if try_skip_singleton(uri, purpose, series)
-    normalized_uri = normalize_uri(original_uri)
+    normalized_uri = normalize_uri(uri)
     return if try_skip_singleton(normalized_uri, purpose, series)
     $log.info("Fetch #{normalized_uri}")
@@ -146,7 +187,9 @@ class Grubby < Mechanize
     yield resource unless skip
-    series.append_to_file(@journal) if @journal
+    CSV.open(journal, "a") do |csv|
+      series.each{|singleton_key| csv << singleton_key }
+    end if journal
     !skip
   end
@@ -154,7 +197,8 @@ class Grubby < Mechanize
   private
-  SingletonKey = DumbDelimited[:purpose, :target]
+  # @!visibility private
+  SingletonKey = Struct.new(:purpose, :target)
   def try_skip_singleton(target, purpose, series)
     series << SingletonKey.new(purpose, target.to_s)
@@ -175,8 +219,8 @@ class Grubby < Mechanize
   def sleep_between_requests
     @last_request_at ||= 0.0
-    delay_duration = @time_between_requests.is_a?(Range) ?
-      rand(@time_between_requests) : @time_between_requests
+    delay_duration = time_between_requests.is_a?(Range) ?
+      rand(time_between_requests) : time_between_requests
     sleep_duration = @last_request_at + delay_duration - Time.now.to_f
     sleep(sleep_duration) if sleep_duration > 0
     @last_request_at = Time.now.to_f
@@ -189,3 +233,6 @@ require_relative "grubby/json_parser"
 require_relative "grubby/scraper"
 require_relative "grubby/page_scraper"
 require_relative "grubby/json_scraper"
+$grubby = Grubby.new

data/lib/grubby/core_ext/string.rb CHANGED

@@ -4,7 +4,8 @@ class String
   # does not denote an absolute URI.
   #
   # @return [URI]
-  # @raise [RuntimeError] if the String does not denote an absolute URI
+  # @raise [RuntimeError]
+  #   if the String does not denote an absolute URI
   def to_absolute_uri
     URI(self).to_absolute_uri
   end

data/lib/grubby/core_ext/uri.rb CHANGED

@@ -9,7 +9,7 @@ module URI
   #
   # @return [String]
   def basename
-    self.path == "/" ? "" : File.basename(self.path)
+    self.path == "/" ? "" : ::File.basename(self.path)
   end
   # Returns the value of the specified param in the URI's +query+.
@@ -21,7 +21,7 @@ module URI
   # occurrence of that param in the query string.
   #
   # @example
-  #   URI("http://example.com/?foo=a").query_param("foo")          # == "a"
+  #   URI("http://example.com/?foo=a").query_param("foo")  # == "a"
   #
   #   URI("http://example.com/?foo=a&foo=b").query_param("foo")    # == "b"
   #   URI("http://example.com/?foo=a&foo=b").query_param("foo[]")  # == nil
@@ -43,7 +43,8 @@ module URI
   # Raises an exception if the URI is not +absolute?+.
   #
   # @return [self]
-  # @raise [RuntimeError] if the URI is not +absolute?+
+  # @raise [RuntimeError]
+  #   if the URI is not +absolute?+
   def to_absolute_uri
     raise "URI is not absolute: #{self}" unless self.absolute?
     self

data/lib/grubby/json_parser.rb CHANGED

@@ -39,7 +39,7 @@ class Grubby::JsonParser < Mechanize::File
   attr_reader :json
   def initialize(uri = nil, response = nil, body = nil, code = nil)
-    @json = body && JSON.parse(body, self.class.json_parse_options)
+    @json = body.presence && JSON.parse(body, self.class.json_parse_options)
     super
   end

data/lib/grubby/mechanize/download.rb CHANGED

@@ -1,6 +1,6 @@
 class Mechanize::Download
-  # private
+  # @!visibility private
   def content_hash
     @content_hash ||= Digest::SHA1.new.io(self.body_io).hexdigest
   end

data/lib/grubby/mechanize/file.rb CHANGED

@@ -1,6 +1,6 @@
 class Mechanize::File
-  # private
+  # @!visibility private
   def content_hash
     @content_hash ||= self.body.to_s.sha1
   end

data/lib/grubby/mechanize/page.rb CHANGED

@@ -1,17 +1,21 @@
 class Mechanize::Page
   # @!method search!(*queries)
-  # See {::Nokogiri::XML::Searchable#search!}.
+  # See Ryoba's +Nokogiri::XML::Searchable#search!+.
   #
   # @param queries [Array<String>]
-  # @return [Array<Nokogiri::XML::Element>]
+  # @return [Nokogiri::XML::NodeSet]
+  # @raise [Ryoba::Error]
+  #   if all queries yield no results
   def_delegators :parser, :search!
   # @!method at!(*queries)
-  # See {::Nokogiri::XML::Searchable#at!}.
+  # See Ryoba's +Nokogiri::XML::Searchable#at!+.
   #
   # @param queries [Array<String>]
   # @return [Nokogiri::XML::Element]
+  # @raise [Ryoba::Error]
+  #   if all queries yield no results
   def_delegators :parser, :at!
 end

data/lib/grubby/page_scraper.rb CHANGED

@@ -24,7 +24,7 @@ class Grubby::PageScraper < Grubby::Scraper
   # @param path [String]
   # @param agent [Mechanize]
   # @return [Grubby::PageScraper]
-  def self.scrape_file(path, agent = Grubby.new)
+  def self.scrape_file(path, agent = $grubby)
     uri = URI.join("file:///", File.expand_path(path))
     body = File.read(path)
     self.new(Mechanize::Page.new(uri, nil, body, "200", agent))

data/lib/grubby/scraper.rb CHANGED

@@ -2,61 +2,200 @@ class Grubby::Scraper
   # Defines an attribute reader method named by +field+.  During
   # +initialize+, the given block is called, and the attribute is set to
-  # the block's return value.  By default, if the block's return value
-  # is nil, an exception will be raised.  To prevent this behavior, set
-  # +optional+ to true.
+  # the block's return value.
+  #
+  # By default, if the block's return value is nil, an exception will be
+  # raised.  To prevent this behavior, specify +optional: true+.
+  #
+  # The block may also be evaluated conditionally, based on another
+  # method's return value, using the +:if+ or +:unless+ options.
+  #
+  # @example
+  #   class GreetingScraper < Grubby::Scraper
+  #     scrapes(:salutation) do
+  #       source[/\A(hello|good morning)\b/i]
+  #     end
+  #
+  #     scrapes(:recipient, optional: true) do
+  #       source[/\A#{salutation} ([a-z ]+)/i, 1]
+  #     end
+  #   end
+  #
+  #   scraper = GreetingScraper.new("Hello World!")
+  #   scraper.salutation  # == "Hello"
+  #   scraper.recipient   # == "World"
+  #
+  #   scraper = GreetingScraper.new("Good morning!")
+  #   scraper.salutation  # == "Good morning"
+  #   scraper.recipient   # == nil
+  #
+  #   scraper = GreetingScraper.new("Hey!")  # raises Grubby::Scraper::Error
+  #
+  # @example
+  #   class EmbeddedUrlScraper < Grubby::Scraper
+  #     scrapes(:url, optional: true){ source[%r"\bhttps?://\S+"] }
+  #
+  #     scrapes(:domain, if: :url){ url[%r"://([^/]+)/", 1] }
+  #   end
+  #
+  #   scraper = EmbeddedUrlScraper.new("visit https://example.com/foo for details")
+  #   scraper.url     # == "https://example.com/foo"
+  #   scraper.domain  # == "example.com"
+  #
+  #   scraper = EmbeddedUrlScraper.new("visit our website for details")
+  #   scraper.url     # == nil
+  #   scraper.domain  # == nil
   #
   # @param field [Symbol, String]
-  #   name of the scraped value
-  # @param optional [Boolean]
-  #   whether to permit a nil scraped value
+  # @param options [Hash]
+  # @option options :optional [Boolean]
+  # @option options :if [Symbol]
+  # @option options :unless [Symbol]
   # @yield []
-  #   scrapes the value
   # @yieldreturn [Object]
-  #   scraped value
-  def self.scrapes(field, optional: false, &block)
+  # @return [void]
+  def self.scrapes(field, **options, &block)
     field = field.to_sym
     self.fields << field
     define_method(field) do
       raise "#{self.class}#initialize does not invoke `super`" unless defined?(@scraped)
-      return @scraped[field] if @scraped.key?(field)
-      unless @errors[field]
+      if !@scraped.key?(field) && !@errors.key?(field)
         begin
-          value = instance_eval(&block)
-          if value.nil?
-            raise FieldValueRequiredError.new(field) unless optional
-            $log.debug("#{self.class}##{field} is nil")
+          skip = (options[:if] && !self.send(options[:if])) ||
+            (options[:unless] && self.send(options[:unless]))
+          if skip
+            @scraped[field] = nil
+          else
+            @scraped[field] = instance_eval(&block)
+            if @scraped[field].nil?
+              raise FieldValueRequiredError.new(field) unless options[:optional]
+              $log.debug("#{self.class}##{field} is nil")
+            end
           end
-          @scraped[field] = value
         rescue RuntimeError, IndexError => e
           @errors[field] = e
         end
       end
-      raise FieldScrapeFailedError.new(field, @errors[field]) if @errors[field]
-      @scraped[field]
+      if @errors.key?(field)
+        raise FieldScrapeFailedError.new(field, @errors[field])
+      else
+        @scraped[field]
+      end
     end
   end
-  # The names of all scraped values, as defined by {scrapes}.
+  # Fields defined by {scrapes}.
   #
   # @return [Array<Symbol>]
   def self.fields
-    @fields ||= []
+    @fields ||= self == Grubby::Scraper ? [] : self.superclass.fields.dup
+  end
+  # Instantiates the Scraper class with the resource specified by +url+.
+  # This method acts as a default factory method, and provides a
+  # standard interface for specialized overrides.
+  #
+  # @example Default factory method
+  #   class PostPageScraper < Grubby::PageScraper
+  #     # ...
+  #   end
+  #
+  #   PostPageScraper.scrape("https://example.com/posts/42")
+  #     # == PostPageScraper.new($grubby.get("https://example.com/posts/42"))
+  #
+  # @example Specialized factory method
+  #   class PostApiScraper < Grubby::JsonScraper
+  #     # ...
+  #
+  #     def self.scrapes(url, agent = $grubby)
+  #       api_url = url.sub(%r"//example.com/(.+)", '//api.example.com/\1.json')
+  #       super(api_url, agent)
+  #     end
+  #   end
+  #
+  #   PostApiScraper.scrape("https://example.com/posts/42")
+  #     # == PostApiScraper.new($grubby.get("https://api.example.com/posts/42.json"))
+  #
+  # @param url [String, URI]
+  # @param agent [Mechanize]
+  # @return [Grubby::Scraper]
+  def self.scrape(url, agent = $grubby)
+    self.new(agent.get(url))
+  end
+  # Iterates a series of pages, starting at +start_url+.  For each page,
+  # the Scraper class is instantiated and passed to the given block.
+  # Subsequent pages in the series are determined by invoking
+  # +next_method+ on each previous scraper instance.
+  #
+  # Iteration stops when the +next_method+ method returns nil.  If the
+  # +next_method+ method returns a String or URI, that value will be
+  # treated as the URL of the next page.  Otherwise that value will be
+  # treated as the page itself.
+  #
+  # @example
+  #   class PostsIndexScraper < Grubby::PageScraper
+  #     scrapes(:page_param){ page.uri.query_param("page") }
+  #
+  #     def next
+  #       page.link_with(text: "Next >")&.click
+  #     end
+  #   end
+  #
+  #   PostsIndexScraper.each("https://example.com/posts?page=1") do |scraper|
+  #     scraper.page_param  # == "1", "2", "3", ...
+  #   end
+  #
+  # @example
+  #   class PostsIndexScraper < Grubby::PageScraper
+  #     scrapes(:page_param){ page.uri.query_param("page") }
+  #
+  #     scrapes(:next_uri, optional: true) do
+  #       page.link_with(text: "Next >")&.to_absolute_uri
+  #     end
+  #   end
+  #
+  #   PostsIndexScraper.each("https://example.com/posts?page=1", next_method: :next_uri) do |scraper|
+  #     scraper.page_param  # == "1", "2", "3", ...
+  #   end
+  #
+  # @param start_url [String, URI]
+  # @param agent [Mechanize]
+  # @param next_method [Symbol]
+  # @yield [scraper]
+  # @yieldparam scraper [Grubby::Scraper]
+  # @return [void]
+  # @raise [NoMethodError]
+  #   if Scraper class does not implement +next_method+
+  def self.each(start_url, agent = $grubby, next_method: :next)
+    unless self.method_defined?(next_method)
+      raise NoMethodError.new(nil, next_method), "#{self} does not define `#{next_method}`"
+    end
+    return to_enum(:each, start_url, agent, next_method: next_method) unless block_given?
+    current = start_url
+    while current
+      current = agent.get(current) if current.is_a?(String) || current.is_a?(URI)
+      scraper = self.new(current)
+      yield scraper
+      current = scraper.send(next_method)
+    end
   end
-  # The source being scraped.  Typically a Mechanize pluggable parser
+  # The object being scraped.  Typically a Mechanize pluggable parser
   # such as +Mechanize::Page+.
   #
   # @return [Object]
   attr_reader :source
-  # Hash of errors raised by blocks passed to {scrapes}.  If
-  # {initialize} does not raise +Grubby::Scraper::Error+, this Hash will
-  # be empty.
+  # Collected errors raised during {initialize} by blocks passed to
+  # {scrapes}, indexed by field name.  If {initialize} did not raise
+  # +Grubby::Scraper::Error+, this Hash will be empty.
   #
   # @return [Hash<Symbol, StandardError>]
   attr_reader :errors
@@ -123,6 +262,7 @@ class Grubby::Scraper
     end
   end
+  # @!visibility private
   class FieldScrapeFailedError < RuntimeError
     def initialize(field, field_error)
       super("`#{field}` raised #{field_error.class}")

data/lib/grubby/version.rb CHANGED

	@@ -1 +1 @@
1	- GRUBBY_VERSION = "1.1.0"
1	+ GRUBBY_VERSION = "1.2.0"

metadata CHANGED

@@ -1,27 +1,27 @@
 --- !ruby/object:Gem::Specification
 name: grubby
 version: !ruby/object:Gem::Version
-  version: 1.1.0
+  version: 1.2.0
 platform: ruby
 authors:
 - Jonathan Hefner
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2018-07-27 00:00:00.000000000 Z
+date: 2019-07-06 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activesupport
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - "~>"
+    - - ">="
       - !ruby/object:Gem::Version
         version: '5.0'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - "~>"
+    - - ">="
       - !ruby/object:Gem::Version
         version: '5.0'
 - !ruby/object:Gem::Dependency
@@ -38,20 +38,6 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '3.0'
-- !ruby/object:Gem::Dependency
-  name: dumb_delimited
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '1.0'
-  type: :runtime
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '1.0'
 - !ruby/object:Gem::Dependency
   name: gorge
   requirement: !ruby/object:Gem::Requirement
@@ -227,8 +213,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubyforge_project:
-rubygems_version: 2.7.6
+rubygems_version: 3.0.1
 signing_key:
 specification_version: 4
 summary: Fail-fast web scraping