RubyGems - grubby - Versions diffs - 1.2.1 → 2.0.0 - Mend

grubby 1.2.1 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/.travis.yml +6 -3
data/CHANGELOG.md +12 -0
data/Gemfile +3 -0
data/README.md +140 -92
data/Rakefile +0 -13
data/gemfiles/activesupport-6.0.gemfile +3 -0
data/grubby.gemspec +17 -18
data/lib/grubby.rb +64 -46
data/lib/grubby/core_ext/uri.rb +12 -11
data/lib/grubby/json_parser.rb +1 -27
data/lib/grubby/json_scraper.rb +6 -2
data/lib/grubby/mechanize/download.rb +1 -1
data/lib/grubby/mechanize/file.rb +1 -2
data/lib/grubby/mechanize/link.rb +9 -6
data/lib/grubby/mechanize/page.rb +4 -2
data/lib/grubby/mechanize/parser.rb +9 -9
data/lib/grubby/page_scraper.rb +6 -2
data/lib/grubby/scraper.rb +86 -60
data/lib/grubby/version.rb +1 -1
metadata +17 -69

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 91cb5fb76be040dc0a6b86c7dd5513e7dfa79327e68b6f15da6ed41df1492740
-  data.tar.gz: d96e1a83f6ebc93c09403bc66ee3251132bbdabeb40379aa081dbece2c978b98
+  metadata.gz: e313c9ba144ee119b31eb6b7ec5fef721df811c8d579f532e5aa5de5a8d65198
+  data.tar.gz: 07f06e01378301c37ca0177a29f95e72f3cf549b65c3e1c9896c9749a9cd857d
 SHA512:
-  metadata.gz: 4e10fa8ae3b183fa600a26af1ff87e0e340e63cfdeec9369c1f9987ace143591b9c33b1edfed980b841ffea5806f96332b1b32e117551b714dcd3b66cff5a8da
-  data.tar.gz: 63985a6d1d39a1ac224eb1aca676f3266b911059e7ab5e838a535dd14e6249d2bbc1d41b59a35101e17983930ebd7ab258a6ce39375a300bcf1725a0e79b72c1
+  metadata.gz: ea948a4c90d2d9ef0e1cd527adc3ef89cb0379ad98751ffbf671b5cf2210e6e700b7856983e1378a500d9db842d9411cf20275f005ea4a2e2eba824a9c929ee3
+  data.tar.gz: 7a7985f0d5127d6c7e25f9d39a489c460cd76f05219358072dce618667edcefd033740335fbb4b6c8cfa216f0ed4f4d3cfab239af7d33f7d92ec939508a6ea20

data/.gitignore CHANGED

@@ -4,6 +4,7 @@
 /_yardoc/
 /coverage/
 /doc/
+/gemfiles/*.lock
 /pkg/
 /spec/reports/
 /tmp/

data/.travis.yml CHANGED

@@ -1,5 +1,8 @@
-sudo: false
 language: ruby
 rvm:
-  - 2.2.5
-before_install: gem install bundler -v 1.15.1
+  - 2.6
+  - 2.7
+gemfile:
+  - gemfiles/activesupport-6.0.gemfile

data/CHANGELOG.md CHANGED

@@ -1,3 +1,15 @@
+## 2.0.0
+* [BREAKING] Drop support for Active Support < 6.0
+* [BREAKING] Require casual_support ~> 4.0
+* [BREAKING] Require mini_sanity ~> 2.0
+* [BREAKING] Require pleasant_path ~> 2.0
+* [BREAKING] Remove `JsonParser.json_parse_options`
+  * Use `::JSON.load_default_options` instead
+* [BREAKING] Rename `Grubby#singleton` to `Grubby#fulfill`
+* [BREAKING] Change `Grubby#fulfill` to return block's result
 ## 1.2.1
 * Add `JsonParser#mech` attribute for parity with `Mechanize::Page#mech`

data/Gemfile CHANGED

@@ -2,3 +2,6 @@ source "https://rubygems.org"
 # Specify your gem's dependencies in grubby.gemspec
 gemspec
+gem "rake", "~> 12.0"
+gem "minitest", "~> 5.0"

data/README.md CHANGED

@@ -1,162 +1,211 @@
-# grubby
+# grubby [![Build Status](https://travis-ci.org/jonathanhefner/grubby.svg?branch=master)](https://travis-ci.org/jonathanhefner/grubby)
 [Fail-fast] web scraping.  *grubby* adds a layer of utility and
-error-checking atop the marvelous [Mechanize gem].  See API summary
+error-checking atop the marvelous [Mechanize gem].  See API listing
 below, or browse the [full documentation].
 [Fail-fast]: https://en.wikipedia.org/wiki/Fail-fast
 [Mechanize gem]: https://rubygems.org/gems/mechanize
-[full documentation]: http://www.rubydoc.info/gems/grubby/
+[full documentation]: https://www.rubydoc.info/gems/grubby/
 ## Examples
-The following example scrapes stories from the [Hacker News] front page:
+The following code scrapes stories from the [Hacker News](
+https://news.ycombinator.com/news) front page:
 ```ruby
 require "grubby"
 class HackerNews < Grubby::PageScraper
   scrapes(:items) do
-    page.search!(".athing").map{|el| Item.new(el) }
+    page.search!(".athing").map{|element| Item.new(element) }
   end
   class Item < Grubby::Scraper
     scrapes(:story_link){ source.at!("a.storylink") }
-    scrapes(:story_uri){ story_link.uri }
+    scrapes(:story_url){ expand_url(story_link["href"]) }
     scrapes(:title){ story_link.text }
+    scrapes(:comments_link, optional: true) do
+      source.next_sibling.search!(".subtext a").find do |link|
+        link.text.match?(/comment|discuss/)
+      end
+    end
+    scrapes(:comments_url, if: :comments_link) do
+      expand_url(comments_link["href"])
+    end
+    scrapes(:comment_count, if: :comments_link) do
+      comments_link.text.to_i
+    end
+    def expand_url(url)
+      url.include?("://") ? url : source.document.uri.merge(url).to_s
+    end
   end
 end
 # The following line will raise an exception if anything goes wrong
 # during the scraping process.  For example, if the structure of the
-# HTML does not match expectations, either due to incorrect assumptions
-# or a site change, the script will terminate immediately with a helpful
-# error message.  This prevents bad data from propagating and causing
-# hard-to-trace errors.
+# HTML does not match expectations due to a site change, the script will
+# terminate immediately with a helpful error message.  This prevents bad
+# data from propagating and causing hard-to-trace errors.
 hn = HackerNews.scrape("https://news.ycombinator.com/news")
 # Your processing logic goes here:
 hn.items.take(10).each do |item|
   puts "* #{item.title}"
-  puts "  #{item.story_uri}"
+  puts "  #{item.story_url}"
+  puts "  #{item.comment_count} comments: #{item.comments_url}" if item.comments_url
   puts
 end
 ```
-[Hacker News]: https://news.ycombinator.com/news
+Hacker News also offers a [JSON API](https://github.com/HackerNews/API),
+which may be more robust for scraping purposes.  *grubby* can scrape
+JSON just as well:
+```ruby
+require "grubby"
+class HackerNews < Grubby::JsonScraper
+  scrapes(:items) do
+    # API returns array of top 500 item IDs, so limit as necessary
+    json.take(10).map do |item_id|
+      Item.scrape("https://hacker-news.firebaseio.com/v0/item/#{item_id}.json")
+    end
+  end
+  class Item < Grubby::JsonScraper
+    scrapes(:story_url){ json["url"] || hn_url }
+    scrapes(:title){ json["title"] }
+    scrapes(:comments_url, optional: true) do
+      hn_url if json["descendants"]
+    end
+    scrapes(:comment_count, optional: true) do
+      json["descendants"]&.to_i
+    end
+    def hn_url
+      "https://news.ycombinator.com/item?id=#{json["id"]}"
+    end
+  end
+end
+hn = HackerNews.scrape("https://hacker-news.firebaseio.com/v0/topstories.json")
+# Your processing logic goes here:
+hn.items.each do |item|
+  puts "* #{item.title}"
+  puts "  #{item.story_url}"
+  puts "  #{item.comment_count} comments: #{item.comments_url}" if item.comments_url
+  puts
+end
+```
 ## Core API
-- [Grubby](http://www.rubydoc.info/gems/grubby/Grubby)
-  - [#get_mirrored](http://www.rubydoc.info/gems/grubby/Grubby:get_mirrored)
-  - [#ok?](http://www.rubydoc.info/gems/grubby/Grubby:ok%3F)
-  - [#singleton](http://www.rubydoc.info/gems/grubby/Grubby:singleton)
-  - [#time_between_requests](http://www.rubydoc.info/gems/grubby/Grubby:time_between_requests)
-- [Scraper](http://www.rubydoc.info/gems/grubby/Grubby/Scraper)
-  - [.each](http://www.rubydoc.info/gems/grubby/Grubby/Scraper.each)
-  - [.fields](http://www.rubydoc.info/gems/grubby/Grubby/Scraper.fields)
-  - [.scrape](http://www.rubydoc.info/gems/grubby/Grubby/Scraper.scrape)
-  - [.scrapes](http://www.rubydoc.info/gems/grubby/Grubby/Scraper.scrapes)
-  - [#[]](http://www.rubydoc.info/gems/grubby/Grubby/Scraper:[])
-  - [#source](http://www.rubydoc.info/gems/grubby/Grubby/Scraper:source)
-  - [#to_h](http://www.rubydoc.info/gems/grubby/Grubby/Scraper:to_h)
-- [PageScraper](http://www.rubydoc.info/gems/grubby/Grubby/PageScraper)
-  - [.scrape_file](http://www.rubydoc.info/gems/grubby/Grubby/PageScraper.scrape_file)
-  - [#page](http://www.rubydoc.info/gems/grubby/Grubby/PageScraper:page)
-- [JsonScraper](http://www.rubydoc.info/gems/grubby/Grubby/JsonScraper)
-  - [.scrape_file](http://www.rubydoc.info/gems/grubby/Grubby/JsonScraper.scrape_file)
-  - [#json](http://www.rubydoc.info/gems/grubby/Grubby/JsonScraper:json)
-- Mechanize::Download
-  - [#save_to](http://www.rubydoc.info/gems/grubby/Mechanize/Parser:save_to)
-  - [#save_to!](http://www.rubydoc.info/gems/grubby/Mechanize/Parser:save_to%21)
+- [Grubby](https://www.rubydoc.info/gems/grubby/Grubby)
+  - [#fulfill](https://www.rubydoc.info/gems/grubby/Grubby:fulfill)
+  - [#get_mirrored](https://www.rubydoc.info/gems/grubby/Grubby:get_mirrored)
+  - [#ok?](https://www.rubydoc.info/gems/grubby/Grubby:ok%3F)
+  - [#time_between_requests](https://www.rubydoc.info/gems/grubby/Grubby:time_between_requests)
+- [Scraper](https://www.rubydoc.info/gems/grubby/Grubby/Scraper)
+  - [.each](https://www.rubydoc.info/gems/grubby/Grubby/Scraper.each)
+  - [.scrape](https://www.rubydoc.info/gems/grubby/Grubby/Scraper.scrape)
+  - [.scrapes](https://www.rubydoc.info/gems/grubby/Grubby/Scraper.scrapes)
+  - [#[]](https://www.rubydoc.info/gems/grubby/Grubby/Scraper:[])
+  - [#to_h](https://www.rubydoc.info/gems/grubby/Grubby/Scraper:to_h)
+- [PageScraper](https://www.rubydoc.info/gems/grubby/Grubby/PageScraper)
+  - [.scrape_file](https://www.rubydoc.info/gems/grubby/Grubby/PageScraper.scrape_file)
+  - [#page](https://www.rubydoc.info/gems/grubby/Grubby/PageScraper:page)
+- [JsonScraper](https://www.rubydoc.info/gems/grubby/Grubby/JsonScraper)
+  - [.scrape_file](https://www.rubydoc.info/gems/grubby/Grubby/JsonScraper.scrape_file)
+  - [#json](https://www.rubydoc.info/gems/grubby/Grubby/JsonScraper:json)
 - Mechanize::File
-  - [#save_to](http://www.rubydoc.info/gems/grubby/Mechanize/Parser:save_to)
-  - [#save_to!](http://www.rubydoc.info/gems/grubby/Mechanize/Parser:save_to%21)
+  - [#save_to](https://www.rubydoc.info/gems/grubby/Mechanize/Parser:save_to)
+  - [#save_to!](https://www.rubydoc.info/gems/grubby/Mechanize/Parser:save_to%21)
 - Mechanize::Page
-  - [#at!](http://www.rubydoc.info/gems/grubby/Mechanize/Page:at%21)
-  - [#search!](http://www.rubydoc.info/gems/grubby/Mechanize/Page:search%21)
+  - [#at!](https://www.rubydoc.info/gems/grubby/Mechanize/Page:at%21)
+  - [#search!](https://www.rubydoc.info/gems/grubby/Mechanize/Page:search%21)
 - Mechanize::Page::Link
-  - [#to_absolute_uri](http://www.rubydoc.info/gems/grubby/Mechanize/Page/Link#to_absolute_uri)
+  - [#to_absolute_uri](https://www.rubydoc.info/gems/grubby/Mechanize/Page/Link#to_absolute_uri)
 - URI
   - [#basename](https://www.rubydoc.info/gems/grubby/URI:basename)
   - [#query_param](https://www.rubydoc.info/gems/grubby/URI:query_param)
-## Supplemental API
+## Auxiliary API
-*grubby* includes several gems which extend Ruby objects with
-convenience methods.  When you load *grubby* you automatically make
-these methods available.  The included gems are listed below, along with
-**a few** of the methods each provides.  See each gem's documentation
-for a complete API listing.
+*grubby* loads several gems that extend Ruby objects with utility
+methods.  Some of those methods are listed below.  See each gem's
+documentation for a complete API listing.
 - [Active Support](https://rubygems.org/gems/activesupport)
-  ([docs](http://www.rubydoc.info/gems/activesupport/))
+  ([docs](https://www.rubydoc.info/gems/activesupport/))
   - [Enumerable#index_by](https://www.rubydoc.info/gems/activesupport/Enumerable:index_by)
   - [File.atomic_write](https://www.rubydoc.info/gems/activesupport/File:atomic_write)
-  - [NilClass#try](https://www.rubydoc.info/gems/activesupport/NilClass:try)
   - [Object#presence](https://www.rubydoc.info/gems/activesupport/Object:presence)
   - [String#blank?](https://www.rubydoc.info/gems/activesupport/String:blank%3F)
   - [String#squish](https://www.rubydoc.info/gems/activesupport/String:squish)
 - [casual_support](https://rubygems.org/gems/casual_support)
-  ([docs](http://www.rubydoc.info/gems/casual_support/))
-  - [Enumerable#index_to](http://www.rubydoc.info/gems/casual_support/Enumerable:index_to)
-  - [String#after](http://www.rubydoc.info/gems/casual_support/String:after)
-  - [String#after_last](http://www.rubydoc.info/gems/casual_support/String:after_last)
-  - [String#before](http://www.rubydoc.info/gems/casual_support/String:before)
-  - [String#before_last](http://www.rubydoc.info/gems/casual_support/String:before_last)
-  - [String#between](http://www.rubydoc.info/gems/casual_support/String:between)
-  - [Time#to_hms](http://www.rubydoc.info/gems/casual_support/Time:to_hms)
-  - [Time#to_ymd](http://www.rubydoc.info/gems/casual_support/Time:to_ymd)
+  ([docs](https://www.rubydoc.info/gems/casual_support/))
+  - [Enumerable#index_to](https://www.rubydoc.info/gems/casual_support/Enumerable:index_to)
+  - [String#after](https://www.rubydoc.info/gems/casual_support/String:after)
+  - [String#after_last](https://www.rubydoc.info/gems/casual_support/String:after_last)
+  - [String#before](https://www.rubydoc.info/gems/casual_support/String:before)
+  - [String#before_last](https://www.rubydoc.info/gems/casual_support/String:before_last)
+  - [String#between](https://www.rubydoc.info/gems/casual_support/String:between)
+  - [Time#to_hms](https://www.rubydoc.info/gems/casual_support/Time:to_hms)
+  - [Time#to_ymd](https://www.rubydoc.info/gems/casual_support/Time:to_ymd)
 - [gorge](https://rubygems.org/gems/gorge)
-  ([docs](http://www.rubydoc.info/gems/gorge/))
-  - [Pathname#file_crc32](http://www.rubydoc.info/gems/gorge/Pathname:file_crc32)
-  - [Pathname#file_md5](http://www.rubydoc.info/gems/gorge/Pathname:file_md5)
-  - [Pathname#file_sha1](http://www.rubydoc.info/gems/gorge/Pathname:file_sha1)
-  - [String#crc32](http://www.rubydoc.info/gems/gorge/String:crc32)
-  - [String#md5](http://www.rubydoc.info/gems/gorge/String:md5)
-  - [String#sha1](http://www.rubydoc.info/gems/gorge/String:sha1)
+  ([docs](https://www.rubydoc.info/gems/gorge/))
+  - [Pathname#file_crc32](https://www.rubydoc.info/gems/gorge/Pathname:file_crc32)
+  - [Pathname#file_md5](https://www.rubydoc.info/gems/gorge/Pathname:file_md5)
+  - [Pathname#file_sha1](https://www.rubydoc.info/gems/gorge/Pathname:file_sha1)
 - [mini_sanity](https://rubygems.org/gems/mini_sanity)
-  ([docs](http://www.rubydoc.info/gems/mini_sanity/))
-  - [Array#assert_length!](http://www.rubydoc.info/gems/mini_sanity/Array:assert_length%21)
-  - [Enumerable#refute_empty!](http://www.rubydoc.info/gems/mini_sanity/Enumerable:refute_empty%21)
-  - [Object#assert_equal!](http://www.rubydoc.info/gems/mini_sanity/Object:assert_equal%21)
-  - [Object#assert_in!](http://www.rubydoc.info/gems/mini_sanity/Object:assert_in%21)
-  - [Object#refute_nil!](http://www.rubydoc.info/gems/mini_sanity/Object:refute_nil%21)
-  - [Pathname#assert_exist!](http://www.rubydoc.info/gems/mini_sanity/Pathname:assert_exist%21)
-  - [String#assert_match!](http://www.rubydoc.info/gems/mini_sanity/String:assert_match%21)
+  ([docs](https://www.rubydoc.info/gems/mini_sanity/))
+  - [Enumerator#result!](https://www.rubydoc.info/gems/mini_sanity/Enumerator:result%21)
+  - [Enumerator#results!](https://www.rubydoc.info/gems/mini_sanity/Enumerator:results%21)
+  - [Object#assert!](https://www.rubydoc.info/gems/mini_sanity/Object:assert%21)
+  - [Object#refute!](https://www.rubydoc.info/gems/mini_sanity/Object:refute%21)
+  - [String#match!](https://www.rubydoc.info/gems/mini_sanity/String:match%21)
 - [pleasant_path](https://rubygems.org/gems/pleasant_path)
-  ([docs](http://www.rubydoc.info/gems/pleasant_path/))
-  - [Pathname#available_name](http://www.rubydoc.info/gems/pleasant_path/Pathname:available_name)
-  - [Pathname#dirs](http://www.rubydoc.info/gems/pleasant_path/Pathname:dirs)
-  - [Pathname#files](http://www.rubydoc.info/gems/pleasant_path/Pathname:files)
-  - [Pathname#make_dirname](http://www.rubydoc.info/gems/pleasant_path/Pathname:make_dirname)
-  - [Pathname#make_file](http://www.rubydoc.info/gems/pleasant_path/Pathname:make_file)
-  - [Pathname#move_as](http://www.rubydoc.info/gems/pleasant_path/Pathname:move_as)
-  - [Pathname#rename_basename](http://www.rubydoc.info/gems/pleasant_path/Pathname:rename_basename)
-  - [Pathname#rename_extname](http://www.rubydoc.info/gems/pleasant_path/Pathname:rename_extname)
+  ([docs](https://www.rubydoc.info/gems/pleasant_path/))
+  - [Pathname#available_name](https://www.rubydoc.info/gems/pleasant_path/Pathname:available_name)
+  - [Pathname#existence](https://www.rubydoc.info/gems/pleasant_path/Pathname:existence)
+  - [Pathname#make_dirname](https://www.rubydoc.info/gems/pleasant_path/Pathname:make_dirname)
+  - [Pathname#move_as](https://www.rubydoc.info/gems/pleasant_path/Pathname:move_as)
+  - [Pathname#rename_basename](https://www.rubydoc.info/gems/pleasant_path/Pathname:rename_basename)
+  - [Pathname#rename_extname](https://www.rubydoc.info/gems/pleasant_path/Pathname:rename_extname)
 - [ryoba](https://rubygems.org/gems/ryoba)
-  ([docs](http://www.rubydoc.info/gems/ryoba/))
-  - [Nokogiri::XML::Node#matches!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Node:matches%21)
-  - [Nokogiri::XML::Node#text!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Node:text%21)
-  - [Nokogiri::XML::Node#uri](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Node:uri)
-  - [Nokogiri::XML::Searchable#ancestor!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:ancestor%21)
-  - [Nokogiri::XML::Searchable#ancestors!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:ancestors%21)
-  - [Nokogiri::XML::Searchable#at!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:at%21)
-  - [Nokogiri::XML::Searchable#search!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:search%21)
+  ([docs](https://www.rubydoc.info/gems/ryoba/))
+  - [Nokogiri::XML::Node#matches!](https://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Node:matches%21)
+  - [Nokogiri::XML::Node#text!](https://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Node:text%21)
+  - [Nokogiri::XML::Node#uri](https://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Node:uri)
+  - [Nokogiri::XML::Searchable#ancestor!](https://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:ancestor%21)
+  - [Nokogiri::XML::Searchable#ancestors!](https://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:ancestors%21)
+  - [Nokogiri::XML::Searchable#at!](https://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:at%21)
+  - [Nokogiri::XML::Searchable#search!](https://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:search%21)
 ## Installation
-Install from [Ruby Gems](https://rubygems.org/gems/grubby):
+Install the [gem](https://rubygems.org/gems/grubby):
 ```bash
 $ gem install grubby
 ```
-Then require in your Ruby script:
+Then require in your Ruby code:
 ```ruby
 require "grubby"
@@ -165,8 +214,7 @@ require "grubby"
 ## Contributing
-Run `rake test` to run the tests.  You can also run `rake irb` for an
-interactive prompt that pre-loads the project code.
+Run `rake test` to run the tests.
 ## License

data/Rakefile CHANGED

@@ -1,18 +1,5 @@
 require "bundler/gem_tasks"
 require "rake/testtask"
-require "yard"
-YARD::Rake::YardocTask.new(:doc) do |t|
-end
-desc "Launch IRB with this gem pre-loaded"
-task :irb do
-  require "grubby"
-  require "irb"
-  ARGV.clear
-  IRB.start
-end
 Rake::TestTask.new(:test) do |t|
   t.libs << "test"

data/gemfiles/activesupport-6.0.gemfile ADDED

@@ -0,0 +1,3 @@
+eval_gemfile "../Gemfile"
+gem "activesupport", "~> 6.0.0"

data/grubby.gemspec CHANGED

@@ -1,7 +1,4 @@
-# coding: utf-8
-lib = File.expand_path("../lib", __FILE__)
-$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
-require "grubby/version"
+require_relative "lib/grubby/version"
 Gem::Specification.new do |spec|
   spec.name          = "grubby"
@@ -12,24 +9,26 @@ Gem::Specification.new do |spec|
   spec.summary       = %q{Fail-fast web scraping}
   spec.homepage      = "https://github.com/jonathanhefner/grubby"
   spec.license       = "MIT"
+  spec.required_ruby_version = ">= 2.6"
-  spec.files         = `git ls-files -z`.split("\x0").reject do |f|
-    f.match(%r{^(test|spec|features)/})
+  spec.metadata["homepage_uri"] = spec.homepage
+  spec.metadata["source_code_uri"] = spec.homepage
+  spec.metadata["changelog_uri"] = spec.metadata["source_code_uri"] + "/blob/master/CHANGELOG.md"
+  # Specify which files should be added to the gem when it is released.
+  # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
+  spec.files         = Dir.chdir(__dir__) do
+    `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
   end
   spec.bindir        = "exe"
   spec.executables   = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
   spec.require_paths = ["lib"]
-  spec.add_runtime_dependency "activesupport", ">= 5.0"
-  spec.add_runtime_dependency "casual_support", "~> 3.0"
-  spec.add_runtime_dependency "gorge", "~> 1.0"
-  spec.add_runtime_dependency "mechanize", "~> 2.7"
-  spec.add_runtime_dependency "mini_sanity", "~> 1.0"
-  spec.add_runtime_dependency "pleasant_path", "~> 1.1"
-  spec.add_runtime_dependency "ryoba", "~> 1.0"
-  spec.add_development_dependency "bundler", "~> 1.15"
-  spec.add_development_dependency "rake", "~> 10.0"
-  spec.add_development_dependency "minitest", "~> 5.0"
-  spec.add_development_dependency "yard", "~> 0.9"
+  spec.add_dependency "activesupport", ">= 6.0"
+  spec.add_dependency "casual_support", "~> 4.0"
+  spec.add_dependency "gorge", "~> 1.0"
+  spec.add_dependency "mechanize", "~> 2.7"
+  spec.add_dependency "mini_sanity", "~> 2.0"
+  spec.add_dependency "pleasant_path", "~> 2.0"
+  spec.add_dependency "ryoba", "~> 1.0"
 end

data/lib/grubby.rb CHANGED

@@ -23,22 +23,22 @@ class Grubby < Mechanize
   VERSION = GRUBBY_VERSION
-  # The enforced minimum amount of time to wait between requests, in
-  # seconds.  If the value is a Range, a random number within the Range
-  # is chosen for each request.
+  # The minimum amount of time enforced between requests, in seconds.
+  # If the value is a Range, a random number within the Range is chosen
+  # for each request.
   #
   # @return [Integer, Float, Range<Integer>, Range<Float>]
   attr_accessor :time_between_requests
   # Journal file used to ensure only-once processing of resources by
-  # {singleton} across multiple program runs.
+  # {fulfill} across multiple program runs.
   #
   # @return [Pathname, nil]
   attr_reader :journal
   # @param journal [Pathname, String]
   #   Optional journal file used to ensure only-once processing of
-  #   resources by {singleton} across multiple program runs.
+  #   resources by {fulfill} across multiple program runs
   def initialize(journal = nil)
     super()
@@ -74,26 +74,27 @@ class Grubby < Mechanize
   end
   # Sets the journal file used to ensure only-once processing of
-  # resources by {singleton} across multiple program runs.  Setting the
+  # resources by {fulfill} across multiple program runs.  Setting the
   # journal file will clear the in-memory list of previously-processed
   # resources, and, if the journal file exists, load the list from file.
   #
   # @param path [Pathname, String, nil]
   # @return [Pathname]
   def journal=(path)
-    @journal = path&.to_pathname&.touch_file
-    @seen = if @journal
+    @journal = path&.to_pathname&.make_file
+    @fulfilled = if @journal
         require "csv"
-        CSV.read(@journal).map{|row| SingletonKey.new(*row) }.to_set
+        CSV.read(@journal).map{|row| FulfilledEntry.new(*row) }.to_set
       else
         Set.new
       end
     @journal
   end
-  # Calls +#head+ and returns true if the result has response code
-  # "200".  Unlike +#head+, error response codes (e.g. "404", "500")
-  # do not cause a +Mechanize::ResponseCodeError+ to be raised.
+  # Calls +#head+ and returns true if a response code "200" is received,
+  # false otherwise.  Unlike +#head+, error response codes (e.g. "404",
+  # "500") do not result in a +Mechanize::ResponseCodeError+ being
+  # raised.
   #
   # @param uri [URI, String]
   # @return [Boolean]
@@ -106,7 +107,7 @@ class Grubby < Mechanize
   end
   # Calls +#get+ with each of +mirror_uris+ until a successful
-  # ("200 OK") response is recieved, and returns that +#get+ result.
+  # ("200 OK") response is received, and returns that +#get+ result.
   # Rescues and logs +Mechanize::ResponseCodeError+ failures for all but
   # the last mirror.
   #
@@ -114,13 +115,13 @@ class Grubby < Mechanize
   #   grubby = Grubby.new
   #
   #   urls = [
-  #     "http://httpstat.us/404",
-  #     "http://httpstat.us/500",
-  #     "http://httpstat.us/200#foo",
-  #     "http://httpstat.us/200#bar",
+  #     "https://httpstat.us/404",
+  #     "https://httpstat.us/500",
+  #     "https://httpstat.us/200?foo",
+  #     "https://httpstat.us/200?bar",
   #   ]
   #
-  #   grubby.get_mirrored(urls).uri  # == URI("http://httpstat.us/200#foo")
+  #   grubby.get_mirrored(urls).uri  # == URI("https://httpstat.us/200?foo")
   #
   #   grubby.get_mirrored(urls.take(2))  # raise Mechanize::ResponseCodeError
   #
@@ -145,70 +146,87 @@ class Grubby < Mechanize
   end
   # Ensures only-once processing of the resource indicated by +uri+ for
-  # the specified +purpose+.  A list of previously-processed resource
-  # URIs and content hashes is maintained in the Grubby instance.  The
-  # given block is called with the fetched resource only if the
-  # resource's URI and the resource's content hash have not been
-  # previously processed under the specified +purpose+.
+  # the specified +purpose+.  The given block is executed and the result
+  # is returned if and only if the Grubby instance has not recorded a
+  # previous call to +fulfill+ for the same resource and purpose.
+  #
+  # Note that the resource is identified by both its URI and its content
+  # hash.  The latter prevents superfluous and rearranged URI query
+  # string parameters from interfering with only-once processing.
+  #
+  # If {journal} is set, and if the block does not raise an exception,
+  # the resource and purpose are logged to the journal file.  This
+  # enables only-once processing across multiple program runs.  It also
+  # provides a means to resume batch processing after an unexpected
+  # termination.
   #
   # @example
   #   grubby = Grubby.new
   #
-  #   grubby.singleton("https://example.com/foo") do |page|
-  #     # will be executed (first time "/foo")
+  #   grubby.fulfill("https://example.com/posts") do |page|
+  #     "first time"
+  #   end
+  #   # == "first time"
+  #
+  #   grubby.fulfill("https://example.com/posts") do |page|
+  #     "already seen" # not evaluated
   #   end
+  #   # == nil
   #
-  #   grubby.singleton("https://example.com/foo#bar") do |page|
-  #     # will be skipped (already seen "/foo")
+  #   grubby.fulfill("https://example.com/posts?page=1") do |page|
+  #     "already seen content hash" # not evaluated
   #   end
+  #   # == nil
   #
-  #   grubby.singleton("https://example.com/foo", "again!") do |page|
-  #     # will be executed (new purpose for "/foo")
+  #   grubby.fulfill("https://example.com/posts", "again!") do |page|
+  #     "already seen, but new purpose"
   #   end
+  #   # == "already seen, but new purpose"
   #
   # @param uri [URI, String]
   # @param purpose [String]
-  # @yield [resource]
   # @yieldparam resource [Mechanize::Page, Mechanize::File, Mechanize::Download, ...]
-  # @return [Boolean]
-  #   whether the given block was called
+  # @yieldreturn [Object]
+  # @return [Object, nil]
   # @raise [Mechanize::ResponseCodeError]
   #   if fetching the resource results in error (see +Mechanize#get+)
-  def singleton(uri, purpose = "")
+  def fulfill(uri, purpose = "")
     series = []
     uri = uri.to_absolute_uri
-    return if try_skip_singleton(uri, purpose, series)
+    return unless add_fulfilled(uri, purpose, series)
     normalized_uri = normalize_uri(uri)
-    return if try_skip_singleton(normalized_uri, purpose, series)
+    return unless add_fulfilled(normalized_uri, purpose, series)
     $log.info("Fetch #{normalized_uri}")
     resource = get(normalized_uri)
-    skip = try_skip_singleton(resource.uri, purpose, series) |
-      try_skip_singleton("content hash: #{resource.content_hash}", purpose, series)
+    unprocessed = add_fulfilled(resource.uri, purpose, series) &
+      add_fulfilled("content hash: #{resource.content_hash}", purpose, series)
-    yield resource unless skip
+    result = yield resource if unprocessed
     CSV.open(journal, "a") do |csv|
-      series.each{|singleton_key| csv << singleton_key }
+      series.each{|entry| csv << entry }
     end if journal
-    !skip
+    result
   end
   private
   # @!visibility private
-  SingletonKey = Struct.new(:purpose, :target)
+  FulfilledEntry = Struct.new(:purpose, :target)
-  def try_skip_singleton(target, purpose, series)
-    series << SingletonKey.new(purpose, target.to_s)
-    if series.uniq!.nil? && !@seen.add?(series.last)
-      seen_info = series.length > 1 ? "seen #{series.last.target}" : "seen"
-      $log.info("Skip #{series.first.target} (#{seen_info})")
+  def add_fulfilled(target, purpose, series)
+    series << FulfilledEntry.new(purpose, target.to_s)
+    if (series.uniq!) || @fulfilled.add?(series.last)
       true
+    else
+      $log.info("Skip #{series.first.target}" \
+        " (seen#{" #{series.last.target}" unless series.length == 1})")
+      false
     end
   end