grubby 1.0.0 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 88c8ecc06ffba254ee9e9de3d42f868c0692b244
4
- data.tar.gz: 8cd3445f33c9f7db05550947d686293ceee620c1
2
+ SHA256:
3
+ metadata.gz: 7528791ce5da4ca182e8258cf5bc8920345470ee76ee50de44cf89adac7ffec6
4
+ data.tar.gz: 3b3dad255ae1841583abb2c61345fbefffb268231906968619000b95005044be
5
5
  SHA512:
6
- metadata.gz: d9dc7435763425d54d82f4930935c913cd67eb15a1250c58c7ccf3b0e419eeb6ceb87289ed3c22386faa96e3ae3584a9f39cf8e671af8b56b52bb3e2c8257e4d
7
- data.tar.gz: 2c5c96993c8a673274a4acc34c3f82a8719fbea508c1dffb5ce8cff1d4a13cd90c9fcbf8a3c743500a31f08419c687c24adf5f2aee464a8a4f1935b2b302b184
6
+ metadata.gz: 295c2957f708d86b596a4c062fcdf31d9c5083d26d15989de31feb174316ee17430e19d1746ad0f389d599560cc419e740835b1bfba6b0f57627e633f1a0ecf1
7
+ data.tar.gz: e8bc4ecb3ce277436be91ee4e8cf9c187c1f0bbf5ee170bc7a4e3f221f94d678e0e24d2f7dd878427c7b2b0e0b2fe1baa485556bdf21cd89e05a7ff222a9dc53
@@ -0,0 +1,17 @@
1
+ ## 1.1.0
2
+ * Added `Grubby#ok?`.
3
+ * Added `Grubby::PageScraper.scrape_file` and `Grubby::JsonScraper.scrape_file`.
4
+ * Added `Mechanize::Parser#save_to` and `Mechanize::Parser#save_to!`,
5
+ which are inherited by `Mechanize::Download` and `Mechanize::File`.
6
+ * Added `URI#basename`.
7
+ * Added `URI#query_param`.
8
+ * Added utility methods from [ryoba](https://rubygems.org/gems/ryoba).
9
+ * Added `Grubby::Scraper::Error#scraper` and `Grubby::Scraper#errors`
10
+ for interactive debugging with e.g. byebug.
11
+ * Improved log messages and error formatting.
12
+ * Fixed compatibility with net-http-persistent gem v3.0.
13
+
14
+
15
+ ## 1.0.0
16
+
17
+ * Initial release
data/README.md CHANGED
@@ -60,6 +60,7 @@ puts hn.items.take(10).map(&:title) # your scraping logic goes here
60
60
 
61
61
  - [Grubby](http://www.rubydoc.info/gems/grubby/Grubby)
62
62
  - [#get_mirrored](http://www.rubydoc.info/gems/grubby/Grubby:get_mirrored)
63
+ - [#ok?](http://www.rubydoc.info/gems/grubby/Grubby:ok%3F)
63
64
  - [#singleton](http://www.rubydoc.info/gems/grubby/Grubby:singleton)
64
65
  - [#time_between_requests](http://www.rubydoc.info/gems/grubby/Grubby:time_between_requests)
65
66
  - [Scraper](http://www.rubydoc.info/gems/grubby/Grubby/Scraper)
@@ -69,37 +70,89 @@ puts hn.items.take(10).map(&:title) # your scraping logic goes here
69
70
  - [#source](http://www.rubydoc.info/gems/grubby/Grubby/Scraper:source)
70
71
  - [#to_h](http://www.rubydoc.info/gems/grubby/Grubby/Scraper:to_h)
71
72
  - [PageScraper](http://www.rubydoc.info/gems/grubby/Grubby/PageScraper)
73
+ - [.scrape_file](http://www.rubydoc.info/gems/grubby/Grubby/PageScraper.scrape_file)
72
74
  - [#page](http://www.rubydoc.info/gems/grubby/Grubby/PageScraper:page)
73
75
  - [JsonScraper](http://www.rubydoc.info/gems/grubby/Grubby/JsonScraper)
76
+ - [.scrape_file](http://www.rubydoc.info/gems/grubby/Grubby/JsonScraper.scrape_file)
74
77
  - [#json](http://www.rubydoc.info/gems/grubby/Grubby/JsonScraper:json)
75
- - Nokogiri::XML::Searchable
76
- - [#at!](http://www.rubydoc.info/gems/grubby/Nokogiri/XML/Searchable:at%21)
77
- - [#search!](http://www.rubydoc.info/gems/grubby/Nokogiri/XML/Searchable:search%21)
78
+ - Mechanize::Download
79
+ - [#save_to](http://www.rubydoc.info/gems/grubby/Mechanize/Parser:save_to)
80
+ - [#save_to!](http://www.rubydoc.info/gems/grubby/Mechanize/Parser:save_to%21)
81
+ - Mechanize::File
82
+ - [#save_to](http://www.rubydoc.info/gems/grubby/Mechanize/Parser:save_to)
83
+ - [#save_to!](http://www.rubydoc.info/gems/grubby/Mechanize/Parser:save_to%21)
78
84
  - Mechanize::Page
79
85
  - [#at!](http://www.rubydoc.info/gems/grubby/Mechanize/Page:at%21)
80
86
  - [#search!](http://www.rubydoc.info/gems/grubby/Mechanize/Page:search%21)
81
87
  - Mechanize::Page::Link
82
88
  - [#to_absolute_uri](http://www.rubydoc.info/gems/grubby/Mechanize/Page/Link#to_absolute_uri)
89
+ - URI
90
+ - [#basename](https://www.rubydoc.info/gems/grubby/URI:basename)
91
+ - [#query_param](https://www.rubydoc.info/gems/grubby/URI:query_param)
83
92
 
84
93
 
85
94
  ## Supplemental API
86
95
 
87
- *grubby* uses several gems which extend core Ruby objects with
88
- convenience methods. When you import *grubby* you automatically make
89
- these methods available. See each gem below for its specific API
90
- documentation:
96
+ *grubby* includes several gems which extend Ruby objects with
97
+ convenience methods. When you load *grubby* you automatically make
98
+ these methods available. The included gems are listed below, along with
99
+ **a few** of the methods each provides. See each gem's documentation
100
+ for a complete API listing.
91
101
 
92
102
  - [Active Support](https://rubygems.org/gems/activesupport)
93
103
  ([docs](http://www.rubydoc.info/gems/activesupport/))
104
+ - [Enumerable#index_by](https://www.rubydoc.info/gems/activesupport/Enumerable:index_by)
105
+ - [File.atomic_write](https://www.rubydoc.info/gems/activesupport/File:atomic_write)
106
+ - [NilClass#try](https://www.rubydoc.info/gems/activesupport/NilClass:try)
107
+ - [Object#presence](https://www.rubydoc.info/gems/activesupport/Object:presence)
108
+ - [String#blank?](https://www.rubydoc.info/gems/activesupport/String:blank%3F)
109
+ - [String#squish](https://www.rubydoc.info/gems/activesupport/String:squish)
94
110
  - [casual_support](https://rubygems.org/gems/casual_support)
95
111
  ([docs](http://www.rubydoc.info/gems/casual_support/))
112
+ - [Enumerable#index_to](http://www.rubydoc.info/gems/casual_support/Enumerable:index_to)
113
+ - [String#after](http://www.rubydoc.info/gems/casual_support/String:after)
114
+ - [String#after_last](http://www.rubydoc.info/gems/casual_support/String:after_last)
115
+ - [String#before](http://www.rubydoc.info/gems/casual_support/String:before)
116
+ - [String#before_last](http://www.rubydoc.info/gems/casual_support/String:before_last)
117
+ - [String#between](http://www.rubydoc.info/gems/casual_support/String:between)
118
+ - [Time#to_hms](http://www.rubydoc.info/gems/casual_support/Time:to_hms)
119
+ - [Time#to_ymd](http://www.rubydoc.info/gems/casual_support/Time:to_ymd)
96
120
  - [gorge](https://rubygems.org/gems/gorge)
97
121
  ([docs](http://www.rubydoc.info/gems/gorge/))
122
+ - [Pathname#file_crc32](http://www.rubydoc.info/gems/gorge/Pathname:file_crc32)
123
+ - [Pathname#file_md5](http://www.rubydoc.info/gems/gorge/Pathname:file_md5)
124
+ - [Pathname#file_sha1](http://www.rubydoc.info/gems/gorge/Pathname:file_sha1)
125
+ - [String#crc32](http://www.rubydoc.info/gems/gorge/String:crc32)
126
+ - [String#md5](http://www.rubydoc.info/gems/gorge/String:md5)
127
+ - [String#sha1](http://www.rubydoc.info/gems/gorge/String:sha1)
98
128
  - [mini_sanity](https://rubygems.org/gems/mini_sanity)
99
129
  ([docs](http://www.rubydoc.info/gems/mini_sanity/))
130
+ - [Array#assert_length!](http://www.rubydoc.info/gems/mini_sanity/Array:assert_length%21)
131
+ - [Enumerable#refute_empty!](http://www.rubydoc.info/gems/mini_sanity/Enumerable:refute_empty%21)
132
+ - [Object#assert_equal!](http://www.rubydoc.info/gems/mini_sanity/Object:assert_equal%21)
133
+ - [Object#assert_in!](http://www.rubydoc.info/gems/mini_sanity/Object:assert_in%21)
134
+ - [Object#refute_nil!](http://www.rubydoc.info/gems/mini_sanity/Object:refute_nil%21)
135
+ - [Pathname#assert_exist!](http://www.rubydoc.info/gems/mini_sanity/Pathname:assert_exist%21)
136
+ - [String#assert_match!](http://www.rubydoc.info/gems/mini_sanity/String:assert_match%21)
100
137
  - [pleasant_path](https://rubygems.org/gems/pleasant_path)
101
138
  ([docs](http://www.rubydoc.info/gems/pleasant_path/))
102
-
139
+ - [Pathname#dirs](http://www.rubydoc.info/gems/pleasant_path/Pathname:dirs)
140
+ - [Pathname#dirs_r](http://www.rubydoc.info/gems/pleasant_path/Pathname:dirs_r)
141
+ - [Pathname#files](http://www.rubydoc.info/gems/pleasant_path/Pathname:files)
142
+ - [Pathname#files_r](http://www.rubydoc.info/gems/pleasant_path/Pathname:files_r)
143
+ - [Pathname#make_dirname](http://www.rubydoc.info/gems/pleasant_path/Pathname:make_dirname)
144
+ - [Pathname#rename_basename](http://www.rubydoc.info/gems/pleasant_path/Pathname:rename_basename)
145
+ - [Pathname#rename_extname](http://www.rubydoc.info/gems/pleasant_path/Pathname:rename_extname)
146
+ - [Pathname#touch_file](http://www.rubydoc.info/gems/pleasant_path/Pathname:touch_file)
147
+ - [ryoba](https://rubygems.org/gems/ryoba)
148
+ ([docs](http://www.rubydoc.info/gems/ryoba/))
149
+ - [Nokogiri::XML::Node#matches!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Node:matches%21)
150
+ - [Nokogiri::XML::Node#text!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Node:text%21)
151
+ - [Nokogiri::XML::Node#uri](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Node:uri)
152
+ - [Nokogiri::XML::Searchable#ancestor!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:ancestor%21)
153
+ - [Nokogiri::XML::Searchable#ancestors!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:ancestors%21)
154
+ - [Nokogiri::XML::Searchable#at!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:at%21)
155
+ - [Nokogiri::XML::Searchable#search!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:search%21)
103
156
 
104
157
  ## Installation
105
158
 
data/Rakefile CHANGED
@@ -8,9 +8,6 @@ end
8
8
 
9
9
  desc "Launch IRB with this gem pre-loaded"
10
10
  task :irb do
11
- # HACK because lib/grubby/version is prematurely loaded by bundler/gem_tasks
12
- Object.send(:remove_const, :Grubby)
13
-
14
11
  require "grubby"
15
12
  require "irb"
16
13
  ARGV.clear
@@ -5,7 +5,7 @@ require "grubby/version"
5
5
 
6
6
  Gem::Specification.new do |spec|
7
7
  spec.name = "grubby"
8
- spec.version = Grubby::VERSION
8
+ spec.version = GRUBBY_VERSION
9
9
  spec.authors = ["Jonathan Hefner"]
10
10
  spec.email = ["jonathan.hefner@gmail.com"]
11
11
 
@@ -27,6 +27,7 @@ Gem::Specification.new do |spec|
27
27
  spec.add_runtime_dependency "mechanize", "~> 2.7"
28
28
  spec.add_runtime_dependency "mini_sanity", "~> 1.0"
29
29
  spec.add_runtime_dependency "pleasant_path", "~> 1.1"
30
+ spec.add_runtime_dependency "ryoba", "~> 1.0"
30
31
 
31
32
  spec.add_development_dependency "bundler", "~> 1.15"
32
33
  spec.add_development_dependency "rake", "~> 10.0"
@@ -5,7 +5,9 @@ require "gorge"
5
5
  require "mechanize"
6
6
  require "mini_sanity"
7
7
  require "pleasant_path"
8
+ require "ryoba"
8
9
 
10
+ require_relative "grubby/version"
9
11
  require_relative "grubby/log"
10
12
 
11
13
  require_relative "grubby/core_ext/string"
@@ -15,22 +17,30 @@ require_relative "grubby/mechanize/download"
15
17
  require_relative "grubby/mechanize/file"
16
18
  require_relative "grubby/mechanize/link"
17
19
  require_relative "grubby/mechanize/page"
18
- require_relative "grubby/nokogiri/searchable"
20
+ require_relative "grubby/mechanize/parser"
19
21
 
20
22
 
21
23
  class Grubby < Mechanize
22
24
 
25
+ VERSION = GRUBBY_VERSION
26
+
27
+ # The enforced minimum amount of time to wait between requests, in
28
+ # seconds. If the value is a Range, a random number within the Range
29
+ # is chosen for each request.
30
+ #
23
31
  # @return [Integer, Float, Range<Integer>, Range<Float>]
24
- # The enforced minimum amount of time to wait between requests, in
25
- # seconds. If the value is a Range, a random number within the
26
- # Range is chosen for each request.
27
32
  attr_accessor :time_between_requests
28
33
 
29
- # @param singleton_journal [Pathname, String]
30
- # Optional journal file to persist the list of resources processed
31
- # by {singleton}. Useful to ensure only-once processing across
32
- # multiple program runs.
33
- def initialize(singleton_journal = nil)
34
+ # Journal file used to ensure only-once processing of resources by
35
+ # {singleton} across multiple program runs. Set via {initialize}.
36
+ #
37
+ # @return [Pathname, nil]
38
+ attr_reader :journal
39
+
40
+ # @param journal [Pathname, String]
41
+ # Optional journal file used to ensure only-once processing of
42
+ # resources by {singleton} across multiple program runs.
43
+ def initialize(journal = nil)
34
44
  super()
35
45
 
36
46
  # Prevent "memory leaks", and prevent mistakenly blank urls from
@@ -58,10 +68,22 @@ class Grubby < Mechanize
58
68
  self.pre_connect_hooks << Proc.new{ self.send(:sleep_between_requests) }
59
69
  self.time_between_requests = 1.0
60
70
 
61
- @journal = singleton_journal ?
62
- singleton_journal.to_pathname.touch_file : Pathname::NULL
63
- @seen = SingletonKey.parse_file(@journal).
64
- group_by(&:purpose).transform_values{|sks| sks.map(&:key).index_to{ true } }
71
+ @journal = journal.try(&:to_pathname).try(&:touch_file)
72
+ @seen = @journal ? SingletonKey.parse_file(@journal).index_to{ true } : {}
73
+ end
74
+
75
+ # Calls +#head+ and returns true if the result has response code
76
+ # "200". Unlike +#head+, error response codes (e.g. "404", "500")
77
+ # do not cause a +Mechanize::ResponseCodeError+ to be raised.
78
+ #
79
+ # @param uri [String]
80
+ # @return [Boolean]
81
+ def ok?(uri, query_params = {}, headers = {})
82
+ begin
83
+ head(uri, query_params, headers).code == "200"
84
+ rescue Mechanize::ResponseCodeError => e
85
+ false
86
+ end
65
87
  end
66
88
 
67
89
  # Calls +#get+ with each of +mirror_uris+ until a successful
@@ -82,8 +104,8 @@ class Grubby < Mechanize
82
104
  if i >= mirror_uris.length
83
105
  raise
84
106
  else
85
- $log.info("Mirror failed with response code #{e.response_code}: #{mirror_uris[i - 1]}")
86
- $log.debug("Trying next mirror: #{mirror_uris[i]}")
107
+ $log.debug("Mirror failed (code #{e.response_code}): #{mirror_uris[i - 1]}")
108
+ $log.debug("Try mirror: #{mirror_uris[i]}")
87
109
  retry
88
110
  end
89
111
  end
@@ -111,20 +133,20 @@ class Grubby < Mechanize
111
133
  def singleton(target, purpose = "")
112
134
  series = []
113
135
 
114
- original_url = target.to_absolute_uri
115
- return if skip_singleton?(purpose, original_url.to_s, series)
136
+ original_uri = target.to_absolute_uri
137
+ return if try_skip_singleton(original_uri, purpose, series)
116
138
 
117
- url = normalize_url(original_url)
118
- return if skip_singleton?(purpose, url.to_s, series)
139
+ normalized_uri = normalize_uri(original_uri)
140
+ return if try_skip_singleton(normalized_uri, purpose, series)
119
141
 
120
- $log.info("Fetching #{url}")
121
- resource = get(url)
122
- skip = skip_singleton?(purpose, resource.uri.to_s, series) |
123
- skip_singleton?(purpose, "content hash: #{resource.content_hash}", series)
142
+ $log.info("Fetch #{normalized_uri}")
143
+ resource = get(normalized_uri)
144
+ skip = try_skip_singleton(resource.uri, purpose, series) |
145
+ try_skip_singleton("content hash: #{resource.content_hash}", purpose, series)
124
146
 
125
147
  yield resource unless skip
126
148
 
127
- series.map{|k| SingletonKey.new(purpose, k) }.append_to_file(@journal)
149
+ series.append_to_file(@journal) if @journal
128
150
 
129
151
  !skip
130
152
  end
@@ -132,22 +154,23 @@ class Grubby < Mechanize
132
154
 
133
155
  private
134
156
 
135
- SingletonKey = DumbDelimited[:purpose, :key]
157
+ SingletonKey = DumbDelimited[:purpose, :target]
136
158
 
137
- def skip_singleton?(purpose, key, series)
138
- return false if series.include?(key)
139
- series << key
140
- already = (@seen[purpose.to_s] ||= {}).displace(key, true)
141
- $log.info("Skipping #{series.first} (already seen #{series.last})") if already
142
- already
159
+ def try_skip_singleton(target, purpose, series)
160
+ series << SingletonKey.new(purpose, target.to_s)
161
+ if series.uniq!.nil? && @seen.displace(series.last, true)
162
+ seen_info = series.length > 1 ? "seen #{series.last.target}" : "seen"
163
+ $log.info("Skip #{series.first.target} (#{seen_info})")
164
+ true
165
+ end
143
166
  end
144
167
 
145
- def normalize_url(url)
146
- url = url.dup
147
- $log.warn("Discarding fragment in URL: #{url}") if url.fragment
148
- url.fragment = nil
149
- url.path = url.path.chomp("/")
150
- url
168
+ def normalize_uri(uri)
169
+ uri = uri.dup
170
+ $log.warn("Ignore ##{uri.fragment} in #{uri}") if uri.fragment
171
+ uri.fragment = nil
172
+ uri.path = uri.path.chomp("/")
173
+ uri
151
174
  end
152
175
 
153
176
  def sleep_between_requests
@@ -162,7 +185,6 @@ class Grubby < Mechanize
162
185
  end
163
186
 
164
187
 
165
- require_relative "grubby/version"
166
188
  require_relative "grubby/json_parser"
167
189
  require_relative "grubby/scraper"
168
190
  require_relative "grubby/page_scraper"
File without changes
@@ -1,5 +1,45 @@
1
1
  module URI
2
2
 
3
+ # Returns the basename of the URI's +path+, a la +File.basename+.
4
+ #
5
+ # @example
6
+ # URI("http://example.com/foo/bar").basename # == "bar"
7
+ # URI("http://example.com/foo").basename # == "foo"
8
+ # URI("http://example.com/").basename # == ""
9
+ #
10
+ # @return [String]
11
+ def basename
12
+ self.path == "/" ? "" : File.basename(self.path)
13
+ end
14
+
15
+ # Returns the value of the specified param in the URI's +query+.
16
+ # The specified param name must be exactly as it appears in the query
17
+ # string, and support for complex nested values is limited. (See
18
+ # +CGI.parse+ for parsing behavior.) If the param name includes a
19
+ # +"[]"+, the result will be an array of all occurrences of that param
20
+ # in the query string. Otherwise, the result will be the last
21
+ # occurrence of that param in the query string.
22
+ #
23
+ # @example
24
+ # URI("http://example.com/?foo=a").query_param("foo") # == "a"
25
+ #
26
+ # URI("http://example.com/?foo=a&foo=b").query_param("foo") # == "b"
27
+ # URI("http://example.com/?foo=a&foo=b").query_param("foo[]") # == nil
28
+ #
29
+ # URI("http://example.com/?foo[]=a&foo[]=b").query_param("foo") # == nil
30
+ # URI("http://example.com/?foo[]=a&foo[]=b").query_param("foo[]") # == ["a", "b"]
31
+ #
32
+ # URI("http://example.com/?foo[][x]=a&foo[][y]=b").query_param("foo[]") # == nil
33
+ # URI("http://example.com/?foo[][x]=a&foo[][y]=b").query_param("foo[][x]") # == ["a"]
34
+ #
35
+ # @return [String, nil]
36
+ # @return [Array<String>, nil]
37
+ # if +name+ contains +"[]"+
38
+ def query_param(name)
39
+ values = CGI.parse(self.query)[name.to_s]
40
+ (values.nil? || name.include?("[]")) ? values : values.last
41
+ end
42
+
3
43
  # Raises an exception if the URI is not +absolute?+.
4
44
  #
5
45
  # @return [self]
@@ -33,8 +33,9 @@ class Grubby::JsonParser < Mechanize::File
33
33
  @json_parse_options = options
34
34
  end
35
35
 
36
+ # The parsed JSON data.
37
+ #
36
38
  # @return [Hash, Array]
37
- # The parsed JSON data.
38
39
  attr_reader :json
39
40
 
40
41
  def initialize(uri = nil, response = nil, body = nil, code = nil)
@@ -1,7 +1,8 @@
1
1
  class Grubby::JsonScraper < Grubby::Scraper
2
2
 
3
+ # The parsed JSON data being scraped.
4
+ #
3
5
  # @return [Hash, Array]
4
- # The parsed JSON data being scraped.
5
6
  attr_reader :json
6
7
 
7
8
  # @param source [Grubby::JsonParser]
@@ -10,4 +11,22 @@ class Grubby::JsonScraper < Grubby::Scraper
10
11
  super
11
12
  end
12
13
 
14
+ # Scrapes a locally-stored file. This method is intended for use with
15
+ # subclasses of +Grubby::JsonScraper+.
16
+ #
17
+ # @example
18
+ # class MyScraper < Grubby::JsonScraper
19
+ # # ...
20
+ # end
21
+ #
22
+ # MyScraper.scrape_file("path/to/local_file.json").class # == MyScraper
23
+ #
24
+ # @param path [String]
25
+ # @return [Grubby::JsonScraper]
26
+ def self.scrape_file(path)
27
+ uri = URI.join("file:///", File.expand_path(path))
28
+ body = File.read(path)
29
+ self.new(Grubby::JsonParser.new(uri, nil, body, "200"))
30
+ end
31
+
13
32
  end
File without changes
File without changes
@@ -9,9 +9,8 @@ class Mechanize::HTTP::Agent
9
9
  IDEMPOTENT_HTTP_METHODS = [:get, :head, :options, :delete]
10
10
 
11
11
  # Replacement for +Mechanize::HTTP::Agent#fetch+. When a "too many
12
- # connection resets" error is encountered, this method shuts down the
13
- # persistent HTTP connection, and then retries the request (upto
14
- # {MAX_CONNECTION_RESET_RETRIES} times).
12
+ # connection resets" error is encountered, this method retries the
13
+ # request (upto {MAX_CONNECTION_RESET_RETRIES} times).
15
14
  def fetch_with_retry(uri, http_method = :get, headers = {}, params = [], referer = current_page, redirects = 0)
16
15
  retry_count = 0
17
16
  begin
@@ -26,9 +25,9 @@ class Mechanize::HTTP::Agent
26
25
 
27
26
  # otherwise, shutdown the persistent HTTP connection and try again
28
27
  retry_count += 1
29
- $log.warn("Possible connection reset bug. Retry(#{retry_count}) #{http_method.to_s.upcase} #{uri}")
30
- self.http.shutdown
31
- sleep(retry_count) # incremental backoff in case problem is with server
28
+ $log.warn("#{e.message} (#{e.class}). Retry in #{retry_count} seconds.")
29
+ sleep(retry_count) # incremental backoff to allow server to self-correct
30
+ $log.warn("Retry #{http_method.to_s.upcase} #{uri}")
32
31
  retry
33
32
  end
34
33
  end
File without changes
File without changes
File without changes
@@ -0,0 +1,46 @@
1
+ require "fileutils"
2
+
3
+ module Mechanize::Parser
4
+
5
+ # Saves the payload to a specified directory, but using the default
6
+ # filename suggested by the server. If a file with that name already
7
+ # exists, this method will try to find a free filename by appending
8
+ # numbers to the original name. Returns the full path of the saved
9
+ # file.
10
+ #
11
+ # NOTE: this method expects a +#save!+ method to be defined by the
12
+ # class extending +Mechanize::Parser+, e.g. +Mechanize::File#save!+
13
+ # and +Mechanize::Download#save!+.
14
+ #
15
+ # @param directory [String]
16
+ # @return [String]
17
+ def save_to(directory)
18
+ raise "#{self.class}#save! is not defined" unless self.respond_to?(:save!)
19
+
20
+ FileUtils.mkdir_p(directory)
21
+ path = find_free_name(File.join(directory, @filename))
22
+ save!(path)
23
+ path
24
+ end
25
+
26
+ # Saves the payload to a specified directory, but using the default
27
+ # filename suggested by the server. If a file with that name already
28
+ # exists, that file will be overwritten. Returns the full path of the
29
+ # saved file.
30
+ #
31
+ # NOTE: this method expects a +#save!+ method to be defined by the
32
+ # class extending +Mechanize::Parser+, e.g. +Mechanize::File#save!+
33
+ # and +Mechanize::Download#save!+.
34
+ #
35
+ # @param directory [String]
36
+ # @return [String]
37
+ def save_to!(directory)
38
+ raise "#{self.class}#save! is not defined" unless self.respond_to?(:save!)
39
+
40
+ FileUtils.mkdir_p(directory)
41
+ path = File.join(directory, @filename)
42
+ save!(path)
43
+ path
44
+ end
45
+
46
+ end
@@ -1,7 +1,8 @@
1
1
  class Grubby::PageScraper < Grubby::Scraper
2
2
 
3
+ # The Page being scraped.
4
+ #
3
5
  # @return [Mechanize::Page]
4
- # The Page being scraped.
5
6
  attr_reader :page
6
7
 
7
8
  # @param source [Mechanize::Page]
@@ -10,4 +11,23 @@ class Grubby::PageScraper < Grubby::Scraper
10
11
  super
11
12
  end
12
13
 
14
+ # Scrapes a locally-stored file. This method is intended for use with
15
+ # subclasses of +Grubby::PageScraper+.
16
+ #
17
+ # @example
18
+ # class MyScraper < Grubby::PageScraper
19
+ # # ...
20
+ # end
21
+ #
22
+ # MyScraper.scrape_file("path/to/local_file.html").class # == MyScraper
23
+ #
24
+ # @param path [String]
25
+ # @param agent [Mechanize]
26
+ # @return [Grubby::PageScraper]
27
+ def self.scrape_file(path, agent = Grubby.new)
28
+ uri = URI.join("file:///", File.expand_path(path))
29
+ body = File.read(path)
30
+ self.new(Mechanize::Page.new(uri, nil, body, "200", agent))
31
+ end
32
+
13
33
  end
@@ -1,8 +1,5 @@
1
1
  class Grubby::Scraper
2
2
 
3
- class Error < RuntimeError
4
- end
5
-
6
3
  # Defines an attribute reader method named by +field+. During
7
4
  # +initialize+, the given block is called, and the attribute is set to
8
5
  # the block's return value. By default, if the block's return value
@@ -22,38 +19,48 @@ class Grubby::Scraper
22
19
  self.fields << field
23
20
 
24
21
  define_method(field) do
22
+ raise "#{self.class}#initialize does not invoke `super`" unless defined?(@scraped)
25
23
  return @scraped[field] if @scraped.key?(field)
26
24
 
27
- unless @errors.key?(field)
25
+ unless @errors[field]
28
26
  begin
29
27
  value = instance_eval(&block)
30
28
  if value.nil?
31
- raise "`#{field}` cannot be nil" unless optional
32
- $log.debug("Scraped nil value for #{self.class}##{field}")
29
+ raise FieldValueRequiredError.new(field) unless optional
30
+ $log.debug("#{self.class}##{field} is nil")
33
31
  end
34
32
  @scraped[field] = value
35
- rescue RuntimeError => e
33
+ rescue RuntimeError, IndexError => e
36
34
  @errors[field] = e
37
35
  end
38
36
  end
39
37
 
40
- raise "`#{field}` raised a #{@errors[field].class}" if @errors.key?(field)
38
+ raise FieldScrapeFailedError.new(field, @errors[field]) if @errors[field]
41
39
 
42
40
  @scraped[field]
43
41
  end
44
42
  end
45
43
 
44
+ # The names of all scraped values, as defined by {scrapes}.
45
+ #
46
46
  # @return [Array<Symbol>]
47
- # The names of all scraped values, as defined by {scrapes}.
48
47
  def self.fields
49
48
  @fields ||= []
50
49
  end
51
50
 
51
+ # The source being scraped. Typically a Mechanize pluggable parser
52
+ # such as +Mechanize::Page+.
53
+ #
52
54
  # @return [Object]
53
- # The source being scraped. Typically a Mechanize pluggable parser
54
- # such as +Mechanize::Page+.
55
55
  attr_reader :source
56
56
 
57
+ # Hash of errors raised by blocks passed to {scrapes}. If
58
+ # {initialize} does not raise +Grubby::Scraper::Error+, this Hash will
59
+ # be empty.
60
+ #
61
+ # @return [Hash<Symbol, StandardError>]
62
+ attr_reader :errors
63
+
57
64
  # @param source
58
65
  # @raise [Grubby::Scraper::Error]
59
66
  # if any scraped values result in error
@@ -65,18 +72,11 @@ class Grubby::Scraper
65
72
  self.class.fields.each do |field|
66
73
  begin
67
74
  self.send(field)
68
- rescue RuntimeError
75
+ rescue FieldScrapeFailedError
69
76
  end
70
77
  end
71
78
 
72
- unless @errors.empty?
73
- listing = @errors.map do |field, error|
74
- error_class = " (#{error.class})" unless error.class == RuntimeError
75
- error_trace = error.backtrace.join("\n").indent(2)
76
- "* #{field} -- #{error.message}#{error_class}\n#{error_trace}"
77
- end
78
- raise Error.new("Failed to scrape the following fields:\n#{listing.join("\n")}")
79
- end
79
+ raise Error.new(self) unless @errors.empty?
80
80
  end
81
81
 
82
82
  # Returns the scraped value named by +field+.
@@ -96,4 +96,43 @@ class Grubby::Scraper
96
96
  @scraped.dup
97
97
  end
98
98
 
99
+ class Error < RuntimeError
100
+ BACKTRACE_CLEANER = ActiveSupport::BacktraceCleaner.new.tap do |cleaner|
101
+ cleaner.add_silencer do |line|
102
+ line.include?(__dir__) && line.include?("scraper.rb:")
103
+ end
104
+ end
105
+
106
+ # @return [Grubby::Scraper]
107
+ # The Scraper that raised this error.
108
+ attr_accessor :scraper
109
+
110
+ def initialize(scraper)
111
+ self.scraper = scraper
112
+
113
+ listing = scraper.errors.
114
+ reject{|field, error| error.is_a?(FieldScrapeFailedError) }.
115
+ map do |field, error|
116
+ "* `#{field}` (#{error.class})\n" +
117
+ error.message.indent(2) + "\n\n" +
118
+ BACKTRACE_CLEANER.clean(error.backtrace).join("\n").indent(4) + "\n"
119
+ end.
120
+ join("\n")
121
+
122
+ super("Failed to scrape the following fields:\n#{listing}")
123
+ end
124
+ end
125
+
126
+ class FieldScrapeFailedError < RuntimeError
127
+ def initialize(field, field_error)
128
+ super("`#{field}` raised #{field_error.class}")
129
+ end
130
+ end
131
+
132
+ class FieldValueRequiredError < RuntimeError
133
+ def initialize(field)
134
+ super("`#{field}` is nil but is not marked as optional")
135
+ end
136
+ end
137
+
99
138
  end
@@ -1,3 +1 @@
1
- class Grubby
2
- VERSION = "1.0.0"
3
- end
1
+ GRUBBY_VERSION = "1.1.0"
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: grubby
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jonathan Hefner
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2017-09-05 00:00:00.000000000 Z
11
+ date: 2018-07-27 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activesupport
@@ -108,6 +108,20 @@ dependencies:
108
108
  - - "~>"
109
109
  - !ruby/object:Gem::Version
110
110
  version: '1.1'
111
+ - !ruby/object:Gem::Dependency
112
+ name: ryoba
113
+ requirement: !ruby/object:Gem::Requirement
114
+ requirements:
115
+ - - "~>"
116
+ - !ruby/object:Gem::Version
117
+ version: '1.0'
118
+ type: :runtime
119
+ prerelease: false
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - "~>"
123
+ - !ruby/object:Gem::Version
124
+ version: '1.0'
111
125
  - !ruby/object:Gem::Dependency
112
126
  name: bundler
113
127
  requirement: !ruby/object:Gem::Requirement
@@ -173,6 +187,7 @@ extra_rdoc_files: []
173
187
  files:
174
188
  - ".gitignore"
175
189
  - ".travis.yml"
190
+ - CHANGELOG.md
176
191
  - Gemfile
177
192
  - LICENSE.txt
178
193
  - README.md
@@ -189,7 +204,7 @@ files:
189
204
  - lib/grubby/mechanize/file.rb
190
205
  - lib/grubby/mechanize/link.rb
191
206
  - lib/grubby/mechanize/page.rb
192
- - lib/grubby/nokogiri/searchable.rb
207
+ - lib/grubby/mechanize/parser.rb
193
208
  - lib/grubby/page_scraper.rb
194
209
  - lib/grubby/scraper.rb
195
210
  - lib/grubby/version.rb
@@ -213,7 +228,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
213
228
  version: '0'
214
229
  requirements: []
215
230
  rubyforge_project:
216
- rubygems_version: 2.6.13
231
+ rubygems_version: 2.7.6
217
232
  signing_key:
218
233
  specification_version: 4
219
234
  summary: Fail-fast web scraping
@@ -1,27 +0,0 @@
1
- module Nokogiri::XML::Searchable
2
-
3
- # Searches the node using the given XPath or CSS queries, and returns
4
- # the results. Raises an exception if there are no results. See also
5
- # +#search+.
6
- #
7
- # @param queries [Array<String>]
8
- # @return [Array<Nokogiri::XML::Element>]
9
- # @raise [RuntimeError] if queries yield no results
10
- def search!(*queries)
11
- results = search(*queries)
12
- raise "No elements matching #{queries.map(&:inspect).join(" OR ")}" if results.empty?
13
- results
14
- end
15
-
16
- # Searches the node using the given XPath or CSS queries, and returns
17
- # only the first result. Raises an exception if there are no results.
18
- # See also +#at+.
19
- #
20
- # @param queries [Array<String>]
21
- # @return [Nokogiri::XML::Element]
22
- # @raise [RuntimeError] if queries yield no results
23
- def at!(*queries)
24
- search!(*queries).first
25
- end
26
-
27
- end