grubby 1.1.0 → 1.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 7528791ce5da4ca182e8258cf5bc8920345470ee76ee50de44cf89adac7ffec6
4
- data.tar.gz: 3b3dad255ae1841583abb2c61345fbefffb268231906968619000b95005044be
3
+ metadata.gz: 84d759cf7187c8502b42e9d7828f59f126bb87af8da524e9d8e6f6ad8a64f467
4
+ data.tar.gz: bf26cca3991fca00e573f51f28a1c457e063e4f419986971f1429f051f2e3155
5
5
  SHA512:
6
- metadata.gz: 295c2957f708d86b596a4c062fcdf31d9c5083d26d15989de31feb174316ee17430e19d1746ad0f389d599560cc419e740835b1bfba6b0f57627e633f1a0ecf1
7
- data.tar.gz: e8bc4ecb3ce277436be91ee4e8cf9c187c1f0bbf5ee170bc7a4e3f221f94d678e0e24d2f7dd878427c7b2b0e0b2fe1baa485556bdf21cd89e05a7ff222a9dc53
6
+ metadata.gz: 38b8f7818be985da5c48484b8a3f42a40401b4890e46da93c2565c546654a660537cf15303e1106bdca201d1ea8e7ff90e13ab13dcb652997b0acc9becc01b48
7
+ data.tar.gz: e3c8b063d275ebf49dc50c5a70fa82cb0f9e517f17cc9e3735557a2fe998d5ea82a3ea0932ad8a6ecec630f4f66c8d62443c76497f6e98e4f202a72df988095e
@@ -1,15 +1,28 @@
1
+ ## 1.2.0
2
+
3
+ * Add `Grubby#journal=`
4
+ * Add `$grubby` global default `Grubby` instance
5
+ * Add `Scraper.scrape`
6
+ * Add `Scraper.each`
7
+ * Support `:if` and `:unless` options for `Scraper.scrapes`
8
+ * Fix fail-fast behavior of inherited scraper fields
9
+ * Fix `JsonParser` on empty response body
10
+ * Loosen Active Support version constraint
11
+
12
+
1
13
  ## 1.1.0
2
- * Added `Grubby#ok?`.
3
- * Added `Grubby::PageScraper.scrape_file` and `Grubby::JsonScraper.scrape_file`.
4
- * Added `Mechanize::Parser#save_to` and `Mechanize::Parser#save_to!`,
5
- which are inherited by `Mechanize::Download` and `Mechanize::File`.
6
- * Added `URI#basename`.
7
- * Added `URI#query_param`.
8
- * Added utility methods from [ryoba](https://rubygems.org/gems/ryoba).
9
- * Added `Grubby::Scraper::Error#scraper` and `Grubby::Scraper#errors`
10
- for interactive debugging with e.g. byebug.
11
- * Improved log messages and error formatting.
12
- * Fixed compatibility with net-http-persistent gem v3.0.
14
+
15
+ * Add `Grubby#ok?`
16
+ * Add `PageScraper.scrape_file` and `JsonScraper.scrape_file`
17
+ * Add `Mechanize::Parser#save_to` and `Mechanize::Parser#save_to!`,
18
+ which are inherited by `Mechanize::Download` and `Mechanize::File`
19
+ * Add `URI#basename`
20
+ * Add `URI#query_param`
21
+ * Add utility methods from [ryoba](https://rubygems.org/gems/ryoba)
22
+ * Add `Scraper::Error#scraper` and `Scraper#errors` for interactive
23
+ debugging with e.g. `byebug`
24
+ * Improve log messages and error formatting
25
+ * Fix compatibility with net-http-persistent gem v3.0
13
26
 
14
27
 
15
28
  ## 1.0.0
data/README.md CHANGED
@@ -11,7 +11,7 @@ below, or browse the [full documentation].
11
11
 
12
12
  ## Examples
13
13
 
14
- The following example scrapes the [Hacker News] front page:
14
+ The following example scrapes stories from the [Hacker News] front page:
15
15
 
16
16
  ```ruby
17
17
  require "grubby"
@@ -19,38 +19,31 @@ require "grubby"
19
19
  class HackerNews < Grubby::PageScraper
20
20
 
21
21
  scrapes(:items) do
22
- page.search!(".athing").map{|item| HackerNewsItem.new(item) }
22
+ page.search!(".athing").map{|el| Item.new(el) }
23
23
  end
24
24
 
25
- end
26
-
27
- class HackerNewsItem < Grubby::Scraper
28
-
29
- scrapes(:title) { @row1.at!(".storylink").text }
30
- scrapes(:submitter) { @row2.at!(".hnuser").text }
31
- scrapes(:story_uri) { URI.join(@base_uri, @row1.at!(".storylink")["href"]) }
32
- scrapes(:comments_uri) { URI.join(@base_uri, @row2.at!(".age a")["href"]) }
33
-
34
- def initialize(source)
35
- @row1 = source
36
- @row2 = source.next_sibling
37
- @base_uri = source.document.url
38
- super
25
+ class Item < Grubby::Scraper
26
+ scrapes(:story_link){ source.at!("a.storylink") }
27
+ scrapes(:story_uri) { story_link.uri }
28
+ scrapes(:title) { story_link.text }
39
29
  end
40
30
 
41
31
  end
42
32
 
43
- grubby = Grubby.new
44
-
45
33
  # The following line will raise an exception if anything goes wrong
46
34
  # during the scraping process. For example, if the structure of the
47
- # HTML does not match expectations, either due to a bad assumption or
48
- # due to a site-wide change, the script will terminate immediately with
49
- # a relevant error message. This prevents bad values from propogating
50
- # and causing hard-to-trace errors.
51
- hn = HackerNews.new(grubby.get("https://news.ycombinator.com/news"))
52
-
53
- puts hn.items.take(10).map(&:title) # your scraping logic goes here
35
+ # HTML does not match expectations, either due to incorrect assumptions
36
+ # or a site change, the script will terminate immediately with a helpful
37
+ # error message. This prevents bad data from propagating and causing
38
+ # hard-to-trace errors.
39
+ hn = HackerNews.scrape("https://news.ycombinator.com/news")
40
+
41
+ # Your processing logic goes here:
42
+ hn.items.take(10).each do |item|
43
+ puts "* #{item.title}"
44
+ puts " #{item.story_uri}"
45
+ puts
46
+ end
54
47
  ```
55
48
 
56
49
  [Hacker News]: https://news.ycombinator.com/news
@@ -64,7 +57,9 @@ puts hn.items.take(10).map(&:title) # your scraping logic goes here
64
57
  - [#singleton](http://www.rubydoc.info/gems/grubby/Grubby:singleton)
65
58
  - [#time_between_requests](http://www.rubydoc.info/gems/grubby/Grubby:time_between_requests)
66
59
  - [Scraper](http://www.rubydoc.info/gems/grubby/Grubby/Scraper)
60
+ - [.each](http://www.rubydoc.info/gems/grubby/Grubby/Scraper.each)
67
61
  - [.fields](http://www.rubydoc.info/gems/grubby/Grubby/Scraper.fields)
62
+ - [.scrape](http://www.rubydoc.info/gems/grubby/Grubby/Scraper.scrape)
68
63
  - [.scrapes](http://www.rubydoc.info/gems/grubby/Grubby/Scraper.scrapes)
69
64
  - [#[]](http://www.rubydoc.info/gems/grubby/Grubby/Scraper:[])
70
65
  - [#source](http://www.rubydoc.info/gems/grubby/Grubby/Scraper:source)
@@ -136,14 +131,14 @@ for a complete API listing.
136
131
  - [String#assert_match!](http://www.rubydoc.info/gems/mini_sanity/String:assert_match%21)
137
132
  - [pleasant_path](https://rubygems.org/gems/pleasant_path)
138
133
  ([docs](http://www.rubydoc.info/gems/pleasant_path/))
134
+ - [Pathname#available_name](http://www.rubydoc.info/gems/pleasant_path/Pathname:available_name)
139
135
  - [Pathname#dirs](http://www.rubydoc.info/gems/pleasant_path/Pathname:dirs)
140
- - [Pathname#dirs_r](http://www.rubydoc.info/gems/pleasant_path/Pathname:dirs_r)
141
136
  - [Pathname#files](http://www.rubydoc.info/gems/pleasant_path/Pathname:files)
142
- - [Pathname#files_r](http://www.rubydoc.info/gems/pleasant_path/Pathname:files_r)
143
137
  - [Pathname#make_dirname](http://www.rubydoc.info/gems/pleasant_path/Pathname:make_dirname)
138
+ - [Pathname#make_file](http://www.rubydoc.info/gems/pleasant_path/Pathname:make_file)
139
+ - [Pathname#move_as](http://www.rubydoc.info/gems/pleasant_path/Pathname:move_as)
144
140
  - [Pathname#rename_basename](http://www.rubydoc.info/gems/pleasant_path/Pathname:rename_basename)
145
141
  - [Pathname#rename_extname](http://www.rubydoc.info/gems/pleasant_path/Pathname:rename_extname)
146
- - [Pathname#touch_file](http://www.rubydoc.info/gems/pleasant_path/Pathname:touch_file)
147
142
  - [ryoba](https://rubygems.org/gems/ryoba)
148
143
  ([docs](http://www.rubydoc.info/gems/ryoba/))
149
144
  - [Nokogiri::XML::Node#matches!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Node:matches%21)
@@ -154,6 +149,7 @@ for a complete API listing.
154
149
  - [Nokogiri::XML::Searchable#at!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:at%21)
155
150
  - [Nokogiri::XML::Searchable#search!](http://www.rubydoc.info/gems/ryoba/Nokogiri/XML/Searchable:search%21)
156
151
 
152
+
157
153
  ## Installation
158
154
 
159
155
  Install from [Ruby Gems](https://rubygems.org/gems/grubby):
@@ -20,9 +20,8 @@ Gem::Specification.new do |spec|
20
20
  spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
21
21
  spec.require_paths = ["lib"]
22
22
 
23
- spec.add_runtime_dependency "activesupport", "~> 5.0"
23
+ spec.add_runtime_dependency "activesupport", ">= 5.0"
24
24
  spec.add_runtime_dependency "casual_support", "~> 3.0"
25
- spec.add_runtime_dependency "dumb_delimited", "~> 1.0"
26
25
  spec.add_runtime_dependency "gorge", "~> 1.0"
27
26
  spec.add_runtime_dependency "mechanize", "~> 2.7"
28
27
  spec.add_runtime_dependency "mini_sanity", "~> 1.0"
@@ -1,6 +1,5 @@
1
1
  require "active_support/all"
2
2
  require "casual_support"
3
- require "dumb_delimited"
4
3
  require "gorge"
5
4
  require "mechanize"
6
5
  require "mini_sanity"
@@ -32,7 +31,7 @@ class Grubby < Mechanize
32
31
  attr_accessor :time_between_requests
33
32
 
34
33
  # Journal file used to ensure only-once processing of resources by
35
- # {singleton} across multiple program runs. Set via {initialize}.
34
+ # {singleton} across multiple program runs.
36
35
  #
37
36
  # @return [Pathname, nil]
38
37
  attr_reader :journal
@@ -68,20 +67,37 @@ class Grubby < Mechanize
68
67
  self.pre_connect_hooks << Proc.new{ self.send(:sleep_between_requests) }
69
68
  self.time_between_requests = 1.0
70
69
 
71
- @journal = journal.try(&:to_pathname).try(&:touch_file)
72
- @seen = @journal ? SingletonKey.parse_file(@journal).index_to{ true } : {}
70
+ self.journal = journal
71
+ end
72
+
73
+ # Sets the journal file used to ensure only-once processing of
74
+ # resources by {singleton} across multiple program runs. Setting the
75
+ # journal file will clear the in-memory list of previously-processed
76
+ # resources, and, if the journal file exists, load the list from file.
77
+ #
78
+ # @param path [Pathname, String, nil]
79
+ # @return [Pathname]
80
+ def journal=(path)
81
+ @journal = path&.to_pathname&.touch_file
82
+ @seen = if @journal
83
+ require "csv"
84
+ CSV.read(@journal).map{|row| SingletonKey.new(*row) }.index_to{ true }
85
+ else
86
+ {}
87
+ end
88
+ @journal
73
89
  end
74
90
 
75
91
  # Calls +#head+ and returns true if the result has response code
76
92
  # "200". Unlike +#head+, error response codes (e.g. "404", "500")
77
93
  # do not cause a +Mechanize::ResponseCodeError+ to be raised.
78
94
  #
79
- # @param uri [String]
95
+ # @param uri [URI, String]
80
96
  # @return [Boolean]
81
97
  def ok?(uri, query_params = {}, headers = {})
82
98
  begin
83
99
  head(uri, query_params, headers).code == "200"
84
- rescue Mechanize::ResponseCodeError => e
100
+ rescue Mechanize::ResponseCodeError
85
101
  false
86
102
  end
87
103
  end
@@ -91,7 +107,21 @@ class Grubby < Mechanize
91
107
  # Rescues and logs +Mechanize::ResponseCodeError+ failures for all but
92
108
  # the last mirror.
93
109
  #
94
- # @param mirror_uris [Array<String>]
110
+ # @example
111
+ # grubby = Grubby.new
112
+ #
113
+ # urls = [
114
+ # "http://httpstat.us/404",
115
+ # "http://httpstat.us/500",
116
+ # "http://httpstat.us/200#foo",
117
+ # "http://httpstat.us/200#bar",
118
+ # ]
119
+ #
120
+ # grubby.get_mirrored(urls).uri # == URI("http://httpstat.us/200#foo")
121
+ #
122
+ # grubby.get_mirrored(urls.take(2)) # raise Mechanize::ResponseCodeError
123
+ #
124
+ # @param mirror_uris [Array<URI>, Array<String>]
95
125
  # @return [Mechanize::Page, Mechanize::File, Mechanize::Download, ...]
96
126
  # @raise [Mechanize::ResponseCodeError]
97
127
  # if all +mirror_uris+ fail
@@ -111,32 +141,43 @@ class Grubby < Mechanize
111
141
  end
112
142
  end
113
143
 
114
- # Ensures only-once processing of the resource indicated by +target+
115
- # for the specified +purpose+. A list of previously-processed
116
- # resource URIs and content hashes is maintained in the Grubby
117
- # instance. The given block is called with the fetched resource only
118
- # if the resource's URI and the resource's content hash have not been
144
+ # Ensures only-once processing of the resource indicated by +uri+ for
145
+ # the specified +purpose+. A list of previously-processed resource
146
+ # URIs and content hashes is maintained in the Grubby instance. The
147
+ # given block is called with the fetched resource only if the
148
+ # resource's URI and the resource's content hash have not been
119
149
  # previously processed under the specified +purpose+.
120
150
  #
121
- # @param target [URI, String, Mechanize::Page::Link, #to_absolute_uri]
122
- # designates the resource to fetch
151
+ # @example
152
+ # grubby = Grubby.new
153
+ #
154
+ # grubby.singleton("https://example.com/foo") do |page|
155
+ # # will be executed (first time "/foo")
156
+ # end
157
+ #
158
+ # grubby.singleton("https://example.com/foo#bar") do |page|
159
+ # # will be skipped (already seen "/foo")
160
+ # end
161
+ #
162
+ # grubby.singleton("https://example.com/foo", "again!") do |page|
163
+ # # will be executed (new purpose for "/foo")
164
+ # end
165
+ #
166
+ # @param uri [URI, String]
123
167
  # @param purpose [String]
124
- # the purpose of processing the resource
125
168
  # @yield [resource]
126
- # processes the resource
127
169
  # @yieldparam resource [Mechanize::Page, Mechanize::File, Mechanize::Download, ...]
128
- # the fetched resource
129
170
  # @return [Boolean]
130
171
  # whether the given block was called
131
172
  # @raise [Mechanize::ResponseCodeError]
132
173
  # if fetching the resource results in error (see +Mechanize#get+)
133
- def singleton(target, purpose = "")
174
+ def singleton(uri, purpose = "")
134
175
  series = []
135
176
 
136
- original_uri = target.to_absolute_uri
137
- return if try_skip_singleton(original_uri, purpose, series)
177
+ uri = uri.to_absolute_uri
178
+ return if try_skip_singleton(uri, purpose, series)
138
179
 
139
- normalized_uri = normalize_uri(original_uri)
180
+ normalized_uri = normalize_uri(uri)
140
181
  return if try_skip_singleton(normalized_uri, purpose, series)
141
182
 
142
183
  $log.info("Fetch #{normalized_uri}")
@@ -146,7 +187,9 @@ class Grubby < Mechanize
146
187
 
147
188
  yield resource unless skip
148
189
 
149
- series.append_to_file(@journal) if @journal
190
+ CSV.open(journal, "a") do |csv|
191
+ series.each{|singleton_key| csv << singleton_key }
192
+ end if journal
150
193
 
151
194
  !skip
152
195
  end
@@ -154,7 +197,8 @@ class Grubby < Mechanize
154
197
 
155
198
  private
156
199
 
157
- SingletonKey = DumbDelimited[:purpose, :target]
200
+ # @!visibility private
201
+ SingletonKey = Struct.new(:purpose, :target)
158
202
 
159
203
  def try_skip_singleton(target, purpose, series)
160
204
  series << SingletonKey.new(purpose, target.to_s)
@@ -175,8 +219,8 @@ class Grubby < Mechanize
175
219
 
176
220
  def sleep_between_requests
177
221
  @last_request_at ||= 0.0
178
- delay_duration = @time_between_requests.is_a?(Range) ?
179
- rand(@time_between_requests) : @time_between_requests
222
+ delay_duration = time_between_requests.is_a?(Range) ?
223
+ rand(time_between_requests) : time_between_requests
180
224
  sleep_duration = @last_request_at + delay_duration - Time.now.to_f
181
225
  sleep(sleep_duration) if sleep_duration > 0
182
226
  @last_request_at = Time.now.to_f
@@ -189,3 +233,6 @@ require_relative "grubby/json_parser"
189
233
  require_relative "grubby/scraper"
190
234
  require_relative "grubby/page_scraper"
191
235
  require_relative "grubby/json_scraper"
236
+
237
+
238
+ $grubby = Grubby.new
@@ -4,7 +4,8 @@ class String
4
4
  # does not denote an absolute URI.
5
5
  #
6
6
  # @return [URI]
7
- # @raise [RuntimeError] if the String does not denote an absolute URI
7
+ # @raise [RuntimeError]
8
+ # if the String does not denote an absolute URI
8
9
  def to_absolute_uri
9
10
  URI(self).to_absolute_uri
10
11
  end
@@ -9,7 +9,7 @@ module URI
9
9
  #
10
10
  # @return [String]
11
11
  def basename
12
- self.path == "/" ? "" : File.basename(self.path)
12
+ self.path == "/" ? "" : ::File.basename(self.path)
13
13
  end
14
14
 
15
15
  # Returns the value of the specified param in the URI's +query+.
@@ -21,7 +21,7 @@ module URI
21
21
  # occurrence of that param in the query string.
22
22
  #
23
23
  # @example
24
- # URI("http://example.com/?foo=a").query_param("foo") # == "a"
24
+ # URI("http://example.com/?foo=a").query_param("foo") # == "a"
25
25
  #
26
26
  # URI("http://example.com/?foo=a&foo=b").query_param("foo") # == "b"
27
27
  # URI("http://example.com/?foo=a&foo=b").query_param("foo[]") # == nil
@@ -43,7 +43,8 @@ module URI
43
43
  # Raises an exception if the URI is not +absolute?+.
44
44
  #
45
45
  # @return [self]
46
- # @raise [RuntimeError] if the URI is not +absolute?+
46
+ # @raise [RuntimeError]
47
+ # if the URI is not +absolute?+
47
48
  def to_absolute_uri
48
49
  raise "URI is not absolute: #{self}" unless self.absolute?
49
50
  self
@@ -39,7 +39,7 @@ class Grubby::JsonParser < Mechanize::File
39
39
  attr_reader :json
40
40
 
41
41
  def initialize(uri = nil, response = nil, body = nil, code = nil)
42
- @json = body && JSON.parse(body, self.class.json_parse_options)
42
+ @json = body.presence && JSON.parse(body, self.class.json_parse_options)
43
43
  super
44
44
  end
45
45
 
@@ -1,6 +1,6 @@
1
1
  class Mechanize::Download
2
2
 
3
- # private
3
+ # @!visibility private
4
4
  def content_hash
5
5
  @content_hash ||= Digest::SHA1.new.io(self.body_io).hexdigest
6
6
  end
@@ -1,6 +1,6 @@
1
1
  class Mechanize::File
2
2
 
3
- # private
3
+ # @!visibility private
4
4
  def content_hash
5
5
  @content_hash ||= self.body.to_s.sha1
6
6
  end
@@ -1,17 +1,21 @@
1
1
  class Mechanize::Page
2
2
 
3
3
  # @!method search!(*queries)
4
- # See {::Nokogiri::XML::Searchable#search!}.
4
+ # See Ryoba's +Nokogiri::XML::Searchable#search!+.
5
5
  #
6
6
  # @param queries [Array<String>]
7
- # @return [Array<Nokogiri::XML::Element>]
7
+ # @return [Nokogiri::XML::NodeSet]
8
+ # @raise [Ryoba::Error]
9
+ # if all queries yield no results
8
10
  def_delegators :parser, :search!
9
11
 
10
12
  # @!method at!(*queries)
11
- # See {::Nokogiri::XML::Searchable#at!}.
13
+ # See Ryoba's +Nokogiri::XML::Searchable#at!+.
12
14
  #
13
15
  # @param queries [Array<String>]
14
16
  # @return [Nokogiri::XML::Element]
17
+ # @raise [Ryoba::Error]
18
+ # if all queries yield no results
15
19
  def_delegators :parser, :at!
16
20
 
17
21
  end
@@ -24,7 +24,7 @@ class Grubby::PageScraper < Grubby::Scraper
24
24
  # @param path [String]
25
25
  # @param agent [Mechanize]
26
26
  # @return [Grubby::PageScraper]
27
- def self.scrape_file(path, agent = Grubby.new)
27
+ def self.scrape_file(path, agent = $grubby)
28
28
  uri = URI.join("file:///", File.expand_path(path))
29
29
  body = File.read(path)
30
30
  self.new(Mechanize::Page.new(uri, nil, body, "200", agent))
@@ -2,61 +2,200 @@ class Grubby::Scraper
2
2
 
3
3
  # Defines an attribute reader method named by +field+. During
4
4
  # +initialize+, the given block is called, and the attribute is set to
5
- # the block's return value. By default, if the block's return value
6
- # is nil, an exception will be raised. To prevent this behavior, set
7
- # +optional+ to true.
5
+ # the block's return value.
6
+ #
7
+ # By default, if the block's return value is nil, an exception will be
8
+ # raised. To prevent this behavior, specify +optional: true+.
9
+ #
10
+ # The block may also be evaluated conditionally, based on another
11
+ # method's return value, using the +:if+ or +:unless+ options.
12
+ #
13
+ # @example
14
+ # class GreetingScraper < Grubby::Scraper
15
+ # scrapes(:salutation) do
16
+ # source[/\A(hello|good morning)\b/i]
17
+ # end
18
+ #
19
+ # scrapes(:recipient, optional: true) do
20
+ # source[/\A#{salutation} ([a-z ]+)/i, 1]
21
+ # end
22
+ # end
23
+ #
24
+ # scraper = GreetingScraper.new("Hello World!")
25
+ # scraper.salutation # == "Hello"
26
+ # scraper.recipient # == "World"
27
+ #
28
+ # scraper = GreetingScraper.new("Good morning!")
29
+ # scraper.salutation # == "Good morning"
30
+ # scraper.recipient # == nil
31
+ #
32
+ # scraper = GreetingScraper.new("Hey!") # raises Grubby::Scraper::Error
33
+ #
34
+ # @example
35
+ # class EmbeddedUrlScraper < Grubby::Scraper
36
+ # scrapes(:url, optional: true){ source[%r"\bhttps?://\S+"] }
37
+ #
38
+ # scrapes(:domain, if: :url){ url[%r"://([^/]+)/", 1] }
39
+ # end
40
+ #
41
+ # scraper = EmbeddedUrlScraper.new("visit https://example.com/foo for details")
42
+ # scraper.url # == "https://example.com/foo"
43
+ # scraper.domain # == "example.com"
44
+ #
45
+ # scraper = EmbeddedUrlScraper.new("visit our website for details")
46
+ # scraper.url # == nil
47
+ # scraper.domain # == nil
8
48
  #
9
49
  # @param field [Symbol, String]
10
- # name of the scraped value
11
- # @param optional [Boolean]
12
- # whether to permit a nil scraped value
50
+ # @param options [Hash]
51
+ # @option options :optional [Boolean]
52
+ # @option options :if [Symbol]
53
+ # @option options :unless [Symbol]
13
54
  # @yield []
14
- # scrapes the value
15
55
  # @yieldreturn [Object]
16
- # scraped value
17
- def self.scrapes(field, optional: false, &block)
56
+ # @return [void]
57
+ def self.scrapes(field, **options, &block)
18
58
  field = field.to_sym
19
59
  self.fields << field
20
60
 
21
61
  define_method(field) do
22
62
  raise "#{self.class}#initialize does not invoke `super`" unless defined?(@scraped)
23
- return @scraped[field] if @scraped.key?(field)
24
63
 
25
- unless @errors[field]
64
+ if !@scraped.key?(field) && !@errors.key?(field)
26
65
  begin
27
- value = instance_eval(&block)
28
- if value.nil?
29
- raise FieldValueRequiredError.new(field) unless optional
30
- $log.debug("#{self.class}##{field} is nil")
66
+ skip = (options[:if] && !self.send(options[:if])) ||
67
+ (options[:unless] && self.send(options[:unless]))
68
+
69
+ if skip
70
+ @scraped[field] = nil
71
+ else
72
+ @scraped[field] = instance_eval(&block)
73
+ if @scraped[field].nil?
74
+ raise FieldValueRequiredError.new(field) unless options[:optional]
75
+ $log.debug("#{self.class}##{field} is nil")
76
+ end
31
77
  end
32
- @scraped[field] = value
33
78
  rescue RuntimeError, IndexError => e
34
79
  @errors[field] = e
35
80
  end
36
81
  end
37
82
 
38
- raise FieldScrapeFailedError.new(field, @errors[field]) if @errors[field]
39
-
40
- @scraped[field]
83
+ if @errors.key?(field)
84
+ raise FieldScrapeFailedError.new(field, @errors[field])
85
+ else
86
+ @scraped[field]
87
+ end
41
88
  end
42
89
  end
43
90
 
44
- # The names of all scraped values, as defined by {scrapes}.
91
+ # Fields defined by {scrapes}.
45
92
  #
46
93
  # @return [Array<Symbol>]
47
94
  def self.fields
48
- @fields ||= []
95
+ @fields ||= self == Grubby::Scraper ? [] : self.superclass.fields.dup
96
+ end
97
+
98
+ # Instantiates the Scraper class with the resource specified by +url+.
99
+ # This method acts as a default factory method, and provides a
100
+ # standard interface for specialized overrides.
101
+ #
102
+ # @example Default factory method
103
+ # class PostPageScraper < Grubby::PageScraper
104
+ # # ...
105
+ # end
106
+ #
107
+ # PostPageScraper.scrape("https://example.com/posts/42")
108
+ # # == PostPageScraper.new($grubby.get("https://example.com/posts/42"))
109
+ #
110
+ # @example Specialized factory method
111
+ # class PostApiScraper < Grubby::JsonScraper
112
+ # # ...
113
+ #
114
+ # def self.scrapes(url, agent = $grubby)
115
+ # api_url = url.sub(%r"//example.com/(.+)", '//api.example.com/\1.json')
116
+ # super(api_url, agent)
117
+ # end
118
+ # end
119
+ #
120
+ # PostApiScraper.scrape("https://example.com/posts/42")
121
+ # # == PostApiScraper.new($grubby.get("https://api.example.com/posts/42.json"))
122
+ #
123
+ # @param url [String, URI]
124
+ # @param agent [Mechanize]
125
+ # @return [Grubby::Scraper]
126
+ def self.scrape(url, agent = $grubby)
127
+ self.new(agent.get(url))
128
+ end
129
+
130
+ # Iterates a series of pages, starting at +start_url+. For each page,
131
+ # the Scraper class is instantiated and passed to the given block.
132
+ # Subsequent pages in the series are determined by invoking
133
+ # +next_method+ on each previous scraper instance.
134
+ #
135
+ # Iteration stops when the +next_method+ method returns nil. If the
136
+ # +next_method+ method returns a String or URI, that value will be
137
+ # treated as the URL of the next page. Otherwise that value will be
138
+ # treated as the page itself.
139
+ #
140
+ # @example
141
+ # class PostsIndexScraper < Grubby::PageScraper
142
+ # scrapes(:page_param){ page.uri.query_param("page") }
143
+ #
144
+ # def next
145
+ # page.link_with(text: "Next >")&.click
146
+ # end
147
+ # end
148
+ #
149
+ # PostsIndexScraper.each("https://example.com/posts?page=1") do |scraper|
150
+ # scraper.page_param # == "1", "2", "3", ...
151
+ # end
152
+ #
153
+ # @example
154
+ # class PostsIndexScraper < Grubby::PageScraper
155
+ # scrapes(:page_param){ page.uri.query_param("page") }
156
+ #
157
+ # scrapes(:next_uri, optional: true) do
158
+ # page.link_with(text: "Next >")&.to_absolute_uri
159
+ # end
160
+ # end
161
+ #
162
+ # PostsIndexScraper.each("https://example.com/posts?page=1", next_method: :next_uri) do |scraper|
163
+ # scraper.page_param # == "1", "2", "3", ...
164
+ # end
165
+ #
166
+ # @param start_url [String, URI]
167
+ # @param agent [Mechanize]
168
+ # @param next_method [Symbol]
169
+ # @yield [scraper]
170
+ # @yieldparam scraper [Grubby::Scraper]
171
+ # @return [void]
172
+ # @raise [NoMethodError]
173
+ # if Scraper class does not implement +next_method+
174
+ def self.each(start_url, agent = $grubby, next_method: :next)
175
+ unless self.method_defined?(next_method)
176
+ raise NoMethodError.new(nil, next_method), "#{self} does not define `#{next_method}`"
177
+ end
178
+
179
+ return to_enum(:each, start_url, agent, next_method: next_method) unless block_given?
180
+
181
+ current = start_url
182
+ while current
183
+ current = agent.get(current) if current.is_a?(String) || current.is_a?(URI)
184
+ scraper = self.new(current)
185
+ yield scraper
186
+ current = scraper.send(next_method)
187
+ end
49
188
  end
50
189
 
51
- # The source being scraped. Typically a Mechanize pluggable parser
190
+ # The object being scraped. Typically a Mechanize pluggable parser
52
191
  # such as +Mechanize::Page+.
53
192
  #
54
193
  # @return [Object]
55
194
  attr_reader :source
56
195
 
57
- # Hash of errors raised by blocks passed to {scrapes}. If
58
- # {initialize} does not raise +Grubby::Scraper::Error+, this Hash will
59
- # be empty.
196
+ # Collected errors raised during {initialize} by blocks passed to
197
+ # {scrapes}, indexed by field name. If {initialize} did not raise
198
+ # +Grubby::Scraper::Error+, this Hash will be empty.
60
199
  #
61
200
  # @return [Hash<Symbol, StandardError>]
62
201
  attr_reader :errors
@@ -123,6 +262,7 @@ class Grubby::Scraper
123
262
  end
124
263
  end
125
264
 
265
+ # @!visibility private
126
266
  class FieldScrapeFailedError < RuntimeError
127
267
  def initialize(field, field_error)
128
268
  super("`#{field}` raised #{field_error.class}")
@@ -1 +1 @@
1
- GRUBBY_VERSION = "1.1.0"
1
+ GRUBBY_VERSION = "1.2.0"
metadata CHANGED
@@ -1,27 +1,27 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: grubby
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.0
4
+ version: 1.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jonathan Hefner
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2018-07-27 00:00:00.000000000 Z
11
+ date: 2019-07-06 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activesupport
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - "~>"
17
+ - - ">="
18
18
  - !ruby/object:Gem::Version
19
19
  version: '5.0'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
- - - "~>"
24
+ - - ">="
25
25
  - !ruby/object:Gem::Version
26
26
  version: '5.0'
27
27
  - !ruby/object:Gem::Dependency
@@ -38,20 +38,6 @@ dependencies:
38
38
  - - "~>"
39
39
  - !ruby/object:Gem::Version
40
40
  version: '3.0'
41
- - !ruby/object:Gem::Dependency
42
- name: dumb_delimited
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - "~>"
46
- - !ruby/object:Gem::Version
47
- version: '1.0'
48
- type: :runtime
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - "~>"
53
- - !ruby/object:Gem::Version
54
- version: '1.0'
55
41
  - !ruby/object:Gem::Dependency
56
42
  name: gorge
57
43
  requirement: !ruby/object:Gem::Requirement
@@ -227,8 +213,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
227
213
  - !ruby/object:Gem::Version
228
214
  version: '0'
229
215
  requirements: []
230
- rubyforge_project:
231
- rubygems_version: 2.7.6
216
+ rubygems_version: 3.0.1
232
217
  signing_key:
233
218
  specification_version: 4
234
219
  summary: Fail-fast web scraping