scraper_utils 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: db12d36e0d3be635eba2c00dbe149f4d177ddc5e538a08fcd9038a026feaee91
4
- data.tar.gz: 8d2f140b7fff7e02d90df19ac196018f8719cd73d85067519ee3e931f679f619
3
+ metadata.gz: b4291b6994419c04851935fe4aa4e047eb4069cab3fecf451bf65f8e91acb48d
4
+ data.tar.gz: 2e3a657ce230f9c6bc9defe042cf7babb9e52e2130d32f0ec8312571f5dcb26a
5
5
  SHA512:
6
- metadata.gz: 7138204493653a872aafcf4a1f8b78d8d5129c70d79a54d6ca10aa1440fc60362edc270522cc0d66c13a7694527a502d75d3dec36cb21f2240fceea85367eec4
7
- data.tar.gz: 83dffaedd054ed40c7a269c4fd3db270892bc0fa20c4b7d1a904a075cb990bee51004ccb9c0cb86840d87a631655207301b8af1f7a5572f0893d9317a1b90aa5
6
+ metadata.gz: 51c29aea77f43a8c7de8e874a2601b4e0b9e9c36ae512180f10dcd182b2ecc899cc08944face686e0f993b02338975672d9eaf06d8a0185ee222cfc263993244
7
+ data.tar.gz: ec63009a4f10677a8e9500b5ca15c68b432081fa995d5bee2eda2b2cff88cdb7090595be7ad64a0dfeacf09decbc292313b47f9f8af795a3b95c718c77f59339
data/.rubocop.yml CHANGED
@@ -1,12 +1,5 @@
1
1
  AllCops:
2
- Exclude:
3
- - bin/*
4
- # This is a temporary dumping ground for authority specific
5
- # code that we're probably just initially copying across from
6
- # other scrapers. So, we don't care about the formatting and style
7
- # initially.
8
- # TODO: Remove this once we've removed all the code from here
9
- - lib/technology_one_scraper/authority/*
2
+ NewCops: enable
10
3
 
11
4
  # Bumping max line length to something a little more reasonable
12
5
  Layout/LineLength:
data/CHANGELOG.md ADDED
@@ -0,0 +1,5 @@
1
+ # Changelog
2
+
3
+ ## 0.1.0 - 2025-02-23
4
+
5
+ `First release for development`
data/GUIDELINES.md ADDED
@@ -0,0 +1,75 @@
1
+ # Project-Specific Guidelines
2
+
3
+ These project specific guidelines supplement the general project instructions.
4
+ Ask for clarification of any apparent conflicts with SPECS, IMPLEMENTATION or project instructions.
5
+
6
+ ## Error Handling Approaches
7
+
8
+ Process each authority's site in issolation - problems with one authority are irrelevant to others.
9
+
10
+ * we do a 2nd attempt of authorities with the same proxy settings
11
+ * and a 3rd attemopt for those that failed with the proxy but with the proxy disabled
12
+
13
+ Within a proxy distinguish between
14
+
15
+ * Errors that are specific to that record
16
+ * only allow 5 such errors plus 10% of successfully processed records
17
+ * these could be regarded as not worth retrying (we currently do)
18
+ * Any other exceptions stop the processing of that authorities site
19
+
20
+ ### Fail Fast on deeper calls
21
+
22
+ - Raise exceptions early (when they are clearly detectable)
23
+ - input validation according to scraper specs
24
+
25
+ ### Be forgiving on things that don't matter
26
+
27
+ - not all sites have robots.txt, and not all robots.txt are well formatted therefor stop processing the file on obvious conflicts with the specs,
28
+ but if the file is bad, just treat it as missing.
29
+
30
+ - don't fuss over things we are not going to record.
31
+
32
+ - we do detect maintenance pages because that is a helpful and simple clue that we wont find the data, and we can just wait till the site is back online
33
+
34
+ ## Type Checking
35
+
36
+ ### Partial Duck Typing
37
+ - Focus on behavior over types internally (.robocop.yml should disable requiring everything to be typed)
38
+ - Runtime validation of values
39
+ - document public API though
40
+ - Use @params and @returns comments to document types for external uses of the public methods (rubymine will use thesefor checking)
41
+
42
+ ## Input Validation
43
+
44
+ ### Early Validation
45
+ - Check all inputs at system boundaries
46
+ - Fail on any invalid input
47
+
48
+ ## Testing Strategies
49
+
50
+ * Avoid mocking unless really needed, instead
51
+ * instantiate a real object to use in the test
52
+ * use mocking facilities provided by the gem (eg Mechanize, Aws etc)
53
+ * use integration tests with WebMock for simple external sites or VCR for more complex.
54
+ * Testing the integration all the way through is just as important as the specific algorithms
55
+ * Consider using single responsibility classes / methods to make testing simpler but don't make things more complex just to be testable
56
+ * If necessary expose internal values as read only attributes as needed for testing,
57
+ for example adding a read only attribute to mechanize agent instances with the values calculated internally
58
+
59
+ ### Behavior-Driven Development (BDD)
60
+ - Focus on behavior specifications
61
+ - User-centric scenarios
62
+ - Best for: User-facing features
63
+
64
+ ## Documentation Approaches
65
+
66
+ ### Just-Enough Documentation
67
+ - Focus on key decisions
68
+ - Document non-obvious choices
69
+ - Best for: Rapid development, internal tools
70
+
71
+ ## Logging Philosophy
72
+
73
+ ### Minimal Logging
74
+ - Log only key events (key means down to adding a record)
75
+ - Focus on errors
data/Gemfile CHANGED
@@ -27,7 +27,7 @@ gem "rake", platform && (platform == :heroku16 ? "~> 12.3.3" : "~> 13.0")
27
27
  gem "rspec", platform && (platform == :heroku16 ? "~> 3.9.0" : "~> 3.12")
28
28
  gem "rubocop", platform && (platform == :heroku16 ? "~> 0.80.0" : "~> 1.57")
29
29
  gem "simplecov", platform && (platform == :heroku16 ? "~> 0.18.0" : "~> 0.22.0")
30
- # gem "simplecov-console" listed in gemspec
30
+ gem "simplecov-console"
31
31
  gem "webmock", platform && (platform == :heroku16 ? "~> 3.14.0" : "~> 3.19.0")
32
32
 
33
33
  gemspec
data/IMPLEMENTATION.md ADDED
@@ -0,0 +1,33 @@
1
+ IMPLEMENTATION
2
+ ==============
3
+
4
+ Document decisions on how we are implementing the specs to be consistent and save time.
5
+ Things we MUST do go in SPECS.
6
+ Choices between a number of valid possibilities go here.
7
+ Once made, these choices should only be changed after careful consideration.
8
+
9
+ ASK for clarification of any apparent conflicts with SPECS, GUIDELINES or project instructions.
10
+
11
+ ## Debugging
12
+
13
+ Output debugging messages if ENV['DEBUG'] is set, for example:
14
+
15
+ ```ruby
16
+ puts "Pre Connect request: #{request.inspect}" if ENV["DEBUG"]
17
+ ```
18
+
19
+ ## Robots.txt Handling
20
+
21
+ - Used as a "good citizen" mechanism for respecting site preferences
22
+ - Graceful fallback (to permitted) if robots.txt is unavailable or invalid
23
+ - Match `/^User-agent:\s*ScraperUtils/i` for specific user agent
24
+ - If there is a line matching `/^Disallow:\s*\//` then we are disallowed
25
+ - Check for `/^Crawl-delay:\s*(\d[.0-9]*)/` to extract delay
26
+ - If the no crawl-delay is found in that section, then check in the default `/^User-agent:\s*\*/` section
27
+ - This is a deliberate significant simplification of the robots.txt specification in RFC 9309.
28
+
29
+ ## Method Organization
30
+
31
+ - Externalize configuration to improve testability
32
+ - Keep shared logic in the main class
33
+ - Decisions / information specific to just one class, can be documented there, otherwise it belongs here
data/README.md CHANGED
@@ -1,11 +1,43 @@
1
1
  ScraperUtils (Ruby)
2
2
  ===================
3
3
 
4
- Utilities to help make planningalerts scrapers, especially multis easier to develop, run and debug.
4
+ Utilities to help make planningalerts scrapers, especially multis, easier to develop, run and debug.
5
5
 
6
- WARNING: This is still under development! Breaking changes may occur in version 0!
6
+ WARNING: This is still under development! Breaking changes may occur in version 0.x!
7
7
 
8
- ## Installation
8
+ For Server Administrators
9
+ -------------------------
10
+
11
+ The ScraperUtils library is designed to be a respectful citizen of the web. If you're a server administrator and notice
12
+ our scraper accessing your systems, here's what you should know:
13
+
14
+ ### How to Control Our Behavior
15
+
16
+ Our scraper utilities respect the standard server **robots.txt** control mechanisms (by default). To control our access:
17
+
18
+ - Add a section for our user agent: `User-agent: ScraperUtils` (default)
19
+ - Set a crawl delay: `Crawl-delay: 5`
20
+ - If needed specify disallowed paths: `Disallow: /private/`
21
+
22
+ ### Built-in Politeness Features
23
+
24
+ Even without specific configuration, our scrapers will, by default:
25
+
26
+ - **Identify themselves**: Our user agent clearly indicates who we are and provides a link to the project repository:
27
+ `Mozilla/5.0 (compatible; ScraperUtils/0.2.0 2025-02-22; +https://github.com/ianheggie-oaf/scraper_utils)`
28
+
29
+ - **Limit server load**: We introduce delays to avoid undue load on your server's by default based on your response
30
+ time.
31
+ The slower your server is running, the longer the delay we add between requests to help you.
32
+ In the default "compliant mode" this defaults to 20% and custom settings are capped at 33% maximum.
33
+
34
+ - **Add randomized delays**: We add random delays between requests to avoid creating regular traffic patterns that might
35
+ impact server performance (enabled by default).
36
+
37
+ Our goal is to access public planning information without negatively impacting your services.
38
+
39
+ Installation
40
+ ------------
9
41
 
10
42
  Add these line to your application's Gemfile:
11
43
 
@@ -22,20 +54,85 @@ Or install it yourself for testing:
22
54
 
23
55
  $ gem install scraper_utils
24
56
 
25
- ## Usage
57
+ Usage
58
+ -----
59
+
60
+ ### Ruby Versions
61
+
62
+ This gem is designed to be compatible the latest ruby supported by morph.io - other versions may work, but not tested:
63
+
64
+ * ruby 3.2.2 - requires the `platform` file to contain `heroku_18` in the scraper
65
+ * ruby 2.5.8 - `heroku_16` (the default)
26
66
 
27
67
  ### Environment variables
28
68
 
29
- Optionally filter authorities via environment variable in morph > scraper > settings or
69
+ #### `MORPH_AUSTRALIAN_PROXY`
70
+
71
+ On morph.io set the environment variable `MORPH_AUSTRALIAN_PROXY` to
72
+ `http://morph:password@au.proxy.oaf.org.au:8888`
73
+ replacing password with the real password.
74
+ Alternatively enter your own AUSTRALIAN proxy details when testing.
75
+
76
+ #### `MORPH_EXPECT_BAD`
77
+
78
+ To avoid morph complaining about sites that are known to be bad,
79
+ but you want them to keep being tested, list them on `MORPH_EXPECT_BAD`, for example:
80
+
81
+ #### `MORPH_AUTHORITIES`
82
+
83
+ Optionally filter authorities for multi authority scrapers
84
+ via environment variable in morph > scraper > settings or
30
85
  in your dev environment:
31
86
 
32
87
  ```bash
33
88
  export MORPH_AUTHORITIES=noosa,wagga
34
89
  ```
35
90
 
91
+ #### `DEBUG`
92
+
93
+ Optionally enable verbose debugging messages when developing:
94
+
95
+ ```bash
96
+ export DEBUG=1
97
+ ```
98
+
99
+ ### Extra Mechanize options
100
+
101
+ Add `client_options` to your AUTHORITIES configuration and move any of the following settings into it:
102
+
103
+ * `timeout: Integer` - Timeout for agent connections in case the server is slower than normal
104
+ * `australian_proxy: true` - Use the MORPH_AUSTRALIAN_PROXY as proxy url if the site is geo-locked
105
+ * `disable_ssl_certificate_check: true` - Disabled SSL verification for old / incorrect certificates
106
+
107
+ See the documentation on `ScraperUtils::MechanizeUtils::AgentConfig` for more options
108
+
109
+ Then adjust your code to accept client_options and pass then through to:
110
+ `ScraperUtils::MechanizeUtils.mechanize_agent(client_options || {})`
111
+ to receive a `Mechanize::Agent` configured accordingly.
112
+
113
+ The agent returned is configured using Mechanize hooks to implement the desired delays automatically.
114
+
115
+ ### Default Configuration
116
+
117
+ By default, the Mechanize agent is configured with the following settings.
118
+
119
+ ```ruby
120
+ ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
121
+ config.default_timeout = 60
122
+ config.default_compliant_mode = true
123
+ config.default_random_delay = 3
124
+ config.default_max_load = 20 # percentage
125
+ config.default_disable_ssl_certificate_check = false
126
+ config.default_australian_proxy = false
127
+ end
128
+ ```
129
+
130
+ You can modify these global defaults before creating any Mechanize agents. These settings will be used for all Mechanize
131
+ agents created by `ScraperUtils::MechanizeUtils.mechanize_agent` unless overridden by passing parameters to that method.
132
+
36
133
  ### Example updated `scraper.rb` file
37
134
 
38
- Update your `scraper.rb` as per the following example:
135
+ Update your `scraper.rb` as per the following example for basic utilities:
39
136
 
40
137
  ```ruby
41
138
  #!/usr/bin/env ruby
@@ -48,55 +145,42 @@ require "technology_one_scraper"
48
145
 
49
146
  # Main Scraper class
50
147
  class Scraper
51
- AUTHORITIES = TechnologyOneScraper::AUTHORITIES
148
+ AUTHORITIES = YourScraper::AUTHORITIES
52
149
 
53
- def self.scrape(authorities, attempt)
54
- results = {}
150
+ # ADD: attempt argument
151
+ def scrape(authorities, attempt)
152
+ exceptions = {}
153
+ # ADD: Report attempt number
55
154
  authorities.each do |authority_label|
56
- these_results = results[authority_label] = {}
155
+ puts "\nCollecting feed data for #{authority_label}, attempt: #{attempt}..."
156
+
57
157
  begin
58
- records_scraped = 0
59
- unprocessable_records = 0
60
- # Allow 5 + 10% unprocessable records
61
- too_many_unprocessable = -5.0
62
- use_proxy = AUTHORITIES[authority_label][:australian_proxy] && ScraperUtils.australian_proxy
63
- next if attempt > 2 && !use_proxy
64
-
65
- puts "",
66
- "Collecting feed data for #{authority_label}, attempt: #{attempt}" \
67
- "#{use_proxy ? ' (via proxy)' : ''} ..."
68
- # Change scrape to accept a use_proxy flag and return an unprocessable flag
69
- # it should rescue ScraperUtils::UnprocessableRecord thrown deeper in the scraping code and
70
- # set unprocessable
71
- TechnologyOneScraper.scrape(use_proxy, authority_label) do |record, unprocessable|
72
- unless unprocessable
73
- begin
74
- record["authority_label"] = authority_label.to_s
75
- ScraperUtils::DbUtils.save_record(record)
76
- rescue ScraperUtils::UnprocessableRecord => e
77
- # validation error
78
- unprocessable = true
79
- these_results[:error] = e
80
- end
81
- end
82
- if unprocessable
83
- unprocessable_records += 1
84
- these_results[:unprocessable_records] = unprocessable_records
85
- too_many_unprocessable += 1
86
- raise "Too many unprocessable records" if too_many_unprocessable.positive?
87
- else
88
- records_scraped += 1
89
- these_results[:records_scraped] = records_scraped
90
- too_many_unprocessable -= 0.1
158
+ # REPLACE:
159
+ # YourScraper.scrape(authority_label) do |record|
160
+ # record["authority_label"] = authority_label.to_s
161
+ # YourScraper.log(record)
162
+ # ScraperWiki.save_sqlite(%w[authority_label council_reference], record)
163
+ # end
164
+ # WITH:
165
+ ScraperUtils::DataQualityMonitor.start_authority(authority_label)
166
+ YourScraper.scrape(authority_label) do |record|
167
+ begin
168
+ record["authority_label"] = authority_label.to_s
169
+ ScraperUtils::DbUtils.save_record(record)
170
+ rescue ScraperUtils::UnprocessableRecord => e
171
+ ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
172
+ exceptions[authority_label] = e
91
173
  end
92
174
  end
93
- rescue StandardError => e
94
- warn "#{authority_label}: ERROR: #{e}"
95
- warn e.backtrace || "No backtrace available"
96
- these_results[:error] = e
175
+ # END OF REPLACE
97
176
  end
177
+ rescue StandardError => e
178
+ warn "#{authority_label}: ERROR: #{e}"
179
+ warn e.backtrace
180
+ exceptions[authority_label] = e
98
181
  end
99
- results
182
+
183
+ exceptions
100
184
  end
101
185
 
102
186
  def self.selected_authorities
@@ -106,123 +190,79 @@ class Scraper
106
190
  def self.run(authorities)
107
191
  puts "Scraping authorities: #{authorities.join(', ')}"
108
192
  start_time = Time.now
109
- results = scrape(authorities, 1)
193
+ exceptions = scrape(authorities, 1)
194
+ # Set start_time and attempt to the call above and log run below
110
195
  ScraperUtils::LogUtils.log_scraping_run(
111
196
  start_time,
112
197
  1,
113
198
  authorities,
114
- results
199
+ exceptions
115
200
  )
116
201
 
117
- retry_errors = results.select do |_auth, result|
118
- result[:error] && !result[:error].is_a?(ScraperUtils::UnprocessableRecord)
119
- end.keys
120
-
121
- unless retry_errors.empty?
122
- puts "",
123
- "***************************************************"
202
+ unless exceptions.empty?
203
+ puts "\n***************************************************"
124
204
  puts "Now retrying authorities which earlier had failures"
125
- puts retry_errors.join(", ").to_s
205
+ puts exceptions.keys.join(", ").to_s
126
206
  puts "***************************************************"
127
207
 
128
- start_retry = Time.now
129
- retry_results = scrape(retry_errors, 2)
208
+ start_time = Time.now
209
+ exceptions = scrape(exceptions.keys, 2)
210
+ # Set start_time and attempt to the call above and log run below
130
211
  ScraperUtils::LogUtils.log_scraping_run(
131
- start_retry,
212
+ start_time,
132
213
  2,
133
- retry_errors,
134
- retry_results
214
+ authorities,
215
+ exceptions
135
216
  )
136
-
137
- retry_results.each do |auth, result|
138
- unless result[:error] && !result[:error].is_a?(ScraperUtils::UnprocessableRecord)
139
- results[auth] = result
140
- end
141
- end.keys
142
- retry_no_proxy = retry_results.select do |_auth, result|
143
- result[:used_proxy] && result[:error] &&
144
- !result[:error].is_a?(ScraperUtils::UnprocessableRecord)
145
- end.keys
146
-
147
- unless retry_no_proxy.empty?
148
- puts "",
149
- "*****************************************************************"
150
- puts "Now retrying authorities which earlier had failures without proxy"
151
- puts retry_no_proxy.join(", ").to_s
152
- puts "*****************************************************************"
153
-
154
- start_retry = Time.now
155
- second_retry_results = scrape(retry_no_proxy, 3)
156
- ScraperUtils::LogUtils.log_scraping_run(
157
- start_retry,
158
- 3,
159
- retry_no_proxy,
160
- second_retry_results
161
- )
162
- second_retry_results.each do |auth, result|
163
- unless result[:error] && !result[:error].is_a?(ScraperUtils::UnprocessableRecord)
164
- results[auth] = result
165
- end
166
- end.keys
167
- end
168
217
  end
169
218
 
170
219
  # Report on results, raising errors for unexpected conditions
171
- ScraperUtils::LogUtils.report_on_results(authorities, results)
220
+ ScraperUtils::LogUtils.report_on_results(authorities, exceptions)
172
221
  end
173
222
  end
174
223
 
175
224
  if __FILE__ == $PROGRAM_NAME
176
225
  # Default to list of authorities we can't or won't fix in code, explain why
177
- # wagga: url redirects and reports Application error, main site says to use NSW Planning Portal from 1 July 2021
178
- # which doesn't list any DA's for wagga wagga!
226
+ # wagga: url redirects and then reports Application error
179
227
 
180
228
  ENV["MORPH_EXPECT_BAD"] ||= "wagga"
181
229
  Scraper.run(Scraper.selected_authorities)
182
230
  end
183
231
  ```
184
232
 
185
- Then deeper in your code update:
233
+ Your code should raise ScraperUtils::UnprocessableRecord when there is a problem with the data presented on a page for a
234
+ record.
235
+ Then just before you would normally yield a record for saving, rescue that exception and:
236
+
237
+ * Call ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
238
+ * NOT yield the record for saving
186
239
 
187
- * Change scrape to accept a `use_proxy` flag and return an `unprocessable` flag
188
- * it should rescue ScraperUtils::UnprocessableRecord thrown deeper in the scraping code and
189
- set and yield unprocessable eg: `TechnologyOneScraper.scrape(use_proxy, authority_label) do |record, unprocessable|`
240
+ In your code update where create a mechanize agent (often `YourScraper.scrape_period`) and the `AUTHORITIES` hash
241
+ to move mechanize_agent options (like `australian_proxy` and `timeout`) to a hash under a new key: `client_options`.
242
+ For example:
190
243
 
191
244
  ```ruby
192
245
  require "scraper_utils"
193
246
  #...
194
- module TechnologyOneScraper
195
- # Note the extra parameter: use_proxy
196
- def self.scrape(use_proxy, authority)
197
- raise "Unexpected authority: #{authority}" unless AUTHORITIES.key?(authority)
247
+ module YourScraper
248
+ # ... some code ...
198
249
 
199
- scrape_period(use_proxy, AUTHORITIES[authority]) do |record, unprocessable|
200
- yield record, unprocessable
201
- end
202
- end
203
-
204
- # ... rest of code ...
205
-
206
- # Note the extra parameters: use_proxy and timeout
207
- def self.scrape_period(use_proxy,
208
- url:, period:, webguest: "P1.WEBGUEST", disable_ssl_certificate_check: false,
209
- australian_proxy: false, timeout: nil
250
+ # Note the extra parameter: client_options
251
+ def self.scrape_period(url:, period:, webguest: "P1.WEBGUEST",
252
+ client_options: {}
210
253
  )
211
- agent = ScraperUtils::MechanizeUtils.mechanize_agent(use_proxy: use_proxy, timeout: timeout)
212
- agent.verify_mode = OpenSSL::SSL::VERIFY_NONE if disable_ssl_certificate_check
213
-
214
- # ... rest of code ...
215
-
216
- # Update yield to return unprocessable as well as record
254
+ agent = ScraperUtils::MechanizeUtils.mechanize_agent(**client_options)
217
255
 
256
+ # ... rest of code ...
218
257
  end
258
+
219
259
  # ... rest of code ...
220
260
  end
221
261
  ```
222
262
 
223
263
  ### Debugging Techniques
224
264
 
225
- The following code will print dbugging info if you set:
265
+ The following code will cause debugging info to be output:
226
266
 
227
267
  ```bash
228
268
  export DEBUG=1
@@ -235,8 +275,8 @@ require 'scraper_utils'
235
275
 
236
276
  # Debug an HTTP request
237
277
  ScraperUtils::DebugUtils.debug_request(
238
- "GET",
239
- "https://example.com/planning-apps",
278
+ "GET",
279
+ "https://example.com/planning-apps",
240
280
  parameters: { year: 2023 },
241
281
  headers: { "Accept" => "application/json" }
242
282
  )
@@ -248,7 +288,56 @@ ScraperUtils::DebugUtils.debug_page(page, "Checking search results page")
248
288
  ScraperUtils::DebugUtils.debug_selector(page, '.results-table', "Looking for development applications")
249
289
  ```
250
290
 
251
- ## Development
291
+ Interleaving Requests
292
+ ---------------------
293
+
294
+ The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
295
+
296
+ * works on the other authorities whilst in the delay period for an authorities next request
297
+ * thus optimizing the total scraper run time
298
+ * allows you to increase the random delay for authorities without undue effect on total run time
299
+ * For the curious, it uses [ruby fibers](https://ruby-doc.org/core-2.5.8/Fiber.html) rather than threads as that is
300
+ simpler to get right and debug!
301
+
302
+ To enable change the scrape method in the example above to;
303
+
304
+ ```ruby
305
+
306
+ def scrape(authorities, attempt)
307
+ ScraperUtils::FiberScheduler.reset!
308
+ exceptions = {}
309
+ authorities.each do |authority_label|
310
+ ScraperUtils::FiberScheduler.register_operation(authority_label) do
311
+ ScraperUtils::FiberScheduler.log "Collecting feed data for #{authority_label}, attempt: #{attempt}..."
312
+ begin
313
+ ScraperUtils::DataQualityMonitor.start_authority(authority_label)
314
+ YourScraper.scrape(authority_label) do |record|
315
+ begin
316
+ record["authority_label"] = authority_label.to_s
317
+ ScraperUtils::DbUtils.save_record(record)
318
+ rescue ScraperUtils::UnprocessableRecord => e
319
+ ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
320
+ exceptions[authority_label] = e
321
+ end
322
+ end
323
+ rescue StandardError => e
324
+ warn "#{authority_label}: ERROR: #{e}"
325
+ warn e.backtrace
326
+ exceptions[authority_label] = e
327
+ end
328
+ end # end of register_operation block
329
+ end
330
+ ScraperUtils::FiberScheduler.run_all
331
+ exceptions
332
+ end
333
+ ```
334
+
335
+ And use `ScraperUtils::FiberScheduler.log` instead of `puts` when logging within the authority processing code.
336
+ This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
337
+ thus the output.
338
+
339
+ Development
340
+ -----------
252
341
 
253
342
  After checking out the repo, run `bin/setup` to install dependencies.
254
343
  Then, run `rake test` to run the tests.
@@ -259,13 +348,19 @@ To install this gem onto your local machine, run `bundle exec rake install`.
259
348
 
260
349
  To release a new version, update the version number in `version.rb`, and
261
350
  then run `bundle exec rake release`,
262
- which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
351
+ which will create a git tag for the version, push git commits and tags, and push the `.gem` file
352
+ to [rubygems.org](https://rubygems.org).
353
+
354
+ NOTE: You need to use ruby 3.2.2 instead of 2.5.8 to release to OTP protected accounts.
263
355
 
264
- ## Contributing
356
+ Contributing
357
+ ------------
265
358
 
266
359
  Bug reports and pull requests are welcome on GitHub at https://github.com/ianheggie-oaf/scraper_utils
267
360
 
268
- ## License
361
+ CHANGELOG.md is maintained by the author aiming to follow https://github.com/vweevers/common-changelog
269
362
 
270
- The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
363
+ License
364
+ -------
271
365
 
366
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
data/SPECS.md ADDED
@@ -0,0 +1,25 @@
1
+ SPECS
2
+ =====
3
+
4
+ These project specific Specifications go into further details than the
5
+ installation and usage notes in `README.md`.
6
+
7
+ ASK for clarification of any apparent conflicts with IMPLEMENTATION, GUIDELINES or project instructions.
8
+
9
+ ## Core Design Principles
10
+
11
+ ### Error Handling
12
+ - Record-level errors abort only that record's processing
13
+ - Allow up to 5 + 10% unprocessable records before failing
14
+ - External service reliability (e.g., robots.txt) should not block core functionality
15
+
16
+ ### Rate Limiting
17
+ - Honor site-specific rate limits when clearly specified
18
+ - Apply adaptive delays based on response times
19
+ - Use randomized delays to avoid looking like a bot
20
+ - Support proxy configuration for geolocation needs
21
+
22
+ ### Testing
23
+ - Ensure components are independently testable
24
+ - Avoid timing-based tests in favor of logic validation
25
+ - Keep test scenarios focused and under 20 lines
data/bin/console CHANGED
@@ -1,4 +1,5 @@
1
1
  #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
2
3
 
3
4
  require "bundler/setup"
4
5
  require "scraper_utils"
data/bin/setup CHANGED
@@ -1,4 +1,5 @@
1
- #!/usr/bin/env bash
1
+ #!/bin/bash
2
+
2
3
  set -euo pipefail
3
4
  IFS=$'\n\t'
4
5
  set -vx