scraper_utils 0.1.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: db12d36e0d3be635eba2c00dbe149f4d177ddc5e538a08fcd9038a026feaee91
4
- data.tar.gz: 8d2f140b7fff7e02d90df19ac196018f8719cd73d85067519ee3e931f679f619
3
+ metadata.gz: a2082c406ad96266f644fc1dfc588046aa43494e9e79b8fec38fe59252c09f06
4
+ data.tar.gz: 5100907cdcc8c55ddd59b25cf393d212f681b573ef874c5a6c65f748a8c852ec
5
5
  SHA512:
6
- metadata.gz: 7138204493653a872aafcf4a1f8b78d8d5129c70d79a54d6ca10aa1440fc60362edc270522cc0d66c13a7694527a502d75d3dec36cb21f2240fceea85367eec4
7
- data.tar.gz: 83dffaedd054ed40c7a269c4fd3db270892bc0fa20c4b7d1a904a075cb990bee51004ccb9c0cb86840d87a631655207301b8af1f7a5572f0893d9317a1b90aa5
6
+ metadata.gz: feabea23ac5b14f6b642769db303bcab954b22cfd2d95d7694af51afd661c73f1ff70a0c17b54557a730fe811275c126c0fdd4473e2ae6d2b83cfc495aa52bc6
7
+ data.tar.gz: d91154c0dbccfb4271fd4830a82316676c1ad5665773c30d212bcc1ab6178a09d4fee97b2b2a32480c32b8ee69da08fc9a78575460d49f8233fb91b74bf7df66
data/.gitignore CHANGED
@@ -9,6 +9,9 @@
9
9
  /test/tmp/
10
10
  /test/version_tmp/
11
11
 
12
+ # Ignore log files
13
+ /log/
14
+
12
15
  # Temp files
13
16
  ,*
14
17
  *.bak
data/.rubocop.yml CHANGED
@@ -1,12 +1,9 @@
1
+ plugins:
2
+ - rubocop-rake
3
+ - rubocop-rspec
4
+
1
5
  AllCops:
2
- Exclude:
3
- - bin/*
4
- # This is a temporary dumping ground for authority specific
5
- # code that we're probably just initially copying across from
6
- # other scrapers. So, we don't care about the formatting and style
7
- # initially.
8
- # TODO: Remove this once we've removed all the code from here
9
- - lib/technology_one_scraper/authority/*
6
+ NewCops: enable
10
7
 
11
8
  # Bumping max line length to something a little more reasonable
12
9
  Layout/LineLength:
data/CHANGELOG.md ADDED
@@ -0,0 +1,14 @@
1
+ # Changelog
2
+
3
+
4
+ ## 0.2.1 - 2025-02-28
5
+
6
+ Fixed broken v0.2.0
7
+
8
+ ## 0.2.0 - 2025-02-28
9
+
10
+ Added FiberScheduler, enabled complient mode with delays by default and simplified usage removing third retry without proxy
11
+
12
+ ## 0.1.0 - 2025-02-23
13
+
14
+ First release for development
data/GUIDELINES.md ADDED
@@ -0,0 +1,75 @@
1
+ # Project-Specific Guidelines
2
+
3
+ These project specific guidelines supplement the general project instructions.
4
+ Ask for clarification of any apparent conflicts with SPECS, IMPLEMENTATION or project instructions.
5
+
6
+ ## Error Handling Approaches
7
+
8
+ Process each authority's site in issolation - problems with one authority are irrelevant to others.
9
+
10
+ * we do a 2nd attempt of authorities with the same proxy settings
11
+ * and a 3rd attemopt for those that failed with the proxy but with the proxy disabled
12
+
13
+ Within a proxy distinguish between
14
+
15
+ * Errors that are specific to that record
16
+ * only allow 5 such errors plus 10% of successfully processed records
17
+ * these could be regarded as not worth retrying (we currently do)
18
+ * Any other exceptions stop the processing of that authorities site
19
+
20
+ ### Fail Fast on deeper calls
21
+
22
+ - Raise exceptions early (when they are clearly detectable)
23
+ - input validation according to scraper specs
24
+
25
+ ### Be forgiving on things that don't matter
26
+
27
+ - not all sites have robots.txt, and not all robots.txt are well formatted therefor stop processing the file on obvious conflicts with the specs,
28
+ but if the file is bad, just treat it as missing.
29
+
30
+ - don't fuss over things we are not going to record.
31
+
32
+ - we do detect maintenance pages because that is a helpful and simple clue that we wont find the data, and we can just wait till the site is back online
33
+
34
+ ## Type Checking
35
+
36
+ ### Partial Duck Typing
37
+ - Focus on behavior over types internally (.robocop.yml should disable requiring everything to be typed)
38
+ - Runtime validation of values
39
+ - document public API though
40
+ - Use @params and @returns comments to document types for external uses of the public methods (rubymine will use thesefor checking)
41
+
42
+ ## Input Validation
43
+
44
+ ### Early Validation
45
+ - Check all inputs at system boundaries
46
+ - Fail on any invalid input
47
+
48
+ ## Testing Strategies
49
+
50
+ * Avoid mocking unless really needed, instead
51
+ * instantiate a real object to use in the test
52
+ * use mocking facilities provided by the gem (eg Mechanize, Aws etc)
53
+ * use integration tests with WebMock for simple external sites or VCR for more complex.
54
+ * Testing the integration all the way through is just as important as the specific algorithms
55
+ * Consider using single responsibility classes / methods to make testing simpler but don't make things more complex just to be testable
56
+ * If necessary expose internal values as read only attributes as needed for testing,
57
+ for example adding a read only attribute to mechanize agent instances with the values calculated internally
58
+
59
+ ### Behavior-Driven Development (BDD)
60
+ - Focus on behavior specifications
61
+ - User-centric scenarios
62
+ - Best for: User-facing features
63
+
64
+ ## Documentation Approaches
65
+
66
+ ### Just-Enough Documentation
67
+ - Focus on key decisions
68
+ - Document non-obvious choices
69
+ - Best for: Rapid development, internal tools
70
+
71
+ ## Logging Philosophy
72
+
73
+ ### Minimal Logging
74
+ - Log only key events (key means down to adding a record)
75
+ - Focus on errors
data/Gemfile CHANGED
@@ -22,12 +22,15 @@ gem "sqlite3", platform && (platform == :heroku16 ? "~> 1.4.0" : "~> 1.6.3")
22
22
  gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git",
23
23
  branch: "morph_defaults"
24
24
 
25
- # development and test test gems
25
+ # development and test gems
26
26
  gem "rake", platform && (platform == :heroku16 ? "~> 12.3.3" : "~> 13.0")
27
27
  gem "rspec", platform && (platform == :heroku16 ? "~> 3.9.0" : "~> 3.12")
28
- gem "rubocop", platform && (platform == :heroku16 ? "~> 0.80.0" : "~> 1.57")
28
+ gem "rubocop", platform && (platform == :heroku16 ? "~> 1.28.2" : "~> 1.73")
29
+ gem "rubocop-rake", platform && (platform == :heroku16 ? "~> 0.6.0" : "~> 0.7")
30
+ gem "rubocop-rspec", platform && (platform == :heroku16 ? "~> 2.10.0" : "~> 3.5")
29
31
  gem "simplecov", platform && (platform == :heroku16 ? "~> 0.18.0" : "~> 0.22.0")
30
- # gem "simplecov-console" listed in gemspec
32
+ gem "simplecov-console"
33
+ gem "terminal-table"
31
34
  gem "webmock", platform && (platform == :heroku16 ? "~> 3.14.0" : "~> 3.19.0")
32
35
 
33
36
  gemspec
data/IMPLEMENTATION.md ADDED
@@ -0,0 +1,33 @@
1
+ IMPLEMENTATION
2
+ ==============
3
+
4
+ Document decisions on how we are implementing the specs to be consistent and save time.
5
+ Things we MUST do go in SPECS.
6
+ Choices between a number of valid possibilities go here.
7
+ Once made, these choices should only be changed after careful consideration.
8
+
9
+ ASK for clarification of any apparent conflicts with SPECS, GUIDELINES or project instructions.
10
+
11
+ ## Debugging
12
+
13
+ Output debugging messages if ENV['DEBUG'] is set, for example:
14
+
15
+ ```ruby
16
+ puts "Pre Connect request: #{request.inspect}" if ENV["DEBUG"]
17
+ ```
18
+
19
+ ## Robots.txt Handling
20
+
21
+ - Used as a "good citizen" mechanism for respecting site preferences
22
+ - Graceful fallback (to permitted) if robots.txt is unavailable or invalid
23
+ - Match `/^User-agent:\s*ScraperUtils/i` for specific user agent
24
+ - If there is a line matching `/^Disallow:\s*\//` then we are disallowed
25
+ - Check for `/^Crawl-delay:\s*(\d[.0-9]*)/` to extract delay
26
+ - If the no crawl-delay is found in that section, then check in the default `/^User-agent:\s*\*/` section
27
+ - This is a deliberate significant simplification of the robots.txt specification in RFC 9309.
28
+
29
+ ## Method Organization
30
+
31
+ - Externalize configuration to improve testability
32
+ - Keep shared logic in the main class
33
+ - Decisions / information specific to just one class, can be documented there, otherwise it belongs here
data/README.md CHANGED
@@ -1,11 +1,53 @@
1
1
  ScraperUtils (Ruby)
2
2
  ===================
3
3
 
4
- Utilities to help make planningalerts scrapers, especially multis easier to develop, run and debug.
4
+ Utilities to help make planningalerts scrapers, especially multis, easier to develop, run and debug.
5
5
 
6
- WARNING: This is still under development! Breaking changes may occur in version 0!
6
+ WARNING: This is still under development! Breaking changes may occur in version 0.x!
7
7
 
8
- ## Installation
8
+ For Server Administrators
9
+ -------------------------
10
+
11
+ The ScraperUtils library is designed to be a respectful citizen of the web. If you're a server administrator and notice
12
+ our scraper accessing your systems, here's what you should know:
13
+
14
+ ### How to Control Our Behavior
15
+
16
+ Our scraper utilities respect the standard server **robots.txt** control mechanisms (by default).
17
+ To control our access:
18
+
19
+ - Add a section for our user agent: `User-agent: ScraperUtils` (default)
20
+ - Set a crawl delay, eg: `Crawl-delay: 20`
21
+ - If needed specify disallowed paths*: `Disallow: /private/`
22
+
23
+ ### Built-in Politeness Features
24
+
25
+ Even without specific configuration, our scrapers will, by default:
26
+
27
+ - **Identify themselves**: Our user agent clearly indicates who we are and provides a link to the project repository:
28
+ `Mozilla/5.0 (compatible; ScraperUtils/0.2.0 2025-02-22; +https://github.com/ianheggie-oaf/scraper_utils)`
29
+
30
+ - **Limit server load**: We slow down our requests so we should never be a significant load to your server, let alone
31
+ overload it.
32
+ The slower your server is running, the longer the delay we add between requests to help.
33
+ In the default "compliant mode" this defaults to a max load of 20% and is capped at 33%.
34
+
35
+ - **Add randomized delays**: We add random delays between requests to further reduce our impact on servers, which should
36
+ bring us
37
+ - down to the load of a single industrious person.
38
+
39
+ Extra utilities provided for scrapers to further reduce your server load:
40
+
41
+ - **Interleave requests**: This spreads out the requests to your server rather than focusing on one scraper at a time.
42
+
43
+ - **Intelligent Date Range selection**: This reduces server load by over 60% by a smarter choice of date range searches,
44
+ checking the recent 4 days each day and reducing down to checking each 3 days by the end of the 33-day mark. This
45
+ replaces the simplistic check of the last 30 days each day.
46
+
47
+ Our goal is to access public planning information without negatively impacting your services.
48
+
49
+ Installation
50
+ ------------
9
51
 
10
52
  Add these line to your application's Gemfile:
11
53
 
@@ -18,211 +60,133 @@ And then execute:
18
60
 
19
61
  $ bundle
20
62
 
21
- Or install it yourself for testing:
63
+ Usage
64
+ -----
65
+
66
+ ### Ruby Versions
22
67
 
23
- $ gem install scraper_utils
68
+ This gem is designed to be compatible the latest ruby supported by morph.io - other versions may work, but not tested:
24
69
 
25
- ## Usage
70
+ * ruby 3.2.2 - requires the `platform` file to contain `heroku_18` in the scraper
71
+ * ruby 2.5.8 - `heroku_16` (the default)
26
72
 
27
73
  ### Environment variables
28
74
 
29
- Optionally filter authorities via environment variable in morph > scraper > settings or
75
+ #### `MORPH_AUSTRALIAN_PROXY`
76
+
77
+ On morph.io set the environment variable `MORPH_AUSTRALIAN_PROXY` to
78
+ `http://morph:password@au.proxy.oaf.org.au:8888`
79
+ replacing password with the real password.
80
+ Alternatively enter your own AUSTRALIAN proxy details when testing.
81
+
82
+ #### `MORPH_EXPECT_BAD`
83
+
84
+ To avoid morph complaining about sites that are known to be bad,
85
+ but you want them to keep being tested, list them on `MORPH_EXPECT_BAD`, for example:
86
+
87
+ #### `MORPH_AUTHORITIES`
88
+
89
+ Optionally filter authorities for multi authority scrapers
90
+ via environment variable in morph > scraper > settings or
30
91
  in your dev environment:
31
92
 
32
93
  ```bash
33
94
  export MORPH_AUTHORITIES=noosa,wagga
34
95
  ```
35
96
 
36
- ### Example updated `scraper.rb` file
97
+ #### `DEBUG`
37
98
 
38
- Update your `scraper.rb` as per the following example:
99
+ Optionally enable verbose debugging messages when developing:
39
100
 
40
- ```ruby
41
- #!/usr/bin/env ruby
42
- # frozen_string_literal: true
101
+ ```bash
102
+ export DEBUG=1
103
+ ```
43
104
 
44
- $LOAD_PATH << "./lib"
105
+ ### Extra Mechanize options
45
106
 
46
- require "scraper_utils"
47
- require "technology_one_scraper"
48
-
49
- # Main Scraper class
50
- class Scraper
51
- AUTHORITIES = TechnologyOneScraper::AUTHORITIES
52
-
53
- def self.scrape(authorities, attempt)
54
- results = {}
55
- authorities.each do |authority_label|
56
- these_results = results[authority_label] = {}
57
- begin
58
- records_scraped = 0
59
- unprocessable_records = 0
60
- # Allow 5 + 10% unprocessable records
61
- too_many_unprocessable = -5.0
62
- use_proxy = AUTHORITIES[authority_label][:australian_proxy] && ScraperUtils.australian_proxy
63
- next if attempt > 2 && !use_proxy
64
-
65
- puts "",
66
- "Collecting feed data for #{authority_label}, attempt: #{attempt}" \
67
- "#{use_proxy ? ' (via proxy)' : ''} ..."
68
- # Change scrape to accept a use_proxy flag and return an unprocessable flag
69
- # it should rescue ScraperUtils::UnprocessableRecord thrown deeper in the scraping code and
70
- # set unprocessable
71
- TechnologyOneScraper.scrape(use_proxy, authority_label) do |record, unprocessable|
72
- unless unprocessable
73
- begin
74
- record["authority_label"] = authority_label.to_s
75
- ScraperUtils::DbUtils.save_record(record)
76
- rescue ScraperUtils::UnprocessableRecord => e
77
- # validation error
78
- unprocessable = true
79
- these_results[:error] = e
80
- end
81
- end
82
- if unprocessable
83
- unprocessable_records += 1
84
- these_results[:unprocessable_records] = unprocessable_records
85
- too_many_unprocessable += 1
86
- raise "Too many unprocessable records" if too_many_unprocessable.positive?
87
- else
88
- records_scraped += 1
89
- these_results[:records_scraped] = records_scraped
90
- too_many_unprocessable -= 0.1
91
- end
92
- end
93
- rescue StandardError => e
94
- warn "#{authority_label}: ERROR: #{e}"
95
- warn e.backtrace || "No backtrace available"
96
- these_results[:error] = e
97
- end
98
- end
99
- results
100
- end
107
+ Add `client_options` to your AUTHORITIES configuration and move any of the following settings into it:
101
108
 
102
- def self.selected_authorities
103
- ScraperUtils::AuthorityUtils.selected_authorities(AUTHORITIES.keys)
104
- end
109
+ * `timeout: Integer` - Timeout for agent connections in case the server is slower than normal
110
+ * `australian_proxy: true` - Use the proxy url in the `MORPH_AUSTRALIAN_PROXY` env variable if the site is geo-locked
111
+ * `disable_ssl_certificate_check: true` - Disabled SSL verification for old / incorrect certificates
105
112
 
106
- def self.run(authorities)
107
- puts "Scraping authorities: #{authorities.join(', ')}"
108
- start_time = Time.now
109
- results = scrape(authorities, 1)
110
- ScraperUtils::LogUtils.log_scraping_run(
111
- start_time,
112
- 1,
113
- authorities,
114
- results
115
- )
116
-
117
- retry_errors = results.select do |_auth, result|
118
- result[:error] && !result[:error].is_a?(ScraperUtils::UnprocessableRecord)
119
- end.keys
120
-
121
- unless retry_errors.empty?
122
- puts "",
123
- "***************************************************"
124
- puts "Now retrying authorities which earlier had failures"
125
- puts retry_errors.join(", ").to_s
126
- puts "***************************************************"
127
-
128
- start_retry = Time.now
129
- retry_results = scrape(retry_errors, 2)
130
- ScraperUtils::LogUtils.log_scraping_run(
131
- start_retry,
132
- 2,
133
- retry_errors,
134
- retry_results
135
- )
136
-
137
- retry_results.each do |auth, result|
138
- unless result[:error] && !result[:error].is_a?(ScraperUtils::UnprocessableRecord)
139
- results[auth] = result
140
- end
141
- end.keys
142
- retry_no_proxy = retry_results.select do |_auth, result|
143
- result[:used_proxy] && result[:error] &&
144
- !result[:error].is_a?(ScraperUtils::UnprocessableRecord)
145
- end.keys
146
-
147
- unless retry_no_proxy.empty?
148
- puts "",
149
- "*****************************************************************"
150
- puts "Now retrying authorities which earlier had failures without proxy"
151
- puts retry_no_proxy.join(", ").to_s
152
- puts "*****************************************************************"
153
-
154
- start_retry = Time.now
155
- second_retry_results = scrape(retry_no_proxy, 3)
156
- ScraperUtils::LogUtils.log_scraping_run(
157
- start_retry,
158
- 3,
159
- retry_no_proxy,
160
- second_retry_results
161
- )
162
- second_retry_results.each do |auth, result|
163
- unless result[:error] && !result[:error].is_a?(ScraperUtils::UnprocessableRecord)
164
- results[auth] = result
165
- end
166
- end.keys
167
- end
168
- end
169
-
170
- # Report on results, raising errors for unexpected conditions
171
- ScraperUtils::LogUtils.report_on_results(authorities, results)
172
- end
113
+ See the documentation on `ScraperUtils::MechanizeUtils::AgentConfig` for more options
114
+
115
+ Then adjust your code to accept `client_options` and pass then through to:
116
+ `ScraperUtils::MechanizeUtils.mechanize_agent(client_options || {})`
117
+ to receive a `Mechanize::Agent` configured accordingly.
118
+
119
+ The agent returned is configured using Mechanize hooks to implement the desired delays automatically.
120
+
121
+ ### Default Configuration
122
+
123
+ By default, the Mechanize agent is configured with the following settings.
124
+ As you can see, the defaults can be changed using env variables.
125
+
126
+ Note - compliant mode forces max_load to be set to a value no greater than 33.
127
+ PLEASE don't use our user agent string with a max_load higher than 33!
128
+
129
+ ```ruby
130
+ ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
131
+ config.default_timeout = ENV.fetch('MORPH_TIMEOUT', 60).to_i # 60
132
+ config.default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
133
+ config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', 15).to_i # 15
134
+ config.default_max_load = ENV.fetch('MORPH_MAX_LOAD', 20.0).to_f # 20
135
+ config.default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
136
+ config.default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
137
+ config.default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
173
138
  end
139
+ ```
140
+
141
+ You can modify these global defaults before creating any Mechanize agents. These settings will be used for all Mechanize
142
+ agents created by `ScraperUtils::MechanizeUtils.mechanize_agent` unless overridden by passing parameters to that method.
174
143
 
175
- if __FILE__ == $PROGRAM_NAME
176
- # Default to list of authorities we can't or won't fix in code, explain why
177
- # wagga: url redirects and reports Application error, main site says to use NSW Planning Portal from 1 July 2021
178
- # which doesn't list any DA's for wagga wagga!
144
+ To speed up testing, set the following in `spec_helper.rb`:
179
145
 
180
- ENV["MORPH_EXPECT_BAD"] ||= "wagga"
181
- Scraper.run(Scraper.selected_authorities)
146
+ ```ruby
147
+ ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
148
+ config.default_random_delay = nil
149
+ config.default_max_load = 33
182
150
  end
183
151
  ```
184
152
 
185
- Then deeper in your code update:
153
+ ### Example updated `scraper.rb` file
186
154
 
187
- * Change scrape to accept a `use_proxy` flag and return an `unprocessable` flag
188
- * it should rescue ScraperUtils::UnprocessableRecord thrown deeper in the scraping code and
189
- set and yield unprocessable eg: `TechnologyOneScraper.scrape(use_proxy, authority_label) do |record, unprocessable|`
155
+ Update your `scraper.rb` as per the [example scraper](docs/example_scraper.rb).
156
+
157
+ Your code should raise ScraperUtils::UnprocessableRecord when there is a problem with the data presented on a page for a
158
+ record.
159
+ Then just before you would normally yield a record for saving, rescue that exception and:
160
+
161
+ * Call `ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)`
162
+ * NOT yield the record for saving
163
+
164
+ In your code update where create a mechanize agent (often `YourScraper.scrape_period`) and the `AUTHORITIES` hash
165
+ to move Mechanize agent options (like `australian_proxy` and `timeout`) to a hash under a new key: `client_options`.
166
+ For example:
190
167
 
191
168
  ```ruby
192
169
  require "scraper_utils"
193
170
  #...
194
- module TechnologyOneScraper
195
- # Note the extra parameter: use_proxy
196
- def self.scrape(use_proxy, authority)
197
- raise "Unexpected authority: #{authority}" unless AUTHORITIES.key?(authority)
198
-
199
- scrape_period(use_proxy, AUTHORITIES[authority]) do |record, unprocessable|
200
- yield record, unprocessable
201
- end
202
- end
203
-
204
- # ... rest of code ...
171
+ module YourScraper
172
+ # ... some code ...
205
173
 
206
- # Note the extra parameters: use_proxy and timeout
207
- def self.scrape_period(use_proxy,
208
- url:, period:, webguest: "P1.WEBGUEST", disable_ssl_certificate_check: false,
209
- australian_proxy: false, timeout: nil
174
+ # Note the extra parameter: client_options
175
+ def self.scrape_period(url:, period:, webguest: "P1.WEBGUEST",
176
+ client_options: {}
210
177
  )
211
- agent = ScraperUtils::MechanizeUtils.mechanize_agent(use_proxy: use_proxy, timeout: timeout)
212
- agent.verify_mode = OpenSSL::SSL::VERIFY_NONE if disable_ssl_certificate_check
213
-
214
- # ... rest of code ...
215
-
216
- # Update yield to return unprocessable as well as record
178
+ agent = ScraperUtils::MechanizeUtils.mechanize_agent(**client_options)
217
179
 
180
+ # ... rest of code ...
218
181
  end
182
+
219
183
  # ... rest of code ...
220
184
  end
221
185
  ```
222
186
 
223
187
  ### Debugging Techniques
224
188
 
225
- The following code will print dbugging info if you set:
189
+ The following code will cause debugging info to be output:
226
190
 
227
191
  ```bash
228
192
  export DEBUG=1
@@ -235,8 +199,8 @@ require 'scraper_utils'
235
199
 
236
200
  # Debug an HTTP request
237
201
  ScraperUtils::DebugUtils.debug_request(
238
- "GET",
239
- "https://example.com/planning-apps",
202
+ "GET",
203
+ "https://example.com/planning-apps",
240
204
  parameters: { year: 2023 },
241
205
  headers: { "Accept" => "application/json" }
242
206
  )
@@ -248,7 +212,85 @@ ScraperUtils::DebugUtils.debug_page(page, "Checking search results page")
248
212
  ScraperUtils::DebugUtils.debug_selector(page, '.results-table', "Looking for development applications")
249
213
  ```
250
214
 
251
- ## Development
215
+ Interleaving Requests
216
+ ---------------------
217
+
218
+ The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
219
+
220
+ * works on the other authorities whilst in the delay period for an authorities next request
221
+ * thus optimizing the total scraper run time
222
+ * allows you to increase the random delay for authorities without undue effect on total run time
223
+ * For the curious, it uses [ruby fibers](https://ruby-doc.org/core-2.5.8/Fiber.html) rather than threads as that is
224
+ a simpler system and thus easier to get right, understand and debug!
225
+ * Cycles around the authorities when compliant_mode, max_load and random_delay are disabled
226
+
227
+ To enable change the scrape method to be like [example scrape method using fibers](docs/example_scrape_with_fibers.rb)
228
+
229
+ And use `ScraperUtils::FiberScheduler.log` instead of `puts` when logging within the authority processing code.
230
+ This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
231
+ thus the output.
232
+
233
+ This uses `ScraperUtils::RandomizeUtils` as described below. Remember to add the recommended line to
234
+ `spec/spec_heper.rb`.
235
+
236
+ Intelligent Date Range Selection
237
+ --------------------------------
238
+
239
+ To further reduce server load and speed up scrapers, we provide an intelligent date range selection mechanism
240
+ that can reduce server requests by 60% without significantly impacting delay in picking up changes.
241
+
242
+ The `ScraperUtils::DateRangeUtils#calculate_date_ranges` method provides a smart approach to searching historical
243
+ records:
244
+
245
+ - Always checks the most recent 4 days daily (configurable)
246
+ - Progressively reduces search frequency for older records
247
+ - Uses a Fibonacci-like progression to create natural, efficient search intervals
248
+ - Configurable `max_period` (default is 3 days)
249
+ - merges adjacent search ranges and handles the changeover in search frequency by extending some searches
250
+
251
+ Example usage in your scraper:
252
+
253
+ ```ruby
254
+ date_ranges = ScraperUtils::DateRangeUtils.new.calculate_date_ranges
255
+ date_ranges.each do |from_date, to_date, _debugging_comment|
256
+ # Adjust your normal search code to use for this date range
257
+ your_search_records(from_date: from_date, to_date: to_date) do |record|
258
+ # process as normal
259
+ end
260
+ end
261
+ ```
262
+
263
+ Typical server load reductions:
264
+
265
+ * Max period 2 days : ~42% of the 33 days selected
266
+ * Max period 3 days : ~37% of the 33 days selected (default)
267
+ * Max period 5 days : ~35% (or ~31% when days = 45)
268
+
269
+ See the class documentation for customizing defaults and passing options.
270
+
271
+ Randomizing Requests
272
+ --------------------
273
+
274
+ Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
275
+ receive in as is when testing.
276
+
277
+ Use this with the list of records scraped from an index to randomise any requests for further information to be less Bot like.
278
+
279
+ ### Spec setup
280
+
281
+ You should enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb` :
282
+
283
+ ```
284
+ ScraperUtils::RandomizeUtils.sequential = true
285
+ ```
286
+
287
+ Note:
288
+
289
+ * You can also force sequential mode by setting the env variable `MORPH_PROCESS_SEQUENTIALLY` to `1` (any non blank)
290
+ * testing using VCR requires sequential mode
291
+
292
+ Development
293
+ -----------
252
294
 
253
295
  After checking out the repo, run `bin/setup` to install dependencies.
254
296
  Then, run `rake test` to run the tests.
@@ -259,13 +301,20 @@ To install this gem onto your local machine, run `bundle exec rake install`.
259
301
 
260
302
  To release a new version, update the version number in `version.rb`, and
261
303
  then run `bundle exec rake release`,
262
- which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
304
+ which will create a git tag for the version, push git commits and tags, and push the `.gem` file
305
+ to [rubygems.org](https://rubygems.org).
306
+
307
+ NOTE: You need to use ruby 3.2.2 instead of 2.5.8 to release to OTP protected accounts.
308
+
309
+ Contributing
310
+ ------------
263
311
 
264
- ## Contributing
312
+ Bug reports and pull requests with working tests are welcome on [GitHub](https://github.com/ianheggie-oaf/scraper_utils)
265
313
 
266
- Bug reports and pull requests are welcome on GitHub at https://github.com/ianheggie-oaf/scraper_utils
314
+ CHANGELOG.md is maintained by the author aiming to follow https://github.com/vweevers/common-changelog
267
315
 
268
- ## License
316
+ License
317
+ -------
269
318
 
270
319
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
271
320