scraper_utils 0.4.2 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 65565228471cfb92e0ea4c7280f2f4b30c22ac0fdb88cad30e99d14f011b2ff4
4
- data.tar.gz: 812f6ca2270a046db40af5427684015ca68f1c1a7670041eb51889a3ba959cd2
3
+ metadata.gz: 63a24c24b497494b79c4d7e12f04a1bd2555068f37f50389f3906c0033817d7e
4
+ data.tar.gz: 6d6b96112dc3e2f9dc5a54de6318a544c240c0e3d5246ab4178c07346d0de7dc
5
5
  SHA512:
6
- metadata.gz: 5aee38354b3fde81b2e9fd0442aa3155df8fdde8959e2fc6db49d55cd3872b5ad15ffc7a3fccbe99e759c89fdbca1ced9f453021e0fe7a6ae47d23c42f264a39
7
- data.tar.gz: 4e51e770bc2252caf18c548572925bba4cb5fd651cdc063d379bb42a732921f740c76d1bdfde32df2a0443449b1c4da7fe624a4e908d8116f0286b4c2c169da5
6
+ metadata.gz: eda8d10d996d51b7ef1d2610e21da31390c10dd29f4daa70bd5d9c3c8dc6eb9bed651803ccd6a59f53b03dae4fcd1ea016802e693f8828f4a13b92e07a0b046e
7
+ data.tar.gz: eba2704a99c6599a2789ec573fa335d7939a63d0c27b06886d6e905cd785e2095d7d0307e7aa1195a1209e022340fa5d027a72ccca61a350590058e998355d5d
data/CHANGELOG.md CHANGED
@@ -1,5 +1,9 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.5.0 - 2025-03-05
4
+
5
+ * Add action processing utility
6
+
3
7
  ## 0.4.2 - 2025-03-04
4
8
 
5
9
  * Fix gem require list
data/README.md CHANGED
@@ -3,8 +3,6 @@ ScraperUtils (Ruby)
3
3
 
4
4
  Utilities to help make planningalerts scrapers, especially multis, easier to develop, run and debug.
5
5
 
6
- WARNING: This is still under development! Breaking changes may occur in version 0.x!
7
-
8
6
  For Server Administrators
9
7
  -------------------------
10
8
 
@@ -18,331 +16,97 @@ To control our access:
18
16
 
19
17
  - Add a section for our user agent: `User-agent: ScraperUtils` (default)
20
18
  - Set a crawl delay, eg: `Crawl-delay: 20`
21
- - If needed specify disallowed paths*: `Disallow: /private/`
19
+ - If needed specify disallowed paths: `Disallow: /private/`
22
20
 
23
- ### Built-in Politeness Features
21
+ ### We play nice with your servers
24
22
 
25
- Even without specific configuration, our scrapers will, by default:
23
+ Our goal is to access public planning information with minimal impact on your services. The following features are on by
24
+ default:
26
25
 
27
26
  - **Identify themselves**: Our user agent clearly indicates who we are and provides a link to the project repository:
28
27
  `Mozilla/5.0 (compatible; ScraperUtils/0.2.0 2025-02-22; +https://github.com/ianheggie-oaf/scraper_utils)`
29
28
 
30
- - **Limit server load**: We slow down our requests so we should never be a significant load to your server, let alone
31
- overload it.
32
- The slower your server is running, the longer the delay we add between requests to help.
33
- In the default "compliant mode" this defaults to a max load of 20% and is capped at 33%.
34
-
35
- - **Add randomized delays**: We add random delays between requests to further reduce our impact on servers, which should
36
- bring us down to the load of a single industrious person.
37
-
38
- Extra utilities provided for scrapers to further reduce your server load:
29
+ - **Limit server load**:
30
+ - We wait double your response time before making another request to avoid being a significant load on your server
31
+ - We also randomly add extra delays to give your server a chance to catch up with background tasks
39
32
 
40
- - **Interleave requests**: This spreads out the requests to your server rather than focusing on one scraper at a time.
33
+ We also provide scraper developers other features to reduce overall load as well.
41
34
 
42
- - **Intelligent Date Range selection**: This reduces server load by over 60% by a smarter choice of date range searches,
43
- checking the recent 4 days each day and reducing down to checking each 3 days by the end of the 33-day mark. This
44
- replaces the simplistic check of the last 30 days each day.
35
+ For Scraper Developers
36
+ ----------------------
45
37
 
46
- - Alternative **Cycle Utilities** - a convenience class to cycle through short and longer search ranges to reduce server
47
- load.
38
+ We provide utilities to make developing, running and debugging your scraper easier in addition to the base utilities
39
+ mentioned above.
48
40
 
49
- Our goal is to access public planning information without negatively impacting your services.
41
+ ## Installation & Configuration
50
42
 
51
- Installation
52
- ------------
53
-
54
- Add these line to your application's Gemfile:
43
+ Add to your [scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
55
44
 
56
45
  ```ruby
57
46
  gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
58
47
  gem 'scraper_utils'
59
48
  ```
60
49
 
61
- And then execute:
62
-
63
- $ bundle
64
-
65
- Usage
66
- -----
67
-
68
- ### Ruby Versions
69
-
70
- This gem is designed to be compatible the latest ruby supported by morph.io - other versions may work, but not tested:
71
-
72
- * ruby 3.2.2 - requires the `platform` file to contain `heroku_18` in the scraper
73
- * ruby 2.5.8 - `heroku_16` (the default)
74
-
75
- ### Environment variables
76
-
77
- #### `MORPH_AUSTRALIAN_PROXY`
78
-
79
- On morph.io set the environment variable `MORPH_AUSTRALIAN_PROXY` to
80
- `http://morph:password@au.proxy.oaf.org.au:8888`
81
- replacing password with the real password.
82
- Alternatively enter your own AUSTRALIAN proxy details when testing.
83
-
84
- #### `MORPH_EXPECT_BAD`
85
-
86
- To avoid morph complaining about sites that are known to be bad,
87
- but you want them to keep being tested, list them on `MORPH_EXPECT_BAD`, for example:
88
-
89
- #### `MORPH_AUTHORITIES`
90
-
91
- Optionally filter authorities for multi authority scrapers
92
- via environment variable in morph > scraper > settings or
93
- in your dev environment:
94
-
95
- ```bash
96
- export MORPH_AUTHORITIES=noosa,wagga
97
- ```
98
-
99
- #### `DEBUG`
100
-
101
- Optionally enable verbose debugging messages when developing:
102
-
103
- ```bash
104
- export DEBUG=1
105
- ```
106
-
107
- ### Extra Mechanize options
50
+ For detailed setup and configuration options, see the [Getting Started guide](docs/getting_started.md).
108
51
 
109
- Add `client_options` to your AUTHORITIES configuration and move any of the following settings into it:
52
+ ## Key Features
110
53
 
111
- * `timeout: Integer` - Timeout for agent connections in case the server is slower than normal
112
- * `australian_proxy: true` - Use the proxy url in the `MORPH_AUSTRALIAN_PROXY` env variable if the site is geo-locked
113
- * `disable_ssl_certificate_check: true` - Disabled SSL verification for old / incorrect certificates
54
+ ### Well-Behaved Web Client
114
55
 
115
- See the documentation on `ScraperUtils::MechanizeUtils::AgentConfig` for more options
56
+ - Configure Mechanize agents with sensible defaults
57
+ - Automatic rate limiting based on server response times
58
+ - Supports robots.txt and crawl-delay directives
59
+ - Supports extra actions required to get to results page
60
+ - [Learn more about Mechanize utilities](docs/mechanize_utilities.md)
116
61
 
117
- Then adjust your code to accept `client_options` and pass then through to:
118
- `ScraperUtils::MechanizeUtils.mechanize_agent(client_options || {})`
119
- to receive a `Mechanize::Agent` configured accordingly.
62
+ ### Optimize Server Load
120
63
 
121
- The agent returned is configured using Mechanize hooks to implement the desired delays automatically.
64
+ - Intelligent date range selection (reduce server load by up to 60%)
65
+ - Cycle utilities for rotating search parameters
66
+ - [Learn more about reducing server load](docs/reducing_server_load.md)
122
67
 
123
- ### Default Configuration
68
+ ### Improve Scraper Efficiency
124
69
 
125
- By default, the Mechanize agent is configured with the following settings.
126
- As you can see, the defaults can be changed using env variables.
70
+ - Interleave requests to optimize run time
71
+ - [Learn more about interleaving requests](docs/interleaving_requests.md)
72
+ - Randomize processing order for more natural request patterns
73
+ - [Learn more about randomizing requests](docs/randomizing_requests.md)
127
74
 
128
- Note - compliant mode forces max_load to be set to a value no greater than 50.
75
+ ### Error Handling & Quality Monitoring
129
76
 
130
- ```ruby
131
- ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
132
- config.default_timeout = ENV.fetch('MORPH_TIMEOUT', 60).to_i # 60
133
- config.default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
134
- config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', 5).to_i # 5
135
- config.default_max_load = ENV.fetch('MORPH_MAX_LOAD', 33.3).to_f # 33.3
136
- config.default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
137
- config.default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
138
- config.default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
139
- end
140
- ```
77
+ - Record-level error handling with appropriate thresholds
78
+ - Data quality monitoring during scraping
79
+ - Detailed logging and reporting
141
80
 
142
- You can modify these global defaults before creating any Mechanize agents. These settings will be used for all Mechanize
143
- agents created by `ScraperUtils::MechanizeUtils.mechanize_agent` unless overridden by passing parameters to that method.
81
+ ### Developer Tools
144
82
 
145
- To speed up testing, set the following in `spec_helper.rb`:
83
+ - Enhanced debugging utilities
84
+ - Simple logging with authority context
85
+ - [Learn more about debugging](docs/debugging.md)
146
86
 
147
- ```ruby
148
- ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
149
- config.default_random_delay = nil
150
- config.default_max_load = 33
151
- end
152
- ```
87
+ ## API Documentation
153
88
 
154
- ### Example updated `scraper.rb` file
89
+ Complete API documentation is available at [RubyDoc.info](https://rubydoc.info/gems/scraper_utils).
155
90
 
156
- Update your `scraper.rb` as per the [example scraper](docs/example_scraper.rb).
91
+ ## Ruby Versions
157
92
 
158
- Your code should raise ScraperUtils::UnprocessableRecord when there is a problem with the data presented on a page for a
159
- record.
160
- Then just before you would normally yield a record for saving, rescue that exception and:
161
-
162
- * Call `ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)`
163
- * NOT yield the record for saving
164
-
165
- In your code update where create a mechanize agent (often `YourScraper.scrape_period`) and the `AUTHORITIES` hash
166
- to move Mechanize agent options (like `australian_proxy` and `timeout`) to a hash under a new key: `client_options`.
167
- For example:
168
-
169
- ```ruby
170
- require "scraper_utils"
171
- #...
172
- module YourScraper
173
- # ... some code ...
174
-
175
- # Note the extra parameter: client_options
176
- def self.scrape_period(url:, period:, webguest: "P1.WEBGUEST",
177
- client_options: {}
178
- )
179
- agent = ScraperUtils::MechanizeUtils.mechanize_agent(**client_options)
180
-
181
- # ... rest of code ...
182
- end
183
-
184
- # ... rest of code ...
185
- end
186
- ```
93
+ This gem is designed to be compatible with Ruby versions supported by morph.io:
187
94
 
188
- ### Debugging Techniques
189
-
190
- The following code will cause debugging info to be output:
191
-
192
- ```bash
193
- export DEBUG=1
194
- ```
195
-
196
- Add the following immediately before requesting or examining pages
197
-
198
- ```ruby
199
- require 'scraper_utils'
200
-
201
- # Debug an HTTP request
202
- ScraperUtils::DebugUtils.debug_request(
203
- "GET",
204
- "https://example.com/planning-apps",
205
- parameters: { year: 2023 },
206
- headers: { "Accept" => "application/json" }
207
- )
208
-
209
- # Debug a web page
210
- ScraperUtils::DebugUtils.debug_page(page, "Checking search results page")
211
-
212
- # Debug a specific page selector
213
- ScraperUtils::DebugUtils.debug_selector(page, '.results-table', "Looking for development applications")
214
- ```
95
+ * Ruby 3.2.2 - requires the `platform` file to contain `heroku_18` in the scraper
96
+ * Ruby 2.5.8 - `heroku_16` (the default)
215
97
 
216
- Interleaving Requests
217
- ---------------------
218
-
219
- The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
220
-
221
- * works on the other authorities whilst in the delay period for an authorities next request
222
- * thus optimizing the total scraper run time
223
- * allows you to increase the random delay for authorities without undue effect on total run time
224
- * For the curious, it uses [ruby fibers](https://ruby-doc.org/core-2.5.8/Fiber.html) rather than threads as that is
225
- a simpler system and thus easier to get right, understand and debug!
226
- * Cycles around the authorities when compliant_mode, max_load and random_delay are disabled
227
-
228
- To enable change the scrape method to be like [example scrape method using fibers](docs/example_scrape_with_fibers.rb)
229
-
230
- And use `ScraperUtils::FiberScheduler.log` instead of `puts` when logging within the authority processing code.
231
- This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
232
- thus the output.
233
-
234
- This uses `ScraperUtils::RandomizeUtils` as described below. Remember to add the recommended line to
235
- `spec/spec_heper.rb`.
236
-
237
- Intelligent Date Range Selection
238
- --------------------------------
239
-
240
- To further reduce server load and speed up scrapers, we provide an intelligent date range selection mechanism
241
- that can reduce server requests by 60% without significantly impacting delay in picking up changes.
242
-
243
- The `ScraperUtils::DateRangeUtils#calculate_date_ranges` method provides a smart approach to searching historical
244
- records:
245
-
246
- - Always checks the most recent 4 days daily (configurable)
247
- - Progressively reduces search frequency for older records
248
- - Uses a Fibonacci-like progression to create natural, efficient search intervals
249
- - Configurable `max_period` (default is 3 days)
250
- - merges adjacent search ranges and handles the changeover in search frequency by extending some searches
251
-
252
- Example usage in your scraper:
253
-
254
- ```ruby
255
- date_ranges = ScraperUtils::DateRangeUtils.new.calculate_date_ranges
256
- date_ranges.each do |from_date, to_date, _debugging_comment|
257
- # Adjust your normal search code to use for this date range
258
- your_search_records(from_date: from_date, to_date: to_date) do |record|
259
- # process as normal
260
- end
261
- end
262
- ```
263
-
264
- Typical server load reductions:
265
-
266
- * Max period 2 days : ~42% of the 33 days selected
267
- * Max period 3 days : ~37% of the 33 days selected (default)
268
- * Max period 5 days : ~35% (or ~31% when days = 45)
269
-
270
- See the class documentation for customizing defaults and passing options.
271
-
272
- ### Other possibilities
273
-
274
- If the site uses tags like 'L28', 'L14' and 'L7' for the last 28, 14 and 7 days, an alternative solution
275
- is to cycle through ['L28', 'L7', 'L14', 'L7'] which would drop the load by 50% and be less Bot like.
276
-
277
- Cycle Utils
278
- -----------
279
- Simple utility for cycling through options based on Julian day number:
280
-
281
- ```ruby
282
- # Toggle between main and alternate behaviour
283
- alternate = ScraperUtils::CycleUtils.position(2).even?
284
-
285
- # OR cycle through a list of values day by day:
286
- period = ScraperUtils::CycleUtils.pick(['L28', 'L7', 'L14', 'L7'])
287
-
288
- # Use with any cycle size
289
- pos = ScraperUtils::CycleUtils.position(7) # 0-6 cycle
290
-
291
- # Test with specific date
292
- pos = ScraperUtils::CycleUtils.position(3, date: Date.new(2024, 1, 5))
293
-
294
- # Override for testing
295
- # CYCLE_POSITION=2 bundle exec ruby scraper.rb
296
- ```
297
-
298
- Randomizing Requests
299
- --------------------
300
-
301
- Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
302
- receive in as is when testing.
303
-
304
- Use this with the list of records scraped from an index to randomise any requests for further information to be less Bot
305
- like.
306
-
307
- ### Spec setup
308
-
309
- You should enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb` :
310
-
311
- ```
312
- ScraperUtils::RandomizeUtils.sequential = true
313
- ```
314
-
315
- Note:
316
-
317
- * You can also force sequential mode by setting the env variable `MORPH_PROCESS_SEQUENTIALLY` to `1` (any non blank)
318
- * testing using VCR requires sequential mode
319
-
320
- Development
321
- -----------
98
+ ## Development
322
99
 
323
100
  After checking out the repo, run `bin/setup` to install dependencies.
324
101
  Then, run `rake test` to run the tests.
325
102
 
326
- You can also run `bin/console` for an interactive prompt that will allow you to experiment.
327
-
328
103
  To install this gem onto your local machine, run `bundle exec rake install`.
329
104
 
330
- To release a new version, update the version number in `version.rb`, and
331
- then run `bundle exec rake release`,
332
- which will create a git tag for the version, push git commits and tags, and push the `.gem` file
333
- to [rubygems.org](https://rubygems.org).
334
-
335
- NOTE: You need to use ruby 3.2.2 instead of 2.5.8 to release to OTP protected accounts.
336
-
337
- Contributing
338
- ------------
105
+ ## Contributing
339
106
 
340
- Bug reports and pull requests with working tests are welcome on [GitHub](https://github.com/ianheggie-oaf/scraper_utils)
107
+ Bug reports and pull requests with working tests are welcome
108
+ on [GitHub](https://github.com/ianheggie-oaf/scraper_utils).
341
109
 
342
- CHANGELOG.md is maintained by the author aiming to follow https://github.com/vweevers/common-changelog
343
-
344
- License
345
- -------
110
+ ## License
346
111
 
347
112
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
348
-
data/docs/debugging.md ADDED
@@ -0,0 +1,50 @@
1
+ # Debugging Techniques
2
+
3
+ ScraperUtils provides several debugging utilities to help you troubleshoot your scrapers.
4
+
5
+ ## Enabling Debug Mode
6
+
7
+ Set the `DEBUG` environment variable to enable debugging:
8
+
9
+ ```bash
10
+ export DEBUG=1 # Basic debugging
11
+ export DEBUG=2 # Verbose debugging
12
+ export DEBUG=3 # Trace debugging with detailed content
13
+ ```
14
+
15
+ ## Debug Utilities
16
+
17
+ The `ScraperUtils::DebugUtils` module provides several methods for debugging:
18
+
19
+ ```ruby
20
+ # Debug an HTTP request
21
+ ScraperUtils::DebugUtils.debug_request(
22
+ "GET",
23
+ "https://example.com/planning-apps",
24
+ parameters: { year: 2023 },
25
+ headers: { "Accept" => "application/json" }
26
+ )
27
+
28
+ # Debug a web page
29
+ ScraperUtils::DebugUtils.debug_page(page, "Checking search results page")
30
+
31
+ # Debug a specific page selector
32
+ ScraperUtils::DebugUtils.debug_selector(page, '.results-table', "Looking for development applications")
33
+ ```
34
+
35
+ ## Debug Level Constants
36
+
37
+ - `DISABLED_LEVEL = 0`: Debugging disabled
38
+ - `BASIC_LEVEL = 1`: Basic debugging information
39
+ - `VERBOSE_LEVEL = 2`: Verbose debugging information
40
+ - `TRACE_LEVEL = 3`: Detailed tracing information
41
+
42
+ ## Helper Methods
43
+
44
+ - `debug_level`: Get the current debug level
45
+ - `debug?(level)`: Check if debugging is enabled at the specified level
46
+ - `basic?`: Check if basic debugging is enabled
47
+ - `verbose?`: Check if verbose debugging is enabled
48
+ - `trace?`: Check if trace debugging is enabled
49
+
50
+ For full details, see the [DebugUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/DebugUtils).
@@ -0,0 +1,145 @@
1
+ # Getting Started with ScraperUtils
2
+
3
+ This guide will help you get started with ScraperUtils for your PlanningAlerts scraper.
4
+
5
+ ## Installation
6
+
7
+ Add these lines to your [scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
8
+
9
+ ```ruby
10
+ # Below:
11
+ gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
12
+
13
+ # Add:
14
+ gem 'scraper_utils'
15
+ ```
16
+
17
+ And then execute:
18
+
19
+ ```bash
20
+ bundle install
21
+ ```
22
+
23
+ ## Environment Variables
24
+
25
+ ### `MORPH_AUSTRALIAN_PROXY`
26
+
27
+ On morph.io set the environment variable `MORPH_AUSTRALIAN_PROXY` to
28
+ `http://morph:password@au.proxy.oaf.org.au:8888`
29
+ replacing password with the real password.
30
+ Alternatively enter your own AUSTRALIAN proxy details when testing.
31
+
32
+ ### `MORPH_EXPECT_BAD`
33
+
34
+ To avoid morph complaining about sites that are known to be bad,
35
+ but you want them to keep being tested, list them on `MORPH_EXPECT_BAD`, for example:
36
+
37
+ ### `MORPH_AUTHORITIES`
38
+
39
+ Optionally filter authorities for multi authority scrapers
40
+ via environment variable in morph > scraper > settings or
41
+ in your dev environment:
42
+
43
+ ```bash
44
+ export MORPH_AUTHORITIES=noosa,wagga
45
+ ```
46
+
47
+ ### `DEBUG`
48
+
49
+ Optionally enable verbose debugging messages when developing:
50
+
51
+ ```bash
52
+ export DEBUG=1 # for basic, or 2 for verbose or 3 for tracing nearly everything
53
+ ```
54
+
55
+ ## Example Scraper Implementation
56
+
57
+ Update your `scraper.rb` as follows:
58
+
59
+ ```ruby
60
+ #!/usr/bin/env ruby
61
+ # frozen_string_literal: true
62
+
63
+ $LOAD_PATH << "./lib"
64
+
65
+ require "scraper_utils"
66
+ require "your_scraper"
67
+
68
+ # Main Scraper class
69
+ class Scraper
70
+ AUTHORITIES = YourScraper::AUTHORITIES
71
+
72
+ def scrape(authorities, attempt)
73
+ exceptions = {}
74
+ authorities.each do |authority_label|
75
+ puts "\nCollecting feed data for #{authority_label}, attempt: #{attempt}..."
76
+
77
+ begin
78
+ ScraperUtils::DataQualityMonitor.start_authority(authority_label)
79
+ YourScraper.scrape(authority_label) do |record|
80
+ begin
81
+ record["authority_label"] = authority_label.to_s
82
+ ScraperUtils::DbUtils.save_record(record)
83
+ rescue ScraperUtils::UnprocessableRecord => e
84
+ ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
85
+ exceptions[authority_label] = e
86
+ end
87
+ end
88
+ rescue StandardError => e
89
+ warn "#{authority_label}: ERROR: #{e}"
90
+ warn e.backtrace
91
+ exceptions[authority_label] = e
92
+ end
93
+ end
94
+
95
+ exceptions
96
+ end
97
+
98
+ def self.selected_authorities
99
+ ScraperUtils::AuthorityUtils.selected_authorities(AUTHORITIES.keys)
100
+ end
101
+
102
+ def self.run(authorities)
103
+ puts "Scraping authorities: #{authorities.join(', ')}"
104
+ start_time = Time.now
105
+ exceptions = new.scrape(authorities, 1)
106
+ ScraperUtils::LogUtils.log_scraping_run(
107
+ start_time,
108
+ 1,
109
+ authorities,
110
+ exceptions
111
+ )
112
+
113
+ unless exceptions.empty?
114
+ puts "\n***************************************************"
115
+ puts "Now retrying authorities which earlier had failures"
116
+ puts exceptions.keys.join(", ").to_s
117
+ puts "***************************************************"
118
+
119
+ start_time = Time.now
120
+ exceptions = new.scrape(exceptions.keys, 2)
121
+ ScraperUtils::LogUtils.log_scraping_run(
122
+ start_time,
123
+ 2,
124
+ authorities,
125
+ exceptions
126
+ )
127
+ end
128
+
129
+ ScraperUtils::LogUtils.report_on_results(authorities, exceptions)
130
+ end
131
+ end
132
+
133
+ if __FILE__ == $PROGRAM_NAME
134
+ ENV["MORPH_EXPECT_BAD"] ||= "wagga"
135
+ Scraper.run(Scraper.selected_authorities)
136
+ end
137
+ ```
138
+
139
+ For more advanced implementations, see the [Interleaving Requests documentation](interleaving_requests.md).
140
+
141
+ ## Next Steps
142
+
143
+ - [Reducing Server Load](reducing_server_load.md)
144
+ - [Mechanize Utilities](mechanize_utilities.md)
145
+ - [Debugging](debugging.md)
@@ -0,0 +1,61 @@
1
+ # Interleaving Requests with FiberScheduler
2
+
3
+ The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
4
+
5
+ * Works on other authorities while in the delay period for an authority's next request
6
+ * Optimizes the total scraper run time
7
+ * Allows you to increase the random delay for authorities without undue effect on total run time
8
+ * For the curious, it uses [ruby fibers](https://ruby-doc.org/core-2.5.8/Fiber.html) rather than threads as that is
9
+ a simpler system and thus easier to get right, understand and debug!
10
+ * Cycles around the authorities when compliant_mode, max_load and random_delay are disabled
11
+
12
+ ## Implementation
13
+
14
+ To enable fiber scheduling, change your scrape method to follow this pattern:
15
+
16
+ ```ruby
17
+ def scrape(authorities, attempt)
18
+ ScraperUtils::FiberScheduler.reset!
19
+ exceptions = {}
20
+ authorities.each do |authority_label|
21
+ ScraperUtils::FiberScheduler.register_operation(authority_label) do
22
+ ScraperUtils::FiberScheduler.log(
23
+ "Collecting feed data for #{authority_label}, attempt: #{attempt}..."
24
+ )
25
+ ScraperUtils::DataQualityMonitor.start_authority(authority_label)
26
+ YourScraper.scrape(authority_label) do |record|
27
+ record["authority_label"] = authority_label.to_s
28
+ ScraperUtils::DbUtils.save_record(record)
29
+ rescue ScraperUtils::UnprocessableRecord => e
30
+ ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
31
+ exceptions[authority_label] = e
32
+ # Continues processing other records
33
+ end
34
+ rescue StandardError => e
35
+ warn "#{authority_label}: ERROR: #{e}"
36
+ warn e.backtrace || "No backtrace available"
37
+ exceptions[authority_label] = e
38
+ end
39
+ # end of register_operation block
40
+ end
41
+ ScraperUtils::FiberScheduler.run_all
42
+ exceptions
43
+ end
44
+ ```
45
+
46
+ ## Logging with FiberScheduler
47
+
48
+ Use `ScraperUtils::FiberScheduler.log` instead of `puts` when logging within the authority processing code.
49
+ This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
50
+ thus the output.
51
+
52
+ ## Testing Considerations
53
+
54
+ This uses `ScraperUtils::RandomizeUtils` for determining the order of operations. Remember to add the following line to
55
+ `spec/spec_helper.rb`:
56
+
57
+ ```ruby
58
+ ScraperUtils::RandomizeUtils.sequential = true
59
+ ```
60
+
61
+ For full details, see the [FiberScheduler class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/FiberScheduler).
@@ -0,0 +1,92 @@
1
+ # Mechanize Utilities
2
+
3
+ This document provides detailed information about the Mechanize utilities provided by ScraperUtils.
4
+
5
+ ## MechanizeUtils
6
+
7
+ The `ScraperUtils::MechanizeUtils` module provides utilities for configuring and using Mechanize for web scraping.
8
+
9
+ ### Creating a Mechanize Agent
10
+
11
+ ```ruby
12
+ agent = ScraperUtils::MechanizeUtils.mechanize_agent(**options)
13
+ ```
14
+
15
+ ### Configuration Options
16
+
17
+ Add `client_options` to your AUTHORITIES configuration and move any of the following settings into it:
18
+
19
+ * `timeout: Integer` - Timeout for agent connections in case the server is slower than normal
20
+ * `australian_proxy: true` - Use the proxy url in the `MORPH_AUSTRALIAN_PROXY` env variable if the site is geo-locked
21
+ * `disable_ssl_certificate_check: true` - Disabled SSL verification for old / incorrect certificates
22
+
23
+ Then adjust your code to accept `client_options` and pass then through to:
24
+ `ScraperUtils::MechanizeUtils.mechanize_agent(client_options || {})`
25
+ to receive a `Mechanize::Agent` configured accordingly.
26
+
27
+ The agent returned is configured using Mechanize hooks to implement the desired delays automatically.
28
+
29
+ ### Default Configuration
30
+
31
+ By default, the Mechanize agent is configured with the following settings.
32
+ As you can see, the defaults can be changed using env variables.
33
+
34
+ Note - compliant mode forces max_load to be set to a value no greater than 50.
35
+
36
+ ```ruby
37
+ ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
38
+ config.default_timeout = ENV.fetch('MORPH_TIMEOUT', 60).to_i # 60
39
+ config.default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
40
+ config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', 5).to_i # 5
41
+ config.default_max_load = ENV.fetch('MORPH_MAX_LOAD', 33.3).to_f # 33.3
42
+ config.default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
43
+ config.default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
44
+ config.default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
45
+ end
46
+ ```
47
+
48
+ For full details, see the [MechanizeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/MechanizeUtils).
49
+
50
+ ## MechanizeActions
51
+
52
+ The `ScraperUtils::MechanizeActions` class provides a convenient way to execute a series of actions (like clicking links, filling forms) on a Mechanize page.
53
+
54
+ ### Action Format
55
+
56
+ ```ruby
57
+ actions = [
58
+ [:click, "Find an application"],
59
+ [:click, ["Submitted Last 28 Days", "Submitted Last 7 Days"]],
60
+ [:block, ->(page, args, agent, results) { [new_page, result_data] }]
61
+ ]
62
+
63
+ processor = ScraperUtils::MechanizeActions.new(agent)
64
+ result_page = processor.process(page, actions)
65
+ ```
66
+
67
+ ### Supported Actions
68
+
69
+ - `:click` - Clicks on a link or element matching the provided selector
70
+ - `:block` - Executes a custom block of code for complex scenarios
71
+
72
+ ### Selector Types
73
+
74
+ - Text selector (default): `"Find an application"`
75
+ - CSS selector: `"css:.button"`
76
+ - XPath selector: `"xpath://a[@class='button']"`
77
+
78
+ ### Replacements
79
+
80
+ You can use replacements in your action parameters:
81
+
82
+ ```ruby
83
+ replacements = { FROM_DATE: "2022-01-01", TO_DATE: "2022-03-01" }
84
+ processor = ScraperUtils::MechanizeActions.new(agent, replacements)
85
+
86
+ # Use replacements in actions
87
+ actions = [
88
+ [:click, "Search between {FROM_DATE} and {TO_DATE}"]
89
+ ]
90
+ ```
91
+
92
+ For full details, see the [MechanizeActions class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/MechanizeActions).
@@ -0,0 +1,34 @@
1
+ # Randomizing Requests
2
+
3
+ `ScraperUtils::RandomizeUtils` provides utilities for randomizing processing order in scrapers,
4
+ which is helpful for distributing load and avoiding predictable patterns.
5
+
6
+ ## Basic Usage
7
+
8
+ Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
9
+ receive it as is when testing.
10
+
11
+ ```ruby
12
+ # Randomize a collection
13
+ randomized_authorities = ScraperUtils::RandomizeUtils.randomize_order(authorities)
14
+
15
+ # Use with a list of records from an index to randomize requests for details
16
+ records.each do |record|
17
+ # Process record
18
+ end
19
+ ```
20
+
21
+ ## Testing Configuration
22
+
23
+ Enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb`:
24
+
25
+ ```ruby
26
+ ScraperUtils::RandomizeUtils.sequential = true
27
+ ```
28
+
29
+ ## Notes
30
+
31
+ * You can also force sequential mode by setting the env variable `MORPH_PROCESS_SEQUENTIALLY` to `1` (any non-blank value)
32
+ * Testing using VCR requires sequential mode
33
+
34
+ For full details, see the [RandomizeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/RandomizeUtils).
@@ -0,0 +1,63 @@
1
+ # Reducing Server Load
2
+
3
+ This document explains various techniques for reducing load on the servers you're scraping.
4
+
5
+ ## Intelligent Date Range Selection
6
+
7
+ To further reduce server load and speed up scrapers, we provide an intelligent date range selection mechanism
8
+ that can reduce server requests by 60% without significantly impacting delay in picking up changes.
9
+
10
+ The `ScraperUtils::DateRangeUtils#calculate_date_ranges` method provides a smart approach to searching historical
11
+ records:
12
+
13
+ - Always checks the most recent 4 days daily (configurable)
14
+ - Progressively reduces search frequency for older records
15
+ - Uses a Fibonacci-like progression to create natural, efficient search intervals
16
+ - Configurable `max_period` (default is 3 days)
17
+ - Merges adjacent search ranges and handles the changeover in search frequency by extending some searches
18
+
19
+ Example usage in your scraper:
20
+
21
+ ```ruby
22
+ date_ranges = ScraperUtils::DateRangeUtils.new.calculate_date_ranges
23
+ date_ranges.each do |from_date, to_date, _debugging_comment|
24
+ # Adjust your normal search code to use for this date range
25
+ your_search_records(from_date: from_date, to_date: to_date) do |record|
26
+ # process as normal
27
+ end
28
+ end
29
+ ```
30
+
31
+ Typical server load reductions:
32
+
33
+ * Max period 2 days : ~42% of the 33 days selected
34
+ * Max period 3 days : ~37% of the 33 days selected (default)
35
+ * Max period 5 days : ~35% (or ~31% when days = 45)
36
+
37
+ See the [DateRangeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/DateRangeUtils) for customizing defaults and passing options.
38
+
39
+ ## Cycle Utilities
40
+
41
+ Simple utility for cycling through options based on Julian day number to reduce server load and make your scraper seem less bot-like.
42
+
43
+ If the site uses tags like 'L28', 'L14' and 'L7' for the last 28, 14 and 7 days, an alternative solution
44
+ is to cycle through ['L28', 'L7', 'L14', 'L7'] which would drop the load by 50% and be less bot-like.
45
+
46
+ ```ruby
47
+ # Toggle between main and alternate behaviour
48
+ alternate = ScraperUtils::CycleUtils.position(2).even?
49
+
50
+ # OR cycle through a list of values day by day:
51
+ period = ScraperUtils::CycleUtils.pick(['L28', 'L7', 'L14', 'L7'])
52
+
53
+ # Use with any cycle size
54
+ pos = ScraperUtils::CycleUtils.position(7) # 0-6 cycle
55
+
56
+ # Test with specific date
57
+ pos = ScraperUtils::CycleUtils.position(3, date: Date.new(2024, 1, 5))
58
+
59
+ # Override for testing
60
+ # CYCLE_POSITION=2 bundle exec ruby scraper.rb
61
+ ```
62
+
63
+ For full details, see the [CycleUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/CycleUtils).
@@ -19,6 +19,7 @@ module ScraperUtils
19
19
  # @return value from array
20
20
  # Can override using CYCLE_POSITION ENV variable
21
21
  def self.pick(values, date: nil)
22
+ values = values.to_a
22
23
  values[position(values.size, date: date)]
23
24
  end
24
25
  end
@@ -0,0 +1,154 @@
1
+ # frozen_string_literal: true
2
+
3
+ module ScraperUtils
4
+ # Class for executing a series of mechanize actions with flexible replacements
5
+ #
6
+ # @example Basic usage
7
+ # agent = ScraperUtils::MechanizeUtils.mechanize_agent
8
+ # page = agent.get("https://example.com")
9
+ #
10
+ # actions = [
11
+ # [:click, "Next Page"],
12
+ # [:click, ["Option A", "Option B"]] # Will select one randomly
13
+ # ]
14
+ #
15
+ # processor = ScraperUtils::MechanizeActions.new(agent)
16
+ # result_page = processor.process(page, actions)
17
+ #
18
+ # @example With replacements
19
+ # replacements = { FROM_DATE: "2022-01-01", TO_DATE: "2022-03-01" }
20
+ # processor = ScraperUtils::MechanizeActions.new(agent, replacements)
21
+ #
22
+ # # Use replacements in actions
23
+ # actions = [
24
+ # [:click, "Search between {FROM_DATE} and {TO_DATE}"]
25
+ # ]
26
+ class MechanizeActions
27
+ # @return [Mechanize] The mechanize agent used for actions
28
+ attr_reader :agent
29
+
30
+ # @return [Array] The results of each action performed
31
+ attr_reader :results
32
+
33
+ # Initialize a new MechanizeActions processor
34
+ #
35
+ # @param agent [Mechanize] The mechanize agent to use for actions
36
+ # @param replacements [Hash] Optional text replacements to apply to action parameters
37
+ def initialize(agent, replacements = {})
38
+ @agent = agent
39
+ @replacements = replacements || {}
40
+ @results = []
41
+ end
42
+
43
+ # Process a sequence of actions on a page
44
+ #
45
+ # @param page [Mechanize::Page] The starting page
46
+ # @param actions [Array<Array>] The sequence of actions to perform
47
+ # @return [Mechanize::Page] The resulting page after all actions
48
+ # @raise [ArgumentError] If an unknown action type is provided
49
+ #
50
+ # @example Action format
51
+ # actions = [
52
+ # [:click, "Link Text"], # Click on link with this text
53
+ # [:click, ["Option A", "Option B"]], # Click on one of these options (randomly selected)
54
+ # [:click, "css:.some-button"], # Use CSS selector
55
+ # [:click, "xpath://div[@id='results']/a"], # Use XPath selector
56
+ # [:block, ->(page, args, agent, results) { [page, { custom_results: 'data' }] }] # Custom block
57
+ # ]
58
+ def process(page, actions)
59
+ @results = []
60
+ current_page = page
61
+
62
+ actions.each do |action|
63
+ args = action.dup
64
+ action_type = args.shift
65
+ current_page, result =
66
+ case action_type
67
+ when :click
68
+ handle_click(current_page, args)
69
+ when :block
70
+ block = args.shift
71
+ block.call(current_page, args, agent, @results.dup)
72
+ else
73
+ raise ArgumentError, "Unknown action type: #{action_type}"
74
+ end
75
+
76
+ @results << result
77
+ end
78
+
79
+ current_page
80
+ end
81
+
82
+ private
83
+
84
+ # Handle a click action
85
+ #
86
+ # @param page [Mechanize::Page] The current page
87
+ # @param args [Array] The first element is the selection target
88
+ # @return [Array<Mechanize::Page, Hash>] The resulting page and status
89
+ def handle_click(page, args)
90
+ target = args.shift
91
+ if target.is_a?(Array)
92
+ target = ScraperUtils::CycleUtils.pick(target, date: @replacements[:TODAY])
93
+ end
94
+ target = apply_replacements(target)
95
+ element = select_element(page, target)
96
+ if element.nil?
97
+ raise "Unable to find click target: #{target}"
98
+ end
99
+
100
+ result = { action: :click, target: target }
101
+ next_page = element.click
102
+ [next_page, result]
103
+ end
104
+
105
+ # Select an element on the page based on selector string
106
+ #
107
+ # @param page [Mechanize::Page] The page to search in
108
+ # @param selector_string [String] The selector string
109
+ # @return [Mechanize::Element, nil] The selected element or nil if not found
110
+ def select_element(page, selector_string)
111
+ # Handle different selector types based on prefixes
112
+ if selector_string.start_with?("css:")
113
+ selector = selector_string.sub(/^css:/, '')
114
+ page.at_css(selector)
115
+ elsif selector_string.start_with?("xpath:")
116
+ selector = selector_string.sub(/^xpath:/, '')
117
+ page.at_xpath(selector)
118
+ else
119
+ # Default to text: for links
120
+ selector = selector_string.sub(/^text:/, '')
121
+ # Find links that include the text and don't have fragment-only hrefs
122
+ matching_links = page.links.select do |l|
123
+ l.text.include?(selector) &&
124
+ !(l.href.nil? || l.href.start_with?('#'))
125
+ end
126
+
127
+ if matching_links.empty?
128
+ # try case-insensitive
129
+ selector = selector.downcase
130
+ matching_links = page.links.select do |l|
131
+ l.text.downcase.include?(selector) &&
132
+ !(l.href.nil? || l.href.start_with?('#'))
133
+ end
134
+ end
135
+
136
+ # Get the link with the shortest (closest matching) text then the longest href
137
+ matching_links.min_by { |l| [l.text.strip.length, -l.href.length] }
138
+ end
139
+ end
140
+
141
+ # Apply text replacements to a string
142
+ #
143
+ # @param text [String, Object] The text to process or object to return unchanged
144
+ # @return [String, Object] The processed text with replacements or original object
145
+ def apply_replacements(text)
146
+ result = text.to_s
147
+
148
+ @replacements.each do |key, value|
149
+ result = result.gsub(/\{#{key}\}/, value.to_s)
150
+ end
151
+ result
152
+ end
153
+ end
154
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module ScraperUtils
4
- VERSION = "0.4.2"
4
+ VERSION = "0.5.0"
5
5
  end
data/lib/scraper_utils.rb CHANGED
@@ -9,6 +9,7 @@ require "scraper_utils/db_utils"
9
9
  require "scraper_utils/debug_utils"
10
10
  require "scraper_utils/fiber_scheduler"
11
11
  require "scraper_utils/log_utils"
12
+ require "scraper_utils/mechanize_actions"
12
13
  require "scraper_utils/mechanize_utils/agent_config"
13
14
  require "scraper_utils/mechanize_utils"
14
15
  require "scraper_utils/randomize_utils"
@@ -13,8 +13,8 @@ Gem::Specification.new do |spec|
13
13
 
14
14
  spec.summary = "planningalerts scraper utilities"
15
15
  spec.description = "Utilities to help make planningalerts scrapers, " \
16
- "+especially multis easier to develop, run and debug."
17
- spec.homepage = "https://github.com/ianheggie-oaf/scraper_utils"
16
+ "especially multi authority scrapers, easier to develop, run and debug."
17
+ spec.homepage = "https://github.com/ianheggie-oaf/#{spec.name}"
18
18
  spec.license = "MIT"
19
19
 
20
20
  if spec.respond_to?(:metadata)
@@ -22,10 +22,11 @@ Gem::Specification.new do |spec|
22
22
 
23
23
  spec.metadata["homepage_uri"] = spec.homepage
24
24
  spec.metadata["source_code_uri"] = spec.homepage
25
- # spec.metadata["changelog_uri"] = "TODO: Put your gem's CHANGELOG.md URL here."
25
+ spec.metadata["documentation_uri"] = "https://rubydoc.info/gems/#{spec.name}/#{ScraperUtils::VERSION}"
26
+ spec.metadata["changelog_uri"] = "#{spec.metadata["source_code_uri"]}/blob/main/CHANGELOG.md"
26
27
  else
27
28
  raise "RubyGems 2.0 or newer is required to protect against " \
28
- "public gem pushes."
29
+ "public gem pushes."
29
30
  end
30
31
 
31
32
  # Specify which files should be added to the gem when it is released.
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: scraper_utils
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.2
4
+ version: 0.5.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ian Heggie
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2025-03-03 00:00:00.000000000 Z
11
+ date: 2025-03-05 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: mechanize
@@ -52,8 +52,8 @@ dependencies:
52
52
  - - ">="
53
53
  - !ruby/object:Gem::Version
54
54
  version: '0'
55
- description: Utilities to help make planningalerts scrapers, +especially multis easier
56
- to develop, run and debug.
55
+ description: Utilities to help make planningalerts scrapers, especially multi authority
56
+ scrapers, easier to develop, run and debug.
57
57
  email:
58
58
  - ian@heggie.biz
59
59
  executables: []
@@ -74,8 +74,14 @@ files:
74
74
  - SPECS.md
75
75
  - bin/console
76
76
  - bin/setup
77
+ - docs/debugging.md
77
78
  - docs/example_scrape_with_fibers.rb
78
79
  - docs/example_scraper.rb
80
+ - docs/getting_started.md
81
+ - docs/interleaving_requests.md
82
+ - docs/mechanize_utilities.md
83
+ - docs/randomizing_requests.md
84
+ - docs/reducing_server_load.md
79
85
  - lib/scraper_utils.rb
80
86
  - lib/scraper_utils/adaptive_delay.rb
81
87
  - lib/scraper_utils/authority_utils.rb
@@ -86,6 +92,7 @@ files:
86
92
  - lib/scraper_utils/debug_utils.rb
87
93
  - lib/scraper_utils/fiber_scheduler.rb
88
94
  - lib/scraper_utils/log_utils.rb
95
+ - lib/scraper_utils/mechanize_actions.rb
89
96
  - lib/scraper_utils/mechanize_utils.rb
90
97
  - lib/scraper_utils/mechanize_utils/agent_config.rb
91
98
  - lib/scraper_utils/randomize_utils.rb
@@ -99,6 +106,8 @@ metadata:
99
106
  allowed_push_host: https://rubygems.org
100
107
  homepage_uri: https://github.com/ianheggie-oaf/scraper_utils
101
108
  source_code_uri: https://github.com/ianheggie-oaf/scraper_utils
109
+ documentation_uri: https://rubydoc.info/gems/scraper_utils/0.5.0
110
+ changelog_uri: https://github.com/ianheggie-oaf/scraper_utils/blob/main/CHANGELOG.md
102
111
  rubygems_mfa_required: 'true'
103
112
  post_install_message:
104
113
  rdoc_options: []