RubyGems - scraper_utils - Versions diffs - 0.4.2 → 0.5.0 - Mend

scraper_utils 0.4.2 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +4 -0
data/README.md +50 -286
data/docs/debugging.md +50 -0
data/docs/getting_started.md +145 -0
data/docs/interleaving_requests.md +61 -0
data/docs/mechanize_utilities.md +92 -0
data/docs/randomizing_requests.md +34 -0
data/docs/reducing_server_load.md +63 -0
data/lib/scraper_utils/cycle_utils.rb +1 -0
data/lib/scraper_utils/mechanize_actions.rb +154 -0
data/lib/scraper_utils/version.rb +1 -1
data/lib/scraper_utils.rb +1 -0
data/scraper_utils.gemspec +5 -4
metadata +13 -4

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 65565228471cfb92e0ea4c7280f2f4b30c22ac0fdb88cad30e99d14f011b2ff4
-  data.tar.gz: 812f6ca2270a046db40af5427684015ca68f1c1a7670041eb51889a3ba959cd2
+  metadata.gz: 63a24c24b497494b79c4d7e12f04a1bd2555068f37f50389f3906c0033817d7e
+  data.tar.gz: 6d6b96112dc3e2f9dc5a54de6318a544c240c0e3d5246ab4178c07346d0de7dc
 SHA512:
-  metadata.gz: 5aee38354b3fde81b2e9fd0442aa3155df8fdde8959e2fc6db49d55cd3872b5ad15ffc7a3fccbe99e759c89fdbca1ced9f453021e0fe7a6ae47d23c42f264a39
-  data.tar.gz: 4e51e770bc2252caf18c548572925bba4cb5fd651cdc063d379bb42a732921f740c76d1bdfde32df2a0443449b1c4da7fe624a4e908d8116f0286b4c2c169da5
+  metadata.gz: eda8d10d996d51b7ef1d2610e21da31390c10dd29f4daa70bd5d9c3c8dc6eb9bed651803ccd6a59f53b03dae4fcd1ea016802e693f8828f4a13b92e07a0b046e
+  data.tar.gz: eba2704a99c6599a2789ec573fa335d7939a63d0c27b06886d6e905cd785e2095d7d0307e7aa1195a1209e022340fa5d027a72ccca61a350590058e998355d5d

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,9 @@
 # Changelog
+## 0.5.0 - 2025-03-05
+* Add action processing utility
 ## 0.4.2 - 2025-03-04
 * Fix gem require list

data/README.md CHANGED Viewed

@@ -3,8 +3,6 @@ ScraperUtils (Ruby)
 Utilities to help make planningalerts scrapers, especially multis, easier to develop, run and debug.
-WARNING: This is still under development! Breaking changes may occur in version 0.x!
 For Server Administrators
 -------------------------
@@ -18,331 +16,97 @@ To control our access:
 - Add a section for our user agent: `User-agent: ScraperUtils` (default)
 - Set a crawl delay, eg: `Crawl-delay: 20`
-- If needed specify disallowed paths*: `Disallow: /private/`
+- If needed specify disallowed paths: `Disallow: /private/`
-### Built-in Politeness Features
+### We play nice with your servers
-Even without specific configuration, our scrapers will, by default:
+Our goal is to access public planning information with minimal impact on your services. The following features are on by
+default:
 - **Identify themselves**: Our user agent clearly indicates who we are and provides a link to the project repository:
   `Mozilla/5.0 (compatible; ScraperUtils/0.2.0 2025-02-22; +https://github.com/ianheggie-oaf/scraper_utils)`
-- **Limit server load**: We slow down our requests so we should never be a significant load to your server, let alone
-  overload it.
-  The slower your server is running, the longer the delay we add between requests to help.
-  In the default "compliant mode" this defaults to a max load of 20% and is capped at 33%.
-- **Add randomized delays**: We add random delays between requests to further reduce our impact on servers, which should
-  bring us down to the load of a single industrious person.
-Extra utilities provided for scrapers to further reduce your server load:
+- **Limit server load**:
+    - We wait double your response time before making another request to avoid being a significant load on your server
+    - We also randomly add extra delays to give your server a chance to catch up with background tasks
-- **Interleave requests**: This spreads out the requests to your server rather than focusing on one scraper at a time.
+We also provide scraper developers other features to reduce overall load as well.
-- **Intelligent Date Range selection**: This reduces server load by over 60% by a smarter choice of date range searches,
-  checking the recent 4 days each day and reducing down to checking each 3 days by the end of the 33-day mark. This
-  replaces the simplistic check of the last 30 days each day.
+For Scraper Developers
+----------------------
-- Alternative **Cycle Utilities** - a convenience class to cycle through short and longer search ranges to reduce server
-  load.
+We provide utilities to make developing, running and debugging your scraper easier in addition to the base utilities
+mentioned above.
-Our goal is to access public planning information without negatively impacting your services.
+## Installation & Configuration
-Installation
-------------
-Add these line to your application's Gemfile:
+Add to your [scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
 ```ruby
 gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
 gem 'scraper_utils'
 ```
-And then execute:
-    $ bundle
-Usage
------
-### Ruby Versions
-This gem is designed to be compatible the latest ruby supported by morph.io - other versions may work, but not tested:
-* ruby 3.2.2 - requires the `platform` file to contain `heroku_18` in the scraper
-* ruby 2.5.8 - `heroku_16` (the default)
-### Environment variables
-#### `MORPH_AUSTRALIAN_PROXY`
-On morph.io set the environment variable `MORPH_AUSTRALIAN_PROXY` to
-`http://morph:password@au.proxy.oaf.org.au:8888`
-replacing password with the real password.
-Alternatively enter your own AUSTRALIAN proxy details when testing.
-#### `MORPH_EXPECT_BAD`
-To avoid morph complaining about sites that are known to be bad,
-but you want them to keep being tested, list them on `MORPH_EXPECT_BAD`, for example:
-#### `MORPH_AUTHORITIES`
-Optionally filter authorities for multi authority scrapers
-via environment variable in morph > scraper > settings or
-in your dev environment:
-```bash
-export MORPH_AUTHORITIES=noosa,wagga
-```
-#### `DEBUG`
-Optionally enable verbose debugging messages when developing:
-```bash
-export DEBUG=1
-```
-### Extra Mechanize options
+For detailed setup and configuration options, see the [Getting Started guide](docs/getting_started.md).
-Add `client_options` to your AUTHORITIES configuration and move any of the following settings into it:
+## Key Features
-* `timeout: Integer` - Timeout for agent connections in case the server is slower than normal
-* `australian_proxy: true` - Use the proxy url in the `MORPH_AUSTRALIAN_PROXY` env variable if the site is geo-locked
-* `disable_ssl_certificate_check: true` - Disabled SSL verification for old / incorrect certificates
+### Well-Behaved Web Client
-See the documentation on `ScraperUtils::MechanizeUtils::AgentConfig` for more options
+- Configure Mechanize agents with sensible defaults
+- Automatic rate limiting based on server response times
+- Supports robots.txt and crawl-delay directives
+- Supports extra actions required to get to results page
+- [Learn more about Mechanize utilities](docs/mechanize_utilities.md)
-Then adjust your code to accept `client_options` and pass then through to:
-`ScraperUtils::MechanizeUtils.mechanize_agent(client_options || {})`
-to receive a `Mechanize::Agent` configured accordingly.
+### Optimize Server Load
-The agent returned is configured using Mechanize hooks to implement the desired delays automatically.
+- Intelligent date range selection (reduce server load by up to 60%)
+- Cycle utilities for rotating search parameters
+- [Learn more about reducing server load](docs/reducing_server_load.md)
-### Default Configuration
+### Improve Scraper Efficiency
-By default, the Mechanize agent is configured with the following settings.
-As you can see, the defaults can be changed using env variables.
+- Interleave requests to optimize run time
+- [Learn more about interleaving requests](docs/interleaving_requests.md)
+- Randomize processing order for more natural request patterns
+- [Learn more about randomizing requests](docs/randomizing_requests.md)
-Note - compliant mode forces max_load to be set to a value no greater than 50.
+### Error Handling & Quality Monitoring
-```ruby
-ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
-  config.default_timeout = ENV.fetch('MORPH_TIMEOUT', 60).to_i # 60
-  config.default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
-  config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', 5).to_i # 5
-  config.default_max_load = ENV.fetch('MORPH_MAX_LOAD', 33.3).to_f # 33.3
-  config.default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
-  config.default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
-  config.default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
-end
-```
+- Record-level error handling with appropriate thresholds
+- Data quality monitoring during scraping
+- Detailed logging and reporting
-You can modify these global defaults before creating any Mechanize agents. These settings will be used for all Mechanize
-agents created by `ScraperUtils::MechanizeUtils.mechanize_agent` unless overridden by passing parameters to that method.
+### Developer Tools
-To speed up testing, set the following in `spec_helper.rb`:
+- Enhanced debugging utilities
+- Simple logging with authority context
+- [Learn more about debugging](docs/debugging.md)
-```ruby
-ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
-  config.default_random_delay = nil
-  config.default_max_load = 33
-end
-```
+## API Documentation
-### Example updated `scraper.rb` file
+Complete API documentation is available at [RubyDoc.info](https://rubydoc.info/gems/scraper_utils).
-Update your `scraper.rb` as per the [example scraper](docs/example_scraper.rb).
+## Ruby Versions
-Your code should raise ScraperUtils::UnprocessableRecord when there is a problem with the data presented on a page for a
-record.
-Then just before you would normally yield a record for saving, rescue that exception and:
-* Call `ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)`
-* NOT yield the record for saving
-In your code update where create a mechanize agent (often `YourScraper.scrape_period`) and the `AUTHORITIES` hash
-to move Mechanize agent options (like `australian_proxy` and `timeout`) to a hash under a new key: `client_options`.
-For example:
-```ruby
-require "scraper_utils"
-#...
-module YourScraper
-  # ... some code ...
-  # Note the extra parameter: client_options
-  def self.scrape_period(url:, period:, webguest: "P1.WEBGUEST",
-                         client_options: {}
-  )
-    agent = ScraperUtils::MechanizeUtils.mechanize_agent(**client_options)
-    # ... rest of code ...
-  end
-  # ... rest of code ...
-end
-```
+This gem is designed to be compatible with Ruby versions supported by morph.io:
-### Debugging Techniques
-The following code will cause debugging info to be output:
-```bash
-export DEBUG=1
-```
-Add the following immediately before requesting or examining pages
-```ruby
-require 'scraper_utils'
-# Debug an HTTP request
-ScraperUtils::DebugUtils.debug_request(
-  "GET",
-  "https://example.com/planning-apps",
-  parameters: { year: 2023 },
-  headers: { "Accept" => "application/json" }
-)
-# Debug a web page
-ScraperUtils::DebugUtils.debug_page(page, "Checking search results page")
-# Debug a specific page selector
-ScraperUtils::DebugUtils.debug_selector(page, '.results-table', "Looking for development applications")
-```
+* Ruby 3.2.2 - requires the `platform` file to contain `heroku_18` in the scraper
+* Ruby 2.5.8 - `heroku_16` (the default)
-Interleaving Requests
----------------------
-The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
-* works on the other authorities whilst in the delay period for an authorities next request
-* thus optimizing the total scraper run time
-* allows you to increase the random delay for authorities without undue effect on total run time
-* For the curious, it uses [ruby fibers](https://ruby-doc.org/core-2.5.8/Fiber.html) rather than threads as that is
-  a simpler system and thus easier to get right, understand and debug!
-* Cycles around the authorities when compliant_mode, max_load and random_delay are disabled
-To enable change the scrape method to be like [example scrape method using fibers](docs/example_scrape_with_fibers.rb)
-And use `ScraperUtils::FiberScheduler.log` instead of `puts` when logging within the authority processing code.
-This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
-thus the output.
-This uses `ScraperUtils::RandomizeUtils` as described below. Remember to add the recommended line to
-`spec/spec_heper.rb`.
-Intelligent Date Range Selection
---------------------------------
-To further reduce server load and speed up scrapers, we provide an intelligent date range selection mechanism
-that can reduce server requests by 60% without significantly impacting delay in picking up changes.
-The `ScraperUtils::DateRangeUtils#calculate_date_ranges` method provides a smart approach to searching historical
-records:
-- Always checks the most recent 4 days daily (configurable)
-- Progressively reduces search frequency for older records
-- Uses a Fibonacci-like progression to create natural, efficient search intervals
-- Configurable `max_period` (default is 3 days)
-- merges adjacent search ranges and handles the changeover in search frequency by extending some searches
-Example usage in your scraper:
- ```ruby
-date_ranges = ScraperUtils::DateRangeUtils.new.calculate_date_ranges
-date_ranges.each do |from_date, to_date, _debugging_comment|
-  # Adjust your normal search code to use for this date range
-  your_search_records(from_date: from_date, to_date: to_date) do |record|
-    # process as normal
-  end
-end
-```
-Typical server load reductions:
-* Max period 2 days : ~42% of the 33 days selected
-* Max period 3 days : ~37% of the 33 days selected (default)
-* Max period 5 days : ~35% (or ~31% when days = 45)
-See the class documentation for customizing defaults and passing options.
-### Other possibilities
-If the site uses tags like 'L28', 'L14' and 'L7' for the last 28, 14 and 7 days, an alternative solution
-is to cycle through ['L28', 'L7', 'L14', 'L7'] which would drop the load by 50% and be less Bot like.
-Cycle Utils
------------
-Simple utility for cycling through options based on Julian day number:
-```ruby
-# Toggle between main and alternate behaviour
-alternate = ScraperUtils::CycleUtils.position(2).even?
-# OR cycle through a list of values day by day:
-period = ScraperUtils::CycleUtils.pick(['L28', 'L7', 'L14', 'L7'])
-# Use with any cycle size
-pos = ScraperUtils::CycleUtils.position(7) # 0-6 cycle
-# Test with specific date
-pos = ScraperUtils::CycleUtils.position(3, date: Date.new(2024, 1, 5))
-# Override for testing
-# CYCLE_POSITION=2 bundle exec ruby scraper.rb
-```
-Randomizing Requests
---------------------
-Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
-receive in as is when testing.
-Use this with the list of records scraped from an index to randomise any requests for further information to be less Bot
-like.
-### Spec setup
-You should enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb` :
-```
-ScraperUtils::RandomizeUtils.sequential = true
-```
-Note:
-* You can also force sequential mode by setting the env variable `MORPH_PROCESS_SEQUENTIALLY` to `1` (any non blank)
-* testing using VCR requires sequential mode
-Development
------------
+## Development
 After checking out the repo, run `bin/setup` to install dependencies.
 Then, run `rake test` to run the tests.
-You can also run `bin/console` for an interactive prompt that will allow you to experiment.
 To install this gem onto your local machine, run `bundle exec rake install`.
-To release a new version, update the version number in `version.rb`, and
-then run `bundle exec rake release`,
-which will create a git tag for the version, push git commits and tags, and push the `.gem` file
-to [rubygems.org](https://rubygems.org).
-NOTE: You need to use ruby 3.2.2 instead of 2.5.8 to release to OTP protected accounts.
-Contributing
-------------
+## Contributing
-Bug reports and pull requests with working tests are welcome on [GitHub](https://github.com/ianheggie-oaf/scraper_utils)
+Bug reports and pull requests with working tests are welcome
+on [GitHub](https://github.com/ianheggie-oaf/scraper_utils).
-CHANGELOG.md is maintained by the author aiming to follow https://github.com/vweevers/common-changelog
-License
--------
+## License
 The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

data/docs/debugging.md ADDED Viewed

@@ -0,0 +1,50 @@
+# Debugging Techniques
+ScraperUtils provides several debugging utilities to help you troubleshoot your scrapers.
+## Enabling Debug Mode
+Set the `DEBUG` environment variable to enable debugging:
+```bash
+export DEBUG=1  # Basic debugging
+export DEBUG=2  # Verbose debugging
+export DEBUG=3  # Trace debugging with detailed content
+```
+## Debug Utilities
+The `ScraperUtils::DebugUtils` module provides several methods for debugging:
+```ruby
+# Debug an HTTP request
+ScraperUtils::DebugUtils.debug_request(
+  "GET",
+  "https://example.com/planning-apps",
+  parameters: { year: 2023 },
+  headers: { "Accept" => "application/json" }
+)
+# Debug a web page
+ScraperUtils::DebugUtils.debug_page(page, "Checking search results page")
+# Debug a specific page selector
+ScraperUtils::DebugUtils.debug_selector(page, '.results-table', "Looking for development applications")
+```
+## Debug Level Constants
+- `DISABLED_LEVEL = 0`: Debugging disabled
+- `BASIC_LEVEL = 1`: Basic debugging information
+- `VERBOSE_LEVEL = 2`: Verbose debugging information
+- `TRACE_LEVEL = 3`: Detailed tracing information
+## Helper Methods
+- `debug_level`: Get the current debug level
+- `debug?(level)`: Check if debugging is enabled at the specified level
+- `basic?`: Check if basic debugging is enabled
+- `verbose?`: Check if verbose debugging is enabled
+- `trace?`: Check if trace debugging is enabled
+For full details, see the [DebugUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/DebugUtils).

data/docs/getting_started.md ADDED Viewed

@@ -0,0 +1,145 @@
+# Getting Started with ScraperUtils
+This guide will help you get started with ScraperUtils for your PlanningAlerts scraper.
+## Installation
+Add these lines to your [scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
+```ruby
+# Below:
+gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
+# Add:
+gem 'scraper_utils'
+```
+And then execute:
+```bash
+bundle install
+```
+## Environment Variables
+### `MORPH_AUSTRALIAN_PROXY`
+On morph.io set the environment variable `MORPH_AUSTRALIAN_PROXY` to
+`http://morph:password@au.proxy.oaf.org.au:8888`
+replacing password with the real password.
+Alternatively enter your own AUSTRALIAN proxy details when testing.
+### `MORPH_EXPECT_BAD`
+To avoid morph complaining about sites that are known to be bad,
+but you want them to keep being tested, list them on `MORPH_EXPECT_BAD`, for example:
+### `MORPH_AUTHORITIES`
+Optionally filter authorities for multi authority scrapers
+via environment variable in morph > scraper > settings or
+in your dev environment:
+```bash
+export MORPH_AUTHORITIES=noosa,wagga
+```
+### `DEBUG`
+Optionally enable verbose debugging messages when developing:
+```bash
+export DEBUG=1 # for basic, or 2 for verbose or 3 for tracing nearly everything
+```
+## Example Scraper Implementation
+Update your `scraper.rb` as follows:
+```ruby
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+$LOAD_PATH << "./lib"
+require "scraper_utils"
+require "your_scraper"
+# Main Scraper class
+class Scraper
+  AUTHORITIES = YourScraper::AUTHORITIES
+  def scrape(authorities, attempt)
+    exceptions = {}
+    authorities.each do |authority_label|
+      puts "\nCollecting feed data for #{authority_label}, attempt: #{attempt}..."
+      begin
+        ScraperUtils::DataQualityMonitor.start_authority(authority_label)
+        YourScraper.scrape(authority_label) do |record|
+          begin
+            record["authority_label"] = authority_label.to_s
+            ScraperUtils::DbUtils.save_record(record)
+          rescue ScraperUtils::UnprocessableRecord => e
+            ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
+            exceptions[authority_label] = e
+          end
+        end
+      rescue StandardError => e
+        warn "#{authority_label}: ERROR: #{e}"
+        warn e.backtrace
+        exceptions[authority_label] = e
+      end
+    end
+    exceptions
+  end
+  def self.selected_authorities
+    ScraperUtils::AuthorityUtils.selected_authorities(AUTHORITIES.keys)
+  end
+  def self.run(authorities)
+    puts "Scraping authorities: #{authorities.join(', ')}"
+    start_time = Time.now
+    exceptions = new.scrape(authorities, 1)
+    ScraperUtils::LogUtils.log_scraping_run(
+      start_time,
+      1,
+      authorities,
+      exceptions
+    )
+    unless exceptions.empty?
+      puts "\n***************************************************"
+      puts "Now retrying authorities which earlier had failures"
+      puts exceptions.keys.join(", ").to_s
+      puts "***************************************************"
+      start_time = Time.now
+      exceptions = new.scrape(exceptions.keys, 2)
+      ScraperUtils::LogUtils.log_scraping_run(
+        start_time,
+        2,
+        authorities,
+        exceptions
+      )
+    end
+    ScraperUtils::LogUtils.report_on_results(authorities, exceptions)
+  end
+end
+if __FILE__ == $PROGRAM_NAME
+  ENV["MORPH_EXPECT_BAD"] ||= "wagga"
+  Scraper.run(Scraper.selected_authorities)
+end
+```
+For more advanced implementations, see the [Interleaving Requests documentation](interleaving_requests.md).
+## Next Steps
+- [Reducing Server Load](reducing_server_load.md)
+- [Mechanize Utilities](mechanize_utilities.md)
+- [Debugging](debugging.md)

data/docs/interleaving_requests.md ADDED Viewed

@@ -0,0 +1,61 @@
+# Interleaving Requests with FiberScheduler
+The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
+* Works on other authorities while in the delay period for an authority's next request
+* Optimizes the total scraper run time
+* Allows you to increase the random delay for authorities without undue effect on total run time
+* For the curious, it uses [ruby fibers](https://ruby-doc.org/core-2.5.8/Fiber.html) rather than threads as that is
+  a simpler system and thus easier to get right, understand and debug!
+* Cycles around the authorities when compliant_mode, max_load and random_delay are disabled
+## Implementation
+To enable fiber scheduling, change your scrape method to follow this pattern:
+```ruby
+def scrape(authorities, attempt)
+  ScraperUtils::FiberScheduler.reset!
+  exceptions = {}
+  authorities.each do |authority_label|
+    ScraperUtils::FiberScheduler.register_operation(authority_label) do
+      ScraperUtils::FiberScheduler.log(
+        "Collecting feed data for #{authority_label}, attempt: #{attempt}..."
+      )
+      ScraperUtils::DataQualityMonitor.start_authority(authority_label)
+      YourScraper.scrape(authority_label) do |record|
+        record["authority_label"] = authority_label.to_s
+        ScraperUtils::DbUtils.save_record(record)
+      rescue ScraperUtils::UnprocessableRecord => e
+        ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
+        exceptions[authority_label] = e
+        # Continues processing other records
+      end
+    rescue StandardError => e
+      warn "#{authority_label}: ERROR: #{e}"
+      warn e.backtrace || "No backtrace available"
+      exceptions[authority_label] = e
+    end
+    # end of register_operation block
+  end
+  ScraperUtils::FiberScheduler.run_all
+  exceptions
+end
+```
+## Logging with FiberScheduler
+Use `ScraperUtils::FiberScheduler.log` instead of `puts` when logging within the authority processing code.
+This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
+thus the output.
+## Testing Considerations
+This uses `ScraperUtils::RandomizeUtils` for determining the order of operations. Remember to add the following line to
+`spec/spec_helper.rb`:
+```ruby
+ScraperUtils::RandomizeUtils.sequential = true
+```
+For full details, see the [FiberScheduler class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/FiberScheduler).

data/docs/mechanize_utilities.md ADDED Viewed

@@ -0,0 +1,92 @@
+# Mechanize Utilities
+This document provides detailed information about the Mechanize utilities provided by ScraperUtils.
+## MechanizeUtils
+The `ScraperUtils::MechanizeUtils` module provides utilities for configuring and using Mechanize for web scraping.
+### Creating a Mechanize Agent
+```ruby
+agent = ScraperUtils::MechanizeUtils.mechanize_agent(**options)
+```
+### Configuration Options
+Add `client_options` to your AUTHORITIES configuration and move any of the following settings into it:
+* `timeout: Integer` - Timeout for agent connections in case the server is slower than normal
+* `australian_proxy: true` - Use the proxy url in the `MORPH_AUSTRALIAN_PROXY` env variable if the site is geo-locked
+* `disable_ssl_certificate_check: true` - Disabled SSL verification for old / incorrect certificates
+Then adjust your code to accept `client_options` and pass then through to:
+`ScraperUtils::MechanizeUtils.mechanize_agent(client_options || {})`
+to receive a `Mechanize::Agent` configured accordingly.
+The agent returned is configured using Mechanize hooks to implement the desired delays automatically.
+### Default Configuration
+By default, the Mechanize agent is configured with the following settings.
+As you can see, the defaults can be changed using env variables.
+Note - compliant mode forces max_load to be set to a value no greater than 50.
+```ruby
+ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
+  config.default_timeout = ENV.fetch('MORPH_TIMEOUT', 60).to_i # 60
+  config.default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
+  config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', 5).to_i # 5
+  config.default_max_load = ENV.fetch('MORPH_MAX_LOAD', 33.3).to_f # 33.3
+  config.default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
+  config.default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
+  config.default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
+end
+```
+For full details, see the [MechanizeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/MechanizeUtils).
+## MechanizeActions
+The `ScraperUtils::MechanizeActions` class provides a convenient way to execute a series of actions (like clicking links, filling forms) on a Mechanize page.
+### Action Format
+```ruby
+actions = [
+  [:click, "Find an application"],
+  [:click, ["Submitted Last 28 Days", "Submitted Last 7 Days"]],
+  [:block, ->(page, args, agent, results) { [new_page, result_data] }]
+]
+processor = ScraperUtils::MechanizeActions.new(agent)
+result_page = processor.process(page, actions)
+```
+### Supported Actions
+- `:click` - Clicks on a link or element matching the provided selector
+- `:block` - Executes a custom block of code for complex scenarios
+### Selector Types
+- Text selector (default): `"Find an application"`
+- CSS selector: `"css:.button"`
+- XPath selector: `"xpath://a[@class='button']"`
+### Replacements
+You can use replacements in your action parameters:
+```ruby
+replacements = { FROM_DATE: "2022-01-01", TO_DATE: "2022-03-01" }
+processor = ScraperUtils::MechanizeActions.new(agent, replacements)
+# Use replacements in actions
+actions = [
+  [:click, "Search between {FROM_DATE} and {TO_DATE}"]
+]
+```
+For full details, see the [MechanizeActions class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/MechanizeActions).

data/docs/randomizing_requests.md ADDED Viewed

@@ -0,0 +1,34 @@
+# Randomizing Requests
+`ScraperUtils::RandomizeUtils` provides utilities for randomizing processing order in scrapers,
+which is helpful for distributing load and avoiding predictable patterns.
+## Basic Usage
+Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
+receive it as is when testing.
+```ruby
+# Randomize a collection
+randomized_authorities = ScraperUtils::RandomizeUtils.randomize_order(authorities)
+# Use with a list of records from an index to randomize requests for details
+records.each do |record|
+  # Process record
+end
+```
+## Testing Configuration
+Enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb`:
+```ruby
+ScraperUtils::RandomizeUtils.sequential = true
+```
+## Notes
+* You can also force sequential mode by setting the env variable `MORPH_PROCESS_SEQUENTIALLY` to `1` (any non-blank value)
+* Testing using VCR requires sequential mode
+For full details, see the [RandomizeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/RandomizeUtils).

data/docs/reducing_server_load.md ADDED Viewed

@@ -0,0 +1,63 @@
+# Reducing Server Load
+This document explains various techniques for reducing load on the servers you're scraping.
+## Intelligent Date Range Selection
+To further reduce server load and speed up scrapers, we provide an intelligent date range selection mechanism
+that can reduce server requests by 60% without significantly impacting delay in picking up changes.
+The `ScraperUtils::DateRangeUtils#calculate_date_ranges` method provides a smart approach to searching historical
+records:
+- Always checks the most recent 4 days daily (configurable)
+- Progressively reduces search frequency for older records
+- Uses a Fibonacci-like progression to create natural, efficient search intervals
+- Configurable `max_period` (default is 3 days)
+- Merges adjacent search ranges and handles the changeover in search frequency by extending some searches
+Example usage in your scraper:
+```ruby
+date_ranges = ScraperUtils::DateRangeUtils.new.calculate_date_ranges
+date_ranges.each do |from_date, to_date, _debugging_comment|
+  # Adjust your normal search code to use for this date range
+  your_search_records(from_date: from_date, to_date: to_date) do |record|
+    # process as normal
+  end
+end
+```
+Typical server load reductions:
+* Max period 2 days : ~42% of the 33 days selected
+* Max period 3 days : ~37% of the 33 days selected (default)
+* Max period 5 days : ~35% (or ~31% when days = 45)
+See the [DateRangeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/DateRangeUtils) for customizing defaults and passing options.
+## Cycle Utilities
+Simple utility for cycling through options based on Julian day number to reduce server load and make your scraper seem less bot-like.
+If the site uses tags like 'L28', 'L14' and 'L7' for the last 28, 14 and 7 days, an alternative solution
+is to cycle through ['L28', 'L7', 'L14', 'L7'] which would drop the load by 50% and be less bot-like.
+```ruby
+# Toggle between main and alternate behaviour
+alternate = ScraperUtils::CycleUtils.position(2).even?
+# OR cycle through a list of values day by day:
+period = ScraperUtils::CycleUtils.pick(['L28', 'L7', 'L14', 'L7'])
+# Use with any cycle size
+pos = ScraperUtils::CycleUtils.position(7) # 0-6 cycle
+# Test with specific date
+pos = ScraperUtils::CycleUtils.position(3, date: Date.new(2024, 1, 5))
+# Override for testing
+# CYCLE_POSITION=2 bundle exec ruby scraper.rb
+```
+For full details, see the [CycleUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/CycleUtils).

data/lib/scraper_utils/cycle_utils.rb CHANGED Viewed

@@ -19,6 +19,7 @@ module ScraperUtils
     # @return value from array
     # Can override using CYCLE_POSITION ENV variable
     def self.pick(values, date: nil)
+      values = values.to_a
       values[position(values.size, date: date)]
     end
   end

data/lib/scraper_utils/mechanize_actions.rb ADDED Viewed

@@ -0,0 +1,154 @@
+# frozen_string_literal: true
+module ScraperUtils
+  # Class for executing a series of mechanize actions with flexible replacements
+  #
+  # @example Basic usage
+  #   agent = ScraperUtils::MechanizeUtils.mechanize_agent
+  #   page = agent.get("https://example.com")
+  #
+  #   actions = [
+  #     [:click, "Next Page"],
+  #     [:click, ["Option A", "Option B"]] # Will select one randomly
+  #   ]
+  #
+  #   processor = ScraperUtils::MechanizeActions.new(agent)
+  #   result_page = processor.process(page, actions)
+  #
+  # @example With replacements
+  #   replacements = { FROM_DATE: "2022-01-01", TO_DATE: "2022-03-01" }
+  #   processor = ScraperUtils::MechanizeActions.new(agent, replacements)
+  #
+  #   # Use replacements in actions
+  #   actions = [
+  #     [:click, "Search between {FROM_DATE} and {TO_DATE}"]
+  #   ]
+  class MechanizeActions
+    # @return [Mechanize] The mechanize agent used for actions
+    attr_reader :agent
+    # @return [Array] The results of each action performed
+    attr_reader :results
+    # Initialize a new MechanizeActions processor
+    #
+    # @param agent [Mechanize] The mechanize agent to use for actions
+    # @param replacements [Hash] Optional text replacements to apply to action parameters
+    def initialize(agent, replacements = {})
+      @agent = agent
+      @replacements = replacements || {}
+      @results = []
+    end
+    # Process a sequence of actions on a page
+    #
+    # @param page [Mechanize::Page] The starting page
+    # @param actions [Array<Array>] The sequence of actions to perform
+    # @return [Mechanize::Page] The resulting page after all actions
+    # @raise [ArgumentError] If an unknown action type is provided
+    #
+    # @example Action format
+    #   actions = [
+    #     [:click, "Link Text"],                     # Click on link with this text
+    #     [:click, ["Option A", "Option B"]],        # Click on one of these options (randomly selected)
+    #     [:click, "css:.some-button"],              # Use CSS selector
+    #     [:click, "xpath://div[@id='results']/a"],  # Use XPath selector
+    #     [:block, ->(page, args, agent, results) { [page, { custom_results: 'data' }] }] # Custom block
+    #   ]
+    def process(page, actions)
+      @results = []
+      current_page = page
+      actions.each do |action|
+        args = action.dup
+        action_type = args.shift
+        current_page, result =
+          case action_type
+          when :click
+            handle_click(current_page, args)
+          when :block
+            block = args.shift
+            block.call(current_page, args, agent, @results.dup)
+          else
+            raise ArgumentError, "Unknown action type: #{action_type}"
+          end
+        @results << result
+      end
+      current_page
+    end
+    private
+    # Handle a click action
+    #
+    # @param page [Mechanize::Page] The current page
+    # @param args [Array] The first element is the selection target
+    # @return [Array<Mechanize::Page, Hash>] The resulting page and status
+    def handle_click(page, args)
+      target = args.shift
+      if target.is_a?(Array)
+        target = ScraperUtils::CycleUtils.pick(target, date: @replacements[:TODAY])
+      end
+      target = apply_replacements(target)
+      element = select_element(page, target)
+      if element.nil?
+        raise "Unable to find click target: #{target}"
+      end
+      result = { action: :click, target: target }
+      next_page = element.click
+      [next_page, result]
+    end
+    # Select an element on the page based on selector string
+    #
+    # @param page [Mechanize::Page] The page to search in
+    # @param selector_string [String] The selector string
+    # @return [Mechanize::Element, nil] The selected element or nil if not found
+    def select_element(page, selector_string)
+      # Handle different selector types based on prefixes
+      if selector_string.start_with?("css:")
+        selector = selector_string.sub(/^css:/, '')
+        page.at_css(selector)
+      elsif selector_string.start_with?("xpath:")
+        selector = selector_string.sub(/^xpath:/, '')
+        page.at_xpath(selector)
+      else
+        # Default to text: for links
+        selector = selector_string.sub(/^text:/, '')
+        # Find links that include the text and don't have fragment-only hrefs
+        matching_links = page.links.select do |l|
+          l.text.include?(selector) &&
+            !(l.href.nil? || l.href.start_with?('#'))
+        end
+        if matching_links.empty?
+          # try case-insensitive
+          selector = selector.downcase
+          matching_links = page.links.select do |l|
+            l.text.downcase.include?(selector) &&
+              !(l.href.nil? || l.href.start_with?('#'))
+          end
+        end
+        # Get the link with the shortest (closest matching) text then the longest href
+        matching_links.min_by { |l| [l.text.strip.length, -l.href.length] }
+      end
+    end
+    # Apply text replacements to a string
+    #
+    # @param text [String, Object] The text to process or object to return unchanged
+    # @return [String, Object] The processed text with replacements or original object
+    def apply_replacements(text)
+      result = text.to_s
+      @replacements.each do |key, value|
+        result = result.gsub(/\{#{key}\}/, value.to_s)
+      end
+      result
+    end
+  end
+end

data/lib/scraper_utils/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module ScraperUtils
-  VERSION = "0.4.2"
+  VERSION = "0.5.0"
 end

data/lib/scraper_utils.rb CHANGED Viewed

@@ -9,6 +9,7 @@ require "scraper_utils/db_utils"
 require "scraper_utils/debug_utils"
 require "scraper_utils/fiber_scheduler"
 require "scraper_utils/log_utils"
+require "scraper_utils/mechanize_actions"
 require "scraper_utils/mechanize_utils/agent_config"
 require "scraper_utils/mechanize_utils"
 require "scraper_utils/randomize_utils"

data/scraper_utils.gemspec CHANGED Viewed

@@ -13,8 +13,8 @@ Gem::Specification.new do |spec|
   spec.summary = "planningalerts scraper utilities"
   spec.description = "Utilities to help make planningalerts scrapers, " \
-                     "+especially multis easier to develop, run and debug."
-  spec.homepage = "https://github.com/ianheggie-oaf/scraper_utils"
+    "especially multi authority scrapers, easier to develop, run and debug."
+  spec.homepage = "https://github.com/ianheggie-oaf/#{spec.name}"
   spec.license = "MIT"
   if spec.respond_to?(:metadata)
@@ -22,10 +22,11 @@ Gem::Specification.new do |spec|
     spec.metadata["homepage_uri"] = spec.homepage
     spec.metadata["source_code_uri"] = spec.homepage
-    # spec.metadata["changelog_uri"] = "TODO: Put your gem's CHANGELOG.md URL here."
+    spec.metadata["documentation_uri"] = "https://rubydoc.info/gems/#{spec.name}/#{ScraperUtils::VERSION}"
+    spec.metadata["changelog_uri"] = "#{spec.metadata["source_code_uri"]}/blob/main/CHANGELOG.md"
   else
     raise "RubyGems 2.0 or newer is required to protect against " \
-          "public gem pushes."
+            "public gem pushes."
   end
   # Specify which files should be added to the gem when it is released.

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: scraper_utils
 version: !ruby/object:Gem::Version
-  version: 0.4.2
+  version: 0.5.0
 platform: ruby
 authors:
 - Ian Heggie
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2025-03-03 00:00:00.000000000 Z
+date: 2025-03-05 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: mechanize
@@ -52,8 +52,8 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
-description: Utilities to help make planningalerts scrapers, +especially multis easier
-  to develop, run and debug.
+description: Utilities to help make planningalerts scrapers, especially multi authority
+  scrapers, easier to develop, run and debug.
 email:
 - ian@heggie.biz
 executables: []
@@ -74,8 +74,14 @@ files:
 - SPECS.md
 - bin/console
 - bin/setup
+- docs/debugging.md
 - docs/example_scrape_with_fibers.rb
 - docs/example_scraper.rb
+- docs/getting_started.md
+- docs/interleaving_requests.md
+- docs/mechanize_utilities.md
+- docs/randomizing_requests.md
+- docs/reducing_server_load.md
 - lib/scraper_utils.rb
 - lib/scraper_utils/adaptive_delay.rb
 - lib/scraper_utils/authority_utils.rb
@@ -86,6 +92,7 @@ files:
 - lib/scraper_utils/debug_utils.rb
 - lib/scraper_utils/fiber_scheduler.rb
 - lib/scraper_utils/log_utils.rb
+- lib/scraper_utils/mechanize_actions.rb
 - lib/scraper_utils/mechanize_utils.rb
 - lib/scraper_utils/mechanize_utils/agent_config.rb
 - lib/scraper_utils/randomize_utils.rb
@@ -99,6 +106,8 @@ metadata:
   allowed_push_host: https://rubygems.org
   homepage_uri: https://github.com/ianheggie-oaf/scraper_utils
   source_code_uri: https://github.com/ianheggie-oaf/scraper_utils
+  documentation_uri: https://rubydoc.info/gems/scraper_utils/0.5.0
+  changelog_uri: https://github.com/ianheggie-oaf/scraper_utils/blob/main/CHANGELOG.md
   rubygems_mfa_required: 'true'
 post_install_message:
 rdoc_options: []