RubyGems - scraper_utils - Versions diffs - 0.12.1 → 0.14.1 - Mend

scraper_utils 0.12.1 → 0.14.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +34 -3
data/lib/scraper_utils/db_utils.rb +9 -19
data/lib/scraper_utils/host_throttler.rb +82 -0
data/lib/scraper_utils/log_utils.rb +6 -14
data/lib/scraper_utils/mechanize_utils/agent_config.rb +27 -21
data/lib/scraper_utils/misc_utils.rb +25 -12
data/lib/scraper_utils/pa_validation.rb +88 -0
data/lib/scraper_utils/spec_support.rb +34 -16
data/lib/scraper_utils/version.rb +1 -1
data/lib/scraper_utils.rb +2 -0
metadata +5 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: fc8aab3f24d29cc1e3fe44880d9030e7149372189a8691acd6e628e2a260c57c
-  data.tar.gz: a82ecb97878e57f5d0cf75d623c94a6ff54e7805c43a7be140e6a2da7bbcca49
+  metadata.gz: 03b44a667992331d6e36bb6eca68afc286205846d7be06263694fed52b5e2d30
+  data.tar.gz: 9f0dd276223f1b22dd688453e1769199cbda34efa5141d58e546a8ddcb85c795
 SHA512:
-  metadata.gz: bd6d178afa8669916b70f2c6bdb56b77a8a5995d18e47c64b360a8d5d334b479645d20792d80759afde1aac6550d79f46a0504030d8a8920c8dea55fa6ad6132
-  data.tar.gz: 20a2c5f144cfc8e2106d2e4643d7f3b0cb35110a5519d63bb0d8c1e655c8958fe73115089352a71272a419168c50299c8ce985d318be6bfa9635e64c9c4fb238
+  metadata.gz: b42e0be0f9e42d9a83588cf7dcbb98ec079d01262340d2e6fef8ac7201c3d80faa645351631f60f767186721a58580f4f1e5e09c130a3a32aebb4f301dbfbdfc
+  data.tar.gz: e3cec3345d0af13026259600a54e417efd0c36394f1bc22ecac1a25573551a3a2e51482b060ad1b72ed7ba4850d55bf9f8032321d1b8c1ae6eab581244e92410

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,29 @@
 # Changelog
+## 0.14.1 - 2026-03-04
+* Can pass `known_suburbs: ['Suburb', ...]` to `ScraperUtils::SpecSupport.validate_addresses_are_geocodable!` and
+  `ScraperUtils::SpecSupport.geocodable?` to validate addresses that don't have postcodes nor capitalised suburb names
+* Can pass ignore_case: true to relax the requirement for either postcode or uppercase suburb when you don't want to
+  pass Known suburb.
+* Move Throttling to HostThrottler
+## 0.13.1 - 2026.02-21
+* Added PaValidation that validates based
+  on [How to write a scraper](https://www.planningalerts.org.au/how_to_write_a_scraper)
+  * `ScraperUtils::PaValidation.validate_record!` raises an exception if record is invalid, calls
+  * `ScraperUtils::PaValidation.validate_record` returns an Array of error messages if record is invalid, otherwise nil
+* Added `ScraperUtils::SpecSupport.validate_unique_references!` which validates that all references are unique
+  * Note: due to saving records based on the unique reference, any duplicates are overwritten and are never presented to
+    PA, so this is basically checking that you are not losing records due to an incorrect reference
+* Refactored `DbUtils.save_record` to use PaValidation
+* Merged `clean_old_records` from LogUtils into same method in DbUtils bringing across `force` named param
+  * `LogUtils.clean_old_records` now warns if it is deprecated
+* Increased test coverage
+* Fixed edge case in `ScraperUtils::MechanizeUtils::AgentConfig#verify_proxy_works` - it now raises an exception on json
+  parse error
 ## 0.12.1 - 2026.02-18
 * Added override for the threshold of when to abdon scraping due to unprocessable records
@@ -34,15 +58,18 @@
 ## 0.9.0 - 2025-07-11
-**Significant cleanup - removed code we ended up not using as none of the councils are actually concerned about server load**
+**Significant cleanup - removed code we ended up not using as none of the councils are actually concerned about server
+load**
 * Refactored example code into simple callable methods
 * Expand test for geocodeable addresses to include comma between postcode and state at the end of the address.
 ### Added
 - `ScraperUtils::SpecSupport.validate_addresses_are_geocodable!` - validates percentage of geocodable addresses
 - `ScraperUtils::SpecSupport.validate_descriptions_are_reasonable!` - validates percentage of reasonable descriptions
-- `ScraperUtils::SpecSupport.validate_uses_one_valid_info_url!` - validates single global info_url usage and availability
+- `ScraperUtils::SpecSupport.validate_uses_one_valid_info_url!` - validates single global info_url usage and
+  availability
 - `ScraperUtils::SpecSupport.validate_info_urls_have_expected_details!` - validates info_urls contain expected content
 - `ScraperUtils::MathsUtils.fibonacci_series` - generates fibonacci sequence up to max value
 - `bot_check_expected` parameter to info_url validation methods for handling reCAPTCHA/Cloudflare protection
@@ -53,10 +80,12 @@
 - .editorconfig as an example for scrapers
 ### Fixed
 - Typo in `geocodable?` method debug output (`has_suburb_stats` → `has_suburb_states`)
 - Code example in `docs/enhancing_specs.md`
 ### Updated
 - `ScraperUtils::SpecSupport.acceptable_description?` - Accept 1 or 2 word descriptors with planning specific terms
 - Code example in `docs/enhancing_specs.md` to reflect new support methods
 - Code examples
@@ -68,6 +97,7 @@
 - Added extra street types
 ### Removed
 - Unsued CycleUtils
 - Unused DateRangeUtils
 - Unused RandomizeUtils
@@ -150,7 +180,8 @@ Fixed broken v0.2.0
 ## 0.2.0 - 2025-02-28
-Added FiberScheduler, enabled complient mode with delays by default and simplified usage removing third retry without proxy
+Added FiberScheduler, enabled complient mode with delays by default and simplified usage removing third retry without
+proxy
 ## 0.1.0 - 2025-02-23

data/lib/scraper_utils/db_utils.rb CHANGED Viewed

@@ -1,5 +1,6 @@
 # frozen_string_literal: true
+require "uri"
 require "scraperwiki"
 module ScraperUtils
@@ -27,23 +28,10 @@ module ScraperUtils
     # @raise [ScraperUtils::UnprocessableRecord] If record fails validation
     # @return [void]
     def self.save_record(record)
-      # Validate required fields
-      required_fields = %w[council_reference address description info_url date_scraped]
-      required_fields.each do |field|
-        if record[field].to_s.empty?
-          raise ScraperUtils::UnprocessableRecord, "Missing required field: #{field}"
-        end
-      end
-      # Validate date formats
-      %w[date_scraped date_received on_notice_from on_notice_to].each do |date_field|
-        Date.parse(record[date_field]) unless record[date_field].to_s.empty?
-      rescue ArgumentError
-        raise ScraperUtils::UnprocessableRecord,
-              "Invalid date format for #{date_field}: #{record[date_field].inspect}"
-      end
+      record = record.transform_keys(&:to_s)
+      ScraperUtils::PaValidation.validate_record!(record)
-      # Determine primary key based on presence of authority_label
+      # Determine the primary key based on the presence of authority_label
       primary_key = if record.key?("authority_label")
                       %w[authority_label council_reference]
                     else
@@ -58,7 +46,7 @@ module ScraperUtils
     end
     # Clean up records older than 30 days and approx once a month vacuum the DB
-    def self.cleanup_old_records
+    def self.cleanup_old_records(force: false)
       cutoff_date = (Date.today - 30).to_s
       vacuum_cutoff_date = (Date.today - 35).to_s
@@ -70,15 +58,17 @@ module ScraperUtils
       deleted_count = stats["count"]
       oldest_date = stats["oldest"]
-      return unless deleted_count.positive? || ENV["VACUUM"]
+      return unless deleted_count.positive? || ENV["VACUUM"] || force
       LogUtils.log "Deleting #{deleted_count} applications scraped between #{oldest_date} and #{cutoff_date}"
       ScraperWiki.sqliteexecute("DELETE FROM data WHERE date_scraped < ?", [cutoff_date])
-      return unless rand < 0.03 || (oldest_date && oldest_date < vacuum_cutoff_date) || ENV["VACUUM"]
+      return unless rand < 0.03 || (oldest_date && oldest_date < vacuum_cutoff_date) || ENV["VACUUM"] || force
       LogUtils.log "  Running VACUUM to reclaim space..."
       ScraperWiki.sqliteexecute("VACUUM")
+    rescue SqliteMagic::NoSuchTable => e
+      ScraperUtils::LogUtils.log "Ignoring: #{e} whilst cleaning old records" if ScraperUtils::DebugUtils.trace?
     end
   end
 end

data/lib/scraper_utils/host_throttler.rb ADDED Viewed

@@ -0,0 +1,82 @@
+# frozen_string_literal: true
+module ScraperUtils
+  # Tracks per-host next-allowed-request time so that time spent parsing
+  # and saving records counts toward the crawl delay rather than being
+  # added on top of it.
+  #
+  # Usage:
+  #   throttler = HostThrottler.new(crawl_delay: 1.0, max_load: 50.0)
+  #   throttler.before_request(hostname)   # sleep until ready
+  #   # ... make request ...
+  #   throttler.after_request(hostname)    # record timing, schedule next slot
+  #   throttler.after_request(hostname, overloaded: true)  # double delay + 5s
+  class HostThrottler
+    MAX_DELAY = 120.0
+    # @param crawl_delay [Float] minimum seconds between requests per host
+    # @param max_load [Float] target server load percentage (10..100);
+    #   50 means response_time == pause_time
+    def initialize(crawl_delay: 0.0, max_load: nil)
+      @crawl_delay = crawl_delay.to_f
+      # Clamp between 10 (delay 9x response) and 100 (no extra delay)
+      @max_load = max_load ? max_load.to_f.clamp(10.0, 100.0) : nil
+      @next_request_at = {}   # hostname => Time
+      @request_started_at = {} # hostname => Time
+    end
+    # Sleep until this host's throttle window has elapsed.
+    # Records when the request actually started.
+    # @param hostname [String]
+    # @return [void]
+    def before_request(hostname)
+      target = @next_request_at[hostname]
+      if target
+        remaining = target - Time.now
+        sleep(remaining) if remaining > 0
+      end
+      @request_started_at[hostname] = Time.now
+    end
+    # Calculate and store the next allowed request time for this host.
+    # @param hostname [String]
+    # @param overloaded [Boolean] true when the server signalled overload
+    #   (HTTP 429/500/503); doubles the normal delay and adds 5 seconds.
+    # @return [void]
+    def after_request(hostname, overloaded: false)
+      started = @request_started_at[hostname] || Time.now
+      response_time = Time.now - started
+      delay = @crawl_delay
+      if @max_load
+        delay += (100.0 - @max_load) * response_time / @max_load
+      end
+      if overloaded
+        delay = delay + response_time * 2 + 5.0
+      end
+      delay = delay.round(3).clamp(0.0, MAX_DELAY)
+      @next_request_at[hostname] = Time.now + delay
+      if DebugUtils.basic?
+        msg = "HostThrottler: #{hostname} response=#{response_time.round(3)}s"
+        msg += " OVERLOADED" if overloaded
+        msg += ", Will delay #{delay}s before next request"
+        LogUtils.log(msg)
+      end
+    end
+    # Duck-type check for HTTP overload errors across Mechanize, HTTParty, etc.
+    # @param error [Exception]
+    # @return [Boolean]
+    def self.overload_error?(error)
+      code = if error.respond_to?(:response) && error.response.respond_to?(:code)
+               error.response.code.to_i          # HTTParty style
+             elsif error.respond_to?(:response_code)
+               error.response_code.to_i          # Mechanize style
+             end
+      [429, 500, 503].include?(code)
+    end
+  end
+end

data/lib/scraper_utils/log_utils.rb CHANGED Viewed

@@ -85,7 +85,7 @@ module ScraperUtils
         failed
       )
-      cleanup_old_records
+      DbUtils::cleanup_old_records
     end
     # Extracts the first relevant line from backtrace that's from our project
@@ -225,21 +225,13 @@ module ScraperUtils
       )
     end
+    # Moved to DbUtils
+    # :nocov:
     def self.cleanup_old_records(force: false)
-      cutoff = (Date.today - LOG_RETENTION_DAYS).to_s
-      return if !force && @last_cutoff == cutoff
-      @last_cutoff = cutoff
-      [SUMMARY_TABLE, LOG_TABLE].each do |table|
-        ScraperWiki.sqliteexecute(
-          "DELETE FROM #{table} WHERE date(run_at) < date(?)",
-          [cutoff]
-        )
-      rescue SqliteMagic::NoSuchTable => e
-        ScraperUtils::LogUtils.log "Ignoring: #{e} whilst cleaning old records" if ScraperUtils::DebugUtils.trace?
-      end
+      warn "`#{self.class}##{__method__}` is deprecated and will be removed in a future release, use `ScraperUtils::DbUtils.cleanup_old_records` instead.", category: :deprecated
+      ScraperUtils::DbUtils.cleanup_old_records(force: force)
     end
+    # :nocov:
     # Extracts meaningful backtrace - 3 lines from ruby/gem and max 6 in total
     def self.extract_meaningful_backtrace(error)

data/lib/scraper_utils/mechanize_utils/agent_config.rb CHANGED Viewed

@@ -2,6 +2,7 @@
 require "mechanize"
 require "ipaddr"
+require_relative "../host_throttler"
 module ScraperUtils
   module MechanizeUtils
@@ -76,8 +77,7 @@ module ScraperUtils
       attr_reader :user_agent
       # Give access for testing
-      attr_reader :max_load, :crawl_delay
+      attr_reader :max_load, :crawl_delay, :throttler
       # Creates Mechanize agent configuration with sensible defaults overridable via configure
       # @param timeout [Integer, nil] Timeout for agent connections (default: 60)
@@ -107,6 +107,7 @@ module ScraperUtils
         @crawl_delay = crawl_delay.nil? ? self.class.default_crawl_delay : crawl_delay.to_f
         # Clamp between 10 (delay 9 x response) and 100 (no delay)
         @max_load = (max_load.nil? ? self.class.default_max_load : max_load).to_f.clamp(10.0, 100.0)
+        @throttler = HostThrottler.new(crawl_delay: @crawl_delay, max_load: @max_load)
         # Validate proxy URL format if proxy will be used
         @australian_proxy &&= !ScraperUtils.australian_proxy.to_s.empty?
@@ -155,6 +156,7 @@ module ScraperUtils
         agent.pre_connect_hooks << method(:pre_connect_hook)
         agent.post_connect_hooks << method(:post_connect_hook)
+        agent.error_hooks << method(:error_hook) if agent.respond_to?(:error_hooks)
       end
       private
@@ -175,38 +177,41 @@ module ScraperUtils
       end
       def pre_connect_hook(_agent, request)
-        @connection_started_at = Time.now
-        return unless DebugUtils.verbose?
-        ScraperUtils::LogUtils.log(
-          "Pre Connect request: #{request.inspect} at #{@connection_started_at}"
-        )
+        hostname = request.respond_to?(:uri) ? request.uri.host : 'unknown'
+        @throttler.before_request(hostname)
+        if DebugUtils.verbose?
+          ScraperUtils::LogUtils.log(
+            "Pre Connect request: #{request.inspect}"
+          )
+        end
       end
       def post_connect_hook(_agent, uri, response, _body)
         raise ArgumentError, "URI must be present in post-connect hook" unless uri
-        response_time = Time.now - @connection_started_at
-        response_delay = @crawl_delay || 0.0
-        if @crawl_delay ||@max_load
-          response_delay += response_time
-          if @max_load && @max_load >= 1
-            response_delay += (100.0 - @max_load) * response_time / @max_load
-          end
-          response_delay = response_delay.round(3)
-        end
+        status = response.respond_to?(:code) ? response.code.to_i : nil
+        overloaded = [429, 500, 503].include?(status)
+        @throttler.after_request(uri.host, overloaded: overloaded)
         if DebugUtils.basic?
           ScraperUtils::LogUtils.log(
-            "Post Connect uri: #{uri.inspect}, response: #{response.inspect} " \
-              "after #{response_time} seconds#{response_delay > 0.0 ? ", pausing for #{response_delay} seconds" : ""}"
+            "Post Connect uri: #{uri.inspect}, response: #{response.inspect}"
           )
         end
-        sleep(response_delay) if response_delay > 0.0
         response
       end
+      def error_hook(_agent, error)
+        # Best-effort: record the error against whatever host we can find
+        # Mechanize errors often carry the URI in the message; fall back to 'unknown'
+        hostname = if error.respond_to?(:uri)
+                     error.uri.host
+                   else
+                     'unknown'
+                   end
+        @throttler.after_request(hostname, overloaded: HostThrottler.overload_error?(error))
+      end
       def verify_proxy_works(agent)
         $stderr.flush
         $stdout.flush
@@ -227,6 +232,7 @@ module ScraperUtils
         rescue JSON::ParserError => e
           puts "Couldn't parse public_headers: #{e}! Raw response:"
           puts my_headers.inspect
+          raise "Couldn't parse public_headers as JSON: #{e}!"
         end
       rescue Timeout::Error => e # Includes Net::OpenTimeout
         raise "Proxy check timed out: #{e}"

data/lib/scraper_utils/misc_utils.rb CHANGED Viewed

@@ -1,23 +1,36 @@
 # frozen_string_literal: true
+require_relative "host_throttler"
 module ScraperUtils
   # Misc Standalone Utilities
   module MiscUtils
-    MAX_PAUSE = 120.0
+    THROTTLE_HOSTNAME = "block"
     class << self
-      attr_accessor :pause_duration
-      # Throttle block to be nice to servers we are scraping
-      def throttle_block(extra_delay: 0.5)
-        if @pause_duration&.positive?
-          puts "Pausing #{@pause_duration}s" if ScraperUtils::DebugUtils.trace?
-          sleep(@pause_duration)
+      # Throttle block to be nice to servers we are scraping.
+      # Time spent inside the block (parsing, saving) counts toward the delay.
+      def throttle_block
+        throttler.before_request(THROTTLE_HOSTNAME)
+        begin
+          result = yield
+          throttler.after_request(THROTTLE_HOSTNAME)
+          result
+        rescue StandardError => e
+          throttler.after_request(THROTTLE_HOSTNAME, overloaded: HostThrottler.overload_error?(e))
+          raise
         end
-        start_time = Time.now.to_f
-        result = yield
-        @pause_duration = (Time.now.to_f - start_time + extra_delay).round(3).clamp(0.0, MAX_PAUSE)
-        result
+      end
+      # Reset the internal throttler (useful in tests)
+      def reset_throttler!
+        @throttler = nil
+      end
+      private
+      def throttler
+        @throttler ||= HostThrottler.new
       end
     end
   end

data/lib/scraper_utils/pa_validation.rb ADDED Viewed

@@ -0,0 +1,88 @@
+# frozen_string_literal: true
+require "uri"
+require "date"
+module ScraperUtils
+  # Validates scraper records match Planning Alerts requirements before submission.
+  # Use in specs to catch problems early rather than waiting for PA's import.
+  module PaValidation
+    REQUIRED_FIELDS = %w[council_reference address description date_scraped].freeze
+    # Validates a single record (hash with string keys) against PA's rules.
+    # @param record [Hash] The record to validate
+    # @raise [ScraperUtils::UnprocessableRecord] if there are error messages
+    def self.validate_record!(record)
+      errors = validate_record(record)
+      raise(ScraperUtils::UnprocessableRecord, errors.join("; ")) if errors&.any?
+    end
+    # Validates a single record (hash with string keys) against PA's rules.
+    # @param record [Hash] The record to validate
+    # @return [Array<String>, nil] Array of error messages, or nil if valid
+    def self.validate_record(record)
+      record = record.transform_keys(&:to_s)
+      errors = []
+      validate_presence(record, errors)
+      validate_info_url(record, errors)
+      validate_dates(record, errors)
+      errors.empty? ? nil : errors
+    end
+    private
+    def self.validate_presence(record, errors)
+      REQUIRED_FIELDS.each do |field|
+        errors << "#{field} can't be blank" if record[field].to_s.strip.empty?
+      end
+      errors << "info_url can't be blank" if record["info_url"].to_s.strip.empty?
+    end
+    def self.validate_info_url(record, errors)
+      url = record["info_url"].to_s.strip
+      return if url.empty? # already caught by presence check
+      begin
+        uri = URI.parse(url)
+        unless uri.is_a?(URI::HTTP) && uri.host.to_s != ""
+          errors << "info_url must be a valid http\/https URL with host"
+        end
+      rescue URI::InvalidURIError
+        errors << "info_url must be a valid http\/https URL"
+      end
+    end
+    def self.validate_dates(record, errors)
+      today = Date.today
+      date_scraped = parse_date(record["date_scraped"])
+      errors << "Invalid date format for date_scraped: #{record["date_scraped"].inspect} is not a valid ISO 8601 date" if record["date_scraped"] && date_scraped.nil?
+      date_received = parse_date(record["date_received"])
+      if record["date_received"] && date_received.nil?
+        errors << "Invalid date format for date_received: #{record["date_received"].inspect} is not a valid ISO 8601 date"
+      elsif date_received && date_received.to_date > today
+        errors << "Invalid date for date_received: #{record["date_received"].inspect} is in the future"
+      end
+      %w[on_notice_from on_notice_to].each do |field|
+        val = parse_date(record[field])
+        errors << "Invalid date format for #{field}: #{record[field].inspect} is not a valid ISO 8601 date" if record[field] && val.nil?
+      end
+    end
+    # Returns a Date if value is already a Date, or parses a YYYY-MM-DD string.
+    # Returns nil if unparseable or blank.
+    def self.parse_date(value)
+      return nil if value.nil? || value == ""
+      return value if value.is_a?(Date) || value.is_a?(Time)
+      return nil unless value.is_a?(String) && value =~ /\A\d{4}-\d{2}-\d{2}\z/
+      Date.parse(value)
+    rescue ArgumentError
+      nil
+    end
+  end
+end

data/lib/scraper_utils/spec_support.rb CHANGED Viewed

@@ -78,30 +78,49 @@ module ScraperUtils
       "#{prefix}#{authority_labels.first}#{suffix}"
     end
+    # Finds records with duplicate [authority_label, council_reference] keys.
+    # @param records [Array<Hash>] All records to check
+    # @raises [Hash<Array<String>, Array<Hash>>, nil] Groups of duplicate records keys by primary key, or nil if all unique
+    def self.validate_unique_references!(records)
+      groups = records.group_by do |r|
+        [r["authority_label"], r["council_reference"]&.downcase]
+      end
+      duplicates = groups.select { |_k, g| g.size > 1 }
+      return if duplicates.empty?
+      raise UnprocessableSite, "Duplicate authority labels: #{duplicates.keys.map(&:inspect).join(', ')}"
+    end
     # Validates enough addresses are geocodable
     # @param results [Array<Hash>] The results from scraping an authority
     # @param percentage [Integer] The min percentage of addresses expected to be geocodable (default:50)
     # @param variation [Integer] The variation allowed in addition to percentage (default:3)
+    # @param ignore_case [Boolean] Ignores case which relaxes suburb check
+    # @param known_suburbs [Array<String>] Known suburbs to detect in address when there is no postcode and no uppercase suburb
     # @raise RuntimeError if insufficient addresses are geocodable
-    def self.validate_addresses_are_geocodable!(results, percentage: 50, variation: 3)
+    def self.validate_addresses_are_geocodable!(results, percentage: 50, variation: 3, ignore_case: false, known_suburbs: [])
       return nil if results.empty?
       geocodable = results
                      .map { |record| record["address"] }
                      .uniq
-                     .count { |text| ScraperUtils::SpecSupport.geocodable? text }
+                     .count { |text| ScraperUtils::SpecSupport.geocodable? text, known_suburbs: known_suburbs, ignore_case: ignore_case }
       puts "Found #{geocodable} out of #{results.count} unique geocodable addresses " \
              "(#{(100.0 * geocodable / results.count).round(1)}%)"
       expected = [((percentage.to_f / 100.0) * results.count - variation), 1].max
-      raise "Expected at least #{expected} (#{percentage}% - #{variation}) geocodable addresses, got #{geocodable}" unless geocodable >= expected
+      unless geocodable >= expected
+        raise UnprocessableSite, "Expected at least #{expected} (#{percentage}% - #{variation}) geocodable addresses, got #{geocodable}"
+      end
       geocodable
     end
     # Check if an address is likely to be geocodable by analyzing its format.
     # This is a bit stricter than needed - typically assert >= 75% match
     # @param address [String] The address to check
+    # @param ignore_case [Boolean] Ignores case which relaxes suburb check
+    # @param known_suburbs [Array<String>] Known suburbs to detect in address when there is no postcode and no uppercase suburb
     # @return [Boolean] True if the address appears to be geocodable.
-    def self.geocodable?(address, ignore_case: false)
+    def self.geocodable?(address, ignore_case: false, known_suburbs: [])
       return false if address.nil? || address.empty?
       check_address = ignore_case ? address.upcase : address
@@ -114,16 +133,17 @@ module ScraperUtils
       uppercase_words = address.scan(/\b[A-Z]{2,}\b/)
       has_uppercase_suburb = uppercase_words.any? { |word| !AUSTRALIAN_STATES.include?(word) }
+      has_known_suburb = known_suburbs.any? { |suburb| address.include?(suburb) }
       if ENV["DEBUG"]
         missing = []
         missing << "street type" unless has_street_type
-        missing << "postcode/Uppercase suburb" unless has_postcode || has_uppercase_suburb
+        missing << "postcode/Uppercase suburb/Known suburb" unless has_postcode || has_uppercase_suburb || has_known_suburb
         missing << "state" unless has_state
         puts "  address: #{address} is not geocodable, missing #{missing.join(', ')}" if missing.any?
       end
-      has_street_type && (has_postcode || has_uppercase_suburb) && has_state
+      has_street_type && (has_postcode || has_uppercase_suburb || has_known_suburb) && has_state
     end
     PLACEHOLDERS = [
@@ -157,7 +177,7 @@ module ScraperUtils
       puts "Found #{descriptions} out of #{results.count} unique reasonable descriptions " \
              "(#{(100.0 * descriptions / results.count).round(1)}%)"
       expected = [(percentage.to_f / 100.0) * results.count - variation, 1].max
-      raise "Expected at least #{expected} (#{percentage}% - #{variation}) reasonable descriptions, got #{descriptions}" unless descriptions >= expected
+      raise UnprocessableSite, "Expected at least #{expected} (#{percentage}% - #{variation}) reasonable descriptions, got #{descriptions}" unless descriptions >= expected
       descriptions
     end
@@ -278,7 +298,7 @@ module ScraperUtils
           next
         end
-        raise "Expected 200 response, got #{page.code}" unless page.code == "200"
+        raise UnprocessableRecord, "Expected 200 response, got #{page.code}" unless page.code == "200"
         page_body = page.body.dup.force_encoding("UTF-8").gsub(/\s\s+/, " ")
@@ -310,12 +330,10 @@ module ScraperUtils
           min_required = ((percentage.to_f / 100.0) * count - variation).round(0)
           passed = count - failed
           raise "Too many failures: #{passed}/#{count} passed (min required: #{min_required})" if passed < min_required
-          end
-          end
-          puts "#{(100.0 * (count - failed) / count).round(1)}% detail checks passed (#{failed}/#{count} failed)!" if count > 0
-          end
-          end
-          end
+        end
+      end
+      puts "#{(100.0 * (count - failed) / count).round(1)}% detail checks passed (#{failed}/#{count} failed)!" if count > 0
+    end
+  end
+end

data/lib/scraper_utils/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module ScraperUtils
-  VERSION = "0.12.1"
+  VERSION = "0.14.1"
 end

data/lib/scraper_utils.rb CHANGED Viewed

@@ -7,9 +7,11 @@ require "scraper_utils/authority_utils"
 require "scraper_utils/data_quality_monitor"
 require "scraper_utils/db_utils"
 require "scraper_utils/debug_utils"
+require "scraper_utils/host_throttler"
 require "scraper_utils/log_utils"
 require "scraper_utils/maths_utils"
 require "scraper_utils/misc_utils"
+require "scraper_utils/pa_validation"
 require "scraper_utils/spec_support"
 # Mechanize utilities

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: scraper_utils
 version: !ruby/object:Gem::Version
-  version: 0.12.1
+  version: 0.14.1
 platform: ruby
 authors:
 - Ian Heggie
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2026-02-18 00:00:00.000000000 Z
+date: 2026-03-04 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: mechanize
@@ -97,11 +97,13 @@ files:
 - lib/scraper_utils/data_quality_monitor.rb
 - lib/scraper_utils/db_utils.rb
 - lib/scraper_utils/debug_utils.rb
+- lib/scraper_utils/host_throttler.rb
 - lib/scraper_utils/log_utils.rb
 - lib/scraper_utils/maths_utils.rb
 - lib/scraper_utils/mechanize_utils.rb
 - lib/scraper_utils/mechanize_utils/agent_config.rb
 - lib/scraper_utils/misc_utils.rb
+- lib/scraper_utils/pa_validation.rb
 - lib/scraper_utils/spec_support.rb
 - lib/scraper_utils/version.rb
 - scraper_utils.gemspec
@@ -112,7 +114,7 @@ metadata:
   allowed_push_host: https://rubygems.org
   homepage_uri: https://github.com/ianheggie-oaf/scraper_utils
   source_code_uri: https://github.com/ianheggie-oaf/scraper_utils
-  documentation_uri: https://rubydoc.info/gems/scraper_utils/0.12.1
+  documentation_uri: https://rubydoc.info/gems/scraper_utils/0.14.1
   changelog_uri: https://github.com/ianheggie-oaf/scraper_utils/blob/main/CHANGELOG.md
   rubygems_mfa_required: 'true'
 post_install_message: