RubyGems - scraper_utils - Versions diffs - 0.12.1 → 0.13.1 - Mend

scraper_utils 0.12.1 → 0.13.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +26 -3
data/lib/scraper_utils/db_utils.rb +9 -19
data/lib/scraper_utils/log_utils.rb +6 -14
data/lib/scraper_utils/mechanize_utils/agent_config.rb +1 -0
data/lib/scraper_utils/pa_validation.rb +88 -0
data/lib/scraper_utils/spec_support.rb +24 -11
data/lib/scraper_utils/version.rb +1 -1
data/lib/scraper_utils.rb +1 -0
metadata +4 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: fc8aab3f24d29cc1e3fe44880d9030e7149372189a8691acd6e628e2a260c57c
-  data.tar.gz: a82ecb97878e57f5d0cf75d623c94a6ff54e7805c43a7be140e6a2da7bbcca49
+  metadata.gz: 3bce8cc5a624f9904ebf8bb35ccb5c5c6c831e28ed56f88d3baf3b8d19fbbd13
+  data.tar.gz: 0a481566e846a4274796b0542fb64a805f486065ed08045724cea7bc3d46710d
 SHA512:
-  metadata.gz: bd6d178afa8669916b70f2c6bdb56b77a8a5995d18e47c64b360a8d5d334b479645d20792d80759afde1aac6550d79f46a0504030d8a8920c8dea55fa6ad6132
-  data.tar.gz: 20a2c5f144cfc8e2106d2e4643d7f3b0cb35110a5519d63bb0d8c1e655c8958fe73115089352a71272a419168c50299c8ce985d318be6bfa9635e64c9c4fb238
+  metadata.gz: 231c167ffe232daacbc862b8c3dd2c0c71be6b8fc2ff061f4f36d88f2e2185a454eb0aa79653c7a99a2ed65c9857d961059456f8403af8c1ed39623cc8e2db6a
+  data.tar.gz: f287f85cdd4cc11cf17c3e5d34d5493e2809f255f3a3544bc881e756f3379c897dd70dbba5ebf16b30837bb8612f42f704872e06c6bec1cad87845606fce6231

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,21 @@
 # Changelog
+## 0.13.1 - 2026.02-21
+* Added PaValidation that validates based
+  on [How to write a scraper](https://www.planningalerts.org.au/how_to_write_a_scraper)
+  * `ScraperUtils::PaValidation.validate_record!` raises an exception if record is invalid, calls
+  * `ScraperUtils::PaValidation.validate_record` returns an Array of error messages if record is invalid, otherwise nil
+* Added `ScraperUtils::SpecSupport.validate_unique_references!` which validates that all references are unique
+  * Note: due to saving records based on the unique reference, any duplicates are overwritten and are never presented to
+    PA, so this is basically checking that you are not losing records due to an incorrect reference
+* Refactored `DbUtils.save_record` to use PaValidation
+* Merged `clean_old_records` from LogUtils into same method in DbUtils bringing across `force` named param
+  * `LogUtils.clean_old_records` now warns if it is deprecated
+* Increased test coverage
+* Fixed edge case in `ScraperUtils::MechanizeUtils::AgentConfig#verify_proxy_works` - it now raises an exception on json
+  parse error
 ## 0.12.1 - 2026.02-18
 * Added override for the threshold of when to abdon scraping due to unprocessable records
@@ -34,15 +50,18 @@
 ## 0.9.0 - 2025-07-11
-**Significant cleanup - removed code we ended up not using as none of the councils are actually concerned about server load**
+**Significant cleanup - removed code we ended up not using as none of the councils are actually concerned about server
+load**
 * Refactored example code into simple callable methods
 * Expand test for geocodeable addresses to include comma between postcode and state at the end of the address.
 ### Added
 - `ScraperUtils::SpecSupport.validate_addresses_are_geocodable!` - validates percentage of geocodable addresses
 - `ScraperUtils::SpecSupport.validate_descriptions_are_reasonable!` - validates percentage of reasonable descriptions
-- `ScraperUtils::SpecSupport.validate_uses_one_valid_info_url!` - validates single global info_url usage and availability
+- `ScraperUtils::SpecSupport.validate_uses_one_valid_info_url!` - validates single global info_url usage and
+  availability
 - `ScraperUtils::SpecSupport.validate_info_urls_have_expected_details!` - validates info_urls contain expected content
 - `ScraperUtils::MathsUtils.fibonacci_series` - generates fibonacci sequence up to max value
 - `bot_check_expected` parameter to info_url validation methods for handling reCAPTCHA/Cloudflare protection
@@ -53,10 +72,12 @@
 - .editorconfig as an example for scrapers
 ### Fixed
 - Typo in `geocodable?` method debug output (`has_suburb_stats` → `has_suburb_states`)
 - Code example in `docs/enhancing_specs.md`
 ### Updated
 - `ScraperUtils::SpecSupport.acceptable_description?` - Accept 1 or 2 word descriptors with planning specific terms
 - Code example in `docs/enhancing_specs.md` to reflect new support methods
 - Code examples
@@ -68,6 +89,7 @@
 - Added extra street types
 ### Removed
 - Unsued CycleUtils
 - Unused DateRangeUtils
 - Unused RandomizeUtils
@@ -150,7 +172,8 @@ Fixed broken v0.2.0
 ## 0.2.0 - 2025-02-28
-Added FiberScheduler, enabled complient mode with delays by default and simplified usage removing third retry without proxy
+Added FiberScheduler, enabled complient mode with delays by default and simplified usage removing third retry without
+proxy
 ## 0.1.0 - 2025-02-23

data/lib/scraper_utils/db_utils.rb CHANGED Viewed

@@ -1,5 +1,6 @@
 # frozen_string_literal: true
+require "uri"
 require "scraperwiki"
 module ScraperUtils
@@ -27,23 +28,10 @@ module ScraperUtils
     # @raise [ScraperUtils::UnprocessableRecord] If record fails validation
     # @return [void]
     def self.save_record(record)
-      # Validate required fields
-      required_fields = %w[council_reference address description info_url date_scraped]
-      required_fields.each do |field|
-        if record[field].to_s.empty?
-          raise ScraperUtils::UnprocessableRecord, "Missing required field: #{field}"
-        end
-      end
-      # Validate date formats
-      %w[date_scraped date_received on_notice_from on_notice_to].each do |date_field|
-        Date.parse(record[date_field]) unless record[date_field].to_s.empty?
-      rescue ArgumentError
-        raise ScraperUtils::UnprocessableRecord,
-              "Invalid date format for #{date_field}: #{record[date_field].inspect}"
-      end
+      record = record.transform_keys(&:to_s)
+      ScraperUtils::PaValidation.validate_record!(record)
-      # Determine primary key based on presence of authority_label
+      # Determine the primary key based on the presence of authority_label
       primary_key = if record.key?("authority_label")
                       %w[authority_label council_reference]
                     else
@@ -58,7 +46,7 @@ module ScraperUtils
     end
     # Clean up records older than 30 days and approx once a month vacuum the DB
-    def self.cleanup_old_records
+    def self.cleanup_old_records(force: false)
       cutoff_date = (Date.today - 30).to_s
       vacuum_cutoff_date = (Date.today - 35).to_s
@@ -70,15 +58,17 @@ module ScraperUtils
       deleted_count = stats["count"]
       oldest_date = stats["oldest"]
-      return unless deleted_count.positive? || ENV["VACUUM"]
+      return unless deleted_count.positive? || ENV["VACUUM"] || force
       LogUtils.log "Deleting #{deleted_count} applications scraped between #{oldest_date} and #{cutoff_date}"
       ScraperWiki.sqliteexecute("DELETE FROM data WHERE date_scraped < ?", [cutoff_date])
-      return unless rand < 0.03 || (oldest_date && oldest_date < vacuum_cutoff_date) || ENV["VACUUM"]
+      return unless rand < 0.03 || (oldest_date && oldest_date < vacuum_cutoff_date) || ENV["VACUUM"] || force
       LogUtils.log "  Running VACUUM to reclaim space..."
       ScraperWiki.sqliteexecute("VACUUM")
+    rescue SqliteMagic::NoSuchTable => e
+      ScraperUtils::LogUtils.log "Ignoring: #{e} whilst cleaning old records" if ScraperUtils::DebugUtils.trace?
     end
   end
 end

data/lib/scraper_utils/log_utils.rb CHANGED Viewed

@@ -85,7 +85,7 @@ module ScraperUtils
         failed
       )
-      cleanup_old_records
+      DbUtils::cleanup_old_records
     end
     # Extracts the first relevant line from backtrace that's from our project
@@ -225,21 +225,13 @@ module ScraperUtils
       )
     end
+    # Moved to DbUtils
+    # :nocov:
     def self.cleanup_old_records(force: false)
-      cutoff = (Date.today - LOG_RETENTION_DAYS).to_s
-      return if !force && @last_cutoff == cutoff
-      @last_cutoff = cutoff
-      [SUMMARY_TABLE, LOG_TABLE].each do |table|
-        ScraperWiki.sqliteexecute(
-          "DELETE FROM #{table} WHERE date(run_at) < date(?)",
-          [cutoff]
-        )
-      rescue SqliteMagic::NoSuchTable => e
-        ScraperUtils::LogUtils.log "Ignoring: #{e} whilst cleaning old records" if ScraperUtils::DebugUtils.trace?
-      end
+      warn "`#{self.class}##{__method__}` is deprecated and will be removed in a future release, use `ScraperUtils::DbUtils.cleanup_old_records` instead.", category: :deprecated
+      ScraperUtils::DbUtils.cleanup_old_records(force: force)
     end
+    # :nocov:
     # Extracts meaningful backtrace - 3 lines from ruby/gem and max 6 in total
     def self.extract_meaningful_backtrace(error)

data/lib/scraper_utils/mechanize_utils/agent_config.rb CHANGED Viewed

@@ -227,6 +227,7 @@ module ScraperUtils
         rescue JSON::ParserError => e
           puts "Couldn't parse public_headers: #{e}! Raw response:"
           puts my_headers.inspect
+          raise "Couldn't parse public_headers as JSON: #{e}!"
         end
       rescue Timeout::Error => e # Includes Net::OpenTimeout
         raise "Proxy check timed out: #{e}"

data/lib/scraper_utils/pa_validation.rb ADDED Viewed

@@ -0,0 +1,88 @@
+# frozen_string_literal: true
+require "uri"
+require "date"
+module ScraperUtils
+  # Validates scraper records match Planning Alerts requirements before submission.
+  # Use in specs to catch problems early rather than waiting for PA's import.
+  module PaValidation
+    REQUIRED_FIELDS = %w[council_reference address description date_scraped].freeze
+    # Validates a single record (hash with string keys) against PA's rules.
+    # @param record [Hash] The record to validate
+    # @raise [ScraperUtils::UnprocessableRecord] if there are error messages
+    def self.validate_record!(record)
+      errors = validate_record(record)
+      raise(ScraperUtils::UnprocessableRecord, errors.join("; ")) if errors&.any?
+    end
+    # Validates a single record (hash with string keys) against PA's rules.
+    # @param record [Hash] The record to validate
+    # @return [Array<String>, nil] Array of error messages, or nil if valid
+    def self.validate_record(record)
+      record = record.transform_keys(&:to_s)
+      errors = []
+      validate_presence(record, errors)
+      validate_info_url(record, errors)
+      validate_dates(record, errors)
+      errors.empty? ? nil : errors
+    end
+    private
+    def self.validate_presence(record, errors)
+      REQUIRED_FIELDS.each do |field|
+        errors << "#{field} can't be blank" if record[field].to_s.strip.empty?
+      end
+      errors << "info_url can't be blank" if record["info_url"].to_s.strip.empty?
+    end
+    def self.validate_info_url(record, errors)
+      url = record["info_url"].to_s.strip
+      return if url.empty? # already caught by presence check
+      begin
+        uri = URI.parse(url)
+        unless uri.is_a?(URI::HTTP) && uri.host.to_s != ""
+          errors << "info_url must be a valid http\/https URL with host"
+        end
+      rescue URI::InvalidURIError
+        errors << "info_url must be a valid http\/https URL"
+      end
+    end
+    def self.validate_dates(record, errors)
+      today = Date.today
+      date_scraped = parse_date(record["date_scraped"])
+      errors << "Invalid date format for date_scraped: #{record["date_scraped"].inspect} is not a valid ISO 8601 date" if record["date_scraped"] && date_scraped.nil?
+      date_received = parse_date(record["date_received"])
+      if record["date_received"] && date_received.nil?
+        errors << "Invalid date format for date_received: #{record["date_received"].inspect} is not a valid ISO 8601 date"
+      elsif date_received && date_received.to_date > today
+        errors << "Invalid date for date_received: #{record["date_received"].inspect} is in the future"
+      end
+      %w[on_notice_from on_notice_to].each do |field|
+        val = parse_date(record[field])
+        errors << "Invalid date format for #{field}: #{record[field].inspect} is not a valid ISO 8601 date" if record[field] && val.nil?
+      end
+    end
+    # Returns a Date if value is already a Date, or parses a YYYY-MM-DD string.
+    # Returns nil if unparseable or blank.
+    def self.parse_date(value)
+      return nil if value.nil? || value == ""
+      return value if value.is_a?(Date) || value.is_a?(Time)
+      return nil unless value.is_a?(String) && value =~ /\A\d{4}-\d{2}-\d{2}\z/
+      Date.parse(value)
+    rescue ArgumentError
+      nil
+    end
+  end
+end

data/lib/scraper_utils/spec_support.rb CHANGED Viewed

@@ -78,6 +78,19 @@ module ScraperUtils
       "#{prefix}#{authority_labels.first}#{suffix}"
     end
+    # Finds records with duplicate [authority_label, council_reference] keys.
+    # @param records [Array<Hash>] All records to check
+    # @raises [Hash<Array<String>, Array<Hash>>, nil] Groups of duplicate records keys by primary key, or nil if all unique
+    def self.validate_unique_references!(records)
+      groups = records.group_by do |r|
+        [r["authority_label"], r["council_reference"]&.downcase]
+      end
+      duplicates = groups.select { |_k, g| g.size > 1 }
+      return if duplicates.empty?
+      raise UnprocessableSite, "Duplicate authority labels: #{duplicates.keys.map(&:inspect).join(', ')}"
+    end
     # Validates enough addresses are geocodable
     # @param results [Array<Hash>] The results from scraping an authority
     # @param percentage [Integer] The min percentage of addresses expected to be geocodable (default:50)
@@ -93,7 +106,9 @@ module ScraperUtils
       puts "Found #{geocodable} out of #{results.count} unique geocodable addresses " \
              "(#{(100.0 * geocodable / results.count).round(1)}%)"
       expected = [((percentage.to_f / 100.0) * results.count - variation), 1].max
-      raise "Expected at least #{expected} (#{percentage}% - #{variation}) geocodable addresses, got #{geocodable}" unless geocodable >= expected
+      unless geocodable >= expected
+        raise UnprocessableSite, "Expected at least #{expected} (#{percentage}% - #{variation}) geocodable addresses, got #{geocodable}"
+      end
       geocodable
     end
@@ -157,7 +172,7 @@ module ScraperUtils
       puts "Found #{descriptions} out of #{results.count} unique reasonable descriptions " \
              "(#{(100.0 * descriptions / results.count).round(1)}%)"
       expected = [(percentage.to_f / 100.0) * results.count - variation, 1].max
-      raise "Expected at least #{expected} (#{percentage}% - #{variation}) reasonable descriptions, got #{descriptions}" unless descriptions >= expected
+      raise UnprocessableSite, "Expected at least #{expected} (#{percentage}% - #{variation}) reasonable descriptions, got #{descriptions}" unless descriptions >= expected
       descriptions
     end
@@ -278,7 +293,7 @@ module ScraperUtils
           next
         end
-        raise "Expected 200 response, got #{page.code}" unless page.code == "200"
+        raise UnprocessableRecord, "Expected 200 response, got #{page.code}" unless page.code == "200"
         page_body = page.body.dup.force_encoding("UTF-8").gsub(/\s\s+/, " ")
@@ -310,12 +325,10 @@ module ScraperUtils
           min_required = ((percentage.to_f / 100.0) * count - variation).round(0)
           passed = count - failed
           raise "Too many failures: #{passed}/#{count} passed (min required: #{min_required})" if passed < min_required
-          end
-          end
-          puts "#{(100.0 * (count - failed) / count).round(1)}% detail checks passed (#{failed}/#{count} failed)!" if count > 0
-          end
-          end
-          end
+        end
+      end
+      puts "#{(100.0 * (count - failed) / count).round(1)}% detail checks passed (#{failed}/#{count} failed)!" if count > 0
+    end
+  end
+end

data/lib/scraper_utils/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module ScraperUtils
-  VERSION = "0.12.1"
+  VERSION = "0.13.1"
 end

data/lib/scraper_utils.rb CHANGED Viewed

@@ -5,6 +5,7 @@ require "scraper_utils/version"
 # Public Apis (responsible for requiring their own dependencies)
 require "scraper_utils/authority_utils"
 require "scraper_utils/data_quality_monitor"
+require "scraper_utils/pa_validation"
 require "scraper_utils/db_utils"
 require "scraper_utils/debug_utils"
 require "scraper_utils/log_utils"

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: scraper_utils
 version: !ruby/object:Gem::Version
-  version: 0.12.1
+  version: 0.13.1
 platform: ruby
 authors:
 - Ian Heggie
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2026-02-18 00:00:00.000000000 Z
+date: 2026-02-21 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: mechanize
@@ -102,6 +102,7 @@ files:
 - lib/scraper_utils/mechanize_utils.rb
 - lib/scraper_utils/mechanize_utils/agent_config.rb
 - lib/scraper_utils/misc_utils.rb
+- lib/scraper_utils/pa_validation.rb
 - lib/scraper_utils/spec_support.rb
 - lib/scraper_utils/version.rb
 - scraper_utils.gemspec
@@ -112,7 +113,7 @@ metadata:
   allowed_push_host: https://rubygems.org
   homepage_uri: https://github.com/ianheggie-oaf/scraper_utils
   source_code_uri: https://github.com/ianheggie-oaf/scraper_utils
-  documentation_uri: https://rubydoc.info/gems/scraper_utils/0.12.1
+  documentation_uri: https://rubydoc.info/gems/scraper_utils/0.13.1
   changelog_uri: https://github.com/ianheggie-oaf/scraper_utils/blob/main/CHANGELOG.md
   rubygems_mfa_required: 'true'
 post_install_message: