RubyGems - scraper_utils - Versions diffs - 0.8.3 → 0.9.0 - Mend

scraper_utils 0.8.3 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/.editorconfig +42 -0
data/CHANGELOG.md +14 -5
data/README.md +11 -1
data/docs/enhancing_specs.md +48 -14
data/docs/example_parallel_scraper.rb +124 -0
data/docs/example_scraper.rb +19 -3
data/docs/parallel_scrapers.md +95 -0
data/lib/scraper_utils/db_utils.rb +22 -2
data/lib/scraper_utils/spec_support.rb +83 -35
data/lib/scraper_utils/version.rb +1 -1
metadata +6 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 858cdc2f2f55b23ea10a09629ff347b2a6833c8efd1f89851168db7ffb46d9f8
-  data.tar.gz: f6d17f7a327dd3698b12e4da7772037e58cb76201d3271def86f195a68e97b03
+  metadata.gz: e3d985ab4384498ed5c44360db6d6020a039e38ec0480ab3e9942cd0d3267d9e
+  data.tar.gz: c61a923aa54f37623f8766a6750ecd97529dcaf6d315d056e04dde77f962cdaa
 SHA512:
-  metadata.gz: 0605eaa66aa317a0076d7c87ebdb883e274bd0fd44fd53ab28219a09a963c78decd0a086368e5b0e01748c0c67cd9ef4a84b48274794ad6bb5ea6e9dcdff4745
-  data.tar.gz: 511175ebaed815b6b851f804029b33c8ebbe553623d63556d78bb49f5be1fe4d52e5d9a78f27a75f3423ea90df7aad244acb43415b3990f86f78a9986438e21c
+  metadata.gz: 5e99031153c341917fecc73057ce5e6f662cf6f45859990c817c7c77e530aa11162f8d5c778f78c09dbb4d2287e060769880e23f4591f8a30abb816c6dfe28e1
+  data.tar.gz: e8d9dcbb019f93b05dc9ca3f2c09cf5f488434befab13b92841d0e9377041eb3e8914efafdd81218596bad158aab24e0304212ef25ed8211cb3c1b3a7c5c2478

data/.editorconfig ADDED Viewed

@@ -0,0 +1,42 @@
+# EditorConfig helps maintain consistent coding styles across different editors and IDEs
+# https://editorconfig.org/
+root = true
+# Ruby default: 2 spaces for indentation (Ruby Style Guide standard)
+[*]
+end_of_line = lf
+indent_size = 2
+indent_style = space
+insert_final_newline = true
+tab_width = 4            # Tab display width (RubyMine/Vim standard)
+trim_trailing_whitespace = true
+[*.bat]
+end_of_line = crlf
+# Python uses 4 spaces (PEP 8 standard)
+[*.py]
+indent_size = 4
+# Makefiles require tabs
+[{*[Mm]akefile*,*.mak,*.mk,depend}]
+indent_style = tab
+# Minified JavaScript files shouldn't be changed
+[**.min.js]
+indent_style = ignore
+insert_final_newline = ignore
+# Editor Setup Instructions:
+#
+# Vim: Add to ~/.vimrc:
+#   filetype plugin indent on
+#   set expandtab
+#   autocmd FileType ruby setlocal shiftwidth=2 tabstop=2 softtabstop=2
+#
+# RubyMine: Settings → Editor → Code Style → Ruby → set to 2 spaces
+#   (should auto-detect this .editorconfig file)
+#
+# VS Code: Install "EditorConfig for VS Code" extension
+#   (will automatically apply these settings)

data/CHANGELOG.md CHANGED Viewed

@@ -14,17 +14,26 @@
 - `ScraperUtils::SpecSupport.validate_info_urls_have_expected_details!` - validates info_urls contain expected content
 - `ScraperUtils::MathsUtils.fibonacci_series` - generates fibonacci sequence up to max value
 - `bot_check_expected` parameter to info_url validation methods for handling reCAPTCHA/Cloudflare protection
+- Experimental Parallel Processing support
+  - Uses the parallel gem with subprocesses
+  - Added facility to collect records in memory
+  - see docs/parallel_scrapers.md and docs/example_parallel_scraper.rb
+- .editorconfig as an example for scrapers
 ### Fixed
-- Typo in `geocodable?` method debug output (`has_suburb_stats` → `has_suburb_states`)
+- Typo in `geocodable?` method debug output (`has_suburb_stats` → `has_suburb_states`)
 - Code example in `docs/enhancing_specs.md`
 ### Updated
+- `ScraperUtils::SpecSupport.acceptable_description?` - Accept 1 or 2 word descriptors with planning specific terms
 - Code example in `docs/enhancing_specs.md` to reflect new support methods
-- geocodeable? test is simpler - it requires
-  - a street type, unit or lot,
-  - an uppercase word (assumed to be a suburb) or postcode and
+- Code examples
+- geocodeable? test is simpler - it requires
+  - a street type
+  - an uppercase word (assumed to be a suburb) or postcode and
   - a state
+- Support for 1 or 2 word "reasonable" descriptions that use words specific to planning alerts
+- Added extra street types
 ### Removed
 - Unsued CycleUtils
@@ -32,7 +41,7 @@
 - Unused RandomizeUtils
 - Unused Scheduling (Fiber and Threads)
 - Unused Compliant mode, delays for Agent (Agent is configured with an agent string)
-- Unused MechanizeActions
+- Unused MechanizeActions
 ## 0.8.2 - 2025-05-07

data/README.md CHANGED Viewed

@@ -10,6 +10,8 @@ Add to [your scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper
 ```ruby
 gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
 gem 'scraper_utils'
+# Optional: For parallel processing of multi-authority scrapers
+gem 'parallel'
 ```
 For detailed setup and configuration options,
@@ -23,6 +25,14 @@ see {file:docs/getting_started.md Getting Started guide}
 - Supports extra actions required to get to results page
 - {file:docs/mechanize_utilities.md Learn more about Mechanize utilities}
+### Parallel Processing
+- Process multiple authorities simultaneously for significant speed improvements
+- 3-8x faster execution for multi-authority scrapers
+- Simple migration from sequential processing with minimal code changes
+- {file:docs/parallel_scraping.md Learn more about parallel scraping}
+- {file:docs/example_parallel_scraper.rb Example parallel scraper script}
 ### Error Handling & Quality Monitoring
 - Record-level error handling with appropriate thresholds
@@ -67,4 +77,4 @@ on [ianheggie-oaf/scraper_utils | GitHub](https://github.com/ianheggie-oaf/scrap
 ## License
-The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
+The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

data/docs/enhancing_specs.md CHANGED Viewed

@@ -25,6 +25,17 @@ RSpec.describe Scraper do
   ].freeze
   describe ".scrape" do
+def fetch_url_with_redirects(url)
+      agent = Mechanize.new
+      page = agent.get(url)
+      if YourScraper::Pages::TermsAndConditions.on_page?(page)
+        puts "Agreeing to terms and conditions for #{url}"
+        YourScraper::Pages::TermsAndConditions.click_agree(page)
+        page = agent.get(url)
+      end
+      page
+    end
     def test_scrape(authority)
       ScraperWiki.close_sqlite
       FileUtils.rm_f("data.sqlite")
@@ -55,20 +66,27 @@ RSpec.describe Scraper do
       expect(results).to eq expected
-      # Validate addresses are geocodable (70% minimum with 3 records variation)
-      ScraperUtils::SpecSupport.validate_addresses_are_geocodable!(results, percentage: 70, variation: 3)
-      # Validate descriptions are reasonable (55% minimum with 3 records variation)
-      ScraperUtils::SpecSupport.validate_descriptions_are_reasonable!(results, percentage: 55, variation: 3)
-      # Validate info_urls based on authority configuration
-      global_info_url = Scraper::AUTHORITIES[authority][:info_url]
-      bot_check_expected = AUTHORITIES_WITH_BOT_PROTECTION.include?(authority)
-      if global_info_url
-        ScraperUtils::SpecSupport.validate_uses_one_valid_info_url!(results, global_info_url, bot_check_expected: bot_check_expected)
-      else
-        ScraperUtils::SpecSupport.validate_info_urls_have_expected_details!(results, percentage: 75, variation: 3, bot_check_expected: bot_check_expected)
+      if results.any?
+        ScraperUtils::SpecSupport.validate_addresses_are_geocodable!(results, percentage: 70, variation: 3)
+        ScraperUtils::SpecSupport.validate_descriptions_are_reasonable!(results, percentage: 55, variation: 3)
+        global_info_url = Scraper::AUTHORITIES[authority][:info_url]
+        # OR
+        # global_info_url = results.first['info_url']
+        bot_check_expected = AUTHORITIES_WITH_BOT_PROTECTION.include?(authority)
+        unless ENV['DISABLE_INFO_URL_CHECK']
+          if global_info_url
+            ScraperUtils::SpecSupport.validate_uses_one_valid_info_url!(results, global_info_url, bot_check_expected: bot_check_expected) do |url|
+              fetch_url_with_redirects(url)
+            end
+          else
+            ScraperUtils::SpecSupport.validate_info_urls_have_expected_details!(results, percentage: 70, variation: 3, bot_check_expected: bot_check_expected) do |url|
+              fetch_url_with_redirects(url)
+            end
+          end
+        end
       end
     end
@@ -119,6 +137,21 @@ ScraperUtils::SpecSupport.validate_info_urls_have_expected_details!(results, per
 ScraperUtils::SpecSupport.validate_info_urls_have_expected_details!(results, percentage: 75, variation: 3, bot_check_expected: true)
 ```
+### Custom URL fetching
+For sites requiring special handling (terms agreement, cookies, etc.):
+```ruby
+# Custom URL fetching with block
+ScraperUtils::SpecSupport.validate_uses_one_valid_info_url!(results, url) do |url|
+  fetch_url_with_redirects(url)  # Your custom fetch implementation
+end
+ScraperUtils::SpecSupport.validate_info_urls_have_expected_details!(results) do |url|
+  fetch_url_with_redirects(url)  # Handle terms agreement, cookies, etc.
+end
+```
 ## Bot Protection Handling
 The `bot_check_expected` parameter allows validation methods to accept bot protection as valid responses:
@@ -137,3 +170,4 @@ The `bot_check_expected` parameter allows validation methods to accept bot prote
 All validation methods accept `percentage` (minimum percentage required) and `variation` (additional tolerance)
 parameters for consistent configuration.

data/docs/example_parallel_scraper.rb ADDED Viewed

@@ -0,0 +1,124 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+$LOAD_PATH << "./lib"
+require "scraper_utils"
+require "parallel"
+require "your_scraper"
+# Main Scraper class
+class Scraper
+  AUTHORITIES = YourScraper::AUTHORITIES
+  # Process a single authority and returns an array of:
+  # * authority_label,
+  # * array of records to save,
+  # * an array of arrays of unprocessable_records and their exception
+  # * nil or a fatal exception,
+  def self.scrape_authority(authority_label, attempt)
+    puts "\nCollecting feed data for #{authority_label}, attempt: #{attempt}..."
+    # Enable in-memory collection mode, which disables saving to file and avoids conflicts
+    ScraperUtils::DbUtils.collect_saves!
+    unprocessable_record_details = []
+    fatal_exception = nil
+    begin
+      ScraperUtils::DataQualityMonitor.start_authority(authority_label)
+      YourScraper.scrape(authority_label) do |record|
+        begin
+          record["authority_label"] = authority_label.to_s
+          ScraperUtils::DbUtils.save_record(record)
+        rescue ScraperUtils::UnprocessableRecord => e
+          # Log bad record but continue processing unless too many have occurred
+          ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
+          unprocessable_record_details << [e, record]
+        end
+      end
+    rescue StandardError => e
+      warn "#{authority_label}: ERROR: #{e}"
+      warn e.backtrace
+      fatal_exception = e
+    end
+    [authority_label, ScraperUtils::DbUtils.collected_saves, unprocessable_record_details, fatal_exception]
+  end
+  # Process authorities in parallel
+  def self.scrape_parallel(authorities, attempt, process_count: 4)
+    exceptions = {}
+    # Saves immediately in main process
+    ScraperUtils::DbUtils.save_immediately!
+    Parallel.map(authorities, in_processes: process_count) do |authority_label|
+      # Runs in sub process
+      scrape_authority(authority_label, attempt)
+    end.each do |authority_label, saves, unprocessable, fatal_exception|
+      # Runs in main process
+      status = fatal_exception ? 'FAILED' : 'OK'
+      puts "Saving results of #{authority_label}: #{saves.size} records, #{unprocessable.size} unprocessable #{status}"
+      saves.each do |record|
+        ScraperUtils::DbUtils.save_record(record)
+      end
+      unprocessable.each do |e, record|
+        ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
+        exceptions[authority_label] = e
+      end
+      if fatal_exception
+        puts "  Warning: #{authority_label} failed with: #{fatal_exception.message}"
+        puts "  Saved #{saves.size} records before failure"
+        exceptions[authority_label] = fatal_exception
+      end
+    end
+    exceptions
+  end
+  def self.selected_authorities
+    ScraperUtils::AuthorityUtils.selected_authorities(AUTHORITIES.keys)
+  end
+  def self.run(authorities, process_count: 8)
+    puts "Scraping authorities in parallel: #{authorities.join(', ')}"
+    puts "Using #{process_count} processes"
+    start_time = Time.now
+    exceptions = scrape_parallel(authorities, 1, process_count: process_count)
+    ScraperUtils::LogUtils.log_scraping_run(
+      start_time,
+      1,
+      authorities,
+      exceptions
+    )
+    unless exceptions.empty?
+      puts "\n***************************************************"
+      puts "Now retrying authorities which earlier had failures"
+      puts exceptions.keys.join(", ").to_s
+      puts "***************************************************"
+      start_time = Time.now
+      exceptions = scrape_parallel(exceptions.keys, 2, process_count: process_count)
+      ScraperUtils::LogUtils.log_scraping_run(
+        start_time,
+        2,
+        authorities,
+        exceptions
+      )
+    end
+    # Report on results, raising errors for unexpected conditions
+    ScraperUtils::LogUtils.report_on_results(authorities, exceptions)
+  end
+end
+if __FILE__ == $PROGRAM_NAME
+  ENV["MORPH_EXPECT_BAD"] ||= "some,councils"
+  process_count = (ENV['MORPH_PROCESSES'] || Etc.nprocessors * 2).to_i
+  Scraper.run(Scraper.selected_authorities, process_count: process_count)
+end

data/docs/example_scraper.rb CHANGED Viewed

@@ -1,6 +1,8 @@
 #!/usr/bin/env ruby
 # frozen_string_literal: true
+Bundler.require
 $LOAD_PATH << "./lib"
 require "scraper_utils"
@@ -57,8 +59,9 @@ class Scraper
     unless exceptions.empty?
       puts "\n***************************************************"
       puts "Now retrying authorities which earlier had failures"
-      puts exceptions.keys.join(", ").to_s
+      puts exceptions.keys.join(", ")
       puts "***************************************************"
+      ENV['DEBUG'] ||= '1'
       start_time = Time.now
       exceptions = scrape(exceptions.keys, 2)
@@ -79,8 +82,21 @@ end
 if __FILE__ == $PROGRAM_NAME
   # Default to list of authorities we can't or won't fix in code, explain why
   # some: url-for-issue Summary Reason
-  # councils : url-for-issue Summary Reason
+  # councils: url-for-issue Summary Reason
+  if ENV['MORPH_EXPECT_BAD'].nil?
+    default_expect_bad = {
+    }
+    puts 'Default EXPECT_BAD:', default_expect_bad.to_yaml if default_expect_bad.any?
-  ENV["MORPH_EXPECT_BAD"] ||= "some,councils"
+    ENV["MORPH_EXPECT_BAD"] = default_expect_bad.keys.join(',')
+  end
   Scraper.run(Scraper.selected_authorities)
+  # Dump database for morph-cli
+  if File.exist?("tmp/dump-data-sqlite")
+    puts "-- dump of data.sqlite --"
+    system "sqlite3 data.sqlite .dump"
+    puts "-- end of dump --"
+  end
 end

data/docs/parallel_scrapers.md ADDED Viewed

@@ -0,0 +1,95 @@
+# Parallel Scraping with ScraperUtils
+This guide shows how to parallelize your multi-authority scraper to significantly reduce run times.
+## When to Use Parallel Scraping
+Use parallel scraping when:
+- You have 10+ authorities taking significant time each
+- Authorities are independent (no shared state)
+- You want to reduce total scraper run time from hours to minutes
+## Installation
+Add the parallel gem to your Gemfile:
+```ruby
+gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
+gem 'scraper_utils'
+gem 'parallel'  # Add this line
+```
+## Modified Scraper Implementation
+See `example_parallel_scraper.rb` as an example of how to convert your existing scraper to use parallel processing.
+## Key Changes from Sequential Version
+1. **Added `parallel` gem** to Gemfile
+2. **Split scraping logic** into `scrape_authority` (single authority) and `scrape_parallel` (coordinator)
+3. **Enable collection mode** with `ScraperUtils::DbUtils.collect_saves!` in each subprocess
+4. **Return results** as `[authority_label, saves, unprocessable, exception]` from each subprocess
+5. **Save in main process** to avoid SQLite locking issues
+6. **Preserve error handling**: UnprocessableRecord exceptions logged but don't re-raise
+## Configuration Options
+### Process Count
+Control the number of parallel processes:
+```ruby
+# In code
+process_count = (ENV['MORPH_PROCESSES'] || Etc.nprocessors * 2).to_i
+Scraper.run(authorities, process_count: process_count)
+# Via environment variable
+export MORPH_PROCESSES=6
+```
+Start with 4 processes and adjust based on:
+- Available CPU cores
+- Memory usage
+- Network bandwidth
+- Target site responsiveness
+### Environment Variables
+All existing environment variables work unchanged:
+- `MORPH_AUTHORITIES` - filter authorities
+- `MORPH_EXPECT_BAD` - expected bad authorities
+- `DEBUG` - debugging output
+- `MORPH_PROCESSES` - number of parallel processes
+## Performance Expectations
+Typical performance improvements:
+- **4 processes**: 3-4x faster
+- **8 processes**: 6-7x faster (if you have the cores/bandwidth)
+- **Diminishing returns** beyond 8 processes for most scrapers
+Example: 20 authorities × 6 minutes each = 2 hours sequential → 30 minutes with 4 processes
+## Debugging Parallel Scrapers
+1. **Test with 1 process first**: `process_count: 1` to isolate logic issues
+2. **Check individual authorities**: Use `MORPH_AUTHORITIES=problematic_auth`
+3. **Monitor resource usage**: Watch CPU, memory, and network during runs
+4. **Enable debugging**: `DEBUG=1` works in all processes
+## Limitations
+- **Shared state**: Each process is isolated - no shared variables between authorities
+- **Memory usage**: Each process uses full memory - monitor total usage
+- **Database locking**: Only the main process writes to SQLite (by design)
+- **Error handling**: Exceptions in one process don't affect others
+## Migration from Sequential
+Your existing scraper logic requires minimal changes:
+1. Extract single-authority logic into separate method
+2. Add `collect_saves!` call at start of each subprocess
+3. Return collected saves instead of direct database writes
+4. Use `Parallel.map` instead of `each` for authorities
+The core scraping logic in `YourScraper.scrape` remains completely unchanged.

data/lib/scraper_utils/db_utils.rb CHANGED Viewed

@@ -5,6 +5,22 @@ require "scraperwiki"
 module ScraperUtils
   # Utilities for database operations in scrapers
   module DbUtils
+    # Enable in-memory collection mode instead of saving to SQLite
+    def self.collect_saves!
+      @collected_saves = []
+    end
+    # Save to disk rather than collect
+    def self.save_immediately!
+      @collected_saves = nil
+    end
+    # Get all collected save calls
+    # @return [Array<Array>] Array of [primary_key, record] pairs
+    def self.collected_saves
+      @collected_saves
+    end
     # Saves a record to the SQLite database with validation and logging
     #
     # @param record [Hash] The record to be saved
@@ -33,8 +49,12 @@ module ScraperUtils
                     else
                       ["council_reference"]
                     end
-      ScraperWiki.save_sqlite(primary_key, record)
-      ScraperUtils::DataQualityMonitor.log_saved_record(record)
+      if @collected_saves
+        @collected_saves << record
+      else
+        ScraperWiki.save_sqlite(primary_key, record)
+        ScraperUtils::DataQualityMonitor.log_saved_record(record)
+      end
     end
   end
 end

data/lib/scraper_utils/spec_support.rb CHANGED Viewed

@@ -7,35 +7,61 @@ module ScraperUtils
   # Methods to support specs
   module SpecSupport
     AUSTRALIAN_STATES = %w[ACT NSW NT QLD SA TAS VIC WA].freeze
     STREET_TYPE_PATTERNS = [
+      /\bArcade\b/i,
       /\bAv(e(nue)?)?\b/i,
-      /\bB(oulevard|lvd)\b/i,
+      /\bB(oulevard|lvd|vd)\b/i,
       /\b(Circuit|Cct)\b/i,
+      /\bCir(cle)?\b/i,
       /\bCl(ose)?\b/i,
       /\bC(our|r)?t\b/i,
-      /\bCircle\b/i,
       /\bChase\b/i,
+      /\bCorso\b/i,
       /\bCr(es(cent)?)?\b/i,
+      /\bCross\b/i,
       /\bDr((ive)?|v)\b/i,
       /\bEnt(rance)?\b/i,
+      /\bEsp(lanade)?\b/i,
       /\bGr(ove)?\b/i,
       /\bH(ighwa|w)y\b/i,
-      /\bLane\b/i,
+      /\bL(ane?|a)\b/i,
       /\bLoop\b/i,
+      /\bM(ews|w)\b/i,
+      /\bP(arade|de)\b/i,
       /\bParkway\b/i,
       /\bPl(ace)?\b/i,
       /\bPriv(ate)?\b/i,
-      /\bParade\b/i,
+      /\bProm(enade)?\b/i,
+      /\bQuay\b/i,
       /\bR(oa)?d\b/i,
+      /\bR(idge|dg)\b/i,
       /\bRise\b/i,
+      /\bSq(uare)?\b/i,
       /\bSt(reet)?\b/i,
-      /\bSquare\b/i,
-      /\bTerrace\b/i,
-      /\bWay\b/i
+      /\bT(erra)?ce\b/i,
+      /\bWa?y\b/i
     ].freeze
     AUSTRALIAN_POSTCODES = /\b\d{4}\b/.freeze
+    PLANNING_KEYWORDS = [
+      # Building types
+      'dwelling', 'house', 'unit', 'building', 'structure', 'facility',
+      # Modifications
+      'addition', 'extension', 'renovation', 'alteration', 'modification',
+      'replacement', 'upgrade', 'improvement',
+      # Specific structures
+      'carport', 'garage', 'shed', 'pool', 'deck', 'patio', 'pergola',
+      'verandah', 'balcony', 'fence', 'wall', 'driveway',
+      # Development types
+      'subdivision', 'demolition', 'construction', 'development',
+      # Services/utilities
+      'signage', 'telecommunications', 'stormwater', 'water', 'sewer',
+      # Approvals/certificates
+      'certificate', 'approval', 'consent', 'permit'
+    ].freeze
     def self.fetch_url_with_redirects(url)
       agent = Mechanize.new
       # FIXME - Allow injection of a check to agree to terms if needed to set a cookie and reget the url
@@ -45,7 +71,7 @@ module ScraperUtils
     def self.authority_label(results, prefix: '', suffix: '')
       return nil if results.nil?
-      authority_labels = results.map { |record| record['authority_label']}.compact.uniq
+      authority_labels = results.map { |record| record['authority_label'] }.compact.uniq
       return nil if authority_labels.empty?
       raise "Expected one authority_label, not #{authority_labels.inspect}" if authority_labels.size > 1
@@ -86,22 +112,18 @@ module ScraperUtils
       # Using the pre-compiled patterns
       has_street_type = STREET_TYPE_PATTERNS.any? { |pattern| check_address.match?(pattern) }
-      has_unit_or_lot = address.match?(/\b(Unit|Lot:?)\s+\d+/i)
       uppercase_words = address.scan(/\b[A-Z]{2,}\b/)
       has_uppercase_suburb = uppercase_words.any? { |word| !AUSTRALIAN_STATES.include?(word) }
       if ENV["DEBUG"]
         missing = []
-        unless has_street_type || has_unit_or_lot
-          missing << "street type / unit / lot"
-        end
+        missing << "street type" unless has_street_type
         missing << "postcode/Uppercase suburb" unless has_postcode || has_uppercase_suburb
         missing << "state" unless has_state
         puts "  address: #{address} is not geocodable, missing #{missing.join(', ')}" if missing.any?
       end
-      (has_street_type || has_unit_or_lot) && (has_postcode || has_uppercase_suburb) && has_state
+      has_street_type && (has_postcode || has_uppercase_suburb) && has_state
     end
     PLACEHOLDERS = [
@@ -142,14 +164,23 @@ module ScraperUtils
     # Check if this looks like a "reasonable" description
     # This is a bit stricter than needed - typically assert >= 75% match
     def self.reasonable_description?(text)
-      !placeholder?(text) && text.to_s.split.size >= 3
+      return false if placeholder?(text)
+      # Long descriptions (3+ words) are assumed reasonable
+      return true if text.to_s.split.size >= 3
+      # Short descriptions must contain at least one planning keyword
+      text_lower = text.to_s.downcase
+      PLANNING_KEYWORDS.any? { |keyword| text_lower.include?(keyword) }
     end
     # Validates that all records use the expected global info_url and it returns 200
     # @param results [Array<Hash>] The results from scraping an authority
     # @param expected_url [String] The expected global info_url for this authority
+    # @param bot_check_expected [Boolean] Whether bot protection is acceptable
+    # @yield [String] Optional block to customize URL fetching (e.g., handle terms agreement)
     # @raise RuntimeError if records don't use the expected URL or it doesn't return 200
-    def self.validate_uses_one_valid_info_url!(results, expected_url, bot_check_expected: false)
+    def self.validate_uses_one_valid_info_url!(results, expected_url, bot_check_expected: false, &block)
       info_urls = results.map { |record| record["info_url"] }.uniq
       unless info_urls.size == 1
@@ -163,11 +194,11 @@ module ScraperUtils
       if defined?(VCR)
         VCR.use_cassette("#{authority_label(results, suffix: '_')}one_info_url") do
-          page = fetch_url_with_redirects(expected_url)
+          page = block_given? ? block.call(expected_url) : fetch_url_with_redirects(expected_url)
           validate_page_response(page, bot_check_expected)
         end
       else
-        page = fetch_url_with_redirects(expected_url)
+        page = block_given? ? block.call(expected_url) : fetch_url_with_redirects(expected_url)
         validate_page_response(page, bot_check_expected)
       end
     end
@@ -176,14 +207,16 @@ module ScraperUtils
     # @param results [Array<Hash>] The results from scraping an authority
     # @param percentage [Integer] The min percentage of detail checks expected to pass (default:75)
     # @param variation [Integer] The variation allowed in addition to percentage (default:3)
+    # @param bot_check_expected [Boolean] Whether bot protection is acceptable
+    # @yield [String] Optional block to customize URL fetching (e.g., handle terms agreement)
     # @raise RuntimeError if insufficient detail checks pass
-    def self.validate_info_urls_have_expected_details!(results, percentage: 75, variation: 3, bot_check_expected: false)
+    def self.validate_info_urls_have_expected_details!(results, percentage: 75, variation: 3, bot_check_expected: false, &block)
       if defined?(VCR)
         VCR.use_cassette("#{authority_label(results, suffix: '_')}info_url_details") do
-          check_info_url_details(results, percentage, variation, bot_check_expected)
+          check_info_url_details(results, percentage, variation, bot_check_expected, &block)
         end
       else
-        check_info_url_details(results, percentage, variation, bot_check_expected)
+        check_info_url_details(results, percentage, variation, bot_check_expected, &block)
       end
     end
@@ -195,7 +228,7 @@ module ScraperUtils
       return false unless page.body
-      body_lower = page.body.downcase
+      body_lower = page.body&.downcase
       # Check for common bot protection indicators
       bot_indicators = [
@@ -228,7 +261,7 @@ module ScraperUtils
     private
-    def self.check_info_url_details(results, percentage, variation, bot_check_expected)
+    def self.check_info_url_details(results, percentage, variation, bot_check_expected, &block)
       count = 0
       failed = 0
       fib_indices = ScraperUtils::MathsUtils.fibonacci_series(results.size - 1).uniq
@@ -238,7 +271,7 @@ module ScraperUtils
         info_url = record["info_url"]
         puts "Checking info_url[#{index}]: #{info_url} has the expected reference, address and description..."
-        page = fetch_url_with_redirects(info_url)
+        page = block_given? ? block.call(info_url) : fetch_url_with_redirects(info_url)
         if bot_check_expected && bot_protection_detected?(page)
           puts "  Bot protection detected - skipping detailed validation"
@@ -252,22 +285,37 @@ module ScraperUtils
         %w[council_reference address description].each do |attribute|
           count += 1
           expected = CGI.escapeHTML(record[attribute]).gsub(/\s\s+/, " ")
-          expected2 = expected.gsub(/(\S+)\s+(\S+)\z/, '\2 \1') # Handle Lismore post-code/state swap
-          next if page_body.include?(expected) || page_body.include?(expected2)
+          expected2 = case attribute
+                      when 'council_reference'
+                        expected.sub(/\ADA\s*-\s*/, '')
+                      when 'address'
+                        expected.sub(/(\S+)\s+(\S+)\z/, '\2 \1').sub(/,\s*\z/, '') # Handle Lismore post-code/state swap
+                      else
+                        expected
+                      end
+          expected3 = case attribute
+                      when 'address'
+                        expected.sub(/\s*,?\s+(VIC|NSW|QLD|SA|TAS|WA|ACT|NT)\z/, '')
+                      else
+                        expected
+                      end.gsub(/\s*,\s*/, ' ').gsub(/\s*-\s*/, '-')
+          next if page_body.include?(expected) || page_body.include?(expected2) || page_body.gsub(/\s*,\s*/, ' ').gsub(/\s*-\s*/, '-').include?(expected3)
           failed += 1
-          puts "  Missing: #{expected}"
+          desc2 = expected2 == expected ? '' : " or #{expected2.inspect}"
+          desc3 = expected3 == expected ? '' : " or #{expected3.inspect}"
+          puts "  Missing: #{expected.inspect}#{desc2}#{desc3}"
           puts "    IN: #{page_body}" if ENV['DEBUG']
-          min_required = [((percentage.to_f / 100.0) * count - variation), 1].max
+          min_required = ((percentage.to_f / 100.0) * count - variation).round(0)
           passed = count - failed
           raise "Too many failures: #{passed}/#{count} passed (min required: #{min_required})" if passed < min_required
-        end
-      end
+          end
+          end
-      puts "#{(100.0 * (count - failed) / count).round(1)}% detail checks passed (#{failed}/#{count} failed)!" if count > 0
-    end
+          puts "#{(100.0 * (count - failed) / count).round(1)}% detail checks passed (#{failed}/#{count} failed)!" if count > 0
+          end
+          end
+          end
-  end
-end

data/lib/scraper_utils/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module ScraperUtils
-  VERSION = "0.8.3"
+  VERSION = "0.9.0"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: scraper_utils
 version: !ruby/object:Gem::Version
-  version: 0.8.3
+  version: 0.9.0
 platform: ruby
 authors:
 - Ian Heggie
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2025-07-08 00:00:00.000000000 Z
+date: 2025-07-11 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: mechanize
@@ -61,6 +61,7 @@ executables:
 extensions: []
 extra_rdoc_files: []
 files:
+- ".editorconfig"
 - ".gitignore"
 - ".rspec"
 - ".rubocop.yml"
@@ -82,9 +83,11 @@ files:
 - docs/enhancing_specs.md
 - docs/example_custom_Rakefile
 - docs/example_dot_scraper_validation.yml
+- docs/example_parallel_scraper.rb
 - docs/example_scraper.rb
 - docs/getting_started.md
 - docs/mechanize_utilities.md
+- docs/parallel_scrapers.md
 - docs/testing_custom_scrapers.md
 - exe/validate_scraper_data
 - lib/scraper_utils.rb
@@ -106,7 +109,7 @@ metadata:
   allowed_push_host: https://rubygems.org
   homepage_uri: https://github.com/ianheggie-oaf/scraper_utils
   source_code_uri: https://github.com/ianheggie-oaf/scraper_utils
-  documentation_uri: https://rubydoc.info/gems/scraper_utils/0.8.3
+  documentation_uri: https://rubydoc.info/gems/scraper_utils/0.9.0
   changelog_uri: https://github.com/ianheggie-oaf/scraper_utils/blob/main/CHANGELOG.md
   rubygems_mfa_required: 'true'
 post_install_message: