RubyGems - scraper_utils - Versions diffs - 0.6.0 → 0.7.1 - Mend

scraper_utils 0.6.0 → 0.7.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +17 -0
data/IMPLEMENTATION.md +0 -1
data/README.md +3 -3
data/docs/enhancing_specs.md +100 -0
data/docs/interleaving_requests.md +2 -1
data/docs/mechanize_utilities.md +4 -4
data/lib/scraper_utils/mechanize_utils/adaptive_delay.rb +4 -3
data/lib/scraper_utils/mechanize_utils/agent_config.rb +4 -4
data/lib/scraper_utils/scheduler.rb +2 -2
data/lib/scraper_utils/spec_support.rb +91 -0
data/lib/scraper_utils/version.rb +1 -1
data/lib/scraper_utils.rb +1 -0
metadata +5 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 13ad14102f284c98d658bb928bcf7806ea7594326d11c5426903ebc6b1f919e0
-  data.tar.gz: 260cd94a1b76e9851f5af47f716dc754386b081f3923ee0a0eb6fb2b2d086c4f
+  metadata.gz: 4293294d99565b0ee4adc097e9d31619868477e6efa8255f3cd5ec02dc8b275b
+  data.tar.gz: e78a08bced34a1f55b8d6e869158887da33f1d6b9803493c7d836402d64aa4ef
 SHA512:
-  metadata.gz: 824b9e64ae7debdf9cddfc90b47de6dff7865e3f655ad6022f58181f38efd06788413e0251525410736e58d6fd325917a8b7a0ad6b2468fa7ab9de3b697955af
-  data.tar.gz: fd154118b2eaa22962f4343a3f62ca1daabc19de924a36e4c3cd4f61c9c7bb08a232554f10141b4e3534f3fa2e546a86f860d3bc7997f54d29a067cfb3c4f451
+  metadata.gz: ea00598b58e69cf5d911b62148d5d38e13ef017dd5a22159839f3b4437bf28c7c2b17c4832e3869688970037048f69fb543f9645922d6b92888a5fc68d88fc33
+  data.tar.gz: 57500daacbec8602c9288bf27bd7ac5f2ff407f621f9ac2c9e114a85fbbb0c555ebebfdc62782b814616b9b46dad65bd5a665b24bb1c9e6774625469bb290485

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,22 @@
 # Changelog
+## 0.7.1 - 2025-04-15
+* Accept mixed case suburb names after a comma as well as uppercase suburb names as geocachable
+* Accept more street type abbreviations and check they are on word boundaries
+## 0.7.0 - 2025-04-15
+* Added Spec helpers and associated doc: `docs/enhancing_specs.md`
+  * `ScraperUtils::SpecSupport.geocodable?`
+  * `ScraperUtils::SpecSupport.reasonable_description?`
+## 0.6.1 - 2025-03-28
+* Changed DEFAULT_MAX_LOAD to 50.0 as we are overestimating the load we present as network latency is included
+* Correct documentation of spec_helper extra lines
+* Fix misc bugs found in use
 ## 0.6.0 - 2025-03-16
 * Add threads for more efficient scraping

data/IMPLEMENTATION.md CHANGED Viewed

@@ -54,7 +54,6 @@ Our test directory structure reflects various testing strategies and aspects of
 These specs check the options we use when things go wrong in production
 - `spec/scraper_utils/no_threads/` - Tests with threads disabled (`MORPH_DISABLE_THREADS=1`)
-- `spec/scraper_utils/no_fibers/` - Tests with fibers disabled (`MORPH_MAX_WORKERS=0`)
 - `spec/scraper_utils/sequential/` - Tests with exactly one worker (`MORPH_MAX_WORKERS=1`)
 ### Directories to break up large specs

data/README.md CHANGED Viewed

@@ -15,11 +15,11 @@ Our goal is to access public planning information with minimal impact on your se
 default:
 - **Limit server load**:
-    - We limit the max load we present to your server to well less than a third of a single cpu
-        - The more loaded your server is, the longer we wait between requests
+    - We limit the max load we present to your server to less than a half of one of your cpu cores
+        - The more loaded your server is, the longer we wait between requests!
     - We respect Crawl-delay from robots.txt (see section below), so you can tell us an acceptable rate
     - Scarper developers can
-        - reduce the max_load we present to your server even lower
+        - reduce the max_load we present to your server even further
         - add random extra delays to give your server a chance to catch up with background tasks
 - **Identify themselves**: Our user agent clearly indicates who we are and provides a link to the project repository:

data/docs/enhancing_specs.md ADDED Viewed

@@ -0,0 +1,100 @@
+# Enhancing specs
+ScraperUtils provides two methods to help with checking results
+* `ScraperUtils::SpecSupport.geocodable?`
+* `ScraperUtils::SpecSupport.reasonable_description?`
+## Example Code:
+```ruby
+# frozen_string_literal: true
+require "timecop"
+require_relative "../scraper"
+RSpec.describe Scraper do
+  describe ".scrape" do
+    def test_scrape(authority)
+      File.delete("./data.sqlite") if File.exist?("./data.sqlite")
+      VCR.use_cassette(authority) do
+        date = Date.new(2025, 4, 15)
+        Timecop.freeze(date) do
+          Scraper.scrape([authority], 1)
+        end
+      end
+      expected = if File.exist?("spec/expected/#{authority}.yml")
+                   YAML.safe_load(File.read("spec/expected/#{authority}.yml"))
+                 else
+                   []
+                 end
+      results = ScraperWiki.select("* from data order by council_reference")
+      ScraperWiki.close_sqlite
+      if results != expected
+        # Overwrite expected so that we can compare with version control
+        # (and maybe commit if it is correct)
+        File.open("spec/expected/#{authority}.yml", "w") do |f|
+          f.write(results.to_yaml)
+        end
+      end
+      expect(results).to eq expected
+      geocodable = results
+                     .map { |record| record["address"] }
+                     .uniq
+                     .count { |text| SpecHelper.geocodable? text }
+      puts "Found #{geocodable} out of #{results.count} unique geocodable addresses " \
+        "(#{(100.0 * geocodable / results.count).round(1)}%)"
+      expect(geocodable).to be > (0.7 * results.count)
+      descriptions = results
+                       .map { |record| record["description"] }
+                       .uniq
+                       .count do |text|
+        selected = SpecHelper.reasonable_description? text
+        puts "  description: #{text} is not reasonable" if ENV["DEBUG"] && !selected
+        selected
+      end
+      puts "Found #{descriptions} out of #{results.count} unique reasonable descriptions " \
+             "(#{(100.0 * descriptions / results.count).round(1)}%)"
+      expect(descriptions).to be > (0.55 * results.count)
+      info_urls = results
+                  .map { |record| record["info_url"] }
+                  .uniq
+                  .count { |text| text.to_s.match(%r{\Ahttps?://}) }
+      puts "Found #{info_urls} out of #{results.count} unique info_urls " \
+             "(#{(100.0 * info_urls / results.count).round(1)}%)"
+      expect(info_urls).to be > (0.7 * results.count) if info_urls != 1
+      VCR.use_cassette("#{authority}.info_urls") do
+        results.each do |record|
+          info_url = record["info_url"]
+          puts "Checking info_url #{info_url} #{info_urls > 1 ? ' has expected details' : ''} ..."
+          response = Net::HTTP.get_response(URI(info_url))
+          expect(response.code).to eq("200")
+          # If info_url is the same for all records, then it won't have details
+          break if info_urls == 1
+          expect(response.body).to include(record["council_reference"])
+          expect(response.body).to include(record["address"])
+          expect(response.body).to include(record["description"])
+        end
+      end
+    end
+    Scraper.selected_authorities.each do |authority|
+      it authority do
+        test_scrape(authority)
+      end
+    end
+  end
+end
+```

data/docs/interleaving_requests.md CHANGED Viewed

@@ -26,7 +26,8 @@ This uses {ScraperUtils::RandomizeUtils} for determining the order of operations
 `spec/spec_helper.rb`:
 ```ruby
-ScraperUtils::RandomizeUtils.sequential = true
+ScraperUtils::RandomizeUtils.random = false
+ScraperUtils::Scheduler.max_workers = 1
 ```
 For full details, see the {Scheduler}.

data/docs/mechanize_utilities.md CHANGED Viewed

@@ -29,16 +29,16 @@ The agent returned is configured using Mechanize hooks to implement the desired
 ### Default Configuration
 By default, the Mechanize agent is configured with the following settings.
-As you can see, the defaults can be changed using env variables.
+As you can see, the defaults can be changed using env variables or via code.
 Note - compliant mode forces max_load to be set to a value no greater than 50.
 ```ruby
 ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
-  config.default_timeout = ENV.fetch('MORPH_TIMEOUT', 60).to_i # 60
+  config.default_timeout = ENV.fetch('MORPH_TIMEOUT', DEFAULT_TIMEOUT).to_i # 60
   config.default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
-  config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', 5).to_i # 5
-  config.default_max_load = ENV.fetch('MORPH_MAX_LOAD', 33.3).to_f # 33.3
+  config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', DEFAULT_RANDOM_DELAY).to_i # 0
+  config.default_max_load = ENV.fetch('MORPH_MAX_LOAD',DEFAULT_MAX_LOAD).to_f # 50.0
   config.default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
   config.default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
   config.default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent

data/lib/scraper_utils/mechanize_utils/adaptive_delay.rb CHANGED Viewed

@@ -1,6 +1,7 @@
 # frozen_string_literal: true
 require "uri"
+require_relative "agent_config"
 module ScraperUtils
   module MechanizeUtils
@@ -17,13 +18,13 @@ module ScraperUtils
       #
       # @param min_delay [Float] Minimum delay between requests in seconds
       # @param max_delay [Float] Maximum delay between requests in seconds
-      # @param max_load [Float] Maximum load percentage (1-99) we aim to place on the server
+      # @param max_load [Float] Maximum load percentage (1..Constants::MAX_LOAD_CAP) we aim to place on the server
       #                         Lower values are more conservative (e.g., 20% = 4x response time delay)
-      def initialize(min_delay: DEFAULT_MIN_DELAY, max_delay: DEFAULT_MAX_DELAY, max_load: 20.0)
+      def initialize(min_delay: DEFAULT_MIN_DELAY, max_delay: DEFAULT_MAX_DELAY, max_load: AgentConfig::DEFAULT_MAX_LOAD)
         @delays = {} # domain -> last delay used
         @min_delay = min_delay.to_f
         @max_delay = max_delay.to_f
-        @max_load = max_load.to_f.clamp(1.0, 99.0)
+        @max_load = max_load.to_f.clamp(1.0, AgentConfig::MAX_LOAD_CAP)
         @response_multiplier = (100.0 - @max_load) / @max_load
         return unless DebugUtils.basic?

data/lib/scraper_utils/mechanize_utils/agent_config.rb CHANGED Viewed

@@ -25,8 +25,8 @@ module ScraperUtils
     class AgentConfig
       DEFAULT_TIMEOUT = 60
       DEFAULT_RANDOM_DELAY = 0
-      DEFAULT_MAX_LOAD = 33.3
-      MAX_LOAD_CAP = 50.0
+      DEFAULT_MAX_LOAD = 50.0
+      MAX_LOAD_CAP = 80.0
       # Class-level defaults that can be modified
       class << self
@@ -69,8 +69,8 @@ module ScraperUtils
         def reset_defaults!
           @default_timeout = ENV.fetch('MORPH_CLIENT_TIMEOUT', DEFAULT_TIMEOUT).to_i # 60
           @default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
-          @default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', DEFAULT_RANDOM_DELAY).to_i # 5
-          @default_max_load = ENV.fetch('MORPH_MAX_LOAD', DEFAULT_MAX_LOAD).to_f # 33.3
+          @default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', DEFAULT_RANDOM_DELAY).to_i # 0
+          @default_max_load = ENV.fetch('MORPH_MAX_LOAD', DEFAULT_MAX_LOAD).to_f # 50.0
           @default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
           @default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
           @default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent

data/lib/scraper_utils/scheduler.rb CHANGED Viewed

@@ -68,7 +68,7 @@ module ScraperUtils
       # Controls whether Mechanize network requests are executed in parallel using threads
       #
-      # @return [Integer] max concurrent workers using fibers and threads, defaults to MAX_WORKERS env variable or 50
+      # @return [Integer] max concurrent workers using fibers and threads, defaults to MORPH_MAX_WORKERS env variable or 50
       attr_accessor :max_workers
       # @return [Hash{Symbol => Exception}] exceptions by authority
@@ -259,7 +259,7 @@ module ScraperUtils
     #
     # @return [ThreadResponse, nil] Result of request execution
     def self.get_response(non_block = true)
-      return nil if non_block && @response_queue.empty?
+      return nil if @response_queue.nil? || (non_block && @response_queue.empty?)
       @response_queue.pop(non_block)
     end

data/lib/scraper_utils/spec_support.rb ADDED Viewed

@@ -0,0 +1,91 @@
+# frozen_string_literal: true
+require "scraperwiki"
+module ScraperUtils
+  # Methods to support specs
+  module SpecSupport
+    AUSTRALIAN_STATES = %w[ACT NSW NT QLD SA TAS VIC WA].freeze
+    STREET_TYPE_PATTERNS = [
+      /\bAv(e(nue)?)?\b/i,
+      /\bB(oulevard|lvd)\b/i,
+      /\b(Circuit|Cct)\b/i,
+      /\bCl(ose)?\b/i,
+      /\bC(our|r)t\b/i,
+      /\bCircle\b/i,
+      /\bChase\b/i,
+      /\bCr(escent)?\b/i,
+      /\bDr((ive)?|v)\b/i,
+      /\bEnt(rance)?\b/i,
+      /\bGr(ove)?\b/i,
+      /\bH(ighwa|w)y\b/i,
+      /\bLane\b/i,
+      /\bLoop\b/i,
+      /\bParkway\b/i,
+      /\bPl(ace)?\b/i,
+      /\bPriv(ate)?\b/i,
+      /\bParade\b/i,
+      /\bR(oa)?d\b/i,
+      /\bRise\b/i,
+      /\bSt(reet)?\b/i,
+      /\bSquare\b/i,
+      /\bTerrace\b/i,
+      /\bWay\b/i
+    ].freeze
+    AUSTRALIAN_POSTCODES = /\b\d{4}\b/.freeze
+    # Check if an address is likely to be geocodable by analyzing its format.
+    # This is a bit stricter than needed - typically assert >= 75% match
+    # @param address [String] The address to check
+    # @return [Boolean] True if the address appears to be geocodable.
+    def self.geocodable?(address, ignore_case: false)
+      return false if address.nil? || address.empty?
+      check_address = ignore_case ? address.upcase : address
+      # Basic structure check - must have a street name, suburb, state and postcode
+      has_state = AUSTRALIAN_STATES.any? { |state| check_address.end_with?(" #{state}") || check_address.include?(" #{state} ") }
+      has_postcode = address.match?(AUSTRALIAN_POSTCODES)
+      # Using the pre-compiled patterns
+      has_street_type = STREET_TYPE_PATTERNS.any? { |pattern| check_address.match?(pattern) }
+      has_unit_or_lot = address.match?(/\b(Unit|Lot:?)\s+\d+/i)
+      has_suburb_stats = check_address.match?(/(\b[A-Z]{2,}(\s+[A-Z]+)*,?|,\s+[A-Z][A-Za-z ]+)\s+(#{AUSTRALIAN_STATES.join('|')})\b/)
+      if ENV["DEBUG"]
+        missing = []
+        unless has_street_type || has_unit_or_lot
+          missing << "street type / unit / lot"
+        end
+        missing << "state" unless has_state
+        missing << "postcode" unless has_postcode
+        missing << "suburb state" unless has_suburb_stats
+        puts "  address: #{address} is not geocodable, missing #{missing.join(', ')}" if missing.any?
+      end
+      (has_street_type || has_unit_or_lot) && has_state && has_postcode && has_suburb_stats
+    end
+    PLACEHOLDERS = [
+      /no description/i,
+      /not available/i,
+      /to be confirmed/i,
+      /\btbc\b/i,
+      %r{\bn/a\b}i
+    ].freeze
+    def self.placeholder?(text)
+      PLACEHOLDERS.any? { |placeholder| text.to_s.match?(placeholder) }
+    end
+    # Check if this looks like a "reasonable" description
+    # This is a bit stricter than needed - typically assert >= 75% match
+    def self.reasonable_description?(text)
+      !placeholder?(text) && text.to_s.split.size >= 3
+    end
+  end
+end

data/lib/scraper_utils/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module ScraperUtils
-  VERSION = "0.6.0"
+  VERSION = "0.7.1"
 end

data/lib/scraper_utils.rb CHANGED Viewed

@@ -12,6 +12,7 @@ require "scraper_utils/debug_utils"
 require "scraper_utils/log_utils"
 require "scraper_utils/randomize_utils"
 require "scraper_utils/scheduler"
+require "scraper_utils/spec_support"
 # Mechanize utilities
 require "scraper_utils/mechanize_actions"

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: scraper_utils
 version: !ruby/object:Gem::Version
-  version: 0.6.0
+  version: 0.7.1
 platform: ruby
 authors:
 - Ian Heggie
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2025-03-27 00:00:00.000000000 Z
+date: 2025-04-15 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: mechanize
@@ -77,6 +77,7 @@ files:
 - bin/rspec
 - bin/setup
 - docs/debugging.md
+- docs/enhancing_specs.md
 - docs/example_scrape_with_fibers.rb
 - docs/example_scraper.rb
 - docs/fibers_and_threads.md
@@ -107,6 +108,7 @@ files:
 - lib/scraper_utils/scheduler/process_request.rb
 - lib/scraper_utils/scheduler/thread_request.rb
 - lib/scraper_utils/scheduler/thread_response.rb
+- lib/scraper_utils/spec_support.rb
 - lib/scraper_utils/version.rb
 - scraper_utils.gemspec
 homepage: https://github.com/ianheggie-oaf/scraper_utils
@@ -116,7 +118,7 @@ metadata:
   allowed_push_host: https://rubygems.org
   homepage_uri: https://github.com/ianheggie-oaf/scraper_utils
   source_code_uri: https://github.com/ianheggie-oaf/scraper_utils
-  documentation_uri: https://rubydoc.info/gems/scraper_utils/0.6.0
+  documentation_uri: https://rubydoc.info/gems/scraper_utils/0.7.1
   changelog_uri: https://github.com/ianheggie-oaf/scraper_utils/blob/main/CHANGELOG.md
   rubygems_mfa_required: 'true'
 post_install_message: