scraper_utils 0.12.1 → 0.13.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fc8aab3f24d29cc1e3fe44880d9030e7149372189a8691acd6e628e2a260c57c
4
- data.tar.gz: a82ecb97878e57f5d0cf75d623c94a6ff54e7805c43a7be140e6a2da7bbcca49
3
+ metadata.gz: 3bce8cc5a624f9904ebf8bb35ccb5c5c6c831e28ed56f88d3baf3b8d19fbbd13
4
+ data.tar.gz: 0a481566e846a4274796b0542fb64a805f486065ed08045724cea7bc3d46710d
5
5
  SHA512:
6
- metadata.gz: bd6d178afa8669916b70f2c6bdb56b77a8a5995d18e47c64b360a8d5d334b479645d20792d80759afde1aac6550d79f46a0504030d8a8920c8dea55fa6ad6132
7
- data.tar.gz: 20a2c5f144cfc8e2106d2e4643d7f3b0cb35110a5519d63bb0d8c1e655c8958fe73115089352a71272a419168c50299c8ce985d318be6bfa9635e64c9c4fb238
6
+ metadata.gz: 231c167ffe232daacbc862b8c3dd2c0c71be6b8fc2ff061f4f36d88f2e2185a454eb0aa79653c7a99a2ed65c9857d961059456f8403af8c1ed39623cc8e2db6a
7
+ data.tar.gz: f287f85cdd4cc11cf17c3e5d34d5493e2809f255f3a3544bc881e756f3379c897dd70dbba5ebf16b30837bb8612f42f704872e06c6bec1cad87845606fce6231
data/CHANGELOG.md CHANGED
@@ -1,5 +1,21 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.13.1 - 2026.02-21
4
+
5
+ * Added PaValidation that validates based
6
+ on [How to write a scraper](https://www.planningalerts.org.au/how_to_write_a_scraper)
7
+ * `ScraperUtils::PaValidation.validate_record!` raises an exception if record is invalid, calls
8
+ * `ScraperUtils::PaValidation.validate_record` returns an Array of error messages if record is invalid, otherwise nil
9
+ * Added `ScraperUtils::SpecSupport.validate_unique_references!` which validates that all references are unique
10
+ * Note: due to saving records based on the unique reference, any duplicates are overwritten and are never presented to
11
+ PA, so this is basically checking that you are not losing records due to an incorrect reference
12
+ * Refactored `DbUtils.save_record` to use PaValidation
13
+ * Merged `clean_old_records` from LogUtils into same method in DbUtils bringing across `force` named param
14
+ * `LogUtils.clean_old_records` now warns if it is deprecated
15
+ * Increased test coverage
16
+ * Fixed edge case in `ScraperUtils::MechanizeUtils::AgentConfig#verify_proxy_works` - it now raises an exception on json
17
+ parse error
18
+
3
19
  ## 0.12.1 - 2026.02-18
4
20
 
5
21
  * Added override for the threshold of when to abdon scraping due to unprocessable records
@@ -34,15 +50,18 @@
34
50
 
35
51
  ## 0.9.0 - 2025-07-11
36
52
 
37
- **Significant cleanup - removed code we ended up not using as none of the councils are actually concerned about server load**
53
+ **Significant cleanup - removed code we ended up not using as none of the councils are actually concerned about server
54
+ load**
38
55
 
39
56
  * Refactored example code into simple callable methods
40
57
  * Expand test for geocodeable addresses to include comma between postcode and state at the end of the address.
41
58
 
42
59
  ### Added
60
+
43
61
  - `ScraperUtils::SpecSupport.validate_addresses_are_geocodable!` - validates percentage of geocodable addresses
44
62
  - `ScraperUtils::SpecSupport.validate_descriptions_are_reasonable!` - validates percentage of reasonable descriptions
45
- - `ScraperUtils::SpecSupport.validate_uses_one_valid_info_url!` - validates single global info_url usage and availability
63
+ - `ScraperUtils::SpecSupport.validate_uses_one_valid_info_url!` - validates single global info_url usage and
64
+ availability
46
65
  - `ScraperUtils::SpecSupport.validate_info_urls_have_expected_details!` - validates info_urls contain expected content
47
66
  - `ScraperUtils::MathsUtils.fibonacci_series` - generates fibonacci sequence up to max value
48
67
  - `bot_check_expected` parameter to info_url validation methods for handling reCAPTCHA/Cloudflare protection
@@ -53,10 +72,12 @@
53
72
  - .editorconfig as an example for scrapers
54
73
 
55
74
  ### Fixed
75
+
56
76
  - Typo in `geocodable?` method debug output (`has_suburb_stats` → `has_suburb_states`)
57
77
  - Code example in `docs/enhancing_specs.md`
58
78
 
59
79
  ### Updated
80
+
60
81
  - `ScraperUtils::SpecSupport.acceptable_description?` - Accept 1 or 2 word descriptors with planning specific terms
61
82
  - Code example in `docs/enhancing_specs.md` to reflect new support methods
62
83
  - Code examples
@@ -68,6 +89,7 @@
68
89
  - Added extra street types
69
90
 
70
91
  ### Removed
92
+
71
93
  - Unsued CycleUtils
72
94
  - Unused DateRangeUtils
73
95
  - Unused RandomizeUtils
@@ -150,7 +172,8 @@ Fixed broken v0.2.0
150
172
 
151
173
  ## 0.2.0 - 2025-02-28
152
174
 
153
- Added FiberScheduler, enabled complient mode with delays by default and simplified usage removing third retry without proxy
175
+ Added FiberScheduler, enabled complient mode with delays by default and simplified usage removing third retry without
176
+ proxy
154
177
 
155
178
  ## 0.1.0 - 2025-02-23
156
179
 
@@ -1,5 +1,6 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require "uri"
3
4
  require "scraperwiki"
4
5
 
5
6
  module ScraperUtils
@@ -27,23 +28,10 @@ module ScraperUtils
27
28
  # @raise [ScraperUtils::UnprocessableRecord] If record fails validation
28
29
  # @return [void]
29
30
  def self.save_record(record)
30
- # Validate required fields
31
- required_fields = %w[council_reference address description info_url date_scraped]
32
- required_fields.each do |field|
33
- if record[field].to_s.empty?
34
- raise ScraperUtils::UnprocessableRecord, "Missing required field: #{field}"
35
- end
36
- end
37
-
38
- # Validate date formats
39
- %w[date_scraped date_received on_notice_from on_notice_to].each do |date_field|
40
- Date.parse(record[date_field]) unless record[date_field].to_s.empty?
41
- rescue ArgumentError
42
- raise ScraperUtils::UnprocessableRecord,
43
- "Invalid date format for #{date_field}: #{record[date_field].inspect}"
44
- end
31
+ record = record.transform_keys(&:to_s)
32
+ ScraperUtils::PaValidation.validate_record!(record)
45
33
 
46
- # Determine primary key based on presence of authority_label
34
+ # Determine the primary key based on the presence of authority_label
47
35
  primary_key = if record.key?("authority_label")
48
36
  %w[authority_label council_reference]
49
37
  else
@@ -58,7 +46,7 @@ module ScraperUtils
58
46
  end
59
47
 
60
48
  # Clean up records older than 30 days and approx once a month vacuum the DB
61
- def self.cleanup_old_records
49
+ def self.cleanup_old_records(force: false)
62
50
  cutoff_date = (Date.today - 30).to_s
63
51
  vacuum_cutoff_date = (Date.today - 35).to_s
64
52
 
@@ -70,15 +58,17 @@ module ScraperUtils
70
58
  deleted_count = stats["count"]
71
59
  oldest_date = stats["oldest"]
72
60
 
73
- return unless deleted_count.positive? || ENV["VACUUM"]
61
+ return unless deleted_count.positive? || ENV["VACUUM"] || force
74
62
 
75
63
  LogUtils.log "Deleting #{deleted_count} applications scraped between #{oldest_date} and #{cutoff_date}"
76
64
  ScraperWiki.sqliteexecute("DELETE FROM data WHERE date_scraped < ?", [cutoff_date])
77
65
 
78
- return unless rand < 0.03 || (oldest_date && oldest_date < vacuum_cutoff_date) || ENV["VACUUM"]
66
+ return unless rand < 0.03 || (oldest_date && oldest_date < vacuum_cutoff_date) || ENV["VACUUM"] || force
79
67
 
80
68
  LogUtils.log " Running VACUUM to reclaim space..."
81
69
  ScraperWiki.sqliteexecute("VACUUM")
70
+ rescue SqliteMagic::NoSuchTable => e
71
+ ScraperUtils::LogUtils.log "Ignoring: #{e} whilst cleaning old records" if ScraperUtils::DebugUtils.trace?
82
72
  end
83
73
  end
84
74
  end
@@ -85,7 +85,7 @@ module ScraperUtils
85
85
  failed
86
86
  )
87
87
 
88
- cleanup_old_records
88
+ DbUtils::cleanup_old_records
89
89
  end
90
90
 
91
91
  # Extracts the first relevant line from backtrace that's from our project
@@ -225,21 +225,13 @@ module ScraperUtils
225
225
  )
226
226
  end
227
227
 
228
+ # Moved to DbUtils
229
+ # :nocov:
228
230
  def self.cleanup_old_records(force: false)
229
- cutoff = (Date.today - LOG_RETENTION_DAYS).to_s
230
- return if !force && @last_cutoff == cutoff
231
-
232
- @last_cutoff = cutoff
233
-
234
- [SUMMARY_TABLE, LOG_TABLE].each do |table|
235
- ScraperWiki.sqliteexecute(
236
- "DELETE FROM #{table} WHERE date(run_at) < date(?)",
237
- [cutoff]
238
- )
239
- rescue SqliteMagic::NoSuchTable => e
240
- ScraperUtils::LogUtils.log "Ignoring: #{e} whilst cleaning old records" if ScraperUtils::DebugUtils.trace?
241
- end
231
+ warn "`#{self.class}##{__method__}` is deprecated and will be removed in a future release, use `ScraperUtils::DbUtils.cleanup_old_records` instead.", category: :deprecated
232
+ ScraperUtils::DbUtils.cleanup_old_records(force: force)
242
233
  end
234
+ # :nocov:
243
235
 
244
236
  # Extracts meaningful backtrace - 3 lines from ruby/gem and max 6 in total
245
237
  def self.extract_meaningful_backtrace(error)
@@ -227,6 +227,7 @@ module ScraperUtils
227
227
  rescue JSON::ParserError => e
228
228
  puts "Couldn't parse public_headers: #{e}! Raw response:"
229
229
  puts my_headers.inspect
230
+ raise "Couldn't parse public_headers as JSON: #{e}!"
230
231
  end
231
232
  rescue Timeout::Error => e # Includes Net::OpenTimeout
232
233
  raise "Proxy check timed out: #{e}"
@@ -0,0 +1,88 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "uri"
4
+ require "date"
5
+
6
+ module ScraperUtils
7
+ # Validates scraper records match Planning Alerts requirements before submission.
8
+ # Use in specs to catch problems early rather than waiting for PA's import.
9
+ module PaValidation
10
+ REQUIRED_FIELDS = %w[council_reference address description date_scraped].freeze
11
+
12
+ # Validates a single record (hash with string keys) against PA's rules.
13
+ # @param record [Hash] The record to validate
14
+ # @raise [ScraperUtils::UnprocessableRecord] if there are error messages
15
+ def self.validate_record!(record)
16
+ errors = validate_record(record)
17
+ raise(ScraperUtils::UnprocessableRecord, errors.join("; ")) if errors&.any?
18
+ end
19
+
20
+ # Validates a single record (hash with string keys) against PA's rules.
21
+ # @param record [Hash] The record to validate
22
+ # @return [Array<String>, nil] Array of error messages, or nil if valid
23
+ def self.validate_record(record)
24
+ record = record.transform_keys(&:to_s)
25
+ errors = []
26
+
27
+ validate_presence(record, errors)
28
+ validate_info_url(record, errors)
29
+ validate_dates(record, errors)
30
+
31
+ errors.empty? ? nil : errors
32
+ end
33
+
34
+ private
35
+
36
+ def self.validate_presence(record, errors)
37
+ REQUIRED_FIELDS.each do |field|
38
+ errors << "#{field} can't be blank" if record[field].to_s.strip.empty?
39
+ end
40
+ errors << "info_url can't be blank" if record["info_url"].to_s.strip.empty?
41
+ end
42
+
43
+ def self.validate_info_url(record, errors)
44
+ url = record["info_url"].to_s.strip
45
+ return if url.empty? # already caught by presence check
46
+
47
+ begin
48
+ uri = URI.parse(url)
49
+ unless uri.is_a?(URI::HTTP) && uri.host.to_s != ""
50
+ errors << "info_url must be a valid http\/https URL with host"
51
+ end
52
+ rescue URI::InvalidURIError
53
+ errors << "info_url must be a valid http\/https URL"
54
+ end
55
+ end
56
+
57
+ def self.validate_dates(record, errors)
58
+ today = Date.today
59
+
60
+ date_scraped = parse_date(record["date_scraped"])
61
+ errors << "Invalid date format for date_scraped: #{record["date_scraped"].inspect} is not a valid ISO 8601 date" if record["date_scraped"] && date_scraped.nil?
62
+
63
+ date_received = parse_date(record["date_received"])
64
+ if record["date_received"] && date_received.nil?
65
+ errors << "Invalid date format for date_received: #{record["date_received"].inspect} is not a valid ISO 8601 date"
66
+ elsif date_received && date_received.to_date > today
67
+ errors << "Invalid date for date_received: #{record["date_received"].inspect} is in the future"
68
+ end
69
+
70
+ %w[on_notice_from on_notice_to].each do |field|
71
+ val = parse_date(record[field])
72
+ errors << "Invalid date format for #{field}: #{record[field].inspect} is not a valid ISO 8601 date" if record[field] && val.nil?
73
+ end
74
+ end
75
+
76
+ # Returns a Date if value is already a Date, or parses a YYYY-MM-DD string.
77
+ # Returns nil if unparseable or blank.
78
+ def self.parse_date(value)
79
+ return nil if value.nil? || value == ""
80
+ return value if value.is_a?(Date) || value.is_a?(Time)
81
+ return nil unless value.is_a?(String) && value =~ /\A\d{4}-\d{2}-\d{2}\z/
82
+
83
+ Date.parse(value)
84
+ rescue ArgumentError
85
+ nil
86
+ end
87
+ end
88
+ end
@@ -78,6 +78,19 @@ module ScraperUtils
78
78
  "#{prefix}#{authority_labels.first}#{suffix}"
79
79
  end
80
80
 
81
+ # Finds records with duplicate [authority_label, council_reference] keys.
82
+ # @param records [Array<Hash>] All records to check
83
+ # @raises [Hash<Array<String>, Array<Hash>>, nil] Groups of duplicate records keys by primary key, or nil if all unique
84
+ def self.validate_unique_references!(records)
85
+ groups = records.group_by do |r|
86
+ [r["authority_label"], r["council_reference"]&.downcase]
87
+ end
88
+ duplicates = groups.select { |_k, g| g.size > 1 }
89
+ return if duplicates.empty?
90
+
91
+ raise UnprocessableSite, "Duplicate authority labels: #{duplicates.keys.map(&:inspect).join(', ')}"
92
+ end
93
+
81
94
  # Validates enough addresses are geocodable
82
95
  # @param results [Array<Hash>] The results from scraping an authority
83
96
  # @param percentage [Integer] The min percentage of addresses expected to be geocodable (default:50)
@@ -93,7 +106,9 @@ module ScraperUtils
93
106
  puts "Found #{geocodable} out of #{results.count} unique geocodable addresses " \
94
107
  "(#{(100.0 * geocodable / results.count).round(1)}%)"
95
108
  expected = [((percentage.to_f / 100.0) * results.count - variation), 1].max
96
- raise "Expected at least #{expected} (#{percentage}% - #{variation}) geocodable addresses, got #{geocodable}" unless geocodable >= expected
109
+ unless geocodable >= expected
110
+ raise UnprocessableSite, "Expected at least #{expected} (#{percentage}% - #{variation}) geocodable addresses, got #{geocodable}"
111
+ end
97
112
  geocodable
98
113
  end
99
114
 
@@ -157,7 +172,7 @@ module ScraperUtils
157
172
  puts "Found #{descriptions} out of #{results.count} unique reasonable descriptions " \
158
173
  "(#{(100.0 * descriptions / results.count).round(1)}%)"
159
174
  expected = [(percentage.to_f / 100.0) * results.count - variation, 1].max
160
- raise "Expected at least #{expected} (#{percentage}% - #{variation}) reasonable descriptions, got #{descriptions}" unless descriptions >= expected
175
+ raise UnprocessableSite, "Expected at least #{expected} (#{percentage}% - #{variation}) reasonable descriptions, got #{descriptions}" unless descriptions >= expected
161
176
  descriptions
162
177
  end
163
178
 
@@ -278,7 +293,7 @@ module ScraperUtils
278
293
  next
279
294
  end
280
295
 
281
- raise "Expected 200 response, got #{page.code}" unless page.code == "200"
296
+ raise UnprocessableRecord, "Expected 200 response, got #{page.code}" unless page.code == "200"
282
297
 
283
298
  page_body = page.body.dup.force_encoding("UTF-8").gsub(/\s\s+/, " ")
284
299
 
@@ -310,12 +325,10 @@ module ScraperUtils
310
325
  min_required = ((percentage.to_f / 100.0) * count - variation).round(0)
311
326
  passed = count - failed
312
327
  raise "Too many failures: #{passed}/#{count} passed (min required: #{min_required})" if passed < min_required
313
- end
314
- end
315
-
316
- puts "#{(100.0 * (count - failed) / count).round(1)}% detail checks passed (#{failed}/#{count} failed)!" if count > 0
317
- end
318
-
319
- end
320
- end
328
+ end
329
+ end
321
330
 
331
+ puts "#{(100.0 * (count - failed) / count).round(1)}% detail checks passed (#{failed}/#{count} failed)!" if count > 0
332
+ end
333
+ end
334
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module ScraperUtils
4
- VERSION = "0.12.1"
4
+ VERSION = "0.13.1"
5
5
  end
data/lib/scraper_utils.rb CHANGED
@@ -5,6 +5,7 @@ require "scraper_utils/version"
5
5
  # Public Apis (responsible for requiring their own dependencies)
6
6
  require "scraper_utils/authority_utils"
7
7
  require "scraper_utils/data_quality_monitor"
8
+ require "scraper_utils/pa_validation"
8
9
  require "scraper_utils/db_utils"
9
10
  require "scraper_utils/debug_utils"
10
11
  require "scraper_utils/log_utils"
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: scraper_utils
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.12.1
4
+ version: 0.13.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ian Heggie
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2026-02-18 00:00:00.000000000 Z
11
+ date: 2026-02-21 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: mechanize
@@ -102,6 +102,7 @@ files:
102
102
  - lib/scraper_utils/mechanize_utils.rb
103
103
  - lib/scraper_utils/mechanize_utils/agent_config.rb
104
104
  - lib/scraper_utils/misc_utils.rb
105
+ - lib/scraper_utils/pa_validation.rb
105
106
  - lib/scraper_utils/spec_support.rb
106
107
  - lib/scraper_utils/version.rb
107
108
  - scraper_utils.gemspec
@@ -112,7 +113,7 @@ metadata:
112
113
  allowed_push_host: https://rubygems.org
113
114
  homepage_uri: https://github.com/ianheggie-oaf/scraper_utils
114
115
  source_code_uri: https://github.com/ianheggie-oaf/scraper_utils
115
- documentation_uri: https://rubydoc.info/gems/scraper_utils/0.12.1
116
+ documentation_uri: https://rubydoc.info/gems/scraper_utils/0.13.1
116
117
  changelog_uri: https://github.com/ianheggie-oaf/scraper_utils/blob/main/CHANGELOG.md
117
118
  rubygems_mfa_required: 'true'
118
119
  post_install_message: