scraper_utils 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a2082c406ad96266f644fc1dfc588046aa43494e9e79b8fec38fe59252c09f06
4
- data.tar.gz: 5100907cdcc8c55ddd59b25cf393d212f681b573ef874c5a6c65f748a8c852ec
3
+ metadata.gz: 9443f89de2518d12830ddcd3d4d5e246c77a35b34e6549031a2f61a454c9c96c
4
+ data.tar.gz: 693aeaeb896c9779135f3c703f86aa436a3d9b10715448cc18c96c590ddf3723
5
5
  SHA512:
6
- metadata.gz: feabea23ac5b14f6b642769db303bcab954b22cfd2d95d7694af51afd661c73f1ff70a0c17b54557a730fe811275c126c0fdd4473e2ae6d2b83cfc495aa52bc6
7
- data.tar.gz: d91154c0dbccfb4271fd4830a82316676c1ad5665773c30d212bcc1ab6178a09d4fee97b2b2a32480c32b8ee69da08fc9a78575460d49f8233fb91b74bf7df66
6
+ metadata.gz: 38d23ede40ee6313c0c8098b34d1258829c2842e2b43fe39fd156503d0c387fc59929fec740205ffc232cafdfe095932a62f937b3aded2ab7b5990dc889615d7
7
+ data.tar.gz: 7e9b7da5194265fdd77a8136202bf11a3293a2dda0b2f5d00ffaf7503872079233889b80d81ab2267eccbd9c570b56e8778b9e4fce28cee9e3a2657788f02334
data/CHANGELOG.md CHANGED
@@ -1,5 +1,15 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.4.0 - 2025-03-04
4
+
5
+ * Add Cycle Utils as an alternative to Date range utils
6
+ * Update README.md with changed defaults
7
+
8
+ ## 0.3.0 - 2025-03-04
9
+
10
+ * Add date range utils
11
+ * Flush $stdout and $stderr when logging to sync exception output and logging lines
12
+ * Break out example code from README.md into docs dir
3
13
 
4
14
  ## 0.2.1 - 2025-02-28
5
15
 
@@ -12,3 +22,5 @@ Added FiberScheduler, enabled complient mode with delays by default and simplifi
12
22
  ## 0.1.0 - 2025-02-23
13
23
 
14
24
  First release for development
25
+
26
+
data/README.md CHANGED
@@ -33,8 +33,7 @@ Even without specific configuration, our scrapers will, by default:
33
33
  In the default "compliant mode" this defaults to a max load of 20% and is capped at 33%.
34
34
 
35
35
  - **Add randomized delays**: We add random delays between requests to further reduce our impact on servers, which should
36
- bring us
37
- - down to the load of a single industrious person.
36
+ bring us down to the load of a single industrious person.
38
37
 
39
38
  Extra utilities provided for scrapers to further reduce your server load:
40
39
 
@@ -44,6 +43,9 @@ Extra utilities provided for scrapers to further reduce your server load:
44
43
  checking the recent 4 days each day and reducing down to checking each 3 days by the end of the 33-day mark. This
45
44
  replaces the simplistic check of the last 30 days each day.
46
45
 
46
+ - Alternative **Cycle Utilities** - a convenience class to cycle through short and longer search ranges to reduce server
47
+ load.
48
+
47
49
  Our goal is to access public planning information without negatively impacting your services.
48
50
 
49
51
  Installation
@@ -123,15 +125,14 @@ The agent returned is configured using Mechanize hooks to implement the desired
123
125
  By default, the Mechanize agent is configured with the following settings.
124
126
  As you can see, the defaults can be changed using env variables.
125
127
 
126
- Note - compliant mode forces max_load to be set to a value no greater than 33.
127
- PLEASE don't use our user agent string with a max_load higher than 33!
128
+ Note - compliant mode forces max_load to be set to a value no greater than 50.
128
129
 
129
130
  ```ruby
130
131
  ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
131
132
  config.default_timeout = ENV.fetch('MORPH_TIMEOUT', 60).to_i # 60
132
133
  config.default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
133
- config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', 15).to_i # 15
134
- config.default_max_load = ENV.fetch('MORPH_MAX_LOAD', 20.0).to_f # 20
134
+ config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', 5).to_i # 5
135
+ config.default_max_load = ENV.fetch('MORPH_MAX_LOAD', 33.3).to_f # 33.3
135
136
  config.default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
136
137
  config.default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
137
138
  config.default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
@@ -268,13 +269,37 @@ Typical server load reductions:
268
269
 
269
270
  See the class documentation for customizing defaults and passing options.
270
271
 
272
+ ### Other possibilities
273
+
274
+ If the site uses tags like 'L28', 'L14' and 'L7' for the last 28, 14 and 7 days, an alternative solution
275
+ is to cycle through ['L28', 'L7', 'L14', 'L7'] which would drop the load by 50% and be less Bot like.
276
+
277
+ Cycle Utils
278
+ -----------
279
+ Simple utility for cycling through options based on Julian day number:
280
+
281
+ ```ruby
282
+ # Toggle between main and alternate behaviour
283
+ alternate = ScraperUtils::CycleUtils.position(2).even?
284
+
285
+ # Use with any cycle size
286
+ pos = ScraperUtils::CycleUtils.position(7) # 0-6 cycle
287
+
288
+ # Test with specific date
289
+ pos = ScraperUtils::CycleUtils.position(3, date: Date.new(2024, 1, 5))
290
+
291
+ # Override for testing
292
+ # CYCLE_POSITION=2 bundle exec ruby scraper.rb
293
+ ```
294
+
271
295
  Randomizing Requests
272
296
  --------------------
273
297
 
274
298
  Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
275
299
  receive in as is when testing.
276
300
 
277
- Use this with the list of records scraped from an index to randomise any requests for further information to be less Bot like.
301
+ Use this with the list of records scraped from an index to randomise any requests for further information to be less Bot
302
+ like.
278
303
 
279
304
  ### Spec setup
280
305
 
@@ -0,0 +1,25 @@
1
+ # frozen_string_literal: true
2
+
3
+ module ScraperUtils
4
+ # Provides utilities for cycling through a range of options day by day
5
+ module CycleUtils
6
+ # Returns position in cycle from zero onwards
7
+ # @param cycle [Integer] Cycle size (2 onwards)
8
+ # @param date [Date, nil] Optional date to use instead of today
9
+ # @return [Integer] position in cycle progressing from zero to cycle-1 and then repeating day by day
10
+ # Can override using CYCLE_POSITION ENV variable
11
+ def self.position(cycle, date: nil)
12
+ day = ENV.fetch('CYCLE_POSITION', (date || Date.today).jd).to_i
13
+ day % cycle
14
+ end
15
+
16
+ # Returns one value per day, cycling through all possible values in order
17
+ # @param values [Array] Values to cycle through
18
+ # @param date [Date, nil] Optional date to use instead of today to calculate position
19
+ # @return value from array
20
+ # Can override using CYCLE_POSITION ENV variable
21
+ def self.pick(values, date: nil)
22
+ values[position(values.size, date: date)]
23
+ end
24
+ end
25
+ end
@@ -69,8 +69,8 @@ module ScraperUtils
69
69
  def reset_defaults!
70
70
  @default_timeout = ENV.fetch('MORPH_TIMEOUT', DEFAULT_TIMEOUT).to_i # 60
71
71
  @default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
72
- @default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', DEFAULT_RANDOM_DELAY).to_i # 33
73
- @default_max_load = ENV.fetch('MORPH_MAX_LOAD', DEFAULT_MAX_LOAD).to_f # 20
72
+ @default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', DEFAULT_RANDOM_DELAY).to_i # 5
73
+ @default_max_load = ENV.fetch('MORPH_MAX_LOAD', DEFAULT_MAX_LOAD).to_f # 33.3
74
74
  @default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
75
75
  @default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
76
76
  @default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module ScraperUtils
4
- VERSION = "0.3.0"
4
+ VERSION = "0.4.0"
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: scraper_utils
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.0
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ian Heggie
@@ -79,6 +79,7 @@ files:
79
79
  - lib/scraper_utils.rb
80
80
  - lib/scraper_utils/adaptive_delay.rb
81
81
  - lib/scraper_utils/authority_utils.rb
82
+ - lib/scraper_utils/cycle_utils.rb
82
83
  - lib/scraper_utils/data_quality_monitor.rb
83
84
  - lib/scraper_utils/date_range_utils.rb
84
85
  - lib/scraper_utils/db_utils.rb