scraper_utils 0.5.1 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. checksums.yaml +4 -4
  2. data/.yardopts +5 -0
  3. data/CHANGELOG.md +19 -0
  4. data/GUIDELINES.md +2 -1
  5. data/Gemfile +1 -0
  6. data/IMPLEMENTATION.md +39 -0
  7. data/README.md +29 -23
  8. data/SPECS.md +13 -1
  9. data/bin/rspec +27 -0
  10. data/docs/enhancing_specs.md +100 -0
  11. data/docs/example_scrape_with_fibers.rb +4 -4
  12. data/docs/fibers_and_threads.md +72 -0
  13. data/docs/getting_started.md +6 -6
  14. data/docs/interleaving_requests.md +9 -8
  15. data/docs/mechanize_utilities.md +4 -4
  16. data/docs/parallel_requests.md +138 -0
  17. data/docs/randomizing_requests.md +12 -8
  18. data/docs/reducing_server_load.md +6 -6
  19. data/lib/scraper_utils/data_quality_monitor.rb +2 -3
  20. data/lib/scraper_utils/date_range_utils.rb +37 -78
  21. data/lib/scraper_utils/debug_utils.rb +5 -5
  22. data/lib/scraper_utils/log_utils.rb +15 -0
  23. data/lib/scraper_utils/mechanize_actions.rb +37 -8
  24. data/lib/scraper_utils/mechanize_utils/adaptive_delay.rb +80 -0
  25. data/lib/scraper_utils/mechanize_utils/agent_config.rb +35 -34
  26. data/lib/scraper_utils/mechanize_utils/robots_checker.rb +151 -0
  27. data/lib/scraper_utils/mechanize_utils.rb +8 -5
  28. data/lib/scraper_utils/randomize_utils.rb +22 -19
  29. data/lib/scraper_utils/scheduler/constants.rb +12 -0
  30. data/lib/scraper_utils/scheduler/operation_registry.rb +101 -0
  31. data/lib/scraper_utils/scheduler/operation_worker.rb +199 -0
  32. data/lib/scraper_utils/scheduler/process_request.rb +59 -0
  33. data/lib/scraper_utils/scheduler/thread_request.rb +51 -0
  34. data/lib/scraper_utils/scheduler/thread_response.rb +59 -0
  35. data/lib/scraper_utils/scheduler.rb +286 -0
  36. data/lib/scraper_utils/spec_support.rb +67 -0
  37. data/lib/scraper_utils/version.rb +1 -1
  38. data/lib/scraper_utils.rb +12 -14
  39. metadata +18 -6
  40. data/lib/scraper_utils/adaptive_delay.rb +0 -70
  41. data/lib/scraper_utils/fiber_scheduler.rb +0 -229
  42. data/lib/scraper_utils/robots_checker.rb +0 -149
@@ -0,0 +1,138 @@
1
+ Parallel Request Processing
2
+ ===========================
3
+
4
+ The ScraperUtils library provides a mechanism for executing network I/O requests in parallel using a thread for each
5
+ operation worker, allowing the fiber to yield control and allow other fibers to process whilst the thread processes the
6
+ mechanize network I/O request.
7
+
8
+ This can be disabled by setting `MORPH_DISABLE_THREADS` ENV var to a non-blank value.
9
+
10
+ Overview
11
+ --------
12
+
13
+ When scraping multiple authority websites, around 99% of the time was spent waiting for network I/O. While the
14
+ `Scheduler`
15
+ efficiently interleaves fibers during delay periods, network I/O requests will still block a fiber until they
16
+ complete.
17
+
18
+ The `OperationWorker` optimizes this process by:
19
+
20
+ 1. Executing mechanize network operations in parallel using a thread for each operation_worker and fiber
21
+ 2. Allowing other fibers to continue working while waiting for thread responses
22
+ 3. Integrating seamlessly with the existing `Scheduler`
23
+
24
+ Usage
25
+ -----
26
+
27
+ ```ruby
28
+ # In your authority scraper block
29
+ ScraperUtils::Scheduler.register_operation("authority_name") do
30
+ # Instead of:
31
+ # page = agent.get(url)
32
+
33
+ # Use:
34
+ page = ScraperUtils::Scheduler.execute_request(agent, :get, [url])
35
+
36
+ # Process page as normal
37
+ process_page(page)
38
+ end
39
+ ```
40
+
41
+ For testing purposes, you can also execute non-network operations:
42
+
43
+ ```ruby
44
+ # Create a test object
45
+ test_object = Object.new
46
+
47
+ def test_object.sleep_test(duration)
48
+ sleep(duration)
49
+ "Completed after #{duration} seconds"
50
+ end
51
+
52
+ # Queue a sleep command
53
+ command = ScraperUtils::ProcessRequest.new(
54
+ "test_id",
55
+ test_object,
56
+ :sleep_test,
57
+ [0.5]
58
+ )
59
+
60
+ thread_scheduler.queue_request(command)
61
+ ```
62
+
63
+ Configuration
64
+ -------------
65
+
66
+ The followingENV variables affect how `Scheduler` is configured:
67
+
68
+ * `MORPH_DISABLE_THREADS=1` disabled the use of threads
69
+ * `MORPH_MAX_WORKERS=N` configures the system to a max of N workers (minimum 1).
70
+ If N is 1 then this forces the system to process one authority at a time.
71
+
72
+ Key Components
73
+ --------------
74
+
75
+ ### ThreadRequest
76
+
77
+ A value object encapsulating a command to be executed:
78
+
79
+ - External ID: Any value suitable as a hash key (String, Symbol, Integer, Object) that identifies the command
80
+ - Subject: The object to call the method on
81
+ - Method: The method to call on the subject
82
+ - Args: Arguments to pass to the method
83
+
84
+ ### ThreadResponse
85
+
86
+ A value object encapsulating a response:
87
+
88
+ - External ID: Matches the ID from the original command
89
+ - Result: The result of the operation
90
+ - Error: Any error that occurred
91
+ - Time Taken: Execution time in seconds
92
+
93
+ ### ThreadPool
94
+
95
+ Manages a pool of threads that execute commands:
96
+
97
+ - Processes commands from a queue
98
+ - Returns responses with matching external IDs
99
+ - Provides clear separation between I/O and scheduling
100
+
101
+ Benefits
102
+ --------
103
+
104
+ 1. **Improved Throughput**: Process multiple operations simultaneously
105
+ 2. **Reduced Total Runtime**: Make better use of wait time during network operations
106
+ 3. **Optimal Resource Usage**: Efficiently balance CPU and network operations
107
+ 4. **Better Geolocation Handling**: Distribute requests across proxies more efficiently
108
+ 5. **Testability**: Execute non-network operations for testing concurrency
109
+
110
+ Debugging
111
+ ---------
112
+
113
+ When debugging issues with parallel operations, use:
114
+
115
+ ```shell
116
+ # Set debug level to see request/response logging
117
+ export DEBUG = 2
118
+ ```
119
+
120
+ The system will log:
121
+
122
+ - When commands are queued
123
+ - When responses are received
124
+ - How long each operation took
125
+ - Any errors that occurred
126
+
127
+ ## Implementation Details
128
+
129
+ The integration between `Scheduler` and `ThreadPool` follows these principles:
130
+
131
+ 1. `Scheduler` maintains ownership of all fiber scheduling
132
+ 2. `ThreadPool` only knows about commands and responses
133
+ 3. Communication happens via value objects with validation
134
+ 4. State is managed in dedicated `FiberState` objects
135
+ 5. Each component has a single responsibility
136
+
137
+ This design provides a clean separation of concerns while enabling parallel operations within the existing fiber
138
+ scheduling framework.
@@ -1,9 +1,11 @@
1
- # Randomizing Requests
1
+ Randomizing Requests
2
+ ====================
2
3
 
3
4
  `ScraperUtils::RandomizeUtils` provides utilities for randomizing processing order in scrapers,
4
5
  which is helpful for distributing load and avoiding predictable patterns.
5
6
 
6
- ## Basic Usage
7
+ Usage
8
+ -----
7
9
 
8
10
  Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
9
11
  receive it as is when testing.
@@ -18,17 +20,19 @@ records.each do |record|
18
20
  end
19
21
  ```
20
22
 
21
- ## Testing Configuration
23
+ Testing Configuration
24
+ ---------------------
22
25
 
23
26
  Enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb`:
24
27
 
25
28
  ```ruby
26
- ScraperUtils::RandomizeUtils.sequential = true
29
+ ScraperUtils::RandomizeUtils.random = false
27
30
  ```
28
31
 
29
- ## Notes
32
+ Notes
33
+ -----
30
34
 
31
- * You can also force sequential mode by setting the env variable `MORPH_PROCESS_SEQUENTIALLY` to `1` (any non-blank value)
32
- * Testing using VCR requires sequential mode
35
+ * You can also disable random mode by setting the env variable `MORPH_DISABLE_RANDOM` to `1` (or any non-blank value)
36
+ * Testing using VCR requires random to be disabled
33
37
 
34
- For full details, see the [RandomizeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/RandomizeUtils).
38
+ For full details, see {ScraperUtils::RandomizeUtils Randomize Utils class documentation}
@@ -12,8 +12,8 @@ records:
12
12
 
13
13
  - Always checks the most recent 4 days daily (configurable)
14
14
  - Progressively reduces search frequency for older records
15
- - Uses a Fibonacci-like progression to create natural, efficient search intervals
16
- - Configurable `max_period` (default is 3 days)
15
+ - Uses a progression from each 2 days and upwards to create an efficient search intervals
16
+ - Configurable `max_period` (default is 2 days)
17
17
  - Merges adjacent search ranges and handles the changeover in search frequency by extending some searches
18
18
 
19
19
  Example usage in your scraper:
@@ -28,11 +28,11 @@ date_ranges.each do |from_date, to_date, _debugging_comment|
28
28
  end
29
29
  ```
30
30
 
31
- Typical server load reductions:
31
+ Typical server load compared to search all days each time:
32
32
 
33
- * Max period 2 days : ~42% of the 33 days selected
34
- * Max period 3 days : ~37% of the 33 days selected (default)
35
- * Max period 5 days : ~35% (or ~31% when days = 45)
33
+ * Max period 2 days : ~59% of the 33 days selected (default, alternates between 57% and 61% covered)
34
+ * Max period 3 days : ~50% of the 33 days selected (varies much more - between 33 and 67%)
35
+ * Max period 4 days : ~46% (more efficient if you search back 50 or more days, varies between 15 and 61%)
36
36
 
37
37
  See the [DateRangeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/DateRangeUtils) for customizing defaults and passing options.
38
38
 
@@ -13,7 +13,6 @@ module ScraperUtils
13
13
  # Notes the start of processing an authority and clears any previous stats
14
14
  #
15
15
  # @param authority_label [Symbol] The authority we are processing
16
- # @return [void]
17
16
  def self.start_authority(authority_label)
18
17
  @stats ||= {}
19
18
  @stats[authority_label] = { saved: 0, unprocessed: 0 }
@@ -41,7 +40,7 @@ module ScraperUtils
41
40
  def self.log_unprocessable_record(exception, record)
42
41
  authority_label = extract_authority(record)
43
42
  @stats[authority_label][:unprocessed] += 1
44
- ScraperUtils::FiberScheduler.log "Erroneous record #{authority_label} - #{record&.fetch(
43
+ ScraperUtils::LogUtils.log "Erroneous record #{authority_label} - #{record&.fetch(
45
44
  'address', nil
46
45
  ) || record.inspect}: #{exception}"
47
46
  return unless @stats[authority_label][:unprocessed] > threshold(authority_label)
@@ -58,7 +57,7 @@ module ScraperUtils
58
57
  def self.log_saved_record(record)
59
58
  authority_label = extract_authority(record)
60
59
  @stats[authority_label][:saved] += 1
61
- ScraperUtils::FiberScheduler.log "Saving record #{authority_label} - #{record['address']}"
60
+ ScraperUtils::LogUtils.log "Saving record #{authority_label} - #{record['address']}"
62
61
  end
63
62
  end
64
63
  end
@@ -3,7 +3,7 @@
3
3
  module ScraperUtils
4
4
  class DateRangeUtils
5
5
  MERGE_ADJACENT_RANGES = true
6
- PERIODS = [2, 3, 5, 8].freeze
6
+ PERIODS = [2, 3, 4].freeze
7
7
 
8
8
  class << self
9
9
  # @return [Integer] Default number of days to cover
@@ -33,7 +33,7 @@ module ScraperUtils
33
33
  def reset_defaults!
34
34
  @default_days = ENV.fetch('MORPH_DAYS', 33).to_i # 33
35
35
  @default_everytime = ENV.fetch('MORPH_EVERYTIME', 4).to_i # 4
36
- @default_max_period = ENV.fetch('MORPH_MAX_PERIOD', 3).to_i # 3
36
+ @default_max_period = ENV.fetch('MORPH_MAX_PERIOD', 2).to_i # 3
37
37
  end
38
38
  end
39
39
 
@@ -46,8 +46,8 @@ module ScraperUtils
46
46
  # Generates one or more date ranges to check the most recent daily through to checking each max_period
47
47
  # There is a graduated schedule from the latest `everytime` days through to the oldest of `days` dates which is checked each `max_period` days.
48
48
  # @param days [Integer, nil] create ranges that cover the last `days` dates
49
- # @param everytime [Integer, nil] Always include the latest `everytime` out of `days` dates
50
- # @param max_period [Integer, nil] the last `days` dates must be checked at least every `max_period` days
49
+ # @param everytime [Integer, nil] Always include the latest `everytime` out of `days` dates (minimum 1)
50
+ # @param max_period [Integer, nil] the last `days` dates must be checked at least every `max_period` days (1..4)
51
51
  # @param today [Date, nil] overrides the default determination of today at UTC+09:30 (middle of Australia)
52
52
  # @return [Array{[Date, Date, String]}] being from_date, to_date and a comment
53
53
  #
@@ -58,7 +58,7 @@ module ScraperUtils
58
58
  def calculate_date_ranges(days: nil, everytime: nil, max_period: nil, today: nil)
59
59
  _calculate_date_ranges(
60
60
  Integer(days || self.class.default_days),
61
- Integer(everytime || self.class.default_everytime),
61
+ [1, Integer(everytime || self.class.default_everytime)].max,
62
62
  Integer(max_period || self.class.default_max_period),
63
63
  today || Time.now(in: '+09:30').to_date
64
64
  )
@@ -76,84 +76,43 @@ module ScraperUtils
76
76
  # cover everything everytime
77
77
  return [[today + 1 - days, today, "everything"]]
78
78
  end
79
-
80
79
  max_period = valid_periods.max
81
-
82
- run_number = today.to_date.jd
83
- ranges = []
84
- if everytime.positive?
85
- ranges << [to_date + 1 - everytime, to_date, "everytime"]
86
- days -= everytime
87
- to_date -= everytime
88
- end
89
-
90
- periods = valid_periods.dup
91
- loop do
92
- period = periods.shift
93
- break if period.nil? || period >= max_period || !days.positive?
94
-
95
- if DebugUtils.trace?
96
- FiberScheduler.log "DEBUG: #{period} day periods started #{(today - to_date).to_i} days in."
80
+ @max_period_used = max_period
81
+
82
+ one_half = ((days - everytime) / 2).to_i
83
+ one_third = ((days - everytime) / 3).to_i
84
+ two_ninths = (2 * (days - everytime) / 9).to_i
85
+ run_ranges =
86
+ case max_period
87
+ when 2
88
+ [
89
+ [[to_date - (one_half + everytime), to_date, "#{max_period}#0+everytime"]],
90
+ [[to_date - days, to_date - (one_half + everytime), "#{max_period}#1"], [to_date - everytime, to_date, "everytime"]]
91
+ ]
92
+ when 3
93
+ [
94
+ [[to_date - days - 1, to_date + two_ninths - days, "3#0"], [to_date - (one_third + everytime), to_date, "2#0+everytime"]],
95
+ [[to_date + two_ninths - days, to_date + 2 * two_ninths - days, "3#1"], [to_date - everytime, to_date, "everytime"]],
96
+ [[to_date + 2 * two_ninths - days, to_date, "3#2+2#0+everytime"]],
97
+ [[to_date - days - 1, to_date + two_ninths - days, "3#3"], [to_date - everytime, to_date, "everytime"]],
98
+ [[to_date + two_ninths - days, to_date + 2 * two_ninths - days, "3#4"], [to_date - (one_third + everytime), to_date, "2#2+everytime"]],
99
+ [[to_date + 2 * two_ninths - days, to_date - (one_third + everytime), "3#5"], [to_date - everytime, to_date, "everytime"]]
100
+ ]
101
+ else
102
+ [
103
+ [[to_date - (one_half + everytime), to_date, "2#0+everytime"]],
104
+ [[to_date - days - 2, to_date - (one_half + everytime), "4#0"], [to_date - everytime, to_date, "everytime"]],
105
+ [[to_date - (one_half + everytime), to_date, "2#1+everytime"]],
106
+ [[to_date - everytime, to_date, "everytime"]]
107
+ ]
97
108
  end
98
- period.times do |index|
99
- break unless days.positive?
100
-
101
- this_period = [days, period].min
102
- break if this_period <= 0
103
-
104
- earliest_from = to_date - days
105
- # we are working from the oldest back towards today
106
- if run_number % period == index
107
- from = to_date - index - (this_period - 1)
108
- from = earliest_from if from < earliest_from
109
- to = [today, to_date - index].min
110
- break if from > to
109
+ run_number = today.to_date.jd % run_ranges.size
111
110
 
112
- @max_period_used = [this_period, @max_period_used].max
113
- if ranges.any? && ranges.last[0] <= to + 1 && MERGE_ADJACENT_RANGES
114
- # extend adjacent range
115
- ranges.last[0] = [from, ranges.last[0]].min
116
- ranges.last[2] = "#{period}\##{index},#{ranges.last[2]}"
117
- else
118
- to = ranges.last[0] - 1 if ranges.any? && to >= ranges.last[0]
119
- ranges << [from, to, "#{period}\##{index}"]
120
- end
121
- end
122
- days -= this_period
123
- to_date -= this_period
124
- end
125
- end
126
- # remainder of range at max_period, whatever that is
111
+ ranges = run_ranges[run_number]
127
112
  if days.positive? && ScraperUtils::DebugUtils.trace?
128
- FiberScheduler.log "DEBUG: #{max_period} day periods started #{(today - to_date).to_i} days in."
129
- end
130
- index = -1
131
- while days.positive?
132
- index += 1
133
- this_period = [days, max_period].min
134
- break if this_period <= 0
135
-
136
- earliest_from = to_date - days
137
- if (run_number % max_period) == (index % max_period)
138
- from = to_date - index - (this_period - 1)
139
- from = earliest_from if from < earliest_from
140
- to = to_date - index
141
- break if from > to
142
-
143
- @max_period_used = [this_period, @max_period_used].max
144
- if ranges.any? && ranges.last[0] <= to + 1 && MERGE_ADJACENT_RANGES
145
- # extend adjacent range
146
- ranges.last[0] = [from, ranges.last[0]].min
147
- ranges.last[2] = "#{this_period}\##{index},#{ranges.last[2]}"
148
- else
149
- to = ranges.last[0] - 1 if ranges.any? && to >= ranges.last[0]
150
- ranges << [from, to, "#{this_period}\##{index}"]
151
- end
152
- end
153
- days -= this_period
154
- to_date -= this_period
113
+ LogUtils.log "DEBUG: #{max_period} ranges: #{ranges.inspect}"
155
114
  end
156
- ranges.reverse
115
+ ranges
157
116
  end
158
117
  end
159
118
  end
@@ -51,17 +51,17 @@ module ScraperUtils
51
51
 
52
52
  # Logs details of an HTTP request when debug mode is enabled
53
53
  #
54
- # @param method [String] HTTP method (GET, POST, etc.)
54
+ # @param http_method [String] HTTP http_method (GET, POST, etc.)
55
55
  # @param url [String] Request URL
56
56
  # @param parameters [Hash, nil] Optional request parameters
57
57
  # @param headers [Hash, nil] Optional request headers
58
58
  # @param body [Hash, nil] Optional request body
59
59
  # @return [void]
60
- def self.debug_request(method, url, parameters: nil, headers: nil, body: nil)
60
+ def self.debug_request(http_method, url, parameters: nil, headers: nil, body: nil)
61
61
  return unless basic?
62
62
 
63
63
  puts
64
- FiberScheduler.log "🔍 #{method.upcase} #{url}"
64
+ LogUtils.log "🔍 #{http_method.upcase} #{url}"
65
65
  puts "Parameters:", JSON.pretty_generate(parameters) if parameters
66
66
  puts "Headers:", JSON.pretty_generate(headers) if headers
67
67
  puts "Body:", JSON.pretty_generate(body) if body
@@ -77,7 +77,7 @@ module ScraperUtils
77
77
  return unless trace?
78
78
 
79
79
  puts
80
- FiberScheduler.log "🔍 DEBUG: #{message}"
80
+ LogUtils.log "🔍 DEBUG: #{message}"
81
81
  puts "Current URL: #{page.uri}"
82
82
  puts "Page title: #{page.at('title').text.strip}" if page.at("title")
83
83
  puts "",
@@ -98,7 +98,7 @@ module ScraperUtils
98
98
  return unless trace?
99
99
 
100
100
  puts
101
- FiberScheduler.log "🔍 DEBUG: #{message}"
101
+ LogUtils.log "🔍 DEBUG: #{message}"
102
102
  puts "Looking for selector: #{selector}"
103
103
  element = page.at(selector)
104
104
  if element
@@ -9,6 +9,21 @@ module ScraperUtils
9
9
  LOG_TABLE = "scrape_log"
10
10
  LOG_RETENTION_DAYS = 30
11
11
 
12
+ # Logs a message, automatically prefixing with authority name if in a fiber
13
+ #
14
+ # @param message [String] the message to log
15
+ # @return [void]
16
+ def self.log(message, authority = nil)
17
+ authority ||= Scheduler.current_authority
18
+ $stderr.flush
19
+ if authority
20
+ puts "[#{authority}] #{message}"
21
+ else
22
+ puts message
23
+ end
24
+ $stdout.flush
25
+ end
26
+
12
27
  # Log details about a scraping run for one or more authorities
13
28
  # @param start_time [Time] When this scraping attempt was started
14
29
  # @param attempt [Integer] 1 for first run, 2 for first retry, 3 for last retry (without proxy)
@@ -9,7 +9,7 @@ module ScraperUtils
9
9
  #
10
10
  # actions = [
11
11
  # [:click, "Next Page"],
12
- # [:click, ["Option A", "Option B"]] # Will select one randomly
12
+ # [:click, ["Option A", "xpath://div[@id='results']/a", "css:.some-button"]] # Will select one randomly
13
13
  # ]
14
14
  #
15
15
  # processor = ScraperUtils::MechanizeActions.new(agent)
@@ -50,7 +50,7 @@ module ScraperUtils
50
50
  # @example Action format
51
51
  # actions = [
52
52
  # [:click, "Link Text"], # Click on link with this text
53
- # [:click, ["Option A", "Option B"]], # Click on one of these options (randomly selected)
53
+ # [:click, ["Option A", "text:Option B"]], # Click on one of these options (randomly selected)
54
54
  # [:click, "css:.some-button"], # Use CSS selector
55
55
  # [:click, "xpath://div[@id='results']/a"], # Use XPath selector
56
56
  # [:block, ->(page, args, agent, results) { [page, { custom_results: 'data' }] }] # Custom block
@@ -67,8 +67,7 @@ module ScraperUtils
67
67
  when :click
68
68
  handle_click(current_page, args)
69
69
  when :block
70
- block = args.shift
71
- block.call(current_page, args, agent, @results.dup)
70
+ handle_block(current_page, args)
72
71
  else
73
72
  raise ArgumentError, "Unknown action type: #{action_type}"
74
73
  end
@@ -81,6 +80,18 @@ module ScraperUtils
81
80
 
82
81
  private
83
82
 
83
+ # Process a block action
84
+ #
85
+ # @param page [Mechanize::Page] The current page
86
+ # @param args [Array] The block and its arguments
87
+ # @return [Array<Mechanize::Page, Hash>] The resulting page and status
88
+ def handle_block(page, args)
89
+ block = args.shift
90
+ # Apply replacements to all remaining arguments
91
+ processed_args = args.map { |arg| apply_replacements(arg) }
92
+ block.call(page, processed_args.first, agent, @results.dup)
93
+ end
94
+
84
95
  # Handle a click action
85
96
  #
86
97
  # @param page [Mechanize::Page] The current page
@@ -105,16 +116,34 @@ module ScraperUtils
105
116
  # Select an element on the page based on selector string
106
117
  #
107
118
  # @param page [Mechanize::Page] The page to search in
108
- # @param selector_string [String] The selector string
119
+ # @param selector_string [String] The selector string, optionally with "css:", "xpath:" or "text:" prefix
109
120
  # @return [Mechanize::Element, nil] The selected element or nil if not found
110
121
  def select_element(page, selector_string)
111
122
  # Handle different selector types based on prefixes
112
123
  if selector_string.start_with?("css:")
113
124
  selector = selector_string.sub(/^css:/, '')
114
- page.at_css(selector)
125
+ # We need to convert Nokogiri elements to Mechanize elements for clicking
126
+ css_element = page.at_css(selector)
127
+ return nil unless css_element
128
+
129
+ # If it's a link, find the matching Mechanize link
130
+ if css_element.name.downcase == 'a' && css_element['href']
131
+ return page.links.find { |link| link.href == css_element['href'] }
132
+ end
133
+
134
+ return css_element
115
135
  elsif selector_string.start_with?("xpath:")
116
136
  selector = selector_string.sub(/^xpath:/, '')
117
- page.at_xpath(selector)
137
+ # We need to convert Nokogiri elements to Mechanize elements for clicking
138
+ xpath_element = page.at_xpath(selector)
139
+ return nil unless xpath_element
140
+
141
+ # If it's a link, find the matching Mechanize link
142
+ if xpath_element.name.downcase == 'a' && xpath_element['href']
143
+ return page.links.find { |link| link.href == xpath_element['href'] }
144
+ end
145
+
146
+ return xpath_element
118
147
  else
119
148
  # Default to text: for links
120
149
  selector = selector_string.sub(/^text:/, '')
@@ -133,7 +162,7 @@ module ScraperUtils
133
162
  end
134
163
  end
135
164
 
136
- # Get the link with the shortest (closest matching) text then the longest href
165
+ # Get the link with the a. shortest (closest matching) text and then b. the longest href
137
166
  matching_links.min_by { |l| [l.text.strip.length, -l.href.length] }
138
167
  end
139
168
  end
@@ -0,0 +1,80 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "uri"
4
+ require_relative "agent_config"
5
+
6
+ module ScraperUtils
7
+ module MechanizeUtils
8
+ # Adapts delays between requests based on server response times.
9
+ # Target delay is proportional to response time based on max_load setting.
10
+ # Uses an exponential moving average to smooth variations in response times.
11
+ class AdaptiveDelay
12
+ DEFAULT_MIN_DELAY = 0.0
13
+ DEFAULT_MAX_DELAY = 30.0 # Presumed default timeout for Mechanize
14
+
15
+ attr_reader :min_delay, :max_delay, :max_load
16
+
17
+ # Creates a new adaptive delay calculator
18
+ #
19
+ # @param min_delay [Float] Minimum delay between requests in seconds
20
+ # @param max_delay [Float] Maximum delay between requests in seconds
21
+ # @param max_load [Float] Maximum load percentage (1..Constants::MAX_LOAD_CAP) we aim to place on the server
22
+ # Lower values are more conservative (e.g., 20% = 4x response time delay)
23
+ def initialize(min_delay: DEFAULT_MIN_DELAY, max_delay: DEFAULT_MAX_DELAY, max_load: AgentConfig::DEFAULT_MAX_LOAD)
24
+ @delays = {} # domain -> last delay used
25
+ @min_delay = min_delay.to_f
26
+ @max_delay = max_delay.to_f
27
+ @max_load = max_load.to_f.clamp(1.0, AgentConfig::MAX_LOAD_CAP)
28
+ @response_multiplier = (100.0 - @max_load) / @max_load
29
+
30
+ return unless DebugUtils.basic?
31
+
32
+ ScraperUtils::LogUtils.log(
33
+ "AdaptiveDelay initialized with delays between #{@min_delay} and #{@max_delay} seconds, " \
34
+ "Max_load #{@max_load}% thus response multiplier: #{@response_multiplier.round(2)}x"
35
+ )
36
+ end
37
+
38
+ # @param uri [URI::Generic, String] URL to get delay for
39
+ # @return [Float] Current delay for the domain, or min_delay if no delay set
40
+ def delay(uri)
41
+ @delays[domain(uri)] || @min_delay
42
+ end
43
+
44
+ # Returns the next_delay calculated from a smoothed average of response_time to use less than max_load% of server
45
+ #
46
+ # @param uri [URI::Generic, String] URL the response came from
47
+ # @param response_time [Float] Time in seconds the server took to respond
48
+ # @return [Float] The calculated delay to use with the next request
49
+ def next_delay(uri, response_time)
50
+ uris_domain = domain(uri)
51
+ # calculate target_delay to achieve desired max_load% using pre-calculated multiplier
52
+ target_delay = (response_time * @response_multiplier).clamp(0.0, @max_delay)
53
+ # Initialise average from initial_response_time rather than zero to start with reasonable approximation
54
+ current_delay = @delays[uris_domain] || target_delay
55
+ # exponential smooth the delay to smooth out wild swings (Equivalent to an RC low pass filter)
56
+ delay = ((3.0 * current_delay) + target_delay) / 4.0
57
+ delay = delay.clamp(@min_delay, @max_delay)
58
+
59
+ if DebugUtils.basic?
60
+ ScraperUtils::LogUtils.log(
61
+ "Adaptive delay for #{uris_domain} updated to #{delay.round(2)}s (target: " \
62
+ "#{@response_multiplier.round(1)}x response_time of #{response_time.round(2)}s)"
63
+ )
64
+ end
65
+
66
+ @delays[uris_domain] = delay
67
+ delay
68
+ end
69
+
70
+ private
71
+
72
+ # @param uri [URI::Generic, String] The URL to extract the domain from
73
+ # @return [String] The domain in the format "scheme://host"
74
+ def domain(uri)
75
+ uri = URI(uri) unless uri.is_a?(URI)
76
+ "#{uri.scheme}://#{uri.host}".downcase
77
+ end
78
+ end
79
+ end
80
+ end