scraper_utils 0.5.1 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.yardopts +5 -0
- data/CHANGELOG.md +19 -0
- data/GUIDELINES.md +2 -1
- data/Gemfile +1 -0
- data/IMPLEMENTATION.md +39 -0
- data/README.md +29 -23
- data/SPECS.md +13 -1
- data/bin/rspec +27 -0
- data/docs/enhancing_specs.md +100 -0
- data/docs/example_scrape_with_fibers.rb +4 -4
- data/docs/fibers_and_threads.md +72 -0
- data/docs/getting_started.md +6 -6
- data/docs/interleaving_requests.md +9 -8
- data/docs/mechanize_utilities.md +4 -4
- data/docs/parallel_requests.md +138 -0
- data/docs/randomizing_requests.md +12 -8
- data/docs/reducing_server_load.md +6 -6
- data/lib/scraper_utils/data_quality_monitor.rb +2 -3
- data/lib/scraper_utils/date_range_utils.rb +37 -78
- data/lib/scraper_utils/debug_utils.rb +5 -5
- data/lib/scraper_utils/log_utils.rb +15 -0
- data/lib/scraper_utils/mechanize_actions.rb +37 -8
- data/lib/scraper_utils/mechanize_utils/adaptive_delay.rb +80 -0
- data/lib/scraper_utils/mechanize_utils/agent_config.rb +35 -34
- data/lib/scraper_utils/mechanize_utils/robots_checker.rb +151 -0
- data/lib/scraper_utils/mechanize_utils.rb +8 -5
- data/lib/scraper_utils/randomize_utils.rb +22 -19
- data/lib/scraper_utils/scheduler/constants.rb +12 -0
- data/lib/scraper_utils/scheduler/operation_registry.rb +101 -0
- data/lib/scraper_utils/scheduler/operation_worker.rb +199 -0
- data/lib/scraper_utils/scheduler/process_request.rb +59 -0
- data/lib/scraper_utils/scheduler/thread_request.rb +51 -0
- data/lib/scraper_utils/scheduler/thread_response.rb +59 -0
- data/lib/scraper_utils/scheduler.rb +286 -0
- data/lib/scraper_utils/spec_support.rb +67 -0
- data/lib/scraper_utils/version.rb +1 -1
- data/lib/scraper_utils.rb +12 -14
- metadata +18 -6
- data/lib/scraper_utils/adaptive_delay.rb +0 -70
- data/lib/scraper_utils/fiber_scheduler.rb +0 -229
- data/lib/scraper_utils/robots_checker.rb +0 -149
@@ -0,0 +1,138 @@
|
|
1
|
+
Parallel Request Processing
|
2
|
+
===========================
|
3
|
+
|
4
|
+
The ScraperUtils library provides a mechanism for executing network I/O requests in parallel using a thread for each
|
5
|
+
operation worker, allowing the fiber to yield control and allow other fibers to process whilst the thread processes the
|
6
|
+
mechanize network I/O request.
|
7
|
+
|
8
|
+
This can be disabled by setting `MORPH_DISABLE_THREADS` ENV var to a non-blank value.
|
9
|
+
|
10
|
+
Overview
|
11
|
+
--------
|
12
|
+
|
13
|
+
When scraping multiple authority websites, around 99% of the time was spent waiting for network I/O. While the
|
14
|
+
`Scheduler`
|
15
|
+
efficiently interleaves fibers during delay periods, network I/O requests will still block a fiber until they
|
16
|
+
complete.
|
17
|
+
|
18
|
+
The `OperationWorker` optimizes this process by:
|
19
|
+
|
20
|
+
1. Executing mechanize network operations in parallel using a thread for each operation_worker and fiber
|
21
|
+
2. Allowing other fibers to continue working while waiting for thread responses
|
22
|
+
3. Integrating seamlessly with the existing `Scheduler`
|
23
|
+
|
24
|
+
Usage
|
25
|
+
-----
|
26
|
+
|
27
|
+
```ruby
|
28
|
+
# In your authority scraper block
|
29
|
+
ScraperUtils::Scheduler.register_operation("authority_name") do
|
30
|
+
# Instead of:
|
31
|
+
# page = agent.get(url)
|
32
|
+
|
33
|
+
# Use:
|
34
|
+
page = ScraperUtils::Scheduler.execute_request(agent, :get, [url])
|
35
|
+
|
36
|
+
# Process page as normal
|
37
|
+
process_page(page)
|
38
|
+
end
|
39
|
+
```
|
40
|
+
|
41
|
+
For testing purposes, you can also execute non-network operations:
|
42
|
+
|
43
|
+
```ruby
|
44
|
+
# Create a test object
|
45
|
+
test_object = Object.new
|
46
|
+
|
47
|
+
def test_object.sleep_test(duration)
|
48
|
+
sleep(duration)
|
49
|
+
"Completed after #{duration} seconds"
|
50
|
+
end
|
51
|
+
|
52
|
+
# Queue a sleep command
|
53
|
+
command = ScraperUtils::ProcessRequest.new(
|
54
|
+
"test_id",
|
55
|
+
test_object,
|
56
|
+
:sleep_test,
|
57
|
+
[0.5]
|
58
|
+
)
|
59
|
+
|
60
|
+
thread_scheduler.queue_request(command)
|
61
|
+
```
|
62
|
+
|
63
|
+
Configuration
|
64
|
+
-------------
|
65
|
+
|
66
|
+
The followingENV variables affect how `Scheduler` is configured:
|
67
|
+
|
68
|
+
* `MORPH_DISABLE_THREADS=1` disabled the use of threads
|
69
|
+
* `MORPH_MAX_WORKERS=N` configures the system to a max of N workers (minimum 1).
|
70
|
+
If N is 1 then this forces the system to process one authority at a time.
|
71
|
+
|
72
|
+
Key Components
|
73
|
+
--------------
|
74
|
+
|
75
|
+
### ThreadRequest
|
76
|
+
|
77
|
+
A value object encapsulating a command to be executed:
|
78
|
+
|
79
|
+
- External ID: Any value suitable as a hash key (String, Symbol, Integer, Object) that identifies the command
|
80
|
+
- Subject: The object to call the method on
|
81
|
+
- Method: The method to call on the subject
|
82
|
+
- Args: Arguments to pass to the method
|
83
|
+
|
84
|
+
### ThreadResponse
|
85
|
+
|
86
|
+
A value object encapsulating a response:
|
87
|
+
|
88
|
+
- External ID: Matches the ID from the original command
|
89
|
+
- Result: The result of the operation
|
90
|
+
- Error: Any error that occurred
|
91
|
+
- Time Taken: Execution time in seconds
|
92
|
+
|
93
|
+
### ThreadPool
|
94
|
+
|
95
|
+
Manages a pool of threads that execute commands:
|
96
|
+
|
97
|
+
- Processes commands from a queue
|
98
|
+
- Returns responses with matching external IDs
|
99
|
+
- Provides clear separation between I/O and scheduling
|
100
|
+
|
101
|
+
Benefits
|
102
|
+
--------
|
103
|
+
|
104
|
+
1. **Improved Throughput**: Process multiple operations simultaneously
|
105
|
+
2. **Reduced Total Runtime**: Make better use of wait time during network operations
|
106
|
+
3. **Optimal Resource Usage**: Efficiently balance CPU and network operations
|
107
|
+
4. **Better Geolocation Handling**: Distribute requests across proxies more efficiently
|
108
|
+
5. **Testability**: Execute non-network operations for testing concurrency
|
109
|
+
|
110
|
+
Debugging
|
111
|
+
---------
|
112
|
+
|
113
|
+
When debugging issues with parallel operations, use:
|
114
|
+
|
115
|
+
```shell
|
116
|
+
# Set debug level to see request/response logging
|
117
|
+
export DEBUG = 2
|
118
|
+
```
|
119
|
+
|
120
|
+
The system will log:
|
121
|
+
|
122
|
+
- When commands are queued
|
123
|
+
- When responses are received
|
124
|
+
- How long each operation took
|
125
|
+
- Any errors that occurred
|
126
|
+
|
127
|
+
## Implementation Details
|
128
|
+
|
129
|
+
The integration between `Scheduler` and `ThreadPool` follows these principles:
|
130
|
+
|
131
|
+
1. `Scheduler` maintains ownership of all fiber scheduling
|
132
|
+
2. `ThreadPool` only knows about commands and responses
|
133
|
+
3. Communication happens via value objects with validation
|
134
|
+
4. State is managed in dedicated `FiberState` objects
|
135
|
+
5. Each component has a single responsibility
|
136
|
+
|
137
|
+
This design provides a clean separation of concerns while enabling parallel operations within the existing fiber
|
138
|
+
scheduling framework.
|
@@ -1,9 +1,11 @@
|
|
1
|
-
|
1
|
+
Randomizing Requests
|
2
|
+
====================
|
2
3
|
|
3
4
|
`ScraperUtils::RandomizeUtils` provides utilities for randomizing processing order in scrapers,
|
4
5
|
which is helpful for distributing load and avoiding predictable patterns.
|
5
6
|
|
6
|
-
|
7
|
+
Usage
|
8
|
+
-----
|
7
9
|
|
8
10
|
Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
|
9
11
|
receive it as is when testing.
|
@@ -18,17 +20,19 @@ records.each do |record|
|
|
18
20
|
end
|
19
21
|
```
|
20
22
|
|
21
|
-
|
23
|
+
Testing Configuration
|
24
|
+
---------------------
|
22
25
|
|
23
26
|
Enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb`:
|
24
27
|
|
25
28
|
```ruby
|
26
|
-
ScraperUtils::RandomizeUtils.
|
29
|
+
ScraperUtils::RandomizeUtils.random = false
|
27
30
|
```
|
28
31
|
|
29
|
-
|
32
|
+
Notes
|
33
|
+
-----
|
30
34
|
|
31
|
-
* You can also
|
32
|
-
* Testing using VCR requires
|
35
|
+
* You can also disable random mode by setting the env variable `MORPH_DISABLE_RANDOM` to `1` (or any non-blank value)
|
36
|
+
* Testing using VCR requires random to be disabled
|
33
37
|
|
34
|
-
For full details, see
|
38
|
+
For full details, see {ScraperUtils::RandomizeUtils Randomize Utils class documentation}
|
@@ -12,8 +12,8 @@ records:
|
|
12
12
|
|
13
13
|
- Always checks the most recent 4 days daily (configurable)
|
14
14
|
- Progressively reduces search frequency for older records
|
15
|
-
- Uses a
|
16
|
-
- Configurable `max_period` (default is
|
15
|
+
- Uses a progression from each 2 days and upwards to create an efficient search intervals
|
16
|
+
- Configurable `max_period` (default is 2 days)
|
17
17
|
- Merges adjacent search ranges and handles the changeover in search frequency by extending some searches
|
18
18
|
|
19
19
|
Example usage in your scraper:
|
@@ -28,11 +28,11 @@ date_ranges.each do |from_date, to_date, _debugging_comment|
|
|
28
28
|
end
|
29
29
|
```
|
30
30
|
|
31
|
-
Typical server load
|
31
|
+
Typical server load compared to search all days each time:
|
32
32
|
|
33
|
-
* Max period 2 days : ~
|
34
|
-
* Max period 3 days : ~
|
35
|
-
* Max period
|
33
|
+
* Max period 2 days : ~59% of the 33 days selected (default, alternates between 57% and 61% covered)
|
34
|
+
* Max period 3 days : ~50% of the 33 days selected (varies much more - between 33 and 67%)
|
35
|
+
* Max period 4 days : ~46% (more efficient if you search back 50 or more days, varies between 15 and 61%)
|
36
36
|
|
37
37
|
See the [DateRangeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/DateRangeUtils) for customizing defaults and passing options.
|
38
38
|
|
@@ -13,7 +13,6 @@ module ScraperUtils
|
|
13
13
|
# Notes the start of processing an authority and clears any previous stats
|
14
14
|
#
|
15
15
|
# @param authority_label [Symbol] The authority we are processing
|
16
|
-
# @return [void]
|
17
16
|
def self.start_authority(authority_label)
|
18
17
|
@stats ||= {}
|
19
18
|
@stats[authority_label] = { saved: 0, unprocessed: 0 }
|
@@ -41,7 +40,7 @@ module ScraperUtils
|
|
41
40
|
def self.log_unprocessable_record(exception, record)
|
42
41
|
authority_label = extract_authority(record)
|
43
42
|
@stats[authority_label][:unprocessed] += 1
|
44
|
-
ScraperUtils::
|
43
|
+
ScraperUtils::LogUtils.log "Erroneous record #{authority_label} - #{record&.fetch(
|
45
44
|
'address', nil
|
46
45
|
) || record.inspect}: #{exception}"
|
47
46
|
return unless @stats[authority_label][:unprocessed] > threshold(authority_label)
|
@@ -58,7 +57,7 @@ module ScraperUtils
|
|
58
57
|
def self.log_saved_record(record)
|
59
58
|
authority_label = extract_authority(record)
|
60
59
|
@stats[authority_label][:saved] += 1
|
61
|
-
ScraperUtils::
|
60
|
+
ScraperUtils::LogUtils.log "Saving record #{authority_label} - #{record['address']}"
|
62
61
|
end
|
63
62
|
end
|
64
63
|
end
|
@@ -3,7 +3,7 @@
|
|
3
3
|
module ScraperUtils
|
4
4
|
class DateRangeUtils
|
5
5
|
MERGE_ADJACENT_RANGES = true
|
6
|
-
PERIODS = [2, 3,
|
6
|
+
PERIODS = [2, 3, 4].freeze
|
7
7
|
|
8
8
|
class << self
|
9
9
|
# @return [Integer] Default number of days to cover
|
@@ -33,7 +33,7 @@ module ScraperUtils
|
|
33
33
|
def reset_defaults!
|
34
34
|
@default_days = ENV.fetch('MORPH_DAYS', 33).to_i # 33
|
35
35
|
@default_everytime = ENV.fetch('MORPH_EVERYTIME', 4).to_i # 4
|
36
|
-
@default_max_period = ENV.fetch('MORPH_MAX_PERIOD',
|
36
|
+
@default_max_period = ENV.fetch('MORPH_MAX_PERIOD', 2).to_i # 3
|
37
37
|
end
|
38
38
|
end
|
39
39
|
|
@@ -46,8 +46,8 @@ module ScraperUtils
|
|
46
46
|
# Generates one or more date ranges to check the most recent daily through to checking each max_period
|
47
47
|
# There is a graduated schedule from the latest `everytime` days through to the oldest of `days` dates which is checked each `max_period` days.
|
48
48
|
# @param days [Integer, nil] create ranges that cover the last `days` dates
|
49
|
-
# @param everytime [Integer, nil] Always include the latest `everytime` out of `days` dates
|
50
|
-
# @param max_period [Integer, nil] the last `days` dates must be checked at least every `max_period` days
|
49
|
+
# @param everytime [Integer, nil] Always include the latest `everytime` out of `days` dates (minimum 1)
|
50
|
+
# @param max_period [Integer, nil] the last `days` dates must be checked at least every `max_period` days (1..4)
|
51
51
|
# @param today [Date, nil] overrides the default determination of today at UTC+09:30 (middle of Australia)
|
52
52
|
# @return [Array{[Date, Date, String]}] being from_date, to_date and a comment
|
53
53
|
#
|
@@ -58,7 +58,7 @@ module ScraperUtils
|
|
58
58
|
def calculate_date_ranges(days: nil, everytime: nil, max_period: nil, today: nil)
|
59
59
|
_calculate_date_ranges(
|
60
60
|
Integer(days || self.class.default_days),
|
61
|
-
Integer(everytime || self.class.default_everytime),
|
61
|
+
[1, Integer(everytime || self.class.default_everytime)].max,
|
62
62
|
Integer(max_period || self.class.default_max_period),
|
63
63
|
today || Time.now(in: '+09:30').to_date
|
64
64
|
)
|
@@ -76,84 +76,43 @@ module ScraperUtils
|
|
76
76
|
# cover everything everytime
|
77
77
|
return [[today + 1 - days, today, "everything"]]
|
78
78
|
end
|
79
|
-
|
80
79
|
max_period = valid_periods.max
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
80
|
+
@max_period_used = max_period
|
81
|
+
|
82
|
+
one_half = ((days - everytime) / 2).to_i
|
83
|
+
one_third = ((days - everytime) / 3).to_i
|
84
|
+
two_ninths = (2 * (days - everytime) / 9).to_i
|
85
|
+
run_ranges =
|
86
|
+
case max_period
|
87
|
+
when 2
|
88
|
+
[
|
89
|
+
[[to_date - (one_half + everytime), to_date, "#{max_period}#0+everytime"]],
|
90
|
+
[[to_date - days, to_date - (one_half + everytime), "#{max_period}#1"], [to_date - everytime, to_date, "everytime"]]
|
91
|
+
]
|
92
|
+
when 3
|
93
|
+
[
|
94
|
+
[[to_date - days - 1, to_date + two_ninths - days, "3#0"], [to_date - (one_third + everytime), to_date, "2#0+everytime"]],
|
95
|
+
[[to_date + two_ninths - days, to_date + 2 * two_ninths - days, "3#1"], [to_date - everytime, to_date, "everytime"]],
|
96
|
+
[[to_date + 2 * two_ninths - days, to_date, "3#2+2#0+everytime"]],
|
97
|
+
[[to_date - days - 1, to_date + two_ninths - days, "3#3"], [to_date - everytime, to_date, "everytime"]],
|
98
|
+
[[to_date + two_ninths - days, to_date + 2 * two_ninths - days, "3#4"], [to_date - (one_third + everytime), to_date, "2#2+everytime"]],
|
99
|
+
[[to_date + 2 * two_ninths - days, to_date - (one_third + everytime), "3#5"], [to_date - everytime, to_date, "everytime"]]
|
100
|
+
]
|
101
|
+
else
|
102
|
+
[
|
103
|
+
[[to_date - (one_half + everytime), to_date, "2#0+everytime"]],
|
104
|
+
[[to_date - days - 2, to_date - (one_half + everytime), "4#0"], [to_date - everytime, to_date, "everytime"]],
|
105
|
+
[[to_date - (one_half + everytime), to_date, "2#1+everytime"]],
|
106
|
+
[[to_date - everytime, to_date, "everytime"]]
|
107
|
+
]
|
97
108
|
end
|
98
|
-
|
99
|
-
break unless days.positive?
|
100
|
-
|
101
|
-
this_period = [days, period].min
|
102
|
-
break if this_period <= 0
|
103
|
-
|
104
|
-
earliest_from = to_date - days
|
105
|
-
# we are working from the oldest back towards today
|
106
|
-
if run_number % period == index
|
107
|
-
from = to_date - index - (this_period - 1)
|
108
|
-
from = earliest_from if from < earliest_from
|
109
|
-
to = [today, to_date - index].min
|
110
|
-
break if from > to
|
109
|
+
run_number = today.to_date.jd % run_ranges.size
|
111
110
|
|
112
|
-
|
113
|
-
if ranges.any? && ranges.last[0] <= to + 1 && MERGE_ADJACENT_RANGES
|
114
|
-
# extend adjacent range
|
115
|
-
ranges.last[0] = [from, ranges.last[0]].min
|
116
|
-
ranges.last[2] = "#{period}\##{index},#{ranges.last[2]}"
|
117
|
-
else
|
118
|
-
to = ranges.last[0] - 1 if ranges.any? && to >= ranges.last[0]
|
119
|
-
ranges << [from, to, "#{period}\##{index}"]
|
120
|
-
end
|
121
|
-
end
|
122
|
-
days -= this_period
|
123
|
-
to_date -= this_period
|
124
|
-
end
|
125
|
-
end
|
126
|
-
# remainder of range at max_period, whatever that is
|
111
|
+
ranges = run_ranges[run_number]
|
127
112
|
if days.positive? && ScraperUtils::DebugUtils.trace?
|
128
|
-
|
129
|
-
end
|
130
|
-
index = -1
|
131
|
-
while days.positive?
|
132
|
-
index += 1
|
133
|
-
this_period = [days, max_period].min
|
134
|
-
break if this_period <= 0
|
135
|
-
|
136
|
-
earliest_from = to_date - days
|
137
|
-
if (run_number % max_period) == (index % max_period)
|
138
|
-
from = to_date - index - (this_period - 1)
|
139
|
-
from = earliest_from if from < earliest_from
|
140
|
-
to = to_date - index
|
141
|
-
break if from > to
|
142
|
-
|
143
|
-
@max_period_used = [this_period, @max_period_used].max
|
144
|
-
if ranges.any? && ranges.last[0] <= to + 1 && MERGE_ADJACENT_RANGES
|
145
|
-
# extend adjacent range
|
146
|
-
ranges.last[0] = [from, ranges.last[0]].min
|
147
|
-
ranges.last[2] = "#{this_period}\##{index},#{ranges.last[2]}"
|
148
|
-
else
|
149
|
-
to = ranges.last[0] - 1 if ranges.any? && to >= ranges.last[0]
|
150
|
-
ranges << [from, to, "#{this_period}\##{index}"]
|
151
|
-
end
|
152
|
-
end
|
153
|
-
days -= this_period
|
154
|
-
to_date -= this_period
|
113
|
+
LogUtils.log "DEBUG: #{max_period} ranges: #{ranges.inspect}"
|
155
114
|
end
|
156
|
-
ranges
|
115
|
+
ranges
|
157
116
|
end
|
158
117
|
end
|
159
118
|
end
|
@@ -51,17 +51,17 @@ module ScraperUtils
|
|
51
51
|
|
52
52
|
# Logs details of an HTTP request when debug mode is enabled
|
53
53
|
#
|
54
|
-
# @param
|
54
|
+
# @param http_method [String] HTTP http_method (GET, POST, etc.)
|
55
55
|
# @param url [String] Request URL
|
56
56
|
# @param parameters [Hash, nil] Optional request parameters
|
57
57
|
# @param headers [Hash, nil] Optional request headers
|
58
58
|
# @param body [Hash, nil] Optional request body
|
59
59
|
# @return [void]
|
60
|
-
def self.debug_request(
|
60
|
+
def self.debug_request(http_method, url, parameters: nil, headers: nil, body: nil)
|
61
61
|
return unless basic?
|
62
62
|
|
63
63
|
puts
|
64
|
-
|
64
|
+
LogUtils.log "🔍 #{http_method.upcase} #{url}"
|
65
65
|
puts "Parameters:", JSON.pretty_generate(parameters) if parameters
|
66
66
|
puts "Headers:", JSON.pretty_generate(headers) if headers
|
67
67
|
puts "Body:", JSON.pretty_generate(body) if body
|
@@ -77,7 +77,7 @@ module ScraperUtils
|
|
77
77
|
return unless trace?
|
78
78
|
|
79
79
|
puts
|
80
|
-
|
80
|
+
LogUtils.log "🔍 DEBUG: #{message}"
|
81
81
|
puts "Current URL: #{page.uri}"
|
82
82
|
puts "Page title: #{page.at('title').text.strip}" if page.at("title")
|
83
83
|
puts "",
|
@@ -98,7 +98,7 @@ module ScraperUtils
|
|
98
98
|
return unless trace?
|
99
99
|
|
100
100
|
puts
|
101
|
-
|
101
|
+
LogUtils.log "🔍 DEBUG: #{message}"
|
102
102
|
puts "Looking for selector: #{selector}"
|
103
103
|
element = page.at(selector)
|
104
104
|
if element
|
@@ -9,6 +9,21 @@ module ScraperUtils
|
|
9
9
|
LOG_TABLE = "scrape_log"
|
10
10
|
LOG_RETENTION_DAYS = 30
|
11
11
|
|
12
|
+
# Logs a message, automatically prefixing with authority name if in a fiber
|
13
|
+
#
|
14
|
+
# @param message [String] the message to log
|
15
|
+
# @return [void]
|
16
|
+
def self.log(message, authority = nil)
|
17
|
+
authority ||= Scheduler.current_authority
|
18
|
+
$stderr.flush
|
19
|
+
if authority
|
20
|
+
puts "[#{authority}] #{message}"
|
21
|
+
else
|
22
|
+
puts message
|
23
|
+
end
|
24
|
+
$stdout.flush
|
25
|
+
end
|
26
|
+
|
12
27
|
# Log details about a scraping run for one or more authorities
|
13
28
|
# @param start_time [Time] When this scraping attempt was started
|
14
29
|
# @param attempt [Integer] 1 for first run, 2 for first retry, 3 for last retry (without proxy)
|
@@ -9,7 +9,7 @@ module ScraperUtils
|
|
9
9
|
#
|
10
10
|
# actions = [
|
11
11
|
# [:click, "Next Page"],
|
12
|
-
# [:click, ["Option A", "
|
12
|
+
# [:click, ["Option A", "xpath://div[@id='results']/a", "css:.some-button"]] # Will select one randomly
|
13
13
|
# ]
|
14
14
|
#
|
15
15
|
# processor = ScraperUtils::MechanizeActions.new(agent)
|
@@ -50,7 +50,7 @@ module ScraperUtils
|
|
50
50
|
# @example Action format
|
51
51
|
# actions = [
|
52
52
|
# [:click, "Link Text"], # Click on link with this text
|
53
|
-
# [:click, ["Option A", "Option B"]],
|
53
|
+
# [:click, ["Option A", "text:Option B"]], # Click on one of these options (randomly selected)
|
54
54
|
# [:click, "css:.some-button"], # Use CSS selector
|
55
55
|
# [:click, "xpath://div[@id='results']/a"], # Use XPath selector
|
56
56
|
# [:block, ->(page, args, agent, results) { [page, { custom_results: 'data' }] }] # Custom block
|
@@ -67,8 +67,7 @@ module ScraperUtils
|
|
67
67
|
when :click
|
68
68
|
handle_click(current_page, args)
|
69
69
|
when :block
|
70
|
-
|
71
|
-
block.call(current_page, args, agent, @results.dup)
|
70
|
+
handle_block(current_page, args)
|
72
71
|
else
|
73
72
|
raise ArgumentError, "Unknown action type: #{action_type}"
|
74
73
|
end
|
@@ -81,6 +80,18 @@ module ScraperUtils
|
|
81
80
|
|
82
81
|
private
|
83
82
|
|
83
|
+
# Process a block action
|
84
|
+
#
|
85
|
+
# @param page [Mechanize::Page] The current page
|
86
|
+
# @param args [Array] The block and its arguments
|
87
|
+
# @return [Array<Mechanize::Page, Hash>] The resulting page and status
|
88
|
+
def handle_block(page, args)
|
89
|
+
block = args.shift
|
90
|
+
# Apply replacements to all remaining arguments
|
91
|
+
processed_args = args.map { |arg| apply_replacements(arg) }
|
92
|
+
block.call(page, processed_args.first, agent, @results.dup)
|
93
|
+
end
|
94
|
+
|
84
95
|
# Handle a click action
|
85
96
|
#
|
86
97
|
# @param page [Mechanize::Page] The current page
|
@@ -105,16 +116,34 @@ module ScraperUtils
|
|
105
116
|
# Select an element on the page based on selector string
|
106
117
|
#
|
107
118
|
# @param page [Mechanize::Page] The page to search in
|
108
|
-
# @param selector_string [String] The selector string
|
119
|
+
# @param selector_string [String] The selector string, optionally with "css:", "xpath:" or "text:" prefix
|
109
120
|
# @return [Mechanize::Element, nil] The selected element or nil if not found
|
110
121
|
def select_element(page, selector_string)
|
111
122
|
# Handle different selector types based on prefixes
|
112
123
|
if selector_string.start_with?("css:")
|
113
124
|
selector = selector_string.sub(/^css:/, '')
|
114
|
-
|
125
|
+
# We need to convert Nokogiri elements to Mechanize elements for clicking
|
126
|
+
css_element = page.at_css(selector)
|
127
|
+
return nil unless css_element
|
128
|
+
|
129
|
+
# If it's a link, find the matching Mechanize link
|
130
|
+
if css_element.name.downcase == 'a' && css_element['href']
|
131
|
+
return page.links.find { |link| link.href == css_element['href'] }
|
132
|
+
end
|
133
|
+
|
134
|
+
return css_element
|
115
135
|
elsif selector_string.start_with?("xpath:")
|
116
136
|
selector = selector_string.sub(/^xpath:/, '')
|
117
|
-
|
137
|
+
# We need to convert Nokogiri elements to Mechanize elements for clicking
|
138
|
+
xpath_element = page.at_xpath(selector)
|
139
|
+
return nil unless xpath_element
|
140
|
+
|
141
|
+
# If it's a link, find the matching Mechanize link
|
142
|
+
if xpath_element.name.downcase == 'a' && xpath_element['href']
|
143
|
+
return page.links.find { |link| link.href == xpath_element['href'] }
|
144
|
+
end
|
145
|
+
|
146
|
+
return xpath_element
|
118
147
|
else
|
119
148
|
# Default to text: for links
|
120
149
|
selector = selector_string.sub(/^text:/, '')
|
@@ -133,7 +162,7 @@ module ScraperUtils
|
|
133
162
|
end
|
134
163
|
end
|
135
164
|
|
136
|
-
# Get the link with the shortest (closest matching) text then the longest href
|
165
|
+
# Get the link with the a. shortest (closest matching) text and then b. the longest href
|
137
166
|
matching_links.min_by { |l| [l.text.strip.length, -l.href.length] }
|
138
167
|
end
|
139
168
|
end
|
@@ -0,0 +1,80 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "uri"
|
4
|
+
require_relative "agent_config"
|
5
|
+
|
6
|
+
module ScraperUtils
|
7
|
+
module MechanizeUtils
|
8
|
+
# Adapts delays between requests based on server response times.
|
9
|
+
# Target delay is proportional to response time based on max_load setting.
|
10
|
+
# Uses an exponential moving average to smooth variations in response times.
|
11
|
+
class AdaptiveDelay
|
12
|
+
DEFAULT_MIN_DELAY = 0.0
|
13
|
+
DEFAULT_MAX_DELAY = 30.0 # Presumed default timeout for Mechanize
|
14
|
+
|
15
|
+
attr_reader :min_delay, :max_delay, :max_load
|
16
|
+
|
17
|
+
# Creates a new adaptive delay calculator
|
18
|
+
#
|
19
|
+
# @param min_delay [Float] Minimum delay between requests in seconds
|
20
|
+
# @param max_delay [Float] Maximum delay between requests in seconds
|
21
|
+
# @param max_load [Float] Maximum load percentage (1..Constants::MAX_LOAD_CAP) we aim to place on the server
|
22
|
+
# Lower values are more conservative (e.g., 20% = 4x response time delay)
|
23
|
+
def initialize(min_delay: DEFAULT_MIN_DELAY, max_delay: DEFAULT_MAX_DELAY, max_load: AgentConfig::DEFAULT_MAX_LOAD)
|
24
|
+
@delays = {} # domain -> last delay used
|
25
|
+
@min_delay = min_delay.to_f
|
26
|
+
@max_delay = max_delay.to_f
|
27
|
+
@max_load = max_load.to_f.clamp(1.0, AgentConfig::MAX_LOAD_CAP)
|
28
|
+
@response_multiplier = (100.0 - @max_load) / @max_load
|
29
|
+
|
30
|
+
return unless DebugUtils.basic?
|
31
|
+
|
32
|
+
ScraperUtils::LogUtils.log(
|
33
|
+
"AdaptiveDelay initialized with delays between #{@min_delay} and #{@max_delay} seconds, " \
|
34
|
+
"Max_load #{@max_load}% thus response multiplier: #{@response_multiplier.round(2)}x"
|
35
|
+
)
|
36
|
+
end
|
37
|
+
|
38
|
+
# @param uri [URI::Generic, String] URL to get delay for
|
39
|
+
# @return [Float] Current delay for the domain, or min_delay if no delay set
|
40
|
+
def delay(uri)
|
41
|
+
@delays[domain(uri)] || @min_delay
|
42
|
+
end
|
43
|
+
|
44
|
+
# Returns the next_delay calculated from a smoothed average of response_time to use less than max_load% of server
|
45
|
+
#
|
46
|
+
# @param uri [URI::Generic, String] URL the response came from
|
47
|
+
# @param response_time [Float] Time in seconds the server took to respond
|
48
|
+
# @return [Float] The calculated delay to use with the next request
|
49
|
+
def next_delay(uri, response_time)
|
50
|
+
uris_domain = domain(uri)
|
51
|
+
# calculate target_delay to achieve desired max_load% using pre-calculated multiplier
|
52
|
+
target_delay = (response_time * @response_multiplier).clamp(0.0, @max_delay)
|
53
|
+
# Initialise average from initial_response_time rather than zero to start with reasonable approximation
|
54
|
+
current_delay = @delays[uris_domain] || target_delay
|
55
|
+
# exponential smooth the delay to smooth out wild swings (Equivalent to an RC low pass filter)
|
56
|
+
delay = ((3.0 * current_delay) + target_delay) / 4.0
|
57
|
+
delay = delay.clamp(@min_delay, @max_delay)
|
58
|
+
|
59
|
+
if DebugUtils.basic?
|
60
|
+
ScraperUtils::LogUtils.log(
|
61
|
+
"Adaptive delay for #{uris_domain} updated to #{delay.round(2)}s (target: " \
|
62
|
+
"#{@response_multiplier.round(1)}x response_time of #{response_time.round(2)}s)"
|
63
|
+
)
|
64
|
+
end
|
65
|
+
|
66
|
+
@delays[uris_domain] = delay
|
67
|
+
delay
|
68
|
+
end
|
69
|
+
|
70
|
+
private
|
71
|
+
|
72
|
+
# @param uri [URI::Generic, String] The URL to extract the domain from
|
73
|
+
# @return [String] The domain in the format "scheme://host"
|
74
|
+
def domain(uri)
|
75
|
+
uri = URI(uri) unless uri.is_a?(URI)
|
76
|
+
"#{uri.scheme}://#{uri.host}".downcase
|
77
|
+
end
|
78
|
+
end
|
79
|
+
end
|
80
|
+
end
|