scraper_utils 0.8.2 → 0.8.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +39 -9
- data/CHANGELOG.md +34 -0
- data/README.md +7 -55
- data/docs/enhancing_specs.md +86 -47
- data/docs/example_custom_Rakefile +38 -0
- data/docs/example_dot_scraper_validation.yml +23 -0
- data/docs/mechanize_utilities.md +0 -3
- data/docs/testing_custom_scrapers.md +74 -0
- data/exe/validate_scraper_data +150 -0
- data/lib/scraper_utils/log_utils.rb +5 -5
- data/lib/scraper_utils/maths_utils.rb +23 -0
- data/lib/scraper_utils/mechanize_utils/agent_config.rb +9 -65
- data/lib/scraper_utils/mechanize_utils.rb +0 -2
- data/lib/scraper_utils/spec_support.rb +189 -6
- data/lib/scraper_utils/version.rb +1 -1
- data/lib/scraper_utils.rb +1 -5
- data/scraper_utils.gemspec +1 -0
- metadata +11 -24
- data/docs/example_scrape_with_fibers.rb +0 -31
- data/docs/fibers_and_threads.md +0 -72
- data/docs/interleaving_requests.md +0 -33
- data/docs/parallel_requests.md +0 -138
- data/docs/randomizing_requests.md +0 -38
- data/docs/reducing_server_load.md +0 -63
- data/lib/scraper_utils/cycle_utils.rb +0 -26
- data/lib/scraper_utils/date_range_utils.rb +0 -118
- data/lib/scraper_utils/mechanize_actions.rb +0 -183
- data/lib/scraper_utils/mechanize_utils/adaptive_delay.rb +0 -80
- data/lib/scraper_utils/mechanize_utils/robots_checker.rb +0 -151
- data/lib/scraper_utils/randomize_utils.rb +0 -37
- data/lib/scraper_utils/scheduler/constants.rb +0 -12
- data/lib/scraper_utils/scheduler/operation_registry.rb +0 -101
- data/lib/scraper_utils/scheduler/operation_worker.rb +0 -199
- data/lib/scraper_utils/scheduler/process_request.rb +0 -59
- data/lib/scraper_utils/scheduler/thread_request.rb +0 -51
- data/lib/scraper_utils/scheduler/thread_response.rb +0 -59
- data/lib/scraper_utils/scheduler.rb +0 -286
data/docs/fibers_and_threads.md
DELETED
@@ -1,72 +0,0 @@
|
|
1
|
-
Fibers and Threads
|
2
|
-
==================
|
3
|
-
|
4
|
-
This sequence diagram supplements the notes on the {ScraperUtils::Scheduler} class and is intended to help show
|
5
|
-
the passing of messages and control between the fibers and threads.
|
6
|
-
|
7
|
-
* To keep things simple I have only shown the Fibers and Threads and not all the other calls like to the
|
8
|
-
OperationRegistry to lookup the current operation, or OperationWorker etc.
|
9
|
-
* There is ONE (global) response queue, which is monitored by the Scheduler.run_operations loop in the main Fiber
|
10
|
-
* Each authority has ONE OperationWorker (not shown), which has ONE Fiber, ONE Thread, ONE request queue.
|
11
|
-
* I use "◀─▶" to indicate a call and response, and "║" for which fiber / object is currently running.
|
12
|
-
|
13
|
-
```text
|
14
|
-
|
15
|
-
SCHEDULER (Main Fiber)
|
16
|
-
NxRegister-operation RESPONSE.Q
|
17
|
-
║──creates────────◀─▶┐
|
18
|
-
║ │ FIBER (runs block passed to register_operation)
|
19
|
-
║──creates──────────────────◀─▶┐ WORKER object and Registry
|
20
|
-
║──registers(fiber)───────────────────────▶┐ REQUEST-Q
|
21
|
-
│ │ │ ║──creates─◀─▶┐ THREAD
|
22
|
-
│ │ │ ║──creates───────────◀─▶┐
|
23
|
-
║◀─────────────────────────────────────────┘ ║◀──pop───║ ...[block waiting for request]
|
24
|
-
║ │ │ │ ║ │
|
25
|
-
run_operations │ │ │ ║ │
|
26
|
-
║──pop(non block)─◀─▶│ │ │ ║ │ ...[no responses yet]
|
27
|
-
║ │ │ │ ║ │
|
28
|
-
║───resumes-next─"can_resume"─────────────▶┐ ║ │
|
29
|
-
│ │ │ ║ ║ │
|
30
|
-
│ │ ║◀──resume──┘ ║ │ ...[first Resume passes true]
|
31
|
-
│ │ ║ │ ║ │ ...[initialise scraper]
|
32
|
-
```
|
33
|
-
**REPEATS FROM HERE**
|
34
|
-
```text
|
35
|
-
SCHEDULER RESPONSE.Q FIBER WORKER REQUEST.Q THREAD
|
36
|
-
│ │ ║──request─▶┐ ║ │
|
37
|
-
│ │ │ ║──push req ─▶║ │
|
38
|
-
│ │ ║◀──────────┘ ║──req───▶║
|
39
|
-
║◀──yields control─(waiting)───┘ │ │ ║
|
40
|
-
║ │ │ │ │ ║ ...[Executes network I/O request]
|
41
|
-
║ │ │ │ │ ║
|
42
|
-
║───other-resumes... │ │ │ │ ║ ...[Other Workers will be resumed
|
43
|
-
║ │ │ │ │ ║ till most 99% are waiting on
|
44
|
-
║───lots of │ │ │ │ ║ responses from their threads
|
45
|
-
║ short sleeps ║◀──pushes response───────────────────────────┘
|
46
|
-
║ ║ │ │ ║◀──pop───║ ...[block waiting for request]
|
47
|
-
║──pop(response)──◀─▶║ │ │ ║ │
|
48
|
-
║ │ │ │ ║ │
|
49
|
-
║──saves─response───────────────────────◀─▶│ ║ │
|
50
|
-
║ │ │ │ ║ │
|
51
|
-
║───resumes-next─"can_resume"─────────────▶┐ ║ │
|
52
|
-
│ │ │ ║ ║ │
|
53
|
-
│ │ ║◀──resume──┘ ║ │ ...[Resume passes response]
|
54
|
-
│ │ ║ │ ║ │
|
55
|
-
│ │ ║ │ ║ │ ...[Process Response]
|
56
|
-
```
|
57
|
-
**REPEATS TO HERE** - WHEN FIBER FINISHES, instead it:
|
58
|
-
```text
|
59
|
-
SCHEDULER RESPONSE.Q FIBER WORKER REQUEST.Q THREAD
|
60
|
-
│ │ ║ │ ║ │
|
61
|
-
│ │ ║─deregister─▶║ ║ │
|
62
|
-
│ │ │ ║──close───▶║ │
|
63
|
-
│ │ │ ║ ║──nil───▶┐
|
64
|
-
│ │ │ ║ │ ║ ... [thread exists]
|
65
|
-
│ │ │ ║──join────────────◀─▶┘
|
66
|
-
│ │ │ ║ ....... [worker removes
|
67
|
-
│ │ │ ║ itself from registry]
|
68
|
-
│ │ ║◀──returns───┘
|
69
|
-
│◀──returns─nil────────────────┘
|
70
|
-
│ │
|
71
|
-
```
|
72
|
-
When the last fiber finishes and the registry is empty, then the response queue is also removed
|
@@ -1,33 +0,0 @@
|
|
1
|
-
# Interleaving Requests with Scheduler
|
2
|
-
|
3
|
-
The `ScraperUtils::Scheduler` provides a lightweight utility that:
|
4
|
-
|
5
|
-
* Works on other authorities while in the delay period for an authority's next request
|
6
|
-
* Optimizes the total scraper run time
|
7
|
-
* Allows you to increase the random delay for authorities without undue effect on total run time
|
8
|
-
* For the curious, it uses [ruby fibers](https://ruby-doc.org/core-2.5.8/Fiber.html) rather than threads as that is
|
9
|
-
a simpler system and thus easier to get right, understand and debug!
|
10
|
-
* Cycles around the authorities when compliant_mode, max_load and random_delay are disabled
|
11
|
-
|
12
|
-
## Implementation
|
13
|
-
|
14
|
-
To enable fiber scheduling, change your scrape method as per
|
15
|
-
{example_scrape_with_fibers.rb example scrape with fibers}
|
16
|
-
|
17
|
-
## Logging with Scheduler
|
18
|
-
|
19
|
-
Use {ScraperUtils::LogUtils.log} instead of `puts` when logging within the authority processing code.
|
20
|
-
This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
|
21
|
-
thus the output.
|
22
|
-
|
23
|
-
## Testing Considerations
|
24
|
-
|
25
|
-
This uses {ScraperUtils::RandomizeUtils} for determining the order of operations. Remember to add the following line to
|
26
|
-
`spec/spec_helper.rb`:
|
27
|
-
|
28
|
-
```ruby
|
29
|
-
ScraperUtils::RandomizeUtils.random = false
|
30
|
-
ScraperUtils::Scheduler.max_workers = 1
|
31
|
-
```
|
32
|
-
|
33
|
-
For full details, see the {Scheduler}.
|
data/docs/parallel_requests.md
DELETED
@@ -1,138 +0,0 @@
|
|
1
|
-
Parallel Request Processing
|
2
|
-
===========================
|
3
|
-
|
4
|
-
The ScraperUtils library provides a mechanism for executing network I/O requests in parallel using a thread for each
|
5
|
-
operation worker, allowing the fiber to yield control and allow other fibers to process whilst the thread processes the
|
6
|
-
mechanize network I/O request.
|
7
|
-
|
8
|
-
This can be disabled by setting `MORPH_DISABLE_THREADS` ENV var to a non-blank value.
|
9
|
-
|
10
|
-
Overview
|
11
|
-
--------
|
12
|
-
|
13
|
-
When scraping multiple authority websites, around 99% of the time was spent waiting for network I/O. While the
|
14
|
-
`Scheduler`
|
15
|
-
efficiently interleaves fibers during delay periods, network I/O requests will still block a fiber until they
|
16
|
-
complete.
|
17
|
-
|
18
|
-
The `OperationWorker` optimizes this process by:
|
19
|
-
|
20
|
-
1. Executing mechanize network operations in parallel using a thread for each operation_worker and fiber
|
21
|
-
2. Allowing other fibers to continue working while waiting for thread responses
|
22
|
-
3. Integrating seamlessly with the existing `Scheduler`
|
23
|
-
|
24
|
-
Usage
|
25
|
-
-----
|
26
|
-
|
27
|
-
```ruby
|
28
|
-
# In your authority scraper block
|
29
|
-
ScraperUtils::Scheduler.register_operation("authority_name") do
|
30
|
-
# Instead of:
|
31
|
-
# page = agent.get(url)
|
32
|
-
|
33
|
-
# Use:
|
34
|
-
page = ScraperUtils::Scheduler.execute_request(agent, :get, [url])
|
35
|
-
|
36
|
-
# Process page as normal
|
37
|
-
process_page(page)
|
38
|
-
end
|
39
|
-
```
|
40
|
-
|
41
|
-
For testing purposes, you can also execute non-network operations:
|
42
|
-
|
43
|
-
```ruby
|
44
|
-
# Create a test object
|
45
|
-
test_object = Object.new
|
46
|
-
|
47
|
-
def test_object.sleep_test(duration)
|
48
|
-
sleep(duration)
|
49
|
-
"Completed after #{duration} seconds"
|
50
|
-
end
|
51
|
-
|
52
|
-
# Queue a sleep command
|
53
|
-
command = ScraperUtils::ProcessRequest.new(
|
54
|
-
"test_id",
|
55
|
-
test_object,
|
56
|
-
:sleep_test,
|
57
|
-
[0.5]
|
58
|
-
)
|
59
|
-
|
60
|
-
thread_scheduler.queue_request(command)
|
61
|
-
```
|
62
|
-
|
63
|
-
Configuration
|
64
|
-
-------------
|
65
|
-
|
66
|
-
The followingENV variables affect how `Scheduler` is configured:
|
67
|
-
|
68
|
-
* `MORPH_DISABLE_THREADS=1` disabled the use of threads
|
69
|
-
* `MORPH_MAX_WORKERS=N` configures the system to a max of N workers (minimum 1).
|
70
|
-
If N is 1 then this forces the system to process one authority at a time.
|
71
|
-
|
72
|
-
Key Components
|
73
|
-
--------------
|
74
|
-
|
75
|
-
### ThreadRequest
|
76
|
-
|
77
|
-
A value object encapsulating a command to be executed:
|
78
|
-
|
79
|
-
- External ID: Any value suitable as a hash key (String, Symbol, Integer, Object) that identifies the command
|
80
|
-
- Subject: The object to call the method on
|
81
|
-
- Method: The method to call on the subject
|
82
|
-
- Args: Arguments to pass to the method
|
83
|
-
|
84
|
-
### ThreadResponse
|
85
|
-
|
86
|
-
A value object encapsulating a response:
|
87
|
-
|
88
|
-
- External ID: Matches the ID from the original command
|
89
|
-
- Result: The result of the operation
|
90
|
-
- Error: Any error that occurred
|
91
|
-
- Time Taken: Execution time in seconds
|
92
|
-
|
93
|
-
### ThreadPool
|
94
|
-
|
95
|
-
Manages a pool of threads that execute commands:
|
96
|
-
|
97
|
-
- Processes commands from a queue
|
98
|
-
- Returns responses with matching external IDs
|
99
|
-
- Provides clear separation between I/O and scheduling
|
100
|
-
|
101
|
-
Benefits
|
102
|
-
--------
|
103
|
-
|
104
|
-
1. **Improved Throughput**: Process multiple operations simultaneously
|
105
|
-
2. **Reduced Total Runtime**: Make better use of wait time during network operations
|
106
|
-
3. **Optimal Resource Usage**: Efficiently balance CPU and network operations
|
107
|
-
4. **Better Geolocation Handling**: Distribute requests across proxies more efficiently
|
108
|
-
5. **Testability**: Execute non-network operations for testing concurrency
|
109
|
-
|
110
|
-
Debugging
|
111
|
-
---------
|
112
|
-
|
113
|
-
When debugging issues with parallel operations, use:
|
114
|
-
|
115
|
-
```shell
|
116
|
-
# Set debug level to see request/response logging
|
117
|
-
export DEBUG = 2
|
118
|
-
```
|
119
|
-
|
120
|
-
The system will log:
|
121
|
-
|
122
|
-
- When commands are queued
|
123
|
-
- When responses are received
|
124
|
-
- How long each operation took
|
125
|
-
- Any errors that occurred
|
126
|
-
|
127
|
-
## Implementation Details
|
128
|
-
|
129
|
-
The integration between `Scheduler` and `ThreadPool` follows these principles:
|
130
|
-
|
131
|
-
1. `Scheduler` maintains ownership of all fiber scheduling
|
132
|
-
2. `ThreadPool` only knows about commands and responses
|
133
|
-
3. Communication happens via value objects with validation
|
134
|
-
4. State is managed in dedicated `FiberState` objects
|
135
|
-
5. Each component has a single responsibility
|
136
|
-
|
137
|
-
This design provides a clean separation of concerns while enabling parallel operations within the existing fiber
|
138
|
-
scheduling framework.
|
@@ -1,38 +0,0 @@
|
|
1
|
-
Randomizing Requests
|
2
|
-
====================
|
3
|
-
|
4
|
-
`ScraperUtils::RandomizeUtils` provides utilities for randomizing processing order in scrapers,
|
5
|
-
which is helpful for distributing load and avoiding predictable patterns.
|
6
|
-
|
7
|
-
Usage
|
8
|
-
-----
|
9
|
-
|
10
|
-
Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
|
11
|
-
receive it as is when testing.
|
12
|
-
|
13
|
-
```ruby
|
14
|
-
# Randomize a collection
|
15
|
-
randomized_authorities = ScraperUtils::RandomizeUtils.randomize_order(authorities)
|
16
|
-
|
17
|
-
# Use with a list of records from an index to randomize requests for details
|
18
|
-
records.each do |record|
|
19
|
-
# Process record
|
20
|
-
end
|
21
|
-
```
|
22
|
-
|
23
|
-
Testing Configuration
|
24
|
-
---------------------
|
25
|
-
|
26
|
-
Enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb`:
|
27
|
-
|
28
|
-
```ruby
|
29
|
-
ScraperUtils::RandomizeUtils.random = false
|
30
|
-
```
|
31
|
-
|
32
|
-
Notes
|
33
|
-
-----
|
34
|
-
|
35
|
-
* You can also disable random mode by setting the env variable `MORPH_DISABLE_RANDOM` to `1` (or any non-blank value)
|
36
|
-
* Testing using VCR requires random to be disabled
|
37
|
-
|
38
|
-
For full details, see {ScraperUtils::RandomizeUtils Randomize Utils class documentation}
|
@@ -1,63 +0,0 @@
|
|
1
|
-
# Reducing Server Load
|
2
|
-
|
3
|
-
This document explains various techniques for reducing load on the servers you're scraping.
|
4
|
-
|
5
|
-
## Intelligent Date Range Selection
|
6
|
-
|
7
|
-
To further reduce server load and speed up scrapers, we provide an intelligent date range selection mechanism
|
8
|
-
that can reduce server requests by 60% without significantly impacting delay in picking up changes.
|
9
|
-
|
10
|
-
The `ScraperUtils::DateRangeUtils#calculate_date_ranges` method provides a smart approach to searching historical
|
11
|
-
records:
|
12
|
-
|
13
|
-
- Always checks the most recent 4 days daily (configurable)
|
14
|
-
- Progressively reduces search frequency for older records
|
15
|
-
- Uses a progression from each 2 days and upwards to create an efficient search intervals
|
16
|
-
- Configurable `max_period` (default is 2 days)
|
17
|
-
- Merges adjacent search ranges and handles the changeover in search frequency by extending some searches
|
18
|
-
|
19
|
-
Example usage in your scraper:
|
20
|
-
|
21
|
-
```ruby
|
22
|
-
date_ranges = ScraperUtils::DateRangeUtils.new.calculate_date_ranges
|
23
|
-
date_ranges.each do |from_date, to_date, _debugging_comment|
|
24
|
-
# Adjust your normal search code to use for this date range
|
25
|
-
your_search_records(from_date: from_date, to_date: to_date) do |record|
|
26
|
-
# process as normal
|
27
|
-
end
|
28
|
-
end
|
29
|
-
```
|
30
|
-
|
31
|
-
Typical server load compared to search all days each time:
|
32
|
-
|
33
|
-
* Max period 2 days : ~59% of the 33 days selected (default, alternates between 57% and 61% covered)
|
34
|
-
* Max period 3 days : ~50% of the 33 days selected (varies much more - between 33 and 67%)
|
35
|
-
* Max period 4 days : ~46% (more efficient if you search back 50 or more days, varies between 15 and 61%)
|
36
|
-
|
37
|
-
See the [DateRangeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/DateRangeUtils) for customizing defaults and passing options.
|
38
|
-
|
39
|
-
## Cycle Utilities
|
40
|
-
|
41
|
-
Simple utility for cycling through options based on Julian day number to reduce server load and make your scraper seem less bot-like.
|
42
|
-
|
43
|
-
If the site uses tags like 'L28', 'L14' and 'L7' for the last 28, 14 and 7 days, an alternative solution
|
44
|
-
is to cycle through ['L28', 'L7', 'L14', 'L7'] which would drop the load by 50% and be less bot-like.
|
45
|
-
|
46
|
-
```ruby
|
47
|
-
# Toggle between main and alternate behaviour
|
48
|
-
alternate = ScraperUtils::CycleUtils.position(2).even?
|
49
|
-
|
50
|
-
# OR cycle through a list of values day by day:
|
51
|
-
period = ScraperUtils::CycleUtils.pick(['L28', 'L7', 'L14', 'L7'])
|
52
|
-
|
53
|
-
# Use with any cycle size
|
54
|
-
pos = ScraperUtils::CycleUtils.position(7) # 0-6 cycle
|
55
|
-
|
56
|
-
# Test with specific date
|
57
|
-
pos = ScraperUtils::CycleUtils.position(3, date: Date.new(2024, 1, 5))
|
58
|
-
|
59
|
-
# Override for testing
|
60
|
-
# CYCLE_POSITION=2 bundle exec ruby scraper.rb
|
61
|
-
```
|
62
|
-
|
63
|
-
For full details, see the [CycleUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/CycleUtils).
|
@@ -1,26 +0,0 @@
|
|
1
|
-
# frozen_string_literal: true
|
2
|
-
|
3
|
-
module ScraperUtils
|
4
|
-
# Provides utilities for cycling through a range of options day by day
|
5
|
-
module CycleUtils
|
6
|
-
# Returns position in cycle from zero onwards
|
7
|
-
# @param cycle [Integer] Cycle size (2 onwards)
|
8
|
-
# @param date [Date, nil] Optional date to use instead of today
|
9
|
-
# @return [Integer] position in cycle progressing from zero to cycle-1 and then repeating day by day
|
10
|
-
# Can override using CYCLE_POSITION ENV variable
|
11
|
-
def self.position(cycle, date: nil)
|
12
|
-
day = ENV.fetch('CYCLE_POSITION', (date || Date.today).jd).to_i
|
13
|
-
day % cycle
|
14
|
-
end
|
15
|
-
|
16
|
-
# Returns one value per day, cycling through all possible values in order
|
17
|
-
# @param values [Array] Values to cycle through
|
18
|
-
# @param date [Date, nil] Optional date to use instead of today to calculate position
|
19
|
-
# @return value from array
|
20
|
-
# Can override using CYCLE_POSITION ENV variable
|
21
|
-
def self.pick(values, date: nil)
|
22
|
-
values = values.to_a
|
23
|
-
values[position(values.size, date: date)]
|
24
|
-
end
|
25
|
-
end
|
26
|
-
end
|
@@ -1,118 +0,0 @@
|
|
1
|
-
# frozen_string_literal: true
|
2
|
-
|
3
|
-
module ScraperUtils
|
4
|
-
class DateRangeUtils
|
5
|
-
MERGE_ADJACENT_RANGES = true
|
6
|
-
PERIODS = [2, 3, 4].freeze
|
7
|
-
|
8
|
-
class << self
|
9
|
-
# @return [Integer] Default number of days to cover
|
10
|
-
attr_accessor :default_days
|
11
|
-
|
12
|
-
# @return [Integer] Default days to always include in ranges
|
13
|
-
attr_accessor :default_everytime
|
14
|
-
|
15
|
-
# @return [Integer, nil] Default max days between any one date being in a range
|
16
|
-
attr_accessor :default_max_period
|
17
|
-
|
18
|
-
# Configure default settings for all DateRangeUtils instances
|
19
|
-
# @yield [self] Yields self for configuration
|
20
|
-
# @example
|
21
|
-
# AgentConfig.configure do |config|
|
22
|
-
# config.default_everytime = 3
|
23
|
-
# config.default_days = 35
|
24
|
-
# config.default_max_period = 5
|
25
|
-
# end
|
26
|
-
# @return [void]
|
27
|
-
def configure
|
28
|
-
yield self if block_given?
|
29
|
-
end
|
30
|
-
|
31
|
-
# Reset all configuration options to their default values
|
32
|
-
# @return [void]
|
33
|
-
def reset_defaults!
|
34
|
-
@default_days = ENV.fetch('MORPH_DAYS', 33).to_i # 33
|
35
|
-
@default_everytime = ENV.fetch('MORPH_EVERYTIME', 4).to_i # 4
|
36
|
-
@default_max_period = ENV.fetch('MORPH_MAX_PERIOD', 2).to_i # 3
|
37
|
-
end
|
38
|
-
end
|
39
|
-
|
40
|
-
# Set defaults on load
|
41
|
-
reset_defaults!
|
42
|
-
|
43
|
-
attr_reader :max_period_used
|
44
|
-
attr_reader :extended_max_period
|
45
|
-
|
46
|
-
# Generates one or more date ranges to check the most recent daily through to checking each max_period
|
47
|
-
# There is a graduated schedule from the latest `everytime` days through to the oldest of `days` dates which is checked each `max_period` days.
|
48
|
-
# @param days [Integer, nil] create ranges that cover the last `days` dates
|
49
|
-
# @param everytime [Integer, nil] Always include the latest `everytime` out of `days` dates (minimum 1)
|
50
|
-
# @param max_period [Integer, nil] the last `days` dates must be checked at least every `max_period` days (1..4)
|
51
|
-
# @param today [Date, nil] overrides the default determination of today at UTC+09:30 (middle of Australia)
|
52
|
-
# @return [Array{[Date, Date, String]}] being from_date, to_date and a comment
|
53
|
-
#
|
54
|
-
# Uses a Fibonacci sequence to create a natural progression of check frequencies.
|
55
|
-
# Newer data is checked more frequently, with periods between checks growing
|
56
|
-
# according to the Fibonacci sequence (2, 3, 5, 8, 13...) until reaching max_period.
|
57
|
-
# This creates an efficient schedule that mimics natural information decay patterns.
|
58
|
-
def calculate_date_ranges(days: nil, everytime: nil, max_period: nil, today: nil)
|
59
|
-
_calculate_date_ranges(
|
60
|
-
Integer(days || self.class.default_days),
|
61
|
-
[1, Integer(everytime || self.class.default_everytime)].max,
|
62
|
-
Integer(max_period || self.class.default_max_period),
|
63
|
-
today || Time.now(in: '+09:30').to_date
|
64
|
-
)
|
65
|
-
end
|
66
|
-
|
67
|
-
private
|
68
|
-
|
69
|
-
def _calculate_date_ranges(days, everytime, max_period, today)
|
70
|
-
@max_period_used = 1
|
71
|
-
to_date = today
|
72
|
-
valid_periods = PERIODS.select { |p| p <= max_period }
|
73
|
-
if !max_period.positive? || !days.positive?
|
74
|
-
return []
|
75
|
-
elsif valid_periods.empty? || everytime >= days
|
76
|
-
# cover everything everytime
|
77
|
-
return [[today + 1 - days, today, "everything"]]
|
78
|
-
end
|
79
|
-
max_period = valid_periods.max
|
80
|
-
@max_period_used = max_period
|
81
|
-
|
82
|
-
one_half = ((days - everytime) / 2).to_i
|
83
|
-
one_third = ((days - everytime) / 3).to_i
|
84
|
-
two_ninths = (2 * (days - everytime) / 9).to_i
|
85
|
-
run_ranges =
|
86
|
-
case max_period
|
87
|
-
when 2
|
88
|
-
[
|
89
|
-
[[to_date - (one_half + everytime), to_date, "#{max_period}#0+everytime"]],
|
90
|
-
[[to_date - days, to_date - (one_half + everytime), "#{max_period}#1"], [to_date - everytime, to_date, "everytime"]]
|
91
|
-
]
|
92
|
-
when 3
|
93
|
-
[
|
94
|
-
[[to_date - days - 1, to_date + two_ninths - days, "3#0"], [to_date - (one_third + everytime), to_date, "2#0+everytime"]],
|
95
|
-
[[to_date + two_ninths - days, to_date + 2 * two_ninths - days, "3#1"], [to_date - everytime, to_date, "everytime"]],
|
96
|
-
[[to_date + 2 * two_ninths - days, to_date, "3#2+2#0+everytime"]],
|
97
|
-
[[to_date - days - 1, to_date + two_ninths - days, "3#3"], [to_date - everytime, to_date, "everytime"]],
|
98
|
-
[[to_date + two_ninths - days, to_date + 2 * two_ninths - days, "3#4"], [to_date - (one_third + everytime), to_date, "2#2+everytime"]],
|
99
|
-
[[to_date + 2 * two_ninths - days, to_date - (one_third + everytime), "3#5"], [to_date - everytime, to_date, "everytime"]]
|
100
|
-
]
|
101
|
-
else
|
102
|
-
[
|
103
|
-
[[to_date - (one_half + everytime), to_date, "2#0+everytime"]],
|
104
|
-
[[to_date - days - 2, to_date - (one_half + everytime), "4#0"], [to_date - everytime, to_date, "everytime"]],
|
105
|
-
[[to_date - (one_half + everytime), to_date, "2#1+everytime"]],
|
106
|
-
[[to_date - everytime, to_date, "everytime"]]
|
107
|
-
]
|
108
|
-
end
|
109
|
-
run_number = today.to_date.jd % run_ranges.size
|
110
|
-
|
111
|
-
ranges = run_ranges[run_number]
|
112
|
-
if days.positive? && ScraperUtils::DebugUtils.trace?
|
113
|
-
LogUtils.log "DEBUG: #{max_period} ranges: #{ranges.inspect}"
|
114
|
-
end
|
115
|
-
ranges
|
116
|
-
end
|
117
|
-
end
|
118
|
-
end
|
@@ -1,183 +0,0 @@
|
|
1
|
-
# frozen_string_literal: true
|
2
|
-
|
3
|
-
module ScraperUtils
|
4
|
-
# Class for executing a series of mechanize actions with flexible replacements
|
5
|
-
#
|
6
|
-
# @example Basic usage
|
7
|
-
# agent = ScraperUtils::MechanizeUtils.mechanize_agent
|
8
|
-
# page = agent.get("https://example.com")
|
9
|
-
#
|
10
|
-
# actions = [
|
11
|
-
# [:click, "Next Page"],
|
12
|
-
# [:click, ["Option A", "xpath://div[@id='results']/a", "css:.some-button"]] # Will select one randomly
|
13
|
-
# ]
|
14
|
-
#
|
15
|
-
# processor = ScraperUtils::MechanizeActions.new(agent)
|
16
|
-
# result_page = processor.process(page, actions)
|
17
|
-
#
|
18
|
-
# @example With replacements
|
19
|
-
# replacements = { FROM_DATE: "2022-01-01", TO_DATE: "2022-03-01" }
|
20
|
-
# processor = ScraperUtils::MechanizeActions.new(agent, replacements)
|
21
|
-
#
|
22
|
-
# # Use replacements in actions
|
23
|
-
# actions = [
|
24
|
-
# [:click, "Search between {FROM_DATE} and {TO_DATE}"]
|
25
|
-
# ]
|
26
|
-
class MechanizeActions
|
27
|
-
# @return [Mechanize] The mechanize agent used for actions
|
28
|
-
attr_reader :agent
|
29
|
-
|
30
|
-
# @return [Array] The results of each action performed
|
31
|
-
attr_reader :results
|
32
|
-
|
33
|
-
# Initialize a new MechanizeActions processor
|
34
|
-
#
|
35
|
-
# @param agent [Mechanize] The mechanize agent to use for actions
|
36
|
-
# @param replacements [Hash] Optional text replacements to apply to action parameters
|
37
|
-
def initialize(agent, replacements = {})
|
38
|
-
@agent = agent
|
39
|
-
@replacements = replacements || {}
|
40
|
-
@results = []
|
41
|
-
end
|
42
|
-
|
43
|
-
# Process a sequence of actions on a page
|
44
|
-
#
|
45
|
-
# @param page [Mechanize::Page] The starting page
|
46
|
-
# @param actions [Array<Array>] The sequence of actions to perform
|
47
|
-
# @return [Mechanize::Page] The resulting page after all actions
|
48
|
-
# @raise [ArgumentError] If an unknown action type is provided
|
49
|
-
#
|
50
|
-
# @example Action format
|
51
|
-
# actions = [
|
52
|
-
# [:click, "Link Text"], # Click on link with this text
|
53
|
-
# [:click, ["Option A", "text:Option B"]], # Click on one of these options (randomly selected)
|
54
|
-
# [:click, "css:.some-button"], # Use CSS selector
|
55
|
-
# [:click, "xpath://div[@id='results']/a"], # Use XPath selector
|
56
|
-
# [:block, ->(page, args, agent, results) { [page, { custom_results: 'data' }] }] # Custom block
|
57
|
-
# ]
|
58
|
-
def process(page, actions)
|
59
|
-
@results = []
|
60
|
-
current_page = page
|
61
|
-
|
62
|
-
actions.each do |action|
|
63
|
-
args = action.dup
|
64
|
-
action_type = args.shift
|
65
|
-
current_page, result =
|
66
|
-
case action_type
|
67
|
-
when :click
|
68
|
-
handle_click(current_page, args)
|
69
|
-
when :block
|
70
|
-
handle_block(current_page, args)
|
71
|
-
else
|
72
|
-
raise ArgumentError, "Unknown action type: #{action_type}"
|
73
|
-
end
|
74
|
-
|
75
|
-
@results << result
|
76
|
-
end
|
77
|
-
|
78
|
-
current_page
|
79
|
-
end
|
80
|
-
|
81
|
-
private
|
82
|
-
|
83
|
-
# Process a block action
|
84
|
-
#
|
85
|
-
# @param page [Mechanize::Page] The current page
|
86
|
-
# @param args [Array] The block and its arguments
|
87
|
-
# @return [Array<Mechanize::Page, Hash>] The resulting page and status
|
88
|
-
def handle_block(page, args)
|
89
|
-
block = args.shift
|
90
|
-
# Apply replacements to all remaining arguments
|
91
|
-
processed_args = args.map { |arg| apply_replacements(arg) }
|
92
|
-
block.call(page, processed_args.first, agent, @results.dup)
|
93
|
-
end
|
94
|
-
|
95
|
-
# Handle a click action
|
96
|
-
#
|
97
|
-
# @param page [Mechanize::Page] The current page
|
98
|
-
# @param args [Array] The first element is the selection target
|
99
|
-
# @return [Array<Mechanize::Page, Hash>] The resulting page and status
|
100
|
-
def handle_click(page, args)
|
101
|
-
target = args.shift
|
102
|
-
if target.is_a?(Array)
|
103
|
-
target = ScraperUtils::CycleUtils.pick(target, date: @replacements[:TODAY])
|
104
|
-
end
|
105
|
-
target = apply_replacements(target)
|
106
|
-
element = select_element(page, target)
|
107
|
-
if element.nil?
|
108
|
-
raise "Unable to find click target: #{target}"
|
109
|
-
end
|
110
|
-
|
111
|
-
result = { action: :click, target: target }
|
112
|
-
next_page = element.click
|
113
|
-
[next_page, result]
|
114
|
-
end
|
115
|
-
|
116
|
-
# Select an element on the page based on selector string
|
117
|
-
#
|
118
|
-
# @param page [Mechanize::Page] The page to search in
|
119
|
-
# @param selector_string [String] The selector string, optionally with "css:", "xpath:" or "text:" prefix
|
120
|
-
# @return [Mechanize::Element, nil] The selected element or nil if not found
|
121
|
-
def select_element(page, selector_string)
|
122
|
-
# Handle different selector types based on prefixes
|
123
|
-
if selector_string.start_with?("css:")
|
124
|
-
selector = selector_string.sub(/^css:/, '')
|
125
|
-
# We need to convert Nokogiri elements to Mechanize elements for clicking
|
126
|
-
css_element = page.at_css(selector)
|
127
|
-
return nil unless css_element
|
128
|
-
|
129
|
-
# If it's a link, find the matching Mechanize link
|
130
|
-
if css_element.name.downcase == 'a' && css_element['href']
|
131
|
-
return page.links.find { |link| link.href == css_element['href'] }
|
132
|
-
end
|
133
|
-
|
134
|
-
return css_element
|
135
|
-
elsif selector_string.start_with?("xpath:")
|
136
|
-
selector = selector_string.sub(/^xpath:/, '')
|
137
|
-
# We need to convert Nokogiri elements to Mechanize elements for clicking
|
138
|
-
xpath_element = page.at_xpath(selector)
|
139
|
-
return nil unless xpath_element
|
140
|
-
|
141
|
-
# If it's a link, find the matching Mechanize link
|
142
|
-
if xpath_element.name.downcase == 'a' && xpath_element['href']
|
143
|
-
return page.links.find { |link| link.href == xpath_element['href'] }
|
144
|
-
end
|
145
|
-
|
146
|
-
return xpath_element
|
147
|
-
else
|
148
|
-
# Default to text: for links
|
149
|
-
selector = selector_string.sub(/^text:/, '')
|
150
|
-
# Find links that include the text and don't have fragment-only hrefs
|
151
|
-
matching_links = page.links.select do |l|
|
152
|
-
l.text.include?(selector) &&
|
153
|
-
!(l.href.nil? || l.href.start_with?('#'))
|
154
|
-
end
|
155
|
-
|
156
|
-
if matching_links.empty?
|
157
|
-
# try case-insensitive
|
158
|
-
selector = selector.downcase
|
159
|
-
matching_links = page.links.select do |l|
|
160
|
-
l.text.downcase.include?(selector) &&
|
161
|
-
!(l.href.nil? || l.href.start_with?('#'))
|
162
|
-
end
|
163
|
-
end
|
164
|
-
|
165
|
-
# Get the link with the a. shortest (closest matching) text and then b. the longest href
|
166
|
-
matching_links.min_by { |l| [l.text.strip.length, -l.href.length] }
|
167
|
-
end
|
168
|
-
end
|
169
|
-
|
170
|
-
# Apply text replacements to a string
|
171
|
-
#
|
172
|
-
# @param text [String, Object] The text to process or object to return unchanged
|
173
|
-
# @return [String, Object] The processed text with replacements or original object
|
174
|
-
def apply_replacements(text)
|
175
|
-
result = text.to_s
|
176
|
-
|
177
|
-
@replacements.each do |key, value|
|
178
|
-
result = result.gsub(/\{#{key}\}/, value.to_s)
|
179
|
-
end
|
180
|
-
result
|
181
|
-
end
|
182
|
-
end
|
183
|
-
end
|