scraper_utils 0.8.2 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. checksums.yaml +4 -4
  2. data/.gitignore +39 -9
  3. data/CHANGELOG.md +34 -0
  4. data/README.md +7 -55
  5. data/docs/enhancing_specs.md +86 -47
  6. data/docs/example_custom_Rakefile +38 -0
  7. data/docs/example_dot_scraper_validation.yml +23 -0
  8. data/docs/mechanize_utilities.md +0 -3
  9. data/docs/testing_custom_scrapers.md +74 -0
  10. data/exe/validate_scraper_data +150 -0
  11. data/lib/scraper_utils/log_utils.rb +5 -5
  12. data/lib/scraper_utils/maths_utils.rb +23 -0
  13. data/lib/scraper_utils/mechanize_utils/agent_config.rb +9 -65
  14. data/lib/scraper_utils/mechanize_utils.rb +0 -2
  15. data/lib/scraper_utils/spec_support.rb +189 -6
  16. data/lib/scraper_utils/version.rb +1 -1
  17. data/lib/scraper_utils.rb +1 -5
  18. data/scraper_utils.gemspec +1 -0
  19. metadata +11 -24
  20. data/docs/example_scrape_with_fibers.rb +0 -31
  21. data/docs/fibers_and_threads.md +0 -72
  22. data/docs/interleaving_requests.md +0 -33
  23. data/docs/parallel_requests.md +0 -138
  24. data/docs/randomizing_requests.md +0 -38
  25. data/docs/reducing_server_load.md +0 -63
  26. data/lib/scraper_utils/cycle_utils.rb +0 -26
  27. data/lib/scraper_utils/date_range_utils.rb +0 -118
  28. data/lib/scraper_utils/mechanize_actions.rb +0 -183
  29. data/lib/scraper_utils/mechanize_utils/adaptive_delay.rb +0 -80
  30. data/lib/scraper_utils/mechanize_utils/robots_checker.rb +0 -151
  31. data/lib/scraper_utils/randomize_utils.rb +0 -37
  32. data/lib/scraper_utils/scheduler/constants.rb +0 -12
  33. data/lib/scraper_utils/scheduler/operation_registry.rb +0 -101
  34. data/lib/scraper_utils/scheduler/operation_worker.rb +0 -199
  35. data/lib/scraper_utils/scheduler/process_request.rb +0 -59
  36. data/lib/scraper_utils/scheduler/thread_request.rb +0 -51
  37. data/lib/scraper_utils/scheduler/thread_response.rb +0 -59
  38. data/lib/scraper_utils/scheduler.rb +0 -286
@@ -1,72 +0,0 @@
1
- Fibers and Threads
2
- ==================
3
-
4
- This sequence diagram supplements the notes on the {ScraperUtils::Scheduler} class and is intended to help show
5
- the passing of messages and control between the fibers and threads.
6
-
7
- * To keep things simple I have only shown the Fibers and Threads and not all the other calls like to the
8
- OperationRegistry to lookup the current operation, or OperationWorker etc.
9
- * There is ONE (global) response queue, which is monitored by the Scheduler.run_operations loop in the main Fiber
10
- * Each authority has ONE OperationWorker (not shown), which has ONE Fiber, ONE Thread, ONE request queue.
11
- * I use "◀─▶" to indicate a call and response, and "║" for which fiber / object is currently running.
12
-
13
- ```text
14
-
15
- SCHEDULER (Main Fiber)
16
- NxRegister-operation RESPONSE.Q
17
- ║──creates────────◀─▶┐
18
- ║ │ FIBER (runs block passed to register_operation)
19
- ║──creates──────────────────◀─▶┐ WORKER object and Registry
20
- ║──registers(fiber)───────────────────────▶┐ REQUEST-Q
21
- │ │ │ ║──creates─◀─▶┐ THREAD
22
- │ │ │ ║──creates───────────◀─▶┐
23
- ║◀─────────────────────────────────────────┘ ║◀──pop───║ ...[block waiting for request]
24
- ║ │ │ │ ║ │
25
- run_operations │ │ │ ║ │
26
- ║──pop(non block)─◀─▶│ │ │ ║ │ ...[no responses yet]
27
- ║ │ │ │ ║ │
28
- ║───resumes-next─"can_resume"─────────────▶┐ ║ │
29
- │ │ │ ║ ║ │
30
- │ │ ║◀──resume──┘ ║ │ ...[first Resume passes true]
31
- │ │ ║ │ ║ │ ...[initialise scraper]
32
- ```
33
- **REPEATS FROM HERE**
34
- ```text
35
- SCHEDULER RESPONSE.Q FIBER WORKER REQUEST.Q THREAD
36
- │ │ ║──request─▶┐ ║ │
37
- │ │ │ ║──push req ─▶║ │
38
- │ │ ║◀──────────┘ ║──req───▶║
39
- ║◀──yields control─(waiting)───┘ │ │ ║
40
- ║ │ │ │ │ ║ ...[Executes network I/O request]
41
- ║ │ │ │ │ ║
42
- ║───other-resumes... │ │ │ │ ║ ...[Other Workers will be resumed
43
- ║ │ │ │ │ ║ till most 99% are waiting on
44
- ║───lots of │ │ │ │ ║ responses from their threads
45
- ║ short sleeps ║◀──pushes response───────────────────────────┘
46
- ║ ║ │ │ ║◀──pop───║ ...[block waiting for request]
47
- ║──pop(response)──◀─▶║ │ │ ║ │
48
- ║ │ │ │ ║ │
49
- ║──saves─response───────────────────────◀─▶│ ║ │
50
- ║ │ │ │ ║ │
51
- ║───resumes-next─"can_resume"─────────────▶┐ ║ │
52
- │ │ │ ║ ║ │
53
- │ │ ║◀──resume──┘ ║ │ ...[Resume passes response]
54
- │ │ ║ │ ║ │
55
- │ │ ║ │ ║ │ ...[Process Response]
56
- ```
57
- **REPEATS TO HERE** - WHEN FIBER FINISHES, instead it:
58
- ```text
59
- SCHEDULER RESPONSE.Q FIBER WORKER REQUEST.Q THREAD
60
- │ │ ║ │ ║ │
61
- │ │ ║─deregister─▶║ ║ │
62
- │ │ │ ║──close───▶║ │
63
- │ │ │ ║ ║──nil───▶┐
64
- │ │ │ ║ │ ║ ... [thread exists]
65
- │ │ │ ║──join────────────◀─▶┘
66
- │ │ │ ║ ....... [worker removes
67
- │ │ │ ║ itself from registry]
68
- │ │ ║◀──returns───┘
69
- │◀──returns─nil────────────────┘
70
- │ │
71
- ```
72
- When the last fiber finishes and the registry is empty, then the response queue is also removed
@@ -1,33 +0,0 @@
1
- # Interleaving Requests with Scheduler
2
-
3
- The `ScraperUtils::Scheduler` provides a lightweight utility that:
4
-
5
- * Works on other authorities while in the delay period for an authority's next request
6
- * Optimizes the total scraper run time
7
- * Allows you to increase the random delay for authorities without undue effect on total run time
8
- * For the curious, it uses [ruby fibers](https://ruby-doc.org/core-2.5.8/Fiber.html) rather than threads as that is
9
- a simpler system and thus easier to get right, understand and debug!
10
- * Cycles around the authorities when compliant_mode, max_load and random_delay are disabled
11
-
12
- ## Implementation
13
-
14
- To enable fiber scheduling, change your scrape method as per
15
- {example_scrape_with_fibers.rb example scrape with fibers}
16
-
17
- ## Logging with Scheduler
18
-
19
- Use {ScraperUtils::LogUtils.log} instead of `puts` when logging within the authority processing code.
20
- This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
21
- thus the output.
22
-
23
- ## Testing Considerations
24
-
25
- This uses {ScraperUtils::RandomizeUtils} for determining the order of operations. Remember to add the following line to
26
- `spec/spec_helper.rb`:
27
-
28
- ```ruby
29
- ScraperUtils::RandomizeUtils.random = false
30
- ScraperUtils::Scheduler.max_workers = 1
31
- ```
32
-
33
- For full details, see the {Scheduler}.
@@ -1,138 +0,0 @@
1
- Parallel Request Processing
2
- ===========================
3
-
4
- The ScraperUtils library provides a mechanism for executing network I/O requests in parallel using a thread for each
5
- operation worker, allowing the fiber to yield control and allow other fibers to process whilst the thread processes the
6
- mechanize network I/O request.
7
-
8
- This can be disabled by setting `MORPH_DISABLE_THREADS` ENV var to a non-blank value.
9
-
10
- Overview
11
- --------
12
-
13
- When scraping multiple authority websites, around 99% of the time was spent waiting for network I/O. While the
14
- `Scheduler`
15
- efficiently interleaves fibers during delay periods, network I/O requests will still block a fiber until they
16
- complete.
17
-
18
- The `OperationWorker` optimizes this process by:
19
-
20
- 1. Executing mechanize network operations in parallel using a thread for each operation_worker and fiber
21
- 2. Allowing other fibers to continue working while waiting for thread responses
22
- 3. Integrating seamlessly with the existing `Scheduler`
23
-
24
- Usage
25
- -----
26
-
27
- ```ruby
28
- # In your authority scraper block
29
- ScraperUtils::Scheduler.register_operation("authority_name") do
30
- # Instead of:
31
- # page = agent.get(url)
32
-
33
- # Use:
34
- page = ScraperUtils::Scheduler.execute_request(agent, :get, [url])
35
-
36
- # Process page as normal
37
- process_page(page)
38
- end
39
- ```
40
-
41
- For testing purposes, you can also execute non-network operations:
42
-
43
- ```ruby
44
- # Create a test object
45
- test_object = Object.new
46
-
47
- def test_object.sleep_test(duration)
48
- sleep(duration)
49
- "Completed after #{duration} seconds"
50
- end
51
-
52
- # Queue a sleep command
53
- command = ScraperUtils::ProcessRequest.new(
54
- "test_id",
55
- test_object,
56
- :sleep_test,
57
- [0.5]
58
- )
59
-
60
- thread_scheduler.queue_request(command)
61
- ```
62
-
63
- Configuration
64
- -------------
65
-
66
- The followingENV variables affect how `Scheduler` is configured:
67
-
68
- * `MORPH_DISABLE_THREADS=1` disabled the use of threads
69
- * `MORPH_MAX_WORKERS=N` configures the system to a max of N workers (minimum 1).
70
- If N is 1 then this forces the system to process one authority at a time.
71
-
72
- Key Components
73
- --------------
74
-
75
- ### ThreadRequest
76
-
77
- A value object encapsulating a command to be executed:
78
-
79
- - External ID: Any value suitable as a hash key (String, Symbol, Integer, Object) that identifies the command
80
- - Subject: The object to call the method on
81
- - Method: The method to call on the subject
82
- - Args: Arguments to pass to the method
83
-
84
- ### ThreadResponse
85
-
86
- A value object encapsulating a response:
87
-
88
- - External ID: Matches the ID from the original command
89
- - Result: The result of the operation
90
- - Error: Any error that occurred
91
- - Time Taken: Execution time in seconds
92
-
93
- ### ThreadPool
94
-
95
- Manages a pool of threads that execute commands:
96
-
97
- - Processes commands from a queue
98
- - Returns responses with matching external IDs
99
- - Provides clear separation between I/O and scheduling
100
-
101
- Benefits
102
- --------
103
-
104
- 1. **Improved Throughput**: Process multiple operations simultaneously
105
- 2. **Reduced Total Runtime**: Make better use of wait time during network operations
106
- 3. **Optimal Resource Usage**: Efficiently balance CPU and network operations
107
- 4. **Better Geolocation Handling**: Distribute requests across proxies more efficiently
108
- 5. **Testability**: Execute non-network operations for testing concurrency
109
-
110
- Debugging
111
- ---------
112
-
113
- When debugging issues with parallel operations, use:
114
-
115
- ```shell
116
- # Set debug level to see request/response logging
117
- export DEBUG = 2
118
- ```
119
-
120
- The system will log:
121
-
122
- - When commands are queued
123
- - When responses are received
124
- - How long each operation took
125
- - Any errors that occurred
126
-
127
- ## Implementation Details
128
-
129
- The integration between `Scheduler` and `ThreadPool` follows these principles:
130
-
131
- 1. `Scheduler` maintains ownership of all fiber scheduling
132
- 2. `ThreadPool` only knows about commands and responses
133
- 3. Communication happens via value objects with validation
134
- 4. State is managed in dedicated `FiberState` objects
135
- 5. Each component has a single responsibility
136
-
137
- This design provides a clean separation of concerns while enabling parallel operations within the existing fiber
138
- scheduling framework.
@@ -1,38 +0,0 @@
1
- Randomizing Requests
2
- ====================
3
-
4
- `ScraperUtils::RandomizeUtils` provides utilities for randomizing processing order in scrapers,
5
- which is helpful for distributing load and avoiding predictable patterns.
6
-
7
- Usage
8
- -----
9
-
10
- Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
11
- receive it as is when testing.
12
-
13
- ```ruby
14
- # Randomize a collection
15
- randomized_authorities = ScraperUtils::RandomizeUtils.randomize_order(authorities)
16
-
17
- # Use with a list of records from an index to randomize requests for details
18
- records.each do |record|
19
- # Process record
20
- end
21
- ```
22
-
23
- Testing Configuration
24
- ---------------------
25
-
26
- Enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb`:
27
-
28
- ```ruby
29
- ScraperUtils::RandomizeUtils.random = false
30
- ```
31
-
32
- Notes
33
- -----
34
-
35
- * You can also disable random mode by setting the env variable `MORPH_DISABLE_RANDOM` to `1` (or any non-blank value)
36
- * Testing using VCR requires random to be disabled
37
-
38
- For full details, see {ScraperUtils::RandomizeUtils Randomize Utils class documentation}
@@ -1,63 +0,0 @@
1
- # Reducing Server Load
2
-
3
- This document explains various techniques for reducing load on the servers you're scraping.
4
-
5
- ## Intelligent Date Range Selection
6
-
7
- To further reduce server load and speed up scrapers, we provide an intelligent date range selection mechanism
8
- that can reduce server requests by 60% without significantly impacting delay in picking up changes.
9
-
10
- The `ScraperUtils::DateRangeUtils#calculate_date_ranges` method provides a smart approach to searching historical
11
- records:
12
-
13
- - Always checks the most recent 4 days daily (configurable)
14
- - Progressively reduces search frequency for older records
15
- - Uses a progression from each 2 days and upwards to create an efficient search intervals
16
- - Configurable `max_period` (default is 2 days)
17
- - Merges adjacent search ranges and handles the changeover in search frequency by extending some searches
18
-
19
- Example usage in your scraper:
20
-
21
- ```ruby
22
- date_ranges = ScraperUtils::DateRangeUtils.new.calculate_date_ranges
23
- date_ranges.each do |from_date, to_date, _debugging_comment|
24
- # Adjust your normal search code to use for this date range
25
- your_search_records(from_date: from_date, to_date: to_date) do |record|
26
- # process as normal
27
- end
28
- end
29
- ```
30
-
31
- Typical server load compared to search all days each time:
32
-
33
- * Max period 2 days : ~59% of the 33 days selected (default, alternates between 57% and 61% covered)
34
- * Max period 3 days : ~50% of the 33 days selected (varies much more - between 33 and 67%)
35
- * Max period 4 days : ~46% (more efficient if you search back 50 or more days, varies between 15 and 61%)
36
-
37
- See the [DateRangeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/DateRangeUtils) for customizing defaults and passing options.
38
-
39
- ## Cycle Utilities
40
-
41
- Simple utility for cycling through options based on Julian day number to reduce server load and make your scraper seem less bot-like.
42
-
43
- If the site uses tags like 'L28', 'L14' and 'L7' for the last 28, 14 and 7 days, an alternative solution
44
- is to cycle through ['L28', 'L7', 'L14', 'L7'] which would drop the load by 50% and be less bot-like.
45
-
46
- ```ruby
47
- # Toggle between main and alternate behaviour
48
- alternate = ScraperUtils::CycleUtils.position(2).even?
49
-
50
- # OR cycle through a list of values day by day:
51
- period = ScraperUtils::CycleUtils.pick(['L28', 'L7', 'L14', 'L7'])
52
-
53
- # Use with any cycle size
54
- pos = ScraperUtils::CycleUtils.position(7) # 0-6 cycle
55
-
56
- # Test with specific date
57
- pos = ScraperUtils::CycleUtils.position(3, date: Date.new(2024, 1, 5))
58
-
59
- # Override for testing
60
- # CYCLE_POSITION=2 bundle exec ruby scraper.rb
61
- ```
62
-
63
- For full details, see the [CycleUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/CycleUtils).
@@ -1,26 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module ScraperUtils
4
- # Provides utilities for cycling through a range of options day by day
5
- module CycleUtils
6
- # Returns position in cycle from zero onwards
7
- # @param cycle [Integer] Cycle size (2 onwards)
8
- # @param date [Date, nil] Optional date to use instead of today
9
- # @return [Integer] position in cycle progressing from zero to cycle-1 and then repeating day by day
10
- # Can override using CYCLE_POSITION ENV variable
11
- def self.position(cycle, date: nil)
12
- day = ENV.fetch('CYCLE_POSITION', (date || Date.today).jd).to_i
13
- day % cycle
14
- end
15
-
16
- # Returns one value per day, cycling through all possible values in order
17
- # @param values [Array] Values to cycle through
18
- # @param date [Date, nil] Optional date to use instead of today to calculate position
19
- # @return value from array
20
- # Can override using CYCLE_POSITION ENV variable
21
- def self.pick(values, date: nil)
22
- values = values.to_a
23
- values[position(values.size, date: date)]
24
- end
25
- end
26
- end
@@ -1,118 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module ScraperUtils
4
- class DateRangeUtils
5
- MERGE_ADJACENT_RANGES = true
6
- PERIODS = [2, 3, 4].freeze
7
-
8
- class << self
9
- # @return [Integer] Default number of days to cover
10
- attr_accessor :default_days
11
-
12
- # @return [Integer] Default days to always include in ranges
13
- attr_accessor :default_everytime
14
-
15
- # @return [Integer, nil] Default max days between any one date being in a range
16
- attr_accessor :default_max_period
17
-
18
- # Configure default settings for all DateRangeUtils instances
19
- # @yield [self] Yields self for configuration
20
- # @example
21
- # AgentConfig.configure do |config|
22
- # config.default_everytime = 3
23
- # config.default_days = 35
24
- # config.default_max_period = 5
25
- # end
26
- # @return [void]
27
- def configure
28
- yield self if block_given?
29
- end
30
-
31
- # Reset all configuration options to their default values
32
- # @return [void]
33
- def reset_defaults!
34
- @default_days = ENV.fetch('MORPH_DAYS', 33).to_i # 33
35
- @default_everytime = ENV.fetch('MORPH_EVERYTIME', 4).to_i # 4
36
- @default_max_period = ENV.fetch('MORPH_MAX_PERIOD', 2).to_i # 3
37
- end
38
- end
39
-
40
- # Set defaults on load
41
- reset_defaults!
42
-
43
- attr_reader :max_period_used
44
- attr_reader :extended_max_period
45
-
46
- # Generates one or more date ranges to check the most recent daily through to checking each max_period
47
- # There is a graduated schedule from the latest `everytime` days through to the oldest of `days` dates which is checked each `max_period` days.
48
- # @param days [Integer, nil] create ranges that cover the last `days` dates
49
- # @param everytime [Integer, nil] Always include the latest `everytime` out of `days` dates (minimum 1)
50
- # @param max_period [Integer, nil] the last `days` dates must be checked at least every `max_period` days (1..4)
51
- # @param today [Date, nil] overrides the default determination of today at UTC+09:30 (middle of Australia)
52
- # @return [Array{[Date, Date, String]}] being from_date, to_date and a comment
53
- #
54
- # Uses a Fibonacci sequence to create a natural progression of check frequencies.
55
- # Newer data is checked more frequently, with periods between checks growing
56
- # according to the Fibonacci sequence (2, 3, 5, 8, 13...) until reaching max_period.
57
- # This creates an efficient schedule that mimics natural information decay patterns.
58
- def calculate_date_ranges(days: nil, everytime: nil, max_period: nil, today: nil)
59
- _calculate_date_ranges(
60
- Integer(days || self.class.default_days),
61
- [1, Integer(everytime || self.class.default_everytime)].max,
62
- Integer(max_period || self.class.default_max_period),
63
- today || Time.now(in: '+09:30').to_date
64
- )
65
- end
66
-
67
- private
68
-
69
- def _calculate_date_ranges(days, everytime, max_period, today)
70
- @max_period_used = 1
71
- to_date = today
72
- valid_periods = PERIODS.select { |p| p <= max_period }
73
- if !max_period.positive? || !days.positive?
74
- return []
75
- elsif valid_periods.empty? || everytime >= days
76
- # cover everything everytime
77
- return [[today + 1 - days, today, "everything"]]
78
- end
79
- max_period = valid_periods.max
80
- @max_period_used = max_period
81
-
82
- one_half = ((days - everytime) / 2).to_i
83
- one_third = ((days - everytime) / 3).to_i
84
- two_ninths = (2 * (days - everytime) / 9).to_i
85
- run_ranges =
86
- case max_period
87
- when 2
88
- [
89
- [[to_date - (one_half + everytime), to_date, "#{max_period}#0+everytime"]],
90
- [[to_date - days, to_date - (one_half + everytime), "#{max_period}#1"], [to_date - everytime, to_date, "everytime"]]
91
- ]
92
- when 3
93
- [
94
- [[to_date - days - 1, to_date + two_ninths - days, "3#0"], [to_date - (one_third + everytime), to_date, "2#0+everytime"]],
95
- [[to_date + two_ninths - days, to_date + 2 * two_ninths - days, "3#1"], [to_date - everytime, to_date, "everytime"]],
96
- [[to_date + 2 * two_ninths - days, to_date, "3#2+2#0+everytime"]],
97
- [[to_date - days - 1, to_date + two_ninths - days, "3#3"], [to_date - everytime, to_date, "everytime"]],
98
- [[to_date + two_ninths - days, to_date + 2 * two_ninths - days, "3#4"], [to_date - (one_third + everytime), to_date, "2#2+everytime"]],
99
- [[to_date + 2 * two_ninths - days, to_date - (one_third + everytime), "3#5"], [to_date - everytime, to_date, "everytime"]]
100
- ]
101
- else
102
- [
103
- [[to_date - (one_half + everytime), to_date, "2#0+everytime"]],
104
- [[to_date - days - 2, to_date - (one_half + everytime), "4#0"], [to_date - everytime, to_date, "everytime"]],
105
- [[to_date - (one_half + everytime), to_date, "2#1+everytime"]],
106
- [[to_date - everytime, to_date, "everytime"]]
107
- ]
108
- end
109
- run_number = today.to_date.jd % run_ranges.size
110
-
111
- ranges = run_ranges[run_number]
112
- if days.positive? && ScraperUtils::DebugUtils.trace?
113
- LogUtils.log "DEBUG: #{max_period} ranges: #{ranges.inspect}"
114
- end
115
- ranges
116
- end
117
- end
118
- end
@@ -1,183 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module ScraperUtils
4
- # Class for executing a series of mechanize actions with flexible replacements
5
- #
6
- # @example Basic usage
7
- # agent = ScraperUtils::MechanizeUtils.mechanize_agent
8
- # page = agent.get("https://example.com")
9
- #
10
- # actions = [
11
- # [:click, "Next Page"],
12
- # [:click, ["Option A", "xpath://div[@id='results']/a", "css:.some-button"]] # Will select one randomly
13
- # ]
14
- #
15
- # processor = ScraperUtils::MechanizeActions.new(agent)
16
- # result_page = processor.process(page, actions)
17
- #
18
- # @example With replacements
19
- # replacements = { FROM_DATE: "2022-01-01", TO_DATE: "2022-03-01" }
20
- # processor = ScraperUtils::MechanizeActions.new(agent, replacements)
21
- #
22
- # # Use replacements in actions
23
- # actions = [
24
- # [:click, "Search between {FROM_DATE} and {TO_DATE}"]
25
- # ]
26
- class MechanizeActions
27
- # @return [Mechanize] The mechanize agent used for actions
28
- attr_reader :agent
29
-
30
- # @return [Array] The results of each action performed
31
- attr_reader :results
32
-
33
- # Initialize a new MechanizeActions processor
34
- #
35
- # @param agent [Mechanize] The mechanize agent to use for actions
36
- # @param replacements [Hash] Optional text replacements to apply to action parameters
37
- def initialize(agent, replacements = {})
38
- @agent = agent
39
- @replacements = replacements || {}
40
- @results = []
41
- end
42
-
43
- # Process a sequence of actions on a page
44
- #
45
- # @param page [Mechanize::Page] The starting page
46
- # @param actions [Array<Array>] The sequence of actions to perform
47
- # @return [Mechanize::Page] The resulting page after all actions
48
- # @raise [ArgumentError] If an unknown action type is provided
49
- #
50
- # @example Action format
51
- # actions = [
52
- # [:click, "Link Text"], # Click on link with this text
53
- # [:click, ["Option A", "text:Option B"]], # Click on one of these options (randomly selected)
54
- # [:click, "css:.some-button"], # Use CSS selector
55
- # [:click, "xpath://div[@id='results']/a"], # Use XPath selector
56
- # [:block, ->(page, args, agent, results) { [page, { custom_results: 'data' }] }] # Custom block
57
- # ]
58
- def process(page, actions)
59
- @results = []
60
- current_page = page
61
-
62
- actions.each do |action|
63
- args = action.dup
64
- action_type = args.shift
65
- current_page, result =
66
- case action_type
67
- when :click
68
- handle_click(current_page, args)
69
- when :block
70
- handle_block(current_page, args)
71
- else
72
- raise ArgumentError, "Unknown action type: #{action_type}"
73
- end
74
-
75
- @results << result
76
- end
77
-
78
- current_page
79
- end
80
-
81
- private
82
-
83
- # Process a block action
84
- #
85
- # @param page [Mechanize::Page] The current page
86
- # @param args [Array] The block and its arguments
87
- # @return [Array<Mechanize::Page, Hash>] The resulting page and status
88
- def handle_block(page, args)
89
- block = args.shift
90
- # Apply replacements to all remaining arguments
91
- processed_args = args.map { |arg| apply_replacements(arg) }
92
- block.call(page, processed_args.first, agent, @results.dup)
93
- end
94
-
95
- # Handle a click action
96
- #
97
- # @param page [Mechanize::Page] The current page
98
- # @param args [Array] The first element is the selection target
99
- # @return [Array<Mechanize::Page, Hash>] The resulting page and status
100
- def handle_click(page, args)
101
- target = args.shift
102
- if target.is_a?(Array)
103
- target = ScraperUtils::CycleUtils.pick(target, date: @replacements[:TODAY])
104
- end
105
- target = apply_replacements(target)
106
- element = select_element(page, target)
107
- if element.nil?
108
- raise "Unable to find click target: #{target}"
109
- end
110
-
111
- result = { action: :click, target: target }
112
- next_page = element.click
113
- [next_page, result]
114
- end
115
-
116
- # Select an element on the page based on selector string
117
- #
118
- # @param page [Mechanize::Page] The page to search in
119
- # @param selector_string [String] The selector string, optionally with "css:", "xpath:" or "text:" prefix
120
- # @return [Mechanize::Element, nil] The selected element or nil if not found
121
- def select_element(page, selector_string)
122
- # Handle different selector types based on prefixes
123
- if selector_string.start_with?("css:")
124
- selector = selector_string.sub(/^css:/, '')
125
- # We need to convert Nokogiri elements to Mechanize elements for clicking
126
- css_element = page.at_css(selector)
127
- return nil unless css_element
128
-
129
- # If it's a link, find the matching Mechanize link
130
- if css_element.name.downcase == 'a' && css_element['href']
131
- return page.links.find { |link| link.href == css_element['href'] }
132
- end
133
-
134
- return css_element
135
- elsif selector_string.start_with?("xpath:")
136
- selector = selector_string.sub(/^xpath:/, '')
137
- # We need to convert Nokogiri elements to Mechanize elements for clicking
138
- xpath_element = page.at_xpath(selector)
139
- return nil unless xpath_element
140
-
141
- # If it's a link, find the matching Mechanize link
142
- if xpath_element.name.downcase == 'a' && xpath_element['href']
143
- return page.links.find { |link| link.href == xpath_element['href'] }
144
- end
145
-
146
- return xpath_element
147
- else
148
- # Default to text: for links
149
- selector = selector_string.sub(/^text:/, '')
150
- # Find links that include the text and don't have fragment-only hrefs
151
- matching_links = page.links.select do |l|
152
- l.text.include?(selector) &&
153
- !(l.href.nil? || l.href.start_with?('#'))
154
- end
155
-
156
- if matching_links.empty?
157
- # try case-insensitive
158
- selector = selector.downcase
159
- matching_links = page.links.select do |l|
160
- l.text.downcase.include?(selector) &&
161
- !(l.href.nil? || l.href.start_with?('#'))
162
- end
163
- end
164
-
165
- # Get the link with the a. shortest (closest matching) text and then b. the longest href
166
- matching_links.min_by { |l| [l.text.strip.length, -l.href.length] }
167
- end
168
- end
169
-
170
- # Apply text replacements to a string
171
- #
172
- # @param text [String, Object] The text to process or object to return unchanged
173
- # @return [String, Object] The processed text with replacements or original object
174
- def apply_replacements(text)
175
- result = text.to_s
176
-
177
- @replacements.each do |key, value|
178
- result = result.gsub(/\{#{key}\}/, value.to_s)
179
- end
180
- result
181
- end
182
- end
183
- end