scraper_utils 0.4.2 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +4 -0
- data/README.md +50 -286
- data/docs/debugging.md +50 -0
- data/docs/getting_started.md +145 -0
- data/docs/interleaving_requests.md +61 -0
- data/docs/mechanize_utilities.md +92 -0
- data/docs/randomizing_requests.md +34 -0
- data/docs/reducing_server_load.md +63 -0
- data/lib/scraper_utils/cycle_utils.rb +1 -0
- data/lib/scraper_utils/mechanize_actions.rb +154 -0
- data/lib/scraper_utils/version.rb +1 -1
- data/lib/scraper_utils.rb +1 -0
- data/scraper_utils.gemspec +5 -4
- metadata +13 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 63a24c24b497494b79c4d7e12f04a1bd2555068f37f50389f3906c0033817d7e
|
4
|
+
data.tar.gz: 6d6b96112dc3e2f9dc5a54de6318a544c240c0e3d5246ab4178c07346d0de7dc
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: eda8d10d996d51b7ef1d2610e21da31390c10dd29f4daa70bd5d9c3c8dc6eb9bed651803ccd6a59f53b03dae4fcd1ea016802e693f8828f4a13b92e07a0b046e
|
7
|
+
data.tar.gz: eba2704a99c6599a2789ec573fa335d7939a63d0c27b06886d6e905cd785e2095d7d0307e7aa1195a1209e022340fa5d027a72ccca61a350590058e998355d5d
|
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
@@ -3,8 +3,6 @@ ScraperUtils (Ruby)
|
|
3
3
|
|
4
4
|
Utilities to help make planningalerts scrapers, especially multis, easier to develop, run and debug.
|
5
5
|
|
6
|
-
WARNING: This is still under development! Breaking changes may occur in version 0.x!
|
7
|
-
|
8
6
|
For Server Administrators
|
9
7
|
-------------------------
|
10
8
|
|
@@ -18,331 +16,97 @@ To control our access:
|
|
18
16
|
|
19
17
|
- Add a section for our user agent: `User-agent: ScraperUtils` (default)
|
20
18
|
- Set a crawl delay, eg: `Crawl-delay: 20`
|
21
|
-
- If needed specify disallowed paths
|
19
|
+
- If needed specify disallowed paths: `Disallow: /private/`
|
22
20
|
|
23
|
-
###
|
21
|
+
### We play nice with your servers
|
24
22
|
|
25
|
-
|
23
|
+
Our goal is to access public planning information with minimal impact on your services. The following features are on by
|
24
|
+
default:
|
26
25
|
|
27
26
|
- **Identify themselves**: Our user agent clearly indicates who we are and provides a link to the project repository:
|
28
27
|
`Mozilla/5.0 (compatible; ScraperUtils/0.2.0 2025-02-22; +https://github.com/ianheggie-oaf/scraper_utils)`
|
29
28
|
|
30
|
-
- **Limit server load**:
|
31
|
-
|
32
|
-
|
33
|
-
In the default "compliant mode" this defaults to a max load of 20% and is capped at 33%.
|
34
|
-
|
35
|
-
- **Add randomized delays**: We add random delays between requests to further reduce our impact on servers, which should
|
36
|
-
bring us down to the load of a single industrious person.
|
37
|
-
|
38
|
-
Extra utilities provided for scrapers to further reduce your server load:
|
29
|
+
- **Limit server load**:
|
30
|
+
- We wait double your response time before making another request to avoid being a significant load on your server
|
31
|
+
- We also randomly add extra delays to give your server a chance to catch up with background tasks
|
39
32
|
|
40
|
-
|
33
|
+
We also provide scraper developers other features to reduce overall load as well.
|
41
34
|
|
42
|
-
|
43
|
-
|
44
|
-
replaces the simplistic check of the last 30 days each day.
|
35
|
+
For Scraper Developers
|
36
|
+
----------------------
|
45
37
|
|
46
|
-
|
47
|
-
|
38
|
+
We provide utilities to make developing, running and debugging your scraper easier in addition to the base utilities
|
39
|
+
mentioned above.
|
48
40
|
|
49
|
-
|
41
|
+
## Installation & Configuration
|
50
42
|
|
51
|
-
|
52
|
-
------------
|
53
|
-
|
54
|
-
Add these line to your application's Gemfile:
|
43
|
+
Add to your [scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
|
55
44
|
|
56
45
|
```ruby
|
57
46
|
gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
|
58
47
|
gem 'scraper_utils'
|
59
48
|
```
|
60
49
|
|
61
|
-
|
62
|
-
|
63
|
-
$ bundle
|
64
|
-
|
65
|
-
Usage
|
66
|
-
-----
|
67
|
-
|
68
|
-
### Ruby Versions
|
69
|
-
|
70
|
-
This gem is designed to be compatible the latest ruby supported by morph.io - other versions may work, but not tested:
|
71
|
-
|
72
|
-
* ruby 3.2.2 - requires the `platform` file to contain `heroku_18` in the scraper
|
73
|
-
* ruby 2.5.8 - `heroku_16` (the default)
|
74
|
-
|
75
|
-
### Environment variables
|
76
|
-
|
77
|
-
#### `MORPH_AUSTRALIAN_PROXY`
|
78
|
-
|
79
|
-
On morph.io set the environment variable `MORPH_AUSTRALIAN_PROXY` to
|
80
|
-
`http://morph:password@au.proxy.oaf.org.au:8888`
|
81
|
-
replacing password with the real password.
|
82
|
-
Alternatively enter your own AUSTRALIAN proxy details when testing.
|
83
|
-
|
84
|
-
#### `MORPH_EXPECT_BAD`
|
85
|
-
|
86
|
-
To avoid morph complaining about sites that are known to be bad,
|
87
|
-
but you want them to keep being tested, list them on `MORPH_EXPECT_BAD`, for example:
|
88
|
-
|
89
|
-
#### `MORPH_AUTHORITIES`
|
90
|
-
|
91
|
-
Optionally filter authorities for multi authority scrapers
|
92
|
-
via environment variable in morph > scraper > settings or
|
93
|
-
in your dev environment:
|
94
|
-
|
95
|
-
```bash
|
96
|
-
export MORPH_AUTHORITIES=noosa,wagga
|
97
|
-
```
|
98
|
-
|
99
|
-
#### `DEBUG`
|
100
|
-
|
101
|
-
Optionally enable verbose debugging messages when developing:
|
102
|
-
|
103
|
-
```bash
|
104
|
-
export DEBUG=1
|
105
|
-
```
|
106
|
-
|
107
|
-
### Extra Mechanize options
|
50
|
+
For detailed setup and configuration options, see the [Getting Started guide](docs/getting_started.md).
|
108
51
|
|
109
|
-
|
52
|
+
## Key Features
|
110
53
|
|
111
|
-
|
112
|
-
* `australian_proxy: true` - Use the proxy url in the `MORPH_AUSTRALIAN_PROXY` env variable if the site is geo-locked
|
113
|
-
* `disable_ssl_certificate_check: true` - Disabled SSL verification for old / incorrect certificates
|
54
|
+
### Well-Behaved Web Client
|
114
55
|
|
115
|
-
|
56
|
+
- Configure Mechanize agents with sensible defaults
|
57
|
+
- Automatic rate limiting based on server response times
|
58
|
+
- Supports robots.txt and crawl-delay directives
|
59
|
+
- Supports extra actions required to get to results page
|
60
|
+
- [Learn more about Mechanize utilities](docs/mechanize_utilities.md)
|
116
61
|
|
117
|
-
|
118
|
-
`ScraperUtils::MechanizeUtils.mechanize_agent(client_options || {})`
|
119
|
-
to receive a `Mechanize::Agent` configured accordingly.
|
62
|
+
### Optimize Server Load
|
120
63
|
|
121
|
-
|
64
|
+
- Intelligent date range selection (reduce server load by up to 60%)
|
65
|
+
- Cycle utilities for rotating search parameters
|
66
|
+
- [Learn more about reducing server load](docs/reducing_server_load.md)
|
122
67
|
|
123
|
-
###
|
68
|
+
### Improve Scraper Efficiency
|
124
69
|
|
125
|
-
|
126
|
-
|
70
|
+
- Interleave requests to optimize run time
|
71
|
+
- [Learn more about interleaving requests](docs/interleaving_requests.md)
|
72
|
+
- Randomize processing order for more natural request patterns
|
73
|
+
- [Learn more about randomizing requests](docs/randomizing_requests.md)
|
127
74
|
|
128
|
-
|
75
|
+
### Error Handling & Quality Monitoring
|
129
76
|
|
130
|
-
|
131
|
-
|
132
|
-
|
133
|
-
config.default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
|
134
|
-
config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', 5).to_i # 5
|
135
|
-
config.default_max_load = ENV.fetch('MORPH_MAX_LOAD', 33.3).to_f # 33.3
|
136
|
-
config.default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
|
137
|
-
config.default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
|
138
|
-
config.default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
|
139
|
-
end
|
140
|
-
```
|
77
|
+
- Record-level error handling with appropriate thresholds
|
78
|
+
- Data quality monitoring during scraping
|
79
|
+
- Detailed logging and reporting
|
141
80
|
|
142
|
-
|
143
|
-
agents created by `ScraperUtils::MechanizeUtils.mechanize_agent` unless overridden by passing parameters to that method.
|
81
|
+
### Developer Tools
|
144
82
|
|
145
|
-
|
83
|
+
- Enhanced debugging utilities
|
84
|
+
- Simple logging with authority context
|
85
|
+
- [Learn more about debugging](docs/debugging.md)
|
146
86
|
|
147
|
-
|
148
|
-
ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
|
149
|
-
config.default_random_delay = nil
|
150
|
-
config.default_max_load = 33
|
151
|
-
end
|
152
|
-
```
|
87
|
+
## API Documentation
|
153
88
|
|
154
|
-
|
89
|
+
Complete API documentation is available at [RubyDoc.info](https://rubydoc.info/gems/scraper_utils).
|
155
90
|
|
156
|
-
|
91
|
+
## Ruby Versions
|
157
92
|
|
158
|
-
|
159
|
-
record.
|
160
|
-
Then just before you would normally yield a record for saving, rescue that exception and:
|
161
|
-
|
162
|
-
* Call `ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)`
|
163
|
-
* NOT yield the record for saving
|
164
|
-
|
165
|
-
In your code update where create a mechanize agent (often `YourScraper.scrape_period`) and the `AUTHORITIES` hash
|
166
|
-
to move Mechanize agent options (like `australian_proxy` and `timeout`) to a hash under a new key: `client_options`.
|
167
|
-
For example:
|
168
|
-
|
169
|
-
```ruby
|
170
|
-
require "scraper_utils"
|
171
|
-
#...
|
172
|
-
module YourScraper
|
173
|
-
# ... some code ...
|
174
|
-
|
175
|
-
# Note the extra parameter: client_options
|
176
|
-
def self.scrape_period(url:, period:, webguest: "P1.WEBGUEST",
|
177
|
-
client_options: {}
|
178
|
-
)
|
179
|
-
agent = ScraperUtils::MechanizeUtils.mechanize_agent(**client_options)
|
180
|
-
|
181
|
-
# ... rest of code ...
|
182
|
-
end
|
183
|
-
|
184
|
-
# ... rest of code ...
|
185
|
-
end
|
186
|
-
```
|
93
|
+
This gem is designed to be compatible with Ruby versions supported by morph.io:
|
187
94
|
|
188
|
-
|
189
|
-
|
190
|
-
The following code will cause debugging info to be output:
|
191
|
-
|
192
|
-
```bash
|
193
|
-
export DEBUG=1
|
194
|
-
```
|
195
|
-
|
196
|
-
Add the following immediately before requesting or examining pages
|
197
|
-
|
198
|
-
```ruby
|
199
|
-
require 'scraper_utils'
|
200
|
-
|
201
|
-
# Debug an HTTP request
|
202
|
-
ScraperUtils::DebugUtils.debug_request(
|
203
|
-
"GET",
|
204
|
-
"https://example.com/planning-apps",
|
205
|
-
parameters: { year: 2023 },
|
206
|
-
headers: { "Accept" => "application/json" }
|
207
|
-
)
|
208
|
-
|
209
|
-
# Debug a web page
|
210
|
-
ScraperUtils::DebugUtils.debug_page(page, "Checking search results page")
|
211
|
-
|
212
|
-
# Debug a specific page selector
|
213
|
-
ScraperUtils::DebugUtils.debug_selector(page, '.results-table', "Looking for development applications")
|
214
|
-
```
|
95
|
+
* Ruby 3.2.2 - requires the `platform` file to contain `heroku_18` in the scraper
|
96
|
+
* Ruby 2.5.8 - `heroku_16` (the default)
|
215
97
|
|
216
|
-
|
217
|
-
---------------------
|
218
|
-
|
219
|
-
The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
|
220
|
-
|
221
|
-
* works on the other authorities whilst in the delay period for an authorities next request
|
222
|
-
* thus optimizing the total scraper run time
|
223
|
-
* allows you to increase the random delay for authorities without undue effect on total run time
|
224
|
-
* For the curious, it uses [ruby fibers](https://ruby-doc.org/core-2.5.8/Fiber.html) rather than threads as that is
|
225
|
-
a simpler system and thus easier to get right, understand and debug!
|
226
|
-
* Cycles around the authorities when compliant_mode, max_load and random_delay are disabled
|
227
|
-
|
228
|
-
To enable change the scrape method to be like [example scrape method using fibers](docs/example_scrape_with_fibers.rb)
|
229
|
-
|
230
|
-
And use `ScraperUtils::FiberScheduler.log` instead of `puts` when logging within the authority processing code.
|
231
|
-
This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
|
232
|
-
thus the output.
|
233
|
-
|
234
|
-
This uses `ScraperUtils::RandomizeUtils` as described below. Remember to add the recommended line to
|
235
|
-
`spec/spec_heper.rb`.
|
236
|
-
|
237
|
-
Intelligent Date Range Selection
|
238
|
-
--------------------------------
|
239
|
-
|
240
|
-
To further reduce server load and speed up scrapers, we provide an intelligent date range selection mechanism
|
241
|
-
that can reduce server requests by 60% without significantly impacting delay in picking up changes.
|
242
|
-
|
243
|
-
The `ScraperUtils::DateRangeUtils#calculate_date_ranges` method provides a smart approach to searching historical
|
244
|
-
records:
|
245
|
-
|
246
|
-
- Always checks the most recent 4 days daily (configurable)
|
247
|
-
- Progressively reduces search frequency for older records
|
248
|
-
- Uses a Fibonacci-like progression to create natural, efficient search intervals
|
249
|
-
- Configurable `max_period` (default is 3 days)
|
250
|
-
- merges adjacent search ranges and handles the changeover in search frequency by extending some searches
|
251
|
-
|
252
|
-
Example usage in your scraper:
|
253
|
-
|
254
|
-
```ruby
|
255
|
-
date_ranges = ScraperUtils::DateRangeUtils.new.calculate_date_ranges
|
256
|
-
date_ranges.each do |from_date, to_date, _debugging_comment|
|
257
|
-
# Adjust your normal search code to use for this date range
|
258
|
-
your_search_records(from_date: from_date, to_date: to_date) do |record|
|
259
|
-
# process as normal
|
260
|
-
end
|
261
|
-
end
|
262
|
-
```
|
263
|
-
|
264
|
-
Typical server load reductions:
|
265
|
-
|
266
|
-
* Max period 2 days : ~42% of the 33 days selected
|
267
|
-
* Max period 3 days : ~37% of the 33 days selected (default)
|
268
|
-
* Max period 5 days : ~35% (or ~31% when days = 45)
|
269
|
-
|
270
|
-
See the class documentation for customizing defaults and passing options.
|
271
|
-
|
272
|
-
### Other possibilities
|
273
|
-
|
274
|
-
If the site uses tags like 'L28', 'L14' and 'L7' for the last 28, 14 and 7 days, an alternative solution
|
275
|
-
is to cycle through ['L28', 'L7', 'L14', 'L7'] which would drop the load by 50% and be less Bot like.
|
276
|
-
|
277
|
-
Cycle Utils
|
278
|
-
-----------
|
279
|
-
Simple utility for cycling through options based on Julian day number:
|
280
|
-
|
281
|
-
```ruby
|
282
|
-
# Toggle between main and alternate behaviour
|
283
|
-
alternate = ScraperUtils::CycleUtils.position(2).even?
|
284
|
-
|
285
|
-
# OR cycle through a list of values day by day:
|
286
|
-
period = ScraperUtils::CycleUtils.pick(['L28', 'L7', 'L14', 'L7'])
|
287
|
-
|
288
|
-
# Use with any cycle size
|
289
|
-
pos = ScraperUtils::CycleUtils.position(7) # 0-6 cycle
|
290
|
-
|
291
|
-
# Test with specific date
|
292
|
-
pos = ScraperUtils::CycleUtils.position(3, date: Date.new(2024, 1, 5))
|
293
|
-
|
294
|
-
# Override for testing
|
295
|
-
# CYCLE_POSITION=2 bundle exec ruby scraper.rb
|
296
|
-
```
|
297
|
-
|
298
|
-
Randomizing Requests
|
299
|
-
--------------------
|
300
|
-
|
301
|
-
Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
|
302
|
-
receive in as is when testing.
|
303
|
-
|
304
|
-
Use this with the list of records scraped from an index to randomise any requests for further information to be less Bot
|
305
|
-
like.
|
306
|
-
|
307
|
-
### Spec setup
|
308
|
-
|
309
|
-
You should enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb` :
|
310
|
-
|
311
|
-
```
|
312
|
-
ScraperUtils::RandomizeUtils.sequential = true
|
313
|
-
```
|
314
|
-
|
315
|
-
Note:
|
316
|
-
|
317
|
-
* You can also force sequential mode by setting the env variable `MORPH_PROCESS_SEQUENTIALLY` to `1` (any non blank)
|
318
|
-
* testing using VCR requires sequential mode
|
319
|
-
|
320
|
-
Development
|
321
|
-
-----------
|
98
|
+
## Development
|
322
99
|
|
323
100
|
After checking out the repo, run `bin/setup` to install dependencies.
|
324
101
|
Then, run `rake test` to run the tests.
|
325
102
|
|
326
|
-
You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
327
|
-
|
328
103
|
To install this gem onto your local machine, run `bundle exec rake install`.
|
329
104
|
|
330
|
-
|
331
|
-
then run `bundle exec rake release`,
|
332
|
-
which will create a git tag for the version, push git commits and tags, and push the `.gem` file
|
333
|
-
to [rubygems.org](https://rubygems.org).
|
334
|
-
|
335
|
-
NOTE: You need to use ruby 3.2.2 instead of 2.5.8 to release to OTP protected accounts.
|
336
|
-
|
337
|
-
Contributing
|
338
|
-
------------
|
105
|
+
## Contributing
|
339
106
|
|
340
|
-
Bug reports and pull requests with working tests are welcome
|
107
|
+
Bug reports and pull requests with working tests are welcome
|
108
|
+
on [GitHub](https://github.com/ianheggie-oaf/scraper_utils).
|
341
109
|
|
342
|
-
|
343
|
-
|
344
|
-
License
|
345
|
-
-------
|
110
|
+
## License
|
346
111
|
|
347
112
|
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
348
|
-
|
data/docs/debugging.md
ADDED
@@ -0,0 +1,50 @@
|
|
1
|
+
# Debugging Techniques
|
2
|
+
|
3
|
+
ScraperUtils provides several debugging utilities to help you troubleshoot your scrapers.
|
4
|
+
|
5
|
+
## Enabling Debug Mode
|
6
|
+
|
7
|
+
Set the `DEBUG` environment variable to enable debugging:
|
8
|
+
|
9
|
+
```bash
|
10
|
+
export DEBUG=1 # Basic debugging
|
11
|
+
export DEBUG=2 # Verbose debugging
|
12
|
+
export DEBUG=3 # Trace debugging with detailed content
|
13
|
+
```
|
14
|
+
|
15
|
+
## Debug Utilities
|
16
|
+
|
17
|
+
The `ScraperUtils::DebugUtils` module provides several methods for debugging:
|
18
|
+
|
19
|
+
```ruby
|
20
|
+
# Debug an HTTP request
|
21
|
+
ScraperUtils::DebugUtils.debug_request(
|
22
|
+
"GET",
|
23
|
+
"https://example.com/planning-apps",
|
24
|
+
parameters: { year: 2023 },
|
25
|
+
headers: { "Accept" => "application/json" }
|
26
|
+
)
|
27
|
+
|
28
|
+
# Debug a web page
|
29
|
+
ScraperUtils::DebugUtils.debug_page(page, "Checking search results page")
|
30
|
+
|
31
|
+
# Debug a specific page selector
|
32
|
+
ScraperUtils::DebugUtils.debug_selector(page, '.results-table', "Looking for development applications")
|
33
|
+
```
|
34
|
+
|
35
|
+
## Debug Level Constants
|
36
|
+
|
37
|
+
- `DISABLED_LEVEL = 0`: Debugging disabled
|
38
|
+
- `BASIC_LEVEL = 1`: Basic debugging information
|
39
|
+
- `VERBOSE_LEVEL = 2`: Verbose debugging information
|
40
|
+
- `TRACE_LEVEL = 3`: Detailed tracing information
|
41
|
+
|
42
|
+
## Helper Methods
|
43
|
+
|
44
|
+
- `debug_level`: Get the current debug level
|
45
|
+
- `debug?(level)`: Check if debugging is enabled at the specified level
|
46
|
+
- `basic?`: Check if basic debugging is enabled
|
47
|
+
- `verbose?`: Check if verbose debugging is enabled
|
48
|
+
- `trace?`: Check if trace debugging is enabled
|
49
|
+
|
50
|
+
For full details, see the [DebugUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/DebugUtils).
|
@@ -0,0 +1,145 @@
|
|
1
|
+
# Getting Started with ScraperUtils
|
2
|
+
|
3
|
+
This guide will help you get started with ScraperUtils for your PlanningAlerts scraper.
|
4
|
+
|
5
|
+
## Installation
|
6
|
+
|
7
|
+
Add these lines to your [scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
|
8
|
+
|
9
|
+
```ruby
|
10
|
+
# Below:
|
11
|
+
gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
|
12
|
+
|
13
|
+
# Add:
|
14
|
+
gem 'scraper_utils'
|
15
|
+
```
|
16
|
+
|
17
|
+
And then execute:
|
18
|
+
|
19
|
+
```bash
|
20
|
+
bundle install
|
21
|
+
```
|
22
|
+
|
23
|
+
## Environment Variables
|
24
|
+
|
25
|
+
### `MORPH_AUSTRALIAN_PROXY`
|
26
|
+
|
27
|
+
On morph.io set the environment variable `MORPH_AUSTRALIAN_PROXY` to
|
28
|
+
`http://morph:password@au.proxy.oaf.org.au:8888`
|
29
|
+
replacing password with the real password.
|
30
|
+
Alternatively enter your own AUSTRALIAN proxy details when testing.
|
31
|
+
|
32
|
+
### `MORPH_EXPECT_BAD`
|
33
|
+
|
34
|
+
To avoid morph complaining about sites that are known to be bad,
|
35
|
+
but you want them to keep being tested, list them on `MORPH_EXPECT_BAD`, for example:
|
36
|
+
|
37
|
+
### `MORPH_AUTHORITIES`
|
38
|
+
|
39
|
+
Optionally filter authorities for multi authority scrapers
|
40
|
+
via environment variable in morph > scraper > settings or
|
41
|
+
in your dev environment:
|
42
|
+
|
43
|
+
```bash
|
44
|
+
export MORPH_AUTHORITIES=noosa,wagga
|
45
|
+
```
|
46
|
+
|
47
|
+
### `DEBUG`
|
48
|
+
|
49
|
+
Optionally enable verbose debugging messages when developing:
|
50
|
+
|
51
|
+
```bash
|
52
|
+
export DEBUG=1 # for basic, or 2 for verbose or 3 for tracing nearly everything
|
53
|
+
```
|
54
|
+
|
55
|
+
## Example Scraper Implementation
|
56
|
+
|
57
|
+
Update your `scraper.rb` as follows:
|
58
|
+
|
59
|
+
```ruby
|
60
|
+
#!/usr/bin/env ruby
|
61
|
+
# frozen_string_literal: true
|
62
|
+
|
63
|
+
$LOAD_PATH << "./lib"
|
64
|
+
|
65
|
+
require "scraper_utils"
|
66
|
+
require "your_scraper"
|
67
|
+
|
68
|
+
# Main Scraper class
|
69
|
+
class Scraper
|
70
|
+
AUTHORITIES = YourScraper::AUTHORITIES
|
71
|
+
|
72
|
+
def scrape(authorities, attempt)
|
73
|
+
exceptions = {}
|
74
|
+
authorities.each do |authority_label|
|
75
|
+
puts "\nCollecting feed data for #{authority_label}, attempt: #{attempt}..."
|
76
|
+
|
77
|
+
begin
|
78
|
+
ScraperUtils::DataQualityMonitor.start_authority(authority_label)
|
79
|
+
YourScraper.scrape(authority_label) do |record|
|
80
|
+
begin
|
81
|
+
record["authority_label"] = authority_label.to_s
|
82
|
+
ScraperUtils::DbUtils.save_record(record)
|
83
|
+
rescue ScraperUtils::UnprocessableRecord => e
|
84
|
+
ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
|
85
|
+
exceptions[authority_label] = e
|
86
|
+
end
|
87
|
+
end
|
88
|
+
rescue StandardError => e
|
89
|
+
warn "#{authority_label}: ERROR: #{e}"
|
90
|
+
warn e.backtrace
|
91
|
+
exceptions[authority_label] = e
|
92
|
+
end
|
93
|
+
end
|
94
|
+
|
95
|
+
exceptions
|
96
|
+
end
|
97
|
+
|
98
|
+
def self.selected_authorities
|
99
|
+
ScraperUtils::AuthorityUtils.selected_authorities(AUTHORITIES.keys)
|
100
|
+
end
|
101
|
+
|
102
|
+
def self.run(authorities)
|
103
|
+
puts "Scraping authorities: #{authorities.join(', ')}"
|
104
|
+
start_time = Time.now
|
105
|
+
exceptions = new.scrape(authorities, 1)
|
106
|
+
ScraperUtils::LogUtils.log_scraping_run(
|
107
|
+
start_time,
|
108
|
+
1,
|
109
|
+
authorities,
|
110
|
+
exceptions
|
111
|
+
)
|
112
|
+
|
113
|
+
unless exceptions.empty?
|
114
|
+
puts "\n***************************************************"
|
115
|
+
puts "Now retrying authorities which earlier had failures"
|
116
|
+
puts exceptions.keys.join(", ").to_s
|
117
|
+
puts "***************************************************"
|
118
|
+
|
119
|
+
start_time = Time.now
|
120
|
+
exceptions = new.scrape(exceptions.keys, 2)
|
121
|
+
ScraperUtils::LogUtils.log_scraping_run(
|
122
|
+
start_time,
|
123
|
+
2,
|
124
|
+
authorities,
|
125
|
+
exceptions
|
126
|
+
)
|
127
|
+
end
|
128
|
+
|
129
|
+
ScraperUtils::LogUtils.report_on_results(authorities, exceptions)
|
130
|
+
end
|
131
|
+
end
|
132
|
+
|
133
|
+
if __FILE__ == $PROGRAM_NAME
|
134
|
+
ENV["MORPH_EXPECT_BAD"] ||= "wagga"
|
135
|
+
Scraper.run(Scraper.selected_authorities)
|
136
|
+
end
|
137
|
+
```
|
138
|
+
|
139
|
+
For more advanced implementations, see the [Interleaving Requests documentation](interleaving_requests.md).
|
140
|
+
|
141
|
+
## Next Steps
|
142
|
+
|
143
|
+
- [Reducing Server Load](reducing_server_load.md)
|
144
|
+
- [Mechanize Utilities](mechanize_utilities.md)
|
145
|
+
- [Debugging](debugging.md)
|
@@ -0,0 +1,61 @@
|
|
1
|
+
# Interleaving Requests with FiberScheduler
|
2
|
+
|
3
|
+
The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
|
4
|
+
|
5
|
+
* Works on other authorities while in the delay period for an authority's next request
|
6
|
+
* Optimizes the total scraper run time
|
7
|
+
* Allows you to increase the random delay for authorities without undue effect on total run time
|
8
|
+
* For the curious, it uses [ruby fibers](https://ruby-doc.org/core-2.5.8/Fiber.html) rather than threads as that is
|
9
|
+
a simpler system and thus easier to get right, understand and debug!
|
10
|
+
* Cycles around the authorities when compliant_mode, max_load and random_delay are disabled
|
11
|
+
|
12
|
+
## Implementation
|
13
|
+
|
14
|
+
To enable fiber scheduling, change your scrape method to follow this pattern:
|
15
|
+
|
16
|
+
```ruby
|
17
|
+
def scrape(authorities, attempt)
|
18
|
+
ScraperUtils::FiberScheduler.reset!
|
19
|
+
exceptions = {}
|
20
|
+
authorities.each do |authority_label|
|
21
|
+
ScraperUtils::FiberScheduler.register_operation(authority_label) do
|
22
|
+
ScraperUtils::FiberScheduler.log(
|
23
|
+
"Collecting feed data for #{authority_label}, attempt: #{attempt}..."
|
24
|
+
)
|
25
|
+
ScraperUtils::DataQualityMonitor.start_authority(authority_label)
|
26
|
+
YourScraper.scrape(authority_label) do |record|
|
27
|
+
record["authority_label"] = authority_label.to_s
|
28
|
+
ScraperUtils::DbUtils.save_record(record)
|
29
|
+
rescue ScraperUtils::UnprocessableRecord => e
|
30
|
+
ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
|
31
|
+
exceptions[authority_label] = e
|
32
|
+
# Continues processing other records
|
33
|
+
end
|
34
|
+
rescue StandardError => e
|
35
|
+
warn "#{authority_label}: ERROR: #{e}"
|
36
|
+
warn e.backtrace || "No backtrace available"
|
37
|
+
exceptions[authority_label] = e
|
38
|
+
end
|
39
|
+
# end of register_operation block
|
40
|
+
end
|
41
|
+
ScraperUtils::FiberScheduler.run_all
|
42
|
+
exceptions
|
43
|
+
end
|
44
|
+
```
|
45
|
+
|
46
|
+
## Logging with FiberScheduler
|
47
|
+
|
48
|
+
Use `ScraperUtils::FiberScheduler.log` instead of `puts` when logging within the authority processing code.
|
49
|
+
This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
|
50
|
+
thus the output.
|
51
|
+
|
52
|
+
## Testing Considerations
|
53
|
+
|
54
|
+
This uses `ScraperUtils::RandomizeUtils` for determining the order of operations. Remember to add the following line to
|
55
|
+
`spec/spec_helper.rb`:
|
56
|
+
|
57
|
+
```ruby
|
58
|
+
ScraperUtils::RandomizeUtils.sequential = true
|
59
|
+
```
|
60
|
+
|
61
|
+
For full details, see the [FiberScheduler class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/FiberScheduler).
|
@@ -0,0 +1,92 @@
|
|
1
|
+
# Mechanize Utilities
|
2
|
+
|
3
|
+
This document provides detailed information about the Mechanize utilities provided by ScraperUtils.
|
4
|
+
|
5
|
+
## MechanizeUtils
|
6
|
+
|
7
|
+
The `ScraperUtils::MechanizeUtils` module provides utilities for configuring and using Mechanize for web scraping.
|
8
|
+
|
9
|
+
### Creating a Mechanize Agent
|
10
|
+
|
11
|
+
```ruby
|
12
|
+
agent = ScraperUtils::MechanizeUtils.mechanize_agent(**options)
|
13
|
+
```
|
14
|
+
|
15
|
+
### Configuration Options
|
16
|
+
|
17
|
+
Add `client_options` to your AUTHORITIES configuration and move any of the following settings into it:
|
18
|
+
|
19
|
+
* `timeout: Integer` - Timeout for agent connections in case the server is slower than normal
|
20
|
+
* `australian_proxy: true` - Use the proxy url in the `MORPH_AUSTRALIAN_PROXY` env variable if the site is geo-locked
|
21
|
+
* `disable_ssl_certificate_check: true` - Disabled SSL verification for old / incorrect certificates
|
22
|
+
|
23
|
+
Then adjust your code to accept `client_options` and pass then through to:
|
24
|
+
`ScraperUtils::MechanizeUtils.mechanize_agent(client_options || {})`
|
25
|
+
to receive a `Mechanize::Agent` configured accordingly.
|
26
|
+
|
27
|
+
The agent returned is configured using Mechanize hooks to implement the desired delays automatically.
|
28
|
+
|
29
|
+
### Default Configuration
|
30
|
+
|
31
|
+
By default, the Mechanize agent is configured with the following settings.
|
32
|
+
As you can see, the defaults can be changed using env variables.
|
33
|
+
|
34
|
+
Note - compliant mode forces max_load to be set to a value no greater than 50.
|
35
|
+
|
36
|
+
```ruby
|
37
|
+
ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
|
38
|
+
config.default_timeout = ENV.fetch('MORPH_TIMEOUT', 60).to_i # 60
|
39
|
+
config.default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
|
40
|
+
config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', 5).to_i # 5
|
41
|
+
config.default_max_load = ENV.fetch('MORPH_MAX_LOAD', 33.3).to_f # 33.3
|
42
|
+
config.default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
|
43
|
+
config.default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
|
44
|
+
config.default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
|
45
|
+
end
|
46
|
+
```
|
47
|
+
|
48
|
+
For full details, see the [MechanizeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/MechanizeUtils).
|
49
|
+
|
50
|
+
## MechanizeActions
|
51
|
+
|
52
|
+
The `ScraperUtils::MechanizeActions` class provides a convenient way to execute a series of actions (like clicking links, filling forms) on a Mechanize page.
|
53
|
+
|
54
|
+
### Action Format
|
55
|
+
|
56
|
+
```ruby
|
57
|
+
actions = [
|
58
|
+
[:click, "Find an application"],
|
59
|
+
[:click, ["Submitted Last 28 Days", "Submitted Last 7 Days"]],
|
60
|
+
[:block, ->(page, args, agent, results) { [new_page, result_data] }]
|
61
|
+
]
|
62
|
+
|
63
|
+
processor = ScraperUtils::MechanizeActions.new(agent)
|
64
|
+
result_page = processor.process(page, actions)
|
65
|
+
```
|
66
|
+
|
67
|
+
### Supported Actions
|
68
|
+
|
69
|
+
- `:click` - Clicks on a link or element matching the provided selector
|
70
|
+
- `:block` - Executes a custom block of code for complex scenarios
|
71
|
+
|
72
|
+
### Selector Types
|
73
|
+
|
74
|
+
- Text selector (default): `"Find an application"`
|
75
|
+
- CSS selector: `"css:.button"`
|
76
|
+
- XPath selector: `"xpath://a[@class='button']"`
|
77
|
+
|
78
|
+
### Replacements
|
79
|
+
|
80
|
+
You can use replacements in your action parameters:
|
81
|
+
|
82
|
+
```ruby
|
83
|
+
replacements = { FROM_DATE: "2022-01-01", TO_DATE: "2022-03-01" }
|
84
|
+
processor = ScraperUtils::MechanizeActions.new(agent, replacements)
|
85
|
+
|
86
|
+
# Use replacements in actions
|
87
|
+
actions = [
|
88
|
+
[:click, "Search between {FROM_DATE} and {TO_DATE}"]
|
89
|
+
]
|
90
|
+
```
|
91
|
+
|
92
|
+
For full details, see the [MechanizeActions class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/MechanizeActions).
|
@@ -0,0 +1,34 @@
|
|
1
|
+
# Randomizing Requests
|
2
|
+
|
3
|
+
`ScraperUtils::RandomizeUtils` provides utilities for randomizing processing order in scrapers,
|
4
|
+
which is helpful for distributing load and avoiding predictable patterns.
|
5
|
+
|
6
|
+
## Basic Usage
|
7
|
+
|
8
|
+
Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
|
9
|
+
receive it as is when testing.
|
10
|
+
|
11
|
+
```ruby
|
12
|
+
# Randomize a collection
|
13
|
+
randomized_authorities = ScraperUtils::RandomizeUtils.randomize_order(authorities)
|
14
|
+
|
15
|
+
# Use with a list of records from an index to randomize requests for details
|
16
|
+
records.each do |record|
|
17
|
+
# Process record
|
18
|
+
end
|
19
|
+
```
|
20
|
+
|
21
|
+
## Testing Configuration
|
22
|
+
|
23
|
+
Enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb`:
|
24
|
+
|
25
|
+
```ruby
|
26
|
+
ScraperUtils::RandomizeUtils.sequential = true
|
27
|
+
```
|
28
|
+
|
29
|
+
## Notes
|
30
|
+
|
31
|
+
* You can also force sequential mode by setting the env variable `MORPH_PROCESS_SEQUENTIALLY` to `1` (any non-blank value)
|
32
|
+
* Testing using VCR requires sequential mode
|
33
|
+
|
34
|
+
For full details, see the [RandomizeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/RandomizeUtils).
|
@@ -0,0 +1,63 @@
|
|
1
|
+
# Reducing Server Load
|
2
|
+
|
3
|
+
This document explains various techniques for reducing load on the servers you're scraping.
|
4
|
+
|
5
|
+
## Intelligent Date Range Selection
|
6
|
+
|
7
|
+
To further reduce server load and speed up scrapers, we provide an intelligent date range selection mechanism
|
8
|
+
that can reduce server requests by 60% without significantly impacting delay in picking up changes.
|
9
|
+
|
10
|
+
The `ScraperUtils::DateRangeUtils#calculate_date_ranges` method provides a smart approach to searching historical
|
11
|
+
records:
|
12
|
+
|
13
|
+
- Always checks the most recent 4 days daily (configurable)
|
14
|
+
- Progressively reduces search frequency for older records
|
15
|
+
- Uses a Fibonacci-like progression to create natural, efficient search intervals
|
16
|
+
- Configurable `max_period` (default is 3 days)
|
17
|
+
- Merges adjacent search ranges and handles the changeover in search frequency by extending some searches
|
18
|
+
|
19
|
+
Example usage in your scraper:
|
20
|
+
|
21
|
+
```ruby
|
22
|
+
date_ranges = ScraperUtils::DateRangeUtils.new.calculate_date_ranges
|
23
|
+
date_ranges.each do |from_date, to_date, _debugging_comment|
|
24
|
+
# Adjust your normal search code to use for this date range
|
25
|
+
your_search_records(from_date: from_date, to_date: to_date) do |record|
|
26
|
+
# process as normal
|
27
|
+
end
|
28
|
+
end
|
29
|
+
```
|
30
|
+
|
31
|
+
Typical server load reductions:
|
32
|
+
|
33
|
+
* Max period 2 days : ~42% of the 33 days selected
|
34
|
+
* Max period 3 days : ~37% of the 33 days selected (default)
|
35
|
+
* Max period 5 days : ~35% (or ~31% when days = 45)
|
36
|
+
|
37
|
+
See the [DateRangeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/DateRangeUtils) for customizing defaults and passing options.
|
38
|
+
|
39
|
+
## Cycle Utilities
|
40
|
+
|
41
|
+
Simple utility for cycling through options based on Julian day number to reduce server load and make your scraper seem less bot-like.
|
42
|
+
|
43
|
+
If the site uses tags like 'L28', 'L14' and 'L7' for the last 28, 14 and 7 days, an alternative solution
|
44
|
+
is to cycle through ['L28', 'L7', 'L14', 'L7'] which would drop the load by 50% and be less bot-like.
|
45
|
+
|
46
|
+
```ruby
|
47
|
+
# Toggle between main and alternate behaviour
|
48
|
+
alternate = ScraperUtils::CycleUtils.position(2).even?
|
49
|
+
|
50
|
+
# OR cycle through a list of values day by day:
|
51
|
+
period = ScraperUtils::CycleUtils.pick(['L28', 'L7', 'L14', 'L7'])
|
52
|
+
|
53
|
+
# Use with any cycle size
|
54
|
+
pos = ScraperUtils::CycleUtils.position(7) # 0-6 cycle
|
55
|
+
|
56
|
+
# Test with specific date
|
57
|
+
pos = ScraperUtils::CycleUtils.position(3, date: Date.new(2024, 1, 5))
|
58
|
+
|
59
|
+
# Override for testing
|
60
|
+
# CYCLE_POSITION=2 bundle exec ruby scraper.rb
|
61
|
+
```
|
62
|
+
|
63
|
+
For full details, see the [CycleUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/CycleUtils).
|
@@ -0,0 +1,154 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module ScraperUtils
|
4
|
+
# Class for executing a series of mechanize actions with flexible replacements
|
5
|
+
#
|
6
|
+
# @example Basic usage
|
7
|
+
# agent = ScraperUtils::MechanizeUtils.mechanize_agent
|
8
|
+
# page = agent.get("https://example.com")
|
9
|
+
#
|
10
|
+
# actions = [
|
11
|
+
# [:click, "Next Page"],
|
12
|
+
# [:click, ["Option A", "Option B"]] # Will select one randomly
|
13
|
+
# ]
|
14
|
+
#
|
15
|
+
# processor = ScraperUtils::MechanizeActions.new(agent)
|
16
|
+
# result_page = processor.process(page, actions)
|
17
|
+
#
|
18
|
+
# @example With replacements
|
19
|
+
# replacements = { FROM_DATE: "2022-01-01", TO_DATE: "2022-03-01" }
|
20
|
+
# processor = ScraperUtils::MechanizeActions.new(agent, replacements)
|
21
|
+
#
|
22
|
+
# # Use replacements in actions
|
23
|
+
# actions = [
|
24
|
+
# [:click, "Search between {FROM_DATE} and {TO_DATE}"]
|
25
|
+
# ]
|
26
|
+
class MechanizeActions
|
27
|
+
# @return [Mechanize] The mechanize agent used for actions
|
28
|
+
attr_reader :agent
|
29
|
+
|
30
|
+
# @return [Array] The results of each action performed
|
31
|
+
attr_reader :results
|
32
|
+
|
33
|
+
# Initialize a new MechanizeActions processor
|
34
|
+
#
|
35
|
+
# @param agent [Mechanize] The mechanize agent to use for actions
|
36
|
+
# @param replacements [Hash] Optional text replacements to apply to action parameters
|
37
|
+
def initialize(agent, replacements = {})
|
38
|
+
@agent = agent
|
39
|
+
@replacements = replacements || {}
|
40
|
+
@results = []
|
41
|
+
end
|
42
|
+
|
43
|
+
# Process a sequence of actions on a page
|
44
|
+
#
|
45
|
+
# @param page [Mechanize::Page] The starting page
|
46
|
+
# @param actions [Array<Array>] The sequence of actions to perform
|
47
|
+
# @return [Mechanize::Page] The resulting page after all actions
|
48
|
+
# @raise [ArgumentError] If an unknown action type is provided
|
49
|
+
#
|
50
|
+
# @example Action format
|
51
|
+
# actions = [
|
52
|
+
# [:click, "Link Text"], # Click on link with this text
|
53
|
+
# [:click, ["Option A", "Option B"]], # Click on one of these options (randomly selected)
|
54
|
+
# [:click, "css:.some-button"], # Use CSS selector
|
55
|
+
# [:click, "xpath://div[@id='results']/a"], # Use XPath selector
|
56
|
+
# [:block, ->(page, args, agent, results) { [page, { custom_results: 'data' }] }] # Custom block
|
57
|
+
# ]
|
58
|
+
def process(page, actions)
|
59
|
+
@results = []
|
60
|
+
current_page = page
|
61
|
+
|
62
|
+
actions.each do |action|
|
63
|
+
args = action.dup
|
64
|
+
action_type = args.shift
|
65
|
+
current_page, result =
|
66
|
+
case action_type
|
67
|
+
when :click
|
68
|
+
handle_click(current_page, args)
|
69
|
+
when :block
|
70
|
+
block = args.shift
|
71
|
+
block.call(current_page, args, agent, @results.dup)
|
72
|
+
else
|
73
|
+
raise ArgumentError, "Unknown action type: #{action_type}"
|
74
|
+
end
|
75
|
+
|
76
|
+
@results << result
|
77
|
+
end
|
78
|
+
|
79
|
+
current_page
|
80
|
+
end
|
81
|
+
|
82
|
+
private
|
83
|
+
|
84
|
+
# Handle a click action
|
85
|
+
#
|
86
|
+
# @param page [Mechanize::Page] The current page
|
87
|
+
# @param args [Array] The first element is the selection target
|
88
|
+
# @return [Array<Mechanize::Page, Hash>] The resulting page and status
|
89
|
+
def handle_click(page, args)
|
90
|
+
target = args.shift
|
91
|
+
if target.is_a?(Array)
|
92
|
+
target = ScraperUtils::CycleUtils.pick(target, date: @replacements[:TODAY])
|
93
|
+
end
|
94
|
+
target = apply_replacements(target)
|
95
|
+
element = select_element(page, target)
|
96
|
+
if element.nil?
|
97
|
+
raise "Unable to find click target: #{target}"
|
98
|
+
end
|
99
|
+
|
100
|
+
result = { action: :click, target: target }
|
101
|
+
next_page = element.click
|
102
|
+
[next_page, result]
|
103
|
+
end
|
104
|
+
|
105
|
+
# Select an element on the page based on selector string
|
106
|
+
#
|
107
|
+
# @param page [Mechanize::Page] The page to search in
|
108
|
+
# @param selector_string [String] The selector string
|
109
|
+
# @return [Mechanize::Element, nil] The selected element or nil if not found
|
110
|
+
def select_element(page, selector_string)
|
111
|
+
# Handle different selector types based on prefixes
|
112
|
+
if selector_string.start_with?("css:")
|
113
|
+
selector = selector_string.sub(/^css:/, '')
|
114
|
+
page.at_css(selector)
|
115
|
+
elsif selector_string.start_with?("xpath:")
|
116
|
+
selector = selector_string.sub(/^xpath:/, '')
|
117
|
+
page.at_xpath(selector)
|
118
|
+
else
|
119
|
+
# Default to text: for links
|
120
|
+
selector = selector_string.sub(/^text:/, '')
|
121
|
+
# Find links that include the text and don't have fragment-only hrefs
|
122
|
+
matching_links = page.links.select do |l|
|
123
|
+
l.text.include?(selector) &&
|
124
|
+
!(l.href.nil? || l.href.start_with?('#'))
|
125
|
+
end
|
126
|
+
|
127
|
+
if matching_links.empty?
|
128
|
+
# try case-insensitive
|
129
|
+
selector = selector.downcase
|
130
|
+
matching_links = page.links.select do |l|
|
131
|
+
l.text.downcase.include?(selector) &&
|
132
|
+
!(l.href.nil? || l.href.start_with?('#'))
|
133
|
+
end
|
134
|
+
end
|
135
|
+
|
136
|
+
# Get the link with the shortest (closest matching) text then the longest href
|
137
|
+
matching_links.min_by { |l| [l.text.strip.length, -l.href.length] }
|
138
|
+
end
|
139
|
+
end
|
140
|
+
|
141
|
+
# Apply text replacements to a string
|
142
|
+
#
|
143
|
+
# @param text [String, Object] The text to process or object to return unchanged
|
144
|
+
# @return [String, Object] The processed text with replacements or original object
|
145
|
+
def apply_replacements(text)
|
146
|
+
result = text.to_s
|
147
|
+
|
148
|
+
@replacements.each do |key, value|
|
149
|
+
result = result.gsub(/\{#{key}\}/, value.to_s)
|
150
|
+
end
|
151
|
+
result
|
152
|
+
end
|
153
|
+
end
|
154
|
+
end
|
data/lib/scraper_utils.rb
CHANGED
@@ -9,6 +9,7 @@ require "scraper_utils/db_utils"
|
|
9
9
|
require "scraper_utils/debug_utils"
|
10
10
|
require "scraper_utils/fiber_scheduler"
|
11
11
|
require "scraper_utils/log_utils"
|
12
|
+
require "scraper_utils/mechanize_actions"
|
12
13
|
require "scraper_utils/mechanize_utils/agent_config"
|
13
14
|
require "scraper_utils/mechanize_utils"
|
14
15
|
require "scraper_utils/randomize_utils"
|
data/scraper_utils.gemspec
CHANGED
@@ -13,8 +13,8 @@ Gem::Specification.new do |spec|
|
|
13
13
|
|
14
14
|
spec.summary = "planningalerts scraper utilities"
|
15
15
|
spec.description = "Utilities to help make planningalerts scrapers, " \
|
16
|
-
|
17
|
-
spec.homepage = "https://github.com/ianheggie-oaf
|
16
|
+
"especially multi authority scrapers, easier to develop, run and debug."
|
17
|
+
spec.homepage = "https://github.com/ianheggie-oaf/#{spec.name}"
|
18
18
|
spec.license = "MIT"
|
19
19
|
|
20
20
|
if spec.respond_to?(:metadata)
|
@@ -22,10 +22,11 @@ Gem::Specification.new do |spec|
|
|
22
22
|
|
23
23
|
spec.metadata["homepage_uri"] = spec.homepage
|
24
24
|
spec.metadata["source_code_uri"] = spec.homepage
|
25
|
-
|
25
|
+
spec.metadata["documentation_uri"] = "https://rubydoc.info/gems/#{spec.name}/#{ScraperUtils::VERSION}"
|
26
|
+
spec.metadata["changelog_uri"] = "#{spec.metadata["source_code_uri"]}/blob/main/CHANGELOG.md"
|
26
27
|
else
|
27
28
|
raise "RubyGems 2.0 or newer is required to protect against " \
|
28
|
-
|
29
|
+
"public gem pushes."
|
29
30
|
end
|
30
31
|
|
31
32
|
# Specify which files should be added to the gem when it is released.
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: scraper_utils
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.5.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Ian Heggie
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2025-03-
|
11
|
+
date: 2025-03-05 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: mechanize
|
@@ -52,8 +52,8 @@ dependencies:
|
|
52
52
|
- - ">="
|
53
53
|
- !ruby/object:Gem::Version
|
54
54
|
version: '0'
|
55
|
-
description: Utilities to help make planningalerts scrapers,
|
56
|
-
to develop, run and debug.
|
55
|
+
description: Utilities to help make planningalerts scrapers, especially multi authority
|
56
|
+
scrapers, easier to develop, run and debug.
|
57
57
|
email:
|
58
58
|
- ian@heggie.biz
|
59
59
|
executables: []
|
@@ -74,8 +74,14 @@ files:
|
|
74
74
|
- SPECS.md
|
75
75
|
- bin/console
|
76
76
|
- bin/setup
|
77
|
+
- docs/debugging.md
|
77
78
|
- docs/example_scrape_with_fibers.rb
|
78
79
|
- docs/example_scraper.rb
|
80
|
+
- docs/getting_started.md
|
81
|
+
- docs/interleaving_requests.md
|
82
|
+
- docs/mechanize_utilities.md
|
83
|
+
- docs/randomizing_requests.md
|
84
|
+
- docs/reducing_server_load.md
|
79
85
|
- lib/scraper_utils.rb
|
80
86
|
- lib/scraper_utils/adaptive_delay.rb
|
81
87
|
- lib/scraper_utils/authority_utils.rb
|
@@ -86,6 +92,7 @@ files:
|
|
86
92
|
- lib/scraper_utils/debug_utils.rb
|
87
93
|
- lib/scraper_utils/fiber_scheduler.rb
|
88
94
|
- lib/scraper_utils/log_utils.rb
|
95
|
+
- lib/scraper_utils/mechanize_actions.rb
|
89
96
|
- lib/scraper_utils/mechanize_utils.rb
|
90
97
|
- lib/scraper_utils/mechanize_utils/agent_config.rb
|
91
98
|
- lib/scraper_utils/randomize_utils.rb
|
@@ -99,6 +106,8 @@ metadata:
|
|
99
106
|
allowed_push_host: https://rubygems.org
|
100
107
|
homepage_uri: https://github.com/ianheggie-oaf/scraper_utils
|
101
108
|
source_code_uri: https://github.com/ianheggie-oaf/scraper_utils
|
109
|
+
documentation_uri: https://rubydoc.info/gems/scraper_utils/0.5.0
|
110
|
+
changelog_uri: https://github.com/ianheggie-oaf/scraper_utils/blob/main/CHANGELOG.md
|
102
111
|
rubygems_mfa_required: 'true'
|
103
112
|
post_install_message:
|
104
113
|
rdoc_options: []
|