scraper_utils 0.1.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +3 -0
- data/.rubocop.yml +5 -8
- data/CHANGELOG.md +14 -0
- data/GUIDELINES.md +75 -0
- data/Gemfile +6 -3
- data/IMPLEMENTATION.md +33 -0
- data/README.md +226 -177
- data/SPECS.md +25 -0
- data/bin/console +1 -0
- data/bin/setup +2 -1
- data/docs/example_scrape_with_fibers.rb +31 -0
- data/docs/example_scraper.rb +93 -0
- data/lib/scraper_utils/adaptive_delay.rb +70 -0
- data/lib/scraper_utils/authority_utils.rb +2 -2
- data/lib/scraper_utils/data_quality_monitor.rb +64 -0
- data/lib/scraper_utils/date_range_utils.rb +159 -0
- data/lib/scraper_utils/db_utils.rb +1 -2
- data/lib/scraper_utils/debug_utils.rb +63 -23
- data/lib/scraper_utils/fiber_scheduler.rb +229 -0
- data/lib/scraper_utils/log_utils.rb +58 -25
- data/lib/scraper_utils/mechanize_utils/agent_config.rb +276 -0
- data/lib/scraper_utils/mechanize_utils.rb +32 -30
- data/lib/scraper_utils/randomize_utils.rb +34 -0
- data/lib/scraper_utils/robots_checker.rb +149 -0
- data/lib/scraper_utils/version.rb +1 -1
- data/lib/scraper_utils.rb +6 -10
- data/scraper_utils.gemspec +3 -8
- metadata +17 -74
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: a2082c406ad96266f644fc1dfc588046aa43494e9e79b8fec38fe59252c09f06
|
4
|
+
data.tar.gz: 5100907cdcc8c55ddd59b25cf393d212f681b573ef874c5a6c65f748a8c852ec
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: feabea23ac5b14f6b642769db303bcab954b22cfd2d95d7694af51afd661c73f1ff70a0c17b54557a730fe811275c126c0fdd4473e2ae6d2b83cfc495aa52bc6
|
7
|
+
data.tar.gz: d91154c0dbccfb4271fd4830a82316676c1ad5665773c30d212bcc1ab6178a09d4fee97b2b2a32480c32b8ee69da08fc9a78575460d49f8233fb91b74bf7df66
|
data/.gitignore
CHANGED
data/.rubocop.yml
CHANGED
@@ -1,12 +1,9 @@
|
|
1
|
+
plugins:
|
2
|
+
- rubocop-rake
|
3
|
+
- rubocop-rspec
|
4
|
+
|
1
5
|
AllCops:
|
2
|
-
|
3
|
-
- bin/*
|
4
|
-
# This is a temporary dumping ground for authority specific
|
5
|
-
# code that we're probably just initially copying across from
|
6
|
-
# other scrapers. So, we don't care about the formatting and style
|
7
|
-
# initially.
|
8
|
-
# TODO: Remove this once we've removed all the code from here
|
9
|
-
- lib/technology_one_scraper/authority/*
|
6
|
+
NewCops: enable
|
10
7
|
|
11
8
|
# Bumping max line length to something a little more reasonable
|
12
9
|
Layout/LineLength:
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
# Changelog
|
2
|
+
|
3
|
+
|
4
|
+
## 0.2.1 - 2025-02-28
|
5
|
+
|
6
|
+
Fixed broken v0.2.0
|
7
|
+
|
8
|
+
## 0.2.0 - 2025-02-28
|
9
|
+
|
10
|
+
Added FiberScheduler, enabled complient mode with delays by default and simplified usage removing third retry without proxy
|
11
|
+
|
12
|
+
## 0.1.0 - 2025-02-23
|
13
|
+
|
14
|
+
First release for development
|
data/GUIDELINES.md
ADDED
@@ -0,0 +1,75 @@
|
|
1
|
+
# Project-Specific Guidelines
|
2
|
+
|
3
|
+
These project specific guidelines supplement the general project instructions.
|
4
|
+
Ask for clarification of any apparent conflicts with SPECS, IMPLEMENTATION or project instructions.
|
5
|
+
|
6
|
+
## Error Handling Approaches
|
7
|
+
|
8
|
+
Process each authority's site in issolation - problems with one authority are irrelevant to others.
|
9
|
+
|
10
|
+
* we do a 2nd attempt of authorities with the same proxy settings
|
11
|
+
* and a 3rd attemopt for those that failed with the proxy but with the proxy disabled
|
12
|
+
|
13
|
+
Within a proxy distinguish between
|
14
|
+
|
15
|
+
* Errors that are specific to that record
|
16
|
+
* only allow 5 such errors plus 10% of successfully processed records
|
17
|
+
* these could be regarded as not worth retrying (we currently do)
|
18
|
+
* Any other exceptions stop the processing of that authorities site
|
19
|
+
|
20
|
+
### Fail Fast on deeper calls
|
21
|
+
|
22
|
+
- Raise exceptions early (when they are clearly detectable)
|
23
|
+
- input validation according to scraper specs
|
24
|
+
|
25
|
+
### Be forgiving on things that don't matter
|
26
|
+
|
27
|
+
- not all sites have robots.txt, and not all robots.txt are well formatted therefor stop processing the file on obvious conflicts with the specs,
|
28
|
+
but if the file is bad, just treat it as missing.
|
29
|
+
|
30
|
+
- don't fuss over things we are not going to record.
|
31
|
+
|
32
|
+
- we do detect maintenance pages because that is a helpful and simple clue that we wont find the data, and we can just wait till the site is back online
|
33
|
+
|
34
|
+
## Type Checking
|
35
|
+
|
36
|
+
### Partial Duck Typing
|
37
|
+
- Focus on behavior over types internally (.robocop.yml should disable requiring everything to be typed)
|
38
|
+
- Runtime validation of values
|
39
|
+
- document public API though
|
40
|
+
- Use @params and @returns comments to document types for external uses of the public methods (rubymine will use thesefor checking)
|
41
|
+
|
42
|
+
## Input Validation
|
43
|
+
|
44
|
+
### Early Validation
|
45
|
+
- Check all inputs at system boundaries
|
46
|
+
- Fail on any invalid input
|
47
|
+
|
48
|
+
## Testing Strategies
|
49
|
+
|
50
|
+
* Avoid mocking unless really needed, instead
|
51
|
+
* instantiate a real object to use in the test
|
52
|
+
* use mocking facilities provided by the gem (eg Mechanize, Aws etc)
|
53
|
+
* use integration tests with WebMock for simple external sites or VCR for more complex.
|
54
|
+
* Testing the integration all the way through is just as important as the specific algorithms
|
55
|
+
* Consider using single responsibility classes / methods to make testing simpler but don't make things more complex just to be testable
|
56
|
+
* If necessary expose internal values as read only attributes as needed for testing,
|
57
|
+
for example adding a read only attribute to mechanize agent instances with the values calculated internally
|
58
|
+
|
59
|
+
### Behavior-Driven Development (BDD)
|
60
|
+
- Focus on behavior specifications
|
61
|
+
- User-centric scenarios
|
62
|
+
- Best for: User-facing features
|
63
|
+
|
64
|
+
## Documentation Approaches
|
65
|
+
|
66
|
+
### Just-Enough Documentation
|
67
|
+
- Focus on key decisions
|
68
|
+
- Document non-obvious choices
|
69
|
+
- Best for: Rapid development, internal tools
|
70
|
+
|
71
|
+
## Logging Philosophy
|
72
|
+
|
73
|
+
### Minimal Logging
|
74
|
+
- Log only key events (key means down to adding a record)
|
75
|
+
- Focus on errors
|
data/Gemfile
CHANGED
@@ -22,12 +22,15 @@ gem "sqlite3", platform && (platform == :heroku16 ? "~> 1.4.0" : "~> 1.6.3")
|
|
22
22
|
gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git",
|
23
23
|
branch: "morph_defaults"
|
24
24
|
|
25
|
-
# development and test
|
25
|
+
# development and test gems
|
26
26
|
gem "rake", platform && (platform == :heroku16 ? "~> 12.3.3" : "~> 13.0")
|
27
27
|
gem "rspec", platform && (platform == :heroku16 ? "~> 3.9.0" : "~> 3.12")
|
28
|
-
gem "rubocop", platform && (platform == :heroku16 ? "~>
|
28
|
+
gem "rubocop", platform && (platform == :heroku16 ? "~> 1.28.2" : "~> 1.73")
|
29
|
+
gem "rubocop-rake", platform && (platform == :heroku16 ? "~> 0.6.0" : "~> 0.7")
|
30
|
+
gem "rubocop-rspec", platform && (platform == :heroku16 ? "~> 2.10.0" : "~> 3.5")
|
29
31
|
gem "simplecov", platform && (platform == :heroku16 ? "~> 0.18.0" : "~> 0.22.0")
|
30
|
-
|
32
|
+
gem "simplecov-console"
|
33
|
+
gem "terminal-table"
|
31
34
|
gem "webmock", platform && (platform == :heroku16 ? "~> 3.14.0" : "~> 3.19.0")
|
32
35
|
|
33
36
|
gemspec
|
data/IMPLEMENTATION.md
ADDED
@@ -0,0 +1,33 @@
|
|
1
|
+
IMPLEMENTATION
|
2
|
+
==============
|
3
|
+
|
4
|
+
Document decisions on how we are implementing the specs to be consistent and save time.
|
5
|
+
Things we MUST do go in SPECS.
|
6
|
+
Choices between a number of valid possibilities go here.
|
7
|
+
Once made, these choices should only be changed after careful consideration.
|
8
|
+
|
9
|
+
ASK for clarification of any apparent conflicts with SPECS, GUIDELINES or project instructions.
|
10
|
+
|
11
|
+
## Debugging
|
12
|
+
|
13
|
+
Output debugging messages if ENV['DEBUG'] is set, for example:
|
14
|
+
|
15
|
+
```ruby
|
16
|
+
puts "Pre Connect request: #{request.inspect}" if ENV["DEBUG"]
|
17
|
+
```
|
18
|
+
|
19
|
+
## Robots.txt Handling
|
20
|
+
|
21
|
+
- Used as a "good citizen" mechanism for respecting site preferences
|
22
|
+
- Graceful fallback (to permitted) if robots.txt is unavailable or invalid
|
23
|
+
- Match `/^User-agent:\s*ScraperUtils/i` for specific user agent
|
24
|
+
- If there is a line matching `/^Disallow:\s*\//` then we are disallowed
|
25
|
+
- Check for `/^Crawl-delay:\s*(\d[.0-9]*)/` to extract delay
|
26
|
+
- If the no crawl-delay is found in that section, then check in the default `/^User-agent:\s*\*/` section
|
27
|
+
- This is a deliberate significant simplification of the robots.txt specification in RFC 9309.
|
28
|
+
|
29
|
+
## Method Organization
|
30
|
+
|
31
|
+
- Externalize configuration to improve testability
|
32
|
+
- Keep shared logic in the main class
|
33
|
+
- Decisions / information specific to just one class, can be documented there, otherwise it belongs here
|
data/README.md
CHANGED
@@ -1,11 +1,53 @@
|
|
1
1
|
ScraperUtils (Ruby)
|
2
2
|
===================
|
3
3
|
|
4
|
-
Utilities to help make planningalerts scrapers, especially multis easier to develop, run and debug.
|
4
|
+
Utilities to help make planningalerts scrapers, especially multis, easier to develop, run and debug.
|
5
5
|
|
6
|
-
WARNING: This is still under development! Breaking changes may occur in version 0!
|
6
|
+
WARNING: This is still under development! Breaking changes may occur in version 0.x!
|
7
7
|
|
8
|
-
|
8
|
+
For Server Administrators
|
9
|
+
-------------------------
|
10
|
+
|
11
|
+
The ScraperUtils library is designed to be a respectful citizen of the web. If you're a server administrator and notice
|
12
|
+
our scraper accessing your systems, here's what you should know:
|
13
|
+
|
14
|
+
### How to Control Our Behavior
|
15
|
+
|
16
|
+
Our scraper utilities respect the standard server **robots.txt** control mechanisms (by default).
|
17
|
+
To control our access:
|
18
|
+
|
19
|
+
- Add a section for our user agent: `User-agent: ScraperUtils` (default)
|
20
|
+
- Set a crawl delay, eg: `Crawl-delay: 20`
|
21
|
+
- If needed specify disallowed paths*: `Disallow: /private/`
|
22
|
+
|
23
|
+
### Built-in Politeness Features
|
24
|
+
|
25
|
+
Even without specific configuration, our scrapers will, by default:
|
26
|
+
|
27
|
+
- **Identify themselves**: Our user agent clearly indicates who we are and provides a link to the project repository:
|
28
|
+
`Mozilla/5.0 (compatible; ScraperUtils/0.2.0 2025-02-22; +https://github.com/ianheggie-oaf/scraper_utils)`
|
29
|
+
|
30
|
+
- **Limit server load**: We slow down our requests so we should never be a significant load to your server, let alone
|
31
|
+
overload it.
|
32
|
+
The slower your server is running, the longer the delay we add between requests to help.
|
33
|
+
In the default "compliant mode" this defaults to a max load of 20% and is capped at 33%.
|
34
|
+
|
35
|
+
- **Add randomized delays**: We add random delays between requests to further reduce our impact on servers, which should
|
36
|
+
bring us
|
37
|
+
- down to the load of a single industrious person.
|
38
|
+
|
39
|
+
Extra utilities provided for scrapers to further reduce your server load:
|
40
|
+
|
41
|
+
- **Interleave requests**: This spreads out the requests to your server rather than focusing on one scraper at a time.
|
42
|
+
|
43
|
+
- **Intelligent Date Range selection**: This reduces server load by over 60% by a smarter choice of date range searches,
|
44
|
+
checking the recent 4 days each day and reducing down to checking each 3 days by the end of the 33-day mark. This
|
45
|
+
replaces the simplistic check of the last 30 days each day.
|
46
|
+
|
47
|
+
Our goal is to access public planning information without negatively impacting your services.
|
48
|
+
|
49
|
+
Installation
|
50
|
+
------------
|
9
51
|
|
10
52
|
Add these line to your application's Gemfile:
|
11
53
|
|
@@ -18,211 +60,133 @@ And then execute:
|
|
18
60
|
|
19
61
|
$ bundle
|
20
62
|
|
21
|
-
|
63
|
+
Usage
|
64
|
+
-----
|
65
|
+
|
66
|
+
### Ruby Versions
|
22
67
|
|
23
|
-
|
68
|
+
This gem is designed to be compatible the latest ruby supported by morph.io - other versions may work, but not tested:
|
24
69
|
|
25
|
-
|
70
|
+
* ruby 3.2.2 - requires the `platform` file to contain `heroku_18` in the scraper
|
71
|
+
* ruby 2.5.8 - `heroku_16` (the default)
|
26
72
|
|
27
73
|
### Environment variables
|
28
74
|
|
29
|
-
|
75
|
+
#### `MORPH_AUSTRALIAN_PROXY`
|
76
|
+
|
77
|
+
On morph.io set the environment variable `MORPH_AUSTRALIAN_PROXY` to
|
78
|
+
`http://morph:password@au.proxy.oaf.org.au:8888`
|
79
|
+
replacing password with the real password.
|
80
|
+
Alternatively enter your own AUSTRALIAN proxy details when testing.
|
81
|
+
|
82
|
+
#### `MORPH_EXPECT_BAD`
|
83
|
+
|
84
|
+
To avoid morph complaining about sites that are known to be bad,
|
85
|
+
but you want them to keep being tested, list them on `MORPH_EXPECT_BAD`, for example:
|
86
|
+
|
87
|
+
#### `MORPH_AUTHORITIES`
|
88
|
+
|
89
|
+
Optionally filter authorities for multi authority scrapers
|
90
|
+
via environment variable in morph > scraper > settings or
|
30
91
|
in your dev environment:
|
31
92
|
|
32
93
|
```bash
|
33
94
|
export MORPH_AUTHORITIES=noosa,wagga
|
34
95
|
```
|
35
96
|
|
36
|
-
|
97
|
+
#### `DEBUG`
|
37
98
|
|
38
|
-
|
99
|
+
Optionally enable verbose debugging messages when developing:
|
39
100
|
|
40
|
-
```
|
41
|
-
|
42
|
-
|
101
|
+
```bash
|
102
|
+
export DEBUG=1
|
103
|
+
```
|
43
104
|
|
44
|
-
|
105
|
+
### Extra Mechanize options
|
45
106
|
|
46
|
-
|
47
|
-
require "technology_one_scraper"
|
48
|
-
|
49
|
-
# Main Scraper class
|
50
|
-
class Scraper
|
51
|
-
AUTHORITIES = TechnologyOneScraper::AUTHORITIES
|
52
|
-
|
53
|
-
def self.scrape(authorities, attempt)
|
54
|
-
results = {}
|
55
|
-
authorities.each do |authority_label|
|
56
|
-
these_results = results[authority_label] = {}
|
57
|
-
begin
|
58
|
-
records_scraped = 0
|
59
|
-
unprocessable_records = 0
|
60
|
-
# Allow 5 + 10% unprocessable records
|
61
|
-
too_many_unprocessable = -5.0
|
62
|
-
use_proxy = AUTHORITIES[authority_label][:australian_proxy] && ScraperUtils.australian_proxy
|
63
|
-
next if attempt > 2 && !use_proxy
|
64
|
-
|
65
|
-
puts "",
|
66
|
-
"Collecting feed data for #{authority_label}, attempt: #{attempt}" \
|
67
|
-
"#{use_proxy ? ' (via proxy)' : ''} ..."
|
68
|
-
# Change scrape to accept a use_proxy flag and return an unprocessable flag
|
69
|
-
# it should rescue ScraperUtils::UnprocessableRecord thrown deeper in the scraping code and
|
70
|
-
# set unprocessable
|
71
|
-
TechnologyOneScraper.scrape(use_proxy, authority_label) do |record, unprocessable|
|
72
|
-
unless unprocessable
|
73
|
-
begin
|
74
|
-
record["authority_label"] = authority_label.to_s
|
75
|
-
ScraperUtils::DbUtils.save_record(record)
|
76
|
-
rescue ScraperUtils::UnprocessableRecord => e
|
77
|
-
# validation error
|
78
|
-
unprocessable = true
|
79
|
-
these_results[:error] = e
|
80
|
-
end
|
81
|
-
end
|
82
|
-
if unprocessable
|
83
|
-
unprocessable_records += 1
|
84
|
-
these_results[:unprocessable_records] = unprocessable_records
|
85
|
-
too_many_unprocessable += 1
|
86
|
-
raise "Too many unprocessable records" if too_many_unprocessable.positive?
|
87
|
-
else
|
88
|
-
records_scraped += 1
|
89
|
-
these_results[:records_scraped] = records_scraped
|
90
|
-
too_many_unprocessable -= 0.1
|
91
|
-
end
|
92
|
-
end
|
93
|
-
rescue StandardError => e
|
94
|
-
warn "#{authority_label}: ERROR: #{e}"
|
95
|
-
warn e.backtrace || "No backtrace available"
|
96
|
-
these_results[:error] = e
|
97
|
-
end
|
98
|
-
end
|
99
|
-
results
|
100
|
-
end
|
107
|
+
Add `client_options` to your AUTHORITIES configuration and move any of the following settings into it:
|
101
108
|
|
102
|
-
|
103
|
-
|
104
|
-
|
109
|
+
* `timeout: Integer` - Timeout for agent connections in case the server is slower than normal
|
110
|
+
* `australian_proxy: true` - Use the proxy url in the `MORPH_AUSTRALIAN_PROXY` env variable if the site is geo-locked
|
111
|
+
* `disable_ssl_certificate_check: true` - Disabled SSL verification for old / incorrect certificates
|
105
112
|
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
125
|
-
|
126
|
-
|
127
|
-
|
128
|
-
|
129
|
-
|
130
|
-
|
131
|
-
start_retry,
|
132
|
-
2,
|
133
|
-
retry_errors,
|
134
|
-
retry_results
|
135
|
-
)
|
136
|
-
|
137
|
-
retry_results.each do |auth, result|
|
138
|
-
unless result[:error] && !result[:error].is_a?(ScraperUtils::UnprocessableRecord)
|
139
|
-
results[auth] = result
|
140
|
-
end
|
141
|
-
end.keys
|
142
|
-
retry_no_proxy = retry_results.select do |_auth, result|
|
143
|
-
result[:used_proxy] && result[:error] &&
|
144
|
-
!result[:error].is_a?(ScraperUtils::UnprocessableRecord)
|
145
|
-
end.keys
|
146
|
-
|
147
|
-
unless retry_no_proxy.empty?
|
148
|
-
puts "",
|
149
|
-
"*****************************************************************"
|
150
|
-
puts "Now retrying authorities which earlier had failures without proxy"
|
151
|
-
puts retry_no_proxy.join(", ").to_s
|
152
|
-
puts "*****************************************************************"
|
153
|
-
|
154
|
-
start_retry = Time.now
|
155
|
-
second_retry_results = scrape(retry_no_proxy, 3)
|
156
|
-
ScraperUtils::LogUtils.log_scraping_run(
|
157
|
-
start_retry,
|
158
|
-
3,
|
159
|
-
retry_no_proxy,
|
160
|
-
second_retry_results
|
161
|
-
)
|
162
|
-
second_retry_results.each do |auth, result|
|
163
|
-
unless result[:error] && !result[:error].is_a?(ScraperUtils::UnprocessableRecord)
|
164
|
-
results[auth] = result
|
165
|
-
end
|
166
|
-
end.keys
|
167
|
-
end
|
168
|
-
end
|
169
|
-
|
170
|
-
# Report on results, raising errors for unexpected conditions
|
171
|
-
ScraperUtils::LogUtils.report_on_results(authorities, results)
|
172
|
-
end
|
113
|
+
See the documentation on `ScraperUtils::MechanizeUtils::AgentConfig` for more options
|
114
|
+
|
115
|
+
Then adjust your code to accept `client_options` and pass then through to:
|
116
|
+
`ScraperUtils::MechanizeUtils.mechanize_agent(client_options || {})`
|
117
|
+
to receive a `Mechanize::Agent` configured accordingly.
|
118
|
+
|
119
|
+
The agent returned is configured using Mechanize hooks to implement the desired delays automatically.
|
120
|
+
|
121
|
+
### Default Configuration
|
122
|
+
|
123
|
+
By default, the Mechanize agent is configured with the following settings.
|
124
|
+
As you can see, the defaults can be changed using env variables.
|
125
|
+
|
126
|
+
Note - compliant mode forces max_load to be set to a value no greater than 33.
|
127
|
+
PLEASE don't use our user agent string with a max_load higher than 33!
|
128
|
+
|
129
|
+
```ruby
|
130
|
+
ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
|
131
|
+
config.default_timeout = ENV.fetch('MORPH_TIMEOUT', 60).to_i # 60
|
132
|
+
config.default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
|
133
|
+
config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', 15).to_i # 15
|
134
|
+
config.default_max_load = ENV.fetch('MORPH_MAX_LOAD', 20.0).to_f # 20
|
135
|
+
config.default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
|
136
|
+
config.default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
|
137
|
+
config.default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
|
173
138
|
end
|
139
|
+
```
|
140
|
+
|
141
|
+
You can modify these global defaults before creating any Mechanize agents. These settings will be used for all Mechanize
|
142
|
+
agents created by `ScraperUtils::MechanizeUtils.mechanize_agent` unless overridden by passing parameters to that method.
|
174
143
|
|
175
|
-
|
176
|
-
# Default to list of authorities we can't or won't fix in code, explain why
|
177
|
-
# wagga: url redirects and reports Application error, main site says to use NSW Planning Portal from 1 July 2021
|
178
|
-
# which doesn't list any DA's for wagga wagga!
|
144
|
+
To speed up testing, set the following in `spec_helper.rb`:
|
179
145
|
|
180
|
-
|
181
|
-
|
146
|
+
```ruby
|
147
|
+
ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
|
148
|
+
config.default_random_delay = nil
|
149
|
+
config.default_max_load = 33
|
182
150
|
end
|
183
151
|
```
|
184
152
|
|
185
|
-
|
153
|
+
### Example updated `scraper.rb` file
|
186
154
|
|
187
|
-
|
188
|
-
|
189
|
-
|
155
|
+
Update your `scraper.rb` as per the [example scraper](docs/example_scraper.rb).
|
156
|
+
|
157
|
+
Your code should raise ScraperUtils::UnprocessableRecord when there is a problem with the data presented on a page for a
|
158
|
+
record.
|
159
|
+
Then just before you would normally yield a record for saving, rescue that exception and:
|
160
|
+
|
161
|
+
* Call `ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)`
|
162
|
+
* NOT yield the record for saving
|
163
|
+
|
164
|
+
In your code update where create a mechanize agent (often `YourScraper.scrape_period`) and the `AUTHORITIES` hash
|
165
|
+
to move Mechanize agent options (like `australian_proxy` and `timeout`) to a hash under a new key: `client_options`.
|
166
|
+
For example:
|
190
167
|
|
191
168
|
```ruby
|
192
169
|
require "scraper_utils"
|
193
170
|
#...
|
194
|
-
module
|
195
|
-
#
|
196
|
-
def self.scrape(use_proxy, authority)
|
197
|
-
raise "Unexpected authority: #{authority}" unless AUTHORITIES.key?(authority)
|
198
|
-
|
199
|
-
scrape_period(use_proxy, AUTHORITIES[authority]) do |record, unprocessable|
|
200
|
-
yield record, unprocessable
|
201
|
-
end
|
202
|
-
end
|
203
|
-
|
204
|
-
# ... rest of code ...
|
171
|
+
module YourScraper
|
172
|
+
# ... some code ...
|
205
173
|
|
206
|
-
# Note the extra
|
207
|
-
def self.scrape_period(
|
208
|
-
|
209
|
-
australian_proxy: false, timeout: nil
|
174
|
+
# Note the extra parameter: client_options
|
175
|
+
def self.scrape_period(url:, period:, webguest: "P1.WEBGUEST",
|
176
|
+
client_options: {}
|
210
177
|
)
|
211
|
-
agent = ScraperUtils::MechanizeUtils.mechanize_agent(
|
212
|
-
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE if disable_ssl_certificate_check
|
213
|
-
|
214
|
-
# ... rest of code ...
|
215
|
-
|
216
|
-
# Update yield to return unprocessable as well as record
|
178
|
+
agent = ScraperUtils::MechanizeUtils.mechanize_agent(**client_options)
|
217
179
|
|
180
|
+
# ... rest of code ...
|
218
181
|
end
|
182
|
+
|
219
183
|
# ... rest of code ...
|
220
184
|
end
|
221
185
|
```
|
222
186
|
|
223
187
|
### Debugging Techniques
|
224
188
|
|
225
|
-
The following code will
|
189
|
+
The following code will cause debugging info to be output:
|
226
190
|
|
227
191
|
```bash
|
228
192
|
export DEBUG=1
|
@@ -235,8 +199,8 @@ require 'scraper_utils'
|
|
235
199
|
|
236
200
|
# Debug an HTTP request
|
237
201
|
ScraperUtils::DebugUtils.debug_request(
|
238
|
-
"GET",
|
239
|
-
"https://example.com/planning-apps",
|
202
|
+
"GET",
|
203
|
+
"https://example.com/planning-apps",
|
240
204
|
parameters: { year: 2023 },
|
241
205
|
headers: { "Accept" => "application/json" }
|
242
206
|
)
|
@@ -248,7 +212,85 @@ ScraperUtils::DebugUtils.debug_page(page, "Checking search results page")
|
|
248
212
|
ScraperUtils::DebugUtils.debug_selector(page, '.results-table', "Looking for development applications")
|
249
213
|
```
|
250
214
|
|
251
|
-
|
215
|
+
Interleaving Requests
|
216
|
+
---------------------
|
217
|
+
|
218
|
+
The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
|
219
|
+
|
220
|
+
* works on the other authorities whilst in the delay period for an authorities next request
|
221
|
+
* thus optimizing the total scraper run time
|
222
|
+
* allows you to increase the random delay for authorities without undue effect on total run time
|
223
|
+
* For the curious, it uses [ruby fibers](https://ruby-doc.org/core-2.5.8/Fiber.html) rather than threads as that is
|
224
|
+
a simpler system and thus easier to get right, understand and debug!
|
225
|
+
* Cycles around the authorities when compliant_mode, max_load and random_delay are disabled
|
226
|
+
|
227
|
+
To enable change the scrape method to be like [example scrape method using fibers](docs/example_scrape_with_fibers.rb)
|
228
|
+
|
229
|
+
And use `ScraperUtils::FiberScheduler.log` instead of `puts` when logging within the authority processing code.
|
230
|
+
This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
|
231
|
+
thus the output.
|
232
|
+
|
233
|
+
This uses `ScraperUtils::RandomizeUtils` as described below. Remember to add the recommended line to
|
234
|
+
`spec/spec_heper.rb`.
|
235
|
+
|
236
|
+
Intelligent Date Range Selection
|
237
|
+
--------------------------------
|
238
|
+
|
239
|
+
To further reduce server load and speed up scrapers, we provide an intelligent date range selection mechanism
|
240
|
+
that can reduce server requests by 60% without significantly impacting delay in picking up changes.
|
241
|
+
|
242
|
+
The `ScraperUtils::DateRangeUtils#calculate_date_ranges` method provides a smart approach to searching historical
|
243
|
+
records:
|
244
|
+
|
245
|
+
- Always checks the most recent 4 days daily (configurable)
|
246
|
+
- Progressively reduces search frequency for older records
|
247
|
+
- Uses a Fibonacci-like progression to create natural, efficient search intervals
|
248
|
+
- Configurable `max_period` (default is 3 days)
|
249
|
+
- merges adjacent search ranges and handles the changeover in search frequency by extending some searches
|
250
|
+
|
251
|
+
Example usage in your scraper:
|
252
|
+
|
253
|
+
```ruby
|
254
|
+
date_ranges = ScraperUtils::DateRangeUtils.new.calculate_date_ranges
|
255
|
+
date_ranges.each do |from_date, to_date, _debugging_comment|
|
256
|
+
# Adjust your normal search code to use for this date range
|
257
|
+
your_search_records(from_date: from_date, to_date: to_date) do |record|
|
258
|
+
# process as normal
|
259
|
+
end
|
260
|
+
end
|
261
|
+
```
|
262
|
+
|
263
|
+
Typical server load reductions:
|
264
|
+
|
265
|
+
* Max period 2 days : ~42% of the 33 days selected
|
266
|
+
* Max period 3 days : ~37% of the 33 days selected (default)
|
267
|
+
* Max period 5 days : ~35% (or ~31% when days = 45)
|
268
|
+
|
269
|
+
See the class documentation for customizing defaults and passing options.
|
270
|
+
|
271
|
+
Randomizing Requests
|
272
|
+
--------------------
|
273
|
+
|
274
|
+
Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
|
275
|
+
receive in as is when testing.
|
276
|
+
|
277
|
+
Use this with the list of records scraped from an index to randomise any requests for further information to be less Bot like.
|
278
|
+
|
279
|
+
### Spec setup
|
280
|
+
|
281
|
+
You should enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb` :
|
282
|
+
|
283
|
+
```
|
284
|
+
ScraperUtils::RandomizeUtils.sequential = true
|
285
|
+
```
|
286
|
+
|
287
|
+
Note:
|
288
|
+
|
289
|
+
* You can also force sequential mode by setting the env variable `MORPH_PROCESS_SEQUENTIALLY` to `1` (any non blank)
|
290
|
+
* testing using VCR requires sequential mode
|
291
|
+
|
292
|
+
Development
|
293
|
+
-----------
|
252
294
|
|
253
295
|
After checking out the repo, run `bin/setup` to install dependencies.
|
254
296
|
Then, run `rake test` to run the tests.
|
@@ -259,13 +301,20 @@ To install this gem onto your local machine, run `bundle exec rake install`.
|
|
259
301
|
|
260
302
|
To release a new version, update the version number in `version.rb`, and
|
261
303
|
then run `bundle exec rake release`,
|
262
|
-
which will create a git tag for the version, push git commits and tags, and push the `.gem` file
|
304
|
+
which will create a git tag for the version, push git commits and tags, and push the `.gem` file
|
305
|
+
to [rubygems.org](https://rubygems.org).
|
306
|
+
|
307
|
+
NOTE: You need to use ruby 3.2.2 instead of 2.5.8 to release to OTP protected accounts.
|
308
|
+
|
309
|
+
Contributing
|
310
|
+
------------
|
263
311
|
|
264
|
-
|
312
|
+
Bug reports and pull requests with working tests are welcome on [GitHub](https://github.com/ianheggie-oaf/scraper_utils)
|
265
313
|
|
266
|
-
|
314
|
+
CHANGELOG.md is maintained by the author aiming to follow https://github.com/vweevers/common-changelog
|
267
315
|
|
268
|
-
|
316
|
+
License
|
317
|
+
-------
|
269
318
|
|
270
319
|
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
271
320
|
|