scraper_utils 0.5.1 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.yardopts +5 -0
- data/CHANGELOG.md +19 -0
- data/GUIDELINES.md +2 -1
- data/Gemfile +1 -0
- data/IMPLEMENTATION.md +39 -0
- data/README.md +29 -23
- data/SPECS.md +13 -1
- data/bin/rspec +27 -0
- data/docs/enhancing_specs.md +100 -0
- data/docs/example_scrape_with_fibers.rb +4 -4
- data/docs/fibers_and_threads.md +72 -0
- data/docs/getting_started.md +6 -6
- data/docs/interleaving_requests.md +9 -8
- data/docs/mechanize_utilities.md +4 -4
- data/docs/parallel_requests.md +138 -0
- data/docs/randomizing_requests.md +12 -8
- data/docs/reducing_server_load.md +6 -6
- data/lib/scraper_utils/data_quality_monitor.rb +2 -3
- data/lib/scraper_utils/date_range_utils.rb +37 -78
- data/lib/scraper_utils/debug_utils.rb +5 -5
- data/lib/scraper_utils/log_utils.rb +15 -0
- data/lib/scraper_utils/mechanize_actions.rb +37 -8
- data/lib/scraper_utils/mechanize_utils/adaptive_delay.rb +80 -0
- data/lib/scraper_utils/mechanize_utils/agent_config.rb +35 -34
- data/lib/scraper_utils/mechanize_utils/robots_checker.rb +151 -0
- data/lib/scraper_utils/mechanize_utils.rb +8 -5
- data/lib/scraper_utils/randomize_utils.rb +22 -19
- data/lib/scraper_utils/scheduler/constants.rb +12 -0
- data/lib/scraper_utils/scheduler/operation_registry.rb +101 -0
- data/lib/scraper_utils/scheduler/operation_worker.rb +199 -0
- data/lib/scraper_utils/scheduler/process_request.rb +59 -0
- data/lib/scraper_utils/scheduler/thread_request.rb +51 -0
- data/lib/scraper_utils/scheduler/thread_response.rb +59 -0
- data/lib/scraper_utils/scheduler.rb +286 -0
- data/lib/scraper_utils/spec_support.rb +67 -0
- data/lib/scraper_utils/version.rb +1 -1
- data/lib/scraper_utils.rb +12 -14
- metadata +18 -6
- data/lib/scraper_utils/adaptive_delay.rb +0 -70
- data/lib/scraper_utils/fiber_scheduler.rb +0 -229
- data/lib/scraper_utils/robots_checker.rb +0 -149
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: da26385a1d788bc9ad9d725f0eaefe233d4b70a8bf9aeab0af3168041adc0bc2
|
4
|
+
data.tar.gz: '02892db893cc706ec67845bde0a05d80b89cdfc2d48759b2e66574e6f87b0031'
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: b8101b0b0d2ed1d775de54f0e8bac5a1a22ca6f540cea2752de76218a4915c325ec7569ac76a719bf540474f54123cb32f700635160cdbabdcc68679ac33c2e4
|
7
|
+
data.tar.gz: ae1e5d72f45b077f0525e62dc0399f9bd6f519d52222518e95f57f8f45e03575f3b8b909fb02b030a27b33bfccaefec74bb143d9d13ff1bcfe094ceaa8e369d3
|
data/.yardopts
ADDED
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,24 @@
|
|
1
1
|
# Changelog
|
2
2
|
|
3
|
+
## 0.7.0 - 2025-04-15
|
4
|
+
|
5
|
+
* Added Spec helpers and associated doc: `docs/enhancing_specs.md`
|
6
|
+
* `ScraperUtils::SpecSupport.geocodable?`
|
7
|
+
* `ScraperUtils::SpecSupport.reasonable_description?`
|
8
|
+
|
9
|
+
## 0.6.1 - 2025-03-28
|
10
|
+
|
11
|
+
* Changed DEFAULT_MAX_LOAD to 50.0 as we are overestimating the load we present as network latency is included
|
12
|
+
* Correct documentation of spec_helper extra lines
|
13
|
+
* Fix misc bugs found in use
|
14
|
+
|
15
|
+
## 0.6.0 - 2025-03-16
|
16
|
+
|
17
|
+
* Add threads for more efficient scraping
|
18
|
+
* Adjust defaults for more efficient scraping, retaining just response based delays by default
|
19
|
+
* Correct and simplify date range utilities so everything is checked at least `max_period` days
|
20
|
+
* Release Candidate for v1.0.0, subject to testing in production
|
21
|
+
|
3
22
|
## 0.5.1 - 2025-03-05
|
4
23
|
|
5
24
|
* Remove duplicated example code in docs
|
data/GUIDELINES.md
CHANGED
@@ -47,7 +47,8 @@ but if the file is bad, just treat it as missing.
|
|
47
47
|
|
48
48
|
## Testing Strategies
|
49
49
|
|
50
|
-
*
|
50
|
+
* AVOID mocking unless really needed (and REALLY avoid mocking your own code), instead
|
51
|
+
* Consider if you can change your own code, whilst keeping it simple, to make it easier to test
|
51
52
|
* instantiate a real object to use in the test
|
52
53
|
* use mocking facilities provided by the gem (eg Mechanize, Aws etc)
|
53
54
|
* use integration tests with WebMock for simple external sites or VCR for more complex.
|
data/Gemfile
CHANGED
data/IMPLEMENTATION.md
CHANGED
@@ -31,3 +31,42 @@ puts "Pre Connect request: #{request.inspect}" if ENV["DEBUG"]
|
|
31
31
|
- Externalize configuration to improve testability
|
32
32
|
- Keep shared logic in the main class
|
33
33
|
- Decisions / information specific to just one class, can be documented there, otherwise it belongs here
|
34
|
+
|
35
|
+
## Testing Directory Structure
|
36
|
+
|
37
|
+
Our test directory structure reflects various testing strategies and aspects of the codebase:
|
38
|
+
|
39
|
+
### API Context Directories
|
40
|
+
- `spec/scraper_utils/fiber_api/` - Tests functionality called from within worker fibers
|
41
|
+
- `spec/scraper_utils/main_fiber/` - Tests functionality called from the main fiber's perspective
|
42
|
+
- `spec/scraper_utils/thread_api/` - Tests functionality called from within worker threads
|
43
|
+
|
44
|
+
### Utility Classes
|
45
|
+
- `spec/scraper_utils/mechanize_utils/` - Tests for `lib/scraper_utils/mechanize_utils/*.rb` files
|
46
|
+
- `spec/scraper_utils/scheduler/` - Tests for `lib/scraper_utils/scheduler/*.rb` files
|
47
|
+
- `spec/scraper_utils/scheduler2/` - FIXME: remove duplicate tests and merge to `spec/scraper_utils/scheduler/` unless > 200 lines
|
48
|
+
|
49
|
+
### Integration vs Unit Tests
|
50
|
+
- `spec/scraper_utils/integration/` - Tests that focus on the integration between components
|
51
|
+
- Name tests after the most "parent-like" class of the components involved
|
52
|
+
|
53
|
+
### Special Configuration Directories
|
54
|
+
These specs check the options we use when things go wrong in production
|
55
|
+
|
56
|
+
- `spec/scraper_utils/no_threads/` - Tests with threads disabled (`MORPH_DISABLE_THREADS=1`)
|
57
|
+
- `spec/scraper_utils/sequential/` - Tests with exactly one worker (`MORPH_MAX_WORKERS=1`)
|
58
|
+
|
59
|
+
### Directories to break up large specs
|
60
|
+
Keep specs less than 200 lines long
|
61
|
+
|
62
|
+
- `spec/scraper_utils/replacements` - Tests for replacements in MechanizeActions
|
63
|
+
- `spec/scraper_utils/replacements2` - FIXME: remove duplicate tests and merge to `spec/scraper_utils/replacements/`?
|
64
|
+
- `spec/scraper_utils/selectors` - Tests the various node selectors available in MechanizeActions
|
65
|
+
- `spec/scraper_utils/selectors2` - FIXME: remove duplicate tests and merge to `spec/scraper_utils/selectors/`?
|
66
|
+
|
67
|
+
### General Testing Guidelines
|
68
|
+
- Respect fiber and thread context validation - never mock the objects under test
|
69
|
+
- Structure tests to run in the appropriate fiber context
|
70
|
+
- Use real fibers, threads and operations rather than excessive mocking
|
71
|
+
- Ensure proper cleanup of resources in both success and error paths
|
72
|
+
- ASK when unsure which (yard doc, spec or code) is wrong as I don't always follow the "write specs first" strategy
|
data/README.md
CHANGED
@@ -9,28 +9,30 @@ For Server Administrators
|
|
9
9
|
The ScraperUtils library is designed to be a respectful citizen of the web. If you're a server administrator and notice
|
10
10
|
our scraper accessing your systems, here's what you should know:
|
11
11
|
|
12
|
-
### How to Control Our Behavior
|
13
|
-
|
14
|
-
Our scraper utilities respect the standard server **robots.txt** control mechanisms (by default).
|
15
|
-
To control our access:
|
16
|
-
|
17
|
-
- Add a section for our user agent: `User-agent: ScraperUtils` (default)
|
18
|
-
- Set a crawl delay, eg: `Crawl-delay: 20`
|
19
|
-
- If needed specify disallowed paths: `Disallow: /private/`
|
20
|
-
|
21
12
|
### We play nice with your servers
|
22
13
|
|
23
14
|
Our goal is to access public planning information with minimal impact on your services. The following features are on by
|
24
15
|
default:
|
25
16
|
|
17
|
+
- **Limit server load**:
|
18
|
+
- We limit the max load we present to your server to less than a half of one of your cpu cores
|
19
|
+
- The more loaded your server is, the longer we wait between requests!
|
20
|
+
- We respect Crawl-delay from robots.txt (see section below), so you can tell us an acceptable rate
|
21
|
+
- Scarper developers can
|
22
|
+
- reduce the max_load we present to your server even further
|
23
|
+
- add random extra delays to give your server a chance to catch up with background tasks
|
24
|
+
|
26
25
|
- **Identify themselves**: Our user agent clearly indicates who we are and provides a link to the project repository:
|
27
26
|
`Mozilla/5.0 (compatible; ScraperUtils/0.2.0 2025-02-22; +https://github.com/ianheggie-oaf/scraper_utils)`
|
28
27
|
|
29
|
-
|
30
|
-
|
31
|
-
|
28
|
+
### How to Control Our Behavior
|
29
|
+
|
30
|
+
Our scraper utilities respect the standard server **robots.txt** control mechanisms (by default).
|
31
|
+
To control our access:
|
32
32
|
|
33
|
-
|
33
|
+
- Add a section for our user agent: `User-agent: ScraperUtils`
|
34
|
+
- Set a crawl delay, eg: `Crawl-delay: 20`
|
35
|
+
- If needed specify disallowed paths: `Disallow: /private/`
|
34
36
|
|
35
37
|
For Scraper Developers
|
36
38
|
----------------------
|
@@ -40,14 +42,15 @@ mentioned above.
|
|
40
42
|
|
41
43
|
## Installation & Configuration
|
42
44
|
|
43
|
-
Add to your
|
45
|
+
Add to [your scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
|
44
46
|
|
45
47
|
```ruby
|
46
48
|
gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
|
47
49
|
gem 'scraper_utils'
|
48
50
|
```
|
49
51
|
|
50
|
-
For detailed setup and configuration options,
|
52
|
+
For detailed setup and configuration options,
|
53
|
+
see {file:docs/getting_started.md Getting Started guide}
|
51
54
|
|
52
55
|
## Key Features
|
53
56
|
|
@@ -57,20 +60,23 @@ For detailed setup and configuration options, see the [Getting Started guide](do
|
|
57
60
|
- Automatic rate limiting based on server response times
|
58
61
|
- Supports robots.txt and crawl-delay directives
|
59
62
|
- Supports extra actions required to get to results page
|
60
|
-
-
|
63
|
+
- {file:docs/mechanize_utilities.md Learn more about Mechanize utilities}
|
61
64
|
|
62
65
|
### Optimize Server Load
|
63
66
|
|
64
67
|
- Intelligent date range selection (reduce server load by up to 60%)
|
65
68
|
- Cycle utilities for rotating search parameters
|
66
|
-
-
|
69
|
+
- {file:docs/reducing_server_load.md Learn more about reducing server load}
|
67
70
|
|
68
71
|
### Improve Scraper Efficiency
|
69
72
|
|
70
|
-
-
|
71
|
-
-
|
73
|
+
- Interleaves requests to optimize run time
|
74
|
+
- {file:docs/interleaving_requests.md Learn more about interleaving requests}
|
75
|
+
- Use {ScraperUtils::Scheduler.execute_request} so Mechanize network requests will be performed by threads in parallel
|
76
|
+
- {file:docs/parallel_requests.md Parallel Request} - see Usage section for installation instructions
|
72
77
|
- Randomize processing order for more natural request patterns
|
73
|
-
-
|
78
|
+
- {file:docs/randomizing_requests.md Learn more about randomizing requests} - see Usage section for installation
|
79
|
+
instructions
|
74
80
|
|
75
81
|
### Error Handling & Quality Monitoring
|
76
82
|
|
@@ -82,11 +88,11 @@ For detailed setup and configuration options, see the [Getting Started guide](do
|
|
82
88
|
|
83
89
|
- Enhanced debugging utilities
|
84
90
|
- Simple logging with authority context
|
85
|
-
-
|
91
|
+
- {file:docs/debugging.md Learn more about debugging}
|
86
92
|
|
87
93
|
## API Documentation
|
88
94
|
|
89
|
-
Complete API documentation is available at [RubyDoc.info](https://rubydoc.info/gems/scraper_utils).
|
95
|
+
Complete API documentation is available at [scraper_utils | RubyDoc.info](https://rubydoc.info/gems/scraper_utils).
|
90
96
|
|
91
97
|
## Ruby Versions
|
92
98
|
|
@@ -105,7 +111,7 @@ To install this gem onto your local machine, run `bundle exec rake install`.
|
|
105
111
|
## Contributing
|
106
112
|
|
107
113
|
Bug reports and pull requests with working tests are welcome
|
108
|
-
on [GitHub](https://github.com/ianheggie-oaf/scraper_utils).
|
114
|
+
on [ianheggie-oaf/scraper_utils | GitHub](https://github.com/ianheggie-oaf/scraper_utils).
|
109
115
|
|
110
116
|
## License
|
111
117
|
|
data/SPECS.md
CHANGED
@@ -6,7 +6,13 @@ installation and usage notes in `README.md`.
|
|
6
6
|
|
7
7
|
ASK for clarification of any apparent conflicts with IMPLEMENTATION, GUIDELINES or project instructions.
|
8
8
|
|
9
|
-
|
9
|
+
Core Design Principles
|
10
|
+
----------------------
|
11
|
+
|
12
|
+
## Coding Style and Complexity
|
13
|
+
- KISS (Keep it Simple and Stupid) is a guiding principle:
|
14
|
+
- Simple: Design and implement with as little complexity as possible while still achieving the desired functionality
|
15
|
+
- Stupid: Should be easy to diagnose and repair with basic tooling
|
10
16
|
|
11
17
|
### Error Handling
|
12
18
|
- Record-level errors abort only that record's processing
|
@@ -23,3 +29,9 @@ ASK for clarification of any apparent conflicts with IMPLEMENTATION, GUIDELINES
|
|
23
29
|
- Ensure components are independently testable
|
24
30
|
- Avoid timing-based tests in favor of logic validation
|
25
31
|
- Keep test scenarios focused and under 20 lines
|
32
|
+
|
33
|
+
#### Fiber and Thread Testing
|
34
|
+
- Test in appropriate fiber/thread context using API-specific directories
|
35
|
+
- Validate cooperative concurrency with real fibers rather than mocks
|
36
|
+
- Ensure tests for each context: main fiber, worker fibers, and various thread configurations
|
37
|
+
- Test special configurations (no threads, no fibers, sequential) in dedicated directories
|
data/bin/rspec
ADDED
@@ -0,0 +1,27 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# frozen_string_literal: true
|
3
|
+
|
4
|
+
#
|
5
|
+
# This file was generated by Bundler.
|
6
|
+
#
|
7
|
+
# The application 'rspec' is installed as part of a gem, and
|
8
|
+
# this file is here to facilitate running it.
|
9
|
+
#
|
10
|
+
|
11
|
+
ENV["BUNDLE_GEMFILE"] ||= File.expand_path("../Gemfile", __dir__)
|
12
|
+
|
13
|
+
bundle_binstub = File.expand_path("bundle", __dir__)
|
14
|
+
|
15
|
+
if File.file?(bundle_binstub)
|
16
|
+
if File.read(bundle_binstub, 300).include?("This file was generated by Bundler")
|
17
|
+
load(bundle_binstub)
|
18
|
+
else
|
19
|
+
abort("Your `bin/bundle` was not generated by Bundler, so this binstub cannot run.
|
20
|
+
Replace `bin/bundle` by running `bundle binstubs bundler --force`, then run this command again.")
|
21
|
+
end
|
22
|
+
end
|
23
|
+
|
24
|
+
require "rubygems"
|
25
|
+
require "bundler/setup"
|
26
|
+
|
27
|
+
load Gem.bin_path("rspec-core", "rspec")
|
@@ -0,0 +1,100 @@
|
|
1
|
+
# Enhancing specs
|
2
|
+
|
3
|
+
ScraperUtils provides two methods to help with checking results
|
4
|
+
|
5
|
+
* `ScraperUtils::SpecSupport.geocodable?`
|
6
|
+
* `ScraperUtils::SpecSupport.reasonable_description?`
|
7
|
+
|
8
|
+
## Example Code:
|
9
|
+
|
10
|
+
```ruby
|
11
|
+
# frozen_string_literal: true
|
12
|
+
|
13
|
+
require "timecop"
|
14
|
+
require_relative "../scraper"
|
15
|
+
|
16
|
+
RSpec.describe Scraper do
|
17
|
+
describe ".scrape" do
|
18
|
+
def test_scrape(authority)
|
19
|
+
File.delete("./data.sqlite") if File.exist?("./data.sqlite")
|
20
|
+
|
21
|
+
VCR.use_cassette(authority) do
|
22
|
+
date = Date.new(2025, 4, 15)
|
23
|
+
Timecop.freeze(date) do
|
24
|
+
Scraper.scrape([authority], 1)
|
25
|
+
end
|
26
|
+
end
|
27
|
+
|
28
|
+
expected = if File.exist?("spec/expected/#{authority}.yml")
|
29
|
+
YAML.safe_load(File.read("spec/expected/#{authority}.yml"))
|
30
|
+
else
|
31
|
+
[]
|
32
|
+
end
|
33
|
+
results = ScraperWiki.select("* from data order by council_reference")
|
34
|
+
|
35
|
+
ScraperWiki.close_sqlite
|
36
|
+
|
37
|
+
if results != expected
|
38
|
+
# Overwrite expected so that we can compare with version control
|
39
|
+
# (and maybe commit if it is correct)
|
40
|
+
File.open("spec/expected/#{authority}.yml", "w") do |f|
|
41
|
+
f.write(results.to_yaml)
|
42
|
+
end
|
43
|
+
end
|
44
|
+
|
45
|
+
expect(results).to eq expected
|
46
|
+
|
47
|
+
geocodable = results
|
48
|
+
.map { |record| record["address"] }
|
49
|
+
.uniq
|
50
|
+
.count { |text| SpecHelper.geocodable? text }
|
51
|
+
puts "Found #{geocodable} out of #{results.count} unique geocodable addresses " \
|
52
|
+
"(#{(100.0 * geocodable / results.count).round(1)}%)"
|
53
|
+
expect(geocodable).to be > (0.7 * results.count)
|
54
|
+
|
55
|
+
descriptions = results
|
56
|
+
.map { |record| record["description"] }
|
57
|
+
.uniq
|
58
|
+
.count do |text|
|
59
|
+
selected = SpecHelper.reasonable_description? text
|
60
|
+
puts " description: #{text} is not reasonable" if ENV["DEBUG"] && !selected
|
61
|
+
selected
|
62
|
+
end
|
63
|
+
puts "Found #{descriptions} out of #{results.count} unique reasonable descriptions " \
|
64
|
+
"(#{(100.0 * descriptions / results.count).round(1)}%)"
|
65
|
+
expect(descriptions).to be > (0.55 * results.count)
|
66
|
+
|
67
|
+
info_urls = results
|
68
|
+
.map { |record| record["info_url"] }
|
69
|
+
.uniq
|
70
|
+
.count { |text| text.to_s.match(%r{\Ahttps?://}) }
|
71
|
+
puts "Found #{info_urls} out of #{results.count} unique info_urls " \
|
72
|
+
"(#{(100.0 * info_urls / results.count).round(1)}%)"
|
73
|
+
expect(info_urls).to be > (0.7 * results.count) if info_urls != 1
|
74
|
+
|
75
|
+
VCR.use_cassette("#{authority}.info_urls") do
|
76
|
+
results.each do |record|
|
77
|
+
info_url = record["info_url"]
|
78
|
+
puts "Checking info_url #{info_url} #{info_urls > 1 ? ' has expected details' : ''} ..."
|
79
|
+
response = Net::HTTP.get_response(URI(info_url))
|
80
|
+
|
81
|
+
expect(response.code).to eq("200")
|
82
|
+
# If info_url is the same for all records, then it won't have details
|
83
|
+
break if info_urls == 1
|
84
|
+
|
85
|
+
expect(response.body).to include(record["council_reference"])
|
86
|
+
expect(response.body).to include(record["address"])
|
87
|
+
expect(response.body).to include(record["description"])
|
88
|
+
end
|
89
|
+
end
|
90
|
+
end
|
91
|
+
|
92
|
+
Scraper.selected_authorities.each do |authority|
|
93
|
+
it authority do
|
94
|
+
test_scrape(authority)
|
95
|
+
end
|
96
|
+
end
|
97
|
+
end
|
98
|
+
end
|
99
|
+
|
100
|
+
```
|
@@ -3,11 +3,11 @@
|
|
3
3
|
# Example scrape method updated to use ScraperUtils::FibreScheduler
|
4
4
|
|
5
5
|
def scrape(authorities, attempt)
|
6
|
-
ScraperUtils::
|
6
|
+
ScraperUtils::Scheduler.reset!
|
7
7
|
exceptions = {}
|
8
8
|
authorities.each do |authority_label|
|
9
|
-
ScraperUtils::
|
10
|
-
ScraperUtils::
|
9
|
+
ScraperUtils::Scheduler.register_operation(authority_label) do
|
10
|
+
ScraperUtils::LogUtils.log(
|
11
11
|
"Collecting feed data for #{authority_label}, attempt: #{attempt}..."
|
12
12
|
)
|
13
13
|
ScraperUtils::DataQualityMonitor.start_authority(authority_label)
|
@@ -26,6 +26,6 @@ def scrape(authorities, attempt)
|
|
26
26
|
end
|
27
27
|
# end of register_operation block
|
28
28
|
end
|
29
|
-
ScraperUtils::
|
29
|
+
ScraperUtils::Scheduler.run_operations
|
30
30
|
exceptions
|
31
31
|
end
|
@@ -0,0 +1,72 @@
|
|
1
|
+
Fibers and Threads
|
2
|
+
==================
|
3
|
+
|
4
|
+
This sequence diagram supplements the notes on the {ScraperUtils::Scheduler} class and is intended to help show
|
5
|
+
the passing of messages and control between the fibers and threads.
|
6
|
+
|
7
|
+
* To keep things simple I have only shown the Fibers and Threads and not all the other calls like to the
|
8
|
+
OperationRegistry to lookup the current operation, or OperationWorker etc.
|
9
|
+
* There is ONE (global) response queue, which is monitored by the Scheduler.run_operations loop in the main Fiber
|
10
|
+
* Each authority has ONE OperationWorker (not shown), which has ONE Fiber, ONE Thread, ONE request queue.
|
11
|
+
* I use "◀─▶" to indicate a call and response, and "║" for which fiber / object is currently running.
|
12
|
+
|
13
|
+
```text
|
14
|
+
|
15
|
+
SCHEDULER (Main Fiber)
|
16
|
+
NxRegister-operation RESPONSE.Q
|
17
|
+
║──creates────────◀─▶┐
|
18
|
+
║ │ FIBER (runs block passed to register_operation)
|
19
|
+
║──creates──────────────────◀─▶┐ WORKER object and Registry
|
20
|
+
║──registers(fiber)───────────────────────▶┐ REQUEST-Q
|
21
|
+
│ │ │ ║──creates─◀─▶┐ THREAD
|
22
|
+
│ │ │ ║──creates───────────◀─▶┐
|
23
|
+
║◀─────────────────────────────────────────┘ ║◀──pop───║ ...[block waiting for request]
|
24
|
+
║ │ │ │ ║ │
|
25
|
+
run_operations │ │ │ ║ │
|
26
|
+
║──pop(non block)─◀─▶│ │ │ ║ │ ...[no responses yet]
|
27
|
+
║ │ │ │ ║ │
|
28
|
+
║───resumes-next─"can_resume"─────────────▶┐ ║ │
|
29
|
+
│ │ │ ║ ║ │
|
30
|
+
│ │ ║◀──resume──┘ ║ │ ...[first Resume passes true]
|
31
|
+
│ │ ║ │ ║ │ ...[initialise scraper]
|
32
|
+
```
|
33
|
+
**REPEATS FROM HERE**
|
34
|
+
```text
|
35
|
+
SCHEDULER RESPONSE.Q FIBER WORKER REQUEST.Q THREAD
|
36
|
+
│ │ ║──request─▶┐ ║ │
|
37
|
+
│ │ │ ║──push req ─▶║ │
|
38
|
+
│ │ ║◀──────────┘ ║──req───▶║
|
39
|
+
║◀──yields control─(waiting)───┘ │ │ ║
|
40
|
+
║ │ │ │ │ ║ ...[Executes network I/O request]
|
41
|
+
║ │ │ │ │ ║
|
42
|
+
║───other-resumes... │ │ │ │ ║ ...[Other Workers will be resumed
|
43
|
+
║ │ │ │ │ ║ till most 99% are waiting on
|
44
|
+
║───lots of │ │ │ │ ║ responses from their threads
|
45
|
+
║ short sleeps ║◀──pushes response───────────────────────────┘
|
46
|
+
║ ║ │ │ ║◀──pop───║ ...[block waiting for request]
|
47
|
+
║──pop(response)──◀─▶║ │ │ ║ │
|
48
|
+
║ │ │ │ ║ │
|
49
|
+
║──saves─response───────────────────────◀─▶│ ║ │
|
50
|
+
║ │ │ │ ║ │
|
51
|
+
║───resumes-next─"can_resume"─────────────▶┐ ║ │
|
52
|
+
│ │ │ ║ ║ │
|
53
|
+
│ │ ║◀──resume──┘ ║ │ ...[Resume passes response]
|
54
|
+
│ │ ║ │ ║ │
|
55
|
+
│ │ ║ │ ║ │ ...[Process Response]
|
56
|
+
```
|
57
|
+
**REPEATS TO HERE** - WHEN FIBER FINISHES, instead it:
|
58
|
+
```text
|
59
|
+
SCHEDULER RESPONSE.Q FIBER WORKER REQUEST.Q THREAD
|
60
|
+
│ │ ║ │ ║ │
|
61
|
+
│ │ ║─deregister─▶║ ║ │
|
62
|
+
│ │ │ ║──close───▶║ │
|
63
|
+
│ │ │ ║ ║──nil───▶┐
|
64
|
+
│ │ │ ║ │ ║ ... [thread exists]
|
65
|
+
│ │ │ ║──join────────────◀─▶┘
|
66
|
+
│ │ │ ║ ....... [worker removes
|
67
|
+
│ │ │ ║ itself from registry]
|
68
|
+
│ │ ║◀──returns───┘
|
69
|
+
│◀──returns─nil────────────────┘
|
70
|
+
│ │
|
71
|
+
```
|
72
|
+
When the last fiber finishes and the registry is empty, then the response queue is also removed
|
data/docs/getting_started.md
CHANGED
@@ -54,14 +54,14 @@ export DEBUG=1 # for basic, or 2 for verbose or 3 for tracing nearly everything
|
|
54
54
|
|
55
55
|
## Example Scraper Implementation
|
56
56
|
|
57
|
-
Update your `scraper.rb` as per
|
57
|
+
Update your `scraper.rb` as per {file:example_scraper.rb example scraper}
|
58
58
|
|
59
|
-
For more advanced implementations, see the
|
59
|
+
For more advanced implementations, see the {file:interleaving_requests.md Interleaving Requests documentation}.
|
60
60
|
|
61
61
|
## Logging Tables
|
62
62
|
|
63
63
|
The following logging tables are created for use in monitoring failure patterns and debugging issues.
|
64
|
-
Records are
|
64
|
+
Records are automatically cleared after 30 days.
|
65
65
|
|
66
66
|
The `ScraperUtils::LogUtils.log_scraping_run` call also logs the information to the `scrape_log` table.
|
67
67
|
|
@@ -69,6 +69,6 @@ The `ScraperUtils::LogUtils.save_summary_record` call also logs the information
|
|
69
69
|
|
70
70
|
## Next Steps
|
71
71
|
|
72
|
-
-
|
73
|
-
-
|
74
|
-
-
|
72
|
+
- {file:reducing_server_load.md Reducing Server Load}
|
73
|
+
- {file:mechanize_utilities.md Mechanize Utilities}
|
74
|
+
- {file:debugging.md Debugging}
|
@@ -1,6 +1,6 @@
|
|
1
|
-
# Interleaving Requests with
|
1
|
+
# Interleaving Requests with Scheduler
|
2
2
|
|
3
|
-
The `ScraperUtils::
|
3
|
+
The `ScraperUtils::Scheduler` provides a lightweight utility that:
|
4
4
|
|
5
5
|
* Works on other authorities while in the delay period for an authority's next request
|
6
6
|
* Optimizes the total scraper run time
|
@@ -12,21 +12,22 @@ The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
|
|
12
12
|
## Implementation
|
13
13
|
|
14
14
|
To enable fiber scheduling, change your scrape method as per
|
15
|
-
|
15
|
+
{example_scrape_with_fibers.rb example scrape with fibers}
|
16
16
|
|
17
|
-
## Logging with
|
17
|
+
## Logging with Scheduler
|
18
18
|
|
19
|
-
Use
|
19
|
+
Use {ScraperUtils::LogUtils.log} instead of `puts` when logging within the authority processing code.
|
20
20
|
This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
|
21
21
|
thus the output.
|
22
22
|
|
23
23
|
## Testing Considerations
|
24
24
|
|
25
|
-
This uses
|
25
|
+
This uses {ScraperUtils::RandomizeUtils} for determining the order of operations. Remember to add the following line to
|
26
26
|
`spec/spec_helper.rb`:
|
27
27
|
|
28
28
|
```ruby
|
29
|
-
ScraperUtils::RandomizeUtils.
|
29
|
+
ScraperUtils::RandomizeUtils.random = false
|
30
|
+
ScraperUtils::Scheduler.max_workers = 1
|
30
31
|
```
|
31
32
|
|
32
|
-
For full details, see the
|
33
|
+
For full details, see the {Scheduler}.
|
data/docs/mechanize_utilities.md
CHANGED
@@ -29,16 +29,16 @@ The agent returned is configured using Mechanize hooks to implement the desired
|
|
29
29
|
### Default Configuration
|
30
30
|
|
31
31
|
By default, the Mechanize agent is configured with the following settings.
|
32
|
-
As you can see, the defaults can be changed using env variables.
|
32
|
+
As you can see, the defaults can be changed using env variables or via code.
|
33
33
|
|
34
34
|
Note - compliant mode forces max_load to be set to a value no greater than 50.
|
35
35
|
|
36
36
|
```ruby
|
37
37
|
ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
|
38
|
-
config.default_timeout = ENV.fetch('MORPH_TIMEOUT',
|
38
|
+
config.default_timeout = ENV.fetch('MORPH_TIMEOUT', DEFAULT_TIMEOUT).to_i # 60
|
39
39
|
config.default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
|
40
|
-
config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY',
|
41
|
-
config.default_max_load = ENV.fetch('MORPH_MAX_LOAD',
|
40
|
+
config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', DEFAULT_RANDOM_DELAY).to_i # 0
|
41
|
+
config.default_max_load = ENV.fetch('MORPH_MAX_LOAD',DEFAULT_MAX_LOAD).to_f # 50.0
|
42
42
|
config.default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
|
43
43
|
config.default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
|
44
44
|
config.default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
|