scraper_utils 0.5.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. checksums.yaml +4 -4
  2. data/.yardopts +5 -0
  3. data/CHANGELOG.md +11 -0
  4. data/GUIDELINES.md +2 -1
  5. data/Gemfile +1 -0
  6. data/IMPLEMENTATION.md +40 -0
  7. data/README.md +29 -23
  8. data/SPECS.md +13 -1
  9. data/bin/rspec +27 -0
  10. data/docs/example_scrape_with_fibers.rb +4 -4
  11. data/docs/example_scraper.rb +14 -21
  12. data/docs/fibers_and_threads.md +72 -0
  13. data/docs/getting_started.md +13 -84
  14. data/docs/interleaving_requests.md +8 -37
  15. data/docs/parallel_requests.md +138 -0
  16. data/docs/randomizing_requests.md +12 -8
  17. data/docs/reducing_server_load.md +6 -6
  18. data/lib/scraper_utils/data_quality_monitor.rb +2 -3
  19. data/lib/scraper_utils/date_range_utils.rb +37 -78
  20. data/lib/scraper_utils/debug_utils.rb +5 -5
  21. data/lib/scraper_utils/log_utils.rb +15 -0
  22. data/lib/scraper_utils/mechanize_actions.rb +37 -8
  23. data/lib/scraper_utils/mechanize_utils/adaptive_delay.rb +79 -0
  24. data/lib/scraper_utils/mechanize_utils/agent_config.rb +31 -30
  25. data/lib/scraper_utils/mechanize_utils/robots_checker.rb +151 -0
  26. data/lib/scraper_utils/mechanize_utils.rb +8 -5
  27. data/lib/scraper_utils/randomize_utils.rb +22 -19
  28. data/lib/scraper_utils/scheduler/constants.rb +12 -0
  29. data/lib/scraper_utils/scheduler/operation_registry.rb +101 -0
  30. data/lib/scraper_utils/scheduler/operation_worker.rb +199 -0
  31. data/lib/scraper_utils/scheduler/process_request.rb +59 -0
  32. data/lib/scraper_utils/scheduler/thread_request.rb +51 -0
  33. data/lib/scraper_utils/scheduler/thread_response.rb +59 -0
  34. data/lib/scraper_utils/scheduler.rb +286 -0
  35. data/lib/scraper_utils/version.rb +1 -1
  36. data/lib/scraper_utils.rb +11 -14
  37. metadata +16 -6
  38. data/lib/scraper_utils/adaptive_delay.rb +0 -70
  39. data/lib/scraper_utils/fiber_scheduler.rb +0 -229
  40. data/lib/scraper_utils/robots_checker.rb +0 -149
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 63a24c24b497494b79c4d7e12f04a1bd2555068f37f50389f3906c0033817d7e
4
- data.tar.gz: 6d6b96112dc3e2f9dc5a54de6318a544c240c0e3d5246ab4178c07346d0de7dc
3
+ metadata.gz: 13ad14102f284c98d658bb928bcf7806ea7594326d11c5426903ebc6b1f919e0
4
+ data.tar.gz: 260cd94a1b76e9851f5af47f716dc754386b081f3923ee0a0eb6fb2b2d086c4f
5
5
  SHA512:
6
- metadata.gz: eda8d10d996d51b7ef1d2610e21da31390c10dd29f4daa70bd5d9c3c8dc6eb9bed651803ccd6a59f53b03dae4fcd1ea016802e693f8828f4a13b92e07a0b046e
7
- data.tar.gz: eba2704a99c6599a2789ec573fa335d7939a63d0c27b06886d6e905cd785e2095d7d0307e7aa1195a1209e022340fa5d027a72ccca61a350590058e998355d5d
6
+ metadata.gz: 824b9e64ae7debdf9cddfc90b47de6dff7865e3f655ad6022f58181f38efd06788413e0251525410736e58d6fd325917a8b7a0ad6b2468fa7ab9de3b697955af
7
+ data.tar.gz: fd154118b2eaa22962f4343a3f62ca1daabc19de924a36e4c3cd4f61c9c7bb08a232554f10141b4e3534f3fa2e546a86f860d3bc7997f54d29a067cfb3c4f451
data/.yardopts ADDED
@@ -0,0 +1,5 @@
1
+ --files docs/*.rb
2
+ --files docs/*.md
3
+ --files CHANGELOG.md
4
+ --files LICENSE.txt
5
+ --readme README.md
data/CHANGELOG.md CHANGED
@@ -1,5 +1,16 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.6.0 - 2025-03-16
4
+
5
+ * Add threads for more efficient scraping
6
+ * Adjust defaults for more efficient scraping, retaining just response based delays by default
7
+ * Correct and simplify date range utilities so everything is checked at least `max_period` days
8
+ * Release Candidate for v1.0.0, subject to testing in production
9
+
10
+ ## 0.5.1 - 2025-03-05
11
+
12
+ * Remove duplicated example code in docs
13
+
3
14
  ## 0.5.0 - 2025-03-05
4
15
 
5
16
  * Add action processing utility
data/GUIDELINES.md CHANGED
@@ -47,7 +47,8 @@ but if the file is bad, just treat it as missing.
47
47
 
48
48
  ## Testing Strategies
49
49
 
50
- * Avoid mocking unless really needed, instead
50
+ * AVOID mocking unless really needed (and REALLY avoid mocking your own code), instead
51
+ * Consider if you can change your own code, whilst keeping it simple, to make it easier to test
51
52
  * instantiate a real object to use in the test
52
53
  * use mocking facilities provided by the gem (eg Mechanize, Aws etc)
53
54
  * use integration tests with WebMock for simple external sites or VCR for more complex.
data/Gemfile CHANGED
@@ -32,5 +32,6 @@ gem "simplecov", platform && (platform == :heroku16 ? "~> 0.18.0" : "~> 0.22.0")
32
32
  gem "simplecov-console"
33
33
  gem "terminal-table"
34
34
  gem "webmock", platform && (platform == :heroku16 ? "~> 3.14.0" : "~> 3.19.0")
35
+ gem "yard"
35
36
 
36
37
  gemspec
data/IMPLEMENTATION.md CHANGED
@@ -31,3 +31,43 @@ puts "Pre Connect request: #{request.inspect}" if ENV["DEBUG"]
31
31
  - Externalize configuration to improve testability
32
32
  - Keep shared logic in the main class
33
33
  - Decisions / information specific to just one class, can be documented there, otherwise it belongs here
34
+
35
+ ## Testing Directory Structure
36
+
37
+ Our test directory structure reflects various testing strategies and aspects of the codebase:
38
+
39
+ ### API Context Directories
40
+ - `spec/scraper_utils/fiber_api/` - Tests functionality called from within worker fibers
41
+ - `spec/scraper_utils/main_fiber/` - Tests functionality called from the main fiber's perspective
42
+ - `spec/scraper_utils/thread_api/` - Tests functionality called from within worker threads
43
+
44
+ ### Utility Classes
45
+ - `spec/scraper_utils/mechanize_utils/` - Tests for `lib/scraper_utils/mechanize_utils/*.rb` files
46
+ - `spec/scraper_utils/scheduler/` - Tests for `lib/scraper_utils/scheduler/*.rb` files
47
+ - `spec/scraper_utils/scheduler2/` - FIXME: remove duplicate tests and merge to `spec/scraper_utils/scheduler/` unless > 200 lines
48
+
49
+ ### Integration vs Unit Tests
50
+ - `spec/scraper_utils/integration/` - Tests that focus on the integration between components
51
+ - Name tests after the most "parent-like" class of the components involved
52
+
53
+ ### Special Configuration Directories
54
+ These specs check the options we use when things go wrong in production
55
+
56
+ - `spec/scraper_utils/no_threads/` - Tests with threads disabled (`MORPH_DISABLE_THREADS=1`)
57
+ - `spec/scraper_utils/no_fibers/` - Tests with fibers disabled (`MORPH_MAX_WORKERS=0`)
58
+ - `spec/scraper_utils/sequential/` - Tests with exactly one worker (`MORPH_MAX_WORKERS=1`)
59
+
60
+ ### Directories to break up large specs
61
+ Keep specs less than 200 lines long
62
+
63
+ - `spec/scraper_utils/replacements` - Tests for replacements in MechanizeActions
64
+ - `spec/scraper_utils/replacements2` - FIXME: remove duplicate tests and merge to `spec/scraper_utils/replacements/`?
65
+ - `spec/scraper_utils/selectors` - Tests the various node selectors available in MechanizeActions
66
+ - `spec/scraper_utils/selectors2` - FIXME: remove duplicate tests and merge to `spec/scraper_utils/selectors/`?
67
+
68
+ ### General Testing Guidelines
69
+ - Respect fiber and thread context validation - never mock the objects under test
70
+ - Structure tests to run in the appropriate fiber context
71
+ - Use real fibers, threads and operations rather than excessive mocking
72
+ - Ensure proper cleanup of resources in both success and error paths
73
+ - ASK when unsure which (yard doc, spec or code) is wrong as I don't always follow the "write specs first" strategy
data/README.md CHANGED
@@ -9,28 +9,30 @@ For Server Administrators
9
9
  The ScraperUtils library is designed to be a respectful citizen of the web. If you're a server administrator and notice
10
10
  our scraper accessing your systems, here's what you should know:
11
11
 
12
- ### How to Control Our Behavior
13
-
14
- Our scraper utilities respect the standard server **robots.txt** control mechanisms (by default).
15
- To control our access:
16
-
17
- - Add a section for our user agent: `User-agent: ScraperUtils` (default)
18
- - Set a crawl delay, eg: `Crawl-delay: 20`
19
- - If needed specify disallowed paths: `Disallow: /private/`
20
-
21
12
  ### We play nice with your servers
22
13
 
23
14
  Our goal is to access public planning information with minimal impact on your services. The following features are on by
24
15
  default:
25
16
 
17
+ - **Limit server load**:
18
+ - We limit the max load we present to your server to well less than a third of a single cpu
19
+ - The more loaded your server is, the longer we wait between requests
20
+ - We respect Crawl-delay from robots.txt (see section below), so you can tell us an acceptable rate
21
+ - Scarper developers can
22
+ - reduce the max_load we present to your server even lower
23
+ - add random extra delays to give your server a chance to catch up with background tasks
24
+
26
25
  - **Identify themselves**: Our user agent clearly indicates who we are and provides a link to the project repository:
27
26
  `Mozilla/5.0 (compatible; ScraperUtils/0.2.0 2025-02-22; +https://github.com/ianheggie-oaf/scraper_utils)`
28
27
 
29
- - **Limit server load**:
30
- - We wait double your response time before making another request to avoid being a significant load on your server
31
- - We also randomly add extra delays to give your server a chance to catch up with background tasks
28
+ ### How to Control Our Behavior
29
+
30
+ Our scraper utilities respect the standard server **robots.txt** control mechanisms (by default).
31
+ To control our access:
32
32
 
33
- We also provide scraper developers other features to reduce overall load as well.
33
+ - Add a section for our user agent: `User-agent: ScraperUtils`
34
+ - Set a crawl delay, eg: `Crawl-delay: 20`
35
+ - If needed specify disallowed paths: `Disallow: /private/`
34
36
 
35
37
  For Scraper Developers
36
38
  ----------------------
@@ -40,14 +42,15 @@ mentioned above.
40
42
 
41
43
  ## Installation & Configuration
42
44
 
43
- Add to your [scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
45
+ Add to [your scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
44
46
 
45
47
  ```ruby
46
48
  gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
47
49
  gem 'scraper_utils'
48
50
  ```
49
51
 
50
- For detailed setup and configuration options, see the [Getting Started guide](docs/getting_started.md).
52
+ For detailed setup and configuration options,
53
+ see {file:docs/getting_started.md Getting Started guide}
51
54
 
52
55
  ## Key Features
53
56
 
@@ -57,20 +60,23 @@ For detailed setup and configuration options, see the [Getting Started guide](do
57
60
  - Automatic rate limiting based on server response times
58
61
  - Supports robots.txt and crawl-delay directives
59
62
  - Supports extra actions required to get to results page
60
- - [Learn more about Mechanize utilities](docs/mechanize_utilities.md)
63
+ - {file:docs/mechanize_utilities.md Learn more about Mechanize utilities}
61
64
 
62
65
  ### Optimize Server Load
63
66
 
64
67
  - Intelligent date range selection (reduce server load by up to 60%)
65
68
  - Cycle utilities for rotating search parameters
66
- - [Learn more about reducing server load](docs/reducing_server_load.md)
69
+ - {file:docs/reducing_server_load.md Learn more about reducing server load}
67
70
 
68
71
  ### Improve Scraper Efficiency
69
72
 
70
- - Interleave requests to optimize run time
71
- - [Learn more about interleaving requests](docs/interleaving_requests.md)
73
+ - Interleaves requests to optimize run time
74
+ - {file:docs/interleaving_requests.md Learn more about interleaving requests}
75
+ - Use {ScraperUtils::Scheduler.execute_request} so Mechanize network requests will be performed by threads in parallel
76
+ - {file:docs/parallel_requests.md Parallel Request} - see Usage section for installation instructions
72
77
  - Randomize processing order for more natural request patterns
73
- - [Learn more about randomizing requests](docs/randomizing_requests.md)
78
+ - {file:docs/randomizing_requests.md Learn more about randomizing requests} - see Usage section for installation
79
+ instructions
74
80
 
75
81
  ### Error Handling & Quality Monitoring
76
82
 
@@ -82,11 +88,11 @@ For detailed setup and configuration options, see the [Getting Started guide](do
82
88
 
83
89
  - Enhanced debugging utilities
84
90
  - Simple logging with authority context
85
- - [Learn more about debugging](docs/debugging.md)
91
+ - {file:docs/debugging.md Learn more about debugging}
86
92
 
87
93
  ## API Documentation
88
94
 
89
- Complete API documentation is available at [RubyDoc.info](https://rubydoc.info/gems/scraper_utils).
95
+ Complete API documentation is available at [scraper_utils | RubyDoc.info](https://rubydoc.info/gems/scraper_utils).
90
96
 
91
97
  ## Ruby Versions
92
98
 
@@ -105,7 +111,7 @@ To install this gem onto your local machine, run `bundle exec rake install`.
105
111
  ## Contributing
106
112
 
107
113
  Bug reports and pull requests with working tests are welcome
108
- on [GitHub](https://github.com/ianheggie-oaf/scraper_utils).
114
+ on [ianheggie-oaf/scraper_utils | GitHub](https://github.com/ianheggie-oaf/scraper_utils).
109
115
 
110
116
  ## License
111
117
 
data/SPECS.md CHANGED
@@ -6,7 +6,13 @@ installation and usage notes in `README.md`.
6
6
 
7
7
  ASK for clarification of any apparent conflicts with IMPLEMENTATION, GUIDELINES or project instructions.
8
8
 
9
- ## Core Design Principles
9
+ Core Design Principles
10
+ ----------------------
11
+
12
+ ## Coding Style and Complexity
13
+ - KISS (Keep it Simple and Stupid) is a guiding principle:
14
+ - Simple: Design and implement with as little complexity as possible while still achieving the desired functionality
15
+ - Stupid: Should be easy to diagnose and repair with basic tooling
10
16
 
11
17
  ### Error Handling
12
18
  - Record-level errors abort only that record's processing
@@ -23,3 +29,9 @@ ASK for clarification of any apparent conflicts with IMPLEMENTATION, GUIDELINES
23
29
  - Ensure components are independently testable
24
30
  - Avoid timing-based tests in favor of logic validation
25
31
  - Keep test scenarios focused and under 20 lines
32
+
33
+ #### Fiber and Thread Testing
34
+ - Test in appropriate fiber/thread context using API-specific directories
35
+ - Validate cooperative concurrency with real fibers rather than mocks
36
+ - Ensure tests for each context: main fiber, worker fibers, and various thread configurations
37
+ - Test special configurations (no threads, no fibers, sequential) in dedicated directories
data/bin/rspec ADDED
@@ -0,0 +1,27 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ #
5
+ # This file was generated by Bundler.
6
+ #
7
+ # The application 'rspec' is installed as part of a gem, and
8
+ # this file is here to facilitate running it.
9
+ #
10
+
11
+ ENV["BUNDLE_GEMFILE"] ||= File.expand_path("../Gemfile", __dir__)
12
+
13
+ bundle_binstub = File.expand_path("bundle", __dir__)
14
+
15
+ if File.file?(bundle_binstub)
16
+ if File.read(bundle_binstub, 300).include?("This file was generated by Bundler")
17
+ load(bundle_binstub)
18
+ else
19
+ abort("Your `bin/bundle` was not generated by Bundler, so this binstub cannot run.
20
+ Replace `bin/bundle` by running `bundle binstubs bundler --force`, then run this command again.")
21
+ end
22
+ end
23
+
24
+ require "rubygems"
25
+ require "bundler/setup"
26
+
27
+ load Gem.bin_path("rspec-core", "rspec")
@@ -3,11 +3,11 @@
3
3
  # Example scrape method updated to use ScraperUtils::FibreScheduler
4
4
 
5
5
  def scrape(authorities, attempt)
6
- ScraperUtils::FiberScheduler.reset!
6
+ ScraperUtils::Scheduler.reset!
7
7
  exceptions = {}
8
8
  authorities.each do |authority_label|
9
- ScraperUtils::FiberScheduler.register_operation(authority_label) do
10
- ScraperUtils::FiberScheduler.log(
9
+ ScraperUtils::Scheduler.register_operation(authority_label) do
10
+ ScraperUtils::LogUtils.log(
11
11
  "Collecting feed data for #{authority_label}, attempt: #{attempt}..."
12
12
  )
13
13
  ScraperUtils::DataQualityMonitor.start_authority(authority_label)
@@ -26,6 +26,6 @@ def scrape(authorities, attempt)
26
26
  end
27
27
  # end of register_operation block
28
28
  end
29
- ScraperUtils::FiberScheduler.run_all
29
+ ScraperUtils::Scheduler.run_operations
30
30
  exceptions
31
31
  end
@@ -4,7 +4,7 @@
4
4
  $LOAD_PATH << "./lib"
5
5
 
6
6
  require "scraper_utils"
7
- require "technology_one_scraper"
7
+ require "your_scraper"
8
8
 
9
9
  # Main Scraper class
10
10
  class Scraper
@@ -17,26 +17,18 @@ class Scraper
17
17
  authorities.each do |authority_label|
18
18
  puts "\nCollecting feed data for #{authority_label}, attempt: #{attempt}..."
19
19
 
20
- begin
21
- # REPLACE:
22
- # YourScraper.scrape(authority_label) do |record|
23
- # record["authority_label"] = authority_label.to_s
24
- # YourScraper.log(record)
25
- # ScraperWiki.save_sqlite(%w[authority_label council_reference], record)
26
- # end
27
- # WITH:
28
- ScraperUtils::DataQualityMonitor.start_authority(authority_label)
29
- YourScraper.scrape(authority_label) do |record|
30
- begin
31
- record["authority_label"] = authority_label.to_s
32
- ScraperUtils::DbUtils.save_record(record)
33
- rescue ScraperUtils::UnprocessableRecord => e
34
- ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
35
- exceptions[authority_label] = e
36
- end
20
+ # REPLACE section with:
21
+ ScraperUtils::DataQualityMonitor.start_authority(authority_label)
22
+ YourScraper.scrape(authority_label) do |record|
23
+ begin
24
+ record["authority_label"] = authority_label.to_s
25
+ ScraperUtils::DbUtils.save_record(record)
26
+ rescue ScraperUtils::UnprocessableRecord => e
27
+ ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
28
+ exceptions[authority_label] = e
37
29
  end
38
- # END OF REPLACE
39
30
  end
31
+ # END OF REPLACE
40
32
  rescue StandardError => e
41
33
  warn "#{authority_label}: ERROR: #{e}"
42
34
  warn e.backtrace
@@ -86,8 +78,9 @@ end
86
78
 
87
79
  if __FILE__ == $PROGRAM_NAME
88
80
  # Default to list of authorities we can't or won't fix in code, explain why
89
- # wagga: url redirects and then reports Application error
81
+ # some: url-for-issue Summary Reason
82
+ # councils : url-for-issue Summary Reason
90
83
 
91
- ENV["MORPH_EXPECT_BAD"] ||= "wagga"
84
+ ENV["MORPH_EXPECT_BAD"] ||= "some,councils"
92
85
  Scraper.run(Scraper.selected_authorities)
93
86
  end
@@ -0,0 +1,72 @@
1
+ Fibers and Threads
2
+ ==================
3
+
4
+ This sequence diagram supplements the notes on the {ScraperUtils::Scheduler} class and is intended to help show
5
+ the passing of messages and control between the fibers and threads.
6
+
7
+ * To keep things simple I have only shown the Fibers and Threads and not all the other calls like to the
8
+ OperationRegistry to lookup the current operation, or OperationWorker etc.
9
+ * There is ONE (global) response queue, which is monitored by the Scheduler.run_operations loop in the main Fiber
10
+ * Each authority has ONE OperationWorker (not shown), which has ONE Fiber, ONE Thread, ONE request queue.
11
+ * I use "◀─▶" to indicate a call and response, and "║" for which fiber / object is currently running.
12
+
13
+ ```text
14
+
15
+ SCHEDULER (Main Fiber)
16
+ NxRegister-operation RESPONSE.Q
17
+ ║──creates────────◀─▶┐
18
+ ║ │ FIBER (runs block passed to register_operation)
19
+ ║──creates──────────────────◀─▶┐ WORKER object and Registry
20
+ ║──registers(fiber)───────────────────────▶┐ REQUEST-Q
21
+ │ │ │ ║──creates─◀─▶┐ THREAD
22
+ │ │ │ ║──creates───────────◀─▶┐
23
+ ║◀─────────────────────────────────────────┘ ║◀──pop───║ ...[block waiting for request]
24
+ ║ │ │ │ ║ │
25
+ run_operations │ │ │ ║ │
26
+ ║──pop(non block)─◀─▶│ │ │ ║ │ ...[no responses yet]
27
+ ║ │ │ │ ║ │
28
+ ║───resumes-next─"can_resume"─────────────▶┐ ║ │
29
+ │ │ │ ║ ║ │
30
+ │ │ ║◀──resume──┘ ║ │ ...[first Resume passes true]
31
+ │ │ ║ │ ║ │ ...[initialise scraper]
32
+ ```
33
+ **REPEATS FROM HERE**
34
+ ```text
35
+ SCHEDULER RESPONSE.Q FIBER WORKER REQUEST.Q THREAD
36
+ │ │ ║──request─▶┐ ║ │
37
+ │ │ │ ║──push req ─▶║ │
38
+ │ │ ║◀──────────┘ ║──req───▶║
39
+ ║◀──yields control─(waiting)───┘ │ │ ║
40
+ ║ │ │ │ │ ║ ...[Executes network I/O request]
41
+ ║ │ │ │ │ ║
42
+ ║───other-resumes... │ │ │ │ ║ ...[Other Workers will be resumed
43
+ ║ │ │ │ │ ║ till most 99% are waiting on
44
+ ║───lots of │ │ │ │ ║ responses from their threads
45
+ ║ short sleeps ║◀──pushes response───────────────────────────┘
46
+ ║ ║ │ │ ║◀──pop───║ ...[block waiting for request]
47
+ ║──pop(response)──◀─▶║ │ │ ║ │
48
+ ║ │ │ │ ║ │
49
+ ║──saves─response───────────────────────◀─▶│ ║ │
50
+ ║ │ │ │ ║ │
51
+ ║───resumes-next─"can_resume"─────────────▶┐ ║ │
52
+ │ │ │ ║ ║ │
53
+ │ │ ║◀──resume──┘ ║ │ ...[Resume passes response]
54
+ │ │ ║ │ ║ │
55
+ │ │ ║ │ ║ │ ...[Process Response]
56
+ ```
57
+ **REPEATS TO HERE** - WHEN FIBER FINISHES, instead it:
58
+ ```text
59
+ SCHEDULER RESPONSE.Q FIBER WORKER REQUEST.Q THREAD
60
+ │ │ ║ │ ║ │
61
+ │ │ ║─deregister─▶║ ║ │
62
+ │ │ │ ║──close───▶║ │
63
+ │ │ │ ║ ║──nil───▶┐
64
+ │ │ │ ║ │ ║ ... [thread exists]
65
+ │ │ │ ║──join────────────◀─▶┘
66
+ │ │ │ ║ ....... [worker removes
67
+ │ │ │ ║ itself from registry]
68
+ │ │ ║◀──returns───┘
69
+ │◀──returns─nil────────────────┘
70
+ │ │
71
+ ```
72
+ When the last fiber finishes and the registry is empty, then the response queue is also removed
@@ -54,92 +54,21 @@ export DEBUG=1 # for basic, or 2 for verbose or 3 for tracing nearly everything
54
54
 
55
55
  ## Example Scraper Implementation
56
56
 
57
- Update your `scraper.rb` as follows:
57
+ Update your `scraper.rb` as per {file:example_scraper.rb example scraper}
58
58
 
59
- ```ruby
60
- #!/usr/bin/env ruby
61
- # frozen_string_literal: true
62
-
63
- $LOAD_PATH << "./lib"
64
-
65
- require "scraper_utils"
66
- require "your_scraper"
67
-
68
- # Main Scraper class
69
- class Scraper
70
- AUTHORITIES = YourScraper::AUTHORITIES
71
-
72
- def scrape(authorities, attempt)
73
- exceptions = {}
74
- authorities.each do |authority_label|
75
- puts "\nCollecting feed data for #{authority_label}, attempt: #{attempt}..."
76
-
77
- begin
78
- ScraperUtils::DataQualityMonitor.start_authority(authority_label)
79
- YourScraper.scrape(authority_label) do |record|
80
- begin
81
- record["authority_label"] = authority_label.to_s
82
- ScraperUtils::DbUtils.save_record(record)
83
- rescue ScraperUtils::UnprocessableRecord => e
84
- ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
85
- exceptions[authority_label] = e
86
- end
87
- end
88
- rescue StandardError => e
89
- warn "#{authority_label}: ERROR: #{e}"
90
- warn e.backtrace
91
- exceptions[authority_label] = e
92
- end
93
- end
94
-
95
- exceptions
96
- end
97
-
98
- def self.selected_authorities
99
- ScraperUtils::AuthorityUtils.selected_authorities(AUTHORITIES.keys)
100
- end
101
-
102
- def self.run(authorities)
103
- puts "Scraping authorities: #{authorities.join(', ')}"
104
- start_time = Time.now
105
- exceptions = new.scrape(authorities, 1)
106
- ScraperUtils::LogUtils.log_scraping_run(
107
- start_time,
108
- 1,
109
- authorities,
110
- exceptions
111
- )
112
-
113
- unless exceptions.empty?
114
- puts "\n***************************************************"
115
- puts "Now retrying authorities which earlier had failures"
116
- puts exceptions.keys.join(", ").to_s
117
- puts "***************************************************"
118
-
119
- start_time = Time.now
120
- exceptions = new.scrape(exceptions.keys, 2)
121
- ScraperUtils::LogUtils.log_scraping_run(
122
- start_time,
123
- 2,
124
- authorities,
125
- exceptions
126
- )
127
- end
128
-
129
- ScraperUtils::LogUtils.report_on_results(authorities, exceptions)
130
- end
131
- end
132
-
133
- if __FILE__ == $PROGRAM_NAME
134
- ENV["MORPH_EXPECT_BAD"] ||= "wagga"
135
- Scraper.run(Scraper.selected_authorities)
136
- end
137
- ```
59
+ For more advanced implementations, see the {file:interleaving_requests.md Interleaving Requests documentation}.
60
+
61
+ ## Logging Tables
62
+
63
+ The following logging tables are created for use in monitoring failure patterns and debugging issues.
64
+ Records are automatically cleared after 30 days.
65
+
66
+ The `ScraperUtils::LogUtils.log_scraping_run` call also logs the information to the `scrape_log` table.
138
67
 
139
- For more advanced implementations, see the [Interleaving Requests documentation](interleaving_requests.md).
68
+ The `ScraperUtils::LogUtils.save_summary_record` call also logs the information to the `scrape_summary` table.
140
69
 
141
70
  ## Next Steps
142
71
 
143
- - [Reducing Server Load](reducing_server_load.md)
144
- - [Mechanize Utilities](mechanize_utilities.md)
145
- - [Debugging](debugging.md)
72
+ - {file:reducing_server_load.md Reducing Server Load}
73
+ - {file:mechanize_utilities.md Mechanize Utilities}
74
+ - {file:debugging.md Debugging}
@@ -1,6 +1,6 @@
1
- # Interleaving Requests with FiberScheduler
1
+ # Interleaving Requests with Scheduler
2
2
 
3
- The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
3
+ The `ScraperUtils::Scheduler` provides a lightweight utility that:
4
4
 
5
5
  * Works on other authorities while in the delay period for an authority's next request
6
6
  * Optimizes the total scraper run time
@@ -11,51 +11,22 @@ The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
11
11
 
12
12
  ## Implementation
13
13
 
14
- To enable fiber scheduling, change your scrape method to follow this pattern:
14
+ To enable fiber scheduling, change your scrape method as per
15
+ {example_scrape_with_fibers.rb example scrape with fibers}
15
16
 
16
- ```ruby
17
- def scrape(authorities, attempt)
18
- ScraperUtils::FiberScheduler.reset!
19
- exceptions = {}
20
- authorities.each do |authority_label|
21
- ScraperUtils::FiberScheduler.register_operation(authority_label) do
22
- ScraperUtils::FiberScheduler.log(
23
- "Collecting feed data for #{authority_label}, attempt: #{attempt}..."
24
- )
25
- ScraperUtils::DataQualityMonitor.start_authority(authority_label)
26
- YourScraper.scrape(authority_label) do |record|
27
- record["authority_label"] = authority_label.to_s
28
- ScraperUtils::DbUtils.save_record(record)
29
- rescue ScraperUtils::UnprocessableRecord => e
30
- ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
31
- exceptions[authority_label] = e
32
- # Continues processing other records
33
- end
34
- rescue StandardError => e
35
- warn "#{authority_label}: ERROR: #{e}"
36
- warn e.backtrace || "No backtrace available"
37
- exceptions[authority_label] = e
38
- end
39
- # end of register_operation block
40
- end
41
- ScraperUtils::FiberScheduler.run_all
42
- exceptions
43
- end
44
- ```
45
-
46
- ## Logging with FiberScheduler
17
+ ## Logging with Scheduler
47
18
 
48
- Use `ScraperUtils::FiberScheduler.log` instead of `puts` when logging within the authority processing code.
19
+ Use {ScraperUtils::LogUtils.log} instead of `puts` when logging within the authority processing code.
49
20
  This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
50
21
  thus the output.
51
22
 
52
23
  ## Testing Considerations
53
24
 
54
- This uses `ScraperUtils::RandomizeUtils` for determining the order of operations. Remember to add the following line to
25
+ This uses {ScraperUtils::RandomizeUtils} for determining the order of operations. Remember to add the following line to
55
26
  `spec/spec_helper.rb`:
56
27
 
57
28
  ```ruby
58
29
  ScraperUtils::RandomizeUtils.sequential = true
59
30
  ```
60
31
 
61
- For full details, see the [FiberScheduler class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/FiberScheduler).
32
+ For full details, see the {Scheduler}.