RubyGems - scraper_utils - Versions diffs - 0.5.0 → 0.6.0 - Mend

scraper_utils 0.5.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

checksums.yaml +4 -4
data/.yardopts +5 -0
data/CHANGELOG.md +11 -0
data/GUIDELINES.md +2 -1
data/Gemfile +1 -0
data/IMPLEMENTATION.md +40 -0
data/README.md +29 -23
data/SPECS.md +13 -1
data/bin/rspec +27 -0
data/docs/example_scrape_with_fibers.rb +4 -4
data/docs/example_scraper.rb +14 -21
data/docs/fibers_and_threads.md +72 -0
data/docs/getting_started.md +13 -84
data/docs/interleaving_requests.md +8 -37
data/docs/parallel_requests.md +138 -0
data/docs/randomizing_requests.md +12 -8
data/docs/reducing_server_load.md +6 -6
data/lib/scraper_utils/data_quality_monitor.rb +2 -3
data/lib/scraper_utils/date_range_utils.rb +37 -78
data/lib/scraper_utils/debug_utils.rb +5 -5
data/lib/scraper_utils/log_utils.rb +15 -0
data/lib/scraper_utils/mechanize_actions.rb +37 -8
data/lib/scraper_utils/mechanize_utils/adaptive_delay.rb +79 -0
data/lib/scraper_utils/mechanize_utils/agent_config.rb +31 -30
data/lib/scraper_utils/mechanize_utils/robots_checker.rb +151 -0
data/lib/scraper_utils/mechanize_utils.rb +8 -5
data/lib/scraper_utils/randomize_utils.rb +22 -19
data/lib/scraper_utils/scheduler/constants.rb +12 -0
data/lib/scraper_utils/scheduler/operation_registry.rb +101 -0
data/lib/scraper_utils/scheduler/operation_worker.rb +199 -0
data/lib/scraper_utils/scheduler/process_request.rb +59 -0
data/lib/scraper_utils/scheduler/thread_request.rb +51 -0
data/lib/scraper_utils/scheduler/thread_response.rb +59 -0
data/lib/scraper_utils/scheduler.rb +286 -0
data/lib/scraper_utils/version.rb +1 -1
data/lib/scraper_utils.rb +11 -14
metadata +16 -6
data/lib/scraper_utils/adaptive_delay.rb +0 -70
data/lib/scraper_utils/fiber_scheduler.rb +0 -229
data/lib/scraper_utils/robots_checker.rb +0 -149

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 63a24c24b497494b79c4d7e12f04a1bd2555068f37f50389f3906c0033817d7e
-  data.tar.gz: 6d6b96112dc3e2f9dc5a54de6318a544c240c0e3d5246ab4178c07346d0de7dc
+  metadata.gz: 13ad14102f284c98d658bb928bcf7806ea7594326d11c5426903ebc6b1f919e0
+  data.tar.gz: 260cd94a1b76e9851f5af47f716dc754386b081f3923ee0a0eb6fb2b2d086c4f
 SHA512:
-  metadata.gz: eda8d10d996d51b7ef1d2610e21da31390c10dd29f4daa70bd5d9c3c8dc6eb9bed651803ccd6a59f53b03dae4fcd1ea016802e693f8828f4a13b92e07a0b046e
-  data.tar.gz: eba2704a99c6599a2789ec573fa335d7939a63d0c27b06886d6e905cd785e2095d7d0307e7aa1195a1209e022340fa5d027a72ccca61a350590058e998355d5d
+  metadata.gz: 824b9e64ae7debdf9cddfc90b47de6dff7865e3f655ad6022f58181f38efd06788413e0251525410736e58d6fd325917a8b7a0ad6b2468fa7ab9de3b697955af
+  data.tar.gz: fd154118b2eaa22962f4343a3f62ca1daabc19de924a36e4c3cd4f61c9c7bb08a232554f10141b4e3534f3fa2e546a86f860d3bc7997f54d29a067cfb3c4f451

data/.yardopts ADDED Viewed

@@ -0,0 +1,5 @@
+--files docs/*.rb
+--files docs/*.md
+--files CHANGELOG.md
+--files LICENSE.txt
+--readme README.md

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,16 @@
 # Changelog
+## 0.6.0 - 2025-03-16
+* Add threads for more efficient scraping
+* Adjust defaults for more efficient scraping, retaining just response based delays by default
+* Correct and simplify date range utilities so everything is checked at least `max_period` days
+* Release Candidate for v1.0.0, subject to testing in production
+## 0.5.1 - 2025-03-05
+* Remove duplicated example code in docs
 ## 0.5.0 - 2025-03-05
 * Add action processing utility

data/GUIDELINES.md CHANGED Viewed

@@ -47,7 +47,8 @@ but if the file is bad, just treat it as missing.
 ## Testing Strategies
-* Avoid mocking unless really needed, instead
+* AVOID mocking unless really needed (and REALLY avoid mocking your own code), instead
+  * Consider if you can change your own code, whilst keeping it simple, to make it easier to test
   * instantiate a real object to use in the test
   * use mocking facilities provided by the gem (eg Mechanize, Aws etc)
   * use integration tests with WebMock for simple external sites or VCR for more complex.

data/Gemfile CHANGED Viewed

@@ -32,5 +32,6 @@ gem "simplecov", platform && (platform == :heroku16 ? "~> 0.18.0" : "~> 0.22.0")
 gem "simplecov-console"
 gem "terminal-table"
 gem "webmock", platform && (platform == :heroku16 ? "~> 3.14.0" : "~> 3.19.0")
+gem "yard"
 gemspec

data/IMPLEMENTATION.md CHANGED Viewed

@@ -31,3 +31,43 @@ puts "Pre Connect request: #{request.inspect}" if ENV["DEBUG"]
 - Externalize configuration to improve testability
 - Keep shared logic in the main class
 - Decisions / information specific to just one class, can be documented there, otherwise it belongs here
+## Testing Directory Structure
+Our test directory structure reflects various testing strategies and aspects of the codebase:
+### API Context Directories
+- `spec/scraper_utils/fiber_api/` - Tests functionality called from within worker fibers
+- `spec/scraper_utils/main_fiber/` - Tests functionality called from the main fiber's perspective
+- `spec/scraper_utils/thread_api/` - Tests functionality called from within worker threads
+### Utility Classes
+- `spec/scraper_utils/mechanize_utils/` - Tests for `lib/scraper_utils/mechanize_utils/*.rb` files
+- `spec/scraper_utils/scheduler/` - Tests for `lib/scraper_utils/scheduler/*.rb` files
+- `spec/scraper_utils/scheduler2/` - FIXME: remove duplicate tests and merge to `spec/scraper_utils/scheduler/` unless > 200 lines
+### Integration vs Unit Tests
+- `spec/scraper_utils/integration/` - Tests that focus on the integration between components
+  - Name tests after the most "parent-like" class of the components involved
+### Special Configuration Directories
+These specs check the options we use when things go wrong in production
+- `spec/scraper_utils/no_threads/` - Tests with threads disabled (`MORPH_DISABLE_THREADS=1`)
+- `spec/scraper_utils/no_fibers/` - Tests with fibers disabled (`MORPH_MAX_WORKERS=0`)
+- `spec/scraper_utils/sequential/` - Tests with exactly one worker (`MORPH_MAX_WORKERS=1`)
+### Directories to break up large specs
+Keep specs less than 200 lines long
+- `spec/scraper_utils/replacements` - Tests for replacements in MechanizeActions
+- `spec/scraper_utils/replacements2` - FIXME: remove duplicate tests and merge to `spec/scraper_utils/replacements/`?
+- `spec/scraper_utils/selectors` - Tests the various node selectors available in MechanizeActions
+- `spec/scraper_utils/selectors2` - FIXME: remove duplicate tests and merge to `spec/scraper_utils/selectors/`?
+### General Testing Guidelines
+- Respect fiber and thread context validation - never mock the objects under test
+- Structure tests to run in the appropriate fiber context
+- Use real fibers, threads and operations rather than excessive mocking
+- Ensure proper cleanup of resources in both success and error paths
+- ASK when unsure which (yard doc, spec or code) is wrong as I don't always follow the "write specs first" strategy

data/README.md CHANGED Viewed

@@ -9,28 +9,30 @@ For Server Administrators
 The ScraperUtils library is designed to be a respectful citizen of the web. If you're a server administrator and notice
 our scraper accessing your systems, here's what you should know:
-### How to Control Our Behavior
-Our scraper utilities respect the standard server **robots.txt** control mechanisms (by default).
-To control our access:
-- Add a section for our user agent: `User-agent: ScraperUtils` (default)
-- Set a crawl delay, eg: `Crawl-delay: 20`
-- If needed specify disallowed paths: `Disallow: /private/`
 ### We play nice with your servers
 Our goal is to access public planning information with minimal impact on your services. The following features are on by
 default:
+- **Limit server load**:
+    - We limit the max load we present to your server to well less than a third of a single cpu
+        - The more loaded your server is, the longer we wait between requests
+    - We respect Crawl-delay from robots.txt (see section below), so you can tell us an acceptable rate
+    - Scarper developers can
+        - reduce the max_load we present to your server even lower
+        - add random extra delays to give your server a chance to catch up with background tasks
 - **Identify themselves**: Our user agent clearly indicates who we are and provides a link to the project repository:
   `Mozilla/5.0 (compatible; ScraperUtils/0.2.0 2025-02-22; +https://github.com/ianheggie-oaf/scraper_utils)`
-- **Limit server load**:
-    - We wait double your response time before making another request to avoid being a significant load on your server
-    - We also randomly add extra delays to give your server a chance to catch up with background tasks
+### How to Control Our Behavior
+Our scraper utilities respect the standard server **robots.txt** control mechanisms (by default).
+To control our access:
-We also provide scraper developers other features to reduce overall load as well.
+- Add a section for our user agent: `User-agent: ScraperUtils`
+- Set a crawl delay, eg: `Crawl-delay: 20`
+- If needed specify disallowed paths: `Disallow: /private/`
 For Scraper Developers
 ----------------------
@@ -40,14 +42,15 @@ mentioned above.
 ## Installation & Configuration
-Add to your [scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
+Add to [your scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
 ```ruby
 gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
 gem 'scraper_utils'
 ```
-For detailed setup and configuration options, see the [Getting Started guide](docs/getting_started.md).
+For detailed setup and configuration options,
+see {file:docs/getting_started.md Getting Started guide}
 ## Key Features
@@ -57,20 +60,23 @@ For detailed setup and configuration options, see the [Getting Started guide](do
 - Automatic rate limiting based on server response times
 - Supports robots.txt and crawl-delay directives
 - Supports extra actions required to get to results page
-- [Learn more about Mechanize utilities](docs/mechanize_utilities.md)
+- {file:docs/mechanize_utilities.md Learn more about Mechanize utilities}
 ### Optimize Server Load
 - Intelligent date range selection (reduce server load by up to 60%)
 - Cycle utilities for rotating search parameters
-- [Learn more about reducing server load](docs/reducing_server_load.md)
+- {file:docs/reducing_server_load.md Learn more about reducing server load}
 ### Improve Scraper Efficiency
-- Interleave requests to optimize run time
-- [Learn more about interleaving requests](docs/interleaving_requests.md)
+- Interleaves requests to optimize run time
+    - {file:docs/interleaving_requests.md Learn more about interleaving requests}
+- Use {ScraperUtils::Scheduler.execute_request} so Mechanize network requests will be performed by threads in parallel
+    - {file:docs/parallel_requests.md Parallel Request} - see Usage section for installation instructions
 - Randomize processing order for more natural request patterns
-- [Learn more about randomizing requests](docs/randomizing_requests.md)
+    - {file:docs/randomizing_requests.md Learn more about randomizing requests} - see Usage section for installation
+      instructions
 ### Error Handling & Quality Monitoring
@@ -82,11 +88,11 @@ For detailed setup and configuration options, see the [Getting Started guide](do
 - Enhanced debugging utilities
 - Simple logging with authority context
-- [Learn more about debugging](docs/debugging.md)
+- {file:docs/debugging.md Learn more about debugging}
 ## API Documentation
-Complete API documentation is available at [RubyDoc.info](https://rubydoc.info/gems/scraper_utils).
+Complete API documentation is available at [scraper_utils | RubyDoc.info](https://rubydoc.info/gems/scraper_utils).
 ## Ruby Versions
@@ -105,7 +111,7 @@ To install this gem onto your local machine, run `bundle exec rake install`.
 ## Contributing
 Bug reports and pull requests with working tests are welcome
-on [GitHub](https://github.com/ianheggie-oaf/scraper_utils).
+on [ianheggie-oaf/scraper_utils | GitHub](https://github.com/ianheggie-oaf/scraper_utils).
 ## License

data/SPECS.md CHANGED Viewed

@@ -6,7 +6,13 @@ installation and usage notes in `README.md`.
 ASK for clarification of any apparent conflicts with IMPLEMENTATION, GUIDELINES or project instructions.
-## Core Design Principles
+Core Design Principles
+----------------------
+## Coding Style and Complexity
+- KISS (Keep it Simple and Stupid) is a guiding principle:
+  - Simple: Design and implement with as little complexity as possible while still achieving the desired functionality
+  - Stupid: Should be easy to diagnose and repair with basic tooling
 ### Error Handling
 - Record-level errors abort only that record's processing
@@ -23,3 +29,9 @@ ASK for clarification of any apparent conflicts with IMPLEMENTATION, GUIDELINES
 - Ensure components are independently testable
 - Avoid timing-based tests in favor of logic validation
 - Keep test scenarios focused and under 20 lines
+#### Fiber and Thread Testing
+- Test in appropriate fiber/thread context using API-specific directories
+- Validate cooperative concurrency with real fibers rather than mocks
+- Ensure tests for each context: main fiber, worker fibers, and various thread configurations
+- Test special configurations (no threads, no fibers, sequential) in dedicated directories

data/bin/rspec ADDED Viewed

@@ -0,0 +1,27 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+#
+# This file was generated by Bundler.
+#
+# The application 'rspec' is installed as part of a gem, and
+# this file is here to facilitate running it.
+#
+ENV["BUNDLE_GEMFILE"] ||= File.expand_path("../Gemfile", __dir__)
+bundle_binstub = File.expand_path("bundle", __dir__)
+if File.file?(bundle_binstub)
+  if File.read(bundle_binstub, 300).include?("This file was generated by Bundler")
+    load(bundle_binstub)
+  else
+    abort("Your `bin/bundle` was not generated by Bundler, so this binstub cannot run.
+Replace `bin/bundle` by running `bundle binstubs bundler --force`, then run this command again.")
+  end
+end
+require "rubygems"
+require "bundler/setup"
+load Gem.bin_path("rspec-core", "rspec")

data/docs/example_scrape_with_fibers.rb CHANGED Viewed

@@ -3,11 +3,11 @@
 # Example scrape method updated to use ScraperUtils::FibreScheduler
 def scrape(authorities, attempt)
-  ScraperUtils::FiberScheduler.reset!
+  ScraperUtils::Scheduler.reset!
   exceptions = {}
   authorities.each do |authority_label|
-    ScraperUtils::FiberScheduler.register_operation(authority_label) do
-      ScraperUtils::FiberScheduler.log(
+    ScraperUtils::Scheduler.register_operation(authority_label) do
+      ScraperUtils::LogUtils.log(
         "Collecting feed data for #{authority_label}, attempt: #{attempt}..."
       )
       ScraperUtils::DataQualityMonitor.start_authority(authority_label)
@@ -26,6 +26,6 @@ def scrape(authorities, attempt)
     end
     # end of register_operation block
   end
-  ScraperUtils::FiberScheduler.run_all
+  ScraperUtils::Scheduler.run_operations
   exceptions
 end

data/docs/example_scraper.rb CHANGED Viewed

@@ -4,7 +4,7 @@
 $LOAD_PATH << "./lib"
 require "scraper_utils"
-require "technology_one_scraper"
+require "your_scraper"
 # Main Scraper class
 class Scraper
@@ -17,26 +17,18 @@ class Scraper
     authorities.each do |authority_label|
       puts "\nCollecting feed data for #{authority_label}, attempt: #{attempt}..."
-      begin
-        # REPLACE:
-        # YourScraper.scrape(authority_label) do |record|
-        #   record["authority_label"] = authority_label.to_s
-        #   YourScraper.log(record)
-        #   ScraperWiki.save_sqlite(%w[authority_label council_reference], record)
-        # end
-        # WITH:
-        ScraperUtils::DataQualityMonitor.start_authority(authority_label)
-        YourScraper.scrape(authority_label) do |record|
-          begin
-            record["authority_label"] = authority_label.to_s
-            ScraperUtils::DbUtils.save_record(record)
-          rescue ScraperUtils::UnprocessableRecord => e
-            ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
-            exceptions[authority_label] = e
-          end
+      # REPLACE section with:
+      ScraperUtils::DataQualityMonitor.start_authority(authority_label)
+      YourScraper.scrape(authority_label) do |record|
+        begin
+          record["authority_label"] = authority_label.to_s
+          ScraperUtils::DbUtils.save_record(record)
+        rescue ScraperUtils::UnprocessableRecord => e
+          ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
+          exceptions[authority_label] = e
         end
-        # END OF REPLACE
       end
+      # END OF REPLACE
     rescue StandardError => e
       warn "#{authority_label}: ERROR: #{e}"
       warn e.backtrace
@@ -86,8 +78,9 @@ end
 if __FILE__ == $PROGRAM_NAME
   # Default to list of authorities we can't or won't fix in code, explain why
-  # wagga: url redirects and then reports Application error
+  # some: url-for-issue Summary Reason
+  # councils : url-for-issue Summary Reason
-  ENV["MORPH_EXPECT_BAD"] ||= "wagga"
+  ENV["MORPH_EXPECT_BAD"] ||= "some,councils"
   Scraper.run(Scraper.selected_authorities)
 end

data/docs/fibers_and_threads.md ADDED Viewed

@@ -0,0 +1,72 @@
+Fibers and Threads
+==================
+This sequence diagram supplements the notes on the {ScraperUtils::Scheduler} class and is intended to help show
+the passing of messages and control between the fibers and threads.
+* To keep things simple I have only shown the Fibers and Threads and not all the other calls like to the
+  OperationRegistry to lookup the current operation, or OperationWorker etc.
+* There is ONE (global) response queue, which is monitored by the Scheduler.run_operations loop in the main Fiber
+* Each authority has ONE OperationWorker (not shown), which has ONE Fiber, ONE Thread, ONE request queue.
+* I use "◀─▶" to indicate a call and response, and "║" for which fiber / object is currently running.
+```text
+SCHEDULER (Main Fiber)
+NxRegister-operation  RESPONSE.Q
+   ║──creates────────◀─▶┐
+   ║                    │       FIBER (runs block passed to register_operation)
+   ║──creates──────────────────◀─▶┐         WORKER object and Registry
+   ║──registers(fiber)───────────────────────▶┐           REQUEST-Q
+   │                    │         │           ║──creates─◀─▶┐       THREAD
+   │                    │         │           ║──creates───────────◀─▶┐
+   ║◀─────────────────────────────────────────┘             ║◀──pop───║ ...[block waiting for request]
+   ║                    │         │           │             ║         │
+run_operations          │         │           │             ║         │
+   ║──pop(non block)─◀─▶│         │           │             ║         │ ...[no responses yet]
+   ║                    │         │           │             ║         │
+   ║───resumes-next─"can_resume"─────────────▶┐             ║         │
+   │                    │         │           ║             ║         │
+   │                    │         ║◀──resume──┘             ║         │ ...[first Resume passes true]
+   │                    │         ║           │             ║         │ ...[initialise scraper]
+```
+**REPEATS FROM HERE**
+```text
+ SCHEDULER        RESPONSE.Q    FIBER       WORKER        REQUEST.Q THREAD
+   │                    │         ║──request─▶┐             ║         │
+   │                    │         │           ║──push req ─▶║         │
+   │                    │         ║◀──────────┘             ║──req───▶║
+   ║◀──yields control─(waiting)───┘           │             │         ║
+   ║                    │         │           │             │         ║ ...[Executes network I/O request]
+   ║                    │         │           │             │         ║
+   ║───other-resumes... │         │           │             │         ║ ...[Other Workers will be resumed
+   ║                    │         │           │             │         ║     till most 99% are waiting on
+   ║───lots of          │         │           │             │         ║     responses from their threads
+   ║    short sleeps    ║◀──pushes response───────────────────────────┘
+   ║                    ║         │           │             ║◀──pop───║ ...[block waiting for request]
+   ║──pop(response)──◀─▶║         │           │             ║         │
+   ║                    │         │           │             ║         │
+   ║──saves─response───────────────────────◀─▶│             ║         │
+   ║                    │         │           │             ║         │
+   ║───resumes-next─"can_resume"─────────────▶┐             ║         │
+   │                    │         │           ║             ║         │
+   │                    │         ║◀──resume──┘             ║         │ ...[Resume passes response]
+   │                    │         ║           │             ║         │
+   │                    │         ║           │             ║         │ ...[Process Response]
+```
+**REPEATS TO HERE** - WHEN FIBER FINISHES, instead it:
+```text
+ SCHEDULER        RESPONSE.Q    FIBER         WORKER        REQUEST.Q THREAD
+   │                    │         ║             │           ║         │
+   │                    │         ║─deregister─▶║           ║         │
+   │                    │         │             ║──close───▶║         │
+   │                    │         │             ║           ║──nil───▶┐
+   │                    │         │             ║           │         ║ ... [thread exists]
+   │                    │         │             ║──join────────────◀─▶┘
+   │                    │         │             ║  ....... [worker removes
+   │                    │         │             ║           itself from registry]
+   │                    │         ║◀──returns───┘
+   │◀──returns─nil────────────────┘
+   │                    │
+```
+When the last fiber finishes and the registry is empty, then the response queue is also removed

data/docs/getting_started.md CHANGED Viewed

@@ -54,92 +54,21 @@ export DEBUG=1 # for basic, or 2 for verbose or 3 for tracing nearly everything
 ## Example Scraper Implementation
-Update your `scraper.rb` as follows:
+Update your `scraper.rb` as per {file:example_scraper.rb example scraper}
-```ruby
-#!/usr/bin/env ruby
-# frozen_string_literal: true
-$LOAD_PATH << "./lib"
-require "scraper_utils"
-require "your_scraper"
-# Main Scraper class
-class Scraper
-  AUTHORITIES = YourScraper::AUTHORITIES
-  def scrape(authorities, attempt)
-    exceptions = {}
-    authorities.each do |authority_label|
-      puts "\nCollecting feed data for #{authority_label}, attempt: #{attempt}..."
-      begin
-        ScraperUtils::DataQualityMonitor.start_authority(authority_label)
-        YourScraper.scrape(authority_label) do |record|
-          begin
-            record["authority_label"] = authority_label.to_s
-            ScraperUtils::DbUtils.save_record(record)
-          rescue ScraperUtils::UnprocessableRecord => e
-            ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
-            exceptions[authority_label] = e
-          end
-        end
-      rescue StandardError => e
-        warn "#{authority_label}: ERROR: #{e}"
-        warn e.backtrace
-        exceptions[authority_label] = e
-      end
-    end
-    exceptions
-  end
-  def self.selected_authorities
-    ScraperUtils::AuthorityUtils.selected_authorities(AUTHORITIES.keys)
-  end
-  def self.run(authorities)
-    puts "Scraping authorities: #{authorities.join(', ')}"
-    start_time = Time.now
-    exceptions = new.scrape(authorities, 1)
-    ScraperUtils::LogUtils.log_scraping_run(
-      start_time,
-      1,
-      authorities,
-      exceptions
-    )
-    unless exceptions.empty?
-      puts "\n***************************************************"
-      puts "Now retrying authorities which earlier had failures"
-      puts exceptions.keys.join(", ").to_s
-      puts "***************************************************"
-      start_time = Time.now
-      exceptions = new.scrape(exceptions.keys, 2)
-      ScraperUtils::LogUtils.log_scraping_run(
-        start_time,
-        2,
-        authorities,
-        exceptions
-      )
-    end
-    ScraperUtils::LogUtils.report_on_results(authorities, exceptions)
-  end
-end
-if __FILE__ == $PROGRAM_NAME
-  ENV["MORPH_EXPECT_BAD"] ||= "wagga"
-  Scraper.run(Scraper.selected_authorities)
-end
-```
+For more advanced implementations, see the {file:interleaving_requests.md Interleaving Requests documentation}.
+## Logging Tables
+The following logging tables are created for use in monitoring failure patterns and debugging issues.
+Records are automatically cleared after 30 days.
+The `ScraperUtils::LogUtils.log_scraping_run` call also logs the information to the `scrape_log` table.
-For more advanced implementations, see the [Interleaving Requests documentation](interleaving_requests.md).
+The `ScraperUtils::LogUtils.save_summary_record` call also logs the information to the `scrape_summary` table.
 ## Next Steps
-- [Reducing Server Load](reducing_server_load.md)
-- [Mechanize Utilities](mechanize_utilities.md)
-- [Debugging](debugging.md)
+- {file:reducing_server_load.md Reducing Server Load}
+- {file:mechanize_utilities.md Mechanize Utilities}
+- {file:debugging.md Debugging}

data/docs/interleaving_requests.md CHANGED Viewed

@@ -1,6 +1,6 @@
-# Interleaving Requests with FiberScheduler
+# Interleaving Requests with Scheduler
-The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
+The `ScraperUtils::Scheduler` provides a lightweight utility that:
 * Works on other authorities while in the delay period for an authority's next request
 * Optimizes the total scraper run time
@@ -11,51 +11,22 @@ The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
 ## Implementation
-To enable fiber scheduling, change your scrape method to follow this pattern:
+To enable fiber scheduling, change your scrape method as per
+{example_scrape_with_fibers.rb example scrape with fibers}
-```ruby
-def scrape(authorities, attempt)
-  ScraperUtils::FiberScheduler.reset!
-  exceptions = {}
-  authorities.each do |authority_label|
-    ScraperUtils::FiberScheduler.register_operation(authority_label) do
-      ScraperUtils::FiberScheduler.log(
-        "Collecting feed data for #{authority_label}, attempt: #{attempt}..."
-      )
-      ScraperUtils::DataQualityMonitor.start_authority(authority_label)
-      YourScraper.scrape(authority_label) do |record|
-        record["authority_label"] = authority_label.to_s
-        ScraperUtils::DbUtils.save_record(record)
-      rescue ScraperUtils::UnprocessableRecord => e
-        ScraperUtils::DataQualityMonitor.log_unprocessable_record(e, record)
-        exceptions[authority_label] = e
-        # Continues processing other records
-      end
-    rescue StandardError => e
-      warn "#{authority_label}: ERROR: #{e}"
-      warn e.backtrace || "No backtrace available"
-      exceptions[authority_label] = e
-    end
-    # end of register_operation block
-  end
-  ScraperUtils::FiberScheduler.run_all
-  exceptions
-end
-```
-## Logging with FiberScheduler
+## Logging with Scheduler
-Use `ScraperUtils::FiberScheduler.log` instead of `puts` when logging within the authority processing code.
+Use {ScraperUtils::LogUtils.log} instead of `puts` when logging within the authority processing code.
 This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
 thus the output.
 ## Testing Considerations
-This uses `ScraperUtils::RandomizeUtils` for determining the order of operations. Remember to add the following line to
+This uses {ScraperUtils::RandomizeUtils} for determining the order of operations. Remember to add the following line to
 `spec/spec_helper.rb`:
 ```ruby
 ScraperUtils::RandomizeUtils.sequential = true
 ```
-For full details, see the [FiberScheduler class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/FiberScheduler).
+For full details, see the {Scheduler}.