RubyGems - scraper_utils - Versions diffs - 0.5.1 → 0.6.0 - Mend

scraper_utils 0.5.1 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (39) hide show

checksums.yaml +4 -4
data/.yardopts +5 -0
data/CHANGELOG.md +7 -0
data/GUIDELINES.md +2 -1
data/Gemfile +1 -0
data/IMPLEMENTATION.md +40 -0
data/README.md +29 -23
data/SPECS.md +13 -1
data/bin/rspec +27 -0
data/docs/example_scrape_with_fibers.rb +4 -4
data/docs/fibers_and_threads.md +72 -0
data/docs/getting_started.md +6 -6
data/docs/interleaving_requests.md +7 -7
data/docs/parallel_requests.md +138 -0
data/docs/randomizing_requests.md +12 -8
data/docs/reducing_server_load.md +6 -6
data/lib/scraper_utils/data_quality_monitor.rb +2 -3
data/lib/scraper_utils/date_range_utils.rb +37 -78
data/lib/scraper_utils/debug_utils.rb +5 -5
data/lib/scraper_utils/log_utils.rb +15 -0
data/lib/scraper_utils/mechanize_actions.rb +37 -8
data/lib/scraper_utils/mechanize_utils/adaptive_delay.rb +79 -0
data/lib/scraper_utils/mechanize_utils/agent_config.rb +31 -30
data/lib/scraper_utils/mechanize_utils/robots_checker.rb +151 -0
data/lib/scraper_utils/mechanize_utils.rb +8 -5
data/lib/scraper_utils/randomize_utils.rb +22 -19
data/lib/scraper_utils/scheduler/constants.rb +12 -0
data/lib/scraper_utils/scheduler/operation_registry.rb +101 -0
data/lib/scraper_utils/scheduler/operation_worker.rb +199 -0
data/lib/scraper_utils/scheduler/process_request.rb +59 -0
data/lib/scraper_utils/scheduler/thread_request.rb +51 -0
data/lib/scraper_utils/scheduler/thread_response.rb +59 -0
data/lib/scraper_utils/scheduler.rb +286 -0
data/lib/scraper_utils/version.rb +1 -1
data/lib/scraper_utils.rb +11 -14
metadata +16 -6
data/lib/scraper_utils/adaptive_delay.rb +0 -70
data/lib/scraper_utils/fiber_scheduler.rb +0 -229
data/lib/scraper_utils/robots_checker.rb +0 -149

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 28f415290516d19f6ffc95a7d349a6ed269de987a0ffe45ed512ff29bfc82902
-  data.tar.gz: 9d337d1145754bf2375f4d2e18d89da7c21231d7b9b279bc51340f92781caa35
+  metadata.gz: 13ad14102f284c98d658bb928bcf7806ea7594326d11c5426903ebc6b1f919e0
+  data.tar.gz: 260cd94a1b76e9851f5af47f716dc754386b081f3923ee0a0eb6fb2b2d086c4f
 SHA512:
-  metadata.gz: 8a3050451f512b2f77cf9cd806fc1602d6502b247f24248d93cfc12dea47bf5d7f02bd9be5453c0da7ee7b8acc0e3ee32cd375b52e04f0f366c2174ea7320bd9
-  data.tar.gz: 17befcb8b9305536385ddf6772aeee038df34bd42ce197c4951836bb396e1db9d3dabece2a543ece6f60e331ca244ef1f660f1b111af17d79703a2a783801183
+  metadata.gz: 824b9e64ae7debdf9cddfc90b47de6dff7865e3f655ad6022f58181f38efd06788413e0251525410736e58d6fd325917a8b7a0ad6b2468fa7ab9de3b697955af
+  data.tar.gz: fd154118b2eaa22962f4343a3f62ca1daabc19de924a36e4c3cd4f61c9c7bb08a232554f10141b4e3534f3fa2e546a86f860d3bc7997f54d29a067cfb3c4f451

data/.yardopts ADDED Viewed

@@ -0,0 +1,5 @@
+--files docs/*.rb
+--files docs/*.md
+--files CHANGELOG.md
+--files LICENSE.txt
+--readme README.md

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,12 @@
 # Changelog
+## 0.6.0 - 2025-03-16
+* Add threads for more efficient scraping
+* Adjust defaults for more efficient scraping, retaining just response based delays by default
+* Correct and simplify date range utilities so everything is checked at least `max_period` days
+* Release Candidate for v1.0.0, subject to testing in production
 ## 0.5.1 - 2025-03-05
 * Remove duplicated example code in docs

data/GUIDELINES.md CHANGED Viewed

@@ -47,7 +47,8 @@ but if the file is bad, just treat it as missing.
 ## Testing Strategies
-* Avoid mocking unless really needed, instead
+* AVOID mocking unless really needed (and REALLY avoid mocking your own code), instead
+  * Consider if you can change your own code, whilst keeping it simple, to make it easier to test
   * instantiate a real object to use in the test
   * use mocking facilities provided by the gem (eg Mechanize, Aws etc)
   * use integration tests with WebMock for simple external sites or VCR for more complex.

data/Gemfile CHANGED Viewed

@@ -32,5 +32,6 @@ gem "simplecov", platform && (platform == :heroku16 ? "~> 0.18.0" : "~> 0.22.0")
 gem "simplecov-console"
 gem "terminal-table"
 gem "webmock", platform && (platform == :heroku16 ? "~> 3.14.0" : "~> 3.19.0")
+gem "yard"
 gemspec

data/IMPLEMENTATION.md CHANGED Viewed

@@ -31,3 +31,43 @@ puts "Pre Connect request: #{request.inspect}" if ENV["DEBUG"]
 - Externalize configuration to improve testability
 - Keep shared logic in the main class
 - Decisions / information specific to just one class, can be documented there, otherwise it belongs here
+## Testing Directory Structure
+Our test directory structure reflects various testing strategies and aspects of the codebase:
+### API Context Directories
+- `spec/scraper_utils/fiber_api/` - Tests functionality called from within worker fibers
+- `spec/scraper_utils/main_fiber/` - Tests functionality called from the main fiber's perspective
+- `spec/scraper_utils/thread_api/` - Tests functionality called from within worker threads
+### Utility Classes
+- `spec/scraper_utils/mechanize_utils/` - Tests for `lib/scraper_utils/mechanize_utils/*.rb` files
+- `spec/scraper_utils/scheduler/` - Tests for `lib/scraper_utils/scheduler/*.rb` files
+- `spec/scraper_utils/scheduler2/` - FIXME: remove duplicate tests and merge to `spec/scraper_utils/scheduler/` unless > 200 lines
+### Integration vs Unit Tests
+- `spec/scraper_utils/integration/` - Tests that focus on the integration between components
+  - Name tests after the most "parent-like" class of the components involved
+### Special Configuration Directories
+These specs check the options we use when things go wrong in production
+- `spec/scraper_utils/no_threads/` - Tests with threads disabled (`MORPH_DISABLE_THREADS=1`)
+- `spec/scraper_utils/no_fibers/` - Tests with fibers disabled (`MORPH_MAX_WORKERS=0`)
+- `spec/scraper_utils/sequential/` - Tests with exactly one worker (`MORPH_MAX_WORKERS=1`)
+### Directories to break up large specs
+Keep specs less than 200 lines long
+- `spec/scraper_utils/replacements` - Tests for replacements in MechanizeActions
+- `spec/scraper_utils/replacements2` - FIXME: remove duplicate tests and merge to `spec/scraper_utils/replacements/`?
+- `spec/scraper_utils/selectors` - Tests the various node selectors available in MechanizeActions
+- `spec/scraper_utils/selectors2` - FIXME: remove duplicate tests and merge to `spec/scraper_utils/selectors/`?
+### General Testing Guidelines
+- Respect fiber and thread context validation - never mock the objects under test
+- Structure tests to run in the appropriate fiber context
+- Use real fibers, threads and operations rather than excessive mocking
+- Ensure proper cleanup of resources in both success and error paths
+- ASK when unsure which (yard doc, spec or code) is wrong as I don't always follow the "write specs first" strategy

data/README.md CHANGED Viewed

@@ -9,28 +9,30 @@ For Server Administrators
 The ScraperUtils library is designed to be a respectful citizen of the web. If you're a server administrator and notice
 our scraper accessing your systems, here's what you should know:
-### How to Control Our Behavior
-Our scraper utilities respect the standard server **robots.txt** control mechanisms (by default).
-To control our access:
-- Add a section for our user agent: `User-agent: ScraperUtils` (default)
-- Set a crawl delay, eg: `Crawl-delay: 20`
-- If needed specify disallowed paths: `Disallow: /private/`
 ### We play nice with your servers
 Our goal is to access public planning information with minimal impact on your services. The following features are on by
 default:
+- **Limit server load**:
+    - We limit the max load we present to your server to well less than a third of a single cpu
+        - The more loaded your server is, the longer we wait between requests
+    - We respect Crawl-delay from robots.txt (see section below), so you can tell us an acceptable rate
+    - Scarper developers can
+        - reduce the max_load we present to your server even lower
+        - add random extra delays to give your server a chance to catch up with background tasks
 - **Identify themselves**: Our user agent clearly indicates who we are and provides a link to the project repository:
   `Mozilla/5.0 (compatible; ScraperUtils/0.2.0 2025-02-22; +https://github.com/ianheggie-oaf/scraper_utils)`
-- **Limit server load**:
-    - We wait double your response time before making another request to avoid being a significant load on your server
-    - We also randomly add extra delays to give your server a chance to catch up with background tasks
+### How to Control Our Behavior
+Our scraper utilities respect the standard server **robots.txt** control mechanisms (by default).
+To control our access:
-We also provide scraper developers other features to reduce overall load as well.
+- Add a section for our user agent: `User-agent: ScraperUtils`
+- Set a crawl delay, eg: `Crawl-delay: 20`
+- If needed specify disallowed paths: `Disallow: /private/`
 For Scraper Developers
 ----------------------
@@ -40,14 +42,15 @@ mentioned above.
 ## Installation & Configuration
-Add to your [scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
+Add to [your scraper's](https://www.planningalerts.org.au/how_to_write_a_scraper) Gemfile:
 ```ruby
 gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
 gem 'scraper_utils'
 ```
-For detailed setup and configuration options, see the [Getting Started guide](docs/getting_started.md).
+For detailed setup and configuration options,
+see {file:docs/getting_started.md Getting Started guide}
 ## Key Features
@@ -57,20 +60,23 @@ For detailed setup and configuration options, see the [Getting Started guide](do
 - Automatic rate limiting based on server response times
 - Supports robots.txt and crawl-delay directives
 - Supports extra actions required to get to results page
-- [Learn more about Mechanize utilities](docs/mechanize_utilities.md)
+- {file:docs/mechanize_utilities.md Learn more about Mechanize utilities}
 ### Optimize Server Load
 - Intelligent date range selection (reduce server load by up to 60%)
 - Cycle utilities for rotating search parameters
-- [Learn more about reducing server load](docs/reducing_server_load.md)
+- {file:docs/reducing_server_load.md Learn more about reducing server load}
 ### Improve Scraper Efficiency
-- Interleave requests to optimize run time
-- [Learn more about interleaving requests](docs/interleaving_requests.md)
+- Interleaves requests to optimize run time
+    - {file:docs/interleaving_requests.md Learn more about interleaving requests}
+- Use {ScraperUtils::Scheduler.execute_request} so Mechanize network requests will be performed by threads in parallel
+    - {file:docs/parallel_requests.md Parallel Request} - see Usage section for installation instructions
 - Randomize processing order for more natural request patterns
-- [Learn more about randomizing requests](docs/randomizing_requests.md)
+    - {file:docs/randomizing_requests.md Learn more about randomizing requests} - see Usage section for installation
+      instructions
 ### Error Handling & Quality Monitoring
@@ -82,11 +88,11 @@ For detailed setup and configuration options, see the [Getting Started guide](do
 - Enhanced debugging utilities
 - Simple logging with authority context
-- [Learn more about debugging](docs/debugging.md)
+- {file:docs/debugging.md Learn more about debugging}
 ## API Documentation
-Complete API documentation is available at [RubyDoc.info](https://rubydoc.info/gems/scraper_utils).
+Complete API documentation is available at [scraper_utils | RubyDoc.info](https://rubydoc.info/gems/scraper_utils).
 ## Ruby Versions
@@ -105,7 +111,7 @@ To install this gem onto your local machine, run `bundle exec rake install`.
 ## Contributing
 Bug reports and pull requests with working tests are welcome
-on [GitHub](https://github.com/ianheggie-oaf/scraper_utils).
+on [ianheggie-oaf/scraper_utils | GitHub](https://github.com/ianheggie-oaf/scraper_utils).
 ## License

data/SPECS.md CHANGED Viewed

@@ -6,7 +6,13 @@ installation and usage notes in `README.md`.
 ASK for clarification of any apparent conflicts with IMPLEMENTATION, GUIDELINES or project instructions.
-## Core Design Principles
+Core Design Principles
+----------------------
+## Coding Style and Complexity
+- KISS (Keep it Simple and Stupid) is a guiding principle:
+  - Simple: Design and implement with as little complexity as possible while still achieving the desired functionality
+  - Stupid: Should be easy to diagnose and repair with basic tooling
 ### Error Handling
 - Record-level errors abort only that record's processing
@@ -23,3 +29,9 @@ ASK for clarification of any apparent conflicts with IMPLEMENTATION, GUIDELINES
 - Ensure components are independently testable
 - Avoid timing-based tests in favor of logic validation
 - Keep test scenarios focused and under 20 lines
+#### Fiber and Thread Testing
+- Test in appropriate fiber/thread context using API-specific directories
+- Validate cooperative concurrency with real fibers rather than mocks
+- Ensure tests for each context: main fiber, worker fibers, and various thread configurations
+- Test special configurations (no threads, no fibers, sequential) in dedicated directories

data/bin/rspec ADDED Viewed

@@ -0,0 +1,27 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+#
+# This file was generated by Bundler.
+#
+# The application 'rspec' is installed as part of a gem, and
+# this file is here to facilitate running it.
+#
+ENV["BUNDLE_GEMFILE"] ||= File.expand_path("../Gemfile", __dir__)
+bundle_binstub = File.expand_path("bundle", __dir__)
+if File.file?(bundle_binstub)
+  if File.read(bundle_binstub, 300).include?("This file was generated by Bundler")
+    load(bundle_binstub)
+  else
+    abort("Your `bin/bundle` was not generated by Bundler, so this binstub cannot run.
+Replace `bin/bundle` by running `bundle binstubs bundler --force`, then run this command again.")
+  end
+end
+require "rubygems"
+require "bundler/setup"
+load Gem.bin_path("rspec-core", "rspec")

data/docs/example_scrape_with_fibers.rb CHANGED Viewed

@@ -3,11 +3,11 @@
 # Example scrape method updated to use ScraperUtils::FibreScheduler
 def scrape(authorities, attempt)
-  ScraperUtils::FiberScheduler.reset!
+  ScraperUtils::Scheduler.reset!
   exceptions = {}
   authorities.each do |authority_label|
-    ScraperUtils::FiberScheduler.register_operation(authority_label) do
-      ScraperUtils::FiberScheduler.log(
+    ScraperUtils::Scheduler.register_operation(authority_label) do
+      ScraperUtils::LogUtils.log(
         "Collecting feed data for #{authority_label}, attempt: #{attempt}..."
       )
       ScraperUtils::DataQualityMonitor.start_authority(authority_label)
@@ -26,6 +26,6 @@ def scrape(authorities, attempt)
     end
     # end of register_operation block
   end
-  ScraperUtils::FiberScheduler.run_all
+  ScraperUtils::Scheduler.run_operations
   exceptions
 end

data/docs/fibers_and_threads.md ADDED Viewed

@@ -0,0 +1,72 @@
+Fibers and Threads
+==================
+This sequence diagram supplements the notes on the {ScraperUtils::Scheduler} class and is intended to help show
+the passing of messages and control between the fibers and threads.
+* To keep things simple I have only shown the Fibers and Threads and not all the other calls like to the
+  OperationRegistry to lookup the current operation, or OperationWorker etc.
+* There is ONE (global) response queue, which is monitored by the Scheduler.run_operations loop in the main Fiber
+* Each authority has ONE OperationWorker (not shown), which has ONE Fiber, ONE Thread, ONE request queue.
+* I use "◀─▶" to indicate a call and response, and "║" for which fiber / object is currently running.
+```text
+SCHEDULER (Main Fiber)
+NxRegister-operation  RESPONSE.Q
+   ║──creates────────◀─▶┐
+   ║                    │       FIBER (runs block passed to register_operation)
+   ║──creates──────────────────◀─▶┐         WORKER object and Registry
+   ║──registers(fiber)───────────────────────▶┐           REQUEST-Q
+   │                    │         │           ║──creates─◀─▶┐       THREAD
+   │                    │         │           ║──creates───────────◀─▶┐
+   ║◀─────────────────────────────────────────┘             ║◀──pop───║ ...[block waiting for request]
+   ║                    │         │           │             ║         │
+run_operations          │         │           │             ║         │
+   ║──pop(non block)─◀─▶│         │           │             ║         │ ...[no responses yet]
+   ║                    │         │           │             ║         │
+   ║───resumes-next─"can_resume"─────────────▶┐             ║         │
+   │                    │         │           ║             ║         │
+   │                    │         ║◀──resume──┘             ║         │ ...[first Resume passes true]
+   │                    │         ║           │             ║         │ ...[initialise scraper]
+```
+**REPEATS FROM HERE**
+```text
+ SCHEDULER        RESPONSE.Q    FIBER       WORKER        REQUEST.Q THREAD
+   │                    │         ║──request─▶┐             ║         │
+   │                    │         │           ║──push req ─▶║         │
+   │                    │         ║◀──────────┘             ║──req───▶║
+   ║◀──yields control─(waiting)───┘           │             │         ║
+   ║                    │         │           │             │         ║ ...[Executes network I/O request]
+   ║                    │         │           │             │         ║
+   ║───other-resumes... │         │           │             │         ║ ...[Other Workers will be resumed
+   ║                    │         │           │             │         ║     till most 99% are waiting on
+   ║───lots of          │         │           │             │         ║     responses from their threads
+   ║    short sleeps    ║◀──pushes response───────────────────────────┘
+   ║                    ║         │           │             ║◀──pop───║ ...[block waiting for request]
+   ║──pop(response)──◀─▶║         │           │             ║         │
+   ║                    │         │           │             ║         │
+   ║──saves─response───────────────────────◀─▶│             ║         │
+   ║                    │         │           │             ║         │
+   ║───resumes-next─"can_resume"─────────────▶┐             ║         │
+   │                    │         │           ║             ║         │
+   │                    │         ║◀──resume──┘             ║         │ ...[Resume passes response]
+   │                    │         ║           │             ║         │
+   │                    │         ║           │             ║         │ ...[Process Response]
+```
+**REPEATS TO HERE** - WHEN FIBER FINISHES, instead it:
+```text
+ SCHEDULER        RESPONSE.Q    FIBER         WORKER        REQUEST.Q THREAD
+   │                    │         ║             │           ║         │
+   │                    │         ║─deregister─▶║           ║         │
+   │                    │         │             ║──close───▶║         │
+   │                    │         │             ║           ║──nil───▶┐
+   │                    │         │             ║           │         ║ ... [thread exists]
+   │                    │         │             ║──join────────────◀─▶┘
+   │                    │         │             ║  ....... [worker removes
+   │                    │         │             ║           itself from registry]
+   │                    │         ║◀──returns───┘
+   │◀──returns─nil────────────────┘
+   │                    │
+```
+When the last fiber finishes and the registry is empty, then the response queue is also removed

data/docs/getting_started.md CHANGED Viewed

@@ -54,14 +54,14 @@ export DEBUG=1 # for basic, or 2 for verbose or 3 for tracing nearly everything
 ## Example Scraper Implementation
-Update your `scraper.rb` as per [example scraper](example_scraper.rb)
+Update your `scraper.rb` as per {file:example_scraper.rb example scraper}
-For more advanced implementations, see the [Interleaving Requests documentation](interleaving_requests.md).
+For more advanced implementations, see the {file:interleaving_requests.md Interleaving Requests documentation}.
 ## Logging Tables
 The following logging tables are created for use in monitoring failure patterns and debugging issues.
-Records are automaticaly cleared after 30 days.
+Records are automatically cleared after 30 days.
 The `ScraperUtils::LogUtils.log_scraping_run` call also logs the information to the `scrape_log` table.
@@ -69,6 +69,6 @@ The `ScraperUtils::LogUtils.save_summary_record` call also logs the information
 ## Next Steps
-- [Reducing Server Load](reducing_server_load.md)
-- [Mechanize Utilities](mechanize_utilities.md)
-- [Debugging](debugging.md)
+- {file:reducing_server_load.md Reducing Server Load}
+- {file:mechanize_utilities.md Mechanize Utilities}
+- {file:debugging.md Debugging}

data/docs/interleaving_requests.md CHANGED Viewed

@@ -1,6 +1,6 @@
-# Interleaving Requests with FiberScheduler
+# Interleaving Requests with Scheduler
-The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
+The `ScraperUtils::Scheduler` provides a lightweight utility that:
 * Works on other authorities while in the delay period for an authority's next request
 * Optimizes the total scraper run time
@@ -12,21 +12,21 @@ The `ScraperUtils::FiberScheduler` provides a lightweight utility that:
 ## Implementation
 To enable fiber scheduling, change your scrape method as per
-[example scrape with fibers](example_scrape_with_fibers.rb)
+{example_scrape_with_fibers.rb example scrape with fibers}
-## Logging with FiberScheduler
+## Logging with Scheduler
-Use `ScraperUtils::FiberScheduler.log` instead of `puts` when logging within the authority processing code.
+Use {ScraperUtils::LogUtils.log} instead of `puts` when logging within the authority processing code.
 This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
 thus the output.
 ## Testing Considerations
-This uses `ScraperUtils::RandomizeUtils` for determining the order of operations. Remember to add the following line to
+This uses {ScraperUtils::RandomizeUtils} for determining the order of operations. Remember to add the following line to
 `spec/spec_helper.rb`:
 ```ruby
 ScraperUtils::RandomizeUtils.sequential = true
 ```
-For full details, see the [FiberScheduler class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/FiberScheduler).
+For full details, see the {Scheduler}.

data/docs/parallel_requests.md ADDED Viewed

@@ -0,0 +1,138 @@
+Parallel Request Processing
+===========================
+The ScraperUtils library provides a mechanism for executing network I/O requests in parallel using a thread for each
+operation worker, allowing the fiber to yield control and allow other fibers to process whilst the thread processes the
+mechanize network I/O request.
+This can be disabled by setting `MORPH_DISABLE_THREADS` ENV var to a non-blank value.
+Overview
+--------
+When scraping multiple authority websites, around 99% of the time was spent waiting for network I/O. While the
+`Scheduler`
+efficiently interleaves fibers during delay periods, network I/O requests will still block a fiber until they
+complete.
+The `OperationWorker` optimizes this process by:
+1. Executing mechanize network operations in parallel using a thread for each operation_worker and fiber
+2. Allowing other fibers to continue working while waiting for thread responses
+3. Integrating seamlessly with the existing `Scheduler`
+Usage
+-----
+```ruby
+# In your authority scraper block
+ScraperUtils::Scheduler.register_operation("authority_name") do
+  # Instead of:
+  # page = agent.get(url)
+  # Use:
+  page = ScraperUtils::Scheduler.execute_request(agent, :get, [url])
+  # Process page as normal
+  process_page(page)
+end
+```
+For testing purposes, you can also execute non-network operations:
+```ruby
+# Create a test object
+test_object = Object.new
+def test_object.sleep_test(duration)
+  sleep(duration)
+  "Completed after #{duration} seconds"
+end
+# Queue a sleep command
+command = ScraperUtils::ProcessRequest.new(
+  "test_id",
+  test_object,
+  :sleep_test,
+  [0.5]
+)
+thread_scheduler.queue_request(command)
+```
+Configuration
+-------------
+The followingENV variables affect how `Scheduler` is configured:
+* `MORPH_DISABLE_THREADS=1` disabled the use of threads
+* `MORPH_MAX_WORKERS=N` configures the system to a max of N workers (minimum 1).
+  If N is 1 then this forces the system to process one authority at a time.
+Key Components
+--------------
+### ThreadRequest
+A value object encapsulating a command to be executed:
+- External ID: Any value suitable as a hash key (String, Symbol, Integer, Object) that identifies the command
+- Subject: The object to call the method on
+- Method: The method to call on the subject
+- Args: Arguments to pass to the method
+### ThreadResponse
+A value object encapsulating a response:
+- External ID: Matches the ID from the original command
+- Result: The result of the operation
+- Error: Any error that occurred
+- Time Taken: Execution time in seconds
+### ThreadPool
+Manages a pool of threads that execute commands:
+- Processes commands from a queue
+- Returns responses with matching external IDs
+- Provides clear separation between I/O and scheduling
+Benefits
+--------
+1. **Improved Throughput**: Process multiple operations simultaneously
+2. **Reduced Total Runtime**: Make better use of wait time during network operations
+3. **Optimal Resource Usage**: Efficiently balance CPU and network operations
+4. **Better Geolocation Handling**: Distribute requests across proxies more efficiently
+5. **Testability**: Execute non-network operations for testing concurrency
+Debugging
+---------
+When debugging issues with parallel operations, use:
+```shell
+# Set debug level to see request/response logging
+export DEBUG = 2
+```
+The system will log:
+- When commands are queued
+- When responses are received
+- How long each operation took
+- Any errors that occurred
+## Implementation Details
+The integration between `Scheduler` and `ThreadPool` follows these principles:
+1. `Scheduler` maintains ownership of all fiber scheduling
+2. `ThreadPool` only knows about commands and responses
+3. Communication happens via value objects with validation
+4. State is managed in dedicated `FiberState` objects
+5. Each component has a single responsibility
+This design provides a clean separation of concerns while enabling parallel operations within the existing fiber
+scheduling framework.

data/docs/randomizing_requests.md CHANGED Viewed

@@ -1,9 +1,11 @@
-# Randomizing Requests
+Randomizing Requests
+====================
 `ScraperUtils::RandomizeUtils` provides utilities for randomizing processing order in scrapers,
 which is helpful for distributing load and avoiding predictable patterns.
-## Basic Usage
+Usage
+-----
 Pass a `Collection` or `Array` to `ScraperUtils::RandomizeUtils.randomize_order` to randomize it in production, but
 receive it as is when testing.
@@ -18,17 +20,19 @@ records.each do |record|
 end
 ```
-## Testing Configuration
+Testing Configuration
+---------------------
 Enforce sequential mode when testing by adding the following code to `spec/spec_helper.rb`:
 ```ruby
-ScraperUtils::RandomizeUtils.sequential = true
+ScraperUtils::RandomizeUtils.random = false
 ```
-## Notes
+Notes
+-----
-* You can also force sequential mode by setting the env variable `MORPH_PROCESS_SEQUENTIALLY` to `1` (any non-blank value)
-* Testing using VCR requires sequential mode
+* You can also disable random mode by setting the env variable `MORPH_DISABLE_RANDOM` to `1` (or any non-blank value)
+* Testing using VCR requires random to be disabled
-For full details, see the [RandomizeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/RandomizeUtils).
+For full details, see {ScraperUtils::RandomizeUtils Randomize Utils class documentation}

data/docs/reducing_server_load.md CHANGED Viewed

@@ -12,8 +12,8 @@ records:
 - Always checks the most recent 4 days daily (configurable)
 - Progressively reduces search frequency for older records
-- Uses a Fibonacci-like progression to create natural, efficient search intervals
-- Configurable `max_period` (default is 3 days)
+- Uses a progression from each 2 days and upwards to create an efficient search intervals
+- Configurable `max_period` (default is 2 days)
 - Merges adjacent search ranges and handles the changeover in search frequency by extending some searches
 Example usage in your scraper:
@@ -28,11 +28,11 @@ date_ranges.each do |from_date, to_date, _debugging_comment|
 end
 ```
-Typical server load reductions:
+Typical server load compared to search all days each time:
-* Max period 2 days : ~42% of the 33 days selected
-* Max period 3 days : ~37% of the 33 days selected (default)
-* Max period 5 days : ~35% (or ~31% when days = 45)
+* Max period 2 days : ~59% of the 33 days selected (default, alternates between 57% and 61% covered)
+* Max period 3 days : ~50% of the 33 days selected (varies much more - between 33 and 67%)
+* Max period 4 days : ~46% (more efficient if you search back 50 or more days, varies between 15 and 61%)
 See the [DateRangeUtils class documentation](https://rubydoc.info/gems/scraper_utils/ScraperUtils/DateRangeUtils) for customizing defaults and passing options.

data/lib/scraper_utils/data_quality_monitor.rb CHANGED Viewed

@@ -13,7 +13,6 @@ module ScraperUtils
     # Notes the start of processing an authority and clears any previous stats
     #
     # @param authority_label [Symbol] The authority we are processing
-    # @return [void]
     def self.start_authority(authority_label)
       @stats ||= {}
       @stats[authority_label] = { saved: 0, unprocessed: 0 }
@@ -41,7 +40,7 @@ module ScraperUtils
     def self.log_unprocessable_record(exception, record)
       authority_label = extract_authority(record)
       @stats[authority_label][:unprocessed] += 1
-      ScraperUtils::FiberScheduler.log "Erroneous record #{authority_label} - #{record&.fetch(
+      ScraperUtils::LogUtils.log "Erroneous record #{authority_label} - #{record&.fetch(
         'address', nil
       ) || record.inspect}: #{exception}"
       return unless @stats[authority_label][:unprocessed] > threshold(authority_label)
@@ -58,7 +57,7 @@ module ScraperUtils
     def self.log_saved_record(record)
       authority_label = extract_authority(record)
       @stats[authority_label][:saved] += 1
-      ScraperUtils::FiberScheduler.log "Saving record #{authority_label} - #{record['address']}"
+      ScraperUtils::LogUtils.log "Saving record #{authority_label} - #{record['address']}"
     end
   end
 end