RubyGems - relaton-w3c - Versions diffs - 2.1.4 → 2.2.0.pre.alpha.1 - Mend

relaton-w3c 2.1.4 → 2.2.0.pre.alpha.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +4 -4
data/CLAUDE.md +3 -3
data/Gemfile +8 -0
data/README.adoc +5 -3
data/lib/relaton/w3c/data_fetcher.rb +78 -15
data/lib/relaton/w3c/version.rb +1 -1
data/relaton-w3c.gemspec +5 -4
metadata +23 -10
data/.rubocop.yml +0 -12

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 7cdd6ed3f2403c63011b2f3d023afca6d1fde795b09e65d39e9044384ed52b2b
-  data.tar.gz: 4086a9ec931c36512084c11c7327be1cd5d21c85106bbe312e6c142dbe9639d4
+  metadata.gz: 3e7055aa24f2e33a4eb5825a57fca822732a5ddfeec7f7c484d44afe86dbacaa
+  data.tar.gz: 415b74a5cd3ecbee917d601794721fb9e207b427ea4fcd9dd1ce8420a439a5b3
 SHA512:
-  metadata.gz: d2a193173054fd17d1b1e30218887722b58f3a12192923e7abb36e167314c6378e4255a87632bb9fac21035f7bc33e6b4f032ff8dd180fef938ef1977ab6cbd6
-  data.tar.gz: c3b52a5693c876cee96728332e694824350a50a2d4945c04c941e84d7c85d874216a02472f9121c3ca4446420cf686a6d8b8b042397deafcb4dcc89870e5a8b7
+  metadata.gz: 87deba90b8ae19bb20c67eefd794dcb0b29171573c48cecb04445f73e66921953c228ef5e759be7742d8c072a18dad2cecf3f9214993de1988ea4a64562310ee
+  data.tar.gz: c1a9d9c598da26dbb05611415b655bdcb8adac22a3d1d1488a15f00d06b68dfbbcca0ad4bfa1d30366a466ae2948cf99ffc2690e5c61309fd3fca3247af00274

data/CLAUDE.md CHANGED Viewed

@@ -47,7 +47,7 @@ All classes live under `lib/relaton/w3c/` in the `Relaton::W3c` namespace:
 - **`Processor`** (`processor.rb`) — extends `Relaton::Core::Processor`, registers the W3C flavor (prefix `W3C`, dataset `w3c-api`)
 **Data fetching:**
-- **`DataFetcher`** (`data_fetcher.rb`) — extends `Core::DataFetcher`, fetches all W3C specs via the W3C API. Fetches the specification index with `embed: true` so each spec is realized from the page's embedded payload instead of a per-spec HTTP request, and paginates by page number (only the `fetch` path repopulates `_embedded`, unlike realizing the `next` link). Runs `fetch_spec` across a small thread pool. A SIGINT (Ctrl-C) is handled gracefully — the producer stops queuing and workers stop after their in-flight spec, then the index of everything fetched so far is saved (the prior INT handler is restored afterwards, so the trap doesn't leak into the host process). See **Crawler tuning** for the env-var knobs.
+- **`DataFetcher`** (`data_fetcher.rb`) — extends `Core::DataFetcher`, fetches all W3C specs via the W3C API. Fetches the specification index with `embed: true` so each spec is realized from the page's embedded payload instead of a per-spec HTTP request, and paginates by page number (only the `fetch` path repopulates `_embedded`, unlike realizing the `next` link). Runs `fetch_spec` across a small thread pool. A SIGINT (Ctrl-C) is handled gracefully — the producer stops queuing and workers stop after their in-flight spec, then the index of everything fetched so far is saved (the prior INT handler is restored afterwards, so the trap doesn't leak into the host process). If an index page fails to fetch after retries, or pagination ends before the API's advertised last page, `enqueue_specs` raises `CrawlIncompleteError` and the crawl aborts **without** saving the index — a transient rate-limit must never silently truncate the dataset (`crawler.rb` wipes `data/` before each run, so a partial crawl would otherwise commit mass deletions). The worker pool is still drained in an `ensure` so the abort doesn't deadlock. See **Crawler tuning** for the env-var knobs.
 - **`DataParser`** (`data_parser.rb`) — converts W3C API spec objects into `Relaton::W3c::Item` instances
 - **`SafeRealize`** (`safe_realize.rb`) — mixin that, on a terminal error, skips the resource (returns `nil`) so one bad link doesn't abort the crawl (see Rate limiting & retries). It does not retry or cache successes — those live upstream.
 - **`PubId`** (`pubid.rb`) — parses and compares W3C document identifiers (stage, code, date parts)
@@ -61,7 +61,7 @@ The entry module is defined in `lib/relaton/w3c.rb` and exposes `grammar_hash`.
 `DataFetcher` is tunable via environment variables (read by class methods, so they apply to the whole crawl):
-- **`RELATON_W3C_FETCH_CONCURRENCY`** (default `8`) — number of `fetch_spec` worker threads. Lower it to lighten load on api.w3.org or for debugging.
+- **`RELATON_W3C_FETCH_CONCURRENCY`** (default `4`) — number of `fetch_spec` worker threads. Kept conservative so the version-history requests don't burst fast enough to trip the W3C API rate limiter (429s); raise it for a faster run, lower it for debugging or if 429 skips appear.
 - **`RELATON_W3C_FETCH_VERSIONS`** (default enabled) — set to `false`/`0`/`no`/`off` for a faster, shallower crawl that emits only the top-level specifications and skips each spec's version-history fan-out (version_history, predecessor/successor versions — the bulk of the API requests). Leave it set (the default) for a complete dataset.
 `embed: true` (always on) inlines each specification into its index page, so the per-spec realize is served from memory rather than an HTTP request — the largest single reduction in request count.
@@ -76,7 +76,7 @@ Successful objects are cached by **w3c_api** (lutaml-hal caches realized objects
 ### Key Dependencies
-- **relaton-bib** (~> 2.1.0) — provides base `Bib::Item`, `Bib::Ext`, `Bib::Doctype` and serialization mixins (LutaML model layer)
+- **relaton-bib** (~> 2.2.0) — provides base `Bib::Item`, `Bib::Ext`, `Bib::Doctype` and serialization mixins (LutaML model layer)
 - **relaton-core** — provides base `Core::Processor` and `Core::DataFetcher`
 - **relaton-index** — index-based search for bibliographic references; also unpacks the index zip at runtime
 - **w3c_api** (~> 0.3.2) — W3C API (HAL/REST) client used by `DataFetcher` to retrieve specifications; owns rate-limit and transient-error retries, and the (thread-safe) object cache

data/Gemfile CHANGED Viewed

@@ -3,6 +3,14 @@ source "https://rubygems.org"
 # Specify your gem's dependencies in relaton_w3c.gemspec
 gemspec
+# Use local monorepo sibling gems where available.
+Dir["../*/"].each do |dir|
+  name = File.basename(dir)
+  next if name == File.basename(__dir__)
+  next unless File.exist?(File.join(dir, "#{name}.gemspec"))
+  gem name, path: dir
+end
 gem "rake", "~> 13.0"
 gem "rspec", "~> 3.0"

data/README.adoc CHANGED Viewed

@@ -120,17 +120,19 @@ Relaton::W3c::DataFetcher.fetch
 The crawl is tunable via environment variables:
-- `RELATON_W3C_FETCH_CONCURRENCY` (default `8`) - number of parallel worker threads. Lower it to lighten load on `api.w3.org`.
+- `RELATON_W3C_FETCH_CONCURRENCY` (default `4`) - number of parallel worker threads. The default is kept conservative so the version-history requests don't burst fast enough to trip the W3C API rate limiter; raise it for a faster run, lower it if you still see rate-limit (429) skips.
 - `RELATON_W3C_FETCH_VERSIONS` (default enabled) - set to `false` for a faster, shallower crawl that fetches only the top-level specifications and skips each spec's version history (the bulk of the API requests). Leave it unset for a complete dataset.
 The fetcher requests the specifications index with embedded specification data, so each specification is read from the page already in memory instead of issuing a separate HTTP request.
 A full crawl is long-running, so it handles `Ctrl-C` gracefully: it stops fetching, lets in-flight work finish, and saves the index of everything collected so far rather than losing the run.
+If a specifications-index page can't be fetched (e.g. a rate-limit that outlasts the retries) the crawl aborts with `CrawlIncompleteError` rather than treating the failure as the end of the list. This is deliberate: a truncated crawl is never saved, so a transient API hiccup can't silently drop most of the dataset.
 [source,sh]
 ----
-# Fast, shallow refresh: top-level specs only, 4 workers
-RELATON_W3C_FETCH_VERSIONS=false RELATON_W3C_FETCH_CONCURRENCY=4 \
+# Fast, shallow refresh: top-level specs only, 8 workers
+RELATON_W3C_FETCH_VERSIONS=false RELATON_W3C_FETCH_CONCURRENCY=8 \
   ruby -r relaton/w3c/data_fetcher -e 'Relaton::W3c::DataFetcher.fetch'
 ----

data/lib/relaton/w3c/data_fetcher.rb CHANGED Viewed

@@ -10,11 +10,26 @@ module Relaton
     class DataFetcher < Core::DataFetcher
       include Relaton::W3c::SafeRealize
-      DEFAULT_CONCURRENCY = 8
+      # Raised when pagination over the specifications index stops before the
+      # last page (e.g. a page fetch fails after retries, or the API reports
+      # more pages than were reached). It aborts the whole crawl so a truncated
+      # dataset is never saved or committed — see #fetch and #enqueue_specs.
+      class CrawlIncompleteError < StandardError; end
+      # Conservative default: too many parallel workers burst the per-spec
+      # version-history requests fast enough to trip the W3C API rate limiter
+      # (429s), which is what silently truncated the dataset before the crawl
+      # learned to abort on incomplete pagination. Raise it via the env var on
+      # a faster/shallower run; lower it further if 429s still appear.
+      DEFAULT_CONCURRENCY = 4
+      # How many times #fetch_specifications_page retries a transient failure
+      # (rate-limit/connection) before giving up and aborting the crawl.
+      PAGE_FETCH_ATTEMPTS = 3
       # Number of fetch_spec worker threads. Tunable via env var so CI or
-      # local runs can dial it down (e.g. for debugging or to lighten load
-      # on api.w3.org).
+      # local runs can dial it up for speed or down to lighten load on
+      # api.w3.org (or for debugging).
       def self.concurrency
         (ENV["RELATON_W3C_FETCH_CONCURRENCY"] || DEFAULT_CONCURRENCY).to_i
       end
@@ -65,9 +80,15 @@ module Relaton
         workers = Array.new(n_workers) { spawn_worker(queue) }
         with_interrupt_handler do
-          enqueue_specs(queue)
-          n_workers.times { queue << nil } # poison pills
-          workers.each(&:join)
+          # The poison pills + join run in `ensure` so an exception raised while
+          # enqueuing (e.g. CrawlIncompleteError) still unblocks the producer
+          # and drains the workers instead of deadlocking on queue.pop.
+          begin
+            enqueue_specs(queue)
+          ensure
+            n_workers.times { queue << nil } # poison pills
+            workers.each(&:join)
+          end
           Util.warn "Crawl interrupted — saving progress collected so far." if @interrupted
           index.save
         end
@@ -87,6 +108,8 @@ module Relaton
       #
       def enqueue_specs(queue)
         specs = client.specifications(embed: true)
+        expected_pages = specs.pages
+        last_page = nil
         loop do
           page = specs
           page.links.specifications.each do |spec|
@@ -94,7 +117,10 @@ module Relaton
             queue << [spec, page]
           end
-          break if @interrupted || !page.next?
+          break if @interrupted
+          last_page = page.page
+          break unless page.next?
           # Fetch the next page through the client's fetch path rather than
           # realizing the `next` link: only fetch populates the page's
@@ -102,10 +128,35 @@ module Relaton
           # the `next` link drops `_embedded` and forces a per-spec HTTP
           # request for every specification on every later page.
           next_page = fetch_specifications_page(page.page + 1)
-          break unless next_page
+          # A nil here means the page fetch failed after retries (not the end
+          # of the list — that is `!page.next?` above). Aborting rather than
+          # `break`ing prevents a rate-limit blip from silently truncating the
+          # dataset: a partial crawl must never be saved/committed.
+          unless next_page
+            raise CrawlIncompleteError,
+                  "specifications pagination stopped at page #{page.page}: " \
+                  "failed to fetch page #{page.page + 1}"
+          end
           specs = next_page
         end
+        return if @interrupted
+        guard_complete_pagination(last_page, expected_pages)
+      end
+      # Defense in depth: even when no page fetch raised, make sure pagination
+      # actually reached the last page the API advertised. Catches truncation
+      # modes other than a failed fetch (e.g. a `next` link that goes missing).
+      # Only enforced when the index reported a positive page count.
+      def guard_complete_pagination(last_page, expected_pages)
+        return unless expected_pages.is_a?(Integer) && expected_pages.positive?
+        return unless last_page.is_a?(Integer) && last_page < expected_pages
+        raise CrawlIncompleteError,
+              "specifications pagination ended at page #{last_page} of " \
+              "#{expected_pages}; refusing to save a partial dataset"
       end
       def fetch_spec(unrealized_spec, page = nil)
@@ -209,14 +260,26 @@ module Relaton
       # Fetch one page of the specifications index with embed enabled. Goes
       # through the client (the register's fetch path) so the page's
       # embedded_data is populated. Transient 403/5xx/connection failures are
-      # already retried upstream (w3c_api/lutaml-hal); a terminal error here
-      # stops pagination gracefully rather than crashing the crawl.
+      # already retried upstream (w3c_api/lutaml-hal), but losing an index page
+      # drops every spec on it, so retry a few more times here with backoff to
+      # ride out a brief rate-limit window. Returns nil only once the attempts
+      # are exhausted; the caller turns that into a CrawlIncompleteError so the
+      # crawl aborts instead of committing a truncated dataset.
       def fetch_specifications_page(number)
-        client.specifications(embed: true, page: number)
-      rescue Lutaml::Hal::Error, Faraday::Error => e
-        log_error "Failed to fetch specifications page #{number}: " \
-                  "#{e.class}: #{e.message}"
-        nil
+        attempt = 0
+        begin
+          attempt += 1
+          client.specifications(embed: true, page: number)
+        rescue Lutaml::Hal::Error, Faraday::Error => e
+          log_error "Failed to fetch specifications page #{number} " \
+                    "(attempt #{attempt}/#{PAGE_FETCH_ATTEMPTS}): " \
+                    "#{e.class}: #{e.message}"
+          if attempt < PAGE_FETCH_ATTEMPTS
+            sleep(2**attempt)
+            retry
+          end
+          nil
+        end
       end
       def spawn_worker(queue)

data/lib/relaton/w3c/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 module Relaton
   module W3c
-    VERSION = "2.1.4".freeze
+    VERSION = "2.2.0.pre.alpha.1".freeze
   end
 end

data/relaton-w3c.gemspec CHANGED Viewed

@@ -14,7 +14,7 @@ Gem::Specification.new do |spec|
                        "using the IsoBibliographicItem model"
   spec.homepage      = "https://github.com/relaton/relaton-wc3"
   spec.license       = "BSD-2-Clause"
-  spec.required_ruby_version = Gem::Requirement.new(">= 3.2.0")
+  spec.required_ruby_version = Gem::Requirement.new(">= 3.3.0")
   # spec.metadata["allowed_push_host"] = "TODO: Set to 'http://mygemserver.com'"
@@ -31,8 +31,9 @@ Gem::Specification.new do |spec|
   spec.executables   = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
   spec.require_paths = ["lib"]
-  spec.add_dependency "relaton-bib", "~> 2.1.0"
-  spec.add_dependency "relaton-core", "~> 0.0.13"
-  spec.add_dependency "relaton-index", "~> 0.2.8"
+  spec.add_dependency "concurrent-ruby", "~> 1.0"
+  spec.add_dependency "relaton-bib", "~> 2.2.0.pre.alpha.1"
+  spec.add_dependency "relaton-core", "~> 2.2.0.pre.alpha.1"
+  spec.add_dependency "relaton-index", "~> 2.2.0.pre.alpha.1"
   spec.add_dependency "w3c_api", "~> 0.3.2"
 end

metadata CHANGED Viewed

@@ -1,57 +1,71 @@
 --- !ruby/object:Gem::Specification
 name: relaton-w3c
 version: !ruby/object:Gem::Version
-  version: 2.1.4
+  version: 2.2.0.pre.alpha.1
 platform: ruby
 authors:
 - Ribose Inc.
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2026-06-04 00:00:00.000000000 Z
+date: 2026-06-26 00:00:00.000000000 Z
 dependencies:
+- !ruby/object:Gem::Dependency
+  name: concurrent-ruby
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.0'
 - !ruby/object:Gem::Dependency
   name: relaton-bib
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 2.1.0
+        version: 2.2.0.pre.alpha.1
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 2.1.0
+        version: 2.2.0.pre.alpha.1
 - !ruby/object:Gem::Dependency
   name: relaton-core
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 0.0.13
+        version: 2.2.0.pre.alpha.1
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 0.0.13
+        version: 2.2.0.pre.alpha.1
 - !ruby/object:Gem::Dependency
   name: relaton-index
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 0.2.8
+        version: 2.2.0.pre.alpha.1
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 0.2.8
+        version: 2.2.0.pre.alpha.1
 - !ruby/object:Gem::Dependency
   name: w3c_api
   requirement: !ruby/object:Gem::Requirement
@@ -79,7 +93,6 @@ files:
 - ".gitignore"
 - ".hound.yml"
 - ".rspec"
-- ".rubocop.yml"
 - CLAUDE.md
 - Gemfile
 - LICENSE.txt
@@ -117,7 +130,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: 3.2.0
+      version: 3.3.0
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="

data/.rubocop.yml DELETED Viewed

@@ -1,12 +0,0 @@
-# This project follows the Ribose OSS style guide.
-# https://github.com/riboseinc/oss-guides
-# All project-specific additions and overrides should be specified in this file.
-require: rubocop-rails
-inherit_from:
-  - https://raw.githubusercontent.com/riboseinc/oss-guides/master/ci/rubocop.yml
-AllCops:
-  TargetRubyVersion: 3.2
-Rails:
-  Enabled: false