RubyGems - relaton-w3c - Versions diffs - 2.1.3 → 2.1.4 - Mend

relaton-w3c 2.1.3 → 2.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/CLAUDE.md +10 -1
data/README.adoc +16 -0
data/lib/relaton/w3c/data_fetcher.rb +108 -23
data/lib/relaton/w3c/safe_realize.rb +6 -2
data/lib/relaton/w3c/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: ac21e91a675f1a0c33ea6a0610edb38895ffe87d9b810593331f9a79cdbe0324
-  data.tar.gz: 74993548a097428280e01147eea4dd604adf181ed33c9d55b1db33e8a994c489
+  metadata.gz: 7cdd6ed3f2403c63011b2f3d023afca6d1fde795b09e65d39e9044384ed52b2b
+  data.tar.gz: 4086a9ec931c36512084c11c7327be1cd5d21c85106bbe312e6c142dbe9639d4
 SHA512:
-  metadata.gz: 6851b1f389210dfe5bbef588b1d9de910c7f0d8c595e896a1d3419a937d20bbc643dbe08af40bd896e9f8f8149f59bea736b2f338512046f6d11f197e21b0999
-  data.tar.gz: 6a6db0027bde29086eff8f66240a9343288b393e2c25515e6e92b667d31b0fe89ea100de94adbb4bd8d244f7b527e7d05b3be1edf97917fad54512a88ca79132
+  metadata.gz: d2a193173054fd17d1b1e30218887722b58f3a12192923e7abb36e167314c6378e4255a87632bb9fac21035f7bc33e6b4f032ff8dd180fef938ef1977ab6cbd6
+  data.tar.gz: c3b52a5693c876cee96728332e694824350a50a2d4945c04c941e84d7c85d874216a02472f9121c3ca4446420cf686a6d8b8b042397deafcb4dcc89870e5a8b7

data/CLAUDE.md CHANGED Viewed

@@ -47,7 +47,7 @@ All classes live under `lib/relaton/w3c/` in the `Relaton::W3c` namespace:
 - **`Processor`** (`processor.rb`) — extends `Relaton::Core::Processor`, registers the W3C flavor (prefix `W3C`, dataset `w3c-api`)
 **Data fetching:**
-- **`DataFetcher`** (`data_fetcher.rb`) — extends `Core::DataFetcher`, fetches all W3C specs via the W3C API
+- **`DataFetcher`** (`data_fetcher.rb`) — extends `Core::DataFetcher`, fetches all W3C specs via the W3C API. Fetches the specification index with `embed: true` so each spec is realized from the page's embedded payload instead of a per-spec HTTP request, and paginates by page number (only the `fetch` path repopulates `_embedded`, unlike realizing the `next` link). Runs `fetch_spec` across a small thread pool. A SIGINT (Ctrl-C) is handled gracefully — the producer stops queuing and workers stop after their in-flight spec, then the index of everything fetched so far is saved (the prior INT handler is restored afterwards, so the trap doesn't leak into the host process). See **Crawler tuning** for the env-var knobs.
 - **`DataParser`** (`data_parser.rb`) — converts W3C API spec objects into `Relaton::W3c::Item` instances
 - **`SafeRealize`** (`safe_realize.rb`) — mixin that, on a terminal error, skips the resource (returns `nil`) so one bad link doesn't abort the crawl (see Rate limiting & retries). It does not retry or cache successes — those live upstream.
 - **`PubId`** (`pubid.rb`) — parses and compares W3C document identifiers (stage, code, date parts)
@@ -57,6 +57,15 @@ All classes live under `lib/relaton/w3c/` in the `Relaton::W3c` namespace:
 The entry module is defined in `lib/relaton/w3c.rb` and exposes `grammar_hash`.
+### Crawler tuning
+`DataFetcher` is tunable via environment variables (read by class methods, so they apply to the whole crawl):
+- **`RELATON_W3C_FETCH_CONCURRENCY`** (default `8`) — number of `fetch_spec` worker threads. Lower it to lighten load on api.w3.org or for debugging.
+- **`RELATON_W3C_FETCH_VERSIONS`** (default enabled) — set to `false`/`0`/`no`/`off` for a faster, shallower crawl that emits only the top-level specifications and skips each spec's version-history fan-out (version_history, predecessor/successor versions — the bulk of the API requests). Leave it set (the default) for a complete dataset.
+`embed: true` (always on) inlines each specification into its index page, so the per-spec realize is served from memory rather than an HTTP request — the largest single reduction in request count.
 ### Rate limiting & retries
 Transient-failure resilience is layered upstream, not in this gem:

data/README.adoc CHANGED Viewed

@@ -118,6 +118,22 @@ require 'relaton/w3c/data_fetcher'
 Relaton::W3c::DataFetcher.fetch
 ----
+The crawl is tunable via environment variables:
+- `RELATON_W3C_FETCH_CONCURRENCY` (default `8`) - number of parallel worker threads. Lower it to lighten load on `api.w3.org`.
+- `RELATON_W3C_FETCH_VERSIONS` (default enabled) - set to `false` for a faster, shallower crawl that fetches only the top-level specifications and skips each spec's version history (the bulk of the API requests). Leave it unset for a complete dataset.
+The fetcher requests the specifications index with embedded specification data, so each specification is read from the page already in memory instead of issuing a separate HTTP request.
+A full crawl is long-running, so it handles `Ctrl-C` gracefully: it stops fetching, lets in-flight work finish, and saves the index of everything collected so far rather than losing the run.
+[source,sh]
+----
+# Fast, shallow refresh: top-level specs only, 4 workers
+RELATON_W3C_FETCH_VERSIONS=false RELATON_W3C_FETCH_CONCURRENCY=4 \
+  ruby -r relaton/w3c/data_fetcher -e 'Relaton::W3c::DataFetcher.fetch'
+----
 === Logging
 RelatonW3c uses the relaton-logger gem for logging. By default, it logs to STDOUT. To change the log levels and add other loggers, read the https://github.com/relaton/relaton-logger#usage[relaton-logger] documentation.

data/lib/relaton/w3c/data_fetcher.rb CHANGED Viewed

@@ -19,9 +19,22 @@ module Relaton
         (ENV["RELATON_W3C_FETCH_CONCURRENCY"] || DEFAULT_CONCURRENCY).to_i
       end
+      # Whether to crawl each specification's version history (version_history,
+      # predecessor_versions, successor_versions). Enabled by default for a
+      # complete dataset. Set RELATON_W3C_FETCH_VERSIONS=false for a faster,
+      # shallower crawl that emits only the top-level specifications and skips
+      # the per-spec version fan-out (the bulk of the API requests).
+      def self.fetch_versions?
+        val = ENV["RELATON_W3C_FETCH_VERSIONS"]
+        return true if val.nil? || val.empty?
+        !%w[0 false no off].include?(val.strip.downcase)
+      end
       def initialize(*args)
         super
         @mutex = Mutex.new
+        @interrupted = false
       end
       def index
@@ -39,41 +52,83 @@ module Relaton
       #
       # Parse documents in parallel. The crawler is heavily I/O-bound on
       # api.w3.org round-trips (~30-50k requests per run), so a small thread
-      # pool gives a near-linear speedup. Pagination still happens serially
-      # because each page depends on the previous response's `next` link.
+      # pool gives a near-linear speedup. Pagination still happens serially:
+      # each page's `next?` flag gates whether the next page is requested.
+      #
+      # A SIGINT (Ctrl-C) is handled gracefully: the producer stops queuing and
+      # the workers stop processing after their in-flight spec, then the index
+      # of everything fetched so far is saved rather than the run being lost.
       #
       def fetch(_source = nil)
         n_workers = self.class.concurrency
         queue = SizedQueue.new(n_workers * 4)
         workers = Array.new(n_workers) { spawn_worker(queue) }
-        specs = client.specifications
+        with_interrupt_handler do
+          enqueue_specs(queue)
+          n_workers.times { queue << nil } # poison pills
+          workers.each(&:join)
+          Util.warn "Crawl interrupted — saving progress collected so far." if @interrupted
+          index.save
+        end
+        report_errors
+      end
+      #
+      # Page through the specifications index, feeding each spec (paired with
+      # its embedded page) to the worker queue. Returns early when interrupted.
+      #
+      # embed: true inlines each specification's full payload into the index
+      # page's `_embedded` block, so a spec link realizes from that page in
+      # memory instead of making its own HTTP request — one request per page
+      # rather than one per specification. The page is queued alongside each
+      # link so the worker can hand it back to realize as the parent_resource.
+      #
+      def enqueue_specs(queue)
+        specs = client.specifications(embed: true)
         loop do
-          specs.links.specifications.each { |spec| queue << spec }
-          break unless specs.next?
+          page = specs
+          page.links.specifications.each do |spec|
+            break if @interrupted
-          # Route pagination through realize so transient 403/5xx on the
-          # next-page link retry with backoff instead of crashing the crawl.
-          next_page = realize(specs.links.next)
+            queue << [spec, page]
+          end
+          break if @interrupted || !page.next?
+          # Fetch the next page through the client's fetch path rather than
+          # realizing the `next` link: only fetch populates the page's
+          # embedded_data, so this keeps embed working past page 1. Realizing
+          # the `next` link drops `_embedded` and forces a per-spec HTTP
+          # request for every specification on every later page.
+          next_page = fetch_specifications_page(page.page + 1)
           break unless next_page
           specs = next_page
         end
-        n_workers.times { queue << nil } # poison pills
-        workers.each(&:join)
-        index.save
-        report_errors
       end
-      def fetch_spec(unrealized_spec)
-        spec = realize unrealized_spec
+      def fetch_spec(unrealized_spec, page = nil)
+        # When `page` came from an embed:true fetch, realizing against it as the
+        # parent_resource serves the spec from embedded data (no HTTP request).
+        spec = realize(unrealized_spec, parent_resource: page)
         return unless spec
         local_errors = Hash.new(true)
         save_doc DataParser.parse(spec, local_errors)
+        fetch_versions(spec) if self.class.fetch_versions?
+        @mutex.synchronize { local_errors.each { |k, v| @errors[k] &&= v } }
+      end
+      #
+      # Crawl a specification's version history: its dated editions plus the
+      # predecessor/successor version chains. Each entry is a separate HTTP
+      # request, so this is the bulk of a run and can be skipped via
+      # RELATON_W3C_FETCH_VERSIONS=false (see .fetch_versions?).
+      #
+      def fetch_versions(spec)
         if spec.links.respond_to?(:version_history) && spec.links.version_history
           version_history = realize spec.links.version_history
           version_history&.links&.spec_versions&.each { |version| parse_and_save version }
@@ -84,12 +139,10 @@ module Relaton
           predecessor_versions&.links&.predecessor_versions&.each { |version| parse_and_save version }
         end
-        if spec.links.respond_to?(:successor_versions) && spec.links.successor_versions
-          successor_versions = realize spec.links.successor_versions
-          successor_versions&.links&.successor_versions&.each { |version| parse_and_save version }
-        end
+        return unless spec.links.respond_to?(:successor_versions) && spec.links.successor_versions
-        @mutex.synchronize { local_errors.each { |k, v| @errors[k] &&= v } }
+        successor_versions = realize spec.links.successor_versions
+        successor_versions&.links&.successor_versions&.each { |version| parse_and_save version }
       end
       #
@@ -139,11 +192,43 @@ module Relaton
       private
+      # Install a SIGINT handler for the duration of the crawl so Ctrl-C sets
+      # the @interrupted flag (observed by the producer loop and the workers)
+      # instead of killing the process mid-write. The trap body is kept minimal
+      # (no I/O or locking) because trap context is restricted; the user-facing
+      # notice is printed from the main thread once the crawl winds down. The
+      # previous handler is restored on the way out so the trap doesn't leak
+      # into the host process.
+      def with_interrupt_handler
+        previous = Signal.trap("INT") { @interrupted = true }
+        yield
+      ensure
+        Signal.trap("INT", previous || "DEFAULT")
+      end
+      # Fetch one page of the specifications index with embed enabled. Goes
+      # through the client (the register's fetch path) so the page's
+      # embedded_data is populated. Transient 403/5xx/connection failures are
+      # already retried upstream (w3c_api/lutaml-hal); a terminal error here
+      # stops pagination gracefully rather than crashing the crawl.
+      def fetch_specifications_page(number)
+        client.specifications(embed: true, page: number)
+      rescue Lutaml::Hal::Error, Faraday::Error => e
+        log_error "Failed to fetch specifications page #{number}: " \
+                  "#{e.class}: #{e.message}"
+        nil
+      end
       def spawn_worker(queue)
         Thread.new do
-          while (spec = queue.pop)
+          while (item = queue.pop)
+            # Once interrupted, drain the queue without processing so the
+            # producer unblocks and the pool reaches its poison pills quickly.
+            next if @interrupted
+            spec, page = item
             begin
-              fetch_spec spec
+              fetch_spec spec, page
             rescue StandardError => e
               log_error "fetch_spec failed: #{e.class}: #{e.message}\n" \
                         "#{e.backtrace.first(5).join("\n")}"

data/lib/relaton/w3c/safe_realize.rb CHANGED Viewed

@@ -22,11 +22,15 @@ module Relaton
         @skipped
       end
-      def realize(obj)
+      # @param parent_resource [Object, nil] the index/page the link came from.
+      #   When the page was fetched with `embed: true`, its inlined `_embedded`
+      #   payload lets the link realize from memory instead of issuing an HTTP
+      #   request. nil (the default) preserves the plain remote-fetch behavior.
+      def realize(obj, parent_resource: nil)
         href = resolve_href(obj)
         return nil if SafeRealize.skipped.key?(href)
-        obj.realize
+        obj.realize(parent_resource: parent_resource)
       rescue Lutaml::Hal::ConnectionError, Lutaml::Hal::TimeoutError, Faraday::Error, Net::OpenTimeout => e
         # Network-level failure (already retried by w3c_api). The resource itself
         # is fine, so don't skip it permanently — a later reference can try again.

data/lib/relaton/w3c/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 module Relaton
   module W3c
-    VERSION = "2.1.3".freeze
+    VERSION = "2.1.4".freeze
   end
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: relaton-w3c
 version: !ruby/object:Gem::Version
-  version: 2.1.3
+  version: 2.1.4
 platform: ruby
 authors:
 - Ribose Inc.
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2026-06-03 00:00:00.000000000 Z
+date: 2026-06-04 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: relaton-bib