relaton-w3c 2.1.3 → 2.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ac21e91a675f1a0c33ea6a0610edb38895ffe87d9b810593331f9a79cdbe0324
4
- data.tar.gz: 74993548a097428280e01147eea4dd604adf181ed33c9d55b1db33e8a994c489
3
+ metadata.gz: 7cdd6ed3f2403c63011b2f3d023afca6d1fde795b09e65d39e9044384ed52b2b
4
+ data.tar.gz: 4086a9ec931c36512084c11c7327be1cd5d21c85106bbe312e6c142dbe9639d4
5
5
  SHA512:
6
- metadata.gz: 6851b1f389210dfe5bbef588b1d9de910c7f0d8c595e896a1d3419a937d20bbc643dbe08af40bd896e9f8f8149f59bea736b2f338512046f6d11f197e21b0999
7
- data.tar.gz: 6a6db0027bde29086eff8f66240a9343288b393e2c25515e6e92b667d31b0fe89ea100de94adbb4bd8d244f7b527e7d05b3be1edf97917fad54512a88ca79132
6
+ metadata.gz: d2a193173054fd17d1b1e30218887722b58f3a12192923e7abb36e167314c6378e4255a87632bb9fac21035f7bc33e6b4f032ff8dd180fef938ef1977ab6cbd6
7
+ data.tar.gz: c3b52a5693c876cee96728332e694824350a50a2d4945c04c941e84d7c85d874216a02472f9121c3ca4446420cf686a6d8b8b042397deafcb4dcc89870e5a8b7
data/CLAUDE.md CHANGED
@@ -47,7 +47,7 @@ All classes live under `lib/relaton/w3c/` in the `Relaton::W3c` namespace:
47
47
  - **`Processor`** (`processor.rb`) — extends `Relaton::Core::Processor`, registers the W3C flavor (prefix `W3C`, dataset `w3c-api`)
48
48
 
49
49
  **Data fetching:**
50
- - **`DataFetcher`** (`data_fetcher.rb`) — extends `Core::DataFetcher`, fetches all W3C specs via the W3C API
50
+ - **`DataFetcher`** (`data_fetcher.rb`) — extends `Core::DataFetcher`, fetches all W3C specs via the W3C API. Fetches the specification index with `embed: true` so each spec is realized from the page's embedded payload instead of a per-spec HTTP request, and paginates by page number (only the `fetch` path repopulates `_embedded`, unlike realizing the `next` link). Runs `fetch_spec` across a small thread pool. A SIGINT (Ctrl-C) is handled gracefully — the producer stops queuing and workers stop after their in-flight spec, then the index of everything fetched so far is saved (the prior INT handler is restored afterwards, so the trap doesn't leak into the host process). See **Crawler tuning** for the env-var knobs.
51
51
  - **`DataParser`** (`data_parser.rb`) — converts W3C API spec objects into `Relaton::W3c::Item` instances
52
52
  - **`SafeRealize`** (`safe_realize.rb`) — mixin that, on a terminal error, skips the resource (returns `nil`) so one bad link doesn't abort the crawl (see Rate limiting & retries). It does not retry or cache successes — those live upstream.
53
53
  - **`PubId`** (`pubid.rb`) — parses and compares W3C document identifiers (stage, code, date parts)
@@ -57,6 +57,15 @@ All classes live under `lib/relaton/w3c/` in the `Relaton::W3c` namespace:
57
57
 
58
58
  The entry module is defined in `lib/relaton/w3c.rb` and exposes `grammar_hash`.
59
59
 
60
+ ### Crawler tuning
61
+
62
+ `DataFetcher` is tunable via environment variables (read by class methods, so they apply to the whole crawl):
63
+
64
+ - **`RELATON_W3C_FETCH_CONCURRENCY`** (default `8`) — number of `fetch_spec` worker threads. Lower it to lighten load on api.w3.org or for debugging.
65
+ - **`RELATON_W3C_FETCH_VERSIONS`** (default enabled) — set to `false`/`0`/`no`/`off` for a faster, shallower crawl that emits only the top-level specifications and skips each spec's version-history fan-out (version_history, predecessor/successor versions — the bulk of the API requests). Leave it set (the default) for a complete dataset.
66
+
67
+ `embed: true` (always on) inlines each specification into its index page, so the per-spec realize is served from memory rather than an HTTP request — the largest single reduction in request count.
68
+
60
69
  ### Rate limiting & retries
61
70
 
62
71
  Transient-failure resilience is layered upstream, not in this gem:
data/README.adoc CHANGED
@@ -118,6 +118,22 @@ require 'relaton/w3c/data_fetcher'
118
118
  Relaton::W3c::DataFetcher.fetch
119
119
  ----
120
120
 
121
+ The crawl is tunable via environment variables:
122
+
123
+ - `RELATON_W3C_FETCH_CONCURRENCY` (default `8`) - number of parallel worker threads. Lower it to lighten load on `api.w3.org`.
124
+ - `RELATON_W3C_FETCH_VERSIONS` (default enabled) - set to `false` for a faster, shallower crawl that fetches only the top-level specifications and skips each spec's version history (the bulk of the API requests). Leave it unset for a complete dataset.
125
+
126
+ The fetcher requests the specifications index with embedded specification data, so each specification is read from the page already in memory instead of issuing a separate HTTP request.
127
+
128
+ A full crawl is long-running, so it handles `Ctrl-C` gracefully: it stops fetching, lets in-flight work finish, and saves the index of everything collected so far rather than losing the run.
129
+
130
+ [source,sh]
131
+ ----
132
+ # Fast, shallow refresh: top-level specs only, 4 workers
133
+ RELATON_W3C_FETCH_VERSIONS=false RELATON_W3C_FETCH_CONCURRENCY=4 \
134
+ ruby -r relaton/w3c/data_fetcher -e 'Relaton::W3c::DataFetcher.fetch'
135
+ ----
136
+
121
137
  === Logging
122
138
 
123
139
  RelatonW3c uses the relaton-logger gem for logging. By default, it logs to STDOUT. To change the log levels and add other loggers, read the https://github.com/relaton/relaton-logger#usage[relaton-logger] documentation.
@@ -19,9 +19,22 @@ module Relaton
19
19
  (ENV["RELATON_W3C_FETCH_CONCURRENCY"] || DEFAULT_CONCURRENCY).to_i
20
20
  end
21
21
 
22
+ # Whether to crawl each specification's version history (version_history,
23
+ # predecessor_versions, successor_versions). Enabled by default for a
24
+ # complete dataset. Set RELATON_W3C_FETCH_VERSIONS=false for a faster,
25
+ # shallower crawl that emits only the top-level specifications and skips
26
+ # the per-spec version fan-out (the bulk of the API requests).
27
+ def self.fetch_versions?
28
+ val = ENV["RELATON_W3C_FETCH_VERSIONS"]
29
+ return true if val.nil? || val.empty?
30
+
31
+ !%w[0 false no off].include?(val.strip.downcase)
32
+ end
33
+
22
34
  def initialize(*args)
23
35
  super
24
36
  @mutex = Mutex.new
37
+ @interrupted = false
25
38
  end
26
39
 
27
40
  def index
@@ -39,41 +52,83 @@ module Relaton
39
52
  #
40
53
  # Parse documents in parallel. The crawler is heavily I/O-bound on
41
54
  # api.w3.org round-trips (~30-50k requests per run), so a small thread
42
- # pool gives a near-linear speedup. Pagination still happens serially
43
- # because each page depends on the previous response's `next` link.
55
+ # pool gives a near-linear speedup. Pagination still happens serially:
56
+ # each page's `next?` flag gates whether the next page is requested.
57
+ #
58
+ # A SIGINT (Ctrl-C) is handled gracefully: the producer stops queuing and
59
+ # the workers stop processing after their in-flight spec, then the index
60
+ # of everything fetched so far is saved rather than the run being lost.
44
61
  #
45
62
  def fetch(_source = nil)
46
63
  n_workers = self.class.concurrency
47
64
  queue = SizedQueue.new(n_workers * 4)
48
65
  workers = Array.new(n_workers) { spawn_worker(queue) }
49
66
 
50
- specs = client.specifications
67
+ with_interrupt_handler do
68
+ enqueue_specs(queue)
69
+ n_workers.times { queue << nil } # poison pills
70
+ workers.each(&:join)
71
+ Util.warn "Crawl interrupted — saving progress collected so far." if @interrupted
72
+ index.save
73
+ end
74
+
75
+ report_errors
76
+ end
77
+
78
+ #
79
+ # Page through the specifications index, feeding each spec (paired with
80
+ # its embedded page) to the worker queue. Returns early when interrupted.
81
+ #
82
+ # embed: true inlines each specification's full payload into the index
83
+ # page's `_embedded` block, so a spec link realizes from that page in
84
+ # memory instead of making its own HTTP request — one request per page
85
+ # rather than one per specification. The page is queued alongside each
86
+ # link so the worker can hand it back to realize as the parent_resource.
87
+ #
88
+ def enqueue_specs(queue)
89
+ specs = client.specifications(embed: true)
51
90
  loop do
52
- specs.links.specifications.each { |spec| queue << spec }
53
- break unless specs.next?
91
+ page = specs
92
+ page.links.specifications.each do |spec|
93
+ break if @interrupted
54
94
 
55
- # Route pagination through realize so transient 403/5xx on the
56
- # next-page link retry with backoff instead of crashing the crawl.
57
- next_page = realize(specs.links.next)
95
+ queue << [spec, page]
96
+ end
97
+ break if @interrupted || !page.next?
98
+
99
+ # Fetch the next page through the client's fetch path rather than
100
+ # realizing the `next` link: only fetch populates the page's
101
+ # embedded_data, so this keeps embed working past page 1. Realizing
102
+ # the `next` link drops `_embedded` and forces a per-spec HTTP
103
+ # request for every specification on every later page.
104
+ next_page = fetch_specifications_page(page.page + 1)
58
105
  break unless next_page
59
106
 
60
107
  specs = next_page
61
108
  end
62
-
63
- n_workers.times { queue << nil } # poison pills
64
- workers.each(&:join)
65
-
66
- index.save
67
- report_errors
68
109
  end
69
110
 
70
- def fetch_spec(unrealized_spec)
71
- spec = realize unrealized_spec
111
+ def fetch_spec(unrealized_spec, page = nil)
112
+ # When `page` came from an embed:true fetch, realizing against it as the
113
+ # parent_resource serves the spec from embedded data (no HTTP request).
114
+ spec = realize(unrealized_spec, parent_resource: page)
72
115
  return unless spec
73
116
 
74
117
  local_errors = Hash.new(true)
75
118
  save_doc DataParser.parse(spec, local_errors)
76
119
 
120
+ fetch_versions(spec) if self.class.fetch_versions?
121
+
122
+ @mutex.synchronize { local_errors.each { |k, v| @errors[k] &&= v } }
123
+ end
124
+
125
+ #
126
+ # Crawl a specification's version history: its dated editions plus the
127
+ # predecessor/successor version chains. Each entry is a separate HTTP
128
+ # request, so this is the bulk of a run and can be skipped via
129
+ # RELATON_W3C_FETCH_VERSIONS=false (see .fetch_versions?).
130
+ #
131
+ def fetch_versions(spec)
77
132
  if spec.links.respond_to?(:version_history) && spec.links.version_history
78
133
  version_history = realize spec.links.version_history
79
134
  version_history&.links&.spec_versions&.each { |version| parse_and_save version }
@@ -84,12 +139,10 @@ module Relaton
84
139
  predecessor_versions&.links&.predecessor_versions&.each { |version| parse_and_save version }
85
140
  end
86
141
 
87
- if spec.links.respond_to?(:successor_versions) && spec.links.successor_versions
88
- successor_versions = realize spec.links.successor_versions
89
- successor_versions&.links&.successor_versions&.each { |version| parse_and_save version }
90
- end
142
+ return unless spec.links.respond_to?(:successor_versions) && spec.links.successor_versions
91
143
 
92
- @mutex.synchronize { local_errors.each { |k, v| @errors[k] &&= v } }
144
+ successor_versions = realize spec.links.successor_versions
145
+ successor_versions&.links&.successor_versions&.each { |version| parse_and_save version }
93
146
  end
94
147
 
95
148
  #
@@ -139,11 +192,43 @@ module Relaton
139
192
 
140
193
  private
141
194
 
195
+ # Install a SIGINT handler for the duration of the crawl so Ctrl-C sets
196
+ # the @interrupted flag (observed by the producer loop and the workers)
197
+ # instead of killing the process mid-write. The trap body is kept minimal
198
+ # (no I/O or locking) because trap context is restricted; the user-facing
199
+ # notice is printed from the main thread once the crawl winds down. The
200
+ # previous handler is restored on the way out so the trap doesn't leak
201
+ # into the host process.
202
+ def with_interrupt_handler
203
+ previous = Signal.trap("INT") { @interrupted = true }
204
+ yield
205
+ ensure
206
+ Signal.trap("INT", previous || "DEFAULT")
207
+ end
208
+
209
+ # Fetch one page of the specifications index with embed enabled. Goes
210
+ # through the client (the register's fetch path) so the page's
211
+ # embedded_data is populated. Transient 403/5xx/connection failures are
212
+ # already retried upstream (w3c_api/lutaml-hal); a terminal error here
213
+ # stops pagination gracefully rather than crashing the crawl.
214
+ def fetch_specifications_page(number)
215
+ client.specifications(embed: true, page: number)
216
+ rescue Lutaml::Hal::Error, Faraday::Error => e
217
+ log_error "Failed to fetch specifications page #{number}: " \
218
+ "#{e.class}: #{e.message}"
219
+ nil
220
+ end
221
+
142
222
  def spawn_worker(queue)
143
223
  Thread.new do
144
- while (spec = queue.pop)
224
+ while (item = queue.pop)
225
+ # Once interrupted, drain the queue without processing so the
226
+ # producer unblocks and the pool reaches its poison pills quickly.
227
+ next if @interrupted
228
+
229
+ spec, page = item
145
230
  begin
146
- fetch_spec spec
231
+ fetch_spec spec, page
147
232
  rescue StandardError => e
148
233
  log_error "fetch_spec failed: #{e.class}: #{e.message}\n" \
149
234
  "#{e.backtrace.first(5).join("\n")}"
@@ -22,11 +22,15 @@ module Relaton
22
22
  @skipped
23
23
  end
24
24
 
25
- def realize(obj)
25
+ # @param parent_resource [Object, nil] the index/page the link came from.
26
+ # When the page was fetched with `embed: true`, its inlined `_embedded`
27
+ # payload lets the link realize from memory instead of issuing an HTTP
28
+ # request. nil (the default) preserves the plain remote-fetch behavior.
29
+ def realize(obj, parent_resource: nil)
26
30
  href = resolve_href(obj)
27
31
  return nil if SafeRealize.skipped.key?(href)
28
32
 
29
- obj.realize
33
+ obj.realize(parent_resource: parent_resource)
30
34
  rescue Lutaml::Hal::ConnectionError, Lutaml::Hal::TimeoutError, Faraday::Error, Net::OpenTimeout => e
31
35
  # Network-level failure (already retried by w3c_api). The resource itself
32
36
  # is fine, so don't skip it permanently — a later reference can try again.
@@ -1,5 +1,5 @@
1
1
  module Relaton
2
2
  module W3c
3
- VERSION = "2.1.3".freeze
3
+ VERSION = "2.1.4".freeze
4
4
  end
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: relaton-w3c
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.1.3
4
+ version: 2.1.4
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ribose Inc.
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2026-06-03 00:00:00.000000000 Z
11
+ date: 2026-06-04 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: relaton-bib