relaton-w3c 2.1.3 → 2.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CLAUDE.md +10 -1
- data/README.adoc +16 -0
- data/lib/relaton/w3c/data_fetcher.rb +108 -23
- data/lib/relaton/w3c/safe_realize.rb +6 -2
- data/lib/relaton/w3c/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 7cdd6ed3f2403c63011b2f3d023afca6d1fde795b09e65d39e9044384ed52b2b
|
|
4
|
+
data.tar.gz: 4086a9ec931c36512084c11c7327be1cd5d21c85106bbe312e6c142dbe9639d4
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: d2a193173054fd17d1b1e30218887722b58f3a12192923e7abb36e167314c6378e4255a87632bb9fac21035f7bc33e6b4f032ff8dd180fef938ef1977ab6cbd6
|
|
7
|
+
data.tar.gz: c3b52a5693c876cee96728332e694824350a50a2d4945c04c941e84d7c85d874216a02472f9121c3ca4446420cf686a6d8b8b042397deafcb4dcc89870e5a8b7
|
data/CLAUDE.md
CHANGED
|
@@ -47,7 +47,7 @@ All classes live under `lib/relaton/w3c/` in the `Relaton::W3c` namespace:
|
|
|
47
47
|
- **`Processor`** (`processor.rb`) — extends `Relaton::Core::Processor`, registers the W3C flavor (prefix `W3C`, dataset `w3c-api`)
|
|
48
48
|
|
|
49
49
|
**Data fetching:**
|
|
50
|
-
- **`DataFetcher`** (`data_fetcher.rb`) — extends `Core::DataFetcher`, fetches all W3C specs via the W3C API
|
|
50
|
+
- **`DataFetcher`** (`data_fetcher.rb`) — extends `Core::DataFetcher`, fetches all W3C specs via the W3C API. Fetches the specification index with `embed: true` so each spec is realized from the page's embedded payload instead of a per-spec HTTP request, and paginates by page number (only the `fetch` path repopulates `_embedded`, unlike realizing the `next` link). Runs `fetch_spec` across a small thread pool. A SIGINT (Ctrl-C) is handled gracefully — the producer stops queuing and workers stop after their in-flight spec, then the index of everything fetched so far is saved (the prior INT handler is restored afterwards, so the trap doesn't leak into the host process). See **Crawler tuning** for the env-var knobs.
|
|
51
51
|
- **`DataParser`** (`data_parser.rb`) — converts W3C API spec objects into `Relaton::W3c::Item` instances
|
|
52
52
|
- **`SafeRealize`** (`safe_realize.rb`) — mixin that, on a terminal error, skips the resource (returns `nil`) so one bad link doesn't abort the crawl (see Rate limiting & retries). It does not retry or cache successes — those live upstream.
|
|
53
53
|
- **`PubId`** (`pubid.rb`) — parses and compares W3C document identifiers (stage, code, date parts)
|
|
@@ -57,6 +57,15 @@ All classes live under `lib/relaton/w3c/` in the `Relaton::W3c` namespace:
|
|
|
57
57
|
|
|
58
58
|
The entry module is defined in `lib/relaton/w3c.rb` and exposes `grammar_hash`.
|
|
59
59
|
|
|
60
|
+
### Crawler tuning
|
|
61
|
+
|
|
62
|
+
`DataFetcher` is tunable via environment variables (read by class methods, so they apply to the whole crawl):
|
|
63
|
+
|
|
64
|
+
- **`RELATON_W3C_FETCH_CONCURRENCY`** (default `8`) — number of `fetch_spec` worker threads. Lower it to lighten load on api.w3.org or for debugging.
|
|
65
|
+
- **`RELATON_W3C_FETCH_VERSIONS`** (default enabled) — set to `false`/`0`/`no`/`off` for a faster, shallower crawl that emits only the top-level specifications and skips each spec's version-history fan-out (version_history, predecessor/successor versions — the bulk of the API requests). Leave it set (the default) for a complete dataset.
|
|
66
|
+
|
|
67
|
+
`embed: true` (always on) inlines each specification into its index page, so the per-spec realize is served from memory rather than an HTTP request — the largest single reduction in request count.
|
|
68
|
+
|
|
60
69
|
### Rate limiting & retries
|
|
61
70
|
|
|
62
71
|
Transient-failure resilience is layered upstream, not in this gem:
|
data/README.adoc
CHANGED
|
@@ -118,6 +118,22 @@ require 'relaton/w3c/data_fetcher'
|
|
|
118
118
|
Relaton::W3c::DataFetcher.fetch
|
|
119
119
|
----
|
|
120
120
|
|
|
121
|
+
The crawl is tunable via environment variables:
|
|
122
|
+
|
|
123
|
+
- `RELATON_W3C_FETCH_CONCURRENCY` (default `8`) - number of parallel worker threads. Lower it to lighten load on `api.w3.org`.
|
|
124
|
+
- `RELATON_W3C_FETCH_VERSIONS` (default enabled) - set to `false` for a faster, shallower crawl that fetches only the top-level specifications and skips each spec's version history (the bulk of the API requests). Leave it unset for a complete dataset.
|
|
125
|
+
|
|
126
|
+
The fetcher requests the specifications index with embedded specification data, so each specification is read from the page already in memory instead of issuing a separate HTTP request.
|
|
127
|
+
|
|
128
|
+
A full crawl is long-running, so it handles `Ctrl-C` gracefully: it stops fetching, lets in-flight work finish, and saves the index of everything collected so far rather than losing the run.
|
|
129
|
+
|
|
130
|
+
[source,sh]
|
|
131
|
+
----
|
|
132
|
+
# Fast, shallow refresh: top-level specs only, 4 workers
|
|
133
|
+
RELATON_W3C_FETCH_VERSIONS=false RELATON_W3C_FETCH_CONCURRENCY=4 \
|
|
134
|
+
ruby -r relaton/w3c/data_fetcher -e 'Relaton::W3c::DataFetcher.fetch'
|
|
135
|
+
----
|
|
136
|
+
|
|
121
137
|
=== Logging
|
|
122
138
|
|
|
123
139
|
RelatonW3c uses the relaton-logger gem for logging. By default, it logs to STDOUT. To change the log levels and add other loggers, read the https://github.com/relaton/relaton-logger#usage[relaton-logger] documentation.
|
|
@@ -19,9 +19,22 @@ module Relaton
|
|
|
19
19
|
(ENV["RELATON_W3C_FETCH_CONCURRENCY"] || DEFAULT_CONCURRENCY).to_i
|
|
20
20
|
end
|
|
21
21
|
|
|
22
|
+
# Whether to crawl each specification's version history (version_history,
|
|
23
|
+
# predecessor_versions, successor_versions). Enabled by default for a
|
|
24
|
+
# complete dataset. Set RELATON_W3C_FETCH_VERSIONS=false for a faster,
|
|
25
|
+
# shallower crawl that emits only the top-level specifications and skips
|
|
26
|
+
# the per-spec version fan-out (the bulk of the API requests).
|
|
27
|
+
def self.fetch_versions?
|
|
28
|
+
val = ENV["RELATON_W3C_FETCH_VERSIONS"]
|
|
29
|
+
return true if val.nil? || val.empty?
|
|
30
|
+
|
|
31
|
+
!%w[0 false no off].include?(val.strip.downcase)
|
|
32
|
+
end
|
|
33
|
+
|
|
22
34
|
def initialize(*args)
|
|
23
35
|
super
|
|
24
36
|
@mutex = Mutex.new
|
|
37
|
+
@interrupted = false
|
|
25
38
|
end
|
|
26
39
|
|
|
27
40
|
def index
|
|
@@ -39,41 +52,83 @@ module Relaton
|
|
|
39
52
|
#
|
|
40
53
|
# Parse documents in parallel. The crawler is heavily I/O-bound on
|
|
41
54
|
# api.w3.org round-trips (~30-50k requests per run), so a small thread
|
|
42
|
-
# pool gives a near-linear speedup. Pagination still happens serially
|
|
43
|
-
#
|
|
55
|
+
# pool gives a near-linear speedup. Pagination still happens serially:
|
|
56
|
+
# each page's `next?` flag gates whether the next page is requested.
|
|
57
|
+
#
|
|
58
|
+
# A SIGINT (Ctrl-C) is handled gracefully: the producer stops queuing and
|
|
59
|
+
# the workers stop processing after their in-flight spec, then the index
|
|
60
|
+
# of everything fetched so far is saved rather than the run being lost.
|
|
44
61
|
#
|
|
45
62
|
def fetch(_source = nil)
|
|
46
63
|
n_workers = self.class.concurrency
|
|
47
64
|
queue = SizedQueue.new(n_workers * 4)
|
|
48
65
|
workers = Array.new(n_workers) { spawn_worker(queue) }
|
|
49
66
|
|
|
50
|
-
|
|
67
|
+
with_interrupt_handler do
|
|
68
|
+
enqueue_specs(queue)
|
|
69
|
+
n_workers.times { queue << nil } # poison pills
|
|
70
|
+
workers.each(&:join)
|
|
71
|
+
Util.warn "Crawl interrupted — saving progress collected so far." if @interrupted
|
|
72
|
+
index.save
|
|
73
|
+
end
|
|
74
|
+
|
|
75
|
+
report_errors
|
|
76
|
+
end
|
|
77
|
+
|
|
78
|
+
#
|
|
79
|
+
# Page through the specifications index, feeding each spec (paired with
|
|
80
|
+
# its embedded page) to the worker queue. Returns early when interrupted.
|
|
81
|
+
#
|
|
82
|
+
# embed: true inlines each specification's full payload into the index
|
|
83
|
+
# page's `_embedded` block, so a spec link realizes from that page in
|
|
84
|
+
# memory instead of making its own HTTP request — one request per page
|
|
85
|
+
# rather than one per specification. The page is queued alongside each
|
|
86
|
+
# link so the worker can hand it back to realize as the parent_resource.
|
|
87
|
+
#
|
|
88
|
+
def enqueue_specs(queue)
|
|
89
|
+
specs = client.specifications(embed: true)
|
|
51
90
|
loop do
|
|
52
|
-
|
|
53
|
-
|
|
91
|
+
page = specs
|
|
92
|
+
page.links.specifications.each do |spec|
|
|
93
|
+
break if @interrupted
|
|
54
94
|
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
95
|
+
queue << [spec, page]
|
|
96
|
+
end
|
|
97
|
+
break if @interrupted || !page.next?
|
|
98
|
+
|
|
99
|
+
# Fetch the next page through the client's fetch path rather than
|
|
100
|
+
# realizing the `next` link: only fetch populates the page's
|
|
101
|
+
# embedded_data, so this keeps embed working past page 1. Realizing
|
|
102
|
+
# the `next` link drops `_embedded` and forces a per-spec HTTP
|
|
103
|
+
# request for every specification on every later page.
|
|
104
|
+
next_page = fetch_specifications_page(page.page + 1)
|
|
58
105
|
break unless next_page
|
|
59
106
|
|
|
60
107
|
specs = next_page
|
|
61
108
|
end
|
|
62
|
-
|
|
63
|
-
n_workers.times { queue << nil } # poison pills
|
|
64
|
-
workers.each(&:join)
|
|
65
|
-
|
|
66
|
-
index.save
|
|
67
|
-
report_errors
|
|
68
109
|
end
|
|
69
110
|
|
|
70
|
-
def fetch_spec(unrealized_spec)
|
|
71
|
-
|
|
111
|
+
def fetch_spec(unrealized_spec, page = nil)
|
|
112
|
+
# When `page` came from an embed:true fetch, realizing against it as the
|
|
113
|
+
# parent_resource serves the spec from embedded data (no HTTP request).
|
|
114
|
+
spec = realize(unrealized_spec, parent_resource: page)
|
|
72
115
|
return unless spec
|
|
73
116
|
|
|
74
117
|
local_errors = Hash.new(true)
|
|
75
118
|
save_doc DataParser.parse(spec, local_errors)
|
|
76
119
|
|
|
120
|
+
fetch_versions(spec) if self.class.fetch_versions?
|
|
121
|
+
|
|
122
|
+
@mutex.synchronize { local_errors.each { |k, v| @errors[k] &&= v } }
|
|
123
|
+
end
|
|
124
|
+
|
|
125
|
+
#
|
|
126
|
+
# Crawl a specification's version history: its dated editions plus the
|
|
127
|
+
# predecessor/successor version chains. Each entry is a separate HTTP
|
|
128
|
+
# request, so this is the bulk of a run and can be skipped via
|
|
129
|
+
# RELATON_W3C_FETCH_VERSIONS=false (see .fetch_versions?).
|
|
130
|
+
#
|
|
131
|
+
def fetch_versions(spec)
|
|
77
132
|
if spec.links.respond_to?(:version_history) && spec.links.version_history
|
|
78
133
|
version_history = realize spec.links.version_history
|
|
79
134
|
version_history&.links&.spec_versions&.each { |version| parse_and_save version }
|
|
@@ -84,12 +139,10 @@ module Relaton
|
|
|
84
139
|
predecessor_versions&.links&.predecessor_versions&.each { |version| parse_and_save version }
|
|
85
140
|
end
|
|
86
141
|
|
|
87
|
-
|
|
88
|
-
successor_versions = realize spec.links.successor_versions
|
|
89
|
-
successor_versions&.links&.successor_versions&.each { |version| parse_and_save version }
|
|
90
|
-
end
|
|
142
|
+
return unless spec.links.respond_to?(:successor_versions) && spec.links.successor_versions
|
|
91
143
|
|
|
92
|
-
|
|
144
|
+
successor_versions = realize spec.links.successor_versions
|
|
145
|
+
successor_versions&.links&.successor_versions&.each { |version| parse_and_save version }
|
|
93
146
|
end
|
|
94
147
|
|
|
95
148
|
#
|
|
@@ -139,11 +192,43 @@ module Relaton
|
|
|
139
192
|
|
|
140
193
|
private
|
|
141
194
|
|
|
195
|
+
# Install a SIGINT handler for the duration of the crawl so Ctrl-C sets
|
|
196
|
+
# the @interrupted flag (observed by the producer loop and the workers)
|
|
197
|
+
# instead of killing the process mid-write. The trap body is kept minimal
|
|
198
|
+
# (no I/O or locking) because trap context is restricted; the user-facing
|
|
199
|
+
# notice is printed from the main thread once the crawl winds down. The
|
|
200
|
+
# previous handler is restored on the way out so the trap doesn't leak
|
|
201
|
+
# into the host process.
|
|
202
|
+
def with_interrupt_handler
|
|
203
|
+
previous = Signal.trap("INT") { @interrupted = true }
|
|
204
|
+
yield
|
|
205
|
+
ensure
|
|
206
|
+
Signal.trap("INT", previous || "DEFAULT")
|
|
207
|
+
end
|
|
208
|
+
|
|
209
|
+
# Fetch one page of the specifications index with embed enabled. Goes
|
|
210
|
+
# through the client (the register's fetch path) so the page's
|
|
211
|
+
# embedded_data is populated. Transient 403/5xx/connection failures are
|
|
212
|
+
# already retried upstream (w3c_api/lutaml-hal); a terminal error here
|
|
213
|
+
# stops pagination gracefully rather than crashing the crawl.
|
|
214
|
+
def fetch_specifications_page(number)
|
|
215
|
+
client.specifications(embed: true, page: number)
|
|
216
|
+
rescue Lutaml::Hal::Error, Faraday::Error => e
|
|
217
|
+
log_error "Failed to fetch specifications page #{number}: " \
|
|
218
|
+
"#{e.class}: #{e.message}"
|
|
219
|
+
nil
|
|
220
|
+
end
|
|
221
|
+
|
|
142
222
|
def spawn_worker(queue)
|
|
143
223
|
Thread.new do
|
|
144
|
-
while (
|
|
224
|
+
while (item = queue.pop)
|
|
225
|
+
# Once interrupted, drain the queue without processing so the
|
|
226
|
+
# producer unblocks and the pool reaches its poison pills quickly.
|
|
227
|
+
next if @interrupted
|
|
228
|
+
|
|
229
|
+
spec, page = item
|
|
145
230
|
begin
|
|
146
|
-
fetch_spec spec
|
|
231
|
+
fetch_spec spec, page
|
|
147
232
|
rescue StandardError => e
|
|
148
233
|
log_error "fetch_spec failed: #{e.class}: #{e.message}\n" \
|
|
149
234
|
"#{e.backtrace.first(5).join("\n")}"
|
|
@@ -22,11 +22,15 @@ module Relaton
|
|
|
22
22
|
@skipped
|
|
23
23
|
end
|
|
24
24
|
|
|
25
|
-
|
|
25
|
+
# @param parent_resource [Object, nil] the index/page the link came from.
|
|
26
|
+
# When the page was fetched with `embed: true`, its inlined `_embedded`
|
|
27
|
+
# payload lets the link realize from memory instead of issuing an HTTP
|
|
28
|
+
# request. nil (the default) preserves the plain remote-fetch behavior.
|
|
29
|
+
def realize(obj, parent_resource: nil)
|
|
26
30
|
href = resolve_href(obj)
|
|
27
31
|
return nil if SafeRealize.skipped.key?(href)
|
|
28
32
|
|
|
29
|
-
obj.realize
|
|
33
|
+
obj.realize(parent_resource: parent_resource)
|
|
30
34
|
rescue Lutaml::Hal::ConnectionError, Lutaml::Hal::TimeoutError, Faraday::Error, Net::OpenTimeout => e
|
|
31
35
|
# Network-level failure (already retried by w3c_api). The resource itself
|
|
32
36
|
# is fine, so don't skip it permanently — a later reference can try again.
|
data/lib/relaton/w3c/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: relaton-w3c
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 2.1.
|
|
4
|
+
version: 2.1.4
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Ribose Inc.
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-06-
|
|
11
|
+
date: 2026-06-04 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: relaton-bib
|