relaton-w3c 2.1.4 → 2.2.0.pre.alpha.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 7cdd6ed3f2403c63011b2f3d023afca6d1fde795b09e65d39e9044384ed52b2b
4
- data.tar.gz: 4086a9ec931c36512084c11c7327be1cd5d21c85106bbe312e6c142dbe9639d4
3
+ metadata.gz: 3e7055aa24f2e33a4eb5825a57fca822732a5ddfeec7f7c484d44afe86dbacaa
4
+ data.tar.gz: 415b74a5cd3ecbee917d601794721fb9e207b427ea4fcd9dd1ce8420a439a5b3
5
5
  SHA512:
6
- metadata.gz: d2a193173054fd17d1b1e30218887722b58f3a12192923e7abb36e167314c6378e4255a87632bb9fac21035f7bc33e6b4f032ff8dd180fef938ef1977ab6cbd6
7
- data.tar.gz: c3b52a5693c876cee96728332e694824350a50a2d4945c04c941e84d7c85d874216a02472f9121c3ca4446420cf686a6d8b8b042397deafcb4dcc89870e5a8b7
6
+ metadata.gz: 87deba90b8ae19bb20c67eefd794dcb0b29171573c48cecb04445f73e66921953c228ef5e759be7742d8c072a18dad2cecf3f9214993de1988ea4a64562310ee
7
+ data.tar.gz: c1a9d9c598da26dbb05611415b655bdcb8adac22a3d1d1488a15f00d06b68dfbbcca0ad4bfa1d30366a466ae2948cf99ffc2690e5c61309fd3fca3247af00274
data/CLAUDE.md CHANGED
@@ -47,7 +47,7 @@ All classes live under `lib/relaton/w3c/` in the `Relaton::W3c` namespace:
47
47
  - **`Processor`** (`processor.rb`) — extends `Relaton::Core::Processor`, registers the W3C flavor (prefix `W3C`, dataset `w3c-api`)
48
48
 
49
49
  **Data fetching:**
50
- - **`DataFetcher`** (`data_fetcher.rb`) — extends `Core::DataFetcher`, fetches all W3C specs via the W3C API. Fetches the specification index with `embed: true` so each spec is realized from the page's embedded payload instead of a per-spec HTTP request, and paginates by page number (only the `fetch` path repopulates `_embedded`, unlike realizing the `next` link). Runs `fetch_spec` across a small thread pool. A SIGINT (Ctrl-C) is handled gracefully — the producer stops queuing and workers stop after their in-flight spec, then the index of everything fetched so far is saved (the prior INT handler is restored afterwards, so the trap doesn't leak into the host process). See **Crawler tuning** for the env-var knobs.
50
+ - **`DataFetcher`** (`data_fetcher.rb`) — extends `Core::DataFetcher`, fetches all W3C specs via the W3C API. Fetches the specification index with `embed: true` so each spec is realized from the page's embedded payload instead of a per-spec HTTP request, and paginates by page number (only the `fetch` path repopulates `_embedded`, unlike realizing the `next` link). Runs `fetch_spec` across a small thread pool. A SIGINT (Ctrl-C) is handled gracefully — the producer stops queuing and workers stop after their in-flight spec, then the index of everything fetched so far is saved (the prior INT handler is restored afterwards, so the trap doesn't leak into the host process). If an index page fails to fetch after retries, or pagination ends before the API's advertised last page, `enqueue_specs` raises `CrawlIncompleteError` and the crawl aborts **without** saving the index — a transient rate-limit must never silently truncate the dataset (`crawler.rb` wipes `data/` before each run, so a partial crawl would otherwise commit mass deletions). The worker pool is still drained in an `ensure` so the abort doesn't deadlock. See **Crawler tuning** for the env-var knobs.
51
51
  - **`DataParser`** (`data_parser.rb`) — converts W3C API spec objects into `Relaton::W3c::Item` instances
52
52
  - **`SafeRealize`** (`safe_realize.rb`) — mixin that, on a terminal error, skips the resource (returns `nil`) so one bad link doesn't abort the crawl (see Rate limiting & retries). It does not retry or cache successes — those live upstream.
53
53
  - **`PubId`** (`pubid.rb`) — parses and compares W3C document identifiers (stage, code, date parts)
@@ -61,7 +61,7 @@ The entry module is defined in `lib/relaton/w3c.rb` and exposes `grammar_hash`.
61
61
 
62
62
  `DataFetcher` is tunable via environment variables (read by class methods, so they apply to the whole crawl):
63
63
 
64
- - **`RELATON_W3C_FETCH_CONCURRENCY`** (default `8`) — number of `fetch_spec` worker threads. Lower it to lighten load on api.w3.org or for debugging.
64
+ - **`RELATON_W3C_FETCH_CONCURRENCY`** (default `4`) — number of `fetch_spec` worker threads. Kept conservative so the version-history requests don't burst fast enough to trip the W3C API rate limiter (429s); raise it for a faster run, lower it for debugging or if 429 skips appear.
65
65
  - **`RELATON_W3C_FETCH_VERSIONS`** (default enabled) — set to `false`/`0`/`no`/`off` for a faster, shallower crawl that emits only the top-level specifications and skips each spec's version-history fan-out (version_history, predecessor/successor versions — the bulk of the API requests). Leave it set (the default) for a complete dataset.
66
66
 
67
67
  `embed: true` (always on) inlines each specification into its index page, so the per-spec realize is served from memory rather than an HTTP request — the largest single reduction in request count.
@@ -76,7 +76,7 @@ Successful objects are cached by **w3c_api** (lutaml-hal caches realized objects
76
76
 
77
77
  ### Key Dependencies
78
78
 
79
- - **relaton-bib** (~> 2.1.0) — provides base `Bib::Item`, `Bib::Ext`, `Bib::Doctype` and serialization mixins (LutaML model layer)
79
+ - **relaton-bib** (~> 2.2.0) — provides base `Bib::Item`, `Bib::Ext`, `Bib::Doctype` and serialization mixins (LutaML model layer)
80
80
  - **relaton-core** — provides base `Core::Processor` and `Core::DataFetcher`
81
81
  - **relaton-index** — index-based search for bibliographic references; also unpacks the index zip at runtime
82
82
  - **w3c_api** (~> 0.3.2) — W3C API (HAL/REST) client used by `DataFetcher` to retrieve specifications; owns rate-limit and transient-error retries, and the (thread-safe) object cache
data/Gemfile CHANGED
@@ -3,6 +3,14 @@ source "https://rubygems.org"
3
3
  # Specify your gem's dependencies in relaton_w3c.gemspec
4
4
  gemspec
5
5
 
6
+ # Use local monorepo sibling gems where available.
7
+ Dir["../*/"].each do |dir|
8
+ name = File.basename(dir)
9
+ next if name == File.basename(__dir__)
10
+ next unless File.exist?(File.join(dir, "#{name}.gemspec"))
11
+ gem name, path: dir
12
+ end
13
+
6
14
 
7
15
  gem "rake", "~> 13.0"
8
16
  gem "rspec", "~> 3.0"
data/README.adoc CHANGED
@@ -120,17 +120,19 @@ Relaton::W3c::DataFetcher.fetch
120
120
 
121
121
  The crawl is tunable via environment variables:
122
122
 
123
- - `RELATON_W3C_FETCH_CONCURRENCY` (default `8`) - number of parallel worker threads. Lower it to lighten load on `api.w3.org`.
123
+ - `RELATON_W3C_FETCH_CONCURRENCY` (default `4`) - number of parallel worker threads. The default is kept conservative so the version-history requests don't burst fast enough to trip the W3C API rate limiter; raise it for a faster run, lower it if you still see rate-limit (429) skips.
124
124
  - `RELATON_W3C_FETCH_VERSIONS` (default enabled) - set to `false` for a faster, shallower crawl that fetches only the top-level specifications and skips each spec's version history (the bulk of the API requests). Leave it unset for a complete dataset.
125
125
 
126
126
  The fetcher requests the specifications index with embedded specification data, so each specification is read from the page already in memory instead of issuing a separate HTTP request.
127
127
 
128
128
  A full crawl is long-running, so it handles `Ctrl-C` gracefully: it stops fetching, lets in-flight work finish, and saves the index of everything collected so far rather than losing the run.
129
129
 
130
+ If a specifications-index page can't be fetched (e.g. a rate-limit that outlasts the retries) the crawl aborts with `CrawlIncompleteError` rather than treating the failure as the end of the list. This is deliberate: a truncated crawl is never saved, so a transient API hiccup can't silently drop most of the dataset.
131
+
130
132
  [source,sh]
131
133
  ----
132
- # Fast, shallow refresh: top-level specs only, 4 workers
133
- RELATON_W3C_FETCH_VERSIONS=false RELATON_W3C_FETCH_CONCURRENCY=4 \
134
+ # Fast, shallow refresh: top-level specs only, 8 workers
135
+ RELATON_W3C_FETCH_VERSIONS=false RELATON_W3C_FETCH_CONCURRENCY=8 \
134
136
  ruby -r relaton/w3c/data_fetcher -e 'Relaton::W3c::DataFetcher.fetch'
135
137
  ----
136
138
 
@@ -10,11 +10,26 @@ module Relaton
10
10
  class DataFetcher < Core::DataFetcher
11
11
  include Relaton::W3c::SafeRealize
12
12
 
13
- DEFAULT_CONCURRENCY = 8
13
+ # Raised when pagination over the specifications index stops before the
14
+ # last page (e.g. a page fetch fails after retries, or the API reports
15
+ # more pages than were reached). It aborts the whole crawl so a truncated
16
+ # dataset is never saved or committed — see #fetch and #enqueue_specs.
17
+ class CrawlIncompleteError < StandardError; end
18
+
19
+ # Conservative default: too many parallel workers burst the per-spec
20
+ # version-history requests fast enough to trip the W3C API rate limiter
21
+ # (429s), which is what silently truncated the dataset before the crawl
22
+ # learned to abort on incomplete pagination. Raise it via the env var on
23
+ # a faster/shallower run; lower it further if 429s still appear.
24
+ DEFAULT_CONCURRENCY = 4
25
+
26
+ # How many times #fetch_specifications_page retries a transient failure
27
+ # (rate-limit/connection) before giving up and aborting the crawl.
28
+ PAGE_FETCH_ATTEMPTS = 3
14
29
 
15
30
  # Number of fetch_spec worker threads. Tunable via env var so CI or
16
- # local runs can dial it down (e.g. for debugging or to lighten load
17
- # on api.w3.org).
31
+ # local runs can dial it up for speed or down to lighten load on
32
+ # api.w3.org (or for debugging).
18
33
  def self.concurrency
19
34
  (ENV["RELATON_W3C_FETCH_CONCURRENCY"] || DEFAULT_CONCURRENCY).to_i
20
35
  end
@@ -65,9 +80,15 @@ module Relaton
65
80
  workers = Array.new(n_workers) { spawn_worker(queue) }
66
81
 
67
82
  with_interrupt_handler do
68
- enqueue_specs(queue)
69
- n_workers.times { queue << nil } # poison pills
70
- workers.each(&:join)
83
+ # The poison pills + join run in `ensure` so an exception raised while
84
+ # enqueuing (e.g. CrawlIncompleteError) still unblocks the producer
85
+ # and drains the workers instead of deadlocking on queue.pop.
86
+ begin
87
+ enqueue_specs(queue)
88
+ ensure
89
+ n_workers.times { queue << nil } # poison pills
90
+ workers.each(&:join)
91
+ end
71
92
  Util.warn "Crawl interrupted — saving progress collected so far." if @interrupted
72
93
  index.save
73
94
  end
@@ -87,6 +108,8 @@ module Relaton
87
108
  #
88
109
  def enqueue_specs(queue)
89
110
  specs = client.specifications(embed: true)
111
+ expected_pages = specs.pages
112
+ last_page = nil
90
113
  loop do
91
114
  page = specs
92
115
  page.links.specifications.each do |spec|
@@ -94,7 +117,10 @@ module Relaton
94
117
 
95
118
  queue << [spec, page]
96
119
  end
97
- break if @interrupted || !page.next?
120
+ break if @interrupted
121
+
122
+ last_page = page.page
123
+ break unless page.next?
98
124
 
99
125
  # Fetch the next page through the client's fetch path rather than
100
126
  # realizing the `next` link: only fetch populates the page's
@@ -102,10 +128,35 @@ module Relaton
102
128
  # the `next` link drops `_embedded` and forces a per-spec HTTP
103
129
  # request for every specification on every later page.
104
130
  next_page = fetch_specifications_page(page.page + 1)
105
- break unless next_page
131
+ # A nil here means the page fetch failed after retries (not the end
132
+ # of the list — that is `!page.next?` above). Aborting rather than
133
+ # `break`ing prevents a rate-limit blip from silently truncating the
134
+ # dataset: a partial crawl must never be saved/committed.
135
+ unless next_page
136
+ raise CrawlIncompleteError,
137
+ "specifications pagination stopped at page #{page.page}: " \
138
+ "failed to fetch page #{page.page + 1}"
139
+ end
106
140
 
107
141
  specs = next_page
108
142
  end
143
+
144
+ return if @interrupted
145
+
146
+ guard_complete_pagination(last_page, expected_pages)
147
+ end
148
+
149
+ # Defense in depth: even when no page fetch raised, make sure pagination
150
+ # actually reached the last page the API advertised. Catches truncation
151
+ # modes other than a failed fetch (e.g. a `next` link that goes missing).
152
+ # Only enforced when the index reported a positive page count.
153
+ def guard_complete_pagination(last_page, expected_pages)
154
+ return unless expected_pages.is_a?(Integer) && expected_pages.positive?
155
+ return unless last_page.is_a?(Integer) && last_page < expected_pages
156
+
157
+ raise CrawlIncompleteError,
158
+ "specifications pagination ended at page #{last_page} of " \
159
+ "#{expected_pages}; refusing to save a partial dataset"
109
160
  end
110
161
 
111
162
  def fetch_spec(unrealized_spec, page = nil)
@@ -209,14 +260,26 @@ module Relaton
209
260
  # Fetch one page of the specifications index with embed enabled. Goes
210
261
  # through the client (the register's fetch path) so the page's
211
262
  # embedded_data is populated. Transient 403/5xx/connection failures are
212
- # already retried upstream (w3c_api/lutaml-hal); a terminal error here
213
- # stops pagination gracefully rather than crashing the crawl.
263
+ # already retried upstream (w3c_api/lutaml-hal), but losing an index page
264
+ # drops every spec on it, so retry a few more times here with backoff to
265
+ # ride out a brief rate-limit window. Returns nil only once the attempts
266
+ # are exhausted; the caller turns that into a CrawlIncompleteError so the
267
+ # crawl aborts instead of committing a truncated dataset.
214
268
  def fetch_specifications_page(number)
215
- client.specifications(embed: true, page: number)
216
- rescue Lutaml::Hal::Error, Faraday::Error => e
217
- log_error "Failed to fetch specifications page #{number}: " \
218
- "#{e.class}: #{e.message}"
219
- nil
269
+ attempt = 0
270
+ begin
271
+ attempt += 1
272
+ client.specifications(embed: true, page: number)
273
+ rescue Lutaml::Hal::Error, Faraday::Error => e
274
+ log_error "Failed to fetch specifications page #{number} " \
275
+ "(attempt #{attempt}/#{PAGE_FETCH_ATTEMPTS}): " \
276
+ "#{e.class}: #{e.message}"
277
+ if attempt < PAGE_FETCH_ATTEMPTS
278
+ sleep(2**attempt)
279
+ retry
280
+ end
281
+ nil
282
+ end
220
283
  end
221
284
 
222
285
  def spawn_worker(queue)
@@ -1,5 +1,5 @@
1
1
  module Relaton
2
2
  module W3c
3
- VERSION = "2.1.4".freeze
3
+ VERSION = "2.2.0.pre.alpha.1".freeze
4
4
  end
5
5
  end
data/relaton-w3c.gemspec CHANGED
@@ -14,7 +14,7 @@ Gem::Specification.new do |spec|
14
14
  "using the IsoBibliographicItem model"
15
15
  spec.homepage = "https://github.com/relaton/relaton-wc3"
16
16
  spec.license = "BSD-2-Clause"
17
- spec.required_ruby_version = Gem::Requirement.new(">= 3.2.0")
17
+ spec.required_ruby_version = Gem::Requirement.new(">= 3.3.0")
18
18
 
19
19
  # spec.metadata["allowed_push_host"] = "TODO: Set to 'http://mygemserver.com'"
20
20
 
@@ -31,8 +31,9 @@ Gem::Specification.new do |spec|
31
31
  spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
32
32
  spec.require_paths = ["lib"]
33
33
 
34
- spec.add_dependency "relaton-bib", "~> 2.1.0"
35
- spec.add_dependency "relaton-core", "~> 0.0.13"
36
- spec.add_dependency "relaton-index", "~> 0.2.8"
34
+ spec.add_dependency "concurrent-ruby", "~> 1.0"
35
+ spec.add_dependency "relaton-bib", "~> 2.2.0.pre.alpha.1"
36
+ spec.add_dependency "relaton-core", "~> 2.2.0.pre.alpha.1"
37
+ spec.add_dependency "relaton-index", "~> 2.2.0.pre.alpha.1"
37
38
  spec.add_dependency "w3c_api", "~> 0.3.2"
38
39
  end
metadata CHANGED
@@ -1,57 +1,71 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: relaton-w3c
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.1.4
4
+ version: 2.2.0.pre.alpha.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ribose Inc.
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2026-06-04 00:00:00.000000000 Z
11
+ date: 2026-06-26 00:00:00.000000000 Z
12
12
  dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: concurrent-ruby
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.0'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.0'
13
27
  - !ruby/object:Gem::Dependency
14
28
  name: relaton-bib
15
29
  requirement: !ruby/object:Gem::Requirement
16
30
  requirements:
17
31
  - - "~>"
18
32
  - !ruby/object:Gem::Version
19
- version: 2.1.0
33
+ version: 2.2.0.pre.alpha.1
20
34
  type: :runtime
21
35
  prerelease: false
22
36
  version_requirements: !ruby/object:Gem::Requirement
23
37
  requirements:
24
38
  - - "~>"
25
39
  - !ruby/object:Gem::Version
26
- version: 2.1.0
40
+ version: 2.2.0.pre.alpha.1
27
41
  - !ruby/object:Gem::Dependency
28
42
  name: relaton-core
29
43
  requirement: !ruby/object:Gem::Requirement
30
44
  requirements:
31
45
  - - "~>"
32
46
  - !ruby/object:Gem::Version
33
- version: 0.0.13
47
+ version: 2.2.0.pre.alpha.1
34
48
  type: :runtime
35
49
  prerelease: false
36
50
  version_requirements: !ruby/object:Gem::Requirement
37
51
  requirements:
38
52
  - - "~>"
39
53
  - !ruby/object:Gem::Version
40
- version: 0.0.13
54
+ version: 2.2.0.pre.alpha.1
41
55
  - !ruby/object:Gem::Dependency
42
56
  name: relaton-index
43
57
  requirement: !ruby/object:Gem::Requirement
44
58
  requirements:
45
59
  - - "~>"
46
60
  - !ruby/object:Gem::Version
47
- version: 0.2.8
61
+ version: 2.2.0.pre.alpha.1
48
62
  type: :runtime
49
63
  prerelease: false
50
64
  version_requirements: !ruby/object:Gem::Requirement
51
65
  requirements:
52
66
  - - "~>"
53
67
  - !ruby/object:Gem::Version
54
- version: 0.2.8
68
+ version: 2.2.0.pre.alpha.1
55
69
  - !ruby/object:Gem::Dependency
56
70
  name: w3c_api
57
71
  requirement: !ruby/object:Gem::Requirement
@@ -79,7 +93,6 @@ files:
79
93
  - ".gitignore"
80
94
  - ".hound.yml"
81
95
  - ".rspec"
82
- - ".rubocop.yml"
83
96
  - CLAUDE.md
84
97
  - Gemfile
85
98
  - LICENSE.txt
@@ -117,7 +130,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
117
130
  requirements:
118
131
  - - ">="
119
132
  - !ruby/object:Gem::Version
120
- version: 3.2.0
133
+ version: 3.3.0
121
134
  required_rubygems_version: !ruby/object:Gem::Requirement
122
135
  requirements:
123
136
  - - ">="
data/.rubocop.yml DELETED
@@ -1,12 +0,0 @@
1
- # This project follows the Ribose OSS style guide.
2
- # https://github.com/riboseinc/oss-guides
3
- # All project-specific additions and overrides should be specified in this file.
4
-
5
- require: rubocop-rails
6
-
7
- inherit_from:
8
- - https://raw.githubusercontent.com/riboseinc/oss-guides/master/ci/rubocop.yml
9
- AllCops:
10
- TargetRubyVersion: 3.2
11
- Rails:
12
- Enabled: false