relaton-w3c 2.1.2 → 2.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 46f27195b30d4b285034731e6749b4ed4aff9ed607d41530a1d5b9d07c859b22
4
- data.tar.gz: a263250dc4bda6a871af9addecbe37c5e3d62ab748797fe064dc9545e90eec8a
3
+ metadata.gz: ac21e91a675f1a0c33ea6a0610edb38895ffe87d9b810593331f9a79cdbe0324
4
+ data.tar.gz: 74993548a097428280e01147eea4dd604adf181ed33c9d55b1db33e8a994c489
5
5
  SHA512:
6
- metadata.gz: 973bcf91864d27cb1f19f6001733dc0b46c66738d1e6614ff91313a61dcf03b6a721cbc8c82fb3bd64a00dccc8af50dd2efbf23d051ada503d4b8c698719bb68
7
- data.tar.gz: 89fb5bcce023f69488932d901fbe260450bf1fcd6164c46be7731fb107bcb47a9aec30b9eb08a7fe534822727dcdb48c262e2386cf7f921ce018f0537fb33377
6
+ metadata.gz: 6851b1f389210dfe5bbef588b1d9de910c7f0d8c595e896a1d3419a937d20bbc643dbe08af40bd896e9f8f8149f59bea736b2f338512046f6d11f197e21b0999
7
+ data.tar.gz: 6a6db0027bde29086eff8f66240a9343288b393e2c25515e6e92b667d31b0fe89ea100de94adbb4bd8d244f7b527e7d05b3be1edf97917fad54512a88ca79132
data/CLAUDE.md CHANGED
@@ -49,7 +49,7 @@ All classes live under `lib/relaton/w3c/` in the `Relaton::W3c` namespace:
49
49
  **Data fetching:**
50
50
  - **`DataFetcher`** (`data_fetcher.rb`) — extends `Core::DataFetcher`, fetches all W3C specs via the W3C API
51
51
  - **`DataParser`** (`data_parser.rb`) — converts W3C API spec objects into `Relaton::W3c::Item` instances
52
- - **`RateLimitHandler`** (`rate_limit_handler.rb`) — mixin for retry logic and caching of fetched API objects
52
+ - **`SafeRealize`** (`safe_realize.rb`) — mixin that, on a terminal error, skips the resource (returns `nil`) so one bad link doesn't abort the crawl (see Rate limiting & retries). It does not retry or cache successes those live upstream.
53
53
  - **`PubId`** (`pubid.rb`) — parses and compares W3C document identifiers (stage, code, date parts)
54
54
 
55
55
  **Utilities:**
@@ -57,13 +57,22 @@ All classes live under `lib/relaton/w3c/` in the `Relaton::W3c` namespace:
57
57
 
58
58
  The entry module is defined in `lib/relaton/w3c.rb` and exposes `grammar_hash`.
59
59
 
60
+ ### Rate limiting & retries
61
+
62
+ Transient-failure resilience is layered upstream, not in this gem:
63
+ - **w3c_api** builds its HAL client with `faraday-retry` to retry HTTP 403 (the W3C rate-limit signal) and connection/timeout errors.
64
+ - **lutaml-hal** (beneath w3c_api) retries 429 and 5xx with exponential backoff.
65
+
66
+ Successful objects are cached by **w3c_api** (lutaml-hal caches realized objects keyed by URL, thread-safely as of lutaml-hal 0.2.1), so `SafeRealize` doesn't cache them. It only **retries nothing** and remembers hrefs that failed terminally (in a `Concurrent::Map`), returning `nil` for them so one bad link doesn't abort the crawl and isn't re-fetched on every reference. Network errors are not remembered, so a later reference can try again.
67
+
60
68
  ### Key Dependencies
61
69
 
62
- - **relaton-bib** (~> 2.0.0-alpha) — provides base `Bib::Item`, `Bib::Ext`, `Bib::Doctype` and serialization mixins (LutaML model layer)
70
+ - **relaton-bib** (~> 2.1.0) — provides base `Bib::Item`, `Bib::Ext`, `Bib::Doctype` and serialization mixins (LutaML model layer)
63
71
  - **relaton-core** — provides base `Core::Processor` and `Core::DataFetcher`
64
- - **relaton-index** — index-based search for bibliographic references
65
- - **w3c_api** — W3C API client used by `DataFetcher` to retrieve specifications
66
- - **linkeddata/rdf/sparql** — legacy RDF dependencies (still in gemspec)
72
+ - **relaton-index** — index-based search for bibliographic references; also unpacks the index zip at runtime
73
+ - **w3c_api** (~> 0.3.2) — W3C API (HAL/REST) client used by `DataFetcher` to retrieve specifications; owns rate-limit and transient-error retries, and the (thread-safe) object cache
74
+
75
+ The W3C data is fetched entirely through `w3c_api`; the older RDF/SPARQL/scraping stack (linkeddata, rdf, sparql, shex, mechanize, …) has been removed.
67
76
 
68
77
  ### Schema Validation
69
78
 
@@ -82,7 +91,7 @@ Tests use RSpec with:
82
91
  - **VCR** — recorded HTTP cassettes in `spec/vcr_cassettes/` (7-day re-record interval)
83
92
  - **WebMock** — disables external HTTP in tests
84
93
 
85
- Test fixtures live in `spec/fixtures/` (YAML, XML, RDF files).
94
+ Test fixtures live in `spec/fixtures/` (YAML and XML files).
86
95
 
87
96
  ## Style
88
97
 
data/Gemfile CHANGED
@@ -9,6 +9,7 @@ gem "rspec", "~> 3.0"
9
9
 
10
10
  gem "equivalent-xml"
11
11
  gem "ruby-jing"
12
+ gem "rubyzip" # test-only: spec/support reads a zipped index fixture
12
13
  gem "simplecov"
13
14
  gem "vcr"
14
15
  gem "webmock"
@@ -1,14 +1,14 @@
1
1
  require "relaton/core"
2
2
  require "w3c_api"
3
3
  require_relative "../w3c"
4
- require_relative "rate_limit_handler"
4
+ require_relative "safe_realize"
5
5
  require_relative "data_parser"
6
6
  require_relative "pubid"
7
7
 
8
8
  module Relaton
9
9
  module W3c
10
10
  class DataFetcher < Core::DataFetcher
11
- include Relaton::W3c::RateLimitHandler
11
+ include Relaton::W3c::SafeRealize
12
12
 
13
13
  DEFAULT_CONCURRENCY = 8
14
14
 
@@ -52,7 +52,12 @@ module Relaton
52
52
  specs.links.specifications.each { |spec| queue << spec }
53
53
  break unless specs.next?
54
54
 
55
- specs = specs.next
55
+ # Route pagination through realize so transient 403/5xx on the
56
+ # next-page link retry with backoff instead of crashing the crawl.
57
+ next_page = realize(specs.links.next)
58
+ break unless next_page
59
+
60
+ specs = next_page
56
61
  end
57
62
 
58
63
  n_workers.times { queue << nil } # poison pills
@@ -1,7 +1,7 @@
1
1
  module Relaton
2
2
  module W3c
3
3
  class DataParser
4
- include Relaton::W3c::RateLimitHandler
4
+ include Relaton::W3c::SafeRealize
5
5
 
6
6
  USED_TYPES = %w[WD NOTE PER PR REC CR].freeze
7
7
 
@@ -0,0 +1,55 @@
1
+ require "concurrent/map"
2
+
3
+ module Relaton
4
+ module W3c
5
+ # Thin wrapper over lutaml-hal's `realize`. Successful objects are cached by
6
+ # w3c_api (it caches realized objects keyed by URL), so this only remembers
7
+ # resources that failed terminally and returns nil for them — so one broken
8
+ # link doesn't abort the crawl and isn't re-fetched on every reference.
9
+ #
10
+ # Transient failures are retried upstream: w3c_api retries HTTP 403 (the
11
+ # W3C rate-limit signal) and connection/timeout errors, and lutaml-hal
12
+ # retries 429 and 5xx. By the time an error surfaces here it is terminal.
13
+ module SafeRealize
14
+ # Hrefs that failed terminally — one map shared by every includer
15
+ # (DataFetcher and DataParser) since a broken resource is broken for the
16
+ # whole crawl. Initialized eagerly (at load, single-threaded) so the
17
+ # parallel fetcher's first concurrent access can't race a lazy `||=`;
18
+ # Concurrent::Map then handles the concurrent reads/writes.
19
+ @skipped = Concurrent::Map.new
20
+
21
+ def self.skipped
22
+ @skipped
23
+ end
24
+
25
+ def realize(obj)
26
+ href = resolve_href(obj)
27
+ return nil if SafeRealize.skipped.key?(href)
28
+
29
+ obj.realize
30
+ rescue Lutaml::Hal::ConnectionError, Lutaml::Hal::TimeoutError, Faraday::Error, Net::OpenTimeout => e
31
+ # Network-level failure (already retried by w3c_api). The resource itself
32
+ # is fine, so don't skip it permanently — a later reference can try again.
33
+ Util.warn "Failed to realize object: #{href}, error: #{e.message}"
34
+ nil
35
+ rescue Lutaml::Hal::NotFoundError
36
+ Util.warn "Object not found: #{href}"
37
+ SafeRealize.skipped[href] = true
38
+ nil
39
+ rescue Lutaml::Hal::Error => e
40
+ # Definitive upstream error (403 rate-limit, 5xx, 429) already retried by
41
+ # w3c_api / lutaml-hal. Skip the broken/unavailable resource rather than
42
+ # re-hitting it for every link that references it.
43
+ Util.warn "Skipping #{href}, upstream error after retries: #{e.message}"
44
+ SafeRealize.skipped[href] = true
45
+ nil
46
+ end
47
+
48
+ private
49
+
50
+ def resolve_href(obj)
51
+ obj.href || obj.links.self.href
52
+ end
53
+ end
54
+ end
55
+ end
@@ -1,5 +1,5 @@
1
1
  module Relaton
2
2
  module W3c
3
- VERSION = "2.1.2".freeze
3
+ VERSION = "2.1.3".freeze
4
4
  end
5
5
  end
data/relaton-w3c.gemspec CHANGED
@@ -31,16 +31,8 @@ Gem::Specification.new do |spec|
31
31
  spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
32
32
  spec.require_paths = ["lib"]
33
33
 
34
- spec.add_dependency "linkeddata", "~> 3.2"
35
- spec.add_dependency "mechanize", "~> 2.10"
36
- spec.add_dependency "rdf", "~> 3.2"
37
- spec.add_dependency "rdf-normalize", "~> 0.6"
38
34
  spec.add_dependency "relaton-bib", "~> 2.1.0"
39
35
  spec.add_dependency "relaton-core", "~> 0.0.13"
40
36
  spec.add_dependency "relaton-index", "~> 0.2.8"
41
- spec.add_dependency "rubyzip", "~> 2.3"
42
- spec.add_dependency "shex", "~> 0.7"
43
- spec.add_dependency "csv", "~> 3.0"
44
- spec.add_dependency "sparql", "~> 3.2"
45
- spec.add_dependency "w3c_api", "~> 0.1.3"
37
+ spec.add_dependency "w3c_api", "~> 0.3.2"
46
38
  end
metadata CHANGED
@@ -1,71 +1,15 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: relaton-w3c
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.1.2
4
+ version: 2.1.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ribose Inc.
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2026-05-15 00:00:00.000000000 Z
11
+ date: 2026-06-03 00:00:00.000000000 Z
12
12
  dependencies:
13
- - !ruby/object:Gem::Dependency
14
- name: linkeddata
15
- requirement: !ruby/object:Gem::Requirement
16
- requirements:
17
- - - "~>"
18
- - !ruby/object:Gem::Version
19
- version: '3.2'
20
- type: :runtime
21
- prerelease: false
22
- version_requirements: !ruby/object:Gem::Requirement
23
- requirements:
24
- - - "~>"
25
- - !ruby/object:Gem::Version
26
- version: '3.2'
27
- - !ruby/object:Gem::Dependency
28
- name: mechanize
29
- requirement: !ruby/object:Gem::Requirement
30
- requirements:
31
- - - "~>"
32
- - !ruby/object:Gem::Version
33
- version: '2.10'
34
- type: :runtime
35
- prerelease: false
36
- version_requirements: !ruby/object:Gem::Requirement
37
- requirements:
38
- - - "~>"
39
- - !ruby/object:Gem::Version
40
- version: '2.10'
41
- - !ruby/object:Gem::Dependency
42
- name: rdf
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - "~>"
46
- - !ruby/object:Gem::Version
47
- version: '3.2'
48
- type: :runtime
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - "~>"
53
- - !ruby/object:Gem::Version
54
- version: '3.2'
55
- - !ruby/object:Gem::Dependency
56
- name: rdf-normalize
57
- requirement: !ruby/object:Gem::Requirement
58
- requirements:
59
- - - "~>"
60
- - !ruby/object:Gem::Version
61
- version: '0.6'
62
- type: :runtime
63
- prerelease: false
64
- version_requirements: !ruby/object:Gem::Requirement
65
- requirements:
66
- - - "~>"
67
- - !ruby/object:Gem::Version
68
- version: '0.6'
69
13
  - !ruby/object:Gem::Dependency
70
14
  name: relaton-bib
71
15
  requirement: !ruby/object:Gem::Requirement
@@ -108,76 +52,20 @@ dependencies:
108
52
  - - "~>"
109
53
  - !ruby/object:Gem::Version
110
54
  version: 0.2.8
111
- - !ruby/object:Gem::Dependency
112
- name: rubyzip
113
- requirement: !ruby/object:Gem::Requirement
114
- requirements:
115
- - - "~>"
116
- - !ruby/object:Gem::Version
117
- version: '2.3'
118
- type: :runtime
119
- prerelease: false
120
- version_requirements: !ruby/object:Gem::Requirement
121
- requirements:
122
- - - "~>"
123
- - !ruby/object:Gem::Version
124
- version: '2.3'
125
- - !ruby/object:Gem::Dependency
126
- name: shex
127
- requirement: !ruby/object:Gem::Requirement
128
- requirements:
129
- - - "~>"
130
- - !ruby/object:Gem::Version
131
- version: '0.7'
132
- type: :runtime
133
- prerelease: false
134
- version_requirements: !ruby/object:Gem::Requirement
135
- requirements:
136
- - - "~>"
137
- - !ruby/object:Gem::Version
138
- version: '0.7'
139
- - !ruby/object:Gem::Dependency
140
- name: csv
141
- requirement: !ruby/object:Gem::Requirement
142
- requirements:
143
- - - "~>"
144
- - !ruby/object:Gem::Version
145
- version: '3.0'
146
- type: :runtime
147
- prerelease: false
148
- version_requirements: !ruby/object:Gem::Requirement
149
- requirements:
150
- - - "~>"
151
- - !ruby/object:Gem::Version
152
- version: '3.0'
153
- - !ruby/object:Gem::Dependency
154
- name: sparql
155
- requirement: !ruby/object:Gem::Requirement
156
- requirements:
157
- - - "~>"
158
- - !ruby/object:Gem::Version
159
- version: '3.2'
160
- type: :runtime
161
- prerelease: false
162
- version_requirements: !ruby/object:Gem::Requirement
163
- requirements:
164
- - - "~>"
165
- - !ruby/object:Gem::Version
166
- version: '3.2'
167
55
  - !ruby/object:Gem::Dependency
168
56
  name: w3c_api
169
57
  requirement: !ruby/object:Gem::Requirement
170
58
  requirements:
171
59
  - - "~>"
172
60
  - !ruby/object:Gem::Version
173
- version: 0.1.3
61
+ version: 0.3.2
174
62
  type: :runtime
175
63
  prerelease: false
176
64
  version_requirements: !ruby/object:Gem::Requirement
177
65
  requirements:
178
66
  - - "~>"
179
67
  - !ruby/object:Gem::Version
180
- version: 0.1.3
68
+ version: 0.3.2
181
69
  description: 'Relaton::W3c: retrieve W3C Standards for bibliographic using the IsoBibliographicItem
182
70
  model'
183
71
  email:
@@ -200,11 +88,6 @@ files:
200
88
  - bin/console
201
89
  - bin/rspec
202
90
  - bin/setup
203
- - grammars/basicdoc.rng
204
- - grammars/biblio-standoc.rng
205
- - grammars/biblio.rng
206
- - grammars/relaton-w3c-compile.rng
207
- - grammars/relaton-w3c.rng
208
91
  - lib/relaton/w3c.rb
209
92
  - lib/relaton/w3c/bibdata.rb
210
93
  - lib/relaton/w3c/bibitem.rb
@@ -217,7 +100,7 @@ files:
217
100
  - lib/relaton/w3c/item_data.rb
218
101
  - lib/relaton/w3c/processor.rb
219
102
  - lib/relaton/w3c/pubid.rb
220
- - lib/relaton/w3c/rate_limit_handler.rb
103
+ - lib/relaton/w3c/safe_realize.rb
221
104
  - lib/relaton/w3c/util.rb
222
105
  - lib/relaton/w3c/version.rb
223
106
  - relaton-w3c.gemspec