relaton-w3c 2.1.2 → 2.1.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CLAUDE.md +15 -6
- data/Gemfile +1 -0
- data/lib/relaton/w3c/data_fetcher.rb +8 -3
- data/lib/relaton/w3c/data_parser.rb +1 -1
- data/lib/relaton/w3c/safe_realize.rb +55 -0
- data/lib/relaton/w3c/version.rb +1 -1
- data/relaton-w3c.gemspec +1 -9
- metadata +5 -122
- data/grammars/basicdoc.rng +0 -2140
- data/grammars/biblio-standoc.rng +0 -268
- data/grammars/biblio.rng +0 -2125
- data/grammars/relaton-w3c-compile.rng +0 -11
- data/grammars/relaton-w3c.rng +0 -11
- data/lib/relaton/w3c/rate_limit_handler.rb +0 -62
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: ac21e91a675f1a0c33ea6a0610edb38895ffe87d9b810593331f9a79cdbe0324
|
|
4
|
+
data.tar.gz: 74993548a097428280e01147eea4dd604adf181ed33c9d55b1db33e8a994c489
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 6851b1f389210dfe5bbef588b1d9de910c7f0d8c595e896a1d3419a937d20bbc643dbe08af40bd896e9f8f8149f59bea736b2f338512046f6d11f197e21b0999
|
|
7
|
+
data.tar.gz: 6a6db0027bde29086eff8f66240a9343288b393e2c25515e6e92b667d31b0fe89ea100de94adbb4bd8d244f7b527e7d05b3be1edf97917fad54512a88ca79132
|
data/CLAUDE.md
CHANGED
|
@@ -49,7 +49,7 @@ All classes live under `lib/relaton/w3c/` in the `Relaton::W3c` namespace:
|
|
|
49
49
|
**Data fetching:**
|
|
50
50
|
- **`DataFetcher`** (`data_fetcher.rb`) — extends `Core::DataFetcher`, fetches all W3C specs via the W3C API
|
|
51
51
|
- **`DataParser`** (`data_parser.rb`) — converts W3C API spec objects into `Relaton::W3c::Item` instances
|
|
52
|
-
- **`
|
|
52
|
+
- **`SafeRealize`** (`safe_realize.rb`) — mixin that, on a terminal error, skips the resource (returns `nil`) so one bad link doesn't abort the crawl (see Rate limiting & retries). It does not retry or cache successes — those live upstream.
|
|
53
53
|
- **`PubId`** (`pubid.rb`) — parses and compares W3C document identifiers (stage, code, date parts)
|
|
54
54
|
|
|
55
55
|
**Utilities:**
|
|
@@ -57,13 +57,22 @@ All classes live under `lib/relaton/w3c/` in the `Relaton::W3c` namespace:
|
|
|
57
57
|
|
|
58
58
|
The entry module is defined in `lib/relaton/w3c.rb` and exposes `grammar_hash`.
|
|
59
59
|
|
|
60
|
+
### Rate limiting & retries
|
|
61
|
+
|
|
62
|
+
Transient-failure resilience is layered upstream, not in this gem:
|
|
63
|
+
- **w3c_api** builds its HAL client with `faraday-retry` to retry HTTP 403 (the W3C rate-limit signal) and connection/timeout errors.
|
|
64
|
+
- **lutaml-hal** (beneath w3c_api) retries 429 and 5xx with exponential backoff.
|
|
65
|
+
|
|
66
|
+
Successful objects are cached by **w3c_api** (lutaml-hal caches realized objects keyed by URL, thread-safely as of lutaml-hal 0.2.1), so `SafeRealize` doesn't cache them. It only **retries nothing** and remembers hrefs that failed terminally (in a `Concurrent::Map`), returning `nil` for them so one bad link doesn't abort the crawl and isn't re-fetched on every reference. Network errors are not remembered, so a later reference can try again.
|
|
67
|
+
|
|
60
68
|
### Key Dependencies
|
|
61
69
|
|
|
62
|
-
- **relaton-bib** (~> 2.
|
|
70
|
+
- **relaton-bib** (~> 2.1.0) — provides base `Bib::Item`, `Bib::Ext`, `Bib::Doctype` and serialization mixins (LutaML model layer)
|
|
63
71
|
- **relaton-core** — provides base `Core::Processor` and `Core::DataFetcher`
|
|
64
|
-
- **relaton-index** — index-based search for bibliographic references
|
|
65
|
-
- **w3c_api** — W3C API client used by `DataFetcher` to retrieve specifications
|
|
66
|
-
|
|
72
|
+
- **relaton-index** — index-based search for bibliographic references; also unpacks the index zip at runtime
|
|
73
|
+
- **w3c_api** (~> 0.3.2) — W3C API (HAL/REST) client used by `DataFetcher` to retrieve specifications; owns rate-limit and transient-error retries, and the (thread-safe) object cache
|
|
74
|
+
|
|
75
|
+
The W3C data is fetched entirely through `w3c_api`; the older RDF/SPARQL/scraping stack (linkeddata, rdf, sparql, shex, mechanize, …) has been removed.
|
|
67
76
|
|
|
68
77
|
### Schema Validation
|
|
69
78
|
|
|
@@ -82,7 +91,7 @@ Tests use RSpec with:
|
|
|
82
91
|
- **VCR** — recorded HTTP cassettes in `spec/vcr_cassettes/` (7-day re-record interval)
|
|
83
92
|
- **WebMock** — disables external HTTP in tests
|
|
84
93
|
|
|
85
|
-
Test fixtures live in `spec/fixtures/` (YAML
|
|
94
|
+
Test fixtures live in `spec/fixtures/` (YAML and XML files).
|
|
86
95
|
|
|
87
96
|
## Style
|
|
88
97
|
|
data/Gemfile
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
require "relaton/core"
|
|
2
2
|
require "w3c_api"
|
|
3
3
|
require_relative "../w3c"
|
|
4
|
-
require_relative "
|
|
4
|
+
require_relative "safe_realize"
|
|
5
5
|
require_relative "data_parser"
|
|
6
6
|
require_relative "pubid"
|
|
7
7
|
|
|
8
8
|
module Relaton
|
|
9
9
|
module W3c
|
|
10
10
|
class DataFetcher < Core::DataFetcher
|
|
11
|
-
include Relaton::W3c::
|
|
11
|
+
include Relaton::W3c::SafeRealize
|
|
12
12
|
|
|
13
13
|
DEFAULT_CONCURRENCY = 8
|
|
14
14
|
|
|
@@ -52,7 +52,12 @@ module Relaton
|
|
|
52
52
|
specs.links.specifications.each { |spec| queue << spec }
|
|
53
53
|
break unless specs.next?
|
|
54
54
|
|
|
55
|
-
|
|
55
|
+
# Route pagination through realize so transient 403/5xx on the
|
|
56
|
+
# next-page link retry with backoff instead of crashing the crawl.
|
|
57
|
+
next_page = realize(specs.links.next)
|
|
58
|
+
break unless next_page
|
|
59
|
+
|
|
60
|
+
specs = next_page
|
|
56
61
|
end
|
|
57
62
|
|
|
58
63
|
n_workers.times { queue << nil } # poison pills
|
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
require "concurrent/map"
|
|
2
|
+
|
|
3
|
+
module Relaton
|
|
4
|
+
module W3c
|
|
5
|
+
# Thin wrapper over lutaml-hal's `realize`. Successful objects are cached by
|
|
6
|
+
# w3c_api (it caches realized objects keyed by URL), so this only remembers
|
|
7
|
+
# resources that failed terminally and returns nil for them — so one broken
|
|
8
|
+
# link doesn't abort the crawl and isn't re-fetched on every reference.
|
|
9
|
+
#
|
|
10
|
+
# Transient failures are retried upstream: w3c_api retries HTTP 403 (the
|
|
11
|
+
# W3C rate-limit signal) and connection/timeout errors, and lutaml-hal
|
|
12
|
+
# retries 429 and 5xx. By the time an error surfaces here it is terminal.
|
|
13
|
+
module SafeRealize
|
|
14
|
+
# Hrefs that failed terminally — one map shared by every includer
|
|
15
|
+
# (DataFetcher and DataParser) since a broken resource is broken for the
|
|
16
|
+
# whole crawl. Initialized eagerly (at load, single-threaded) so the
|
|
17
|
+
# parallel fetcher's first concurrent access can't race a lazy `||=`;
|
|
18
|
+
# Concurrent::Map then handles the concurrent reads/writes.
|
|
19
|
+
@skipped = Concurrent::Map.new
|
|
20
|
+
|
|
21
|
+
def self.skipped
|
|
22
|
+
@skipped
|
|
23
|
+
end
|
|
24
|
+
|
|
25
|
+
def realize(obj)
|
|
26
|
+
href = resolve_href(obj)
|
|
27
|
+
return nil if SafeRealize.skipped.key?(href)
|
|
28
|
+
|
|
29
|
+
obj.realize
|
|
30
|
+
rescue Lutaml::Hal::ConnectionError, Lutaml::Hal::TimeoutError, Faraday::Error, Net::OpenTimeout => e
|
|
31
|
+
# Network-level failure (already retried by w3c_api). The resource itself
|
|
32
|
+
# is fine, so don't skip it permanently — a later reference can try again.
|
|
33
|
+
Util.warn "Failed to realize object: #{href}, error: #{e.message}"
|
|
34
|
+
nil
|
|
35
|
+
rescue Lutaml::Hal::NotFoundError
|
|
36
|
+
Util.warn "Object not found: #{href}"
|
|
37
|
+
SafeRealize.skipped[href] = true
|
|
38
|
+
nil
|
|
39
|
+
rescue Lutaml::Hal::Error => e
|
|
40
|
+
# Definitive upstream error (403 rate-limit, 5xx, 429) already retried by
|
|
41
|
+
# w3c_api / lutaml-hal. Skip the broken/unavailable resource rather than
|
|
42
|
+
# re-hitting it for every link that references it.
|
|
43
|
+
Util.warn "Skipping #{href}, upstream error after retries: #{e.message}"
|
|
44
|
+
SafeRealize.skipped[href] = true
|
|
45
|
+
nil
|
|
46
|
+
end
|
|
47
|
+
|
|
48
|
+
private
|
|
49
|
+
|
|
50
|
+
def resolve_href(obj)
|
|
51
|
+
obj.href || obj.links.self.href
|
|
52
|
+
end
|
|
53
|
+
end
|
|
54
|
+
end
|
|
55
|
+
end
|
data/lib/relaton/w3c/version.rb
CHANGED
data/relaton-w3c.gemspec
CHANGED
|
@@ -31,16 +31,8 @@ Gem::Specification.new do |spec|
|
|
|
31
31
|
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
|
|
32
32
|
spec.require_paths = ["lib"]
|
|
33
33
|
|
|
34
|
-
spec.add_dependency "linkeddata", "~> 3.2"
|
|
35
|
-
spec.add_dependency "mechanize", "~> 2.10"
|
|
36
|
-
spec.add_dependency "rdf", "~> 3.2"
|
|
37
|
-
spec.add_dependency "rdf-normalize", "~> 0.6"
|
|
38
34
|
spec.add_dependency "relaton-bib", "~> 2.1.0"
|
|
39
35
|
spec.add_dependency "relaton-core", "~> 0.0.13"
|
|
40
36
|
spec.add_dependency "relaton-index", "~> 0.2.8"
|
|
41
|
-
spec.add_dependency "
|
|
42
|
-
spec.add_dependency "shex", "~> 0.7"
|
|
43
|
-
spec.add_dependency "csv", "~> 3.0"
|
|
44
|
-
spec.add_dependency "sparql", "~> 3.2"
|
|
45
|
-
spec.add_dependency "w3c_api", "~> 0.1.3"
|
|
37
|
+
spec.add_dependency "w3c_api", "~> 0.3.2"
|
|
46
38
|
end
|
metadata
CHANGED
|
@@ -1,71 +1,15 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: relaton-w3c
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 2.1.
|
|
4
|
+
version: 2.1.3
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Ribose Inc.
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-
|
|
11
|
+
date: 2026-06-03 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
|
-
- !ruby/object:Gem::Dependency
|
|
14
|
-
name: linkeddata
|
|
15
|
-
requirement: !ruby/object:Gem::Requirement
|
|
16
|
-
requirements:
|
|
17
|
-
- - "~>"
|
|
18
|
-
- !ruby/object:Gem::Version
|
|
19
|
-
version: '3.2'
|
|
20
|
-
type: :runtime
|
|
21
|
-
prerelease: false
|
|
22
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
23
|
-
requirements:
|
|
24
|
-
- - "~>"
|
|
25
|
-
- !ruby/object:Gem::Version
|
|
26
|
-
version: '3.2'
|
|
27
|
-
- !ruby/object:Gem::Dependency
|
|
28
|
-
name: mechanize
|
|
29
|
-
requirement: !ruby/object:Gem::Requirement
|
|
30
|
-
requirements:
|
|
31
|
-
- - "~>"
|
|
32
|
-
- !ruby/object:Gem::Version
|
|
33
|
-
version: '2.10'
|
|
34
|
-
type: :runtime
|
|
35
|
-
prerelease: false
|
|
36
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
37
|
-
requirements:
|
|
38
|
-
- - "~>"
|
|
39
|
-
- !ruby/object:Gem::Version
|
|
40
|
-
version: '2.10'
|
|
41
|
-
- !ruby/object:Gem::Dependency
|
|
42
|
-
name: rdf
|
|
43
|
-
requirement: !ruby/object:Gem::Requirement
|
|
44
|
-
requirements:
|
|
45
|
-
- - "~>"
|
|
46
|
-
- !ruby/object:Gem::Version
|
|
47
|
-
version: '3.2'
|
|
48
|
-
type: :runtime
|
|
49
|
-
prerelease: false
|
|
50
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
51
|
-
requirements:
|
|
52
|
-
- - "~>"
|
|
53
|
-
- !ruby/object:Gem::Version
|
|
54
|
-
version: '3.2'
|
|
55
|
-
- !ruby/object:Gem::Dependency
|
|
56
|
-
name: rdf-normalize
|
|
57
|
-
requirement: !ruby/object:Gem::Requirement
|
|
58
|
-
requirements:
|
|
59
|
-
- - "~>"
|
|
60
|
-
- !ruby/object:Gem::Version
|
|
61
|
-
version: '0.6'
|
|
62
|
-
type: :runtime
|
|
63
|
-
prerelease: false
|
|
64
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
65
|
-
requirements:
|
|
66
|
-
- - "~>"
|
|
67
|
-
- !ruby/object:Gem::Version
|
|
68
|
-
version: '0.6'
|
|
69
13
|
- !ruby/object:Gem::Dependency
|
|
70
14
|
name: relaton-bib
|
|
71
15
|
requirement: !ruby/object:Gem::Requirement
|
|
@@ -108,76 +52,20 @@ dependencies:
|
|
|
108
52
|
- - "~>"
|
|
109
53
|
- !ruby/object:Gem::Version
|
|
110
54
|
version: 0.2.8
|
|
111
|
-
- !ruby/object:Gem::Dependency
|
|
112
|
-
name: rubyzip
|
|
113
|
-
requirement: !ruby/object:Gem::Requirement
|
|
114
|
-
requirements:
|
|
115
|
-
- - "~>"
|
|
116
|
-
- !ruby/object:Gem::Version
|
|
117
|
-
version: '2.3'
|
|
118
|
-
type: :runtime
|
|
119
|
-
prerelease: false
|
|
120
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
121
|
-
requirements:
|
|
122
|
-
- - "~>"
|
|
123
|
-
- !ruby/object:Gem::Version
|
|
124
|
-
version: '2.3'
|
|
125
|
-
- !ruby/object:Gem::Dependency
|
|
126
|
-
name: shex
|
|
127
|
-
requirement: !ruby/object:Gem::Requirement
|
|
128
|
-
requirements:
|
|
129
|
-
- - "~>"
|
|
130
|
-
- !ruby/object:Gem::Version
|
|
131
|
-
version: '0.7'
|
|
132
|
-
type: :runtime
|
|
133
|
-
prerelease: false
|
|
134
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
135
|
-
requirements:
|
|
136
|
-
- - "~>"
|
|
137
|
-
- !ruby/object:Gem::Version
|
|
138
|
-
version: '0.7'
|
|
139
|
-
- !ruby/object:Gem::Dependency
|
|
140
|
-
name: csv
|
|
141
|
-
requirement: !ruby/object:Gem::Requirement
|
|
142
|
-
requirements:
|
|
143
|
-
- - "~>"
|
|
144
|
-
- !ruby/object:Gem::Version
|
|
145
|
-
version: '3.0'
|
|
146
|
-
type: :runtime
|
|
147
|
-
prerelease: false
|
|
148
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
149
|
-
requirements:
|
|
150
|
-
- - "~>"
|
|
151
|
-
- !ruby/object:Gem::Version
|
|
152
|
-
version: '3.0'
|
|
153
|
-
- !ruby/object:Gem::Dependency
|
|
154
|
-
name: sparql
|
|
155
|
-
requirement: !ruby/object:Gem::Requirement
|
|
156
|
-
requirements:
|
|
157
|
-
- - "~>"
|
|
158
|
-
- !ruby/object:Gem::Version
|
|
159
|
-
version: '3.2'
|
|
160
|
-
type: :runtime
|
|
161
|
-
prerelease: false
|
|
162
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
163
|
-
requirements:
|
|
164
|
-
- - "~>"
|
|
165
|
-
- !ruby/object:Gem::Version
|
|
166
|
-
version: '3.2'
|
|
167
55
|
- !ruby/object:Gem::Dependency
|
|
168
56
|
name: w3c_api
|
|
169
57
|
requirement: !ruby/object:Gem::Requirement
|
|
170
58
|
requirements:
|
|
171
59
|
- - "~>"
|
|
172
60
|
- !ruby/object:Gem::Version
|
|
173
|
-
version: 0.
|
|
61
|
+
version: 0.3.2
|
|
174
62
|
type: :runtime
|
|
175
63
|
prerelease: false
|
|
176
64
|
version_requirements: !ruby/object:Gem::Requirement
|
|
177
65
|
requirements:
|
|
178
66
|
- - "~>"
|
|
179
67
|
- !ruby/object:Gem::Version
|
|
180
|
-
version: 0.
|
|
68
|
+
version: 0.3.2
|
|
181
69
|
description: 'Relaton::W3c: retrieve W3C Standards for bibliographic using the IsoBibliographicItem
|
|
182
70
|
model'
|
|
183
71
|
email:
|
|
@@ -200,11 +88,6 @@ files:
|
|
|
200
88
|
- bin/console
|
|
201
89
|
- bin/rspec
|
|
202
90
|
- bin/setup
|
|
203
|
-
- grammars/basicdoc.rng
|
|
204
|
-
- grammars/biblio-standoc.rng
|
|
205
|
-
- grammars/biblio.rng
|
|
206
|
-
- grammars/relaton-w3c-compile.rng
|
|
207
|
-
- grammars/relaton-w3c.rng
|
|
208
91
|
- lib/relaton/w3c.rb
|
|
209
92
|
- lib/relaton/w3c/bibdata.rb
|
|
210
93
|
- lib/relaton/w3c/bibitem.rb
|
|
@@ -217,7 +100,7 @@ files:
|
|
|
217
100
|
- lib/relaton/w3c/item_data.rb
|
|
218
101
|
- lib/relaton/w3c/processor.rb
|
|
219
102
|
- lib/relaton/w3c/pubid.rb
|
|
220
|
-
- lib/relaton/w3c/
|
|
103
|
+
- lib/relaton/w3c/safe_realize.rb
|
|
221
104
|
- lib/relaton/w3c/util.rb
|
|
222
105
|
- lib/relaton/w3c/version.rb
|
|
223
106
|
- relaton-w3c.gemspec
|