relaton-iso 2.1.1 → 2.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CLAUDE.md +5 -2
- data/README.adoc +21 -1
- data/lib/relaton/iso/bibliography.rb +11 -1
- data/lib/relaton/iso/data_fetcher.rb +220 -151
- data/lib/relaton/iso/data_parser.rb +443 -0
- data/lib/relaton/iso/model/docidentifier.rb +7 -2
- data/lib/relaton/iso/processor.rb +8 -5
- data/lib/relaton/iso/type/pubid.rb +50 -0
- data/lib/relaton/iso/version.rb +1 -1
- metadata +5 -12
- data/grammars/basicdoc.rng +0 -2140
- data/grammars/biblio-standoc.rng +0 -268
- data/grammars/biblio.rng +0 -2125
- data/grammars/relaton-iso-compile.rng +0 -11
- data/grammars/relaton-iso.rng +0 -165
- data/lib/relaton/iso/queue.rb +0 -63
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 767dfed024aec3fc3c96c2322ef0fe9514fdd99d7ea5602c9caa878ba9ff95f6
|
|
4
|
+
data.tar.gz: 427f2b4fb8c58791acd025fcaa7bc6ba7d1928314c6ca9e04ea583d3063247e8
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: a99c1fb6fd7ed6cd9f11784d1851353224f9af378705ed3cfcbafd4a20d1eb40f33036521bdcabebe4ab720a6d628ab01a41af914652e3effbe0ba4d5176882f
|
|
7
|
+
data.tar.gz: 52c952079694c2f0b43a23a9eb767442c53c2fd32183317616eafff960c14e8b997a6819b407933c1d584e15f690fc531b4c8fea7a4dc7add4bac8eb6a237e90
|
data/CLAUDE.md
CHANGED
|
@@ -20,8 +20,11 @@ relaton-iso retrieves ISO standard bibliographic data. The core retrieval flow:
|
|
|
20
20
|
2. **HitCollection** (`lib/relaton/iso/hit_collection.rb`) — searches a pre-built YAML index (`index-v1.zip` from relaton-data-iso) using `Relaton::Index`. Matches on `id_keys`: publisher, number, copublisher, part, year, edition, type, stage, iteration. Returns sorted Hit array.
|
|
21
21
|
3. **Hit** (`lib/relaton/iso/hit.rb`) — wraps an index result. The `item` attribute lazy-loads the full document from GitHub raw content (relaton-data-iso repo). `sort_weight` prioritizes published over withdrawn/deleted.
|
|
22
22
|
4. **ItemData** / **Model::Item** — ISO-specific bibliographic item extending `Relaton::Bib::ItemData`.
|
|
23
|
-
5. **Scraper** (`lib/relaton/iso/scraper.rb`) — parses ISO website pages
|
|
24
|
-
6. **DataFetcher** (`lib/relaton/iso/data_fetcher.rb`) —
|
|
23
|
+
5. **Scraper** (`lib/relaton/iso/scraper.rb`) — parses individual ISO website pages. Used only by `Bibliography.get` as a fallback when an item is missing from the curated index; no longer drives bulk ingest.
|
|
24
|
+
6. **DataFetcher** (`lib/relaton/iso/data_fetcher.rb`) — streams the ISO Open Data programme JSONL feeds (`iso_deliverables_metadata.jsonl` for documents, `iso_technical_committees.jsonl` for committee titles) and writes one YAML per primary docid into `@output`. Short-circuits on upstream `Last-Modified`; falls back to a full pass when `data/` or `index-v1.yaml` is missing. Two source modes:
|
|
25
|
+
- `iso-open-data` (default) — incremental, skip when upstream is unchanged.
|
|
26
|
+
- `iso-open-data-all` — wipe `@output` and re-emit every record.
|
|
27
|
+
7. **DataParser** (`lib/relaton/iso/data_parser.rb`) — converts one Open Data record (`Hash`) into a `Relaton::Iso::ItemData`. Takes a `ref_index` (id → reference) for resolving `replaces`/`replacedBy` and a `tc_index` (reference → `{ "en"/"fr" => title }`) for resolving committee labels.
|
|
25
28
|
|
|
26
29
|
Key dependency: `pubid-iso` gem handles ISO publication identifier parsing and comparison.
|
|
27
30
|
|
data/README.adoc
CHANGED
|
@@ -352,6 +352,26 @@ item.source
|
|
|
352
352
|
@type="rss">]
|
|
353
353
|
----
|
|
354
354
|
|
|
355
|
+
[[bulk-data-ingest]]
|
|
356
|
+
=== Bulk data ingest
|
|
357
|
+
|
|
358
|
+
The curated dataset under https://github.com/relaton/relaton-data-iso[relaton-data-iso] is rebuilt daily by `Relaton::Iso::DataFetcher`, which streams the
|
|
359
|
+
https://www.iso.org/open-data.html[ISO Open Data programme] JSONL feeds — `iso_deliverables_metadata.jsonl` for documents (~80,000 records) and `iso_technical_committees.jsonl` for committee titles — and writes one YAML per primary docid.
|
|
360
|
+
|
|
361
|
+
Two source modes are exposed (also reachable via `relaton-cli`'s dataset list and the GitHub Actions workflow input):
|
|
362
|
+
|
|
363
|
+
[source,ruby]
|
|
364
|
+
----
|
|
365
|
+
# Incremental: skip the run if upstream `Last-Modified` matches the local
|
|
366
|
+
# `last_modified.txt`. Falls back to a full pass when `data/` or
|
|
367
|
+
# `index-v1.yaml` is missing.
|
|
368
|
+
Relaton::Iso::DataFetcher.fetch("iso-open-data", output: "data", format: "yaml")
|
|
369
|
+
|
|
370
|
+
# Full refresh: wipe `output` and re-emit every record. Use when the local
|
|
371
|
+
# tree is suspect or after a parser change that affects emitted YAML.
|
|
372
|
+
Relaton::Iso::DataFetcher.fetch("iso-open-data-all", output: "data", format: "yaml")
|
|
373
|
+
----
|
|
374
|
+
|
|
355
375
|
=== Logging
|
|
356
376
|
|
|
357
377
|
RelatonIso uses the relaton-logger gem for logging. By default, it logs to STDOUT. To change the log levels and add other loggers, read the https://github.com/relaton/relaton-logger#usage[relaton-logger] documentation.
|
|
@@ -367,7 +387,7 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
|
|
|
367
387
|
|
|
368
388
|
== Exceptional Citations
|
|
369
389
|
|
|
370
|
-
|
|
390
|
+
Single-document lookups via `Bibliography.get` first consult the curated relaton-data-iso index (regenerated daily from the ISO Open Data programme — see <<bulk-data-ingest>>) and fall back to scraping individual pages on http://www.iso.org for items not yet indexed. The following documents are not returned as search results from the ISO website, and the gem returns manually generated references to them.
|
|
371
391
|
|
|
372
392
|
* `IEV`: used in the metanorma-iso gem to reference Electropedia entries generically. Is resolved to an "all parts" reference to IEC 60050, which in turn is resolved into the specific documents cited by their top-level clause.
|
|
373
393
|
|
|
@@ -186,10 +186,16 @@ module Relaton
|
|
|
186
186
|
end
|
|
187
187
|
|
|
188
188
|
# Extract year from a hit as an integer.
|
|
189
|
+
#
|
|
190
|
+
# Amendments, corrigendums and supplements carry no year on their own
|
|
191
|
+
# identifier; the year lives on the underlying standard reachable via
|
|
192
|
+
# `root` (which walks the full base chain, however deeply nested). Fall
|
|
193
|
+
# back to it so a date filter does not drop such references (issue #181).
|
|
194
|
+
#
|
|
189
195
|
# @param hit [Relaton::Iso::Hit]
|
|
190
196
|
# @return [Integer]
|
|
191
197
|
def hit_year(hit)
|
|
192
|
-
yr = hit.pubid&.year || hit.hit[:year]
|
|
198
|
+
yr = hit.pubid&.year || hit.hit[:year] || hit.pubid&.root&.year
|
|
193
199
|
yr.to_i
|
|
194
200
|
end
|
|
195
201
|
|
|
@@ -236,6 +242,10 @@ module Relaton
|
|
|
236
242
|
# @return [Relaton::Iso::ItemData, nil]
|
|
237
243
|
def fetch_and_check_date(hit, pubid, opts)
|
|
238
244
|
ret = hit.item
|
|
245
|
+
# A data file that fails to load (e.g. the index references a file that
|
|
246
|
+
# 404s) yields an item with no docidentifier; skip it rather than crash.
|
|
247
|
+
return unless ret&.docidentifier&.first
|
|
248
|
+
|
|
239
249
|
if publication_date_in_range?(ret, opts)
|
|
240
250
|
Util.info "Found: `#{ret.docidentifier.first.content}`", key: pubid.to_s
|
|
241
251
|
ret
|
|
@@ -1,185 +1,251 @@
|
|
|
1
|
+
require "fileutils"
|
|
2
|
+
require "json"
|
|
3
|
+
require "net/http"
|
|
4
|
+
require "tmpdir"
|
|
1
5
|
require_relative "../iso"
|
|
2
|
-
require_relative "
|
|
3
|
-
require_relative "scraper"
|
|
6
|
+
require_relative "data_parser"
|
|
4
7
|
|
|
5
8
|
module Relaton
|
|
6
9
|
module Iso
|
|
7
|
-
#
|
|
10
|
+
#
|
|
11
|
+
# Fetch ISO documents from the ISO Open Data programme bulk JSONL
|
|
12
|
+
# (see https://www.iso.org/open-data.html) and write each one as a YAML
|
|
13
|
+
# file under `@output`.
|
|
14
|
+
#
|
|
15
|
+
# `source` modes (matching the `Relaton::Core::DataFetcher.fetch` arg):
|
|
16
|
+
#
|
|
17
|
+
# * `"iso-open-data"` (default) - skip the run if the upstream
|
|
18
|
+
# `Last-Modified` header matches `LAST_MODIFIED_FILE`.
|
|
19
|
+
# * `"iso-open-data-all"` - clear `@output` and re-emit every record.
|
|
20
|
+
#
|
|
8
21
|
class DataFetcher < Core::DataFetcher
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
def mutex
|
|
19
|
-
@mutex ||= Mutex.new
|
|
20
|
-
end
|
|
22
|
+
OPEN_DATA_URL = "https://isopublicstorageprod.blob.core.windows.net/" \
|
|
23
|
+
"opendata/_latest/iso_deliverables_metadata/json/" \
|
|
24
|
+
"iso_deliverables_metadata.jsonl".freeze
|
|
25
|
+
TC_DATA_URL = "https://isopublicstorageprod.blob.core.windows.net/" \
|
|
26
|
+
"opendata/_latest/iso_technical_committees/json/" \
|
|
27
|
+
"iso_technical_committees.jsonl".freeze
|
|
28
|
+
LAST_MODIFIED_FILE = "last_modified.txt".freeze
|
|
29
|
+
MAX_DOWNLOAD_RETRIES = 4
|
|
30
|
+
RETRY_BACKOFF_BASE = 30
|
|
21
31
|
|
|
22
32
|
def log_error(msg)
|
|
23
33
|
Util.error msg
|
|
24
34
|
end
|
|
25
35
|
|
|
26
36
|
def index
|
|
27
|
-
@index ||= Relaton::Index.find_or_create
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
Util.info "Scrapping ICS pages..."
|
|
48
|
-
fetch_ics
|
|
49
|
-
Util.info "(#{Time.now}) Scrapping documents..."
|
|
50
|
-
fetch_docs
|
|
51
|
-
iso_queue.save
|
|
52
|
-
# index.sort! { |a, b| compare_docids a, b }
|
|
37
|
+
@index ||= Relaton::Index.find_or_create(
|
|
38
|
+
:iso, file: "#{INDEXFILE}.yaml", pubid_class: ::Pubid::Iso::Identifier,
|
|
39
|
+
)
|
|
40
|
+
end
|
|
41
|
+
|
|
42
|
+
def fetch(source = nil)
|
|
43
|
+
@source = source || "iso-open-data"
|
|
44
|
+
@full_refresh = @source == "iso-open-data-all"
|
|
45
|
+
|
|
46
|
+
Util.info "Fetching ISO Open Data (mode: #{@source})..."
|
|
47
|
+
last_modified = fetch_last_modified
|
|
48
|
+
return if up_to_date?(last_modified)
|
|
49
|
+
|
|
50
|
+
prepare_output
|
|
51
|
+
jsonl_path = download_dataset
|
|
52
|
+
ref_index, amend_index, date_index = build_ref_index(jsonl_path)
|
|
53
|
+
tc_index = build_tc_index
|
|
54
|
+
ingest_records(jsonl_path, ref_index, tc_index, amend_index, date_index)
|
|
55
|
+
merge_static_files
|
|
56
|
+
|
|
53
57
|
index.save
|
|
58
|
+
save_last_modified(last_modified)
|
|
54
59
|
report_errors
|
|
60
|
+
rescue StandardError => e
|
|
61
|
+
Util.error "#{e.message}\n#{e.backtrace.join("\n")}"
|
|
62
|
+
raise
|
|
55
63
|
end
|
|
56
64
|
|
|
57
65
|
private
|
|
58
66
|
|
|
59
|
-
#
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
Util.
|
|
76
|
-
|
|
67
|
+
# --- HTTP / state -----------------------------------------------------
|
|
68
|
+
|
|
69
|
+
def fetch_last_modified
|
|
70
|
+
uri = URI(OPEN_DATA_URL)
|
|
71
|
+
resp = Net::HTTP.start(uri.host, uri.port, use_ssl: true) do |http|
|
|
72
|
+
http.request(Net::HTTP::Head.new(uri.request_uri))
|
|
73
|
+
end
|
|
74
|
+
resp["last-modified"]
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
def up_to_date?(last_modified)
|
|
78
|
+
return false if @full_refresh || last_modified.nil?
|
|
79
|
+
return false unless File.exist?(LAST_MODIFIED_FILE)
|
|
80
|
+
return false unless output_populated?
|
|
81
|
+
|
|
82
|
+
if File.read(LAST_MODIFIED_FILE, encoding: "UTF-8").strip == last_modified.strip
|
|
83
|
+
Util.info "ISO Open Data is up to date (Last-Modified: #{last_modified}); nothing to do."
|
|
84
|
+
true
|
|
85
|
+
else
|
|
86
|
+
false
|
|
77
87
|
end
|
|
88
|
+
end
|
|
89
|
+
|
|
90
|
+
# Guard against an external wipe (or a fresh checkout) — if the YAML tree
|
|
91
|
+
# or the index file is gone, force a refresh instead of trusting
|
|
92
|
+
# `LAST_MODIFIED_FILE`.
|
|
93
|
+
def output_populated?
|
|
94
|
+
return false unless Dir.exist?(@output)
|
|
95
|
+
return false unless File.exist?("#{INDEXFILE}.yaml")
|
|
78
96
|
|
|
79
|
-
|
|
80
|
-
parse_doc_links page
|
|
81
|
-
parse_ics_links page
|
|
97
|
+
Dir.children(@output).any? { |f| f.end_with?(".yaml") }
|
|
82
98
|
end
|
|
83
99
|
|
|
84
|
-
def
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
100
|
+
def save_last_modified(last_modified)
|
|
101
|
+
return unless last_modified
|
|
102
|
+
|
|
103
|
+
File.write(LAST_MODIFIED_FILE, last_modified, encoding: "UTF-8")
|
|
104
|
+
end
|
|
105
|
+
|
|
106
|
+
def prepare_output
|
|
107
|
+
FileUtils.rm_rf(@output) if @full_refresh
|
|
108
|
+
FileUtils.mkdir_p(@output)
|
|
88
109
|
end
|
|
89
110
|
|
|
90
|
-
def
|
|
91
|
-
|
|
92
|
-
@errors[:ics_links] &&= ics_links.empty?
|
|
93
|
-
ics_links.each { |item| queue << item[:href] }
|
|
111
|
+
def download_dataset
|
|
112
|
+
download_jsonl(OPEN_DATA_URL, "iso_deliverables_metadata.jsonl")
|
|
94
113
|
end
|
|
95
114
|
|
|
96
|
-
def
|
|
97
|
-
|
|
115
|
+
def download_tc_dataset
|
|
116
|
+
download_jsonl(TC_DATA_URL, "iso_technical_committees.jsonl")
|
|
98
117
|
end
|
|
99
118
|
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
#
|
|
106
|
-
# @return [Net::HTTPOK, nil] HTTP response
|
|
107
|
-
#
|
|
108
|
-
def get_redirection(path) # rubocop:disable Metrics/MethodLength
|
|
109
|
-
try = 0
|
|
110
|
-
uri = URI url(path)
|
|
119
|
+
def download_jsonl(url, filename)
|
|
120
|
+
path = File.join(Dir.tmpdir, filename)
|
|
121
|
+
Util.info "Downloading #{url}..."
|
|
122
|
+
uri = URI(url)
|
|
123
|
+
attempt = 0
|
|
111
124
|
begin
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
125
|
+
File.open(path, "wb") do |f|
|
|
126
|
+
Net::HTTP.start(uri.host, uri.port, use_ssl: true) do |http|
|
|
127
|
+
http.request_get(uri.request_uri) do |resp|
|
|
128
|
+
raise "Open Data download failed: HTTP #{resp.code}" unless resp.code == "200"
|
|
116
129
|
|
|
117
|
-
|
|
130
|
+
resp.read_body { |chunk| f.write(chunk) }
|
|
131
|
+
end
|
|
132
|
+
end
|
|
133
|
+
end
|
|
134
|
+
rescue StandardError => e
|
|
135
|
+
attempt += 1
|
|
136
|
+
raise if attempt > MAX_DOWNLOAD_RETRIES
|
|
137
|
+
|
|
138
|
+
delay = RETRY_BACKOFF_BASE * (2**(attempt - 1))
|
|
139
|
+
Util.warn "Download attempt #{attempt}/#{MAX_DOWNLOAD_RETRIES} failed (#{e.message}). Retrying in #{delay}s..."
|
|
140
|
+
sleep delay
|
|
141
|
+
retry
|
|
118
142
|
end
|
|
143
|
+
Util.info "Downloaded #{File.size(path) / 1024 / 1024} MB to #{path}."
|
|
144
|
+
path
|
|
119
145
|
end
|
|
120
146
|
|
|
121
|
-
|
|
122
|
-
resp = Net::HTTP.get_response(uri)
|
|
123
|
-
resp.code == "302" ? get_redirection(resp["location"]) : resp
|
|
124
|
-
end
|
|
147
|
+
# --- ingestion --------------------------------------------------------
|
|
125
148
|
|
|
126
|
-
def
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
149
|
+
def build_ref_index(path)
|
|
150
|
+
Util.info "Indexing references and amendments..."
|
|
151
|
+
ref_map = {}
|
|
152
|
+
amend_map = Hash.new { |h, k| h[k] = [] }
|
|
153
|
+
date_map = {}
|
|
154
|
+
File.foreach(path, encoding: "UTF-8") do |line|
|
|
155
|
+
rec = JSON.parse(line)
|
|
156
|
+
id = rec["id"]
|
|
157
|
+
ref = normalize_reference(rec["reference"])
|
|
158
|
+
next unless ref
|
|
159
|
+
|
|
160
|
+
ref_map[id] = ref if id
|
|
161
|
+
pub_date = rec["publicationDate"]
|
|
162
|
+
date_map[ref] = pub_date if pub_date && !pub_date.empty?
|
|
163
|
+
if rec["supplementType"] && (base = amend_base(ref))
|
|
164
|
+
amend_map[base] << ref
|
|
165
|
+
end
|
|
166
|
+
rescue JSON::ParserError
|
|
167
|
+
next
|
|
131
168
|
end
|
|
169
|
+
Util.info "Indexed #{ref_map.size} references; " \
|
|
170
|
+
"#{amend_map.values.sum(&:size)} amendments across #{amend_map.size} bases; " \
|
|
171
|
+
"#{date_map.size} publication dates."
|
|
172
|
+
[ref_map, amend_map, date_map]
|
|
132
173
|
end
|
|
133
174
|
|
|
134
|
-
def
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
|
|
175
|
+
def amend_base(ref)
|
|
176
|
+
pubid = ::Pubid::Iso::Identifier.parse(ref)
|
|
177
|
+
return nil unless pubid.respond_to?(:base) && pubid.base
|
|
178
|
+
|
|
179
|
+
pubid.base.to_s
|
|
180
|
+
rescue StandardError
|
|
181
|
+
nil
|
|
139
182
|
end
|
|
140
183
|
|
|
141
|
-
#
|
|
142
|
-
#
|
|
143
|
-
#
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
doc = Scraper.parse_page docpath, errors: @errors
|
|
150
|
-
mutex.synchronize { save_doc doc, docpath }
|
|
151
|
-
rescue StandardError => e
|
|
152
|
-
Util.warn "Fail fetching document: #{url(docpath)}\n#{e.message}\n#{e.backtrace}"
|
|
184
|
+
# Open Data emits stub records for deleted/abandoned projects with a
|
|
185
|
+
# "Withdrawn" publisher prefix. They have no publicationDate, no edition,
|
|
186
|
+
# and sit on stage *.98 (deleted). Skip them entirely.
|
|
187
|
+
def normalize_reference(ref)
|
|
188
|
+
return nil if ref.nil? || ref.empty?
|
|
189
|
+
return nil if ref.start_with?("Withdrawn ")
|
|
190
|
+
|
|
191
|
+
ref
|
|
153
192
|
end
|
|
154
193
|
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
194
|
+
def ingestable?(ref)
|
|
195
|
+
!ref.nil? && !ref.empty? && !ref.start_with?("Withdrawn ")
|
|
196
|
+
end
|
|
197
|
+
|
|
198
|
+
def build_tc_index
|
|
199
|
+
Util.info "Indexing technical committees..."
|
|
200
|
+
path = download_tc_dataset
|
|
201
|
+
map = {}
|
|
202
|
+
File.foreach(path, encoding: "UTF-8") do |line|
|
|
203
|
+
rec = JSON.parse(line)
|
|
204
|
+
ref = rec["reference"]
|
|
205
|
+
title = rec["title"]
|
|
206
|
+
map[ref] = title if ref && title.is_a?(Hash)
|
|
207
|
+
rescue JSON::ParserError
|
|
208
|
+
next
|
|
209
|
+
end
|
|
210
|
+
Util.info "Indexed #{map.size} committees."
|
|
211
|
+
map
|
|
212
|
+
end
|
|
213
|
+
|
|
214
|
+
def ingest_records(path, ref_index, tc_index, amend_index = {}, date_index = {})
|
|
215
|
+
Util.info "Parsing records..."
|
|
216
|
+
count = 0
|
|
217
|
+
File.foreach(path, encoding: "UTF-8") do |line|
|
|
218
|
+
rec = JSON.parse(line)
|
|
219
|
+
next unless ingestable?(rec["reference"])
|
|
220
|
+
|
|
221
|
+
fetch_pub(rec, ref_index, tc_index, amend_index, date_index)
|
|
222
|
+
count += 1
|
|
223
|
+
Util.info "Processed #{count} records..." if (count % 5_000).zero?
|
|
224
|
+
rescue StandardError => e
|
|
225
|
+
Util.warn "Failed record `#{rec && rec['reference']}`: #{e.message}"
|
|
226
|
+
end
|
|
227
|
+
Util.info "Finished: #{count} records."
|
|
228
|
+
end
|
|
158
229
|
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
#
|
|
162
|
-
# @param [RelatonIsoBib::IsoBibliographicItem] doc document
|
|
163
|
-
#
|
|
164
|
-
# @return [void]
|
|
165
|
-
#
|
|
166
|
-
def save_doc(doc, docpath) # rubocop:disable Metrics/AbcSize,Metrics/MethodLength
|
|
230
|
+
def fetch_pub(rec, ref_index, tc_index = {}, amend_index = {}, date_index = {})
|
|
231
|
+
doc = DataParser.new(rec, ref_index, @errors, tc_index, amend_index, date_index).parse
|
|
167
232
|
docid = doc.docidentifier.detect(&:primary)
|
|
168
|
-
|
|
233
|
+
return unless docid
|
|
234
|
+
|
|
235
|
+
file = output_file(docid.content.to_s)
|
|
169
236
|
if File.exist?(file)
|
|
170
|
-
rewrite_with_same_or_newer
|
|
237
|
+
rewrite_with_same_or_newer(doc, docid, file)
|
|
171
238
|
else
|
|
172
|
-
write_file
|
|
239
|
+
write_file(file, doc, docid)
|
|
173
240
|
end
|
|
174
|
-
iso_queue.move_last docpath
|
|
175
241
|
end
|
|
176
242
|
|
|
177
|
-
def rewrite_with_same_or_newer(doc, docid, file
|
|
178
|
-
|
|
179
|
-
if edition_greater?(doc,
|
|
180
|
-
write_file
|
|
181
|
-
elsif @files.include?(file) && !edition_greater?(
|
|
182
|
-
Util.warn "Duplicate file `#{file}` for `#{docid.content}`
|
|
243
|
+
def rewrite_with_same_or_newer(doc, docid, file)
|
|
244
|
+
existing = Item.from_yaml(File.read(file, encoding: "UTF-8"))
|
|
245
|
+
if edition_greater?(doc, existing) || replace_substage98?(doc, existing)
|
|
246
|
+
write_file(file, doc, docid)
|
|
247
|
+
elsif @files.include?(file) && !edition_greater?(existing, doc)
|
|
248
|
+
Util.warn "Duplicate file `#{file}` for `#{docid.content}`"
|
|
183
249
|
end
|
|
184
250
|
end
|
|
185
251
|
|
|
@@ -187,35 +253,38 @@ module Relaton
|
|
|
187
253
|
doc.edition && bib.edition && doc.edition.content.to_i > bib.edition.content.to_i
|
|
188
254
|
end
|
|
189
255
|
|
|
190
|
-
def replace_substage98?(doc, bib)
|
|
256
|
+
def replace_substage98?(doc, bib)
|
|
191
257
|
doc.edition&.content == bib.edition&.content &&
|
|
192
258
|
(doc.status&.substage&.content != "98" || bib.status&.substage&.content == "98")
|
|
193
259
|
end
|
|
194
260
|
|
|
195
261
|
def write_file(file, doc, docid)
|
|
196
262
|
@files << file
|
|
197
|
-
index.add_or_update
|
|
198
|
-
File.write
|
|
263
|
+
index.add_or_update(docid.pubid || docid.content.to_s, file)
|
|
264
|
+
File.write(file, serialize(doc), encoding: "UTF-8")
|
|
199
265
|
end
|
|
200
266
|
|
|
201
|
-
|
|
267
|
+
# --- static merge -----------------------------------------------------
|
|
202
268
|
|
|
203
|
-
def
|
|
269
|
+
def merge_static_files
|
|
270
|
+
return unless Dir.exist?("static")
|
|
204
271
|
|
|
205
|
-
|
|
272
|
+
Dir["static/**/*.yaml"].each do |f|
|
|
273
|
+
item = Item.from_yaml(File.read(f, encoding: "UTF-8"))
|
|
274
|
+
did = item.docidentifier.detect(&:primary)
|
|
275
|
+
next unless did
|
|
206
276
|
|
|
207
|
-
|
|
208
|
-
# Create thread worker
|
|
209
|
-
#
|
|
210
|
-
# @return [Thread] thread
|
|
211
|
-
#
|
|
212
|
-
def thread
|
|
213
|
-
Thread.new do
|
|
214
|
-
while (path = queue.pop) != :END
|
|
215
|
-
yield path
|
|
216
|
-
end
|
|
277
|
+
index.add_or_update(did.pubid || did.content.to_s, f)
|
|
217
278
|
end
|
|
218
279
|
end
|
|
280
|
+
|
|
281
|
+
# --- serialization ---------------------------------------------------
|
|
282
|
+
|
|
283
|
+
def to_yaml(doc) = doc.to_yaml
|
|
284
|
+
|
|
285
|
+
def to_xml(doc) = doc.to_xml(bibxml: true)
|
|
286
|
+
|
|
287
|
+
def to_bibxml(doc) = doc.to_rfcxml
|
|
219
288
|
end
|
|
220
289
|
end
|
|
221
290
|
end
|