RubyGems - relaton-iso - Versions diffs - 2.1.1 → 2.1.2 - Mend

relaton-iso 2.1.1 → 2.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml +4 -4
data/CLAUDE.md +5 -2
data/README.adoc +21 -1
data/lib/relaton/iso/bibliography.rb +11 -1
data/lib/relaton/iso/data_fetcher.rb +220 -151
data/lib/relaton/iso/data_parser.rb +443 -0
data/lib/relaton/iso/model/docidentifier.rb +7 -2
data/lib/relaton/iso/processor.rb +8 -5
data/lib/relaton/iso/type/pubid.rb +50 -0
data/lib/relaton/iso/version.rb +1 -1
metadata +5 -12
data/grammars/basicdoc.rng +0 -2140
data/grammars/biblio-standoc.rng +0 -268
data/grammars/biblio.rng +0 -2125
data/grammars/relaton-iso-compile.rng +0 -11
data/grammars/relaton-iso.rng +0 -165
data/lib/relaton/iso/queue.rb +0 -63

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 1996226d20bb1e528b2a5a2cabce6446bcddc1036b0c00f40da8b56d39b57142
-  data.tar.gz: abe31e6602e8846f4154ced3900cd289c3b11fa2ce1ece03cb1cc325590736c6
+  metadata.gz: 767dfed024aec3fc3c96c2322ef0fe9514fdd99d7ea5602c9caa878ba9ff95f6
+  data.tar.gz: 427f2b4fb8c58791acd025fcaa7bc6ba7d1928314c6ca9e04ea583d3063247e8
 SHA512:
-  metadata.gz: fb51cc0479e8a79c2e395d98a119d19aa75c40ce046c80c64151ac3620560dd408d0a78902de1e8b8be5c6531da6649284fc6e0cb97fb5ab7860132713559716
-  data.tar.gz: 4b70e291c451f833abc4c97346398638c09df0658c6fa7b7a7eadf318f0d0b80d5e08c0dd1254b17837af5195e1f3029fd3efff2d35a8b3207a1de0275320dad
+  metadata.gz: a99c1fb6fd7ed6cd9f11784d1851353224f9af378705ed3cfcbafd4a20d1eb40f33036521bdcabebe4ab720a6d628ab01a41af914652e3effbe0ba4d5176882f
+  data.tar.gz: 52c952079694c2f0b43a23a9eb767442c53c2fd32183317616eafff960c14e8b997a6819b407933c1d584e15f690fc531b4c8fea7a4dc7add4bac8eb6a237e90

data/CLAUDE.md CHANGED Viewed

@@ -20,8 +20,11 @@ relaton-iso retrieves ISO standard bibliographic data. The core retrieval flow:
 2. **HitCollection** (`lib/relaton/iso/hit_collection.rb`) — searches a pre-built YAML index (`index-v1.zip` from relaton-data-iso) using `Relaton::Index`. Matches on `id_keys`: publisher, number, copublisher, part, year, edition, type, stage, iteration. Returns sorted Hit array.
 3. **Hit** (`lib/relaton/iso/hit.rb`) — wraps an index result. The `item` attribute lazy-loads the full document from GitHub raw content (relaton-data-iso repo). `sort_weight` prioritizes published over withdrawn/deleted.
 4. **ItemData** / **Model::Item** — ISO-specific bibliographic item extending `Relaton::Bib::ItemData`.
-5. **Scraper** (`lib/relaton/iso/scraper.rb`) — parses ISO website pages for metadata (used by DataFetcher for bulk operations, not the normal lookup path).
-6. **DataFetcher** (`lib/relaton/iso/data_fetcher.rb`) — bulk fetches from ISO.org ICS pages using 3 threads with a persistent queue for resumability.
+5. **Scraper** (`lib/relaton/iso/scraper.rb`) — parses individual ISO website pages. Used only by `Bibliography.get` as a fallback when an item is missing from the curated index; no longer drives bulk ingest.
+6. **DataFetcher** (`lib/relaton/iso/data_fetcher.rb`) — streams the ISO Open Data programme JSONL feeds (`iso_deliverables_metadata.jsonl` for documents, `iso_technical_committees.jsonl` for committee titles) and writes one YAML per primary docid into `@output`. Short-circuits on upstream `Last-Modified`; falls back to a full pass when `data/` or `index-v1.yaml` is missing. Two source modes:
+   - `iso-open-data` (default) — incremental, skip when upstream is unchanged.
+   - `iso-open-data-all` — wipe `@output` and re-emit every record.
+7. **DataParser** (`lib/relaton/iso/data_parser.rb`) — converts one Open Data record (`Hash`) into a `Relaton::Iso::ItemData`. Takes a `ref_index` (id → reference) for resolving `replaces`/`replacedBy` and a `tc_index` (reference → `{ "en"/"fr" => title }`) for resolving committee labels.
 Key dependency: `pubid-iso` gem handles ISO publication identifier parsing and comparison.

data/README.adoc CHANGED Viewed

@@ -352,6 +352,26 @@ item.source
   @type="rss">]
 ----
+[[bulk-data-ingest]]
+=== Bulk data ingest
+The curated dataset under https://github.com/relaton/relaton-data-iso[relaton-data-iso] is rebuilt daily by `Relaton::Iso::DataFetcher`, which streams the
+https://www.iso.org/open-data.html[ISO Open Data programme] JSONL feeds — `iso_deliverables_metadata.jsonl` for documents (~80,000 records) and `iso_technical_committees.jsonl` for committee titles — and writes one YAML per primary docid.
+Two source modes are exposed (also reachable via `relaton-cli`'s dataset list and the GitHub Actions workflow input):
+[source,ruby]
+----
+# Incremental: skip the run if upstream `Last-Modified` matches the local
+# `last_modified.txt`. Falls back to a full pass when `data/` or
+# `index-v1.yaml` is missing.
+Relaton::Iso::DataFetcher.fetch("iso-open-data", output: "data", format: "yaml")
+# Full refresh: wipe `output` and re-emit every record. Use when the local
+# tree is suspect or after a parser change that affects emitted YAML.
+Relaton::Iso::DataFetcher.fetch("iso-open-data-all", output: "data", format: "yaml")
+----
 === Logging
 RelatonIso uses the relaton-logger gem for logging. By default, it logs to STDOUT. To change the log levels and add other loggers, read the https://github.com/relaton/relaton-logger#usage[relaton-logger] documentation.
@@ -367,7 +387,7 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
 == Exceptional Citations
-This gem retrieves bibliographic descriptions of ISO documents by doing searches on the ISO website, http://www.iso.org, and screenscraping the document that matches the queried document identifier. The following documents are not returned as search results from the ISO website, and the gem returns manually generated references to them.
+Single-document lookups via `Bibliography.get` first consult the curated relaton-data-iso index (regenerated daily from the ISO Open Data programme — see <<bulk-data-ingest>>) and fall back to scraping individual pages on http://www.iso.org for items not yet indexed. The following documents are not returned as search results from the ISO website, and the gem returns manually generated references to them.
 * `IEV`: used in the metanorma-iso gem to reference Electropedia entries generically. Is resolved to an "all parts" reference to IEC 60050, which in turn is resolved into the specific documents cited by their top-level clause.

data/lib/relaton/iso/bibliography.rb CHANGED Viewed

@@ -186,10 +186,16 @@ module Relaton
       end
       # Extract year from a hit as an integer.
+      #
+      # Amendments, corrigendums and supplements carry no year on their own
+      # identifier; the year lives on the underlying standard reachable via
+      # `root` (which walks the full base chain, however deeply nested). Fall
+      # back to it so a date filter does not drop such references (issue #181).
+      #
       # @param hit [Relaton::Iso::Hit]
       # @return [Integer]
       def hit_year(hit)
-        yr = hit.pubid&.year || hit.hit[:year]
+        yr = hit.pubid&.year || hit.hit[:year] || hit.pubid&.root&.year
         yr.to_i
       end
@@ -236,6 +242,10 @@ module Relaton
       # @return [Relaton::Iso::ItemData, nil]
       def fetch_and_check_date(hit, pubid, opts)
         ret = hit.item
+        # A data file that fails to load (e.g. the index references a file that
+        # 404s) yields an item with no docidentifier; skip it rather than crash.
+        return unless ret&.docidentifier&.first
         if publication_date_in_range?(ret, opts)
           Util.info "Found: `#{ret.docidentifier.first.content}`", key: pubid.to_s
           ret

data/lib/relaton/iso/data_fetcher.rb CHANGED Viewed

@@ -1,185 +1,251 @@
+require "fileutils"
+require "json"
+require "net/http"
+require "tmpdir"
 require_relative "../iso"
-require_relative "queue"
-require_relative "scraper"
+require_relative "data_parser"
 module Relaton
   module Iso
-    # Fetch all the documents from ISO website.
+    #
+    # Fetch ISO documents from the ISO Open Data programme bulk JSONL
+    # (see https://www.iso.org/open-data.html) and write each one as a YAML
+    # file under `@output`.
+    #
+    # `source` modes (matching the `Relaton::Core::DataFetcher.fetch` arg):
+    #
+    # * `"iso-open-data"` (default) - skip the run if the upstream
+    #   `Last-Modified` header matches `LAST_MODIFIED_FILE`.
+    # * `"iso-open-data-all"` - clear `@output` and re-emit every record.
+    #
     class DataFetcher < Core::DataFetcher
-      #
-      # The queue is used to store the ICS page paths beeing fetching in the current run.
-      #
-      # @return [Queue] queue
-      #
-      def queue
-        @queue ||= ::Queue.new
-      end
-      def mutex
-        @mutex ||= Mutex.new
-      end
+      OPEN_DATA_URL = "https://isopublicstorageprod.blob.core.windows.net/" \
+                      "opendata/_latest/iso_deliverables_metadata/json/" \
+                      "iso_deliverables_metadata.jsonl".freeze
+      TC_DATA_URL = "https://isopublicstorageprod.blob.core.windows.net/" \
+                    "opendata/_latest/iso_technical_committees/json/" \
+                    "iso_technical_committees.jsonl".freeze
+      LAST_MODIFIED_FILE = "last_modified.txt".freeze
+      MAX_DOWNLOAD_RETRIES = 4
+      RETRY_BACKOFF_BASE = 30
       def log_error(msg)
         Util.error msg
       end
       def index
-        @index ||= Relaton::Index.find_or_create :iso, file: "#{INDEXFILE}.yaml"
-      end
-      #
-      # ISO has too many docs. GHA can't get them all in one run.
-      # So, we need to split the process into several runs.
-      # The iso_queue is used to store the doc paths that have not been fetched.
-      #
-      # @return [Relaton::Iso::Queue] queue
-      #
-      def iso_queue
-        @iso_queue ||= Relaton::Iso::Queue.new
-      end
-      #
-      # Go through all ICS and fetch all documents.
-      #
-      # @return [void]
-      #
-      def fetch # rubocop:disable Metrics/AbcSize
-        Util.info "Scrapping ICS pages..."
-        fetch_ics
-        Util.info "(#{Time.now}) Scrapping documents..."
-        fetch_docs
-        iso_queue.save
-        # index.sort! { |a, b| compare_docids a, b }
+        @index ||= Relaton::Index.find_or_create(
+          :iso, file: "#{INDEXFILE}.yaml", pubid_class: ::Pubid::Iso::Identifier,
+        )
+      end
+      def fetch(source = nil)
+        @source = source || "iso-open-data"
+        @full_refresh = @source == "iso-open-data-all"
+        Util.info "Fetching ISO Open Data (mode: #{@source})..."
+        last_modified = fetch_last_modified
+        return if up_to_date?(last_modified)
+        prepare_output
+        jsonl_path = download_dataset
+        ref_index, amend_index, date_index = build_ref_index(jsonl_path)
+        tc_index = build_tc_index
+        ingest_records(jsonl_path, ref_index, tc_index, amend_index, date_index)
+        merge_static_files
         index.save
+        save_last_modified(last_modified)
         report_errors
+      rescue StandardError => e
+        Util.error "#{e.message}\n#{e.backtrace.join("\n")}"
+        raise
       end
       private
-      #
-      # Fetch ICS page recursively and store all the links to documents in the iso_queue.
-      #
-      # @param [String] path path to ICS page
-      #
-      def fetch_ics
-        threads = Array.new(3) { thread { |path| fetch_ics_page(path) } }
-        fetch_ics_page "/standards-catalogue/browse-by-ics.html"
-        sleep(1) until queue.empty?
-        threads.size.times { queue << :END }
-        threads.each(&:join)
-      end
-      def fetch_ics_page(path)
-        resp = get_redirection path
-        unless resp
-          Util.error "Failed fetching ICS page #{url(path)}"
-          return
+      # --- HTTP / state -----------------------------------------------------
+      def fetch_last_modified
+        uri = URI(OPEN_DATA_URL)
+        resp = Net::HTTP.start(uri.host, uri.port, use_ssl: true) do |http|
+          http.request(Net::HTTP::Head.new(uri.request_uri))
+        end
+        resp["last-modified"]
+      end
+      def up_to_date?(last_modified)
+        return false if @full_refresh || last_modified.nil?
+        return false unless File.exist?(LAST_MODIFIED_FILE)
+        return false unless output_populated?
+        if File.read(LAST_MODIFIED_FILE, encoding: "UTF-8").strip == last_modified.strip
+          Util.info "ISO Open Data is up to date (Last-Modified: #{last_modified}); nothing to do."
+          true
+        else
+          false
         end
+      end
+      # Guard against an external wipe (or a fresh checkout) — if the YAML tree
+      # or the index file is gone, force a refresh instead of trusting
+      # `LAST_MODIFIED_FILE`.
+      def output_populated?
+        return false unless Dir.exist?(@output)
+        return false unless File.exist?("#{INDEXFILE}.yaml")
-        page = Nokogiri::HTML(resp.body)
-        parse_doc_links page
-        parse_ics_links page
+        Dir.children(@output).any? { |f| f.end_with?(".yaml") }
       end
-      def parse_doc_links(page)
-        doc_links = page.xpath "//td[@data-title='Standard and/or project']/div/div/a"
-        @errors[:doc_links] &&= doc_links.empty?
-        doc_links.each { |item| iso_queue.add_first item[:href].split("?").first }
+      def save_last_modified(last_modified)
+        return unless last_modified
+        File.write(LAST_MODIFIED_FILE, last_modified, encoding: "UTF-8")
+      end
+      def prepare_output
+        FileUtils.rm_rf(@output) if @full_refresh
+        FileUtils.mkdir_p(@output)
       end
-      def parse_ics_links(page)
-        ics_links = page.xpath("//td[@data-title='ICS']/a")
-        @errors[:ics_links] &&= ics_links.empty?
-        ics_links.each { |item| queue << item[:href] }
+      def download_dataset
+        download_jsonl(OPEN_DATA_URL, "iso_deliverables_metadata.jsonl")
       end
-      def url(path)
-        Scraper::DOMAIN + path
+      def download_tc_dataset
+        download_jsonl(TC_DATA_URL, "iso_technical_committees.jsonl")
       end
-      #
-      # Get the page from the given path. If the page is redirected, get the
-      # page from the new path.
-      #
-      # @param [String] path path to the page
-      #
-      # @return [Net::HTTPOK, nil] HTTP response
-      #
-      def get_redirection(path) # rubocop:disable Metrics/MethodLength
-        try = 0
-        uri = URI url(path)
+      def download_jsonl(url, filename)
+        path = File.join(Dir.tmpdir, filename)
+        Util.info "Downloading #{url}..."
+        uri = URI(url)
+        attempt = 0
         begin
-          get_response uri
-        rescue Net::OpenTimeout, Net::ReadTimeout, Errno::ECONNREFUSED => e
-          try += 1
-          retry if check_try try, uri
+          File.open(path, "wb") do |f|
+            Net::HTTP.start(uri.host, uri.port, use_ssl: true) do |http|
+              http.request_get(uri.request_uri) do |resp|
+                raise "Open Data download failed: HTTP #{resp.code}" unless resp.code == "200"
-          Util.warn "Failed fetching #{uri}, #{e.message}"
+                resp.read_body { |chunk| f.write(chunk) }
+              end
+            end
+          end
+        rescue StandardError => e
+          attempt += 1
+          raise if attempt > MAX_DOWNLOAD_RETRIES
+          delay = RETRY_BACKOFF_BASE * (2**(attempt - 1))
+          Util.warn "Download attempt #{attempt}/#{MAX_DOWNLOAD_RETRIES} failed (#{e.message}). Retrying in #{delay}s..."
+          sleep delay
+          retry
         end
+        Util.info "Downloaded #{File.size(path) / 1024 / 1024} MB to #{path}."
+        path
       end
-      def get_response(uri)
-        resp = Net::HTTP.get_response(uri)
-        resp.code == "302" ? get_redirection(resp["location"]) : resp
-      end
+      # --- ingestion --------------------------------------------------------
-      def check_try(try, uri)
-        if try < 3
-          Util.warn "Timeout fetching #{uri}, retrying..."
-          sleep 1
-          true
+      def build_ref_index(path)
+        Util.info "Indexing references and amendments..."
+        ref_map = {}
+        amend_map = Hash.new { |h, k| h[k] = [] }
+        date_map = {}
+        File.foreach(path, encoding: "UTF-8") do |line|
+          rec = JSON.parse(line)
+          id = rec["id"]
+          ref = normalize_reference(rec["reference"])
+          next unless ref
+          ref_map[id] = ref if id
+          pub_date = rec["publicationDate"]
+          date_map[ref] = pub_date if pub_date && !pub_date.empty?
+          if rec["supplementType"] && (base = amend_base(ref))
+            amend_map[base] << ref
+          end
+        rescue JSON::ParserError
+          next
         end
+        Util.info "Indexed #{ref_map.size} references; " \
+                  "#{amend_map.values.sum(&:size)} amendments across #{amend_map.size} bases; " \
+                  "#{date_map.size} publication dates."
+        [ref_map, amend_map, date_map]
       end
-      def fetch_docs
-        threads = Array.new(3) { thread { |path| fetch_doc(path) } }
-        iso_queue[0..10_000].each { |docpath| queue << docpath }
-        threads.size.times { queue << :END }
-        threads.each(&:join)
+      def amend_base(ref)
+        pubid = ::Pubid::Iso::Identifier.parse(ref)
+        return nil unless pubid.respond_to?(:base) && pubid.base
+        pubid.base.to_s
+      rescue StandardError
+        nil
       end
-      #
-      # Fetch document from ISO website.
-      #
-      # @param [String] docpath document page path
-      #
-      # @return [void]
-      #
-      def fetch_doc(docpath)
-        doc = Scraper.parse_page docpath, errors: @errors
-        mutex.synchronize { save_doc doc, docpath }
-      rescue StandardError => e
-        Util.warn "Fail fetching document: #{url(docpath)}\n#{e.message}\n#{e.backtrace}"
+      # Open Data emits stub records for deleted/abandoned projects with a
+      # "Withdrawn" publisher prefix. They have no publicationDate, no edition,
+      # and sit on stage *.98 (deleted). Skip them entirely.
+      def normalize_reference(ref)
+        return nil if ref.nil? || ref.empty?
+        return nil if ref.start_with?("Withdrawn ")
+        ref
       end
-      # def compare_docids(id1, id2)
-      #   Pubid::Iso::Identifier.create(**id1).to_s <=> Pubid::Iso::Identifier.create(**id2).to_s
-      # end
+      def ingestable?(ref)
+        !ref.nil? && !ref.empty? && !ref.start_with?("Withdrawn ")
+      end
+      def build_tc_index
+        Util.info "Indexing technical committees..."
+        path = download_tc_dataset
+        map = {}
+        File.foreach(path, encoding: "UTF-8") do |line|
+          rec = JSON.parse(line)
+          ref = rec["reference"]
+          title = rec["title"]
+          map[ref] = title if ref && title.is_a?(Hash)
+        rescue JSON::ParserError
+          next
+        end
+        Util.info "Indexed #{map.size} committees."
+        map
+      end
+      def ingest_records(path, ref_index, tc_index, amend_index = {}, date_index = {})
+        Util.info "Parsing records..."
+        count = 0
+        File.foreach(path, encoding: "UTF-8") do |line|
+          rec = JSON.parse(line)
+          next unless ingestable?(rec["reference"])
+          fetch_pub(rec, ref_index, tc_index, amend_index, date_index)
+          count += 1
+          Util.info "Processed #{count} records..." if (count % 5_000).zero?
+        rescue StandardError => e
+          Util.warn "Failed record `#{rec && rec['reference']}`: #{e.message}"
+        end
+        Util.info "Finished: #{count} records."
+      end
-      #
-      # save document to file.
-      #
-      # @param [RelatonIsoBib::IsoBibliographicItem] doc document
-      #
-      # @return [void]
-      #
-      def save_doc(doc, docpath) # rubocop:disable Metrics/AbcSize,Metrics/MethodLength
+      def fetch_pub(rec, ref_index, tc_index = {}, amend_index = {}, date_index = {})
+        doc = DataParser.new(rec, ref_index, @errors, tc_index, amend_index, date_index).parse
         docid = doc.docidentifier.detect(&:primary)
-        file = output_file docid.content.to_s
+        return unless docid
+        file = output_file(docid.content.to_s)
         if File.exist?(file)
-          rewrite_with_same_or_newer doc, docid, file, docpath
+          rewrite_with_same_or_newer(doc, docid, file)
         else
-          write_file file, doc, docid
+          write_file(file, doc, docid)
         end
-        iso_queue.move_last docpath
       end
-      def rewrite_with_same_or_newer(doc, docid, file, docpath)
-        bib = Item.from_yaml File.read(file, encoding: "UTF-8")
-        if edition_greater?(doc, bib) || replace_substage98?(doc, bib)
-          write_file file, doc, docid
-        elsif @files.include?(file) && !edition_greater?(bib, doc)
-          Util.warn "Duplicate file `#{file}` for `#{docid.content}` from #{url(docpath)}"
+      def rewrite_with_same_or_newer(doc, docid, file)
+        existing = Item.from_yaml(File.read(file, encoding: "UTF-8"))
+        if edition_greater?(doc, existing) || replace_substage98?(doc, existing)
+          write_file(file, doc, docid)
+        elsif @files.include?(file) && !edition_greater?(existing, doc)
+          Util.warn "Duplicate file `#{file}` for `#{docid.content}`"
         end
       end
@@ -187,35 +253,38 @@ module Relaton
         doc.edition && bib.edition && doc.edition.content.to_i > bib.edition.content.to_i
       end
-      def replace_substage98?(doc, bib) # rubocop:disable Metrics/CyclomaticComplexity,Metrics/PerceivedComplexity
+      def replace_substage98?(doc, bib)
         doc.edition&.content == bib.edition&.content &&
           (doc.status&.substage&.content != "98" || bib.status&.substage&.content == "98")
       end
       def write_file(file, doc, docid)
         @files << file
-        index.add_or_update docid.pubid.to_h, file
-        File.write file, serialize(doc), encoding: "UTF-8"
+        index.add_or_update(docid.pubid || docid.content.to_s, file)
+        File.write(file, serialize(doc), encoding: "UTF-8")
       end
-      def to_yaml(doc) = doc.to_yaml
+      # --- static merge -----------------------------------------------------
-      def to_xml(doc) = doc.to_xml bibxml: true
+      def merge_static_files
+        return unless Dir.exist?("static")
-      def to_bibxml(doc) = doc.to_rfcxml
+        Dir["static/**/*.yaml"].each do |f|
+          item = Item.from_yaml(File.read(f, encoding: "UTF-8"))
+          did = item.docidentifier.detect(&:primary)
+          next unless did
-      #
-      # Create thread worker
-      #
-      # @return [Thread] thread
-      #
-      def thread
-        Thread.new do
-          while (path = queue.pop) != :END
-            yield path
-          end
+          index.add_or_update(did.pubid || did.content.to_s, f)
         end
       end
+      # --- serialization ---------------------------------------------------
+      def to_yaml(doc) = doc.to_yaml
+      def to_xml(doc) = doc.to_xml(bibxml: true)
+      def to_bibxml(doc) = doc.to_rfcxml
     end
   end
 end