RubyGems - identifiers - Versions diffs - 0.14.0 → 0.15.0 - Mend

identifiers 0.14.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +8 -0
data/README.md +4 -2
data/lib/identifiers/isbn.rb +53 -16
data/spec/identifiers/isbn_spec.rb +60 -5
metadata +7 -11

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 911c52af8788814413165b25ec5dfdace4cd213906dd193216c720c31cb6d3de
-  data.tar.gz: fb75b4b356a1e87711f7b0d9a820d938d4dbb2d28730c5c8738d105f947cae00
+  metadata.gz: f0cc5842531ab0baf207b1aba207fa1fc89e338b3bb6a36c689346e7b473fdd6
+  data.tar.gz: bba49e3c29eeb837cf62f0dcc20256d024ee2690e65ce1df36d7f619a4afc25f
 SHA512:
-  metadata.gz: 72881f5981cec05c2273e8c76912a8853edaa5b6a0d0cc69a298daedde4037ee98080903fcf9c80f8290a9fe79d7e0ced24ec680dd9f92c7756c9e76c70e3bbe
-  data.tar.gz: 1255cb693b80e63ccd19e2ed0c2c5962319467c6635e8713cf7185e5e6c91dc7f952e6b614f5c1131cef45586c06fff12c4eb38e7803b8bd0cb1dfec8b0134e0
+  metadata.gz: b629b7898e27360f8b0b45c537b735f9e3a2cfaa37cb0c8a4b00b226af0d16c861fefd00c3d630658c38447d05aa2e25dcd6b4f8a52e257762b3ed3469c53b73
+  data.tar.gz: ea4edebb501806152ac4affc90d766bd31b93526606cbd90c64c1c8fa62722da45ca40514e6ade5229878390ab857390d8f4502cb87b27ab6390561e7ec98a1c

data/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,14 @@
 All notable changes to this project will be documented in this file. This
 project adheres to [Semantic Versioning](http://semver.org/).
+## [0.15.0] - 2026-06-12
+### Added
+- Extract ISBNs that mix different dash characters, including the U+2212 minus
+  sign, within a single ISBN (#31)
+### Fixed
+- Stop extracting spurious ISBNs from a long run of digits separated by hyphens (#34)
 ## [0.14.0] - 2024-07-30
 ### Added
 - Added optional prefixes argument to ISBNs extraction.

data/README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# Identifiers [![Build Status](https://travis-ci.org/altmetric/identifiers.svg?branch=master)](https://travis-ci.org/altmetric/identifiers) [![Gem Version](https://badge.fury.io/rb/identifiers.svg)](https://badge.fury.io/rb/identifiers)
+# Identifiers [![Gem Version](https://badge.fury.io/rb/identifiers.svg)](https://badge.fury.io/rb/identifiers)
 Collection of utilities related to the extraction, validation and normalization of various scholarly identifiers. The supported list is:
@@ -13,6 +13,8 @@ Collection of utilities related to the extraction, validation and normalization
 - [URNs](https://en.wikipedia.org/wiki/Uniform_Resource_Name)
 - [ORCID identifiers](http://orcid.org/)
+**Supported Ruby versions**: >= 2.7
 ## Installation
 Add this line to your application's `Gemfile`:
@@ -106,6 +108,6 @@ We also maintain [a version of this library for PHP](https://github.com/altmetri
 ## License
-Copyright © 2016-2018 Altmetric LLP
+Copyright © 2016-2024 Altmetric LLP
 Distributed under the [MIT License](http://opensource.org/licenses/MIT).

data/lib/identifiers/isbn.rb CHANGED Viewed

@@ -1,6 +1,9 @@
+# frozen_string_literal: true
 module Identifiers
   class ISBN
     ISBN_13_REGEXP = /
+      (?<!\p{Pd})         # Not part of a longer hyphen-separated run of digits
       \b
       (
         97[89]            # ISBN (GS1) Bookland prefix
@@ -12,12 +15,14 @@ module Identifiers
         \d                # Check digit
       )
       \b
-    /x
+      (?!\p{Pd})          # Not part of a longer hyphen-separated run of digits
+    /x.freeze
     ISBN_10_REGEXP = /
       (?<!              # Don't match a hyphenated or spaced ISBN-13
         97[89]
         [\p{Pd}\p{Zs}]
       )
+      (?<!\p{Pd})         # Not part of a longer hyphen-separated run of digits
       \b
       (
         \d{1,5}           # Registration group identifier
@@ -29,7 +34,8 @@ module Identifiers
         [\dX]             # Check digit
       )
       \b
-    /x
+      (?!\p{Pd})          # Not part of a longer hyphen-separated run of digits
+    /x.freeze
     ISBN_A_REGEXP = %r{
       \b
       (?<=10\.) # Directory indicator (always 10)
@@ -38,24 +44,26 @@ module Identifiers
       /         # Prefix/suffix divider
       \d{1,7}   # ISBN title enumerator and check digit
       \b
-    }x
-    TEXT_AFTER_PREFIX_REGEXP = ':?\s*(\d.*)$'.freeze
+    }x.freeze
+    # Dashes other than the ASCII hyphen we normalise to, including U+2212
+    # MINUS SIGN, which is not part of \p{Pd}
+    NON_CANONICAL_DASHES_REGEXP = /[[\p{Pd}−]&&[^-]]/.freeze
-    def self.extract(str , prefixes = [])
-      str = match_strings_with_prefix(str , prefixes) if prefixes.any?
+    def self.extract(str, prefixes = [])
+      str = str.to_s
+      # Normalise dashes to a single ASCII hyphen so one ISBN can mix them
+      str = str.gsub(NON_CANONICAL_DASHES_REGEXP, '-') if str.match?(NON_CANONICAL_DASHES_REGEXP)
+      return extract_with_prefix(str, prefixes) if prefixes.any?
       extract_isbn_as(str) + extract_thirteen_digit_isbns(str) + extract_ten_digit_isbns(str)
     end
-    def self.match_strings_with_prefix(str, prefixes)
-      prefix_regexp = prefixes.join('|')
+    def self.extract_with_prefix(str, prefixes)
+      prefix_regexp = generate_prefix_regexp(prefixes)
-      str
-        .to_s
-        .scan(/(#{prefix_regexp})#{TEXT_AFTER_PREFIX_REGEXP}/i)
-        .inject('') do |acum, (_prefix, match)|
-          acum + "#{match} \n "
-        end
+      [isbn_a_candidate_matcher, ISBN_13_REGEXP, ISBN_10_REGEXP].inject([]) do |matches, isbn_regexp|
+        matches | isbn_with_prefix_candidates(str, prefix_regexp, isbn_regexp)
+      end
     end
     def self.extract_isbn_as(str)
@@ -78,13 +86,13 @@ module Identifiers
         .select { |isbn, hyphen| !hyphen || isbn.count(hyphen) == 3 }
         .map { |isbn, hyphen| isbn.delete(hyphen.to_s) }
         .select { |isbn| valid_isbn_10?(isbn) }
-        .map { |isbn|
+        .map do |isbn|
           isbn.chop!
           isbn.prepend('978')
           isbn << isbn_13_check_digit(isbn).to_s
           isbn
-        }
+        end
     end
     def self.isbn_13_check_digit(isbn)
@@ -117,5 +125,34 @@ module Identifiers
     def self.digits_of(isbn)
       isbn.to_s.each_char.map { |char| char == 'X' ? 10 : Integer(char) }.to_enum
     end
+    def self.isbn_with_prefix_candidates(str, prefix_regexp, isbn_regexp)
+      regexp = Regexp.new("#{prefix_regexp}#{isbn_regexp}", Regexp::IGNORECASE | Regexp::EXTENDED)
+      str
+        .to_s
+        .scan(regexp)
+        .filter_map do |match|
+          extract(Array(match).first)&.first
+        end
+    end
+    def self.generate_prefix_regexp(prefixes)
+      joined_prefixes = Regexp.union(prefixes).source
+      Regexp.new(
+        "(?<=                 # Lookbehind for a prefix
+           #{joined_prefixes} # ie:p1|p2|p3
+         )
+         :?                   # Optional colon. If you want to use a different separator, you can add it as a prefix
+         \\s*                 # Optional whitespaces
+         ", Regexp::IGNORECASE | Regexp::EXTENDED
+      )
+    end
+    def self.isbn_a_candidate_matcher
+      # We capture the ISBN-A prefix for the ISBN-A regexp to work correctly when extracting ISBN-As
+      Regexp.new(ISBN_A_REGEXP.source.gsub('(?<=10\\.)', '10\.'), Regexp::IGNORECASE | Regexp::EXTENDED)
+    end
   end
 end

data/spec/identifiers/isbn_spec.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 require 'identifiers/isbn'
 RSpec.describe Identifiers::ISBN do
@@ -31,6 +33,10 @@ RSpec.describe Identifiers::ISBN do
     expect(described_class.extract('ISBN: 978–0–80–506909–9')).to contain_exactly('9780805069099')
   end
+  it 'extracts ISBNs that mix different dash characters' do
+    expect(described_class.extract('978–3−200–01908–9')).to contain_exactly('9783200019089')
+  end
   it 'extracts ISBNs with spaces' do
     expect(described_class.extract('ISBN: 978 0 80 506909 9')).to contain_exactly('9780805069099')
   end
@@ -119,6 +125,14 @@ RSpec.describe Identifiers::ISBN do
     expect(described_class.extract('0-80506909-7')).to be_empty
   end
+  it 'does not extract ISBNs from a long run of digits separated by hyphens' do
+    expect(described_class.extract('0-1884-0-3140-0-4396-0-5652-0-4396-0-2826')).to be_empty
+  end
+  it 'does not extract ISBNs joined to other digits by a hyphen' do
+    expect(described_class.extract('9780805069099-9780671879198')).to be_empty
+  end
   it 'extracts ISBN-10s with variable width registration group identifiers' do
     expect(described_class.extract('99921-58-10-7 9971-5-0210-0 960-425-059-0 80-902734-1-6'))
       .to contain_exactly('9789992158104', '9789971502102', '9789604250592', '9788090273412')
@@ -126,16 +140,57 @@ RSpec.describe Identifiers::ISBN do
   context 'when passing prefixes' do
     it 'extracts only prefixed ISBNs' do
-      text = "ISBN:9789992158104  \n ISBN-10 9789971502102 \n IsbN-13: 9789604250592 \n 9788090273412"
-      prefixes = ['IsBn', 'ISBN-10', 'ISBN-13']
+      text = "ISBN:9789992158104  ISBN-10 9789971502102 \n IsbN-13: 9789604250592 \n 9788090273412"
+      prefixes = %w[IsBn ISBN-10 ISBN-13]
       expect(described_class.extract(text, prefixes))
         .to contain_exactly('9789992158104', '9789971502102', '9789604250592')
     end
+    it 'extracts ISBNs with special characters in the prefixes' do
+      text = 'ISB*N:99921-58-10-7 IS?BN-10 9971-5-0210-0 Is$bN-13: 978-0-80-506909-9 80-902734-1-6'
+      prefixes = ['IsB*n', 'IS?BN-10', 'IS$BN-13']
+      expect(described_class.extract(text, prefixes))
+        .to contain_exactly('9789992158104', '9789971502102', '9780805069099')
+    end
+    it 'extracts ISBNs with Unicode dashes' do
+      text = 'ISB*N:99921-58-10-7 IS?BN-10 9971-5-0210-0 Is$bN-13: 978–0–80–506909–9 80-902734-1-6'
+      prefixes = ['IsB*n', 'IS?BN-10', 'IS$BN-13']
+      expect(described_class.extract(text, prefixes))
+        .to contain_exactly('9789992158104', '9789971502102', '9780805069099')
+    end
+    it 'extracts ISBNs with Unicode spaces' do
+      text = 'ISBN-13: 978 0 80 506909 9'
+      prefixes = ['ISBN-13']
+      expect(described_class.extract(text, prefixes)).to contain_exactly('9780805069099')
+    end
+    it 'normalizes 10-digit ISBNs with hyphens and a check digit of X' do
+      expect(described_class.extract('ISBN:2-7594-0269-X', ['ISBN'])).to contain_exactly('9782759402694')
+    end
+    it 'normalizes 10-digit ISBNs with spaces and a check digit of X' do
+      text = 'ISBN-10 2 7594 0269 X'
+      prefixes = ['ISBN-10']
+      expect(described_class.extract(text, prefixes)).to contain_exactly('9782759402694')
+    end
+    it 'extracts ISBN-13s from ISBN-As' do
+      text = 'ISBN 10.978.8898392/315'
+      prefixes = %w[ISBN ISBN-10]
+      expect(described_class.extract(text, prefixes)).to contain_exactly('9788898392315')
+    end
     it 'does not extract ISBNs with different prefixes' do
-      text = "ISBN:9789992158104 \n ISBN-10 9789971502102  \n IsbN-13: 9789604250592 \n 9788090273412"
-      prefixes = ['IsBn', 'ISBN-10']
+      text = "ISBN:9789992158104 \n ISBN-10 9789971502102  IsbN-13: 9789604250592  9788090273412"
+      prefixes = %w[IsBn ISBN-10]
       expect(described_class.extract(text, prefixes))
         .to contain_exactly('9789992158104', '9789971502102')
@@ -143,7 +198,7 @@ RSpec.describe Identifiers::ISBN do
     it 'does not extract ISBNs without prefixes' do
       text = "9789992158104 9789971502102 9789604250592 \n 9788090273412"
-      prefixes = ['IsBn', 'ISBN-10', 'ISBN-13']
+      prefixes = %w[IsBn ISBN-10 ISBN-13]
       expect(described_class.extract(text, prefixes)).to be_empty
     end

metadata CHANGED Viewed

@@ -1,16 +1,15 @@
 --- !ruby/object:Gem::Specification
 name: identifiers
 version: !ruby/object:Gem::Version
-  version: 0.14.0
+  version: 0.15.0
 platform: ruby
 authors:
 - Jonathan Hernandez
 - Paul Mucur
 - PatoSoft
-autorequire:
 bindir: exe
 cert_chain: []
-date: 2024-08-01 00:00:00.000000000 Z
+date: 1980-01-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: urn
@@ -32,29 +31,28 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '10.0'
+        version: '13.2'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '10.0'
+        version: '13.2'
 - !ruby/object:Gem::Dependency
   name: rspec
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '3.4'
+        version: '3.13'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '3.4'
-description:
+        version: '3.13'
 email:
 - support@altmetric.com
 executables: []
@@ -90,7 +88,6 @@ homepage: https://github.com/altmetric/identifiers
 licenses:
 - MIT
 metadata: {}
-post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -105,8 +102,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.3.27
-signing_key:
+rubygems_version: 3.6.9
 specification_version: 4
 summary: Utilities library for various scholarly identifiers used by Altmetric
 test_files: