neu-mods 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.version +1 -0
- data/Gemfile +8 -0
- data/README.md +83 -0
- data/Rakefile +12 -0
- data/lib/neu/mods/canonicalize.rb +143 -0
- data/lib/neu/mods/document.rb +38 -0
- data/lib/neu/mods/projection.rb +273 -0
- data/lib/neu/mods/selectors.rb +52 -0
- data/lib/neu/mods/version.rb +10 -0
- data/lib/neu-mods.rb +37 -0
- metadata +99 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: ab1cb07bf4122e98017ded99c790824865f68fb8ccdfdcbb3b5fc8e7d72cbf55
|
|
4
|
+
data.tar.gz: 6d3b467da6fde096e0ea26f9e76c447ea24c563c6cc5fd1a47e06b8f18d72aca
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: 4c9b5a6006cf0ab3faced2d60f49f703ab482f28f2208d39c5225bd7af769c4f6bba634b0808d83d3b5ac3f2210efca07f8b1ad37976e0ed17d1175d666fcccf
|
|
7
|
+
data.tar.gz: 2c92060d9faddce439bb9b462982fbb186f6d792f5fc86cfa9702a7cf1c2118c06a3236a5e0875c2499e5593083f531a5668c841696168d330c6527728ca56a7
|
data/.version
ADDED
|
@@ -0,0 +1 @@
|
|
|
1
|
+
0.1.0
|
data/Gemfile
ADDED
data/README.md
ADDED
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
# neu-mods
|
|
2
|
+
|
|
3
|
+
Northeastern-flavored MODS v3 **projection + selection** for the DRS, shared by
|
|
4
|
+
[Cerberus](https://github.com/NEU-Libraries/cerberus) (front end) and
|
|
5
|
+
[Atlas](https://github.com/NEU-Libraries/atlas) (API backend).
|
|
6
|
+
|
|
7
|
+
It is a **Nokogiri-native, dependency-light contract over MODS documents** — pure
|
|
8
|
+
functions over a parsed document, nothing else. No Rails, no persistence, no
|
|
9
|
+
HTTP. It answers two questions:
|
|
10
|
+
|
|
11
|
+
- **"Where is X?"** — `Selectors` return *live Nokogiri nodes*, so they serve both
|
|
12
|
+
the read path (projection reads their text) and the write path (an editor
|
|
13
|
+
mutates the returned node in place). The node an editor changes is provably the
|
|
14
|
+
node the projection reads.
|
|
15
|
+
- **"What does this project to?"** — `Projection` returns *plain data*
|
|
16
|
+
(hashes/strings/arrays — never opaque typed objects) for indexing/display.
|
|
17
|
+
|
|
18
|
+
It depends on **Nokogiri alone** — deliberately *not* the `sul-dlss/mods` +
|
|
19
|
+
`nom-xml` stack (which is sunsetting alongside Stanford's move to Cocina). See
|
|
20
|
+
the design note in the DRS gap-reports for the full rationale.
|
|
21
|
+
|
|
22
|
+
## Usage
|
|
23
|
+
|
|
24
|
+
```ruby
|
|
25
|
+
require "neu-mods"
|
|
26
|
+
|
|
27
|
+
doc = NEU::MODS::Document.parse(xml_string)
|
|
28
|
+
|
|
29
|
+
# Projection (plain data)
|
|
30
|
+
doc.plain_title # => "What's New - How We Respond to Disaster, Episode 1"
|
|
31
|
+
doc.title_parts # => { non_sort:, subtitle:, title:, part_name:, part_number: }
|
|
32
|
+
doc.abstract # => normalized, paragraph-joined String
|
|
33
|
+
doc.topical_subjects # => ["Civil society", ...] (every <topic>, for the access copy)
|
|
34
|
+
doc.keywords # => [...] (only the editable attribute-free keyword subjects)
|
|
35
|
+
doc.to_h # => full projection, keyed to Atlas's Metadata::MODS attributes
|
|
36
|
+
|
|
37
|
+
# Selectors (live nodes — for editing)
|
|
38
|
+
node = doc.primary_title_info.at_xpath("mods:title", NEU::MODS::NAMESPACE)
|
|
39
|
+
node.content = "New Title" unless NEU::MODS.whitespace_equivalent?(node.text, "New Title")
|
|
40
|
+
doc.to_xml
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Two normalizers, two jobs
|
|
44
|
+
|
|
45
|
+
- `NEU::MODS.whitespace_equivalent?` / `.canonical_ws` — the **no-op guard**: did an
|
|
46
|
+
edit change anything, or only insignificant whitespace? (Used to avoid minting
|
|
47
|
+
an unchanged OCFL MODS version.)
|
|
48
|
+
- `NEU::MODS.normalize_paragraphs` / `.normalize` — clean **curator freetext** for
|
|
49
|
+
the JSON/Solr access copy (dash/smart-punctuation transliteration, control
|
|
50
|
+
stripping, paragraph handling). The XML preservation copy is never touched.
|
|
51
|
+
|
|
52
|
+
## Behavior fidelity & known caveats
|
|
53
|
+
|
|
54
|
+
The projection is **behavior-preserving** with Atlas's prior `mods`-gem-based
|
|
55
|
+
extraction, pinned by `spec/conformance_spec.rb` against `work-mods.xml`. Two
|
|
56
|
+
intentional notes:
|
|
57
|
+
|
|
58
|
+
- **Name display** reproduces the `mods` gem's `display_value_w_date` *including
|
|
59
|
+
its quirks* (e.g. multiple `given` nameParts concatenate with no separator),
|
|
60
|
+
to preserve existing Solr/display output. Cleanups are a deliberate future
|
|
61
|
+
contract change, not a silent one.
|
|
62
|
+
- **Roles & languages** read the `type="text"` term and fall back to the **raw
|
|
63
|
+
code** — they are *not* MARC-relator / ISO-639 translated. Records carrying
|
|
64
|
+
text forms (the norm) are unaffected; code-only records would differ. Vendoring
|
|
65
|
+
those lookup tables (or depending on `iso-639`) is deferred to keep the gem
|
|
66
|
+
Nokogiri-only and small.
|
|
67
|
+
|
|
68
|
+
## Source convention
|
|
69
|
+
|
|
70
|
+
Every character-class regex in `TextNormalizer` is built **programmatically from
|
|
71
|
+
codepoints**, so the source stays pure ASCII (no literal smart-quotes/dashes, no
|
|
72
|
+
raw control bytes). A spec enforces this. Keep it that way.
|
|
73
|
+
|
|
74
|
+
## Development
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
bundle install
|
|
78
|
+
bundle exec rspec
|
|
79
|
+
bundle exec rubocop
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
Versioned via the `.version` file (read by `lib/neu/mods/version.rb`); released
|
|
83
|
+
with `bundler/gem_tasks` (`rake release`), mirroring `atlas_rb`.
|
data/Rakefile
ADDED
|
@@ -0,0 +1,143 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module NEU
|
|
4
|
+
module MODS
|
|
5
|
+
# Lightweight whitespace canonicalization used by the *no-op guard* -- does an
|
|
6
|
+
# edit actually change anything, or only insignificant whitespace? (Cerberus's
|
|
7
|
+
# MODSMerge uses this to avoid minting an unchanged OCFL MODS version.) This is
|
|
8
|
+
# deliberately distinct from TextNormalizer below: this one only folds
|
|
9
|
+
# whitespace; TextNormalizer cleans curator freetext for the access copy.
|
|
10
|
+
module Canonicalize
|
|
11
|
+
module_function
|
|
12
|
+
|
|
13
|
+
NBSP = [0xA0].pack("U") # U+00A0 non-breaking space, built from codepoint
|
|
14
|
+
|
|
15
|
+
# \s doesn't match U+00A0 (NBSP) in Ruby's default mode, so fold NBSP to a
|
|
16
|
+
# plain space first, then collapse any whitespace run to one space + strip.
|
|
17
|
+
def canonical_ws(str)
|
|
18
|
+
str.to_s.tr(NBSP, " ").gsub(/\s+/, " ").strip
|
|
19
|
+
end
|
|
20
|
+
|
|
21
|
+
# Treat values differing only by insignificant whitespace (NBSP vs space,
|
|
22
|
+
# collapsible runs, leading/trailing) as equal.
|
|
23
|
+
def whitespace_equivalent?(current, incoming)
|
|
24
|
+
canonical_ws(current) == canonical_ws(incoming)
|
|
25
|
+
end
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
# Normalises curator-authored freetext on the way into the JSON access copy
|
|
29
|
+
# (and Solr); the XML preservation copy stays untouched. Ported from Atlas's
|
|
30
|
+
# TextNormalizer (which carries DRS v1 prior art) so the gem reproduces Atlas's
|
|
31
|
+
# projection byte-for-byte.
|
|
32
|
+
#
|
|
33
|
+
# IMPORTANT: every character-class regex is built *programmatically* from
|
|
34
|
+
# codepoint lists via `format('\\u%04X', cp)`, so this source file stays pure
|
|
35
|
+
# ASCII -- no literal smart-quotes, dashes, or (critically) raw control bytes
|
|
36
|
+
# land on disk. Keep it that way.
|
|
37
|
+
#
|
|
38
|
+
# Pipeline: force UTF-8 + scrub invalid bytes; NFC; map Unicode dashes to '-'
|
|
39
|
+
# (swung-dash to '~'); transliterate the General Punctuation block (smart
|
|
40
|
+
# quotes, ellipsis, etc.) to ASCII; strip C0/C1 controls (keeping tab/newline);
|
|
41
|
+
# collapse horizontal-whitespace runs to one space; for paragraph fields,
|
|
42
|
+
# collapse 2+ newlines to exactly two; strip.
|
|
43
|
+
#
|
|
44
|
+
# .normalize(str) -- single-line fields (newlines -> spaces)
|
|
45
|
+
# .normalize_paragraphs(str) -- fields that may carry paragraph breaks
|
|
46
|
+
# (abstract, accessCondition)
|
|
47
|
+
module TextNormalizer
|
|
48
|
+
module_function
|
|
49
|
+
|
|
50
|
+
# Build a character-class Regexp from an array of integer codepoints, as
|
|
51
|
+
# \uXXXX escapes (keeps this source ASCII).
|
|
52
|
+
def self.char_class(codepoints, prefix: "")
|
|
53
|
+
Regexp.new("[#{prefix}#{codepoints.map { |cp| format('\\u%04X', cp) }.join}]")
|
|
54
|
+
end
|
|
55
|
+
|
|
56
|
+
# NOTE: U+2053 (swung dash) is intentionally excluded from dashes -- it is
|
|
57
|
+
# named "dash" but conventionally maps to ASCII '~', not '-' (V1 prior art).
|
|
58
|
+
DASH_CODEPOINTS = [
|
|
59
|
+
0x002D, 0x00AD, 0x058A, 0x05BE, 0x1400, 0x1806,
|
|
60
|
+
0x2010, 0x2011, 0x2012, 0x2013, 0x2014, 0x2015,
|
|
61
|
+
0x2043, 0x207B, 0x208B, 0x2212,
|
|
62
|
+
0x2E17, 0x2E1A, 0x2E3A, 0x2E3B, 0x2E40,
|
|
63
|
+
0x301C, 0x3030, 0x30A0, 0xFE31, 0xFE32, 0xFE58,
|
|
64
|
+
0xFE63, 0xFF0D
|
|
65
|
+
].freeze
|
|
66
|
+
DASH_RE = char_class(DASH_CODEPOINTS).freeze
|
|
67
|
+
|
|
68
|
+
SWUNG_DASH_RE = Regexp.new(format('\\u%04X', 0x2053)).freeze
|
|
69
|
+
|
|
70
|
+
# C0 (U+0000..U+0008, U+000B..U+001F) and C1 (U+007F..U+009F). U+0009 (tab)
|
|
71
|
+
# and U+000A (newline) are preserved.
|
|
72
|
+
CONTROL_CODEPOINTS = ((0x0000..0x0008).to_a + (0x000B..0x001F).to_a + (0x007F..0x009F).to_a).freeze
|
|
73
|
+
CONTROL_RE = char_class(CONTROL_CODEPOINTS).freeze
|
|
74
|
+
|
|
75
|
+
HORIZONTAL_WS_CODEPOINTS = [
|
|
76
|
+
0x0009, 0x00A0, 0x1680,
|
|
77
|
+
0x2000, 0x2001, 0x2002, 0x2003, 0x2004, 0x2005, 0x2006,
|
|
78
|
+
0x2007, 0x2008, 0x2009, 0x200A, 0x202F, 0x205F, 0x3000
|
|
79
|
+
].freeze
|
|
80
|
+
# Leading literal space included in the class (the " " prefix); `+` so a run
|
|
81
|
+
# of horizontal whitespace collapses to a single space.
|
|
82
|
+
HORIZONTAL_WS_RE = Regexp.new("#{char_class(HORIZONTAL_WS_CODEPOINTS, prefix: " ").source}+").freeze
|
|
83
|
+
|
|
84
|
+
PARAGRAPH_RUN_RE = /\n{2,}/
|
|
85
|
+
|
|
86
|
+
# General Punctuation block (U+2000..U+206F). Codepoints not listed pass
|
|
87
|
+
# through unchanged. Empty-string values deliberately drop invisible/bidi/
|
|
88
|
+
# format marks so they cannot leak into the access copy.
|
|
89
|
+
GENERAL_PUNCTUATION = {
|
|
90
|
+
0x2000 => " ", 0x2001 => " ", 0x2002 => " ", 0x2003 => " ",
|
|
91
|
+
0x2004 => " ", 0x2005 => " ", 0x2006 => " ", 0x2007 => " ",
|
|
92
|
+
0x2008 => " ", 0x2009 => " ", 0x200A => " ",
|
|
93
|
+
0x200B => "", 0x200C => "", 0x200D => "",
|
|
94
|
+
0x200E => "", 0x200F => "",
|
|
95
|
+
0x2018 => "'", 0x2019 => "'", 0x201A => ",", 0x201B => "'",
|
|
96
|
+
0x201C => '"', 0x201D => '"', 0x201E => '"', 0x201F => '"',
|
|
97
|
+
0x2020 => "+", 0x2021 => "+",
|
|
98
|
+
0x2022 => "*", 0x2023 => "*", 0x2024 => ".", 0x2025 => "..",
|
|
99
|
+
0x2026 => "...",
|
|
100
|
+
0x2028 => "\n", 0x2029 => "\n\n",
|
|
101
|
+
0x202A => "", 0x202B => "", 0x202C => "", 0x202D => "",
|
|
102
|
+
0x202E => "", 0x202F => " ",
|
|
103
|
+
0x2030 => "%", 0x2032 => "'", 0x2033 => '"', 0x2035 => "'",
|
|
104
|
+
0x2036 => '"',
|
|
105
|
+
0x2039 => "<", 0x203A => ">", 0x203C => "!!", 0x203D => "?",
|
|
106
|
+
0x2044 => "/", 0x2052 => "%",
|
|
107
|
+
0x205F => " ", 0x2060 => "", 0x2061 => "", 0x2062 => "",
|
|
108
|
+
0x2063 => "", 0x2064 => "",
|
|
109
|
+
0x206A => "", 0x206B => "", 0x206C => "", 0x206D => "",
|
|
110
|
+
0x206E => "", 0x206F => ""
|
|
111
|
+
}.transform_keys { |cp| [cp].pack("U") }.freeze
|
|
112
|
+
GENERAL_PUNCTUATION_RE = Regexp.new("[#{format('\\u%04X-\\u%04X', 0x2000, 0x206F)}]").freeze
|
|
113
|
+
|
|
114
|
+
def normalize(str)
|
|
115
|
+
return "" if str.nil?
|
|
116
|
+
|
|
117
|
+
s = base_normalize(str.to_s)
|
|
118
|
+
s = s.tr("\n", " ")
|
|
119
|
+
s.gsub(HORIZONTAL_WS_RE, " ").strip
|
|
120
|
+
end
|
|
121
|
+
|
|
122
|
+
def normalize_paragraphs(str)
|
|
123
|
+
return "" if str.nil?
|
|
124
|
+
|
|
125
|
+
s = base_normalize(str.to_s)
|
|
126
|
+
s = s.gsub(HORIZONTAL_WS_RE, " ")
|
|
127
|
+
s = s.gsub(/ *\n */, "\n")
|
|
128
|
+
s.split(PARAGRAPH_RUN_RE).map { |p| p.tr("\n", " ").strip }
|
|
129
|
+
.reject(&:empty?).join("\n\n")
|
|
130
|
+
end
|
|
131
|
+
|
|
132
|
+
def base_normalize(str)
|
|
133
|
+
s = str.dup.force_encoding("UTF-8")
|
|
134
|
+
s = s.scrub("")
|
|
135
|
+
s = s.unicode_normalize(:nfc)
|
|
136
|
+
s = s.gsub(DASH_RE, "-")
|
|
137
|
+
s = s.gsub(SWUNG_DASH_RE, "~")
|
|
138
|
+
s = s.gsub(GENERAL_PUNCTUATION_RE) { |c| GENERAL_PUNCTUATION.fetch(c, c) }
|
|
139
|
+
s.gsub(CONTROL_RE, "")
|
|
140
|
+
end
|
|
141
|
+
end
|
|
142
|
+
end
|
|
143
|
+
end
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "nokogiri"
|
|
4
|
+
|
|
5
|
+
module NEU
|
|
6
|
+
module MODS
|
|
7
|
+
# The gem's main entry point: a thin facade over a parsed MODS document.
|
|
8
|
+
#
|
|
9
|
+
# doc = NEU::MODS::Document.parse(xml)
|
|
10
|
+
# doc.plain_title # => composed display title
|
|
11
|
+
# doc.to_h # => full read projection
|
|
12
|
+
# doc.primary_title_info # => a live Nokogiri node (for editing)
|
|
13
|
+
#
|
|
14
|
+
# Selectors return live nodes (shared by read and write); projection methods
|
|
15
|
+
# return plain data. Parsing uses `&:noblanks` to match Atlas's read and to
|
|
16
|
+
# avoid spurious whitespace-only text nodes.
|
|
17
|
+
class Document
|
|
18
|
+
include Selectors
|
|
19
|
+
include Projection
|
|
20
|
+
|
|
21
|
+
attr_reader :doc
|
|
22
|
+
|
|
23
|
+
def self.parse(xml)
|
|
24
|
+
new(Nokogiri::XML(xml.to_s, &:noblanks))
|
|
25
|
+
end
|
|
26
|
+
|
|
27
|
+
# Wrap an already-parsed Nokogiri document (used by writers that own the doc
|
|
28
|
+
# they're mutating, so selectors and serialization share one instance).
|
|
29
|
+
def initialize(nokogiri_doc)
|
|
30
|
+
@doc = nokogiri_doc
|
|
31
|
+
end
|
|
32
|
+
|
|
33
|
+
def to_xml(...)
|
|
34
|
+
doc.to_xml(...)
|
|
35
|
+
end
|
|
36
|
+
end
|
|
37
|
+
end
|
|
38
|
+
end
|
|
@@ -0,0 +1,273 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "date"
|
|
4
|
+
|
|
5
|
+
module NEU
|
|
6
|
+
module MODS
|
|
7
|
+
# Node -> plain data. The read contract: what a MODS document *projects to* for
|
|
8
|
+
# indexing/display. Behavior-preserving with Atlas's prior `mods`-gem-based
|
|
9
|
+
# extraction (verified by the conformance corpus), reimplemented in Nokogiri so
|
|
10
|
+
# DRS depends on Nokogiri alone. Mixed into Document; operates on `doc`.
|
|
11
|
+
#
|
|
12
|
+
# Empty-value conventions mirror Atlas: scalar fields are "" when absent
|
|
13
|
+
# (matching `.text.squish` on an empty node set), except `permanent_url` and
|
|
14
|
+
# `date_created`, which are nil when their node is absent. Arrays are [].
|
|
15
|
+
module Projection
|
|
16
|
+
# --- Title ---------------------------------------------------------------
|
|
17
|
+
|
|
18
|
+
# Structured primary-title parts. nil for an absent part (the Cerberus form
|
|
19
|
+
# treats nil as "not present"); to_h coerces to "" for the Atlas main_title.
|
|
20
|
+
def title_parts
|
|
21
|
+
ti = primary_title_info
|
|
22
|
+
{
|
|
23
|
+
non_sort: child_text(ti, "mods:nonSort"),
|
|
24
|
+
subtitle: child_text(ti, "mods:subTitle"),
|
|
25
|
+
title: child_text(ti, "mods:title"),
|
|
26
|
+
part_name: child_text(ti, "mods:partName"),
|
|
27
|
+
part_number: child_text(ti, "mods:partNumber")
|
|
28
|
+
}
|
|
29
|
+
end
|
|
30
|
+
|
|
31
|
+
# Composed display title (the former Atlas MODSDecoration#plain_title), driven
|
|
32
|
+
# off the scoped primary title.
|
|
33
|
+
def plain_title
|
|
34
|
+
p = title_parts
|
|
35
|
+
return "" if blank?(p[:title])
|
|
36
|
+
|
|
37
|
+
"#{p[:non_sort]}#{p[:title]}" +
|
|
38
|
+
prefix(": ", p[:subtitle]) +
|
|
39
|
+
prefix(" - ", p[:part_name]) +
|
|
40
|
+
prefix(", ", p[:part_number])
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
# --- Abstract / access ---------------------------------------------------
|
|
44
|
+
|
|
45
|
+
def abstract
|
|
46
|
+
join_paragraphs(abstract_nodes)
|
|
47
|
+
end
|
|
48
|
+
|
|
49
|
+
def access_condition
|
|
50
|
+
join_paragraphs(doc.xpath("/mods:mods/mods:accessCondition", NAMESPACE))
|
|
51
|
+
end
|
|
52
|
+
|
|
53
|
+
# --- Subjects ------------------------------------------------------------
|
|
54
|
+
|
|
55
|
+
# The editable free-text keyword set (Cerberus simple form): topics under the
|
|
56
|
+
# attribute-free keyword subjects only.
|
|
57
|
+
def keywords
|
|
58
|
+
keyword_subjects.flat_map { |s| s.xpath("mods:topic", NAMESPACE).map { |t| t.text.strip } }
|
|
59
|
+
end
|
|
60
|
+
|
|
61
|
+
# Every <topic> under any top-level <subject> (the access-copy projection,
|
|
62
|
+
# equivalent to Atlas's extract_topical_subjects).
|
|
63
|
+
def topical_subjects
|
|
64
|
+
doc.xpath("/mods:mods/mods:subject/mods:topic", NAMESPACE).map { |t| clean(t.text) }
|
|
65
|
+
end
|
|
66
|
+
|
|
67
|
+
# --- Names ---------------------------------------------------------------
|
|
68
|
+
|
|
69
|
+
# All top-level names as { name:, role: }. `name` reproduces the `mods` gem's
|
|
70
|
+
# display_value_w_date (including its quirks -- faithfully, so existing Solr/
|
|
71
|
+
# display output is preserved). `role` prefers the type="text" roleTerm,
|
|
72
|
+
# falling back to the raw code (NOT MARC-relator-translated -- see README).
|
|
73
|
+
def names
|
|
74
|
+
doc.xpath("/mods:mods/mods:name", NAMESPACE).map do |node|
|
|
75
|
+
{ name: name_display_value_w_date(node), role: name_role(node) }
|
|
76
|
+
end
|
|
77
|
+
end
|
|
78
|
+
|
|
79
|
+
# --- Scalars / simple arrays --------------------------------------------
|
|
80
|
+
|
|
81
|
+
def languages
|
|
82
|
+
doc.xpath("/mods:mods/mods:language", NAMESPACE).map do |lang|
|
|
83
|
+
term = lang.at_xpath("mods:languageTerm[@type='text']", NAMESPACE) ||
|
|
84
|
+
lang.at_xpath("mods:languageTerm", NAMESPACE)
|
|
85
|
+
clean(term&.text)
|
|
86
|
+
end.compact
|
|
87
|
+
end
|
|
88
|
+
|
|
89
|
+
def resource_type = text_at("/mods:mods/mods:typeOfResource")
|
|
90
|
+
def format = text_at("/mods:mods/mods:physicalDescription/mods:form")
|
|
91
|
+
def extent = text_at("/mods:mods/mods:physicalDescription/mods:extent")
|
|
92
|
+
def digital_origin = text_at("/mods:mods/mods:physicalDescription/mods:digitalOrigin")
|
|
93
|
+
|
|
94
|
+
def genres
|
|
95
|
+
doc.xpath("/mods:mods/mods:genre", NAMESPACE).map { |g| clean(g.text) }
|
|
96
|
+
end
|
|
97
|
+
|
|
98
|
+
def related_series
|
|
99
|
+
doc.xpath("/mods:mods/mods:relatedItem[@type='series']/mods:titleInfo/mods:title", NAMESPACE)
|
|
100
|
+
.map { |t| clean(t.text) }
|
|
101
|
+
end
|
|
102
|
+
|
|
103
|
+
def identifiers
|
|
104
|
+
doc.xpath("/mods:mods/mods:identifier", NAMESPACE).map { |i| clean(i.text) }
|
|
105
|
+
end
|
|
106
|
+
|
|
107
|
+
def permanent_url
|
|
108
|
+
node = doc.at_xpath("/mods:mods/mods:identifier[@type='hdl']", NAMESPACE)
|
|
109
|
+
node && clean(node.text)
|
|
110
|
+
end
|
|
111
|
+
|
|
112
|
+
# Parsed dateCreated, or nil if no originInfo/dateCreated, or "" if present
|
|
113
|
+
# but unparseable (mirrors Atlas's safe_date_parse rescue).
|
|
114
|
+
def date_created
|
|
115
|
+
node = doc.at_xpath("/mods:mods/mods:originInfo/mods:dateCreated", NAMESPACE)
|
|
116
|
+
return nil unless node
|
|
117
|
+
|
|
118
|
+
str = NEU::MODS.canonical_ws(node.text)
|
|
119
|
+
return nil if str.empty?
|
|
120
|
+
|
|
121
|
+
begin
|
|
122
|
+
DateTime.parse(str)
|
|
123
|
+
rescue Date::Error
|
|
124
|
+
""
|
|
125
|
+
end
|
|
126
|
+
end
|
|
127
|
+
|
|
128
|
+
# --- Full projection -----------------------------------------------------
|
|
129
|
+
|
|
130
|
+
# The complete read projection, keyed to Atlas's Metadata::MODS attribute
|
|
131
|
+
# names -- a drop-in source for `convert_xml_to_json`.
|
|
132
|
+
def to_h
|
|
133
|
+
{
|
|
134
|
+
main_title: title_parts.transform_values(&:to_s),
|
|
135
|
+
names: names,
|
|
136
|
+
languages: languages,
|
|
137
|
+
date_created: date_created,
|
|
138
|
+
resource_type: resource_type,
|
|
139
|
+
genres: genres,
|
|
140
|
+
format: format,
|
|
141
|
+
extent: extent,
|
|
142
|
+
digital_origin: digital_origin,
|
|
143
|
+
abstract: abstract,
|
|
144
|
+
related_series: related_series,
|
|
145
|
+
topical_subjects: topical_subjects,
|
|
146
|
+
identifiers: identifiers,
|
|
147
|
+
permanent_url: permanent_url,
|
|
148
|
+
access_condition: access_condition
|
|
149
|
+
}
|
|
150
|
+
end
|
|
151
|
+
|
|
152
|
+
private
|
|
153
|
+
|
|
154
|
+
# --- helpers -------------------------------------------------------------
|
|
155
|
+
|
|
156
|
+
def text_at(xpath)
|
|
157
|
+
node = doc.at_xpath(xpath, NAMESPACE)
|
|
158
|
+
node ? NEU::MODS.canonical_ws(node.text) : ""
|
|
159
|
+
end
|
|
160
|
+
|
|
161
|
+
def child_text(parent, xpath)
|
|
162
|
+
return nil unless parent
|
|
163
|
+
|
|
164
|
+
node = parent.at_xpath(xpath, NAMESPACE)
|
|
165
|
+
return nil unless node
|
|
166
|
+
|
|
167
|
+
v = NEU::MODS.canonical_ws(node.text)
|
|
168
|
+
v.empty? ? nil : v
|
|
169
|
+
end
|
|
170
|
+
|
|
171
|
+
# canonical_ws, but nil for blank (used where an absent member must drop out).
|
|
172
|
+
def clean(str)
|
|
173
|
+
return nil if str.nil?
|
|
174
|
+
|
|
175
|
+
v = NEU::MODS.canonical_ws(str)
|
|
176
|
+
v.empty? ? nil : v
|
|
177
|
+
end
|
|
178
|
+
|
|
179
|
+
def blank?(str)
|
|
180
|
+
str.nil? || str.strip.empty?
|
|
181
|
+
end
|
|
182
|
+
|
|
183
|
+
def prefix(sep, val)
|
|
184
|
+
blank?(val) ? "" : "#{sep}#{val}"
|
|
185
|
+
end
|
|
186
|
+
|
|
187
|
+
def join_paragraphs(nodes)
|
|
188
|
+
nodes.map { |n| NEU::MODS.normalize_paragraphs(n.text) }.reject(&:empty?).join("\n\n")
|
|
189
|
+
end
|
|
190
|
+
|
|
191
|
+
# --- name display (faithful port of mods gem display_value_w_date) -------
|
|
192
|
+
|
|
193
|
+
def name_display_value_w_date(node)
|
|
194
|
+
dv = name_display_value(node)
|
|
195
|
+
node.xpath("mods:namePart[@type='date']", NAMESPACE).each do |np|
|
|
196
|
+
d = np.text
|
|
197
|
+
dv += ", #{d}" unless d.empty? || dv.end_with?(d)
|
|
198
|
+
end
|
|
199
|
+
dv = dv.sub(/\A, /, "")
|
|
200
|
+
dv.strip.empty? ? nil : dv.strip
|
|
201
|
+
end
|
|
202
|
+
|
|
203
|
+
def name_display_value(node)
|
|
204
|
+
display_form = node.at_xpath("mods:displayForm", NAMESPACE)
|
|
205
|
+
return display_form.text if display_form && !display_form.text.empty?
|
|
206
|
+
|
|
207
|
+
if node["type"] == "personal"
|
|
208
|
+
personal_display_value(node)
|
|
209
|
+
else
|
|
210
|
+
non_date_parts_joined(node)
|
|
211
|
+
end
|
|
212
|
+
end
|
|
213
|
+
|
|
214
|
+
def personal_display_value(node)
|
|
215
|
+
family = joined_parts(node, "family")
|
|
216
|
+
given = joined_parts(node, "given")
|
|
217
|
+
dv =
|
|
218
|
+
if family.empty?
|
|
219
|
+
given
|
|
220
|
+
else
|
|
221
|
+
given.empty? ? family : "#{family}, #{given}"
|
|
222
|
+
end
|
|
223
|
+
|
|
224
|
+
return non_date_parts_joined(node) if dv.empty?
|
|
225
|
+
|
|
226
|
+
append_terms_of_address(node, dv)
|
|
227
|
+
end
|
|
228
|
+
|
|
229
|
+
def append_terms_of_address(node, dv)
|
|
230
|
+
first = true
|
|
231
|
+
node.xpath("mods:namePart[@type='termsOfAddress']", NAMESPACE).each do |np|
|
|
232
|
+
next if np.text.empty?
|
|
233
|
+
|
|
234
|
+
dv += first ? " #{np.text}" : ", #{np.text}"
|
|
235
|
+
first = false
|
|
236
|
+
end
|
|
237
|
+
dv
|
|
238
|
+
end
|
|
239
|
+
|
|
240
|
+
# NodeSet-style concatenation: the `mods` gem joins same-typed nameParts via
|
|
241
|
+
# NodeSet#text (no separator) -- e.g. two `given` parts become "A.(B)". We
|
|
242
|
+
# reproduce that (quirk included) to stay behavior-preserving.
|
|
243
|
+
def joined_parts(node, type)
|
|
244
|
+
node.xpath("mods:namePart[@type='#{type}']", NAMESPACE).map(&:text).join
|
|
245
|
+
end
|
|
246
|
+
|
|
247
|
+
def non_date_parts_joined(node)
|
|
248
|
+
node.xpath("mods:namePart", NAMESPACE)
|
|
249
|
+
.reject { |np| np["type"] == "date" || np.text.empty? }
|
|
250
|
+
.map(&:text).join(" ")
|
|
251
|
+
end
|
|
252
|
+
|
|
253
|
+
def name_role(node)
|
|
254
|
+
node.xpath("mods:role", NAMESPACE).each do |role|
|
|
255
|
+
val = role_term_value(role)
|
|
256
|
+
return val if val
|
|
257
|
+
end
|
|
258
|
+
nil
|
|
259
|
+
end
|
|
260
|
+
|
|
261
|
+
# Prefer the type="text" roleTerm; fall back to the raw type="code" term
|
|
262
|
+
# (NOT MARC-relator-translated -- see README). nil if neither is present.
|
|
263
|
+
def role_term_value(role)
|
|
264
|
+
%w[text code].each do |type|
|
|
265
|
+
term = role.at_xpath("mods:roleTerm[@type='#{type}']", NAMESPACE)
|
|
266
|
+
text = term&.text.to_s.strip
|
|
267
|
+
return text unless text.empty?
|
|
268
|
+
end
|
|
269
|
+
nil
|
|
270
|
+
end
|
|
271
|
+
end
|
|
272
|
+
end
|
|
273
|
+
end
|
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module NEU
|
|
4
|
+
module MODS
|
|
5
|
+
# Node LOCATION over a parsed MODS document. These return live Nokogiri nodes,
|
|
6
|
+
# so they serve BOTH the read path (projection reads their text) AND the write
|
|
7
|
+
# path (Cerberus's MODSMerge mutates the returned nodes in place). That shared
|
|
8
|
+
# definition is the point: the node an editor changes is provably the node the
|
|
9
|
+
# projection reads. Mixed into Document; operates on `doc`.
|
|
10
|
+
module Selectors
|
|
11
|
+
# Top-level primary titleInfo, falling back to the first top-level titleInfo.
|
|
12
|
+
# Scoped to direct children of <mods:mods> so a relatedItem's nested
|
|
13
|
+
# titleInfo (e.g. a series title) is never matched.
|
|
14
|
+
def primary_title_info
|
|
15
|
+
doc.at_xpath("/mods:mods/mods:titleInfo[@usage='primary']", NAMESPACE) ||
|
|
16
|
+
doc.at_xpath("/mods:mods/mods:titleInfo", NAMESPACE)
|
|
17
|
+
end
|
|
18
|
+
|
|
19
|
+
# All top-level <abstract> elements (MODS permits several).
|
|
20
|
+
def abstract_nodes
|
|
21
|
+
doc.xpath("/mods:mods/mods:abstract", NAMESPACE)
|
|
22
|
+
end
|
|
23
|
+
|
|
24
|
+
# The "keyword" subjects the simple form manages: attribute-free <subject>
|
|
25
|
+
# elements whose element children are all <topic>. Anything with an
|
|
26
|
+
# authority/valueURI (or a non-topic child, e.g. a <name> subject) is curated
|
|
27
|
+
# and left untouched. (Distinct from the projection's #topical_subjects,
|
|
28
|
+
# which harvests *every* <topic> for the access copy.)
|
|
29
|
+
def keyword_subjects
|
|
30
|
+
doc.xpath("/mods:mods/mods:subject", NAMESPACE).select { |s| keyword_subject?(s) }
|
|
31
|
+
end
|
|
32
|
+
|
|
33
|
+
# Build a namespaced MODS element reusing the document's existing `mods:`
|
|
34
|
+
# namespace declaration (so new nodes never re-declare xmlns).
|
|
35
|
+
def build_node(name, text = nil)
|
|
36
|
+
node = Nokogiri::XML::Node.new(name, doc)
|
|
37
|
+
node.namespace = doc.root.namespace_definitions.find { |d| d.prefix == "mods" }
|
|
38
|
+
node.content = text unless text.nil?
|
|
39
|
+
node
|
|
40
|
+
end
|
|
41
|
+
|
|
42
|
+
private
|
|
43
|
+
|
|
44
|
+
def keyword_subject?(subject)
|
|
45
|
+
return false if subject.attributes.any?
|
|
46
|
+
|
|
47
|
+
topics = subject.element_children
|
|
48
|
+
topics.any? && topics.all? { |c| c.name == "topic" }
|
|
49
|
+
end
|
|
50
|
+
end
|
|
51
|
+
end
|
|
52
|
+
end
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module NEU
|
|
4
|
+
module MODS
|
|
5
|
+
# Current gem version, read from the `.version` file at the repo root at load
|
|
6
|
+
# time (mirrors the atlas_rb convention so a single `.version` bump drives the
|
|
7
|
+
# gem version + `bundler/gem_tasks` release).
|
|
8
|
+
VERSION = File.read(File.expand_path("../../../.version", __dir__)).strip
|
|
9
|
+
end
|
|
10
|
+
end
|
data/lib/neu-mods.rb
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require_relative "neu/mods/version"
|
|
4
|
+
require_relative "neu/mods/canonicalize"
|
|
5
|
+
require_relative "neu/mods/selectors"
|
|
6
|
+
require_relative "neu/mods/projection"
|
|
7
|
+
require_relative "neu/mods/document"
|
|
8
|
+
|
|
9
|
+
# Northeastern-flavored MODS v3 projection + selection for the DRS.
|
|
10
|
+
#
|
|
11
|
+
# A Nokogiri-native, dependency-light *reading/projection contract* over MODS
|
|
12
|
+
# documents, shared by Cerberus (front end) and Atlas (API backend). Pure
|
|
13
|
+
# functions over a document -- no Rails, no persistence, no HTTP. It answers two
|
|
14
|
+
# questions and nothing else:
|
|
15
|
+
#
|
|
16
|
+
# * "Where is X?" -> Selectors (live nodes; serve read AND write)
|
|
17
|
+
# * "What does this project to?" -> Projection (plain data; for index/display)
|
|
18
|
+
#
|
|
19
|
+
# Top-level conveniences delegate to the canonicalization helpers so callers can
|
|
20
|
+
# write `NEU::MODS.whitespace_equivalent?(a, b)` etc. without reaching into the
|
|
21
|
+
# submodules.
|
|
22
|
+
module NEU
|
|
23
|
+
module MODS
|
|
24
|
+
# The MODS v3 namespace, as a Nokogiri xpath namespace hash.
|
|
25
|
+
NAMESPACE = { "mods" => "http://www.loc.gov/mods/v3" }.freeze
|
|
26
|
+
|
|
27
|
+
module_function
|
|
28
|
+
|
|
29
|
+
# Whitespace no-op guard (see Canonicalize).
|
|
30
|
+
def canonical_ws(str) = Canonicalize.canonical_ws(str)
|
|
31
|
+
def whitespace_equivalent?(current, incoming) = Canonicalize.whitespace_equivalent?(current, incoming)
|
|
32
|
+
|
|
33
|
+
# Curator-freetext normalization for the access copy (see TextNormalizer).
|
|
34
|
+
def normalize(str) = TextNormalizer.normalize(str)
|
|
35
|
+
def normalize_paragraphs(str) = TextNormalizer.normalize_paragraphs(str)
|
|
36
|
+
end
|
|
37
|
+
end
|
metadata
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
|
2
|
+
name: neu-mods
|
|
3
|
+
version: !ruby/object:Gem::Version
|
|
4
|
+
version: 0.1.0
|
|
5
|
+
platform: ruby
|
|
6
|
+
authors:
|
|
7
|
+
- David Cliff
|
|
8
|
+
autorequire:
|
|
9
|
+
bindir: bin
|
|
10
|
+
cert_chain: []
|
|
11
|
+
date: 2026-06-09 00:00:00.000000000 Z
|
|
12
|
+
dependencies:
|
|
13
|
+
- !ruby/object:Gem::Dependency
|
|
14
|
+
name: nokogiri
|
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
|
16
|
+
requirements:
|
|
17
|
+
- - '>='
|
|
18
|
+
- !ruby/object:Gem::Version
|
|
19
|
+
version: '1.13'
|
|
20
|
+
type: :runtime
|
|
21
|
+
prerelease: false
|
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
23
|
+
requirements:
|
|
24
|
+
- - '>='
|
|
25
|
+
- !ruby/object:Gem::Version
|
|
26
|
+
version: '1.13'
|
|
27
|
+
- !ruby/object:Gem::Dependency
|
|
28
|
+
name: rspec
|
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
|
30
|
+
requirements:
|
|
31
|
+
- - ~>
|
|
32
|
+
- !ruby/object:Gem::Version
|
|
33
|
+
version: '3.12'
|
|
34
|
+
type: :development
|
|
35
|
+
prerelease: false
|
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
37
|
+
requirements:
|
|
38
|
+
- - ~>
|
|
39
|
+
- !ruby/object:Gem::Version
|
|
40
|
+
version: '3.12'
|
|
41
|
+
- !ruby/object:Gem::Dependency
|
|
42
|
+
name: rubocop
|
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
|
44
|
+
requirements:
|
|
45
|
+
- - ~>
|
|
46
|
+
- !ruby/object:Gem::Version
|
|
47
|
+
version: '1.60'
|
|
48
|
+
type: :development
|
|
49
|
+
prerelease: false
|
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
51
|
+
requirements:
|
|
52
|
+
- - ~>
|
|
53
|
+
- !ruby/object:Gem::Version
|
|
54
|
+
version: '1.60'
|
|
55
|
+
description: 'Nokogiri-native, dependency-light reading/projection contract over MODS
|
|
56
|
+
v3 documents, shared by the DRS front end (Cerberus) and API backend (Atlas). Pure
|
|
57
|
+
functions over a document: selectors that locate nodes (for editing) and projections
|
|
58
|
+
that return plain data (for indexing/display). No Rails, no persistence, no HTTP.'
|
|
59
|
+
email:
|
|
60
|
+
- d.cliff@northeastern.edu
|
|
61
|
+
executables: []
|
|
62
|
+
extensions: []
|
|
63
|
+
extra_rdoc_files: []
|
|
64
|
+
files:
|
|
65
|
+
- .version
|
|
66
|
+
- Gemfile
|
|
67
|
+
- README.md
|
|
68
|
+
- Rakefile
|
|
69
|
+
- lib/neu-mods.rb
|
|
70
|
+
- lib/neu/mods/canonicalize.rb
|
|
71
|
+
- lib/neu/mods/document.rb
|
|
72
|
+
- lib/neu/mods/projection.rb
|
|
73
|
+
- lib/neu/mods/selectors.rb
|
|
74
|
+
- lib/neu/mods/version.rb
|
|
75
|
+
homepage:
|
|
76
|
+
licenses:
|
|
77
|
+
- MIT
|
|
78
|
+
metadata:
|
|
79
|
+
rubygems_mfa_required: 'false'
|
|
80
|
+
post_install_message:
|
|
81
|
+
rdoc_options: []
|
|
82
|
+
require_paths:
|
|
83
|
+
- lib
|
|
84
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
|
85
|
+
requirements:
|
|
86
|
+
- - '>='
|
|
87
|
+
- !ruby/object:Gem::Version
|
|
88
|
+
version: '3.0'
|
|
89
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
90
|
+
requirements:
|
|
91
|
+
- - '>='
|
|
92
|
+
- !ruby/object:Gem::Version
|
|
93
|
+
version: '0'
|
|
94
|
+
requirements: []
|
|
95
|
+
rubygems_version: 3.0.9
|
|
96
|
+
signing_key:
|
|
97
|
+
specification_version: 4
|
|
98
|
+
summary: Northeastern-flavored MODS XML projection + selection for the DRS.
|
|
99
|
+
test_files: []
|