neu-mods 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: ab1cb07bf4122e98017ded99c790824865f68fb8ccdfdcbb3b5fc8e7d72cbf55
4
+ data.tar.gz: 6d3b467da6fde096e0ea26f9e76c447ea24c563c6cc5fd1a47e06b8f18d72aca
5
+ SHA512:
6
+ metadata.gz: 4c9b5a6006cf0ab3faced2d60f49f703ab482f28f2208d39c5225bd7af769c4f6bba634b0808d83d3b5ac3f2210efca07f8b1ad37976e0ed17d1175d666fcccf
7
+ data.tar.gz: 2c92060d9faddce439bb9b462982fbb186f6d792f5fc86cfa9702a7cf1c2118c06a3236a5e0875c2499e5593083f531a5668c841696168d330c6527728ca56a7
data/.version ADDED
@@ -0,0 +1 @@
1
+ 0.1.0
data/Gemfile ADDED
@@ -0,0 +1,8 @@
1
+ # frozen_string_literal: true
2
+
3
+ source "https://rubygems.org"
4
+
5
+ # Specify your gem's dependencies in neu-mods.gemspec
6
+ gemspec
7
+
8
+ gem "rake", "~> 13.0"
data/README.md ADDED
@@ -0,0 +1,83 @@
1
+ # neu-mods
2
+
3
+ Northeastern-flavored MODS v3 **projection + selection** for the DRS, shared by
4
+ [Cerberus](https://github.com/NEU-Libraries/cerberus) (front end) and
5
+ [Atlas](https://github.com/NEU-Libraries/atlas) (API backend).
6
+
7
+ It is a **Nokogiri-native, dependency-light contract over MODS documents** — pure
8
+ functions over a parsed document, nothing else. No Rails, no persistence, no
9
+ HTTP. It answers two questions:
10
+
11
+ - **"Where is X?"** — `Selectors` return *live Nokogiri nodes*, so they serve both
12
+ the read path (projection reads their text) and the write path (an editor
13
+ mutates the returned node in place). The node an editor changes is provably the
14
+ node the projection reads.
15
+ - **"What does this project to?"** — `Projection` returns *plain data*
16
+ (hashes/strings/arrays — never opaque typed objects) for indexing/display.
17
+
18
+ It depends on **Nokogiri alone** — deliberately *not* the `sul-dlss/mods` +
19
+ `nom-xml` stack (which is sunsetting alongside Stanford's move to Cocina). See
20
+ the design note in the DRS gap-reports for the full rationale.
21
+
22
+ ## Usage
23
+
24
+ ```ruby
25
+ require "neu-mods"
26
+
27
+ doc = NEU::MODS::Document.parse(xml_string)
28
+
29
+ # Projection (plain data)
30
+ doc.plain_title # => "What's New - How We Respond to Disaster, Episode 1"
31
+ doc.title_parts # => { non_sort:, subtitle:, title:, part_name:, part_number: }
32
+ doc.abstract # => normalized, paragraph-joined String
33
+ doc.topical_subjects # => ["Civil society", ...] (every <topic>, for the access copy)
34
+ doc.keywords # => [...] (only the editable attribute-free keyword subjects)
35
+ doc.to_h # => full projection, keyed to Atlas's Metadata::MODS attributes
36
+
37
+ # Selectors (live nodes — for editing)
38
+ node = doc.primary_title_info.at_xpath("mods:title", NEU::MODS::NAMESPACE)
39
+ node.content = "New Title" unless NEU::MODS.whitespace_equivalent?(node.text, "New Title")
40
+ doc.to_xml
41
+ ```
42
+
43
+ ## Two normalizers, two jobs
44
+
45
+ - `NEU::MODS.whitespace_equivalent?` / `.canonical_ws` — the **no-op guard**: did an
46
+ edit change anything, or only insignificant whitespace? (Used to avoid minting
47
+ an unchanged OCFL MODS version.)
48
+ - `NEU::MODS.normalize_paragraphs` / `.normalize` — clean **curator freetext** for
49
+ the JSON/Solr access copy (dash/smart-punctuation transliteration, control
50
+ stripping, paragraph handling). The XML preservation copy is never touched.
51
+
52
+ ## Behavior fidelity & known caveats
53
+
54
+ The projection is **behavior-preserving** with Atlas's prior `mods`-gem-based
55
+ extraction, pinned by `spec/conformance_spec.rb` against `work-mods.xml`. Two
56
+ intentional notes:
57
+
58
+ - **Name display** reproduces the `mods` gem's `display_value_w_date` *including
59
+ its quirks* (e.g. multiple `given` nameParts concatenate with no separator),
60
+ to preserve existing Solr/display output. Cleanups are a deliberate future
61
+ contract change, not a silent one.
62
+ - **Roles & languages** read the `type="text"` term and fall back to the **raw
63
+ code** — they are *not* MARC-relator / ISO-639 translated. Records carrying
64
+ text forms (the norm) are unaffected; code-only records would differ. Vendoring
65
+ those lookup tables (or depending on `iso-639`) is deferred to keep the gem
66
+ Nokogiri-only and small.
67
+
68
+ ## Source convention
69
+
70
+ Every character-class regex in `TextNormalizer` is built **programmatically from
71
+ codepoints**, so the source stays pure ASCII (no literal smart-quotes/dashes, no
72
+ raw control bytes). A spec enforces this. Keep it that way.
73
+
74
+ ## Development
75
+
76
+ ```bash
77
+ bundle install
78
+ bundle exec rspec
79
+ bundle exec rubocop
80
+ ```
81
+
82
+ Versioned via the `.version` file (read by `lib/neu/mods/version.rb`); released
83
+ with `bundler/gem_tasks` (`rake release`), mirroring `atlas_rb`.
data/Rakefile ADDED
@@ -0,0 +1,12 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "rspec/core/rake_task"
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ require "rubocop/rake_task"
9
+
10
+ RuboCop::RakeTask.new
11
+
12
+ task default: %i[spec rubocop]
@@ -0,0 +1,143 @@
1
+ # frozen_string_literal: true
2
+
3
+ module NEU
4
+ module MODS
5
+ # Lightweight whitespace canonicalization used by the *no-op guard* -- does an
6
+ # edit actually change anything, or only insignificant whitespace? (Cerberus's
7
+ # MODSMerge uses this to avoid minting an unchanged OCFL MODS version.) This is
8
+ # deliberately distinct from TextNormalizer below: this one only folds
9
+ # whitespace; TextNormalizer cleans curator freetext for the access copy.
10
+ module Canonicalize
11
+ module_function
12
+
13
+ NBSP = [0xA0].pack("U") # U+00A0 non-breaking space, built from codepoint
14
+
15
+ # \s doesn't match U+00A0 (NBSP) in Ruby's default mode, so fold NBSP to a
16
+ # plain space first, then collapse any whitespace run to one space + strip.
17
+ def canonical_ws(str)
18
+ str.to_s.tr(NBSP, " ").gsub(/\s+/, " ").strip
19
+ end
20
+
21
+ # Treat values differing only by insignificant whitespace (NBSP vs space,
22
+ # collapsible runs, leading/trailing) as equal.
23
+ def whitespace_equivalent?(current, incoming)
24
+ canonical_ws(current) == canonical_ws(incoming)
25
+ end
26
+ end
27
+
28
+ # Normalises curator-authored freetext on the way into the JSON access copy
29
+ # (and Solr); the XML preservation copy stays untouched. Ported from Atlas's
30
+ # TextNormalizer (which carries DRS v1 prior art) so the gem reproduces Atlas's
31
+ # projection byte-for-byte.
32
+ #
33
+ # IMPORTANT: every character-class regex is built *programmatically* from
34
+ # codepoint lists via `format('\\u%04X', cp)`, so this source file stays pure
35
+ # ASCII -- no literal smart-quotes, dashes, or (critically) raw control bytes
36
+ # land on disk. Keep it that way.
37
+ #
38
+ # Pipeline: force UTF-8 + scrub invalid bytes; NFC; map Unicode dashes to '-'
39
+ # (swung-dash to '~'); transliterate the General Punctuation block (smart
40
+ # quotes, ellipsis, etc.) to ASCII; strip C0/C1 controls (keeping tab/newline);
41
+ # collapse horizontal-whitespace runs to one space; for paragraph fields,
42
+ # collapse 2+ newlines to exactly two; strip.
43
+ #
44
+ # .normalize(str) -- single-line fields (newlines -> spaces)
45
+ # .normalize_paragraphs(str) -- fields that may carry paragraph breaks
46
+ # (abstract, accessCondition)
47
+ module TextNormalizer
48
+ module_function
49
+
50
+ # Build a character-class Regexp from an array of integer codepoints, as
51
+ # \uXXXX escapes (keeps this source ASCII).
52
+ def self.char_class(codepoints, prefix: "")
53
+ Regexp.new("[#{prefix}#{codepoints.map { |cp| format('\\u%04X', cp) }.join}]")
54
+ end
55
+
56
+ # NOTE: U+2053 (swung dash) is intentionally excluded from dashes -- it is
57
+ # named "dash" but conventionally maps to ASCII '~', not '-' (V1 prior art).
58
+ DASH_CODEPOINTS = [
59
+ 0x002D, 0x00AD, 0x058A, 0x05BE, 0x1400, 0x1806,
60
+ 0x2010, 0x2011, 0x2012, 0x2013, 0x2014, 0x2015,
61
+ 0x2043, 0x207B, 0x208B, 0x2212,
62
+ 0x2E17, 0x2E1A, 0x2E3A, 0x2E3B, 0x2E40,
63
+ 0x301C, 0x3030, 0x30A0, 0xFE31, 0xFE32, 0xFE58,
64
+ 0xFE63, 0xFF0D
65
+ ].freeze
66
+ DASH_RE = char_class(DASH_CODEPOINTS).freeze
67
+
68
+ SWUNG_DASH_RE = Regexp.new(format('\\u%04X', 0x2053)).freeze
69
+
70
+ # C0 (U+0000..U+0008, U+000B..U+001F) and C1 (U+007F..U+009F). U+0009 (tab)
71
+ # and U+000A (newline) are preserved.
72
+ CONTROL_CODEPOINTS = ((0x0000..0x0008).to_a + (0x000B..0x001F).to_a + (0x007F..0x009F).to_a).freeze
73
+ CONTROL_RE = char_class(CONTROL_CODEPOINTS).freeze
74
+
75
+ HORIZONTAL_WS_CODEPOINTS = [
76
+ 0x0009, 0x00A0, 0x1680,
77
+ 0x2000, 0x2001, 0x2002, 0x2003, 0x2004, 0x2005, 0x2006,
78
+ 0x2007, 0x2008, 0x2009, 0x200A, 0x202F, 0x205F, 0x3000
79
+ ].freeze
80
+ # Leading literal space included in the class (the " " prefix); `+` so a run
81
+ # of horizontal whitespace collapses to a single space.
82
+ HORIZONTAL_WS_RE = Regexp.new("#{char_class(HORIZONTAL_WS_CODEPOINTS, prefix: " ").source}+").freeze
83
+
84
+ PARAGRAPH_RUN_RE = /\n{2,}/
85
+
86
+ # General Punctuation block (U+2000..U+206F). Codepoints not listed pass
87
+ # through unchanged. Empty-string values deliberately drop invisible/bidi/
88
+ # format marks so they cannot leak into the access copy.
89
+ GENERAL_PUNCTUATION = {
90
+ 0x2000 => " ", 0x2001 => " ", 0x2002 => " ", 0x2003 => " ",
91
+ 0x2004 => " ", 0x2005 => " ", 0x2006 => " ", 0x2007 => " ",
92
+ 0x2008 => " ", 0x2009 => " ", 0x200A => " ",
93
+ 0x200B => "", 0x200C => "", 0x200D => "",
94
+ 0x200E => "", 0x200F => "",
95
+ 0x2018 => "'", 0x2019 => "'", 0x201A => ",", 0x201B => "'",
96
+ 0x201C => '"', 0x201D => '"', 0x201E => '"', 0x201F => '"',
97
+ 0x2020 => "+", 0x2021 => "+",
98
+ 0x2022 => "*", 0x2023 => "*", 0x2024 => ".", 0x2025 => "..",
99
+ 0x2026 => "...",
100
+ 0x2028 => "\n", 0x2029 => "\n\n",
101
+ 0x202A => "", 0x202B => "", 0x202C => "", 0x202D => "",
102
+ 0x202E => "", 0x202F => " ",
103
+ 0x2030 => "%", 0x2032 => "'", 0x2033 => '"', 0x2035 => "'",
104
+ 0x2036 => '"',
105
+ 0x2039 => "<", 0x203A => ">", 0x203C => "!!", 0x203D => "?",
106
+ 0x2044 => "/", 0x2052 => "%",
107
+ 0x205F => " ", 0x2060 => "", 0x2061 => "", 0x2062 => "",
108
+ 0x2063 => "", 0x2064 => "",
109
+ 0x206A => "", 0x206B => "", 0x206C => "", 0x206D => "",
110
+ 0x206E => "", 0x206F => ""
111
+ }.transform_keys { |cp| [cp].pack("U") }.freeze
112
+ GENERAL_PUNCTUATION_RE = Regexp.new("[#{format('\\u%04X-\\u%04X', 0x2000, 0x206F)}]").freeze
113
+
114
+ def normalize(str)
115
+ return "" if str.nil?
116
+
117
+ s = base_normalize(str.to_s)
118
+ s = s.tr("\n", " ")
119
+ s.gsub(HORIZONTAL_WS_RE, " ").strip
120
+ end
121
+
122
+ def normalize_paragraphs(str)
123
+ return "" if str.nil?
124
+
125
+ s = base_normalize(str.to_s)
126
+ s = s.gsub(HORIZONTAL_WS_RE, " ")
127
+ s = s.gsub(/ *\n */, "\n")
128
+ s.split(PARAGRAPH_RUN_RE).map { |p| p.tr("\n", " ").strip }
129
+ .reject(&:empty?).join("\n\n")
130
+ end
131
+
132
+ def base_normalize(str)
133
+ s = str.dup.force_encoding("UTF-8")
134
+ s = s.scrub("")
135
+ s = s.unicode_normalize(:nfc)
136
+ s = s.gsub(DASH_RE, "-")
137
+ s = s.gsub(SWUNG_DASH_RE, "~")
138
+ s = s.gsub(GENERAL_PUNCTUATION_RE) { |c| GENERAL_PUNCTUATION.fetch(c, c) }
139
+ s.gsub(CONTROL_RE, "")
140
+ end
141
+ end
142
+ end
143
+ end
@@ -0,0 +1,38 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "nokogiri"
4
+
5
+ module NEU
6
+ module MODS
7
+ # The gem's main entry point: a thin facade over a parsed MODS document.
8
+ #
9
+ # doc = NEU::MODS::Document.parse(xml)
10
+ # doc.plain_title # => composed display title
11
+ # doc.to_h # => full read projection
12
+ # doc.primary_title_info # => a live Nokogiri node (for editing)
13
+ #
14
+ # Selectors return live nodes (shared by read and write); projection methods
15
+ # return plain data. Parsing uses `&:noblanks` to match Atlas's read and to
16
+ # avoid spurious whitespace-only text nodes.
17
+ class Document
18
+ include Selectors
19
+ include Projection
20
+
21
+ attr_reader :doc
22
+
23
+ def self.parse(xml)
24
+ new(Nokogiri::XML(xml.to_s, &:noblanks))
25
+ end
26
+
27
+ # Wrap an already-parsed Nokogiri document (used by writers that own the doc
28
+ # they're mutating, so selectors and serialization share one instance).
29
+ def initialize(nokogiri_doc)
30
+ @doc = nokogiri_doc
31
+ end
32
+
33
+ def to_xml(...)
34
+ doc.to_xml(...)
35
+ end
36
+ end
37
+ end
38
+ end
@@ -0,0 +1,273 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "date"
4
+
5
+ module NEU
6
+ module MODS
7
+ # Node -> plain data. The read contract: what a MODS document *projects to* for
8
+ # indexing/display. Behavior-preserving with Atlas's prior `mods`-gem-based
9
+ # extraction (verified by the conformance corpus), reimplemented in Nokogiri so
10
+ # DRS depends on Nokogiri alone. Mixed into Document; operates on `doc`.
11
+ #
12
+ # Empty-value conventions mirror Atlas: scalar fields are "" when absent
13
+ # (matching `.text.squish` on an empty node set), except `permanent_url` and
14
+ # `date_created`, which are nil when their node is absent. Arrays are [].
15
+ module Projection
16
+ # --- Title ---------------------------------------------------------------
17
+
18
+ # Structured primary-title parts. nil for an absent part (the Cerberus form
19
+ # treats nil as "not present"); to_h coerces to "" for the Atlas main_title.
20
+ def title_parts
21
+ ti = primary_title_info
22
+ {
23
+ non_sort: child_text(ti, "mods:nonSort"),
24
+ subtitle: child_text(ti, "mods:subTitle"),
25
+ title: child_text(ti, "mods:title"),
26
+ part_name: child_text(ti, "mods:partName"),
27
+ part_number: child_text(ti, "mods:partNumber")
28
+ }
29
+ end
30
+
31
+ # Composed display title (the former Atlas MODSDecoration#plain_title), driven
32
+ # off the scoped primary title.
33
+ def plain_title
34
+ p = title_parts
35
+ return "" if blank?(p[:title])
36
+
37
+ "#{p[:non_sort]}#{p[:title]}" +
38
+ prefix(": ", p[:subtitle]) +
39
+ prefix(" - ", p[:part_name]) +
40
+ prefix(", ", p[:part_number])
41
+ end
42
+
43
+ # --- Abstract / access ---------------------------------------------------
44
+
45
+ def abstract
46
+ join_paragraphs(abstract_nodes)
47
+ end
48
+
49
+ def access_condition
50
+ join_paragraphs(doc.xpath("/mods:mods/mods:accessCondition", NAMESPACE))
51
+ end
52
+
53
+ # --- Subjects ------------------------------------------------------------
54
+
55
+ # The editable free-text keyword set (Cerberus simple form): topics under the
56
+ # attribute-free keyword subjects only.
57
+ def keywords
58
+ keyword_subjects.flat_map { |s| s.xpath("mods:topic", NAMESPACE).map { |t| t.text.strip } }
59
+ end
60
+
61
+ # Every <topic> under any top-level <subject> (the access-copy projection,
62
+ # equivalent to Atlas's extract_topical_subjects).
63
+ def topical_subjects
64
+ doc.xpath("/mods:mods/mods:subject/mods:topic", NAMESPACE).map { |t| clean(t.text) }
65
+ end
66
+
67
+ # --- Names ---------------------------------------------------------------
68
+
69
+ # All top-level names as { name:, role: }. `name` reproduces the `mods` gem's
70
+ # display_value_w_date (including its quirks -- faithfully, so existing Solr/
71
+ # display output is preserved). `role` prefers the type="text" roleTerm,
72
+ # falling back to the raw code (NOT MARC-relator-translated -- see README).
73
+ def names
74
+ doc.xpath("/mods:mods/mods:name", NAMESPACE).map do |node|
75
+ { name: name_display_value_w_date(node), role: name_role(node) }
76
+ end
77
+ end
78
+
79
+ # --- Scalars / simple arrays --------------------------------------------
80
+
81
+ def languages
82
+ doc.xpath("/mods:mods/mods:language", NAMESPACE).map do |lang|
83
+ term = lang.at_xpath("mods:languageTerm[@type='text']", NAMESPACE) ||
84
+ lang.at_xpath("mods:languageTerm", NAMESPACE)
85
+ clean(term&.text)
86
+ end.compact
87
+ end
88
+
89
+ def resource_type = text_at("/mods:mods/mods:typeOfResource")
90
+ def format = text_at("/mods:mods/mods:physicalDescription/mods:form")
91
+ def extent = text_at("/mods:mods/mods:physicalDescription/mods:extent")
92
+ def digital_origin = text_at("/mods:mods/mods:physicalDescription/mods:digitalOrigin")
93
+
94
+ def genres
95
+ doc.xpath("/mods:mods/mods:genre", NAMESPACE).map { |g| clean(g.text) }
96
+ end
97
+
98
+ def related_series
99
+ doc.xpath("/mods:mods/mods:relatedItem[@type='series']/mods:titleInfo/mods:title", NAMESPACE)
100
+ .map { |t| clean(t.text) }
101
+ end
102
+
103
+ def identifiers
104
+ doc.xpath("/mods:mods/mods:identifier", NAMESPACE).map { |i| clean(i.text) }
105
+ end
106
+
107
+ def permanent_url
108
+ node = doc.at_xpath("/mods:mods/mods:identifier[@type='hdl']", NAMESPACE)
109
+ node && clean(node.text)
110
+ end
111
+
112
+ # Parsed dateCreated, or nil if no originInfo/dateCreated, or "" if present
113
+ # but unparseable (mirrors Atlas's safe_date_parse rescue).
114
+ def date_created
115
+ node = doc.at_xpath("/mods:mods/mods:originInfo/mods:dateCreated", NAMESPACE)
116
+ return nil unless node
117
+
118
+ str = NEU::MODS.canonical_ws(node.text)
119
+ return nil if str.empty?
120
+
121
+ begin
122
+ DateTime.parse(str)
123
+ rescue Date::Error
124
+ ""
125
+ end
126
+ end
127
+
128
+ # --- Full projection -----------------------------------------------------
129
+
130
+ # The complete read projection, keyed to Atlas's Metadata::MODS attribute
131
+ # names -- a drop-in source for `convert_xml_to_json`.
132
+ def to_h
133
+ {
134
+ main_title: title_parts.transform_values(&:to_s),
135
+ names: names,
136
+ languages: languages,
137
+ date_created: date_created,
138
+ resource_type: resource_type,
139
+ genres: genres,
140
+ format: format,
141
+ extent: extent,
142
+ digital_origin: digital_origin,
143
+ abstract: abstract,
144
+ related_series: related_series,
145
+ topical_subjects: topical_subjects,
146
+ identifiers: identifiers,
147
+ permanent_url: permanent_url,
148
+ access_condition: access_condition
149
+ }
150
+ end
151
+
152
+ private
153
+
154
+ # --- helpers -------------------------------------------------------------
155
+
156
+ def text_at(xpath)
157
+ node = doc.at_xpath(xpath, NAMESPACE)
158
+ node ? NEU::MODS.canonical_ws(node.text) : ""
159
+ end
160
+
161
+ def child_text(parent, xpath)
162
+ return nil unless parent
163
+
164
+ node = parent.at_xpath(xpath, NAMESPACE)
165
+ return nil unless node
166
+
167
+ v = NEU::MODS.canonical_ws(node.text)
168
+ v.empty? ? nil : v
169
+ end
170
+
171
+ # canonical_ws, but nil for blank (used where an absent member must drop out).
172
+ def clean(str)
173
+ return nil if str.nil?
174
+
175
+ v = NEU::MODS.canonical_ws(str)
176
+ v.empty? ? nil : v
177
+ end
178
+
179
+ def blank?(str)
180
+ str.nil? || str.strip.empty?
181
+ end
182
+
183
+ def prefix(sep, val)
184
+ blank?(val) ? "" : "#{sep}#{val}"
185
+ end
186
+
187
+ def join_paragraphs(nodes)
188
+ nodes.map { |n| NEU::MODS.normalize_paragraphs(n.text) }.reject(&:empty?).join("\n\n")
189
+ end
190
+
191
+ # --- name display (faithful port of mods gem display_value_w_date) -------
192
+
193
+ def name_display_value_w_date(node)
194
+ dv = name_display_value(node)
195
+ node.xpath("mods:namePart[@type='date']", NAMESPACE).each do |np|
196
+ d = np.text
197
+ dv += ", #{d}" unless d.empty? || dv.end_with?(d)
198
+ end
199
+ dv = dv.sub(/\A, /, "")
200
+ dv.strip.empty? ? nil : dv.strip
201
+ end
202
+
203
+ def name_display_value(node)
204
+ display_form = node.at_xpath("mods:displayForm", NAMESPACE)
205
+ return display_form.text if display_form && !display_form.text.empty?
206
+
207
+ if node["type"] == "personal"
208
+ personal_display_value(node)
209
+ else
210
+ non_date_parts_joined(node)
211
+ end
212
+ end
213
+
214
+ def personal_display_value(node)
215
+ family = joined_parts(node, "family")
216
+ given = joined_parts(node, "given")
217
+ dv =
218
+ if family.empty?
219
+ given
220
+ else
221
+ given.empty? ? family : "#{family}, #{given}"
222
+ end
223
+
224
+ return non_date_parts_joined(node) if dv.empty?
225
+
226
+ append_terms_of_address(node, dv)
227
+ end
228
+
229
+ def append_terms_of_address(node, dv)
230
+ first = true
231
+ node.xpath("mods:namePart[@type='termsOfAddress']", NAMESPACE).each do |np|
232
+ next if np.text.empty?
233
+
234
+ dv += first ? " #{np.text}" : ", #{np.text}"
235
+ first = false
236
+ end
237
+ dv
238
+ end
239
+
240
+ # NodeSet-style concatenation: the `mods` gem joins same-typed nameParts via
241
+ # NodeSet#text (no separator) -- e.g. two `given` parts become "A.(B)". We
242
+ # reproduce that (quirk included) to stay behavior-preserving.
243
+ def joined_parts(node, type)
244
+ node.xpath("mods:namePart[@type='#{type}']", NAMESPACE).map(&:text).join
245
+ end
246
+
247
+ def non_date_parts_joined(node)
248
+ node.xpath("mods:namePart", NAMESPACE)
249
+ .reject { |np| np["type"] == "date" || np.text.empty? }
250
+ .map(&:text).join(" ")
251
+ end
252
+
253
+ def name_role(node)
254
+ node.xpath("mods:role", NAMESPACE).each do |role|
255
+ val = role_term_value(role)
256
+ return val if val
257
+ end
258
+ nil
259
+ end
260
+
261
+ # Prefer the type="text" roleTerm; fall back to the raw type="code" term
262
+ # (NOT MARC-relator-translated -- see README). nil if neither is present.
263
+ def role_term_value(role)
264
+ %w[text code].each do |type|
265
+ term = role.at_xpath("mods:roleTerm[@type='#{type}']", NAMESPACE)
266
+ text = term&.text.to_s.strip
267
+ return text unless text.empty?
268
+ end
269
+ nil
270
+ end
271
+ end
272
+ end
273
+ end
@@ -0,0 +1,52 @@
1
+ # frozen_string_literal: true
2
+
3
+ module NEU
4
+ module MODS
5
+ # Node LOCATION over a parsed MODS document. These return live Nokogiri nodes,
6
+ # so they serve BOTH the read path (projection reads their text) AND the write
7
+ # path (Cerberus's MODSMerge mutates the returned nodes in place). That shared
8
+ # definition is the point: the node an editor changes is provably the node the
9
+ # projection reads. Mixed into Document; operates on `doc`.
10
+ module Selectors
11
+ # Top-level primary titleInfo, falling back to the first top-level titleInfo.
12
+ # Scoped to direct children of <mods:mods> so a relatedItem's nested
13
+ # titleInfo (e.g. a series title) is never matched.
14
+ def primary_title_info
15
+ doc.at_xpath("/mods:mods/mods:titleInfo[@usage='primary']", NAMESPACE) ||
16
+ doc.at_xpath("/mods:mods/mods:titleInfo", NAMESPACE)
17
+ end
18
+
19
+ # All top-level <abstract> elements (MODS permits several).
20
+ def abstract_nodes
21
+ doc.xpath("/mods:mods/mods:abstract", NAMESPACE)
22
+ end
23
+
24
+ # The "keyword" subjects the simple form manages: attribute-free <subject>
25
+ # elements whose element children are all <topic>. Anything with an
26
+ # authority/valueURI (or a non-topic child, e.g. a <name> subject) is curated
27
+ # and left untouched. (Distinct from the projection's #topical_subjects,
28
+ # which harvests *every* <topic> for the access copy.)
29
+ def keyword_subjects
30
+ doc.xpath("/mods:mods/mods:subject", NAMESPACE).select { |s| keyword_subject?(s) }
31
+ end
32
+
33
+ # Build a namespaced MODS element reusing the document's existing `mods:`
34
+ # namespace declaration (so new nodes never re-declare xmlns).
35
+ def build_node(name, text = nil)
36
+ node = Nokogiri::XML::Node.new(name, doc)
37
+ node.namespace = doc.root.namespace_definitions.find { |d| d.prefix == "mods" }
38
+ node.content = text unless text.nil?
39
+ node
40
+ end
41
+
42
+ private
43
+
44
+ def keyword_subject?(subject)
45
+ return false if subject.attributes.any?
46
+
47
+ topics = subject.element_children
48
+ topics.any? && topics.all? { |c| c.name == "topic" }
49
+ end
50
+ end
51
+ end
52
+ end
@@ -0,0 +1,10 @@
1
+ # frozen_string_literal: true
2
+
3
+ module NEU
4
+ module MODS
5
+ # Current gem version, read from the `.version` file at the repo root at load
6
+ # time (mirrors the atlas_rb convention so a single `.version` bump drives the
7
+ # gem version + `bundler/gem_tasks` release).
8
+ VERSION = File.read(File.expand_path("../../../.version", __dir__)).strip
9
+ end
10
+ end
data/lib/neu-mods.rb ADDED
@@ -0,0 +1,37 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "neu/mods/version"
4
+ require_relative "neu/mods/canonicalize"
5
+ require_relative "neu/mods/selectors"
6
+ require_relative "neu/mods/projection"
7
+ require_relative "neu/mods/document"
8
+
9
+ # Northeastern-flavored MODS v3 projection + selection for the DRS.
10
+ #
11
+ # A Nokogiri-native, dependency-light *reading/projection contract* over MODS
12
+ # documents, shared by Cerberus (front end) and Atlas (API backend). Pure
13
+ # functions over a document -- no Rails, no persistence, no HTTP. It answers two
14
+ # questions and nothing else:
15
+ #
16
+ # * "Where is X?" -> Selectors (live nodes; serve read AND write)
17
+ # * "What does this project to?" -> Projection (plain data; for index/display)
18
+ #
19
+ # Top-level conveniences delegate to the canonicalization helpers so callers can
20
+ # write `NEU::MODS.whitespace_equivalent?(a, b)` etc. without reaching into the
21
+ # submodules.
22
+ module NEU
23
+ module MODS
24
+ # The MODS v3 namespace, as a Nokogiri xpath namespace hash.
25
+ NAMESPACE = { "mods" => "http://www.loc.gov/mods/v3" }.freeze
26
+
27
+ module_function
28
+
29
+ # Whitespace no-op guard (see Canonicalize).
30
+ def canonical_ws(str) = Canonicalize.canonical_ws(str)
31
+ def whitespace_equivalent?(current, incoming) = Canonicalize.whitespace_equivalent?(current, incoming)
32
+
33
+ # Curator-freetext normalization for the access copy (see TextNormalizer).
34
+ def normalize(str) = TextNormalizer.normalize(str)
35
+ def normalize_paragraphs(str) = TextNormalizer.normalize_paragraphs(str)
36
+ end
37
+ end
metadata ADDED
@@ -0,0 +1,99 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: neu-mods
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - David Cliff
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2026-06-09 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: nokogiri
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - '>='
18
+ - !ruby/object:Gem::Version
19
+ version: '1.13'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - '>='
25
+ - !ruby/object:Gem::Version
26
+ version: '1.13'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rspec
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ~>
32
+ - !ruby/object:Gem::Version
33
+ version: '3.12'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ~>
39
+ - !ruby/object:Gem::Version
40
+ version: '3.12'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rubocop
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ~>
46
+ - !ruby/object:Gem::Version
47
+ version: '1.60'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ~>
53
+ - !ruby/object:Gem::Version
54
+ version: '1.60'
55
+ description: 'Nokogiri-native, dependency-light reading/projection contract over MODS
56
+ v3 documents, shared by the DRS front end (Cerberus) and API backend (Atlas). Pure
57
+ functions over a document: selectors that locate nodes (for editing) and projections
58
+ that return plain data (for indexing/display). No Rails, no persistence, no HTTP.'
59
+ email:
60
+ - d.cliff@northeastern.edu
61
+ executables: []
62
+ extensions: []
63
+ extra_rdoc_files: []
64
+ files:
65
+ - .version
66
+ - Gemfile
67
+ - README.md
68
+ - Rakefile
69
+ - lib/neu-mods.rb
70
+ - lib/neu/mods/canonicalize.rb
71
+ - lib/neu/mods/document.rb
72
+ - lib/neu/mods/projection.rb
73
+ - lib/neu/mods/selectors.rb
74
+ - lib/neu/mods/version.rb
75
+ homepage:
76
+ licenses:
77
+ - MIT
78
+ metadata:
79
+ rubygems_mfa_required: 'false'
80
+ post_install_message:
81
+ rdoc_options: []
82
+ require_paths:
83
+ - lib
84
+ required_ruby_version: !ruby/object:Gem::Requirement
85
+ requirements:
86
+ - - '>='
87
+ - !ruby/object:Gem::Version
88
+ version: '3.0'
89
+ required_rubygems_version: !ruby/object:Gem::Requirement
90
+ requirements:
91
+ - - '>='
92
+ - !ruby/object:Gem::Version
93
+ version: '0'
94
+ requirements: []
95
+ rubygems_version: 3.0.9
96
+ signing_key:
97
+ specification_version: 4
98
+ summary: Northeastern-flavored MODS XML projection + selection for the DRS.
99
+ test_files: []