traject 3.1.0.rc1 → 3.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.github/workflows/ruby.yml +26 -0
- data/CHANGES.md +46 -0
- data/README.md +3 -1
- data/doc/settings.md +5 -1
- data/doc/xml.md +10 -0
- data/lib/traject/command_line.rb +34 -43
- data/lib/traject/indexer.rb +12 -4
- data/lib/traject/macros/marc21.rb +3 -3
- data/lib/traject/macros/marc21_semantics.rb +15 -12
- data/lib/traject/macros/nokogiri_macros.rb +9 -3
- data/lib/traject/marc_extractor.rb +3 -3
- data/lib/traject/nokogiri_reader.rb +8 -1
- data/lib/traject/oai_pmh_nokogiri_reader.rb +9 -3
- data/lib/traject/solr_json_writer.rb +58 -17
- data/lib/traject/version.rb +1 -1
- data/lib/translation_maps/marc_languages.yaml +77 -48
- data/test/command_line_test.rb +51 -0
- data/test/delimited_writer_test.rb +14 -16
- data/test/indexer/class_level_configuration_test.rb +23 -0
- data/test/indexer/macros/macros_marc21_semantics_test.rb +4 -0
- data/test/indexer/nokogiri_indexer_test.rb +35 -0
- data/test/nokogiri_reader_test.rb +10 -0
- data/test/solr_json_writer_test.rb +65 -0
- data/test/test_support/date_resort_to_264.marc +1 -0
- data/traject.gemspec +3 -3
- metadata +32 -23
- data/.travis.yml +0 -16
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 01ca968682bb3fc2a8313131ef6344bfc9e5418b767b2900c3d799caa356d016
|
4
|
+
data.tar.gz: a3fd6c9a3bec88c6ba592500ea170357b059533f28c5ba3fb2fe72de39702a2a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 93547927e90b7947588c1983bd37de4651722bebaff7aaad2d3965ec46eed8c647971f1ce093beb86a5c47c2efa999516d00fb6956682c98913cd54ea5a1a2b8
|
7
|
+
data.tar.gz: 128ea6e2517711f2324541215f814430f963b0042ae321ebc3f29646398990dc09d2a307acc32fb89a70142d74763fdfee6a61d220043622b5b8b11fbcd645d8
|
@@ -0,0 +1,26 @@
|
|
1
|
+
name: CI
|
2
|
+
|
3
|
+
on:
|
4
|
+
push:
|
5
|
+
branches: [ master ]
|
6
|
+
pull_request:
|
7
|
+
branches: ['**']
|
8
|
+
|
9
|
+
jobs:
|
10
|
+
tests:
|
11
|
+
runs-on: ubuntu-latest
|
12
|
+
strategy:
|
13
|
+
fail-fast: false
|
14
|
+
matrix:
|
15
|
+
ruby: [ '2.4', '2.5', '2.6', '2.7', 'jruby-9.1', 'jruby-9.2' ]
|
16
|
+
name: Ruby ${{ matrix.ruby }}
|
17
|
+
steps:
|
18
|
+
- uses: actions/checkout@v2
|
19
|
+
- name: Set up Ruby
|
20
|
+
uses: ruby/setup-ruby@v1
|
21
|
+
with:
|
22
|
+
ruby-version: ${{ matrix.ruby }}
|
23
|
+
- name: Install dependencies
|
24
|
+
run: bundle install --jobs 4 --retry 3
|
25
|
+
- name: Run tests
|
26
|
+
run: bundle exec rake
|
data/CHANGES.md
CHANGED
@@ -1,5 +1,47 @@
|
|
1
1
|
# Changes
|
2
2
|
|
3
|
+
## Next
|
4
|
+
|
5
|
+
*
|
6
|
+
|
7
|
+
*
|
8
|
+
|
9
|
+
*
|
10
|
+
|
11
|
+
## 3.5.0
|
12
|
+
|
13
|
+
* `traject -v` and `traject -h` correctly return 0 exit code indicating success.
|
14
|
+
|
15
|
+
* upgrade to slop gem 4.x, which carries with it a slightly different format of human-readable command-line arg errors, should be otherwise invisible.
|
16
|
+
|
17
|
+
* the SolrJsonWriter now supports HTTP basic auth credentials embedded in `solr.url` or `solr.update_url`, eg `http://user:pass@example.org/solr` https://github.com/traject/traject/pull/262
|
18
|
+
|
19
|
+
|
20
|
+
## 3.4.0
|
21
|
+
|
22
|
+
* XML-mode `extract_xpath` now supports extracting attribute values with xpath @attr syntax.
|
23
|
+
|
24
|
+
## 3.3.0
|
25
|
+
|
26
|
+
* `Traject::Macros::Marc21Semantics.publication_date` now gets date from 264 before 260. https://github.com/traject/traject/pull/233
|
27
|
+
|
28
|
+
* Allow hashie 4.x in gemspec https://github.com/traject/traject/pull/234
|
29
|
+
|
30
|
+
* Allow `http` gem 4.x versions. https://github.com/traject/traject/pull/236
|
31
|
+
|
32
|
+
* Can now call class-level Indexer.configure multiple times https://github.com/sciencehistory/scihist_digicoll/pull/525
|
33
|
+
|
34
|
+
## 3.2.0
|
35
|
+
|
36
|
+
* NokogiriReader has a "nokogiri.strict_mode" setting. Set to true or string 'true' to ask Nokogori to parse in strict mode, so it will immediately raise on ill-formed XML, instead of nokogiri's default to do what it can with it. https://github.com/traject/traject/pull/226
|
37
|
+
|
38
|
+
* SolrJsonWriter
|
39
|
+
|
40
|
+
* Utility method `delete_all!` sends a delete all query to the Solr URL endpoint. https://github.com/traject/traject/pull/227
|
41
|
+
|
42
|
+
* Allow basic auth configuration of the default http client via `solr_writer.basic_auth_user` and `solr_writer.basic_auth_password`. https://github.com/traject/traject/pull/231
|
43
|
+
|
44
|
+
|
3
45
|
## 3.1.0
|
4
46
|
|
5
47
|
### Added
|
@@ -24,6 +66,10 @@
|
|
24
66
|
|
25
67
|
* SolrJsonWriter now respects a `solr_writer.http_timeout` setting, in seconds, to be passed to HTTPClient instance. https://github.com/traject/traject/pull/219
|
26
68
|
|
69
|
+
* Only runs thread pool shutdown code (and logging) if there is a `solr_writer.batch_size` greater than 0. Keep it out of the logs if it was a no-op anyway.
|
70
|
+
|
71
|
+
* Logs at DEBUG level every time it sends an update request to solr
|
72
|
+
|
27
73
|
* Nokogiri dependency for the NokogiriReader increased to `~> 1.9`. When using Jruby `each_record_xpath`, resulting yielded documents may have xmlns declarations on different nodes than in MRI (and previous versions of nokogiri), but we could find now way around this with nokogiri >= 1.9.0. The documents should still be semantically equivalent for namespace use. This was necessary to keep JRuby Nokogiri XML working with recent Nokogiri releases. https://github.com/traject/traject/pull/209
|
28
74
|
|
29
75
|
* LineWriter guesses better about when to auto-close, and provides an optional explicit setting in case it guesses wrong. (thanks @justinlittman) https://github.com/traject/traject/pull/211
|
data/README.md
CHANGED
@@ -9,7 +9,7 @@ Traject can also be generalized to a set of tools for getting structured data fr
|
|
9
9
|
**Traject is stable, mature software, that is already being used in production by its authors and several other institutions.**
|
10
10
|
|
11
11
|
[](http://badge.fury.io/rb/traject)
|
12
|
-
[](https://github.com/traject/traject/actions?query=workflow%3ACI+branch%3Amaster)
|
13
13
|
|
14
14
|
|
15
15
|
## Background/Goals
|
@@ -175,6 +175,8 @@ TranslationMap use above is just one example of a transformation macro, that tra
|
|
175
175
|
* `append("--after each value")`
|
176
176
|
* `gsub(/regex/, "replacement")`
|
177
177
|
* `split(" ")`: take values and split them, possibly result in multiple values.
|
178
|
+
* `transform(proc)`: transform each existing macro using a proc, kind of like `map`.
|
179
|
+
eg `to_field "something", extract_xml("//author"), transform( ->(author) { "#{author.last}, #{author.first}" })
|
178
180
|
|
179
181
|
You can add on as many transformation macros as you want, they will be applied to output in order.
|
180
182
|
|
data/doc/settings.md
CHANGED
@@ -83,7 +83,8 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
83
83
|
### Writing to solr
|
84
84
|
|
85
85
|
* `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
|
86
|
-
|
86
|
+
|
87
|
+
* `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`. (Can include embedded HTTP basic auth as eg `http://user:pass@example.org/solr`)
|
87
88
|
|
88
89
|
* `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control some default settings, and/or sanity check and warn you if you're doing something that might not work with that version of solr. Set now for help in the future.
|
89
90
|
|
@@ -93,6 +94,9 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
93
94
|
|
94
95
|
* `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
|
95
96
|
|
97
|
+
* `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth. You can also just embed basic
|
98
|
+
auth credentials in `solr.url` using standard URI syntax.
|
99
|
+
|
96
100
|
|
97
101
|
### Dealing with MARC data
|
98
102
|
|
data/doc/xml.md
CHANGED
@@ -72,6 +72,16 @@ You can use all the standard transforation macros in Traject::Macros::Transforma
|
|
72
72
|
to_field "something", extract_xpath("//value"), first_only, translation_map("some_map"), default("no value")
|
73
73
|
```
|
74
74
|
|
75
|
+
### selecting attribute values
|
76
|
+
|
77
|
+
Just works, using xpath syntax for selecting an attribute:
|
78
|
+
|
79
|
+
|
80
|
+
```ruby
|
81
|
+
# gets status value in: <oai:header status="something">
|
82
|
+
to_field "status", extract_xpath("//oai:record/oai:header/@status")
|
83
|
+
```
|
84
|
+
|
75
85
|
|
76
86
|
### selecting non-text nodes
|
77
87
|
|
data/lib/traject/command_line.rb
CHANGED
@@ -29,10 +29,10 @@ module Traject
|
|
29
29
|
self.console = $stderr
|
30
30
|
|
31
31
|
self.orig_argv = argv.dup
|
32
|
-
self.remaining_argv = argv
|
33
32
|
|
34
|
-
self.slop = create_slop!
|
35
|
-
self.options =
|
33
|
+
self.slop = create_slop!(argv)
|
34
|
+
self.options = self.slop
|
35
|
+
self.remaining_argv = self.slop.arguments
|
36
36
|
end
|
37
37
|
|
38
38
|
# Returns true on success or false on failure; may also raise exceptions;
|
@@ -40,11 +40,11 @@ module Traject
|
|
40
40
|
def execute
|
41
41
|
if options[:version]
|
42
42
|
self.console.puts "traject version #{Traject::VERSION}"
|
43
|
-
return
|
43
|
+
return true
|
44
44
|
end
|
45
45
|
if options[:help]
|
46
|
-
self.console.puts slop.
|
47
|
-
return
|
46
|
+
self.console.puts slop.to_s
|
47
|
+
return true
|
48
48
|
end
|
49
49
|
|
50
50
|
|
@@ -179,11 +179,11 @@ module Traject
|
|
179
179
|
end
|
180
180
|
|
181
181
|
def arg_check!
|
182
|
-
if options[:command] == "process" && (options[:conf]
|
182
|
+
if options[:command] == "process" && (!options[:conf] || options[:conf].length == 0)
|
183
183
|
self.console.puts "Error: Missing required configuration file"
|
184
184
|
self.console.puts "Exiting..."
|
185
185
|
self.console.puts
|
186
|
-
self.console.puts self.slop.
|
186
|
+
self.console.puts self.slop.to_s
|
187
187
|
exit 2
|
188
188
|
end
|
189
189
|
end
|
@@ -234,28 +234,36 @@ module Traject
|
|
234
234
|
end
|
235
235
|
|
236
236
|
|
237
|
-
def create_slop!
|
238
|
-
|
239
|
-
banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
|
237
|
+
def create_slop!(argv)
|
238
|
+
options = Slop::Options.new do |o|
|
239
|
+
o.banner = "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
|
240
240
|
|
241
|
-
on 'v', 'version', "print version information to stderr"
|
242
|
-
on 'd', 'debug', "Include debug log, -s log.level=debug"
|
243
|
-
on 'h', 'help', "print usage information to stderr"
|
244
|
-
|
245
|
-
|
246
|
-
|
247
|
-
|
248
|
-
|
249
|
-
|
250
|
-
|
251
|
-
|
252
|
-
|
241
|
+
o.on '-v', '--version', "print version information to stderr"
|
242
|
+
o.on '-d', '--debug', "Include debug log, -s log.level=debug"
|
243
|
+
o.on '-h', '--help', "print usage information to stderr"
|
244
|
+
o.array '-c', '--conf', 'configuration file path (repeatable)', :delimiter => nil
|
245
|
+
o.string "-i", '--indexer', "Traject indexer class name or shortcut", :default => "marc"
|
246
|
+
o.array "-s", "--setting", "settings: `-s key=value` (repeatable)", :delimiter => nil
|
247
|
+
o.string "-r", "--reader", "Set reader class, shortcut for -s reader_class_name="
|
248
|
+
o.string "-o", "--output_file", "output file for Writer classes that write to files"
|
249
|
+
o.string "-w", "--writer", "Set writer class, shortcut for -s writer_class_name="
|
250
|
+
o.string "-u", "--solr", "Set solr url, shortcut for -s solr.url="
|
251
|
+
o.string "-t", "--marc_type", "xml, json or binary. shortcut for -s marc_source.type="
|
252
|
+
o.array "-I", "--load_path", "append paths to ruby $LOAD_PATH", :delimiter => ":"
|
253
253
|
|
254
|
-
|
254
|
+
o.string "-x", "--command", "alternate traject command: process (default); marcout; commit", :default => "process"
|
255
255
|
|
256
|
-
on "stdin", "read input from stdin"
|
257
|
-
on "debug-mode", "debug logging, single threaded, output human readable hashes"
|
256
|
+
o.on "--stdin", "read input from stdin"
|
257
|
+
o.on "--debug-mode", "debug logging, single threaded, output human readable hashes"
|
258
258
|
end
|
259
|
+
|
260
|
+
options.parse(argv)
|
261
|
+
rescue Slop::Error => e
|
262
|
+
self.console.puts "Error: #{e.message}"
|
263
|
+
self.console.puts "Exiting..."
|
264
|
+
self.console.puts
|
265
|
+
self.console.puts options.to_s
|
266
|
+
exit 1
|
259
267
|
end
|
260
268
|
|
261
269
|
def initialize_indexer!
|
@@ -267,22 +275,5 @@ module Traject
|
|
267
275
|
|
268
276
|
return indexer
|
269
277
|
end
|
270
|
-
|
271
|
-
def parse_options(argv)
|
272
|
-
|
273
|
-
begin
|
274
|
-
self.slop.parse!(argv)
|
275
|
-
rescue Slop::Error => e
|
276
|
-
self.console.puts "Error: #{e.message}"
|
277
|
-
self.console.puts "Exiting..."
|
278
|
-
self.console.puts
|
279
|
-
self.console.puts slop.help
|
280
|
-
exit 1
|
281
|
-
end
|
282
|
-
|
283
|
-
return self.slop.to_hash
|
284
|
-
end
|
285
|
-
|
286
|
-
|
287
278
|
end
|
288
279
|
end
|
data/lib/traject/indexer.rb
CHANGED
@@ -190,7 +190,7 @@ class Traject::Indexer
|
|
190
190
|
instance_eval(&block)
|
191
191
|
end
|
192
192
|
|
193
|
-
## Class level configure block accepted too, and applied at instantiation
|
193
|
+
## Class level configure block(s) accepted too, and applied at instantiation
|
194
194
|
# before instance-level configuration.
|
195
195
|
#
|
196
196
|
# EXPERIMENTAL, implementation may change in ways that effect some uses.
|
@@ -199,8 +199,14 @@ class Traject::Indexer
|
|
199
199
|
# Note that settings set by 'provide' in subclass can not really be overridden
|
200
200
|
# by 'provide' in a next level subclass. Use self.default_settings instead, with
|
201
201
|
# call to super.
|
202
|
+
#
|
203
|
+
# You can call this .configure multiple times, blocks are added to a list, and
|
204
|
+
# will be used to initialize an instance in order.
|
205
|
+
#
|
206
|
+
# The main downside of this workaround implementation is performance, even though
|
207
|
+
# defined at load-time on class level, blocks are all executed on every instantiation.
|
202
208
|
def self.configure(&block)
|
203
|
-
@
|
209
|
+
(@class_configure_blocks ||= []) << block
|
204
210
|
end
|
205
211
|
|
206
212
|
def self.apply_class_configure_block(instance)
|
@@ -208,8 +214,10 @@ class Traject::Indexer
|
|
208
214
|
if self.superclass.respond_to?(:apply_class_configure_block)
|
209
215
|
self.superclass.apply_class_configure_block(instance)
|
210
216
|
end
|
211
|
-
if @
|
212
|
-
|
217
|
+
if @class_configure_blocks && !@class_configure_blocks.empty?
|
218
|
+
@class_configure_blocks.each do |block|
|
219
|
+
instance.configure(&block)
|
220
|
+
end
|
213
221
|
end
|
214
222
|
end
|
215
223
|
|
@@ -15,7 +15,7 @@ module Traject::Macros
|
|
15
15
|
# field/substring specification.
|
16
16
|
#
|
17
17
|
# First argument is a string spec suitable for the MarcExtractor, see
|
18
|
-
# MarcExtractor::
|
18
|
+
# Traject::MarcExtractor::Spec.
|
19
19
|
#
|
20
20
|
# Second arg is optional options, including options valid on MarcExtractor.new,
|
21
21
|
# and others. By default, will de-duplicate results, but see :allow_duplicates
|
@@ -42,11 +42,11 @@ module Traject::Macros
|
|
42
42
|
#
|
43
43
|
# * :translation_map => String: translate with named translation map looked up in load
|
44
44
|
# path, uses Tranject::TranslationMap.new(translation_map_arg).
|
45
|
-
# **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)
|
45
|
+
# **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)`
|
46
46
|
#
|
47
47
|
# * :trim_punctuation => true; trims leading/trailing punctuation using standard algorithms that
|
48
48
|
# have shown themselves useful with Marc, using Marc21.trim_punctuation. **Instead**, use
|
49
|
-
# `extract_marc(whatever), trim_punctuation
|
49
|
+
# `extract_marc(whatever), trim_punctuation`
|
50
50
|
#
|
51
51
|
# * :default => String: if otherwise empty, add default value. **Instead**, use `extract_marc(whatever), default("default value")`
|
52
52
|
#
|
@@ -26,19 +26,19 @@ module Traject::Macros
|
|
26
26
|
accumulator.concat list.uniq if list
|
27
27
|
end
|
28
28
|
end
|
29
|
-
|
29
|
+
|
30
30
|
# If a num begins with a known OCLC prefix, return it without the prefix.
|
31
31
|
# otherwise nil.
|
32
32
|
#
|
33
|
-
# Allow (OCoLC) and/or ocn/ocm/on
|
34
|
-
|
33
|
+
# Allow (OCoLC) and/or ocn/ocm/on
|
34
|
+
|
35
35
|
OCLCPAT = /
|
36
36
|
\A\s*
|
37
37
|
(?:(?:\(OCoLC\)) |
|
38
38
|
(?:\(OCoLC\))?(?:(?:ocm)|(?:ocn)|(?:on))
|
39
39
|
)(\d+)
|
40
40
|
/x
|
41
|
-
|
41
|
+
|
42
42
|
def self.oclcnum_extract(num)
|
43
43
|
if m = OCLCPAT.match(num)
|
44
44
|
return m[1]
|
@@ -364,13 +364,16 @@ module Traject::Macros
|
|
364
364
|
end
|
365
365
|
end
|
366
366
|
end
|
367
|
-
# Okay, nothing from 008, try 260
|
367
|
+
# Okay, nothing from 008, first try 264, then try 260
|
368
368
|
if found_date.nil?
|
369
|
+
v264c = MarcExtractor.cached("264c", :separator => nil).extract(record).first
|
369
370
|
v260c = MarcExtractor.cached("260c", :separator => nil).extract(record).first
|
370
371
|
# just try to take the first four digits out of there, we're not going to try
|
371
372
|
# anything crazy.
|
372
|
-
if m = /(\d{4})/.match(
|
373
|
+
if m = /(\d{4})/.match(v264c)
|
373
374
|
found_date = m[1].to_i
|
375
|
+
elsif m = /(\d{4})/.match(v260c)
|
376
|
+
found_date = m[1].to_i
|
374
377
|
end
|
375
378
|
end
|
376
379
|
|
@@ -519,11 +522,11 @@ module Traject::Macros
|
|
519
522
|
|
520
523
|
# Extracts LCSH-carrying fields, and formatting them
|
521
524
|
# as a pre-coordinated LCSH string, for instance suitable for including
|
522
|
-
# in a facet.
|
525
|
+
# in a facet.
|
523
526
|
#
|
524
527
|
# You can supply your own list of fields as a spec, but for significant
|
525
528
|
# customization you probably just want to write your own method in
|
526
|
-
# terms of the Marc21Semantics.assemble_lcsh method.
|
529
|
+
# terms of the Marc21Semantics.assemble_lcsh method.
|
527
530
|
def marc_lcsh_formatted(options = {})
|
528
531
|
spec = options[:spec] || "600:610:611:630:648:650:651:654:662"
|
529
532
|
subd_separator = options[:subdivison_separator] || " — "
|
@@ -540,17 +543,17 @@ module Traject::Macros
|
|
540
543
|
end
|
541
544
|
|
542
545
|
# Takes a MARC::Field and formats it into a pre-coordinated LCSH string
|
543
|
-
# with subdivision seperators in the right place.
|
546
|
+
# with subdivision seperators in the right place.
|
544
547
|
#
|
545
548
|
# For 600 fields especially, need to not just join with subdivision seperator
|
546
549
|
# to take acount of $a$d$t -- for other fields, might be able to just
|
547
|
-
# join subfields, not sure.
|
550
|
+
# join subfields, not sure.
|
548
551
|
#
|
549
552
|
# WILL strip trailing period from generated string, contrary to some LCSH practice.
|
550
553
|
# Our data is inconsistent on whether it has period or not, this was
|
551
|
-
# the easiest way to standardize.
|
554
|
+
# the easiest way to standardize.
|
552
555
|
#
|
553
|
-
# Default subdivision seperator is em-dash with spaces, set to '--' if you want.
|
556
|
+
# Default subdivision seperator is em-dash with spaces, set to '--' if you want.
|
554
557
|
#
|
555
558
|
# Cite: "Dash (-) that precedes a subdivision in an extended 600 subject heading
|
556
559
|
# is not carried in the MARC record. It may be system generated as a display constant
|
@@ -26,9 +26,15 @@ module Traject
|
|
26
26
|
# Make sure to avoid text content that was all blank, which is "between the children"
|
27
27
|
# whitespace.
|
28
28
|
result = result.collect do |n|
|
29
|
-
n.
|
30
|
-
|
31
|
-
|
29
|
+
if n.kind_of?(Nokogiri::XML::Attr)
|
30
|
+
# attribute value
|
31
|
+
n.value
|
32
|
+
else
|
33
|
+
# text from node
|
34
|
+
n.xpath('.//text()').collect(&:text).tap do |arr|
|
35
|
+
arr.reject! { |s| s =~ (/\A\s+\z/) }
|
36
|
+
end.join(" ")
|
37
|
+
end
|
32
38
|
end
|
33
39
|
else
|
34
40
|
# just put all matches in accumulator as Nokogiri::XML::Node's
|
@@ -2,9 +2,9 @@ require 'traject/marc_extractor_spec'
|
|
2
2
|
|
3
3
|
module Traject
|
4
4
|
# MarcExtractor is a class for extracting lists of strings from a MARC::Record,
|
5
|
-
# according to specifications. See
|
6
|
-
# string arguments used to specify extraction. See #initialize for
|
7
|
-
# that can be set controlling extraction.
|
5
|
+
# according to specifications. See Traject::MarcExtractor::Spec for description
|
6
|
+
# of string string arguments used to specify extraction. See #initialize for
|
7
|
+
# options that can be set controlling extraction.
|
8
8
|
#
|
9
9
|
# Examples:
|
10
10
|
#
|
@@ -21,6 +21,9 @@ module Traject
|
|
21
21
|
# If you need to use namespaces here, you need to have them registered with
|
22
22
|
# `nokogiri.default_namespaces`. If your source docs use namespaces, you DO need
|
23
23
|
# to use them in your each_record_xpath.
|
24
|
+
# * nokogiri.strict_mode: if set to `true` or `"true"`, ask Nokogiri to parse in 'strict'
|
25
|
+
# mode, it will raise a `Nokogiri::XML::SyntaxError` if the XML is not well-formed, instead
|
26
|
+
# of trying to take it's best-guess correction. https://nokogiri.org/tutorials/ensuring_well_formed_markup.html
|
24
27
|
# * nokogiri_reader.extra_xpath_hooks: Experimental in progress, see below.
|
25
28
|
#
|
26
29
|
# ## nokogiri_reader.extra_xpath_hooks: For handling nodes outside of your each_record_xpath
|
@@ -87,7 +90,11 @@ module Traject
|
|
87
90
|
end
|
88
91
|
|
89
92
|
def each
|
90
|
-
|
93
|
+
config_proc = if settings["nokogiri.strict_mode"]
|
94
|
+
proc { |config| config.strict }
|
95
|
+
end
|
96
|
+
|
97
|
+
whole_input_doc = Nokogiri::XML.parse(input_stream, &config_proc)
|
91
98
|
|
92
99
|
if each_record_xpath
|
93
100
|
whole_input_doc.xpath(each_record_xpath, default_namespaces).each do |matching_node|
|