traject 3.1.0.rc1 → 3.5.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.github/workflows/ruby.yml +26 -0
- data/CHANGES.md +46 -0
- data/README.md +3 -1
- data/doc/settings.md +5 -1
- data/doc/xml.md +10 -0
- data/lib/traject/command_line.rb +34 -43
- data/lib/traject/indexer.rb +12 -4
- data/lib/traject/macros/marc21.rb +3 -3
- data/lib/traject/macros/marc21_semantics.rb +15 -12
- data/lib/traject/macros/nokogiri_macros.rb +9 -3
- data/lib/traject/marc_extractor.rb +3 -3
- data/lib/traject/nokogiri_reader.rb +8 -1
- data/lib/traject/oai_pmh_nokogiri_reader.rb +9 -3
- data/lib/traject/solr_json_writer.rb +58 -17
- data/lib/traject/version.rb +1 -1
- data/lib/translation_maps/marc_languages.yaml +77 -48
- data/test/command_line_test.rb +51 -0
- data/test/delimited_writer_test.rb +14 -16
- data/test/indexer/class_level_configuration_test.rb +23 -0
- data/test/indexer/macros/macros_marc21_semantics_test.rb +4 -0
- data/test/indexer/nokogiri_indexer_test.rb +35 -0
- data/test/nokogiri_reader_test.rb +10 -0
- data/test/solr_json_writer_test.rb +65 -0
- data/test/test_support/date_resort_to_264.marc +1 -0
- data/traject.gemspec +3 -3
- metadata +32 -23
- data/.travis.yml +0 -16
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 01ca968682bb3fc2a8313131ef6344bfc9e5418b767b2900c3d799caa356d016
|
4
|
+
data.tar.gz: a3fd6c9a3bec88c6ba592500ea170357b059533f28c5ba3fb2fe72de39702a2a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 93547927e90b7947588c1983bd37de4651722bebaff7aaad2d3965ec46eed8c647971f1ce093beb86a5c47c2efa999516d00fb6956682c98913cd54ea5a1a2b8
|
7
|
+
data.tar.gz: 128ea6e2517711f2324541215f814430f963b0042ae321ebc3f29646398990dc09d2a307acc32fb89a70142d74763fdfee6a61d220043622b5b8b11fbcd645d8
|
@@ -0,0 +1,26 @@
|
|
1
|
+
name: CI
|
2
|
+
|
3
|
+
on:
|
4
|
+
push:
|
5
|
+
branches: [ master ]
|
6
|
+
pull_request:
|
7
|
+
branches: ['**']
|
8
|
+
|
9
|
+
jobs:
|
10
|
+
tests:
|
11
|
+
runs-on: ubuntu-latest
|
12
|
+
strategy:
|
13
|
+
fail-fast: false
|
14
|
+
matrix:
|
15
|
+
ruby: [ '2.4', '2.5', '2.6', '2.7', 'jruby-9.1', 'jruby-9.2' ]
|
16
|
+
name: Ruby ${{ matrix.ruby }}
|
17
|
+
steps:
|
18
|
+
- uses: actions/checkout@v2
|
19
|
+
- name: Set up Ruby
|
20
|
+
uses: ruby/setup-ruby@v1
|
21
|
+
with:
|
22
|
+
ruby-version: ${{ matrix.ruby }}
|
23
|
+
- name: Install dependencies
|
24
|
+
run: bundle install --jobs 4 --retry 3
|
25
|
+
- name: Run tests
|
26
|
+
run: bundle exec rake
|
data/CHANGES.md
CHANGED
@@ -1,5 +1,47 @@
|
|
1
1
|
# Changes
|
2
2
|
|
3
|
+
## Next
|
4
|
+
|
5
|
+
*
|
6
|
+
|
7
|
+
*
|
8
|
+
|
9
|
+
*
|
10
|
+
|
11
|
+
## 3.5.0
|
12
|
+
|
13
|
+
* `traject -v` and `traject -h` correctly return 0 exit code indicating success.
|
14
|
+
|
15
|
+
* upgrade to slop gem 4.x, which carries with it a slightly different format of human-readable command-line arg errors, should be otherwise invisible.
|
16
|
+
|
17
|
+
* the SolrJsonWriter now supports HTTP basic auth credentials embedded in `solr.url` or `solr.update_url`, eg `http://user:pass@example.org/solr` https://github.com/traject/traject/pull/262
|
18
|
+
|
19
|
+
|
20
|
+
## 3.4.0
|
21
|
+
|
22
|
+
* XML-mode `extract_xpath` now supports extracting attribute values with xpath @attr syntax.
|
23
|
+
|
24
|
+
## 3.3.0
|
25
|
+
|
26
|
+
* `Traject::Macros::Marc21Semantics.publication_date` now gets date from 264 before 260. https://github.com/traject/traject/pull/233
|
27
|
+
|
28
|
+
* Allow hashie 4.x in gemspec https://github.com/traject/traject/pull/234
|
29
|
+
|
30
|
+
* Allow `http` gem 4.x versions. https://github.com/traject/traject/pull/236
|
31
|
+
|
32
|
+
* Can now call class-level Indexer.configure multiple times https://github.com/sciencehistory/scihist_digicoll/pull/525
|
33
|
+
|
34
|
+
## 3.2.0
|
35
|
+
|
36
|
+
* NokogiriReader has a "nokogiri.strict_mode" setting. Set to true or string 'true' to ask Nokogori to parse in strict mode, so it will immediately raise on ill-formed XML, instead of nokogiri's default to do what it can with it. https://github.com/traject/traject/pull/226
|
37
|
+
|
38
|
+
* SolrJsonWriter
|
39
|
+
|
40
|
+
* Utility method `delete_all!` sends a delete all query to the Solr URL endpoint. https://github.com/traject/traject/pull/227
|
41
|
+
|
42
|
+
* Allow basic auth configuration of the default http client via `solr_writer.basic_auth_user` and `solr_writer.basic_auth_password`. https://github.com/traject/traject/pull/231
|
43
|
+
|
44
|
+
|
3
45
|
## 3.1.0
|
4
46
|
|
5
47
|
### Added
|
@@ -24,6 +66,10 @@
|
|
24
66
|
|
25
67
|
* SolrJsonWriter now respects a `solr_writer.http_timeout` setting, in seconds, to be passed to HTTPClient instance. https://github.com/traject/traject/pull/219
|
26
68
|
|
69
|
+
* Only runs thread pool shutdown code (and logging) if there is a `solr_writer.batch_size` greater than 0. Keep it out of the logs if it was a no-op anyway.
|
70
|
+
|
71
|
+
* Logs at DEBUG level every time it sends an update request to solr
|
72
|
+
|
27
73
|
* Nokogiri dependency for the NokogiriReader increased to `~> 1.9`. When using Jruby `each_record_xpath`, resulting yielded documents may have xmlns declarations on different nodes than in MRI (and previous versions of nokogiri), but we could find now way around this with nokogiri >= 1.9.0. The documents should still be semantically equivalent for namespace use. This was necessary to keep JRuby Nokogiri XML working with recent Nokogiri releases. https://github.com/traject/traject/pull/209
|
28
74
|
|
29
75
|
* LineWriter guesses better about when to auto-close, and provides an optional explicit setting in case it guesses wrong. (thanks @justinlittman) https://github.com/traject/traject/pull/211
|
data/README.md
CHANGED
@@ -9,7 +9,7 @@ Traject can also be generalized to a set of tools for getting structured data fr
|
|
9
9
|
**Traject is stable, mature software, that is already being used in production by its authors and several other institutions.**
|
10
10
|
|
11
11
|
[![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
|
12
|
-
[![
|
12
|
+
[![CI Status](https://github.com/traject/traject/workflows/CI/badge.svg?branch=master)](https://github.com/traject/traject/actions?query=workflow%3ACI+branch%3Amaster)
|
13
13
|
|
14
14
|
|
15
15
|
## Background/Goals
|
@@ -175,6 +175,8 @@ TranslationMap use above is just one example of a transformation macro, that tra
|
|
175
175
|
* `append("--after each value")`
|
176
176
|
* `gsub(/regex/, "replacement")`
|
177
177
|
* `split(" ")`: take values and split them, possibly result in multiple values.
|
178
|
+
* `transform(proc)`: transform each existing macro using a proc, kind of like `map`.
|
179
|
+
eg `to_field "something", extract_xml("//author"), transform( ->(author) { "#{author.last}, #{author.first}" })
|
178
180
|
|
179
181
|
You can add on as many transformation macros as you want, they will be applied to output in order.
|
180
182
|
|
data/doc/settings.md
CHANGED
@@ -83,7 +83,8 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
83
83
|
### Writing to solr
|
84
84
|
|
85
85
|
* `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
|
86
|
-
|
86
|
+
|
87
|
+
* `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`. (Can include embedded HTTP basic auth as eg `http://user:pass@example.org/solr`)
|
87
88
|
|
88
89
|
* `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control some default settings, and/or sanity check and warn you if you're doing something that might not work with that version of solr. Set now for help in the future.
|
89
90
|
|
@@ -93,6 +94,9 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
93
94
|
|
94
95
|
* `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
|
95
96
|
|
97
|
+
* `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth. You can also just embed basic
|
98
|
+
auth credentials in `solr.url` using standard URI syntax.
|
99
|
+
|
96
100
|
|
97
101
|
### Dealing with MARC data
|
98
102
|
|
data/doc/xml.md
CHANGED
@@ -72,6 +72,16 @@ You can use all the standard transforation macros in Traject::Macros::Transforma
|
|
72
72
|
to_field "something", extract_xpath("//value"), first_only, translation_map("some_map"), default("no value")
|
73
73
|
```
|
74
74
|
|
75
|
+
### selecting attribute values
|
76
|
+
|
77
|
+
Just works, using xpath syntax for selecting an attribute:
|
78
|
+
|
79
|
+
|
80
|
+
```ruby
|
81
|
+
# gets status value in: <oai:header status="something">
|
82
|
+
to_field "status", extract_xpath("//oai:record/oai:header/@status")
|
83
|
+
```
|
84
|
+
|
75
85
|
|
76
86
|
### selecting non-text nodes
|
77
87
|
|
data/lib/traject/command_line.rb
CHANGED
@@ -29,10 +29,10 @@ module Traject
|
|
29
29
|
self.console = $stderr
|
30
30
|
|
31
31
|
self.orig_argv = argv.dup
|
32
|
-
self.remaining_argv = argv
|
33
32
|
|
34
|
-
self.slop = create_slop!
|
35
|
-
self.options =
|
33
|
+
self.slop = create_slop!(argv)
|
34
|
+
self.options = self.slop
|
35
|
+
self.remaining_argv = self.slop.arguments
|
36
36
|
end
|
37
37
|
|
38
38
|
# Returns true on success or false on failure; may also raise exceptions;
|
@@ -40,11 +40,11 @@ module Traject
|
|
40
40
|
def execute
|
41
41
|
if options[:version]
|
42
42
|
self.console.puts "traject version #{Traject::VERSION}"
|
43
|
-
return
|
43
|
+
return true
|
44
44
|
end
|
45
45
|
if options[:help]
|
46
|
-
self.console.puts slop.
|
47
|
-
return
|
46
|
+
self.console.puts slop.to_s
|
47
|
+
return true
|
48
48
|
end
|
49
49
|
|
50
50
|
|
@@ -179,11 +179,11 @@ module Traject
|
|
179
179
|
end
|
180
180
|
|
181
181
|
def arg_check!
|
182
|
-
if options[:command] == "process" && (options[:conf]
|
182
|
+
if options[:command] == "process" && (!options[:conf] || options[:conf].length == 0)
|
183
183
|
self.console.puts "Error: Missing required configuration file"
|
184
184
|
self.console.puts "Exiting..."
|
185
185
|
self.console.puts
|
186
|
-
self.console.puts self.slop.
|
186
|
+
self.console.puts self.slop.to_s
|
187
187
|
exit 2
|
188
188
|
end
|
189
189
|
end
|
@@ -234,28 +234,36 @@ module Traject
|
|
234
234
|
end
|
235
235
|
|
236
236
|
|
237
|
-
def create_slop!
|
238
|
-
|
239
|
-
banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
|
237
|
+
def create_slop!(argv)
|
238
|
+
options = Slop::Options.new do |o|
|
239
|
+
o.banner = "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
|
240
240
|
|
241
|
-
on 'v', 'version', "print version information to stderr"
|
242
|
-
on 'd', 'debug', "Include debug log, -s log.level=debug"
|
243
|
-
on 'h', 'help', "print usage information to stderr"
|
244
|
-
|
245
|
-
|
246
|
-
|
247
|
-
|
248
|
-
|
249
|
-
|
250
|
-
|
251
|
-
|
252
|
-
|
241
|
+
o.on '-v', '--version', "print version information to stderr"
|
242
|
+
o.on '-d', '--debug', "Include debug log, -s log.level=debug"
|
243
|
+
o.on '-h', '--help', "print usage information to stderr"
|
244
|
+
o.array '-c', '--conf', 'configuration file path (repeatable)', :delimiter => nil
|
245
|
+
o.string "-i", '--indexer', "Traject indexer class name or shortcut", :default => "marc"
|
246
|
+
o.array "-s", "--setting", "settings: `-s key=value` (repeatable)", :delimiter => nil
|
247
|
+
o.string "-r", "--reader", "Set reader class, shortcut for -s reader_class_name="
|
248
|
+
o.string "-o", "--output_file", "output file for Writer classes that write to files"
|
249
|
+
o.string "-w", "--writer", "Set writer class, shortcut for -s writer_class_name="
|
250
|
+
o.string "-u", "--solr", "Set solr url, shortcut for -s solr.url="
|
251
|
+
o.string "-t", "--marc_type", "xml, json or binary. shortcut for -s marc_source.type="
|
252
|
+
o.array "-I", "--load_path", "append paths to ruby $LOAD_PATH", :delimiter => ":"
|
253
253
|
|
254
|
-
|
254
|
+
o.string "-x", "--command", "alternate traject command: process (default); marcout; commit", :default => "process"
|
255
255
|
|
256
|
-
on "stdin", "read input from stdin"
|
257
|
-
on "debug-mode", "debug logging, single threaded, output human readable hashes"
|
256
|
+
o.on "--stdin", "read input from stdin"
|
257
|
+
o.on "--debug-mode", "debug logging, single threaded, output human readable hashes"
|
258
258
|
end
|
259
|
+
|
260
|
+
options.parse(argv)
|
261
|
+
rescue Slop::Error => e
|
262
|
+
self.console.puts "Error: #{e.message}"
|
263
|
+
self.console.puts "Exiting..."
|
264
|
+
self.console.puts
|
265
|
+
self.console.puts options.to_s
|
266
|
+
exit 1
|
259
267
|
end
|
260
268
|
|
261
269
|
def initialize_indexer!
|
@@ -267,22 +275,5 @@ module Traject
|
|
267
275
|
|
268
276
|
return indexer
|
269
277
|
end
|
270
|
-
|
271
|
-
def parse_options(argv)
|
272
|
-
|
273
|
-
begin
|
274
|
-
self.slop.parse!(argv)
|
275
|
-
rescue Slop::Error => e
|
276
|
-
self.console.puts "Error: #{e.message}"
|
277
|
-
self.console.puts "Exiting..."
|
278
|
-
self.console.puts
|
279
|
-
self.console.puts slop.help
|
280
|
-
exit 1
|
281
|
-
end
|
282
|
-
|
283
|
-
return self.slop.to_hash
|
284
|
-
end
|
285
|
-
|
286
|
-
|
287
278
|
end
|
288
279
|
end
|
data/lib/traject/indexer.rb
CHANGED
@@ -190,7 +190,7 @@ class Traject::Indexer
|
|
190
190
|
instance_eval(&block)
|
191
191
|
end
|
192
192
|
|
193
|
-
## Class level configure block accepted too, and applied at instantiation
|
193
|
+
## Class level configure block(s) accepted too, and applied at instantiation
|
194
194
|
# before instance-level configuration.
|
195
195
|
#
|
196
196
|
# EXPERIMENTAL, implementation may change in ways that effect some uses.
|
@@ -199,8 +199,14 @@ class Traject::Indexer
|
|
199
199
|
# Note that settings set by 'provide' in subclass can not really be overridden
|
200
200
|
# by 'provide' in a next level subclass. Use self.default_settings instead, with
|
201
201
|
# call to super.
|
202
|
+
#
|
203
|
+
# You can call this .configure multiple times, blocks are added to a list, and
|
204
|
+
# will be used to initialize an instance in order.
|
205
|
+
#
|
206
|
+
# The main downside of this workaround implementation is performance, even though
|
207
|
+
# defined at load-time on class level, blocks are all executed on every instantiation.
|
202
208
|
def self.configure(&block)
|
203
|
-
@
|
209
|
+
(@class_configure_blocks ||= []) << block
|
204
210
|
end
|
205
211
|
|
206
212
|
def self.apply_class_configure_block(instance)
|
@@ -208,8 +214,10 @@ class Traject::Indexer
|
|
208
214
|
if self.superclass.respond_to?(:apply_class_configure_block)
|
209
215
|
self.superclass.apply_class_configure_block(instance)
|
210
216
|
end
|
211
|
-
if @
|
212
|
-
|
217
|
+
if @class_configure_blocks && !@class_configure_blocks.empty?
|
218
|
+
@class_configure_blocks.each do |block|
|
219
|
+
instance.configure(&block)
|
220
|
+
end
|
213
221
|
end
|
214
222
|
end
|
215
223
|
|
@@ -15,7 +15,7 @@ module Traject::Macros
|
|
15
15
|
# field/substring specification.
|
16
16
|
#
|
17
17
|
# First argument is a string spec suitable for the MarcExtractor, see
|
18
|
-
# MarcExtractor::
|
18
|
+
# Traject::MarcExtractor::Spec.
|
19
19
|
#
|
20
20
|
# Second arg is optional options, including options valid on MarcExtractor.new,
|
21
21
|
# and others. By default, will de-duplicate results, but see :allow_duplicates
|
@@ -42,11 +42,11 @@ module Traject::Macros
|
|
42
42
|
#
|
43
43
|
# * :translation_map => String: translate with named translation map looked up in load
|
44
44
|
# path, uses Tranject::TranslationMap.new(translation_map_arg).
|
45
|
-
# **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)
|
45
|
+
# **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)`
|
46
46
|
#
|
47
47
|
# * :trim_punctuation => true; trims leading/trailing punctuation using standard algorithms that
|
48
48
|
# have shown themselves useful with Marc, using Marc21.trim_punctuation. **Instead**, use
|
49
|
-
# `extract_marc(whatever), trim_punctuation
|
49
|
+
# `extract_marc(whatever), trim_punctuation`
|
50
50
|
#
|
51
51
|
# * :default => String: if otherwise empty, add default value. **Instead**, use `extract_marc(whatever), default("default value")`
|
52
52
|
#
|
@@ -26,19 +26,19 @@ module Traject::Macros
|
|
26
26
|
accumulator.concat list.uniq if list
|
27
27
|
end
|
28
28
|
end
|
29
|
-
|
29
|
+
|
30
30
|
# If a num begins with a known OCLC prefix, return it without the prefix.
|
31
31
|
# otherwise nil.
|
32
32
|
#
|
33
|
-
# Allow (OCoLC) and/or ocn/ocm/on
|
34
|
-
|
33
|
+
# Allow (OCoLC) and/or ocn/ocm/on
|
34
|
+
|
35
35
|
OCLCPAT = /
|
36
36
|
\A\s*
|
37
37
|
(?:(?:\(OCoLC\)) |
|
38
38
|
(?:\(OCoLC\))?(?:(?:ocm)|(?:ocn)|(?:on))
|
39
39
|
)(\d+)
|
40
40
|
/x
|
41
|
-
|
41
|
+
|
42
42
|
def self.oclcnum_extract(num)
|
43
43
|
if m = OCLCPAT.match(num)
|
44
44
|
return m[1]
|
@@ -364,13 +364,16 @@ module Traject::Macros
|
|
364
364
|
end
|
365
365
|
end
|
366
366
|
end
|
367
|
-
# Okay, nothing from 008, try 260
|
367
|
+
# Okay, nothing from 008, first try 264, then try 260
|
368
368
|
if found_date.nil?
|
369
|
+
v264c = MarcExtractor.cached("264c", :separator => nil).extract(record).first
|
369
370
|
v260c = MarcExtractor.cached("260c", :separator => nil).extract(record).first
|
370
371
|
# just try to take the first four digits out of there, we're not going to try
|
371
372
|
# anything crazy.
|
372
|
-
if m = /(\d{4})/.match(
|
373
|
+
if m = /(\d{4})/.match(v264c)
|
373
374
|
found_date = m[1].to_i
|
375
|
+
elsif m = /(\d{4})/.match(v260c)
|
376
|
+
found_date = m[1].to_i
|
374
377
|
end
|
375
378
|
end
|
376
379
|
|
@@ -519,11 +522,11 @@ module Traject::Macros
|
|
519
522
|
|
520
523
|
# Extracts LCSH-carrying fields, and formatting them
|
521
524
|
# as a pre-coordinated LCSH string, for instance suitable for including
|
522
|
-
# in a facet.
|
525
|
+
# in a facet.
|
523
526
|
#
|
524
527
|
# You can supply your own list of fields as a spec, but for significant
|
525
528
|
# customization you probably just want to write your own method in
|
526
|
-
# terms of the Marc21Semantics.assemble_lcsh method.
|
529
|
+
# terms of the Marc21Semantics.assemble_lcsh method.
|
527
530
|
def marc_lcsh_formatted(options = {})
|
528
531
|
spec = options[:spec] || "600:610:611:630:648:650:651:654:662"
|
529
532
|
subd_separator = options[:subdivison_separator] || " — "
|
@@ -540,17 +543,17 @@ module Traject::Macros
|
|
540
543
|
end
|
541
544
|
|
542
545
|
# Takes a MARC::Field and formats it into a pre-coordinated LCSH string
|
543
|
-
# with subdivision seperators in the right place.
|
546
|
+
# with subdivision seperators in the right place.
|
544
547
|
#
|
545
548
|
# For 600 fields especially, need to not just join with subdivision seperator
|
546
549
|
# to take acount of $a$d$t -- for other fields, might be able to just
|
547
|
-
# join subfields, not sure.
|
550
|
+
# join subfields, not sure.
|
548
551
|
#
|
549
552
|
# WILL strip trailing period from generated string, contrary to some LCSH practice.
|
550
553
|
# Our data is inconsistent on whether it has period or not, this was
|
551
|
-
# the easiest way to standardize.
|
554
|
+
# the easiest way to standardize.
|
552
555
|
#
|
553
|
-
# Default subdivision seperator is em-dash with spaces, set to '--' if you want.
|
556
|
+
# Default subdivision seperator is em-dash with spaces, set to '--' if you want.
|
554
557
|
#
|
555
558
|
# Cite: "Dash (-) that precedes a subdivision in an extended 600 subject heading
|
556
559
|
# is not carried in the MARC record. It may be system generated as a display constant
|
@@ -26,9 +26,15 @@ module Traject
|
|
26
26
|
# Make sure to avoid text content that was all blank, which is "between the children"
|
27
27
|
# whitespace.
|
28
28
|
result = result.collect do |n|
|
29
|
-
n.
|
30
|
-
|
31
|
-
|
29
|
+
if n.kind_of?(Nokogiri::XML::Attr)
|
30
|
+
# attribute value
|
31
|
+
n.value
|
32
|
+
else
|
33
|
+
# text from node
|
34
|
+
n.xpath('.//text()').collect(&:text).tap do |arr|
|
35
|
+
arr.reject! { |s| s =~ (/\A\s+\z/) }
|
36
|
+
end.join(" ")
|
37
|
+
end
|
32
38
|
end
|
33
39
|
else
|
34
40
|
# just put all matches in accumulator as Nokogiri::XML::Node's
|
@@ -2,9 +2,9 @@ require 'traject/marc_extractor_spec'
|
|
2
2
|
|
3
3
|
module Traject
|
4
4
|
# MarcExtractor is a class for extracting lists of strings from a MARC::Record,
|
5
|
-
# according to specifications. See
|
6
|
-
# string arguments used to specify extraction. See #initialize for
|
7
|
-
# that can be set controlling extraction.
|
5
|
+
# according to specifications. See Traject::MarcExtractor::Spec for description
|
6
|
+
# of string string arguments used to specify extraction. See #initialize for
|
7
|
+
# options that can be set controlling extraction.
|
8
8
|
#
|
9
9
|
# Examples:
|
10
10
|
#
|
@@ -21,6 +21,9 @@ module Traject
|
|
21
21
|
# If you need to use namespaces here, you need to have them registered with
|
22
22
|
# `nokogiri.default_namespaces`. If your source docs use namespaces, you DO need
|
23
23
|
# to use them in your each_record_xpath.
|
24
|
+
# * nokogiri.strict_mode: if set to `true` or `"true"`, ask Nokogiri to parse in 'strict'
|
25
|
+
# mode, it will raise a `Nokogiri::XML::SyntaxError` if the XML is not well-formed, instead
|
26
|
+
# of trying to take it's best-guess correction. https://nokogiri.org/tutorials/ensuring_well_formed_markup.html
|
24
27
|
# * nokogiri_reader.extra_xpath_hooks: Experimental in progress, see below.
|
25
28
|
#
|
26
29
|
# ## nokogiri_reader.extra_xpath_hooks: For handling nodes outside of your each_record_xpath
|
@@ -87,7 +90,11 @@ module Traject
|
|
87
90
|
end
|
88
91
|
|
89
92
|
def each
|
90
|
-
|
93
|
+
config_proc = if settings["nokogiri.strict_mode"]
|
94
|
+
proc { |config| config.strict }
|
95
|
+
end
|
96
|
+
|
97
|
+
whole_input_doc = Nokogiri::XML.parse(input_stream, &config_proc)
|
91
98
|
|
92
99
|
if each_record_xpath
|
93
100
|
whole_input_doc.xpath(each_record_xpath, default_namespaces).each do |matching_node|
|