traject 3.3.0 → 3.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.github/workflows/ruby.yml +35 -0
- data/CHANGES.md +23 -2
- data/README.md +23 -2
- data/doc/settings.md +4 -2
- data/doc/xml.md +12 -0
- data/examples/marc/tiny.xml +35 -0
- data/lib/traject/command_line.rb +34 -43
- data/lib/traject/debug_writer.rb +1 -1
- data/lib/traject/macros/marc21.rb +3 -3
- data/lib/traject/macros/marc21_semantics.rb +7 -3
- data/lib/traject/macros/nokogiri_macros.rb +9 -3
- data/lib/traject/macros/transformation.rb +30 -0
- data/lib/traject/marc_extractor.rb +3 -3
- data/lib/traject/nokogiri_reader.rb +2 -0
- data/lib/traject/solr_json_writer.rb +28 -10
- data/lib/traject/version.rb +1 -1
- data/lib/translation_maps/marc_languages.yaml +77 -48
- data/test/command_line_test.rb +52 -0
- data/test/debug_writer_test.rb +13 -0
- data/test/indexer/macros/macros_marc21_semantics_test.rb +4 -0
- data/test/indexer/macros/transformation_test.rb +110 -0
- data/test/indexer/nokogiri_indexer_test.rb +35 -0
- data/test/indexer/read_write_test.rb +14 -3
- data/test/solr_json_writer_test.rb +45 -10
- data/test/test_support/missing-second-date.marc +1 -0
- data/traject.gemspec +3 -3
- metadata +19 -21
- data/.travis.yml +0 -16
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7ffc677e0ebb13e01b852a1d59ddfdd3cd9906142520e0c296f69ebb0eeb7429
|
4
|
+
data.tar.gz: 61b0e966f6ecd4d27e757e4cfc1057c72ac6deca5ad119c78ff883c246744814
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8240b450b27df011c2ff998c24c612f44bdd21a2fde5fbab996ffe509f3fc45cec7a8a8947e385a35d06f5bd8ed19732287e8f2d3dab682cc6295f7320f8dfab
|
7
|
+
data.tar.gz: f5dbcb44edb8d37a4e74cd1255aa1b05b638913337577c92d4fa276150c023b9e29823cc2b20ef050b33879ec63d76194d0202697d7c858a7aacd3d08241dcce
|
@@ -0,0 +1,35 @@
|
|
1
|
+
name: CI
|
2
|
+
|
3
|
+
on:
|
4
|
+
push:
|
5
|
+
branches: [ master ]
|
6
|
+
pull_request:
|
7
|
+
branches: ['**']
|
8
|
+
|
9
|
+
jobs:
|
10
|
+
tests:
|
11
|
+
runs-on: ubuntu-latest
|
12
|
+
strategy:
|
13
|
+
fail-fast: false
|
14
|
+
matrix:
|
15
|
+
ruby: [ '2.4', '2.5', '2.6', '2.7', '3.0', 'jruby-9.1', 'jruby-9.2' ]
|
16
|
+
name: Ruby ${{ matrix.ruby }}
|
17
|
+
steps:
|
18
|
+
- uses: actions/checkout@v2
|
19
|
+
|
20
|
+
- name: Set up Ruby
|
21
|
+
uses: ruby/setup-ruby@v1
|
22
|
+
with:
|
23
|
+
ruby-version: ${{ matrix.ruby }}
|
24
|
+
|
25
|
+
- name: set JAVA_OPTS for jruby-9.1
|
26
|
+
run: echo 'JAVA_OPTS="--add-opens java.base/java.security.cert=ALL-UNNAMED --add-opens java.base/java.security=ALL-UNNAMED --add-opens java.base/java.util.zip=ALL-UNNAMED"' >> $GITHUB_ENV
|
27
|
+
if: ${{ matrix.ruby == 'jruby-9.1' }}
|
28
|
+
# https://github.com/jruby/jruby/issues/4834
|
29
|
+
# Still seems to be an issue in jruby-9.1, but not 9.2
|
30
|
+
# https://github.community/t/conditional-setting-of-env-variables-in-gh-actions/179650
|
31
|
+
|
32
|
+
- name: Install dependencies
|
33
|
+
run: bundle install --jobs 4 --retry 3
|
34
|
+
- name: Run tests
|
35
|
+
run: bundle exec rake
|
data/CHANGES.md
CHANGED
@@ -1,12 +1,33 @@
|
|
1
1
|
# Changes
|
2
2
|
|
3
|
-
##
|
3
|
+
## NEXT
|
4
4
|
|
5
5
|
*
|
6
6
|
|
7
7
|
*
|
8
8
|
|
9
|
-
|
9
|
+
## 3.7.0
|
10
|
+
|
11
|
+
* Add two new transformation macros, `Traject::Macros::Transformation.delete_if` and `Traject::Macros::Transformations.select`.
|
12
|
+
|
13
|
+
## 3.6.0
|
14
|
+
|
15
|
+
* Tiny backward compat changes for ruby 3.0 compat. https://github.com/traject/traject/pull/263
|
16
|
+
|
17
|
+
* Allow gem `http` 5.x in gemspec. https://github.com/traject/traject/pull/269
|
18
|
+
|
19
|
+
## 3.5.0
|
20
|
+
|
21
|
+
* `traject -v` and `traject -h` correctly return 0 exit code indicating success.
|
22
|
+
|
23
|
+
* upgrade to slop gem 4.x, which carries with it a slightly different format of human-readable command-line arg errors, should be otherwise invisible.
|
24
|
+
|
25
|
+
* the SolrJsonWriter now supports HTTP basic auth credentials embedded in `solr.url` or `solr.update_url`, eg `http://user:pass@example.org/solr` https://github.com/traject/traject/pull/262
|
26
|
+
|
27
|
+
|
28
|
+
## 3.4.0
|
29
|
+
|
30
|
+
* XML-mode `extract_xpath` now supports extracting attribute values with xpath @attr syntax.
|
10
31
|
|
11
32
|
## 3.3.0
|
12
33
|
|
data/README.md
CHANGED
@@ -8,8 +8,8 @@ Traject can also be generalized to a set of tools for getting structured data fr
|
|
8
8
|
|
9
9
|
**Traject is stable, mature software, that is already being used in production by its authors and several other institutions.**
|
10
10
|
|
11
|
-
[](http://badge.fury.io/rb/traject)
|
12
|
+
[](https://github.com/traject/traject/actions?query=workflow%3ACI+branch%3Amaster)
|
13
13
|
|
14
14
|
|
15
15
|
## Background/Goals
|
@@ -177,6 +177,11 @@ TranslationMap use above is just one example of a transformation macro, that tra
|
|
177
177
|
* `split(" ")`: take values and split them, possibly result in multiple values.
|
178
178
|
* `transform(proc)`: transform each existing macro using a proc, kind of like `map`.
|
179
179
|
eg `to_field "something", extract_xml("//author"), transform( ->(author) { "#{author.last}, #{author.first}" })
|
180
|
+
* `delete_if(["a", "b"])`: remove a value from accumulated values if it is included in the passed in argumet.
|
181
|
+
* Can also take a string, proc or regex as an argument. See [tests](test/indexer/macros/transformation_test.rb) for full functionality.
|
182
|
+
* `select(proc)`: selects (keeps) values from accumulated values if proc evaluates to true for specifc value.
|
183
|
+
* Can also take a arrays, sets and regex as an argument. See [tests](test/indexer/macros/transformation_test.rb) for full functionality.
|
184
|
+
|
180
185
|
|
181
186
|
You can add on as many transformation macros as you want, they will be applied to output in order.
|
182
187
|
|
@@ -468,6 +473,22 @@ Also see `-I load_path` option and suggestions for Bundler use under Extending W
|
|
468
473
|
See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
469
474
|
|
470
475
|
|
476
|
+
## A small but complete example
|
477
|
+
|
478
|
+
To process a MARC XML file with the data shown in [./examples/marc/tiny.xml](./examples/marc/tiny.xml) you can use save the following configuration as `config.rb`:
|
479
|
+
|
480
|
+
```
|
481
|
+
to_field 'title', extract_marc('245a', first: true)
|
482
|
+
```
|
483
|
+
|
484
|
+
and run Traject as follows:
|
485
|
+
|
486
|
+
```
|
487
|
+
traject -t xml -c config.rb -w Traject::DebugWriter tiny.xml
|
488
|
+
```
|
489
|
+
|
490
|
+
`-t xml` indicates that the file is a MARC XML file. `-w Traject::DebugWriter` outputs the results to the console (e.g. without saving to Solr).
|
491
|
+
|
471
492
|
## Extending With Your Own Code
|
472
493
|
|
473
494
|
Traject config files are full live ruby files, where you can do anything,
|
data/doc/settings.md
CHANGED
@@ -83,7 +83,8 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
83
83
|
### Writing to solr
|
84
84
|
|
85
85
|
* `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
|
86
|
-
|
86
|
+
|
87
|
+
* `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`. (Can include embedded HTTP basic auth as eg `http://user:pass@example.org/solr`)
|
87
88
|
|
88
89
|
* `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control some default settings, and/or sanity check and warn you if you're doing something that might not work with that version of solr. Set now for help in the future.
|
89
90
|
|
@@ -93,7 +94,8 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
93
94
|
|
94
95
|
* `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
|
95
96
|
|
96
|
-
* `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth.
|
97
|
+
* `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth. You can also just embed basic
|
98
|
+
auth credentials in `solr.url` using standard URI syntax.
|
97
99
|
|
98
100
|
|
99
101
|
### Dealing with MARC data
|
data/doc/xml.md
CHANGED
@@ -4,6 +4,8 @@ The [NokogiriIndexer](../lib/traject/nokogiri_indexer.md) is a Traject::Indexer
|
|
4
4
|
|
5
5
|
It by default uses the NokogiriReader to read XML and read Nokogiri::XML::Documents, and includes the NokogiriMacros mix-in, with some macros for operating on Nokogiri::XML::Documents.
|
6
6
|
|
7
|
+
Plese notice that the recommened mechanism to parse MARC XML files with Traject is via the `-t` parameter (or the via the `provide "marc_source.type", "xml"` setting). The documentation in this page is for those parsing other (non MARC) XML files.
|
8
|
+
|
7
9
|
## On the command-line
|
8
10
|
|
9
11
|
You can tell the traject command-line to use the NokogiriIndexer with the `-i xml` flag:
|
@@ -72,6 +74,16 @@ You can use all the standard transforation macros in Traject::Macros::Transforma
|
|
72
74
|
to_field "something", extract_xpath("//value"), first_only, translation_map("some_map"), default("no value")
|
73
75
|
```
|
74
76
|
|
77
|
+
### selecting attribute values
|
78
|
+
|
79
|
+
Just works, using xpath syntax for selecting an attribute:
|
80
|
+
|
81
|
+
|
82
|
+
```ruby
|
83
|
+
# gets status value in: <oai:header status="something">
|
84
|
+
to_field "status", extract_xpath("//oai:record/oai:header/@status")
|
85
|
+
```
|
86
|
+
|
75
87
|
|
76
88
|
### selecting non-text nodes
|
77
89
|
|
@@ -0,0 +1,35 @@
|
|
1
|
+
<?xml version="1.0" encoding="UTF-8"?>
|
2
|
+
<collection xmlns="http://www.loc.gov/MARC21/slim" xmlns:marc="http://www.loc.gov/MARC21/slim">
|
3
|
+
<record>
|
4
|
+
<leader>01352cam a2200349 a 4500</leader>
|
5
|
+
<datafield tag="245" ind1="0" ind2="0">
|
6
|
+
<subfield code="6">880-01</subfield>
|
7
|
+
<subfield code="a">Kazoku kankei no shakai shinrigaku /</subfield>
|
8
|
+
<subfield code="c">Osada Masayoshi hen.</subfield>
|
9
|
+
</datafield>
|
10
|
+
</record>
|
11
|
+
<record>
|
12
|
+
<leader>01121ccm a2200289z 4500</leader>
|
13
|
+
<datafield tag="245" ind1="1" ind2="0">
|
14
|
+
<subfield code="a">Powhatan's daughter :</subfield>
|
15
|
+
<subfield code="b">march</subfield>
|
16
|
+
</datafield>
|
17
|
+
<datafield tag="100" ind1="1" ind2=" ">
|
18
|
+
<subfield code="a">Sousa, John Philip,</subfield>
|
19
|
+
<subfield code="d">1854-1932,</subfield>
|
20
|
+
<subfield code="e">composer.</subfield>
|
21
|
+
</datafield>
|
22
|
+
</record>
|
23
|
+
<record>
|
24
|
+
<leader>01137cam a2200301 a 4500</leader>
|
25
|
+
<datafield tag="245" ind1="1" ind2="0">
|
26
|
+
<subfield code="a">Two pieces /</subfield>
|
27
|
+
<subfield code="c">by Frank O'Hara.</subfield>
|
28
|
+
</datafield>
|
29
|
+
<datafield tag="100" ind1="1" ind2=" ">
|
30
|
+
<subfield code="a">O'Hara, Frank,</subfield>
|
31
|
+
<subfield code="d">1926-1966.</subfield>
|
32
|
+
<subfield code="0">http://id.loc.gov/authorities/names/n79042130</subfield>
|
33
|
+
</datafield>
|
34
|
+
</record>
|
35
|
+
</collection>
|
data/lib/traject/command_line.rb
CHANGED
@@ -29,10 +29,10 @@ module Traject
|
|
29
29
|
self.console = $stderr
|
30
30
|
|
31
31
|
self.orig_argv = argv.dup
|
32
|
-
self.remaining_argv = argv
|
33
32
|
|
34
|
-
self.slop = create_slop!
|
35
|
-
self.options =
|
33
|
+
self.slop = create_slop!(argv)
|
34
|
+
self.options = self.slop
|
35
|
+
self.remaining_argv = self.slop.arguments
|
36
36
|
end
|
37
37
|
|
38
38
|
# Returns true on success or false on failure; may also raise exceptions;
|
@@ -40,11 +40,11 @@ module Traject
|
|
40
40
|
def execute
|
41
41
|
if options[:version]
|
42
42
|
self.console.puts "traject version #{Traject::VERSION}"
|
43
|
-
return
|
43
|
+
return true
|
44
44
|
end
|
45
45
|
if options[:help]
|
46
|
-
self.console.puts slop.
|
47
|
-
return
|
46
|
+
self.console.puts slop.to_s
|
47
|
+
return true
|
48
48
|
end
|
49
49
|
|
50
50
|
|
@@ -179,11 +179,11 @@ module Traject
|
|
179
179
|
end
|
180
180
|
|
181
181
|
def arg_check!
|
182
|
-
if options[:command] == "process" && (options[:conf]
|
182
|
+
if options[:command] == "process" && (!options[:conf] || options[:conf].length == 0)
|
183
183
|
self.console.puts "Error: Missing required configuration file"
|
184
184
|
self.console.puts "Exiting..."
|
185
185
|
self.console.puts
|
186
|
-
self.console.puts self.slop.
|
186
|
+
self.console.puts self.slop.to_s
|
187
187
|
exit 2
|
188
188
|
end
|
189
189
|
end
|
@@ -234,28 +234,36 @@ module Traject
|
|
234
234
|
end
|
235
235
|
|
236
236
|
|
237
|
-
def create_slop!
|
238
|
-
|
239
|
-
banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
|
237
|
+
def create_slop!(argv)
|
238
|
+
options = Slop::Options.new do |o|
|
239
|
+
o.banner = "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
|
240
240
|
|
241
|
-
on 'v', 'version', "print version information to stderr"
|
242
|
-
on 'd', 'debug', "Include debug log, -s log.level=debug"
|
243
|
-
on 'h', 'help', "print usage information to stderr"
|
244
|
-
|
245
|
-
|
246
|
-
|
247
|
-
|
248
|
-
|
249
|
-
|
250
|
-
|
251
|
-
|
252
|
-
|
241
|
+
o.on '-v', '--version', "print version information to stderr"
|
242
|
+
o.on '-d', '--debug', "Include debug log, -s log.level=debug"
|
243
|
+
o.on '-h', '--help', "print usage information to stderr"
|
244
|
+
o.array '-c', '--conf', 'configuration file path (repeatable)', :delimiter => nil
|
245
|
+
o.string "-i", '--indexer', "Traject indexer class name or shortcut", :default => "marc"
|
246
|
+
o.array "-s", "--setting", "settings: `-s key=value` (repeatable)", :delimiter => nil
|
247
|
+
o.string "-r", "--reader", "Set reader class, shortcut for -s reader_class_name="
|
248
|
+
o.string "-o", "--output_file", "output file for Writer classes that write to files"
|
249
|
+
o.string "-w", "--writer", "Set writer class, shortcut for -s writer_class_name="
|
250
|
+
o.string "-u", "--solr", "Set solr url, shortcut for -s solr.url="
|
251
|
+
o.string "-t", "--marc_type", "xml, json or binary. shortcut for -s marc_source.type="
|
252
|
+
o.array "-I", "--load_path", "append paths to ruby $LOAD_PATH", :delimiter => ":"
|
253
253
|
|
254
|
-
|
254
|
+
o.string "-x", "--command", "alternate traject command: process (default); marcout; commit", :default => "process"
|
255
255
|
|
256
|
-
on "stdin", "read input from stdin"
|
257
|
-
on "debug-mode", "debug logging, single threaded, output human readable hashes"
|
256
|
+
o.on "--stdin", "read input from stdin"
|
257
|
+
o.on "--debug-mode", "debug logging, single threaded, output human readable hashes"
|
258
258
|
end
|
259
|
+
|
260
|
+
options.parse(argv)
|
261
|
+
rescue Slop::Error => e
|
262
|
+
self.console.puts "Error: #{e.message}"
|
263
|
+
self.console.puts "Exiting..."
|
264
|
+
self.console.puts
|
265
|
+
self.console.puts options.to_s
|
266
|
+
exit 1
|
259
267
|
end
|
260
268
|
|
261
269
|
def initialize_indexer!
|
@@ -267,22 +275,5 @@ module Traject
|
|
267
275
|
|
268
276
|
return indexer
|
269
277
|
end
|
270
|
-
|
271
|
-
def parse_options(argv)
|
272
|
-
|
273
|
-
begin
|
274
|
-
self.slop.parse!(argv)
|
275
|
-
rescue Slop::Error => e
|
276
|
-
self.console.puts "Error: #{e.message}"
|
277
|
-
self.console.puts "Exiting..."
|
278
|
-
self.console.puts
|
279
|
-
self.console.puts slop.help
|
280
|
-
exit 1
|
281
|
-
end
|
282
|
-
|
283
|
-
return self.slop.to_hash
|
284
|
-
end
|
285
|
-
|
286
|
-
|
287
278
|
end
|
288
279
|
end
|
data/lib/traject/debug_writer.rb
CHANGED
@@ -62,7 +62,7 @@ All records are assumed to have a unique id. You can set which field to look in
|
|
62
62
|
def serialize(context)
|
63
63
|
h = context.output_hash
|
64
64
|
rec_key = record_number(context)
|
65
|
-
lines = h.keys.sort.map { |k| @format % [rec_key, k, h[k].join(' | ')] }
|
65
|
+
lines = h.keys.sort.map { |k| @format % [rec_key, k, (h[k] || []).join(' | ')] }
|
66
66
|
lines.push "\n"
|
67
67
|
lines.join("\n")
|
68
68
|
end
|
@@ -15,7 +15,7 @@ module Traject::Macros
|
|
15
15
|
# field/substring specification.
|
16
16
|
#
|
17
17
|
# First argument is a string spec suitable for the MarcExtractor, see
|
18
|
-
# MarcExtractor::
|
18
|
+
# Traject::MarcExtractor::Spec.
|
19
19
|
#
|
20
20
|
# Second arg is optional options, including options valid on MarcExtractor.new,
|
21
21
|
# and others. By default, will de-duplicate results, but see :allow_duplicates
|
@@ -42,11 +42,11 @@ module Traject::Macros
|
|
42
42
|
#
|
43
43
|
# * :translation_map => String: translate with named translation map looked up in load
|
44
44
|
# path, uses Tranject::TranslationMap.new(translation_map_arg).
|
45
|
-
# **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)
|
45
|
+
# **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)`
|
46
46
|
#
|
47
47
|
# * :trim_punctuation => true; trims leading/trailing punctuation using standard algorithms that
|
48
48
|
# have shown themselves useful with Marc, using Marc21.trim_punctuation. **Instead**, use
|
49
|
-
# `extract_marc(whatever), trim_punctuation
|
49
|
+
# `extract_marc(whatever), trim_punctuation`
|
50
50
|
#
|
51
51
|
# * :default => String: if otherwise empty, add default value. **Instead**, use `extract_marc(whatever), default("default value")`
|
52
52
|
#
|
@@ -327,10 +327,14 @@ module Traject::Macros
|
|
327
327
|
if field008 && field008.length >= 11
|
328
328
|
date_type = field008.slice(6)
|
329
329
|
date1_str = field008.slice(7,4)
|
330
|
-
|
330
|
+
if field008.length > 15
|
331
|
+
date2_str = field008.slice(11, 4)
|
332
|
+
else
|
333
|
+
date2_str = date1_str
|
334
|
+
end
|
331
335
|
|
332
|
-
# for date_type q=questionable, we have a range.
|
333
|
-
if
|
336
|
+
# for date_type q=questionable, we expect to have a range.
|
337
|
+
if date_type == 'q' and date1_str != date2_str
|
334
338
|
# make unknown digits at the beginning or end of range,
|
335
339
|
date1 = date1_str.sub("u", "0").to_i
|
336
340
|
date2 = date2_str.sub("u", "9").to_i
|
@@ -26,9 +26,15 @@ module Traject
|
|
26
26
|
# Make sure to avoid text content that was all blank, which is "between the children"
|
27
27
|
# whitespace.
|
28
28
|
result = result.collect do |n|
|
29
|
-
n.
|
30
|
-
|
31
|
-
|
29
|
+
if n.kind_of?(Nokogiri::XML::Attr)
|
30
|
+
# attribute value
|
31
|
+
n.value
|
32
|
+
else
|
33
|
+
# text from node
|
34
|
+
n.xpath('.//text()').collect(&:text).tap do |arr|
|
35
|
+
arr.reject! { |s| s =~ (/\A\s+\z/) }
|
36
|
+
end.join(" ")
|
37
|
+
end
|
32
38
|
end
|
33
39
|
else
|
34
40
|
# just put all matches in accumulator as Nokogiri::XML::Node's
|
@@ -157,6 +157,36 @@ module Traject
|
|
157
157
|
acc.collect! { |v| v.gsub(pattern, replace) }
|
158
158
|
end
|
159
159
|
end
|
160
|
+
|
161
|
+
# Run ruby `delete_if` on the accumulator for values that include or are equal to arg.
|
162
|
+
# It will also accept an array, set, regex pattern, proc or lambda as an arugment.
|
163
|
+
#
|
164
|
+
# @example
|
165
|
+
# to_field "creator_facet", extract_marc("100abcdq"), delete_if(/foo/)
|
166
|
+
def delete_if(arg)
|
167
|
+
p = if arg.respond_to? :include?
|
168
|
+
proc { |v| arg.include?(v) }
|
169
|
+
else
|
170
|
+
proc { |v| arg === v }
|
171
|
+
end
|
172
|
+
|
173
|
+
->(_, acc) { acc.delete_if(&p) }
|
174
|
+
end
|
175
|
+
|
176
|
+
# Run ruby `select!` on the accumulator for values that include or are equal to arg.
|
177
|
+
# It accepts an array, set, regex pattern, proc or lambda as an arugument.
|
178
|
+
#
|
179
|
+
# @example
|
180
|
+
# to_field "creator_facet", extract_marc("100abcdq"), select(->(v) { v != "foo" })
|
181
|
+
def select(arg)
|
182
|
+
p = if arg.respond_to? :include?
|
183
|
+
proc { |v| arg.include?(v) }
|
184
|
+
else
|
185
|
+
proc { |v| arg === v }
|
186
|
+
end
|
187
|
+
|
188
|
+
->(_, acc) { acc.select!(&p) }
|
189
|
+
end
|
160
190
|
end
|
161
191
|
end
|
162
192
|
end
|
@@ -2,9 +2,9 @@ require 'traject/marc_extractor_spec'
|
|
2
2
|
|
3
3
|
module Traject
|
4
4
|
# MarcExtractor is a class for extracting lists of strings from a MARC::Record,
|
5
|
-
# according to specifications. See
|
6
|
-
# string arguments used to specify extraction. See #initialize for
|
7
|
-
# that can be set controlling extraction.
|
5
|
+
# according to specifications. See Traject::MarcExtractor::Spec for description
|
6
|
+
# of string string arguments used to specify extraction. See #initialize for
|
7
|
+
# options that can be set controlling extraction.
|
8
8
|
#
|
9
9
|
# Examples:
|
10
10
|
#
|
@@ -41,10 +41,12 @@ require 'concurrent' # for atomic_fixnum
|
|
41
41
|
#
|
42
42
|
# ## Relevant settings
|
43
43
|
#
|
44
|
-
# * solr.url (optional if solr.update_url is set) The URL to the solr core to index into
|
44
|
+
# * solr.url (optional if solr.update_url is set) The URL to the solr core to index into.
|
45
|
+
# (Can include embedded HTTP basic auth as eg `http://user:pass@host/solr`)
|
45
46
|
#
|
46
47
|
# * solr.update_url: The actual update url. If unset, we'll first see if
|
47
|
-
# "#{solr.url}/update/json" exists, and if not use "#{solr.url}/update"
|
48
|
+
# "#{solr.url}/update/json" exists, and if not use "#{solr.url}/update". (Can include
|
49
|
+
# embedded HTTP basic auth as eg `http://user:pass@host/solr)
|
48
50
|
#
|
49
51
|
# * solr_writer.batch_size: How big a batch to send to solr. Default is 100.
|
50
52
|
# My tests indicate that this setting doesn't change overall index speed by a ton.
|
@@ -101,12 +103,17 @@ class Traject::SolrJsonWriter
|
|
101
103
|
def initialize(argSettings)
|
102
104
|
@settings = Traject::Indexer::Settings.new(argSettings)
|
103
105
|
|
106
|
+
|
104
107
|
# Set max errors
|
105
108
|
@max_skipped = (@settings['solr_writer.max_skipped'] || DEFAULT_MAX_SKIPPED).to_i
|
106
109
|
if @max_skipped < 0
|
107
110
|
@max_skipped = nil
|
108
111
|
end
|
109
112
|
|
113
|
+
|
114
|
+
# Figure out where to send updates, and if with basic auth
|
115
|
+
@solr_update_url, basic_auth_user, basic_auth_password = self.determine_solr_update_url
|
116
|
+
|
110
117
|
@http_client = if @settings["solr_json_writer.http_client"]
|
111
118
|
@settings["solr_json_writer.http_client"]
|
112
119
|
else
|
@@ -115,9 +122,8 @@ class Traject::SolrJsonWriter
|
|
115
122
|
client.connect_timeout = client.receive_timeout = client.send_timeout = @settings["solr_writer.http_timeout"]
|
116
123
|
end
|
117
124
|
|
118
|
-
if
|
119
|
-
|
120
|
-
client.set_auth(@settings["solr.url"], @settings["solr_writer.basic_auth_user"], @settings["solr_writer.basic_auth_password"])
|
125
|
+
if basic_auth_user || basic_auth_password
|
126
|
+
client.set_auth(@solr_update_url, basic_auth_user, basic_auth_password)
|
121
127
|
end
|
122
128
|
|
123
129
|
client
|
@@ -143,13 +149,11 @@ class Traject::SolrJsonWriter
|
|
143
149
|
# this the new default writer.
|
144
150
|
@commit_on_close = (settings["solr_writer.commit_on_close"] || settings["solrj_writer.commit_on_close"]).to_s == "true"
|
145
151
|
|
146
|
-
# Figure out where to send updates
|
147
|
-
@solr_update_url = self.determine_solr_update_url
|
148
152
|
|
149
153
|
@solr_update_args = settings["solr_writer.solr_update_args"]
|
150
154
|
@commit_solr_update_args = settings["solr_writer.commit_solr_update_args"]
|
151
155
|
|
152
|
-
logger.info(" #{self.class.name} writing to '#{@solr_update_url}' in batches of #{@batch_size} with #{@thread_pool_size} bg threads")
|
156
|
+
logger.info(" #{self.class.name} writing to '#{@solr_update_url}' #{"(with HTTP basic auth)" if basic_auth_user || basic_auth_password}in batches of #{@batch_size} with #{@thread_pool_size} bg threads")
|
153
157
|
end
|
154
158
|
|
155
159
|
|
@@ -368,13 +372,27 @@ class Traject::SolrJsonWriter
|
|
368
372
|
end
|
369
373
|
|
370
374
|
|
371
|
-
# Relatively complex logic to determine if we have a valid URL and what it is
|
375
|
+
# Relatively complex logic to determine if we have a valid URL and what it is,
|
376
|
+
# and if we have basic_auth info
|
377
|
+
#
|
378
|
+
# Empties out user and password embedded in URI returned, to help avoid logging it.
|
379
|
+
#
|
380
|
+
# @returns [update_url, basic_auth_user, basic_auth_password]
|
372
381
|
def determine_solr_update_url
|
373
|
-
if settings['solr.update_url']
|
382
|
+
url = if settings['solr.update_url']
|
374
383
|
check_solr_update_url(settings['solr.update_url'])
|
375
384
|
else
|
376
385
|
derive_solr_update_url_from_solr_url(settings['solr.url'])
|
377
386
|
end
|
387
|
+
|
388
|
+
parsed_uri = URI.parse(url)
|
389
|
+
user_from_uri, password_from_uri = parsed_uri.user, parsed_uri.password
|
390
|
+
parsed_uri.user, parsed_uri.password = nil, nil
|
391
|
+
|
392
|
+
basic_auth_user = @settings["solr_writer.basic_auth_user"] || user_from_uri
|
393
|
+
basic_auth_password = @settings["solr_writer.basic_auth_password"] || password_from_uri
|
394
|
+
|
395
|
+
return [parsed_uri.to_s, basic_auth_user, basic_auth_password]
|
378
396
|
end
|
379
397
|
|
380
398
|
|
data/lib/traject/version.rb
CHANGED