traject 3.3.0 → 3.7.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.github/workflows/ruby.yml +35 -0
- data/CHANGES.md +23 -2
- data/README.md +23 -2
- data/doc/settings.md +4 -2
- data/doc/xml.md +12 -0
- data/examples/marc/tiny.xml +35 -0
- data/lib/traject/command_line.rb +34 -43
- data/lib/traject/debug_writer.rb +1 -1
- data/lib/traject/macros/marc21.rb +3 -3
- data/lib/traject/macros/marc21_semantics.rb +7 -3
- data/lib/traject/macros/nokogiri_macros.rb +9 -3
- data/lib/traject/macros/transformation.rb +30 -0
- data/lib/traject/marc_extractor.rb +3 -3
- data/lib/traject/nokogiri_reader.rb +2 -0
- data/lib/traject/solr_json_writer.rb +28 -10
- data/lib/traject/version.rb +1 -1
- data/lib/translation_maps/marc_languages.yaml +77 -48
- data/test/command_line_test.rb +52 -0
- data/test/debug_writer_test.rb +13 -0
- data/test/indexer/macros/macros_marc21_semantics_test.rb +4 -0
- data/test/indexer/macros/transformation_test.rb +110 -0
- data/test/indexer/nokogiri_indexer_test.rb +35 -0
- data/test/indexer/read_write_test.rb +14 -3
- data/test/solr_json_writer_test.rb +45 -10
- data/test/test_support/missing-second-date.marc +1 -0
- data/traject.gemspec +3 -3
- metadata +19 -21
- data/.travis.yml +0 -16
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7ffc677e0ebb13e01b852a1d59ddfdd3cd9906142520e0c296f69ebb0eeb7429
|
4
|
+
data.tar.gz: 61b0e966f6ecd4d27e757e4cfc1057c72ac6deca5ad119c78ff883c246744814
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8240b450b27df011c2ff998c24c612f44bdd21a2fde5fbab996ffe509f3fc45cec7a8a8947e385a35d06f5bd8ed19732287e8f2d3dab682cc6295f7320f8dfab
|
7
|
+
data.tar.gz: f5dbcb44edb8d37a4e74cd1255aa1b05b638913337577c92d4fa276150c023b9e29823cc2b20ef050b33879ec63d76194d0202697d7c858a7aacd3d08241dcce
|
@@ -0,0 +1,35 @@
|
|
1
|
+
name: CI
|
2
|
+
|
3
|
+
on:
|
4
|
+
push:
|
5
|
+
branches: [ master ]
|
6
|
+
pull_request:
|
7
|
+
branches: ['**']
|
8
|
+
|
9
|
+
jobs:
|
10
|
+
tests:
|
11
|
+
runs-on: ubuntu-latest
|
12
|
+
strategy:
|
13
|
+
fail-fast: false
|
14
|
+
matrix:
|
15
|
+
ruby: [ '2.4', '2.5', '2.6', '2.7', '3.0', 'jruby-9.1', 'jruby-9.2' ]
|
16
|
+
name: Ruby ${{ matrix.ruby }}
|
17
|
+
steps:
|
18
|
+
- uses: actions/checkout@v2
|
19
|
+
|
20
|
+
- name: Set up Ruby
|
21
|
+
uses: ruby/setup-ruby@v1
|
22
|
+
with:
|
23
|
+
ruby-version: ${{ matrix.ruby }}
|
24
|
+
|
25
|
+
- name: set JAVA_OPTS for jruby-9.1
|
26
|
+
run: echo 'JAVA_OPTS="--add-opens java.base/java.security.cert=ALL-UNNAMED --add-opens java.base/java.security=ALL-UNNAMED --add-opens java.base/java.util.zip=ALL-UNNAMED"' >> $GITHUB_ENV
|
27
|
+
if: ${{ matrix.ruby == 'jruby-9.1' }}
|
28
|
+
# https://github.com/jruby/jruby/issues/4834
|
29
|
+
# Still seems to be an issue in jruby-9.1, but not 9.2
|
30
|
+
# https://github.community/t/conditional-setting-of-env-variables-in-gh-actions/179650
|
31
|
+
|
32
|
+
- name: Install dependencies
|
33
|
+
run: bundle install --jobs 4 --retry 3
|
34
|
+
- name: Run tests
|
35
|
+
run: bundle exec rake
|
data/CHANGES.md
CHANGED
@@ -1,12 +1,33 @@
|
|
1
1
|
# Changes
|
2
2
|
|
3
|
-
##
|
3
|
+
## NEXT
|
4
4
|
|
5
5
|
*
|
6
6
|
|
7
7
|
*
|
8
8
|
|
9
|
-
|
9
|
+
## 3.7.0
|
10
|
+
|
11
|
+
* Add two new transformation macros, `Traject::Macros::Transformation.delete_if` and `Traject::Macros::Transformations.select`.
|
12
|
+
|
13
|
+
## 3.6.0
|
14
|
+
|
15
|
+
* Tiny backward compat changes for ruby 3.0 compat. https://github.com/traject/traject/pull/263
|
16
|
+
|
17
|
+
* Allow gem `http` 5.x in gemspec. https://github.com/traject/traject/pull/269
|
18
|
+
|
19
|
+
## 3.5.0
|
20
|
+
|
21
|
+
* `traject -v` and `traject -h` correctly return 0 exit code indicating success.
|
22
|
+
|
23
|
+
* upgrade to slop gem 4.x, which carries with it a slightly different format of human-readable command-line arg errors, should be otherwise invisible.
|
24
|
+
|
25
|
+
* the SolrJsonWriter now supports HTTP basic auth credentials embedded in `solr.url` or `solr.update_url`, eg `http://user:pass@example.org/solr` https://github.com/traject/traject/pull/262
|
26
|
+
|
27
|
+
|
28
|
+
## 3.4.0
|
29
|
+
|
30
|
+
* XML-mode `extract_xpath` now supports extracting attribute values with xpath @attr syntax.
|
10
31
|
|
11
32
|
## 3.3.0
|
12
33
|
|
data/README.md
CHANGED
@@ -8,8 +8,8 @@ Traject can also be generalized to a set of tools for getting structured data fr
|
|
8
8
|
|
9
9
|
**Traject is stable, mature software, that is already being used in production by its authors and several other institutions.**
|
10
10
|
|
11
|
-
[![Gem Version](https://badge.fury.io/rb/traject.
|
12
|
-
[![
|
11
|
+
[![Gem Version](https://badge.fury.io/rb/traject.svg)](http://badge.fury.io/rb/traject)
|
12
|
+
[![CI Status](https://github.com/traject/traject/workflows/CI/badge.svg?branch=master)](https://github.com/traject/traject/actions?query=workflow%3ACI+branch%3Amaster)
|
13
13
|
|
14
14
|
|
15
15
|
## Background/Goals
|
@@ -177,6 +177,11 @@ TranslationMap use above is just one example of a transformation macro, that tra
|
|
177
177
|
* `split(" ")`: take values and split them, possibly result in multiple values.
|
178
178
|
* `transform(proc)`: transform each existing macro using a proc, kind of like `map`.
|
179
179
|
eg `to_field "something", extract_xml("//author"), transform( ->(author) { "#{author.last}, #{author.first}" })
|
180
|
+
* `delete_if(["a", "b"])`: remove a value from accumulated values if it is included in the passed in argumet.
|
181
|
+
* Can also take a string, proc or regex as an argument. See [tests](test/indexer/macros/transformation_test.rb) for full functionality.
|
182
|
+
* `select(proc)`: selects (keeps) values from accumulated values if proc evaluates to true for specifc value.
|
183
|
+
* Can also take a arrays, sets and regex as an argument. See [tests](test/indexer/macros/transformation_test.rb) for full functionality.
|
184
|
+
|
180
185
|
|
181
186
|
You can add on as many transformation macros as you want, they will be applied to output in order.
|
182
187
|
|
@@ -468,6 +473,22 @@ Also see `-I load_path` option and suggestions for Bundler use under Extending W
|
|
468
473
|
See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
469
474
|
|
470
475
|
|
476
|
+
## A small but complete example
|
477
|
+
|
478
|
+
To process a MARC XML file with the data shown in [./examples/marc/tiny.xml](./examples/marc/tiny.xml) you can use save the following configuration as `config.rb`:
|
479
|
+
|
480
|
+
```
|
481
|
+
to_field 'title', extract_marc('245a', first: true)
|
482
|
+
```
|
483
|
+
|
484
|
+
and run Traject as follows:
|
485
|
+
|
486
|
+
```
|
487
|
+
traject -t xml -c config.rb -w Traject::DebugWriter tiny.xml
|
488
|
+
```
|
489
|
+
|
490
|
+
`-t xml` indicates that the file is a MARC XML file. `-w Traject::DebugWriter` outputs the results to the console (e.g. without saving to Solr).
|
491
|
+
|
471
492
|
## Extending With Your Own Code
|
472
493
|
|
473
494
|
Traject config files are full live ruby files, where you can do anything,
|
data/doc/settings.md
CHANGED
@@ -83,7 +83,8 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
83
83
|
### Writing to solr
|
84
84
|
|
85
85
|
* `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
|
86
|
-
|
86
|
+
|
87
|
+
* `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`. (Can include embedded HTTP basic auth as eg `http://user:pass@example.org/solr`)
|
87
88
|
|
88
89
|
* `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control some default settings, and/or sanity check and warn you if you're doing something that might not work with that version of solr. Set now for help in the future.
|
89
90
|
|
@@ -93,7 +94,8 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
93
94
|
|
94
95
|
* `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
|
95
96
|
|
96
|
-
* `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth.
|
97
|
+
* `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth. You can also just embed basic
|
98
|
+
auth credentials in `solr.url` using standard URI syntax.
|
97
99
|
|
98
100
|
|
99
101
|
### Dealing with MARC data
|
data/doc/xml.md
CHANGED
@@ -4,6 +4,8 @@ The [NokogiriIndexer](../lib/traject/nokogiri_indexer.md) is a Traject::Indexer
|
|
4
4
|
|
5
5
|
It by default uses the NokogiriReader to read XML and read Nokogiri::XML::Documents, and includes the NokogiriMacros mix-in, with some macros for operating on Nokogiri::XML::Documents.
|
6
6
|
|
7
|
+
Plese notice that the recommened mechanism to parse MARC XML files with Traject is via the `-t` parameter (or the via the `provide "marc_source.type", "xml"` setting). The documentation in this page is for those parsing other (non MARC) XML files.
|
8
|
+
|
7
9
|
## On the command-line
|
8
10
|
|
9
11
|
You can tell the traject command-line to use the NokogiriIndexer with the `-i xml` flag:
|
@@ -72,6 +74,16 @@ You can use all the standard transforation macros in Traject::Macros::Transforma
|
|
72
74
|
to_field "something", extract_xpath("//value"), first_only, translation_map("some_map"), default("no value")
|
73
75
|
```
|
74
76
|
|
77
|
+
### selecting attribute values
|
78
|
+
|
79
|
+
Just works, using xpath syntax for selecting an attribute:
|
80
|
+
|
81
|
+
|
82
|
+
```ruby
|
83
|
+
# gets status value in: <oai:header status="something">
|
84
|
+
to_field "status", extract_xpath("//oai:record/oai:header/@status")
|
85
|
+
```
|
86
|
+
|
75
87
|
|
76
88
|
### selecting non-text nodes
|
77
89
|
|
@@ -0,0 +1,35 @@
|
|
1
|
+
<?xml version="1.0" encoding="UTF-8"?>
|
2
|
+
<collection xmlns="http://www.loc.gov/MARC21/slim" xmlns:marc="http://www.loc.gov/MARC21/slim">
|
3
|
+
<record>
|
4
|
+
<leader>01352cam a2200349 a 4500</leader>
|
5
|
+
<datafield tag="245" ind1="0" ind2="0">
|
6
|
+
<subfield code="6">880-01</subfield>
|
7
|
+
<subfield code="a">Kazoku kankei no shakai shinrigaku /</subfield>
|
8
|
+
<subfield code="c">Osada Masayoshi hen.</subfield>
|
9
|
+
</datafield>
|
10
|
+
</record>
|
11
|
+
<record>
|
12
|
+
<leader>01121ccm a2200289z 4500</leader>
|
13
|
+
<datafield tag="245" ind1="1" ind2="0">
|
14
|
+
<subfield code="a">Powhatan's daughter :</subfield>
|
15
|
+
<subfield code="b">march</subfield>
|
16
|
+
</datafield>
|
17
|
+
<datafield tag="100" ind1="1" ind2=" ">
|
18
|
+
<subfield code="a">Sousa, John Philip,</subfield>
|
19
|
+
<subfield code="d">1854-1932,</subfield>
|
20
|
+
<subfield code="e">composer.</subfield>
|
21
|
+
</datafield>
|
22
|
+
</record>
|
23
|
+
<record>
|
24
|
+
<leader>01137cam a2200301 a 4500</leader>
|
25
|
+
<datafield tag="245" ind1="1" ind2="0">
|
26
|
+
<subfield code="a">Two pieces /</subfield>
|
27
|
+
<subfield code="c">by Frank O'Hara.</subfield>
|
28
|
+
</datafield>
|
29
|
+
<datafield tag="100" ind1="1" ind2=" ">
|
30
|
+
<subfield code="a">O'Hara, Frank,</subfield>
|
31
|
+
<subfield code="d">1926-1966.</subfield>
|
32
|
+
<subfield code="0">http://id.loc.gov/authorities/names/n79042130</subfield>
|
33
|
+
</datafield>
|
34
|
+
</record>
|
35
|
+
</collection>
|
data/lib/traject/command_line.rb
CHANGED
@@ -29,10 +29,10 @@ module Traject
|
|
29
29
|
self.console = $stderr
|
30
30
|
|
31
31
|
self.orig_argv = argv.dup
|
32
|
-
self.remaining_argv = argv
|
33
32
|
|
34
|
-
self.slop = create_slop!
|
35
|
-
self.options =
|
33
|
+
self.slop = create_slop!(argv)
|
34
|
+
self.options = self.slop
|
35
|
+
self.remaining_argv = self.slop.arguments
|
36
36
|
end
|
37
37
|
|
38
38
|
# Returns true on success or false on failure; may also raise exceptions;
|
@@ -40,11 +40,11 @@ module Traject
|
|
40
40
|
def execute
|
41
41
|
if options[:version]
|
42
42
|
self.console.puts "traject version #{Traject::VERSION}"
|
43
|
-
return
|
43
|
+
return true
|
44
44
|
end
|
45
45
|
if options[:help]
|
46
|
-
self.console.puts slop.
|
47
|
-
return
|
46
|
+
self.console.puts slop.to_s
|
47
|
+
return true
|
48
48
|
end
|
49
49
|
|
50
50
|
|
@@ -179,11 +179,11 @@ module Traject
|
|
179
179
|
end
|
180
180
|
|
181
181
|
def arg_check!
|
182
|
-
if options[:command] == "process" && (options[:conf]
|
182
|
+
if options[:command] == "process" && (!options[:conf] || options[:conf].length == 0)
|
183
183
|
self.console.puts "Error: Missing required configuration file"
|
184
184
|
self.console.puts "Exiting..."
|
185
185
|
self.console.puts
|
186
|
-
self.console.puts self.slop.
|
186
|
+
self.console.puts self.slop.to_s
|
187
187
|
exit 2
|
188
188
|
end
|
189
189
|
end
|
@@ -234,28 +234,36 @@ module Traject
|
|
234
234
|
end
|
235
235
|
|
236
236
|
|
237
|
-
def create_slop!
|
238
|
-
|
239
|
-
banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
|
237
|
+
def create_slop!(argv)
|
238
|
+
options = Slop::Options.new do |o|
|
239
|
+
o.banner = "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
|
240
240
|
|
241
|
-
on 'v', 'version', "print version information to stderr"
|
242
|
-
on 'd', 'debug', "Include debug log, -s log.level=debug"
|
243
|
-
on 'h', 'help', "print usage information to stderr"
|
244
|
-
|
245
|
-
|
246
|
-
|
247
|
-
|
248
|
-
|
249
|
-
|
250
|
-
|
251
|
-
|
252
|
-
|
241
|
+
o.on '-v', '--version', "print version information to stderr"
|
242
|
+
o.on '-d', '--debug', "Include debug log, -s log.level=debug"
|
243
|
+
o.on '-h', '--help', "print usage information to stderr"
|
244
|
+
o.array '-c', '--conf', 'configuration file path (repeatable)', :delimiter => nil
|
245
|
+
o.string "-i", '--indexer', "Traject indexer class name or shortcut", :default => "marc"
|
246
|
+
o.array "-s", "--setting", "settings: `-s key=value` (repeatable)", :delimiter => nil
|
247
|
+
o.string "-r", "--reader", "Set reader class, shortcut for -s reader_class_name="
|
248
|
+
o.string "-o", "--output_file", "output file for Writer classes that write to files"
|
249
|
+
o.string "-w", "--writer", "Set writer class, shortcut for -s writer_class_name="
|
250
|
+
o.string "-u", "--solr", "Set solr url, shortcut for -s solr.url="
|
251
|
+
o.string "-t", "--marc_type", "xml, json or binary. shortcut for -s marc_source.type="
|
252
|
+
o.array "-I", "--load_path", "append paths to ruby $LOAD_PATH", :delimiter => ":"
|
253
253
|
|
254
|
-
|
254
|
+
o.string "-x", "--command", "alternate traject command: process (default); marcout; commit", :default => "process"
|
255
255
|
|
256
|
-
on "stdin", "read input from stdin"
|
257
|
-
on "debug-mode", "debug logging, single threaded, output human readable hashes"
|
256
|
+
o.on "--stdin", "read input from stdin"
|
257
|
+
o.on "--debug-mode", "debug logging, single threaded, output human readable hashes"
|
258
258
|
end
|
259
|
+
|
260
|
+
options.parse(argv)
|
261
|
+
rescue Slop::Error => e
|
262
|
+
self.console.puts "Error: #{e.message}"
|
263
|
+
self.console.puts "Exiting..."
|
264
|
+
self.console.puts
|
265
|
+
self.console.puts options.to_s
|
266
|
+
exit 1
|
259
267
|
end
|
260
268
|
|
261
269
|
def initialize_indexer!
|
@@ -267,22 +275,5 @@ module Traject
|
|
267
275
|
|
268
276
|
return indexer
|
269
277
|
end
|
270
|
-
|
271
|
-
def parse_options(argv)
|
272
|
-
|
273
|
-
begin
|
274
|
-
self.slop.parse!(argv)
|
275
|
-
rescue Slop::Error => e
|
276
|
-
self.console.puts "Error: #{e.message}"
|
277
|
-
self.console.puts "Exiting..."
|
278
|
-
self.console.puts
|
279
|
-
self.console.puts slop.help
|
280
|
-
exit 1
|
281
|
-
end
|
282
|
-
|
283
|
-
return self.slop.to_hash
|
284
|
-
end
|
285
|
-
|
286
|
-
|
287
278
|
end
|
288
279
|
end
|
data/lib/traject/debug_writer.rb
CHANGED
@@ -62,7 +62,7 @@ All records are assumed to have a unique id. You can set which field to look in
|
|
62
62
|
def serialize(context)
|
63
63
|
h = context.output_hash
|
64
64
|
rec_key = record_number(context)
|
65
|
-
lines = h.keys.sort.map { |k| @format % [rec_key, k, h[k].join(' | ')] }
|
65
|
+
lines = h.keys.sort.map { |k| @format % [rec_key, k, (h[k] || []).join(' | ')] }
|
66
66
|
lines.push "\n"
|
67
67
|
lines.join("\n")
|
68
68
|
end
|
@@ -15,7 +15,7 @@ module Traject::Macros
|
|
15
15
|
# field/substring specification.
|
16
16
|
#
|
17
17
|
# First argument is a string spec suitable for the MarcExtractor, see
|
18
|
-
# MarcExtractor::
|
18
|
+
# Traject::MarcExtractor::Spec.
|
19
19
|
#
|
20
20
|
# Second arg is optional options, including options valid on MarcExtractor.new,
|
21
21
|
# and others. By default, will de-duplicate results, but see :allow_duplicates
|
@@ -42,11 +42,11 @@ module Traject::Macros
|
|
42
42
|
#
|
43
43
|
# * :translation_map => String: translate with named translation map looked up in load
|
44
44
|
# path, uses Tranject::TranslationMap.new(translation_map_arg).
|
45
|
-
# **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)
|
45
|
+
# **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)`
|
46
46
|
#
|
47
47
|
# * :trim_punctuation => true; trims leading/trailing punctuation using standard algorithms that
|
48
48
|
# have shown themselves useful with Marc, using Marc21.trim_punctuation. **Instead**, use
|
49
|
-
# `extract_marc(whatever), trim_punctuation
|
49
|
+
# `extract_marc(whatever), trim_punctuation`
|
50
50
|
#
|
51
51
|
# * :default => String: if otherwise empty, add default value. **Instead**, use `extract_marc(whatever), default("default value")`
|
52
52
|
#
|
@@ -327,10 +327,14 @@ module Traject::Macros
|
|
327
327
|
if field008 && field008.length >= 11
|
328
328
|
date_type = field008.slice(6)
|
329
329
|
date1_str = field008.slice(7,4)
|
330
|
-
|
330
|
+
if field008.length > 15
|
331
|
+
date2_str = field008.slice(11, 4)
|
332
|
+
else
|
333
|
+
date2_str = date1_str
|
334
|
+
end
|
331
335
|
|
332
|
-
# for date_type q=questionable, we have a range.
|
333
|
-
if
|
336
|
+
# for date_type q=questionable, we expect to have a range.
|
337
|
+
if date_type == 'q' and date1_str != date2_str
|
334
338
|
# make unknown digits at the beginning or end of range,
|
335
339
|
date1 = date1_str.sub("u", "0").to_i
|
336
340
|
date2 = date2_str.sub("u", "9").to_i
|
@@ -26,9 +26,15 @@ module Traject
|
|
26
26
|
# Make sure to avoid text content that was all blank, which is "between the children"
|
27
27
|
# whitespace.
|
28
28
|
result = result.collect do |n|
|
29
|
-
n.
|
30
|
-
|
31
|
-
|
29
|
+
if n.kind_of?(Nokogiri::XML::Attr)
|
30
|
+
# attribute value
|
31
|
+
n.value
|
32
|
+
else
|
33
|
+
# text from node
|
34
|
+
n.xpath('.//text()').collect(&:text).tap do |arr|
|
35
|
+
arr.reject! { |s| s =~ (/\A\s+\z/) }
|
36
|
+
end.join(" ")
|
37
|
+
end
|
32
38
|
end
|
33
39
|
else
|
34
40
|
# just put all matches in accumulator as Nokogiri::XML::Node's
|
@@ -157,6 +157,36 @@ module Traject
|
|
157
157
|
acc.collect! { |v| v.gsub(pattern, replace) }
|
158
158
|
end
|
159
159
|
end
|
160
|
+
|
161
|
+
# Run ruby `delete_if` on the accumulator for values that include or are equal to arg.
|
162
|
+
# It will also accept an array, set, regex pattern, proc or lambda as an arugment.
|
163
|
+
#
|
164
|
+
# @example
|
165
|
+
# to_field "creator_facet", extract_marc("100abcdq"), delete_if(/foo/)
|
166
|
+
def delete_if(arg)
|
167
|
+
p = if arg.respond_to? :include?
|
168
|
+
proc { |v| arg.include?(v) }
|
169
|
+
else
|
170
|
+
proc { |v| arg === v }
|
171
|
+
end
|
172
|
+
|
173
|
+
->(_, acc) { acc.delete_if(&p) }
|
174
|
+
end
|
175
|
+
|
176
|
+
# Run ruby `select!` on the accumulator for values that include or are equal to arg.
|
177
|
+
# It accepts an array, set, regex pattern, proc or lambda as an arugument.
|
178
|
+
#
|
179
|
+
# @example
|
180
|
+
# to_field "creator_facet", extract_marc("100abcdq"), select(->(v) { v != "foo" })
|
181
|
+
def select(arg)
|
182
|
+
p = if arg.respond_to? :include?
|
183
|
+
proc { |v| arg.include?(v) }
|
184
|
+
else
|
185
|
+
proc { |v| arg === v }
|
186
|
+
end
|
187
|
+
|
188
|
+
->(_, acc) { acc.select!(&p) }
|
189
|
+
end
|
160
190
|
end
|
161
191
|
end
|
162
192
|
end
|
@@ -2,9 +2,9 @@ require 'traject/marc_extractor_spec'
|
|
2
2
|
|
3
3
|
module Traject
|
4
4
|
# MarcExtractor is a class for extracting lists of strings from a MARC::Record,
|
5
|
-
# according to specifications. See
|
6
|
-
# string arguments used to specify extraction. See #initialize for
|
7
|
-
# that can be set controlling extraction.
|
5
|
+
# according to specifications. See Traject::MarcExtractor::Spec for description
|
6
|
+
# of string string arguments used to specify extraction. See #initialize for
|
7
|
+
# options that can be set controlling extraction.
|
8
8
|
#
|
9
9
|
# Examples:
|
10
10
|
#
|
@@ -41,10 +41,12 @@ require 'concurrent' # for atomic_fixnum
|
|
41
41
|
#
|
42
42
|
# ## Relevant settings
|
43
43
|
#
|
44
|
-
# * solr.url (optional if solr.update_url is set) The URL to the solr core to index into
|
44
|
+
# * solr.url (optional if solr.update_url is set) The URL to the solr core to index into.
|
45
|
+
# (Can include embedded HTTP basic auth as eg `http://user:pass@host/solr`)
|
45
46
|
#
|
46
47
|
# * solr.update_url: The actual update url. If unset, we'll first see if
|
47
|
-
# "#{solr.url}/update/json" exists, and if not use "#{solr.url}/update"
|
48
|
+
# "#{solr.url}/update/json" exists, and if not use "#{solr.url}/update". (Can include
|
49
|
+
# embedded HTTP basic auth as eg `http://user:pass@host/solr)
|
48
50
|
#
|
49
51
|
# * solr_writer.batch_size: How big a batch to send to solr. Default is 100.
|
50
52
|
# My tests indicate that this setting doesn't change overall index speed by a ton.
|
@@ -101,12 +103,17 @@ class Traject::SolrJsonWriter
|
|
101
103
|
def initialize(argSettings)
|
102
104
|
@settings = Traject::Indexer::Settings.new(argSettings)
|
103
105
|
|
106
|
+
|
104
107
|
# Set max errors
|
105
108
|
@max_skipped = (@settings['solr_writer.max_skipped'] || DEFAULT_MAX_SKIPPED).to_i
|
106
109
|
if @max_skipped < 0
|
107
110
|
@max_skipped = nil
|
108
111
|
end
|
109
112
|
|
113
|
+
|
114
|
+
# Figure out where to send updates, and if with basic auth
|
115
|
+
@solr_update_url, basic_auth_user, basic_auth_password = self.determine_solr_update_url
|
116
|
+
|
110
117
|
@http_client = if @settings["solr_json_writer.http_client"]
|
111
118
|
@settings["solr_json_writer.http_client"]
|
112
119
|
else
|
@@ -115,9 +122,8 @@ class Traject::SolrJsonWriter
|
|
115
122
|
client.connect_timeout = client.receive_timeout = client.send_timeout = @settings["solr_writer.http_timeout"]
|
116
123
|
end
|
117
124
|
|
118
|
-
if
|
119
|
-
|
120
|
-
client.set_auth(@settings["solr.url"], @settings["solr_writer.basic_auth_user"], @settings["solr_writer.basic_auth_password"])
|
125
|
+
if basic_auth_user || basic_auth_password
|
126
|
+
client.set_auth(@solr_update_url, basic_auth_user, basic_auth_password)
|
121
127
|
end
|
122
128
|
|
123
129
|
client
|
@@ -143,13 +149,11 @@ class Traject::SolrJsonWriter
|
|
143
149
|
# this the new default writer.
|
144
150
|
@commit_on_close = (settings["solr_writer.commit_on_close"] || settings["solrj_writer.commit_on_close"]).to_s == "true"
|
145
151
|
|
146
|
-
# Figure out where to send updates
|
147
|
-
@solr_update_url = self.determine_solr_update_url
|
148
152
|
|
149
153
|
@solr_update_args = settings["solr_writer.solr_update_args"]
|
150
154
|
@commit_solr_update_args = settings["solr_writer.commit_solr_update_args"]
|
151
155
|
|
152
|
-
logger.info(" #{self.class.name} writing to '#{@solr_update_url}' in batches of #{@batch_size} with #{@thread_pool_size} bg threads")
|
156
|
+
logger.info(" #{self.class.name} writing to '#{@solr_update_url}' #{"(with HTTP basic auth)" if basic_auth_user || basic_auth_password}in batches of #{@batch_size} with #{@thread_pool_size} bg threads")
|
153
157
|
end
|
154
158
|
|
155
159
|
|
@@ -368,13 +372,27 @@ class Traject::SolrJsonWriter
|
|
368
372
|
end
|
369
373
|
|
370
374
|
|
371
|
-
# Relatively complex logic to determine if we have a valid URL and what it is
|
375
|
+
# Relatively complex logic to determine if we have a valid URL and what it is,
|
376
|
+
# and if we have basic_auth info
|
377
|
+
#
|
378
|
+
# Empties out user and password embedded in URI returned, to help avoid logging it.
|
379
|
+
#
|
380
|
+
# @returns [update_url, basic_auth_user, basic_auth_password]
|
372
381
|
def determine_solr_update_url
|
373
|
-
if settings['solr.update_url']
|
382
|
+
url = if settings['solr.update_url']
|
374
383
|
check_solr_update_url(settings['solr.update_url'])
|
375
384
|
else
|
376
385
|
derive_solr_update_url_from_solr_url(settings['solr.url'])
|
377
386
|
end
|
387
|
+
|
388
|
+
parsed_uri = URI.parse(url)
|
389
|
+
user_from_uri, password_from_uri = parsed_uri.user, parsed_uri.password
|
390
|
+
parsed_uri.user, parsed_uri.password = nil, nil
|
391
|
+
|
392
|
+
basic_auth_user = @settings["solr_writer.basic_auth_user"] || user_from_uri
|
393
|
+
basic_auth_password = @settings["solr_writer.basic_auth_password"] || password_from_uri
|
394
|
+
|
395
|
+
return [parsed_uri.to_s, basic_auth_user, basic_auth_password]
|
378
396
|
end
|
379
397
|
|
380
398
|
|
data/lib/traject/version.rb
CHANGED