traject 3.3.0 → 3.7.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1700077d5c2d3c667fc9520b659c3ca986b8ab34aee233f62bd7f73fdef91977
4
- data.tar.gz: 736b217f209ed08faba9c1d20c006b29586aa3ebdf088a89e37f5f3b7400de06
3
+ metadata.gz: 7ffc677e0ebb13e01b852a1d59ddfdd3cd9906142520e0c296f69ebb0eeb7429
4
+ data.tar.gz: 61b0e966f6ecd4d27e757e4cfc1057c72ac6deca5ad119c78ff883c246744814
5
5
  SHA512:
6
- metadata.gz: 21877d6cd5b03f7ffbbac316a6d58a3bc65b534cb7457e57d39ba470ad49d99c8677e5e6ede25c650bba5ac3f0b22f9b348ebabb36ac4047433eb8a76379ef1d
7
- data.tar.gz: 4ec1938d2d7b60a61ebde4e9c4e763e511c2896788b56b38ad6f22615dffb57449e29c9ef40e261952e125b90f5fe491fa447b077c7dcb1c55f57d6ef603fd5b
6
+ metadata.gz: 8240b450b27df011c2ff998c24c612f44bdd21a2fde5fbab996ffe509f3fc45cec7a8a8947e385a35d06f5bd8ed19732287e8f2d3dab682cc6295f7320f8dfab
7
+ data.tar.gz: f5dbcb44edb8d37a4e74cd1255aa1b05b638913337577c92d4fa276150c023b9e29823cc2b20ef050b33879ec63d76194d0202697d7c858a7aacd3d08241dcce
@@ -0,0 +1,35 @@
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches: [ master ]
6
+ pull_request:
7
+ branches: ['**']
8
+
9
+ jobs:
10
+ tests:
11
+ runs-on: ubuntu-latest
12
+ strategy:
13
+ fail-fast: false
14
+ matrix:
15
+ ruby: [ '2.4', '2.5', '2.6', '2.7', '3.0', 'jruby-9.1', 'jruby-9.2' ]
16
+ name: Ruby ${{ matrix.ruby }}
17
+ steps:
18
+ - uses: actions/checkout@v2
19
+
20
+ - name: Set up Ruby
21
+ uses: ruby/setup-ruby@v1
22
+ with:
23
+ ruby-version: ${{ matrix.ruby }}
24
+
25
+ - name: set JAVA_OPTS for jruby-9.1
26
+ run: echo 'JAVA_OPTS="--add-opens java.base/java.security.cert=ALL-UNNAMED --add-opens java.base/java.security=ALL-UNNAMED --add-opens java.base/java.util.zip=ALL-UNNAMED"' >> $GITHUB_ENV
27
+ if: ${{ matrix.ruby == 'jruby-9.1' }}
28
+ # https://github.com/jruby/jruby/issues/4834
29
+ # Still seems to be an issue in jruby-9.1, but not 9.2
30
+ # https://github.community/t/conditional-setting-of-env-variables-in-gh-actions/179650
31
+
32
+ - name: Install dependencies
33
+ run: bundle install --jobs 4 --retry 3
34
+ - name: Run tests
35
+ run: bundle exec rake
data/CHANGES.md CHANGED
@@ -1,12 +1,33 @@
1
1
  # Changes
2
2
 
3
- ## Next
3
+ ## NEXT
4
4
 
5
5
  *
6
6
 
7
7
  *
8
8
 
9
- *
9
+ ## 3.7.0
10
+
11
+ * Add two new transformation macros, `Traject::Macros::Transformation.delete_if` and `Traject::Macros::Transformations.select`.
12
+
13
+ ## 3.6.0
14
+
15
+ * Tiny backward compat changes for ruby 3.0 compat. https://github.com/traject/traject/pull/263
16
+
17
+ * Allow gem `http` 5.x in gemspec. https://github.com/traject/traject/pull/269
18
+
19
+ ## 3.5.0
20
+
21
+ * `traject -v` and `traject -h` correctly return 0 exit code indicating success.
22
+
23
+ * upgrade to slop gem 4.x, which carries with it a slightly different format of human-readable command-line arg errors, should be otherwise invisible.
24
+
25
+ * the SolrJsonWriter now supports HTTP basic auth credentials embedded in `solr.url` or `solr.update_url`, eg `http://user:pass@example.org/solr` https://github.com/traject/traject/pull/262
26
+
27
+
28
+ ## 3.4.0
29
+
30
+ * XML-mode `extract_xpath` now supports extracting attribute values with xpath @attr syntax.
10
31
 
11
32
  ## 3.3.0
12
33
 
data/README.md CHANGED
@@ -8,8 +8,8 @@ Traject can also be generalized to a set of tools for getting structured data fr
8
8
 
9
9
  **Traject is stable, mature software, that is already being used in production by its authors and several other institutions.**
10
10
 
11
- [![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
12
- [![Build Status](https://travis-ci.org/traject/traject.png)](https://travis-ci.org/traject/traject)
11
+ [![Gem Version](https://badge.fury.io/rb/traject.svg)](http://badge.fury.io/rb/traject)
12
+ [![CI Status](https://github.com/traject/traject/workflows/CI/badge.svg?branch=master)](https://github.com/traject/traject/actions?query=workflow%3ACI+branch%3Amaster)
13
13
 
14
14
 
15
15
  ## Background/Goals
@@ -177,6 +177,11 @@ TranslationMap use above is just one example of a transformation macro, that tra
177
177
  * `split(" ")`: take values and split them, possibly result in multiple values.
178
178
  * `transform(proc)`: transform each existing macro using a proc, kind of like `map`.
179
179
  eg `to_field "something", extract_xml("//author"), transform( ->(author) { "#{author.last}, #{author.first}" })
180
+ * `delete_if(["a", "b"])`: remove a value from accumulated values if it is included in the passed in argumet.
181
+ * Can also take a string, proc or regex as an argument. See [tests](test/indexer/macros/transformation_test.rb) for full functionality.
182
+ * `select(proc)`: selects (keeps) values from accumulated values if proc evaluates to true for specifc value.
183
+ * Can also take a arrays, sets and regex as an argument. See [tests](test/indexer/macros/transformation_test.rb) for full functionality.
184
+
180
185
 
181
186
  You can add on as many transformation macros as you want, they will be applied to output in order.
182
187
 
@@ -468,6 +473,22 @@ Also see `-I load_path` option and suggestions for Bundler use under Extending W
468
473
  See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
469
474
 
470
475
 
476
+ ## A small but complete example
477
+
478
+ To process a MARC XML file with the data shown in [./examples/marc/tiny.xml](./examples/marc/tiny.xml) you can use save the following configuration as `config.rb`:
479
+
480
+ ```
481
+ to_field 'title', extract_marc('245a', first: true)
482
+ ```
483
+
484
+ and run Traject as follows:
485
+
486
+ ```
487
+ traject -t xml -c config.rb -w Traject::DebugWriter tiny.xml
488
+ ```
489
+
490
+ `-t xml` indicates that the file is a MARC XML file. `-w Traject::DebugWriter` outputs the results to the console (e.g. without saving to Solr).
491
+
471
492
  ## Extending With Your Own Code
472
493
 
473
494
  Traject config files are full live ruby files, where you can do anything,
data/doc/settings.md CHANGED
@@ -83,7 +83,8 @@ settings are applied first of all. It's recommended you use `provide`.
83
83
  ### Writing to solr
84
84
 
85
85
  * `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
86
- * `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
86
+
87
+ * `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`. (Can include embedded HTTP basic auth as eg `http://user:pass@example.org/solr`)
87
88
 
88
89
  * `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control some default settings, and/or sanity check and warn you if you're doing something that might not work with that version of solr. Set now for help in the future.
89
90
 
@@ -93,7 +94,8 @@ settings are applied first of all. It's recommended you use `provide`.
93
94
 
94
95
  * `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
95
96
 
96
- * `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth.
97
+ * `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth. You can also just embed basic
98
+ auth credentials in `solr.url` using standard URI syntax.
97
99
 
98
100
 
99
101
  ### Dealing with MARC data
data/doc/xml.md CHANGED
@@ -4,6 +4,8 @@ The [NokogiriIndexer](../lib/traject/nokogiri_indexer.md) is a Traject::Indexer
4
4
 
5
5
  It by default uses the NokogiriReader to read XML and read Nokogiri::XML::Documents, and includes the NokogiriMacros mix-in, with some macros for operating on Nokogiri::XML::Documents.
6
6
 
7
+ Plese notice that the recommened mechanism to parse MARC XML files with Traject is via the `-t` parameter (or the via the `provide "marc_source.type", "xml"` setting). The documentation in this page is for those parsing other (non MARC) XML files.
8
+
7
9
  ## On the command-line
8
10
 
9
11
  You can tell the traject command-line to use the NokogiriIndexer with the `-i xml` flag:
@@ -72,6 +74,16 @@ You can use all the standard transforation macros in Traject::Macros::Transforma
72
74
  to_field "something", extract_xpath("//value"), first_only, translation_map("some_map"), default("no value")
73
75
  ```
74
76
 
77
+ ### selecting attribute values
78
+
79
+ Just works, using xpath syntax for selecting an attribute:
80
+
81
+
82
+ ```ruby
83
+ # gets status value in: <oai:header status="something">
84
+ to_field "status", extract_xpath("//oai:record/oai:header/@status")
85
+ ```
86
+
75
87
 
76
88
  ### selecting non-text nodes
77
89
 
@@ -0,0 +1,35 @@
1
+ <?xml version="1.0" encoding="UTF-8"?>
2
+ <collection xmlns="http://www.loc.gov/MARC21/slim" xmlns:marc="http://www.loc.gov/MARC21/slim">
3
+ <record>
4
+ <leader>01352cam a2200349 a 4500</leader>
5
+ <datafield tag="245" ind1="0" ind2="0">
6
+ <subfield code="6">880-01</subfield>
7
+ <subfield code="a">Kazoku kankei no shakai shinrigaku /</subfield>
8
+ <subfield code="c">Osada Masayoshi hen.</subfield>
9
+ </datafield>
10
+ </record>
11
+ <record>
12
+ <leader>01121ccm a2200289z 4500</leader>
13
+ <datafield tag="245" ind1="1" ind2="0">
14
+ <subfield code="a">Powhatan&#39;s daughter :</subfield>
15
+ <subfield code="b">march</subfield>
16
+ </datafield>
17
+ <datafield tag="100" ind1="1" ind2=" ">
18
+ <subfield code="a">Sousa, John Philip,</subfield>
19
+ <subfield code="d">1854-1932,</subfield>
20
+ <subfield code="e">composer.</subfield>
21
+ </datafield>
22
+ </record>
23
+ <record>
24
+ <leader>01137cam a2200301 a 4500</leader>
25
+ <datafield tag="245" ind1="1" ind2="0">
26
+ <subfield code="a">Two pieces /</subfield>
27
+ <subfield code="c">by Frank O&#39;Hara.</subfield>
28
+ </datafield>
29
+ <datafield tag="100" ind1="1" ind2=" ">
30
+ <subfield code="a">O&#39;Hara, Frank,</subfield>
31
+ <subfield code="d">1926-1966.</subfield>
32
+ <subfield code="0">http://id.loc.gov/authorities/names/n79042130</subfield>
33
+ </datafield>
34
+ </record>
35
+ </collection>
@@ -29,10 +29,10 @@ module Traject
29
29
  self.console = $stderr
30
30
 
31
31
  self.orig_argv = argv.dup
32
- self.remaining_argv = argv
33
32
 
34
- self.slop = create_slop!
35
- self.options = parse_options(self.remaining_argv)
33
+ self.slop = create_slop!(argv)
34
+ self.options = self.slop
35
+ self.remaining_argv = self.slop.arguments
36
36
  end
37
37
 
38
38
  # Returns true on success or false on failure; may also raise exceptions;
@@ -40,11 +40,11 @@ module Traject
40
40
  def execute
41
41
  if options[:version]
42
42
  self.console.puts "traject version #{Traject::VERSION}"
43
- return
43
+ return true
44
44
  end
45
45
  if options[:help]
46
- self.console.puts slop.help
47
- return
46
+ self.console.puts slop.to_s
47
+ return true
48
48
  end
49
49
 
50
50
 
@@ -179,11 +179,11 @@ module Traject
179
179
  end
180
180
 
181
181
  def arg_check!
182
- if options[:command] == "process" && (options[:conf].nil? || options[:conf].length == 0)
182
+ if options[:command] == "process" && (!options[:conf] || options[:conf].length == 0)
183
183
  self.console.puts "Error: Missing required configuration file"
184
184
  self.console.puts "Exiting..."
185
185
  self.console.puts
186
- self.console.puts self.slop.help
186
+ self.console.puts self.slop.to_s
187
187
  exit 2
188
188
  end
189
189
  end
@@ -234,28 +234,36 @@ module Traject
234
234
  end
235
235
 
236
236
 
237
- def create_slop!
238
- return Slop.new(:strict => true) do
239
- banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
237
+ def create_slop!(argv)
238
+ options = Slop::Options.new do |o|
239
+ o.banner = "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
240
240
 
241
- on 'v', 'version', "print version information to stderr"
242
- on 'd', 'debug', "Include debug log, -s log.level=debug"
243
- on 'h', 'help', "print usage information to stderr"
244
- on 'c', 'conf', 'configuration file path (repeatable)', :argument => true, :as => Array
245
- on :i, 'indexer', "Traject indexer class name or shortcut", :argument => true, default: "marc"
246
- on :s, :setting, "settings: `-s key=value` (repeatable)", :argument => true, :as => Array
247
- on :r, :reader, "Set reader class, shortcut for -s reader_class_name=", :argument => true
248
- on :o, "output_file", "output file for Writer classes that write to files", :argument => true
249
- on :w, :writer, "Set writer class, shortcut for -s writer_class_name=", :argument => true
250
- on :u, :solr, "Set solr url, shortcut for -s solr.url=", :argument => true
251
- on :t, :marc_type, "xml, json or binary. shortcut for -s marc_source.type=", :argument => true
252
- on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
241
+ o.on '-v', '--version', "print version information to stderr"
242
+ o.on '-d', '--debug', "Include debug log, -s log.level=debug"
243
+ o.on '-h', '--help', "print usage information to stderr"
244
+ o.array '-c', '--conf', 'configuration file path (repeatable)', :delimiter => nil
245
+ o.string "-i", '--indexer', "Traject indexer class name or shortcut", :default => "marc"
246
+ o.array "-s", "--setting", "settings: `-s key=value` (repeatable)", :delimiter => nil
247
+ o.string "-r", "--reader", "Set reader class, shortcut for -s reader_class_name="
248
+ o.string "-o", "--output_file", "output file for Writer classes that write to files"
249
+ o.string "-w", "--writer", "Set writer class, shortcut for -s writer_class_name="
250
+ o.string "-u", "--solr", "Set solr url, shortcut for -s solr.url="
251
+ o.string "-t", "--marc_type", "xml, json or binary. shortcut for -s marc_source.type="
252
+ o.array "-I", "--load_path", "append paths to ruby $LOAD_PATH", :delimiter => ":"
253
253
 
254
- on :x, "command", "alternate traject command: process (default); marcout; commit", :argument => true, :default => "process"
254
+ o.string "-x", "--command", "alternate traject command: process (default); marcout; commit", :default => "process"
255
255
 
256
- on "stdin", "read input from stdin"
257
- on "debug-mode", "debug logging, single threaded, output human readable hashes"
256
+ o.on "--stdin", "read input from stdin"
257
+ o.on "--debug-mode", "debug logging, single threaded, output human readable hashes"
258
258
  end
259
+
260
+ options.parse(argv)
261
+ rescue Slop::Error => e
262
+ self.console.puts "Error: #{e.message}"
263
+ self.console.puts "Exiting..."
264
+ self.console.puts
265
+ self.console.puts options.to_s
266
+ exit 1
259
267
  end
260
268
 
261
269
  def initialize_indexer!
@@ -267,22 +275,5 @@ module Traject
267
275
 
268
276
  return indexer
269
277
  end
270
-
271
- def parse_options(argv)
272
-
273
- begin
274
- self.slop.parse!(argv)
275
- rescue Slop::Error => e
276
- self.console.puts "Error: #{e.message}"
277
- self.console.puts "Exiting..."
278
- self.console.puts
279
- self.console.puts slop.help
280
- exit 1
281
- end
282
-
283
- return self.slop.to_hash
284
- end
285
-
286
-
287
278
  end
288
279
  end
@@ -62,7 +62,7 @@ All records are assumed to have a unique id. You can set which field to look in
62
62
  def serialize(context)
63
63
  h = context.output_hash
64
64
  rec_key = record_number(context)
65
- lines = h.keys.sort.map { |k| @format % [rec_key, k, h[k].join(' | ')] }
65
+ lines = h.keys.sort.map { |k| @format % [rec_key, k, (h[k] || []).join(' | ')] }
66
66
  lines.push "\n"
67
67
  lines.join("\n")
68
68
  end
@@ -15,7 +15,7 @@ module Traject::Macros
15
15
  # field/substring specification.
16
16
  #
17
17
  # First argument is a string spec suitable for the MarcExtractor, see
18
- # MarcExtractor::parse_string_spec.
18
+ # Traject::MarcExtractor::Spec.
19
19
  #
20
20
  # Second arg is optional options, including options valid on MarcExtractor.new,
21
21
  # and others. By default, will de-duplicate results, but see :allow_duplicates
@@ -42,11 +42,11 @@ module Traject::Macros
42
42
  #
43
43
  # * :translation_map => String: translate with named translation map looked up in load
44
44
  # path, uses Tranject::TranslationMap.new(translation_map_arg).
45
- # **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)
45
+ # **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)`
46
46
  #
47
47
  # * :trim_punctuation => true; trims leading/trailing punctuation using standard algorithms that
48
48
  # have shown themselves useful with Marc, using Marc21.trim_punctuation. **Instead**, use
49
- # `extract_marc(whatever), trim_punctuation
49
+ # `extract_marc(whatever), trim_punctuation`
50
50
  #
51
51
  # * :default => String: if otherwise empty, add default value. **Instead**, use `extract_marc(whatever), default("default value")`
52
52
  #
@@ -327,10 +327,14 @@ module Traject::Macros
327
327
  if field008 && field008.length >= 11
328
328
  date_type = field008.slice(6)
329
329
  date1_str = field008.slice(7,4)
330
- date2_str = field008.slice(11, 4) if field008.length > 15
330
+ if field008.length > 15
331
+ date2_str = field008.slice(11, 4)
332
+ else
333
+ date2_str = date1_str
334
+ end
331
335
 
332
- # for date_type q=questionable, we have a range.
333
- if (date_type == 'q')
336
+ # for date_type q=questionable, we expect to have a range.
337
+ if date_type == 'q' and date1_str != date2_str
334
338
  # make unknown digits at the beginning or end of range,
335
339
  date1 = date1_str.sub("u", "0").to_i
336
340
  date2 = date2_str.sub("u", "9").to_i
@@ -26,9 +26,15 @@ module Traject
26
26
  # Make sure to avoid text content that was all blank, which is "between the children"
27
27
  # whitespace.
28
28
  result = result.collect do |n|
29
- n.xpath('.//text()').collect(&:text).tap do |arr|
30
- arr.reject! { |s| s =~ (/\A\s+\z/) }
31
- end.join(" ")
29
+ if n.kind_of?(Nokogiri::XML::Attr)
30
+ # attribute value
31
+ n.value
32
+ else
33
+ # text from node
34
+ n.xpath('.//text()').collect(&:text).tap do |arr|
35
+ arr.reject! { |s| s =~ (/\A\s+\z/) }
36
+ end.join(" ")
37
+ end
32
38
  end
33
39
  else
34
40
  # just put all matches in accumulator as Nokogiri::XML::Node's
@@ -157,6 +157,36 @@ module Traject
157
157
  acc.collect! { |v| v.gsub(pattern, replace) }
158
158
  end
159
159
  end
160
+
161
+ # Run ruby `delete_if` on the accumulator for values that include or are equal to arg.
162
+ # It will also accept an array, set, regex pattern, proc or lambda as an arugment.
163
+ #
164
+ # @example
165
+ # to_field "creator_facet", extract_marc("100abcdq"), delete_if(/foo/)
166
+ def delete_if(arg)
167
+ p = if arg.respond_to? :include?
168
+ proc { |v| arg.include?(v) }
169
+ else
170
+ proc { |v| arg === v }
171
+ end
172
+
173
+ ->(_, acc) { acc.delete_if(&p) }
174
+ end
175
+
176
+ # Run ruby `select!` on the accumulator for values that include or are equal to arg.
177
+ # It accepts an array, set, regex pattern, proc or lambda as an arugument.
178
+ #
179
+ # @example
180
+ # to_field "creator_facet", extract_marc("100abcdq"), select(->(v) { v != "foo" })
181
+ def select(arg)
182
+ p = if arg.respond_to? :include?
183
+ proc { |v| arg.include?(v) }
184
+ else
185
+ proc { |v| arg === v }
186
+ end
187
+
188
+ ->(_, acc) { acc.select!(&p) }
189
+ end
160
190
  end
161
191
  end
162
192
  end
@@ -2,9 +2,9 @@ require 'traject/marc_extractor_spec'
2
2
 
3
3
  module Traject
4
4
  # MarcExtractor is a class for extracting lists of strings from a MARC::Record,
5
- # according to specifications. See #parse_string_spec for description of string
6
- # string arguments used to specify extraction. See #initialize for options
7
- # that can be set controlling extraction.
5
+ # according to specifications. See Traject::MarcExtractor::Spec for description
6
+ # of string string arguments used to specify extraction. See #initialize for
7
+ # options that can be set controlling extraction.
8
8
  #
9
9
  # Examples:
10
10
  #
@@ -1,3 +1,5 @@
1
+ require 'nokogiri'
2
+
1
3
  module Traject
2
4
  # A Trajet reader which reads XML, and yields zero to many Nokogiri::XML::Document
3
5
  # objects as source records in the traject pipeline.
@@ -41,10 +41,12 @@ require 'concurrent' # for atomic_fixnum
41
41
  #
42
42
  # ## Relevant settings
43
43
  #
44
- # * solr.url (optional if solr.update_url is set) The URL to the solr core to index into
44
+ # * solr.url (optional if solr.update_url is set) The URL to the solr core to index into.
45
+ # (Can include embedded HTTP basic auth as eg `http://user:pass@host/solr`)
45
46
  #
46
47
  # * solr.update_url: The actual update url. If unset, we'll first see if
47
- # "#{solr.url}/update/json" exists, and if not use "#{solr.url}/update"
48
+ # "#{solr.url}/update/json" exists, and if not use "#{solr.url}/update". (Can include
49
+ # embedded HTTP basic auth as eg `http://user:pass@host/solr)
48
50
  #
49
51
  # * solr_writer.batch_size: How big a batch to send to solr. Default is 100.
50
52
  # My tests indicate that this setting doesn't change overall index speed by a ton.
@@ -101,12 +103,17 @@ class Traject::SolrJsonWriter
101
103
  def initialize(argSettings)
102
104
  @settings = Traject::Indexer::Settings.new(argSettings)
103
105
 
106
+
104
107
  # Set max errors
105
108
  @max_skipped = (@settings['solr_writer.max_skipped'] || DEFAULT_MAX_SKIPPED).to_i
106
109
  if @max_skipped < 0
107
110
  @max_skipped = nil
108
111
  end
109
112
 
113
+
114
+ # Figure out where to send updates, and if with basic auth
115
+ @solr_update_url, basic_auth_user, basic_auth_password = self.determine_solr_update_url
116
+
110
117
  @http_client = if @settings["solr_json_writer.http_client"]
111
118
  @settings["solr_json_writer.http_client"]
112
119
  else
@@ -115,9 +122,8 @@ class Traject::SolrJsonWriter
115
122
  client.connect_timeout = client.receive_timeout = client.send_timeout = @settings["solr_writer.http_timeout"]
116
123
  end
117
124
 
118
- if @settings["solr_writer.basic_auth_user"] &&
119
- @settings["solr_writer.basic_auth_password"]
120
- client.set_auth(@settings["solr.url"], @settings["solr_writer.basic_auth_user"], @settings["solr_writer.basic_auth_password"])
125
+ if basic_auth_user || basic_auth_password
126
+ client.set_auth(@solr_update_url, basic_auth_user, basic_auth_password)
121
127
  end
122
128
 
123
129
  client
@@ -143,13 +149,11 @@ class Traject::SolrJsonWriter
143
149
  # this the new default writer.
144
150
  @commit_on_close = (settings["solr_writer.commit_on_close"] || settings["solrj_writer.commit_on_close"]).to_s == "true"
145
151
 
146
- # Figure out where to send updates
147
- @solr_update_url = self.determine_solr_update_url
148
152
 
149
153
  @solr_update_args = settings["solr_writer.solr_update_args"]
150
154
  @commit_solr_update_args = settings["solr_writer.commit_solr_update_args"]
151
155
 
152
- logger.info(" #{self.class.name} writing to '#{@solr_update_url}' in batches of #{@batch_size} with #{@thread_pool_size} bg threads")
156
+ logger.info(" #{self.class.name} writing to '#{@solr_update_url}' #{"(with HTTP basic auth)" if basic_auth_user || basic_auth_password}in batches of #{@batch_size} with #{@thread_pool_size} bg threads")
153
157
  end
154
158
 
155
159
 
@@ -368,13 +372,27 @@ class Traject::SolrJsonWriter
368
372
  end
369
373
 
370
374
 
371
- # Relatively complex logic to determine if we have a valid URL and what it is
375
+ # Relatively complex logic to determine if we have a valid URL and what it is,
376
+ # and if we have basic_auth info
377
+ #
378
+ # Empties out user and password embedded in URI returned, to help avoid logging it.
379
+ #
380
+ # @returns [update_url, basic_auth_user, basic_auth_password]
372
381
  def determine_solr_update_url
373
- if settings['solr.update_url']
382
+ url = if settings['solr.update_url']
374
383
  check_solr_update_url(settings['solr.update_url'])
375
384
  else
376
385
  derive_solr_update_url_from_solr_url(settings['solr.url'])
377
386
  end
387
+
388
+ parsed_uri = URI.parse(url)
389
+ user_from_uri, password_from_uri = parsed_uri.user, parsed_uri.password
390
+ parsed_uri.user, parsed_uri.password = nil, nil
391
+
392
+ basic_auth_user = @settings["solr_writer.basic_auth_user"] || user_from_uri
393
+ basic_auth_password = @settings["solr_writer.basic_auth_password"] || password_from_uri
394
+
395
+ return [parsed_uri.to_s, basic_auth_user, basic_auth_password]
378
396
  end
379
397
 
380
398
 
@@ -1,3 +1,3 @@
1
1
  module Traject
2
- VERSION = "3.3.0"
2
+ VERSION = "3.7.0"
3
3
  end