traject 3.3.0 → 3.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1700077d5c2d3c667fc9520b659c3ca986b8ab34aee233f62bd7f73fdef91977
4
- data.tar.gz: 736b217f209ed08faba9c1d20c006b29586aa3ebdf088a89e37f5f3b7400de06
3
+ metadata.gz: 7ffc677e0ebb13e01b852a1d59ddfdd3cd9906142520e0c296f69ebb0eeb7429
4
+ data.tar.gz: 61b0e966f6ecd4d27e757e4cfc1057c72ac6deca5ad119c78ff883c246744814
5
5
  SHA512:
6
- metadata.gz: 21877d6cd5b03f7ffbbac316a6d58a3bc65b534cb7457e57d39ba470ad49d99c8677e5e6ede25c650bba5ac3f0b22f9b348ebabb36ac4047433eb8a76379ef1d
7
- data.tar.gz: 4ec1938d2d7b60a61ebde4e9c4e763e511c2896788b56b38ad6f22615dffb57449e29c9ef40e261952e125b90f5fe491fa447b077c7dcb1c55f57d6ef603fd5b
6
+ metadata.gz: 8240b450b27df011c2ff998c24c612f44bdd21a2fde5fbab996ffe509f3fc45cec7a8a8947e385a35d06f5bd8ed19732287e8f2d3dab682cc6295f7320f8dfab
7
+ data.tar.gz: f5dbcb44edb8d37a4e74cd1255aa1b05b638913337577c92d4fa276150c023b9e29823cc2b20ef050b33879ec63d76194d0202697d7c858a7aacd3d08241dcce
@@ -0,0 +1,35 @@
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches: [ master ]
6
+ pull_request:
7
+ branches: ['**']
8
+
9
+ jobs:
10
+ tests:
11
+ runs-on: ubuntu-latest
12
+ strategy:
13
+ fail-fast: false
14
+ matrix:
15
+ ruby: [ '2.4', '2.5', '2.6', '2.7', '3.0', 'jruby-9.1', 'jruby-9.2' ]
16
+ name: Ruby ${{ matrix.ruby }}
17
+ steps:
18
+ - uses: actions/checkout@v2
19
+
20
+ - name: Set up Ruby
21
+ uses: ruby/setup-ruby@v1
22
+ with:
23
+ ruby-version: ${{ matrix.ruby }}
24
+
25
+ - name: set JAVA_OPTS for jruby-9.1
26
+ run: echo 'JAVA_OPTS="--add-opens java.base/java.security.cert=ALL-UNNAMED --add-opens java.base/java.security=ALL-UNNAMED --add-opens java.base/java.util.zip=ALL-UNNAMED"' >> $GITHUB_ENV
27
+ if: ${{ matrix.ruby == 'jruby-9.1' }}
28
+ # https://github.com/jruby/jruby/issues/4834
29
+ # Still seems to be an issue in jruby-9.1, but not 9.2
30
+ # https://github.community/t/conditional-setting-of-env-variables-in-gh-actions/179650
31
+
32
+ - name: Install dependencies
33
+ run: bundle install --jobs 4 --retry 3
34
+ - name: Run tests
35
+ run: bundle exec rake
data/CHANGES.md CHANGED
@@ -1,12 +1,33 @@
1
1
  # Changes
2
2
 
3
- ## Next
3
+ ## NEXT
4
4
 
5
5
  *
6
6
 
7
7
  *
8
8
 
9
- *
9
+ ## 3.7.0
10
+
11
+ * Add two new transformation macros, `Traject::Macros::Transformation.delete_if` and `Traject::Macros::Transformations.select`.
12
+
13
+ ## 3.6.0
14
+
15
+ * Tiny backward compat changes for ruby 3.0 compat. https://github.com/traject/traject/pull/263
16
+
17
+ * Allow gem `http` 5.x in gemspec. https://github.com/traject/traject/pull/269
18
+
19
+ ## 3.5.0
20
+
21
+ * `traject -v` and `traject -h` correctly return 0 exit code indicating success.
22
+
23
+ * upgrade to slop gem 4.x, which carries with it a slightly different format of human-readable command-line arg errors, should be otherwise invisible.
24
+
25
+ * the SolrJsonWriter now supports HTTP basic auth credentials embedded in `solr.url` or `solr.update_url`, eg `http://user:pass@example.org/solr` https://github.com/traject/traject/pull/262
26
+
27
+
28
+ ## 3.4.0
29
+
30
+ * XML-mode `extract_xpath` now supports extracting attribute values with xpath @attr syntax.
10
31
 
11
32
  ## 3.3.0
12
33
 
data/README.md CHANGED
@@ -8,8 +8,8 @@ Traject can also be generalized to a set of tools for getting structured data fr
8
8
 
9
9
  **Traject is stable, mature software, that is already being used in production by its authors and several other institutions.**
10
10
 
11
- [![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
12
- [![Build Status](https://travis-ci.org/traject/traject.png)](https://travis-ci.org/traject/traject)
11
+ [![Gem Version](https://badge.fury.io/rb/traject.svg)](http://badge.fury.io/rb/traject)
12
+ [![CI Status](https://github.com/traject/traject/workflows/CI/badge.svg?branch=master)](https://github.com/traject/traject/actions?query=workflow%3ACI+branch%3Amaster)
13
13
 
14
14
 
15
15
  ## Background/Goals
@@ -177,6 +177,11 @@ TranslationMap use above is just one example of a transformation macro, that tra
177
177
  * `split(" ")`: take values and split them, possibly result in multiple values.
178
178
  * `transform(proc)`: transform each existing macro using a proc, kind of like `map`.
179
179
  eg `to_field "something", extract_xml("//author"), transform( ->(author) { "#{author.last}, #{author.first}" })
180
+ * `delete_if(["a", "b"])`: remove a value from accumulated values if it is included in the passed in argumet.
181
+ * Can also take a string, proc or regex as an argument. See [tests](test/indexer/macros/transformation_test.rb) for full functionality.
182
+ * `select(proc)`: selects (keeps) values from accumulated values if proc evaluates to true for specifc value.
183
+ * Can also take a arrays, sets and regex as an argument. See [tests](test/indexer/macros/transformation_test.rb) for full functionality.
184
+
180
185
 
181
186
  You can add on as many transformation macros as you want, they will be applied to output in order.
182
187
 
@@ -468,6 +473,22 @@ Also see `-I load_path` option and suggestions for Bundler use under Extending W
468
473
  See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
469
474
 
470
475
 
476
+ ## A small but complete example
477
+
478
+ To process a MARC XML file with the data shown in [./examples/marc/tiny.xml](./examples/marc/tiny.xml) you can use save the following configuration as `config.rb`:
479
+
480
+ ```
481
+ to_field 'title', extract_marc('245a', first: true)
482
+ ```
483
+
484
+ and run Traject as follows:
485
+
486
+ ```
487
+ traject -t xml -c config.rb -w Traject::DebugWriter tiny.xml
488
+ ```
489
+
490
+ `-t xml` indicates that the file is a MARC XML file. `-w Traject::DebugWriter` outputs the results to the console (e.g. without saving to Solr).
491
+
471
492
  ## Extending With Your Own Code
472
493
 
473
494
  Traject config files are full live ruby files, where you can do anything,
data/doc/settings.md CHANGED
@@ -83,7 +83,8 @@ settings are applied first of all. It's recommended you use `provide`.
83
83
  ### Writing to solr
84
84
 
85
85
  * `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
86
- * `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
86
+
87
+ * `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`. (Can include embedded HTTP basic auth as eg `http://user:pass@example.org/solr`)
87
88
 
88
89
  * `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control some default settings, and/or sanity check and warn you if you're doing something that might not work with that version of solr. Set now for help in the future.
89
90
 
@@ -93,7 +94,8 @@ settings are applied first of all. It's recommended you use `provide`.
93
94
 
94
95
  * `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
95
96
 
96
- * `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth.
97
+ * `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth. You can also just embed basic
98
+ auth credentials in `solr.url` using standard URI syntax.
97
99
 
98
100
 
99
101
  ### Dealing with MARC data
data/doc/xml.md CHANGED
@@ -4,6 +4,8 @@ The [NokogiriIndexer](../lib/traject/nokogiri_indexer.md) is a Traject::Indexer
4
4
 
5
5
  It by default uses the NokogiriReader to read XML and read Nokogiri::XML::Documents, and includes the NokogiriMacros mix-in, with some macros for operating on Nokogiri::XML::Documents.
6
6
 
7
+ Plese notice that the recommened mechanism to parse MARC XML files with Traject is via the `-t` parameter (or the via the `provide "marc_source.type", "xml"` setting). The documentation in this page is for those parsing other (non MARC) XML files.
8
+
7
9
  ## On the command-line
8
10
 
9
11
  You can tell the traject command-line to use the NokogiriIndexer with the `-i xml` flag:
@@ -72,6 +74,16 @@ You can use all the standard transforation macros in Traject::Macros::Transforma
72
74
  to_field "something", extract_xpath("//value"), first_only, translation_map("some_map"), default("no value")
73
75
  ```
74
76
 
77
+ ### selecting attribute values
78
+
79
+ Just works, using xpath syntax for selecting an attribute:
80
+
81
+
82
+ ```ruby
83
+ # gets status value in: <oai:header status="something">
84
+ to_field "status", extract_xpath("//oai:record/oai:header/@status")
85
+ ```
86
+
75
87
 
76
88
  ### selecting non-text nodes
77
89
 
@@ -0,0 +1,35 @@
1
+ <?xml version="1.0" encoding="UTF-8"?>
2
+ <collection xmlns="http://www.loc.gov/MARC21/slim" xmlns:marc="http://www.loc.gov/MARC21/slim">
3
+ <record>
4
+ <leader>01352cam a2200349 a 4500</leader>
5
+ <datafield tag="245" ind1="0" ind2="0">
6
+ <subfield code="6">880-01</subfield>
7
+ <subfield code="a">Kazoku kankei no shakai shinrigaku /</subfield>
8
+ <subfield code="c">Osada Masayoshi hen.</subfield>
9
+ </datafield>
10
+ </record>
11
+ <record>
12
+ <leader>01121ccm a2200289z 4500</leader>
13
+ <datafield tag="245" ind1="1" ind2="0">
14
+ <subfield code="a">Powhatan&#39;s daughter :</subfield>
15
+ <subfield code="b">march</subfield>
16
+ </datafield>
17
+ <datafield tag="100" ind1="1" ind2=" ">
18
+ <subfield code="a">Sousa, John Philip,</subfield>
19
+ <subfield code="d">1854-1932,</subfield>
20
+ <subfield code="e">composer.</subfield>
21
+ </datafield>
22
+ </record>
23
+ <record>
24
+ <leader>01137cam a2200301 a 4500</leader>
25
+ <datafield tag="245" ind1="1" ind2="0">
26
+ <subfield code="a">Two pieces /</subfield>
27
+ <subfield code="c">by Frank O&#39;Hara.</subfield>
28
+ </datafield>
29
+ <datafield tag="100" ind1="1" ind2=" ">
30
+ <subfield code="a">O&#39;Hara, Frank,</subfield>
31
+ <subfield code="d">1926-1966.</subfield>
32
+ <subfield code="0">http://id.loc.gov/authorities/names/n79042130</subfield>
33
+ </datafield>
34
+ </record>
35
+ </collection>
@@ -29,10 +29,10 @@ module Traject
29
29
  self.console = $stderr
30
30
 
31
31
  self.orig_argv = argv.dup
32
- self.remaining_argv = argv
33
32
 
34
- self.slop = create_slop!
35
- self.options = parse_options(self.remaining_argv)
33
+ self.slop = create_slop!(argv)
34
+ self.options = self.slop
35
+ self.remaining_argv = self.slop.arguments
36
36
  end
37
37
 
38
38
  # Returns true on success or false on failure; may also raise exceptions;
@@ -40,11 +40,11 @@ module Traject
40
40
  def execute
41
41
  if options[:version]
42
42
  self.console.puts "traject version #{Traject::VERSION}"
43
- return
43
+ return true
44
44
  end
45
45
  if options[:help]
46
- self.console.puts slop.help
47
- return
46
+ self.console.puts slop.to_s
47
+ return true
48
48
  end
49
49
 
50
50
 
@@ -179,11 +179,11 @@ module Traject
179
179
  end
180
180
 
181
181
  def arg_check!
182
- if options[:command] == "process" && (options[:conf].nil? || options[:conf].length == 0)
182
+ if options[:command] == "process" && (!options[:conf] || options[:conf].length == 0)
183
183
  self.console.puts "Error: Missing required configuration file"
184
184
  self.console.puts "Exiting..."
185
185
  self.console.puts
186
- self.console.puts self.slop.help
186
+ self.console.puts self.slop.to_s
187
187
  exit 2
188
188
  end
189
189
  end
@@ -234,28 +234,36 @@ module Traject
234
234
  end
235
235
 
236
236
 
237
- def create_slop!
238
- return Slop.new(:strict => true) do
239
- banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
237
+ def create_slop!(argv)
238
+ options = Slop::Options.new do |o|
239
+ o.banner = "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
240
240
 
241
- on 'v', 'version', "print version information to stderr"
242
- on 'd', 'debug', "Include debug log, -s log.level=debug"
243
- on 'h', 'help', "print usage information to stderr"
244
- on 'c', 'conf', 'configuration file path (repeatable)', :argument => true, :as => Array
245
- on :i, 'indexer', "Traject indexer class name or shortcut", :argument => true, default: "marc"
246
- on :s, :setting, "settings: `-s key=value` (repeatable)", :argument => true, :as => Array
247
- on :r, :reader, "Set reader class, shortcut for -s reader_class_name=", :argument => true
248
- on :o, "output_file", "output file for Writer classes that write to files", :argument => true
249
- on :w, :writer, "Set writer class, shortcut for -s writer_class_name=", :argument => true
250
- on :u, :solr, "Set solr url, shortcut for -s solr.url=", :argument => true
251
- on :t, :marc_type, "xml, json or binary. shortcut for -s marc_source.type=", :argument => true
252
- on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
241
+ o.on '-v', '--version', "print version information to stderr"
242
+ o.on '-d', '--debug', "Include debug log, -s log.level=debug"
243
+ o.on '-h', '--help', "print usage information to stderr"
244
+ o.array '-c', '--conf', 'configuration file path (repeatable)', :delimiter => nil
245
+ o.string "-i", '--indexer', "Traject indexer class name or shortcut", :default => "marc"
246
+ o.array "-s", "--setting", "settings: `-s key=value` (repeatable)", :delimiter => nil
247
+ o.string "-r", "--reader", "Set reader class, shortcut for -s reader_class_name="
248
+ o.string "-o", "--output_file", "output file for Writer classes that write to files"
249
+ o.string "-w", "--writer", "Set writer class, shortcut for -s writer_class_name="
250
+ o.string "-u", "--solr", "Set solr url, shortcut for -s solr.url="
251
+ o.string "-t", "--marc_type", "xml, json or binary. shortcut for -s marc_source.type="
252
+ o.array "-I", "--load_path", "append paths to ruby $LOAD_PATH", :delimiter => ":"
253
253
 
254
- on :x, "command", "alternate traject command: process (default); marcout; commit", :argument => true, :default => "process"
254
+ o.string "-x", "--command", "alternate traject command: process (default); marcout; commit", :default => "process"
255
255
 
256
- on "stdin", "read input from stdin"
257
- on "debug-mode", "debug logging, single threaded, output human readable hashes"
256
+ o.on "--stdin", "read input from stdin"
257
+ o.on "--debug-mode", "debug logging, single threaded, output human readable hashes"
258
258
  end
259
+
260
+ options.parse(argv)
261
+ rescue Slop::Error => e
262
+ self.console.puts "Error: #{e.message}"
263
+ self.console.puts "Exiting..."
264
+ self.console.puts
265
+ self.console.puts options.to_s
266
+ exit 1
259
267
  end
260
268
 
261
269
  def initialize_indexer!
@@ -267,22 +275,5 @@ module Traject
267
275
 
268
276
  return indexer
269
277
  end
270
-
271
- def parse_options(argv)
272
-
273
- begin
274
- self.slop.parse!(argv)
275
- rescue Slop::Error => e
276
- self.console.puts "Error: #{e.message}"
277
- self.console.puts "Exiting..."
278
- self.console.puts
279
- self.console.puts slop.help
280
- exit 1
281
- end
282
-
283
- return self.slop.to_hash
284
- end
285
-
286
-
287
278
  end
288
279
  end
@@ -62,7 +62,7 @@ All records are assumed to have a unique id. You can set which field to look in
62
62
  def serialize(context)
63
63
  h = context.output_hash
64
64
  rec_key = record_number(context)
65
- lines = h.keys.sort.map { |k| @format % [rec_key, k, h[k].join(' | ')] }
65
+ lines = h.keys.sort.map { |k| @format % [rec_key, k, (h[k] || []).join(' | ')] }
66
66
  lines.push "\n"
67
67
  lines.join("\n")
68
68
  end
@@ -15,7 +15,7 @@ module Traject::Macros
15
15
  # field/substring specification.
16
16
  #
17
17
  # First argument is a string spec suitable for the MarcExtractor, see
18
- # MarcExtractor::parse_string_spec.
18
+ # Traject::MarcExtractor::Spec.
19
19
  #
20
20
  # Second arg is optional options, including options valid on MarcExtractor.new,
21
21
  # and others. By default, will de-duplicate results, but see :allow_duplicates
@@ -42,11 +42,11 @@ module Traject::Macros
42
42
  #
43
43
  # * :translation_map => String: translate with named translation map looked up in load
44
44
  # path, uses Tranject::TranslationMap.new(translation_map_arg).
45
- # **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)
45
+ # **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)`
46
46
  #
47
47
  # * :trim_punctuation => true; trims leading/trailing punctuation using standard algorithms that
48
48
  # have shown themselves useful with Marc, using Marc21.trim_punctuation. **Instead**, use
49
- # `extract_marc(whatever), trim_punctuation
49
+ # `extract_marc(whatever), trim_punctuation`
50
50
  #
51
51
  # * :default => String: if otherwise empty, add default value. **Instead**, use `extract_marc(whatever), default("default value")`
52
52
  #
@@ -327,10 +327,14 @@ module Traject::Macros
327
327
  if field008 && field008.length >= 11
328
328
  date_type = field008.slice(6)
329
329
  date1_str = field008.slice(7,4)
330
- date2_str = field008.slice(11, 4) if field008.length > 15
330
+ if field008.length > 15
331
+ date2_str = field008.slice(11, 4)
332
+ else
333
+ date2_str = date1_str
334
+ end
331
335
 
332
- # for date_type q=questionable, we have a range.
333
- if (date_type == 'q')
336
+ # for date_type q=questionable, we expect to have a range.
337
+ if date_type == 'q' and date1_str != date2_str
334
338
  # make unknown digits at the beginning or end of range,
335
339
  date1 = date1_str.sub("u", "0").to_i
336
340
  date2 = date2_str.sub("u", "9").to_i
@@ -26,9 +26,15 @@ module Traject
26
26
  # Make sure to avoid text content that was all blank, which is "between the children"
27
27
  # whitespace.
28
28
  result = result.collect do |n|
29
- n.xpath('.//text()').collect(&:text).tap do |arr|
30
- arr.reject! { |s| s =~ (/\A\s+\z/) }
31
- end.join(" ")
29
+ if n.kind_of?(Nokogiri::XML::Attr)
30
+ # attribute value
31
+ n.value
32
+ else
33
+ # text from node
34
+ n.xpath('.//text()').collect(&:text).tap do |arr|
35
+ arr.reject! { |s| s =~ (/\A\s+\z/) }
36
+ end.join(" ")
37
+ end
32
38
  end
33
39
  else
34
40
  # just put all matches in accumulator as Nokogiri::XML::Node's
@@ -157,6 +157,36 @@ module Traject
157
157
  acc.collect! { |v| v.gsub(pattern, replace) }
158
158
  end
159
159
  end
160
+
161
+ # Run ruby `delete_if` on the accumulator for values that include or are equal to arg.
162
+ # It will also accept an array, set, regex pattern, proc or lambda as an arugment.
163
+ #
164
+ # @example
165
+ # to_field "creator_facet", extract_marc("100abcdq"), delete_if(/foo/)
166
+ def delete_if(arg)
167
+ p = if arg.respond_to? :include?
168
+ proc { |v| arg.include?(v) }
169
+ else
170
+ proc { |v| arg === v }
171
+ end
172
+
173
+ ->(_, acc) { acc.delete_if(&p) }
174
+ end
175
+
176
+ # Run ruby `select!` on the accumulator for values that include or are equal to arg.
177
+ # It accepts an array, set, regex pattern, proc or lambda as an arugument.
178
+ #
179
+ # @example
180
+ # to_field "creator_facet", extract_marc("100abcdq"), select(->(v) { v != "foo" })
181
+ def select(arg)
182
+ p = if arg.respond_to? :include?
183
+ proc { |v| arg.include?(v) }
184
+ else
185
+ proc { |v| arg === v }
186
+ end
187
+
188
+ ->(_, acc) { acc.select!(&p) }
189
+ end
160
190
  end
161
191
  end
162
192
  end
@@ -2,9 +2,9 @@ require 'traject/marc_extractor_spec'
2
2
 
3
3
  module Traject
4
4
  # MarcExtractor is a class for extracting lists of strings from a MARC::Record,
5
- # according to specifications. See #parse_string_spec for description of string
6
- # string arguments used to specify extraction. See #initialize for options
7
- # that can be set controlling extraction.
5
+ # according to specifications. See Traject::MarcExtractor::Spec for description
6
+ # of string string arguments used to specify extraction. See #initialize for
7
+ # options that can be set controlling extraction.
8
8
  #
9
9
  # Examples:
10
10
  #
@@ -1,3 +1,5 @@
1
+ require 'nokogiri'
2
+
1
3
  module Traject
2
4
  # A Trajet reader which reads XML, and yields zero to many Nokogiri::XML::Document
3
5
  # objects as source records in the traject pipeline.
@@ -41,10 +41,12 @@ require 'concurrent' # for atomic_fixnum
41
41
  #
42
42
  # ## Relevant settings
43
43
  #
44
- # * solr.url (optional if solr.update_url is set) The URL to the solr core to index into
44
+ # * solr.url (optional if solr.update_url is set) The URL to the solr core to index into.
45
+ # (Can include embedded HTTP basic auth as eg `http://user:pass@host/solr`)
45
46
  #
46
47
  # * solr.update_url: The actual update url. If unset, we'll first see if
47
- # "#{solr.url}/update/json" exists, and if not use "#{solr.url}/update"
48
+ # "#{solr.url}/update/json" exists, and if not use "#{solr.url}/update". (Can include
49
+ # embedded HTTP basic auth as eg `http://user:pass@host/solr)
48
50
  #
49
51
  # * solr_writer.batch_size: How big a batch to send to solr. Default is 100.
50
52
  # My tests indicate that this setting doesn't change overall index speed by a ton.
@@ -101,12 +103,17 @@ class Traject::SolrJsonWriter
101
103
  def initialize(argSettings)
102
104
  @settings = Traject::Indexer::Settings.new(argSettings)
103
105
 
106
+
104
107
  # Set max errors
105
108
  @max_skipped = (@settings['solr_writer.max_skipped'] || DEFAULT_MAX_SKIPPED).to_i
106
109
  if @max_skipped < 0
107
110
  @max_skipped = nil
108
111
  end
109
112
 
113
+
114
+ # Figure out where to send updates, and if with basic auth
115
+ @solr_update_url, basic_auth_user, basic_auth_password = self.determine_solr_update_url
116
+
110
117
  @http_client = if @settings["solr_json_writer.http_client"]
111
118
  @settings["solr_json_writer.http_client"]
112
119
  else
@@ -115,9 +122,8 @@ class Traject::SolrJsonWriter
115
122
  client.connect_timeout = client.receive_timeout = client.send_timeout = @settings["solr_writer.http_timeout"]
116
123
  end
117
124
 
118
- if @settings["solr_writer.basic_auth_user"] &&
119
- @settings["solr_writer.basic_auth_password"]
120
- client.set_auth(@settings["solr.url"], @settings["solr_writer.basic_auth_user"], @settings["solr_writer.basic_auth_password"])
125
+ if basic_auth_user || basic_auth_password
126
+ client.set_auth(@solr_update_url, basic_auth_user, basic_auth_password)
121
127
  end
122
128
 
123
129
  client
@@ -143,13 +149,11 @@ class Traject::SolrJsonWriter
143
149
  # this the new default writer.
144
150
  @commit_on_close = (settings["solr_writer.commit_on_close"] || settings["solrj_writer.commit_on_close"]).to_s == "true"
145
151
 
146
- # Figure out where to send updates
147
- @solr_update_url = self.determine_solr_update_url
148
152
 
149
153
  @solr_update_args = settings["solr_writer.solr_update_args"]
150
154
  @commit_solr_update_args = settings["solr_writer.commit_solr_update_args"]
151
155
 
152
- logger.info(" #{self.class.name} writing to '#{@solr_update_url}' in batches of #{@batch_size} with #{@thread_pool_size} bg threads")
156
+ logger.info(" #{self.class.name} writing to '#{@solr_update_url}' #{"(with HTTP basic auth)" if basic_auth_user || basic_auth_password}in batches of #{@batch_size} with #{@thread_pool_size} bg threads")
153
157
  end
154
158
 
155
159
 
@@ -368,13 +372,27 @@ class Traject::SolrJsonWriter
368
372
  end
369
373
 
370
374
 
371
- # Relatively complex logic to determine if we have a valid URL and what it is
375
+ # Relatively complex logic to determine if we have a valid URL and what it is,
376
+ # and if we have basic_auth info
377
+ #
378
+ # Empties out user and password embedded in URI returned, to help avoid logging it.
379
+ #
380
+ # @returns [update_url, basic_auth_user, basic_auth_password]
372
381
  def determine_solr_update_url
373
- if settings['solr.update_url']
382
+ url = if settings['solr.update_url']
374
383
  check_solr_update_url(settings['solr.update_url'])
375
384
  else
376
385
  derive_solr_update_url_from_solr_url(settings['solr.url'])
377
386
  end
387
+
388
+ parsed_uri = URI.parse(url)
389
+ user_from_uri, password_from_uri = parsed_uri.user, parsed_uri.password
390
+ parsed_uri.user, parsed_uri.password = nil, nil
391
+
392
+ basic_auth_user = @settings["solr_writer.basic_auth_user"] || user_from_uri
393
+ basic_auth_password = @settings["solr_writer.basic_auth_password"] || password_from_uri
394
+
395
+ return [parsed_uri.to_s, basic_auth_user, basic_auth_password]
378
396
  end
379
397
 
380
398
 
@@ -1,3 +1,3 @@
1
1
  module Traject
2
- VERSION = "3.3.0"
2
+ VERSION = "3.7.0"
3
3
  end