traject 3.0.0 → 3.1.0.rc1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: cf92e5467d32d37b681a36ae1ffbd2995bbf3e0def938b13d74831a939b68632
4
- data.tar.gz: 7c4693ded4a9a8b0e9c599e7489aaefdf9806dfffce6b20ae6054def9ba8c156
3
+ metadata.gz: 06c28d37f9aafafe709a146c7612e5b5d8a5c58a61fd1502823a38dc52b9d05b
4
+ data.tar.gz: 2e38b2b8c4030456f3757ae6062231268110d68ef07e10cab722b4074ccd570c
5
5
  SHA512:
6
- metadata.gz: 9e12113a6f53aa9c7629c072df80b1e347f432d069bd30dbb35d73373fccc3fa341682b281a65c778aa2a3eae9fb7b2d52c81c2f39aa17d348074ecb8b9c2512
7
- data.tar.gz: 6f2294bce5deb181a20db0977f8ab7e73e8e1cda6e86d8ee562fabd7a8cce2c683011be8f3955ccafd0165787dbaf774e7c3571220f5d6e797eaf6fe8a02577d
6
+ metadata.gz: 04561a77a3e6f2073198983b5bf7d4e35cc9f52bccc1211487cc4c850b0f0b0fc9395a7c87e6ed90061f4a15af57516434d260c649fbc43ea65a0c6435194818
7
+ data.tar.gz: c7312156c3be556218e319e35ae76aa97fbae5fad6720dbce2e4a046ec90603f5de34fe2cb055425fb3da499922fba50c7d4a6445858793bb0a4fb26cf8f7b29
@@ -6,13 +6,11 @@ sudo: true
6
6
  rvm:
7
7
  - 2.4.4
8
8
  - 2.5.1
9
- - "2.6.0-preview2"
9
+ - 2.6.1
10
10
  # avoid having travis install jdk on MRI builds where we don't need it.
11
11
  matrix:
12
12
  include:
13
13
  - jdk: openjdk8
14
14
  rvm: jruby-9.1.17.0
15
15
  - jdk: openjdk8
16
- rvm: jruby-9.2.0.0
17
- allow_failures:
18
- - rvm: "2.6.0-preview2"
16
+ rvm: jruby-9.2.6.0
data/CHANGES.md CHANGED
@@ -1,5 +1,35 @@
1
1
  # Changes
2
2
 
3
+ ## 3.1.0
4
+
5
+ ### Added
6
+
7
+ * Context#add_output is added, convenient for custom ruby code.
8
+
9
+ each_record do |record, context|
10
+ context.add_output "key", something_from(record)
11
+ end
12
+
13
+ https://github.com/traject/traject/pull/220
14
+
15
+ * SolrJsonWriter
16
+
17
+ * Class-level indexer configuration, for custom indexer subclasses, now available with class-level `configure` method. Warning, Indexers are still expensive to instantiate though. https://github.com/traject/traject/pull/213
18
+
19
+ * SolrJsonWriter has new settings to control commit semantics. `solr_writer.solr_update_args` and `solr_writer.commit_solr_update_args`, both have hash values that are Solr update handler query params. https://github.com/traject/traject/pull/215
20
+
21
+ * SolrJsonWriter has a `delete(solr-unique-key)` method. Does not currently use any batching or threading. https://github.com/traject/traject/pull/214
22
+
23
+ * SolrJsonWriter, when MaxSkippedRecordsExceeded is raised, it will have a #cause that is the last error, which resulted in MaxSkippedRecordsExceeded. Some error reporting systems, including Rails, will automatically log #cause, so that's helpful. https://github.com/traject/traject/pull/216
24
+
25
+ * SolrJsonWriter now respects a `solr_writer.http_timeout` setting, in seconds, to be passed to HTTPClient instance. https://github.com/traject/traject/pull/219
26
+
27
+ * Nokogiri dependency for the NokogiriReader increased to `~> 1.9`. When using Jruby `each_record_xpath`, resulting yielded documents may have xmlns declarations on different nodes than in MRI (and previous versions of nokogiri), but we could find now way around this with nokogiri >= 1.9.0. The documents should still be semantically equivalent for namespace use. This was necessary to keep JRuby Nokogiri XML working with recent Nokogiri releases. https://github.com/traject/traject/pull/209
28
+
29
+ * LineWriter guesses better about when to auto-close, and provides an optional explicit setting in case it guesses wrong. (thanks @justinlittman) https://github.com/traject/traject/pull/211
30
+
31
+ * Traject::Indexer will now use a Logger(-compatible) instance passed in in setting 'logger' https://github.com/traject/traject/pull/217
32
+
3
33
  ## 3.0.0
4
34
 
5
35
  ### Changed/Backwards Incompatibilities
data/README.md CHANGED
@@ -19,7 +19,7 @@ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (Univer
19
19
  * Basic configuration files can be easily written even by non-rubyists, with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
20
20
  * Easy to program, easy to read, easy to modify.
21
21
  * Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with solr even under MRI. Traject is intended to be usable to process millions of records.
22
- * Composed of decoupled components, for flexibility and extensibility.
22
+ * Composed of decoupled components, for flexibility and extensibility.f?
23
23
  * Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
24
24
  * Easy to split configuration between multiple files, for simple "pick-and-choose" command line options that can combine to deal with any of your local needs.
25
25
 
@@ -135,7 +135,7 @@ For the syntax and complete possibilities of the specification string argument t
135
135
 
136
136
  To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
137
137
 
138
- ### XML mode, extract_xml
138
+ ### XML mode, extract_xpath
139
139
 
140
140
  See our [xml guide](./doc/xml.md) for more XML examples, but you will usually use extract_xpath.
141
141
 
@@ -311,12 +311,15 @@ like `to_field`, is executed for every record, but without being tied
311
311
  to a specific output field.
312
312
 
313
313
  `each_record` can be used for logging or notifiying, computing intermediate
314
- results, or writing to more than one field at once.
314
+ results, or more complex ruby logic.
315
315
 
316
316
  ~~~ruby
317
317
  each_record do |record|
318
318
  some_custom_logging(record)
319
319
  end
320
+ each_record do |record, context|
321
+ context.add_output(:some_value, extract_some_value_from_record(record))
322
+ end
320
323
  ~~~
321
324
 
322
325
  For more on `each_record`, see [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md).
@@ -405,7 +408,7 @@ writer class in question.
405
408
 
406
409
  ## The traject command Line
407
410
 
408
- (If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./docs/programmatic_use.md) )
411
+ (If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./doc/programmatic_use.md) )
409
412
 
410
413
  The simplest invocation is:
411
414
 
@@ -247,13 +247,12 @@ each_record do |record, context|
247
247
  end
248
248
 
249
249
  each_record do |record, context|
250
- (val1, val2) = calculate_two_things_from(record)
250
+ if eligible_for_things?(record)
251
+ (val1, val2) = calculate_two_things_from(record)
251
252
 
252
- context.output_hash["first_field"] ||= []
253
- context.output_hash["first_field"] << val1
254
-
255
- context.output_hash["second_field"] ||= []
256
- context.output_hash["second_field"] << val2
253
+ context.add_output("first_field", val1)
254
+ context.add_output("second_field", val2)
255
+ end
257
256
  end
258
257
  ~~~
259
258
 
@@ -48,6 +48,30 @@ indexer = Traject::Indexer.new(settings) do
48
48
  end
49
49
  ```
50
50
 
51
+ ### Configuring indexer subclasses
52
+
53
+ Indexing step configuration is historically done in traject at the indexer _instance_ level. Either programmatically or by applying a "configuration file" to an indexer instance.
54
+
55
+ But you can also define your own indexer sub-class with indexing steps built-in, using the class-level `configure` method.
56
+
57
+ This is an EXPERIMENTAL feature, implementation may change. https://github.com/traject/traject/pull/213
58
+
59
+ ```ruby
60
+ class MyIndexer < Traject::Indexer
61
+ configure do
62
+ settings do
63
+ provide "solr.url", Rails.application.config.my_solr_url
64
+ end
65
+
66
+ to_field "our_name", literal("University of Whatever")
67
+ end
68
+ end
69
+ ```
70
+
71
+ These setting and indexing steps are now "hard-coded" into that subclass. You can still provide additional configuration at the instance level, as normal. You can also make a subclass of that `MyIndexer` class, that will inherit configuration from MyIndexer, and can supply it's own additional class-level configuration too.
72
+
73
+ Note that due to how implementation is done, instantiating an indexer is still _relatively_ expensive. (Class-level configuration is only actually executed on instantiation). You will still get better performance by re-using a global instance of your indexer subclass, instead of, say, instantiating one per object to be indexed.
74
+
51
75
  ## Running the indexer
52
76
 
53
77
  ### process: probably not what you want
@@ -157,7 +181,7 @@ You may want to consider instead creating one or more configured "global" indexe
157
181
 
158
182
  * Readers, and the Indexer#process method, are not thread-safe. Which is why using Indexer#process, which uses a fixed reader, is not threads-safe, and why when sharing a global idnexer we want to use `process_record`, `map_record`, or `process_with` as above.
159
183
 
160
- It ought to be safe to use a global Indexer concurrently in several threads, with the `map_record`, `process_record` or `process_with` methods -- so long as your indexing rules and writers are thread-safe, as they usually will be and always ought to be.
184
+ It ought to be safe to use a global Indexer concurrently in several threads, with the `map_record`, `process_record` or `process_with` methods -- so long as your indexing rules and writers are thread-safe, as they usually will be and always ought to be.
161
185
 
162
186
  ### An example
163
187
 
@@ -119,6 +119,8 @@ settings are applied first of all. It's recommended you use `provide`.
119
119
 
120
120
  * `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
121
121
 
122
+ * 'logger': Ignore all the other logger settings, just pass a `Logger` compatible logger instance in directly.
123
+
122
124
 
123
125
 
124
126
 
data/doc/xml.md CHANGED
@@ -133,6 +133,8 @@ The NokogiriReader parser should be relatively performant though, allowing you t
133
133
 
134
134
  (There is a half-finished `ExperimentalStreamingNokogiriReader` available, but it is experimental, half-finished, may disappear or change in backwards compat at any time, problematic, not recommended for production use, etc.)
135
135
 
136
+ Note also that in Jruby, when using `each_record_xpath` with the NokogiriReader, the extracted individual documents may have xmlns declerations in different places than you may expect, although they will still be semantically equivalent for namespace processing. This is due to Nokogiri JRuby implementation, and we could find no good way to ensure consistent behavior with MRI. See: https://github.com/sparklemotion/nokogiri/issues/1875
137
+
136
138
  ### Jruby
137
139
 
138
140
  It may be that nokogiri JRuby is just much slower than nokogiri MRI (at least when namespaces are involved?) It may be that our workaround to a [JRuby bug involving namespaces on moving nodes](https://github.com/sparklemotion/nokogiri/issues/1774) doesn't help.
@@ -180,6 +180,7 @@ class Traject::Indexer
180
180
  @index_steps = []
181
181
  @after_processing_steps = []
182
182
 
183
+ self.class.apply_class_configure_block(self)
183
184
  instance_eval(&block) if block
184
185
  end
185
186
 
@@ -189,6 +190,30 @@ class Traject::Indexer
189
190
  instance_eval(&block)
190
191
  end
191
192
 
193
+ ## Class level configure block accepted too, and applied at instantiation
194
+ # before instance-level configuration.
195
+ #
196
+ # EXPERIMENTAL, implementation may change in ways that effect some uses.
197
+ # https://github.com/traject/traject/pull/213
198
+ #
199
+ # Note that settings set by 'provide' in subclass can not really be overridden
200
+ # by 'provide' in a next level subclass. Use self.default_settings instead, with
201
+ # call to super.
202
+ def self.configure(&block)
203
+ @class_configure_block = block
204
+ end
205
+
206
+ def self.apply_class_configure_block(instance)
207
+ # Make sure we inherit from superclass that has a class-level ivar @class_configure_block
208
+ if self.superclass.respond_to?(:apply_class_configure_block)
209
+ self.superclass.apply_class_configure_block(instance)
210
+ end
211
+ if @class_configure_block
212
+ instance.configure(&@class_configure_block)
213
+ end
214
+ end
215
+
216
+
192
217
 
193
218
  # Pass a string file path, a Pathname, or a File object, for
194
219
  # a config file to load into indexer.
@@ -258,10 +283,9 @@ class Traject::Indexer
258
283
  "log.batch_size.severity" => "info",
259
284
 
260
285
  # how to post-process the accumulator
261
- "allow_nil_values" => false,
262
- "allow_duplicate_values" => true,
263
-
264
- "allow_empty_fields" => false
286
+ Traject::Indexer::ToFieldStep::ALLOW_NIL_VALUES => false,
287
+ Traject::Indexer::ToFieldStep::ALLOW_DUPLICATE_VALUES => true,
288
+ Traject::Indexer::ToFieldStep::ALLOW_EMPTY_FIELDS => false
265
289
  }.freeze
266
290
  end
267
291
 
@@ -349,6 +373,10 @@ class Traject::Indexer
349
373
 
350
374
  # Create logger according to settings
351
375
  def create_logger
376
+ if settings["logger"]
377
+ # none of the other settings matter, we just got a logger
378
+ return settings["logger"]
379
+ end
352
380
 
353
381
  logger_level = settings["log.level"] || "info"
354
382
 
@@ -82,6 +82,51 @@ class Traject::Indexer
82
82
  str
83
83
  end
84
84
 
85
+ # Add values to an array in context.output_hash with the specified key/field_name(s).
86
+ # Creates array in output_hash if currently nil.
87
+ #
88
+ # Post-processing/filtering:
89
+ #
90
+ # * uniqs accumulator, unless settings["allow_dupicate_values"] is set.
91
+ # * Removes nil values unless settings["allow_nil_values"] is set.
92
+ # * Will not add an empty array to output_hash (will leave it nil instead)
93
+ # unless settings["allow_empty_fields"] is set.
94
+ #
95
+ # Multiple values can be added with multiple arguments (we avoid an array argument meaning
96
+ # multiple values to accomodate odd use cases where array itself is desired in output_hash value)
97
+ #
98
+ # @param field_name [String,Symbol,Array<String>,Array[<Symbol>]] A key to set in output_hash, or
99
+ # an array of such keys.
100
+ #
101
+ # @example add one value
102
+ # context.add_output(:additional_title, "a title")
103
+ #
104
+ # @example add multiple values as multiple params
105
+ # context.add_output("additional_title", "a title", "another title")
106
+ #
107
+ # @example add multiple values as multiple params from array using ruby spread operator
108
+ # context.add_output(:some_key, *array_of_values)
109
+ #
110
+ # @example add to multiple keys in output hash
111
+ # context.add_output(["key1", "key2"], "value")
112
+ #
113
+ # @return [Traject::Context] self
114
+ #
115
+ # Note for historical reasons relevant settings key *names* are in constants in Traject::Indexer::ToFieldStep,
116
+ # but the settings don't just apply to ToFieldSteps
117
+ def add_output(field_name, *values)
118
+ values.compact! unless self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_NIL_VALUES]
119
+
120
+ return self if values.empty? and not (self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_EMPTY_FIELDS])
121
+
122
+ Array(field_name).each do |key|
123
+ accumulator = (self.output_hash[key.to_s] ||= [])
124
+ accumulator.concat values
125
+ accumulator.uniq! unless self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_DUPLICATE_VALUES]
126
+ end
127
+
128
+ return self
129
+ end
85
130
  end
86
131
 
87
132
 
@@ -145,24 +145,20 @@ class Traject::Indexer
145
145
  return accumulator
146
146
  end
147
147
 
148
- # Add the accumulator to the context with the correct field name
149
- # Do post-processing on the accumulator (remove nil values, allow empty
150
- # fields, etc)
148
+
149
+ # These constqnts here for historical/legacy reasons, they really oughta
150
+ # live in Traject::Context, but in case anyone is referring to them
151
+ # we'll leave them here for now.
151
152
  ALLOW_NIL_VALUES = "allow_nil_values".freeze
152
153
  ALLOW_EMPTY_FIELDS = "allow_empty_fields".freeze
153
154
  ALLOW_DUPLICATE_VALUES = "allow_duplicate_values".freeze
154
155
 
156
+ # Add the accumulator to the context with the correct field name(s).
157
+ # Do post-processing on the accumulator (remove nil values, allow empty
158
+ # fields, etc)
155
159
  def add_accumulator_to_context!(accumulator, context)
156
- accumulator.compact! unless context.settings[ALLOW_NIL_VALUES]
157
- return if accumulator.empty? and not (context.settings[ALLOW_EMPTY_FIELDS])
158
-
159
160
  # field_name can actually be an array of field names
160
- Array(field_name).each do |a_field_name|
161
- context.output_hash[a_field_name] ||= []
162
-
163
- existing_accumulator = context.output_hash[a_field_name].concat(accumulator)
164
- existing_accumulator.uniq! unless context.settings[ALLOW_DUPLICATE_VALUES]
165
- end
161
+ context.add_output(field_name, *accumulator)
166
162
  end
167
163
  end
168
164
 
@@ -8,12 +8,35 @@ require 'thread'
8
8
  # This does not seem to effect performance much, as far as I could tell
9
9
  # benchmarking.
10
10
  #
11
- # Output will be sent to `settings["output_file"]` string path, or else
12
- # `settings["output_stream"]` (ruby IO object), or else stdout.
13
- #
14
11
  # This class can be sub-classed to write out different serialized
15
12
  # reprentations -- subclasses will just override the #serialize
16
13
  # method. For instance, see JsonWriter.
14
+ #
15
+ # ## Output
16
+ #
17
+ # The main functionality this class provides is logic for choosing based on
18
+ # settings what file or bytestream to send output to.
19
+ #
20
+ # You can supply `settings["output_file"]` with a _file path_. LineWriter
21
+ # will open up a `File` to write to.
22
+ #
23
+ # Or you can supply `settings["output_stream"]` with any ruby IO object, such an
24
+ # open `File` object or anything else.
25
+ #
26
+ # If neither are supplied, will write to `$stdout`.
27
+ #
28
+ # ## Closing the output stream
29
+ #
30
+ # The LineWriter tries to guess on whether it should call `close` on the output
31
+ # stream it's writing to, when the LineWriter instance is closed. For instance,
32
+ # if you passed in a `settings["output_file"]` with a path, and the LineWriter
33
+ # opened up a `File` object for you, it should close it for you.
34
+ #
35
+ # But for historical reasons, LineWriter doesn't just use that signal, but tries
36
+ # to guess generally on when to call close. If for some reason it gets it wrong,
37
+ # just use `settings["close_output_on_close"]` set to `true` or `false`.
38
+ # (String `"true"` or `"false"` are also acceptable, for convenience in setting
39
+ # options on command line)
17
40
  class Traject::LineWriter
18
41
  attr_reader :settings
19
42
  attr_reader :write_mutex, :output_file
@@ -57,7 +80,16 @@ class Traject::LineWriter
57
80
  end
58
81
 
59
82
  def close
60
- @output_file.close unless (@output_file.nil? || @output_file.tty?)
83
+ @output_file.close if should_close_stream?
84
+ end
85
+
86
+ def should_close_stream?
87
+ if settings["close_output_on_close"].nil?
88
+ (@output_file.nil? || @output_file.tty? || @output_file == $stdout || $output_file == $stderr)
89
+ else
90
+ settings["close_output_on_close"].to_s == "true"
91
+ end
61
92
  end
62
93
 
94
+
63
95
  end
@@ -118,35 +118,26 @@ module Traject
118
118
  private
119
119
 
120
120
 
121
- # In MRI Nokogiri, this is as simple as `new_parent_doc.root = node`
121
+ # We simply do `new_parent_doc.root = node`
122
122
  # It seemed maybe safer to dup the node as well as remove the original from the original doc,
123
123
  # but I believe this will result in double memory usage, as unlinked nodes aren't GC'd until
124
124
  # their doc is. I am hoping this pattern results in less memory usage.
125
125
  # https://github.com/sparklemotion/nokogiri/issues/1703
126
126
  #
127
- # However, in JRuby it's a different story, JRuby doesn't properly preserve namespaces
128
- # when re-parenting a node.
127
+ # We used to have to do something different in Jruby to work around bug:
129
128
  # https://github.com/sparklemotion/nokogiri/issues/1774
130
129
  #
131
- # The nodes within the tree re-parented _know_ they are in the correct namespaces,
132
- # and xpath queries require that namespace, but the appropriate xmlns attributes
133
- # aren't included in the serialized XML. This JRuby-specific code seems to get
134
- # things back to a consistent state.
130
+ # But as of nokogiri 1.9, that does not work, and is not necessary if we accept
131
+ # that Jruby nokogiri may put xmlns declerations on different elements than MRI,
132
+ # although it should be semantically equivalent for a namespace-aware parser.
133
+ # https://github.com/sparklemotion/nokogiri/issues/1875
134
+ #
135
+ # This as a separate method now exists largely as a historical artifact, and for this
136
+ # documentation.
135
137
  def reparent_node_to_root(new_parent_doc, node)
136
- if Traject::Util.is_jruby?
137
- original_ns_scopes = node.namespace_scopes
138
- end
139
138
 
140
139
  new_parent_doc.root = node
141
140
 
142
- if Traject::Util.is_jruby?
143
- original_ns_scopes.each do |ns|
144
- if new_parent_doc.at_xpath("//#{ns.prefix}:*", ns.prefix => ns.href)
145
- new_parent_doc.root.add_namespace(ns.prefix, ns.href)
146
- end
147
- end
148
- end
149
-
150
141
  return new_parent_doc
151
142
  end
152
143
 
@@ -16,7 +16,30 @@ require 'concurrent' # for atomic_fixnum
16
16
  # This should work under both MRI and JRuby, with JRuby getting much
17
17
  # better performance due to the threading model.
18
18
  #
19
- # Relevant settings
19
+ # Solr updates are by default sent with no commit params. This will definitely
20
+ # maximize your performance, and *especially* for bulk/batch indexing is recommended --
21
+ # use Solr auto commit in your Solr configuration instead, possibly with `commit_on_close`
22
+ # setting here.
23
+ #
24
+ # However, if you want the writer to send `commitWithin=true`, `commit=true`,
25
+ # `softCommit=true`, or any other URL parameters valid for Solr update handlers,
26
+ # you can configure this with `solr_writer.solr_update_args` setting. See:
27
+ # https://lucene.apache.org/solr/guide/7_0/near-real-time-searching.html#passing-commit-and-commitwithin-parameters-as-part-of-the-url
28
+ # Eg:
29
+ #
30
+ # settings do
31
+ # provide "solr_writer.solr_update_args", { commitWithin: 1000 }
32
+ # end
33
+ #
34
+ # (That it's a hash makes it infeasible to set/override on command line, if this is
35
+ # annoying for you let us know)
36
+ #
37
+ # `solr_update_args` will apply to batch and individual update requests, but
38
+ # not to commit sent if `commit_on_close`. You can also instead set
39
+ # `solr_writer.solr_commit_args` for that (or pass in an arg to #commit if calling
40
+ # manually)
41
+ #
42
+ # ## Relevant settings
20
43
  #
21
44
  # * solr.url (optional if solr.update_url is set) The URL to the solr core to index into
22
45
  #
@@ -35,19 +58,32 @@ require 'concurrent' # for atomic_fixnum
35
58
  #
36
59
  # * solr_writer.skippable_exceptions: List of classes that will be rescued internal to
37
60
  # SolrJsonWriter, and handled with max_skipped logic. Defaults to
38
- # `[HTTPClient::TimeoutError, SocketError, Errno::ECONNREFUSED]`
61
+ # `[HTTPClient::TimeoutError, SocketError, Errno::ECONNREFUSED, Traject::SolrJsonWriter::BadHttpResponse]`
62
+ #
63
+ # * solr_writer.solr_update_args: A _hash_ of query params to send to solr update url.
64
+ # Will be sent with every update request. Eg `{ softCommit: true }` or `{ commitWithin: 1000 }`.
65
+ # See also `solr_writer.solr_commit_args`
39
66
  #
40
67
  # * solr_writer.commit_on_close: Set to true (or "true") if you want to commit at the
41
68
  # end of the indexing run. (Old "solrj_writer.commit_on_close" supported for backwards
42
69
  # compat only.)
43
70
  #
71
+ # * solr_writer.commit_solr_update_args: A hash of query params to send when committing.
72
+ # Will be used for automatic `close_on_commit`, as well as any manual calls to #commit.
73
+ # If set, must include {"commit" => "true"} or { "softCommit" => "true" } if you actually
74
+ # want commits to happen when SolrJsonWriter tries to commit! But can be used to switch to softCommits
75
+ # (hard commits default), or specify additional params like optimize etc.
76
+ #
77
+ # * solr_writer.http_timeout: Value in seconds, will be set on the httpclient as connect/receive/send
78
+ # timeout. No way to set them individually at present. Default nil, use HTTPClient defaults
79
+ # (60 for connect/recieve, 120 for send).
80
+ #
44
81
  # * solr_writer.commit_timeout: If commit_on_close, how long to wait for Solr before
45
- # giving up as a timeout. Default 10 minutes. Solr can be slow.
82
+ # giving up as a timeout (http client receive_timeout). Default 10 minutes. Solr can be slow at commits. Overrides solr_writer.timeout
46
83
  #
47
84
  # * solr_json_writer.http_client Mainly intended for testing, set your own HTTPClient
48
85
  # or mock object to be used for HTTP.
49
-
50
-
86
+ #
51
87
  class Traject::SolrJsonWriter
52
88
  include Traject::QualifiedConstGet
53
89
 
@@ -71,7 +107,15 @@ class Traject::SolrJsonWriter
71
107
  @max_skipped = nil
72
108
  end
73
109
 
74
- @http_client = @settings["solr_json_writer.http_client"] || HTTPClient.new
110
+ @http_client = if @settings["solr_json_writer.http_client"]
111
+ @settings["solr_json_writer.http_client"]
112
+ else
113
+ client = HTTPClient.new
114
+ if @settings["solr_writer.http_timeout"]
115
+ client.connect_timeout = client.receive_timeout = client.send_timeout = @settings["solr_writer.http_timeout"]
116
+ end
117
+ client
118
+ end
75
119
 
76
120
  @batch_size = (settings["solr_writer.batch_size"] || DEFAULT_BATCH_SIZE).to_i
77
121
  @batch_size = 1 if @batch_size < 1
@@ -96,6 +140,9 @@ class Traject::SolrJsonWriter
96
140
  # Figure out where to send updates
97
141
  @solr_update_url = self.determine_solr_update_url
98
142
 
143
+ @solr_update_args = settings["solr_writer.solr_update_args"]
144
+ @commit_solr_update_args = settings["solr_writer.commit_solr_update_args"]
145
+
99
146
  logger.info(" #{self.class.name} writing to '#{@solr_update_url}' in batches of #{@batch_size} with #{@thread_pool_size} bg threads")
100
147
  end
101
148
 
@@ -123,14 +170,25 @@ class Traject::SolrJsonWriter
123
170
  send_batch( Traject::Util.drain_queue(@batched_queue) )
124
171
  end
125
172
 
173
+ # configured update url, with either settings @solr_update_args or passed in
174
+ # query_params added to it
175
+ def solr_update_url_with_query(query_params)
176
+ if query_params
177
+ @solr_update_url + '?' + URI.encode_www_form(query_params)
178
+ else
179
+ @solr_update_url
180
+ end
181
+ end
182
+
126
183
  # Send the given batch of contexts. If something goes wrong, send
127
184
  # them one at a time.
128
185
  # @param [Array<Traject::Indexer::Context>] an array of contexts
129
186
  def send_batch(batch)
130
187
  return if batch.empty?
131
188
  json_package = JSON.generate(batch.map { |c| c.output_hash })
189
+
132
190
  begin
133
- resp = @http_client.post @solr_update_url, json_package, "Content-type" => "application/json"
191
+ resp = @http_client.post solr_update_url_with_query(@solr_update_args), json_package, "Content-type" => "application/json"
134
192
  rescue StandardError => exception
135
193
  end
136
194
 
@@ -153,30 +211,55 @@ class Traject::SolrJsonWriter
153
211
  def send_single(c)
154
212
  json_package = JSON.generate([c.output_hash])
155
213
  begin
156
- resp = @http_client.post @solr_update_url, json_package, "Content-type" => "application/json"
157
- # Catch Timeouts and network errors as skipped records, but otherwise
158
- # allow unexpected errors to propagate up.
159
- rescue *skippable_exceptions => exception
160
- # no body, local variable exception set above will be used below
161
- end
214
+ resp = @http_client.post solr_update_url_with_query(@solr_update_args), json_package, "Content-type" => "application/json"
162
215
 
163
- if exception || resp.status != 200
164
- if exception
165
- msg = Traject::Util.exception_to_log_message(exception)
216
+ unless resp.status == 200
217
+ raise BadHttpResponse.new("Unexpected HTTP response status #{resp.status}", resp)
218
+ end
219
+
220
+ # Catch Timeouts and network errors -- as well as non-200 http responses --
221
+ # as skipped records, but otherwise allow unexpected errors to propagate up.
222
+ rescue *skippable_exceptions => exception
223
+ msg = if exception.kind_of?(BadHttpResponse)
224
+ "Solr error response: #{exception.response.status}: #{exception.response.body}"
166
225
  else
167
- msg = "Solr error response: #{resp.status}: #{resp.body}"
226
+ Traject::Util.exception_to_log_message(exception)
168
227
  end
228
+
169
229
  logger.error "Could not add record #{c.record_inspect}: #{msg}"
170
230
  logger.debug("\t" + exception.backtrace.join("\n\t")) if exception
171
231
  logger.debug(c.source_record.to_s) if c.source_record
172
232
 
173
233
  @skipped_record_incrementer.increment
174
234
  if @max_skipped and skipped_record_count > @max_skipped
235
+ # re-raising in rescue means the last encountered error will be available as #cause
236
+ # on raised exception, a feature in ruby 2.1+.
175
237
  raise MaxSkippedRecordsExceeded.new("#{self.class.name}: Exceeded maximum number of skipped records (#{@max_skipped}): aborting")
176
238
  end
177
-
178
239
  end
240
+ end
241
+
179
242
 
243
+ # Very beginning of a delete implementation. POSTs a delete request to solr
244
+ # for id in arg (value of Solr UniqueID field, usually `id` field).
245
+ #
246
+ # Right now, does it inline and immediately, no use of background threads or batching.
247
+ # This could change.
248
+ #
249
+ # Right now, if unsuccesful for any reason, will raise immediately out of here.
250
+ # Could raise any of the `skippable_exceptions` (timeouts, network errors), an
251
+ # exception will be raised right out of here.
252
+ #
253
+ # Will use `solr_writer.solr_update_args` settings.
254
+ #
255
+ # There is no built-in way to direct a record to be deleted from an indexing config
256
+ # file at the moment, this is just a loose method on the writer.
257
+ def delete(id)
258
+ json_package = {delete: id}
259
+ resp = @http_client.post solr_update_url_with_query(@solr_update_args), JSON.generate(json_package), "Content-type" => "application/json"
260
+ if resp.status != 200
261
+ raise RuntimeError.new("Could not delete #{id.inspect}, http response #{resp.status}: #{resp.body}")
262
+ end
180
263
  end
181
264
 
182
265
 
@@ -220,14 +303,32 @@ class Traject::SolrJsonWriter
220
303
 
221
304
 
222
305
  # Send a commit
223
- def commit
306
+ #
307
+ # Called automatially by `close_on_commit` setting, but also can be called manually.
308
+ #
309
+ # If settings `solr_writer.commit_solr_update_args` is set, will be used by default.
310
+ # That setting needs `{ commit: true }` or `{softCommit: true}` if you want it to
311
+ # actually do a commit!
312
+ #
313
+ # Optional query_params argument is the actual args to send, you must be sure
314
+ # to make it include "commit: true" or "softCommit: true" for it to actually commit!
315
+ # But you may want to include other params too, like optimize etc. query_param
316
+ # argument replaces setting `solr_writer.commit_solr_update_args`, they are not merged.
317
+ #
318
+ # @param [Hash] query_params optional query params to send to solr update. Default {"commit" => "true"}
319
+ #
320
+ # @example @writer.commit
321
+ # @example @writer.commit(softCommit: true)
322
+ # @example @writer.commit(commit: true, optimize: true, waitFlush: false)
323
+ def commit(query_params = nil)
324
+ query_params ||= @commit_solr_update_args || {"commit" => "true"}
224
325
  logger.info "#{self.class.name} sending commit to solr at url #{@solr_update_url}..."
225
326
 
226
327
  original_timeout = @http_client.receive_timeout
227
328
 
228
329
  @http_client.receive_timeout = (settings["commit_timeout"] || (10 * 60)).to_i
229
330
 
230
- resp = @http_client.get(@solr_update_url, {"commit" => 'true'})
331
+ resp = @http_client.get(solr_update_url_with_query(query_params))
231
332
  unless resp.status == 200
232
333
  raise RuntimeError.new("Could not commit to Solr: #{resp.status} #{resp.body}")
233
334
  end
@@ -279,10 +380,24 @@ class Traject::SolrJsonWriter
279
380
 
280
381
  class MaxSkippedRecordsExceeded < RuntimeError ; end
281
382
 
383
+ # Adapted from HTTPClient::BadResponseError.
384
+ # It's got a #response accessor that will give you the HTTPClient
385
+ # Response object that had a bad status, although relying on that
386
+ # would tie you to our HTTPClient implementation that maybe should
387
+ # be considered an implementation detail, so I dunno.
388
+ class BadHttpResponse < RuntimeError
389
+ # HTTP::Message:: a response
390
+ attr_reader :response
391
+
392
+ def initialize(msg, response = nil) # :nodoc:
393
+ super(msg)
394
+ @response = response
395
+ end
396
+ end
282
397
 
283
398
  private
284
399
 
285
400
  def skippable_exceptions
286
- @skippable_exceptions ||= (settings["solr_writer.skippable_exceptions"] || [HTTPClient::TimeoutError, SocketError, Errno::ECONNREFUSED])
401
+ @skippable_exceptions ||= (settings["solr_writer.skippable_exceptions"] || [HTTPClient::TimeoutError, SocketError, Errno::ECONNREFUSED, Traject::SolrJsonWriter::BadHttpResponse])
287
402
  end
288
403
  end
@@ -1,3 +1,3 @@
1
1
  module Traject
2
- VERSION = "3.0.0"
2
+ VERSION = "3.1.0.rc1"
3
3
  end
@@ -0,0 +1,104 @@
1
+ require 'test_helper'
2
+
3
+ describe "Class-level configuration of Indexer sub-class" do
4
+ # Declaring a class inline in minitest isn't great, this really is a globally
5
+ # available class now, other tests shouldn't re-use this class name. But it works
6
+ # for testing for now.
7
+ class TestIndexerSubclass < Traject::Indexer
8
+ configure do
9
+ settings do
10
+ provide "class_level", "TestIndexerSubclass"
11
+ end
12
+
13
+ to_field "field", literal("value")
14
+ each_record do |rec, context|
15
+ context.output_hash["from_each_record"] ||= []
16
+ context.output_hash["from_each_record"] << "value"
17
+ end
18
+ end
19
+
20
+ def self.default_settings
21
+ @default_settings ||= super.merge(
22
+ "set_by_default_setting_no_override" => "TestIndexerSubclass",
23
+ "set_by_default_setting" => "TestIndexerSubclass"
24
+ )
25
+ end
26
+ end
27
+
28
+ before do
29
+ @indexer = TestIndexerSubclass.new
30
+ end
31
+
32
+ it "uses class-level configuration" do
33
+ result = @indexer.map_record(Object.new)
34
+
35
+ assert_equal ['value'], result['field']
36
+ assert_equal ['value'], result['from_each_record']
37
+ end
38
+
39
+ it "uses class-level configuration and instance-level configuration" do
40
+ @indexer.configure do
41
+ to_field "field", literal("from-instance-config")
42
+ to_field "instance_field", literal("from-instance-config")
43
+ end
44
+
45
+ result = @indexer.map_record(Object.new)
46
+ assert_equal ['value', 'from-instance-config'], result['field']
47
+ assert_equal ['from-instance-config'], result["instance_field"]
48
+ end
49
+
50
+ describe "with multi-level subclass" do
51
+ class TestIndexerSubclassSubclass < TestIndexerSubclass
52
+ configure do
53
+ settings do
54
+ provide "class_level", "TestIndexerSubclassSubclass"
55
+ end
56
+
57
+ to_field "field", literal("from-sub-subclass")
58
+ to_field "subclass_field", literal("from-sub-subclass")
59
+ end
60
+
61
+ def self.default_settings
62
+ @default_settings ||= super.merge(
63
+ "set_by_default_setting" => "TestIndexerSubclassSubclass"
64
+ )
65
+ end
66
+
67
+ end
68
+
69
+ before do
70
+ @indexer = TestIndexerSubclassSubclass.new
71
+ end
72
+
73
+ it "lets subclass override settings 'provide'" do
74
+ skip("This would be nice but is currently architecturally hard")
75
+ assert_equal "TestIndexerSubclassSubclass", @indexer.settings["class_level"]
76
+ end
77
+
78
+ it "lets subclass override default settings" do
79
+ assert_equal "TestIndexerSubclassSubclass", @indexer.settings["set_by_default_setting"]
80
+ assert_equal "TestIndexerSubclass", @indexer.settings["set_by_default_setting_no_override"]
81
+ end
82
+
83
+ it "uses configuraton from all inheritance" do
84
+ result = @indexer.map_record(Object.new)
85
+
86
+ assert_equal ['value', 'from-sub-subclass'], result['field']
87
+ assert_equal ['value'], result['from_each_record']
88
+ assert_equal ['from-sub-subclass'], result['subclass_field']
89
+ end
90
+
91
+ it "uses configuraton from all inheritance plus instance" do
92
+ @indexer.configure do
93
+ to_field "field", literal("from-instance")
94
+ to_field "instance_field", literal("from-instance")
95
+ end
96
+
97
+ result = @indexer.map_record(Object.new)
98
+
99
+ assert_equal ['value', 'from-sub-subclass', 'from-instance'], result['field']
100
+ assert_equal ['from-instance'], result['instance_field']
101
+ end
102
+ end
103
+
104
+ end
@@ -38,8 +38,71 @@ describe "Traject::Indexer::Context" do
38
38
 
39
39
  assert_equal "<record ##{@position} (#{@input_name} ##{@position_in_input}), source_id:#{@record_001} output_id:output_id>", @context.record_inspect
40
40
  end
41
-
42
41
  end
43
42
 
43
+ describe "#add_output" do
44
+ before do
45
+ @context = Traject::Indexer::Context.new
46
+ end
47
+ it "adds one value to nil" do
48
+ @context.add_output(:key, "value")
49
+ assert_equal @context.output_hash, { "key" => ["value"] }
50
+ end
51
+
52
+ it "adds multiple values to nil" do
53
+ @context.add_output(:key, "value1", "value2")
54
+ assert_equal @context.output_hash, { "key" => ["value1", "value2"] }
55
+ end
56
+
57
+ it "adds one value to existing accumulator" do
58
+ @context.output_hash["key"] = ["value1"]
59
+ @context.add_output(:key, "value2")
60
+ assert_equal @context.output_hash, { "key" => ["value1", "value2"] }
61
+ end
62
+
63
+ it "uniqs by default" do
64
+ @context.output_hash["key"] = ["value1"]
65
+ @context.add_output(:key, "value1")
66
+ assert_equal @context.output_hash, { "key" => ["value1"] }
67
+ end
68
+
69
+ it "does not unique if allow_duplicate_values" do
70
+ @context.settings = { Traject::Indexer::ToFieldStep::ALLOW_DUPLICATE_VALUES => true }
71
+ @context.output_hash["key"] = ["value1"]
72
+
73
+ @context.add_output(:key, "value1")
74
+ assert_equal @context.output_hash, { "key" => ["value1", "value1"] }
75
+ end
76
+
77
+ it "ignores nil values by default" do
78
+ @context.add_output(:key, "value1", nil, "value2")
79
+ assert_equal @context.output_hash, { "key" => ["value1", "value2"] }
80
+ end
81
+
82
+ it "allows nil values if allow_nil_values" do
83
+ @context.settings = { Traject::Indexer::ToFieldStep::ALLOW_NIL_VALUES => true }
44
84
 
85
+ @context.add_output(:key, "value1", nil, "value2")
86
+ assert_equal @context.output_hash, { "key" => ["value1", nil, "value2"] }
87
+ end
88
+
89
+ it "ignores empty array by default" do
90
+ @context.add_output(:key)
91
+ @context.add_output(:key, nil)
92
+
93
+ assert_nil @context.output_hash["key"]
94
+ end
95
+
96
+ it "allows empty field if allow_empty_fields" do
97
+ @context.settings = { Traject::Indexer::ToFieldStep::ALLOW_EMPTY_FIELDS => true }
98
+
99
+ @context.add_output(:key, nil)
100
+ assert_equal @context.output_hash, { "key" => [] }
101
+ end
102
+
103
+ it "can add to multiple fields" do
104
+ @context.add_output(["field1", "field2"], "value1", "value2")
105
+ assert_equal @context.output_hash, { "field1" => ["value1", "value2"], "field2" => ["value1", "value2"] }
106
+ end
107
+ end
45
108
  end
@@ -56,4 +56,22 @@ describe 'Custom mapping error handler' do
56
56
 
57
57
  assert_nil indexer.map_record({})
58
58
  end
59
+
60
+ it "uses logger from settings" do
61
+ desired_logger = Logger.new("/dev/null")
62
+ set_logger = nil
63
+ indexer.configure do
64
+ settings do
65
+ provide "logger", desired_logger
66
+ provide "mapping_rescue", -> (ctx, e) {
67
+ set_logger = ctx.logger
68
+ }
69
+ end
70
+ to_field 'id' do |_context , _exception|
71
+ raise 'this was always going to fail'
72
+ end
73
+ end
74
+ indexer.map_record({})
75
+ assert_equal desired_logger.object_id, set_logger.object_id
76
+ end
59
77
  end
@@ -1,6 +1,12 @@
1
1
  require 'test_helper'
2
2
  require 'traject/nokogiri_reader'
3
3
 
4
+ # Note that JRuby Nokogiri can treat namespaces differently than MRI nokogiri.
5
+ # Particularly when we extract elements from a larger document with `each_record_xpath`,
6
+ # and put them in their own document, in JRuby nokogiri the xmlns declarations
7
+ # can end up on different elements than expected, although the document should
8
+ # be semantically equivalent to an XML-namespace-aware processor. See:
9
+ # https://github.com/sparklemotion/nokogiri/issues/1875
4
10
  describe "Traject::NokogiriReader" do
5
11
  describe "with namespaces" do
6
12
  before do
@@ -80,8 +86,22 @@ describe "Traject::NokogiriReader" do
80
86
  assert yielded_records.length > 0
81
87
 
82
88
  expected_namespaces = {"xmlns"=>"http://example.org/top", "xmlns:a"=>"http://example.org/a", "xmlns:b"=>"http://example.org/b"}
83
- yielded_records.each do |rec|
84
- assert_equal expected_namespaces, rec.namespaces
89
+
90
+ if !Traject::Util.is_jruby?
91
+ yielded_records.each do |rec|
92
+ assert_equal expected_namespaces, rec.namespaces
93
+ end
94
+ else
95
+ # jruby nokogiri shuffles things around, all we can really do is test that the namespaces
96
+ # are somehwere in the doc :( We rely on other tests to test semantic equivalence.
97
+ yielded_records.each do |rec|
98
+ assert_equal expected_namespaces, rec.collect_namespaces
99
+ end
100
+
101
+ whole_doc = Nokogiri::XML.parse(File.open(support_file_path("namespace-test.xml")))
102
+ whole_doc.xpath("//mytop:record", mytop: "http://example.org/top").each_with_index do |original_el, i|
103
+ assert ns_semantic_equivalent_xml?(original_el, yielded_records[i])
104
+ end
85
105
  end
86
106
  end
87
107
  end
@@ -139,7 +159,40 @@ describe "Traject::NokogiriReader" do
139
159
 
140
160
  assert_length manually_extracted.size, yielded_records
141
161
  assert yielded_records.all? {|r| r.kind_of? Nokogiri::XML::Document }
142
- assert_equal manually_extracted.collect(&:to_xml), yielded_records.collect(&:root).collect(&:to_xml)
162
+
163
+ expected_xml = manually_extracted
164
+ actual_xml = yielded_records.collect(&:root)
165
+
166
+ expected_xml.size.times do |i|
167
+ if !Traject::Util.is_jruby?
168
+ assert_equal expected_xml[i-1].to_xml, actual_xml[i-1].to_xml
169
+ else
170
+ # jruby shuffles the xmlns declarations around, but they should
171
+ # be semantically equivalent to an namespace-aware processor
172
+ assert ns_semantic_equivalent_xml?(expected_xml[i-1], actual_xml[i-1])
173
+ end
174
+ end
175
+ end
176
+
177
+ # Jruby nokogiri can shuffle around where the `xmlns:ns` declarations appear, although it
178
+ # _ought_ not to be semantically different for a namespace-aware parser -- nodes are still in
179
+ # same namespaces. JRuby may differ from what MRI does with same code, and may differ from
180
+ # the way an element appeared in input when extracting records from a larger input doc.
181
+ # There isn't much we can do about this, but we can write a recursive method
182
+ # that hopefully compares XML to make sure it really is semantically equivalent to
183
+ # a namespace, and hope we got that right.
184
+ def ns_semantic_equivalent_xml?(noko_a, noko_b)
185
+ noko_a = noko_a.root if noko_a.kind_of?(Nokogiri::XML::Document)
186
+ noko_b = noko_b.root if noko_b.kind_of?(Nokogiri::XML::Document)
187
+
188
+ noko_a.name == noko_b.name &&
189
+ noko_a.namespace&.prefix == noko_b.namespace&.prefix &&
190
+ noko_a.namespace&.href == noko_b.namespace&.href &&
191
+ noko_a.attributes == noko_b.attributes &&
192
+ noko_a.children.length == noko_b.children.length &&
193
+ noko_a.children.each_with_index.all? do |a_child, index|
194
+ ns_semantic_equivalent_xml?(a_child, noko_b.children[index])
195
+ end
143
196
  end
144
197
 
145
198
  describe "without each_record_xpath" do
@@ -137,6 +137,26 @@ describe "Traject::SolrJsonWriter" do
137
137
  assert_length 1, JSON.parse(post_args[1][1]), "second batch posted with last remaining doc"
138
138
  end
139
139
 
140
+ it "retries batch as individual records on failure" do
141
+ @writer = create_writer("solr_writer.batch_size" => 2, "solr_writer.max_skipped" => 10)
142
+ @fake_http_client.response_status = 500
143
+
144
+ 2.times do |i|
145
+ @writer.put context_with({"id" => "doc_#{i}", "key" => "value"})
146
+ end
147
+ @writer.close
148
+
149
+ # 1 batch, then 2 for re-trying each individually
150
+ assert_length 3, @fake_http_client.post_args
151
+
152
+ batch_update = @fake_http_client.post_args.first
153
+ assert_length 2, JSON.parse(batch_update[1])
154
+
155
+ individual_update1, individual_update2 = @fake_http_client.post_args[1], @fake_http_client.post_args[2]
156
+ assert_length 1, JSON.parse(individual_update1[1])
157
+ assert_length 1, JSON.parse(individual_update2[1])
158
+ end
159
+
140
160
  it "can #flush" do
141
161
  2.times do |i|
142
162
  doc = {"id" => "doc_#{i}", "key" => "value"}
@@ -150,15 +170,116 @@ describe "Traject::SolrJsonWriter" do
150
170
  assert_length 1, @fake_http_client.post_args, "Has flushed to solr"
151
171
  end
152
172
 
153
- it "commits on close when set" do
154
- @writer = create_writer("solr.url" => "http://example.com", "solr_writer.commit_on_close" => "true")
155
- @writer.put context_with({"id" => "one", "key" => ["value1", "value2"]})
156
- @writer.close
173
+ describe "commit" do
174
+ it "commits on close when set" do
175
+ @writer = create_writer("solr.url" => "http://example.com", "solr_writer.commit_on_close" => "true")
176
+ @writer.put context_with({"id" => "one", "key" => ["value1", "value2"]})
177
+ @writer.close
178
+
179
+ last_solr_get = @fake_http_client.get_args.last
180
+
181
+ assert_equal "http://example.com/update/json?commit=true", last_solr_get[0]
182
+ end
183
+
184
+ it "commits on close with commit_solr_update_args" do
185
+ @writer = create_writer(
186
+ "solr.url" => "http://example.com",
187
+ "solr_writer.commit_on_close" => "true",
188
+ "solr_writer.commit_solr_update_args" => { softCommit: true }
189
+ )
190
+ @writer.put context_with({"id" => "one", "key" => ["value1", "value2"]})
191
+ @writer.close
192
+
193
+ last_solr_get = @fake_http_client.get_args.last
194
+
195
+ assert_equal "http://example.com/update/json?softCommit=true", last_solr_get[0]
196
+ end
157
197
 
158
- last_solr_get = @fake_http_client.get_args.last
198
+ it "can manually send commit" do
199
+ @writer = create_writer("solr.url" => "http://example.com")
200
+ @writer.commit
201
+
202
+ last_solr_get = @fake_http_client.get_args.last
203
+ assert_equal "http://example.com/update/json?commit=true", last_solr_get[0]
204
+ end
205
+
206
+ it "can manually send commit with specified args" do
207
+ @writer = create_writer("solr.url" => "http://example.com", "solr_writer.commit_solr_update_args" => { softCommit: true })
208
+ @writer.commit(commit: true, optimize: true, waitFlush: false)
209
+ last_solr_get = @fake_http_client.get_args.last
210
+ assert_equal "http://example.com/update/json?commit=true&optimize=true&waitFlush=false", last_solr_get[0]
211
+ end
212
+
213
+ it "uses commit_solr_update_args settings by default" do
214
+ @writer = create_writer(
215
+ "solr.url" => "http://example.com",
216
+ "solr_writer.commit_solr_update_args" => { softCommit: true }
217
+ )
218
+ @writer.commit
219
+
220
+ last_solr_get = @fake_http_client.get_args.last
221
+ assert_equal "http://example.com/update/json?softCommit=true", last_solr_get[0]
222
+ end
223
+
224
+ it "overrides commit_solr_update_args with method arg" do
225
+ @writer = create_writer(
226
+ "solr.url" => "http://example.com",
227
+ "solr_writer.commit_solr_update_args" => { softCommit: true, foo: "bar" }
228
+ )
229
+ @writer.commit(commit: true)
159
230
 
160
- assert_equal "http://example.com/update/json", last_solr_get[0]
161
- assert_equal( {"commit" => "true"}, last_solr_get[1] )
231
+ last_solr_get = @fake_http_client.get_args.last
232
+ assert_equal "http://example.com/update/json?commit=true", last_solr_get[0]
233
+ end
234
+ end
235
+
236
+ describe "solr_writer.solr_update_args" do
237
+ before do
238
+ @writer = create_writer("solr_writer.solr_update_args" => { softCommit: true } )
239
+ end
240
+
241
+ it "sends update args" do
242
+ @writer.put context_with({"id" => "one", "key" => ["value1", "value2"]})
243
+ @writer.close
244
+
245
+ assert_equal 1, @fake_http_client.post_args.count
246
+
247
+ post_args = @fake_http_client.post_args.first
248
+
249
+ assert_equal "http://example.com/solr/update/json?softCommit=true", post_args[0]
250
+ end
251
+
252
+ it "sends update args with delete" do
253
+ @writer.delete("test-id")
254
+ @writer.close
255
+
256
+ assert_equal 1, @fake_http_client.post_args.count
257
+
258
+ post_args = @fake_http_client.post_args.first
259
+
260
+ assert_equal "http://example.com/solr/update/json?softCommit=true", post_args[0]
261
+ end
262
+
263
+ it "sends update args on individual-retry after batch failure" do
264
+ @writer = create_writer(
265
+ "solr_writer.batch_size" => 2,
266
+ "solr_writer.max_skipped" => 10,
267
+ "solr_writer.solr_update_args" => { softCommit: true }
268
+ )
269
+ @fake_http_client.response_status = 500
270
+
271
+ 2.times do |i|
272
+ @writer.put context_with({"id" => "doc_#{i}", "key" => "value"})
273
+ end
274
+ @writer.close
275
+
276
+ # 1 batch, then 2 for re-trying each individually
277
+ assert_length 3, @fake_http_client.post_args
278
+
279
+ individual_update1, individual_update2 = @fake_http_client.post_args[1], @fake_http_client.post_args[2]
280
+ assert_equal "http://example.com/solr/update/json?softCommit=true", individual_update1[0]
281
+ assert_equal "http://example.com/solr/update/json?softCommit=true", individual_update2[0]
282
+ end
162
283
  end
163
284
 
164
285
  describe "skipped records" do
@@ -225,6 +346,23 @@ describe "Traject::SolrJsonWriter" do
225
346
  logged = strio.string
226
347
  assert_includes logged, 'ArgumentError: bad stuff'
227
348
  end
349
+ end
350
+
351
+ describe "#delete" do
352
+ it "deletes" do
353
+ id = "123456"
354
+ @writer.delete(id)
355
+
356
+ post_args = @fake_http_client.post_args.first
357
+ assert_equal "http://example.com/solr/update/json", post_args[0]
358
+ assert_equal JSON.generate({"delete" => id}), post_args[1]
359
+ end
228
360
 
361
+ it "raises on non-200 http response" do
362
+ @fake_http_client.response_status = 500
363
+ assert_raises(RuntimeError) do
364
+ @writer.delete("12345")
365
+ end
366
+ end
229
367
  end
230
368
  end
@@ -31,9 +31,9 @@ Gem::Specification.new do |spec|
31
31
  spec.add_dependency "httpclient", "~> 2.5"
32
32
  spec.add_dependency "http", "~> 3.0" # used in oai_pmh_reader, may use more extensively in future instead of httpclient
33
33
  spec.add_dependency 'marc-fastxmlwriter', '~>1.0' # fast marc->xml
34
- spec.add_dependency "nokogiri", "~> 1.0" # NokogiriIndexer
34
+ spec.add_dependency "nokogiri", "~> 1.9" # NokogiriIndexer
35
35
 
36
- spec.add_development_dependency "bundler", '~> 1.7'
36
+ spec.add_development_dependency 'bundler', '>= 1.7', '< 3'
37
37
 
38
38
  spec.add_development_dependency "rake"
39
39
  spec.add_development_dependency "minitest"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: traject
3
3
  version: !ruby/object:Gem::Version
4
- version: 3.0.0
4
+ version: 3.1.0.rc1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jonathan Rochkind
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2018-10-12 00:00:00.000000000 Z
12
+ date: 2019-04-10 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: concurrent-ruby
@@ -149,28 +149,34 @@ dependencies:
149
149
  requirements:
150
150
  - - "~>"
151
151
  - !ruby/object:Gem::Version
152
- version: '1.0'
152
+ version: '1.9'
153
153
  type: :runtime
154
154
  prerelease: false
155
155
  version_requirements: !ruby/object:Gem::Requirement
156
156
  requirements:
157
157
  - - "~>"
158
158
  - !ruby/object:Gem::Version
159
- version: '1.0'
159
+ version: '1.9'
160
160
  - !ruby/object:Gem::Dependency
161
161
  name: bundler
162
162
  requirement: !ruby/object:Gem::Requirement
163
163
  requirements:
164
- - - "~>"
164
+ - - ">="
165
165
  - !ruby/object:Gem::Version
166
166
  version: '1.7'
167
+ - - "<"
168
+ - !ruby/object:Gem::Version
169
+ version: '3'
167
170
  type: :development
168
171
  prerelease: false
169
172
  version_requirements: !ruby/object:Gem::Requirement
170
173
  requirements:
171
- - - "~>"
174
+ - - ">="
172
175
  - !ruby/object:Gem::Version
173
176
  version: '1.7'
177
+ - - "<"
178
+ - !ruby/object:Gem::Version
179
+ version: '3'
174
180
  - !ruby/object:Gem::Dependency
175
181
  name: rake
176
182
  requirement: !ruby/object:Gem::Requirement
@@ -292,6 +298,7 @@ files:
292
298
  - test/debug_writer_test.rb
293
299
  - test/delimited_writer_test.rb
294
300
  - test/experimental_nokogiri_streaming_reader_test.rb
301
+ - test/indexer/class_level_configuration_test.rb
295
302
  - test/indexer/context_test.rb
296
303
  - test/indexer/each_record_test.rb
297
304
  - test/indexer/error_handler_test.rb
@@ -381,12 +388,12 @@ required_ruby_version: !ruby/object:Gem::Requirement
381
388
  version: '0'
382
389
  required_rubygems_version: !ruby/object:Gem::Requirement
383
390
  requirements:
384
- - - ">="
391
+ - - ">"
385
392
  - !ruby/object:Gem::Version
386
- version: '0'
393
+ version: 1.3.1
387
394
  requirements: []
388
395
  rubyforge_project:
389
- rubygems_version: 2.7.7
396
+ rubygems_version: 2.7.6
390
397
  signing_key:
391
398
  specification_version: 4
392
399
  summary: An easy to use, high-performance, flexible and extensible metadata transformation
@@ -395,6 +402,7 @@ test_files:
395
402
  - test/debug_writer_test.rb
396
403
  - test/delimited_writer_test.rb
397
404
  - test/experimental_nokogiri_streaming_reader_test.rb
405
+ - test/indexer/class_level_configuration_test.rb
398
406
  - test/indexer/context_test.rb
399
407
  - test/indexer/each_record_test.rb
400
408
  - test/indexer/error_handler_test.rb