traject 3.0.0 → 3.1.0.rc1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +2 -4
- data/CHANGES.md +30 -0
- data/README.md +7 -4
- data/doc/indexing_rules.md +5 -6
- data/doc/programmatic_use.md +25 -1
- data/doc/settings.md +2 -0
- data/doc/xml.md +2 -0
- data/lib/traject/indexer.rb +32 -4
- data/lib/traject/indexer/context.rb +45 -0
- data/lib/traject/indexer/step.rb +8 -12
- data/lib/traject/line_writer.rb +36 -4
- data/lib/traject/nokogiri_reader.rb +9 -18
- data/lib/traject/solr_json_writer.rb +136 -21
- data/lib/traject/version.rb +1 -1
- data/test/indexer/class_level_configuration_test.rb +104 -0
- data/test/indexer/context_test.rb +64 -1
- data/test/indexer/error_handler_test.rb +18 -0
- data/test/nokogiri_reader_test.rb +56 -3
- data/test/solr_json_writer_test.rb +145 -7
- data/traject.gemspec +2 -2
- metadata +17 -9
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 06c28d37f9aafafe709a146c7612e5b5d8a5c58a61fd1502823a38dc52b9d05b
|
4
|
+
data.tar.gz: 2e38b2b8c4030456f3757ae6062231268110d68ef07e10cab722b4074ccd570c
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 04561a77a3e6f2073198983b5bf7d4e35cc9f52bccc1211487cc4c850b0f0b0fc9395a7c87e6ed90061f4a15af57516434d260c649fbc43ea65a0c6435194818
|
7
|
+
data.tar.gz: c7312156c3be556218e319e35ae76aa97fbae5fad6720dbce2e4a046ec90603f5de34fe2cb055425fb3da499922fba50c7d4a6445858793bb0a4fb26cf8f7b29
|
data/.travis.yml
CHANGED
@@ -6,13 +6,11 @@ sudo: true
|
|
6
6
|
rvm:
|
7
7
|
- 2.4.4
|
8
8
|
- 2.5.1
|
9
|
-
-
|
9
|
+
- 2.6.1
|
10
10
|
# avoid having travis install jdk on MRI builds where we don't need it.
|
11
11
|
matrix:
|
12
12
|
include:
|
13
13
|
- jdk: openjdk8
|
14
14
|
rvm: jruby-9.1.17.0
|
15
15
|
- jdk: openjdk8
|
16
|
-
rvm: jruby-9.2.
|
17
|
-
allow_failures:
|
18
|
-
- rvm: "2.6.0-preview2"
|
16
|
+
rvm: jruby-9.2.6.0
|
data/CHANGES.md
CHANGED
@@ -1,5 +1,35 @@
|
|
1
1
|
# Changes
|
2
2
|
|
3
|
+
## 3.1.0
|
4
|
+
|
5
|
+
### Added
|
6
|
+
|
7
|
+
* Context#add_output is added, convenient for custom ruby code.
|
8
|
+
|
9
|
+
each_record do |record, context|
|
10
|
+
context.add_output "key", something_from(record)
|
11
|
+
end
|
12
|
+
|
13
|
+
https://github.com/traject/traject/pull/220
|
14
|
+
|
15
|
+
* SolrJsonWriter
|
16
|
+
|
17
|
+
* Class-level indexer configuration, for custom indexer subclasses, now available with class-level `configure` method. Warning, Indexers are still expensive to instantiate though. https://github.com/traject/traject/pull/213
|
18
|
+
|
19
|
+
* SolrJsonWriter has new settings to control commit semantics. `solr_writer.solr_update_args` and `solr_writer.commit_solr_update_args`, both have hash values that are Solr update handler query params. https://github.com/traject/traject/pull/215
|
20
|
+
|
21
|
+
* SolrJsonWriter has a `delete(solr-unique-key)` method. Does not currently use any batching or threading. https://github.com/traject/traject/pull/214
|
22
|
+
|
23
|
+
* SolrJsonWriter, when MaxSkippedRecordsExceeded is raised, it will have a #cause that is the last error, which resulted in MaxSkippedRecordsExceeded. Some error reporting systems, including Rails, will automatically log #cause, so that's helpful. https://github.com/traject/traject/pull/216
|
24
|
+
|
25
|
+
* SolrJsonWriter now respects a `solr_writer.http_timeout` setting, in seconds, to be passed to HTTPClient instance. https://github.com/traject/traject/pull/219
|
26
|
+
|
27
|
+
* Nokogiri dependency for the NokogiriReader increased to `~> 1.9`. When using Jruby `each_record_xpath`, resulting yielded documents may have xmlns declarations on different nodes than in MRI (and previous versions of nokogiri), but we could find now way around this with nokogiri >= 1.9.0. The documents should still be semantically equivalent for namespace use. This was necessary to keep JRuby Nokogiri XML working with recent Nokogiri releases. https://github.com/traject/traject/pull/209
|
28
|
+
|
29
|
+
* LineWriter guesses better about when to auto-close, and provides an optional explicit setting in case it guesses wrong. (thanks @justinlittman) https://github.com/traject/traject/pull/211
|
30
|
+
|
31
|
+
* Traject::Indexer will now use a Logger(-compatible) instance passed in in setting 'logger' https://github.com/traject/traject/pull/217
|
32
|
+
|
3
33
|
## 3.0.0
|
4
34
|
|
5
35
|
### Changed/Backwards Incompatibilities
|
data/README.md
CHANGED
@@ -19,7 +19,7 @@ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (Univer
|
|
19
19
|
* Basic configuration files can be easily written even by non-rubyists, with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
|
20
20
|
* Easy to program, easy to read, easy to modify.
|
21
21
|
* Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with solr even under MRI. Traject is intended to be usable to process millions of records.
|
22
|
-
* Composed of decoupled components, for flexibility and extensibility.
|
22
|
+
* Composed of decoupled components, for flexibility and extensibility.f?
|
23
23
|
* Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
|
24
24
|
* Easy to split configuration between multiple files, for simple "pick-and-choose" command line options that can combine to deal with any of your local needs.
|
25
25
|
|
@@ -135,7 +135,7 @@ For the syntax and complete possibilities of the specification string argument t
|
|
135
135
|
|
136
136
|
To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
|
137
137
|
|
138
|
-
### XML mode,
|
138
|
+
### XML mode, extract_xpath
|
139
139
|
|
140
140
|
See our [xml guide](./doc/xml.md) for more XML examples, but you will usually use extract_xpath.
|
141
141
|
|
@@ -311,12 +311,15 @@ like `to_field`, is executed for every record, but without being tied
|
|
311
311
|
to a specific output field.
|
312
312
|
|
313
313
|
`each_record` can be used for logging or notifiying, computing intermediate
|
314
|
-
results, or
|
314
|
+
results, or more complex ruby logic.
|
315
315
|
|
316
316
|
~~~ruby
|
317
317
|
each_record do |record|
|
318
318
|
some_custom_logging(record)
|
319
319
|
end
|
320
|
+
each_record do |record, context|
|
321
|
+
context.add_output(:some_value, extract_some_value_from_record(record))
|
322
|
+
end
|
320
323
|
~~~
|
321
324
|
|
322
325
|
For more on `each_record`, see [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md).
|
@@ -405,7 +408,7 @@ writer class in question.
|
|
405
408
|
|
406
409
|
## The traject command Line
|
407
410
|
|
408
|
-
(If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./
|
411
|
+
(If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./doc/programmatic_use.md) )
|
409
412
|
|
410
413
|
The simplest invocation is:
|
411
414
|
|
data/doc/indexing_rules.md
CHANGED
@@ -247,13 +247,12 @@ each_record do |record, context|
|
|
247
247
|
end
|
248
248
|
|
249
249
|
each_record do |record, context|
|
250
|
-
|
250
|
+
if eligible_for_things?(record)
|
251
|
+
(val1, val2) = calculate_two_things_from(record)
|
251
252
|
|
252
|
-
|
253
|
-
|
254
|
-
|
255
|
-
context.output_hash["second_field"] ||= []
|
256
|
-
context.output_hash["second_field"] << val2
|
253
|
+
context.add_output("first_field", val1)
|
254
|
+
context.add_output("second_field", val2)
|
255
|
+
end
|
257
256
|
end
|
258
257
|
~~~
|
259
258
|
|
data/doc/programmatic_use.md
CHANGED
@@ -48,6 +48,30 @@ indexer = Traject::Indexer.new(settings) do
|
|
48
48
|
end
|
49
49
|
```
|
50
50
|
|
51
|
+
### Configuring indexer subclasses
|
52
|
+
|
53
|
+
Indexing step configuration is historically done in traject at the indexer _instance_ level. Either programmatically or by applying a "configuration file" to an indexer instance.
|
54
|
+
|
55
|
+
But you can also define your own indexer sub-class with indexing steps built-in, using the class-level `configure` method.
|
56
|
+
|
57
|
+
This is an EXPERIMENTAL feature, implementation may change. https://github.com/traject/traject/pull/213
|
58
|
+
|
59
|
+
```ruby
|
60
|
+
class MyIndexer < Traject::Indexer
|
61
|
+
configure do
|
62
|
+
settings do
|
63
|
+
provide "solr.url", Rails.application.config.my_solr_url
|
64
|
+
end
|
65
|
+
|
66
|
+
to_field "our_name", literal("University of Whatever")
|
67
|
+
end
|
68
|
+
end
|
69
|
+
```
|
70
|
+
|
71
|
+
These setting and indexing steps are now "hard-coded" into that subclass. You can still provide additional configuration at the instance level, as normal. You can also make a subclass of that `MyIndexer` class, that will inherit configuration from MyIndexer, and can supply it's own additional class-level configuration too.
|
72
|
+
|
73
|
+
Note that due to how implementation is done, instantiating an indexer is still _relatively_ expensive. (Class-level configuration is only actually executed on instantiation). You will still get better performance by re-using a global instance of your indexer subclass, instead of, say, instantiating one per object to be indexed.
|
74
|
+
|
51
75
|
## Running the indexer
|
52
76
|
|
53
77
|
### process: probably not what you want
|
@@ -157,7 +181,7 @@ You may want to consider instead creating one or more configured "global" indexe
|
|
157
181
|
|
158
182
|
* Readers, and the Indexer#process method, are not thread-safe. Which is why using Indexer#process, which uses a fixed reader, is not threads-safe, and why when sharing a global idnexer we want to use `process_record`, `map_record`, or `process_with` as above.
|
159
183
|
|
160
|
-
It ought to be safe to use a global Indexer concurrently in several threads, with the `map_record`, `process_record` or `process_with` methods -- so long as your indexing rules and writers are thread-safe, as they usually will be and always ought to be.
|
184
|
+
It ought to be safe to use a global Indexer concurrently in several threads, with the `map_record`, `process_record` or `process_with` methods -- so long as your indexing rules and writers are thread-safe, as they usually will be and always ought to be.
|
161
185
|
|
162
186
|
### An example
|
163
187
|
|
data/doc/settings.md
CHANGED
@@ -119,6 +119,8 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
119
119
|
|
120
120
|
* `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
|
121
121
|
|
122
|
+
* 'logger': Ignore all the other logger settings, just pass a `Logger` compatible logger instance in directly.
|
123
|
+
|
122
124
|
|
123
125
|
|
124
126
|
|
data/doc/xml.md
CHANGED
@@ -133,6 +133,8 @@ The NokogiriReader parser should be relatively performant though, allowing you t
|
|
133
133
|
|
134
134
|
(There is a half-finished `ExperimentalStreamingNokogiriReader` available, but it is experimental, half-finished, may disappear or change in backwards compat at any time, problematic, not recommended for production use, etc.)
|
135
135
|
|
136
|
+
Note also that in Jruby, when using `each_record_xpath` with the NokogiriReader, the extracted individual documents may have xmlns declerations in different places than you may expect, although they will still be semantically equivalent for namespace processing. This is due to Nokogiri JRuby implementation, and we could find no good way to ensure consistent behavior with MRI. See: https://github.com/sparklemotion/nokogiri/issues/1875
|
137
|
+
|
136
138
|
### Jruby
|
137
139
|
|
138
140
|
It may be that nokogiri JRuby is just much slower than nokogiri MRI (at least when namespaces are involved?) It may be that our workaround to a [JRuby bug involving namespaces on moving nodes](https://github.com/sparklemotion/nokogiri/issues/1774) doesn't help.
|
data/lib/traject/indexer.rb
CHANGED
@@ -180,6 +180,7 @@ class Traject::Indexer
|
|
180
180
|
@index_steps = []
|
181
181
|
@after_processing_steps = []
|
182
182
|
|
183
|
+
self.class.apply_class_configure_block(self)
|
183
184
|
instance_eval(&block) if block
|
184
185
|
end
|
185
186
|
|
@@ -189,6 +190,30 @@ class Traject::Indexer
|
|
189
190
|
instance_eval(&block)
|
190
191
|
end
|
191
192
|
|
193
|
+
## Class level configure block accepted too, and applied at instantiation
|
194
|
+
# before instance-level configuration.
|
195
|
+
#
|
196
|
+
# EXPERIMENTAL, implementation may change in ways that effect some uses.
|
197
|
+
# https://github.com/traject/traject/pull/213
|
198
|
+
#
|
199
|
+
# Note that settings set by 'provide' in subclass can not really be overridden
|
200
|
+
# by 'provide' in a next level subclass. Use self.default_settings instead, with
|
201
|
+
# call to super.
|
202
|
+
def self.configure(&block)
|
203
|
+
@class_configure_block = block
|
204
|
+
end
|
205
|
+
|
206
|
+
def self.apply_class_configure_block(instance)
|
207
|
+
# Make sure we inherit from superclass that has a class-level ivar @class_configure_block
|
208
|
+
if self.superclass.respond_to?(:apply_class_configure_block)
|
209
|
+
self.superclass.apply_class_configure_block(instance)
|
210
|
+
end
|
211
|
+
if @class_configure_block
|
212
|
+
instance.configure(&@class_configure_block)
|
213
|
+
end
|
214
|
+
end
|
215
|
+
|
216
|
+
|
192
217
|
|
193
218
|
# Pass a string file path, a Pathname, or a File object, for
|
194
219
|
# a config file to load into indexer.
|
@@ -258,10 +283,9 @@ class Traject::Indexer
|
|
258
283
|
"log.batch_size.severity" => "info",
|
259
284
|
|
260
285
|
# how to post-process the accumulator
|
261
|
-
|
262
|
-
|
263
|
-
|
264
|
-
"allow_empty_fields" => false
|
286
|
+
Traject::Indexer::ToFieldStep::ALLOW_NIL_VALUES => false,
|
287
|
+
Traject::Indexer::ToFieldStep::ALLOW_DUPLICATE_VALUES => true,
|
288
|
+
Traject::Indexer::ToFieldStep::ALLOW_EMPTY_FIELDS => false
|
265
289
|
}.freeze
|
266
290
|
end
|
267
291
|
|
@@ -349,6 +373,10 @@ class Traject::Indexer
|
|
349
373
|
|
350
374
|
# Create logger according to settings
|
351
375
|
def create_logger
|
376
|
+
if settings["logger"]
|
377
|
+
# none of the other settings matter, we just got a logger
|
378
|
+
return settings["logger"]
|
379
|
+
end
|
352
380
|
|
353
381
|
logger_level = settings["log.level"] || "info"
|
354
382
|
|
@@ -82,6 +82,51 @@ class Traject::Indexer
|
|
82
82
|
str
|
83
83
|
end
|
84
84
|
|
85
|
+
# Add values to an array in context.output_hash with the specified key/field_name(s).
|
86
|
+
# Creates array in output_hash if currently nil.
|
87
|
+
#
|
88
|
+
# Post-processing/filtering:
|
89
|
+
#
|
90
|
+
# * uniqs accumulator, unless settings["allow_dupicate_values"] is set.
|
91
|
+
# * Removes nil values unless settings["allow_nil_values"] is set.
|
92
|
+
# * Will not add an empty array to output_hash (will leave it nil instead)
|
93
|
+
# unless settings["allow_empty_fields"] is set.
|
94
|
+
#
|
95
|
+
# Multiple values can be added with multiple arguments (we avoid an array argument meaning
|
96
|
+
# multiple values to accomodate odd use cases where array itself is desired in output_hash value)
|
97
|
+
#
|
98
|
+
# @param field_name [String,Symbol,Array<String>,Array[<Symbol>]] A key to set in output_hash, or
|
99
|
+
# an array of such keys.
|
100
|
+
#
|
101
|
+
# @example add one value
|
102
|
+
# context.add_output(:additional_title, "a title")
|
103
|
+
#
|
104
|
+
# @example add multiple values as multiple params
|
105
|
+
# context.add_output("additional_title", "a title", "another title")
|
106
|
+
#
|
107
|
+
# @example add multiple values as multiple params from array using ruby spread operator
|
108
|
+
# context.add_output(:some_key, *array_of_values)
|
109
|
+
#
|
110
|
+
# @example add to multiple keys in output hash
|
111
|
+
# context.add_output(["key1", "key2"], "value")
|
112
|
+
#
|
113
|
+
# @return [Traject::Context] self
|
114
|
+
#
|
115
|
+
# Note for historical reasons relevant settings key *names* are in constants in Traject::Indexer::ToFieldStep,
|
116
|
+
# but the settings don't just apply to ToFieldSteps
|
117
|
+
def add_output(field_name, *values)
|
118
|
+
values.compact! unless self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_NIL_VALUES]
|
119
|
+
|
120
|
+
return self if values.empty? and not (self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_EMPTY_FIELDS])
|
121
|
+
|
122
|
+
Array(field_name).each do |key|
|
123
|
+
accumulator = (self.output_hash[key.to_s] ||= [])
|
124
|
+
accumulator.concat values
|
125
|
+
accumulator.uniq! unless self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_DUPLICATE_VALUES]
|
126
|
+
end
|
127
|
+
|
128
|
+
return self
|
129
|
+
end
|
85
130
|
end
|
86
131
|
|
87
132
|
|
data/lib/traject/indexer/step.rb
CHANGED
@@ -145,24 +145,20 @@ class Traject::Indexer
|
|
145
145
|
return accumulator
|
146
146
|
end
|
147
147
|
|
148
|
-
|
149
|
-
#
|
150
|
-
#
|
148
|
+
|
149
|
+
# These constqnts here for historical/legacy reasons, they really oughta
|
150
|
+
# live in Traject::Context, but in case anyone is referring to them
|
151
|
+
# we'll leave them here for now.
|
151
152
|
ALLOW_NIL_VALUES = "allow_nil_values".freeze
|
152
153
|
ALLOW_EMPTY_FIELDS = "allow_empty_fields".freeze
|
153
154
|
ALLOW_DUPLICATE_VALUES = "allow_duplicate_values".freeze
|
154
155
|
|
156
|
+
# Add the accumulator to the context with the correct field name(s).
|
157
|
+
# Do post-processing on the accumulator (remove nil values, allow empty
|
158
|
+
# fields, etc)
|
155
159
|
def add_accumulator_to_context!(accumulator, context)
|
156
|
-
accumulator.compact! unless context.settings[ALLOW_NIL_VALUES]
|
157
|
-
return if accumulator.empty? and not (context.settings[ALLOW_EMPTY_FIELDS])
|
158
|
-
|
159
160
|
# field_name can actually be an array of field names
|
160
|
-
|
161
|
-
context.output_hash[a_field_name] ||= []
|
162
|
-
|
163
|
-
existing_accumulator = context.output_hash[a_field_name].concat(accumulator)
|
164
|
-
existing_accumulator.uniq! unless context.settings[ALLOW_DUPLICATE_VALUES]
|
165
|
-
end
|
161
|
+
context.add_output(field_name, *accumulator)
|
166
162
|
end
|
167
163
|
end
|
168
164
|
|
data/lib/traject/line_writer.rb
CHANGED
@@ -8,12 +8,35 @@ require 'thread'
|
|
8
8
|
# This does not seem to effect performance much, as far as I could tell
|
9
9
|
# benchmarking.
|
10
10
|
#
|
11
|
-
# Output will be sent to `settings["output_file"]` string path, or else
|
12
|
-
# `settings["output_stream"]` (ruby IO object), or else stdout.
|
13
|
-
#
|
14
11
|
# This class can be sub-classed to write out different serialized
|
15
12
|
# reprentations -- subclasses will just override the #serialize
|
16
13
|
# method. For instance, see JsonWriter.
|
14
|
+
#
|
15
|
+
# ## Output
|
16
|
+
#
|
17
|
+
# The main functionality this class provides is logic for choosing based on
|
18
|
+
# settings what file or bytestream to send output to.
|
19
|
+
#
|
20
|
+
# You can supply `settings["output_file"]` with a _file path_. LineWriter
|
21
|
+
# will open up a `File` to write to.
|
22
|
+
#
|
23
|
+
# Or you can supply `settings["output_stream"]` with any ruby IO object, such an
|
24
|
+
# open `File` object or anything else.
|
25
|
+
#
|
26
|
+
# If neither are supplied, will write to `$stdout`.
|
27
|
+
#
|
28
|
+
# ## Closing the output stream
|
29
|
+
#
|
30
|
+
# The LineWriter tries to guess on whether it should call `close` on the output
|
31
|
+
# stream it's writing to, when the LineWriter instance is closed. For instance,
|
32
|
+
# if you passed in a `settings["output_file"]` with a path, and the LineWriter
|
33
|
+
# opened up a `File` object for you, it should close it for you.
|
34
|
+
#
|
35
|
+
# But for historical reasons, LineWriter doesn't just use that signal, but tries
|
36
|
+
# to guess generally on when to call close. If for some reason it gets it wrong,
|
37
|
+
# just use `settings["close_output_on_close"]` set to `true` or `false`.
|
38
|
+
# (String `"true"` or `"false"` are also acceptable, for convenience in setting
|
39
|
+
# options on command line)
|
17
40
|
class Traject::LineWriter
|
18
41
|
attr_reader :settings
|
19
42
|
attr_reader :write_mutex, :output_file
|
@@ -57,7 +80,16 @@ class Traject::LineWriter
|
|
57
80
|
end
|
58
81
|
|
59
82
|
def close
|
60
|
-
@output_file.close
|
83
|
+
@output_file.close if should_close_stream?
|
84
|
+
end
|
85
|
+
|
86
|
+
def should_close_stream?
|
87
|
+
if settings["close_output_on_close"].nil?
|
88
|
+
(@output_file.nil? || @output_file.tty? || @output_file == $stdout || $output_file == $stderr)
|
89
|
+
else
|
90
|
+
settings["close_output_on_close"].to_s == "true"
|
91
|
+
end
|
61
92
|
end
|
62
93
|
|
94
|
+
|
63
95
|
end
|
@@ -118,35 +118,26 @@ module Traject
|
|
118
118
|
private
|
119
119
|
|
120
120
|
|
121
|
-
#
|
121
|
+
# We simply do `new_parent_doc.root = node`
|
122
122
|
# It seemed maybe safer to dup the node as well as remove the original from the original doc,
|
123
123
|
# but I believe this will result in double memory usage, as unlinked nodes aren't GC'd until
|
124
124
|
# their doc is. I am hoping this pattern results in less memory usage.
|
125
125
|
# https://github.com/sparklemotion/nokogiri/issues/1703
|
126
126
|
#
|
127
|
-
#
|
128
|
-
# when re-parenting a node.
|
127
|
+
# We used to have to do something different in Jruby to work around bug:
|
129
128
|
# https://github.com/sparklemotion/nokogiri/issues/1774
|
130
129
|
#
|
131
|
-
#
|
132
|
-
#
|
133
|
-
#
|
134
|
-
#
|
130
|
+
# But as of nokogiri 1.9, that does not work, and is not necessary if we accept
|
131
|
+
# that Jruby nokogiri may put xmlns declerations on different elements than MRI,
|
132
|
+
# although it should be semantically equivalent for a namespace-aware parser.
|
133
|
+
# https://github.com/sparklemotion/nokogiri/issues/1875
|
134
|
+
#
|
135
|
+
# This as a separate method now exists largely as a historical artifact, and for this
|
136
|
+
# documentation.
|
135
137
|
def reparent_node_to_root(new_parent_doc, node)
|
136
|
-
if Traject::Util.is_jruby?
|
137
|
-
original_ns_scopes = node.namespace_scopes
|
138
|
-
end
|
139
138
|
|
140
139
|
new_parent_doc.root = node
|
141
140
|
|
142
|
-
if Traject::Util.is_jruby?
|
143
|
-
original_ns_scopes.each do |ns|
|
144
|
-
if new_parent_doc.at_xpath("//#{ns.prefix}:*", ns.prefix => ns.href)
|
145
|
-
new_parent_doc.root.add_namespace(ns.prefix, ns.href)
|
146
|
-
end
|
147
|
-
end
|
148
|
-
end
|
149
|
-
|
150
141
|
return new_parent_doc
|
151
142
|
end
|
152
143
|
|
@@ -16,7 +16,30 @@ require 'concurrent' # for atomic_fixnum
|
|
16
16
|
# This should work under both MRI and JRuby, with JRuby getting much
|
17
17
|
# better performance due to the threading model.
|
18
18
|
#
|
19
|
-
#
|
19
|
+
# Solr updates are by default sent with no commit params. This will definitely
|
20
|
+
# maximize your performance, and *especially* for bulk/batch indexing is recommended --
|
21
|
+
# use Solr auto commit in your Solr configuration instead, possibly with `commit_on_close`
|
22
|
+
# setting here.
|
23
|
+
#
|
24
|
+
# However, if you want the writer to send `commitWithin=true`, `commit=true`,
|
25
|
+
# `softCommit=true`, or any other URL parameters valid for Solr update handlers,
|
26
|
+
# you can configure this with `solr_writer.solr_update_args` setting. See:
|
27
|
+
# https://lucene.apache.org/solr/guide/7_0/near-real-time-searching.html#passing-commit-and-commitwithin-parameters-as-part-of-the-url
|
28
|
+
# Eg:
|
29
|
+
#
|
30
|
+
# settings do
|
31
|
+
# provide "solr_writer.solr_update_args", { commitWithin: 1000 }
|
32
|
+
# end
|
33
|
+
#
|
34
|
+
# (That it's a hash makes it infeasible to set/override on command line, if this is
|
35
|
+
# annoying for you let us know)
|
36
|
+
#
|
37
|
+
# `solr_update_args` will apply to batch and individual update requests, but
|
38
|
+
# not to commit sent if `commit_on_close`. You can also instead set
|
39
|
+
# `solr_writer.solr_commit_args` for that (or pass in an arg to #commit if calling
|
40
|
+
# manually)
|
41
|
+
#
|
42
|
+
# ## Relevant settings
|
20
43
|
#
|
21
44
|
# * solr.url (optional if solr.update_url is set) The URL to the solr core to index into
|
22
45
|
#
|
@@ -35,19 +58,32 @@ require 'concurrent' # for atomic_fixnum
|
|
35
58
|
#
|
36
59
|
# * solr_writer.skippable_exceptions: List of classes that will be rescued internal to
|
37
60
|
# SolrJsonWriter, and handled with max_skipped logic. Defaults to
|
38
|
-
# `[HTTPClient::TimeoutError, SocketError, Errno::ECONNREFUSED]`
|
61
|
+
# `[HTTPClient::TimeoutError, SocketError, Errno::ECONNREFUSED, Traject::SolrJsonWriter::BadHttpResponse]`
|
62
|
+
#
|
63
|
+
# * solr_writer.solr_update_args: A _hash_ of query params to send to solr update url.
|
64
|
+
# Will be sent with every update request. Eg `{ softCommit: true }` or `{ commitWithin: 1000 }`.
|
65
|
+
# See also `solr_writer.solr_commit_args`
|
39
66
|
#
|
40
67
|
# * solr_writer.commit_on_close: Set to true (or "true") if you want to commit at the
|
41
68
|
# end of the indexing run. (Old "solrj_writer.commit_on_close" supported for backwards
|
42
69
|
# compat only.)
|
43
70
|
#
|
71
|
+
# * solr_writer.commit_solr_update_args: A hash of query params to send when committing.
|
72
|
+
# Will be used for automatic `close_on_commit`, as well as any manual calls to #commit.
|
73
|
+
# If set, must include {"commit" => "true"} or { "softCommit" => "true" } if you actually
|
74
|
+
# want commits to happen when SolrJsonWriter tries to commit! But can be used to switch to softCommits
|
75
|
+
# (hard commits default), or specify additional params like optimize etc.
|
76
|
+
#
|
77
|
+
# * solr_writer.http_timeout: Value in seconds, will be set on the httpclient as connect/receive/send
|
78
|
+
# timeout. No way to set them individually at present. Default nil, use HTTPClient defaults
|
79
|
+
# (60 for connect/recieve, 120 for send).
|
80
|
+
#
|
44
81
|
# * solr_writer.commit_timeout: If commit_on_close, how long to wait for Solr before
|
45
|
-
# giving up as a timeout. Default 10 minutes. Solr can be slow.
|
82
|
+
# giving up as a timeout (http client receive_timeout). Default 10 minutes. Solr can be slow at commits. Overrides solr_writer.timeout
|
46
83
|
#
|
47
84
|
# * solr_json_writer.http_client Mainly intended for testing, set your own HTTPClient
|
48
85
|
# or mock object to be used for HTTP.
|
49
|
-
|
50
|
-
|
86
|
+
#
|
51
87
|
class Traject::SolrJsonWriter
|
52
88
|
include Traject::QualifiedConstGet
|
53
89
|
|
@@ -71,7 +107,15 @@ class Traject::SolrJsonWriter
|
|
71
107
|
@max_skipped = nil
|
72
108
|
end
|
73
109
|
|
74
|
-
@http_client = @settings["solr_json_writer.http_client"]
|
110
|
+
@http_client = if @settings["solr_json_writer.http_client"]
|
111
|
+
@settings["solr_json_writer.http_client"]
|
112
|
+
else
|
113
|
+
client = HTTPClient.new
|
114
|
+
if @settings["solr_writer.http_timeout"]
|
115
|
+
client.connect_timeout = client.receive_timeout = client.send_timeout = @settings["solr_writer.http_timeout"]
|
116
|
+
end
|
117
|
+
client
|
118
|
+
end
|
75
119
|
|
76
120
|
@batch_size = (settings["solr_writer.batch_size"] || DEFAULT_BATCH_SIZE).to_i
|
77
121
|
@batch_size = 1 if @batch_size < 1
|
@@ -96,6 +140,9 @@ class Traject::SolrJsonWriter
|
|
96
140
|
# Figure out where to send updates
|
97
141
|
@solr_update_url = self.determine_solr_update_url
|
98
142
|
|
143
|
+
@solr_update_args = settings["solr_writer.solr_update_args"]
|
144
|
+
@commit_solr_update_args = settings["solr_writer.commit_solr_update_args"]
|
145
|
+
|
99
146
|
logger.info(" #{self.class.name} writing to '#{@solr_update_url}' in batches of #{@batch_size} with #{@thread_pool_size} bg threads")
|
100
147
|
end
|
101
148
|
|
@@ -123,14 +170,25 @@ class Traject::SolrJsonWriter
|
|
123
170
|
send_batch( Traject::Util.drain_queue(@batched_queue) )
|
124
171
|
end
|
125
172
|
|
173
|
+
# configured update url, with either settings @solr_update_args or passed in
|
174
|
+
# query_params added to it
|
175
|
+
def solr_update_url_with_query(query_params)
|
176
|
+
if query_params
|
177
|
+
@solr_update_url + '?' + URI.encode_www_form(query_params)
|
178
|
+
else
|
179
|
+
@solr_update_url
|
180
|
+
end
|
181
|
+
end
|
182
|
+
|
126
183
|
# Send the given batch of contexts. If something goes wrong, send
|
127
184
|
# them one at a time.
|
128
185
|
# @param [Array<Traject::Indexer::Context>] an array of contexts
|
129
186
|
def send_batch(batch)
|
130
187
|
return if batch.empty?
|
131
188
|
json_package = JSON.generate(batch.map { |c| c.output_hash })
|
189
|
+
|
132
190
|
begin
|
133
|
-
resp = @http_client.post @
|
191
|
+
resp = @http_client.post solr_update_url_with_query(@solr_update_args), json_package, "Content-type" => "application/json"
|
134
192
|
rescue StandardError => exception
|
135
193
|
end
|
136
194
|
|
@@ -153,30 +211,55 @@ class Traject::SolrJsonWriter
|
|
153
211
|
def send_single(c)
|
154
212
|
json_package = JSON.generate([c.output_hash])
|
155
213
|
begin
|
156
|
-
resp = @http_client.post @
|
157
|
-
# Catch Timeouts and network errors as skipped records, but otherwise
|
158
|
-
# allow unexpected errors to propagate up.
|
159
|
-
rescue *skippable_exceptions => exception
|
160
|
-
# no body, local variable exception set above will be used below
|
161
|
-
end
|
214
|
+
resp = @http_client.post solr_update_url_with_query(@solr_update_args), json_package, "Content-type" => "application/json"
|
162
215
|
|
163
|
-
|
164
|
-
|
165
|
-
|
216
|
+
unless resp.status == 200
|
217
|
+
raise BadHttpResponse.new("Unexpected HTTP response status #{resp.status}", resp)
|
218
|
+
end
|
219
|
+
|
220
|
+
# Catch Timeouts and network errors -- as well as non-200 http responses --
|
221
|
+
# as skipped records, but otherwise allow unexpected errors to propagate up.
|
222
|
+
rescue *skippable_exceptions => exception
|
223
|
+
msg = if exception.kind_of?(BadHttpResponse)
|
224
|
+
"Solr error response: #{exception.response.status}: #{exception.response.body}"
|
166
225
|
else
|
167
|
-
|
226
|
+
Traject::Util.exception_to_log_message(exception)
|
168
227
|
end
|
228
|
+
|
169
229
|
logger.error "Could not add record #{c.record_inspect}: #{msg}"
|
170
230
|
logger.debug("\t" + exception.backtrace.join("\n\t")) if exception
|
171
231
|
logger.debug(c.source_record.to_s) if c.source_record
|
172
232
|
|
173
233
|
@skipped_record_incrementer.increment
|
174
234
|
if @max_skipped and skipped_record_count > @max_skipped
|
235
|
+
# re-raising in rescue means the last encountered error will be available as #cause
|
236
|
+
# on raised exception, a feature in ruby 2.1+.
|
175
237
|
raise MaxSkippedRecordsExceeded.new("#{self.class.name}: Exceeded maximum number of skipped records (#{@max_skipped}): aborting")
|
176
238
|
end
|
177
|
-
|
178
239
|
end
|
240
|
+
end
|
241
|
+
|
179
242
|
|
243
|
+
# Very beginning of a delete implementation. POSTs a delete request to solr
|
244
|
+
# for id in arg (value of Solr UniqueID field, usually `id` field).
|
245
|
+
#
|
246
|
+
# Right now, does it inline and immediately, no use of background threads or batching.
|
247
|
+
# This could change.
|
248
|
+
#
|
249
|
+
# Right now, if unsuccesful for any reason, will raise immediately out of here.
|
250
|
+
# Could raise any of the `skippable_exceptions` (timeouts, network errors), an
|
251
|
+
# exception will be raised right out of here.
|
252
|
+
#
|
253
|
+
# Will use `solr_writer.solr_update_args` settings.
|
254
|
+
#
|
255
|
+
# There is no built-in way to direct a record to be deleted from an indexing config
|
256
|
+
# file at the moment, this is just a loose method on the writer.
|
257
|
+
def delete(id)
|
258
|
+
json_package = {delete: id}
|
259
|
+
resp = @http_client.post solr_update_url_with_query(@solr_update_args), JSON.generate(json_package), "Content-type" => "application/json"
|
260
|
+
if resp.status != 200
|
261
|
+
raise RuntimeError.new("Could not delete #{id.inspect}, http response #{resp.status}: #{resp.body}")
|
262
|
+
end
|
180
263
|
end
|
181
264
|
|
182
265
|
|
@@ -220,14 +303,32 @@ class Traject::SolrJsonWriter
|
|
220
303
|
|
221
304
|
|
222
305
|
# Send a commit
|
223
|
-
|
306
|
+
#
|
307
|
+
# Called automatially by `close_on_commit` setting, but also can be called manually.
|
308
|
+
#
|
309
|
+
# If settings `solr_writer.commit_solr_update_args` is set, will be used by default.
|
310
|
+
# That setting needs `{ commit: true }` or `{softCommit: true}` if you want it to
|
311
|
+
# actually do a commit!
|
312
|
+
#
|
313
|
+
# Optional query_params argument is the actual args to send, you must be sure
|
314
|
+
# to make it include "commit: true" or "softCommit: true" for it to actually commit!
|
315
|
+
# But you may want to include other params too, like optimize etc. query_param
|
316
|
+
# argument replaces setting `solr_writer.commit_solr_update_args`, they are not merged.
|
317
|
+
#
|
318
|
+
# @param [Hash] query_params optional query params to send to solr update. Default {"commit" => "true"}
|
319
|
+
#
|
320
|
+
# @example @writer.commit
|
321
|
+
# @example @writer.commit(softCommit: true)
|
322
|
+
# @example @writer.commit(commit: true, optimize: true, waitFlush: false)
|
323
|
+
def commit(query_params = nil)
|
324
|
+
query_params ||= @commit_solr_update_args || {"commit" => "true"}
|
224
325
|
logger.info "#{self.class.name} sending commit to solr at url #{@solr_update_url}..."
|
225
326
|
|
226
327
|
original_timeout = @http_client.receive_timeout
|
227
328
|
|
228
329
|
@http_client.receive_timeout = (settings["commit_timeout"] || (10 * 60)).to_i
|
229
330
|
|
230
|
-
resp = @http_client.get(
|
331
|
+
resp = @http_client.get(solr_update_url_with_query(query_params))
|
231
332
|
unless resp.status == 200
|
232
333
|
raise RuntimeError.new("Could not commit to Solr: #{resp.status} #{resp.body}")
|
233
334
|
end
|
@@ -279,10 +380,24 @@ class Traject::SolrJsonWriter
|
|
279
380
|
|
280
381
|
class MaxSkippedRecordsExceeded < RuntimeError ; end
|
281
382
|
|
383
|
+
# Adapted from HTTPClient::BadResponseError.
|
384
|
+
# It's got a #response accessor that will give you the HTTPClient
|
385
|
+
# Response object that had a bad status, although relying on that
|
386
|
+
# would tie you to our HTTPClient implementation that maybe should
|
387
|
+
# be considered an implementation detail, so I dunno.
|
388
|
+
class BadHttpResponse < RuntimeError
|
389
|
+
# HTTP::Message:: a response
|
390
|
+
attr_reader :response
|
391
|
+
|
392
|
+
def initialize(msg, response = nil) # :nodoc:
|
393
|
+
super(msg)
|
394
|
+
@response = response
|
395
|
+
end
|
396
|
+
end
|
282
397
|
|
283
398
|
private
|
284
399
|
|
285
400
|
def skippable_exceptions
|
286
|
-
@skippable_exceptions ||= (settings["solr_writer.skippable_exceptions"] || [HTTPClient::TimeoutError, SocketError, Errno::ECONNREFUSED])
|
401
|
+
@skippable_exceptions ||= (settings["solr_writer.skippable_exceptions"] || [HTTPClient::TimeoutError, SocketError, Errno::ECONNREFUSED, Traject::SolrJsonWriter::BadHttpResponse])
|
287
402
|
end
|
288
403
|
end
|
data/lib/traject/version.rb
CHANGED
@@ -0,0 +1,104 @@
|
|
1
|
+
require 'test_helper'
|
2
|
+
|
3
|
+
describe "Class-level configuration of Indexer sub-class" do
|
4
|
+
# Declaring a class inline in minitest isn't great, this really is a globally
|
5
|
+
# available class now, other tests shouldn't re-use this class name. But it works
|
6
|
+
# for testing for now.
|
7
|
+
class TestIndexerSubclass < Traject::Indexer
|
8
|
+
configure do
|
9
|
+
settings do
|
10
|
+
provide "class_level", "TestIndexerSubclass"
|
11
|
+
end
|
12
|
+
|
13
|
+
to_field "field", literal("value")
|
14
|
+
each_record do |rec, context|
|
15
|
+
context.output_hash["from_each_record"] ||= []
|
16
|
+
context.output_hash["from_each_record"] << "value"
|
17
|
+
end
|
18
|
+
end
|
19
|
+
|
20
|
+
def self.default_settings
|
21
|
+
@default_settings ||= super.merge(
|
22
|
+
"set_by_default_setting_no_override" => "TestIndexerSubclass",
|
23
|
+
"set_by_default_setting" => "TestIndexerSubclass"
|
24
|
+
)
|
25
|
+
end
|
26
|
+
end
|
27
|
+
|
28
|
+
before do
|
29
|
+
@indexer = TestIndexerSubclass.new
|
30
|
+
end
|
31
|
+
|
32
|
+
it "uses class-level configuration" do
|
33
|
+
result = @indexer.map_record(Object.new)
|
34
|
+
|
35
|
+
assert_equal ['value'], result['field']
|
36
|
+
assert_equal ['value'], result['from_each_record']
|
37
|
+
end
|
38
|
+
|
39
|
+
it "uses class-level configuration and instance-level configuration" do
|
40
|
+
@indexer.configure do
|
41
|
+
to_field "field", literal("from-instance-config")
|
42
|
+
to_field "instance_field", literal("from-instance-config")
|
43
|
+
end
|
44
|
+
|
45
|
+
result = @indexer.map_record(Object.new)
|
46
|
+
assert_equal ['value', 'from-instance-config'], result['field']
|
47
|
+
assert_equal ['from-instance-config'], result["instance_field"]
|
48
|
+
end
|
49
|
+
|
50
|
+
describe "with multi-level subclass" do
|
51
|
+
class TestIndexerSubclassSubclass < TestIndexerSubclass
|
52
|
+
configure do
|
53
|
+
settings do
|
54
|
+
provide "class_level", "TestIndexerSubclassSubclass"
|
55
|
+
end
|
56
|
+
|
57
|
+
to_field "field", literal("from-sub-subclass")
|
58
|
+
to_field "subclass_field", literal("from-sub-subclass")
|
59
|
+
end
|
60
|
+
|
61
|
+
def self.default_settings
|
62
|
+
@default_settings ||= super.merge(
|
63
|
+
"set_by_default_setting" => "TestIndexerSubclassSubclass"
|
64
|
+
)
|
65
|
+
end
|
66
|
+
|
67
|
+
end
|
68
|
+
|
69
|
+
before do
|
70
|
+
@indexer = TestIndexerSubclassSubclass.new
|
71
|
+
end
|
72
|
+
|
73
|
+
it "lets subclass override settings 'provide'" do
|
74
|
+
skip("This would be nice but is currently architecturally hard")
|
75
|
+
assert_equal "TestIndexerSubclassSubclass", @indexer.settings["class_level"]
|
76
|
+
end
|
77
|
+
|
78
|
+
it "lets subclass override default settings" do
|
79
|
+
assert_equal "TestIndexerSubclassSubclass", @indexer.settings["set_by_default_setting"]
|
80
|
+
assert_equal "TestIndexerSubclass", @indexer.settings["set_by_default_setting_no_override"]
|
81
|
+
end
|
82
|
+
|
83
|
+
it "uses configuraton from all inheritance" do
|
84
|
+
result = @indexer.map_record(Object.new)
|
85
|
+
|
86
|
+
assert_equal ['value', 'from-sub-subclass'], result['field']
|
87
|
+
assert_equal ['value'], result['from_each_record']
|
88
|
+
assert_equal ['from-sub-subclass'], result['subclass_field']
|
89
|
+
end
|
90
|
+
|
91
|
+
it "uses configuraton from all inheritance plus instance" do
|
92
|
+
@indexer.configure do
|
93
|
+
to_field "field", literal("from-instance")
|
94
|
+
to_field "instance_field", literal("from-instance")
|
95
|
+
end
|
96
|
+
|
97
|
+
result = @indexer.map_record(Object.new)
|
98
|
+
|
99
|
+
assert_equal ['value', 'from-sub-subclass', 'from-instance'], result['field']
|
100
|
+
assert_equal ['from-instance'], result['instance_field']
|
101
|
+
end
|
102
|
+
end
|
103
|
+
|
104
|
+
end
|
@@ -38,8 +38,71 @@ describe "Traject::Indexer::Context" do
|
|
38
38
|
|
39
39
|
assert_equal "<record ##{@position} (#{@input_name} ##{@position_in_input}), source_id:#{@record_001} output_id:output_id>", @context.record_inspect
|
40
40
|
end
|
41
|
-
|
42
41
|
end
|
43
42
|
|
43
|
+
describe "#add_output" do
|
44
|
+
before do
|
45
|
+
@context = Traject::Indexer::Context.new
|
46
|
+
end
|
47
|
+
it "adds one value to nil" do
|
48
|
+
@context.add_output(:key, "value")
|
49
|
+
assert_equal @context.output_hash, { "key" => ["value"] }
|
50
|
+
end
|
51
|
+
|
52
|
+
it "adds multiple values to nil" do
|
53
|
+
@context.add_output(:key, "value1", "value2")
|
54
|
+
assert_equal @context.output_hash, { "key" => ["value1", "value2"] }
|
55
|
+
end
|
56
|
+
|
57
|
+
it "adds one value to existing accumulator" do
|
58
|
+
@context.output_hash["key"] = ["value1"]
|
59
|
+
@context.add_output(:key, "value2")
|
60
|
+
assert_equal @context.output_hash, { "key" => ["value1", "value2"] }
|
61
|
+
end
|
62
|
+
|
63
|
+
it "uniqs by default" do
|
64
|
+
@context.output_hash["key"] = ["value1"]
|
65
|
+
@context.add_output(:key, "value1")
|
66
|
+
assert_equal @context.output_hash, { "key" => ["value1"] }
|
67
|
+
end
|
68
|
+
|
69
|
+
it "does not unique if allow_duplicate_values" do
|
70
|
+
@context.settings = { Traject::Indexer::ToFieldStep::ALLOW_DUPLICATE_VALUES => true }
|
71
|
+
@context.output_hash["key"] = ["value1"]
|
72
|
+
|
73
|
+
@context.add_output(:key, "value1")
|
74
|
+
assert_equal @context.output_hash, { "key" => ["value1", "value1"] }
|
75
|
+
end
|
76
|
+
|
77
|
+
it "ignores nil values by default" do
|
78
|
+
@context.add_output(:key, "value1", nil, "value2")
|
79
|
+
assert_equal @context.output_hash, { "key" => ["value1", "value2"] }
|
80
|
+
end
|
81
|
+
|
82
|
+
it "allows nil values if allow_nil_values" do
|
83
|
+
@context.settings = { Traject::Indexer::ToFieldStep::ALLOW_NIL_VALUES => true }
|
44
84
|
|
85
|
+
@context.add_output(:key, "value1", nil, "value2")
|
86
|
+
assert_equal @context.output_hash, { "key" => ["value1", nil, "value2"] }
|
87
|
+
end
|
88
|
+
|
89
|
+
it "ignores empty array by default" do
|
90
|
+
@context.add_output(:key)
|
91
|
+
@context.add_output(:key, nil)
|
92
|
+
|
93
|
+
assert_nil @context.output_hash["key"]
|
94
|
+
end
|
95
|
+
|
96
|
+
it "allows empty field if allow_empty_fields" do
|
97
|
+
@context.settings = { Traject::Indexer::ToFieldStep::ALLOW_EMPTY_FIELDS => true }
|
98
|
+
|
99
|
+
@context.add_output(:key, nil)
|
100
|
+
assert_equal @context.output_hash, { "key" => [] }
|
101
|
+
end
|
102
|
+
|
103
|
+
it "can add to multiple fields" do
|
104
|
+
@context.add_output(["field1", "field2"], "value1", "value2")
|
105
|
+
assert_equal @context.output_hash, { "field1" => ["value1", "value2"], "field2" => ["value1", "value2"] }
|
106
|
+
end
|
107
|
+
end
|
45
108
|
end
|
@@ -56,4 +56,22 @@ describe 'Custom mapping error handler' do
|
|
56
56
|
|
57
57
|
assert_nil indexer.map_record({})
|
58
58
|
end
|
59
|
+
|
60
|
+
it "uses logger from settings" do
|
61
|
+
desired_logger = Logger.new("/dev/null")
|
62
|
+
set_logger = nil
|
63
|
+
indexer.configure do
|
64
|
+
settings do
|
65
|
+
provide "logger", desired_logger
|
66
|
+
provide "mapping_rescue", -> (ctx, e) {
|
67
|
+
set_logger = ctx.logger
|
68
|
+
}
|
69
|
+
end
|
70
|
+
to_field 'id' do |_context , _exception|
|
71
|
+
raise 'this was always going to fail'
|
72
|
+
end
|
73
|
+
end
|
74
|
+
indexer.map_record({})
|
75
|
+
assert_equal desired_logger.object_id, set_logger.object_id
|
76
|
+
end
|
59
77
|
end
|
@@ -1,6 +1,12 @@
|
|
1
1
|
require 'test_helper'
|
2
2
|
require 'traject/nokogiri_reader'
|
3
3
|
|
4
|
+
# Note that JRuby Nokogiri can treat namespaces differently than MRI nokogiri.
|
5
|
+
# Particularly when we extract elements from a larger document with `each_record_xpath`,
|
6
|
+
# and put them in their own document, in JRuby nokogiri the xmlns declarations
|
7
|
+
# can end up on different elements than expected, although the document should
|
8
|
+
# be semantically equivalent to an XML-namespace-aware processor. See:
|
9
|
+
# https://github.com/sparklemotion/nokogiri/issues/1875
|
4
10
|
describe "Traject::NokogiriReader" do
|
5
11
|
describe "with namespaces" do
|
6
12
|
before do
|
@@ -80,8 +86,22 @@ describe "Traject::NokogiriReader" do
|
|
80
86
|
assert yielded_records.length > 0
|
81
87
|
|
82
88
|
expected_namespaces = {"xmlns"=>"http://example.org/top", "xmlns:a"=>"http://example.org/a", "xmlns:b"=>"http://example.org/b"}
|
83
|
-
|
84
|
-
|
89
|
+
|
90
|
+
if !Traject::Util.is_jruby?
|
91
|
+
yielded_records.each do |rec|
|
92
|
+
assert_equal expected_namespaces, rec.namespaces
|
93
|
+
end
|
94
|
+
else
|
95
|
+
# jruby nokogiri shuffles things around, all we can really do is test that the namespaces
|
96
|
+
# are somehwere in the doc :( We rely on other tests to test semantic equivalence.
|
97
|
+
yielded_records.each do |rec|
|
98
|
+
assert_equal expected_namespaces, rec.collect_namespaces
|
99
|
+
end
|
100
|
+
|
101
|
+
whole_doc = Nokogiri::XML.parse(File.open(support_file_path("namespace-test.xml")))
|
102
|
+
whole_doc.xpath("//mytop:record", mytop: "http://example.org/top").each_with_index do |original_el, i|
|
103
|
+
assert ns_semantic_equivalent_xml?(original_el, yielded_records[i])
|
104
|
+
end
|
85
105
|
end
|
86
106
|
end
|
87
107
|
end
|
@@ -139,7 +159,40 @@ describe "Traject::NokogiriReader" do
|
|
139
159
|
|
140
160
|
assert_length manually_extracted.size, yielded_records
|
141
161
|
assert yielded_records.all? {|r| r.kind_of? Nokogiri::XML::Document }
|
142
|
-
|
162
|
+
|
163
|
+
expected_xml = manually_extracted
|
164
|
+
actual_xml = yielded_records.collect(&:root)
|
165
|
+
|
166
|
+
expected_xml.size.times do |i|
|
167
|
+
if !Traject::Util.is_jruby?
|
168
|
+
assert_equal expected_xml[i-1].to_xml, actual_xml[i-1].to_xml
|
169
|
+
else
|
170
|
+
# jruby shuffles the xmlns declarations around, but they should
|
171
|
+
# be semantically equivalent to an namespace-aware processor
|
172
|
+
assert ns_semantic_equivalent_xml?(expected_xml[i-1], actual_xml[i-1])
|
173
|
+
end
|
174
|
+
end
|
175
|
+
end
|
176
|
+
|
177
|
+
# Jruby nokogiri can shuffle around where the `xmlns:ns` declarations appear, although it
|
178
|
+
# _ought_ not to be semantically different for a namespace-aware parser -- nodes are still in
|
179
|
+
# same namespaces. JRuby may differ from what MRI does with same code, and may differ from
|
180
|
+
# the way an element appeared in input when extracting records from a larger input doc.
|
181
|
+
# There isn't much we can do about this, but we can write a recursive method
|
182
|
+
# that hopefully compares XML to make sure it really is semantically equivalent to
|
183
|
+
# a namespace, and hope we got that right.
|
184
|
+
def ns_semantic_equivalent_xml?(noko_a, noko_b)
|
185
|
+
noko_a = noko_a.root if noko_a.kind_of?(Nokogiri::XML::Document)
|
186
|
+
noko_b = noko_b.root if noko_b.kind_of?(Nokogiri::XML::Document)
|
187
|
+
|
188
|
+
noko_a.name == noko_b.name &&
|
189
|
+
noko_a.namespace&.prefix == noko_b.namespace&.prefix &&
|
190
|
+
noko_a.namespace&.href == noko_b.namespace&.href &&
|
191
|
+
noko_a.attributes == noko_b.attributes &&
|
192
|
+
noko_a.children.length == noko_b.children.length &&
|
193
|
+
noko_a.children.each_with_index.all? do |a_child, index|
|
194
|
+
ns_semantic_equivalent_xml?(a_child, noko_b.children[index])
|
195
|
+
end
|
143
196
|
end
|
144
197
|
|
145
198
|
describe "without each_record_xpath" do
|
@@ -137,6 +137,26 @@ describe "Traject::SolrJsonWriter" do
|
|
137
137
|
assert_length 1, JSON.parse(post_args[1][1]), "second batch posted with last remaining doc"
|
138
138
|
end
|
139
139
|
|
140
|
+
it "retries batch as individual records on failure" do
|
141
|
+
@writer = create_writer("solr_writer.batch_size" => 2, "solr_writer.max_skipped" => 10)
|
142
|
+
@fake_http_client.response_status = 500
|
143
|
+
|
144
|
+
2.times do |i|
|
145
|
+
@writer.put context_with({"id" => "doc_#{i}", "key" => "value"})
|
146
|
+
end
|
147
|
+
@writer.close
|
148
|
+
|
149
|
+
# 1 batch, then 2 for re-trying each individually
|
150
|
+
assert_length 3, @fake_http_client.post_args
|
151
|
+
|
152
|
+
batch_update = @fake_http_client.post_args.first
|
153
|
+
assert_length 2, JSON.parse(batch_update[1])
|
154
|
+
|
155
|
+
individual_update1, individual_update2 = @fake_http_client.post_args[1], @fake_http_client.post_args[2]
|
156
|
+
assert_length 1, JSON.parse(individual_update1[1])
|
157
|
+
assert_length 1, JSON.parse(individual_update2[1])
|
158
|
+
end
|
159
|
+
|
140
160
|
it "can #flush" do
|
141
161
|
2.times do |i|
|
142
162
|
doc = {"id" => "doc_#{i}", "key" => "value"}
|
@@ -150,15 +170,116 @@ describe "Traject::SolrJsonWriter" do
|
|
150
170
|
assert_length 1, @fake_http_client.post_args, "Has flushed to solr"
|
151
171
|
end
|
152
172
|
|
153
|
-
|
154
|
-
|
155
|
-
|
156
|
-
|
173
|
+
describe "commit" do
|
174
|
+
it "commits on close when set" do
|
175
|
+
@writer = create_writer("solr.url" => "http://example.com", "solr_writer.commit_on_close" => "true")
|
176
|
+
@writer.put context_with({"id" => "one", "key" => ["value1", "value2"]})
|
177
|
+
@writer.close
|
178
|
+
|
179
|
+
last_solr_get = @fake_http_client.get_args.last
|
180
|
+
|
181
|
+
assert_equal "http://example.com/update/json?commit=true", last_solr_get[0]
|
182
|
+
end
|
183
|
+
|
184
|
+
it "commits on close with commit_solr_update_args" do
|
185
|
+
@writer = create_writer(
|
186
|
+
"solr.url" => "http://example.com",
|
187
|
+
"solr_writer.commit_on_close" => "true",
|
188
|
+
"solr_writer.commit_solr_update_args" => { softCommit: true }
|
189
|
+
)
|
190
|
+
@writer.put context_with({"id" => "one", "key" => ["value1", "value2"]})
|
191
|
+
@writer.close
|
192
|
+
|
193
|
+
last_solr_get = @fake_http_client.get_args.last
|
194
|
+
|
195
|
+
assert_equal "http://example.com/update/json?softCommit=true", last_solr_get[0]
|
196
|
+
end
|
157
197
|
|
158
|
-
|
198
|
+
it "can manually send commit" do
|
199
|
+
@writer = create_writer("solr.url" => "http://example.com")
|
200
|
+
@writer.commit
|
201
|
+
|
202
|
+
last_solr_get = @fake_http_client.get_args.last
|
203
|
+
assert_equal "http://example.com/update/json?commit=true", last_solr_get[0]
|
204
|
+
end
|
205
|
+
|
206
|
+
it "can manually send commit with specified args" do
|
207
|
+
@writer = create_writer("solr.url" => "http://example.com", "solr_writer.commit_solr_update_args" => { softCommit: true })
|
208
|
+
@writer.commit(commit: true, optimize: true, waitFlush: false)
|
209
|
+
last_solr_get = @fake_http_client.get_args.last
|
210
|
+
assert_equal "http://example.com/update/json?commit=true&optimize=true&waitFlush=false", last_solr_get[0]
|
211
|
+
end
|
212
|
+
|
213
|
+
it "uses commit_solr_update_args settings by default" do
|
214
|
+
@writer = create_writer(
|
215
|
+
"solr.url" => "http://example.com",
|
216
|
+
"solr_writer.commit_solr_update_args" => { softCommit: true }
|
217
|
+
)
|
218
|
+
@writer.commit
|
219
|
+
|
220
|
+
last_solr_get = @fake_http_client.get_args.last
|
221
|
+
assert_equal "http://example.com/update/json?softCommit=true", last_solr_get[0]
|
222
|
+
end
|
223
|
+
|
224
|
+
it "overrides commit_solr_update_args with method arg" do
|
225
|
+
@writer = create_writer(
|
226
|
+
"solr.url" => "http://example.com",
|
227
|
+
"solr_writer.commit_solr_update_args" => { softCommit: true, foo: "bar" }
|
228
|
+
)
|
229
|
+
@writer.commit(commit: true)
|
159
230
|
|
160
|
-
|
161
|
-
|
231
|
+
last_solr_get = @fake_http_client.get_args.last
|
232
|
+
assert_equal "http://example.com/update/json?commit=true", last_solr_get[0]
|
233
|
+
end
|
234
|
+
end
|
235
|
+
|
236
|
+
describe "solr_writer.solr_update_args" do
|
237
|
+
before do
|
238
|
+
@writer = create_writer("solr_writer.solr_update_args" => { softCommit: true } )
|
239
|
+
end
|
240
|
+
|
241
|
+
it "sends update args" do
|
242
|
+
@writer.put context_with({"id" => "one", "key" => ["value1", "value2"]})
|
243
|
+
@writer.close
|
244
|
+
|
245
|
+
assert_equal 1, @fake_http_client.post_args.count
|
246
|
+
|
247
|
+
post_args = @fake_http_client.post_args.first
|
248
|
+
|
249
|
+
assert_equal "http://example.com/solr/update/json?softCommit=true", post_args[0]
|
250
|
+
end
|
251
|
+
|
252
|
+
it "sends update args with delete" do
|
253
|
+
@writer.delete("test-id")
|
254
|
+
@writer.close
|
255
|
+
|
256
|
+
assert_equal 1, @fake_http_client.post_args.count
|
257
|
+
|
258
|
+
post_args = @fake_http_client.post_args.first
|
259
|
+
|
260
|
+
assert_equal "http://example.com/solr/update/json?softCommit=true", post_args[0]
|
261
|
+
end
|
262
|
+
|
263
|
+
it "sends update args on individual-retry after batch failure" do
|
264
|
+
@writer = create_writer(
|
265
|
+
"solr_writer.batch_size" => 2,
|
266
|
+
"solr_writer.max_skipped" => 10,
|
267
|
+
"solr_writer.solr_update_args" => { softCommit: true }
|
268
|
+
)
|
269
|
+
@fake_http_client.response_status = 500
|
270
|
+
|
271
|
+
2.times do |i|
|
272
|
+
@writer.put context_with({"id" => "doc_#{i}", "key" => "value"})
|
273
|
+
end
|
274
|
+
@writer.close
|
275
|
+
|
276
|
+
# 1 batch, then 2 for re-trying each individually
|
277
|
+
assert_length 3, @fake_http_client.post_args
|
278
|
+
|
279
|
+
individual_update1, individual_update2 = @fake_http_client.post_args[1], @fake_http_client.post_args[2]
|
280
|
+
assert_equal "http://example.com/solr/update/json?softCommit=true", individual_update1[0]
|
281
|
+
assert_equal "http://example.com/solr/update/json?softCommit=true", individual_update2[0]
|
282
|
+
end
|
162
283
|
end
|
163
284
|
|
164
285
|
describe "skipped records" do
|
@@ -225,6 +346,23 @@ describe "Traject::SolrJsonWriter" do
|
|
225
346
|
logged = strio.string
|
226
347
|
assert_includes logged, 'ArgumentError: bad stuff'
|
227
348
|
end
|
349
|
+
end
|
350
|
+
|
351
|
+
describe "#delete" do
|
352
|
+
it "deletes" do
|
353
|
+
id = "123456"
|
354
|
+
@writer.delete(id)
|
355
|
+
|
356
|
+
post_args = @fake_http_client.post_args.first
|
357
|
+
assert_equal "http://example.com/solr/update/json", post_args[0]
|
358
|
+
assert_equal JSON.generate({"delete" => id}), post_args[1]
|
359
|
+
end
|
228
360
|
|
361
|
+
it "raises on non-200 http response" do
|
362
|
+
@fake_http_client.response_status = 500
|
363
|
+
assert_raises(RuntimeError) do
|
364
|
+
@writer.delete("12345")
|
365
|
+
end
|
366
|
+
end
|
229
367
|
end
|
230
368
|
end
|
data/traject.gemspec
CHANGED
@@ -31,9 +31,9 @@ Gem::Specification.new do |spec|
|
|
31
31
|
spec.add_dependency "httpclient", "~> 2.5"
|
32
32
|
spec.add_dependency "http", "~> 3.0" # used in oai_pmh_reader, may use more extensively in future instead of httpclient
|
33
33
|
spec.add_dependency 'marc-fastxmlwriter', '~>1.0' # fast marc->xml
|
34
|
-
spec.add_dependency "nokogiri", "~> 1.
|
34
|
+
spec.add_dependency "nokogiri", "~> 1.9" # NokogiriIndexer
|
35
35
|
|
36
|
-
spec.add_development_dependency
|
36
|
+
spec.add_development_dependency 'bundler', '>= 1.7', '< 3'
|
37
37
|
|
38
38
|
spec.add_development_dependency "rake"
|
39
39
|
spec.add_development_dependency "minitest"
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: traject
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 3.0.
|
4
|
+
version: 3.1.0.rc1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jonathan Rochkind
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2019-04-10 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: concurrent-ruby
|
@@ -149,28 +149,34 @@ dependencies:
|
|
149
149
|
requirements:
|
150
150
|
- - "~>"
|
151
151
|
- !ruby/object:Gem::Version
|
152
|
-
version: '1.
|
152
|
+
version: '1.9'
|
153
153
|
type: :runtime
|
154
154
|
prerelease: false
|
155
155
|
version_requirements: !ruby/object:Gem::Requirement
|
156
156
|
requirements:
|
157
157
|
- - "~>"
|
158
158
|
- !ruby/object:Gem::Version
|
159
|
-
version: '1.
|
159
|
+
version: '1.9'
|
160
160
|
- !ruby/object:Gem::Dependency
|
161
161
|
name: bundler
|
162
162
|
requirement: !ruby/object:Gem::Requirement
|
163
163
|
requirements:
|
164
|
-
- - "
|
164
|
+
- - ">="
|
165
165
|
- !ruby/object:Gem::Version
|
166
166
|
version: '1.7'
|
167
|
+
- - "<"
|
168
|
+
- !ruby/object:Gem::Version
|
169
|
+
version: '3'
|
167
170
|
type: :development
|
168
171
|
prerelease: false
|
169
172
|
version_requirements: !ruby/object:Gem::Requirement
|
170
173
|
requirements:
|
171
|
-
- - "
|
174
|
+
- - ">="
|
172
175
|
- !ruby/object:Gem::Version
|
173
176
|
version: '1.7'
|
177
|
+
- - "<"
|
178
|
+
- !ruby/object:Gem::Version
|
179
|
+
version: '3'
|
174
180
|
- !ruby/object:Gem::Dependency
|
175
181
|
name: rake
|
176
182
|
requirement: !ruby/object:Gem::Requirement
|
@@ -292,6 +298,7 @@ files:
|
|
292
298
|
- test/debug_writer_test.rb
|
293
299
|
- test/delimited_writer_test.rb
|
294
300
|
- test/experimental_nokogiri_streaming_reader_test.rb
|
301
|
+
- test/indexer/class_level_configuration_test.rb
|
295
302
|
- test/indexer/context_test.rb
|
296
303
|
- test/indexer/each_record_test.rb
|
297
304
|
- test/indexer/error_handler_test.rb
|
@@ -381,12 +388,12 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
381
388
|
version: '0'
|
382
389
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
383
390
|
requirements:
|
384
|
-
- - "
|
391
|
+
- - ">"
|
385
392
|
- !ruby/object:Gem::Version
|
386
|
-
version:
|
393
|
+
version: 1.3.1
|
387
394
|
requirements: []
|
388
395
|
rubyforge_project:
|
389
|
-
rubygems_version: 2.7.
|
396
|
+
rubygems_version: 2.7.6
|
390
397
|
signing_key:
|
391
398
|
specification_version: 4
|
392
399
|
summary: An easy to use, high-performance, flexible and extensible metadata transformation
|
@@ -395,6 +402,7 @@ test_files:
|
|
395
402
|
- test/debug_writer_test.rb
|
396
403
|
- test/delimited_writer_test.rb
|
397
404
|
- test/experimental_nokogiri_streaming_reader_test.rb
|
405
|
+
- test/indexer/class_level_configuration_test.rb
|
398
406
|
- test/indexer/context_test.rb
|
399
407
|
- test/indexer/each_record_test.rb
|
400
408
|
- test/indexer/error_handler_test.rb
|