traject 3.0.0 → 3.4.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: cf92e5467d32d37b681a36ae1ffbd2995bbf3e0def938b13d74831a939b68632
4
- data.tar.gz: 7c4693ded4a9a8b0e9c599e7489aaefdf9806dfffce6b20ae6054def9ba8c156
3
+ metadata.gz: c30572335810dc620f9a169df6f8f374512d3c472ea34bc03068106959fd1463
4
+ data.tar.gz: 3181c37e41e80416487d730e1983bc647daf480a3a308db30e294c7587adc644
5
5
  SHA512:
6
- metadata.gz: 9e12113a6f53aa9c7629c072df80b1e347f432d069bd30dbb35d73373fccc3fa341682b281a65c778aa2a3eae9fb7b2d52c81c2f39aa17d348074ecb8b9c2512
7
- data.tar.gz: 6f2294bce5deb181a20db0977f8ab7e73e8e1cda6e86d8ee562fabd7a8cce2c683011be8f3955ccafd0165787dbaf774e7c3571220f5d6e797eaf6fe8a02577d
6
+ metadata.gz: 83b73a10113e75106a0fb7af9bec79802d2e3f5c8f3e07742f33a52642a9441c20769072f8ea5bd532011b7d172db6ca007121d6874f705008e1a5a511ca1ff8
7
+ data.tar.gz: d9c53588e8adbd76764c20012baf702591276d84c2cd64ed0bb0d5b742699607a2287d30a2fb4f70c1bf6f8a7338d716d989b208c8983c2a87eeddbd6d96dd3d
@@ -6,13 +6,12 @@ sudo: true
6
6
  rvm:
7
7
  - 2.4.4
8
8
  - 2.5.1
9
- - "2.6.0-preview2"
9
+ - 2.6.1
10
+ - 2.7.0
10
11
  # avoid having travis install jdk on MRI builds where we don't need it.
11
12
  matrix:
12
13
  include:
13
14
  - jdk: openjdk8
14
15
  rvm: jruby-9.1.17.0
15
16
  - jdk: openjdk8
16
- rvm: jruby-9.2.0.0
17
- allow_failures:
18
- - rvm: "2.6.0-preview2"
17
+ rvm: jruby-9.2.6.0
data/CHANGES.md CHANGED
@@ -1,5 +1,70 @@
1
1
  # Changes
2
2
 
3
+ ## Next
4
+
5
+ *
6
+
7
+ *
8
+
9
+ ## 3.4.0
10
+
11
+ * XML-mode `extract_xpath` now supports extracting attribute values with xpath @attr syntax.
12
+
13
+ ## 3.3.0
14
+
15
+ * `Traject::Macros::Marc21Semantics.publication_date` now gets date from 264 before 260. https://github.com/traject/traject/pull/233
16
+
17
+ * Allow hashie 4.x in gemspec https://github.com/traject/traject/pull/234
18
+
19
+ * Allow `http` gem 4.x versions. https://github.com/traject/traject/pull/236
20
+
21
+ * Can now call class-level Indexer.configure multiple times https://github.com/sciencehistory/scihist_digicoll/pull/525
22
+
23
+ ## 3.2.0
24
+
25
+ * NokogiriReader has a "nokogiri.strict_mode" setting. Set to true or string 'true' to ask Nokogori to parse in strict mode, so it will immediately raise on ill-formed XML, instead of nokogiri's default to do what it can with it. https://github.com/traject/traject/pull/226
26
+
27
+ * SolrJsonWriter
28
+
29
+ * Utility method `delete_all!` sends a delete all query to the Solr URL endpoint. https://github.com/traject/traject/pull/227
30
+
31
+ * Allow basic auth configuration of the default http client via `solr_writer.basic_auth_user` and `solr_writer.basic_auth_password`. https://github.com/traject/traject/pull/231
32
+
33
+
34
+ ## 3.1.0
35
+
36
+ ### Added
37
+
38
+ * Context#add_output is added, convenient for custom ruby code.
39
+
40
+ each_record do |record, context|
41
+ context.add_output "key", something_from(record)
42
+ end
43
+
44
+ https://github.com/traject/traject/pull/220
45
+
46
+ * SolrJsonWriter
47
+
48
+ * Class-level indexer configuration, for custom indexer subclasses, now available with class-level `configure` method. Warning, Indexers are still expensive to instantiate though. https://github.com/traject/traject/pull/213
49
+
50
+ * SolrJsonWriter has new settings to control commit semantics. `solr_writer.solr_update_args` and `solr_writer.commit_solr_update_args`, both have hash values that are Solr update handler query params. https://github.com/traject/traject/pull/215
51
+
52
+ * SolrJsonWriter has a `delete(solr-unique-key)` method. Does not currently use any batching or threading. https://github.com/traject/traject/pull/214
53
+
54
+ * SolrJsonWriter, when MaxSkippedRecordsExceeded is raised, it will have a #cause that is the last error, which resulted in MaxSkippedRecordsExceeded. Some error reporting systems, including Rails, will automatically log #cause, so that's helpful. https://github.com/traject/traject/pull/216
55
+
56
+ * SolrJsonWriter now respects a `solr_writer.http_timeout` setting, in seconds, to be passed to HTTPClient instance. https://github.com/traject/traject/pull/219
57
+
58
+ * Only runs thread pool shutdown code (and logging) if there is a `solr_writer.batch_size` greater than 0. Keep it out of the logs if it was a no-op anyway.
59
+
60
+ * Logs at DEBUG level every time it sends an update request to solr
61
+
62
+ * Nokogiri dependency for the NokogiriReader increased to `~> 1.9`. When using Jruby `each_record_xpath`, resulting yielded documents may have xmlns declarations on different nodes than in MRI (and previous versions of nokogiri), but we could find now way around this with nokogiri >= 1.9.0. The documents should still be semantically equivalent for namespace use. This was necessary to keep JRuby Nokogiri XML working with recent Nokogiri releases. https://github.com/traject/traject/pull/209
63
+
64
+ * LineWriter guesses better about when to auto-close, and provides an optional explicit setting in case it guesses wrong. (thanks @justinlittman) https://github.com/traject/traject/pull/211
65
+
66
+ * Traject::Indexer will now use a Logger(-compatible) instance passed in in setting 'logger' https://github.com/traject/traject/pull/217
67
+
3
68
  ## 3.0.0
4
69
 
5
70
  ### Changed/Backwards Incompatibilities
data/README.md CHANGED
@@ -19,7 +19,7 @@ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (Univer
19
19
  * Basic configuration files can be easily written even by non-rubyists, with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
20
20
  * Easy to program, easy to read, easy to modify.
21
21
  * Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with solr even under MRI. Traject is intended to be usable to process millions of records.
22
- * Composed of decoupled components, for flexibility and extensibility.
22
+ * Composed of decoupled components, for flexibility and extensibility.f?
23
23
  * Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
24
24
  * Easy to split configuration between multiple files, for simple "pick-and-choose" command line options that can combine to deal with any of your local needs.
25
25
 
@@ -135,7 +135,7 @@ For the syntax and complete possibilities of the specification string argument t
135
135
 
136
136
  To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
137
137
 
138
- ### XML mode, extract_xml
138
+ ### XML mode, extract_xpath
139
139
 
140
140
  See our [xml guide](./doc/xml.md) for more XML examples, but you will usually use extract_xpath.
141
141
 
@@ -175,6 +175,8 @@ TranslationMap use above is just one example of a transformation macro, that tra
175
175
  * `append("--after each value")`
176
176
  * `gsub(/regex/, "replacement")`
177
177
  * `split(" ")`: take values and split them, possibly result in multiple values.
178
+ * `transform(proc)`: transform each existing macro using a proc, kind of like `map`.
179
+ eg `to_field "something", extract_xml("//author"), transform( ->(author) { "#{author.last}, #{author.first}" })
178
180
 
179
181
  You can add on as many transformation macros as you want, they will be applied to output in order.
180
182
 
@@ -311,12 +313,15 @@ like `to_field`, is executed for every record, but without being tied
311
313
  to a specific output field.
312
314
 
313
315
  `each_record` can be used for logging or notifiying, computing intermediate
314
- results, or writing to more than one field at once.
316
+ results, or more complex ruby logic.
315
317
 
316
318
  ~~~ruby
317
319
  each_record do |record|
318
320
  some_custom_logging(record)
319
321
  end
322
+ each_record do |record, context|
323
+ context.add_output(:some_value, extract_some_value_from_record(record))
324
+ end
320
325
  ~~~
321
326
 
322
327
  For more on `each_record`, see [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md).
@@ -405,7 +410,7 @@ writer class in question.
405
410
 
406
411
  ## The traject command Line
407
412
 
408
- (If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./docs/programmatic_use.md) )
413
+ (If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./doc/programmatic_use.md) )
409
414
 
410
415
  The simplest invocation is:
411
416
 
@@ -247,13 +247,12 @@ each_record do |record, context|
247
247
  end
248
248
 
249
249
  each_record do |record, context|
250
- (val1, val2) = calculate_two_things_from(record)
250
+ if eligible_for_things?(record)
251
+ (val1, val2) = calculate_two_things_from(record)
251
252
 
252
- context.output_hash["first_field"] ||= []
253
- context.output_hash["first_field"] << val1
254
-
255
- context.output_hash["second_field"] ||= []
256
- context.output_hash["second_field"] << val2
253
+ context.add_output("first_field", val1)
254
+ context.add_output("second_field", val2)
255
+ end
257
256
  end
258
257
  ~~~
259
258
 
@@ -48,6 +48,30 @@ indexer = Traject::Indexer.new(settings) do
48
48
  end
49
49
  ```
50
50
 
51
+ ### Configuring indexer subclasses
52
+
53
+ Indexing step configuration is historically done in traject at the indexer _instance_ level. Either programmatically or by applying a "configuration file" to an indexer instance.
54
+
55
+ But you can also define your own indexer sub-class with indexing steps built-in, using the class-level `configure` method.
56
+
57
+ This is an EXPERIMENTAL feature, implementation may change. https://github.com/traject/traject/pull/213
58
+
59
+ ```ruby
60
+ class MyIndexer < Traject::Indexer
61
+ configure do
62
+ settings do
63
+ provide "solr.url", Rails.application.config.my_solr_url
64
+ end
65
+
66
+ to_field "our_name", literal("University of Whatever")
67
+ end
68
+ end
69
+ ```
70
+
71
+ These setting and indexing steps are now "hard-coded" into that subclass. You can still provide additional configuration at the instance level, as normal. You can also make a subclass of that `MyIndexer` class, that will inherit configuration from MyIndexer, and can supply it's own additional class-level configuration too.
72
+
73
+ Note that due to how implementation is done, instantiating an indexer is still _relatively_ expensive. (Class-level configuration is only actually executed on instantiation). You will still get better performance by re-using a global instance of your indexer subclass, instead of, say, instantiating one per object to be indexed.
74
+
51
75
  ## Running the indexer
52
76
 
53
77
  ### process: probably not what you want
@@ -157,7 +181,7 @@ You may want to consider instead creating one or more configured "global" indexe
157
181
 
158
182
  * Readers, and the Indexer#process method, are not thread-safe. Which is why using Indexer#process, which uses a fixed reader, is not threads-safe, and why when sharing a global idnexer we want to use `process_record`, `map_record`, or `process_with` as above.
159
183
 
160
- It ought to be safe to use a global Indexer concurrently in several threads, with the `map_record`, `process_record` or `process_with` methods -- so long as your indexing rules and writers are thread-safe, as they usually will be and always ought to be.
184
+ It ought to be safe to use a global Indexer concurrently in several threads, with the `map_record`, `process_record` or `process_with` methods -- so long as your indexing rules and writers are thread-safe, as they usually will be and always ought to be.
161
185
 
162
186
  ### An example
163
187
 
@@ -93,6 +93,8 @@ settings are applied first of all. It's recommended you use `provide`.
93
93
 
94
94
  * `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
95
95
 
96
+ * `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth.
97
+
96
98
 
97
99
  ### Dealing with MARC data
98
100
 
@@ -119,6 +121,8 @@ settings are applied first of all. It's recommended you use `provide`.
119
121
 
120
122
  * `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
121
123
 
124
+ * 'logger': Ignore all the other logger settings, just pass a `Logger` compatible logger instance in directly.
125
+
122
126
 
123
127
 
124
128
 
data/doc/xml.md CHANGED
@@ -72,6 +72,16 @@ You can use all the standard transforation macros in Traject::Macros::Transforma
72
72
  to_field "something", extract_xpath("//value"), first_only, translation_map("some_map"), default("no value")
73
73
  ```
74
74
 
75
+ ### selecting attribute values
76
+
77
+ Just works, using xpath syntax for selecting an attribute:
78
+
79
+
80
+ ```ruby
81
+ # gets status value in: <oai:header status="something">
82
+ to_field "status", extract_xpath("//oai:record/oai:header/@status")
83
+ ```
84
+
75
85
 
76
86
  ### selecting non-text nodes
77
87
 
@@ -133,6 +143,8 @@ The NokogiriReader parser should be relatively performant though, allowing you t
133
143
 
134
144
  (There is a half-finished `ExperimentalStreamingNokogiriReader` available, but it is experimental, half-finished, may disappear or change in backwards compat at any time, problematic, not recommended for production use, etc.)
135
145
 
146
+ Note also that in Jruby, when using `each_record_xpath` with the NokogiriReader, the extracted individual documents may have xmlns declerations in different places than you may expect, although they will still be semantically equivalent for namespace processing. This is due to Nokogiri JRuby implementation, and we could find no good way to ensure consistent behavior with MRI. See: https://github.com/sparklemotion/nokogiri/issues/1875
147
+
136
148
  ### Jruby
137
149
 
138
150
  It may be that nokogiri JRuby is just much slower than nokogiri MRI (at least when namespaces are involved?) It may be that our workaround to a [JRuby bug involving namespaces on moving nodes](https://github.com/sparklemotion/nokogiri/issues/1774) doesn't help.
@@ -180,6 +180,7 @@ class Traject::Indexer
180
180
  @index_steps = []
181
181
  @after_processing_steps = []
182
182
 
183
+ self.class.apply_class_configure_block(self)
183
184
  instance_eval(&block) if block
184
185
  end
185
186
 
@@ -189,6 +190,38 @@ class Traject::Indexer
189
190
  instance_eval(&block)
190
191
  end
191
192
 
193
+ ## Class level configure block(s) accepted too, and applied at instantiation
194
+ # before instance-level configuration.
195
+ #
196
+ # EXPERIMENTAL, implementation may change in ways that effect some uses.
197
+ # https://github.com/traject/traject/pull/213
198
+ #
199
+ # Note that settings set by 'provide' in subclass can not really be overridden
200
+ # by 'provide' in a next level subclass. Use self.default_settings instead, with
201
+ # call to super.
202
+ #
203
+ # You can call this .configure multiple times, blocks are added to a list, and
204
+ # will be used to initialize an instance in order.
205
+ #
206
+ # The main downside of this workaround implementation is performance, even though
207
+ # defined at load-time on class level, blocks are all executed on every instantiation.
208
+ def self.configure(&block)
209
+ (@class_configure_blocks ||= []) << block
210
+ end
211
+
212
+ def self.apply_class_configure_block(instance)
213
+ # Make sure we inherit from superclass that has a class-level ivar @class_configure_block
214
+ if self.superclass.respond_to?(:apply_class_configure_block)
215
+ self.superclass.apply_class_configure_block(instance)
216
+ end
217
+ if @class_configure_blocks && !@class_configure_blocks.empty?
218
+ @class_configure_blocks.each do |block|
219
+ instance.configure(&block)
220
+ end
221
+ end
222
+ end
223
+
224
+
192
225
 
193
226
  # Pass a string file path, a Pathname, or a File object, for
194
227
  # a config file to load into indexer.
@@ -258,10 +291,9 @@ class Traject::Indexer
258
291
  "log.batch_size.severity" => "info",
259
292
 
260
293
  # how to post-process the accumulator
261
- "allow_nil_values" => false,
262
- "allow_duplicate_values" => true,
263
-
264
- "allow_empty_fields" => false
294
+ Traject::Indexer::ToFieldStep::ALLOW_NIL_VALUES => false,
295
+ Traject::Indexer::ToFieldStep::ALLOW_DUPLICATE_VALUES => true,
296
+ Traject::Indexer::ToFieldStep::ALLOW_EMPTY_FIELDS => false
265
297
  }.freeze
266
298
  end
267
299
 
@@ -349,6 +381,10 @@ class Traject::Indexer
349
381
 
350
382
  # Create logger according to settings
351
383
  def create_logger
384
+ if settings["logger"]
385
+ # none of the other settings matter, we just got a logger
386
+ return settings["logger"]
387
+ end
352
388
 
353
389
  logger_level = settings["log.level"] || "info"
354
390
 
@@ -82,6 +82,51 @@ class Traject::Indexer
82
82
  str
83
83
  end
84
84
 
85
+ # Add values to an array in context.output_hash with the specified key/field_name(s).
86
+ # Creates array in output_hash if currently nil.
87
+ #
88
+ # Post-processing/filtering:
89
+ #
90
+ # * uniqs accumulator, unless settings["allow_dupicate_values"] is set.
91
+ # * Removes nil values unless settings["allow_nil_values"] is set.
92
+ # * Will not add an empty array to output_hash (will leave it nil instead)
93
+ # unless settings["allow_empty_fields"] is set.
94
+ #
95
+ # Multiple values can be added with multiple arguments (we avoid an array argument meaning
96
+ # multiple values to accomodate odd use cases where array itself is desired in output_hash value)
97
+ #
98
+ # @param field_name [String,Symbol,Array<String>,Array[<Symbol>]] A key to set in output_hash, or
99
+ # an array of such keys.
100
+ #
101
+ # @example add one value
102
+ # context.add_output(:additional_title, "a title")
103
+ #
104
+ # @example add multiple values as multiple params
105
+ # context.add_output("additional_title", "a title", "another title")
106
+ #
107
+ # @example add multiple values as multiple params from array using ruby spread operator
108
+ # context.add_output(:some_key, *array_of_values)
109
+ #
110
+ # @example add to multiple keys in output hash
111
+ # context.add_output(["key1", "key2"], "value")
112
+ #
113
+ # @return [Traject::Context] self
114
+ #
115
+ # Note for historical reasons relevant settings key *names* are in constants in Traject::Indexer::ToFieldStep,
116
+ # but the settings don't just apply to ToFieldSteps
117
+ def add_output(field_name, *values)
118
+ values.compact! unless self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_NIL_VALUES]
119
+
120
+ return self if values.empty? and not (self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_EMPTY_FIELDS])
121
+
122
+ Array(field_name).each do |key|
123
+ accumulator = (self.output_hash[key.to_s] ||= [])
124
+ accumulator.concat values
125
+ accumulator.uniq! unless self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_DUPLICATE_VALUES]
126
+ end
127
+
128
+ return self
129
+ end
85
130
  end
86
131
 
87
132
 
@@ -145,24 +145,20 @@ class Traject::Indexer
145
145
  return accumulator
146
146
  end
147
147
 
148
- # Add the accumulator to the context with the correct field name
149
- # Do post-processing on the accumulator (remove nil values, allow empty
150
- # fields, etc)
148
+
149
+ # These constqnts here for historical/legacy reasons, they really oughta
150
+ # live in Traject::Context, but in case anyone is referring to them
151
+ # we'll leave them here for now.
151
152
  ALLOW_NIL_VALUES = "allow_nil_values".freeze
152
153
  ALLOW_EMPTY_FIELDS = "allow_empty_fields".freeze
153
154
  ALLOW_DUPLICATE_VALUES = "allow_duplicate_values".freeze
154
155
 
156
+ # Add the accumulator to the context with the correct field name(s).
157
+ # Do post-processing on the accumulator (remove nil values, allow empty
158
+ # fields, etc)
155
159
  def add_accumulator_to_context!(accumulator, context)
156
- accumulator.compact! unless context.settings[ALLOW_NIL_VALUES]
157
- return if accumulator.empty? and not (context.settings[ALLOW_EMPTY_FIELDS])
158
-
159
160
  # field_name can actually be an array of field names
160
- Array(field_name).each do |a_field_name|
161
- context.output_hash[a_field_name] ||= []
162
-
163
- existing_accumulator = context.output_hash[a_field_name].concat(accumulator)
164
- existing_accumulator.uniq! unless context.settings[ALLOW_DUPLICATE_VALUES]
165
- end
161
+ context.add_output(field_name, *accumulator)
166
162
  end
167
163
  end
168
164
 
@@ -8,12 +8,35 @@ require 'thread'
8
8
  # This does not seem to effect performance much, as far as I could tell
9
9
  # benchmarking.
10
10
  #
11
- # Output will be sent to `settings["output_file"]` string path, or else
12
- # `settings["output_stream"]` (ruby IO object), or else stdout.
13
- #
14
11
  # This class can be sub-classed to write out different serialized
15
12
  # reprentations -- subclasses will just override the #serialize
16
13
  # method. For instance, see JsonWriter.
14
+ #
15
+ # ## Output
16
+ #
17
+ # The main functionality this class provides is logic for choosing based on
18
+ # settings what file or bytestream to send output to.
19
+ #
20
+ # You can supply `settings["output_file"]` with a _file path_. LineWriter
21
+ # will open up a `File` to write to.
22
+ #
23
+ # Or you can supply `settings["output_stream"]` with any ruby IO object, such an
24
+ # open `File` object or anything else.
25
+ #
26
+ # If neither are supplied, will write to `$stdout`.
27
+ #
28
+ # ## Closing the output stream
29
+ #
30
+ # The LineWriter tries to guess on whether it should call `close` on the output
31
+ # stream it's writing to, when the LineWriter instance is closed. For instance,
32
+ # if you passed in a `settings["output_file"]` with a path, and the LineWriter
33
+ # opened up a `File` object for you, it should close it for you.
34
+ #
35
+ # But for historical reasons, LineWriter doesn't just use that signal, but tries
36
+ # to guess generally on when to call close. If for some reason it gets it wrong,
37
+ # just use `settings["close_output_on_close"]` set to `true` or `false`.
38
+ # (String `"true"` or `"false"` are also acceptable, for convenience in setting
39
+ # options on command line)
17
40
  class Traject::LineWriter
18
41
  attr_reader :settings
19
42
  attr_reader :write_mutex, :output_file
@@ -57,7 +80,16 @@ class Traject::LineWriter
57
80
  end
58
81
 
59
82
  def close
60
- @output_file.close unless (@output_file.nil? || @output_file.tty?)
83
+ @output_file.close if should_close_stream?
84
+ end
85
+
86
+ def should_close_stream?
87
+ if settings["close_output_on_close"].nil?
88
+ (@output_file.nil? || @output_file.tty? || @output_file == $stdout || $output_file == $stderr)
89
+ else
90
+ settings["close_output_on_close"].to_s == "true"
91
+ end
61
92
  end
62
93
 
94
+
63
95
  end