RubyGems - traject - Versions diffs - 3.0.0 → 3.4.0 - Mend

traject 3.0.0 → 3.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

checksums.yaml +4 -4
data/.travis.yml +3 -4
data/CHANGES.md +65 -0
data/README.md +9 -4
data/doc/indexing_rules.md +5 -6
data/doc/programmatic_use.md +25 -1
data/doc/settings.md +4 -0
data/doc/xml.md +12 -0
data/lib/traject/indexer.rb +40 -4
data/lib/traject/indexer/context.rb +45 -0
data/lib/traject/indexer/step.rb +8 -12
data/lib/traject/line_writer.rb +36 -4
data/lib/traject/macros/marc21.rb +2 -2
data/lib/traject/macros/marc21_semantics.rb +15 -12
data/lib/traject/macros/nokogiri_macros.rb +9 -3
data/lib/traject/nokogiri_reader.rb +17 -19
data/lib/traject/oai_pmh_nokogiri_reader.rb +9 -3
data/lib/traject/solr_json_writer.rb +167 -29
data/lib/traject/version.rb +1 -1
data/lib/translation_maps/marc_languages.yaml +77 -48
data/test/delimited_writer_test.rb +14 -16
data/test/indexer/class_level_configuration_test.rb +127 -0
data/test/indexer/context_test.rb +64 -1
data/test/indexer/error_handler_test.rb +18 -0
data/test/indexer/macros/macros_marc21_semantics_test.rb +4 -0
data/test/indexer/nokogiri_indexer_test.rb +35 -0
data/test/nokogiri_reader_test.rb +66 -3
data/test/solr_json_writer_test.rb +175 -7
data/test/test_support/date_resort_to_264.marc +1 -0
data/traject.gemspec +4 -4
metadata +37 -16

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: cf92e5467d32d37b681a36ae1ffbd2995bbf3e0def938b13d74831a939b68632
-  data.tar.gz: 7c4693ded4a9a8b0e9c599e7489aaefdf9806dfffce6b20ae6054def9ba8c156
+  metadata.gz: c30572335810dc620f9a169df6f8f374512d3c472ea34bc03068106959fd1463
+  data.tar.gz: 3181c37e41e80416487d730e1983bc647daf480a3a308db30e294c7587adc644
 SHA512:
-  metadata.gz: 9e12113a6f53aa9c7629c072df80b1e347f432d069bd30dbb35d73373fccc3fa341682b281a65c778aa2a3eae9fb7b2d52c81c2f39aa17d348074ecb8b9c2512
-  data.tar.gz: 6f2294bce5deb181a20db0977f8ab7e73e8e1cda6e86d8ee562fabd7a8cce2c683011be8f3955ccafd0165787dbaf774e7c3571220f5d6e797eaf6fe8a02577d
+  metadata.gz: 83b73a10113e75106a0fb7af9bec79802d2e3f5c8f3e07742f33a52642a9441c20769072f8ea5bd532011b7d172db6ca007121d6874f705008e1a5a511ca1ff8
+  data.tar.gz: d9c53588e8adbd76764c20012baf702591276d84c2cd64ed0bb0d5b742699607a2287d30a2fb4f70c1bf6f8a7338d716d989b208c8983c2a87eeddbd6d96dd3d

data/.travis.yml CHANGED

@@ -6,13 +6,12 @@ sudo: true
 rvm:
   - 2.4.4
   - 2.5.1
-  - "2.6.0-preview2"
+  - 2.6.1
+  - 2.7.0
 # avoid having travis install jdk on MRI builds where we don't need it.
 matrix:
   include:
     - jdk: openjdk8
       rvm: jruby-9.1.17.0
     - jdk: openjdk8
-      rvm: jruby-9.2.0.0
-  allow_failures:
-    - rvm: "2.6.0-preview2"
+      rvm: jruby-9.2.6.0

data/CHANGES.md CHANGED

@@ -1,5 +1,70 @@
 # Changes
+## Next
+*
+*
+## 3.4.0
+* XML-mode `extract_xpath` now supports extracting attribute values with xpath @attr syntax.
+## 3.3.0
+* `Traject::Macros::Marc21Semantics.publication_date` now gets date from 264 before 260. https://github.com/traject/traject/pull/233
+* Allow hashie 4.x in gemspec https://github.com/traject/traject/pull/234
+* Allow `http` gem 4.x versions. https://github.com/traject/traject/pull/236
+* Can now call class-level Indexer.configure multiple times https://github.com/sciencehistory/scihist_digicoll/pull/525
+## 3.2.0
+* NokogiriReader has a "nokogiri.strict_mode" setting. Set to true or string 'true' to ask Nokogori to parse in strict mode, so it will immediately raise on ill-formed XML, instead of nokogiri's default to do what it can with it. https://github.com/traject/traject/pull/226
+* SolrJsonWriter
+  * Utility method `delete_all!` sends a delete all query to the Solr URL endpoint. https://github.com/traject/traject/pull/227
+  * Allow basic auth configuration of the default http client via `solr_writer.basic_auth_user` and `solr_writer.basic_auth_password`. https://github.com/traject/traject/pull/231
+## 3.1.0
+### Added
+* Context#add_output is added, convenient for custom ruby code.
+        each_record do |record, context|
+           context.add_output "key", something_from(record)
+        end
+  https://github.com/traject/traject/pull/220
+* SolrJsonWriter
+  * Class-level indexer configuration, for custom indexer subclasses, now available with class-level `configure` method. Warning, Indexers are still expensive to instantiate though. https://github.com/traject/traject/pull/213
+  * SolrJsonWriter has new settings to control commit semantics. `solr_writer.solr_update_args` and `solr_writer.commit_solr_update_args`, both have hash values that are Solr update handler query params. https://github.com/traject/traject/pull/215
+  * SolrJsonWriter has a `delete(solr-unique-key)` method. Does not currently use any batching or threading. https://github.com/traject/traject/pull/214
+  * SolrJsonWriter, when MaxSkippedRecordsExceeded is raised, it will have a #cause that is the last error, which resulted in MaxSkippedRecordsExceeded. Some error reporting systems, including Rails, will automatically log #cause, so that's helpful. https://github.com/traject/traject/pull/216
+  * SolrJsonWriter now respects a `solr_writer.http_timeout` setting, in seconds, to be passed to HTTPClient instance. https://github.com/traject/traject/pull/219
+  * Only runs thread pool shutdown code (and logging) if there is a `solr_writer.batch_size` greater than 0. Keep it out of the logs if it was a no-op anyway.
+  * Logs at DEBUG level every time it sends an update request to solr
+* Nokogiri dependency for the NokogiriReader increased to `~> 1.9`. When using Jruby `each_record_xpath`, resulting yielded documents may have xmlns declarations on different nodes than in MRI (and previous versions of nokogiri), but we could find now way around this with nokogiri >= 1.9.0. The documents should still be semantically equivalent for namespace use. This was necessary to keep JRuby Nokogiri XML working with recent Nokogiri releases.  https://github.com/traject/traject/pull/209
+* LineWriter guesses better about when to auto-close, and provides an optional explicit setting in case it guesses wrong. (thanks @justinlittman) https://github.com/traject/traject/pull/211
+* Traject::Indexer will now use a Logger(-compatible) instance passed in in setting 'logger' https://github.com/traject/traject/pull/217
 ## 3.0.0
 ### Changed/Backwards Incompatibilities

data/README.md CHANGED

@@ -19,7 +19,7 @@ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (Univer
 * Basic configuration files can be easily written even by non-rubyists,  with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
 * Easy to program, easy to read, easy to modify.
 * Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with solr even under MRI. Traject is intended to be usable to process millions of records.
-* Composed of decoupled components, for flexibility and extensibility.
+* Composed of decoupled components, for flexibility and extensibility.f?
 * Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
 * Easy to split configuration between multiple files, for simple "pick-and-choose" command line options that can combine to deal with any of your local needs.
@@ -135,7 +135,7 @@ For the syntax and complete possibilities of the specification string argument t
 To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
-### XML mode, extract_xml
+### XML mode, extract_xpath
 See our [xml guide](./doc/xml.md) for more XML examples, but you will usually use extract_xpath.
@@ -175,6 +175,8 @@ TranslationMap use above is just one example of a transformation macro, that tra
 * `append("--after each value")`
 * `gsub(/regex/, "replacement")`
 * `split(" ")`: take values and split them, possibly result in multiple values.
+* `transform(proc)`: transform each existing macro using a proc, kind of like `map`.
+   eg `to_field "something", extract_xml("//author"), transform( ->(author) { "#{author.last}, #{author.first}" })
 You can add on as many transformation macros as you want, they will be applied to output in order.
@@ -311,12 +313,15 @@ like `to_field`, is executed for every record, but without being tied
 to a specific output field.
 `each_record` can be used for logging or notifiying, computing intermediate
-results, or writing to more than one field at once.
+results, or more complex ruby logic.
 ~~~ruby
   each_record do |record|
     some_custom_logging(record)
   end
+  each_record do |record, context|
+    context.add_output(:some_value, extract_some_value_from_record(record))
+  end
 ~~~
 For more on `each_record`, see [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md).
@@ -405,7 +410,7 @@ writer class in question.
 ## The traject command Line
-(If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./docs/programmatic_use.md) )
+(If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./doc/programmatic_use.md) )
 The simplest invocation is:

data/doc/indexing_rules.md CHANGED

@@ -247,13 +247,12 @@ each_record do |record, context|
 end
 each_record do |record, context|
-  (val1, val2) = calculate_two_things_from(record)
+  if eligible_for_things?(record)
+    (val1, val2) = calculate_two_things_from(record)
-  context.output_hash["first_field"] ||= []
-  context.output_hash["first_field"] << val1
-  context.output_hash["second_field"] ||= []
-  context.output_hash["second_field"] << val2
+    context.add_output("first_field", val1)
+    context.add_output("second_field", val2)
+  end
 end
 ~~~

data/doc/programmatic_use.md CHANGED

@@ -48,6 +48,30 @@ indexer = Traject::Indexer.new(settings) do
 end
 ```
+### Configuring indexer subclasses
+Indexing step configuration is historically done in traject at the indexer _instance_ level. Either programmatically or by applying a "configuration file" to an indexer instance.
+But you can also define your own indexer sub-class with indexing steps built-in, using the class-level `configure` method.
+This is an EXPERIMENTAL feature, implementation may change. https://github.com/traject/traject/pull/213
+```ruby
+class MyIndexer < Traject::Indexer
+  configure do
+    settings do
+      provide "solr.url", Rails.application.config.my_solr_url
+    end
+    to_field "our_name", literal("University of Whatever")
+  end
+end
+```
+These setting and indexing steps are now "hard-coded" into that subclass. You can still provide additional configuration at the instance level, as normal. You can also make a subclass of that `MyIndexer` class, that will inherit configuration from MyIndexer, and can supply it's own additional class-level configuration too.
+Note that due to how implementation is done, instantiating an indexer is still _relatively_ expensive. (Class-level configuration is only actually executed on instantiation). You will still get better performance by re-using a global instance of your indexer subclass, instead of, say, instantiating one per object to be indexed.
 ## Running the indexer
 ### process: probably not what you want
@@ -157,7 +181,7 @@ You may want to consider instead creating one or more configured "global" indexe
 * Readers, and the Indexer#process method, are not thread-safe. Which is why using Indexer#process, which uses a fixed reader, is not threads-safe, and why when sharing a global idnexer we want to use `process_record`, `map_record`, or `process_with` as above.
-It ought to be safe to use a global Indexer concurrently in several threads, with the `map_record`, `process_record` or `process_with` methods -- so long as your indexing rules and writers are thread-safe, as they usually will be and always ought to be.
+It ought to be safe to use a global Indexer concurrently in several threads, with the `map_record`, `process_record` or `process_with` methods -- so long as your indexing rules and writers are thread-safe, as they usually will be and always ought to be.
 ### An example

data/doc/settings.md CHANGED

@@ -93,6 +93,8 @@ settings are applied first of all. It's recommended you use `provide`.
 * `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
+* `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth.
 ### Dealing with MARC data
@@ -119,6 +121,8 @@ settings are applied first of all. It's recommended you use `provide`.
 * `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
+* 'logger': Ignore all the other logger settings, just pass a `Logger` compatible logger instance in directly.

data/doc/xml.md CHANGED

@@ -72,6 +72,16 @@ You can use all the standard transforation macros in Traject::Macros::Transforma
 to_field "something", extract_xpath("//value"), first_only, translation_map("some_map"), default("no value")
 ```
+### selecting attribute values
+Just works, using xpath syntax for selecting an attribute:
+```ruby
+# gets status value in:  <oai:header status="something">
+to_field "status", extract_xpath("//oai:record/oai:header/@status")
+```
 ### selecting non-text nodes
@@ -133,6 +143,8 @@ The NokogiriReader parser should be relatively performant though, allowing you t
 (There is a half-finished `ExperimentalStreamingNokogiriReader` available, but it is experimental, half-finished, may disappear or change in backwards compat at any time, problematic, not recommended for production use, etc.)
+Note also that in Jruby, when using `each_record_xpath` with the NokogiriReader, the extracted individual documents may have xmlns declerations in different places than you may expect, although they will still be semantically equivalent for namespace processing. This is due to Nokogiri JRuby implementation, and we could find no good way to ensure consistent behavior with MRI. See: https://github.com/sparklemotion/nokogiri/issues/1875
 ### Jruby
 It may be that nokogiri JRuby is just much slower than nokogiri MRI (at least when namespaces are involved?)  It may be that our workaround to a [JRuby bug involving namespaces on moving nodes](https://github.com/sparklemotion/nokogiri/issues/1774) doesn't help.

data/lib/traject/indexer.rb CHANGED

@@ -180,6 +180,7 @@ class Traject::Indexer
     @index_steps            = []
     @after_processing_steps = []
+    self.class.apply_class_configure_block(self)
     instance_eval(&block) if block
   end
@@ -189,6 +190,38 @@ class Traject::Indexer
     instance_eval(&block)
   end
+  ## Class level configure block(s) accepted too, and applied at instantiation
+  #  before instance-level configuration.
+  #
+  #  EXPERIMENTAL, implementation may change in ways that effect some uses.
+  #  https://github.com/traject/traject/pull/213
+  #
+  #  Note that settings set by 'provide' in subclass can not really be overridden
+  #  by 'provide' in a next level subclass. Use self.default_settings instead, with
+  #  call to super.
+  #
+  #  You can call this .configure multiple times, blocks are added to a list, and
+  #  will be used to initialize an instance in order.
+  #
+  #  The main downside of this workaround implementation is performance, even though
+  #  defined at load-time on class level, blocks are all executed on every instantiation.
+  def self.configure(&block)
+    (@class_configure_blocks ||= []) << block
+  end
+  def self.apply_class_configure_block(instance)
+    # Make sure we inherit from superclass that has a class-level ivar @class_configure_block
+    if self.superclass.respond_to?(:apply_class_configure_block)
+      self.superclass.apply_class_configure_block(instance)
+    end
+    if @class_configure_blocks && !@class_configure_blocks.empty?
+      @class_configure_blocks.each do |block|
+        instance.configure(&block)
+      end
+    end
+  end
   # Pass a string file path, a Pathname, or a File object, for
   # a config file to load into indexer.
@@ -258,10 +291,9 @@ class Traject::Indexer
         "log.batch_size.severity" => "info",
         # how to post-process the accumulator
-        "allow_nil_values"        => false,
-        "allow_duplicate_values"  => true,
-        "allow_empty_fields"      => false
+        Traject::Indexer::ToFieldStep::ALLOW_NIL_VALUES => false,
+        Traject::Indexer::ToFieldStep::ALLOW_DUPLICATE_VALUES  => true,
+        Traject::Indexer::ToFieldStep::ALLOW_EMPTY_FIELDS => false
     }.freeze
   end
@@ -349,6 +381,10 @@ class Traject::Indexer
   # Create logger according to settings
   def create_logger
+    if settings["logger"]
+      # none of the other settings matter, we just got a logger
+      return settings["logger"]
+    end
     logger_level  = settings["log.level"] || "info"

data/lib/traject/indexer/context.rb CHANGED

@@ -82,6 +82,51 @@ class Traject::Indexer
       str
     end
+    # Add values to an array in context.output_hash with the specified key/field_name(s).
+    # Creates array in output_hash if currently nil.
+    #
+    # Post-processing/filtering:
+    #
+    # * uniqs accumulator, unless settings["allow_dupicate_values"] is set.
+    # * Removes nil values unless settings["allow_nil_values"] is set.
+    # * Will not add an empty array to output_hash (will leave it nil instead)
+    #   unless settings["allow_empty_fields"] is set.
+    #
+    # Multiple values can be added with multiple arguments (we avoid an array argument meaning
+    # multiple values to accomodate odd use cases where array itself is desired in output_hash value)
+    #
+    # @param field_name [String,Symbol,Array<String>,Array[<Symbol>]] A key to set in output_hash, or
+    #   an array of such keys.
+    #
+    # @example add one value
+    #   context.add_output(:additional_title, "a title")
+    #
+    # @example add multiple values as multiple params
+    #   context.add_output("additional_title", "a title", "another title")
+    #
+    # @example add multiple values as multiple params from array using ruby spread operator
+    #   context.add_output(:some_key, *array_of_values)
+    #
+    # @example add to multiple keys in output hash
+    #   context.add_output(["key1", "key2"], "value")
+    #
+    # @return [Traject::Context] self
+    #
+    # Note for historical reasons relevant settings key *names* are in constants in Traject::Indexer::ToFieldStep,
+    # but the settings don't just apply to ToFieldSteps
+    def add_output(field_name, *values)
+      values.compact! unless self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_NIL_VALUES]
+      return self if values.empty? and not (self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_EMPTY_FIELDS])
+      Array(field_name).each do |key|
+        accumulator = (self.output_hash[key.to_s] ||= [])
+        accumulator.concat values
+        accumulator.uniq! unless self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_DUPLICATE_VALUES]
+      end
+      return self
+    end
   end

data/lib/traject/indexer/step.rb CHANGED

@@ -145,24 +145,20 @@ class Traject::Indexer
       return accumulator
     end
-    # Add the accumulator to the context with the correct field name
-    # Do post-processing on the accumulator (remove nil values, allow empty
-    # fields, etc)
+    # These constqnts here for historical/legacy reasons, they really oughta
+    # live in Traject::Context, but in case anyone is referring to them
+    # we'll leave them here for now.
     ALLOW_NIL_VALUES       = "allow_nil_values".freeze
     ALLOW_EMPTY_FIELDS     = "allow_empty_fields".freeze
     ALLOW_DUPLICATE_VALUES = "allow_duplicate_values".freeze
+    # Add the accumulator to the context with the correct field name(s).
+    # Do post-processing on the accumulator (remove nil values, allow empty
+    # fields, etc)
     def add_accumulator_to_context!(accumulator, context)
-      accumulator.compact! unless context.settings[ALLOW_NIL_VALUES]
-      return if accumulator.empty? and not (context.settings[ALLOW_EMPTY_FIELDS])
       # field_name can actually be an array of field names
-      Array(field_name).each do |a_field_name|
-        context.output_hash[a_field_name] ||= []
-        existing_accumulator = context.output_hash[a_field_name].concat(accumulator)
-        existing_accumulator.uniq! unless context.settings[ALLOW_DUPLICATE_VALUES]
-      end
+      context.add_output(field_name, *accumulator)
     end
   end

data/lib/traject/line_writer.rb CHANGED

@@ -8,12 +8,35 @@ require 'thread'
 # This does not seem to effect performance much, as far as I could tell
 # benchmarking.
 #
-# Output will be sent to `settings["output_file"]` string path, or else
-# `settings["output_stream"]` (ruby IO object), or else stdout.
-#
 # This class can be sub-classed to write out different serialized
 # reprentations -- subclasses will just override the #serialize
 # method. For instance, see JsonWriter.
+#
+# ## Output
+#
+# The main functionality this class provides is logic for choosing based on
+# settings what file or bytestream to send output to.
+#
+# You can supply `settings["output_file"]` with a _file path_. LineWriter
+# will open up a `File` to write to.
+#
+# Or you can supply `settings["output_stream"]` with any ruby IO object, such an
+# open `File` object or anything else.
+#
+# If neither are supplied, will write to `$stdout`.
+#
+# ## Closing the output stream
+#
+# The LineWriter tries to guess on whether it should call `close` on the output
+# stream it's writing to, when the LineWriter instance is closed. For instance,
+# if you passed in a `settings["output_file"]` with a path, and the LineWriter
+# opened up a `File` object for you, it should close it for you.
+#
+# But for historical reasons, LineWriter doesn't just use that signal, but tries
+# to guess generally on when to call close. If for some reason it gets it wrong,
+# just use `settings["close_output_on_close"]` set to `true` or `false`.
+# (String `"true"` or `"false"` are also acceptable, for convenience in setting
+# options on command line)
 class Traject::LineWriter
   attr_reader :settings
   attr_reader :write_mutex, :output_file
@@ -57,7 +80,16 @@ class Traject::LineWriter
   end
   def close
-    @output_file.close unless (@output_file.nil? || @output_file.tty?)
+    @output_file.close if should_close_stream?
+  end
+  def should_close_stream?
+    if settings["close_output_on_close"].nil?
+      (@output_file.nil? || @output_file.tty? || @output_file == $stdout || $output_file == $stderr)
+    else
+      settings["close_output_on_close"].to_s == "true"
+    end
   end
 end