RubyGems - traject - Versions diffs - 0.16.0 → 0.17.0 - Mend

traject 0.16.0 → 0.17.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (53) hide show

checksums.yaml +7 -0
data/.yardopts +1 -0
data/README.md +183 -191
data/bench/bench.rb +1 -1
data/doc/batch_execution.md +14 -0
data/doc/extending.md +14 -12
data/doc/indexing_rules.md +265 -0
data/lib/traject/command_line.rb +12 -41
data/lib/traject/debug_writer.rb +32 -13
data/lib/traject/indexer.rb +101 -24
data/lib/traject/indexer/settings.rb +18 -17
data/lib/traject/json_writer.rb +32 -11
data/lib/traject/line_writer.rb +6 -6
data/lib/traject/macros/basic.rb +1 -1
data/lib/traject/macros/marc21.rb +17 -13
data/lib/traject/macros/marc21_semantics.rb +27 -25
data/lib/traject/macros/marc_format_classifier.rb +39 -25
data/lib/traject/marc4j_reader.rb +36 -22
data/lib/traject/marc_extractor.rb +79 -75
data/lib/traject/marc_reader.rb +33 -25
data/lib/traject/mock_reader.rb +9 -10
data/lib/traject/ndj_reader.rb +7 -7
data/lib/traject/null_writer.rb +1 -1
data/lib/traject/qualified_const_get.rb +12 -2
data/lib/traject/solrj_writer.rb +61 -52
data/lib/traject/thread_pool.rb +45 -45
data/lib/traject/translation_map.rb +59 -27
data/lib/traject/util.rb +3 -3
data/lib/traject/version.rb +1 -1
data/lib/traject/yaml_writer.rb +1 -1
data/test/debug_writer_test.rb +7 -7
data/test/indexer/each_record_test.rb +4 -4
data/test/indexer/macros_marc21_semantics_test.rb +12 -12
data/test/indexer/macros_marc21_test.rb +10 -10
data/test/indexer/macros_test.rb +1 -1
data/test/indexer/map_record_test.rb +6 -6
data/test/indexer/read_write_test.rb +43 -4
data/test/indexer/settings_test.rb +2 -2
data/test/indexer/to_field_test.rb +8 -8
data/test/marc4j_reader_test.rb +4 -4
data/test/marc_extractor_test.rb +33 -25
data/test/marc_format_classifier_test.rb +3 -3
data/test/marc_reader_test.rb +2 -2
data/test/test_helper.rb +3 -3
data/test/test_support/demo_config.rb +52 -48
data/test/translation_map_test.rb +22 -4
data/test/translation_maps/bad_ruby.rb +2 -2
data/test/translation_maps/both_map.rb +1 -1
data/test/translation_maps/default_literal.rb +1 -1
data/test/translation_maps/default_passthrough.rb +1 -1
data/test/translation_maps/ruby_map.rb +1 -1
metadata +7 -31
data/doc/macros.md +0 -103

data/bench/bench.rb CHANGED

@@ -27,4 +27,4 @@ Benchmark.bmbm do |x|
   end
 end

data/doc/batch_execution.md CHANGED

@@ -99,6 +99,20 @@ Now any account, in a crontab, in an interactive shell, wherever,
 can just execute `jruby-traject {arguments}`, and execute traject
 in a jruby environment.
+### Bundler too?
+If you're running with bundler too, you could make a wrapper file specific to
+a particular traject project and it's Gemfile, by combining the `bundle exec` into
+your wrapper file.  For instance,  for chruby, this works:
+    #!/usr/bin/env bash
+    chruby-exec jruby -- BUNDLE_GEMFILE=/path/to/Gemfile bundle exec traject "$@"
+Now you can call your wrapper script from anywhere and with any active ruby,
+and execute it in jruby and with the dependencies specified in the Gemfile
+for your project.
 ## Exit codes
 Traject tries to always return a well-behaved unix exit code -- 0 for success,

data/doc/extending.md CHANGED

@@ -19,9 +19,9 @@ of a couple traject features meant to make it easier.
   * translation map files found in a
     "./translation_maps" subdir on the load path will be found
     for Traject translation maps.
-* Traject `-G` command line can be used to tell traject to use
-  bundler with a `Gemfile` located at current working dirctory
-  (or give an argument to `-G ./some/myGemfile`)
+* You can use Bundler with traject simply by creating a Gemfile with `bundler init`,
+  and then running command line with `bundle exec traject` or
+  even `BUNDLE_GEMFILE=path/to/Gemfile bundle exec traject`
 ## Custom code local to your project
@@ -160,19 +160,21 @@ possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemf
 Run `bundle install` from the directory with the Gemfile, on any system
 at any time, to make sure specified gems are installed.
-**Run traject** with the `-G` flag to tell it to use the Gemfile, for instance if
-your working directory is the one that includes your Gemfile:
+**Run traject** with `bundle exec` to have bundler set up the environment
+from your Gemfile. You can `cd` into the directory containing the Gemfile,
+so bundler can find it:
-    traject -G -c some_traject_config.rb ...
+    $ cd /some/where
+    $ bundle exec traject -c some_traject_config.rb ...
-Or explicitly specify a Gemfile somewhere else:
+Or you can use the BUNDLE_GEMFILE environment variable to tell bundler where
+to find the Gemfile, and run from any directory at all:
-    traject -G /some/path/Gemfile -c some_config.rb ...
+    $ BUNDLE_GEMFILE=/path/to/Gemfile bundle exec traject -c /path/to/some_config.rb ...
-Traject will use bundler to setup with the Gemfile, making sure
-the specified versions of all gems are used (and also making sure
-no gems except those specified in the gemfile are available to
-the program).
+Bundler will make sure the specified versions of all gems are used by
+traject, and also make sure no gems except those specified in the gemfile
+are available to the program, for a reliable reproducible environment.
 You should still `require` the gem in your traject config file,
 then just refer to what it provides in your config code as usual.

data/doc/indexing_rules.md ADDED

@@ -0,0 +1,265 @@
+# Details on Traject Indexing: from custom logic to Macros
+Traject macros are a way of providing re-usable index mapping rules. Before we discuss how they work, we need to remind ourselves of the basic/direct Traject `to_field` indexing method.
+## How direct indexing logic works
+Here's the simplest possible direct Traject mapping logic, duplicating the effects of the `literal` macro:
+~~~ruby
+to_field("title") do |record, accumulator, context|
+  accumulator << "FIXED LITERAL"
+end
+~~~
+That `do` is just ruby `block` syntax, whereby we can pass a block of ruby code as an argument to to a ruby method. We pass a block taking three arguments, labeled `record`, `accumulator`, and `context`, to the `to_field` method. The third 'context' object is optional, you can define it in your block or not, depending on if you want to use it.
+The block is then stored by the Traject::Indexer, and called for each record indexed, with three arguments provided.
+#### record argument
+The record that gets passed to your block is a MARC::Record object (or, theoretically, any object that gets returned by a traject Reader). Your logic will usually examine the record to calculate the desired output.
+### accumulator argument
+The accumulator argument is an array. At the end of your custom code, the accumulator
+array should hold the output you want to send off, to the field specified in the `to_field`.
+The accumulator is a reference to a ruby array, and you need to **modify** that array,
+manipulating it in place with Array methods that mutate the array, like `concat`, `<<`,
+`map!` or even `replace`.
+You can't simply assign the accumulator variable to a different array, that won't work,
+you need to modify the array in-place.
+    # Won't work, assigning variable
+    to_field('foo') do |rec, acc|
+      acc = ["some constant"] } # WRONG!
+    end
+    # Won't work, assigning variable
+    to_field('foo') do |rec, acc|
+      acc << 'bill'
+      acc << 'dueber'
+      acc = acc.map{|str| str.upcase}
+    end   # WRONG! WRONG! WRONG! WRONG! WRONG!
+    # Instead, do, modify array in place
+    to_field('foo') {|rec, acc| acc << "some constant" }
+    to_field('foo') do |rec, acc|
+      acc << 'bill'
+      acc << 'dueber'
+      acc = acc.map!{|str| str.upcase} #notice using "map!" not just "map"
+    end
+### context argument
+The third optional context argument
+The third optional argument is a
+[Traject::Indexer::Context](./lib/traject/indexer/context.rb)  ([rdoc](http://rdoc.info/github/jrochkind/traject/Traject/Indexer/Context))
+object. Most of the time you don't need it, but you can use it for
+some sophisticated functionality, for example using these Context methods:
+* `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard.
+* `context.position` The position of the record in the input file (e.g., was it the first record, seoncd, etc.). Useful for error reporting
+* `context.output_hash` A hash mapping the field names (generally defined in `to_field` calls) to an array of values to be sent to the writer associated with that field. This allows you to modify what goes to the writer without going through a `to_field` call -- you can just set `context.output_hash['myfield'] = ['my', 'values']` and you're set. See below for more examples
+* `context.skip!(msg)` An assertion that this record should be ignored. No more indexing steps will be called, no results will be sent to the writer, and a `debug`-level log message will be written stating that the record was skipped.
+## Gotcha: Use closures to make your code more efficient
+A _closure_ is a computer-science term that means "a piece of code
+that remembers all the variables that were in scope when it was
+created." In ruby, lambdas and blocks are closures. Method definitions
+are not, which most of us have run across much to our chagrin.
+Within the context of `traject`, this means you can define a variable
+outside of a `to_field` or `each_record` block and it will be avaiable
+inside those blocks. And you only have to define it once.
+That's useful to do for any object that is even a bit expensive
+to create -- we can maximize the performance of our traject
+indexing by creating those objects once outside the block,
+instead of inside the block where it will be created
+once per-record (every time the block is executed):
+Compare:
+```ruby
+# Create the transformer for every single record
+to_field 'normalized_title' do |rec, acc|
+  transformer = My::Custom::Format::Transformer.new # Oh no! I'm doing this for each of my 10M records!
+  acc << transformer.transform(rec['245'].value)
+end
+# Create the transformer exactly once
+transformer = My::Custom::Format::Transformer.new # Ahhh. Do it once.
+to_field 'normalized_title' do |rec, acc|
+  acc << transformer.transform(rec['245'].value)
+end
+```
+Certain built-in traject calls have been optimized to be high performance
+so it's safe to do them inside 'inner loop' blocks though.
+That includes `Traject::TranslationMap.new` and `Traject::MarcExtractor.cached("xxx")`
+(note #cached rather than #new there)
+## From block to lambda
+In the ruby language, in addition to creating a code block as an argument
+to a method with `do |args| ... end` or `{|arg| ...  }, we can also create
+a code block to hold in a variable, with the `lambda` keyword:
+    always_output_foo = lambda do |record, accumulator|
+      accumulator << "FOO"
+    end
+traject `to_field` is written so, as a convenience, it can take a lambda expression
+stored in a variable as an alternative to a block:
+    to_field("always_has_foo"), always_output_foo
+Why is this a convenience? Well, ordinarily it's not something we
+need, but in fact it's what allows traject 'macros' as re-useable
+code templates.
+## Macros
+A Traject macro is a way to automatically create indexing rules via re-usable "templates".
+Traject macros are simply methods that return ruby lambda/proc objects, possibly creating
+them based on parameters passed in.
+Here is in fact how the `literal` function is implemented:
+~~~ruby
+def literal(value)
+  return lambda do |record, accumulator, context|
+     # because a lambda is a closure, we can define it in terms
+     # of the 'value' from the scope it's defined in!
+     accumulator << value
+  end
+end
+to_field("something"), literal("something")
+~~~
+It's really as simple as that, that's all a Traject macro is. A function that takes parameters, and based on those parameters returns a lambda; the lambda is then passed to the `to_field` indexing method, or similar methods.
+How do you make these methods available to the indexer?
+Define it in a module:
+~~~ruby
+# in a file literal_macro.rb
+module LiteralMacro
+  def literal(value)
+    return lambda do |record, accumulator, context|
+       # because a lambda is a closure, we can define it in terms
+       # of the 'value' from the scope it's defined in!
+       accumulator << value
+    end
+  end
+end
+~~~
+And then use ordinary ruby `require` and `extend` to add it to the current Indexer file, by simply including this
+in one of your config files:
+~~~
+require `literal_macro.rb`
+extend LiteralMacro
+to_field ...
+~~~
+That's it.  You can use the traject command line `-I` option to set the ruby load path, so your file will be findable via `require`.  Or you can distribute it in a gem, and use straight rubygems and the `gem` command in your configuration file, or Bundler with traject command-line `-g` option.
+## Using a lambda _and_ and block
+Traject macros (such as `extract_marc`) create and return a lambda. If
+you include a lambda _and_ a block on a `to_field` call, the latter
+gets the accumulator as it was filled in by the former.
+```ruby
+# Get the titles and lowercase them
+to_field 'lc_title', extract_marc('245') do |rec, acc, context|
+  acc.map!{|title| title.downcase}
+end
+# Build my own lambda and use it
+mylam = lambda {|rec, acc|  acc << 'one'} # just add a constant
+to_field('foo'), mylam do |rec, acc, context|
+  acc << 'two'
+end #=> context.output_hash['foo'] == ['one', 'two']
+# You might also want to do something like this
+to_field('foo'), my_macro_that_doesn't_dedup_ do |rec, acc|
+  acc.uniq!
+end
+```
+## Maniuplating `context.output_hash` directly
+If you ask for the context argument, a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Indexer/Context)), you have access to context.output_hash, with is
+the hash of transformed output that will be sent to Solr (or any other Writer)
+You can look in there to see any already transformed output and use it as the source
+for new output. You can actually *write* to there manually, which can be useful
+to write routines that effect more than one output field at once.
+**Note**: Make sure you always assign an _array_ to, e.g., `context.output_hash['foo']`, not a single value!
+## each_record
+All the previous discussion was in terms of `to_field` -- `each_record` is a similar
+routine, to define logic that is executed for each record, but isn't fixed to write
+to a single output field.
+So `each_record` blocks have no `accumulator` argument, instead they either take a single
+`record` argument; or both a `record` and a `context`.
+`each_record` can be used for logging or notifiying; computing intermediate
+results; or writing to more than one field at once.
+~~~ruby
+each_record do |record, context|
+  if is_it_bad?(record)
+    context.skip!("Skipping bad record")
+  else
+    context.clipboard[:expensive_result] = calculate_expensive_thing(record)
+  end
+end
+each_record do |record, context|
+  (one, two) = calculate_two_things_from(record)
+  context.output_hash["first_field"] ||= []
+  context.output_hash["first_field"] << one
+  context.output_hash["second_field"] ||= []
+  context.output_hash["second_field"] << one
+end
+~~~
+traject doesn't come with any macros written for use with
+`each_record`, but they could be created if useful --
+just methods that return lambda's taking the right
+args for `each_record`.
+## More tips and gotchas about indexing steps
+* **All your `to_field` and `each_record` steps are run _in the order in which they were initially evaluated_**. That means that the order you call your config files can potentially make a difference if you're screwing around stuffing stuff into the context clipboard or whatnot.
+* **`to_field` can be called multiple times on the same field name.** If you call the same field name multiple times, all the values will be sent to the writer.
+* **Once you call `context.skip!(msg)` no more index steps will be run for that record**. So if you have any cleanup code, you'll need to make sure to call it yourself.
+* **By default, `trajcet` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).

data/lib/traject/command_line.rb CHANGED

@@ -1,7 +1,6 @@
-# Require as little as possible at top, so we can bundle require later
-# if needed, before requiring anything from the bundle. Can't avoid slop
-# though, to get our bundle arg out, sorry.
 require 'slop'
+require 'traject'
+require 'traject/indexer'
 module Traject
   # The class that executes for the Traject command line utility.
@@ -33,21 +32,6 @@ module Traject
     # Returns true on success or false on failure; may also raise exceptions;
     # may also exit program directly itself (yeah, could use some normalization)
     def execute
-      # Do bundler setup FIRST to try and initialize all gems from gemfile
-      # if requested.
-      # have to use Slop object to tell diff between
-      # no arg supplied and no option -g given at all
-      if slop.present? :Gemfile
-        require_bundler_setup(options[:Gemfile])
-      end
-      # We require them here instead of top of file,
-      # so we have done bundler require before we require these.
-      require 'traject'
-      require 'traject/indexer'
       if options[:version]
         self.console.puts "traject version #{Traject::VERSION}"
         return
@@ -92,6 +76,10 @@ module Traject
         end
       return result
+    rescue Exception => e
+      # Try to log unexpected exceptions if possible
+      indexer && indexer.logger && indexer.logger.fatal("Traject::CommandLine: Unexpected exception, terminating execution: #{e.inspect}") rescue nil
+      raise e
     end
     def command_commit!
@@ -117,19 +105,21 @@ module Traject
         $stdout
       end
+      indexer.logger.info("   marcout writing type:#{output_type} to file:#{output_arg}")
       case output_type
       when "binary"
         writer = MARC::Writer.new(output_arg)
         allow_oversized = indexer.settings["marcout.allow_oversized"]
         if allow_oversized
-          allow_oversized = (allow_oversized.to_s == "true")
+          allow_oversized = (allow_oversized.to_s == "true")
           writer.allow_oversized = allow_oversized
         end
       when "xml"
         writer = MARC::XMLWriter.new(output_arg)
       when "human"
-        writer = output_arg.kind_of?(String) ? File.open(output_arg, "w:binary") : output_arg
+        writer = output_arg.kind_of?(String) ? File.open(output_arg, "w:binary") : output_arg
       else
         raise ArgumentError.new("traject marcout unrecognized marcout.type: #{output_type}")
       end
@@ -174,7 +164,7 @@ module Traject
         filename = argv.first
         indexer.logger.info "Reading from #{filename}"
       end
       return io, filename
     end
@@ -215,24 +205,6 @@ module Traject
       end
     end
-    # requires bundler/setup, optionally first setting ENV["BUNDLE_GEMFILE"]
-    # to tell bundler to use a specific gemfile. Gemfile arg can be relative
-    # to current working directory.
-    def require_bundler_setup(gemfile=nil)
-      if gemfile
-        # tell bundler what gemfile to use
-        gem_path = File.expand_path( gemfile )
-        # bundler not good at error reporting, we check ourselves
-        unless File.exists? gem_path
-          self.console.puts "Gemfile `#{gemfile}` does not exist, exiting..."
-          self.console.puts
-          self.console.puts slop.help
-          exit 2
-        end
-        ENV["BUNDLE_GEMFILE"] = gem_path
-      end
-      require 'bundler/setup'
-    end
     def assemble_settings_hash(options)
       settings = {}
@@ -256,7 +228,7 @@ module Traject
       if options[:'debug-mode']
         require 'traject/debug_writer'
         settings["writer_class_name"] = "Traject::DebugWriter"
-        settings["log.level"] = "debug"
+        settings["log.level"] = "debug"
         settings["processing_thread_pool"] = 0
       end
       if options[:writer]
@@ -294,7 +266,6 @@ module Traject
         on :u, :solr, "Set solr url, shortcut for -s solr.url=", :argument => true
         on :t, :marc_type, "xml, json or binary. shortcut for -s marc_source.type=", :argument => true
         on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
-        on :G, "Gemfile", "run with bundler and optionally specified Gemfile", :argument => :optional, :default => nil
         on :x, "command", "alternate traject command: process (default); marcout; commit", :argument => true, :default => "process"