RubyGems - traject - Versions diffs - 0.9.1 → 0.10.0 - Mend

traject 0.9.1 → 0.10.0

Files changed (36) hide show

data/.travis.yml +7 -0
data/Gemfile +5 -1
data/README.md +65 -17
data/bench/bench.rb +30 -0
data/bin/traject +4 -169
data/doc/batch_execution.md +177 -0
data/doc/extending.md +182 -0
data/doc/other_commands.md +49 -0
data/doc/settings.md +6 -2
data/lib/traject.rb +1 -0
data/lib/traject/command_line.rb +296 -0
data/lib/traject/debug_writer.rb +28 -0
data/lib/traject/indexer.rb +84 -20
data/lib/traject/indexer/settings.rb +9 -1
data/lib/traject/json_writer.rb +15 -38
data/lib/traject/line_writer.rb +59 -0
data/lib/traject/macros/marc21.rb +10 -5
data/lib/traject/macros/marc21_semantics.rb +57 -25
data/lib/traject/marc4j_reader.rb +9 -26
data/lib/traject/marc_extractor.rb +121 -48
data/lib/traject/mock_reader.rb +87 -0
data/lib/traject/mock_writer.rb +34 -0
data/lib/traject/solrj_writer.rb +1 -22
data/lib/traject/util.rb +107 -1
data/lib/traject/version.rb +1 -1
data/lib/traject/yaml_writer.rb +9 -0
data/test/debug_writer_test.rb +38 -0
data/test/indexer/each_record_test.rb +27 -2
data/test/indexer/macros_marc21_semantics_test.rb +12 -1
data/test/indexer/settings_test.rb +9 -2
data/test/indexer/to_field_test.rb +35 -5
data/test/marc4j_reader_test.rb +3 -0
data/test/marc_extractor_test.rb +94 -20
data/test/test_support/demo_config.rb +6 -3
data/traject.gemspec +1 -2
metadata +17 -20

data/.travis.yml ADDED

@@ -0,0 +1,7 @@
+language: ruby
+rvm:
+  - jruby-19mode
+jdk:
+  - openjdk7
+  - openjdk6
+bundler_args: --without debug

data/Gemfile CHANGED

@@ -5,4 +5,8 @@ gemspec
 group :development do
   gem "nokogiri" # used only for rake tasks load_maps:
-end
+end
+group :debug do
+  gem "pry"
+end

data/README.md CHANGED

@@ -7,9 +7,13 @@ them somewhere.
 **Currently under development, not production ready**
+[![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
+[![Build Status](https://travis-ci.org/jrochkind/traject.png)](https://travis-ci.org/jrochkind/traject)
 ## Background/Goals
-Existing tools for indexing Marc to Solr exist, and have served many of us for many years. But I was having more and more difficulty working with the existing tools, and difficulty providing the custom logic I needed in a maintainable way. I realized that for me, to create a tool with the flexibility, maintainability, and performance I wanted, I would need to do it in jruby (ruby on the JVM).
+Existing tools for indexing Marc to Solr exist, and have served us well for many years, and have many useful things about them -- which I've tried to preserve in traject.  But I was having more and more difficulty working with the existing tools, including difficulty providing the custom logic I needed in a maintainable way. I realized that for me, to create a tool with the flexibility, maintainability, and performance I wanted, I would need to do it in jruby (ruby on the JVM).
 Some goals:
@@ -19,11 +23,13 @@ Some goals:
 * Built of modular and composable elements: If you want to change part of what traject does, you should be able to do so without having to reimplement other things you don't want to change.
 * A maintainable internal architecture, well-factored with seperated concerns and DRY logic. Aim to be comprehensible to newcomer developers, and well-covered by tests.
 * High performance, using multi-threaded concurrency where appropriate to maximize throughput. Actual throughput can depend on complexity of your mapping rules and capacity of your server(s), but I am getting throughput 2-5x greater than previous solutions.
+* Cooperate well in unix batch/pipeline, with control over output/logging of errors, proper exit codes, use of stdin/stdout, etc.
 ## Installation
-Traject runs under jruby (ruby on the JVM). I recommend [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme) for installing and managing ruby installations.
+Traject runs under jruby (ruby on the JVM). I recommend [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme) for installing and managing ruby installations. (traject is tested
+and supported for ruby 1.9 -- recent versions of jruby should run under 1.9 mode by default).
 Then just `gem install traject`.
@@ -151,6 +157,11 @@ Other examples of the specification string, which can include multiple tag menti
   # "*" is a wildcard in indicator spec.  So
   # 856 with first indicator '0', subfield u.
   to_field "email_addresses", extract_marc("856|0*|u")
+  # Instead of joining subfields from the same field
+  # into one string, joined by spaces, leave them
+  # each in seperate strings:
+  to_field "isbn", extract_marc("020az", :seperator => nil)
 ~~~
 The `extract_marc` function *by default* includes any linked
@@ -214,9 +225,14 @@ end
 # marc_extract does, you may want to use the Traject::MarcExtractor
 # class
 to_field "weirdo" do |record, accumulator, context|
-   list = MarcExtractor.extract_by_spec(record, "700a")
+   # use MarcExtractor.cached for performance, globally
+   # caching the MarcExtractor we create. See docs
+   # at MarcExtractor.
+   list = MarcExtractor.cached("700a").extract(record)
    # combine all the 700a's in ONE string, cause we're weird
    list = list.join(" ")
    accumulator << list
 end
 ~~~
@@ -264,6 +280,10 @@ in order.
   to_field("foo") {...}  # and will be called after each of the preceding for each record
 ~~~
+#### Sample config
+A fairly complex sample config file can be found at [./test/test_support/demo_config.rb](./test/test_support/demo_config.rb)
 #### Built-in MARC21 Semantics
 There is another package of 'macros' that comes with Traject for extracting semantics
@@ -292,7 +312,7 @@ The simplest invocation is:
     traject -c conf_file.rb marc_file.mrc
 Traject assumes marc files are in ISO 2709 binary format; it is not
-currently able to buess marc format type. If you are reading
+currently able to guess marc format type from filenames. If you are reading
 marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
     traject -c conf.rb -t xml marc_file.xml
@@ -323,21 +343,45 @@ Use `-u` as a shortcut for `s solr.url=X`
     traject -c conf_file.rb -u http://example.com/solr marc_file.mrc
-Also see `-I load_path` and `-g Gemfile` options under Extending Logic
+Also see `-I load_path` and `-g Gemfile` options under Extending With Your Own Code.
-## Extending Logic
+See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
-TODO fill out nicer.
+## Extending With Your Own Code
-Basically:
+Traject config files are full live ruby files, where you can do anything,
+including declaring new classes, etc.
-command line `-I` can be used to append to the ruby $LOAD_PATH, and then you can simply `require` your local files, and then use them for
-whatever. Macros, utility functions, translation maps, whatever.
+However, beyond limited trivial logic, you'll want to organize your
+code reasonably into seperate files, not jam everything into config
+files.
-If you want to use logic from other gems in your configuration mapping, you can do that too. This works for traject-specific
-functionality like translation maps and macros, or for anything else.
-To use gems, you can _either_ use straight rubygems, simply by
-installing gems in your system and using `require` or `gem` commands... **or** you can use Bundler for dependency locking and other dependency management. To have traject use Bundler, create a `Gemfile` and then call traject command line with the `-g` option. With the `-g` option alone, Bundler will look in the CWD and parents for the first `Gemfile` it finds. Or supply `-g ./somewhere/MyGemfile` to anywhere.
+Traject wants to make sure it makes it convenient for you to do so,
+whether project-specific logic in files local to the traject project,
+or in ruby gems that can be shared between projects.
+There are standard ruby mechanisms you can use to do this, and
+traject provides a couple features to make sure this remains
+convenient with the traject command line.
+For more information, see documentation page on [Extending With Your
+Own Code](./doc/extending.md)
+**Expert summary** :
+* Traject `-I` argument command line can be used to list directories to
+  add to the load path, similar to the `ruby -I` argument. You
+  can then 'require' local project files from the load path.
+  * translation map files found on the load path or in a
+    "./translation_maps" subdir on the load path will be found
+    for Traject translation maps.
+* Traject `-g` command line can be used to tell traject to use
+  bundler with a `Gemfile` located at current working dirctory
+  (or give an argument to `-g ./some/myGemfile`)
+## More
+* [Other traject commands](./doc/other_commands.md) including `marcout`, and `commit`
+* [Hints for batch and cronjob use](./doc/batch_execution.md) of  traject.
 # Development
@@ -351,6 +395,9 @@ instance is baked in.  You can provide your own solr instance to test against an
 "solr_url", and the tests will use it. Otherwise, tests will
 use a mocked up Solr instance.
+To make a pull request, please make a feature branch *created from the master branch*, not from an existing feature branch. (If you need to do a feature branch dependent on an existing not-yet merged feature branch... discuss
+this with other developers first!)
 Pull requests should come with tests, as well as docs where applicable. Docs can be inline rdoc-style, edits to this README,
 and/or extra files in ./docs -- as appropriate for what needs to be docs.
@@ -364,8 +411,9 @@ and/or extra files in ./docs -- as appropriate for what needs to be docs.
   * Either way, all optional/configurable of course. based
     on Settings.
-* Command line code. It's only 150 lines, but it's kind of messy
-jammed into one file *and lacks tests*. I couldn't figure out
-what to do with it or how to test it. Needs a bit of love.
+* CommandLine class isn't covered by tests -- it's written using functionality
+from Indexer and other classes taht are well-covered, but the CommandLine itself
+probably needs some tests -- especially covering error handling, which probably
+needs a bit more attention and using exceptions instead of exits, etc.
 * Optional built-in jetty stop/start to allow indexing to Solr that wasn't running before. maybe https://github.com/projecthydra/jettywrapper ?

data/bench/bench.rb ADDED

@@ -0,0 +1,30 @@
+#!/usr/bin/env jruby
+$:.unshift File.expand_path('../../lib', __FILE__)
+require 'traject/command_line'
+require 'benchmark'
+unless ARGV.size >= 2
+  STDERR.puts "\n     Benchmark two (or more) different config files with both 0 and 3 threads against the given marc file\n"
+  STDERR.puts "\n     Usage:"
+  STDERR.puts "         jruby --server bench.rb config1.rb config2.rb [...configN.rb] filename.mrc\n\n"
+  exit
+end
+filename = ARGV.pop
+config_files = ARGV
+puts RUBY_DESCRIPTION
+Benchmark.bmbm do |x|
+  [0, 3].each do |threads|
+    config_files.each do |cf|
+      x.report("#{cf} (#{threads})") do
+        cmdline = Traject::CommandLine.new(["-c", cf, '-s', 'log.file=bench.log', '-s', "processing_thread_pool=#{threads}", filename])
+        cmdline.execute
+      end
+    end
+  end
+end

data/bin/traject CHANGED

@@ -1,7 +1,5 @@
 #!/usr/bin/env ruby
-require 'slop'
 # If we're loading from source instead of a gem, rubygems
 # isn't setting load paths for us, so we need to set it ourselves
@@ -10,172 +8,9 @@ unless $LOAD_PATH.include? self_load_path
   $LOAD_PATH << self_load_path
 end
-require 'traject'
-require 'traject/indexer'
-orig_argv = ARGV.dup
-opts = Slop.new(:strict => true) do
-  banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
-  on 'v', 'version', "print version information to stderr"
-  on 'd', 'debug', "Include debug log, -s log.level=debug"
-  on 'h', 'help', "print usage information to stderr"
-  on 'c', 'conf', 'configuration file path (repeatable)', :argument => true, :as => Array
-  on :s, :setting, "settings: `-s key=value` (repeatable)", :argument => true, :as => Array
-  on :r, :reader, "Set reader class, shortcut for `-s reader_class_name=*`", :argument => true
-  on :w, :writer, "Set writer class, shortcut for `-s writer_class_name=*`", :argument => true
-  on :u, :solr, "Set solr url, shortcut for `-s solr.url=*`", :argument => true
-  on :j, "output as pretty printed json, shortcut for `-s writer_class_name=JsonWriter -s json_writer.pretty_print=true`"
-  on :t, :marc_type, "xml, json or binary. shortcut for `-s marc_source.type=*`", :argument => true
-  on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
-  on :g, "gemfile", "run with bundler and optionally specified Gemfile", :argument => :optional, :default => ""
-end
-begin
-  opts.parse!
-rescue Slop::Error => e
-  $stderr.puts "Error: #{e.message}"
-  $stderr.puts "Exiting..."
-  $stderr.puts
-  $stderr.puts opts.help
-  exit 1
-end
-options = opts.to_hash
-if options[:version]
-  $stderr.puts "traject version #{Traject::VERSION}"
-  exit 1
-end
-if options[:help]
-  $stderr.puts opts.help
-  exit 1
-end
-# have to use Slop object to tell diff between
-# no arg supplied and no option -g given at all
-if opts.present? :gemfile
-  if options[:gemfile]
-    # tell bundler what gemfile to use
-    gem_path = File.expand_path( options[:gemfile] )
-    # bundler not good at error reporting, we check ourselves
-    unless File.exists? gem_path
-      $stderr.puts "Gemfile `#{options[:gemfile]}` does not exist, exiting..."
-      $stderr.puts
-      $stderr.puts opts.help
-      exit 2
-    end
-    ENV["BUNDLE_GEMFILE"] = gem_path
-  end
-  require 'bundler/setup'
-end
-settings = {}
-(options[:setting] || []).each do |setting_pair|
-  if setting_pair =~ /\A([^=]+)\=([^=]*)\Z/
-    key, value = $1, $2
-    settings[key] = value
-  else
-    $stderr.puts "Unrecognized setting argument '#{setting_pair}':"
-    $stderr.puts "Should be of format -s key=value"
-    exit 3
-  end
-end
-if options[:debug]
-  settings["log.level"] = "debug"
-end
-if options[:writer]
-  settings["writer_class_name"] = options[:writer]
-end
-if options[:reader]
-  settings["reader_class_name"] = options[:reader]
-end
-if options[:solr]
-  settings["solr.url"] = options[:solr]
-end
-if options[:j]
-  settings["writer_class_name"] = "JsonWriter"
-  settings["json_writer.pretty_print"] = "true"
-end
-if options[:marc_type]
-  settings["marc_source.type"] = options[:marc_type]
-end
-(options[:load_path] || []).each do |path|
-  $LOAD_PATH << path unless $LOAD_PATH.include? path
-end
-indexer = Traject::Indexer.new
-indexer.settings( settings )
-unless options[:conf] && options[:conf].length > 0
-  $stderr.puts "Error: Missing required configuration file"
-  $stderr.puts "Exiting..."
-  $stderr.puts
-  $stderr.puts opts.help
-  exit 2
-end
-options[:conf].each do |conf_path|
-  begin
-    indexer.instance_eval(File.open(conf_path).read, conf_path)
-  rescue Errno::ENOENT => e
-    $stderr.puts "Could not find configuration file '#{conf_path}', exiting..."
-    exit 2
-  rescue Exception => e
-    $stderr.puts "Could not parse configuration file '#{conf_path}'"
-    $stderr.puts "  #{e.message}"
-    if e.backtrace.first =~ /\A(.*)\:in/
-      $stderr.puts "  #{$1}"
-    end
-    exit 3
-  end
-end
-## SAFE TO LOG STARTING HERE.
-#
-#  Shoudln't log before config files are read above, because
-#  config files set up logger
-##############
-indexer.logger.info("executing with arguments: `#{orig_argv.join(' ')}`")
-# ARGF might be perfect for this, but problems with it include:
-# * jruby is broken, no way to set it's encoding, leads to encoding errors reading non-ascii
-#   https://github.com/jruby/jruby/issues/891
-# * It's apparently not enough like an IO object for at least one of the ruby-marc XML
-#   readers:
-#   NoMethodError: undefined method `to_inputstream' for ARGF:Object
-#      init at /Users/jrochkind/.gem/jruby/1.9.3/gems/marc-0.5.1/lib/marc/xml_parsers.rb:369
-#
-# * It INSISTS on reading from ARGFV, making it hard to test, or use when you want to give
-#   it a list of files on something other than ARGV.
-#
-# So for now we do just one file, or stdin if none given. Sorry!
-if ARGV.length > 1
-  $stderr.puts "Sorry, traject can only handle one input file at a time right now. `#{ARGV}` Exiting..."
-  exit 1
-end
-if ARGV.length == 0
-  indexer.logger.info "Reading from STDIN..."
-  io = $stdin
-else
-  indexer.logger.info "Reading from #{ARGV.first}"
-  io = File.open(ARGV.first, 'r')
-end
+require 'traject/command_line'
-result = indexer.process(io)
+cmdline = Traject::CommandLine.new(ARGV)
+result = cmdline.execute
-exit 1 unless result # non-zero exit status on process telling us there's problems.
+exit 1 unless result # non-zero exit status on process telling us there's problems.

data/doc/batch_execution.md ADDED

@@ -0,0 +1,177 @@
+# Hints for running traject as a batch job
+Maybe as a cronjob. Maybe via a batch shell script that executes
+traject, and maybe even pipelines it together with other commands.
+These are things you might want to do with traject. Some potential problem points
+with suggested solutions, and additional hints.
+## Ruby version setting
+traject ordinarily needs to run under jruby. You will
+ordinarily have jruby installed under a ruby version switcher -- we
+highly recommend [chruby](https://github.com/postmodern/chruby) over other choices,
+but other popular choices include rvm and rbenv.
+Remember that traject needs to run in 1.9.x mode in jruby--
+with jruby 1.7.x or later, this should be default, recommend
+you use jruby 1.7.x.
+Especially when running under a cron job, it can be difficult to
+set things up so traject runs under jruby.
+It can sometimes be useful to create a wrapper script for traject
+that takes care of making sure it's running under the right ruby
+version.
+### for chruby
+Simply run with:
+    chruby-exec jruby -- traject {other arguments}
+Whether specifying that directly in a crontab, or in a shell script
+that needs to call traject, etc. So simple you might not need
+a wrapper script, but it might still be convenient to create one. Say
+you put a `jruby-traject` at `/usr/local/bin/jruby-traject`, that
+looks like this:
+    #!/usr/bin/env bash
+    chruby-exec jruby -- traject "$@"
+Now any account, in a crontab, in an interactive shell, wherever,
+can just execute `jruby-traject {arguments}`, and execute traject
+in a jruby environment.
+### for rbenv
+If running in an interactive shell that has had rbenv set up for
+it, you can use rbenv's standard mechanism to say to execute
+something in jruby:
+    RBENV_VERSION=jruby-1.7.2 traject {args}
+You do need to specify the exact version of jruby, I don't think
+there's any way to say 'latest install jruby'. You could do the
+same thing for any batch scripts you're writing -- just have
+them set that `RBENV_VERSION` environment variable before
+executing traject.
+If you're running inside a cronjob, things get a bit trickier,
+because rbenv isn't normally set up in the limited environment
+of cron tasks. One way to deal with this is to have your
+cronjob explicitly execute in a bash login shell, that
+will then have rbenv set up so long as it's running
+under an account with rbenv set up properly!
+    # in a cronfile
+    # 10 * * * * /bin/bash -l -c 'RBENV_VERSION=jruby-1.7.2 traject {args}'
+(Better way? Doc pull requests welcome.)
+### for rvm
+See rvm's [own docs on use with cron](http://rvm.io/integration/cron), it gets a bit confusing.
+But here's one way, using a wrapper script. It does require you to
+identify and hard-code in where your rvm is installed, and exactly which
+version of jruby you want to execute with (will have to be updated if you upgrade
+jruby). (Is there a better way? Doc pull requests welcome! rvm confuses me!)
+Make a file at `/usr/local/bin/jruby-traject` that looks like this:
+~~~bash
+#!/usr/bin/env bash
+# load rvm ruby
+source /home/MY_ACCT/.rvm/environments/jruby-1.7.3
+traject "$@"
+~~~
+You have to use your actual account rvm is installed in for MY_ACCT.
+Or, if you have a global install of rvm instead of a user-account one,
+it might be at `/usr/local/rvm/environments`... instead.
+Now any account, in a crontab, in an interactive shell, wherever,
+can just execute `jruby-traject {arguments}`, and execute traject
+in a jruby environment.
+## Exit codes
+Traject tries to always return a well-behaved unix exit code -- 0 for success,
+non-0 for error.
+You should be able to rely on this in your batch bash scripts, if you want to abort
+further processing if traject failed for some reason, you can check traject's
+exit code.
+If an uncaught exception happens, traject will return non-0.
+There are some kinds of errors which prevent traject from indexing
+one or more records, but traject may still continue processing
+the other records. If any records have been skipped in this way,
+traject will _also_ return a non-0 failure exit code. (Is this good?
+Does it need to be configurable?)
+In these cases, information about errors that led to skipped records should
+be output as ERROR level in the logs.
+## Logs and Error Reporting
+By default, traject outputs all logging to stderr.  This is often just what
+you want for a batch or automated process, where there might be some wrapper
+script which captures stderr and puts it where you want it.
+However, it's easy enough to tell traject to log somewhere else. Either on
+the command-line:
+    traject -s log.file=/some/other/file/log {other args}
+Or in a traject configuration file, setting the `log.file` configuration setting.
+### Seperate error log
+You can also seperately have a duplicate log file created with ONLY log messages of
+level ERROR and higher (meaning ERROR and FATAL), with the `log.error_file` setting.
+Then, if there's any lines in this error log file at all, you know something bad
+happened, maybe your batch process needs to notify someone, or abort further
+steps in the batch process.
+    traject -s log.file=/var/log/traject.log -s log.error_file=/var/log/traject_error.log {more args}
+The error lines will be in the main log file, and also duplicated in the error
+log file.
+### Completely customizable logging with yell
+Traject uses the [yell](https://github.com/rudionrails/yell) gem for logging.
+You can configure the logger directly to implement whatever crazy logging rules you might
+want, so long as yell supports them. But yell is pretty flexible.
+Recall that traject config files are just ruby, executed in the context
+of a Traject::Indexer. You can set the Indexer's `logger` to a yell logger
+object you configure yourself however you like:
+~~~ruby
+  # inside a traject configuration file
+  logger = Yell.new do |l|
+    l.level = 'gte.info' # will only pass :info and above to the adapters
+    l.adapter :datefile, 'production.log', level: 'lte.warn' # anything lower or equal to :warn
+    l.adapter :datefile, 'error.log', level: 'gte.error' # anything greater or equal to :error
+  end
+~~~
+See [yell](https://github.com/rudionrails/yell)  docs for more, you can
+do whatever you can make yell, just write ruby.
+### Bundler
+For automated batch execution, we recommend you consider using
+bundler to manage any gem dependencies. See the [Extending
+With Your Own Code](./extending.md) traject docs for
+information on how traject integrates with bundler.