RubyGems - traject - Versions diffs - 0.9.1 → 0.10.0 - Mend

traject 0.9.1 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

data/.travis.yml +7 -0
data/Gemfile +5 -1
data/README.md +65 -17
data/bench/bench.rb +30 -0
data/bin/traject +4 -169
data/doc/batch_execution.md +177 -0
data/doc/extending.md +182 -0
data/doc/other_commands.md +49 -0
data/doc/settings.md +6 -2
data/lib/traject.rb +1 -0
data/lib/traject/command_line.rb +296 -0
data/lib/traject/debug_writer.rb +28 -0
data/lib/traject/indexer.rb +84 -20
data/lib/traject/indexer/settings.rb +9 -1
data/lib/traject/json_writer.rb +15 -38
data/lib/traject/line_writer.rb +59 -0
data/lib/traject/macros/marc21.rb +10 -5
data/lib/traject/macros/marc21_semantics.rb +57 -25
data/lib/traject/marc4j_reader.rb +9 -26
data/lib/traject/marc_extractor.rb +121 -48
data/lib/traject/mock_reader.rb +87 -0
data/lib/traject/mock_writer.rb +34 -0
data/lib/traject/solrj_writer.rb +1 -22
data/lib/traject/util.rb +107 -1
data/lib/traject/version.rb +1 -1
data/lib/traject/yaml_writer.rb +9 -0
data/test/debug_writer_test.rb +38 -0
data/test/indexer/each_record_test.rb +27 -2
data/test/indexer/macros_marc21_semantics_test.rb +12 -1
data/test/indexer/settings_test.rb +9 -2
data/test/indexer/to_field_test.rb +35 -5
data/test/marc4j_reader_test.rb +3 -0
data/test/marc_extractor_test.rb +94 -20
data/test/test_support/demo_config.rb +6 -3
data/traject.gemspec +1 -2
metadata +17 -20

data/doc/extending.md ADDED

@@ -0,0 +1,182 @@
+# Extending With Your Own Code
+Beyond very simple logic, you'll want to write your own ruby code,
+organize it in files other than traject config files, but then
+use it in traject config files.
+You might want to have code local to your traject project; or you
+might want to use ruby gems with shared code in your traject project.
+A given project may use both of these techniques.
+Here are some suggestions for how to do this, along with mention
+of a couple traject features meant to make it easier.
+## Expert Summary
+* Traject `-I` argument command line can be used to list directories to
+  add to the load path, similar to the `ruby -I` argument. You
+  can then 'require' local project files from the load path.
+  * translation map files found on the load path or in a
+    "./translation_maps" subdir on the load path will be found
+    for Traject translation maps.
+* Traject `-g` command line can be used to tell traject to use
+  bundler with a `Gemfile` located at current working dirctory
+  (or give an argument to `-g ./some/myGemfile`)
+## Custom code local to your project
+You might want local translation maps, or local ruby
+code. Here's a standard way you might lay out
+this extra code in the file system, using a 'lib'
+directory kept next to your traject config files:
+~~~
+- my_traject/
+  * config_file.rb
+  - lib/
+    * my_macros.rb
+    * my_utility.rb
+    - translation_maps/
+      * my_map.yaml
+~~~
+The `my_macros.rb` file might contain a simple [macro](./macros.md)
+in a module called `MyMacros`.
+The `my_utility.rb` file might contain, say, a module of utility
+methods, `MyUtility.some_utility`, etc.
+To refer to ruby code from another file, we use the standard
+ruby `require` statement to bring in the files:
+~~~ruby
+# config_file.rb
+require 'my_macros'
+require 'my_utility'
+# Now that MyMacros is available, extend it into the indexer,
+# and use it:
+extend MyMacros
+to_field "title", my_some_macro
+# And likewise, we can use our utility methods:
+to_field "title" do |record, accumulator, context|
+  accumulator << MyUtility.some_utility(record)
+end
+~~~
+**But wait!** This won't work yet. Becuase ruby won't be
+able to find the file in `requires 'my_macros'`. To fix
+that, we want to add our local `lib` directory to the
+ruby `$LOAD_PATH`, a standard ruby feature.
+Traject provides a way for you to add to the load path
+from the traject command line, the `-I` flag:
+    traject -I ./lib -c ./config_file.rb ...
+Or, you can hard-code a `$LOAD_PATH` change directly in your
+config file. You'll have to use some weird looking
+ruby code to create a file path relative to the current
+file (the config_file.rb), and then make sure it's
+an absolute path. (Should we add a traject utility
+method for this?)
+~~~ruby
+# at top of config_file.rb...
+$LOAD_PATH.unshift File.expand_path(File.join(File.dirname(__FILE__), './lib'))
+~~~
+That's pretty much it!
+What about that translation map? The `$LOAD_PATH` modification
+took care of that too, the Traject::TranslationMap will look
+up translation map definition files on the load path, or
+in a `./translation_maps` subdir on the load path.
+## Using gems in your traject project
+If there is certain logic that is common between (traject or other)
+projects, it makes sense to put it in a ruby gem.
+We won't go into detail about creating ruby gems, but we
+do recomend you use the `bundle gem my_gem_name` command to create
+a skeleton of your gem
+([one tutorial here](http://railscasts.com/episodes/245-new-gem-with-bundler?view=asciicast)).
+This will also make available rake commands to install your gem locally
+(`rake install`), or release it to the rubygems server (`rake release`).
+There are two main methods to use a gem in your traject project,
+with straight rubygems, or with bundler.
+Without bundler is simpler. Simply `gem install some_gem` from the
+command line, and now you can `require` that gem in your traject
+config file, and use what it provides:
+~~~ruby
+#some_traject_config.rb
+require 'some_gem'
+SomeGem.whatever!
+~~~
+Any gem can provide traject translation map definitions
+in it's `lib` directory, or in a `lib/translation_maps`
+sub-directory, and traject will be able to find those
+translation maps when the gem is loaded. (Because gems'
+`./lib` directories are added to the ruby load path.)
+### Or, with bundler:
+However, if you then move your traject project to another system,
+where you haven't yet installed the `some_gem`, then running
+traject with this config file will, of course, fail. Or if you
+move your traject project to another system with a slightly
+different version of `some_gem`, your traject indexing could
+behave differently in confusing ways. As the number of gems
+you are using increases, managing this gets increasingly
+confusing.
+[bundler](http://bundler.io/) was invented to make this kind of dependency management
+more straightforward and reliable. We recommend you consider using
+bundler, especially for traject installations where traject will
+be run via automated batch jobs on production servers.
+Bundler's behavior is based on a `Gemfile` that lists your
+project dependencies. You can create a starter skeleton
+by running `bundler init`, probably in the directory
+right next to your traject config files.
+Then specify what gems your traject project will use,
+possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemfile.html).
+Run `bundle install` from the directory with the Gemfile, on any system
+at any time, to make sure specified gems are installed.
+**Run traject** with the `-g` flag to tell it to use the Gemfile:
+   traject -g -c some_traject_config.rb ...
+Traject will use bundler to setup with the Gemfile, making sure
+the specified versions of all gems are used (and also making sure
+no gems except those specified in the gemfile are available to
+the program).
+You should still `require` the gem in your traject config file,
+then just refer to what it provides in your config code as usual.
+You should check both the `Gemfile` and the `Gemfile.lock`
+that bundler creates into your source control repo. The
+`Gemfile.lock` specifies _exactly_ what versions of
+gem dependencies are currently being used, so you can get the exact
+same dependency environment on different servers.
+See the [bundler documentation](http://bundler.io/#getting-started), or google, for more information.

data/doc/other_commands.md ADDED

@@ -0,0 +1,49 @@
+# Other traject command-line commands
+The traject command line supporst a few other miscellaneous commands with
+the "-x command" switch. The usual traject command line is actually
+the `process` command, `traject -x process ...` is the same as leaving out
+the `-x process`.
+## Commit
+`traject -x commit` will send a 'commit' message to the Solr server
+specified in setting `solr.url`.  Other parts of configuration will
+be ignored, but don't hurt.
+    traject -x commit -s solr.url=http://some.com/solr
+Or with a config file that includes a solr.url setting:
+    traject -x commit -c config_file.rb
+## marcout
+The `marcout` command will skip all processing/mapping, and simply
+serialize marc out to a file stream.
+This is mainly useful when you're using a custom reader to read
+marc from a database or something, but could also be used to
+convert marc from one format to another or something.
+Will write to stdout, or set the `output_file` setting (`-o` shortcut).
+Set the `marcout.type` setting to 'xml' or 'binary' for type of output.
+Or to `human` for human readable display of marc (that is not meant for
+machine readability, but can be good for manual diagnostics.)
+If outputing type binary, setting `marcout.allow_oversized` to
+true or false (boolean or string), to pass that to the MARC::Writer.
+If set to true, then oversized MARC records can still be serialized,
+with length bytes zero'd out -- technically illegal, but can
+be read by MARC::Reader in permissive mode.
+As the standard Marc4JReader always convert to UTF8,
+output will always be in UTF8. For standard readeres, you
+do need to set the `marc_source.type` setting to XML for xml input
+using the standard MARC readers.
+~~~bash
+traject -x marcout somefile.marc -o output.xml -s marcout.type=xml
+traject -x marcout -s marc_source.type=xml somefile.xml -c configuration.rb
+~~~

data/doc/settings.md CHANGED

@@ -46,14 +46,18 @@ for commonly used settings, see `traject -h`.
 * `marc_source.type`: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`
-* `marc4j_reader.jar_dir`:   Path to a directory containing Marc4J jar file to use. All .jar's in dir will
-                           be loaded. If unset, uses marc4j.jar bundled with traject.
+* `marc4j.jar_dir`:   Path to a directory containing Marc4J jar file to use. All .jar's in dir will
+                      be loaded. If unset, uses marc4j.jar bundled with traject.
 * `marc4j_reader.permissive`: Used by Marc4JReader only when marc.source_type is 'binary', boolean, argument to the underlying MarcPermissiveStreamReader. Default true.
 * `marc4j_reader.source_encoding`: Used by Marc4JReader only when marc.source_type is 'binary', encoding strings accepted
   by marc4j MarcPermissiveStreamReader. Default "BESTGUESS", also "UTF-8", "MARC"
+* `output_file`: Output file to write to for operations that write to files: For instance the `marcout` command,
+                 or Writer classes that write to files, like Traject::JsonWriter. Has an shortcut
+                 `-o` on command line.
 * `processing_thread_pool` Default 3. Main thread pool used for processing records with input rules. Choose a
    pool size based on size of your machine, and complexity of your indexing rules.
    Probably no reason for it ever to be more than number of cores on indexing machine.

data/lib/traject.rb CHANGED

@@ -1,6 +1,7 @@
 require "traject/version"
 require 'traject/indexer'
+require 'traject/util'
 require 'traject/macros/basic'
 require 'traject/macros/marc21'

data/lib/traject/command_line.rb ADDED

@@ -0,0 +1,296 @@
+require 'slop'
+require 'traject'
+require 'traject/indexer'
+module Traject
+  # The class that executes for the Traject command line utility.
+  #
+  # Warning, does do things like exit entire program on error at present.
+  # You probably don't want to use this class for anything but an actual
+  # shell command line, if you want to execute indexing directly, just
+  # use the Traject::Indexer directly.
+  #
+  # A CommandLine object has a single persistent Indexer object it uses
+  class CommandLine
+    # orig_argv is origina one passed in, remaining_argv is after destructive
+    # processing by slop, still has file args in it etc.
+    attr_accessor :orig_argv, :remaining_argv
+    attr_accessor :slop, :options
+    attr_accessor :indexer
+    attr_accessor :console
+    def initialize(argv=ARGV)
+      self.console = $stderr
+      self.orig_argv      = argv.dup
+      self.remaining_argv = argv
+      self.slop    = create_slop!
+      self.options = parse_options(self.remaining_argv)
+    end
+    # Returns true on success or false on failure; may also raise exceptions;
+    # may also exit program directly itself (yeah, could use some normalization)
+    def execute
+      if options[:version]
+        self.console.puts "traject version #{Traject::VERSION}"
+        return
+      end
+      if options[:help]
+        self.console.puts slop.help
+        return
+      end
+      # have to use Slop object to tell diff between
+      # no arg supplied and no option -g given at all
+      if slop.present? :gemfile
+        require_bundler_setup(options[:gemfile])
+      end
+      (options[:load_path] || []).each do |path|
+        $LOAD_PATH << path unless $LOAD_PATH.include? path
+      end
+      arg_check!
+      self.indexer = initialize_indexer!
+      ######
+      # SAFE TO LOG to indexer.logger starting here, after indexer is set up from conf files
+      # with logging config.
+      #####
+      indexer.logger.info("traject executing with: `#{orig_argv.join(' ')}`")
+      # Okay, actual command process! All command_ methods should return true
+      # on success, or false on failure.
+      result =
+        case options[:command]
+        when "process"
+          indexer.process get_input_io(self.remaining_argv)
+        when "marcout"
+          command_marcout! get_input_io(self.remaining_argv)
+        when "commit"
+          command_commit!
+        else
+          raise ArgumentError.new("Unrecognized traject command: #{options[:command]}")
+        end
+      return result
+    end
+    def command_commit!
+      require 'open-uri'
+      raise ArgumentError.new("No solr.url setting provided") if indexer.settings['solr.url'].to_s.empty?
+      url = "#{indexer.settings['solr.url']}/update?commit=true"
+      indexer.logger.info("Sending commit to: #{url}")
+      indexer.logger.info(  open(url).read )
+      return true
+    end
+    def command_marcout!(io)
+      require 'marc'
+      output_type = indexer.settings["marcout.type"].to_s
+      output_type = "binary" if output_type.empty?
+      output_arg      = unless indexer.settings["output_file"].to_s.empty?
+        indexer.settings["output_file"]
+      else
+        $stdout
+      end
+      case output_type
+      when "binary"
+        writer = MARC::Writer.new(output_arg)
+        allow_oversized = indexer.settings["marcout.allow_oversized"]
+        if allow_oversized
+          allow_oversized = (allow_oversized.to_s == "true")
+          writer.allow_oversized = allow_oversized
+        end
+      when "xml"
+        writer = MARC::XMLWriter.new(output_arg)
+      when "human"
+        writer = output_arg.kind_of?(String) ? File.open(output_arg, "w:binary") : output_arg
+      else
+        raise ArgumentError.new("traject marcout unrecognized marcout.type: #{output_type}")
+      end
+      reader      = indexer.reader!(io)
+      reader.each do |record|
+        writer.write record
+      end
+      writer.close
+      return true
+    end
+    def get_input_io(argv)
+      # ARGF might be perfect for this, but problems with it include:
+      # * jruby is broken, no way to set it's encoding, leads to encoding errors reading non-ascii
+      #   https://github.com/jruby/jruby/issues/891
+      # * It's apparently not enough like an IO object for at least one of the ruby-marc XML
+      #   readers:
+      #   NoMethodError: undefined method `to_inputstream' for ARGF:Object
+      #      init at /Users/jrochkind/.gem/jruby/1.9.3/gems/marc-0.5.1/lib/marc/xml_parsers.rb:369
+      #
+      # * It INSISTS on reading from ARGFV, making it hard to test, or use when you want to give
+      #   it a list of files on something other than ARGV.
+      #
+      # So for now we do just one file, or stdin if none given. Sorry!
+      if argv.length > 1
+        self.console.puts "Sorry, traject can only handle one input file at a time right now. `#{argv}` Exiting..."
+        exit 1
+      end
+      if argv.length == 0
+        indexer.logger.info "Reading from STDIN..."
+        io = $stdin
+      else
+        indexer.logger.info "Reading from #{argv.first}"
+        io = File.open(argv.first, 'r')
+      end
+      return io
+    end
+    def load_configuration_files!(my_indexer, conf_files)
+      conf_files.each do |conf_path|
+        begin
+          my_indexer.instance_eval(File.open(conf_path).read, conf_path)
+        rescue Errno::ENOENT => e
+          self.console.puts "Could not find configuration file '#{conf_path}', exiting..."
+          exit 2
+        rescue Exception => e
+          self.console.puts "Could not parse configuration file '#{conf_path}'"
+          self.console.puts "  #{e.message}"
+          if e.backtrace.first =~ /\A(.*)\:in/
+            self.console.puts "  #{$1}"
+          end
+          exit 3
+        end
+      end
+    end
+    def arg_check!
+      if options[:command] == "process" && (options[:conf].nil? || options[:conf].length == 0)
+        self.console.puts "Error: Missing required configuration file"
+        self.console.puts "Exiting..."
+        self.console.puts
+        self.console.puts self.slop.help
+        exit 2
+      end
+    end
+    # requires bundler/setup, optionally first setting ENV["BUNDLE_GEMFILE"]
+    # to tell bundler to use a specific gemfile. Gemfile arg can be relative
+    # to current working directory.
+    def require_bundler_setup(gemfile=nil)
+      if gemfile
+        # tell bundler what gemfile to use
+        gem_path = File.expand_path( options[:gemfile] )
+        # bundler not good at error reporting, we check ourselves
+        unless File.exists? gem_path
+          self.console.puts "Gemfile `#{options[:gemfile]}` does not exist, exiting..."
+          self.console.puts
+          self.console.puts slop.help
+          exit 2
+        end
+        ENV["BUNDLE_GEMFILE"] = gem_path
+      end
+      require 'bundler/setup'
+    end
+    def assemble_settings_hash(options)
+      settings = {}
+      # `-s key=value` command line
+      (options[:setting] || []).each do |setting_pair|
+        if setting_pair =~ /\A([^=]+)\=(.*)\Z/
+          key, value = $1, $2
+          settings[key] = value
+        else
+          self.console.puts "Unrecognized setting argument '#{setting_pair}':"
+          self.console.puts "Should be of format -s key=value"
+          exit 3
+        end
+      end
+      # other command line shortcuts for settings
+      if options[:debug]
+        settings["log.level"] = "debug"
+      end
+      if options[:writer]
+        settings["writer_class_name"] = options[:writer]
+      end
+      if options[:reader]
+        settings["reader_class_name"] = options[:reader]
+      end
+      if options[:solr]
+        settings["solr.url"] = options[:solr]
+      end
+      if options[:j]
+        settings["writer_class_name"] = "JsonWriter"
+        settings["json_writer.pretty_print"] = "true"
+      end
+      if options[:marc_type]
+        settings["marc_source.type"] = options[:marc_type]
+      end
+      if options[:output_file]
+        settings["output_file"] = options[:output_file]
+      end
+      return settings
+    end
+    def create_slop!
+      return Slop.new(:strict => true) do
+        banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
+        on 'v', 'version', "print version information to stderr"
+        on 'd', 'debug', "Include debug log, -s log.level=debug"
+        on 'h', 'help', "print usage information to stderr"
+        on 'c', 'conf', 'configuration file path (repeatable)', :argument => true, :as => Array
+        on :s, :setting, "settings: `-s key=value` (repeatable)", :argument => true, :as => Array
+        on :r, :reader, "Set reader class, shortcut for -s reader_class_name=", :argument => true
+        on :o, "output_file", "output file for Writer classes that write to files", :argument => true
+        on :w, :writer, "Set writer class, shortcut for -s writer_class_name=", :argument => true
+        on :u, :solr, "Set solr url, shortcut for -s solr.url=", :argument => true
+        on :j, "output as pretty printed json, shortcut for -s writer_class_name=JsonWriter -s json_writer.pretty_print=true"
+        on :t, :marc_type, "xml, json or binary. shortcut for -s marc_source.type=", :argument => true
+        on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
+        on :g, "gemfile", "run with bundler and optionally specified Gemfile", :argument => :optional, :default => ""
+        on :x, "command", "alternate traject command: process (default); marcout", :argument => true, :default => "process"
+      end
+    end
+    def initialize_indexer!
+      indexer = Traject::Indexer.new self.assemble_settings_hash(self.options)
+      load_configuration_files!(indexer, options[:conf])
+      return indexer
+    end
+    def parse_options(argv)
+      begin
+        self.slop.parse!(argv)
+      rescue Slop::Error => e
+        self.console.puts "Error: #{e.message}"
+        self.console.puts "Exiting..."
+        self.console.puts
+        self.console.puts slop.help
+        exit 1
+      end
+      return self.slop.to_hash
+    end
+  end
+end