RubyGems - traject - Versions diffs - 0.0.1 - Mend

traject 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (64) hide show

data/.gitignore +18 -0
data/Gemfile +4 -0
data/LICENSE.txt +22 -0
data/README.md +346 -0
data/Rakefile +16 -0
data/bin/traject +153 -0
data/doc/macros.md +103 -0
data/doc/settings.md +34 -0
data/lib/traject.rb +10 -0
data/lib/traject/indexer.rb +196 -0
data/lib/traject/json_writer.rb +51 -0
data/lib/traject/macros/basic.rb +9 -0
data/lib/traject/macros/marc21.rb +145 -0
data/lib/traject/marc_extractor.rb +206 -0
data/lib/traject/marc_reader.rb +61 -0
data/lib/traject/qualified_const_get.rb +30 -0
data/lib/traject/solrj_writer.rb +120 -0
data/lib/traject/translation_map.rb +184 -0
data/lib/traject/version.rb +3 -0
data/test/indexer/macros_marc21_test.rb +146 -0
data/test/indexer/macros_test.rb +40 -0
data/test/indexer/map_record_test.rb +120 -0
data/test/indexer/read_write_test.rb +47 -0
data/test/indexer/settings_test.rb +65 -0
data/test/marc_extractor_test.rb +168 -0
data/test/marc_reader_test.rb +29 -0
data/test/solrj_writer_test.rb +106 -0
data/test/test_helper.rb +28 -0
data/test/test_support/hebrew880s.marc +1 -0
data/test/test_support/manufacturing_consent.marc +1 -0
data/test/test_support/test_data.utf8.marc.xml +2609 -0
data/test/test_support/test_data.utf8.mrc +1 -0
data/test/translation_map_test.rb +98 -0
data/test/translation_maps/bad_ruby.rb +8 -0
data/test/translation_maps/bad_yaml.yaml +1 -0
data/test/translation_maps/both_map.rb +1 -0
data/test/translation_maps/both_map.yaml +1 -0
data/test/translation_maps/default_literal.rb +10 -0
data/test/translation_maps/default_passthrough.rb +10 -0
data/test/translation_maps/marc_040a_translate_test.yaml +1 -0
data/test/translation_maps/ruby_map.rb +10 -0
data/test/translation_maps/translate_array_test.yaml +8 -0
data/test/translation_maps/yaml_map.yaml +7 -0
data/traject.gemspec +30 -0
data/vendor/solrj/README +8 -0
data/vendor/solrj/build.xml +39 -0
data/vendor/solrj/ivy.xml +16 -0
data/vendor/solrj/lib/commons-codec-1.7.jar +0 -0
data/vendor/solrj/lib/commons-io-2.1.jar +0 -0
data/vendor/solrj/lib/httpclient-4.2.3.jar +0 -0
data/vendor/solrj/lib/httpcore-4.2.2.jar +0 -0
data/vendor/solrj/lib/httpmime-4.2.3.jar +0 -0
data/vendor/solrj/lib/jcl-over-slf4j-1.6.6.jar +0 -0
data/vendor/solrj/lib/jul-to-slf4j-1.6.6.jar +0 -0
data/vendor/solrj/lib/log4j-1.2.16.jar +0 -0
data/vendor/solrj/lib/noggit-0.5.jar +0 -0
data/vendor/solrj/lib/slf4j-api-1.6.6.jar +0 -0
data/vendor/solrj/lib/slf4j-log4j12-1.6.6.jar +0 -0
data/vendor/solrj/lib/solr-solrj-4.3.1-javadoc.jar +0 -0
data/vendor/solrj/lib/solr-solrj-4.3.1-sources.jar +0 -0
data/vendor/solrj/lib/solr-solrj-4.3.1.jar +0 -0
data/vendor/solrj/lib/wstx-asl-3.2.7.jar +0 -0
data/vendor/solrj/lib/zookeeper-3.4.5.jar +0 -0
metadata +264 -0

data/doc/macros.md ADDED

@@ -0,0 +1,103 @@
+# Traject Indexing 'Macros'
+Traject macros are a way of providing re-usable index mapping rules. Before we discuss how they work, we need to remind ourselves of the basic/direct Traject `to_field` indexing method.
+## Review and details of direct indexing logic
+Here's the simplest possible direct Traject mapping logic, duplicating the effects of the `literal` function:
+~~~ruby
+to_field("title") do |record, accumulator, context|
+  accumulator << "FIXED LITERAL"
+end
+~~~
+That `do` is just ruby `block` syntax, whereby we can pass a block of ruby code as an argument to to a ruby method. We pass a block taking three arguments, labelled `record`, `accumulator`, and `context`, to the `to_field` method.
+The block is then stored by the Traject::Indexer, and called for each record indexed. When it's called, it's passed the particular record at hand for the first argument, an Array used as an 'accumulator' as the second argument, and a Traject::Indexer::Context as the third argument.
+The code in the block can add values to the accumulator array, which the Traject::Indexer then adds to the field specified by `to_field`.
+It's also worth pointing out that ruby blocks are `closures`, so they can "capture" and use values from outside the block. So this would work too:
+~~~ruby
+my_var = "FIXED LITERAL"
+to_field("title") do |record, accumulator, context|
+  accumulator << my_var
+end
+~~~
+So that's the way to provide direct logic for mapping rules.
+## Macros
+A Traject macro is a way to automatically create indexing rules via re-usable "templates".
+Traject macros are simply methods that return ruby lambda/proc objects. A ruby lambda is just another syntax for creating blocks of ruby logic that can be passed around as data.
+So, for instance, we could capture that fixed literal block in a lambda like this:
+~~~ruby
+always_add_black = lambda do |record, accumulator, context|
+   accumulator << "BLACK"
+end
+~~~
+Then, knowing that the `to_field` ruby method takes a block, we can use the ruby `&` operator
+to convert our lambda to a block argument. This would in fact work:
+~~~ruby
+to_field "color", &always_add_black
+~~~
+However, for convenience, the `to_field` method can take a lambda directly (without having to use '&' to convert it to a block argument) as a second argument too. So this would work too:
+~~~ruby
+to_field "color", always_add_black
+~~~
+A macro is jus more step, using a method to create lambdas dynamically:  A Traject macro is just a ruby method that **returns** a lambda, a three-arg lambda like `to_field` wants.
+Here is in fact how the `literal` function is implemented:
+~~~ruby
+def literal(value)
+  return lambda do |record, accumulator, context|
+     # because a lambda is a closure, we can define it in terms
+     # of the 'value' from the scope it's defined in!
+     accumulator << value
+  end
+end
+to_field("something"), literal("something")
+~~~
+It's really as simple as that, that's all a Traject macro is. A function that takes parameters, and based on those parameters returns a lambda; the lambda is then passed to the `to_field` indexing method, or similar methods.
+How do you make these methods available to the indexer?
+Define it in a module:
+~~~ruby
+# in a file literal_macro.rb
+module LiteralMacro
+  def literal(value)
+    return lambda do |record, accumulator, context|
+       # because a lambda is a closure, we can define it in terms
+       # of the 'value' from the scope it's defined in!
+       accumulator << value
+    end
+  end
+end
+~~~
+And then use ordinary ruby `require` and `extend` to add it to the current Indexer file, by simply including this
+in one of your config files:
+~~~
+require `literal_macro.rb`
+extend LiteralMacro
+to_field ...
+~~~
+That's it.  You can use the traject command line `-I` option to set the ruby load path, so your file will be findable via `require`.  Or you can distribute it in a gem, and use straight rubygems and the `gem` command in your configuration file, or Bundler with traject command-line `-g` option.

data/doc/settings.md ADDED

@@ -0,0 +1,34 @@
+# Traject settings
+Traject settings are a flat list of key/value pairs -- a single
+Hash, not nested. Keys are always strings, and dots (".") can be
+used for grouping and namespacing.
+Values are usually strings, but occasionally something else.
+Settings can be set in configuration files, or on the command
+line.
+## Known settings
+* json_writer.pretty_print: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
+* marc_source.type: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`
+* reader_class_name: a Traject Reader class, used by the indexer as a source of records. Default Traject::MarcReader. See Traject::Indexer for more info. Command-line shortcut `-r`
+* solr.url: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
+* solrj.jar_dir: SolrJWriter needs to load Java .jar files with SolrJ. It will load from a packaged SolrJ, but you can load your own SolrJ (different version etc) by specifying a directory. All *.jar in directory will be loaded.
+* solr.version: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control
+  change some default settings, and/or sanity check and warn you if you're doing something
+  that might not work with that version of solr. Set now for help in the future.
+* solrj_writer.commit_on_close: default false, set to true to have SolrJWriter send an explicit commit message to Solr after indexing.
+* solrj_writer.parser_class_name: Set to "XMLResponseParser" or "BinaryResponseParser". Will be instantiated and passed to the solrj.SolrServer with setResponseParser. Default nil, use SolrServer default. To talk to a solr 1.x, you will want to set to "XMLResponseParser"
+* solrj_writer.server_class_name: String name of a solrj.SolrServer subclass to be used by SolrJWriter. Default "HttpSolrServer"
+* writer_class_name: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::SolrJWriter, also available Traject::JsonWriter. See Traject::Indexer for more info. Command line shortcut `-w`

data/lib/traject.rb ADDED

@@ -0,0 +1,10 @@
+require "traject/version"
+require 'traject/indexer'
+require 'traject/macros/basic'
+require 'traject/macros/marc21'
+module Traject
+  # Your code goes here...
+end

data/lib/traject/indexer.rb ADDED

@@ -0,0 +1,196 @@
+require 'hashie'
+require 'traject'
+require 'traject/qualified_const_get'
+require 'traject/marc_reader'
+require 'traject/json_writer'
+require 'traject/solrj_writer'
+#
+#  == Readers and Writers
+#
+#  The Indexer has a modularized architecture for readers and writers, for where
+#  source records come from (reader), and where output is sent to (writer).
+#
+#  A Reader is any class that:
+#   1) Has a two-argument initializer taking an IO stream and a Settings hash
+#   2) Responds to the usual ruby #each, returning a source record from each #each.
+#      (Including Enumerable is prob a good idea too)
+#
+#  The default reader is the Traject::MarcReader, who's behavior is
+#  further customized by several settings in the Settings hash.
+#
+#  Alternate readers can be set directly with the #reader_class= method, or
+#  with the "reader_class_name" Setting, a String name of a class
+#  meeting the reader contract.
+#
+#
+#  A Writer is any class that:
+#  1) Has a one-argument initializer taking a Settings hash.
+#  2) Responds to a one argument #put method, where the argument is
+#     a hash of mapped keys/values. The writer should write them
+#     to the appropriate place.
+#  3) Responds to a #close method, called when we're done.
+#
+#  The default writer (will be) the SolrWriter , which is configured
+#  through additional Settings as well. A JsonWriter is also available,
+#  which can be useful for debugging your index mappings.
+#
+#  You can set alternate writers by setting a Class object directly
+#  with the #writer_class method, or by the 'writer_class_name' Setting,
+#  with a String name of class meeting the Writer contract.
+#
+class Traject::Indexer
+  include Traject::QualifiedConstGet
+  attr_writer :reader_class, :writer_class
+  def initialize
+    @settings = Settings.new(self.class.default_settings)
+    @index_steps = []
+  end
+  # The Indexer's settings are a hash of key/values -- not
+  # nested, just one level -- of configuration settings. Keys
+  # are strings.
+  #
+  # The settings method with no arguments returns that hash.
+  #
+  # With a hash and/or block argument, can be used to set
+  # new key/values. Each call merges onto the existing settings
+  # hash.
+  #
+  #    indexer.settings("a" => "a", "b" => "b")
+  #
+  #    indexer.settings do
+  #      store "b", "new b"
+  #    end
+  #
+  #    indexer.settings #=> {"a" => "a", "b" => "new b"}
+  #
+  # even with arguments, returns settings hash too, so can
+  # be chained.
+  def settings(new_settings = nil, &block)
+    @settings.merge!(new_settings) if new_settings
+    @settings.instance_eval &block if block
+    return @settings
+  end
+  # Used to define an indexing mapping.
+  def to_field(field_name, aLambda = nil, &block)
+    @index_steps << {
+      :field_name => field_name.to_s,
+      :lambda => aLambda,
+      :block  => block
+    }
+  end
+  # Processes a single record, according to indexing rules
+  # set up in this Indexer. Returns a hash whose values are
+  # Arrays, and keys are strings.
+  #
+  def map_record(record)
+    context = Context.new(:source_record => record, :settings => settings)
+    @index_steps.each do |index_step|
+      accumulator = []
+      field_name  = index_step[:field_name]
+      context.field_name = field_name
+      # Might have a lambda arg AND a block, we execute in order,
+      # with same accumulator.
+      [index_step[:lambda], index_step[:block]].each do |aProc|
+        if aProc
+          case aProc.arity
+          when 1 then aProc.call(record)
+          when 2 then aProc.call(record, accumulator)
+          else        aProc.call(record, accumulator, context)
+          end
+        end
+      end
+      (context.output_hash[field_name] ||= []).concat accumulator
+      context.field_name = nil
+    end
+    return context.output_hash
+  end
+  # Processes a stream of records, reading from the configured Reader,
+  # mapping according to configured mapping rules, and then writing
+  # to configured Writer.
+  def process(io_stream)
+    reader = self.reader!(io_stream)
+    writer = self.writer!
+    reader.each do |record|
+      writer.put map_record(record)
+    end
+    writer.close if writer.respond_to?(:close)
+  end
+  def reader_class
+    unless defined? @reader_class
+      @reader_class = qualified_const_get(settings["reader_class_name"])
+    end
+    return @reader_class
+  end
+  def writer_class
+    unless defined? @writer_class
+      @writer_class = qualified_const_get(settings["writer_class_name"])
+    end
+    return @writer_class
+  end
+  # Instantiate a Traject Reader, using class set
+  # in #reader_class, initialized with io_stream passed in
+  def reader!(io_stream)
+    return reader_class.new(io_stream, settings)
+  end
+  # Instantiate a Traject Writer, suing class set in #writer_class
+  def writer!
+    return writer_class.new(settings)
+  end
+  def self.default_settings
+    {
+      "reader_class_name" => "Traject::MarcReader",
+      "writer_class_name" => "Traject::SolrJWriter"
+    }
+  end
+  # Enhanced with a few features from Hashie, to make it for
+  # instance string/symbol indifferent
+  class Settings < Hash
+    include Hashie::Extensions::MergeInitializer # can init with hash
+    include Hashie::Extensions::IndifferentAccess
+    # Hashie bug Issue #100 https://github.com/intridea/hashie/pull/100
+    alias_method :store, :indifferent_writer
+  end
+  # Represents the context of a specific record being indexed, passed
+  # to indexing logic blocks
+  #
+  class Traject::Indexer::Context
+    def initialize(hash_init = {})
+      # TODO, argument checking for required args?
+      self.clipboard   = {}
+      self.output_hash = {}
+      hash_init.each_pair do |key, value|
+        self.send("#{key}=", value)
+      end
+    end
+    attr_accessor :clipboard, :output_hash
+    attr_accessor :field_name, :source_record, :settings
+  end
+end

data/lib/traject/json_writer.rb ADDED

@@ -0,0 +1,51 @@
+require 'json'
+# A writer for Traject::Indexer, that just writes out
+# all the output as Json. It's newline delimitted json, but
+# right now no checks to make sure there is no internal newlines
+# as whitespace in the json. TODO, add that.
+#
+# Not currently thread-safe (have to make sure whole object and newline
+# get written without context switch. Can be made so.)
+#
+# You can force pretty-printing with setting 'json_writer.pretty_print' of boolean
+# true or string 'true'.  Useful mostly for human checking of output.
+#
+# Output will be sent to settings["output_file"] string path, or else
+# settings["output_stream"] (ruby IO object), or else stdout.
+class Traject::JsonWriter
+  attr_reader :settings
+  def initialize(argSettings)
+    @settings = argSettings
+  end
+  def put(hash)
+    serialized =
+      if settings["json_writer.pretty_print"]
+        JSON.pretty_generate(hash)
+      else
+        JSON.generate(hash)
+      end
+    output_file.puts(serialized)
+  end
+  def output_file
+    unless defined? @output_file
+      @output_file =
+        if settings["output_file"]
+          File.open(settings["output_file"])
+        elsif settings["output_stream"]
+          settings["output_stream"]
+        else
+          $stdout
+        end
+    end
+    return @output_file
+  end
+  def close
+    @output_file.close unless (@output_file.nil? || @output_file.tty?)
+  end
+end

data/lib/traject/macros/basic.rb ADDED

@@ -0,0 +1,9 @@
+module Traject::Macros
+  module Basic
+    def literal(literal)
+      lambda do |record, accumulator, context|
+        accumulator << literal
+      end
+    end
+  end
+end

data/lib/traject/macros/marc21.rb ADDED

@@ -0,0 +1,145 @@
+require 'traject/marc_extractor'
+require 'traject/translation_map'
+require 'base64'
+require 'json'
+module Traject::Macros
+  # Some of these may be generic for any MARC, but we haven't done
+  # the analytical work to think it through, some of this is
+  # def specific to Marc21.
+  module Marc21
+    # A combo function macro that will extract data from marc according to a string
+    # field/substring specification, then apply various optional post-processing to it too.
+    #
+    # First argument is a string spec suitable for the MarcExtractor, see
+    # MarcExtractor::parse_string_spec.
+    #
+    # Second arg is optional options, including options valid on MarcExtractor.new,
+    # and others. (TODO)
+    #
+    # Examples:
+    #
+    # to_field("title"), extract_marc("245abcd", :trim_punctuation => true)
+    # to_field("id"),    extract_marc("001", :first => true)
+    # to_field("geo"),   extract_marc("040a", :seperator => nil, :translation_map => "marc040")
+    def extract_marc(spec, options = {})
+      only_first              = options.delete(:first)
+      trim_punctuation        = options.delete(:trim_punctuation)
+      # We create the TranslationMap here on load, not inside the closure
+      # where it'll be called for every record. Since TranslationMap is supposed
+      # to cache, prob doesn't matter, but doens't hurt. Also causes any syntax
+      # exceptions to raise on load.
+      if translation_map_arg  = options.delete(:translation_map)
+        translation_map = Traject::TranslationMap.new(translation_map_arg)
+      end
+      lambda do |record, accumulator, context|
+        accumulator.concat Traject::MarcExtractor.extract_by_spec(record, spec, options)
+        if only_first
+          Marc21.first! accumulator
+        end
+        if translation_map
+          translation_map.translate_array! accumulator
+        end
+        if trim_punctuation
+          accumulator.collect! {|s| Marc21.trim_punctuation(s)}
+        end
+      end
+    end
+    # Serializes complete marc record to a serialization format.
+    # required param :format,
+    # serialize_marc(:format => :binary)
+    #
+    # formats:
+    # [xml] MarcXML
+    # [json] marc-in-json (http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/)
+    # [binary] Standard ISO 2709 binary marc. By default WILL be base64-encoded,
+    #          assumed destination a solr 'binary' field.
+    #          add option `:binary_escape => false` to do straight binary -- unclear
+    #          what Solr's documented behavior is when you do this, and add a string
+    #          with binary control chars to solr. May do different things in diff
+    #          Solr versions, including raising exceptions.
+    def serialized_marc(options)
+      options[:format] = options[:format].to_s
+      raise ArgumentError.new("Need :format => [binary|xml|json] arg") unless %w{binary xml json}.include?(options[:format])
+      lambda do |record, accumulator, context|
+        case options[:format]
+        when "binary"
+          binary = record.to_marc
+          binary = Base64.encode64(binary) unless options[:binary_escape] == false
+          accumulator << binary
+        when "xml"
+          # ruby-marc #to_xml returns a REXML object at time of this writing, bah!@
+          # call #to_s on it. Hopefully that'll be forward compatible.
+          accumulator << record.to_xml.to_s
+        when "json"
+          accumulator << JSON.dump(record.to_hash)
+        end
+      end
+    end
+    # Takes the whole record, by default from tags 100 to 899 inclusive,
+    # all subfields, and adds them to output. Subfields in a record are all
+    # joined by space by default.
+    #
+    # options
+    # [:from] default 100, only tags >= lexicographically
+    # [:to]   default 899, only tags <= lexicographically
+    # [:seperator] how to join subfields, default space, nil means don't join
+    #
+    # All fields in from-to must be marc DATA (not control fields), or weirdness
+    #
+    # Can always run this thing multiple times on the same field if you need
+    # non-contiguous ranges of fields.
+    def extract_all_marc_values(options = {})
+      options = {:from => "100", :to => "899", :seperator => ' '}.merge(options)
+      lambda do |record, accumulator, context|
+        record.each do |field|
+          next unless field.tag >= options[:from] && field.tag <= options[:to]
+          subfield_values = field.subfields.collect {|sf| sf.value}
+          next unless subfield_values.length > 0
+          if options[:seperator]
+            accumulator << subfield_values.join( options[:seperator])
+          else
+            accumulator.concat subfield_values
+          end
+        end
+      end
+    end
+    # Trims punctuation mostly from end, and occasionally from beginning
+    # of string. Not nearly as complex logic as SolrMarc's version, just
+    # pretty simple.
+    #
+    # Removes
+    # * trailing: comma, slash, semicolon, colon (possibly followed by whitespace)
+    # * trailing period if it is preceded by at least three letters (possibly followed by whitespace)
+    # * single square bracket characters if they are the start and/or end
+    #   chars and there are no internal square brackets.
+    #
+    # Returns altered string, doesn't change original arg.
+    def self.trim_punctuation(str)
+      str = str.sub(/[ ,\/;:] *\Z/, '')
+      str = str.sub(/(\w\w\w)\. *\Z/, '\1')
+      str = str.sub(/\A\[?([^\[\]]+)\]?\Z/, '\1')
+      return str
+    end
+    def self.first!(arr)
+      # kind of esoteric, but slice used this way does mutating first, yep
+      arr.slice!(1, arr.length)
+    end
+  end
+end