RubyGems - traject - Versions diffs - 2.0.2 → 2.1.0 - Mend

traject 2.0.2 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/README.md +8 -8
data/doc/indexing_rules.md +1 -1
data/doc/settings.md +4 -1
data/lib/traject/command_line.rb +9 -18
data/lib/traject/indexer.rb +162 -35
data/lib/traject/macros/marc21.rb +23 -0
data/lib/traject/macros/marc_format_classifier.rb +2 -2
data/lib/traject/solr_json_writer.rb +2 -16
data/lib/traject/util.rb +75 -0
data/lib/traject/version.rb +1 -1
data/test/indexer/context_test.rb +35 -0
data/test/indexer/load_config_file_test.rb +89 -0
data/test/indexer/macros_marc21_test.rb +5 -0
data/test/indexer/writer_test.rb +54 -0
data/test/solr_json_writer_test.rb +0 -29
data/traject.gemspec +1 -1
metadata +9 -3

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: f738d7b32e8ccc9ac89b1ed735a154790a36c656
-  data.tar.gz: 79d32ec442d5569b50ac4c8c2f25d588de335a79
+  metadata.gz: 68f2f404b1bcae2d73ad2d947d40178ddeaf3912
+  data.tar.gz: 3aa18f9032b5406059bf8093416df04655ae16e2
 SHA512:
-  metadata.gz: 3042d06bc09ffc3421a334595a6393a047e0425f0b41221ab17b6881de7e96e07040f51ea1f3b4a18dd24580158eebaa341c6e640cc31263385f497584b33f75
-  data.tar.gz: 811378d18249d07db3c85dc31205e2d70a98b86b6edd57fb97789f3aaa98c7b6fd7b1045966674c06eced1c6cc87848ef1efc361994f383a5c05488046fc286a
+  metadata.gz: e0521cadc05b454787f642726a6f537b6d09eb46f88389e7f036e9808e680e593ec09b594f2fa85480cb5e350c2f5c966549561cbfab580f9e38300365658e3f
+  data.tar.gz: 5fe31273d6cbe1cf46cf2a186f8d3b4dbc64bdca86e39cb6a9e3c3de08289a1e06f4ca0bc66e4e80df622591fa6f95bdb2b1c66e3a9728890155e642dfdb49fd

data/README.md CHANGED

@@ -11,7 +11,7 @@ for debugging by a human.
 **Traject is stable, mature software, that is already being used in production by its authors.**
 [![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
-[![Build Status](https://travis-ci.org/traject-project/traject.png)](https://travis-ci.org/traject-project/traject)
+[![Build Status](https://travis-ci.org/traject/traject.png)](https://travis-ci.org/traject/traject)
 ## Background/Goals
@@ -137,7 +137,7 @@ data out of a MARC record according to a tag/subfield specification.
     # For MARC Control ('fixed') fields, you can optionally
     # use square brackets to take a byte offset.
-    to_field "langauge_code", extract_marc("008[35-37]")
+    to_field "language_code", extract_marc("008[35-37]")
 ~~~
 `extract_marc` by default includes all 'alternate script' linked fields correspoinding
@@ -189,7 +189,7 @@ The current record serialized back out as MARC, in binary, XML, or json:
 Text of all fields in a range:
 ~~~ruby
-    to_field "text", extract_all_marc_values(:from => 100, :to => 899)
+    to_field "text", extract_all_marc_values(:from => "100", :to => "899")
 ~~~
 All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macros/marc21.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21))
@@ -324,7 +324,7 @@ Several other writers are also built-in:
 You set which writer is being used in settings (`provide "writer_class_name", "Traject::DebugWriter"`),
 or with the shortcut command line argument  `-w Traject::DebugWriter`.
-The [SolrJWriter](https://github.com/traject-project/traject-solrj_writer) is packaged separately,
+The [SolrJWriter](https://github.com/traject/traject-solrj_writer) is packaged separately,
 and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
 You can easily write your own Readers and Writers if you'd like, see comments at top
@@ -413,11 +413,11 @@ Own Code](./doc/extending.md)
 * [Other traject commands](./doc/other_commands.md) including `marcout`, and `commit`
 * [Hints for batch and cronjob use](./doc/batch_execution.md) of  traject.
 * Plugin extensions: Gems that add functionality to traject
-  * [traject_alephsequential_reader](https://github.com/traject-project/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
+  * [traject_alephsequential_reader](https://github.com/traject/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
   * [traject_horizon](https://github.com/jrochkind/traject_horizon): Export MARC records directly from a Horizon ILS rdbms, as serialized MARC or to  index into Solr.
   * [traject_umich_format](https://github.com/billdueber/traject_umich_format/): opinionated code and associated macros to extract format (book, audio file, etc.) and types (bibliography, conference report, etc.) from a MARC record. Code mirrors that used by the University of Michigan, and is an alternate approach to that taken by the `marc_formats` macro in `Traject::Macros::MarcFormatClassifier`.
-  * [traject-solrj_writer](https://github.com/traject-project/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
-  * [traject_marc4j_reader](https://github.com/billdueber/traject_marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
+  * [traject-solrj_writer](https://github.com/traject/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
+  * [traject_marc4j_reader](https://github.com/traject/traject-marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
   reading marc records using the Marc4J library, fastest MARC reading on JRuby.
 # Development
@@ -454,7 +454,7 @@ a gemspec dependency on the Marc4JReader gem.
     on Settings.
 * CommandLine class isn't covered by tests -- it's written using functionality
-from Indexer and other classes taht are well-covered, but the CommandLine itself
+from Indexer and other classes that are well-covered, but the CommandLine itself
 probably needs some tests -- especially covering error handling, which probably
 needs a bit more attention and using exceptions instead of exits, etc.

data/doc/indexing_rules.md CHANGED

@@ -58,7 +58,7 @@ you need to modify the array in-place.
 The third optional context argument
 The third optional argument is a
-[Traject::Indexer::Context](./lib/traject/indexer/context.rb)  ([rdoc](http://rdoc.info/github/traject-project/traject/Traject/Indexer/Context))
+[Traject::Indexer::Context](./lib/traject/indexer/context.rb)  ([rdoc](http://rdoc.info/github/traject/traject/Traject/Indexer/Context))
 object. Most of the time you don't need it, but you can use it for
 some sophisticated functionality, for example using these Context methods:

data/doc/settings.md CHANGED

@@ -98,4 +98,7 @@ settings are applied first of all. It's recommended you use `provide`.
                                     Note that processing_thread_pool threads can end up submitting
                                     to solr too, if solr_json_writer.thread_pool is full.
-* `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::SolrJsonWriter, other writers for debugging or writing to files are also available. See Traject::Indexer for more info. Command line shortcut `-w`
+* `writer`: An object that implements the Traject Writer interface. If set, takes precedence
+            over `writer_class_name`.
+* `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Will be used if no explicit `writer` setting or `#writer=` is set. Default Traject::SolrJsonWriter, other writers for debugging or writing to files are also available. See Traject::Indexer for more info. Command line shortcut `-w`

data/lib/traject/command_line.rb CHANGED

@@ -171,25 +171,16 @@ module Traject
     def load_configuration_files!(my_indexer, conf_files)
       conf_files.each do |conf_path|
         begin
-          file_io = File.open(conf_path)
-        rescue Errno::ENOENT => e
-          self.console.puts "Could not find configuration file '#{conf_path}', exiting..."
+          my_indexer.load_config_file(conf_path)
+        rescue Errno::ENOENT, Errno::EACCES => e
+          self.console.puts "Could not read configuration file '#{conf_path}', exiting..."
           exit 2
-        end
-        begin
-          my_indexer.instance_eval(file_io.read, conf_path)
-        rescue Exception => e
-          if (conf_trace = e.backtrace.find {|l| l.start_with? conf_path}) &&
-             (conf_trace =~ /\A.*\:(\d+)\:in/)
-            line_number = $1
-          end
-          self.console.puts "Error processing configuration file '#{conf_path}' at line #{line_number}"
-          self.console.puts "  #{e.class}: #{e.message}"
-          if e.backtrace.first =~ /\A(.*)\:in/
-            self.console.puts "  from #{$1}"
-          end
+        rescue Traject::Indexer::ConfigLoadError => e
+          self.console.puts "\n"
+          self.console.puts e.message
+          self.console.puts e.config_file_backtrace
+          self.console.puts "\n"
+          self.console.puts "Exiting..."
           exit 3
         end
       end

data/lib/traject/indexer.rb CHANGED

@@ -23,7 +23,7 @@ end
 # Traject config files are `instance_eval`d in an Indexer object, so `self` in
 # a config file is an Indexer, and any Indexer methods can be called.
 #
-# However, certain Indexer methods exist almost entirely for the purpose of
+# However, certain Indexer methods exist mainly for the purpose of
 # being called in config files; these methods are part of the expected
 # Domain-Specific Language ("DSL") for config files, and will ordinarily
 # form the bulk or entirety of config files:
@@ -34,18 +34,6 @@ end
 # * #after_procesing
 # * #logger (rarely used in config files, but in some cases to set up custom logging config)
 #
-# If accessing a Traject::Indexer programmatically (instead of via command line with
-# config files), additional methods of note include:
-#
-#     # to process a stream of input records from configured Reader,
-#     # to configured Writer:
-#     indexer.process(io_stream)
-#
-#     # To map a single input record manually to an ouput_hash,
-#     # ignoring Readers and Writers
-#     hash = indexer.map_record(record)
-#
-#
 #  ## Readers and Writers
 #
 #  The Indexer has a modularized architecture for readers and writers, for where
@@ -93,6 +81,77 @@ end
 #  * traject/delimited_writer and traject/csv_writer -- write character-delimited files
 #    (default is tab-delimited) or comma-separated-value files.
 #
+# ## Creating and Using an Indexer programmatically
+#
+# Normally the Traject::Indexer is created and used by a Traject::Command object.
+# However, you can also create and use a Traject::Indexer programmatically, for embeddeding
+# in your own ruby software. (Note, you will get best performance under Jruby only)
+#
+#      indexer = Traject::Indexer.new
+#
+# You can load a config file from disk, using standard ruby `instance_eval`.
+# One benefit of loading one or more ordinary traject config files saved separately
+# on disk is that these config files could also be used with the standard
+# traject command line.
+#
+#      indexer.load_config_file(path_to_config)
+#
+# This may raise if the file is not readable. Or if the config file
+# can't be evaluated, it will raise a Traject::Indexer::ConfigLoadError
+# with a bunch of contextual information useful to reporting to developer.
+#
+# You can also instead, or in addition, write configuration inline using
+# standard ruby `instance_eval`:
+#
+#     indexer.instance_eval do
+#        to_field "something", literal("something")
+#        # etc
+#     end
+#
+# Or even load configuration from an existing lambda/proc object:
+#
+#     config = proc do
+#       to_field "something", literal("something")
+#     end
+#     indexer.instance_eval &config
+#
+# It is least confusing to provide settings after you load
+# config files, so you can determine if your settings should
+# be defaults (taking effect only if not provided in earlier config),
+# or should force themselves, potentially overwriting earlier config:
+#
+#      indexer.settings do
+#         # default, won't overwrite if already set by earlier config
+#         provide "solr.url", "http://example.org/solr"
+#         provide "reader", "Traject::MarcReader"
+#
+#         # or force over any previous config
+#         store "solr.url", "http://example.org/solr"
+#      end
+#
+# Once your indexer is set up, you could use it to transform individual
+# input records to output hashes. This method will ignore any readers
+# and writers, and won't use thread pools, it just maps. Under
+# standard MARC setup, `record` should be a `MARC::Record`:
+#
+#      output_hash = indexer.map_record(record)
+#
+# Or you could process an entire stream of input records from the
+# configured reader, to the configured writer, as the traject command line
+# does:
+#
+#      indexer.process(io_stream)
+#      # or, eg:
+#      File.open("path/to/input") do |file|
+#        indexer.process(file)
+#      end
+#
+# At present, you can only call #process _once_ on an indexer,
+# but let us know if that's a problem, we could enhance.
+#
+# Please do let us know if there is some part of this API that is
+# inconveient for you, we'd like to know your use case and improve things.
+#
 class Traject::Indexer
   # Arity error on a passed block
@@ -103,7 +162,7 @@ class Traject::Indexer
   include Traject::QualifiedConstGet
-  attr_writer :reader_class, :writer_class
+  attr_writer :reader_class, :writer_class, :writer
   # For now we hard-code these basic macro's included
   # TODO, make these added with extend per-indexer,
@@ -120,6 +179,24 @@ class Traject::Indexer
     @after_processing_steps = []
   end
+  # Pass a string file path, or a File object, for
+  # a config file to load into indexer.
+  #
+  # Can raise:
+  # * Errno::ENOENT or Errno::EACCES if file path is not accessible
+  # * Traject::Indexer::ConfigLoadError if exception is raised evaluating
+  #   the config. A ConfigLoadError has information in it about original
+  #   exception, and exactly what config file and line number triggered it.
+  def load_config_file(file_path)
+    File.open(file_path) do |file|
+      begin
+        self.instance_eval(file.read, file_path)
+      rescue ScriptError, StandardError => e
+        raise ConfigLoadError.new(file_path, e)
+      end
+    end
+  end
   # Part of the config file DSL, for writing settings values.
   #
   # The Indexer's settings consist of a hash-like Traject::Settings
@@ -282,7 +359,7 @@ class Traject::Indexer
     begin
       yield
     rescue Exception => e
-      msg =  "Unexpected error on record id `#{id_string(context.source_record)}` at file position #{context.position}\n"
+      msg =  "Unexpected error on record id `#{context.source_record_id}` at file position #{context.position}\n"
       msg += "    while executing #{index_step.inspect}\n"
       msg += Traject::Util.exception_to_log_message(e)
@@ -297,11 +374,6 @@ class Traject::Indexer
     end
   end
-  # get a printable id from record for error logging.
-  # Maybe override this for a future XML version.
-  def id_string(record)
-    record && record['001'] && record['001'].value.to_s
-  end
   # Processes a stream of records, reading from the configured Reader,
   # mapping according to configured mapping rules, and then writing
@@ -320,8 +392,6 @@ class Traject::Indexer
     logger.debug "beginning Indexer#process with settings: #{settings.inspect}"
     reader = self.reader!(io_stream)
-    writer = self.writer!
     processing_threads = settings["processing_thread_pool"].to_i
     thread_pool = Traject::ThreadPool.new(processing_threads)
@@ -343,20 +413,24 @@ class Traject::Indexer
         $stderr.write "." if count % settings["solr_writer.batch_size"].to_i == 0
       end
+      context = Context.new(
+        :source_record => record,
+        :settings => settings,
+        :position => position,
+        :logger => logger
+      )
       if log_batch_size && (count % log_batch_size == 0)
         batch_rps = log_batch_size / (Time.now - batch_start_time)
         overall_rps = count / (Time.now - start_time)
-        logger.send(settings["log.batch_size.severity"].downcase.to_sym, "Traject::Indexer#process, read #{count} records at id:#{id_string(record)}; #{'%.0f' % batch_rps}/s this batch, #{'%.0f' % overall_rps}/s overall")
+        logger.send(settings["log.batch_size.severity"].downcase.to_sym, "Traject::Indexer#process, read #{count} records at id:#{context.source_record_id}; #{'%.0f' % batch_rps}/s this batch, #{'%.0f' % overall_rps}/s overall")
         batch_start_time = Time.now
       end
-      # we have to use this weird lambda to properly "capture" the count, instead
-      # of having it be bound to the original variable in a non-threadsafe way.
-      # This is confusing, I might not be understanding things properly, but that's where i am.
-      #thread_pool.maybe_in_thread_pool &make_lambda(count, record, writer)
-      thread_pool.maybe_in_thread_pool(record, settings, position) do |record, settings, position|
-        context = Context.new(:source_record => record, :settings => settings, :position => position)
-        context.logger = logger
+      # We pass context in a block arg to properly 'capture' it, so
+      # we don't accidentally share the local var under closure between
+      # threads.
+      thread_pool.maybe_in_thread_pool(context) do |context|
         map_to_context!(context)
         if context.skip?
           log_skip(context)
@@ -413,10 +487,7 @@ class Traject::Indexer
   end
   def writer_class
-    unless defined? @writer_class
-      @writer_class = qualified_const_get(settings["writer_class_name"])
-    end
-    return @writer_class
+    writer.class
   end
   # Instantiate a Traject Reader, using class set
@@ -427,7 +498,12 @@ class Traject::Indexer
   # Instantiate a Traject Writer, suing class set in #writer_class
   def writer!
-    return writer_class.new(settings.merge("logger" => logger))
+    writer_class = @writer_class || qualified_const_get(settings["writer_class_name"])
+    writer_class.new(settings.merge("logger" => logger))
+  end
+  def writer
+    @writer ||= settings["writer"] || writer!
   end
   # Represents the context of a specific record being indexed, passed
@@ -467,6 +543,26 @@ class Traject::Indexer
       @skip
     end
+    # Useful for describing a record in a log or especially
+    # error message. May be useful to combine with #position
+    # in output messages, especially since this method may sometimes
+    # return empty string if info on record id is not available.
+    #
+    # Returns MARC 001, then a slash, then output_hash["id"] -- if both
+    # are present. Otherwise may return just one, or even an empty string.
+    #
+    # Likely override this for a future XML or other source format version.
+    def source_record_id
+      marc_id = if self.source_record &&
+                   self.source_record.kind_of?(MARC::Record) &&
+                   self.source_record['001']
+        self.source_record['001'].value
+      end
+      output_id = self.output_hash["id"]
+      return [marc_id, output_id].compact.join("/")
+    end
   end
@@ -607,6 +703,37 @@ class Traject::Indexer
     end
   end
+  # Raised by #load_config_file when config file can not
+  # be processed.
+  #
+  # The exception #message includes an error message formatted
+  # for good display to the developer, in the console.
+  #
+  # Original exception raised when processing config file
+  # can be found in #original. Original exception should ordinarily
+  # have a good stack trace, including the file path of the config
+  # file in question.
+  #
+  # Original config path in #config_file, and line number in config
+  # file that triggered the exception in #config_file_lineno (may be nil)
+  #
+  # A filtered backtrace just DOWN from config file (not including trace
+  # from traject loading config file itself) can be found in
+  # #config_file_backtrace
+  class ConfigLoadError < StandardError
+    # We'd have #cause in ruby 2.1, filled out for us, but we want
+    # to work before then, so we use our own 'original'
+    attr_reader :original, :config_file, :config_file_lineno, :config_file_backtrace
+    def initialize(config_file_path, original_exception)
+      @original               = original_exception
+      @config_file            = config_file_path
+      @config_file_lineno     = Traject::Util.backtrace_lineno_for_config(config_file_path, original_exception)
+      @config_file_backtrace  = Traject::Util.backtrace_from_config(config_file_path, original_exception)
+      message = "Error loading configuration file #{self.config_file}:#{self.config_file_lineno} #{original_exception.class}:#{original_exception.message}"
+      super(message)
+    end
+  end

data/lib/traject/macros/marc21.rb CHANGED

@@ -39,6 +39,9 @@ module Traject::Macros
     #     to_field("title"), extract_marc("245abcd", :trim_punctuation => true)
     #     to_field("id"),    extract_marc("001", :first => true)
     #     to_field("geo"),   extract_marc("040a", :separator => nil, :translation_map => "marc040")
+    #
+    # If you'd like extract_marc functionality but you're not creating an indexer
+    # step, see Traject::Macros::Marc21.extract_marc_from module method.
     def extract_marc(spec, options = {})
       # Raise an error if there are any invalid options, indicating a
@@ -70,6 +73,26 @@ module Traject::Macros
         Marc21.apply_extraction_options(accumulator, options, translation_map)
       end
     end
+    module_function :extract_marc
+    # Convenience method when you want extract_marc behavior, but NOT
+    # to create a lambda for an Indexer step, but instead just give
+    # it a record directly and get back an array of values.
+    #
+    #     array = Traject::Indexer::Marc21.extract_marc_from(record, "245ab", :trim_punctuation => true)
+    #
+    # If you have a Traject::Indexer::Context and want to pass it in, you can:
+    #
+    #    array = Traject::Indexer::Marc21.extract_marc_from(record, "245ab", :trim_punctuation => true, :context => existing_context)
+    def self.extract_marc_from(record, spec, options = {})
+      output  = []
+      # Nil context works, but if caller wants to pass one in
+      # for better error reporting that's cool too.
+      context = options.delete(:context) || nil
+      extract_marc(spec, options).call(record, output, context)
+      return output
+    end
     # Side-effect the accumulator with the options
     def self.apply_extraction_options(accumulator, options, translation_map=nil)

data/lib/traject/macros/marc_format_classifier.rb CHANGED

@@ -2,10 +2,10 @@ module Traject
   module Macros
     # To use the marc_format macro, in your configuration file:
     #
-    #     require 'traject/macros/marc_formats
+    #     require 'traject/macros/marc_format_classifier'
     #     extend Traject::Macros::MarcFormats
     #
-    #     to_field("format_s") marc_formats
+    #     to_field "format", marc_formats
     #
     # See also MarcClassifier which can be used directly for a bit more
     # control.

data/lib/traject/solr_json_writer.rb CHANGED

@@ -144,11 +144,11 @@ class Traject::SolrJsonWriter
     if exception || resp.status != 200
       if exception
-        msg = Traject::Util.exception_to_log_message(e)
+        msg = Traject::Util.exception_to_log_message(exception)
       else
         msg = "Solr error response: #{resp.status}: #{resp.body}"
       end
-      logger.error "Could not add record #{record_id_from_context c} at source file position #{c.position}: #{msg}"
+      logger.error "Could not add record #{c.source_record_id} at source file position #{c.position}: #{msg}"
       logger.debug(c.source_record.to_s)
       @skipped_record_incrementer.increment
@@ -166,20 +166,6 @@ class Traject::SolrJsonWriter
     settings["logger"] ||= Yell.new(STDERR, :level => "gt.fatal") # null logger
   end
-  # Returns MARC 001, then a slash, then output_hash["id"] -- if both
-  # are present. Otherwise may return just one, or even an empty string.
-  def record_id_from_context(context)
-    marc_id = if context.source_record &&
-                 context.source_record.kind_of?(MARC::Record) &&
-                 context.source_record['001']
-      context.source_record['001'].value
-    end
-    output_id = context.output_hash["id"]
-    return [marc_id, output_id].compact.join("/")
-  end
   # On close, we need to (a) raise any exceptions we might have, (b) send off
   # the last (possibly empty) batch, and (c) commit if instructed to do so
   # via the solr_writer.commit_on_close setting.

data/lib/traject/util.rb CHANGED

@@ -26,6 +26,81 @@ module Traject
       str.split(':in `').first
     end
+    # Provide a config source file path, and an exception.
+    #
+    # Returns the line number from the first line in the stack
+    # trace of the exception that matches your file path.
+    # of the first line in the backtrace matching that file_path.
+    #
+    # Returns `nil` if no suitable backtrace line can be found.
+    #
+    # Has special logic to try and grep the info out of a SyntaxError, bah.
+    def self.backtrace_lineno_for_config(file_path, exception)
+      # For a SyntaxError, we really need to grep it from the
+      # exception message, it really appears to be nowhere else. Ugh.
+      if exception.kind_of? SyntaxError
+        if exception.message =~ /:(\d+):/
+          return $1.to_i
+        end
+      end
+      # Otherwise we try to fish it out of the backtrace, first
+      # line matching the config file path.
+      # exception.backtrace_locations exists in MRI 2.1+, which makes
+      # our task a lot easier. But not yet in JRuby 1.7.x, so we got to
+      # handle the old way of having to parse the strings in backtrace too.
+      if ( exception.respond_to?(:backtrace_locations) &&
+           exception.backtrace_locations &&
+           exception.backtrace_locations.length > 0 )
+        location = exception.backtrace_locations.find do |bt|
+          bt.path == file_path
+        end
+        return location ? location.lineno : nil
+      else # have to parse string backtrace
+        exception.backtrace.each do |line|
+          if line.start_with?(file_path)
+            return $1.to_i if line =~ /\A.*\:(\d+)\:in/
+            break
+          end
+        end
+        # if we got here, we have nothing
+        return nil
+      end
+    end
+    # Extract just the part of the backtrace that is "below"
+    # the config file mentioned. If we can't find the config file
+    # in the stack trace, we might return empty array.
+    #
+    # If the ruby supports Exception#backtrace_locations, the
+    # returned array will actually be of Thread::Backtrace::Location elements.
+    def self.backtrace_from_config(file_path, exception)
+      filtered_trace = []
+      found = false
+      # MRI 2.1+ has exception.backtrace_locations which makes
+      # this a lot easier, but JRuby 1.7.x doesn't yet, so we
+      # need to do it both ways.
+      if ( exception.respond_to?(:backtrace_locations) &&
+           exception.backtrace_locations &&
+           exception.backtrace_locations.length > 0 )
+        exception.backtrace_locations.each do |location|
+          filtered_trace << location
+          (found=true and break) if location.path == file_path
+        end
+      else
+        filtered_trace = []
+        exception.backtrace.each do |line|
+          filtered_trace << line
+          (found=true and break) if line.start_with?(file_path)
+        end
+      end
+      return found ? filtered_trace : []
+    end
     # Ruby stdlib queue lacks a 'drain' function, we write one.

data/lib/traject/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Traject
-  VERSION = "2.0.2"
+  VERSION = "2.1.0"
 end

data/test/indexer/context_test.rb ADDED

@@ -0,0 +1,35 @@
+require 'test_helper'
+describe "Traject::Indexer::Context" do
+  describe "source_record_id" do
+    before do
+      @record = MARC::Reader.new(support_file_path('test_data.utf8.mrc')).first
+      @context = Traject::Indexer::Context.new
+      @record_001 = "   00282214 " # from the mrc file
+    end
+    it "gets it from 001" do
+      @context.source_record = @record
+      assert_equal @record_001, @context.source_record_id
+    end
+    it "gets it from the id" do
+      @context.output_hash['id'] = 'the_record_id'
+      assert_equal 'the_record_id', @context.source_record_id
+    end
+    it "gets from the id with non-MARC source" do
+      @context.source_record = Object.new
+      @context.output_hash['id'] = 'the_record_id'
+      assert_equal 'the_record_id', @context.source_record_id
+    end
+    it "gets it from both 001 and id" do
+      @context.output_hash['id'] = 'the_record_id'
+      @context.source_record = @record
+      assert_equal [@record_001, 'the_record_id'].join('/'), @context.source_record_id
+    end
+  end
+end

data/test/indexer/load_config_file_test.rb ADDED

@@ -0,0 +1,89 @@
+require 'test_helper'
+require 'tempfile'
+describe "Traject::Indexer#load_config_path" do
+  before do
+    @indexer = Traject::Indexer.new
+  end
+  describe "with bad path" do
+    it "raises ENOENT on non-existing path" do
+      assert_raises(Errno::ENOENT) { @indexer.load_config_file("does/not/exist.rb") }
+    end
+    it "raises EACCES on non-readable path" do
+      file = Tempfile.new('traject_test')
+      FileUtils.chmod("ugo-r", file.path)
+      assert_raises(Errno::EACCES) { @indexer.load_config_file(file.path) }
+      file.unlink
+    end
+  end
+  describe "with good config" do
+    before do
+      @config_file = tmp_config_file_with(%Q{
+        settings do
+          provide "our_key", "our_value"
+        end
+        to_field "literal", literal("literal")
+      })
+    end
+    after do
+      @config_file.unlink
+    end
+    it "loads config file by path" do
+      @indexer.load_config_file(@config_file.path)
+      assert_equal "our_value", @indexer.settings["our_key"]
+    end
+  end
+  describe "with error in config" do
+    after do
+      @config_file.unlink if @config_file
+    end
+    it "raises good error on SyntaxError type" do
+      @config_file = tmp_config_file_with(%Q{
+        puts "foo"
+        # Intentional syntax error missing comma
+        to_field "foo" extract_marc("245")
+      })
+      e = assert_raises(Traject::Indexer::ConfigLoadError) do
+        @indexer.load_config_file(@config_file.path)
+      end
+      assert_kind_of SyntaxError, e.original
+      assert_equal @config_file.path, e.config_file
+      assert_equal 4,  e.config_file_lineno
+    end
+    it "raises good error on StandardError type" do
+      @config_file = tmp_config_file_with(%Q{
+        # Intentional non-syntax error, bad extract_marc spec
+        to_field "foo", extract_marc("#%^%^%^")
+      })
+      e = assert_raises(Traject::Indexer::ConfigLoadError) do
+        @indexer.load_config_file(@config_file.path)
+      end
+      assert_kind_of StandardError, e.original
+      assert_equal @config_file.path, e.config_file
+      assert_equal 3,  e.config_file_lineno
+    end
+  end
+  def tmp_config_file_with(str)
+    file = Tempfile.new('traject_test_config')
+    file.write(str)
+    file.rewind
+    return file
+  end
+end

data/test/indexer/macros_marc21_test.rb CHANGED

@@ -118,6 +118,11 @@ describe "Traject::Macros::Marc21" do
     end
   end
+  it "supports #extract_marc_from module method" do
+    output_arr = ::Traject::Macros::Marc21.extract_marc_from(@record, "245ab", :trim_punctuation => true)
+    assert_equal ["Manufacturing consent : the political economy of the mass media"], output_arr
+  end
   describe "serialized_marc" do
     it "serializes xml" do
       @indexer.instance_eval do

data/test/indexer/writer_test.rb ADDED

@@ -0,0 +1,54 @@
+require 'test_helper'
+require 'traject/yaml_writer'
+describe "The writer on Traject::Indexer" do
+  let(:indexer) { Traject::Indexer.new("solr.url" => "http://example.com") }
+  it "has a default" do
+    assert_instance_of Traject::SolrJsonWriter, indexer.writer
+    assert_equal Traject::SolrJsonWriter, indexer.writer_class
+  end
+  describe "when the writer is set in config" do
+    let(:writer) { Traject::YamlWriter.new({}) }
+    let(:indexer) { Traject::Indexer.new(
+      "solr.url" => "http://example.com",
+      "writer_class" => 'Traject::SolrJsonWriter',
+      "writer"   => writer
+      )}
+    it "uses writer from config" do
+      assert_equal writer, indexer.writer
+      assert_equal writer.class, indexer.writer_class
+    end
+  end
+  describe "when writer_class is set directly" do
+    let(:writer_class) { Traject::YamlWriter }
+    before do
+      indexer.writer_class = writer_class
+    end
+    it "uses writer_class set directly" do
+      assert_kind_of writer_class, indexer.writer
+      assert_equal writer_class, indexer.writer_class
+    end
+  end
+  describe "when the writer is set directly" do
+    let(:writer) { Traject::YamlWriter.new({}) }
+    before do
+      indexer.writer = writer
+    end
+    it "uses the set value" do
+      assert_equal writer, indexer.writer
+      assert_equal writer.class, indexer.writer_class
+    end
+  end
+end

data/test/solr_json_writer_test.rb CHANGED

@@ -215,34 +215,5 @@ describe "Traject::SolrJsonWriter" do
       assert_equal "http://example.com/solr/update", @writer.determine_solr_update_url
     end
   end
-  describe "Record id from context" do
-    before do
-      @record = MARC::Reader.new(support_file_path('test_data.utf8.mrc')).first
-      @context = Traject::Indexer::Context.new
-      @writer = create_writer
-      @record_001 = "   00282214 " # from the mrc file
-    end
-    it "gets it from 001" do
-      @context.source_record = @record
-      assert_equal @record_001, @writer.record_id_from_context(@context)
-    end
-    it "gets it from the id" do
-      @context.output_hash['id'] = 'the_record_id'
-      assert_equal 'the_record_id', @writer.record_id_from_context(@context)
-    end
-    it "gets it from both 001 and id" do
-      @context.output_hash['id'] = 'the_record_id'
-      @context.source_record = @record
-      assert_equal [@record_001, 'the_record_id'].join('/'), @writer.record_id_from_context(@context)
-    end
-  end
 end

data/traject.gemspec CHANGED

@@ -9,7 +9,7 @@ Gem::Specification.new do |spec|
   spec.authors       = ["Jonathan Rochkind", "Bill Dueber"]
   spec.email         = ["none@nowhere.org"]
   spec.summary       = %q{Index MARC to Solr; or generally process source records to hash-like structures}
-  spec.homepage      = "http://github.com/traject-project/traject"
+  spec.homepage      = "http://github.com/traject/traject"
   spec.license       = "MIT"
   spec.files         = `git ls-files`.split($/)

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: traject
 version: !ruby/object:Gem::Version
-  version: 2.0.2
+  version: 2.1.0
 platform: ruby
 authors:
 - Jonathan Rochkind
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-02-20 00:00:00.000000000 Z
+date: 2015-07-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: concurrent-ruby
@@ -233,7 +233,9 @@ files:
 - lib/translation_maps/marc_languages.yaml
 - test/debug_writer_test.rb
 - test/delimited_writer_test.rb
+- test/indexer/context_test.rb
 - test/indexer/each_record_test.rb
+- test/indexer/load_config_file_test.rb
 - test/indexer/macros_marc21_semantics_test.rb
 - test/indexer/macros_marc21_test.rb
 - test/indexer/macros_test.rb
@@ -241,6 +243,7 @@ files:
 - test/indexer/read_write_test.rb
 - test/indexer/settings_test.rb
 - test/indexer/to_field_test.rb
+- test/indexer/writer_test.rb
 - test/marc_extractor_test.rb
 - test/marc_format_classifier_test.rb
 - test/marc_reader_test.rb
@@ -287,7 +290,7 @@ files:
 - test/translation_maps/translate_array_test.yaml
 - test/translation_maps/yaml_map.yaml
 - traject.gemspec
-homepage: http://github.com/traject-project/traject
+homepage: http://github.com/traject/traject
 licenses:
 - MIT
 metadata: {}
@@ -314,7 +317,9 @@ summary: Index MARC to Solr; or generally process source records to hash-like st
 test_files:
 - test/debug_writer_test.rb
 - test/delimited_writer_test.rb
+- test/indexer/context_test.rb
 - test/indexer/each_record_test.rb
+- test/indexer/load_config_file_test.rb
 - test/indexer/macros_marc21_semantics_test.rb
 - test/indexer/macros_marc21_test.rb
 - test/indexer/macros_test.rb
@@ -322,6 +327,7 @@ test_files:
 - test/indexer/read_write_test.rb
 - test/indexer/settings_test.rb
 - test/indexer/to_field_test.rb
+- test/indexer/writer_test.rb
 - test/marc_extractor_test.rb
 - test/marc_format_classifier_test.rb
 - test/marc_reader_test.rb