RubyGems - traject - Versions diffs - 2.1.0-java → 2.2.0-java - Mend

traject 2.1.0-java → 2.2.0-java

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/.gitignore +2 -0
data/.travis.yml +8 -20
data/CHANGES.md +14 -0
data/README.md +35 -56
data/doc/extending.md +20 -27
data/doc/indexing_rules.md +46 -57
data/doc/settings.md +17 -48
data/lib/traject/debug_writer.rb +31 -5
data/lib/traject/indexer.rb +6 -4
data/lib/traject/marc_extractor.rb +37 -157
data/lib/traject/marc_extractor_spec.rb +229 -0
data/lib/traject/version.rb +1 -1
data/test/debug_writer_test.rb +41 -0
data/test/marc_extractor_test.rb +24 -24
data/test/test_support/demo_config.rb +1 -1
data/traject.gemspec +5 -5
metadata +74 -73

data/doc/settings.md CHANGED Viewed

@@ -25,80 +25,49 @@ settings are applied first of all. It's recommended you use `provide`.
 ## Known settings
-* `debug_ascii_progress`: true/'true' to print ascii characters to STDERR indicating progress. Note,
-                          yes, this is fixed to STDERR, regardless of your logging setup.
-                          * `.` for every batch of records read and parsed
-                          * `^` for every batch of records batched and queued for adding to solr
-                                (possibly in thread pool)
-                          * `%` for completing of a Solr 'add'
-                          * `!` when threadpool for solr add has a full queue, so solr add is
-                                going to happen in calling queue -- means solr adding can't
-                                keep up with production.
+* `debug_ascii_progress`: true/'true' to print ascii characters to STDERR indicating progress. Yes, this is fixed to STDERR, regardless of your logging setup.
+  * `.` for every batch of records read and parsed
+  * `^` for every batch of records batched and queued for adding to solr (possibly in thread pool)
+  * `%` for completing of a Solr 'add'
+  * `!` when threadpool for solr add has a full queue, so solr add is going to happen in calling queue -- means solr adding can't keep up with production.
 * `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
 * `log.file`: filename to send logging, or 'STDOUT' or 'STDERR' for those streams. Default STDERR
-* `log.error_file`: Default nil, if set then all log lines of ERROR and higher will be _additionally_
-                  sent to error file named.
+* `log.error_file`: Default nil, if set then all log lines of ERROR and higher will be _additionally_ sent to error file named.
 * `log.format`: Formatting string used by Yell logger. https://github.com/rudionrails/yell/wiki/101-formatting-log-messages
-* `log.level`:  Log this level and above. Default 'info', set to eg 'debug' to get potentially more logging info,
-              or 'error' to get less. https://github.com/rudionrails/yell/wiki/101-setting-the-log-level
+* `log.level`:  Log this level and above. Default 'info', set to eg 'debug' to get potentially more logging info, or 'error' to get less. https://github.com/rudionrails/yell/wiki/101-setting-the-log-level
-* `log.batch_size`: If set to a number N (or string representation), will output a progress line to
-   log. (by default as INFO, but see log.batch_size.severity)
+* `log.batch_size`: If set to a number N (or string representation), will output a progress line to log. (by default as INFO, but see log.batch_size.severity)
 * `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
 * `marc_source.type`: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`
-* `marcout.allow_oversized`: Used with `-x marcout` command to output marc when outputting
-     as ISO 2709 binary, set to true or string "true", and the MARC::Writer will have
-     allow_oversized=true set, allowing oversized records to be serialized with length
-    bytes zero'd out -- technically illegal, but can be read by MARC::Reader in permissive mode.
+* `marcout.allow_oversized`: Used with `-x marcout` command to output marc when outputting as ISO 2709 binary, set to true or string "true", and the MARC::Writer will have  allow_oversized=true set, allowing oversized records to be serialized with length bytes zero'd out -- technically illegal, but can be read by MARC::Reader in permissive mode.
-* `output_file`: Output file to write to for operations that write to files: For instance the `marcout` command,
-                 or Writer classes that write to files, like Traject::JsonWriter. Has an shortcut
-                 `-o` on command line.
+* `output_file`: Output file to write to for operations that write to files: For instance the `marcout` command, or Writer classes that write to files, like Traject::JsonWriter. Has an shortcut `-o` on command line.
-* `processing_thread_pool` Number of threads in the main thread pool used for processing
-   records with input rules. On JRuby or Rubinius, defaults to 1 less than the number of processors detected on your machine. On other ruby platforms, defaults to 1. Set to 0 or nil
-   to disable thread pool, and do all processing in main thread.
+* `processing_thread_pool` Number of threads in the main thread pool used for processing records with input rules. On JRuby or Rubinius, defaults to 1 less than the number of processors detected on your machine. On other ruby platforms, defaults to 1. Set to 0 or nil to disable thread pool, and do all processing in main thread.
-   Choose a pool size based on size of your machine, and complexity of your indexing rules, you
-   might want to try different sizes and measure which works best for you.
-   Probably no reason for it ever to be more than number of cores on indexing machine.
+  Choose a pool size based on size of your machine, and complexity of your indexing rules, you might want to try different sizes and measure which works best for you. Probably no reason for it ever to be more than number of cores on indexing machine.
-* `reader_class_name`: a Traject Reader class, used by the indexer as a source
-    of records.   Defaults to Traject::Marc4JReader (using the Java Marc4J
-    library) on JRuby; Traject::MarcReader (using the ruby marc gem) otherwise.
-    Command-line shortcut `-r`
+* `reader_class_name`: a Traject Reader class, used by the indexer as a source of records.   Defaults to Traject::Marc4JReader (using the Java Marc4J library) on JRuby; Traject::MarcReader (using the ruby marc gem) otherwise. Command-line shortcut `-r`
 * `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
-* `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control
-  change some default settings, and/or sanity check and warn you if you're doing something
-  that might not work with that version of solr. Set now for help in the future.
+* `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control some default settings, and/or sanity check and warn you if you're doing something that might not work with that version of solr. Set now for help in the future.
-* `solr_writer.batch_size`: size of batches that SolrJsonWriter will send docs to Solr in. Default 100. Set to nil,
-  0, or 1, and SolrJsonWriter will do one http transaction per document, no batching.
+* `solr_writer.batch_size`: size of batches that SolrJsonWriter will send docs to Solr in. Default 100. Set to nil, 0, or 1, and SolrJsonWriter will do one http transaction per document, no batching.
 * `solr_writer.commit_on_close`: default false, set to true to have the solr writer send an explicit commit message to Solr after indexing.
+* `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
-* `solr_writer.thread_pool`:       Defaults to 1 (single bg thread). A thread pool is used for submitting docs
-                                    to solr. Set to 0 or nil to disable threading. Set to 1,
-                                    there will still be a single bg thread doing the adds.
-                                    May make sense to set higher than number of cores on your
-                                    indexing machine, as these threads will mostly be waiting
-                                    on Solr. Speed/capacity of your solr might be more relevant.
-                                    Note that processing_thread_pool threads can end up submitting
-                                    to solr too, if solr_json_writer.thread_pool is full.
-* `writer`: An object that implements the Traject Writer interface. If set, takes precedence
-            over `writer_class_name`.
+* `writer`: An object that implements the Traject Writer interface. If set, takes precedence over `writer_class_name`.
 * `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Will be used if no explicit `writer` setting or `#writer=` is set. Default Traject::SolrJsonWriter, other writers for debugging or writing to files are also available. See Traject::Indexer for more info. Command line shortcut `-w`

data/lib/traject/debug_writer.rb CHANGED Viewed

@@ -32,14 +32,40 @@ require 'traject/line_writer'
 #       provide "output_file", "out.txt"
 #     end
 class Traject::DebugWriter < Traject::LineWriter
-  DEFAULT_FORMAT = '%-12s %-25s %s'
   DEFAULT_IDFIELD = 'id'
+  DEFAULT_FORMAT  = '%-12s %-25s %s'
+  def initialize(*)
+    super
+    @idfield = settings["debug_writer.idfield"] || DEFAULT_IDFIELD
+    @format  = settings['debug_writer.format'] || DEFAULT_FORMAT
+    if @idfield == 'record_position' then
+      @use_position = true
+    end
+    @already_threw_warning_about_missing_id = false
+  end
+  def record_number(context)
+    return context.position if @use_position
+    if context.output_hash.has_key?(@idfield)
+      context.output_hash[@idfield].first
+    else
+      unless @already_threw_warning_about_missing_id
+        context.logger.warn "At least one record (##{context.position}) doesn't define field '#{@idfield}'.
+All records are assumed to have a unique id. You can set which field to look in via the setting 'debug_writer.idfield'"
+        @already_threw_warning_about_missing_id = true
+      end
+      "record_num_#{context.position}"
+    end
+  end
   def serialize(context)
-    idfield = settings["debug_writer.idfield"] || DEFAULT_IDFIELD
-    format  = settings['debug_writer.format']  || DEFAULT_FORMAT
-    h = context.output_hash
-    lines = h.keys.sort.map {|k| format % [h[idfield].first, k, h[k].join(' | ')] }
+    h       = context.output_hash
+    rec_key = record_number(context)
+    lines   = h.keys.sort.map { |k| @format % [rec_key, k, h[k].join(' | ')] }
     lines.push "\n"
     lines.join("\n")
   end

data/lib/traject/indexer.rb CHANGED Viewed

@@ -8,6 +8,8 @@ require 'traject/indexer/settings'
 require 'traject/marc_reader'
 require 'traject/json_writer'
 require 'traject/solr_json_writer'
+require 'traject/debug_writer'
 require 'traject/macros/marc21'
 require 'traject/macros/basic'
@@ -98,7 +100,7 @@ end
 #
 # This may raise if the file is not readable. Or if the config file
 # can't be evaluated, it will raise a Traject::Indexer::ConfigLoadError
-# with a bunch of contextual information useful to reporting to developer.
+# with a bunch of contextual information useful to reporting to developer.
 #
 # You can also instead, or in addition, write configuration inline using
 # standard ruby `instance_eval`:
@@ -704,15 +706,15 @@ class Traject::Indexer
   end
   # Raised by #load_config_file when config file can not
-  # be processed.
+  # be processed.
   #
   # The exception #message includes an error message formatted
-  # for good display to the developer, in the console.
+  # for good display to the developer, in the console.
   #
   # Original exception raised when processing config file
   # can be found in #original. Original exception should ordinarily
   # have a good stack trace, including the file path of the config
-  # file in question.
+  # file in question.
   #
   # Original config path in #config_file, and line number in config
   # file that triggered the exception in #config_file_lineno (may be nil)

data/lib/traject/marc_extractor.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+require 'traject/marc_extractor_spec'
 module Traject
   # MarcExtractor is a class for extracting lists of strings from a MARC::Record,
   # according to specifications. See #parse_string_spec for description of string
@@ -36,7 +38,7 @@ module Traject
   # and includes a tag and a a byte slice specification.
   #
   #      "008[35-37]:007[5]""
-  #      => bytes 35-37 inclusive of any field 008, and byte 5 of any field 007
+  #      => bytes 35-37 inclusive of any field 008, and byte 5 of any field 007
   #
   # * subfields and indicators can only be provided for marc data/variable fields
   # * byte slice can only be provided for marc control fields (generally tags less than 010)
@@ -105,7 +107,9 @@ module Traject
   # lazily create and then re-use a MarcExtractor object with
   # particular initialization arguments.
   class MarcExtractor
-    attr_accessor :options, :spec_hash
+    attr_accessor :options, :spec_set
+    ALTERNATE_SCRIPT_TAG = '880'
     # First arg is a specification for extraction of data from a MARC record.
     # Specification can be given in two forms:
@@ -126,30 +130,48 @@ module Traject
     #                     * :only => only include linked 880s, not original
     def initialize(spec, options = {})
       self.options = {
-        :separator => ' ',
-        :alternate_script => :include
+          :separator        => ' ',
+          :alternate_script => :include
       }.merge(options)
-      self.spec_hash = spec.kind_of?(Hash) ? spec : self.class.parse_string_spec(spec)
+      self.spec_set = SpecSet.new(spec)
       # Tags are "interesting" if we have a spec that might cover it
-      @interesting_tags_hash = {}
       # By default, interesting tags are those represented by keys in spec_hash.
       # Add them unless we only care about alternate scripts.
       unless options[:alternate_script] == :only
-        self.spec_hash.keys.each {|tag| @interesting_tags_hash[tag] = true}
+        self.spec_set.tags.each { |tag| show_interest_in_tag(tag) }
       end
       # If we *are* interested in alternate scripts, add the 880
       if options[:alternate_script] != false
-        @interesting_tags_hash['880'] = true
+        @fetch_alternate_script = true
+        show_interest_in_tag(ALTERNATE_SCRIPT_TAG)
       end
       self.freeze
     end
+    # Declare that we're interested in a tag
+    def show_interest_in_tag(tag)
+      @interesting_tags_hash      ||= {}
+      @interesting_tags_hash[tag] = true
+    end
+    # Check to see if a tag is interesting (meaning it may be covered by a spec
+    # and the passed-in options about alternate scripts)
+    def interesting_tag?(tag)
+      return @interesting_tags_hash.include?(tag)
+    end
+    # All the "interesting" tags
+    def interesting_tags
+      @interesting_tags_hash.keys
+    end
     # Takes the same arguments as MarcExtractor.new, but will re-use an existing
     # cached MarcExtractor already created with given initialization arguments,
     # if available.
@@ -169,80 +191,11 @@ module Traject
     #     extractor = MarcExtractor.cached("245abc:700a", :separator => nil)
     def self.cached(*args)
       cache = (Thread.current[:marc_extractor_cached] ||= Hash.new)
-      return ( cache[args] ||= Traject::MarcExtractor.new(*args).freeze )
+      return (cache[args] ||= Traject::MarcExtractor.new(*args).freeze)
     end
-    # Check to see if a tag is interesting (meaning it may be covered by a spec
-    # and the passed-in options about alternate scripts)
-    def interesting_tag?(tag)
-      return @interesting_tags_hash.include?(tag)
-    end
-    # Converts from a string marc spec like "008[35]:245abc:700a" to a hash used internally
-    # to represent the specification. See comments at head of class for
-    # documentation of string specification format.
-    #
-    #
-    # ## Return value
-    #
-    # The hash returned is keyed by tag, and has as values an array of 0 or
-    # or more MarcExtractor::Spec objects representing the specified extraction
-    # operations for that tag.
-    #
-    # It's an array of possibly more than one, because you can specify
-    # multiple extractions on the same tag: for instance "245a:245abc"
-    #
-    # See tests for more examples.
-    def self.parse_string_spec(spec_string)
-      # hash defaults to []
-      hash = Hash.new
-      spec_strings = spec_string.is_a?(Array) ? spec_string.map{|s| s.split(/\s*:\s*/)}.flatten : spec_string.split(/s*:\s*/)
-      spec_strings.each do |part|
-        if (part =~ /\A([a-zA-Z0-9]{3})(\|([a-z0-9\ \*]{2})\|)?([a-z0-9]*)?\Z/)
-          # variable field
-          tag, indicators, subfields = $1, $3, $4
-          spec = Spec.new(:tag => tag)
-          if subfields and !subfields.empty?
-            spec.subfields = subfields.split('')
-          end
-          if indicators
-           # if specified as '*', leave nil
-           spec.indicator1 = indicators[0] if indicators[0] != "*"
-           spec.indicator2 = indicators[1] if indicators[1] != "*"
-          end
-          hash[spec.tag] ||= []
-          hash[spec.tag] << spec
-        elsif (part =~ /\A([a-zA-Z0-9]{3})(\[(\d+)(-(\d+))?\])\Z/) # control field, "005[4-5]"
-          tag, byte1, byte2 = $1, $3, $5
-          spec = Spec.new(:tag => tag)
-          if byte1 && byte2
-            spec.bytes = ((byte1.to_i)..(byte2.to_i))
-          elsif byte1
-           spec.bytes = byte1.to_i
-          end
-          hash[spec.tag] ||= []
-          hash[spec.tag] << spec
-        else
-          raise ArgumentError.new("Unrecognized marc extract specification: #{part}")
-        end
-      end
-      return hash
-    end
-    # Returns array of strings, extracted values. Maybe empty array.
+    # Returns array of strings from a MARC::Record, extracted values. May be empty array.
     def extract(marc_record)
       results = []
@@ -265,14 +218,10 @@ module Traject
     # Third (optional) arg to block is self, the MarcExtractor object, useful for custom
     # implementations.
     def each_matching_line(marc_record)
-      marc_record.fields(@interesting_tags_hash.keys).each do |field|
+      marc_record.fields(interesting_tags).each do |field|
-        # Make sure it matches indicators too, specs_covering_field
-        # doesn't check that.
         specs_covering_field(field).each do |spec|
-          if spec.matches_indicators?(field)
             yield(field, spec, self)
-          end
         end
       end
@@ -314,29 +263,13 @@ module Traject
     end
     # Find Spec objects, if any, covering extraction from this field.
     # Returns an array of 0 or more MarcExtractor::Spec objects
     #
-    # When given an 880, will return the spec (if any) for the linked tag iff
-    # we have a $6 and we want the alternate script.
-    #
     # Returns an empty array in case of no matching extraction specs.
     def specs_covering_field(field)
-      tag = field.tag
-      # Short-circuit the unintersting stuff
-      return [] unless interesting_tag?(tag)
-      # Due to bug in jruby https://github.com/jruby/jruby/issues/886 , we need
-      # to do this weird encode gymnastics, which fixes it for mysterious reasons.
-      if tag == "880" && field['6']
-        tag = field["6"].encode(field["6"].encoding).byteslice(0,3)
-      end
-      # Take the resulting tag and get the spec from it (or the default nil if there isn't a spec for this tag)
-      spec = self.spec_hash[tag] || []
+      return [] unless interesting_tag?(field.tag)
+      self.spec_set.specs_matching_field(field, @fetch_alternate_script)
     end
@@ -348,63 +281,10 @@ module Traject
     def freeze
       self.options.freeze
-      self.spec_hash.freeze
+      self.spec_set.freeze
       super
     end
-    # Represents a single specification for extracting data
-    # from a marc field, like "600abc" or "600|1*|x".
-    #
-    # Includes the tag for reference, although this is redundant and not actually used
-    # in logic, since the tag is also implicit in the overall spec_hash
-    # with tag => [spec1, spec2]
-    class Spec
-      attr_accessor :tag, :subfields, :indicator1, :indicator2, :bytes
-      def initialize(hash = {})
-        hash.each_pair do |key, value|
-          self.send("#{key}=", value)
-        end
-      end
-      #  Should subfields extracted by joined, if we have a seperator?
-      #  * '630' no subfields specified => join all subfields
-      #  * '630abc' multiple subfields specified = join all subfields
-      #  * '633a' one subfield => do not join, return one value for each $a in the field
-      #  * '633aa' one subfield, doubled => do join after all, will return a single string joining all the values of all the $a's.
-      #
-      # Last case is handled implicitly at the moment when subfields == ['a', 'a']
-      def joinable?
-        (self.subfields.nil? || self.subfields.size != 1)
-      end
-      # Pass in a MARC field, do it's indicators match indicators
-      # in this spec? nil indicators in spec mean we don't care, everything
-      # matches.
-      def matches_indicators?(field)
-        return (self.indicator1.nil? || self.indicator1 == field.indicator1) &&
-          (self.indicator2.nil? || self.indicator2 == field.indicator2)
-      end
-      # Pass in a string subfield code like 'a'; does this
-      # spec include it?
-      def includes_subfield_code?(code)
-        # subfields nil means include them all
-        self.subfields.nil? || self.subfields.include?(code)
-      end
-      def ==(spec)
-        return false unless spec.kind_of?(Spec)
-        return (self.tag == spec.tag) &&
-          (self.subfields == spec.subfields) &&
-          (self.indicator1 == spec.indicator1) &&
-          (self.indicator1 == spec.indicator2) &&
-          (self.bytes == spec.bytes)
-      end
-    end
   end
 end