RubyGems - traject_horizon - Versions diffs - 0.0.1 - Mend

traject_horizon 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

data/.gitignore +17 -0
data/Gemfile +4 -0
data/LICENSE.txt +22 -0
data/README.md +173 -0
data/Rakefile +11 -0
data/lib/traject/horizon_bib_auth_merge.rb +124 -0
data/lib/traject/horizon_reader.rb +641 -0
data/lib/traject_horizon.rb +6 -0
data/lib/traject_horizon/version.rb +3 -0
data/test/horizon_bib_auth_merge_test.rb +58 -0
data/test/test_helper.rb +16 -0
data/traject_horizon.gemspec +24 -0
data/vendor/jtds/.DS_Store +0 -0
data/vendor/jtds/jtds-1.2.8.jar +0 -0
metadata +110 -0

data/.gitignore ADDED Viewed

@@ -0,0 +1,17 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in traject_horizon.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2013 Jonathan Rochkind
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,173 @@
+# Traject::Horizon
+Export MARC records directly from a Horizon ILS rdbms, either as serialized MARC,
+or to then index to Solr.
+traject-horizon is a plugin for [traject](http://github.com/jrochkind/traject), and
+requires jruby to be installed.
+Supports embedding copy/item holdings information in exported MARC.
+Fairly high-performance, should have higher throughput than most existing
+Horizon MARC export options, including the vendor-supplied Windows-only
+'marcout'. There are probably opportunities for increasing performance
+yet further with more development of multi-threaded processing.
+## Installation
+traject_horizon is a plugin for [traject](http://github.com/jrochkind/traject), install
+them both:
+    $ gem install traject traject_horizon
+### Or, if using a Gemfile with your traject project
+Add this line to your [traject project's Gemfile](https://github.com/jrochkind/traject/blob/master/doc/extending.md#or-with-bundler):
+    gem 'traject_horizon'
+And then execute:
+    $ bundle install
+## Usage
+I recommend creating a seperate traject configuration file just with
+settings for the Horizon export.
+~~~ruby
+# horizon_conf.rb
+# Require traject/horizon to load the gem, including
+# the Traject::HorizonReader we'll subsequently
+# configure to be used
+require 'traject/horizon'
+settings do
+  store "reader_class_name", "Traject::HorizonReader"
+  # JDBC URL starting with "jdbc:jtds", and either "sybase:"
+  # or "sqlserver:", including username on the end but not password:
+  provide "horizon.jdbc_url", "jdbc:jtds:sybase://horizonserver.university.edu:2025/horizon_db;user=esys"
+  # DB password in seperate setting
+  provide "horizon.jdbc_password", "drilg53"
+  # Do you want to include copy/item holdings information?
+  # this setting says to include "top-level" holdings,
+  # copy or item but not both. Holdings will be included
+  # in tags 991 and 937, although the tags and nature
+  # of included holdings is configurable.
+  provide "horizon.include_holdings", "direct"
+  # Would you like to exclude certain tags from
+  # your Horizon db?  If you are including holdings,
+  # then it's recommended to exclude 991 and 937 to
+  # avoid any collision with the tags we add to represent holdings.
+  provide "horizon.exclude_tags", "991,937"
+end
+~~~
+There are a variety of additional settings that apply to the HorizonReader,
+especially settings for customizing the item/copy holdings information
+included. See [HorizonReader] inline comment docs.
+Note by default `staff-only` records are _not_ included in the export,
+but this can be changed in settings.
+As with all traject settings, string-valued settings can also be supplied
+on the traject command line with `-s setting=value`.
+### Export MARC records
+    $ traject -x marcout -c horizon_conf.rb -o marc_files.marc
+That will export your entire horizon database,,
+using the connection details and configuration from horizon_conf.rb, exporting
+in ISO 2709 binary format to `marc_files.marc`.
+You can also specify specific ranges of bib#'s to export:
+    $ traject -x marcout -c horizon_conf.rb -o marc_files.marc -s horizon.first_bib=10000 -s horizon.last_bib=10100
+    $ traject -x marcout -c horizon_conf.rb -o marc_files.marc -s horizon.only_bib=12345
+You can export in MarcXML, or in a human readable format for debuging,
+using standard traject `-x marcout` functionality:
+    $ traject -x marcout -c horizon_conf.rb -s marcout.type=xml -o marc_files.xml
+    # leave off the `-o` argument to write to stdout, and view bib# 12345 in
+    # human-readable format:
+    $ traject -x marcout -c horizon_conf.rb -s marcout.type=human -s horizon.only_bib=12345
+### Indexing records to solr
+Traject is primarily a tool for indexing to solr. You can use `traject-horizon` to
+export from Horizon and send directly through the indexing pipeline, without
+having to serialize MARC to disk first.
+You would have one or more additional traject configuration files specifying
+your indexing mapping rules, and Solr connection details. See traject
+documentation.
+Then, simply:
+    $ traject -c horizon_conf.rb -c other_traject_conf.rb
+## Note on character encodings
+By default, traject-horizon assumes the data in your Horizon database is stored
+in the Marc8 encoding. (I think this is true of all Horizon databases?). And by
+default, traject-horizon will transcode it to UTF-8, marking leader byte 9 in any
+exported MARC appropriately (Using the Marc4J AnselConverter class).
+If you'd like traject to avoid this transcode, you can set the traject
+setting `horizon.destination_encoding` to nil or the empty string, either
+on the command line:
+    traject -x marcout -s horizon.destination_encoding= -c horizon_conf.rb
+Or in your traject configuration file:
+    settings do
+      #...
+      provide "horizon.destination_encoding", nil
+    end
+You might want to do this with `marcout` use, perhaps for diagnostics, but
+it shouldn't ever be appropriate for indexing-to-solr use, as there are limited
+facilities for dealing with Marc8 encoding in ruby.
+Currently, item/copy information may not be treated entirely consistent here,
+there may be edge-case encoding bugs related to non-ascii item/copy notes etc,
+and it may not be possible to output them in Marc8. Sorry.
+## Challenges
+I had to reverse engineer the Horizon database to figure out how to turn it into
+MARC records.  I believe I have been succesful, and traject-horizon seems to produce
+the same output as Horizon's own marcout.
+Hopefully this will remain true in future Horizon versions, I don't think relevant
+aspects of Horizon architecture change very much, but it's always a risk.
+The two biggest challenges were dealing with character encoding, and dealing
+with merging information from the Horizon bib and auth tables.
+The translation from Marc8 to UTF8 appears to work properly, _except_
+some known issues with item/copy holding information. item/copy holding
+information may occasionally not transcode properly, and it may not
+be possible to keep item/copy holding info in Marc8.  If these become
+an actual problem in practice for anyone, further development can
+probably resolve these issues.
+## Development
+There is only limited test coverage at the moment, sorry. I couldn't
+quite figure out how to easily provide test coverage when so much
+functionality interacts with a Horizon database.
+There is some test coverage of the bib/auth merging routines.
+Test are provided with minitest, and can be run with `rake test`.

data/Rakefile ADDED Viewed

@@ -0,0 +1,11 @@
+require "bundler/gem_tasks"
+require 'rake'
+require 'rake/testtask'
+task :default => [:test]
+Rake::TestTask.new do |t|
+  t.pattern = 'test/**/*_test.rb'
+  t.libs.push 'test', 'test_support'
+end

data/lib/traject/horizon_bib_auth_merge.rb ADDED Viewed

@@ -0,0 +1,124 @@
+module Traject
+  # Merges 'bib text' and 'auth text' lines from Horizon, using bib text as
+  # template when neccesary.
+  #
+  #     merged_str = HorizonBibAuthMerge.new(tag, bib_text_str, auth_text_str).merge!
+  #
+  # Strings passed in may be mutated for efficiency. So you can only call merge! once, it's just
+  # utility.
+  class HorizonBibAuthMerge
+    attr_reader :bibtext, :authtext, :tag
+    # Pass in bibtext and authtext as String -- you probably need to get
+    # column values from JDBC as bytes and then use String.from_java_bytes
+    # to avoid messing up possible Marc8 encoding.
+    #
+    # bibtext is either text or longtext column from fullbib, preferring
+    # longtext.  authtext is either xref_text or xref_longtext from fullbib,
+    # preferring xref_longtext.
+    def initialize(tag, bibtext, authtext)
+      @merged = false
+      @tag      = tag
+      @bibtext  = bibtext
+      @authtext = authtext
+      # remove terminal MARC Field Terminator if present.
+      @bibtext.chomp!("\x1E") if @bibtext
+      @authtext.chomp!("\x1E") if @authtext
+    end
+    # Returns merged string, composed of a marc 'field', with subfields
+    # seperated by seperator control chars. Does not include terminal
+    # MARC Field Seperator.
+    #
+    # Will mutate bibtext and authtext for efficiency.
+    def merge!
+      raise Exception.new("Can only call `merge!` once, already called.") if @merged
+      @merged = true
+      # just one? (Or neither?) Just return it.
+      return authtext if bibtext.nil?
+      return bibtext  if authtext.nil?
+      # We need to do a crazy combination of template in text with values in authtext.
+      # horizon, you so crazy. text template is like:
+      #"\x1Fa.\x1Fp ;\x1Fv81."
+      # which means each subfield after the \x1F, merge in
+      # the subfield value from the auth record if it's present,
+      # otherwise don't.
+      #
+      # plus some weird as hell stuff with punctuation and spaces, I can't
+      # even explain it, just trial and error'd it comparing to marcout.
+      bibtext.gsub!(/\x1F([^\x1F\x1E])( ?)([[:punct:] ]*)/) do
+        subfield       = $1
+        space          = $2
+        maybe_punct    = $3
+        # okay this is crazy hacky reverse engineering, I don't really
+        # know what's going on but for 240 and 243, 'a' in template
+        # is filled by 't' in auth tag.
+        auth_subfield = if subfield == "a" && (tag == "240" || tag == "243")
+          "t"
+        else
+          subfield
+        end
+        # Find substitute fill-in value from authtext, if it can
+        # be found -- first subfield indicated. Then we REMOVE
+        # it from authtext, so next time this subfield is asked for,
+        # subsequent subfield with that code will be used.
+        substitute = nil
+        authtext.sub!(/\x1F#{auth_subfield}([^\x1F\x1E]*)/) do
+          substitute = $1
+          ''
+        end
+        if substitute
+          # Dealing with punctuation is REALLY CONFUSING -- reverse engineering
+          # HIP/Horizon, which does WEIRD THINGS.
+          # But we seem to have arrived at something that appears to match all cases
+          # we can find of what HIP/Horizon does.
+          #
+          # If the auth value already ends up with the same punctuation from the template,
+          # _leave it alone_ -- including preserving all spaces near the punct in the auth
+          # value.
+          #
+          # Otherwise, remove all punct from the auth value, then add in the punct from the template,
+          # along with any spaces before the punct in the template.
+          if maybe_punct && maybe_punct.length > 0
+            # remove all punctuation from end of auth value? to use punct from template instead?
+            # But preserve initial spaces from template? Unless it already ends
+            # with the punctuation, in which case don't touch it, to avoid
+            # messing up spaces? WEIRD, yeah.
+            unless substitute.end_with? maybe_punct
+              substitute.gsub!(/[[:punct:]]+\Z/, "")
+              # This adding the #{space} back in, is consistent with what HIP does.
+              # I have no idea if it's right or a bug in HIP, but being consistent.
+              # neither leaving it in nor taking it out is exactly consistent with HznExportMarc,
+              # which seems to have bugs.
+              substitute << "#{space}#{maybe_punct}"
+            end
+          end
+          "\x1F#{subfield}#{substitute}"
+        else # just keep original, which has no maybe_punct
+          "\x1F#{subfield}"
+        end
+      end
+      # We mutated bibtext to fill in template, now just return it.
+      return bibtext
+    end
+  end
+end

data/lib/traject/horizon_reader.rb ADDED Viewed

@@ -0,0 +1,641 @@
+require 'traject'
+require 'traject/util'
+require 'traject/indexer/settings'
+require 'traject/horizon_bib_auth_merge'
+require 'marc'
+module Traject
+  #
+  # = Settings
+  #
+  # == Connection
+  #
+  # [horizon.jdbc_url]  JDBC connection URL using jtds. Should include username, but not password.
+  #                     See `horizon.jdbc_password` setting, kept seperate so we can try to suppress
+  #                     it from logging. Eg: "jdbc:jtds:sybase://horizon.lib.univ.edu:2025/dbname;user=dbuser"
+  #                     * In command line, you'll have to use quotes: -s 'horizon.jdbc_url=jdbc:jtds:sybase://horizon.lib.univ.edu:2025/dbname;user=dbuser'
+  #
+  # [horizon.jdbc_password] Password to use for JDBC connection. We'll try to suppress it from being logged.
+  #
+  # == What to export
+  #
+  # Normally exports the entire horizon database, for diagnostic or batch purposes you
+  # can export just one bib, or a range of bibs instead.
+  #
+  # [horizon.first_bib] Greater than equal to this bib number. Can be combined with horizon.last_bib
+  # [horizon.last_bib]  Less than or equal to this bib number. Can be combined with horizon.first_bib
+  # [horizon.only_bib]  Only this single bib number.
+  #
+  # You can also control whether to export staff-only bibs, copies, and items.
+  #
+  # [horizon.public_only] Default true. If set to true, only includes bibs that are NOT staff_only,
+  #                       also only include copy/item that are not staff-only if including copy/item.
+  #
+  # You can also exclude certain tags:
+  #
+  # [horizon.exclude_tags] Default nil. A comma-seperated string (so easy to supply on command line)
+  #                   of tag names to exclude from export. You probably want to at least include the tags
+  #                   you are using for horizon.item_tag and horizon.copy_tag, to avoid collision
+  #                   from tags already in record.
+  #
+  # == Item/Copy Inclusion
+  #
+  # The HorizonReader can export MARC with holdings information (horizon items and copies) included
+  # in the MARC. Each item or copy record will be represented as one marc field -- the tags
+  # used are configurable.  You can configure how individual columns from item or copy tables
+  # map to MARC subfields in that field -- and also include columns from other tables joined
+  # to item or copy.
+  #
+  # [horizon.include_holdings]  * false, nil, or empty string: Do not include holdings. (DEFAULT)
+  #                             * all: include copies and items
+  #                             * items: only include items
+  #                             * copies: only include copies
+  #                             * direct: only include copies OR items, but not both; if bib has
+  #                               include copies, otherwise include items if present.
+  #
+  # Each item or copy will be one marc field, you can configure what tags these fields
+  # will have.
+  #
+  # [horizon.item_tag]  Default "991".
+  # [horizon.copy_tag]  Default "937"
+  #
+  # Which columns from item or copy tables will be mapped to which subfields in those
+  # fields is controlled by hashes in settings, hash from column name (with table prefix)
+  # to subfield code. There are defaults, see HorizonReader.default_settings. Example for
+  # item_map default:
+  #
+  # "horizon.item_map"          => {
+  #         "item.call_reconstructed"   => "a",
+  #         "collection.call_type"      => "b",
+  #         "item.copy_reconstructed"   => "c",
+  #         "call_type.processor"       => "f",
+  #         "item.item#"                => "i",
+  #         "item.collection"           => "l",
+  #         "item.location"             => "m",
+  #         "item.notes"                => "n",
+  #         "item.staff_only"           => "q"
+  #       }
+  #
+  # [horizon.item_map]
+  # [horizon.copy_map]
+  #
+  # The column-to-subfield maps can include columns from other tables
+  # joined in, with a join clause configured in settings too.
+  # By default both item and copy join to: collection, and call_type --
+  # using some clever SQL to join to call_type on the item/copy fk, OR the
+  # associated collection fk if no specific item/copy one is defined.
+  #
+  # [horizon.item_join_clause]
+  # [horizon.copy_join_clause]
+  #
+  # == Character Encoding
+  #
+  # The HorizonReader can convert from Marc8 to UTF8. By default `horizon.source_encoding` is set to "MARC8"
+  # and `horizon.destination_encoding` is set to "UTF8", which will make it do that conversion, as well
+  # as set the leader byte for char encoding properly.
+  #
+  # Any other configuration of those settings, and no transcoding will take place, HorizonReader
+  # is not currently capable of doing any other transcoding. Set
+  # or `horizon.destination_encoding` to nil if you don't want any transcoding to happen --
+  # you'd only want this for diagnostic purposes, or if your horizon db is already utf8 (is
+  # that possible? We don't know.)
+  #
+  # [horizon.codepoint_translate] translates from Horizon's weird <U+nnnn> unicode
+  #   codepoint escaping to actual UTF-8 bytes. Defaults to true. Will be ignored
+  #   unless horizon.destination_encoding is UTF8 though.
+  #
+  # == Misc
+  #
+  # [horizon.batch_size] Batch size to use for fetching item/copy info on each bib. Default 400.
+  # [debug_ascii_progress]  if true, will output a "<" and a ">" to stderr around every copy/item
+  #           subsidiary fetch. See description of this setting in docs/settings.md
+  #
+  # [jtds.jar_path]  Normally we'll use a distribution of jtds bundled with this gem.
+  #                  But specify a filepath of a directory containing jtds jar(s),
+  #                  and all jars in that dir will be loaded instead of our bundled jtds.
+  #
+  #
+  # Note: Could probably make this even faster by using a thread pool -- the bottleneck
+  # is probably processing into MARC, not the database query and streaming. But it's a
+  # bit tricky to refactor for concurrency there. Perhaps pull all the raw
+  # row values out and batch them in groups by bib#, then feed those lists
+  # to a threadpool. And then we'd just be fighting for CPU time with the
+  # threadpool for mapping, not sure if overall throughput increase would happen, would
+  # depend on exact environment.
+  class HorizonReader
+    attr_reader :settings
+    attr_reader :things_to_close
+    # We ignore the iostream even though we get one, we're gonna
+    # read from a Horizon DB!
+    def initialize(iostream, settings)
+      # we ignore the iostream, we're fetching from Horizon db
+      @settings = Traject::Indexer::Settings.new( self.class.default_settings).merge(settings)
+      require_jars!
+    end
+    # Requires marc4j and jtds, and java_import's some classes.
+    def require_jars!
+        Traject::Util.jruby_ensure_init!("Traject::HorizonReader")
+        Traject::Util.require_marc4j_jars(settings)
+        # For some reason we seem to need to java_import it, and use
+        # a string like this. can't just refer to it by full
+        # qualified name, not sure why, but this seems to work.
+        java_import "org.marc4j.converter.impl.AnselToUnicode"
+        unless defined? Java::net.sourceforge.jtds.jdbc.Driver
+          jtds_jar_dir = settings["jtds.jar_path"] || File.expand_path("../../vendor/jtds", File.dirname(__FILE__))
+          Dir.glob("#{jtds_jar_dir}/*.jar") do |x|
+            require x
+          end
+          # For confusing reasons, in normal Java need to
+          # Class.forName("net.sourceforge.jtds.jdbc.Driver")
+          # to get the jtds driver to actually be recognized by JDBC.
+          #
+          # In Jruby, Class.forName doesn't work, but this seems
+          # to do the same thing:
+          Java::net.sourceforge.jtds.jdbc.Driver
+        end
+        # So we can refer to these classes as just ResultSet, etc.
+        java_import java.sql.ResultSet, java.sql.PreparedStatement, java.sql.Driver
+    end
+    def fetch_result_set!(conn)
+      #fullbib is a view in Horizon, I think it was an SD default view, that pulls
+      #in stuff from multiple tables, including authority tables, to get actual
+      # text.
+      # You might think need an ORDER BY, but doing so makes it incredibly slow
+      # to retrieve results, can't do it. We just count on the view returning
+      # the rows properly. (ORDER BY bib#, tagord)
+      #
+      # We start with the fullbib view defined out of the box in Horizon, but
+      # need to join in bib_control to have access to the staff_only column.
+      #
+      sql = <<-EOS
+        SELECT b.bib#, b.tagord, b.tag,
+         indicators = substring(b.indicators+'  ',1,2)+a.indicators,
+         b.text, b.cat_link_type#, b.cat_link_xref#, b.link_type,
+         bl.longtext, xref_text     = a.text, xref_longtext = al.longtext,
+         b.timestamp, auth_timestamp = a.timestamp,
+         bc.staff_only
+        FROM bib b
+          left join bib_control bc on b.bib# = bc.bib#
+          left join bib_longtext bl on b.bib# = bl.bib# and b.tag = bl.tag and b.tagord = bl.tagord
+          left join auth a on b.cat_link_xref# = a.auth# and a.tag like '1[0-9][0-9]'
+          left join auth_longtext al on b.cat_link_xref# = al.auth# and al.tag like '1[0-9][0-9]'
+        WHERE 1 = 1
+      EOS
+      sql = <<-EOS
+        SELECT b.*, bc.staff_only
+        FROM fullbib b
+        JOIN bib_control bc on b.bib# = bc.bib#
+        WHERE 1 = 1
+      EOS
+      if settings["horizon.public_only"].to_s == "true"
+        sql += " AND staff_only != 1"
+      end
+      # settings should not be coming from untrusted user input not going
+      # to bother worrying about sql injection.
+      if settings.has_key? "horizon.only_bib"
+        sql += " AND b.bib# = #{settings['horizon.only_bib']} "
+      elsif settings.has_key?("horizon.first_bib") || settings.has_key?("horizon.last_bib")
+        clauses = []
+        clauses << " b.bib# >= #{settings['horizon.first_bib']}" if settings['horizon.first_bib']
+        clauses << " b.bib# <= #{settings['horizon.last_bib']}" if settings['horizon.last_bib']
+        sql += " AND " + clauses.join(" AND ") + " "
+      end
+      pstmt = conn.prepareStatement(sql);
+      # this may be what's neccesary to keep the driver from fetching
+      # entire result set into memory.
+      pstmt.setFetchSize(10000)
+      logger.debug("HorizonReader: Executing query: #{sql}")
+      rs = pstmt.executeQuery
+      logger.debug("HorizonReader: Executed!")
+      return rs
+    end
+    # Converts from Marc8 to UTF8 if neccesary.
+    # Also replaces horizon <U+nnnn> codes if needed.
+    def convert_text!(text, error_handler)
+      text = AnselToUnicode.new(error_handler, true).convert(text) if convert_marc8_to_utf8?
+      # Turn Horizon's weird escaping into UTF8: <U+nnnn> where nnnn is a hex unicode
+      # codepoint, turn it UTF8 for that codepoint
+      if settings["horizon.codepoint_translate"].to_s == "true" && settings["horizon.destination_encoding"] == "UTF8"
+        text.gsub!(/\<U\+([0-9A-F]{4})\>/) do
+          [$1.hex].pack("U")
+        end
+      end
+      return text
+    end
+    # Read rows from horizon database, assemble them into MARC::Record's, and yield each
+    # MARC::Record to caller.
+    def each
+      # Need to close the connection, teh result_set, AND the result_set.getStatement when
+      # we're done.
+      connection = open_connection!
+      # We're going to need to ask for item/copy info while in the
+      # middle of streaming our results. JDBC is happier and more performant
+      # if we use a seperate connection for this.
+      extra_connection = open_connection! if include_some_holdings?
+      # We're going to make our marc records in batches, and only yield
+      # them to caller in batches, so we can fetch copy/item info in batches
+      # for efficiency.
+      batch_size = settings["horizon.batch_size"]
+      record_batch = []
+      exclude_tags = (settings["horizon.exclude_tags"] || "").split(",")
+      rs = self.fetch_result_set!(connection)
+      current_bib_id = nil
+      record = nil
+      record_count = 0
+      error_handler = org.marc4j.ErrorHandler.new
+      while(rs.next)
+        bib_id      = rs.getInt("bib#");
+        if bib_id != current_bib_id
+          record_count += 1
+          if settings["debug_ascii_progress"] &&  (record_count % settings["solrj_writer.batch_size"] == 0)
+            $stderr.write ","
+          end
+          # new record! Put old one on batch queue.
+          record_batch << record if record
+          # prepare and yield batch?
+          if (record_count % batch_size == 0)
+            enhance_batch!(extra_connection, record_batch)
+            record_batch.each do |r|
+              # set current_bib_id for error logging
+              current_bib_id = r['001'].value
+              yield r
+            end
+            record_batch.clear
+          end
+          # And start new record we've encountered.
+          error_handler = org.marc4j.ErrorHandler.new
+          current_bib_id = bib_id
+          record = MARC::Record.new
+          record.append MARC::ControlField.new("001", bib_id.to_s)
+        end
+        tagord      = rs.getInt("tagord");
+        tag         = rs.getString("tag")
+        # just silently skip it, some weird row in the horizon db, it happens.
+        # plus any of our exclude_tags.
+        next if tag.nil? || tag == "" || exclude_tags.include?(tag)
+        numeric_tag = tag.to_i if tag =~ /\A\d+\Z/
+        indicators = rs.getString("indicators")
+        # a packed byte array could be in various columns, in order of preference...
+        # the xref stuff is joined in from the auth table
+        # Have to get it as bytes and then convert it to String to avoid JDBC messing
+        # up the encoding marc8 grr
+        authtext = rs.getBytes("xref_longtext") || rs.getBytes("xref_text")
+        if authtext
+          authtext = String.from_java_bytes(authtext)
+          authtext.force_encoding("binary")
+        end
+        text     = rs.getBytes("longtext") || rs.getBytes("text")
+        if text
+          text = String.from_java_bytes(text)
+          text.force_encoding("binary")
+        end
+        text = Traject::HorizonBibAuthMerge.new(tag, text, authtext).merge!
+        next if text.nil? # sometimes there's nothing there, skip it.
+        # convert from MARC8 to UTF8 if needed
+        text = convert_text!(text, error_handler)
+        if numeric_tag && numeric_tag == 0
+          record.leader = text
+          fix_leader!(record.leader)
+        elsif numeric_tag && numeric_tag == 1
+          # nothing, we add the 001 ourselves first
+        elsif numeric_tag && numeric_tag < 10
+          # control field
+          record.append MARC::ControlField.new(tag, text )
+        else
+          # data field
+          indicator1 = indicators.slice(0)
+          indicator2 = indicators.slice(1)
+          data_field = MARC::DataField.new(  tag,  indicator1, indicator2 )
+          record.append data_field
+          subfields  = text.split("\x1F")
+          subfields.each do |subfield|
+            next if subfield.empty?
+            subfield_code = subfield.slice(0)
+            subfield_text = subfield.slice(1, subfield.length)
+            data_field.append MARC::Subfield.new(subfield_code, subfield_text)
+          end
+        end
+      end
+      # last one
+      record_batch << record if record
+      # yield last batch
+      enhance_batch!(extra_connection, record_batch)
+      record_batch.each do |r|
+        # reset bib_id for error message logging
+        current_bib_id = (f = r['001']) && f.value
+        yield r
+      end
+      record_batch.clear
+    rescue Exception => e
+      logger.fatal "HorizonReader, unexpected exception at bib id:#{current_bib_id}: #{Traject::Util.exception_to_log_message(e)}"
+      raise e
+    ensure
+      logger.info("HorizonReader: Closing all JDBC objects...")
+      # have to cancel the statement to keep us from waiting on entire
+      # result set when exception is raised in the middle of stream.
+      statement = rs && rs.getStatement
+      if statement
+        statement.cancel
+        statement.close
+      end
+      rs.close if rs
+      # shouldn't actually need to close the resultset and statement if we cancel, I think.
+      connection.close if connection
+      extra_connection.close if extra_connection
+      logger.info("HorizonReader: Closed JDBC objects")
+    end
+    def process_batch(batch)
+    end
+    # Pass in an array of MARC::Records', adds fields for copy and item
+    # info if so configured. Returns record_batch so you can chain if you want.
+    def enhance_batch!(conn, record_batch)
+      return record_batch if record_batch.nil? || record_batch.empty?
+      copy_info = get_joined_table(
+        conn, record_batch,
+        :table_name  => "copy",
+        :column_map  => settings['horizon.copy_map'],
+        :join_clause => settings['horizon.copy_join_clause'],
+        :public_only => (settings['horizon.public_only'].to_s == "true")
+      ) if %w{all copies direct}.include? settings['horizon.include_holdings'].to_s
+      item_info = get_joined_table(
+        conn, record_batch,
+        :table_name  => "item",
+        :column_map  => settings['horizon.item_map'],
+        :join_clause => settings['horizon.item_join_clause'],
+        :public_only => (settings['horizon.public_only'].to_s == "true")
+      ) if %w{all items direct}.include? settings['horizon.include_holdings'].to_s
+      if item_info || copy_info
+        record_batch.each do |record|
+          id = record['001'].value.to_s
+          record_copy_info = copy_info && copy_info[id]
+          record_item_info = item_info && item_info[id]
+          record_copy_info.each do |copy_row|
+            field = MARC::DataField.new( settings["horizon.copy_tag"] )
+            copy_row.each_pair do |subfield, value|
+              field.append MARC::Subfield.new(subfield, value)
+            end
+            record.append field
+          end if record_copy_info
+          record_item_info.each do |item_row|
+            field = MARC::DataField.new( settings["horizon.item_tag"] )
+            item_row.each_pair do |subfield, value|
+              field.append MARC::Subfield.new(subfield, value)
+            end
+            record.append field
+          end if record_item_info && ((settings['horizon.include_holdings'].to_s != "direct") || record_copy_info.empty?)
+        end
+      end
+      return record_batch
+    end
+    # Can be used to fetch a batch of subsidiary info from other tables:
+    # Used to fetch item or copy information. Can fetch with joins too.
+    # Usually called by passing in settings, but a literal call might look something
+    # like this for items:
+    #
+    # get_joined_table(jdbc_conn, array_of_marc_records,
+    #    :table_name => "item",
+    #    :column_map => {"item.item#" => "i", "call_type.processor" => "k"},
+    #    :join_clause => "JOIN call_type ON item.call_type = call_type.call_type"
+    # )
+    #
+    # Returns a hash keyed by bibID, value is an array of hashes of subfield->value, eg:
+    #
+    # {'343434' => [
+    #    {
+    #      'i' => "012124" # item.item#
+    #      'k' => 'lccn'   # call_type.processor
+    #    }
+    #   ]
+    # }
+    #
+    # Can also pass in a `:public_only => true` option, will add on a staff_only != 1
+    # where clause, assumes primary table has a staff_only column.
+    def get_joined_table(conn, batch, options = {})
+      table_name  = options[:table_name]  or raise ArgumentError.new("Need a :table_name option")
+      column_map  = options[:column_map]  or raise ArgumentError.new("Need a :column_map option")
+      join_clause = options[:join_clause] || ""
+      public_only = options[:public_only]
+      results = Hash.new {|h, k| h[k] = [] }
+      bib_ids_joined = batch.collect do |record|
+        record['001'].value.to_s
+      end.join(",")
+      # We include the column name with prefix as an "AS", so we can fetch it out
+      # of the result set later just like that.
+      columns_clause = column_map.keys.collect {|c| "#{c} AS '#{c}'"}.join(",")
+      sql = <<-EOS
+        SELECT bib#, #{columns_clause}
+        FROM #{table_name}
+        #{join_clause}
+        WHERE bib# IN (#{bib_ids_joined})
+      EOS
+      if public_only
+        sql += " AND staff_only != 1"
+      end
+      $stderr.write "<" if settings["debug_ascii_progress"]
+      # It might be higher performance to refactor to re-use the same prepared statement
+      # for each item/copy fetch... but appears to be no great way to do that in JDBC3
+      # where you need to parameterize "IN" values. JDBC4 has got it, but jTDS is just JDBC3.
+      pstmt = conn.prepareStatement(sql);
+      rs = pstmt.executeQuery
+      while (rs.next)
+        bib_id = rs.getString("bib#")
+        row_hash = {}
+        column_map.each_pair do |column, subfield|
+          value = rs.getString( column )
+          if value
+            # Okay, total hack to deal with the fact that holding notes
+            # seem to be in UTF8 even though records are in MARC... which
+            # ends up causing problems for exporting as marc8, which is
+            # handled kind of not very well anyway.
+            # I don't even totally understand what I'm doing, after 6 hours working on it,
+            # sorry, just a hack.
+            value.force_encoding("BINARY") unless  settings["horizon.destination_encoding"] == "UTF8"
+            row_hash[subfield] = value
+          end
+        end
+        results[bib_id] << row_hash
+      end
+      return results
+    ensure
+      pstmt.cancel if pstmt
+      pstmt.close if pstmt
+      rs.close if rs
+      $stderr.write ">" if settings["debug_ascii_progress"]
+    end
+    # Mutate string passed in to fix leader bytes for marc21
+    def fix_leader!(leader)
+      if leader.length < 24
+        # pad it to 24 bytes, leader is supposed to be 24 bytes
+        leader.replace(  leader.ljust(24, ' ')  )
+      end
+      # http://www.loc.gov/marc/bibliographic/ecbdldrd.html
+      leader[10..11] = '22'
+      leader[20..23] = '4500'
+      if settings['horizon.destination_encoding'] == "UTF8"
+        leader[9] = 'a'
+      end
+    end
+    def include_some_holdings?
+      ! [false, nil, ""].include?(settings['horizon.include_holdings'])
+    end
+    def convert_marc8_to_utf8?
+      settings['horizon.source_encoding'] == "MARC8" && settings['horizon.destination_encoding'] == "UTF8"
+    end
+    def open_connection!
+      logger.debug("HorizonReader: Opening JDBC Connection at #{settings["horizon.jdbc_url"]} ...")
+      url = settings["horizon.jdbc_url"]
+      if settings["horizon.jdbc_password"]
+        url += ";password=#{settings['horizon.jdbc_password']}"
+      end
+      conn =  java.sql.DriverManager.getConnection( url )
+      # If autocommit on, fetchSize later has no effect, and JDBC slurps
+      # the whole result set into memory, which we can not handle.
+      conn.setAutoCommit false
+      logger.debug("HorizonReader: Opened JDBC Connection.")
+      return conn
+    end
+    def logger
+      settings["logger"] || Yell::Logger.new(STDERR, :level => "gt.fatal") # null logger
+    end
+    def self.default_settings
+      {
+        "horizon.batch_size" => 400,
+        "horizon.public_only" => true,
+        "horizon.source_encoding"      => "MARC8",
+        "horizon.destination_encoding" => "UTF8",
+        "horizon.codepoint_translate"  => true,
+        "horizon.item_tag"          => "991",
+        # Crazy isnull() in the call_type join to join to call_type directly on item
+        # if specified otherwise calltype on colleciton. Phew!
+        "horizon.item_join_clause"  => "LEFT OUTER JOIN collection ON item.collection = collection.collection LEFT OUTER JOIN call_type ON isnull(item.call_type, collection.call_type) = call_type.call_type",
+        "horizon.item_map"          => {
+          "item.call_reconstructed"   => "a",
+          "call_type.processor"       => "f",
+          "call_type.call_type"      => "b",
+          "item.copy_reconstructed"   => "c",
+          "item.staff_only"           => "q",
+          "item.item#"                => "i",
+          "item.collection"           => "l",
+          "item.notes"                => "n",
+          "item.location"             => "m"
+        },
+        "horizon.copy_tag"          => "937",
+        # Crazy isnull() in the call_type join to join to call_type directly on item
+        # if specified otherwise calltype on colleciton. Phew!
+        "horizon.copy_join_clause"  => "LEFT OUTER JOIN collection ON copy.collection = collection.collection LEFT OUTER JOIN call_type ON isnull(copy.call_type, collection.call_type) = call_type.call_type",
+        "horizon.copy_map"          => {
+          "copy.copy#"           => "8",
+          "copy.call"            => "a",
+          "copy.copy_number"     => "c",
+          "call_type.processor"  => "f",
+          "copy.staff_only"      => "q",
+          "copy.location"        => "m",
+          "copy.collection"      => "l",
+          "copy.pac_note"        => "n"
+        }
+      }
+    end
+  end
+end

data/lib/traject_horizon.rb ADDED Viewed

@@ -0,0 +1,6 @@
+require "traject_horizon/version"
+require 'traject/horizon_reader'
+module TrajectHorizon
+end

data/lib/traject_horizon/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module TrajectHorizon
+  VERSION = "0.0.1"
+end

data/test/horizon_bib_auth_merge_test.rb ADDED Viewed

@@ -0,0 +1,58 @@
+# Encoding: ASCII-8BIT
+require 'test_helper'
+require 'traject'
+require 'traject/horizon_bib_auth_merge'
+describe "HorizonBibAuthMerge" do
+  HzMerge = Traject::HorizonBibAuthMerge # shortcut
+  it "does simple example" do
+    assert_equal "aOsmoregulationvCongresses.", HzMerge.new("650", "a v.", "aOsmoregulationvCongresses.").merge!
+  end
+  it "adds on simple trailing punctuation" do
+    assert_equal "aHomeostasisvCongresses.", HzMerge.new("650", "a v.", "aHomeostasisvCongresses").merge!
+  end
+  it "handles weirder punctuation" do
+    assert_equal "aEastaugh, Steven R.,d1952-", HzMerge.new("100", "a.,d-", "aEastaugh, Steven R.,d1952-").merge!
+  end
+  it "merges non-controlled values" do
+    assert_equal "aNational League for Nursing publication ;vno. 52-1870.", HzMerge.new("830", "a ;vno. 52-1870.", "aNational League for Nursing publication ;").merge!
+  end
+  it "handles multiple templated subfield with same code" do
+    assert_equal "aMedical carexUtilizationzMarylandzBaltimore.", HzMerge.new("650", "a x z z.", "aMedical carexUtilizationzMarylandzBaltimore.").merge!
+  end
+  it "handles tag 240 weirdness" do
+    assert_equal "aProblemy radiaÙtýsionnoµi genetiki.lEnglish", HzMerge.new("240", "a.l ", "aDubinin, Nikolaµi Petrovich,d1907-1998.tProblemy radiaÙtýsionnoµi genetiki.lEnglish").merge!
+  end
+  it "preserves space before semi-colon in 830" do
+    # this is actually something Alpha-G's HznExportMarc does differently
+    # than HIP/Horizon -- we try to stick with HIP/Horizon, not entirely
+    # sure if this is a bug in HIP we're reproducing, maybe there shouldn't
+    # be space before the semi-colon?
+    assert_equal "aActa ophthalmologica.pSupplementum ;v81.", HzMerge.new("830", "a.p ;v81.", "aActa ophthalmologica.pSupplementum").merge!
+  end
+  it "handles non-matching ending punct" do
+    # Yes, current HIP behavior, as well as marcout and HznMarcOut, ends in
+    # period. I don't know if it's really right, but we'll match current behavior.
+    assert_equal "aWessel, Rosa,d1897.", HzMerge.new("100", "a,d.", "aWessel, Rosa,d1897-").merge!
+  end
+  it "a weird non-matching ending punct" do
+    # in this one, HIP and Alpha-G HznMarcOut actually didn't match! We go with HIP.
+    assert_equal "aGreat Britain.bParliament.tPapers by Command ;vCd. 4671.", HzMerge.new("810", "a.b.t ;vCd. 4671.", "aGreat Britain.bParliament.tPapers by Command.").merge!
+  end
+  it "handles weird internal multi punct with spaces" do
+    assert_equal "aMiscellaneous publications (Pan American Sanitary Bureau) ;vno. 79.", HzMerge.new("830", "a) ;vno. 79.", "aMiscellaneous publications (Pan American Sanitary Bureau) ;").merge!
+  end
+end

data/test/test_helper.rb ADDED Viewed

@@ -0,0 +1,16 @@
+gem 'minitest' # I feel like this messes with bundler, but only way to get minitest to shut up
+require 'minitest/autorun'
+require 'minitest/spec'
+require 'traject'
+require 'marc'
+# keeps things from complaining about "yell-1.4.0/lib/yell/adapters/io.rb:66 warning: syswrite for buffered IO"
+# for reasons I don't entirely understand, involving yell using syswrite and tests sometimes
+# using $stderr.puts. https://github.com/TwP/logging/issues/31
+STDERR.sync = true
+# Hacky way to turn off Indexer logging by default, say only
+# log things higher than fatal, which is nothing.
+require 'traject/indexer/settings'
+Traject::Indexer::Settings.defaults["log.level"] = "gt.fatal"

data/traject_horizon.gemspec ADDED Viewed

@@ -0,0 +1,24 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'traject_horizon/version'
+Gem::Specification.new do |spec|
+  spec.name          = "traject_horizon"
+  spec.version       = TrajectHorizon::VERSION
+  spec.authors       = ["Jonathan Rochkind"]
+  spec.email         = ["jonathan@dnil.net"]
+  spec.summary       = %q{Horizon ILS MARC Exporter, a plugin for the traject tool}
+  spec.homepage      = "http://github.com/jrochkind/traject_horizon"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files`.split($/)
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_dependency "traject"
+  spec.add_development_dependency "bundler", "~> 1.3"
+  spec.add_development_dependency "rake"
+end

data/vendor/jtds/.DS_Store ADDED Viewed

Binary file

data/vendor/jtds/jtds-1.2.8.jar ADDED Viewed

Binary file

metadata ADDED Viewed

@@ -0,0 +1,110 @@
+--- !ruby/object:Gem::Specification
+name: traject_horizon
+version: !ruby/object:Gem::Version
+  prerelease:
+  version: 0.0.1
+platform: ruby
+authors:
+- Jonathan Rochkind
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2013-08-28 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: traject
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+    none: false
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+    none: false
+  prerelease: false
+  type: :runtime
+- !ruby/object:Gem::Dependency
+  name: bundler
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.3'
+    none: false
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.3'
+    none: false
+  prerelease: false
+  type: :development
+- !ruby/object:Gem::Dependency
+  name: rake
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+    none: false
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+    none: false
+  prerelease: false
+  type: :development
+description:
+email:
+- jonathan@dnil.net
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- lib/traject/horizon_bib_auth_merge.rb
+- lib/traject/horizon_reader.rb
+- lib/traject_horizon.rb
+- lib/traject_horizon/version.rb
+- test/horizon_bib_auth_merge_test.rb
+- test/test_helper.rb
+- traject_horizon.gemspec
+- vendor/jtds/.DS_Store
+- vendor/jtds/jtds-1.2.8.jar
+homepage: http://github.com/jrochkind/traject_horizon
+licenses:
+- MIT
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+  none: false
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+  none: false
+requirements: []
+rubyforge_project:
+rubygems_version: 1.8.24
+signing_key:
+specification_version: 3
+summary: Horizon ILS MARC Exporter, a plugin for the traject tool
+test_files:
+- test/horizon_bib_auth_merge_test.rb
+- test/test_helper.rb