RubyGems - marc - Versions diffs - 0.4.4 → 0.5.0 - Mend

marc 0.4.4 → 0.5.0

Files changed (24) hide show

data/Changes +13 -0
data/README.md +88 -0
data/Rakefile +2 -26
data/lib/marc.rb +1 -1
data/lib/marc/reader.rb +270 -50
data/lib/marc/version.rb +3 -0
data/lib/marc/writer.rb +11 -3
data/test/bare_cp866.txt +1 -0
data/test/cp866_multirecord.marc +1 -0
data/test/cp866_unimarc.marc +1 -0
data/test/jruby_bad_transcode.rb +52 -0
data/test/jruby_just_string.rb +39 -0
data/test/marc8_accented_chars.marc +1 -0
data/test/tc_bare_ruby_strings.rb +43 -0
data/test/tc_reader.rb +21 -6
data/test/tc_reader_char_encodings.rb +256 -0
data/test/tc_writer.rb +14 -2
data/test/test_cp866.txt +1 -0
data/test/{000039829.marc → utf8.marc} +0 -0
data/test/utf8_multirecord.marc +1 -0
data/test/utf8_with_bad_bytes.marc +1 -0
metadata +73 -41
data/README +0 -55
data/test/t +0 -1

data/Changes CHANGED

@@ -1,3 +1,16 @@
+v0.5.0 April 2012
+- Extensive rewrite of MARC::Reader (ISO 2709 binary reader) to provide a
+  fairly complete and consistent handing of char encoding issues in ruby 1.9.
+  - This code is well covered by automated tests, but ends up complex, there
+    may be bugs, please report them.
+  - May not work properly under jruby with non-unicode source encodings.
+  - Still can't handle Marc8 encoding.
+  - May not have entirely backwards compatible behavior with regard to char
+    encodings under ruby 1.9.x as previous 0.4.x versions. Test your code.
+    In particular, previous versions may have automatically _transcoded_
+    non-unicode encodings to UTF-8 for you. This version will not do
+    so unless you ask it to with correct arguments.
 v0.4.4 Sat Mar 03 14:55:00 EDT 2012
 - Fixed performance regression: strict reader will parse about 5x faster now
 - Updated CHANGES file for first time in a long time :-)

data/README.md ADDED

@@ -0,0 +1,88 @@
+marc is a ruby library for reading and writing MAchine Readable Cataloging
+(MARC). More information about MARC can be found at <http://www.loc.gov/marc>.
+## Usage
+    require 'marc'
+    # reading records from a batch file
+    reader = MARC::Reader.new('marc.dat')
+    for record in reader
+      # print out field 245 subfield a
+      puts record['245']['a']
+    end
+    # creating a record
+    record = MARC::Record.new()
+    record.append(MARC::DataField.new('100', '0',  ' ', ['a', 'John Doe']))
+    # writing a record
+    writer = MARC::Writer.new('marc.dat')
+    writer.write(record)
+    writer.close()
+    # writing a record as XML
+    writer = MARC::XMLWriter.new('marc.xml')
+    writer.write(record)
+    writer.close()
+    # encoding a record
+    MARC::Writer.encode(record) # or record.to_marc
+MARC::Record provides `#to_hash` and `#from_hash` implementations that deal in ruby
+hash's that are compatible with the
+[marc-in-json](http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/)
+serialization format. You are responsible for serializing the hash to/from JSON yourself.
+## Installation
+    gem install marc
+Or if you're using bundler, add to your Gemfile
+    gem 'marc'
+## Character Encodings
+Dealing with character encoding issues is one of the most confusing programming areas in general, and dealing with MARC (esp 'binary' ISO 2709 marc) can make it even more confusing.
+In ruby 1.8, if you get your character encodings wrong, you may find what look like garbage characters in your output. In ruby 1.9, you may also cause exceptions to be raised in your code.  ruby-marc as of 0.5.0 has a fairly complete and consistent featureset for helping you deal with character encodings in 'binary' MARC.
+There are no tools in ruby for transcoding or dealing with the 'marc8' encoding, used in Marc21 in the US and other countries.  If you have to deal with MARC with marc8 encoding, your best bet is using an external tool to convert between MARC8 and UTF8 before the ruby app even sees it. [MarcEdit](http://people.oregonstate.edu/~reeset/marcedit/html/index.php), [yaz-marcdump command line tool](http://www.indexdata.com/yaz), [Marc4J java library](http://marc4j.tigris.org/)
+### 'binary' ISO 2709 MARC
+The Marc binary (ISO 2709) Reader (MARC::Reader) has some features for helping you deal with character encodings in ruby 1.9. It should often do the right thing, especially if you are working only in unicode. See documentation in that class for details, including additional features you can use.   Note it does NOT currently determine encoding based on internal leader bytes in the marc file.
+The MARC binary Writer (MARC::Writer) does not have any such features -- it's up to you the developer to make sure you create MARC::Records with consistent and expected char encodings, although MARC::Writer will write out a legal ISO 2709 either way, it just might have corrupted encodings.
+#### jruby note
+Note all of our char encoding tests currently pass on jruby in ruby 1.9 mode; if you are using binary MARC records in a non-UTF8 encoding, you may have trouble in jruby. We believe it's a jruby bug. https://jira.codehaus.org/browse/JRUBY-6637
+### xml or json
+For XML or json use, things should probably work right if your input is in UTF-8, but this hasn't been extensively tested. Feel free to file issues if you run into any.
+## Miscellany
+Source code at: https://github.com/ruby-marc/ruby-marc/
+Find generated API docs at: http://rubydoc.info/gems/marc/frames
+Run automated tests in source with `rake test`.
+Developers, release new version of gem to rubygems with `rake release`
+(bundler-supplied task). Note that one nice thing this will do is automatically
+tag the version in git, very important for later figuring out what's going on.
+Please send bugs, requests and comments to Code4Lib Mailing list (https://listserv.nd.edu/cgi-bin/wa?A0=CODE4LIB).
+## Authors
+Kevin Clarke <ksclarke@gmail.com>
+Bill Dueber <bill@dueber.com>
+William Groppe <will.groppe@gmail.com>
+Ross Singer <rossfsinger@gmail.com>
+Ed Summers <ehs@pobox.com>

data/Rakefile CHANGED

@@ -3,9 +3,8 @@ RUBY_MARC_VERSION = '0.4.4'
 require 'rubygems'
 require 'rake'
 require 'rake/testtask'
-require 'rake/rdoctask'
-require 'rake/packagetask'
-require 'rake/gempackagetask'
+require 'rdoc/task'
+require 'bundler/gem_tasks'
 task :default => [:test]
@@ -16,29 +15,6 @@ Rake::TestTask.new('test') do |t|
   t.ruby_opts = ['-r marc', '-r test/unit']
 end
-spec = Gem::Specification.new do |s|
-  s.name = 'marc'
-  s.version = RUBY_MARC_VERSION
-  s.author = 'Ed Summers'
-  s.email = 'ehs@pobox.com'
-  s.homepage = 'http://marc.rubyforge.org/'
-  s.platform = Gem::Platform::RUBY
-  s.summary = 'A ruby library for working with Machine Readable Cataloging'
-  s.files = Dir.glob("{lib,test}/**/*") + ["Rakefile", "README", "Changes",
-    "LICENSE"]
-  s.require_path = 'lib'
-  s.autorequire = 'marc'
-  s.has_rdoc = true
-  s.required_ruby_version = '>= 1.8.6'
-  s.authors = ["Kevin Clarke", "Bill Dueber", "William Groppe", "Ross Singer", "Ed Summers"]
-  s.test_file = 'test/ts_marc.rb'
-  s.bindir = 'bin'
-end
-Rake::GemPackageTask.new(spec) do |pkg|
-  pkg.need_zip = true
-  pkg.need_tar = true
-end
 Rake::RDocTask.new('doc') do |rd|
   rd.rdoc_files.include("README", "Changes", "LICENSE", "lib/**/*.rb")

data/lib/marc.rb CHANGED

@@ -31,7 +31,7 @@
 #    record.add_field(MARC::ControlField.new('FMT', 'Book')) # doesn't throw an error
+require File.dirname(__FILE__) + '/marc/version'
 require File.dirname(__FILE__) + '/marc/constants'
 require File.dirname(__FILE__) + '/marc/record'
 require File.dirname(__FILE__) + '/marc/datafield'

data/lib/marc/reader.rb CHANGED

@@ -1,12 +1,126 @@
 module MARC
+  # A class for reading MARC binary (ISO 2709) files.
+  #
+  # == Character Encoding
+  #
+  # In ruby 1.8, if you mess up your character encodings, you may get
+  # garbage bytes. MARC::Reader takes no special action to determine or
+  # correct character encodings in ruby 1.8.
+  #
+  # In ruby 1.9, if character encodings get confused, you will likely get an
+  # exception raised at some point, either from inside MARC::Reader or in your
+  # own code. If your marc records are not in UTF-8, you will have to make sure
+  # MARC::Reader knows what character encoding to expect. For UTF-8, normally
+  # it will just work.
+  #
+  # Note that if your source data includes invalid illegal characters
+  # for it's encoding, while it _may_ not cause MARC::Reader to raise an
+  # exception, it will likely result in an exception at a later point in
+  # your own code. You can ask MARC::Reader to remove invalid bytes from data,
+  # see :invalid and :replace options below.
+  #
+  # In ruby 1.9, it's important strings are tagged with their proper encoding.
+  # **MARC::Reader does _not_ at present look inside the MARC file to see what
+  # encoding it claims for itself** -- real world MARC records are so unreliable
+  # here as to limit utility; and we have international users and international
+  # MARC uses several conventions for this. Instead, MARC::Reader uses ordinary
+  # ruby conventions.  If your data is in UTF-8, it'll probably Just Work,
+  # otherwise you simply have to tell MARC::Reader what the source encoding is:
+  #
+  #     Encoding.default_external # => usually "UTF-8" for most people
+  #     # marc data will be considered UTF-8, as per Encoding.default_external
+  #     MARC::Reader.new("path/to/file.marc")
+  #
+  #     # marc data will have same encoding as string.encoding:
+  #     MARC::Reader.decode( string )
+  #
+  #     # Same, values will have encoding of string.encoding:
+  #     MARC::Reader.new(StringIO.new(string))
+  #
+  #     # data values will have cp866 encoding, per external_encoding of
+  #     # File object passed in
+  #     MARC::Reader.new(File.new("myfile.marc", "r:cp866"))
+  #
+  #     # explicitly tell MARC::Reader the encoding
+  #     MARC::Reader.new("myfile.marc", :external_encoding => "cp866")
+  #
+  #     # If you have Marc8 data, you _really_ want to convert it
+  #     # to UTF8 outside of ruby, but if you can't:
+  #     MARC::Reader.new("marc8.marc" :external_encoding => "binary")
+  #     # But you probably _will_ have problems subsequently in your own
+  #     # own code using the MARC::Record.
+  #
+  # One way or another, you have to tell MARC::Reader what the external
+  # encoding is, if it's not the default for your system (usually UTF-8).
+  # It won't guess from internal MARC leader etc.
+  #
+  # == Additional Options
+  # These options can all be used on MARC::Reader.new _or_ MARC::Reader.decode
+  # to specify external encoding, ask for a transcode to a different
+  # encoding on read, or validate or replace bad bytes in source.
+  #
+  # [:external_encoding]
+  #    What encoding to consider the MARC record's values to be in. This option
+  #    takes precedence over the File handle or String argument's encodings.
+  # [:internal_encoding]
+  #    Ask MARC::Reader to transcode to this encoding in memory after reading
+  #    the file in.
+  # [:validate_encoding]
+  #    If you pass in `true`, MARC::Reader will promise to raise an Encoding::InvalidByteSequenceError
+  #    if there are illegal bytes in the source for the :external_encoding. There is
+  #    a performance penalty for this check. Without this option, an exception
+  #    _may_ or _may not_ be raised, and whether an exception or raised (or
+  #    what class the exception has) may change in future ruby-marc versions
+  #    without warning.
+  # [:invalid]
+  #    Just like String#encode, set to :replace and any bytes in source data
+  #    illegal for the source encoding will be replaced with the unicode
+  #    replacement character (when in unicode encodings), or else '?'. Overrides
+  #    :validate_encoding. This can help you sanitize your input and
+  #    avoid ruby "invalid UTF-8 byte" exceptions later.
+  # [:replace]
+  #    Just like String#encode, combine with `:invalid=>:replace`, set
+  #    your own replacement string for invalid bytes. You may use the
+  #    empty string to simply eliminate invalid bytes.
+  #
+  # == Warning on ruby File's own :internal_encoding, and unsafe transcoding from ruby
+  #
+  # Be careful with using an explicit File object with the File's own
+  # :internal_encoding set -- it can cause ruby to transcode your data
+  # _before_ MARC::Reader gets it, changing the bytecount and making the
+  # marc record unreadable in some cases. This
+  # applies to Encoding.default_encoding too!
+  #
+  #    # May in some cases result in unreadable marc and an exception
+  #    MARC::Reader.new(  File.new("marc_in_cp866.mrc", "r:cp866:utf-8") )
+  #
+  #    # May in some cases result in unreadable marc and an exception
+  #    Encoding.default_internal = "utf-8"
+  #    MARC::Reader.new(  File.new("marc_in_cp866.mrc", "r:cp866") )
+  #
+  #    # However this shoudl be safe:
+  #    MARC::Reader.new(  "marc_in_cp866.mrc", :external_encoding => "cp866")
+  #
+  #    # And this shoudl be safe, if you do want to transcode:
+  #    MARC::Reader.new(  "marc_in_cp866.mrc", :external_encoding => "cp866",
+  #       :internal_encoding => "utf-8")
+  #
+  #    # And this should ALWAYS be safe, with or without an internal_encoding
+  #    MARC::Reader.new( File.new("marc_in_cp866.mrc", "r:binary:binary"),
+  #       :external_encoding => "cp866",
+  #       :internal_encoding => "utf-8")
+  # == jruby note
+  # Note all of our char encoding tests currently pass on jruby in ruby 1.9
+  # mode; if you are using binary MARC records in a non-UTF8 encoding, you may
+  # have trouble in jruby. We believe it's a jruby bug.
+  # https://jira.codehaus.org/browse/JRUBY-6637
   class Reader
     include Enumerable
-    # The constructor which you may pass either a path
+    # The constructor which you may pass either a path
     #
     #   reader = MARC::Reader.new('marc.dat')
-    #
+    #
     # or, if it's more convenient a File object:
     #
     #   fh = File.new('marc.dat')
@@ -15,33 +129,54 @@ module MARC
     # or really any object that responds to read(n)
     #
     #   # marc is a string with a bunch of records in it
-    #   reader = MARC::Reader.new(StringIO.new(reader))
+    #   reader = MARC::Reader.new(StringIO.new(marc))
     #
     # If your data have non-standard control fields in them
     # (e.g., Aleph's 'FMT') you need to add them specifically
     # to the MARC::ControlField.control_tags Set object
-    #
+    #
     #   MARC::ControlField.control_tags << 'FMT'
-    def initialize(file)
-      if file.is_a?(String)
+    #
+    # Also, if your data encoded with non ascii/utf-8 encoding
+    # (for ex. when reading RUSMARC data) and you use ruby 1.9
+    # you can specify source data encoding with an option.
+    #
+    #   reader = MARC::Reader.new('marc.dat', :external_encoding => 'cp866')
+    #
+    # or, you can pass IO, opened in the corresponding encoding
+    #
+    #   reader = MARC::Reader.new(File.new('marc.dat', 'r:cp866'))
+    def initialize(file, options = {})
+      @encoding_options = {}
+      # all can be nil
+      [:internal_encoding, :external_encoding, :invalid, :replace, :validate_encoding].each do |key|
+        @encoding_options[key] = options[key] if options.has_key?(key)
+      end
+      if file.is_a?(String)
         @handle = File.new(file)
       elsif file.respond_to?("read", 5)
         @handle = file
       else
         throw "must pass in path or file"
       end
+      if (! @encoding_options[:external_encoding] ) && @handle.respond_to?(:external_encoding)
+        # use file encoding only if we didn't already have an explicit one,
+        # explicit one takes precedence.
+        #
+        # Note, please don't use ruby's own internal_encoding transcode
+        # with binary marc data, the transcode can mess up the byte count
+        # and make it unreadable.
+        @encoding_options[:external_encoding] ||= @handle.external_encoding
+      end
     end
     # to support iteration:
     #   for record in reader
     #     print record
     #   end
-    #
-    # and even searching:
-    #   record.find { |f| f['245'] =~ /Huckleberry/ }
-    def each
+    def each
       # while there is data left in the file
       while rec_length_s = @handle.read(5)
         # make sure the record length looks like an integer
@@ -53,24 +188,34 @@ module MARC
         # get the raw MARC21 for a record back from the file
         # using the record length
         raw = rec_length_s + @handle.read(rec_length_i-5)
-        # Ruby 1.9 will try to set the encoding to ASCII-8BIT, which we don't want.
-        # Not entirely sure what happens for MARC-8 encoded records, but, technically,
-        # ruby-marc doesn't support MARC-8, anyway.
-        raw.force_encoding('utf-8') if raw.respond_to?(:force_encoding)
         # create a record from the data and return it
         #record = MARC::Record.new_from_marc(raw)
-        record = MARC::Reader.decode(raw)
-        yield record
+        record = MARC::Reader.decode(raw, @encoding_options)
+        yield record
       end
     end
     # A static method for turning raw MARC data in transission
     # format into a MARC::Record object.
+    # First argument is a String
+    # options include:
+    #   [:external_encoding]  encoding of MARC record data values
+    #   [:forgiving]          needs more docs, true is some kind of forgiving
+    #                         of certain kinds of bad MARC.
     def self.decode(marc, params={})
+      if params.has_key?(:encoding)
+        $stderr.puts "DEPRECATION WARNING: MARC::Reader.decode :encoding option deprecated, please use :external_encoding"
+        params[:external_encoding] = params.delete(:encoding)
+      end
+      if (! params.has_key? :external_encoding ) && marc.respond_to?(:encoding)
+        # If no forced external_encoding giving, respect the encoding
+        # declared on the string passed in.
+        params[:external_encoding] = marc.encoding
+      end
       record = Record.new()
       record.leader = marc[0..LEADER_LENGTH-1]
@@ -82,15 +227,21 @@ module MARC
       throw "invalid directory in record" if directory == nil
-      # the number of fields in the record corresponds to
+      # the number of fields in the record corresponds to
       # how many directory entries there are
       num_fields = directory.length / DIRECTORY_ENTRY_LENGTH
       # when operating in forgiving mode we just split on end of
-      # field instead of using calculated byte offsets from the
+      # field instead of using calculated byte offsets from the
       # directory
-      if params[:forgiving]
-        all_fields = marc[base_address..-1].split(END_OF_FIELD)
+      if params[:forgiving]
+        marc_field_data = marc[base_address..-1]
+        # It won't let us do the split on bad utf8 data, but
+        # we haven't yet set the 'proper' encoding or used
+        # our correction/replace options. So call it binary for now.
+        marc_field_data.force_encoding("binary") if marc_field_data.respond_to?(:force_encoding)
+        all_fields = marc_field_data.split(END_OF_FIELD)
       else
         mba =  marc.bytes.to_a
       end
@@ -101,19 +252,19 @@ module MARC
         entry_start = field_num * DIRECTORY_ENTRY_LENGTH
         entry_end = entry_start + DIRECTORY_ENTRY_LENGTH
         entry = directory[entry_start..entry_end]
         # extract the tag
         tag = entry[0..2]
         # get the actual field data
         # if we were told to be forgiving we just use the
-        # next available chuck of field data that we
+        # next available chuck of field data that we
         # split apart based on the END_OF_FIELD
         field_data = ''
         if params[:forgiving]
           field_data = all_fields.shift()
-        # otherwise we actually use the byte offsets in
+        # otherwise we actually use the byte offsets in
         # directory to figure out what field data to extract
         else
           length = entry[3..6].to_i
@@ -125,7 +276,29 @@ module MARC
         # remove end of field
         field_data.delete!(END_OF_FIELD)
+        if field_data.respond_to?(:force_encoding)
+          if params[:external_encoding]
+            field_data = field_data.force_encoding(params[:external_encoding])
+          end
+          # If we're transcoding anyway, pass our invalid/replace options
+          # on to String#encode, which will take care of them -- or raise
+          # with illegal bytes without :replace=>:invalid.
+          #
+          # If we're NOT transcoding, we need to use our own pure-ruby
+          # implementation to do invalid byte replacements. OR to raise
+          # a predicatable exception iff :validate_encoding, otherwise
+          # for performance we won't check, and you may or may not
+          # get an exception from inside ruby-marc, and it may change
+          # in future implementations.
+          if params[:internal_encoding]
+            field_data = field_data.encode(params[:internal_encoding], params)
+          elsif (params[:invalid] || params[:replace] || (params[:validate_encoding] == true))
+            field_data = MARC::Reader.validate_encoding(field_data,  params)
+          end
+        end
         # add a control field or data field
         if MARC::ControlField.control_tag?(tag)
           record.append(MARC::ControlField.new(tag,field_data))
@@ -156,40 +329,87 @@ module MARC
       end
       return record
+    end
+    # Pass in a string, will raise an Encoding::InvalidByteSequenceError
+    # if it contains an invalid byte for it's encoding; otherwise
+    # returns an equivalent string. Surprisingly not built into
+    # ruby 1.9.3 (yet?). https://bugs.ruby-lang.org/issues/6321
+    #
+    # The InvalidByteSequenceError will NOT be filled out
+    # with the usual error metadata, sorry.
+    #
+    # OR, like String#encode, pass in option `:invalid => :replace`
+    # to replace invalid bytes with a replacement string in the
+    # returned string.  Pass in the
+    # char you'd like with option `:replace`, or will, like String#encode
+    # use the unicode replacement char if it thinks it's a unicode encoding,
+    # else ascii '?'.
+    #
+    # in any case, method will raise, or return a new string
+    # that is #valid_encoding?
+    def self.validate_encoding(str, options = {})
+      return str unless str.respond_to?(:encoding)
+      if str.valid_encoding?
+        return str
+      elsif options[:invalid] != :replace
+        # If we're not replacing, just raise right away without going through
+        # chars for performance.
+        #
+        # That does mean we're not able to say exactly what byte was bad though.
+        # And the exception isn't filled out with all it's usual attributes,
+        # which would be hard even we were going through all the chars/bytes.
+        raise  Encoding::InvalidByteSequenceError.new("invalid byte in string for source encoding #{str.encoding.name}")
+      else
+        # :replace => :invalid,
+        # actually need to go through chars to replace bad ones
+        return str.chars.collect do |c|
+          if c.valid_encoding?
+            c
+          else
+            options[:replace] || (
+             # surely there's a better way to tell if
+             # an encoding is a 'Unicode encoding form'
+             # than this? What's wrong with you ruby 1.9?
+             str.encoding.name.start_with?('UTF') ?
+                "\uFFFD" :
+                "?" )
+          end
+        end.join
+      end
     end
   end
   # Like Reader ForgivingReader lets you read in a batch of MARC21 records
-  # but it does not use record lengths and field byte offets found in the
+  # but it does not use record lengths and field byte offets found in the
   # leader and directory. It is not unusual to run across MARC records
   # which have had their offsets calcualted wrong. In situations like this
   # the vanilla Reader may fail, and you can try to use ForgivingReader.
+  #
   # The one downside to this is that ForgivingReader will assume that the
   # order of the fields in the directory is the same as the order of fields
-  # in the field data. Hopefully this will be the case, but it is not
+  # in the field data. Hopefully this will be the case, but it is not
   # 100% guranteed which is why the normal behavior of Reader is encouraged.
+  #
+  # **NOTE**: ForgivingReader _may_ have unpredictable results when used
+  # with marc records with char encoding other than system default (usually
+  # UTF8), _especially_ if you have Encoding.default_internal set.
+  #
+  # Implemented a sub-class of Reader over-riding #each, so we still
+  # get DRY Reader's #initialize with proper char encoding options
+  # and handling.
+  class ForgivingReader < Reader
-  class ForgivingReader
-    include Enumerable
-    def initialize(file)
-      if file.class == String
-        @handle = File.new(file)
-      elsif file.respond_to?("read", 5)
-        @handle = file
-      else
-        throw "must pass in path or File object"
-      end
-    end
-    def each
-      @handle.each_line(END_OF_RECORD) do |raw|
+    def each
+      @handle.each_line(END_OF_RECORD) do |raw|
         begin
-          record = MARC::Reader.decode(raw, :forgiving => true)
-          yield record
+          record = MARC::Reader.decode(raw, @encoding_options.merge(:forgiving => true))
+          yield record
         rescue StandardError => e
           # caught exception just keep barrelling along
           # TODO add logging