marc 0.4.4 → 0.5.0

Sign up to get free protection for your applications and to get access to all the features.
data/Changes CHANGED
@@ -1,3 +1,16 @@
1
+ v0.5.0 April 2012
2
+ - Extensive rewrite of MARC::Reader (ISO 2709 binary reader) to provide a
3
+ fairly complete and consistent handing of char encoding issues in ruby 1.9.
4
+ - This code is well covered by automated tests, but ends up complex, there
5
+ may be bugs, please report them.
6
+ - May not work properly under jruby with non-unicode source encodings.
7
+ - Still can't handle Marc8 encoding.
8
+ - May not have entirely backwards compatible behavior with regard to char
9
+ encodings under ruby 1.9.x as previous 0.4.x versions. Test your code.
10
+ In particular, previous versions may have automatically _transcoded_
11
+ non-unicode encodings to UTF-8 for you. This version will not do
12
+ so unless you ask it to with correct arguments.
13
+
1
14
  v0.4.4 Sat Mar 03 14:55:00 EDT 2012
2
15
  - Fixed performance regression: strict reader will parse about 5x faster now
3
16
  - Updated CHANGES file for first time in a long time :-)
@@ -0,0 +1,88 @@
1
+ marc is a ruby library for reading and writing MAchine Readable Cataloging
2
+ (MARC). More information about MARC can be found at <http://www.loc.gov/marc>.
3
+
4
+ ## Usage
5
+
6
+ require 'marc'
7
+
8
+ # reading records from a batch file
9
+ reader = MARC::Reader.new('marc.dat')
10
+ for record in reader
11
+ # print out field 245 subfield a
12
+ puts record['245']['a']
13
+ end
14
+
15
+ # creating a record
16
+ record = MARC::Record.new()
17
+ record.append(MARC::DataField.new('100', '0', ' ', ['a', 'John Doe']))
18
+
19
+ # writing a record
20
+ writer = MARC::Writer.new('marc.dat')
21
+ writer.write(record)
22
+ writer.close()
23
+
24
+ # writing a record as XML
25
+ writer = MARC::XMLWriter.new('marc.xml')
26
+ writer.write(record)
27
+ writer.close()
28
+
29
+ # encoding a record
30
+ MARC::Writer.encode(record) # or record.to_marc
31
+
32
+ MARC::Record provides `#to_hash` and `#from_hash` implementations that deal in ruby
33
+ hash's that are compatible with the
34
+ [marc-in-json](http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/)
35
+ serialization format. You are responsible for serializing the hash to/from JSON yourself.
36
+
37
+ ## Installation
38
+
39
+ gem install marc
40
+
41
+ Or if you're using bundler, add to your Gemfile
42
+
43
+ gem 'marc'
44
+
45
+ ## Character Encodings
46
+
47
+ Dealing with character encoding issues is one of the most confusing programming areas in general, and dealing with MARC (esp 'binary' ISO 2709 marc) can make it even more confusing.
48
+
49
+ In ruby 1.8, if you get your character encodings wrong, you may find what look like garbage characters in your output. In ruby 1.9, you may also cause exceptions to be raised in your code. ruby-marc as of 0.5.0 has a fairly complete and consistent featureset for helping you deal with character encodings in 'binary' MARC.
50
+
51
+ There are no tools in ruby for transcoding or dealing with the 'marc8' encoding, used in Marc21 in the US and other countries. If you have to deal with MARC with marc8 encoding, your best bet is using an external tool to convert between MARC8 and UTF8 before the ruby app even sees it. [MarcEdit](http://people.oregonstate.edu/~reeset/marcedit/html/index.php), [yaz-marcdump command line tool](http://www.indexdata.com/yaz), [Marc4J java library](http://marc4j.tigris.org/)
52
+
53
+ ### 'binary' ISO 2709 MARC
54
+
55
+ The Marc binary (ISO 2709) Reader (MARC::Reader) has some features for helping you deal with character encodings in ruby 1.9. It should often do the right thing, especially if you are working only in unicode. See documentation in that class for details, including additional features you can use. Note it does NOT currently determine encoding based on internal leader bytes in the marc file.
56
+
57
+ The MARC binary Writer (MARC::Writer) does not have any such features -- it's up to you the developer to make sure you create MARC::Records with consistent and expected char encodings, although MARC::Writer will write out a legal ISO 2709 either way, it just might have corrupted encodings.
58
+
59
+ #### jruby note
60
+
61
+ Note all of our char encoding tests currently pass on jruby in ruby 1.9 mode; if you are using binary MARC records in a non-UTF8 encoding, you may have trouble in jruby. We believe it's a jruby bug. https://jira.codehaus.org/browse/JRUBY-6637
62
+
63
+
64
+ ### xml or json
65
+
66
+ For XML or json use, things should probably work right if your input is in UTF-8, but this hasn't been extensively tested. Feel free to file issues if you run into any.
67
+
68
+ ## Miscellany
69
+
70
+ Source code at: https://github.com/ruby-marc/ruby-marc/
71
+
72
+ Find generated API docs at: http://rubydoc.info/gems/marc/frames
73
+
74
+ Run automated tests in source with `rake test`.
75
+
76
+ Developers, release new version of gem to rubygems with `rake release`
77
+ (bundler-supplied task). Note that one nice thing this will do is automatically
78
+ tag the version in git, very important for later figuring out what's going on.
79
+
80
+ Please send bugs, requests and comments to Code4Lib Mailing list (https://listserv.nd.edu/cgi-bin/wa?A0=CODE4LIB).
81
+
82
+ ## Authors
83
+
84
+ Kevin Clarke <ksclarke@gmail.com>
85
+ Bill Dueber <bill@dueber.com>
86
+ William Groppe <will.groppe@gmail.com>
87
+ Ross Singer <rossfsinger@gmail.com>
88
+ Ed Summers <ehs@pobox.com>
data/Rakefile CHANGED
@@ -3,9 +3,8 @@ RUBY_MARC_VERSION = '0.4.4'
3
3
  require 'rubygems'
4
4
  require 'rake'
5
5
  require 'rake/testtask'
6
- require 'rake/rdoctask'
7
- require 'rake/packagetask'
8
- require 'rake/gempackagetask'
6
+ require 'rdoc/task'
7
+ require 'bundler/gem_tasks'
9
8
 
10
9
  task :default => [:test]
11
10
 
@@ -16,29 +15,6 @@ Rake::TestTask.new('test') do |t|
16
15
  t.ruby_opts = ['-r marc', '-r test/unit']
17
16
  end
18
17
 
19
- spec = Gem::Specification.new do |s|
20
- s.name = 'marc'
21
- s.version = RUBY_MARC_VERSION
22
- s.author = 'Ed Summers'
23
- s.email = 'ehs@pobox.com'
24
- s.homepage = 'http://marc.rubyforge.org/'
25
- s.platform = Gem::Platform::RUBY
26
- s.summary = 'A ruby library for working with Machine Readable Cataloging'
27
- s.files = Dir.glob("{lib,test}/**/*") + ["Rakefile", "README", "Changes",
28
- "LICENSE"]
29
- s.require_path = 'lib'
30
- s.autorequire = 'marc'
31
- s.has_rdoc = true
32
- s.required_ruby_version = '>= 1.8.6'
33
- s.authors = ["Kevin Clarke", "Bill Dueber", "William Groppe", "Ross Singer", "Ed Summers"]
34
- s.test_file = 'test/ts_marc.rb'
35
- s.bindir = 'bin'
36
- end
37
-
38
- Rake::GemPackageTask.new(spec) do |pkg|
39
- pkg.need_zip = true
40
- pkg.need_tar = true
41
- end
42
18
 
43
19
  Rake::RDocTask.new('doc') do |rd|
44
20
  rd.rdoc_files.include("README", "Changes", "LICENSE", "lib/**/*.rb")
@@ -31,7 +31,7 @@
31
31
  # record.add_field(MARC::ControlField.new('FMT', 'Book')) # doesn't throw an error
32
32
 
33
33
 
34
-
34
+ require File.dirname(__FILE__) + '/marc/version'
35
35
  require File.dirname(__FILE__) + '/marc/constants'
36
36
  require File.dirname(__FILE__) + '/marc/record'
37
37
  require File.dirname(__FILE__) + '/marc/datafield'
@@ -1,12 +1,126 @@
1
1
  module MARC
2
-
2
+ # A class for reading MARC binary (ISO 2709) files.
3
+ #
4
+ # == Character Encoding
5
+ #
6
+ # In ruby 1.8, if you mess up your character encodings, you may get
7
+ # garbage bytes. MARC::Reader takes no special action to determine or
8
+ # correct character encodings in ruby 1.8.
9
+ #
10
+ # In ruby 1.9, if character encodings get confused, you will likely get an
11
+ # exception raised at some point, either from inside MARC::Reader or in your
12
+ # own code. If your marc records are not in UTF-8, you will have to make sure
13
+ # MARC::Reader knows what character encoding to expect. For UTF-8, normally
14
+ # it will just work.
15
+ #
16
+ # Note that if your source data includes invalid illegal characters
17
+ # for it's encoding, while it _may_ not cause MARC::Reader to raise an
18
+ # exception, it will likely result in an exception at a later point in
19
+ # your own code. You can ask MARC::Reader to remove invalid bytes from data,
20
+ # see :invalid and :replace options below.
21
+ #
22
+ # In ruby 1.9, it's important strings are tagged with their proper encoding.
23
+ # **MARC::Reader does _not_ at present look inside the MARC file to see what
24
+ # encoding it claims for itself** -- real world MARC records are so unreliable
25
+ # here as to limit utility; and we have international users and international
26
+ # MARC uses several conventions for this. Instead, MARC::Reader uses ordinary
27
+ # ruby conventions. If your data is in UTF-8, it'll probably Just Work,
28
+ # otherwise you simply have to tell MARC::Reader what the source encoding is:
29
+ #
30
+ # Encoding.default_external # => usually "UTF-8" for most people
31
+ # # marc data will be considered UTF-8, as per Encoding.default_external
32
+ # MARC::Reader.new("path/to/file.marc")
33
+ #
34
+ # # marc data will have same encoding as string.encoding:
35
+ # MARC::Reader.decode( string )
36
+ #
37
+ # # Same, values will have encoding of string.encoding:
38
+ # MARC::Reader.new(StringIO.new(string))
39
+ #
40
+ # # data values will have cp866 encoding, per external_encoding of
41
+ # # File object passed in
42
+ # MARC::Reader.new(File.new("myfile.marc", "r:cp866"))
43
+ #
44
+ # # explicitly tell MARC::Reader the encoding
45
+ # MARC::Reader.new("myfile.marc", :external_encoding => "cp866")
46
+ #
47
+ # # If you have Marc8 data, you _really_ want to convert it
48
+ # # to UTF8 outside of ruby, but if you can't:
49
+ # MARC::Reader.new("marc8.marc" :external_encoding => "binary")
50
+ # # But you probably _will_ have problems subsequently in your own
51
+ # # own code using the MARC::Record.
52
+ #
53
+ # One way or another, you have to tell MARC::Reader what the external
54
+ # encoding is, if it's not the default for your system (usually UTF-8).
55
+ # It won't guess from internal MARC leader etc.
56
+ #
57
+ # == Additional Options
58
+ # These options can all be used on MARC::Reader.new _or_ MARC::Reader.decode
59
+ # to specify external encoding, ask for a transcode to a different
60
+ # encoding on read, or validate or replace bad bytes in source.
61
+ #
62
+ # [:external_encoding]
63
+ # What encoding to consider the MARC record's values to be in. This option
64
+ # takes precedence over the File handle or String argument's encodings.
65
+ # [:internal_encoding]
66
+ # Ask MARC::Reader to transcode to this encoding in memory after reading
67
+ # the file in.
68
+ # [:validate_encoding]
69
+ # If you pass in `true`, MARC::Reader will promise to raise an Encoding::InvalidByteSequenceError
70
+ # if there are illegal bytes in the source for the :external_encoding. There is
71
+ # a performance penalty for this check. Without this option, an exception
72
+ # _may_ or _may not_ be raised, and whether an exception or raised (or
73
+ # what class the exception has) may change in future ruby-marc versions
74
+ # without warning.
75
+ # [:invalid]
76
+ # Just like String#encode, set to :replace and any bytes in source data
77
+ # illegal for the source encoding will be replaced with the unicode
78
+ # replacement character (when in unicode encodings), or else '?'. Overrides
79
+ # :validate_encoding. This can help you sanitize your input and
80
+ # avoid ruby "invalid UTF-8 byte" exceptions later.
81
+ # [:replace]
82
+ # Just like String#encode, combine with `:invalid=>:replace`, set
83
+ # your own replacement string for invalid bytes. You may use the
84
+ # empty string to simply eliminate invalid bytes.
85
+ #
86
+ # == Warning on ruby File's own :internal_encoding, and unsafe transcoding from ruby
87
+ #
88
+ # Be careful with using an explicit File object with the File's own
89
+ # :internal_encoding set -- it can cause ruby to transcode your data
90
+ # _before_ MARC::Reader gets it, changing the bytecount and making the
91
+ # marc record unreadable in some cases. This
92
+ # applies to Encoding.default_encoding too!
93
+ #
94
+ # # May in some cases result in unreadable marc and an exception
95
+ # MARC::Reader.new( File.new("marc_in_cp866.mrc", "r:cp866:utf-8") )
96
+ #
97
+ # # May in some cases result in unreadable marc and an exception
98
+ # Encoding.default_internal = "utf-8"
99
+ # MARC::Reader.new( File.new("marc_in_cp866.mrc", "r:cp866") )
100
+ #
101
+ # # However this shoudl be safe:
102
+ # MARC::Reader.new( "marc_in_cp866.mrc", :external_encoding => "cp866")
103
+ #
104
+ # # And this shoudl be safe, if you do want to transcode:
105
+ # MARC::Reader.new( "marc_in_cp866.mrc", :external_encoding => "cp866",
106
+ # :internal_encoding => "utf-8")
107
+ #
108
+ # # And this should ALWAYS be safe, with or without an internal_encoding
109
+ # MARC::Reader.new( File.new("marc_in_cp866.mrc", "r:binary:binary"),
110
+ # :external_encoding => "cp866",
111
+ # :internal_encoding => "utf-8")
112
+ # == jruby note
113
+ # Note all of our char encoding tests currently pass on jruby in ruby 1.9
114
+ # mode; if you are using binary MARC records in a non-UTF8 encoding, you may
115
+ # have trouble in jruby. We believe it's a jruby bug.
116
+ # https://jira.codehaus.org/browse/JRUBY-6637
3
117
  class Reader
4
118
  include Enumerable
5
119
 
6
- # The constructor which you may pass either a path
120
+ # The constructor which you may pass either a path
7
121
  #
8
122
  # reader = MARC::Reader.new('marc.dat')
9
- #
123
+ #
10
124
  # or, if it's more convenient a File object:
11
125
  #
12
126
  # fh = File.new('marc.dat')
@@ -15,33 +129,54 @@ module MARC
15
129
  # or really any object that responds to read(n)
16
130
  #
17
131
  # # marc is a string with a bunch of records in it
18
- # reader = MARC::Reader.new(StringIO.new(reader))
132
+ # reader = MARC::Reader.new(StringIO.new(marc))
19
133
  #
20
134
  # If your data have non-standard control fields in them
21
135
  # (e.g., Aleph's 'FMT') you need to add them specifically
22
136
  # to the MARC::ControlField.control_tags Set object
23
- #
137
+ #
24
138
  # MARC::ControlField.control_tags << 'FMT'
25
-
26
- def initialize(file)
27
- if file.is_a?(String)
139
+ #
140
+ # Also, if your data encoded with non ascii/utf-8 encoding
141
+ # (for ex. when reading RUSMARC data) and you use ruby 1.9
142
+ # you can specify source data encoding with an option.
143
+ #
144
+ # reader = MARC::Reader.new('marc.dat', :external_encoding => 'cp866')
145
+ #
146
+ # or, you can pass IO, opened in the corresponding encoding
147
+ #
148
+ # reader = MARC::Reader.new(File.new('marc.dat', 'r:cp866'))
149
+ def initialize(file, options = {})
150
+ @encoding_options = {}
151
+ # all can be nil
152
+ [:internal_encoding, :external_encoding, :invalid, :replace, :validate_encoding].each do |key|
153
+ @encoding_options[key] = options[key] if options.has_key?(key)
154
+ end
155
+
156
+ if file.is_a?(String)
28
157
  @handle = File.new(file)
29
158
  elsif file.respond_to?("read", 5)
30
159
  @handle = file
31
160
  else
32
161
  throw "must pass in path or file"
33
162
  end
163
+
164
+ if (! @encoding_options[:external_encoding] ) && @handle.respond_to?(:external_encoding)
165
+ # use file encoding only if we didn't already have an explicit one,
166
+ # explicit one takes precedence.
167
+ #
168
+ # Note, please don't use ruby's own internal_encoding transcode
169
+ # with binary marc data, the transcode can mess up the byte count
170
+ # and make it unreadable.
171
+ @encoding_options[:external_encoding] ||= @handle.external_encoding
172
+ end
34
173
  end
35
174
 
36
175
  # to support iteration:
37
176
  # for record in reader
38
177
  # print record
39
178
  # end
40
- #
41
- # and even searching:
42
- # record.find { |f| f['245'] =~ /Huckleberry/ }
43
-
44
- def each
179
+ def each
45
180
  # while there is data left in the file
46
181
  while rec_length_s = @handle.read(5)
47
182
  # make sure the record length looks like an integer
@@ -53,24 +188,34 @@ module MARC
53
188
  # get the raw MARC21 for a record back from the file
54
189
  # using the record length
55
190
  raw = rec_length_s + @handle.read(rec_length_i-5)
56
-
57
- # Ruby 1.9 will try to set the encoding to ASCII-8BIT, which we don't want.
58
- # Not entirely sure what happens for MARC-8 encoded records, but, technically,
59
- # ruby-marc doesn't support MARC-8, anyway.
60
- raw.force_encoding('utf-8') if raw.respond_to?(:force_encoding)
61
191
 
62
192
  # create a record from the data and return it
63
193
  #record = MARC::Record.new_from_marc(raw)
64
- record = MARC::Reader.decode(raw)
65
- yield record
194
+ record = MARC::Reader.decode(raw, @encoding_options)
195
+ yield record
66
196
  end
67
197
  end
68
198
 
69
199
 
70
200
  # A static method for turning raw MARC data in transission
71
201
  # format into a MARC::Record object.
72
-
202
+ # First argument is a String
203
+ # options include:
204
+ # [:external_encoding] encoding of MARC record data values
205
+ # [:forgiving] needs more docs, true is some kind of forgiving
206
+ # of certain kinds of bad MARC.
73
207
  def self.decode(marc, params={})
208
+ if params.has_key?(:encoding)
209
+ $stderr.puts "DEPRECATION WARNING: MARC::Reader.decode :encoding option deprecated, please use :external_encoding"
210
+ params[:external_encoding] = params.delete(:encoding)
211
+ end
212
+
213
+ if (! params.has_key? :external_encoding ) && marc.respond_to?(:encoding)
214
+ # If no forced external_encoding giving, respect the encoding
215
+ # declared on the string passed in.
216
+ params[:external_encoding] = marc.encoding
217
+ end
218
+
74
219
  record = Record.new()
75
220
  record.leader = marc[0..LEADER_LENGTH-1]
76
221
 
@@ -82,15 +227,21 @@ module MARC
82
227
 
83
228
  throw "invalid directory in record" if directory == nil
84
229
 
85
- # the number of fields in the record corresponds to
230
+ # the number of fields in the record corresponds to
86
231
  # how many directory entries there are
87
232
  num_fields = directory.length / DIRECTORY_ENTRY_LENGTH
88
233
 
89
234
  # when operating in forgiving mode we just split on end of
90
- # field instead of using calculated byte offsets from the
235
+ # field instead of using calculated byte offsets from the
91
236
  # directory
92
- if params[:forgiving]
93
- all_fields = marc[base_address..-1].split(END_OF_FIELD)
237
+ if params[:forgiving]
238
+ marc_field_data = marc[base_address..-1]
239
+ # It won't let us do the split on bad utf8 data, but
240
+ # we haven't yet set the 'proper' encoding or used
241
+ # our correction/replace options. So call it binary for now.
242
+ marc_field_data.force_encoding("binary") if marc_field_data.respond_to?(:force_encoding)
243
+
244
+ all_fields = marc_field_data.split(END_OF_FIELD)
94
245
  else
95
246
  mba = marc.bytes.to_a
96
247
  end
@@ -101,19 +252,19 @@ module MARC
101
252
  entry_start = field_num * DIRECTORY_ENTRY_LENGTH
102
253
  entry_end = entry_start + DIRECTORY_ENTRY_LENGTH
103
254
  entry = directory[entry_start..entry_end]
104
-
255
+
105
256
  # extract the tag
106
257
  tag = entry[0..2]
107
258
 
108
259
  # get the actual field data
109
260
  # if we were told to be forgiving we just use the
110
- # next available chuck of field data that we
261
+ # next available chuck of field data that we
111
262
  # split apart based on the END_OF_FIELD
112
263
  field_data = ''
113
264
  if params[:forgiving]
114
265
  field_data = all_fields.shift()
115
266
 
116
- # otherwise we actually use the byte offsets in
267
+ # otherwise we actually use the byte offsets in
117
268
  # directory to figure out what field data to extract
118
269
  else
119
270
  length = entry[3..6].to_i
@@ -125,7 +276,29 @@ module MARC
125
276
 
126
277
  # remove end of field
127
278
  field_data.delete!(END_OF_FIELD)
128
-
279
+
280
+ if field_data.respond_to?(:force_encoding)
281
+ if params[:external_encoding]
282
+ field_data = field_data.force_encoding(params[:external_encoding])
283
+ end
284
+
285
+ # If we're transcoding anyway, pass our invalid/replace options
286
+ # on to String#encode, which will take care of them -- or raise
287
+ # with illegal bytes without :replace=>:invalid.
288
+ #
289
+ # If we're NOT transcoding, we need to use our own pure-ruby
290
+ # implementation to do invalid byte replacements. OR to raise
291
+ # a predicatable exception iff :validate_encoding, otherwise
292
+ # for performance we won't check, and you may or may not
293
+ # get an exception from inside ruby-marc, and it may change
294
+ # in future implementations.
295
+ if params[:internal_encoding]
296
+ field_data = field_data.encode(params[:internal_encoding], params)
297
+ elsif (params[:invalid] || params[:replace] || (params[:validate_encoding] == true))
298
+ field_data = MARC::Reader.validate_encoding(field_data, params)
299
+ end
300
+
301
+ end
129
302
  # add a control field or data field
130
303
  if MARC::ControlField.control_tag?(tag)
131
304
  record.append(MARC::ControlField.new(tag,field_data))
@@ -156,40 +329,87 @@ module MARC
156
329
  end
157
330
 
158
331
  return record
332
+ end
333
+
334
+ # Pass in a string, will raise an Encoding::InvalidByteSequenceError
335
+ # if it contains an invalid byte for it's encoding; otherwise
336
+ # returns an equivalent string. Surprisingly not built into
337
+ # ruby 1.9.3 (yet?). https://bugs.ruby-lang.org/issues/6321
338
+ #
339
+ # The InvalidByteSequenceError will NOT be filled out
340
+ # with the usual error metadata, sorry.
341
+ #
342
+ # OR, like String#encode, pass in option `:invalid => :replace`
343
+ # to replace invalid bytes with a replacement string in the
344
+ # returned string. Pass in the
345
+ # char you'd like with option `:replace`, or will, like String#encode
346
+ # use the unicode replacement char if it thinks it's a unicode encoding,
347
+ # else ascii '?'.
348
+ #
349
+ # in any case, method will raise, or return a new string
350
+ # that is #valid_encoding?
351
+ def self.validate_encoding(str, options = {})
352
+ return str unless str.respond_to?(:encoding)
353
+
354
+ if str.valid_encoding?
355
+ return str
356
+ elsif options[:invalid] != :replace
357
+ # If we're not replacing, just raise right away without going through
358
+ # chars for performance.
359
+ #
360
+ # That does mean we're not able to say exactly what byte was bad though.
361
+ # And the exception isn't filled out with all it's usual attributes,
362
+ # which would be hard even we were going through all the chars/bytes.
363
+ raise Encoding::InvalidByteSequenceError.new("invalid byte in string for source encoding #{str.encoding.name}")
364
+ else
365
+ # :replace => :invalid,
366
+ # actually need to go through chars to replace bad ones
367
+ return str.chars.collect do |c|
368
+ if c.valid_encoding?
369
+ c
370
+ else
371
+ options[:replace] || (
372
+ # surely there's a better way to tell if
373
+ # an encoding is a 'Unicode encoding form'
374
+ # than this? What's wrong with you ruby 1.9?
375
+ str.encoding.name.start_with?('UTF') ?
376
+ "\uFFFD" :
377
+ "?" )
378
+ end
379
+ end.join
380
+ end
159
381
  end
382
+
160
383
  end
161
384
 
162
385
 
386
+
387
+
163
388
  # Like Reader ForgivingReader lets you read in a batch of MARC21 records
164
- # but it does not use record lengths and field byte offets found in the
389
+ # but it does not use record lengths and field byte offets found in the
165
390
  # leader and directory. It is not unusual to run across MARC records
166
391
  # which have had their offsets calcualted wrong. In situations like this
167
392
  # the vanilla Reader may fail, and you can try to use ForgivingReader.
168
-
393
+ #
169
394
  # The one downside to this is that ForgivingReader will assume that the
170
395
  # order of the fields in the directory is the same as the order of fields
171
- # in the field data. Hopefully this will be the case, but it is not
396
+ # in the field data. Hopefully this will be the case, but it is not
172
397
  # 100% guranteed which is why the normal behavior of Reader is encouraged.
398
+ #
399
+ # **NOTE**: ForgivingReader _may_ have unpredictable results when used
400
+ # with marc records with char encoding other than system default (usually
401
+ # UTF8), _especially_ if you have Encoding.default_internal set.
402
+ #
403
+ # Implemented a sub-class of Reader over-riding #each, so we still
404
+ # get DRY Reader's #initialize with proper char encoding options
405
+ # and handling.
406
+ class ForgivingReader < Reader
173
407
 
174
- class ForgivingReader
175
- include Enumerable
176
-
177
- def initialize(file)
178
- if file.class == String
179
- @handle = File.new(file)
180
- elsif file.respond_to?("read", 5)
181
- @handle = file
182
- else
183
- throw "must pass in path or File object"
184
- end
185
- end
186
-
187
-
188
- def each
189
- @handle.each_line(END_OF_RECORD) do |raw|
408
+ def each
409
+ @handle.each_line(END_OF_RECORD) do |raw|
190
410
  begin
191
- record = MARC::Reader.decode(raw, :forgiving => true)
192
- yield record
411
+ record = MARC::Reader.decode(raw, @encoding_options.merge(:forgiving => true))
412
+ yield record
193
413
  rescue StandardError => e
194
414
  # caught exception just keep barrelling along
195
415
  # TODO add logging