marc 1.1.1 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (52) hide show
  1. checksums.yaml +4 -4
  2. data/.github/ISSUE_TEMPLATE/bug_report.md +30 -0
  3. data/.github/workflows/ruby.yml +24 -0
  4. data/.gitignore +17 -0
  5. data/.standard.yml +1 -0
  6. data/{Changes → CHANGELOG.md} +116 -30
  7. data/Gemfile +5 -0
  8. data/README.md +239 -46
  9. data/Rakefile +14 -14
  10. data/bin/marc +14 -0
  11. data/bin/marc2xml +17 -0
  12. data/examples/xml2marc.rb +10 -0
  13. data/lib/marc/constants.rb +3 -3
  14. data/lib/marc/controlfield.rb +35 -23
  15. data/lib/marc/datafield.rb +70 -63
  16. data/lib/marc/dublincore.rb +59 -41
  17. data/lib/marc/exception.rb +9 -1
  18. data/lib/marc/jsonl_reader.rb +33 -0
  19. data/lib/marc/jsonl_writer.rb +44 -0
  20. data/lib/marc/marc8/map_to_unicode.rb +16417 -16420
  21. data/lib/marc/marc8/to_unicode.rb +80 -87
  22. data/lib/marc/reader.rb +116 -124
  23. data/lib/marc/record.rb +72 -62
  24. data/lib/marc/subfield.rb +12 -10
  25. data/lib/marc/unsafe_xmlwriter.rb +93 -0
  26. data/lib/marc/version.rb +1 -1
  27. data/lib/marc/writer.rb +27 -30
  28. data/lib/marc/xml_parsers.rb +222 -197
  29. data/lib/marc/xmlreader.rb +131 -114
  30. data/lib/marc/xmlwriter.rb +93 -82
  31. data/lib/marc.rb +20 -18
  32. data/marc.gemspec +28 -0
  33. data/test/marc8/tc_marc8_mapping.rb +3 -3
  34. data/test/marc8/tc_to_unicode.rb +28 -34
  35. data/test/messed_up_leader.xml +9 -0
  36. data/test/tc_controlfield.rb +37 -34
  37. data/test/tc_datafield.rb +65 -60
  38. data/test/tc_dublincore.rb +9 -11
  39. data/test/tc_hash.rb +10 -13
  40. data/test/tc_jsonl.rb +19 -0
  41. data/test/tc_marchash.rb +17 -21
  42. data/test/tc_parsers.rb +108 -144
  43. data/test/tc_reader.rb +35 -36
  44. data/test/tc_reader_char_encodings.rb +149 -169
  45. data/test/tc_record.rb +143 -148
  46. data/test/tc_subfield.rb +14 -13
  47. data/test/tc_unsafe_xml.rb +95 -0
  48. data/test/tc_writer.rb +101 -108
  49. data/test/tc_xml.rb +101 -94
  50. data/test/tc_xml_error_handling.rb +7 -8
  51. data/test/ts_marc.rb +8 -8
  52. metadata +129 -22
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 53c1e338a00e1dbd7c09ed14edc916edd1211abe0edbabcf757b8b00a5aa209c
4
- data.tar.gz: 80b4c48c2fc95887216194d264583302bacae6c606616a087c888668ba2bfb68
3
+ metadata.gz: faaea90f6a961fd419d41cd3fc789a7967ddd41866b82cb2075be3a139e73e58
4
+ data.tar.gz: dacb3e74659b39f4774205231c07a09802af24bf15726cd8eecee1c424199a56
5
5
  SHA512:
6
- metadata.gz: b6dd17fa76ff33ef0da68946d29ec2079c495423559c47391cabf80b59393f0d481ef6b1467b839465d3225c1a52aec2d4945994986e8e956eccfc2ea73ada5b
7
- data.tar.gz: a745e41aa2cbe87c70f9a2cbfe0de0dd12f47c167f867127353d86ff771ada14327dbce1ec461fcb81cacb3cf3f60f821cdb4c0087213f37a0ef42dc7ade78e8
6
+ metadata.gz: c5860c18fac9062dc15cea515e925a31e652bb599ddb5ef96281ac1840860b607a0d5c2bc66e514ab7d0ea17d1f08219253f54eeff128a9d6251a73627a4a3ce
7
+ data.tar.gz: 3936a4dc7fd88038e9794cec0427e0f0ee1264639f700d7ae8cf37c9710bc7ff4bba0c7679c8c301034ca3ea98c36e6c5df81c77caff8b6dfe2a9cdf6928dcaf
@@ -0,0 +1,30 @@
1
+ ---
2
+ name: Bug report
3
+ about: Create a report to help us improve
4
+ title: ''
5
+ labels: ''
6
+ assignees: ''
7
+
8
+ ---
9
+
10
+ **Describe the bug**
11
+ A clear and concise description of what the bug is.
12
+
13
+ **To Reproduce**
14
+ Steps to reproduce the behavior:
15
+ 1. Include, if possible, a sample file that exhibits the behavior
16
+ 2. Include minimal but relevant ruby-marc code that exhibits the behavor
17
+
18
+ **Expected behavior**
19
+ A clear and concise description of what you expected to happen.
20
+
21
+ **Program Output**
22
+ If applicable, add program output and/or backtraces
23
+
24
+ **Environment (please complete the following information):**
25
+ - ruby-marc version (from `MARC::VERSION`)
26
+ - ruby runtime and version (best: the output of `ruby -e 'puts RUBY_DESCRIPTION'`)
27
+ - operating system, if not included in output of `RUBY_DESCRIPTION`
28
+
29
+ **Additional context**
30
+ Add any other context about the problem here.
@@ -0,0 +1,24 @@
1
+ name: CI
2
+
3
+ on: [push, pull_request]
4
+
5
+ env:
6
+ # See https://github.com/jruby/jruby/issues/5509
7
+ JAVA_OPTS: "--add-opens java.xml/com.sun.org.apache.xerces.internal.impl=org.jruby.dist"
8
+
9
+ jobs:
10
+ tests:
11
+ runs-on: ubuntu-latest
12
+ strategy:
13
+ matrix:
14
+ ruby: [2.7, 3.0, 3.1, 3.2, 3.3, 3.4, jruby, truffleruby, "truffleruby+graalvm"]
15
+ steps:
16
+ - uses: actions/checkout@v4
17
+ - name: Set up Ruby
18
+ uses: ruby/setup-ruby@v1
19
+ with:
20
+ ruby-version: ${{ matrix.ruby }}
21
+ - name: Install dependencies
22
+ run: bundle install --without documentation
23
+ - name: Run tests
24
+ run: bundle exec rake
data/.gitignore ADDED
@@ -0,0 +1,17 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
data/.standard.yml ADDED
@@ -0,0 +1 @@
1
+ ruby_version: 2.3
@@ -1,20 +1,106 @@
1
- v1.1.1 June 2021
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ ## [1.3] - 2025-01-0-9
6
+
7
+ ### Breaking Change
8
+
9
+ **ruby >= 2.2 is now required**. Removed no-longer-necessary `unf` gem in
10
+ favor of built-in string methods for dealing with encodings.
11
+
12
+ ### Non-user-facing changes
13
+
14
+ - Pulled everything into the gemspec instead of having some stuff
15
+ hanging out in the Gemfile
16
+ - Changed tested rubies to only include 2.7, 3.[0,1,2,3,4], latest
17
+ jruby, and latest truffleruby (with and without graalvm)
18
+
19
+ ## [1.2] - 2022-08-02
20
+
21
+ ### Added
22
+
23
+ * New XML writer `MARC::UnsafeXMLWriter` which is 15-20 times faster than the
24
+ default (rexml-based) writer. It mirrors code from the old
25
+ [`MARC::FastXMLWriter` gem](https://github.com/billdueber/marc-fastxmlwriter)
26
+ in a way that integrates better with the existing writer framework. It can
27
+ be used like any other writer,
28
+ e.g., `writer = MARC::UnsafeXMLWriter. new(filename)`. Note that while it
29
+ is "unsafe" in that it doesn't do checks for valid XML going out (it's speed
30
+ comes from the fact that it's just concatenating strings together),
31
+ the `FastXMLWriter` gem has been used "in the wild" for years and doesn't
32
+ seem to cause anyone any problems.
33
+ * Added a new method, `MARC::Record.to_xml_string` which produces a
34
+ valid `<record>...</record>` XML snippet. It takes an optional keyword
35
+ argument to include namespace attributes on the
36
+ `<record>` tag, and another to use the new unsafe generator as
37
+ `record.to_xml_string(fast_but_unsafe: true)`.
38
+ * Added first-class support for `.jsonl` (aka "newline-delimited json")
39
+ files using the marc-in-json format via `MARC::JSONLReader` and
40
+ `MARC::JSONLWriter` which read and write marc-in-json. `ruby-marc` has
41
+ supported `#to_hash` and `#from_hash` to deal with this format at the
42
+ individual record level for a long time; this just provides the
43
+ reader/writer scaffolding.
44
+ * Also added `MARC::Record.to_json_string` to get a marc-in-json string
45
+ representation (parallel to the new `#to_xml_string`)
46
+ * New option to xml readers to ignore any namespaces
47
+ via `reader = MARC::XMLReader.new(filename, ignore_namespace: true)`. While
48
+ the REXML MARC-XML reader can't handle
49
+ (and thus has always ignored XML namespaces), the Nokogiri-based version
50
+ will enforce namespaces if present. Useful only when you have
51
+ poorly-generated files where the XML namespace attributes are wonky.
52
+ * All writers will now self-close if used with a block (e.g.,
53
+ `MARC::Writer.new(filename) {|w| w.write(record)}`), parallel to the way
54
+ `File.open` works in regular ruby.
55
+ * XML writers will now take an optional keyword argument,
56
+ `include_namespace`, on both `#new` and `.encode`.
57
+
58
+ ### Changed
59
+ * Remove the `JREXML` parser, which apparently hasn't worked for years yet
60
+ also wasn't running in CI because the test are running under bundler,
61
+ which didn't load `jrexml`. Set to emit a warning to use nokogiri
62
+ instead and fall back to REXML.
63
+ * 10-15% speed improvement when parsing MARC-XML with nokogiri (PR #97,
64
+ billdueber)
65
+ * Added deprecation warnings when using the `libxml`, `jstax`, or `jrexml`
66
+ xml parsers. When introduced, Nokogiri under JRuby was iffy. It's now
67
+ stable on both MRI and JRuby and faster than any of the other
68
+ included options and should be preferred. (PR #98, billdueber)
69
+ * MARC fields are now validated in their own post-creation stage (PR #66,
70
+ cbeer)
71
+ * Reduce the noise when running tests (billdueber)
72
+ * Reformatted this CHANGELOG.md file and added examples/structure to
73
+ README.md.
74
+
75
+
76
+ ### Fixed
77
+ * MARC-XML has requirements on the leader that are applied when writing out
78
+ MARC-XML by `MARC::XMLWriter.encode`. Previous versions would actually
79
+ mutate the record being written, resulting in a silent modification to
80
+ a record just because you were writing it out. Changed to use a duplicate
81
+ (PR #73, cbeer)
82
+ * Guard against multiple character calls when parsing XML (PR #74, cbeer)
83
+ * Minor Dublin Core code fixes (PRs #83 and #84, fjorba)
84
+ * `JRubyStaxReader` now supports Java 9+ / JRuby 9.3+ (PR #87, dmolesUC)
85
+
86
+ ## [1.1.1] - 2021-06-07
87
+
2
88
  - Fix a regression when normalizing indicator values when serializing marcxml
3
89
 
4
- v1.1.0 June 2021
90
+ ## [1.1.0] - 2021-06-01
5
91
  - Add support for additional valid subfield codes in marcxml
6
92
 
7
- v1.0.2 July 2017
93
+ ## [1.0.2] - 2017-08-01
8
94
  - Now (correctly) throw an error if datafield string is the empty string
9
95
  (thanks to @bibliotechy)
10
96
 
11
- v1.0.1 February 2016
97
+ ## [1.0.1] - 2016-02-29
12
98
  - Non-user-facing change in implementation of FieldMap strictly for performance
13
99
 
14
- v1.0.0 January 2015
100
+ ## [1.0.0] - 2015-01-28
15
101
  - Mostly changes that deal with encoding, plus the plunge to a 1.0 release
16
102
 
17
- v0.5.0 April 2012
103
+ ## [0.5.0] April 2012
18
104
  - Extensive rewrite of MARC::Reader (ISO 2709 binary reader) to provide a
19
105
  fairly complete and consistent handing of char encoding issues in ruby 1.9.
20
106
  - This code is well covered by automated tests, but ends up complex, there
@@ -27,75 +113,75 @@ v0.5.0 April 2012
27
113
  non-unicode encodings to UTF-8 for you. This version will not do
28
114
  so unless you ask it to with correct arguments.
29
115
 
30
- v0.4.4 Sat Mar 03 14:55:00 EDT 2012
116
+ ## [0.4.4] Sat Mar 03 14:55:00 EDT 2012
31
117
  - Fixed performance regression: strict reader will parse about 5x faster now
32
118
  - Updated CHANGES file for first time in a long time :-)
33
119
 
34
- v0.3.0 Wed Sep 23 21:51:00 EDT 2009
120
+ ## [0.3.0] Wed Sep 23 21:51:00 EDT 2009
35
121
  - Nokogiri and jrexml parser integration added as well as Ruby 1.9 support
36
122
 
37
- v0.2.2 Tue Dec 30 09:50:33 EST 2008
123
+ ## [0.2.2] Tue Dec 30 09:50:33 EST 2008
38
124
  - DataField tags that are all numeric are now padded with leading zeros
39
125
 
40
- v0.2.1 Mon Aug 18 14:14:16 EDT 2008
126
+ ## [0.2.1] Mon Aug 18 14:14:16 EDT 2008
41
127
  - can now process records that have fields tags that are non-numeric (thanks
42
128
  Ross Singer)
43
129
 
44
- v0.2.0 Wed Jun 11 12:42:20 EDT 2008
130
+ ## [0.2.0] Wed Jun 11 12:42:20 EDT 2008
45
131
  - added newline to output generated by REXML::Formatters::Default to make
46
132
  it a bit more friendly. REXML::Formatters::Pretty and Transitive just
47
133
  don't do what I want (whitespace in weird places).
48
134
 
49
- v0.1.9 Thu Jun 5 12:00:01 EDT 2008
135
+ ## [0.1.9] - Thu Jun 5 12:00:01 EDT 2008
50
136
  - small docfix change in XMLReader
51
137
  - use REXML::Formatters::Default instead of deprecated REXML::Element.write
52
138
 
53
- v0.1.8 Tue Nov 13 22:51:03 EST 2007
139
+ ## [0.1.8] - Tue Nov 13 22:51:03 EST 2007
54
140
  - added examples directory
55
141
  - fixed problem with leading whitespace and the leader in xml reader
56
142
  (thanks Morgan Cundiff)
57
143
 
58
- v0.1.7 Mon Nov 12 09:33:57 EST 2007
144
+ ## [0.1.7] - Mon Nov 12 09:33:57 EST 2007
59
145
  - updated Record.to_marc documentation to be a bit more precise
60
146
  - removed doc references to MARC::Field which is no longer around
61
147
  - changed from Artistic to MIT License
62
148
 
63
- v0.1.6 Fri May 4 12:37:33 EDT 2007
149
+ ## [0.1.6] - Fri May 4 12:37:33 EDT 2007
64
150
  - fixed bad record length test
65
151
  - removed MARC::XMLWriter convert_to_utf8 which wasn't really working and
66
152
  shouldn't be there if it isn't good
67
153
  - added unescaping of entities to MARC::XMLReader
68
154
 
69
- v0.1.5 Tue May 1 16:50:02 EDT 2007
155
+ ## [0.1.5] - Tue May 1 16:50:02 EDT 2007
70
156
  - docfix in MARC::DataField (thanks Jason Ronallo)
71
157
  - multiple docfixes (thanks Jonathan Rochkind)
72
158
 
73
- v0.1.4 Tue Jan 2 15:45:53 EST 2007
159
+ ## [0.1.4] - Tue Jan 2 15:45:53 EST 2007
74
160
  - fixed bug in MARC::XMLWriter that was outputting all control field tags as 00z
75
161
  (thanks Ross Singer)
76
162
  - added :include_namespace option to MARC::XMLWriter::encode to include the
77
163
  marcxml namespace, which allows MARC::Record::to_xml to emit the namespace
78
164
  for a single record.
79
165
 
80
- v0.1.3 Tue Jan 2 12:56:36 EST 2007
166
+ ## [0.1.3] - Tue Jan 2 12:56:36 EST 2007
81
167
  - added ability to map a MARC record to the Dublin Core fields. Calling
82
168
  to_dublin_core on a MARC::Record returns a hash that has Dublin Core fields
83
169
  as the hash keys.
84
170
 
85
- v0.1.2 Thu Dec 21 18:46:01 EST 2007
171
+ ## [0.1.2] - Thu Dec 21 18:46:01 EST 2007
86
172
  - fixed MARC::Record::to_xml so that it actually is tested and works (thanks
87
173
  Ross Singer)
88
174
 
89
- v0.1.1
175
+ ## [0.1.1] -
90
176
  - added ability to pass File like objects to the constructor for
91
177
  MARC::XMLReader like MARC::Reader (thanks Jake Glenn)
92
178
 
93
- v0.1.0 Wed Dec 6 15:40:40 EST 2006
179
+ ## [0.1.0] - Wed Dec 6 15:40:40 EST 2006
94
180
  - fixed pretty xml when stylesheet is used
95
181
  - added value() to MARC::DataField
96
182
  - added Rakefile for testing/building
97
183
 
98
- v0.0.9 Tue Mar 28 10:02:16 CST 2006
184
+ ## [0.0.9] - Tue Mar 28 10:02:16 CST 2006
99
185
  - changed XMLWriter.write to output pretty-printed XML
100
186
  - normalized Text in XML output
101
187
  - added XMLWriter checks and replacements for bad subfield codes and indicator
@@ -108,7 +194,7 @@ v0.0.9 Tue Mar 28 10:02:16 CST 2006
108
194
  test.
109
195
  - added :stylesheet argument to XLMWriter.new
110
196
 
111
- v0.0.8 Mon Jan 16 22:31:00 EST 2006
197
+ ## [0.0.8] - Mon Jan 16 22:31:00 EST 2006
112
198
  - removed control tests out of tc_field.rb into tc_control.rb
113
199
  - fixed some formatting
114
200
  - changed control/field to controlfield/datafield
@@ -120,7 +206,7 @@ v0.0.8 Mon Jan 16 22:31:00 EST 2006
120
206
  - fixed xmlreader strip_ns which was rerturning Nil when no namespace
121
207
  was found on an element (exposed by namespace changes).
122
208
 
123
- v0.0.7 Mon Jan 2 21:39:28 CST 2006
209
+ ## [0.0.7] - Mon Jan 2 21:39:28 CST 2006
124
210
  - MARC::XMLWriter added
125
211
  - removed encode/decode methods in MARC::MARC21 into MARC::Writer and
126
212
  MARC::Reader respectively. This required pushing MARC21 specific constants
@@ -131,25 +217,25 @@ v0.0.7 Mon Jan 2 21:39:28 CST 2006
131
217
  - added xml reading tests
132
218
  - fixed indentation to be two spaces
133
219
 
134
- v0.0.6 Tue Oct 18 09:33:12 CDT 2005
220
+ ## [0.0.6] - Tue Oct 18 09:33:12 CDT 2005
135
221
  - MARC::MARC21::decode throws an exception when a directory can't be found.
136
222
  Exception is caught and ignored in MARC::ForgivingReader
137
223
 
138
- v0.0.5 Tue Oct 18 01:50:40 CDT 2005
224
+ ## [0.0.5] - Tue Oct 18 01:50:40 CDT 2005
139
225
  - when unspecified field indicators are forced to blanks
140
226
  - checking for when a field appears to not have indicators and subfields in
141
227
  which case the field is skipped entirely
142
228
 
143
- v0.0.4 Tue Oct 18 00:39:50 CDT 2005
229
+ ## [0.0.4] - Tue Oct 18 00:39:50 CDT 2005
144
230
  - fixed off by one error when reading in leader, previous versions were
145
231
  reading an extra character
146
232
 
147
- v0.0.3 Mon Oct 17 22:51:23 CDT 2005
233
+ ## [0.0.3] - Mon Oct 17 22:51:23 CDT 2005
148
234
  - added ForgivingReader class and support for reading records without using
149
235
  possibly faulty offsets when the user needs them.
150
236
 
151
- v0.0.2 Mon Oct 17 17:42:57 CDT 2005
237
+ ## [0.0.2] - Mon Oct 17 17:42:57 CDT 2005
152
238
  - updated version string to see if it'll fix some gem oddness
153
239
 
154
- v0.0.1 Mon Oct 10 10:29:20 CDT 2005
240
+ ## [0.0.1] - Mon Oct 10 10:29:20 CDT 2005
155
241
  - initial release
data/Gemfile ADDED
@@ -0,0 +1,5 @@
1
+ source "https://rubygems.org"
2
+
3
+
4
+ # Specify your gem's dependencies in ..gemspec
5
+ gemspec
data/README.md CHANGED
@@ -1,64 +1,230 @@
1
1
  [![Gem Version](https://badge.fury.io/rb/marc.png)](http://badge.fury.io/rb/marc)
2
- ![Build Status](https://github.com/ruby-marc/ruby-marc/workflows/CI/badge.svg) |
2
+ ![Build Status](https://github.com/ruby-marc/ruby-marc/workflows/CI/badge.svg)
3
+ |
3
4
 
4
5
  marc is a ruby library for reading and writing MAchine Readable Cataloging
5
6
  (MARC). More information about MARC can be found at <http://www.loc.gov/marc>.
6
7
 
7
- ## Usage
8
+ ## Usage
8
9
 
9
- require 'marc'
10
-
11
- # reading records from a batch file
12
- reader = MARC::Reader.new('marc.dat', :external_encoding => "MARC-8")
13
- for record in reader
14
- # print out field 245 subfield a
15
- puts record['245']['a']
16
- end
10
+ ### Basics
11
+
12
+ ```ruby
13
+
14
+ reader = MARC::Reader.new("myfile.mrc")
15
+ reader.each do |record|
16
+ first_245 = record["245"] #=> #<MARC::DataField...>
17
+ first_245.to_s #=> "245 04 $a The Texas ranger $h [sound recording] / $c Sung by Beale D. Taylor. "
18
+ first_245.value #=> "The Texas ranger[sound recording] /Sung by Beale D. Taylor."
19
+ first_245.codes #=> ["a", "h", "c"]
20
+ first_245["a"] #=> "The Texas ranger"
17
21
 
18
- # creating a record
19
- record = MARC::Record.new()
20
- record.append(MARC::DataField.new('100', '0', ' ', ['a', 'John Doe']))
22
+ # A record is an enumerable over its fields and thus can use things like
23
+ # #each, #select, #find, etc.
21
24
 
22
- # writing a record
23
- writer = MARC::Writer.new('marc.dat')
24
- writer.write(record)
25
- writer.close()
25
+ subject_fields = record.select{|f| f.tag =~ /\A6/}
26
26
 
27
- # writing a record as XML
28
- writer = MARC::XMLWriter.new('marc.xml')
29
- writer.write(record)
30
- writer.close()
31
-
32
- # encoding a record
33
- MARC::Writer.encode(record) # or record.to_marc
34
-
35
- MARC::Record provides `#to_hash` and `#from_hash` implementations that deal in ruby
36
- hash's that are compatible with the
27
+ # Get author fields by supplying a list of tags
28
+ record.fields.each_by_tag(["100", "110", "111"]) do |field|
29
+ puts field.value
30
+ end
31
+ end
32
+ ```
33
+
34
+
35
+ ### Reading / Writing MARC21 binary data
36
+
37
+ ```ruby
38
+ require 'marc'
39
+
40
+ # marc21 binary format uses MARC::Reader and MARC::Writer
41
+
42
+ reader = MARC::Reader.new('marc.dat')
43
+ reader.each do |record|
44
+ title = record["245"].value
45
+ puts title
46
+ end
47
+ ```
48
+
49
+ If you know you have another encoding, you can specify it
50
+
51
+ ```ruby
52
+ reader = MARC::Reader.new("marc.dat", external_encoding: "MARC-8")
53
+ ```
54
+
55
+ While generally used with files, you can also give a reader an IO object
56
+ (usually an already-opened file or a StringIO object)
57
+
58
+ ```ruby
59
+ marc_data = File.open("marc.dat")
60
+ reader = MARC::Reader.new(marc_data)
61
+ ```
62
+
63
+ Similarly, you can write to either a file or an IO-like object
64
+
65
+ ```ruby
66
+ writer = MARC::Writer.new("myfile.dat")
67
+ # writer = MARC::Writer.new(Zlib::GzipWriter.open("myfile.dat.gz"))
68
+
69
+ myrecords.each do |rec|
70
+ writer.write(rec)
71
+ end
72
+ writer.close
73
+ ```
74
+
75
+ ### Reading/Writing marc-in-json
76
+
37
77
  [marc-in-json](https://rossfsinger.com/blog/2010/09/a-proposal-to-serialize-marc-in-json/)
38
- serialization format. You are responsible for serializing the hash to/from JSON yourself.
78
+ is a simple hash-based serialization format for MARC, often used with the
79
+ [jsonl](https://jsonlines.org/) (aka jsonlines or newline-delimited-json)
80
+ file format which puts one json structure on each line.
39
81
 
40
- ## Installation
82
+ ```ruby
41
83
 
42
- gem install marc
84
+ reader = MARC::JSONLReader.new("myfile.jsonl")
85
+ writer = MARC::JSONLWriter.new("my_other_file.jsonl")
86
+ reader.each do |record|
87
+ writer.write(record)
88
+ end
89
+ writer.close
43
90
 
44
- Or if you're using bundler, add to your Gemfile
91
+ ```
45
92
 
46
- gem 'marc'
47
-
48
- ## Character Encodings in 'binary' ISO-2709 MARC
93
+ ### Reading/Writing MARC-XML
94
+
95
+ MARC-XML is an XML-based serialiation format for MARC records. It is,
96
+ generally speaking, a lot slower than using MARC21 or marc-in-json.
97
+
98
+ There are two XML parsers supported going forwards within the ruby-marc code
99
+ base: REXML (the first, and for a long time only, ruby XML parser based on
100
+ regular expressions) and Nokogiri. Both are compatible with both MRI ("normal") ruby and JRuby.
101
+
102
+ The Nokogiri parser is about 6x faster than using REXML. See performance
103
+ numbers, below.
104
+
105
+ At one time, it was difficult to install Nokogiri under MRI and impossible
106
+ under JRuby. Because of this historical blip, nokogiri is _not_
107
+ automatically included when doing `require "marc"` in your code. If you want
108
+ to use the Nokogiri-based parser, you must include it explicitly.
109
+
110
+ ```ruby
111
+ require "nokogiri"
112
+ require "marc"
113
+
114
+ reader = MARC::XMLReader.new("myfile.xml", parser: "nokogiri")
115
+ ```
116
+
117
+ The `parser` argument works as follows:
118
+
119
+ * if not included, REXML is used
120
+ * if "rexml" or "nokogiri", the appropriate parser will be used
121
+ * if "magic", the Nokogiri parser will be used if Nokogiri has been loaded;
122
+ otherwise it will fall back to using REXML.
123
+
124
+ ```ruby
125
+ # Use the best available
126
+ reader = MARC::XMLReader.new("my_file.xml", parser: "magic")
127
+ ```
128
+
129
+ ### "Self-closing" writers
49
130
 
50
- The Marc binary (ISO 2709) Reader (MARC::Reader) has some features for helping you deal with character encodings in ruby 1.9. It is always recommended to supply an explicit :external_encoding option to MARC::Reader; either any valid ruby encoding, _or_ the string "MARC-8". MARC-8 input will by default be transcoded to a UTF-8 internal representation.
131
+ Much like one can [open a file and have it automatically close at the end
132
+ of a block](https://ruby-doc.org/core-2.5.0/File.html#method-c-open) in
133
+ standard ruby, the various writers will do the same.
51
134
 
52
- MARC::Reader does _not_ currently have any facilities for guessing encoding from MARC21 leader byte 9, that is
53
- ignored.
135
+ ```ruby
54
136
 
55
- Consult the MARC::Reader class docs for a more complete discussion and range of options.
137
+ # separate writer and #close
138
+ reader = MARC::Reader.new("my_marc.mrc")
139
+ writer = MARC::UnsafeXMLWriter.new("my_marc.xml")
140
+ reader.each do |record|
141
+ writer.write(record)
142
+ end
143
+ writer.close
56
144
 
57
- The MARC binary Writer (MARC::Writer) does not have any encoding-related features -- it's up to you the developer to make sure you create MARC::Records with consistent and expected char encodings, although MARC::Writer will write out a legal ISO 2709 either way, it just might have corrupted encodings.
145
+ # "self-closing" equivalent
146
+ reader = MARC::Reader.new("my_marc.mrc")
147
+ MARC::UnsafeXMLWriter.new("my_marc.xml") do |w|
148
+ reader.each do |record|
149
+ w.write(record)
150
+ end
151
+ end
152
+ # no need to close the writer here
153
+ ```
154
+
155
+ ### Serializing a single record
156
+
157
+ The `MARC::Record` class has utility functions to serialize to the various
158
+ formats. These are generally thin wrappers around the `encode` class
159
+ methods (e.g., `MARC::Writer.encode`, `MARC::XMLWriter.encode`, etc.)
160
+
161
+ * `record.to_marc` will production a marc21 binary string
162
+ * `record.to_json_string` returns a string containing the JSON document
163
+ for the marc-in-json serialization
164
+ * This just json-ifies `record.to_hash`, which returns a hash compatible
165
+ with the marc-in-json format.
166
+ * `record.to_xml_string` returns the actual XML string, with the following
167
+ options:
168
+ * `include_namespace: true` (default: `true`) will include the MARC namespace
169
+ attributes
170
+ * `fast_but_unsafe: true` (default: `false`) will use the much faster
171
+ `MARC::UnsafeXMLWriter` code, which produces the XML by string
172
+ concatenation. See that class for more information, but in general, if
173
+ your MARC isn't wildly invalid, it works fine and is roughly 15x faster.
174
+ The default (REXML) simply does `record.to_xml.to_s`
175
+
176
+ Note that * `record.to_xml`, for historical reasons, returns an REXML document of
177
+ the XML serialization and _not_ an XML string as one might expect.
178
+
179
+
180
+ ## Benchmarking reading MARC in various formats
181
+
182
+ A simple benchmark run on a single thread on a 2017-era x64 Macintosh
183
+ gives the numbers below.
184
+
185
+ ```
186
+ With mri 3.1.0 and jruby 9.3.6.0
187
+
188
+ Format Implementation Ruby r/sec x Slower compared to fastest
189
+ ===================================================================
190
+ jsonl stdlib JSON mri 6512 1.0
191
+ jsonl O j mri 6199 1.0
192
+ marc21 MARC::Reader mri 2889 2.3
193
+ marc-xml Nokogiri mri 1451 4.6
194
+ marc-xml REXML mri 239 28.0
195
+
196
+ marc21 MARC::Reader jruby 5455 1.2
197
+ jsonl stdlib JSON jruby 5437 1.2
198
+ marc-xml Nokogiri jruby 1631 4.1
199
+ marc-xml REXML jruby 253 26.5
200
+
201
+ ```
202
+
203
+ Note especially that if you're using MARC-XML, Nokogiri will read in
204
+ records 4-5 times faster.
205
+
206
+ ## Character Encoding issues
207
+
208
+ The Marc binary (ISO 2709) Reader (MARC::Reader) has some features for helping
209
+ you deal with character encodings in ruby 1.9. It is always recommended to
210
+ supply an explicit :external_encoding option to MARC::Reader; either any valid
211
+ ruby encoding, _or_ the string "MARC-8".
212
+ MARC-8 input will by default be transcoded to a UTF-8 internal representation.
213
+
214
+ MARC::Reader does _not_ currently have any facilities for guessing encoding
215
+ from MARC21 leader byte 9, that is ignored.
216
+
217
+ Consult the MARC::Reader class docs for a more complete discussion and range
218
+ of options.
219
+
220
+ The MARC binary Writer (MARC::Writer) does not have any encoding-related
221
+ features -- it's up to you the developer to make sure you create MARC::Records
222
+ with consistent and expected char encodings, although MARC::Writer will write
223
+ out a legal ISO 2709 either way, it just might have corrupted encodings.
58
224
 
59
225
  When parsing MARCXML _with Nokogiri as your XML parser implementation_ up to
60
- and including version `1.0.2` of this gem, if the XML was badly formed, parsing
61
- would stop and no error would be reported to your code.
226
+ and including version `1.0.2` of this gem, if the XML was badly formed,
227
+ parsing would stop and no error would be reported to your code.
62
228
 
63
229
  If you are using a version > `1.0.2` of `ruby-marc` with MRI + Nokogiri, XML
64
230
  syntax errors will be thrown (and you may need to adjust your code to account
@@ -67,17 +233,44 @@ using Nokogiri as an XML parser with JRuby as your ruby implementation, XML
67
233
  syntax errors will still be ignored unless you have Nokogiri version `1.10.2`
68
234
  or later.
69
235
 
70
- ## Miscellany
236
+ ## JRubySTAXReader caveats
237
+
238
+ NOTE: The JRubyStaxReader is deprecated. Nokogiri should be used instead.
239
+
240
+ - Under Java 9+, MARC::JRubySTAXReader requires adding the following
241
+ to `JAVA_OPTS`
242
+ in order to work
243
+ around [Java module system](https://openjdk.java.net/jeps/261)
244
+ restrictions:
245
+
246
+ ```sh
247
+ --add-opens java.xml/com.sun.org.apache.xerces.internal.impl=org.jruby.dist
248
+ ```
249
+
250
+ - MARC::JRubySTAXReader is deprecated and will be removed in a future version
251
+ of
252
+ `ruby-marc`. Please use MARC::JREXMLReader or MARC::NokogiriReader instead.
253
+
254
+ ## Miscellany
71
255
 
72
256
  Source code at: https://github.com/ruby-marc/ruby-marc/
73
257
 
74
258
  Find generated API docs at: http://rubydoc.info/gems/marc/frames
75
259
 
76
- Run automated tests in source with `rake test`.
260
+ Run automated tests in source with `rake test`.
261
+
262
+ Developers, release new version of gem to rubygems with `rake release`
263
+ (bundler-supplied task). Note that one nice thing this will do is
264
+ automatically tag the version in git, very important for later figuring out
265
+ what's going on.
77
266
 
78
- Developers, release new version of gem to rubygems with `rake release`
79
- (bundler-supplied task). Note that one nice thing this will do is automatically
80
- tag the version in git, very important for later figuring out what's going on.
267
+ ## Installation
268
+
269
+ gem install marc
270
+
271
+ Or if you're using bundler, add to your Gemfile
272
+
273
+ gem 'marc'
81
274
 
82
275
  ## Authors
83
276