traject-marc4j_reader 1.0.0-java
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +22 -0
- data/.travis.yml +3 -0
- data/Gemfile +4 -0
- data/LICENSE.txt +22 -0
- data/README.md +100 -0
- data/Rakefile +9 -0
- data/lib/traject/marc4j_reader/version.rb +5 -0
- data/lib/traject/marc4j_reader.rb +153 -0
- data/test/marc4j_reader_test.rb +138 -0
- data/test/test_helper.rb +88 -0
- data/test/test_support/bad_subfield_code.marc +1 -0
- data/test/test_support/bad_utf_byte.utf8.marc +1 -0
- data/test/test_support/escaped_character_reference.marc8.marc +1 -0
- data/test/test_support/one-marc8.mrc +1 -0
- data/test/test_support/test_data.utf8.marc.xml +2609 -0
- data/test/test_support/test_data.utf8.mrc +1 -0
- data/test/test_traject_marc4j_reader.rb +11 -0
- data/traject-marc4j_reader.gemspec +31 -0
- metadata +157 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 781377f3f2adb35b8345224b0d04f424355b899f
|
4
|
+
data.tar.gz: 5941faff6ae8c86282d1acad116c2079f6840d65
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 2deabe30e44a00b5f232660bf36db22bddeb0f0577ba75bfaa662d5ad98aeaca36493b2d69104e7ab3b6362b1f1ac2daf76a6c5cccd31cb0e4bae9a362901468
|
7
|
+
data.tar.gz: b9259af133a130101c0b3e6a2a47bd4eac2a6dac790c79ff9fea3062cf1ab8222873761684c641d2a7145dab20ed8a6acf263e054be925575d17c15bebe39815
|
data/.gitignore
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
*.gem
|
2
|
+
*.rbc
|
3
|
+
.bundle
|
4
|
+
.config
|
5
|
+
.yardoc
|
6
|
+
Gemfile.lock
|
7
|
+
InstalledFiles
|
8
|
+
_yardoc
|
9
|
+
coverage
|
10
|
+
doc/
|
11
|
+
lib/bundler/man
|
12
|
+
pkg
|
13
|
+
rdoc
|
14
|
+
spec/reports
|
15
|
+
test/tmp
|
16
|
+
test/version_tmp
|
17
|
+
tmp
|
18
|
+
*.bundle
|
19
|
+
*.so
|
20
|
+
*.o
|
21
|
+
*.a
|
22
|
+
mkmf.log
|
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2014 Bill Dueber
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,100 @@
|
|
1
|
+
# Traject::Marc4JReader
|
2
|
+
|
3
|
+
**Note**: `Traject::Marc4JReader` is for JRuby only.
|
4
|
+
|
5
|
+
`Traject::Marc4JReader` is a reader for the [traject](https://github.com/traject-project/traject) ETL system
|
6
|
+
that allows the use of [marc4j](https://github.com/marc4j/marc4j) as a reader when dealing with MARC
|
7
|
+
binary or MARC-XML files. It is of no use outside of `traject` run under JRuby.
|
8
|
+
|
9
|
+
It leverages [marc-marc4j](https://github.com/billdueber/ruby-marc-marc4j), which is a paper-thin wrapper around
|
10
|
+
the Marc4J `.jar` that is shipped with it.
|
11
|
+
|
12
|
+
**The output of the reader is a vanilla ruby-marc object**. You can hang onto the
|
13
|
+
original marc4j java object with the `marc4j_reader.keep_marc4j` setting.
|
14
|
+
|
15
|
+
|
16
|
+
## Why use this?
|
17
|
+
|
18
|
+
|
19
|
+
The biggest reason would be for faster MARC/MARC-XML parsing and generation than the vanilla
|
20
|
+
[marc](https://github.com/ruby-marc/ruby-marc) gem can provide, or if you need to do something wacky with the marc4j
|
21
|
+
internal structure (such as feed it to legacy java code you have lying around).
|
22
|
+
|
23
|
+
In general, the marc4j library will parse marc21 (binary) and MARC-XML roughly twice
|
24
|
+
as fast as the pure-ruby library. While MARC parsing tends to not be a huge part
|
25
|
+
of the workload in a traject run, you'll almost certainly see performance gains.
|
26
|
+
|
27
|
+
## Installation
|
28
|
+
|
29
|
+
Add this line to your application's Gemfile:
|
30
|
+
|
31
|
+
gem 'traject-marc4j_reader'
|
32
|
+
|
33
|
+
And then execute:
|
34
|
+
|
35
|
+
$ bundle
|
36
|
+
|
37
|
+
Or install it yourself as:
|
38
|
+
|
39
|
+
$ gem install traject-marc4j_reader
|
40
|
+
|
41
|
+
## Traject::Marc4jReader settings
|
42
|
+
|
43
|
+
For more about the traject `settings` object, see [the traject settings documentation](https://github.com/traject-project/traject/blob/master/doc/settings.md)
|
44
|
+
|
45
|
+
|
46
|
+
Note that the standard Marc4JReader always converts to UTF8,
|
47
|
+
so output will always reflect that conversion.
|
48
|
+
|
49
|
+
* `marc4j.jar_dir`: Path to a directory containing Marc4J jar file to use. All .jar's in dir will
|
50
|
+
be loaded. If unset, uses marc4j.jar bundled with traject.
|
51
|
+
|
52
|
+
* `marc4j_reader.permissive`: Used by Marc4JReader only when marc.source_type is 'binary', boolean, argument to the underlying MarcPermissiveStreamReader. Default true.
|
53
|
+
|
54
|
+
* `marc4j_reader.source_encoding`: Used by Marc4JReader only when marc.source_type is 'binary', encoding strings accepted
|
55
|
+
by marc4j MarcPermissiveStreamReader. Default "BESTGUESS", also "UTF-8", "MARC"
|
56
|
+
|
57
|
+
* `marc4j_reader.keep_marc4j`: After translating the marc4j record into a normal ruby-marc object,
|
58
|
+
provides access to the former via `record#original_marc4j`.
|
59
|
+
|
60
|
+
|
61
|
+
## Sample use
|
62
|
+
|
63
|
+
A simple example that reads in via marc4j and outputs to the newline-delimited-json writer.
|
64
|
+
|
65
|
+
Use would be:
|
66
|
+
|
67
|
+
```bash
|
68
|
+
traject -c id_title.rb my_marc_file.mrc
|
69
|
+
```
|
70
|
+
|
71
|
+
```ruby
|
72
|
+
# File id_title.rb
|
73
|
+
|
74
|
+
require 'traject'
|
75
|
+
require 'traject/marc4j_reader'
|
76
|
+
require 'traject/json_writer'
|
77
|
+
|
78
|
+
require 'traject/macros/marc21_semantics'
|
79
|
+
extend Traject::Macros::Marc21Semantics
|
80
|
+
|
81
|
+
settings do
|
82
|
+
provide "reader_class_name", "Traject::Marc4JReader"
|
83
|
+
provide "marc4j_reader.keep_marc4j", true
|
84
|
+
provide "writer_class_name", "Traject::JsonWriter"
|
85
|
+
provide "output_file", "ids_and_titles.ndj"
|
86
|
+
end
|
87
|
+
|
88
|
+
to_field "id", extract_marc("001", :first => true)
|
89
|
+
to_field "title", extract_marc_filing_version('245abdefghknp', :include_original => true)
|
90
|
+
|
91
|
+
```
|
92
|
+
|
93
|
+
|
94
|
+
## Contributing
|
95
|
+
|
96
|
+
1. Fork it ( https://github.com/[my-github-username]/traject_marc4j_reader/fork )
|
97
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
98
|
+
3. Commit your changes (`git commit -am 'Add some feature'`)
|
99
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
100
|
+
5. Create a new Pull Request
|
data/Rakefile
ADDED
@@ -0,0 +1,153 @@
|
|
1
|
+
require 'traject'
|
2
|
+
require 'marc'
|
3
|
+
require 'marc/marc4j'
|
4
|
+
|
5
|
+
# `Traject::Marc4JReader` uses the marc4j java package to parse the MARC records
|
6
|
+
# into standard ruby-marc MARC::Record objects. This reader may be faster than
|
7
|
+
# Traject::MarcReader, especially for XML.
|
8
|
+
#
|
9
|
+
# Marc4JReader can read MARC ISO 2709 ("binary") or MARCXML. We use the Marc4J MarcPermissiveStreamReader
|
10
|
+
# for reading binary, but sometimes in non-permissive mode, according to settings. We use the Marc4j MarcXmlReader
|
11
|
+
# for reading xml. The actual code for dealing with Marc4J is in the separate
|
12
|
+
# [marc-marc4j gem](https://github.com/billdueber/ruby-marc-marc4j).
|
13
|
+
#
|
14
|
+
# See also the pure ruby Traject::MarcReader as an alternative, if you need to read
|
15
|
+
# marc-in-json, or if you don't need binary Marc8 support, it may in some cases
|
16
|
+
# be faster.
|
17
|
+
#
|
18
|
+
# ## Settings
|
19
|
+
#
|
20
|
+
# * marc_source.type: serialization type. default 'binary', also 'xml' (TODO: json/marc-in-json)
|
21
|
+
#
|
22
|
+
# * marc4j_reader.permissive: default true, false to turn off permissive reading. Used as
|
23
|
+
# value to 'permissive' arg of MarcPermissiveStreamReader constructor.
|
24
|
+
# Only used for 'binary'
|
25
|
+
#
|
26
|
+
# * marc_source.encoding: Only used for 'binary', otherwise always UTF-8.
|
27
|
+
# String of the values MarcPermissiveStreamReader accepts:
|
28
|
+
# * BESTGUESS (default: not entirely clear what Marc4J does with this)
|
29
|
+
# * ISO-8859-1 (also accepted: ISO8859_1)
|
30
|
+
# * UTF-8
|
31
|
+
# * MARC-8 (also accepted: MARC8)
|
32
|
+
# Default 'BESTGUESS', but HIGHLY recommend setting
|
33
|
+
# to avoid some Marc4J unpredictability, Marc4J "BESTGUESS" can be unpredictable
|
34
|
+
# in a variety of ways.
|
35
|
+
# (will ALWAYS be transcoded to UTF-8 on the way out. We insist.)
|
36
|
+
#
|
37
|
+
# * marc4j_reader.jar_dir: Path to a directory containing Marc4J jar file to use. All .jar's in dir will
|
38
|
+
# be loaded. If unset, uses marc4j.jar bundled with traject.
|
39
|
+
#
|
40
|
+
# * marc4j_reader.keep_marc4j: Keeps the original marc4j record accessible from
|
41
|
+
# the eventual ruby-marc record via record#original_marc4j. Intended for
|
42
|
+
# those that have legacy java code for which a marc4j object is needed. .
|
43
|
+
#
|
44
|
+
#
|
45
|
+
# ## Example
|
46
|
+
#
|
47
|
+
# In a configuration file:
|
48
|
+
#
|
49
|
+
# require 'traject/marc4j_reader
|
50
|
+
# settings do
|
51
|
+
# provide "reader_class_name", "Traject::Marc4JReader"
|
52
|
+
#
|
53
|
+
# #for MarcXML:
|
54
|
+
# # provide "marc_source.type", "xml"
|
55
|
+
#
|
56
|
+
# # Or instead for binary:
|
57
|
+
# provide "marc4j_reader.permissive", true
|
58
|
+
# provide "marc_source.encoding", "MARC8"
|
59
|
+
# end
|
60
|
+
class Traject::Marc4JReader
|
61
|
+
include Enumerable
|
62
|
+
|
63
|
+
attr_reader :settings, :input_stream
|
64
|
+
|
65
|
+
def initialize(input_stream, settings)
|
66
|
+
@settings = Traject::Indexer::Settings.new settings
|
67
|
+
@input_stream = input_stream
|
68
|
+
|
69
|
+
if @settings['marc4j_reader.keep_marc4j'] &&
|
70
|
+
! (MARC::Record.instance_methods.include?(:original_marc4j) &&
|
71
|
+
MARC::Record.instance_methods.include?(:"original_marc4j="))
|
72
|
+
MARC::Record.class_eval('attr_accessor :original_marc4j')
|
73
|
+
end
|
74
|
+
|
75
|
+
# Creating a converter will do the following:
|
76
|
+
# - nothing, if it detects that the marc4j jar is already loaded
|
77
|
+
# - load all the .jar files in settings['marc4j_reader.jar_dir'] if set
|
78
|
+
# - load the marc4j jar file bundled with MARC::MARC4J otherwise
|
79
|
+
|
80
|
+
@converter = MARC::MARC4J.new(:jardir => settings['marc4j_reader.jar_dir'], :logger => logger)
|
81
|
+
|
82
|
+
# Convenience
|
83
|
+
java_import org.marc4j.MarcPermissiveStreamReader
|
84
|
+
java_import org.marc4j.MarcXmlReader
|
85
|
+
|
86
|
+
end
|
87
|
+
|
88
|
+
|
89
|
+
def internal_reader
|
90
|
+
@internal_reader ||= create_marc_reader!
|
91
|
+
end
|
92
|
+
|
93
|
+
def input_type
|
94
|
+
# maybe later add some guessing somehow
|
95
|
+
settings["marc_source.type"]
|
96
|
+
end
|
97
|
+
|
98
|
+
def specified_source_encoding
|
99
|
+
#settings["marc4j_reader.source_encoding"]
|
100
|
+
enc = settings["marc_source.encoding"]
|
101
|
+
|
102
|
+
# one is standard for ruby and we want to support,
|
103
|
+
# the other is used by Marc4J and we have to pass it to Marc4J
|
104
|
+
enc = "ISO8859_1" if enc == "ISO-8859-1"
|
105
|
+
|
106
|
+
# default
|
107
|
+
enc = "BESTGUESS" if enc.nil? || enc.empty?
|
108
|
+
|
109
|
+
return enc
|
110
|
+
end
|
111
|
+
|
112
|
+
def create_marc_reader!
|
113
|
+
case input_type
|
114
|
+
when "binary"
|
115
|
+
permissive = settings["marc4j_reader.permissive"].to_s == "true"
|
116
|
+
|
117
|
+
# #to_inputstream turns our ruby IO into a Java InputStream
|
118
|
+
# third arg means 'convert to UTF-8, yes'
|
119
|
+
MarcPermissiveStreamReader.new(input_stream.to_inputstream, permissive, true, specified_source_encoding)
|
120
|
+
when "xml"
|
121
|
+
MarcXmlReader.new(input_stream.to_inputstream)
|
122
|
+
else
|
123
|
+
raise IllegalArgument.new("Unrecgonized marc_source.type: #{input_type}")
|
124
|
+
end
|
125
|
+
end
|
126
|
+
|
127
|
+
def each
|
128
|
+
while (internal_reader.hasNext)
|
129
|
+
begin
|
130
|
+
marc4j = internal_reader.next
|
131
|
+
rubymarc = @converter.marc4j_to_rubymarc(marc4j)
|
132
|
+
if @settings['marc4j_reader.keep_marc4j']
|
133
|
+
rubymarc.original_marc4j = marc4j
|
134
|
+
end
|
135
|
+
rescue Exception =>e
|
136
|
+
msg = "MARC4JReader: Error reading MARC, fatal, re-raising"
|
137
|
+
if marc4j
|
138
|
+
msg += "\n 001 id: #{marc4j.getControlNumber}"
|
139
|
+
end
|
140
|
+
msg += "\n #{Traject::Util.exception_to_log_message(e)}"
|
141
|
+
logger.fatal msg
|
142
|
+
raise e
|
143
|
+
end
|
144
|
+
|
145
|
+
yield rubymarc
|
146
|
+
end
|
147
|
+
end
|
148
|
+
|
149
|
+
def logger
|
150
|
+
@logger ||= (settings[:logger] || Yell.new(STDERR, :level => "gt.fatal")) # null logger)
|
151
|
+
end
|
152
|
+
|
153
|
+
end
|
@@ -0,0 +1,138 @@
|
|
1
|
+
# Encoding: UTF-8
|
2
|
+
|
3
|
+
require 'test_helper'
|
4
|
+
|
5
|
+
require 'traject'
|
6
|
+
require 'traject/indexer'
|
7
|
+
require 'traject/marc4j_reader'
|
8
|
+
|
9
|
+
require 'marc'
|
10
|
+
|
11
|
+
describe "Marc4JReader" do
|
12
|
+
it "reads Marc binary" do
|
13
|
+
file = File.new(support_file_path("test_data.utf8.mrc"))
|
14
|
+
settings = Traject::Indexer::Settings.new() # binary type is default
|
15
|
+
reader = Traject::Marc4JReader.new(file, settings)
|
16
|
+
|
17
|
+
array = reader.to_a
|
18
|
+
|
19
|
+
assert_equal 30, array.length
|
20
|
+
first = array.first
|
21
|
+
|
22
|
+
assert_kind_of MARC::Record, first
|
23
|
+
assert_equal first['245']['a'].encoding.name, "UTF-8"
|
24
|
+
end
|
25
|
+
|
26
|
+
it "can skip a bad subfield code" do
|
27
|
+
file = File.new(support_file_path("bad_subfield_code.marc"))
|
28
|
+
settings = Traject::Indexer::Settings.new() # binary type is default
|
29
|
+
reader = Traject::Marc4JReader.new(file, settings)
|
30
|
+
|
31
|
+
array = reader.to_a
|
32
|
+
|
33
|
+
assert_equal 1, array.length
|
34
|
+
assert_kind_of MARC::Record, array.first
|
35
|
+
assert_length 2, array.first['260'].subfields
|
36
|
+
end
|
37
|
+
|
38
|
+
it "reads Marc binary in Marc8 encoding" do
|
39
|
+
file = File.new(support_file_path("one-marc8.mrc"))
|
40
|
+
settings = Traject::Indexer::Settings.new("marc_source.encoding" => "MARC8")
|
41
|
+
reader = Traject::Marc4JReader.new(file, settings)
|
42
|
+
|
43
|
+
array = reader.to_a
|
44
|
+
|
45
|
+
assert_length 1, array
|
46
|
+
|
47
|
+
|
48
|
+
assert_kind_of MARC::Record, array.first
|
49
|
+
a245a = array.first['245']['a']
|
50
|
+
|
51
|
+
assert a245a.encoding.name, "UTF-8"
|
52
|
+
assert a245a.valid_encoding?
|
53
|
+
# marc4j converts to denormalized unicode, bah. Although
|
54
|
+
# it's legal, it probably looks weird as a string literal
|
55
|
+
# below, depending on your editor.
|
56
|
+
assert_equal "Por uma outra globalização :", a245a
|
57
|
+
|
58
|
+
# Set leader byte to proper for unicode
|
59
|
+
assert_equal 'a', array.first.leader[9]
|
60
|
+
end
|
61
|
+
|
62
|
+
|
63
|
+
it "reads XML" do
|
64
|
+
file = File.new(support_file_path "test_data.utf8.marc.xml")
|
65
|
+
settings = Traject::Indexer::Settings.new("marc_source.type" => "xml")
|
66
|
+
reader = Traject::Marc4JReader.new(file, settings)
|
67
|
+
|
68
|
+
array = reader.to_a
|
69
|
+
|
70
|
+
assert_equal 30, array.length
|
71
|
+
|
72
|
+
first = array.first
|
73
|
+
|
74
|
+
assert_kind_of MARC::Record, first
|
75
|
+
assert first['245']['a'].encoding.name, "UTF-8"
|
76
|
+
assert_equal "Fikr-i Ayāz /", first['245']['a']
|
77
|
+
end
|
78
|
+
|
79
|
+
it "keeps marc4j object when asked" do
|
80
|
+
file = File.new(support_file_path "test_data.utf8.marc.xml")
|
81
|
+
settings = Traject::Indexer::Settings.new("marc_source.type" => "xml", 'marc4j_reader.keep_marc4j' => true)
|
82
|
+
record = Traject::Marc4JReader.new(file, settings).to_a.first
|
83
|
+
assert_kind_of MARC::Record, record
|
84
|
+
assert_kind_of Java::org.marc4j.marc.impl::RecordImpl, record.original_marc4j
|
85
|
+
end
|
86
|
+
|
87
|
+
it "replaces unicode character reference in Marc8 transcode" do
|
88
|
+
file = File.new(support_file_path "escaped_character_reference.marc8.marc")
|
89
|
+
# due to marc4j idiosyncracies, this test will NOT pass with default source_encoding
|
90
|
+
# of "BESTGUESS", it only works if you explicitly set to MARC8. Doh.
|
91
|
+
settings = Traject::Indexer::Settings.new("marc_source.encoding" => "MARC8") # binary type is default
|
92
|
+
record = Traject::Marc4JReader.new(file, settings).to_a.first
|
93
|
+
|
94
|
+
assert_equal "Rio de Janeiro escaped replacement char: \uFFFD .", record['260']['a']
|
95
|
+
end
|
96
|
+
|
97
|
+
describe "Marc4J Java Permissive Stream Reader" do
|
98
|
+
# needed for sanity check when our tests fail to see if Marc4J
|
99
|
+
# is not behaving how we think it should.
|
100
|
+
it "converts character references" do
|
101
|
+
settings = Traject::Indexer::Settings.new("marc4j_reader.permissive" => true)
|
102
|
+
|
103
|
+
file = File.new(support_file_path "escaped_character_reference.marc8.marc")
|
104
|
+
reader = MarcPermissiveStreamReader.new(file.to_inputstream, true, true, "MARC-8")
|
105
|
+
record = reader.next
|
106
|
+
|
107
|
+
field = record.getVariableField("260")
|
108
|
+
subfield = field.getSubfield('a'.ord)
|
109
|
+
value = subfield.getData
|
110
|
+
|
111
|
+
assert_equal "Rio de Janeiro escaped replacement char: \uFFFD .", value
|
112
|
+
end
|
113
|
+
end
|
114
|
+
|
115
|
+
it "replaces bad byte in UTF8 marc" do
|
116
|
+
|
117
|
+
# Note this only works because the marc file DOES correctly
|
118
|
+
# have leader byte 9 set to 'a' for UTF8, otherwise Marc4J can't do it.
|
119
|
+
file = File.new(support_file_path "bad_utf_byte.utf8.marc")
|
120
|
+
|
121
|
+
# marc4j doesn't do this in permissive mode
|
122
|
+
settings = Traject::Indexer::Settings.new("marc4j_reader.permissive" => false)
|
123
|
+
reader = Traject::Marc4JReader.new(file, settings)
|
124
|
+
|
125
|
+
record = reader.to_a.first
|
126
|
+
|
127
|
+
value = record['300']['a']
|
128
|
+
|
129
|
+
assert_equal value.encoding.name, "UTF-8"
|
130
|
+
assert value.valid_encoding?, "Has valid encoding"
|
131
|
+
assert_equal "This is a bad byte: '\uFFFD' and another: '\uFFFD'", record['300']['a']
|
132
|
+
end
|
133
|
+
|
134
|
+
|
135
|
+
|
136
|
+
|
137
|
+
|
138
|
+
end
|
data/test/test_helper.rb
ADDED
@@ -0,0 +1,88 @@
|
|
1
|
+
gem 'minitest' # I feel like this messes with bundler, but only way to get minitest to shut up
|
2
|
+
require 'minitest/autorun'
|
3
|
+
require 'minitest/spec'
|
4
|
+
|
5
|
+
require 'traject'
|
6
|
+
require 'marc'
|
7
|
+
|
8
|
+
# keeps things from complaining about "yell-1.4.0/lib/yell/adapters/io.rb:66 warning: syswrite for buffered IO"
|
9
|
+
# for reasons I don't entirely understand, involving yell using syswrite and tests sometimes
|
10
|
+
# using $stderr.puts. https://github.com/TwP/logging/issues/31
|
11
|
+
STDERR.sync = true
|
12
|
+
|
13
|
+
# Hacky way to turn off Indexer logging by default, say only
|
14
|
+
# log things higher than fatal, which is nothing.
|
15
|
+
require 'traject/indexer/settings'
|
16
|
+
Traject::Indexer::Settings.defaults["log.level"] = "gt.fatal"
|
17
|
+
|
18
|
+
def support_file_path(relative_path)
|
19
|
+
return File.expand_path(File.join("test_support", relative_path), File.dirname(__FILE__))
|
20
|
+
end
|
21
|
+
|
22
|
+
# The 'assert' method I don't know why it's not there
|
23
|
+
def assert_length(length, obj, msg = nil)
|
24
|
+
unless obj.respond_to? :length
|
25
|
+
raise ArgumentError, "object with assert_length must respond_to? :length", obj
|
26
|
+
end
|
27
|
+
|
28
|
+
|
29
|
+
msg ||= "Expected length of #{obj} to be #{length}, but was #{obj.length}"
|
30
|
+
|
31
|
+
assert_equal(length, obj.length, msg.to_s )
|
32
|
+
end
|
33
|
+
|
34
|
+
def assert_start_with(start_with, obj, msg = nil)
|
35
|
+
msg ||= "expected #{obj} to start with #{start_with}"
|
36
|
+
|
37
|
+
assert obj.start_with?(start_with), msg
|
38
|
+
end
|
39
|
+
|
40
|
+
|
41
|
+
# An empty record, for making sure extractors and macros work when
|
42
|
+
# the fields they're looking for aren't there
|
43
|
+
|
44
|
+
def empty_record
|
45
|
+
rec = MARC::Record.new
|
46
|
+
rec.append(MARC::ControlField.new('001', '000000000'))
|
47
|
+
rec
|
48
|
+
end
|
49
|
+
|
50
|
+
# pretends to be a SolrJ HTTPServer-like thing, just kind of mocks it up
|
51
|
+
# and records what happens and simulates errors in some cases.
|
52
|
+
class MockSolrServer
|
53
|
+
attr_accessor :things_added, :url, :committed, :parser, :shutted_down
|
54
|
+
|
55
|
+
def initialize(url)
|
56
|
+
@url = url
|
57
|
+
@things_added = []
|
58
|
+
@add_mutex = Mutex.new
|
59
|
+
end
|
60
|
+
|
61
|
+
def add(thing)
|
62
|
+
@add_mutex.synchronize do # easy peasy threadsafety for our mock
|
63
|
+
if @url == "http://no.such.place"
|
64
|
+
raise org.apache.solr.client.solrj.SolrServerException.new("mock bad uri", java.io.IOException.new)
|
65
|
+
end
|
66
|
+
|
67
|
+
# simulate a multiple id error please
|
68
|
+
if [thing].flatten.find {|doc| doc.getField("id").getValueCount() != 1}
|
69
|
+
raise org.apache.solr.client.solrj.SolrServerException.new("mock non-1 size of 'id'")
|
70
|
+
else
|
71
|
+
things_added << thing
|
72
|
+
end
|
73
|
+
end
|
74
|
+
end
|
75
|
+
|
76
|
+
def commit
|
77
|
+
@committed = true
|
78
|
+
end
|
79
|
+
|
80
|
+
def setParser(parser)
|
81
|
+
@parser = parser
|
82
|
+
end
|
83
|
+
|
84
|
+
def shutdown
|
85
|
+
@shutted_down = true
|
86
|
+
end
|
87
|
+
|
88
|
+
end
|
@@ -0,0 +1 @@
|
|
1
|
+
00582cam 2200217 4500001000800000005001700008008004100025035001200066035001400078035001900092035002300111040001300134049000800147050002000155100001800175245003300193260003900226300001000265910002600275991006300301117499920020328125800.0670101s1959 fr 000 0 fre a1174999 aABK6934EI a(MdBJ)99005528 a(CaOTULAS)00190819 aOTUbENG00aJHE 4aPQ2605 A873bC61 aCayrol, Jean.14aLes corps �etrangers, roman. aParis��bEditions du Seuilc[1959] a188p. a1174999bHorizon bib# aPH4605.C36 C6 1959flcbelccc. 1q0i2240984leoffsmelsc
|
@@ -0,0 +1 @@
|
|
1
|
+
00083 a2200037 4500300004500000 aThis is a bad byte: '�' and another: '�'
|
@@ -0,0 +1 @@
|
|
1
|
+
00138cam 2200049Ia 45000010008000002600080000082196384 aRio de Janeiro escaped replacement char: � .bEditora Record,c2000.
|
@@ -0,0 +1 @@
|
|
1
|
+
00663cam 2200217Ia 4500001000800000005001700008008004100025020001500066035001600081040001300097049000900110090002200119100002000141245010100161260004500262300002100307650001800328910002600346991006100372994001200433219638420010919155148.0010626s2000 bl 000 0 por d a8501058785 aocm47198253 aIXAcIXA aJHEE aJZ1308b.S26 20001 aSantos, M�ilton10aPor uma outra globaliza�c�ao :bdo pensamemto �unico �a consci�encia universal /cMilton Santos. aRio de Janeiro :bEditora Record,c2000. a174 p. ;c21 cm. 0aGlobalization a2196384bHorizon bib# aJZ1308.S26 2000flcbelccc. 1q0i3557489lemainmemsel aX0bJHE
|