traject-marc4j_reader 1.0.0-java
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.gitignore +22 -0
- data/.travis.yml +3 -0
- data/Gemfile +4 -0
- data/LICENSE.txt +22 -0
- data/README.md +100 -0
- data/Rakefile +9 -0
- data/lib/traject/marc4j_reader/version.rb +5 -0
- data/lib/traject/marc4j_reader.rb +153 -0
- data/test/marc4j_reader_test.rb +138 -0
- data/test/test_helper.rb +88 -0
- data/test/test_support/bad_subfield_code.marc +1 -0
- data/test/test_support/bad_utf_byte.utf8.marc +1 -0
- data/test/test_support/escaped_character_reference.marc8.marc +1 -0
- data/test/test_support/one-marc8.mrc +1 -0
- data/test/test_support/test_data.utf8.marc.xml +2609 -0
- data/test/test_support/test_data.utf8.mrc +1 -0
- data/test/test_traject_marc4j_reader.rb +11 -0
- data/traject-marc4j_reader.gemspec +31 -0
- metadata +157 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 781377f3f2adb35b8345224b0d04f424355b899f
|
4
|
+
data.tar.gz: 5941faff6ae8c86282d1acad116c2079f6840d65
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 2deabe30e44a00b5f232660bf36db22bddeb0f0577ba75bfaa662d5ad98aeaca36493b2d69104e7ab3b6362b1f1ac2daf76a6c5cccd31cb0e4bae9a362901468
|
7
|
+
data.tar.gz: b9259af133a130101c0b3e6a2a47bd4eac2a6dac790c79ff9fea3062cf1ab8222873761684c641d2a7145dab20ed8a6acf263e054be925575d17c15bebe39815
|
data/.gitignore
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
*.gem
|
2
|
+
*.rbc
|
3
|
+
.bundle
|
4
|
+
.config
|
5
|
+
.yardoc
|
6
|
+
Gemfile.lock
|
7
|
+
InstalledFiles
|
8
|
+
_yardoc
|
9
|
+
coverage
|
10
|
+
doc/
|
11
|
+
lib/bundler/man
|
12
|
+
pkg
|
13
|
+
rdoc
|
14
|
+
spec/reports
|
15
|
+
test/tmp
|
16
|
+
test/version_tmp
|
17
|
+
tmp
|
18
|
+
*.bundle
|
19
|
+
*.so
|
20
|
+
*.o
|
21
|
+
*.a
|
22
|
+
mkmf.log
|
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2014 Bill Dueber
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,100 @@
|
|
1
|
+
# Traject::Marc4JReader
|
2
|
+
|
3
|
+
**Note**: `Traject::Marc4JReader` is for JRuby only.
|
4
|
+
|
5
|
+
`Traject::Marc4JReader` is a reader for the [traject](https://github.com/traject-project/traject) ETL system
|
6
|
+
that allows the use of [marc4j](https://github.com/marc4j/marc4j) as a reader when dealing with MARC
|
7
|
+
binary or MARC-XML files. It is of no use outside of `traject` run under JRuby.
|
8
|
+
|
9
|
+
It leverages [marc-marc4j](https://github.com/billdueber/ruby-marc-marc4j), which is a paper-thin wrapper around
|
10
|
+
the Marc4J `.jar` that is shipped with it.
|
11
|
+
|
12
|
+
**The output of the reader is a vanilla ruby-marc object**. You can hang onto the
|
13
|
+
original marc4j java object with the `marc4j_reader.keep_marc4j` setting.
|
14
|
+
|
15
|
+
|
16
|
+
## Why use this?
|
17
|
+
|
18
|
+
|
19
|
+
The biggest reason would be for faster MARC/MARC-XML parsing and generation than the vanilla
|
20
|
+
[marc](https://github.com/ruby-marc/ruby-marc) gem can provide, or if you need to do something wacky with the marc4j
|
21
|
+
internal structure (such as feed it to legacy java code you have lying around).
|
22
|
+
|
23
|
+
In general, the marc4j library will parse marc21 (binary) and MARC-XML roughly twice
|
24
|
+
as fast as the pure-ruby library. While MARC parsing tends to not be a huge part
|
25
|
+
of the workload in a traject run, you'll almost certainly see performance gains.
|
26
|
+
|
27
|
+
## Installation
|
28
|
+
|
29
|
+
Add this line to your application's Gemfile:
|
30
|
+
|
31
|
+
gem 'traject-marc4j_reader'
|
32
|
+
|
33
|
+
And then execute:
|
34
|
+
|
35
|
+
$ bundle
|
36
|
+
|
37
|
+
Or install it yourself as:
|
38
|
+
|
39
|
+
$ gem install traject-marc4j_reader
|
40
|
+
|
41
|
+
## Traject::Marc4jReader settings
|
42
|
+
|
43
|
+
For more about the traject `settings` object, see [the traject settings documentation](https://github.com/traject-project/traject/blob/master/doc/settings.md)
|
44
|
+
|
45
|
+
|
46
|
+
Note that the standard Marc4JReader always converts to UTF8,
|
47
|
+
so output will always reflect that conversion.
|
48
|
+
|
49
|
+
* `marc4j.jar_dir`: Path to a directory containing Marc4J jar file to use. All .jar's in dir will
|
50
|
+
be loaded. If unset, uses marc4j.jar bundled with traject.
|
51
|
+
|
52
|
+
* `marc4j_reader.permissive`: Used by Marc4JReader only when marc.source_type is 'binary', boolean, argument to the underlying MarcPermissiveStreamReader. Default true.
|
53
|
+
|
54
|
+
* `marc4j_reader.source_encoding`: Used by Marc4JReader only when marc.source_type is 'binary', encoding strings accepted
|
55
|
+
by marc4j MarcPermissiveStreamReader. Default "BESTGUESS", also "UTF-8", "MARC"
|
56
|
+
|
57
|
+
* `marc4j_reader.keep_marc4j`: After translating the marc4j record into a normal ruby-marc object,
|
58
|
+
provides access to the former via `record#original_marc4j`.
|
59
|
+
|
60
|
+
|
61
|
+
## Sample use
|
62
|
+
|
63
|
+
A simple example that reads in via marc4j and outputs to the newline-delimited-json writer.
|
64
|
+
|
65
|
+
Use would be:
|
66
|
+
|
67
|
+
```bash
|
68
|
+
traject -c id_title.rb my_marc_file.mrc
|
69
|
+
```
|
70
|
+
|
71
|
+
```ruby
|
72
|
+
# File id_title.rb
|
73
|
+
|
74
|
+
require 'traject'
|
75
|
+
require 'traject/marc4j_reader'
|
76
|
+
require 'traject/json_writer'
|
77
|
+
|
78
|
+
require 'traject/macros/marc21_semantics'
|
79
|
+
extend Traject::Macros::Marc21Semantics
|
80
|
+
|
81
|
+
settings do
|
82
|
+
provide "reader_class_name", "Traject::Marc4JReader"
|
83
|
+
provide "marc4j_reader.keep_marc4j", true
|
84
|
+
provide "writer_class_name", "Traject::JsonWriter"
|
85
|
+
provide "output_file", "ids_and_titles.ndj"
|
86
|
+
end
|
87
|
+
|
88
|
+
to_field "id", extract_marc("001", :first => true)
|
89
|
+
to_field "title", extract_marc_filing_version('245abdefghknp', :include_original => true)
|
90
|
+
|
91
|
+
```
|
92
|
+
|
93
|
+
|
94
|
+
## Contributing
|
95
|
+
|
96
|
+
1. Fork it ( https://github.com/[my-github-username]/traject_marc4j_reader/fork )
|
97
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
98
|
+
3. Commit your changes (`git commit -am 'Add some feature'`)
|
99
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
100
|
+
5. Create a new Pull Request
|
data/Rakefile
ADDED
@@ -0,0 +1,153 @@
|
|
1
|
+
require 'traject'
|
2
|
+
require 'marc'
|
3
|
+
require 'marc/marc4j'
|
4
|
+
|
5
|
+
# `Traject::Marc4JReader` uses the marc4j java package to parse the MARC records
|
6
|
+
# into standard ruby-marc MARC::Record objects. This reader may be faster than
|
7
|
+
# Traject::MarcReader, especially for XML.
|
8
|
+
#
|
9
|
+
# Marc4JReader can read MARC ISO 2709 ("binary") or MARCXML. We use the Marc4J MarcPermissiveStreamReader
|
10
|
+
# for reading binary, but sometimes in non-permissive mode, according to settings. We use the Marc4j MarcXmlReader
|
11
|
+
# for reading xml. The actual code for dealing with Marc4J is in the separate
|
12
|
+
# [marc-marc4j gem](https://github.com/billdueber/ruby-marc-marc4j).
|
13
|
+
#
|
14
|
+
# See also the pure ruby Traject::MarcReader as an alternative, if you need to read
|
15
|
+
# marc-in-json, or if you don't need binary Marc8 support, it may in some cases
|
16
|
+
# be faster.
|
17
|
+
#
|
18
|
+
# ## Settings
|
19
|
+
#
|
20
|
+
# * marc_source.type: serialization type. default 'binary', also 'xml' (TODO: json/marc-in-json)
|
21
|
+
#
|
22
|
+
# * marc4j_reader.permissive: default true, false to turn off permissive reading. Used as
|
23
|
+
# value to 'permissive' arg of MarcPermissiveStreamReader constructor.
|
24
|
+
# Only used for 'binary'
|
25
|
+
#
|
26
|
+
# * marc_source.encoding: Only used for 'binary', otherwise always UTF-8.
|
27
|
+
# String of the values MarcPermissiveStreamReader accepts:
|
28
|
+
# * BESTGUESS (default: not entirely clear what Marc4J does with this)
|
29
|
+
# * ISO-8859-1 (also accepted: ISO8859_1)
|
30
|
+
# * UTF-8
|
31
|
+
# * MARC-8 (also accepted: MARC8)
|
32
|
+
# Default 'BESTGUESS', but HIGHLY recommend setting
|
33
|
+
# to avoid some Marc4J unpredictability, Marc4J "BESTGUESS" can be unpredictable
|
34
|
+
# in a variety of ways.
|
35
|
+
# (will ALWAYS be transcoded to UTF-8 on the way out. We insist.)
|
36
|
+
#
|
37
|
+
# * marc4j_reader.jar_dir: Path to a directory containing Marc4J jar file to use. All .jar's in dir will
|
38
|
+
# be loaded. If unset, uses marc4j.jar bundled with traject.
|
39
|
+
#
|
40
|
+
# * marc4j_reader.keep_marc4j: Keeps the original marc4j record accessible from
|
41
|
+
# the eventual ruby-marc record via record#original_marc4j. Intended for
|
42
|
+
# those that have legacy java code for which a marc4j object is needed. .
|
43
|
+
#
|
44
|
+
#
|
45
|
+
# ## Example
|
46
|
+
#
|
47
|
+
# In a configuration file:
|
48
|
+
#
|
49
|
+
# require 'traject/marc4j_reader
|
50
|
+
# settings do
|
51
|
+
# provide "reader_class_name", "Traject::Marc4JReader"
|
52
|
+
#
|
53
|
+
# #for MarcXML:
|
54
|
+
# # provide "marc_source.type", "xml"
|
55
|
+
#
|
56
|
+
# # Or instead for binary:
|
57
|
+
# provide "marc4j_reader.permissive", true
|
58
|
+
# provide "marc_source.encoding", "MARC8"
|
59
|
+
# end
|
60
|
+
class Traject::Marc4JReader
|
61
|
+
include Enumerable
|
62
|
+
|
63
|
+
attr_reader :settings, :input_stream
|
64
|
+
|
65
|
+
def initialize(input_stream, settings)
|
66
|
+
@settings = Traject::Indexer::Settings.new settings
|
67
|
+
@input_stream = input_stream
|
68
|
+
|
69
|
+
if @settings['marc4j_reader.keep_marc4j'] &&
|
70
|
+
! (MARC::Record.instance_methods.include?(:original_marc4j) &&
|
71
|
+
MARC::Record.instance_methods.include?(:"original_marc4j="))
|
72
|
+
MARC::Record.class_eval('attr_accessor :original_marc4j')
|
73
|
+
end
|
74
|
+
|
75
|
+
# Creating a converter will do the following:
|
76
|
+
# - nothing, if it detects that the marc4j jar is already loaded
|
77
|
+
# - load all the .jar files in settings['marc4j_reader.jar_dir'] if set
|
78
|
+
# - load the marc4j jar file bundled with MARC::MARC4J otherwise
|
79
|
+
|
80
|
+
@converter = MARC::MARC4J.new(:jardir => settings['marc4j_reader.jar_dir'], :logger => logger)
|
81
|
+
|
82
|
+
# Convenience
|
83
|
+
java_import org.marc4j.MarcPermissiveStreamReader
|
84
|
+
java_import org.marc4j.MarcXmlReader
|
85
|
+
|
86
|
+
end
|
87
|
+
|
88
|
+
|
89
|
+
def internal_reader
|
90
|
+
@internal_reader ||= create_marc_reader!
|
91
|
+
end
|
92
|
+
|
93
|
+
def input_type
|
94
|
+
# maybe later add some guessing somehow
|
95
|
+
settings["marc_source.type"]
|
96
|
+
end
|
97
|
+
|
98
|
+
def specified_source_encoding
|
99
|
+
#settings["marc4j_reader.source_encoding"]
|
100
|
+
enc = settings["marc_source.encoding"]
|
101
|
+
|
102
|
+
# one is standard for ruby and we want to support,
|
103
|
+
# the other is used by Marc4J and we have to pass it to Marc4J
|
104
|
+
enc = "ISO8859_1" if enc == "ISO-8859-1"
|
105
|
+
|
106
|
+
# default
|
107
|
+
enc = "BESTGUESS" if enc.nil? || enc.empty?
|
108
|
+
|
109
|
+
return enc
|
110
|
+
end
|
111
|
+
|
112
|
+
def create_marc_reader!
|
113
|
+
case input_type
|
114
|
+
when "binary"
|
115
|
+
permissive = settings["marc4j_reader.permissive"].to_s == "true"
|
116
|
+
|
117
|
+
# #to_inputstream turns our ruby IO into a Java InputStream
|
118
|
+
# third arg means 'convert to UTF-8, yes'
|
119
|
+
MarcPermissiveStreamReader.new(input_stream.to_inputstream, permissive, true, specified_source_encoding)
|
120
|
+
when "xml"
|
121
|
+
MarcXmlReader.new(input_stream.to_inputstream)
|
122
|
+
else
|
123
|
+
raise IllegalArgument.new("Unrecgonized marc_source.type: #{input_type}")
|
124
|
+
end
|
125
|
+
end
|
126
|
+
|
127
|
+
def each
|
128
|
+
while (internal_reader.hasNext)
|
129
|
+
begin
|
130
|
+
marc4j = internal_reader.next
|
131
|
+
rubymarc = @converter.marc4j_to_rubymarc(marc4j)
|
132
|
+
if @settings['marc4j_reader.keep_marc4j']
|
133
|
+
rubymarc.original_marc4j = marc4j
|
134
|
+
end
|
135
|
+
rescue Exception =>e
|
136
|
+
msg = "MARC4JReader: Error reading MARC, fatal, re-raising"
|
137
|
+
if marc4j
|
138
|
+
msg += "\n 001 id: #{marc4j.getControlNumber}"
|
139
|
+
end
|
140
|
+
msg += "\n #{Traject::Util.exception_to_log_message(e)}"
|
141
|
+
logger.fatal msg
|
142
|
+
raise e
|
143
|
+
end
|
144
|
+
|
145
|
+
yield rubymarc
|
146
|
+
end
|
147
|
+
end
|
148
|
+
|
149
|
+
def logger
|
150
|
+
@logger ||= (settings[:logger] || Yell.new(STDERR, :level => "gt.fatal")) # null logger)
|
151
|
+
end
|
152
|
+
|
153
|
+
end
|
@@ -0,0 +1,138 @@
|
|
1
|
+
# Encoding: UTF-8
|
2
|
+
|
3
|
+
require 'test_helper'
|
4
|
+
|
5
|
+
require 'traject'
|
6
|
+
require 'traject/indexer'
|
7
|
+
require 'traject/marc4j_reader'
|
8
|
+
|
9
|
+
require 'marc'
|
10
|
+
|
11
|
+
describe "Marc4JReader" do
|
12
|
+
it "reads Marc binary" do
|
13
|
+
file = File.new(support_file_path("test_data.utf8.mrc"))
|
14
|
+
settings = Traject::Indexer::Settings.new() # binary type is default
|
15
|
+
reader = Traject::Marc4JReader.new(file, settings)
|
16
|
+
|
17
|
+
array = reader.to_a
|
18
|
+
|
19
|
+
assert_equal 30, array.length
|
20
|
+
first = array.first
|
21
|
+
|
22
|
+
assert_kind_of MARC::Record, first
|
23
|
+
assert_equal first['245']['a'].encoding.name, "UTF-8"
|
24
|
+
end
|
25
|
+
|
26
|
+
it "can skip a bad subfield code" do
|
27
|
+
file = File.new(support_file_path("bad_subfield_code.marc"))
|
28
|
+
settings = Traject::Indexer::Settings.new() # binary type is default
|
29
|
+
reader = Traject::Marc4JReader.new(file, settings)
|
30
|
+
|
31
|
+
array = reader.to_a
|
32
|
+
|
33
|
+
assert_equal 1, array.length
|
34
|
+
assert_kind_of MARC::Record, array.first
|
35
|
+
assert_length 2, array.first['260'].subfields
|
36
|
+
end
|
37
|
+
|
38
|
+
it "reads Marc binary in Marc8 encoding" do
|
39
|
+
file = File.new(support_file_path("one-marc8.mrc"))
|
40
|
+
settings = Traject::Indexer::Settings.new("marc_source.encoding" => "MARC8")
|
41
|
+
reader = Traject::Marc4JReader.new(file, settings)
|
42
|
+
|
43
|
+
array = reader.to_a
|
44
|
+
|
45
|
+
assert_length 1, array
|
46
|
+
|
47
|
+
|
48
|
+
assert_kind_of MARC::Record, array.first
|
49
|
+
a245a = array.first['245']['a']
|
50
|
+
|
51
|
+
assert a245a.encoding.name, "UTF-8"
|
52
|
+
assert a245a.valid_encoding?
|
53
|
+
# marc4j converts to denormalized unicode, bah. Although
|
54
|
+
# it's legal, it probably looks weird as a string literal
|
55
|
+
# below, depending on your editor.
|
56
|
+
assert_equal "Por uma outra globalização :", a245a
|
57
|
+
|
58
|
+
# Set leader byte to proper for unicode
|
59
|
+
assert_equal 'a', array.first.leader[9]
|
60
|
+
end
|
61
|
+
|
62
|
+
|
63
|
+
it "reads XML" do
|
64
|
+
file = File.new(support_file_path "test_data.utf8.marc.xml")
|
65
|
+
settings = Traject::Indexer::Settings.new("marc_source.type" => "xml")
|
66
|
+
reader = Traject::Marc4JReader.new(file, settings)
|
67
|
+
|
68
|
+
array = reader.to_a
|
69
|
+
|
70
|
+
assert_equal 30, array.length
|
71
|
+
|
72
|
+
first = array.first
|
73
|
+
|
74
|
+
assert_kind_of MARC::Record, first
|
75
|
+
assert first['245']['a'].encoding.name, "UTF-8"
|
76
|
+
assert_equal "Fikr-i Ayāz /", first['245']['a']
|
77
|
+
end
|
78
|
+
|
79
|
+
it "keeps marc4j object when asked" do
|
80
|
+
file = File.new(support_file_path "test_data.utf8.marc.xml")
|
81
|
+
settings = Traject::Indexer::Settings.new("marc_source.type" => "xml", 'marc4j_reader.keep_marc4j' => true)
|
82
|
+
record = Traject::Marc4JReader.new(file, settings).to_a.first
|
83
|
+
assert_kind_of MARC::Record, record
|
84
|
+
assert_kind_of Java::org.marc4j.marc.impl::RecordImpl, record.original_marc4j
|
85
|
+
end
|
86
|
+
|
87
|
+
it "replaces unicode character reference in Marc8 transcode" do
|
88
|
+
file = File.new(support_file_path "escaped_character_reference.marc8.marc")
|
89
|
+
# due to marc4j idiosyncracies, this test will NOT pass with default source_encoding
|
90
|
+
# of "BESTGUESS", it only works if you explicitly set to MARC8. Doh.
|
91
|
+
settings = Traject::Indexer::Settings.new("marc_source.encoding" => "MARC8") # binary type is default
|
92
|
+
record = Traject::Marc4JReader.new(file, settings).to_a.first
|
93
|
+
|
94
|
+
assert_equal "Rio de Janeiro escaped replacement char: \uFFFD .", record['260']['a']
|
95
|
+
end
|
96
|
+
|
97
|
+
describe "Marc4J Java Permissive Stream Reader" do
|
98
|
+
# needed for sanity check when our tests fail to see if Marc4J
|
99
|
+
# is not behaving how we think it should.
|
100
|
+
it "converts character references" do
|
101
|
+
settings = Traject::Indexer::Settings.new("marc4j_reader.permissive" => true)
|
102
|
+
|
103
|
+
file = File.new(support_file_path "escaped_character_reference.marc8.marc")
|
104
|
+
reader = MarcPermissiveStreamReader.new(file.to_inputstream, true, true, "MARC-8")
|
105
|
+
record = reader.next
|
106
|
+
|
107
|
+
field = record.getVariableField("260")
|
108
|
+
subfield = field.getSubfield('a'.ord)
|
109
|
+
value = subfield.getData
|
110
|
+
|
111
|
+
assert_equal "Rio de Janeiro escaped replacement char: \uFFFD .", value
|
112
|
+
end
|
113
|
+
end
|
114
|
+
|
115
|
+
it "replaces bad byte in UTF8 marc" do
|
116
|
+
|
117
|
+
# Note this only works because the marc file DOES correctly
|
118
|
+
# have leader byte 9 set to 'a' for UTF8, otherwise Marc4J can't do it.
|
119
|
+
file = File.new(support_file_path "bad_utf_byte.utf8.marc")
|
120
|
+
|
121
|
+
# marc4j doesn't do this in permissive mode
|
122
|
+
settings = Traject::Indexer::Settings.new("marc4j_reader.permissive" => false)
|
123
|
+
reader = Traject::Marc4JReader.new(file, settings)
|
124
|
+
|
125
|
+
record = reader.to_a.first
|
126
|
+
|
127
|
+
value = record['300']['a']
|
128
|
+
|
129
|
+
assert_equal value.encoding.name, "UTF-8"
|
130
|
+
assert value.valid_encoding?, "Has valid encoding"
|
131
|
+
assert_equal "This is a bad byte: '\uFFFD' and another: '\uFFFD'", record['300']['a']
|
132
|
+
end
|
133
|
+
|
134
|
+
|
135
|
+
|
136
|
+
|
137
|
+
|
138
|
+
end
|
data/test/test_helper.rb
ADDED
@@ -0,0 +1,88 @@
|
|
1
|
+
gem 'minitest' # I feel like this messes with bundler, but only way to get minitest to shut up
|
2
|
+
require 'minitest/autorun'
|
3
|
+
require 'minitest/spec'
|
4
|
+
|
5
|
+
require 'traject'
|
6
|
+
require 'marc'
|
7
|
+
|
8
|
+
# keeps things from complaining about "yell-1.4.0/lib/yell/adapters/io.rb:66 warning: syswrite for buffered IO"
|
9
|
+
# for reasons I don't entirely understand, involving yell using syswrite and tests sometimes
|
10
|
+
# using $stderr.puts. https://github.com/TwP/logging/issues/31
|
11
|
+
STDERR.sync = true
|
12
|
+
|
13
|
+
# Hacky way to turn off Indexer logging by default, say only
|
14
|
+
# log things higher than fatal, which is nothing.
|
15
|
+
require 'traject/indexer/settings'
|
16
|
+
Traject::Indexer::Settings.defaults["log.level"] = "gt.fatal"
|
17
|
+
|
18
|
+
def support_file_path(relative_path)
|
19
|
+
return File.expand_path(File.join("test_support", relative_path), File.dirname(__FILE__))
|
20
|
+
end
|
21
|
+
|
22
|
+
# The 'assert' method I don't know why it's not there
|
23
|
+
def assert_length(length, obj, msg = nil)
|
24
|
+
unless obj.respond_to? :length
|
25
|
+
raise ArgumentError, "object with assert_length must respond_to? :length", obj
|
26
|
+
end
|
27
|
+
|
28
|
+
|
29
|
+
msg ||= "Expected length of #{obj} to be #{length}, but was #{obj.length}"
|
30
|
+
|
31
|
+
assert_equal(length, obj.length, msg.to_s )
|
32
|
+
end
|
33
|
+
|
34
|
+
def assert_start_with(start_with, obj, msg = nil)
|
35
|
+
msg ||= "expected #{obj} to start with #{start_with}"
|
36
|
+
|
37
|
+
assert obj.start_with?(start_with), msg
|
38
|
+
end
|
39
|
+
|
40
|
+
|
41
|
+
# An empty record, for making sure extractors and macros work when
|
42
|
+
# the fields they're looking for aren't there
|
43
|
+
|
44
|
+
def empty_record
|
45
|
+
rec = MARC::Record.new
|
46
|
+
rec.append(MARC::ControlField.new('001', '000000000'))
|
47
|
+
rec
|
48
|
+
end
|
49
|
+
|
50
|
+
# pretends to be a SolrJ HTTPServer-like thing, just kind of mocks it up
|
51
|
+
# and records what happens and simulates errors in some cases.
|
52
|
+
class MockSolrServer
|
53
|
+
attr_accessor :things_added, :url, :committed, :parser, :shutted_down
|
54
|
+
|
55
|
+
def initialize(url)
|
56
|
+
@url = url
|
57
|
+
@things_added = []
|
58
|
+
@add_mutex = Mutex.new
|
59
|
+
end
|
60
|
+
|
61
|
+
def add(thing)
|
62
|
+
@add_mutex.synchronize do # easy peasy threadsafety for our mock
|
63
|
+
if @url == "http://no.such.place"
|
64
|
+
raise org.apache.solr.client.solrj.SolrServerException.new("mock bad uri", java.io.IOException.new)
|
65
|
+
end
|
66
|
+
|
67
|
+
# simulate a multiple id error please
|
68
|
+
if [thing].flatten.find {|doc| doc.getField("id").getValueCount() != 1}
|
69
|
+
raise org.apache.solr.client.solrj.SolrServerException.new("mock non-1 size of 'id'")
|
70
|
+
else
|
71
|
+
things_added << thing
|
72
|
+
end
|
73
|
+
end
|
74
|
+
end
|
75
|
+
|
76
|
+
def commit
|
77
|
+
@committed = true
|
78
|
+
end
|
79
|
+
|
80
|
+
def setParser(parser)
|
81
|
+
@parser = parser
|
82
|
+
end
|
83
|
+
|
84
|
+
def shutdown
|
85
|
+
@shutted_down = true
|
86
|
+
end
|
87
|
+
|
88
|
+
end
|
@@ -0,0 +1 @@
|
|
1
|
+
00582cam 2200217 4500001000800000005001700008008004100025035001200066035001400078035001900092035002300111040001300134049000800147050002000155100001800175245003300193260003900226300001000265910002600275991006300301117499920020328125800.0670101s1959 fr 000 0 fre a1174999 aABK6934EI a(MdBJ)99005528 a(CaOTULAS)00190819 aOTUbENG00aJHE 4aPQ2605 A873bC61 aCayrol, Jean.14aLes corps �etrangers, roman. aParis��bEditions du Seuilc[1959] a188p. a1174999bHorizon bib# aPH4605.C36 C6 1959flcbelccc. 1q0i2240984leoffsmelsc
|
@@ -0,0 +1 @@
|
|
1
|
+
00083 a2200037 4500300004500000 aThis is a bad byte: '�' and another: '�'
|
@@ -0,0 +1 @@
|
|
1
|
+
00138cam 2200049Ia 45000010008000002600080000082196384 aRio de Janeiro escaped replacement char: � .bEditora Record,c2000.
|
@@ -0,0 +1 @@
|
|
1
|
+
00663cam 2200217Ia 4500001000800000005001700008008004100025020001500066035001600081040001300097049000900110090002200119100002000141245010100161260004500262300002100307650001800328910002600346991006100372994001200433219638420010919155148.0010626s2000 bl 000 0 por d a8501058785 aocm47198253 aIXAcIXA aJHEE aJZ1308b.S26 20001 aSantos, M�ilton10aPor uma outra globaliza�c�ao :bdo pensamemto �unico �a consci�encia universal /cMilton Santos. aRio de Janeiro :bEditora Record,c2000. a174 p. ;c21 cm. 0aGlobalization a2196384bHorizon bib# aJZ1308.S26 2000flcbelccc. 1q0i3557489lemainmemsel aX0bJHE
|