traject 0.13.1 → 0.13.2
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +12 -4
- data/doc/batch_execution.md +2 -2
- data/doc/settings.md +2 -2
- data/lib/traject/command_line.rb +21 -7
- data/lib/traject/indexer.rb +15 -8
- data/lib/traject/macros/marc21.rb +11 -5
- data/lib/traject/macros/marc21_semantics.rb +12 -12
- data/lib/traject/marc_extractor.rb +7 -7
- data/lib/traject/marc_reader.rb +2 -2
- data/lib/traject/{mock_writer.rb → null_writer.rb} +1 -1
- data/lib/traject/solrj_writer.rb +2 -0
- data/lib/traject/version.rb +1 -1
- data/test/indexer/macros_marc21_test.rb +21 -1
- data/test/marc_extractor_test.rb +7 -7
- data/test/test_support/demo_config.rb +4 -4
- data/test/test_support/test_data.utf8.mrc.gz +0 -0
- metadata +5 -3
data/README.md
CHANGED
@@ -20,7 +20,7 @@ Existing tools for indexing Marc to Solr exist, and have served us well for many
|
|
20
20
|
logic, should be very easy. More sophisticated and even complex customization use cases should still be possible,
|
21
21
|
changing just the parts of traject you want to change.
|
22
22
|
* *Maintainable local logic*, including supporting sharing of reusable logic via ruby gems.
|
23
|
-
* *Maintainable understandable internal logic*; well-covered by tests, well-factored
|
23
|
+
* *Maintainable understandable internal logic*; well-covered by tests, well-factored separation of concerns,
|
24
24
|
easy for newcomer developers who know ruby to understand the codebase.
|
25
25
|
* *High performance*, using multi-threaded concurrency where appropriate to maximize throughput.
|
26
26
|
While it depends on your configuration and the size of your server(s), traject is likely higher
|
@@ -164,8 +164,12 @@ Other examples of the specification string, which can include multiple tag menti
|
|
164
164
|
|
165
165
|
# Instead of joining subfields from the same field
|
166
166
|
# into one string, joined by spaces, leave them
|
167
|
-
# each in
|
168
|
-
to_field "isbn", extract_marc("020az", :
|
167
|
+
# each in separate strings:
|
168
|
+
to_field "isbn", extract_marc("020az", :separator => nil)
|
169
|
+
|
170
|
+
# Make sure that you don't get any duplicates
|
171
|
+
# by passing in ":deduplicate => true"
|
172
|
+
to_field 'language008', extract_marc('008[35-37]', :deduplicate=>true)
|
169
173
|
~~~
|
170
174
|
|
171
175
|
The `extract_marc` function *by default* includes any linked
|
@@ -347,18 +351,22 @@ checking.
|
|
347
351
|
Use `-u` as a shortcut for `s solr.url=X`
|
348
352
|
|
349
353
|
traject -c conf_file.rb -u http://example.com/solr marc_file.mrc
|
354
|
+
|
355
|
+
Run `traject -h` to see the command line help screen listing all available options.
|
350
356
|
|
351
357
|
Also see `-I load_path` and `-G Gemfile` options under Extending With Your Own Code.
|
352
358
|
|
353
359
|
See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
354
360
|
|
361
|
+
|
362
|
+
|
355
363
|
## Extending With Your Own Code
|
356
364
|
|
357
365
|
Traject config files are full live ruby files, where you can do anything,
|
358
366
|
including declaring new classes, etc.
|
359
367
|
|
360
368
|
However, beyond limited trivial logic, you'll want to organize your
|
361
|
-
code reasonably into
|
369
|
+
code reasonably into separate files, not jam everything into config
|
362
370
|
files.
|
363
371
|
|
364
372
|
Traject wants to make sure it makes it convenient for you to do so,
|
data/doc/batch_execution.md
CHANGED
@@ -132,9 +132,9 @@ the command-line:
|
|
132
132
|
|
133
133
|
Or in a traject configuration file, setting the `log.file` configuration setting.
|
134
134
|
|
135
|
-
###
|
135
|
+
### separate error log
|
136
136
|
|
137
|
-
You can also
|
137
|
+
You can also separately have a duplicate log file created with ONLY log messages of
|
138
138
|
level ERROR and higher (meaning ERROR and FATAL), with the `log.error_file` setting.
|
139
139
|
Then, if there's any lines in this error log file at all, you know something bad
|
140
140
|
happened, maybe your batch process needs to notify someone, or abort further
|
data/doc/settings.md
CHANGED
@@ -41,8 +41,8 @@ for commonly used settings, see `traject -h`.
|
|
41
41
|
* `log.level`: Log this level and above. Default 'info', set to eg 'debug' to get potentially more logging info,
|
42
42
|
or 'error' to get less. https://github.com/rudionrails/yell/wiki/101-setting-the-log-level
|
43
43
|
|
44
|
-
* `log.
|
45
|
-
log, every N records.
|
44
|
+
* `log.batch_size`: If set to a number N (or string representation), will output a progress line to INFO
|
45
|
+
log, every N records.
|
46
46
|
|
47
47
|
* `marc_source.type`: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`
|
48
48
|
|
data/lib/traject/command_line.rb
CHANGED
@@ -78,9 +78,13 @@ module Traject
|
|
78
78
|
result =
|
79
79
|
case options[:command]
|
80
80
|
when "process"
|
81
|
-
|
81
|
+
(io, filename) = get_input_io(self.remaining_argv)
|
82
|
+
indexer.settings['command_line.filename'] = filename if filename
|
83
|
+
indexer.process(io)
|
82
84
|
when "marcout"
|
83
|
-
|
85
|
+
(io, filename) = get_input_io(self.remaining_argv)
|
86
|
+
indexer.settings['command_line.filename'] = filename if filename
|
87
|
+
command_marcout!(io)
|
84
88
|
when "commit"
|
85
89
|
command_commit!
|
86
90
|
else
|
@@ -155,20 +159,23 @@ module Traject
|
|
155
159
|
#
|
156
160
|
# So for now we do just one file, or stdin if specified. Sorry!
|
157
161
|
|
162
|
+
filename = nil
|
158
163
|
if options[:stdin]
|
159
|
-
indexer.logger.info
|
164
|
+
indexer.logger.info("Reading from standard input")
|
160
165
|
io = $stdin
|
161
166
|
elsif argv.length > 1
|
162
167
|
self.console.puts "Sorry, traject can only handle one input file at a time right now. `#{argv}` Exiting..."
|
163
168
|
exit 1
|
164
169
|
elsif argv.length == 0
|
165
|
-
indexer.logger.warn "Warning, no file input given..."
|
166
170
|
io = File.open(File::NULL, 'r')
|
171
|
+
indexer.logger.info("Warning, no file input given. Use command-line argument '--stdin' to use standard input ")
|
167
172
|
else
|
168
|
-
indexer.logger.info "Reading from #{argv.first}"
|
169
173
|
io = File.open(argv.first, 'r')
|
174
|
+
filename = argv.first
|
175
|
+
indexer.logger.info "Reading from #{filename}"
|
170
176
|
end
|
171
|
-
|
177
|
+
|
178
|
+
return io, filename
|
172
179
|
end
|
173
180
|
|
174
181
|
def load_configuration_files!(my_indexer, conf_files)
|
@@ -246,6 +253,12 @@ module Traject
|
|
246
253
|
if options[:debug]
|
247
254
|
settings["log.level"] = "debug"
|
248
255
|
end
|
256
|
+
if options[:'debug-mode']
|
257
|
+
require 'traject/debug_writer'
|
258
|
+
settings["writer_class_name"] = "Traject::DebugWriter"
|
259
|
+
settings["log.level"] = "debug"
|
260
|
+
settings["processing_thread_pool"] = 0
|
261
|
+
end
|
249
262
|
if options[:writer]
|
250
263
|
settings["writer_class_name"] = options[:writer]
|
251
264
|
end
|
@@ -291,6 +304,7 @@ module Traject
|
|
291
304
|
on :x, "command", "alternate traject command: process (default); marcout", :argument => true, :default => "process"
|
292
305
|
|
293
306
|
on "stdin", "read input from stdin"
|
307
|
+
on "debug-mode", "debug logging, single threaded, output human readable hashes"
|
294
308
|
end
|
295
309
|
end
|
296
310
|
|
@@ -318,4 +332,4 @@ module Traject
|
|
318
332
|
|
319
333
|
|
320
334
|
end
|
321
|
-
end
|
335
|
+
end
|
data/lib/traject/indexer.rb
CHANGED
@@ -226,6 +226,7 @@ class Traject::Indexer
|
|
226
226
|
end
|
227
227
|
end
|
228
228
|
end
|
229
|
+
accumulator.compact!
|
229
230
|
(context.output_hash[context.field_name] ||= []).concat accumulator unless accumulator.empty?
|
230
231
|
context.field_name = nil
|
231
232
|
|
@@ -264,7 +265,7 @@ class Traject::Indexer
|
|
264
265
|
def log_mapping_errors(context, index_step, aProc)
|
265
266
|
begin
|
266
267
|
yield
|
267
|
-
rescue Exception => e
|
268
|
+
rescue Exception => e
|
268
269
|
msg = "Unexpected error on record id `#{id_string(context.source_record)}` at file position #{context.position}\n"
|
269
270
|
|
270
271
|
conf = context.field_name ? "to_field '#{context.field_name}'" : "each_record"
|
@@ -272,10 +273,14 @@ class Traject::Indexer
|
|
272
273
|
msg += " while executing #{conf} defined at #{index_step[:source_location]}\n"
|
273
274
|
msg += Traject::Util.exception_to_log_message(e)
|
274
275
|
|
275
|
-
logger.error msg
|
276
|
-
|
276
|
+
logger.error msg
|
277
|
+
begin
|
278
|
+
logger.debug "Record: " + context.source_record.to_s
|
279
|
+
rescue Exception => marc_to_s_exception
|
280
|
+
logger.debug "(Could not log record, #{marc_to_s_exception})"
|
281
|
+
end
|
277
282
|
|
278
|
-
raise e
|
283
|
+
raise e
|
279
284
|
end
|
280
285
|
end
|
281
286
|
|
@@ -293,14 +298,16 @@ class Traject::Indexer
|
|
293
298
|
|
294
299
|
count = 0
|
295
300
|
start_time = batch_start_time = Time.now
|
296
|
-
logger.
|
301
|
+
logger.debug "beginning Indexer#process with settings: #{settings.inspect}"
|
297
302
|
|
298
303
|
reader = self.reader!(io_stream)
|
299
304
|
writer = self.writer!
|
300
305
|
|
301
306
|
thread_pool = Traject::ThreadPool.new(settings["processing_thread_pool"].to_i)
|
302
307
|
|
303
|
-
logger.info " with reader: #{reader.class.name} and writer: #{writer.class.name}"
|
308
|
+
logger.info " Indexer with reader: #{reader.class.name} and writer: #{writer.class.name}"
|
309
|
+
|
310
|
+
log_batch_size = settings["log.batch_size"] && settings["log.batch_size"].to_i
|
304
311
|
|
305
312
|
reader.each do |record; position|
|
306
313
|
count += 1
|
@@ -315,8 +322,8 @@ class Traject::Indexer
|
|
315
322
|
$stderr.write "." if count % settings["solrj_writer.batch_size"] == 0
|
316
323
|
end
|
317
324
|
|
318
|
-
if
|
319
|
-
batch_rps =
|
325
|
+
if log_batch_size && (count % log_batch_size == 0)
|
326
|
+
batch_rps = log_batch_size / (Time.now - batch_start_time)
|
320
327
|
overall_rps = count / (Time.now - start_time)
|
321
328
|
logger.info "Traject::Indexer#process, read #{count} records at id:#{id_string(record)}; #{'%.0f' % batch_rps}/s this batch, #{'%.0f' % overall_rps}/s overall"
|
322
329
|
batch_start_time = Time.now
|
@@ -29,11 +29,12 @@ module Traject::Macros
|
|
29
29
|
#
|
30
30
|
# to_field("title"), extract_marc("245abcd", :trim_punctuation => true)
|
31
31
|
# to_field("id"), extract_marc("001", :first => true)
|
32
|
-
# to_field("geo"), extract_marc("040a", :
|
32
|
+
# to_field("geo"), extract_marc("040a", :separator => nil, :translation_map => "marc040")
|
33
33
|
def extract_marc(spec, options = {})
|
34
34
|
only_first = options.delete(:first)
|
35
35
|
trim_punctuation = options.delete(:trim_punctuation)
|
36
36
|
default_value = options.delete(:default)
|
37
|
+
deduplicate = options.delete(:deduplicate) || options.delete(:uniq)
|
37
38
|
|
38
39
|
# We create the TranslationMap and the MarcExtractor here
|
39
40
|
# on load, so the lambda can just refer to already created
|
@@ -62,10 +63,15 @@ module Traject::Macros
|
|
62
63
|
if trim_punctuation
|
63
64
|
accumulator.collect! {|s| Marc21.trim_punctuation(s)}
|
64
65
|
end
|
66
|
+
|
67
|
+
if deduplicate
|
68
|
+
accumulator.uniq!
|
69
|
+
end
|
65
70
|
|
66
71
|
if default_value && accumulator.empty?
|
67
72
|
accumulator << default_value
|
68
73
|
end
|
74
|
+
|
69
75
|
end
|
70
76
|
end
|
71
77
|
|
@@ -117,14 +123,14 @@ module Traject::Macros
|
|
117
123
|
# options
|
118
124
|
# [:from] default 100, only tags >= lexicographically
|
119
125
|
# [:to] default 899, only tags <= lexicographically
|
120
|
-
# [:
|
126
|
+
# [:separator] how to join subfields, default space, nil means don't join
|
121
127
|
#
|
122
128
|
# All fields in from-to must be marc DATA (not control fields), or weirdness
|
123
129
|
#
|
124
130
|
# Can always run this thing multiple times on the same field if you need
|
125
131
|
# non-contiguous ranges of fields.
|
126
132
|
def extract_all_marc_values(options = {})
|
127
|
-
options = {:from => "100", :to => "899", :
|
133
|
+
options = {:from => "100", :to => "899", :separator => ' '}.merge(options)
|
128
134
|
|
129
135
|
lambda do |record, accumulator, context|
|
130
136
|
record.each do |field|
|
@@ -132,8 +138,8 @@ module Traject::Macros
|
|
132
138
|
subfield_values = field.subfields.collect {|sf| sf.value}
|
133
139
|
next unless subfield_values.length > 0
|
134
140
|
|
135
|
-
if options[:
|
136
|
-
accumulator << subfield_values.join( options[:
|
141
|
+
if options[:separator]
|
142
|
+
accumulator << subfield_values.join( options[:separator])
|
137
143
|
else
|
138
144
|
accumulator.concat subfield_values
|
139
145
|
end
|
@@ -14,7 +14,7 @@ module Traject::Macros
|
|
14
14
|
# Extract OCLC numbers from, by default 035a's by known prefixes, then stripped
|
15
15
|
# just the num, and de-dup.
|
16
16
|
def oclcnum(extract_fields = "035a")
|
17
|
-
extractor = MarcExtractor.new(extract_fields, :
|
17
|
+
extractor = MarcExtractor.new(extract_fields, :separator => nil)
|
18
18
|
|
19
19
|
lambda do |record, accumulator|
|
20
20
|
list = extractor.extract(record).collect! do |o|
|
@@ -118,7 +118,7 @@ module Traject::Macros
|
|
118
118
|
def marc_languages(spec = "008[35-37]:041a:041d")
|
119
119
|
translation_map = Traject::TranslationMap.new("marc_languages")
|
120
120
|
|
121
|
-
extractor = MarcExtractor.new(spec, :
|
121
|
+
extractor = MarcExtractor.new(spec, :separator => nil)
|
122
122
|
|
123
123
|
lambda do |record, accumulator|
|
124
124
|
codes = extractor.collect_matching_lines(record) do |field, spec, extractor|
|
@@ -127,7 +127,7 @@ module Traject::Macros
|
|
127
127
|
else
|
128
128
|
extractor.collect_subfields(field, spec).collect do |value|
|
129
129
|
# sometimes multiple language codes are jammed together in one subfield, and
|
130
|
-
# we need to
|
130
|
+
# we need to separate ourselves. sigh.
|
131
131
|
unless value.length == 3
|
132
132
|
value = value.scan(/.{1,3}/) # split into an array of 3-length substrs
|
133
133
|
end
|
@@ -162,11 +162,11 @@ module Traject::Macros
|
|
162
162
|
# Takes marc 048ab instrument code, and translates it to human-displayable
|
163
163
|
# string. Takes first two chars of 048a or b, to translate (ignores numeric code)
|
164
164
|
#
|
165
|
-
# Pass in custom spec if you want just a or b, to
|
165
|
+
# Pass in custom spec if you want just a or b, to separate soloists or whatever.
|
166
166
|
def marc_instrumentation_humanized(spec = "048ab", options = {})
|
167
167
|
translation_map = Traject::TranslationMap.new(options[:translation_map] || "marc_instruments")
|
168
168
|
|
169
|
-
extractor = MarcExtractor.new(spec, :
|
169
|
+
extractor = MarcExtractor.new(spec, :separator => nil)
|
170
170
|
|
171
171
|
lambda do |record, accumulator|
|
172
172
|
values = extractor.extract(record)
|
@@ -189,7 +189,7 @@ module Traject::Macros
|
|
189
189
|
def marc_instrument_codes_normalized(spec = "048")
|
190
190
|
soloist_suffix = ".s"
|
191
191
|
|
192
|
-
extractor = MarcExtractor.new("048", :
|
192
|
+
extractor = MarcExtractor.new("048", :separator => nil)
|
193
193
|
|
194
194
|
return lambda do |record, accumulator|
|
195
195
|
accumulator.concat(
|
@@ -286,7 +286,7 @@ module Traject::Macros
|
|
286
286
|
end
|
287
287
|
# Okay, nothing from 008, try 260
|
288
288
|
if found_date.nil?
|
289
|
-
v260c = MarcExtractor.cached("260c", :
|
289
|
+
v260c = MarcExtractor.cached("260c", :separator => nil).extract(record).first
|
290
290
|
# just try to take the first four digits out of there, we're not going to try
|
291
291
|
# anything crazy.
|
292
292
|
if v260c =~ /(\d{4})/
|
@@ -320,7 +320,7 @@ module Traject::Macros
|
|
320
320
|
default_value = options.has_key?(:default) ? options[:default] : "Unknown"
|
321
321
|
translation_map = Traject::TranslationMap.new("lcc_top_level")
|
322
322
|
|
323
|
-
extractor = MarcExtractor.new(spec, :
|
323
|
+
extractor = MarcExtractor.new(spec, :separator => nil)
|
324
324
|
|
325
325
|
lambda do |record, accumulator|
|
326
326
|
candidates = extractor.extract(record)
|
@@ -352,8 +352,8 @@ module Traject::Macros
|
|
352
352
|
a_fields_spec = options[:geo_a_fields] || "651a:691a"
|
353
353
|
z_fields_spec = options[:geo_z_fields] || "600:610:611:630:648:650:654:655:656:690:651:691"
|
354
354
|
|
355
|
-
extractor_043a = MarcExtractor.new("043a", :
|
356
|
-
extractor_a_fields = MarcExtractor.new(a_fields_spec, :
|
355
|
+
extractor_043a = MarcExtractor.new("043a", :separator => nil)
|
356
|
+
extractor_a_fields = MarcExtractor.new(a_fields_spec, :separator => nil)
|
357
357
|
extractor_z_fields = MarcExtractor.new(z_fields_spec)
|
358
358
|
|
359
359
|
lambda do |record, accumulator|
|
@@ -403,7 +403,7 @@ module Traject::Macros
|
|
403
403
|
def marc_era_facet
|
404
404
|
ordinary_fields_spec = "600y:610y:611y:630y:648ay:650y:654y:656y:690y"
|
405
405
|
special_fields_spec = "651:691"
|
406
|
-
|
406
|
+
separator = ": "
|
407
407
|
|
408
408
|
extractor_ordinary_fields = MarcExtractor.new(ordinary_fields_spec)
|
409
409
|
extractor_special_fields = MarcExtractor.new(special_fields_spec)
|
@@ -423,7 +423,7 @@ module Traject::Macros
|
|
423
423
|
next unless sf.code == 'y'
|
424
424
|
if sf.value =~ /\A\s*.+,\s+(ca.\s+)?\d\d\d\d?(-\d\d\d\d?)?( B\.C\.)?[.,; ]*\Z/
|
425
425
|
# it's our pattern, add the $a in please
|
426
|
-
accumulator << "#{field['a']}#{
|
426
|
+
accumulator << "#{field['a']}#{separator}#{sf.value.sub(/\. *\Z/, '')}"
|
427
427
|
else
|
428
428
|
accumulator << sf.value.sub(/\. *\Z/, '')
|
429
429
|
end
|
@@ -7,7 +7,7 @@ module Traject
|
|
7
7
|
# Examples:
|
8
8
|
#
|
9
9
|
# array_of_stuff = MarcExtractor.new("001:245abc:700a").extract(marc_record)
|
10
|
-
# values = MarcExtractor.new("040a", :
|
10
|
+
# values = MarcExtractor.new("040a", :separator => nil).extract(marc_record)
|
11
11
|
#
|
12
12
|
#
|
13
13
|
# == Note on Performance and MarcExtractor creation and reuse
|
@@ -46,7 +46,7 @@ module Traject
|
|
46
46
|
#
|
47
47
|
# options:
|
48
48
|
#
|
49
|
-
# [:
|
49
|
+
# [:separator] default ' ' (space), what to use to separate
|
50
50
|
# subfield values when joining strings
|
51
51
|
#
|
52
52
|
# [:alternate_script] default :include, include linked 880s for tags
|
@@ -55,7 +55,7 @@ module Traject
|
|
55
55
|
# * :only => only include linked 880s, not original
|
56
56
|
def initialize(spec, options = {})
|
57
57
|
self.options = {
|
58
|
-
:
|
58
|
+
:separator => ' ',
|
59
59
|
:alternate_script => :include
|
60
60
|
}.merge(options)
|
61
61
|
|
@@ -93,7 +93,7 @@ module Traject
|
|
93
93
|
# although if you try hard enough you can surely find a way to do something
|
94
94
|
# you shouldn't.
|
95
95
|
#
|
96
|
-
# extractor = MarcExtractor.cached("245abc:700a", :
|
96
|
+
# extractor = MarcExtractor.cached("245abc:700a", :separator => nil)
|
97
97
|
def self.cached(*args)
|
98
98
|
cache = (Thread.current[:marc_extractor_cached] ||= Hash.new)
|
99
99
|
extractor = (cache[args] ||= begin
|
@@ -118,7 +118,7 @@ module Traject
|
|
118
118
|
# to represent the specification.
|
119
119
|
#
|
120
120
|
# a String specification is a string (or array of strings) of form:
|
121
|
-
# {tag}{|indicators|}{subfields}
|
121
|
+
# {tag}{|indicators|}{subfields} separated by colons
|
122
122
|
# tag is three chars (usually but not neccesarily numeric),
|
123
123
|
# indicators are optional two chars prefixed by hyphen,
|
124
124
|
# subfields are optional list of chars (alphanumeric)
|
@@ -239,7 +239,7 @@ module Traject
|
|
239
239
|
# Pass in a marc data field and a hash spec, returns
|
240
240
|
# an ARRAY of one or more strings, subfields extracted
|
241
241
|
# and processed per spec. Takes account of options such
|
242
|
-
# as :
|
242
|
+
# as :separator
|
243
243
|
#
|
244
244
|
# Always returns array, sometimes empty array.
|
245
245
|
def collect_subfields(field, spec)
|
@@ -249,7 +249,7 @@ module Traject
|
|
249
249
|
|
250
250
|
return subfields if subfields.empty? # empty array, just return it.
|
251
251
|
|
252
|
-
return options[:
|
252
|
+
return options[:separator] ? [ subfields.join( options[:separator]) ] : subfields
|
253
253
|
end
|
254
254
|
|
255
255
|
|
data/lib/traject/marc_reader.rb
CHANGED
@@ -16,8 +16,8 @@ require 'marc'
|
|
16
16
|
# ["marc_source.type"] serialization type. default 'binary'
|
17
17
|
# * "binary". Actual marc.
|
18
18
|
# * "xml", MarcXML
|
19
|
-
# * "json". (NOT YET IMPLEMENTED) The "marc-in-json" format, encoded as newline-
|
20
|
-
# json. A simplistic newline-
|
19
|
+
# * "json". (NOT YET IMPLEMENTED) The "marc-in-json" format, encoded as newline-separated
|
20
|
+
# json. A simplistic newline-separated json, with no comments
|
21
21
|
# allowed, and no unescpaed internal newlines allowed in the json
|
22
22
|
# objects -- we just read line by line, and assume each line is a
|
23
23
|
# marc-in-json. http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/
|
data/lib/traject/solrj_writer.rb
CHANGED
@@ -108,6 +108,8 @@ class Traject::SolrJWriter
|
|
108
108
|
@thread_pool = Traject::ThreadPool.new( @settings["solrj_writer.thread_pool"].to_i )
|
109
109
|
|
110
110
|
@debug_ascii_progress = (@settings["debug_ascii_progress"].to_s == "true")
|
111
|
+
|
112
|
+
logger.info(" SolrJWriter writing to '#{settings['solr.url']}'")
|
111
113
|
end
|
112
114
|
|
113
115
|
# Loads solrj if not already loaded. By loading all jars found
|
data/lib/traject/version.rb
CHANGED
@@ -56,6 +56,26 @@ describe "Traject::Macros::Marc21" do
|
|
56
56
|
|
57
57
|
assert_equal ["DEFAULT VALUE"], output["only_default"]
|
58
58
|
end
|
59
|
+
|
60
|
+
it "respects the :deduplicate option (and its alias 'uniq')" do
|
61
|
+
# Add a second 008
|
62
|
+
f = @record.fields('008').first
|
63
|
+
@record.append(f)
|
64
|
+
|
65
|
+
@indexer.instance_eval do
|
66
|
+
to_field "lang1", extract_marc('008[35-37]')
|
67
|
+
to_field "lang2", extract_marc('008[35-37]', :deduplicate=>true)
|
68
|
+
to_field "lang3", extract_marc('008[35-37]', :uniq=>true)
|
69
|
+
end
|
70
|
+
|
71
|
+
output = @indexer.map_record(@record)
|
72
|
+
assert_equal ["eng", "eng"], output['lang1']
|
73
|
+
assert_equal ["eng"], output['lang2']
|
74
|
+
assert_equal ["eng"], output['lang3']
|
75
|
+
|
76
|
+
end
|
77
|
+
|
78
|
+
|
59
79
|
|
60
80
|
it "Marc21::trim_punctuation class method" do
|
61
81
|
assert_equal "one two three", Marc21.trim_punctuation("one two three")
|
@@ -75,7 +95,7 @@ describe "Traject::Macros::Marc21" do
|
|
75
95
|
|
76
96
|
it "uses :translation_map" do
|
77
97
|
@indexer.instance_eval do
|
78
|
-
to_field "cataloging_agency", extract_marc("040a", :
|
98
|
+
to_field "cataloging_agency", extract_marc("040a", :separator => nil, :translation_map => "marc_040a_translate_test")
|
79
99
|
end
|
80
100
|
output = @indexer.map_record(@record)
|
81
101
|
|
data/test/marc_extractor_test.rb
CHANGED
@@ -173,17 +173,17 @@ describe "Traject::MarcExtractor" do
|
|
173
173
|
end
|
174
174
|
end
|
175
175
|
|
176
|
-
describe "
|
176
|
+
describe "separator argument" do
|
177
177
|
it "causes non-join when nil" do
|
178
178
|
parsed_spec = Traject::MarcExtractor.parse_string_spec("245")
|
179
|
-
values = Traject::MarcExtractor.new(parsed_spec, :
|
179
|
+
values = Traject::MarcExtractor.new(parsed_spec, :separator => nil).extract(@record)
|
180
180
|
|
181
181
|
assert_length 3, values
|
182
182
|
end
|
183
183
|
|
184
184
|
it "can be non-default" do
|
185
185
|
parsed_spec = Traject::MarcExtractor.parse_string_spec("245")
|
186
|
-
values = Traject::MarcExtractor.new(parsed_spec, :
|
186
|
+
values = Traject::MarcExtractor.new(parsed_spec, :separator => "!! ").extract(@record)
|
187
187
|
|
188
188
|
assert_length 1, values
|
189
189
|
assert_equal "Manufacturing consent :!! the political economy of the mass media /!! Edward S. Herman and Noam Chomsky ; with a new introduction by the authors.", values.first
|
@@ -288,13 +288,13 @@ describe "Traject::MarcExtractor" do
|
|
288
288
|
|
289
289
|
describe "MarcExtractor.cached" do
|
290
290
|
it "creates" do
|
291
|
-
ext = Traject::MarcExtractor.cached("245abc", :
|
291
|
+
ext = Traject::MarcExtractor.cached("245abc", :separator => nil)
|
292
292
|
assert_equal({"245"=>{:subfields=>["a", "b", "c"]}}, ext.spec_hash)
|
293
|
-
assert ext.options[:
|
293
|
+
assert ext.options[:separator].nil?, "extractor options[:separator] is nil"
|
294
294
|
end
|
295
295
|
it "caches" do
|
296
|
-
ext1 = Traject::MarcExtractor.cached("245abc", :
|
297
|
-
ext2 = Traject::MarcExtractor.cached("245abc", :
|
296
|
+
ext1 = Traject::MarcExtractor.cached("245abc", :separator => nil)
|
297
|
+
ext2 = Traject::MarcExtractor.cached("245abc", :separator => nil)
|
298
298
|
|
299
299
|
assert_same ext1, ext2
|
300
300
|
end
|
@@ -50,7 +50,7 @@ to_field "format", marc_formats
|
|
50
50
|
to_field "isbn_t", extract_marc("020a:773z:776z:534z:556z")
|
51
51
|
to_field "lccn", extract_marc("010a")
|
52
52
|
|
53
|
-
to_field "material_type_display", extract_marc("300a", :
|
53
|
+
to_field "material_type_display", extract_marc("300a", :separator => nil, :trim_punctuation => true)
|
54
54
|
|
55
55
|
to_field "title_t", extract_marc("245ak")
|
56
56
|
to_field "title1_t", extract_marc("245abk")
|
@@ -107,7 +107,7 @@ to_field "pub_date", marc_publication_date
|
|
107
107
|
# call numbers.
|
108
108
|
lcc_map = Traject::TranslationMap.new("lcc_top_level")
|
109
109
|
holdings_extractor = Traject::MarcExtractor.new("991:937")
|
110
|
-
sudoc_extractor = Traject::MarcExtractor.new("086a", :
|
110
|
+
sudoc_extractor = Traject::MarcExtractor.new("086a", :separator =>nil)
|
111
111
|
|
112
112
|
to_field "discipline_facet", marc_lcc_to_broad_category(:default => nil) do |record, accumulator|
|
113
113
|
# add in our local call numbers
|
@@ -147,8 +147,8 @@ end
|
|
147
147
|
to_field "instrumentation_facet", marc_instrumentation_humanized
|
148
148
|
to_field "instrumentation_code_unstem", marc_instrument_codes_normalized
|
149
149
|
|
150
|
-
to_field "issn", extract_marc("022a:022l:022y:773x:774x:776x", :
|
151
|
-
to_field "issn_related", extract_marc("490x:440x:800x:400x:410x:411x:810x:811x:830x:700x:710x:711x:730x:780x:785x:777x:543x:760x:762x:765x:767x:770x:772x:775x:786x:787x", :
|
150
|
+
to_field "issn", extract_marc("022a:022l:022y:773x:774x:776x", :separator => nil)
|
151
|
+
to_field "issn_related", extract_marc("490x:440x:800x:400x:410x:411x:810x:811x:830x:700x:710x:711x:730x:780x:785x:777x:543x:760x:762x:765x:767x:770x:772x:775x:786x:787x", :separator => nil)
|
152
152
|
|
153
153
|
to_field "oclcnum_t", oclcnum
|
154
154
|
|
Binary file
|
metadata
CHANGED
@@ -2,14 +2,14 @@
|
|
2
2
|
name: traject
|
3
3
|
version: !ruby/object:Gem::Version
|
4
4
|
prerelease:
|
5
|
-
version: 0.13.
|
5
|
+
version: 0.13.2
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
8
8
|
- Jonathan Rochkind
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2013-09-
|
12
|
+
date: 2013-09-23 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: marc
|
@@ -194,7 +194,7 @@ files:
|
|
194
194
|
- lib/traject/marc_extractor.rb
|
195
195
|
- lib/traject/marc_reader.rb
|
196
196
|
- lib/traject/mock_reader.rb
|
197
|
-
- lib/traject/
|
197
|
+
- lib/traject/null_writer.rb
|
198
198
|
- lib/traject/qualified_const_get.rb
|
199
199
|
- lib/traject/solrj_writer.rb
|
200
200
|
- lib/traject/thread_pool.rb
|
@@ -244,6 +244,7 @@ files:
|
|
244
244
|
- test/test_support/packed_041a_lang.marc
|
245
245
|
- test/test_support/test_data.utf8.marc.xml
|
246
246
|
- test/test_support/test_data.utf8.mrc
|
247
|
+
- test/test_support/test_data.utf8.mrc.gz
|
247
248
|
- test/test_support/the_business_ren.marc
|
248
249
|
- test/translation_map_test.rb
|
249
250
|
- test/translation_maps/bad_ruby.rb
|
@@ -345,6 +346,7 @@ test_files:
|
|
345
346
|
- test/test_support/packed_041a_lang.marc
|
346
347
|
- test/test_support/test_data.utf8.marc.xml
|
347
348
|
- test/test_support/test_data.utf8.mrc
|
349
|
+
- test/test_support/test_data.utf8.mrc.gz
|
348
350
|
- test/test_support/the_business_ren.marc
|
349
351
|
- test/translation_map_test.rb
|
350
352
|
- test/translation_maps/bad_ruby.rb
|