traject 0.13.1 → 0.13.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +12 -4
- data/doc/batch_execution.md +2 -2
- data/doc/settings.md +2 -2
- data/lib/traject/command_line.rb +21 -7
- data/lib/traject/indexer.rb +15 -8
- data/lib/traject/macros/marc21.rb +11 -5
- data/lib/traject/macros/marc21_semantics.rb +12 -12
- data/lib/traject/marc_extractor.rb +7 -7
- data/lib/traject/marc_reader.rb +2 -2
- data/lib/traject/{mock_writer.rb → null_writer.rb} +1 -1
- data/lib/traject/solrj_writer.rb +2 -0
- data/lib/traject/version.rb +1 -1
- data/test/indexer/macros_marc21_test.rb +21 -1
- data/test/marc_extractor_test.rb +7 -7
- data/test/test_support/demo_config.rb +4 -4
- data/test/test_support/test_data.utf8.mrc.gz +0 -0
- metadata +5 -3
data/README.md
CHANGED
@@ -20,7 +20,7 @@ Existing tools for indexing Marc to Solr exist, and have served us well for many
|
|
20
20
|
logic, should be very easy. More sophisticated and even complex customization use cases should still be possible,
|
21
21
|
changing just the parts of traject you want to change.
|
22
22
|
* *Maintainable local logic*, including supporting sharing of reusable logic via ruby gems.
|
23
|
-
* *Maintainable understandable internal logic*; well-covered by tests, well-factored
|
23
|
+
* *Maintainable understandable internal logic*; well-covered by tests, well-factored separation of concerns,
|
24
24
|
easy for newcomer developers who know ruby to understand the codebase.
|
25
25
|
* *High performance*, using multi-threaded concurrency where appropriate to maximize throughput.
|
26
26
|
While it depends on your configuration and the size of your server(s), traject is likely higher
|
@@ -164,8 +164,12 @@ Other examples of the specification string, which can include multiple tag menti
|
|
164
164
|
|
165
165
|
# Instead of joining subfields from the same field
|
166
166
|
# into one string, joined by spaces, leave them
|
167
|
-
# each in
|
168
|
-
to_field "isbn", extract_marc("020az", :
|
167
|
+
# each in separate strings:
|
168
|
+
to_field "isbn", extract_marc("020az", :separator => nil)
|
169
|
+
|
170
|
+
# Make sure that you don't get any duplicates
|
171
|
+
# by passing in ":deduplicate => true"
|
172
|
+
to_field 'language008', extract_marc('008[35-37]', :deduplicate=>true)
|
169
173
|
~~~
|
170
174
|
|
171
175
|
The `extract_marc` function *by default* includes any linked
|
@@ -347,18 +351,22 @@ checking.
|
|
347
351
|
Use `-u` as a shortcut for `s solr.url=X`
|
348
352
|
|
349
353
|
traject -c conf_file.rb -u http://example.com/solr marc_file.mrc
|
354
|
+
|
355
|
+
Run `traject -h` to see the command line help screen listing all available options.
|
350
356
|
|
351
357
|
Also see `-I load_path` and `-G Gemfile` options under Extending With Your Own Code.
|
352
358
|
|
353
359
|
See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
354
360
|
|
361
|
+
|
362
|
+
|
355
363
|
## Extending With Your Own Code
|
356
364
|
|
357
365
|
Traject config files are full live ruby files, where you can do anything,
|
358
366
|
including declaring new classes, etc.
|
359
367
|
|
360
368
|
However, beyond limited trivial logic, you'll want to organize your
|
361
|
-
code reasonably into
|
369
|
+
code reasonably into separate files, not jam everything into config
|
362
370
|
files.
|
363
371
|
|
364
372
|
Traject wants to make sure it makes it convenient for you to do so,
|
data/doc/batch_execution.md
CHANGED
@@ -132,9 +132,9 @@ the command-line:
|
|
132
132
|
|
133
133
|
Or in a traject configuration file, setting the `log.file` configuration setting.
|
134
134
|
|
135
|
-
###
|
135
|
+
### separate error log
|
136
136
|
|
137
|
-
You can also
|
137
|
+
You can also separately have a duplicate log file created with ONLY log messages of
|
138
138
|
level ERROR and higher (meaning ERROR and FATAL), with the `log.error_file` setting.
|
139
139
|
Then, if there's any lines in this error log file at all, you know something bad
|
140
140
|
happened, maybe your batch process needs to notify someone, or abort further
|
data/doc/settings.md
CHANGED
@@ -41,8 +41,8 @@ for commonly used settings, see `traject -h`.
|
|
41
41
|
* `log.level`: Log this level and above. Default 'info', set to eg 'debug' to get potentially more logging info,
|
42
42
|
or 'error' to get less. https://github.com/rudionrails/yell/wiki/101-setting-the-log-level
|
43
43
|
|
44
|
-
* `log.
|
45
|
-
log, every N records.
|
44
|
+
* `log.batch_size`: If set to a number N (or string representation), will output a progress line to INFO
|
45
|
+
log, every N records.
|
46
46
|
|
47
47
|
* `marc_source.type`: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`
|
48
48
|
|
data/lib/traject/command_line.rb
CHANGED
@@ -78,9 +78,13 @@ module Traject
|
|
78
78
|
result =
|
79
79
|
case options[:command]
|
80
80
|
when "process"
|
81
|
-
|
81
|
+
(io, filename) = get_input_io(self.remaining_argv)
|
82
|
+
indexer.settings['command_line.filename'] = filename if filename
|
83
|
+
indexer.process(io)
|
82
84
|
when "marcout"
|
83
|
-
|
85
|
+
(io, filename) = get_input_io(self.remaining_argv)
|
86
|
+
indexer.settings['command_line.filename'] = filename if filename
|
87
|
+
command_marcout!(io)
|
84
88
|
when "commit"
|
85
89
|
command_commit!
|
86
90
|
else
|
@@ -155,20 +159,23 @@ module Traject
|
|
155
159
|
#
|
156
160
|
# So for now we do just one file, or stdin if specified. Sorry!
|
157
161
|
|
162
|
+
filename = nil
|
158
163
|
if options[:stdin]
|
159
|
-
indexer.logger.info
|
164
|
+
indexer.logger.info("Reading from standard input")
|
160
165
|
io = $stdin
|
161
166
|
elsif argv.length > 1
|
162
167
|
self.console.puts "Sorry, traject can only handle one input file at a time right now. `#{argv}` Exiting..."
|
163
168
|
exit 1
|
164
169
|
elsif argv.length == 0
|
165
|
-
indexer.logger.warn "Warning, no file input given..."
|
166
170
|
io = File.open(File::NULL, 'r')
|
171
|
+
indexer.logger.info("Warning, no file input given. Use command-line argument '--stdin' to use standard input ")
|
167
172
|
else
|
168
|
-
indexer.logger.info "Reading from #{argv.first}"
|
169
173
|
io = File.open(argv.first, 'r')
|
174
|
+
filename = argv.first
|
175
|
+
indexer.logger.info "Reading from #{filename}"
|
170
176
|
end
|
171
|
-
|
177
|
+
|
178
|
+
return io, filename
|
172
179
|
end
|
173
180
|
|
174
181
|
def load_configuration_files!(my_indexer, conf_files)
|
@@ -246,6 +253,12 @@ module Traject
|
|
246
253
|
if options[:debug]
|
247
254
|
settings["log.level"] = "debug"
|
248
255
|
end
|
256
|
+
if options[:'debug-mode']
|
257
|
+
require 'traject/debug_writer'
|
258
|
+
settings["writer_class_name"] = "Traject::DebugWriter"
|
259
|
+
settings["log.level"] = "debug"
|
260
|
+
settings["processing_thread_pool"] = 0
|
261
|
+
end
|
249
262
|
if options[:writer]
|
250
263
|
settings["writer_class_name"] = options[:writer]
|
251
264
|
end
|
@@ -291,6 +304,7 @@ module Traject
|
|
291
304
|
on :x, "command", "alternate traject command: process (default); marcout", :argument => true, :default => "process"
|
292
305
|
|
293
306
|
on "stdin", "read input from stdin"
|
307
|
+
on "debug-mode", "debug logging, single threaded, output human readable hashes"
|
294
308
|
end
|
295
309
|
end
|
296
310
|
|
@@ -318,4 +332,4 @@ module Traject
|
|
318
332
|
|
319
333
|
|
320
334
|
end
|
321
|
-
end
|
335
|
+
end
|
data/lib/traject/indexer.rb
CHANGED
@@ -226,6 +226,7 @@ class Traject::Indexer
|
|
226
226
|
end
|
227
227
|
end
|
228
228
|
end
|
229
|
+
accumulator.compact!
|
229
230
|
(context.output_hash[context.field_name] ||= []).concat accumulator unless accumulator.empty?
|
230
231
|
context.field_name = nil
|
231
232
|
|
@@ -264,7 +265,7 @@ class Traject::Indexer
|
|
264
265
|
def log_mapping_errors(context, index_step, aProc)
|
265
266
|
begin
|
266
267
|
yield
|
267
|
-
rescue Exception => e
|
268
|
+
rescue Exception => e
|
268
269
|
msg = "Unexpected error on record id `#{id_string(context.source_record)}` at file position #{context.position}\n"
|
269
270
|
|
270
271
|
conf = context.field_name ? "to_field '#{context.field_name}'" : "each_record"
|
@@ -272,10 +273,14 @@ class Traject::Indexer
|
|
272
273
|
msg += " while executing #{conf} defined at #{index_step[:source_location]}\n"
|
273
274
|
msg += Traject::Util.exception_to_log_message(e)
|
274
275
|
|
275
|
-
logger.error msg
|
276
|
-
|
276
|
+
logger.error msg
|
277
|
+
begin
|
278
|
+
logger.debug "Record: " + context.source_record.to_s
|
279
|
+
rescue Exception => marc_to_s_exception
|
280
|
+
logger.debug "(Could not log record, #{marc_to_s_exception})"
|
281
|
+
end
|
277
282
|
|
278
|
-
raise e
|
283
|
+
raise e
|
279
284
|
end
|
280
285
|
end
|
281
286
|
|
@@ -293,14 +298,16 @@ class Traject::Indexer
|
|
293
298
|
|
294
299
|
count = 0
|
295
300
|
start_time = batch_start_time = Time.now
|
296
|
-
logger.
|
301
|
+
logger.debug "beginning Indexer#process with settings: #{settings.inspect}"
|
297
302
|
|
298
303
|
reader = self.reader!(io_stream)
|
299
304
|
writer = self.writer!
|
300
305
|
|
301
306
|
thread_pool = Traject::ThreadPool.new(settings["processing_thread_pool"].to_i)
|
302
307
|
|
303
|
-
logger.info " with reader: #{reader.class.name} and writer: #{writer.class.name}"
|
308
|
+
logger.info " Indexer with reader: #{reader.class.name} and writer: #{writer.class.name}"
|
309
|
+
|
310
|
+
log_batch_size = settings["log.batch_size"] && settings["log.batch_size"].to_i
|
304
311
|
|
305
312
|
reader.each do |record; position|
|
306
313
|
count += 1
|
@@ -315,8 +322,8 @@ class Traject::Indexer
|
|
315
322
|
$stderr.write "." if count % settings["solrj_writer.batch_size"] == 0
|
316
323
|
end
|
317
324
|
|
318
|
-
if
|
319
|
-
batch_rps =
|
325
|
+
if log_batch_size && (count % log_batch_size == 0)
|
326
|
+
batch_rps = log_batch_size / (Time.now - batch_start_time)
|
320
327
|
overall_rps = count / (Time.now - start_time)
|
321
328
|
logger.info "Traject::Indexer#process, read #{count} records at id:#{id_string(record)}; #{'%.0f' % batch_rps}/s this batch, #{'%.0f' % overall_rps}/s overall"
|
322
329
|
batch_start_time = Time.now
|
@@ -29,11 +29,12 @@ module Traject::Macros
|
|
29
29
|
#
|
30
30
|
# to_field("title"), extract_marc("245abcd", :trim_punctuation => true)
|
31
31
|
# to_field("id"), extract_marc("001", :first => true)
|
32
|
-
# to_field("geo"), extract_marc("040a", :
|
32
|
+
# to_field("geo"), extract_marc("040a", :separator => nil, :translation_map => "marc040")
|
33
33
|
def extract_marc(spec, options = {})
|
34
34
|
only_first = options.delete(:first)
|
35
35
|
trim_punctuation = options.delete(:trim_punctuation)
|
36
36
|
default_value = options.delete(:default)
|
37
|
+
deduplicate = options.delete(:deduplicate) || options.delete(:uniq)
|
37
38
|
|
38
39
|
# We create the TranslationMap and the MarcExtractor here
|
39
40
|
# on load, so the lambda can just refer to already created
|
@@ -62,10 +63,15 @@ module Traject::Macros
|
|
62
63
|
if trim_punctuation
|
63
64
|
accumulator.collect! {|s| Marc21.trim_punctuation(s)}
|
64
65
|
end
|
66
|
+
|
67
|
+
if deduplicate
|
68
|
+
accumulator.uniq!
|
69
|
+
end
|
65
70
|
|
66
71
|
if default_value && accumulator.empty?
|
67
72
|
accumulator << default_value
|
68
73
|
end
|
74
|
+
|
69
75
|
end
|
70
76
|
end
|
71
77
|
|
@@ -117,14 +123,14 @@ module Traject::Macros
|
|
117
123
|
# options
|
118
124
|
# [:from] default 100, only tags >= lexicographically
|
119
125
|
# [:to] default 899, only tags <= lexicographically
|
120
|
-
# [:
|
126
|
+
# [:separator] how to join subfields, default space, nil means don't join
|
121
127
|
#
|
122
128
|
# All fields in from-to must be marc DATA (not control fields), or weirdness
|
123
129
|
#
|
124
130
|
# Can always run this thing multiple times on the same field if you need
|
125
131
|
# non-contiguous ranges of fields.
|
126
132
|
def extract_all_marc_values(options = {})
|
127
|
-
options = {:from => "100", :to => "899", :
|
133
|
+
options = {:from => "100", :to => "899", :separator => ' '}.merge(options)
|
128
134
|
|
129
135
|
lambda do |record, accumulator, context|
|
130
136
|
record.each do |field|
|
@@ -132,8 +138,8 @@ module Traject::Macros
|
|
132
138
|
subfield_values = field.subfields.collect {|sf| sf.value}
|
133
139
|
next unless subfield_values.length > 0
|
134
140
|
|
135
|
-
if options[:
|
136
|
-
accumulator << subfield_values.join( options[:
|
141
|
+
if options[:separator]
|
142
|
+
accumulator << subfield_values.join( options[:separator])
|
137
143
|
else
|
138
144
|
accumulator.concat subfield_values
|
139
145
|
end
|
@@ -14,7 +14,7 @@ module Traject::Macros
|
|
14
14
|
# Extract OCLC numbers from, by default 035a's by known prefixes, then stripped
|
15
15
|
# just the num, and de-dup.
|
16
16
|
def oclcnum(extract_fields = "035a")
|
17
|
-
extractor = MarcExtractor.new(extract_fields, :
|
17
|
+
extractor = MarcExtractor.new(extract_fields, :separator => nil)
|
18
18
|
|
19
19
|
lambda do |record, accumulator|
|
20
20
|
list = extractor.extract(record).collect! do |o|
|
@@ -118,7 +118,7 @@ module Traject::Macros
|
|
118
118
|
def marc_languages(spec = "008[35-37]:041a:041d")
|
119
119
|
translation_map = Traject::TranslationMap.new("marc_languages")
|
120
120
|
|
121
|
-
extractor = MarcExtractor.new(spec, :
|
121
|
+
extractor = MarcExtractor.new(spec, :separator => nil)
|
122
122
|
|
123
123
|
lambda do |record, accumulator|
|
124
124
|
codes = extractor.collect_matching_lines(record) do |field, spec, extractor|
|
@@ -127,7 +127,7 @@ module Traject::Macros
|
|
127
127
|
else
|
128
128
|
extractor.collect_subfields(field, spec).collect do |value|
|
129
129
|
# sometimes multiple language codes are jammed together in one subfield, and
|
130
|
-
# we need to
|
130
|
+
# we need to separate ourselves. sigh.
|
131
131
|
unless value.length == 3
|
132
132
|
value = value.scan(/.{1,3}/) # split into an array of 3-length substrs
|
133
133
|
end
|
@@ -162,11 +162,11 @@ module Traject::Macros
|
|
162
162
|
# Takes marc 048ab instrument code, and translates it to human-displayable
|
163
163
|
# string. Takes first two chars of 048a or b, to translate (ignores numeric code)
|
164
164
|
#
|
165
|
-
# Pass in custom spec if you want just a or b, to
|
165
|
+
# Pass in custom spec if you want just a or b, to separate soloists or whatever.
|
166
166
|
def marc_instrumentation_humanized(spec = "048ab", options = {})
|
167
167
|
translation_map = Traject::TranslationMap.new(options[:translation_map] || "marc_instruments")
|
168
168
|
|
169
|
-
extractor = MarcExtractor.new(spec, :
|
169
|
+
extractor = MarcExtractor.new(spec, :separator => nil)
|
170
170
|
|
171
171
|
lambda do |record, accumulator|
|
172
172
|
values = extractor.extract(record)
|
@@ -189,7 +189,7 @@ module Traject::Macros
|
|
189
189
|
def marc_instrument_codes_normalized(spec = "048")
|
190
190
|
soloist_suffix = ".s"
|
191
191
|
|
192
|
-
extractor = MarcExtractor.new("048", :
|
192
|
+
extractor = MarcExtractor.new("048", :separator => nil)
|
193
193
|
|
194
194
|
return lambda do |record, accumulator|
|
195
195
|
accumulator.concat(
|
@@ -286,7 +286,7 @@ module Traject::Macros
|
|
286
286
|
end
|
287
287
|
# Okay, nothing from 008, try 260
|
288
288
|
if found_date.nil?
|
289
|
-
v260c = MarcExtractor.cached("260c", :
|
289
|
+
v260c = MarcExtractor.cached("260c", :separator => nil).extract(record).first
|
290
290
|
# just try to take the first four digits out of there, we're not going to try
|
291
291
|
# anything crazy.
|
292
292
|
if v260c =~ /(\d{4})/
|
@@ -320,7 +320,7 @@ module Traject::Macros
|
|
320
320
|
default_value = options.has_key?(:default) ? options[:default] : "Unknown"
|
321
321
|
translation_map = Traject::TranslationMap.new("lcc_top_level")
|
322
322
|
|
323
|
-
extractor = MarcExtractor.new(spec, :
|
323
|
+
extractor = MarcExtractor.new(spec, :separator => nil)
|
324
324
|
|
325
325
|
lambda do |record, accumulator|
|
326
326
|
candidates = extractor.extract(record)
|
@@ -352,8 +352,8 @@ module Traject::Macros
|
|
352
352
|
a_fields_spec = options[:geo_a_fields] || "651a:691a"
|
353
353
|
z_fields_spec = options[:geo_z_fields] || "600:610:611:630:648:650:654:655:656:690:651:691"
|
354
354
|
|
355
|
-
extractor_043a = MarcExtractor.new("043a", :
|
356
|
-
extractor_a_fields = MarcExtractor.new(a_fields_spec, :
|
355
|
+
extractor_043a = MarcExtractor.new("043a", :separator => nil)
|
356
|
+
extractor_a_fields = MarcExtractor.new(a_fields_spec, :separator => nil)
|
357
357
|
extractor_z_fields = MarcExtractor.new(z_fields_spec)
|
358
358
|
|
359
359
|
lambda do |record, accumulator|
|
@@ -403,7 +403,7 @@ module Traject::Macros
|
|
403
403
|
def marc_era_facet
|
404
404
|
ordinary_fields_spec = "600y:610y:611y:630y:648ay:650y:654y:656y:690y"
|
405
405
|
special_fields_spec = "651:691"
|
406
|
-
|
406
|
+
separator = ": "
|
407
407
|
|
408
408
|
extractor_ordinary_fields = MarcExtractor.new(ordinary_fields_spec)
|
409
409
|
extractor_special_fields = MarcExtractor.new(special_fields_spec)
|
@@ -423,7 +423,7 @@ module Traject::Macros
|
|
423
423
|
next unless sf.code == 'y'
|
424
424
|
if sf.value =~ /\A\s*.+,\s+(ca.\s+)?\d\d\d\d?(-\d\d\d\d?)?( B\.C\.)?[.,; ]*\Z/
|
425
425
|
# it's our pattern, add the $a in please
|
426
|
-
accumulator << "#{field['a']}#{
|
426
|
+
accumulator << "#{field['a']}#{separator}#{sf.value.sub(/\. *\Z/, '')}"
|
427
427
|
else
|
428
428
|
accumulator << sf.value.sub(/\. *\Z/, '')
|
429
429
|
end
|
@@ -7,7 +7,7 @@ module Traject
|
|
7
7
|
# Examples:
|
8
8
|
#
|
9
9
|
# array_of_stuff = MarcExtractor.new("001:245abc:700a").extract(marc_record)
|
10
|
-
# values = MarcExtractor.new("040a", :
|
10
|
+
# values = MarcExtractor.new("040a", :separator => nil).extract(marc_record)
|
11
11
|
#
|
12
12
|
#
|
13
13
|
# == Note on Performance and MarcExtractor creation and reuse
|
@@ -46,7 +46,7 @@ module Traject
|
|
46
46
|
#
|
47
47
|
# options:
|
48
48
|
#
|
49
|
-
# [:
|
49
|
+
# [:separator] default ' ' (space), what to use to separate
|
50
50
|
# subfield values when joining strings
|
51
51
|
#
|
52
52
|
# [:alternate_script] default :include, include linked 880s for tags
|
@@ -55,7 +55,7 @@ module Traject
|
|
55
55
|
# * :only => only include linked 880s, not original
|
56
56
|
def initialize(spec, options = {})
|
57
57
|
self.options = {
|
58
|
-
:
|
58
|
+
:separator => ' ',
|
59
59
|
:alternate_script => :include
|
60
60
|
}.merge(options)
|
61
61
|
|
@@ -93,7 +93,7 @@ module Traject
|
|
93
93
|
# although if you try hard enough you can surely find a way to do something
|
94
94
|
# you shouldn't.
|
95
95
|
#
|
96
|
-
# extractor = MarcExtractor.cached("245abc:700a", :
|
96
|
+
# extractor = MarcExtractor.cached("245abc:700a", :separator => nil)
|
97
97
|
def self.cached(*args)
|
98
98
|
cache = (Thread.current[:marc_extractor_cached] ||= Hash.new)
|
99
99
|
extractor = (cache[args] ||= begin
|
@@ -118,7 +118,7 @@ module Traject
|
|
118
118
|
# to represent the specification.
|
119
119
|
#
|
120
120
|
# a String specification is a string (or array of strings) of form:
|
121
|
-
# {tag}{|indicators|}{subfields}
|
121
|
+
# {tag}{|indicators|}{subfields} separated by colons
|
122
122
|
# tag is three chars (usually but not neccesarily numeric),
|
123
123
|
# indicators are optional two chars prefixed by hyphen,
|
124
124
|
# subfields are optional list of chars (alphanumeric)
|
@@ -239,7 +239,7 @@ module Traject
|
|
239
239
|
# Pass in a marc data field and a hash spec, returns
|
240
240
|
# an ARRAY of one or more strings, subfields extracted
|
241
241
|
# and processed per spec. Takes account of options such
|
242
|
-
# as :
|
242
|
+
# as :separator
|
243
243
|
#
|
244
244
|
# Always returns array, sometimes empty array.
|
245
245
|
def collect_subfields(field, spec)
|
@@ -249,7 +249,7 @@ module Traject
|
|
249
249
|
|
250
250
|
return subfields if subfields.empty? # empty array, just return it.
|
251
251
|
|
252
|
-
return options[:
|
252
|
+
return options[:separator] ? [ subfields.join( options[:separator]) ] : subfields
|
253
253
|
end
|
254
254
|
|
255
255
|
|
data/lib/traject/marc_reader.rb
CHANGED
@@ -16,8 +16,8 @@ require 'marc'
|
|
16
16
|
# ["marc_source.type"] serialization type. default 'binary'
|
17
17
|
# * "binary". Actual marc.
|
18
18
|
# * "xml", MarcXML
|
19
|
-
# * "json". (NOT YET IMPLEMENTED) The "marc-in-json" format, encoded as newline-
|
20
|
-
# json. A simplistic newline-
|
19
|
+
# * "json". (NOT YET IMPLEMENTED) The "marc-in-json" format, encoded as newline-separated
|
20
|
+
# json. A simplistic newline-separated json, with no comments
|
21
21
|
# allowed, and no unescpaed internal newlines allowed in the json
|
22
22
|
# objects -- we just read line by line, and assume each line is a
|
23
23
|
# marc-in-json. http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/
|
data/lib/traject/solrj_writer.rb
CHANGED
@@ -108,6 +108,8 @@ class Traject::SolrJWriter
|
|
108
108
|
@thread_pool = Traject::ThreadPool.new( @settings["solrj_writer.thread_pool"].to_i )
|
109
109
|
|
110
110
|
@debug_ascii_progress = (@settings["debug_ascii_progress"].to_s == "true")
|
111
|
+
|
112
|
+
logger.info(" SolrJWriter writing to '#{settings['solr.url']}'")
|
111
113
|
end
|
112
114
|
|
113
115
|
# Loads solrj if not already loaded. By loading all jars found
|
data/lib/traject/version.rb
CHANGED
@@ -56,6 +56,26 @@ describe "Traject::Macros::Marc21" do
|
|
56
56
|
|
57
57
|
assert_equal ["DEFAULT VALUE"], output["only_default"]
|
58
58
|
end
|
59
|
+
|
60
|
+
it "respects the :deduplicate option (and its alias 'uniq')" do
|
61
|
+
# Add a second 008
|
62
|
+
f = @record.fields('008').first
|
63
|
+
@record.append(f)
|
64
|
+
|
65
|
+
@indexer.instance_eval do
|
66
|
+
to_field "lang1", extract_marc('008[35-37]')
|
67
|
+
to_field "lang2", extract_marc('008[35-37]', :deduplicate=>true)
|
68
|
+
to_field "lang3", extract_marc('008[35-37]', :uniq=>true)
|
69
|
+
end
|
70
|
+
|
71
|
+
output = @indexer.map_record(@record)
|
72
|
+
assert_equal ["eng", "eng"], output['lang1']
|
73
|
+
assert_equal ["eng"], output['lang2']
|
74
|
+
assert_equal ["eng"], output['lang3']
|
75
|
+
|
76
|
+
end
|
77
|
+
|
78
|
+
|
59
79
|
|
60
80
|
it "Marc21::trim_punctuation class method" do
|
61
81
|
assert_equal "one two three", Marc21.trim_punctuation("one two three")
|
@@ -75,7 +95,7 @@ describe "Traject::Macros::Marc21" do
|
|
75
95
|
|
76
96
|
it "uses :translation_map" do
|
77
97
|
@indexer.instance_eval do
|
78
|
-
to_field "cataloging_agency", extract_marc("040a", :
|
98
|
+
to_field "cataloging_agency", extract_marc("040a", :separator => nil, :translation_map => "marc_040a_translate_test")
|
79
99
|
end
|
80
100
|
output = @indexer.map_record(@record)
|
81
101
|
|
data/test/marc_extractor_test.rb
CHANGED
@@ -173,17 +173,17 @@ describe "Traject::MarcExtractor" do
|
|
173
173
|
end
|
174
174
|
end
|
175
175
|
|
176
|
-
describe "
|
176
|
+
describe "separator argument" do
|
177
177
|
it "causes non-join when nil" do
|
178
178
|
parsed_spec = Traject::MarcExtractor.parse_string_spec("245")
|
179
|
-
values = Traject::MarcExtractor.new(parsed_spec, :
|
179
|
+
values = Traject::MarcExtractor.new(parsed_spec, :separator => nil).extract(@record)
|
180
180
|
|
181
181
|
assert_length 3, values
|
182
182
|
end
|
183
183
|
|
184
184
|
it "can be non-default" do
|
185
185
|
parsed_spec = Traject::MarcExtractor.parse_string_spec("245")
|
186
|
-
values = Traject::MarcExtractor.new(parsed_spec, :
|
186
|
+
values = Traject::MarcExtractor.new(parsed_spec, :separator => "!! ").extract(@record)
|
187
187
|
|
188
188
|
assert_length 1, values
|
189
189
|
assert_equal "Manufacturing consent :!! the political economy of the mass media /!! Edward S. Herman and Noam Chomsky ; with a new introduction by the authors.", values.first
|
@@ -288,13 +288,13 @@ describe "Traject::MarcExtractor" do
|
|
288
288
|
|
289
289
|
describe "MarcExtractor.cached" do
|
290
290
|
it "creates" do
|
291
|
-
ext = Traject::MarcExtractor.cached("245abc", :
|
291
|
+
ext = Traject::MarcExtractor.cached("245abc", :separator => nil)
|
292
292
|
assert_equal({"245"=>{:subfields=>["a", "b", "c"]}}, ext.spec_hash)
|
293
|
-
assert ext.options[:
|
293
|
+
assert ext.options[:separator].nil?, "extractor options[:separator] is nil"
|
294
294
|
end
|
295
295
|
it "caches" do
|
296
|
-
ext1 = Traject::MarcExtractor.cached("245abc", :
|
297
|
-
ext2 = Traject::MarcExtractor.cached("245abc", :
|
296
|
+
ext1 = Traject::MarcExtractor.cached("245abc", :separator => nil)
|
297
|
+
ext2 = Traject::MarcExtractor.cached("245abc", :separator => nil)
|
298
298
|
|
299
299
|
assert_same ext1, ext2
|
300
300
|
end
|
@@ -50,7 +50,7 @@ to_field "format", marc_formats
|
|
50
50
|
to_field "isbn_t", extract_marc("020a:773z:776z:534z:556z")
|
51
51
|
to_field "lccn", extract_marc("010a")
|
52
52
|
|
53
|
-
to_field "material_type_display", extract_marc("300a", :
|
53
|
+
to_field "material_type_display", extract_marc("300a", :separator => nil, :trim_punctuation => true)
|
54
54
|
|
55
55
|
to_field "title_t", extract_marc("245ak")
|
56
56
|
to_field "title1_t", extract_marc("245abk")
|
@@ -107,7 +107,7 @@ to_field "pub_date", marc_publication_date
|
|
107
107
|
# call numbers.
|
108
108
|
lcc_map = Traject::TranslationMap.new("lcc_top_level")
|
109
109
|
holdings_extractor = Traject::MarcExtractor.new("991:937")
|
110
|
-
sudoc_extractor = Traject::MarcExtractor.new("086a", :
|
110
|
+
sudoc_extractor = Traject::MarcExtractor.new("086a", :separator =>nil)
|
111
111
|
|
112
112
|
to_field "discipline_facet", marc_lcc_to_broad_category(:default => nil) do |record, accumulator|
|
113
113
|
# add in our local call numbers
|
@@ -147,8 +147,8 @@ end
|
|
147
147
|
to_field "instrumentation_facet", marc_instrumentation_humanized
|
148
148
|
to_field "instrumentation_code_unstem", marc_instrument_codes_normalized
|
149
149
|
|
150
|
-
to_field "issn", extract_marc("022a:022l:022y:773x:774x:776x", :
|
151
|
-
to_field "issn_related", extract_marc("490x:440x:800x:400x:410x:411x:810x:811x:830x:700x:710x:711x:730x:780x:785x:777x:543x:760x:762x:765x:767x:770x:772x:775x:786x:787x", :
|
150
|
+
to_field "issn", extract_marc("022a:022l:022y:773x:774x:776x", :separator => nil)
|
151
|
+
to_field "issn_related", extract_marc("490x:440x:800x:400x:410x:411x:810x:811x:830x:700x:710x:711x:730x:780x:785x:777x:543x:760x:762x:765x:767x:770x:772x:775x:786x:787x", :separator => nil)
|
152
152
|
|
153
153
|
to_field "oclcnum_t", oclcnum
|
154
154
|
|
Binary file
|
metadata
CHANGED
@@ -2,14 +2,14 @@
|
|
2
2
|
name: traject
|
3
3
|
version: !ruby/object:Gem::Version
|
4
4
|
prerelease:
|
5
|
-
version: 0.13.
|
5
|
+
version: 0.13.2
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
8
8
|
- Jonathan Rochkind
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2013-09-
|
12
|
+
date: 2013-09-23 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: marc
|
@@ -194,7 +194,7 @@ files:
|
|
194
194
|
- lib/traject/marc_extractor.rb
|
195
195
|
- lib/traject/marc_reader.rb
|
196
196
|
- lib/traject/mock_reader.rb
|
197
|
-
- lib/traject/
|
197
|
+
- lib/traject/null_writer.rb
|
198
198
|
- lib/traject/qualified_const_get.rb
|
199
199
|
- lib/traject/solrj_writer.rb
|
200
200
|
- lib/traject/thread_pool.rb
|
@@ -244,6 +244,7 @@ files:
|
|
244
244
|
- test/test_support/packed_041a_lang.marc
|
245
245
|
- test/test_support/test_data.utf8.marc.xml
|
246
246
|
- test/test_support/test_data.utf8.mrc
|
247
|
+
- test/test_support/test_data.utf8.mrc.gz
|
247
248
|
- test/test_support/the_business_ren.marc
|
248
249
|
- test/translation_map_test.rb
|
249
250
|
- test/translation_maps/bad_ruby.rb
|
@@ -345,6 +346,7 @@ test_files:
|
|
345
346
|
- test/test_support/packed_041a_lang.marc
|
346
347
|
- test/test_support/test_data.utf8.marc.xml
|
347
348
|
- test/test_support/test_data.utf8.mrc
|
349
|
+
- test/test_support/test_data.utf8.mrc.gz
|
348
350
|
- test/test_support/the_business_ren.marc
|
349
351
|
- test/translation_map_test.rb
|
350
352
|
- test/translation_maps/bad_ruby.rb
|