traject 2.0.2 → 2.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +8 -8
- data/doc/indexing_rules.md +1 -1
- data/doc/settings.md +4 -1
- data/lib/traject/command_line.rb +9 -18
- data/lib/traject/indexer.rb +162 -35
- data/lib/traject/macros/marc21.rb +23 -0
- data/lib/traject/macros/marc_format_classifier.rb +2 -2
- data/lib/traject/solr_json_writer.rb +2 -16
- data/lib/traject/util.rb +75 -0
- data/lib/traject/version.rb +1 -1
- data/test/indexer/context_test.rb +35 -0
- data/test/indexer/load_config_file_test.rb +89 -0
- data/test/indexer/macros_marc21_test.rb +5 -0
- data/test/indexer/writer_test.rb +54 -0
- data/test/solr_json_writer_test.rb +0 -29
- data/traject.gemspec +1 -1
- metadata +9 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 68f2f404b1bcae2d73ad2d947d40178ddeaf3912
|
4
|
+
data.tar.gz: 3aa18f9032b5406059bf8093416df04655ae16e2
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e0521cadc05b454787f642726a6f537b6d09eb46f88389e7f036e9808e680e593ec09b594f2fa85480cb5e350c2f5c966549561cbfab580f9e38300365658e3f
|
7
|
+
data.tar.gz: 5fe31273d6cbe1cf46cf2a186f8d3b4dbc64bdca86e39cb6a9e3c3de08289a1e06f4ca0bc66e4e80df622591fa6f95bdb2b1c66e3a9728890155e642dfdb49fd
|
data/README.md
CHANGED
@@ -11,7 +11,7 @@ for debugging by a human.
|
|
11
11
|
**Traject is stable, mature software, that is already being used in production by its authors.**
|
12
12
|
|
13
13
|
[![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
|
14
|
-
[![Build Status](https://travis-ci.org/traject
|
14
|
+
[![Build Status](https://travis-ci.org/traject/traject.png)](https://travis-ci.org/traject/traject)
|
15
15
|
|
16
16
|
|
17
17
|
## Background/Goals
|
@@ -137,7 +137,7 @@ data out of a MARC record according to a tag/subfield specification.
|
|
137
137
|
|
138
138
|
# For MARC Control ('fixed') fields, you can optionally
|
139
139
|
# use square brackets to take a byte offset.
|
140
|
-
to_field "
|
140
|
+
to_field "language_code", extract_marc("008[35-37]")
|
141
141
|
~~~
|
142
142
|
|
143
143
|
`extract_marc` by default includes all 'alternate script' linked fields correspoinding
|
@@ -189,7 +189,7 @@ The current record serialized back out as MARC, in binary, XML, or json:
|
|
189
189
|
Text of all fields in a range:
|
190
190
|
|
191
191
|
~~~ruby
|
192
|
-
to_field "text", extract_all_marc_values(:from => 100, :to => 899)
|
192
|
+
to_field "text", extract_all_marc_values(:from => "100", :to => "899")
|
193
193
|
~~~
|
194
194
|
|
195
195
|
All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macros/marc21.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21))
|
@@ -324,7 +324,7 @@ Several other writers are also built-in:
|
|
324
324
|
You set which writer is being used in settings (`provide "writer_class_name", "Traject::DebugWriter"`),
|
325
325
|
or with the shortcut command line argument `-w Traject::DebugWriter`.
|
326
326
|
|
327
|
-
The [SolrJWriter](https://github.com/traject
|
327
|
+
The [SolrJWriter](https://github.com/traject/traject-solrj_writer) is packaged separately,
|
328
328
|
and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
|
329
329
|
|
330
330
|
You can easily write your own Readers and Writers if you'd like, see comments at top
|
@@ -413,11 +413,11 @@ Own Code](./doc/extending.md)
|
|
413
413
|
* [Other traject commands](./doc/other_commands.md) including `marcout`, and `commit`
|
414
414
|
* [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
415
415
|
* Plugin extensions: Gems that add functionality to traject
|
416
|
-
* [traject_alephsequential_reader](https://github.com/traject
|
416
|
+
* [traject_alephsequential_reader](https://github.com/traject/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
|
417
417
|
* [traject_horizon](https://github.com/jrochkind/traject_horizon): Export MARC records directly from a Horizon ILS rdbms, as serialized MARC or to index into Solr.
|
418
418
|
* [traject_umich_format](https://github.com/billdueber/traject_umich_format/): opinionated code and associated macros to extract format (book, audio file, etc.) and types (bibliography, conference report, etc.) from a MARC record. Code mirrors that used by the University of Michigan, and is an alternate approach to that taken by the `marc_formats` macro in `Traject::Macros::MarcFormatClassifier`.
|
419
|
-
* [traject-solrj_writer](https://github.com/traject
|
420
|
-
* [traject_marc4j_reader](https://github.com/
|
419
|
+
* [traject-solrj_writer](https://github.com/traject/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
|
420
|
+
* [traject_marc4j_reader](https://github.com/traject/traject-marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
|
421
421
|
reading marc records using the Marc4J library, fastest MARC reading on JRuby.
|
422
422
|
|
423
423
|
# Development
|
@@ -454,7 +454,7 @@ a gemspec dependency on the Marc4JReader gem.
|
|
454
454
|
on Settings.
|
455
455
|
|
456
456
|
* CommandLine class isn't covered by tests -- it's written using functionality
|
457
|
-
from Indexer and other classes
|
457
|
+
from Indexer and other classes that are well-covered, but the CommandLine itself
|
458
458
|
probably needs some tests -- especially covering error handling, which probably
|
459
459
|
needs a bit more attention and using exceptions instead of exits, etc.
|
460
460
|
|
data/doc/indexing_rules.md
CHANGED
@@ -58,7 +58,7 @@ you need to modify the array in-place.
|
|
58
58
|
The third optional context argument
|
59
59
|
|
60
60
|
The third optional argument is a
|
61
|
-
[Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/traject
|
61
|
+
[Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/traject/traject/Traject/Indexer/Context))
|
62
62
|
object. Most of the time you don't need it, but you can use it for
|
63
63
|
some sophisticated functionality, for example using these Context methods:
|
64
64
|
|
data/doc/settings.md
CHANGED
@@ -98,4 +98,7 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
98
98
|
Note that processing_thread_pool threads can end up submitting
|
99
99
|
to solr too, if solr_json_writer.thread_pool is full.
|
100
100
|
|
101
|
-
* `
|
101
|
+
* `writer`: An object that implements the Traject Writer interface. If set, takes precedence
|
102
|
+
over `writer_class_name`.
|
103
|
+
|
104
|
+
* `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Will be used if no explicit `writer` setting or `#writer=` is set. Default Traject::SolrJsonWriter, other writers for debugging or writing to files are also available. See Traject::Indexer for more info. Command line shortcut `-w`
|
data/lib/traject/command_line.rb
CHANGED
@@ -171,25 +171,16 @@ module Traject
|
|
171
171
|
def load_configuration_files!(my_indexer, conf_files)
|
172
172
|
conf_files.each do |conf_path|
|
173
173
|
begin
|
174
|
-
|
175
|
-
rescue Errno::ENOENT => e
|
176
|
-
self.console.puts "Could not
|
174
|
+
my_indexer.load_config_file(conf_path)
|
175
|
+
rescue Errno::ENOENT, Errno::EACCES => e
|
176
|
+
self.console.puts "Could not read configuration file '#{conf_path}', exiting..."
|
177
177
|
exit 2
|
178
|
-
|
179
|
-
|
180
|
-
|
181
|
-
|
182
|
-
|
183
|
-
|
184
|
-
(conf_trace =~ /\A.*\:(\d+)\:in/)
|
185
|
-
line_number = $1
|
186
|
-
end
|
187
|
-
|
188
|
-
self.console.puts "Error processing configuration file '#{conf_path}' at line #{line_number}"
|
189
|
-
self.console.puts " #{e.class}: #{e.message}"
|
190
|
-
if e.backtrace.first =~ /\A(.*)\:in/
|
191
|
-
self.console.puts " from #{$1}"
|
192
|
-
end
|
178
|
+
rescue Traject::Indexer::ConfigLoadError => e
|
179
|
+
self.console.puts "\n"
|
180
|
+
self.console.puts e.message
|
181
|
+
self.console.puts e.config_file_backtrace
|
182
|
+
self.console.puts "\n"
|
183
|
+
self.console.puts "Exiting..."
|
193
184
|
exit 3
|
194
185
|
end
|
195
186
|
end
|
data/lib/traject/indexer.rb
CHANGED
@@ -23,7 +23,7 @@ end
|
|
23
23
|
# Traject config files are `instance_eval`d in an Indexer object, so `self` in
|
24
24
|
# a config file is an Indexer, and any Indexer methods can be called.
|
25
25
|
#
|
26
|
-
# However, certain Indexer methods exist
|
26
|
+
# However, certain Indexer methods exist mainly for the purpose of
|
27
27
|
# being called in config files; these methods are part of the expected
|
28
28
|
# Domain-Specific Language ("DSL") for config files, and will ordinarily
|
29
29
|
# form the bulk or entirety of config files:
|
@@ -34,18 +34,6 @@ end
|
|
34
34
|
# * #after_procesing
|
35
35
|
# * #logger (rarely used in config files, but in some cases to set up custom logging config)
|
36
36
|
#
|
37
|
-
# If accessing a Traject::Indexer programmatically (instead of via command line with
|
38
|
-
# config files), additional methods of note include:
|
39
|
-
#
|
40
|
-
# # to process a stream of input records from configured Reader,
|
41
|
-
# # to configured Writer:
|
42
|
-
# indexer.process(io_stream)
|
43
|
-
#
|
44
|
-
# # To map a single input record manually to an ouput_hash,
|
45
|
-
# # ignoring Readers and Writers
|
46
|
-
# hash = indexer.map_record(record)
|
47
|
-
#
|
48
|
-
#
|
49
37
|
# ## Readers and Writers
|
50
38
|
#
|
51
39
|
# The Indexer has a modularized architecture for readers and writers, for where
|
@@ -93,6 +81,77 @@ end
|
|
93
81
|
# * traject/delimited_writer and traject/csv_writer -- write character-delimited files
|
94
82
|
# (default is tab-delimited) or comma-separated-value files.
|
95
83
|
#
|
84
|
+
# ## Creating and Using an Indexer programmatically
|
85
|
+
#
|
86
|
+
# Normally the Traject::Indexer is created and used by a Traject::Command object.
|
87
|
+
# However, you can also create and use a Traject::Indexer programmatically, for embeddeding
|
88
|
+
# in your own ruby software. (Note, you will get best performance under Jruby only)
|
89
|
+
#
|
90
|
+
# indexer = Traject::Indexer.new
|
91
|
+
#
|
92
|
+
# You can load a config file from disk, using standard ruby `instance_eval`.
|
93
|
+
# One benefit of loading one or more ordinary traject config files saved separately
|
94
|
+
# on disk is that these config files could also be used with the standard
|
95
|
+
# traject command line.
|
96
|
+
#
|
97
|
+
# indexer.load_config_file(path_to_config)
|
98
|
+
#
|
99
|
+
# This may raise if the file is not readable. Or if the config file
|
100
|
+
# can't be evaluated, it will raise a Traject::Indexer::ConfigLoadError
|
101
|
+
# with a bunch of contextual information useful to reporting to developer.
|
102
|
+
#
|
103
|
+
# You can also instead, or in addition, write configuration inline using
|
104
|
+
# standard ruby `instance_eval`:
|
105
|
+
#
|
106
|
+
# indexer.instance_eval do
|
107
|
+
# to_field "something", literal("something")
|
108
|
+
# # etc
|
109
|
+
# end
|
110
|
+
#
|
111
|
+
# Or even load configuration from an existing lambda/proc object:
|
112
|
+
#
|
113
|
+
# config = proc do
|
114
|
+
# to_field "something", literal("something")
|
115
|
+
# end
|
116
|
+
# indexer.instance_eval &config
|
117
|
+
#
|
118
|
+
# It is least confusing to provide settings after you load
|
119
|
+
# config files, so you can determine if your settings should
|
120
|
+
# be defaults (taking effect only if not provided in earlier config),
|
121
|
+
# or should force themselves, potentially overwriting earlier config:
|
122
|
+
#
|
123
|
+
# indexer.settings do
|
124
|
+
# # default, won't overwrite if already set by earlier config
|
125
|
+
# provide "solr.url", "http://example.org/solr"
|
126
|
+
# provide "reader", "Traject::MarcReader"
|
127
|
+
#
|
128
|
+
# # or force over any previous config
|
129
|
+
# store "solr.url", "http://example.org/solr"
|
130
|
+
# end
|
131
|
+
#
|
132
|
+
# Once your indexer is set up, you could use it to transform individual
|
133
|
+
# input records to output hashes. This method will ignore any readers
|
134
|
+
# and writers, and won't use thread pools, it just maps. Under
|
135
|
+
# standard MARC setup, `record` should be a `MARC::Record`:
|
136
|
+
#
|
137
|
+
# output_hash = indexer.map_record(record)
|
138
|
+
#
|
139
|
+
# Or you could process an entire stream of input records from the
|
140
|
+
# configured reader, to the configured writer, as the traject command line
|
141
|
+
# does:
|
142
|
+
#
|
143
|
+
# indexer.process(io_stream)
|
144
|
+
# # or, eg:
|
145
|
+
# File.open("path/to/input") do |file|
|
146
|
+
# indexer.process(file)
|
147
|
+
# end
|
148
|
+
#
|
149
|
+
# At present, you can only call #process _once_ on an indexer,
|
150
|
+
# but let us know if that's a problem, we could enhance.
|
151
|
+
#
|
152
|
+
# Please do let us know if there is some part of this API that is
|
153
|
+
# inconveient for you, we'd like to know your use case and improve things.
|
154
|
+
#
|
96
155
|
class Traject::Indexer
|
97
156
|
|
98
157
|
# Arity error on a passed block
|
@@ -103,7 +162,7 @@ class Traject::Indexer
|
|
103
162
|
|
104
163
|
include Traject::QualifiedConstGet
|
105
164
|
|
106
|
-
attr_writer :reader_class, :writer_class
|
165
|
+
attr_writer :reader_class, :writer_class, :writer
|
107
166
|
|
108
167
|
# For now we hard-code these basic macro's included
|
109
168
|
# TODO, make these added with extend per-indexer,
|
@@ -120,6 +179,24 @@ class Traject::Indexer
|
|
120
179
|
@after_processing_steps = []
|
121
180
|
end
|
122
181
|
|
182
|
+
# Pass a string file path, or a File object, for
|
183
|
+
# a config file to load into indexer.
|
184
|
+
#
|
185
|
+
# Can raise:
|
186
|
+
# * Errno::ENOENT or Errno::EACCES if file path is not accessible
|
187
|
+
# * Traject::Indexer::ConfigLoadError if exception is raised evaluating
|
188
|
+
# the config. A ConfigLoadError has information in it about original
|
189
|
+
# exception, and exactly what config file and line number triggered it.
|
190
|
+
def load_config_file(file_path)
|
191
|
+
File.open(file_path) do |file|
|
192
|
+
begin
|
193
|
+
self.instance_eval(file.read, file_path)
|
194
|
+
rescue ScriptError, StandardError => e
|
195
|
+
raise ConfigLoadError.new(file_path, e)
|
196
|
+
end
|
197
|
+
end
|
198
|
+
end
|
199
|
+
|
123
200
|
# Part of the config file DSL, for writing settings values.
|
124
201
|
#
|
125
202
|
# The Indexer's settings consist of a hash-like Traject::Settings
|
@@ -282,7 +359,7 @@ class Traject::Indexer
|
|
282
359
|
begin
|
283
360
|
yield
|
284
361
|
rescue Exception => e
|
285
|
-
msg = "Unexpected error on record id `#{
|
362
|
+
msg = "Unexpected error on record id `#{context.source_record_id}` at file position #{context.position}\n"
|
286
363
|
msg += " while executing #{index_step.inspect}\n"
|
287
364
|
msg += Traject::Util.exception_to_log_message(e)
|
288
365
|
|
@@ -297,11 +374,6 @@ class Traject::Indexer
|
|
297
374
|
end
|
298
375
|
end
|
299
376
|
|
300
|
-
# get a printable id from record for error logging.
|
301
|
-
# Maybe override this for a future XML version.
|
302
|
-
def id_string(record)
|
303
|
-
record && record['001'] && record['001'].value.to_s
|
304
|
-
end
|
305
377
|
|
306
378
|
# Processes a stream of records, reading from the configured Reader,
|
307
379
|
# mapping according to configured mapping rules, and then writing
|
@@ -320,8 +392,6 @@ class Traject::Indexer
|
|
320
392
|
logger.debug "beginning Indexer#process with settings: #{settings.inspect}"
|
321
393
|
|
322
394
|
reader = self.reader!(io_stream)
|
323
|
-
writer = self.writer!
|
324
|
-
|
325
395
|
|
326
396
|
processing_threads = settings["processing_thread_pool"].to_i
|
327
397
|
thread_pool = Traject::ThreadPool.new(processing_threads)
|
@@ -343,20 +413,24 @@ class Traject::Indexer
|
|
343
413
|
$stderr.write "." if count % settings["solr_writer.batch_size"].to_i == 0
|
344
414
|
end
|
345
415
|
|
416
|
+
context = Context.new(
|
417
|
+
:source_record => record,
|
418
|
+
:settings => settings,
|
419
|
+
:position => position,
|
420
|
+
:logger => logger
|
421
|
+
)
|
422
|
+
|
346
423
|
if log_batch_size && (count % log_batch_size == 0)
|
347
424
|
batch_rps = log_batch_size / (Time.now - batch_start_time)
|
348
425
|
overall_rps = count / (Time.now - start_time)
|
349
|
-
logger.send(settings["log.batch_size.severity"].downcase.to_sym, "Traject::Indexer#process, read #{count} records at id:#{
|
426
|
+
logger.send(settings["log.batch_size.severity"].downcase.to_sym, "Traject::Indexer#process, read #{count} records at id:#{context.source_record_id}; #{'%.0f' % batch_rps}/s this batch, #{'%.0f' % overall_rps}/s overall")
|
350
427
|
batch_start_time = Time.now
|
351
428
|
end
|
352
429
|
|
353
|
-
#
|
354
|
-
#
|
355
|
-
#
|
356
|
-
|
357
|
-
thread_pool.maybe_in_thread_pool(record, settings, position) do |record, settings, position|
|
358
|
-
context = Context.new(:source_record => record, :settings => settings, :position => position)
|
359
|
-
context.logger = logger
|
430
|
+
# We pass context in a block arg to properly 'capture' it, so
|
431
|
+
# we don't accidentally share the local var under closure between
|
432
|
+
# threads.
|
433
|
+
thread_pool.maybe_in_thread_pool(context) do |context|
|
360
434
|
map_to_context!(context)
|
361
435
|
if context.skip?
|
362
436
|
log_skip(context)
|
@@ -413,10 +487,7 @@ class Traject::Indexer
|
|
413
487
|
end
|
414
488
|
|
415
489
|
def writer_class
|
416
|
-
|
417
|
-
@writer_class = qualified_const_get(settings["writer_class_name"])
|
418
|
-
end
|
419
|
-
return @writer_class
|
490
|
+
writer.class
|
420
491
|
end
|
421
492
|
|
422
493
|
# Instantiate a Traject Reader, using class set
|
@@ -427,7 +498,12 @@ class Traject::Indexer
|
|
427
498
|
|
428
499
|
# Instantiate a Traject Writer, suing class set in #writer_class
|
429
500
|
def writer!
|
430
|
-
|
501
|
+
writer_class = @writer_class || qualified_const_get(settings["writer_class_name"])
|
502
|
+
writer_class.new(settings.merge("logger" => logger))
|
503
|
+
end
|
504
|
+
|
505
|
+
def writer
|
506
|
+
@writer ||= settings["writer"] || writer!
|
431
507
|
end
|
432
508
|
|
433
509
|
# Represents the context of a specific record being indexed, passed
|
@@ -467,6 +543,26 @@ class Traject::Indexer
|
|
467
543
|
@skip
|
468
544
|
end
|
469
545
|
|
546
|
+
# Useful for describing a record in a log or especially
|
547
|
+
# error message. May be useful to combine with #position
|
548
|
+
# in output messages, especially since this method may sometimes
|
549
|
+
# return empty string if info on record id is not available.
|
550
|
+
#
|
551
|
+
# Returns MARC 001, then a slash, then output_hash["id"] -- if both
|
552
|
+
# are present. Otherwise may return just one, or even an empty string.
|
553
|
+
#
|
554
|
+
# Likely override this for a future XML or other source format version.
|
555
|
+
def source_record_id
|
556
|
+
marc_id = if self.source_record &&
|
557
|
+
self.source_record.kind_of?(MARC::Record) &&
|
558
|
+
self.source_record['001']
|
559
|
+
self.source_record['001'].value
|
560
|
+
end
|
561
|
+
output_id = self.output_hash["id"]
|
562
|
+
|
563
|
+
return [marc_id, output_id].compact.join("/")
|
564
|
+
end
|
565
|
+
|
470
566
|
end
|
471
567
|
|
472
568
|
|
@@ -607,6 +703,37 @@ class Traject::Indexer
|
|
607
703
|
end
|
608
704
|
end
|
609
705
|
|
706
|
+
# Raised by #load_config_file when config file can not
|
707
|
+
# be processed.
|
708
|
+
#
|
709
|
+
# The exception #message includes an error message formatted
|
710
|
+
# for good display to the developer, in the console.
|
711
|
+
#
|
712
|
+
# Original exception raised when processing config file
|
713
|
+
# can be found in #original. Original exception should ordinarily
|
714
|
+
# have a good stack trace, including the file path of the config
|
715
|
+
# file in question.
|
716
|
+
#
|
717
|
+
# Original config path in #config_file, and line number in config
|
718
|
+
# file that triggered the exception in #config_file_lineno (may be nil)
|
719
|
+
#
|
720
|
+
# A filtered backtrace just DOWN from config file (not including trace
|
721
|
+
# from traject loading config file itself) can be found in
|
722
|
+
# #config_file_backtrace
|
723
|
+
class ConfigLoadError < StandardError
|
724
|
+
# We'd have #cause in ruby 2.1, filled out for us, but we want
|
725
|
+
# to work before then, so we use our own 'original'
|
726
|
+
attr_reader :original, :config_file, :config_file_lineno, :config_file_backtrace
|
727
|
+
def initialize(config_file_path, original_exception)
|
728
|
+
@original = original_exception
|
729
|
+
@config_file = config_file_path
|
730
|
+
@config_file_lineno = Traject::Util.backtrace_lineno_for_config(config_file_path, original_exception)
|
731
|
+
@config_file_backtrace = Traject::Util.backtrace_from_config(config_file_path, original_exception)
|
732
|
+
message = "Error loading configuration file #{self.config_file}:#{self.config_file_lineno} #{original_exception.class}:#{original_exception.message}"
|
733
|
+
|
734
|
+
super(message)
|
735
|
+
end
|
736
|
+
end
|
610
737
|
|
611
738
|
|
612
739
|
|
@@ -39,6 +39,9 @@ module Traject::Macros
|
|
39
39
|
# to_field("title"), extract_marc("245abcd", :trim_punctuation => true)
|
40
40
|
# to_field("id"), extract_marc("001", :first => true)
|
41
41
|
# to_field("geo"), extract_marc("040a", :separator => nil, :translation_map => "marc040")
|
42
|
+
#
|
43
|
+
# If you'd like extract_marc functionality but you're not creating an indexer
|
44
|
+
# step, see Traject::Macros::Marc21.extract_marc_from module method.
|
42
45
|
def extract_marc(spec, options = {})
|
43
46
|
|
44
47
|
# Raise an error if there are any invalid options, indicating a
|
@@ -70,6 +73,26 @@ module Traject::Macros
|
|
70
73
|
Marc21.apply_extraction_options(accumulator, options, translation_map)
|
71
74
|
end
|
72
75
|
end
|
76
|
+
module_function :extract_marc
|
77
|
+
|
78
|
+
# Convenience method when you want extract_marc behavior, but NOT
|
79
|
+
# to create a lambda for an Indexer step, but instead just give
|
80
|
+
# it a record directly and get back an array of values.
|
81
|
+
#
|
82
|
+
# array = Traject::Indexer::Marc21.extract_marc_from(record, "245ab", :trim_punctuation => true)
|
83
|
+
#
|
84
|
+
# If you have a Traject::Indexer::Context and want to pass it in, you can:
|
85
|
+
#
|
86
|
+
# array = Traject::Indexer::Marc21.extract_marc_from(record, "245ab", :trim_punctuation => true, :context => existing_context)
|
87
|
+
def self.extract_marc_from(record, spec, options = {})
|
88
|
+
output = []
|
89
|
+
# Nil context works, but if caller wants to pass one in
|
90
|
+
# for better error reporting that's cool too.
|
91
|
+
context = options.delete(:context) || nil
|
92
|
+
|
93
|
+
extract_marc(spec, options).call(record, output, context)
|
94
|
+
return output
|
95
|
+
end
|
73
96
|
|
74
97
|
# Side-effect the accumulator with the options
|
75
98
|
def self.apply_extraction_options(accumulator, options, translation_map=nil)
|
@@ -2,10 +2,10 @@ module Traject
|
|
2
2
|
module Macros
|
3
3
|
# To use the marc_format macro, in your configuration file:
|
4
4
|
#
|
5
|
-
# require 'traject/macros/
|
5
|
+
# require 'traject/macros/marc_format_classifier'
|
6
6
|
# extend Traject::Macros::MarcFormats
|
7
7
|
#
|
8
|
-
# to_field
|
8
|
+
# to_field "format", marc_formats
|
9
9
|
#
|
10
10
|
# See also MarcClassifier which can be used directly for a bit more
|
11
11
|
# control.
|
@@ -144,11 +144,11 @@ class Traject::SolrJsonWriter
|
|
144
144
|
|
145
145
|
if exception || resp.status != 200
|
146
146
|
if exception
|
147
|
-
msg = Traject::Util.exception_to_log_message(
|
147
|
+
msg = Traject::Util.exception_to_log_message(exception)
|
148
148
|
else
|
149
149
|
msg = "Solr error response: #{resp.status}: #{resp.body}"
|
150
150
|
end
|
151
|
-
logger.error "Could not add record #{
|
151
|
+
logger.error "Could not add record #{c.source_record_id} at source file position #{c.position}: #{msg}"
|
152
152
|
logger.debug(c.source_record.to_s)
|
153
153
|
|
154
154
|
@skipped_record_incrementer.increment
|
@@ -166,20 +166,6 @@ class Traject::SolrJsonWriter
|
|
166
166
|
settings["logger"] ||= Yell.new(STDERR, :level => "gt.fatal") # null logger
|
167
167
|
end
|
168
168
|
|
169
|
-
# Returns MARC 001, then a slash, then output_hash["id"] -- if both
|
170
|
-
# are present. Otherwise may return just one, or even an empty string.
|
171
|
-
def record_id_from_context(context)
|
172
|
-
marc_id = if context.source_record &&
|
173
|
-
context.source_record.kind_of?(MARC::Record) &&
|
174
|
-
context.source_record['001']
|
175
|
-
context.source_record['001'].value
|
176
|
-
end
|
177
|
-
output_id = context.output_hash["id"]
|
178
|
-
|
179
|
-
return [marc_id, output_id].compact.join("/")
|
180
|
-
end
|
181
|
-
|
182
|
-
|
183
169
|
# On close, we need to (a) raise any exceptions we might have, (b) send off
|
184
170
|
# the last (possibly empty) batch, and (c) commit if instructed to do so
|
185
171
|
# via the solr_writer.commit_on_close setting.
|
data/lib/traject/util.rb
CHANGED
@@ -26,6 +26,81 @@ module Traject
|
|
26
26
|
str.split(':in `').first
|
27
27
|
end
|
28
28
|
|
29
|
+
# Provide a config source file path, and an exception.
|
30
|
+
#
|
31
|
+
# Returns the line number from the first line in the stack
|
32
|
+
# trace of the exception that matches your file path.
|
33
|
+
# of the first line in the backtrace matching that file_path.
|
34
|
+
#
|
35
|
+
# Returns `nil` if no suitable backtrace line can be found.
|
36
|
+
#
|
37
|
+
# Has special logic to try and grep the info out of a SyntaxError, bah.
|
38
|
+
def self.backtrace_lineno_for_config(file_path, exception)
|
39
|
+
# For a SyntaxError, we really need to grep it from the
|
40
|
+
# exception message, it really appears to be nowhere else. Ugh.
|
41
|
+
if exception.kind_of? SyntaxError
|
42
|
+
if exception.message =~ /:(\d+):/
|
43
|
+
return $1.to_i
|
44
|
+
end
|
45
|
+
end
|
46
|
+
|
47
|
+
# Otherwise we try to fish it out of the backtrace, first
|
48
|
+
# line matching the config file path.
|
49
|
+
|
50
|
+
# exception.backtrace_locations exists in MRI 2.1+, which makes
|
51
|
+
# our task a lot easier. But not yet in JRuby 1.7.x, so we got to
|
52
|
+
# handle the old way of having to parse the strings in backtrace too.
|
53
|
+
if ( exception.respond_to?(:backtrace_locations) &&
|
54
|
+
exception.backtrace_locations &&
|
55
|
+
exception.backtrace_locations.length > 0 )
|
56
|
+
location = exception.backtrace_locations.find do |bt|
|
57
|
+
bt.path == file_path
|
58
|
+
end
|
59
|
+
return location ? location.lineno : nil
|
60
|
+
else # have to parse string backtrace
|
61
|
+
exception.backtrace.each do |line|
|
62
|
+
if line.start_with?(file_path)
|
63
|
+
return $1.to_i if line =~ /\A.*\:(\d+)\:in/
|
64
|
+
break
|
65
|
+
end
|
66
|
+
end
|
67
|
+
# if we got here, we have nothing
|
68
|
+
return nil
|
69
|
+
end
|
70
|
+
end
|
71
|
+
|
72
|
+
# Extract just the part of the backtrace that is "below"
|
73
|
+
# the config file mentioned. If we can't find the config file
|
74
|
+
# in the stack trace, we might return empty array.
|
75
|
+
#
|
76
|
+
# If the ruby supports Exception#backtrace_locations, the
|
77
|
+
# returned array will actually be of Thread::Backtrace::Location elements.
|
78
|
+
def self.backtrace_from_config(file_path, exception)
|
79
|
+
filtered_trace = []
|
80
|
+
found = false
|
81
|
+
|
82
|
+
# MRI 2.1+ has exception.backtrace_locations which makes
|
83
|
+
# this a lot easier, but JRuby 1.7.x doesn't yet, so we
|
84
|
+
# need to do it both ways.
|
85
|
+
if ( exception.respond_to?(:backtrace_locations) &&
|
86
|
+
exception.backtrace_locations &&
|
87
|
+
exception.backtrace_locations.length > 0 )
|
88
|
+
|
89
|
+
exception.backtrace_locations.each do |location|
|
90
|
+
filtered_trace << location
|
91
|
+
(found=true and break) if location.path == file_path
|
92
|
+
end
|
93
|
+
else
|
94
|
+
filtered_trace = []
|
95
|
+
exception.backtrace.each do |line|
|
96
|
+
filtered_trace << line
|
97
|
+
(found=true and break) if line.start_with?(file_path)
|
98
|
+
end
|
99
|
+
end
|
100
|
+
|
101
|
+
return found ? filtered_trace : []
|
102
|
+
end
|
103
|
+
|
29
104
|
|
30
105
|
|
31
106
|
# Ruby stdlib queue lacks a 'drain' function, we write one.
|
data/lib/traject/version.rb
CHANGED
@@ -0,0 +1,35 @@
|
|
1
|
+
require 'test_helper'
|
2
|
+
|
3
|
+
describe "Traject::Indexer::Context" do
|
4
|
+
|
5
|
+
describe "source_record_id" do
|
6
|
+
before do
|
7
|
+
@record = MARC::Reader.new(support_file_path('test_data.utf8.mrc')).first
|
8
|
+
@context = Traject::Indexer::Context.new
|
9
|
+
@record_001 = " 00282214 " # from the mrc file
|
10
|
+
end
|
11
|
+
|
12
|
+
it "gets it from 001" do
|
13
|
+
@context.source_record = @record
|
14
|
+
assert_equal @record_001, @context.source_record_id
|
15
|
+
end
|
16
|
+
|
17
|
+
it "gets it from the id" do
|
18
|
+
@context.output_hash['id'] = 'the_record_id'
|
19
|
+
assert_equal 'the_record_id', @context.source_record_id
|
20
|
+
end
|
21
|
+
|
22
|
+
it "gets from the id with non-MARC source" do
|
23
|
+
@context.source_record = Object.new
|
24
|
+
@context.output_hash['id'] = 'the_record_id'
|
25
|
+
assert_equal 'the_record_id', @context.source_record_id
|
26
|
+
end
|
27
|
+
|
28
|
+
it "gets it from both 001 and id" do
|
29
|
+
@context.output_hash['id'] = 'the_record_id'
|
30
|
+
@context.source_record = @record
|
31
|
+
assert_equal [@record_001, 'the_record_id'].join('/'), @context.source_record_id
|
32
|
+
end
|
33
|
+
end
|
34
|
+
|
35
|
+
end
|
@@ -0,0 +1,89 @@
|
|
1
|
+
require 'test_helper'
|
2
|
+
require 'tempfile'
|
3
|
+
|
4
|
+
describe "Traject::Indexer#load_config_path" do
|
5
|
+
before do
|
6
|
+
@indexer = Traject::Indexer.new
|
7
|
+
end
|
8
|
+
|
9
|
+
describe "with bad path" do
|
10
|
+
it "raises ENOENT on non-existing path" do
|
11
|
+
assert_raises(Errno::ENOENT) { @indexer.load_config_file("does/not/exist.rb") }
|
12
|
+
end
|
13
|
+
it "raises EACCES on non-readable path" do
|
14
|
+
file = Tempfile.new('traject_test')
|
15
|
+
FileUtils.chmod("ugo-r", file.path)
|
16
|
+
|
17
|
+
assert_raises(Errno::EACCES) { @indexer.load_config_file(file.path) }
|
18
|
+
|
19
|
+
file.unlink
|
20
|
+
end
|
21
|
+
end
|
22
|
+
|
23
|
+
describe "with good config" do
|
24
|
+
before do
|
25
|
+
@config_file = tmp_config_file_with(%Q{
|
26
|
+
settings do
|
27
|
+
provide "our_key", "our_value"
|
28
|
+
end
|
29
|
+
to_field "literal", literal("literal")
|
30
|
+
})
|
31
|
+
end
|
32
|
+
after do
|
33
|
+
@config_file.unlink
|
34
|
+
end
|
35
|
+
it "loads config file by path" do
|
36
|
+
@indexer.load_config_file(@config_file.path)
|
37
|
+
|
38
|
+
assert_equal "our_value", @indexer.settings["our_key"]
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
describe "with error in config" do
|
43
|
+
after do
|
44
|
+
@config_file.unlink if @config_file
|
45
|
+
end
|
46
|
+
|
47
|
+
it "raises good error on SyntaxError type" do
|
48
|
+
@config_file = tmp_config_file_with(%Q{
|
49
|
+
puts "foo"
|
50
|
+
# Intentional syntax error missing comma
|
51
|
+
to_field "foo" extract_marc("245")
|
52
|
+
})
|
53
|
+
|
54
|
+
e = assert_raises(Traject::Indexer::ConfigLoadError) do
|
55
|
+
@indexer.load_config_file(@config_file.path)
|
56
|
+
end
|
57
|
+
|
58
|
+
assert_kind_of SyntaxError, e.original
|
59
|
+
assert_equal @config_file.path, e.config_file
|
60
|
+
assert_equal 4, e.config_file_lineno
|
61
|
+
end
|
62
|
+
|
63
|
+
it "raises good error on StandardError type" do
|
64
|
+
@config_file = tmp_config_file_with(%Q{
|
65
|
+
# Intentional non-syntax error, bad extract_marc spec
|
66
|
+
to_field "foo", extract_marc("#%^%^%^")
|
67
|
+
})
|
68
|
+
|
69
|
+
e = assert_raises(Traject::Indexer::ConfigLoadError) do
|
70
|
+
@indexer.load_config_file(@config_file.path)
|
71
|
+
end
|
72
|
+
|
73
|
+
assert_kind_of StandardError, e.original
|
74
|
+
assert_equal @config_file.path, e.config_file
|
75
|
+
assert_equal 3, e.config_file_lineno
|
76
|
+
end
|
77
|
+
end
|
78
|
+
|
79
|
+
|
80
|
+
def tmp_config_file_with(str)
|
81
|
+
file = Tempfile.new('traject_test_config')
|
82
|
+
file.write(str)
|
83
|
+
file.rewind
|
84
|
+
|
85
|
+
return file
|
86
|
+
end
|
87
|
+
|
88
|
+
|
89
|
+
end
|
@@ -118,6 +118,11 @@ describe "Traject::Macros::Marc21" do
|
|
118
118
|
end
|
119
119
|
end
|
120
120
|
|
121
|
+
it "supports #extract_marc_from module method" do
|
122
|
+
output_arr = ::Traject::Macros::Marc21.extract_marc_from(@record, "245ab", :trim_punctuation => true)
|
123
|
+
assert_equal ["Manufacturing consent : the political economy of the mass media"], output_arr
|
124
|
+
end
|
125
|
+
|
121
126
|
describe "serialized_marc" do
|
122
127
|
it "serializes xml" do
|
123
128
|
@indexer.instance_eval do
|
@@ -0,0 +1,54 @@
|
|
1
|
+
require 'test_helper'
|
2
|
+
require 'traject/yaml_writer'
|
3
|
+
|
4
|
+
describe "The writer on Traject::Indexer" do
|
5
|
+
let(:indexer) { Traject::Indexer.new("solr.url" => "http://example.com") }
|
6
|
+
|
7
|
+
it "has a default" do
|
8
|
+
assert_instance_of Traject::SolrJsonWriter, indexer.writer
|
9
|
+
assert_equal Traject::SolrJsonWriter, indexer.writer_class
|
10
|
+
end
|
11
|
+
|
12
|
+
describe "when the writer is set in config" do
|
13
|
+
let(:writer) { Traject::YamlWriter.new({}) }
|
14
|
+
|
15
|
+
let(:indexer) { Traject::Indexer.new(
|
16
|
+
"solr.url" => "http://example.com",
|
17
|
+
"writer_class" => 'Traject::SolrJsonWriter',
|
18
|
+
"writer" => writer
|
19
|
+
)}
|
20
|
+
|
21
|
+
it "uses writer from config" do
|
22
|
+
assert_equal writer, indexer.writer
|
23
|
+
assert_equal writer.class, indexer.writer_class
|
24
|
+
end
|
25
|
+
end
|
26
|
+
|
27
|
+
describe "when writer_class is set directly" do
|
28
|
+
let(:writer_class) { Traject::YamlWriter }
|
29
|
+
|
30
|
+
before do
|
31
|
+
indexer.writer_class = writer_class
|
32
|
+
end
|
33
|
+
|
34
|
+
it "uses writer_class set directly" do
|
35
|
+
assert_kind_of writer_class, indexer.writer
|
36
|
+
assert_equal writer_class, indexer.writer_class
|
37
|
+
end
|
38
|
+
|
39
|
+
end
|
40
|
+
|
41
|
+
describe "when the writer is set directly" do
|
42
|
+
let(:writer) { Traject::YamlWriter.new({}) }
|
43
|
+
|
44
|
+
before do
|
45
|
+
indexer.writer = writer
|
46
|
+
end
|
47
|
+
|
48
|
+
it "uses the set value" do
|
49
|
+
assert_equal writer, indexer.writer
|
50
|
+
assert_equal writer.class, indexer.writer_class
|
51
|
+
end
|
52
|
+
end
|
53
|
+
|
54
|
+
end
|
@@ -215,34 +215,5 @@ describe "Traject::SolrJsonWriter" do
|
|
215
215
|
assert_equal "http://example.com/solr/update", @writer.determine_solr_update_url
|
216
216
|
end
|
217
217
|
end
|
218
|
-
|
219
|
-
describe "Record id from context" do
|
220
|
-
before do
|
221
|
-
@record = MARC::Reader.new(support_file_path('test_data.utf8.mrc')).first
|
222
|
-
@context = Traject::Indexer::Context.new
|
223
|
-
@writer = create_writer
|
224
|
-
@record_001 = " 00282214 " # from the mrc file
|
225
|
-
end
|
226
|
-
|
227
|
-
it "gets it from 001" do
|
228
|
-
@context.source_record = @record
|
229
|
-
assert_equal @record_001, @writer.record_id_from_context(@context)
|
230
|
-
end
|
231
|
-
|
232
|
-
it "gets it from the id" do
|
233
|
-
@context.output_hash['id'] = 'the_record_id'
|
234
|
-
assert_equal 'the_record_id', @writer.record_id_from_context(@context)
|
235
|
-
end
|
236
|
-
|
237
|
-
it "gets it from both 001 and id" do
|
238
|
-
@context.output_hash['id'] = 'the_record_id'
|
239
|
-
@context.source_record = @record
|
240
|
-
assert_equal [@record_001, 'the_record_id'].join('/'), @writer.record_id_from_context(@context)
|
241
|
-
end
|
242
|
-
|
243
|
-
|
244
|
-
|
245
|
-
end
|
246
|
-
|
247
218
|
|
248
219
|
end
|
data/traject.gemspec
CHANGED
@@ -9,7 +9,7 @@ Gem::Specification.new do |spec|
|
|
9
9
|
spec.authors = ["Jonathan Rochkind", "Bill Dueber"]
|
10
10
|
spec.email = ["none@nowhere.org"]
|
11
11
|
spec.summary = %q{Index MARC to Solr; or generally process source records to hash-like structures}
|
12
|
-
spec.homepage = "http://github.com/traject
|
12
|
+
spec.homepage = "http://github.com/traject/traject"
|
13
13
|
spec.license = "MIT"
|
14
14
|
|
15
15
|
spec.files = `git ls-files`.split($/)
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: traject
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 2.0
|
4
|
+
version: 2.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jonathan Rochkind
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2015-02
|
12
|
+
date: 2015-07-02 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: concurrent-ruby
|
@@ -233,7 +233,9 @@ files:
|
|
233
233
|
- lib/translation_maps/marc_languages.yaml
|
234
234
|
- test/debug_writer_test.rb
|
235
235
|
- test/delimited_writer_test.rb
|
236
|
+
- test/indexer/context_test.rb
|
236
237
|
- test/indexer/each_record_test.rb
|
238
|
+
- test/indexer/load_config_file_test.rb
|
237
239
|
- test/indexer/macros_marc21_semantics_test.rb
|
238
240
|
- test/indexer/macros_marc21_test.rb
|
239
241
|
- test/indexer/macros_test.rb
|
@@ -241,6 +243,7 @@ files:
|
|
241
243
|
- test/indexer/read_write_test.rb
|
242
244
|
- test/indexer/settings_test.rb
|
243
245
|
- test/indexer/to_field_test.rb
|
246
|
+
- test/indexer/writer_test.rb
|
244
247
|
- test/marc_extractor_test.rb
|
245
248
|
- test/marc_format_classifier_test.rb
|
246
249
|
- test/marc_reader_test.rb
|
@@ -287,7 +290,7 @@ files:
|
|
287
290
|
- test/translation_maps/translate_array_test.yaml
|
288
291
|
- test/translation_maps/yaml_map.yaml
|
289
292
|
- traject.gemspec
|
290
|
-
homepage: http://github.com/traject
|
293
|
+
homepage: http://github.com/traject/traject
|
291
294
|
licenses:
|
292
295
|
- MIT
|
293
296
|
metadata: {}
|
@@ -314,7 +317,9 @@ summary: Index MARC to Solr; or generally process source records to hash-like st
|
|
314
317
|
test_files:
|
315
318
|
- test/debug_writer_test.rb
|
316
319
|
- test/delimited_writer_test.rb
|
320
|
+
- test/indexer/context_test.rb
|
317
321
|
- test/indexer/each_record_test.rb
|
322
|
+
- test/indexer/load_config_file_test.rb
|
318
323
|
- test/indexer/macros_marc21_semantics_test.rb
|
319
324
|
- test/indexer/macros_marc21_test.rb
|
320
325
|
- test/indexer/macros_test.rb
|
@@ -322,6 +327,7 @@ test_files:
|
|
322
327
|
- test/indexer/read_write_test.rb
|
323
328
|
- test/indexer/settings_test.rb
|
324
329
|
- test/indexer/to_field_test.rb
|
330
|
+
- test/indexer/writer_test.rb
|
325
331
|
- test/marc_extractor_test.rb
|
326
332
|
- test/marc_format_classifier_test.rb
|
327
333
|
- test/marc_reader_test.rb
|