traject 2.0.2 → 2.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: f738d7b32e8ccc9ac89b1ed735a154790a36c656
4
- data.tar.gz: 79d32ec442d5569b50ac4c8c2f25d588de335a79
3
+ metadata.gz: 68f2f404b1bcae2d73ad2d947d40178ddeaf3912
4
+ data.tar.gz: 3aa18f9032b5406059bf8093416df04655ae16e2
5
5
  SHA512:
6
- metadata.gz: 3042d06bc09ffc3421a334595a6393a047e0425f0b41221ab17b6881de7e96e07040f51ea1f3b4a18dd24580158eebaa341c6e640cc31263385f497584b33f75
7
- data.tar.gz: 811378d18249d07db3c85dc31205e2d70a98b86b6edd57fb97789f3aaa98c7b6fd7b1045966674c06eced1c6cc87848ef1efc361994f383a5c05488046fc286a
6
+ metadata.gz: e0521cadc05b454787f642726a6f537b6d09eb46f88389e7f036e9808e680e593ec09b594f2fa85480cb5e350c2f5c966549561cbfab580f9e38300365658e3f
7
+ data.tar.gz: 5fe31273d6cbe1cf46cf2a186f8d3b4dbc64bdca86e39cb6a9e3c3de08289a1e06f4ca0bc66e4e80df622591fa6f95bdb2b1c66e3a9728890155e642dfdb49fd
data/README.md CHANGED
@@ -11,7 +11,7 @@ for debugging by a human.
11
11
  **Traject is stable, mature software, that is already being used in production by its authors.**
12
12
 
13
13
  [![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
14
- [![Build Status](https://travis-ci.org/traject-project/traject.png)](https://travis-ci.org/traject-project/traject)
14
+ [![Build Status](https://travis-ci.org/traject/traject.png)](https://travis-ci.org/traject/traject)
15
15
 
16
16
 
17
17
  ## Background/Goals
@@ -137,7 +137,7 @@ data out of a MARC record according to a tag/subfield specification.
137
137
 
138
138
  # For MARC Control ('fixed') fields, you can optionally
139
139
  # use square brackets to take a byte offset.
140
- to_field "langauge_code", extract_marc("008[35-37]")
140
+ to_field "language_code", extract_marc("008[35-37]")
141
141
  ~~~
142
142
 
143
143
  `extract_marc` by default includes all 'alternate script' linked fields correspoinding
@@ -189,7 +189,7 @@ The current record serialized back out as MARC, in binary, XML, or json:
189
189
  Text of all fields in a range:
190
190
 
191
191
  ~~~ruby
192
- to_field "text", extract_all_marc_values(:from => 100, :to => 899)
192
+ to_field "text", extract_all_marc_values(:from => "100", :to => "899")
193
193
  ~~~
194
194
 
195
195
  All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macros/marc21.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21))
@@ -324,7 +324,7 @@ Several other writers are also built-in:
324
324
  You set which writer is being used in settings (`provide "writer_class_name", "Traject::DebugWriter"`),
325
325
  or with the shortcut command line argument `-w Traject::DebugWriter`.
326
326
 
327
- The [SolrJWriter](https://github.com/traject-project/traject-solrj_writer) is packaged separately,
327
+ The [SolrJWriter](https://github.com/traject/traject-solrj_writer) is packaged separately,
328
328
  and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
329
329
 
330
330
  You can easily write your own Readers and Writers if you'd like, see comments at top
@@ -413,11 +413,11 @@ Own Code](./doc/extending.md)
413
413
  * [Other traject commands](./doc/other_commands.md) including `marcout`, and `commit`
414
414
  * [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
415
415
  * Plugin extensions: Gems that add functionality to traject
416
- * [traject_alephsequential_reader](https://github.com/traject-project/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
416
+ * [traject_alephsequential_reader](https://github.com/traject/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
417
417
  * [traject_horizon](https://github.com/jrochkind/traject_horizon): Export MARC records directly from a Horizon ILS rdbms, as serialized MARC or to index into Solr.
418
418
  * [traject_umich_format](https://github.com/billdueber/traject_umich_format/): opinionated code and associated macros to extract format (book, audio file, etc.) and types (bibliography, conference report, etc.) from a MARC record. Code mirrors that used by the University of Michigan, and is an alternate approach to that taken by the `marc_formats` macro in `Traject::Macros::MarcFormatClassifier`.
419
- * [traject-solrj_writer](https://github.com/traject-project/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
420
- * [traject_marc4j_reader](https://github.com/billdueber/traject_marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
419
+ * [traject-solrj_writer](https://github.com/traject/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
420
+ * [traject_marc4j_reader](https://github.com/traject/traject-marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
421
421
  reading marc records using the Marc4J library, fastest MARC reading on JRuby.
422
422
 
423
423
  # Development
@@ -454,7 +454,7 @@ a gemspec dependency on the Marc4JReader gem.
454
454
  on Settings.
455
455
 
456
456
  * CommandLine class isn't covered by tests -- it's written using functionality
457
- from Indexer and other classes taht are well-covered, but the CommandLine itself
457
+ from Indexer and other classes that are well-covered, but the CommandLine itself
458
458
  probably needs some tests -- especially covering error handling, which probably
459
459
  needs a bit more attention and using exceptions instead of exits, etc.
460
460
 
@@ -58,7 +58,7 @@ you need to modify the array in-place.
58
58
  The third optional context argument
59
59
 
60
60
  The third optional argument is a
61
- [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/traject-project/traject/Traject/Indexer/Context))
61
+ [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/traject/traject/Traject/Indexer/Context))
62
62
  object. Most of the time you don't need it, but you can use it for
63
63
  some sophisticated functionality, for example using these Context methods:
64
64
 
@@ -98,4 +98,7 @@ settings are applied first of all. It's recommended you use `provide`.
98
98
  Note that processing_thread_pool threads can end up submitting
99
99
  to solr too, if solr_json_writer.thread_pool is full.
100
100
 
101
- * `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::SolrJsonWriter, other writers for debugging or writing to files are also available. See Traject::Indexer for more info. Command line shortcut `-w`
101
+ * `writer`: An object that implements the Traject Writer interface. If set, takes precedence
102
+ over `writer_class_name`.
103
+
104
+ * `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Will be used if no explicit `writer` setting or `#writer=` is set. Default Traject::SolrJsonWriter, other writers for debugging or writing to files are also available. See Traject::Indexer for more info. Command line shortcut `-w`
@@ -171,25 +171,16 @@ module Traject
171
171
  def load_configuration_files!(my_indexer, conf_files)
172
172
  conf_files.each do |conf_path|
173
173
  begin
174
- file_io = File.open(conf_path)
175
- rescue Errno::ENOENT => e
176
- self.console.puts "Could not find configuration file '#{conf_path}', exiting..."
174
+ my_indexer.load_config_file(conf_path)
175
+ rescue Errno::ENOENT, Errno::EACCES => e
176
+ self.console.puts "Could not read configuration file '#{conf_path}', exiting..."
177
177
  exit 2
178
- end
179
-
180
- begin
181
- my_indexer.instance_eval(file_io.read, conf_path)
182
- rescue Exception => e
183
- if (conf_trace = e.backtrace.find {|l| l.start_with? conf_path}) &&
184
- (conf_trace =~ /\A.*\:(\d+)\:in/)
185
- line_number = $1
186
- end
187
-
188
- self.console.puts "Error processing configuration file '#{conf_path}' at line #{line_number}"
189
- self.console.puts " #{e.class}: #{e.message}"
190
- if e.backtrace.first =~ /\A(.*)\:in/
191
- self.console.puts " from #{$1}"
192
- end
178
+ rescue Traject::Indexer::ConfigLoadError => e
179
+ self.console.puts "\n"
180
+ self.console.puts e.message
181
+ self.console.puts e.config_file_backtrace
182
+ self.console.puts "\n"
183
+ self.console.puts "Exiting..."
193
184
  exit 3
194
185
  end
195
186
  end
@@ -23,7 +23,7 @@ end
23
23
  # Traject config files are `instance_eval`d in an Indexer object, so `self` in
24
24
  # a config file is an Indexer, and any Indexer methods can be called.
25
25
  #
26
- # However, certain Indexer methods exist almost entirely for the purpose of
26
+ # However, certain Indexer methods exist mainly for the purpose of
27
27
  # being called in config files; these methods are part of the expected
28
28
  # Domain-Specific Language ("DSL") for config files, and will ordinarily
29
29
  # form the bulk or entirety of config files:
@@ -34,18 +34,6 @@ end
34
34
  # * #after_procesing
35
35
  # * #logger (rarely used in config files, but in some cases to set up custom logging config)
36
36
  #
37
- # If accessing a Traject::Indexer programmatically (instead of via command line with
38
- # config files), additional methods of note include:
39
- #
40
- # # to process a stream of input records from configured Reader,
41
- # # to configured Writer:
42
- # indexer.process(io_stream)
43
- #
44
- # # To map a single input record manually to an ouput_hash,
45
- # # ignoring Readers and Writers
46
- # hash = indexer.map_record(record)
47
- #
48
- #
49
37
  # ## Readers and Writers
50
38
  #
51
39
  # The Indexer has a modularized architecture for readers and writers, for where
@@ -93,6 +81,77 @@ end
93
81
  # * traject/delimited_writer and traject/csv_writer -- write character-delimited files
94
82
  # (default is tab-delimited) or comma-separated-value files.
95
83
  #
84
+ # ## Creating and Using an Indexer programmatically
85
+ #
86
+ # Normally the Traject::Indexer is created and used by a Traject::Command object.
87
+ # However, you can also create and use a Traject::Indexer programmatically, for embeddeding
88
+ # in your own ruby software. (Note, you will get best performance under Jruby only)
89
+ #
90
+ # indexer = Traject::Indexer.new
91
+ #
92
+ # You can load a config file from disk, using standard ruby `instance_eval`.
93
+ # One benefit of loading one or more ordinary traject config files saved separately
94
+ # on disk is that these config files could also be used with the standard
95
+ # traject command line.
96
+ #
97
+ # indexer.load_config_file(path_to_config)
98
+ #
99
+ # This may raise if the file is not readable. Or if the config file
100
+ # can't be evaluated, it will raise a Traject::Indexer::ConfigLoadError
101
+ # with a bunch of contextual information useful to reporting to developer.
102
+ #
103
+ # You can also instead, or in addition, write configuration inline using
104
+ # standard ruby `instance_eval`:
105
+ #
106
+ # indexer.instance_eval do
107
+ # to_field "something", literal("something")
108
+ # # etc
109
+ # end
110
+ #
111
+ # Or even load configuration from an existing lambda/proc object:
112
+ #
113
+ # config = proc do
114
+ # to_field "something", literal("something")
115
+ # end
116
+ # indexer.instance_eval &config
117
+ #
118
+ # It is least confusing to provide settings after you load
119
+ # config files, so you can determine if your settings should
120
+ # be defaults (taking effect only if not provided in earlier config),
121
+ # or should force themselves, potentially overwriting earlier config:
122
+ #
123
+ # indexer.settings do
124
+ # # default, won't overwrite if already set by earlier config
125
+ # provide "solr.url", "http://example.org/solr"
126
+ # provide "reader", "Traject::MarcReader"
127
+ #
128
+ # # or force over any previous config
129
+ # store "solr.url", "http://example.org/solr"
130
+ # end
131
+ #
132
+ # Once your indexer is set up, you could use it to transform individual
133
+ # input records to output hashes. This method will ignore any readers
134
+ # and writers, and won't use thread pools, it just maps. Under
135
+ # standard MARC setup, `record` should be a `MARC::Record`:
136
+ #
137
+ # output_hash = indexer.map_record(record)
138
+ #
139
+ # Or you could process an entire stream of input records from the
140
+ # configured reader, to the configured writer, as the traject command line
141
+ # does:
142
+ #
143
+ # indexer.process(io_stream)
144
+ # # or, eg:
145
+ # File.open("path/to/input") do |file|
146
+ # indexer.process(file)
147
+ # end
148
+ #
149
+ # At present, you can only call #process _once_ on an indexer,
150
+ # but let us know if that's a problem, we could enhance.
151
+ #
152
+ # Please do let us know if there is some part of this API that is
153
+ # inconveient for you, we'd like to know your use case and improve things.
154
+ #
96
155
  class Traject::Indexer
97
156
 
98
157
  # Arity error on a passed block
@@ -103,7 +162,7 @@ class Traject::Indexer
103
162
 
104
163
  include Traject::QualifiedConstGet
105
164
 
106
- attr_writer :reader_class, :writer_class
165
+ attr_writer :reader_class, :writer_class, :writer
107
166
 
108
167
  # For now we hard-code these basic macro's included
109
168
  # TODO, make these added with extend per-indexer,
@@ -120,6 +179,24 @@ class Traject::Indexer
120
179
  @after_processing_steps = []
121
180
  end
122
181
 
182
+ # Pass a string file path, or a File object, for
183
+ # a config file to load into indexer.
184
+ #
185
+ # Can raise:
186
+ # * Errno::ENOENT or Errno::EACCES if file path is not accessible
187
+ # * Traject::Indexer::ConfigLoadError if exception is raised evaluating
188
+ # the config. A ConfigLoadError has information in it about original
189
+ # exception, and exactly what config file and line number triggered it.
190
+ def load_config_file(file_path)
191
+ File.open(file_path) do |file|
192
+ begin
193
+ self.instance_eval(file.read, file_path)
194
+ rescue ScriptError, StandardError => e
195
+ raise ConfigLoadError.new(file_path, e)
196
+ end
197
+ end
198
+ end
199
+
123
200
  # Part of the config file DSL, for writing settings values.
124
201
  #
125
202
  # The Indexer's settings consist of a hash-like Traject::Settings
@@ -282,7 +359,7 @@ class Traject::Indexer
282
359
  begin
283
360
  yield
284
361
  rescue Exception => e
285
- msg = "Unexpected error on record id `#{id_string(context.source_record)}` at file position #{context.position}\n"
362
+ msg = "Unexpected error on record id `#{context.source_record_id}` at file position #{context.position}\n"
286
363
  msg += " while executing #{index_step.inspect}\n"
287
364
  msg += Traject::Util.exception_to_log_message(e)
288
365
 
@@ -297,11 +374,6 @@ class Traject::Indexer
297
374
  end
298
375
  end
299
376
 
300
- # get a printable id from record for error logging.
301
- # Maybe override this for a future XML version.
302
- def id_string(record)
303
- record && record['001'] && record['001'].value.to_s
304
- end
305
377
 
306
378
  # Processes a stream of records, reading from the configured Reader,
307
379
  # mapping according to configured mapping rules, and then writing
@@ -320,8 +392,6 @@ class Traject::Indexer
320
392
  logger.debug "beginning Indexer#process with settings: #{settings.inspect}"
321
393
 
322
394
  reader = self.reader!(io_stream)
323
- writer = self.writer!
324
-
325
395
 
326
396
  processing_threads = settings["processing_thread_pool"].to_i
327
397
  thread_pool = Traject::ThreadPool.new(processing_threads)
@@ -343,20 +413,24 @@ class Traject::Indexer
343
413
  $stderr.write "." if count % settings["solr_writer.batch_size"].to_i == 0
344
414
  end
345
415
 
416
+ context = Context.new(
417
+ :source_record => record,
418
+ :settings => settings,
419
+ :position => position,
420
+ :logger => logger
421
+ )
422
+
346
423
  if log_batch_size && (count % log_batch_size == 0)
347
424
  batch_rps = log_batch_size / (Time.now - batch_start_time)
348
425
  overall_rps = count / (Time.now - start_time)
349
- logger.send(settings["log.batch_size.severity"].downcase.to_sym, "Traject::Indexer#process, read #{count} records at id:#{id_string(record)}; #{'%.0f' % batch_rps}/s this batch, #{'%.0f' % overall_rps}/s overall")
426
+ logger.send(settings["log.batch_size.severity"].downcase.to_sym, "Traject::Indexer#process, read #{count} records at id:#{context.source_record_id}; #{'%.0f' % batch_rps}/s this batch, #{'%.0f' % overall_rps}/s overall")
350
427
  batch_start_time = Time.now
351
428
  end
352
429
 
353
- # we have to use this weird lambda to properly "capture" the count, instead
354
- # of having it be bound to the original variable in a non-threadsafe way.
355
- # This is confusing, I might not be understanding things properly, but that's where i am.
356
- #thread_pool.maybe_in_thread_pool &make_lambda(count, record, writer)
357
- thread_pool.maybe_in_thread_pool(record, settings, position) do |record, settings, position|
358
- context = Context.new(:source_record => record, :settings => settings, :position => position)
359
- context.logger = logger
430
+ # We pass context in a block arg to properly 'capture' it, so
431
+ # we don't accidentally share the local var under closure between
432
+ # threads.
433
+ thread_pool.maybe_in_thread_pool(context) do |context|
360
434
  map_to_context!(context)
361
435
  if context.skip?
362
436
  log_skip(context)
@@ -413,10 +487,7 @@ class Traject::Indexer
413
487
  end
414
488
 
415
489
  def writer_class
416
- unless defined? @writer_class
417
- @writer_class = qualified_const_get(settings["writer_class_name"])
418
- end
419
- return @writer_class
490
+ writer.class
420
491
  end
421
492
 
422
493
  # Instantiate a Traject Reader, using class set
@@ -427,7 +498,12 @@ class Traject::Indexer
427
498
 
428
499
  # Instantiate a Traject Writer, suing class set in #writer_class
429
500
  def writer!
430
- return writer_class.new(settings.merge("logger" => logger))
501
+ writer_class = @writer_class || qualified_const_get(settings["writer_class_name"])
502
+ writer_class.new(settings.merge("logger" => logger))
503
+ end
504
+
505
+ def writer
506
+ @writer ||= settings["writer"] || writer!
431
507
  end
432
508
 
433
509
  # Represents the context of a specific record being indexed, passed
@@ -467,6 +543,26 @@ class Traject::Indexer
467
543
  @skip
468
544
  end
469
545
 
546
+ # Useful for describing a record in a log or especially
547
+ # error message. May be useful to combine with #position
548
+ # in output messages, especially since this method may sometimes
549
+ # return empty string if info on record id is not available.
550
+ #
551
+ # Returns MARC 001, then a slash, then output_hash["id"] -- if both
552
+ # are present. Otherwise may return just one, or even an empty string.
553
+ #
554
+ # Likely override this for a future XML or other source format version.
555
+ def source_record_id
556
+ marc_id = if self.source_record &&
557
+ self.source_record.kind_of?(MARC::Record) &&
558
+ self.source_record['001']
559
+ self.source_record['001'].value
560
+ end
561
+ output_id = self.output_hash["id"]
562
+
563
+ return [marc_id, output_id].compact.join("/")
564
+ end
565
+
470
566
  end
471
567
 
472
568
 
@@ -607,6 +703,37 @@ class Traject::Indexer
607
703
  end
608
704
  end
609
705
 
706
+ # Raised by #load_config_file when config file can not
707
+ # be processed.
708
+ #
709
+ # The exception #message includes an error message formatted
710
+ # for good display to the developer, in the console.
711
+ #
712
+ # Original exception raised when processing config file
713
+ # can be found in #original. Original exception should ordinarily
714
+ # have a good stack trace, including the file path of the config
715
+ # file in question.
716
+ #
717
+ # Original config path in #config_file, and line number in config
718
+ # file that triggered the exception in #config_file_lineno (may be nil)
719
+ #
720
+ # A filtered backtrace just DOWN from config file (not including trace
721
+ # from traject loading config file itself) can be found in
722
+ # #config_file_backtrace
723
+ class ConfigLoadError < StandardError
724
+ # We'd have #cause in ruby 2.1, filled out for us, but we want
725
+ # to work before then, so we use our own 'original'
726
+ attr_reader :original, :config_file, :config_file_lineno, :config_file_backtrace
727
+ def initialize(config_file_path, original_exception)
728
+ @original = original_exception
729
+ @config_file = config_file_path
730
+ @config_file_lineno = Traject::Util.backtrace_lineno_for_config(config_file_path, original_exception)
731
+ @config_file_backtrace = Traject::Util.backtrace_from_config(config_file_path, original_exception)
732
+ message = "Error loading configuration file #{self.config_file}:#{self.config_file_lineno} #{original_exception.class}:#{original_exception.message}"
733
+
734
+ super(message)
735
+ end
736
+ end
610
737
 
611
738
 
612
739
 
@@ -39,6 +39,9 @@ module Traject::Macros
39
39
  # to_field("title"), extract_marc("245abcd", :trim_punctuation => true)
40
40
  # to_field("id"), extract_marc("001", :first => true)
41
41
  # to_field("geo"), extract_marc("040a", :separator => nil, :translation_map => "marc040")
42
+ #
43
+ # If you'd like extract_marc functionality but you're not creating an indexer
44
+ # step, see Traject::Macros::Marc21.extract_marc_from module method.
42
45
  def extract_marc(spec, options = {})
43
46
 
44
47
  # Raise an error if there are any invalid options, indicating a
@@ -70,6 +73,26 @@ module Traject::Macros
70
73
  Marc21.apply_extraction_options(accumulator, options, translation_map)
71
74
  end
72
75
  end
76
+ module_function :extract_marc
77
+
78
+ # Convenience method when you want extract_marc behavior, but NOT
79
+ # to create a lambda for an Indexer step, but instead just give
80
+ # it a record directly and get back an array of values.
81
+ #
82
+ # array = Traject::Indexer::Marc21.extract_marc_from(record, "245ab", :trim_punctuation => true)
83
+ #
84
+ # If you have a Traject::Indexer::Context and want to pass it in, you can:
85
+ #
86
+ # array = Traject::Indexer::Marc21.extract_marc_from(record, "245ab", :trim_punctuation => true, :context => existing_context)
87
+ def self.extract_marc_from(record, spec, options = {})
88
+ output = []
89
+ # Nil context works, but if caller wants to pass one in
90
+ # for better error reporting that's cool too.
91
+ context = options.delete(:context) || nil
92
+
93
+ extract_marc(spec, options).call(record, output, context)
94
+ return output
95
+ end
73
96
 
74
97
  # Side-effect the accumulator with the options
75
98
  def self.apply_extraction_options(accumulator, options, translation_map=nil)
@@ -2,10 +2,10 @@ module Traject
2
2
  module Macros
3
3
  # To use the marc_format macro, in your configuration file:
4
4
  #
5
- # require 'traject/macros/marc_formats
5
+ # require 'traject/macros/marc_format_classifier'
6
6
  # extend Traject::Macros::MarcFormats
7
7
  #
8
- # to_field("format_s") marc_formats
8
+ # to_field "format", marc_formats
9
9
  #
10
10
  # See also MarcClassifier which can be used directly for a bit more
11
11
  # control.
@@ -144,11 +144,11 @@ class Traject::SolrJsonWriter
144
144
 
145
145
  if exception || resp.status != 200
146
146
  if exception
147
- msg = Traject::Util.exception_to_log_message(e)
147
+ msg = Traject::Util.exception_to_log_message(exception)
148
148
  else
149
149
  msg = "Solr error response: #{resp.status}: #{resp.body}"
150
150
  end
151
- logger.error "Could not add record #{record_id_from_context c} at source file position #{c.position}: #{msg}"
151
+ logger.error "Could not add record #{c.source_record_id} at source file position #{c.position}: #{msg}"
152
152
  logger.debug(c.source_record.to_s)
153
153
 
154
154
  @skipped_record_incrementer.increment
@@ -166,20 +166,6 @@ class Traject::SolrJsonWriter
166
166
  settings["logger"] ||= Yell.new(STDERR, :level => "gt.fatal") # null logger
167
167
  end
168
168
 
169
- # Returns MARC 001, then a slash, then output_hash["id"] -- if both
170
- # are present. Otherwise may return just one, or even an empty string.
171
- def record_id_from_context(context)
172
- marc_id = if context.source_record &&
173
- context.source_record.kind_of?(MARC::Record) &&
174
- context.source_record['001']
175
- context.source_record['001'].value
176
- end
177
- output_id = context.output_hash["id"]
178
-
179
- return [marc_id, output_id].compact.join("/")
180
- end
181
-
182
-
183
169
  # On close, we need to (a) raise any exceptions we might have, (b) send off
184
170
  # the last (possibly empty) batch, and (c) commit if instructed to do so
185
171
  # via the solr_writer.commit_on_close setting.
@@ -26,6 +26,81 @@ module Traject
26
26
  str.split(':in `').first
27
27
  end
28
28
 
29
+ # Provide a config source file path, and an exception.
30
+ #
31
+ # Returns the line number from the first line in the stack
32
+ # trace of the exception that matches your file path.
33
+ # of the first line in the backtrace matching that file_path.
34
+ #
35
+ # Returns `nil` if no suitable backtrace line can be found.
36
+ #
37
+ # Has special logic to try and grep the info out of a SyntaxError, bah.
38
+ def self.backtrace_lineno_for_config(file_path, exception)
39
+ # For a SyntaxError, we really need to grep it from the
40
+ # exception message, it really appears to be nowhere else. Ugh.
41
+ if exception.kind_of? SyntaxError
42
+ if exception.message =~ /:(\d+):/
43
+ return $1.to_i
44
+ end
45
+ end
46
+
47
+ # Otherwise we try to fish it out of the backtrace, first
48
+ # line matching the config file path.
49
+
50
+ # exception.backtrace_locations exists in MRI 2.1+, which makes
51
+ # our task a lot easier. But not yet in JRuby 1.7.x, so we got to
52
+ # handle the old way of having to parse the strings in backtrace too.
53
+ if ( exception.respond_to?(:backtrace_locations) &&
54
+ exception.backtrace_locations &&
55
+ exception.backtrace_locations.length > 0 )
56
+ location = exception.backtrace_locations.find do |bt|
57
+ bt.path == file_path
58
+ end
59
+ return location ? location.lineno : nil
60
+ else # have to parse string backtrace
61
+ exception.backtrace.each do |line|
62
+ if line.start_with?(file_path)
63
+ return $1.to_i if line =~ /\A.*\:(\d+)\:in/
64
+ break
65
+ end
66
+ end
67
+ # if we got here, we have nothing
68
+ return nil
69
+ end
70
+ end
71
+
72
+ # Extract just the part of the backtrace that is "below"
73
+ # the config file mentioned. If we can't find the config file
74
+ # in the stack trace, we might return empty array.
75
+ #
76
+ # If the ruby supports Exception#backtrace_locations, the
77
+ # returned array will actually be of Thread::Backtrace::Location elements.
78
+ def self.backtrace_from_config(file_path, exception)
79
+ filtered_trace = []
80
+ found = false
81
+
82
+ # MRI 2.1+ has exception.backtrace_locations which makes
83
+ # this a lot easier, but JRuby 1.7.x doesn't yet, so we
84
+ # need to do it both ways.
85
+ if ( exception.respond_to?(:backtrace_locations) &&
86
+ exception.backtrace_locations &&
87
+ exception.backtrace_locations.length > 0 )
88
+
89
+ exception.backtrace_locations.each do |location|
90
+ filtered_trace << location
91
+ (found=true and break) if location.path == file_path
92
+ end
93
+ else
94
+ filtered_trace = []
95
+ exception.backtrace.each do |line|
96
+ filtered_trace << line
97
+ (found=true and break) if line.start_with?(file_path)
98
+ end
99
+ end
100
+
101
+ return found ? filtered_trace : []
102
+ end
103
+
29
104
 
30
105
 
31
106
  # Ruby stdlib queue lacks a 'drain' function, we write one.
@@ -1,3 +1,3 @@
1
1
  module Traject
2
- VERSION = "2.0.2"
2
+ VERSION = "2.1.0"
3
3
  end
@@ -0,0 +1,35 @@
1
+ require 'test_helper'
2
+
3
+ describe "Traject::Indexer::Context" do
4
+
5
+ describe "source_record_id" do
6
+ before do
7
+ @record = MARC::Reader.new(support_file_path('test_data.utf8.mrc')).first
8
+ @context = Traject::Indexer::Context.new
9
+ @record_001 = " 00282214 " # from the mrc file
10
+ end
11
+
12
+ it "gets it from 001" do
13
+ @context.source_record = @record
14
+ assert_equal @record_001, @context.source_record_id
15
+ end
16
+
17
+ it "gets it from the id" do
18
+ @context.output_hash['id'] = 'the_record_id'
19
+ assert_equal 'the_record_id', @context.source_record_id
20
+ end
21
+
22
+ it "gets from the id with non-MARC source" do
23
+ @context.source_record = Object.new
24
+ @context.output_hash['id'] = 'the_record_id'
25
+ assert_equal 'the_record_id', @context.source_record_id
26
+ end
27
+
28
+ it "gets it from both 001 and id" do
29
+ @context.output_hash['id'] = 'the_record_id'
30
+ @context.source_record = @record
31
+ assert_equal [@record_001, 'the_record_id'].join('/'), @context.source_record_id
32
+ end
33
+ end
34
+
35
+ end
@@ -0,0 +1,89 @@
1
+ require 'test_helper'
2
+ require 'tempfile'
3
+
4
+ describe "Traject::Indexer#load_config_path" do
5
+ before do
6
+ @indexer = Traject::Indexer.new
7
+ end
8
+
9
+ describe "with bad path" do
10
+ it "raises ENOENT on non-existing path" do
11
+ assert_raises(Errno::ENOENT) { @indexer.load_config_file("does/not/exist.rb") }
12
+ end
13
+ it "raises EACCES on non-readable path" do
14
+ file = Tempfile.new('traject_test')
15
+ FileUtils.chmod("ugo-r", file.path)
16
+
17
+ assert_raises(Errno::EACCES) { @indexer.load_config_file(file.path) }
18
+
19
+ file.unlink
20
+ end
21
+ end
22
+
23
+ describe "with good config" do
24
+ before do
25
+ @config_file = tmp_config_file_with(%Q{
26
+ settings do
27
+ provide "our_key", "our_value"
28
+ end
29
+ to_field "literal", literal("literal")
30
+ })
31
+ end
32
+ after do
33
+ @config_file.unlink
34
+ end
35
+ it "loads config file by path" do
36
+ @indexer.load_config_file(@config_file.path)
37
+
38
+ assert_equal "our_value", @indexer.settings["our_key"]
39
+ end
40
+ end
41
+
42
+ describe "with error in config" do
43
+ after do
44
+ @config_file.unlink if @config_file
45
+ end
46
+
47
+ it "raises good error on SyntaxError type" do
48
+ @config_file = tmp_config_file_with(%Q{
49
+ puts "foo"
50
+ # Intentional syntax error missing comma
51
+ to_field "foo" extract_marc("245")
52
+ })
53
+
54
+ e = assert_raises(Traject::Indexer::ConfigLoadError) do
55
+ @indexer.load_config_file(@config_file.path)
56
+ end
57
+
58
+ assert_kind_of SyntaxError, e.original
59
+ assert_equal @config_file.path, e.config_file
60
+ assert_equal 4, e.config_file_lineno
61
+ end
62
+
63
+ it "raises good error on StandardError type" do
64
+ @config_file = tmp_config_file_with(%Q{
65
+ # Intentional non-syntax error, bad extract_marc spec
66
+ to_field "foo", extract_marc("#%^%^%^")
67
+ })
68
+
69
+ e = assert_raises(Traject::Indexer::ConfigLoadError) do
70
+ @indexer.load_config_file(@config_file.path)
71
+ end
72
+
73
+ assert_kind_of StandardError, e.original
74
+ assert_equal @config_file.path, e.config_file
75
+ assert_equal 3, e.config_file_lineno
76
+ end
77
+ end
78
+
79
+
80
+ def tmp_config_file_with(str)
81
+ file = Tempfile.new('traject_test_config')
82
+ file.write(str)
83
+ file.rewind
84
+
85
+ return file
86
+ end
87
+
88
+
89
+ end
@@ -118,6 +118,11 @@ describe "Traject::Macros::Marc21" do
118
118
  end
119
119
  end
120
120
 
121
+ it "supports #extract_marc_from module method" do
122
+ output_arr = ::Traject::Macros::Marc21.extract_marc_from(@record, "245ab", :trim_punctuation => true)
123
+ assert_equal ["Manufacturing consent : the political economy of the mass media"], output_arr
124
+ end
125
+
121
126
  describe "serialized_marc" do
122
127
  it "serializes xml" do
123
128
  @indexer.instance_eval do
@@ -0,0 +1,54 @@
1
+ require 'test_helper'
2
+ require 'traject/yaml_writer'
3
+
4
+ describe "The writer on Traject::Indexer" do
5
+ let(:indexer) { Traject::Indexer.new("solr.url" => "http://example.com") }
6
+
7
+ it "has a default" do
8
+ assert_instance_of Traject::SolrJsonWriter, indexer.writer
9
+ assert_equal Traject::SolrJsonWriter, indexer.writer_class
10
+ end
11
+
12
+ describe "when the writer is set in config" do
13
+ let(:writer) { Traject::YamlWriter.new({}) }
14
+
15
+ let(:indexer) { Traject::Indexer.new(
16
+ "solr.url" => "http://example.com",
17
+ "writer_class" => 'Traject::SolrJsonWriter',
18
+ "writer" => writer
19
+ )}
20
+
21
+ it "uses writer from config" do
22
+ assert_equal writer, indexer.writer
23
+ assert_equal writer.class, indexer.writer_class
24
+ end
25
+ end
26
+
27
+ describe "when writer_class is set directly" do
28
+ let(:writer_class) { Traject::YamlWriter }
29
+
30
+ before do
31
+ indexer.writer_class = writer_class
32
+ end
33
+
34
+ it "uses writer_class set directly" do
35
+ assert_kind_of writer_class, indexer.writer
36
+ assert_equal writer_class, indexer.writer_class
37
+ end
38
+
39
+ end
40
+
41
+ describe "when the writer is set directly" do
42
+ let(:writer) { Traject::YamlWriter.new({}) }
43
+
44
+ before do
45
+ indexer.writer = writer
46
+ end
47
+
48
+ it "uses the set value" do
49
+ assert_equal writer, indexer.writer
50
+ assert_equal writer.class, indexer.writer_class
51
+ end
52
+ end
53
+
54
+ end
@@ -215,34 +215,5 @@ describe "Traject::SolrJsonWriter" do
215
215
  assert_equal "http://example.com/solr/update", @writer.determine_solr_update_url
216
216
  end
217
217
  end
218
-
219
- describe "Record id from context" do
220
- before do
221
- @record = MARC::Reader.new(support_file_path('test_data.utf8.mrc')).first
222
- @context = Traject::Indexer::Context.new
223
- @writer = create_writer
224
- @record_001 = " 00282214 " # from the mrc file
225
- end
226
-
227
- it "gets it from 001" do
228
- @context.source_record = @record
229
- assert_equal @record_001, @writer.record_id_from_context(@context)
230
- end
231
-
232
- it "gets it from the id" do
233
- @context.output_hash['id'] = 'the_record_id'
234
- assert_equal 'the_record_id', @writer.record_id_from_context(@context)
235
- end
236
-
237
- it "gets it from both 001 and id" do
238
- @context.output_hash['id'] = 'the_record_id'
239
- @context.source_record = @record
240
- assert_equal [@record_001, 'the_record_id'].join('/'), @writer.record_id_from_context(@context)
241
- end
242
-
243
-
244
-
245
- end
246
-
247
218
 
248
219
  end
@@ -9,7 +9,7 @@ Gem::Specification.new do |spec|
9
9
  spec.authors = ["Jonathan Rochkind", "Bill Dueber"]
10
10
  spec.email = ["none@nowhere.org"]
11
11
  spec.summary = %q{Index MARC to Solr; or generally process source records to hash-like structures}
12
- spec.homepage = "http://github.com/traject-project/traject"
12
+ spec.homepage = "http://github.com/traject/traject"
13
13
  spec.license = "MIT"
14
14
 
15
15
  spec.files = `git ls-files`.split($/)
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: traject
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.2
4
+ version: 2.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jonathan Rochkind
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2015-02-20 00:00:00.000000000 Z
12
+ date: 2015-07-02 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: concurrent-ruby
@@ -233,7 +233,9 @@ files:
233
233
  - lib/translation_maps/marc_languages.yaml
234
234
  - test/debug_writer_test.rb
235
235
  - test/delimited_writer_test.rb
236
+ - test/indexer/context_test.rb
236
237
  - test/indexer/each_record_test.rb
238
+ - test/indexer/load_config_file_test.rb
237
239
  - test/indexer/macros_marc21_semantics_test.rb
238
240
  - test/indexer/macros_marc21_test.rb
239
241
  - test/indexer/macros_test.rb
@@ -241,6 +243,7 @@ files:
241
243
  - test/indexer/read_write_test.rb
242
244
  - test/indexer/settings_test.rb
243
245
  - test/indexer/to_field_test.rb
246
+ - test/indexer/writer_test.rb
244
247
  - test/marc_extractor_test.rb
245
248
  - test/marc_format_classifier_test.rb
246
249
  - test/marc_reader_test.rb
@@ -287,7 +290,7 @@ files:
287
290
  - test/translation_maps/translate_array_test.yaml
288
291
  - test/translation_maps/yaml_map.yaml
289
292
  - traject.gemspec
290
- homepage: http://github.com/traject-project/traject
293
+ homepage: http://github.com/traject/traject
291
294
  licenses:
292
295
  - MIT
293
296
  metadata: {}
@@ -314,7 +317,9 @@ summary: Index MARC to Solr; or generally process source records to hash-like st
314
317
  test_files:
315
318
  - test/debug_writer_test.rb
316
319
  - test/delimited_writer_test.rb
320
+ - test/indexer/context_test.rb
317
321
  - test/indexer/each_record_test.rb
322
+ - test/indexer/load_config_file_test.rb
318
323
  - test/indexer/macros_marc21_semantics_test.rb
319
324
  - test/indexer/macros_marc21_test.rb
320
325
  - test/indexer/macros_test.rb
@@ -322,6 +327,7 @@ test_files:
322
327
  - test/indexer/read_write_test.rb
323
328
  - test/indexer/settings_test.rb
324
329
  - test/indexer/to_field_test.rb
330
+ - test/indexer/writer_test.rb
325
331
  - test/marc_extractor_test.rb
326
332
  - test/marc_format_classifier_test.rb
327
333
  - test/marc_reader_test.rb