traject 2.1.0-java → 2.2.0-java

Sign up to get free protection for your applications and to get access to all the features.
data/doc/settings.md CHANGED
@@ -25,80 +25,49 @@ settings are applied first of all. It's recommended you use `provide`.
25
25
 
26
26
  ## Known settings
27
27
 
28
- * `debug_ascii_progress`: true/'true' to print ascii characters to STDERR indicating progress. Note,
29
- yes, this is fixed to STDERR, regardless of your logging setup.
30
- * `.` for every batch of records read and parsed
31
- * `^` for every batch of records batched and queued for adding to solr
32
- (possibly in thread pool)
33
- * `%` for completing of a Solr 'add'
34
- * `!` when threadpool for solr add has a full queue, so solr add is
35
- going to happen in calling queue -- means solr adding can't
36
- keep up with production.
28
+ * `debug_ascii_progress`: true/'true' to print ascii characters to STDERR indicating progress. Yes, this is fixed to STDERR, regardless of your logging setup.
29
+ * `.` for every batch of records read and parsed
30
+ * `^` for every batch of records batched and queued for adding to solr (possibly in thread pool)
31
+ * `%` for completing of a Solr 'add'
32
+ * `!` when threadpool for solr add has a full queue, so solr add is going to happen in calling queue -- means solr adding can't keep up with production.
37
33
 
38
34
  * `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
39
35
 
40
36
  * `log.file`: filename to send logging, or 'STDOUT' or 'STDERR' for those streams. Default STDERR
41
37
 
42
- * `log.error_file`: Default nil, if set then all log lines of ERROR and higher will be _additionally_
43
- sent to error file named.
38
+ * `log.error_file`: Default nil, if set then all log lines of ERROR and higher will be _additionally_ sent to error file named.
44
39
 
45
40
  * `log.format`: Formatting string used by Yell logger. https://github.com/rudionrails/yell/wiki/101-formatting-log-messages
46
41
 
47
- * `log.level`: Log this level and above. Default 'info', set to eg 'debug' to get potentially more logging info,
48
- or 'error' to get less. https://github.com/rudionrails/yell/wiki/101-setting-the-log-level
42
+ * `log.level`: Log this level and above. Default 'info', set to eg 'debug' to get potentially more logging info, or 'error' to get less. https://github.com/rudionrails/yell/wiki/101-setting-the-log-level
49
43
 
50
- * `log.batch_size`: If set to a number N (or string representation), will output a progress line to
51
- log. (by default as INFO, but see log.batch_size.severity)
44
+ * `log.batch_size`: If set to a number N (or string representation), will output a progress line to log. (by default as INFO, but see log.batch_size.severity)
52
45
 
53
46
  * `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
54
47
 
55
48
  * `marc_source.type`: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`
56
49
 
57
- * `marcout.allow_oversized`: Used with `-x marcout` command to output marc when outputting
58
- as ISO 2709 binary, set to true or string "true", and the MARC::Writer will have
59
- allow_oversized=true set, allowing oversized records to be serialized with length
60
- bytes zero'd out -- technically illegal, but can be read by MARC::Reader in permissive mode.
50
+ * `marcout.allow_oversized`: Used with `-x marcout` command to output marc when outputting as ISO 2709 binary, set to true or string "true", and the MARC::Writer will have allow_oversized=true set, allowing oversized records to be serialized with length bytes zero'd out -- technically illegal, but can be read by MARC::Reader in permissive mode.
61
51
 
62
- * `output_file`: Output file to write to for operations that write to files: For instance the `marcout` command,
63
- or Writer classes that write to files, like Traject::JsonWriter. Has an shortcut
64
- `-o` on command line.
52
+ * `output_file`: Output file to write to for operations that write to files: For instance the `marcout` command, or Writer classes that write to files, like Traject::JsonWriter. Has an shortcut `-o` on command line.
65
53
 
66
- * `processing_thread_pool` Number of threads in the main thread pool used for processing
67
- records with input rules. On JRuby or Rubinius, defaults to 1 less than the number of processors detected on your machine. On other ruby platforms, defaults to 1. Set to 0 or nil
68
- to disable thread pool, and do all processing in main thread.
54
+ * `processing_thread_pool` Number of threads in the main thread pool used for processing records with input rules. On JRuby or Rubinius, defaults to 1 less than the number of processors detected on your machine. On other ruby platforms, defaults to 1. Set to 0 or nil to disable thread pool, and do all processing in main thread.
69
55
 
70
- Choose a pool size based on size of your machine, and complexity of your indexing rules, you
71
- might want to try different sizes and measure which works best for you.
72
- Probably no reason for it ever to be more than number of cores on indexing machine.
56
+ Choose a pool size based on size of your machine, and complexity of your indexing rules, you might want to try different sizes and measure which works best for you. Probably no reason for it ever to be more than number of cores on indexing machine.
73
57
 
74
58
 
75
- * `reader_class_name`: a Traject Reader class, used by the indexer as a source
76
- of records. Defaults to Traject::Marc4JReader (using the Java Marc4J
77
- library) on JRuby; Traject::MarcReader (using the ruby marc gem) otherwise.
78
- Command-line shortcut `-r`
59
+ * `reader_class_name`: a Traject Reader class, used by the indexer as a source of records. Defaults to Traject::Marc4JReader (using the Java Marc4J library) on JRuby; Traject::MarcReader (using the ruby marc gem) otherwise. Command-line shortcut `-r`
79
60
 
80
61
  * `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
81
62
 
82
- * `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control
83
- change some default settings, and/or sanity check and warn you if you're doing something
84
- that might not work with that version of solr. Set now for help in the future.
63
+ * `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control some default settings, and/or sanity check and warn you if you're doing something that might not work with that version of solr. Set now for help in the future.
85
64
 
86
- * `solr_writer.batch_size`: size of batches that SolrJsonWriter will send docs to Solr in. Default 100. Set to nil,
87
- 0, or 1, and SolrJsonWriter will do one http transaction per document, no batching.
65
+ * `solr_writer.batch_size`: size of batches that SolrJsonWriter will send docs to Solr in. Default 100. Set to nil, 0, or 1, and SolrJsonWriter will do one http transaction per document, no batching.
88
66
 
89
67
  * `solr_writer.commit_on_close`: default false, set to true to have the solr writer send an explicit commit message to Solr after indexing.
90
68
 
69
+ * `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
91
70
 
92
- * `solr_writer.thread_pool`: Defaults to 1 (single bg thread). A thread pool is used for submitting docs
93
- to solr. Set to 0 or nil to disable threading. Set to 1,
94
- there will still be a single bg thread doing the adds.
95
- May make sense to set higher than number of cores on your
96
- indexing machine, as these threads will mostly be waiting
97
- on Solr. Speed/capacity of your solr might be more relevant.
98
- Note that processing_thread_pool threads can end up submitting
99
- to solr too, if solr_json_writer.thread_pool is full.
100
-
101
- * `writer`: An object that implements the Traject Writer interface. If set, takes precedence
102
- over `writer_class_name`.
71
+ * `writer`: An object that implements the Traject Writer interface. If set, takes precedence over `writer_class_name`.
103
72
 
104
73
  * `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Will be used if no explicit `writer` setting or `#writer=` is set. Default Traject::SolrJsonWriter, other writers for debugging or writing to files are also available. See Traject::Indexer for more info. Command line shortcut `-w`
@@ -32,14 +32,40 @@ require 'traject/line_writer'
32
32
  # provide "output_file", "out.txt"
33
33
  # end
34
34
  class Traject::DebugWriter < Traject::LineWriter
35
- DEFAULT_FORMAT = '%-12s %-25s %s'
36
35
  DEFAULT_IDFIELD = 'id'
36
+ DEFAULT_FORMAT = '%-12s %-25s %s'
37
+
38
+ def initialize(*)
39
+ super
40
+ @idfield = settings["debug_writer.idfield"] || DEFAULT_IDFIELD
41
+ @format = settings['debug_writer.format'] || DEFAULT_FORMAT
42
+
43
+ if @idfield == 'record_position' then
44
+ @use_position = true
45
+ end
46
+
47
+ @already_threw_warning_about_missing_id = false
48
+
49
+ end
50
+
51
+ def record_number(context)
52
+ return context.position if @use_position
53
+ if context.output_hash.has_key?(@idfield)
54
+ context.output_hash[@idfield].first
55
+ else
56
+ unless @already_threw_warning_about_missing_id
57
+ context.logger.warn "At least one record (##{context.position}) doesn't define field '#{@idfield}'.
58
+ All records are assumed to have a unique id. You can set which field to look in via the setting 'debug_writer.idfield'"
59
+ @already_threw_warning_about_missing_id = true
60
+ end
61
+ "record_num_#{context.position}"
62
+ end
63
+ end
37
64
 
38
65
  def serialize(context)
39
- idfield = settings["debug_writer.idfield"] || DEFAULT_IDFIELD
40
- format = settings['debug_writer.format'] || DEFAULT_FORMAT
41
- h = context.output_hash
42
- lines = h.keys.sort.map {|k| format % [h[idfield].first, k, h[k].join(' | ')] }
66
+ h = context.output_hash
67
+ rec_key = record_number(context)
68
+ lines = h.keys.sort.map { |k| @format % [rec_key, k, h[k].join(' | ')] }
43
69
  lines.push "\n"
44
70
  lines.join("\n")
45
71
  end
@@ -8,6 +8,8 @@ require 'traject/indexer/settings'
8
8
  require 'traject/marc_reader'
9
9
  require 'traject/json_writer'
10
10
  require 'traject/solr_json_writer'
11
+ require 'traject/debug_writer'
12
+
11
13
 
12
14
  require 'traject/macros/marc21'
13
15
  require 'traject/macros/basic'
@@ -98,7 +100,7 @@ end
98
100
  #
99
101
  # This may raise if the file is not readable. Or if the config file
100
102
  # can't be evaluated, it will raise a Traject::Indexer::ConfigLoadError
101
- # with a bunch of contextual information useful to reporting to developer.
103
+ # with a bunch of contextual information useful to reporting to developer.
102
104
  #
103
105
  # You can also instead, or in addition, write configuration inline using
104
106
  # standard ruby `instance_eval`:
@@ -704,15 +706,15 @@ class Traject::Indexer
704
706
  end
705
707
 
706
708
  # Raised by #load_config_file when config file can not
707
- # be processed.
709
+ # be processed.
708
710
  #
709
711
  # The exception #message includes an error message formatted
710
- # for good display to the developer, in the console.
712
+ # for good display to the developer, in the console.
711
713
  #
712
714
  # Original exception raised when processing config file
713
715
  # can be found in #original. Original exception should ordinarily
714
716
  # have a good stack trace, including the file path of the config
715
- # file in question.
717
+ # file in question.
716
718
  #
717
719
  # Original config path in #config_file, and line number in config
718
720
  # file that triggered the exception in #config_file_lineno (may be nil)
@@ -1,3 +1,5 @@
1
+ require 'traject/marc_extractor_spec'
2
+
1
3
  module Traject
2
4
  # MarcExtractor is a class for extracting lists of strings from a MARC::Record,
3
5
  # according to specifications. See #parse_string_spec for description of string
@@ -36,7 +38,7 @@ module Traject
36
38
  # and includes a tag and a a byte slice specification.
37
39
  #
38
40
  # "008[35-37]:007[5]""
39
- # => bytes 35-37 inclusive of any field 008, and byte 5 of any field 007
41
+ # => bytes 35-37 inclusive of any field 008, and byte 5 of any field 007
40
42
  #
41
43
  # * subfields and indicators can only be provided for marc data/variable fields
42
44
  # * byte slice can only be provided for marc control fields (generally tags less than 010)
@@ -105,7 +107,9 @@ module Traject
105
107
  # lazily create and then re-use a MarcExtractor object with
106
108
  # particular initialization arguments.
107
109
  class MarcExtractor
108
- attr_accessor :options, :spec_hash
110
+ attr_accessor :options, :spec_set
111
+
112
+ ALTERNATE_SCRIPT_TAG = '880'
109
113
 
110
114
  # First arg is a specification for extraction of data from a MARC record.
111
115
  # Specification can be given in two forms:
@@ -126,30 +130,48 @@ module Traject
126
130
  # * :only => only include linked 880s, not original
127
131
  def initialize(spec, options = {})
128
132
  self.options = {
129
- :separator => ' ',
130
- :alternate_script => :include
133
+ :separator => ' ',
134
+ :alternate_script => :include
131
135
  }.merge(options)
132
136
 
133
- self.spec_hash = spec.kind_of?(Hash) ? spec : self.class.parse_string_spec(spec)
137
+ self.spec_set = SpecSet.new(spec)
134
138
 
135
139
 
136
140
  # Tags are "interesting" if we have a spec that might cover it
137
- @interesting_tags_hash = {}
138
-
139
141
  # By default, interesting tags are those represented by keys in spec_hash.
140
142
  # Add them unless we only care about alternate scripts.
141
143
  unless options[:alternate_script] == :only
142
- self.spec_hash.keys.each {|tag| @interesting_tags_hash[tag] = true}
144
+ self.spec_set.tags.each { |tag| show_interest_in_tag(tag) }
143
145
  end
144
146
 
145
147
  # If we *are* interested in alternate scripts, add the 880
146
148
  if options[:alternate_script] != false
147
- @interesting_tags_hash['880'] = true
149
+ @fetch_alternate_script = true
150
+ show_interest_in_tag(ALTERNATE_SCRIPT_TAG)
148
151
  end
149
152
 
150
153
  self.freeze
151
154
  end
152
155
 
156
+
157
+ # Declare that we're interested in a tag
158
+ def show_interest_in_tag(tag)
159
+ @interesting_tags_hash ||= {}
160
+ @interesting_tags_hash[tag] = true
161
+ end
162
+
163
+ # Check to see if a tag is interesting (meaning it may be covered by a spec
164
+ # and the passed-in options about alternate scripts)
165
+ def interesting_tag?(tag)
166
+ return @interesting_tags_hash.include?(tag)
167
+ end
168
+
169
+ # All the "interesting" tags
170
+ def interesting_tags
171
+ @interesting_tags_hash.keys
172
+ end
173
+
174
+
153
175
  # Takes the same arguments as MarcExtractor.new, but will re-use an existing
154
176
  # cached MarcExtractor already created with given initialization arguments,
155
177
  # if available.
@@ -169,80 +191,11 @@ module Traject
169
191
  # extractor = MarcExtractor.cached("245abc:700a", :separator => nil)
170
192
  def self.cached(*args)
171
193
  cache = (Thread.current[:marc_extractor_cached] ||= Hash.new)
172
- return ( cache[args] ||= Traject::MarcExtractor.new(*args).freeze )
194
+ return (cache[args] ||= Traject::MarcExtractor.new(*args).freeze)
173
195
  end
174
196
 
175
- # Check to see if a tag is interesting (meaning it may be covered by a spec
176
- # and the passed-in options about alternate scripts)
177
- def interesting_tag?(tag)
178
- return @interesting_tags_hash.include?(tag)
179
- end
180
-
181
-
182
- # Converts from a string marc spec like "008[35]:245abc:700a" to a hash used internally
183
- # to represent the specification. See comments at head of class for
184
- # documentation of string specification format.
185
- #
186
- #
187
- # ## Return value
188
- #
189
- # The hash returned is keyed by tag, and has as values an array of 0 or
190
- # or more MarcExtractor::Spec objects representing the specified extraction
191
- # operations for that tag.
192
- #
193
- # It's an array of possibly more than one, because you can specify
194
- # multiple extractions on the same tag: for instance "245a:245abc"
195
- #
196
- # See tests for more examples.
197
- def self.parse_string_spec(spec_string)
198
- # hash defaults to []
199
- hash = Hash.new
200
-
201
- spec_strings = spec_string.is_a?(Array) ? spec_string.map{|s| s.split(/\s*:\s*/)}.flatten : spec_string.split(/s*:\s*/)
202
-
203
- spec_strings.each do |part|
204
- if (part =~ /\A([a-zA-Z0-9]{3})(\|([a-z0-9\ \*]{2})\|)?([a-z0-9]*)?\Z/)
205
- # variable field
206
- tag, indicators, subfields = $1, $3, $4
207
-
208
- spec = Spec.new(:tag => tag)
209
-
210
- if subfields and !subfields.empty?
211
- spec.subfields = subfields.split('')
212
- end
213
-
214
- if indicators
215
- # if specified as '*', leave nil
216
- spec.indicator1 = indicators[0] if indicators[0] != "*"
217
- spec.indicator2 = indicators[1] if indicators[1] != "*"
218
- end
219
-
220
- hash[spec.tag] ||= []
221
- hash[spec.tag] << spec
222
197
 
223
- elsif (part =~ /\A([a-zA-Z0-9]{3})(\[(\d+)(-(\d+))?\])\Z/) # control field, "005[4-5]"
224
- tag, byte1, byte2 = $1, $3, $5
225
-
226
- spec = Spec.new(:tag => tag)
227
-
228
- if byte1 && byte2
229
- spec.bytes = ((byte1.to_i)..(byte2.to_i))
230
- elsif byte1
231
- spec.bytes = byte1.to_i
232
- end
233
-
234
- hash[spec.tag] ||= []
235
- hash[spec.tag] << spec
236
- else
237
- raise ArgumentError.new("Unrecognized marc extract specification: #{part}")
238
- end
239
- end
240
-
241
- return hash
242
- end
243
-
244
-
245
- # Returns array of strings, extracted values. Maybe empty array.
198
+ # Returns array of strings from a MARC::Record, extracted values. May be empty array.
246
199
  def extract(marc_record)
247
200
  results = []
248
201
 
@@ -265,14 +218,10 @@ module Traject
265
218
  # Third (optional) arg to block is self, the MarcExtractor object, useful for custom
266
219
  # implementations.
267
220
  def each_matching_line(marc_record)
268
- marc_record.fields(@interesting_tags_hash.keys).each do |field|
221
+ marc_record.fields(interesting_tags).each do |field|
269
222
 
270
- # Make sure it matches indicators too, specs_covering_field
271
- # doesn't check that.
272
223
  specs_covering_field(field).each do |spec|
273
- if spec.matches_indicators?(field)
274
224
  yield(field, spec, self)
275
- end
276
225
  end
277
226
 
278
227
  end
@@ -314,29 +263,13 @@ module Traject
314
263
  end
315
264
 
316
265
 
317
-
318
266
  # Find Spec objects, if any, covering extraction from this field.
319
267
  # Returns an array of 0 or more MarcExtractor::Spec objects
320
268
  #
321
- # When given an 880, will return the spec (if any) for the linked tag iff
322
- # we have a $6 and we want the alternate script.
323
- #
324
269
  # Returns an empty array in case of no matching extraction specs.
325
270
  def specs_covering_field(field)
326
- tag = field.tag
327
-
328
- # Short-circuit the unintersting stuff
329
- return [] unless interesting_tag?(tag)
330
-
331
- # Due to bug in jruby https://github.com/jruby/jruby/issues/886 , we need
332
- # to do this weird encode gymnastics, which fixes it for mysterious reasons.
333
-
334
- if tag == "880" && field['6']
335
- tag = field["6"].encode(field["6"].encoding).byteslice(0,3)
336
- end
337
-
338
- # Take the resulting tag and get the spec from it (or the default nil if there isn't a spec for this tag)
339
- spec = self.spec_hash[tag] || []
271
+ return [] unless interesting_tag?(field.tag)
272
+ self.spec_set.specs_matching_field(field, @fetch_alternate_script)
340
273
  end
341
274
 
342
275
 
@@ -348,63 +281,10 @@ module Traject
348
281
 
349
282
  def freeze
350
283
  self.options.freeze
351
- self.spec_hash.freeze
284
+ self.spec_set.freeze
352
285
  super
353
286
  end
354
287
 
355
288
 
356
- # Represents a single specification for extracting data
357
- # from a marc field, like "600abc" or "600|1*|x".
358
- #
359
- # Includes the tag for reference, although this is redundant and not actually used
360
- # in logic, since the tag is also implicit in the overall spec_hash
361
- # with tag => [spec1, spec2]
362
- class Spec
363
- attr_accessor :tag, :subfields, :indicator1, :indicator2, :bytes
364
-
365
- def initialize(hash = {})
366
- hash.each_pair do |key, value|
367
- self.send("#{key}=", value)
368
- end
369
- end
370
-
371
-
372
- # Should subfields extracted by joined, if we have a seperator?
373
- # * '630' no subfields specified => join all subfields
374
- # * '630abc' multiple subfields specified = join all subfields
375
- # * '633a' one subfield => do not join, return one value for each $a in the field
376
- # * '633aa' one subfield, doubled => do join after all, will return a single string joining all the values of all the $a's.
377
- #
378
- # Last case is handled implicitly at the moment when subfields == ['a', 'a']
379
- def joinable?
380
- (self.subfields.nil? || self.subfields.size != 1)
381
- end
382
-
383
- # Pass in a MARC field, do it's indicators match indicators
384
- # in this spec? nil indicators in spec mean we don't care, everything
385
- # matches.
386
- def matches_indicators?(field)
387
- return (self.indicator1.nil? || self.indicator1 == field.indicator1) &&
388
- (self.indicator2.nil? || self.indicator2 == field.indicator2)
389
- end
390
-
391
- # Pass in a string subfield code like 'a'; does this
392
- # spec include it?
393
- def includes_subfield_code?(code)
394
- # subfields nil means include them all
395
- self.subfields.nil? || self.subfields.include?(code)
396
- end
397
-
398
- def ==(spec)
399
- return false unless spec.kind_of?(Spec)
400
-
401
- return (self.tag == spec.tag) &&
402
- (self.subfields == spec.subfields) &&
403
- (self.indicator1 == spec.indicator1) &&
404
- (self.indicator1 == spec.indicator2) &&
405
- (self.bytes == spec.bytes)
406
- end
407
- end
408
-
409
289
  end
410
290
  end