traject 0.15.0 → 0.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -49,8 +49,9 @@ The traject command-line utility requires you to supply it with a configuration
49
49
 
50
50
  Configuration files are actually just ruby -- so by convention they end in `.rb`.
51
51
 
52
- Don't worry, you don't neccesarily need to know ruby well to write them, they give you a subset of ruby to work with. But the full power
53
- of ruby is available to you.
52
+ We hope you can write basic useful configuration files without being a ruby expert,
53
+ they give you a subset of ruby to work with. But the full power
54
+ of ruby is available to you if needed.
54
55
 
55
56
  **rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
56
57
  call ordinary ruby `require` in config files, etc., too, to load
@@ -84,9 +85,6 @@ settings do
84
85
  # you have to tell it.
85
86
  provide "marc_source.type", "xml"
86
87
 
87
- # settings can be set on command line instead of
88
- # config file too.
89
-
90
88
  # various others...
91
89
  provide "solrj_writer.commit_on_close", "true"
92
90
 
@@ -163,39 +161,44 @@ Other examples of the specification string, which can include multiple tag menti
163
161
  # "*" is a wildcard in indicator spec. So
164
162
  # 856 with first indicator '0', subfield u.
165
163
  to_field "email_addresses", extract_marc("856|0*|u")
166
-
167
- # Instead of joining subfields from the same field
168
- # into one string, joined by spaces, leave them
169
- # each in separate strings:
170
- to_field "isbn", extract_marc("020az", :separator => nil)
171
-
172
- # Same thing, but more explicit
173
- to_field "isbn", extract_marc("020a:020z")
174
164
 
175
-
176
- # Make sure that you don't get any duplicates
177
- # by passing in ":deduplicate => true"
178
- to_field 'language008', extract_marc('008[35-37]', :deduplicate=>true)
165
+ # Can list tag twice with different field combinations
166
+ # to extract separately
167
+ to_field "isbn", extract_marc("245a:245abcde")
179
168
  ~~~
180
169
 
181
170
  The `extract_marc` function *by default* includes any linked
182
171
  MARC `880` fields with alternate-script versions. Another reason
183
172
  to use the `:first` option if you really only want one.
184
173
 
174
+ By default, specifications with multiple subfields (like "240abc") will produce
175
+ one single string of output for each matching field. Specifications
176
+ with single subfields (like "020a") will split subfields and produce
177
+ an output string for each matching subfield.
178
+
185
179
  For MARC control (aka 'fixed') fields, you can use square
186
180
  brackets to take a slice by byte offset.
187
181
 
182
+ ~~~ruby
188
183
  to_field "langauge_code", extract_marc("008[35-37]")
184
+ ~~~
185
+
186
+ For more information on extraction specifications, see
187
+ the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
189
188
 
190
189
  `extract_marc` also supports `translation maps` similar
191
- to SolrMarc's. There will be some translation maps built in,
192
- and you can provide your own. translation maps can be supplied
190
+ to SolrMarc's. There are some translation maps provided by traject,
191
+ and you can also define your own. translation maps can be supplied
193
192
  in yaml or ruby. Translation maps are especially useful
194
- for mapping form MARC codes to user-displayable strings. See Traject::TranslationMap for more info:
193
+ for mapping form MARC codes to user-displayable strings:
195
194
 
195
+ ~~~ruby
196
196
  # "translation_map" will be passed to Traject::TranslationMap.new
197
197
  # and the created map used to translate all values
198
198
  to_field "language", extract_marc("008[35-37]:041a:041d", :translation_map => "marc_language_code")
199
+ ~~~
200
+
201
+ See [Traject::TranslationMap](./lib/traject/translation_map.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/TranslationMap)) for more info on translation mapping.
199
202
 
200
203
  #### Direct indexing logic vs. Macros
201
204
 
@@ -348,11 +351,11 @@ This will over-ride any settings set with `provide` in conf files.
348
351
  There are some built-in command-line option shortcuts for useful
349
352
  settings:
350
353
 
351
- Use `-j` to output as pretty-printed JSON
352
- hashes, instead of sending to solr. Useful for debugging or sanity
353
- checking.
354
+ Use `--debug-mode` to output in a human-readable format, instead of sending to solr.
355
+ Also turns on debug logging and restricts processing to single-threaded. Useful for
356
+ debugging or sanity checking.
354
357
 
355
- traject -j -c conf_file.rb marc_file
358
+ traject --debug-mode -c conf_file.rb marc_file
356
359
 
357
360
  Use `-u` as a shortcut for `s solr.url=X`
358
361
 
@@ -4,7 +4,8 @@ Traject settings are a flat list of key/value pairs -- a single
4
4
  Hash, not nested. Keys are always strings, and dots (".") can be
5
5
  used for grouping and namespacing.
6
6
 
7
- Values are usually strings, but occasionally something else.
7
+ Values are usually strings, but occasionally something else. String values can be easily
8
+ set via the command line.
8
9
 
9
10
  Settings can be set in configuration files, usually like:
10
11
 
@@ -17,6 +18,11 @@ end
17
18
  or on the command line: `-s key=value`. There are also some command line shortcuts
18
19
  for commonly used settings, see `traject -h`.
19
20
 
21
+ `provide` will only set the key if it was previously unset, so first time to set 'wins'. And command-line
22
+ settings are applied first of all. It's recommended you use `provide`.
23
+
24
+ `store` is also available, and forces setting of the new value overriding any previous value set.
25
+
20
26
  ## Known settings
21
27
 
22
28
  * `debug_ascii_progress`: true/'true' to print ascii characters to STDERR indicating progress. Note,
@@ -101,4 +107,4 @@ for commonly used settings, see `traject -h`.
101
107
  Note that processing_thread_pool threads can end up submitting
102
108
  to solr too, if solrj_writer.thread_pool is full.
103
109
 
104
- * `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::SolrJWriter, also available Traject::JsonWriter. See Traject::Indexer for more info. Command line shortcut `-w`
110
+ * `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::SolrJWriter, also available Traject::JsonWriter. See Traject::Indexer for more info. Command line shortcut `-w`
@@ -268,10 +268,6 @@ module Traject
268
268
  if options[:solr]
269
269
  settings["solr.url"] = options[:solr]
270
270
  end
271
- if options[:j]
272
- settings["writer_class_name"] = "JsonWriter"
273
- settings["json_writer.pretty_print"] = "true"
274
- end
275
271
  if options[:marc_type]
276
272
  settings["marc_source.type"] = options[:marc_type]
277
273
  end
@@ -296,12 +292,11 @@ module Traject
296
292
  on :o, "output_file", "output file for Writer classes that write to files", :argument => true
297
293
  on :w, :writer, "Set writer class, shortcut for -s writer_class_name=", :argument => true
298
294
  on :u, :solr, "Set solr url, shortcut for -s solr.url=", :argument => true
299
- on :j, "output as pretty printed json, shortcut for -s writer_class_name=JsonWriter -s json_writer.pretty_print=true"
300
295
  on :t, :marc_type, "xml, json or binary. shortcut for -s marc_source.type=", :argument => true
301
296
  on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
302
297
  on :G, "Gemfile", "run with bundler and optionally specified Gemfile", :argument => :optional, :default => nil
303
298
 
304
- on :x, "command", "alternate traject command: process (default); marcout", :argument => true, :default => "process"
299
+ on :x, "command", "alternate traject command: process (default); marcout; commit", :argument => true, :default => "process"
305
300
 
306
301
  on "stdin", "read input from stdin"
307
302
  on "debug-mode", "debug logging, single threaded, output human readable hashes"
@@ -144,7 +144,7 @@ module Traject::Macros
144
144
  # return the filing version (i.e., the string without the
145
145
  # non-filing characters)
146
146
 
147
- def self.filing_version(field, str, spechash)
147
+ def self.filing_version(field, str, spec)
148
148
  # Control fields don't have non-filing characters
149
149
  return str if field.kind_of? MARC::ControlField
150
150
 
@@ -155,7 +155,7 @@ module Traject::Macros
155
155
  # The spechash must either (a) have no subfields specified, or
156
156
  # (b) include the first subfield in the record
157
157
 
158
- subs = spechash[:subfields]
158
+ subs = spec.subfields
159
159
  return str unless subs && subs.include?(field.subfields[0].code)
160
160
 
161
161
  # OK. If we got this far we actually need to strip characters off the string
@@ -183,7 +183,7 @@ module Traject::Macros
183
183
  lambda do |record, accumulator|
184
184
  codes = extractor.collect_matching_lines(record) do |field, spec, extractor|
185
185
  if extractor.control_field?(field)
186
- (spec[:bytes] ? field.value.byteslice(spec[:bytes]) : field.value)
186
+ (spec.bytes ? field.value.byteslice(spec.bytes) : field.value)
187
187
  else
188
188
  extractor.collect_subfields(field, spec).collect do |value|
189
189
  # sometimes multiple language codes are jammed together in one subfield, and
@@ -212,9 +212,16 @@ module Traject::Macros
212
212
  extractor = MarcExtractor.new(spec)
213
213
 
214
214
  lambda do |record, accumulator|
215
- accumulator.concat( extractor.collect_matching_lines(record) do |field, spec, extractor|
215
+ values = extractor.collect_matching_lines(record) do |field, spec, extractor|
216
216
  extractor.collect_subfields(field, spec) unless (field.tag == "490" && field.indicator1 == "1")
217
- end.compact)
217
+ end.compact
218
+
219
+ # trim punctuation
220
+ values.collect! do |s|
221
+ Marc21.trim_punctuation(s)
222
+ end
223
+
224
+ accumulator.concat( values )
218
225
  end
219
226
  end
220
227
 
@@ -6,9 +6,79 @@ module Traject
6
6
  #
7
7
  # Examples:
8
8
  #
9
- # array_of_stuff = MarcExtractor.new("001:245abc:700a").extract(marc_record)
10
- # values = MarcExtractor.new("040a", :separator => nil).extract(marc_record)
9
+ # array_of_stuff = MarcExtractor.new("001:245abc:700a").extract(marc_record)
10
+ # values = MarcExtractor.new("245a:245abc").extract_marc(marc_record)
11
+ # seperated_values = MarcExtractor.new("020a:020z").extract(marc_record)
12
+ # bytes = MarcExtractor.new("008[35-37]")
11
13
  #
14
+ # == String extraction specifications
15
+ #
16
+ # Extraction directions are supplied in strings, usually as the first
17
+ # parameter to MarcExtractor.new or MarcExtractor.cached. These specifications
18
+ # are also the first parameter to the #marc_extract macro.
19
+ #
20
+ # A String specification is a string (or array of strings) which consists
21
+ # of one or more Data and Control Field Specifications seperated by colons.
22
+ #
23
+ # A Data Field Specification is of the form:
24
+ # `{tag}{|indicators|}{subfields}`
25
+ # * {tag} is three chars (usually but not neccesarily numeric)
26
+ # * {indicators} are optional two chars enclosed in pipe ('|') characters,
27
+ # * {subfields} are optional list of chars (alphanumeric)
28
+ #
29
+ # indicator spec must be two chars, but one can be * meaning "don't care".
30
+ # space to mean 'blank'
31
+ #
32
+ # "245|01|abc65:345abc:700|*5|:800"
33
+ #
34
+ # A Control Field Specification is used with tags for control (fixed) fields (ordinarily fields 001-010)
35
+ # and includes a tag and a a byte slice specification.
36
+ #
37
+ # "008[35-37]:007[5]""
38
+ # => bytes 35-37 inclusive of any field 008, and byte 5 of any field 007 (TODO: Should we support
39
+ # "LDR" as a pseudo-tag to take byte slices of leader?)
40
+ #
41
+ # * subfields and indicators can only be provided for marc data/variable fields
42
+ # * byte slice can only be provided for marc control fields (generally tags less than 010)
43
+ #
44
+ # == Subfield concatenation
45
+ #
46
+ # Normally, for a spec including multiple subfield codes, multiple subfields
47
+ # from the same MARC field will be concatenated into one string separated by spaces:
48
+ #
49
+ # 600 a| Chomsky, Noam x| Philosophy.
50
+ # 600 a| Chomsky, Noam x| Political and social views.
51
+ # MarcExtractor.new("600ax").extract(record)
52
+ # # results in two values sent to Solr:
53
+ # "Chomsky, Noam Philosophy."
54
+ # "Chomsky, Noam Political and social views."
55
+ #
56
+ # You can turn off this concatenation and leave individual subfields in seperate
57
+ # strings by setting the `separator` option to nil:
58
+ #
59
+ # MarcExtractor.new("600ax", :separator => nil).extract(record)
60
+ # # Results in four values being sent to Solr (or 3 if you de-dup):
61
+ # "Chomksy, Noam"
62
+ # "Philosophy."
63
+ # "Chomsky, Noam"
64
+ # "Political and social views."
65
+ #
66
+ # However, **the default is different for specifications with only a single
67
+ # subfield**, these are by default kept seperated:
68
+ #
69
+ # 020 a| 285197145X a| 9782851971456
70
+ # MarcExtractor.new("020a:020z").extract(record)
71
+ # # two seperate strings sent to Solr:
72
+ # "285197145X"
73
+ # "9782851971456"
74
+ #
75
+ # For single subfield specifications, you force concatenation by
76
+ # repeating the subfield specification:
77
+ #
78
+ # MarcExtractor.new("020aa:020zz").extract(record)
79
+ # # would result in a single string sent to solr for
80
+ # # the single field, by default space-separated:
81
+ # "285197145X 9782851971456"
12
82
  #
13
83
  # == Note on Performance and MarcExtractor creation and reuse
14
84
  #
@@ -37,14 +107,15 @@ module Traject
37
107
  class MarcExtractor
38
108
  attr_accessor :options, :spec_hash
39
109
 
40
- # Take a hash that's the output of #parse_string_spec, return
41
- # an array of strings extracted from a marc record accordingly
110
+ # First arg is a specification for extraction of data from a MARC record.
111
+ # Specification can be given in two forms:
42
112
  #
43
- # Second arg can either be a string specification that will be passed
44
- # to MarcExtractor.parse_string_spec, or a Hash that's
45
- # already been created by it.
113
+ # * a string specification like "008[35]:020a:245abc", see top of class
114
+ # for examples. A string specification is most typical argument.
115
+ # * The output of a previous call to MarcExtractor.parse_string_spec(string_spec),
116
+ # a 'pre-parsed' specification.
46
117
  #
47
- # options:
118
+ # Second arg is options:
48
119
  #
49
120
  # [:separator] default ' ' (space), what to use to separate
50
121
  # subfield values when joining strings
@@ -108,57 +179,30 @@ module Traject
108
179
 
109
180
  # Check to see if a tag is interesting (meaning it may be covered by a spec
110
181
  # and the passed-in options about alternate scripts)
111
-
112
182
  def interesting_tag?(tag)
113
183
  return @interesting_tags_hash.include?(tag)
114
184
  end
115
185
 
116
186
 
117
- # Converts from a string marc spec like "245abc:700a" to a nested hash used internally
118
- # to represent the specification.
119
- #
120
- # a String specification is a string (or array of strings) of form:
121
- # {tag}{|indicators|}{subfields} separated by colons
122
- # tag is three chars (usually but not neccesarily numeric),
123
- # indicators are optional two chars enclosed in pipe ('|') characters,
124
- # subfields are optional list of chars (alphanumeric)
125
- #
126
- # indicator spec must be two chars, but one can be * meaning "don't care".
127
- # space to mean 'blank'
128
- #
129
- # "245|01|abc65:345abc:700|*5|:800"
130
- #
131
- # Or, for control (fixed) fields (ordinarily fields 001-010), you can include a byte slice specification,
132
- # but can NOT include subfield or indicator specifications. Plus can use special tag "LDR" for
133
- # the marc leader. (TODO)
134
- #
135
- # "008[35-37]:LDR[5]"
136
- # => bytes 35-37 inclusive of field 008, and byte 5 of the marc leader.
187
+ # Converts from a string marc spec like "008[35]:245abc:700a" to a hash used internally
188
+ # to represent the specification. See comments at head of class for
189
+ # documentation of string specification format.
137
190
  #
138
- # Returns a nested hash whose keys are tags and whose value is an array
139
- # of hash structures indicating what indicators and subfields (or
140
- # byte-offsets for control fields) are needed, e.g.
141
191
  #
142
- # '245|1*|a:245ab:110:008[15-17]:008[17]' would give us
192
+ # == Return value
143
193
  #
144
- # {
145
- # '245' => [
146
- # {:indicators => ['1', nil], :subfields=>['a']},
147
- # {:subfields => ['a', 'b']}
148
- # ]
149
- # '110' => [{}] # all subfields, indicators don't matter
150
- # '008' => [
151
- # {:bytes => (15..17)}
152
- # {:bytes => 17}
153
- # ]
154
- # }
194
+ # The hash returned is keyed by tag, and has as values an array of 0 or
195
+ # or more MarcExtractor::Spec objects representing the specified extraction
196
+ # operations for that tag.
155
197
  #
156
- # * subfields and indicators can only be provided for marc data/variable fields
157
- # * byte slice can only be provided for marc control fields (generally tags less than 010)
198
+ # It's an array of possibly more than one, because you can specify
199
+ # multiple extractions on the same tag: for instance "245a:245abc"
158
200
  #
159
201
  # See tests for more examples.
160
202
  def self.parse_string_spec(spec_string)
161
- hash = {}
203
+ # hash defaults to []
204
+ hash = Hash.new {|hash,key| hash[key] = []}
205
+
162
206
  spec_strings = spec_string.is_a?(Array) ? spec_string.map{|s| s.split(/\s*:\s*/)}.flatten : spec_string.split(/s*:\s*/)
163
207
 
164
208
  spec_strings.each do |part|
@@ -166,31 +210,32 @@ module Traject
166
210
  # variable field
167
211
  tag, indicators, subfields = $1, $3, $4
168
212
 
169
- hash[tag] ||= []
170
- spec = {}
213
+ spec = Spec.new(:tag => tag)
171
214
 
172
215
  if subfields and !subfields.empty?
173
- spec[:subfields] = subfields.split('')
216
+ spec.subfields = subfields.split('')
174
217
  end
175
218
 
176
219
  if indicators
177
- spec[:indicators] = [ (indicators[0] if indicators[0] != "*"), (indicators[1] if indicators[1] != "*") ]
220
+ # if specified as '*', leave nil
221
+ spec.indicator1 = indicators[0] if indicators[0] != "*"
222
+ spec.indicator2 = indicators[1] if indicators[1] != "*"
178
223
  end
224
+
225
+ hash[spec.tag] << spec
179
226
 
180
- hash[tag] << spec
181
-
182
- elsif (part =~ /\A([a-zA-Z0-9]{3})(\[(\d+)(-(\d+))?\])\Z/) # "005[4-5]"
227
+ elsif (part =~ /\A([a-zA-Z0-9]{3})(\[(\d+)(-(\d+))?\])\Z/) # control field, "005[4-5]"
183
228
  tag, byte1, byte2 = $1, $3, $5
184
- hash[tag] ||= []
185
- spec = {}
229
+
230
+ spec = Spec.new(:tag => tag)
186
231
 
187
232
  if byte1 && byte2
188
- spec[:bytes] = ((byte1.to_i)..(byte2.to_i))
233
+ spec.bytes = ((byte1.to_i)..(byte2.to_i))
189
234
  elsif byte1
190
- spec[:bytes] = byte1.to_i
235
+ spec.bytes = byte1.to_i
191
236
  end
192
237
 
193
- hash[tag] << spec
238
+ hash[spec.tag] << spec
194
239
  else
195
240
  raise ArgumentError.new("Unrecognized marc extract specification: #{part}")
196
241
  end
@@ -206,7 +251,7 @@ module Traject
206
251
 
207
252
  self.each_matching_line(marc_record) do |field, spec|
208
253
  if control_field?(field)
209
- results << (spec[:bytes] ? field.value.byteslice(spec[:bytes]) : field.value)
254
+ results << (spec.bytes ? field.value.byteslice(spec.bytes) : field.value)
210
255
  else
211
256
  results.concat collect_subfields(field, spec)
212
257
  end
@@ -217,7 +262,7 @@ module Traject
217
262
 
218
263
  # Yields a block for every line in source record that matches
219
264
  # spec. First arg to block is MARC::DataField or ControlField, second
220
- # is the hash specification that it matched on. May take account
265
+ # is the MarcExtractor::Spec that it matched on. May take account
221
266
  # of options such as :alternate_script
222
267
  #
223
268
  # Third (optional) arg to block is self, the MarcExtractor object, useful for custom
@@ -225,19 +270,14 @@ module Traject
225
270
  def each_matching_line(marc_record)
226
271
  marc_record.fields(@interesting_tags_hash.keys).each do |field|
227
272
 
228
- specs = spec_covering_field(field)
229
-
230
- # Don't have a spec that addresses this field? Move on.
231
- next unless specs
232
-
233
- # Make sure it matches indicators too, spec_covering_field
234
- # doens't check that.
235
-
236
- specs.each do |spec|
237
- if matches_indicators(field, spec)
273
+ # Make sure it matches indicators too, specs_covering_field
274
+ # doesn't check that.
275
+ specs_covering_field(field).each do |spec|
276
+ if spec.matches_indicators?(field)
238
277
  yield(field, spec, self)
239
278
  end
240
279
  end
280
+
241
281
  end
242
282
  end
243
283
 
@@ -245,6 +285,8 @@ module Traject
245
285
  # but collects results of block into an array -- flattens any subarrays for you!
246
286
  #
247
287
  # Useful for re-use of this class for custom processing
288
+ #
289
+ # yields the MARC Field, the MarcExtractor::Spec object, the MarcExtractor object.
248
290
  def collect_matching_lines(marc_record)
249
291
  results = []
250
292
  self.each_matching_line(marc_record) do |field, spec, extractor|
@@ -254,31 +296,36 @@ module Traject
254
296
  end
255
297
 
256
298
 
257
- # Pass in a marc data field and a hash spec, returns
258
- # an ARRAY of one or more strings, subfields extracted
299
+ # Pass in a marc data field and a Spec object with extraction
300
+ # instructions, returns an ARRAY of one or more strings, subfields extracted
259
301
  # and processed per spec. Takes account of options such
260
302
  # as :separator
261
303
  #
262
304
  # Always returns array, sometimes empty array.
263
305
  def collect_subfields(field, spec)
264
306
  subfields = field.subfields.collect do |subfield|
265
- subfield.value if spec[:subfields].nil? || spec[:subfields].include?(subfield.code)
307
+ subfield.value if spec.includes_subfield_code?(subfield.code)
266
308
  end.compact
267
309
 
268
310
  return subfields if subfields.empty? # empty array, just return it.
269
311
 
270
- return options[:separator] ? [ subfields.join( options[:separator]) ] : subfields
312
+ if options[:separator] && spec.joinable?
313
+ subfields = [subfields.join(options[:separator])]
314
+ end
315
+
316
+ return subfields
271
317
  end
272
318
 
273
319
 
274
- # Find a spec, if any, covering extraction from this field
320
+
321
+ # Find Spec objects, if any, covering extraction from this field.
322
+ # Returns an array of 0 or more MarcExtractor::Spec objects
275
323
  #
276
324
  # When given an 880, will return the spec (if any) for the linked tag iff
277
325
  # we have a $6 and we want the alternate script.
278
326
  #
279
- # Returns nil if no matching spec is found
280
-
281
- def spec_covering_field(field)
327
+ # Returns an empty array in case of no matching extraction specs.
328
+ def specs_covering_field(field)
282
329
  tag = field.tag
283
330
 
284
331
  # Short-circuit the unintersting stuff
@@ -301,13 +348,60 @@ module Traject
301
348
  # define #control_field? on both ControlField and DataField?
302
349
  return field.kind_of? MARC::ControlField
303
350
  end
351
+
304
352
 
305
- # a marc field, and an individual spec hash, {:subfields => array, :indicators => array}
306
- def matches_indicators(field, spec)
307
- return true if spec[:indicators].nil?
353
+ # Represents a single specification for extracting data
354
+ # from a marc field, like "600abc" or "600|1*|x".
355
+ #
356
+ # Includes the tag for reference, although this is redundant and not actually used
357
+ # in logic, since the tag is also implicit in the overall spec_hash
358
+ # with tag => [spec1, spec2]
359
+ class Spec
360
+ attr_accessor :tag, :subfields, :indicator1, :indicator2, :bytes
361
+
362
+ def initialize(hash = {})
363
+ hash.each_pair do |key, value|
364
+ self.send("#{key}=", value)
365
+ end
366
+ end
367
+
368
+
369
+ # Should subfields extracted by joined, if we have a seperator?
370
+ # * '630' no subfields specified => join all subfields
371
+ # * '630abc' multiple subfields specified = join all subfields
372
+ # * '633a' one subfield => do not join, return one value for each $a in the field
373
+ # * '633aa' one subfield, doubled => do join after all, will return a single string joining all the values of all the $a's.
374
+ #
375
+ # Last case is handled implicitly at the moment when subfields == ['a', 'a']
376
+ def joinable?
377
+ (self.subfields.nil? || self.subfields.size != 1)
378
+ end
379
+
380
+ # Pass in a MARC field, do it's indicators match indicators
381
+ # in this spec? nil indicators in spec mean we don't care, everything
382
+ # matches.
383
+ def matches_indicators?(field)
384
+ return (self.indicator1.nil? || self.indicator1 == field.indicator1) &&
385
+ (self.indicator2.nil? || self.indicator2 == field.indicator2)
386
+ end
308
387
 
309
- return (spec[:indicators][0].nil? || spec[:indicators][0] == field.indicator1) &&
310
- (spec[:indicators][1].nil? || spec[:indicators][1] == field.indicator2)
388
+ # Pass in a string subfield code like 'a'; does this
389
+ # spec include it?
390
+ def includes_subfield_code?(code)
391
+ # subfields nil means include them all
392
+ self.subfields.nil? || self.subfields.include?(code)
393
+ end
394
+
395
+ def ==(spec)
396
+ return false unless spec.kind_of?(Spec)
397
+
398
+ return (self.tag == spec.tag) &&
399
+ (self.subfields == spec.subfields) &&
400
+ (self.indicator1 == spec.indicator1) &&
401
+ (self.indicator1 == spec.indicator2) &&
402
+ (self.bytes == spec.bytes)
403
+ end
311
404
  end
405
+
312
406
  end
313
407
  end
@@ -7,7 +7,7 @@ module Traject
7
7
  # A TranslationMap is basically just something that has a hash-like #[]
8
8
  # method to map from input strings to output strings:
9
9
  #
10
- # translation_map["some_input"] #=> some_output
10
+ # translation_map["some_input"] #=> some_output
11
11
  #
12
12
  # Input is assumed to always be string, output is either string
13
13
  # or array of strings.
@@ -17,10 +17,10 @@ module Traject
17
17
  # yaml, or java .properties. (Limited basic .properties, don't try any fancy escaping please,
18
18
  # no = or : in key names, no split lines.)
19
19
  #
20
- # TranslationMap.new("dir/some_file")
20
+ # TranslationMap.new("dir/some_file")
21
21
  #
22
- # Will look through the entire ruby $LOAD_PATH, for a translation_maps subdir
23
- # that contains either some_file.rb OR some_file.yaml OR some_file.properties.
22
+ # Will look for a file named `some_file.rb` or `some_file.yaml` or `some_file.properties`,
23
+ # somewhere in the ruby $LOAD_PATH in a `/translation_maps` subdir.
24
24
  # * Looks for "/translation_maps" subdir in load paths, so
25
25
  # for instance you can have a gem that keeps translation maps
26
26
  # in ./lib/translation_maps, and it Just Works.
@@ -47,12 +47,12 @@ module Traject
47
47
  # Or, when calling TranslationMap.new(), you can pass in options over-riding special
48
48
  # key too:
49
49
  #
50
- # TranslationMap.new("something", :default => "foo")
51
- # TranslationMap.new("something", :default => :passthrough)
50
+ # TranslationMap.new("something", :default => "foo")
51
+ # TranslationMap.new("something", :default => :passthrough)
52
52
  #
53
53
  # == Output: String or array of strings
54
54
  #
55
- # The output can be a string or an array of strings, or nil. It should not be anything
55
+ # The output can be a string or an array of strings, or nil. It should not be anything else.
56
56
  # When used with the #translate_array! method, one string can be replaced by multiple values
57
57
  # (array of strings) or removed (nil)
58
58
  #
@@ -1,3 +1,3 @@
1
1
  module Traject
2
- VERSION = "0.15.0"
2
+ VERSION = "0.16.0"
3
3
  end
@@ -36,7 +36,8 @@ describe "Traject::Macros::Marc21Semantics" do
36
36
  end
37
37
  output = @indexer.map_record(@record)
38
38
 
39
- assert_equal ["Big bands."], output["series_facet"]
39
+ # trims punctuation too
40
+ assert_equal ["Big bands"], output["series_facet"]
40
41
  end
41
42
 
42
43
  describe "marc_sortable_author" do
@@ -51,6 +51,27 @@ describe "Traject::Indexer.to_field" do
51
51
  flunk("Should only fail with a NamingError")
52
52
  end
53
53
  end
54
-
55
54
 
56
- end
55
+ # Just verifying this is how it works
56
+ it "doesn't allow you to just wholesale assignment to the accumulator" do
57
+ @indexer.to_field('foo') do |rec, acc|
58
+ acc = ['hello']
59
+ end
60
+ output = @indexer.map_record('never looked at')
61
+ assert_equal nil, output['foo']
62
+ end
63
+
64
+ it "allows use of accumulator.replace" do
65
+ @indexer.to_field('foo') do |rec, acc|
66
+ acc.replace ['hello']
67
+ end
68
+ output = @indexer.map_record('never looked at')
69
+ assert_equal ['hello'], output['foo']
70
+ end
71
+
72
+
73
+ end
74
+
75
+
76
+
77
+
@@ -13,15 +13,12 @@ describe "Traject::MarcExtractor" do
13
13
  assert_kind_of Hash, parsed
14
14
  assert_equal 1, parsed.keys.length
15
15
  spec = parsed['245'].first
16
- assert_kind_of Hash, spec
17
-
18
- assert_kind_of Array, spec[:indicators]
19
- assert_equal 2, spec[:indicators].length
20
- assert_equal "1", spec[:indicators][0]
21
- assert_nil spec[:indicators][1]
22
-
23
- assert_kind_of Array, spec[:subfields]
16
+ assert_kind_of Traject::MarcExtractor::Spec, spec
24
17
 
18
+ assert_equal "1", spec.indicator1
19
+ assert_nil spec.indicator2
20
+
21
+ assert_kind_of Array, spec.subfields
25
22
  end
26
23
 
27
24
  it "parses a mixed bag" do
@@ -34,25 +31,28 @@ describe "Traject::MarcExtractor" do
34
31
 
35
32
  #245abcde
36
33
  assert spec245
37
- assert_nil spec245[:indicators]
38
- assert_equal %w{a b c d e}, spec245[:subfields]
34
+ assert_nil spec245.indicator1
35
+ assert_nil spec245.indicator2
36
+ assert_equal %w{a b c d e}, spec245.subfields
39
37
 
40
38
  #810
41
39
  assert spec810
42
- assert_nil spec810[:indicators]
43
- assert_nil spec810[:subfields], "No subfields"
40
+ assert_nil spec810.indicator1
41
+ assert_nil spec810.indicator2
42
+ assert_nil spec810.subfields, "No subfields"
44
43
 
45
44
  #700-*4bcd
46
45
  assert spec700
47
- assert_equal [nil, "4"], spec700[:indicators]
48
- assert_equal %w{b c d}, spec700[:subfields]
46
+ assert_nil spec700.indicator1
47
+ assert_equal "4", spec700.indicator2
48
+ assert_equal %w{b c d}, spec700.subfields
49
49
  end
50
50
 
51
51
  it "parses fixed field byte offsets" do
52
52
  parsed = Traject::MarcExtractor.parse_string_spec("005[5]:008[7-10]")
53
53
 
54
- assert_equal 5, parsed["005"].first[:bytes]
55
- assert_equal 7..10, parsed["008"].first[:bytes]
54
+ assert_equal 5, parsed["005"].first.bytes
55
+ assert_equal 7..10, parsed["008"].first.bytes
56
56
  end
57
57
 
58
58
  it "allows arrays of specs" do
@@ -79,7 +79,7 @@ describe "Traject::MarcExtractor" do
79
79
 
80
80
  # Mostly an internal method, not neccesarily API, but
81
81
  # an important one, so we unit test some parts of it.
82
- describe "#spec_covering_field" do
82
+ describe "#specs_covering_field" do
83
83
  describe "for alternate script tags" do
84
84
  before do
85
85
  @record = MARC::Reader.new(support_file_path "hebrew880s.marc").to_a.first
@@ -102,17 +102,17 @@ describe "Traject::MarcExtractor" do
102
102
  assert ! @a880_100.nil?, "Found an 880-100 to test"
103
103
  end
104
104
  it "finds spec for relevant 880" do
105
- assert_equal( [{}], @extractor.spec_covering_field(@a880_245) )
106
- assert_nil @extractor.spec_covering_field(@a880_100)
105
+ assert_equal( [Traject::MarcExtractor::Spec.new(:tag => "245")], @extractor.specs_covering_field(@a880_245) )
106
+ assert_equal [], @extractor.specs_covering_field(@a880_100)
107
107
  end
108
108
  it "does not find spec for 880 if disabled" do
109
109
  @extractor = Traject::MarcExtractor.new("245", :alternate_script => false)
110
- assert_nil @extractor.spec_covering_field(@a880_245)
110
+ assert_nil @extractor.specs_covering_field(@a880_245)
111
111
  end
112
112
  it "finds only 880 if so configured" do
113
113
  @extractor = Traject::MarcExtractor.new("245", :alternate_script => :only)
114
- assert_nil @extractor.spec_covering_field(@a245)
115
- assert_equal([{}], @extractor.spec_covering_field(@a880_245))
114
+ assert_nil @extractor.specs_covering_field(@a245)
115
+ assert_equal([Traject::MarcExtractor::Spec.new(:tag => "245")], @extractor.specs_covering_field(@a880_245))
116
116
  end
117
117
  end
118
118
  end
@@ -260,7 +260,7 @@ describe "Traject::MarcExtractor" do
260
260
  @extractor.each_matching_line(@record) do |field, spec|
261
261
  called = true
262
262
  assert_kind_of MARC::DataField, field
263
- assert_kind_of Hash, spec
263
+ assert_kind_of Traject::MarcExtractor::Spec, spec
264
264
  end
265
265
  assert called, "calls block"
266
266
  end
@@ -269,7 +269,7 @@ describe "Traject::MarcExtractor" do
269
269
  @extractor.each_matching_line(@record) do |field, spec, extractor|
270
270
  called = true
271
271
  assert_kind_of MARC::DataField, field
272
- assert_kind_of Hash, spec
272
+ assert_kind_of Traject::MarcExtractor::Spec, spec
273
273
  assert_kind_of Traject::MarcExtractor, extractor
274
274
  assert_same @extractor, extractor
275
275
  end
@@ -292,9 +292,11 @@ describe "Traject::MarcExtractor" do
292
292
 
293
293
  describe "MarcExtractor.cached" do
294
294
  it "creates" do
295
- ext = Traject::MarcExtractor.cached("245abc", :separator => nil)
296
- assert_equal({"245"=>[{:subfields=>["a", "b", "c"]}]}, ext.spec_hash)
297
- assert ext.options[:separator].nil?, "extractor options[:separator] is nil"
295
+ extractor = Traject::MarcExtractor.cached("245abc", :separator => nil)
296
+ spec_hash = extractor.spec_hash
297
+
298
+ assert extractor.options[:separator].nil?, "extractor options[:separator] is nil"
299
+ assert_equal({"245"=>[Traject::MarcExtractor::Spec.new(:tag => "245", :subfields=>["a", "b", "c"])]}, spec_hash)
298
300
  end
299
301
  it "caches" do
300
302
  ext1 = Traject::MarcExtractor.cached("245abc", :separator => nil)
@@ -326,11 +328,45 @@ describe "Traject::MarcExtractor" do
326
328
 
327
329
 
328
330
 
329
- it "works the same as ::separator=>nil" do
330
- ex1 = Traject::MarcExtractor.new("245a:245b")
331
- ex2 = Traject::MarcExtractor.new("245ab", :separator=>nil)
332
- assert_equal ex1.extract(@record), ex2.extract(@record)
331
+ it "provides multiple values for repeated subfields with single specified subfield" do
332
+ ex = Traject::MarcExtractor.new("245a")
333
+ f = @record.fields('245').first
334
+ title_a = f['a']
335
+ f.append(MARC::Subfield.new('a', title_a))
336
+ results = ex.extract(@record)
337
+ assert_equal [title_a, title_a], results
338
+ end
339
+
340
+ it "concats single subfield spec when given as eg 245aa" do
341
+ ex = Traject::MarcExtractor.new("245aa")
342
+ f = @record.fields('245').first
343
+ title_a = f['a']
344
+ f.append(MARC::Subfield.new('a', title_a))
345
+ results = ex.extract(@record)
346
+ assert_equal ["#{title_a} #{title_a}"], results
347
+ end
348
+
349
+ it "provides single value for repeated subfields with multiple specified subfields" do
350
+ ex = Traject::MarcExtractor.new("245ab")
351
+ f = @record.fields('245').first
352
+ title_a = f['a']
353
+ title_b = f['b']
354
+ f.append(MARC::Subfield.new('a', title_a))
355
+ results = ex.extract(@record)
356
+ assert_equal ["#{title_a} #{title_b} #{title_a}"], results
357
+
358
+ end
359
+
360
+ it "provides single value for repeated subfields with no specified subfield" do
361
+ ex = Traject::MarcExtractor.new("245")
362
+ f = @record.fields('245').first
363
+ title_a = f['a']
364
+ f.append(MARC::Subfield.new('a', title_a))
365
+ results = ex.extract(@record)
366
+ assert_equal 1, results.size
333
367
  end
368
+
369
+
334
370
 
335
371
 
336
372
  it "allows repeated tags for a control field" do
@@ -352,6 +388,17 @@ describe "Traject::MarcExtractor" do
352
388
  end
353
389
 
354
390
  end
391
+
392
+ describe "MarcExtractor::Spec" do
393
+ describe "==" do
394
+ it "equals when equal" do
395
+ assert_equal Traject::MarcExtractor::Spec.new(:subfields => %w{a b c}), Traject::MarcExtractor::Spec.new(:subfields => %w{a b c})
396
+ end
397
+ it "does not equal when not" do
398
+ refute_equal Traject::MarcExtractor::Spec.new(:subfields => %w{a b c}), Traject::MarcExtractor::Spec.new(:subfields => %w{a b c}, :indicator2 => '1')
399
+ end
400
+ end
401
+ end
355
402
 
356
403
 
357
- end
404
+ end
metadata CHANGED
@@ -2,7 +2,7 @@
2
2
  name: traject
3
3
  version: !ruby/object:Gem::Version
4
4
  prerelease:
5
- version: 0.15.0
5
+ version: 0.16.0
6
6
  platform: ruby
7
7
  authors:
8
8
  - Jonathan Rochkind
@@ -10,7 +10,7 @@ authors:
10
10
  autorequire:
11
11
  bindir: bin
12
12
  cert_chain: []
13
- date: 2013-09-25 00:00:00.000000000 Z
13
+ date: 2013-09-30 00:00:00.000000000 Z
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: marc