traject 0.16.0 → 0.17.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (53) hide show
  1. checksums.yaml +7 -0
  2. data/.yardopts +1 -0
  3. data/README.md +183 -191
  4. data/bench/bench.rb +1 -1
  5. data/doc/batch_execution.md +14 -0
  6. data/doc/extending.md +14 -12
  7. data/doc/indexing_rules.md +265 -0
  8. data/lib/traject/command_line.rb +12 -41
  9. data/lib/traject/debug_writer.rb +32 -13
  10. data/lib/traject/indexer.rb +101 -24
  11. data/lib/traject/indexer/settings.rb +18 -17
  12. data/lib/traject/json_writer.rb +32 -11
  13. data/lib/traject/line_writer.rb +6 -6
  14. data/lib/traject/macros/basic.rb +1 -1
  15. data/lib/traject/macros/marc21.rb +17 -13
  16. data/lib/traject/macros/marc21_semantics.rb +27 -25
  17. data/lib/traject/macros/marc_format_classifier.rb +39 -25
  18. data/lib/traject/marc4j_reader.rb +36 -22
  19. data/lib/traject/marc_extractor.rb +79 -75
  20. data/lib/traject/marc_reader.rb +33 -25
  21. data/lib/traject/mock_reader.rb +9 -10
  22. data/lib/traject/ndj_reader.rb +7 -7
  23. data/lib/traject/null_writer.rb +1 -1
  24. data/lib/traject/qualified_const_get.rb +12 -2
  25. data/lib/traject/solrj_writer.rb +61 -52
  26. data/lib/traject/thread_pool.rb +45 -45
  27. data/lib/traject/translation_map.rb +59 -27
  28. data/lib/traject/util.rb +3 -3
  29. data/lib/traject/version.rb +1 -1
  30. data/lib/traject/yaml_writer.rb +1 -1
  31. data/test/debug_writer_test.rb +7 -7
  32. data/test/indexer/each_record_test.rb +4 -4
  33. data/test/indexer/macros_marc21_semantics_test.rb +12 -12
  34. data/test/indexer/macros_marc21_test.rb +10 -10
  35. data/test/indexer/macros_test.rb +1 -1
  36. data/test/indexer/map_record_test.rb +6 -6
  37. data/test/indexer/read_write_test.rb +43 -4
  38. data/test/indexer/settings_test.rb +2 -2
  39. data/test/indexer/to_field_test.rb +8 -8
  40. data/test/marc4j_reader_test.rb +4 -4
  41. data/test/marc_extractor_test.rb +33 -25
  42. data/test/marc_format_classifier_test.rb +3 -3
  43. data/test/marc_reader_test.rb +2 -2
  44. data/test/test_helper.rb +3 -3
  45. data/test/test_support/demo_config.rb +52 -48
  46. data/test/translation_map_test.rb +22 -4
  47. data/test/translation_maps/bad_ruby.rb +2 -2
  48. data/test/translation_maps/both_map.rb +1 -1
  49. data/test/translation_maps/default_literal.rb +1 -1
  50. data/test/translation_maps/default_passthrough.rb +1 -1
  51. data/test/translation_maps/ruby_map.rb +1 -1
  52. metadata +7 -31
  53. data/doc/macros.md +0 -103
@@ -27,4 +27,4 @@ Benchmark.bmbm do |x|
27
27
  end
28
28
  end
29
29
 
30
-
30
+
@@ -99,6 +99,20 @@ Now any account, in a crontab, in an interactive shell, wherever,
99
99
  can just execute `jruby-traject {arguments}`, and execute traject
100
100
  in a jruby environment.
101
101
 
102
+ ### Bundler too?
103
+
104
+ If you're running with bundler too, you could make a wrapper file specific to
105
+ a particular traject project and it's Gemfile, by combining the `bundle exec` into
106
+ your wrapper file. For instance, for chruby, this works:
107
+
108
+ #!/usr/bin/env bash
109
+
110
+ chruby-exec jruby -- BUNDLE_GEMFILE=/path/to/Gemfile bundle exec traject "$@"
111
+
112
+ Now you can call your wrapper script from anywhere and with any active ruby,
113
+ and execute it in jruby and with the dependencies specified in the Gemfile
114
+ for your project.
115
+
102
116
  ## Exit codes
103
117
 
104
118
  Traject tries to always return a well-behaved unix exit code -- 0 for success,
@@ -19,9 +19,9 @@ of a couple traject features meant to make it easier.
19
19
  * translation map files found in a
20
20
  "./translation_maps" subdir on the load path will be found
21
21
  for Traject translation maps.
22
- * Traject `-G` command line can be used to tell traject to use
23
- bundler with a `Gemfile` located at current working dirctory
24
- (or give an argument to `-G ./some/myGemfile`)
22
+ * You can use Bundler with traject simply by creating a Gemfile with `bundler init`,
23
+ and then running command line with `bundle exec traject` or
24
+ even `BUNDLE_GEMFILE=path/to/Gemfile bundle exec traject`
25
25
 
26
26
  ## Custom code local to your project
27
27
 
@@ -160,19 +160,21 @@ possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemf
160
160
  Run `bundle install` from the directory with the Gemfile, on any system
161
161
  at any time, to make sure specified gems are installed.
162
162
 
163
- **Run traject** with the `-G` flag to tell it to use the Gemfile, for instance if
164
- your working directory is the one that includes your Gemfile:
163
+ **Run traject** with `bundle exec` to have bundler set up the environment
164
+ from your Gemfile. You can `cd` into the directory containing the Gemfile,
165
+ so bundler can find it:
165
166
 
166
- traject -G -c some_traject_config.rb ...
167
+ $ cd /some/where
168
+ $ bundle exec traject -c some_traject_config.rb ...
167
169
 
168
- Or explicitly specify a Gemfile somewhere else:
170
+ Or you can use the BUNDLE_GEMFILE environment variable to tell bundler where
171
+ to find the Gemfile, and run from any directory at all:
169
172
 
170
- traject -G /some/path/Gemfile -c some_config.rb ...
173
+ $ BUNDLE_GEMFILE=/path/to/Gemfile bundle exec traject -c /path/to/some_config.rb ...
171
174
 
172
- Traject will use bundler to setup with the Gemfile, making sure
173
- the specified versions of all gems are used (and also making sure
174
- no gems except those specified in the gemfile are available to
175
- the program).
175
+ Bundler will make sure the specified versions of all gems are used by
176
+ traject, and also make sure no gems except those specified in the gemfile
177
+ are available to the program, for a reliable reproducible environment.
176
178
 
177
179
  You should still `require` the gem in your traject config file,
178
180
  then just refer to what it provides in your config code as usual.
@@ -0,0 +1,265 @@
1
+ # Details on Traject Indexing: from custom logic to Macros
2
+
3
+ Traject macros are a way of providing re-usable index mapping rules. Before we discuss how they work, we need to remind ourselves of the basic/direct Traject `to_field` indexing method.
4
+
5
+ ## How direct indexing logic works
6
+
7
+ Here's the simplest possible direct Traject mapping logic, duplicating the effects of the `literal` macro:
8
+
9
+ ~~~ruby
10
+ to_field("title") do |record, accumulator, context|
11
+ accumulator << "FIXED LITERAL"
12
+ end
13
+ ~~~
14
+
15
+ That `do` is just ruby `block` syntax, whereby we can pass a block of ruby code as an argument to to a ruby method. We pass a block taking three arguments, labeled `record`, `accumulator`, and `context`, to the `to_field` method. The third 'context' object is optional, you can define it in your block or not, depending on if you want to use it.
16
+
17
+ The block is then stored by the Traject::Indexer, and called for each record indexed, with three arguments provided.
18
+
19
+ #### record argument
20
+
21
+ The record that gets passed to your block is a MARC::Record object (or, theoretically, any object that gets returned by a traject Reader). Your logic will usually examine the record to calculate the desired output.
22
+
23
+ ### accumulator argument
24
+
25
+ The accumulator argument is an array. At the end of your custom code, the accumulator
26
+ array should hold the output you want to send off, to the field specified in the `to_field`.
27
+
28
+ The accumulator is a reference to a ruby array, and you need to **modify** that array,
29
+ manipulating it in place with Array methods that mutate the array, like `concat`, `<<`,
30
+ `map!` or even `replace`.
31
+
32
+ You can't simply assign the accumulator variable to a different array, that won't work,
33
+ you need to modify the array in-place.
34
+
35
+ # Won't work, assigning variable
36
+ to_field('foo') do |rec, acc|
37
+ acc = ["some constant"] } # WRONG!
38
+ end
39
+
40
+ # Won't work, assigning variable
41
+ to_field('foo') do |rec, acc|
42
+ acc << 'bill'
43
+ acc << 'dueber'
44
+ acc = acc.map{|str| str.upcase}
45
+ end # WRONG! WRONG! WRONG! WRONG! WRONG!
46
+
47
+
48
+ # Instead, do, modify array in place
49
+ to_field('foo') {|rec, acc| acc << "some constant" }
50
+ to_field('foo') do |rec, acc|
51
+ acc << 'bill'
52
+ acc << 'dueber'
53
+ acc = acc.map!{|str| str.upcase} #notice using "map!" not just "map"
54
+ end
55
+
56
+ ### context argument
57
+
58
+ The third optional context argument
59
+
60
+ The third optional argument is a
61
+ [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/jrochkind/traject/Traject/Indexer/Context))
62
+ object. Most of the time you don't need it, but you can use it for
63
+ some sophisticated functionality, for example using these Context methods:
64
+
65
+ * `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard.
66
+ * `context.position` The position of the record in the input file (e.g., was it the first record, seoncd, etc.). Useful for error reporting
67
+ * `context.output_hash` A hash mapping the field names (generally defined in `to_field` calls) to an array of values to be sent to the writer associated with that field. This allows you to modify what goes to the writer without going through a `to_field` call -- you can just set `context.output_hash['myfield'] = ['my', 'values']` and you're set. See below for more examples
68
+ * `context.skip!(msg)` An assertion that this record should be ignored. No more indexing steps will be called, no results will be sent to the writer, and a `debug`-level log message will be written stating that the record was skipped.
69
+
70
+
71
+ ## Gotcha: Use closures to make your code more efficient
72
+
73
+ A _closure_ is a computer-science term that means "a piece of code
74
+ that remembers all the variables that were in scope when it was
75
+ created." In ruby, lambdas and blocks are closures. Method definitions
76
+ are not, which most of us have run across much to our chagrin.
77
+
78
+ Within the context of `traject`, this means you can define a variable
79
+ outside of a `to_field` or `each_record` block and it will be avaiable
80
+ inside those blocks. And you only have to define it once.
81
+
82
+ That's useful to do for any object that is even a bit expensive
83
+ to create -- we can maximize the performance of our traject
84
+ indexing by creating those objects once outside the block,
85
+ instead of inside the block where it will be created
86
+ once per-record (every time the block is executed):
87
+
88
+ Compare:
89
+
90
+ ```ruby
91
+ # Create the transformer for every single record
92
+ to_field 'normalized_title' do |rec, acc|
93
+ transformer = My::Custom::Format::Transformer.new # Oh no! I'm doing this for each of my 10M records!
94
+ acc << transformer.transform(rec['245'].value)
95
+ end
96
+
97
+ # Create the transformer exactly once
98
+ transformer = My::Custom::Format::Transformer.new # Ahhh. Do it once.
99
+ to_field 'normalized_title' do |rec, acc|
100
+ acc << transformer.transform(rec['245'].value)
101
+ end
102
+ ```
103
+
104
+ Certain built-in traject calls have been optimized to be high performance
105
+ so it's safe to do them inside 'inner loop' blocks though.
106
+ That includes `Traject::TranslationMap.new` and `Traject::MarcExtractor.cached("xxx")`
107
+ (note #cached rather than #new there)
108
+
109
+
110
+ ## From block to lambda
111
+
112
+ In the ruby language, in addition to creating a code block as an argument
113
+ to a method with `do |args| ... end` or `{|arg| ... }, we can also create
114
+ a code block to hold in a variable, with the `lambda` keyword:
115
+
116
+ always_output_foo = lambda do |record, accumulator|
117
+ accumulator << "FOO"
118
+ end
119
+
120
+ traject `to_field` is written so, as a convenience, it can take a lambda expression
121
+ stored in a variable as an alternative to a block:
122
+
123
+ to_field("always_has_foo"), always_output_foo
124
+
125
+ Why is this a convenience? Well, ordinarily it's not something we
126
+ need, but in fact it's what allows traject 'macros' as re-useable
127
+ code templates.
128
+
129
+
130
+ ## Macros
131
+
132
+ A Traject macro is a way to automatically create indexing rules via re-usable "templates".
133
+
134
+ Traject macros are simply methods that return ruby lambda/proc objects, possibly creating
135
+ them based on parameters passed in.
136
+
137
+ Here is in fact how the `literal` function is implemented:
138
+
139
+ ~~~ruby
140
+ def literal(value)
141
+ return lambda do |record, accumulator, context|
142
+ # because a lambda is a closure, we can define it in terms
143
+ # of the 'value' from the scope it's defined in!
144
+ accumulator << value
145
+ end
146
+ end
147
+ to_field("something"), literal("something")
148
+ ~~~
149
+
150
+ It's really as simple as that, that's all a Traject macro is. A function that takes parameters, and based on those parameters returns a lambda; the lambda is then passed to the `to_field` indexing method, or similar methods.
151
+
152
+ How do you make these methods available to the indexer?
153
+
154
+ Define it in a module:
155
+
156
+ ~~~ruby
157
+ # in a file literal_macro.rb
158
+ module LiteralMacro
159
+ def literal(value)
160
+ return lambda do |record, accumulator, context|
161
+ # because a lambda is a closure, we can define it in terms
162
+ # of the 'value' from the scope it's defined in!
163
+ accumulator << value
164
+ end
165
+ end
166
+ end
167
+ ~~~
168
+
169
+ And then use ordinary ruby `require` and `extend` to add it to the current Indexer file, by simply including this
170
+ in one of your config files:
171
+
172
+ ~~~
173
+ require `literal_macro.rb`
174
+ extend LiteralMacro
175
+
176
+ to_field ...
177
+ ~~~
178
+
179
+ That's it. You can use the traject command line `-I` option to set the ruby load path, so your file will be findable via `require`. Or you can distribute it in a gem, and use straight rubygems and the `gem` command in your configuration file, or Bundler with traject command-line `-g` option.
180
+
181
+ ## Using a lambda _and_ and block
182
+
183
+ Traject macros (such as `extract_marc`) create and return a lambda. If
184
+ you include a lambda _and_ a block on a `to_field` call, the latter
185
+ gets the accumulator as it was filled in by the former.
186
+
187
+ ```ruby
188
+ # Get the titles and lowercase them
189
+ to_field 'lc_title', extract_marc('245') do |rec, acc, context|
190
+ acc.map!{|title| title.downcase}
191
+ end
192
+
193
+ # Build my own lambda and use it
194
+ mylam = lambda {|rec, acc| acc << 'one'} # just add a constant
195
+ to_field('foo'), mylam do |rec, acc, context|
196
+ acc << 'two'
197
+ end #=> context.output_hash['foo'] == ['one', 'two']
198
+
199
+
200
+ # You might also want to do something like this
201
+
202
+ to_field('foo'), my_macro_that_doesn't_dedup_ do |rec, acc|
203
+ acc.uniq!
204
+ end
205
+ ```
206
+
207
+ ## Maniuplating `context.output_hash` directly
208
+
209
+ If you ask for the context argument, a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Indexer/Context)), you have access to context.output_hash, with is
210
+ the hash of transformed output that will be sent to Solr (or any other Writer)
211
+
212
+ You can look in there to see any already transformed output and use it as the source
213
+ for new output. You can actually *write* to there manually, which can be useful
214
+ to write routines that effect more than one output field at once.
215
+
216
+ **Note**: Make sure you always assign an _array_ to, e.g., `context.output_hash['foo']`, not a single value!
217
+
218
+
219
+
220
+ ## each_record
221
+
222
+ All the previous discussion was in terms of `to_field` -- `each_record` is a similar
223
+ routine, to define logic that is executed for each record, but isn't fixed to write
224
+ to a single output field.
225
+
226
+ So `each_record` blocks have no `accumulator` argument, instead they either take a single
227
+ `record` argument; or both a `record` and a `context`.
228
+
229
+ `each_record` can be used for logging or notifiying; computing intermediate
230
+ results; or writing to more than one field at once.
231
+
232
+ ~~~ruby
233
+ each_record do |record, context|
234
+ if is_it_bad?(record)
235
+ context.skip!("Skipping bad record")
236
+ else
237
+ context.clipboard[:expensive_result] = calculate_expensive_thing(record)
238
+ end
239
+ end
240
+
241
+ each_record do |record, context|
242
+ (one, two) = calculate_two_things_from(record)
243
+
244
+ context.output_hash["first_field"] ||= []
245
+ context.output_hash["first_field"] << one
246
+
247
+ context.output_hash["second_field"] ||= []
248
+ context.output_hash["second_field"] << one
249
+ end
250
+ ~~~
251
+
252
+ traject doesn't come with any macros written for use with
253
+ `each_record`, but they could be created if useful --
254
+ just methods that return lambda's taking the right
255
+ args for `each_record`.
256
+
257
+ ## More tips and gotchas about indexing steps
258
+
259
+ * **All your `to_field` and `each_record` steps are run _in the order in which they were initially evaluated_**. That means that the order you call your config files can potentially make a difference if you're screwing around stuffing stuff into the context clipboard or whatnot.
260
+
261
+ * **`to_field` can be called multiple times on the same field name.** If you call the same field name multiple times, all the values will be sent to the writer.
262
+
263
+ * **Once you call `context.skip!(msg)` no more index steps will be run for that record**. So if you have any cleanup code, you'll need to make sure to call it yourself.
264
+
265
+ * **By default, `trajcet` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).
@@ -1,7 +1,6 @@
1
- # Require as little as possible at top, so we can bundle require later
2
- # if needed, before requiring anything from the bundle. Can't avoid slop
3
- # though, to get our bundle arg out, sorry.
4
1
  require 'slop'
2
+ require 'traject'
3
+ require 'traject/indexer'
5
4
 
6
5
  module Traject
7
6
  # The class that executes for the Traject command line utility.
@@ -33,21 +32,6 @@ module Traject
33
32
  # Returns true on success or false on failure; may also raise exceptions;
34
33
  # may also exit program directly itself (yeah, could use some normalization)
35
34
  def execute
36
- # Do bundler setup FIRST to try and initialize all gems from gemfile
37
- # if requested.
38
-
39
- # have to use Slop object to tell diff between
40
- # no arg supplied and no option -g given at all
41
- if slop.present? :Gemfile
42
- require_bundler_setup(options[:Gemfile])
43
- end
44
-
45
-
46
- # We require them here instead of top of file,
47
- # so we have done bundler require before we require these.
48
- require 'traject'
49
- require 'traject/indexer'
50
-
51
35
  if options[:version]
52
36
  self.console.puts "traject version #{Traject::VERSION}"
53
37
  return
@@ -92,6 +76,10 @@ module Traject
92
76
  end
93
77
 
94
78
  return result
79
+ rescue Exception => e
80
+ # Try to log unexpected exceptions if possible
81
+ indexer && indexer.logger && indexer.logger.fatal("Traject::CommandLine: Unexpected exception, terminating execution: #{e.inspect}") rescue nil
82
+ raise e
95
83
  end
96
84
 
97
85
  def command_commit!
@@ -117,19 +105,21 @@ module Traject
117
105
  $stdout
118
106
  end
119
107
 
108
+ indexer.logger.info(" marcout writing type:#{output_type} to file:#{output_arg}")
109
+
120
110
  case output_type
121
111
  when "binary"
122
112
  writer = MARC::Writer.new(output_arg)
123
113
 
124
114
  allow_oversized = indexer.settings["marcout.allow_oversized"]
125
115
  if allow_oversized
126
- allow_oversized = (allow_oversized.to_s == "true")
116
+ allow_oversized = (allow_oversized.to_s == "true")
127
117
  writer.allow_oversized = allow_oversized
128
118
  end
129
119
  when "xml"
130
120
  writer = MARC::XMLWriter.new(output_arg)
131
121
  when "human"
132
- writer = output_arg.kind_of?(String) ? File.open(output_arg, "w:binary") : output_arg
122
+ writer = output_arg.kind_of?(String) ? File.open(output_arg, "w:binary") : output_arg
133
123
  else
134
124
  raise ArgumentError.new("traject marcout unrecognized marcout.type: #{output_type}")
135
125
  end
@@ -174,7 +164,7 @@ module Traject
174
164
  filename = argv.first
175
165
  indexer.logger.info "Reading from #{filename}"
176
166
  end
177
-
167
+
178
168
  return io, filename
179
169
  end
180
170
 
@@ -215,24 +205,6 @@ module Traject
215
205
  end
216
206
  end
217
207
 
218
- # requires bundler/setup, optionally first setting ENV["BUNDLE_GEMFILE"]
219
- # to tell bundler to use a specific gemfile. Gemfile arg can be relative
220
- # to current working directory.
221
- def require_bundler_setup(gemfile=nil)
222
- if gemfile
223
- # tell bundler what gemfile to use
224
- gem_path = File.expand_path( gemfile )
225
- # bundler not good at error reporting, we check ourselves
226
- unless File.exists? gem_path
227
- self.console.puts "Gemfile `#{gemfile}` does not exist, exiting..."
228
- self.console.puts
229
- self.console.puts slop.help
230
- exit 2
231
- end
232
- ENV["BUNDLE_GEMFILE"] = gem_path
233
- end
234
- require 'bundler/setup'
235
- end
236
208
 
237
209
  def assemble_settings_hash(options)
238
210
  settings = {}
@@ -256,7 +228,7 @@ module Traject
256
228
  if options[:'debug-mode']
257
229
  require 'traject/debug_writer'
258
230
  settings["writer_class_name"] = "Traject::DebugWriter"
259
- settings["log.level"] = "debug"
231
+ settings["log.level"] = "debug"
260
232
  settings["processing_thread_pool"] = 0
261
233
  end
262
234
  if options[:writer]
@@ -294,7 +266,6 @@ module Traject
294
266
  on :u, :solr, "Set solr url, shortcut for -s solr.url=", :argument => true
295
267
  on :t, :marc_type, "xml, json or binary. shortcut for -s marc_source.type=", :argument => true
296
268
  on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
297
- on :G, "Gemfile", "run with bundler and optionally specified Gemfile", :argument => :optional, :default => nil
298
269
 
299
270
  on :x, "command", "alternate traject command: process (default); marcout; commit", :argument => true, :default => "process"
300
271