traject 0.16.0 → 0.17.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (53) hide show
  1. checksums.yaml +7 -0
  2. data/.yardopts +1 -0
  3. data/README.md +183 -191
  4. data/bench/bench.rb +1 -1
  5. data/doc/batch_execution.md +14 -0
  6. data/doc/extending.md +14 -12
  7. data/doc/indexing_rules.md +265 -0
  8. data/lib/traject/command_line.rb +12 -41
  9. data/lib/traject/debug_writer.rb +32 -13
  10. data/lib/traject/indexer.rb +101 -24
  11. data/lib/traject/indexer/settings.rb +18 -17
  12. data/lib/traject/json_writer.rb +32 -11
  13. data/lib/traject/line_writer.rb +6 -6
  14. data/lib/traject/macros/basic.rb +1 -1
  15. data/lib/traject/macros/marc21.rb +17 -13
  16. data/lib/traject/macros/marc21_semantics.rb +27 -25
  17. data/lib/traject/macros/marc_format_classifier.rb +39 -25
  18. data/lib/traject/marc4j_reader.rb +36 -22
  19. data/lib/traject/marc_extractor.rb +79 -75
  20. data/lib/traject/marc_reader.rb +33 -25
  21. data/lib/traject/mock_reader.rb +9 -10
  22. data/lib/traject/ndj_reader.rb +7 -7
  23. data/lib/traject/null_writer.rb +1 -1
  24. data/lib/traject/qualified_const_get.rb +12 -2
  25. data/lib/traject/solrj_writer.rb +61 -52
  26. data/lib/traject/thread_pool.rb +45 -45
  27. data/lib/traject/translation_map.rb +59 -27
  28. data/lib/traject/util.rb +3 -3
  29. data/lib/traject/version.rb +1 -1
  30. data/lib/traject/yaml_writer.rb +1 -1
  31. data/test/debug_writer_test.rb +7 -7
  32. data/test/indexer/each_record_test.rb +4 -4
  33. data/test/indexer/macros_marc21_semantics_test.rb +12 -12
  34. data/test/indexer/macros_marc21_test.rb +10 -10
  35. data/test/indexer/macros_test.rb +1 -1
  36. data/test/indexer/map_record_test.rb +6 -6
  37. data/test/indexer/read_write_test.rb +43 -4
  38. data/test/indexer/settings_test.rb +2 -2
  39. data/test/indexer/to_field_test.rb +8 -8
  40. data/test/marc4j_reader_test.rb +4 -4
  41. data/test/marc_extractor_test.rb +33 -25
  42. data/test/marc_format_classifier_test.rb +3 -3
  43. data/test/marc_reader_test.rb +2 -2
  44. data/test/test_helper.rb +3 -3
  45. data/test/test_support/demo_config.rb +52 -48
  46. data/test/translation_map_test.rb +22 -4
  47. data/test/translation_maps/bad_ruby.rb +2 -2
  48. data/test/translation_maps/both_map.rb +1 -1
  49. data/test/translation_maps/default_literal.rb +1 -1
  50. data/test/translation_maps/default_passthrough.rb +1 -1
  51. data/test/translation_maps/ruby_map.rb +1 -1
  52. metadata +7 -31
  53. data/doc/macros.md +0 -103
@@ -27,4 +27,4 @@ Benchmark.bmbm do |x|
27
27
  end
28
28
  end
29
29
 
30
-
30
+
@@ -99,6 +99,20 @@ Now any account, in a crontab, in an interactive shell, wherever,
99
99
  can just execute `jruby-traject {arguments}`, and execute traject
100
100
  in a jruby environment.
101
101
 
102
+ ### Bundler too?
103
+
104
+ If you're running with bundler too, you could make a wrapper file specific to
105
+ a particular traject project and it's Gemfile, by combining the `bundle exec` into
106
+ your wrapper file. For instance, for chruby, this works:
107
+
108
+ #!/usr/bin/env bash
109
+
110
+ chruby-exec jruby -- BUNDLE_GEMFILE=/path/to/Gemfile bundle exec traject "$@"
111
+
112
+ Now you can call your wrapper script from anywhere and with any active ruby,
113
+ and execute it in jruby and with the dependencies specified in the Gemfile
114
+ for your project.
115
+
102
116
  ## Exit codes
103
117
 
104
118
  Traject tries to always return a well-behaved unix exit code -- 0 for success,
@@ -19,9 +19,9 @@ of a couple traject features meant to make it easier.
19
19
  * translation map files found in a
20
20
  "./translation_maps" subdir on the load path will be found
21
21
  for Traject translation maps.
22
- * Traject `-G` command line can be used to tell traject to use
23
- bundler with a `Gemfile` located at current working dirctory
24
- (or give an argument to `-G ./some/myGemfile`)
22
+ * You can use Bundler with traject simply by creating a Gemfile with `bundler init`,
23
+ and then running command line with `bundle exec traject` or
24
+ even `BUNDLE_GEMFILE=path/to/Gemfile bundle exec traject`
25
25
 
26
26
  ## Custom code local to your project
27
27
 
@@ -160,19 +160,21 @@ possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemf
160
160
  Run `bundle install` from the directory with the Gemfile, on any system
161
161
  at any time, to make sure specified gems are installed.
162
162
 
163
- **Run traject** with the `-G` flag to tell it to use the Gemfile, for instance if
164
- your working directory is the one that includes your Gemfile:
163
+ **Run traject** with `bundle exec` to have bundler set up the environment
164
+ from your Gemfile. You can `cd` into the directory containing the Gemfile,
165
+ so bundler can find it:
165
166
 
166
- traject -G -c some_traject_config.rb ...
167
+ $ cd /some/where
168
+ $ bundle exec traject -c some_traject_config.rb ...
167
169
 
168
- Or explicitly specify a Gemfile somewhere else:
170
+ Or you can use the BUNDLE_GEMFILE environment variable to tell bundler where
171
+ to find the Gemfile, and run from any directory at all:
169
172
 
170
- traject -G /some/path/Gemfile -c some_config.rb ...
173
+ $ BUNDLE_GEMFILE=/path/to/Gemfile bundle exec traject -c /path/to/some_config.rb ...
171
174
 
172
- Traject will use bundler to setup with the Gemfile, making sure
173
- the specified versions of all gems are used (and also making sure
174
- no gems except those specified in the gemfile are available to
175
- the program).
175
+ Bundler will make sure the specified versions of all gems are used by
176
+ traject, and also make sure no gems except those specified in the gemfile
177
+ are available to the program, for a reliable reproducible environment.
176
178
 
177
179
  You should still `require` the gem in your traject config file,
178
180
  then just refer to what it provides in your config code as usual.
@@ -0,0 +1,265 @@
1
+ # Details on Traject Indexing: from custom logic to Macros
2
+
3
+ Traject macros are a way of providing re-usable index mapping rules. Before we discuss how they work, we need to remind ourselves of the basic/direct Traject `to_field` indexing method.
4
+
5
+ ## How direct indexing logic works
6
+
7
+ Here's the simplest possible direct Traject mapping logic, duplicating the effects of the `literal` macro:
8
+
9
+ ~~~ruby
10
+ to_field("title") do |record, accumulator, context|
11
+ accumulator << "FIXED LITERAL"
12
+ end
13
+ ~~~
14
+
15
+ That `do` is just ruby `block` syntax, whereby we can pass a block of ruby code as an argument to to a ruby method. We pass a block taking three arguments, labeled `record`, `accumulator`, and `context`, to the `to_field` method. The third 'context' object is optional, you can define it in your block or not, depending on if you want to use it.
16
+
17
+ The block is then stored by the Traject::Indexer, and called for each record indexed, with three arguments provided.
18
+
19
+ #### record argument
20
+
21
+ The record that gets passed to your block is a MARC::Record object (or, theoretically, any object that gets returned by a traject Reader). Your logic will usually examine the record to calculate the desired output.
22
+
23
+ ### accumulator argument
24
+
25
+ The accumulator argument is an array. At the end of your custom code, the accumulator
26
+ array should hold the output you want to send off, to the field specified in the `to_field`.
27
+
28
+ The accumulator is a reference to a ruby array, and you need to **modify** that array,
29
+ manipulating it in place with Array methods that mutate the array, like `concat`, `<<`,
30
+ `map!` or even `replace`.
31
+
32
+ You can't simply assign the accumulator variable to a different array, that won't work,
33
+ you need to modify the array in-place.
34
+
35
+ # Won't work, assigning variable
36
+ to_field('foo') do |rec, acc|
37
+ acc = ["some constant"] } # WRONG!
38
+ end
39
+
40
+ # Won't work, assigning variable
41
+ to_field('foo') do |rec, acc|
42
+ acc << 'bill'
43
+ acc << 'dueber'
44
+ acc = acc.map{|str| str.upcase}
45
+ end # WRONG! WRONG! WRONG! WRONG! WRONG!
46
+
47
+
48
+ # Instead, do, modify array in place
49
+ to_field('foo') {|rec, acc| acc << "some constant" }
50
+ to_field('foo') do |rec, acc|
51
+ acc << 'bill'
52
+ acc << 'dueber'
53
+ acc = acc.map!{|str| str.upcase} #notice using "map!" not just "map"
54
+ end
55
+
56
+ ### context argument
57
+
58
+ The third optional context argument
59
+
60
+ The third optional argument is a
61
+ [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/jrochkind/traject/Traject/Indexer/Context))
62
+ object. Most of the time you don't need it, but you can use it for
63
+ some sophisticated functionality, for example using these Context methods:
64
+
65
+ * `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard.
66
+ * `context.position` The position of the record in the input file (e.g., was it the first record, seoncd, etc.). Useful for error reporting
67
+ * `context.output_hash` A hash mapping the field names (generally defined in `to_field` calls) to an array of values to be sent to the writer associated with that field. This allows you to modify what goes to the writer without going through a `to_field` call -- you can just set `context.output_hash['myfield'] = ['my', 'values']` and you're set. See below for more examples
68
+ * `context.skip!(msg)` An assertion that this record should be ignored. No more indexing steps will be called, no results will be sent to the writer, and a `debug`-level log message will be written stating that the record was skipped.
69
+
70
+
71
+ ## Gotcha: Use closures to make your code more efficient
72
+
73
+ A _closure_ is a computer-science term that means "a piece of code
74
+ that remembers all the variables that were in scope when it was
75
+ created." In ruby, lambdas and blocks are closures. Method definitions
76
+ are not, which most of us have run across much to our chagrin.
77
+
78
+ Within the context of `traject`, this means you can define a variable
79
+ outside of a `to_field` or `each_record` block and it will be avaiable
80
+ inside those blocks. And you only have to define it once.
81
+
82
+ That's useful to do for any object that is even a bit expensive
83
+ to create -- we can maximize the performance of our traject
84
+ indexing by creating those objects once outside the block,
85
+ instead of inside the block where it will be created
86
+ once per-record (every time the block is executed):
87
+
88
+ Compare:
89
+
90
+ ```ruby
91
+ # Create the transformer for every single record
92
+ to_field 'normalized_title' do |rec, acc|
93
+ transformer = My::Custom::Format::Transformer.new # Oh no! I'm doing this for each of my 10M records!
94
+ acc << transformer.transform(rec['245'].value)
95
+ end
96
+
97
+ # Create the transformer exactly once
98
+ transformer = My::Custom::Format::Transformer.new # Ahhh. Do it once.
99
+ to_field 'normalized_title' do |rec, acc|
100
+ acc << transformer.transform(rec['245'].value)
101
+ end
102
+ ```
103
+
104
+ Certain built-in traject calls have been optimized to be high performance
105
+ so it's safe to do them inside 'inner loop' blocks though.
106
+ That includes `Traject::TranslationMap.new` and `Traject::MarcExtractor.cached("xxx")`
107
+ (note #cached rather than #new there)
108
+
109
+
110
+ ## From block to lambda
111
+
112
+ In the ruby language, in addition to creating a code block as an argument
113
+ to a method with `do |args| ... end` or `{|arg| ... }, we can also create
114
+ a code block to hold in a variable, with the `lambda` keyword:
115
+
116
+ always_output_foo = lambda do |record, accumulator|
117
+ accumulator << "FOO"
118
+ end
119
+
120
+ traject `to_field` is written so, as a convenience, it can take a lambda expression
121
+ stored in a variable as an alternative to a block:
122
+
123
+ to_field("always_has_foo"), always_output_foo
124
+
125
+ Why is this a convenience? Well, ordinarily it's not something we
126
+ need, but in fact it's what allows traject 'macros' as re-useable
127
+ code templates.
128
+
129
+
130
+ ## Macros
131
+
132
+ A Traject macro is a way to automatically create indexing rules via re-usable "templates".
133
+
134
+ Traject macros are simply methods that return ruby lambda/proc objects, possibly creating
135
+ them based on parameters passed in.
136
+
137
+ Here is in fact how the `literal` function is implemented:
138
+
139
+ ~~~ruby
140
+ def literal(value)
141
+ return lambda do |record, accumulator, context|
142
+ # because a lambda is a closure, we can define it in terms
143
+ # of the 'value' from the scope it's defined in!
144
+ accumulator << value
145
+ end
146
+ end
147
+ to_field("something"), literal("something")
148
+ ~~~
149
+
150
+ It's really as simple as that, that's all a Traject macro is. A function that takes parameters, and based on those parameters returns a lambda; the lambda is then passed to the `to_field` indexing method, or similar methods.
151
+
152
+ How do you make these methods available to the indexer?
153
+
154
+ Define it in a module:
155
+
156
+ ~~~ruby
157
+ # in a file literal_macro.rb
158
+ module LiteralMacro
159
+ def literal(value)
160
+ return lambda do |record, accumulator, context|
161
+ # because a lambda is a closure, we can define it in terms
162
+ # of the 'value' from the scope it's defined in!
163
+ accumulator << value
164
+ end
165
+ end
166
+ end
167
+ ~~~
168
+
169
+ And then use ordinary ruby `require` and `extend` to add it to the current Indexer file, by simply including this
170
+ in one of your config files:
171
+
172
+ ~~~
173
+ require `literal_macro.rb`
174
+ extend LiteralMacro
175
+
176
+ to_field ...
177
+ ~~~
178
+
179
+ That's it. You can use the traject command line `-I` option to set the ruby load path, so your file will be findable via `require`. Or you can distribute it in a gem, and use straight rubygems and the `gem` command in your configuration file, or Bundler with traject command-line `-g` option.
180
+
181
+ ## Using a lambda _and_ and block
182
+
183
+ Traject macros (such as `extract_marc`) create and return a lambda. If
184
+ you include a lambda _and_ a block on a `to_field` call, the latter
185
+ gets the accumulator as it was filled in by the former.
186
+
187
+ ```ruby
188
+ # Get the titles and lowercase them
189
+ to_field 'lc_title', extract_marc('245') do |rec, acc, context|
190
+ acc.map!{|title| title.downcase}
191
+ end
192
+
193
+ # Build my own lambda and use it
194
+ mylam = lambda {|rec, acc| acc << 'one'} # just add a constant
195
+ to_field('foo'), mylam do |rec, acc, context|
196
+ acc << 'two'
197
+ end #=> context.output_hash['foo'] == ['one', 'two']
198
+
199
+
200
+ # You might also want to do something like this
201
+
202
+ to_field('foo'), my_macro_that_doesn't_dedup_ do |rec, acc|
203
+ acc.uniq!
204
+ end
205
+ ```
206
+
207
+ ## Maniuplating `context.output_hash` directly
208
+
209
+ If you ask for the context argument, a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Indexer/Context)), you have access to context.output_hash, with is
210
+ the hash of transformed output that will be sent to Solr (or any other Writer)
211
+
212
+ You can look in there to see any already transformed output and use it as the source
213
+ for new output. You can actually *write* to there manually, which can be useful
214
+ to write routines that effect more than one output field at once.
215
+
216
+ **Note**: Make sure you always assign an _array_ to, e.g., `context.output_hash['foo']`, not a single value!
217
+
218
+
219
+
220
+ ## each_record
221
+
222
+ All the previous discussion was in terms of `to_field` -- `each_record` is a similar
223
+ routine, to define logic that is executed for each record, but isn't fixed to write
224
+ to a single output field.
225
+
226
+ So `each_record` blocks have no `accumulator` argument, instead they either take a single
227
+ `record` argument; or both a `record` and a `context`.
228
+
229
+ `each_record` can be used for logging or notifiying; computing intermediate
230
+ results; or writing to more than one field at once.
231
+
232
+ ~~~ruby
233
+ each_record do |record, context|
234
+ if is_it_bad?(record)
235
+ context.skip!("Skipping bad record")
236
+ else
237
+ context.clipboard[:expensive_result] = calculate_expensive_thing(record)
238
+ end
239
+ end
240
+
241
+ each_record do |record, context|
242
+ (one, two) = calculate_two_things_from(record)
243
+
244
+ context.output_hash["first_field"] ||= []
245
+ context.output_hash["first_field"] << one
246
+
247
+ context.output_hash["second_field"] ||= []
248
+ context.output_hash["second_field"] << one
249
+ end
250
+ ~~~
251
+
252
+ traject doesn't come with any macros written for use with
253
+ `each_record`, but they could be created if useful --
254
+ just methods that return lambda's taking the right
255
+ args for `each_record`.
256
+
257
+ ## More tips and gotchas about indexing steps
258
+
259
+ * **All your `to_field` and `each_record` steps are run _in the order in which they were initially evaluated_**. That means that the order you call your config files can potentially make a difference if you're screwing around stuffing stuff into the context clipboard or whatnot.
260
+
261
+ * **`to_field` can be called multiple times on the same field name.** If you call the same field name multiple times, all the values will be sent to the writer.
262
+
263
+ * **Once you call `context.skip!(msg)` no more index steps will be run for that record**. So if you have any cleanup code, you'll need to make sure to call it yourself.
264
+
265
+ * **By default, `trajcet` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).
@@ -1,7 +1,6 @@
1
- # Require as little as possible at top, so we can bundle require later
2
- # if needed, before requiring anything from the bundle. Can't avoid slop
3
- # though, to get our bundle arg out, sorry.
4
1
  require 'slop'
2
+ require 'traject'
3
+ require 'traject/indexer'
5
4
 
6
5
  module Traject
7
6
  # The class that executes for the Traject command line utility.
@@ -33,21 +32,6 @@ module Traject
33
32
  # Returns true on success or false on failure; may also raise exceptions;
34
33
  # may also exit program directly itself (yeah, could use some normalization)
35
34
  def execute
36
- # Do bundler setup FIRST to try and initialize all gems from gemfile
37
- # if requested.
38
-
39
- # have to use Slop object to tell diff between
40
- # no arg supplied and no option -g given at all
41
- if slop.present? :Gemfile
42
- require_bundler_setup(options[:Gemfile])
43
- end
44
-
45
-
46
- # We require them here instead of top of file,
47
- # so we have done bundler require before we require these.
48
- require 'traject'
49
- require 'traject/indexer'
50
-
51
35
  if options[:version]
52
36
  self.console.puts "traject version #{Traject::VERSION}"
53
37
  return
@@ -92,6 +76,10 @@ module Traject
92
76
  end
93
77
 
94
78
  return result
79
+ rescue Exception => e
80
+ # Try to log unexpected exceptions if possible
81
+ indexer && indexer.logger && indexer.logger.fatal("Traject::CommandLine: Unexpected exception, terminating execution: #{e.inspect}") rescue nil
82
+ raise e
95
83
  end
96
84
 
97
85
  def command_commit!
@@ -117,19 +105,21 @@ module Traject
117
105
  $stdout
118
106
  end
119
107
 
108
+ indexer.logger.info(" marcout writing type:#{output_type} to file:#{output_arg}")
109
+
120
110
  case output_type
121
111
  when "binary"
122
112
  writer = MARC::Writer.new(output_arg)
123
113
 
124
114
  allow_oversized = indexer.settings["marcout.allow_oversized"]
125
115
  if allow_oversized
126
- allow_oversized = (allow_oversized.to_s == "true")
116
+ allow_oversized = (allow_oversized.to_s == "true")
127
117
  writer.allow_oversized = allow_oversized
128
118
  end
129
119
  when "xml"
130
120
  writer = MARC::XMLWriter.new(output_arg)
131
121
  when "human"
132
- writer = output_arg.kind_of?(String) ? File.open(output_arg, "w:binary") : output_arg
122
+ writer = output_arg.kind_of?(String) ? File.open(output_arg, "w:binary") : output_arg
133
123
  else
134
124
  raise ArgumentError.new("traject marcout unrecognized marcout.type: #{output_type}")
135
125
  end
@@ -174,7 +164,7 @@ module Traject
174
164
  filename = argv.first
175
165
  indexer.logger.info "Reading from #{filename}"
176
166
  end
177
-
167
+
178
168
  return io, filename
179
169
  end
180
170
 
@@ -215,24 +205,6 @@ module Traject
215
205
  end
216
206
  end
217
207
 
218
- # requires bundler/setup, optionally first setting ENV["BUNDLE_GEMFILE"]
219
- # to tell bundler to use a specific gemfile. Gemfile arg can be relative
220
- # to current working directory.
221
- def require_bundler_setup(gemfile=nil)
222
- if gemfile
223
- # tell bundler what gemfile to use
224
- gem_path = File.expand_path( gemfile )
225
- # bundler not good at error reporting, we check ourselves
226
- unless File.exists? gem_path
227
- self.console.puts "Gemfile `#{gemfile}` does not exist, exiting..."
228
- self.console.puts
229
- self.console.puts slop.help
230
- exit 2
231
- end
232
- ENV["BUNDLE_GEMFILE"] = gem_path
233
- end
234
- require 'bundler/setup'
235
- end
236
208
 
237
209
  def assemble_settings_hash(options)
238
210
  settings = {}
@@ -256,7 +228,7 @@ module Traject
256
228
  if options[:'debug-mode']
257
229
  require 'traject/debug_writer'
258
230
  settings["writer_class_name"] = "Traject::DebugWriter"
259
- settings["log.level"] = "debug"
231
+ settings["log.level"] = "debug"
260
232
  settings["processing_thread_pool"] = 0
261
233
  end
262
234
  if options[:writer]
@@ -294,7 +266,6 @@ module Traject
294
266
  on :u, :solr, "Set solr url, shortcut for -s solr.url=", :argument => true
295
267
  on :t, :marc_type, "xml, json or binary. shortcut for -s marc_source.type=", :argument => true
296
268
  on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
297
- on :G, "Gemfile", "run with bundler and optionally specified Gemfile", :argument => :optional, :default => nil
298
269
 
299
270
  on :x, "command", "alternate traject command: process (default); marcout; commit", :argument => true, :default => "process"
300
271