traject 0.9.1 → 0.10.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.travis.yml +7 -0
- data/Gemfile +5 -1
- data/README.md +65 -17
- data/bench/bench.rb +30 -0
- data/bin/traject +4 -169
- data/doc/batch_execution.md +177 -0
- data/doc/extending.md +182 -0
- data/doc/other_commands.md +49 -0
- data/doc/settings.md +6 -2
- data/lib/traject.rb +1 -0
- data/lib/traject/command_line.rb +296 -0
- data/lib/traject/debug_writer.rb +28 -0
- data/lib/traject/indexer.rb +84 -20
- data/lib/traject/indexer/settings.rb +9 -1
- data/lib/traject/json_writer.rb +15 -38
- data/lib/traject/line_writer.rb +59 -0
- data/lib/traject/macros/marc21.rb +10 -5
- data/lib/traject/macros/marc21_semantics.rb +57 -25
- data/lib/traject/marc4j_reader.rb +9 -26
- data/lib/traject/marc_extractor.rb +121 -48
- data/lib/traject/mock_reader.rb +87 -0
- data/lib/traject/mock_writer.rb +34 -0
- data/lib/traject/solrj_writer.rb +1 -22
- data/lib/traject/util.rb +107 -1
- data/lib/traject/version.rb +1 -1
- data/lib/traject/yaml_writer.rb +9 -0
- data/test/debug_writer_test.rb +38 -0
- data/test/indexer/each_record_test.rb +27 -2
- data/test/indexer/macros_marc21_semantics_test.rb +12 -1
- data/test/indexer/settings_test.rb +9 -2
- data/test/indexer/to_field_test.rb +35 -5
- data/test/marc4j_reader_test.rb +3 -0
- data/test/marc_extractor_test.rb +94 -20
- data/test/test_support/demo_config.rb +6 -3
- data/traject.gemspec +1 -2
- metadata +17 -20
data/.travis.yml
ADDED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -7,9 +7,13 @@ them somewhere.
|
|
7
7
|
|
8
8
|
**Currently under development, not production ready**
|
9
9
|
|
10
|
+
[![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
|
11
|
+
[![Build Status](https://travis-ci.org/jrochkind/traject.png)](https://travis-ci.org/jrochkind/traject)
|
12
|
+
|
13
|
+
|
10
14
|
## Background/Goals
|
11
15
|
|
12
|
-
Existing tools for indexing Marc to Solr exist, and have served many
|
16
|
+
Existing tools for indexing Marc to Solr exist, and have served us well for many years, and have many useful things about them -- which I've tried to preserve in traject. But I was having more and more difficulty working with the existing tools, including difficulty providing the custom logic I needed in a maintainable way. I realized that for me, to create a tool with the flexibility, maintainability, and performance I wanted, I would need to do it in jruby (ruby on the JVM).
|
13
17
|
|
14
18
|
Some goals:
|
15
19
|
|
@@ -19,11 +23,13 @@ Some goals:
|
|
19
23
|
* Built of modular and composable elements: If you want to change part of what traject does, you should be able to do so without having to reimplement other things you don't want to change.
|
20
24
|
* A maintainable internal architecture, well-factored with seperated concerns and DRY logic. Aim to be comprehensible to newcomer developers, and well-covered by tests.
|
21
25
|
* High performance, using multi-threaded concurrency where appropriate to maximize throughput. Actual throughput can depend on complexity of your mapping rules and capacity of your server(s), but I am getting throughput 2-5x greater than previous solutions.
|
26
|
+
* Cooperate well in unix batch/pipeline, with control over output/logging of errors, proper exit codes, use of stdin/stdout, etc.
|
22
27
|
|
23
28
|
|
24
29
|
## Installation
|
25
30
|
|
26
|
-
Traject runs under jruby (ruby on the JVM). I recommend [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme) for installing and managing ruby installations.
|
31
|
+
Traject runs under jruby (ruby on the JVM). I recommend [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme) for installing and managing ruby installations. (traject is tested
|
32
|
+
and supported for ruby 1.9 -- recent versions of jruby should run under 1.9 mode by default).
|
27
33
|
|
28
34
|
Then just `gem install traject`.
|
29
35
|
|
@@ -151,6 +157,11 @@ Other examples of the specification string, which can include multiple tag menti
|
|
151
157
|
# "*" is a wildcard in indicator spec. So
|
152
158
|
# 856 with first indicator '0', subfield u.
|
153
159
|
to_field "email_addresses", extract_marc("856|0*|u")
|
160
|
+
|
161
|
+
# Instead of joining subfields from the same field
|
162
|
+
# into one string, joined by spaces, leave them
|
163
|
+
# each in seperate strings:
|
164
|
+
to_field "isbn", extract_marc("020az", :seperator => nil)
|
154
165
|
~~~
|
155
166
|
|
156
167
|
The `extract_marc` function *by default* includes any linked
|
@@ -214,9 +225,14 @@ end
|
|
214
225
|
# marc_extract does, you may want to use the Traject::MarcExtractor
|
215
226
|
# class
|
216
227
|
to_field "weirdo" do |record, accumulator, context|
|
217
|
-
|
228
|
+
# use MarcExtractor.cached for performance, globally
|
229
|
+
# caching the MarcExtractor we create. See docs
|
230
|
+
# at MarcExtractor.
|
231
|
+
list = MarcExtractor.cached("700a").extract(record)
|
232
|
+
|
218
233
|
# combine all the 700a's in ONE string, cause we're weird
|
219
234
|
list = list.join(" ")
|
235
|
+
|
220
236
|
accumulator << list
|
221
237
|
end
|
222
238
|
~~~
|
@@ -264,6 +280,10 @@ in order.
|
|
264
280
|
to_field("foo") {...} # and will be called after each of the preceding for each record
|
265
281
|
~~~
|
266
282
|
|
283
|
+
#### Sample config
|
284
|
+
|
285
|
+
A fairly complex sample config file can be found at [./test/test_support/demo_config.rb](./test/test_support/demo_config.rb)
|
286
|
+
|
267
287
|
#### Built-in MARC21 Semantics
|
268
288
|
|
269
289
|
There is another package of 'macros' that comes with Traject for extracting semantics
|
@@ -292,7 +312,7 @@ The simplest invocation is:
|
|
292
312
|
traject -c conf_file.rb marc_file.mrc
|
293
313
|
|
294
314
|
Traject assumes marc files are in ISO 2709 binary format; it is not
|
295
|
-
currently able to
|
315
|
+
currently able to guess marc format type from filenames. If you are reading
|
296
316
|
marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
|
297
317
|
|
298
318
|
traject -c conf.rb -t xml marc_file.xml
|
@@ -323,21 +343,45 @@ Use `-u` as a shortcut for `s solr.url=X`
|
|
323
343
|
|
324
344
|
traject -c conf_file.rb -u http://example.com/solr marc_file.mrc
|
325
345
|
|
326
|
-
Also see `-I load_path` and `-g Gemfile` options under Extending
|
346
|
+
Also see `-I load_path` and `-g Gemfile` options under Extending With Your Own Code.
|
327
347
|
|
328
|
-
|
348
|
+
See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
329
349
|
|
330
|
-
|
350
|
+
## Extending With Your Own Code
|
331
351
|
|
332
|
-
|
352
|
+
Traject config files are full live ruby files, where you can do anything,
|
353
|
+
including declaring new classes, etc.
|
333
354
|
|
334
|
-
|
335
|
-
|
355
|
+
However, beyond limited trivial logic, you'll want to organize your
|
356
|
+
code reasonably into seperate files, not jam everything into config
|
357
|
+
files.
|
336
358
|
|
337
|
-
|
338
|
-
|
339
|
-
|
340
|
-
|
359
|
+
Traject wants to make sure it makes it convenient for you to do so,
|
360
|
+
whether project-specific logic in files local to the traject project,
|
361
|
+
or in ruby gems that can be shared between projects.
|
362
|
+
|
363
|
+
There are standard ruby mechanisms you can use to do this, and
|
364
|
+
traject provides a couple features to make sure this remains
|
365
|
+
convenient with the traject command line.
|
366
|
+
|
367
|
+
For more information, see documentation page on [Extending With Your
|
368
|
+
Own Code](./doc/extending.md)
|
369
|
+
|
370
|
+
**Expert summary** :
|
371
|
+
* Traject `-I` argument command line can be used to list directories to
|
372
|
+
add to the load path, similar to the `ruby -I` argument. You
|
373
|
+
can then 'require' local project files from the load path.
|
374
|
+
* translation map files found on the load path or in a
|
375
|
+
"./translation_maps" subdir on the load path will be found
|
376
|
+
for Traject translation maps.
|
377
|
+
* Traject `-g` command line can be used to tell traject to use
|
378
|
+
bundler with a `Gemfile` located at current working dirctory
|
379
|
+
(or give an argument to `-g ./some/myGemfile`)
|
380
|
+
|
381
|
+
## More
|
382
|
+
|
383
|
+
* [Other traject commands](./doc/other_commands.md) including `marcout`, and `commit`
|
384
|
+
* [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
341
385
|
|
342
386
|
|
343
387
|
# Development
|
@@ -351,6 +395,9 @@ instance is baked in. You can provide your own solr instance to test against an
|
|
351
395
|
"solr_url", and the tests will use it. Otherwise, tests will
|
352
396
|
use a mocked up Solr instance.
|
353
397
|
|
398
|
+
To make a pull request, please make a feature branch *created from the master branch*, not from an existing feature branch. (If you need to do a feature branch dependent on an existing not-yet merged feature branch... discuss
|
399
|
+
this with other developers first!)
|
400
|
+
|
354
401
|
Pull requests should come with tests, as well as docs where applicable. Docs can be inline rdoc-style, edits to this README,
|
355
402
|
and/or extra files in ./docs -- as appropriate for what needs to be docs.
|
356
403
|
|
@@ -364,8 +411,9 @@ and/or extra files in ./docs -- as appropriate for what needs to be docs.
|
|
364
411
|
* Either way, all optional/configurable of course. based
|
365
412
|
on Settings.
|
366
413
|
|
367
|
-
*
|
368
|
-
|
369
|
-
|
414
|
+
* CommandLine class isn't covered by tests -- it's written using functionality
|
415
|
+
from Indexer and other classes taht are well-covered, but the CommandLine itself
|
416
|
+
probably needs some tests -- especially covering error handling, which probably
|
417
|
+
needs a bit more attention and using exceptions instead of exits, etc.
|
370
418
|
|
371
419
|
* Optional built-in jetty stop/start to allow indexing to Solr that wasn't running before. maybe https://github.com/projecthydra/jettywrapper ?
|
data/bench/bench.rb
ADDED
@@ -0,0 +1,30 @@
|
|
1
|
+
#!/usr/bin/env jruby
|
2
|
+
$:.unshift File.expand_path('../../lib', __FILE__)
|
3
|
+
|
4
|
+
require 'traject/command_line'
|
5
|
+
|
6
|
+
require 'benchmark'
|
7
|
+
|
8
|
+
unless ARGV.size >= 2
|
9
|
+
STDERR.puts "\n Benchmark two (or more) different config files with both 0 and 3 threads against the given marc file\n"
|
10
|
+
STDERR.puts "\n Usage:"
|
11
|
+
STDERR.puts " jruby --server bench.rb config1.rb config2.rb [...configN.rb] filename.mrc\n\n"
|
12
|
+
exit
|
13
|
+
end
|
14
|
+
|
15
|
+
filename = ARGV.pop
|
16
|
+
config_files = ARGV
|
17
|
+
|
18
|
+
puts RUBY_DESCRIPTION
|
19
|
+
Benchmark.bmbm do |x|
|
20
|
+
[0, 3].each do |threads|
|
21
|
+
config_files.each do |cf|
|
22
|
+
x.report("#{cf} (#{threads})") do
|
23
|
+
cmdline = Traject::CommandLine.new(["-c", cf, '-s', 'log.file=bench.log', '-s', "processing_thread_pool=#{threads}", filename])
|
24
|
+
cmdline.execute
|
25
|
+
end
|
26
|
+
end
|
27
|
+
end
|
28
|
+
end
|
29
|
+
|
30
|
+
|
data/bin/traject
CHANGED
@@ -1,7 +1,5 @@
|
|
1
1
|
#!/usr/bin/env ruby
|
2
2
|
|
3
|
-
require 'slop'
|
4
|
-
|
5
3
|
|
6
4
|
# If we're loading from source instead of a gem, rubygems
|
7
5
|
# isn't setting load paths for us, so we need to set it ourselves
|
@@ -10,172 +8,9 @@ unless $LOAD_PATH.include? self_load_path
|
|
10
8
|
$LOAD_PATH << self_load_path
|
11
9
|
end
|
12
10
|
|
13
|
-
require 'traject'
|
14
|
-
require 'traject/indexer'
|
15
|
-
|
16
|
-
|
17
|
-
orig_argv = ARGV.dup
|
18
|
-
|
19
|
-
|
20
|
-
opts = Slop.new(:strict => true) do
|
21
|
-
banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
|
22
|
-
|
23
|
-
on 'v', 'version', "print version information to stderr"
|
24
|
-
on 'd', 'debug', "Include debug log, -s log.level=debug"
|
25
|
-
on 'h', 'help', "print usage information to stderr"
|
26
|
-
on 'c', 'conf', 'configuration file path (repeatable)', :argument => true, :as => Array
|
27
|
-
on :s, :setting, "settings: `-s key=value` (repeatable)", :argument => true, :as => Array
|
28
|
-
on :r, :reader, "Set reader class, shortcut for `-s reader_class_name=*`", :argument => true
|
29
|
-
on :w, :writer, "Set writer class, shortcut for `-s writer_class_name=*`", :argument => true
|
30
|
-
on :u, :solr, "Set solr url, shortcut for `-s solr.url=*`", :argument => true
|
31
|
-
on :j, "output as pretty printed json, shortcut for `-s writer_class_name=JsonWriter -s json_writer.pretty_print=true`"
|
32
|
-
on :t, :marc_type, "xml, json or binary. shortcut for `-s marc_source.type=*`", :argument => true
|
33
|
-
on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
|
34
|
-
on :g, "gemfile", "run with bundler and optionally specified Gemfile", :argument => :optional, :default => ""
|
35
|
-
end
|
36
|
-
|
37
|
-
begin
|
38
|
-
opts.parse!
|
39
|
-
rescue Slop::Error => e
|
40
|
-
$stderr.puts "Error: #{e.message}"
|
41
|
-
$stderr.puts "Exiting..."
|
42
|
-
$stderr.puts
|
43
|
-
$stderr.puts opts.help
|
44
|
-
exit 1
|
45
|
-
end
|
46
|
-
|
47
|
-
|
48
|
-
options = opts.to_hash
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
if options[:version]
|
53
|
-
$stderr.puts "traject version #{Traject::VERSION}"
|
54
|
-
exit 1
|
55
|
-
end
|
56
|
-
|
57
|
-
if options[:help]
|
58
|
-
$stderr.puts opts.help
|
59
|
-
exit 1
|
60
|
-
end
|
61
|
-
|
62
|
-
# have to use Slop object to tell diff between
|
63
|
-
# no arg supplied and no option -g given at all
|
64
|
-
if opts.present? :gemfile
|
65
|
-
if options[:gemfile]
|
66
|
-
# tell bundler what gemfile to use
|
67
|
-
gem_path = File.expand_path( options[:gemfile] )
|
68
|
-
# bundler not good at error reporting, we check ourselves
|
69
|
-
unless File.exists? gem_path
|
70
|
-
$stderr.puts "Gemfile `#{options[:gemfile]}` does not exist, exiting..."
|
71
|
-
$stderr.puts
|
72
|
-
$stderr.puts opts.help
|
73
|
-
exit 2
|
74
|
-
end
|
75
|
-
|
76
|
-
ENV["BUNDLE_GEMFILE"] = gem_path
|
77
|
-
end
|
78
|
-
require 'bundler/setup'
|
79
|
-
end
|
80
|
-
|
81
|
-
settings = {}
|
82
|
-
(options[:setting] || []).each do |setting_pair|
|
83
|
-
|
84
|
-
if setting_pair =~ /\A([^=]+)\=([^=]*)\Z/
|
85
|
-
key, value = $1, $2
|
86
|
-
settings[key] = value
|
87
|
-
else
|
88
|
-
$stderr.puts "Unrecognized setting argument '#{setting_pair}':"
|
89
|
-
$stderr.puts "Should be of format -s key=value"
|
90
|
-
exit 3
|
91
|
-
end
|
92
|
-
end
|
93
|
-
|
94
|
-
|
95
|
-
if options[:debug]
|
96
|
-
settings["log.level"] = "debug"
|
97
|
-
end
|
98
|
-
if options[:writer]
|
99
|
-
settings["writer_class_name"] = options[:writer]
|
100
|
-
end
|
101
|
-
if options[:reader]
|
102
|
-
settings["reader_class_name"] = options[:reader]
|
103
|
-
end
|
104
|
-
if options[:solr]
|
105
|
-
settings["solr.url"] = options[:solr]
|
106
|
-
end
|
107
|
-
if options[:j]
|
108
|
-
settings["writer_class_name"] = "JsonWriter"
|
109
|
-
settings["json_writer.pretty_print"] = "true"
|
110
|
-
end
|
111
|
-
if options[:marc_type]
|
112
|
-
settings["marc_source.type"] = options[:marc_type]
|
113
|
-
end
|
114
|
-
|
115
|
-
|
116
|
-
(options[:load_path] || []).each do |path|
|
117
|
-
$LOAD_PATH << path unless $LOAD_PATH.include? path
|
118
|
-
end
|
119
|
-
|
120
|
-
indexer = Traject::Indexer.new
|
121
|
-
indexer.settings( settings )
|
122
|
-
|
123
|
-
unless options[:conf] && options[:conf].length > 0
|
124
|
-
$stderr.puts "Error: Missing required configuration file"
|
125
|
-
$stderr.puts "Exiting..."
|
126
|
-
$stderr.puts
|
127
|
-
$stderr.puts opts.help
|
128
|
-
exit 2
|
129
|
-
end
|
130
|
-
|
131
|
-
options[:conf].each do |conf_path|
|
132
|
-
begin
|
133
|
-
indexer.instance_eval(File.open(conf_path).read, conf_path)
|
134
|
-
rescue Errno::ENOENT => e
|
135
|
-
$stderr.puts "Could not find configuration file '#{conf_path}', exiting..."
|
136
|
-
exit 2
|
137
|
-
rescue Exception => e
|
138
|
-
$stderr.puts "Could not parse configuration file '#{conf_path}'"
|
139
|
-
$stderr.puts " #{e.message}"
|
140
|
-
if e.backtrace.first =~ /\A(.*)\:in/
|
141
|
-
$stderr.puts " #{$1}"
|
142
|
-
end
|
143
|
-
exit 3
|
144
|
-
end
|
145
|
-
end
|
146
|
-
|
147
|
-
## SAFE TO LOG STARTING HERE.
|
148
|
-
#
|
149
|
-
# Shoudln't log before config files are read above, because
|
150
|
-
# config files set up logger
|
151
|
-
##############
|
152
|
-
indexer.logger.info("executing with arguments: `#{orig_argv.join(' ')}`")
|
153
|
-
|
154
|
-
|
155
|
-
# ARGF might be perfect for this, but problems with it include:
|
156
|
-
# * jruby is broken, no way to set it's encoding, leads to encoding errors reading non-ascii
|
157
|
-
# https://github.com/jruby/jruby/issues/891
|
158
|
-
# * It's apparently not enough like an IO object for at least one of the ruby-marc XML
|
159
|
-
# readers:
|
160
|
-
# NoMethodError: undefined method `to_inputstream' for ARGF:Object
|
161
|
-
# init at /Users/jrochkind/.gem/jruby/1.9.3/gems/marc-0.5.1/lib/marc/xml_parsers.rb:369
|
162
|
-
#
|
163
|
-
# * It INSISTS on reading from ARGFV, making it hard to test, or use when you want to give
|
164
|
-
# it a list of files on something other than ARGV.
|
165
|
-
#
|
166
|
-
# So for now we do just one file, or stdin if none given. Sorry!
|
167
|
-
if ARGV.length > 1
|
168
|
-
$stderr.puts "Sorry, traject can only handle one input file at a time right now. `#{ARGV}` Exiting..."
|
169
|
-
exit 1
|
170
|
-
end
|
171
|
-
if ARGV.length == 0
|
172
|
-
indexer.logger.info "Reading from STDIN..."
|
173
|
-
io = $stdin
|
174
|
-
else
|
175
|
-
indexer.logger.info "Reading from #{ARGV.first}"
|
176
|
-
io = File.open(ARGV.first, 'r')
|
177
|
-
end
|
11
|
+
require 'traject/command_line'
|
178
12
|
|
179
|
-
|
13
|
+
cmdline = Traject::CommandLine.new(ARGV)
|
14
|
+
result = cmdline.execute
|
180
15
|
|
181
|
-
exit 1 unless result # non-zero exit status on process telling us there's problems.
|
16
|
+
exit 1 unless result # non-zero exit status on process telling us there's problems.
|
@@ -0,0 +1,177 @@
|
|
1
|
+
# Hints for running traject as a batch job
|
2
|
+
|
3
|
+
Maybe as a cronjob. Maybe via a batch shell script that executes
|
4
|
+
traject, and maybe even pipelines it together with other commands.
|
5
|
+
|
6
|
+
These are things you might want to do with traject. Some potential problem points
|
7
|
+
with suggested solutions, and additional hints.
|
8
|
+
|
9
|
+
## Ruby version setting
|
10
|
+
|
11
|
+
traject ordinarily needs to run under jruby. You will
|
12
|
+
ordinarily have jruby installed under a ruby version switcher -- we
|
13
|
+
highly recommend [chruby](https://github.com/postmodern/chruby) over other choices,
|
14
|
+
but other popular choices include rvm and rbenv.
|
15
|
+
|
16
|
+
Remember that traject needs to run in 1.9.x mode in jruby--
|
17
|
+
with jruby 1.7.x or later, this should be default, recommend
|
18
|
+
you use jruby 1.7.x.
|
19
|
+
|
20
|
+
Especially when running under a cron job, it can be difficult to
|
21
|
+
set things up so traject runs under jruby.
|
22
|
+
|
23
|
+
It can sometimes be useful to create a wrapper script for traject
|
24
|
+
that takes care of making sure it's running under the right ruby
|
25
|
+
version.
|
26
|
+
|
27
|
+
### for chruby
|
28
|
+
|
29
|
+
Simply run with:
|
30
|
+
|
31
|
+
chruby-exec jruby -- traject {other arguments}
|
32
|
+
|
33
|
+
Whether specifying that directly in a crontab, or in a shell script
|
34
|
+
that needs to call traject, etc. So simple you might not need
|
35
|
+
a wrapper script, but it might still be convenient to create one. Say
|
36
|
+
you put a `jruby-traject` at `/usr/local/bin/jruby-traject`, that
|
37
|
+
looks like this:
|
38
|
+
|
39
|
+
#!/usr/bin/env bash
|
40
|
+
|
41
|
+
chruby-exec jruby -- traject "$@"
|
42
|
+
|
43
|
+
Now any account, in a crontab, in an interactive shell, wherever,
|
44
|
+
can just execute `jruby-traject {arguments}`, and execute traject
|
45
|
+
in a jruby environment.
|
46
|
+
|
47
|
+
### for rbenv
|
48
|
+
|
49
|
+
If running in an interactive shell that has had rbenv set up for
|
50
|
+
it, you can use rbenv's standard mechanism to say to execute
|
51
|
+
something in jruby:
|
52
|
+
|
53
|
+
RBENV_VERSION=jruby-1.7.2 traject {args}
|
54
|
+
|
55
|
+
You do need to specify the exact version of jruby, I don't think
|
56
|
+
there's any way to say 'latest install jruby'. You could do the
|
57
|
+
same thing for any batch scripts you're writing -- just have
|
58
|
+
them set that `RBENV_VERSION` environment variable before
|
59
|
+
executing traject.
|
60
|
+
|
61
|
+
If you're running inside a cronjob, things get a bit trickier,
|
62
|
+
because rbenv isn't normally set up in the limited environment
|
63
|
+
of cron tasks. One way to deal with this is to have your
|
64
|
+
cronjob explicitly execute in a bash login shell, that
|
65
|
+
will then have rbenv set up so long as it's running
|
66
|
+
under an account with rbenv set up properly!
|
67
|
+
|
68
|
+
# in a cronfile
|
69
|
+
# 10 * * * * /bin/bash -l -c 'RBENV_VERSION=jruby-1.7.2 traject {args}'
|
70
|
+
|
71
|
+
(Better way? Doc pull requests welcome.)
|
72
|
+
|
73
|
+
|
74
|
+
### for rvm
|
75
|
+
|
76
|
+
See rvm's [own docs on use with cron](http://rvm.io/integration/cron), it gets a bit confusing.
|
77
|
+
But here's one way, using a wrapper script. It does require you to
|
78
|
+
identify and hard-code in where your rvm is installed, and exactly which
|
79
|
+
version of jruby you want to execute with (will have to be updated if you upgrade
|
80
|
+
jruby). (Is there a better way? Doc pull requests welcome! rvm confuses me!)
|
81
|
+
|
82
|
+
Make a file at `/usr/local/bin/jruby-traject` that looks like this:
|
83
|
+
|
84
|
+
|
85
|
+
~~~bash
|
86
|
+
#!/usr/bin/env bash
|
87
|
+
|
88
|
+
# load rvm ruby
|
89
|
+
source /home/MY_ACCT/.rvm/environments/jruby-1.7.3
|
90
|
+
|
91
|
+
traject "$@"
|
92
|
+
~~~
|
93
|
+
|
94
|
+
You have to use your actual account rvm is installed in for MY_ACCT.
|
95
|
+
Or, if you have a global install of rvm instead of a user-account one,
|
96
|
+
it might be at `/usr/local/rvm/environments`... instead.
|
97
|
+
|
98
|
+
Now any account, in a crontab, in an interactive shell, wherever,
|
99
|
+
can just execute `jruby-traject {arguments}`, and execute traject
|
100
|
+
in a jruby environment.
|
101
|
+
|
102
|
+
## Exit codes
|
103
|
+
|
104
|
+
Traject tries to always return a well-behaved unix exit code -- 0 for success,
|
105
|
+
non-0 for error.
|
106
|
+
|
107
|
+
You should be able to rely on this in your batch bash scripts, if you want to abort
|
108
|
+
further processing if traject failed for some reason, you can check traject's
|
109
|
+
exit code.
|
110
|
+
|
111
|
+
If an uncaught exception happens, traject will return non-0.
|
112
|
+
|
113
|
+
There are some kinds of errors which prevent traject from indexing
|
114
|
+
one or more records, but traject may still continue processing
|
115
|
+
the other records. If any records have been skipped in this way,
|
116
|
+
traject will _also_ return a non-0 failure exit code. (Is this good?
|
117
|
+
Does it need to be configurable?)
|
118
|
+
|
119
|
+
In these cases, information about errors that led to skipped records should
|
120
|
+
be output as ERROR level in the logs.
|
121
|
+
|
122
|
+
## Logs and Error Reporting
|
123
|
+
|
124
|
+
By default, traject outputs all logging to stderr. This is often just what
|
125
|
+
you want for a batch or automated process, where there might be some wrapper
|
126
|
+
script which captures stderr and puts it where you want it.
|
127
|
+
|
128
|
+
However, it's easy enough to tell traject to log somewhere else. Either on
|
129
|
+
the command-line:
|
130
|
+
|
131
|
+
traject -s log.file=/some/other/file/log {other args}
|
132
|
+
|
133
|
+
Or in a traject configuration file, setting the `log.file` configuration setting.
|
134
|
+
|
135
|
+
### Seperate error log
|
136
|
+
|
137
|
+
You can also seperately have a duplicate log file created with ONLY log messages of
|
138
|
+
level ERROR and higher (meaning ERROR and FATAL), with the `log.error_file` setting.
|
139
|
+
Then, if there's any lines in this error log file at all, you know something bad
|
140
|
+
happened, maybe your batch process needs to notify someone, or abort further
|
141
|
+
steps in the batch process.
|
142
|
+
|
143
|
+
traject -s log.file=/var/log/traject.log -s log.error_file=/var/log/traject_error.log {more args}
|
144
|
+
|
145
|
+
The error lines will be in the main log file, and also duplicated in the error
|
146
|
+
log file.
|
147
|
+
|
148
|
+
### Completely customizable logging with yell
|
149
|
+
|
150
|
+
Traject uses the [yell](https://github.com/rudionrails/yell) gem for logging.
|
151
|
+
You can configure the logger directly to implement whatever crazy logging rules you might
|
152
|
+
want, so long as yell supports them. But yell is pretty flexible.
|
153
|
+
|
154
|
+
Recall that traject config files are just ruby, executed in the context
|
155
|
+
of a Traject::Indexer. You can set the Indexer's `logger` to a yell logger
|
156
|
+
object you configure yourself however you like:
|
157
|
+
|
158
|
+
~~~ruby
|
159
|
+
# inside a traject configuration file
|
160
|
+
|
161
|
+
logger = Yell.new do |l|
|
162
|
+
l.level = 'gte.info' # will only pass :info and above to the adapters
|
163
|
+
|
164
|
+
l.adapter :datefile, 'production.log', level: 'lte.warn' # anything lower or equal to :warn
|
165
|
+
l.adapter :datefile, 'error.log', level: 'gte.error' # anything greater or equal to :error
|
166
|
+
end
|
167
|
+
~~~
|
168
|
+
|
169
|
+
See [yell](https://github.com/rudionrails/yell) docs for more, you can
|
170
|
+
do whatever you can make yell, just write ruby.
|
171
|
+
|
172
|
+
### Bundler
|
173
|
+
|
174
|
+
For automated batch execution, we recommend you consider using
|
175
|
+
bundler to manage any gem dependencies. See the [Extending
|
176
|
+
With Your Own Code](./extending.md) traject docs for
|
177
|
+
information on how traject integrates with bundler.
|