traject 0.9.1 → 0.10.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.travis.yml +7 -0
- data/Gemfile +5 -1
- data/README.md +65 -17
- data/bench/bench.rb +30 -0
- data/bin/traject +4 -169
- data/doc/batch_execution.md +177 -0
- data/doc/extending.md +182 -0
- data/doc/other_commands.md +49 -0
- data/doc/settings.md +6 -2
- data/lib/traject.rb +1 -0
- data/lib/traject/command_line.rb +296 -0
- data/lib/traject/debug_writer.rb +28 -0
- data/lib/traject/indexer.rb +84 -20
- data/lib/traject/indexer/settings.rb +9 -1
- data/lib/traject/json_writer.rb +15 -38
- data/lib/traject/line_writer.rb +59 -0
- data/lib/traject/macros/marc21.rb +10 -5
- data/lib/traject/macros/marc21_semantics.rb +57 -25
- data/lib/traject/marc4j_reader.rb +9 -26
- data/lib/traject/marc_extractor.rb +121 -48
- data/lib/traject/mock_reader.rb +87 -0
- data/lib/traject/mock_writer.rb +34 -0
- data/lib/traject/solrj_writer.rb +1 -22
- data/lib/traject/util.rb +107 -1
- data/lib/traject/version.rb +1 -1
- data/lib/traject/yaml_writer.rb +9 -0
- data/test/debug_writer_test.rb +38 -0
- data/test/indexer/each_record_test.rb +27 -2
- data/test/indexer/macros_marc21_semantics_test.rb +12 -1
- data/test/indexer/settings_test.rb +9 -2
- data/test/indexer/to_field_test.rb +35 -5
- data/test/marc4j_reader_test.rb +3 -0
- data/test/marc_extractor_test.rb +94 -20
- data/test/test_support/demo_config.rb +6 -3
- data/traject.gemspec +1 -2
- metadata +17 -20
data/.travis.yml
ADDED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -7,9 +7,13 @@ them somewhere.
|
|
7
7
|
|
8
8
|
**Currently under development, not production ready**
|
9
9
|
|
10
|
+
[](http://badge.fury.io/rb/traject)
|
11
|
+
[](https://travis-ci.org/jrochkind/traject)
|
12
|
+
|
13
|
+
|
10
14
|
## Background/Goals
|
11
15
|
|
12
|
-
Existing tools for indexing Marc to Solr exist, and have served many
|
16
|
+
Existing tools for indexing Marc to Solr exist, and have served us well for many years, and have many useful things about them -- which I've tried to preserve in traject. But I was having more and more difficulty working with the existing tools, including difficulty providing the custom logic I needed in a maintainable way. I realized that for me, to create a tool with the flexibility, maintainability, and performance I wanted, I would need to do it in jruby (ruby on the JVM).
|
13
17
|
|
14
18
|
Some goals:
|
15
19
|
|
@@ -19,11 +23,13 @@ Some goals:
|
|
19
23
|
* Built of modular and composable elements: If you want to change part of what traject does, you should be able to do so without having to reimplement other things you don't want to change.
|
20
24
|
* A maintainable internal architecture, well-factored with seperated concerns and DRY logic. Aim to be comprehensible to newcomer developers, and well-covered by tests.
|
21
25
|
* High performance, using multi-threaded concurrency where appropriate to maximize throughput. Actual throughput can depend on complexity of your mapping rules and capacity of your server(s), but I am getting throughput 2-5x greater than previous solutions.
|
26
|
+
* Cooperate well in unix batch/pipeline, with control over output/logging of errors, proper exit codes, use of stdin/stdout, etc.
|
22
27
|
|
23
28
|
|
24
29
|
## Installation
|
25
30
|
|
26
|
-
Traject runs under jruby (ruby on the JVM). I recommend [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme) for installing and managing ruby installations.
|
31
|
+
Traject runs under jruby (ruby on the JVM). I recommend [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme) for installing and managing ruby installations. (traject is tested
|
32
|
+
and supported for ruby 1.9 -- recent versions of jruby should run under 1.9 mode by default).
|
27
33
|
|
28
34
|
Then just `gem install traject`.
|
29
35
|
|
@@ -151,6 +157,11 @@ Other examples of the specification string, which can include multiple tag menti
|
|
151
157
|
# "*" is a wildcard in indicator spec. So
|
152
158
|
# 856 with first indicator '0', subfield u.
|
153
159
|
to_field "email_addresses", extract_marc("856|0*|u")
|
160
|
+
|
161
|
+
# Instead of joining subfields from the same field
|
162
|
+
# into one string, joined by spaces, leave them
|
163
|
+
# each in seperate strings:
|
164
|
+
to_field "isbn", extract_marc("020az", :seperator => nil)
|
154
165
|
~~~
|
155
166
|
|
156
167
|
The `extract_marc` function *by default* includes any linked
|
@@ -214,9 +225,14 @@ end
|
|
214
225
|
# marc_extract does, you may want to use the Traject::MarcExtractor
|
215
226
|
# class
|
216
227
|
to_field "weirdo" do |record, accumulator, context|
|
217
|
-
|
228
|
+
# use MarcExtractor.cached for performance, globally
|
229
|
+
# caching the MarcExtractor we create. See docs
|
230
|
+
# at MarcExtractor.
|
231
|
+
list = MarcExtractor.cached("700a").extract(record)
|
232
|
+
|
218
233
|
# combine all the 700a's in ONE string, cause we're weird
|
219
234
|
list = list.join(" ")
|
235
|
+
|
220
236
|
accumulator << list
|
221
237
|
end
|
222
238
|
~~~
|
@@ -264,6 +280,10 @@ in order.
|
|
264
280
|
to_field("foo") {...} # and will be called after each of the preceding for each record
|
265
281
|
~~~
|
266
282
|
|
283
|
+
#### Sample config
|
284
|
+
|
285
|
+
A fairly complex sample config file can be found at [./test/test_support/demo_config.rb](./test/test_support/demo_config.rb)
|
286
|
+
|
267
287
|
#### Built-in MARC21 Semantics
|
268
288
|
|
269
289
|
There is another package of 'macros' that comes with Traject for extracting semantics
|
@@ -292,7 +312,7 @@ The simplest invocation is:
|
|
292
312
|
traject -c conf_file.rb marc_file.mrc
|
293
313
|
|
294
314
|
Traject assumes marc files are in ISO 2709 binary format; it is not
|
295
|
-
currently able to
|
315
|
+
currently able to guess marc format type from filenames. If you are reading
|
296
316
|
marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
|
297
317
|
|
298
318
|
traject -c conf.rb -t xml marc_file.xml
|
@@ -323,21 +343,45 @@ Use `-u` as a shortcut for `s solr.url=X`
|
|
323
343
|
|
324
344
|
traject -c conf_file.rb -u http://example.com/solr marc_file.mrc
|
325
345
|
|
326
|
-
Also see `-I load_path` and `-g Gemfile` options under Extending
|
346
|
+
Also see `-I load_path` and `-g Gemfile` options under Extending With Your Own Code.
|
327
347
|
|
328
|
-
|
348
|
+
See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
329
349
|
|
330
|
-
|
350
|
+
## Extending With Your Own Code
|
331
351
|
|
332
|
-
|
352
|
+
Traject config files are full live ruby files, where you can do anything,
|
353
|
+
including declaring new classes, etc.
|
333
354
|
|
334
|
-
|
335
|
-
|
355
|
+
However, beyond limited trivial logic, you'll want to organize your
|
356
|
+
code reasonably into seperate files, not jam everything into config
|
357
|
+
files.
|
336
358
|
|
337
|
-
|
338
|
-
|
339
|
-
|
340
|
-
|
359
|
+
Traject wants to make sure it makes it convenient for you to do so,
|
360
|
+
whether project-specific logic in files local to the traject project,
|
361
|
+
or in ruby gems that can be shared between projects.
|
362
|
+
|
363
|
+
There are standard ruby mechanisms you can use to do this, and
|
364
|
+
traject provides a couple features to make sure this remains
|
365
|
+
convenient with the traject command line.
|
366
|
+
|
367
|
+
For more information, see documentation page on [Extending With Your
|
368
|
+
Own Code](./doc/extending.md)
|
369
|
+
|
370
|
+
**Expert summary** :
|
371
|
+
* Traject `-I` argument command line can be used to list directories to
|
372
|
+
add to the load path, similar to the `ruby -I` argument. You
|
373
|
+
can then 'require' local project files from the load path.
|
374
|
+
* translation map files found on the load path or in a
|
375
|
+
"./translation_maps" subdir on the load path will be found
|
376
|
+
for Traject translation maps.
|
377
|
+
* Traject `-g` command line can be used to tell traject to use
|
378
|
+
bundler with a `Gemfile` located at current working dirctory
|
379
|
+
(or give an argument to `-g ./some/myGemfile`)
|
380
|
+
|
381
|
+
## More
|
382
|
+
|
383
|
+
* [Other traject commands](./doc/other_commands.md) including `marcout`, and `commit`
|
384
|
+
* [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
341
385
|
|
342
386
|
|
343
387
|
# Development
|
@@ -351,6 +395,9 @@ instance is baked in. You can provide your own solr instance to test against an
|
|
351
395
|
"solr_url", and the tests will use it. Otherwise, tests will
|
352
396
|
use a mocked up Solr instance.
|
353
397
|
|
398
|
+
To make a pull request, please make a feature branch *created from the master branch*, not from an existing feature branch. (If you need to do a feature branch dependent on an existing not-yet merged feature branch... discuss
|
399
|
+
this with other developers first!)
|
400
|
+
|
354
401
|
Pull requests should come with tests, as well as docs where applicable. Docs can be inline rdoc-style, edits to this README,
|
355
402
|
and/or extra files in ./docs -- as appropriate for what needs to be docs.
|
356
403
|
|
@@ -364,8 +411,9 @@ and/or extra files in ./docs -- as appropriate for what needs to be docs.
|
|
364
411
|
* Either way, all optional/configurable of course. based
|
365
412
|
on Settings.
|
366
413
|
|
367
|
-
*
|
368
|
-
|
369
|
-
|
414
|
+
* CommandLine class isn't covered by tests -- it's written using functionality
|
415
|
+
from Indexer and other classes taht are well-covered, but the CommandLine itself
|
416
|
+
probably needs some tests -- especially covering error handling, which probably
|
417
|
+
needs a bit more attention and using exceptions instead of exits, etc.
|
370
418
|
|
371
419
|
* Optional built-in jetty stop/start to allow indexing to Solr that wasn't running before. maybe https://github.com/projecthydra/jettywrapper ?
|
data/bench/bench.rb
ADDED
@@ -0,0 +1,30 @@
|
|
1
|
+
#!/usr/bin/env jruby
|
2
|
+
$:.unshift File.expand_path('../../lib', __FILE__)
|
3
|
+
|
4
|
+
require 'traject/command_line'
|
5
|
+
|
6
|
+
require 'benchmark'
|
7
|
+
|
8
|
+
unless ARGV.size >= 2
|
9
|
+
STDERR.puts "\n Benchmark two (or more) different config files with both 0 and 3 threads against the given marc file\n"
|
10
|
+
STDERR.puts "\n Usage:"
|
11
|
+
STDERR.puts " jruby --server bench.rb config1.rb config2.rb [...configN.rb] filename.mrc\n\n"
|
12
|
+
exit
|
13
|
+
end
|
14
|
+
|
15
|
+
filename = ARGV.pop
|
16
|
+
config_files = ARGV
|
17
|
+
|
18
|
+
puts RUBY_DESCRIPTION
|
19
|
+
Benchmark.bmbm do |x|
|
20
|
+
[0, 3].each do |threads|
|
21
|
+
config_files.each do |cf|
|
22
|
+
x.report("#{cf} (#{threads})") do
|
23
|
+
cmdline = Traject::CommandLine.new(["-c", cf, '-s', 'log.file=bench.log', '-s', "processing_thread_pool=#{threads}", filename])
|
24
|
+
cmdline.execute
|
25
|
+
end
|
26
|
+
end
|
27
|
+
end
|
28
|
+
end
|
29
|
+
|
30
|
+
|
data/bin/traject
CHANGED
@@ -1,7 +1,5 @@
|
|
1
1
|
#!/usr/bin/env ruby
|
2
2
|
|
3
|
-
require 'slop'
|
4
|
-
|
5
3
|
|
6
4
|
# If we're loading from source instead of a gem, rubygems
|
7
5
|
# isn't setting load paths for us, so we need to set it ourselves
|
@@ -10,172 +8,9 @@ unless $LOAD_PATH.include? self_load_path
|
|
10
8
|
$LOAD_PATH << self_load_path
|
11
9
|
end
|
12
10
|
|
13
|
-
require 'traject'
|
14
|
-
require 'traject/indexer'
|
15
|
-
|
16
|
-
|
17
|
-
orig_argv = ARGV.dup
|
18
|
-
|
19
|
-
|
20
|
-
opts = Slop.new(:strict => true) do
|
21
|
-
banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
|
22
|
-
|
23
|
-
on 'v', 'version', "print version information to stderr"
|
24
|
-
on 'd', 'debug', "Include debug log, -s log.level=debug"
|
25
|
-
on 'h', 'help', "print usage information to stderr"
|
26
|
-
on 'c', 'conf', 'configuration file path (repeatable)', :argument => true, :as => Array
|
27
|
-
on :s, :setting, "settings: `-s key=value` (repeatable)", :argument => true, :as => Array
|
28
|
-
on :r, :reader, "Set reader class, shortcut for `-s reader_class_name=*`", :argument => true
|
29
|
-
on :w, :writer, "Set writer class, shortcut for `-s writer_class_name=*`", :argument => true
|
30
|
-
on :u, :solr, "Set solr url, shortcut for `-s solr.url=*`", :argument => true
|
31
|
-
on :j, "output as pretty printed json, shortcut for `-s writer_class_name=JsonWriter -s json_writer.pretty_print=true`"
|
32
|
-
on :t, :marc_type, "xml, json or binary. shortcut for `-s marc_source.type=*`", :argument => true
|
33
|
-
on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
|
34
|
-
on :g, "gemfile", "run with bundler and optionally specified Gemfile", :argument => :optional, :default => ""
|
35
|
-
end
|
36
|
-
|
37
|
-
begin
|
38
|
-
opts.parse!
|
39
|
-
rescue Slop::Error => e
|
40
|
-
$stderr.puts "Error: #{e.message}"
|
41
|
-
$stderr.puts "Exiting..."
|
42
|
-
$stderr.puts
|
43
|
-
$stderr.puts opts.help
|
44
|
-
exit 1
|
45
|
-
end
|
46
|
-
|
47
|
-
|
48
|
-
options = opts.to_hash
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
if options[:version]
|
53
|
-
$stderr.puts "traject version #{Traject::VERSION}"
|
54
|
-
exit 1
|
55
|
-
end
|
56
|
-
|
57
|
-
if options[:help]
|
58
|
-
$stderr.puts opts.help
|
59
|
-
exit 1
|
60
|
-
end
|
61
|
-
|
62
|
-
# have to use Slop object to tell diff between
|
63
|
-
# no arg supplied and no option -g given at all
|
64
|
-
if opts.present? :gemfile
|
65
|
-
if options[:gemfile]
|
66
|
-
# tell bundler what gemfile to use
|
67
|
-
gem_path = File.expand_path( options[:gemfile] )
|
68
|
-
# bundler not good at error reporting, we check ourselves
|
69
|
-
unless File.exists? gem_path
|
70
|
-
$stderr.puts "Gemfile `#{options[:gemfile]}` does not exist, exiting..."
|
71
|
-
$stderr.puts
|
72
|
-
$stderr.puts opts.help
|
73
|
-
exit 2
|
74
|
-
end
|
75
|
-
|
76
|
-
ENV["BUNDLE_GEMFILE"] = gem_path
|
77
|
-
end
|
78
|
-
require 'bundler/setup'
|
79
|
-
end
|
80
|
-
|
81
|
-
settings = {}
|
82
|
-
(options[:setting] || []).each do |setting_pair|
|
83
|
-
|
84
|
-
if setting_pair =~ /\A([^=]+)\=([^=]*)\Z/
|
85
|
-
key, value = $1, $2
|
86
|
-
settings[key] = value
|
87
|
-
else
|
88
|
-
$stderr.puts "Unrecognized setting argument '#{setting_pair}':"
|
89
|
-
$stderr.puts "Should be of format -s key=value"
|
90
|
-
exit 3
|
91
|
-
end
|
92
|
-
end
|
93
|
-
|
94
|
-
|
95
|
-
if options[:debug]
|
96
|
-
settings["log.level"] = "debug"
|
97
|
-
end
|
98
|
-
if options[:writer]
|
99
|
-
settings["writer_class_name"] = options[:writer]
|
100
|
-
end
|
101
|
-
if options[:reader]
|
102
|
-
settings["reader_class_name"] = options[:reader]
|
103
|
-
end
|
104
|
-
if options[:solr]
|
105
|
-
settings["solr.url"] = options[:solr]
|
106
|
-
end
|
107
|
-
if options[:j]
|
108
|
-
settings["writer_class_name"] = "JsonWriter"
|
109
|
-
settings["json_writer.pretty_print"] = "true"
|
110
|
-
end
|
111
|
-
if options[:marc_type]
|
112
|
-
settings["marc_source.type"] = options[:marc_type]
|
113
|
-
end
|
114
|
-
|
115
|
-
|
116
|
-
(options[:load_path] || []).each do |path|
|
117
|
-
$LOAD_PATH << path unless $LOAD_PATH.include? path
|
118
|
-
end
|
119
|
-
|
120
|
-
indexer = Traject::Indexer.new
|
121
|
-
indexer.settings( settings )
|
122
|
-
|
123
|
-
unless options[:conf] && options[:conf].length > 0
|
124
|
-
$stderr.puts "Error: Missing required configuration file"
|
125
|
-
$stderr.puts "Exiting..."
|
126
|
-
$stderr.puts
|
127
|
-
$stderr.puts opts.help
|
128
|
-
exit 2
|
129
|
-
end
|
130
|
-
|
131
|
-
options[:conf].each do |conf_path|
|
132
|
-
begin
|
133
|
-
indexer.instance_eval(File.open(conf_path).read, conf_path)
|
134
|
-
rescue Errno::ENOENT => e
|
135
|
-
$stderr.puts "Could not find configuration file '#{conf_path}', exiting..."
|
136
|
-
exit 2
|
137
|
-
rescue Exception => e
|
138
|
-
$stderr.puts "Could not parse configuration file '#{conf_path}'"
|
139
|
-
$stderr.puts " #{e.message}"
|
140
|
-
if e.backtrace.first =~ /\A(.*)\:in/
|
141
|
-
$stderr.puts " #{$1}"
|
142
|
-
end
|
143
|
-
exit 3
|
144
|
-
end
|
145
|
-
end
|
146
|
-
|
147
|
-
## SAFE TO LOG STARTING HERE.
|
148
|
-
#
|
149
|
-
# Shoudln't log before config files are read above, because
|
150
|
-
# config files set up logger
|
151
|
-
##############
|
152
|
-
indexer.logger.info("executing with arguments: `#{orig_argv.join(' ')}`")
|
153
|
-
|
154
|
-
|
155
|
-
# ARGF might be perfect for this, but problems with it include:
|
156
|
-
# * jruby is broken, no way to set it's encoding, leads to encoding errors reading non-ascii
|
157
|
-
# https://github.com/jruby/jruby/issues/891
|
158
|
-
# * It's apparently not enough like an IO object for at least one of the ruby-marc XML
|
159
|
-
# readers:
|
160
|
-
# NoMethodError: undefined method `to_inputstream' for ARGF:Object
|
161
|
-
# init at /Users/jrochkind/.gem/jruby/1.9.3/gems/marc-0.5.1/lib/marc/xml_parsers.rb:369
|
162
|
-
#
|
163
|
-
# * It INSISTS on reading from ARGFV, making it hard to test, or use when you want to give
|
164
|
-
# it a list of files on something other than ARGV.
|
165
|
-
#
|
166
|
-
# So for now we do just one file, or stdin if none given. Sorry!
|
167
|
-
if ARGV.length > 1
|
168
|
-
$stderr.puts "Sorry, traject can only handle one input file at a time right now. `#{ARGV}` Exiting..."
|
169
|
-
exit 1
|
170
|
-
end
|
171
|
-
if ARGV.length == 0
|
172
|
-
indexer.logger.info "Reading from STDIN..."
|
173
|
-
io = $stdin
|
174
|
-
else
|
175
|
-
indexer.logger.info "Reading from #{ARGV.first}"
|
176
|
-
io = File.open(ARGV.first, 'r')
|
177
|
-
end
|
11
|
+
require 'traject/command_line'
|
178
12
|
|
179
|
-
|
13
|
+
cmdline = Traject::CommandLine.new(ARGV)
|
14
|
+
result = cmdline.execute
|
180
15
|
|
181
|
-
exit 1 unless result # non-zero exit status on process telling us there's problems.
|
16
|
+
exit 1 unless result # non-zero exit status on process telling us there's problems.
|
@@ -0,0 +1,177 @@
|
|
1
|
+
# Hints for running traject as a batch job
|
2
|
+
|
3
|
+
Maybe as a cronjob. Maybe via a batch shell script that executes
|
4
|
+
traject, and maybe even pipelines it together with other commands.
|
5
|
+
|
6
|
+
These are things you might want to do with traject. Some potential problem points
|
7
|
+
with suggested solutions, and additional hints.
|
8
|
+
|
9
|
+
## Ruby version setting
|
10
|
+
|
11
|
+
traject ordinarily needs to run under jruby. You will
|
12
|
+
ordinarily have jruby installed under a ruby version switcher -- we
|
13
|
+
highly recommend [chruby](https://github.com/postmodern/chruby) over other choices,
|
14
|
+
but other popular choices include rvm and rbenv.
|
15
|
+
|
16
|
+
Remember that traject needs to run in 1.9.x mode in jruby--
|
17
|
+
with jruby 1.7.x or later, this should be default, recommend
|
18
|
+
you use jruby 1.7.x.
|
19
|
+
|
20
|
+
Especially when running under a cron job, it can be difficult to
|
21
|
+
set things up so traject runs under jruby.
|
22
|
+
|
23
|
+
It can sometimes be useful to create a wrapper script for traject
|
24
|
+
that takes care of making sure it's running under the right ruby
|
25
|
+
version.
|
26
|
+
|
27
|
+
### for chruby
|
28
|
+
|
29
|
+
Simply run with:
|
30
|
+
|
31
|
+
chruby-exec jruby -- traject {other arguments}
|
32
|
+
|
33
|
+
Whether specifying that directly in a crontab, or in a shell script
|
34
|
+
that needs to call traject, etc. So simple you might not need
|
35
|
+
a wrapper script, but it might still be convenient to create one. Say
|
36
|
+
you put a `jruby-traject` at `/usr/local/bin/jruby-traject`, that
|
37
|
+
looks like this:
|
38
|
+
|
39
|
+
#!/usr/bin/env bash
|
40
|
+
|
41
|
+
chruby-exec jruby -- traject "$@"
|
42
|
+
|
43
|
+
Now any account, in a crontab, in an interactive shell, wherever,
|
44
|
+
can just execute `jruby-traject {arguments}`, and execute traject
|
45
|
+
in a jruby environment.
|
46
|
+
|
47
|
+
### for rbenv
|
48
|
+
|
49
|
+
If running in an interactive shell that has had rbenv set up for
|
50
|
+
it, you can use rbenv's standard mechanism to say to execute
|
51
|
+
something in jruby:
|
52
|
+
|
53
|
+
RBENV_VERSION=jruby-1.7.2 traject {args}
|
54
|
+
|
55
|
+
You do need to specify the exact version of jruby, I don't think
|
56
|
+
there's any way to say 'latest install jruby'. You could do the
|
57
|
+
same thing for any batch scripts you're writing -- just have
|
58
|
+
them set that `RBENV_VERSION` environment variable before
|
59
|
+
executing traject.
|
60
|
+
|
61
|
+
If you're running inside a cronjob, things get a bit trickier,
|
62
|
+
because rbenv isn't normally set up in the limited environment
|
63
|
+
of cron tasks. One way to deal with this is to have your
|
64
|
+
cronjob explicitly execute in a bash login shell, that
|
65
|
+
will then have rbenv set up so long as it's running
|
66
|
+
under an account with rbenv set up properly!
|
67
|
+
|
68
|
+
# in a cronfile
|
69
|
+
# 10 * * * * /bin/bash -l -c 'RBENV_VERSION=jruby-1.7.2 traject {args}'
|
70
|
+
|
71
|
+
(Better way? Doc pull requests welcome.)
|
72
|
+
|
73
|
+
|
74
|
+
### for rvm
|
75
|
+
|
76
|
+
See rvm's [own docs on use with cron](http://rvm.io/integration/cron), it gets a bit confusing.
|
77
|
+
But here's one way, using a wrapper script. It does require you to
|
78
|
+
identify and hard-code in where your rvm is installed, and exactly which
|
79
|
+
version of jruby you want to execute with (will have to be updated if you upgrade
|
80
|
+
jruby). (Is there a better way? Doc pull requests welcome! rvm confuses me!)
|
81
|
+
|
82
|
+
Make a file at `/usr/local/bin/jruby-traject` that looks like this:
|
83
|
+
|
84
|
+
|
85
|
+
~~~bash
|
86
|
+
#!/usr/bin/env bash
|
87
|
+
|
88
|
+
# load rvm ruby
|
89
|
+
source /home/MY_ACCT/.rvm/environments/jruby-1.7.3
|
90
|
+
|
91
|
+
traject "$@"
|
92
|
+
~~~
|
93
|
+
|
94
|
+
You have to use your actual account rvm is installed in for MY_ACCT.
|
95
|
+
Or, if you have a global install of rvm instead of a user-account one,
|
96
|
+
it might be at `/usr/local/rvm/environments`... instead.
|
97
|
+
|
98
|
+
Now any account, in a crontab, in an interactive shell, wherever,
|
99
|
+
can just execute `jruby-traject {arguments}`, and execute traject
|
100
|
+
in a jruby environment.
|
101
|
+
|
102
|
+
## Exit codes
|
103
|
+
|
104
|
+
Traject tries to always return a well-behaved unix exit code -- 0 for success,
|
105
|
+
non-0 for error.
|
106
|
+
|
107
|
+
You should be able to rely on this in your batch bash scripts, if you want to abort
|
108
|
+
further processing if traject failed for some reason, you can check traject's
|
109
|
+
exit code.
|
110
|
+
|
111
|
+
If an uncaught exception happens, traject will return non-0.
|
112
|
+
|
113
|
+
There are some kinds of errors which prevent traject from indexing
|
114
|
+
one or more records, but traject may still continue processing
|
115
|
+
the other records. If any records have been skipped in this way,
|
116
|
+
traject will _also_ return a non-0 failure exit code. (Is this good?
|
117
|
+
Does it need to be configurable?)
|
118
|
+
|
119
|
+
In these cases, information about errors that led to skipped records should
|
120
|
+
be output as ERROR level in the logs.
|
121
|
+
|
122
|
+
## Logs and Error Reporting
|
123
|
+
|
124
|
+
By default, traject outputs all logging to stderr. This is often just what
|
125
|
+
you want for a batch or automated process, where there might be some wrapper
|
126
|
+
script which captures stderr and puts it where you want it.
|
127
|
+
|
128
|
+
However, it's easy enough to tell traject to log somewhere else. Either on
|
129
|
+
the command-line:
|
130
|
+
|
131
|
+
traject -s log.file=/some/other/file/log {other args}
|
132
|
+
|
133
|
+
Or in a traject configuration file, setting the `log.file` configuration setting.
|
134
|
+
|
135
|
+
### Seperate error log
|
136
|
+
|
137
|
+
You can also seperately have a duplicate log file created with ONLY log messages of
|
138
|
+
level ERROR and higher (meaning ERROR and FATAL), with the `log.error_file` setting.
|
139
|
+
Then, if there's any lines in this error log file at all, you know something bad
|
140
|
+
happened, maybe your batch process needs to notify someone, or abort further
|
141
|
+
steps in the batch process.
|
142
|
+
|
143
|
+
traject -s log.file=/var/log/traject.log -s log.error_file=/var/log/traject_error.log {more args}
|
144
|
+
|
145
|
+
The error lines will be in the main log file, and also duplicated in the error
|
146
|
+
log file.
|
147
|
+
|
148
|
+
### Completely customizable logging with yell
|
149
|
+
|
150
|
+
Traject uses the [yell](https://github.com/rudionrails/yell) gem for logging.
|
151
|
+
You can configure the logger directly to implement whatever crazy logging rules you might
|
152
|
+
want, so long as yell supports them. But yell is pretty flexible.
|
153
|
+
|
154
|
+
Recall that traject config files are just ruby, executed in the context
|
155
|
+
of a Traject::Indexer. You can set the Indexer's `logger` to a yell logger
|
156
|
+
object you configure yourself however you like:
|
157
|
+
|
158
|
+
~~~ruby
|
159
|
+
# inside a traject configuration file
|
160
|
+
|
161
|
+
logger = Yell.new do |l|
|
162
|
+
l.level = 'gte.info' # will only pass :info and above to the adapters
|
163
|
+
|
164
|
+
l.adapter :datefile, 'production.log', level: 'lte.warn' # anything lower or equal to :warn
|
165
|
+
l.adapter :datefile, 'error.log', level: 'gte.error' # anything greater or equal to :error
|
166
|
+
end
|
167
|
+
~~~
|
168
|
+
|
169
|
+
See [yell](https://github.com/rudionrails/yell) docs for more, you can
|
170
|
+
do whatever you can make yell, just write ruby.
|
171
|
+
|
172
|
+
### Bundler
|
173
|
+
|
174
|
+
For automated batch execution, we recommend you consider using
|
175
|
+
bundler to manage any gem dependencies. See the [Extending
|
176
|
+
With Your Own Code](./extending.md) traject docs for
|
177
|
+
information on how traject integrates with bundler.
|