traject 0.9.1 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,182 @@
1
+ # Extending With Your Own Code
2
+
3
+ Beyond very simple logic, you'll want to write your own ruby code,
4
+ organize it in files other than traject config files, but then
5
+ use it in traject config files.
6
+
7
+ You might want to have code local to your traject project; or you
8
+ might want to use ruby gems with shared code in your traject project.
9
+ A given project may use both of these techniques.
10
+
11
+ Here are some suggestions for how to do this, along with mention
12
+ of a couple traject features meant to make it easier.
13
+
14
+ ## Expert Summary
15
+
16
+ * Traject `-I` argument command line can be used to list directories to
17
+ add to the load path, similar to the `ruby -I` argument. You
18
+ can then 'require' local project files from the load path.
19
+ * translation map files found on the load path or in a
20
+ "./translation_maps" subdir on the load path will be found
21
+ for Traject translation maps.
22
+ * Traject `-g` command line can be used to tell traject to use
23
+ bundler with a `Gemfile` located at current working dirctory
24
+ (or give an argument to `-g ./some/myGemfile`)
25
+
26
+ ## Custom code local to your project
27
+
28
+ You might want local translation maps, or local ruby
29
+ code. Here's a standard way you might lay out
30
+ this extra code in the file system, using a 'lib'
31
+ directory kept next to your traject config files:
32
+
33
+ ~~~
34
+ - my_traject/
35
+ * config_file.rb
36
+ - lib/
37
+ * my_macros.rb
38
+ * my_utility.rb
39
+ - translation_maps/
40
+ * my_map.yaml
41
+ ~~~
42
+
43
+
44
+ The `my_macros.rb` file might contain a simple [macro](./macros.md)
45
+ in a module called `MyMacros`.
46
+
47
+ The `my_utility.rb` file might contain, say, a module of utility
48
+ methods, `MyUtility.some_utility`, etc.
49
+
50
+ To refer to ruby code from another file, we use the standard
51
+ ruby `require` statement to bring in the files:
52
+
53
+ ~~~ruby
54
+ # config_file.rb
55
+
56
+ require 'my_macros'
57
+ require 'my_utility'
58
+
59
+ # Now that MyMacros is available, extend it into the indexer,
60
+ # and use it:
61
+
62
+ extend MyMacros
63
+
64
+ to_field "title", my_some_macro
65
+
66
+ # And likewise, we can use our utility methods:
67
+
68
+ to_field "title" do |record, accumulator, context|
69
+ accumulator << MyUtility.some_utility(record)
70
+ end
71
+ ~~~
72
+
73
+ **But wait!** This won't work yet. Becuase ruby won't be
74
+ able to find the file in `requires 'my_macros'`. To fix
75
+ that, we want to add our local `lib` directory to the
76
+ ruby `$LOAD_PATH`, a standard ruby feature.
77
+
78
+ Traject provides a way for you to add to the load path
79
+ from the traject command line, the `-I` flag:
80
+
81
+ traject -I ./lib -c ./config_file.rb ...
82
+
83
+ Or, you can hard-code a `$LOAD_PATH` change directly in your
84
+ config file. You'll have to use some weird looking
85
+ ruby code to create a file path relative to the current
86
+ file (the config_file.rb), and then make sure it's
87
+ an absolute path. (Should we add a traject utility
88
+ method for this?)
89
+
90
+ ~~~ruby
91
+ # at top of config_file.rb...
92
+
93
+ $LOAD_PATH.unshift File.expand_path(File.join(File.dirname(__FILE__), './lib'))
94
+ ~~~
95
+
96
+ That's pretty much it!
97
+
98
+ What about that translation map? The `$LOAD_PATH` modification
99
+ took care of that too, the Traject::TranslationMap will look
100
+ up translation map definition files on the load path, or
101
+ in a `./translation_maps` subdir on the load path.
102
+
103
+
104
+ ## Using gems in your traject project
105
+
106
+ If there is certain logic that is common between (traject or other)
107
+ projects, it makes sense to put it in a ruby gem.
108
+
109
+ We won't go into detail about creating ruby gems, but we
110
+ do recomend you use the `bundle gem my_gem_name` command to create
111
+ a skeleton of your gem
112
+ ([one tutorial here](http://railscasts.com/episodes/245-new-gem-with-bundler?view=asciicast)).
113
+ This will also make available rake commands to install your gem locally
114
+ (`rake install`), or release it to the rubygems server (`rake release`).
115
+
116
+ There are two main methods to use a gem in your traject project,
117
+ with straight rubygems, or with bundler.
118
+
119
+ Without bundler is simpler. Simply `gem install some_gem` from the
120
+ command line, and now you can `require` that gem in your traject
121
+ config file, and use what it provides:
122
+
123
+ ~~~ruby
124
+ #some_traject_config.rb
125
+
126
+ require 'some_gem'
127
+
128
+ SomeGem.whatever!
129
+ ~~~
130
+
131
+ Any gem can provide traject translation map definitions
132
+ in it's `lib` directory, or in a `lib/translation_maps`
133
+ sub-directory, and traject will be able to find those
134
+ translation maps when the gem is loaded. (Because gems'
135
+ `./lib` directories are added to the ruby load path.)
136
+
137
+ ### Or, with bundler:
138
+
139
+ However, if you then move your traject project to another system,
140
+ where you haven't yet installed the `some_gem`, then running
141
+ traject with this config file will, of course, fail. Or if you
142
+ move your traject project to another system with a slightly
143
+ different version of `some_gem`, your traject indexing could
144
+ behave differently in confusing ways. As the number of gems
145
+ you are using increases, managing this gets increasingly
146
+ confusing.
147
+
148
+ [bundler](http://bundler.io/) was invented to make this kind of dependency management
149
+ more straightforward and reliable. We recommend you consider using
150
+ bundler, especially for traject installations where traject will
151
+ be run via automated batch jobs on production servers.
152
+
153
+ Bundler's behavior is based on a `Gemfile` that lists your
154
+ project dependencies. You can create a starter skeleton
155
+ by running `bundler init`, probably in the directory
156
+ right next to your traject config files.
157
+
158
+ Then specify what gems your traject project will use,
159
+ possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemfile.html).
160
+
161
+ Run `bundle install` from the directory with the Gemfile, on any system
162
+ at any time, to make sure specified gems are installed.
163
+
164
+ **Run traject** with the `-g` flag to tell it to use the Gemfile:
165
+
166
+ traject -g -c some_traject_config.rb ...
167
+
168
+ Traject will use bundler to setup with the Gemfile, making sure
169
+ the specified versions of all gems are used (and also making sure
170
+ no gems except those specified in the gemfile are available to
171
+ the program).
172
+
173
+ You should still `require` the gem in your traject config file,
174
+ then just refer to what it provides in your config code as usual.
175
+
176
+ You should check both the `Gemfile` and the `Gemfile.lock`
177
+ that bundler creates into your source control repo. The
178
+ `Gemfile.lock` specifies _exactly_ what versions of
179
+ gem dependencies are currently being used, so you can get the exact
180
+ same dependency environment on different servers.
181
+
182
+ See the [bundler documentation](http://bundler.io/#getting-started), or google, for more information.
@@ -0,0 +1,49 @@
1
+ # Other traject command-line commands
2
+
3
+ The traject command line supporst a few other miscellaneous commands with
4
+ the "-x command" switch. The usual traject command line is actually
5
+ the `process` command, `traject -x process ...` is the same as leaving out
6
+ the `-x process`.
7
+
8
+ ## Commit
9
+
10
+ `traject -x commit` will send a 'commit' message to the Solr server
11
+ specified in setting `solr.url`. Other parts of configuration will
12
+ be ignored, but don't hurt.
13
+
14
+ traject -x commit -s solr.url=http://some.com/solr
15
+
16
+ Or with a config file that includes a solr.url setting:
17
+
18
+ traject -x commit -c config_file.rb
19
+
20
+ ## marcout
21
+
22
+ The `marcout` command will skip all processing/mapping, and simply
23
+ serialize marc out to a file stream.
24
+
25
+ This is mainly useful when you're using a custom reader to read
26
+ marc from a database or something, but could also be used to
27
+ convert marc from one format to another or something.
28
+
29
+ Will write to stdout, or set the `output_file` setting (`-o` shortcut).
30
+
31
+ Set the `marcout.type` setting to 'xml' or 'binary' for type of output.
32
+ Or to `human` for human readable display of marc (that is not meant for
33
+ machine readability, but can be good for manual diagnostics.)
34
+
35
+ If outputing type binary, setting `marcout.allow_oversized` to
36
+ true or false (boolean or string), to pass that to the MARC::Writer.
37
+ If set to true, then oversized MARC records can still be serialized,
38
+ with length bytes zero'd out -- technically illegal, but can
39
+ be read by MARC::Reader in permissive mode.
40
+
41
+ As the standard Marc4JReader always convert to UTF8,
42
+ output will always be in UTF8. For standard readeres, you
43
+ do need to set the `marc_source.type` setting to XML for xml input
44
+ using the standard MARC readers.
45
+
46
+ ~~~bash
47
+ traject -x marcout somefile.marc -o output.xml -s marcout.type=xml
48
+ traject -x marcout -s marc_source.type=xml somefile.xml -c configuration.rb
49
+ ~~~
@@ -46,14 +46,18 @@ for commonly used settings, see `traject -h`.
46
46
 
47
47
  * `marc_source.type`: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`
48
48
 
49
- * `marc4j_reader.jar_dir`: Path to a directory containing Marc4J jar file to use. All .jar's in dir will
50
- be loaded. If unset, uses marc4j.jar bundled with traject.
49
+ * `marc4j.jar_dir`: Path to a directory containing Marc4J jar file to use. All .jar's in dir will
50
+ be loaded. If unset, uses marc4j.jar bundled with traject.
51
51
 
52
52
  * `marc4j_reader.permissive`: Used by Marc4JReader only when marc.source_type is 'binary', boolean, argument to the underlying MarcPermissiveStreamReader. Default true.
53
53
 
54
54
  * `marc4j_reader.source_encoding`: Used by Marc4JReader only when marc.source_type is 'binary', encoding strings accepted
55
55
  by marc4j MarcPermissiveStreamReader. Default "BESTGUESS", also "UTF-8", "MARC"
56
56
 
57
+ * `output_file`: Output file to write to for operations that write to files: For instance the `marcout` command,
58
+ or Writer classes that write to files, like Traject::JsonWriter. Has an shortcut
59
+ `-o` on command line.
60
+
57
61
  * `processing_thread_pool` Default 3. Main thread pool used for processing records with input rules. Choose a
58
62
  pool size based on size of your machine, and complexity of your indexing rules.
59
63
  Probably no reason for it ever to be more than number of cores on indexing machine.
@@ -1,6 +1,7 @@
1
1
  require "traject/version"
2
2
 
3
3
  require 'traject/indexer'
4
+ require 'traject/util'
4
5
 
5
6
  require 'traject/macros/basic'
6
7
  require 'traject/macros/marc21'
@@ -0,0 +1,296 @@
1
+ require 'slop'
2
+ require 'traject'
3
+ require 'traject/indexer'
4
+
5
+ module Traject
6
+ # The class that executes for the Traject command line utility.
7
+ #
8
+ # Warning, does do things like exit entire program on error at present.
9
+ # You probably don't want to use this class for anything but an actual
10
+ # shell command line, if you want to execute indexing directly, just
11
+ # use the Traject::Indexer directly.
12
+ #
13
+ # A CommandLine object has a single persistent Indexer object it uses
14
+ class CommandLine
15
+ # orig_argv is origina one passed in, remaining_argv is after destructive
16
+ # processing by slop, still has file args in it etc.
17
+ attr_accessor :orig_argv, :remaining_argv
18
+ attr_accessor :slop, :options
19
+ attr_accessor :indexer
20
+ attr_accessor :console
21
+
22
+ def initialize(argv=ARGV)
23
+ self.console = $stderr
24
+
25
+ self.orig_argv = argv.dup
26
+ self.remaining_argv = argv
27
+
28
+ self.slop = create_slop!
29
+ self.options = parse_options(self.remaining_argv)
30
+ end
31
+
32
+ # Returns true on success or false on failure; may also raise exceptions;
33
+ # may also exit program directly itself (yeah, could use some normalization)
34
+ def execute
35
+ if options[:version]
36
+ self.console.puts "traject version #{Traject::VERSION}"
37
+ return
38
+ end
39
+ if options[:help]
40
+ self.console.puts slop.help
41
+ return
42
+ end
43
+
44
+ # have to use Slop object to tell diff between
45
+ # no arg supplied and no option -g given at all
46
+ if slop.present? :gemfile
47
+ require_bundler_setup(options[:gemfile])
48
+ end
49
+
50
+ (options[:load_path] || []).each do |path|
51
+ $LOAD_PATH << path unless $LOAD_PATH.include? path
52
+ end
53
+
54
+ arg_check!
55
+
56
+ self.indexer = initialize_indexer!
57
+
58
+ ######
59
+ # SAFE TO LOG to indexer.logger starting here, after indexer is set up from conf files
60
+ # with logging config.
61
+ #####
62
+
63
+ indexer.logger.info("traject executing with: `#{orig_argv.join(' ')}`")
64
+
65
+ # Okay, actual command process! All command_ methods should return true
66
+ # on success, or false on failure.
67
+ result =
68
+ case options[:command]
69
+ when "process"
70
+ indexer.process get_input_io(self.remaining_argv)
71
+ when "marcout"
72
+ command_marcout! get_input_io(self.remaining_argv)
73
+ when "commit"
74
+ command_commit!
75
+ else
76
+ raise ArgumentError.new("Unrecognized traject command: #{options[:command]}")
77
+ end
78
+
79
+ return result
80
+ end
81
+
82
+ def command_commit!
83
+ require 'open-uri'
84
+ raise ArgumentError.new("No solr.url setting provided") if indexer.settings['solr.url'].to_s.empty?
85
+
86
+ url = "#{indexer.settings['solr.url']}/update?commit=true"
87
+ indexer.logger.info("Sending commit to: #{url}")
88
+ indexer.logger.info( open(url).read )
89
+
90
+ return true
91
+ end
92
+
93
+ def command_marcout!(io)
94
+ require 'marc'
95
+
96
+ output_type = indexer.settings["marcout.type"].to_s
97
+ output_type = "binary" if output_type.empty?
98
+
99
+ output_arg = unless indexer.settings["output_file"].to_s.empty?
100
+ indexer.settings["output_file"]
101
+ else
102
+ $stdout
103
+ end
104
+
105
+ case output_type
106
+ when "binary"
107
+ writer = MARC::Writer.new(output_arg)
108
+
109
+ allow_oversized = indexer.settings["marcout.allow_oversized"]
110
+ if allow_oversized
111
+ allow_oversized = (allow_oversized.to_s == "true")
112
+ writer.allow_oversized = allow_oversized
113
+ end
114
+ when "xml"
115
+ writer = MARC::XMLWriter.new(output_arg)
116
+ when "human"
117
+ writer = output_arg.kind_of?(String) ? File.open(output_arg, "w:binary") : output_arg
118
+ else
119
+ raise ArgumentError.new("traject marcout unrecognized marcout.type: #{output_type}")
120
+ end
121
+
122
+ reader = indexer.reader!(io)
123
+
124
+ reader.each do |record|
125
+ writer.write record
126
+ end
127
+
128
+ writer.close
129
+
130
+ return true
131
+ end
132
+
133
+ def get_input_io(argv)
134
+ # ARGF might be perfect for this, but problems with it include:
135
+ # * jruby is broken, no way to set it's encoding, leads to encoding errors reading non-ascii
136
+ # https://github.com/jruby/jruby/issues/891
137
+ # * It's apparently not enough like an IO object for at least one of the ruby-marc XML
138
+ # readers:
139
+ # NoMethodError: undefined method `to_inputstream' for ARGF:Object
140
+ # init at /Users/jrochkind/.gem/jruby/1.9.3/gems/marc-0.5.1/lib/marc/xml_parsers.rb:369
141
+ #
142
+ # * It INSISTS on reading from ARGFV, making it hard to test, or use when you want to give
143
+ # it a list of files on something other than ARGV.
144
+ #
145
+ # So for now we do just one file, or stdin if none given. Sorry!
146
+ if argv.length > 1
147
+ self.console.puts "Sorry, traject can only handle one input file at a time right now. `#{argv}` Exiting..."
148
+ exit 1
149
+ end
150
+ if argv.length == 0
151
+ indexer.logger.info "Reading from STDIN..."
152
+ io = $stdin
153
+ else
154
+ indexer.logger.info "Reading from #{argv.first}"
155
+ io = File.open(argv.first, 'r')
156
+ end
157
+ return io
158
+ end
159
+
160
+ def load_configuration_files!(my_indexer, conf_files)
161
+ conf_files.each do |conf_path|
162
+ begin
163
+ my_indexer.instance_eval(File.open(conf_path).read, conf_path)
164
+ rescue Errno::ENOENT => e
165
+ self.console.puts "Could not find configuration file '#{conf_path}', exiting..."
166
+ exit 2
167
+ rescue Exception => e
168
+ self.console.puts "Could not parse configuration file '#{conf_path}'"
169
+ self.console.puts " #{e.message}"
170
+ if e.backtrace.first =~ /\A(.*)\:in/
171
+ self.console.puts " #{$1}"
172
+ end
173
+ exit 3
174
+ end
175
+ end
176
+ end
177
+
178
+ def arg_check!
179
+ if options[:command] == "process" && (options[:conf].nil? || options[:conf].length == 0)
180
+ self.console.puts "Error: Missing required configuration file"
181
+ self.console.puts "Exiting..."
182
+ self.console.puts
183
+ self.console.puts self.slop.help
184
+ exit 2
185
+ end
186
+ end
187
+
188
+ # requires bundler/setup, optionally first setting ENV["BUNDLE_GEMFILE"]
189
+ # to tell bundler to use a specific gemfile. Gemfile arg can be relative
190
+ # to current working directory.
191
+ def require_bundler_setup(gemfile=nil)
192
+ if gemfile
193
+ # tell bundler what gemfile to use
194
+ gem_path = File.expand_path( options[:gemfile] )
195
+ # bundler not good at error reporting, we check ourselves
196
+ unless File.exists? gem_path
197
+ self.console.puts "Gemfile `#{options[:gemfile]}` does not exist, exiting..."
198
+ self.console.puts
199
+ self.console.puts slop.help
200
+ exit 2
201
+ end
202
+ ENV["BUNDLE_GEMFILE"] = gem_path
203
+ end
204
+ require 'bundler/setup'
205
+ end
206
+
207
+ def assemble_settings_hash(options)
208
+ settings = {}
209
+
210
+ # `-s key=value` command line
211
+ (options[:setting] || []).each do |setting_pair|
212
+ if setting_pair =~ /\A([^=]+)\=(.*)\Z/
213
+ key, value = $1, $2
214
+ settings[key] = value
215
+ else
216
+ self.console.puts "Unrecognized setting argument '#{setting_pair}':"
217
+ self.console.puts "Should be of format -s key=value"
218
+ exit 3
219
+ end
220
+ end
221
+
222
+ # other command line shortcuts for settings
223
+ if options[:debug]
224
+ settings["log.level"] = "debug"
225
+ end
226
+ if options[:writer]
227
+ settings["writer_class_name"] = options[:writer]
228
+ end
229
+ if options[:reader]
230
+ settings["reader_class_name"] = options[:reader]
231
+ end
232
+ if options[:solr]
233
+ settings["solr.url"] = options[:solr]
234
+ end
235
+ if options[:j]
236
+ settings["writer_class_name"] = "JsonWriter"
237
+ settings["json_writer.pretty_print"] = "true"
238
+ end
239
+ if options[:marc_type]
240
+ settings["marc_source.type"] = options[:marc_type]
241
+ end
242
+ if options[:output_file]
243
+ settings["output_file"] = options[:output_file]
244
+ end
245
+
246
+ return settings
247
+ end
248
+
249
+
250
+ def create_slop!
251
+ return Slop.new(:strict => true) do
252
+ banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
253
+
254
+ on 'v', 'version', "print version information to stderr"
255
+ on 'd', 'debug', "Include debug log, -s log.level=debug"
256
+ on 'h', 'help', "print usage information to stderr"
257
+ on 'c', 'conf', 'configuration file path (repeatable)', :argument => true, :as => Array
258
+ on :s, :setting, "settings: `-s key=value` (repeatable)", :argument => true, :as => Array
259
+ on :r, :reader, "Set reader class, shortcut for -s reader_class_name=", :argument => true
260
+ on :o, "output_file", "output file for Writer classes that write to files", :argument => true
261
+ on :w, :writer, "Set writer class, shortcut for -s writer_class_name=", :argument => true
262
+ on :u, :solr, "Set solr url, shortcut for -s solr.url=", :argument => true
263
+ on :j, "output as pretty printed json, shortcut for -s writer_class_name=JsonWriter -s json_writer.pretty_print=true"
264
+ on :t, :marc_type, "xml, json or binary. shortcut for -s marc_source.type=", :argument => true
265
+ on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
266
+ on :g, "gemfile", "run with bundler and optionally specified Gemfile", :argument => :optional, :default => ""
267
+
268
+ on :x, "command", "alternate traject command: process (default); marcout", :argument => true, :default => "process"
269
+ end
270
+ end
271
+
272
+ def initialize_indexer!
273
+ indexer = Traject::Indexer.new self.assemble_settings_hash(self.options)
274
+ load_configuration_files!(indexer, options[:conf])
275
+
276
+ return indexer
277
+ end
278
+
279
+ def parse_options(argv)
280
+
281
+ begin
282
+ self.slop.parse!(argv)
283
+ rescue Slop::Error => e
284
+ self.console.puts "Error: #{e.message}"
285
+ self.console.puts "Exiting..."
286
+ self.console.puts
287
+ self.console.puts slop.help
288
+ exit 1
289
+ end
290
+
291
+ return self.slop.to_hash
292
+ end
293
+
294
+
295
+ end
296
+ end