traject 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (64) hide show
  1. data/.gitignore +18 -0
  2. data/Gemfile +4 -0
  3. data/LICENSE.txt +22 -0
  4. data/README.md +346 -0
  5. data/Rakefile +16 -0
  6. data/bin/traject +153 -0
  7. data/doc/macros.md +103 -0
  8. data/doc/settings.md +34 -0
  9. data/lib/traject.rb +10 -0
  10. data/lib/traject/indexer.rb +196 -0
  11. data/lib/traject/json_writer.rb +51 -0
  12. data/lib/traject/macros/basic.rb +9 -0
  13. data/lib/traject/macros/marc21.rb +145 -0
  14. data/lib/traject/marc_extractor.rb +206 -0
  15. data/lib/traject/marc_reader.rb +61 -0
  16. data/lib/traject/qualified_const_get.rb +30 -0
  17. data/lib/traject/solrj_writer.rb +120 -0
  18. data/lib/traject/translation_map.rb +184 -0
  19. data/lib/traject/version.rb +3 -0
  20. data/test/indexer/macros_marc21_test.rb +146 -0
  21. data/test/indexer/macros_test.rb +40 -0
  22. data/test/indexer/map_record_test.rb +120 -0
  23. data/test/indexer/read_write_test.rb +47 -0
  24. data/test/indexer/settings_test.rb +65 -0
  25. data/test/marc_extractor_test.rb +168 -0
  26. data/test/marc_reader_test.rb +29 -0
  27. data/test/solrj_writer_test.rb +106 -0
  28. data/test/test_helper.rb +28 -0
  29. data/test/test_support/hebrew880s.marc +1 -0
  30. data/test/test_support/manufacturing_consent.marc +1 -0
  31. data/test/test_support/test_data.utf8.marc.xml +2609 -0
  32. data/test/test_support/test_data.utf8.mrc +1 -0
  33. data/test/translation_map_test.rb +98 -0
  34. data/test/translation_maps/bad_ruby.rb +8 -0
  35. data/test/translation_maps/bad_yaml.yaml +1 -0
  36. data/test/translation_maps/both_map.rb +1 -0
  37. data/test/translation_maps/both_map.yaml +1 -0
  38. data/test/translation_maps/default_literal.rb +10 -0
  39. data/test/translation_maps/default_passthrough.rb +10 -0
  40. data/test/translation_maps/marc_040a_translate_test.yaml +1 -0
  41. data/test/translation_maps/ruby_map.rb +10 -0
  42. data/test/translation_maps/translate_array_test.yaml +8 -0
  43. data/test/translation_maps/yaml_map.yaml +7 -0
  44. data/traject.gemspec +30 -0
  45. data/vendor/solrj/README +8 -0
  46. data/vendor/solrj/build.xml +39 -0
  47. data/vendor/solrj/ivy.xml +16 -0
  48. data/vendor/solrj/lib/commons-codec-1.7.jar +0 -0
  49. data/vendor/solrj/lib/commons-io-2.1.jar +0 -0
  50. data/vendor/solrj/lib/httpclient-4.2.3.jar +0 -0
  51. data/vendor/solrj/lib/httpcore-4.2.2.jar +0 -0
  52. data/vendor/solrj/lib/httpmime-4.2.3.jar +0 -0
  53. data/vendor/solrj/lib/jcl-over-slf4j-1.6.6.jar +0 -0
  54. data/vendor/solrj/lib/jul-to-slf4j-1.6.6.jar +0 -0
  55. data/vendor/solrj/lib/log4j-1.2.16.jar +0 -0
  56. data/vendor/solrj/lib/noggit-0.5.jar +0 -0
  57. data/vendor/solrj/lib/slf4j-api-1.6.6.jar +0 -0
  58. data/vendor/solrj/lib/slf4j-log4j12-1.6.6.jar +0 -0
  59. data/vendor/solrj/lib/solr-solrj-4.3.1-javadoc.jar +0 -0
  60. data/vendor/solrj/lib/solr-solrj-4.3.1-sources.jar +0 -0
  61. data/vendor/solrj/lib/solr-solrj-4.3.1.jar +0 -0
  62. data/vendor/solrj/lib/wstx-asl-3.2.7.jar +0 -0
  63. data/vendor/solrj/lib/zookeeper-3.4.5.jar +0 -0
  64. metadata +264 -0
@@ -0,0 +1,18 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ .DS_Store
7
+ Gemfile.lock
8
+ InstalledFiles
9
+ _yardoc
10
+ coverage
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
18
+ vendor/solrj/ivy
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in traject.gemspec
4
+ gemspec
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2013 TODO: Write your name
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,346 @@
1
+ # Traject
2
+
3
+ Tools for indexing MARC records to Solr.
4
+
5
+ Generalizable to tools for configuring mapping records to associative array data structures, and sending
6
+ them somewhere.
7
+
8
+ **Currently under development, not production ready**
9
+
10
+ ## Background/Goals
11
+
12
+ Based on both the successes and failures of previous MARC indexing attempts -- including the venerable SolrMarc which we greatly appreciate and from which we've learned a lot -- I decided that to create a solution that worked at remaining pain points, I needed jruby -- ruby on the JVM.
13
+
14
+ Traject aims to:
15
+
16
+ * Be simple and straightforward for simple use cases, hopefully being accessible even to non-rubyists, although it's in ruby
17
+ * Be composed of modular and re-composible elements, to provide flexibility for non-common use cases. You should be able to
18
+ do your own thing wherever you want, without having to give up
19
+ the already implemented parts you still do want, mixing and matching at all.
20
+ * Have a parsimonious or 'elegant' internal architecture, only a few architectural concepts to understand that everything else is built on, to hopefully make the internals easy to work with and maintain.
21
+
22
+
23
+ ## Installation
24
+
25
+ Traject runs under jruby (ruby on the JVM). I recommend [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme) for installing and managing ruby installations.
26
+
27
+ Then just `gem install traject`.
28
+
29
+ ( **Note**: We may later provide an all-in-one .jar distribution, which does not require you to install jruby or use on your system. This is hypothetically possible. Is it a good idea?)
30
+
31
+ # Usage
32
+
33
+ ## Configuration file format
34
+
35
+ The traject command-line utility requires you to supply it with a configuration file. So let's start by describing the configuration file.
36
+
37
+ Configuration files are actually just ruby -- so by convention they end in `.rb`.
38
+
39
+ Don't worry, you don't neccesarily need to know ruby well to write them, they give you a subset of ruby to work with. But the full power
40
+ of ruby is available to you.
41
+
42
+ **rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
43
+ call ordinary ruby `require` in config files, etc., too, to load
44
+ external functionality. See more at Extending Logic below.
45
+
46
+ There are two main categories of directives in your configuration files: _Settings_, and _Indexing Rules_.
47
+
48
+ ### Settings
49
+
50
+ Settings are a flat list of key/value pairs, where the keys are always strings and the values usually are. They look like this
51
+ in a config file:
52
+
53
+ ~~~ruby
54
+ # configuration_file.rb
55
+ # Note that "#" is a comment, cause it's just ruby
56
+
57
+ settings do
58
+ # Where to find solr server to write to
59
+ store "solr.url", "http://example.org/solr"
60
+
61
+ # solr.version doesn't currently do anything, but set it
62
+ # anyway, in the future it will warn you if you have settings
63
+ # that may not work with your version.
64
+ store "solr.version", "4.3.0"
65
+
66
+ # default source type is binary, traject can't guess
67
+ # you have to tell it.
68
+ store "marc_source.type", "xml"
69
+
70
+ # settings can be set on command line instead of
71
+ # config file too.
72
+
73
+ # various others...
74
+ store "solrj_writer.commit_on_close", "true"
75
+ end
76
+ ~~~
77
+
78
+ See, docs page on [Settings][./doc/settings.md] for list
79
+ of all standardized settings.
80
+
81
+ ### Indexing Rules
82
+
83
+ You can keep your settings and indexing rules in one config file,
84
+ or split them accross multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
85
+
86
+ The main tool for indexing rules is the `to_field` command.
87
+ Which can be used with a few standard functions.
88
+
89
+ ~~~ruby
90
+ # configuration.rb
91
+
92
+ # The first arguent, 'source' in this case, is what Solr field we're
93
+ # sending to. And the 'literal' function supplies a hard-coded
94
+ # constant string literal.
95
+ to_field "source", literal("LIB_CATALOG")
96
+
97
+ # you can call 'to_field' multiple times, additional values
98
+ # are concatenated
99
+ to_field "source", literal("ANOTHER ONE")
100
+
101
+ # Serialize the marc record back out and
102
+ # put it in a solr field.
103
+ to_field "marc_record", serialized_marc(:format => "xml")
104
+
105
+ # or :format => "json" for marc-in-json
106
+ # or :format => "binary", by default Base64-encoded for Solr
107
+ # 'binary' field, or, for more like what SolrMarc did, without
108
+ # escaping:
109
+ to_field "marc_record_raw", serialized_marc(:format => "binary", :binary_escape => false)
110
+
111
+ # Take ALL of the text from the marc record, useful for
112
+ # a catch-all field. Actually by default only takes
113
+ # from tags 100 to 899.
114
+ to_field "text", extract_all_marc_values
115
+
116
+ # Now we have a simple example of the general utility function
117
+ # `extract_marc`
118
+ to_field "id", extract_marc("001", :first => true)
119
+ ~~~
120
+
121
+ `extract_marc` takes a marc tag/subfield specification, and optional
122
+ arguments. `:first => true` means if the specification returned multiple values, ignore all bet the first. It is wise to use this
123
+ *whenever you have a non-multi-valued solr field* even if you think "There should only be one 001 field anyway!", to deal with unexpected
124
+ data properly.
125
+
126
+ Other examples of the specification string, which can include multiple tag mentions, as well as subfields and indicators:
127
+
128
+ ~~~ruby
129
+ # 245 subfields a, p, and s. 130, all subfields.
130
+ # built-in punctuation trimming routine.
131
+ to_field "title_t", extract_marc("245nps:130", :trim_punctuation => true)
132
+
133
+ # Can limit to certain indicators with || chars.
134
+ # "*" is a wildcard in indicator spec. So
135
+ # 856 with first indicator '0', subfield u.
136
+ to_field "email_addresses", extract_marc("856|0*|u")
137
+ ~~~
138
+
139
+ The `extract_marc` function *by default* includes any linked
140
+ MARC `880` fields with alternate-script versions. Another reason
141
+ to use the `:first` option if you really only want one.
142
+
143
+ For MARC control (aka 'fixed') fields, you can use square
144
+ brackets to take a slice by byte offset.
145
+
146
+ to_field "langauge_code", extract_marc("008[35-37]")
147
+
148
+ `extract_marc` also supports `translation maps` similar
149
+ to SolrMarc's. There will be some translation maps built in,
150
+ and you can provide your own. translation maps can be supplied
151
+ in yaml or ruby. Translation maps are especially useful
152
+ for mapping form MARC codes to user-displayable strings. See Traject::TranslationMap for more info:
153
+
154
+ # "translation_map" will be passed to Traject::TranslationMap.new
155
+ # and the created map used to translate all values
156
+ to_field "language", extract_marc("008[35-37]:041a:041d", :translation_map => "marc_language_code")
157
+
158
+ #### Direct indexing logic vs. Macros
159
+
160
+ It turns out all those functions we saw above used with `to_field` -- `literal`, `serialized_marc`, `extract_all_marc_values, and `extract_marc` -- are what Traject calls 'macros'.
161
+
162
+ They are all actually built based upon a more basic element of
163
+ indexing functionality, which you can always drop down to, and
164
+ which is used to build the macros. The basic use of `to_field`,
165
+ with directly specified logic instead of using a macro, looks like this:
166
+
167
+ ~~~ruby
168
+ to_field "source" do |record, accumulator, context|
169
+ accumulator << "LIB CATALOG"
170
+ end
171
+ ~~~~
172
+
173
+ That's actually equivalent to the macro we used earlier: `to_field("source"), literal("LIB_CATALOG")`.
174
+
175
+ This direct use of to_field happens to be a ruby "block", which is
176
+ used to define a block of logic that can be stored and executed later. When the block is called, first argument (`record` above) is the marc_record being indexed (a ruby-marc MARC::Record object), and the second argument (`accumulator`) is a ruby array used to accumulate output values.
177
+
178
+ The third argument is a `Traject::Indexer::Context` object that can
179
+ be used for more advanced functionality, including caching expensive
180
+ per-record calculations, writing out to more than one output field at a time (TODO example), or taking account of current Traject Settings in your logic.
181
+
182
+ You can always drop out to this basic direct use whenever you need
183
+ special purpose logic, directly in the config file, writing in
184
+ ruby:
185
+
186
+ ~~~ruby
187
+ # this is more or less nonsense, just an example
188
+ to_field "weird_title" do |record, accumlator, context|
189
+ field = record['245']
190
+ title = field['a']
191
+ title.upcase! if field.indicator1 = '1'
192
+ accumulator << title
193
+ end
194
+
195
+ # To make use of marc extraction by specification, just like
196
+ # marc_extract does, you may want to use the Traject::MarcExtractor
197
+ # class
198
+ to_field "weirdo" do |record, accumulator, context|
199
+ list = MarcExtractor.new(record, "700a").extract
200
+ # combine all the 700a's in ONE string, cause we're weird
201
+ list = list.join(" ")
202
+ accumulator << list
203
+ end
204
+ ~~~~
205
+
206
+ You can also *combine* a macro and a direct block for some
207
+ post-processing. In this case, the `accumulator` parameter
208
+ in our block will start out with the values left by
209
+ the `extract_marc`:
210
+
211
+ ~~~ruby
212
+ to_field "subjects", extract_marc("600:650:610") do |record, accumulator, context|
213
+ # for some reason we want to uppercase all our subjects
214
+ accumulator.collect! {|s| s.upcase }
215
+ end
216
+ ~~~
217
+
218
+ If you find yourself repeating code a lot in direct blocks, you
219
+ can supply your _own_ macros, for local use, or even to share
220
+ with others in a ruby gem. See docs [Macros](./doc/macros.md)
221
+
222
+ ## Command Line
223
+
224
+ The simplest invocation is:
225
+
226
+ traject -c conf_file.rb marc_file.mrc
227
+
228
+ Traject assumes marc files are in ISO 2709 binary format; it is not
229
+ currently able to buess marc format type. If you are reading
230
+ marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
231
+
232
+ traject -c conf.rb -t xml marc_file.xml
233
+
234
+ You can supply more than one conf file with repeated `-c` arguments.
235
+
236
+ traject -c connection_conf.rb -c indexing_conf.rb marc_file.mrc
237
+
238
+ If you leave off the marc_file, traject will try to read from stdin. You can only supply one marc file at a time, but we can take advantage of stdin to get around this:
239
+
240
+ cat some/dir/*.marc | traject -c conf_file.rb
241
+
242
+ You can set any setting on the command line with `-s key=value`.
243
+ This will over-ride any settings from conf files. (TODO, I don't
244
+ think over-riding works, it's actually a bit tricky)
245
+
246
+ traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solr.url=http://example.com/solr -s solrj_writer.commit_on_close=true
247
+
248
+ There are some built-in command-line option shortcuts for useful
249
+ settings:
250
+
251
+ Use `-j` to output as pretty-printed JSON
252
+ hashes, instead of sending to solr. Useful for debugging or sanity
253
+ checking.
254
+
255
+ traject -j -c conf_file.rb marc_file
256
+
257
+ Use `-u` as a shortcut for `s solr.url=X`
258
+
259
+ traject -c conf_file.rb -u http://example.com/solr marc_file.mrc
260
+
261
+ Also see `-I load_path` and `-g Gemfile` options under Extending Logic
262
+
263
+ ## Extending Logic
264
+
265
+ TODO fill out nicer.
266
+
267
+ Basically:
268
+
269
+ command line `-I` can be used to append to the ruby $LOAD_PATH, and then you can simply `require` your local files, and then use them for
270
+ whatever. Macros, utility functions, translation maps, whatever.
271
+
272
+ If you want to use logic from other gems in your configuration mapping, you can do that too. This works for traject-specific
273
+ functionality like translation maps and macros, or for anything else.
274
+ To use gems, you can _either_ use straight rubygems, simply by
275
+ installing gems in your system and using `require` or `gem` commands... **or** you can use Bundler for dependency locking and other dependency management. To have traject use Bundler, create a `Gemfile` and then call traject command line with the `-g` option. With the `-g` option alone, Bundler will look in the CWD and parents for the first `Gemfile` it finds. Or supply `-g ./somewhere/MyGemfile` to anywhere.
276
+
277
+
278
+ # Development
279
+
280
+ Run tests with `rake test` or just `rake`. Tests are written using Minitest (please, no rspec). We use the spec-style describe/it to
281
+ list the tests -- but generally prefer unit-style "assert_*" methods
282
+ to make actual assertions, for clarity.
283
+
284
+ Some tests need to run against a solr instance. Currently no solr
285
+ instance is baked in. You can provide your own solr instance to test against and set shell ENV variable
286
+ "solr_url", and the tests will use it. Otherwise, tests will
287
+ use a mocked up Solr instance.
288
+
289
+ Pull requests should come with tests, as well as docs where applicable. Docs can be inline rdoc-style, edits to this README,
290
+ and/or extra files in ./docs -- as appropriate for what needs to be docs.
291
+
292
+ ## TODO
293
+
294
+ * Logging
295
+ * it's doing no logging of it's own
296
+ * It's not properly setting up the solrj logging
297
+ * Making solrj and it's own logging go to same place, accross jruby bridge, not sure
298
+ (I want all of this code BUT the Solr writing stuff to be usable under MRI too,
299
+ I want to repurpose the mapping code for DISPLAY too)
300
+
301
+ * Error handling. Related to logging. Catch errors indexing
302
+ particular records, make
303
+ sure they are logged in an obvious place, make sure processing proceeds with other
304
+ records (if it should!) etc.
305
+
306
+ * Distro and the SolrJ jars. Right now the SolrJ jars are included in the gem (although they
307
+ aren't actually loaded until you try to use the SolrJWriter). This is not neccesarily
308
+ best. other possibilities:
309
+ * Put them in their own gem
310
+ * Make the end-user download them theirselves, possibly providing the ivy.xml's to do so for
311
+ them.
312
+
313
+ * Various performance improvements, this is not optimized yet. Some improvements
314
+ may challenge architecture, when they involve threading.
315
+ * Profile and optimize marc loading -- right now just using ruby-marc, always.
316
+ * Profile/optimize marc serialization back to stored filed, right now it uses
317
+ known-to-be-slow rexml as part of ruby-marc.
318
+ * Use threads for the mapping step? With celluloid, or threach, or other? Does
319
+ this require thinking more about thread safety of existing code?
320
+ * Use threads for writing to solr?
321
+ * I am not sure about using the solrj ConcurrentUpdateSolrServer -- among other
322
+ things, it seems to swallow solr errors, that i'm not sure we want to do.
323
+ * But we can batch docs ourselves before HttpServer#add'ing them -- every
324
+ solrj HTTPServer#add is an http transaction, but you can give it an ARRAY
325
+ to load multiple at once -- and still get the errors, I think. (Have to test)
326
+ Could be perf nearly as good as concurrentupdate? Or do that, but then make each
327
+ HttpServer#add in one of our own manual threads (Celluloid? Or raw?), so
328
+ continued processing doesn't block?
329
+
330
+ * Reading Marc8. It can't do it yet. Easiest way would be using Marc4j to read, or using it as a transcoder anyway. Don't really want to write marc8 transcoder in ruby.
331
+
332
+ * We need something like `to_field`, but without actually being
333
+ for mapping to a specific output field. For generic pre or post-processing, or multi-output-field logic. `before_record do &block`, `after_record do &block` , `on_each_record do &block`, one or more of those.
334
+
335
+ * Unicode normalization. Has to normalize to NFKC on way out to index. Except for serialized marc field and other exceptions? Except maybe don't have to, rely on solr analyzer to do it?
336
+
337
+ * Should it normalize to NFC on the way in, to make sure translation maps and other string comparisons match properly?
338
+
339
+ * Either way, all optional/configurable of course. based
340
+ on Settings.
341
+
342
+ * More macros. Not all the built-in functionality that comes with SolrMarc is here yet. It can be provided as macros, either built in, or distro'd in other gems. If really needed as macros, and not just something local configs build themselves as needed out of the parts already here.
343
+
344
+ * Command line code. It's only 150 lines, but it's kind of messy
345
+ jammed into one file *and lacks tests*. I couldn't figure out
346
+ what to do with it or how to test it. Needs a bit of love.
@@ -0,0 +1,16 @@
1
+ begin
2
+ require 'bundler/setup'
3
+ require "bundler/gem_tasks"
4
+ rescue LoadError
5
+ puts "You must `gem install bundler` and `bundle install` to run rake tasks"
6
+ end
7
+
8
+ require 'rake'
9
+ require 'rake/testtask'
10
+
11
+ task :default => [:test]
12
+
13
+ Rake::TestTask.new do |t|
14
+ t.pattern = 'test/**/*_test.rb'
15
+ t.libs.push 'test', 'test_support'
16
+ end
@@ -0,0 +1,153 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'slop'
4
+
5
+
6
+ # If we're loading from source instead of a gem, rubygems
7
+ # isn't setting load paths for us, so we need to set it ourselves
8
+ self_load_path = File.expand_path("../lib", File.dirname(__FILE__))
9
+ unless $LOAD_PATH.include? self_load_path
10
+ $LOAD_PATH << self_load_path
11
+ end
12
+
13
+ require 'traject'
14
+ require 'traject/indexer'
15
+
16
+
17
+
18
+ opts = Slop.new(:strict => true) do
19
+ banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
20
+
21
+ on 'v', 'version', "print version information to stderr"
22
+ on 'h', 'help', "print usage information to stderr"
23
+ on 'c', 'conf', 'configuration file path (repeatable)', :argument => true, :as => Array, :required => true
24
+ on :s, :setting, "settings: `-s key=value` (repeatable)", :argument => true, :as => Array
25
+ on :r, :reader, "Set reader class, shortcut for `-s reader_class_name=*`", :argument => true
26
+ on :w, :writer, "Set writer class, shortcut for `-s writer_class_name=*`", :argument => true
27
+ on :u, :solr, "Set solr url, shortcut for `-s solr.url=*`", :argument => true
28
+ on :j, "output as pretty printed json, shortcut for `-s writer_class_name=JsonWriter -s json_writer.pretty_print=true`"
29
+ on :t, :marc_type, "xml, json or binary. shortcut for `-s marc_source.type=*`", :argument => true
30
+ on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
31
+ on :g, "gemfile", "run with bundler and optionally specified Gemfile", :argument => :optional, :default => ""
32
+ end
33
+
34
+ begin
35
+ opts.parse!
36
+ rescue Slop::Error => e
37
+ $stderr.puts "Error: #{e.message}"
38
+ $stderr.puts "Exiting..."
39
+ $stderr.puts
40
+ $stderr.puts opts.help
41
+ exit 1
42
+ end
43
+
44
+
45
+ options = opts.to_hash
46
+
47
+
48
+
49
+ if options[:version]
50
+ $stderr.puts "traject version #{Traject::VERSION}"
51
+ end
52
+
53
+ if options[:help]
54
+ $stderr.puts opts.help
55
+ exit 0
56
+ end
57
+
58
+ # have to use Slop object to tell diff between
59
+ # no arg supplied and no option -g given at all
60
+ if opts.present? :gemfile
61
+ if options[:gemfile]
62
+ # tell bundler what gemfile to use
63
+ gem_path = File.expand_path( options[:gemfile] )
64
+ # bundler not good at error reporting, we check ourselves
65
+ unless File.exists? gem_path
66
+ $stderr.puts "Gemfile `#{options[:gemfile]}` does not exist, exiting..."
67
+ $stderr.puts
68
+ $stderr.puts opts.help
69
+ exit 2
70
+ end
71
+
72
+ ENV["BUNDLE_GEMFILE"] = gem_path
73
+ end
74
+ require 'bundler/setup'
75
+ end
76
+
77
+ settings = {}
78
+ (options[:setting] || []).each do |setting_pair|
79
+
80
+ if setting_pair =~ /\A([^=]+)\=([^=]*)\Z/
81
+ key, value = $1, $2
82
+ settings[key] = value
83
+ else
84
+ $stderr.puts "Unrecognized setting argument '#{setting_pair}':"
85
+ $stderr.puts "Should be of format -s key=value"
86
+ exit 3
87
+ end
88
+ end
89
+
90
+ if options[:writer]
91
+ settings["writer_class_name"] = options[:writer]
92
+ end
93
+ if options[:reader]
94
+ settings["reader_class_name"] = options[:reader]
95
+ end
96
+ if options[:solr]
97
+ settings["solr.url"] = options[:solr]
98
+ end
99
+ if options[:j]
100
+ settings["writer_class_name"] = "JsonWriter"
101
+ settings["json_writer.pretty_print"] = "true"
102
+ end
103
+ if options[:marc_type]
104
+ settings["marc_source.type"] = options[:marc_type]
105
+ end
106
+
107
+
108
+ (options[:load_path] || []).each do |path|
109
+ $LOAD_PATH << path unless $LOAD_PATH.include? path
110
+ end
111
+
112
+ indexer = Traject::Indexer.new
113
+ indexer.settings( settings )
114
+
115
+ options[:conf].each do |conf_path|
116
+ begin
117
+ indexer.instance_eval(File.open(conf_path).read, conf_path)
118
+ rescue Errno::ENOENT => e
119
+ $stderr.puts "Could not find configuration file '#{conf_path}', exiting..."
120
+ exit 2
121
+ rescue Exception => e
122
+ $stderr.puts "Could not parse configuration file '#{conf_path}'"
123
+ $stderr.puts " #{e.message}"
124
+ if e.backtrace.first =~ /\A(.*)\:in/
125
+ $stderr.puts " #{$1}"
126
+ end
127
+ exit 3
128
+ end
129
+ end
130
+
131
+ # ARGF might be perfect for this, but problems with it include:
132
+ # * jruby is broken, no way to set it's encoding, leads to encoding errors reading non-ascii
133
+ # https://github.com/jruby/jruby/issues/891
134
+ # * It's apparently not enough like an IO object for at least one of the ruby-marc XML
135
+ # readers:
136
+ # NoMethodError: undefined method `to_inputstream' for ARGF:Object
137
+ # init at /Users/jrochkind/.gem/jruby/1.9.3/gems/marc-0.5.1/lib/marc/xml_parsers.rb:369
138
+ #
139
+ # * It INSISTS on reading from ARGFV, making it hard to test, or use when you want to give
140
+ # it a list of files on something other than ARGV.
141
+ #
142
+ # So for now we do just one file, or stdin if none given. Sorry!
143
+ if ARGV.length > 1
144
+ $stderr.puts "Sorry, traject can only handle one input file at a time right now. `#{ARGV}` Exiting..."
145
+ exit 1
146
+ end
147
+ if ARGV.length == 0
148
+ io = $stdin
149
+ else
150
+ io = File.open(ARGV.first, 'r')
151
+ end
152
+
153
+ indexer.process(io)