traject 1.1.0 → 2.0.0.rc.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (51) hide show
  1. checksums.yaml +4 -4
  2. data/.travis.yml +20 -0
  3. data/README.md +85 -73
  4. data/doc/batch_execution.md +2 -6
  5. data/doc/other_commands.md +3 -5
  6. data/doc/settings.md +27 -38
  7. data/lib/traject/command_line.rb +1 -1
  8. data/lib/traject/csv_writer.rb +34 -0
  9. data/lib/traject/delimited_writer.rb +110 -0
  10. data/lib/traject/indexer.rb +29 -11
  11. data/lib/traject/indexer/settings.rb +39 -13
  12. data/lib/traject/line_writer.rb +10 -6
  13. data/lib/traject/marc_reader.rb +2 -1
  14. data/lib/traject/solr_json_writer.rb +277 -0
  15. data/lib/traject/thread_pool.rb +38 -48
  16. data/lib/traject/translation_map.rb +3 -0
  17. data/lib/traject/util.rb +13 -51
  18. data/lib/traject/version.rb +1 -1
  19. data/lib/translation_maps/marc_geographic.yaml +2 -2
  20. data/test/delimited_writer_test.rb +104 -0
  21. data/test/indexer/read_write_test.rb +0 -22
  22. data/test/indexer/settings_test.rb +24 -0
  23. data/test/solr_json_writer_test.rb +248 -0
  24. data/test/test_helper.rb +5 -3
  25. data/test/test_support/demo_config.rb +0 -5
  26. data/test/translation_map_test.rb +9 -0
  27. data/traject.gemspec +18 -5
  28. metadata +77 -87
  29. data/lib/traject/marc4j_reader.rb +0 -153
  30. data/lib/traject/solrj_writer.rb +0 -351
  31. data/test/marc4j_reader_test.rb +0 -136
  32. data/test/solrj_writer_test.rb +0 -209
  33. data/vendor/solrj/README +0 -8
  34. data/vendor/solrj/build.xml +0 -39
  35. data/vendor/solrj/ivy.xml +0 -16
  36. data/vendor/solrj/lib/commons-codec-1.7.jar +0 -0
  37. data/vendor/solrj/lib/commons-io-2.1.jar +0 -0
  38. data/vendor/solrj/lib/httpclient-4.2.3.jar +0 -0
  39. data/vendor/solrj/lib/httpcore-4.2.2.jar +0 -0
  40. data/vendor/solrj/lib/httpmime-4.2.3.jar +0 -0
  41. data/vendor/solrj/lib/jcl-over-slf4j-1.6.6.jar +0 -0
  42. data/vendor/solrj/lib/jul-to-slf4j-1.6.6.jar +0 -0
  43. data/vendor/solrj/lib/log4j-1.2.16.jar +0 -0
  44. data/vendor/solrj/lib/noggit-0.5.jar +0 -0
  45. data/vendor/solrj/lib/slf4j-api-1.6.6.jar +0 -0
  46. data/vendor/solrj/lib/slf4j-log4j12-1.6.6.jar +0 -0
  47. data/vendor/solrj/lib/solr-solrj-4.3.1-javadoc.jar +0 -0
  48. data/vendor/solrj/lib/solr-solrj-4.3.1-sources.jar +0 -0
  49. data/vendor/solrj/lib/solr-solrj-4.3.1.jar +0 -0
  50. data/vendor/solrj/lib/wstx-asl-3.2.7.jar +0 -0
  51. data/vendor/solrj/lib/zookeeper-3.4.5.jar +0 -0
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 4ae9c6a2d87868021cae1b48637592238387d8a1
4
- data.tar.gz: 578c645162da3560ff5e01a28cf43b36e82734f8
3
+ metadata.gz: 1e875abe713a3200de4fb424c9570a3b89d5ddc0
4
+ data.tar.gz: cbb7b0f4fd9bb293af55afff48b52f92ce2b4dfb
5
5
  SHA512:
6
- metadata.gz: e0bf13c4ff3cab492b6be8922ae22e33311f701756910526eb3206522774b1519db07324531ebd2c366d5907854c78bb6cde0d65eeff78258513d37fae1a3a57
7
- data.tar.gz: dd251b15afafe2a11cbefe8493e145d9b9b883183b5f884a5b61557046c74b699427fe2766d5df587e27729f6c22052b2f672b756aba8fee33149f6abbcc4f40
6
+ metadata.gz: b7f911f43784275b0a7782e642788bb57a492c1660b9e3a461bf9b03c96882b887dd62e18f385ee9ffb3488c60afda44fb77ea9fc619cd4f3061c471b5bc7227
7
+ data.tar.gz: 105737429ce7778ae57a3182671fa9c41f72d17579b82f80b38a8ec373b634c1d22394e17c404e23961b5d2f557442e56d69700aaad3091e950d36aa2dc050c5
@@ -1,7 +1,27 @@
1
1
  language: ruby
2
2
  rvm:
3
3
  - jruby-19mode
4
+ - jruby-head
5
+ - 1.9
6
+ - 2.1
7
+ - 2.2
8
+ - rbx-2
4
9
  jdk:
5
10
  - openjdk7
6
11
  - openjdk6
12
+ matrix:
13
+ exclude:
14
+ - rvm: 1.9
15
+ jdk: openjdk7
16
+ - rvm: 2.1
17
+ jdk: openjdk7
18
+ - rvm: rbx-2
19
+ jdk: openjdk7
20
+ - rvm: jruby-head
21
+ jdk: openjdk6
22
+ - rvm: 2.2
23
+ jdk: openjdk6
24
+ allow_failures:
25
+ - rvm: jruby-head
26
+
7
27
  bundler_args: --without debug
data/README.md CHANGED
@@ -1,10 +1,12 @@
1
1
  # Traject
2
2
 
3
- Tools for reading MARC records, transforming them with indexing rules, and indexing to Solr.
4
- Might be used to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
3
+ An easy to use, high-performance, flexible and extensible MARC to Solr indexer.
5
4
 
6
- Traject might also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination.
5
+ You might use traject to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
7
6
 
7
+ Traject can also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination. In addition to sending data
8
+ to solr, Traject can produce json or yaml files, tab-delimited files, CSV files, and output suitable
9
+ for debugging by a human.
8
10
 
9
11
  **Traject is stable, mature software, that is already being used in production by its authors.**
10
12
 
@@ -14,42 +16,46 @@ Traject might also be generalized to a set of tools for getting structured data
14
16
 
15
17
  ## Background/Goals
16
18
 
17
- Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (University of Michigan Libraries).
19
+ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (University of Michigan Libraries).
18
20
 
19
- Traject was born out of our experience with similar tools, including the very popular and useful [solrmarc](https://code.google.com/p/solrmarc/) by Bob Haschart; and Bill Dueber's own [marc2solr](http://github.com/billdueber/marc2solr/).
21
+ * Basic configuration files can be easily written even by non-rubyists, with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
22
+ * Easy to program, easy to read, easy to modify.
23
+ * Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying
24
+ ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with
25
+ solr even under MRI.
26
+ * Composed of decoupled components, for flexibility and extensibility.
27
+ * Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
28
+ * Easy to split configuration between multiple files, for simple "pick-and-choose" command line options
29
+ that can combine to deal with any of your local needs.
20
30
 
21
- We're comfortable programming (especially in a dynamic language), and want to be able to experiment with different indexing patterns quickly, easily, and testably; but are admittedly less comfortable in Java. In order to have a tool with the API's and usage patterns convenient for us, we found we could do it better in JRuby -- Ruby on the JVM.
22
31
 
23
- * Basic configuration files can be easily written even by non-rubyists, with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
24
- * Easy to program, easy to read, easy to modify.
25
- * Fast. Traject by default indexes using multiple threads, on multiple cpu cores.
26
- * Composed of decoupled components, for flexibility and extensibility. The whole code base is only 6400 lines of code, more than a third of which is tests.
27
- * Designed to support local code and configuration that's maintainable and testable, an can be shared between projects as ruby gems.
28
- * Designed with batch execution in mind: flexible logging, good exit codes, good use of stdin/stdout/stderr.
32
+ ## Installation
29
33
 
34
+ Traject runs under MRI ruby (1.9 through 2.2), jruby 1.7.x, or rubinius.
30
35
 
31
- ## Installation
36
+ For high-volume indexing in production, traject performs **much** better when run with **JRuby** (ruby on the JVM).
37
+ Standard MRI ruby can't use multiple CPU cores at once, but on JRuby traject can use
38
+ multiple cores for much better performance.
32
39
 
33
- Traject runs under jruby (ruby on the JVM). I recommend [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme) for installing and managing ruby installations. (traject is tested
34
- and supported for ruby 1.9 -- recent versions of jruby should run under 1.9 mode by default).
40
+ Some options for installing a ruby other than your system-provided one are [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme).
35
41
 
36
- Then just `gem install traject`.
42
+ Once you have ruby, just `$ gem install traject`.
37
43
 
38
- ( **Note**: We may later provide an all-in-one .jar distribution, which does not require you to install jruby or use on your system. This is hypothetically possible. Is it a good idea?)
44
+ ( **Note**: We might in the future provide an all-in-one .jar distribution, which does not require you to install jruby on your system, for those who want the multi-threading of jruby without having to actually install it. Let us know if interested.).
39
45
 
40
46
 
41
47
  ## Configuration files
42
48
 
43
49
  traject is configured using configuration files. To get a sense of what they look like, you can
44
- take a look at our sample non-trivial configuration file,
45
- [demo_config.rb](./test/test_support/demo_config.rb), which you'd run like
46
- `traject -c path/to/demo_config.rb marc_file.marc`.
50
+ take a look at our sample basic configuration file,
51
+ [demo_config.rb](./test/test_support/demo_config.rb). You could run traject with that configuration file
52
+ as: `traject -c path/to/demo_config.rb marc_file.marc`.
47
53
 
48
54
  Configuration files are actually just ruby -- so by convention they end in `.rb`.
49
55
 
50
- We hope you can write basic useful configuration files without being a ruby expert,
51
- traject gives you some easy functions to use for common diretives. But the full power
52
- of ruby is available to you if needed.
56
+ We hope you can write basic useful configuration files without much ruby experience, since
57
+ traject gives you some easy functions to use for common directives. But the full power
58
+ of ruby is available to you if needed.
53
59
 
54
60
  **rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
55
61
  call ordinary ruby `require` in config files, etc., too, to load
@@ -73,10 +79,6 @@ settings do
73
79
  # Where to find solr server to write to
74
80
  provide "solr.url", "http://example.org/solr"
75
81
 
76
- # If you are connecting to Solr 1.x, you need to set
77
- # for SolrJ compatibility:
78
- # provide "solrj_writer.parser_class_name", "XMLResponseParser"
79
-
80
82
  # solr.version doesn't currently do anything, but set it
81
83
  # anyway, in the future it will warn you if you have settings
82
84
  # that may not work with your version.
@@ -87,13 +89,11 @@ settings do
87
89
  provide "marc_source.type", "xml"
88
90
 
89
91
  # various others...
90
- provide "solrj_writer.commit_on_close", "true"
92
+ provide "solr_writer.commit_on_close", "true"
91
93
 
92
- # By default, we use the Traject::MarcReader
93
- # One altenrnative is the Marc4JReader, using Marc4J.
94
- # provide "reader_class_name", "Traject::Marc4Reader"
95
- # If we're reading binary MARC, it's best to tell it the encoding.
96
- provide "marc4j_reader.source_encoding", "MARC-8" # or 'UTF-8' or 'ISO-8859-1' or whatever.
94
+ # The default writer is the Traject::SolrJsonWriter. The default
95
+ # reader is Marc4JReader (using Java Marc4J library) on Jruby,
96
+ # MarcReader (using ruby-marc) otherwise.
97
97
  end
98
98
  ~~~
99
99
 
@@ -105,17 +105,17 @@ See, docs page on [Settings](./doc/settings.md) for list
105
105
  of all standardized settings.
106
106
 
107
107
 
108
- ## Indexing rules: Let's start with `to_field` and `extract_marc`
108
+ ## Indexing rules: Let's start with 'to_field' and 'extract_marc'
109
109
 
110
110
  There are a few methods that can be used to create indexing rules, but the
111
111
  one you'll most common is called `to_field`, and establishes a rule
112
- to extract content to a particular named output field.
112
+ to extract content to a particular named output field.
113
113
 
114
- The extraction rule can use built-in 'macros', or, as we'll see later,
115
- entirely custom logic.
114
+ A `to_field` extraction rule can use built-in 'macros', or, as we'll see later,
115
+ entirely custom logic.
116
116
 
117
117
  The built-in macro you'll use the most is `extract_marc`, to extract
118
- data out of a MARC record according to a tag/subfield specification.
118
+ data out of a MARC record according to a tag/subfield specification.
119
119
 
120
120
  ~~~ruby
121
121
  # Take the value of the first 001 field, and put
@@ -128,7 +128,7 @@ data out of a MARC record according to a tag/subfield specification.
128
128
  to_field "title_t", extract_marc("245nps:130", :trim_punctuation => true)
129
129
 
130
130
  # Can limit to certain indicators with || chars.
131
- # "*" is a wildcard in indicator spec. So
131
+ # "*" is a wildcard in indicator spec. So this is
132
132
  # 856 with first indicator '0', subfield u.
133
133
  to_field "email_addresses", extract_marc("856|0*|u")
134
134
 
@@ -137,20 +137,20 @@ data out of a MARC record according to a tag/subfield specification.
137
137
  to_field "isbn", extract_marc("245a:245abcde")
138
138
 
139
139
  # For MARC Control ('fixed') fields, you can optionally
140
- # use square brackets to take a byte offset.
140
+ # use square brackets to take a byte offset.
141
141
  to_field "langauge_code", extract_marc("008[35-37]")
142
142
  ~~~
143
143
 
144
144
  `extract_marc` by default includes all 'alternate script' linked fields correspoinding
145
145
  to matched specifications, but you can turn that off, or extract *only* corresponding
146
- 880s.
146
+ 880s.
147
147
 
148
148
  ~~~ruby
149
149
  to_field "title", extract_marc("245abc", :alternate_script => false)
150
150
  to_field "title_vernacular", extract_marc("245abc", :alternate_script => :only)
151
151
  ~~~
152
152
 
153
- By default, specifications with multiple subfields (like "240abc") will produce one single string of output for each matching field. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield.
153
+ By default, specifications with multiple subfields (like "240abc") will produce one single string of output for each matching field. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield.
154
154
 
155
155
  For the syntax and complete possibilities of the specification
156
156
  string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
@@ -199,7 +199,7 @@ All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macr
199
199
 
200
200
  Some more complex (and opinionated/subjective) algorithms for deriving semantics
201
201
  from Marc are also packaged with Traject, but not available by default. To make
202
- them available to your indexing, you just need to use ruby `require` and `extend`.
202
+ them available to your indexing, you just need to use ruby `require` and `extend`.
203
203
 
204
204
  A number of methods are in [Traject::Macros::Marc21Semantics](./lib/traject/macros/marc21_semantics.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21Semantics))
205
205
 
@@ -223,6 +223,9 @@ format/genre/type vocabulary:
223
223
  to_field 'format_facet', marc_formats
224
224
  ~~~
225
225
 
226
+ (Alternately, see the [traject_umich_format](https://github.com/billdueber/traject_umich_format) gem for the often-ridiculously-complex
227
+ logic used at the University of Michigan.)
228
+
226
229
  ## Custom logic
227
230
 
228
231
  The built-in routines are there for your convenience, but if you need
@@ -240,12 +243,12 @@ in a configuration file, using a ruby block, which looks like this:
240
243
  end
241
244
  ~~~
242
245
 
243
- `do |record, accumulator|` is the definition of a ruby block taking
246
+ `do |record, accumulator| ... ` is the definition of a ruby block taking
244
247
  two arguments. The first one passed in will be a MARC record. The
245
248
  second is an array, you add values to the array to send them to
246
- output.
249
+ output.
247
250
 
248
- Here's a more realistic example that shows how you'd get the
251
+ Here's another example that shows how you'd get the
249
252
  record type byte 06 out of a MARC leader, then translate it
250
253
  to a human-readable string with a TranslationMap
251
254
 
@@ -257,37 +260,34 @@ to a human-readable string with a TranslationMap
257
260
  end
258
261
  ~~~
259
262
 
260
- You can also add a block onto the end of a built-in 'macro', to
263
+ You can also add a block onto the end of a built-in 'macro', to
261
264
  further customize the output. The `accumulator` passed to your block
262
265
  will already have values in it from the first step, and you can
263
266
  use ruby methods like `map!` to modify it:
264
267
 
265
268
  ~~~ruby
266
269
  to_field "big_title", extract_marc("245abcdefg") do |record, accumulator|
267
- # put it all in all uppercase, I don't know why.
270
+ # put it all in all uppercase, I don't know why.
268
271
  accumulator.map! {|v| v.upcase}
269
272
  end
270
273
  ~~~
271
274
 
272
- There are many more things you can do with custom logic blocks like this too,
273
- including additional features we haven't discussed yet.
274
-
275
275
  If you find yourself repeating boilerplate code in your custom logic, you can
276
276
  even create your own 'macros' (like `extract_marc`). `extract_marc` and other
277
277
  macros are nothing more than methods that return ruby lambda objects of
278
- the same format as the blocks you write for custom logic.
278
+ the same format as the blocks you write for custom logic.
279
279
 
280
280
  For tips, gotchas, and a more complete explanation of how this works, see
281
281
  additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
282
282
 
283
283
  ## each_record and after_processing
284
284
 
285
- In addition to `to_field`, an `each_record` method is available, which,
285
+ In addition to `to_field`, an `each_record` method is available, which,
286
286
  like `to_field`, is executed for every record, but without being tied
287
- to a specific field.
287
+ to a specific field.
288
288
 
289
289
  `each_record` can be used for logging or notifiying; computing intermediate
290
- results; or writing to more than one field at once.
290
+ results; or writing to more than one field at once.
291
291
 
292
292
  ~~~ruby
293
293
  each_record do |record|
@@ -303,26 +303,33 @@ ruby code you might want for your app (send an email? Clean up a log file? Trigg
303
303
  a Solr replication?)
304
304
 
305
305
  ~~~ruby
306
- after_processing do
306
+ after_processing do
307
307
  whatever_ruby_code
308
308
  end
309
309
  ~~~
310
310
 
311
311
 
312
- ## Writers
312
+ ## Readers and Writers
313
313
 
314
314
  Traject uses modular 'Writer' classes to take the output hashes from transformation, and
315
- send them somewhere or do something useful with them.
315
+ send them somewhere or do something useful with them.
316
316
 
317
- By default traject uses the [Traject::SolrJWriter](lib/traject/solrj_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/SolrJWriter)) to send to Solr for indexing.
318
- A couple other writers are available too, mostly for debugging purposes:
319
- [Traject::DebugWriter](lib/traject/debug_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DebugWriter))
320
- and [Traject::JsonWriter](lib/traject/json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/JsonWriter))
317
+ By default traject uses the [Traject::SolrJsonWriter](lib/traject/solr_json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/SolrJsonWriter)) to send to Solr for indexing.
318
+ Several other writers are also built-in:
319
+ * [Traject::DebugWriter](lib/traject/debug_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DebugWriter))
320
+ * [Traject::JsonWriter](lib/traject/json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/JsonWriter))
321
+ * [Traject::YamlWriter](lib/traject/yaml_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/YamlWriter))
322
+ * [Traject::DelimitedWriter](lib/traject/delimited_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DelimitedWriter))
323
+ * [Traject::CSVWriter](lib/traject/csv_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/CSVWriter))
321
324
 
322
325
  You set which writer is being used in settings (`provide "writer_class_name", "Traject::DebugWriter"`),
323
- or on the command-line as a shortcut with `-w Traject::DebugWriter`.
326
+ or with the shortcut command line argument `-w Traject::DebugWriter`.
327
+
328
+ The [SolrJWriter](https://github.com/traject-project/traject-solrj_writer) is packaged separately,
329
+ and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
330
+
331
+ You can easily write your own Readers and Writers if you'd like, see comments at top
324
332
 
325
- You can write your own Readers and Writers if you'd like, see comments at top
326
333
  of [Traject::Indexer](lib/traject/indexer.rb).
327
334
 
328
335
  ## The traject command Line
@@ -331,13 +338,13 @@ The simplest invocation is:
331
338
 
332
339
  traject -c conf_file.rb marc_file.mrc
333
340
 
334
- Traject assumes marc files are in ISO 2709 binary format; it is not
335
- currently able to guess marc format type from filenames. If you are reading
341
+ Traject assumes marc files are in ISO 2709 MARC 'binary' format; it is not
342
+ currently able to guess other marc format types like XML from filenames or content. If you are reading
336
343
  marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
337
344
 
338
345
  traject -c conf.rb -t xml marc_file.xml
339
346
 
340
- You can supply more than one conf file with repeated `-c` arguments.
347
+ You can supply more than one conf file to traject with repeated `-c` arguments.
341
348
 
342
349
  traject -c connection_conf.rb -c indexing_conf.rb marc_file.mrc
343
350
 
@@ -349,7 +356,7 @@ You can only supply one marc file at a time, but we can take advantage of stdin
349
356
  You can set any setting on the command line with `-s key=value`.
350
357
  This will over-ride any settings set with `provide` in conf files.
351
358
 
352
- traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solr.url=http://example.com/solr -s solrj_writer.commit_on_close=true
359
+ traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solrj_writer.commit_on_close=true
353
360
 
354
361
  There are some built-in command-line option shortcuts for useful
355
362
  settings:
@@ -363,8 +370,8 @@ debugging or sanity checking.
363
370
  Use `-u` as a shortcut for `s solr.url=X`
364
371
 
365
372
  traject -c conf_file.rb -u http://example.com/solr marc_file.mrc
366
-
367
- Run `traject -h` to see the command line help screen listing all available options.
373
+
374
+ Run `traject -h` to see the command line help screen listing all available options.
368
375
 
369
376
  Also see `-I load_path` option and suggestions for Bundler use under Extending With Your Own Code.
370
377
 
@@ -399,7 +406,7 @@ Own Code](./doc/extending.md)
399
406
  "./translation_maps" subdir on the load path will be found
400
407
  for Traject translation maps.
401
408
  * Use [Bundler](http://bundler.io/) with traject simply by creating a Gemfile with `bundler init`,
402
- and then running command line with `bundle exec traject` or
409
+ and then running command line with `bundle exec traject` or
403
410
  even `BUNDLE_GEMFILE=path/to/Gemfile bundle exec traject`
404
411
 
405
412
  ## More
@@ -410,7 +417,9 @@ Own Code](./doc/extending.md)
410
417
  * [traject_alephsequential_reader](https://github.com/traject-project/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
411
418
  * [traject_horizon](https://github.com/jrochkind/traject_horizon): Export MARC records directly from a Horizon ILS rdbms, as serialized MARC or to index into Solr.
412
419
  * [traject_umich_format](https://github.com/billdueber/traject_umich_format/): opinionated code and associated macros to extract format (book, audio file, etc.) and types (bibliography, conference report, etc.) from a MARC record. Code mirrors that used by the University of Michigan, and is an alternate approach to that taken by the `marc_formats` macro in `Traject::Macros::MarcFormatClassifier`.
413
-
420
+ * [traject-solrj_writer](https://github.com/traject-project/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
421
+ * [traject_marc4j_reader](https://github.com/billdueber/traject_marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
422
+ reading marc records using the Marc4J library, fastest MARC reading on JRuby.
414
423
 
415
424
  # Development
416
425
 
@@ -430,12 +439,15 @@ Pull requests should come with tests, as well as docs where applicable. Docs can
430
439
  and/or extra files in ./docs -- as appropriate for what needs to be docs.
431
440
 
432
441
  **Inline api docs** Note that our [`.yardopts` file](./.yardopts) used by rdoc.info to generate
433
- online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
442
+ online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
434
443
 
435
444
  Bundler rake tasks included for gem releases: `rake release`
436
445
 
437
446
  ## TODO
438
447
 
448
+ * Readers and index rules helpers for reading XML files as input? Maybe.
449
+
450
+ * Writers for writing to stores other than Solr? ElasticSearch? Maybe.
439
451
 
440
452
  * Unicode normalization. Has to normalize to NFKC on way out to index. Except for serialized marc field and other exceptions? Except maybe don't have to, rely on solr analyzer to do it?
441
453
 
@@ -8,15 +8,11 @@ with suggested solutions, and additional hints.
8
8
 
9
9
  ## Ruby version setting
10
10
 
11
- traject ordinarily needs to run under jruby. You will
11
+ For best performance, traject should run under jruby. You will
12
12
  ordinarily have jruby installed under a ruby version switcher -- we
13
- highly recommend [chruby](https://github.com/postmodern/chruby) over other choices,
13
+ recommend [chruby](https://github.com/postmodern/chruby) over other choices,
14
14
  but other popular choices include rvm and rbenv.
15
15
 
16
- Remember that traject needs to run in 1.9.x mode in jruby--
17
- with jruby 1.7.x or later, this should be default, recommend
18
- you use jruby 1.7.x.
19
-
20
16
  Especially when running under a cron job, it can be difficult to
21
17
  set things up so traject runs under jruby -- and then when you add
22
18
  bundler into it, things can get positively byzantine. It's not you,
@@ -38,12 +38,10 @@ If set to true, then oversized MARC records can still be serialized,
38
38
  with length bytes zero'd out -- technically illegal, but can
39
39
  be read by MARC::Reader in permissive mode.
40
40
 
41
- As the standard Marc4JReader always convert to UTF8,
42
- output will always be in UTF8. For standard readeres, you
43
- do need to set the `marc_source.type` setting to XML for xml input
44
- using the standard MARC readers.
41
+ If you have MARC-XML *input*, you need to
42
+ set the `marc_source.type` setting to XML for xml input.
45
43
 
46
44
  ~~~bash
47
45
  traject -x marcout somefile.marc -o output.xml -s marcout.type=xml
48
46
  traject -x marcout -s marc_source.type=xml somefile.xml -c configuration.rb
49
- ~~~
47
+ ~~~
@@ -5,7 +5,7 @@ Hash, not nested. Keys are always strings, and dots (".") can be
5
5
  used for grouping and namespacing.
6
6
 
7
7
  Values are usually strings, but occasionally something else. String values can be easily
8
- set via the command line.
8
+ set via the command line.
9
9
 
10
10
  Settings can be set in configuration files, usually like:
11
11
 
@@ -16,24 +16,24 @@ end
16
16
  ~~~~
17
17
 
18
18
  or on the command line: `-s key=value`. There are also some command line shortcuts
19
- for commonly used settings, see `traject -h`.
19
+ for commonly used settings, see `traject -h`.
20
20
 
21
- `provide` will only set the key if it was previously unset, so first time to set 'wins'. And command-line
22
- settings are applied first of all. It's recommended you use `provide`.
21
+ `provide` will only set the key if it was previously unset, so first time to set 'wins'. And command-line
22
+ settings are applied first of all. It's recommended you use `provide`.
23
23
 
24
- `store` is also available, and forces setting of the new value overriding any previous value set.
24
+ `store` is also available, and forces setting of the new value overriding any previous value set.
25
25
 
26
26
  ## Known settings
27
27
 
28
28
  * `debug_ascii_progress`: true/'true' to print ascii characters to STDERR indicating progress. Note,
29
- yes, this is fixed to STDERR, regardless of your logging setup.
29
+ yes, this is fixed to STDERR, regardless of your logging setup.
30
30
  * `.` for every batch of records read and parsed
31
31
  * `^` for every batch of records batched and queued for adding to solr
32
32
  (possibly in thread pool)
33
33
  * `%` for completing of a Solr 'add'
34
34
  * `!` when threadpool for solr add has a full queue, so solr add is
35
35
  going to happen in calling queue -- means solr adding can't
36
- keep up with production.
36
+ keep up with production.
37
37
 
38
38
  * `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
39
39
 
@@ -50,18 +50,10 @@ settings are applied first of all. It's recommended you use `provide`.
50
50
  * `log.batch_size`: If set to a number N (or string representation), will output a progress line to
51
51
  log. (by default as INFO, but see log.batch_size.severity)
52
52
 
53
- * `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
53
+ * `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
54
54
 
55
55
  * `marc_source.type`: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`
56
56
 
57
- * `marc4j.jar_dir`: Path to a directory containing Marc4J jar file to use. All .jar's in dir will
58
- be loaded. If unset, uses marc4j.jar bundled with traject.
59
-
60
- * `marc4j_reader.permissive`: Used by Marc4JReader only when marc.source_type is 'binary', boolean, argument to the underlying MarcPermissiveStreamReader. Default true.
61
-
62
- * `marc4j_reader.source_encoding`: Used by Marc4JReader only when marc.source_type is 'binary', encoding strings accepted
63
- by marc4j MarcPermissiveStreamReader. Default "BESTGUESS", also "UTF-8", "MARC"
64
-
65
57
  * `marcout.allow_oversized`: Used with `-x marcout` command to output marc when outputting
66
58
  as ISO 2709 binary, set to true or string "true", and the MARC::Writer will have
67
59
  allow_oversized=true set, allowing oversized records to be serialized with length
@@ -69,44 +61,41 @@ settings are applied first of all. It's recommended you use `provide`.
69
61
 
70
62
  * `output_file`: Output file to write to for operations that write to files: For instance the `marcout` command,
71
63
  or Writer classes that write to files, like Traject::JsonWriter. Has an shortcut
72
- `-o` on command line.
64
+ `-o` on command line.
73
65
 
74
- * `processing_thread_pool` Default 3. Main thread pool used for processing records with input rules. Choose a
75
- pool size based on size of your machine, and complexity of your indexing rules.
76
- Probably no reason for it ever to be more than number of cores on indexing machine.
77
- But this is the first thread_pool to try increasing for better performance on a multi-core machine.
78
-
79
- A pool here can sometimes result in multi-threaded commiting to Solr too with the
80
- SolrJWriter, as processing worker threads will do their own commits to solr if the
81
- solrj_writer.thread_pool is full. Having a multi-threaded pool here can help even out throughput
82
- through Solr's pauses for committing too.
66
+ * `processing_thread_pool` Number of threads in the main thread pool used for processing
67
+ records with input rules. On JRuby or Rubinius, defaults to 1 less than the number of processors detected on your machine. On other ruby platforms, defaults to 1. Set to 0 or nil
68
+ to disable thread pool, and do all processing in main thread.
83
69
 
84
- * `reader_class_name`: a Traject Reader class, used by the indexer as a source of records. Default Traject::Marc4jReader. If you don't need to read marc binary with Marc8 encoding, the pure ruby MarcReader may give you better performance. Command-line shortcut `-r`
70
+ Choose a pool size based on size of your machine, and complexity of your indexing rules, you
71
+ might want to try different sizes and measure which works best for you.
72
+ Probably no reason for it ever to be more than number of cores on indexing machine.
85
73
 
86
- * `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
87
74
 
88
- * `solrj.jar_dir`: SolrJWriter needs to load Java .jar files with SolrJ. It will load from a packaged SolrJ, but you can load your own SolrJ (different version etc) by specifying a directory. All *.jar in directory will be loaded.
75
+ * `reader_class_name`: a Traject Reader class, used by the indexer as a source
76
+ of records. Defaults to Traject::Marc4JReader (using the Java Marc4J
77
+ library) on JRuby; Traject::MarcReader (using the ruby marc gem) otherwise.
78
+ Command-line shortcut `-r`
79
+
80
+ * `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
89
81
 
90
82
  * `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control
91
83
  change some default settings, and/or sanity check and warn you if you're doing something
92
84
  that might not work with that version of solr. Set now for help in the future.
93
85
 
94
- * `solrj_writer.batch_size`: size of batches that SolrJWriter will send docs to Solr in. Default 200. Set to nil,
95
- 0, or 1, and SolrJWriter will do one http transaction per document, no batching.
96
-
97
- * `solrj_writer.commit_on_close`: default false, set to true to have SolrJWriter send an explicit commit message to Solr after indexing.
86
+ * `solr_writer.batch_size`: size of batches that SolrJsonWriter will send docs to Solr in. Default 100. Set to nil,
87
+ 0, or 1, and SolrJsonWriter will do one http transaction per document, no batching.
98
88
 
99
- * `solrj_writer.parser_class_name`: Set to "XMLResponseParser" or "BinaryResponseParser". Will be instantiated and passed to the solrj.SolrServer with setResponseParser. Default nil, use SolrServer default. To talk to a solr 1.x, you will want to set to "XMLResponseParser"
89
+ * `solr_writer.commit_on_close`: default false, set to true to have the solr writer send an explicit commit message to Solr after indexing.
100
90
 
101
- * `solrj_writer.server_class_name`: String name of a solrj.SolrServer subclass to be used by SolrJWriter. Default "HttpSolrServer"
102
91
 
103
- * `solrj_writer.thread_pool`: Defaults to 1 (single bg thread). A thread pool is used for submitting docs
92
+ * `solr_writer.thread_pool`: Defaults to 1 (single bg thread). A thread pool is used for submitting docs
104
93
  to solr. Set to 0 or nil to disable threading. Set to 1,
105
94
  there will still be a single bg thread doing the adds.
106
95
  May make sense to set higher than number of cores on your
107
96
  indexing machine, as these threads will mostly be waiting
108
97
  on Solr. Speed/capacity of your solr might be more relevant.
109
98
  Note that processing_thread_pool threads can end up submitting
110
- to solr too, if solrj_writer.thread_pool is full.
99
+ to solr too, if solr_json_writer.thread_pool is full.
111
100
 
112
- * `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::SolrJWriter, also available Traject::JsonWriter. See Traject::Indexer for more info. Command line shortcut `-w`
101
+ * `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::SolrJsonWriter, other writers for debugging or writing to files are also available. See Traject::Indexer for more info. Command line shortcut `-w`