traject 1.1.0 → 2.0.0.rc.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +20 -0
- data/README.md +85 -73
- data/doc/batch_execution.md +2 -6
- data/doc/other_commands.md +3 -5
- data/doc/settings.md +27 -38
- data/lib/traject/command_line.rb +1 -1
- data/lib/traject/csv_writer.rb +34 -0
- data/lib/traject/delimited_writer.rb +110 -0
- data/lib/traject/indexer.rb +29 -11
- data/lib/traject/indexer/settings.rb +39 -13
- data/lib/traject/line_writer.rb +10 -6
- data/lib/traject/marc_reader.rb +2 -1
- data/lib/traject/solr_json_writer.rb +277 -0
- data/lib/traject/thread_pool.rb +38 -48
- data/lib/traject/translation_map.rb +3 -0
- data/lib/traject/util.rb +13 -51
- data/lib/traject/version.rb +1 -1
- data/lib/translation_maps/marc_geographic.yaml +2 -2
- data/test/delimited_writer_test.rb +104 -0
- data/test/indexer/read_write_test.rb +0 -22
- data/test/indexer/settings_test.rb +24 -0
- data/test/solr_json_writer_test.rb +248 -0
- data/test/test_helper.rb +5 -3
- data/test/test_support/demo_config.rb +0 -5
- data/test/translation_map_test.rb +9 -0
- data/traject.gemspec +18 -5
- metadata +77 -87
- data/lib/traject/marc4j_reader.rb +0 -153
- data/lib/traject/solrj_writer.rb +0 -351
- data/test/marc4j_reader_test.rb +0 -136
- data/test/solrj_writer_test.rb +0 -209
- data/vendor/solrj/README +0 -8
- data/vendor/solrj/build.xml +0 -39
- data/vendor/solrj/ivy.xml +0 -16
- data/vendor/solrj/lib/commons-codec-1.7.jar +0 -0
- data/vendor/solrj/lib/commons-io-2.1.jar +0 -0
- data/vendor/solrj/lib/httpclient-4.2.3.jar +0 -0
- data/vendor/solrj/lib/httpcore-4.2.2.jar +0 -0
- data/vendor/solrj/lib/httpmime-4.2.3.jar +0 -0
- data/vendor/solrj/lib/jcl-over-slf4j-1.6.6.jar +0 -0
- data/vendor/solrj/lib/jul-to-slf4j-1.6.6.jar +0 -0
- data/vendor/solrj/lib/log4j-1.2.16.jar +0 -0
- data/vendor/solrj/lib/noggit-0.5.jar +0 -0
- data/vendor/solrj/lib/slf4j-api-1.6.6.jar +0 -0
- data/vendor/solrj/lib/slf4j-log4j12-1.6.6.jar +0 -0
- data/vendor/solrj/lib/solr-solrj-4.3.1-javadoc.jar +0 -0
- data/vendor/solrj/lib/solr-solrj-4.3.1-sources.jar +0 -0
- data/vendor/solrj/lib/solr-solrj-4.3.1.jar +0 -0
- data/vendor/solrj/lib/wstx-asl-3.2.7.jar +0 -0
- data/vendor/solrj/lib/zookeeper-3.4.5.jar +0 -0
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 1e875abe713a3200de4fb424c9570a3b89d5ddc0
|
4
|
+
data.tar.gz: cbb7b0f4fd9bb293af55afff48b52f92ce2b4dfb
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: b7f911f43784275b0a7782e642788bb57a492c1660b9e3a461bf9b03c96882b887dd62e18f385ee9ffb3488c60afda44fb77ea9fc619cd4f3061c471b5bc7227
|
7
|
+
data.tar.gz: 105737429ce7778ae57a3182671fa9c41f72d17579b82f80b38a8ec373b634c1d22394e17c404e23961b5d2f557442e56d69700aaad3091e950d36aa2dc050c5
|
data/.travis.yml
CHANGED
@@ -1,7 +1,27 @@
|
|
1
1
|
language: ruby
|
2
2
|
rvm:
|
3
3
|
- jruby-19mode
|
4
|
+
- jruby-head
|
5
|
+
- 1.9
|
6
|
+
- 2.1
|
7
|
+
- 2.2
|
8
|
+
- rbx-2
|
4
9
|
jdk:
|
5
10
|
- openjdk7
|
6
11
|
- openjdk6
|
12
|
+
matrix:
|
13
|
+
exclude:
|
14
|
+
- rvm: 1.9
|
15
|
+
jdk: openjdk7
|
16
|
+
- rvm: 2.1
|
17
|
+
jdk: openjdk7
|
18
|
+
- rvm: rbx-2
|
19
|
+
jdk: openjdk7
|
20
|
+
- rvm: jruby-head
|
21
|
+
jdk: openjdk6
|
22
|
+
- rvm: 2.2
|
23
|
+
jdk: openjdk6
|
24
|
+
allow_failures:
|
25
|
+
- rvm: jruby-head
|
26
|
+
|
7
27
|
bundler_args: --without debug
|
data/README.md
CHANGED
@@ -1,10 +1,12 @@
|
|
1
1
|
# Traject
|
2
2
|
|
3
|
-
|
4
|
-
Might be used to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
|
3
|
+
An easy to use, high-performance, flexible and extensible MARC to Solr indexer.
|
5
4
|
|
6
|
-
|
5
|
+
You might use traject to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
|
7
6
|
|
7
|
+
Traject can also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination. In addition to sending data
|
8
|
+
to solr, Traject can produce json or yaml files, tab-delimited files, CSV files, and output suitable
|
9
|
+
for debugging by a human.
|
8
10
|
|
9
11
|
**Traject is stable, mature software, that is already being used in production by its authors.**
|
10
12
|
|
@@ -14,42 +16,46 @@ Traject might also be generalized to a set of tools for getting structured data
|
|
14
16
|
|
15
17
|
## Background/Goals
|
16
18
|
|
17
|
-
Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (University of Michigan Libraries).
|
19
|
+
Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (University of Michigan Libraries).
|
18
20
|
|
19
|
-
|
21
|
+
* Basic configuration files can be easily written even by non-rubyists, with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
|
22
|
+
* Easy to program, easy to read, easy to modify.
|
23
|
+
* Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying
|
24
|
+
ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with
|
25
|
+
solr even under MRI.
|
26
|
+
* Composed of decoupled components, for flexibility and extensibility.
|
27
|
+
* Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
|
28
|
+
* Easy to split configuration between multiple files, for simple "pick-and-choose" command line options
|
29
|
+
that can combine to deal with any of your local needs.
|
20
30
|
|
21
|
-
We're comfortable programming (especially in a dynamic language), and want to be able to experiment with different indexing patterns quickly, easily, and testably; but are admittedly less comfortable in Java. In order to have a tool with the API's and usage patterns convenient for us, we found we could do it better in JRuby -- Ruby on the JVM.
|
22
31
|
|
23
|
-
|
24
|
-
* Easy to program, easy to read, easy to modify.
|
25
|
-
* Fast. Traject by default indexes using multiple threads, on multiple cpu cores.
|
26
|
-
* Composed of decoupled components, for flexibility and extensibility. The whole code base is only 6400 lines of code, more than a third of which is tests.
|
27
|
-
* Designed to support local code and configuration that's maintainable and testable, an can be shared between projects as ruby gems.
|
28
|
-
* Designed with batch execution in mind: flexible logging, good exit codes, good use of stdin/stdout/stderr.
|
32
|
+
## Installation
|
29
33
|
|
34
|
+
Traject runs under MRI ruby (1.9 through 2.2), jruby 1.7.x, or rubinius.
|
30
35
|
|
31
|
-
|
36
|
+
For high-volume indexing in production, traject performs **much** better when run with **JRuby** (ruby on the JVM).
|
37
|
+
Standard MRI ruby can't use multiple CPU cores at once, but on JRuby traject can use
|
38
|
+
multiple cores for much better performance.
|
32
39
|
|
33
|
-
|
34
|
-
and supported for ruby 1.9 -- recent versions of jruby should run under 1.9 mode by default).
|
40
|
+
Some options for installing a ruby other than your system-provided one are [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme).
|
35
41
|
|
36
|
-
|
42
|
+
Once you have ruby, just `$ gem install traject`.
|
37
43
|
|
38
|
-
( **Note**: We
|
44
|
+
( **Note**: We might in the future provide an all-in-one .jar distribution, which does not require you to install jruby on your system, for those who want the multi-threading of jruby without having to actually install it. Let us know if interested.).
|
39
45
|
|
40
46
|
|
41
47
|
## Configuration files
|
42
48
|
|
43
49
|
traject is configured using configuration files. To get a sense of what they look like, you can
|
44
|
-
take a look at our sample
|
45
|
-
[demo_config.rb](./test/test_support/demo_config.rb)
|
46
|
-
`traject -c path/to/demo_config.rb marc_file.marc`.
|
50
|
+
take a look at our sample basic configuration file,
|
51
|
+
[demo_config.rb](./test/test_support/demo_config.rb). You could run traject with that configuration file
|
52
|
+
as: `traject -c path/to/demo_config.rb marc_file.marc`.
|
47
53
|
|
48
54
|
Configuration files are actually just ruby -- so by convention they end in `.rb`.
|
49
55
|
|
50
|
-
We hope you can write basic useful configuration files without
|
51
|
-
traject gives you some easy functions to use for common
|
52
|
-
of ruby is available to you if needed.
|
56
|
+
We hope you can write basic useful configuration files without much ruby experience, since
|
57
|
+
traject gives you some easy functions to use for common directives. But the full power
|
58
|
+
of ruby is available to you if needed.
|
53
59
|
|
54
60
|
**rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
|
55
61
|
call ordinary ruby `require` in config files, etc., too, to load
|
@@ -73,10 +79,6 @@ settings do
|
|
73
79
|
# Where to find solr server to write to
|
74
80
|
provide "solr.url", "http://example.org/solr"
|
75
81
|
|
76
|
-
# If you are connecting to Solr 1.x, you need to set
|
77
|
-
# for SolrJ compatibility:
|
78
|
-
# provide "solrj_writer.parser_class_name", "XMLResponseParser"
|
79
|
-
|
80
82
|
# solr.version doesn't currently do anything, but set it
|
81
83
|
# anyway, in the future it will warn you if you have settings
|
82
84
|
# that may not work with your version.
|
@@ -87,13 +89,11 @@ settings do
|
|
87
89
|
provide "marc_source.type", "xml"
|
88
90
|
|
89
91
|
# various others...
|
90
|
-
provide "
|
92
|
+
provide "solr_writer.commit_on_close", "true"
|
91
93
|
|
92
|
-
#
|
93
|
-
#
|
94
|
-
#
|
95
|
-
# If we're reading binary MARC, it's best to tell it the encoding.
|
96
|
-
provide "marc4j_reader.source_encoding", "MARC-8" # or 'UTF-8' or 'ISO-8859-1' or whatever.
|
94
|
+
# The default writer is the Traject::SolrJsonWriter. The default
|
95
|
+
# reader is Marc4JReader (using Java Marc4J library) on Jruby,
|
96
|
+
# MarcReader (using ruby-marc) otherwise.
|
97
97
|
end
|
98
98
|
~~~
|
99
99
|
|
@@ -105,17 +105,17 @@ See, docs page on [Settings](./doc/settings.md) for list
|
|
105
105
|
of all standardized settings.
|
106
106
|
|
107
107
|
|
108
|
-
## Indexing rules: Let's start with
|
108
|
+
## Indexing rules: Let's start with 'to_field' and 'extract_marc'
|
109
109
|
|
110
110
|
There are a few methods that can be used to create indexing rules, but the
|
111
111
|
one you'll most common is called `to_field`, and establishes a rule
|
112
|
-
to extract content to a particular named output field.
|
112
|
+
to extract content to a particular named output field.
|
113
113
|
|
114
|
-
|
115
|
-
entirely custom logic.
|
114
|
+
A `to_field` extraction rule can use built-in 'macros', or, as we'll see later,
|
115
|
+
entirely custom logic.
|
116
116
|
|
117
117
|
The built-in macro you'll use the most is `extract_marc`, to extract
|
118
|
-
data out of a MARC record according to a tag/subfield specification.
|
118
|
+
data out of a MARC record according to a tag/subfield specification.
|
119
119
|
|
120
120
|
~~~ruby
|
121
121
|
# Take the value of the first 001 field, and put
|
@@ -128,7 +128,7 @@ data out of a MARC record according to a tag/subfield specification.
|
|
128
128
|
to_field "title_t", extract_marc("245nps:130", :trim_punctuation => true)
|
129
129
|
|
130
130
|
# Can limit to certain indicators with || chars.
|
131
|
-
# "*" is a wildcard in indicator spec. So
|
131
|
+
# "*" is a wildcard in indicator spec. So this is
|
132
132
|
# 856 with first indicator '0', subfield u.
|
133
133
|
to_field "email_addresses", extract_marc("856|0*|u")
|
134
134
|
|
@@ -137,20 +137,20 @@ data out of a MARC record according to a tag/subfield specification.
|
|
137
137
|
to_field "isbn", extract_marc("245a:245abcde")
|
138
138
|
|
139
139
|
# For MARC Control ('fixed') fields, you can optionally
|
140
|
-
# use square brackets to take a byte offset.
|
140
|
+
# use square brackets to take a byte offset.
|
141
141
|
to_field "langauge_code", extract_marc("008[35-37]")
|
142
142
|
~~~
|
143
143
|
|
144
144
|
`extract_marc` by default includes all 'alternate script' linked fields correspoinding
|
145
145
|
to matched specifications, but you can turn that off, or extract *only* corresponding
|
146
|
-
880s.
|
146
|
+
880s.
|
147
147
|
|
148
148
|
~~~ruby
|
149
149
|
to_field "title", extract_marc("245abc", :alternate_script => false)
|
150
150
|
to_field "title_vernacular", extract_marc("245abc", :alternate_script => :only)
|
151
151
|
~~~
|
152
152
|
|
153
|
-
By default, specifications with multiple subfields (like "240abc") will produce one single string of output for each matching field. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield.
|
153
|
+
By default, specifications with multiple subfields (like "240abc") will produce one single string of output for each matching field. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield.
|
154
154
|
|
155
155
|
For the syntax and complete possibilities of the specification
|
156
156
|
string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
|
@@ -199,7 +199,7 @@ All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macr
|
|
199
199
|
|
200
200
|
Some more complex (and opinionated/subjective) algorithms for deriving semantics
|
201
201
|
from Marc are also packaged with Traject, but not available by default. To make
|
202
|
-
them available to your indexing, you just need to use ruby `require` and `extend`.
|
202
|
+
them available to your indexing, you just need to use ruby `require` and `extend`.
|
203
203
|
|
204
204
|
A number of methods are in [Traject::Macros::Marc21Semantics](./lib/traject/macros/marc21_semantics.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21Semantics))
|
205
205
|
|
@@ -223,6 +223,9 @@ format/genre/type vocabulary:
|
|
223
223
|
to_field 'format_facet', marc_formats
|
224
224
|
~~~
|
225
225
|
|
226
|
+
(Alternately, see the [traject_umich_format](https://github.com/billdueber/traject_umich_format) gem for the often-ridiculously-complex
|
227
|
+
logic used at the University of Michigan.)
|
228
|
+
|
226
229
|
## Custom logic
|
227
230
|
|
228
231
|
The built-in routines are there for your convenience, but if you need
|
@@ -240,12 +243,12 @@ in a configuration file, using a ruby block, which looks like this:
|
|
240
243
|
end
|
241
244
|
~~~
|
242
245
|
|
243
|
-
`do |record, accumulator
|
246
|
+
`do |record, accumulator| ... ` is the definition of a ruby block taking
|
244
247
|
two arguments. The first one passed in will be a MARC record. The
|
245
248
|
second is an array, you add values to the array to send them to
|
246
|
-
output.
|
249
|
+
output.
|
247
250
|
|
248
|
-
Here's
|
251
|
+
Here's another example that shows how you'd get the
|
249
252
|
record type byte 06 out of a MARC leader, then translate it
|
250
253
|
to a human-readable string with a TranslationMap
|
251
254
|
|
@@ -257,37 +260,34 @@ to a human-readable string with a TranslationMap
|
|
257
260
|
end
|
258
261
|
~~~
|
259
262
|
|
260
|
-
You can also add a block onto the end of a built-in 'macro', to
|
263
|
+
You can also add a block onto the end of a built-in 'macro', to
|
261
264
|
further customize the output. The `accumulator` passed to your block
|
262
265
|
will already have values in it from the first step, and you can
|
263
266
|
use ruby methods like `map!` to modify it:
|
264
267
|
|
265
268
|
~~~ruby
|
266
269
|
to_field "big_title", extract_marc("245abcdefg") do |record, accumulator|
|
267
|
-
# put it all in all uppercase, I don't know why.
|
270
|
+
# put it all in all uppercase, I don't know why.
|
268
271
|
accumulator.map! {|v| v.upcase}
|
269
272
|
end
|
270
273
|
~~~
|
271
274
|
|
272
|
-
There are many more things you can do with custom logic blocks like this too,
|
273
|
-
including additional features we haven't discussed yet.
|
274
|
-
|
275
275
|
If you find yourself repeating boilerplate code in your custom logic, you can
|
276
276
|
even create your own 'macros' (like `extract_marc`). `extract_marc` and other
|
277
277
|
macros are nothing more than methods that return ruby lambda objects of
|
278
|
-
the same format as the blocks you write for custom logic.
|
278
|
+
the same format as the blocks you write for custom logic.
|
279
279
|
|
280
280
|
For tips, gotchas, and a more complete explanation of how this works, see
|
281
281
|
additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
|
282
282
|
|
283
283
|
## each_record and after_processing
|
284
284
|
|
285
|
-
In addition to `to_field`, an `each_record` method is available, which,
|
285
|
+
In addition to `to_field`, an `each_record` method is available, which,
|
286
286
|
like `to_field`, is executed for every record, but without being tied
|
287
|
-
to a specific field.
|
287
|
+
to a specific field.
|
288
288
|
|
289
289
|
`each_record` can be used for logging or notifiying; computing intermediate
|
290
|
-
results; or writing to more than one field at once.
|
290
|
+
results; or writing to more than one field at once.
|
291
291
|
|
292
292
|
~~~ruby
|
293
293
|
each_record do |record|
|
@@ -303,26 +303,33 @@ ruby code you might want for your app (send an email? Clean up a log file? Trigg
|
|
303
303
|
a Solr replication?)
|
304
304
|
|
305
305
|
~~~ruby
|
306
|
-
after_processing do
|
306
|
+
after_processing do
|
307
307
|
whatever_ruby_code
|
308
308
|
end
|
309
309
|
~~~
|
310
310
|
|
311
311
|
|
312
|
-
## Writers
|
312
|
+
## Readers and Writers
|
313
313
|
|
314
314
|
Traject uses modular 'Writer' classes to take the output hashes from transformation, and
|
315
|
-
send them somewhere or do something useful with them.
|
315
|
+
send them somewhere or do something useful with them.
|
316
316
|
|
317
|
-
By default traject uses the [Traject::
|
318
|
-
|
319
|
-
[Traject::DebugWriter](lib/traject/debug_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DebugWriter))
|
320
|
-
|
317
|
+
By default traject uses the [Traject::SolrJsonWriter](lib/traject/solr_json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/SolrJsonWriter)) to send to Solr for indexing.
|
318
|
+
Several other writers are also built-in:
|
319
|
+
* [Traject::DebugWriter](lib/traject/debug_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DebugWriter))
|
320
|
+
* [Traject::JsonWriter](lib/traject/json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/JsonWriter))
|
321
|
+
* [Traject::YamlWriter](lib/traject/yaml_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/YamlWriter))
|
322
|
+
* [Traject::DelimitedWriter](lib/traject/delimited_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DelimitedWriter))
|
323
|
+
* [Traject::CSVWriter](lib/traject/csv_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/CSVWriter))
|
321
324
|
|
322
325
|
You set which writer is being used in settings (`provide "writer_class_name", "Traject::DebugWriter"`),
|
323
|
-
or
|
326
|
+
or with the shortcut command line argument `-w Traject::DebugWriter`.
|
327
|
+
|
328
|
+
The [SolrJWriter](https://github.com/traject-project/traject-solrj_writer) is packaged separately,
|
329
|
+
and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
|
330
|
+
|
331
|
+
You can easily write your own Readers and Writers if you'd like, see comments at top
|
324
332
|
|
325
|
-
You can write your own Readers and Writers if you'd like, see comments at top
|
326
333
|
of [Traject::Indexer](lib/traject/indexer.rb).
|
327
334
|
|
328
335
|
## The traject command Line
|
@@ -331,13 +338,13 @@ The simplest invocation is:
|
|
331
338
|
|
332
339
|
traject -c conf_file.rb marc_file.mrc
|
333
340
|
|
334
|
-
Traject assumes marc files are in ISO 2709 binary format; it is not
|
335
|
-
currently able to guess marc format
|
341
|
+
Traject assumes marc files are in ISO 2709 MARC 'binary' format; it is not
|
342
|
+
currently able to guess other marc format types like XML from filenames or content. If you are reading
|
336
343
|
marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
|
337
344
|
|
338
345
|
traject -c conf.rb -t xml marc_file.xml
|
339
346
|
|
340
|
-
You can supply more than one conf file with repeated `-c` arguments.
|
347
|
+
You can supply more than one conf file to traject with repeated `-c` arguments.
|
341
348
|
|
342
349
|
traject -c connection_conf.rb -c indexing_conf.rb marc_file.mrc
|
343
350
|
|
@@ -349,7 +356,7 @@ You can only supply one marc file at a time, but we can take advantage of stdin
|
|
349
356
|
You can set any setting on the command line with `-s key=value`.
|
350
357
|
This will over-ride any settings set with `provide` in conf files.
|
351
358
|
|
352
|
-
traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s
|
359
|
+
traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solrj_writer.commit_on_close=true
|
353
360
|
|
354
361
|
There are some built-in command-line option shortcuts for useful
|
355
362
|
settings:
|
@@ -363,8 +370,8 @@ debugging or sanity checking.
|
|
363
370
|
Use `-u` as a shortcut for `s solr.url=X`
|
364
371
|
|
365
372
|
traject -c conf_file.rb -u http://example.com/solr marc_file.mrc
|
366
|
-
|
367
|
-
Run `traject -h` to see the command line help screen listing all available options.
|
373
|
+
|
374
|
+
Run `traject -h` to see the command line help screen listing all available options.
|
368
375
|
|
369
376
|
Also see `-I load_path` option and suggestions for Bundler use under Extending With Your Own Code.
|
370
377
|
|
@@ -399,7 +406,7 @@ Own Code](./doc/extending.md)
|
|
399
406
|
"./translation_maps" subdir on the load path will be found
|
400
407
|
for Traject translation maps.
|
401
408
|
* Use [Bundler](http://bundler.io/) with traject simply by creating a Gemfile with `bundler init`,
|
402
|
-
and then running command line with `bundle exec traject` or
|
409
|
+
and then running command line with `bundle exec traject` or
|
403
410
|
even `BUNDLE_GEMFILE=path/to/Gemfile bundle exec traject`
|
404
411
|
|
405
412
|
## More
|
@@ -410,7 +417,9 @@ Own Code](./doc/extending.md)
|
|
410
417
|
* [traject_alephsequential_reader](https://github.com/traject-project/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
|
411
418
|
* [traject_horizon](https://github.com/jrochkind/traject_horizon): Export MARC records directly from a Horizon ILS rdbms, as serialized MARC or to index into Solr.
|
412
419
|
* [traject_umich_format](https://github.com/billdueber/traject_umich_format/): opinionated code and associated macros to extract format (book, audio file, etc.) and types (bibliography, conference report, etc.) from a MARC record. Code mirrors that used by the University of Michigan, and is an alternate approach to that taken by the `marc_formats` macro in `Traject::Macros::MarcFormatClassifier`.
|
413
|
-
|
420
|
+
* [traject-solrj_writer](https://github.com/traject-project/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
|
421
|
+
* [traject_marc4j_reader](https://github.com/billdueber/traject_marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
|
422
|
+
reading marc records using the Marc4J library, fastest MARC reading on JRuby.
|
414
423
|
|
415
424
|
# Development
|
416
425
|
|
@@ -430,12 +439,15 @@ Pull requests should come with tests, as well as docs where applicable. Docs can
|
|
430
439
|
and/or extra files in ./docs -- as appropriate for what needs to be docs.
|
431
440
|
|
432
441
|
**Inline api docs** Note that our [`.yardopts` file](./.yardopts) used by rdoc.info to generate
|
433
|
-
online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
|
442
|
+
online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
|
434
443
|
|
435
444
|
Bundler rake tasks included for gem releases: `rake release`
|
436
445
|
|
437
446
|
## TODO
|
438
447
|
|
448
|
+
* Readers and index rules helpers for reading XML files as input? Maybe.
|
449
|
+
|
450
|
+
* Writers for writing to stores other than Solr? ElasticSearch? Maybe.
|
439
451
|
|
440
452
|
* Unicode normalization. Has to normalize to NFKC on way out to index. Except for serialized marc field and other exceptions? Except maybe don't have to, rely on solr analyzer to do it?
|
441
453
|
|
data/doc/batch_execution.md
CHANGED
@@ -8,15 +8,11 @@ with suggested solutions, and additional hints.
|
|
8
8
|
|
9
9
|
## Ruby version setting
|
10
10
|
|
11
|
-
|
11
|
+
For best performance, traject should run under jruby. You will
|
12
12
|
ordinarily have jruby installed under a ruby version switcher -- we
|
13
|
-
|
13
|
+
recommend [chruby](https://github.com/postmodern/chruby) over other choices,
|
14
14
|
but other popular choices include rvm and rbenv.
|
15
15
|
|
16
|
-
Remember that traject needs to run in 1.9.x mode in jruby--
|
17
|
-
with jruby 1.7.x or later, this should be default, recommend
|
18
|
-
you use jruby 1.7.x.
|
19
|
-
|
20
16
|
Especially when running under a cron job, it can be difficult to
|
21
17
|
set things up so traject runs under jruby -- and then when you add
|
22
18
|
bundler into it, things can get positively byzantine. It's not you,
|
data/doc/other_commands.md
CHANGED
@@ -38,12 +38,10 @@ If set to true, then oversized MARC records can still be serialized,
|
|
38
38
|
with length bytes zero'd out -- technically illegal, but can
|
39
39
|
be read by MARC::Reader in permissive mode.
|
40
40
|
|
41
|
-
|
42
|
-
|
43
|
-
do need to set the `marc_source.type` setting to XML for xml input
|
44
|
-
using the standard MARC readers.
|
41
|
+
If you have MARC-XML *input*, you need to
|
42
|
+
set the `marc_source.type` setting to XML for xml input.
|
45
43
|
|
46
44
|
~~~bash
|
47
45
|
traject -x marcout somefile.marc -o output.xml -s marcout.type=xml
|
48
46
|
traject -x marcout -s marc_source.type=xml somefile.xml -c configuration.rb
|
49
|
-
~~~
|
47
|
+
~~~
|
data/doc/settings.md
CHANGED
@@ -5,7 +5,7 @@ Hash, not nested. Keys are always strings, and dots (".") can be
|
|
5
5
|
used for grouping and namespacing.
|
6
6
|
|
7
7
|
Values are usually strings, but occasionally something else. String values can be easily
|
8
|
-
set via the command line.
|
8
|
+
set via the command line.
|
9
9
|
|
10
10
|
Settings can be set in configuration files, usually like:
|
11
11
|
|
@@ -16,24 +16,24 @@ end
|
|
16
16
|
~~~~
|
17
17
|
|
18
18
|
or on the command line: `-s key=value`. There are also some command line shortcuts
|
19
|
-
for commonly used settings, see `traject -h`.
|
19
|
+
for commonly used settings, see `traject -h`.
|
20
20
|
|
21
|
-
`provide` will only set the key if it was previously unset, so first time to set 'wins'. And command-line
|
22
|
-
settings are applied first of all. It's recommended you use `provide`.
|
21
|
+
`provide` will only set the key if it was previously unset, so first time to set 'wins'. And command-line
|
22
|
+
settings are applied first of all. It's recommended you use `provide`.
|
23
23
|
|
24
|
-
`store` is also available, and forces setting of the new value overriding any previous value set.
|
24
|
+
`store` is also available, and forces setting of the new value overriding any previous value set.
|
25
25
|
|
26
26
|
## Known settings
|
27
27
|
|
28
28
|
* `debug_ascii_progress`: true/'true' to print ascii characters to STDERR indicating progress. Note,
|
29
|
-
yes, this is fixed to STDERR, regardless of your logging setup.
|
29
|
+
yes, this is fixed to STDERR, regardless of your logging setup.
|
30
30
|
* `.` for every batch of records read and parsed
|
31
31
|
* `^` for every batch of records batched and queued for adding to solr
|
32
32
|
(possibly in thread pool)
|
33
33
|
* `%` for completing of a Solr 'add'
|
34
34
|
* `!` when threadpool for solr add has a full queue, so solr add is
|
35
35
|
going to happen in calling queue -- means solr adding can't
|
36
|
-
keep up with production.
|
36
|
+
keep up with production.
|
37
37
|
|
38
38
|
* `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
|
39
39
|
|
@@ -50,18 +50,10 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
50
50
|
* `log.batch_size`: If set to a number N (or string representation), will output a progress line to
|
51
51
|
log. (by default as INFO, but see log.batch_size.severity)
|
52
52
|
|
53
|
-
* `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
|
53
|
+
* `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
|
54
54
|
|
55
55
|
* `marc_source.type`: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`
|
56
56
|
|
57
|
-
* `marc4j.jar_dir`: Path to a directory containing Marc4J jar file to use. All .jar's in dir will
|
58
|
-
be loaded. If unset, uses marc4j.jar bundled with traject.
|
59
|
-
|
60
|
-
* `marc4j_reader.permissive`: Used by Marc4JReader only when marc.source_type is 'binary', boolean, argument to the underlying MarcPermissiveStreamReader. Default true.
|
61
|
-
|
62
|
-
* `marc4j_reader.source_encoding`: Used by Marc4JReader only when marc.source_type is 'binary', encoding strings accepted
|
63
|
-
by marc4j MarcPermissiveStreamReader. Default "BESTGUESS", also "UTF-8", "MARC"
|
64
|
-
|
65
57
|
* `marcout.allow_oversized`: Used with `-x marcout` command to output marc when outputting
|
66
58
|
as ISO 2709 binary, set to true or string "true", and the MARC::Writer will have
|
67
59
|
allow_oversized=true set, allowing oversized records to be serialized with length
|
@@ -69,44 +61,41 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
69
61
|
|
70
62
|
* `output_file`: Output file to write to for operations that write to files: For instance the `marcout` command,
|
71
63
|
or Writer classes that write to files, like Traject::JsonWriter. Has an shortcut
|
72
|
-
`-o` on command line.
|
64
|
+
`-o` on command line.
|
73
65
|
|
74
|
-
* `processing_thread_pool`
|
75
|
-
|
76
|
-
|
77
|
-
But this is the first thread_pool to try increasing for better performance on a multi-core machine.
|
78
|
-
|
79
|
-
A pool here can sometimes result in multi-threaded commiting to Solr too with the
|
80
|
-
SolrJWriter, as processing worker threads will do their own commits to solr if the
|
81
|
-
solrj_writer.thread_pool is full. Having a multi-threaded pool here can help even out throughput
|
82
|
-
through Solr's pauses for committing too.
|
66
|
+
* `processing_thread_pool` Number of threads in the main thread pool used for processing
|
67
|
+
records with input rules. On JRuby or Rubinius, defaults to 1 less than the number of processors detected on your machine. On other ruby platforms, defaults to 1. Set to 0 or nil
|
68
|
+
to disable thread pool, and do all processing in main thread.
|
83
69
|
|
84
|
-
|
70
|
+
Choose a pool size based on size of your machine, and complexity of your indexing rules, you
|
71
|
+
might want to try different sizes and measure which works best for you.
|
72
|
+
Probably no reason for it ever to be more than number of cores on indexing machine.
|
85
73
|
|
86
|
-
* `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
|
87
74
|
|
88
|
-
* `
|
75
|
+
* `reader_class_name`: a Traject Reader class, used by the indexer as a source
|
76
|
+
of records. Defaults to Traject::Marc4JReader (using the Java Marc4J
|
77
|
+
library) on JRuby; Traject::MarcReader (using the ruby marc gem) otherwise.
|
78
|
+
Command-line shortcut `-r`
|
79
|
+
|
80
|
+
* `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
|
89
81
|
|
90
82
|
* `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control
|
91
83
|
change some default settings, and/or sanity check and warn you if you're doing something
|
92
84
|
that might not work with that version of solr. Set now for help in the future.
|
93
85
|
|
94
|
-
* `
|
95
|
-
0, or 1, and
|
96
|
-
|
97
|
-
* `solrj_writer.commit_on_close`: default false, set to true to have SolrJWriter send an explicit commit message to Solr after indexing.
|
86
|
+
* `solr_writer.batch_size`: size of batches that SolrJsonWriter will send docs to Solr in. Default 100. Set to nil,
|
87
|
+
0, or 1, and SolrJsonWriter will do one http transaction per document, no batching.
|
98
88
|
|
99
|
-
* `
|
89
|
+
* `solr_writer.commit_on_close`: default false, set to true to have the solr writer send an explicit commit message to Solr after indexing.
|
100
90
|
|
101
|
-
* `solrj_writer.server_class_name`: String name of a solrj.SolrServer subclass to be used by SolrJWriter. Default "HttpSolrServer"
|
102
91
|
|
103
|
-
* `
|
92
|
+
* `solr_writer.thread_pool`: Defaults to 1 (single bg thread). A thread pool is used for submitting docs
|
104
93
|
to solr. Set to 0 or nil to disable threading. Set to 1,
|
105
94
|
there will still be a single bg thread doing the adds.
|
106
95
|
May make sense to set higher than number of cores on your
|
107
96
|
indexing machine, as these threads will mostly be waiting
|
108
97
|
on Solr. Speed/capacity of your solr might be more relevant.
|
109
98
|
Note that processing_thread_pool threads can end up submitting
|
110
|
-
to solr too, if
|
99
|
+
to solr too, if solr_json_writer.thread_pool is full.
|
111
100
|
|
112
|
-
* `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::
|
101
|
+
* `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::SolrJsonWriter, other writers for debugging or writing to files are also available. See Traject::Indexer for more info. Command line shortcut `-w`
|