traject 2.1.0-java → 2.2.0-java
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +2 -0
- data/.travis.yml +8 -20
- data/CHANGES.md +14 -0
- data/README.md +35 -56
- data/doc/extending.md +20 -27
- data/doc/indexing_rules.md +46 -57
- data/doc/settings.md +17 -48
- data/lib/traject/debug_writer.rb +31 -5
- data/lib/traject/indexer.rb +6 -4
- data/lib/traject/marc_extractor.rb +37 -157
- data/lib/traject/marc_extractor_spec.rb +229 -0
- data/lib/traject/version.rb +1 -1
- data/test/debug_writer_test.rb +41 -0
- data/test/marc_extractor_test.rb +24 -24
- data/test/test_support/demo_config.rb +1 -1
- data/traject.gemspec +5 -5
- metadata +74 -73
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 31b0c8daf3b5365e6f76172e9af9d8ec7fd842fe
|
4
|
+
data.tar.gz: 9286f5626eb34bd4df3e89aa1197fc3b1810e601
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 0495e94238704ab066c86c40e1ef65c86d4229178795031a59dfaf42ceb5efa01e00943b0447291a666d504b42f3cca62d56735240a56004541fa24b8efc138a
|
7
|
+
data.tar.gz: e002525a16a48c0897548f526df1c63bccdf5d23d888cd799afb2518b52a73b3832ba0f5e8a8748f8649ef45b43c6efe85b0f48fa883d9540480f15dd6d19423
|
data/.travis.yml
CHANGED
@@ -1,27 +1,15 @@
|
|
1
1
|
language: ruby
|
2
|
+
cache: bundler
|
3
|
+
sudo: false
|
2
4
|
rvm:
|
3
5
|
- jruby-19mode
|
4
|
-
- jruby-
|
6
|
+
- jruby-9.0.4.0
|
5
7
|
- 1.9
|
6
|
-
- 2.1
|
7
8
|
- 2.2
|
9
|
+
- 2.3.0
|
8
10
|
- rbx-2
|
11
|
+
before_install:
|
12
|
+
- gem update --system
|
13
|
+
- gem install bundler
|
9
14
|
jdk:
|
10
|
-
-
|
11
|
-
- openjdk6
|
12
|
-
matrix:
|
13
|
-
exclude:
|
14
|
-
- rvm: 1.9
|
15
|
-
jdk: openjdk7
|
16
|
-
- rvm: 2.1
|
17
|
-
jdk: openjdk7
|
18
|
-
- rvm: rbx-2
|
19
|
-
jdk: openjdk7
|
20
|
-
- rvm: jruby-head
|
21
|
-
jdk: openjdk6
|
22
|
-
- rvm: 2.2
|
23
|
-
jdk: openjdk6
|
24
|
-
allow_failures:
|
25
|
-
- rvm: jruby-head
|
26
|
-
|
27
|
-
bundler_args: --without debug
|
15
|
+
- oraclejdk8
|
data/CHANGES.md
CHANGED
@@ -1,5 +1,19 @@
|
|
1
1
|
# Changes
|
2
2
|
|
3
|
+
## 2.2.0
|
4
|
+
* Change DebugWriter to be more forgiving (and informative) about missing record-id fields
|
5
|
+
* Automatically require DebugWriter for easier use on the command line
|
6
|
+
* Refactor MarcExtractor to be easier to read
|
7
|
+
* Fix .travis file to actually work, and target more recent rubies.
|
8
|
+
|
9
|
+
## 2.1.0
|
10
|
+
* update some docs (typos)
|
11
|
+
* Make the indexer's `writer` r/w so it can be set at runtime (#110)
|
12
|
+
* Allow `extract_marc` to be callable from anywhere (#111)
|
13
|
+
* Add doc instructions/examples for programmatic Indexer use
|
14
|
+
* _Much_ better error reporting; easier to find which record went wrong
|
15
|
+
|
16
|
+
|
3
17
|
## 2.0.2
|
4
18
|
|
5
19
|
* Guard against assumption of MARC data when indexing using SolrJsonWriter ([#94](https://github.com/traject-project/traject/issues/94))
|
data/README.md
CHANGED
@@ -1,12 +1,10 @@
|
|
1
1
|
# Traject
|
2
2
|
|
3
|
-
An easy to use, high-performance, flexible and extensible MARC to Solr indexer.
|
3
|
+
An easy to use, high-performance, flexible and extensible MARC to Solr indexer.
|
4
4
|
|
5
|
-
You might use traject to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
|
5
|
+
You might use [traject](https://github.com/traject/traject) to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
|
6
6
|
|
7
|
-
Traject can also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination. In addition to sending data
|
8
|
-
to solr, Traject can produce json or yaml files, tab-delimited files, CSV files, and output suitable
|
9
|
-
for debugging by a human.
|
7
|
+
Traject can also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination. In addition to sending data to Solr, Traject can produce json or yaml files, tab-delimited files, CSV files, and output suitable for debugging by a human.
|
10
8
|
|
11
9
|
**Traject is stable, mature software, that is already being used in production by its authors.**
|
12
10
|
|
@@ -23,7 +21,7 @@ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (Univer
|
|
23
21
|
* Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying
|
24
22
|
ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with
|
25
23
|
solr even under MRI.
|
26
|
-
* Composed of decoupled components, for flexibility and extensibility.
|
24
|
+
* Composed of decoupled components, for flexibility and extensibility.
|
27
25
|
* Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
|
28
26
|
* Easy to split configuration between multiple files, for simple "pick-and-choose" command line options
|
29
27
|
that can combine to deal with any of your local needs.
|
@@ -33,42 +31,36 @@ that can combine to deal with any of your local needs.
|
|
33
31
|
|
34
32
|
Traject runs under jruby (1.7.x or higher), MRI ruby (1.9.3 or higher), or probably any other ruby platform.
|
35
33
|
|
36
|
-
**Traject runs much faster on JRuby** where it can use multi-core parallelism, and the Java
|
37
|
-
Marc4J marc reader. If performance is a concern, you should run traject on JRuby.
|
34
|
+
**Traject runs much faster on JRuby** where it can use multi-core parallelism, and the Java Marc4J marc reader. If performance is a concern, you should run traject on JRuby.
|
38
35
|
|
39
36
|
Some options for installing a ruby other than your system-provided one are [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme).
|
40
37
|
|
41
38
|
Once you have ruby, just `$ gem install traject`.
|
42
39
|
|
43
|
-
(
|
40
|
+
(**Note**: We might in the future provide an all-in-one .jar distribution, which will not require you to install jruby on your system, for those who want the multi-threading of jruby without having to actually install it. Let us know if interested.)
|
44
41
|
|
45
42
|
|
46
43
|
## Configuration files
|
47
44
|
|
48
|
-
traject is configured using configuration files. To get a sense of what they look like, you can
|
49
|
-
|
50
|
-
[demo_config.rb](./test/test_support/demo_config.rb). You could run traject with that configuration file
|
51
|
-
as: `traject -c path/to/demo_config.rb marc_file.marc`.
|
45
|
+
traject is configured using configuration files. To get a sense of what they look like, you can take a look at our sample basic configuration file,
|
46
|
+
[demo_config.rb](./test/test_support/demo_config.rb). You could run traject with that configuration file as: `traject -c path/to/demo_config.rb marc_file.marc`.
|
52
47
|
|
53
48
|
Configuration files are actually just ruby -- so by convention they end in `.rb`.
|
54
49
|
|
55
|
-
We hope you can write basic useful configuration files without much ruby experience, since
|
56
|
-
traject gives you some easy functions to use for common directives. But the full power
|
57
|
-
of ruby is available to you if needed.
|
50
|
+
We hope you can write basic useful configuration files without much ruby experience, since traject gives you some easy functions to use for common directives. But the full power of ruby is available to you if needed.
|
58
51
|
|
59
52
|
**rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
|
60
53
|
call ordinary ruby `require` in config files, etc., too, to load
|
61
54
|
external functionality. See more at Extending Logic below.
|
62
55
|
|
63
56
|
You can keep your settings and indexing rules in one config file,
|
64
|
-
or split them
|
57
|
+
or split them across multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
|
65
58
|
|
66
59
|
There are two main categories of directives in your configuration files: _Settings_, and _Indexing Rules_.
|
67
60
|
|
68
61
|
## Settings
|
69
62
|
|
70
|
-
Settings are a flat list of key/value pairs, where the keys are always strings and the values usually are. They look like this
|
71
|
-
in a config file:
|
63
|
+
Settings are a flat list of key/value pairs, where the keys are always strings and the values usually are too. They look like this in a config file:
|
72
64
|
|
73
65
|
~~~ruby
|
74
66
|
# configuration_file.rb
|
@@ -98,20 +90,17 @@ end
|
|
98
90
|
|
99
91
|
`provide` will only set the key if it was previously unset, so first
|
100
92
|
setting wins, and command-line comes first of all and overrides everything.
|
101
|
-
You can also use `store` if you want to force-set
|
93
|
+
You can also use `store` if you want to force-set: last set wins.
|
102
94
|
|
103
95
|
See, docs page on [Settings](./doc/settings.md) for list
|
104
96
|
of all standardized settings.
|
105
97
|
|
106
98
|
|
107
|
-
## Indexing rules:
|
99
|
+
## Indexing rules: 'to_field' and 'extract_marc'
|
108
100
|
|
109
|
-
There are a few methods that can be used to create indexing rules
|
110
|
-
one you'll most common is called `to_field`, and establishes a rule
|
111
|
-
to extract content to a particular named output field.
|
101
|
+
There are a few methods that can be used to create indexing rules. We will touch on the two most commonly used methods here. More information is available in [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
|
112
102
|
|
113
|
-
A `to_field` extraction rule can use built-in 'macros', or, as we'll see later,
|
114
|
-
entirely custom logic.
|
103
|
+
`to_field` establishes a rule to extract content to a particular named output field. A `to_field` extraction rule can use built-in 'macros', or, as we'll see later, entirely custom logic.
|
115
104
|
|
116
105
|
The built-in macro you'll use the most is `extract_marc`, to extract
|
117
106
|
data out of a MARC record according to a tag/subfield specification.
|
@@ -140,24 +129,18 @@ data out of a MARC record according to a tag/subfield specification.
|
|
140
129
|
to_field "language_code", extract_marc("008[35-37]")
|
141
130
|
~~~
|
142
131
|
|
143
|
-
`extract_marc` by default includes all 'alternate script' linked fields correspoinding
|
144
|
-
to matched specifications, but you can turn that off, or extract *only* corresponding
|
145
|
-
880s.
|
132
|
+
`extract_marc` by default includes all 'alternate script' linked fields correspoinding to matched specifications, but you can turn that off, or extract *only* corresponding 880s.
|
146
133
|
|
147
134
|
~~~ruby
|
148
135
|
to_field "title", extract_marc("245abc", :alternate_script => false)
|
149
136
|
to_field "title_vernacular", extract_marc("245abc", :alternate_script => :only)
|
150
137
|
~~~
|
151
138
|
|
152
|
-
By default, specifications with multiple subfields (
|
139
|
+
By default, specifications with multiple subfields (e.g. "240abc") will produce one single string of output per field (for each '240' field in the record), with the concatenation of each matched subfield. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield (i.e. two output strings for a single '020' with two subfield 'a').
|
153
140
|
|
154
|
-
For the syntax and complete possibilities of the specification
|
155
|
-
string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
|
141
|
+
For the syntax and complete possibilities of the specification string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
|
156
142
|
|
157
|
-
`extract_marc` also supports `translation maps` similar
|
158
|
-
to SolrMarc's. There are some translation maps provided by traject,
|
159
|
-
and you can also define your own, in yaml or ruby. Translation maps are especially useful
|
160
|
-
for mapping form MARC codes to user-displayable strings:
|
143
|
+
`extract_marc` also supports `translation maps` similar to SolrMarc's. There are some translation maps provided by traject, and you can also define your own, in yaml or ruby. Translation maps are especially useful for mapping form MARC codes to user-displayable strings:
|
161
144
|
|
162
145
|
~~~ruby
|
163
146
|
# "translation_map" will be passed to Traject::TranslationMap.new
|
@@ -165,18 +148,19 @@ for mapping form MARC codes to user-displayable strings:
|
|
165
148
|
to_field "language", extract_marc("008[35-37]:041a:041d", :translation_map => "marc_language_code")
|
166
149
|
~~~
|
167
150
|
|
168
|
-
To see all options for `extract_marc`, see the [
|
151
|
+
To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
|
169
152
|
|
170
|
-
##
|
153
|
+
## Other built-in utility macros
|
171
154
|
|
172
|
-
Other built-in methods that can be used with `to_field` include
|
173
|
-
|
155
|
+
Other built-in methods that can be used with `to_field` include:
|
156
|
+
|
157
|
+
a hard-coded literal string:
|
174
158
|
|
175
159
|
~~~ruby
|
176
160
|
to_field "source", literal("LIB_CATALOG")
|
177
161
|
~~~
|
178
162
|
|
179
|
-
|
163
|
+
the current record serialized back out as MARC, in binary, XML, or json:
|
180
164
|
|
181
165
|
~~~ruby
|
182
166
|
# or :format => "json" for marc-in-json
|
@@ -186,7 +170,7 @@ The current record serialized back out as MARC, in binary, XML, or json:
|
|
186
170
|
to_field "marc_record_raw", serialized_marc(:format => "binary", :binary_escape => false, :allow_oversized => true)
|
187
171
|
~~~
|
188
172
|
|
189
|
-
|
173
|
+
text of all fields in a range:
|
190
174
|
|
191
175
|
~~~ruby
|
192
176
|
to_field "text", extract_all_marc_values(:from => "100", :to => "899")
|
@@ -194,11 +178,9 @@ Text of all fields in a range:
|
|
194
178
|
|
195
179
|
All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macros/marc21.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21))
|
196
180
|
|
197
|
-
##
|
181
|
+
## More complex canned MARC semantic logic
|
198
182
|
|
199
|
-
Some more complex (and opinionated/subjective) algorithms for deriving semantics
|
200
|
-
from Marc are also packaged with Traject, but not available by default. To make
|
201
|
-
them available to your indexing, you just need to use ruby `require` and `extend`.
|
183
|
+
Some more complex (and opinionated/subjective) algorithms for deriving semantics from Marc are also packaged with Traject, but not available by default. To make them available to your indexing, you just need to use ruby `require` and `extend`.
|
202
184
|
|
203
185
|
A number of methods are in [Traject::Macros::Marc21Semantics](./lib/traject/macros/marc21_semantics.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21Semantics))
|
204
186
|
|
@@ -283,10 +265,10 @@ additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc
|
|
283
265
|
|
284
266
|
In addition to `to_field`, an `each_record` method is available, which,
|
285
267
|
like `to_field`, is executed for every record, but without being tied
|
286
|
-
to a specific field.
|
268
|
+
to a specific output field.
|
287
269
|
|
288
|
-
`each_record` can be used for logging or notifiying
|
289
|
-
results
|
270
|
+
`each_record` can be used for logging or notifiying, computing intermediate
|
271
|
+
results, or writing to more than one field at once.
|
290
272
|
|
291
273
|
~~~ruby
|
292
274
|
each_record do |record|
|
@@ -294,12 +276,9 @@ results; or writing to more than one field at once.
|
|
294
276
|
end
|
295
277
|
~~~
|
296
278
|
|
297
|
-
For more on `each_record`, see
|
279
|
+
For more on `each_record`, see [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md).
|
298
280
|
|
299
|
-
There is also an `after_processing` method that can be used to register
|
300
|
-
logic that will be called after the entire has been processed. You can use it for whatever custom
|
301
|
-
ruby code you might want for your app (send an email? Clean up a log file? Trigger
|
302
|
-
a Solr replication?)
|
281
|
+
There is also an `after_processing` method that can be used to register logic that will be called after the entire has been processed. You can use it for whatever custom ruby code you might want for your app (send an email? Clean up a log file? Trigger a Solr replication?)
|
303
282
|
|
304
283
|
~~~ruby
|
305
284
|
after_processing do
|
@@ -310,8 +289,7 @@ end
|
|
310
289
|
|
311
290
|
## Readers and Writers
|
312
291
|
|
313
|
-
Traject uses modular 'Writer' classes to take the output hashes from transformation
|
314
|
-
send them somewhere or do something useful with them.
|
292
|
+
Traject uses modular 'Writer' classes to take the output hashes from transformation and send them somewhere or do something useful with them.
|
315
293
|
|
316
294
|
By default traject uses the [Traject::SolrJsonWriter](lib/traject/solr_json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/SolrJsonWriter)) to send to Solr for indexing.
|
317
295
|
Several other writers are also built-in:
|
@@ -419,6 +397,7 @@ Own Code](./doc/extending.md)
|
|
419
397
|
* [traject-solrj_writer](https://github.com/traject/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
|
420
398
|
* [traject_marc4j_reader](https://github.com/traject/traject-marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
|
421
399
|
reading marc records using the Marc4J library, fastest MARC reading on JRuby.
|
400
|
+
* [traject_sequel_writer](https://github.com/traject/traject_sequel_writer) A writer for sending to an rdbms via [Sequel](https://github.com/jeremyevans/sequel)
|
422
401
|
|
423
402
|
# Development
|
424
403
|
|
data/doc/extending.md
CHANGED
@@ -13,13 +13,12 @@ of a couple traject features meant to make it easier.
|
|
13
13
|
|
14
14
|
## Expert Summary
|
15
15
|
|
16
|
-
*
|
16
|
+
* Load Path options:
|
17
|
+
* Traject `-I` argument command line can be used to list directories to
|
17
18
|
add to the load path, similar to the `ruby -I` argument. You
|
18
19
|
can then 'require' local project files from the load path.
|
19
|
-
*
|
20
|
-
* translation map files
|
21
|
-
"./translation_maps" subdir on the load path will be found
|
22
|
-
for Traject translation maps.
|
20
|
+
* Modify the ruby `$LOAD_PATH` manually at the top of a traject config file you are loading.
|
21
|
+
* NOTE: translation map files in a "./translation_maps" subdir on the load path will be available for to traject.
|
23
22
|
* You can use Bundler with traject simply by creating a Gemfile with `bundler init`,
|
24
23
|
and then running command line with `bundle exec traject` or
|
25
24
|
even `BUNDLE_GEMFILE=path/to/Gemfile bundle exec traject`
|
@@ -114,11 +113,11 @@ a skeleton of your gem
|
|
114
113
|
This will also make available rake commands to install your gem locally
|
115
114
|
(`rake install`), or release it to the rubygems server (`rake release`).
|
116
115
|
|
117
|
-
There are two main methods to use a gem in your traject project,
|
118
|
-
with straight rubygems, or with bundler.
|
116
|
+
There are two main methods to use a gem in your traject project: with straight rubygems, or with bundler.
|
119
117
|
|
120
|
-
|
121
|
-
|
118
|
+
### without bundler (straight rubygems):
|
119
|
+
|
120
|
+
Without bundler may be simpler, at least at first. Simply `gem install some_gem` from the command line, and now you can `require` that gem in your traject
|
122
121
|
config file, and use what it provides:
|
123
122
|
|
124
123
|
~~~ruby
|
@@ -129,25 +128,20 @@ require 'some_gem'
|
|
129
128
|
SomeGem.whatever!
|
130
129
|
~~~
|
131
130
|
|
132
|
-
A gem can provide traject translation map definitions
|
133
|
-
in a `lib/translation_maps` sub-directory, and traject will be able to find those
|
134
|
-
translation maps when the gem is loaded. (Because gems'
|
135
|
-
`./lib` directories are by default added to the ruby load path.)
|
131
|
+
A gem can provide traject translation map definitions in a `lib/translation_maps` sub-directory, and traject will be able to find those translation maps when the gem is loaded (because gems' `./lib` directories are by default added to the ruby load path).
|
136
132
|
|
137
|
-
###
|
133
|
+
### with bundler:
|
138
134
|
|
139
|
-
|
135
|
+
If you move your traject project to another system,
|
140
136
|
where you haven't yet installed the `some_gem`, then running
|
141
|
-
traject with
|
137
|
+
traject with the above config file will, of course, fail. Or if you
|
142
138
|
move your traject project to another system with a slightly
|
143
139
|
different version of `some_gem`, your traject indexing could
|
144
140
|
behave differently in confusing ways. As the number of gems
|
145
|
-
you are using increases, managing
|
141
|
+
you are using increases, managing the gems and gem versions gets increasingly
|
146
142
|
confusing.
|
147
143
|
|
148
|
-
[bundler](http://bundler.io/) was invented to make this kind of dependency management
|
149
|
-
more straightforward and reliable. We recommend you consider using
|
150
|
-
bundler, especially for traject installations where traject will
|
144
|
+
[bundler](http://bundler.io/) was invented to make this kind of dependency management in ruby more straightforward and reliable. We recommend you consider using bundler, especially for traject installations where traject will
|
151
145
|
be run via automated batch jobs on production servers.
|
152
146
|
|
153
147
|
Bundler's behavior is based on a `Gemfile` that lists your
|
@@ -156,15 +150,14 @@ by running `bundler init`, probably in the directory
|
|
156
150
|
right next to your traject config files.
|
157
151
|
|
158
152
|
Then specify what gems your traject project will use,
|
159
|
-
possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemfile.html)
|
160
|
-
|
153
|
+
possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemfile.html)
|
154
|
+
|
155
|
+
Be sure to include `gem 'traject'` in the Gemfile.
|
161
156
|
|
162
157
|
Run `bundle install` from the directory with the Gemfile, on any system
|
163
|
-
at any time, to make sure specified gems are installed.
|
158
|
+
at any time, to make sure specified gems are installed. (The bundler gem must be already installed on the system.)
|
164
159
|
|
165
|
-
**Run traject** with `bundle exec` to have bundler set up the environment
|
166
|
-
from your Gemfile. You can `cd` into the directory containing the Gemfile,
|
167
|
-
so bundler can find it:
|
160
|
+
**Run traject** with `bundle exec` to have bundler set up the traject environment from your Gemfile. You can `cd` into the directory containing the Gemfile, so bundler can find it:
|
168
161
|
|
169
162
|
$ cd /some/where
|
170
163
|
$ bundle exec traject -c some_traject_config.rb ...
|
@@ -178,7 +171,7 @@ Bundler will make sure the specified versions of all gems are used by
|
|
178
171
|
traject, and also make sure no gems except those specified in the gemfile
|
179
172
|
are available to the program, for a reliable reproducible environment.
|
180
173
|
|
181
|
-
You
|
174
|
+
You still need to `require` the gem in your traject config file;
|
182
175
|
then just refer to what it provides in your config code as usual.
|
183
176
|
|
184
177
|
You should check both the `Gemfile` and the `Gemfile.lock`
|
data/doc/indexing_rules.md
CHANGED
@@ -16,21 +16,17 @@ That `do` is just ruby `block` syntax, whereby we can pass a block of ruby code
|
|
16
16
|
|
17
17
|
The block is then stored by the Traject::Indexer, and called for each record indexed, with three arguments provided.
|
18
18
|
|
19
|
-
|
19
|
+
### record argument
|
20
20
|
|
21
21
|
The record that gets passed to your block is a MARC::Record object (or, theoretically, any object that gets returned by a traject Reader). Your logic will usually examine the record to calculate the desired output.
|
22
22
|
|
23
23
|
### accumulator argument
|
24
24
|
|
25
|
-
The accumulator argument is an
|
26
|
-
array should hold the output you want to send off, to the field specified in the `to_field`.
|
25
|
+
The accumulator argument is an Array. At the end of your custom code, the accumulator Array should hold the output you want send off to the field specified in `to_field`.
|
27
26
|
|
28
|
-
The accumulator is a reference to a ruby
|
29
|
-
manipulating it in place with Array methods that mutate the array, like `concat`, `<<`,
|
30
|
-
`map!` or even `replace`.
|
27
|
+
The accumulator is a reference to a ruby Array, and you need to **modify** that Array, manipulating it in place with Array methods that mutate the array, like `concat`, `<<`, `map!` or even `replace`.
|
31
28
|
|
32
|
-
You can't simply assign the accumulator variable to a different
|
33
|
-
you need to modify the array in-place.
|
29
|
+
You can't simply assign the accumulator variable to a different Array; you need to modify the Array *in place*.
|
34
30
|
|
35
31
|
# Won't work, assigning variable
|
36
32
|
to_field('foo') do |rec, acc|
|
@@ -50,21 +46,16 @@ you need to modify the array in-place.
|
|
50
46
|
to_field('foo') do |rec, acc|
|
51
47
|
acc << 'bill'
|
52
48
|
acc << 'dueber'
|
53
|
-
acc
|
49
|
+
acc.map!{|str| str.upcase} # NOTE: "map!" not "map"
|
54
50
|
end
|
55
51
|
|
56
52
|
### context argument
|
57
53
|
|
58
|
-
The third optional context
|
59
|
-
|
60
|
-
The third optional argument is a
|
61
|
-
[Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/traject/traject/Traject/Indexer/Context))
|
62
|
-
object. Most of the time you don't need it, but you can use it for
|
63
|
-
some sophisticated functionality, for example using these Context methods:
|
54
|
+
The third optional argument is a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/traject/traject/Traject/Indexer/Context)) object. Most of the time you don't need it, but you can use it for some sophisticated functionality. These are some useful methods available:
|
64
55
|
|
65
56
|
* `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard. This clipboard is record-specific, and won't persist between records.
|
66
|
-
* `context.position` The position of the record in the input file (e.g., was it the first record,
|
67
|
-
* `context.output_hash` A hash mapping the field names (generally defined in `to_field` calls) to an array of values to be sent to the writer associated with that field. This allows you to modify what goes to the writer without going through a `to_field` call -- you can just set `context.output_hash['myfield'] = ['my', 'values']` and you're set. See below for more examples
|
57
|
+
* `context.position` The position of the record in the input file (e.g., was it the first record, second, etc.). Useful for error reporting.
|
58
|
+
* `context.output_hash` A hash mapping the field names (generally defined in `to_field` calls) to an array of values to be sent to the writer associated with that field. This allows you to modify what goes to the writer without going through a `to_field` call -- you can just set `context.output_hash['myfield'] = ['my', 'values']` and you're set. See below for more examples.
|
68
59
|
* `context.skip!(msg)` An assertion that this record should be ignored. No more indexing steps will be called, no results will be sent to the writer, and a `debug`-level log message will be written stating that the record was skipped.
|
69
60
|
|
70
61
|
|
@@ -102,28 +93,26 @@ end
|
|
102
93
|
```
|
103
94
|
|
104
95
|
Certain built-in traject calls have been optimized to be high performance
|
105
|
-
so it's safe to do them inside 'inner loop' blocks
|
106
|
-
|
107
|
-
(note #cached rather than #new there)
|
96
|
+
so it's safe to do them inside 'inner loop' blocks. That includes `Traject::TranslationMap.new` and `Traject::MarcExtractor.cached("xxx")`
|
97
|
+
(NOTE: #cached rather than #new there)
|
108
98
|
|
109
99
|
|
110
100
|
## From block to lambda
|
111
101
|
|
112
102
|
In the ruby language, in addition to creating a code block as an argument
|
113
|
-
to a method with `do |args| ... end` or `{|arg| ... }
|
103
|
+
to a method with `do |args| ... end` or `{|arg| ... }`, we can also create
|
114
104
|
a code block to hold in a variable, with the `lambda` keyword:
|
115
105
|
|
116
106
|
always_output_foo = lambda do |record, accumulator|
|
117
107
|
accumulator << "FOO"
|
118
108
|
end
|
119
109
|
|
120
|
-
traject `to_field` is written so, as a convenience, it can take a lambda expression
|
121
|
-
stored in a variable as an alternative to a block:
|
110
|
+
In traject, `to_field` is written so that, as a convenience, it can take a lambda expression stored in a variable as an alternative to a block:
|
122
111
|
|
123
112
|
to_field("always_has_foo"), always_output_foo
|
124
113
|
|
125
114
|
Why is this a convenience? Well, ordinarily it's not something we
|
126
|
-
need, but in fact it's what allows traject 'macros'
|
115
|
+
need, but in fact it's what allows traject 'macros' to be re-useable
|
127
116
|
code templates.
|
128
117
|
|
129
118
|
|
@@ -131,10 +120,9 @@ code templates.
|
|
131
120
|
|
132
121
|
A Traject macro is a way to automatically create indexing rules via re-usable "templates".
|
133
122
|
|
134
|
-
Traject macros are
|
135
|
-
them based on parameters passed in.
|
123
|
+
Traject macros are methods that return ruby lambda/proc objects, possibly creating them based on parameters passed in.
|
136
124
|
|
137
|
-
|
125
|
+
For example, here is the implementation of the `literal` method/macro:
|
138
126
|
|
139
127
|
~~~ruby
|
140
128
|
def literal(value)
|
@@ -144,12 +132,12 @@ def literal(value)
|
|
144
132
|
accumulator << value
|
145
133
|
end
|
146
134
|
end
|
147
|
-
to_field("
|
135
|
+
to_field("fieldname"), literal("my_fav_literal")
|
148
136
|
~~~
|
149
137
|
|
150
|
-
|
138
|
+
So a Traject macro is a method that may have parameters and, based on those parameters, returns a lambda; the lambda is then passed to the `to_field` indexing method, or similar methods.
|
151
139
|
|
152
|
-
How do you make these methods available to the indexer?
|
140
|
+
How do you make these methods available to the traject indexer?
|
153
141
|
|
154
142
|
Define it in a module:
|
155
143
|
|
@@ -173,15 +161,15 @@ in one of your config files:
|
|
173
161
|
require `literal_macro.rb`
|
174
162
|
extend LiteralMacro
|
175
163
|
|
176
|
-
to_field
|
164
|
+
to_field("fieldname"), literal("my_fav_literal")
|
177
165
|
~~~
|
178
166
|
|
179
167
|
That's it. You can use the traject command line `-I` option to set the ruby load path, so your file will be findable via `require`. Or you can distribute it in a gem, and use straight rubygems and the `gem` command in your configuration file, or Bundler with traject command-line `-g` option.
|
180
168
|
|
181
|
-
## Using a lambda _and_
|
169
|
+
## Using a lambda _and_ a block
|
182
170
|
|
183
171
|
Traject macros (such as `extract_marc`) create and return a lambda. If
|
184
|
-
you include a lambda _and_ a block on a `to_field` call, the
|
172
|
+
you include a lambda _and_ a block on a `to_field` call, the block
|
185
173
|
gets the accumulator as it was filled in by the former.
|
186
174
|
|
187
175
|
```ruby
|
@@ -196,38 +184,42 @@ to_field('foo'), mylam do |rec, acc, context|
|
|
196
184
|
acc << 'two'
|
197
185
|
end #=> context.output_hash['foo'] == ['one', 'two']
|
198
186
|
|
199
|
-
|
200
187
|
# You might also want to do something like this
|
201
|
-
|
202
|
-
to_field('foo'), my_macro_that_doesn't_dedup_ do |rec, acc|
|
188
|
+
to_field('foo'), macro_returning_dup_values do |rec, acc|
|
203
189
|
acc.uniq!
|
204
190
|
end
|
205
191
|
```
|
206
192
|
|
207
193
|
## Maniuplating `context.output_hash` directly
|
208
194
|
|
209
|
-
If you ask for the context argument, a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Indexer/Context)), you have access to context.output_hash
|
210
|
-
the hash of transformed output that will be sent to Solr (or any other Writer)
|
195
|
+
If you ask for the context argument, a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Indexer/Context)), you have access to `context.output_hash`, which is
|
196
|
+
the hash of already transformed output that will be sent to Solr (or any other Writer).
|
197
|
+
|
198
|
+
You can examine `context.output_hash` to see any already transformed output and use it as the source for new output.
|
199
|
+
|
200
|
+
You can *write* to `context.output_hash` directly, which can be useful for computations that affect more than one output field at once.
|
211
201
|
|
212
|
-
|
213
|
-
for new output. You can actually *write* to there manually, which can be useful
|
214
|
-
to write routines that effect more than one output field at once.
|
202
|
+
**Note**: Make sure you always assign an _Array_ to each `context.output_hash` value, e.g., `context.output_hash['foo']`, not a single value!
|
215
203
|
|
216
|
-
|
204
|
+
```ruby
|
205
|
+
|
206
|
+
# Wrong - do NOT assign a value of anything other than an Array
|
207
|
+
context.output_hash['fieldname'] = 'fuzzy_wuzzies'
|
208
|
+
|
209
|
+
# Correct
|
210
|
+
context.output_hash['fieldname'] = ['fuzzy_wuzzies']
|
211
|
+
```
|
217
212
|
|
218
213
|
|
219
214
|
|
220
215
|
## each_record
|
221
216
|
|
222
|
-
|
223
|
-
routine, to define logic that is executed for each record, but isn't fixed to write
|
224
|
-
to a single output field.
|
217
|
+
`each_record` is similar to `to_field` in that it defines logic executed for each record. It differs from `to_field` because the output of `each_record` is not associated with a specific output field.
|
225
218
|
|
226
|
-
|
227
|
-
`record` argument; or both a `record` and a `context`.
|
219
|
+
Thus, `each_record` blocks have no `accumulator` argument: instead they either take a single `record` argument; or both a `record` and a `context`.
|
228
220
|
|
229
|
-
`each_record`
|
230
|
-
results
|
221
|
+
`each_record` is useful for logging or notifiying, computing intermediate
|
222
|
+
results, or writing to more than one field at once.
|
231
223
|
|
232
224
|
~~~ruby
|
233
225
|
each_record do |record, context|
|
@@ -239,20 +231,17 @@ each_record do |record, context|
|
|
239
231
|
end
|
240
232
|
|
241
233
|
each_record do |record, context|
|
242
|
-
(
|
234
|
+
(val1, val2) = calculate_two_things_from(record)
|
243
235
|
|
244
236
|
context.output_hash["first_field"] ||= []
|
245
|
-
context.output_hash["first_field"] <<
|
237
|
+
context.output_hash["first_field"] << val1
|
246
238
|
|
247
239
|
context.output_hash["second_field"] ||= []
|
248
|
-
context.output_hash["second_field"] <<
|
240
|
+
context.output_hash["second_field"] << val2
|
249
241
|
end
|
250
242
|
~~~
|
251
243
|
|
252
|
-
traject doesn't come with any macros written for use with
|
253
|
-
`each_record`, but they could be created if useful --
|
254
|
-
just methods that return lambda's taking the right
|
255
|
-
args for `each_record`.
|
244
|
+
traject doesn't come with any macros written for use with `each_record`, but they could be created: such macros would be methods that return a lambda given the appropriate args from `each_record`.
|
256
245
|
|
257
246
|
## More tips and gotchas about indexing steps
|
258
247
|
|
@@ -262,4 +251,4 @@ args for `each_record`.
|
|
262
251
|
|
263
252
|
* **Once you call `context.skip!(msg)` no more index steps will be run for that record**. So if you have any cleanup code, you'll need to make sure to call it yourself.
|
264
253
|
|
265
|
-
* **By default, `
|
254
|
+
* **By default, `traject` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).
|