traject 0.16.0 → 0.17.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.yardopts +1 -0
- data/README.md +183 -191
- data/bench/bench.rb +1 -1
- data/doc/batch_execution.md +14 -0
- data/doc/extending.md +14 -12
- data/doc/indexing_rules.md +265 -0
- data/lib/traject/command_line.rb +12 -41
- data/lib/traject/debug_writer.rb +32 -13
- data/lib/traject/indexer.rb +101 -24
- data/lib/traject/indexer/settings.rb +18 -17
- data/lib/traject/json_writer.rb +32 -11
- data/lib/traject/line_writer.rb +6 -6
- data/lib/traject/macros/basic.rb +1 -1
- data/lib/traject/macros/marc21.rb +17 -13
- data/lib/traject/macros/marc21_semantics.rb +27 -25
- data/lib/traject/macros/marc_format_classifier.rb +39 -25
- data/lib/traject/marc4j_reader.rb +36 -22
- data/lib/traject/marc_extractor.rb +79 -75
- data/lib/traject/marc_reader.rb +33 -25
- data/lib/traject/mock_reader.rb +9 -10
- data/lib/traject/ndj_reader.rb +7 -7
- data/lib/traject/null_writer.rb +1 -1
- data/lib/traject/qualified_const_get.rb +12 -2
- data/lib/traject/solrj_writer.rb +61 -52
- data/lib/traject/thread_pool.rb +45 -45
- data/lib/traject/translation_map.rb +59 -27
- data/lib/traject/util.rb +3 -3
- data/lib/traject/version.rb +1 -1
- data/lib/traject/yaml_writer.rb +1 -1
- data/test/debug_writer_test.rb +7 -7
- data/test/indexer/each_record_test.rb +4 -4
- data/test/indexer/macros_marc21_semantics_test.rb +12 -12
- data/test/indexer/macros_marc21_test.rb +10 -10
- data/test/indexer/macros_test.rb +1 -1
- data/test/indexer/map_record_test.rb +6 -6
- data/test/indexer/read_write_test.rb +43 -4
- data/test/indexer/settings_test.rb +2 -2
- data/test/indexer/to_field_test.rb +8 -8
- data/test/marc4j_reader_test.rb +4 -4
- data/test/marc_extractor_test.rb +33 -25
- data/test/marc_format_classifier_test.rb +3 -3
- data/test/marc_reader_test.rb +2 -2
- data/test/test_helper.rb +3 -3
- data/test/test_support/demo_config.rb +52 -48
- data/test/translation_map_test.rb +22 -4
- data/test/translation_maps/bad_ruby.rb +2 -2
- data/test/translation_maps/both_map.rb +1 -1
- data/test/translation_maps/default_literal.rb +1 -1
- data/test/translation_maps/default_passthrough.rb +1 -1
- data/test/translation_maps/ruby_map.rb +1 -1
- metadata +7 -31
- data/doc/macros.md +0 -103
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: ab462aadfb1252846b617cf1adb288eeb519b353
|
4
|
+
data.tar.gz: 7eac38dd8ac32e1dbfd417686ff04f95c108f011
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 331350a2a93083b10710943e71bdf31b30bb3c6aeed9dde97f05fd232eaa34681a7ac0bcdf0d7aae9e37fd6ea7b9d3e4da1c840f036ed9abc6d57a01aea02e12
|
7
|
+
data.tar.gz: 381e2c56dc2b92e0b91330bf20275e47462b86c1301ef723f31297cae1702b1f2fb77e6b8a016cf9213879f4c593ce001b68e854649613f12ba9a96238dc9da2
|
data/.yardopts
CHANGED
data/README.md
CHANGED
@@ -1,11 +1,12 @@
|
|
1
1
|
# Traject
|
2
2
|
|
3
|
-
Tools for
|
3
|
+
Tools for reading MARC records, transforming them with indexing rules, and indexing to Solr.
|
4
|
+
Might be used to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
|
4
5
|
|
5
|
-
|
6
|
-
them somewhere.
|
6
|
+
Traject might also be generalized to a set of tools for getting structured data from a source, and sending it to a destination.
|
7
7
|
|
8
|
-
|
8
|
+
|
9
|
+
**Traject is nearing 1.0, it is robust, feature-rich and ready for trial use**
|
9
10
|
|
10
11
|
[![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
|
11
12
|
[![Build Status](https://travis-ci.org/jrochkind/traject.png)](https://travis-ci.org/jrochkind/traject)
|
@@ -13,23 +14,18 @@ them somewhere.
|
|
13
14
|
|
14
15
|
## Background/Goals
|
15
16
|
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
job with jruby (ruby on the JVM).
|
17
|
+
Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (University of Michigan Libraries).
|
18
|
+
|
19
|
+
Traject was born out of our experience with similar tools, including the very popular and useful [solrmarc](https://code.google.com/p/solrmarc/) by Bob Haschart; and Bill Dueber's own [marc2solr](http://github.com/billdueber/marc2solr/).
|
20
20
|
|
21
|
-
|
22
|
-
* **Support customization and flexiblity**, common customization use cases, including simple local
|
23
|
-
logic, should be very easy. More sophisticated and even complex customization use cases should still be possible,
|
24
|
-
changing just the parts of traject you want to change.
|
25
|
-
* **Maintainable local logic**, supporting sharing of reusable logic via ruby gems.
|
26
|
-
* **Comprehensible internal logic**; well-covered by tests, well-factored separation of concerns,
|
27
|
-
easy for newcomer developers who know ruby to understand the codebase.
|
28
|
-
* **High performance**, using multi-threaded concurrency where appropriate to maximize throughput.
|
29
|
-
traject likely will provide higher throughput than other similar solutions.
|
30
|
-
* **Well-behaved shell script**, for painless integration in batch processes and cronjobs, with
|
31
|
-
exit codes, sufficiently flexible control of logging, proper use of stderr, etc.
|
21
|
+
We're comfortable programming (especially in a dynamic language), and want to be able to experiment with different indexing patterns quickly, easily, and testably; but are admittedly less comfortable in Java. In order to have a tool with the API's and usage patterns convenient for us, we found we could do it better in JRuby -- Ruby on the JVM.
|
32
22
|
|
23
|
+
* Basic configuration files can be easily written even by non-rubyists, with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
|
24
|
+
* Easy to program, easy to read, easy to modify.
|
25
|
+
* Fast. Traject by default indexes using multiple threads, on multiple cpu cores.
|
26
|
+
* Composed of decoupled components, for flexibility and extensibility. The whole code base is only 6400 lines of code, more than a third of which is tests.
|
27
|
+
* Designed to support local code and configuration that's maintainable and testable, an can be shared between projects as ruby gems.
|
28
|
+
* Designed with batch execution in mind: flexible logging, good exit codes, good use of stdin/stdout/stderr.
|
33
29
|
|
34
30
|
|
35
31
|
## Installation
|
@@ -41,25 +37,30 @@ Then just `gem install traject`.
|
|
41
37
|
|
42
38
|
( **Note**: We may later provide an all-in-one .jar distribution, which does not require you to install jruby or use on your system. This is hypothetically possible. Is it a good idea?)
|
43
39
|
|
44
|
-
# Usage
|
45
40
|
|
46
|
-
## Configuration
|
41
|
+
## Configuration files
|
47
42
|
|
48
|
-
|
43
|
+
traject is configured using configuration files. To get a sense of what they look like, you can
|
44
|
+
take a look at our sample non-trivial configuration file,
|
45
|
+
[demo_config.rb](./test/test_support/demo_config.rb), which you'd run like
|
46
|
+
`traject -c path/to/demo_config.rb marc_file.marc`.
|
49
47
|
|
50
48
|
Configuration files are actually just ruby -- so by convention they end in `.rb`.
|
51
49
|
|
52
50
|
We hope you can write basic useful configuration files without being a ruby expert,
|
53
|
-
|
51
|
+
traject gives you some easy functions to use for common diretives. But the full power
|
54
52
|
of ruby is available to you if needed.
|
55
53
|
|
56
54
|
**rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
|
57
55
|
call ordinary ruby `require` in config files, etc., too, to load
|
58
56
|
external functionality. See more at Extending Logic below.
|
59
57
|
|
58
|
+
You can keep your settings and indexing rules in one config file,
|
59
|
+
or split them accross multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
|
60
|
+
|
60
61
|
There are two main categories of directives in your configuration files: _Settings_, and _Indexing Rules_.
|
61
62
|
|
62
|
-
|
63
|
+
## Settings
|
63
64
|
|
64
65
|
Settings are a flat list of key/value pairs, where the keys are always strings and the values usually are. They look like this
|
65
66
|
in a config file:
|
@@ -105,91 +106,58 @@ You can also use `store` if you want to force-set, last set wins.
|
|
105
106
|
See, docs page on [Settings](./doc/settings.md) for list
|
106
107
|
of all standardized settings.
|
107
108
|
|
108
|
-
### Indexing Rules
|
109
|
-
|
110
|
-
You can keep your settings and indexing rules in one config file,
|
111
|
-
or split them accross multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
|
112
109
|
|
113
|
-
|
114
|
-
Which can be used with a few standard functions.
|
110
|
+
## Indexing rules: Let's start with `to_field` and `extract_marc`
|
115
111
|
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
# The first arguent, 'source' in this case, is what Solr field we're
|
120
|
-
# sending to. And the 'literal' function supplies a hard-coded
|
121
|
-
# constant string literal.
|
122
|
-
to_field "source", literal("LIB_CATALOG")
|
123
|
-
|
124
|
-
# you can call 'to_field' multiple times, additional values
|
125
|
-
# are concatenated
|
126
|
-
to_field "source", literal("ANOTHER ONE")
|
127
|
-
|
128
|
-
# Serialize the marc record back out and
|
129
|
-
# put it in a solr field.
|
130
|
-
to_field "marc_record", serialized_marc(:format => "xml")
|
131
|
-
|
132
|
-
# or :format => "json" for marc-in-json
|
133
|
-
# or :format => "binary", by default Base64-encoded for Solr
|
134
|
-
# 'binary' field, or, for more like what SolrMarc did, without
|
135
|
-
# escaping:
|
136
|
-
to_field "marc_record_raw", serialized_marc(:format => "binary", :binary_escape => false)
|
137
|
-
|
138
|
-
# Take ALL of the text from the marc record, useful for
|
139
|
-
# a catch-all field. Actually by default only takes
|
140
|
-
# from tags 100 to 899.
|
141
|
-
to_field "text", extract_all_marc_values
|
142
|
-
|
143
|
-
# Now we have a simple example of the general utility function
|
144
|
-
# `extract_marc`
|
145
|
-
to_field "id", extract_marc("001", :first => true)
|
146
|
-
~~~
|
112
|
+
There are a few methods that can be used to create indexing rules, but the
|
113
|
+
one you'll most common is called `to_field`, and establishes a rule
|
114
|
+
to extract content to a particular named output field.
|
147
115
|
|
148
|
-
|
149
|
-
|
150
|
-
*whenever you have a non-multi-valued solr field* even if you think "There should only be one 001 field anyway!", to deal with unexpected
|
151
|
-
data properly.
|
116
|
+
The extraction rule can use built-in 'macros', or, as we'll see later,
|
117
|
+
entirely custom logic.
|
152
118
|
|
153
|
-
|
119
|
+
The built-in macro you'll use the most is `extract_marc`, to extract
|
120
|
+
data out of a MARC record according to a tag/subfield specification.
|
154
121
|
|
155
122
|
~~~ruby
|
156
|
-
|
157
|
-
|
158
|
-
|
159
|
-
|
160
|
-
|
161
|
-
|
162
|
-
|
163
|
-
|
164
|
-
|
165
|
-
|
166
|
-
|
167
|
-
|
123
|
+
# Take the value of the first 001 field, and put
|
124
|
+
# it in output field 'id', to be indexed in Solr
|
125
|
+
# field 'id'
|
126
|
+
to_field "id", extract_marc("001", :first => true)
|
127
|
+
|
128
|
+
# 245 subfields a, p, and s. 130, all subfields.
|
129
|
+
# built-in punctuation trimming routine.
|
130
|
+
to_field "title_t", extract_marc("245nps:130", :trim_punctuation => true)
|
131
|
+
|
132
|
+
# Can limit to certain indicators with || chars.
|
133
|
+
# "*" is a wildcard in indicator spec. So
|
134
|
+
# 856 with first indicator '0', subfield u.
|
135
|
+
to_field "email_addresses", extract_marc("856|0*|u")
|
136
|
+
|
137
|
+
# Can list tag twice with different field combinations
|
138
|
+
# to extract separately
|
139
|
+
to_field "isbn", extract_marc("245a:245abcde")
|
140
|
+
|
141
|
+
# For MARC Control ('fixed') fields, you can optionally
|
142
|
+
# use square brackets to take a byte offset.
|
143
|
+
to_field "langauge_code", extract_marc("008[35-37]")
|
168
144
|
~~~
|
169
145
|
|
170
|
-
|
171
|
-
|
172
|
-
|
146
|
+
`extract_marc` by default includes all 'alternate script' linked fields correspoinding
|
147
|
+
to matched specifications, but you can turn that off, or extract *only* corresponding
|
148
|
+
880s.
|
173
149
|
|
174
|
-
|
175
|
-
|
176
|
-
with single subfields (like "020a") will split subfields and produce
|
177
|
-
an output string for each matching subfield.
|
150
|
+
to_field "title", extract_marc("245abc", :alternate_script => false)
|
151
|
+
to_field "title_vernacular", extract_marc("245abc", :alternate_script => :only)
|
178
152
|
|
179
|
-
|
180
|
-
brackets to take a slice by byte offset.
|
153
|
+
By default, specifications with multiple subfields (like "240abc") will produce one single string of output for each matching field. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield.
|
181
154
|
|
182
|
-
|
183
|
-
|
184
|
-
~~~
|
185
|
-
|
186
|
-
For more information on extraction specifications, see
|
187
|
-
the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
|
155
|
+
For the syntax and complete possibilities of the specification
|
156
|
+
string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
|
188
157
|
|
189
158
|
`extract_marc` also supports `translation maps` similar
|
190
159
|
to SolrMarc's. There are some translation maps provided by traject,
|
191
|
-
and you can also define your own.
|
192
|
-
in yaml or ruby. Translation maps are especially useful
|
160
|
+
and you can also define your own, in yaml or ruby. Translation maps are especially useful
|
193
161
|
for mapping form MARC codes to user-displayable strings:
|
194
162
|
|
195
163
|
~~~ruby
|
@@ -198,131 +166,152 @@ for mapping form MARC codes to user-displayable strings:
|
|
198
166
|
to_field "language", extract_marc("008[35-37]:041a:041d", :translation_map => "marc_language_code")
|
199
167
|
~~~
|
200
168
|
|
201
|
-
|
169
|
+
To see all options for `extract_marc`, see the [method documentation](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc)
|
202
170
|
|
203
|
-
|
171
|
+
## other built-in utility macros
|
204
172
|
|
205
|
-
|
173
|
+
Other built-in methods that can be used with `to_field` include a hard-coded
|
174
|
+
literal string:
|
206
175
|
|
207
|
-
|
208
|
-
indexing functionality, which you can always drop down to, and
|
209
|
-
which is used to build the macros. The basic use of `to_field`,
|
210
|
-
with directly specified logic instead of using a macro, looks like this:
|
176
|
+
to_field "source", literal("LIB_CATALOG")
|
211
177
|
|
212
|
-
|
213
|
-
to_field "source" do |record, accumulator, context|
|
214
|
-
accumulator << "LIB CATALOG"
|
215
|
-
end
|
216
|
-
~~~~
|
178
|
+
The current record serialized back out as MARC, in binary, XML, or json:
|
217
179
|
|
218
|
-
|
180
|
+
# or :format => "json" for marc-in-json
|
181
|
+
# or :format => "binary", by default Base64-encoded for Solr
|
182
|
+
# 'binary' field, or, for more like what SolrMarc did, without
|
183
|
+
# escaping:
|
184
|
+
to_field "marc_record_raw", serialized_marc(:format => "binary", :binary_escape => false, :allow_oversized => true)
|
219
185
|
|
220
|
-
|
221
|
-
used to define a block of logic that can be stored and executed later. When the block is called, first argument (`record` above) is the marc_record being indexed (a ruby-marc MARC::Record object), and the second argument (`accumulator`) is a ruby array used to accumulate output values.
|
186
|
+
Text of all fields in a range:
|
222
187
|
|
223
|
-
|
224
|
-
be used for more advanced functionality, including caching expensive
|
225
|
-
per-record calculations, writing out to more than one output field at a time, or taking account of current Traject Settings in your logic. The third argument is optional, you can supply
|
226
|
-
a two-argument block too.
|
188
|
+
to_field "text", extract_all_marc_values(:from => 100, :to => 899)
|
227
189
|
|
228
|
-
You can always drop out to this basic direct use whenever you need
|
229
|
-
special purpose logic, directly in the config file, writing in
|
230
|
-
ruby:
|
231
190
|
|
232
|
-
|
233
|
-
# this is more or less nonsense, just an example
|
234
|
-
to_field "weird_title" do |record, accumlator, context|
|
235
|
-
field = record['245']
|
236
|
-
title = field['a']
|
237
|
-
title.upcase! if field.indicator1 = '1'
|
238
|
-
accumulator << title
|
239
|
-
end
|
191
|
+
All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macros/marc21.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21))
|
240
192
|
|
241
|
-
|
242
|
-
# marc_extract does, you may want to use the Traject::MarcExtractor
|
243
|
-
# class
|
244
|
-
to_field "weirdo" do |record, accumulator, context|
|
245
|
-
# use MarcExtractor.cached for performance, globally
|
246
|
-
# caching the MarcExtractor we create. See docs
|
247
|
-
# at MarcExtractor.
|
248
|
-
list = MarcExtractor.cached("700a").extract(record)
|
193
|
+
## more complex canned MARC semantic logic
|
249
194
|
|
250
|
-
|
251
|
-
|
195
|
+
Some more complex (and opinionated/subjective) algorithms for deriving semantics
|
196
|
+
from Marc are also packaged with Traject, but not available by default. To make
|
197
|
+
them available to your indexing, you just need to use ruby `require` and `extend`.
|
252
198
|
|
253
|
-
|
254
|
-
end
|
255
|
-
~~~
|
199
|
+
A number of methods are in [Traject::Macros::Marc21Semantics](./lib/traject/macros/marc21_semantics.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21Semantics))
|
256
200
|
|
257
|
-
|
258
|
-
|
259
|
-
in our block will start out with the values left by
|
260
|
-
the `extract_marc`:
|
201
|
+
require 'traject/macros/marc21_semantics'
|
202
|
+
extend Traject::Macros::Marc21Semantics
|
261
203
|
|
262
|
-
|
263
|
-
to_field
|
264
|
-
|
265
|
-
|
266
|
-
|
267
|
-
|
204
|
+
to_field 'title_sort', marc_sortable_title
|
205
|
+
to_field 'broad_subject', marc_lcc_to_broad_category
|
206
|
+
to_field "geographic_facet", marc_geo_facet
|
207
|
+
# And several more
|
208
|
+
|
209
|
+
And, there's a routine for classifying MARC to an internal
|
210
|
+
format/genre/type vocabulary:
|
211
|
+
|
212
|
+
require 'traject/macros/marc_format_classifier'
|
213
|
+
extend Traject::Macros::MarcFormats
|
214
|
+
|
215
|
+
to_field 'format_facet', marc_formats
|
216
|
+
|
217
|
+
|
218
|
+
## Custom logic
|
219
|
+
|
220
|
+
The built-in routines are there for your convenience, but if you need
|
221
|
+
something local or custom, you can write ruby logic directly
|
222
|
+
in a configuration file, using a ruby block, which looks like this:
|
223
|
+
|
224
|
+
to_field "id" do |record, accumulator|
|
225
|
+
# take the record's 001, prefix it with "bib_",
|
226
|
+
# and then add it to the 'accumulator' argument,
|
227
|
+
# to send it to the specified output field
|
228
|
+
value = record['001']
|
229
|
+
value = "bib_#{value}"
|
230
|
+
accumulator << value
|
231
|
+
end
|
232
|
+
|
233
|
+
`do |record, accumulator|` is the definition of a ruby block taking
|
234
|
+
two arguments. The first one passed in will be a MARC record. The
|
235
|
+
second is an array, you add values to the array to send them to
|
236
|
+
output.
|
237
|
+
|
238
|
+
Here's a more realistic example that shows how you'd get the
|
239
|
+
record type byte 06 out of a MARC leader, then translate it
|
240
|
+
to a human-readable string with a TranslationMap
|
241
|
+
|
242
|
+
to_field "marc_type" do |record, accumulator|
|
243
|
+
leader06 = record.leader.byteslice(6)
|
244
|
+
# this translation map doesn't actually exist, but could
|
245
|
+
accumulator << TranslationMap.new("marc_leader")[ leader06 ]
|
246
|
+
end
|
268
247
|
|
269
|
-
|
270
|
-
|
271
|
-
|
248
|
+
You can also add a block onto the end of a built-in 'macro', to
|
249
|
+
further customize the output. The `accumulator` passed to your block
|
250
|
+
will already have values in it from the first step, and you can
|
251
|
+
use ruby methods like `map!` to modify it:
|
272
252
|
|
273
|
-
|
253
|
+
to_field "big_title", extract_marc("245abcdefg") do |record, accumulator|
|
254
|
+
# put it all in all uppercase, I don't know why.
|
255
|
+
accumulator.map! {|v| v.upcase}
|
256
|
+
end
|
274
257
|
|
275
|
-
There
|
276
|
-
|
277
|
-
|
258
|
+
There are many more things you can do with custom logic blocks like this too,
|
259
|
+
including additional features we haven't discussed yet.
|
260
|
+
|
261
|
+
If you find yourself repeating boilerplate code in your custom logic, you can
|
262
|
+
even create your own 'macros' (like `extract_marc`). `extract_marc` and other
|
263
|
+
macros are nothing more than methods that return ruby lambda objects of
|
264
|
+
the same format as the blocks you write for custom logic.
|
265
|
+
|
266
|
+
For tips, gotchas, and a more complete explanation of how this works, see
|
267
|
+
additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
|
268
|
+
|
269
|
+
## each_record and after_processing
|
270
|
+
|
271
|
+
In addition to `to_field`, an `each_record` method is available, which,
|
272
|
+
like `to_field`, is executed for every record, but without being tied
|
273
|
+
to a specific field.
|
274
|
+
|
275
|
+
`each_record` can be used for logging or notifiying; computing intermediate
|
276
|
+
results; or writing to more than one field at once.
|
278
277
|
|
279
278
|
~~~ruby
|
280
|
-
each_record do |record
|
281
|
-
|
282
|
-
(x, y) = Something.do_stuff
|
283
|
-
(context["one_field"] ||= []) << x
|
284
|
-
(context["another_field"] ||= []) << y
|
279
|
+
each_record do |record|
|
280
|
+
some_custom_logging(record)
|
285
281
|
end
|
286
282
|
~~~
|
287
283
|
|
288
|
-
|
289
|
-
such a macro take the field names it will effect as arguments (example?)
|
284
|
+
For more on `each_record`, see documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md).
|
290
285
|
|
291
|
-
|
292
|
-
|
286
|
+
There is also an `after_processing` method that can be used to register
|
287
|
+
logic that will be called after the entire has been processed. You can use it for whatever custom
|
288
|
+
ruby code you might want for your app (send an email? Clean up a log file? Trigger
|
289
|
+
a Solr replication?)
|
293
290
|
|
294
291
|
~~~ruby
|
295
|
-
|
296
|
-
|
297
|
-
|
292
|
+
after_processing do
|
293
|
+
whatever_ruby_code
|
294
|
+
end
|
298
295
|
~~~
|
299
296
|
|
300
|
-
#### Sample config
|
301
297
|
|
302
|
-
|
298
|
+
## Writers
|
303
299
|
|
304
|
-
|
300
|
+
Traject uses modular 'Writer' classes to take the output hashes from transformation, and
|
301
|
+
send them somewhere or do something useful with them.
|
305
302
|
|
306
|
-
|
307
|
-
|
308
|
-
|
303
|
+
By default traject uses the [Traject::SolrJWriter](lib/traject/solrj_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/SolrJWriter)) to send to Solr for indexing.
|
304
|
+
A couple other writers are available too, mostly for debugging purposes:
|
305
|
+
[Traject::DebugWriter](lib/traject/debug_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DebugWriter))
|
306
|
+
and [Traject::JsonWriter](lib/traject/json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/JsonWriter))
|
309
307
|
|
310
|
-
|
311
|
-
|
308
|
+
You set which writer is being used in settings (`provide "writer_class_name", "Traject::DebugWriter"`),
|
309
|
+
or on the command-line as a shortcut with `-w Traject::DebugWriter`.
|
312
310
|
|
313
|
-
|
314
|
-
|
315
|
-
require 'traject/macros/marc21_semantics'
|
316
|
-
extend Traject::Macros::Marc21Semantics
|
317
|
-
|
318
|
-
to_field "date", marc_publication_date
|
319
|
-
to_field "author_sort", marc_sortable_author
|
320
|
-
to_field "inst_facet", marc_instrumentation_humanized
|
321
|
-
~~~
|
311
|
+
You can write your own Readers and Writers if you'd like, see comments at top
|
312
|
+
of [Traject::Indexer](lib/traject/indexer.rb).
|
322
313
|
|
323
|
-
|
324
|
-
|
325
|
-
## Command Line
|
314
|
+
## The traject command Line
|
326
315
|
|
327
316
|
The simplest invocation is:
|
328
317
|
|
@@ -363,7 +352,7 @@ Use `-u` as a shortcut for `s solr.url=X`
|
|
363
352
|
|
364
353
|
Run `traject -h` to see the command line help screen listing all available options.
|
365
354
|
|
366
|
-
Also see `-I load_path` and
|
355
|
+
Also see `-I load_path` option and suggestions for Bundler use under Extending With Your Own Code.
|
367
356
|
|
368
357
|
See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
369
358
|
|
@@ -396,9 +385,9 @@ Own Code](./doc/extending.md)
|
|
396
385
|
* translation map files found on the load path or in a
|
397
386
|
"./translation_maps" subdir on the load path will be found
|
398
387
|
for Traject translation maps.
|
399
|
-
*
|
400
|
-
|
401
|
-
|
388
|
+
* Use [Bundler](http://bundler.io/) with traject simply by creating a Gemfile with `bundler init`,
|
389
|
+
and then running command line with `bundle exec traject` or
|
390
|
+
even `BUNDLE_GEMFILE=path/to/Gemfile bundle exec traject`
|
402
391
|
|
403
392
|
## More
|
404
393
|
|
@@ -423,6 +412,9 @@ this with other developers first!)
|
|
423
412
|
Pull requests should come with tests, as well as docs where applicable. Docs can be inline rdoc-style, edits to this README,
|
424
413
|
and/or extra files in ./docs -- as appropriate for what needs to be docs.
|
425
414
|
|
415
|
+
**Inline api docs** Note that our [`.yardopts` file](./.yardopts) used by rdoc.info to generate
|
416
|
+
online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
|
417
|
+
|
426
418
|
## TODO
|
427
419
|
|
428
420
|
|