traject 2.3.4 → 3.0.0.alpha.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/.travis.yml +16 -9
- data/CHANGES.md +74 -1
- data/Gemfile +2 -1
- data/README.md +104 -53
- data/Rakefile +8 -1
- data/doc/indexing_rules.md +79 -63
- data/doc/programmatic_use.md +218 -0
- data/doc/settings.md +28 -1
- data/doc/xml.md +134 -0
- data/lib/traject.rb +5 -0
- data/lib/traject/array_writer.rb +34 -0
- data/lib/traject/command_line.rb +18 -22
- data/lib/traject/debug_writer.rb +2 -5
- data/lib/traject/experimental_nokogiri_streaming_reader.rb +276 -0
- data/lib/traject/hashie/indifferent_access_fix.rb +25 -0
- data/lib/traject/indexer.rb +321 -92
- data/lib/traject/indexer/context.rb +39 -13
- data/lib/traject/indexer/marc_indexer.rb +30 -0
- data/lib/traject/indexer/nokogiri_indexer.rb +30 -0
- data/lib/traject/indexer/settings.rb +36 -53
- data/lib/traject/indexer/step.rb +27 -33
- data/lib/traject/macros/marc21.rb +37 -12
- data/lib/traject/macros/nokogiri_macros.rb +43 -0
- data/lib/traject/macros/transformation.rb +162 -0
- data/lib/traject/marc_extractor.rb +2 -0
- data/lib/traject/ndj_reader.rb +1 -1
- data/lib/traject/nokogiri_reader.rb +179 -0
- data/lib/traject/oai_pmh_nokogiri_reader.rb +159 -0
- data/lib/traject/solr_json_writer.rb +19 -12
- data/lib/traject/thread_pool.rb +13 -0
- data/lib/traject/util.rb +14 -2
- data/lib/traject/version.rb +1 -1
- data/test/debug_writer_test.rb +3 -3
- data/test/delimited_writer_test.rb +3 -3
- data/test/experimental_nokogiri_streaming_reader_test.rb +169 -0
- data/test/indexer/context_test.rb +23 -13
- data/test/indexer/error_handler_test.rb +59 -0
- data/test/indexer/macros/macros_marc21_semantics_test.rb +46 -46
- data/test/indexer/macros/marc21/extract_all_marc_values_test.rb +1 -1
- data/test/indexer/macros/marc21/extract_marc_test.rb +19 -9
- data/test/indexer/macros/marc21/serialize_marc_test.rb +4 -4
- data/test/indexer/macros/to_field_test.rb +2 -2
- data/test/indexer/macros/transformation_test.rb +177 -0
- data/test/indexer/map_record_test.rb +2 -3
- data/test/indexer/nokogiri_indexer_test.rb +103 -0
- data/test/indexer/process_record_test.rb +55 -0
- data/test/indexer/process_with_test.rb +148 -0
- data/test/indexer/read_write_test.rb +52 -2
- data/test/indexer/settings_test.rb +34 -24
- data/test/indexer/to_field_test.rb +27 -2
- data/test/marc_extractor_test.rb +7 -7
- data/test/marc_reader_test.rb +4 -4
- data/test/nokogiri_reader_test.rb +158 -0
- data/test/oai_pmh_nokogiri_reader_test.rb +23 -0
- data/test/solr_json_writer_test.rb +24 -28
- data/test/test_helper.rb +8 -2
- data/test/test_support/namespace-test.xml +7 -0
- data/test/test_support/nokogiri_demo_config.rb +17 -0
- data/test/test_support/oai-pmh-one-record-2.xml +24 -0
- data/test/test_support/oai-pmh-one-record-first.xml +24 -0
- data/test/test_support/sample-oai-no-namespace.xml +197 -0
- data/test/test_support/sample-oai-pmh.xml +197 -0
- data/test/thread_pool_test.rb +38 -0
- data/test/translation_map_test.rb +3 -3
- data/test/translation_maps/ruby_map.rb +2 -1
- data/test/translation_maps/yaml_map.yaml +2 -1
- data/traject.gemspec +4 -11
- metadata +92 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: 176864633191a53e32c9563227072f16d83e8bc27f2e0d1c9a436fc3d281fc21
|
4
|
+
data.tar.gz: 468120d2066634a98aa18438820e24e4bb099125886bfca86f61e53d080070a3
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 156d479792897beac99ecbc84a6da06470b7baf1edb1f4b3008deb3e26621fca1723112bb2e46f20406c320b486a679b8c2247f77283a358f6f17f7b4444c7e4
|
7
|
+
data.tar.gz: 6e9e0f3c80ca6c2a36adb455a416700f2008cffb5fb66093be1f4e1fb3828a236da4bdf7e1fd26e9d3762b8554c544779138d8636fba6241d3ae682b2df468d9
|
data/.travis.yml
CHANGED
@@ -1,12 +1,19 @@
|
|
1
1
|
language: ruby
|
2
2
|
cache: bundler
|
3
|
-
sudo:
|
3
|
+
# we don't really need `sudo: true`, but for some reason travis docker-based systems are unreliable
|
4
|
+
# at downloading jruby, and
|
5
|
+
sudo: true
|
4
6
|
rvm:
|
5
|
-
-
|
6
|
-
-
|
7
|
-
- 1
|
8
|
-
- 2.
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
7
|
+
- 2.3.6
|
8
|
+
- 2.4.3
|
9
|
+
- 2.5.1
|
10
|
+
- "2.6.0-preview2"
|
11
|
+
# avoid having travis install jdk on MRI builds where we don't need it.
|
12
|
+
matrix:
|
13
|
+
include:
|
14
|
+
- jdk: openjdk8
|
15
|
+
rvm: jruby-9.0.5.0
|
16
|
+
- jdk: openjdk8
|
17
|
+
rvm: jruby-9.2.0.0
|
18
|
+
allow_failures:
|
19
|
+
- rvm: "2.6.0-preview2"
|
data/CHANGES.md
CHANGED
@@ -1,5 +1,78 @@
|
|
1
1
|
# Changes
|
2
2
|
|
3
|
+
## 3.0.0
|
4
|
+
|
5
|
+
### Changed/Backwards Incompatibilities
|
6
|
+
|
7
|
+
* JRuby traject no longer includes `traject-marc4j_reader` as a dependency or default reader, although it may provide faster MARC-XML reading on JRuby. To use it manually, see https://github.com/traject/traject-marc4j_reader . See https://github.com/traject/traject/pull/187
|
8
|
+
|
9
|
+
* `map_record` now returns `nil` if record was skipped.
|
10
|
+
|
11
|
+
* The `Traject::Indexer` class no longer includes marc-specific settings and modules.
|
12
|
+
* If you are using command-line `traject`, this should make no difference to you, as command-line now defaults to the new `Traject::Indexer::MarcIndexer` with those removed things.
|
13
|
+
* If you are using Traject::Indexer programmatically and want those features, switch to using `Traject::Indexer::MarcIndexer`.
|
14
|
+
* If neccessary, as a hopefully temporary backwards compat shim, call `Traject::Indexer.legacy_marc_mode!`, which injects the old marc-specific behavior into Traject::Indexer again, globally and permanently.
|
15
|
+
|
16
|
+
* Traject::Indexer::Settings no longer has it's own global defaults, Instead it can be given a set of defaults with #with_defaults, usually right after instantiation. To support different defaults for different Indexers.
|
17
|
+
|
18
|
+
* SolrJsonWriter now assumes an /update/json convenience url is available in solr instead of trying to verify it. If you are using an older Solr (before 4?) or otherwise want a different update url, just use setting `solr.update_url`
|
19
|
+
|
20
|
+
|
21
|
+
### Added
|
22
|
+
|
23
|
+
* Traject::Indexer#configure is available, and recommended instead of raw `instance_eval`. It just does an instance_eval, but is clearer and safer for future changes.
|
24
|
+
|
25
|
+
* traject command line can now take multiple input files. And underlying it, Traject::Indexer#process can take an array of input streams.
|
26
|
+
|
27
|
+
* There is now a built-in mode for XML source records, see docs at [xml.md](./doc/xml.md)
|
28
|
+
|
29
|
+
* new setting `mapping_rescue` is available, to supply custom logic for handling errors. See docs at [settings.md](../doc/settings.md)
|
30
|
+
|
31
|
+
* Call Traject::ThreadPool.disable_concurrency! to force all pool sizes to be 0, and work to be performed inline. All threading will be disabled.
|
32
|
+
|
33
|
+
* `to_field` can now take an array as a first argument, to send values to multiple fields mentioned, eg:
|
34
|
+
|
35
|
+
to_field ["field1", "field2"], extract_marc("240")
|
36
|
+
|
37
|
+
* `to_field` can take multiple transformation procs (all with the same form). https://github.com/traject/traject/pull/153
|
38
|
+
|
39
|
+
* There is a new set of standard transformation macros included in `Traject::Indexer`, from [Traject::Macros::Transformation](./lib/traject/macros/transformation.rb). It includes an extraction of previous/existing arguments from `marc_extract`, along with some additional stuff. , in [Traject::Macros::Transformations]. https://github.com/traject/traject/pull/154
|
40
|
+
* This is the new preferred way to do post-processing with the `marc_extract` options, but the existing options are not deprecated and there is no current plan for them to be removed.
|
41
|
+
* before:
|
42
|
+
|
43
|
+
to_field "some_field", extract_marc("800",
|
44
|
+
translation_map: "marc_800_map",
|
45
|
+
allow_duplicates: true,
|
46
|
+
first: true,
|
47
|
+
default: "default value")
|
48
|
+
* now preferred:
|
49
|
+
|
50
|
+
to_field "some_field", extract_marc("800", allow_duplicates: true),
|
51
|
+
translation_map("marc_800_map"),
|
52
|
+
first_only,
|
53
|
+
default("default value")
|
54
|
+
|
55
|
+
(still need `allow_duplicates: true` cause extract_marc defaults to false, but see also `unique` macro)
|
56
|
+
|
57
|
+
* So, these transformation steps can now be used with non-MARC formats as well. See also new transformation macros: `strip`, `split`, `append`, `prepend`, `gsub`, and `transform`. And for MARC use, `trim_punctuation`.
|
58
|
+
|
59
|
+
|
60
|
+
* Traject::Indexer new api, for more convenient programmatic/embedded use.
|
61
|
+
|
62
|
+
* `Traject::Indexer.new` takes a block for config
|
63
|
+
|
64
|
+
* `Traject::Indexer#process_record`
|
65
|
+
|
66
|
+
* `Traject::Indexer#process_with`
|
67
|
+
|
68
|
+
* `Traject::Indexer#complete` and `#run_after_processing_steps` public API.
|
69
|
+
|
70
|
+
* `Traject::SolrJsonWriter#flush`, flush to solr without closing, may be useful for direct programmatic use.
|
71
|
+
|
72
|
+
* Traject::Indexer sub-classes can implement a #source_record_id_proc, which is passed to Context, for source-format-specific logic for getting an ID to use in logging.
|
73
|
+
|
74
|
+
* command line takes an `-i` flag for choice of indexer.
|
75
|
+
|
3
76
|
## 2.3.4
|
4
77
|
* Totally internal change to provide easier hooks into indexing process
|
5
78
|
|
@@ -35,7 +108,7 @@
|
|
35
108
|
|
36
109
|
## 2.2.1
|
37
110
|
* Had inadvertently broken use of arrays as extract_marc specifications. Fixed.
|
38
|
-
|
111
|
+
|
39
112
|
## 2.2.0
|
40
113
|
* Change DebugWriter to be more forgiving (and informative) about missing record-id fields
|
41
114
|
* Automatically require DebugWriter for easier use on the command line
|
data/Gemfile
CHANGED
@@ -4,9 +4,10 @@ source 'https://rubygems.org'
|
|
4
4
|
gemspec
|
5
5
|
|
6
6
|
group :development do
|
7
|
-
gem "
|
7
|
+
gem "webmock", "~> 3.4"
|
8
8
|
end
|
9
9
|
|
10
10
|
group :debug do
|
11
11
|
gem "ruby-debug", :platform => "jruby"
|
12
|
+
gem "byebug", :platform => "mri"
|
12
13
|
end
|
data/README.md
CHANGED
@@ -1,10 +1,8 @@
|
|
1
1
|
# Traject
|
2
2
|
|
3
|
-
An easy to use, high-performance, flexible and extensible
|
3
|
+
An easy to use, high-performance, flexible and extensible metadata transformation system, focused on library-archives-museums input, and indexing to Solr as output.
|
4
4
|
|
5
|
-
|
6
|
-
|
7
|
-
You might use [traject](https://github.com/traject/traject) to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
|
5
|
+
You might use [traject](https://github.com/traject/traject) to index MARC or XML data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
|
8
6
|
|
9
7
|
Traject can also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination. In addition to sending data to Solr, Traject can produce json or yaml files, tab-delimited files, CSV files, and output suitable for debugging by a human.
|
10
8
|
|
@@ -20,43 +18,34 @@ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (Univer
|
|
20
18
|
|
21
19
|
* Basic configuration files can be easily written even by non-rubyists, with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
|
22
20
|
* Easy to program, easy to read, easy to modify.
|
23
|
-
* Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying
|
24
|
-
ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with
|
25
|
-
solr even under MRI.
|
21
|
+
* Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with solr even under MRI. Traject is intended to be usable to process millions of records.
|
26
22
|
* Composed of decoupled components, for flexibility and extensibility.
|
27
23
|
* Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
|
28
|
-
* Easy to split configuration between multiple files, for simple "pick-and-choose" command line options
|
29
|
-
that can combine to deal with any of your local needs.
|
24
|
+
* Easy to split configuration between multiple files, for simple "pick-and-choose" command line options that can combine to deal with any of your local needs.
|
30
25
|
|
31
26
|
|
32
27
|
## Installation
|
33
28
|
|
34
|
-
Traject runs under jruby (
|
35
|
-
|
36
|
-
**Traject runs much faster on JRuby** where it can use multi-core parallelism, and the Java Marc4J marc reader. If performance is a concern, you should run traject on JRuby.
|
29
|
+
Traject runs under jruby (9.0.x or higher), MRI ruby (2.3.x or higher), or probably any other ruby platform.
|
37
30
|
|
38
|
-
|
31
|
+
Once you have ruby installed, just `$ gem install traject`.
|
39
32
|
|
40
|
-
|
33
|
+
**If you are processing MARC input, you will probably get significant performance improvements on JRuby.** If you are processing Marc-XML (rather than binary Marc21), you should additionally get even more performance improvements by using the [traject-marc4j_reader](https://github.com/traject/traject-marc4j_reader) gem on JRuby (see installation instructions there). If performance is a concern, and you are processing MARC, we recommend benchmarking traject on JRuby. (JRuby is not currently recommended for non-MARC XML input.)
|
41
34
|
|
42
|
-
|
35
|
+
Some options for installing a ruby other than your system-provided one are [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme).
|
43
36
|
|
44
37
|
|
45
38
|
## Configuration files
|
46
39
|
|
47
|
-
traject is configured using configuration files. To get a sense of what they look like, you can take a look at our sample basic configuration file
|
48
|
-
[demo_config.rb](./test/test_support/demo_config.rb). You could run traject with that configuration file as: `traject -c path/to/demo_config.rb marc_file.marc`.
|
40
|
+
traject is configured using configuration files. To get a sense of what they look like, you can take a look at our sample basic [MARC configuration file](./test/test_support/demo_config.rb) or [XML configuration file](./test/test_support/nokogiri_demo_config.rb). You could run traject with that configuration file as: `traject -c path/to/demo_config.rb marc_file.marc`.
|
49
41
|
|
50
42
|
Configuration files are actually just ruby -- so by convention they end in `.rb`.
|
51
43
|
|
52
44
|
We hope you can write basic useful configuration files without much ruby experience, since traject gives you some easy functions to use for common directives. But the full power of ruby is available to you if needed.
|
53
45
|
|
54
|
-
**rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
|
55
|
-
call ordinary ruby `require` in config files, etc., too, to load
|
56
|
-
external functionality. See more at Extending Logic below.
|
46
|
+
**rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can call ordinary ruby `require` in config files, etc., too, to load external functionality. See more at Extending Logic below.
|
57
47
|
|
58
|
-
You can keep your settings and indexing rules in one config file,
|
59
|
-
or split them across multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
|
48
|
+
You can keep your settings and indexing rules in one config file, or split them across multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
|
60
49
|
|
61
50
|
There are two main categories of directives in your configuration files: _Settings_, and _Indexing Rules_.
|
62
51
|
|
@@ -84,9 +73,9 @@ settings do
|
|
84
73
|
# various others...
|
85
74
|
provide "solr_writer.commit_on_close", "true"
|
86
75
|
|
87
|
-
# The default writer is the Traject::SolrJsonWriter.
|
88
|
-
# reader is
|
89
|
-
#
|
76
|
+
# The default writer is the Traject::SolrJsonWriter. In the default MARC mode,
|
77
|
+
# the default reader in MARC mode is MarcReader (using ruby-marc).
|
78
|
+
# In XML mode, it is the NokogiriReader.
|
90
79
|
end
|
91
80
|
~~~
|
92
81
|
|
@@ -98,24 +87,26 @@ See, docs page on [Settings](./doc/settings.md) for list
|
|
98
87
|
of all standardized settings.
|
99
88
|
|
100
89
|
|
101
|
-
## Indexing rules: 'to_field'
|
90
|
+
## Indexing rules: 'to_field'
|
102
91
|
|
103
|
-
There are a few methods that can be used to create indexing rules. We will touch on the two most commonly used methods here. More information
|
92
|
+
There are a few methods that can be used to create indexing rules. We will touch on the two most commonly used methods here. More information and technical details are available in [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
|
104
93
|
|
105
94
|
`to_field` establishes a rule to extract content to a particular named output field. A `to_field` extraction rule can use built-in 'macros', or, as we'll see later, entirely custom logic.
|
106
95
|
|
107
|
-
The built-in
|
108
|
-
|
96
|
+
The built-in macros most commonly used are:
|
97
|
+
|
98
|
+
* In MARC mode, `extract_marc`, to extract data out of a MARC record according to a tag/subfield specification.
|
99
|
+
* In XML mode, `extract_xpath`. For more on XML use of traject, see the [XML guide](./doc/xml.md).
|
100
|
+
|
101
|
+
### MARC examples: extract_marc
|
109
102
|
|
110
103
|
~~~ruby
|
111
104
|
# Take the value of the first 001 field, and put
|
112
105
|
# it in output field 'id', to be indexed in Solr
|
113
106
|
# field 'id'
|
114
|
-
to_field "id", extract_marc("001"
|
107
|
+
to_field "id", extract_marc("001")
|
115
108
|
|
116
|
-
|
117
|
-
# built-in punctuation trimming routine.
|
118
|
-
to_field "title_t", extract_marc("245aps:130", :trim_punctuation => true)
|
109
|
+
to_field "title_t", extract_marc("245aps:130")
|
119
110
|
|
120
111
|
# Can limit to certain indicators with || chars.
|
121
112
|
# "*" is a wildcard in indicator spec. So this is
|
@@ -142,17 +133,64 @@ By default, specifications with multiple subfields (e.g. "240abc") will produce
|
|
142
133
|
|
143
134
|
For the syntax and complete possibilities of the specification string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
|
144
135
|
|
145
|
-
|
136
|
+
To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
|
137
|
+
|
138
|
+
There is one special MARC-specific transformation macro, that strips punctuation from beginning and end of values using heuristics designed for AACR2 in MARC:
|
139
|
+
|
140
|
+
```ruby
|
141
|
+
to_field "title", extract_marc("245abc"), trim_punctuation
|
142
|
+
```
|
143
|
+
|
144
|
+
### XML mode, extract_xml
|
145
|
+
|
146
|
+
See our [xml guide](./doc/xml.md) for more XML examples, but you will usually use extract_xpath.
|
147
|
+
|
148
|
+
to_field "title", extract_xpath("//title")
|
149
|
+
|
150
|
+
### Translation maps
|
151
|
+
|
152
|
+
|
153
|
+
Traject supports `translation maps` similar to SolrMarc's. There are some translation maps provided by traject, and you can also define your own, in yaml or ruby. Translation maps are especially useful for mapping from MARC codes to user-displayable strings. Translation maps are invokved in a second arg to `to_field`.
|
146
154
|
|
147
155
|
~~~ruby
|
148
156
|
# "translation_map" will be passed to Traject::TranslationMap.new
|
149
157
|
# and the created map used to translate all values
|
150
|
-
to_field "language", extract_marc("008[35-37]:041a:041d",
|
158
|
+
to_field "language", extract_marc("008[35-37]:041a:041d"), translation_map("marc_language_code")
|
151
159
|
~~~
|
152
160
|
|
153
|
-
|
161
|
+
The argument(s) to `translation_map` are passed to `TranslationMap.new`, see comment docs at [TranslationMap](./lib/traject/translation_map.rb) for documentation.
|
162
|
+
|
163
|
+
The `translation_map` macro also allows you to specify _multiple_ translation maps, with the latter ones overriding earlier ones:
|
164
|
+
|
165
|
+
```ruby
|
166
|
+
to_field "language", extract_marc("008[35-37]:041a:041d"),
|
167
|
+
translation_map("marc_language_code",
|
168
|
+
"local_marc_language_code_overrides",
|
169
|
+
{"inline_hash" => "even local more overrides"})
|
170
|
+
```
|
171
|
+
|
172
|
+
### Additional transformation macros
|
173
|
+
|
174
|
+
TranslationMap use above is just one example of a transformation macro, that transforms output values. Other built-in transformation macros are defined in Traject::Macros::Transformation, and include:
|
154
175
|
|
155
|
-
|
176
|
+
* `default("some value")`: provide a default value if no extracted value exists
|
177
|
+
* `first_only`: limit output to a single value, the first one extracted.
|
178
|
+
* `unique`: reduce output values to only unique values
|
179
|
+
* `strip`: remove leading or trailing whitespace
|
180
|
+
* `prepend("before each value:")`
|
181
|
+
* `append("--after each value")`
|
182
|
+
* `gsub(/regex/, "replacement")`
|
183
|
+
* `split(" ")`: take values and split them, possibly result in multiple values.
|
184
|
+
|
185
|
+
You can add on as many transformation macros as you want, they will be applied to output in order.
|
186
|
+
|
187
|
+
Example:
|
188
|
+
|
189
|
+
```ruby
|
190
|
+
to_field "something", extract_xpath("//value"), strip, default("no value"), prepend("Extracted value: ")
|
191
|
+
```
|
192
|
+
|
193
|
+
### Other built-in utility macros
|
156
194
|
|
157
195
|
Other built-in methods that can be used with `to_field` include:
|
158
196
|
|
@@ -256,9 +294,7 @@ use ruby methods like `map!` to modify it:
|
|
256
294
|
~~~
|
257
295
|
|
258
296
|
If you find yourself repeating boilerplate code in your custom logic, you can
|
259
|
-
even create your own 'macros' (like `extract_marc`). `extract_marc` and other
|
260
|
-
macros are nothing more than methods that return ruby lambda objects of
|
261
|
-
the same format as the blocks you write for custom logic.
|
297
|
+
even create your own 'macros' (like `extract_marc`). `extract_marc`, `translation_map`, `first_only` and other macros are nothing more than methods that return ruby lambda objects of the same format as the blocks you write for custom logic.
|
262
298
|
|
263
299
|
For tips, gotchas, and a more complete explanation of how this works, see
|
264
300
|
additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
|
@@ -304,11 +340,10 @@ You set which writer is being used in settings (`provide "writer_class_name", "T
|
|
304
340
|
or with the shortcut command line argument `-w Traject::DebugWriter`.
|
305
341
|
|
306
342
|
The [SolrJWriter](https://github.com/traject/traject-solrj_writer) is packaged separately,
|
307
|
-
and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
|
343
|
+
and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
|
308
344
|
|
309
345
|
You can easily write your own Readers and Writers if you'd like, see comments at top
|
310
|
-
of [Traject::Indexer](lib/traject/indexer.rb).
|
311
|
-
|
346
|
+
of [Traject::Indexer](lib/traject/indexer.rb). A reader is simply an object that initializes with a ruby IO and traject Settings, and provides an `each` method. The simplest Writer class initializes with a traject Settings, and provides a `put(traject_context)` method.
|
312
347
|
|
313
348
|
|
314
349
|
## Duplicate, `nil`, and empty values
|
@@ -365,13 +400,16 @@ writer class in question.
|
|
365
400
|
|
366
401
|
## The traject command Line
|
367
402
|
|
403
|
+
(If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./docs/programmatic_use.md) )
|
404
|
+
|
368
405
|
The simplest invocation is:
|
369
406
|
|
370
407
|
traject -c conf_file.rb marc_file.mrc
|
371
408
|
|
409
|
+
By default, and for legacy reasons, the traject command line uses the MarcIndexer, with default marc reader and macros. If you want to use a different indexer for a different file format, use the `-i` flag: `traject -i xml`, the NokogiriReader; `traject -i basic`, the base Traject::Indexer with no format-specific behavior; or `traject -i Your::Own::Class`.
|
410
|
+
|
372
411
|
Traject assumes marc files are in ISO 2709 MARC 'binary' format; it is not
|
373
|
-
currently able to guess other marc format types like XML from filenames or content. If you are reading
|
374
|
-
marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
|
412
|
+
currently able to guess other marc format types like XML from filenames or content. If you are reading marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
|
375
413
|
|
376
414
|
traject -c conf.rb -t xml marc_file.xml
|
377
415
|
|
@@ -389,6 +427,17 @@ This will over-ride any settings set with `provide` in conf files.
|
|
389
427
|
|
390
428
|
traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solrj_writer.commit_on_close=true
|
391
429
|
|
430
|
+
|
431
|
+
When using the Traject::MarcIndexer (default), it assumes marc files are in ISO 2709 MARC 'binary' format; it is not currently able to guess other marc format types like XML from filenames or content. If you are reading marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
|
432
|
+
|
433
|
+
traject -c conf.rb -t xml marc_file.xml
|
434
|
+
|
435
|
+
To use XML mode instead (with the Traject::NokogiriReader and suitable config files), use the `-i` flag:
|
436
|
+
|
437
|
+
traject -i xml -c xml_suitable_config_file.rb
|
438
|
+
|
439
|
+
(You can also pass the full name of a custom indexer class to `-i`)
|
440
|
+
|
392
441
|
There are some built-in command-line option shortcuts for useful
|
393
442
|
settings:
|
394
443
|
|
@@ -442,15 +491,16 @@ Own Code](./doc/extending.md)
|
|
442
491
|
|
443
492
|
## More
|
444
493
|
|
494
|
+
* [Traject XML guide](./doc/xml.md)
|
445
495
|
* [Other traject commands](./doc/other_commands.md) including `marcout`, and `commit`
|
446
496
|
* [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
497
|
+
* [Traject Programmatic Use guide](./doc/programmatic_use.md)
|
447
498
|
* Plugin extensions: Gems that add functionality to traject
|
448
499
|
* [traject_alephsequential_reader](https://github.com/traject/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
|
449
500
|
* [traject_horizon](https://github.com/jrochkind/traject_horizon): Export MARC records directly from a Horizon ILS rdbms, as serialized MARC or to index into Solr.
|
450
501
|
* [traject_umich_format](https://github.com/billdueber/traject_umich_format/): opinionated code and associated macros to extract format (book, audio file, etc.) and types (bibliography, conference report, etc.) from a MARC record. Code mirrors that used by the University of Michigan, and is an alternate approach to that taken by the `marc_formats` macro in `Traject::Macros::MarcFormatClassifier`.
|
451
502
|
* [traject-solrj_writer](https://github.com/traject/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
|
452
|
-
* [traject_marc4j_reader](https://github.com/traject/traject-marc4j_reader):
|
453
|
-
reading marc records using the Marc4J library, fastest MARC reading on JRuby.
|
503
|
+
* [traject_marc4j_reader](https://github.com/traject/traject-marc4j_reader): A JRuby-only reader for reading marc records using the Marc4J library, fastest MARC-XML reading on JRuby.
|
454
504
|
* [traject_sequel_writer](https://github.com/traject/traject_sequel_writer) A writer for sending to an rdbms via [Sequel](https://github.com/jeremyevans/sequel)
|
455
505
|
|
456
506
|
# Development
|
@@ -469,15 +519,16 @@ and/or extra files in ./docs -- as appropriate for what needs to be docs.
|
|
469
519
|
online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
|
470
520
|
|
471
521
|
Bundler rake tasks included for gem releases: `rake release`
|
472
|
-
* Every traject release needs to be done once when running MRI, and switch to JRuby
|
473
|
-
and do the same release again. The JRuby release is identical but for including
|
474
|
-
a gemspec dependency on the Marc4JReader gem.
|
475
522
|
|
476
|
-
|
523
|
+
The standard [bundle console](https://bundler.io/v1.7/bundle_console.html) command may be useful for getting an `irb` console with the gem and it's dependencies loaded.
|
524
|
+
|
525
|
+
## TODO: Possible future improvements
|
526
|
+
|
527
|
+
* Incorporate more inspired by [TrajectPlus](https://github.com/sul-dlss/traject_plus), possibly including `compose` for building nested hash output.
|
477
528
|
|
478
|
-
*
|
529
|
+
* Incorporate functionality to write multiple output records based on a single input record. Likely will share implementation details with a trajectplus-style `compose`.
|
479
530
|
|
480
|
-
* Writers for writing to stores other than Solr? ElasticSearch? Maybe.
|
531
|
+
* Writers for writing to stores other than Solr? ElasticSearch? Maybe.
|
481
532
|
|
482
533
|
* Unicode normalization. Has to normalize to NFKC on way out to index. Except for serialized marc field and other exceptions? Except maybe don't have to, rely on solr analyzer to do it?
|
483
534
|
|
data/Rakefile
CHANGED
@@ -11,6 +11,13 @@ require 'rake/testtask'
|
|
11
11
|
task :default => [:test]
|
12
12
|
|
13
13
|
Rake::TestTask.new do |t|
|
14
|
+
# Rake 11 makes warnings on by default, but there is so much noise, including
|
15
|
+
# from our dependencies, and from things I think are silly warnings like
|
16
|
+
# "shadowing outer local variable"
|
17
|
+
# Possibly could turn back on in the future using https://rubygems.org/gems/warning/versions/0.10.0
|
18
|
+
# gem to customize.
|
19
|
+
t.warning = false
|
20
|
+
|
14
21
|
t.pattern = 'test/**/*_test.rb'
|
15
22
|
t.libs.push 'test', 'test_support'
|
16
23
|
end
|
@@ -18,4 +25,4 @@ end
|
|
18
25
|
# Not documented well, but this seems to be
|
19
26
|
# the way to load rake tasks from other files
|
20
27
|
#import "lib/tasks/load_map.rake"
|
21
|
-
Dir.glob('lib/tasks/*.rake').each { |r| import r}
|
28
|
+
Dir.glob('lib/tasks/*.rake').each { |r| import r}
|
data/doc/indexing_rules.md
CHANGED
@@ -1,35 +1,59 @@
|
|
1
1
|
# Details on Traject Indexing: from custom logic to Macros
|
2
2
|
|
3
|
-
|
3
|
+
We will explain the architecture of indexing rules, to help you use them more effectively, and create 'macros' which are re-usable index mapping rules.
|
4
4
|
|
5
|
-
## How
|
5
|
+
## How to_field works
|
6
6
|
|
7
|
-
|
7
|
+
A `to_field` invocation might look like this:
|
8
8
|
|
9
|
-
|
9
|
+
```ruby
|
10
|
+
to_field "title", extract_marc("245abc"), first_only
|
11
|
+
```
|
12
|
+
|
13
|
+
In fact, both `extract_marc("245abc")` and `first_only` are invocation of methods that return ruby [Proc](https://ruby-doc.org/core-2.2.0/Proc.html) or lambda objects. We call a method that is included in an indexer, and returns a `Proc` object suitable as an arg to `to_field` -- we call this a **macro** in traject.
|
14
|
+
|
15
|
+
The `to_field` method for establishing an indexing rule, is defined to simply take a first argument that is a field name, and then one or more arguments that are procs. During indexing, the procs registered with the indexing rule are executed in order to provide and transform output values. There can be an additional proc provided as a block argument to the `to_field` method.
|
16
|
+
|
17
|
+
By providing macro methods in the indexer that return procs, we can use this simple ruby method signature to create something that looks like a "domain specific language," where you might not even realize it's all based on procs. `extract_marc` is a method defined in the MarcIndexer (via including the `Traject::Macros::Marc21` mixin), while `first_only` is a method included in all Indexers (via the base Indexer class including the `Traject::Macros::Transformation` mixin).
|
18
|
+
|
19
|
+
These proc arguments themselves take three arguments, of which the third is optional.
|
20
|
+
|
21
|
+
1. the source record
|
22
|
+
2. an "accumulator" array of output values, to which the procs add or transform values
|
23
|
+
3. a traject "context"
|
24
|
+
|
25
|
+
Here's the simplest possible direct Traject mapping logic, duplicating the effects of the literal macro:
|
26
|
+
|
27
|
+
```ruby
|
10
28
|
to_field("title") do |record, accumulator, context|
|
11
29
|
accumulator << "FIXED LITERAL"
|
12
30
|
end
|
13
|
-
|
31
|
+
```
|
14
32
|
|
15
|
-
That `do` is just ruby
|
33
|
+
That `do` is just ruby block syntax, whereby we can pass a block of ruby code as an argument to to a ruby method. We pass a block taking three arguments, labeled record, accumulator, and context, to the to_field method. The third 'context' object is optional, you can define it in your block or not, depending on if you want to use it.
|
16
34
|
|
17
|
-
The block is then stored by the Traject::Indexer, and called for each record indexed, with three arguments provided.
|
35
|
+
The block is then stored by the Traject::Indexer, and called for each record indexed, with three arguments provided.
|
18
36
|
|
19
37
|
### record argument
|
20
38
|
|
21
|
-
The record that gets passed to your block is
|
39
|
+
The record that gets passed to your block is the source record for the current indexing: A `MARC::Record` when using the MarcIndexer, a `Nokogiri::XML::Document` using the NokogiriIndexer, or whatever source record type is used by a given indexer.
|
40
|
+
|
41
|
+
Logic for an "extraction" proc, like that returned by `extract_marc`, usually the first one given to `to_field`, will usually examine the record to calculate the desired output.
|
42
|
+
|
43
|
+
Logic for a "transformation" proc, such as that returned by `first_only`, usually ignores the record argument.
|
44
|
+
|
45
|
+
"Extraction" vs "transformation" are just names for procs that either examine the source_record to add something to the accumulator ("extraction") or transform values already in the accumulator ("transformation") -- a proc can actually do these things in any combination, but it usually makes sense to design some procs for extraction and others for transformation.
|
22
46
|
|
23
47
|
### accumulator argument
|
24
48
|
|
25
|
-
The accumulator argument is an Array. At the end of your custom code, the accumulator Array should hold the output you want send off to the field specified in `to_field`.
|
49
|
+
The accumulator argument is an Array. At the end of your custom code, the accumulator Array should hold the output you want send off to the field specified in `to_field`.
|
26
50
|
|
27
|
-
The accumulator is a reference to a ruby Array, and you need to **modify** that Array, manipulating it in place with Array methods that mutate the array, like `concat`, `<<`, `map!` or even `replace`.
|
51
|
+
The accumulator is a reference to a ruby Array, and you need to **modify** that Array, manipulating it in place with Array methods that mutate the array, like `concat`, `<<`, `map!` or even `replace`.
|
28
52
|
|
29
|
-
You can't simply assign the accumulator variable to a different Array; you need to modify the Array *in place*.
|
53
|
+
You can't simply assign the accumulator variable to a different Array; you need to modify the Array *in place*.
|
30
54
|
|
31
55
|
# Won't work, assigning variable
|
32
|
-
to_field('foo') do |rec, acc|
|
56
|
+
to_field('foo') do |rec, acc|
|
33
57
|
acc = ["some constant"] } # WRONG!
|
34
58
|
end
|
35
59
|
|
@@ -38,7 +62,7 @@ You can't simply assign the accumulator variable to a different Array; you need
|
|
38
62
|
acc << 'bill'
|
39
63
|
acc << 'dueber'
|
40
64
|
acc = acc.map{|str| str.upcase}
|
41
|
-
end # WRONG! WRONG! WRONG! WRONG! WRONG!
|
65
|
+
end # WRONG! WRONG! WRONG! WRONG! WRONG!
|
42
66
|
|
43
67
|
|
44
68
|
# Instead, do, modify array in place
|
@@ -49,11 +73,13 @@ You can't simply assign the accumulator variable to a different Array; you need
|
|
49
73
|
acc.map!{|str| str.upcase} # NOTE: "map!" not "map"
|
50
74
|
end
|
51
75
|
|
76
|
+
If you have multiple calls to `to_field` for the same field, each invocation begins with an empty accumulator, to help keep them independent.
|
77
|
+
|
52
78
|
### context argument
|
53
79
|
|
54
80
|
The third optional argument is a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/traject/traject/Traject/Indexer/Context)) object. Most of the time you don't need it, but you can use it for some sophisticated functionality. These are some useful methods available:
|
55
81
|
|
56
|
-
* `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard. This clipboard is record-specific, and won't persist between records.
|
82
|
+
* `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard. This clipboard is record-specific, and won't persist between records.
|
57
83
|
* `context.position` The position of the record in the input file (e.g., was it the first record, second, etc.). Useful for error reporting.
|
58
84
|
* `context.output_hash` A hash mapping the field names (generally defined in `to_field` calls) to an array of values to be sent to the writer associated with that field. This allows you to modify what goes to the writer without going through a `to_field` call -- you can just set `context.output_hash['myfield'] = ['my', 'values']` and you're set. See below for more examples.
|
59
85
|
* `context.skip!(msg)` An assertion that this record should be ignored. No more indexing steps will be called, no results will be sent to the writer, and a `debug`-level log message will be written stating that the record was skipped.
|
@@ -92,50 +118,37 @@ to_field 'normalized_title' do |rec, acc|
|
|
92
118
|
end
|
93
119
|
```
|
94
120
|
|
121
|
+
Traject macros similarly will capture some values in local variables outside the actual proc return value, which the proc returned can then use.
|
122
|
+
|
95
123
|
Certain built-in traject calls have been optimized to be high performance
|
96
|
-
so it's safe to do them inside 'inner loop' blocks. That includes `Traject::TranslationMap.new` and `Traject::MarcExtractor.cached("xxx")`
|
124
|
+
so it's safe to do them inside 'inner loop' blocks. That includes `Traject::TranslationMap.new` and `Traject::MarcExtractor.cached("xxx")`
|
97
125
|
(NOTE: #cached rather than #new there)
|
98
126
|
|
99
127
|
|
100
|
-
##
|
101
|
-
|
102
|
-
In the ruby language, in addition to creating a code block as an argument
|
103
|
-
to a method with `do |args| ... end` or `{|arg| ... }`, we can also create
|
104
|
-
a code block to hold in a variable, with the `lambda` keyword:
|
105
|
-
|
106
|
-
always_output_foo = lambda do |record, accumulator|
|
107
|
-
accumulator << "FOO"
|
108
|
-
end
|
109
|
-
|
110
|
-
In traject, `to_field` is written so that, as a convenience, it can take a lambda expression stored in a variable as an alternative to a block:
|
111
|
-
|
112
|
-
to_field("always_has_foo"), always_output_foo
|
113
|
-
|
114
|
-
Why is this a convenience? Well, ordinarily it's not something we
|
115
|
-
need, but in fact it's what allows traject 'macros' to be re-useable
|
116
|
-
code templates.
|
117
|
-
|
118
|
-
|
119
|
-
## Macros
|
128
|
+
## Back to macros
|
120
129
|
|
121
130
|
A Traject macro is a way to automatically create indexing rules via re-usable "templates".
|
122
131
|
|
123
|
-
Traject macros are methods that return ruby lambda/proc objects, possibly creating them based on parameters passed in.
|
132
|
+
Traject macros are simply methods that return ruby lambda/proc objects, possibly creating them based on parameters passed in.
|
124
133
|
|
125
|
-
For example, here is the implementation of the
|
134
|
+
For example, here is the implementation of the `literal` logic, as a macro method returning a proc, instead of as an inline proc.
|
126
135
|
|
127
136
|
~~~ruby
|
137
|
+
# This method is included in an Indexer, possibly as a module mix-in.
|
128
138
|
def literal(value)
|
129
|
-
return
|
139
|
+
return proc do |record, accumulator, context|
|
130
140
|
# because a lambda is a closure, we can define it in terms
|
131
141
|
# of the 'value' from the scope it's defined in!
|
132
142
|
accumulator << value
|
133
143
|
end
|
134
144
|
end
|
145
|
+
|
146
|
+
# then it would be called on the indexer, typically in a traject configuration file,
|
147
|
+
# when setting up an indexing rule:
|
135
148
|
to_field("fieldname"), literal("my_fav_literal")
|
136
149
|
~~~
|
137
150
|
|
138
|
-
So a Traject macro is a method that may have parameters and, based on those parameters, returns a
|
151
|
+
So a Traject macro is a method that may have parameters and, based on those parameters, returns a proc; the proc is then passed to the `to_field` indexing method, or similar methods.
|
139
152
|
|
140
153
|
How do you make these methods available to the traject indexer?
|
141
154
|
|
@@ -145,7 +158,7 @@ Define it in a module:
|
|
145
158
|
# in a file literal_macro.rb
|
146
159
|
module LiteralMacro
|
147
160
|
def literal(value)
|
148
|
-
return
|
161
|
+
return proc do |record, accumulator, context|
|
149
162
|
# because a lambda is a closure, we can define it in terms
|
150
163
|
# of the 'value' from the scope it's defined in!
|
151
164
|
accumulator << value
|
@@ -165,39 +178,43 @@ to_field("fieldname"), literal("my_fav_literal")
|
|
165
178
|
~~~
|
166
179
|
|
167
180
|
That's it. You can use the traject command line `-I` option to set the ruby load path, so your file will be findable via `require`. Or you can distribute it in a gem, and use straight rubygems and the `gem` command in your configuration file, or Bundler with traject command-line `-g` option.
|
181
|
+
See the [Extending with your own code](./extending.md) guide for various methods for including custom code in a traject command-line invocation.
|
182
|
+
|
183
|
+
## Combining multiple macros, lambdas and blocks
|
168
184
|
|
169
|
-
|
185
|
+
Traject macros (such as `extract_marc`) create and return a proc. If
|
186
|
+
you include a proc _and_ a block (or multiple procs) on a `to_field` call, subsequent procs
|
187
|
+
or code blocks get the accumulator as it was filled in by former procs or code blocks, and can *transform* values in the accumulator.
|
170
188
|
|
171
|
-
|
172
|
-
you include a lambda _and_ a block on a `to_field` call, the block
|
173
|
-
gets the accumulator as it was filled in by the former.
|
189
|
+
Here is an example of passing `to_field` procs returned by macros, procs held in variables, and blocks.
|
174
190
|
|
175
191
|
```ruby
|
176
|
-
# Get the titles and lowercase them
|
177
|
-
to_field 'lc_title', extract_marc('245') do |rec, acc, context|
|
178
|
-
acc.map!{|title| title.downcase}
|
179
|
-
end
|
180
192
|
|
181
|
-
|
182
|
-
|
183
|
-
|
184
|
-
acc << 'two'
|
185
|
-
end #=> context.output_hash['foo'] == ['one', 'two']
|
193
|
+
titlecase = proc do |rec, acc|
|
194
|
+
acc.map! { |value| value.titlecase }
|
195
|
+
end
|
186
196
|
|
187
|
-
|
188
|
-
|
189
|
-
acc.uniq!
|
197
|
+
to_field 'lc_title', extract_marc('245'), titlecase, unique do |rec, acc, context|
|
198
|
+
acc.delete_if { |v| v == "value_to_eliminate" }
|
190
199
|
end
|
191
200
|
```
|
192
201
|
|
202
|
+
`extract_marc` and `unique` are "macro" methods reutrning a proc.
|
203
|
+
|
204
|
+
`titlecase` is just a local variable, defined in the indexing file itself, holding a proc.
|
205
|
+
|
206
|
+
Then finally there is a block arg, taking the same arguments as the procs would.
|
207
|
+
|
208
|
+
All of these can be combined, and will be executed in order to transform output values.
|
209
|
+
|
193
210
|
## Manipulating `context.output_hash` directly
|
194
211
|
|
195
212
|
If you ask for the context argument, a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Indexer/Context)), you have access to `context.output_hash`, which is
|
196
213
|
the hash of already transformed output that will be sent to Solr (or any other Writer).
|
197
214
|
|
198
|
-
You can examine `context.output_hash` to see any already transformed output and use it as the source for new output.
|
215
|
+
You can examine `context.output_hash` to see any already transformed output and use it as the source for new output.
|
199
216
|
|
200
|
-
You can *write* to `context.output_hash` directly, which can be useful for computations that affect more than one output field at once.
|
217
|
+
You can *write* to `context.output_hash` directly, which can be useful for computations that affect more than one output field at once.
|
201
218
|
|
202
219
|
**Note**: Make sure you always assign an _Array_ to each `context.output_hash` value, e.g., `context.output_hash['foo']`, not a single value!
|
203
220
|
|
@@ -211,15 +228,14 @@ context.output_hash['fieldname'] = ['fuzzy_wuzzies']
|
|
211
228
|
```
|
212
229
|
|
213
230
|
|
214
|
-
|
215
231
|
## each_record
|
216
232
|
|
217
233
|
`each_record` is similar to `to_field` in that it defines logic executed for each record. It differs from `to_field` because the output of `each_record` is not associated with a specific output field.
|
218
234
|
|
219
|
-
Thus, `each_record` blocks have no `accumulator` argument: instead they either take a single `record` argument; or both a `record` and a `context`.
|
235
|
+
Thus, `each_record` blocks have no `accumulator` argument: instead they either take a single `record` argument; or both a `record` and a `context`.
|
220
236
|
|
221
237
|
`each_record` is useful for logging or notifying, computing intermediate
|
222
|
-
results, or writing to more than one field at once.
|
238
|
+
results, or writing to more than one field at once.
|
223
239
|
|
224
240
|
~~~ruby
|
225
241
|
each_record do |record, context|
|
@@ -241,7 +257,7 @@ each_record do |record, context|
|
|
241
257
|
end
|
242
258
|
~~~
|
243
259
|
|
244
|
-
traject doesn't come with any macros written for use with `each_record`, but they could be created: such macros would be methods that return a lambda given the appropriate args from `each_record`.
|
260
|
+
traject doesn't come with any macros written for use with `each_record`, but they could be created: such macros would be methods that return a lambda given the appropriate args from `each_record`.
|
245
261
|
|
246
262
|
## More tips and gotchas about indexing steps
|
247
263
|
|
@@ -251,4 +267,4 @@ traject doesn't come with any macros written for use with `each_record`, but the
|
|
251
267
|
|
252
268
|
* **Once you call `context.skip!(msg)` no more index steps will be run for that record**. So if you have any cleanup code, you'll need to make sure to call it yourself.
|
253
269
|
|
254
|
-
* **By default, `traject` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).
|
270
|
+
* **By default, `traject` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).
|