traject 2.1.0-java → 2.2.0-java

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 1e497f1fbf0507bf5c427dc49caaf177eaf44cbe
4
- data.tar.gz: e9822f0ab83b9645172aa56b4219709f302cad1b
3
+ metadata.gz: 31b0c8daf3b5365e6f76172e9af9d8ec7fd842fe
4
+ data.tar.gz: 9286f5626eb34bd4df3e89aa1197fc3b1810e601
5
5
  SHA512:
6
- metadata.gz: 69e90b97d27248d62d17e7b3e6cd7088ff1f15e20f7205cd2e633965294efaa68ec991c86a181c0ad0b71ea3c601a165ce159acbac3eb9010682b8bc0d6d8e16
7
- data.tar.gz: 6bab099581a57947ef8d14191bff72602c9c428fc836597b6e33a3baeec2f14ec5cc316d6d5f555e5c4c62d862966ff703992fedeed9307e6a45f6ef4d6f86c1
6
+ metadata.gz: 0495e94238704ab066c86c40e1ef65c86d4229178795031a59dfaf42ceb5efa01e00943b0447291a666d504b42f3cca62d56735240a56004541fa24b8efc138a
7
+ data.tar.gz: e002525a16a48c0897548f526df1c63bccdf5d23d888cd799afb2518b52a73b3832ba0f5e8a8748f8649ef45b43c6efe85b0f48fa883d9540480f15dd6d19423
data/.gitignore CHANGED
@@ -1,3 +1,5 @@
1
+ .idea
2
+ bench
1
3
  *.gem
2
4
  *.rbc
3
5
  .bundle
data/.travis.yml CHANGED
@@ -1,27 +1,15 @@
1
1
  language: ruby
2
+ cache: bundler
3
+ sudo: false
2
4
  rvm:
3
5
  - jruby-19mode
4
- - jruby-head
6
+ - jruby-9.0.4.0
5
7
  - 1.9
6
- - 2.1
7
8
  - 2.2
9
+ - 2.3.0
8
10
  - rbx-2
11
+ before_install:
12
+ - gem update --system
13
+ - gem install bundler
9
14
  jdk:
10
- - openjdk7
11
- - openjdk6
12
- matrix:
13
- exclude:
14
- - rvm: 1.9
15
- jdk: openjdk7
16
- - rvm: 2.1
17
- jdk: openjdk7
18
- - rvm: rbx-2
19
- jdk: openjdk7
20
- - rvm: jruby-head
21
- jdk: openjdk6
22
- - rvm: 2.2
23
- jdk: openjdk6
24
- allow_failures:
25
- - rvm: jruby-head
26
-
27
- bundler_args: --without debug
15
+ - oraclejdk8
data/CHANGES.md CHANGED
@@ -1,5 +1,19 @@
1
1
  # Changes
2
2
 
3
+ ## 2.2.0
4
+ * Change DebugWriter to be more forgiving (and informative) about missing record-id fields
5
+ * Automatically require DebugWriter for easier use on the command line
6
+ * Refactor MarcExtractor to be easier to read
7
+ * Fix .travis file to actually work, and target more recent rubies.
8
+
9
+ ## 2.1.0
10
+ * update some docs (typos)
11
+ * Make the indexer's `writer` r/w so it can be set at runtime (#110)
12
+ * Allow `extract_marc` to be callable from anywhere (#111)
13
+ * Add doc instructions/examples for programmatic Indexer use
14
+ * _Much_ better error reporting; easier to find which record went wrong
15
+
16
+
3
17
  ## 2.0.2
4
18
 
5
19
  * Guard against assumption of MARC data when indexing using SolrJsonWriter ([#94](https://github.com/traject-project/traject/issues/94))
data/README.md CHANGED
@@ -1,12 +1,10 @@
1
1
  # Traject
2
2
 
3
- An easy to use, high-performance, flexible and extensible MARC to Solr indexer.
3
+ An easy to use, high-performance, flexible and extensible MARC to Solr indexer.
4
4
 
5
- You might use traject to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
5
+ You might use [traject](https://github.com/traject/traject) to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
6
6
 
7
- Traject can also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination. In addition to sending data
8
- to solr, Traject can produce json or yaml files, tab-delimited files, CSV files, and output suitable
9
- for debugging by a human.
7
+ Traject can also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination. In addition to sending data to Solr, Traject can produce json or yaml files, tab-delimited files, CSV files, and output suitable for debugging by a human.
10
8
 
11
9
  **Traject is stable, mature software, that is already being used in production by its authors.**
12
10
 
@@ -23,7 +21,7 @@ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (Univer
23
21
  * Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying
24
22
  ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with
25
23
  solr even under MRI.
26
- * Composed of decoupled components, for flexibility and extensibility.
24
+ * Composed of decoupled components, for flexibility and extensibility.
27
25
  * Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
28
26
  * Easy to split configuration between multiple files, for simple "pick-and-choose" command line options
29
27
  that can combine to deal with any of your local needs.
@@ -33,42 +31,36 @@ that can combine to deal with any of your local needs.
33
31
 
34
32
  Traject runs under jruby (1.7.x or higher), MRI ruby (1.9.3 or higher), or probably any other ruby platform.
35
33
 
36
- **Traject runs much faster on JRuby** where it can use multi-core parallelism, and the Java
37
- Marc4J marc reader. If performance is a concern, you should run traject on JRuby.
34
+ **Traject runs much faster on JRuby** where it can use multi-core parallelism, and the Java Marc4J marc reader. If performance is a concern, you should run traject on JRuby.
38
35
 
39
36
  Some options for installing a ruby other than your system-provided one are [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme).
40
37
 
41
38
  Once you have ruby, just `$ gem install traject`.
42
39
 
43
- ( **Note**: We might in the future provide an all-in-one .jar distribution, which does not require you to install jruby on your system, for those who want the multi-threading of jruby without having to actually install it. Let us know if interested.).
40
+ (**Note**: We might in the future provide an all-in-one .jar distribution, which will not require you to install jruby on your system, for those who want the multi-threading of jruby without having to actually install it. Let us know if interested.)
44
41
 
45
42
 
46
43
  ## Configuration files
47
44
 
48
- traject is configured using configuration files. To get a sense of what they look like, you can
49
- take a look at our sample basic configuration file,
50
- [demo_config.rb](./test/test_support/demo_config.rb). You could run traject with that configuration file
51
- as: `traject -c path/to/demo_config.rb marc_file.marc`.
45
+ traject is configured using configuration files. To get a sense of what they look like, you can take a look at our sample basic configuration file,
46
+ [demo_config.rb](./test/test_support/demo_config.rb). You could run traject with that configuration file as: `traject -c path/to/demo_config.rb marc_file.marc`.
52
47
 
53
48
  Configuration files are actually just ruby -- so by convention they end in `.rb`.
54
49
 
55
- We hope you can write basic useful configuration files without much ruby experience, since
56
- traject gives you some easy functions to use for common directives. But the full power
57
- of ruby is available to you if needed.
50
+ We hope you can write basic useful configuration files without much ruby experience, since traject gives you some easy functions to use for common directives. But the full power of ruby is available to you if needed.
58
51
 
59
52
  **rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
60
53
  call ordinary ruby `require` in config files, etc., too, to load
61
54
  external functionality. See more at Extending Logic below.
62
55
 
63
56
  You can keep your settings and indexing rules in one config file,
64
- or split them accross multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
57
+ or split them across multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
65
58
 
66
59
  There are two main categories of directives in your configuration files: _Settings_, and _Indexing Rules_.
67
60
 
68
61
  ## Settings
69
62
 
70
- Settings are a flat list of key/value pairs, where the keys are always strings and the values usually are. They look like this
71
- in a config file:
63
+ Settings are a flat list of key/value pairs, where the keys are always strings and the values usually are too. They look like this in a config file:
72
64
 
73
65
  ~~~ruby
74
66
  # configuration_file.rb
@@ -98,20 +90,17 @@ end
98
90
 
99
91
  `provide` will only set the key if it was previously unset, so first
100
92
  setting wins, and command-line comes first of all and overrides everything.
101
- You can also use `store` if you want to force-set, last set wins.
93
+ You can also use `store` if you want to force-set: last set wins.
102
94
 
103
95
  See, docs page on [Settings](./doc/settings.md) for list
104
96
  of all standardized settings.
105
97
 
106
98
 
107
- ## Indexing rules: Let's start with 'to_field' and 'extract_marc'
99
+ ## Indexing rules: 'to_field' and 'extract_marc'
108
100
 
109
- There are a few methods that can be used to create indexing rules, but the
110
- one you'll most common is called `to_field`, and establishes a rule
111
- to extract content to a particular named output field.
101
+ There are a few methods that can be used to create indexing rules. We will touch on the two most commonly used methods here. More information is available in [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
112
102
 
113
- A `to_field` extraction rule can use built-in 'macros', or, as we'll see later,
114
- entirely custom logic.
103
+ `to_field` establishes a rule to extract content to a particular named output field. A `to_field` extraction rule can use built-in 'macros', or, as we'll see later, entirely custom logic.
115
104
 
116
105
  The built-in macro you'll use the most is `extract_marc`, to extract
117
106
  data out of a MARC record according to a tag/subfield specification.
@@ -140,24 +129,18 @@ data out of a MARC record according to a tag/subfield specification.
140
129
  to_field "language_code", extract_marc("008[35-37]")
141
130
  ~~~
142
131
 
143
- `extract_marc` by default includes all 'alternate script' linked fields correspoinding
144
- to matched specifications, but you can turn that off, or extract *only* corresponding
145
- 880s.
132
+ `extract_marc` by default includes all 'alternate script' linked fields correspoinding to matched specifications, but you can turn that off, or extract *only* corresponding 880s.
146
133
 
147
134
  ~~~ruby
148
135
  to_field "title", extract_marc("245abc", :alternate_script => false)
149
136
  to_field "title_vernacular", extract_marc("245abc", :alternate_script => :only)
150
137
  ~~~
151
138
 
152
- By default, specifications with multiple subfields (like "240abc") will produce one single string of output per field (for each '240'), with the concatenation of each matched subfield. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield.
139
+ By default, specifications with multiple subfields (e.g. "240abc") will produce one single string of output per field (for each '240' field in the record), with the concatenation of each matched subfield. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield (i.e. two output strings for a single '020' with two subfield 'a').
153
140
 
154
- For the syntax and complete possibilities of the specification
155
- string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
141
+ For the syntax and complete possibilities of the specification string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
156
142
 
157
- `extract_marc` also supports `translation maps` similar
158
- to SolrMarc's. There are some translation maps provided by traject,
159
- and you can also define your own, in yaml or ruby. Translation maps are especially useful
160
- for mapping form MARC codes to user-displayable strings:
143
+ `extract_marc` also supports `translation maps` similar to SolrMarc's. There are some translation maps provided by traject, and you can also define your own, in yaml or ruby. Translation maps are especially useful for mapping form MARC codes to user-displayable strings:
161
144
 
162
145
  ~~~ruby
163
146
  # "translation_map" will be passed to Traject::TranslationMap.new
@@ -165,18 +148,19 @@ for mapping form MARC codes to user-displayable strings:
165
148
  to_field "language", extract_marc("008[35-37]:041a:041d", :translation_map => "marc_language_code")
166
149
  ~~~
167
150
 
168
- To see all options for `extract_marc`, see the [method documentation](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc)
151
+ To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
169
152
 
170
- ## other built-in utility macros
153
+ ## Other built-in utility macros
171
154
 
172
- Other built-in methods that can be used with `to_field` include a hard-coded
173
- literal string:
155
+ Other built-in methods that can be used with `to_field` include:
156
+
157
+ a hard-coded literal string:
174
158
 
175
159
  ~~~ruby
176
160
  to_field "source", literal("LIB_CATALOG")
177
161
  ~~~
178
162
 
179
- The current record serialized back out as MARC, in binary, XML, or json:
163
+ the current record serialized back out as MARC, in binary, XML, or json:
180
164
 
181
165
  ~~~ruby
182
166
  # or :format => "json" for marc-in-json
@@ -186,7 +170,7 @@ The current record serialized back out as MARC, in binary, XML, or json:
186
170
  to_field "marc_record_raw", serialized_marc(:format => "binary", :binary_escape => false, :allow_oversized => true)
187
171
  ~~~
188
172
 
189
- Text of all fields in a range:
173
+ text of all fields in a range:
190
174
 
191
175
  ~~~ruby
192
176
  to_field "text", extract_all_marc_values(:from => "100", :to => "899")
@@ -194,11 +178,9 @@ Text of all fields in a range:
194
178
 
195
179
  All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macros/marc21.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21))
196
180
 
197
- ## more complex canned MARC semantic logic
181
+ ## More complex canned MARC semantic logic
198
182
 
199
- Some more complex (and opinionated/subjective) algorithms for deriving semantics
200
- from Marc are also packaged with Traject, but not available by default. To make
201
- them available to your indexing, you just need to use ruby `require` and `extend`.
183
+ Some more complex (and opinionated/subjective) algorithms for deriving semantics from Marc are also packaged with Traject, but not available by default. To make them available to your indexing, you just need to use ruby `require` and `extend`.
202
184
 
203
185
  A number of methods are in [Traject::Macros::Marc21Semantics](./lib/traject/macros/marc21_semantics.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21Semantics))
204
186
 
@@ -283,10 +265,10 @@ additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc
283
265
 
284
266
  In addition to `to_field`, an `each_record` method is available, which,
285
267
  like `to_field`, is executed for every record, but without being tied
286
- to a specific field.
268
+ to a specific output field.
287
269
 
288
- `each_record` can be used for logging or notifiying; computing intermediate
289
- results; or writing to more than one field at once.
270
+ `each_record` can be used for logging or notifiying, computing intermediate
271
+ results, or writing to more than one field at once.
290
272
 
291
273
  ~~~ruby
292
274
  each_record do |record|
@@ -294,12 +276,9 @@ results; or writing to more than one field at once.
294
276
  end
295
277
  ~~~
296
278
 
297
- For more on `each_record`, see documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md).
279
+ For more on `each_record`, see [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md).
298
280
 
299
- There is also an `after_processing` method that can be used to register
300
- logic that will be called after the entire has been processed. You can use it for whatever custom
301
- ruby code you might want for your app (send an email? Clean up a log file? Trigger
302
- a Solr replication?)
281
+ There is also an `after_processing` method that can be used to register logic that will be called after the entire has been processed. You can use it for whatever custom ruby code you might want for your app (send an email? Clean up a log file? Trigger a Solr replication?)
303
282
 
304
283
  ~~~ruby
305
284
  after_processing do
@@ -310,8 +289,7 @@ end
310
289
 
311
290
  ## Readers and Writers
312
291
 
313
- Traject uses modular 'Writer' classes to take the output hashes from transformation, and
314
- send them somewhere or do something useful with them.
292
+ Traject uses modular 'Writer' classes to take the output hashes from transformation and send them somewhere or do something useful with them.
315
293
 
316
294
  By default traject uses the [Traject::SolrJsonWriter](lib/traject/solr_json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/SolrJsonWriter)) to send to Solr for indexing.
317
295
  Several other writers are also built-in:
@@ -419,6 +397,7 @@ Own Code](./doc/extending.md)
419
397
  * [traject-solrj_writer](https://github.com/traject/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
420
398
  * [traject_marc4j_reader](https://github.com/traject/traject-marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
421
399
  reading marc records using the Marc4J library, fastest MARC reading on JRuby.
400
+ * [traject_sequel_writer](https://github.com/traject/traject_sequel_writer) A writer for sending to an rdbms via [Sequel](https://github.com/jeremyevans/sequel)
422
401
 
423
402
  # Development
424
403
 
data/doc/extending.md CHANGED
@@ -13,13 +13,12 @@ of a couple traject features meant to make it easier.
13
13
 
14
14
  ## Expert Summary
15
15
 
16
- * Traject `-I` argument command line can be used to list directories to
16
+ * Load Path options:
17
+ * Traject `-I` argument command line can be used to list directories to
17
18
  add to the load path, similar to the `ruby -I` argument. You
18
19
  can then 'require' local project files from the load path.
19
- * Or modify the ruby `$LOAD_PATH` manually at the top of a traject config file you are loading.
20
- * translation map files found in a
21
- "./translation_maps" subdir on the load path will be found
22
- for Traject translation maps.
20
+ * Modify the ruby `$LOAD_PATH` manually at the top of a traject config file you are loading.
21
+ * NOTE: translation map files in a "./translation_maps" subdir on the load path will be available for to traject.
23
22
  * You can use Bundler with traject simply by creating a Gemfile with `bundler init`,
24
23
  and then running command line with `bundle exec traject` or
25
24
  even `BUNDLE_GEMFILE=path/to/Gemfile bundle exec traject`
@@ -114,11 +113,11 @@ a skeleton of your gem
114
113
  This will also make available rake commands to install your gem locally
115
114
  (`rake install`), or release it to the rubygems server (`rake release`).
116
115
 
117
- There are two main methods to use a gem in your traject project,
118
- with straight rubygems, or with bundler.
116
+ There are two main methods to use a gem in your traject project: with straight rubygems, or with bundler.
119
117
 
120
- Without bundler is simpler. Simply `gem install some_gem` from the
121
- command line, and now you can `require` that gem in your traject
118
+ ### without bundler (straight rubygems):
119
+
120
+ Without bundler may be simpler, at least at first. Simply `gem install some_gem` from the command line, and now you can `require` that gem in your traject
122
121
  config file, and use what it provides:
123
122
 
124
123
  ~~~ruby
@@ -129,25 +128,20 @@ require 'some_gem'
129
128
  SomeGem.whatever!
130
129
  ~~~
131
130
 
132
- A gem can provide traject translation map definitions
133
- in a `lib/translation_maps` sub-directory, and traject will be able to find those
134
- translation maps when the gem is loaded. (Because gems'
135
- `./lib` directories are by default added to the ruby load path.)
131
+ A gem can provide traject translation map definitions in a `lib/translation_maps` sub-directory, and traject will be able to find those translation maps when the gem is loaded (because gems' `./lib` directories are by default added to the ruby load path).
136
132
 
137
- ### Or, with bundler:
133
+ ### with bundler:
138
134
 
139
- However, if you then move your traject project to another system,
135
+ If you move your traject project to another system,
140
136
  where you haven't yet installed the `some_gem`, then running
141
- traject with this config file will, of course, fail. Or if you
137
+ traject with the above config file will, of course, fail. Or if you
142
138
  move your traject project to another system with a slightly
143
139
  different version of `some_gem`, your traject indexing could
144
140
  behave differently in confusing ways. As the number of gems
145
- you are using increases, managing this gets increasingly
141
+ you are using increases, managing the gems and gem versions gets increasingly
146
142
  confusing.
147
143
 
148
- [bundler](http://bundler.io/) was invented to make this kind of dependency management
149
- more straightforward and reliable. We recommend you consider using
150
- bundler, especially for traject installations where traject will
144
+ [bundler](http://bundler.io/) was invented to make this kind of dependency management in ruby more straightforward and reliable. We recommend you consider using bundler, especially for traject installations where traject will
151
145
  be run via automated batch jobs on production servers.
152
146
 
153
147
  Bundler's behavior is based on a `Gemfile` that lists your
@@ -156,15 +150,14 @@ by running `bundler init`, probably in the directory
156
150
  right next to your traject config files.
157
151
 
158
152
  Then specify what gems your traject project will use,
159
- possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemfile.html) --
160
- **do** include `gem 'traject'` in the Gemfile.
153
+ possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemfile.html)
154
+
155
+ Be sure to include `gem 'traject'` in the Gemfile.
161
156
 
162
157
  Run `bundle install` from the directory with the Gemfile, on any system
163
- at any time, to make sure specified gems are installed.
158
+ at any time, to make sure specified gems are installed. (The bundler gem must be already installed on the system.)
164
159
 
165
- **Run traject** with `bundle exec` to have bundler set up the environment
166
- from your Gemfile. You can `cd` into the directory containing the Gemfile,
167
- so bundler can find it:
160
+ **Run traject** with `bundle exec` to have bundler set up the traject environment from your Gemfile. You can `cd` into the directory containing the Gemfile, so bundler can find it:
168
161
 
169
162
  $ cd /some/where
170
163
  $ bundle exec traject -c some_traject_config.rb ...
@@ -178,7 +171,7 @@ Bundler will make sure the specified versions of all gems are used by
178
171
  traject, and also make sure no gems except those specified in the gemfile
179
172
  are available to the program, for a reliable reproducible environment.
180
173
 
181
- You should still `require` the gem in your traject config file,
174
+ You still need to `require` the gem in your traject config file;
182
175
  then just refer to what it provides in your config code as usual.
183
176
 
184
177
  You should check both the `Gemfile` and the `Gemfile.lock`
@@ -16,21 +16,17 @@ That `do` is just ruby `block` syntax, whereby we can pass a block of ruby code
16
16
 
17
17
  The block is then stored by the Traject::Indexer, and called for each record indexed, with three arguments provided.
18
18
 
19
- #### record argument
19
+ ### record argument
20
20
 
21
21
  The record that gets passed to your block is a MARC::Record object (or, theoretically, any object that gets returned by a traject Reader). Your logic will usually examine the record to calculate the desired output.
22
22
 
23
23
  ### accumulator argument
24
24
 
25
- The accumulator argument is an array. At the end of your custom code, the accumulator
26
- array should hold the output you want to send off, to the field specified in the `to_field`.
25
+ The accumulator argument is an Array. At the end of your custom code, the accumulator Array should hold the output you want send off to the field specified in `to_field`.
27
26
 
28
- The accumulator is a reference to a ruby array, and you need to **modify** that array,
29
- manipulating it in place with Array methods that mutate the array, like `concat`, `<<`,
30
- `map!` or even `replace`.
27
+ The accumulator is a reference to a ruby Array, and you need to **modify** that Array, manipulating it in place with Array methods that mutate the array, like `concat`, `<<`, `map!` or even `replace`.
31
28
 
32
- You can't simply assign the accumulator variable to a different array, that won't work,
33
- you need to modify the array in-place.
29
+ You can't simply assign the accumulator variable to a different Array; you need to modify the Array *in place*.
34
30
 
35
31
  # Won't work, assigning variable
36
32
  to_field('foo') do |rec, acc|
@@ -50,21 +46,16 @@ you need to modify the array in-place.
50
46
  to_field('foo') do |rec, acc|
51
47
  acc << 'bill'
52
48
  acc << 'dueber'
53
- acc = acc.map!{|str| str.upcase} #notice using "map!" not just "map"
49
+ acc.map!{|str| str.upcase} # NOTE: "map!" not "map"
54
50
  end
55
51
 
56
52
  ### context argument
57
53
 
58
- The third optional context argument
59
-
60
- The third optional argument is a
61
- [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/traject/traject/Traject/Indexer/Context))
62
- object. Most of the time you don't need it, but you can use it for
63
- some sophisticated functionality, for example using these Context methods:
54
+ The third optional argument is a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/traject/traject/Traject/Indexer/Context)) object. Most of the time you don't need it, but you can use it for some sophisticated functionality. These are some useful methods available:
64
55
 
65
56
  * `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard. This clipboard is record-specific, and won't persist between records.
66
- * `context.position` The position of the record in the input file (e.g., was it the first record, seoncd, etc.). Useful for error reporting
67
- * `context.output_hash` A hash mapping the field names (generally defined in `to_field` calls) to an array of values to be sent to the writer associated with that field. This allows you to modify what goes to the writer without going through a `to_field` call -- you can just set `context.output_hash['myfield'] = ['my', 'values']` and you're set. See below for more examples
57
+ * `context.position` The position of the record in the input file (e.g., was it the first record, second, etc.). Useful for error reporting.
58
+ * `context.output_hash` A hash mapping the field names (generally defined in `to_field` calls) to an array of values to be sent to the writer associated with that field. This allows you to modify what goes to the writer without going through a `to_field` call -- you can just set `context.output_hash['myfield'] = ['my', 'values']` and you're set. See below for more examples.
68
59
  * `context.skip!(msg)` An assertion that this record should be ignored. No more indexing steps will be called, no results will be sent to the writer, and a `debug`-level log message will be written stating that the record was skipped.
69
60
 
70
61
 
@@ -102,28 +93,26 @@ end
102
93
  ```
103
94
 
104
95
  Certain built-in traject calls have been optimized to be high performance
105
- so it's safe to do them inside 'inner loop' blocks though.
106
- That includes `Traject::TranslationMap.new` and `Traject::MarcExtractor.cached("xxx")`
107
- (note #cached rather than #new there)
96
+ so it's safe to do them inside 'inner loop' blocks. That includes `Traject::TranslationMap.new` and `Traject::MarcExtractor.cached("xxx")`
97
+ (NOTE: #cached rather than #new there)
108
98
 
109
99
 
110
100
  ## From block to lambda
111
101
 
112
102
  In the ruby language, in addition to creating a code block as an argument
113
- to a method with `do |args| ... end` or `{|arg| ... }, we can also create
103
+ to a method with `do |args| ... end` or `{|arg| ... }`, we can also create
114
104
  a code block to hold in a variable, with the `lambda` keyword:
115
105
 
116
106
  always_output_foo = lambda do |record, accumulator|
117
107
  accumulator << "FOO"
118
108
  end
119
109
 
120
- traject `to_field` is written so, as a convenience, it can take a lambda expression
121
- stored in a variable as an alternative to a block:
110
+ In traject, `to_field` is written so that, as a convenience, it can take a lambda expression stored in a variable as an alternative to a block:
122
111
 
123
112
  to_field("always_has_foo"), always_output_foo
124
113
 
125
114
  Why is this a convenience? Well, ordinarily it's not something we
126
- need, but in fact it's what allows traject 'macros' as re-useable
115
+ need, but in fact it's what allows traject 'macros' to be re-useable
127
116
  code templates.
128
117
 
129
118
 
@@ -131,10 +120,9 @@ code templates.
131
120
 
132
121
  A Traject macro is a way to automatically create indexing rules via re-usable "templates".
133
122
 
134
- Traject macros are simply methods that return ruby lambda/proc objects, possibly creating
135
- them based on parameters passed in.
123
+ Traject macros are methods that return ruby lambda/proc objects, possibly creating them based on parameters passed in.
136
124
 
137
- Here is in fact how the `literal` function is implemented:
125
+ For example, here is the implementation of the `literal` method/macro:
138
126
 
139
127
  ~~~ruby
140
128
  def literal(value)
@@ -144,12 +132,12 @@ def literal(value)
144
132
  accumulator << value
145
133
  end
146
134
  end
147
- to_field("something"), literal("something")
135
+ to_field("fieldname"), literal("my_fav_literal")
148
136
  ~~~
149
137
 
150
- It's really as simple as that, that's all a Traject macro is. A function that takes parameters, and based on those parameters returns a lambda; the lambda is then passed to the `to_field` indexing method, or similar methods.
138
+ So a Traject macro is a method that may have parameters and, based on those parameters, returns a lambda; the lambda is then passed to the `to_field` indexing method, or similar methods.
151
139
 
152
- How do you make these methods available to the indexer?
140
+ How do you make these methods available to the traject indexer?
153
141
 
154
142
  Define it in a module:
155
143
 
@@ -173,15 +161,15 @@ in one of your config files:
173
161
  require `literal_macro.rb`
174
162
  extend LiteralMacro
175
163
 
176
- to_field ...
164
+ to_field("fieldname"), literal("my_fav_literal")
177
165
  ~~~
178
166
 
179
167
  That's it. You can use the traject command line `-I` option to set the ruby load path, so your file will be findable via `require`. Or you can distribute it in a gem, and use straight rubygems and the `gem` command in your configuration file, or Bundler with traject command-line `-g` option.
180
168
 
181
- ## Using a lambda _and_ and block
169
+ ## Using a lambda _and_ a block
182
170
 
183
171
  Traject macros (such as `extract_marc`) create and return a lambda. If
184
- you include a lambda _and_ a block on a `to_field` call, the latter
172
+ you include a lambda _and_ a block on a `to_field` call, the block
185
173
  gets the accumulator as it was filled in by the former.
186
174
 
187
175
  ```ruby
@@ -196,38 +184,42 @@ to_field('foo'), mylam do |rec, acc, context|
196
184
  acc << 'two'
197
185
  end #=> context.output_hash['foo'] == ['one', 'two']
198
186
 
199
-
200
187
  # You might also want to do something like this
201
-
202
- to_field('foo'), my_macro_that_doesn't_dedup_ do |rec, acc|
188
+ to_field('foo'), macro_returning_dup_values do |rec, acc|
203
189
  acc.uniq!
204
190
  end
205
191
  ```
206
192
 
207
193
  ## Maniuplating `context.output_hash` directly
208
194
 
209
- If you ask for the context argument, a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Indexer/Context)), you have access to context.output_hash, with is
210
- the hash of transformed output that will be sent to Solr (or any other Writer)
195
+ If you ask for the context argument, a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Indexer/Context)), you have access to `context.output_hash`, which is
196
+ the hash of already transformed output that will be sent to Solr (or any other Writer).
197
+
198
+ You can examine `context.output_hash` to see any already transformed output and use it as the source for new output.
199
+
200
+ You can *write* to `context.output_hash` directly, which can be useful for computations that affect more than one output field at once.
211
201
 
212
- You can look in there to see any already transformed output and use it as the source
213
- for new output. You can actually *write* to there manually, which can be useful
214
- to write routines that effect more than one output field at once.
202
+ **Note**: Make sure you always assign an _Array_ to each `context.output_hash` value, e.g., `context.output_hash['foo']`, not a single value!
215
203
 
216
- **Note**: Make sure you always assign an _array_ to, e.g., `context.output_hash['foo']`, not a single value!
204
+ ```ruby
205
+
206
+ # Wrong - do NOT assign a value of anything other than an Array
207
+ context.output_hash['fieldname'] = 'fuzzy_wuzzies'
208
+
209
+ # Correct
210
+ context.output_hash['fieldname'] = ['fuzzy_wuzzies']
211
+ ```
217
212
 
218
213
 
219
214
 
220
215
  ## each_record
221
216
 
222
- All the previous discussion was in terms of `to_field` -- `each_record` is a similar
223
- routine, to define logic that is executed for each record, but isn't fixed to write
224
- to a single output field.
217
+ `each_record` is similar to `to_field` in that it defines logic executed for each record. It differs from `to_field` because the output of `each_record` is not associated with a specific output field.
225
218
 
226
- So `each_record` blocks have no `accumulator` argument, instead they either take a single
227
- `record` argument; or both a `record` and a `context`.
219
+ Thus, `each_record` blocks have no `accumulator` argument: instead they either take a single `record` argument; or both a `record` and a `context`.
228
220
 
229
- `each_record` can be used for logging or notifiying; computing intermediate
230
- results; or writing to more than one field at once.
221
+ `each_record` is useful for logging or notifiying, computing intermediate
222
+ results, or writing to more than one field at once.
231
223
 
232
224
  ~~~ruby
233
225
  each_record do |record, context|
@@ -239,20 +231,17 @@ each_record do |record, context|
239
231
  end
240
232
 
241
233
  each_record do |record, context|
242
- (one, two) = calculate_two_things_from(record)
234
+ (val1, val2) = calculate_two_things_from(record)
243
235
 
244
236
  context.output_hash["first_field"] ||= []
245
- context.output_hash["first_field"] << one
237
+ context.output_hash["first_field"] << val1
246
238
 
247
239
  context.output_hash["second_field"] ||= []
248
- context.output_hash["second_field"] << one
240
+ context.output_hash["second_field"] << val2
249
241
  end
250
242
  ~~~
251
243
 
252
- traject doesn't come with any macros written for use with
253
- `each_record`, but they could be created if useful --
254
- just methods that return lambda's taking the right
255
- args for `each_record`.
244
+ traject doesn't come with any macros written for use with `each_record`, but they could be created: such macros would be methods that return a lambda given the appropriate args from `each_record`.
256
245
 
257
246
  ## More tips and gotchas about indexing steps
258
247
 
@@ -262,4 +251,4 @@ args for `each_record`.
262
251
 
263
252
  * **Once you call `context.skip!(msg)` no more index steps will be run for that record**. So if you have any cleanup code, you'll need to make sure to call it yourself.
264
253
 
265
- * **By default, `trajcet` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).
254
+ * **By default, `traject` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).