traject 2.3.4 → 3.0.0.alpha.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (69) hide show
  1. checksums.yaml +5 -5
  2. data/.travis.yml +16 -9
  3. data/CHANGES.md +74 -1
  4. data/Gemfile +2 -1
  5. data/README.md +104 -53
  6. data/Rakefile +8 -1
  7. data/doc/indexing_rules.md +79 -63
  8. data/doc/programmatic_use.md +218 -0
  9. data/doc/settings.md +28 -1
  10. data/doc/xml.md +134 -0
  11. data/lib/traject.rb +5 -0
  12. data/lib/traject/array_writer.rb +34 -0
  13. data/lib/traject/command_line.rb +18 -22
  14. data/lib/traject/debug_writer.rb +2 -5
  15. data/lib/traject/experimental_nokogiri_streaming_reader.rb +276 -0
  16. data/lib/traject/hashie/indifferent_access_fix.rb +25 -0
  17. data/lib/traject/indexer.rb +321 -92
  18. data/lib/traject/indexer/context.rb +39 -13
  19. data/lib/traject/indexer/marc_indexer.rb +30 -0
  20. data/lib/traject/indexer/nokogiri_indexer.rb +30 -0
  21. data/lib/traject/indexer/settings.rb +36 -53
  22. data/lib/traject/indexer/step.rb +27 -33
  23. data/lib/traject/macros/marc21.rb +37 -12
  24. data/lib/traject/macros/nokogiri_macros.rb +43 -0
  25. data/lib/traject/macros/transformation.rb +162 -0
  26. data/lib/traject/marc_extractor.rb +2 -0
  27. data/lib/traject/ndj_reader.rb +1 -1
  28. data/lib/traject/nokogiri_reader.rb +179 -0
  29. data/lib/traject/oai_pmh_nokogiri_reader.rb +159 -0
  30. data/lib/traject/solr_json_writer.rb +19 -12
  31. data/lib/traject/thread_pool.rb +13 -0
  32. data/lib/traject/util.rb +14 -2
  33. data/lib/traject/version.rb +1 -1
  34. data/test/debug_writer_test.rb +3 -3
  35. data/test/delimited_writer_test.rb +3 -3
  36. data/test/experimental_nokogiri_streaming_reader_test.rb +169 -0
  37. data/test/indexer/context_test.rb +23 -13
  38. data/test/indexer/error_handler_test.rb +59 -0
  39. data/test/indexer/macros/macros_marc21_semantics_test.rb +46 -46
  40. data/test/indexer/macros/marc21/extract_all_marc_values_test.rb +1 -1
  41. data/test/indexer/macros/marc21/extract_marc_test.rb +19 -9
  42. data/test/indexer/macros/marc21/serialize_marc_test.rb +4 -4
  43. data/test/indexer/macros/to_field_test.rb +2 -2
  44. data/test/indexer/macros/transformation_test.rb +177 -0
  45. data/test/indexer/map_record_test.rb +2 -3
  46. data/test/indexer/nokogiri_indexer_test.rb +103 -0
  47. data/test/indexer/process_record_test.rb +55 -0
  48. data/test/indexer/process_with_test.rb +148 -0
  49. data/test/indexer/read_write_test.rb +52 -2
  50. data/test/indexer/settings_test.rb +34 -24
  51. data/test/indexer/to_field_test.rb +27 -2
  52. data/test/marc_extractor_test.rb +7 -7
  53. data/test/marc_reader_test.rb +4 -4
  54. data/test/nokogiri_reader_test.rb +158 -0
  55. data/test/oai_pmh_nokogiri_reader_test.rb +23 -0
  56. data/test/solr_json_writer_test.rb +24 -28
  57. data/test/test_helper.rb +8 -2
  58. data/test/test_support/namespace-test.xml +7 -0
  59. data/test/test_support/nokogiri_demo_config.rb +17 -0
  60. data/test/test_support/oai-pmh-one-record-2.xml +24 -0
  61. data/test/test_support/oai-pmh-one-record-first.xml +24 -0
  62. data/test/test_support/sample-oai-no-namespace.xml +197 -0
  63. data/test/test_support/sample-oai-pmh.xml +197 -0
  64. data/test/thread_pool_test.rb +38 -0
  65. data/test/translation_map_test.rb +3 -3
  66. data/test/translation_maps/ruby_map.rb +2 -1
  67. data/test/translation_maps/yaml_map.yaml +2 -1
  68. data/traject.gemspec +4 -11
  69. metadata +92 -6
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 70d8b265e2e00866a63fdc067172ee1174efb068
4
- data.tar.gz: 6588b8231b636d268765a5428a607b731a34fe7f
2
+ SHA256:
3
+ metadata.gz: 176864633191a53e32c9563227072f16d83e8bc27f2e0d1c9a436fc3d281fc21
4
+ data.tar.gz: 468120d2066634a98aa18438820e24e4bb099125886bfca86f61e53d080070a3
5
5
  SHA512:
6
- metadata.gz: 564a834087b4b5d0b032a9a0797cd1587f241afcbf35bf0c3aa928724f9f26fb8d3e7220d43e909c9bf4c03dd91873113e4d2c081ee68ad12ed151ff3ba34284
7
- data.tar.gz: 9cbc506f6f2ab5bcbf77811c5018a91d8b588eff2d0c75960b3aada514756e6d90bce441da3db59d1aaa0a9a0f62b3e0c56d1be2c494aae091b06a61f94de46c
6
+ metadata.gz: 156d479792897beac99ecbc84a6da06470b7baf1edb1f4b3008deb3e26621fca1723112bb2e46f20406c320b486a679b8c2247f77283a358f6f17f7b4444c7e4
7
+ data.tar.gz: 6e9e0f3c80ca6c2a36adb455a416700f2008cffb5fb66093be1f4e1fb3828a236da4bdf7e1fd26e9d3762b8554c544779138d8636fba6241d3ae682b2df468d9
@@ -1,12 +1,19 @@
1
1
  language: ruby
2
2
  cache: bundler
3
- sudo: false
3
+ # we don't really need `sudo: true`, but for some reason travis docker-based systems are unreliable
4
+ # at downloading jruby, and
5
+ sudo: true
4
6
  rvm:
5
- - jruby-19mode
6
- - jruby-9.0.4.0
7
- - 1.9
8
- - 2.2.8
9
- - 2.3.5
10
- - 2.4.2
11
- jdk:
12
- - oraclejdk8
7
+ - 2.3.6
8
+ - 2.4.3
9
+ - 2.5.1
10
+ - "2.6.0-preview2"
11
+ # avoid having travis install jdk on MRI builds where we don't need it.
12
+ matrix:
13
+ include:
14
+ - jdk: openjdk8
15
+ rvm: jruby-9.0.5.0
16
+ - jdk: openjdk8
17
+ rvm: jruby-9.2.0.0
18
+ allow_failures:
19
+ - rvm: "2.6.0-preview2"
data/CHANGES.md CHANGED
@@ -1,5 +1,78 @@
1
1
  # Changes
2
2
 
3
+ ## 3.0.0
4
+
5
+ ### Changed/Backwards Incompatibilities
6
+
7
+ * JRuby traject no longer includes `traject-marc4j_reader` as a dependency or default reader, although it may provide faster MARC-XML reading on JRuby. To use it manually, see https://github.com/traject/traject-marc4j_reader . See https://github.com/traject/traject/pull/187
8
+
9
+ * `map_record` now returns `nil` if record was skipped.
10
+
11
+ * The `Traject::Indexer` class no longer includes marc-specific settings and modules.
12
+ * If you are using command-line `traject`, this should make no difference to you, as command-line now defaults to the new `Traject::Indexer::MarcIndexer` with those removed things.
13
+ * If you are using Traject::Indexer programmatically and want those features, switch to using `Traject::Indexer::MarcIndexer`.
14
+ * If neccessary, as a hopefully temporary backwards compat shim, call `Traject::Indexer.legacy_marc_mode!`, which injects the old marc-specific behavior into Traject::Indexer again, globally and permanently.
15
+
16
+ * Traject::Indexer::Settings no longer has it's own global defaults, Instead it can be given a set of defaults with #with_defaults, usually right after instantiation. To support different defaults for different Indexers.
17
+
18
+ * SolrJsonWriter now assumes an /update/json convenience url is available in solr instead of trying to verify it. If you are using an older Solr (before 4?) or otherwise want a different update url, just use setting `solr.update_url`
19
+
20
+
21
+ ### Added
22
+
23
+ * Traject::Indexer#configure is available, and recommended instead of raw `instance_eval`. It just does an instance_eval, but is clearer and safer for future changes.
24
+
25
+ * traject command line can now take multiple input files. And underlying it, Traject::Indexer#process can take an array of input streams.
26
+
27
+ * There is now a built-in mode for XML source records, see docs at [xml.md](./doc/xml.md)
28
+
29
+ * new setting `mapping_rescue` is available, to supply custom logic for handling errors. See docs at [settings.md](../doc/settings.md)
30
+
31
+ * Call Traject::ThreadPool.disable_concurrency! to force all pool sizes to be 0, and work to be performed inline. All threading will be disabled.
32
+
33
+ * `to_field` can now take an array as a first argument, to send values to multiple fields mentioned, eg:
34
+
35
+ to_field ["field1", "field2"], extract_marc("240")
36
+
37
+ * `to_field` can take multiple transformation procs (all with the same form). https://github.com/traject/traject/pull/153
38
+
39
+ * There is a new set of standard transformation macros included in `Traject::Indexer`, from [Traject::Macros::Transformation](./lib/traject/macros/transformation.rb). It includes an extraction of previous/existing arguments from `marc_extract`, along with some additional stuff. , in [Traject::Macros::Transformations]. https://github.com/traject/traject/pull/154
40
+ * This is the new preferred way to do post-processing with the `marc_extract` options, but the existing options are not deprecated and there is no current plan for them to be removed.
41
+ * before:
42
+
43
+ to_field "some_field", extract_marc("800",
44
+ translation_map: "marc_800_map",
45
+ allow_duplicates: true,
46
+ first: true,
47
+ default: "default value")
48
+ * now preferred:
49
+
50
+ to_field "some_field", extract_marc("800", allow_duplicates: true),
51
+ translation_map("marc_800_map"),
52
+ first_only,
53
+ default("default value")
54
+
55
+ (still need `allow_duplicates: true` cause extract_marc defaults to false, but see also `unique` macro)
56
+
57
+ * So, these transformation steps can now be used with non-MARC formats as well. See also new transformation macros: `strip`, `split`, `append`, `prepend`, `gsub`, and `transform`. And for MARC use, `trim_punctuation`.
58
+
59
+
60
+ * Traject::Indexer new api, for more convenient programmatic/embedded use.
61
+
62
+ * `Traject::Indexer.new` takes a block for config
63
+
64
+ * `Traject::Indexer#process_record`
65
+
66
+ * `Traject::Indexer#process_with`
67
+
68
+ * `Traject::Indexer#complete` and `#run_after_processing_steps` public API.
69
+
70
+ * `Traject::SolrJsonWriter#flush`, flush to solr without closing, may be useful for direct programmatic use.
71
+
72
+ * Traject::Indexer sub-classes can implement a #source_record_id_proc, which is passed to Context, for source-format-specific logic for getting an ID to use in logging.
73
+
74
+ * command line takes an `-i` flag for choice of indexer.
75
+
3
76
  ## 2.3.4
4
77
  * Totally internal change to provide easier hooks into indexing process
5
78
 
@@ -35,7 +108,7 @@
35
108
 
36
109
  ## 2.2.1
37
110
  * Had inadvertently broken use of arrays as extract_marc specifications. Fixed.
38
-
111
+
39
112
  ## 2.2.0
40
113
  * Change DebugWriter to be more forgiving (and informative) about missing record-id fields
41
114
  * Automatically require DebugWriter for easier use on the command line
data/Gemfile CHANGED
@@ -4,9 +4,10 @@ source 'https://rubygems.org'
4
4
  gemspec
5
5
 
6
6
  group :development do
7
- gem "nokogiri" # used only for rake tasks load_maps:
7
+ gem "webmock", "~> 3.4"
8
8
  end
9
9
 
10
10
  group :debug do
11
11
  gem "ruby-debug", :platform => "jruby"
12
+ gem "byebug", :platform => "mri"
12
13
  end
data/README.md CHANGED
@@ -1,10 +1,8 @@
1
1
  # Traject
2
2
 
3
- An easy to use, high-performance, flexible and extensible MARC to Solr indexer.
3
+ An easy to use, high-performance, flexible and extensible metadata transformation system, focused on library-archives-museums input, and indexing to Solr as output.
4
4
 
5
- (Questions about use are welcome here or on the [google group](https://groups.google.com/forum/#!forum/traject-users))
6
-
7
- You might use [traject](https://github.com/traject/traject) to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
5
+ You might use [traject](https://github.com/traject/traject) to index MARC or XML data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
8
6
 
9
7
  Traject can also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination. In addition to sending data to Solr, Traject can produce json or yaml files, tab-delimited files, CSV files, and output suitable for debugging by a human.
10
8
 
@@ -20,43 +18,34 @@ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (Univer
20
18
 
21
19
  * Basic configuration files can be easily written even by non-rubyists, with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
22
20
  * Easy to program, easy to read, easy to modify.
23
- * Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying
24
- ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with
25
- solr even under MRI.
21
+ * Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with solr even under MRI. Traject is intended to be usable to process millions of records.
26
22
  * Composed of decoupled components, for flexibility and extensibility.
27
23
  * Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
28
- * Easy to split configuration between multiple files, for simple "pick-and-choose" command line options
29
- that can combine to deal with any of your local needs.
24
+ * Easy to split configuration between multiple files, for simple "pick-and-choose" command line options that can combine to deal with any of your local needs.
30
25
 
31
26
 
32
27
  ## Installation
33
28
 
34
- Traject runs under jruby (1.7.x or higher), MRI ruby (1.9.3 or higher), or probably any other ruby platform.
35
-
36
- **Traject runs much faster on JRuby** where it can use multi-core parallelism, and the Java Marc4J marc reader. If performance is a concern, you should run traject on JRuby.
29
+ Traject runs under jruby (9.0.x or higher), MRI ruby (2.3.x or higher), or probably any other ruby platform.
37
30
 
38
- Some options for installing a ruby other than your system-provided one are [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme).
31
+ Once you have ruby installed, just `$ gem install traject`.
39
32
 
40
- Once you have ruby, just `$ gem install traject`.
33
+ **If you are processing MARC input, you will probably get significant performance improvements on JRuby.** If you are processing Marc-XML (rather than binary Marc21), you should additionally get even more performance improvements by using the [traject-marc4j_reader](https://github.com/traject/traject-marc4j_reader) gem on JRuby (see installation instructions there). If performance is a concern, and you are processing MARC, we recommend benchmarking traject on JRuby. (JRuby is not currently recommended for non-MARC XML input.)
41
34
 
42
- (**Note**: We might in the future provide an all-in-one .jar distribution, which will not require you to install jruby on your system, for those who want the multi-threading of jruby without having to actually install it. Let us know if interested.)
35
+ Some options for installing a ruby other than your system-provided one are [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme).
43
36
 
44
37
 
45
38
  ## Configuration files
46
39
 
47
- traject is configured using configuration files. To get a sense of what they look like, you can take a look at our sample basic configuration file,
48
- [demo_config.rb](./test/test_support/demo_config.rb). You could run traject with that configuration file as: `traject -c path/to/demo_config.rb marc_file.marc`.
40
+ traject is configured using configuration files. To get a sense of what they look like, you can take a look at our sample basic [MARC configuration file](./test/test_support/demo_config.rb) or [XML configuration file](./test/test_support/nokogiri_demo_config.rb). You could run traject with that configuration file as: `traject -c path/to/demo_config.rb marc_file.marc`.
49
41
 
50
42
  Configuration files are actually just ruby -- so by convention they end in `.rb`.
51
43
 
52
44
  We hope you can write basic useful configuration files without much ruby experience, since traject gives you some easy functions to use for common directives. But the full power of ruby is available to you if needed.
53
45
 
54
- **rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
55
- call ordinary ruby `require` in config files, etc., too, to load
56
- external functionality. See more at Extending Logic below.
46
+ **rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can call ordinary ruby `require` in config files, etc., too, to load external functionality. See more at Extending Logic below.
57
47
 
58
- You can keep your settings and indexing rules in one config file,
59
- or split them across multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
48
+ You can keep your settings and indexing rules in one config file, or split them across multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
60
49
 
61
50
  There are two main categories of directives in your configuration files: _Settings_, and _Indexing Rules_.
62
51
 
@@ -84,9 +73,9 @@ settings do
84
73
  # various others...
85
74
  provide "solr_writer.commit_on_close", "true"
86
75
 
87
- # The default writer is the Traject::SolrJsonWriter. The default
88
- # reader is Marc4JReader (using Java Marc4J library) on Jruby,
89
- # MarcReader (using ruby-marc) otherwise.
76
+ # The default writer is the Traject::SolrJsonWriter. In the default MARC mode,
77
+ # the default reader in MARC mode is MarcReader (using ruby-marc).
78
+ # In XML mode, it is the NokogiriReader.
90
79
  end
91
80
  ~~~
92
81
 
@@ -98,24 +87,26 @@ See, docs page on [Settings](./doc/settings.md) for list
98
87
  of all standardized settings.
99
88
 
100
89
 
101
- ## Indexing rules: 'to_field' and 'extract_marc'
90
+ ## Indexing rules: 'to_field'
102
91
 
103
- There are a few methods that can be used to create indexing rules. We will touch on the two most commonly used methods here. More information is available in [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
92
+ There are a few methods that can be used to create indexing rules. We will touch on the two most commonly used methods here. More information and technical details are available in [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
104
93
 
105
94
  `to_field` establishes a rule to extract content to a particular named output field. A `to_field` extraction rule can use built-in 'macros', or, as we'll see later, entirely custom logic.
106
95
 
107
- The built-in macro you'll use the most is `extract_marc`, to extract
108
- data out of a MARC record according to a tag/subfield specification.
96
+ The built-in macros most commonly used are:
97
+
98
+ * In MARC mode, `extract_marc`, to extract data out of a MARC record according to a tag/subfield specification.
99
+ * In XML mode, `extract_xpath`. For more on XML use of traject, see the [XML guide](./doc/xml.md).
100
+
101
+ ### MARC examples: extract_marc
109
102
 
110
103
  ~~~ruby
111
104
  # Take the value of the first 001 field, and put
112
105
  # it in output field 'id', to be indexed in Solr
113
106
  # field 'id'
114
- to_field "id", extract_marc("001", :first => true)
107
+ to_field "id", extract_marc("001")
115
108
 
116
- # 245 subfields a, p, and s. 130, all subfields.
117
- # built-in punctuation trimming routine.
118
- to_field "title_t", extract_marc("245aps:130", :trim_punctuation => true)
109
+ to_field "title_t", extract_marc("245aps:130")
119
110
 
120
111
  # Can limit to certain indicators with || chars.
121
112
  # "*" is a wildcard in indicator spec. So this is
@@ -142,17 +133,64 @@ By default, specifications with multiple subfields (e.g. "240abc") will produce
142
133
 
143
134
  For the syntax and complete possibilities of the specification string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
144
135
 
145
- `extract_marc` also supports `translation maps` similar to SolrMarc's. There are some translation maps provided by traject, and you can also define your own, in yaml or ruby. Translation maps are especially useful for mapping from MARC codes to user-displayable strings:
136
+ To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
137
+
138
+ There is one special MARC-specific transformation macro, that strips punctuation from beginning and end of values using heuristics designed for AACR2 in MARC:
139
+
140
+ ```ruby
141
+ to_field "title", extract_marc("245abc"), trim_punctuation
142
+ ```
143
+
144
+ ### XML mode, extract_xml
145
+
146
+ See our [xml guide](./doc/xml.md) for more XML examples, but you will usually use extract_xpath.
147
+
148
+ to_field "title", extract_xpath("//title")
149
+
150
+ ### Translation maps
151
+
152
+
153
+ Traject supports `translation maps` similar to SolrMarc's. There are some translation maps provided by traject, and you can also define your own, in yaml or ruby. Translation maps are especially useful for mapping from MARC codes to user-displayable strings. Translation maps are invokved in a second arg to `to_field`.
146
154
 
147
155
  ~~~ruby
148
156
  # "translation_map" will be passed to Traject::TranslationMap.new
149
157
  # and the created map used to translate all values
150
- to_field "language", extract_marc("008[35-37]:041a:041d", :translation_map => "marc_language_code")
158
+ to_field "language", extract_marc("008[35-37]:041a:041d"), translation_map("marc_language_code")
151
159
  ~~~
152
160
 
153
- To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
161
+ The argument(s) to `translation_map` are passed to `TranslationMap.new`, see comment docs at [TranslationMap](./lib/traject/translation_map.rb) for documentation.
162
+
163
+ The `translation_map` macro also allows you to specify _multiple_ translation maps, with the latter ones overriding earlier ones:
164
+
165
+ ```ruby
166
+ to_field "language", extract_marc("008[35-37]:041a:041d"),
167
+ translation_map("marc_language_code",
168
+ "local_marc_language_code_overrides",
169
+ {"inline_hash" => "even local more overrides"})
170
+ ```
171
+
172
+ ### Additional transformation macros
173
+
174
+ TranslationMap use above is just one example of a transformation macro, that transforms output values. Other built-in transformation macros are defined in Traject::Macros::Transformation, and include:
154
175
 
155
- ## Other built-in utility macros
176
+ * `default("some value")`: provide a default value if no extracted value exists
177
+ * `first_only`: limit output to a single value, the first one extracted.
178
+ * `unique`: reduce output values to only unique values
179
+ * `strip`: remove leading or trailing whitespace
180
+ * `prepend("before each value:")`
181
+ * `append("--after each value")`
182
+ * `gsub(/regex/, "replacement")`
183
+ * `split(" ")`: take values and split them, possibly result in multiple values.
184
+
185
+ You can add on as many transformation macros as you want, they will be applied to output in order.
186
+
187
+ Example:
188
+
189
+ ```ruby
190
+ to_field "something", extract_xpath("//value"), strip, default("no value"), prepend("Extracted value: ")
191
+ ```
192
+
193
+ ### Other built-in utility macros
156
194
 
157
195
  Other built-in methods that can be used with `to_field` include:
158
196
 
@@ -256,9 +294,7 @@ use ruby methods like `map!` to modify it:
256
294
  ~~~
257
295
 
258
296
  If you find yourself repeating boilerplate code in your custom logic, you can
259
- even create your own 'macros' (like `extract_marc`). `extract_marc` and other
260
- macros are nothing more than methods that return ruby lambda objects of
261
- the same format as the blocks you write for custom logic.
297
+ even create your own 'macros' (like `extract_marc`). `extract_marc`, `translation_map`, `first_only` and other macros are nothing more than methods that return ruby lambda objects of the same format as the blocks you write for custom logic.
262
298
 
263
299
  For tips, gotchas, and a more complete explanation of how this works, see
264
300
  additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
@@ -304,11 +340,10 @@ You set which writer is being used in settings (`provide "writer_class_name", "T
304
340
  or with the shortcut command line argument `-w Traject::DebugWriter`.
305
341
 
306
342
  The [SolrJWriter](https://github.com/traject/traject-solrj_writer) is packaged separately,
307
- and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
343
+ and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
308
344
 
309
345
  You can easily write your own Readers and Writers if you'd like, see comments at top
310
- of [Traject::Indexer](lib/traject/indexer.rb).
311
-
346
+ of [Traject::Indexer](lib/traject/indexer.rb). A reader is simply an object that initializes with a ruby IO and traject Settings, and provides an `each` method. The simplest Writer class initializes with a traject Settings, and provides a `put(traject_context)` method.
312
347
 
313
348
 
314
349
  ## Duplicate, `nil`, and empty values
@@ -365,13 +400,16 @@ writer class in question.
365
400
 
366
401
  ## The traject command Line
367
402
 
403
+ (If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./docs/programmatic_use.md) )
404
+
368
405
  The simplest invocation is:
369
406
 
370
407
  traject -c conf_file.rb marc_file.mrc
371
408
 
409
+ By default, and for legacy reasons, the traject command line uses the MarcIndexer, with default marc reader and macros. If you want to use a different indexer for a different file format, use the `-i` flag: `traject -i xml`, the NokogiriReader; `traject -i basic`, the base Traject::Indexer with no format-specific behavior; or `traject -i Your::Own::Class`.
410
+
372
411
  Traject assumes marc files are in ISO 2709 MARC 'binary' format; it is not
373
- currently able to guess other marc format types like XML from filenames or content. If you are reading
374
- marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
412
+ currently able to guess other marc format types like XML from filenames or content. If you are reading marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
375
413
 
376
414
  traject -c conf.rb -t xml marc_file.xml
377
415
 
@@ -389,6 +427,17 @@ This will over-ride any settings set with `provide` in conf files.
389
427
 
390
428
  traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solrj_writer.commit_on_close=true
391
429
 
430
+
431
+ When using the Traject::MarcIndexer (default), it assumes marc files are in ISO 2709 MARC 'binary' format; it is not currently able to guess other marc format types like XML from filenames or content. If you are reading marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
432
+
433
+ traject -c conf.rb -t xml marc_file.xml
434
+
435
+ To use XML mode instead (with the Traject::NokogiriReader and suitable config files), use the `-i` flag:
436
+
437
+ traject -i xml -c xml_suitable_config_file.rb
438
+
439
+ (You can also pass the full name of a custom indexer class to `-i`)
440
+
392
441
  There are some built-in command-line option shortcuts for useful
393
442
  settings:
394
443
 
@@ -442,15 +491,16 @@ Own Code](./doc/extending.md)
442
491
 
443
492
  ## More
444
493
 
494
+ * [Traject XML guide](./doc/xml.md)
445
495
  * [Other traject commands](./doc/other_commands.md) including `marcout`, and `commit`
446
496
  * [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
497
+ * [Traject Programmatic Use guide](./doc/programmatic_use.md)
447
498
  * Plugin extensions: Gems that add functionality to traject
448
499
  * [traject_alephsequential_reader](https://github.com/traject/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
449
500
  * [traject_horizon](https://github.com/jrochkind/traject_horizon): Export MARC records directly from a Horizon ILS rdbms, as serialized MARC or to index into Solr.
450
501
  * [traject_umich_format](https://github.com/billdueber/traject_umich_format/): opinionated code and associated macros to extract format (book, audio file, etc.) and types (bibliography, conference report, etc.) from a MARC record. Code mirrors that used by the University of Michigan, and is an alternate approach to that taken by the `marc_formats` macro in `Traject::Macros::MarcFormatClassifier`.
451
502
  * [traject-solrj_writer](https://github.com/traject/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
452
- * [traject_marc4j_reader](https://github.com/traject/traject-marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
453
- reading marc records using the Marc4J library, fastest MARC reading on JRuby.
503
+ * [traject_marc4j_reader](https://github.com/traject/traject-marc4j_reader): A JRuby-only reader for reading marc records using the Marc4J library, fastest MARC-XML reading on JRuby.
454
504
  * [traject_sequel_writer](https://github.com/traject/traject_sequel_writer) A writer for sending to an rdbms via [Sequel](https://github.com/jeremyevans/sequel)
455
505
 
456
506
  # Development
@@ -469,15 +519,16 @@ and/or extra files in ./docs -- as appropriate for what needs to be docs.
469
519
  online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
470
520
 
471
521
  Bundler rake tasks included for gem releases: `rake release`
472
- * Every traject release needs to be done once when running MRI, and switch to JRuby
473
- and do the same release again. The JRuby release is identical but for including
474
- a gemspec dependency on the Marc4JReader gem.
475
522
 
476
- ## TODO
523
+ The standard [bundle console](https://bundler.io/v1.7/bundle_console.html) command may be useful for getting an `irb` console with the gem and it's dependencies loaded.
524
+
525
+ ## TODO: Possible future improvements
526
+
527
+ * Incorporate more inspired by [TrajectPlus](https://github.com/sul-dlss/traject_plus), possibly including `compose` for building nested hash output.
477
528
 
478
- * Readers and index rules helpers for reading XML files as input? Maybe.
529
+ * Incorporate functionality to write multiple output records based on a single input record. Likely will share implementation details with a trajectplus-style `compose`.
479
530
 
480
- * Writers for writing to stores other than Solr? ElasticSearch? Maybe.
531
+ * Writers for writing to stores other than Solr? ElasticSearch? Maybe.
481
532
 
482
533
  * Unicode normalization. Has to normalize to NFKC on way out to index. Except for serialized marc field and other exceptions? Except maybe don't have to, rely on solr analyzer to do it?
483
534
 
data/Rakefile CHANGED
@@ -11,6 +11,13 @@ require 'rake/testtask'
11
11
  task :default => [:test]
12
12
 
13
13
  Rake::TestTask.new do |t|
14
+ # Rake 11 makes warnings on by default, but there is so much noise, including
15
+ # from our dependencies, and from things I think are silly warnings like
16
+ # "shadowing outer local variable"
17
+ # Possibly could turn back on in the future using https://rubygems.org/gems/warning/versions/0.10.0
18
+ # gem to customize.
19
+ t.warning = false
20
+
14
21
  t.pattern = 'test/**/*_test.rb'
15
22
  t.libs.push 'test', 'test_support'
16
23
  end
@@ -18,4 +25,4 @@ end
18
25
  # Not documented well, but this seems to be
19
26
  # the way to load rake tasks from other files
20
27
  #import "lib/tasks/load_map.rake"
21
- Dir.glob('lib/tasks/*.rake').each { |r| import r}
28
+ Dir.glob('lib/tasks/*.rake').each { |r| import r}
@@ -1,35 +1,59 @@
1
1
  # Details on Traject Indexing: from custom logic to Macros
2
2
 
3
- Traject macros are a way of providing re-usable index mapping rules. Before we discuss how they work, we need to remind ourselves of the basic/direct Traject `to_field` indexing method.
3
+ We will explain the architecture of indexing rules, to help you use them more effectively, and create 'macros' which are re-usable index mapping rules.
4
4
 
5
- ## How direct indexing logic works
5
+ ## How to_field works
6
6
 
7
- Here's the simplest possible direct Traject mapping logic, duplicating the effects of the `literal` macro:
7
+ A `to_field` invocation might look like this:
8
8
 
9
- ~~~ruby
9
+ ```ruby
10
+ to_field "title", extract_marc("245abc"), first_only
11
+ ```
12
+
13
+ In fact, both `extract_marc("245abc")` and `first_only` are invocation of methods that return ruby [Proc](https://ruby-doc.org/core-2.2.0/Proc.html) or lambda objects. We call a method that is included in an indexer, and returns a `Proc` object suitable as an arg to `to_field` -- we call this a **macro** in traject.
14
+
15
+ The `to_field` method for establishing an indexing rule, is defined to simply take a first argument that is a field name, and then one or more arguments that are procs. During indexing, the procs registered with the indexing rule are executed in order to provide and transform output values. There can be an additional proc provided as a block argument to the `to_field` method.
16
+
17
+ By providing macro methods in the indexer that return procs, we can use this simple ruby method signature to create something that looks like a "domain specific language," where you might not even realize it's all based on procs. `extract_marc` is a method defined in the MarcIndexer (via including the `Traject::Macros::Marc21` mixin), while `first_only` is a method included in all Indexers (via the base Indexer class including the `Traject::Macros::Transformation` mixin).
18
+
19
+ These proc arguments themselves take three arguments, of which the third is optional.
20
+
21
+ 1. the source record
22
+ 2. an "accumulator" array of output values, to which the procs add or transform values
23
+ 3. a traject "context"
24
+
25
+ Here's the simplest possible direct Traject mapping logic, duplicating the effects of the literal macro:
26
+
27
+ ```ruby
10
28
  to_field("title") do |record, accumulator, context|
11
29
  accumulator << "FIXED LITERAL"
12
30
  end
13
- ~~~
31
+ ```
14
32
 
15
- That `do` is just ruby `block` syntax, whereby we can pass a block of ruby code as an argument to to a ruby method. We pass a block taking three arguments, labeled `record`, `accumulator`, and `context`, to the `to_field` method. The third 'context' object is optional, you can define it in your block or not, depending on if you want to use it.
33
+ That `do` is just ruby block syntax, whereby we can pass a block of ruby code as an argument to to a ruby method. We pass a block taking three arguments, labeled record, accumulator, and context, to the to_field method. The third 'context' object is optional, you can define it in your block or not, depending on if you want to use it.
16
34
 
17
- The block is then stored by the Traject::Indexer, and called for each record indexed, with three arguments provided.
35
+ The block is then stored by the Traject::Indexer, and called for each record indexed, with three arguments provided.
18
36
 
19
37
  ### record argument
20
38
 
21
- The record that gets passed to your block is a MARC::Record object (or, theoretically, any object that gets returned by a traject Reader). Your logic will usually examine the record to calculate the desired output.
39
+ The record that gets passed to your block is the source record for the current indexing: A `MARC::Record` when using the MarcIndexer, a `Nokogiri::XML::Document` using the NokogiriIndexer, or whatever source record type is used by a given indexer.
40
+
41
+ Logic for an "extraction" proc, like that returned by `extract_marc`, usually the first one given to `to_field`, will usually examine the record to calculate the desired output.
42
+
43
+ Logic for a "transformation" proc, such as that returned by `first_only`, usually ignores the record argument.
44
+
45
+ "Extraction" vs "transformation" are just names for procs that either examine the source_record to add something to the accumulator ("extraction") or transform values already in the accumulator ("transformation") -- a proc can actually do these things in any combination, but it usually makes sense to design some procs for extraction and others for transformation.
22
46
 
23
47
  ### accumulator argument
24
48
 
25
- The accumulator argument is an Array. At the end of your custom code, the accumulator Array should hold the output you want send off to the field specified in `to_field`.
49
+ The accumulator argument is an Array. At the end of your custom code, the accumulator Array should hold the output you want send off to the field specified in `to_field`.
26
50
 
27
- The accumulator is a reference to a ruby Array, and you need to **modify** that Array, manipulating it in place with Array methods that mutate the array, like `concat`, `<<`, `map!` or even `replace`.
51
+ The accumulator is a reference to a ruby Array, and you need to **modify** that Array, manipulating it in place with Array methods that mutate the array, like `concat`, `<<`, `map!` or even `replace`.
28
52
 
29
- You can't simply assign the accumulator variable to a different Array; you need to modify the Array *in place*.
53
+ You can't simply assign the accumulator variable to a different Array; you need to modify the Array *in place*.
30
54
 
31
55
  # Won't work, assigning variable
32
- to_field('foo') do |rec, acc|
56
+ to_field('foo') do |rec, acc|
33
57
  acc = ["some constant"] } # WRONG!
34
58
  end
35
59
 
@@ -38,7 +62,7 @@ You can't simply assign the accumulator variable to a different Array; you need
38
62
  acc << 'bill'
39
63
  acc << 'dueber'
40
64
  acc = acc.map{|str| str.upcase}
41
- end # WRONG! WRONG! WRONG! WRONG! WRONG!
65
+ end # WRONG! WRONG! WRONG! WRONG! WRONG!
42
66
 
43
67
 
44
68
  # Instead, do, modify array in place
@@ -49,11 +73,13 @@ You can't simply assign the accumulator variable to a different Array; you need
49
73
  acc.map!{|str| str.upcase} # NOTE: "map!" not "map"
50
74
  end
51
75
 
76
+ If you have multiple calls to `to_field` for the same field, each invocation begins with an empty accumulator, to help keep them independent.
77
+
52
78
  ### context argument
53
79
 
54
80
  The third optional argument is a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/traject/traject/Traject/Indexer/Context)) object. Most of the time you don't need it, but you can use it for some sophisticated functionality. These are some useful methods available:
55
81
 
56
- * `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard. This clipboard is record-specific, and won't persist between records.
82
+ * `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard. This clipboard is record-specific, and won't persist between records.
57
83
  * `context.position` The position of the record in the input file (e.g., was it the first record, second, etc.). Useful for error reporting.
58
84
  * `context.output_hash` A hash mapping the field names (generally defined in `to_field` calls) to an array of values to be sent to the writer associated with that field. This allows you to modify what goes to the writer without going through a `to_field` call -- you can just set `context.output_hash['myfield'] = ['my', 'values']` and you're set. See below for more examples.
59
85
  * `context.skip!(msg)` An assertion that this record should be ignored. No more indexing steps will be called, no results will be sent to the writer, and a `debug`-level log message will be written stating that the record was skipped.
@@ -92,50 +118,37 @@ to_field 'normalized_title' do |rec, acc|
92
118
  end
93
119
  ```
94
120
 
121
+ Traject macros similarly will capture some values in local variables outside the actual proc return value, which the proc returned can then use.
122
+
95
123
  Certain built-in traject calls have been optimized to be high performance
96
- so it's safe to do them inside 'inner loop' blocks. That includes `Traject::TranslationMap.new` and `Traject::MarcExtractor.cached("xxx")`
124
+ so it's safe to do them inside 'inner loop' blocks. That includes `Traject::TranslationMap.new` and `Traject::MarcExtractor.cached("xxx")`
97
125
  (NOTE: #cached rather than #new there)
98
126
 
99
127
 
100
- ## From block to lambda
101
-
102
- In the ruby language, in addition to creating a code block as an argument
103
- to a method with `do |args| ... end` or `{|arg| ... }`, we can also create
104
- a code block to hold in a variable, with the `lambda` keyword:
105
-
106
- always_output_foo = lambda do |record, accumulator|
107
- accumulator << "FOO"
108
- end
109
-
110
- In traject, `to_field` is written so that, as a convenience, it can take a lambda expression stored in a variable as an alternative to a block:
111
-
112
- to_field("always_has_foo"), always_output_foo
113
-
114
- Why is this a convenience? Well, ordinarily it's not something we
115
- need, but in fact it's what allows traject 'macros' to be re-useable
116
- code templates.
117
-
118
-
119
- ## Macros
128
+ ## Back to macros
120
129
 
121
130
  A Traject macro is a way to automatically create indexing rules via re-usable "templates".
122
131
 
123
- Traject macros are methods that return ruby lambda/proc objects, possibly creating them based on parameters passed in.
132
+ Traject macros are simply methods that return ruby lambda/proc objects, possibly creating them based on parameters passed in.
124
133
 
125
- For example, here is the implementation of the `literal` method/macro:
134
+ For example, here is the implementation of the `literal` logic, as a macro method returning a proc, instead of as an inline proc.
126
135
 
127
136
  ~~~ruby
137
+ # This method is included in an Indexer, possibly as a module mix-in.
128
138
  def literal(value)
129
- return lambda do |record, accumulator, context|
139
+ return proc do |record, accumulator, context|
130
140
  # because a lambda is a closure, we can define it in terms
131
141
  # of the 'value' from the scope it's defined in!
132
142
  accumulator << value
133
143
  end
134
144
  end
145
+
146
+ # then it would be called on the indexer, typically in a traject configuration file,
147
+ # when setting up an indexing rule:
135
148
  to_field("fieldname"), literal("my_fav_literal")
136
149
  ~~~
137
150
 
138
- So a Traject macro is a method that may have parameters and, based on those parameters, returns a lambda; the lambda is then passed to the `to_field` indexing method, or similar methods.
151
+ So a Traject macro is a method that may have parameters and, based on those parameters, returns a proc; the proc is then passed to the `to_field` indexing method, or similar methods.
139
152
 
140
153
  How do you make these methods available to the traject indexer?
141
154
 
@@ -145,7 +158,7 @@ Define it in a module:
145
158
  # in a file literal_macro.rb
146
159
  module LiteralMacro
147
160
  def literal(value)
148
- return lambda do |record, accumulator, context|
161
+ return proc do |record, accumulator, context|
149
162
  # because a lambda is a closure, we can define it in terms
150
163
  # of the 'value' from the scope it's defined in!
151
164
  accumulator << value
@@ -165,39 +178,43 @@ to_field("fieldname"), literal("my_fav_literal")
165
178
  ~~~
166
179
 
167
180
  That's it. You can use the traject command line `-I` option to set the ruby load path, so your file will be findable via `require`. Or you can distribute it in a gem, and use straight rubygems and the `gem` command in your configuration file, or Bundler with traject command-line `-g` option.
181
+ See the [Extending with your own code](./extending.md) guide for various methods for including custom code in a traject command-line invocation.
182
+
183
+ ## Combining multiple macros, lambdas and blocks
168
184
 
169
- ## Using a lambda _and_ a block
185
+ Traject macros (such as `extract_marc`) create and return a proc. If
186
+ you include a proc _and_ a block (or multiple procs) on a `to_field` call, subsequent procs
187
+ or code blocks get the accumulator as it was filled in by former procs or code blocks, and can *transform* values in the accumulator.
170
188
 
171
- Traject macros (such as `extract_marc`) create and return a lambda. If
172
- you include a lambda _and_ a block on a `to_field` call, the block
173
- gets the accumulator as it was filled in by the former.
189
+ Here is an example of passing `to_field` procs returned by macros, procs held in variables, and blocks.
174
190
 
175
191
  ```ruby
176
- # Get the titles and lowercase them
177
- to_field 'lc_title', extract_marc('245') do |rec, acc, context|
178
- acc.map!{|title| title.downcase}
179
- end
180
192
 
181
- # Build my own lambda and use it
182
- mylam = lambda {|rec, acc| acc << 'one'} # just add a constant
183
- to_field('foo'), mylam do |rec, acc, context|
184
- acc << 'two'
185
- end #=> context.output_hash['foo'] == ['one', 'two']
193
+ titlecase = proc do |rec, acc|
194
+ acc.map! { |value| value.titlecase }
195
+ end
186
196
 
187
- # You might also want to do something like this
188
- to_field('foo'), macro_returning_dup_values do |rec, acc|
189
- acc.uniq!
197
+ to_field 'lc_title', extract_marc('245'), titlecase, unique do |rec, acc, context|
198
+ acc.delete_if { |v| v == "value_to_eliminate" }
190
199
  end
191
200
  ```
192
201
 
202
+ `extract_marc` and `unique` are "macro" methods reutrning a proc.
203
+
204
+ `titlecase` is just a local variable, defined in the indexing file itself, holding a proc.
205
+
206
+ Then finally there is a block arg, taking the same arguments as the procs would.
207
+
208
+ All of these can be combined, and will be executed in order to transform output values.
209
+
193
210
  ## Manipulating `context.output_hash` directly
194
211
 
195
212
  If you ask for the context argument, a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Indexer/Context)), you have access to `context.output_hash`, which is
196
213
  the hash of already transformed output that will be sent to Solr (or any other Writer).
197
214
 
198
- You can examine `context.output_hash` to see any already transformed output and use it as the source for new output.
215
+ You can examine `context.output_hash` to see any already transformed output and use it as the source for new output.
199
216
 
200
- You can *write* to `context.output_hash` directly, which can be useful for computations that affect more than one output field at once.
217
+ You can *write* to `context.output_hash` directly, which can be useful for computations that affect more than one output field at once.
201
218
 
202
219
  **Note**: Make sure you always assign an _Array_ to each `context.output_hash` value, e.g., `context.output_hash['foo']`, not a single value!
203
220
 
@@ -211,15 +228,14 @@ context.output_hash['fieldname'] = ['fuzzy_wuzzies']
211
228
  ```
212
229
 
213
230
 
214
-
215
231
  ## each_record
216
232
 
217
233
  `each_record` is similar to `to_field` in that it defines logic executed for each record. It differs from `to_field` because the output of `each_record` is not associated with a specific output field.
218
234
 
219
- Thus, `each_record` blocks have no `accumulator` argument: instead they either take a single `record` argument; or both a `record` and a `context`.
235
+ Thus, `each_record` blocks have no `accumulator` argument: instead they either take a single `record` argument; or both a `record` and a `context`.
220
236
 
221
237
  `each_record` is useful for logging or notifying, computing intermediate
222
- results, or writing to more than one field at once.
238
+ results, or writing to more than one field at once.
223
239
 
224
240
  ~~~ruby
225
241
  each_record do |record, context|
@@ -241,7 +257,7 @@ each_record do |record, context|
241
257
  end
242
258
  ~~~
243
259
 
244
- traject doesn't come with any macros written for use with `each_record`, but they could be created: such macros would be methods that return a lambda given the appropriate args from `each_record`.
260
+ traject doesn't come with any macros written for use with `each_record`, but they could be created: such macros would be methods that return a lambda given the appropriate args from `each_record`.
245
261
 
246
262
  ## More tips and gotchas about indexing steps
247
263
 
@@ -251,4 +267,4 @@ traject doesn't come with any macros written for use with `each_record`, but the
251
267
 
252
268
  * **Once you call `context.skip!(msg)` no more index steps will be run for that record**. So if you have any cleanup code, you'll need to make sure to call it yourself.
253
269
 
254
- * **By default, `traject` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).
270
+ * **By default, `traject` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).