RubyGems - traject - Versions diffs - 2.3.4 → 3.0.0.alpha.1 - Mend

traject 2.3.4 → 3.0.0.alpha.1

Files changed (69) hide show

checksums.yaml +5 -5
data/.travis.yml +16 -9
data/CHANGES.md +74 -1
data/Gemfile +2 -1
data/README.md +104 -53
data/Rakefile +8 -1
data/doc/indexing_rules.md +79 -63
data/doc/programmatic_use.md +218 -0
data/doc/settings.md +28 -1
data/doc/xml.md +134 -0
data/lib/traject.rb +5 -0
data/lib/traject/array_writer.rb +34 -0
data/lib/traject/command_line.rb +18 -22
data/lib/traject/debug_writer.rb +2 -5
data/lib/traject/experimental_nokogiri_streaming_reader.rb +276 -0
data/lib/traject/hashie/indifferent_access_fix.rb +25 -0
data/lib/traject/indexer.rb +321 -92
data/lib/traject/indexer/context.rb +39 -13
data/lib/traject/indexer/marc_indexer.rb +30 -0
data/lib/traject/indexer/nokogiri_indexer.rb +30 -0
data/lib/traject/indexer/settings.rb +36 -53
data/lib/traject/indexer/step.rb +27 -33
data/lib/traject/macros/marc21.rb +37 -12
data/lib/traject/macros/nokogiri_macros.rb +43 -0
data/lib/traject/macros/transformation.rb +162 -0
data/lib/traject/marc_extractor.rb +2 -0
data/lib/traject/ndj_reader.rb +1 -1
data/lib/traject/nokogiri_reader.rb +179 -0
data/lib/traject/oai_pmh_nokogiri_reader.rb +159 -0
data/lib/traject/solr_json_writer.rb +19 -12
data/lib/traject/thread_pool.rb +13 -0
data/lib/traject/util.rb +14 -2
data/lib/traject/version.rb +1 -1
data/test/debug_writer_test.rb +3 -3
data/test/delimited_writer_test.rb +3 -3
data/test/experimental_nokogiri_streaming_reader_test.rb +169 -0
data/test/indexer/context_test.rb +23 -13
data/test/indexer/error_handler_test.rb +59 -0
data/test/indexer/macros/macros_marc21_semantics_test.rb +46 -46
data/test/indexer/macros/marc21/extract_all_marc_values_test.rb +1 -1
data/test/indexer/macros/marc21/extract_marc_test.rb +19 -9
data/test/indexer/macros/marc21/serialize_marc_test.rb +4 -4
data/test/indexer/macros/to_field_test.rb +2 -2
data/test/indexer/macros/transformation_test.rb +177 -0
data/test/indexer/map_record_test.rb +2 -3
data/test/indexer/nokogiri_indexer_test.rb +103 -0
data/test/indexer/process_record_test.rb +55 -0
data/test/indexer/process_with_test.rb +148 -0
data/test/indexer/read_write_test.rb +52 -2
data/test/indexer/settings_test.rb +34 -24
data/test/indexer/to_field_test.rb +27 -2
data/test/marc_extractor_test.rb +7 -7
data/test/marc_reader_test.rb +4 -4
data/test/nokogiri_reader_test.rb +158 -0
data/test/oai_pmh_nokogiri_reader_test.rb +23 -0
data/test/solr_json_writer_test.rb +24 -28
data/test/test_helper.rb +8 -2
data/test/test_support/namespace-test.xml +7 -0
data/test/test_support/nokogiri_demo_config.rb +17 -0
data/test/test_support/oai-pmh-one-record-2.xml +24 -0
data/test/test_support/oai-pmh-one-record-first.xml +24 -0
data/test/test_support/sample-oai-no-namespace.xml +197 -0
data/test/test_support/sample-oai-pmh.xml +197 -0
data/test/thread_pool_test.rb +38 -0
data/test/translation_map_test.rb +3 -3
data/test/translation_maps/ruby_map.rb +2 -1
data/test/translation_maps/yaml_map.yaml +2 -1
data/traject.gemspec +4 -11
metadata +92 -6

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: 70d8b265e2e00866a63fdc067172ee1174efb068
-  data.tar.gz: 6588b8231b636d268765a5428a607b731a34fe7f
+SHA256:
+  metadata.gz: 176864633191a53e32c9563227072f16d83e8bc27f2e0d1c9a436fc3d281fc21
+  data.tar.gz: 468120d2066634a98aa18438820e24e4bb099125886bfca86f61e53d080070a3
 SHA512:
-  metadata.gz: 564a834087b4b5d0b032a9a0797cd1587f241afcbf35bf0c3aa928724f9f26fb8d3e7220d43e909c9bf4c03dd91873113e4d2c081ee68ad12ed151ff3ba34284
-  data.tar.gz: 9cbc506f6f2ab5bcbf77811c5018a91d8b588eff2d0c75960b3aada514756e6d90bce441da3db59d1aaa0a9a0f62b3e0c56d1be2c494aae091b06a61f94de46c
+  metadata.gz: 156d479792897beac99ecbc84a6da06470b7baf1edb1f4b3008deb3e26621fca1723112bb2e46f20406c320b486a679b8c2247f77283a358f6f17f7b4444c7e4
+  data.tar.gz: 6e9e0f3c80ca6c2a36adb455a416700f2008cffb5fb66093be1f4e1fb3828a236da4bdf7e1fd26e9d3762b8554c544779138d8636fba6241d3ae682b2df468d9

data/.travis.yml CHANGED

@@ -1,12 +1,19 @@
 language: ruby
 cache: bundler
-sudo: false
+# we don't really need `sudo: true`, but for some reason travis docker-based systems are unreliable
+# at downloading jruby, and
+sudo: true
 rvm:
-  - jruby-19mode
-  - jruby-9.0.4.0
-  - 1.9
-  - 2.2.8
-  - 2.3.5
-  - 2.4.2
-jdk:
-  - oraclejdk8
+  - 2.3.6
+  - 2.4.3
+  - 2.5.1
+  - "2.6.0-preview2"
+# avoid having travis install jdk on MRI builds where we don't need it.
+matrix:
+  include:
+    - jdk: openjdk8
+      rvm: jruby-9.0.5.0
+    - jdk: openjdk8
+      rvm: jruby-9.2.0.0
+  allow_failures:
+    - rvm: "2.6.0-preview2"

data/CHANGES.md CHANGED

@@ -1,5 +1,78 @@
 # Changes
+## 3.0.0
+### Changed/Backwards Incompatibilities
+* JRuby traject no longer includes `traject-marc4j_reader` as a dependency or default reader, although it may provide faster MARC-XML reading on JRuby. To use it manually, see https://github.com/traject/traject-marc4j_reader . See https://github.com/traject/traject/pull/187
+* `map_record` now returns `nil` if record was skipped.
+* The `Traject::Indexer` class no longer includes marc-specific settings and modules.
+  * If you are using command-line `traject`, this should make no difference to you, as command-line now defaults to the new `Traject::Indexer::MarcIndexer` with those removed things.
+  * If you are using Traject::Indexer programmatically and want those features, switch to using `Traject::Indexer::MarcIndexer`.
+  * If neccessary, as a hopefully temporary backwards compat shim, call `Traject::Indexer.legacy_marc_mode!`, which injects the old marc-specific behavior into Traject::Indexer again, globally and permanently.
+* Traject::Indexer::Settings no longer has it's own global defaults, Instead it can be given a set of defaults with #with_defaults, usually right after instantiation. To support different defaults for different Indexers.
+* SolrJsonWriter now assumes an /update/json convenience url is available in solr instead of trying to verify it.  If you are using an older Solr (before 4?) or otherwise want a different update url, just use setting `solr.update_url`
+### Added
+* Traject::Indexer#configure is available, and recommended instead of raw `instance_eval`. It just does an instance_eval, but is clearer and safer for future changes.
+* traject command line can now take multiple input files. And underlying it, Traject::Indexer#process can take an array of input streams.
+* There is now a built-in mode for XML source records, see docs at [xml.md](./doc/xml.md)
+* new setting `mapping_rescue` is available, to supply custom logic for handling errors. See docs at [settings.md](../doc/settings.md)
+* Call Traject::ThreadPool.disable_concurrency! to force all pool sizes to be 0, and work to be performed inline. All threading will be disabled.
+* `to_field` can now take an array as a first argument, to send values to multiple fields mentioned, eg:
+      to_field ["field1", "field2"], extract_marc("240")
+* `to_field` can take multiple transformation procs (all with the same form). https://github.com/traject/traject/pull/153
+* There is a new set of standard transformation macros included in `Traject::Indexer`, from [Traject::Macros::Transformation](./lib/traject/macros/transformation.rb). It includes an extraction of previous/existing arguments from `marc_extract`, along with some additional stuff. , in [Traject::Macros::Transformations]. https://github.com/traject/traject/pull/154
+  * This is the new preferred way to do post-processing with the `marc_extract` options, but the existing options are not deprecated and there is no current plan for them to be removed.
+  * before:
+        to_field "some_field", extract_marc("800",
+                                translation_map: "marc_800_map",
+                                allow_duplicates: true,
+                                first: true,
+                                default: "default value")
+  * now preferred:
+        to_field "some_field", extract_marc("800", allow_duplicates: true),
+            translation_map("marc_800_map"),
+            first_only,
+            default("default value")
+    (still need `allow_duplicates: true` cause extract_marc defaults to false, but see also `unique` macro)
+  * So, these transformation steps can now be used with non-MARC formats as well. See also new transformation macros: `strip`, `split`, `append`, `prepend`, `gsub`, and `transform`. And for MARC use, `trim_punctuation`.
+* Traject::Indexer new api, for more convenient programmatic/embedded use.
+  * `Traject::Indexer.new` takes a block for config
+  * `Traject::Indexer#process_record`
+  * `Traject::Indexer#process_with`
+  * `Traject::Indexer#complete` and `#run_after_processing_steps` public API.
+* `Traject::SolrJsonWriter#flush`, flush to solr without closing, may be useful for direct programmatic use.
+* Traject::Indexer sub-classes can implement a #source_record_id_proc, which is passed to Context, for source-format-specific logic for getting an ID to use in logging.
+* command line takes an `-i` flag for choice of indexer.
 ## 2.3.4
   * Totally internal change to provide easier hooks into indexing process
@@ -35,7 +108,7 @@
 ## 2.2.1
   * Had inadvertently broken use of arrays as extract_marc specifications. Fixed.
 ## 2.2.0
   * Change DebugWriter to be more forgiving (and informative) about missing record-id fields
   * Automatically require DebugWriter for easier use on the command line

data/Gemfile CHANGED

@@ -4,9 +4,10 @@ source 'https://rubygems.org'
 gemspec
 group :development do
-  gem "nokogiri" # used only for rake tasks load_maps:
+  gem "webmock", "~> 3.4"
 end
 group :debug do
   gem "ruby-debug", :platform => "jruby"
+  gem "byebug", :platform => "mri"
 end

data/README.md CHANGED

@@ -1,10 +1,8 @@
 # Traject
-An easy to use, high-performance, flexible and extensible MARC to Solr indexer.
+An easy to use, high-performance, flexible and extensible metadata transformation system, focused on library-archives-museums input, and indexing to Solr as output.
-(Questions about use are welcome here or on the [google group](https://groups.google.com/forum/#!forum/traject-users))
-You might use [traject](https://github.com/traject/traject) to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
+You might use [traject](https://github.com/traject/traject) to index MARC or XML data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
 Traject can also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination. In addition to sending data to Solr, Traject can produce json or yaml files, tab-delimited files, CSV files, and output suitable for debugging by a human.
@@ -20,43 +18,34 @@ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (Univer
 * Basic configuration files can be easily written even by non-rubyists,  with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
 * Easy to program, easy to read, easy to modify.
-* Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying
-ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with
-solr even under MRI.
+* Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with solr even under MRI. Traject is intended to be usable to process millions of records.
 * Composed of decoupled components, for flexibility and extensibility.
 * Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
-* Easy to split configuration between multiple files, for simple "pick-and-choose" command line options
-that can combine to deal with any of your local needs.
+* Easy to split configuration between multiple files, for simple "pick-and-choose" command line options that can combine to deal with any of your local needs.
 ## Installation
-Traject runs under jruby (1.7.x or higher), MRI ruby (1.9.3 or higher), or probably any other ruby platform.
-**Traject runs much faster on JRuby** where it can use multi-core parallelism, and the Java Marc4J marc reader. If performance is a concern, you should run traject on JRuby.
+Traject runs under jruby (9.0.x or higher), MRI ruby (2.3.x or higher), or probably any other ruby platform.
-Some options for installing a ruby other than your system-provided one are [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme).
+Once you have ruby installed, just `$ gem install traject`.
-Once you have ruby, just `$ gem install traject`.
+**If you are processing MARC input, you will probably get significant performance improvements on JRuby.** If you are processing Marc-XML (rather than binary Marc21), you should additionally get even more performance improvements by using the [traject-marc4j_reader](https://github.com/traject/traject-marc4j_reader) gem on JRuby (see installation instructions there). If performance is a concern, and you are processing MARC, we recommend benchmarking traject on JRuby. (JRuby is not currently recommended for non-MARC XML input.)
-(**Note**: We might in the future provide an all-in-one .jar distribution, which will not require you to install jruby on your system, for those who want the multi-threading of jruby without having to actually install it. Let us know if interested.)
+Some options for installing a ruby other than your system-provided one are [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme).
 ## Configuration files
-traject is configured using configuration files. To get a sense of what they look like, you can take a look at our sample basic configuration file,
-[demo_config.rb](./test/test_support/demo_config.rb). You could run traject with that configuration file as: `traject -c path/to/demo_config.rb marc_file.marc`.
+traject is configured using configuration files. To get a sense of what they look like, you can take a look at our sample basic [MARC configuration file](./test/test_support/demo_config.rb) or [XML configuration file](./test/test_support/nokogiri_demo_config.rb). You could run traject with that configuration file as: `traject -c path/to/demo_config.rb marc_file.marc`.
 Configuration files are actually just ruby -- so by convention they end in `.rb`.
 We hope you can write basic useful configuration files without much ruby experience, since traject gives you some easy functions to use for common directives. But the full power of ruby is available to you if needed.
-**rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
-call ordinary ruby `require` in config files, etc., too, to load
-external functionality. See more at Extending Logic below.
+**rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can call ordinary ruby `require` in config files, etc., too, to load external functionality. See more at Extending Logic below.
-You can keep your settings and indexing rules in one config file,
-or split them across multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
+You can keep your settings and indexing rules in one config file, or split them across multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
 There are two main categories of directives in your configuration files: _Settings_, and _Indexing Rules_.
@@ -84,9 +73,9 @@ settings do
   # various others...
   provide "solr_writer.commit_on_close", "true"
-  # The default writer is the Traject::SolrJsonWriter. The default
-  # reader is Marc4JReader (using Java Marc4J library) on Jruby,
-  # MarcReader (using ruby-marc) otherwise.
+  # The default writer is the Traject::SolrJsonWriter. In the default MARC mode,
+  # the default reader in MARC mode is MarcReader (using ruby-marc).
+  # In XML mode, it is the NokogiriReader.
 end
 ~~~
@@ -98,24 +87,26 @@ See, docs page on [Settings](./doc/settings.md) for list
 of all standardized settings.
-## Indexing rules: 'to_field' and 'extract_marc'
+## Indexing rules: 'to_field'
-There are a few methods that can be used to create indexing rules.  We will touch on the two most commonly used methods here.  More information is available in [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
+There are a few methods that can be used to create indexing rules.  We will touch on the two most commonly used methods here.  More information and technical details are available in [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
 `to_field` establishes a rule to extract content to a particular named output field.  A `to_field` extraction rule can use built-in 'macros', or, as we'll see later, entirely custom logic.
-The built-in macro you'll use the most is `extract_marc`, to extract
-data out of a MARC record according to a tag/subfield specification.
+The built-in macros most commonly used are:
+* In MARC mode, `extract_marc`, to extract data out of a MARC record according to a tag/subfield specification.
+* In XML mode, `extract_xpath`. For more on XML use of traject, see the [XML guide](./doc/xml.md).
+### MARC examples: extract_marc
 ~~~ruby
     # Take the value of the first 001 field, and put
     # it in output field 'id', to be indexed in Solr
     # field 'id'
-    to_field "id", extract_marc("001", :first => true)
+    to_field "id", extract_marc("001")
-    # 245 subfields a, p, and s. 130, all subfields.
-    # built-in punctuation trimming routine.
-    to_field "title_t", extract_marc("245aps:130", :trim_punctuation => true)
+    to_field "title_t", extract_marc("245aps:130")
     # Can limit to certain indicators with || chars.
     # "*" is a wildcard in indicator spec.  So this is
@@ -142,17 +133,64 @@ By default, specifications with multiple subfields (e.g. "240abc") will produce
 For the syntax and complete possibilities of the specification string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
-`extract_marc` also supports `translation maps` similar to SolrMarc's. There are some translation maps provided by traject, and you can also define your own, in yaml or ruby. Translation maps are especially useful for mapping from MARC codes to user-displayable strings:
+To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
+There is one special MARC-specific transformation macro, that strips punctuation from beginning and end of values using heuristics designed for AACR2 in MARC:
+```ruby
+    to_field "title", extract_marc("245abc"), trim_punctuation
+```
+### XML mode, extract_xml
+See our [xml guide](./doc/xml.md) for more XML examples, but you will usually use extract_xpath.
+    to_field "title", extract_xpath("//title")
+### Translation maps
+Traject supports `translation maps` similar to SolrMarc's. There are some translation maps provided by traject, and you can also define your own, in yaml or ruby. Translation maps are especially useful for mapping from MARC codes to user-displayable strings. Translation maps are invokved in a second arg to `to_field`.
 ~~~ruby
     # "translation_map" will be passed to Traject::TranslationMap.new
     # and the created map used to translate all values
-    to_field "language", extract_marc("008[35-37]:041a:041d", :translation_map => "marc_language_code")
+    to_field "language", extract_marc("008[35-37]:041a:041d"), translation_map("marc_language_code")
 ~~~
-To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
+The argument(s) to `translation_map` are  passed to `TranslationMap.new`, see comment docs at [TranslationMap](./lib/traject/translation_map.rb) for documentation.
+The `translation_map` macro also allows you to specify _multiple_ translation maps, with the latter ones overriding earlier ones:
+```ruby
+    to_field "language", extract_marc("008[35-37]:041a:041d"),
+      translation_map("marc_language_code",
+                      "local_marc_language_code_overrides",
+                      {"inline_hash" => "even local more overrides"})
+```
+### Additional transformation macros
+TranslationMap use above is just one example of a transformation macro, that transforms output values. Other built-in transformation macros are defined in Traject::Macros::Transformation, and include:
-## Other built-in utility macros
+* `default("some value")`: provide a default value if no extracted value exists
+* `first_only`: limit output to a single value, the first one extracted.
+* `unique`: reduce output values to only unique values
+* `strip`: remove leading or trailing whitespace
+* `prepend("before each value:")`
+* `append("--after each value")`
+* `gsub(/regex/, "replacement")`
+* `split(" ")`: take values and split them, possibly result in multiple values.
+You can add on as many transformation macros as you want, they will be applied to output in order.
+Example:
+```ruby
+to_field "something", extract_xpath("//value"), strip, default("no value"), prepend("Extracted value: ")
+```
+### Other built-in utility macros
 Other built-in methods that can be used with `to_field` include:
@@ -256,9 +294,7 @@ use ruby methods like `map!` to modify it:
 ~~~
 If you find yourself repeating boilerplate code in your custom logic, you can
-even create your own 'macros' (like `extract_marc`). `extract_marc` and other
-macros are nothing more than methods that return ruby lambda objects of
-the same format as the blocks you write for custom logic.
+even create your own 'macros' (like `extract_marc`). `extract_marc`, `translation_map`, `first_only` and other macros are nothing more than methods that return ruby lambda objects of the same format as the blocks you write for custom logic.
 For tips, gotchas, and a more complete explanation of how this works, see
 additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
@@ -304,11 +340,10 @@ You set which writer is being used in settings (`provide "writer_class_name", "T
 or with the shortcut command line argument  `-w Traject::DebugWriter`.
 The [SolrJWriter](https://github.com/traject/traject-solrj_writer) is packaged separately,
-and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
+and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
 You can easily write your own Readers and Writers if you'd like, see comments at top
-of [Traject::Indexer](lib/traject/indexer.rb).
+of [Traject::Indexer](lib/traject/indexer.rb). A reader is simply an object that initializes with a ruby IO and traject Settings, and provides an `each` method. The simplest Writer class initializes with a traject Settings, and provides a `put(traject_context)` method.
 ## Duplicate, `nil`, and empty values
@@ -365,13 +400,16 @@ writer class in question.
 ## The traject command Line
+(If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./docs/programmatic_use.md) )
 The simplest invocation is:
     traject -c conf_file.rb marc_file.mrc
+By default, and for legacy reasons, the traject command line uses the MarcIndexer, with default marc reader and macros.  If you want to use a different indexer for a different file format, use the `-i` flag:  `traject -i xml`, the NokogiriReader; `traject -i basic`, the base Traject::Indexer with no format-specific behavior; or `traject -i Your::Own::Class`.
 Traject assumes marc files are in ISO 2709 MARC 'binary' format; it is not
-currently able to guess other marc format types like XML from filenames or content. If you are reading
-marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
+currently able to guess other marc format types like XML from filenames or content. If you are reading marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
     traject -c conf.rb -t xml marc_file.xml
@@ -389,6 +427,17 @@ This will over-ride any settings set with `provide` in conf files.
     traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solrj_writer.commit_on_close=true
+When using the Traject::MarcIndexer (default), it assumes marc files are in ISO 2709 MARC 'binary' format; it is not currently able to guess other marc format types like XML from filenames or content. If you are reading marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
+    traject -c conf.rb -t xml marc_file.xml
+To use XML mode instead (with the Traject::NokogiriReader and suitable config files), use the `-i` flag:
+    traject -i xml -c xml_suitable_config_file.rb
+(You can also pass the full name of a custom indexer class to `-i`)
 There are some built-in command-line option shortcuts for useful
 settings:
@@ -442,15 +491,16 @@ Own Code](./doc/extending.md)
 ## More
+* [Traject XML guide](./doc/xml.md)
 * [Other traject commands](./doc/other_commands.md) including `marcout`, and `commit`
 * [Hints for batch and cronjob use](./doc/batch_execution.md) of  traject.
+* [Traject Programmatic Use guide](./doc/programmatic_use.md)
 * Plugin extensions: Gems that add functionality to traject
   * [traject_alephsequential_reader](https://github.com/traject/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
   * [traject_horizon](https://github.com/jrochkind/traject_horizon): Export MARC records directly from a Horizon ILS rdbms, as serialized MARC or to  index into Solr.
   * [traject_umich_format](https://github.com/billdueber/traject_umich_format/): opinionated code and associated macros to extract format (book, audio file, etc.) and types (bibliography, conference report, etc.) from a MARC record. Code mirrors that used by the University of Michigan, and is an alternate approach to that taken by the `marc_formats` macro in `Traject::Macros::MarcFormatClassifier`.
   * [traject-solrj_writer](https://github.com/traject/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
-  * [traject_marc4j_reader](https://github.com/traject/traject-marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
-  reading marc records using the Marc4J library, fastest MARC reading on JRuby.
+  * [traject_marc4j_reader](https://github.com/traject/traject-marc4j_reader): A JRuby-only reader for reading marc records using the Marc4J library, fastest MARC-XML reading on JRuby.
   * [traject_sequel_writer](https://github.com/traject/traject_sequel_writer) A writer for sending to an rdbms via [Sequel](https://github.com/jeremyevans/sequel)
 # Development
@@ -469,15 +519,16 @@ and/or extra files in ./docs -- as appropriate for what needs to be docs.
 online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
 Bundler rake tasks included for gem releases: `rake release`
-* Every traject release needs to be done once when running MRI, and switch to JRuby
-and do the same release again. The JRuby release is identical but for including
-a gemspec dependency on the Marc4JReader gem.
-## TODO
+The standard [bundle console](https://bundler.io/v1.7/bundle_console.html) command may be useful for getting an `irb` console with the gem and it's dependencies loaded.
+## TODO: Possible future improvements
+* Incorporate more inspired by [TrajectPlus](https://github.com/sul-dlss/traject_plus), possibly including `compose` for building nested hash output.
-* Readers and index rules helpers for reading XML files as input? Maybe.
+* Incorporate functionality to write multiple output records based on a single input record. Likely will share implementation details with a trajectplus-style `compose`.
-* Writers for writing to stores other than Solr? ElasticSearch? Maybe.
+* Writers for writing to stores other than Solr? ElasticSearch? Maybe.
 * Unicode normalization. Has to normalize to NFKC on way out to index. Except for serialized marc field and other exceptions? Except maybe don't have to, rely on solr analyzer to do it?

data/Rakefile CHANGED

@@ -11,6 +11,13 @@ require 'rake/testtask'
 task :default => [:test]
 Rake::TestTask.new do |t|
+  # Rake 11 makes warnings on by default, but there is so much noise, including
+  # from our dependencies, and from things I think are silly warnings like
+  # "shadowing outer local variable"
+  # Possibly could turn back on in the future using https://rubygems.org/gems/warning/versions/0.10.0
+  # gem to customize.
+  t.warning = false
   t.pattern = 'test/**/*_test.rb'
   t.libs.push 'test', 'test_support'
 end
@@ -18,4 +25,4 @@ end
 # Not documented well, but this seems to be
 # the way to load rake tasks from other files
 #import "lib/tasks/load_map.rake"
-Dir.glob('lib/tasks/*.rake').each { |r| import r}
+Dir.glob('lib/tasks/*.rake').each { |r| import r}

data/doc/indexing_rules.md CHANGED

@@ -1,35 +1,59 @@
 # Details on Traject Indexing: from custom logic to Macros
-Traject macros are a way of providing re-usable index mapping rules. Before we discuss how they work, we need to remind ourselves of the basic/direct Traject `to_field` indexing method.
+We will explain the architecture of indexing rules, to help you use them more effectively, and create 'macros' which are re-usable index mapping rules.
-## How direct indexing logic works
+## How to_field works
-Here's the simplest possible direct Traject mapping logic, duplicating the effects of the `literal` macro:
+A `to_field` invocation might look like this:
-~~~ruby
+```ruby
+to_field "title", extract_marc("245abc"), first_only
+```
+In fact, both `extract_marc("245abc")` and `first_only` are invocation of methods that return ruby [Proc](https://ruby-doc.org/core-2.2.0/Proc.html) or lambda objects. We call a method that is included in an indexer, and returns a `Proc` object suitable as an arg to `to_field` -- we call this a **macro** in traject.
+The `to_field` method for establishing an indexing rule, is defined to simply take a first argument that is a field name, and then one or more arguments that are procs. During indexing, the procs registered with the indexing rule are executed in order to provide and transform output values.  There can be an additional proc provided as a block argument to the `to_field` method.
+By providing macro methods in the indexer that return procs, we can use this simple ruby method signature to create something that looks like a "domain specific language," where you might not even realize it's all based on procs.  `extract_marc` is a method defined in the MarcIndexer (via including the `Traject::Macros::Marc21` mixin), while `first_only` is a method included in all Indexers (via the base Indexer class including the `Traject::Macros::Transformation` mixin).
+These proc arguments themselves take three arguments, of which the third is optional.
+1. the source record
+2. an "accumulator" array of output values, to which the procs add or transform values
+3. a traject "context"
+Here's the simplest possible direct Traject mapping logic, duplicating the effects of the literal macro:
+```ruby
 to_field("title") do |record, accumulator, context|
   accumulator << "FIXED LITERAL"
 end
-~~~
+```
-That `do` is just ruby `block` syntax, whereby we can pass a block of ruby code as an argument to to a ruby method. We pass a block taking three arguments, labeled `record`, `accumulator`, and `context`, to the `to_field` method. The third 'context' object is optional, you can define it in your block or not, depending on if you want to use it.
+That `do` is just ruby block syntax, whereby we can pass a block of ruby code as an argument to to a ruby method. We pass a block taking three arguments, labeled record, accumulator, and context, to the to_field method. The third 'context' object is optional, you can define it in your block or not, depending on if you want to use it.
-The block is then stored by the Traject::Indexer, and called for each record indexed, with three arguments provided.
+The block is then stored by the Traject::Indexer, and called for each record indexed, with three arguments provided.
 ### record argument
-The record that gets passed to your block is a MARC::Record object (or, theoretically, any object that gets returned by a traject Reader). Your logic will usually examine the record to calculate the desired output.
+The record that gets passed to your block is the source record for the current indexing: A `MARC::Record` when using the MarcIndexer, a `Nokogiri::XML::Document` using the NokogiriIndexer, or whatever source record type is used by a given indexer.
+Logic for an "extraction" proc, like that returned by `extract_marc`, usually the first one given to `to_field`, will usually examine the record to calculate the desired output.
+Logic for a "transformation" proc, such as that returned by `first_only`, usually ignores the record argument.
+"Extraction" vs "transformation" are just names for procs that either examine the source_record to add something to the accumulator ("extraction") or transform values already in the accumulator ("transformation") -- a proc can actually do these things in any combination, but it usually makes sense to design some procs for extraction and others for transformation.
 ### accumulator argument
-The accumulator argument is an Array. At the end of your custom code, the accumulator Array should hold the output you want send off to the field specified in `to_field`.
+The accumulator argument is an Array. At the end of your custom code, the accumulator Array should hold the output you want send off to the field specified in `to_field`.
-The accumulator is a reference to a ruby Array, and you need to **modify** that Array, manipulating it in place with Array methods that mutate the array, like `concat`, `<<`, `map!` or even `replace`.
+The accumulator is a reference to a ruby Array, and you need to **modify** that Array, manipulating it in place with Array methods that mutate the array, like `concat`, `<<`, `map!` or even `replace`.
-You can't simply assign the accumulator variable to a different Array; you need to modify the Array *in place*.
+You can't simply assign the accumulator variable to a different Array; you need to modify the Array *in place*.
     # Won't work, assigning variable
-    to_field('foo') do |rec, acc|
+    to_field('foo') do |rec, acc|
       acc = ["some constant"] } # WRONG!
     end
@@ -38,7 +62,7 @@ You can't simply assign the accumulator variable to a different Array; you need
       acc << 'bill'
       acc << 'dueber'
       acc = acc.map{|str| str.upcase}
-    end   # WRONG! WRONG! WRONG! WRONG! WRONG!
+    end   # WRONG! WRONG! WRONG! WRONG! WRONG!
     # Instead, do, modify array in place
@@ -49,11 +73,13 @@ You can't simply assign the accumulator variable to a different Array; you need
       acc.map!{|str| str.upcase} # NOTE: "map!" not "map"
     end
+If you have multiple calls to `to_field` for the same field, each invocation begins with an empty accumulator, to help keep them independent.
 ### context argument
 The third optional argument is a [Traject::Indexer::Context](./lib/traject/indexer/context.rb)  ([rdoc](http://rdoc.info/github/traject/traject/Traject/Indexer/Context)) object. Most of the time you don't need it, but you can use it for some sophisticated functionality.  These are some useful methods available:
-* `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard. This clipboard is record-specific, and won't persist between records.
+* `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard. This clipboard is record-specific, and won't persist between records.
 * `context.position` The position of the record in the input file (e.g., was it the first record, second, etc.). Useful for error reporting.
 * `context.output_hash` A hash mapping the field names (generally defined in `to_field` calls) to an array of values to be sent to the writer associated with that field. This allows you to modify what goes to the writer without going through a `to_field` call -- you can just set `context.output_hash['myfield'] = ['my', 'values']` and you're set. See below for more examples.
 * `context.skip!(msg)` An assertion that this record should be ignored. No more indexing steps will be called, no results will be sent to the writer, and a `debug`-level log message will be written stating that the record was skipped.
@@ -92,50 +118,37 @@ to_field 'normalized_title' do |rec, acc|
 end
 ```
+Traject macros similarly will capture some values in local variables outside the actual proc return value, which the proc returned can then use.
 Certain built-in traject calls have been optimized to be high performance
-so it's safe to do them inside 'inner loop' blocks. That includes `Traject::TranslationMap.new` and `Traject::MarcExtractor.cached("xxx")`
+so it's safe to do them inside 'inner loop' blocks. That includes `Traject::TranslationMap.new` and `Traject::MarcExtractor.cached("xxx")`
 (NOTE: #cached rather than #new there)
-## From block to lambda
-In the ruby language, in addition to creating a code block as an argument
-to a method with `do |args| ... end` or `{|arg| ...  }`, we can also create
-a code block to hold in a variable, with the `lambda` keyword:
-    always_output_foo = lambda do |record, accumulator|
-      accumulator << "FOO"
-    end
-In traject, `to_field` is written so that, as a convenience, it can take a lambda expression stored in a variable as an alternative to a block:
-    to_field("always_has_foo"), always_output_foo
-Why is this a convenience? Well, ordinarily it's not something we
-need, but in fact it's what allows traject 'macros' to be re-useable
-code templates.
-## Macros
+## Back to macros
 A Traject macro is a way to automatically create indexing rules via re-usable "templates".
-Traject macros are methods that return ruby lambda/proc objects, possibly creating them based on parameters passed in.
+Traject macros are simply methods that return ruby lambda/proc objects, possibly creating them based on parameters passed in.
-For example, here is the implementation of the  `literal` method/macro:
+For example, here is the implementation of the `literal` logic, as a macro method returning a proc, instead of as an inline proc.
 ~~~ruby
+# This method is included in an Indexer, possibly as a module mix-in.
 def literal(value)
-  return lambda do |record, accumulator, context|
+  return proc do |record, accumulator, context|
      # because a lambda is a closure, we can define it in terms
      # of the 'value' from the scope it's defined in!
      accumulator << value
   end
 end
+# then it would be called on the indexer, typically in a traject configuration file,
+# when setting up an indexing rule:
 to_field("fieldname"), literal("my_fav_literal")
 ~~~
-So a Traject macro is a method that may have parameters and, based on those parameters, returns a lambda; the lambda is then passed to the `to_field` indexing method, or similar methods.
+So a Traject macro is a method that may have parameters and, based on those parameters, returns a proc; the proc is then passed to the `to_field` indexing method, or similar methods.
 How do you make these methods available to the traject indexer?
@@ -145,7 +158,7 @@ Define it in a module:
 # in a file literal_macro.rb
 module LiteralMacro
   def literal(value)
-    return lambda do |record, accumulator, context|
+    return proc do |record, accumulator, context|
        # because a lambda is a closure, we can define it in terms
        # of the 'value' from the scope it's defined in!
        accumulator << value
@@ -165,39 +178,43 @@ to_field("fieldname"), literal("my_fav_literal")
 ~~~
 That's it.  You can use the traject command line `-I` option to set the ruby load path, so your file will be findable via `require`.  Or you can distribute it in a gem, and use straight rubygems and the `gem` command in your configuration file, or Bundler with traject command-line `-g` option.
+See the [Extending with your own code](./extending.md) guide for various methods for including custom code in a traject command-line invocation.
+## Combining multiple macros, lambdas and blocks
-## Using a lambda _and_ a block
+Traject macros (such as `extract_marc`) create and return a proc. If
+you include a proc _and_ a block (or multiple procs) on a `to_field` call, subsequent procs
+or code blocks get the accumulator as it was filled in by former procs or code blocks, and can *transform* values in the accumulator.
-Traject macros (such as `extract_marc`) create and return a lambda. If
-you include a lambda _and_ a block on a `to_field` call, the block
-gets the accumulator as it was filled in by the former.
+Here is an example of passing `to_field` procs returned by macros, procs held in variables, and blocks.
 ```ruby
-# Get the titles and lowercase them
-to_field 'lc_title', extract_marc('245') do |rec, acc, context|
-  acc.map!{|title| title.downcase}
-end
-# Build my own lambda and use it
-mylam = lambda {|rec, acc|  acc << 'one'} # just add a constant
-to_field('foo'), mylam do |rec, acc, context|
-  acc << 'two'
-end #=> context.output_hash['foo'] == ['one', 'two']
+titlecase = proc do |rec, acc|
+  acc.map! { |value| value.titlecase }
+end
-# You might also want to do something like this
-to_field('foo'), macro_returning_dup_values do |rec, acc|
-  acc.uniq!
+to_field 'lc_title', extract_marc('245'), titlecase, unique do |rec, acc, context|
+  acc.delete_if { |v| v == "value_to_eliminate" }
 end
 ```
+`extract_marc` and `unique` are "macro" methods reutrning a proc.
+`titlecase` is just a local variable, defined in the indexing file itself, holding a proc.
+Then finally there is a block arg, taking the same arguments as the procs would.
+All of these can be combined, and will be executed in order to transform output values.
 ## Manipulating `context.output_hash` directly
 If you ask for the context argument, a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Indexer/Context)), you have access to `context.output_hash`, which is
 the hash of already transformed output that will be sent to Solr (or any other Writer).
-You can examine `context.output_hash` to see any already transformed output and use it as the source for new output.
+You can examine `context.output_hash` to see any already transformed output and use it as the source for new output.
-You can *write* to `context.output_hash` directly, which can be useful for computations that affect more than one output field at once.
+You can *write* to `context.output_hash` directly, which can be useful for computations that affect more than one output field at once.
 **Note**: Make sure you always assign an _Array_ to each `context.output_hash` value, e.g., `context.output_hash['foo']`, not a single value!
@@ -211,15 +228,14 @@ context.output_hash['fieldname'] = ['fuzzy_wuzzies']
 ```
 ## each_record
 `each_record` is similar to `to_field` in that it defines logic executed for each record.  It differs from `to_field` because the output of `each_record` is not associated with a specific output field.
-Thus, `each_record` blocks have no `accumulator` argument: instead they either take a single `record` argument; or both a `record` and a `context`.
+Thus, `each_record` blocks have no `accumulator` argument: instead they either take a single `record` argument; or both a `record` and a `context`.
 `each_record` is useful for logging or notifying, computing intermediate
-results, or writing to more than one field at once.
+results, or writing to more than one field at once.
 ~~~ruby
 each_record do |record, context|
@@ -241,7 +257,7 @@ each_record do |record, context|
 end
 ~~~
-traject doesn't come with any macros written for use with `each_record`, but they could be created:  such macros would be methods that return a lambda given the appropriate args from `each_record`.
+traject doesn't come with any macros written for use with `each_record`, but they could be created:  such macros would be methods that return a lambda given the appropriate args from `each_record`.
 ## More tips and gotchas about indexing steps
@@ -251,4 +267,4 @@ traject doesn't come with any macros written for use with `each_record`, but the
 * **Once you call `context.skip!(msg)` no more index steps will be run for that record**. So if you have any cleanup code, you'll need to make sure to call it yourself.
-* **By default, `traject` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).
+* **By default, `traject` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).