RubyGems - traject - Versions diffs - 1.1.0 → 2.0.0.rc.1 - Mend

traject 1.1.0 → 2.0.0.rc.1

Files changed (51) hide show

checksums.yaml +4 -4
data/.travis.yml +20 -0
data/README.md +85 -73
data/doc/batch_execution.md +2 -6
data/doc/other_commands.md +3 -5
data/doc/settings.md +27 -38
data/lib/traject/command_line.rb +1 -1
data/lib/traject/csv_writer.rb +34 -0
data/lib/traject/delimited_writer.rb +110 -0
data/lib/traject/indexer.rb +29 -11
data/lib/traject/indexer/settings.rb +39 -13
data/lib/traject/line_writer.rb +10 -6
data/lib/traject/marc_reader.rb +2 -1
data/lib/traject/solr_json_writer.rb +277 -0
data/lib/traject/thread_pool.rb +38 -48
data/lib/traject/translation_map.rb +3 -0
data/lib/traject/util.rb +13 -51
data/lib/traject/version.rb +1 -1
data/lib/translation_maps/marc_geographic.yaml +2 -2
data/test/delimited_writer_test.rb +104 -0
data/test/indexer/read_write_test.rb +0 -22
data/test/indexer/settings_test.rb +24 -0
data/test/solr_json_writer_test.rb +248 -0
data/test/test_helper.rb +5 -3
data/test/test_support/demo_config.rb +0 -5
data/test/translation_map_test.rb +9 -0
data/traject.gemspec +18 -5
metadata +77 -87
data/lib/traject/marc4j_reader.rb +0 -153
data/lib/traject/solrj_writer.rb +0 -351
data/test/marc4j_reader_test.rb +0 -136
data/test/solrj_writer_test.rb +0 -209
data/vendor/solrj/README +0 -8
data/vendor/solrj/build.xml +0 -39
data/vendor/solrj/ivy.xml +0 -16
data/vendor/solrj/lib/commons-codec-1.7.jar +0 -0
data/vendor/solrj/lib/commons-io-2.1.jar +0 -0
data/vendor/solrj/lib/httpclient-4.2.3.jar +0 -0
data/vendor/solrj/lib/httpcore-4.2.2.jar +0 -0
data/vendor/solrj/lib/httpmime-4.2.3.jar +0 -0
data/vendor/solrj/lib/jcl-over-slf4j-1.6.6.jar +0 -0
data/vendor/solrj/lib/jul-to-slf4j-1.6.6.jar +0 -0
data/vendor/solrj/lib/log4j-1.2.16.jar +0 -0
data/vendor/solrj/lib/noggit-0.5.jar +0 -0
data/vendor/solrj/lib/slf4j-api-1.6.6.jar +0 -0
data/vendor/solrj/lib/slf4j-log4j12-1.6.6.jar +0 -0
data/vendor/solrj/lib/solr-solrj-4.3.1-javadoc.jar +0 -0
data/vendor/solrj/lib/solr-solrj-4.3.1-sources.jar +0 -0
data/vendor/solrj/lib/solr-solrj-4.3.1.jar +0 -0
data/vendor/solrj/lib/wstx-asl-3.2.7.jar +0 -0
data/vendor/solrj/lib/zookeeper-3.4.5.jar +0 -0

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 4ae9c6a2d87868021cae1b48637592238387d8a1
-  data.tar.gz: 578c645162da3560ff5e01a28cf43b36e82734f8
+  metadata.gz: 1e875abe713a3200de4fb424c9570a3b89d5ddc0
+  data.tar.gz: cbb7b0f4fd9bb293af55afff48b52f92ce2b4dfb
 SHA512:
-  metadata.gz: e0bf13c4ff3cab492b6be8922ae22e33311f701756910526eb3206522774b1519db07324531ebd2c366d5907854c78bb6cde0d65eeff78258513d37fae1a3a57
-  data.tar.gz: dd251b15afafe2a11cbefe8493e145d9b9b883183b5f884a5b61557046c74b699427fe2766d5df587e27729f6c22052b2f672b756aba8fee33149f6abbcc4f40
+  metadata.gz: b7f911f43784275b0a7782e642788bb57a492c1660b9e3a461bf9b03c96882b887dd62e18f385ee9ffb3488c60afda44fb77ea9fc619cd4f3061c471b5bc7227
+  data.tar.gz: 105737429ce7778ae57a3182671fa9c41f72d17579b82f80b38a8ec373b634c1d22394e17c404e23961b5d2f557442e56d69700aaad3091e950d36aa2dc050c5

data/.travis.yml CHANGED

@@ -1,7 +1,27 @@
 language: ruby
 rvm:
   - jruby-19mode
+  - jruby-head
+  - 1.9
+  - 2.1
+  - 2.2
+  - rbx-2
 jdk:
   - openjdk7
   - openjdk6
+matrix:
+  exclude:
+    - rvm: 1.9
+      jdk: openjdk7
+    - rvm: 2.1
+      jdk: openjdk7
+    - rvm: rbx-2
+      jdk: openjdk7
+    - rvm: jruby-head
+      jdk: openjdk6
+    - rvm: 2.2
+      jdk: openjdk6
+  allow_failures:
+    - rvm: jruby-head
 bundler_args: --without debug

data/README.md CHANGED

@@ -1,10 +1,12 @@
 # Traject
-Tools for reading MARC records, transforming them with indexing rules, and indexing to Solr.
-Might be used to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
+An easy to use, high-performance, flexible and extensible MARC to Solr indexer.
-Traject might also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination.
+You might use traject to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
+Traject can also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination. In addition to sending data
+to solr, Traject can produce json or yaml files, tab-delimited files, CSV files, and output suitable
+for debugging by a human.
 **Traject is stable, mature software, that is already being used in production by its authors.**
@@ -14,42 +16,46 @@ Traject might also be generalized to a set of tools for getting structured data
 ## Background/Goals
-Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (University of Michigan Libraries).
+Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (University of Michigan Libraries).
-Traject was born out of our experience with similar tools, including the very popular and useful [solrmarc](https://code.google.com/p/solrmarc/) by Bob Haschart; and Bill Dueber's own [marc2solr](http://github.com/billdueber/marc2solr/).
+* Basic configuration files can be easily written even by non-rubyists,  with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
+* Easy to program, easy to read, easy to modify.
+* Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying
+ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with
+solr even under MRI.
+* Composed of decoupled components, for flexibility and extensibility.
+* Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
+* Easy to split configuration between multiple files, for simple "pick-and-choose" command line options
+that can combine to deal with any of your local needs.
-We're comfortable programming (especially in a dynamic language), and want to be able to experiment with different indexing patterns quickly, easily, and testably; but are admittedly less comfortable in Java.  In order to have a tool with the API's and usage patterns convenient for us, we found we could do it better in JRuby -- Ruby on the JVM.
-* Basic configuration files can be easily written even by non-rubyists,  with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
-* Easy to program, easy to read, easy to modify.
-* Fast. Traject by default indexes using multiple threads, on multiple cpu cores.
-* Composed of decoupled components, for flexibility and extensibility. The whole code base is only 6400 lines of code, more than a third of which is tests.
-* Designed to support local code and configuration that's maintainable and testable, an can be shared between projects as ruby gems.
-* Designed with batch execution in mind: flexible logging, good exit codes, good use of stdin/stdout/stderr.
+## Installation
+Traject runs under MRI ruby (1.9 through 2.2), jruby 1.7.x, or rubinius.
-## Installation
+For high-volume indexing in production, traject performs **much** better when run with **JRuby** (ruby on the JVM).
+Standard MRI ruby can't use multiple CPU cores at once, but on JRuby traject can use
+multiple cores for much better performance.
-Traject runs under jruby (ruby on the JVM). I recommend [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme) for installing and managing ruby installations. (traject is tested
-and supported for ruby 1.9 -- recent versions of jruby should run under 1.9 mode by default).
+Some options for installing a ruby other than your system-provided one are [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme).
-Then just `gem install traject`.
+Once you have ruby, just `$ gem install traject`.
-( **Note**: We may later provide an all-in-one .jar distribution, which does not require you to install jruby or use on your system. This is hypothetically possible. Is it a good idea?)
+( **Note**: We might in the future provide an all-in-one .jar distribution, which does not require you to install jruby  on your system, for those who want the multi-threading of jruby without having to actually install it. Let us know if interested.).
 ## Configuration files
 traject is configured using configuration files. To get a sense of what they look like, you can
-take a look at our sample non-trivial configuration file,
-[demo_config.rb](./test/test_support/demo_config.rb), which you'd run like
-`traject -c path/to/demo_config.rb marc_file.marc`.
+take a look at our sample basic configuration file,
+[demo_config.rb](./test/test_support/demo_config.rb). You could run traject with that configuration file
+as: `traject -c path/to/demo_config.rb marc_file.marc`.
 Configuration files are actually just ruby -- so by convention they end in `.rb`.
-We hope you can write basic useful configuration files without being a ruby expert,
-traject gives you some easy functions to use for common diretives. But the full power
-of ruby is available to you if needed.
+We hope you can write basic useful configuration files without much ruby experience, since
+traject gives you some easy functions to use for common directives. But the full power
+of ruby is available to you if needed.
 **rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
 call ordinary ruby `require` in config files, etc., too, to load
@@ -73,10 +79,6 @@ settings do
   # Where to find solr server to write to
   provide "solr.url", "http://example.org/solr"
-  # If you are connecting to Solr 1.x, you need to set
-  # for SolrJ compatibility:
-  # provide "solrj_writer.parser_class_name", "XMLResponseParser"
   # solr.version doesn't currently do anything, but set it
   # anyway, in the future it will warn you if you have settings
   # that may not work with your version.
@@ -87,13 +89,11 @@ settings do
   provide "marc_source.type", "xml"
   # various others...
-  provide "solrj_writer.commit_on_close", "true"
+  provide "solr_writer.commit_on_close", "true"
-  # By default, we use the Traject::MarcReader
-  # One altenrnative is the Marc4JReader, using Marc4J.
-  # provide "reader_class_name", "Traject::Marc4Reader"
-  # If we're reading binary MARC, it's best to tell it the encoding.
-  provide "marc4j_reader.source_encoding", "MARC-8" # or 'UTF-8' or 'ISO-8859-1' or whatever.
+  # The default writer is the Traject::SolrJsonWriter. The default
+  # reader is Marc4JReader (using Java Marc4J library) on Jruby,
+  # MarcReader (using ruby-marc) otherwise.
 end
 ~~~
@@ -105,17 +105,17 @@ See, docs page on [Settings](./doc/settings.md) for list
 of all standardized settings.
-## Indexing rules: Let's start with `to_field` and `extract_marc`
+## Indexing rules: Let's start with 'to_field' and 'extract_marc'
 There are a few methods that can be used to create indexing rules, but the
 one you'll most common is called `to_field`, and establishes a rule
-to extract content to a particular named output field.
+to extract content to a particular named output field.
-The extraction rule can use built-in 'macros', or, as we'll see later,
-entirely custom logic.
+A `to_field` extraction rule can use built-in 'macros', or, as we'll see later,
+entirely custom logic.
 The built-in macro you'll use the most is `extract_marc`, to extract
-data out of a MARC record according to a tag/subfield specification.
+data out of a MARC record according to a tag/subfield specification.
 ~~~ruby
     # Take the value of the first 001 field, and put
@@ -128,7 +128,7 @@ data out of a MARC record according to a tag/subfield specification.
     to_field "title_t", extract_marc("245nps:130", :trim_punctuation => true)
     # Can limit to certain indicators with || chars.
-    # "*" is a wildcard in indicator spec.  So
+    # "*" is a wildcard in indicator spec.  So this is
     # 856 with first indicator '0', subfield u.
     to_field "email_addresses", extract_marc("856|0*|u")
@@ -137,20 +137,20 @@ data out of a MARC record according to a tag/subfield specification.
     to_field "isbn", extract_marc("245a:245abcde")
     # For MARC Control ('fixed') fields, you can optionally
-    # use square brackets to take a byte offset.
+    # use square brackets to take a byte offset.
     to_field "langauge_code", extract_marc("008[35-37]")
 ~~~
 `extract_marc` by default includes all 'alternate script' linked fields correspoinding
 to matched specifications, but you can turn that off, or extract *only* corresponding
-880s.
+880s.
 ~~~ruby
     to_field "title", extract_marc("245abc", :alternate_script => false)
     to_field "title_vernacular", extract_marc("245abc", :alternate_script => :only)
 ~~~
-By default, specifications with multiple subfields (like "240abc") will produce one single string of output for each matching field. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield.
+By default, specifications with multiple subfields (like "240abc") will produce one single string of output for each matching field. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield.
 For the syntax and complete possibilities of the specification
 string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
@@ -199,7 +199,7 @@ All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macr
 Some more complex (and opinionated/subjective) algorithms for deriving semantics
 from Marc are also packaged with Traject, but not available by default. To make
-them available to your indexing, you just need to use ruby `require` and `extend`.
+them available to your indexing, you just need to use ruby `require` and `extend`.
 A number of methods are in [Traject::Macros::Marc21Semantics](./lib/traject/macros/marc21_semantics.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21Semantics))
@@ -223,6 +223,9 @@ format/genre/type vocabulary:
     to_field 'format_facet',    marc_formats
 ~~~
+(Alternately, see the [traject_umich_format](https://github.com/billdueber/traject_umich_format) gem for the often-ridiculously-complex
+logic used at the University of Michigan.)
 ## Custom logic
 The built-in routines are there for your convenience, but if you need
@@ -240,12 +243,12 @@ in a configuration file, using a ruby block, which looks like this:
     end
 ~~~
-`do |record, accumulator|` is the definition of a ruby block taking
+`do |record, accumulator| ... ` is the definition of a ruby block taking
 two arguments.  The first one passed in will be a MARC record. The
 second is an array, you add values to the array to send them to
-output.
+output.
-Here's a more realistic example that shows how you'd get the
+Here's another example that shows how you'd get the
 record type byte 06 out of a MARC leader, then translate it
 to a human-readable string with a TranslationMap
@@ -257,37 +260,34 @@ to a human-readable string with a TranslationMap
     end
 ~~~
-You can also add a block onto the end of a built-in 'macro', to
+You can also add a block onto the end of a built-in 'macro', to
 further customize the output. The `accumulator` passed to your block
 will already have values in it from the first step, and you can
 use ruby methods like `map!` to modify it:
 ~~~ruby
     to_field "big_title", extract_marc("245abcdefg") do |record, accumulator|
-      # put it all in all uppercase, I don't know why.
+      # put it all in all uppercase, I don't know why.
       accumulator.map! {|v| v.upcase}
     end
 ~~~
-There are many more things you can do with custom logic blocks like this too,
-including additional features we haven't discussed yet.
 If you find yourself repeating boilerplate code in your custom logic, you can
 even create your own 'macros' (like `extract_marc`). `extract_marc` and other
 macros are nothing more than methods that return ruby lambda objects of
-the same format as the blocks you write for custom logic.
+the same format as the blocks you write for custom logic.
 For tips, gotchas, and a more complete explanation of how this works, see
 additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
 ## each_record and after_processing
-In addition to `to_field`, an `each_record` method is available, which,
+In addition to `to_field`, an `each_record` method is available, which,
 like `to_field`, is executed for every record, but without being tied
-to a specific field.
+to a specific field.
 `each_record` can be used for logging or notifiying; computing intermediate
-results; or writing to more than one field at once.
+results; or writing to more than one field at once.
 ~~~ruby
   each_record do |record|
@@ -303,26 +303,33 @@ ruby code you might want for your app (send an email? Clean up a log file? Trigg
 a Solr replication?)
 ~~~ruby
-after_processing do
+after_processing do
   whatever_ruby_code
 end
 ~~~
-## Writers
+## Readers and Writers
 Traject uses modular 'Writer' classes to take the output hashes from transformation, and
-send them somewhere or do something useful with them.
+send them somewhere or do something useful with them.
-By default traject uses the [Traject::SolrJWriter](lib/traject/solrj_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/SolrJWriter)) to send to Solr for indexing.
-A couple other writers are available too, mostly for debugging purposes:
-[Traject::DebugWriter](lib/traject/debug_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DebugWriter))
-and [Traject::JsonWriter](lib/traject/json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/JsonWriter))
+By default traject uses the [Traject::SolrJsonWriter](lib/traject/solr_json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/SolrJsonWriter)) to send to Solr for indexing.
+Several other writers are also built-in:
+* [Traject::DebugWriter](lib/traject/debug_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DebugWriter))
+* [Traject::JsonWriter](lib/traject/json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/JsonWriter))
+* [Traject::YamlWriter](lib/traject/yaml_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/YamlWriter))
+* [Traject::DelimitedWriter](lib/traject/delimited_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DelimitedWriter))
+* [Traject::CSVWriter](lib/traject/csv_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/CSVWriter))
 You set which writer is being used in settings (`provide "writer_class_name", "Traject::DebugWriter"`),
-or on the command-line as a shortcut with `-w Traject::DebugWriter`.
+or with the shortcut command line argument  `-w Traject::DebugWriter`.
+The [SolrJWriter](https://github.com/traject-project/traject-solrj_writer) is packaged separately,
+and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
+You can easily write your own Readers and Writers if you'd like, see comments at top
-You can write your own Readers and Writers if you'd like, see comments at top
 of [Traject::Indexer](lib/traject/indexer.rb).
 ## The traject command Line
@@ -331,13 +338,13 @@ The simplest invocation is:
     traject -c conf_file.rb marc_file.mrc
-Traject assumes marc files are in ISO 2709 binary format; it is not
-currently able to guess marc format type from filenames. If you are reading
+Traject assumes marc files are in ISO 2709 MARC 'binary' format; it is not
+currently able to guess other marc format types like XML from filenames or content. If you are reading
 marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
     traject -c conf.rb -t xml marc_file.xml
-You can supply more than one conf file with repeated `-c` arguments.
+You can supply more than one conf file to traject with repeated `-c` arguments.
     traject -c connection_conf.rb -c indexing_conf.rb marc_file.mrc
@@ -349,7 +356,7 @@ You can only supply one marc file at a time, but we can take advantage of stdin
 You can set any setting on the command line with `-s key=value`.
 This will over-ride any settings set with `provide` in conf files.
-    traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solr.url=http://example.com/solr -s solrj_writer.commit_on_close=true
+    traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solrj_writer.commit_on_close=true
 There are some built-in command-line option shortcuts for useful
 settings:
@@ -363,8 +370,8 @@ debugging or sanity checking.
 Use `-u` as a shortcut for `s solr.url=X`
     traject -c conf_file.rb -u http://example.com/solr marc_file.mrc
-Run `traject -h` to see the command line help screen listing all available options.
+Run `traject -h` to see the command line help screen listing all available options.
 Also see `-I load_path` option and suggestions for Bundler use under Extending With Your Own Code.
@@ -399,7 +406,7 @@ Own Code](./doc/extending.md)
     "./translation_maps" subdir on the load path will be found
     for Traject translation maps.
 * Use [Bundler](http://bundler.io/) with traject simply by creating a Gemfile with `bundler init`,
-  and then running command line with `bundle exec traject` or
+  and then running command line with `bundle exec traject` or
   even `BUNDLE_GEMFILE=path/to/Gemfile bundle exec traject`
 ## More
@@ -410,7 +417,9 @@ Own Code](./doc/extending.md)
   * [traject_alephsequential_reader](https://github.com/traject-project/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
   * [traject_horizon](https://github.com/jrochkind/traject_horizon): Export MARC records directly from a Horizon ILS rdbms, as serialized MARC or to  index into Solr.
   * [traject_umich_format](https://github.com/billdueber/traject_umich_format/): opinionated code and associated macros to extract format (book, audio file, etc.) and types (bibliography, conference report, etc.) from a MARC record. Code mirrors that used by the University of Michigan, and is an alternate approach to that taken by the `marc_formats` macro in `Traject::Macros::MarcFormatClassifier`.
+  * [traject-solrj_writer](https://github.com/traject-project/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
+  * [traject_marc4j_reader](https://github.com/billdueber/traject_marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
+  reading marc records using the Marc4J library, fastest MARC reading on JRuby.
 # Development
@@ -430,12 +439,15 @@ Pull requests should come with tests, as well as docs where applicable. Docs can
 and/or extra files in ./docs -- as appropriate for what needs to be docs.
 **Inline api docs** Note that our [`.yardopts` file](./.yardopts) used by rdoc.info to generate
-online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
+online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
 Bundler rake tasks included for gem releases: `rake release`
 ## TODO
+* Readers and index rules helpers for reading XML files as input? Maybe.
+* Writers for writing to stores other than Solr? ElasticSearch? Maybe.
 * Unicode normalization. Has to normalize to NFKC on way out to index. Except for serialized marc field and other exceptions? Except maybe don't have to, rely on solr analyzer to do it?

data/doc/batch_execution.md CHANGED

@@ -8,15 +8,11 @@ with suggested solutions, and additional hints.
 ## Ruby version setting
-traject ordinarily needs to run under jruby. You will
+For best performance, traject should run under jruby. You will
 ordinarily have jruby installed under a ruby version switcher -- we
-highly recommend [chruby](https://github.com/postmodern/chruby) over other choices,
+recommend [chruby](https://github.com/postmodern/chruby) over other choices,
 but other popular choices include rvm and rbenv.
-Remember that traject needs to run in 1.9.x mode in jruby--
-with jruby 1.7.x or later, this should be default, recommend
-you use jruby 1.7.x.
 Especially when running under a cron job, it can be difficult to
 set things up so traject runs under jruby -- and then when you add
 bundler into it, things can get positively byzantine. It's not you,

data/doc/other_commands.md CHANGED

@@ -38,12 +38,10 @@ If set to true, then oversized MARC records can still be serialized,
 with length bytes zero'd out -- technically illegal, but can
 be read by MARC::Reader in permissive mode.
-As the standard Marc4JReader always convert to UTF8,
-output will always be in UTF8. For standard readeres, you
-do need to set the `marc_source.type` setting to XML for xml input
-using the standard MARC readers.
+If you have MARC-XML *input*, you need to
+set the `marc_source.type` setting to XML for xml input.
 ~~~bash
 traject -x marcout somefile.marc -o output.xml -s marcout.type=xml
 traject -x marcout -s marc_source.type=xml somefile.xml -c configuration.rb
-~~~
+~~~

data/doc/settings.md CHANGED

@@ -5,7 +5,7 @@ Hash, not nested. Keys are always strings, and dots (".") can be
 used for grouping and namespacing.
 Values are usually strings, but occasionally something else. String values can be easily
-set via the command line.
+set via the command line.
 Settings can be set in configuration files, usually like:
@@ -16,24 +16,24 @@ end
 ~~~~
 or on the command line: `-s key=value`.  There are also some command line shortcuts
-for commonly used settings, see `traject -h`.
+for commonly used settings, see `traject -h`.
-`provide` will only set the key if it was previously unset, so first time to set 'wins'. And command-line
-settings are applied first of all. It's recommended you use `provide`.
+`provide` will only set the key if it was previously unset, so first time to set 'wins'. And command-line
+settings are applied first of all. It's recommended you use `provide`.
-`store` is also available, and forces setting of the new value overriding any previous value set.
+`store` is also available, and forces setting of the new value overriding any previous value set.
 ## Known settings
 * `debug_ascii_progress`: true/'true' to print ascii characters to STDERR indicating progress. Note,
-                          yes, this is fixed to STDERR, regardless of your logging setup.
+                          yes, this is fixed to STDERR, regardless of your logging setup.
                           * `.` for every batch of records read and parsed
                           * `^` for every batch of records batched and queued for adding to solr
                                 (possibly in thread pool)
                           * `%` for completing of a Solr 'add'
                           * `!` when threadpool for solr add has a full queue, so solr add is
                                 going to happen in calling queue -- means solr adding can't
-                                keep up with production.
+                                keep up with production.
 * `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
@@ -50,18 +50,10 @@ settings are applied first of all. It's recommended you use `provide`.
 * `log.batch_size`: If set to a number N (or string representation), will output a progress line to
    log. (by default as INFO, but see log.batch_size.severity)
-* `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
+* `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
 * `marc_source.type`: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`
-* `marc4j.jar_dir`:   Path to a directory containing Marc4J jar file to use. All .jar's in dir will
-                      be loaded. If unset, uses marc4j.jar bundled with traject.
-* `marc4j_reader.permissive`: Used by Marc4JReader only when marc.source_type is 'binary', boolean, argument to the underlying MarcPermissiveStreamReader. Default true.
-* `marc4j_reader.source_encoding`: Used by Marc4JReader only when marc.source_type is 'binary', encoding strings accepted
-  by marc4j MarcPermissiveStreamReader. Default "BESTGUESS", also "UTF-8", "MARC"
 * `marcout.allow_oversized`: Used with `-x marcout` command to output marc when outputting
      as ISO 2709 binary, set to true or string "true", and the MARC::Writer will have
      allow_oversized=true set, allowing oversized records to be serialized with length
@@ -69,44 +61,41 @@ settings are applied first of all. It's recommended you use `provide`.
 * `output_file`: Output file to write to for operations that write to files: For instance the `marcout` command,
                  or Writer classes that write to files, like Traject::JsonWriter. Has an shortcut
-                 `-o` on command line.
+                 `-o` on command line.
-* `processing_thread_pool` Default 3. Main thread pool used for processing records with input rules. Choose a
-   pool size based on size of your machine, and complexity of your indexing rules.
-   Probably no reason for it ever to be more than number of cores on indexing machine.
-   But this is the first thread_pool to try increasing for better performance on a multi-core machine.
-   A pool here can sometimes result in multi-threaded commiting to Solr too with the
-   SolrJWriter, as processing worker threads will do their own commits to solr if the
-   solrj_writer.thread_pool is full. Having a multi-threaded pool here can help even out throughput
-   through Solr's pauses for committing too.
+* `processing_thread_pool` Number of threads in the main thread pool used for processing
+   records with input rules. On JRuby or Rubinius, defaults to 1 less than the number of processors detected on your machine. On other ruby platforms, defaults to 1. Set to 0 or nil
+   to disable thread pool, and do all processing in main thread.
-* `reader_class_name`: a Traject Reader class, used by the indexer as a source of records. Default Traject::Marc4jReader. If you don't need to read marc binary with Marc8 encoding, the pure ruby MarcReader may give you better performance.  Command-line shortcut `-r`
+   Choose a pool size based on size of your machine, and complexity of your indexing rules, you
+   might want to try different sizes and measure which works best for you.
+   Probably no reason for it ever to be more than number of cores on indexing machine.
-* `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
-* `solrj.jar_dir`: SolrJWriter needs to load Java .jar files with SolrJ. It will load from a packaged SolrJ, but you can load your own SolrJ (different version etc) by specifying a directory. All *.jar in directory will be loaded.
+* `reader_class_name`: a Traject Reader class, used by the indexer as a source
+    of records.   Defaults to Traject::Marc4JReader (using the Java Marc4J
+    library) on JRuby; Traject::MarcReader (using the ruby marc gem) otherwise.
+    Command-line shortcut `-r`
+* `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
 * `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control
   change some default settings, and/or sanity check and warn you if you're doing something
   that might not work with that version of solr. Set now for help in the future.
-* `solrj_writer.batch_size`: size of batches that SolrJWriter will send docs to Solr in. Default 200. Set to nil,
-  0, or 1, and SolrJWriter will do one http transaction per document, no batching.
-* `solrj_writer.commit_on_close`: default false, set to true to have SolrJWriter send an explicit commit message to Solr after indexing.
+* `solr_writer.batch_size`: size of batches that SolrJsonWriter will send docs to Solr in. Default 100. Set to nil,
+  0, or 1, and SolrJsonWriter will do one http transaction per document, no batching.
-* `solrj_writer.parser_class_name`: Set to "XMLResponseParser" or "BinaryResponseParser". Will be instantiated and passed to the solrj.SolrServer with setResponseParser. Default nil, use SolrServer default. To talk to a solr 1.x, you will want to set to "XMLResponseParser"
+* `solr_writer.commit_on_close`: default false, set to true to have the solr writer send an explicit commit message to Solr after indexing.
-* `solrj_writer.server_class_name`: String name of a solrj.SolrServer subclass to be used by SolrJWriter. Default "HttpSolrServer"
-* `solrj_writer.thread_pool`:       Defaults to 1 (single bg thread). A thread pool is used for submitting docs
+* `solr_writer.thread_pool`:       Defaults to 1 (single bg thread). A thread pool is used for submitting docs
                                     to solr. Set to 0 or nil to disable threading. Set to 1,
                                     there will still be a single bg thread doing the adds.
                                     May make sense to set higher than number of cores on your
                                     indexing machine, as these threads will mostly be waiting
                                     on Solr. Speed/capacity of your solr might be more relevant.
                                     Note that processing_thread_pool threads can end up submitting
-                                    to solr too, if solrj_writer.thread_pool is full.
+                                    to solr too, if solr_json_writer.thread_pool is full.
-* `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::SolrJWriter, also available Traject::JsonWriter. See Traject::Indexer for more info. Command line shortcut `-w`
+* `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::SolrJsonWriter, other writers for debugging or writing to files are also available. See Traject::Indexer for more info. Command line shortcut `-w`