RubyGems - traject-solrj_writer - Versions diffs - 1.0.0-java - Mend

traject-solrj_writer 1.0.0-java

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

checksums.yaml +7 -0
data/.gitignore +14 -0
data/.travis.yml +3 -0
data/Gemfile +4 -0
data/LICENSE.txt +22 -0
data/README.md +116 -0
data/Rakefile +11 -0
data/lib/traject/solrj_writer.rb +427 -0
data/lib/traject/solrj_writer/version.rb +5 -0
data/spec/minitest_helper.rb +85 -0
data/spec/solrj_writer_spec.rb +223 -0
data/spec/test_support/manufacturing_consent.marc +1 -0
data/traject-solrj_writer.gemspec +27 -0
data/vendor/solrj/README +8 -0
data/vendor/solrj/build.xml +39 -0
data/vendor/solrj/ivy.xml +16 -0
data/vendor/solrj/lib/commons-io-2.3.jar +0 -0
data/vendor/solrj/lib/httpclient-4.3.1.jar +0 -0
data/vendor/solrj/lib/httpcore-4.3.jar +0 -0
data/vendor/solrj/lib/httpmime-4.3.1.jar +0 -0
data/vendor/solrj/lib/jcl-over-slf4j-1.6.6.jar +0 -0
data/vendor/solrj/lib/log4j-1.2.16.jar +0 -0
data/vendor/solrj/lib/noggit-0.5.jar +0 -0
data/vendor/solrj/lib/slf4j-api-1.7.6.jar +0 -0
data/vendor/solrj/lib/slf4j-log4j12-1.6.6.jar +0 -0
data/vendor/solrj/lib/solr-solrj-4.3.1-javadoc.jar +0 -0
data/vendor/solrj/lib/solr-solrj-4.3.1-sources.jar +0 -0
data/vendor/solrj/lib/solr-solrj-4.3.1.jar +0 -0
data/vendor/solrj/lib/wstx-asl-3.2.7.jar +0 -0
data/vendor/solrj/lib/zookeeper-3.4.6.jar +0 -0
metadata +133 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: f7d301bb6262198a78ec629ad45985652b152cb0
+  data.tar.gz: 99196e4d061051b040f99ea553fd5a75328f9990
+SHA512:
+  metadata.gz: e4441cbe2d7a76dc4274c760d8d832c2ea7027063e4e1b8e10db310ae0839ef9c4a5c556aa4366c431a0d4441b1cbd9f3c1a97b46f9aaaea0faad3c44824123b
+  data.tar.gz: 812e57b618b13f27f5f2024df39db2777c58503aa1fb9ec021066a176ebb851686c809844cefd7a22ea86c922dc2c8aacbe0acd2a7439468fb790212f223ad4e

data/.gitignore ADDED

@@ -0,0 +1,14 @@
+/.bundle/
+/.yardoc
+/Gemfile.lock
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log

data/.travis.yml ADDED

@@ -0,0 +1,3 @@
+language: ruby
+rvm:
+  - 2.2.0

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in traject-solrj_writer.gemspec
+gemspec

data/LICENSE.txt ADDED

@@ -0,0 +1,22 @@
+Copyright (c) 2015 Bill Dueber
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,116 @@
+# Traject::SolrJWriter
+Use [Traject](http://github.com/traject-project/traject) to write to
+a Solr index using the `solrj` java library.
+**This gem requires JRuby and Traject >= 2.0**
+**This gem is not yet released**
+## Notes on using this gem
+  * Our benchmarking indicates that `Traject::SolrJsonWriter` (included with Traject) outperforms
+    this library by a notable swath. Use that if you can.
+  * If you're running a version of Solr < 3.2, you can't use `SolrJsonWriter` at all; this
+    becomes your best bet.
+  * Given its reliance on loading `.jar` files, `Traject::SolrJWriter` obviously require JRuby.
+## Usage
+You'll need to make sure this gem is available (e.g., by putting it in your gemfile)
+and then have code like this:
+```ruby
+# Sample traject configuration for using solrj
+require 'traject'
+require 'traject/solrj_writer'
+settings do
+  # Arguments for any solr writer
+  provide "solr.url", ENV["SOLR_URL"] | 'http://localhost:8983/solr/core1'
+  provide "solr_writer.commit_on_close", "true"
+  provide "solr_writer.thread_pool", 2
+  provide "solr_writer.batch_size", 50
+  # SolrJ Specific stuff
+  provide "solrj_writer.parser_class_name", "XMLResponseParser"
+  provide "writer_class_name", "Traject::SolrJWriter"
+  store 'processing_thread_pool', 5
+  store "log.batch_size", 25_000
+```
+...and then use Traject as normal.
+## Full list of settings
+### Generic Solr settings (used for both SolrJWriter and SolrJsonWriter)
+* `solr.url`: Your solr url (required)
+* `solr_writer.commit_on_close`:  If true (or string 'true'), send a commit to solr
+  at end of #process.
+* `solr_writer.batch_size`:      If non-nil and more than 1, send documents to
+  solr in batches of solrj_writer.batch_size. If nil/1,
+  however, an http transaction with solr will be done
+  per doc. DEFAULT to 100, which seems to be a sweet spot.
+* `solr_writer.thread_pool`:      Defaults to 1. A thread pool is used for submitting docs
+  to solr. Set to 0 or nil to disable threading. Set to 1,
+  there will still be a single bg thread doing the adds. For
+  very fast Solr servers and very fast indexing processes, may
+  make sense to increase this value to throw at Solr as fast as it
+  can catch.
+### SolrJ-specific settings
+* `solrj_writer.server_class_name`:  Defaults to "HttpSolrServer". You can specify
+  another Solr Server sub-class, but it has
+  to take a one-arg url constructor. Maybe
+  subclass this writer class and overwrite
+  instantiate_solr_server! otherwise
+* `solrj.jar_dir`: Custom directory containing all of the SolrJ jars. All
+  jars in this dir will be loaded. Otherwise,
+  we load our own packaged solrj jars. This setting
+  can't really be used differently in the same app instance,
+  since jars are loaded globally.
+* `solrj_writer.parser_class_name`: A String name of a class in package
+  org.apache.solr.client.solrj.impl,
+  we'll instantiate one with a zero-arg
+  constructor, and pass it as an arg to setParser on
+  the SolrServer instance, if present.
+  NOTE: For contacting a Solr 1.x server, with the
+  recent version of SolrJ used by default, set to
+  "XMLResponseParser"
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'traject-solrj_writer'
+```
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install traject-solrj_writer
+## Contributing
+1. Fork it ( https://github.com/traject-project/traject-solrj_writer/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create a new Pull Request

data/Rakefile ADDED

@@ -0,0 +1,11 @@
+require "bundler/gem_tasks"
+require "rake/testtask"
+Rake::TestTask.new(:spec) do |t|
+  t.pattern = 'spec/**/*_spec.rb'
+  t.libs << "spec"
+end
+task :test => :spec
+task :default => :spec

data/lib/traject/solrj_writer.rb ADDED

@@ -0,0 +1,427 @@
+require "traject/solrj_writer/version"
+require 'yell'
+require 'traject'
+require 'traject/util'
+require 'traject/qualified_const_get'
+require 'traject/thread_pool'
+require 'uri'
+require 'thread' # for Mutex
+#
+# Writes to a Solr using SolrJ, and the SolrJ HttpSolrServer.
+#
+# After you call #close, you can check #skipped_record_count if you want
+# for an integer count of skipped records.
+#
+# For fatal errors that raise... async processing with thread_pool means that
+# you may not get a raise immediately after calling #put, you may get it on
+# a FUTURE #put or #close. You should get it eventually though.
+#
+# ## Settings
+#
+# * solr.url: Your solr url (required)
+#
+# * solrj_writer.server_class_name:  Defaults to "HttpSolrServer". You can specify
+#   another Solr Server sub-class, but it has
+#   to take a one-arg url constructor. Maybe
+#   subclass this writer class and overwrite
+#   instantiate_solr_server! otherwise
+#
+# * solrj.jar_dir: Custom directory containing all of the SolrJ jars. All
+#   jars in this dir will be loaded. Otherwise,
+#   we load our own packaged solrj jars. This setting
+#   can't really be used differently in the same app instance,
+#   since jars are loaded globally.
+#
+# * solrj_writer.parser_class_name: A String name of a class in package
+#   org.apache.solr.client.solrj.impl,
+#   we'll instantiate one with a zero-arg
+#   constructor, and pass it as an arg to setParser on
+#   the SolrServer instance, if present.
+#   NOTE: For contacting a Solr 1.x server, with the
+#   recent version of SolrJ used by default, set to
+#   "XMLResponseParser"
+#
+# * solr_writer.commit_on_close:  If true (or string 'true'), send a commit to solr
+#   at end of #process.
+#
+# * solr_writer.batch_size:      If non-nil and more than 1, send documents to
+#   solr in batches of solrj_writer.batch_size. If nil/1,
+#   however, an http transaction with solr will be done
+#   per doc. DEFAULT to 100, which seems to be a sweet spot.
+#
+# * solr_writer.thread_pool:      Defaults to 1. A thread pool is used for submitting docs
+#   to solr. Set to 0 or nil to disable threading. Set to 1,
+#   there will still be a single bg thread doing the adds. For
+#   very fast Solr servers and very fast indexing processes, may
+#   make sense to increase this value to throw at Solr as fast as it
+#   can catch.
+#
+# ## Example
+#
+#     settings do
+#       provide "writer_class_name", "Traject::SolrJWriter"
+#
+#       # This is just regular ruby, so don't be afraid to have conditionals!
+#       # Switch on hostname, for test and production server differences
+#       if Socket.gethostname =~ /devhost/
+#         provide "solr.url", "http://my.dev.machine:9033/catalog"
+#       else
+#         provide "solr.url", "http://my.production.machine:9033/catalog"
+#       end
+#
+#       provide "solrj_writer.parser_class_name", "BinaryResponseParser" # for Solr 4.x
+#       # provide "solrj_writer.parser_class_name", "XMLResponseParser" # For solr 1.x or 3.x
+#
+#       provide "solrj_writer.commit_on_close", "true"
+#     end
+class Traject::SolrJWriter
+  # just a tuple of a SolrInputDocument
+  # and a Traject::Indexer::Context it came from
+  class UpdatePackage
+    attr_accessor :solr_document, :context
+    def initialize(doc, ctx)
+      self.solr_document = doc
+      self.context = ctx
+    end
+  end
+  # Class method to load up the jars from vendor if we need to
+  # Requires solrj jar(s) from settings['solrj.jar_dir'] if given, otherwise
+  # uses jars bundled with traject gem in ./vendor
+  #
+  # Have to pass in a settings arg, so we can check it for specified jar dir.
+  #
+  # Tries not to do the dirglob and require if solrj has already been loaded.
+  # Will define global constants with classes HttpSolrServer and SolrInputDocument
+  # if not already defined.
+  #
+  # This is all a bit janky, maybe there's a better way to do this? We do want
+  # a 'require' method defined somewhere utility, so multiple classes can
+  # use it, including extra gems. This method may be used by extra gems, so should
+  # be considered part of the API -- after it's called, those top-level
+  # globals should be available, and solrj should be loaded.
+  def self.require_solrj_jars(settings)
+    jruby_ensure_init!
+    tries = 0
+    begin
+      tries += 1
+      org.apache.solr
+      org.apache.solr.client.solrj
+      # java_import which we'd normally use weirdly doesn't work
+      # from a class method. https://github.com/jruby/jruby/issues/975
+      Object.const_set("HttpSolrServer", org.apache.solr.client.solrj.impl.HttpSolrServer) unless defined? ::HttpSolrServer
+      Object.const_set("SolrInputDocument", org.apache.solr.common.SolrInputDocument) unless defined? ::SolrInputDocument
+    rescue NameError  => e
+      included_jar_dir = File.expand_path("../../vendor/solrj/lib", File.dirname(__FILE__))
+      jardir = settings["solrj.jar_dir"] || included_jar_dir
+      Dir.glob("#{jardir}/*.jar") do |x|
+        require x
+      end
+      if tries > 1
+        raise LoadError.new("Can not find SolrJ java classes")
+      else
+        retry
+      end
+    end
+  end
+  # just does a `require 'java'` but rescues the exception if we
+  # aren't jruby, and raises a better error message.
+  # Pass in a developer-presentable name of a feature to include in the error
+  # message if you want.
+  def self.jruby_ensure_init!(feature = nil)
+    begin
+      require 'java'
+    rescue LoadError => e
+      feature ||= "A traject feature is in use that"
+      msg = if feature
+              "#{feature} requires jruby, but you do not appear to be running under jruby. We recommend `chruby` for managing multiple ruby installs."
+            end
+      raise LoadError.new(msg)
+    end
+  end
+  include Traject::QualifiedConstGet
+  attr_reader :settings
+  attr_reader :batched_queue
+  def initialize(argSettings)
+    @settings = Traject::Indexer::Settings.new(argSettings)
+    # Let's go ahead an alias out the old solrj_writer settings to the
+    # newer solr_writer settings, so the old config files still work
+    %w[commit_on_close batch_size thread_pool].each do |s|
+      swkey = "solr_writer.#{s}"
+      sjwkey = "solrj_writer.#{s}"
+      @settings[swkey] = @settings[sjwkey] unless @settings[sjwkey].nil?
+    end
+    settings_check!(settings)
+    ensure_solrj_loaded!
+    solr_server # init
+    @batched_queue = java.util.concurrent.LinkedBlockingQueue.new
+    # when multi-threaded exceptions raised in threads are held here
+    # we need a HIGH performance queue here to try and avoid slowing things down,
+    # since we need to check it frequently.
+    @async_exception_queue = java.util.concurrent.ConcurrentLinkedQueue.new
+    # Store error count in an AtomicInteger, so multi threads can increment
+    # it safely, if we're threaded.
+    @skipped_record_incrementer = java.util.concurrent.atomic.AtomicInteger.new(0)
+    # if our thread pool settings are 0, it'll just create a null threadpool that
+    # executes in calling context.
+    @thread_pool = Traject::ThreadPool.new( @settings['solr_writer.thread_pool'].to_i )
+    @debug_ascii_progress = (@settings["debug_ascii_progress"].to_s == "true")
+    logger.info("   #{self.class.name} writing to '#{settings['solr.url']}'")
+  end
+  # Loads solrj if not already loaded. By loading all jars found
+  # in settings["solrj.jar_dir"]
+  def ensure_solrj_loaded!
+    unless defined?(HttpSolrServer) && defined?(SolrInputDocument)
+      self.class.require_solrj_jars(settings)
+    end
+    # And for now, SILENCE SolrJ logging
+    org.apache.log4j.Logger.getRootLogger().addAppender(org.apache.log4j.varia.NullAppender.new)
+  end
+  # Method IS thread-safe, can be called concurrently by multi-threads.
+  #
+  # Why? If not using batched add, we just use the SolrServer, which is already
+  # thread safe itself.
+  #
+  # If we are using batch add, we surround all access to our shared state batch queue
+  # in a mutex -- just a naive implementation. May be able to improve performance
+  # with more sophisticated java.util.concurrent data structure (blocking queue etc)
+  # I did try a java ArrayBlockingQueue or LinkedBlockingQueue instead of our own
+  # mutex -- I did not see consistently different performance. May want to
+  # change so doesn't use a mutex at all if multiple mapping threads aren't being
+  # used.
+  #
+  # this class does not at present use any threads itself, all work will be done
+  # in the calling thread, including actual http transactions to solr via solrj SolrServer
+  # if using batches, then not every #put is a http transaction, but when it is,
+  # it's in the calling thread, synchronously.
+  def put(context)
+    @thread_pool.raise_collected_exception!
+    # package the SolrInputDocument along with the context, so we have
+    # the context for error reporting when we actually add.
+    package = UpdatePackage.new(hash_to_solr_document(context.output_hash), context)
+    if settings["solr_writer.batch_size"].to_i > 1
+      ready_batch = []
+      batched_queue.add(package)
+      if batched_queue.size >= settings["solr_writer.batch_size"].to_i
+        batched_queue.drain_to(ready_batch)
+      end
+      if ready_batch.length > 0
+        if @debug_ascii_progress
+          $stderr.write("^")
+          if @thread_pool.queue && (@thread_pool.queue.size >= @thread_pool.queue_capacity)
+            $stderr.write "!"
+          end
+        end
+        @thread_pool.maybe_in_thread_pool { batch_add_document_packages(ready_batch) }
+      end
+    else # non-batched add, add one at a time.
+      @thread_pool.maybe_in_thread_pool { add_one_document_package(package) }
+    end
+  end
+  def hash_to_solr_document(hash)
+    doc = SolrInputDocument.new
+    hash.each_pair do |key, value_array|
+      value_array.each do |value|
+        doc.addField( key, value )
+      end
+    end
+    return doc
+  end
+  # Takes array and batch adds it to solr -- array of UpdatePackage tuples of
+  # SolrInputDocument and context.
+  #
+  # Catches error in batch add, logs, and re-tries docs individually
+  #
+  # Is thread-safe, because SolrServer is thread-safe, and we aren't
+  # referencing any other shared state. Important that CALLER passes
+  # in a doc array that is not shared state, extracting it from
+  # shared state batched_queue in a mutex.
+  def batch_add_document_packages(current_batch)
+    begin
+      a = current_batch.collect {|package| package.solr_document }
+      solr_server.add( a )
+      $stderr.write "%" if @debug_ascii_progress
+    rescue Exception => e
+      # Error in batch, none of the docs got added, let's try to re-add
+      # em all individually, so those that CAN get added get added, and those
+      # that can't get individually logged.
+      logger.warn "Error encountered in batch solr add, will re-try documents individually, at a performance penalty...\n" + Traject::Util.exception_to_log_message(e)
+      current_batch.each do |package|
+        add_one_document_package(package)
+      end
+    end
+  end
+  # Adds a single SolrInputDocument passed in as an UpdatePackage combo of SolrInputDocument
+  # and context.
+  #
+  # Rescues exceptions thrown by SolrServer.add, logs them, and then raises them
+  # again if deemed fatal and should stop indexing. Only intended to be used on a SINGLE
+  # document add. If we get an exception on a multi-doc batch add, we need to recover
+  # differently.
+  def add_one_document_package(package)
+    begin
+      solr_server.add(package.solr_document)
+        # Honestly not sure what the difference is between those types, but SolrJ raises both
+    rescue org.apache.solr.common.SolrException, org.apache.solr.client.solrj.SolrServerException  => e
+      id        = package.context.source_record && package.context.source_record['001'] && package.context.source_record['001'].value
+      id_str    = id ? "001:#{id}" : ""
+      position  = package.context.position
+      position_str = position ? "at file position #{position} (starting at 1)" : ""
+      logger.error("Could not index record #{id_str} #{position_str}\n" + Traject::Util.exception_to_log_message(e) )
+      logger.debug(package.context.source_record.to_s)
+      @skipped_record_incrementer.getAndIncrement() # AtomicInteger, thread-safe increment.
+      if fatal_exception? e
+        logger.fatal ("SolrJ exception judged fatal, raising...")
+        raise e
+      end
+    end
+  end
+  def logger
+    settings["logger"] ||=  Yell.new(STDERR, :level => "gt.fatal") # null logger
+  end
+  # If an exception is encountered talking to Solr, is it one we should
+  # entirely give up on? SolrJ doesn't use a useful exception class hieararchy,
+  # we have to look into it's details and guess.
+  def fatal_exception?(e)
+    root_cause = e.respond_to?(:getRootCause) && e.getRootCause
+    # Various kinds of inability to actually talk to the
+    # server look like this:
+    if root_cause.kind_of? java.io.IOException
+      return true
+    end
+    # Consider Solr server returning HTTP 500 Internal Server Error to be fatal.
+    # This can mean, for instance, that disk space is exhausted on solr server.
+    if e.kind_of?(Java::OrgApacheSolrCommon::SolrException) && e.code == 500
+      return true
+    end
+    return false
+  end
+  def close
+    @thread_pool.raise_collected_exception!
+    # Any leftovers in batch buffer? Send em to the threadpool too.
+    if batched_queue.length > 0
+      packages = []
+      batched_queue.drain_to(packages)
+      # we do it in the thread pool for consistency, and so
+      # it goes to the end of the queue behind any outstanding
+      # work in the pool.
+      @thread_pool.maybe_in_thread_pool { batch_add_document_packages( packages ) }
+    end
+    # Wait for shutdown, and time it.
+    logger.debug "SolrJWriter: Shutting down thread pool, waiting if needed..."
+    elapsed = @thread_pool.shutdown_and_wait
+    if elapsed > 60
+      logger.warn "Waited #{elapsed} seconds for all SolrJWriter threads, you may want to increase solr_writer.thread_pool (currently #{@settings["solr_writer.thread_pool"]})"
+    end
+    logger.debug "SolrJWriter: Thread pool shutdown complete"
+    logger.warn "SolrJWriter: #{skipped_record_count} skipped records" if skipped_record_count > 0
+    # check again now that we've waited, there could still be some
+    # that didn't show up before.
+    @thread_pool.raise_collected_exception!
+    if settings["solrj_writer.commit_on_close"].to_s == "true"
+      logger.info "SolrJWriter: Sending commit to solr..."
+      solr_server.commit
+    end
+    solr_server.shutdown
+    @solr_server = nil
+  end
+  # Return count of encountered skipped records. Most accurate to call
+  # it after #close, in which case it should include full count, even
+                           # under async thread_pool.
+  def skipped_record_count
+    @skipped_record_incrementer.get
+  end
+  def solr_server
+    @solr_server ||= instantiate_solr_server!
+  end
+  attr_writer :solr_server # mainly for testing
+  # Instantiates a solr server of class settings["solrj_writer.server_class_name"] or "HttpSolrServer"
+  # and initializes it with settings["solr.url"]
+  def instantiate_solr_server!
+    server_class  = qualified_const_get( settings["solrj_writer.server_class_name"] || "HttpSolrServer" )
+    server        = server_class.new( settings["solr.url"].to_s );
+    if parser_name = settings["solrj_writer.parser_class_name"]
+      #parser = org.apache.solr.client.solrj.impl.const_get(parser_name).new
+      parser = Java::JavaClass.for_name("org.apache.solr.client.solrj.impl.#{parser_name}").ruby_class.new
+      server.setParser( parser )
+    end
+    server
+  end
+  def settings_check!(settings)
+    unless settings.has_key?("solr.url") && ! settings["solr.url"].nil?
+      raise ArgumentError.new("SolrJWriter requires a 'solr.url' solr url in settings")
+    end
+    unless settings["solr.url"] =~ /^#{URI::regexp}$/
+      raise ArgumentError.new("SolrJWriter requires a 'solr.url' setting that looks like a URL, not: `#{settings['solr.url']}`")
+    end
+  end
+end

data/lib/traject/solrj_writer/version.rb ADDED

@@ -0,0 +1,5 @@
+module Traject
+  class SolrJWriter
+    VERSION = "1.0.0"
+  end
+end

data/spec/minitest_helper.rb ADDED

@@ -0,0 +1,85 @@
+$LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
+require 'traject/solrj_writer'
+require 'minitest/spec'
+require 'minitest/autorun'
+# Get a traject context with the given data
+def context_with(hash)
+  Traject::Indexer::Context.new(:output_hash => hash)
+end
+# pretends to be a SolrJ HTTPServer-like thing, just kind of mocks it up
+# and records what happens and simulates errors in some cases.
+class MockSolrServer
+  attr_accessor :things_added, :url, :committed, :parser, :shutted_down
+  def initialize(url)
+    @url =  url
+    @things_added = []
+    @add_mutex = Mutex.new
+  end
+  def add(thing)
+    @add_mutex.synchronize do # easy peasy threadsafety for our mock
+      if @url == "http://no.such.place"
+        raise org.apache.solr.client.solrj.SolrServerException.new("mock bad uri", java.io.IOException.new)
+      end
+      # simulate a multiple id error please
+      if [thing].flatten.find {|doc| doc.getField("id").getValueCount() != 1}
+        raise org.apache.solr.client.solrj.SolrServerException.new("mock non-1 size of 'id'")
+      else
+        things_added << thing
+      end
+    end
+  end
+  def commit
+    @committed = true
+  end
+  def setParser(parser)
+    @parser = parser
+  end
+  def shutdown
+    @shutted_down = true
+  end
+end
+# keeps things from complaining about "yell-1.4.0/lib/yell/adapters/io.rb:66 warning: syswrite for buffered IO"
+# for reasons I don't entirely understand, involving yell using syswrite and tests sometimes
+# using $stderr.puts. https://github.com/TwP/logging/issues/31
+STDERR.sync = true
+# Hacky way to turn off Indexer logging by default, say only
+# log things higher than fatal, which is nothing.
+require 'traject/indexer/settings'
+Traject::Indexer::Settings.defaults["log.level"] = "gt.fatal"
+def support_file_path(relative_path)
+  return File.expand_path(File.join("test_support", relative_path), File.dirname(__FILE__))
+end
+# The 'assert' method I don't know why it's not there
+def assert_length(length, obj, msg = nil)
+  unless obj.respond_to? :length
+    raise ArgumentError, "object with assert_length must respond_to? :length", obj
+  end
+  msg ||= "Expected length of #{obj} to be #{length}, but was #{obj.length}"
+  assert_equal(length, obj.length, msg.to_s )
+end
+def assert_start_with(start_with, obj, msg = nil)
+  msg ||= "expected #{obj} to start with #{start_with}"
+  assert obj.start_with?(start_with), msg
+end

data/spec/solrj_writer_spec.rb ADDED

@@ -0,0 +1,223 @@
+require 'minitest_helper'
+require 'traject/solrj_writer'
+# It's crazy hard to test this effectively, especially under threading.
+# we do our best to test decently, and keep the tests readable,
+# but some things aren't quite reliable under threading, sorry.
+# create's a solrj_writer, maybe with MockSolrServer, maybe
+# with a real one. With settings in @settings, set or change
+# in before blocks
+#
+# writer left in @writer, with maybe mock solr server in @mock
+def create_solrj_writer
+  @writer = Traject::SolrJWriter.new(@settings)
+  if @settings["solrj_writer.server_class_name"] == "MockSolrServer"
+    # so we can test it later
+    @mock = @writer.solr_server
+  end
+end
+def context_with(hash)
+  Traject::Indexer::Context.new(:output_hash => hash)
+end
+# Some tests we need to run multiple ties in multiple batch/thread scenarios,
+# we DRY them up by creating a method to add the tests in different describe blocks
+def test_handles_errors
+  it "errors but does not raise on multiple ID's" do
+    @writer.put context_with("id" => ["one", "two"])
+    @writer.close
+    assert_equal 1, @writer.skipped_record_count, "counts skipped record"
+  end
+  it "errors and raises on connection error" do
+    @settings.merge!("solr.url" => "http://no.such.place")
+    create_solrj_writer
+    assert_raises org.apache.solr.client.solrj.SolrServerException do
+      @writer.put context_with("id" => ["one"])
+      # in batch and/or thread scenarios, sometimes no exception raised until close
+      @writer.close
+    end
+  end
+end
+$stderr.puts "\n======\nWARNING: Testing SolrJWriter with mock instance, set ENV 'solr_url' to test against real solr\n======\n\n" unless ENV["solr_url"]
+# WARNING. The SolrJWriter talks to a running Solr server.
+#
+# set ENV['solr_url'] to run tests against a real solr server
+# OR
+# the tests will run against a mock SolrJ server instead.
+#
+#
+# This is pretty limited test right now.
+describe "Traject::SolrJWriter" do
+  before do
+    @settings = {
+      # Use XMLResponseParser just to test, and so it will work
+      # with a solr 1.4 test server
+      "solrj_writer.parser_class_name" => "XMLResponseParser",
+      "solrj_writer.commit_on_close" => "false", # real solr is way too slow if we always have it commit on close
+      "solrj_writer.batch_size" => nil
+    }
+    if ENV["solr_url"]
+      @settings["solr.url"] = ENV["solr_url"]
+    else
+      @settings["solr.url"] = "http://example.org/solr"
+      @settings["solrj_writer.server_class_name"] = "MockSolrServer"
+    end
+  end
+  it "raises on missing url" do
+    assert_raises(ArgumentError) { Traject::SolrJWriter.new }
+    assert_raises(ArgumentError) { Traject::SolrJWriter.new("solr.url" => nil) }
+  end
+  it "raises on malformed URL" do
+    assert_raises(ArgumentError) { Traject::SolrJWriter.new("solr.url" => "") }
+    assert_raises(ArgumentError) { Traject::SolrJWriter.new("solr.url" => "adfadf") }
+  end
+  it "defaults to solrj_writer.batch_size more than 1" do
+    assert 1 < Traject::SolrJWriter.new("solr.url" => "http://example.org/solr").settings["solr_writer.batch_size"].to_i
+  end
+  describe "with no threading or batching" do
+    before do
+      @settings.merge!("solrj_writer.batch_size" => nil, "solrj_writer.thread_pool" => nil)
+      create_solrj_writer
+    end
+    it "writes a simple document" do
+      @writer.put context_with("title_t" => ["MY TESTING TITLE"], "id" => ["TEST_TEST_TEST_0001"])
+      @writer.close
+      if @mock
+        assert_kind_of org.apache.solr.client.solrj.impl.XMLResponseParser, @mock.parser
+        assert_equal @settings["solr.url"], @mock.url
+        assert_equal 1, @mock.things_added.length
+        assert_kind_of SolrInputDocument, @mock.things_added.first.first
+        assert @mock.shutted_down
+      end
+    end
+    it "commits on close when so set" do
+      @settings.merge!("solrj_writer.commit_on_close" => "true")
+      create_solrj_writer
+      @writer.put context_with("title_t" => ["MY TESTING TITLE"], "id" => ["TEST_TEST_TEST_0001"])
+      @writer.close
+      # if it's not a mock, we don't really test anything, except that
+      # no exception was raised. oh well. If it's a mock, we can
+      # ask it.
+      if @mock
+        assert @mock.committed, "mock gets commit called on it"
+      end
+    end
+    test_handles_errors
+    # I got to see what serialized marc binary does against a real solr server,
+    # sorry this is a bit out of place, but this is the class that talks to real
+    # solr server right now. This test won't do much unless you have
+    # real solr server set up.
+    #
+    # Not really a good test right now, just manually checking my solr server,
+    # using this to make the add reproducible at least.
+    describe "Serialized MARC" do
+      it "goes to real solr somehow" do
+        record = MARC::Reader.new(support_file_path  "manufacturing_consent.marc").to_a.first
+        serialized = record.to_marc # straight binary
+        @writer.put context_with("marc_record_t" => [serialized], "id" => ["TEST_TEST_TEST_MARC_BINARY"])
+        @writer.close
+      end
+    end
+  end
+  describe "with batching but no threading" do
+    before do
+      @settings.merge!("solr_writer.batch_size" => 5, "solr_writer.thread_pool" => nil)
+      create_solrj_writer
+    end
+    it "sends all documents" do
+      docs = Array(1..17).collect do |i|
+        {"id" => ["item_#{i}"], "title" => ["To be #{i} again!"]}
+      end
+      docs.each do |doc|
+        @writer.put context_with(doc)
+      end
+      @writer.close
+      if @mock
+        # 3 batches of 5, and the leftover 2 (16, 17)
+        assert_length 4, @mock.things_added
+        assert_length 5, @mock.things_added[0]
+        assert_length 5, @mock.things_added[1]
+        assert_length 5, @mock.things_added[2]
+        assert_length 2, @mock.things_added[3]
+      end
+    end
+    test_handles_errors
+  end
+  describe "with batching and threading" do
+    before do
+      @settings.merge!("solr_writer.batch_size" => 5, "solr_writer.thread_pool" => 2)
+      create_solrj_writer
+    end
+    it "sends all documents" do
+      docs = Array(1..17).collect do |i|
+        {"id" => ["item_#{i}"], "title" => ["To be #{i} again!"]}
+      end
+      docs.each do |doc|
+        @writer.put context_with(doc)
+      end
+      @writer.close
+      if @mock
+        # 3 batches of 5, and the leftover 2 (16, 17)
+        assert_length 4, @mock.things_added
+        # we can't be sure of the order under async,
+        # just three of 5 and one of 2
+        assert_length 3, @mock.things_added.find_all {|array| array.length == 5}
+        assert_length 1, @mock.things_added.find_all {|array| array.length == 2}
+      end
+    end
+    test_handles_errors
+  end
+  describe "alises solrj_writer* to solr_writer* settings" do
+   # commit_on_close batch_size thread_pool
+    end
+    it "aliases as needed" do
+      @settings.merge!("solrj_writer.commit_on_close" => true, "solrj_writer.batch_size" => 5, "solrj_writer.thread_pool" => 2)
+      create_solrj_writer
+      assert_equal(true, @writer.settings['solr_writer.commit_on_close'], "commit_on_close")
+      assert_equal(5, @writer.settings['solr_writer.batch_size'], "batch_size")
+      assert_equal(2, @writer.settings['solr_writer.thread_pool'], "thread_pool")
+    end
+end
+require 'thread' # Mutex

data/spec/test_support/manufacturing_consent.marc ADDED

@@ -0,0 +1 @@

+ 02067cam a2200469 a 4500001000800000005001700008008004100025010001700066020001500083020001800098029002100116029001900137029001700156035001200173035001200185035001600197035002000213040006400233049000900297050002200306082002100328084001500349100002200364245014800386246002600534260003900560300002700599500005800626504006600684505032800750650002701078650003101105700001901136856009901155856009101254910002601345938007101371938004001442938003901482991006401521994001201585271018320080307152200.0010831s2002 nyu b 001 0 eng a 2001050014 a0375714499 a97803757144981 aNLGGCb2461901591 aYDXCPb18130101 aNZ1b6504593 a2710183 a2710183 aocm47971712 a(OCoLC)47971712 aDLCcDLCdUSXdBAKERdNLGGCdNPLdYDXCPdOCLCQdBTCTAdMdBJ aJHEE00aP96.E25bH47 200200a381/.4530223221 a05.302bcl1 aHerman, Edward S.10aManufacturing consent :bthe political economy of the mass media /cEdward S. Herman and Noam Chomsky ; with a new introduction by the authors.14aManugacturing content aNew York :bPantheon Books,c2002. alxiv, 412 p. ;c24 cm. aUpdated ed. of: Manufacturing consent. 1st ed. c1988. aIncludes bibliographical references (p. [331]-393) and index.0 aA propaganda model -- Worthy and unworthy victims -- Legitimizing versus meaningless third world elections: El Salvador, Guatemala, and Nicaragua -- The KGB-Bulgarian plot to kill the Pope: free-market disinformation as "news" -- The Indochina wars (I): Vietnam -- The Indochina wars (II): Laos and Cambodia -- Conclusions. 0aMass mediaxOwnership. 0aMass media and propaganda.1 aChomsky, Noam.423Contributor biographical informationuhttp://www.loc.gov/catdir/bios/random051/2001050014.html423Publisher descriptionuhttp://www.loc.gov/catdir/description/random044/2001050014.html a2710183bHorizon bib# aBaker & TaylorbBKTYc18.95d14.21i0375714499n0003788716sactive aYBP Library ServicesbYANKn1813010 aBaker and TaylorbBTCPn2001050014 aP96.E25 H47 2002flcbelc1cc. 1q0i4659750lembluememsel aC0bJHE

data/traject-solrj_writer.gemspec ADDED

@@ -0,0 +1,27 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'traject/solrj_writer/version'
+Gem::Specification.new do |spec|
+  spec.platform      = 'java'
+  spec.name          = "traject-solrj_writer"
+  spec.version       = Traject::SolrJWriter::VERSION
+  spec.authors       = ["Bill Dueber"]
+  spec.email         = ["bill@dueber.com"]
+  spec.summary       = %q{Use Traject into index data into Solr using solrj under JRuby}
+  spec.homepage      = "https://github.com/traject-project/traject-solrj_writer"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.7"
+  spec.add_development_dependency "rake", "~> 10.0"
+  spec.add_development_dependency "minitest"
+  spec.add_development_dependency 'simple_solr_client', '>=0.1.2'
+end

data/vendor/solrj/README ADDED

@@ -0,0 +1,8 @@
+Inside ./lib are all the jar files neccesary for solrj. They are used by the SolrJWriter.
+The build.xml and ivy.xml file included here were used to download the jars, and
+can be used to re-download them. Just run `ant` in this directory, and the contents of `./lib` will be replaced by the current latest release of solrj. Or edit ivy.xml to download a specific solrj version (perhaps change ivy.xml to use a java prop for release, defaulting to latest! ha.)  And then commit changes to repo, etc, to update solrj distro'd with traject.
+This is not neccesarily a great way to provide access to solrj .jars. It's just what we're doing now, and it works. See main project README.md for discussion and other potential ideas.
+Note, the ivy.xml in here currently downloads a bit MORE than we really need, like .jars of docs and source. Haven't yet figured out how to tell it to download all maven-specified solrj jars that we really need, but not the ones we don't need. (we DO need logging-related ones to properly get logging working!) If you can figure it out, it'd be an improvement, as ALL jars in this dir are by default loaded by traject at runtime.

data/vendor/solrj/build.xml ADDED

@@ -0,0 +1,39 @@
+<?xml version="1.0" encoding="utf-8"?>
+<project  xmlns:ivy="antlib:org.apache.ivy.ant" name="traject-fetch-jars" default="prepare" basedir=".">
+  <target name="prepare" depends="setup-ivy">
+        <mkdir dir="lib"/>
+        <ivy:retrieve sync="true"/>
+    </target>
+    <target name="clean">
+	    <delete dir="lib"/>
+    </target>
+    <property name="ivy.install.version" value="2.3.0"/>
+    <property name="ivy.jar.dir" value="ivy"/>
+    <property name="ivy.jar.file" value="${ivy.jar.dir}/ivy.jar"/>
+    <available file="${ivy.jar.file}" property="skip.download"/>
+    <target name="download-ivy" unless="skip.download">
+        <mkdir dir="${ivy.jar.dir}"/>
+        <echo message="installing ivy..."/>
+        <get src="http://repo1.maven.org/maven2/org/apache/ivy/ivy/${ivy.install.version}/ivy-${ivy.install.version}.jar" dest="${ivy.jar.file}" usetimestamp="true"/>
+    </target>
+    <target name="setup-ivy" depends="download-ivy" description="--> setup ivy">
+        <path id="ivy.lib.path">
+            <fileset dir="${ivy.jar.dir}" includes="*.jar"/>
+        </path>
+        <taskdef resource="org/apache/ivy/ant/antlib.xml" uri="antlib:org.apache.ivy.ant" classpathref="ivy.lib.path"/>
+    </target>
+</project>

data/vendor/solrj/ivy.xml ADDED

@@ -0,0 +1,16 @@
+<ivy-module version="2.0">
+    <info organisation="org.code4lib" module="traject"/>
+    <dependencies>
+      <!-- downloads EVERYTHING including docs and source we don't need. Oh well, it
+      works for prototyping at least...  -->
+      <dependency org="org.apache.solr" name="solr-solrj" rev="latest.release"/>
+      <!-- Attempts to give us just what we need, including working logging, still
+           not quite right, but leaving here for thinking... -->
+      <!-- <dependency org="org.apache.solr" name="solr-solrj" rev="latest.release" conf="default" />
+      <dependency org="org.slf4j" name="slf4j-simple" rev="latest.release"/>       -->
+    </dependencies>
+</ivy-module>

data/vendor/solrj/lib/commons-io-2.3.jar ADDED

Binary file

data/vendor/solrj/lib/httpclient-4.3.1.jar ADDED

Binary file

data/vendor/solrj/lib/httpcore-4.3.jar ADDED

Binary file

data/vendor/solrj/lib/httpmime-4.3.1.jar ADDED

Binary file

data/vendor/solrj/lib/jcl-over-slf4j-1.6.6.jar ADDED

Binary file

data/vendor/solrj/lib/log4j-1.2.16.jar ADDED

Binary file

data/vendor/solrj/lib/noggit-0.5.jar ADDED

Binary file

data/vendor/solrj/lib/slf4j-api-1.7.6.jar ADDED

Binary file

data/vendor/solrj/lib/slf4j-log4j12-1.6.6.jar ADDED

Binary file

data/vendor/solrj/lib/solr-solrj-4.3.1-javadoc.jar ADDED

Binary file

data/vendor/solrj/lib/solr-solrj-4.3.1-sources.jar ADDED

Binary file

data/vendor/solrj/lib/solr-solrj-4.3.1.jar ADDED

Binary file

data/vendor/solrj/lib/wstx-asl-3.2.7.jar ADDED

Binary file

data/vendor/solrj/lib/zookeeper-3.4.6.jar ADDED

Binary file

metadata ADDED

@@ -0,0 +1,133 @@
+--- !ruby/object:Gem::Specification
+name: traject-solrj_writer
+version: !ruby/object:Gem::Version
+  version: 1.0.0
+platform: java
+authors:
+- Bill Dueber
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2015-02-10 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.7'
+  name: bundler
+  prerelease: false
+  type: :development
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.7'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '10.0'
+  name: rake
+  prerelease: false
+  type: :development
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '10.0'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  name: minitest
+  prerelease: false
+  type: :development
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: 0.1.2
+  name: simple_solr_client
+  prerelease: false
+  type: :development
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: 0.1.2
+description:
+email:
+- bill@dueber.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- .travis.yml
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- lib/traject/solrj_writer.rb
+- lib/traject/solrj_writer/version.rb
+- spec/minitest_helper.rb
+- spec/solrj_writer_spec.rb
+- spec/test_support/manufacturing_consent.marc
+- traject-solrj_writer.gemspec
+- vendor/solrj/README
+- vendor/solrj/build.xml
+- vendor/solrj/ivy.xml
+- vendor/solrj/lib/commons-io-2.3.jar
+- vendor/solrj/lib/httpclient-4.3.1.jar
+- vendor/solrj/lib/httpcore-4.3.jar
+- vendor/solrj/lib/httpmime-4.3.1.jar
+- vendor/solrj/lib/jcl-over-slf4j-1.6.6.jar
+- vendor/solrj/lib/log4j-1.2.16.jar
+- vendor/solrj/lib/noggit-0.5.jar
+- vendor/solrj/lib/slf4j-api-1.7.6.jar
+- vendor/solrj/lib/slf4j-log4j12-1.6.6.jar
+- vendor/solrj/lib/solr-solrj-4.3.1-javadoc.jar
+- vendor/solrj/lib/solr-solrj-4.3.1-sources.jar
+- vendor/solrj/lib/solr-solrj-4.3.1.jar
+- vendor/solrj/lib/wstx-asl-3.2.7.jar
+- vendor/solrj/lib/zookeeper-3.4.6.jar
+homepage: https://github.com/traject-project/traject-solrj_writer
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.1.9
+signing_key:
+specification_version: 4
+summary: Use Traject into index data into Solr using solrj under JRuby
+test_files:
+- spec/minitest_helper.rb
+- spec/solrj_writer_spec.rb
+- spec/test_support/manufacturing_consent.marc
+has_rdoc: