RubyGems - sunspot_index_queue - Versions diffs - 1.0.0 - Mend

sunspot_index_queue 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

data/.gitignore +1 -0
data/MIT_LICENSE +20 -0
data/README.rdoc +133 -0
data/Rakefile +52 -0
data/VERSION +1 -0
data/lib/sunspot/index_queue.rb +168 -0
data/lib/sunspot/index_queue/batch.rb +120 -0
data/lib/sunspot/index_queue/entry.rb +137 -0
data/lib/sunspot/index_queue/entry/active_record_impl.rb +140 -0
data/lib/sunspot/index_queue/entry/data_mapper_impl.rb +122 -0
data/lib/sunspot/index_queue/entry/mongo_impl.rb +276 -0
data/lib/sunspot/index_queue/session_proxy.rb +111 -0
data/lib/sunspot_index_queue.rb +5 -0
data/spec/active_record_impl_spec.rb +44 -0
data/spec/batch_spec.rb +118 -0
data/spec/data_mapper_impl_spec.rb +37 -0
data/spec/entry_impl_examples.rb +184 -0
data/spec/entry_spec.rb +148 -0
data/spec/index_queue_spec.rb +150 -0
data/spec/integration_spec.rb +110 -0
data/spec/mongo_impl_spec.rb +35 -0
data/spec/session_proxy_spec.rb +174 -0
data/spec/spec_helper.rb +94 -0
data/sunspot_index_queue.gemspec +99 -0
metadata +237 -0

data/.gitignore ADDED Viewed

	@@ -0,0 +1 @@
1	+ pkg

data/MIT_LICENSE ADDED Viewed

@@ -0,0 +1,20 @@
+Copyright (c) 2010 Brian Durand
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.rdoc ADDED Viewed

@@ -0,0 +1,133 @@
+= Sunspot Index Queue
+This gem provides support for asynchronously updating a Solr index with the sunspot gem.
+== Why asynchronous
+=== If Solr is down, your application won't (necessarily) be down
+Since Solr runs as a different part of your application infrastructure, there is always the chance that it isn't working right while the rest of your application is working fine. By queueing up changes for Solr when your models change, the update parts of your application can remain up even during a Solr outage. This could be critical if it keeps your staff working or your customer placing orders.
+=== Better consistency when something goes wrong
+If your application stores data in a relational database, there's always the chance that a record update could succeed while the transaction it was in fails. This can result in inconsistent data between your search index and your database.
+If you use the tactic of batching Solr updates to commit them at once, you could have a problem if an exception is encountered prior to the batch being committed.
+Queueing the updates in the same datastore that is used for persisting your models can prevent these sorts of inconsistencies from happening.
+=== Spread out load peaks
+If you get a particularly large spike of updates to your indexed models, you could be taxing your Solr server with lots of single document updates. This could lead to downstream performance issues in your application. By queueing the Solr updates, they will be sent to the server in larger batches providing better performance. Furthermore, if Solr gets backed up updating the index, the slow down will be isolated to the background job processing the queue.
+This library uses a dedicated work queue for processing Solr requests instead of building off of delayed_job or another background processing library in order to take advantage of efficiently batching requests to Solr. This can give orders of magnitude better performance over indexing and committing individual documents.
+== Usage
+To use asynchronous indexing, you'll have to set up three things.
+=== Session
+Your application should be initialized with a Sunspot::IndexQueue::SessionProxy session. This will send queries immediately to Solr queries but send updates to a queue for processing later.
+This will set up a session proxy using the default Solr session:
+  Sunspot.session = Sunspot::IndexQueue::SessionProxy.new
+If you have custom configuration settings for your Solr session or need to wrap it with additional proxies, you can pass it to the constructor. For example, if you have a master/slave setup running on specific ports:
+  master_session = Sunspot::Session.new{|config| config.solr.url = 'http://master.solr.example.com/solr'}
+  slave_session = Sunspot::Session.new{|config| config.solr.url = 'http://slave01.solr.example.com/solr'}
+  master_slave_session = Sunspot::SessionProxy::MasterSlaveSessionProxy(master_session, slave_session)
+  queue = Sunspot::IndexQueue.new(master_slave_session)
+  Sunspot.session = Sunspot::IndexQueue::SessionProxy.new(queue)
+=== Queue Implementation
+The queue component is designed to be modular so that you can plugin a datastore that fits with your application architecture. To set the implementation, you can set it to one of the included implementations:
+  # Queue implementation backed by ActiveRecord
+  Sunspot::IndexQueue::Entry.implementation = :active_record
+  # Queue implementation backed by DataMapper
+  Sunspot::IndexQueue::Entry.implementation = :data_mapper
+  # Queue implementation backed by MongoDB
+  Sunspot::IndexQueue::Entry.implementation = :mongo
+  # You can also provide your own queue implementation
+  Sunspot::IndexQueue::Entry.implementation = MyQueueImplementation
+=== Process The Queue
+To process the queue:
+  queue = Sunspot::IndexQueue.new
+  queue.process
+This will process all entries currently in the queue. Of course, you'll probably want to wrap this up in some sort of daemon process. Here is a sample daemon script you could run for a Rails application. You'll need to customize it if your setup is more complicated.
+  #!/usr/bin/env ruby
+  require 'rubygems'
+  gem 'daemons-mikehale'
+  require 'daemons'
+  # Assumes the scrip is located in a subdirectory of the Rails root directory
+  rails_root = File.expand_path(File.join(File.dirname(__FILE__), '..'))
+  Daemons.run_proc(File.basename($0), :dir_mode => :normal, :dir => File.join(rails_root, 'log'), :force_kill_wait => 30) do
+    require File.join(rails_root, 'config', 'environment')
+    # Use the default queue settings.
+    queue = Sunspot::IndexQueue.new
+    # Don't want the daemon to fill up the log with SQL queries in development mode
+    Rails.logger.level = Logger::INFO if Rails.logger.level == Logger::DEBUG
+    loop do
+      begin
+        queue.process
+        sleep(2)
+      rescue Exception => e
+        # Make sure we can exit the loop on a shutdown
+        raise e if e.is_a?(SystemExit) || e.is_a?(Interrupt)
+        # If Solr isn't responding, wait a while to give it time to get back up
+        if e.is_a?(Sunspot::IndexQueue::SolrNotResponding)
+          sleep(30)
+        else
+          Rails.logger.error(e)
+        end
+      end
+    end
+  end
+The logic in the queue is designed to allow concurrent processing of the queue by multiple processes, however, some documents may end up getting submitted to Solr multiple times. This can speed up the processing of a large number of documents if you need to index a large data set. Forking too many processes to handle your queue, however, will result in more documents being processed multiple times.
+== Features
+=== Multiple Queues
+If you have multiple models segmented into multiple Solr indexes, you can set up multiple queues. They will share the same persistence backend, but can be configured with different Solr sessions.
+  # Index all your content in one index
+  content_queue = Sunspot::IndexQueue.new(:session => content_session, :class_names => [BlogPost, Review])
+  content_session_proxy = Sunspot::IndexQueue::SessionProxy.new(content_queue)
+  # And all your products in another
+  product_queue = Sunspot::IndexQueue.new(:session => product_session, :class_names => Product)
+  product_session_proxy = Sunspot::IndexQueue::SessionProxy.new(product_queue)
+=== Priority
+When you have updates coming from multiple sources, it is often good to set a priority on when they should get processed so that when there is a backlog, it becomes less noticeable. For example, in an application that has models updated both by human interaction and by an automated feed, you probably want the human updated items to be indexed first. That way when their is a backlog, your human customers won't notice it. You can control the priority of updates inside a block with the Sunspot::IndexQueue.set_priority method.
+=== Recoverability
+When an entry in the queue cannot be sent to Solr, it will be automatically rescheduled to be tried again later. The amount of time is controlled by the +retry_interval+ setting on IndexQueue (defaults to 1 minute). Every time it is tried and fails, the interval will be increased yet again (i.e. first try wait 1 minute, second try wait 2 minutes, third try wait 3 minutes, etc.).
+Error messages and stack traces are stored with the queue entries so the can be debugged.
+sThe exception to this is that if Solr is down altogether, the queue will stop processing and entries will be restored to try again immediately. This error is not logged on the entries but is rather thrown as a Sunspot::IndexQueue::SolrNotResponding exception.

data/Rakefile ADDED Viewed

@@ -0,0 +1,52 @@
+require 'rubygems'
+require 'rake'
+require 'rake/rdoctask'
+desc 'Default: run unit tests.'
+task :default => :test
+begin
+  require 'spec/rake/spectask'
+  desc 'Test the gem.'
+  Spec::Rake::SpecTask.new(:test) do |t|
+    t.spec_files = FileList.new('spec/**/*_spec.rb')
+  end
+rescue LoadError
+  tast :test do
+    STDERR.puts "You must have rspec >= 1.3.0 to run the tests"
+  end
+end
+desc 'Generate documentation for sunspot_index_queue.'
+Rake::RDocTask.new(:rdoc) do |rdoc|
+  rdoc.rdoc_dir = 'rdoc'
+  rdoc.options << '--title' << 'Sunspot Index Queue' << '--line-numbers' << '--inline-source' << '--main' << 'README.rdoc'
+  rdoc.rdoc_files.include('README.rdoc')
+  rdoc.rdoc_files.include('lib/**/*.rb')
+end
+begin
+  require 'jeweler'
+  Jeweler::Tasks.new do |gem|
+    gem.name = "sunspot_index_queue"
+    gem.summary = %Q{Asynchronous Solr indexing support for the sunspot gem with an emphasis on reliablity and throughput.}
+    gem.description = %Q(This gem provides asynchronous indexing to Solr for the sunspot gem. It uses a pluggable model for the backing queue and provides support for ActiveRecord and MongoDB out of the box.)
+    gem.email = "brian@embellishedvisions.com"
+    gem.homepage = "http://github.com/bdurand/sunspot_index_queue"
+    gem.authors = ["Brian Durand"]
+    gem.rdoc_options = ["--charset=UTF-8", "--main", "README.rdoc"]
+    gem.add_dependency('sunspot', '>= 1.1.0')
+    gem.add_development_dependency('sqlite3')
+    gem.add_development_dependency('activerecord', '>= 2.2')
+    gem.add_development_dependency('dm-core', '>= 1.0.0')
+    gem.add_development_dependency('dm-aggregates', '>=1.0.0')
+    gem.add_development_dependency('dm-migrations', '>=1.0.0')
+    gem.add_development_dependency('mongo')
+    gem.add_development_dependency('rspec', '>= 1.3.0')
+    gem.add_development_dependency('jeweler')
+  end
+  Jeweler::GemcutterTasks.new
+rescue LoadError
+end

data/VERSION ADDED Viewed

	@@ -0,0 +1 @@
1	+ 1.0.0

data/lib/sunspot/index_queue.rb ADDED Viewed

@@ -0,0 +1,168 @@
+module Sunspot
+  # Implementation of an asynchronous queue for indexing records with Solr. Entries are added to the queue
+  # defining which records should be indexed or removed. The queue will then process those entries and
+  # send them to the Solr server in batches. This has two advantages over just updating in place. First,
+  # problems with Solr will not cause your application to stop functioning. Second, batching the commits
+  # to Solr is more efficient and it should be able to handle more throughput when you have a lot of records
+  # to index.
+  class IndexQueue
+    autoload :Batch, File.expand_path('../index_queue/batch', __FILE__)
+    autoload :Entry, File.expand_path('../index_queue/entry', __FILE__)
+    autoload :SessionProxy, File.expand_path('../index_queue/session_proxy', __FILE__)
+    # This exception will be thrown if Solr is not responding.
+    class SolrNotResponding < StandardError
+    end
+    attr_accessor :retry_interval, :batch_size
+    attr_reader :session, :class_names
+    class << self
+      # Set the default priority for indexing items within a block. Higher priority items will be processed first.
+      def set_priority (priority, &block)
+        save_val = Thread.current[:sunspot_index_queue_priority]
+        begin
+          Thread.current[:sunspot_index_queue_priority] = priority.to_i
+          yield
+        ensure
+          Thread.current[:sunspot_index_queue_priority] = save_val
+        end
+      end
+      # Get the default indexing priority. Defaults to zero.
+      def default_priority
+        Thread.current[:sunspot_index_queue_priority] || 0
+      end
+    end
+    # Create a new IndexQueue. Available options:
+    #
+    # +:retry_interval+ - The number of seconds to wait between to retry indexing when an attempt fails
+    # (defaults to 1 minute). If an entry fails multiple times, it will be delayed for the interval times
+    # the number of failures. For example, if the interval is 1 minute and it has failed twice, the record
+    # won't be attempted again for 2 minutes.
+    #
+    # +:batch_size+ - The maximum number of records to try submitting to solr at one time (defaults to 100).
+    #
+    # +:class_names+ - A list of class names that the queue will process. This can be used to have different
+    # queues process different classes of records when they need to different configurations.
+    #
+    # +:session+ - The Sunspot::Session object to use for communicating with Solr (defaults to a session with the default config).
+    def initialize (options = {})
+      @retry_interval = options[:retry_interval] || 60
+      @batch_size = options[:batch_size] || 100
+      @batch_handler = nil
+      @class_names = []
+      if options[:class_names].is_a?(Array)
+        @class_names.concat(options[:class_names].collect{|name| name.to_s})
+      elsif options[:class_names]
+        @class_names << options[:class_names].to_s
+      end
+      @session = options[:session] || Sunspot::Session.new
+    end
+    # Provide a block that will handle submitting batches of records. The block will take a Batch object and must call
+    # +submit!+ on it. This can be useful for doing things such as providing an identity map for records in the batch.
+    # Example:
+    #
+    #   # Use the ActiveRecord identity map for each batch submitted to reduce database activity.
+    #   queue.batch_handler do
+    #     ActiveRecord::Base.cache do |batch|
+    #       batch.submit!
+    #     end
+    #   end
+    def batch_handler (&block)
+      @batch_handler = block
+    end
+    # Add a record to be indexed to the queue. The record can be specified as either an indexable object or as
+    # as hash with :class and :id keys. The priority to be indexed can be passed in the options as +:priority+
+    # (defaults to 0).
+    def index (record_or_hash, options = {})
+      klass, id = class_and_id(record_or_hash)
+      Entry.enqueue(self, klass, id, false, options[:priority] || self.class.default_priority)
+    end
+    # Add a record to be removed to the queue. The record can be specified as either an indexable object or as
+    # as hash with :class and :id keys. The priority to be indexed can be passed in the options as +:priority+
+    # (defaults to 0).
+    def remove (record_or_hash, options = {})
+      klass, id = class_and_id(record_or_hash)
+      Entry.enqueue(self, klass, id, true, options[:priority] || self.class.default_priority)
+    end
+    # Add a list of records to be indexed to the queue. The priority to be indexed can be passed in the
+    # options as +:priority+ (defaults to 0).
+    def index_all (klass, ids, options = {})
+      Entry.enqueue(self, klass, ids, false, options[:priority] || self.class.default_priority)
+    end
+    # Add a list of records to be removed to the queue. The priority to be indexed can be passed in the
+    # options as +:priority+ (defaults to 0).
+    def remove_all (klass, ids, options = {})
+      Entry.enqueue(self, klass, ids, true, options[:priority] || self.class.default_priority)
+    end
+    # Get the number of entries to be processed in the queue.
+    def total_count
+      Entry.total_count(self)
+    end
+    # Get the number of entries in the queue that are ready to be processed.
+    def ready_count
+      Entry.ready_count(self)
+    end
+    # Get the number of entries that have errors in the queue.
+    def error_count
+      Entry.error_count(self)
+    end
+    # Get the entries in the queue that have errors. Supported options are +:limit+ (default 50) and +:offset+ (default 0).
+    def errors (options = {})
+      limit = options[:limit] ? options[:limit].to_i : 50
+      Entry.errors(self, limit, options[:offset].to_i)
+    end
+    # Reset all entries in the queue to clear errors and set them to be indexed immediately.
+    def reset!
+      Entry.reset!(self)
+    end
+    # Process the queue. Exits when there are no more entries to process at the current time.
+    # Returns the number of entries processed.
+    #
+    # If any errors are encountered while processing the queue, they will be logged with the errors so they can
+    # be fixed and tried again later. However, if Solr is refusing connections, the processing is stopped right
+    # away and a Sunspot::IndexQueue::SolrNotResponding exception is raised.
+    def process
+      count = 0
+      loop do
+        entries = Entry.next_batch!(self)
+        if entries.nil? || entries.empty?
+          break if Entry.ready_count(self) == 0
+        else
+          batch = Batch.new(self, entries)
+          if defined?(@batch_handler) && @batch_handler
+            @batch_handler.call(batch)
+          else
+            batch.submit!
+          end
+          count += entries.select{|e| e.processed? }.size
+        end
+      end
+      count
+    end
+    private
+    # Get the class and id for either a record or a hash containing +:class+ and +:id+ options
+    def class_and_id (record_or_hash)
+      if record_or_hash.is_a?(Hash)
+        [record_or_hash[:class], record_or_hash[:id]]
+      else
+        [record_or_hash.class, Sunspot::Adapters::InstanceAdapter.adapt(record_or_hash).id]
+      end
+    end
+  end
+end

data/lib/sunspot/index_queue/batch.rb ADDED Viewed

@@ -0,0 +1,120 @@
+module Sunspot
+  class IndexQueue
+    # Batch of entries to be indexed with Solr.
+    class Batch
+      attr_reader :entries
+      # Errors that cause batch processing to stop and are immediately passed on to the caller. All other
+      # are logged on the entry on the assumption that they can be fixed later while other entries can still
+      # be processed.
+      PASS_THROUGH_EXCEPTIONS = [SystemExit, NoMemoryError, Interrupt, SignalException, Errno::ECONNREFUSED]
+      def initialize (queue, entries = nil)
+        @queue = queue
+        @entries = []
+        @entries.concat(entries) if entries
+        @delete_entries = []
+      end
+      # Submit the entries to solr. If they are successfully committed, the entries will be deleted.
+      # Otherwise, any entries that generated errors will be updated with the error messages and
+      # set to be processed again in the future.
+      def submit!
+        Entry.load_all_records(entries)
+        clear_processed(entries)
+        begin
+          # First try submitting the entries in a batch since that's the most efficient.
+          # If there are errors, try each entry individually in case there's a bad document.
+          session.batch do
+            entries.each do |entry|
+              submit_entry(entry)
+            end
+          end
+          commit!
+        rescue Exception => e
+          if PASS_THROUGH_EXCEPTIONS.include?(e.class)
+            raise e
+          else
+            submit_each_entry
+          end
+        end
+      rescue Exception => e
+        begin
+          clear_processed(entries)
+          entries.each{|entry| entry.reset!} if PASS_THROUGH_EXCEPTIONS.include?(e.class)
+        ensure
+          # Use a more specific error to indicate Solr is down.
+          e = SolrNotResponding.new(e.message) if e.is_a?(Errno::ECONNREFUSED)
+          raise e
+        end
+      end
+      private
+      def session
+        @queue.session
+      end
+      # Clear the processed flag on all entries.
+      def clear_processed (entries)
+        entries.each{|entry| entry.processed = false}
+      end
+      # Send the Solr commit command and delete the entries if it succeeds.
+      def commit!
+        session.commit
+        Entry.delete_entries(@delete_entries) unless @delete_entries.empty?
+      rescue Exception => e
+        clear_processed(entries)
+        @delete_entries.clear
+        raise e
+      end
+      # Submit all entries to Solr individually and then commit.
+      def submit_each_entry
+        entries.each do |entry|
+          submit_entry(entry)
+        end
+        begin
+          commit!
+        rescue Exception => e
+          if PASS_THROUGH_EXCEPTIONS.include?(e.class)
+            raise e
+          else
+            entries.each do |entry|
+              entry.set_error!(e, @queue.retry_interval)
+            end
+          end
+        end
+      end
+      # Send an entry to Solr doing an update or delete as necessary.
+      def submit_entry (entry)
+        log_entry_error(entry) do
+          if entry.is_delete?
+            session.remove_by_id(entry.record_class_name, entry.record_id)
+          else
+            record = entry.record
+            session.index(record) if record
+          end
+        end
+      end
+      # Update an entry with an error message if a block fails.
+      def log_entry_error (entry)
+        begin
+          yield
+          entry.processed = true
+          @delete_entries << entry
+        rescue Exception => e
+          if PASS_THROUGH_EXCEPTIONS.include?(e.class)
+            raise e
+          else
+            entry.set_error!(e, @queue.retry_interval)
+          end
+        end
+      end
+    end
+  end
+end