sunspot_index_queue 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore ADDED
@@ -0,0 +1 @@
1
+ pkg
data/MIT_LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2010 Brian Durand
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.rdoc ADDED
@@ -0,0 +1,133 @@
1
+ = Sunspot Index Queue
2
+
3
+ This gem provides support for asynchronously updating a Solr index with the sunspot gem.
4
+
5
+
6
+ == Why asynchronous
7
+
8
+ === If Solr is down, your application won't (necessarily) be down
9
+
10
+ Since Solr runs as a different part of your application infrastructure, there is always the chance that it isn't working right while the rest of your application is working fine. By queueing up changes for Solr when your models change, the update parts of your application can remain up even during a Solr outage. This could be critical if it keeps your staff working or your customer placing orders.
11
+
12
+ === Better consistency when something goes wrong
13
+
14
+ If your application stores data in a relational database, there's always the chance that a record update could succeed while the transaction it was in fails. This can result in inconsistent data between your search index and your database.
15
+
16
+ If you use the tactic of batching Solr updates to commit them at once, you could have a problem if an exception is encountered prior to the batch being committed.
17
+
18
+ Queueing the updates in the same datastore that is used for persisting your models can prevent these sorts of inconsistencies from happening.
19
+
20
+ === Spread out load peaks
21
+
22
+ If you get a particularly large spike of updates to your indexed models, you could be taxing your Solr server with lots of single document updates. This could lead to downstream performance issues in your application. By queueing the Solr updates, they will be sent to the server in larger batches providing better performance. Furthermore, if Solr gets backed up updating the index, the slow down will be isolated to the background job processing the queue.
23
+
24
+ This library uses a dedicated work queue for processing Solr requests instead of building off of delayed_job or another background processing library in order to take advantage of efficiently batching requests to Solr. This can give orders of magnitude better performance over indexing and committing individual documents.
25
+
26
+
27
+ == Usage
28
+
29
+ To use asynchronous indexing, you'll have to set up three things.
30
+
31
+ === Session
32
+
33
+ Your application should be initialized with a Sunspot::IndexQueue::SessionProxy session. This will send queries immediately to Solr queries but send updates to a queue for processing later.
34
+
35
+ This will set up a session proxy using the default Solr session:
36
+
37
+ Sunspot.session = Sunspot::IndexQueue::SessionProxy.new
38
+
39
+ If you have custom configuration settings for your Solr session or need to wrap it with additional proxies, you can pass it to the constructor. For example, if you have a master/slave setup running on specific ports:
40
+
41
+ master_session = Sunspot::Session.new{|config| config.solr.url = 'http://master.solr.example.com/solr'}
42
+ slave_session = Sunspot::Session.new{|config| config.solr.url = 'http://slave01.solr.example.com/solr'}
43
+ master_slave_session = Sunspot::SessionProxy::MasterSlaveSessionProxy(master_session, slave_session)
44
+ queue = Sunspot::IndexQueue.new(master_slave_session)
45
+ Sunspot.session = Sunspot::IndexQueue::SessionProxy.new(queue)
46
+
47
+ === Queue Implementation
48
+
49
+ The queue component is designed to be modular so that you can plugin a datastore that fits with your application architecture. To set the implementation, you can set it to one of the included implementations:
50
+
51
+ # Queue implementation backed by ActiveRecord
52
+ Sunspot::IndexQueue::Entry.implementation = :active_record
53
+
54
+ # Queue implementation backed by DataMapper
55
+ Sunspot::IndexQueue::Entry.implementation = :data_mapper
56
+
57
+ # Queue implementation backed by MongoDB
58
+ Sunspot::IndexQueue::Entry.implementation = :mongo
59
+
60
+ # You can also provide your own queue implementation
61
+ Sunspot::IndexQueue::Entry.implementation = MyQueueImplementation
62
+
63
+ === Process The Queue
64
+
65
+ To process the queue:
66
+
67
+ queue = Sunspot::IndexQueue.new
68
+ queue.process
69
+
70
+ This will process all entries currently in the queue. Of course, you'll probably want to wrap this up in some sort of daemon process. Here is a sample daemon script you could run for a Rails application. You'll need to customize it if your setup is more complicated.
71
+
72
+ #!/usr/bin/env ruby
73
+
74
+ require 'rubygems'
75
+ gem 'daemons-mikehale'
76
+ require 'daemons'
77
+
78
+ # Assumes the scrip is located in a subdirectory of the Rails root directory
79
+ rails_root = File.expand_path(File.join(File.dirname(__FILE__), '..'))
80
+
81
+ Daemons.run_proc(File.basename($0), :dir_mode => :normal, :dir => File.join(rails_root, 'log'), :force_kill_wait => 30) do
82
+ require File.join(rails_root, 'config', 'environment')
83
+
84
+ # Use the default queue settings.
85
+ queue = Sunspot::IndexQueue.new
86
+
87
+ # Don't want the daemon to fill up the log with SQL queries in development mode
88
+ Rails.logger.level = Logger::INFO if Rails.logger.level == Logger::DEBUG
89
+
90
+ loop do
91
+ begin
92
+ queue.process
93
+ sleep(2)
94
+ rescue Exception => e
95
+ # Make sure we can exit the loop on a shutdown
96
+ raise e if e.is_a?(SystemExit) || e.is_a?(Interrupt)
97
+ # If Solr isn't responding, wait a while to give it time to get back up
98
+ if e.is_a?(Sunspot::IndexQueue::SolrNotResponding)
99
+ sleep(30)
100
+ else
101
+ Rails.logger.error(e)
102
+ end
103
+ end
104
+ end
105
+ end
106
+
107
+ The logic in the queue is designed to allow concurrent processing of the queue by multiple processes, however, some documents may end up getting submitted to Solr multiple times. This can speed up the processing of a large number of documents if you need to index a large data set. Forking too many processes to handle your queue, however, will result in more documents being processed multiple times.
108
+
109
+ == Features
110
+
111
+ === Multiple Queues
112
+
113
+ If you have multiple models segmented into multiple Solr indexes, you can set up multiple queues. They will share the same persistence backend, but can be configured with different Solr sessions.
114
+
115
+ # Index all your content in one index
116
+ content_queue = Sunspot::IndexQueue.new(:session => content_session, :class_names => [BlogPost, Review])
117
+ content_session_proxy = Sunspot::IndexQueue::SessionProxy.new(content_queue)
118
+
119
+ # And all your products in another
120
+ product_queue = Sunspot::IndexQueue.new(:session => product_session, :class_names => Product)
121
+ product_session_proxy = Sunspot::IndexQueue::SessionProxy.new(product_queue)
122
+
123
+ === Priority
124
+
125
+ When you have updates coming from multiple sources, it is often good to set a priority on when they should get processed so that when there is a backlog, it becomes less noticeable. For example, in an application that has models updated both by human interaction and by an automated feed, you probably want the human updated items to be indexed first. That way when their is a backlog, your human customers won't notice it. You can control the priority of updates inside a block with the Sunspot::IndexQueue.set_priority method.
126
+
127
+ === Recoverability
128
+
129
+ When an entry in the queue cannot be sent to Solr, it will be automatically rescheduled to be tried again later. The amount of time is controlled by the +retry_interval+ setting on IndexQueue (defaults to 1 minute). Every time it is tried and fails, the interval will be increased yet again (i.e. first try wait 1 minute, second try wait 2 minutes, third try wait 3 minutes, etc.).
130
+
131
+ Error messages and stack traces are stored with the queue entries so the can be debugged.
132
+
133
+ sThe exception to this is that if Solr is down altogether, the queue will stop processing and entries will be restored to try again immediately. This error is not logged on the entries but is rather thrown as a Sunspot::IndexQueue::SolrNotResponding exception.
data/Rakefile ADDED
@@ -0,0 +1,52 @@
1
+ require 'rubygems'
2
+ require 'rake'
3
+ require 'rake/rdoctask'
4
+
5
+ desc 'Default: run unit tests.'
6
+ task :default => :test
7
+
8
+ begin
9
+ require 'spec/rake/spectask'
10
+ desc 'Test the gem.'
11
+ Spec::Rake::SpecTask.new(:test) do |t|
12
+ t.spec_files = FileList.new('spec/**/*_spec.rb')
13
+ end
14
+ rescue LoadError
15
+ tast :test do
16
+ STDERR.puts "You must have rspec >= 1.3.0 to run the tests"
17
+ end
18
+ end
19
+
20
+ desc 'Generate documentation for sunspot_index_queue.'
21
+ Rake::RDocTask.new(:rdoc) do |rdoc|
22
+ rdoc.rdoc_dir = 'rdoc'
23
+ rdoc.options << '--title' << 'Sunspot Index Queue' << '--line-numbers' << '--inline-source' << '--main' << 'README.rdoc'
24
+ rdoc.rdoc_files.include('README.rdoc')
25
+ rdoc.rdoc_files.include('lib/**/*.rb')
26
+ end
27
+
28
+ begin
29
+ require 'jeweler'
30
+ Jeweler::Tasks.new do |gem|
31
+ gem.name = "sunspot_index_queue"
32
+ gem.summary = %Q{Asynchronous Solr indexing support for the sunspot gem with an emphasis on reliablity and throughput.}
33
+ gem.description = %Q(This gem provides asynchronous indexing to Solr for the sunspot gem. It uses a pluggable model for the backing queue and provides support for ActiveRecord and MongoDB out of the box.)
34
+ gem.email = "brian@embellishedvisions.com"
35
+ gem.homepage = "http://github.com/bdurand/sunspot_index_queue"
36
+ gem.authors = ["Brian Durand"]
37
+ gem.rdoc_options = ["--charset=UTF-8", "--main", "README.rdoc"]
38
+
39
+ gem.add_dependency('sunspot', '>= 1.1.0')
40
+ gem.add_development_dependency('sqlite3')
41
+ gem.add_development_dependency('activerecord', '>= 2.2')
42
+ gem.add_development_dependency('dm-core', '>= 1.0.0')
43
+ gem.add_development_dependency('dm-aggregates', '>=1.0.0')
44
+ gem.add_development_dependency('dm-migrations', '>=1.0.0')
45
+ gem.add_development_dependency('mongo')
46
+ gem.add_development_dependency('rspec', '>= 1.3.0')
47
+ gem.add_development_dependency('jeweler')
48
+ end
49
+
50
+ Jeweler::GemcutterTasks.new
51
+ rescue LoadError
52
+ end
data/VERSION ADDED
@@ -0,0 +1 @@
1
+ 1.0.0
@@ -0,0 +1,168 @@
1
+ module Sunspot
2
+ # Implementation of an asynchronous queue for indexing records with Solr. Entries are added to the queue
3
+ # defining which records should be indexed or removed. The queue will then process those entries and
4
+ # send them to the Solr server in batches. This has two advantages over just updating in place. First,
5
+ # problems with Solr will not cause your application to stop functioning. Second, batching the commits
6
+ # to Solr is more efficient and it should be able to handle more throughput when you have a lot of records
7
+ # to index.
8
+ class IndexQueue
9
+ autoload :Batch, File.expand_path('../index_queue/batch', __FILE__)
10
+ autoload :Entry, File.expand_path('../index_queue/entry', __FILE__)
11
+ autoload :SessionProxy, File.expand_path('../index_queue/session_proxy', __FILE__)
12
+
13
+ # This exception will be thrown if Solr is not responding.
14
+ class SolrNotResponding < StandardError
15
+ end
16
+
17
+ attr_accessor :retry_interval, :batch_size
18
+ attr_reader :session, :class_names
19
+
20
+ class << self
21
+ # Set the default priority for indexing items within a block. Higher priority items will be processed first.
22
+ def set_priority (priority, &block)
23
+ save_val = Thread.current[:sunspot_index_queue_priority]
24
+ begin
25
+ Thread.current[:sunspot_index_queue_priority] = priority.to_i
26
+ yield
27
+ ensure
28
+ Thread.current[:sunspot_index_queue_priority] = save_val
29
+ end
30
+ end
31
+
32
+ # Get the default indexing priority. Defaults to zero.
33
+ def default_priority
34
+ Thread.current[:sunspot_index_queue_priority] || 0
35
+ end
36
+ end
37
+
38
+ # Create a new IndexQueue. Available options:
39
+ #
40
+ # +:retry_interval+ - The number of seconds to wait between to retry indexing when an attempt fails
41
+ # (defaults to 1 minute). If an entry fails multiple times, it will be delayed for the interval times
42
+ # the number of failures. For example, if the interval is 1 minute and it has failed twice, the record
43
+ # won't be attempted again for 2 minutes.
44
+ #
45
+ # +:batch_size+ - The maximum number of records to try submitting to solr at one time (defaults to 100).
46
+ #
47
+ # +:class_names+ - A list of class names that the queue will process. This can be used to have different
48
+ # queues process different classes of records when they need to different configurations.
49
+ #
50
+ # +:session+ - The Sunspot::Session object to use for communicating with Solr (defaults to a session with the default config).
51
+ def initialize (options = {})
52
+ @retry_interval = options[:retry_interval] || 60
53
+ @batch_size = options[:batch_size] || 100
54
+ @batch_handler = nil
55
+ @class_names = []
56
+ if options[:class_names].is_a?(Array)
57
+ @class_names.concat(options[:class_names].collect{|name| name.to_s})
58
+ elsif options[:class_names]
59
+ @class_names << options[:class_names].to_s
60
+ end
61
+ @session = options[:session] || Sunspot::Session.new
62
+ end
63
+
64
+ # Provide a block that will handle submitting batches of records. The block will take a Batch object and must call
65
+ # +submit!+ on it. This can be useful for doing things such as providing an identity map for records in the batch.
66
+ # Example:
67
+ #
68
+ # # Use the ActiveRecord identity map for each batch submitted to reduce database activity.
69
+ # queue.batch_handler do
70
+ # ActiveRecord::Base.cache do |batch|
71
+ # batch.submit!
72
+ # end
73
+ # end
74
+ def batch_handler (&block)
75
+ @batch_handler = block
76
+ end
77
+
78
+ # Add a record to be indexed to the queue. The record can be specified as either an indexable object or as
79
+ # as hash with :class and :id keys. The priority to be indexed can be passed in the options as +:priority+
80
+ # (defaults to 0).
81
+ def index (record_or_hash, options = {})
82
+ klass, id = class_and_id(record_or_hash)
83
+ Entry.enqueue(self, klass, id, false, options[:priority] || self.class.default_priority)
84
+ end
85
+
86
+ # Add a record to be removed to the queue. The record can be specified as either an indexable object or as
87
+ # as hash with :class and :id keys. The priority to be indexed can be passed in the options as +:priority+
88
+ # (defaults to 0).
89
+ def remove (record_or_hash, options = {})
90
+ klass, id = class_and_id(record_or_hash)
91
+ Entry.enqueue(self, klass, id, true, options[:priority] || self.class.default_priority)
92
+ end
93
+
94
+ # Add a list of records to be indexed to the queue. The priority to be indexed can be passed in the
95
+ # options as +:priority+ (defaults to 0).
96
+ def index_all (klass, ids, options = {})
97
+ Entry.enqueue(self, klass, ids, false, options[:priority] || self.class.default_priority)
98
+ end
99
+
100
+ # Add a list of records to be removed to the queue. The priority to be indexed can be passed in the
101
+ # options as +:priority+ (defaults to 0).
102
+ def remove_all (klass, ids, options = {})
103
+ Entry.enqueue(self, klass, ids, true, options[:priority] || self.class.default_priority)
104
+ end
105
+
106
+ # Get the number of entries to be processed in the queue.
107
+ def total_count
108
+ Entry.total_count(self)
109
+ end
110
+
111
+ # Get the number of entries in the queue that are ready to be processed.
112
+ def ready_count
113
+ Entry.ready_count(self)
114
+ end
115
+
116
+ # Get the number of entries that have errors in the queue.
117
+ def error_count
118
+ Entry.error_count(self)
119
+ end
120
+
121
+ # Get the entries in the queue that have errors. Supported options are +:limit+ (default 50) and +:offset+ (default 0).
122
+ def errors (options = {})
123
+ limit = options[:limit] ? options[:limit].to_i : 50
124
+ Entry.errors(self, limit, options[:offset].to_i)
125
+ end
126
+
127
+ # Reset all entries in the queue to clear errors and set them to be indexed immediately.
128
+ def reset!
129
+ Entry.reset!(self)
130
+ end
131
+
132
+ # Process the queue. Exits when there are no more entries to process at the current time.
133
+ # Returns the number of entries processed.
134
+ #
135
+ # If any errors are encountered while processing the queue, they will be logged with the errors so they can
136
+ # be fixed and tried again later. However, if Solr is refusing connections, the processing is stopped right
137
+ # away and a Sunspot::IndexQueue::SolrNotResponding exception is raised.
138
+ def process
139
+ count = 0
140
+ loop do
141
+ entries = Entry.next_batch!(self)
142
+ if entries.nil? || entries.empty?
143
+ break if Entry.ready_count(self) == 0
144
+ else
145
+ batch = Batch.new(self, entries)
146
+ if defined?(@batch_handler) && @batch_handler
147
+ @batch_handler.call(batch)
148
+ else
149
+ batch.submit!
150
+ end
151
+ count += entries.select{|e| e.processed? }.size
152
+ end
153
+ end
154
+ count
155
+ end
156
+
157
+ private
158
+
159
+ # Get the class and id for either a record or a hash containing +:class+ and +:id+ options
160
+ def class_and_id (record_or_hash)
161
+ if record_or_hash.is_a?(Hash)
162
+ [record_or_hash[:class], record_or_hash[:id]]
163
+ else
164
+ [record_or_hash.class, Sunspot::Adapters::InstanceAdapter.adapt(record_or_hash).id]
165
+ end
166
+ end
167
+ end
168
+ end
@@ -0,0 +1,120 @@
1
+ module Sunspot
2
+ class IndexQueue
3
+ # Batch of entries to be indexed with Solr.
4
+ class Batch
5
+ attr_reader :entries
6
+
7
+ # Errors that cause batch processing to stop and are immediately passed on to the caller. All other
8
+ # are logged on the entry on the assumption that they can be fixed later while other entries can still
9
+ # be processed.
10
+ PASS_THROUGH_EXCEPTIONS = [SystemExit, NoMemoryError, Interrupt, SignalException, Errno::ECONNREFUSED]
11
+
12
+ def initialize (queue, entries = nil)
13
+ @queue = queue
14
+ @entries = []
15
+ @entries.concat(entries) if entries
16
+ @delete_entries = []
17
+ end
18
+
19
+ # Submit the entries to solr. If they are successfully committed, the entries will be deleted.
20
+ # Otherwise, any entries that generated errors will be updated with the error messages and
21
+ # set to be processed again in the future.
22
+ def submit!
23
+ Entry.load_all_records(entries)
24
+ clear_processed(entries)
25
+ begin
26
+ # First try submitting the entries in a batch since that's the most efficient.
27
+ # If there are errors, try each entry individually in case there's a bad document.
28
+ session.batch do
29
+ entries.each do |entry|
30
+ submit_entry(entry)
31
+ end
32
+ end
33
+ commit!
34
+ rescue Exception => e
35
+ if PASS_THROUGH_EXCEPTIONS.include?(e.class)
36
+ raise e
37
+ else
38
+ submit_each_entry
39
+ end
40
+ end
41
+ rescue Exception => e
42
+ begin
43
+ clear_processed(entries)
44
+ entries.each{|entry| entry.reset!} if PASS_THROUGH_EXCEPTIONS.include?(e.class)
45
+ ensure
46
+ # Use a more specific error to indicate Solr is down.
47
+ e = SolrNotResponding.new(e.message) if e.is_a?(Errno::ECONNREFUSED)
48
+ raise e
49
+ end
50
+ end
51
+
52
+ private
53
+
54
+ def session
55
+ @queue.session
56
+ end
57
+
58
+ # Clear the processed flag on all entries.
59
+ def clear_processed (entries)
60
+ entries.each{|entry| entry.processed = false}
61
+ end
62
+
63
+ # Send the Solr commit command and delete the entries if it succeeds.
64
+ def commit!
65
+ session.commit
66
+ Entry.delete_entries(@delete_entries) unless @delete_entries.empty?
67
+ rescue Exception => e
68
+ clear_processed(entries)
69
+ @delete_entries.clear
70
+ raise e
71
+ end
72
+
73
+ # Submit all entries to Solr individually and then commit.
74
+ def submit_each_entry
75
+ entries.each do |entry|
76
+ submit_entry(entry)
77
+ end
78
+
79
+ begin
80
+ commit!
81
+ rescue Exception => e
82
+ if PASS_THROUGH_EXCEPTIONS.include?(e.class)
83
+ raise e
84
+ else
85
+ entries.each do |entry|
86
+ entry.set_error!(e, @queue.retry_interval)
87
+ end
88
+ end
89
+ end
90
+ end
91
+
92
+ # Send an entry to Solr doing an update or delete as necessary.
93
+ def submit_entry (entry)
94
+ log_entry_error(entry) do
95
+ if entry.is_delete?
96
+ session.remove_by_id(entry.record_class_name, entry.record_id)
97
+ else
98
+ record = entry.record
99
+ session.index(record) if record
100
+ end
101
+ end
102
+ end
103
+
104
+ # Update an entry with an error message if a block fails.
105
+ def log_entry_error (entry)
106
+ begin
107
+ yield
108
+ entry.processed = true
109
+ @delete_entries << entry
110
+ rescue Exception => e
111
+ if PASS_THROUGH_EXCEPTIONS.include?(e.class)
112
+ raise e
113
+ else
114
+ entry.set_error!(e, @queue.retry_interval)
115
+ end
116
+ end
117
+ end
118
+ end
119
+ end
120
+ end