xapian_db 0.3.2 → 0.3.3

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG.md CHANGED
@@ -1,3 +1,17 @@
1
+ ##0.3.3 (December 13th, 2010)
2
+
3
+ Features:
4
+
5
+ - Support for multi language stop words. The implementation was inspired by John Leachs xapian-fu gem
6
+ - Support for query spelling correction (similar to Google's 'did you mean...'). This feature is only
7
+ available for persistent databases (due to a limitation of Xapian)
8
+
9
+ Changes:
10
+
11
+ - Languages must be configured by the iso language code (:en, :de, ...). No more support for the english
12
+ language names (:english, :german, ...)
13
+ - Reduced the memory footprint when reindexing large tables
14
+
1
15
  ##0.3.2 (December 10th, 2010)
2
16
 
3
17
  Features:
data/README.rdoc CHANGED
@@ -114,8 +114,7 @@ you can configure the blueprint to use the language of the object when indexing:
114
114
  blueprint.language_method :language
115
115
  end
116
116
 
117
- Don't worry if you have languages in your database that are not supported by Xapian. If the language is not supported, XapianDb
118
- will fall back to the global language configuration or none, if you haven't configured one.
117
+ The method must return the iso code for the language (:en, :de, ...) as a symbol or a string. Don't worry if you have languages in your database that are not supported by Xapian. If the language is not supported, XapianDb will fall back to the global language configuration or none, if you haven't configured one.
119
118
 
120
119
  You can place this configuration anywhere, e.g. in an initializer.
121
120
 
@@ -145,7 +144,13 @@ You can query attributes:
145
144
 
146
145
  <code>XapianDb.search</code> returns a resultset object. You can access the number of hits directly:
147
146
 
148
- result.size # Very fast, does not load the resulting documents
147
+ results.size # Very fast, does not load the resulting documents
148
+
149
+ If you use a persistent database, the resultset may contain a spelling correction:
150
+
151
+ # Assuming you have at least one document containing "mouse"
152
+ results = XapianDb.search("moose")
153
+ results.corrected_query # "mouse"
149
154
 
150
155
  To access the found documents, get a page from the resultset:
151
156
 
@@ -162,7 +167,6 @@ Now you can access the documents:
162
167
 
163
168
  == What to expect from future releases
164
169
 
165
- * multi language support (spelling correction, stop words)
166
170
  * facet support
167
171
  * will_paginate support
168
172
  * asynchronous index writer based on {resque}[https://github.com/defunkt/resque] for production environments
@@ -30,7 +30,7 @@ module XapianDb
30
30
  end
31
31
 
32
32
  # Install delegates for the config instance variables
33
- [:database, :adapter, :writer, :stemmer].each do |attr|
33
+ [:database, :adapter, :writer, :stemmer, :stopper].each do |attr|
34
34
  define_method attr do
35
35
  @config.nil? ? nil : @config.instance_variable_get("@_#{attr}")
36
36
  end
@@ -42,7 +42,7 @@ module XapianDb
42
42
  # ---------------------------------------------------------------------------------
43
43
 
44
44
  #
45
- attr_reader :_database, :_adapter, :_writer, :_stemmer
45
+ attr_reader :_database, :_adapter, :_writer, :_stemmer, :_stopper
46
46
 
47
47
  # Set the global database to use
48
48
  # @param [String] path The path to the database. Either apply a file sytem path or :memory
@@ -87,20 +87,17 @@ module XapianDb
87
87
  @_writer = XapianDb::IndexWriters.const_get("#{camelize(type.to_s)}Writer")
88
88
  end
89
89
 
90
- # Set the language
91
- # @param [Symbol] lang The language; either apply the english name of the language
92
- # or the two letter IS639 code
93
- # @example Use the english name of the language
94
- # XapianDb::Config.setup do |config|
95
- # config.language :german
96
- # end
97
- # @example Use the iso code of the language
90
+ # Set the language.
91
+ # @param [Symbol] lang The language; apply the two letter ISO639 code for the language
92
+ # @example
98
93
  # XapianDb::Config.setup do |config|
99
94
  # config.language :de
100
95
  # end
101
- # see http://xapian.org/docs/apidoc/html/classXapian_1_1Stem.html for supported languages
96
+ # see {LANGUAGE_MAP} for supported languages
102
97
  def language(lang)
103
- @_stemmer = Xapian::Stem.new(lang.to_s)
98
+ key = lang.to_sym
99
+ @_stemmer = XapianDb::Repositories::Stemmer.stemmer_for key
100
+ key == :none ? @_stopper = nil : @_stopper = XapianDb::Repositories::Stopper.stopper_for(key)
104
101
  end
105
102
 
106
103
  private
@@ -59,6 +59,7 @@ module XapianDb
59
59
  query = @query_parser.parse(expression)
60
60
  enquiry = Xapian::Enquire.new(reader)
61
61
  enquiry.query = query
62
+ opts[:corrected_query] = @query_parser.corrected_query
62
63
  Resultset.new(enquiry, opts)
63
64
  end
64
65
 
@@ -30,7 +30,6 @@ module XapianDb
30
30
  def setup(klass, &block)
31
31
  @blueprints ||= {}
32
32
  blueprint = DocumentBlueprint.new
33
- blueprint.indexer = Indexer.new(blueprint)
34
33
  yield blueprint if block_given? # configure the blueprint through the block
35
34
  @blueprints[klass] = blueprint
36
35
  @adapter = blueprint.adapter || XapianDb::Config.adapter || Adapters::GenericAdapter
@@ -62,10 +61,6 @@ module XapianDb
62
61
  # Instance methods
63
62
  # ---------------------------------------------------------------------------------
64
63
 
65
- # Set / get the indexer
66
- # @return [XapianDb::Indexer]
67
- attr_accessor :indexer
68
-
69
64
  # Return an array of all configured text methods in this blueprint
70
65
  # @return [Array<String>] All searchable prefixes
71
66
  def searchable_prefixes
@@ -102,7 +97,7 @@ module XapianDb
102
97
  # Blueprint DSL methods
103
98
  # ---------------------------------------------------------------------------------
104
99
 
105
- # The name of the method that returns a Xapian compliant language code. The
100
+ # The name of the method that returns an iso language code. The
106
101
  # configured class must implement this method.
107
102
  attr_reader :lang_method
108
103
 
@@ -127,8 +122,8 @@ module XapianDb
127
122
  end
128
123
 
129
124
  # Set the name of the method to get the language for an indexed object
130
- # @param [Symbol] lang The method name. The method must return a language supported
131
- # by Xapian (see http://xapian.org/docs/apidoc/html/classXapian_1_1Stem.html for supported languages)
125
+ # @param [Symbol] lang The method name. The method must return an iso language code (:en, :de, ...)
126
+ # see LANGUAGE_MAP for the supported lanugaes
132
127
  def language_method(lang)
133
128
  @lang_method = lang
134
129
  end
@@ -18,7 +18,8 @@ module XapianDb
18
18
  # @param [Object] obj An instance of a class with a blueprint configuration
19
19
  def index(obj)
20
20
  blueprint = XapianDb::DocumentBlueprint.blueprint_for(obj.class)
21
- doc = blueprint.indexer.build_document_for(obj)
21
+ indexer = XapianDb::Indexer.new(XapianDb.database, blueprint)
22
+ doc = indexer.build_document_for(obj)
22
23
  XapianDb.database.store_doc(doc)
23
24
  XapianDb.database.commit
24
25
  end
@@ -39,19 +40,25 @@ module XapianDb
39
40
  # First, delete all docs of this class
40
41
  XapianDb.database.delete_docs_of_class(klass)
41
42
  blueprint = XapianDb::DocumentBlueprint.blueprint_for(klass)
43
+ indexer = XapianDb::Indexer.new(XapianDb.database, blueprint)
42
44
  show_progressbar = false
45
+ obj_count = klass.count
43
46
  if opts[:verbose]
44
47
  if defined?(ProgressBar)
45
48
  show_progressbar = true
46
49
  end
47
- obj_count = klass.count
48
50
  puts "Reindexing #{obj_count} objects..."
49
51
  pbar = ProgressBar.new("Status", obj_count) if show_progressbar
50
52
  end
51
- klass.all.each do |obj|
52
- doc = blueprint.indexer.build_document_for(obj)
53
- XapianDb.database.store_doc(doc)
54
- pbar.inc if show_progressbar
53
+
54
+ # Process the objects in batches to reduce the memory footprint
55
+ nr_of_batches = (obj_count / 1000) + 1
56
+ nr_of_batches.times do |batch|
57
+ klass.all(:offset => batch * 1000, :limit => 1000) .each do |obj|
58
+ doc = indexer.build_document_for(obj)
59
+ XapianDb.database.store_doc(doc)
60
+ pbar.inc if show_progressbar
61
+ end
55
62
  end
56
63
  XapianDb.database.commit
57
64
  end
@@ -7,28 +7,11 @@ module XapianDb
7
7
  # @author Gernot Kogler
8
8
  class Indexer
9
9
 
10
- # Supported languages and mapping to the stemmer to use
11
- LANGUAGE_MAP = {:da => :danish,
12
- :nl => :dutch,
13
- :en => :english,
14
- :fi => :finnish,
15
- :fr => :french,
16
- :de => :german2, # Normalises umlauts and ß
17
- :hu => :hungarian,
18
- :it => :italian,
19
- :nb => :norwegian,
20
- :nn => :norwegian,
21
- :no => :norwegian,
22
- :pt => :portuguese,
23
- :ro => :romanian,
24
- :ru => :russian,
25
- :es => :spanish,
26
- :sv => :swedish,
27
- :tr => :turkish}
28
10
  # Constructor
11
+ # @param [XapianDb::Database] database The database to use (needed to build a spelling index)
29
12
  # @param [XapianDb::DocumentBlueprint] document_blueprint The blueprint to use
30
- def initialize(document_blueprint)
31
- @document_blueprint = document_blueprint
13
+ def initialize(database, document_blueprint)
14
+ @database, @document_blueprint = database, document_blueprint
32
15
  end
33
16
 
34
17
  # Build the document for an object. The object must respond to 'xapian_id'.
@@ -62,12 +45,18 @@ module XapianDb
62
45
 
63
46
  # Index all configured text methods
64
47
  def index_text
48
+ setup_language_helpers
65
49
  term_generator = Xapian::TermGenerator.new
50
+ term_generator.database = @database.writer
66
51
  term_generator.document = @xapian_doc
67
- term_generator.stemmer = get_stemmer
68
- # TODO: Configure and enable these features
69
- # tg.stopper = stopper if stopper
70
- # tg.set_flags Xapian::TermGenerator::FLAG_SPELLING if db.spelling
52
+ if @stemmer
53
+ term_generator.stemmer = @stemmer
54
+ term_generator.stopper = @stopper unless @stopper.nil?
55
+ # Enable the creation of a spelling index if the database is not in memory
56
+ if @database.is_a? XapianDb::PersistentDatabase
57
+ term_generator.set_flags Xapian::TermGenerator::FLAG_SPELLING if @database.is_a? XapianDb::PersistentDatabase
58
+ end
59
+ end
71
60
 
72
61
  # Always index the class and the primary key
73
62
  @xapian_doc.add_term("C#{@obj.class}")
@@ -89,16 +78,22 @@ module XapianDb
89
78
 
90
79
  private
91
80
 
92
- # Configure the stemmer to use
93
- def get_stemmer
81
+ # Configure the stemmer and stopper to use
82
+ def setup_language_helpers
94
83
  # Do we have a language config on the blueprint?
95
84
  if @blueprint.lang_method
96
85
  lang = @obj.send(@blueprint.lang_method)
97
- return Xapian::Stem.new(LANGUAGE_MAP[lang.to_sym].to_s) if lang && LANGUAGE_MAP.has_key?(lang.to_sym)
86
+ if lang && LANGUAGE_MAP.has_key?(lang.to_sym)
87
+ @stemmer = XapianDb::Repositories::Stemmer.stemmer_for lang.to_sym
88
+ @stopper = XapianDb::Repositories::Stopper.stopper_for lang.to_sym
89
+ return
90
+ end
98
91
  end
99
- # Do we have a global stemmer?
100
- return XapianDb::Config.stemmer if XapianDb::Config.stemmer
101
- return Xapian::Stem.new("none") # No language config
92
+
93
+ # Use the global config
94
+ @stemmer = XapianDb::Config.stemmer
95
+ @stopper = XapianDb::Config.stopper
96
+
102
97
  end
103
98
 
104
99
  end
@@ -6,6 +6,10 @@ module XapianDb
6
6
  # @author Gernot Kogler
7
7
  class QueryParser
8
8
 
9
+ # The spelling corrected query (if a language is configured)
10
+ # @return [String]
11
+ attr_reader :corrected_query
12
+
9
13
  # Constructor
10
14
  # @param [XapianDb::Database] database The database to query
11
15
  def initialize(database)
@@ -13,9 +17,10 @@ module XapianDb
13
17
 
14
18
  # Set the parser options
15
19
  @query_flags = 0
16
- @query_flags |= Xapian::QueryParser::FLAG_WILDCARD # enable wildcards
17
- @query_flags |= Xapian::QueryParser::FLAG_BOOLEAN # enable boolean operators
18
- @query_flags |= Xapian::QueryParser::FLAG_BOOLEAN_ANY_CASE # enable case insensitive boolean operators
20
+ @query_flags |= Xapian::QueryParser::FLAG_WILDCARD # enable wildcards
21
+ @query_flags |= Xapian::QueryParser::FLAG_BOOLEAN # enable boolean operators
22
+ @query_flags |= Xapian::QueryParser::FLAG_BOOLEAN_ANY_CASE # enable case insensitive boolean operators
23
+ @query_flags |= Xapian::QueryParser::FLAG_SPELLING_CORRECTION # enable spelling corrections
19
24
  end
20
25
 
21
26
  # Parse an expression
@@ -24,12 +29,18 @@ module XapianDb
24
29
  parser = Xapian::QueryParser.new
25
30
  parser.database = @db.reader
26
31
  parser.default_op = Xapian::Query::OP_AND # Could be made configurable
27
- # TODO: Setup stopper, stemmer, defaults and fields
32
+ if XapianDb::Config.stemmer
33
+ parser.stemmer = XapianDb::Config.stemmer
34
+ parser.stemming_strategy = Xapian::QueryParser::STEM_SOME
35
+ parser.stopper = XapianDb::Config.stopper
36
+ end
28
37
 
29
38
  # Add the searchable prefixes to allow searches by field
30
39
  # (like "name:Kogler")
31
40
  XapianDb::DocumentBlueprint.searchable_prefixes.each{|prefix| parser.add_prefix(prefix.to_s.downcase, "X#{prefix.to_s.upcase}") }
32
- parser.parse_query(expression, @query_flags)
41
+ query = parser.parse_query(expression, @query_flags)
42
+ @corrected_query = parser.get_corrected_query_string
43
+ query
33
44
  end
34
45
 
35
46
  end
@@ -0,0 +1,38 @@
1
+ # encoding: utf-8
2
+
3
+ module XapianDb
4
+ module Repositories
5
+
6
+ # The stemmer is a repository that manages stemmers for the supported
7
+ # languges
8
+ # @author Gernot Kogler
9
+ class Stemmer
10
+
11
+ class << self
12
+
13
+ # Get or build the stemmer for a language
14
+ # @param [Symbol, String] iso_cd The iso code for the language (:en, :de ...)
15
+ # @return [Xapian::Stem] The Stemmer for this lanugage
16
+ def stemmer_for(iso_cd)
17
+ @stemmers ||= {}
18
+ return nil if iso_cd.nil?
19
+ key = iso_cd.to_sym
20
+
21
+ # Do we already have a stemmer for this language?
22
+ return @stemmers[key] unless @stemmers[key].nil?
23
+
24
+ # Do we support this language?
25
+ unless (LANGUAGE_MAP.keys + [:none]).include?(key)
26
+ raise ArgumentError.new "Language #{iso_cd} is not supported by XapianDb (remember to use the language iso codes)"
27
+ end
28
+
29
+ # Let's build the stemmer
30
+ @stemmers[key] = Xapian::Stem.new(key.to_s)
31
+ end
32
+
33
+ end
34
+
35
+ end
36
+
37
+ end
38
+ end
@@ -0,0 +1,45 @@
1
+ # encoding: utf-8
2
+
3
+ module XapianDb
4
+ module Repositories
5
+
6
+ # The stopper is a repository that manages stoppers for the supported
7
+ # languges
8
+ # @author Gernot Kogler
9
+ class Stopper
10
+
11
+ class << self
12
+
13
+ # Get or build the stopper for a language
14
+ # @param [Symbol, String] iso_cd The iso code for the language (:en, :de ...)
15
+ # @return [Xapian::SimpleStopper] The Stopper for this lanugage
16
+ def stopper_for(iso_cd)
17
+ @stoppers ||= {}
18
+ return nil if iso_cd.nil?
19
+ key = iso_cd.to_sym
20
+
21
+ # Do we already have a stopper for this language?
22
+ return @stoppers[key] unless @stoppers[key].nil?
23
+
24
+ # Do we support this language?
25
+ unless (LANGUAGE_MAP.keys + [:none]).include?(key)
26
+ raise ArgumentError.new "Language #{iso_cd} is not supported by XapianDb (remember to use the language iso codes)"
27
+ end
28
+
29
+ # build the stopper
30
+ stopper = Xapian::SimpleStopper.new
31
+ stopwords_file = File.join(File.dirname(__FILE__), '../stopwords', "#{iso_cd}.txt")
32
+ open(stopwords_file, "r") do |file|
33
+ file.each do |word|
34
+ stopper.add word.chomp
35
+ end
36
+ end
37
+ @stoppers[key] = stopper
38
+ end
39
+
40
+ end
41
+
42
+ end
43
+
44
+ end
45
+ end
@@ -15,15 +15,21 @@ module XapianDb
15
15
  # @return [Integer]
16
16
  attr_reader :size
17
17
 
18
+ # The spelling corrected query (if a language is configured)
19
+ # @return [String]
20
+ attr_reader :corrected_query
21
+
18
22
  # Constructor
19
23
  # @param [Xapian::Enquire] enquiry a Xapian query result (see http://xapian.org/docs/apidoc/html/classXapian_1_1Enquire.html)
20
24
  # @param [Hash] options
21
25
  # @option options [Integer] :per_page (10) How many docs per page?
26
+ # @option options [String] :corrected_query (nil) The spelling corrected query (if a language is configured)
22
27
  def initialize(enquiry, options)
23
28
  @enquiry = enquiry
24
29
  # By passing 0 as the max parameter to the mset method,
25
30
  # we only get statistics about the query, no results
26
- @size = enquiry.mset(0, 0).matches_estimated
31
+ @size = enquiry.mset(0, 0).matches_estimated
32
+ @corrected_query = options[:corrected_query]
27
33
  @per_page = options[:per_page]
28
34
  end
29
35
 
@@ -0,0 +1,5 @@
1
+ These stopword lists are from the Snowball library which is covered by the BSD License, with Copyright (c) 2001, Dr Martin Porter, and (for the Java developments) Copyright (c) 2002, Richard Boulton.
2
+
3
+ http://snowball.tartarus.org/
4
+
5
+ run ruby update.rb from this directory to update the stop word files.
@@ -0,0 +1,94 @@
1
+ og
2
+ i
3
+ jeg
4
+ det
5
+ at
6
+ en
7
+ den
8
+ til
9
+ er
10
+ som
11
+
12
+ de
13
+ med
14
+ han
15
+ af
16
+ for
17
+ ikke
18
+ der
19
+ var
20
+ mig
21
+ sig
22
+ men
23
+ et
24
+ har
25
+ om
26
+ vi
27
+ min
28
+ havde
29
+ ham
30
+ hun
31
+ nu
32
+ over
33
+ da
34
+ fra
35
+ du
36
+ ud
37
+ sin
38
+ dem
39
+ os
40
+ op
41
+ man
42
+ hans
43
+ hvor
44
+ eller
45
+ hvad
46
+ skal
47
+ selv
48
+ her
49
+ alle
50
+ vil
51
+ blev
52
+ kunne
53
+ ind
54
+ når
55
+ være
56
+ dog
57
+ noget
58
+ ville
59
+ jo
60
+ deres
61
+ efter
62
+ ned
63
+ skulle
64
+ denne
65
+ end
66
+ dette
67
+ mit
68
+ også
69
+ under
70
+ have
71
+ dig
72
+ anden
73
+ hende
74
+ mine
75
+ alt
76
+ meget
77
+ sit
78
+ sine
79
+ vor
80
+ mod
81
+ disse
82
+ hvis
83
+ din
84
+ nogle
85
+ hos
86
+ blive
87
+ mange
88
+ ad
89
+ bliver
90
+ hendes
91
+ været
92
+ thi
93
+ jer
94
+ sådan