ferret 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/MIT-LICENSE +20 -0
- data/README +109 -0
- data/Rakefile +275 -0
- data/TODO +9 -0
- data/TUTORIAL +197 -0
- data/ext/extconf.rb +3 -0
- data/ext/ferret.c +23 -0
- data/ext/ferret.h +85 -0
- data/ext/index_io.c +543 -0
- data/ext/priority_queue.c +227 -0
- data/ext/ram_directory.c +316 -0
- data/ext/segment_merge_queue.c +41 -0
- data/ext/string_helper.c +42 -0
- data/ext/tags +240 -0
- data/ext/term.c +261 -0
- data/ext/term_buffer.c +299 -0
- data/ext/util.c +12 -0
- data/lib/ferret.rb +41 -0
- data/lib/ferret/analysis.rb +11 -0
- data/lib/ferret/analysis/analyzers.rb +93 -0
- data/lib/ferret/analysis/standard_tokenizer.rb +65 -0
- data/lib/ferret/analysis/token.rb +79 -0
- data/lib/ferret/analysis/token_filters.rb +86 -0
- data/lib/ferret/analysis/token_stream.rb +26 -0
- data/lib/ferret/analysis/tokenizers.rb +107 -0
- data/lib/ferret/analysis/word_list_loader.rb +27 -0
- data/lib/ferret/document.rb +2 -0
- data/lib/ferret/document/document.rb +152 -0
- data/lib/ferret/document/field.rb +304 -0
- data/lib/ferret/index.rb +26 -0
- data/lib/ferret/index/compound_file_io.rb +343 -0
- data/lib/ferret/index/document_writer.rb +288 -0
- data/lib/ferret/index/field_infos.rb +259 -0
- data/lib/ferret/index/fields_io.rb +175 -0
- data/lib/ferret/index/index.rb +228 -0
- data/lib/ferret/index/index_file_names.rb +33 -0
- data/lib/ferret/index/index_reader.rb +462 -0
- data/lib/ferret/index/index_writer.rb +488 -0
- data/lib/ferret/index/multi_reader.rb +363 -0
- data/lib/ferret/index/multiple_term_doc_pos_enum.rb +105 -0
- data/lib/ferret/index/segment_infos.rb +130 -0
- data/lib/ferret/index/segment_merge_info.rb +47 -0
- data/lib/ferret/index/segment_merge_queue.rb +16 -0
- data/lib/ferret/index/segment_merger.rb +337 -0
- data/lib/ferret/index/segment_reader.rb +380 -0
- data/lib/ferret/index/segment_term_enum.rb +178 -0
- data/lib/ferret/index/segment_term_vector.rb +58 -0
- data/lib/ferret/index/term.rb +49 -0
- data/lib/ferret/index/term_buffer.rb +88 -0
- data/lib/ferret/index/term_doc_enum.rb +283 -0
- data/lib/ferret/index/term_enum.rb +52 -0
- data/lib/ferret/index/term_info.rb +41 -0
- data/lib/ferret/index/term_infos_io.rb +312 -0
- data/lib/ferret/index/term_vector_offset_info.rb +20 -0
- data/lib/ferret/index/term_vectors_io.rb +552 -0
- data/lib/ferret/query_parser.rb +274 -0
- data/lib/ferret/query_parser/query_parser.tab.rb +819 -0
- data/lib/ferret/search.rb +49 -0
- data/lib/ferret/search/boolean_clause.rb +100 -0
- data/lib/ferret/search/boolean_query.rb +303 -0
- data/lib/ferret/search/boolean_scorer.rb +294 -0
- data/lib/ferret/search/caching_wrapper_filter.rb +40 -0
- data/lib/ferret/search/conjunction_scorer.rb +99 -0
- data/lib/ferret/search/disjunction_sum_scorer.rb +203 -0
- data/lib/ferret/search/exact_phrase_scorer.rb +32 -0
- data/lib/ferret/search/explanation.rb +41 -0
- data/lib/ferret/search/field_cache.rb +216 -0
- data/lib/ferret/search/field_doc.rb +31 -0
- data/lib/ferret/search/field_sorted_hit_queue.rb +184 -0
- data/lib/ferret/search/filter.rb +11 -0
- data/lib/ferret/search/filtered_query.rb +130 -0
- data/lib/ferret/search/filtered_term_enum.rb +79 -0
- data/lib/ferret/search/fuzzy_query.rb +153 -0
- data/lib/ferret/search/fuzzy_term_enum.rb +244 -0
- data/lib/ferret/search/hit_collector.rb +34 -0
- data/lib/ferret/search/hit_queue.rb +11 -0
- data/lib/ferret/search/index_searcher.rb +173 -0
- data/lib/ferret/search/match_all_docs_query.rb +104 -0
- data/lib/ferret/search/multi_phrase_query.rb +204 -0
- data/lib/ferret/search/multi_term_query.rb +65 -0
- data/lib/ferret/search/non_matching_scorer.rb +22 -0
- data/lib/ferret/search/phrase_positions.rb +55 -0
- data/lib/ferret/search/phrase_query.rb +217 -0
- data/lib/ferret/search/phrase_scorer.rb +153 -0
- data/lib/ferret/search/prefix_query.rb +47 -0
- data/lib/ferret/search/query.rb +111 -0
- data/lib/ferret/search/query_filter.rb +51 -0
- data/lib/ferret/search/range_filter.rb +103 -0
- data/lib/ferret/search/range_query.rb +139 -0
- data/lib/ferret/search/req_excl_scorer.rb +125 -0
- data/lib/ferret/search/req_opt_sum_scorer.rb +70 -0
- data/lib/ferret/search/score_doc.rb +38 -0
- data/lib/ferret/search/score_doc_comparator.rb +114 -0
- data/lib/ferret/search/scorer.rb +91 -0
- data/lib/ferret/search/similarity.rb +278 -0
- data/lib/ferret/search/sloppy_phrase_scorer.rb +47 -0
- data/lib/ferret/search/sort.rb +105 -0
- data/lib/ferret/search/sort_comparator.rb +60 -0
- data/lib/ferret/search/sort_field.rb +87 -0
- data/lib/ferret/search/spans.rb +12 -0
- data/lib/ferret/search/spans/near_spans_enum.rb +304 -0
- data/lib/ferret/search/spans/span_first_query.rb +79 -0
- data/lib/ferret/search/spans/span_near_query.rb +108 -0
- data/lib/ferret/search/spans/span_not_query.rb +130 -0
- data/lib/ferret/search/spans/span_or_query.rb +176 -0
- data/lib/ferret/search/spans/span_query.rb +25 -0
- data/lib/ferret/search/spans/span_scorer.rb +74 -0
- data/lib/ferret/search/spans/span_term_query.rb +105 -0
- data/lib/ferret/search/spans/span_weight.rb +84 -0
- data/lib/ferret/search/spans/spans_enum.rb +44 -0
- data/lib/ferret/search/term_query.rb +128 -0
- data/lib/ferret/search/term_scorer.rb +181 -0
- data/lib/ferret/search/top_docs.rb +24 -0
- data/lib/ferret/search/top_field_docs.rb +17 -0
- data/lib/ferret/search/weight.rb +54 -0
- data/lib/ferret/search/wildcard_query.rb +26 -0
- data/lib/ferret/search/wildcard_term_enum.rb +61 -0
- data/lib/ferret/stemmers.rb +1 -0
- data/lib/ferret/stemmers/porter_stemmer.rb +218 -0
- data/lib/ferret/store.rb +5 -0
- data/lib/ferret/store/buffered_index_io.rb +191 -0
- data/lib/ferret/store/directory.rb +139 -0
- data/lib/ferret/store/fs_store.rb +338 -0
- data/lib/ferret/store/index_io.rb +259 -0
- data/lib/ferret/store/ram_store.rb +282 -0
- data/lib/ferret/utils.rb +7 -0
- data/lib/ferret/utils/bit_vector.rb +105 -0
- data/lib/ferret/utils/date_tools.rb +138 -0
- data/lib/ferret/utils/number_tools.rb +91 -0
- data/lib/ferret/utils/parameter.rb +41 -0
- data/lib/ferret/utils/priority_queue.rb +120 -0
- data/lib/ferret/utils/string_helper.rb +47 -0
- data/lib/ferret/utils/weak_key_hash.rb +51 -0
- data/rake_utils/code_statistics.rb +106 -0
- data/setup.rb +1551 -0
- data/test/benchmark/tb_ram_store.rb +76 -0
- data/test/benchmark/tb_rw_vint.rb +26 -0
- data/test/longrunning/tc_numbertools.rb +60 -0
- data/test/longrunning/tm_store.rb +19 -0
- data/test/test_all.rb +9 -0
- data/test/test_helper.rb +6 -0
- data/test/unit/analysis/tc_analyzer.rb +21 -0
- data/test/unit/analysis/tc_letter_tokenizer.rb +20 -0
- data/test/unit/analysis/tc_lower_case_filter.rb +20 -0
- data/test/unit/analysis/tc_lower_case_tokenizer.rb +27 -0
- data/test/unit/analysis/tc_per_field_analyzer_wrapper.rb +39 -0
- data/test/unit/analysis/tc_porter_stem_filter.rb +16 -0
- data/test/unit/analysis/tc_standard_analyzer.rb +20 -0
- data/test/unit/analysis/tc_standard_tokenizer.rb +20 -0
- data/test/unit/analysis/tc_stop_analyzer.rb +20 -0
- data/test/unit/analysis/tc_stop_filter.rb +14 -0
- data/test/unit/analysis/tc_white_space_analyzer.rb +21 -0
- data/test/unit/analysis/tc_white_space_tokenizer.rb +20 -0
- data/test/unit/analysis/tc_word_list_loader.rb +32 -0
- data/test/unit/document/tc_document.rb +47 -0
- data/test/unit/document/tc_field.rb +80 -0
- data/test/unit/index/tc_compound_file_io.rb +107 -0
- data/test/unit/index/tc_field_infos.rb +119 -0
- data/test/unit/index/tc_fields_io.rb +167 -0
- data/test/unit/index/tc_index.rb +140 -0
- data/test/unit/index/tc_index_reader.rb +622 -0
- data/test/unit/index/tc_index_writer.rb +57 -0
- data/test/unit/index/tc_multiple_term_doc_pos_enum.rb +80 -0
- data/test/unit/index/tc_segment_infos.rb +74 -0
- data/test/unit/index/tc_segment_term_docs.rb +17 -0
- data/test/unit/index/tc_segment_term_enum.rb +60 -0
- data/test/unit/index/tc_segment_term_vector.rb +71 -0
- data/test/unit/index/tc_term.rb +22 -0
- data/test/unit/index/tc_term_buffer.rb +57 -0
- data/test/unit/index/tc_term_info.rb +19 -0
- data/test/unit/index/tc_term_infos_io.rb +192 -0
- data/test/unit/index/tc_term_vector_offset_info.rb +18 -0
- data/test/unit/index/tc_term_vectors_io.rb +108 -0
- data/test/unit/index/th_doc.rb +244 -0
- data/test/unit/query_parser/tc_query_parser.rb +84 -0
- data/test/unit/search/tc_filter.rb +113 -0
- data/test/unit/search/tc_fuzzy_query.rb +136 -0
- data/test/unit/search/tc_index_searcher.rb +188 -0
- data/test/unit/search/tc_search_and_sort.rb +98 -0
- data/test/unit/search/tc_similarity.rb +37 -0
- data/test/unit/search/tc_sort.rb +48 -0
- data/test/unit/search/tc_sort_field.rb +27 -0
- data/test/unit/search/tc_spans.rb +153 -0
- data/test/unit/store/tc_fs_store.rb +84 -0
- data/test/unit/store/tc_ram_store.rb +35 -0
- data/test/unit/store/tm_store.rb +180 -0
- data/test/unit/store/tm_store_lock.rb +68 -0
- data/test/unit/ts_analysis.rb +16 -0
- data/test/unit/ts_document.rb +4 -0
- data/test/unit/ts_index.rb +18 -0
- data/test/unit/ts_query_parser.rb +3 -0
- data/test/unit/ts_search.rb +10 -0
- data/test/unit/ts_store.rb +6 -0
- data/test/unit/ts_utils.rb +10 -0
- data/test/unit/utils/tc_bit_vector.rb +65 -0
- data/test/unit/utils/tc_date_tools.rb +50 -0
- data/test/unit/utils/tc_number_tools.rb +59 -0
- data/test/unit/utils/tc_parameter.rb +40 -0
- data/test/unit/utils/tc_priority_queue.rb +62 -0
- data/test/unit/utils/tc_string_helper.rb +21 -0
- data/test/unit/utils/tc_weak_key_hash.rb +25 -0
- metadata +251 -0
data/MIT-LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2005 David Balmain
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README
ADDED
@@ -0,0 +1,109 @@
|
|
1
|
+
= Ferret
|
2
|
+
|
3
|
+
Ferret is a Ruby port of the Java Lucene search engine.
|
4
|
+
(http://jakarta.apache.org/lucene/) In the same way as Lucene, it is not a
|
5
|
+
standalone application, but a library you can use to index documents and
|
6
|
+
search for things in them later.
|
7
|
+
|
8
|
+
== Requirements
|
9
|
+
|
10
|
+
* Ruby 1.8
|
11
|
+
* (C compiler to build the extension but not required to use Ferret)
|
12
|
+
|
13
|
+
== Installation
|
14
|
+
|
15
|
+
De-compress the archive and enter its top directory.
|
16
|
+
|
17
|
+
tar zxpvf ferret-0.1.tar.gz
|
18
|
+
cd ferret-0.1
|
19
|
+
|
20
|
+
Run the setup config;
|
21
|
+
|
22
|
+
$ ruby setup.rb config
|
23
|
+
|
24
|
+
Then to compile the C extension (optional) type:
|
25
|
+
|
26
|
+
$ ruby setup.rb setup
|
27
|
+
|
28
|
+
If you don't have a C compiler, never mind. Just go straight to the next step.
|
29
|
+
On *nix you'll need to run this with root privalages. Type;
|
30
|
+
|
31
|
+
# ruby setup.rb install
|
32
|
+
|
33
|
+
These simple steps install ferret in the default location of Ruby libraries.
|
34
|
+
You can also install files into your favorite directory by supplying setup.rb
|
35
|
+
some options. Try;
|
36
|
+
|
37
|
+
$ ruby setup.rb --help
|
38
|
+
|
39
|
+
|
40
|
+
== Usage
|
41
|
+
|
42
|
+
You can read the TUTORIAL which you'll find in the same directory as this
|
43
|
+
README. You can also check the following modules for more specific
|
44
|
+
documentation.
|
45
|
+
|
46
|
+
* Ferret::Analysis: for more information on how the data is processed when it
|
47
|
+
is tokenized. There are a number of things you can do with your data such as
|
48
|
+
adding stop lists or perhaps a porter stemmer. There are also a number of
|
49
|
+
analyzers already available and it is almost trivial to create a new one
|
50
|
+
with a simple regular expression.
|
51
|
+
|
52
|
+
* Ferret::Search: for more information on querying the index. There are a
|
53
|
+
number of already available queries and it's unlikely you'll need to create
|
54
|
+
your own. You may however want to take advantage of the sorting or filtering
|
55
|
+
abilities of Ferret to present your data the best way you see fit.
|
56
|
+
|
57
|
+
* Ferret::Document: to find out how to create documents. This part of Ferret
|
58
|
+
is relatively straightforward. The main thing that we haven't gone into here
|
59
|
+
is the use of term vectors. These allow you to store and retrieve the
|
60
|
+
positions and offsets of the data which can be very useful in document
|
61
|
+
comparison amoung other things. == More information
|
62
|
+
|
63
|
+
* Ferret::QueryParser: if you want to find out more about what you can do with
|
64
|
+
Ferret's Query Parser, this is the place to look. The query parser is one
|
65
|
+
area that could use a bit of work so please send your suggestions.
|
66
|
+
|
67
|
+
* Ferret::Index: for more advanced access to the index you'll probably want to
|
68
|
+
use the Ferret::Index::IndexWriter and Ferret::Index::IndexReader. This is
|
69
|
+
the place to look for more information on them.
|
70
|
+
|
71
|
+
* Ferret::Store: This is the module used to access the actual index storage
|
72
|
+
and won't be of much interest to most people.
|
73
|
+
|
74
|
+
=== Performance
|
75
|
+
|
76
|
+
Currently Ferret is an order of magnitude slower than Java Lucene which can be
|
77
|
+
quite a pain at times. I have written some basic C extensions which may or may
|
78
|
+
not have installed when you installed Ferret. These double the speed but still
|
79
|
+
leave it a lot slower than the Java version. I have, however, ported the
|
80
|
+
indexing part of Java Lucene to C and it is an order of magnitude faster then
|
81
|
+
the Java version. Once I'm pretty certain that the API of Ferret has settled
|
82
|
+
and won't be changing much, I'll intergrate my C version. So expect to see
|
83
|
+
Ferret running faster than Java Lucene some time in the future. If you'd like
|
84
|
+
to try cferret and test my claims, let me know (if you haven't already found
|
85
|
+
it in my subversion repository). It's not currently portable and will probably
|
86
|
+
only run on linux.
|
87
|
+
|
88
|
+
== Contact
|
89
|
+
|
90
|
+
Bug reports, patches, queries, discussion etc should be addressed to
|
91
|
+
the mailing list. More information on the list can be found at:
|
92
|
+
|
93
|
+
http://ferret.davebalmain.com/
|
94
|
+
|
95
|
+
Of course, since Ferret is almost a straight port of Java Lucene,
|
96
|
+
everything said about Lucene at http://jakarta.apache.org/lucene/ should
|
97
|
+
be true about Ferret. Apart from the bits about it being in Java.
|
98
|
+
|
99
|
+
== Authors
|
100
|
+
|
101
|
+
[<b>David Balmain</b>] Port to Ruby
|
102
|
+
|
103
|
+
[<b>Doug Cutting and friends</b>] Original Java Lucene
|
104
|
+
|
105
|
+
== License
|
106
|
+
|
107
|
+
Ferret is available under an MIT-style license.
|
108
|
+
|
109
|
+
:include: MIT-LICENSE
|
data/Rakefile
ADDED
@@ -0,0 +1,275 @@
|
|
1
|
+
$:. << 'lib'
|
2
|
+
# Some parts of this Rakefile where taken from Jim Weirich's Rakefile for
|
3
|
+
# Rake. Other parts where stolen from the David Heinemeier Hansson's Rails
|
4
|
+
# Rakefile. Both are under MIT-LICENSE. Thanks to both for their excellent
|
5
|
+
# projects.
|
6
|
+
|
7
|
+
require 'rake'
|
8
|
+
require 'rake/testtask'
|
9
|
+
require 'rake/rdoctask'
|
10
|
+
require 'rake/clean'
|
11
|
+
require 'rake_utils/code_statistics'
|
12
|
+
require 'lib/ferret'
|
13
|
+
|
14
|
+
begin
|
15
|
+
require 'rubygems'
|
16
|
+
require 'rake/gempackagetask'
|
17
|
+
rescue Exception
|
18
|
+
nil
|
19
|
+
end
|
20
|
+
|
21
|
+
CURRENT_VERSION = Ferret::VERSION
|
22
|
+
if ENV['REL']
|
23
|
+
PKG_VERSION = ENV['REL']
|
24
|
+
else
|
25
|
+
PKG_VERSION = CURRENT_VERSION
|
26
|
+
end
|
27
|
+
|
28
|
+
def announce(msg='')
|
29
|
+
STDERR.puts msg
|
30
|
+
end
|
31
|
+
|
32
|
+
$VERBOSE = nil
|
33
|
+
CLEAN.include(FileList['**/*.o', 'InstalledFiles', '.config'])
|
34
|
+
CLOBBER.include(FileList['**/*.so'], 'ext/Makefile')
|
35
|
+
|
36
|
+
task :default => :all_tests
|
37
|
+
desc "Run all tests"
|
38
|
+
task :all_tests => [ :test_units, :test_functional ]
|
39
|
+
|
40
|
+
desc "Generate API documentation, and show coding stats"
|
41
|
+
task :doc => [ :stats, :appdoc ]
|
42
|
+
|
43
|
+
desc "run unit tests in test/unit"
|
44
|
+
Rake::TestTask.new("test_units" => :parsers) do |t|
|
45
|
+
t.libs << "test/unit"
|
46
|
+
t.pattern = 'test/unit/t[cs]_*.rb'
|
47
|
+
t.verbose = true
|
48
|
+
end
|
49
|
+
|
50
|
+
desc "run unit tests in test/unit"
|
51
|
+
Rake::TestTask.new("test_long") do |t|
|
52
|
+
t.libs << "test"
|
53
|
+
t.libs << "test/unit"
|
54
|
+
t.test_files = FileList["test/longrunning/tm_store.rb"]
|
55
|
+
t.pattern = 'test/unit/t[cs]_*.rb'
|
56
|
+
t.verbose = true
|
57
|
+
end
|
58
|
+
|
59
|
+
desc "run funtional tests in test/funtional"
|
60
|
+
Rake::TestTask.new("test_functional") do |t|
|
61
|
+
t.libs << "test"
|
62
|
+
t.pattern = 'test/funtional/tc_*.rb'
|
63
|
+
t.verbose = true
|
64
|
+
end
|
65
|
+
|
66
|
+
desc "Report code statistics (KLOCS, etc) from application"
|
67
|
+
task :stats do
|
68
|
+
CodeStatistics.new(
|
69
|
+
["Ferret", "lib/ferret"],
|
70
|
+
["Units", "test/unit"],
|
71
|
+
["Units-extended", "test/longrunning"]
|
72
|
+
).to_s
|
73
|
+
end
|
74
|
+
|
75
|
+
desc "Generate documentation for the application"
|
76
|
+
rd = Rake::RDocTask.new("appdoc") do |rdoc|
|
77
|
+
rdoc.rdoc_dir = 'doc/api'
|
78
|
+
rdoc.title = "Ferret Search Library Documentation"
|
79
|
+
rdoc.options << '--line-numbers --inline-source'
|
80
|
+
rdoc.rdoc_files.include('README')
|
81
|
+
rdoc.rdoc_files.include('TODO')
|
82
|
+
rdoc.rdoc_files.include('TUTORIAL')
|
83
|
+
rdoc.rdoc_files.include('MIT-LICENSE')
|
84
|
+
rdoc.rdoc_files.include('lib/**/*.rb')
|
85
|
+
end
|
86
|
+
|
87
|
+
EXT = "ferret_ext.so"
|
88
|
+
|
89
|
+
desc "Build the extension"
|
90
|
+
task :ext => "ext/#{EXT}"
|
91
|
+
|
92
|
+
file "ext/#{EXT}" => "ext/Makefile" do
|
93
|
+
sh "cd ext; make"
|
94
|
+
end
|
95
|
+
|
96
|
+
file "ext/Makefile" do
|
97
|
+
sh "cd ext; ruby extconf.rb"
|
98
|
+
end
|
99
|
+
|
100
|
+
# Make Parsers ---------------------------------------------------------------
|
101
|
+
|
102
|
+
RACC_SRC = FileList["**/*.y"]
|
103
|
+
RACC_OUT = RACC_SRC.collect { |fn| fn.sub(/\.y$/, '.tab.rb') }
|
104
|
+
|
105
|
+
task :parsers => RACC_OUT
|
106
|
+
rule(/\.tab\.rb$/ => [proc {|tn| tn.sub(/\.tab\.rb$/, '.y')}]) do |t|
|
107
|
+
sh "racc #{t.source}"
|
108
|
+
end
|
109
|
+
|
110
|
+
# Create Packages ------------------------------------------------------------
|
111
|
+
|
112
|
+
PKG_FILES = FileList[
|
113
|
+
'setup.rb',
|
114
|
+
'[-A-Z]*',
|
115
|
+
'ext/**/*',
|
116
|
+
'lib/**/*.rb',
|
117
|
+
'test/**/*.rb',
|
118
|
+
'rake_utils/**/*.rb',
|
119
|
+
'Rakefile'
|
120
|
+
]
|
121
|
+
PKG_FILES.exclude('**/*.o')
|
122
|
+
|
123
|
+
|
124
|
+
if ! defined?(Gem)
|
125
|
+
puts "Package Target requires RubyGEMs"
|
126
|
+
else
|
127
|
+
spec = Gem::Specification.new do |s|
|
128
|
+
|
129
|
+
#### Basic information.
|
130
|
+
|
131
|
+
s.name = 'ferret'
|
132
|
+
s.version = PKG_VERSION
|
133
|
+
s.summary = "Ruby indexing library."
|
134
|
+
s.description = <<-EOF
|
135
|
+
Ferret is a port of the Java Lucene project. It is a powerful
|
136
|
+
indexing and search library.
|
137
|
+
EOF
|
138
|
+
|
139
|
+
#### Dependencies and requirements.
|
140
|
+
|
141
|
+
#s.add_dependency('log4r', '> 1.0.4')
|
142
|
+
#s.requirements << ""
|
143
|
+
|
144
|
+
#### Which files are to be included in this gem? Everything! (Except CVS directories.)
|
145
|
+
|
146
|
+
s.files = PKG_FILES.to_a
|
147
|
+
|
148
|
+
#### C code extensions.
|
149
|
+
|
150
|
+
s.extensions << "ext/extconf.rb"
|
151
|
+
|
152
|
+
#### Load-time details: library and application (you will need one or both).
|
153
|
+
|
154
|
+
s.require_path = 'lib' # Use these for libraries.
|
155
|
+
|
156
|
+
#s.bindir = "bin" # Use these for applications.
|
157
|
+
#s.executables = ["rake"]
|
158
|
+
#s.default_executable = "rake"
|
159
|
+
|
160
|
+
#### Documentation and testing.
|
161
|
+
|
162
|
+
s.has_rdoc = true
|
163
|
+
s.extra_rdoc_files = rd.rdoc_files.reject { |fn| fn =~ /\.rb$/ }.to_a
|
164
|
+
s.rdoc_options <<
|
165
|
+
'--title' << 'Ferret -- Ruby Indexer' <<
|
166
|
+
'--main' << 'README' << '--line-numbers' <<
|
167
|
+
'TUTORIAL' << 'TODO'
|
168
|
+
|
169
|
+
#### Author and project details.
|
170
|
+
|
171
|
+
s.author = "David Balmain"
|
172
|
+
s.email = "dbalmain@gmail.com"
|
173
|
+
s.homepage = "http://ferret.davebalmain.com"
|
174
|
+
s.rubyforge_project = "ferret"
|
175
|
+
# if ENV['CERT_DIR']
|
176
|
+
# s.signing_key = File.join(ENV['CERT_DIR'], 'gem-private_key.pem')
|
177
|
+
# s.cert_chain = [File.join(ENV['CERT_DIR'], 'gem-public_cert.pem')]
|
178
|
+
# end
|
179
|
+
end
|
180
|
+
|
181
|
+
package_task = Rake::GemPackageTask.new(spec) do |pkg|
|
182
|
+
pkg.need_zip = true
|
183
|
+
pkg.need_tar = true
|
184
|
+
end
|
185
|
+
end
|
186
|
+
|
187
|
+
# Support Tasks ------------------------------------------------------
|
188
|
+
|
189
|
+
desc "Look for TODO and FIXME tags in the code"
|
190
|
+
task :todo do
|
191
|
+
FileList['**/*.rb'].egrep /#.*(FIXME|TODO|TBD)/
|
192
|
+
end
|
193
|
+
# --------------------------------------------------------------------
|
194
|
+
# Creating a release
|
195
|
+
|
196
|
+
desc "Make a new release"
|
197
|
+
task :prerelease => [:clobber, :all_tests, :parsers]
|
198
|
+
task :package => [:prerelease]
|
199
|
+
task :tag => [:prerelease]
|
200
|
+
task :update_version => [:prerelease]
|
201
|
+
task :release => [:tag, :update_version, :package] do
|
202
|
+
announce
|
203
|
+
announce "**************************************************************"
|
204
|
+
announce "* Release #{PKG_VERSION} Complete."
|
205
|
+
announce "* Packages ready to upload."
|
206
|
+
announce "**************************************************************"
|
207
|
+
announce
|
208
|
+
end
|
209
|
+
|
210
|
+
# Validate that everything is ready to go for a release.
|
211
|
+
task :prerelease do
|
212
|
+
announce
|
213
|
+
announce "**************************************************************"
|
214
|
+
announce "* Making RubyGem Release #{PKG_VERSION}"
|
215
|
+
announce "* (current version #{CURRENT_VERSION})"
|
216
|
+
announce "**************************************************************"
|
217
|
+
announce
|
218
|
+
|
219
|
+
# Is a release number supplied?
|
220
|
+
unless ENV['REL']
|
221
|
+
fail "Usage: rake release REL=x.y.z [REUSE=tag_suffix]"
|
222
|
+
end
|
223
|
+
|
224
|
+
# Is the release different than the current release.
|
225
|
+
# (or is REUSE set?)
|
226
|
+
if PKG_VERSION == CURRENT_VERSION && ! ENV['REUSE']
|
227
|
+
fail "Current version is #{PKG_VERSION}, must specify REUSE=tag_suffix to reuse version"
|
228
|
+
end
|
229
|
+
|
230
|
+
# Are all source files checked in?
|
231
|
+
data = `svn -q status`
|
232
|
+
unless data =~ /^$/
|
233
|
+
fail "'svn -q status' is not clean ... do you have unchecked-in files?"
|
234
|
+
end
|
235
|
+
|
236
|
+
announce "No outstanding checkins found ... OK"
|
237
|
+
end
|
238
|
+
|
239
|
+
task :update_version => [:prerelease] do
|
240
|
+
if PKG_VERSION == CURRENT_VERSION
|
241
|
+
announce "No version change ... skipping version update"
|
242
|
+
else
|
243
|
+
announce "Updating Ferret version to #{PKG_VERSION}"
|
244
|
+
open("lib/ferret.rb") do |ferret_in|
|
245
|
+
open("lib/ferret.rb.new", "w") do |ferret_out|
|
246
|
+
ferret_in.each do |line|
|
247
|
+
if line =~ /^ VERSION\s*=\s*/
|
248
|
+
ferret_out.puts " VERSION = '#{PKG_VERSION}'"
|
249
|
+
else
|
250
|
+
ferret_out.puts line
|
251
|
+
end
|
252
|
+
end
|
253
|
+
end
|
254
|
+
end
|
255
|
+
if ENV['RELTEST']
|
256
|
+
announce "Release Task Testing, skipping commiting of new version"
|
257
|
+
else
|
258
|
+
mv "lib/ferret.rb.new", "lib/ferret.rb"
|
259
|
+
end
|
260
|
+
sh %{svn ci -m "Updated to version #{PKG_VERSION}" lib/ferret.rb}
|
261
|
+
end
|
262
|
+
end
|
263
|
+
|
264
|
+
desc "Tag all the SVN files with the latest release number (REL=x.y.z)"
|
265
|
+
task :tag => [:prerelease] do
|
266
|
+
reltag = "REL-#{PKG_VERSION}"
|
267
|
+
reltag << ENV['REUSE'] if ENV['REUSE']
|
268
|
+
announce "Tagging SVN with [#{reltag}]"
|
269
|
+
if ENV['RELTEST']
|
270
|
+
announce "Release Task Testing, skipping SVN tagging. Would do the following;"
|
271
|
+
announce %{svn copy -m "creating release #{reltag}" svn://www.davebalmain.com/ferret/trunk svn://www.davebalmain.com/ferret/tags/#{reltag}}
|
272
|
+
else
|
273
|
+
sh %{svn copy -m "creating release #{reltag}" svn://www.davebalmain.com/ferret/trunk svn://www.davebalmain.com/ferret/tags/#{reltag}}
|
274
|
+
end
|
275
|
+
end
|
data/TODO
ADDED
data/TUTORIAL
ADDED
@@ -0,0 +1,197 @@
|
|
1
|
+
= Quick Introduction to Ferret
|
2
|
+
|
3
|
+
The simplest way to use Ferret is through the Ferret::Index::Index class.
|
4
|
+
Start by including the Ferret module.
|
5
|
+
|
6
|
+
require 'ferret'
|
7
|
+
include Ferret
|
8
|
+
|
9
|
+
=== Creating an index
|
10
|
+
|
11
|
+
To create an in memory index is very simple;
|
12
|
+
|
13
|
+
index = Index::Index.new()
|
14
|
+
|
15
|
+
To create a persistent index;
|
16
|
+
|
17
|
+
index = Index::Index.new(:path => '/path/to/index')
|
18
|
+
|
19
|
+
Both of these methods create new Indexes with the StandardAnalyzer. An
|
20
|
+
analyzer is what you use to divide the input data up into tokens which you can
|
21
|
+
search for later. If you'd like to use a different analyzer you can specify it
|
22
|
+
here, eg;
|
23
|
+
|
24
|
+
index = Index::Index.new(:path => '/path/to/index',
|
25
|
+
:analyzer => WhiteSpaceAnalyzer.new)
|
26
|
+
|
27
|
+
For more options when creating an Index refer to Ferret::Index::Index.
|
28
|
+
|
29
|
+
=== Adding Documents
|
30
|
+
|
31
|
+
To add a document you can simply add a string or an array of strings.
|
32
|
+
|
33
|
+
index << "This is a new document to be indexed"
|
34
|
+
index << ["And here", "is another", "new document", "to be indexed"]
|
35
|
+
|
36
|
+
But these are pretty simple documents. If this is all you want to index you
|
37
|
+
could probably just use SimpleSearch. So let's give our documents some fields;
|
38
|
+
|
39
|
+
index << {:title => "Programming Ruby", :content => "blah blah blah"}
|
40
|
+
index << {:title => "Programming Ruby", :content => "yada yada yada"}
|
41
|
+
|
42
|
+
Or if you are indexing data stored in a database, you'll probably want to
|
43
|
+
store the id;
|
44
|
+
|
45
|
+
index << {:id => row.id, :title => row.title, :date => row.date}
|
46
|
+
|
47
|
+
The methods above while store all of the input data as well tokenizing and
|
48
|
+
indexing it. Sometimes we won't want to tokenize (divide the string into
|
49
|
+
tokens) the data. For example, we might want to leave the title as a complete
|
50
|
+
string and only allow searchs for that complete string. Sometimes we won't
|
51
|
+
want to store the data as it's already stored in the database so it'll be a
|
52
|
+
waste to store it in the index. Or perhaps we are doing without a database and
|
53
|
+
using Ferret to store all of our data, in which case we might not want to
|
54
|
+
index it. For example, if we are storing images in the index, we won't want to
|
55
|
+
index them. All of this can be done using Ferret's Ferret::Document module.
|
56
|
+
eg;
|
57
|
+
|
58
|
+
include Ferret::Document
|
59
|
+
doc = Document.new
|
60
|
+
doc << Field.new("id", row.id, Field::Store::NO, Field::Index::UNTOKENIZED)
|
61
|
+
doc << Field.new("title", row.title, Field::Store::YES, Field::Index::UNTOKENIZED)
|
62
|
+
doc << Field.new("data", row.data, Field::Store::YES, Field::Index::TOKENIZED)
|
63
|
+
doc << Field.new("image", row.image, Field::Store::YES, Field::Index::NO)
|
64
|
+
index << doc
|
65
|
+
|
66
|
+
You can also compress the data that you are storing or store term vectors with
|
67
|
+
the data. Read more about this in Ferret::Document::Field.
|
68
|
+
|
69
|
+
=== Searching
|
70
|
+
|
71
|
+
Now that we have data in our index, how do we actually use this index to
|
72
|
+
search the data? The Index offers two search methods, Index#search and
|
73
|
+
Index#search_each. The first method returns a Ferret::Index::TopDocs object.
|
74
|
+
The second we'll show here. Lets say we wanted to find all documents with the
|
75
|
+
phrase "quick brown fox" in the content field. We'd write;
|
76
|
+
|
77
|
+
index.search('content:"quick brown fox"') do |doc, score|
|
78
|
+
puts "Document #{doc} found with a score of #{score}"
|
79
|
+
end
|
80
|
+
|
81
|
+
But "fast" has a pretty similar meaning to "quick" and we don't mind if the
|
82
|
+
fox is a little red. So we could expand our search like this;
|
83
|
+
|
84
|
+
index.search('content:"quick|fast brown|red fox"') do |doc, score|
|
85
|
+
puts "Document #{doc} found with a score of #{score}"
|
86
|
+
end
|
87
|
+
|
88
|
+
What if we want to find all documents entered on or after 5th of September,
|
89
|
+
2005 with the words "ruby" or "rails" in it. We could type something like;
|
90
|
+
|
91
|
+
index.search('date:( >= 20050905) content:(ruby OR rails)') do |doc, score|
|
92
|
+
puts "Document #{doc} found with a score of #{score}"
|
93
|
+
end
|
94
|
+
|
95
|
+
Ferret has quite a complex query language. To find out more about Ferret's
|
96
|
+
query language, see Ferret::QueryParser. You can also construct even more
|
97
|
+
complex queries like Ferret::Search::Spans by hand. See Ferret::Search::Query
|
98
|
+
for more information.
|
99
|
+
|
100
|
+
=== Accessing Documents
|
101
|
+
|
102
|
+
You may have noticed that when we run a search we only get the document number
|
103
|
+
back. By itself this isn't much use to us. Getting the data from the index is
|
104
|
+
very straightforward. For example if we want the title field form the 3rd
|
105
|
+
document type;
|
106
|
+
|
107
|
+
index[2]["title"]
|
108
|
+
|
109
|
+
NOTE: documents are indexed from 0.
|
110
|
+
|
111
|
+
Let's go back to the database example above. If we store all of our documents
|
112
|
+
with an id then we can access that field using the id. As long as we called
|
113
|
+
our id field "id" we can do this
|
114
|
+
|
115
|
+
id = "89721347"
|
116
|
+
index[id]["title"]
|
117
|
+
|
118
|
+
If however we called our id field "key" we'll have to do this;
|
119
|
+
|
120
|
+
id = Index::Term.new("key", "89721347")
|
121
|
+
index[id]["title"]
|
122
|
+
|
123
|
+
Pretty simple huh? You should note though that if there are more then one
|
124
|
+
document with the same *id* or *key* then only the first one will be returned
|
125
|
+
so it is probably better that you ensure the key is unique somehow. (Ferret
|
126
|
+
cannot do that for you)
|
127
|
+
|
128
|
+
|
129
|
+
=== Modifying and Deleting Documents
|
130
|
+
|
131
|
+
What if we want to change the data in the index. Ferret doesn't actually let
|
132
|
+
you change the data once it is in the index. But you can delete documents so
|
133
|
+
the standard way to modify data is to delete it and re-add it again with the
|
134
|
+
modifications made. It is important to note that when doing this the documents
|
135
|
+
will get a new document number so you should be careful not to use a document
|
136
|
+
number after the document has been deleted. Here is an examle of modifying a
|
137
|
+
document;
|
138
|
+
|
139
|
+
index << {:title => "Programing Rbuy", :content => "blah blah blah"}
|
140
|
+
doc_num = nil
|
141
|
+
index.search('title:"Programing Rbuy"') {|doc, score| doc_num = doc}
|
142
|
+
return unless doc_num
|
143
|
+
doc = index[doc_num]
|
144
|
+
index.delete(doc_num)
|
145
|
+
|
146
|
+
# modify doc
|
147
|
+
doc["title"] = "Programming Ruby"
|
148
|
+
|
149
|
+
index << doc
|
150
|
+
|
151
|
+
Again, we can use the the id field as above. This time though every document
|
152
|
+
that matches the id will be deleted. Again, it is probably a good idea if you
|
153
|
+
somehow ensure that your *ids* are kept unique.
|
154
|
+
|
155
|
+
id = "23453422"
|
156
|
+
index.delete(id)
|
157
|
+
|
158
|
+
Or;
|
159
|
+
|
160
|
+
id = Index::Term.new("key", "23452345")
|
161
|
+
index.delete(id)
|
162
|
+
|
163
|
+
=== Onwards
|
164
|
+
|
165
|
+
This is just a small sampling of what Ferret allows you to do. Ferret, like
|
166
|
+
Lucene, is designed to be extended, and allows you to construct your own query
|
167
|
+
types, analyzers, and so on. Future versions of Ferret will contain more of
|
168
|
+
these, as well as instructions for how to subclass the base modules to create
|
169
|
+
your own. For now you can look in the following places for more documentation;
|
170
|
+
|
171
|
+
* Ferret::Analysis: for more information on how the data is processed when it
|
172
|
+
is tokenized. There are a number of things you can do with your data such as
|
173
|
+
adding stop lists or perhaps a porter stemmer. There are also a number of
|
174
|
+
analyzers already available and it is almost trivial to create a new one
|
175
|
+
with a simple regular expression.
|
176
|
+
|
177
|
+
* Ferret::Search: for more information on querying the index. There are a
|
178
|
+
number of already available queries and it's unlikely you'll need to create
|
179
|
+
your own. You may however want to take advantage of the sorting or filtering
|
180
|
+
abilities of Ferret to present your data the best way you see fit.
|
181
|
+
|
182
|
+
* Ferret::Document: to find out how to create documents. This part of Ferret
|
183
|
+
is relatively straightforward. The main thing that we haven't gone into here
|
184
|
+
is the use of term vectors. These allow you to store and retrieve the
|
185
|
+
positions and offsets of the data which can be very useful in document
|
186
|
+
comparison amoung other things. == More information
|
187
|
+
|
188
|
+
* Ferret::QueryParser: if you want to find out more about what you can do with
|
189
|
+
Ferret's Query Parser, this is the place to look. The query parser is one
|
190
|
+
area that could use a bit of work so please send your suggestions.
|
191
|
+
|
192
|
+
* Ferret::Index: for more advanced access to the index you'll probably want to
|
193
|
+
use the Ferret::Index::IndexWriter and Ferret::Index::IndexReader. This is
|
194
|
+
the place to look for more information on them.
|
195
|
+
|
196
|
+
* Ferret::Store: This is the module used to access the actual index storage
|
197
|
+
and won't be of much interest to most people.
|