ferret 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/MIT-LICENSE +20 -0
- data/README +109 -0
- data/Rakefile +275 -0
- data/TODO +9 -0
- data/TUTORIAL +197 -0
- data/ext/extconf.rb +3 -0
- data/ext/ferret.c +23 -0
- data/ext/ferret.h +85 -0
- data/ext/index_io.c +543 -0
- data/ext/priority_queue.c +227 -0
- data/ext/ram_directory.c +316 -0
- data/ext/segment_merge_queue.c +41 -0
- data/ext/string_helper.c +42 -0
- data/ext/tags +240 -0
- data/ext/term.c +261 -0
- data/ext/term_buffer.c +299 -0
- data/ext/util.c +12 -0
- data/lib/ferret.rb +41 -0
- data/lib/ferret/analysis.rb +11 -0
- data/lib/ferret/analysis/analyzers.rb +93 -0
- data/lib/ferret/analysis/standard_tokenizer.rb +65 -0
- data/lib/ferret/analysis/token.rb +79 -0
- data/lib/ferret/analysis/token_filters.rb +86 -0
- data/lib/ferret/analysis/token_stream.rb +26 -0
- data/lib/ferret/analysis/tokenizers.rb +107 -0
- data/lib/ferret/analysis/word_list_loader.rb +27 -0
- data/lib/ferret/document.rb +2 -0
- data/lib/ferret/document/document.rb +152 -0
- data/lib/ferret/document/field.rb +304 -0
- data/lib/ferret/index.rb +26 -0
- data/lib/ferret/index/compound_file_io.rb +343 -0
- data/lib/ferret/index/document_writer.rb +288 -0
- data/lib/ferret/index/field_infos.rb +259 -0
- data/lib/ferret/index/fields_io.rb +175 -0
- data/lib/ferret/index/index.rb +228 -0
- data/lib/ferret/index/index_file_names.rb +33 -0
- data/lib/ferret/index/index_reader.rb +462 -0
- data/lib/ferret/index/index_writer.rb +488 -0
- data/lib/ferret/index/multi_reader.rb +363 -0
- data/lib/ferret/index/multiple_term_doc_pos_enum.rb +105 -0
- data/lib/ferret/index/segment_infos.rb +130 -0
- data/lib/ferret/index/segment_merge_info.rb +47 -0
- data/lib/ferret/index/segment_merge_queue.rb +16 -0
- data/lib/ferret/index/segment_merger.rb +337 -0
- data/lib/ferret/index/segment_reader.rb +380 -0
- data/lib/ferret/index/segment_term_enum.rb +178 -0
- data/lib/ferret/index/segment_term_vector.rb +58 -0
- data/lib/ferret/index/term.rb +49 -0
- data/lib/ferret/index/term_buffer.rb +88 -0
- data/lib/ferret/index/term_doc_enum.rb +283 -0
- data/lib/ferret/index/term_enum.rb +52 -0
- data/lib/ferret/index/term_info.rb +41 -0
- data/lib/ferret/index/term_infos_io.rb +312 -0
- data/lib/ferret/index/term_vector_offset_info.rb +20 -0
- data/lib/ferret/index/term_vectors_io.rb +552 -0
- data/lib/ferret/query_parser.rb +274 -0
- data/lib/ferret/query_parser/query_parser.tab.rb +819 -0
- data/lib/ferret/search.rb +49 -0
- data/lib/ferret/search/boolean_clause.rb +100 -0
- data/lib/ferret/search/boolean_query.rb +303 -0
- data/lib/ferret/search/boolean_scorer.rb +294 -0
- data/lib/ferret/search/caching_wrapper_filter.rb +40 -0
- data/lib/ferret/search/conjunction_scorer.rb +99 -0
- data/lib/ferret/search/disjunction_sum_scorer.rb +203 -0
- data/lib/ferret/search/exact_phrase_scorer.rb +32 -0
- data/lib/ferret/search/explanation.rb +41 -0
- data/lib/ferret/search/field_cache.rb +216 -0
- data/lib/ferret/search/field_doc.rb +31 -0
- data/lib/ferret/search/field_sorted_hit_queue.rb +184 -0
- data/lib/ferret/search/filter.rb +11 -0
- data/lib/ferret/search/filtered_query.rb +130 -0
- data/lib/ferret/search/filtered_term_enum.rb +79 -0
- data/lib/ferret/search/fuzzy_query.rb +153 -0
- data/lib/ferret/search/fuzzy_term_enum.rb +244 -0
- data/lib/ferret/search/hit_collector.rb +34 -0
- data/lib/ferret/search/hit_queue.rb +11 -0
- data/lib/ferret/search/index_searcher.rb +173 -0
- data/lib/ferret/search/match_all_docs_query.rb +104 -0
- data/lib/ferret/search/multi_phrase_query.rb +204 -0
- data/lib/ferret/search/multi_term_query.rb +65 -0
- data/lib/ferret/search/non_matching_scorer.rb +22 -0
- data/lib/ferret/search/phrase_positions.rb +55 -0
- data/lib/ferret/search/phrase_query.rb +217 -0
- data/lib/ferret/search/phrase_scorer.rb +153 -0
- data/lib/ferret/search/prefix_query.rb +47 -0
- data/lib/ferret/search/query.rb +111 -0
- data/lib/ferret/search/query_filter.rb +51 -0
- data/lib/ferret/search/range_filter.rb +103 -0
- data/lib/ferret/search/range_query.rb +139 -0
- data/lib/ferret/search/req_excl_scorer.rb +125 -0
- data/lib/ferret/search/req_opt_sum_scorer.rb +70 -0
- data/lib/ferret/search/score_doc.rb +38 -0
- data/lib/ferret/search/score_doc_comparator.rb +114 -0
- data/lib/ferret/search/scorer.rb +91 -0
- data/lib/ferret/search/similarity.rb +278 -0
- data/lib/ferret/search/sloppy_phrase_scorer.rb +47 -0
- data/lib/ferret/search/sort.rb +105 -0
- data/lib/ferret/search/sort_comparator.rb +60 -0
- data/lib/ferret/search/sort_field.rb +87 -0
- data/lib/ferret/search/spans.rb +12 -0
- data/lib/ferret/search/spans/near_spans_enum.rb +304 -0
- data/lib/ferret/search/spans/span_first_query.rb +79 -0
- data/lib/ferret/search/spans/span_near_query.rb +108 -0
- data/lib/ferret/search/spans/span_not_query.rb +130 -0
- data/lib/ferret/search/spans/span_or_query.rb +176 -0
- data/lib/ferret/search/spans/span_query.rb +25 -0
- data/lib/ferret/search/spans/span_scorer.rb +74 -0
- data/lib/ferret/search/spans/span_term_query.rb +105 -0
- data/lib/ferret/search/spans/span_weight.rb +84 -0
- data/lib/ferret/search/spans/spans_enum.rb +44 -0
- data/lib/ferret/search/term_query.rb +128 -0
- data/lib/ferret/search/term_scorer.rb +181 -0
- data/lib/ferret/search/top_docs.rb +24 -0
- data/lib/ferret/search/top_field_docs.rb +17 -0
- data/lib/ferret/search/weight.rb +54 -0
- data/lib/ferret/search/wildcard_query.rb +26 -0
- data/lib/ferret/search/wildcard_term_enum.rb +61 -0
- data/lib/ferret/stemmers.rb +1 -0
- data/lib/ferret/stemmers/porter_stemmer.rb +218 -0
- data/lib/ferret/store.rb +5 -0
- data/lib/ferret/store/buffered_index_io.rb +191 -0
- data/lib/ferret/store/directory.rb +139 -0
- data/lib/ferret/store/fs_store.rb +338 -0
- data/lib/ferret/store/index_io.rb +259 -0
- data/lib/ferret/store/ram_store.rb +282 -0
- data/lib/ferret/utils.rb +7 -0
- data/lib/ferret/utils/bit_vector.rb +105 -0
- data/lib/ferret/utils/date_tools.rb +138 -0
- data/lib/ferret/utils/number_tools.rb +91 -0
- data/lib/ferret/utils/parameter.rb +41 -0
- data/lib/ferret/utils/priority_queue.rb +120 -0
- data/lib/ferret/utils/string_helper.rb +47 -0
- data/lib/ferret/utils/weak_key_hash.rb +51 -0
- data/rake_utils/code_statistics.rb +106 -0
- data/setup.rb +1551 -0
- data/test/benchmark/tb_ram_store.rb +76 -0
- data/test/benchmark/tb_rw_vint.rb +26 -0
- data/test/longrunning/tc_numbertools.rb +60 -0
- data/test/longrunning/tm_store.rb +19 -0
- data/test/test_all.rb +9 -0
- data/test/test_helper.rb +6 -0
- data/test/unit/analysis/tc_analyzer.rb +21 -0
- data/test/unit/analysis/tc_letter_tokenizer.rb +20 -0
- data/test/unit/analysis/tc_lower_case_filter.rb +20 -0
- data/test/unit/analysis/tc_lower_case_tokenizer.rb +27 -0
- data/test/unit/analysis/tc_per_field_analyzer_wrapper.rb +39 -0
- data/test/unit/analysis/tc_porter_stem_filter.rb +16 -0
- data/test/unit/analysis/tc_standard_analyzer.rb +20 -0
- data/test/unit/analysis/tc_standard_tokenizer.rb +20 -0
- data/test/unit/analysis/tc_stop_analyzer.rb +20 -0
- data/test/unit/analysis/tc_stop_filter.rb +14 -0
- data/test/unit/analysis/tc_white_space_analyzer.rb +21 -0
- data/test/unit/analysis/tc_white_space_tokenizer.rb +20 -0
- data/test/unit/analysis/tc_word_list_loader.rb +32 -0
- data/test/unit/document/tc_document.rb +47 -0
- data/test/unit/document/tc_field.rb +80 -0
- data/test/unit/index/tc_compound_file_io.rb +107 -0
- data/test/unit/index/tc_field_infos.rb +119 -0
- data/test/unit/index/tc_fields_io.rb +167 -0
- data/test/unit/index/tc_index.rb +140 -0
- data/test/unit/index/tc_index_reader.rb +622 -0
- data/test/unit/index/tc_index_writer.rb +57 -0
- data/test/unit/index/tc_multiple_term_doc_pos_enum.rb +80 -0
- data/test/unit/index/tc_segment_infos.rb +74 -0
- data/test/unit/index/tc_segment_term_docs.rb +17 -0
- data/test/unit/index/tc_segment_term_enum.rb +60 -0
- data/test/unit/index/tc_segment_term_vector.rb +71 -0
- data/test/unit/index/tc_term.rb +22 -0
- data/test/unit/index/tc_term_buffer.rb +57 -0
- data/test/unit/index/tc_term_info.rb +19 -0
- data/test/unit/index/tc_term_infos_io.rb +192 -0
- data/test/unit/index/tc_term_vector_offset_info.rb +18 -0
- data/test/unit/index/tc_term_vectors_io.rb +108 -0
- data/test/unit/index/th_doc.rb +244 -0
- data/test/unit/query_parser/tc_query_parser.rb +84 -0
- data/test/unit/search/tc_filter.rb +113 -0
- data/test/unit/search/tc_fuzzy_query.rb +136 -0
- data/test/unit/search/tc_index_searcher.rb +188 -0
- data/test/unit/search/tc_search_and_sort.rb +98 -0
- data/test/unit/search/tc_similarity.rb +37 -0
- data/test/unit/search/tc_sort.rb +48 -0
- data/test/unit/search/tc_sort_field.rb +27 -0
- data/test/unit/search/tc_spans.rb +153 -0
- data/test/unit/store/tc_fs_store.rb +84 -0
- data/test/unit/store/tc_ram_store.rb +35 -0
- data/test/unit/store/tm_store.rb +180 -0
- data/test/unit/store/tm_store_lock.rb +68 -0
- data/test/unit/ts_analysis.rb +16 -0
- data/test/unit/ts_document.rb +4 -0
- data/test/unit/ts_index.rb +18 -0
- data/test/unit/ts_query_parser.rb +3 -0
- data/test/unit/ts_search.rb +10 -0
- data/test/unit/ts_store.rb +6 -0
- data/test/unit/ts_utils.rb +10 -0
- data/test/unit/utils/tc_bit_vector.rb +65 -0
- data/test/unit/utils/tc_date_tools.rb +50 -0
- data/test/unit/utils/tc_number_tools.rb +59 -0
- data/test/unit/utils/tc_parameter.rb +40 -0
- data/test/unit/utils/tc_priority_queue.rb +62 -0
- data/test/unit/utils/tc_string_helper.rb +21 -0
- data/test/unit/utils/tc_weak_key_hash.rb +25 -0
- metadata +251 -0
data/MIT-LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2005 David Balmain
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README
ADDED
@@ -0,0 +1,109 @@
|
|
1
|
+
= Ferret
|
2
|
+
|
3
|
+
Ferret is a Ruby port of the Java Lucene search engine.
|
4
|
+
(http://jakarta.apache.org/lucene/) In the same way as Lucene, it is not a
|
5
|
+
standalone application, but a library you can use to index documents and
|
6
|
+
search for things in them later.
|
7
|
+
|
8
|
+
== Requirements
|
9
|
+
|
10
|
+
* Ruby 1.8
|
11
|
+
* (C compiler to build the extension but not required to use Ferret)
|
12
|
+
|
13
|
+
== Installation
|
14
|
+
|
15
|
+
De-compress the archive and enter its top directory.
|
16
|
+
|
17
|
+
tar zxpvf ferret-0.1.tar.gz
|
18
|
+
cd ferret-0.1
|
19
|
+
|
20
|
+
Run the setup config;
|
21
|
+
|
22
|
+
$ ruby setup.rb config
|
23
|
+
|
24
|
+
Then to compile the C extension (optional) type:
|
25
|
+
|
26
|
+
$ ruby setup.rb setup
|
27
|
+
|
28
|
+
If you don't have a C compiler, never mind. Just go straight to the next step.
|
29
|
+
On *nix you'll need to run this with root privalages. Type;
|
30
|
+
|
31
|
+
# ruby setup.rb install
|
32
|
+
|
33
|
+
These simple steps install ferret in the default location of Ruby libraries.
|
34
|
+
You can also install files into your favorite directory by supplying setup.rb
|
35
|
+
some options. Try;
|
36
|
+
|
37
|
+
$ ruby setup.rb --help
|
38
|
+
|
39
|
+
|
40
|
+
== Usage
|
41
|
+
|
42
|
+
You can read the TUTORIAL which you'll find in the same directory as this
|
43
|
+
README. You can also check the following modules for more specific
|
44
|
+
documentation.
|
45
|
+
|
46
|
+
* Ferret::Analysis: for more information on how the data is processed when it
|
47
|
+
is tokenized. There are a number of things you can do with your data such as
|
48
|
+
adding stop lists or perhaps a porter stemmer. There are also a number of
|
49
|
+
analyzers already available and it is almost trivial to create a new one
|
50
|
+
with a simple regular expression.
|
51
|
+
|
52
|
+
* Ferret::Search: for more information on querying the index. There are a
|
53
|
+
number of already available queries and it's unlikely you'll need to create
|
54
|
+
your own. You may however want to take advantage of the sorting or filtering
|
55
|
+
abilities of Ferret to present your data the best way you see fit.
|
56
|
+
|
57
|
+
* Ferret::Document: to find out how to create documents. This part of Ferret
|
58
|
+
is relatively straightforward. The main thing that we haven't gone into here
|
59
|
+
is the use of term vectors. These allow you to store and retrieve the
|
60
|
+
positions and offsets of the data which can be very useful in document
|
61
|
+
comparison amoung other things. == More information
|
62
|
+
|
63
|
+
* Ferret::QueryParser: if you want to find out more about what you can do with
|
64
|
+
Ferret's Query Parser, this is the place to look. The query parser is one
|
65
|
+
area that could use a bit of work so please send your suggestions.
|
66
|
+
|
67
|
+
* Ferret::Index: for more advanced access to the index you'll probably want to
|
68
|
+
use the Ferret::Index::IndexWriter and Ferret::Index::IndexReader. This is
|
69
|
+
the place to look for more information on them.
|
70
|
+
|
71
|
+
* Ferret::Store: This is the module used to access the actual index storage
|
72
|
+
and won't be of much interest to most people.
|
73
|
+
|
74
|
+
=== Performance
|
75
|
+
|
76
|
+
Currently Ferret is an order of magnitude slower than Java Lucene which can be
|
77
|
+
quite a pain at times. I have written some basic C extensions which may or may
|
78
|
+
not have installed when you installed Ferret. These double the speed but still
|
79
|
+
leave it a lot slower than the Java version. I have, however, ported the
|
80
|
+
indexing part of Java Lucene to C and it is an order of magnitude faster then
|
81
|
+
the Java version. Once I'm pretty certain that the API of Ferret has settled
|
82
|
+
and won't be changing much, I'll intergrate my C version. So expect to see
|
83
|
+
Ferret running faster than Java Lucene some time in the future. If you'd like
|
84
|
+
to try cferret and test my claims, let me know (if you haven't already found
|
85
|
+
it in my subversion repository). It's not currently portable and will probably
|
86
|
+
only run on linux.
|
87
|
+
|
88
|
+
== Contact
|
89
|
+
|
90
|
+
Bug reports, patches, queries, discussion etc should be addressed to
|
91
|
+
the mailing list. More information on the list can be found at:
|
92
|
+
|
93
|
+
http://ferret.davebalmain.com/
|
94
|
+
|
95
|
+
Of course, since Ferret is almost a straight port of Java Lucene,
|
96
|
+
everything said about Lucene at http://jakarta.apache.org/lucene/ should
|
97
|
+
be true about Ferret. Apart from the bits about it being in Java.
|
98
|
+
|
99
|
+
== Authors
|
100
|
+
|
101
|
+
[<b>David Balmain</b>] Port to Ruby
|
102
|
+
|
103
|
+
[<b>Doug Cutting and friends</b>] Original Java Lucene
|
104
|
+
|
105
|
+
== License
|
106
|
+
|
107
|
+
Ferret is available under an MIT-style license.
|
108
|
+
|
109
|
+
:include: MIT-LICENSE
|
data/Rakefile
ADDED
@@ -0,0 +1,275 @@
|
|
1
|
+
$:. << 'lib'
|
2
|
+
# Some parts of this Rakefile where taken from Jim Weirich's Rakefile for
|
3
|
+
# Rake. Other parts where stolen from the David Heinemeier Hansson's Rails
|
4
|
+
# Rakefile. Both are under MIT-LICENSE. Thanks to both for their excellent
|
5
|
+
# projects.
|
6
|
+
|
7
|
+
require 'rake'
|
8
|
+
require 'rake/testtask'
|
9
|
+
require 'rake/rdoctask'
|
10
|
+
require 'rake/clean'
|
11
|
+
require 'rake_utils/code_statistics'
|
12
|
+
require 'lib/ferret'
|
13
|
+
|
14
|
+
begin
|
15
|
+
require 'rubygems'
|
16
|
+
require 'rake/gempackagetask'
|
17
|
+
rescue Exception
|
18
|
+
nil
|
19
|
+
end
|
20
|
+
|
21
|
+
CURRENT_VERSION = Ferret::VERSION
|
22
|
+
if ENV['REL']
|
23
|
+
PKG_VERSION = ENV['REL']
|
24
|
+
else
|
25
|
+
PKG_VERSION = CURRENT_VERSION
|
26
|
+
end
|
27
|
+
|
28
|
+
def announce(msg='')
|
29
|
+
STDERR.puts msg
|
30
|
+
end
|
31
|
+
|
32
|
+
$VERBOSE = nil
|
33
|
+
CLEAN.include(FileList['**/*.o', 'InstalledFiles', '.config'])
|
34
|
+
CLOBBER.include(FileList['**/*.so'], 'ext/Makefile')
|
35
|
+
|
36
|
+
task :default => :all_tests
|
37
|
+
desc "Run all tests"
|
38
|
+
task :all_tests => [ :test_units, :test_functional ]
|
39
|
+
|
40
|
+
desc "Generate API documentation, and show coding stats"
|
41
|
+
task :doc => [ :stats, :appdoc ]
|
42
|
+
|
43
|
+
desc "run unit tests in test/unit"
|
44
|
+
Rake::TestTask.new("test_units" => :parsers) do |t|
|
45
|
+
t.libs << "test/unit"
|
46
|
+
t.pattern = 'test/unit/t[cs]_*.rb'
|
47
|
+
t.verbose = true
|
48
|
+
end
|
49
|
+
|
50
|
+
desc "run unit tests in test/unit"
|
51
|
+
Rake::TestTask.new("test_long") do |t|
|
52
|
+
t.libs << "test"
|
53
|
+
t.libs << "test/unit"
|
54
|
+
t.test_files = FileList["test/longrunning/tm_store.rb"]
|
55
|
+
t.pattern = 'test/unit/t[cs]_*.rb'
|
56
|
+
t.verbose = true
|
57
|
+
end
|
58
|
+
|
59
|
+
desc "run funtional tests in test/funtional"
|
60
|
+
Rake::TestTask.new("test_functional") do |t|
|
61
|
+
t.libs << "test"
|
62
|
+
t.pattern = 'test/funtional/tc_*.rb'
|
63
|
+
t.verbose = true
|
64
|
+
end
|
65
|
+
|
66
|
+
desc "Report code statistics (KLOCS, etc) from application"
|
67
|
+
task :stats do
|
68
|
+
CodeStatistics.new(
|
69
|
+
["Ferret", "lib/ferret"],
|
70
|
+
["Units", "test/unit"],
|
71
|
+
["Units-extended", "test/longrunning"]
|
72
|
+
).to_s
|
73
|
+
end
|
74
|
+
|
75
|
+
desc "Generate documentation for the application"
|
76
|
+
rd = Rake::RDocTask.new("appdoc") do |rdoc|
|
77
|
+
rdoc.rdoc_dir = 'doc/api'
|
78
|
+
rdoc.title = "Ferret Search Library Documentation"
|
79
|
+
rdoc.options << '--line-numbers --inline-source'
|
80
|
+
rdoc.rdoc_files.include('README')
|
81
|
+
rdoc.rdoc_files.include('TODO')
|
82
|
+
rdoc.rdoc_files.include('TUTORIAL')
|
83
|
+
rdoc.rdoc_files.include('MIT-LICENSE')
|
84
|
+
rdoc.rdoc_files.include('lib/**/*.rb')
|
85
|
+
end
|
86
|
+
|
87
|
+
EXT = "ferret_ext.so"
|
88
|
+
|
89
|
+
desc "Build the extension"
|
90
|
+
task :ext => "ext/#{EXT}"
|
91
|
+
|
92
|
+
file "ext/#{EXT}" => "ext/Makefile" do
|
93
|
+
sh "cd ext; make"
|
94
|
+
end
|
95
|
+
|
96
|
+
file "ext/Makefile" do
|
97
|
+
sh "cd ext; ruby extconf.rb"
|
98
|
+
end
|
99
|
+
|
100
|
+
# Make Parsers ---------------------------------------------------------------
|
101
|
+
|
102
|
+
RACC_SRC = FileList["**/*.y"]
|
103
|
+
RACC_OUT = RACC_SRC.collect { |fn| fn.sub(/\.y$/, '.tab.rb') }
|
104
|
+
|
105
|
+
task :parsers => RACC_OUT
|
106
|
+
rule(/\.tab\.rb$/ => [proc {|tn| tn.sub(/\.tab\.rb$/, '.y')}]) do |t|
|
107
|
+
sh "racc #{t.source}"
|
108
|
+
end
|
109
|
+
|
110
|
+
# Create Packages ------------------------------------------------------------
|
111
|
+
|
112
|
+
PKG_FILES = FileList[
|
113
|
+
'setup.rb',
|
114
|
+
'[-A-Z]*',
|
115
|
+
'ext/**/*',
|
116
|
+
'lib/**/*.rb',
|
117
|
+
'test/**/*.rb',
|
118
|
+
'rake_utils/**/*.rb',
|
119
|
+
'Rakefile'
|
120
|
+
]
|
121
|
+
PKG_FILES.exclude('**/*.o')
|
122
|
+
|
123
|
+
|
124
|
+
if ! defined?(Gem)
|
125
|
+
puts "Package Target requires RubyGEMs"
|
126
|
+
else
|
127
|
+
spec = Gem::Specification.new do |s|
|
128
|
+
|
129
|
+
#### Basic information.
|
130
|
+
|
131
|
+
s.name = 'ferret'
|
132
|
+
s.version = PKG_VERSION
|
133
|
+
s.summary = "Ruby indexing library."
|
134
|
+
s.description = <<-EOF
|
135
|
+
Ferret is a port of the Java Lucene project. It is a powerful
|
136
|
+
indexing and search library.
|
137
|
+
EOF
|
138
|
+
|
139
|
+
#### Dependencies and requirements.
|
140
|
+
|
141
|
+
#s.add_dependency('log4r', '> 1.0.4')
|
142
|
+
#s.requirements << ""
|
143
|
+
|
144
|
+
#### Which files are to be included in this gem? Everything! (Except CVS directories.)
|
145
|
+
|
146
|
+
s.files = PKG_FILES.to_a
|
147
|
+
|
148
|
+
#### C code extensions.
|
149
|
+
|
150
|
+
s.extensions << "ext/extconf.rb"
|
151
|
+
|
152
|
+
#### Load-time details: library and application (you will need one or both).
|
153
|
+
|
154
|
+
s.require_path = 'lib' # Use these for libraries.
|
155
|
+
|
156
|
+
#s.bindir = "bin" # Use these for applications.
|
157
|
+
#s.executables = ["rake"]
|
158
|
+
#s.default_executable = "rake"
|
159
|
+
|
160
|
+
#### Documentation and testing.
|
161
|
+
|
162
|
+
s.has_rdoc = true
|
163
|
+
s.extra_rdoc_files = rd.rdoc_files.reject { |fn| fn =~ /\.rb$/ }.to_a
|
164
|
+
s.rdoc_options <<
|
165
|
+
'--title' << 'Ferret -- Ruby Indexer' <<
|
166
|
+
'--main' << 'README' << '--line-numbers' <<
|
167
|
+
'TUTORIAL' << 'TODO'
|
168
|
+
|
169
|
+
#### Author and project details.
|
170
|
+
|
171
|
+
s.author = "David Balmain"
|
172
|
+
s.email = "dbalmain@gmail.com"
|
173
|
+
s.homepage = "http://ferret.davebalmain.com"
|
174
|
+
s.rubyforge_project = "ferret"
|
175
|
+
# if ENV['CERT_DIR']
|
176
|
+
# s.signing_key = File.join(ENV['CERT_DIR'], 'gem-private_key.pem')
|
177
|
+
# s.cert_chain = [File.join(ENV['CERT_DIR'], 'gem-public_cert.pem')]
|
178
|
+
# end
|
179
|
+
end
|
180
|
+
|
181
|
+
package_task = Rake::GemPackageTask.new(spec) do |pkg|
|
182
|
+
pkg.need_zip = true
|
183
|
+
pkg.need_tar = true
|
184
|
+
end
|
185
|
+
end
|
186
|
+
|
187
|
+
# Support Tasks ------------------------------------------------------
|
188
|
+
|
189
|
+
desc "Look for TODO and FIXME tags in the code"
|
190
|
+
task :todo do
|
191
|
+
FileList['**/*.rb'].egrep /#.*(FIXME|TODO|TBD)/
|
192
|
+
end
|
193
|
+
# --------------------------------------------------------------------
|
194
|
+
# Creating a release
|
195
|
+
|
196
|
+
desc "Make a new release"
|
197
|
+
task :prerelease => [:clobber, :all_tests, :parsers]
|
198
|
+
task :package => [:prerelease]
|
199
|
+
task :tag => [:prerelease]
|
200
|
+
task :update_version => [:prerelease]
|
201
|
+
task :release => [:tag, :update_version, :package] do
|
202
|
+
announce
|
203
|
+
announce "**************************************************************"
|
204
|
+
announce "* Release #{PKG_VERSION} Complete."
|
205
|
+
announce "* Packages ready to upload."
|
206
|
+
announce "**************************************************************"
|
207
|
+
announce
|
208
|
+
end
|
209
|
+
|
210
|
+
# Validate that everything is ready to go for a release.
|
211
|
+
task :prerelease do
|
212
|
+
announce
|
213
|
+
announce "**************************************************************"
|
214
|
+
announce "* Making RubyGem Release #{PKG_VERSION}"
|
215
|
+
announce "* (current version #{CURRENT_VERSION})"
|
216
|
+
announce "**************************************************************"
|
217
|
+
announce
|
218
|
+
|
219
|
+
# Is a release number supplied?
|
220
|
+
unless ENV['REL']
|
221
|
+
fail "Usage: rake release REL=x.y.z [REUSE=tag_suffix]"
|
222
|
+
end
|
223
|
+
|
224
|
+
# Is the release different than the current release.
|
225
|
+
# (or is REUSE set?)
|
226
|
+
if PKG_VERSION == CURRENT_VERSION && ! ENV['REUSE']
|
227
|
+
fail "Current version is #{PKG_VERSION}, must specify REUSE=tag_suffix to reuse version"
|
228
|
+
end
|
229
|
+
|
230
|
+
# Are all source files checked in?
|
231
|
+
data = `svn -q status`
|
232
|
+
unless data =~ /^$/
|
233
|
+
fail "'svn -q status' is not clean ... do you have unchecked-in files?"
|
234
|
+
end
|
235
|
+
|
236
|
+
announce "No outstanding checkins found ... OK"
|
237
|
+
end
|
238
|
+
|
239
|
+
task :update_version => [:prerelease] do
|
240
|
+
if PKG_VERSION == CURRENT_VERSION
|
241
|
+
announce "No version change ... skipping version update"
|
242
|
+
else
|
243
|
+
announce "Updating Ferret version to #{PKG_VERSION}"
|
244
|
+
open("lib/ferret.rb") do |ferret_in|
|
245
|
+
open("lib/ferret.rb.new", "w") do |ferret_out|
|
246
|
+
ferret_in.each do |line|
|
247
|
+
if line =~ /^ VERSION\s*=\s*/
|
248
|
+
ferret_out.puts " VERSION = '#{PKG_VERSION}'"
|
249
|
+
else
|
250
|
+
ferret_out.puts line
|
251
|
+
end
|
252
|
+
end
|
253
|
+
end
|
254
|
+
end
|
255
|
+
if ENV['RELTEST']
|
256
|
+
announce "Release Task Testing, skipping commiting of new version"
|
257
|
+
else
|
258
|
+
mv "lib/ferret.rb.new", "lib/ferret.rb"
|
259
|
+
end
|
260
|
+
sh %{svn ci -m "Updated to version #{PKG_VERSION}" lib/ferret.rb}
|
261
|
+
end
|
262
|
+
end
|
263
|
+
|
264
|
+
desc "Tag all the SVN files with the latest release number (REL=x.y.z)"
|
265
|
+
task :tag => [:prerelease] do
|
266
|
+
reltag = "REL-#{PKG_VERSION}"
|
267
|
+
reltag << ENV['REUSE'] if ENV['REUSE']
|
268
|
+
announce "Tagging SVN with [#{reltag}]"
|
269
|
+
if ENV['RELTEST']
|
270
|
+
announce "Release Task Testing, skipping SVN tagging. Would do the following;"
|
271
|
+
announce %{svn copy -m "creating release #{reltag}" svn://www.davebalmain.com/ferret/trunk svn://www.davebalmain.com/ferret/tags/#{reltag}}
|
272
|
+
else
|
273
|
+
sh %{svn copy -m "creating release #{reltag}" svn://www.davebalmain.com/ferret/trunk svn://www.davebalmain.com/ferret/tags/#{reltag}}
|
274
|
+
end
|
275
|
+
end
|
data/TODO
ADDED
data/TUTORIAL
ADDED
@@ -0,0 +1,197 @@
|
|
1
|
+
= Quick Introduction to Ferret
|
2
|
+
|
3
|
+
The simplest way to use Ferret is through the Ferret::Index::Index class.
|
4
|
+
Start by including the Ferret module.
|
5
|
+
|
6
|
+
require 'ferret'
|
7
|
+
include Ferret
|
8
|
+
|
9
|
+
=== Creating an index
|
10
|
+
|
11
|
+
To create an in memory index is very simple;
|
12
|
+
|
13
|
+
index = Index::Index.new()
|
14
|
+
|
15
|
+
To create a persistent index;
|
16
|
+
|
17
|
+
index = Index::Index.new(:path => '/path/to/index')
|
18
|
+
|
19
|
+
Both of these methods create new Indexes with the StandardAnalyzer. An
|
20
|
+
analyzer is what you use to divide the input data up into tokens which you can
|
21
|
+
search for later. If you'd like to use a different analyzer you can specify it
|
22
|
+
here, eg;
|
23
|
+
|
24
|
+
index = Index::Index.new(:path => '/path/to/index',
|
25
|
+
:analyzer => WhiteSpaceAnalyzer.new)
|
26
|
+
|
27
|
+
For more options when creating an Index refer to Ferret::Index::Index.
|
28
|
+
|
29
|
+
=== Adding Documents
|
30
|
+
|
31
|
+
To add a document you can simply add a string or an array of strings.
|
32
|
+
|
33
|
+
index << "This is a new document to be indexed"
|
34
|
+
index << ["And here", "is another", "new document", "to be indexed"]
|
35
|
+
|
36
|
+
But these are pretty simple documents. If this is all you want to index you
|
37
|
+
could probably just use SimpleSearch. So let's give our documents some fields;
|
38
|
+
|
39
|
+
index << {:title => "Programming Ruby", :content => "blah blah blah"}
|
40
|
+
index << {:title => "Programming Ruby", :content => "yada yada yada"}
|
41
|
+
|
42
|
+
Or if you are indexing data stored in a database, you'll probably want to
|
43
|
+
store the id;
|
44
|
+
|
45
|
+
index << {:id => row.id, :title => row.title, :date => row.date}
|
46
|
+
|
47
|
+
The methods above while store all of the input data as well tokenizing and
|
48
|
+
indexing it. Sometimes we won't want to tokenize (divide the string into
|
49
|
+
tokens) the data. For example, we might want to leave the title as a complete
|
50
|
+
string and only allow searchs for that complete string. Sometimes we won't
|
51
|
+
want to store the data as it's already stored in the database so it'll be a
|
52
|
+
waste to store it in the index. Or perhaps we are doing without a database and
|
53
|
+
using Ferret to store all of our data, in which case we might not want to
|
54
|
+
index it. For example, if we are storing images in the index, we won't want to
|
55
|
+
index them. All of this can be done using Ferret's Ferret::Document module.
|
56
|
+
eg;
|
57
|
+
|
58
|
+
include Ferret::Document
|
59
|
+
doc = Document.new
|
60
|
+
doc << Field.new("id", row.id, Field::Store::NO, Field::Index::UNTOKENIZED)
|
61
|
+
doc << Field.new("title", row.title, Field::Store::YES, Field::Index::UNTOKENIZED)
|
62
|
+
doc << Field.new("data", row.data, Field::Store::YES, Field::Index::TOKENIZED)
|
63
|
+
doc << Field.new("image", row.image, Field::Store::YES, Field::Index::NO)
|
64
|
+
index << doc
|
65
|
+
|
66
|
+
You can also compress the data that you are storing or store term vectors with
|
67
|
+
the data. Read more about this in Ferret::Document::Field.
|
68
|
+
|
69
|
+
=== Searching
|
70
|
+
|
71
|
+
Now that we have data in our index, how do we actually use this index to
|
72
|
+
search the data? The Index offers two search methods, Index#search and
|
73
|
+
Index#search_each. The first method returns a Ferret::Index::TopDocs object.
|
74
|
+
The second we'll show here. Lets say we wanted to find all documents with the
|
75
|
+
phrase "quick brown fox" in the content field. We'd write;
|
76
|
+
|
77
|
+
index.search('content:"quick brown fox"') do |doc, score|
|
78
|
+
puts "Document #{doc} found with a score of #{score}"
|
79
|
+
end
|
80
|
+
|
81
|
+
But "fast" has a pretty similar meaning to "quick" and we don't mind if the
|
82
|
+
fox is a little red. So we could expand our search like this;
|
83
|
+
|
84
|
+
index.search('content:"quick|fast brown|red fox"') do |doc, score|
|
85
|
+
puts "Document #{doc} found with a score of #{score}"
|
86
|
+
end
|
87
|
+
|
88
|
+
What if we want to find all documents entered on or after 5th of September,
|
89
|
+
2005 with the words "ruby" or "rails" in it. We could type something like;
|
90
|
+
|
91
|
+
index.search('date:( >= 20050905) content:(ruby OR rails)') do |doc, score|
|
92
|
+
puts "Document #{doc} found with a score of #{score}"
|
93
|
+
end
|
94
|
+
|
95
|
+
Ferret has quite a complex query language. To find out more about Ferret's
|
96
|
+
query language, see Ferret::QueryParser. You can also construct even more
|
97
|
+
complex queries like Ferret::Search::Spans by hand. See Ferret::Search::Query
|
98
|
+
for more information.
|
99
|
+
|
100
|
+
=== Accessing Documents
|
101
|
+
|
102
|
+
You may have noticed that when we run a search we only get the document number
|
103
|
+
back. By itself this isn't much use to us. Getting the data from the index is
|
104
|
+
very straightforward. For example if we want the title field form the 3rd
|
105
|
+
document type;
|
106
|
+
|
107
|
+
index[2]["title"]
|
108
|
+
|
109
|
+
NOTE: documents are indexed from 0.
|
110
|
+
|
111
|
+
Let's go back to the database example above. If we store all of our documents
|
112
|
+
with an id then we can access that field using the id. As long as we called
|
113
|
+
our id field "id" we can do this
|
114
|
+
|
115
|
+
id = "89721347"
|
116
|
+
index[id]["title"]
|
117
|
+
|
118
|
+
If however we called our id field "key" we'll have to do this;
|
119
|
+
|
120
|
+
id = Index::Term.new("key", "89721347")
|
121
|
+
index[id]["title"]
|
122
|
+
|
123
|
+
Pretty simple huh? You should note though that if there are more then one
|
124
|
+
document with the same *id* or *key* then only the first one will be returned
|
125
|
+
so it is probably better that you ensure the key is unique somehow. (Ferret
|
126
|
+
cannot do that for you)
|
127
|
+
|
128
|
+
|
129
|
+
=== Modifying and Deleting Documents
|
130
|
+
|
131
|
+
What if we want to change the data in the index. Ferret doesn't actually let
|
132
|
+
you change the data once it is in the index. But you can delete documents so
|
133
|
+
the standard way to modify data is to delete it and re-add it again with the
|
134
|
+
modifications made. It is important to note that when doing this the documents
|
135
|
+
will get a new document number so you should be careful not to use a document
|
136
|
+
number after the document has been deleted. Here is an examle of modifying a
|
137
|
+
document;
|
138
|
+
|
139
|
+
index << {:title => "Programing Rbuy", :content => "blah blah blah"}
|
140
|
+
doc_num = nil
|
141
|
+
index.search('title:"Programing Rbuy"') {|doc, score| doc_num = doc}
|
142
|
+
return unless doc_num
|
143
|
+
doc = index[doc_num]
|
144
|
+
index.delete(doc_num)
|
145
|
+
|
146
|
+
# modify doc
|
147
|
+
doc["title"] = "Programming Ruby"
|
148
|
+
|
149
|
+
index << doc
|
150
|
+
|
151
|
+
Again, we can use the the id field as above. This time though every document
|
152
|
+
that matches the id will be deleted. Again, it is probably a good idea if you
|
153
|
+
somehow ensure that your *ids* are kept unique.
|
154
|
+
|
155
|
+
id = "23453422"
|
156
|
+
index.delete(id)
|
157
|
+
|
158
|
+
Or;
|
159
|
+
|
160
|
+
id = Index::Term.new("key", "23452345")
|
161
|
+
index.delete(id)
|
162
|
+
|
163
|
+
=== Onwards
|
164
|
+
|
165
|
+
This is just a small sampling of what Ferret allows you to do. Ferret, like
|
166
|
+
Lucene, is designed to be extended, and allows you to construct your own query
|
167
|
+
types, analyzers, and so on. Future versions of Ferret will contain more of
|
168
|
+
these, as well as instructions for how to subclass the base modules to create
|
169
|
+
your own. For now you can look in the following places for more documentation;
|
170
|
+
|
171
|
+
* Ferret::Analysis: for more information on how the data is processed when it
|
172
|
+
is tokenized. There are a number of things you can do with your data such as
|
173
|
+
adding stop lists or perhaps a porter stemmer. There are also a number of
|
174
|
+
analyzers already available and it is almost trivial to create a new one
|
175
|
+
with a simple regular expression.
|
176
|
+
|
177
|
+
* Ferret::Search: for more information on querying the index. There are a
|
178
|
+
number of already available queries and it's unlikely you'll need to create
|
179
|
+
your own. You may however want to take advantage of the sorting or filtering
|
180
|
+
abilities of Ferret to present your data the best way you see fit.
|
181
|
+
|
182
|
+
* Ferret::Document: to find out how to create documents. This part of Ferret
|
183
|
+
is relatively straightforward. The main thing that we haven't gone into here
|
184
|
+
is the use of term vectors. These allow you to store and retrieve the
|
185
|
+
positions and offsets of the data which can be very useful in document
|
186
|
+
comparison amoung other things. == More information
|
187
|
+
|
188
|
+
* Ferret::QueryParser: if you want to find out more about what you can do with
|
189
|
+
Ferret's Query Parser, this is the place to look. The query parser is one
|
190
|
+
area that could use a bit of work so please send your suggestions.
|
191
|
+
|
192
|
+
* Ferret::Index: for more advanced access to the index you'll probably want to
|
193
|
+
use the Ferret::Index::IndexWriter and Ferret::Index::IndexReader. This is
|
194
|
+
the place to look for more information on them.
|
195
|
+
|
196
|
+
* Ferret::Store: This is the module used to access the actual index storage
|
197
|
+
and won't be of much interest to most people.
|