thomaspeklak-OfflineSearch 0.2.2

Sign up to get free protection for your applications and to get access to all the features.
data/README ADDED
@@ -0,0 +1,100 @@
1
+ ==OfflineSearch
2
+ OfflineSearch is a semantic offline search generator. It scans a directory of html files and generates a javascript search data file. This was primarly written for an offline html documentation and this is also the main target group. Of course it can be be useful on small websites, too.
3
+
4
+ A frequency file of words can be outputted. This may help to find typos and gives statistical information, that can be used to tweak the stop word list. Maintaining the stop words is crucual to keep the index small. E.g. the index of 11MB of HTML files is about 1.2MB (including double metaphone data and if using a tweaked stop word list)
5
+
6
+ The double metaphone algorithm can be used to find simlar terms if the search did not find any matches. This performs not to well, but an other (better) algorithm was not found that performed well on a large index.
7
+
8
+ Default config files, stop word lists and search templates can be generated. Please see the section Usage for details.
9
+
10
+ The search does not include any boolean logic (OR, NOT, XOR, ...). It always tries to find all specified terms.
11
+
12
+ REMARK: Currently the search index is always written in UTF-8. Future releases will support other encodings.
13
+
14
+ ===Executable:
15
+ OfflineSearch
16
+
17
+ ===Usage:
18
+ OfflineSearch [options]
19
+ -c, --config=CONFIG_FILE configuration file for the offline search
20
+
21
+ Generators
22
+ -g, --generate-default-config creates a default config file in the current directory
23
+ -w, --generate-default-stopwords creates a default stopword list in the current directory. Language flag is required.
24
+ -t, --generate-template creates search template files in the current directory
25
+ -o, --generate-search-data crawler the documents in the given docpath and generates the search data file
26
+
27
+ Optional arguments
28
+ can also be specified in the config file
29
+ command line arguments will overwrite any given value in the config file
30
+ -d, --docpath=DOCPATH path of the documents
31
+ -f=SEARCH_DATA_FILE path and name of the search data file
32
+ --search-data-file
33
+ -s=STOPWORD_LIST stopword list, if none is specified the default stop word list is used
34
+ --stopword-list
35
+ -l, --language=LANGUAGE required if you want to generate a default stopword list
36
+
37
+ -h, --help Show this message
38
+
39
+ ===Config-File example:
40
+
41
+ language: english
42
+ storage: memory
43
+ crawler:
44
+ docpath: ../docs
45
+ docs: [html, htm]
46
+ exceptions:
47
+ stopwords:
48
+ tags:
49
+ title: 150
50
+ h1: 50
51
+ h2: 25
52
+ h3: 18
53
+ h4: 13
54
+ h5: 11
55
+ h6: 9
56
+ strong: 7
57
+ b: 7
58
+ em: 5
59
+ i: 5
60
+ dt: 9
61
+ u: 4
62
+ a: 3
63
+ logger:
64
+ file: STDOUT
65
+ level: info
66
+ search_generator:
67
+ search_data_file: search_data.js
68
+ output_encoding: uft-8
69
+ template: base
70
+ relative_path_to_files: ../docs/
71
+ output_frequency_to: frequency.txt
72
+ use_double_metaphone: true
73
+
74
+ ===Templates===
75
+
76
+ The shipped templates are very basic, as it is assumed that you integrate the search in your site. The only provide a guidance to how to implement the search on your site.
77
+
78
+
79
+ ___________________________________________________________________________
80
+ The MIT License
81
+
82
+ Copyright (c) 2008 Thomas Peklak
83
+
84
+ Permission is hereby granted, free of charge, to any person obtaining a copy
85
+ of this software and associated documentation files (the "Software"), to deal
86
+ in the Software without restriction, including without limitation the rights
87
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
88
+ copies of the Software, and to permit persons to whom the Software is
89
+ furnished to do so, subject to the following conditions:
90
+
91
+ The above copyright notice and this permission notice shall be included in
92
+ all copies or substantial portions of the Software.
93
+
94
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
95
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
96
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
97
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
98
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
99
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
100
+ THE SOFTWARE.
@@ -0,0 +1,3 @@
1
+ #!/usr/local/bin/ruby -w
2
+
3
+ require 'offline_search'
@@ -0,0 +1,70 @@
1
+ # ACTION CONTROLLER
2
+ # checks which option is specified and executes the required scripts
3
+ #
4
+ # * $Author$
5
+ # * $Rev$
6
+ # * $LastChangedDate$
7
+ #
8
+
9
+ class ActionController
10
+ # checks the value of the global variable $action set by option parser
11
+ # valid actions are
12
+ # * genrating a default stopwords
13
+ # * genrating a default config file
14
+ # * genrating a template
15
+ # * genrating the search database
16
+ def initialize()
17
+ require "log_init"
18
+ unless defined?($action)
19
+ $logger.fatal("No action is defined. Please chose one of the following options:\n\t\t-o generate search index\n\t\t-t generate template files\n\t\t-o generate default stop words\n\t\t-g generate default config")
20
+ exit
21
+ end
22
+ case $action
23
+ when 'generate_default_stopwords'
24
+ generate_stopwords
25
+ when 'generate_default_config'
26
+ generate_config
27
+ when 'generate_template'
28
+ unless (['base','base+double_metaphone'].include?($config['template']))
29
+ $logger.error('Template not found')
30
+ $logger.info('Available templates: base,base+double_metaphone')
31
+ exit
32
+ end
33
+ generate_template
34
+ when 'generate_search'
35
+ verify_search_parameters
36
+ start_search
37
+ end
38
+ end
39
+
40
+ private
41
+ def generate_stopwords
42
+ $logger.info("generating default stopwords")
43
+ require 'generate_default_stopwords'
44
+ end
45
+ def generate_config
46
+ $logger.info("generating default config")
47
+ require 'generate_default_config'
48
+ end
49
+ def generate_template
50
+ $logger.info("generating default template")
51
+ require 'generate_default_template'
52
+ TemplateGenerator.new($config['template'])
53
+ end
54
+ def verify_search_parameters
55
+ require 'option_validator'
56
+ OptionValidator.new
57
+ end
58
+ def start_search
59
+ require "stop_words"
60
+ require "crawler"
61
+ require "search_generator"
62
+
63
+ crawler = Crawler.new
64
+ crawler.find_files
65
+ crawler.parse_files
66
+
67
+ generator = SearchGenerator.new(crawler.get_stored_files, crawler.get_terms)
68
+ generator.generate
69
+ end
70
+ end
@@ -0,0 +1,31 @@
1
+ language: german
2
+ storage: memory
3
+ crawler:
4
+ docpath: ../docs
5
+ docs: [html, htm]
6
+ exceptions: _dab
7
+ stopwords:
8
+ tags:
9
+ title: 60
10
+ h1: 40
11
+ h2: 25
12
+ h3: 18
13
+ h4: 13
14
+ h5: 11
15
+ h6: 9
16
+ strong: 7
17
+ b: 7
18
+ em: 5
19
+ i: 5
20
+ dt: 9
21
+ u: 4
22
+ a: 3
23
+ logger:
24
+ file: STDOUT
25
+ level: info
26
+ search_generator:
27
+ search_data_file: search_data_js
28
+ output_encoding: uft-8
29
+ template: base
30
+ relative_path_to_files: D
31
+ output_frequency_to: frequency.txt
@@ -0,0 +1,32 @@
1
+ language: english
2
+ storage: memory
3
+ crawler:
4
+ docpath: ../docs
5
+ docs: [html, htm]
6
+ exceptions:
7
+ stopwords:
8
+ tags:
9
+ title: 60
10
+ h1: 40
11
+ h2: 25
12
+ h3: 18
13
+ h4: 13
14
+ h5: 11
15
+ h6: 9
16
+ strong: 7
17
+ b: 7
18
+ em: 5
19
+ i: 5
20
+ dt: 9
21
+ u: 4
22
+ a: 3
23
+ logger:
24
+ file: STDOUT
25
+ level: info
26
+ search_generator:
27
+ search_data_file: search_data.js
28
+ output_encoding: uft-8
29
+ template: base
30
+ relative_path_to_files:
31
+ output_frequency_to: frequency.txt
32
+ use_double_metaphone: true
@@ -0,0 +1,237 @@
1
+ # CRAWLER
2
+ # searches dirctory for files
3
+ # parses files for keywords, semantic keyword rank and pagerank
4
+ #
5
+ # * $Author$
6
+ # * $Rev$
7
+ # * $LastChangedDate$
8
+
9
+ require 'rexml/document'
10
+ require 'rubygems'
11
+ require 'hpricot'
12
+ require 'Kconv'
13
+ require 'entity_converter'
14
+ require 'filefinder'
15
+ require 'progressbar'
16
+
17
+ class Crawler
18
+ attr_writer :resource
19
+ # requires a docpath set in the config file and a temporary storage handler
20
+ def initialize
21
+ @resource = $config['crawler']['docpath']
22
+ require "temporary_storage"
23
+ @storage=Temporary_Storage.new($config['storage'])
24
+ end
25
+
26
+ # serach the given docpath for files with a valid extension and excludes files that should not be indexed
27
+ # returns an array of files
28
+ def find_files()
29
+ @files = FileFinder::find(@resource,:types=>$config['crawler']['docs'],:excludes=>$config['crawler']['exceptions'])
30
+ if (@files.empty?)
31
+ $logger.error('no files found in directory')
32
+ exit
33
+ end
34
+ @files_size = @files.length
35
+ @files
36
+ end
37
+
38
+ # takes an array of files an iterates through it. each file is read into a string and sent to a doccrawler for further processing
39
+ # if the file is a valid XHTML file, the file is processed with REXML otherwise Hpricot is used and a warning is written to the log
40
+ # no value is returned
41
+ def parse_files
42
+ i = 0
43
+ pbar = ProgressBar.new('indexing',@files_size)
44
+ @files.each do |file|
45
+ $logger.info("processing #{file}")
46
+ File.open(file,'r') do |f|
47
+ lines = f.read().gsub("\n",' ').gsub("\r",'')
48
+ #convert entities before a new Hpricot doc is created, otherwise the entities are not converted correctly
49
+ doc = HpricotCrawler.new(lines.decode_html_entities,file,@storage)
50
+ doc.crawler_and_store
51
+ i += 1
52
+ pbar.set(i)
53
+ end
54
+ end
55
+ @storage.calculate_pageranks_from_links
56
+ end
57
+
58
+ # returns a hash of the parsed documents
59
+ def get_stored_files
60
+ @storage.get_files
61
+ end
62
+
63
+ # returns a hash of the indexed terms with ranks and links to the documents
64
+ def get_terms
65
+ @storage.get_terms
66
+ end
67
+
68
+ ###### HELPER METHODS
69
+ private
70
+
71
+ # This abstract class parses a file and tries to extract semantic information
72
+ class DocCrawler
73
+ # tries to ignore external links an convert internal links
74
+ def resolve_link(link,dir)
75
+ link = File.basename(link)
76
+ case
77
+ when link =~ /^(http|ftp|mailto)/
78
+ return nil
79
+ when link =~ /^[\/a-zA-Z0-9_-]/
80
+ return (File.expand_path(dir+'/'+link)).gsub(@expanded_doc_path,'')
81
+ when link =~ /^\./
82
+ return (File.expand_path(dir+'/'+link)).gsub(@expanded_doc_path,'')
83
+ else
84
+ return nil
85
+ end
86
+ end
87
+
88
+ # method invokes other methods to get certain information about the document. These methods are implemented in the child classes
89
+ def crawler_and_store
90
+ @storage.store_file(resolve_link(@file,File.dirname(@file)),get_title)
91
+ @storage.store_link(get_hrefs)
92
+ split_and_store
93
+ end
94
+
95
+ private
96
+
97
+ # splits textblocks and stores terms in the storage. this method splits an all characters that are non aplhpa
98
+ def split_and_store()
99
+ numbers = '0'..'9'
100
+ get_texts.each do |text_block|
101
+ rank=text_block.semantic_value
102
+ unless (rank.nil?)
103
+ text_block.to_s.downcase.umlaut_to_downcase.decode_html_entities.split(/[^a-zäöüß0-9]+/).each do |term|
104
+ @storage.store_term(term,rank) unless ((term.size < 2 && !numbers.include?(term)) || $stop_words.has_key?(term))
105
+ end
106
+ end
107
+ end
108
+ end
109
+ end
110
+
111
+ # parses valid XHTML documents and extracts information
112
+ class XmlCrawler < DocCrawler
113
+ def initialize(lines,file,storage)
114
+ @file = file
115
+ @storage = storage
116
+ begin
117
+ @xml = REXML::Document.new(lines)
118
+ rescue REXML::ParseException
119
+ raise
120
+ end
121
+ @expanded_doc_path = File.expand_path($config['crawler']['docpath'])
122
+ end
123
+
124
+ private
125
+
126
+ # extracts and returns the title of the document
127
+ def get_title
128
+ @xml.elements.each('//head//title/text()')
129
+ end
130
+
131
+ # extracts all texts and returns an array of REXML::Texts if no block is given
132
+ # if a block is given then the texts are passed to it
133
+ def get_texts
134
+ texts=@xml.elements.each("//body//text()")
135
+ texts.delete_if { |t| t.to_s.lstrip.empty?}
136
+ if block_given?
137
+ yield texts
138
+ else
139
+ texts
140
+ end
141
+ end
142
+
143
+ # returns an array of internal links in the document
144
+ def get_hrefs
145
+ a=@xml.elements.to_a("//a")
146
+ href= Array.new
147
+ a.each do |anker|
148
+ link=resolve_link(anker.attributes.get_attribute('href').value,File.dirname(@file)) if anker.attributes.get_attribute('href')
149
+ href << link unless link.nil?
150
+ end
151
+ href
152
+ end
153
+
154
+ # extends REXML::Text with the functionaliy to extract semantic information
155
+ class REXML::Text
156
+ # stores an array of meaningful tags with their rank value
157
+ def self.store_semantics(tags)
158
+ @@semantic_tags=tags
159
+ end
160
+
161
+ # extracts the semantic value of a text block
162
+ def semantic_value
163
+ REXML::Text.store_semantics($config['crawler']['tags'].keys) unless defined?(@@semantic_tags)
164
+ rank = 1
165
+ node = parent
166
+ return nil if(node.name == 'script')
167
+ while @@semantic_tags.include?(node.name)
168
+ rank += $config['crawler']['tags'][node.name]
169
+ node = node.parent
170
+ return nil if(node.name == 'script')
171
+ end
172
+ rank
173
+ end
174
+ end
175
+ end
176
+
177
+ # parses non valid XHTML documents and extracts information
178
+ class HpricotCrawler < DocCrawler
179
+ def initialize(lines,file,storage)
180
+ @doc = Hpricot(lines)
181
+ @file = file
182
+ @storage = storage
183
+ @expanded_doc_path = File.expand_path($config['crawler']['docpath'])
184
+ end
185
+
186
+ private
187
+
188
+ # extracts and returns the title of the document
189
+ def get_title
190
+ @doc.at('//head/title').inner_text
191
+ end
192
+
193
+ # extracts all texts and returns an array of REXML::Texts if no block is given
194
+ # if a block is given then the texts are passed to it
195
+ def get_texts
196
+ texts = Array.new
197
+ @doc.traverse_text { |text| texts << text unless text.to_s.strip.empty?}
198
+ if block_given?
199
+ yield texts
200
+ else
201
+ texts
202
+ end
203
+ end
204
+
205
+ # returns an array of internal links in the document
206
+ def get_hrefs
207
+ links = Array.new
208
+ (@doc/'a[@href]').each {|a|
209
+ href = a[:href]
210
+ links << resolve_link(a.get_attribute('href'),File.dirname(@file)) unless href[0,1] == '#' || href[0,10] == 'javascript'
211
+ }
212
+ links
213
+ end
214
+
215
+ # extends Hpricot::Text with the functionaliy to extract semantic information
216
+ class Hpricot::Text
217
+ # stores an array of meaningful tags with their rank value
218
+ def self.store_semantics(tags)
219
+ @@semantic_tags=tags
220
+ end
221
+
222
+ # extracts the semantic value of a text block
223
+ def semantic_value
224
+ Hpricot::Text.store_semantics($config['crawler']['tags'].keys) unless defined?(@@semantic_tags)
225
+ rank = 1
226
+ node = parent
227
+ return nil if(node.name == 'script')
228
+ while @@semantic_tags.include?(node.name)
229
+ rank += $config['crawler']['tags'][node.name]
230
+ node = node.parent
231
+ return nil if(node.name == 'script')
232
+ end
233
+ rank
234
+ end
235
+ end
236
+ end
237
+ end