keyword_prospector 0.8.0

Sign up to get free protection for your applications and to get access to all the features.
data/History.txt ADDED
@@ -0,0 +1,4 @@
1
+ == 0.8.0 2008-08-07
2
+
3
+ * 1 major enhancement:
4
+ * Initial release
data/License.txt ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2008 Los Angeles Times
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/Manifest.txt ADDED
@@ -0,0 +1,40 @@
1
+ History.txt
2
+ License.txt
3
+ Manifest.txt
4
+ README.txt
5
+ Rakefile
6
+ config/hoe.rb
7
+ config/requirements.rb
8
+ lib/hyperlink_strategy.rb
9
+ lib/keyword_decorator.rb
10
+ lib/keyword_linker.rb
11
+ lib/keyword_prospector
12
+ lib/keyword_prospector.rb
13
+ lib/lookup_chain.rb
14
+ lib/match.rb
15
+ lib/profile.rb
16
+ lib/search_and_replace.rb
17
+ lib/state.rb
18
+ script/console
19
+ script/destroy
20
+ script/generate
21
+ script/txt2html
22
+ setup.rb
23
+ spec/hyperlink_strategy_spec.rb
24
+ spec/keyword_linker_spec.rb
25
+ spec/keyword_prospector_spec.rb
26
+ spec/lookup_chain_spec.rb
27
+ spec/match_spec.rb
28
+ spec/search_and_replace_spec.rb
29
+ spec/spec.opts
30
+ spec/spec_helper.rb
31
+ spec/state_spec.rb
32
+ tasks/deployment.rake
33
+ tasks/environment.rake
34
+ tasks/rspec.rake
35
+ tasks/website.rake
36
+ website/index.html
37
+ website/index.txt
38
+ website/javascripts/rounded_corners_lite.inc.js
39
+ website/stylesheets/screen.css
40
+ website/template.html.erb
data/README.txt ADDED
@@ -0,0 +1,95 @@
1
+ = keyword_prospector
2
+
3
+ * http://github.com/latimes/keyword_prospector
4
+
5
+ == DESCRIPTION:
6
+
7
+ KeywordProspector is a gem for associating keywords in text with arbitrary
8
+ output objects. It uses an Aho-Corasick tree to provide matching that scales
9
+ linearly with the length of the text, regardless of the number of keywords.
10
+
11
+ == FEATURES:
12
+
13
+ The core KeywordProspector engine has the following properties:
14
+ * Once a tree is built, matching against the tree is o(n) where n is the length
15
+ of your text, regardless of how mane keywords you have in the tree.
16
+ * Arbitrary output objects can be associated with any keyword or set of
17
+ keywords.
18
+
19
+ KeywordLinker can be used to create links to designated url's:
20
+ * You can specify a single keyword to associate with a url.
21
+ * You can specify an array of keywords to associate with a url.
22
+ * Each keyword or group of keywords will be linked only once to the url
23
+ provided.
24
+ * Hyperlinks are not created within existing hyperlinks.
25
+ * Hyperlinks are not generated anywhere they would be illegal in HTML, such as
26
+ within attribute values.
27
+
28
+ == SYNOPSIS:
29
+
30
+ Use KeywordLinker to create links in HTML text. KeywordLinker will link only
31
+ the first alternative that appears in text:
32
+
33
+ require 'keyword_linker'
34
+
35
+ linker = KeywordLinker.new
36
+ linker.add_url("http://www.latimes.com", ["L.A. Times", "Los Angeles Times"])
37
+ linker.link_text("'L.A. Times' or 'Los Angeles Times'?")
38
+ => "'<a href=\"http://www.latimes.com\">L.A. Times</a>' or 'Los Angeles Times'?"
39
+
40
+ linker.link_text("'Los Angeles Times' or 'L.A. Times'?")
41
+ => "'<a href=\"http://www.latimes.com\">Los Angeles Times</a>' or 'L.A. Times'?"
42
+
43
+
44
+ You can provide html options when adding url's:
45
+
46
+ linker = KeywordLinker.new
47
+ linker.add_url("http://www.latimes.com", "Los Angeles Times",
48
+ :title => "Visit the Los Angeles Times")
49
+ linker.add_url("http://www.google.com", ["Google", "The Google"],
50
+ :title => "Go check it out on The Google!")
51
+
52
+ linker.link_text("Do you prefer The Google or the Los Angeles Times?")
53
+ => "Do you prefer <a href=\"http://www.google.com\" title=\"Go check it out on The Google!\">The Google</a> or the <a href=\"http://www.latimes.com\" title=\"Visit the Los Angeles Times\">Los Angeles Times</a>?"
54
+
55
+ == DEPENDENCIES:
56
+
57
+ * Hpricot
58
+
59
+ == INSTALL:
60
+
61
+ sudo gem install keyword_prospector
62
+
63
+ == AUTHOR:
64
+
65
+ Alf Mikula <amikula@gmail.com>
66
+
67
+ == ACKNOWLEDGEMENTS:
68
+
69
+ Thanks to David Johnson for telling me about the Aho-Corasick algorithm, and
70
+ for providing the original Aho-Corasick Ruby implementation.
71
+
72
+ == LICENSE:
73
+
74
+ (The MIT License)
75
+
76
+ Copyright (c) 2008 Los Angeles Times
77
+
78
+ Permission is hereby granted, free of charge, to any person obtaining
79
+ a copy of this software and associated documentation files (the
80
+ 'Software'), to deal in the Software without restriction, including
81
+ without limitation the rights to use, copy, modify, merge, publish,
82
+ distribute, sublicense, and/or sell copies of the Software, and to
83
+ permit persons to whom the Software is furnished to do so, subject to
84
+ the following conditions:
85
+
86
+ The above copyright notice and this permission notice shall be
87
+ included in all copies or substantial portions of the Software.
88
+
89
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
90
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
91
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
92
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
93
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
94
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
95
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/Rakefile ADDED
@@ -0,0 +1,4 @@
1
+ require 'config/requirements'
2
+ require 'config/hoe' # setup Hoe + all gem configuration
3
+
4
+ Dir['tasks/**/*.rake'].each { |rake| load rake }
data/config/hoe.rb ADDED
@@ -0,0 +1,73 @@
1
+ require 'keyword_prospector/version'
2
+
3
+ AUTHOR = 'FIXME full name' # can also be an array of Authors
4
+ EMAIL = "FIXME email"
5
+ DESCRIPTION = "description of gem"
6
+ GEM_NAME = 'keyword_prospector' # what ppl will type to install your gem
7
+ RUBYFORGE_PROJECT = 'latimes' # The unix name for your project
8
+ HOMEPATH = "http://#{RUBYFORGE_PROJECT}.rubyforge.org"
9
+ DOWNLOAD_PATH = "http://rubyforge.org/projects/#{RUBYFORGE_PROJECT}"
10
+ EXTRA_DEPENDENCIES = [
11
+ ['hpricot', '>= 0.6']
12
+ ] # An array of rubygem dependencies [name, version]
13
+
14
+ @config_file = "~/.rubyforge/user-config.yml"
15
+ @config = nil
16
+ RUBYFORGE_USERNAME = "unknown"
17
+ def rubyforge_username
18
+ unless @config
19
+ begin
20
+ @config = YAML.load(File.read(File.expand_path(@config_file)))
21
+ rescue
22
+ puts <<-EOS
23
+ ERROR: No rubyforge config file found: #{@config_file}
24
+ Run 'rubyforge setup' to prepare your env for access to Rubyforge
25
+ - See http://newgem.rubyforge.org/rubyforge.html for more details
26
+ EOS
27
+ exit
28
+ end
29
+ end
30
+ RUBYFORGE_USERNAME.replace @config["username"]
31
+ end
32
+
33
+
34
+ REV = nil
35
+ # UNCOMMENT IF REQUIRED:
36
+ # REV = YAML.load(`svn info`)['Revision']
37
+ VERS = KeywordProspector::VERSION::STRING + (REV ? ".#{REV}" : "")
38
+ RDOC_OPTS = ['--quiet', '--title', 'keyword_prospector documentation',
39
+ "--opname", "index.html",
40
+ "--line-numbers",
41
+ "--main", "README",
42
+ "--inline-source"]
43
+
44
+ class Hoe
45
+ def extra_deps
46
+ @extra_deps.reject! { |x| Array(x).first == 'hoe' }
47
+ @extra_deps
48
+ end
49
+ end
50
+
51
+ # Generate all the Rake tasks
52
+ # Run 'rake -T' to see list of generated tasks (from gem root directory)
53
+ $hoe = Hoe.new(GEM_NAME, VERS) do |p|
54
+ p.developer(AUTHOR, EMAIL)
55
+ p.description = DESCRIPTION
56
+ p.summary = DESCRIPTION
57
+ p.url = HOMEPATH
58
+ p.rubyforge_name = RUBYFORGE_PROJECT if RUBYFORGE_PROJECT
59
+ p.test_globs = ["test/**/test_*.rb"]
60
+ p.clean_globs |= ['**/.*.sw?', '*.gem', '.config', '**/.DS_Store'] #An array of file patterns to delete on clean.
61
+
62
+ # == Optional
63
+ p.changes = p.paragraphs_of("History.txt", 0..1).join("\n\n")
64
+ #p.extra_deps = EXTRA_DEPENDENCIES
65
+
66
+ #p.spec_extras = {} # A hash of extra values to set in the gemspec.
67
+ end
68
+
69
+ CHANGES = $hoe.paragraphs_of('History.txt', 0..1).join("\\n\\n")
70
+ PATH = (RUBYFORGE_PROJECT == GEM_NAME) ? RUBYFORGE_PROJECT : "#{RUBYFORGE_PROJECT}/#{GEM_NAME}"
71
+ $hoe.remote_rdoc_dir = File.join(PATH.gsub(/^#{RUBYFORGE_PROJECT}\/?/,''), 'rdoc')
72
+ $hoe.rsync_args = '-av --delete --ignore-errors'
73
+ $hoe.spec.post_install_message = File.open(File.dirname(__FILE__) + "/../PostInstall.txt").read rescue ""
@@ -0,0 +1,15 @@
1
+ require 'fileutils'
2
+ include FileUtils
3
+
4
+ require 'rubygems'
5
+ %w[rake hoe newgem rubigen].each do |req_gem|
6
+ begin
7
+ require req_gem
8
+ rescue LoadError
9
+ puts "This Rakefile requires the '#{req_gem}' RubyGem."
10
+ puts "Installation: gem install #{req_gem} -y"
11
+ exit
12
+ end
13
+ end
14
+
15
+ $:.unshift(File.join(File.dirname(__FILE__), %w[.. lib]))
@@ -0,0 +1,56 @@
1
+ #
2
+ # (C) 2008 Los Angeles Times
3
+ #
4
+ require 'set'
5
+
6
+ class HyperlinkStrategy
7
+ attr_reader :url
8
+ attr_reader :options
9
+
10
+ def initialize(url=nil, options={})
11
+ @keywords = Set.new
12
+ self.options = options
13
+ self.url = url
14
+ end
15
+
16
+ def keywords=(*keywords)
17
+ @keywords = Set.new(keywords.flatten)
18
+ end
19
+
20
+ def keywords
21
+ @keywords
22
+ end
23
+
24
+ def url=(url)
25
+ @url = url
26
+ merge_options(@options)
27
+ end
28
+
29
+ def options=(options)
30
+ merge_options(options)
31
+ end
32
+
33
+ def add_keyword(keyword)
34
+ @keywords.add(keyword)
35
+ self
36
+ end
37
+
38
+ def decorate(keyword)
39
+ attributes = ""
40
+ options.each_pair do |key, value|
41
+ attributes += " " unless attributes.length == 0
42
+ attributes += "#{key}=\"#{value}\""
43
+ end
44
+
45
+ "<a " + attributes + ">#{keyword}</a>"
46
+ end
47
+
48
+ private
49
+ def merge_options(options)
50
+ if @url
51
+ @options = {:href => @url}.merge(options)
52
+ else
53
+ @options = options
54
+ end
55
+ end
56
+ end
@@ -0,0 +1,7 @@
1
+ #
2
+ # (C) 2008 Los Angeles Times
3
+ #
4
+ class KeywordDecorator
5
+ attr_accessor :keywords
6
+ attr_accessor :decorator
7
+ end
@@ -0,0 +1,174 @@
1
+ #
2
+ # (C) 2008 Los Angeles Times
3
+ #
4
+ require 'set'
5
+ require 'lookup_chain'
6
+ require 'keyword_prospector'
7
+ require 'hyperlink_strategy'
8
+ require 'rubygems'
9
+ require 'hpricot'
10
+
11
+ #
12
+ # Given a set of keywords and url's, and optionally HTML attributes to set
13
+ # on links, takes text and adds hyperlinks from the specified keywords to
14
+ # their associated URL's. Example:
15
+ #
16
+ # linker = KeywordLinker.new
17
+ # linker.add_url('http://www.latimes.com', 'Los Angeles Times')
18
+ # linker.link_text("Let's check out the Los Angeles Times!")
19
+ # => "Let's check out the <a href=\"http://www.latimes.com\">Los Angeles Times</a>!"
20
+ #
21
+ # KeywordLinker depends on hpricot for parsing HTML. This is done to prevent
22
+ # hyperlinks from being added inside of other hyperlinks and inside of
23
+ # attribute text.
24
+ #
25
+ class KeywordLinker
26
+ @@blacklist_strategy = Object.new
27
+ class << @@blacklist_strategy
28
+ def decorate(keyword)
29
+ keyword
30
+ end
31
+ end
32
+
33
+ # Takes an optional array of lookup objects. A lookup object is anything
34
+ # that responds to the process method and returns an array of Match objects,
35
+ # including KeywordLinker, KeywordProspector, and LookupChain objects. If
36
+ # multiple objects are specified, a LookupChain is created that gives highest
37
+ # priority to matches from objects closer to the end of the array.
38
+ def initialize(*lookups)
39
+ @tree_initialized=true
40
+
41
+ if(lookups)
42
+ @lookup = LookupChain.new(lookups)
43
+ end
44
+ end
45
+
46
+ # Takes a url and a keyword String or Array of keywords, and adds it to the
47
+ # tree of keywords in the KeywordLinker. Takes an optional hash of html
48
+ # attributes to be associated with this url.
49
+ #
50
+ # Only the first occurrence of the url will be linked. If multiple keywords
51
+ # are specified, then only the first occurrence of any of the keywords is
52
+ # linked to the target url. ie, if multiple keywords match for this url,
53
+ # only one instance of one keyword will be linked.
54
+ def add_url(url, keyword, html_attributes={})
55
+ init_lookup
56
+
57
+ strategy = HyperlinkStrategy.new(url, html_attributes)
58
+ strategy.keywords = keyword
59
+
60
+ @dl.add(strategy)
61
+ end
62
+
63
+ # Blacklist this keyword or array of keywords. If a keyword is blacklisted,
64
+ # it will not be linked. For example, if the "Los Angeles" part of
65
+ # "Los Angeles Times" is getting linked, you can blacklist
66
+ # "Los Angeles Times" to keep it from being linked.
67
+ def blacklist_keyword(keyword)
68
+ init_lookup
69
+
70
+ @dl.add(keyword, @@blacklist_strategy)
71
+ end
72
+
73
+ # Initialize the tree after _all_ url's have been added. This needs to be
74
+ # called once. If you don't call init_tree, it will be called automatically
75
+ # on the first call to the process or link_text method. You may find this
76
+ # annoying or inconvenient if it happens on the first request to your
77
+ # application and you've constructed a large set of links. Adding url's
78
+ # after calling init_tree, process, or link_text is not supported.
79
+ def init_tree
80
+ unless @tree_initialized
81
+ @dl.construct_fail
82
+ @tree_initialized = true
83
+ end
84
+ end
85
+
86
+ # Adds links to known url's into the text provided. Only the first instance
87
+ # of each keyword or set of keywords associated to a url is linked. In cases
88
+ # of overlap, the longest keyword is chosen to resolve the overlap.
89
+ def link_text(text)
90
+ init_tree unless @tree_initialized
91
+
92
+ linked_outputs = Set.new
93
+
94
+ htext = Hpricot(text)
95
+
96
+ link_text_in_elem(htext, linked_outputs)
97
+
98
+ return htext.to_s
99
+ end
100
+
101
+ # Returns an array of matches in the specified text. Doesn't filter overlaps
102
+ # or parse HTML to prevent matches in attribute text or inside of existing
103
+ # hyperlinks. Primarily for internal use.
104
+ def process(text)
105
+ init_tree unless @tree_initialized
106
+
107
+ @lookup.process(text)
108
+ end
109
+
110
+ private
111
+ # Initialize the KeywordProspector object if needed. Called only when adding
112
+ # our own url's, not when aggregating other lookup objects.
113
+ def init_lookup
114
+ unless @dl
115
+ @dl = KeywordProspector.new
116
+ @tree_initialized = false
117
+
118
+ if @lookup
119
+ @lookup << @dl
120
+ else
121
+ @lookup = @dl
122
+ end
123
+ end
124
+ end
125
+
126
+ # Given a single hpricot element, link text inside of all child elements.
127
+ def link_text_in_elem(elem, linked_outputs)
128
+ elem.children.each do |e|
129
+ case e
130
+ when Hpricot::Text
131
+ text = e.to_s
132
+
133
+ results = process(text)
134
+
135
+ results.sort!
136
+ KeywordProspector.filter_overlaps(results)
137
+
138
+ unless (results.nil? || results.empty?)
139
+ e.content = link_text_internal(text, results, linked_outputs)
140
+ end
141
+ when Hpricot::Elem
142
+ link_text_in_elem(e, linked_outputs) if e.stag.name != "a"
143
+ end
144
+ end
145
+ end
146
+
147
+ # Called internally to substitute links in element text.
148
+ def link_text_internal(text, results, linked_outputs = nil)
149
+ linked_outputs ||= Set.new
150
+
151
+ retval = ""
152
+ cursor = 0
153
+ results.each do |result|
154
+ unless linked_outputs.include?(result.output)
155
+ if(result.start_idx > cursor)
156
+ retval += text[cursor, result.start_idx - cursor]
157
+ cursor = result.start_idx
158
+ end
159
+
160
+ retval += result.output.decorate(result.keyword)
161
+ cursor = result.end_idx
162
+
163
+ linked_outputs.add(result.output)
164
+ end
165
+ end
166
+
167
+ if(cursor < text.size)
168
+ retval += text[cursor, text.size-cursor]
169
+ end
170
+
171
+ return retval
172
+ end
173
+ end
174
+
@@ -0,0 +1,171 @@
1
+ #
2
+ # (C) 2008 Los Angeles Times
3
+ #
4
+ $:.unshift(File.dirname(__FILE__)) unless
5
+ $:.include?(File.dirname(__FILE__)) || $:.include?(File.expand_path(File.dirname(__FILE__)))
6
+
7
+ require 'state'
8
+ require 'match'
9
+
10
+ class Position < Struct.new(:begin, :end); end
11
+
12
+ # KeywordProspector takes a collection of words, and optionally their
13
+ # associated outputs, and builds a match tree for running matches of the
14
+ # keywords against provided text. While construction of the Aho-Corasick
15
+ # tree takes a long time when there are many keywords, matching runs in time
16
+ # proportional to the length of the text provided. So, even if you have
17
+ # tens of thousands of keywords to match against, matching will still be
18
+ # very fast.
19
+ class KeywordProspector
20
+ # If words is provided, each word is added to the tree and the tree is
21
+ # initialized. Otherwise, call add for each word to place in the dictionary.
22
+ def initialize(words=nil)
23
+ @start = State.new(0, 0, [])
24
+ if(words)
25
+ words.each{|word| add word}
26
+
27
+ construct_fail
28
+ end
29
+ end
30
+
31
+ # Add an entry to the tree. The optional output parameter can be any object,
32
+ # and will be returned when this keyword is matched. If the entry has a
33
+ # _keywords_ method, it should return a collection of keywords. In this
34
+ # case, the output will be added for each keyword provided. If output is
35
+ # not provided, the entry is returned.
36
+ def add(entry, output=nil)
37
+ output ||= entry
38
+
39
+ if (entry.respond_to?(:keywords))
40
+ entry.keywords.each do |keyword|
41
+ add_internal(keyword, output)
42
+ end
43
+ else
44
+ add_internal(entry, output)
45
+ end
46
+ end
47
+
48
+ # Call once after adding all entries. This constructs failure links in the
49
+ # tree, which allow the state machine to move a single step with every input
50
+ # character instead of backtracking back to the beginning of the tree when
51
+ # a partial match fails to match the next character.
52
+ def construct_fail
53
+ queue = Queue.new
54
+ @start.values.each do |value|
55
+ value.fail = @start
56
+ value.fail_increment = 1
57
+ queue.push value
58
+ end
59
+
60
+ prepare_root
61
+
62
+ while !queue.empty?
63
+ r = queue.pop
64
+ r.keys.each do |char|
65
+ s = r[char]
66
+ queue.push s
67
+ state = r.fail
68
+ increment = 0
69
+ while !state[char]
70
+ increment += state.fail_increment
71
+ state = state.fail
72
+ end
73
+ s.fail = state[char]
74
+ s.fail_increment = increment
75
+ end
76
+ end
77
+ end
78
+
79
+ # Process the provided text for matches. Returns an array of Match objects.
80
+ # Each Match object contains the keyword matched, the start and end position
81
+ # in the text, and the output object specified when the keyword was added
82
+ # to the search tree. The end position is actually the position of the
83
+ # character immediately following the end of the keyword, such that end
84
+ # position minus start position equals the length of the keyword string.
85
+ #
86
+ # Options:
87
+ # * :filter_overlaps - When multiple keywords overlap, filter out overlaps
88
+ # by choosing the longest match.
89
+ def process(bytes, options={})
90
+ retval = [] unless block_given?
91
+ state = @start
92
+ position = Position.new(0, 0)
93
+ bytes.each_byte do |a|
94
+ state = state.transition(a, position)
95
+ if state.keyword && ((position.begin == 0 ||
96
+ KeywordProspector.word_delimiter?(bytes[position.begin-1])) &&
97
+ (position.end == bytes.length ||
98
+ KeywordProspector.word_delimiter?(bytes[position.end])))
99
+ match = Match.new(state.keyword, position.begin, position.end, state.output)
100
+
101
+ # do something with the found item
102
+ if block_given?
103
+ yield match
104
+ else
105
+ retval << match
106
+ end
107
+ end
108
+ end
109
+
110
+ if retval
111
+ if (options[:filter_overlaps])
112
+ KeywordProspector.filter_overlaps(retval)
113
+ end
114
+ end
115
+
116
+ return retval
117
+ end
118
+
119
+ # Filters overlaps from an array of results. If two results overlap, the
120
+ # shorter result is removed. If both results have the same length, the
121
+ # second result is removed.
122
+ def self.filter_overlaps(results)
123
+ i = 0
124
+ while (i < results.size-1)
125
+ a = results[i]
126
+ b = results[i+1]
127
+ if a.overlap?(b)
128
+ if (a.length < b.length)
129
+ results.delete_at(i)
130
+ else
131
+ results.delete_at(i+1)
132
+ end
133
+ end
134
+ i += 1
135
+ end
136
+ end
137
+
138
+ private
139
+ WORD_CHARS=[?a..?z, ?A..?Z, ?0..?9, ?_]
140
+
141
+ # Returns true if the character provided is a word character.
142
+ def self.word_char?(char)
143
+ WORD_CHARS.each do |spec|
144
+ return true if spec === char
145
+ end
146
+
147
+ return false
148
+ end
149
+
150
+ # Returns true if the character provided is not a word character.
151
+ def self.word_delimiter?(char)
152
+ return !word_char?(char)
153
+ end
154
+
155
+ # Add a single keyword to the tree.
156
+ def add_internal(keyword, output=nil)
157
+ cur_state = @start
158
+ # assuming a string here
159
+ keyword.each_byte {|c| cur_state = cur_state.insert_next_state(c)}
160
+ cur_state.keyword = keyword
161
+ cur_state.output = output
162
+ end
163
+
164
+ # Used internally to create links from root back to itself for all states
165
+ # that are not beginnings of known keywords.
166
+ def prepare_root
167
+ 0.upto(255) do |i|
168
+ @start[i] = @start if !@start[i]
169
+ end
170
+ end
171
+ end