fast_fuzzy 0.1.0-java
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/LICENSE +13 -0
- data/README.md +140 -0
- data/fast_fuzzy.gemspec +30 -0
- data/lib/fast_fuzzy.rb +27 -0
- data/lib/fast_fuzzy/analyzer.rb +49 -0
- data/lib/fast_fuzzy/fastfuzzy-0.1.0.jar +0 -0
- data/lib/fast_fuzzy/lucene.rb +42 -0
- data/lib/fast_fuzzy/ngram.rb +18 -0
- data/lib/fast_fuzzy/percolator.rb +79 -0
- data/lib/fast_fuzzy/version.rb +3 -0
- data/lib/fast_fuzzy_jars.rb +5 -0
- data/lib/org/apache/lucene/lucene-analyzers-common/5.4.1/lucene-analyzers-common-5.4.1.jar +0 -0
- data/lib/org/apache/lucene/lucene-core/5.4.1/lucene-core-5.4.1.jar +0 -0
- data/spec/fast_fuzzy/analyzer_spec.rb +116 -0
- data/spec/fast_fuzzy/lucene_spec.rb +11 -0
- data/spec/fast_fuzzy/ngram_spec.rb +25 -0
- data/spec/fast_fuzzy/percolator_spec.rb +41 -0
- data/spec/fast_fuzzy_spec.rb +7 -0
- data/spec/spec_helper.rb +2 -0
- metadata +127 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 9027df9eb5741a88f7cb6e86217bf10f15e3600a
|
4
|
+
data.tar.gz: 638da9aba144ab4cc92b721e07e7fffb23ba9fa4
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: b179f35244d348ea5e2476707e25cf091975e8fdbf3a955f48d8155e4937fc530e869f8d9373f9268368ec444302b3bc34b208eac2cb102b8a3f724bae9cac92
|
7
|
+
data.tar.gz: 193d92a94d5611c9b1d687eb3377ef058e99922dad99ac0e4b5a3a605b393942ba73005939fced7dc617296fef6c3cd068f83cce97dd28f729e8b9cf0ef4d03f
|
data/LICENSE
ADDED
@@ -0,0 +1,13 @@
|
|
1
|
+
Copyright (c) 2014-2016 Colin Surprenant colin.surprenant@gmail.com
|
2
|
+
|
3
|
+
Licensed under the Apache License, Version 2.0 (the "License");
|
4
|
+
you may not use this file except in compliance with the License.
|
5
|
+
You may obtain a copy of the License at
|
6
|
+
|
7
|
+
http://www.apache.org/licenses/LICENSE-2.0
|
8
|
+
|
9
|
+
Unless required by applicable law or agreed to in writing, software
|
10
|
+
distributed under the License is distributed on an "AS IS" BASIS,
|
11
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12
|
+
See the License for the specific language governing permissions and
|
13
|
+
limitations under the License.
|
data/README.md
ADDED
@@ -0,0 +1,140 @@
|
|
1
|
+
# FastFuzzy
|
2
|
+
|
3
|
+
This gem only supports [JRuby](http://jruby.org/).
|
4
|
+
|
5
|
+
FastFuzzy performs fast and fuzzy text pattern matching. It uses the Lucene analyzers to tokenize the text using a configurable analyzer chain and the extracted tokens
|
6
|
+
are matched against the searched text using ngram matching and a resulting match score is computed.
|
7
|
+
|
8
|
+
The original intent of this code was to perform on-the-fly matching of some specific categories of expressions or sentences in social media. Using an analyzer chain to tokenize, remove stop words, etc, allows performing the matching only on the relevant text tokens. Using ngram scoring provides the fuzzyness confidence score to find approximate matching text with typos or different spelling or any number of variations. With experimentation an "acceptable" score can be decided to establish if the searched text matches or not against what we are looking for.
|
9
|
+
|
10
|
+
Note that this gem also include a custom Lucene Twitter tokenizer, see usage examples below.
|
11
|
+
|
12
|
+
## Installation
|
13
|
+
|
14
|
+
This gem only supports [JRuby](http://jruby.org/).
|
15
|
+
|
16
|
+
Add this line to your application's Gemfile:
|
17
|
+
|
18
|
+
```ruby
|
19
|
+
gem 'fast_fuzzy'
|
20
|
+
```
|
21
|
+
|
22
|
+
And then execute:
|
23
|
+
|
24
|
+
$ bundle
|
25
|
+
|
26
|
+
Or install it yourself as:
|
27
|
+
|
28
|
+
$ gem install fast_fuzzy
|
29
|
+
|
30
|
+
## Usage
|
31
|
+
|
32
|
+
### Percolator
|
33
|
+
|
34
|
+
- First configure the `Percolator` with any number of text strings which represents what we are looking for, or the queries:
|
35
|
+
|
36
|
+
```ruby
|
37
|
+
p = FastFuzzy::Percolator.new
|
38
|
+
|
39
|
+
p << "looking for a restaurant"
|
40
|
+
p << "recommend a restaurant"
|
41
|
+
```
|
42
|
+
|
43
|
+
- Run the `Percolator` against some text. The result is the list of matching "queries" sorted in ascending score order.
|
44
|
+
|
45
|
+
```ruby
|
46
|
+
p.percolate("hey! anyone can recomment a good restaurant in montreal tonight?")
|
47
|
+
=> [[0, 0.5294117647058824], [1, 0.8421052631578947]]
|
48
|
+
```
|
49
|
+
|
50
|
+
In this example the last query "recommend a restaurant" matched with a 0.842 or ~84% score.
|
51
|
+
|
52
|
+
```ruby
|
53
|
+
p.percolate("I am looking for a good suchi restaurant")
|
54
|
+
=> [[1, 0.47368421052631576], [0, 0.8823529411764706]]
|
55
|
+
```
|
56
|
+
|
57
|
+
In this example the first query "looking for a restaurant" matched with a 0.882 or ~88% score.
|
58
|
+
|
59
|
+
### Custom Analyzer Chain
|
60
|
+
|
61
|
+
Included is a custom Twitter Tokenizer and can be used by defining an analyzer chain for the Percolator:
|
62
|
+
|
63
|
+
```ruby
|
64
|
+
p = FastFuzzy::Percolator.new(:analyzer_chain => [
|
65
|
+
[Lucene::TwitterTokenizer],
|
66
|
+
[Lucene::LowerCaseFilter],
|
67
|
+
[Lucene::StopFilter, Lucene::StandardAnalyzer::STOP_WORDS_SET],
|
68
|
+
])
|
69
|
+
|
70
|
+
p << "looking for a restaurant"
|
71
|
+
p << "recommend a restaurant"
|
72
|
+
```
|
73
|
+
|
74
|
+
```ruby
|
75
|
+
p.percolate("RT yo lookin for a good restaurant #montreal #foodie")
|
76
|
+
=> [[1, 0.5263157894736842], [0, 0.7647058823529411]]
|
77
|
+
```
|
78
|
+
|
79
|
+
## Custom Twitter Tokenizer
|
80
|
+
|
81
|
+
The included custom Twitter Tokenizer was created to classify more Twitter specific tokens and help only match on actual text words. YMMV.
|
82
|
+
|
83
|
+
## Development
|
84
|
+
|
85
|
+
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
86
|
+
|
87
|
+
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
88
|
+
|
89
|
+
### Building
|
90
|
+
|
91
|
+
- Install gradle-jflex-plugin
|
92
|
+
|
93
|
+
this uses the gradle-jflex-plugin. The original version does not support jflex 1.6.1 so
|
94
|
+
you can use my forked version at https://github.com/colinsurprenant/gradle-jflex-plugin
|
95
|
+
or the original if/when https://github.com/thomaslee/gradle-jflex-plugin/pull/2 is merged.
|
96
|
+
|
97
|
+
```sh
|
98
|
+
$ git clone https://github.com/colinsurprenant/gradle-jflex-plugin
|
99
|
+
# or
|
100
|
+
$ git clone https://github.com/thomaslee/gradle-jflex-plugin
|
101
|
+
```
|
102
|
+
|
103
|
+
```sh
|
104
|
+
$ cd gradle-jflex-plugin
|
105
|
+
$ gradle build
|
106
|
+
$ gradle install
|
107
|
+
# or use gradlew if you do not have Gradle installed
|
108
|
+
$ ./gradlew build
|
109
|
+
$ ./gradlew install
|
110
|
+
````
|
111
|
+
|
112
|
+
- Build Java sources
|
113
|
+
|
114
|
+
```sh
|
115
|
+
$ gradle build
|
116
|
+
# or use gradlew if you do not have Gradle installed
|
117
|
+
$ ./gradlew build
|
118
|
+
```
|
119
|
+
|
120
|
+
### Tests / Specs
|
121
|
+
|
122
|
+
```sh
|
123
|
+
$ bundle install
|
124
|
+
```
|
125
|
+
|
126
|
+
```sh
|
127
|
+
$ bundle exec rspec
|
128
|
+
```
|
129
|
+
|
130
|
+
## Author
|
131
|
+
|
132
|
+
**Colin Surprenant** on [GitHub](https://github.com/colinsurprenant) and [Twitter](https://twitter.com/colinsurprenant).
|
133
|
+
|
134
|
+
## Contributing
|
135
|
+
|
136
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/colinsurprenant/fast_fuzzy.
|
137
|
+
|
138
|
+
## License and Copyright
|
139
|
+
|
140
|
+
*FastFuzzy* is released under the Apache License, Version 2.0.
|
data/fast_fuzzy.gemspec
ADDED
@@ -0,0 +1,30 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
lib = File.expand_path('../lib', __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
|
+
require 'fast_fuzzy/version'
|
5
|
+
|
6
|
+
Gem::Specification.new do |spec|
|
7
|
+
spec.name = "fast_fuzzy"
|
8
|
+
spec.version = FastFuzzy::VERSION
|
9
|
+
spec.authors = ["Colin Surprenant"]
|
10
|
+
spec.email = ["colin.surprenant@gmail.com"]
|
11
|
+
|
12
|
+
spec.summary = "JRuby Fast and fuzzy text pattern matching using Lucene analyzers"
|
13
|
+
spec.description = "JRuby Fast and fuzzy text pattern matching using Lucene analyzers"
|
14
|
+
spec.homepage = "https://github.com/colinsurprenant/fast_fuzzy"
|
15
|
+
spec.licenses = ["Apache-2.0"]
|
16
|
+
|
17
|
+
spec.files = Dir.glob(["*.gemspec", "lib/**/*.jar", "lib/**/*.rb", "spec/**/*.rb", "LICENSE", "README.md"])
|
18
|
+
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
19
|
+
|
20
|
+
spec.platform = "java"
|
21
|
+
|
22
|
+
spec.add_runtime_dependency "jar-dependencies"
|
23
|
+
|
24
|
+
spec.requirements << "jar org.apache.lucene:lucene-core, 5.4.1"
|
25
|
+
spec.requirements << "jar org.apache.lucene:lucene-analyzers-common, 5.4.1"
|
26
|
+
|
27
|
+
spec.add_development_dependency "bundler", "~> 1.10"
|
28
|
+
spec.add_development_dependency "rake", "~> 10.0"
|
29
|
+
spec.add_development_dependency "rspec"
|
30
|
+
end
|
data/lib/fast_fuzzy.rb
ADDED
@@ -0,0 +1,27 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
require "java"
|
4
|
+
|
5
|
+
module FastFuzzy
|
6
|
+
# local dev setup
|
7
|
+
classes_dir = File.expand_path("../../build/classes/main", __FILE__)
|
8
|
+
|
9
|
+
if File.directory?(classes_dir)
|
10
|
+
# if in local dev setup, add target to classpath
|
11
|
+
$CLASSPATH << classes_dir unless $CLASSPATH.include?(classes_dir)
|
12
|
+
else
|
13
|
+
jar_path = "fast_fuzzy/fastfuzzy-#{VERSION}.jar"
|
14
|
+
begin
|
15
|
+
require jar_path
|
16
|
+
rescue Exception => e
|
17
|
+
raise("Error loading #{jar_path}, cause: #{e.message}")
|
18
|
+
end
|
19
|
+
end
|
20
|
+
end
|
21
|
+
|
22
|
+
require "fast_fuzzy_jars"
|
23
|
+
require "fast_fuzzy/version"
|
24
|
+
require "fast_fuzzy/lucene"
|
25
|
+
require "fast_fuzzy/analyzer"
|
26
|
+
require "fast_fuzzy/ngram"
|
27
|
+
require "fast_fuzzy/percolator"
|
@@ -0,0 +1,49 @@
|
|
1
|
+
module FastFuzzy
|
2
|
+
|
3
|
+
class Analyzer
|
4
|
+
include Enumerable
|
5
|
+
|
6
|
+
STANDARD_CHAIN = [
|
7
|
+
[Lucene::StandardTokenizer],
|
8
|
+
[Lucene::StandardFilter],
|
9
|
+
[Lucene::LowerCaseFilter],
|
10
|
+
[Lucene::StopFilter, Lucene::StandardAnalyzer::STOP_WORDS_SET],
|
11
|
+
]
|
12
|
+
|
13
|
+
def initialize(str, chain_definition = STANDARD_CHAIN)
|
14
|
+
@str = str
|
15
|
+
|
16
|
+
# first chain element must be the tokenizer, initialize it and set its reader
|
17
|
+
definition = chain_definition.first
|
18
|
+
clazz = definition.first
|
19
|
+
params = definition[1..-1]
|
20
|
+
tokenizer = clazz.new(*params)
|
21
|
+
tokenizer.set_reader(java.io.StringReader.new(str))
|
22
|
+
|
23
|
+
# initialize the following filters and build the stream chain
|
24
|
+
stream = chain_definition[1..-1].inject(tokenizer) do |result, definition|
|
25
|
+
clazz = definition.first
|
26
|
+
params = definition[1..-1]
|
27
|
+
clazz.new(result, *params)
|
28
|
+
end
|
29
|
+
|
30
|
+
# use CachingTokenFilter to allow multiple stream traversing
|
31
|
+
@stream = Lucene::CachingTokenFilter.new(stream)
|
32
|
+
@term_attr = @stream.addAttribute(Lucene::CharTermAttribute.java_class);
|
33
|
+
@type_attr = @stream.addAttribute(Lucene::TypeAttribute.java_class);
|
34
|
+
end
|
35
|
+
|
36
|
+
# implement each for Enumerable
|
37
|
+
def each(&block)
|
38
|
+
@stream.reset
|
39
|
+
while @stream.incrementToken
|
40
|
+
yield [@term_attr.to_string, @type_attr.type]
|
41
|
+
end
|
42
|
+
@stream.end
|
43
|
+
end
|
44
|
+
|
45
|
+
def close
|
46
|
+
@stream.close
|
47
|
+
end
|
48
|
+
end
|
49
|
+
end
|
Binary file
|
@@ -0,0 +1,42 @@
|
|
1
|
+
require 'java'
|
2
|
+
|
3
|
+
module Lucene
|
4
|
+
Version = org.apache.lucene.util.Version
|
5
|
+
VERSION = Version::LATEST.to_string
|
6
|
+
|
7
|
+
ALPHANUM_TYPE = "<ALPHANUM>".freeze
|
8
|
+
|
9
|
+
module Analysis
|
10
|
+
include_package "org.apache.lucene.analysis"
|
11
|
+
|
12
|
+
module Standard
|
13
|
+
include_package "org.apache.lucene.analysis.standard"
|
14
|
+
end
|
15
|
+
|
16
|
+
module TokenAttributes
|
17
|
+
include_package 'org.apache.lucene.analysis.tokenattributes'
|
18
|
+
end
|
19
|
+
|
20
|
+
module Core
|
21
|
+
include_package 'org.apache.lucene.analysis.core'
|
22
|
+
end
|
23
|
+
end
|
24
|
+
|
25
|
+
module FastFuzzy
|
26
|
+
# custom TwitterTokenizer
|
27
|
+
include_package "com.colinsurprenant.fastfuzzy"
|
28
|
+
end
|
29
|
+
|
30
|
+
StandardTokenizer = Analysis::Standard::StandardTokenizer
|
31
|
+
ClassicTokenizer = Analysis::Standard::ClassicTokenizer
|
32
|
+
StandardAnalyzer = Analysis::Standard::StandardAnalyzer
|
33
|
+
StandardFilter = Analysis::Standard::StandardFilter
|
34
|
+
LowerCaseFilter = Analysis::Core::LowerCaseFilter
|
35
|
+
StopFilter = Analysis::Core::StopFilter
|
36
|
+
CharTermAttribute = Analysis::TokenAttributes::CharTermAttribute
|
37
|
+
CachingTokenFilter = Analysis::CachingTokenFilter
|
38
|
+
TypeAttribute = Analysis::TokenAttributes::TypeAttribute
|
39
|
+
FlagsAttribute = Analysis::TokenAttributes::FlagsAttribute
|
40
|
+
|
41
|
+
TwitterTokenizer = FastFuzzy::TwitterTokenizer
|
42
|
+
end
|
@@ -0,0 +1,18 @@
|
|
1
|
+
module FastFuzzy
|
2
|
+
|
3
|
+
class Ngram
|
4
|
+
|
5
|
+
# generate the ngram for either the caracter grams of a string or the items grams of an array
|
6
|
+
# example: generate("abcd") => ["ab", "bc", "cd"], generate([:a, :b, :c, :d]) => [[:a, :b], [:b, :c], [:c, :d]]
|
7
|
+
# @param str_or_ary [String|Array] input string or array of anything
|
8
|
+
# @return [Array] array or ngrams
|
9
|
+
def self.generate(str_or_ary, n = 2)
|
10
|
+
if (s = str_or_ary.size) <= n
|
11
|
+
return [] if s.zero?
|
12
|
+
return [str_or_ary]
|
13
|
+
end
|
14
|
+
|
15
|
+
(0..str_or_ary.size - n).map{|i| str_or_ary[i, n]}
|
16
|
+
end
|
17
|
+
end
|
18
|
+
end
|
@@ -0,0 +1,79 @@
|
|
1
|
+
require 'fast_fuzzy/analyzer'
|
2
|
+
require 'fast_fuzzy/ngram'
|
3
|
+
|
4
|
+
module FastFuzzy
|
5
|
+
|
6
|
+
class Percolator
|
7
|
+
attr_reader :queries, :index
|
8
|
+
N = 4 # >= 3 is necessary to wheight consecutiveness of words (ex "this car" => thi, his, is_, s_c, _ca, car)
|
9
|
+
|
10
|
+
def initialize(options = {})
|
11
|
+
@queries = options.delete(:queries) || []
|
12
|
+
@analyzer_chain = options.delete(:analyzer_chain) || Analyzer::STANDARD_CHAIN
|
13
|
+
|
14
|
+
@index = Hash.new{|h, k| h[k] = []}
|
15
|
+
@query_terms_count = []
|
16
|
+
|
17
|
+
build_index unless queries.empty?
|
18
|
+
end
|
19
|
+
|
20
|
+
def <<(query)
|
21
|
+
@queries << query
|
22
|
+
build_index
|
23
|
+
end
|
24
|
+
|
25
|
+
def percolate(str)
|
26
|
+
analyser = Analyzer.new(str, @analyzer_chain)
|
27
|
+
|
28
|
+
terms_phrase = analyser.select{|term| term[1] == Lucene::ALPHANUM_TYPE}.map{|term| term[0]}.join(' ')
|
29
|
+
terms_phrase = " #{terms_phrase} "
|
30
|
+
grams = Ngram.generate(terms_phrase, N).uniq
|
31
|
+
score(lookup(grams))
|
32
|
+
end
|
33
|
+
|
34
|
+
private
|
35
|
+
|
36
|
+
# build inverted index into @index that list all queries that belong to each term
|
37
|
+
# {"term" => [query id, query id, ...]}
|
38
|
+
def build_index
|
39
|
+
@index.clear
|
40
|
+
@query_terms_count.clear
|
41
|
+
|
42
|
+
@queries.each_index do |query_id|
|
43
|
+
analyser = Analyzer.new(@queries[query_id], @analyzer_chain)
|
44
|
+
|
45
|
+
terms_phrase = analyser.select{|term| term[1] == Lucene::ALPHANUM_TYPE}.map{|term| term[0]}.join(' ')
|
46
|
+
terms_phrase = " #{terms_phrase} "
|
47
|
+
grams = Ngram.generate(terms_phrase, N).uniq
|
48
|
+
@query_terms_count[query_id] = grams.size
|
49
|
+
grams.each{|gram| @index[gram] << query_id}
|
50
|
+
end
|
51
|
+
@index.each_key{|k| @index[k] = @index[k].uniq}
|
52
|
+
|
53
|
+
self
|
54
|
+
end
|
55
|
+
|
56
|
+
# for each search term lookup the index and keep count of the number of terms that matched each query
|
57
|
+
# @param terms [Array<String>] search terms
|
58
|
+
# @return [Hash] number of terms that matched each query, {query id => terms hit count}
|
59
|
+
def lookup(terms)
|
60
|
+
results = Hash.new(0)
|
61
|
+
|
62
|
+
terms.each do |term|
|
63
|
+
# @index[term] has list of query_id matching this term
|
64
|
+
@index[term].each{|query_id| results[query_id] += 1}
|
65
|
+
end
|
66
|
+
|
67
|
+
results
|
68
|
+
end
|
69
|
+
|
70
|
+
|
71
|
+
# compute match score per query for given lookup result
|
72
|
+
# @param result [Hash] number of terms that matched (> 0.0 score) each query, {query id => terms hit count}
|
73
|
+
# @return [Array<Array>] sorted array by score of [[query id, score]]. will include only queries with score > 0, [] if no match
|
74
|
+
def score(results)
|
75
|
+
results.map{|k, v| [k, v.to_f / @query_terms_count[k]]}.sort_by{|r| r[1]}
|
76
|
+
end
|
77
|
+
|
78
|
+
end
|
79
|
+
end
|
Binary file
|
Binary file
|
@@ -0,0 +1,116 @@
|
|
1
|
+
require_relative '../spec_helper'
|
2
|
+
|
3
|
+
describe FastFuzzy::Analyzer do
|
4
|
+
|
5
|
+
it "should support to_a" do
|
6
|
+
analyzer = FastFuzzy::Analyzer.new("aa bb")
|
7
|
+
expect(analyzer.to_a).to be_a(Array)
|
8
|
+
expect(analyzer.to_a.size).to eq(2)
|
9
|
+
end
|
10
|
+
|
11
|
+
it "should support each" do
|
12
|
+
analyzer = FastFuzzy::Analyzer.new("aa bb")
|
13
|
+
result = []
|
14
|
+
analyzer.each{|token| result << token}
|
15
|
+
expect(result.size).to eq(2)
|
16
|
+
end
|
17
|
+
|
18
|
+
shared_examples :analyzer do
|
19
|
+
it "should analyse and extract correct type" do
|
20
|
+
tests.each do |test|
|
21
|
+
analyzer = FastFuzzy::Analyzer.new(test.first, chain)
|
22
|
+
expect(analyzer.to_a).to eq(test.last)
|
23
|
+
end
|
24
|
+
end
|
25
|
+
|
26
|
+
it "should support each" do
|
27
|
+
analyzer = FastFuzzy::Analyzer.new("aa bb cc", chain)
|
28
|
+
results = []
|
29
|
+
analyzer.each{|term| results << term}
|
30
|
+
expect(results).to eq([["aa", "<ALPHANUM>"], ["bb", "<ALPHANUM>"], ["cc", "<ALPHANUM>"]])
|
31
|
+
end
|
32
|
+
|
33
|
+
it "should support Enumerable" do
|
34
|
+
analyzer = FastFuzzy::Analyzer.new("aa bb 12", chain)
|
35
|
+
expect(analyzer.map{|term| term.first}).to eq(["aa", "bb", "12"])
|
36
|
+
expect(analyzer.map{|term| term.last}).to eq(["<ALPHANUM>", "<ALPHANUM>", "<NUM>"])
|
37
|
+
end
|
38
|
+
end
|
39
|
+
|
40
|
+
context "using default analyzer chain" do
|
41
|
+
it_behaves_like :analyzer do
|
42
|
+
let(:tests) {[
|
43
|
+
# the following tests are actually copied from the TwitterTokenizer tests below and
|
44
|
+
# ajusted to the StandardTokenizer chain results.
|
45
|
+
# TODO: make better tests for StandardTokenizer & default chain
|
46
|
+
|
47
|
+
["2012 $80K/year", [["2012", "<NUM>"], ["80k", "<ALPHANUM>"], ["year", "<ALPHANUM>"]]],
|
48
|
+
["is @ 7.9% and 1% not 4th", [["7.9", "<NUM>"], ["1", "<NUM>"], ["4th", "<ALPHANUM>"]]],
|
49
|
+
|
50
|
+
["it's o'reilly o'reilly's hey", [["it's", "<ALPHANUM>"], ["o'reilly", "<ALPHANUM>"], ["o'reilly's", "<ALPHANUM>"], ["hey", "<ALPHANUM>"]]],
|
51
|
+
|
52
|
+
["at&t excite@home", [["t", "<ALPHANUM>"], ["excite", "<ALPHANUM>"], ["home", "<ALPHANUM>"]]],
|
53
|
+
|
54
|
+
["I.B.M. is big", [["i.b.m", "<ALPHANUM>"], ["big", "<ALPHANUM>"]]],
|
55
|
+
|
56
|
+
["see @colinsurprenant do #win yeah", [["see", "<ALPHANUM>"], ["colinsurprenant", "<ALPHANUM>"], ["do", "<ALPHANUM>"], ["win", "<ALPHANUM>"], ["yeah", "<ALPHANUM>"]]],
|
57
|
+
|
58
|
+
["go http://colinsurprenant.com/ http://a go", [["go", "<ALPHANUM>"], ["http", "<ALPHANUM>"], ["colinsurprenant.com", "<ALPHANUM>"], ["http", "<ALPHANUM>"], ["go", "<ALPHANUM>"]]],
|
59
|
+
|
60
|
+
["colin.surprenant@gmail.com", [["colin.surprenant", "<ALPHANUM>"], ["gmail.com", "<ALPHANUM>"]]],
|
61
|
+
|
62
|
+
["Face-Lift - Graphic", [["face", "<ALPHANUM>"], ["lift", "<ALPHANUM>"], ["graphic", "<ALPHANUM>"]]],
|
63
|
+
|
64
|
+
["DUDZ :) :P :( 8-(", [["dudz", "<ALPHANUM>"], ["p", "<ALPHANUM>"], ["8", "<NUM>"]]],
|
65
|
+
|
66
|
+
["RT stoopid #FF ", [["rt", "<ALPHANUM>"], ["stoopid", "<ALPHANUM>"], ["ff", "<ALPHANUM>"]]],
|
67
|
+
]}
|
68
|
+
|
69
|
+
let(:chain) { FastFuzzy::Analyzer::STANDARD_CHAIN }
|
70
|
+
end
|
71
|
+
end
|
72
|
+
|
73
|
+
context "using TwitterTokenizer chain" do
|
74
|
+
it_behaves_like :analyzer do
|
75
|
+
let(:tests) {[
|
76
|
+
# NUM
|
77
|
+
["2012 $80K/year", [["2012", "<NUM>"], ["$80k", "<NUM>"], ["year", "<ALPHANUM>"]]],
|
78
|
+
["is @ 7.9% and 1% not 4th", [["7.9%", "<NUM>"], ["1%", "<NUM>"], ["4th", "<ALPHANUM>"]]],
|
79
|
+
|
80
|
+
# APOSTROPHE
|
81
|
+
["it's o'reilly o'reilly's hey", [["it's", "<APOSTROPHE>"], ["o'reilly", "<APOSTROPHE>"], ["o'reilly's", "<APOSTROPHE>"], ["hey", "<ALPHANUM>"]]],
|
82
|
+
|
83
|
+
# COMPANY
|
84
|
+
["at&t excite@home", [["at&t", "<COMPANY>"], ["excite@home", "<COMPANY>"]]],
|
85
|
+
|
86
|
+
# ACRONYM
|
87
|
+
["I.B.M. is big", [["i.b.m.", "<ACRONYM>"], ["big", "<ALPHANUM>"]]],
|
88
|
+
|
89
|
+
# USERNAME & HASHTAG
|
90
|
+
["see @colinsurprenant do #win yeah", [["see", "<ALPHANUM>"], ["@colinsurprenant", "<USERNAME>"], ["do", "<ALPHANUM>"], ["#win", "<HASHTAG>"], ["yeah", "<ALPHANUM>"]]],
|
91
|
+
|
92
|
+
# URL
|
93
|
+
["go http://colinsurprenant.com/ http://a go", [["go", "<ALPHANUM>"], ["http://colinsurprenant.com/", "<URL>"], ["http://a", "<URL>"], ["go", "<ALPHANUM>"]]],
|
94
|
+
|
95
|
+
# EMAIL
|
96
|
+
["colin.surprenant@gmail.com", [["colin.surprenant@gmail.com", "<EMAIL>"]]],
|
97
|
+
|
98
|
+
# DASH
|
99
|
+
["Face-Lift - Graphic", [["face-lift", "<DASH>"], ["graphic", "<ALPHANUM>"]]],
|
100
|
+
|
101
|
+
# EMOTICON
|
102
|
+
["DUDZ :) :P :( 8-(", [["dudz", "<ALPHANUM>"], [":)", "<EMOTICON>"], [":p", "<EMOTICON>"], [":(", "<EMOTICON>"], ["8-(", "<EMOTICON>"]]],
|
103
|
+
|
104
|
+
# HASHTAG & RT
|
105
|
+
["RT stoopid #FF ", [["rt", "<RT>"], ["stoopid", "<ALPHANUM>"], ["#ff", "<HASHTAG>"]]],
|
106
|
+
]}
|
107
|
+
|
108
|
+
let(:chain) {[
|
109
|
+
[Lucene::TwitterTokenizer],
|
110
|
+
[Lucene::LowerCaseFilter],
|
111
|
+
[Lucene::StopFilter, Lucene::StandardAnalyzer::STOP_WORDS_SET],
|
112
|
+
]}
|
113
|
+
end
|
114
|
+
end
|
115
|
+
|
116
|
+
end
|
@@ -0,0 +1,25 @@
|
|
1
|
+
require_relative '../spec_helper'
|
2
|
+
|
3
|
+
describe FastFuzzy::Ngram do
|
4
|
+
|
5
|
+
it "should generate bigrams" do
|
6
|
+
expect(FastFuzzy::Ngram.generate("allo")).to eq(["al", "ll", "lo"])
|
7
|
+
expect(FastFuzzy::Ngram.generate("allo", 2)).to eq(["al", "ll", "lo"])
|
8
|
+
expect(FastFuzzy::Ngram.generate([:a, :b, :c, :d], 2)).to eq([[:a, :b], [:b, :c], [:c, :d]])
|
9
|
+
end
|
10
|
+
|
11
|
+
it "should generate trigrams" do
|
12
|
+
expect(FastFuzzy::Ngram.generate("allotoi", 3)).to eq(["all", "llo", "lot", "oto", "toi"])
|
13
|
+
expect(FastFuzzy::Ngram.generate([:a, :b, :c, :d], 3)).to eq([[:a, :b, :c], [:b, :c, :d]])
|
14
|
+
end
|
15
|
+
|
16
|
+
it "should generate gram <= n" do
|
17
|
+
expect(FastFuzzy::Ngram.generate("al", 2)).to eq(["al"])
|
18
|
+
expect(FastFuzzy::Ngram.generate([:a, :b], 2)).to eq([[:a, :b]])
|
19
|
+
end
|
20
|
+
|
21
|
+
it "should generate gram < n" do
|
22
|
+
expect(FastFuzzy::Ngram.generate("a", 2)).to eq(["a"])
|
23
|
+
expect(FastFuzzy::Ngram.generate([:a], 2)).to eq([[:a]])
|
24
|
+
end
|
25
|
+
end
|
@@ -0,0 +1,41 @@
|
|
1
|
+
require_relative '../spec_helper'
|
2
|
+
|
3
|
+
describe FastFuzzy::Percolator do
|
4
|
+
|
5
|
+
round4 = lambda {|x| x.round(4)}
|
6
|
+
|
7
|
+
it "should add queries" do
|
8
|
+
p = FastFuzzy::Percolator.new
|
9
|
+
p << "test1"
|
10
|
+
p << "test2"
|
11
|
+
expect(p.queries).to eq(["test1", "test2"])
|
12
|
+
|
13
|
+
p = FastFuzzy::Percolator.new(:queries => ["test1", "test2"])
|
14
|
+
expect(p.queries).to eq(["test1", "test2"])
|
15
|
+
end
|
16
|
+
|
17
|
+
it "should percolate using standard chain by default" do
|
18
|
+
p = FastFuzzy::Percolator.new
|
19
|
+
p << "faire une chose"
|
20
|
+
p << "etre subtil"
|
21
|
+
|
22
|
+
expect(p.percolate("veux tu faire quelque chose?").map{|r| [r[0], r[1].round(4)]}).to eq([[0, 0.6429]])
|
23
|
+
expect(p.percolate("allo toi").map{|r| [r[0], r[1].round(4)]}).to be_empty
|
24
|
+
expect(p.percolate("il faut etre quelque peu subtil").map{|r| [r[0], r[1].round(4)]}).to eq([[1, 0.8]])
|
25
|
+
|
26
|
+
p = FastFuzzy::Percolator.new
|
27
|
+
p << "looking for a job"
|
28
|
+
expect(p.percolate("I hate looking for a job").map{|r| [r[0], r[1].round(4)]}).to eq([[0, 1.0]])
|
29
|
+
end
|
30
|
+
|
31
|
+
it "should support configurable chain" do
|
32
|
+
p = FastFuzzy::Percolator.new(:analyzer_chain => [[Lucene::StandardTokenizer], [Lucene::StandardFilter]])
|
33
|
+
p << "foo bar"
|
34
|
+
expect(p.percolate("FOO BAR")).to eq([])
|
35
|
+
|
36
|
+
p = FastFuzzy::Percolator.new(:analyzer_chain => [[Lucene::StandardTokenizer], [Lucene::StandardFilter], [Lucene::LowerCaseFilter]])
|
37
|
+
p << "foo bar"
|
38
|
+
expect(p.percolate("FOO BAR")).to eq([[0, 1.0]])
|
39
|
+
end
|
40
|
+
|
41
|
+
end
|
data/spec/spec_helper.rb
ADDED
metadata
ADDED
@@ -0,0 +1,127 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: fast_fuzzy
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.0
|
5
|
+
platform: java
|
6
|
+
authors:
|
7
|
+
- Colin Surprenant
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2016-03-08 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
requirement: !ruby/object:Gem::Requirement
|
15
|
+
requirements:
|
16
|
+
- - ">="
|
17
|
+
- !ruby/object:Gem::Version
|
18
|
+
version: '0'
|
19
|
+
name: jar-dependencies
|
20
|
+
prerelease: false
|
21
|
+
type: :runtime
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - ">="
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '0'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
requirement: !ruby/object:Gem::Requirement
|
29
|
+
requirements:
|
30
|
+
- - "~>"
|
31
|
+
- !ruby/object:Gem::Version
|
32
|
+
version: '1.10'
|
33
|
+
name: bundler
|
34
|
+
prerelease: false
|
35
|
+
type: :development
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '1.10'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
requirement: !ruby/object:Gem::Requirement
|
43
|
+
requirements:
|
44
|
+
- - "~>"
|
45
|
+
- !ruby/object:Gem::Version
|
46
|
+
version: '10.0'
|
47
|
+
name: rake
|
48
|
+
prerelease: false
|
49
|
+
type: :development
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - "~>"
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '10.0'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
requirement: !ruby/object:Gem::Requirement
|
57
|
+
requirements:
|
58
|
+
- - ">="
|
59
|
+
- !ruby/object:Gem::Version
|
60
|
+
version: '0'
|
61
|
+
name: rspec
|
62
|
+
prerelease: false
|
63
|
+
type: :development
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - ">="
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: '0'
|
69
|
+
description: JRuby Fast and fuzzy text pattern matching using Lucene analyzers
|
70
|
+
email:
|
71
|
+
- colin.surprenant@gmail.com
|
72
|
+
executables: []
|
73
|
+
extensions: []
|
74
|
+
extra_rdoc_files: []
|
75
|
+
files:
|
76
|
+
- LICENSE
|
77
|
+
- README.md
|
78
|
+
- fast_fuzzy.gemspec
|
79
|
+
- lib/fast_fuzzy.rb
|
80
|
+
- lib/fast_fuzzy/analyzer.rb
|
81
|
+
- lib/fast_fuzzy/fastfuzzy-0.1.0.jar
|
82
|
+
- lib/fast_fuzzy/lucene.rb
|
83
|
+
- lib/fast_fuzzy/ngram.rb
|
84
|
+
- lib/fast_fuzzy/percolator.rb
|
85
|
+
- lib/fast_fuzzy/version.rb
|
86
|
+
- lib/fast_fuzzy_jars.rb
|
87
|
+
- lib/org/apache/lucene/lucene-analyzers-common/5.4.1/lucene-analyzers-common-5.4.1.jar
|
88
|
+
- lib/org/apache/lucene/lucene-core/5.4.1/lucene-core-5.4.1.jar
|
89
|
+
- spec/fast_fuzzy/analyzer_spec.rb
|
90
|
+
- spec/fast_fuzzy/lucene_spec.rb
|
91
|
+
- spec/fast_fuzzy/ngram_spec.rb
|
92
|
+
- spec/fast_fuzzy/percolator_spec.rb
|
93
|
+
- spec/fast_fuzzy_spec.rb
|
94
|
+
- spec/spec_helper.rb
|
95
|
+
homepage: https://github.com/colinsurprenant/fast_fuzzy
|
96
|
+
licenses:
|
97
|
+
- Apache-2.0
|
98
|
+
metadata: {}
|
99
|
+
post_install_message:
|
100
|
+
rdoc_options: []
|
101
|
+
require_paths:
|
102
|
+
- lib
|
103
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
104
|
+
requirements:
|
105
|
+
- - ">="
|
106
|
+
- !ruby/object:Gem::Version
|
107
|
+
version: '0'
|
108
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
109
|
+
requirements:
|
110
|
+
- - ">="
|
111
|
+
- !ruby/object:Gem::Version
|
112
|
+
version: '0'
|
113
|
+
requirements:
|
114
|
+
- jar org.apache.lucene:lucene-core, 5.4.1
|
115
|
+
- jar org.apache.lucene:lucene-analyzers-common, 5.4.1
|
116
|
+
rubyforge_project:
|
117
|
+
rubygems_version: 2.4.8
|
118
|
+
signing_key:
|
119
|
+
specification_version: 4
|
120
|
+
summary: JRuby Fast and fuzzy text pattern matching using Lucene analyzers
|
121
|
+
test_files:
|
122
|
+
- spec/fast_fuzzy/analyzer_spec.rb
|
123
|
+
- spec/fast_fuzzy/lucene_spec.rb
|
124
|
+
- spec/fast_fuzzy/ngram_spec.rb
|
125
|
+
- spec/fast_fuzzy/percolator_spec.rb
|
126
|
+
- spec/fast_fuzzy_spec.rb
|
127
|
+
- spec/spec_helper.rb
|