fast_fuzzy 0.1.0-java

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 9027df9eb5741a88f7cb6e86217bf10f15e3600a
4
+ data.tar.gz: 638da9aba144ab4cc92b721e07e7fffb23ba9fa4
5
+ SHA512:
6
+ metadata.gz: b179f35244d348ea5e2476707e25cf091975e8fdbf3a955f48d8155e4937fc530e869f8d9373f9268368ec444302b3bc34b208eac2cb102b8a3f724bae9cac92
7
+ data.tar.gz: 193d92a94d5611c9b1d687eb3377ef058e99922dad99ac0e4b5a3a605b393942ba73005939fced7dc617296fef6c3cd068f83cce97dd28f729e8b9cf0ef4d03f
data/LICENSE ADDED
@@ -0,0 +1,13 @@
1
+ Copyright (c) 2014-2016 Colin Surprenant colin.surprenant@gmail.com
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
@@ -0,0 +1,140 @@
1
+ # FastFuzzy
2
+
3
+ This gem only supports [JRuby](http://jruby.org/).
4
+
5
+ FastFuzzy performs fast and fuzzy text pattern matching. It uses the Lucene analyzers to tokenize the text using a configurable analyzer chain and the extracted tokens
6
+ are matched against the searched text using ngram matching and a resulting match score is computed.
7
+
8
+ The original intent of this code was to perform on-the-fly matching of some specific categories of expressions or sentences in social media. Using an analyzer chain to tokenize, remove stop words, etc, allows performing the matching only on the relevant text tokens. Using ngram scoring provides the fuzzyness confidence score to find approximate matching text with typos or different spelling or any number of variations. With experimentation an "acceptable" score can be decided to establish if the searched text matches or not against what we are looking for.
9
+
10
+ Note that this gem also include a custom Lucene Twitter tokenizer, see usage examples below.
11
+
12
+ ## Installation
13
+
14
+ This gem only supports [JRuby](http://jruby.org/).
15
+
16
+ Add this line to your application's Gemfile:
17
+
18
+ ```ruby
19
+ gem 'fast_fuzzy'
20
+ ```
21
+
22
+ And then execute:
23
+
24
+ $ bundle
25
+
26
+ Or install it yourself as:
27
+
28
+ $ gem install fast_fuzzy
29
+
30
+ ## Usage
31
+
32
+ ### Percolator
33
+
34
+ - First configure the `Percolator` with any number of text strings which represents what we are looking for, or the queries:
35
+
36
+ ```ruby
37
+ p = FastFuzzy::Percolator.new
38
+
39
+ p << "looking for a restaurant"
40
+ p << "recommend a restaurant"
41
+ ```
42
+
43
+ - Run the `Percolator` against some text. The result is the list of matching "queries" sorted in ascending score order.
44
+
45
+ ```ruby
46
+ p.percolate("hey! anyone can recomment a good restaurant in montreal tonight?")
47
+ => [[0, 0.5294117647058824], [1, 0.8421052631578947]]
48
+ ```
49
+
50
+ In this example the last query "recommend a restaurant" matched with a 0.842 or ~84% score.
51
+
52
+ ```ruby
53
+ p.percolate("I am looking for a good suchi restaurant")
54
+ => [[1, 0.47368421052631576], [0, 0.8823529411764706]]
55
+ ```
56
+
57
+ In this example the first query "looking for a restaurant" matched with a 0.882 or ~88% score.
58
+
59
+ ### Custom Analyzer Chain
60
+
61
+ Included is a custom Twitter Tokenizer and can be used by defining an analyzer chain for the Percolator:
62
+
63
+ ```ruby
64
+ p = FastFuzzy::Percolator.new(:analyzer_chain => [
65
+ [Lucene::TwitterTokenizer],
66
+ [Lucene::LowerCaseFilter],
67
+ [Lucene::StopFilter, Lucene::StandardAnalyzer::STOP_WORDS_SET],
68
+ ])
69
+
70
+ p << "looking for a restaurant"
71
+ p << "recommend a restaurant"
72
+ ```
73
+
74
+ ```ruby
75
+ p.percolate("RT yo lookin for a good restaurant #montreal #foodie")
76
+ => [[1, 0.5263157894736842], [0, 0.7647058823529411]]
77
+ ```
78
+
79
+ ## Custom Twitter Tokenizer
80
+
81
+ The included custom Twitter Tokenizer was created to classify more Twitter specific tokens and help only match on actual text words. YMMV.
82
+
83
+ ## Development
84
+
85
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
86
+
87
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
88
+
89
+ ### Building
90
+
91
+ - Install gradle-jflex-plugin
92
+
93
+ this uses the gradle-jflex-plugin. The original version does not support jflex 1.6.1 so
94
+ you can use my forked version at https://github.com/colinsurprenant/gradle-jflex-plugin
95
+ or the original if/when https://github.com/thomaslee/gradle-jflex-plugin/pull/2 is merged.
96
+
97
+ ```sh
98
+ $ git clone https://github.com/colinsurprenant/gradle-jflex-plugin
99
+ # or
100
+ $ git clone https://github.com/thomaslee/gradle-jflex-plugin
101
+ ```
102
+
103
+ ```sh
104
+ $ cd gradle-jflex-plugin
105
+ $ gradle build
106
+ $ gradle install
107
+ # or use gradlew if you do not have Gradle installed
108
+ $ ./gradlew build
109
+ $ ./gradlew install
110
+ ````
111
+
112
+ - Build Java sources
113
+
114
+ ```sh
115
+ $ gradle build
116
+ # or use gradlew if you do not have Gradle installed
117
+ $ ./gradlew build
118
+ ```
119
+
120
+ ### Tests / Specs
121
+
122
+ ```sh
123
+ $ bundle install
124
+ ```
125
+
126
+ ```sh
127
+ $ bundle exec rspec
128
+ ```
129
+
130
+ ## Author
131
+
132
+ **Colin Surprenant** on [GitHub](https://github.com/colinsurprenant) and [Twitter](https://twitter.com/colinsurprenant).
133
+
134
+ ## Contributing
135
+
136
+ Bug reports and pull requests are welcome on GitHub at https://github.com/colinsurprenant/fast_fuzzy.
137
+
138
+ ## License and Copyright
139
+
140
+ *FastFuzzy* is released under the Apache License, Version 2.0.
@@ -0,0 +1,30 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'fast_fuzzy/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "fast_fuzzy"
8
+ spec.version = FastFuzzy::VERSION
9
+ spec.authors = ["Colin Surprenant"]
10
+ spec.email = ["colin.surprenant@gmail.com"]
11
+
12
+ spec.summary = "JRuby Fast and fuzzy text pattern matching using Lucene analyzers"
13
+ spec.description = "JRuby Fast and fuzzy text pattern matching using Lucene analyzers"
14
+ spec.homepage = "https://github.com/colinsurprenant/fast_fuzzy"
15
+ spec.licenses = ["Apache-2.0"]
16
+
17
+ spec.files = Dir.glob(["*.gemspec", "lib/**/*.jar", "lib/**/*.rb", "spec/**/*.rb", "LICENSE", "README.md"])
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+
20
+ spec.platform = "java"
21
+
22
+ spec.add_runtime_dependency "jar-dependencies"
23
+
24
+ spec.requirements << "jar org.apache.lucene:lucene-core, 5.4.1"
25
+ spec.requirements << "jar org.apache.lucene:lucene-analyzers-common, 5.4.1"
26
+
27
+ spec.add_development_dependency "bundler", "~> 1.10"
28
+ spec.add_development_dependency "rake", "~> 10.0"
29
+ spec.add_development_dependency "rspec"
30
+ end
@@ -0,0 +1,27 @@
1
+ # encoding: utf-8
2
+
3
+ require "java"
4
+
5
+ module FastFuzzy
6
+ # local dev setup
7
+ classes_dir = File.expand_path("../../build/classes/main", __FILE__)
8
+
9
+ if File.directory?(classes_dir)
10
+ # if in local dev setup, add target to classpath
11
+ $CLASSPATH << classes_dir unless $CLASSPATH.include?(classes_dir)
12
+ else
13
+ jar_path = "fast_fuzzy/fastfuzzy-#{VERSION}.jar"
14
+ begin
15
+ require jar_path
16
+ rescue Exception => e
17
+ raise("Error loading #{jar_path}, cause: #{e.message}")
18
+ end
19
+ end
20
+ end
21
+
22
+ require "fast_fuzzy_jars"
23
+ require "fast_fuzzy/version"
24
+ require "fast_fuzzy/lucene"
25
+ require "fast_fuzzy/analyzer"
26
+ require "fast_fuzzy/ngram"
27
+ require "fast_fuzzy/percolator"
@@ -0,0 +1,49 @@
1
+ module FastFuzzy
2
+
3
+ class Analyzer
4
+ include Enumerable
5
+
6
+ STANDARD_CHAIN = [
7
+ [Lucene::StandardTokenizer],
8
+ [Lucene::StandardFilter],
9
+ [Lucene::LowerCaseFilter],
10
+ [Lucene::StopFilter, Lucene::StandardAnalyzer::STOP_WORDS_SET],
11
+ ]
12
+
13
+ def initialize(str, chain_definition = STANDARD_CHAIN)
14
+ @str = str
15
+
16
+ # first chain element must be the tokenizer, initialize it and set its reader
17
+ definition = chain_definition.first
18
+ clazz = definition.first
19
+ params = definition[1..-1]
20
+ tokenizer = clazz.new(*params)
21
+ tokenizer.set_reader(java.io.StringReader.new(str))
22
+
23
+ # initialize the following filters and build the stream chain
24
+ stream = chain_definition[1..-1].inject(tokenizer) do |result, definition|
25
+ clazz = definition.first
26
+ params = definition[1..-1]
27
+ clazz.new(result, *params)
28
+ end
29
+
30
+ # use CachingTokenFilter to allow multiple stream traversing
31
+ @stream = Lucene::CachingTokenFilter.new(stream)
32
+ @term_attr = @stream.addAttribute(Lucene::CharTermAttribute.java_class);
33
+ @type_attr = @stream.addAttribute(Lucene::TypeAttribute.java_class);
34
+ end
35
+
36
+ # implement each for Enumerable
37
+ def each(&block)
38
+ @stream.reset
39
+ while @stream.incrementToken
40
+ yield [@term_attr.to_string, @type_attr.type]
41
+ end
42
+ @stream.end
43
+ end
44
+
45
+ def close
46
+ @stream.close
47
+ end
48
+ end
49
+ end
@@ -0,0 +1,42 @@
1
+ require 'java'
2
+
3
+ module Lucene
4
+ Version = org.apache.lucene.util.Version
5
+ VERSION = Version::LATEST.to_string
6
+
7
+ ALPHANUM_TYPE = "<ALPHANUM>".freeze
8
+
9
+ module Analysis
10
+ include_package "org.apache.lucene.analysis"
11
+
12
+ module Standard
13
+ include_package "org.apache.lucene.analysis.standard"
14
+ end
15
+
16
+ module TokenAttributes
17
+ include_package 'org.apache.lucene.analysis.tokenattributes'
18
+ end
19
+
20
+ module Core
21
+ include_package 'org.apache.lucene.analysis.core'
22
+ end
23
+ end
24
+
25
+ module FastFuzzy
26
+ # custom TwitterTokenizer
27
+ include_package "com.colinsurprenant.fastfuzzy"
28
+ end
29
+
30
+ StandardTokenizer = Analysis::Standard::StandardTokenizer
31
+ ClassicTokenizer = Analysis::Standard::ClassicTokenizer
32
+ StandardAnalyzer = Analysis::Standard::StandardAnalyzer
33
+ StandardFilter = Analysis::Standard::StandardFilter
34
+ LowerCaseFilter = Analysis::Core::LowerCaseFilter
35
+ StopFilter = Analysis::Core::StopFilter
36
+ CharTermAttribute = Analysis::TokenAttributes::CharTermAttribute
37
+ CachingTokenFilter = Analysis::CachingTokenFilter
38
+ TypeAttribute = Analysis::TokenAttributes::TypeAttribute
39
+ FlagsAttribute = Analysis::TokenAttributes::FlagsAttribute
40
+
41
+ TwitterTokenizer = FastFuzzy::TwitterTokenizer
42
+ end
@@ -0,0 +1,18 @@
1
+ module FastFuzzy
2
+
3
+ class Ngram
4
+
5
+ # generate the ngram for either the caracter grams of a string or the items grams of an array
6
+ # example: generate("abcd") => ["ab", "bc", "cd"], generate([:a, :b, :c, :d]) => [[:a, :b], [:b, :c], [:c, :d]]
7
+ # @param str_or_ary [String|Array] input string or array of anything
8
+ # @return [Array] array or ngrams
9
+ def self.generate(str_or_ary, n = 2)
10
+ if (s = str_or_ary.size) <= n
11
+ return [] if s.zero?
12
+ return [str_or_ary]
13
+ end
14
+
15
+ (0..str_or_ary.size - n).map{|i| str_or_ary[i, n]}
16
+ end
17
+ end
18
+ end
@@ -0,0 +1,79 @@
1
+ require 'fast_fuzzy/analyzer'
2
+ require 'fast_fuzzy/ngram'
3
+
4
+ module FastFuzzy
5
+
6
+ class Percolator
7
+ attr_reader :queries, :index
8
+ N = 4 # >= 3 is necessary to wheight consecutiveness of words (ex "this car" => thi, his, is_, s_c, _ca, car)
9
+
10
+ def initialize(options = {})
11
+ @queries = options.delete(:queries) || []
12
+ @analyzer_chain = options.delete(:analyzer_chain) || Analyzer::STANDARD_CHAIN
13
+
14
+ @index = Hash.new{|h, k| h[k] = []}
15
+ @query_terms_count = []
16
+
17
+ build_index unless queries.empty?
18
+ end
19
+
20
+ def <<(query)
21
+ @queries << query
22
+ build_index
23
+ end
24
+
25
+ def percolate(str)
26
+ analyser = Analyzer.new(str, @analyzer_chain)
27
+
28
+ terms_phrase = analyser.select{|term| term[1] == Lucene::ALPHANUM_TYPE}.map{|term| term[0]}.join(' ')
29
+ terms_phrase = " #{terms_phrase} "
30
+ grams = Ngram.generate(terms_phrase, N).uniq
31
+ score(lookup(grams))
32
+ end
33
+
34
+ private
35
+
36
+ # build inverted index into @index that list all queries that belong to each term
37
+ # {"term" => [query id, query id, ...]}
38
+ def build_index
39
+ @index.clear
40
+ @query_terms_count.clear
41
+
42
+ @queries.each_index do |query_id|
43
+ analyser = Analyzer.new(@queries[query_id], @analyzer_chain)
44
+
45
+ terms_phrase = analyser.select{|term| term[1] == Lucene::ALPHANUM_TYPE}.map{|term| term[0]}.join(' ')
46
+ terms_phrase = " #{terms_phrase} "
47
+ grams = Ngram.generate(terms_phrase, N).uniq
48
+ @query_terms_count[query_id] = grams.size
49
+ grams.each{|gram| @index[gram] << query_id}
50
+ end
51
+ @index.each_key{|k| @index[k] = @index[k].uniq}
52
+
53
+ self
54
+ end
55
+
56
+ # for each search term lookup the index and keep count of the number of terms that matched each query
57
+ # @param terms [Array<String>] search terms
58
+ # @return [Hash] number of terms that matched each query, {query id => terms hit count}
59
+ def lookup(terms)
60
+ results = Hash.new(0)
61
+
62
+ terms.each do |term|
63
+ # @index[term] has list of query_id matching this term
64
+ @index[term].each{|query_id| results[query_id] += 1}
65
+ end
66
+
67
+ results
68
+ end
69
+
70
+
71
+ # compute match score per query for given lookup result
72
+ # @param result [Hash] number of terms that matched (> 0.0 score) each query, {query id => terms hit count}
73
+ # @return [Array<Array>] sorted array by score of [[query id, score]]. will include only queries with score > 0, [] if no match
74
+ def score(results)
75
+ results.map{|k, v| [k, v.to_f / @query_terms_count[k]]}.sort_by{|r| r[1]}
76
+ end
77
+
78
+ end
79
+ end
@@ -0,0 +1,3 @@
1
+ module FastFuzzy
2
+ VERSION = "0.1.0"
3
+ end
@@ -0,0 +1,5 @@
1
+ # this is a generated file, to avoid over-writing it just delete this comment
2
+ require 'jar_dependencies'
3
+
4
+ require_jar( 'org.apache.lucene', 'lucene-core', '5.4.1' )
5
+ require_jar( 'org.apache.lucene', 'lucene-analyzers-common', '5.4.1' )
@@ -0,0 +1,116 @@
1
+ require_relative '../spec_helper'
2
+
3
+ describe FastFuzzy::Analyzer do
4
+
5
+ it "should support to_a" do
6
+ analyzer = FastFuzzy::Analyzer.new("aa bb")
7
+ expect(analyzer.to_a).to be_a(Array)
8
+ expect(analyzer.to_a.size).to eq(2)
9
+ end
10
+
11
+ it "should support each" do
12
+ analyzer = FastFuzzy::Analyzer.new("aa bb")
13
+ result = []
14
+ analyzer.each{|token| result << token}
15
+ expect(result.size).to eq(2)
16
+ end
17
+
18
+ shared_examples :analyzer do
19
+ it "should analyse and extract correct type" do
20
+ tests.each do |test|
21
+ analyzer = FastFuzzy::Analyzer.new(test.first, chain)
22
+ expect(analyzer.to_a).to eq(test.last)
23
+ end
24
+ end
25
+
26
+ it "should support each" do
27
+ analyzer = FastFuzzy::Analyzer.new("aa bb cc", chain)
28
+ results = []
29
+ analyzer.each{|term| results << term}
30
+ expect(results).to eq([["aa", "<ALPHANUM>"], ["bb", "<ALPHANUM>"], ["cc", "<ALPHANUM>"]])
31
+ end
32
+
33
+ it "should support Enumerable" do
34
+ analyzer = FastFuzzy::Analyzer.new("aa bb 12", chain)
35
+ expect(analyzer.map{|term| term.first}).to eq(["aa", "bb", "12"])
36
+ expect(analyzer.map{|term| term.last}).to eq(["<ALPHANUM>", "<ALPHANUM>", "<NUM>"])
37
+ end
38
+ end
39
+
40
+ context "using default analyzer chain" do
41
+ it_behaves_like :analyzer do
42
+ let(:tests) {[
43
+ # the following tests are actually copied from the TwitterTokenizer tests below and
44
+ # ajusted to the StandardTokenizer chain results.
45
+ # TODO: make better tests for StandardTokenizer & default chain
46
+
47
+ ["2012 $80K/year", [["2012", "<NUM>"], ["80k", "<ALPHANUM>"], ["year", "<ALPHANUM>"]]],
48
+ ["is @ 7.9% and 1% not 4th", [["7.9", "<NUM>"], ["1", "<NUM>"], ["4th", "<ALPHANUM>"]]],
49
+
50
+ ["it's o'reilly o'reilly's hey", [["it's", "<ALPHANUM>"], ["o'reilly", "<ALPHANUM>"], ["o'reilly's", "<ALPHANUM>"], ["hey", "<ALPHANUM>"]]],
51
+
52
+ ["at&t excite@home", [["t", "<ALPHANUM>"], ["excite", "<ALPHANUM>"], ["home", "<ALPHANUM>"]]],
53
+
54
+ ["I.B.M. is big", [["i.b.m", "<ALPHANUM>"], ["big", "<ALPHANUM>"]]],
55
+
56
+ ["see @colinsurprenant do #win yeah", [["see", "<ALPHANUM>"], ["colinsurprenant", "<ALPHANUM>"], ["do", "<ALPHANUM>"], ["win", "<ALPHANUM>"], ["yeah", "<ALPHANUM>"]]],
57
+
58
+ ["go http://colinsurprenant.com/ http://a go", [["go", "<ALPHANUM>"], ["http", "<ALPHANUM>"], ["colinsurprenant.com", "<ALPHANUM>"], ["http", "<ALPHANUM>"], ["go", "<ALPHANUM>"]]],
59
+
60
+ ["colin.surprenant@gmail.com", [["colin.surprenant", "<ALPHANUM>"], ["gmail.com", "<ALPHANUM>"]]],
61
+
62
+ ["Face-Lift - Graphic", [["face", "<ALPHANUM>"], ["lift", "<ALPHANUM>"], ["graphic", "<ALPHANUM>"]]],
63
+
64
+ ["DUDZ :) :P :( 8-(", [["dudz", "<ALPHANUM>"], ["p", "<ALPHANUM>"], ["8", "<NUM>"]]],
65
+
66
+ ["RT stoopid #FF ", [["rt", "<ALPHANUM>"], ["stoopid", "<ALPHANUM>"], ["ff", "<ALPHANUM>"]]],
67
+ ]}
68
+
69
+ let(:chain) { FastFuzzy::Analyzer::STANDARD_CHAIN }
70
+ end
71
+ end
72
+
73
+ context "using TwitterTokenizer chain" do
74
+ it_behaves_like :analyzer do
75
+ let(:tests) {[
76
+ # NUM
77
+ ["2012 $80K/year", [["2012", "<NUM>"], ["$80k", "<NUM>"], ["year", "<ALPHANUM>"]]],
78
+ ["is @ 7.9% and 1% not 4th", [["7.9%", "<NUM>"], ["1%", "<NUM>"], ["4th", "<ALPHANUM>"]]],
79
+
80
+ # APOSTROPHE
81
+ ["it's o'reilly o'reilly's hey", [["it's", "<APOSTROPHE>"], ["o'reilly", "<APOSTROPHE>"], ["o'reilly's", "<APOSTROPHE>"], ["hey", "<ALPHANUM>"]]],
82
+
83
+ # COMPANY
84
+ ["at&t excite@home", [["at&t", "<COMPANY>"], ["excite@home", "<COMPANY>"]]],
85
+
86
+ # ACRONYM
87
+ ["I.B.M. is big", [["i.b.m.", "<ACRONYM>"], ["big", "<ALPHANUM>"]]],
88
+
89
+ # USERNAME & HASHTAG
90
+ ["see @colinsurprenant do #win yeah", [["see", "<ALPHANUM>"], ["@colinsurprenant", "<USERNAME>"], ["do", "<ALPHANUM>"], ["#win", "<HASHTAG>"], ["yeah", "<ALPHANUM>"]]],
91
+
92
+ # URL
93
+ ["go http://colinsurprenant.com/ http://a go", [["go", "<ALPHANUM>"], ["http://colinsurprenant.com/", "<URL>"], ["http://a", "<URL>"], ["go", "<ALPHANUM>"]]],
94
+
95
+ # EMAIL
96
+ ["colin.surprenant@gmail.com", [["colin.surprenant@gmail.com", "<EMAIL>"]]],
97
+
98
+ # DASH
99
+ ["Face-Lift - Graphic", [["face-lift", "<DASH>"], ["graphic", "<ALPHANUM>"]]],
100
+
101
+ # EMOTICON
102
+ ["DUDZ :) :P :( 8-(", [["dudz", "<ALPHANUM>"], [":)", "<EMOTICON>"], [":p", "<EMOTICON>"], [":(", "<EMOTICON>"], ["8-(", "<EMOTICON>"]]],
103
+
104
+ # HASHTAG & RT
105
+ ["RT stoopid #FF ", [["rt", "<RT>"], ["stoopid", "<ALPHANUM>"], ["#ff", "<HASHTAG>"]]],
106
+ ]}
107
+
108
+ let(:chain) {[
109
+ [Lucene::TwitterTokenizer],
110
+ [Lucene::LowerCaseFilter],
111
+ [Lucene::StopFilter, Lucene::StandardAnalyzer::STOP_WORDS_SET],
112
+ ]}
113
+ end
114
+ end
115
+
116
+ end
@@ -0,0 +1,11 @@
1
+ require_relative "../spec_helper"
2
+
3
+ describe Lucene do
4
+ it 'has a version number' do
5
+ expect(Lucene::VERSION).not_to be nil
6
+ end
7
+
8
+ it 'has the expected version' do
9
+ expect(Lucene::VERSION).to eq("5.4.1")
10
+ end
11
+ end
@@ -0,0 +1,25 @@
1
+ require_relative '../spec_helper'
2
+
3
+ describe FastFuzzy::Ngram do
4
+
5
+ it "should generate bigrams" do
6
+ expect(FastFuzzy::Ngram.generate("allo")).to eq(["al", "ll", "lo"])
7
+ expect(FastFuzzy::Ngram.generate("allo", 2)).to eq(["al", "ll", "lo"])
8
+ expect(FastFuzzy::Ngram.generate([:a, :b, :c, :d], 2)).to eq([[:a, :b], [:b, :c], [:c, :d]])
9
+ end
10
+
11
+ it "should generate trigrams" do
12
+ expect(FastFuzzy::Ngram.generate("allotoi", 3)).to eq(["all", "llo", "lot", "oto", "toi"])
13
+ expect(FastFuzzy::Ngram.generate([:a, :b, :c, :d], 3)).to eq([[:a, :b, :c], [:b, :c, :d]])
14
+ end
15
+
16
+ it "should generate gram <= n" do
17
+ expect(FastFuzzy::Ngram.generate("al", 2)).to eq(["al"])
18
+ expect(FastFuzzy::Ngram.generate([:a, :b], 2)).to eq([[:a, :b]])
19
+ end
20
+
21
+ it "should generate gram < n" do
22
+ expect(FastFuzzy::Ngram.generate("a", 2)).to eq(["a"])
23
+ expect(FastFuzzy::Ngram.generate([:a], 2)).to eq([[:a]])
24
+ end
25
+ end
@@ -0,0 +1,41 @@
1
+ require_relative '../spec_helper'
2
+
3
+ describe FastFuzzy::Percolator do
4
+
5
+ round4 = lambda {|x| x.round(4)}
6
+
7
+ it "should add queries" do
8
+ p = FastFuzzy::Percolator.new
9
+ p << "test1"
10
+ p << "test2"
11
+ expect(p.queries).to eq(["test1", "test2"])
12
+
13
+ p = FastFuzzy::Percolator.new(:queries => ["test1", "test2"])
14
+ expect(p.queries).to eq(["test1", "test2"])
15
+ end
16
+
17
+ it "should percolate using standard chain by default" do
18
+ p = FastFuzzy::Percolator.new
19
+ p << "faire une chose"
20
+ p << "etre subtil"
21
+
22
+ expect(p.percolate("veux tu faire quelque chose?").map{|r| [r[0], r[1].round(4)]}).to eq([[0, 0.6429]])
23
+ expect(p.percolate("allo toi").map{|r| [r[0], r[1].round(4)]}).to be_empty
24
+ expect(p.percolate("il faut etre quelque peu subtil").map{|r| [r[0], r[1].round(4)]}).to eq([[1, 0.8]])
25
+
26
+ p = FastFuzzy::Percolator.new
27
+ p << "looking for a job"
28
+ expect(p.percolate("I hate looking for a job").map{|r| [r[0], r[1].round(4)]}).to eq([[0, 1.0]])
29
+ end
30
+
31
+ it "should support configurable chain" do
32
+ p = FastFuzzy::Percolator.new(:analyzer_chain => [[Lucene::StandardTokenizer], [Lucene::StandardFilter]])
33
+ p << "foo bar"
34
+ expect(p.percolate("FOO BAR")).to eq([])
35
+
36
+ p = FastFuzzy::Percolator.new(:analyzer_chain => [[Lucene::StandardTokenizer], [Lucene::StandardFilter], [Lucene::LowerCaseFilter]])
37
+ p << "foo bar"
38
+ expect(p.percolate("FOO BAR")).to eq([[0, 1.0]])
39
+ end
40
+
41
+ end
@@ -0,0 +1,7 @@
1
+ require 'spec_helper'
2
+
3
+ describe FastFuzzy do
4
+ it 'has a version number' do
5
+ expect(FastFuzzy::VERSION).not_to be nil
6
+ end
7
+ end
@@ -0,0 +1,2 @@
1
+ $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
2
+ require 'fast_fuzzy'
metadata ADDED
@@ -0,0 +1,127 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: fast_fuzzy
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: java
6
+ authors:
7
+ - Colin Surprenant
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2016-03-08 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ requirement: !ruby/object:Gem::Requirement
15
+ requirements:
16
+ - - ">="
17
+ - !ruby/object:Gem::Version
18
+ version: '0'
19
+ name: jar-dependencies
20
+ prerelease: false
21
+ type: :runtime
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ requirement: !ruby/object:Gem::Requirement
29
+ requirements:
30
+ - - "~>"
31
+ - !ruby/object:Gem::Version
32
+ version: '1.10'
33
+ name: bundler
34
+ prerelease: false
35
+ type: :development
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '1.10'
41
+ - !ruby/object:Gem::Dependency
42
+ requirement: !ruby/object:Gem::Requirement
43
+ requirements:
44
+ - - "~>"
45
+ - !ruby/object:Gem::Version
46
+ version: '10.0'
47
+ name: rake
48
+ prerelease: false
49
+ type: :development
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '10.0'
55
+ - !ruby/object:Gem::Dependency
56
+ requirement: !ruby/object:Gem::Requirement
57
+ requirements:
58
+ - - ">="
59
+ - !ruby/object:Gem::Version
60
+ version: '0'
61
+ name: rspec
62
+ prerelease: false
63
+ type: :development
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ">="
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
69
+ description: JRuby Fast and fuzzy text pattern matching using Lucene analyzers
70
+ email:
71
+ - colin.surprenant@gmail.com
72
+ executables: []
73
+ extensions: []
74
+ extra_rdoc_files: []
75
+ files:
76
+ - LICENSE
77
+ - README.md
78
+ - fast_fuzzy.gemspec
79
+ - lib/fast_fuzzy.rb
80
+ - lib/fast_fuzzy/analyzer.rb
81
+ - lib/fast_fuzzy/fastfuzzy-0.1.0.jar
82
+ - lib/fast_fuzzy/lucene.rb
83
+ - lib/fast_fuzzy/ngram.rb
84
+ - lib/fast_fuzzy/percolator.rb
85
+ - lib/fast_fuzzy/version.rb
86
+ - lib/fast_fuzzy_jars.rb
87
+ - lib/org/apache/lucene/lucene-analyzers-common/5.4.1/lucene-analyzers-common-5.4.1.jar
88
+ - lib/org/apache/lucene/lucene-core/5.4.1/lucene-core-5.4.1.jar
89
+ - spec/fast_fuzzy/analyzer_spec.rb
90
+ - spec/fast_fuzzy/lucene_spec.rb
91
+ - spec/fast_fuzzy/ngram_spec.rb
92
+ - spec/fast_fuzzy/percolator_spec.rb
93
+ - spec/fast_fuzzy_spec.rb
94
+ - spec/spec_helper.rb
95
+ homepage: https://github.com/colinsurprenant/fast_fuzzy
96
+ licenses:
97
+ - Apache-2.0
98
+ metadata: {}
99
+ post_install_message:
100
+ rdoc_options: []
101
+ require_paths:
102
+ - lib
103
+ required_ruby_version: !ruby/object:Gem::Requirement
104
+ requirements:
105
+ - - ">="
106
+ - !ruby/object:Gem::Version
107
+ version: '0'
108
+ required_rubygems_version: !ruby/object:Gem::Requirement
109
+ requirements:
110
+ - - ">="
111
+ - !ruby/object:Gem::Version
112
+ version: '0'
113
+ requirements:
114
+ - jar org.apache.lucene:lucene-core, 5.4.1
115
+ - jar org.apache.lucene:lucene-analyzers-common, 5.4.1
116
+ rubyforge_project:
117
+ rubygems_version: 2.4.8
118
+ signing_key:
119
+ specification_version: 4
120
+ summary: JRuby Fast and fuzzy text pattern matching using Lucene analyzers
121
+ test_files:
122
+ - spec/fast_fuzzy/analyzer_spec.rb
123
+ - spec/fast_fuzzy/lucene_spec.rb
124
+ - spec/fast_fuzzy/ngram_spec.rb
125
+ - spec/fast_fuzzy/percolator_spec.rb
126
+ - spec/fast_fuzzy_spec.rb
127
+ - spec/spec_helper.rb