ngramdistance-ffi 1.0.0 → 1.0.1
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +0 -1
- data/CHANGELOG.markdown +1 -1
- data/Gemfile +1 -1
- data/README.markdown +120 -4
- data/Rakefile +2 -2
- data/VERSION +1 -1
- data/ext/ngramdistance/ngramdistance.c +142 -96
- data/ext/ngramdistance/ngramdistance.h +1 -1
- data/lib/ngramdistance.rb +47 -3
- data/ngramdistance-ffi.gemspec +3 -3
- data/spec/ngramdistance_spec.rb +30 -3
- data/spec/spec_helper.rb +2 -2
- metadata +2 -2
data/.gitignore
CHANGED
data/CHANGELOG.markdown
CHANGED
data/Gemfile
CHANGED
data/README.markdown
CHANGED
@@ -1,9 +1,14 @@
|
|
1
|
-
|
1
|
+
ngramdistance-ffi
|
2
2
|
===============
|
3
3
|
|
4
4
|
Converted to FFI by Bali for Ruby portability.
|
5
5
|
|
6
|
-
This gem originally based on ngram distance
|
6
|
+
This gem originally based on ngram distance or similarity algorithm from Proceedings of the Twelfth International
|
7
|
+
Conference on String Processing and Information Retrieval (SPIRE 2005).
|
8
|
+
|
9
|
+
* This program was ported by hand from lucene-3.0.2. (lucene is Java product)
|
10
|
+
* It supports only ngram distance algorithm.
|
11
|
+
|
7
12
|
|
8
13
|
Tested on:
|
9
14
|
|
@@ -16,6 +21,117 @@ Known Issues
|
|
16
21
|
Including in Gemfile
|
17
22
|
====================
|
18
23
|
|
19
|
-
gem '
|
24
|
+
gem 'ngramdistance-ffi', :require => 'ngramdistance'
|
25
|
+
Original README
|
26
|
+
===============
|
27
|
+
|
28
|
+
N-Gram version of edit distance based on paper by Grzegorz Kondrak,
|
29
|
+
"N-gram similarity and distance". Proceedings of the Twelfth International
|
30
|
+
Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,
|
31
|
+
Buenos Aires, Argentina, November 2005.
|
32
|
+
http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf
|
33
|
+
|
34
|
+
This implementation uses the position-based optimization to compute partial
|
35
|
+
matches of n-gram sub-strings and adds a null-character prefix of size n-1
|
36
|
+
so that the first character is contained in the same number of n-grams as
|
37
|
+
a middle character. Null-character prefix matches are discounted so that
|
38
|
+
strings with no matching characters will return a distance of 0.
|
39
|
+
|
40
|
+
The module has a
|
41
|
+
following functions:
|
42
|
+
|
43
|
+
* distance
|
44
|
+
----------
|
45
|
+
require 'ngramdistance'
|
46
|
+
NGramDistance.distance("string1", "string1") # returns 1.0
|
47
|
+
NGramDistance.distance("university", "univearsitty",3) # returns 0.750000
|
48
|
+
|
49
|
+
method signature is distance(source,target,ngram=3,mode=PROPOSITIONAL)
|
50
|
+
|
51
|
+
Third argument is the ngram granularity.
|
52
|
+
* 1 -> unigram
|
53
|
+
* 2 -> bigram
|
54
|
+
* 3 -> trigram
|
55
|
+
|
56
|
+
default argument is 3 (tri-gram)
|
57
|
+
|
58
|
+
Fourth argument is mode of operation. We have 3 modes defined
|
59
|
+
* PROPOSITIONAL=0
|
60
|
+
Propositional Computes the distance between n-grams with partial matching
|
61
|
+
* BINARY=1
|
62
|
+
Only considers 2 n-grams match when they are exactly the same
|
63
|
+
* COMPLEX=2
|
64
|
+
Uses ngramdistance with Binary option and 1-gram match for the n-gram match.
|
65
|
+
|
66
|
+
default is PROPOSITIONAL mode.
|
67
|
+
|
68
|
+
* token_sort_distance
|
69
|
+
---------------------
|
70
|
+
require 'ngramdistance'
|
71
|
+
NGramDistance.token_sort_distance("hello bello", "bello hello") # returns 1.0
|
72
|
+
|
73
|
+
The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and then joining them back into a string. For example:
|
74
|
+
|
75
|
+
"new york mets vs atlanta braves" →→ "atlanta braves mets new vs york"
|
76
|
+
|
77
|
+
We then compare the transformed strings with a simple distance(). That nicely solves our ordering problem, as our helper function below indicates:
|
78
|
+
|
79
|
+
NGramDistance.token_sort_distance("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") ⇒ 1.0
|
80
|
+
|
81
|
+
method signature is token_sort_distance(source,target,regex=' ',ngram=3,mode=PROPOSITIONAL)
|
82
|
+
the arguments are same as distance function but with extra argument regex, character or string to split source and target
|
83
|
+
|
84
|
+
* token_set_distance
|
85
|
+
--------------------
|
86
|
+
require 'ngramdistance'
|
87
|
+
NGramDistance.token_sort_distance("hello", "hello hello") # returns 1.0
|
88
|
+
|
89
|
+
|
90
|
+
The token set approach is similar, but a little bit more flexible. Here, we tokenize both strings, but instead of immediately sorting and comparing, we split the tokens into two groups: intersection and remainder. We use those sets to build up a comparison string.
|
91
|
+
Here is an illustrative example:
|
92
|
+
s1 = "mariners vs angels"
|
93
|
+
s2 = "los angeles angels of anaheim at seattle mariners"
|
94
|
+
|
95
|
+
Using the token sort method isn't that helpful, because the second (longer) string has too many extra tokens that get interleaved with the sort. We'd end up comparing:
|
96
|
+
|
97
|
+
t1 = "angels mariners vs"
|
98
|
+
t2 = "anaheim angeles angels los mariners of seattle vs"
|
99
|
+
|
100
|
+
Not very useful. Instead, the set method allows us to detect that "angels" and "mariners" are common to both strings, and separate those out (the set intersection). Now we construct and compare strings of the following form
|
101
|
+
|
102
|
+
t0 = [SORTED_INTERSECTION]
|
103
|
+
t1 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING1]
|
104
|
+
t2 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING2]
|
105
|
+
|
106
|
+
And then compare each pair.
|
107
|
+
|
108
|
+
The intuition here is that because the SORTED_INTERSECTION component is always exactly the same, the scores increase when (a) that makes up a larger percentage of the full string, and (b) the string remainders are more similar. In our example
|
109
|
+
|
110
|
+
t0 = "angels mariners"
|
111
|
+
t1 = "angels mariners vs"
|
112
|
+
t2 = "angels mariners anaheim angels"
|
113
|
+
NGramDistance.distance(t0, t1) ⇒ 0.83333
|
114
|
+
NGramDistance.distance(t0, t2) ⇒ 0.4838
|
115
|
+
NGramDistance.distance(t1, t2) ⇒ 0.54828
|
116
|
+
NGramDistance.token_set_distance(t0,t2) ⇒ 0.6521
|
117
|
+
|
118
|
+
|
119
|
+
NGramDistance.token_set_distance("hello hello world", "hello world") ⇒ 1.0
|
120
|
+
|
121
|
+
method signature is token_set_distance(source,target,regex=' ',ngram=3,mode=PROPOSITIONAL)
|
122
|
+
the arguments are same as distance function but with extra argument regex, character or string to split source and target
|
123
|
+
|
124
|
+
|
125
|
+
|
126
|
+
|
127
|
+
|
128
|
+
Author
|
129
|
+
======
|
130
|
+
|
131
|
+
* Copyright (c) 2012, Bali (mailbali@gmail.com)
|
132
|
+
* I ported from java source code of lucene-3.0.2.
|
133
|
+
|
134
|
+
License
|
135
|
+
=======
|
136
|
+
* Apache 2.0 LICENSE
|
20
137
|
|
21
|
-
- end -
|
data/Rakefile
CHANGED
@@ -4,8 +4,8 @@ begin
|
|
4
4
|
gemspec.name = " ngramdistance-ffi"
|
5
5
|
gemspec.summary = "An FFI version of the ngramdistance gem."
|
6
6
|
gemspec.description = "Provides a fast, cross-Ruby implementation of the ngramdistance distance algorithm."
|
7
|
-
gemspec.email = "
|
8
|
-
gemspec.homepage = "
|
7
|
+
gemspec.email = "mailbali@gmail.com"
|
8
|
+
gemspec.homepage = "https://github.com/pecbali/ngramdistance-ffi"
|
9
9
|
gemspec.authors = ["Bali"]
|
10
10
|
gemspec.add_dependency "ffi"
|
11
11
|
gemspec.add_development_dependency "rspec"
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
1.0.
|
1
|
+
1.0.1
|
@@ -16,6 +16,9 @@
|
|
16
16
|
#define min( a, b ) ( ((a) < (b)) ? (a) : (b) )
|
17
17
|
#endif
|
18
18
|
|
19
|
+
#define POSITIONAL 0
|
20
|
+
#define BINARY 1
|
21
|
+
#define COMPLEX 2
|
19
22
|
|
20
23
|
char* substring(const char* str, size_t begin, size_t len)
|
21
24
|
{
|
@@ -26,104 +29,147 @@ char* substring(const char* str, size_t begin, size_t len)
|
|
26
29
|
}
|
27
30
|
|
28
31
|
|
29
|
-
float ngramdistance (const char *source, const char *target,int n) {
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
32
|
+
float ngramdistance (const char *source, const char *target,int n, int mode) {
|
33
|
+
int sl = strlen(source);
|
34
|
+
int tl = strlen(target);
|
35
|
+
char* word_for_complex1;
|
36
|
+
char* word_for_complex2;
|
37
|
+
|
38
|
+
if (mode == COMPLEX) {
|
39
|
+
word_for_complex1 = malloc(sizeof(char) * n);
|
40
|
+
word_for_complex2 = malloc(sizeof(char) * n);
|
41
|
+
}
|
42
|
+
|
43
|
+
if (sl == 0 || tl == 0) {
|
44
|
+
if (sl == tl) {
|
45
|
+
return 1;
|
40
46
|
}
|
41
|
-
|
42
|
-
|
43
|
-
int ni = min(sl,tl);
|
44
|
-
int i=0;
|
45
|
-
for (i=0;i<ni;i++) {
|
46
|
-
if (eq(source[i],target[i])) {
|
47
|
-
cost++;
|
48
|
-
}
|
49
|
-
}
|
50
|
-
return (float) cost/(float)max(sl, tl);
|
47
|
+
else {
|
48
|
+
return 0;
|
51
49
|
}
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
for (i=0;i<char_len;i++) {
|
61
|
-
if (i < n-1) {
|
62
|
-
sa[i]=0; //add prefix
|
63
|
-
}
|
64
|
-
else {
|
65
|
-
sa[i] = source[i-n+1];
|
66
|
-
}
|
50
|
+
}
|
51
|
+
int cost = 0;
|
52
|
+
if (sl < n || tl < n) {
|
53
|
+
int ni = min(sl,tl);
|
54
|
+
int i=0;
|
55
|
+
for (i=0;i<ni;i++) {
|
56
|
+
if (eq(source[i],target[i])) {
|
57
|
+
cost++;
|
67
58
|
}
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
59
|
+
}
|
60
|
+
return (float) cost/(float)max(sl, tl);
|
61
|
+
}
|
62
|
+
int char_len = sl+n-1;
|
63
|
+
char* sa = calloc(char_len+1,sizeof(char));
|
64
|
+
float* p; //'previous' cost array, horizontally
|
65
|
+
float* d; // cost array, horizontally
|
66
|
+
float* _d; //placeholder to assist in swapping p and d
|
67
|
+
|
68
|
+
//construct sa with prefix
|
69
|
+
int i=0;
|
70
|
+
for (i=0;i<char_len;i++) {
|
71
|
+
if (i < n-1) {
|
72
|
+
sa[i]=0; //add prefix
|
73
|
+
}
|
74
|
+
else {
|
75
|
+
sa[i] = source[i-n+1];
|
76
|
+
}
|
77
|
+
}
|
78
|
+
int float_arr_len = sl+1;
|
79
|
+
p = calloc( float_arr_len+1 , sizeof( float));
|
80
|
+
d = calloc( float_arr_len+1 , sizeof( float));
|
81
|
+
|
82
|
+
// indexes into strings s and t
|
83
|
+
i=0; // iterates through source
|
84
|
+
int j=0; // iterates through target
|
85
|
+
|
86
|
+
char* t_j = calloc(n+1,sizeof(char)); // jth n-gram of t
|
87
|
+
for (i = 0; i<=sl; i++) {
|
88
|
+
p[i] = i;
|
89
|
+
}
|
90
|
+
|
91
|
+
|
92
|
+
for (j = 1; j<=tl; j++) {
|
93
|
+
//construct t_j n-gram
|
94
|
+
if (j < n) {
|
95
|
+
int ti=0;
|
96
|
+
for (ti=0;ti<n-j;ti++) {
|
97
|
+
t_j[ti]=0; //add prefix
|
98
|
+
}
|
99
|
+
for (ti=n-j;ti<n;ti++) {
|
100
|
+
t_j[ti]=target[ti-(n-j)];
|
79
101
|
}
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
102
|
+
}
|
103
|
+
else {
|
104
|
+
free (t_j);
|
105
|
+
t_j = substring(target,j-n, n);
|
106
|
+
}
|
107
|
+
d[0] = j;
|
108
|
+
for (i=1; i<=sl; i++) {
|
109
|
+
cost = 0;
|
110
|
+
int tn=n;
|
111
|
+
//compare sa to t_j
|
112
|
+
int ni=0;
|
113
|
+
float ec = 0;
|
114
|
+
switch(mode) {
|
115
|
+
case BINARY:
|
116
|
+
for (ni=0;ni<n;ni++) {
|
117
|
+
if (eq(sa[i-1+ni], 0) || !eq(sa[i-1+ni] , t_j[ni]) ) {
|
118
|
+
cost++;
|
119
|
+
break;
|
120
|
+
}
|
121
|
+
}
|
122
|
+
if (cost > 0.1) {
|
123
|
+
ec = 1;
|
124
|
+
}
|
125
|
+
break;
|
126
|
+
case COMPLEX:
|
127
|
+
for (ni=0;ni<n;ni++) {
|
128
|
+
if (eq(sa[i-1+ni], 0) || (eq(t_j[ni], 0))) {
|
129
|
+
cost++;
|
130
|
+
break;
|
131
|
+
}
|
132
|
+
}
|
133
|
+
if ( cost > 0.1 ) {
|
134
|
+
ec = 1;
|
135
|
+
} else {
|
136
|
+
strncpy(word_for_complex1, &(sa[i-1]), n);
|
137
|
+
strncpy(word_for_complex2, t_j, n);
|
138
|
+
ec = 1-ngramdistance (word_for_complex1, word_for_complex2, 1, BINARY);
|
139
|
+
}
|
140
|
+
break;
|
141
|
+
default: // default positional
|
142
|
+
for (ni=0;ni<n;ni++) {
|
143
|
+
if (!eq(sa[i-1+ni] , t_j[ni])) {
|
144
|
+
cost++;
|
145
|
+
}
|
146
|
+
else if (eq(sa[i-1+ni], 0)) { //discount matches on prefix
|
147
|
+
tn--;
|
148
|
+
}
|
149
|
+
}
|
150
|
+
ec = (float) cost/(float)tn;
|
151
|
+
break;
|
118
152
|
}
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
125
|
-
|
126
|
-
|
153
|
+
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
|
154
|
+
d[i] = min(min(d[i-1]+1, p[i]+1), p[i-1]+ec);
|
155
|
+
}
|
156
|
+
// copy current distance counts to 'previous row' distance counts
|
157
|
+
_d = p;
|
158
|
+
p = d;
|
159
|
+
d = _d;
|
160
|
+
}
|
161
|
+
float val = p[sl];
|
162
|
+
free(p);
|
163
|
+
free(d);
|
164
|
+
free(t_j);
|
165
|
+
free(sa);
|
166
|
+
if ( mode == COMPLEX ) {
|
167
|
+
free(word_for_complex1);
|
168
|
+
free(word_for_complex2);
|
169
|
+
}
|
170
|
+
// our last action in the above loop was to switch d and p, so p now
|
171
|
+
// actually has the most recent cost counts
|
172
|
+
return 1.0f - ((float) val / (float)max(tl, sl));
|
127
173
|
}
|
128
174
|
|
129
175
|
|
@@ -133,8 +179,8 @@ float ngramdistance (const char *source, const char *target,int n) {
|
|
133
179
|
|
134
180
|
int main (int argc, char **argv) {
|
135
181
|
float distance;
|
136
|
-
if (argc <
|
137
|
-
distance = ngramdistance(argv[1], argv[2],3);
|
138
|
-
printf("%s vs %s: %f\n", argv[1], argv[2],distance);
|
182
|
+
if (argc < 5) return -1;
|
183
|
+
distance = ngramdistance(argv[1], argv[2], atoi(argv[3]), atoi(argv[4]));
|
184
|
+
printf("%s vs %s: %f using %s length and %s method \n", argv[1], argv[2],distance, argv[3], argv[4]);
|
139
185
|
}
|
140
186
|
#endif
|
@@ -1 +1 @@
|
|
1
|
-
|
1
|
+
float ngramdistance(const char *, const char *, int n, int mode);
|
data/lib/ngramdistance.rb
CHANGED
@@ -1,12 +1,56 @@
|
|
1
1
|
require 'ffi'
|
2
|
-
|
2
|
+
require 'set'
|
3
3
|
module NGramDistance
|
4
|
+
PROPOSITIONAL=0
|
5
|
+
BINARY=1
|
6
|
+
COMPLEX=2
|
4
7
|
extend FFI::Library
|
5
8
|
|
6
9
|
# Try loading in order.
|
7
10
|
library = File.dirname(__FILE__) + "/../ext/ngramdistance/ngramdistance"
|
8
|
-
candidates = ['.bundle', '.
|
11
|
+
candidates = ['.bundle', '.dylib', ''].map { |ext| library + ext }
|
9
12
|
ffi_lib(candidates)
|
10
13
|
|
11
|
-
attach_function :
|
14
|
+
attach_function :distance_private, :ngramdistance, [:string, :string, :int, :int], :float
|
15
|
+
|
16
|
+
def self.distance(source,target,ngram=3,mode=PROPOSITIONAL)
|
17
|
+
distance_private(source,target,ngram,mode)
|
18
|
+
end
|
19
|
+
|
20
|
+
def self.token_set_distance(source,target,regex=' ',ngram=3,mode=PROPOSITIONAL)
|
21
|
+
if source.nil? or target.nil?
|
22
|
+
if(source == targer)
|
23
|
+
return 1.0
|
24
|
+
end
|
25
|
+
return 0.0
|
26
|
+
end
|
27
|
+
return distance(token_set(source,regex),token_set(target,regex),ngram,mode)
|
28
|
+
end
|
29
|
+
|
30
|
+
def self.token_sort_distance(source, target,regex=' ',ngram=3,mode=PROPOSITIONAL)
|
31
|
+
if source.nil? or target.nil?
|
32
|
+
if(source == targer)
|
33
|
+
return 1.0
|
34
|
+
end
|
35
|
+
return 0.0
|
36
|
+
end
|
37
|
+
return distance(token_sort(source,regex),token_sort(target,regex),ngram,mode)
|
38
|
+
end
|
39
|
+
|
40
|
+
private
|
41
|
+
|
42
|
+
def self.token_set(source,regex)
|
43
|
+
set = SortedSet.new
|
44
|
+
tokens = source.split(regex)
|
45
|
+
tokens.each do |token|
|
46
|
+
set.add token
|
47
|
+
end
|
48
|
+
return set.to_a.join(' ')
|
49
|
+
end
|
50
|
+
|
51
|
+
def self.token_sort(source, regex)
|
52
|
+
tokens = source.split(regex)
|
53
|
+
tokens.sort!
|
54
|
+
return tokens.join(' ')
|
55
|
+
end
|
12
56
|
end
|
data/ngramdistance-ffi.gemspec
CHANGED
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = %q{ngramdistance-ffi}
|
8
|
-
s.version = "1.0.
|
8
|
+
s.version = "1.0.1"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Bali"]
|
12
|
-
s.date = %q{2012-06
|
12
|
+
s.date = %q{2012-07-06}
|
13
13
|
s.description = %q{Provides a fast, cross-Ruby implementation of the ngramdistance distance algorithm.}
|
14
14
|
s.email = %q{mailbali@gmail.com}
|
15
15
|
s.extensions = ["ext/ngramdistance/extconf.rb"]
|
@@ -35,7 +35,7 @@ Gem::Specification.new do |s|
|
|
35
35
|
s.homepage = %q{https://github.com/pecbali/ngramdistance-ffi}
|
36
36
|
s.rdoc_options = ["--charset=UTF-8"]
|
37
37
|
s.require_paths = ["lib"]
|
38
|
-
s.rubygems_version = %q{1.
|
38
|
+
s.rubygems_version = %q{1.8.21}
|
39
39
|
s.summary = %q{An FFI version of the ngramdistance gem.}
|
40
40
|
s.test_files = [
|
41
41
|
"spec/ngramdistance_spec.rb",
|
data/spec/ngramdistance_spec.rb
CHANGED
@@ -2,18 +2,45 @@ require 'spec_helper'
|
|
2
2
|
|
3
3
|
describe NGramDistance do
|
4
4
|
fixtures = [
|
5
|
-
["university", "univearsitty", 0.750000,
|
5
|
+
["university", "univearsitty", 0.750000],
|
6
6
|
["university", "university", 1.0],
|
7
7
|
["hello", "jello",0.633333],
|
8
8
|
["hello", "heloll", 0.666667],
|
9
9
|
["hello", "saint", 0.0000],
|
10
10
|
["hello", "", 0.0000]
|
11
11
|
]
|
12
|
+
|
13
|
+
fixtures2 = [
|
14
|
+
["hello hello", "hello",1.0],
|
15
|
+
["hello world hello", "hello world", 1.0]
|
16
|
+
]
|
17
|
+
|
18
|
+
fixtures3 = [
|
19
|
+
["hello world", "world hello",1.0],
|
20
|
+
["hello world hello", "hello world", 0.647]
|
21
|
+
]
|
12
22
|
|
13
23
|
fixtures.each do |w1, w2, d|
|
14
24
|
it "should calculate a distance of #{d} between #{w1} and #{w2}" do
|
15
|
-
NGramDistance.distance(w1, w2,3).should
|
16
|
-
NGramDistance.distance(w2, w1,3).should
|
25
|
+
NGramDistance.distance(w1, w2,ngram=3).should be_within(0.05).of( d)
|
26
|
+
NGramDistance.distance(w2, w1,ngram=3).should be_within(0.05).of( d)
|
27
|
+
end
|
28
|
+
end
|
29
|
+
|
30
|
+
|
31
|
+
fixtures2.each do |w1, w2, d|
|
32
|
+
it "should calculate a distance of #{d} between #{w1} and #{w2}" do
|
33
|
+
NGramDistance.token_set_distance(w1, w2).should be_within(0.05).of( d)
|
34
|
+
NGramDistance.token_set_distance(w2, w1).should be_within(0.05).of( d)
|
35
|
+
end
|
36
|
+
end
|
37
|
+
|
38
|
+
|
39
|
+
fixtures3.each do |w1, w2, d|
|
40
|
+
it "should calculate a distance of #{d} between #{w1} and #{w2}" do
|
41
|
+
NGramDistance.token_sort_distance(w1, w2).should be_within(0.05).of( d)
|
42
|
+
NGramDistance.token_sort_distance(w2, w1).should be_within(0.05).of( d)
|
17
43
|
end
|
18
44
|
end
|
45
|
+
|
19
46
|
end
|
data/spec/spec_helper.rb
CHANGED
@@ -1,3 +1,3 @@
|
|
1
|
-
require '
|
1
|
+
require 'rspec'
|
2
2
|
|
3
|
-
require File.dirname(__FILE__) + "/../lib/
|
3
|
+
require File.dirname(__FILE__) + "/../lib/ngramdistance"
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: ngramdistance-ffi
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.1
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-06
|
12
|
+
date: 2012-07-06 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: ffi
|