ngramdistance-ffi 1.0.0 → 1.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore CHANGED
@@ -1,4 +1,3 @@
1
- README.markdown.html
2
1
  *.rbc
3
2
  pkg
4
3
  .*.sw?
@@ -1,3 +1,3 @@
1
- 1.0.0
1
+ 1.0.1
2
2
  -----
3
3
  * started
data/Gemfile CHANGED
@@ -3,6 +3,6 @@ source :rubygems
3
3
  gem 'ffi'
4
4
 
5
5
  group :test do
6
- gem 'rspec', '1.3.1'
6
+ gem 'rspec', '2.10.0'
7
7
  gem 'jeweler'
8
8
  end
@@ -1,9 +1,14 @@
1
- ngramdistance-ffi
1
+ ngramdistance-ffi
2
2
  ===============
3
3
 
4
4
  Converted to FFI by Bali for Ruby portability.
5
5
 
6
- This gem originally based on ngram distance.
6
+ This gem originally based on ngram distance or similarity algorithm from Proceedings of the Twelfth International
7
+ Conference on String Processing and Information Retrieval (SPIRE 2005).
8
+
9
+ * This program was ported by hand from lucene-3.0.2. (lucene is Java product)
10
+ * It supports only ngram distance algorithm.
11
+
7
12
 
8
13
  Tested on:
9
14
 
@@ -16,6 +21,117 @@ Known Issues
16
21
  Including in Gemfile
17
22
  ====================
18
23
 
19
- gem ' ngramdistance-ffi', :require => ' ngramdistance'
24
+ gem 'ngramdistance-ffi', :require => 'ngramdistance'
25
+ Original README
26
+ ===============
27
+
28
+ N-Gram version of edit distance based on paper by Grzegorz Kondrak,
29
+ "N-gram similarity and distance". Proceedings of the Twelfth International
30
+ Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,
31
+ Buenos Aires, Argentina, November 2005.
32
+ http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf
33
+
34
+ This implementation uses the position-based optimization to compute partial
35
+ matches of n-gram sub-strings and adds a null-character prefix of size n-1
36
+ so that the first character is contained in the same number of n-grams as
37
+ a middle character. Null-character prefix matches are discounted so that
38
+ strings with no matching characters will return a distance of 0.
39
+
40
+ The module has a
41
+ following functions:
42
+
43
+ * distance
44
+ ----------
45
+ require 'ngramdistance'
46
+ NGramDistance.distance("string1", "string1") # returns 1.0
47
+ NGramDistance.distance("university", "univearsitty",3) # returns 0.750000
48
+
49
+ method signature is distance(source,target,ngram=3,mode=PROPOSITIONAL)
50
+
51
+ Third argument is the ngram granularity.
52
+ * 1 -> unigram
53
+ * 2 -> bigram
54
+ * 3 -> trigram
55
+
56
+ default argument is 3 (tri-gram)
57
+
58
+ Fourth argument is mode of operation. We have 3 modes defined
59
+ * PROPOSITIONAL=0
60
+ Propositional Computes the distance between n-grams with partial matching
61
+ * BINARY=1
62
+ Only considers 2 n-grams match when they are exactly the same
63
+ * COMPLEX=2
64
+ Uses ngramdistance with Binary option and 1-gram match for the n-gram match.
65
+
66
+ default is PROPOSITIONAL mode.
67
+
68
+ * token_sort_distance
69
+ ---------------------
70
+ require 'ngramdistance'
71
+ NGramDistance.token_sort_distance("hello bello", "bello hello") # returns 1.0
72
+
73
+ The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and then joining them back into a string. For example:
74
+
75
+ "new york mets vs atlanta braves" →→ "atlanta braves mets new vs york"
76
+
77
+ We then compare the transformed strings with a simple distance(). That nicely solves our ordering problem, as our helper function below indicates:
78
+
79
+ NGramDistance.token_sort_distance("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") ⇒ 1.0
80
+
81
+ method signature is token_sort_distance(source,target,regex=' ',ngram=3,mode=PROPOSITIONAL)
82
+ the arguments are same as distance function but with extra argument regex, character or string to split source and target
83
+
84
+ * token_set_distance
85
+ --------------------
86
+ require 'ngramdistance'
87
+ NGramDistance.token_sort_distance("hello", "hello hello") # returns 1.0
88
+
89
+
90
+ The token set approach is similar, but a little bit more flexible. Here, we tokenize both strings, but instead of immediately sorting and comparing, we split the tokens into two groups: intersection and remainder. We use those sets to build up a comparison string.
91
+ Here is an illustrative example:
92
+ s1 = "mariners vs angels"
93
+ s2 = "los angeles angels of anaheim at seattle mariners"
94
+
95
+ Using the token sort method isn't that helpful, because the second (longer) string has too many extra tokens that get interleaved with the sort. We'd end up comparing:
96
+
97
+ t1 = "angels mariners vs"
98
+ t2 = "anaheim angeles angels los mariners of seattle vs"
99
+
100
+ Not very useful. Instead, the set method allows us to detect that "angels" and "mariners" are common to both strings, and separate those out (the set intersection). Now we construct and compare strings of the following form
101
+
102
+ t0 = [SORTED_INTERSECTION]
103
+ t1 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING1]
104
+ t2 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING2]
105
+
106
+ And then compare each pair.
107
+
108
+ The intuition here is that because the SORTED_INTERSECTION component is always exactly the same, the scores increase when (a) that makes up a larger percentage of the full string, and (b) the string remainders are more similar. In our example
109
+
110
+ t0 = "angels mariners"
111
+ t1 = "angels mariners vs"
112
+ t2 = "angels mariners anaheim angels"
113
+ NGramDistance.distance(t0, t1) ⇒ 0.83333
114
+ NGramDistance.distance(t0, t2) ⇒ 0.4838
115
+ NGramDistance.distance(t1, t2) ⇒ 0.54828
116
+ NGramDistance.token_set_distance(t0,t2) ⇒ 0.6521
117
+
118
+
119
+ NGramDistance.token_set_distance("hello hello world", "hello world") ⇒ 1.0
120
+
121
+ method signature is token_set_distance(source,target,regex=' ',ngram=3,mode=PROPOSITIONAL)
122
+ the arguments are same as distance function but with extra argument regex, character or string to split source and target
123
+
124
+
125
+
126
+
127
+
128
+ Author
129
+ ======
130
+
131
+ * Copyright (c) 2012, Bali (mailbali@gmail.com)
132
+ * I ported from java source code of lucene-3.0.2.
133
+
134
+ License
135
+ =======
136
+ * Apache 2.0 LICENSE
20
137
 
21
- - end -
data/Rakefile CHANGED
@@ -4,8 +4,8 @@ begin
4
4
  gemspec.name = " ngramdistance-ffi"
5
5
  gemspec.summary = "An FFI version of the ngramdistance gem."
6
6
  gemspec.description = "Provides a fast, cross-Ruby implementation of the ngramdistance distance algorithm."
7
- gemspec.email = "dbalatero@gmail.com"
8
- gemspec.homepage = "http://github.com/dbalatero/ ngramdistance-ffi"
7
+ gemspec.email = "mailbali@gmail.com"
8
+ gemspec.homepage = "https://github.com/pecbali/ngramdistance-ffi"
9
9
  gemspec.authors = ["Bali"]
10
10
  gemspec.add_dependency "ffi"
11
11
  gemspec.add_development_dependency "rspec"
data/VERSION CHANGED
@@ -1 +1 @@
1
- 1.0.0
1
+ 1.0.1
@@ -16,6 +16,9 @@
16
16
  #define min( a, b ) ( ((a) < (b)) ? (a) : (b) )
17
17
  #endif
18
18
 
19
+ #define POSITIONAL 0
20
+ #define BINARY 1
21
+ #define COMPLEX 2
19
22
 
20
23
  char* substring(const char* str, size_t begin, size_t len)
21
24
  {
@@ -26,104 +29,147 @@ char* substring(const char* str, size_t begin, size_t len)
26
29
  }
27
30
 
28
31
 
29
- float ngramdistance (const char *source, const char *target,int n) {
30
- int sl = strlen(source);
31
- int tl = strlen(target);
32
-
33
- if (sl == 0 || tl == 0) {
34
- if (sl == tl) {
35
- return 1;
36
- }
37
- else {
38
- return 0;
39
- }
32
+ float ngramdistance (const char *source, const char *target,int n, int mode) {
33
+ int sl = strlen(source);
34
+ int tl = strlen(target);
35
+ char* word_for_complex1;
36
+ char* word_for_complex2;
37
+
38
+ if (mode == COMPLEX) {
39
+ word_for_complex1 = malloc(sizeof(char) * n);
40
+ word_for_complex2 = malloc(sizeof(char) * n);
41
+ }
42
+
43
+ if (sl == 0 || tl == 0) {
44
+ if (sl == tl) {
45
+ return 1;
40
46
  }
41
- int cost = 0;
42
- if (sl < n || tl < n) {
43
- int ni = min(sl,tl);
44
- int i=0;
45
- for (i=0;i<ni;i++) {
46
- if (eq(source[i],target[i])) {
47
- cost++;
48
- }
49
- }
50
- return (float) cost/(float)max(sl, tl);
47
+ else {
48
+ return 0;
51
49
  }
52
- int char_len = sl+n-1;
53
- char* sa = calloc(char_len+1,sizeof(char));
54
- float* p; //'previous' cost array, horizontally
55
- float* d; // cost array, horizontally
56
- float* _d; //placeholder to assist in swapping p and d
57
-
58
- //construct sa with prefix
59
- int i=0;
60
- for (i=0;i<char_len;i++) {
61
- if (i < n-1) {
62
- sa[i]=0; //add prefix
63
- }
64
- else {
65
- sa[i] = source[i-n+1];
66
- }
50
+ }
51
+ int cost = 0;
52
+ if (sl < n || tl < n) {
53
+ int ni = min(sl,tl);
54
+ int i=0;
55
+ for (i=0;i<ni;i++) {
56
+ if (eq(source[i],target[i])) {
57
+ cost++;
67
58
  }
68
- int float_arr_len = sl+1;
69
- p = calloc( float_arr_len+1 , sizeof( float));
70
- d = calloc( float_arr_len+1 , sizeof( float));
71
-
72
- // indexes into strings s and t
73
- i=0; // iterates through source
74
- int j=0; // iterates through target
75
-
76
- char* t_j = calloc(n+1,sizeof(char)); // jth n-gram of t
77
- for (i = 0; i<=sl; i++) {
78
- p[i] = i;
59
+ }
60
+ return (float) cost/(float)max(sl, tl);
61
+ }
62
+ int char_len = sl+n-1;
63
+ char* sa = calloc(char_len+1,sizeof(char));
64
+ float* p; //'previous' cost array, horizontally
65
+ float* d; // cost array, horizontally
66
+ float* _d; //placeholder to assist in swapping p and d
67
+
68
+ //construct sa with prefix
69
+ int i=0;
70
+ for (i=0;i<char_len;i++) {
71
+ if (i < n-1) {
72
+ sa[i]=0; //add prefix
73
+ }
74
+ else {
75
+ sa[i] = source[i-n+1];
76
+ }
77
+ }
78
+ int float_arr_len = sl+1;
79
+ p = calloc( float_arr_len+1 , sizeof( float));
80
+ d = calloc( float_arr_len+1 , sizeof( float));
81
+
82
+ // indexes into strings s and t
83
+ i=0; // iterates through source
84
+ int j=0; // iterates through target
85
+
86
+ char* t_j = calloc(n+1,sizeof(char)); // jth n-gram of t
87
+ for (i = 0; i<=sl; i++) {
88
+ p[i] = i;
89
+ }
90
+
91
+
92
+ for (j = 1; j<=tl; j++) {
93
+ //construct t_j n-gram
94
+ if (j < n) {
95
+ int ti=0;
96
+ for (ti=0;ti<n-j;ti++) {
97
+ t_j[ti]=0; //add prefix
98
+ }
99
+ for (ti=n-j;ti<n;ti++) {
100
+ t_j[ti]=target[ti-(n-j)];
79
101
  }
80
-
81
- for (j = 1; j<=tl; j++) {
82
- //construct t_j n-gram
83
- if (j < n) {
84
- int ti=0;
85
- for (ti=0;ti<n-j;ti++) {
86
- t_j[ti]=0; //add prefix
87
- }
88
- for (ti=n-j;ti<n;ti++) {
89
- t_j[ti]=target[ti-(n-j)];
90
- }
91
- }
92
- else {
93
- free (t_j);
94
- t_j = substring(target,j-n, n);
95
- }
96
- d[0] = j;
97
- for (i=1; i<=sl; i++) {
98
- cost = 0;
99
- int tn=n;
100
- //compare sa to t_j
101
- int ni=0;
102
- for (ni=0;ni<n;ni++) {
103
- if (!eq(sa[i-1+ni] , t_j[ni])) {
104
- cost++;
105
- }
106
- else if (eq(sa[i-1+ni], 0)) { //discount matches on prefix
107
- tn--;
108
- }
109
- }
110
- float ec = (float) cost/(float)tn;
111
- // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
112
- d[i] = min(min(d[i-1]+1, p[i]+1), p[i-1]+ec);
113
- }
114
- // copy current distance counts to 'previous row' distance counts
115
- _d = p;
116
- p = d;
117
- d = _d;
102
+ }
103
+ else {
104
+ free (t_j);
105
+ t_j = substring(target,j-n, n);
106
+ }
107
+ d[0] = j;
108
+ for (i=1; i<=sl; i++) {
109
+ cost = 0;
110
+ int tn=n;
111
+ //compare sa to t_j
112
+ int ni=0;
113
+ float ec = 0;
114
+ switch(mode) {
115
+ case BINARY:
116
+ for (ni=0;ni<n;ni++) {
117
+ if (eq(sa[i-1+ni], 0) || !eq(sa[i-1+ni] , t_j[ni]) ) {
118
+ cost++;
119
+ break;
120
+ }
121
+ }
122
+ if (cost > 0.1) {
123
+ ec = 1;
124
+ }
125
+ break;
126
+ case COMPLEX:
127
+ for (ni=0;ni<n;ni++) {
128
+ if (eq(sa[i-1+ni], 0) || (eq(t_j[ni], 0))) {
129
+ cost++;
130
+ break;
131
+ }
132
+ }
133
+ if ( cost > 0.1 ) {
134
+ ec = 1;
135
+ } else {
136
+ strncpy(word_for_complex1, &(sa[i-1]), n);
137
+ strncpy(word_for_complex2, t_j, n);
138
+ ec = 1-ngramdistance (word_for_complex1, word_for_complex2, 1, BINARY);
139
+ }
140
+ break;
141
+ default: // default positional
142
+ for (ni=0;ni<n;ni++) {
143
+ if (!eq(sa[i-1+ni] , t_j[ni])) {
144
+ cost++;
145
+ }
146
+ else if (eq(sa[i-1+ni], 0)) { //discount matches on prefix
147
+ tn--;
148
+ }
149
+ }
150
+ ec = (float) cost/(float)tn;
151
+ break;
118
152
  }
119
- float val = p[sl];
120
- free(p);
121
- free(d);
122
- free(t_j);
123
- free(sa);
124
- // our last action in the above loop was to switch d and p, so p now
125
- // actually has the most recent cost counts
126
- return 1.0f - ((float) val / (float)max(tl, sl));
153
+ // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
154
+ d[i] = min(min(d[i-1]+1, p[i]+1), p[i-1]+ec);
155
+ }
156
+ // copy current distance counts to 'previous row' distance counts
157
+ _d = p;
158
+ p = d;
159
+ d = _d;
160
+ }
161
+ float val = p[sl];
162
+ free(p);
163
+ free(d);
164
+ free(t_j);
165
+ free(sa);
166
+ if ( mode == COMPLEX ) {
167
+ free(word_for_complex1);
168
+ free(word_for_complex2);
169
+ }
170
+ // our last action in the above loop was to switch d and p, so p now
171
+ // actually has the most recent cost counts
172
+ return 1.0f - ((float) val / (float)max(tl, sl));
127
173
  }
128
174
 
129
175
 
@@ -133,8 +179,8 @@ float ngramdistance (const char *source, const char *target,int n) {
133
179
 
134
180
  int main (int argc, char **argv) {
135
181
  float distance;
136
- if (argc < 3) return -1;
137
- distance = ngramdistance(argv[1], argv[2],3);
138
- printf("%s vs %s: %f\n", argv[1], argv[2],distance);
182
+ if (argc < 5) return -1;
183
+ distance = ngramdistance(argv[1], argv[2], atoi(argv[3]), atoi(argv[4]));
184
+ printf("%s vs %s: %f using %s length and %s method \n", argv[1], argv[2],distance, argv[3], argv[4]);
139
185
  }
140
186
  #endif
@@ -1 +1 @@
1
- unsigned ngramdistance(const char *, const char *, int n);
1
+ float ngramdistance(const char *, const char *, int n, int mode);
@@ -1,12 +1,56 @@
1
1
  require 'ffi'
2
-
2
+ require 'set'
3
3
  module NGramDistance
4
+ PROPOSITIONAL=0
5
+ BINARY=1
6
+ COMPLEX=2
4
7
  extend FFI::Library
5
8
 
6
9
  # Try loading in order.
7
10
  library = File.dirname(__FILE__) + "/../ext/ngramdistance/ngramdistance"
8
- candidates = ['.bundle', '.so', '.dylib', ''].map { |ext| library + ext }
11
+ candidates = ['.bundle', '.dylib', ''].map { |ext| library + ext }
9
12
  ffi_lib(candidates)
10
13
 
11
- attach_function :distance, :ngramdistance, [:string, :string, :int], :float
14
+ attach_function :distance_private, :ngramdistance, [:string, :string, :int, :int], :float
15
+
16
+ def self.distance(source,target,ngram=3,mode=PROPOSITIONAL)
17
+ distance_private(source,target,ngram,mode)
18
+ end
19
+
20
+ def self.token_set_distance(source,target,regex=' ',ngram=3,mode=PROPOSITIONAL)
21
+ if source.nil? or target.nil?
22
+ if(source == targer)
23
+ return 1.0
24
+ end
25
+ return 0.0
26
+ end
27
+ return distance(token_set(source,regex),token_set(target,regex),ngram,mode)
28
+ end
29
+
30
+ def self.token_sort_distance(source, target,regex=' ',ngram=3,mode=PROPOSITIONAL)
31
+ if source.nil? or target.nil?
32
+ if(source == targer)
33
+ return 1.0
34
+ end
35
+ return 0.0
36
+ end
37
+ return distance(token_sort(source,regex),token_sort(target,regex),ngram,mode)
38
+ end
39
+
40
+ private
41
+
42
+ def self.token_set(source,regex)
43
+ set = SortedSet.new
44
+ tokens = source.split(regex)
45
+ tokens.each do |token|
46
+ set.add token
47
+ end
48
+ return set.to_a.join(' ')
49
+ end
50
+
51
+ def self.token_sort(source, regex)
52
+ tokens = source.split(regex)
53
+ tokens.sort!
54
+ return tokens.join(' ')
55
+ end
12
56
  end
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{ngramdistance-ffi}
8
- s.version = "1.0.0"
8
+ s.version = "1.0.1"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Bali"]
12
- s.date = %q{2012-06-18}
12
+ s.date = %q{2012-07-06}
13
13
  s.description = %q{Provides a fast, cross-Ruby implementation of the ngramdistance distance algorithm.}
14
14
  s.email = %q{mailbali@gmail.com}
15
15
  s.extensions = ["ext/ngramdistance/extconf.rb"]
@@ -35,7 +35,7 @@ Gem::Specification.new do |s|
35
35
  s.homepage = %q{https://github.com/pecbali/ngramdistance-ffi}
36
36
  s.rdoc_options = ["--charset=UTF-8"]
37
37
  s.require_paths = ["lib"]
38
- s.rubygems_version = %q{1.3.7}
38
+ s.rubygems_version = %q{1.8.21}
39
39
  s.summary = %q{An FFI version of the ngramdistance gem.}
40
40
  s.test_files = [
41
41
  "spec/ngramdistance_spec.rb",
@@ -2,18 +2,45 @@ require 'spec_helper'
2
2
 
3
3
  describe NGramDistance do
4
4
  fixtures = [
5
- ["university", "univearsitty", 0.750000,
5
+ ["university", "univearsitty", 0.750000],
6
6
  ["university", "university", 1.0],
7
7
  ["hello", "jello",0.633333],
8
8
  ["hello", "heloll", 0.666667],
9
9
  ["hello", "saint", 0.0000],
10
10
  ["hello", "", 0.0000]
11
11
  ]
12
+
13
+ fixtures2 = [
14
+ ["hello hello", "hello",1.0],
15
+ ["hello world hello", "hello world", 1.0]
16
+ ]
17
+
18
+ fixtures3 = [
19
+ ["hello world", "world hello",1.0],
20
+ ["hello world hello", "hello world", 0.647]
21
+ ]
12
22
 
13
23
  fixtures.each do |w1, w2, d|
14
24
  it "should calculate a distance of #{d} between #{w1} and #{w2}" do
15
- NGramDistance.distance(w1, w2,3).should be_close( d,0.05)
16
- NGramDistance.distance(w2, w1,3).should be_close d,0.05)
25
+ NGramDistance.distance(w1, w2,ngram=3).should be_within(0.05).of( d)
26
+ NGramDistance.distance(w2, w1,ngram=3).should be_within(0.05).of( d)
27
+ end
28
+ end
29
+
30
+
31
+ fixtures2.each do |w1, w2, d|
32
+ it "should calculate a distance of #{d} between #{w1} and #{w2}" do
33
+ NGramDistance.token_set_distance(w1, w2).should be_within(0.05).of( d)
34
+ NGramDistance.token_set_distance(w2, w1).should be_within(0.05).of( d)
35
+ end
36
+ end
37
+
38
+
39
+ fixtures3.each do |w1, w2, d|
40
+ it "should calculate a distance of #{d} between #{w1} and #{w2}" do
41
+ NGramDistance.token_sort_distance(w1, w2).should be_within(0.05).of( d)
42
+ NGramDistance.token_sort_distance(w2, w1).should be_within(0.05).of( d)
17
43
  end
18
44
  end
45
+
19
46
  end
@@ -1,3 +1,3 @@
1
- require 'spec'
1
+ require 'rspec'
2
2
 
3
- require File.dirname(__FILE__) + "/../lib/levenshtein"
3
+ require File.dirname(__FILE__) + "/../lib/ngramdistance"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ngramdistance-ffi
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.1
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-06-18 00:00:00.000000000 Z
12
+ date: 2012-07-06 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: ffi