RubyGems - ngramdistance-ffi - Versions diffs - 1.0.0 → 1.0.1 - Mend

ngramdistance-ffi 1.0.0 → 1.0.1

Files changed (13) hide show

data/.gitignore +0 -1
data/CHANGELOG.markdown +1 -1
data/Gemfile +1 -1
data/README.markdown +120 -4
data/Rakefile +2 -2
data/VERSION +1 -1
data/ext/ngramdistance/ngramdistance.c +142 -96
data/ext/ngramdistance/ngramdistance.h +1 -1
data/lib/ngramdistance.rb +47 -3
data/ngramdistance-ffi.gemspec +3 -3
data/spec/ngramdistance_spec.rb +30 -3
data/spec/spec_helper.rb +2 -2
metadata +2 -2

data/.gitignore CHANGED

@@ -1,4 +1,3 @@
-README.markdown.html
 *.rbc
 pkg
 .*.sw?

data/CHANGELOG.markdown CHANGED

@@ -1,3 +1,3 @@
-1.0.0
+1.0.1
 -----
 * started

data/Gemfile CHANGED

@@ -3,6 +3,6 @@ source :rubygems
 gem 'ffi'
 group :test do
-  gem 'rspec', '1.3.1'
+  gem 'rspec', '2.10.0'
   gem 'jeweler'
 end

data/README.markdown CHANGED

@@ -1,9 +1,14 @@
- ngramdistance-ffi
+ngramdistance-ffi
 ===============
 Converted to FFI by Bali for Ruby portability.
-This gem originally based on ngram distance.
+This gem originally based on ngram distance or similarity algorithm from Proceedings of the Twelfth International
+Conference on String Processing and Information Retrieval (SPIRE 2005).
+* This program was ported by hand from lucene-3.0.2. (lucene is Java product)
+* It supports only ngram distance algorithm.
 Tested on:
@@ -16,6 +21,117 @@ Known Issues
 Including in Gemfile
 ====================
-    gem ' ngramdistance-ffi', :require => ' ngramdistance'
+    gem 'ngramdistance-ffi', :require => 'ngramdistance'
+Original README
+===============
+N-Gram version of edit distance based on paper by Grzegorz Kondrak,
+"N-gram similarity and distance". Proceedings of the Twelfth International
+Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,
+Buenos Aires, Argentina, November 2005.
+http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf
+This implementation uses the position-based optimization to compute partial
+matches of n-gram sub-strings and adds a null-character prefix of size n-1
+so that the first character is contained in the same number of n-grams as
+a middle character.  Null-character prefix matches are discounted so that
+strings with no matching characters will return a distance of 0.
+The module has a
+following functions:
+* distance
+----------
+    require 'ngramdistance'
+    NGramDistance.distance("string1", "string1") # returns 1.0
+    NGramDistance.distance("university", "univearsitty",3) # returns 0.750000
+method signature is distance(source,target,ngram=3,mode=PROPOSITIONAL)
+Third argument is the ngram granularity.
+* 1 -> unigram
+* 2 -> bigram
+* 3 -> trigram
+default argument is 3 (tri-gram)
+Fourth argument is mode of operation. We have 3 modes defined
+* PROPOSITIONAL=0
+	Propositional Computes the distance between n-grams with partial matching
+* BINARY=1
+	Only considers 2 n-grams match when they are exactly the same
+* COMPLEX=2
+	Uses ngramdistance with Binary option and 1-gram match for the n-gram match.
+default is PROPOSITIONAL mode.
+* token_sort_distance
+---------------------
+	require 'ngramdistance'
+    NGramDistance.token_sort_distance("hello bello", "bello hello") # returns 1.0
+The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and then joining them back into a string. For example:
+	"new york mets vs atlanta braves"   →→  "atlanta braves mets new vs york"
+We then compare the transformed strings with a simple distance(). That nicely solves our ordering problem, as our helper function below indicates:
+	NGramDistance.token_sort_distance("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") ⇒ 1.0
+method signature is token_sort_distance(source,target,regex=' ',ngram=3,mode=PROPOSITIONAL)
+the arguments are same as distance function but with extra argument regex, character or string to split source and target
+* token_set_distance
+--------------------
+	require 'ngramdistance'
+    NGramDistance.token_sort_distance("hello", "hello hello") # returns 1.0
+The token set approach is similar, but a little bit more flexible. Here, we tokenize both strings, but instead of immediately sorting and comparing, we split the tokens into two groups: intersection and remainder. We use those sets to build up a comparison string.
+Here is an illustrative example:
+	s1 = "mariners vs angels"
+	s2 = "los angeles angels of anaheim at seattle mariners"
+Using the token sort method isn't that helpful, because the second (longer) string has too many extra tokens that get interleaved with the sort. We'd end up comparing:
+	t1 = "angels mariners vs"
+	t2 = "anaheim angeles angels los mariners of seattle vs"
+Not very useful. Instead, the set method allows us to detect that "angels" and "mariners" are common to both strings, and separate those out (the set intersection). Now we construct and compare strings of the following form
+	t0 = [SORTED_INTERSECTION]
+	t1 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING1]
+	t2 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING2]
+And then compare each pair.
+The intuition here is that because the SORTED_INTERSECTION component is always exactly the same, the scores increase when (a) that makes up a larger percentage of the full string, and (b) the string remainders are more similar. In our example
+	t0 = "angels mariners"
+	t1 = "angels mariners vs"
+	t2 = "angels mariners anaheim angels"
+	NGramDistance.distance(t0, t1) ⇒ 0.83333
+	NGramDistance.distance(t0, t2) ⇒ 0.4838
+	NGramDistance.distance(t1, t2) ⇒ 0.54828
+	NGramDistance.token_set_distance(t0,t2) ⇒ 0.6521
+	NGramDistance.token_set_distance("hello hello world", "hello world") ⇒ 1.0
+method signature is token_set_distance(source,target,regex=' ',ngram=3,mode=PROPOSITIONAL)
+the arguments are same as distance function but with extra argument regex, character or string to split source and target
+Author
+======
+* Copyright (c) 2012, Bali (mailbali@gmail.com)
+* I ported from java source code of lucene-3.0.2.
+License
+=======
+* Apache  2.0 LICENSE
-- end -

data/Rakefile CHANGED

@@ -4,8 +4,8 @@ begin
     gemspec.name = " ngramdistance-ffi"
     gemspec.summary = "An FFI version of the  ngramdistance gem."
     gemspec.description = "Provides a fast, cross-Ruby implementation of the  ngramdistance distance algorithm."
-    gemspec.email = "dbalatero@gmail.com"
-    gemspec.homepage = "http://github.com/dbalatero/ ngramdistance-ffi"
+    gemspec.email = "mailbali@gmail.com"
+    gemspec.homepage = "https://github.com/pecbali/ngramdistance-ffi"
     gemspec.authors = ["Bali"]
     gemspec.add_dependency "ffi"
     gemspec.add_development_dependency "rspec"

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 1.0.0
1	+ 1.0.1

data/ext/ngramdistance/ngramdistance.c CHANGED

@@ -16,6 +16,9 @@
   #define min( a, b ) ( ((a) < (b)) ? (a) : (b) )
 #endif
+#define POSITIONAL 0
+#define BINARY 1
+#define COMPLEX 2
 char* substring(const char* str, size_t begin, size_t len)
 {
@@ -26,104 +29,147 @@ char* substring(const char* str, size_t begin, size_t len)
 }
-float ngramdistance (const char *source, const char *target,int n) {
-    int sl = strlen(source);
-    int tl = strlen(target);
-    if (sl == 0 || tl == 0) {
-        if (sl == tl) {
-            return 1;
-        }
-        else {
-            return 0;
-        }
+float ngramdistance (const char *source, const char *target,int n, int mode) {
+  int sl = strlen(source);
+  int tl = strlen(target);
+  char* word_for_complex1;
+  char* word_for_complex2;
+  if (mode == COMPLEX) {
+    word_for_complex1 = malloc(sizeof(char) * n);
+    word_for_complex2 = malloc(sizeof(char) * n);
+  }
+  if (sl == 0 || tl == 0) {
+    if (sl == tl) {
+      return 1;
     }
-    int cost = 0;
-    if (sl < n || tl < n) {
-        int ni = min(sl,tl);
-        int i=0;
-        for (i=0;i<ni;i++) {
-            if (eq(source[i],target[i])) {
-                cost++;
-            }
-        }
-        return (float) cost/(float)max(sl, tl);
+    else {
+      return 0;
     }
-    int char_len = sl+n-1;
-    char* sa = calloc(char_len+1,sizeof(char));
-      float* p; //'previous' cost array, horizontally
-      float* d; // cost array, horizontally
-      float* _d; //placeholder to assist in swapping p and d
-      //construct sa with prefix
-      int i=0;
-      for (i=0;i<char_len;i++) {
-          if (i < n-1) {
-              sa[i]=0; //add prefix
-          }
-          else {
-              sa[i] = source[i-n+1];
-          }
+  }
+  int cost = 0;
+  if (sl < n || tl < n) {
+    int ni = min(sl,tl);
+    int i=0;
+    for (i=0;i<ni;i++) {
+      if (eq(source[i],target[i])) {
+	cost++;
       }
-      int float_arr_len = sl+1;
-      p = calloc( float_arr_len+1 , sizeof( float));
-      d = calloc( float_arr_len+1 , sizeof( float));
-      // indexes into strings s and t
-      i=0; // iterates through source
-      int j=0; // iterates through target
-      char* t_j = calloc(n+1,sizeof(char)); // jth n-gram of t
-      for (i = 0; i<=sl; i++) {
-          p[i] = i;
+    }
+    return (float) cost/(float)max(sl, tl);
+  }
+  int char_len = sl+n-1;
+  char* sa = calloc(char_len+1,sizeof(char));
+  float* p; //'previous' cost array, horizontally
+  float* d; // cost array, horizontally
+  float* _d; //placeholder to assist in swapping p and d
+  //construct sa with prefix
+  int i=0;
+  for (i=0;i<char_len;i++) {
+    if (i < n-1) {
+      sa[i]=0; //add prefix
+    }
+    else {
+      sa[i] = source[i-n+1];
+    }
+  }
+  int float_arr_len = sl+1;
+  p = calloc( float_arr_len+1 , sizeof( float));
+  d = calloc( float_arr_len+1 , sizeof( float));
+  // indexes into strings s and t
+  i=0; // iterates through source
+  int j=0; // iterates through target
+  char* t_j = calloc(n+1,sizeof(char)); // jth n-gram of t
+  for (i = 0; i<=sl; i++) {
+    p[i] = i;
+  }
+  for (j = 1; j<=tl; j++) {
+    //construct t_j n-gram
+    if (j < n) {
+      int ti=0;
+      for (ti=0;ti<n-j;ti++) {
+	t_j[ti]=0; //add prefix
+      }
+      for (ti=n-j;ti<n;ti++) {
+	t_j[ti]=target[ti-(n-j)];
       }
-      for (j = 1; j<=tl; j++) {
-          //construct t_j n-gram
-          if (j < n) {
-              int ti=0;
-              for (ti=0;ti<n-j;ti++) {
-                  t_j[ti]=0; //add prefix
-              }
-              for (ti=n-j;ti<n;ti++) {
-                  t_j[ti]=target[ti-(n-j)];
-              }
-          }
-          else {
-              free (t_j);
-              t_j = substring(target,j-n, n);
-          }
-          d[0] = j;
-          for (i=1; i<=sl; i++) {
-              cost = 0;
-              int tn=n;
-              //compare sa to t_j
-              int ni=0;
-              for (ni=0;ni<n;ni++) {
-                  if (!eq(sa[i-1+ni] , t_j[ni])) {
-                      cost++;
-                  }
-                  else if (eq(sa[i-1+ni], 0)) { //discount matches on prefix
-                      tn--;
-                  }
-              }
-              float ec = (float) cost/(float)tn;
-              // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
-              d[i] = min(min(d[i-1]+1, p[i]+1),  p[i-1]+ec);
-          }
-          // copy current distance counts to 'previous row' distance counts
-          _d = p;
-          p = d;
-          d = _d;
+    }
+    else {
+      free (t_j);
+      t_j = substring(target,j-n, n);
+    }
+    d[0] = j;
+    for (i=1; i<=sl; i++) {
+      cost = 0;
+      int tn=n;
+      //compare sa to t_j
+      int ni=0;
+      float ec = 0;
+      switch(mode) {
+      case BINARY:
+	for (ni=0;ni<n;ni++) {
+	  if (eq(sa[i-1+ni], 0) || !eq(sa[i-1+ni] , t_j[ni]) ) {
+	    cost++;
+	    break;
+	  }
+	}
+	if (cost > 0.1) {
+	  ec = 1;
+	}
+	break;
+      case COMPLEX:
+	for (ni=0;ni<n;ni++) {
+	  if (eq(sa[i-1+ni], 0) || (eq(t_j[ni], 0))) {
+	    cost++;
+	    break;
+	  }
+	}
+	if ( cost > 0.1 ) {
+	  ec = 1;
+	} else {
+	  strncpy(word_for_complex1, &(sa[i-1]), n);
+	  strncpy(word_for_complex2, t_j, n);
+	  ec = 1-ngramdistance (word_for_complex1, word_for_complex2, 1, BINARY);
+	}
+	break;
+      default: // default positional
+	for (ni=0;ni<n;ni++) {
+	  if (!eq(sa[i-1+ni] , t_j[ni])) {
+	    cost++;
+	  }
+	  else if (eq(sa[i-1+ni], 0)) { //discount matches on prefix
+	    tn--;
+	  }
+	}
+	ec = (float) cost/(float)tn;
+	break;
       }
-      float val = p[sl];
-      free(p);
-      free(d);
-      free(t_j);
-      free(sa);
-      // our last action in the above loop was to switch d and p, so p now
-      // actually has the most recent cost counts
-      return 1.0f -  ((float) val / (float)max(tl, sl));
+      // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
+      d[i] = min(min(d[i-1]+1, p[i]+1),  p[i-1]+ec);
+    }
+    // copy current distance counts to 'previous row' distance counts
+    _d = p;
+    p = d;
+    d = _d;
+  }
+  float val = p[sl];
+  free(p);
+  free(d);
+  free(t_j);
+  free(sa);
+  if ( mode == COMPLEX ) {
+    free(word_for_complex1);
+    free(word_for_complex2);
+  }
+  // our last action in the above loop was to switch d and p, so p now
+  // actually has the most recent cost counts
+  return 1.0f -  ((float) val / (float)max(tl, sl));
 }
@@ -133,8 +179,8 @@ float ngramdistance (const char *source, const char *target,int n) {
   int main (int argc, char **argv) {
       float distance;
-      if (argc < 3) return -1;
-      distance = ngramdistance(argv[1], argv[2],3);
-      printf("%s vs %s: %f\n", argv[1], argv[2],distance);
+      if (argc < 5) return -1;
+      distance = ngramdistance(argv[1], argv[2], atoi(argv[3]), atoi(argv[4]));
+      printf("%s vs %s: %f using %s length and %s method \n", argv[1], argv[2],distance, argv[3], argv[4]);
   }
 #endif

data/ext/ngramdistance/ngramdistance.h CHANGED

	@@ -1 +1 @@
1	- ~~unsigned~~ ngramdistance(const char , const char , int n);
1	+ float ngramdistance(const char , const char , int n, int mode);

data/lib/ngramdistance.rb CHANGED

@@ -1,12 +1,56 @@
 require 'ffi'
+require 'set'
 module NGramDistance
+  PROPOSITIONAL=0
+  BINARY=1
+  COMPLEX=2
   extend FFI::Library
   # Try loading in order.
   library = File.dirname(__FILE__) + "/../ext/ngramdistance/ngramdistance"
-  candidates = ['.bundle', '.so', '.dylib', ''].map { |ext| library + ext }
+  candidates = ['.bundle', '.dylib', ''].map { |ext| library + ext }
   ffi_lib(candidates)
-  attach_function :distance, :ngramdistance, [:string, :string, :int], :float
+  attach_function :distance_private, :ngramdistance, [:string, :string, :int, :int], :float
+  def self.distance(source,target,ngram=3,mode=PROPOSITIONAL)
+    distance_private(source,target,ngram,mode)
+  end
+  def self.token_set_distance(source,target,regex=' ',ngram=3,mode=PROPOSITIONAL)
+    if source.nil? or target.nil?
+      if(source == targer)
+        return 1.0
+      end
+      return 0.0
+    end
+    return distance(token_set(source,regex),token_set(target,regex),ngram,mode)
+  end
+  def self.token_sort_distance(source, target,regex=' ',ngram=3,mode=PROPOSITIONAL)
+    if source.nil? or target.nil?
+      if(source == targer)
+        return 1.0
+      end
+      return 0.0
+    end
+    return distance(token_sort(source,regex),token_sort(target,regex),ngram,mode)
+  end
+  private
+  def self.token_set(source,regex)
+    set = SortedSet.new
+    tokens = source.split(regex)
+    tokens.each do |token|
+      set.add token
+    end
+    return set.to_a.join(' ')
+  end
+  def self.token_sort(source, regex)
+    tokens = source.split(regex)
+    tokens.sort!
+    return tokens.join(' ')
+  end
 end

data/ngramdistance-ffi.gemspec CHANGED

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = %q{ngramdistance-ffi}
-  s.version = "1.0.0"
+  s.version = "1.0.1"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Bali"]
-  s.date = %q{2012-06-18}
+  s.date = %q{2012-07-06}
   s.description = %q{Provides a fast, cross-Ruby implementation of the  ngramdistance distance algorithm.}
   s.email = %q{mailbali@gmail.com}
   s.extensions = ["ext/ngramdistance/extconf.rb"]
@@ -35,7 +35,7 @@ Gem::Specification.new do |s|
   s.homepage = %q{https://github.com/pecbali/ngramdistance-ffi}
   s.rdoc_options = ["--charset=UTF-8"]
   s.require_paths = ["lib"]
-  s.rubygems_version = %q{1.3.7}
+  s.rubygems_version = %q{1.8.21}
   s.summary = %q{An FFI version of the ngramdistance gem.}
   s.test_files = [
     "spec/ngramdistance_spec.rb",

data/spec/ngramdistance_spec.rb CHANGED

@@ -2,18 +2,45 @@ require 'spec_helper'
 describe NGramDistance do
   fixtures = [
-    ["university", "univearsitty", 0.750000,
+    ["university", "univearsitty", 0.750000],
     ["university", "university", 1.0],
     ["hello", "jello",0.633333],
     ["hello", "heloll", 0.666667],
     ["hello", "saint", 0.0000],
     ["hello", "", 0.0000]
   ]
+  fixtures2 = [
+    ["hello hello", "hello",1.0],
+    ["hello world hello", "hello world", 1.0]
+  ]
+  fixtures3 = [
+    ["hello world", "world hello",1.0],
+    ["hello world hello", "hello world", 0.647]
+  ]
   fixtures.each do |w1, w2, d|
     it "should calculate a distance of #{d} between #{w1} and #{w2}" do
-      NGramDistance.distance(w1, w2,3).should be_close( d,0.05)
-      NGramDistance.distance(w2, w1,3).should be_close d,0.05)
+      NGramDistance.distance(w1, w2,ngram=3).should be_within(0.05).of( d)
+      NGramDistance.distance(w2, w1,ngram=3).should be_within(0.05).of( d)
+    end
+  end
+  fixtures2.each do |w1, w2, d|
+    it "should calculate a distance of #{d} between #{w1} and #{w2}" do
+      NGramDistance.token_set_distance(w1, w2).should be_within(0.05).of( d)
+      NGramDistance.token_set_distance(w2, w1).should be_within(0.05).of( d)
+    end
+  end
+  fixtures3.each do |w1, w2, d|
+    it "should calculate a distance of #{d} between #{w1} and #{w2}" do
+      NGramDistance.token_sort_distance(w1, w2).should be_within(0.05).of( d)
+      NGramDistance.token_sort_distance(w2, w1).should be_within(0.05).of( d)
     end
   end
 end

data/spec/spec_helper.rb CHANGED

@@ -1,3 +1,3 @@
-require 'spec'
+require 'rspec'
-require File.dirname(__FILE__) + "/../lib/levenshtein"
+require File.dirname(__FILE__) + "/../lib/ngramdistance"

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: ngramdistance-ffi
 version: !ruby/object:Gem::Version
-  version: 1.0.0
+  version: 1.0.1
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-06-18 00:00:00.000000000 Z
+date: 2012-07-06 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: ffi