RubyGems - levenshtein_rb - Versions diffs - 0.0.1 - Mend

levenshtein_rb 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml +7 -0
data/.gitignore +22 -0
data/.rspec +2 -0
data/.travis.yml +3 -0
data/Gemfile +4 -0
data/LICENSE.txt +23 -0
data/README.md +104 -0
data/Rakefile +7 -0
data/levenshtein_rb.gemspec +24 -0
data/lib/levenshtein_rb.rb +2 -0
data/lib/levenshtein_rb/levenshtein_distance.rb +70 -0
data/lib/levenshtein_rb/version.rb +3 -0
data/spec/levenshtein_distance_spec.rb +27 -0
data/spec/levenshtein_rb_spec.rb +7 -0
data/spec/spec_helper.rb +2 -0
metadata +106 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 61dd3ffea0bc6ffbaef4dd792cb165384101b4fd
+  data.tar.gz: ee9770dc4d4cc7e1a3644ee06b3c4c92be8ce58e
+SHA512:
+  metadata.gz: 6395246d52816851679edc4ab168899641fcdc021ff7d9339f352be36fbab343c3462032ae882e645f810d2ee0d7e605461cef002f2f2241fc7647fd1e680762
+  data.tar.gz: 316860dd0a260afdda38c6bdab205e11f00af3c0deffe11b167e50ef8599be99159b0fdae4d2982ef5c0d6ee59725d4fa06cd3f0d647c9d022a1a4700901c5f8

data/.gitignore ADDED

@@ -0,0 +1,22 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log

data/.rspec ADDED

	@@ -0,0 +1,2 @@
1	+ --format documentation
2	+ --color

data/.travis.yml ADDED

@@ -0,0 +1,3 @@
+language: ruby
+rvm:
+  - 2.1.2

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in levenshtein_rb.gemspec
+gemspec

data/LICENSE.txt ADDED

@@ -0,0 +1,23 @@
+Copyright (c) 2015 Robin Neumann
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,104 @@
+# levenshtein_rb
+Plain Ruby implementation of Levenshtein distance algorithm
+that measures the similarity of two strings.
+## Disclaimer
+This implementation is intended for educational use to study the algorithm itself by the
+clean syntactic structure of Ruby. Use it in production with caution. To compare really large strings you may want to use a Ruby gem that relies on native C code extensions or similar like
+* [levenshtein-ffi](https://github.com/dbalatero/levenshtein-ffi)
+* [damerau-levenshtein](https://github.com/GlobalNamesArchitecture/damerau-levenshtein)
+levenshtein_rb was inspired by [this example implementation](http://rosettacode.org/wiki/Levenshtein_distance#Ruby).
+During my heuristically flavoured (and therefore not really meaningful) benchmark I took 1.28 seconds to compute the Levenshtein distance of strings of length 1000 and ~ 30 seconds to compute the Levenshtein distance
+of strings of length 10000. Theoretically the algorithm itself is of complexity *O(mn)* where m and n are the sizes of the input strings.
+## Installation
+Add this line to your application's Gemfile:
+    gem 'levenshtein_rb'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install levenshtein_rb
+## Usage
+    LevenshteinRb::LevenshteinDistance.new('Tor', 'Tier').to_i # => 2
+# The Levenshtein algorithm
+The mathematical formalism is well-explained at [wikipedia](https://en.wikipedia.org/wiki/Levenshtein_distance), so let's concentrate on the example here. Let's examine the algorithm on explaining the steps with the example word-word combination x
+For a human being it is convenient to perform the
+algorithm by "filling a table". The table of numbers is the
+"recurrence matrix" in the Ruby code. In the mathematical literature, the recurrence matrix usually is denoted by *D*.
+One starts with the upper left corner that refers to the pair of words (ε,ε). The corresponding matrix entry *D[0,0]* gets as value a zero, which means: Zero changes are
+necessary to turn the word *ε* into *ε*.
+Let's now have a look on the first row:
+```
+   | ε T o r
+ -----------
+ ε | 0 1 2 3
+```
+The semantics of the entries *D[0,1]*, *D[0,2]* and *D[0,3]* are as follows:
+It takes one elementary change (namely an addition) to turn the word *ε* into *εT*. And it takes 2 additions to turn *ε* into *εTo* and 3 additions to turn *ε* into *εTor*. The whole first column workds similar.
+More interesting are the values (a.k.a "costs") of an entry that corresponds to an inner element of the matrix.
+Take, for example the matrix entry *D[1,1]* with the assumption that the first row and
+the whole first column already was calculated (which is easy because it's the same procedure for any word-word combination). So concentrate on the submatrix:
+```
+   | ε T
+ -------
+ ε | 0 1
+ T | 1 ?
+       ↑
+       This is D[1,1]
+```
+*D[1,1]* corresponds to the question: How many steps does it take to turn *εT* into *εT*?
+And the answer is easy: We've added the same character ("*T*") to the word *ε* on both dimensions. So the answer is: Zero.
+Now we have obtained *D[1,1]*, lets continue and fill the whole second row:
+```
+   | ε T o r
+ -----------
+ ε | 0 1 2 3
+ T | 1 0 ? ?
+```
+How to get *D[1,2]* ? The "same character logic" is not the case here. *D[1,2]* corresponds
+to an addition of the character "o" (or a deletion if you see it vice versa). 1 o needs to be added so it costs exactly one more starting from? Right, we choose "the best path" so we choose the mininal value that is a neighbor an already has been computed. In our case it is *D[1,1]*. *D[1,2]* has exactly +1 costs more and therefore *D[1,2] = D[1,1] + 1 = 1*.
+This is the central trick: One always looks for the "minimum value" that corresponds to a neighbor entry calculated before and adds a cost (in our case we define all costs by *+1*)
+If one follows the strategy successively one obtains:
+```
+   | ε T o r
+ -----------
+ ε | 0 1 2 3
+ T | 1 0 1 2
+ i | 2 1 1 2
+ e | 3 2 2 2
+ r | 4 3 3 2
+           ↑
+         This last value is the Levenshtein distance
+ ```
+The entry *D[m,n]* is called the Levenshtein distance of the two input words.

data/Rakefile ADDED

@@ -0,0 +1,7 @@
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+RSpec::Core::RakeTask.new(:spec)
+task :default => :spec

data/levenshtein_rb.gemspec ADDED

@@ -0,0 +1,24 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'levenshtein_rb/version'
+Gem::Specification.new do |spec|
+  spec.name          = "levenshtein_rb"
+  spec.version       = LevenshteinRb::VERSION
+  spec.authors       = ["Robin Neumann\n"]
+  spec.email         = ["robin.neumann@absolventa.de"]
+  spec.summary       = %q{Implementation of Levenshtein algorithm to determine the similarity of two strings using only Ruby}
+  spec.description   = %q{Implementation of Levenshtein algorithm to determine the similarity of two strings using only Ruby}
+  spec.homepage      = ""
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.6"
+  spec.add_development_dependency "rake"
+  spec.add_development_dependency "rspec"
+end

data/lib/levenshtein_rb.rb ADDED

	@@ -0,0 +1,2 @@
1	+ require "levenshtein_rb/version"
2	+ require "levenshtein_rb/levenshtein_distance"

data/lib/levenshtein_rb/levenshtein_distance.rb ADDED

@@ -0,0 +1,70 @@
+module LevenshteinRb
+  class LevenshteinDistance
+    class RecurrenceMatrix
+      attr_reader :store
+      def [](index)
+        store[index]
+      end
+      def []=(index, value)
+        store[index] = value
+      end
+      def initialize(m, n)
+        @store = Array.new(m+1) { Array.new(n+1) }
+        (0..m).each { |i| store[i][0] = i }
+        (0..n).each { |j| store[0][j] = j }
+      end
+    end
+    attr_reader :s, :t, :m, :n, :d
+    def initialize(a_string, another_string)
+      @s, @t = a_string, another_string
+      @m, @n = s.length.to_i, t.length.to_i
+      @d     = RecurrenceMatrix.new(m, n) # d is simply the default notation for a recurrence matrix in literature
+    end
+    def to_i
+      return m if n == 0
+      return n if m == 0
+      (1..n).each do |j|
+        (1..m).each do |i|
+          d[i][j] = costs_for_step(i, j)
+        end
+      end
+      d[m][n]
+    end
+    private
+    def same_character_for_both_words_is_added?(i, j)
+      s[i-1] == t[j-1]
+    end
+    def obtain_minimal_value_from_neighbors(i, j)
+      [
+        d[i-1][j], # Look left in the matrix. This would be a "deletion". It costs + 1 more than d[i-1][j]]
+        d[i][j-1], # Look directly above the current entry in matrix. This would correspond to an insertion which costs +1 additionally
+        d[i-1][j-1], # Completely substitute the character, this also costs +1 more  than d[j-1, i-1]
+      ].min
+    end
+    def costs_for_step(i, j)
+      if same_character_for_both_words_is_added?(i, j)
+        d[i-1][j-1]
+      else
+        obtain_minimal_value_from_neighbors(i, j) + 1
+      end
+    end
+  end
+end

data/lib/levenshtein_rb/version.rb ADDED

@@ -0,0 +1,3 @@
+module LevenshteinRb
+  VERSION = "0.0.1"
+end

data/spec/levenshtein_distance_spec.rb ADDED

@@ -0,0 +1,27 @@
+require 'spec_helper'
+RSpec.describe LevenshteinRb::LevenshteinDistance do
+  describe '#to_i' do
+    it { expect(described_class.new('', '').to_i).to eql 0 }
+    context 'with insertion/deletion' do
+      it { expect(described_class.new('', 'a').to_i).to eql 1 }
+      it { expect(described_class.new('a', '').to_i).to eql 1 }
+      it { expect(described_class.new('bbb', 'bbba').to_i).to eql 1 }
+      it { expect(described_class.new('bbba', 'bbb').to_i).to eql 1 }
+    end
+    context 'with substitution' do
+      it { expect(described_class.new('a', 'b').to_i).to eql 1 }
+      it { expect(described_class.new('ba', 'bb').to_i).to eql 1 }
+      it { expect(described_class.new('ab', 'bb').to_i).to eql 1 }
+    end
+    context 'with case studies' do
+      it { expect(described_class.new('Tier', 'Tor').to_i).to eql 2 }
+    end
+  end
+end

data/spec/levenshtein_rb_spec.rb ADDED

@@ -0,0 +1,7 @@
+require 'spec_helper'
+describe LevenshteinRb do
+  it 'has a version number' do
+    expect(LevenshteinRb::VERSION).not_to be nil
+  end
+end

data/spec/spec_helper.rb ADDED

	@@ -0,0 +1,2 @@
1	+ $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
2	+ require 'levenshtein_rb'

metadata ADDED

@@ -0,0 +1,106 @@
+--- !ruby/object:Gem::Specification
+name: levenshtein_rb
+version: !ruby/object:Gem::Version
+  version: 0.0.1
+platform: ruby
+authors:
+- |
+  Robin Neumann
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2015-11-23 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.6'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.6'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+description: Implementation of Levenshtein algorithm to determine the similarity of
+  two strings using only Ruby
+email:
+- robin.neumann@absolventa.de
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- ".rspec"
+- ".travis.yml"
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- levenshtein_rb.gemspec
+- lib/levenshtein_rb.rb
+- lib/levenshtein_rb/levenshtein_distance.rb
+- lib/levenshtein_rb/version.rb
+- spec/levenshtein_distance_spec.rb
+- spec/levenshtein_rb_spec.rb
+- spec/spec_helper.rb
+homepage: ''
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.2.2
+signing_key:
+specification_version: 4
+summary: Implementation of Levenshtein algorithm to determine the similarity of two
+  strings using only Ruby
+test_files:
+- spec/levenshtein_distance_spec.rb
+- spec/levenshtein_rb_spec.rb
+- spec/spec_helper.rb