levenshtein_rb 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 61dd3ffea0bc6ffbaef4dd792cb165384101b4fd
4
+ data.tar.gz: ee9770dc4d4cc7e1a3644ee06b3c4c92be8ce58e
5
+ SHA512:
6
+ metadata.gz: 6395246d52816851679edc4ab168899641fcdc021ff7d9339f352be36fbab343c3462032ae882e645f810d2ee0d7e605461cef002f2f2241fc7647fd1e680762
7
+ data.tar.gz: 316860dd0a260afdda38c6bdab205e11f00af3c0deffe11b167e50ef8599be99159b0fdae4d2982ef5c0d6ee59725d4fa06cd3f0d647c9d022a1a4700901c5f8
@@ -0,0 +1,22 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
18
+ *.bundle
19
+ *.so
20
+ *.o
21
+ *.a
22
+ mkmf.log
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format documentation
2
+ --color
@@ -0,0 +1,3 @@
1
+ language: ruby
2
+ rvm:
3
+ - 2.1.2
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in levenshtein_rb.gemspec
4
+ gemspec
@@ -0,0 +1,23 @@
1
+ Copyright (c) 2015 Robin Neumann
2
+
3
+
4
+ MIT License
5
+
6
+ Permission is hereby granted, free of charge, to any person obtaining
7
+ a copy of this software and associated documentation files (the
8
+ "Software"), to deal in the Software without restriction, including
9
+ without limitation the rights to use, copy, modify, merge, publish,
10
+ distribute, sublicense, and/or sell copies of the Software, and to
11
+ permit persons to whom the Software is furnished to do so, subject to
12
+ the following conditions:
13
+
14
+ The above copyright notice and this permission notice shall be
15
+ included in all copies or substantial portions of the Software.
16
+
17
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
18
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
19
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
20
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
21
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
22
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
23
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,104 @@
1
+ # levenshtein_rb
2
+
3
+ Plain Ruby implementation of Levenshtein distance algorithm
4
+ that measures the similarity of two strings.
5
+
6
+ ## Disclaimer
7
+
8
+ This implementation is intended for educational use to study the algorithm itself by the
9
+ clean syntactic structure of Ruby. Use it in production with caution. To compare really large strings you may want to use a Ruby gem that relies on native C code extensions or similar like
10
+
11
+ * [levenshtein-ffi](https://github.com/dbalatero/levenshtein-ffi)
12
+ * [damerau-levenshtein](https://github.com/GlobalNamesArchitecture/damerau-levenshtein)
13
+
14
+ levenshtein_rb was inspired by [this example implementation](http://rosettacode.org/wiki/Levenshtein_distance#Ruby).
15
+
16
+ During my heuristically flavoured (and therefore not really meaningful) benchmark I took 1.28 seconds to compute the Levenshtein distance of strings of length 1000 and ~ 30 seconds to compute the Levenshtein distance
17
+ of strings of length 10000. Theoretically the algorithm itself is of complexity *O(mn)* where m and n are the sizes of the input strings.
18
+
19
+ ## Installation
20
+
21
+ Add this line to your application's Gemfile:
22
+
23
+ gem 'levenshtein_rb'
24
+
25
+ And then execute:
26
+
27
+ $ bundle
28
+
29
+ Or install it yourself as:
30
+
31
+ $ gem install levenshtein_rb
32
+
33
+ ## Usage
34
+
35
+ LevenshteinRb::LevenshteinDistance.new('Tor', 'Tier').to_i # => 2
36
+
37
+ # The Levenshtein algorithm
38
+
39
+ The mathematical formalism is well-explained at [wikipedia](https://en.wikipedia.org/wiki/Levenshtein_distance), so let's concentrate on the example here. Let's examine the algorithm on explaining the steps with the example word-word combination x
40
+ For a human being it is convenient to perform the
41
+ algorithm by "filling a table". The table of numbers is the
42
+ "recurrence matrix" in the Ruby code. In the mathematical literature, the recurrence matrix usually is denoted by *D*.
43
+
44
+ One starts with the upper left corner that refers to the pair of words (ε,ε). The corresponding matrix entry *D[0,0]* gets as value a zero, which means: Zero changes are
45
+ necessary to turn the word *ε* into *ε*.
46
+
47
+ Let's now have a look on the first row:
48
+
49
+ ```
50
+ | ε T o r
51
+ -----------
52
+ ε | 0 1 2 3
53
+ ```
54
+
55
+ The semantics of the entries *D[0,1]*, *D[0,2]* and *D[0,3]* are as follows:
56
+ It takes one elementary change (namely an addition) to turn the word *ε* into *εT*. And it takes 2 additions to turn *ε* into *εTo* and 3 additions to turn *ε* into *εTor*. The whole first column workds similar.
57
+
58
+ More interesting are the values (a.k.a "costs") of an entry that corresponds to an inner element of the matrix.
59
+
60
+ Take, for example the matrix entry *D[1,1]* with the assumption that the first row and
61
+ the whole first column already was calculated (which is easy because it's the same procedure for any word-word combination). So concentrate on the submatrix:
62
+
63
+ ```
64
+ | ε T
65
+ -------
66
+ ε | 0 1
67
+ T | 1 ?
68
+
69
+ This is D[1,1]
70
+ ```
71
+
72
+ *D[1,1]* corresponds to the question: How many steps does it take to turn *εT* into *εT*?
73
+ And the answer is easy: We've added the same character ("*T*") to the word *ε* on both dimensions. So the answer is: Zero.
74
+
75
+ Now we have obtained *D[1,1]*, lets continue and fill the whole second row:
76
+
77
+ ```
78
+ | ε T o r
79
+ -----------
80
+ ε | 0 1 2 3
81
+ T | 1 0 ? ?
82
+
83
+ ```
84
+
85
+ How to get *D[1,2]* ? The "same character logic" is not the case here. *D[1,2]* corresponds
86
+ to an addition of the character "o" (or a deletion if you see it vice versa). 1 o needs to be added so it costs exactly one more starting from? Right, we choose "the best path" so we choose the mininal value that is a neighbor an already has been computed. In our case it is *D[1,1]*. *D[1,2]* has exactly +1 costs more and therefore *D[1,2] = D[1,1] + 1 = 1*.
87
+
88
+ This is the central trick: One always looks for the "minimum value" that corresponds to a neighbor entry calculated before and adds a cost (in our case we define all costs by *+1*)
89
+
90
+ If one follows the strategy successively one obtains:
91
+
92
+ ```
93
+ | ε T o r
94
+ -----------
95
+ ε | 0 1 2 3
96
+ T | 1 0 1 2
97
+ i | 2 1 1 2
98
+ e | 3 2 2 2
99
+ r | 4 3 3 2
100
+
101
+ This last value is the Levenshtein distance
102
+ ```
103
+
104
+ The entry *D[m,n]* is called the Levenshtein distance of the two input words.
@@ -0,0 +1,7 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
7
+
@@ -0,0 +1,24 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'levenshtein_rb/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "levenshtein_rb"
8
+ spec.version = LevenshteinRb::VERSION
9
+ spec.authors = ["Robin Neumann\n"]
10
+ spec.email = ["robin.neumann@absolventa.de"]
11
+ spec.summary = %q{Implementation of Levenshtein algorithm to determine the similarity of two strings using only Ruby}
12
+ spec.description = %q{Implementation of Levenshtein algorithm to determine the similarity of two strings using only Ruby}
13
+ spec.homepage = ""
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0")
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_development_dependency "bundler", "~> 1.6"
22
+ spec.add_development_dependency "rake"
23
+ spec.add_development_dependency "rspec"
24
+ end
@@ -0,0 +1,2 @@
1
+ require "levenshtein_rb/version"
2
+ require "levenshtein_rb/levenshtein_distance"
@@ -0,0 +1,70 @@
1
+ module LevenshteinRb
2
+ class LevenshteinDistance
3
+
4
+ class RecurrenceMatrix
5
+
6
+ attr_reader :store
7
+
8
+ def [](index)
9
+ store[index]
10
+ end
11
+
12
+ def []=(index, value)
13
+ store[index] = value
14
+ end
15
+
16
+ def initialize(m, n)
17
+ @store = Array.new(m+1) { Array.new(n+1) }
18
+
19
+ (0..m).each { |i| store[i][0] = i }
20
+ (0..n).each { |j| store[0][j] = j }
21
+ end
22
+
23
+ end
24
+
25
+ attr_reader :s, :t, :m, :n, :d
26
+
27
+ def initialize(a_string, another_string)
28
+ @s, @t = a_string, another_string
29
+ @m, @n = s.length.to_i, t.length.to_i
30
+ @d = RecurrenceMatrix.new(m, n) # d is simply the default notation for a recurrence matrix in literature
31
+ end
32
+
33
+ def to_i
34
+ return m if n == 0
35
+ return n if m == 0
36
+
37
+ (1..n).each do |j|
38
+ (1..m).each do |i|
39
+ d[i][j] = costs_for_step(i, j)
40
+ end
41
+ end
42
+
43
+ d[m][n]
44
+ end
45
+
46
+
47
+ private
48
+
49
+ def same_character_for_both_words_is_added?(i, j)
50
+ s[i-1] == t[j-1]
51
+ end
52
+
53
+ def obtain_minimal_value_from_neighbors(i, j)
54
+ [
55
+ d[i-1][j], # Look left in the matrix. This would be a "deletion". It costs + 1 more than d[i-1][j]]
56
+ d[i][j-1], # Look directly above the current entry in matrix. This would correspond to an insertion which costs +1 additionally
57
+ d[i-1][j-1], # Completely substitute the character, this also costs +1 more than d[j-1, i-1]
58
+ ].min
59
+ end
60
+
61
+ def costs_for_step(i, j)
62
+ if same_character_for_both_words_is_added?(i, j)
63
+ d[i-1][j-1]
64
+ else
65
+ obtain_minimal_value_from_neighbors(i, j) + 1
66
+ end
67
+ end
68
+
69
+ end
70
+ end
@@ -0,0 +1,3 @@
1
+ module LevenshteinRb
2
+ VERSION = "0.0.1"
3
+ end
@@ -0,0 +1,27 @@
1
+ require 'spec_helper'
2
+
3
+ RSpec.describe LevenshteinRb::LevenshteinDistance do
4
+
5
+ describe '#to_i' do
6
+ it { expect(described_class.new('', '').to_i).to eql 0 }
7
+
8
+ context 'with insertion/deletion' do
9
+ it { expect(described_class.new('', 'a').to_i).to eql 1 }
10
+ it { expect(described_class.new('a', '').to_i).to eql 1 }
11
+
12
+ it { expect(described_class.new('bbb', 'bbba').to_i).to eql 1 }
13
+ it { expect(described_class.new('bbba', 'bbb').to_i).to eql 1 }
14
+ end
15
+
16
+ context 'with substitution' do
17
+ it { expect(described_class.new('a', 'b').to_i).to eql 1 }
18
+ it { expect(described_class.new('ba', 'bb').to_i).to eql 1 }
19
+ it { expect(described_class.new('ab', 'bb').to_i).to eql 1 }
20
+ end
21
+
22
+ context 'with case studies' do
23
+ it { expect(described_class.new('Tier', 'Tor').to_i).to eql 2 }
24
+ end
25
+
26
+ end
27
+ end
@@ -0,0 +1,7 @@
1
+ require 'spec_helper'
2
+
3
+ describe LevenshteinRb do
4
+ it 'has a version number' do
5
+ expect(LevenshteinRb::VERSION).not_to be nil
6
+ end
7
+ end
@@ -0,0 +1,2 @@
1
+ $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
2
+ require 'levenshtein_rb'
metadata ADDED
@@ -0,0 +1,106 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: levenshtein_rb
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - |
8
+ Robin Neumann
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2015-11-23 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: bundler
16
+ requirement: !ruby/object:Gem::Requirement
17
+ requirements:
18
+ - - "~>"
19
+ - !ruby/object:Gem::Version
20
+ version: '1.6'
21
+ type: :development
22
+ prerelease: false
23
+ version_requirements: !ruby/object:Gem::Requirement
24
+ requirements:
25
+ - - "~>"
26
+ - !ruby/object:Gem::Version
27
+ version: '1.6'
28
+ - !ruby/object:Gem::Dependency
29
+ name: rake
30
+ requirement: !ruby/object:Gem::Requirement
31
+ requirements:
32
+ - - ">="
33
+ - !ruby/object:Gem::Version
34
+ version: '0'
35
+ type: :development
36
+ prerelease: false
37
+ version_requirements: !ruby/object:Gem::Requirement
38
+ requirements:
39
+ - - ">="
40
+ - !ruby/object:Gem::Version
41
+ version: '0'
42
+ - !ruby/object:Gem::Dependency
43
+ name: rspec
44
+ requirement: !ruby/object:Gem::Requirement
45
+ requirements:
46
+ - - ">="
47
+ - !ruby/object:Gem::Version
48
+ version: '0'
49
+ type: :development
50
+ prerelease: false
51
+ version_requirements: !ruby/object:Gem::Requirement
52
+ requirements:
53
+ - - ">="
54
+ - !ruby/object:Gem::Version
55
+ version: '0'
56
+ description: Implementation of Levenshtein algorithm to determine the similarity of
57
+ two strings using only Ruby
58
+ email:
59
+ - robin.neumann@absolventa.de
60
+ executables: []
61
+ extensions: []
62
+ extra_rdoc_files: []
63
+ files:
64
+ - ".gitignore"
65
+ - ".rspec"
66
+ - ".travis.yml"
67
+ - Gemfile
68
+ - LICENSE.txt
69
+ - README.md
70
+ - Rakefile
71
+ - levenshtein_rb.gemspec
72
+ - lib/levenshtein_rb.rb
73
+ - lib/levenshtein_rb/levenshtein_distance.rb
74
+ - lib/levenshtein_rb/version.rb
75
+ - spec/levenshtein_distance_spec.rb
76
+ - spec/levenshtein_rb_spec.rb
77
+ - spec/spec_helper.rb
78
+ homepage: ''
79
+ licenses:
80
+ - MIT
81
+ metadata: {}
82
+ post_install_message:
83
+ rdoc_options: []
84
+ require_paths:
85
+ - lib
86
+ required_ruby_version: !ruby/object:Gem::Requirement
87
+ requirements:
88
+ - - ">="
89
+ - !ruby/object:Gem::Version
90
+ version: '0'
91
+ required_rubygems_version: !ruby/object:Gem::Requirement
92
+ requirements:
93
+ - - ">="
94
+ - !ruby/object:Gem::Version
95
+ version: '0'
96
+ requirements: []
97
+ rubyforge_project:
98
+ rubygems_version: 2.2.2
99
+ signing_key:
100
+ specification_version: 4
101
+ summary: Implementation of Levenshtein algorithm to determine the similarity of two
102
+ strings using only Ruby
103
+ test_files:
104
+ - spec/levenshtein_distance_spec.rb
105
+ - spec/levenshtein_rb_spec.rb
106
+ - spec/spec_helper.rb