levenshtein_rb 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +22 -0
- data/.rspec +2 -0
- data/.travis.yml +3 -0
- data/Gemfile +4 -0
- data/LICENSE.txt +23 -0
- data/README.md +104 -0
- data/Rakefile +7 -0
- data/levenshtein_rb.gemspec +24 -0
- data/lib/levenshtein_rb.rb +2 -0
- data/lib/levenshtein_rb/levenshtein_distance.rb +70 -0
- data/lib/levenshtein_rb/version.rb +3 -0
- data/spec/levenshtein_distance_spec.rb +27 -0
- data/spec/levenshtein_rb_spec.rb +7 -0
- data/spec/spec_helper.rb +2 -0
- metadata +106 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 61dd3ffea0bc6ffbaef4dd792cb165384101b4fd
|
4
|
+
data.tar.gz: ee9770dc4d4cc7e1a3644ee06b3c4c92be8ce58e
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 6395246d52816851679edc4ab168899641fcdc021ff7d9339f352be36fbab343c3462032ae882e645f810d2ee0d7e605461cef002f2f2241fc7647fd1e680762
|
7
|
+
data.tar.gz: 316860dd0a260afdda38c6bdab205e11f00af3c0deffe11b167e50ef8599be99159b0fdae4d2982ef5c0d6ee59725d4fa06cd3f0d647c9d022a1a4700901c5f8
|
data/.gitignore
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
*.gem
|
2
|
+
*.rbc
|
3
|
+
.bundle
|
4
|
+
.config
|
5
|
+
.yardoc
|
6
|
+
Gemfile.lock
|
7
|
+
InstalledFiles
|
8
|
+
_yardoc
|
9
|
+
coverage
|
10
|
+
doc/
|
11
|
+
lib/bundler/man
|
12
|
+
pkg
|
13
|
+
rdoc
|
14
|
+
spec/reports
|
15
|
+
test/tmp
|
16
|
+
test/version_tmp
|
17
|
+
tmp
|
18
|
+
*.bundle
|
19
|
+
*.so
|
20
|
+
*.o
|
21
|
+
*.a
|
22
|
+
mkmf.log
|
data/.rspec
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
Copyright (c) 2015 Robin Neumann
|
2
|
+
|
3
|
+
|
4
|
+
MIT License
|
5
|
+
|
6
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
7
|
+
a copy of this software and associated documentation files (the
|
8
|
+
"Software"), to deal in the Software without restriction, including
|
9
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
10
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
11
|
+
permit persons to whom the Software is furnished to do so, subject to
|
12
|
+
the following conditions:
|
13
|
+
|
14
|
+
The above copyright notice and this permission notice shall be
|
15
|
+
included in all copies or substantial portions of the Software.
|
16
|
+
|
17
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
18
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
19
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
20
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
21
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
22
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
23
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,104 @@
|
|
1
|
+
# levenshtein_rb
|
2
|
+
|
3
|
+
Plain Ruby implementation of Levenshtein distance algorithm
|
4
|
+
that measures the similarity of two strings.
|
5
|
+
|
6
|
+
## Disclaimer
|
7
|
+
|
8
|
+
This implementation is intended for educational use to study the algorithm itself by the
|
9
|
+
clean syntactic structure of Ruby. Use it in production with caution. To compare really large strings you may want to use a Ruby gem that relies on native C code extensions or similar like
|
10
|
+
|
11
|
+
* [levenshtein-ffi](https://github.com/dbalatero/levenshtein-ffi)
|
12
|
+
* [damerau-levenshtein](https://github.com/GlobalNamesArchitecture/damerau-levenshtein)
|
13
|
+
|
14
|
+
levenshtein_rb was inspired by [this example implementation](http://rosettacode.org/wiki/Levenshtein_distance#Ruby).
|
15
|
+
|
16
|
+
During my heuristically flavoured (and therefore not really meaningful) benchmark I took 1.28 seconds to compute the Levenshtein distance of strings of length 1000 and ~ 30 seconds to compute the Levenshtein distance
|
17
|
+
of strings of length 10000. Theoretically the algorithm itself is of complexity *O(mn)* where m and n are the sizes of the input strings.
|
18
|
+
|
19
|
+
## Installation
|
20
|
+
|
21
|
+
Add this line to your application's Gemfile:
|
22
|
+
|
23
|
+
gem 'levenshtein_rb'
|
24
|
+
|
25
|
+
And then execute:
|
26
|
+
|
27
|
+
$ bundle
|
28
|
+
|
29
|
+
Or install it yourself as:
|
30
|
+
|
31
|
+
$ gem install levenshtein_rb
|
32
|
+
|
33
|
+
## Usage
|
34
|
+
|
35
|
+
LevenshteinRb::LevenshteinDistance.new('Tor', 'Tier').to_i # => 2
|
36
|
+
|
37
|
+
# The Levenshtein algorithm
|
38
|
+
|
39
|
+
The mathematical formalism is well-explained at [wikipedia](https://en.wikipedia.org/wiki/Levenshtein_distance), so let's concentrate on the example here. Let's examine the algorithm on explaining the steps with the example word-word combination x
|
40
|
+
For a human being it is convenient to perform the
|
41
|
+
algorithm by "filling a table". The table of numbers is the
|
42
|
+
"recurrence matrix" in the Ruby code. In the mathematical literature, the recurrence matrix usually is denoted by *D*.
|
43
|
+
|
44
|
+
One starts with the upper left corner that refers to the pair of words (ε,ε). The corresponding matrix entry *D[0,0]* gets as value a zero, which means: Zero changes are
|
45
|
+
necessary to turn the word *ε* into *ε*.
|
46
|
+
|
47
|
+
Let's now have a look on the first row:
|
48
|
+
|
49
|
+
```
|
50
|
+
| ε T o r
|
51
|
+
-----------
|
52
|
+
ε | 0 1 2 3
|
53
|
+
```
|
54
|
+
|
55
|
+
The semantics of the entries *D[0,1]*, *D[0,2]* and *D[0,3]* are as follows:
|
56
|
+
It takes one elementary change (namely an addition) to turn the word *ε* into *εT*. And it takes 2 additions to turn *ε* into *εTo* and 3 additions to turn *ε* into *εTor*. The whole first column workds similar.
|
57
|
+
|
58
|
+
More interesting are the values (a.k.a "costs") of an entry that corresponds to an inner element of the matrix.
|
59
|
+
|
60
|
+
Take, for example the matrix entry *D[1,1]* with the assumption that the first row and
|
61
|
+
the whole first column already was calculated (which is easy because it's the same procedure for any word-word combination). So concentrate on the submatrix:
|
62
|
+
|
63
|
+
```
|
64
|
+
| ε T
|
65
|
+
-------
|
66
|
+
ε | 0 1
|
67
|
+
T | 1 ?
|
68
|
+
↑
|
69
|
+
This is D[1,1]
|
70
|
+
```
|
71
|
+
|
72
|
+
*D[1,1]* corresponds to the question: How many steps does it take to turn *εT* into *εT*?
|
73
|
+
And the answer is easy: We've added the same character ("*T*") to the word *ε* on both dimensions. So the answer is: Zero.
|
74
|
+
|
75
|
+
Now we have obtained *D[1,1]*, lets continue and fill the whole second row:
|
76
|
+
|
77
|
+
```
|
78
|
+
| ε T o r
|
79
|
+
-----------
|
80
|
+
ε | 0 1 2 3
|
81
|
+
T | 1 0 ? ?
|
82
|
+
|
83
|
+
```
|
84
|
+
|
85
|
+
How to get *D[1,2]* ? The "same character logic" is not the case here. *D[1,2]* corresponds
|
86
|
+
to an addition of the character "o" (or a deletion if you see it vice versa). 1 o needs to be added so it costs exactly one more starting from? Right, we choose "the best path" so we choose the mininal value that is a neighbor an already has been computed. In our case it is *D[1,1]*. *D[1,2]* has exactly +1 costs more and therefore *D[1,2] = D[1,1] + 1 = 1*.
|
87
|
+
|
88
|
+
This is the central trick: One always looks for the "minimum value" that corresponds to a neighbor entry calculated before and adds a cost (in our case we define all costs by *+1*)
|
89
|
+
|
90
|
+
If one follows the strategy successively one obtains:
|
91
|
+
|
92
|
+
```
|
93
|
+
| ε T o r
|
94
|
+
-----------
|
95
|
+
ε | 0 1 2 3
|
96
|
+
T | 1 0 1 2
|
97
|
+
i | 2 1 1 2
|
98
|
+
e | 3 2 2 2
|
99
|
+
r | 4 3 3 2
|
100
|
+
↑
|
101
|
+
This last value is the Levenshtein distance
|
102
|
+
```
|
103
|
+
|
104
|
+
The entry *D[m,n]* is called the Levenshtein distance of the two input words.
|
data/Rakefile
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
lib = File.expand_path('../lib', __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
|
+
require 'levenshtein_rb/version'
|
5
|
+
|
6
|
+
Gem::Specification.new do |spec|
|
7
|
+
spec.name = "levenshtein_rb"
|
8
|
+
spec.version = LevenshteinRb::VERSION
|
9
|
+
spec.authors = ["Robin Neumann\n"]
|
10
|
+
spec.email = ["robin.neumann@absolventa.de"]
|
11
|
+
spec.summary = %q{Implementation of Levenshtein algorithm to determine the similarity of two strings using only Ruby}
|
12
|
+
spec.description = %q{Implementation of Levenshtein algorithm to determine the similarity of two strings using only Ruby}
|
13
|
+
spec.homepage = ""
|
14
|
+
spec.license = "MIT"
|
15
|
+
|
16
|
+
spec.files = `git ls-files -z`.split("\x0")
|
17
|
+
spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
|
18
|
+
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
19
|
+
spec.require_paths = ["lib"]
|
20
|
+
|
21
|
+
spec.add_development_dependency "bundler", "~> 1.6"
|
22
|
+
spec.add_development_dependency "rake"
|
23
|
+
spec.add_development_dependency "rspec"
|
24
|
+
end
|
@@ -0,0 +1,70 @@
|
|
1
|
+
module LevenshteinRb
|
2
|
+
class LevenshteinDistance
|
3
|
+
|
4
|
+
class RecurrenceMatrix
|
5
|
+
|
6
|
+
attr_reader :store
|
7
|
+
|
8
|
+
def [](index)
|
9
|
+
store[index]
|
10
|
+
end
|
11
|
+
|
12
|
+
def []=(index, value)
|
13
|
+
store[index] = value
|
14
|
+
end
|
15
|
+
|
16
|
+
def initialize(m, n)
|
17
|
+
@store = Array.new(m+1) { Array.new(n+1) }
|
18
|
+
|
19
|
+
(0..m).each { |i| store[i][0] = i }
|
20
|
+
(0..n).each { |j| store[0][j] = j }
|
21
|
+
end
|
22
|
+
|
23
|
+
end
|
24
|
+
|
25
|
+
attr_reader :s, :t, :m, :n, :d
|
26
|
+
|
27
|
+
def initialize(a_string, another_string)
|
28
|
+
@s, @t = a_string, another_string
|
29
|
+
@m, @n = s.length.to_i, t.length.to_i
|
30
|
+
@d = RecurrenceMatrix.new(m, n) # d is simply the default notation for a recurrence matrix in literature
|
31
|
+
end
|
32
|
+
|
33
|
+
def to_i
|
34
|
+
return m if n == 0
|
35
|
+
return n if m == 0
|
36
|
+
|
37
|
+
(1..n).each do |j|
|
38
|
+
(1..m).each do |i|
|
39
|
+
d[i][j] = costs_for_step(i, j)
|
40
|
+
end
|
41
|
+
end
|
42
|
+
|
43
|
+
d[m][n]
|
44
|
+
end
|
45
|
+
|
46
|
+
|
47
|
+
private
|
48
|
+
|
49
|
+
def same_character_for_both_words_is_added?(i, j)
|
50
|
+
s[i-1] == t[j-1]
|
51
|
+
end
|
52
|
+
|
53
|
+
def obtain_minimal_value_from_neighbors(i, j)
|
54
|
+
[
|
55
|
+
d[i-1][j], # Look left in the matrix. This would be a "deletion". It costs + 1 more than d[i-1][j]]
|
56
|
+
d[i][j-1], # Look directly above the current entry in matrix. This would correspond to an insertion which costs +1 additionally
|
57
|
+
d[i-1][j-1], # Completely substitute the character, this also costs +1 more than d[j-1, i-1]
|
58
|
+
].min
|
59
|
+
end
|
60
|
+
|
61
|
+
def costs_for_step(i, j)
|
62
|
+
if same_character_for_both_words_is_added?(i, j)
|
63
|
+
d[i-1][j-1]
|
64
|
+
else
|
65
|
+
obtain_minimal_value_from_neighbors(i, j) + 1
|
66
|
+
end
|
67
|
+
end
|
68
|
+
|
69
|
+
end
|
70
|
+
end
|
@@ -0,0 +1,27 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
RSpec.describe LevenshteinRb::LevenshteinDistance do
|
4
|
+
|
5
|
+
describe '#to_i' do
|
6
|
+
it { expect(described_class.new('', '').to_i).to eql 0 }
|
7
|
+
|
8
|
+
context 'with insertion/deletion' do
|
9
|
+
it { expect(described_class.new('', 'a').to_i).to eql 1 }
|
10
|
+
it { expect(described_class.new('a', '').to_i).to eql 1 }
|
11
|
+
|
12
|
+
it { expect(described_class.new('bbb', 'bbba').to_i).to eql 1 }
|
13
|
+
it { expect(described_class.new('bbba', 'bbb').to_i).to eql 1 }
|
14
|
+
end
|
15
|
+
|
16
|
+
context 'with substitution' do
|
17
|
+
it { expect(described_class.new('a', 'b').to_i).to eql 1 }
|
18
|
+
it { expect(described_class.new('ba', 'bb').to_i).to eql 1 }
|
19
|
+
it { expect(described_class.new('ab', 'bb').to_i).to eql 1 }
|
20
|
+
end
|
21
|
+
|
22
|
+
context 'with case studies' do
|
23
|
+
it { expect(described_class.new('Tier', 'Tor').to_i).to eql 2 }
|
24
|
+
end
|
25
|
+
|
26
|
+
end
|
27
|
+
end
|
data/spec/spec_helper.rb
ADDED
metadata
ADDED
@@ -0,0 +1,106 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: levenshtein_rb
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- |
|
8
|
+
Robin Neumann
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2015-11-23 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: bundler
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
requirements:
|
18
|
+
- - "~>"
|
19
|
+
- !ruby/object:Gem::Version
|
20
|
+
version: '1.6'
|
21
|
+
type: :development
|
22
|
+
prerelease: false
|
23
|
+
version_requirements: !ruby/object:Gem::Requirement
|
24
|
+
requirements:
|
25
|
+
- - "~>"
|
26
|
+
- !ruby/object:Gem::Version
|
27
|
+
version: '1.6'
|
28
|
+
- !ruby/object:Gem::Dependency
|
29
|
+
name: rake
|
30
|
+
requirement: !ruby/object:Gem::Requirement
|
31
|
+
requirements:
|
32
|
+
- - ">="
|
33
|
+
- !ruby/object:Gem::Version
|
34
|
+
version: '0'
|
35
|
+
type: :development
|
36
|
+
prerelease: false
|
37
|
+
version_requirements: !ruby/object:Gem::Requirement
|
38
|
+
requirements:
|
39
|
+
- - ">="
|
40
|
+
- !ruby/object:Gem::Version
|
41
|
+
version: '0'
|
42
|
+
- !ruby/object:Gem::Dependency
|
43
|
+
name: rspec
|
44
|
+
requirement: !ruby/object:Gem::Requirement
|
45
|
+
requirements:
|
46
|
+
- - ">="
|
47
|
+
- !ruby/object:Gem::Version
|
48
|
+
version: '0'
|
49
|
+
type: :development
|
50
|
+
prerelease: false
|
51
|
+
version_requirements: !ruby/object:Gem::Requirement
|
52
|
+
requirements:
|
53
|
+
- - ">="
|
54
|
+
- !ruby/object:Gem::Version
|
55
|
+
version: '0'
|
56
|
+
description: Implementation of Levenshtein algorithm to determine the similarity of
|
57
|
+
two strings using only Ruby
|
58
|
+
email:
|
59
|
+
- robin.neumann@absolventa.de
|
60
|
+
executables: []
|
61
|
+
extensions: []
|
62
|
+
extra_rdoc_files: []
|
63
|
+
files:
|
64
|
+
- ".gitignore"
|
65
|
+
- ".rspec"
|
66
|
+
- ".travis.yml"
|
67
|
+
- Gemfile
|
68
|
+
- LICENSE.txt
|
69
|
+
- README.md
|
70
|
+
- Rakefile
|
71
|
+
- levenshtein_rb.gemspec
|
72
|
+
- lib/levenshtein_rb.rb
|
73
|
+
- lib/levenshtein_rb/levenshtein_distance.rb
|
74
|
+
- lib/levenshtein_rb/version.rb
|
75
|
+
- spec/levenshtein_distance_spec.rb
|
76
|
+
- spec/levenshtein_rb_spec.rb
|
77
|
+
- spec/spec_helper.rb
|
78
|
+
homepage: ''
|
79
|
+
licenses:
|
80
|
+
- MIT
|
81
|
+
metadata: {}
|
82
|
+
post_install_message:
|
83
|
+
rdoc_options: []
|
84
|
+
require_paths:
|
85
|
+
- lib
|
86
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
87
|
+
requirements:
|
88
|
+
- - ">="
|
89
|
+
- !ruby/object:Gem::Version
|
90
|
+
version: '0'
|
91
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
92
|
+
requirements:
|
93
|
+
- - ">="
|
94
|
+
- !ruby/object:Gem::Version
|
95
|
+
version: '0'
|
96
|
+
requirements: []
|
97
|
+
rubyforge_project:
|
98
|
+
rubygems_version: 2.2.2
|
99
|
+
signing_key:
|
100
|
+
specification_version: 4
|
101
|
+
summary: Implementation of Levenshtein algorithm to determine the similarity of two
|
102
|
+
strings using only Ruby
|
103
|
+
test_files:
|
104
|
+
- spec/levenshtein_distance_spec.rb
|
105
|
+
- spec/levenshtein_rb_spec.rb
|
106
|
+
- spec/spec_helper.rb
|