levenshtein_rb 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.gitignore +22 -0
- data/.rspec +2 -0
- data/.travis.yml +3 -0
- data/Gemfile +4 -0
- data/LICENSE.txt +23 -0
- data/README.md +104 -0
- data/Rakefile +7 -0
- data/levenshtein_rb.gemspec +24 -0
- data/lib/levenshtein_rb.rb +2 -0
- data/lib/levenshtein_rb/levenshtein_distance.rb +70 -0
- data/lib/levenshtein_rb/version.rb +3 -0
- data/spec/levenshtein_distance_spec.rb +27 -0
- data/spec/levenshtein_rb_spec.rb +7 -0
- data/spec/spec_helper.rb +2 -0
- metadata +106 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 61dd3ffea0bc6ffbaef4dd792cb165384101b4fd
|
4
|
+
data.tar.gz: ee9770dc4d4cc7e1a3644ee06b3c4c92be8ce58e
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 6395246d52816851679edc4ab168899641fcdc021ff7d9339f352be36fbab343c3462032ae882e645f810d2ee0d7e605461cef002f2f2241fc7647fd1e680762
|
7
|
+
data.tar.gz: 316860dd0a260afdda38c6bdab205e11f00af3c0deffe11b167e50ef8599be99159b0fdae4d2982ef5c0d6ee59725d4fa06cd3f0d647c9d022a1a4700901c5f8
|
data/.gitignore
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
*.gem
|
2
|
+
*.rbc
|
3
|
+
.bundle
|
4
|
+
.config
|
5
|
+
.yardoc
|
6
|
+
Gemfile.lock
|
7
|
+
InstalledFiles
|
8
|
+
_yardoc
|
9
|
+
coverage
|
10
|
+
doc/
|
11
|
+
lib/bundler/man
|
12
|
+
pkg
|
13
|
+
rdoc
|
14
|
+
spec/reports
|
15
|
+
test/tmp
|
16
|
+
test/version_tmp
|
17
|
+
tmp
|
18
|
+
*.bundle
|
19
|
+
*.so
|
20
|
+
*.o
|
21
|
+
*.a
|
22
|
+
mkmf.log
|
data/.rspec
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
Copyright (c) 2015 Robin Neumann
|
2
|
+
|
3
|
+
|
4
|
+
MIT License
|
5
|
+
|
6
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
7
|
+
a copy of this software and associated documentation files (the
|
8
|
+
"Software"), to deal in the Software without restriction, including
|
9
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
10
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
11
|
+
permit persons to whom the Software is furnished to do so, subject to
|
12
|
+
the following conditions:
|
13
|
+
|
14
|
+
The above copyright notice and this permission notice shall be
|
15
|
+
included in all copies or substantial portions of the Software.
|
16
|
+
|
17
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
18
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
19
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
20
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
21
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
22
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
23
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,104 @@
|
|
1
|
+
# levenshtein_rb
|
2
|
+
|
3
|
+
Plain Ruby implementation of Levenshtein distance algorithm
|
4
|
+
that measures the similarity of two strings.
|
5
|
+
|
6
|
+
## Disclaimer
|
7
|
+
|
8
|
+
This implementation is intended for educational use to study the algorithm itself by the
|
9
|
+
clean syntactic structure of Ruby. Use it in production with caution. To compare really large strings you may want to use a Ruby gem that relies on native C code extensions or similar like
|
10
|
+
|
11
|
+
* [levenshtein-ffi](https://github.com/dbalatero/levenshtein-ffi)
|
12
|
+
* [damerau-levenshtein](https://github.com/GlobalNamesArchitecture/damerau-levenshtein)
|
13
|
+
|
14
|
+
levenshtein_rb was inspired by [this example implementation](http://rosettacode.org/wiki/Levenshtein_distance#Ruby).
|
15
|
+
|
16
|
+
During my heuristically flavoured (and therefore not really meaningful) benchmark I took 1.28 seconds to compute the Levenshtein distance of strings of length 1000 and ~ 30 seconds to compute the Levenshtein distance
|
17
|
+
of strings of length 10000. Theoretically the algorithm itself is of complexity *O(mn)* where m and n are the sizes of the input strings.
|
18
|
+
|
19
|
+
## Installation
|
20
|
+
|
21
|
+
Add this line to your application's Gemfile:
|
22
|
+
|
23
|
+
gem 'levenshtein_rb'
|
24
|
+
|
25
|
+
And then execute:
|
26
|
+
|
27
|
+
$ bundle
|
28
|
+
|
29
|
+
Or install it yourself as:
|
30
|
+
|
31
|
+
$ gem install levenshtein_rb
|
32
|
+
|
33
|
+
## Usage
|
34
|
+
|
35
|
+
LevenshteinRb::LevenshteinDistance.new('Tor', 'Tier').to_i # => 2
|
36
|
+
|
37
|
+
# The Levenshtein algorithm
|
38
|
+
|
39
|
+
The mathematical formalism is well-explained at [wikipedia](https://en.wikipedia.org/wiki/Levenshtein_distance), so let's concentrate on the example here. Let's examine the algorithm on explaining the steps with the example word-word combination x
|
40
|
+
For a human being it is convenient to perform the
|
41
|
+
algorithm by "filling a table". The table of numbers is the
|
42
|
+
"recurrence matrix" in the Ruby code. In the mathematical literature, the recurrence matrix usually is denoted by *D*.
|
43
|
+
|
44
|
+
One starts with the upper left corner that refers to the pair of words (ε,ε). The corresponding matrix entry *D[0,0]* gets as value a zero, which means: Zero changes are
|
45
|
+
necessary to turn the word *ε* into *ε*.
|
46
|
+
|
47
|
+
Let's now have a look on the first row:
|
48
|
+
|
49
|
+
```
|
50
|
+
| ε T o r
|
51
|
+
-----------
|
52
|
+
ε | 0 1 2 3
|
53
|
+
```
|
54
|
+
|
55
|
+
The semantics of the entries *D[0,1]*, *D[0,2]* and *D[0,3]* are as follows:
|
56
|
+
It takes one elementary change (namely an addition) to turn the word *ε* into *εT*. And it takes 2 additions to turn *ε* into *εTo* and 3 additions to turn *ε* into *εTor*. The whole first column workds similar.
|
57
|
+
|
58
|
+
More interesting are the values (a.k.a "costs") of an entry that corresponds to an inner element of the matrix.
|
59
|
+
|
60
|
+
Take, for example the matrix entry *D[1,1]* with the assumption that the first row and
|
61
|
+
the whole first column already was calculated (which is easy because it's the same procedure for any word-word combination). So concentrate on the submatrix:
|
62
|
+
|
63
|
+
```
|
64
|
+
| ε T
|
65
|
+
-------
|
66
|
+
ε | 0 1
|
67
|
+
T | 1 ?
|
68
|
+
↑
|
69
|
+
This is D[1,1]
|
70
|
+
```
|
71
|
+
|
72
|
+
*D[1,1]* corresponds to the question: How many steps does it take to turn *εT* into *εT*?
|
73
|
+
And the answer is easy: We've added the same character ("*T*") to the word *ε* on both dimensions. So the answer is: Zero.
|
74
|
+
|
75
|
+
Now we have obtained *D[1,1]*, lets continue and fill the whole second row:
|
76
|
+
|
77
|
+
```
|
78
|
+
| ε T o r
|
79
|
+
-----------
|
80
|
+
ε | 0 1 2 3
|
81
|
+
T | 1 0 ? ?
|
82
|
+
|
83
|
+
```
|
84
|
+
|
85
|
+
How to get *D[1,2]* ? The "same character logic" is not the case here. *D[1,2]* corresponds
|
86
|
+
to an addition of the character "o" (or a deletion if you see it vice versa). 1 o needs to be added so it costs exactly one more starting from? Right, we choose "the best path" so we choose the mininal value that is a neighbor an already has been computed. In our case it is *D[1,1]*. *D[1,2]* has exactly +1 costs more and therefore *D[1,2] = D[1,1] + 1 = 1*.
|
87
|
+
|
88
|
+
This is the central trick: One always looks for the "minimum value" that corresponds to a neighbor entry calculated before and adds a cost (in our case we define all costs by *+1*)
|
89
|
+
|
90
|
+
If one follows the strategy successively one obtains:
|
91
|
+
|
92
|
+
```
|
93
|
+
| ε T o r
|
94
|
+
-----------
|
95
|
+
ε | 0 1 2 3
|
96
|
+
T | 1 0 1 2
|
97
|
+
i | 2 1 1 2
|
98
|
+
e | 3 2 2 2
|
99
|
+
r | 4 3 3 2
|
100
|
+
↑
|
101
|
+
This last value is the Levenshtein distance
|
102
|
+
```
|
103
|
+
|
104
|
+
The entry *D[m,n]* is called the Levenshtein distance of the two input words.
|
data/Rakefile
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
lib = File.expand_path('../lib', __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
|
+
require 'levenshtein_rb/version'
|
5
|
+
|
6
|
+
Gem::Specification.new do |spec|
|
7
|
+
spec.name = "levenshtein_rb"
|
8
|
+
spec.version = LevenshteinRb::VERSION
|
9
|
+
spec.authors = ["Robin Neumann\n"]
|
10
|
+
spec.email = ["robin.neumann@absolventa.de"]
|
11
|
+
spec.summary = %q{Implementation of Levenshtein algorithm to determine the similarity of two strings using only Ruby}
|
12
|
+
spec.description = %q{Implementation of Levenshtein algorithm to determine the similarity of two strings using only Ruby}
|
13
|
+
spec.homepage = ""
|
14
|
+
spec.license = "MIT"
|
15
|
+
|
16
|
+
spec.files = `git ls-files -z`.split("\x0")
|
17
|
+
spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
|
18
|
+
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
19
|
+
spec.require_paths = ["lib"]
|
20
|
+
|
21
|
+
spec.add_development_dependency "bundler", "~> 1.6"
|
22
|
+
spec.add_development_dependency "rake"
|
23
|
+
spec.add_development_dependency "rspec"
|
24
|
+
end
|
@@ -0,0 +1,70 @@
|
|
1
|
+
module LevenshteinRb
|
2
|
+
class LevenshteinDistance
|
3
|
+
|
4
|
+
class RecurrenceMatrix
|
5
|
+
|
6
|
+
attr_reader :store
|
7
|
+
|
8
|
+
def [](index)
|
9
|
+
store[index]
|
10
|
+
end
|
11
|
+
|
12
|
+
def []=(index, value)
|
13
|
+
store[index] = value
|
14
|
+
end
|
15
|
+
|
16
|
+
def initialize(m, n)
|
17
|
+
@store = Array.new(m+1) { Array.new(n+1) }
|
18
|
+
|
19
|
+
(0..m).each { |i| store[i][0] = i }
|
20
|
+
(0..n).each { |j| store[0][j] = j }
|
21
|
+
end
|
22
|
+
|
23
|
+
end
|
24
|
+
|
25
|
+
attr_reader :s, :t, :m, :n, :d
|
26
|
+
|
27
|
+
def initialize(a_string, another_string)
|
28
|
+
@s, @t = a_string, another_string
|
29
|
+
@m, @n = s.length.to_i, t.length.to_i
|
30
|
+
@d = RecurrenceMatrix.new(m, n) # d is simply the default notation for a recurrence matrix in literature
|
31
|
+
end
|
32
|
+
|
33
|
+
def to_i
|
34
|
+
return m if n == 0
|
35
|
+
return n if m == 0
|
36
|
+
|
37
|
+
(1..n).each do |j|
|
38
|
+
(1..m).each do |i|
|
39
|
+
d[i][j] = costs_for_step(i, j)
|
40
|
+
end
|
41
|
+
end
|
42
|
+
|
43
|
+
d[m][n]
|
44
|
+
end
|
45
|
+
|
46
|
+
|
47
|
+
private
|
48
|
+
|
49
|
+
def same_character_for_both_words_is_added?(i, j)
|
50
|
+
s[i-1] == t[j-1]
|
51
|
+
end
|
52
|
+
|
53
|
+
def obtain_minimal_value_from_neighbors(i, j)
|
54
|
+
[
|
55
|
+
d[i-1][j], # Look left in the matrix. This would be a "deletion". It costs + 1 more than d[i-1][j]]
|
56
|
+
d[i][j-1], # Look directly above the current entry in matrix. This would correspond to an insertion which costs +1 additionally
|
57
|
+
d[i-1][j-1], # Completely substitute the character, this also costs +1 more than d[j-1, i-1]
|
58
|
+
].min
|
59
|
+
end
|
60
|
+
|
61
|
+
def costs_for_step(i, j)
|
62
|
+
if same_character_for_both_words_is_added?(i, j)
|
63
|
+
d[i-1][j-1]
|
64
|
+
else
|
65
|
+
obtain_minimal_value_from_neighbors(i, j) + 1
|
66
|
+
end
|
67
|
+
end
|
68
|
+
|
69
|
+
end
|
70
|
+
end
|
@@ -0,0 +1,27 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
RSpec.describe LevenshteinRb::LevenshteinDistance do
|
4
|
+
|
5
|
+
describe '#to_i' do
|
6
|
+
it { expect(described_class.new('', '').to_i).to eql 0 }
|
7
|
+
|
8
|
+
context 'with insertion/deletion' do
|
9
|
+
it { expect(described_class.new('', 'a').to_i).to eql 1 }
|
10
|
+
it { expect(described_class.new('a', '').to_i).to eql 1 }
|
11
|
+
|
12
|
+
it { expect(described_class.new('bbb', 'bbba').to_i).to eql 1 }
|
13
|
+
it { expect(described_class.new('bbba', 'bbb').to_i).to eql 1 }
|
14
|
+
end
|
15
|
+
|
16
|
+
context 'with substitution' do
|
17
|
+
it { expect(described_class.new('a', 'b').to_i).to eql 1 }
|
18
|
+
it { expect(described_class.new('ba', 'bb').to_i).to eql 1 }
|
19
|
+
it { expect(described_class.new('ab', 'bb').to_i).to eql 1 }
|
20
|
+
end
|
21
|
+
|
22
|
+
context 'with case studies' do
|
23
|
+
it { expect(described_class.new('Tier', 'Tor').to_i).to eql 2 }
|
24
|
+
end
|
25
|
+
|
26
|
+
end
|
27
|
+
end
|
data/spec/spec_helper.rb
ADDED
metadata
ADDED
@@ -0,0 +1,106 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: levenshtein_rb
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- |
|
8
|
+
Robin Neumann
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2015-11-23 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: bundler
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
requirements:
|
18
|
+
- - "~>"
|
19
|
+
- !ruby/object:Gem::Version
|
20
|
+
version: '1.6'
|
21
|
+
type: :development
|
22
|
+
prerelease: false
|
23
|
+
version_requirements: !ruby/object:Gem::Requirement
|
24
|
+
requirements:
|
25
|
+
- - "~>"
|
26
|
+
- !ruby/object:Gem::Version
|
27
|
+
version: '1.6'
|
28
|
+
- !ruby/object:Gem::Dependency
|
29
|
+
name: rake
|
30
|
+
requirement: !ruby/object:Gem::Requirement
|
31
|
+
requirements:
|
32
|
+
- - ">="
|
33
|
+
- !ruby/object:Gem::Version
|
34
|
+
version: '0'
|
35
|
+
type: :development
|
36
|
+
prerelease: false
|
37
|
+
version_requirements: !ruby/object:Gem::Requirement
|
38
|
+
requirements:
|
39
|
+
- - ">="
|
40
|
+
- !ruby/object:Gem::Version
|
41
|
+
version: '0'
|
42
|
+
- !ruby/object:Gem::Dependency
|
43
|
+
name: rspec
|
44
|
+
requirement: !ruby/object:Gem::Requirement
|
45
|
+
requirements:
|
46
|
+
- - ">="
|
47
|
+
- !ruby/object:Gem::Version
|
48
|
+
version: '0'
|
49
|
+
type: :development
|
50
|
+
prerelease: false
|
51
|
+
version_requirements: !ruby/object:Gem::Requirement
|
52
|
+
requirements:
|
53
|
+
- - ">="
|
54
|
+
- !ruby/object:Gem::Version
|
55
|
+
version: '0'
|
56
|
+
description: Implementation of Levenshtein algorithm to determine the similarity of
|
57
|
+
two strings using only Ruby
|
58
|
+
email:
|
59
|
+
- robin.neumann@absolventa.de
|
60
|
+
executables: []
|
61
|
+
extensions: []
|
62
|
+
extra_rdoc_files: []
|
63
|
+
files:
|
64
|
+
- ".gitignore"
|
65
|
+
- ".rspec"
|
66
|
+
- ".travis.yml"
|
67
|
+
- Gemfile
|
68
|
+
- LICENSE.txt
|
69
|
+
- README.md
|
70
|
+
- Rakefile
|
71
|
+
- levenshtein_rb.gemspec
|
72
|
+
- lib/levenshtein_rb.rb
|
73
|
+
- lib/levenshtein_rb/levenshtein_distance.rb
|
74
|
+
- lib/levenshtein_rb/version.rb
|
75
|
+
- spec/levenshtein_distance_spec.rb
|
76
|
+
- spec/levenshtein_rb_spec.rb
|
77
|
+
- spec/spec_helper.rb
|
78
|
+
homepage: ''
|
79
|
+
licenses:
|
80
|
+
- MIT
|
81
|
+
metadata: {}
|
82
|
+
post_install_message:
|
83
|
+
rdoc_options: []
|
84
|
+
require_paths:
|
85
|
+
- lib
|
86
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
87
|
+
requirements:
|
88
|
+
- - ">="
|
89
|
+
- !ruby/object:Gem::Version
|
90
|
+
version: '0'
|
91
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
92
|
+
requirements:
|
93
|
+
- - ">="
|
94
|
+
- !ruby/object:Gem::Version
|
95
|
+
version: '0'
|
96
|
+
requirements: []
|
97
|
+
rubyforge_project:
|
98
|
+
rubygems_version: 2.2.2
|
99
|
+
signing_key:
|
100
|
+
specification_version: 4
|
101
|
+
summary: Implementation of Levenshtein algorithm to determine the similarity of two
|
102
|
+
strings using only Ruby
|
103
|
+
test_files:
|
104
|
+
- spec/levenshtein_distance_spec.rb
|
105
|
+
- spec/levenshtein_rb_spec.rb
|
106
|
+
- spec/spec_helper.rb
|