levenshtein_rb 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 61dd3ffea0bc6ffbaef4dd792cb165384101b4fd
4
+ data.tar.gz: ee9770dc4d4cc7e1a3644ee06b3c4c92be8ce58e
5
+ SHA512:
6
+ metadata.gz: 6395246d52816851679edc4ab168899641fcdc021ff7d9339f352be36fbab343c3462032ae882e645f810d2ee0d7e605461cef002f2f2241fc7647fd1e680762
7
+ data.tar.gz: 316860dd0a260afdda38c6bdab205e11f00af3c0deffe11b167e50ef8599be99159b0fdae4d2982ef5c0d6ee59725d4fa06cd3f0d647c9d022a1a4700901c5f8
@@ -0,0 +1,22 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
18
+ *.bundle
19
+ *.so
20
+ *.o
21
+ *.a
22
+ mkmf.log
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format documentation
2
+ --color
@@ -0,0 +1,3 @@
1
+ language: ruby
2
+ rvm:
3
+ - 2.1.2
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in levenshtein_rb.gemspec
4
+ gemspec
@@ -0,0 +1,23 @@
1
+ Copyright (c) 2015 Robin Neumann
2
+
3
+
4
+ MIT License
5
+
6
+ Permission is hereby granted, free of charge, to any person obtaining
7
+ a copy of this software and associated documentation files (the
8
+ "Software"), to deal in the Software without restriction, including
9
+ without limitation the rights to use, copy, modify, merge, publish,
10
+ distribute, sublicense, and/or sell copies of the Software, and to
11
+ permit persons to whom the Software is furnished to do so, subject to
12
+ the following conditions:
13
+
14
+ The above copyright notice and this permission notice shall be
15
+ included in all copies or substantial portions of the Software.
16
+
17
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
18
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
19
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
20
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
21
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
22
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
23
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,104 @@
1
+ # levenshtein_rb
2
+
3
+ Plain Ruby implementation of Levenshtein distance algorithm
4
+ that measures the similarity of two strings.
5
+
6
+ ## Disclaimer
7
+
8
+ This implementation is intended for educational use to study the algorithm itself by the
9
+ clean syntactic structure of Ruby. Use it in production with caution. To compare really large strings you may want to use a Ruby gem that relies on native C code extensions or similar like
10
+
11
+ * [levenshtein-ffi](https://github.com/dbalatero/levenshtein-ffi)
12
+ * [damerau-levenshtein](https://github.com/GlobalNamesArchitecture/damerau-levenshtein)
13
+
14
+ levenshtein_rb was inspired by [this example implementation](http://rosettacode.org/wiki/Levenshtein_distance#Ruby).
15
+
16
+ During my heuristically flavoured (and therefore not really meaningful) benchmark I took 1.28 seconds to compute the Levenshtein distance of strings of length 1000 and ~ 30 seconds to compute the Levenshtein distance
17
+ of strings of length 10000. Theoretically the algorithm itself is of complexity *O(mn)* where m and n are the sizes of the input strings.
18
+
19
+ ## Installation
20
+
21
+ Add this line to your application's Gemfile:
22
+
23
+ gem 'levenshtein_rb'
24
+
25
+ And then execute:
26
+
27
+ $ bundle
28
+
29
+ Or install it yourself as:
30
+
31
+ $ gem install levenshtein_rb
32
+
33
+ ## Usage
34
+
35
+ LevenshteinRb::LevenshteinDistance.new('Tor', 'Tier').to_i # => 2
36
+
37
+ # The Levenshtein algorithm
38
+
39
+ The mathematical formalism is well-explained at [wikipedia](https://en.wikipedia.org/wiki/Levenshtein_distance), so let's concentrate on the example here. Let's examine the algorithm on explaining the steps with the example word-word combination x
40
+ For a human being it is convenient to perform the
41
+ algorithm by "filling a table". The table of numbers is the
42
+ "recurrence matrix" in the Ruby code. In the mathematical literature, the recurrence matrix usually is denoted by *D*.
43
+
44
+ One starts with the upper left corner that refers to the pair of words (ε,ε). The corresponding matrix entry *D[0,0]* gets as value a zero, which means: Zero changes are
45
+ necessary to turn the word *ε* into *ε*.
46
+
47
+ Let's now have a look on the first row:
48
+
49
+ ```
50
+ | ε T o r
51
+ -----------
52
+ ε | 0 1 2 3
53
+ ```
54
+
55
+ The semantics of the entries *D[0,1]*, *D[0,2]* and *D[0,3]* are as follows:
56
+ It takes one elementary change (namely an addition) to turn the word *ε* into *εT*. And it takes 2 additions to turn *ε* into *εTo* and 3 additions to turn *ε* into *εTor*. The whole first column workds similar.
57
+
58
+ More interesting are the values (a.k.a "costs") of an entry that corresponds to an inner element of the matrix.
59
+
60
+ Take, for example the matrix entry *D[1,1]* with the assumption that the first row and
61
+ the whole first column already was calculated (which is easy because it's the same procedure for any word-word combination). So concentrate on the submatrix:
62
+
63
+ ```
64
+ | ε T
65
+ -------
66
+ ε | 0 1
67
+ T | 1 ?
68
+
69
+ This is D[1,1]
70
+ ```
71
+
72
+ *D[1,1]* corresponds to the question: How many steps does it take to turn *εT* into *εT*?
73
+ And the answer is easy: We've added the same character ("*T*") to the word *ε* on both dimensions. So the answer is: Zero.
74
+
75
+ Now we have obtained *D[1,1]*, lets continue and fill the whole second row:
76
+
77
+ ```
78
+ | ε T o r
79
+ -----------
80
+ ε | 0 1 2 3
81
+ T | 1 0 ? ?
82
+
83
+ ```
84
+
85
+ How to get *D[1,2]* ? The "same character logic" is not the case here. *D[1,2]* corresponds
86
+ to an addition of the character "o" (or a deletion if you see it vice versa). 1 o needs to be added so it costs exactly one more starting from? Right, we choose "the best path" so we choose the mininal value that is a neighbor an already has been computed. In our case it is *D[1,1]*. *D[1,2]* has exactly +1 costs more and therefore *D[1,2] = D[1,1] + 1 = 1*.
87
+
88
+ This is the central trick: One always looks for the "minimum value" that corresponds to a neighbor entry calculated before and adds a cost (in our case we define all costs by *+1*)
89
+
90
+ If one follows the strategy successively one obtains:
91
+
92
+ ```
93
+ | ε T o r
94
+ -----------
95
+ ε | 0 1 2 3
96
+ T | 1 0 1 2
97
+ i | 2 1 1 2
98
+ e | 3 2 2 2
99
+ r | 4 3 3 2
100
+
101
+ This last value is the Levenshtein distance
102
+ ```
103
+
104
+ The entry *D[m,n]* is called the Levenshtein distance of the two input words.
@@ -0,0 +1,7 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
7
+
@@ -0,0 +1,24 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'levenshtein_rb/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "levenshtein_rb"
8
+ spec.version = LevenshteinRb::VERSION
9
+ spec.authors = ["Robin Neumann\n"]
10
+ spec.email = ["robin.neumann@absolventa.de"]
11
+ spec.summary = %q{Implementation of Levenshtein algorithm to determine the similarity of two strings using only Ruby}
12
+ spec.description = %q{Implementation of Levenshtein algorithm to determine the similarity of two strings using only Ruby}
13
+ spec.homepage = ""
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0")
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_development_dependency "bundler", "~> 1.6"
22
+ spec.add_development_dependency "rake"
23
+ spec.add_development_dependency "rspec"
24
+ end
@@ -0,0 +1,2 @@
1
+ require "levenshtein_rb/version"
2
+ require "levenshtein_rb/levenshtein_distance"
@@ -0,0 +1,70 @@
1
+ module LevenshteinRb
2
+ class LevenshteinDistance
3
+
4
+ class RecurrenceMatrix
5
+
6
+ attr_reader :store
7
+
8
+ def [](index)
9
+ store[index]
10
+ end
11
+
12
+ def []=(index, value)
13
+ store[index] = value
14
+ end
15
+
16
+ def initialize(m, n)
17
+ @store = Array.new(m+1) { Array.new(n+1) }
18
+
19
+ (0..m).each { |i| store[i][0] = i }
20
+ (0..n).each { |j| store[0][j] = j }
21
+ end
22
+
23
+ end
24
+
25
+ attr_reader :s, :t, :m, :n, :d
26
+
27
+ def initialize(a_string, another_string)
28
+ @s, @t = a_string, another_string
29
+ @m, @n = s.length.to_i, t.length.to_i
30
+ @d = RecurrenceMatrix.new(m, n) # d is simply the default notation for a recurrence matrix in literature
31
+ end
32
+
33
+ def to_i
34
+ return m if n == 0
35
+ return n if m == 0
36
+
37
+ (1..n).each do |j|
38
+ (1..m).each do |i|
39
+ d[i][j] = costs_for_step(i, j)
40
+ end
41
+ end
42
+
43
+ d[m][n]
44
+ end
45
+
46
+
47
+ private
48
+
49
+ def same_character_for_both_words_is_added?(i, j)
50
+ s[i-1] == t[j-1]
51
+ end
52
+
53
+ def obtain_minimal_value_from_neighbors(i, j)
54
+ [
55
+ d[i-1][j], # Look left in the matrix. This would be a "deletion". It costs + 1 more than d[i-1][j]]
56
+ d[i][j-1], # Look directly above the current entry in matrix. This would correspond to an insertion which costs +1 additionally
57
+ d[i-1][j-1], # Completely substitute the character, this also costs +1 more than d[j-1, i-1]
58
+ ].min
59
+ end
60
+
61
+ def costs_for_step(i, j)
62
+ if same_character_for_both_words_is_added?(i, j)
63
+ d[i-1][j-1]
64
+ else
65
+ obtain_minimal_value_from_neighbors(i, j) + 1
66
+ end
67
+ end
68
+
69
+ end
70
+ end
@@ -0,0 +1,3 @@
1
+ module LevenshteinRb
2
+ VERSION = "0.0.1"
3
+ end
@@ -0,0 +1,27 @@
1
+ require 'spec_helper'
2
+
3
+ RSpec.describe LevenshteinRb::LevenshteinDistance do
4
+
5
+ describe '#to_i' do
6
+ it { expect(described_class.new('', '').to_i).to eql 0 }
7
+
8
+ context 'with insertion/deletion' do
9
+ it { expect(described_class.new('', 'a').to_i).to eql 1 }
10
+ it { expect(described_class.new('a', '').to_i).to eql 1 }
11
+
12
+ it { expect(described_class.new('bbb', 'bbba').to_i).to eql 1 }
13
+ it { expect(described_class.new('bbba', 'bbb').to_i).to eql 1 }
14
+ end
15
+
16
+ context 'with substitution' do
17
+ it { expect(described_class.new('a', 'b').to_i).to eql 1 }
18
+ it { expect(described_class.new('ba', 'bb').to_i).to eql 1 }
19
+ it { expect(described_class.new('ab', 'bb').to_i).to eql 1 }
20
+ end
21
+
22
+ context 'with case studies' do
23
+ it { expect(described_class.new('Tier', 'Tor').to_i).to eql 2 }
24
+ end
25
+
26
+ end
27
+ end
@@ -0,0 +1,7 @@
1
+ require 'spec_helper'
2
+
3
+ describe LevenshteinRb do
4
+ it 'has a version number' do
5
+ expect(LevenshteinRb::VERSION).not_to be nil
6
+ end
7
+ end
@@ -0,0 +1,2 @@
1
+ $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
2
+ require 'levenshtein_rb'
metadata ADDED
@@ -0,0 +1,106 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: levenshtein_rb
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - |
8
+ Robin Neumann
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2015-11-23 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: bundler
16
+ requirement: !ruby/object:Gem::Requirement
17
+ requirements:
18
+ - - "~>"
19
+ - !ruby/object:Gem::Version
20
+ version: '1.6'
21
+ type: :development
22
+ prerelease: false
23
+ version_requirements: !ruby/object:Gem::Requirement
24
+ requirements:
25
+ - - "~>"
26
+ - !ruby/object:Gem::Version
27
+ version: '1.6'
28
+ - !ruby/object:Gem::Dependency
29
+ name: rake
30
+ requirement: !ruby/object:Gem::Requirement
31
+ requirements:
32
+ - - ">="
33
+ - !ruby/object:Gem::Version
34
+ version: '0'
35
+ type: :development
36
+ prerelease: false
37
+ version_requirements: !ruby/object:Gem::Requirement
38
+ requirements:
39
+ - - ">="
40
+ - !ruby/object:Gem::Version
41
+ version: '0'
42
+ - !ruby/object:Gem::Dependency
43
+ name: rspec
44
+ requirement: !ruby/object:Gem::Requirement
45
+ requirements:
46
+ - - ">="
47
+ - !ruby/object:Gem::Version
48
+ version: '0'
49
+ type: :development
50
+ prerelease: false
51
+ version_requirements: !ruby/object:Gem::Requirement
52
+ requirements:
53
+ - - ">="
54
+ - !ruby/object:Gem::Version
55
+ version: '0'
56
+ description: Implementation of Levenshtein algorithm to determine the similarity of
57
+ two strings using only Ruby
58
+ email:
59
+ - robin.neumann@absolventa.de
60
+ executables: []
61
+ extensions: []
62
+ extra_rdoc_files: []
63
+ files:
64
+ - ".gitignore"
65
+ - ".rspec"
66
+ - ".travis.yml"
67
+ - Gemfile
68
+ - LICENSE.txt
69
+ - README.md
70
+ - Rakefile
71
+ - levenshtein_rb.gemspec
72
+ - lib/levenshtein_rb.rb
73
+ - lib/levenshtein_rb/levenshtein_distance.rb
74
+ - lib/levenshtein_rb/version.rb
75
+ - spec/levenshtein_distance_spec.rb
76
+ - spec/levenshtein_rb_spec.rb
77
+ - spec/spec_helper.rb
78
+ homepage: ''
79
+ licenses:
80
+ - MIT
81
+ metadata: {}
82
+ post_install_message:
83
+ rdoc_options: []
84
+ require_paths:
85
+ - lib
86
+ required_ruby_version: !ruby/object:Gem::Requirement
87
+ requirements:
88
+ - - ">="
89
+ - !ruby/object:Gem::Version
90
+ version: '0'
91
+ required_rubygems_version: !ruby/object:Gem::Requirement
92
+ requirements:
93
+ - - ">="
94
+ - !ruby/object:Gem::Version
95
+ version: '0'
96
+ requirements: []
97
+ rubyforge_project:
98
+ rubygems_version: 2.2.2
99
+ signing_key:
100
+ specification_version: 4
101
+ summary: Implementation of Levenshtein algorithm to determine the similarity of two
102
+ strings using only Ruby
103
+ test_files:
104
+ - spec/levenshtein_distance_spec.rb
105
+ - spec/levenshtein_rb_spec.rb
106
+ - spec/spec_helper.rb