bentley_mcilroy 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/LICENSE +21 -0
- data/README.md +133 -0
- data/bentley_mcilroy.gemspec +14 -0
- data/lib/bentley_mcilroy.rb +236 -0
- data/lib/rolling_hash.rb +101 -0
- data/rakefile +11 -0
- data/test/bentley_mcilroy_test.rb +99 -0
- data/test/rolling_hash_test.rb +20 -0
- data/test/test_helper.rb +1 -0
- metadata +90 -0
data/LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
(MIT License)
|
2
|
+
|
3
|
+
Copyright (c) 2013 Adam Prescott
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
13
|
+
all copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
21
|
+
THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,133 @@
|
|
1
|
+
A Ruby implementation of Bentley-McIlroy's data compression scheme to encode
|
2
|
+
compressed versions of strings, and compute deltas between source and target.
|
3
|
+
|
4
|
+
Note the compression and delta encodings are simply represented with Ruby
|
5
|
+
objects, and is independent of any particular binary format.
|
6
|
+
|
7
|
+
The fingerprinting algorithm is the rolling hash frequently used for Rabin-Karp
|
8
|
+
string matching.
|
9
|
+
|
10
|
+
# Usage
|
11
|
+
|
12
|
+
To compress a string, pass the input and block size.
|
13
|
+
|
14
|
+
codec = BentleyMcIlroy::Codec
|
15
|
+
codec.compress("aaaaaa", 3) #=> ["a", [0, 5]]
|
16
|
+
codec.compress("abcabcabc", 3) #=> ["abc", [0, 6]]
|
17
|
+
codec.compress("xabcdabcdy", 2) #=> ["xabcda", [2, 3], "y"]
|
18
|
+
codec.compress("xabcdabcdy", 1) #=> ["xabcd", [1, 4], "y"]
|
19
|
+
|
20
|
+
# Modes of operation
|
21
|
+
|
22
|
+
This library supports two modes of operation: compression and delta encoding.
|
23
|
+
With compression, a single input is compressed. With delta encoding, there is a
|
24
|
+
(non-empty) source and a target, and the result is a delta which can be
|
25
|
+
used to reconstruct the target, given the source. Compression is a special
|
26
|
+
case of delta encoding where there is no source.
|
27
|
+
|
28
|
+
With compression, the source data is everything to the left of the position we've
|
29
|
+
reached along the string. With delta encoding, the source data is fixed for the
|
30
|
+
entire time we move left-to-right through the target string.
|
31
|
+
|
32
|
+
Compression:
|
33
|
+
|
34
|
+
codec.compress("aaaaaa", 3) #=> ["a", [0, 5]]
|
35
|
+
codec.compress("abcabcabc", 3) #=> ["abc", [0, 6]]
|
36
|
+
codec.compress("xabcdabcdy", 2) #=> ["xabcda", [2, 3], "y"]
|
37
|
+
codec.compress("xabcdabcdy", 1) #=> ["xabcd", [1, 4], "y"]
|
38
|
+
|
39
|
+
Delta encoding is similar:
|
40
|
+
|
41
|
+
codec.encode("abcd", "xabcdyabcdz", 1) #=> ["x", [0, 4], "y", [0, 4], "z"]
|
42
|
+
codec.encode("xyz", "xyz", 3) #=> []
|
43
|
+
|
44
|
+
To decompress:
|
45
|
+
|
46
|
+
codec.decompress(["xabcd", [1, 4], "y"]) #=> "xabcdabcdy"
|
47
|
+
|
48
|
+
To decode a delta against a source:
|
49
|
+
|
50
|
+
codec.decode("abcd", ["x", [0, 4], "y", [0, 4], "z"]) #=> "xabcdyabcdz"
|
51
|
+
|
52
|
+
# About Bentley-McIlroy
|
53
|
+
|
54
|
+
The Bentley-McIlroy compression scheme is an algorithm for compressing a
|
55
|
+
string by finding long common substrings. The algorithm and its properties
|
56
|
+
are described in greater detail in their [1999 paper][bentley-mcilroy paper]. The technique, with a
|
57
|
+
source dictionary and a target string, is used in Google's implementation of
|
58
|
+
a VCDIFF encoder, [open-vcdiff][open-vcdiff project], as part of encoding deltas.
|
59
|
+
|
60
|
+
[bentley-mcilroy paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=rep1&type=pdf
|
61
|
+
[open-vcdiff project]: http://code.google.com/p/open-vcdiff/
|
62
|
+
|
63
|
+
To give a brief summary, the algorithm works by fixing a window of block size
|
64
|
+
b and then sliding over the string, storing the fingerprint of every b-th
|
65
|
+
window. These stored fingerprints are then used to detect repetitions later
|
66
|
+
on in the string.
|
67
|
+
|
68
|
+
The algorithm in pseudocode, as given in the paper is:
|
69
|
+
|
70
|
+
initialize fp
|
71
|
+
for (i = b; i < n; i++)
|
72
|
+
if (i % b == 0)
|
73
|
+
store(fp, i)
|
74
|
+
update fp to include a[i] and exclude a[i-b]
|
75
|
+
checkformatch(fp, i)
|
76
|
+
|
77
|
+
In the algorithm above, `checkformatch(fp, i)` looks up the fingerprint `fp` in a
|
78
|
+
hash table and then encodes a match if one is found.
|
79
|
+
|
80
|
+
`checkformatch(fp, i)` is the core piece of this algorithm, and "encodes a
|
81
|
+
match" is not fully described in the paper. The rest of the algorithm simply
|
82
|
+
describes moving through the string with a sliding window, looking at
|
83
|
+
substrings and storing fingerprints whenever we cross a block boundary.
|
84
|
+
|
85
|
+
As described in the paper, suppose b = 100 and that the current block matches
|
86
|
+
block 56 (i.e., bytes 5600 through to 5699). This current block could then be
|
87
|
+
encoded as <5600,100>.
|
88
|
+
|
89
|
+
There are two similar improvements which can be made, so as to prevent
|
90
|
+
`"ababab"` from compressing into `"ab<0,2><0,2>"`, both of which are also in the
|
91
|
+
paper. When we know that the current block matches block 56, we can extend
|
92
|
+
the match as far back as possible, not exceeding b - 1 bytes. Similarly, we
|
93
|
+
can move the match far forward as possible without limitation.
|
94
|
+
|
95
|
+
The reason there is a limit of b-1 bytes when moving backwards is that if
|
96
|
+
there were more to match beyond b-1 bytes, it would've been found in a
|
97
|
+
previous iteration of the loop.
|
98
|
+
|
99
|
+
This library implementation moves matches forward, but does not move matches
|
100
|
+
backwards.
|
101
|
+
|
102
|
+
To be more explicit about what extending the match means, consider
|
103
|
+
|
104
|
+
xabcdabcdy (the string)
|
105
|
+
0123456789 (indices)
|
106
|
+
|
107
|
+
with a block size of b = 2. Moving left to right, the fingerprints of `"xa"`,
|
108
|
+
`"ab"`, `"bc"`, ..., are computed, but only `"xa"`, `"bc"`, `"da"`, ... are stored. When
|
109
|
+
`"ab"` is seen at `5..6`, there is no corresponding entry in the hash table, so
|
110
|
+
nothing is done, yet. On the next substring of length 2, `"bc"`, at positions
|
111
|
+
`6..7`, there _is_ a corresponding entry in the hash table, so there's a match,
|
112
|
+
which we could encode as `<2, 2>`, say. However, we'd like to _actually_ produce
|
113
|
+
`<1, 4>`, which is more efficient. So starting with `<2, 2>`, we move the match
|
114
|
+
back 1 character for both the `"bc"` at `6..7` and the `"bc"` at `2..3`, then check
|
115
|
+
if `1..3` matches `5..7`, which it does. This is moving the match backwards.
|
116
|
+
|
117
|
+
For moving the match forwards, simply do the same thing. Check if `1..4` matches
|
118
|
+
`6..8`, which it does. `1..5` does not match `6..9`, so we use `<1, 4>` and we're done.
|
119
|
+
|
120
|
+
The resulting string, with backward- and forward-extension is `xabcd<1, 4>y`. In
|
121
|
+
the case of no backward extensions, it is `xabcda<2, 3>y`.
|
122
|
+
|
123
|
+
# License
|
124
|
+
|
125
|
+
Copyright (c) Adam Prescott, released under the MIT license. See the license file.
|
126
|
+
|
127
|
+
# TODO
|
128
|
+
|
129
|
+
compress("abcaaaaaa", 1) -> ["abc", [0, 1], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1]]
|
130
|
+
|
131
|
+
Can this be fixed to be: ["abc", [0, 1], [3, 5]] ? Essentially following the paper
|
132
|
+
and picking the longest match on a clash (here, index 0 and index 3 are hit for
|
133
|
+
index 4, but index 3 leads to a better result when the match is extended forward)
|
@@ -0,0 +1,14 @@
|
|
1
|
+
Gem::Specification.new do |s|
|
2
|
+
s.name = "bentley_mcilroy"
|
3
|
+
s.version = "0.0.1"
|
4
|
+
s.authors = ["Adam Prescott"]
|
5
|
+
s.email = ["adam@aprescott.com"]
|
6
|
+
s.homepage = "https://github.com/aprescott/bentley_mcilroy"
|
7
|
+
s.summary = "Bentley-McIlroy compression scheme implementation in Ruby."
|
8
|
+
s.description = "A compression scheme using the Bentley-McIlroy data compression technique of finding long common substrings."
|
9
|
+
s.files = Dir["{lib/**/*,test/**/*}"] + %w[LICENSE README.md bentley_mcilroy.gemspec rakefile]
|
10
|
+
s.test_files = Dir["test/*"]
|
11
|
+
s.require_path = "lib"
|
12
|
+
s.add_development_dependency "rake"
|
13
|
+
s.add_development_dependency "rspec"
|
14
|
+
end
|
@@ -0,0 +1,236 @@
|
|
1
|
+
require "rolling_hash"
|
2
|
+
|
3
|
+
module BentleyMcIlroy
|
4
|
+
# A fixed block of text, appearing in the original text at one of
|
5
|
+
# 0..b-1, b..2b-1, 2b..3b-1, ...
|
6
|
+
class Block
|
7
|
+
attr_reader :text, :position
|
8
|
+
|
9
|
+
def initialize(text, position)
|
10
|
+
@text = text
|
11
|
+
@position = position
|
12
|
+
end
|
13
|
+
|
14
|
+
def hash
|
15
|
+
RollingHash.new.hash(text)
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
# A container for the original text we're processing. Divides the text into
|
20
|
+
# Block objects.
|
21
|
+
class BlockSequencedText
|
22
|
+
attr_reader :blocks, :text
|
23
|
+
|
24
|
+
def initialize(text, block_size)
|
25
|
+
@text = text
|
26
|
+
@block_size = block_size
|
27
|
+
@blocks = []
|
28
|
+
|
29
|
+
# "onetwothree" -> ["one", "two", "thr", "ee"]
|
30
|
+
@text.scan(/.(?:.?){#{@block_size-1}}/).each.with_index do |text_block, index|
|
31
|
+
@blocks << Block.new(text_block, index * @block_size)
|
32
|
+
end
|
33
|
+
end
|
34
|
+
end
|
35
|
+
|
36
|
+
# Look-up table with a #find method which finds an appropriate block and then
|
37
|
+
# modifies the match to extend it to more characters.
|
38
|
+
class BlockFingerprintTable
|
39
|
+
def initialize(block_sequenced_text)
|
40
|
+
@blocked_text = block_sequenced_text
|
41
|
+
@hash = {}
|
42
|
+
|
43
|
+
@blocked_text.blocks.each do |block|
|
44
|
+
(@hash[block.hash] ||= []) << block
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
def find_for_compress(fingerprint, block_size, target, position)
|
49
|
+
source = @blocked_text.text
|
50
|
+
find(fingerprint, block_size, source, target, position)
|
51
|
+
end
|
52
|
+
|
53
|
+
def find_for_diff(fingerprint, block_size, target)
|
54
|
+
source = @blocked_text.text
|
55
|
+
find(fingerprint, block_size, source, target)
|
56
|
+
end
|
57
|
+
|
58
|
+
private
|
59
|
+
|
60
|
+
def find(fingerprint, block_size, source, target, position = nil)
|
61
|
+
blocks = @hash[fingerprint]
|
62
|
+
return nil unless blocks
|
63
|
+
|
64
|
+
blocks.each do |block|
|
65
|
+
next unless block.text == target[0, block_size]
|
66
|
+
|
67
|
+
# in compression, since we don't have true source and target strings as
|
68
|
+
# separate things, we have to ensure that we don't use a fingerprinted
|
69
|
+
# block which appears _after_ the current position, otherwise
|
70
|
+
#
|
71
|
+
# a<x, 0> with x > 0
|
72
|
+
#
|
73
|
+
# might happen, or similar. since blocks are ordered left to right in the
|
74
|
+
# string, we can just return nil, because we know there's not going to be
|
75
|
+
# a valid block for compression.
|
76
|
+
if position && block.position >= position
|
77
|
+
return nil
|
78
|
+
end
|
79
|
+
|
80
|
+
# we know that block matches, so cut it from the beginning,
|
81
|
+
# so we can then see how much of the rest also matches
|
82
|
+
source_match = source[block.position + block_size..-1]
|
83
|
+
target_match = target[block_size..-1]
|
84
|
+
|
85
|
+
# in a backwards extension, we can see how many of the characters before
|
86
|
+
# +position+ (up the previous block we covered, which is +limit+) match
|
87
|
+
# characters block.position (up to b-1) characters. In other words, we can
|
88
|
+
# find the maximum i such that
|
89
|
+
#
|
90
|
+
# original_text[position-k, 1] == original_text[block.position-k, 1]
|
91
|
+
#
|
92
|
+
# for all k in {1, 2, ..., i}, where i <= b-1
|
93
|
+
|
94
|
+
# it may be that the block we've matched on reaches to the end of the
|
95
|
+
# string, in which case, bail
|
96
|
+
if source_match.empty? || target_match.empty?
|
97
|
+
return block
|
98
|
+
end
|
99
|
+
|
100
|
+
end_index = find_end_index(source_match, target_match)
|
101
|
+
match = produce_match(end_index, block, source)
|
102
|
+
return match
|
103
|
+
end
|
104
|
+
|
105
|
+
nil
|
106
|
+
end
|
107
|
+
|
108
|
+
def find_end_index(source, target)
|
109
|
+
end_index = 0
|
110
|
+
any_match = false
|
111
|
+
while end_index < source.length && end_index < target.length && source[end_index, 1] == target[end_index, 1]
|
112
|
+
any_match = true
|
113
|
+
end_index += 1
|
114
|
+
end
|
115
|
+
# undo the final increment, since that's where it failed the equality check
|
116
|
+
end_index -= 1
|
117
|
+
|
118
|
+
any_match ? end_index : nil
|
119
|
+
end
|
120
|
+
|
121
|
+
def produce_match(end_index, block, source)
|
122
|
+
text = block.text
|
123
|
+
if end_index # we have more to grab in the string
|
124
|
+
text += source[0..end_index]
|
125
|
+
end
|
126
|
+
Block.new(text, block.position)
|
127
|
+
end
|
128
|
+
end
|
129
|
+
|
130
|
+
class Codec
|
131
|
+
def self.decompress(sequence)
|
132
|
+
sequence.inject("") do |result, i|
|
133
|
+
if i.is_a?(Array)
|
134
|
+
index, length = i
|
135
|
+
length.times do |k|
|
136
|
+
result << result[index+k, 1]
|
137
|
+
end
|
138
|
+
result
|
139
|
+
else
|
140
|
+
result << i
|
141
|
+
end
|
142
|
+
end
|
143
|
+
end
|
144
|
+
|
145
|
+
def self.decode(source, delta)
|
146
|
+
delta.inject("") do |result, i|
|
147
|
+
if i.is_a?(Array)
|
148
|
+
index, length = i
|
149
|
+
result << source[index, length]
|
150
|
+
else
|
151
|
+
result << i
|
152
|
+
end
|
153
|
+
end
|
154
|
+
end
|
155
|
+
|
156
|
+
def self.compress(text, block_size)
|
157
|
+
__compress_encode__(text, nil, block_size)
|
158
|
+
end
|
159
|
+
|
160
|
+
def self.encode(source, target, block_size)
|
161
|
+
__compress_encode__(source, target, block_size)
|
162
|
+
end
|
163
|
+
|
164
|
+
private
|
165
|
+
|
166
|
+
def self.__compress_encode__(source, target, block_size)
|
167
|
+
return [] if source == target
|
168
|
+
|
169
|
+
block_sequenced_text = BlockSequencedText.new(source, block_size)
|
170
|
+
table = BlockFingerprintTable.new(block_sequenced_text)
|
171
|
+
output = []
|
172
|
+
buffer = ""
|
173
|
+
current_hash = nil
|
174
|
+
hasher = RollingHash.new
|
175
|
+
|
176
|
+
mode = (target ? :diff : :compress)
|
177
|
+
|
178
|
+
if mode == :compress
|
179
|
+
# it's the source we're compressing, there is no target
|
180
|
+
text = source
|
181
|
+
else
|
182
|
+
# it's the target we're compressing against the source
|
183
|
+
text = target
|
184
|
+
end
|
185
|
+
|
186
|
+
position = 0
|
187
|
+
while position < text.length
|
188
|
+
|
189
|
+
if text.length - position < block_size
|
190
|
+
# if there isn't a block-sized substring in the remaining text, stop.
|
191
|
+
# note that we could add the buffer to the output here, but if block_size
|
192
|
+
# is 1, text.length - position < 1 can't be true, so the final character
|
193
|
+
# would go missing. so appending to the buffer goes below, outside the
|
194
|
+
# while loop.
|
195
|
+
break
|
196
|
+
end
|
197
|
+
|
198
|
+
# if we've recently found a block of text which matches and added that to
|
199
|
+
# the output, current_hash will be reset to nil, so get the new hash. note
|
200
|
+
# that we can't just use next_hash, because we might have skipped several
|
201
|
+
# characters in one go, which breaks the rolling aspect of the hash
|
202
|
+
if !current_hash
|
203
|
+
current_hash = hasher.hash(text[position, block_size])
|
204
|
+
else
|
205
|
+
# position-1 is the previous position, + block_size to get the last
|
206
|
+
# character of the current block
|
207
|
+
current_hash = hasher.next_hash(text[position-1 + block_size, 1])
|
208
|
+
end
|
209
|
+
|
210
|
+
match = target ? table.find_for_diff(current_hash, block_size, target[position..-1]) :
|
211
|
+
table.find_for_compress(current_hash, block_size, text[position..-1], position)
|
212
|
+
|
213
|
+
if match
|
214
|
+
if !buffer.empty?
|
215
|
+
output << buffer
|
216
|
+
buffer = ""
|
217
|
+
end
|
218
|
+
|
219
|
+
output << [match.position, match.text.length]
|
220
|
+
position += match.text.length
|
221
|
+
current_hash = nil
|
222
|
+
# get a new hasher, because we've skipped over by match.text.length
|
223
|
+
# characters, so the rolling hash's next_hash won't work
|
224
|
+
hasher = RollingHash.new
|
225
|
+
else
|
226
|
+
buffer << text[position, 1]
|
227
|
+
position += 1
|
228
|
+
end
|
229
|
+
end
|
230
|
+
|
231
|
+
remainder = buffer + text[position..-1]
|
232
|
+
output << remainder if !remainder.empty?
|
233
|
+
output
|
234
|
+
end
|
235
|
+
end
|
236
|
+
end
|
data/lib/rolling_hash.rb
ADDED
@@ -0,0 +1,101 @@
|
|
1
|
+
if RUBY_VERSION < "1.9"
|
2
|
+
class String
|
3
|
+
def ord
|
4
|
+
self[0]
|
5
|
+
end
|
6
|
+
end
|
7
|
+
end
|
8
|
+
|
9
|
+
# Rolling hash as used in Rabin-Karp.
|
10
|
+
#
|
11
|
+
# hasher = RollingHash.new
|
12
|
+
# hasher.hash("abc") #=> 6432038
|
13
|
+
# hasher.next_hash("d") #=> 6498345
|
14
|
+
# ||
|
15
|
+
# hasher.hash("bcd") #=> 6498345
|
16
|
+
class RollingHash
|
17
|
+
def initialize(hash = {})
|
18
|
+
hash = { :base => 257, # prime
|
19
|
+
:mod => 1000000007
|
20
|
+
}.merge!(hash)
|
21
|
+
@base = hash[:base]
|
22
|
+
@mod = hash[:mod]
|
23
|
+
end
|
24
|
+
|
25
|
+
# Compute @base**power working modulo @mod
|
26
|
+
def modulo_exp(power)
|
27
|
+
self.class.modulo_exp(@base, power, @mod)
|
28
|
+
end
|
29
|
+
|
30
|
+
# Given a string "abc...xyz" with length len,
|
31
|
+
# return the hash using @base as
|
32
|
+
#
|
33
|
+
# "a".ord * @base**(len - 1) +
|
34
|
+
# "b".ord * @base**(len - 2) +
|
35
|
+
# ... +
|
36
|
+
# "y".ord * @base**(1) +
|
37
|
+
# "z".ord * @base**0 (== "z".ord)
|
38
|
+
def hash(input)
|
39
|
+
hash = 0
|
40
|
+
characters = input.split("")
|
41
|
+
input_length = characters.length
|
42
|
+
|
43
|
+
characters.each_with_index do |character, index|
|
44
|
+
hash += character.ord * modulo_exp(input_length - 1 - index) % @mod
|
45
|
+
hash = hash % @mod
|
46
|
+
end
|
47
|
+
@prev_hash = hash
|
48
|
+
@prev_input = input
|
49
|
+
@highest_power = input_length - 1
|
50
|
+
hash
|
51
|
+
end
|
52
|
+
|
53
|
+
# Returns the hash of (@prev_input[1..-1] + character)
|
54
|
+
# by using @prev_hash, so that the sum turns from
|
55
|
+
#
|
56
|
+
# "a".ord * @base**(len - 1) +
|
57
|
+
# "b".ord * @base**(len - 2) +
|
58
|
+
# ... +
|
59
|
+
# "y".ord * @base**(1) +
|
60
|
+
# "z".ord * @base**0 (== "z".ord)
|
61
|
+
#
|
62
|
+
# into
|
63
|
+
#
|
64
|
+
# "b".ord * @base**(len - 1) +
|
65
|
+
# ... +
|
66
|
+
# "y".ord * @base**(2) +
|
67
|
+
# "z".ord * @base**1 +
|
68
|
+
# character.ord * @base**0
|
69
|
+
def next_hash(character)
|
70
|
+
# the leading value of the computed sum
|
71
|
+
char_to_subtract = @prev_input.chars.first
|
72
|
+
hash = @prev_hash
|
73
|
+
|
74
|
+
# subtract the leading value
|
75
|
+
hash = hash - char_to_subtract.ord * @base**@highest_power
|
76
|
+
|
77
|
+
# shift everything over to the left by 1, and add the
|
78
|
+
# new character as the lowest value
|
79
|
+
hash = (hash * @base) + character.ord
|
80
|
+
hash = hash % @mod
|
81
|
+
|
82
|
+
# trim off the first character
|
83
|
+
@prev_input.slice!(0)
|
84
|
+
@prev_input << character
|
85
|
+
@prev_hash = hash
|
86
|
+
|
87
|
+
hash
|
88
|
+
end
|
89
|
+
|
90
|
+
private
|
91
|
+
|
92
|
+
# Returns n**power but reduced modulo mod
|
93
|
+
# at each step of the calculation.
|
94
|
+
def self.modulo_exp(n, power, mod)
|
95
|
+
value = 1
|
96
|
+
power.times do
|
97
|
+
value = (n * value) % mod
|
98
|
+
end
|
99
|
+
value
|
100
|
+
end
|
101
|
+
end
|
data/rakefile
ADDED
@@ -0,0 +1,99 @@
|
|
1
|
+
require "test_helper"
|
2
|
+
|
3
|
+
describe BentleyMcIlroy::Codec do
|
4
|
+
describe ".compress" do
|
5
|
+
it "compresses strings" do
|
6
|
+
codec = BentleyMcIlroy::Codec
|
7
|
+
str = "aaaaaaaaaaaaaaaaaaaaaaa"
|
8
|
+
|
9
|
+
(1..10).each { |i| codec.compress(str, i).should == [str[0, 1], [0, str.length-1]] }
|
10
|
+
|
11
|
+
codec.compress("abcabcabc", 3).should == ["abc", [0, 6]]
|
12
|
+
codec.compress("abababab", 2).should == ["ab", [0, 6]]
|
13
|
+
codec.compress("abcdefabc", 3).should == ["abcdef", [0, 3]]
|
14
|
+
codec.compress("abcdefabcdef", 3).should == ["abcdef", [0, 6]]
|
15
|
+
codec.compress("abcabcabc", 2).should == ["abc", [0, 6]]
|
16
|
+
codec.compress("xabcdabcdy", 2).should == ["xabcda", [2, 3], "y"]
|
17
|
+
codec.compress("xabcdabcdy", 1).should == ["xabcd", [1, 4], "y"]
|
18
|
+
codec.compress("xabcabcy", 2).should == ["xabca", [2, 2], "y"]
|
19
|
+
end
|
20
|
+
|
21
|
+
# "aaaa" should compress down to ["a", [0, 3]]
|
22
|
+
it "picks the longest match on clashes"
|
23
|
+
|
24
|
+
# 11
|
25
|
+
# 0123 45678901
|
26
|
+
# encode("xaby", "abababab", 1) would be more efficiently encoded as
|
27
|
+
#
|
28
|
+
# ["x", [1, 2], [4, 6]]
|
29
|
+
#
|
30
|
+
# where [4, 6] refers to the decoded target itself, in the style of
|
31
|
+
# VCDIFF. See RFC3284 section 3, where COPY 4, 4 + COPY 12, 24 is used.
|
32
|
+
#
|
33
|
+
# this should probably only be allowed with a flag or something.
|
34
|
+
#
|
35
|
+
# note that compress is more efficient for this type of input,
|
36
|
+
# since the "source" is everything to the left of the current position:
|
37
|
+
#
|
38
|
+
# compress("abababab", 1) #=> ["ab", [0, 6]]
|
39
|
+
it "can refer to its own target"
|
40
|
+
|
41
|
+
it "handles binary" do
|
42
|
+
codec = BentleyMcIlroy::Codec
|
43
|
+
str = ("\x52\303\x66" * 3)
|
44
|
+
str.force_encoding("BINARY") if str.respond_to?(:force_encoding)
|
45
|
+
|
46
|
+
codec.compress(str, 3).should == ["\x52\303\x66", [0, 6]]
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
describe ".decompress" do
|
51
|
+
it "converts arrays representing compressed strings into the full string" do
|
52
|
+
codec = BentleyMcIlroy::Codec
|
53
|
+
codec.decompress(["abc", [0, 6]]).should == "abcabcabc"
|
54
|
+
codec.decompress(["abcdef", [0, 3]]).should == "abcdefabc"
|
55
|
+
codec.decompress(["xabcda", [2, 3], "y"]).should == "xabcdabcdy"
|
56
|
+
codec.decompress(["xabcd", [1, 4], "y"]).should == "xabcdabcdy"
|
57
|
+
codec.decompress(["xabca", [2, 2], "y"]).should == "xabcabcy"
|
58
|
+
end
|
59
|
+
|
60
|
+
it "round-trips with the compression method" do
|
61
|
+
codec = BentleyMcIlroy::Codec
|
62
|
+
%w[aaaaaaaaa abcabcabcabc abababab abcdefabc abcdefabcdef abcabcabc xabcdabcdy xabcabcy].each do |s|
|
63
|
+
(1..4).each do |n|
|
64
|
+
codec.decompress(codec.compress(s, n)).should == s
|
65
|
+
end
|
66
|
+
end
|
67
|
+
end
|
68
|
+
end
|
69
|
+
|
70
|
+
describe ".encode" do
|
71
|
+
it "encodes strings" do
|
72
|
+
codec = BentleyMcIlroy::Codec
|
73
|
+
codec.encode("abcdef", "defghiabc", 3).should == [[3, 3], "ghi", [0, 3]]
|
74
|
+
codec.encode("abcdef", "defghiabc", 2).should == ["d", [4, 2], "ghi", [0, 3]]
|
75
|
+
codec.encode("abcdef", "defghiabc", 1).should == [[3, 3], "ghi", [0, 3]]
|
76
|
+
codec.encode("abc", "d", 3).should == ["d"]
|
77
|
+
codec.encode("abc", "defghi", 3).should == ["defghi"]
|
78
|
+
codec.encode("abcdef", "abcdef", 3).should == []
|
79
|
+
codec.encode("abc", "abcdef", 3).should == [[0, 3], "def"]
|
80
|
+
codec.encode("aaaaa", "aaaaaaaaaa", 3).should == [[0, 5], [0, 5]]
|
81
|
+
end
|
82
|
+
end
|
83
|
+
|
84
|
+
describe ".decode" do
|
85
|
+
it "applies the given delta to the given source" do
|
86
|
+
codec = BentleyMcIlroy::Codec
|
87
|
+
codec.decode("aaaaa", [[0, 5], [0, 5]]).should == "aaaaaaaaaa"
|
88
|
+
codec.decode("abcdef", [[3, 3], "ghi", [0, 3]]).should == "defghiabc"
|
89
|
+
end
|
90
|
+
|
91
|
+
it "round-trips with the delta method" do
|
92
|
+
codec = BentleyMcIlroy::Codec
|
93
|
+
(1..4).each do |n|
|
94
|
+
codec.decode("abcdef", codec.encode("abcdef", "defghiabc", n)).should == "defghiabc"
|
95
|
+
end
|
96
|
+
end
|
97
|
+
end
|
98
|
+
end
|
99
|
+
|
@@ -0,0 +1,20 @@
|
|
1
|
+
require "test_helper"
|
2
|
+
|
3
|
+
describe RollingHash do
|
4
|
+
describe "#hash(input)" do
|
5
|
+
it "hashes the input using a polynomial" do
|
6
|
+
hasher = RollingHash.new
|
7
|
+
hasher.hash("abc").should == 6432038
|
8
|
+
hasher.hash("bcd").should == 6498345
|
9
|
+
end
|
10
|
+
end
|
11
|
+
|
12
|
+
describe "#next_hash(next_input)" do
|
13
|
+
it "takes the previously hash, the given next input and computes the new hash" do
|
14
|
+
hasher = RollingHash.new
|
15
|
+
h = hasher.hash("abc")
|
16
|
+
new_h = hasher.next_hash("d")
|
17
|
+
new_h.should == RollingHash.new.hash("bcd")
|
18
|
+
end
|
19
|
+
end
|
20
|
+
end
|
data/test/test_helper.rb
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
require "bentley_mcilroy"
|
metadata
ADDED
@@ -0,0 +1,90 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: bentley_mcilroy
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Adam Prescott
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2013-09-09 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: rake
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ! '>='
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: '0'
|
22
|
+
type: :development
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ! '>='
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
version: '0'
|
30
|
+
- !ruby/object:Gem::Dependency
|
31
|
+
name: rspec
|
32
|
+
requirement: !ruby/object:Gem::Requirement
|
33
|
+
none: false
|
34
|
+
requirements:
|
35
|
+
- - ! '>='
|
36
|
+
- !ruby/object:Gem::Version
|
37
|
+
version: '0'
|
38
|
+
type: :development
|
39
|
+
prerelease: false
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
none: false
|
42
|
+
requirements:
|
43
|
+
- - ! '>='
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: '0'
|
46
|
+
description: A compression scheme using the Bentley-McIlroy data compression technique
|
47
|
+
of finding long common substrings.
|
48
|
+
email:
|
49
|
+
- adam@aprescott.com
|
50
|
+
executables: []
|
51
|
+
extensions: []
|
52
|
+
extra_rdoc_files: []
|
53
|
+
files:
|
54
|
+
- lib/bentley_mcilroy.rb
|
55
|
+
- lib/rolling_hash.rb
|
56
|
+
- test/test_helper.rb
|
57
|
+
- test/bentley_mcilroy_test.rb
|
58
|
+
- test/rolling_hash_test.rb
|
59
|
+
- LICENSE
|
60
|
+
- README.md
|
61
|
+
- bentley_mcilroy.gemspec
|
62
|
+
- rakefile
|
63
|
+
homepage: https://github.com/aprescott/bentley_mcilroy
|
64
|
+
licenses: []
|
65
|
+
post_install_message:
|
66
|
+
rdoc_options: []
|
67
|
+
require_paths:
|
68
|
+
- lib
|
69
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
70
|
+
none: false
|
71
|
+
requirements:
|
72
|
+
- - ! '>='
|
73
|
+
- !ruby/object:Gem::Version
|
74
|
+
version: '0'
|
75
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
76
|
+
none: false
|
77
|
+
requirements:
|
78
|
+
- - ! '>='
|
79
|
+
- !ruby/object:Gem::Version
|
80
|
+
version: '0'
|
81
|
+
requirements: []
|
82
|
+
rubyforge_project:
|
83
|
+
rubygems_version: 1.8.24
|
84
|
+
signing_key:
|
85
|
+
specification_version: 3
|
86
|
+
summary: Bentley-McIlroy compression scheme implementation in Ruby.
|
87
|
+
test_files:
|
88
|
+
- test/test_helper.rb
|
89
|
+
- test/bentley_mcilroy_test.rb
|
90
|
+
- test/rolling_hash_test.rb
|