bentley_mcilroy 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- data/LICENSE +21 -0
- data/README.md +133 -0
- data/bentley_mcilroy.gemspec +14 -0
- data/lib/bentley_mcilroy.rb +236 -0
- data/lib/rolling_hash.rb +101 -0
- data/rakefile +11 -0
- data/test/bentley_mcilroy_test.rb +99 -0
- data/test/rolling_hash_test.rb +20 -0
- data/test/test_helper.rb +1 -0
- metadata +90 -0
data/LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
(MIT License)
|
2
|
+
|
3
|
+
Copyright (c) 2013 Adam Prescott
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
13
|
+
all copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
21
|
+
THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,133 @@
|
|
1
|
+
A Ruby implementation of Bentley-McIlroy's data compression scheme to encode
|
2
|
+
compressed versions of strings, and compute deltas between source and target.
|
3
|
+
|
4
|
+
Note the compression and delta encodings are simply represented with Ruby
|
5
|
+
objects, and is independent of any particular binary format.
|
6
|
+
|
7
|
+
The fingerprinting algorithm is the rolling hash frequently used for Rabin-Karp
|
8
|
+
string matching.
|
9
|
+
|
10
|
+
# Usage
|
11
|
+
|
12
|
+
To compress a string, pass the input and block size.
|
13
|
+
|
14
|
+
codec = BentleyMcIlroy::Codec
|
15
|
+
codec.compress("aaaaaa", 3) #=> ["a", [0, 5]]
|
16
|
+
codec.compress("abcabcabc", 3) #=> ["abc", [0, 6]]
|
17
|
+
codec.compress("xabcdabcdy", 2) #=> ["xabcda", [2, 3], "y"]
|
18
|
+
codec.compress("xabcdabcdy", 1) #=> ["xabcd", [1, 4], "y"]
|
19
|
+
|
20
|
+
# Modes of operation
|
21
|
+
|
22
|
+
This library supports two modes of operation: compression and delta encoding.
|
23
|
+
With compression, a single input is compressed. With delta encoding, there is a
|
24
|
+
(non-empty) source and a target, and the result is a delta which can be
|
25
|
+
used to reconstruct the target, given the source. Compression is a special
|
26
|
+
case of delta encoding where there is no source.
|
27
|
+
|
28
|
+
With compression, the source data is everything to the left of the position we've
|
29
|
+
reached along the string. With delta encoding, the source data is fixed for the
|
30
|
+
entire time we move left-to-right through the target string.
|
31
|
+
|
32
|
+
Compression:
|
33
|
+
|
34
|
+
codec.compress("aaaaaa", 3) #=> ["a", [0, 5]]
|
35
|
+
codec.compress("abcabcabc", 3) #=> ["abc", [0, 6]]
|
36
|
+
codec.compress("xabcdabcdy", 2) #=> ["xabcda", [2, 3], "y"]
|
37
|
+
codec.compress("xabcdabcdy", 1) #=> ["xabcd", [1, 4], "y"]
|
38
|
+
|
39
|
+
Delta encoding is similar:
|
40
|
+
|
41
|
+
codec.encode("abcd", "xabcdyabcdz", 1) #=> ["x", [0, 4], "y", [0, 4], "z"]
|
42
|
+
codec.encode("xyz", "xyz", 3) #=> []
|
43
|
+
|
44
|
+
To decompress:
|
45
|
+
|
46
|
+
codec.decompress(["xabcd", [1, 4], "y"]) #=> "xabcdabcdy"
|
47
|
+
|
48
|
+
To decode a delta against a source:
|
49
|
+
|
50
|
+
codec.decode("abcd", ["x", [0, 4], "y", [0, 4], "z"]) #=> "xabcdyabcdz"
|
51
|
+
|
52
|
+
# About Bentley-McIlroy
|
53
|
+
|
54
|
+
The Bentley-McIlroy compression scheme is an algorithm for compressing a
|
55
|
+
string by finding long common substrings. The algorithm and its properties
|
56
|
+
are described in greater detail in their [1999 paper][bentley-mcilroy paper]. The technique, with a
|
57
|
+
source dictionary and a target string, is used in Google's implementation of
|
58
|
+
a VCDIFF encoder, [open-vcdiff][open-vcdiff project], as part of encoding deltas.
|
59
|
+
|
60
|
+
[bentley-mcilroy paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=rep1&type=pdf
|
61
|
+
[open-vcdiff project]: http://code.google.com/p/open-vcdiff/
|
62
|
+
|
63
|
+
To give a brief summary, the algorithm works by fixing a window of block size
|
64
|
+
b and then sliding over the string, storing the fingerprint of every b-th
|
65
|
+
window. These stored fingerprints are then used to detect repetitions later
|
66
|
+
on in the string.
|
67
|
+
|
68
|
+
The algorithm in pseudocode, as given in the paper is:
|
69
|
+
|
70
|
+
initialize fp
|
71
|
+
for (i = b; i < n; i++)
|
72
|
+
if (i % b == 0)
|
73
|
+
store(fp, i)
|
74
|
+
update fp to include a[i] and exclude a[i-b]
|
75
|
+
checkformatch(fp, i)
|
76
|
+
|
77
|
+
In the algorithm above, `checkformatch(fp, i)` looks up the fingerprint `fp` in a
|
78
|
+
hash table and then encodes a match if one is found.
|
79
|
+
|
80
|
+
`checkformatch(fp, i)` is the core piece of this algorithm, and "encodes a
|
81
|
+
match" is not fully described in the paper. The rest of the algorithm simply
|
82
|
+
describes moving through the string with a sliding window, looking at
|
83
|
+
substrings and storing fingerprints whenever we cross a block boundary.
|
84
|
+
|
85
|
+
As described in the paper, suppose b = 100 and that the current block matches
|
86
|
+
block 56 (i.e., bytes 5600 through to 5699). This current block could then be
|
87
|
+
encoded as <5600,100>.
|
88
|
+
|
89
|
+
There are two similar improvements which can be made, so as to prevent
|
90
|
+
`"ababab"` from compressing into `"ab<0,2><0,2>"`, both of which are also in the
|
91
|
+
paper. When we know that the current block matches block 56, we can extend
|
92
|
+
the match as far back as possible, not exceeding b - 1 bytes. Similarly, we
|
93
|
+
can move the match far forward as possible without limitation.
|
94
|
+
|
95
|
+
The reason there is a limit of b-1 bytes when moving backwards is that if
|
96
|
+
there were more to match beyond b-1 bytes, it would've been found in a
|
97
|
+
previous iteration of the loop.
|
98
|
+
|
99
|
+
This library implementation moves matches forward, but does not move matches
|
100
|
+
backwards.
|
101
|
+
|
102
|
+
To be more explicit about what extending the match means, consider
|
103
|
+
|
104
|
+
xabcdabcdy (the string)
|
105
|
+
0123456789 (indices)
|
106
|
+
|
107
|
+
with a block size of b = 2. Moving left to right, the fingerprints of `"xa"`,
|
108
|
+
`"ab"`, `"bc"`, ..., are computed, but only `"xa"`, `"bc"`, `"da"`, ... are stored. When
|
109
|
+
`"ab"` is seen at `5..6`, there is no corresponding entry in the hash table, so
|
110
|
+
nothing is done, yet. On the next substring of length 2, `"bc"`, at positions
|
111
|
+
`6..7`, there _is_ a corresponding entry in the hash table, so there's a match,
|
112
|
+
which we could encode as `<2, 2>`, say. However, we'd like to _actually_ produce
|
113
|
+
`<1, 4>`, which is more efficient. So starting with `<2, 2>`, we move the match
|
114
|
+
back 1 character for both the `"bc"` at `6..7` and the `"bc"` at `2..3`, then check
|
115
|
+
if `1..3` matches `5..7`, which it does. This is moving the match backwards.
|
116
|
+
|
117
|
+
For moving the match forwards, simply do the same thing. Check if `1..4` matches
|
118
|
+
`6..8`, which it does. `1..5` does not match `6..9`, so we use `<1, 4>` and we're done.
|
119
|
+
|
120
|
+
The resulting string, with backward- and forward-extension is `xabcd<1, 4>y`. In
|
121
|
+
the case of no backward extensions, it is `xabcda<2, 3>y`.
|
122
|
+
|
123
|
+
# License
|
124
|
+
|
125
|
+
Copyright (c) Adam Prescott, released under the MIT license. See the license file.
|
126
|
+
|
127
|
+
# TODO
|
128
|
+
|
129
|
+
compress("abcaaaaaa", 1) -> ["abc", [0, 1], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1]]
|
130
|
+
|
131
|
+
Can this be fixed to be: ["abc", [0, 1], [3, 5]] ? Essentially following the paper
|
132
|
+
and picking the longest match on a clash (here, index 0 and index 3 are hit for
|
133
|
+
index 4, but index 3 leads to a better result when the match is extended forward)
|
@@ -0,0 +1,14 @@
|
|
1
|
+
Gem::Specification.new do |s|
|
2
|
+
s.name = "bentley_mcilroy"
|
3
|
+
s.version = "0.0.1"
|
4
|
+
s.authors = ["Adam Prescott"]
|
5
|
+
s.email = ["adam@aprescott.com"]
|
6
|
+
s.homepage = "https://github.com/aprescott/bentley_mcilroy"
|
7
|
+
s.summary = "Bentley-McIlroy compression scheme implementation in Ruby."
|
8
|
+
s.description = "A compression scheme using the Bentley-McIlroy data compression technique of finding long common substrings."
|
9
|
+
s.files = Dir["{lib/**/*,test/**/*}"] + %w[LICENSE README.md bentley_mcilroy.gemspec rakefile]
|
10
|
+
s.test_files = Dir["test/*"]
|
11
|
+
s.require_path = "lib"
|
12
|
+
s.add_development_dependency "rake"
|
13
|
+
s.add_development_dependency "rspec"
|
14
|
+
end
|
@@ -0,0 +1,236 @@
|
|
1
|
+
require "rolling_hash"
|
2
|
+
|
3
|
+
module BentleyMcIlroy
|
4
|
+
# A fixed block of text, appearing in the original text at one of
|
5
|
+
# 0..b-1, b..2b-1, 2b..3b-1, ...
|
6
|
+
class Block
|
7
|
+
attr_reader :text, :position
|
8
|
+
|
9
|
+
def initialize(text, position)
|
10
|
+
@text = text
|
11
|
+
@position = position
|
12
|
+
end
|
13
|
+
|
14
|
+
def hash
|
15
|
+
RollingHash.new.hash(text)
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
# A container for the original text we're processing. Divides the text into
|
20
|
+
# Block objects.
|
21
|
+
class BlockSequencedText
|
22
|
+
attr_reader :blocks, :text
|
23
|
+
|
24
|
+
def initialize(text, block_size)
|
25
|
+
@text = text
|
26
|
+
@block_size = block_size
|
27
|
+
@blocks = []
|
28
|
+
|
29
|
+
# "onetwothree" -> ["one", "two", "thr", "ee"]
|
30
|
+
@text.scan(/.(?:.?){#{@block_size-1}}/).each.with_index do |text_block, index|
|
31
|
+
@blocks << Block.new(text_block, index * @block_size)
|
32
|
+
end
|
33
|
+
end
|
34
|
+
end
|
35
|
+
|
36
|
+
# Look-up table with a #find method which finds an appropriate block and then
|
37
|
+
# modifies the match to extend it to more characters.
|
38
|
+
class BlockFingerprintTable
|
39
|
+
def initialize(block_sequenced_text)
|
40
|
+
@blocked_text = block_sequenced_text
|
41
|
+
@hash = {}
|
42
|
+
|
43
|
+
@blocked_text.blocks.each do |block|
|
44
|
+
(@hash[block.hash] ||= []) << block
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
def find_for_compress(fingerprint, block_size, target, position)
|
49
|
+
source = @blocked_text.text
|
50
|
+
find(fingerprint, block_size, source, target, position)
|
51
|
+
end
|
52
|
+
|
53
|
+
def find_for_diff(fingerprint, block_size, target)
|
54
|
+
source = @blocked_text.text
|
55
|
+
find(fingerprint, block_size, source, target)
|
56
|
+
end
|
57
|
+
|
58
|
+
private
|
59
|
+
|
60
|
+
def find(fingerprint, block_size, source, target, position = nil)
|
61
|
+
blocks = @hash[fingerprint]
|
62
|
+
return nil unless blocks
|
63
|
+
|
64
|
+
blocks.each do |block|
|
65
|
+
next unless block.text == target[0, block_size]
|
66
|
+
|
67
|
+
# in compression, since we don't have true source and target strings as
|
68
|
+
# separate things, we have to ensure that we don't use a fingerprinted
|
69
|
+
# block which appears _after_ the current position, otherwise
|
70
|
+
#
|
71
|
+
# a<x, 0> with x > 0
|
72
|
+
#
|
73
|
+
# might happen, or similar. since blocks are ordered left to right in the
|
74
|
+
# string, we can just return nil, because we know there's not going to be
|
75
|
+
# a valid block for compression.
|
76
|
+
if position && block.position >= position
|
77
|
+
return nil
|
78
|
+
end
|
79
|
+
|
80
|
+
# we know that block matches, so cut it from the beginning,
|
81
|
+
# so we can then see how much of the rest also matches
|
82
|
+
source_match = source[block.position + block_size..-1]
|
83
|
+
target_match = target[block_size..-1]
|
84
|
+
|
85
|
+
# in a backwards extension, we can see how many of the characters before
|
86
|
+
# +position+ (up the previous block we covered, which is +limit+) match
|
87
|
+
# characters block.position (up to b-1) characters. In other words, we can
|
88
|
+
# find the maximum i such that
|
89
|
+
#
|
90
|
+
# original_text[position-k, 1] == original_text[block.position-k, 1]
|
91
|
+
#
|
92
|
+
# for all k in {1, 2, ..., i}, where i <= b-1
|
93
|
+
|
94
|
+
# it may be that the block we've matched on reaches to the end of the
|
95
|
+
# string, in which case, bail
|
96
|
+
if source_match.empty? || target_match.empty?
|
97
|
+
return block
|
98
|
+
end
|
99
|
+
|
100
|
+
end_index = find_end_index(source_match, target_match)
|
101
|
+
match = produce_match(end_index, block, source)
|
102
|
+
return match
|
103
|
+
end
|
104
|
+
|
105
|
+
nil
|
106
|
+
end
|
107
|
+
|
108
|
+
def find_end_index(source, target)
|
109
|
+
end_index = 0
|
110
|
+
any_match = false
|
111
|
+
while end_index < source.length && end_index < target.length && source[end_index, 1] == target[end_index, 1]
|
112
|
+
any_match = true
|
113
|
+
end_index += 1
|
114
|
+
end
|
115
|
+
# undo the final increment, since that's where it failed the equality check
|
116
|
+
end_index -= 1
|
117
|
+
|
118
|
+
any_match ? end_index : nil
|
119
|
+
end
|
120
|
+
|
121
|
+
def produce_match(end_index, block, source)
|
122
|
+
text = block.text
|
123
|
+
if end_index # we have more to grab in the string
|
124
|
+
text += source[0..end_index]
|
125
|
+
end
|
126
|
+
Block.new(text, block.position)
|
127
|
+
end
|
128
|
+
end
|
129
|
+
|
130
|
+
class Codec
|
131
|
+
def self.decompress(sequence)
|
132
|
+
sequence.inject("") do |result, i|
|
133
|
+
if i.is_a?(Array)
|
134
|
+
index, length = i
|
135
|
+
length.times do |k|
|
136
|
+
result << result[index+k, 1]
|
137
|
+
end
|
138
|
+
result
|
139
|
+
else
|
140
|
+
result << i
|
141
|
+
end
|
142
|
+
end
|
143
|
+
end
|
144
|
+
|
145
|
+
def self.decode(source, delta)
|
146
|
+
delta.inject("") do |result, i|
|
147
|
+
if i.is_a?(Array)
|
148
|
+
index, length = i
|
149
|
+
result << source[index, length]
|
150
|
+
else
|
151
|
+
result << i
|
152
|
+
end
|
153
|
+
end
|
154
|
+
end
|
155
|
+
|
156
|
+
def self.compress(text, block_size)
|
157
|
+
__compress_encode__(text, nil, block_size)
|
158
|
+
end
|
159
|
+
|
160
|
+
def self.encode(source, target, block_size)
|
161
|
+
__compress_encode__(source, target, block_size)
|
162
|
+
end
|
163
|
+
|
164
|
+
private
|
165
|
+
|
166
|
+
def self.__compress_encode__(source, target, block_size)
|
167
|
+
return [] if source == target
|
168
|
+
|
169
|
+
block_sequenced_text = BlockSequencedText.new(source, block_size)
|
170
|
+
table = BlockFingerprintTable.new(block_sequenced_text)
|
171
|
+
output = []
|
172
|
+
buffer = ""
|
173
|
+
current_hash = nil
|
174
|
+
hasher = RollingHash.new
|
175
|
+
|
176
|
+
mode = (target ? :diff : :compress)
|
177
|
+
|
178
|
+
if mode == :compress
|
179
|
+
# it's the source we're compressing, there is no target
|
180
|
+
text = source
|
181
|
+
else
|
182
|
+
# it's the target we're compressing against the source
|
183
|
+
text = target
|
184
|
+
end
|
185
|
+
|
186
|
+
position = 0
|
187
|
+
while position < text.length
|
188
|
+
|
189
|
+
if text.length - position < block_size
|
190
|
+
# if there isn't a block-sized substring in the remaining text, stop.
|
191
|
+
# note that we could add the buffer to the output here, but if block_size
|
192
|
+
# is 1, text.length - position < 1 can't be true, so the final character
|
193
|
+
# would go missing. so appending to the buffer goes below, outside the
|
194
|
+
# while loop.
|
195
|
+
break
|
196
|
+
end
|
197
|
+
|
198
|
+
# if we've recently found a block of text which matches and added that to
|
199
|
+
# the output, current_hash will be reset to nil, so get the new hash. note
|
200
|
+
# that we can't just use next_hash, because we might have skipped several
|
201
|
+
# characters in one go, which breaks the rolling aspect of the hash
|
202
|
+
if !current_hash
|
203
|
+
current_hash = hasher.hash(text[position, block_size])
|
204
|
+
else
|
205
|
+
# position-1 is the previous position, + block_size to get the last
|
206
|
+
# character of the current block
|
207
|
+
current_hash = hasher.next_hash(text[position-1 + block_size, 1])
|
208
|
+
end
|
209
|
+
|
210
|
+
match = target ? table.find_for_diff(current_hash, block_size, target[position..-1]) :
|
211
|
+
table.find_for_compress(current_hash, block_size, text[position..-1], position)
|
212
|
+
|
213
|
+
if match
|
214
|
+
if !buffer.empty?
|
215
|
+
output << buffer
|
216
|
+
buffer = ""
|
217
|
+
end
|
218
|
+
|
219
|
+
output << [match.position, match.text.length]
|
220
|
+
position += match.text.length
|
221
|
+
current_hash = nil
|
222
|
+
# get a new hasher, because we've skipped over by match.text.length
|
223
|
+
# characters, so the rolling hash's next_hash won't work
|
224
|
+
hasher = RollingHash.new
|
225
|
+
else
|
226
|
+
buffer << text[position, 1]
|
227
|
+
position += 1
|
228
|
+
end
|
229
|
+
end
|
230
|
+
|
231
|
+
remainder = buffer + text[position..-1]
|
232
|
+
output << remainder if !remainder.empty?
|
233
|
+
output
|
234
|
+
end
|
235
|
+
end
|
236
|
+
end
|
data/lib/rolling_hash.rb
ADDED
@@ -0,0 +1,101 @@
|
|
1
|
+
if RUBY_VERSION < "1.9"
|
2
|
+
class String
|
3
|
+
def ord
|
4
|
+
self[0]
|
5
|
+
end
|
6
|
+
end
|
7
|
+
end
|
8
|
+
|
9
|
+
# Rolling hash as used in Rabin-Karp.
|
10
|
+
#
|
11
|
+
# hasher = RollingHash.new
|
12
|
+
# hasher.hash("abc") #=> 6432038
|
13
|
+
# hasher.next_hash("d") #=> 6498345
|
14
|
+
# ||
|
15
|
+
# hasher.hash("bcd") #=> 6498345
|
16
|
+
class RollingHash
|
17
|
+
def initialize(hash = {})
|
18
|
+
hash = { :base => 257, # prime
|
19
|
+
:mod => 1000000007
|
20
|
+
}.merge!(hash)
|
21
|
+
@base = hash[:base]
|
22
|
+
@mod = hash[:mod]
|
23
|
+
end
|
24
|
+
|
25
|
+
# Compute @base**power working modulo @mod
|
26
|
+
def modulo_exp(power)
|
27
|
+
self.class.modulo_exp(@base, power, @mod)
|
28
|
+
end
|
29
|
+
|
30
|
+
# Given a string "abc...xyz" with length len,
|
31
|
+
# return the hash using @base as
|
32
|
+
#
|
33
|
+
# "a".ord * @base**(len - 1) +
|
34
|
+
# "b".ord * @base**(len - 2) +
|
35
|
+
# ... +
|
36
|
+
# "y".ord * @base**(1) +
|
37
|
+
# "z".ord * @base**0 (== "z".ord)
|
38
|
+
def hash(input)
|
39
|
+
hash = 0
|
40
|
+
characters = input.split("")
|
41
|
+
input_length = characters.length
|
42
|
+
|
43
|
+
characters.each_with_index do |character, index|
|
44
|
+
hash += character.ord * modulo_exp(input_length - 1 - index) % @mod
|
45
|
+
hash = hash % @mod
|
46
|
+
end
|
47
|
+
@prev_hash = hash
|
48
|
+
@prev_input = input
|
49
|
+
@highest_power = input_length - 1
|
50
|
+
hash
|
51
|
+
end
|
52
|
+
|
53
|
+
# Returns the hash of (@prev_input[1..-1] + character)
|
54
|
+
# by using @prev_hash, so that the sum turns from
|
55
|
+
#
|
56
|
+
# "a".ord * @base**(len - 1) +
|
57
|
+
# "b".ord * @base**(len - 2) +
|
58
|
+
# ... +
|
59
|
+
# "y".ord * @base**(1) +
|
60
|
+
# "z".ord * @base**0 (== "z".ord)
|
61
|
+
#
|
62
|
+
# into
|
63
|
+
#
|
64
|
+
# "b".ord * @base**(len - 1) +
|
65
|
+
# ... +
|
66
|
+
# "y".ord * @base**(2) +
|
67
|
+
# "z".ord * @base**1 +
|
68
|
+
# character.ord * @base**0
|
69
|
+
def next_hash(character)
|
70
|
+
# the leading value of the computed sum
|
71
|
+
char_to_subtract = @prev_input.chars.first
|
72
|
+
hash = @prev_hash
|
73
|
+
|
74
|
+
# subtract the leading value
|
75
|
+
hash = hash - char_to_subtract.ord * @base**@highest_power
|
76
|
+
|
77
|
+
# shift everything over to the left by 1, and add the
|
78
|
+
# new character as the lowest value
|
79
|
+
hash = (hash * @base) + character.ord
|
80
|
+
hash = hash % @mod
|
81
|
+
|
82
|
+
# trim off the first character
|
83
|
+
@prev_input.slice!(0)
|
84
|
+
@prev_input << character
|
85
|
+
@prev_hash = hash
|
86
|
+
|
87
|
+
hash
|
88
|
+
end
|
89
|
+
|
90
|
+
private
|
91
|
+
|
92
|
+
# Returns n**power but reduced modulo mod
|
93
|
+
# at each step of the calculation.
|
94
|
+
def self.modulo_exp(n, power, mod)
|
95
|
+
value = 1
|
96
|
+
power.times do
|
97
|
+
value = (n * value) % mod
|
98
|
+
end
|
99
|
+
value
|
100
|
+
end
|
101
|
+
end
|
data/rakefile
ADDED
@@ -0,0 +1,99 @@
|
|
1
|
+
require "test_helper"
|
2
|
+
|
3
|
+
describe BentleyMcIlroy::Codec do
|
4
|
+
describe ".compress" do
|
5
|
+
it "compresses strings" do
|
6
|
+
codec = BentleyMcIlroy::Codec
|
7
|
+
str = "aaaaaaaaaaaaaaaaaaaaaaa"
|
8
|
+
|
9
|
+
(1..10).each { |i| codec.compress(str, i).should == [str[0, 1], [0, str.length-1]] }
|
10
|
+
|
11
|
+
codec.compress("abcabcabc", 3).should == ["abc", [0, 6]]
|
12
|
+
codec.compress("abababab", 2).should == ["ab", [0, 6]]
|
13
|
+
codec.compress("abcdefabc", 3).should == ["abcdef", [0, 3]]
|
14
|
+
codec.compress("abcdefabcdef", 3).should == ["abcdef", [0, 6]]
|
15
|
+
codec.compress("abcabcabc", 2).should == ["abc", [0, 6]]
|
16
|
+
codec.compress("xabcdabcdy", 2).should == ["xabcda", [2, 3], "y"]
|
17
|
+
codec.compress("xabcdabcdy", 1).should == ["xabcd", [1, 4], "y"]
|
18
|
+
codec.compress("xabcabcy", 2).should == ["xabca", [2, 2], "y"]
|
19
|
+
end
|
20
|
+
|
21
|
+
# "aaaa" should compress down to ["a", [0, 3]]
|
22
|
+
it "picks the longest match on clashes"
|
23
|
+
|
24
|
+
# 11
|
25
|
+
# 0123 45678901
|
26
|
+
# encode("xaby", "abababab", 1) would be more efficiently encoded as
|
27
|
+
#
|
28
|
+
# ["x", [1, 2], [4, 6]]
|
29
|
+
#
|
30
|
+
# where [4, 6] refers to the decoded target itself, in the style of
|
31
|
+
# VCDIFF. See RFC3284 section 3, where COPY 4, 4 + COPY 12, 24 is used.
|
32
|
+
#
|
33
|
+
# this should probably only be allowed with a flag or something.
|
34
|
+
#
|
35
|
+
# note that compress is more efficient for this type of input,
|
36
|
+
# since the "source" is everything to the left of the current position:
|
37
|
+
#
|
38
|
+
# compress("abababab", 1) #=> ["ab", [0, 6]]
|
39
|
+
it "can refer to its own target"
|
40
|
+
|
41
|
+
it "handles binary" do
|
42
|
+
codec = BentleyMcIlroy::Codec
|
43
|
+
str = ("\x52\303\x66" * 3)
|
44
|
+
str.force_encoding("BINARY") if str.respond_to?(:force_encoding)
|
45
|
+
|
46
|
+
codec.compress(str, 3).should == ["\x52\303\x66", [0, 6]]
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
describe ".decompress" do
|
51
|
+
it "converts arrays representing compressed strings into the full string" do
|
52
|
+
codec = BentleyMcIlroy::Codec
|
53
|
+
codec.decompress(["abc", [0, 6]]).should == "abcabcabc"
|
54
|
+
codec.decompress(["abcdef", [0, 3]]).should == "abcdefabc"
|
55
|
+
codec.decompress(["xabcda", [2, 3], "y"]).should == "xabcdabcdy"
|
56
|
+
codec.decompress(["xabcd", [1, 4], "y"]).should == "xabcdabcdy"
|
57
|
+
codec.decompress(["xabca", [2, 2], "y"]).should == "xabcabcy"
|
58
|
+
end
|
59
|
+
|
60
|
+
it "round-trips with the compression method" do
|
61
|
+
codec = BentleyMcIlroy::Codec
|
62
|
+
%w[aaaaaaaaa abcabcabcabc abababab abcdefabc abcdefabcdef abcabcabc xabcdabcdy xabcabcy].each do |s|
|
63
|
+
(1..4).each do |n|
|
64
|
+
codec.decompress(codec.compress(s, n)).should == s
|
65
|
+
end
|
66
|
+
end
|
67
|
+
end
|
68
|
+
end
|
69
|
+
|
70
|
+
describe ".encode" do
|
71
|
+
it "encodes strings" do
|
72
|
+
codec = BentleyMcIlroy::Codec
|
73
|
+
codec.encode("abcdef", "defghiabc", 3).should == [[3, 3], "ghi", [0, 3]]
|
74
|
+
codec.encode("abcdef", "defghiabc", 2).should == ["d", [4, 2], "ghi", [0, 3]]
|
75
|
+
codec.encode("abcdef", "defghiabc", 1).should == [[3, 3], "ghi", [0, 3]]
|
76
|
+
codec.encode("abc", "d", 3).should == ["d"]
|
77
|
+
codec.encode("abc", "defghi", 3).should == ["defghi"]
|
78
|
+
codec.encode("abcdef", "abcdef", 3).should == []
|
79
|
+
codec.encode("abc", "abcdef", 3).should == [[0, 3], "def"]
|
80
|
+
codec.encode("aaaaa", "aaaaaaaaaa", 3).should == [[0, 5], [0, 5]]
|
81
|
+
end
|
82
|
+
end
|
83
|
+
|
84
|
+
describe ".decode" do
|
85
|
+
it "applies the given delta to the given source" do
|
86
|
+
codec = BentleyMcIlroy::Codec
|
87
|
+
codec.decode("aaaaa", [[0, 5], [0, 5]]).should == "aaaaaaaaaa"
|
88
|
+
codec.decode("abcdef", [[3, 3], "ghi", [0, 3]]).should == "defghiabc"
|
89
|
+
end
|
90
|
+
|
91
|
+
it "round-trips with the delta method" do
|
92
|
+
codec = BentleyMcIlroy::Codec
|
93
|
+
(1..4).each do |n|
|
94
|
+
codec.decode("abcdef", codec.encode("abcdef", "defghiabc", n)).should == "defghiabc"
|
95
|
+
end
|
96
|
+
end
|
97
|
+
end
|
98
|
+
end
|
99
|
+
|
@@ -0,0 +1,20 @@
|
|
1
|
+
require "test_helper"
|
2
|
+
|
3
|
+
describe RollingHash do
|
4
|
+
describe "#hash(input)" do
|
5
|
+
it "hashes the input using a polynomial" do
|
6
|
+
hasher = RollingHash.new
|
7
|
+
hasher.hash("abc").should == 6432038
|
8
|
+
hasher.hash("bcd").should == 6498345
|
9
|
+
end
|
10
|
+
end
|
11
|
+
|
12
|
+
describe "#next_hash(next_input)" do
|
13
|
+
it "takes the previously hash, the given next input and computes the new hash" do
|
14
|
+
hasher = RollingHash.new
|
15
|
+
h = hasher.hash("abc")
|
16
|
+
new_h = hasher.next_hash("d")
|
17
|
+
new_h.should == RollingHash.new.hash("bcd")
|
18
|
+
end
|
19
|
+
end
|
20
|
+
end
|
data/test/test_helper.rb
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
require "bentley_mcilroy"
|
metadata
ADDED
@@ -0,0 +1,90 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: bentley_mcilroy
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Adam Prescott
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2013-09-09 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: rake
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ! '>='
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: '0'
|
22
|
+
type: :development
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ! '>='
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
version: '0'
|
30
|
+
- !ruby/object:Gem::Dependency
|
31
|
+
name: rspec
|
32
|
+
requirement: !ruby/object:Gem::Requirement
|
33
|
+
none: false
|
34
|
+
requirements:
|
35
|
+
- - ! '>='
|
36
|
+
- !ruby/object:Gem::Version
|
37
|
+
version: '0'
|
38
|
+
type: :development
|
39
|
+
prerelease: false
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
none: false
|
42
|
+
requirements:
|
43
|
+
- - ! '>='
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: '0'
|
46
|
+
description: A compression scheme using the Bentley-McIlroy data compression technique
|
47
|
+
of finding long common substrings.
|
48
|
+
email:
|
49
|
+
- adam@aprescott.com
|
50
|
+
executables: []
|
51
|
+
extensions: []
|
52
|
+
extra_rdoc_files: []
|
53
|
+
files:
|
54
|
+
- lib/bentley_mcilroy.rb
|
55
|
+
- lib/rolling_hash.rb
|
56
|
+
- test/test_helper.rb
|
57
|
+
- test/bentley_mcilroy_test.rb
|
58
|
+
- test/rolling_hash_test.rb
|
59
|
+
- LICENSE
|
60
|
+
- README.md
|
61
|
+
- bentley_mcilroy.gemspec
|
62
|
+
- rakefile
|
63
|
+
homepage: https://github.com/aprescott/bentley_mcilroy
|
64
|
+
licenses: []
|
65
|
+
post_install_message:
|
66
|
+
rdoc_options: []
|
67
|
+
require_paths:
|
68
|
+
- lib
|
69
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
70
|
+
none: false
|
71
|
+
requirements:
|
72
|
+
- - ! '>='
|
73
|
+
- !ruby/object:Gem::Version
|
74
|
+
version: '0'
|
75
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
76
|
+
none: false
|
77
|
+
requirements:
|
78
|
+
- - ! '>='
|
79
|
+
- !ruby/object:Gem::Version
|
80
|
+
version: '0'
|
81
|
+
requirements: []
|
82
|
+
rubyforge_project:
|
83
|
+
rubygems_version: 1.8.24
|
84
|
+
signing_key:
|
85
|
+
specification_version: 3
|
86
|
+
summary: Bentley-McIlroy compression scheme implementation in Ruby.
|
87
|
+
test_files:
|
88
|
+
- test/test_helper.rb
|
89
|
+
- test/bentley_mcilroy_test.rb
|
90
|
+
- test/rolling_hash_test.rb
|