bentley_mcilroy 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ (MIT License)
2
+
3
+ Copyright (c) 2013 Adam Prescott
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,133 @@
1
+ A Ruby implementation of Bentley-McIlroy's data compression scheme to encode
2
+ compressed versions of strings, and compute deltas between source and target.
3
+
4
+ Note the compression and delta encodings are simply represented with Ruby
5
+ objects, and is independent of any particular binary format.
6
+
7
+ The fingerprinting algorithm is the rolling hash frequently used for Rabin-Karp
8
+ string matching.
9
+
10
+ # Usage
11
+
12
+ To compress a string, pass the input and block size.
13
+
14
+ codec = BentleyMcIlroy::Codec
15
+ codec.compress("aaaaaa", 3) #=> ["a", [0, 5]]
16
+ codec.compress("abcabcabc", 3) #=> ["abc", [0, 6]]
17
+ codec.compress("xabcdabcdy", 2) #=> ["xabcda", [2, 3], "y"]
18
+ codec.compress("xabcdabcdy", 1) #=> ["xabcd", [1, 4], "y"]
19
+
20
+ # Modes of operation
21
+
22
+ This library supports two modes of operation: compression and delta encoding.
23
+ With compression, a single input is compressed. With delta encoding, there is a
24
+ (non-empty) source and a target, and the result is a delta which can be
25
+ used to reconstruct the target, given the source. Compression is a special
26
+ case of delta encoding where there is no source.
27
+
28
+ With compression, the source data is everything to the left of the position we've
29
+ reached along the string. With delta encoding, the source data is fixed for the
30
+ entire time we move left-to-right through the target string.
31
+
32
+ Compression:
33
+
34
+ codec.compress("aaaaaa", 3) #=> ["a", [0, 5]]
35
+ codec.compress("abcabcabc", 3) #=> ["abc", [0, 6]]
36
+ codec.compress("xabcdabcdy", 2) #=> ["xabcda", [2, 3], "y"]
37
+ codec.compress("xabcdabcdy", 1) #=> ["xabcd", [1, 4], "y"]
38
+
39
+ Delta encoding is similar:
40
+
41
+ codec.encode("abcd", "xabcdyabcdz", 1) #=> ["x", [0, 4], "y", [0, 4], "z"]
42
+ codec.encode("xyz", "xyz", 3) #=> []
43
+
44
+ To decompress:
45
+
46
+ codec.decompress(["xabcd", [1, 4], "y"]) #=> "xabcdabcdy"
47
+
48
+ To decode a delta against a source:
49
+
50
+ codec.decode("abcd", ["x", [0, 4], "y", [0, 4], "z"]) #=> "xabcdyabcdz"
51
+
52
+ # About Bentley-McIlroy
53
+
54
+ The Bentley-McIlroy compression scheme is an algorithm for compressing a
55
+ string by finding long common substrings. The algorithm and its properties
56
+ are described in greater detail in their [1999 paper][bentley-mcilroy paper]. The technique, with a
57
+ source dictionary and a target string, is used in Google's implementation of
58
+ a VCDIFF encoder, [open-vcdiff][open-vcdiff project], as part of encoding deltas.
59
+
60
+ [bentley-mcilroy paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=rep1&type=pdf
61
+ [open-vcdiff project]: http://code.google.com/p/open-vcdiff/
62
+
63
+ To give a brief summary, the algorithm works by fixing a window of block size
64
+ b and then sliding over the string, storing the fingerprint of every b-th
65
+ window. These stored fingerprints are then used to detect repetitions later
66
+ on in the string.
67
+
68
+ The algorithm in pseudocode, as given in the paper is:
69
+
70
+ initialize fp
71
+ for (i = b; i < n; i++)
72
+ if (i % b == 0)
73
+ store(fp, i)
74
+ update fp to include a[i] and exclude a[i-b]
75
+ checkformatch(fp, i)
76
+
77
+ In the algorithm above, `checkformatch(fp, i)` looks up the fingerprint `fp` in a
78
+ hash table and then encodes a match if one is found.
79
+
80
+ `checkformatch(fp, i)` is the core piece of this algorithm, and "encodes a
81
+ match" is not fully described in the paper. The rest of the algorithm simply
82
+ describes moving through the string with a sliding window, looking at
83
+ substrings and storing fingerprints whenever we cross a block boundary.
84
+
85
+ As described in the paper, suppose b = 100 and that the current block matches
86
+ block 56 (i.e., bytes 5600 through to 5699). This current block could then be
87
+ encoded as <5600,100>.
88
+
89
+ There are two similar improvements which can be made, so as to prevent
90
+ `"ababab"` from compressing into `"ab<0,2><0,2>"`, both of which are also in the
91
+ paper. When we know that the current block matches block 56, we can extend
92
+ the match as far back as possible, not exceeding b - 1 bytes. Similarly, we
93
+ can move the match far forward as possible without limitation.
94
+
95
+ The reason there is a limit of b-1 bytes when moving backwards is that if
96
+ there were more to match beyond b-1 bytes, it would've been found in a
97
+ previous iteration of the loop.
98
+
99
+ This library implementation moves matches forward, but does not move matches
100
+ backwards.
101
+
102
+ To be more explicit about what extending the match means, consider
103
+
104
+ xabcdabcdy (the string)
105
+ 0123456789 (indices)
106
+
107
+ with a block size of b = 2. Moving left to right, the fingerprints of `"xa"`,
108
+ `"ab"`, `"bc"`, ..., are computed, but only `"xa"`, `"bc"`, `"da"`, ... are stored. When
109
+ `"ab"` is seen at `5..6`, there is no corresponding entry in the hash table, so
110
+ nothing is done, yet. On the next substring of length 2, `"bc"`, at positions
111
+ `6..7`, there _is_ a corresponding entry in the hash table, so there's a match,
112
+ which we could encode as `<2, 2>`, say. However, we'd like to _actually_ produce
113
+ `<1, 4>`, which is more efficient. So starting with `<2, 2>`, we move the match
114
+ back 1 character for both the `"bc"` at `6..7` and the `"bc"` at `2..3`, then check
115
+ if `1..3` matches `5..7`, which it does. This is moving the match backwards.
116
+
117
+ For moving the match forwards, simply do the same thing. Check if `1..4` matches
118
+ `6..8`, which it does. `1..5` does not match `6..9`, so we use `<1, 4>` and we're done.
119
+
120
+ The resulting string, with backward- and forward-extension is `xabcd<1, 4>y`. In
121
+ the case of no backward extensions, it is `xabcda<2, 3>y`.
122
+
123
+ # License
124
+
125
+ Copyright (c) Adam Prescott, released under the MIT license. See the license file.
126
+
127
+ # TODO
128
+
129
+ compress("abcaaaaaa", 1) -> ["abc", [0, 1], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1]]
130
+
131
+ Can this be fixed to be: ["abc", [0, 1], [3, 5]] ? Essentially following the paper
132
+ and picking the longest match on a clash (here, index 0 and index 3 are hit for
133
+ index 4, but index 3 leads to a better result when the match is extended forward)
@@ -0,0 +1,14 @@
1
+ Gem::Specification.new do |s|
2
+ s.name = "bentley_mcilroy"
3
+ s.version = "0.0.1"
4
+ s.authors = ["Adam Prescott"]
5
+ s.email = ["adam@aprescott.com"]
6
+ s.homepage = "https://github.com/aprescott/bentley_mcilroy"
7
+ s.summary = "Bentley-McIlroy compression scheme implementation in Ruby."
8
+ s.description = "A compression scheme using the Bentley-McIlroy data compression technique of finding long common substrings."
9
+ s.files = Dir["{lib/**/*,test/**/*}"] + %w[LICENSE README.md bentley_mcilroy.gemspec rakefile]
10
+ s.test_files = Dir["test/*"]
11
+ s.require_path = "lib"
12
+ s.add_development_dependency "rake"
13
+ s.add_development_dependency "rspec"
14
+ end
@@ -0,0 +1,236 @@
1
+ require "rolling_hash"
2
+
3
+ module BentleyMcIlroy
4
+ # A fixed block of text, appearing in the original text at one of
5
+ # 0..b-1, b..2b-1, 2b..3b-1, ...
6
+ class Block
7
+ attr_reader :text, :position
8
+
9
+ def initialize(text, position)
10
+ @text = text
11
+ @position = position
12
+ end
13
+
14
+ def hash
15
+ RollingHash.new.hash(text)
16
+ end
17
+ end
18
+
19
+ # A container for the original text we're processing. Divides the text into
20
+ # Block objects.
21
+ class BlockSequencedText
22
+ attr_reader :blocks, :text
23
+
24
+ def initialize(text, block_size)
25
+ @text = text
26
+ @block_size = block_size
27
+ @blocks = []
28
+
29
+ # "onetwothree" -> ["one", "two", "thr", "ee"]
30
+ @text.scan(/.(?:.?){#{@block_size-1}}/).each.with_index do |text_block, index|
31
+ @blocks << Block.new(text_block, index * @block_size)
32
+ end
33
+ end
34
+ end
35
+
36
+ # Look-up table with a #find method which finds an appropriate block and then
37
+ # modifies the match to extend it to more characters.
38
+ class BlockFingerprintTable
39
+ def initialize(block_sequenced_text)
40
+ @blocked_text = block_sequenced_text
41
+ @hash = {}
42
+
43
+ @blocked_text.blocks.each do |block|
44
+ (@hash[block.hash] ||= []) << block
45
+ end
46
+ end
47
+
48
+ def find_for_compress(fingerprint, block_size, target, position)
49
+ source = @blocked_text.text
50
+ find(fingerprint, block_size, source, target, position)
51
+ end
52
+
53
+ def find_for_diff(fingerprint, block_size, target)
54
+ source = @blocked_text.text
55
+ find(fingerprint, block_size, source, target)
56
+ end
57
+
58
+ private
59
+
60
+ def find(fingerprint, block_size, source, target, position = nil)
61
+ blocks = @hash[fingerprint]
62
+ return nil unless blocks
63
+
64
+ blocks.each do |block|
65
+ next unless block.text == target[0, block_size]
66
+
67
+ # in compression, since we don't have true source and target strings as
68
+ # separate things, we have to ensure that we don't use a fingerprinted
69
+ # block which appears _after_ the current position, otherwise
70
+ #
71
+ # a<x, 0> with x > 0
72
+ #
73
+ # might happen, or similar. since blocks are ordered left to right in the
74
+ # string, we can just return nil, because we know there's not going to be
75
+ # a valid block for compression.
76
+ if position && block.position >= position
77
+ return nil
78
+ end
79
+
80
+ # we know that block matches, so cut it from the beginning,
81
+ # so we can then see how much of the rest also matches
82
+ source_match = source[block.position + block_size..-1]
83
+ target_match = target[block_size..-1]
84
+
85
+ # in a backwards extension, we can see how many of the characters before
86
+ # +position+ (up the previous block we covered, which is +limit+) match
87
+ # characters block.position (up to b-1) characters. In other words, we can
88
+ # find the maximum i such that
89
+ #
90
+ # original_text[position-k, 1] == original_text[block.position-k, 1]
91
+ #
92
+ # for all k in {1, 2, ..., i}, where i <= b-1
93
+
94
+ # it may be that the block we've matched on reaches to the end of the
95
+ # string, in which case, bail
96
+ if source_match.empty? || target_match.empty?
97
+ return block
98
+ end
99
+
100
+ end_index = find_end_index(source_match, target_match)
101
+ match = produce_match(end_index, block, source)
102
+ return match
103
+ end
104
+
105
+ nil
106
+ end
107
+
108
+ def find_end_index(source, target)
109
+ end_index = 0
110
+ any_match = false
111
+ while end_index < source.length && end_index < target.length && source[end_index, 1] == target[end_index, 1]
112
+ any_match = true
113
+ end_index += 1
114
+ end
115
+ # undo the final increment, since that's where it failed the equality check
116
+ end_index -= 1
117
+
118
+ any_match ? end_index : nil
119
+ end
120
+
121
+ def produce_match(end_index, block, source)
122
+ text = block.text
123
+ if end_index # we have more to grab in the string
124
+ text += source[0..end_index]
125
+ end
126
+ Block.new(text, block.position)
127
+ end
128
+ end
129
+
130
+ class Codec
131
+ def self.decompress(sequence)
132
+ sequence.inject("") do |result, i|
133
+ if i.is_a?(Array)
134
+ index, length = i
135
+ length.times do |k|
136
+ result << result[index+k, 1]
137
+ end
138
+ result
139
+ else
140
+ result << i
141
+ end
142
+ end
143
+ end
144
+
145
+ def self.decode(source, delta)
146
+ delta.inject("") do |result, i|
147
+ if i.is_a?(Array)
148
+ index, length = i
149
+ result << source[index, length]
150
+ else
151
+ result << i
152
+ end
153
+ end
154
+ end
155
+
156
+ def self.compress(text, block_size)
157
+ __compress_encode__(text, nil, block_size)
158
+ end
159
+
160
+ def self.encode(source, target, block_size)
161
+ __compress_encode__(source, target, block_size)
162
+ end
163
+
164
+ private
165
+
166
+ def self.__compress_encode__(source, target, block_size)
167
+ return [] if source == target
168
+
169
+ block_sequenced_text = BlockSequencedText.new(source, block_size)
170
+ table = BlockFingerprintTable.new(block_sequenced_text)
171
+ output = []
172
+ buffer = ""
173
+ current_hash = nil
174
+ hasher = RollingHash.new
175
+
176
+ mode = (target ? :diff : :compress)
177
+
178
+ if mode == :compress
179
+ # it's the source we're compressing, there is no target
180
+ text = source
181
+ else
182
+ # it's the target we're compressing against the source
183
+ text = target
184
+ end
185
+
186
+ position = 0
187
+ while position < text.length
188
+
189
+ if text.length - position < block_size
190
+ # if there isn't a block-sized substring in the remaining text, stop.
191
+ # note that we could add the buffer to the output here, but if block_size
192
+ # is 1, text.length - position < 1 can't be true, so the final character
193
+ # would go missing. so appending to the buffer goes below, outside the
194
+ # while loop.
195
+ break
196
+ end
197
+
198
+ # if we've recently found a block of text which matches and added that to
199
+ # the output, current_hash will be reset to nil, so get the new hash. note
200
+ # that we can't just use next_hash, because we might have skipped several
201
+ # characters in one go, which breaks the rolling aspect of the hash
202
+ if !current_hash
203
+ current_hash = hasher.hash(text[position, block_size])
204
+ else
205
+ # position-1 is the previous position, + block_size to get the last
206
+ # character of the current block
207
+ current_hash = hasher.next_hash(text[position-1 + block_size, 1])
208
+ end
209
+
210
+ match = target ? table.find_for_diff(current_hash, block_size, target[position..-1]) :
211
+ table.find_for_compress(current_hash, block_size, text[position..-1], position)
212
+
213
+ if match
214
+ if !buffer.empty?
215
+ output << buffer
216
+ buffer = ""
217
+ end
218
+
219
+ output << [match.position, match.text.length]
220
+ position += match.text.length
221
+ current_hash = nil
222
+ # get a new hasher, because we've skipped over by match.text.length
223
+ # characters, so the rolling hash's next_hash won't work
224
+ hasher = RollingHash.new
225
+ else
226
+ buffer << text[position, 1]
227
+ position += 1
228
+ end
229
+ end
230
+
231
+ remainder = buffer + text[position..-1]
232
+ output << remainder if !remainder.empty?
233
+ output
234
+ end
235
+ end
236
+ end
@@ -0,0 +1,101 @@
1
+ if RUBY_VERSION < "1.9"
2
+ class String
3
+ def ord
4
+ self[0]
5
+ end
6
+ end
7
+ end
8
+
9
+ # Rolling hash as used in Rabin-Karp.
10
+ #
11
+ # hasher = RollingHash.new
12
+ # hasher.hash("abc") #=> 6432038
13
+ # hasher.next_hash("d") #=> 6498345
14
+ # ||
15
+ # hasher.hash("bcd") #=> 6498345
16
+ class RollingHash
17
+ def initialize(hash = {})
18
+ hash = { :base => 257, # prime
19
+ :mod => 1000000007
20
+ }.merge!(hash)
21
+ @base = hash[:base]
22
+ @mod = hash[:mod]
23
+ end
24
+
25
+ # Compute @base**power working modulo @mod
26
+ def modulo_exp(power)
27
+ self.class.modulo_exp(@base, power, @mod)
28
+ end
29
+
30
+ # Given a string "abc...xyz" with length len,
31
+ # return the hash using @base as
32
+ #
33
+ # "a".ord * @base**(len - 1) +
34
+ # "b".ord * @base**(len - 2) +
35
+ # ... +
36
+ # "y".ord * @base**(1) +
37
+ # "z".ord * @base**0 (== "z".ord)
38
+ def hash(input)
39
+ hash = 0
40
+ characters = input.split("")
41
+ input_length = characters.length
42
+
43
+ characters.each_with_index do |character, index|
44
+ hash += character.ord * modulo_exp(input_length - 1 - index) % @mod
45
+ hash = hash % @mod
46
+ end
47
+ @prev_hash = hash
48
+ @prev_input = input
49
+ @highest_power = input_length - 1
50
+ hash
51
+ end
52
+
53
+ # Returns the hash of (@prev_input[1..-1] + character)
54
+ # by using @prev_hash, so that the sum turns from
55
+ #
56
+ # "a".ord * @base**(len - 1) +
57
+ # "b".ord * @base**(len - 2) +
58
+ # ... +
59
+ # "y".ord * @base**(1) +
60
+ # "z".ord * @base**0 (== "z".ord)
61
+ #
62
+ # into
63
+ #
64
+ # "b".ord * @base**(len - 1) +
65
+ # ... +
66
+ # "y".ord * @base**(2) +
67
+ # "z".ord * @base**1 +
68
+ # character.ord * @base**0
69
+ def next_hash(character)
70
+ # the leading value of the computed sum
71
+ char_to_subtract = @prev_input.chars.first
72
+ hash = @prev_hash
73
+
74
+ # subtract the leading value
75
+ hash = hash - char_to_subtract.ord * @base**@highest_power
76
+
77
+ # shift everything over to the left by 1, and add the
78
+ # new character as the lowest value
79
+ hash = (hash * @base) + character.ord
80
+ hash = hash % @mod
81
+
82
+ # trim off the first character
83
+ @prev_input.slice!(0)
84
+ @prev_input << character
85
+ @prev_hash = hash
86
+
87
+ hash
88
+ end
89
+
90
+ private
91
+
92
+ # Returns n**power but reduced modulo mod
93
+ # at each step of the calculation.
94
+ def self.modulo_exp(n, power, mod)
95
+ value = 1
96
+ power.times do
97
+ value = (n * value) % mod
98
+ end
99
+ value
100
+ end
101
+ end
@@ -0,0 +1,11 @@
1
+ require "rake"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:test) do |t|
5
+ t.rspec_opts = "-I test --color --format nested"
6
+ t.pattern = "test/**/*_test.rb"
7
+ t.verbose = false
8
+ t.fail_on_error = true
9
+ end
10
+
11
+ task :default => :test
@@ -0,0 +1,99 @@
1
+ require "test_helper"
2
+
3
+ describe BentleyMcIlroy::Codec do
4
+ describe ".compress" do
5
+ it "compresses strings" do
6
+ codec = BentleyMcIlroy::Codec
7
+ str = "aaaaaaaaaaaaaaaaaaaaaaa"
8
+
9
+ (1..10).each { |i| codec.compress(str, i).should == [str[0, 1], [0, str.length-1]] }
10
+
11
+ codec.compress("abcabcabc", 3).should == ["abc", [0, 6]]
12
+ codec.compress("abababab", 2).should == ["ab", [0, 6]]
13
+ codec.compress("abcdefabc", 3).should == ["abcdef", [0, 3]]
14
+ codec.compress("abcdefabcdef", 3).should == ["abcdef", [0, 6]]
15
+ codec.compress("abcabcabc", 2).should == ["abc", [0, 6]]
16
+ codec.compress("xabcdabcdy", 2).should == ["xabcda", [2, 3], "y"]
17
+ codec.compress("xabcdabcdy", 1).should == ["xabcd", [1, 4], "y"]
18
+ codec.compress("xabcabcy", 2).should == ["xabca", [2, 2], "y"]
19
+ end
20
+
21
+ # "aaaa" should compress down to ["a", [0, 3]]
22
+ it "picks the longest match on clashes"
23
+
24
+ # 11
25
+ # 0123 45678901
26
+ # encode("xaby", "abababab", 1) would be more efficiently encoded as
27
+ #
28
+ # ["x", [1, 2], [4, 6]]
29
+ #
30
+ # where [4, 6] refers to the decoded target itself, in the style of
31
+ # VCDIFF. See RFC3284 section 3, where COPY 4, 4 + COPY 12, 24 is used.
32
+ #
33
+ # this should probably only be allowed with a flag or something.
34
+ #
35
+ # note that compress is more efficient for this type of input,
36
+ # since the "source" is everything to the left of the current position:
37
+ #
38
+ # compress("abababab", 1) #=> ["ab", [0, 6]]
39
+ it "can refer to its own target"
40
+
41
+ it "handles binary" do
42
+ codec = BentleyMcIlroy::Codec
43
+ str = ("\x52\303\x66" * 3)
44
+ str.force_encoding("BINARY") if str.respond_to?(:force_encoding)
45
+
46
+ codec.compress(str, 3).should == ["\x52\303\x66", [0, 6]]
47
+ end
48
+ end
49
+
50
+ describe ".decompress" do
51
+ it "converts arrays representing compressed strings into the full string" do
52
+ codec = BentleyMcIlroy::Codec
53
+ codec.decompress(["abc", [0, 6]]).should == "abcabcabc"
54
+ codec.decompress(["abcdef", [0, 3]]).should == "abcdefabc"
55
+ codec.decompress(["xabcda", [2, 3], "y"]).should == "xabcdabcdy"
56
+ codec.decompress(["xabcd", [1, 4], "y"]).should == "xabcdabcdy"
57
+ codec.decompress(["xabca", [2, 2], "y"]).should == "xabcabcy"
58
+ end
59
+
60
+ it "round-trips with the compression method" do
61
+ codec = BentleyMcIlroy::Codec
62
+ %w[aaaaaaaaa abcabcabcabc abababab abcdefabc abcdefabcdef abcabcabc xabcdabcdy xabcabcy].each do |s|
63
+ (1..4).each do |n|
64
+ codec.decompress(codec.compress(s, n)).should == s
65
+ end
66
+ end
67
+ end
68
+ end
69
+
70
+ describe ".encode" do
71
+ it "encodes strings" do
72
+ codec = BentleyMcIlroy::Codec
73
+ codec.encode("abcdef", "defghiabc", 3).should == [[3, 3], "ghi", [0, 3]]
74
+ codec.encode("abcdef", "defghiabc", 2).should == ["d", [4, 2], "ghi", [0, 3]]
75
+ codec.encode("abcdef", "defghiabc", 1).should == [[3, 3], "ghi", [0, 3]]
76
+ codec.encode("abc", "d", 3).should == ["d"]
77
+ codec.encode("abc", "defghi", 3).should == ["defghi"]
78
+ codec.encode("abcdef", "abcdef", 3).should == []
79
+ codec.encode("abc", "abcdef", 3).should == [[0, 3], "def"]
80
+ codec.encode("aaaaa", "aaaaaaaaaa", 3).should == [[0, 5], [0, 5]]
81
+ end
82
+ end
83
+
84
+ describe ".decode" do
85
+ it "applies the given delta to the given source" do
86
+ codec = BentleyMcIlroy::Codec
87
+ codec.decode("aaaaa", [[0, 5], [0, 5]]).should == "aaaaaaaaaa"
88
+ codec.decode("abcdef", [[3, 3], "ghi", [0, 3]]).should == "defghiabc"
89
+ end
90
+
91
+ it "round-trips with the delta method" do
92
+ codec = BentleyMcIlroy::Codec
93
+ (1..4).each do |n|
94
+ codec.decode("abcdef", codec.encode("abcdef", "defghiabc", n)).should == "defghiabc"
95
+ end
96
+ end
97
+ end
98
+ end
99
+
@@ -0,0 +1,20 @@
1
+ require "test_helper"
2
+
3
+ describe RollingHash do
4
+ describe "#hash(input)" do
5
+ it "hashes the input using a polynomial" do
6
+ hasher = RollingHash.new
7
+ hasher.hash("abc").should == 6432038
8
+ hasher.hash("bcd").should == 6498345
9
+ end
10
+ end
11
+
12
+ describe "#next_hash(next_input)" do
13
+ it "takes the previously hash, the given next input and computes the new hash" do
14
+ hasher = RollingHash.new
15
+ h = hasher.hash("abc")
16
+ new_h = hasher.next_hash("d")
17
+ new_h.should == RollingHash.new.hash("bcd")
18
+ end
19
+ end
20
+ end
@@ -0,0 +1 @@
1
+ require "bentley_mcilroy"
metadata ADDED
@@ -0,0 +1,90 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: bentley_mcilroy
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Adam Prescott
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2013-09-09 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: rake
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0'
22
+ type: :development
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ! '>='
28
+ - !ruby/object:Gem::Version
29
+ version: '0'
30
+ - !ruby/object:Gem::Dependency
31
+ name: rspec
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ description: A compression scheme using the Bentley-McIlroy data compression technique
47
+ of finding long common substrings.
48
+ email:
49
+ - adam@aprescott.com
50
+ executables: []
51
+ extensions: []
52
+ extra_rdoc_files: []
53
+ files:
54
+ - lib/bentley_mcilroy.rb
55
+ - lib/rolling_hash.rb
56
+ - test/test_helper.rb
57
+ - test/bentley_mcilroy_test.rb
58
+ - test/rolling_hash_test.rb
59
+ - LICENSE
60
+ - README.md
61
+ - bentley_mcilroy.gemspec
62
+ - rakefile
63
+ homepage: https://github.com/aprescott/bentley_mcilroy
64
+ licenses: []
65
+ post_install_message:
66
+ rdoc_options: []
67
+ require_paths:
68
+ - lib
69
+ required_ruby_version: !ruby/object:Gem::Requirement
70
+ none: false
71
+ requirements:
72
+ - - ! '>='
73
+ - !ruby/object:Gem::Version
74
+ version: '0'
75
+ required_rubygems_version: !ruby/object:Gem::Requirement
76
+ none: false
77
+ requirements:
78
+ - - ! '>='
79
+ - !ruby/object:Gem::Version
80
+ version: '0'
81
+ requirements: []
82
+ rubyforge_project:
83
+ rubygems_version: 1.8.24
84
+ signing_key:
85
+ specification_version: 3
86
+ summary: Bentley-McIlroy compression scheme implementation in Ruby.
87
+ test_files:
88
+ - test/test_helper.rb
89
+ - test/bentley_mcilroy_test.rb
90
+ - test/rolling_hash_test.rb