bentley_mcilroy 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ (MIT License)
2
+
3
+ Copyright (c) 2013 Adam Prescott
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,133 @@
1
+ A Ruby implementation of Bentley-McIlroy's data compression scheme to encode
2
+ compressed versions of strings, and compute deltas between source and target.
3
+
4
+ Note the compression and delta encodings are simply represented with Ruby
5
+ objects, and is independent of any particular binary format.
6
+
7
+ The fingerprinting algorithm is the rolling hash frequently used for Rabin-Karp
8
+ string matching.
9
+
10
+ # Usage
11
+
12
+ To compress a string, pass the input and block size.
13
+
14
+ codec = BentleyMcIlroy::Codec
15
+ codec.compress("aaaaaa", 3) #=> ["a", [0, 5]]
16
+ codec.compress("abcabcabc", 3) #=> ["abc", [0, 6]]
17
+ codec.compress("xabcdabcdy", 2) #=> ["xabcda", [2, 3], "y"]
18
+ codec.compress("xabcdabcdy", 1) #=> ["xabcd", [1, 4], "y"]
19
+
20
+ # Modes of operation
21
+
22
+ This library supports two modes of operation: compression and delta encoding.
23
+ With compression, a single input is compressed. With delta encoding, there is a
24
+ (non-empty) source and a target, and the result is a delta which can be
25
+ used to reconstruct the target, given the source. Compression is a special
26
+ case of delta encoding where there is no source.
27
+
28
+ With compression, the source data is everything to the left of the position we've
29
+ reached along the string. With delta encoding, the source data is fixed for the
30
+ entire time we move left-to-right through the target string.
31
+
32
+ Compression:
33
+
34
+ codec.compress("aaaaaa", 3) #=> ["a", [0, 5]]
35
+ codec.compress("abcabcabc", 3) #=> ["abc", [0, 6]]
36
+ codec.compress("xabcdabcdy", 2) #=> ["xabcda", [2, 3], "y"]
37
+ codec.compress("xabcdabcdy", 1) #=> ["xabcd", [1, 4], "y"]
38
+
39
+ Delta encoding is similar:
40
+
41
+ codec.encode("abcd", "xabcdyabcdz", 1) #=> ["x", [0, 4], "y", [0, 4], "z"]
42
+ codec.encode("xyz", "xyz", 3) #=> []
43
+
44
+ To decompress:
45
+
46
+ codec.decompress(["xabcd", [1, 4], "y"]) #=> "xabcdabcdy"
47
+
48
+ To decode a delta against a source:
49
+
50
+ codec.decode("abcd", ["x", [0, 4], "y", [0, 4], "z"]) #=> "xabcdyabcdz"
51
+
52
+ # About Bentley-McIlroy
53
+
54
+ The Bentley-McIlroy compression scheme is an algorithm for compressing a
55
+ string by finding long common substrings. The algorithm and its properties
56
+ are described in greater detail in their [1999 paper][bentley-mcilroy paper]. The technique, with a
57
+ source dictionary and a target string, is used in Google's implementation of
58
+ a VCDIFF encoder, [open-vcdiff][open-vcdiff project], as part of encoding deltas.
59
+
60
+ [bentley-mcilroy paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=rep1&type=pdf
61
+ [open-vcdiff project]: http://code.google.com/p/open-vcdiff/
62
+
63
+ To give a brief summary, the algorithm works by fixing a window of block size
64
+ b and then sliding over the string, storing the fingerprint of every b-th
65
+ window. These stored fingerprints are then used to detect repetitions later
66
+ on in the string.
67
+
68
+ The algorithm in pseudocode, as given in the paper is:
69
+
70
+ initialize fp
71
+ for (i = b; i < n; i++)
72
+ if (i % b == 0)
73
+ store(fp, i)
74
+ update fp to include a[i] and exclude a[i-b]
75
+ checkformatch(fp, i)
76
+
77
+ In the algorithm above, `checkformatch(fp, i)` looks up the fingerprint `fp` in a
78
+ hash table and then encodes a match if one is found.
79
+
80
+ `checkformatch(fp, i)` is the core piece of this algorithm, and "encodes a
81
+ match" is not fully described in the paper. The rest of the algorithm simply
82
+ describes moving through the string with a sliding window, looking at
83
+ substrings and storing fingerprints whenever we cross a block boundary.
84
+
85
+ As described in the paper, suppose b = 100 and that the current block matches
86
+ block 56 (i.e., bytes 5600 through to 5699). This current block could then be
87
+ encoded as <5600,100>.
88
+
89
+ There are two similar improvements which can be made, so as to prevent
90
+ `"ababab"` from compressing into `"ab<0,2><0,2>"`, both of which are also in the
91
+ paper. When we know that the current block matches block 56, we can extend
92
+ the match as far back as possible, not exceeding b - 1 bytes. Similarly, we
93
+ can move the match far forward as possible without limitation.
94
+
95
+ The reason there is a limit of b-1 bytes when moving backwards is that if
96
+ there were more to match beyond b-1 bytes, it would've been found in a
97
+ previous iteration of the loop.
98
+
99
+ This library implementation moves matches forward, but does not move matches
100
+ backwards.
101
+
102
+ To be more explicit about what extending the match means, consider
103
+
104
+ xabcdabcdy (the string)
105
+ 0123456789 (indices)
106
+
107
+ with a block size of b = 2. Moving left to right, the fingerprints of `"xa"`,
108
+ `"ab"`, `"bc"`, ..., are computed, but only `"xa"`, `"bc"`, `"da"`, ... are stored. When
109
+ `"ab"` is seen at `5..6`, there is no corresponding entry in the hash table, so
110
+ nothing is done, yet. On the next substring of length 2, `"bc"`, at positions
111
+ `6..7`, there _is_ a corresponding entry in the hash table, so there's a match,
112
+ which we could encode as `<2, 2>`, say. However, we'd like to _actually_ produce
113
+ `<1, 4>`, which is more efficient. So starting with `<2, 2>`, we move the match
114
+ back 1 character for both the `"bc"` at `6..7` and the `"bc"` at `2..3`, then check
115
+ if `1..3` matches `5..7`, which it does. This is moving the match backwards.
116
+
117
+ For moving the match forwards, simply do the same thing. Check if `1..4` matches
118
+ `6..8`, which it does. `1..5` does not match `6..9`, so we use `<1, 4>` and we're done.
119
+
120
+ The resulting string, with backward- and forward-extension is `xabcd<1, 4>y`. In
121
+ the case of no backward extensions, it is `xabcda<2, 3>y`.
122
+
123
+ # License
124
+
125
+ Copyright (c) Adam Prescott, released under the MIT license. See the license file.
126
+
127
+ # TODO
128
+
129
+ compress("abcaaaaaa", 1) -> ["abc", [0, 1], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1]]
130
+
131
+ Can this be fixed to be: ["abc", [0, 1], [3, 5]] ? Essentially following the paper
132
+ and picking the longest match on a clash (here, index 0 and index 3 are hit for
133
+ index 4, but index 3 leads to a better result when the match is extended forward)
@@ -0,0 +1,14 @@
1
+ Gem::Specification.new do |s|
2
+ s.name = "bentley_mcilroy"
3
+ s.version = "0.0.1"
4
+ s.authors = ["Adam Prescott"]
5
+ s.email = ["adam@aprescott.com"]
6
+ s.homepage = "https://github.com/aprescott/bentley_mcilroy"
7
+ s.summary = "Bentley-McIlroy compression scheme implementation in Ruby."
8
+ s.description = "A compression scheme using the Bentley-McIlroy data compression technique of finding long common substrings."
9
+ s.files = Dir["{lib/**/*,test/**/*}"] + %w[LICENSE README.md bentley_mcilroy.gemspec rakefile]
10
+ s.test_files = Dir["test/*"]
11
+ s.require_path = "lib"
12
+ s.add_development_dependency "rake"
13
+ s.add_development_dependency "rspec"
14
+ end
@@ -0,0 +1,236 @@
1
+ require "rolling_hash"
2
+
3
+ module BentleyMcIlroy
4
+ # A fixed block of text, appearing in the original text at one of
5
+ # 0..b-1, b..2b-1, 2b..3b-1, ...
6
+ class Block
7
+ attr_reader :text, :position
8
+
9
+ def initialize(text, position)
10
+ @text = text
11
+ @position = position
12
+ end
13
+
14
+ def hash
15
+ RollingHash.new.hash(text)
16
+ end
17
+ end
18
+
19
+ # A container for the original text we're processing. Divides the text into
20
+ # Block objects.
21
+ class BlockSequencedText
22
+ attr_reader :blocks, :text
23
+
24
+ def initialize(text, block_size)
25
+ @text = text
26
+ @block_size = block_size
27
+ @blocks = []
28
+
29
+ # "onetwothree" -> ["one", "two", "thr", "ee"]
30
+ @text.scan(/.(?:.?){#{@block_size-1}}/).each.with_index do |text_block, index|
31
+ @blocks << Block.new(text_block, index * @block_size)
32
+ end
33
+ end
34
+ end
35
+
36
+ # Look-up table with a #find method which finds an appropriate block and then
37
+ # modifies the match to extend it to more characters.
38
+ class BlockFingerprintTable
39
+ def initialize(block_sequenced_text)
40
+ @blocked_text = block_sequenced_text
41
+ @hash = {}
42
+
43
+ @blocked_text.blocks.each do |block|
44
+ (@hash[block.hash] ||= []) << block
45
+ end
46
+ end
47
+
48
+ def find_for_compress(fingerprint, block_size, target, position)
49
+ source = @blocked_text.text
50
+ find(fingerprint, block_size, source, target, position)
51
+ end
52
+
53
+ def find_for_diff(fingerprint, block_size, target)
54
+ source = @blocked_text.text
55
+ find(fingerprint, block_size, source, target)
56
+ end
57
+
58
+ private
59
+
60
+ def find(fingerprint, block_size, source, target, position = nil)
61
+ blocks = @hash[fingerprint]
62
+ return nil unless blocks
63
+
64
+ blocks.each do |block|
65
+ next unless block.text == target[0, block_size]
66
+
67
+ # in compression, since we don't have true source and target strings as
68
+ # separate things, we have to ensure that we don't use a fingerprinted
69
+ # block which appears _after_ the current position, otherwise
70
+ #
71
+ # a<x, 0> with x > 0
72
+ #
73
+ # might happen, or similar. since blocks are ordered left to right in the
74
+ # string, we can just return nil, because we know there's not going to be
75
+ # a valid block for compression.
76
+ if position && block.position >= position
77
+ return nil
78
+ end
79
+
80
+ # we know that block matches, so cut it from the beginning,
81
+ # so we can then see how much of the rest also matches
82
+ source_match = source[block.position + block_size..-1]
83
+ target_match = target[block_size..-1]
84
+
85
+ # in a backwards extension, we can see how many of the characters before
86
+ # +position+ (up the previous block we covered, which is +limit+) match
87
+ # characters block.position (up to b-1) characters. In other words, we can
88
+ # find the maximum i such that
89
+ #
90
+ # original_text[position-k, 1] == original_text[block.position-k, 1]
91
+ #
92
+ # for all k in {1, 2, ..., i}, where i <= b-1
93
+
94
+ # it may be that the block we've matched on reaches to the end of the
95
+ # string, in which case, bail
96
+ if source_match.empty? || target_match.empty?
97
+ return block
98
+ end
99
+
100
+ end_index = find_end_index(source_match, target_match)
101
+ match = produce_match(end_index, block, source)
102
+ return match
103
+ end
104
+
105
+ nil
106
+ end
107
+
108
+ def find_end_index(source, target)
109
+ end_index = 0
110
+ any_match = false
111
+ while end_index < source.length && end_index < target.length && source[end_index, 1] == target[end_index, 1]
112
+ any_match = true
113
+ end_index += 1
114
+ end
115
+ # undo the final increment, since that's where it failed the equality check
116
+ end_index -= 1
117
+
118
+ any_match ? end_index : nil
119
+ end
120
+
121
+ def produce_match(end_index, block, source)
122
+ text = block.text
123
+ if end_index # we have more to grab in the string
124
+ text += source[0..end_index]
125
+ end
126
+ Block.new(text, block.position)
127
+ end
128
+ end
129
+
130
+ class Codec
131
+ def self.decompress(sequence)
132
+ sequence.inject("") do |result, i|
133
+ if i.is_a?(Array)
134
+ index, length = i
135
+ length.times do |k|
136
+ result << result[index+k, 1]
137
+ end
138
+ result
139
+ else
140
+ result << i
141
+ end
142
+ end
143
+ end
144
+
145
+ def self.decode(source, delta)
146
+ delta.inject("") do |result, i|
147
+ if i.is_a?(Array)
148
+ index, length = i
149
+ result << source[index, length]
150
+ else
151
+ result << i
152
+ end
153
+ end
154
+ end
155
+
156
+ def self.compress(text, block_size)
157
+ __compress_encode__(text, nil, block_size)
158
+ end
159
+
160
+ def self.encode(source, target, block_size)
161
+ __compress_encode__(source, target, block_size)
162
+ end
163
+
164
+ private
165
+
166
+ def self.__compress_encode__(source, target, block_size)
167
+ return [] if source == target
168
+
169
+ block_sequenced_text = BlockSequencedText.new(source, block_size)
170
+ table = BlockFingerprintTable.new(block_sequenced_text)
171
+ output = []
172
+ buffer = ""
173
+ current_hash = nil
174
+ hasher = RollingHash.new
175
+
176
+ mode = (target ? :diff : :compress)
177
+
178
+ if mode == :compress
179
+ # it's the source we're compressing, there is no target
180
+ text = source
181
+ else
182
+ # it's the target we're compressing against the source
183
+ text = target
184
+ end
185
+
186
+ position = 0
187
+ while position < text.length
188
+
189
+ if text.length - position < block_size
190
+ # if there isn't a block-sized substring in the remaining text, stop.
191
+ # note that we could add the buffer to the output here, but if block_size
192
+ # is 1, text.length - position < 1 can't be true, so the final character
193
+ # would go missing. so appending to the buffer goes below, outside the
194
+ # while loop.
195
+ break
196
+ end
197
+
198
+ # if we've recently found a block of text which matches and added that to
199
+ # the output, current_hash will be reset to nil, so get the new hash. note
200
+ # that we can't just use next_hash, because we might have skipped several
201
+ # characters in one go, which breaks the rolling aspect of the hash
202
+ if !current_hash
203
+ current_hash = hasher.hash(text[position, block_size])
204
+ else
205
+ # position-1 is the previous position, + block_size to get the last
206
+ # character of the current block
207
+ current_hash = hasher.next_hash(text[position-1 + block_size, 1])
208
+ end
209
+
210
+ match = target ? table.find_for_diff(current_hash, block_size, target[position..-1]) :
211
+ table.find_for_compress(current_hash, block_size, text[position..-1], position)
212
+
213
+ if match
214
+ if !buffer.empty?
215
+ output << buffer
216
+ buffer = ""
217
+ end
218
+
219
+ output << [match.position, match.text.length]
220
+ position += match.text.length
221
+ current_hash = nil
222
+ # get a new hasher, because we've skipped over by match.text.length
223
+ # characters, so the rolling hash's next_hash won't work
224
+ hasher = RollingHash.new
225
+ else
226
+ buffer << text[position, 1]
227
+ position += 1
228
+ end
229
+ end
230
+
231
+ remainder = buffer + text[position..-1]
232
+ output << remainder if !remainder.empty?
233
+ output
234
+ end
235
+ end
236
+ end
@@ -0,0 +1,101 @@
1
+ if RUBY_VERSION < "1.9"
2
+ class String
3
+ def ord
4
+ self[0]
5
+ end
6
+ end
7
+ end
8
+
9
+ # Rolling hash as used in Rabin-Karp.
10
+ #
11
+ # hasher = RollingHash.new
12
+ # hasher.hash("abc") #=> 6432038
13
+ # hasher.next_hash("d") #=> 6498345
14
+ # ||
15
+ # hasher.hash("bcd") #=> 6498345
16
+ class RollingHash
17
+ def initialize(hash = {})
18
+ hash = { :base => 257, # prime
19
+ :mod => 1000000007
20
+ }.merge!(hash)
21
+ @base = hash[:base]
22
+ @mod = hash[:mod]
23
+ end
24
+
25
+ # Compute @base**power working modulo @mod
26
+ def modulo_exp(power)
27
+ self.class.modulo_exp(@base, power, @mod)
28
+ end
29
+
30
+ # Given a string "abc...xyz" with length len,
31
+ # return the hash using @base as
32
+ #
33
+ # "a".ord * @base**(len - 1) +
34
+ # "b".ord * @base**(len - 2) +
35
+ # ... +
36
+ # "y".ord * @base**(1) +
37
+ # "z".ord * @base**0 (== "z".ord)
38
+ def hash(input)
39
+ hash = 0
40
+ characters = input.split("")
41
+ input_length = characters.length
42
+
43
+ characters.each_with_index do |character, index|
44
+ hash += character.ord * modulo_exp(input_length - 1 - index) % @mod
45
+ hash = hash % @mod
46
+ end
47
+ @prev_hash = hash
48
+ @prev_input = input
49
+ @highest_power = input_length - 1
50
+ hash
51
+ end
52
+
53
+ # Returns the hash of (@prev_input[1..-1] + character)
54
+ # by using @prev_hash, so that the sum turns from
55
+ #
56
+ # "a".ord * @base**(len - 1) +
57
+ # "b".ord * @base**(len - 2) +
58
+ # ... +
59
+ # "y".ord * @base**(1) +
60
+ # "z".ord * @base**0 (== "z".ord)
61
+ #
62
+ # into
63
+ #
64
+ # "b".ord * @base**(len - 1) +
65
+ # ... +
66
+ # "y".ord * @base**(2) +
67
+ # "z".ord * @base**1 +
68
+ # character.ord * @base**0
69
+ def next_hash(character)
70
+ # the leading value of the computed sum
71
+ char_to_subtract = @prev_input.chars.first
72
+ hash = @prev_hash
73
+
74
+ # subtract the leading value
75
+ hash = hash - char_to_subtract.ord * @base**@highest_power
76
+
77
+ # shift everything over to the left by 1, and add the
78
+ # new character as the lowest value
79
+ hash = (hash * @base) + character.ord
80
+ hash = hash % @mod
81
+
82
+ # trim off the first character
83
+ @prev_input.slice!(0)
84
+ @prev_input << character
85
+ @prev_hash = hash
86
+
87
+ hash
88
+ end
89
+
90
+ private
91
+
92
+ # Returns n**power but reduced modulo mod
93
+ # at each step of the calculation.
94
+ def self.modulo_exp(n, power, mod)
95
+ value = 1
96
+ power.times do
97
+ value = (n * value) % mod
98
+ end
99
+ value
100
+ end
101
+ end
@@ -0,0 +1,11 @@
1
+ require "rake"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:test) do |t|
5
+ t.rspec_opts = "-I test --color --format nested"
6
+ t.pattern = "test/**/*_test.rb"
7
+ t.verbose = false
8
+ t.fail_on_error = true
9
+ end
10
+
11
+ task :default => :test
@@ -0,0 +1,99 @@
1
+ require "test_helper"
2
+
3
+ describe BentleyMcIlroy::Codec do
4
+ describe ".compress" do
5
+ it "compresses strings" do
6
+ codec = BentleyMcIlroy::Codec
7
+ str = "aaaaaaaaaaaaaaaaaaaaaaa"
8
+
9
+ (1..10).each { |i| codec.compress(str, i).should == [str[0, 1], [0, str.length-1]] }
10
+
11
+ codec.compress("abcabcabc", 3).should == ["abc", [0, 6]]
12
+ codec.compress("abababab", 2).should == ["ab", [0, 6]]
13
+ codec.compress("abcdefabc", 3).should == ["abcdef", [0, 3]]
14
+ codec.compress("abcdefabcdef", 3).should == ["abcdef", [0, 6]]
15
+ codec.compress("abcabcabc", 2).should == ["abc", [0, 6]]
16
+ codec.compress("xabcdabcdy", 2).should == ["xabcda", [2, 3], "y"]
17
+ codec.compress("xabcdabcdy", 1).should == ["xabcd", [1, 4], "y"]
18
+ codec.compress("xabcabcy", 2).should == ["xabca", [2, 2], "y"]
19
+ end
20
+
21
+ # "aaaa" should compress down to ["a", [0, 3]]
22
+ it "picks the longest match on clashes"
23
+
24
+ # 11
25
+ # 0123 45678901
26
+ # encode("xaby", "abababab", 1) would be more efficiently encoded as
27
+ #
28
+ # ["x", [1, 2], [4, 6]]
29
+ #
30
+ # where [4, 6] refers to the decoded target itself, in the style of
31
+ # VCDIFF. See RFC3284 section 3, where COPY 4, 4 + COPY 12, 24 is used.
32
+ #
33
+ # this should probably only be allowed with a flag or something.
34
+ #
35
+ # note that compress is more efficient for this type of input,
36
+ # since the "source" is everything to the left of the current position:
37
+ #
38
+ # compress("abababab", 1) #=> ["ab", [0, 6]]
39
+ it "can refer to its own target"
40
+
41
+ it "handles binary" do
42
+ codec = BentleyMcIlroy::Codec
43
+ str = ("\x52\303\x66" * 3)
44
+ str.force_encoding("BINARY") if str.respond_to?(:force_encoding)
45
+
46
+ codec.compress(str, 3).should == ["\x52\303\x66", [0, 6]]
47
+ end
48
+ end
49
+
50
+ describe ".decompress" do
51
+ it "converts arrays representing compressed strings into the full string" do
52
+ codec = BentleyMcIlroy::Codec
53
+ codec.decompress(["abc", [0, 6]]).should == "abcabcabc"
54
+ codec.decompress(["abcdef", [0, 3]]).should == "abcdefabc"
55
+ codec.decompress(["xabcda", [2, 3], "y"]).should == "xabcdabcdy"
56
+ codec.decompress(["xabcd", [1, 4], "y"]).should == "xabcdabcdy"
57
+ codec.decompress(["xabca", [2, 2], "y"]).should == "xabcabcy"
58
+ end
59
+
60
+ it "round-trips with the compression method" do
61
+ codec = BentleyMcIlroy::Codec
62
+ %w[aaaaaaaaa abcabcabcabc abababab abcdefabc abcdefabcdef abcabcabc xabcdabcdy xabcabcy].each do |s|
63
+ (1..4).each do |n|
64
+ codec.decompress(codec.compress(s, n)).should == s
65
+ end
66
+ end
67
+ end
68
+ end
69
+
70
+ describe ".encode" do
71
+ it "encodes strings" do
72
+ codec = BentleyMcIlroy::Codec
73
+ codec.encode("abcdef", "defghiabc", 3).should == [[3, 3], "ghi", [0, 3]]
74
+ codec.encode("abcdef", "defghiabc", 2).should == ["d", [4, 2], "ghi", [0, 3]]
75
+ codec.encode("abcdef", "defghiabc", 1).should == [[3, 3], "ghi", [0, 3]]
76
+ codec.encode("abc", "d", 3).should == ["d"]
77
+ codec.encode("abc", "defghi", 3).should == ["defghi"]
78
+ codec.encode("abcdef", "abcdef", 3).should == []
79
+ codec.encode("abc", "abcdef", 3).should == [[0, 3], "def"]
80
+ codec.encode("aaaaa", "aaaaaaaaaa", 3).should == [[0, 5], [0, 5]]
81
+ end
82
+ end
83
+
84
+ describe ".decode" do
85
+ it "applies the given delta to the given source" do
86
+ codec = BentleyMcIlroy::Codec
87
+ codec.decode("aaaaa", [[0, 5], [0, 5]]).should == "aaaaaaaaaa"
88
+ codec.decode("abcdef", [[3, 3], "ghi", [0, 3]]).should == "defghiabc"
89
+ end
90
+
91
+ it "round-trips with the delta method" do
92
+ codec = BentleyMcIlroy::Codec
93
+ (1..4).each do |n|
94
+ codec.decode("abcdef", codec.encode("abcdef", "defghiabc", n)).should == "defghiabc"
95
+ end
96
+ end
97
+ end
98
+ end
99
+
@@ -0,0 +1,20 @@
1
+ require "test_helper"
2
+
3
+ describe RollingHash do
4
+ describe "#hash(input)" do
5
+ it "hashes the input using a polynomial" do
6
+ hasher = RollingHash.new
7
+ hasher.hash("abc").should == 6432038
8
+ hasher.hash("bcd").should == 6498345
9
+ end
10
+ end
11
+
12
+ describe "#next_hash(next_input)" do
13
+ it "takes the previously hash, the given next input and computes the new hash" do
14
+ hasher = RollingHash.new
15
+ h = hasher.hash("abc")
16
+ new_h = hasher.next_hash("d")
17
+ new_h.should == RollingHash.new.hash("bcd")
18
+ end
19
+ end
20
+ end
@@ -0,0 +1 @@
1
+ require "bentley_mcilroy"
metadata ADDED
@@ -0,0 +1,90 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: bentley_mcilroy
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Adam Prescott
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2013-09-09 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: rake
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0'
22
+ type: :development
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ! '>='
28
+ - !ruby/object:Gem::Version
29
+ version: '0'
30
+ - !ruby/object:Gem::Dependency
31
+ name: rspec
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ description: A compression scheme using the Bentley-McIlroy data compression technique
47
+ of finding long common substrings.
48
+ email:
49
+ - adam@aprescott.com
50
+ executables: []
51
+ extensions: []
52
+ extra_rdoc_files: []
53
+ files:
54
+ - lib/bentley_mcilroy.rb
55
+ - lib/rolling_hash.rb
56
+ - test/test_helper.rb
57
+ - test/bentley_mcilroy_test.rb
58
+ - test/rolling_hash_test.rb
59
+ - LICENSE
60
+ - README.md
61
+ - bentley_mcilroy.gemspec
62
+ - rakefile
63
+ homepage: https://github.com/aprescott/bentley_mcilroy
64
+ licenses: []
65
+ post_install_message:
66
+ rdoc_options: []
67
+ require_paths:
68
+ - lib
69
+ required_ruby_version: !ruby/object:Gem::Requirement
70
+ none: false
71
+ requirements:
72
+ - - ! '>='
73
+ - !ruby/object:Gem::Version
74
+ version: '0'
75
+ required_rubygems_version: !ruby/object:Gem::Requirement
76
+ none: false
77
+ requirements:
78
+ - - ! '>='
79
+ - !ruby/object:Gem::Version
80
+ version: '0'
81
+ requirements: []
82
+ rubyforge_project:
83
+ rubygems_version: 1.8.24
84
+ signing_key:
85
+ specification_version: 3
86
+ summary: Bentley-McIlroy compression scheme implementation in Ruby.
87
+ test_files:
88
+ - test/test_helper.rb
89
+ - test/bentley_mcilroy_test.rb
90
+ - test/rolling_hash_test.rb