uhc 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
uhc-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 UHC Contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
uhc-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,75 @@
1
+ Metadata-Version: 2.4
2
+ Name: uhc
3
+ Version: 0.1.0
4
+ Summary: Unified Hash-Compression Engine: compressed-domain hashing over LZ77 streams
5
+ Author: UHC Contributors
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/uhc-framework/uhc
8
+ Project-URL: Documentation, https://github.com/uhc-framework/uhc
9
+ Project-URL: Repository, https://github.com/uhc-framework/uhc
10
+ Project-URL: Issues, https://github.com/uhc-framework/uhc/issues
11
+ Keywords: compression,hashing,lz77,deduplication,integrity,polynomial-hash,deflate,zstandard,lz4
12
+ Classifier: Development Status :: 2 - Pre-Alpha
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Programming Language :: Python :: 3.13
22
+ Classifier: Topic :: Scientific/Engineering
23
+ Classifier: Topic :: System :: Archiving :: Compression
24
+ Classifier: Topic :: Security :: Cryptography
25
+ Requires-Python: >=3.10
26
+ Description-Content-Type: text/markdown
27
+ License-File: LICENSE
28
+ Provides-Extra: dev
29
+ Requires-Dist: pytest>=8.0; extra == "dev"
30
+ Requires-Dist: pytest-cov>=5.0; extra == "dev"
31
+ Provides-Extra: formats
32
+ Requires-Dist: lz4>=4.0; extra == "formats"
33
+ Requires-Dist: zstandard>=0.22; extra == "formats"
34
+ Requires-Dist: blake3>=1.0; extra == "formats"
35
+ Dynamic: license-file
36
+
37
+ # UHC — Unified Hash-Compression Engine
38
+
39
+ A Python framework for compressed-domain hashing over LZ77 streams.
40
+
41
+ UHC computes the polynomial hash of uncompressed data by operating directly on compressed token streams (DEFLATE, LZ4, Zstandard), without ever materializing the decompressed bytes.
42
+
43
+ ## Key Features
44
+
45
+ - **Compressed-domain hashing:** Compute integrity hashes without decompression
46
+ - **Hash-augmented rope:** Novel data structure with RepeatNode for O(k·log q) overlapping back-reference resolution
47
+ - **Multi-hash collision resistance:** k-wise independent polynomial hashes with configurable security levels
48
+ - **Format compatible:** DEFLATE, LZ4, Zstandard support via unified token abstraction
49
+ - **Sliding window:** O(1) memory relative to decompressed size for bounded-window formats
50
+
51
+ ## Installation
52
+
53
+ ```bash
54
+ pip install uhc
55
+ ```
56
+
57
+ ## Quick Start
58
+
59
+ ```python
60
+ import uhc
61
+
62
+ # Coming soon — Phase 1 implementation in progress
63
+ ```
64
+
65
+ ## Mathematical Foundation
66
+
67
+ The complete mathematical framework (24 theorems, 14 lemmas, 5 corollaries) with full proofs is available in `theory/compressed_domain_hashing_framework.md`.
68
+
69
+ ## Project Status
70
+
71
+ **Phase 1:** Algebraic core — polynomial hash, naive LZ77, compressed-domain verifier
72
+
73
+ ## License
74
+
75
+ MIT
uhc-0.1.0/README.md ADDED
@@ -0,0 +1,39 @@
1
+ # UHC — Unified Hash-Compression Engine
2
+
3
+ A Python framework for compressed-domain hashing over LZ77 streams.
4
+
5
+ UHC computes the polynomial hash of uncompressed data by operating directly on compressed token streams (DEFLATE, LZ4, Zstandard), without ever materializing the decompressed bytes.
6
+
7
+ ## Key Features
8
+
9
+ - **Compressed-domain hashing:** Compute integrity hashes without decompression
10
+ - **Hash-augmented rope:** Novel data structure with RepeatNode for O(k·log q) overlapping back-reference resolution
11
+ - **Multi-hash collision resistance:** k-wise independent polynomial hashes with configurable security levels
12
+ - **Format compatible:** DEFLATE, LZ4, Zstandard support via unified token abstraction
13
+ - **Sliding window:** O(1) memory relative to decompressed size for bounded-window formats
14
+
15
+ ## Installation
16
+
17
+ ```bash
18
+ pip install uhc
19
+ ```
20
+
21
+ ## Quick Start
22
+
23
+ ```python
24
+ import uhc
25
+
26
+ # Coming soon — Phase 1 implementation in progress
27
+ ```
28
+
29
+ ## Mathematical Foundation
30
+
31
+ The complete mathematical framework (24 theorems, 14 lemmas, 5 corollaries) with full proofs is available in `theory/compressed_domain_hashing_framework.md`.
32
+
33
+ ## Project Status
34
+
35
+ **Phase 1:** Algebraic core — polynomial hash, naive LZ77, compressed-domain verifier
36
+
37
+ ## License
38
+
39
+ MIT
@@ -0,0 +1,69 @@
1
+ [build-system]
2
+ requires = ["setuptools>=68.0", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "uhc"
7
+ version = "0.1.0"
8
+ description = "Unified Hash-Compression Engine: compressed-domain hashing over LZ77 streams"
9
+ readme = "README.md"
10
+ license = {text = "MIT"}
11
+ requires-python = ">=3.10"
12
+ authors = [
13
+ {name = "UHC Contributors"},
14
+ ]
15
+ keywords = [
16
+ "compression",
17
+ "hashing",
18
+ "lz77",
19
+ "deduplication",
20
+ "integrity",
21
+ "polynomial-hash",
22
+ "deflate",
23
+ "zstandard",
24
+ "lz4",
25
+ ]
26
+ classifiers = [
27
+ "Development Status :: 2 - Pre-Alpha",
28
+ "Intended Audience :: Developers",
29
+ "Intended Audience :: Science/Research",
30
+ "License :: OSI Approved :: MIT License",
31
+ "Operating System :: OS Independent",
32
+ "Programming Language :: Python :: 3",
33
+ "Programming Language :: Python :: 3.10",
34
+ "Programming Language :: Python :: 3.11",
35
+ "Programming Language :: Python :: 3.12",
36
+ "Programming Language :: Python :: 3.13",
37
+ "Topic :: Scientific/Engineering",
38
+ "Topic :: System :: Archiving :: Compression",
39
+ "Topic :: Security :: Cryptography",
40
+ ]
41
+ dependencies = []
42
+
43
+ [project.optional-dependencies]
44
+ dev = [
45
+ "pytest>=8.0",
46
+ "pytest-cov>=5.0",
47
+ ]
48
+ formats = [
49
+ "lz4>=4.0",
50
+ "zstandard>=0.22",
51
+ "blake3>=1.0",
52
+ ]
53
+
54
+ [project.urls]
55
+ Homepage = "https://github.com/uhc-framework/uhc"
56
+ Documentation = "https://github.com/uhc-framework/uhc"
57
+ Repository = "https://github.com/uhc-framework/uhc"
58
+ Issues = "https://github.com/uhc-framework/uhc/issues"
59
+
60
+ [tool.setuptools.packages.find]
61
+ include = ["uhc*"]
62
+
63
+ [tool.pytest.ini_options]
64
+ testpaths = ["tests"]
65
+ addopts = "-v --tb=short"
66
+
67
+ [tool.coverage.run]
68
+ source = ["uhc"]
69
+ omit = ["tests/*"]
uhc-0.1.0/setup.cfg ADDED
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,411 @@
1
+ """
2
+ Tests for uhc.core.polynomial_hash
3
+
4
+ Every test is traceable to a specific definition, lemma, or theorem
5
+ in the compressed-domain hashing framework.
6
+ """
7
+
8
+ import pytest
9
+
10
+ # Will fail until implementation exists — that's TDD.
11
+ from uhc.core.polynomial_hash import (
12
+ PolynomialHash,
13
+ mersenne_mod,
14
+ mersenne_mul,
15
+ phi,
16
+ )
17
+
18
+ # ---------------------------------------------------------------------------
19
+ # Constants
20
+ # ---------------------------------------------------------------------------
21
+
22
+ P61 = (1 << 61) - 1 # Mersenne prime 2^61 - 1
23
+
24
+
25
+ # ---------------------------------------------------------------------------
26
+ # Mersenne arithmetic tests (Lemma 14)
27
+ # ---------------------------------------------------------------------------
28
+
29
+ class TestMersenneArithmetic:
30
+ """Tests for modular arithmetic with p = 2^61 - 1."""
31
+
32
+ def test_mersenne_mod_zero(self):
33
+ """0 mod p = 0."""
34
+ assert mersenne_mod(0) == 0
35
+
36
+ def test_mersenne_mod_small(self):
37
+ """Small values are unchanged."""
38
+ assert mersenne_mod(42) == 42
39
+
40
+ def test_mersenne_mod_p(self):
41
+ """p mod p = 0."""
42
+ assert mersenne_mod(P61) == 0
43
+
44
+ def test_mersenne_mod_p_plus_one(self):
45
+ """(p + 1) mod p = 1."""
46
+ assert mersenne_mod(P61 + 1) == 1
47
+
48
+ def test_mersenne_mod_2p(self):
49
+ """2p mod p = 0."""
50
+ assert mersenne_mod(2 * P61) == 0
51
+
52
+ def test_mersenne_mod_large(self):
53
+ """Large value reduces correctly."""
54
+ val = (1 << 120) + 17
55
+ assert mersenne_mod(val) == val % P61
56
+
57
+ def test_mersenne_mul_commutative(self):
58
+ """a * b = b * a (mod p)."""
59
+ a, b = 123456789, 987654321
60
+ assert mersenne_mul(a, b) == mersenne_mul(b, a)
61
+
62
+ def test_mersenne_mul_identity(self):
63
+ """a * 1 = a (mod p)."""
64
+ a = 999999999
65
+ assert mersenne_mul(a, 1) == mersenne_mod(a)
66
+
67
+ def test_mersenne_mul_zero(self):
68
+ """a * 0 = 0 (mod p)."""
69
+ assert mersenne_mul(123456, 0) == 0
70
+
71
+ def test_mersenne_mul_correctness(self):
72
+ """Multiply matches Python's native modular arithmetic."""
73
+ a, b = 2**60 - 3, 2**60 + 7
74
+ assert mersenne_mul(a, b) == (a * b) % P61
75
+
76
+
77
+ # ---------------------------------------------------------------------------
78
+ # Definition 2: Polynomial hash function
79
+ # ---------------------------------------------------------------------------
80
+
81
+ class TestHashDefinition:
82
+ """Tests for H(s) = Σ (s_i + 1) · x^(n-1-i) (mod p)."""
83
+
84
+ def test_empty_string_hashes_to_zero(self):
85
+ """Definition 2: H(ε) = 0."""
86
+ h = PolynomialHash(prime=P61, base=131)
87
+ assert h.hash(b"") == 0
88
+
89
+ def test_single_byte(self):
90
+ """H((c)) = c + 1 (mod p)."""
91
+ h = PolynomialHash(prime=P61, base=131)
92
+ assert h.hash(b"\x00") == 1 # 0 + 1
93
+ assert h.hash(b"\x01") == 2 # 1 + 1
94
+ assert h.hash(b"\xff") == 256 # 255 + 1
95
+
96
+ def test_two_bytes_manual(self):
97
+ """H((a, b)) = (a+1)·x + (b+1) (mod p)."""
98
+ x = 131
99
+ h = PolynomialHash(prime=P61, base=x)
100
+ a, b = 3, 7
101
+ expected = ((a + 1) * x + (b + 1)) % P61
102
+ assert h.hash(bytes([a, b])) == expected
103
+
104
+ def test_three_bytes_manual(self):
105
+ """H((a,b,c)) = (a+1)·x² + (b+1)·x + (c+1) (mod p)."""
106
+ x = 131
107
+ h = PolynomialHash(prime=P61, base=x)
108
+ a, b, c = 10, 20, 30
109
+ expected = ((a + 1) * x * x + (b + 1) * x + (c + 1)) % P61
110
+ assert h.hash(bytes([a, b, c])) == expected
111
+
112
+
113
+ # ---------------------------------------------------------------------------
114
+ # Lemma 1: Nonzero coefficients
115
+ # ---------------------------------------------------------------------------
116
+
117
+ class TestLemma1:
118
+ """The +1 offset ensures no byte maps to zero in Z/pZ."""
119
+
120
+ def test_null_byte_nonzero(self):
121
+ """H((0)) ≠ 0 — the zero-padding fix."""
122
+ h = PolynomialHash(prime=P61, base=131)
123
+ assert h.hash(b"\x00") != 0
124
+ assert h.hash(b"\x00") == 1
125
+
126
+ def test_all_null_strings_distinct(self):
127
+ """Strings of different lengths composed of null bytes hash differently."""
128
+ h = PolynomialHash(prime=P61, base=131)
129
+ hashes = [h.hash(b"\x00" * n) for n in range(1, 10)]
130
+ assert len(set(hashes)) == 9 # all distinct
131
+
132
+
133
+ # ---------------------------------------------------------------------------
134
+ # Lemma 2: Distinctness of single-byte hashes
135
+ # ---------------------------------------------------------------------------
136
+
137
+ class TestLemma2:
138
+ """For distinct bytes a ≠ b, H((a)) ≠ H((b))."""
139
+
140
+ def test_all_single_bytes_distinct(self):
141
+ h = PolynomialHash(prime=P61, base=131)
142
+ hashes = [h.hash(bytes([b])) for b in range(256)]
143
+ assert len(set(hashes)) == 256
144
+
145
+
146
+ # ---------------------------------------------------------------------------
147
+ # Theorem 1: Concatenation composability
148
+ # ---------------------------------------------------------------------------
149
+
150
+ class TestTheorem1:
151
+ """H(A ‖ B) = H(A) · x^|B| + H(B) (mod p)."""
152
+
153
+ def test_concat_two_strings(self):
154
+ h = PolynomialHash(prime=P61, base=131)
155
+ A = b"Hello"
156
+ B = b"World"
157
+ h_ab = h.hash(A + B)
158
+ h_a = h.hash(A)
159
+ h_b = h.hash(B)
160
+ x_pow_b = h.power(len(B))
161
+ expected = (h_a * x_pow_b + h_b) % P61
162
+ assert h_ab == expected
163
+
164
+ def test_concat_empty_left(self):
165
+ """H(ε ‖ B) = H(B)."""
166
+ h = PolynomialHash(prime=P61, base=131)
167
+ B = b"test"
168
+ assert h.hash(b"" + B) == h.hash(B)
169
+
170
+ def test_concat_empty_right(self):
171
+ """H(A ‖ ε) = H(A)."""
172
+ h = PolynomialHash(prime=P61, base=131)
173
+ A = b"test"
174
+ h_a = h.hash(A)
175
+ x_pow_0 = h.power(0) # x^0 = 1
176
+ assert (h_a * x_pow_0 + 0) % P61 == h_a
177
+
178
+ def test_concat_associativity(self):
179
+ """H(A ‖ B ‖ C) is consistent regardless of grouping."""
180
+ h = PolynomialHash(prime=P61, base=131)
181
+ A, B, C = b"aa", b"bb", b"cc"
182
+ assert h.hash(A + B + C) == h.hash(A + B + C)
183
+ # Verify via Theorem 1 applied twice:
184
+ h_ab = (h.hash(A) * h.power(len(B)) + h.hash(B)) % P61
185
+ h_abc = (h_ab * h.power(len(C)) + h.hash(C)) % P61
186
+ assert h.hash(A + B + C) == h_abc
187
+
188
+ def test_concat_random_data(self):
189
+ """Theorem 1 holds for random byte sequences."""
190
+ import os
191
+ h = PolynomialHash(prime=P61, base=131)
192
+ for _ in range(20):
193
+ A = os.urandom(50)
194
+ B = os.urandom(50)
195
+ h_ab = h.hash(A + B)
196
+ expected = (h.hash(A) * h.power(len(B)) + h.hash(B)) % P61
197
+ assert h_ab == expected
198
+
199
+
200
+ # ---------------------------------------------------------------------------
201
+ # Definition 3 + Theorem 3: Geometric accumulator Φ
202
+ # ---------------------------------------------------------------------------
203
+
204
+ class TestPhi:
205
+ """Φ(q, α) = Σ_{i=0}^{q-1} α^i computed by repeated doubling."""
206
+
207
+ def test_phi_zero(self):
208
+ """Φ(0, α) = 0."""
209
+ assert phi(0, 5, P61) == 0
210
+
211
+ def test_phi_one(self):
212
+ """Φ(1, α) = 1."""
213
+ assert phi(1, 5, P61) == 1
214
+
215
+ def test_phi_two(self):
216
+ """Φ(2, α) = 1 + α."""
217
+ alpha = 5
218
+ assert phi(2, alpha, P61) == (1 + alpha) % P61
219
+
220
+ def test_phi_three(self):
221
+ """Φ(3, α) = 1 + α + α²."""
222
+ alpha = 5
223
+ expected = (1 + alpha + alpha * alpha) % P61
224
+ assert phi(3, alpha, P61) == expected
225
+
226
+ def test_phi_matches_naive(self):
227
+ """Doubling recurrence matches direct summation for small q."""
228
+ alpha = 7
229
+ for q in range(20):
230
+ naive = sum(pow(alpha, i, P61) for i in range(q)) % P61
231
+ assert phi(q, alpha, P61) == naive, f"Failed at q={q}"
232
+
233
+ def test_phi_large_q(self):
234
+ """Φ works for large q without overflow or timeout."""
235
+ alpha = 131
236
+ q = 1_000_000
237
+ # Just verify it completes and returns a value in [0, p)
238
+ result = phi(q, alpha, P61)
239
+ assert 0 <= result < P61
240
+
241
+ def test_phi_degeneracy_alpha_one(self):
242
+ """Lemma 3: Φ(q, 1) = q (mod p) for all q."""
243
+ for q in [0, 1, 2, 5, 10, 100, 999]:
244
+ assert phi(q, 1, P61) == q % P61
245
+
246
+ def test_phi_power_of_two_q(self):
247
+ """Φ(2^k, α) — pure even-case recursion."""
248
+ alpha = 13
249
+ for k in range(1, 15):
250
+ q = 1 << k
251
+ naive = sum(pow(alpha, i, P61) for i in range(q)) % P61
252
+ assert phi(q, alpha, P61) == naive
253
+
254
+
255
+ # ---------------------------------------------------------------------------
256
+ # Corollary 3: Closed-form equivalence
257
+ # ---------------------------------------------------------------------------
258
+
259
+ class TestCorollary3:
260
+ """When α ≢ 1, Φ(q,α) = (α^q - 1) / (α - 1)."""
261
+
262
+ def test_closed_form_matches_doubling(self):
263
+ alpha = 7
264
+ for q in [1, 2, 3, 5, 10, 50, 100]:
265
+ doubling = phi(q, alpha, P61)
266
+ # Closed form: (α^q - 1) · (α - 1)^(-1) mod p
267
+ alpha_q = pow(alpha, q, P61)
268
+ numerator = (alpha_q - 1) % P61
269
+ denominator_inv = pow(alpha - 1, P61 - 2, P61) # Fermat inverse
270
+ closed = (numerator * denominator_inv) % P61
271
+ assert doubling == closed, f"Failed at q={q}"
272
+
273
+
274
+ # ---------------------------------------------------------------------------
275
+ # Theorem 2: Geometric repetition hash
276
+ # ---------------------------------------------------------------------------
277
+
278
+ class TestTheorem2:
279
+ """H(S^q) = H(S) · Φ(q, x^d) (mod p)."""
280
+
281
+ def test_repeat_single_byte(self):
282
+ """b"A" repeated q times."""
283
+ h = PolynomialHash(prime=P61, base=131)
284
+ S = b"A"
285
+ for q in [1, 2, 3, 5, 10, 50]:
286
+ direct = h.hash(S * q)
287
+ h_s = h.hash(S)
288
+ x_d = h.power(len(S))
289
+ algebraic = (h_s * phi(q, x_d, P61)) % P61
290
+ assert direct == algebraic, f"Failed at q={q}"
291
+
292
+ def test_repeat_multi_byte_pattern(self):
293
+ """b"abc" repeated q times."""
294
+ h = PolynomialHash(prime=P61, base=131)
295
+ S = b"abc"
296
+ for q in [1, 2, 3, 7, 20]:
297
+ direct = h.hash(S * q)
298
+ h_s = h.hash(S)
299
+ x_d = h.power(len(S))
300
+ algebraic = (h_s * phi(q, x_d, P61)) % P61
301
+ assert direct == algebraic, f"Failed at q={q}"
302
+
303
+ def test_repeat_zero(self):
304
+ """S^0 = ε, H(ε) = 0."""
305
+ h = PolynomialHash(prime=P61, base=131)
306
+ h_s = h.hash(b"abc")
307
+ x_d = h.power(3)
308
+ algebraic = (h_s * phi(0, x_d, P61)) % P61
309
+ assert algebraic == 0
310
+
311
+
312
+ # ---------------------------------------------------------------------------
313
+ # Theorem 5: Overlapping back-reference hash
314
+ # ---------------------------------------------------------------------------
315
+
316
+ class TestTheorem5:
317
+ """H(W) = H(P) · Φ(q, x^d) · x^r + H(P[0..r-1])."""
318
+
319
+ def test_exact_repetition_no_remainder(self):
320
+ """l is exact multiple of d: r = 0."""
321
+ h = PolynomialHash(prime=P61, base=131)
322
+ P = b"abc" # d = 3
323
+ l = 12 # q = 4, r = 0
324
+ d = len(P)
325
+ q, r = divmod(l, d)
326
+
327
+ # Direct: hash the fully expanded string
328
+ W = P * q
329
+ direct = h.hash(W)
330
+
331
+ # Algebraic
332
+ h_p = h.hash(P)
333
+ x_d = h.power(d)
334
+ x_r = h.power(r) # x^0 = 1
335
+ algebraic = (h_p * phi(q, x_d, P61) % P61 * x_r + h.hash(b"")) % P61
336
+ assert direct == algebraic
337
+
338
+ def test_repetition_with_remainder(self):
339
+ """l is not a multiple of d: r > 0."""
340
+ h = PolynomialHash(prime=P61, base=131)
341
+ P = b"abcd" # d = 4
342
+ l = 11 # q = 2, r = 3
343
+ d = len(P)
344
+ q, r = divmod(l, d)
345
+ assert q == 2 and r == 3
346
+
347
+ # Direct
348
+ W = (P * q) + P[:r] # "abcdabcdabc"
349
+ direct = h.hash(W)
350
+
351
+ # Algebraic (Theorem 5)
352
+ h_p = h.hash(P)
353
+ h_prefix = h.hash(P[:r])
354
+ x_d = h.power(d)
355
+ x_r = h.power(r)
356
+ algebraic = (h_p * phi(q, x_d, P61) % P61 * x_r % P61 + h_prefix) % P61
357
+ assert direct == algebraic
358
+
359
+ def test_single_byte_run_length(self):
360
+ """RLE case: d=1, pattern is single byte."""
361
+ h = PolynomialHash(prime=P61, base=131)
362
+ P = b"\x42" # d = 1
363
+ l = 1000
364
+ d = 1
365
+ q, r = divmod(l, d) # q = 1000, r = 0
366
+
367
+ direct = h.hash(P * l)
368
+
369
+ h_p = h.hash(P)
370
+ x_d = h.power(d)
371
+ algebraic = (h_p * phi(q, x_d, P61)) % P61
372
+ assert direct == algebraic
373
+
374
+ def test_various_patterns_and_lengths(self):
375
+ """Sweep over multiple pattern sizes and repetition lengths."""
376
+ h = PolynomialHash(prime=P61, base=131)
377
+ import os
378
+ for d in [1, 2, 3, 5, 8, 13]:
379
+ P = os.urandom(d)
380
+ for l in [d, d + 1, 2 * d, 2 * d + 1, 5 * d, 5 * d + 3]:
381
+ q, r = divmod(l, d)
382
+ W = (P * q) + P[:r]
383
+ direct = h.hash(W)
384
+
385
+ h_p = h.hash(P)
386
+ h_prefix = h.hash(P[:r]) if r > 0 else 0
387
+ x_d = h.power(d)
388
+ x_r = h.power(r)
389
+ algebraic = (h_p * phi(q, x_d, P61) % P61 * x_r % P61 + h_prefix) % P61
390
+ assert direct == algebraic, f"Failed: d={d}, l={l}, q={q}, r={r}"
391
+
392
+
393
+ # ---------------------------------------------------------------------------
394
+ # Power computation (Lemma 13)
395
+ # ---------------------------------------------------------------------------
396
+
397
+ class TestPowerComputation:
398
+ """x^n computed via repeated squaring."""
399
+
400
+ def test_power_zero(self):
401
+ h = PolynomialHash(prime=P61, base=131)
402
+ assert h.power(0) == 1
403
+
404
+ def test_power_one(self):
405
+ h = PolynomialHash(prime=P61, base=131)
406
+ assert h.power(1) == 131
407
+
408
+ def test_power_matches_builtin(self):
409
+ h = PolynomialHash(prime=P61, base=131)
410
+ for n in [2, 3, 10, 50, 100, 1000]:
411
+ assert h.power(n) == pow(131, n, P61)
@@ -0,0 +1,10 @@
1
+ """
2
+ UHC — Unified Hash-Compression Engine
3
+
4
+ Compressed-domain hashing over LZ77 streams.
5
+ Computes polynomial hashes of uncompressed data directly from
6
+ compressed token representations without decompression.
7
+ """
8
+
9
+ __version__ = "0.1.0"
10
+ __author__ = "UHC Contributors"
@@ -0,0 +1,3 @@
1
+ """
2
+ UHC Core — algebraic primitives for compressed-domain hashing.
3
+ """
@@ -0,0 +1,240 @@
1
+ """
2
+ Polynomial hash over Z/pZ with Mersenne prime arithmetic.
3
+
4
+ Implements Definitions 1-3, Lemmas 1-2, 13-14, Theorems 1-3, 5
5
+ from the compressed-domain hashing framework.
6
+
7
+ All arithmetic is performed modulo a Mersenne prime p = 2^61 - 1
8
+ using bit-shift reduction (Lemma 14) to avoid expensive division.
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ # ---------------------------------------------------------------------------
14
+ # Constants
15
+ # ---------------------------------------------------------------------------
16
+
17
+ MERSENNE_61: int = (1 << 61) - 1 # 2^61 - 1
18
+
19
+
20
+ # ---------------------------------------------------------------------------
21
+ # Mersenne prime arithmetic (Lemma 14)
22
+ # ---------------------------------------------------------------------------
23
+
24
+ def mersenne_mod(a: int, p: int = MERSENNE_61) -> int:
25
+ """
26
+ Reduce a non-negative integer a modulo a Mersenne prime p = 2^k - 1.
27
+
28
+ Uses the identity: a mod (2^k - 1) = (a >> k) + (a & ((1 << k) - 1))
29
+ with at most one conditional subtraction.
30
+
31
+ Proof (Lemma 14): Write a = q·2^k + r. Then a = q·(p+1) + r = q·p + (q+r),
32
+ so a ≡ q + r (mod p). Since q + r < 2^(k+1), at most one subtraction suffices.
33
+ """
34
+ # Determine k from p: p = 2^k - 1, so k = p.bit_length()
35
+ k = p.bit_length()
36
+ # Repeated folding for very large values
37
+ while a >= (1 << (2 * k)):
38
+ a = (a >> k) + (a & p)
39
+ # Final fold
40
+ a = (a >> k) + (a & p)
41
+ # At most one more fold needed
42
+ a = (a >> k) + (a & p)
43
+ # Conditional subtraction
44
+ if a >= p:
45
+ a -= p
46
+ return a
47
+
48
+
49
+ def mersenne_mul(a: int, b: int, p: int = MERSENNE_61) -> int:
50
+ """
51
+ Compute (a * b) mod p for Mersenne prime p.
52
+
53
+ Python's arbitrary-precision integers make this straightforward:
54
+ multiply natively, then reduce via mersenne_mod.
55
+ """
56
+ return mersenne_mod(a * b, p)
57
+
58
+
59
+ # ---------------------------------------------------------------------------
60
+ # Geometric accumulator Φ (Definition 3, Theorem 3)
61
+ # ---------------------------------------------------------------------------
62
+
63
+ def phi(q: int, alpha: int, p: int = MERSENNE_61) -> int:
64
+ """
65
+ Compute Φ(q, α) = Σ_{i=0}^{q-1} α^i (mod p) via inverse-free
66
+ repeated doubling.
67
+
68
+ Recurrence (Theorem 3):
69
+ Φ(0, α) = 0
70
+ Φ(1, α) = 1
71
+ Φ(2k, α) = Φ(k, α) · (1 + α^k)
72
+ Φ(2k+1, α) = Φ(2k, α) · α + 1
73
+
74
+ Handles the degenerate case α ≡ 1 (mod p) correctly,
75
+ yielding Φ(q, 1) = q (mod p) (Lemma 3).
76
+
77
+ Time: O(log q) multiplications in Z/pZ.
78
+
79
+ Implementation note: We scan q's bits left-to-right (MSB to LSB).
80
+ We maintain alpha_power = α^n where n is the current accumulated
81
+ value of q built so far. At each step:
82
+ - Doubling (n → 2n): alpha_power squares (α^n → α^(2n))
83
+ - Odd step (n → n+1): alpha_power multiplies by α (α^n → α^(n+1))
84
+ This ensures the correct α^k is used in each doubling step.
85
+ """
86
+ if q == 0:
87
+ return 0
88
+ if q == 1:
89
+ return 1
90
+
91
+ bits = q.bit_length()
92
+
93
+ phi_val = 1 # Φ(1, α) = 1
94
+ alpha_power = alpha # α^1 — tracks α^(current n)
95
+
96
+ # Scan from the second-most-significant bit down to bit 0
97
+ for i in range(bits - 2, -1, -1):
98
+ # Even step: Φ(2k, α) = Φ(k, α) · (1 + α^k)
99
+ # alpha_power currently holds α^k
100
+ phi_val = mersenne_mul(phi_val, (1 + alpha_power) % p, p)
101
+ # α^k → α^(2k)
102
+ alpha_power = mersenne_mul(alpha_power, alpha_power, p)
103
+
104
+ # If current bit is 1: odd step (2k → 2k+1)
105
+ if (q >> i) & 1:
106
+ # Φ(2k+1, α) = Φ(2k, α) · α + 1
107
+ phi_val = (mersenne_mul(phi_val, alpha, p) + 1) % p
108
+ # α^(2k) → α^(2k+1)
109
+ alpha_power = mersenne_mul(alpha_power, alpha, p)
110
+
111
+ return phi_val
112
+
113
+
114
+ # ---------------------------------------------------------------------------
115
+ # Polynomial Hash (Definition 2, Theorems 1-2, 5)
116
+ # ---------------------------------------------------------------------------
117
+
118
+ class PolynomialHash:
119
+ """
120
+ Polynomial hash function over Z/pZ.
121
+
122
+ H(s) = Σ_{i=0}^{n-1} (s_i + 1) · x^(n-1-i) (mod p)
123
+
124
+ The +1 offset (Definition 2) ensures no byte maps to zero in Z/pZ,
125
+ preventing the zero-padding collision (Lemma 1).
126
+
127
+ Parameters
128
+ ----------
129
+ prime : int
130
+ A Mersenne prime. Default: 2^61 - 1.
131
+ base : int
132
+ The hash base x, chosen from {2, ..., p-1}.
133
+ """
134
+
135
+ __slots__ = ("_p", "_x", "_power_cache")
136
+
137
+ def __init__(self, prime: int = MERSENNE_61, base: int = 131) -> None:
138
+ if base < 2 or base >= prime:
139
+ raise ValueError(f"Base must be in [2, p-1], got {base}")
140
+ self._p = prime
141
+ self._x = base
142
+ self._power_cache: dict[int, int] = {0: 1, 1: base}
143
+
144
+ @property
145
+ def prime(self) -> int:
146
+ return self._p
147
+
148
+ @property
149
+ def base(self) -> int:
150
+ return self._x
151
+
152
+ def power(self, n: int) -> int:
153
+ """
154
+ Compute x^n mod p with caching.
155
+
156
+ Uses Python's built-in pow(x, n, p) which implements
157
+ binary exponentiation (Lemma 13).
158
+ """
159
+ if n in self._power_cache:
160
+ return self._power_cache[n]
161
+ result = pow(self._x, n, self._p)
162
+ self._power_cache[n] = result
163
+ return result
164
+
165
+ def hash(self, data: bytes) -> int:
166
+ """
167
+ Compute H(data) = Σ (data[i] + 1) · x^(n-1-i) (mod p).
168
+
169
+ For the empty string, returns 0 (Definition 2).
170
+
171
+ Parameters
172
+ ----------
173
+ data : bytes
174
+ The byte string to hash.
175
+
176
+ Returns
177
+ -------
178
+ int
179
+ Hash value in [0, p-1].
180
+ """
181
+ if len(data) == 0:
182
+ return 0
183
+
184
+ p = self._p
185
+ x = self._x
186
+ h = 0
187
+ for byte in data:
188
+ # h = h · x + (byte + 1)
189
+ h = mersenne_mod(h * x + byte + 1, p)
190
+ return h
191
+
192
+ def hash_concat(self, h_a: int, len_b: int, h_b: int) -> int:
193
+ """
194
+ Compute H(A ‖ B) from H(A), |B|, H(B) via Theorem 1.
195
+
196
+ H(A ‖ B) = H(A) · x^|B| + H(B) (mod p)
197
+ """
198
+ x_pow = self.power(len_b)
199
+ return mersenne_mod(h_a * x_pow + h_b, self._p)
200
+
201
+ def hash_repeat(self, h_s: int, d: int, q: int) -> int:
202
+ """
203
+ Compute H(S^q) from H(S) and |S| = d via Theorem 2.
204
+
205
+ H(S^q) = H(S) · Φ(q, x^d) (mod p)
206
+ """
207
+ x_d = self.power(d)
208
+ phi_val = phi(q, x_d, self._p)
209
+ return mersenne_mul(h_s, phi_val, self._p)
210
+
211
+ def hash_overlap(self, h_p: int, d: int, l: int, h_prefix: int) -> int:
212
+ """
213
+ Compute H(W) for overlapping back-reference via Theorem 5.
214
+
215
+ W = P^q ‖ P[0..r-1] where q = ⌊l/d⌋, r = l mod d.
216
+
217
+ H(W) = H(P) · Φ(q, x^d) · x^r + H(P[0..r-1]) (mod p)
218
+
219
+ Parameters
220
+ ----------
221
+ h_p : int
222
+ H(P), hash of the pattern of length d.
223
+ d : int
224
+ Pattern length.
225
+ l : int
226
+ Total back-reference length (d < l for overlapping).
227
+ h_prefix : int
228
+ H(P[0..r-1]) where r = l mod d. Pass 0 if r = 0.
229
+ """
230
+ q, r = divmod(l, d)
231
+ x_d = self.power(d)
232
+ x_r = self.power(r)
233
+
234
+ # H(P) · Φ(q, x^d)
235
+ phi_val = phi(q, x_d, self._p)
236
+ h_repeated = mersenne_mul(h_p, phi_val, self._p)
237
+
238
+ # · x^r + H(P[0..r-1])
239
+ result = mersenne_mod(h_repeated * x_r + h_prefix, self._p)
240
+ return result
@@ -0,0 +1,75 @@
1
+ Metadata-Version: 2.4
2
+ Name: uhc
3
+ Version: 0.1.0
4
+ Summary: Unified Hash-Compression Engine: compressed-domain hashing over LZ77 streams
5
+ Author: UHC Contributors
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/uhc-framework/uhc
8
+ Project-URL: Documentation, https://github.com/uhc-framework/uhc
9
+ Project-URL: Repository, https://github.com/uhc-framework/uhc
10
+ Project-URL: Issues, https://github.com/uhc-framework/uhc/issues
11
+ Keywords: compression,hashing,lz77,deduplication,integrity,polynomial-hash,deflate,zstandard,lz4
12
+ Classifier: Development Status :: 2 - Pre-Alpha
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Programming Language :: Python :: 3.13
22
+ Classifier: Topic :: Scientific/Engineering
23
+ Classifier: Topic :: System :: Archiving :: Compression
24
+ Classifier: Topic :: Security :: Cryptography
25
+ Requires-Python: >=3.10
26
+ Description-Content-Type: text/markdown
27
+ License-File: LICENSE
28
+ Provides-Extra: dev
29
+ Requires-Dist: pytest>=8.0; extra == "dev"
30
+ Requires-Dist: pytest-cov>=5.0; extra == "dev"
31
+ Provides-Extra: formats
32
+ Requires-Dist: lz4>=4.0; extra == "formats"
33
+ Requires-Dist: zstandard>=0.22; extra == "formats"
34
+ Requires-Dist: blake3>=1.0; extra == "formats"
35
+ Dynamic: license-file
36
+
37
+ # UHC — Unified Hash-Compression Engine
38
+
39
+ A Python framework for compressed-domain hashing over LZ77 streams.
40
+
41
+ UHC computes the polynomial hash of uncompressed data by operating directly on compressed token streams (DEFLATE, LZ4, Zstandard), without ever materializing the decompressed bytes.
42
+
43
+ ## Key Features
44
+
45
+ - **Compressed-domain hashing:** Compute integrity hashes without decompression
46
+ - **Hash-augmented rope:** Novel data structure with RepeatNode for O(k·log q) overlapping back-reference resolution
47
+ - **Multi-hash collision resistance:** k-wise independent polynomial hashes with configurable security levels
48
+ - **Format compatible:** DEFLATE, LZ4, Zstandard support via unified token abstraction
49
+ - **Sliding window:** O(1) memory relative to decompressed size for bounded-window formats
50
+
51
+ ## Installation
52
+
53
+ ```bash
54
+ pip install uhc
55
+ ```
56
+
57
+ ## Quick Start
58
+
59
+ ```python
60
+ import uhc
61
+
62
+ # Coming soon — Phase 1 implementation in progress
63
+ ```
64
+
65
+ ## Mathematical Foundation
66
+
67
+ The complete mathematical framework (24 theorems, 14 lemmas, 5 corollaries) with full proofs is available in `theory/compressed_domain_hashing_framework.md`.
68
+
69
+ ## Project Status
70
+
71
+ **Phase 1:** Algebraic core — polynomial hash, naive LZ77, compressed-domain verifier
72
+
73
+ ## License
74
+
75
+ MIT
@@ -0,0 +1,12 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ tests/test_polynomial_hash.py
5
+ uhc/__init__.py
6
+ uhc.egg-info/PKG-INFO
7
+ uhc.egg-info/SOURCES.txt
8
+ uhc.egg-info/dependency_links.txt
9
+ uhc.egg-info/requires.txt
10
+ uhc.egg-info/top_level.txt
11
+ uhc/core/__init__.py
12
+ uhc/core/polynomial_hash.py
@@ -0,0 +1,9 @@
1
+
2
+ [dev]
3
+ pytest>=8.0
4
+ pytest-cov>=5.0
5
+
6
+ [formats]
7
+ lz4>=4.0
8
+ zstandard>=0.22
9
+ blake3>=1.0
@@ -0,0 +1 @@
1
+ uhc