uhc 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- uhc-0.1.0/LICENSE +21 -0
- uhc-0.1.0/PKG-INFO +75 -0
- uhc-0.1.0/README.md +39 -0
- uhc-0.1.0/pyproject.toml +69 -0
- uhc-0.1.0/setup.cfg +4 -0
- uhc-0.1.0/tests/test_polynomial_hash.py +411 -0
- uhc-0.1.0/uhc/__init__.py +10 -0
- uhc-0.1.0/uhc/core/__init__.py +3 -0
- uhc-0.1.0/uhc/core/polynomial_hash.py +240 -0
- uhc-0.1.0/uhc.egg-info/PKG-INFO +75 -0
- uhc-0.1.0/uhc.egg-info/SOURCES.txt +12 -0
- uhc-0.1.0/uhc.egg-info/dependency_links.txt +1 -0
- uhc-0.1.0/uhc.egg-info/requires.txt +9 -0
- uhc-0.1.0/uhc.egg-info/top_level.txt +1 -0
uhc-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 UHC Contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
uhc-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: uhc
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Unified Hash-Compression Engine: compressed-domain hashing over LZ77 streams
|
|
5
|
+
Author: UHC Contributors
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/uhc-framework/uhc
|
|
8
|
+
Project-URL: Documentation, https://github.com/uhc-framework/uhc
|
|
9
|
+
Project-URL: Repository, https://github.com/uhc-framework/uhc
|
|
10
|
+
Project-URL: Issues, https://github.com/uhc-framework/uhc/issues
|
|
11
|
+
Keywords: compression,hashing,lz77,deduplication,integrity,polynomial-hash,deflate,zstandard,lz4
|
|
12
|
+
Classifier: Development Status :: 2 - Pre-Alpha
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Operating System :: OS Independent
|
|
17
|
+
Classifier: Programming Language :: Python :: 3
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
22
|
+
Classifier: Topic :: Scientific/Engineering
|
|
23
|
+
Classifier: Topic :: System :: Archiving :: Compression
|
|
24
|
+
Classifier: Topic :: Security :: Cryptography
|
|
25
|
+
Requires-Python: >=3.10
|
|
26
|
+
Description-Content-Type: text/markdown
|
|
27
|
+
License-File: LICENSE
|
|
28
|
+
Provides-Extra: dev
|
|
29
|
+
Requires-Dist: pytest>=8.0; extra == "dev"
|
|
30
|
+
Requires-Dist: pytest-cov>=5.0; extra == "dev"
|
|
31
|
+
Provides-Extra: formats
|
|
32
|
+
Requires-Dist: lz4>=4.0; extra == "formats"
|
|
33
|
+
Requires-Dist: zstandard>=0.22; extra == "formats"
|
|
34
|
+
Requires-Dist: blake3>=1.0; extra == "formats"
|
|
35
|
+
Dynamic: license-file
|
|
36
|
+
|
|
37
|
+
# UHC — Unified Hash-Compression Engine
|
|
38
|
+
|
|
39
|
+
A Python framework for compressed-domain hashing over LZ77 streams.
|
|
40
|
+
|
|
41
|
+
UHC computes the polynomial hash of uncompressed data by operating directly on compressed token streams (DEFLATE, LZ4, Zstandard), without ever materializing the decompressed bytes.
|
|
42
|
+
|
|
43
|
+
## Key Features
|
|
44
|
+
|
|
45
|
+
- **Compressed-domain hashing:** Compute integrity hashes without decompression
|
|
46
|
+
- **Hash-augmented rope:** Novel data structure with RepeatNode for O(k·log q) overlapping back-reference resolution
|
|
47
|
+
- **Multi-hash collision resistance:** k-wise independent polynomial hashes with configurable security levels
|
|
48
|
+
- **Format compatible:** DEFLATE, LZ4, Zstandard support via unified token abstraction
|
|
49
|
+
- **Sliding window:** O(1) memory relative to decompressed size for bounded-window formats
|
|
50
|
+
|
|
51
|
+
## Installation
|
|
52
|
+
|
|
53
|
+
```bash
|
|
54
|
+
pip install uhc
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## Quick Start
|
|
58
|
+
|
|
59
|
+
```python
|
|
60
|
+
import uhc
|
|
61
|
+
|
|
62
|
+
# Coming soon — Phase 1 implementation in progress
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
## Mathematical Foundation
|
|
66
|
+
|
|
67
|
+
The complete mathematical framework (24 theorems, 14 lemmas, 5 corollaries) with full proofs is available in `theory/compressed_domain_hashing_framework.md`.
|
|
68
|
+
|
|
69
|
+
## Project Status
|
|
70
|
+
|
|
71
|
+
**Phase 1:** Algebraic core — polynomial hash, naive LZ77, compressed-domain verifier
|
|
72
|
+
|
|
73
|
+
## License
|
|
74
|
+
|
|
75
|
+
MIT
|
uhc-0.1.0/README.md
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
# UHC — Unified Hash-Compression Engine
|
|
2
|
+
|
|
3
|
+
A Python framework for compressed-domain hashing over LZ77 streams.
|
|
4
|
+
|
|
5
|
+
UHC computes the polynomial hash of uncompressed data by operating directly on compressed token streams (DEFLATE, LZ4, Zstandard), without ever materializing the decompressed bytes.
|
|
6
|
+
|
|
7
|
+
## Key Features
|
|
8
|
+
|
|
9
|
+
- **Compressed-domain hashing:** Compute integrity hashes without decompression
|
|
10
|
+
- **Hash-augmented rope:** Novel data structure with RepeatNode for O(k·log q) overlapping back-reference resolution
|
|
11
|
+
- **Multi-hash collision resistance:** k-wise independent polynomial hashes with configurable security levels
|
|
12
|
+
- **Format compatible:** DEFLATE, LZ4, Zstandard support via unified token abstraction
|
|
13
|
+
- **Sliding window:** O(1) memory relative to decompressed size for bounded-window formats
|
|
14
|
+
|
|
15
|
+
## Installation
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
pip install uhc
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
## Quick Start
|
|
22
|
+
|
|
23
|
+
```python
|
|
24
|
+
import uhc
|
|
25
|
+
|
|
26
|
+
# Coming soon — Phase 1 implementation in progress
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
## Mathematical Foundation
|
|
30
|
+
|
|
31
|
+
The complete mathematical framework (24 theorems, 14 lemmas, 5 corollaries) with full proofs is available in `theory/compressed_domain_hashing_framework.md`.
|
|
32
|
+
|
|
33
|
+
## Project Status
|
|
34
|
+
|
|
35
|
+
**Phase 1:** Algebraic core — polynomial hash, naive LZ77, compressed-domain verifier
|
|
36
|
+
|
|
37
|
+
## License
|
|
38
|
+
|
|
39
|
+
MIT
|
uhc-0.1.0/pyproject.toml
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=68.0", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "uhc"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
description = "Unified Hash-Compression Engine: compressed-domain hashing over LZ77 streams"
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
license = {text = "MIT"}
|
|
11
|
+
requires-python = ">=3.10"
|
|
12
|
+
authors = [
|
|
13
|
+
{name = "UHC Contributors"},
|
|
14
|
+
]
|
|
15
|
+
keywords = [
|
|
16
|
+
"compression",
|
|
17
|
+
"hashing",
|
|
18
|
+
"lz77",
|
|
19
|
+
"deduplication",
|
|
20
|
+
"integrity",
|
|
21
|
+
"polynomial-hash",
|
|
22
|
+
"deflate",
|
|
23
|
+
"zstandard",
|
|
24
|
+
"lz4",
|
|
25
|
+
]
|
|
26
|
+
classifiers = [
|
|
27
|
+
"Development Status :: 2 - Pre-Alpha",
|
|
28
|
+
"Intended Audience :: Developers",
|
|
29
|
+
"Intended Audience :: Science/Research",
|
|
30
|
+
"License :: OSI Approved :: MIT License",
|
|
31
|
+
"Operating System :: OS Independent",
|
|
32
|
+
"Programming Language :: Python :: 3",
|
|
33
|
+
"Programming Language :: Python :: 3.10",
|
|
34
|
+
"Programming Language :: Python :: 3.11",
|
|
35
|
+
"Programming Language :: Python :: 3.12",
|
|
36
|
+
"Programming Language :: Python :: 3.13",
|
|
37
|
+
"Topic :: Scientific/Engineering",
|
|
38
|
+
"Topic :: System :: Archiving :: Compression",
|
|
39
|
+
"Topic :: Security :: Cryptography",
|
|
40
|
+
]
|
|
41
|
+
dependencies = []
|
|
42
|
+
|
|
43
|
+
[project.optional-dependencies]
|
|
44
|
+
dev = [
|
|
45
|
+
"pytest>=8.0",
|
|
46
|
+
"pytest-cov>=5.0",
|
|
47
|
+
]
|
|
48
|
+
formats = [
|
|
49
|
+
"lz4>=4.0",
|
|
50
|
+
"zstandard>=0.22",
|
|
51
|
+
"blake3>=1.0",
|
|
52
|
+
]
|
|
53
|
+
|
|
54
|
+
[project.urls]
|
|
55
|
+
Homepage = "https://github.com/uhc-framework/uhc"
|
|
56
|
+
Documentation = "https://github.com/uhc-framework/uhc"
|
|
57
|
+
Repository = "https://github.com/uhc-framework/uhc"
|
|
58
|
+
Issues = "https://github.com/uhc-framework/uhc/issues"
|
|
59
|
+
|
|
60
|
+
[tool.setuptools.packages.find]
|
|
61
|
+
include = ["uhc*"]
|
|
62
|
+
|
|
63
|
+
[tool.pytest.ini_options]
|
|
64
|
+
testpaths = ["tests"]
|
|
65
|
+
addopts = "-v --tb=short"
|
|
66
|
+
|
|
67
|
+
[tool.coverage.run]
|
|
68
|
+
source = ["uhc"]
|
|
69
|
+
omit = ["tests/*"]
|
uhc-0.1.0/setup.cfg
ADDED
|
@@ -0,0 +1,411 @@
|
|
|
1
|
+
"""
|
|
2
|
+
Tests for uhc.core.polynomial_hash
|
|
3
|
+
|
|
4
|
+
Every test is traceable to a specific definition, lemma, or theorem
|
|
5
|
+
in the compressed-domain hashing framework.
|
|
6
|
+
"""
|
|
7
|
+
|
|
8
|
+
import pytest
|
|
9
|
+
|
|
10
|
+
# Will fail until implementation exists — that's TDD.
|
|
11
|
+
from uhc.core.polynomial_hash import (
|
|
12
|
+
PolynomialHash,
|
|
13
|
+
mersenne_mod,
|
|
14
|
+
mersenne_mul,
|
|
15
|
+
phi,
|
|
16
|
+
)
|
|
17
|
+
|
|
18
|
+
# ---------------------------------------------------------------------------
|
|
19
|
+
# Constants
|
|
20
|
+
# ---------------------------------------------------------------------------
|
|
21
|
+
|
|
22
|
+
P61 = (1 << 61) - 1 # Mersenne prime 2^61 - 1
|
|
23
|
+
|
|
24
|
+
|
|
25
|
+
# ---------------------------------------------------------------------------
|
|
26
|
+
# Mersenne arithmetic tests (Lemma 14)
|
|
27
|
+
# ---------------------------------------------------------------------------
|
|
28
|
+
|
|
29
|
+
class TestMersenneArithmetic:
|
|
30
|
+
"""Tests for modular arithmetic with p = 2^61 - 1."""
|
|
31
|
+
|
|
32
|
+
def test_mersenne_mod_zero(self):
|
|
33
|
+
"""0 mod p = 0."""
|
|
34
|
+
assert mersenne_mod(0) == 0
|
|
35
|
+
|
|
36
|
+
def test_mersenne_mod_small(self):
|
|
37
|
+
"""Small values are unchanged."""
|
|
38
|
+
assert mersenne_mod(42) == 42
|
|
39
|
+
|
|
40
|
+
def test_mersenne_mod_p(self):
|
|
41
|
+
"""p mod p = 0."""
|
|
42
|
+
assert mersenne_mod(P61) == 0
|
|
43
|
+
|
|
44
|
+
def test_mersenne_mod_p_plus_one(self):
|
|
45
|
+
"""(p + 1) mod p = 1."""
|
|
46
|
+
assert mersenne_mod(P61 + 1) == 1
|
|
47
|
+
|
|
48
|
+
def test_mersenne_mod_2p(self):
|
|
49
|
+
"""2p mod p = 0."""
|
|
50
|
+
assert mersenne_mod(2 * P61) == 0
|
|
51
|
+
|
|
52
|
+
def test_mersenne_mod_large(self):
|
|
53
|
+
"""Large value reduces correctly."""
|
|
54
|
+
val = (1 << 120) + 17
|
|
55
|
+
assert mersenne_mod(val) == val % P61
|
|
56
|
+
|
|
57
|
+
def test_mersenne_mul_commutative(self):
|
|
58
|
+
"""a * b = b * a (mod p)."""
|
|
59
|
+
a, b = 123456789, 987654321
|
|
60
|
+
assert mersenne_mul(a, b) == mersenne_mul(b, a)
|
|
61
|
+
|
|
62
|
+
def test_mersenne_mul_identity(self):
|
|
63
|
+
"""a * 1 = a (mod p)."""
|
|
64
|
+
a = 999999999
|
|
65
|
+
assert mersenne_mul(a, 1) == mersenne_mod(a)
|
|
66
|
+
|
|
67
|
+
def test_mersenne_mul_zero(self):
|
|
68
|
+
"""a * 0 = 0 (mod p)."""
|
|
69
|
+
assert mersenne_mul(123456, 0) == 0
|
|
70
|
+
|
|
71
|
+
def test_mersenne_mul_correctness(self):
|
|
72
|
+
"""Multiply matches Python's native modular arithmetic."""
|
|
73
|
+
a, b = 2**60 - 3, 2**60 + 7
|
|
74
|
+
assert mersenne_mul(a, b) == (a * b) % P61
|
|
75
|
+
|
|
76
|
+
|
|
77
|
+
# ---------------------------------------------------------------------------
|
|
78
|
+
# Definition 2: Polynomial hash function
|
|
79
|
+
# ---------------------------------------------------------------------------
|
|
80
|
+
|
|
81
|
+
class TestHashDefinition:
|
|
82
|
+
"""Tests for H(s) = Σ (s_i + 1) · x^(n-1-i) (mod p)."""
|
|
83
|
+
|
|
84
|
+
def test_empty_string_hashes_to_zero(self):
|
|
85
|
+
"""Definition 2: H(ε) = 0."""
|
|
86
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
87
|
+
assert h.hash(b"") == 0
|
|
88
|
+
|
|
89
|
+
def test_single_byte(self):
|
|
90
|
+
"""H((c)) = c + 1 (mod p)."""
|
|
91
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
92
|
+
assert h.hash(b"\x00") == 1 # 0 + 1
|
|
93
|
+
assert h.hash(b"\x01") == 2 # 1 + 1
|
|
94
|
+
assert h.hash(b"\xff") == 256 # 255 + 1
|
|
95
|
+
|
|
96
|
+
def test_two_bytes_manual(self):
|
|
97
|
+
"""H((a, b)) = (a+1)·x + (b+1) (mod p)."""
|
|
98
|
+
x = 131
|
|
99
|
+
h = PolynomialHash(prime=P61, base=x)
|
|
100
|
+
a, b = 3, 7
|
|
101
|
+
expected = ((a + 1) * x + (b + 1)) % P61
|
|
102
|
+
assert h.hash(bytes([a, b])) == expected
|
|
103
|
+
|
|
104
|
+
def test_three_bytes_manual(self):
|
|
105
|
+
"""H((a,b,c)) = (a+1)·x² + (b+1)·x + (c+1) (mod p)."""
|
|
106
|
+
x = 131
|
|
107
|
+
h = PolynomialHash(prime=P61, base=x)
|
|
108
|
+
a, b, c = 10, 20, 30
|
|
109
|
+
expected = ((a + 1) * x * x + (b + 1) * x + (c + 1)) % P61
|
|
110
|
+
assert h.hash(bytes([a, b, c])) == expected
|
|
111
|
+
|
|
112
|
+
|
|
113
|
+
# ---------------------------------------------------------------------------
|
|
114
|
+
# Lemma 1: Nonzero coefficients
|
|
115
|
+
# ---------------------------------------------------------------------------
|
|
116
|
+
|
|
117
|
+
class TestLemma1:
|
|
118
|
+
"""The +1 offset ensures no byte maps to zero in Z/pZ."""
|
|
119
|
+
|
|
120
|
+
def test_null_byte_nonzero(self):
|
|
121
|
+
"""H((0)) ≠ 0 — the zero-padding fix."""
|
|
122
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
123
|
+
assert h.hash(b"\x00") != 0
|
|
124
|
+
assert h.hash(b"\x00") == 1
|
|
125
|
+
|
|
126
|
+
def test_all_null_strings_distinct(self):
|
|
127
|
+
"""Strings of different lengths composed of null bytes hash differently."""
|
|
128
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
129
|
+
hashes = [h.hash(b"\x00" * n) for n in range(1, 10)]
|
|
130
|
+
assert len(set(hashes)) == 9 # all distinct
|
|
131
|
+
|
|
132
|
+
|
|
133
|
+
# ---------------------------------------------------------------------------
|
|
134
|
+
# Lemma 2: Distinctness of single-byte hashes
|
|
135
|
+
# ---------------------------------------------------------------------------
|
|
136
|
+
|
|
137
|
+
class TestLemma2:
|
|
138
|
+
"""For distinct bytes a ≠ b, H((a)) ≠ H((b))."""
|
|
139
|
+
|
|
140
|
+
def test_all_single_bytes_distinct(self):
|
|
141
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
142
|
+
hashes = [h.hash(bytes([b])) for b in range(256)]
|
|
143
|
+
assert len(set(hashes)) == 256
|
|
144
|
+
|
|
145
|
+
|
|
146
|
+
# ---------------------------------------------------------------------------
|
|
147
|
+
# Theorem 1: Concatenation composability
|
|
148
|
+
# ---------------------------------------------------------------------------
|
|
149
|
+
|
|
150
|
+
class TestTheorem1:
|
|
151
|
+
"""H(A ‖ B) = H(A) · x^|B| + H(B) (mod p)."""
|
|
152
|
+
|
|
153
|
+
def test_concat_two_strings(self):
|
|
154
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
155
|
+
A = b"Hello"
|
|
156
|
+
B = b"World"
|
|
157
|
+
h_ab = h.hash(A + B)
|
|
158
|
+
h_a = h.hash(A)
|
|
159
|
+
h_b = h.hash(B)
|
|
160
|
+
x_pow_b = h.power(len(B))
|
|
161
|
+
expected = (h_a * x_pow_b + h_b) % P61
|
|
162
|
+
assert h_ab == expected
|
|
163
|
+
|
|
164
|
+
def test_concat_empty_left(self):
|
|
165
|
+
"""H(ε ‖ B) = H(B)."""
|
|
166
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
167
|
+
B = b"test"
|
|
168
|
+
assert h.hash(b"" + B) == h.hash(B)
|
|
169
|
+
|
|
170
|
+
def test_concat_empty_right(self):
|
|
171
|
+
"""H(A ‖ ε) = H(A)."""
|
|
172
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
173
|
+
A = b"test"
|
|
174
|
+
h_a = h.hash(A)
|
|
175
|
+
x_pow_0 = h.power(0) # x^0 = 1
|
|
176
|
+
assert (h_a * x_pow_0 + 0) % P61 == h_a
|
|
177
|
+
|
|
178
|
+
def test_concat_associativity(self):
|
|
179
|
+
"""H(A ‖ B ‖ C) is consistent regardless of grouping."""
|
|
180
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
181
|
+
A, B, C = b"aa", b"bb", b"cc"
|
|
182
|
+
assert h.hash(A + B + C) == h.hash(A + B + C)
|
|
183
|
+
# Verify via Theorem 1 applied twice:
|
|
184
|
+
h_ab = (h.hash(A) * h.power(len(B)) + h.hash(B)) % P61
|
|
185
|
+
h_abc = (h_ab * h.power(len(C)) + h.hash(C)) % P61
|
|
186
|
+
assert h.hash(A + B + C) == h_abc
|
|
187
|
+
|
|
188
|
+
def test_concat_random_data(self):
|
|
189
|
+
"""Theorem 1 holds for random byte sequences."""
|
|
190
|
+
import os
|
|
191
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
192
|
+
for _ in range(20):
|
|
193
|
+
A = os.urandom(50)
|
|
194
|
+
B = os.urandom(50)
|
|
195
|
+
h_ab = h.hash(A + B)
|
|
196
|
+
expected = (h.hash(A) * h.power(len(B)) + h.hash(B)) % P61
|
|
197
|
+
assert h_ab == expected
|
|
198
|
+
|
|
199
|
+
|
|
200
|
+
# ---------------------------------------------------------------------------
|
|
201
|
+
# Definition 3 + Theorem 3: Geometric accumulator Φ
|
|
202
|
+
# ---------------------------------------------------------------------------
|
|
203
|
+
|
|
204
|
+
class TestPhi:
|
|
205
|
+
"""Φ(q, α) = Σ_{i=0}^{q-1} α^i computed by repeated doubling."""
|
|
206
|
+
|
|
207
|
+
def test_phi_zero(self):
|
|
208
|
+
"""Φ(0, α) = 0."""
|
|
209
|
+
assert phi(0, 5, P61) == 0
|
|
210
|
+
|
|
211
|
+
def test_phi_one(self):
|
|
212
|
+
"""Φ(1, α) = 1."""
|
|
213
|
+
assert phi(1, 5, P61) == 1
|
|
214
|
+
|
|
215
|
+
def test_phi_two(self):
|
|
216
|
+
"""Φ(2, α) = 1 + α."""
|
|
217
|
+
alpha = 5
|
|
218
|
+
assert phi(2, alpha, P61) == (1 + alpha) % P61
|
|
219
|
+
|
|
220
|
+
def test_phi_three(self):
|
|
221
|
+
"""Φ(3, α) = 1 + α + α²."""
|
|
222
|
+
alpha = 5
|
|
223
|
+
expected = (1 + alpha + alpha * alpha) % P61
|
|
224
|
+
assert phi(3, alpha, P61) == expected
|
|
225
|
+
|
|
226
|
+
def test_phi_matches_naive(self):
|
|
227
|
+
"""Doubling recurrence matches direct summation for small q."""
|
|
228
|
+
alpha = 7
|
|
229
|
+
for q in range(20):
|
|
230
|
+
naive = sum(pow(alpha, i, P61) for i in range(q)) % P61
|
|
231
|
+
assert phi(q, alpha, P61) == naive, f"Failed at q={q}"
|
|
232
|
+
|
|
233
|
+
def test_phi_large_q(self):
|
|
234
|
+
"""Φ works for large q without overflow or timeout."""
|
|
235
|
+
alpha = 131
|
|
236
|
+
q = 1_000_000
|
|
237
|
+
# Just verify it completes and returns a value in [0, p)
|
|
238
|
+
result = phi(q, alpha, P61)
|
|
239
|
+
assert 0 <= result < P61
|
|
240
|
+
|
|
241
|
+
def test_phi_degeneracy_alpha_one(self):
|
|
242
|
+
"""Lemma 3: Φ(q, 1) = q (mod p) for all q."""
|
|
243
|
+
for q in [0, 1, 2, 5, 10, 100, 999]:
|
|
244
|
+
assert phi(q, 1, P61) == q % P61
|
|
245
|
+
|
|
246
|
+
def test_phi_power_of_two_q(self):
|
|
247
|
+
"""Φ(2^k, α) — pure even-case recursion."""
|
|
248
|
+
alpha = 13
|
|
249
|
+
for k in range(1, 15):
|
|
250
|
+
q = 1 << k
|
|
251
|
+
naive = sum(pow(alpha, i, P61) for i in range(q)) % P61
|
|
252
|
+
assert phi(q, alpha, P61) == naive
|
|
253
|
+
|
|
254
|
+
|
|
255
|
+
# ---------------------------------------------------------------------------
|
|
256
|
+
# Corollary 3: Closed-form equivalence
|
|
257
|
+
# ---------------------------------------------------------------------------
|
|
258
|
+
|
|
259
|
+
class TestCorollary3:
|
|
260
|
+
"""When α ≢ 1, Φ(q,α) = (α^q - 1) / (α - 1)."""
|
|
261
|
+
|
|
262
|
+
def test_closed_form_matches_doubling(self):
|
|
263
|
+
alpha = 7
|
|
264
|
+
for q in [1, 2, 3, 5, 10, 50, 100]:
|
|
265
|
+
doubling = phi(q, alpha, P61)
|
|
266
|
+
# Closed form: (α^q - 1) · (α - 1)^(-1) mod p
|
|
267
|
+
alpha_q = pow(alpha, q, P61)
|
|
268
|
+
numerator = (alpha_q - 1) % P61
|
|
269
|
+
denominator_inv = pow(alpha - 1, P61 - 2, P61) # Fermat inverse
|
|
270
|
+
closed = (numerator * denominator_inv) % P61
|
|
271
|
+
assert doubling == closed, f"Failed at q={q}"
|
|
272
|
+
|
|
273
|
+
|
|
274
|
+
# ---------------------------------------------------------------------------
|
|
275
|
+
# Theorem 2: Geometric repetition hash
|
|
276
|
+
# ---------------------------------------------------------------------------
|
|
277
|
+
|
|
278
|
+
class TestTheorem2:
|
|
279
|
+
"""H(S^q) = H(S) · Φ(q, x^d) (mod p)."""
|
|
280
|
+
|
|
281
|
+
def test_repeat_single_byte(self):
|
|
282
|
+
"""b"A" repeated q times."""
|
|
283
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
284
|
+
S = b"A"
|
|
285
|
+
for q in [1, 2, 3, 5, 10, 50]:
|
|
286
|
+
direct = h.hash(S * q)
|
|
287
|
+
h_s = h.hash(S)
|
|
288
|
+
x_d = h.power(len(S))
|
|
289
|
+
algebraic = (h_s * phi(q, x_d, P61)) % P61
|
|
290
|
+
assert direct == algebraic, f"Failed at q={q}"
|
|
291
|
+
|
|
292
|
+
def test_repeat_multi_byte_pattern(self):
|
|
293
|
+
"""b"abc" repeated q times."""
|
|
294
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
295
|
+
S = b"abc"
|
|
296
|
+
for q in [1, 2, 3, 7, 20]:
|
|
297
|
+
direct = h.hash(S * q)
|
|
298
|
+
h_s = h.hash(S)
|
|
299
|
+
x_d = h.power(len(S))
|
|
300
|
+
algebraic = (h_s * phi(q, x_d, P61)) % P61
|
|
301
|
+
assert direct == algebraic, f"Failed at q={q}"
|
|
302
|
+
|
|
303
|
+
def test_repeat_zero(self):
|
|
304
|
+
"""S^0 = ε, H(ε) = 0."""
|
|
305
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
306
|
+
h_s = h.hash(b"abc")
|
|
307
|
+
x_d = h.power(3)
|
|
308
|
+
algebraic = (h_s * phi(0, x_d, P61)) % P61
|
|
309
|
+
assert algebraic == 0
|
|
310
|
+
|
|
311
|
+
|
|
312
|
+
# ---------------------------------------------------------------------------
|
|
313
|
+
# Theorem 5: Overlapping back-reference hash
|
|
314
|
+
# ---------------------------------------------------------------------------
|
|
315
|
+
|
|
316
|
+
class TestTheorem5:
|
|
317
|
+
"""H(W) = H(P) · Φ(q, x^d) · x^r + H(P[0..r-1])."""
|
|
318
|
+
|
|
319
|
+
def test_exact_repetition_no_remainder(self):
|
|
320
|
+
"""l is exact multiple of d: r = 0."""
|
|
321
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
322
|
+
P = b"abc" # d = 3
|
|
323
|
+
l = 12 # q = 4, r = 0
|
|
324
|
+
d = len(P)
|
|
325
|
+
q, r = divmod(l, d)
|
|
326
|
+
|
|
327
|
+
# Direct: hash the fully expanded string
|
|
328
|
+
W = P * q
|
|
329
|
+
direct = h.hash(W)
|
|
330
|
+
|
|
331
|
+
# Algebraic
|
|
332
|
+
h_p = h.hash(P)
|
|
333
|
+
x_d = h.power(d)
|
|
334
|
+
x_r = h.power(r) # x^0 = 1
|
|
335
|
+
algebraic = (h_p * phi(q, x_d, P61) % P61 * x_r + h.hash(b"")) % P61
|
|
336
|
+
assert direct == algebraic
|
|
337
|
+
|
|
338
|
+
def test_repetition_with_remainder(self):
|
|
339
|
+
"""l is not a multiple of d: r > 0."""
|
|
340
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
341
|
+
P = b"abcd" # d = 4
|
|
342
|
+
l = 11 # q = 2, r = 3
|
|
343
|
+
d = len(P)
|
|
344
|
+
q, r = divmod(l, d)
|
|
345
|
+
assert q == 2 and r == 3
|
|
346
|
+
|
|
347
|
+
# Direct
|
|
348
|
+
W = (P * q) + P[:r] # "abcdabcdabc"
|
|
349
|
+
direct = h.hash(W)
|
|
350
|
+
|
|
351
|
+
# Algebraic (Theorem 5)
|
|
352
|
+
h_p = h.hash(P)
|
|
353
|
+
h_prefix = h.hash(P[:r])
|
|
354
|
+
x_d = h.power(d)
|
|
355
|
+
x_r = h.power(r)
|
|
356
|
+
algebraic = (h_p * phi(q, x_d, P61) % P61 * x_r % P61 + h_prefix) % P61
|
|
357
|
+
assert direct == algebraic
|
|
358
|
+
|
|
359
|
+
def test_single_byte_run_length(self):
|
|
360
|
+
"""RLE case: d=1, pattern is single byte."""
|
|
361
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
362
|
+
P = b"\x42" # d = 1
|
|
363
|
+
l = 1000
|
|
364
|
+
d = 1
|
|
365
|
+
q, r = divmod(l, d) # q = 1000, r = 0
|
|
366
|
+
|
|
367
|
+
direct = h.hash(P * l)
|
|
368
|
+
|
|
369
|
+
h_p = h.hash(P)
|
|
370
|
+
x_d = h.power(d)
|
|
371
|
+
algebraic = (h_p * phi(q, x_d, P61)) % P61
|
|
372
|
+
assert direct == algebraic
|
|
373
|
+
|
|
374
|
+
def test_various_patterns_and_lengths(self):
|
|
375
|
+
"""Sweep over multiple pattern sizes and repetition lengths."""
|
|
376
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
377
|
+
import os
|
|
378
|
+
for d in [1, 2, 3, 5, 8, 13]:
|
|
379
|
+
P = os.urandom(d)
|
|
380
|
+
for l in [d, d + 1, 2 * d, 2 * d + 1, 5 * d, 5 * d + 3]:
|
|
381
|
+
q, r = divmod(l, d)
|
|
382
|
+
W = (P * q) + P[:r]
|
|
383
|
+
direct = h.hash(W)
|
|
384
|
+
|
|
385
|
+
h_p = h.hash(P)
|
|
386
|
+
h_prefix = h.hash(P[:r]) if r > 0 else 0
|
|
387
|
+
x_d = h.power(d)
|
|
388
|
+
x_r = h.power(r)
|
|
389
|
+
algebraic = (h_p * phi(q, x_d, P61) % P61 * x_r % P61 + h_prefix) % P61
|
|
390
|
+
assert direct == algebraic, f"Failed: d={d}, l={l}, q={q}, r={r}"
|
|
391
|
+
|
|
392
|
+
|
|
393
|
+
# ---------------------------------------------------------------------------
|
|
394
|
+
# Power computation (Lemma 13)
|
|
395
|
+
# ---------------------------------------------------------------------------
|
|
396
|
+
|
|
397
|
+
class TestPowerComputation:
|
|
398
|
+
"""x^n computed via repeated squaring."""
|
|
399
|
+
|
|
400
|
+
def test_power_zero(self):
|
|
401
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
402
|
+
assert h.power(0) == 1
|
|
403
|
+
|
|
404
|
+
def test_power_one(self):
|
|
405
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
406
|
+
assert h.power(1) == 131
|
|
407
|
+
|
|
408
|
+
def test_power_matches_builtin(self):
|
|
409
|
+
h = PolynomialHash(prime=P61, base=131)
|
|
410
|
+
for n in [2, 3, 10, 50, 100, 1000]:
|
|
411
|
+
assert h.power(n) == pow(131, n, P61)
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
"""
|
|
2
|
+
UHC — Unified Hash-Compression Engine
|
|
3
|
+
|
|
4
|
+
Compressed-domain hashing over LZ77 streams.
|
|
5
|
+
Computes polynomial hashes of uncompressed data directly from
|
|
6
|
+
compressed token representations without decompression.
|
|
7
|
+
"""
|
|
8
|
+
|
|
9
|
+
__version__ = "0.1.0"
|
|
10
|
+
__author__ = "UHC Contributors"
|
|
@@ -0,0 +1,240 @@
|
|
|
1
|
+
"""
|
|
2
|
+
Polynomial hash over Z/pZ with Mersenne prime arithmetic.
|
|
3
|
+
|
|
4
|
+
Implements Definitions 1-3, Lemmas 1-2, 13-14, Theorems 1-3, 5
|
|
5
|
+
from the compressed-domain hashing framework.
|
|
6
|
+
|
|
7
|
+
All arithmetic is performed modulo a Mersenne prime p = 2^61 - 1
|
|
8
|
+
using bit-shift reduction (Lemma 14) to avoid expensive division.
|
|
9
|
+
"""
|
|
10
|
+
|
|
11
|
+
from __future__ import annotations
|
|
12
|
+
|
|
13
|
+
# ---------------------------------------------------------------------------
|
|
14
|
+
# Constants
|
|
15
|
+
# ---------------------------------------------------------------------------
|
|
16
|
+
|
|
17
|
+
MERSENNE_61: int = (1 << 61) - 1 # 2^61 - 1
|
|
18
|
+
|
|
19
|
+
|
|
20
|
+
# ---------------------------------------------------------------------------
|
|
21
|
+
# Mersenne prime arithmetic (Lemma 14)
|
|
22
|
+
# ---------------------------------------------------------------------------
|
|
23
|
+
|
|
24
|
+
def mersenne_mod(a: int, p: int = MERSENNE_61) -> int:
|
|
25
|
+
"""
|
|
26
|
+
Reduce a non-negative integer a modulo a Mersenne prime p = 2^k - 1.
|
|
27
|
+
|
|
28
|
+
Uses the identity: a mod (2^k - 1) = (a >> k) + (a & ((1 << k) - 1))
|
|
29
|
+
with at most one conditional subtraction.
|
|
30
|
+
|
|
31
|
+
Proof (Lemma 14): Write a = q·2^k + r. Then a = q·(p+1) + r = q·p + (q+r),
|
|
32
|
+
so a ≡ q + r (mod p). Since q + r < 2^(k+1), at most one subtraction suffices.
|
|
33
|
+
"""
|
|
34
|
+
# Determine k from p: p = 2^k - 1, so k = p.bit_length()
|
|
35
|
+
k = p.bit_length()
|
|
36
|
+
# Repeated folding for very large values
|
|
37
|
+
while a >= (1 << (2 * k)):
|
|
38
|
+
a = (a >> k) + (a & p)
|
|
39
|
+
# Final fold
|
|
40
|
+
a = (a >> k) + (a & p)
|
|
41
|
+
# At most one more fold needed
|
|
42
|
+
a = (a >> k) + (a & p)
|
|
43
|
+
# Conditional subtraction
|
|
44
|
+
if a >= p:
|
|
45
|
+
a -= p
|
|
46
|
+
return a
|
|
47
|
+
|
|
48
|
+
|
|
49
|
+
def mersenne_mul(a: int, b: int, p: int = MERSENNE_61) -> int:
|
|
50
|
+
"""
|
|
51
|
+
Compute (a * b) mod p for Mersenne prime p.
|
|
52
|
+
|
|
53
|
+
Python's arbitrary-precision integers make this straightforward:
|
|
54
|
+
multiply natively, then reduce via mersenne_mod.
|
|
55
|
+
"""
|
|
56
|
+
return mersenne_mod(a * b, p)
|
|
57
|
+
|
|
58
|
+
|
|
59
|
+
# ---------------------------------------------------------------------------
|
|
60
|
+
# Geometric accumulator Φ (Definition 3, Theorem 3)
|
|
61
|
+
# ---------------------------------------------------------------------------
|
|
62
|
+
|
|
63
|
+
def phi(q: int, alpha: int, p: int = MERSENNE_61) -> int:
|
|
64
|
+
"""
|
|
65
|
+
Compute Φ(q, α) = Σ_{i=0}^{q-1} α^i (mod p) via inverse-free
|
|
66
|
+
repeated doubling.
|
|
67
|
+
|
|
68
|
+
Recurrence (Theorem 3):
|
|
69
|
+
Φ(0, α) = 0
|
|
70
|
+
Φ(1, α) = 1
|
|
71
|
+
Φ(2k, α) = Φ(k, α) · (1 + α^k)
|
|
72
|
+
Φ(2k+1, α) = Φ(2k, α) · α + 1
|
|
73
|
+
|
|
74
|
+
Handles the degenerate case α ≡ 1 (mod p) correctly,
|
|
75
|
+
yielding Φ(q, 1) = q (mod p) (Lemma 3).
|
|
76
|
+
|
|
77
|
+
Time: O(log q) multiplications in Z/pZ.
|
|
78
|
+
|
|
79
|
+
Implementation note: We scan q's bits left-to-right (MSB to LSB).
|
|
80
|
+
We maintain alpha_power = α^n where n is the current accumulated
|
|
81
|
+
value of q built so far. At each step:
|
|
82
|
+
- Doubling (n → 2n): alpha_power squares (α^n → α^(2n))
|
|
83
|
+
- Odd step (n → n+1): alpha_power multiplies by α (α^n → α^(n+1))
|
|
84
|
+
This ensures the correct α^k is used in each doubling step.
|
|
85
|
+
"""
|
|
86
|
+
if q == 0:
|
|
87
|
+
return 0
|
|
88
|
+
if q == 1:
|
|
89
|
+
return 1
|
|
90
|
+
|
|
91
|
+
bits = q.bit_length()
|
|
92
|
+
|
|
93
|
+
phi_val = 1 # Φ(1, α) = 1
|
|
94
|
+
alpha_power = alpha # α^1 — tracks α^(current n)
|
|
95
|
+
|
|
96
|
+
# Scan from the second-most-significant bit down to bit 0
|
|
97
|
+
for i in range(bits - 2, -1, -1):
|
|
98
|
+
# Even step: Φ(2k, α) = Φ(k, α) · (1 + α^k)
|
|
99
|
+
# alpha_power currently holds α^k
|
|
100
|
+
phi_val = mersenne_mul(phi_val, (1 + alpha_power) % p, p)
|
|
101
|
+
# α^k → α^(2k)
|
|
102
|
+
alpha_power = mersenne_mul(alpha_power, alpha_power, p)
|
|
103
|
+
|
|
104
|
+
# If current bit is 1: odd step (2k → 2k+1)
|
|
105
|
+
if (q >> i) & 1:
|
|
106
|
+
# Φ(2k+1, α) = Φ(2k, α) · α + 1
|
|
107
|
+
phi_val = (mersenne_mul(phi_val, alpha, p) + 1) % p
|
|
108
|
+
# α^(2k) → α^(2k+1)
|
|
109
|
+
alpha_power = mersenne_mul(alpha_power, alpha, p)
|
|
110
|
+
|
|
111
|
+
return phi_val
|
|
112
|
+
|
|
113
|
+
|
|
114
|
+
# ---------------------------------------------------------------------------
|
|
115
|
+
# Polynomial Hash (Definition 2, Theorems 1-2, 5)
|
|
116
|
+
# ---------------------------------------------------------------------------
|
|
117
|
+
|
|
118
|
+
class PolynomialHash:
|
|
119
|
+
"""
|
|
120
|
+
Polynomial hash function over Z/pZ.
|
|
121
|
+
|
|
122
|
+
H(s) = Σ_{i=0}^{n-1} (s_i + 1) · x^(n-1-i) (mod p)
|
|
123
|
+
|
|
124
|
+
The +1 offset (Definition 2) ensures no byte maps to zero in Z/pZ,
|
|
125
|
+
preventing the zero-padding collision (Lemma 1).
|
|
126
|
+
|
|
127
|
+
Parameters
|
|
128
|
+
----------
|
|
129
|
+
prime : int
|
|
130
|
+
A Mersenne prime. Default: 2^61 - 1.
|
|
131
|
+
base : int
|
|
132
|
+
The hash base x, chosen from {2, ..., p-1}.
|
|
133
|
+
"""
|
|
134
|
+
|
|
135
|
+
__slots__ = ("_p", "_x", "_power_cache")
|
|
136
|
+
|
|
137
|
+
def __init__(self, prime: int = MERSENNE_61, base: int = 131) -> None:
|
|
138
|
+
if base < 2 or base >= prime:
|
|
139
|
+
raise ValueError(f"Base must be in [2, p-1], got {base}")
|
|
140
|
+
self._p = prime
|
|
141
|
+
self._x = base
|
|
142
|
+
self._power_cache: dict[int, int] = {0: 1, 1: base}
|
|
143
|
+
|
|
144
|
+
@property
|
|
145
|
+
def prime(self) -> int:
|
|
146
|
+
return self._p
|
|
147
|
+
|
|
148
|
+
@property
|
|
149
|
+
def base(self) -> int:
|
|
150
|
+
return self._x
|
|
151
|
+
|
|
152
|
+
def power(self, n: int) -> int:
|
|
153
|
+
"""
|
|
154
|
+
Compute x^n mod p with caching.
|
|
155
|
+
|
|
156
|
+
Uses Python's built-in pow(x, n, p) which implements
|
|
157
|
+
binary exponentiation (Lemma 13).
|
|
158
|
+
"""
|
|
159
|
+
if n in self._power_cache:
|
|
160
|
+
return self._power_cache[n]
|
|
161
|
+
result = pow(self._x, n, self._p)
|
|
162
|
+
self._power_cache[n] = result
|
|
163
|
+
return result
|
|
164
|
+
|
|
165
|
+
def hash(self, data: bytes) -> int:
|
|
166
|
+
"""
|
|
167
|
+
Compute H(data) = Σ (data[i] + 1) · x^(n-1-i) (mod p).
|
|
168
|
+
|
|
169
|
+
For the empty string, returns 0 (Definition 2).
|
|
170
|
+
|
|
171
|
+
Parameters
|
|
172
|
+
----------
|
|
173
|
+
data : bytes
|
|
174
|
+
The byte string to hash.
|
|
175
|
+
|
|
176
|
+
Returns
|
|
177
|
+
-------
|
|
178
|
+
int
|
|
179
|
+
Hash value in [0, p-1].
|
|
180
|
+
"""
|
|
181
|
+
if len(data) == 0:
|
|
182
|
+
return 0
|
|
183
|
+
|
|
184
|
+
p = self._p
|
|
185
|
+
x = self._x
|
|
186
|
+
h = 0
|
|
187
|
+
for byte in data:
|
|
188
|
+
# h = h · x + (byte + 1)
|
|
189
|
+
h = mersenne_mod(h * x + byte + 1, p)
|
|
190
|
+
return h
|
|
191
|
+
|
|
192
|
+
def hash_concat(self, h_a: int, len_b: int, h_b: int) -> int:
|
|
193
|
+
"""
|
|
194
|
+
Compute H(A ‖ B) from H(A), |B|, H(B) via Theorem 1.
|
|
195
|
+
|
|
196
|
+
H(A ‖ B) = H(A) · x^|B| + H(B) (mod p)
|
|
197
|
+
"""
|
|
198
|
+
x_pow = self.power(len_b)
|
|
199
|
+
return mersenne_mod(h_a * x_pow + h_b, self._p)
|
|
200
|
+
|
|
201
|
+
def hash_repeat(self, h_s: int, d: int, q: int) -> int:
|
|
202
|
+
"""
|
|
203
|
+
Compute H(S^q) from H(S) and |S| = d via Theorem 2.
|
|
204
|
+
|
|
205
|
+
H(S^q) = H(S) · Φ(q, x^d) (mod p)
|
|
206
|
+
"""
|
|
207
|
+
x_d = self.power(d)
|
|
208
|
+
phi_val = phi(q, x_d, self._p)
|
|
209
|
+
return mersenne_mul(h_s, phi_val, self._p)
|
|
210
|
+
|
|
211
|
+
def hash_overlap(self, h_p: int, d: int, l: int, h_prefix: int) -> int:
|
|
212
|
+
"""
|
|
213
|
+
Compute H(W) for overlapping back-reference via Theorem 5.
|
|
214
|
+
|
|
215
|
+
W = P^q ‖ P[0..r-1] where q = ⌊l/d⌋, r = l mod d.
|
|
216
|
+
|
|
217
|
+
H(W) = H(P) · Φ(q, x^d) · x^r + H(P[0..r-1]) (mod p)
|
|
218
|
+
|
|
219
|
+
Parameters
|
|
220
|
+
----------
|
|
221
|
+
h_p : int
|
|
222
|
+
H(P), hash of the pattern of length d.
|
|
223
|
+
d : int
|
|
224
|
+
Pattern length.
|
|
225
|
+
l : int
|
|
226
|
+
Total back-reference length (d < l for overlapping).
|
|
227
|
+
h_prefix : int
|
|
228
|
+
H(P[0..r-1]) where r = l mod d. Pass 0 if r = 0.
|
|
229
|
+
"""
|
|
230
|
+
q, r = divmod(l, d)
|
|
231
|
+
x_d = self.power(d)
|
|
232
|
+
x_r = self.power(r)
|
|
233
|
+
|
|
234
|
+
# H(P) · Φ(q, x^d)
|
|
235
|
+
phi_val = phi(q, x_d, self._p)
|
|
236
|
+
h_repeated = mersenne_mul(h_p, phi_val, self._p)
|
|
237
|
+
|
|
238
|
+
# · x^r + H(P[0..r-1])
|
|
239
|
+
result = mersenne_mod(h_repeated * x_r + h_prefix, self._p)
|
|
240
|
+
return result
|
|
@@ -0,0 +1,75 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: uhc
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Unified Hash-Compression Engine: compressed-domain hashing over LZ77 streams
|
|
5
|
+
Author: UHC Contributors
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/uhc-framework/uhc
|
|
8
|
+
Project-URL: Documentation, https://github.com/uhc-framework/uhc
|
|
9
|
+
Project-URL: Repository, https://github.com/uhc-framework/uhc
|
|
10
|
+
Project-URL: Issues, https://github.com/uhc-framework/uhc/issues
|
|
11
|
+
Keywords: compression,hashing,lz77,deduplication,integrity,polynomial-hash,deflate,zstandard,lz4
|
|
12
|
+
Classifier: Development Status :: 2 - Pre-Alpha
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Operating System :: OS Independent
|
|
17
|
+
Classifier: Programming Language :: Python :: 3
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
22
|
+
Classifier: Topic :: Scientific/Engineering
|
|
23
|
+
Classifier: Topic :: System :: Archiving :: Compression
|
|
24
|
+
Classifier: Topic :: Security :: Cryptography
|
|
25
|
+
Requires-Python: >=3.10
|
|
26
|
+
Description-Content-Type: text/markdown
|
|
27
|
+
License-File: LICENSE
|
|
28
|
+
Provides-Extra: dev
|
|
29
|
+
Requires-Dist: pytest>=8.0; extra == "dev"
|
|
30
|
+
Requires-Dist: pytest-cov>=5.0; extra == "dev"
|
|
31
|
+
Provides-Extra: formats
|
|
32
|
+
Requires-Dist: lz4>=4.0; extra == "formats"
|
|
33
|
+
Requires-Dist: zstandard>=0.22; extra == "formats"
|
|
34
|
+
Requires-Dist: blake3>=1.0; extra == "formats"
|
|
35
|
+
Dynamic: license-file
|
|
36
|
+
|
|
37
|
+
# UHC — Unified Hash-Compression Engine
|
|
38
|
+
|
|
39
|
+
A Python framework for compressed-domain hashing over LZ77 streams.
|
|
40
|
+
|
|
41
|
+
UHC computes the polynomial hash of uncompressed data by operating directly on compressed token streams (DEFLATE, LZ4, Zstandard), without ever materializing the decompressed bytes.
|
|
42
|
+
|
|
43
|
+
## Key Features
|
|
44
|
+
|
|
45
|
+
- **Compressed-domain hashing:** Compute integrity hashes without decompression
|
|
46
|
+
- **Hash-augmented rope:** Novel data structure with RepeatNode for O(k·log q) overlapping back-reference resolution
|
|
47
|
+
- **Multi-hash collision resistance:** k-wise independent polynomial hashes with configurable security levels
|
|
48
|
+
- **Format compatible:** DEFLATE, LZ4, Zstandard support via unified token abstraction
|
|
49
|
+
- **Sliding window:** O(1) memory relative to decompressed size for bounded-window formats
|
|
50
|
+
|
|
51
|
+
## Installation
|
|
52
|
+
|
|
53
|
+
```bash
|
|
54
|
+
pip install uhc
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## Quick Start
|
|
58
|
+
|
|
59
|
+
```python
|
|
60
|
+
import uhc
|
|
61
|
+
|
|
62
|
+
# Coming soon — Phase 1 implementation in progress
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
## Mathematical Foundation
|
|
66
|
+
|
|
67
|
+
The complete mathematical framework (24 theorems, 14 lemmas, 5 corollaries) with full proofs is available in `theory/compressed_domain_hashing_framework.md`.
|
|
68
|
+
|
|
69
|
+
## Project Status
|
|
70
|
+
|
|
71
|
+
**Phase 1:** Algebraic core — polynomial hash, naive LZ77, compressed-domain verifier
|
|
72
|
+
|
|
73
|
+
## License
|
|
74
|
+
|
|
75
|
+
MIT
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
tests/test_polynomial_hash.py
|
|
5
|
+
uhc/__init__.py
|
|
6
|
+
uhc.egg-info/PKG-INFO
|
|
7
|
+
uhc.egg-info/SOURCES.txt
|
|
8
|
+
uhc.egg-info/dependency_links.txt
|
|
9
|
+
uhc.egg-info/requires.txt
|
|
10
|
+
uhc.egg-info/top_level.txt
|
|
11
|
+
uhc/core/__init__.py
|
|
12
|
+
uhc/core/polynomial_hash.py
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
uhc
|