min-mphash 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.en.md +321 -0
- package/README.md +335 -0
- package/README.zh.md +323 -0
- package/dist/MinMPHash.d.ts +140 -0
- package/dist/MinMPLookup.d.ts +115 -0
- package/dist/index.d.ts +4 -0
- package/dist/index.js +937 -0
- package/dist/util.d.ts +26 -0
- package/package.json +37 -0
package/README.en.md
ADDED
|
@@ -0,0 +1,321 @@
|
|
|
1
|
+
# MinMPHash & MinMPLookup
|
|
2
|
+
|
|
3
|
+
> Mini Minimal Perfect Hash & Mini Minimal Perfect Lookup
|
|
4
|
+
|
|
5
|
+
[中文 README](./README.zh.md)
|
|
6
|
+
|
|
7
|
+
`MinMPHash` can map a set of n strings to the integer range `[0, n-1]` without any collisions.
|
|
8
|
+
|
|
9
|
+
`MinMPLookup` is a minimal perfect lookup table tool implemented based on `MinMPHash`.
|
|
10
|
+
|
|
11
|
+
It can minimize the storage of maps in the form of `{ key1: [value1, value2, ...], key2: [value3, value4, ...] }`, thereby achieving the requirement of looking up the corresponding `key` based on `value`.
|
|
12
|
+
|
|
13
|
+
Compared to raw storage, the minimized lookup table can reduce the volume to less than 10% of the original (the specific compression rate depends on the information entropy of the values in the dataset; the higher the entropy, the better the compression effect).
|
|
14
|
+
|
|
15
|
+
## What is Minimal Perfect Hash?
|
|
16
|
+
|
|
17
|
+
Hash functions can map data to a range, generating a fixed-length "fingerprint". However, ordinary hash functions have two common problems:
|
|
18
|
+
|
|
19
|
+
- Space waste caused by sparsity: The hash range is usually much larger than the actual amount of data, resulting in a very sparse hash table.
|
|
20
|
+
- Different inputs may conflict: Different inputs may map to the same hash value. To reduce the collision rate, longer hash values are usually required, which wastes space.
|
|
21
|
+
|
|
22
|
+
Minimal Perfect Hash Function (MPHF) is a special class of hash functions:
|
|
23
|
+
|
|
24
|
+
- It guarantees no collisions for a given n distinct inputs;
|
|
25
|
+
- The output range is exactly `[0, n-1]`, making space utilization optimal.
|
|
26
|
+
|
|
27
|
+
In other words, if you have n different strings, MPHF will map them one-to-one to the integer range `[0, n-1]`, with each string corresponding to a unique index.
|
|
28
|
+
|
|
29
|
+
```
|
|
30
|
+
text set Hash Hash Table (Sparse)
|
|
31
|
+
+----------+ +------------+ +-------------------+
|
|
32
|
+
| Apple | ----> | h(x) | ----> | 0: [ Apple ] |
|
|
33
|
+
| Banana | +------------+ | 1: | <--- Gap
|
|
34
|
+
| Cherry | | 2: [ Banana ] |
|
|
35
|
+
+----------+ | 3: | <--- Gap
|
|
36
|
+
| ... | <--- Gap
|
|
37
|
+
| 9: [ Cherry ] |
|
|
38
|
+
+-------------------+
|
|
39
|
+
(Waste of Space)
|
|
40
|
+
|
|
41
|
+
|
|
42
|
+
text set 🤩 Minimal Perfect Hash Hash Table (Compact)
|
|
43
|
+
+----------+ +--------------+ +-------------------+
|
|
44
|
+
| Apple | ----> | mmph(x) | ----> | 0: [ Apple ] |
|
|
45
|
+
| Banana | +--------------+ | 1: [ Banana ] |
|
|
46
|
+
| Cherry | | 2: [ Cherry ] |
|
|
47
|
+
+----------+ +-------------------+
|
|
48
|
+
( 100% Space Utilization )
|
|
49
|
+
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
## When to use Minimal Perfect Hash?
|
|
53
|
+
|
|
54
|
+
Minimal Perfect Hash is suitable for scenarios where a set of deterministic keys (Keys) needs to be mapped to compact integer indices. Compared with ordinary mapping tables, using MPHF can save the overhead of storing complete keys: only the integer index corresponding to the key needs to be stored to achieve the same mapping function.
|
|
55
|
+
|
|
56
|
+
In other words, as long as the keys in the dataset are deterministic (will not change), no matter how long the key itself is, it can be stored and uniquely identified by a `number` based on the dataset size without conflict.
|
|
57
|
+
|
|
58
|
+
This is exactly what common hashes (such as MD5, SHA-1) cannot do: the hash values they produce are long (e.g., MD5 is 16 bytes, SHA-1 is 20 bytes), while MPHF can choose the smallest integer range based on the dataset size, thereby achieving higher space utilization.
|
|
59
|
+
|
|
60
|
+
For example, there is a list of font names, and you want to map from the font name to the font family name.
|
|
61
|
+
The general method is to complete all this through a mapping table:
|
|
62
|
+
|
|
63
|
+
```js
|
|
64
|
+
let FontMap = {
|
|
65
|
+
"Source Sans": "Source Sans",
|
|
66
|
+
"Source Sans Black": "Source Sans",
|
|
67
|
+
"Source Sans Bold": "Source Sans",
|
|
68
|
+
"思源黑体 CN ExtraLight": "Source Sans",
|
|
69
|
+
"思源黑体 TW Light": "Source Sans",
|
|
70
|
+
// ... 6000+
|
|
71
|
+
};
|
|
72
|
+
|
|
73
|
+
let query = "思源黑体 TW Light";
|
|
74
|
+
let found = FontMap[query]; // 'Source Sans'
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
Such a mapping table needs to store all keys (such as font names), which takes up a lot of space when the number of keys is large. After using minimal perfect hash, we can only store the index (hash value) corresponding to the font name, instead of the full name:
|
|
78
|
+
|
|
79
|
+
```js
|
|
80
|
+
// Create a set containing all font names
|
|
81
|
+
let values = [
|
|
82
|
+
"Source Sans",
|
|
83
|
+
"Source Sans Black",
|
|
84
|
+
"Source Sans Bold",
|
|
85
|
+
"思源黑体 CN ExtraLight",
|
|
86
|
+
"思源黑体 TW Light",
|
|
87
|
+
// ... 6000+
|
|
88
|
+
];
|
|
89
|
+
|
|
90
|
+
// Create minimal perfect hash dictionary based on values
|
|
91
|
+
let dict = createMinMPHashDict(values);
|
|
92
|
+
// Create hash function instance based on dictionary
|
|
93
|
+
let minHash = new MinMPHash(dict);
|
|
94
|
+
|
|
95
|
+
// Then we use hash values to replace full font names:
|
|
96
|
+
let FontMapWithHash = {
|
|
97
|
+
"Source Sans": [1, 2, 3, 21 /* ... */],
|
|
98
|
+
Heiti: [12, 12 /* ... */],
|
|
99
|
+
"JetBrains Mono": [32, 112 /* ... */],
|
|
100
|
+
// ...
|
|
101
|
+
};
|
|
102
|
+
|
|
103
|
+
// When querying, first calculate the hash value, then find the corresponding font family name through the hash value
|
|
104
|
+
let query = "思源黑体 TW Light";
|
|
105
|
+
let query_hash = minHash.hash(query, dict); // e.g. 42
|
|
106
|
+
|
|
107
|
+
let found = Object.entries(FontMapWithHash).find(([family, hashes]) =>
|
|
108
|
+
hashes.includes(query_hash)
|
|
109
|
+
)[0]; // 'Source Sans'
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
This can significantly reduce storage space because there is no longer a need to save the full key text, only a shorter integer index.
|
|
113
|
+
|
|
114
|
+
You might think that hash functions like MD5 or SHA-1 can also generate identifiers, but their hash values are long (e.g., MD5 is 16 bytes, SHA-1 is 20 bytes). FNV-1a can be used to a minimum of 4 bytes in some scenarios, but its collision rate is high. Minimal perfect hash can choose the minimum range based on the dataset size, ensuring no conflicts and achieving extreme space utilization.
|
|
115
|
+
|
|
116
|
+
## Usage
|
|
117
|
+
|
|
118
|
+
### Installation
|
|
119
|
+
|
|
120
|
+
```bash
|
|
121
|
+
npm install min-mphash
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
### MinMPHash Usage
|
|
125
|
+
|
|
126
|
+
This is the core function, used to map a set of strings to integers in `[0, n-1]`.
|
|
127
|
+
|
|
128
|
+
#### Step 1: Create Dictionary (Build Time)
|
|
129
|
+
|
|
130
|
+
Generate the hash dictionary in your build script or server-side code.
|
|
131
|
+
|
|
132
|
+
```typescript
|
|
133
|
+
import { createMinMPHashDict } from "min-mphash";
|
|
134
|
+
import * as fs from "fs";
|
|
135
|
+
|
|
136
|
+
// Example string set
|
|
137
|
+
const mySet = ["Apple", "Banana", "Cherry", "Date", "Elderberry"];
|
|
138
|
+
|
|
139
|
+
// Create dictionary binary data
|
|
140
|
+
// outputBinary: true returns Uint8Array, suitable for storage or network transmission
|
|
141
|
+
const dictBuffer = createMinMPHashDict(mySet, {
|
|
142
|
+
outputBinary: true,
|
|
143
|
+
level: 5, // Optimization level [1-10], higher is smaller but slower to build
|
|
144
|
+
});
|
|
145
|
+
|
|
146
|
+
fs.writeFileSync("mph-dict.bin", dictBuffer);
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
#### Step 2: Use Dictionary to Generate Hash (Runtime)
|
|
150
|
+
|
|
151
|
+
Load the dictionary and perform hash queries in your application (e.g., browser side).
|
|
152
|
+
|
|
153
|
+
```typescript
|
|
154
|
+
import { MinMPHash } from "min-mphash";
|
|
155
|
+
|
|
156
|
+
// Assume you have already loaded the binary data
|
|
157
|
+
const dictBuffer = await fetch("mph-dict.bin").then((res) => res.arrayBuffer());
|
|
158
|
+
const dict = new Uint8Array(dictBuffer);
|
|
159
|
+
|
|
160
|
+
const mph = new MinMPHash(dict);
|
|
161
|
+
|
|
162
|
+
console.log(mph.hash("Apple")); // 0 (or other unique value between 0-4)
|
|
163
|
+
console.log(mph.hash("Banana")); // 2
|
|
164
|
+
console.log(mph.hash("Cherry")); // 4
|
|
165
|
+
|
|
166
|
+
// Note: For strings not in the set, it will also return a value in [0, n-1] (this is a property of MPHF),
|
|
167
|
+
// unless you enable **Validation Mode** (see below).
|
|
168
|
+
console.log(mph.hash("sdfsd94jx#*")); // May return 1
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
#### Validation Mode `onlySet`
|
|
172
|
+
|
|
173
|
+
Standard minimal perfect hash functions will also return an index within the range for inputs **not in the set** (this is a property of MPHF). If your application needs to identify queries outside the set as "misses", you can enable validation mode `onlySet` when creating the dictionary.
|
|
174
|
+
|
|
175
|
+
`onlySet` will store the fingerprint of each key at the cost of extra space. When querying, the fingerprint will be verified: if the verification fails, `-1` is returned indicating a miss.
|
|
176
|
+
|
|
177
|
+
```typescript
|
|
178
|
+
let dict = createMinMPHashDict(mySet, { onlySet: "8" });
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
| onlySet | Space Usage (per key) | False Positive Rate |
|
|
182
|
+
| ------- | --------------------- | ------------------- |
|
|
183
|
+
| 2 | 0.25 byte | ~25% |
|
|
184
|
+
| 4 | 0.5 byte | ~6.25% |
|
|
185
|
+
| 8 | 1 byte | ~0.39% |
|
|
186
|
+
| 16 | 2 bytes | ~0.0015% |
|
|
187
|
+
| 32 | 4 bytes | ~0.00000002% |
|
|
188
|
+
|
|
189
|
+
Note: "False Positive Rate" in the table refers to the probability that an input not in the set is incorrectly judged to be in the set; if the key is indeed in the set, the verification always succeeds.
|
|
190
|
+
|
|
191
|
+
#### Dictionary Format: JSON/CBOR/CBOR.Gzip
|
|
192
|
+
|
|
193
|
+
`createMinMPHashDict` can output dictionaries in multiple formats:
|
|
194
|
+
|
|
195
|
+
- **Binary**
|
|
196
|
+
`{ outputBinary: true }`
|
|
197
|
+
Returns CBOR `Uint8Array`
|
|
198
|
+
- **Compressed Binary**
|
|
199
|
+
`{ outputBinary: true, enableCompression: true}`
|
|
200
|
+
Gzip compressed CBOR `Uint8Array`
|
|
201
|
+
- **JSON**
|
|
202
|
+
`Default`
|
|
203
|
+
JavaScript object, suitable for debugging and viewing.
|
|
204
|
+
|
|
205
|
+
Generally speaking, it is recommended to use the compressed binary format, which has the smallest volume.
|
|
206
|
+
But JSON is more convenient during development. If the server/CDN supports transparent compression,
|
|
207
|
+
you can use the JSON format directly, and the final volume difference is not big.
|
|
208
|
+
|
|
209
|
+
### MinMPLookup Usage
|
|
210
|
+
|
|
211
|
+
If you have a `Value -> Key` lookup requirement (for example: look up `font family name` by `font file name`, or look up `country` by `city`), and the amount of data is large, you can use `MinMPLookup`. It uses MPHF and differential encoding to greatly compress the mapping relationship.
|
|
212
|
+
|
|
213
|
+
#### Scenario
|
|
214
|
+
|
|
215
|
+
Suppose you have the following mapping:
|
|
216
|
+
|
|
217
|
+
```js
|
|
218
|
+
const lookupMap = {
|
|
219
|
+
China: ["Beijing", "Shanghai", "Guangzhou"],
|
|
220
|
+
USA: ["New York", "Los Angeles"],
|
|
221
|
+
Japan: ["Tokyo"],
|
|
222
|
+
};
|
|
223
|
+
// Goal: Input "Shanghai" -> Get "China"
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
#### Create Lookup Dictionary
|
|
227
|
+
|
|
228
|
+
```typescript
|
|
229
|
+
import { createMinMPLookupDict } from "min-mphash";
|
|
230
|
+
|
|
231
|
+
const lookupMap = {
|
|
232
|
+
China: ["Beijing", "Shanghai", "Guangzhou"],
|
|
233
|
+
USA: ["New York", "Los Angeles"],
|
|
234
|
+
Japan: ["Tokyo"],
|
|
235
|
+
};
|
|
236
|
+
|
|
237
|
+
// Generate compressed binary dictionary
|
|
238
|
+
const lookupDictBin = createMinMPLookupDict(lookupMap, {
|
|
239
|
+
outputBinary: true,
|
|
240
|
+
enableCompression: true, // Enable built-in Gzip compression (Node/Bun environment or browsers supporting CompressionStream only)
|
|
241
|
+
});
|
|
242
|
+
|
|
243
|
+
// Save to file
|
|
244
|
+
// fs.writeFileSync("lookup.bin", lookupDictBin);
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
#### Query
|
|
248
|
+
|
|
249
|
+
```typescript
|
|
250
|
+
import { MinMPLookup } from "min-mphash";
|
|
251
|
+
|
|
252
|
+
// Load dictionary
|
|
253
|
+
const lookup = await MinMPLookup.fromCompressed(lookupDictBin);
|
|
254
|
+
// If enableCompression is not enabled, use MinMPLookup.fromBinary(bin)
|
|
255
|
+
|
|
256
|
+
console.log(lookup.query("Shanghai")); // "China"
|
|
257
|
+
console.log(lookup.queryAll("New York")); // ["USA"]
|
|
258
|
+
console.log(lookup.query("Unknown City")); // null
|
|
259
|
+
console.log(lookup.keys()); // ["China", "USA", "Japan"]
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
#### Validation Mode `onlySet`
|
|
263
|
+
|
|
264
|
+
Standard minimal perfect hash functions will also return an index within the range for inputs **not in the set**. To solve this problem, `MinMPHash` supports a built-in validation mode to ensure that lookups outside the set return `null`.
|
|
265
|
+
|
|
266
|
+
```ts
|
|
267
|
+
const lookupDictBin = createMinMPLookupDict(lookupMap, {
|
|
268
|
+
onlySet: "8", // Enable 8-bit validation mode
|
|
269
|
+
});
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
## Benchmark
|
|
273
|
+
|
|
274
|
+
### MinMPHash Dictionary Size Benchmark
|
|
275
|
+
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
=== MinMPHash Big Dataset Size Benchmark ===
|
|
279
|
+
Generating dataset of size 1000000...
|
|
280
|
+
Dataset size: 1000000 items
|
|
281
|
+
|
|
282
|
+
Dataset json size: 41836.25 KB
|
|
283
|
+
Dataset json gzip size: 6473.48 KB
|
|
284
|
+
|
|
285
|
+
➤ Optimization Level 5
|
|
286
|
+
┌──────────┬──────────────┬──────────────────────┬──────────────────────┬──────────────┬──────────────┐
|
|
287
|
+
│ OnlySet │ JSON Size │ Binary Size (Ratio) │ Gzip Size (Ratio) │ vs None │ Build Time │
|
|
288
|
+
├──────────┼──────────────┼──────────────────────┼──────────────────────┼──────────────┼──────────────┤
|
|
289
|
+
│ none │ 979.18 KB │ 341.37 KB ( 35%) │ 268.36 KB ( 27%) │ 100.0 % │ 2502.35 ms │
|
|
290
|
+
│ 2 │ 1849.51 KB │ 585.38 KB ( 32%) │ 512.86 KB ( 28%) │ 171.5 % │ 2981.46 ms │
|
|
291
|
+
│ 4 │ 2721.75 KB │ 829.94 KB ( 30%) │ 757.49 KB ( 28%) │ 243.1 % │ 3109.38 ms │
|
|
292
|
+
│ 8 │ 4465.27 KB │ 1318.06 KB ( 30%) │ 1245.64 KB ( 28%) │ 386.1 % │ 3132.11 ms │
|
|
293
|
+
│ 16 │ 6672.22 KB │ 2293.96 KB ( 34%) │ 2222.43 KB ( 33%) │ 672.0 % │ 3559.02 ms │
|
|
294
|
+
│ 32 │ 11468.63 KB │ 4247.06 KB ( 37%) │ 4176.07 KB ( 36%) │ 1244.1 % │ 2900.32 ms │
|
|
295
|
+
└──────────┴──────────────┴──────────────────────┴──────────────────────┴──────────────┴──────────────┘
|
|
296
|
+
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
### MinMPLookup Dictionary Size Benchmark
|
|
300
|
+
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
=== MinMPLookup Big Dataset Size Benchmark ===
|
|
304
|
+
Generating dataset of size 100000 values...
|
|
305
|
+
Dataset stats: 5000 keys, 100000 values
|
|
306
|
+
|
|
307
|
+
Dataset json size: 7577.04 KB
|
|
308
|
+
Dataset json gzip size: 5141.74 KB
|
|
309
|
+
|
|
310
|
+
➤ Optimization Level 5
|
|
311
|
+
┌──────────┬──────────────┬──────────────────────┬──────────────────────┬──────────────┬──────────────┐
|
|
312
|
+
│ OnlySet │ JSON Size │ Binary Size (Ratio) │ Gzip Size (Ratio) │ vs None │ Build Time │
|
|
313
|
+
├──────────┼──────────────┼──────────────────────┼──────────────────────┼──────────────┼──────────────┤
|
|
314
|
+
│ none │ 709.67 KB │ 254.85 KB ( 36%) │ 199.59 KB ( 28%) │ 100.0 % │ 412.65 ms │
|
|
315
|
+
│ 2 │ 797.00 KB │ 279.23 KB ( 35%) │ 225.14 KB ( 28%) │ 109.6 % │ 393.94 ms │
|
|
316
|
+
│ 4 │ 884.32 KB │ 303.63 KB ( 34%) │ 248.92 KB ( 28%) │ 119.1 % │ 408.93 ms │
|
|
317
|
+
│ 8 │ 1058.92 KB │ 352.48 KB ( 33%) │ 297.58 KB ( 28%) │ 138.3 % │ 477.32 ms │
|
|
318
|
+
│ 16 │ 1406.98 KB │ 450.21 KB ( 32%) │ 395.21 KB ( 28%) │ 176.7 % │ 421.70 ms │
|
|
319
|
+
│ 32 │ 2104.73 KB │ 645.45 KB ( 31%) │ 591.02 KB ( 28%) │ 253.3 % │ 374.06 ms │
|
|
320
|
+
└──────────┴──────────────┴──────────────────────┴──────────────────────┴──────────────┴──────────────┘
|
|
321
|
+
```
|
package/README.md
ADDED
|
@@ -0,0 +1,335 @@
|
|
|
1
|
+
<img src="./cover.jpeg" alt="MinMPHash Logo" width="100%" />
|
|
2
|
+
|
|
3
|
+
# MinMPHash & MinMPLookup
|
|
4
|
+
|
|
5
|
+
> Mini Minimal Perfect Hash & Mini Minimal Perfect Lookup
|
|
6
|
+
|
|
7
|
+
[中文 README](./README.zh.md)
|
|
8
|
+
|
|
9
|
+
TypeScript/JavaScript 平台上的最小完美哈希与查找工具实现。
|
|
10
|
+
|
|
11
|
+
`MinMPHash` 可以把一组数量为 n 的字符串映射到 `[0, n-1]` 的整数范围内,且不会有冲突。
|
|
12
|
+
|
|
13
|
+
`MinMPLookup` 则是在 `MinMPHash` 的基础上实现的最小完美查找表工具。
|
|
14
|
+
|
|
15
|
+
它可以把形如 `{ key1: [value1, value2, ...], key2: [value3, value4, ...] }` 的映射表最小化存储,从而实现根据 `value` 查找对应 `key` 的需求。
|
|
16
|
+
|
|
17
|
+
相比原样存储,最小化的查找表可以将体积缩小到原来的 10% 以下(具体压缩率取决于数据集中 values 的信息熵,熵越大压缩效果越好)。
|
|
18
|
+
|
|
19
|
+
---
|
|
20
|
+
|
|
21
|
+
`MinMPHash` can map a set of n strings to the integer range `[0, n-1]` without any collisions.
|
|
22
|
+
|
|
23
|
+
`MinMPLookup` is a minimal perfect lookup table tool implemented based on `MinMPHash`.
|
|
24
|
+
|
|
25
|
+
It can minimize the storage of maps in the form of `{ key1: [value1, value2, ...], key2: [value3, value4, ...] }`, thereby achieving the requirement of looking up the corresponding `key` based on `value`.
|
|
26
|
+
|
|
27
|
+
Compared to raw storage, the minimized lookup table can reduce the volume to less than 10% of the original (the specific compression rate depends on the information entropy of the values in the dataset; the higher the entropy, the better the compression effect).
|
|
28
|
+
|
|
29
|
+
## What is Minimal Perfect Hash?
|
|
30
|
+
|
|
31
|
+
Hash functions can map data to a range, generating a fixed-length "fingerprint". However, ordinary hash functions have two common problems:
|
|
32
|
+
|
|
33
|
+
- Space waste caused by sparsity: The hash range is usually much larger than the actual amount of data, resulting in a very sparse hash table.
|
|
34
|
+
- Different inputs may conflict: Different inputs may map to the same hash value. To reduce the collision rate, longer hash values are usually required, which wastes space.
|
|
35
|
+
|
|
36
|
+
Minimal Perfect Hash Function (MPHF) is a special class of hash functions:
|
|
37
|
+
|
|
38
|
+
- It guarantees no collisions for a given n distinct inputs;
|
|
39
|
+
- The output range is exactly `[0, n-1]`, making space utilization optimal.
|
|
40
|
+
|
|
41
|
+
In other words, if you have n different strings, MPHF will map them one-to-one to the integer range `[0, n-1]`, with each string corresponding to a unique index.
|
|
42
|
+
|
|
43
|
+
```
|
|
44
|
+
text set Hash Hash Table (Sparse)
|
|
45
|
+
+----------+ +------------+ +-------------------+
|
|
46
|
+
| Apple | ----> | h(x) | ----> | 0: [ Apple ] |
|
|
47
|
+
| Banana | +------------+ | 1: | <--- Gap
|
|
48
|
+
| Cherry | | 2: [ Banana ] |
|
|
49
|
+
+----------+ | 3: | <--- Gap
|
|
50
|
+
| ... | <--- Gap
|
|
51
|
+
| 9: [ Cherry ] |
|
|
52
|
+
+-------------------+
|
|
53
|
+
(Waste of Space)
|
|
54
|
+
|
|
55
|
+
|
|
56
|
+
text set 🤩 Minimal Perfect Hash Hash Table (Compact)
|
|
57
|
+
+----------+ +--------------+ +-------------------+
|
|
58
|
+
| Apple | ----> | mmph(x) | ----> | 0: [ Apple ] |
|
|
59
|
+
| Banana | +--------------+ | 1: [ Banana ] |
|
|
60
|
+
| Cherry | | 2: [ Cherry ] |
|
|
61
|
+
+----------+ +-------------------+
|
|
62
|
+
( 100% Space Utilization )
|
|
63
|
+
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
## When to use Minimal Perfect Hash?
|
|
67
|
+
|
|
68
|
+
Minimal Perfect Hash is suitable for scenarios where a set of deterministic keys (Keys) needs to be mapped to compact integer indices. Compared with ordinary mapping tables, using MPHF can save the overhead of storing complete keys: only the integer index corresponding to the key needs to be stored to achieve the same mapping function.
|
|
69
|
+
|
|
70
|
+
In other words, as long as the keys in the dataset are deterministic (will not change), no matter how long the key itself is, it can be stored and uniquely identified by a `number` based on the dataset size without conflict.
|
|
71
|
+
|
|
72
|
+
This is exactly what common hashes (such as MD5, SHA-1) cannot do: the hash values they produce are long (e.g., MD5 is 16 bytes, SHA-1 is 20 bytes), while MPHF can choose the smallest integer range based on the dataset size, thereby achieving higher space utilization.
|
|
73
|
+
|
|
74
|
+
For example, there is a list of font names, and you want to map from the font name to the font family name.
|
|
75
|
+
The general method is to complete all this through a mapping table:
|
|
76
|
+
|
|
77
|
+
```js
|
|
78
|
+
let FontMap = {
|
|
79
|
+
"Source Sans": "Source Sans",
|
|
80
|
+
"Source Sans Black": "Source Sans",
|
|
81
|
+
"Source Sans Bold": "Source Sans",
|
|
82
|
+
"思源黑体 CN ExtraLight": "Source Sans",
|
|
83
|
+
"思源黑体 TW Light": "Source Sans",
|
|
84
|
+
// ... 6000+
|
|
85
|
+
};
|
|
86
|
+
|
|
87
|
+
let query = "思源黑体 TW Light";
|
|
88
|
+
let found = FontMap[query]; // 'Source Sans'
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
Such a mapping table needs to store all keys (such as font names), which takes up a lot of space when the number of keys is large. After using minimal perfect hash, we can only store the index (hash value) corresponding to the font name, instead of the full name:
|
|
92
|
+
|
|
93
|
+
```js
|
|
94
|
+
// Create a set containing all font names
|
|
95
|
+
let values = [
|
|
96
|
+
"Source Sans",
|
|
97
|
+
"Source Sans Black",
|
|
98
|
+
"Source Sans Bold",
|
|
99
|
+
"思源黑体 CN ExtraLight",
|
|
100
|
+
"思源黑体 TW Light",
|
|
101
|
+
// ... 6000+
|
|
102
|
+
];
|
|
103
|
+
|
|
104
|
+
// Create minimal perfect hash dictionary based on values
|
|
105
|
+
let dict = createMinMPHashDict(values);
|
|
106
|
+
// Create hash function instance based on dictionary
|
|
107
|
+
let minHash = new MinMPHash(dict);
|
|
108
|
+
|
|
109
|
+
// Then we use hash values to replace full font names:
|
|
110
|
+
let FontMapWithHash = {
|
|
111
|
+
"Source Sans": [1, 2, 3, 21 /* ... */],
|
|
112
|
+
Heiti: [12, 12 /* ... */],
|
|
113
|
+
"JetBrains Mono": [32, 112 /* ... */],
|
|
114
|
+
// ...
|
|
115
|
+
};
|
|
116
|
+
|
|
117
|
+
// When querying, first calculate the hash value, then find the corresponding font family name through the hash value
|
|
118
|
+
let query = "思源黑体 TW Light";
|
|
119
|
+
let query_hash = minHash.hash(query, dict); // e.g. 42
|
|
120
|
+
|
|
121
|
+
let found = Object.entries(FontMapWithHash).find(([family, hashes]) =>
|
|
122
|
+
hashes.includes(query_hash)
|
|
123
|
+
)[0]; // 'Source Sans'
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
This can significantly reduce storage space because there is no longer a need to save the full key text, only a shorter integer index.
|
|
127
|
+
|
|
128
|
+
You might think that hash functions like MD5 or SHA-1 can also generate identifiers, but their hash values are long (e.g., MD5 is 16 bytes, SHA-1 is 20 bytes). FNV-1a can be used to a minimum of 4 bytes in some scenarios, but its collision rate is high. Minimal perfect hash can choose the minimum range based on the dataset size, ensuring no conflicts and achieving extreme space utilization.
|
|
129
|
+
|
|
130
|
+
## Usage
|
|
131
|
+
|
|
132
|
+
### Installation
|
|
133
|
+
|
|
134
|
+
```bash
|
|
135
|
+
npm install min-mphash
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### MinMPHash Usage
|
|
139
|
+
|
|
140
|
+
This is the core function, used to map a set of strings to integers in `[0, n-1]`.
|
|
141
|
+
|
|
142
|
+
#### Step 1: Create Dictionary (Build Time)
|
|
143
|
+
|
|
144
|
+
Generate the hash dictionary in your build script or server-side code.
|
|
145
|
+
|
|
146
|
+
```typescript
|
|
147
|
+
import { createMinMPHashDict } from "min-mphash";
|
|
148
|
+
import * as fs from "fs";
|
|
149
|
+
|
|
150
|
+
// Example string set
|
|
151
|
+
const mySet = ["Apple", "Banana", "Cherry", "Date", "Elderberry"];
|
|
152
|
+
|
|
153
|
+
// Create dictionary binary data
|
|
154
|
+
// outputBinary: true returns Uint8Array, suitable for storage or network transmission
|
|
155
|
+
const dictBuffer = createMinMPHashDict(mySet, {
|
|
156
|
+
outputBinary: true,
|
|
157
|
+
level: 5, // Optimization level [1-10], higher is smaller but slower to build
|
|
158
|
+
});
|
|
159
|
+
|
|
160
|
+
fs.writeFileSync("mph-dict.bin", dictBuffer);
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
#### Step 2: Use Dictionary to Generate Hash (Runtime)
|
|
164
|
+
|
|
165
|
+
Load the dictionary and perform hash queries in your application (e.g., browser side).
|
|
166
|
+
|
|
167
|
+
```typescript
|
|
168
|
+
import { MinMPHash } from "min-mphash";
|
|
169
|
+
|
|
170
|
+
// Assume you have already loaded the binary data
|
|
171
|
+
const dictBuffer = await fetch("mph-dict.bin").then((res) => res.arrayBuffer());
|
|
172
|
+
const dict = new Uint8Array(dictBuffer);
|
|
173
|
+
|
|
174
|
+
const mph = new MinMPHash(dict);
|
|
175
|
+
|
|
176
|
+
console.log(mph.hash("Apple")); // 0 (or other unique value between 0-4)
|
|
177
|
+
console.log(mph.hash("Banana")); // 2
|
|
178
|
+
console.log(mph.hash("Cherry")); // 4
|
|
179
|
+
|
|
180
|
+
// Note: For strings not in the set, it will also return a value in [0, n-1] (this is a property of MPHF),
|
|
181
|
+
// unless you enable **Validation Mode** (see below).
|
|
182
|
+
console.log(mph.hash("sdfsd94jx#*")); // May return 1
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
#### Validation Mode `onlySet`
|
|
186
|
+
|
|
187
|
+
Standard minimal perfect hash functions will also return an index within the range for inputs **not in the set** (this is a property of MPHF). If your application needs to identify queries outside the set as "misses", you can enable validation mode `onlySet` when creating the dictionary.
|
|
188
|
+
|
|
189
|
+
`onlySet` will store the fingerprint of each key at the cost of extra space. When querying, the fingerprint will be verified: if the verification fails, `-1` is returned indicating a miss.
|
|
190
|
+
|
|
191
|
+
```typescript
|
|
192
|
+
let dict = createMinMPHashDict(mySet, { onlySet: "8" });
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
| onlySet | Space Usage (per key) | False Positive Rate |
|
|
196
|
+
| ------- | --------------------- | ------------------- |
|
|
197
|
+
| 2 | 0.25 byte | ~25% |
|
|
198
|
+
| 4 | 0.5 byte | ~6.25% |
|
|
199
|
+
| 8 | 1 byte | ~0.39% |
|
|
200
|
+
| 16 | 2 bytes | ~0.0015% |
|
|
201
|
+
| 32 | 4 bytes | ~0.00000002% |
|
|
202
|
+
|
|
203
|
+
Note: "False Positive Rate" in the table refers to the probability that an input not in the set is incorrectly judged to be in the set; if the key is indeed in the set, the verification always succeeds.
|
|
204
|
+
|
|
205
|
+
#### Dictionary Format: JSON/CBOR/CBOR.Gzip
|
|
206
|
+
|
|
207
|
+
`createMinMPHashDict` can output dictionaries in multiple formats:
|
|
208
|
+
|
|
209
|
+
- **Binary**
|
|
210
|
+
`{ outputBinary: true }`
|
|
211
|
+
Returns CBOR `Uint8Array`
|
|
212
|
+
- **Compressed Binary**
|
|
213
|
+
`{ outputBinary: true, enableCompression: true}`
|
|
214
|
+
Gzip compressed CBOR `Uint8Array`
|
|
215
|
+
- **JSON**
|
|
216
|
+
`Default`
|
|
217
|
+
JavaScript object, suitable for debugging and viewing.
|
|
218
|
+
|
|
219
|
+
Generally speaking, it is recommended to use the compressed binary format, which has the smallest volume.
|
|
220
|
+
But JSON is more convenient during development. If the server/CDN supports transparent compression,
|
|
221
|
+
you can use the JSON format directly, and the final volume difference is not big.
|
|
222
|
+
|
|
223
|
+
### MinMPLookup Usage
|
|
224
|
+
|
|
225
|
+
If you have a `Value -> Key` lookup requirement (for example: look up `font family name` by `font file name`, or look up `country` by `city`), and the amount of data is large, you can use `MinMPLookup`. It uses MPHF and differential encoding to greatly compress the mapping relationship.
|
|
226
|
+
|
|
227
|
+
#### Scenario
|
|
228
|
+
|
|
229
|
+
Suppose you have the following mapping:
|
|
230
|
+
|
|
231
|
+
```js
|
|
232
|
+
const lookupMap = {
|
|
233
|
+
China: ["Beijing", "Shanghai", "Guangzhou"],
|
|
234
|
+
USA: ["New York", "Los Angeles"],
|
|
235
|
+
Japan: ["Tokyo"],
|
|
236
|
+
};
|
|
237
|
+
// Goal: Input "Shanghai" -> Get "China"
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
#### Create Lookup Dictionary
|
|
241
|
+
|
|
242
|
+
```typescript
|
|
243
|
+
import { createMinMPLookupDict } from "min-mphash";
|
|
244
|
+
|
|
245
|
+
const lookupMap = {
|
|
246
|
+
China: ["Beijing", "Shanghai", "Guangzhou"],
|
|
247
|
+
USA: ["New York", "Los Angeles"],
|
|
248
|
+
Japan: ["Tokyo"],
|
|
249
|
+
};
|
|
250
|
+
|
|
251
|
+
// Generate compressed binary dictionary
|
|
252
|
+
const lookupDictBin = createMinMPLookupDict(lookupMap, {
|
|
253
|
+
outputBinary: true,
|
|
254
|
+
enableCompression: true, // Enable built-in Gzip compression (Node/Bun environment or browsers supporting CompressionStream only)
|
|
255
|
+
});
|
|
256
|
+
|
|
257
|
+
// Save to file
|
|
258
|
+
// fs.writeFileSync("lookup.bin", lookupDictBin);
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
#### Query
|
|
262
|
+
|
|
263
|
+
```typescript
|
|
264
|
+
import { MinMPLookup } from "min-mphash";
|
|
265
|
+
|
|
266
|
+
// Load dictionary
|
|
267
|
+
const lookup = await MinMPLookup.fromCompressed(lookupDictBin);
|
|
268
|
+
// If enableCompression is not enabled, use MinMPLookup.fromBinary(bin)
|
|
269
|
+
|
|
270
|
+
console.log(lookup.query("Shanghai")); // "China"
|
|
271
|
+
console.log(lookup.queryAll("New York")); // ["USA"]
|
|
272
|
+
console.log(lookup.query("Unknown City")); // null
|
|
273
|
+
console.log(lookup.keys()); // ["China", "USA", "Japan"]
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
#### Validation Mode `onlySet`
|
|
277
|
+
|
|
278
|
+
Standard minimal perfect hash functions will also return an index within the range for inputs **not in the set**. To solve this problem, `MinMPHash` supports a built-in validation mode to ensure that lookups outside the set return `null`.
|
|
279
|
+
|
|
280
|
+
```ts
|
|
281
|
+
const lookupDictBin = createMinMPLookupDict(lookupMap, {
|
|
282
|
+
onlySet: "8", // Enable 8-bit validation mode
|
|
283
|
+
});
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
## Benchmark
|
|
287
|
+
|
|
288
|
+
### MinMPHash Dictionary Size Benchmark
|
|
289
|
+
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
=== MinMPHash Big Dataset Size Benchmark ===
|
|
293
|
+
Generating dataset of size 1000000...
|
|
294
|
+
Dataset size: 1000000 items
|
|
295
|
+
|
|
296
|
+
Dataset json size: 41836.25 KB
|
|
297
|
+
Dataset json gzip size: 6473.48 KB
|
|
298
|
+
|
|
299
|
+
➤ Optimization Level 5
|
|
300
|
+
┌──────────┬──────────────┬──────────────────────┬──────────────────────┬──────────────┬──────────────┐
|
|
301
|
+
│ OnlySet │ JSON Size │ Binary Size (Ratio) │ Gzip Size (Ratio) │ vs None │ Build Time │
|
|
302
|
+
├──────────┼──────────────┼──────────────────────┼──────────────────────┼──────────────┼──────────────┤
|
|
303
|
+
│ none │ 979.18 KB │ 341.37 KB ( 35%) │ 268.36 KB ( 27%) │ 100.0 % │ 2502.35 ms │
|
|
304
|
+
│ 2 │ 1849.51 KB │ 585.38 KB ( 32%) │ 512.86 KB ( 28%) │ 171.5 % │ 2981.46 ms │
|
|
305
|
+
│ 4 │ 2721.75 KB │ 829.94 KB ( 30%) │ 757.49 KB ( 28%) │ 243.1 % │ 3109.38 ms │
|
|
306
|
+
│ 8 │ 4465.27 KB │ 1318.06 KB ( 30%) │ 1245.64 KB ( 28%) │ 386.1 % │ 3132.11 ms │
|
|
307
|
+
│ 16 │ 6672.22 KB │ 2293.96 KB ( 34%) │ 2222.43 KB ( 33%) │ 672.0 % │ 3559.02 ms │
|
|
308
|
+
│ 32 │ 11468.63 KB │ 4247.06 KB ( 37%) │ 4176.07 KB ( 36%) │ 1244.1 % │ 2900.32 ms │
|
|
309
|
+
└──────────┴──────────────┴──────────────────────┴──────────────────────┴──────────────┴──────────────┘
|
|
310
|
+
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
### MinMPLookup Dictionary Size Benchmark
|
|
314
|
+
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
=== MinMPLookup Big Dataset Size Benchmark ===
|
|
318
|
+
Generating dataset of size 100000 values...
|
|
319
|
+
Dataset stats: 5000 keys, 100000 values
|
|
320
|
+
|
|
321
|
+
Dataset json size: 7577.04 KB
|
|
322
|
+
Dataset json gzip size: 5141.74 KB
|
|
323
|
+
|
|
324
|
+
➤ Optimization Level 5
|
|
325
|
+
┌──────────┬──────────────┬──────────────────────┬──────────────────────┬──────────────┬──────────────┐
|
|
326
|
+
│ OnlySet │ JSON Size │ Binary Size (Ratio) │ Gzip Size (Ratio) │ vs None │ Build Time │
|
|
327
|
+
├──────────┼──────────────┼──────────────────────┼──────────────────────┼──────────────┼──────────────┤
|
|
328
|
+
│ none │ 709.67 KB │ 254.85 KB ( 36%) │ 199.59 KB ( 28%) │ 100.0 % │ 412.65 ms │
|
|
329
|
+
│ 2 │ 797.00 KB │ 279.23 KB ( 35%) │ 225.14 KB ( 28%) │ 109.6 % │ 393.94 ms │
|
|
330
|
+
│ 4 │ 884.32 KB │ 303.63 KB ( 34%) │ 248.92 KB ( 28%) │ 119.1 % │ 408.93 ms │
|
|
331
|
+
│ 8 │ 1058.92 KB │ 352.48 KB ( 33%) │ 297.58 KB ( 28%) │ 138.3 % │ 477.32 ms │
|
|
332
|
+
│ 16 │ 1406.98 KB │ 450.21 KB ( 32%) │ 395.21 KB ( 28%) │ 176.7 % │ 421.70 ms │
|
|
333
|
+
│ 32 │ 2104.73 KB │ 645.45 KB ( 31%) │ 591.02 KB ( 28%) │ 253.3 % │ 374.06 ms │
|
|
334
|
+
└──────────┴──────────────┴──────────────────────┴──────────────────────┴──────────────┴──────────────┘
|
|
335
|
+
```
|