@sepiariver/unique-set 2.0.1 → 2.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/PERF.md +153 -0
- package/README.md +12 -3
- package/package.json +3 -2
- package/temp.cjs +0 -10
- package/temp.mjs +0 -10
package/PERF.md
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
1
|
+
# Performance
|
|
2
|
+
|
|
3
|
+
Comparison against the native Set is irrelevant. Not only does it lack value deduplication, it's so much faster it makes comparison meaningless.
|
|
4
|
+
|
|
5
|
+
BloomSet was initialized with the default sized bit array `6553577` and either `1` or `7` hash iterations.
|
|
6
|
+
|
|
7
|
+
All timing is expressed in milliseconds.
|
|
8
|
+
|
|
9
|
+
## Shallow Data
|
|
10
|
+
|
|
11
|
+
Shallow plain objects and Arrays, with 5 - 10% duplicates
|
|
12
|
+
|
|
13
|
+
### BloomSet hashCount: 1
|
|
14
|
+
|
|
15
|
+
Trial 1
|
|
16
|
+
|
|
17
|
+
| Count | Unique | Bloom |
|
|
18
|
+
| ----- | -------- | ------- |
|
|
19
|
+
| 400 | 10.17 | 8.79 |
|
|
20
|
+
| 1000 | 49.37 | 10.46 |
|
|
21
|
+
| 20000 | 19812.43 | 2530.15 |
|
|
22
|
+
|
|
23
|
+
Trial 2
|
|
24
|
+
|
|
25
|
+
| Count | Unique | Bloom |
|
|
26
|
+
| ----- | -------- | ------- |
|
|
27
|
+
| 400 | 10.06 | 8.75 |
|
|
28
|
+
| 1000 | 48.99 | 10.36 |
|
|
29
|
+
| 20000 | 19663.32 | 2536.92 |
|
|
30
|
+
|
|
31
|
+
It's clear BloomSet has something to offer starting at just a few hundred elements.
|
|
32
|
+
|
|
33
|
+
### BloomSet hashCount: 7
|
|
34
|
+
|
|
35
|
+
Trial 1
|
|
36
|
+
|
|
37
|
+
| Count | Unique | Bloom |
|
|
38
|
+
| ----- | -------- | ------- |
|
|
39
|
+
| 400 | 10.53 | 9.65 |
|
|
40
|
+
| 1000 | 48.60 | 10.39 |
|
|
41
|
+
| 20000 | 19242.54 | 2490.88 |
|
|
42
|
+
|
|
43
|
+
Trial 2
|
|
44
|
+
|
|
45
|
+
| Count | Unique | Bloom |
|
|
46
|
+
| ----- | -------- | ------- |
|
|
47
|
+
| 400 | 9.79 | 9.02 |
|
|
48
|
+
| 1000 | 48.85 | 10.49 |
|
|
49
|
+
| 20000 | 19255.17 | 2489.50 |
|
|
50
|
+
|
|
51
|
+
Performance is fairly stable and predictable with small datasets of shallow objects, regardless of hashCount.
|
|
52
|
+
|
|
53
|
+
## Nested Data
|
|
54
|
+
|
|
55
|
+
Plain objects and Arrays nested 1 or 2 levels deep, with 10-20% duplicates.
|
|
56
|
+
|
|
57
|
+
### BloomSet hashCount: 1
|
|
58
|
+
|
|
59
|
+
Trial 1
|
|
60
|
+
|
|
61
|
+
| Count | Unique | Bloom |
|
|
62
|
+
| ----- | -------- | ------- |
|
|
63
|
+
| 400 | 26.32 | 12.78 |
|
|
64
|
+
| 1000 | 91.30 | 16.86 |
|
|
65
|
+
| 20000 | 37671.41 | 5116.11 |
|
|
66
|
+
|
|
67
|
+
Trial 2
|
|
68
|
+
|
|
69
|
+
| Count | Unique | Bloom |
|
|
70
|
+
| ----- | -------- | ------- |
|
|
71
|
+
| 400 | 21.15 | 12.65 |
|
|
72
|
+
| 1000 | 115.2 | 16.33 |
|
|
73
|
+
| 20000 | 37169.66 | 5031.50 |
|
|
74
|
+
|
|
75
|
+
UniqueSet starts to suffer on > 1000 elements. It gets exponentially worse, especially with nested objects. Whereas BloomSet's optimizations keep it in the realm of usable at 20k elements. Subjectively I feel I'm willing to spend 5 seconds processing 20k elements if I need guaranteed uniqueness-by-value (but not 30 seconds).
|
|
76
|
+
|
|
77
|
+
### BloomSet hashCount: 7
|
|
78
|
+
|
|
79
|
+
Trial 1
|
|
80
|
+
|
|
81
|
+
| Count | Unique | Bloom |
|
|
82
|
+
| ----- | -------- | ------- |
|
|
83
|
+
| 400 | 20.58 | 13.57 |
|
|
84
|
+
| 1000 | 91.23 | 16.81 |
|
|
85
|
+
| 20000 | 37639.03 | 5151.90 |
|
|
86
|
+
|
|
87
|
+
Running 7 hashes doesn't add a lot of clock time for BloomSet, even with nested objects. Rather than recalculating the hash over the entire serialized value, BloomSet does some bit-mixing to distribute the value's representation across the bit array.
|
|
88
|
+
|
|
89
|
+
Trial 2
|
|
90
|
+
|
|
91
|
+
| Count | Unique | Bloom |
|
|
92
|
+
| ----- | -------- | ------- |
|
|
93
|
+
| 400 | 22.86 | 13.48 |
|
|
94
|
+
| 1000 | 94.64 | 17.80 |
|
|
95
|
+
| 20000 | 37673.08 | 5276.09 |
|
|
96
|
+
|
|
97
|
+
## Large (relatively)
|
|
98
|
+
|
|
99
|
+
Still using the nested dataset. Very roughly 15% duplicates, distributed in a contrived manner using modulo.
|
|
100
|
+
|
|
101
|
+
### hashCount: 7, size: 6553577
|
|
102
|
+
|
|
103
|
+
Trial 1
|
|
104
|
+
|
|
105
|
+
| Count | Unique | Bloom |
|
|
106
|
+
| ------ | ---------- | ---------- |
|
|
107
|
+
| 100000 | 982,727.79 | 142,716.46 |
|
|
108
|
+
|
|
109
|
+
```txt
|
|
110
|
+
UniqueSet size: 100000 Expected size: 100000
|
|
111
|
+
BloomSet size: 100000 Expected size: 100000
|
|
112
|
+
Native Set size: 114458 Expected size: 114458
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
With a (relatively) large dataset, UniqueSet is slow enough to make me not want to test it again. It might be possible to squeeze extra performance from BloomSet by tweaking the config options.
|
|
116
|
+
|
|
117
|
+
Trial 2
|
|
118
|
+
|
|
119
|
+
| Count | Unique | Bloom |
|
|
120
|
+
| ------ | ---------- | ---------- |
|
|
121
|
+
| 100000 | n/a | 149600.27 |
|
|
122
|
+
|
|
123
|
+
### hashCount: 1, size: 6553577
|
|
124
|
+
|
|
125
|
+
Trial 1
|
|
126
|
+
|
|
127
|
+
| Count | Unique | Bloom |
|
|
128
|
+
| ------ | ---------- | ---------- |
|
|
129
|
+
| 100000 | n/a | 135919.53 |
|
|
130
|
+
|
|
131
|
+
Trial 2
|
|
132
|
+
|
|
133
|
+
| Count | Unique | Bloom |
|
|
134
|
+
| ------ | ---------- | ---------- |
|
|
135
|
+
| 100000 | n/a | 135913.43 |
|
|
136
|
+
|
|
137
|
+
Reducing the hashCount predictably improves performance by 5-10%. Collisions fallback to `fast-deep-equal`, so we can tolerate false positives unless performance degrades.
|
|
138
|
+
|
|
139
|
+
#### hashCount: 1, size: 28755000
|
|
140
|
+
|
|
141
|
+
Trial 1
|
|
142
|
+
|
|
143
|
+
| Count | Unique | Bloom |
|
|
144
|
+
| ------ | ---------- | ---------- |
|
|
145
|
+
| 100000 | n/a | 128660.39 |
|
|
146
|
+
|
|
147
|
+
Trial 2
|
|
148
|
+
|
|
149
|
+
| Count | Unique | Bloom |
|
|
150
|
+
| ------ | ---------- | ---------- |
|
|
151
|
+
| 100000 | n/a | 127663.77 |
|
|
152
|
+
|
|
153
|
+
Using a larger bit array requires more memory: ~3.5Mb in this case, still extremely memory-efficient for what it's doing. It seems to yield something like 5% clock time improvement over a smaller array, possibly due to decreased false positives, leading to less invocations of `fast-deep-equal` for deep comparison.
|
package/README.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
|
|
3
3
|
Extends the native `Set` class to deeply compare using [fast-deep-equal](https://www.npmjs.com/package/fast-deep-equal), with optional Bloom filter optimization.
|
|
4
4
|
|
|
5
|
-
Supports ESM and CommonJS.
|
|
5
|
+
Supports ESM and CommonJS. Thanks @sakgoyal for contributing to and instigating ESM support.
|
|
6
6
|
|
|
7
7
|
```js
|
|
8
8
|
import { BloomSet, UniqueSet } from '@sepiariver/unique-set';
|
|
@@ -137,10 +137,19 @@ console.log(bloom2.size); // 6
|
|
|
137
137
|
2. `npm install`
|
|
138
138
|
3. `npm run test`
|
|
139
139
|
|
|
140
|
+
## Issues
|
|
141
|
+
|
|
142
|
+
Issue reporting is encouraged: [https://github.com/sepiariver/unique-set/issues]
|
|
143
|
+
|
|
140
144
|
## Contributing
|
|
141
145
|
|
|
142
146
|
Submit pull requests to [https://github.com/sepiariver/unique-set/pulls]
|
|
143
147
|
|
|
144
|
-
##
|
|
148
|
+
## Contributors
|
|
145
149
|
|
|
146
|
-
|
|
150
|
+
- @sepiariver
|
|
151
|
+
- @sakgoyal
|
|
152
|
+
|
|
153
|
+
## License
|
|
154
|
+
|
|
155
|
+
MIT
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@sepiariver/unique-set",
|
|
3
|
-
"version": "2.0.
|
|
3
|
+
"version": "2.0.2",
|
|
4
4
|
"description": "Extends the native Set class to deeply compare using fast-deep-equal, with optional Bloom filter optimization. This version exports 2 classes instead of a default, breaking b/c with version 1.",
|
|
5
5
|
"main": "dist/index.js",
|
|
6
6
|
"module": "dist/index.mjs",
|
|
@@ -10,7 +10,8 @@
|
|
|
10
10
|
"import": "./dist/index.mjs"
|
|
11
11
|
},
|
|
12
12
|
"scripts": {
|
|
13
|
-
"test": "npm run build && vitest",
|
|
13
|
+
"test": "npm run build && vitest basic.spec.ts",
|
|
14
|
+
"bench": "npm run build && vitest bench.spec.ts bench-nested.spec.ts",
|
|
14
15
|
"lint": "tsc",
|
|
15
16
|
"build": "tsup index.ts --format esm --dts"
|
|
16
17
|
},
|
package/temp.cjs
DELETED
|
@@ -1,10 +0,0 @@
|
|
|
1
|
-
const { BloomSet, UniqueSet } = require("./dist/index.js");
|
|
2
|
-
|
|
3
|
-
const bloom = new BloomSet();
|
|
4
|
-
bloom.add("foo");
|
|
5
|
-
console.log(bloom.has("foo")); // true
|
|
6
|
-
|
|
7
|
-
const unique = new UniqueSet();
|
|
8
|
-
unique.add("foo");
|
|
9
|
-
unique.add("foo");
|
|
10
|
-
console.log(unique.size); // 1
|
package/temp.mjs
DELETED
|
@@ -1,10 +0,0 @@
|
|
|
1
|
-
import { BloomSet, UniqueSet } from "./dist/index.js";
|
|
2
|
-
|
|
3
|
-
const bloom = new BloomSet();
|
|
4
|
-
bloom.add("foo");
|
|
5
|
-
console.log(bloom.has("foo")); // true
|
|
6
|
-
|
|
7
|
-
const unique = new UniqueSet();
|
|
8
|
-
unique.add("foo");
|
|
9
|
-
unique.add("foo");
|
|
10
|
-
console.log(unique.size); // 1
|