@sepiariver/unique-set 2.0.1 → 2.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/PERF.md ADDED
@@ -0,0 +1,153 @@
1
+ # Performance
2
+
3
+ Comparison against the native Set is irrelevant. Not only does it lack value deduplication, it's so much faster it makes comparison meaningless.
4
+
5
+ BloomSet was initialized with the default sized bit array `6553577` and either `1` or `7` hash iterations.
6
+
7
+ All timing is expressed in milliseconds.
8
+
9
+ ## Shallow Data
10
+
11
+ Shallow plain objects and Arrays, with 5 - 10% duplicates
12
+
13
+ ### BloomSet hashCount: 1
14
+
15
+ Trial 1
16
+
17
+ | Count | Unique | Bloom |
18
+ | ----- | -------- | ------- |
19
+ | 400 | 10.17 | 8.79 |
20
+ | 1000 | 49.37 | 10.46 |
21
+ | 20000 | 19812.43 | 2530.15 |
22
+
23
+ Trial 2
24
+
25
+ | Count | Unique | Bloom |
26
+ | ----- | -------- | ------- |
27
+ | 400 | 10.06 | 8.75 |
28
+ | 1000 | 48.99 | 10.36 |
29
+ | 20000 | 19663.32 | 2536.92 |
30
+
31
+ It's clear BloomSet has something to offer starting at just a few hundred elements.
32
+
33
+ ### BloomSet hashCount: 7
34
+
35
+ Trial 1
36
+
37
+ | Count | Unique | Bloom |
38
+ | ----- | -------- | ------- |
39
+ | 400 | 10.53 | 9.65 |
40
+ | 1000 | 48.60 | 10.39 |
41
+ | 20000 | 19242.54 | 2490.88 |
42
+
43
+ Trial 2
44
+
45
+ | Count | Unique | Bloom |
46
+ | ----- | -------- | ------- |
47
+ | 400 | 9.79 | 9.02 |
48
+ | 1000 | 48.85 | 10.49 |
49
+ | 20000 | 19255.17 | 2489.50 |
50
+
51
+ Performance is fairly stable and predictable with small datasets of shallow objects, regardless of hashCount.
52
+
53
+ ## Nested Data
54
+
55
+ Plain objects and Arrays nested 1 or 2 levels deep, with 10-20% duplicates.
56
+
57
+ ### BloomSet hashCount: 1
58
+
59
+ Trial 1
60
+
61
+ | Count | Unique | Bloom |
62
+ | ----- | -------- | ------- |
63
+ | 400 | 26.32 | 12.78 |
64
+ | 1000 | 91.30 | 16.86 |
65
+ | 20000 | 37671.41 | 5116.11 |
66
+
67
+ Trial 2
68
+
69
+ | Count | Unique | Bloom |
70
+ | ----- | -------- | ------- |
71
+ | 400 | 21.15 | 12.65 |
72
+ | 1000 | 115.2 | 16.33 |
73
+ | 20000 | 37169.66 | 5031.50 |
74
+
75
+ UniqueSet starts to suffer on > 1000 elements. It gets exponentially worse, especially with nested objects. Whereas BloomSet's optimizations keep it in the realm of usable at 20k elements. Subjectively I feel I'm willing to spend 5 seconds processing 20k elements if I need guaranteed uniqueness-by-value (but not 30 seconds).
76
+
77
+ ### BloomSet hashCount: 7
78
+
79
+ Trial 1
80
+
81
+ | Count | Unique | Bloom |
82
+ | ----- | -------- | ------- |
83
+ | 400 | 20.58 | 13.57 |
84
+ | 1000 | 91.23 | 16.81 |
85
+ | 20000 | 37639.03 | 5151.90 |
86
+
87
+ Running 7 hashes doesn't add a lot of clock time for BloomSet, even with nested objects. Rather than recalculating the hash over the entire serialized value, BloomSet does some bit-mixing to distribute the value's representation across the bit array.
88
+
89
+ Trial 2
90
+
91
+ | Count | Unique | Bloom |
92
+ | ----- | -------- | ------- |
93
+ | 400 | 22.86 | 13.48 |
94
+ | 1000 | 94.64 | 17.80 |
95
+ | 20000 | 37673.08 | 5276.09 |
96
+
97
+ ## Large (relatively)
98
+
99
+ Still using the nested dataset. Very roughly 15% duplicates, distributed in a contrived manner using modulo.
100
+
101
+ ### hashCount: 7, size: 6553577
102
+
103
+ Trial 1
104
+
105
+ | Count | Unique | Bloom |
106
+ | ------ | ---------- | ---------- |
107
+ | 100000 | 982,727.79 | 142,716.46 |
108
+
109
+ ```txt
110
+ UniqueSet size: 100000 Expected size: 100000
111
+ BloomSet size: 100000 Expected size: 100000
112
+ Native Set size: 114458 Expected size: 114458
113
+ ```
114
+
115
+ With a (relatively) large dataset, UniqueSet is slow enough to make me not want to test it again. It might be possible to squeeze extra performance from BloomSet by tweaking the config options.
116
+
117
+ Trial 2
118
+
119
+ | Count | Unique | Bloom |
120
+ | ------ | ---------- | ---------- |
121
+ | 100000 | n/a | 149600.27 |
122
+
123
+ ### hashCount: 1, size: 6553577
124
+
125
+ Trial 1
126
+
127
+ | Count | Unique | Bloom |
128
+ | ------ | ---------- | ---------- |
129
+ | 100000 | n/a | 135919.53 |
130
+
131
+ Trial 2
132
+
133
+ | Count | Unique | Bloom |
134
+ | ------ | ---------- | ---------- |
135
+ | 100000 | n/a | 135913.43 |
136
+
137
+ Reducing the hashCount predictably improves performance by 5-10%. Collisions fallback to `fast-deep-equal`, so we can tolerate false positives unless performance degrades.
138
+
139
+ #### hashCount: 1, size: 28755000
140
+
141
+ Trial 1
142
+
143
+ | Count | Unique | Bloom |
144
+ | ------ | ---------- | ---------- |
145
+ | 100000 | n/a | 128660.39 |
146
+
147
+ Trial 2
148
+
149
+ | Count | Unique | Bloom |
150
+ | ------ | ---------- | ---------- |
151
+ | 100000 | n/a | 127663.77 |
152
+
153
+ Using a larger bit array requires more memory: ~3.5Mb in this case, still extremely memory-efficient for what it's doing. It seems to yield something like 5% clock time improvement over a smaller array, possibly due to decreased false positives, leading to less invocations of `fast-deep-equal` for deep comparison.
package/README.md CHANGED
@@ -2,7 +2,7 @@
2
2
 
3
3
  Extends the native `Set` class to deeply compare using [fast-deep-equal](https://www.npmjs.com/package/fast-deep-equal), with optional Bloom filter optimization.
4
4
 
5
- Supports ESM and CommonJS.
5
+ Supports ESM and CommonJS. Thanks [@sakgoyal](https://github.com/sakgoyal) for contributing to and instigating ESM support.
6
6
 
7
7
  ```js
8
8
  import { BloomSet, UniqueSet } from '@sepiariver/unique-set';
@@ -137,10 +137,19 @@ console.log(bloom2.size); // 6
137
137
  2. `npm install`
138
138
  3. `npm run test`
139
139
 
140
+ ## Issues
141
+
142
+ Issue reporting is encouraged: [https://github.com/sepiariver/unique-set/issues]
143
+
140
144
  ## Contributing
141
145
 
142
146
  Submit pull requests to [https://github.com/sepiariver/unique-set/pulls]
143
147
 
144
- ## Issues
148
+ ## Contributors
145
149
 
146
- Issue reporting is encouraged: [https://github.com/sepiariver/unique-set/issues]
150
+ - @sepiariver
151
+ - @sakgoyal
152
+
153
+ ## License
154
+
155
+ MIT
package/dist/index.mjs CHANGED
@@ -1,5 +1,48 @@
1
1
  // index.ts
2
2
  import equal from "fast-deep-equal/es6/index.js";
3
+ var serialize = (item) => {
4
+ if (typeof item === "number" && isNaN(item)) {
5
+ return "NaN";
6
+ }
7
+ if (item && typeof item === "object") {
8
+ if (Array.isArray(item)) {
9
+ return `[${item.map(serialize).join("")}]`;
10
+ } else {
11
+ return `{${Object.entries(item).sort(([a], [b]) => a.localeCompare(b)).map(([k, v]) => `${k}:${serialize(v)}`).join("")}}`;
12
+ }
13
+ }
14
+ return String(item);
15
+ };
16
+ var fnv1a = (str) => {
17
+ if (typeof str !== "string") {
18
+ str = String(str);
19
+ }
20
+ let hash = 2166136261;
21
+ for (let i = 0; i < str.length; i++) {
22
+ hash ^= str.charCodeAt(i);
23
+ hash = hash * 16777619 >>> 0;
24
+ }
25
+ return hash >>> 0;
26
+ };
27
+ var findNextPrime = (num) => {
28
+ if (num < 2) return 2;
29
+ if ((num & 1) === 0) num++;
30
+ while (!isPrime(num)) {
31
+ num += 2;
32
+ }
33
+ return num;
34
+ };
35
+ var isPrime = (num) => {
36
+ if (num < 2) return false;
37
+ if (num === 2 || num === 3) return true;
38
+ if ((num & 1) === 0) return false;
39
+ if (num % 3 === 0) return false;
40
+ const sqrt = Math.sqrt(num);
41
+ for (let i = 5; i <= sqrt; i += 6) {
42
+ if (num % i === 0 || num % (i + 2) === 0) return false;
43
+ }
44
+ return true;
45
+ };
3
46
  var UniqueSet = class extends Set {
4
47
  /*** @throws TypeError If the input is not iterable. */
5
48
  constructor(iterable = []) {
@@ -62,7 +105,7 @@ var BloomSet = class extends Set {
62
105
  if (typeof size !== "number" || size <= 0) {
63
106
  size = 6553577;
64
107
  }
65
- this.#aSize = this.#findNextPrime(size);
108
+ this.#aSize = findNextPrime(size);
66
109
  if (typeof hashCount !== "number" || hashCount <= 0) {
67
110
  hashCount = 7;
68
111
  }
@@ -73,45 +116,10 @@ var BloomSet = class extends Set {
73
116
  }
74
117
  }
75
118
  /** @internal */
76
- #findNextPrime(num) {
77
- if (num < 2) return 2;
78
- if (num % 2 === 0) num++;
79
- while (!this.#isPrime(num)) {
80
- num += 2;
81
- }
82
- return num;
83
- }
84
- /** @internal */
85
- #isPrime(num) {
86
- if (num < 2) return false;
87
- if (num === 2 || num === 3) return true;
88
- if (num % 2 === 0 || num % 3 === 0) return false;
89
- const sqrt = Math.floor(Math.sqrt(num));
90
- for (let i = 5; i <= sqrt; i += 6) {
91
- if (num % i === 0 || num % (i + 2) === 0) return false;
92
- }
93
- return true;
94
- }
95
- /** @internal */
96
- #serialize(item) {
97
- if (typeof item === "number" && isNaN(item)) {
98
- return "NaN";
99
- }
100
- if (item && typeof item === "object") {
101
- const serialize = this.#serialize.bind(this);
102
- if (Array.isArray(item)) {
103
- return `[${item.map(serialize).join(",")}]`;
104
- } else {
105
- return `{${Object.entries(item).sort(([a], [b]) => a.localeCompare(b)).map(([k, v]) => `${k}:${serialize(v)}`).join(",")}}`;
106
- }
107
- }
108
- return String(item);
109
- }
110
- /** @internal */
111
119
  #hashes(item) {
112
120
  const hashes = [];
113
- const str = this.#serialize(item);
114
- let hash = this.#fnv1a(str);
121
+ const str = serialize(item);
122
+ let hash = fnv1a(str);
115
123
  for (let i = 0; i < this.#hashCount; i++) {
116
124
  hash %= this.#aSize;
117
125
  hashes.push(hash);
@@ -121,18 +129,6 @@ var BloomSet = class extends Set {
121
129
  return hashes;
122
130
  }
123
131
  /** @internal */
124
- #fnv1a(str) {
125
- if (typeof str !== "string") {
126
- str = String(str);
127
- }
128
- let hash = 2166136261;
129
- for (let i = 0; i < str.length; i++) {
130
- hash ^= str.charCodeAt(i);
131
- hash = hash * 16777619 >>> 0;
132
- }
133
- return hash >>> 0;
134
- }
135
- /** @internal */
136
132
  #setBits(hashes) {
137
133
  for (const hash of hashes) {
138
134
  const index = Math.floor(hash / 8);
package/index.ts CHANGED
@@ -1,5 +1,63 @@
1
1
  import equal from "fast-deep-equal/es6/index.js";
2
2
 
3
+ /** Utility functions */
4
+
5
+ const serialize = (item: any | number | object): string => {
6
+ if (typeof item === "number" && isNaN(item)) {
7
+ return "NaN";
8
+ }
9
+
10
+ if (item && typeof item === "object") {
11
+ if (Array.isArray(item)) {
12
+ return `[${item.map(serialize).join("")}]`;
13
+ } else {
14
+ return `{${Object.entries(item)
15
+ .sort(([a], [b]) => a.localeCompare(b))
16
+ .map(([k, v]) => `${k}:${serialize(v)}`)
17
+ .join("")}}`;
18
+ }
19
+ }
20
+
21
+ return String(item);
22
+ };
23
+
24
+ const fnv1a = (str: string) => {
25
+ if (typeof str !== "string") {
26
+ str = String(str);
27
+ }
28
+ let hash = 2166136261; // FNV offset basis for 32-bit
29
+ for (let i = 0; i < str.length; i++) {
30
+ hash ^= str.charCodeAt(i);
31
+ hash = (hash * 16777619) >>> 0; // Multiply by the FNV prime and ensure 32-bit unsigned
32
+ }
33
+ return hash >>> 0;
34
+ };
35
+
36
+ const findNextPrime = (num: number) => {
37
+ if (num < 2) return 2;
38
+ if ((num & 1) === 0) num++; // Odd numbers only
39
+
40
+ while (!isPrime(num)) {
41
+ num += 2; // Odd numbers only
42
+ }
43
+
44
+ return num;
45
+ };
46
+
47
+ const isPrime = (num: number): boolean => {
48
+ if (num < 2) return false;
49
+ if (num === 2 || num === 3) return true;
50
+ if ((num & 1) === 0) return false;
51
+ if (num % 3 === 0) return false;
52
+
53
+ const sqrt = Math.sqrt(num);
54
+ for (let i = 5; i <= sqrt; i += 6) {
55
+ if (num % i === 0 || num % (i + 2) === 0) return false;
56
+ }
57
+
58
+ return true;
59
+ };
60
+
3
61
  /** A `Set` extension that ensures uniqueness of items using deep equality checks. */
4
62
  export class UniqueSet<T> extends Set<T> {
5
63
  /*** @throws TypeError If the input is not iterable. */
@@ -72,7 +130,7 @@ export class BloomSet<T> extends Set<T> {
72
130
  if (typeof size !== "number" || size <= 0) {
73
131
  size = 6553577; // Targeting < 1 collision per 100,000 elements, ~819 KB memory, needs 7 hashes
74
132
  }
75
- this.#aSize = this.#findNextPrime(size);
133
+ this.#aSize = findNextPrime(size);
76
134
 
77
135
  if (typeof hashCount !== "number" || hashCount <= 0) {
78
136
  hashCount = 7;
@@ -85,58 +143,11 @@ export class BloomSet<T> extends Set<T> {
85
143
  }
86
144
  }
87
145
 
88
- /** @internal */
89
- #findNextPrime(num: number) {
90
- if (num < 2) return 2;
91
- if (num % 2 === 0) num++; // Odd numbers only
92
-
93
- while (!this.#isPrime(num)) {
94
- num += 2; // Odd numbers only
95
- }
96
-
97
- return num;
98
- }
99
-
100
- /** @internal */
101
- #isPrime(num: number) {
102
- if (num < 2) return false;
103
- if (num === 2 || num === 3) return true;
104
- if (num % 2 === 0 || num % 3 === 0) return false;
105
-
106
- const sqrt = Math.floor(Math.sqrt(num));
107
- for (let i = 5; i <= sqrt; i += 6) {
108
- if (num % i === 0 || num % (i + 2) === 0) return false;
109
- }
110
-
111
- return true;
112
- }
113
-
114
- /** @internal */
115
- #serialize(item: T | number | object): string {
116
- if (typeof item === "number" && isNaN(item)) {
117
- return "NaN";
118
- }
119
-
120
- if (item && typeof item === "object") {
121
- const serialize = this.#serialize.bind(this);
122
- if (Array.isArray(item)) {
123
- return `[${item.map(serialize).join(",")}]`;
124
- } else {
125
- return `{${Object.entries(item)
126
- .sort(([a], [b]) => a.localeCompare(b))
127
- .map(([k, v]) => `${k}:${serialize(v)}`)
128
- .join(",")}}`;
129
- }
130
- }
131
-
132
- return String(item);
133
- }
134
-
135
146
  /** @internal */
136
147
  #hashes(item: T) {
137
148
  const hashes: number[] = [];
138
- const str = this.#serialize(item);
139
- let hash = this.#fnv1a(str); // Base hash
149
+ const str = serialize(item);
150
+ let hash = fnv1a(str); // Base hash
140
151
 
141
152
  // Bloom into hashCount hash values
142
153
  for (let i = 0; i < this.#hashCount; i++) {
@@ -151,19 +162,6 @@ export class BloomSet<T> extends Set<T> {
151
162
  return hashes;
152
163
  }
153
164
 
154
- /** @internal */
155
- #fnv1a(str: string) {
156
- if (typeof str !== "string") {
157
- str = String(str);
158
- }
159
- let hash = 2166136261; // FNV offset basis for 32-bit
160
- for (let i = 0; i < str.length; i++) {
161
- hash ^= str.charCodeAt(i);
162
- hash = (hash * 16777619) >>> 0; // Multiply by the FNV prime and ensure 32-bit unsigned
163
- }
164
- return hash >>> 0;
165
- }
166
-
167
165
  /** @internal */
168
166
  #setBits(hashes: number[]): void {
169
167
  for (const hash of hashes) {
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@sepiariver/unique-set",
3
- "version": "2.0.1",
3
+ "version": "2.0.3",
4
4
  "description": "Extends the native Set class to deeply compare using fast-deep-equal, with optional Bloom filter optimization. This version exports 2 classes instead of a default, breaking b/c with version 1.",
5
5
  "main": "dist/index.js",
6
6
  "module": "dist/index.mjs",
@@ -10,7 +10,8 @@
10
10
  "import": "./dist/index.mjs"
11
11
  },
12
12
  "scripts": {
13
- "test": "npm run build && vitest",
13
+ "test": "npm run build && vitest basic.spec.ts",
14
+ "bench": "npm run build && vitest bench.spec.ts bench-nested.spec.ts",
14
15
  "lint": "tsc",
15
16
  "build": "tsup index.ts --format esm --dts"
16
17
  },
package/temp.cjs DELETED
@@ -1,10 +0,0 @@
1
- const { BloomSet, UniqueSet } = require("./dist/index.js");
2
-
3
- const bloom = new BloomSet();
4
- bloom.add("foo");
5
- console.log(bloom.has("foo")); // true
6
-
7
- const unique = new UniqueSet();
8
- unique.add("foo");
9
- unique.add("foo");
10
- console.log(unique.size); // 1
package/temp.mjs DELETED
@@ -1,10 +0,0 @@
1
- import { BloomSet, UniqueSet } from "./dist/index.js";
2
-
3
- const bloom = new BloomSet();
4
- bloom.add("foo");
5
- console.log(bloom.has("foo")); // true
6
-
7
- const unique = new UniqueSet();
8
- unique.add("foo");
9
- unique.add("foo");
10
- console.log(unique.size); // 1