namespace-guard 0.17.1 → 0.18.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +79 -71
- package/dist/cli.js +6129 -1
- package/dist/cli.mjs +6129 -1
- package/dist/composability-vectors.js +6125 -0
- package/dist/composability-vectors.mjs +6125 -0
- package/dist/confusable-weights.d.mts +1 -1
- package/dist/confusable-weights.d.ts +1 -1
- package/dist/confusable-weights.js +818 -42
- package/dist/confusable-weights.mjs +818 -42
- package/dist/index.d.mts +63 -8
- package/dist/index.d.ts +63 -8
- package/dist/index.js +130 -27
- package/dist/index.mjs +129 -27
- package/dist/profanity-en.js +6125 -0
- package/dist/profanity-en.mjs +6125 -0
- package/package.json +2 -2
package/README.md
CHANGED
|
@@ -5,11 +5,40 @@
|
|
|
5
5
|
[](https://www.typescriptlang.org/)
|
|
6
6
|
[](https://opensource.org/licenses/MIT)
|
|
7
7
|
|
|
8
|
-
**
|
|
8
|
+
**The world's first library that detects confusable characters across non-Latin scripts.** Slug claimability, Unicode anti-spoofing, and LLM Denial of Spend defence in one zero-dependency package.
|
|
9
9
|
|
|
10
10
|
- Live demo: https://paultendo.github.io/namespace-guard/
|
|
11
11
|
- Blog post: https://paultendo.github.io/posts/namespace-guard-launch/
|
|
12
12
|
|
|
13
|
+
## Cross-script confusable detection
|
|
14
|
+
|
|
15
|
+
Existing confusable standards (TR39, IDNA) map non-Latin characters to Latin equivalents. They have zero coverage for confusable pairs *between* two non-Latin scripts.
|
|
16
|
+
|
|
17
|
+
namespace-guard ships 494 SSIM-measured cross-script pairs from [confusable-vision](https://github.com/paultendo/confusable-vision) (rendered across 230 system fonts, scored by structural similarity). This catches attacks that no other library detects:
|
|
18
|
+
|
|
19
|
+
```typescript
|
|
20
|
+
import { areConfusable, detectCrossScriptRisk } from "namespace-guard";
|
|
21
|
+
import { CONFUSABLE_WEIGHTS } from "namespace-guard/confusable-weights";
|
|
22
|
+
|
|
23
|
+
// Hangul ᅵ and Han 丨 are visually identical (SSIM 0.999, Arial Unicode MS)
|
|
24
|
+
areConfusable("\u1175", "\u4E28", { weights: CONFUSABLE_WEIGHTS }); // true
|
|
25
|
+
|
|
26
|
+
// Greek Τ and Han 丅 are near-identical (SSIM 0.930, Hiragino Kaku Gothic ProN)
|
|
27
|
+
areConfusable("\u03A4", "\u4E05", { weights: CONFUSABLE_WEIGHTS }); // true
|
|
28
|
+
|
|
29
|
+
// Cyrillic І and Greek Ι are pixel-identical (SSIM 1.0, 61 fonts agree)
|
|
30
|
+
areConfusable("\u0406", "\u0399", { weights: CONFUSABLE_WEIGHTS }); // true
|
|
31
|
+
|
|
32
|
+
// Without weights, only skeleton-based detection (TR39 coverage)
|
|
33
|
+
areConfusable("\u1175", "\u4E28"); // false
|
|
34
|
+
|
|
35
|
+
// Analyze an identifier for cross-script risk
|
|
36
|
+
const risk = detectCrossScriptRisk("\u1175\u4E28", { weights: CONFUSABLE_WEIGHTS });
|
|
37
|
+
// { riskLevel: "high", scripts: ["han", "hangul"], crossScriptPairs: [...] }
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
1,397 total SSIM-scored confusable pairs (110 TR39-confirmed, 793 novel Latin-target, 494 cross-script). Cross-script data licensed CC-BY-4.0.
|
|
41
|
+
|
|
13
42
|
## Installation
|
|
14
43
|
|
|
15
44
|
```bash
|
|
@@ -57,58 +86,24 @@ if (!result.claimed) {
|
|
|
57
86
|
}
|
|
58
87
|
```
|
|
59
88
|
|
|
60
|
-
## Research-Backed Differentiation
|
|
61
|
-
|
|
62
|
-
We started by auditing how major Unicode-confusable implementations compose normalization and mapping in practice (including ICU, Chromium, Rust, and django-registration), then converted that gap into a reproducible library design.
|
|
63
|
-
|
|
64
|
-
- Documented a 31-entry NFKC vs TR39 divergence set and shipped it as a named regression suite: `nfkc-tr39-divergence-v1`.
|
|
65
|
-
- Ship two maps for two real pipelines:
|
|
66
|
-
`CONFUSABLE_MAP` (NFKC-first) and `CONFUSABLE_MAP_FULL` (TR39/NFD/raw-input pipelines).
|
|
67
|
-
- Export the vectors as JSON (`docs/data/composability-vectors.json`) and wire them into CLI drift baselines.
|
|
68
|
-
- Publish a labeled benchmark corpus (`docs/data/confusable-bench.v1.json`) for cross-tool evaluation and CI regressions.
|
|
69
|
-
- Submitted the findings for Unicode public review (PRI #540): https://www.unicode.org/review/pri540/
|
|
70
|
-
- Validated the 31 divergence vectors empirically by rendering each character across 12 fonts and measuring SSIM similarity: TR39 is visually correct for letter-shape confusables, NFKC for digit confusables, and 61% are ties where both targets are near-identical ([confusable-vision](https://github.com/paultendo/confusable-vision)).
|
|
71
|
-
|
|
72
|
-
Details:
|
|
73
|
-
- Technical reference: [docs/reference.md#how-the-anti-spoofing-pipeline-works](docs/reference.md#how-the-anti-spoofing-pipeline-works)
|
|
74
|
-
- Launch write-up: https://paultendo.github.io/posts/namespace-guard-launch/
|
|
75
|
-
|
|
76
89
|
## What You Get
|
|
77
90
|
|
|
91
|
+
- **Cross-script confusable detection** with 494 SSIM-measured pairs between non-Latin scripts
|
|
78
92
|
- Cross-table collision checks (users, orgs, teams, etc.)
|
|
79
93
|
- Reserved-name blocking with category-aware messages
|
|
80
94
|
- Unicode anti-spoofing (NFKC + confusable detection + mixed-script/risk controls)
|
|
81
|
-
- Invisible character detection (
|
|
95
|
+
- Invisible character detection (zero-width joiners, direction overrides, and other hidden bytes)
|
|
82
96
|
- Optional profanity/evasion validation
|
|
83
97
|
- Suggestion strategies for taken names
|
|
84
98
|
- CLI for red-team generation, calibration, drift, and CI gates
|
|
85
99
|
|
|
86
100
|
## LLM Pipeline Preprocessing
|
|
87
101
|
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
Document ingestion
|
|
94
|
-
|
|
|
95
|
-
v
|
|
96
|
-
+----------------+
|
|
97
|
-
| namespace- | <-- Detect mixed-script confusable substitution
|
|
98
|
-
| guard | <-- Canonicalise to Latin equivalents
|
|
99
|
-
| (microseconds) | <-- Flag suspicious patterns for review
|
|
100
|
-
+----------------+
|
|
101
|
-
|
|
|
102
|
-
v
|
|
103
|
-
+----------------+
|
|
104
|
-
| LLM API | <-- Any model/provider
|
|
105
|
-
| (GPT/Claude/ | <-- Receives canonicalised text
|
|
106
|
-
| Llama/etc) |
|
|
107
|
-
+----------------+
|
|
108
|
-
|
|
|
109
|
-
v
|
|
110
|
-
Analysis output
|
|
111
|
-
```
|
|
102
|
+
Confusable characters are pixel-identical to Latin letters but encode as multi-byte BPE tokens. A 95-line contract that costs 881 tokens in clean ASCII costs 4,567 tokens when flooded with confusables: **5.2x the API bill**. The model reads it correctly. The invoice does not care.
|
|
103
|
+
|
|
104
|
+
We tested this across 4 frontier models, 8 attack types, and 130+ API calls. Zero meaning flips. Every substituted clause was correctly interpreted. But the billing attack succeeds. We call it **Denial of Spend**: the confusable analogue of DDoS, where the attacker cannot degrade the service but can inflate the cost of running it.
|
|
105
|
+
|
|
106
|
+
`canonicalise()` recovered every substituted term across all 12 attack variants, collapsing the 5.2x inflation to 1.0x. Processing a 10,000-character document takes under 1ms.
|
|
112
107
|
|
|
113
108
|
```typescript
|
|
114
109
|
import { canonicalise, scan, isClean } from "namespace-guard";
|
|
@@ -124,10 +119,47 @@ const ok = isClean(raw); // false (mixed-script confusable detected)
|
|
|
124
119
|
canonicalise("поп-refundable", { strategy: "all" }); // "non-refundable"
|
|
125
120
|
```
|
|
126
121
|
|
|
127
|
-
Research
|
|
122
|
+
Research:
|
|
123
|
+
- Denial of Spend: https://paultendo.github.io/posts/confusable-vision-llm-attack-tests/
|
|
128
124
|
- Launch: https://paultendo.github.io/posts/namespace-guard-launch/
|
|
129
125
|
- NFKC/TR39 composability: https://paultendo.github.io/posts/unicode-confusables-nfkc-conflict/
|
|
130
|
-
|
|
126
|
+
|
|
127
|
+
## Advanced Security Primitives
|
|
128
|
+
|
|
129
|
+
Low-level helpers for custom scoring, pairwise checks, and cross-script risk analysis:
|
|
130
|
+
|
|
131
|
+
```typescript
|
|
132
|
+
import { skeleton, areConfusable, confusableDistance } from "namespace-guard";
|
|
133
|
+
|
|
134
|
+
skeleton("pa\u0443pal"); // "paypal" skeleton form
|
|
135
|
+
areConfusable("paypal", "pa\u0443pal"); // true
|
|
136
|
+
confusableDistance("paypal", "pa\u0443pal"); // graded similarity + chainDepth + explainable steps
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
For measured visual scoring, pass the optional weights from confusable-vision (1,397 SSIM-scored pairs across 230 fonts, including 494 cross-script pairs). The `context` filter restricts to identifier-valid, domain-valid, or all pairs.
|
|
140
|
+
|
|
141
|
+
```typescript
|
|
142
|
+
import { confusableDistance } from "namespace-guard";
|
|
143
|
+
import { CONFUSABLE_WEIGHTS } from "namespace-guard/confusable-weights";
|
|
144
|
+
|
|
145
|
+
const result = confusableDistance("paypal", "pa\u0443pal", {
|
|
146
|
+
weights: CONFUSABLE_WEIGHTS,
|
|
147
|
+
context: "identifier",
|
|
148
|
+
});
|
|
149
|
+
// result.similarity, result.steps (including "visual-weight" reason for novel pairs)
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
## Research
|
|
153
|
+
|
|
154
|
+
Two research tracks feed the library:
|
|
155
|
+
|
|
156
|
+
**Visual measurement.** 1,397 confusable pairs rendered across 230 system fonts, scored by structural similarity (SSIM). 494 of these are novel cross-script pairs between non-Latin scripts (Hangul/Han, Cyrillic/Greek, Cyrillic/Arabic, and more) with zero coverage in any existing standard. Full dataset published as [confusable-vision](https://github.com/paultendo/confusable-vision) (CC-BY-4.0).
|
|
157
|
+
|
|
158
|
+
**Normalisation composability.** 31 characters where Unicode's confusables.txt and NFKC normalisation disagree. Two production maps (`CONFUSABLE_MAP` for NFKC-first, `CONFUSABLE_MAP_FULL` for raw-input pipelines), a benchmark corpus, and composability vectors wired into CLI drift baselines. Findings accepted into [Unicode public review (PRI #540)](https://www.unicode.org/review/pri540/).
|
|
159
|
+
|
|
160
|
+
- Technical reference: [docs/reference.md#how-the-anti-spoofing-pipeline-works](docs/reference.md#how-the-anti-spoofing-pipeline-works)
|
|
161
|
+
- Launch write-up: https://paultendo.github.io/posts/namespace-guard-launch/
|
|
162
|
+
- Denial of Spend: https://paultendo.github.io/posts/confusable-vision-llm-attack-tests/
|
|
131
163
|
|
|
132
164
|
## Built-in Profiles
|
|
133
165
|
|
|
@@ -177,35 +209,10 @@ npx namespace-guard recommend ./risk-dataset.json
|
|
|
177
209
|
# 3) Preflight canonical collisions before adding DB unique constraints
|
|
178
210
|
npx namespace-guard audit-canonical ./users-export.json --json
|
|
179
211
|
|
|
180
|
-
# 4) Compare TR39-full vs NFKC-filtered
|
|
212
|
+
# 4) Compare TR39-full vs NFKC-filtered behaviour
|
|
181
213
|
npx namespace-guard drift --json
|
|
182
214
|
```
|
|
183
215
|
|
|
184
|
-
## Advanced Security Primitives (Optional)
|
|
185
|
-
|
|
186
|
-
Use these when you need custom scoring, explainability, or pairwise checks outside the default claim flow:
|
|
187
|
-
|
|
188
|
-
```typescript
|
|
189
|
-
import { skeleton, areConfusable, confusableDistance } from "namespace-guard";
|
|
190
|
-
|
|
191
|
-
skeleton("pa\u0443pal"); // "paypal" skeleton form
|
|
192
|
-
areConfusable("paypal", "pa\u0443pal"); // true
|
|
193
|
-
confusableDistance("paypal", "pa\u0443pal"); // graded similarity + chainDepth + explainable steps
|
|
194
|
-
```
|
|
195
|
-
|
|
196
|
-
For measured visual scoring, pass the optional weights from confusable-vision (903 SSIM-scored pairs across 230 fonts). The `context` filter restricts to identifier-valid, domain-valid, or all pairs.
|
|
197
|
-
|
|
198
|
-
```typescript
|
|
199
|
-
import { confusableDistance } from "namespace-guard";
|
|
200
|
-
import { CONFUSABLE_WEIGHTS } from "namespace-guard/confusable-weights";
|
|
201
|
-
|
|
202
|
-
const result = confusableDistance("paypal", "pa\u0443pal", {
|
|
203
|
-
weights: CONFUSABLE_WEIGHTS,
|
|
204
|
-
context: "identifier",
|
|
205
|
-
});
|
|
206
|
-
// result.similarity, result.steps (including "visual-weight" reason for novel pairs)
|
|
207
|
-
```
|
|
208
|
-
|
|
209
216
|
## Adapter Support
|
|
210
217
|
|
|
211
218
|
- Prisma
|
|
@@ -236,7 +243,8 @@ Migration guides per adapter: [docs/reference.md#canonical-uniqueness-migration-
|
|
|
236
243
|
- LLM preprocessing (`canonicalise`, `scan`, `isClean`): [docs/reference.md#llm-pipeline-preprocessing](docs/reference.md#llm-pipeline-preprocessing)
|
|
237
244
|
- Benchmark corpus (`confusable-bench.v1`): [docs/reference.md#confusable-benchmark-corpus-artifact](docs/reference.md#confusable-benchmark-corpus-artifact)
|
|
238
245
|
- Advanced primitives (`skeleton`, `areConfusable`, `confusableDistance`): [docs/reference.md#advanced-security-primitives](docs/reference.md#advanced-security-primitives)
|
|
239
|
-
- Confusable weights (SSIM-scored pairs): [docs/reference.md#confusable-weights-subpath](docs/reference.md#confusable-weights-subpath)
|
|
246
|
+
- Confusable weights (SSIM-scored pairs, including cross-script): [docs/reference.md#confusable-weights-subpath](docs/reference.md#confusable-weights-subpath)
|
|
247
|
+
- Cross-script detection: [docs/reference.md#cross-script-detection](docs/reference.md#cross-script-detection)
|
|
240
248
|
- CLI reference: [docs/reference.md#cli](docs/reference.md#cli)
|
|
241
249
|
- API reference: [docs/reference.md#api-reference](docs/reference.md#api-reference)
|
|
242
250
|
- Framework integration (Next.js/Express/tRPC): [docs/reference.md#framework-integration](docs/reference.md#framework-integration)
|