namespace-guard 0.17.1 → 0.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -5,11 +5,40 @@
5
5
  [![TypeScript](https://img.shields.io/badge/TypeScript-5.0+-blue.svg)](https://www.typescriptlang.org/)
6
6
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
7
7
 
8
- **Claim safe slugs in one line**: availability, reserved names, spoofing protection, and moderation hooks.
8
+ **The world's first library that detects confusable characters across non-Latin scripts.** Slug claimability, Unicode anti-spoofing, and LLM Denial of Spend defence in one zero-dependency package.
9
9
 
10
10
  - Live demo: https://paultendo.github.io/namespace-guard/
11
11
  - Blog post: https://paultendo.github.io/posts/namespace-guard-launch/
12
12
 
13
+ ## Cross-script confusable detection
14
+
15
+ Existing confusable standards (TR39, IDNA) map non-Latin characters to Latin equivalents. They have zero coverage for confusable pairs *between* two non-Latin scripts.
16
+
17
+ namespace-guard ships 494 SSIM-measured cross-script pairs from [confusable-vision](https://github.com/paultendo/confusable-vision) (rendered across 230 system fonts, scored by structural similarity). This catches attacks that no other library detects:
18
+
19
+ ```typescript
20
+ import { areConfusable, detectCrossScriptRisk } from "namespace-guard";
21
+ import { CONFUSABLE_WEIGHTS } from "namespace-guard/confusable-weights";
22
+
23
+ // Hangul ᅵ and Han 丨 are visually identical (SSIM 0.999, Arial Unicode MS)
24
+ areConfusable("\u1175", "\u4E28", { weights: CONFUSABLE_WEIGHTS }); // true
25
+
26
+ // Greek Τ and Han 丅 are near-identical (SSIM 0.930, Hiragino Kaku Gothic ProN)
27
+ areConfusable("\u03A4", "\u4E05", { weights: CONFUSABLE_WEIGHTS }); // true
28
+
29
+ // Cyrillic І and Greek Ι are pixel-identical (SSIM 1.0, 61 fonts agree)
30
+ areConfusable("\u0406", "\u0399", { weights: CONFUSABLE_WEIGHTS }); // true
31
+
32
+ // Without weights, only skeleton-based detection (TR39 coverage)
33
+ areConfusable("\u1175", "\u4E28"); // false
34
+
35
+ // Analyze an identifier for cross-script risk
36
+ const risk = detectCrossScriptRisk("\u1175\u4E28", { weights: CONFUSABLE_WEIGHTS });
37
+ // { riskLevel: "high", scripts: ["han", "hangul"], crossScriptPairs: [...] }
38
+ ```
39
+
40
+ 1,397 total SSIM-scored confusable pairs (110 TR39-confirmed, 793 novel Latin-target, 494 cross-script). Cross-script data licensed CC-BY-4.0.
41
+
13
42
  ## Installation
14
43
 
15
44
  ```bash
@@ -57,58 +86,24 @@ if (!result.claimed) {
57
86
  }
58
87
  ```
59
88
 
60
- ## Research-Backed Differentiation
61
-
62
- We started by auditing how major Unicode-confusable implementations compose normalization and mapping in practice (including ICU, Chromium, Rust, and django-registration), then converted that gap into a reproducible library design.
63
-
64
- - Documented a 31-entry NFKC vs TR39 divergence set and shipped it as a named regression suite: `nfkc-tr39-divergence-v1`.
65
- - Ship two maps for two real pipelines:
66
- `CONFUSABLE_MAP` (NFKC-first) and `CONFUSABLE_MAP_FULL` (TR39/NFD/raw-input pipelines).
67
- - Export the vectors as JSON (`docs/data/composability-vectors.json`) and wire them into CLI drift baselines.
68
- - Publish a labeled benchmark corpus (`docs/data/confusable-bench.v1.json`) for cross-tool evaluation and CI regressions.
69
- - Submitted the findings for Unicode public review (PRI #540): https://www.unicode.org/review/pri540/
70
- - Validated the 31 divergence vectors empirically by rendering each character across 12 fonts and measuring SSIM similarity: TR39 is visually correct for letter-shape confusables, NFKC for digit confusables, and 61% are ties where both targets are near-identical ([confusable-vision](https://github.com/paultendo/confusable-vision)).
71
-
72
- Details:
73
- - Technical reference: [docs/reference.md#how-the-anti-spoofing-pipeline-works](docs/reference.md#how-the-anti-spoofing-pipeline-works)
74
- - Launch write-up: https://paultendo.github.io/posts/namespace-guard-launch/
75
-
76
89
  ## What You Get
77
90
 
91
+ - **Cross-script confusable detection** with 494 SSIM-measured pairs between non-Latin scripts
78
92
  - Cross-table collision checks (users, orgs, teams, etc.)
79
93
  - Reserved-name blocking with category-aware messages
80
94
  - Unicode anti-spoofing (NFKC + confusable detection + mixed-script/risk controls)
81
- - Invisible character detection (default-ignorable + bidi controls, optional combining-mark blocking)
95
+ - Invisible character detection (zero-width joiners, direction overrides, and other hidden bytes)
82
96
  - Optional profanity/evasion validation
83
97
  - Suggestion strategies for taken names
84
98
  - CLI for red-team generation, calibration, drift, and CI gates
85
99
 
86
100
  ## LLM Pipeline Preprocessing
87
101
 
88
- LLM tokenizers process Unicode codepoints, not rendered glyphs. Confusable substitutions can inflate token counts and hide important terms in mixed-script text, especially on smaller models.
89
-
90
- Use namespace-guard as a deterministic preprocess layer before model calls:
91
-
92
- ```text
93
- Document ingestion
94
- |
95
- v
96
- +----------------+
97
- | namespace- | <-- Detect mixed-script confusable substitution
98
- | guard | <-- Canonicalise to Latin equivalents
99
- | (microseconds) | <-- Flag suspicious patterns for review
100
- +----------------+
101
- |
102
- v
103
- +----------------+
104
- | LLM API | <-- Any model/provider
105
- | (GPT/Claude/ | <-- Receives canonicalised text
106
- | Llama/etc) |
107
- +----------------+
108
- |
109
- v
110
- Analysis output
111
- ```
102
+ Confusable characters are pixel-identical to Latin letters but encode as multi-byte BPE tokens. A 95-line contract that costs 881 tokens in clean ASCII costs 4,567 tokens when flooded with confusables: **5.2x the API bill**. The model reads it correctly. The invoice does not care.
103
+
104
+ We tested this across 4 frontier models, 8 attack types, and 130+ API calls. Zero meaning flips. Every substituted clause was correctly interpreted. But the billing attack succeeds. We call it **Denial of Spend**: the confusable analogue of DDoS, where the attacker cannot degrade the service but can inflate the cost of running it.
105
+
106
+ `canonicalise()` recovered every substituted term across all 12 attack variants, collapsing the 5.2x inflation to 1.0x. Processing a 10,000-character document takes under 1ms.
112
107
 
113
108
  ```typescript
114
109
  import { canonicalise, scan, isClean } from "namespace-guard";
@@ -124,10 +119,47 @@ const ok = isClean(raw); // false (mixed-script confusable detected)
124
119
  canonicalise("поп-refundable", { strategy: "all" }); // "non-refundable"
125
120
  ```
126
121
 
127
- Research context:
122
+ Research:
123
+ - Denial of Spend: https://paultendo.github.io/posts/confusable-vision-llm-attack-tests/
128
124
  - Launch: https://paultendo.github.io/posts/namespace-guard-launch/
129
125
  - NFKC/TR39 composability: https://paultendo.github.io/posts/unicode-confusables-nfkc-conflict/
130
- - Confusable detection without NFKC: https://paultendo.github.io/posts/confusable-detection-without-nfkc/
126
+
127
+ ## Advanced Security Primitives
128
+
129
+ Low-level helpers for custom scoring, pairwise checks, and cross-script risk analysis:
130
+
131
+ ```typescript
132
+ import { skeleton, areConfusable, confusableDistance } from "namespace-guard";
133
+
134
+ skeleton("pa\u0443pal"); // "paypal" skeleton form
135
+ areConfusable("paypal", "pa\u0443pal"); // true
136
+ confusableDistance("paypal", "pa\u0443pal"); // graded similarity + chainDepth + explainable steps
137
+ ```
138
+
139
+ For measured visual scoring, pass the optional weights from confusable-vision (1,397 SSIM-scored pairs across 230 fonts, including 494 cross-script pairs). The `context` filter restricts to identifier-valid, domain-valid, or all pairs.
140
+
141
+ ```typescript
142
+ import { confusableDistance } from "namespace-guard";
143
+ import { CONFUSABLE_WEIGHTS } from "namespace-guard/confusable-weights";
144
+
145
+ const result = confusableDistance("paypal", "pa\u0443pal", {
146
+ weights: CONFUSABLE_WEIGHTS,
147
+ context: "identifier",
148
+ });
149
+ // result.similarity, result.steps (including "visual-weight" reason for novel pairs)
150
+ ```
151
+
152
+ ## Research
153
+
154
+ Two research tracks feed the library:
155
+
156
+ **Visual measurement.** 1,397 confusable pairs rendered across 230 system fonts, scored by structural similarity (SSIM). 494 of these are novel cross-script pairs between non-Latin scripts (Hangul/Han, Cyrillic/Greek, Cyrillic/Arabic, and more) with zero coverage in any existing standard. Full dataset published as [confusable-vision](https://github.com/paultendo/confusable-vision) (CC-BY-4.0).
157
+
158
+ **Normalisation composability.** 31 characters where Unicode's confusables.txt and NFKC normalisation disagree. Two production maps (`CONFUSABLE_MAP` for NFKC-first, `CONFUSABLE_MAP_FULL` for raw-input pipelines), a benchmark corpus, and composability vectors wired into CLI drift baselines. Findings accepted into [Unicode public review (PRI #540)](https://www.unicode.org/review/pri540/).
159
+
160
+ - Technical reference: [docs/reference.md#how-the-anti-spoofing-pipeline-works](docs/reference.md#how-the-anti-spoofing-pipeline-works)
161
+ - Launch write-up: https://paultendo.github.io/posts/namespace-guard-launch/
162
+ - Denial of Spend: https://paultendo.github.io/posts/confusable-vision-llm-attack-tests/
131
163
 
132
164
  ## Built-in Profiles
133
165
 
@@ -177,35 +209,10 @@ npx namespace-guard recommend ./risk-dataset.json
177
209
  # 3) Preflight canonical collisions before adding DB unique constraints
178
210
  npx namespace-guard audit-canonical ./users-export.json --json
179
211
 
180
- # 4) Compare TR39-full vs NFKC-filtered behavior
212
+ # 4) Compare TR39-full vs NFKC-filtered behaviour
181
213
  npx namespace-guard drift --json
182
214
  ```
183
215
 
184
- ## Advanced Security Primitives (Optional)
185
-
186
- Use these when you need custom scoring, explainability, or pairwise checks outside the default claim flow:
187
-
188
- ```typescript
189
- import { skeleton, areConfusable, confusableDistance } from "namespace-guard";
190
-
191
- skeleton("pa\u0443pal"); // "paypal" skeleton form
192
- areConfusable("paypal", "pa\u0443pal"); // true
193
- confusableDistance("paypal", "pa\u0443pal"); // graded similarity + chainDepth + explainable steps
194
- ```
195
-
196
- For measured visual scoring, pass the optional weights from confusable-vision (903 SSIM-scored pairs across 230 fonts). The `context` filter restricts to identifier-valid, domain-valid, or all pairs.
197
-
198
- ```typescript
199
- import { confusableDistance } from "namespace-guard";
200
- import { CONFUSABLE_WEIGHTS } from "namespace-guard/confusable-weights";
201
-
202
- const result = confusableDistance("paypal", "pa\u0443pal", {
203
- weights: CONFUSABLE_WEIGHTS,
204
- context: "identifier",
205
- });
206
- // result.similarity, result.steps (including "visual-weight" reason for novel pairs)
207
- ```
208
-
209
216
  ## Adapter Support
210
217
 
211
218
  - Prisma
@@ -236,7 +243,8 @@ Migration guides per adapter: [docs/reference.md#canonical-uniqueness-migration-
236
243
  - LLM preprocessing (`canonicalise`, `scan`, `isClean`): [docs/reference.md#llm-pipeline-preprocessing](docs/reference.md#llm-pipeline-preprocessing)
237
244
  - Benchmark corpus (`confusable-bench.v1`): [docs/reference.md#confusable-benchmark-corpus-artifact](docs/reference.md#confusable-benchmark-corpus-artifact)
238
245
  - Advanced primitives (`skeleton`, `areConfusable`, `confusableDistance`): [docs/reference.md#advanced-security-primitives](docs/reference.md#advanced-security-primitives)
239
- - Confusable weights (SSIM-scored pairs): [docs/reference.md#confusable-weights-subpath](docs/reference.md#confusable-weights-subpath)
246
+ - Confusable weights (SSIM-scored pairs, including cross-script): [docs/reference.md#confusable-weights-subpath](docs/reference.md#confusable-weights-subpath)
247
+ - Cross-script detection: [docs/reference.md#cross-script-detection](docs/reference.md#cross-script-detection)
240
248
  - CLI reference: [docs/reference.md#cli](docs/reference.md#cli)
241
249
  - API reference: [docs/reference.md#api-reference](docs/reference.md#api-reference)
242
250
  - Framework integration (Next.js/Express/tRPC): [docs/reference.md#framework-integration](docs/reference.md#framework-integration)