npm - namespace-guard - Versions diffs - 0.17.1 → 0.18.0 - Mend

namespace-guard 0.17.1 → 0.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

package/README.md +79 -71
package/dist/cli.js +6129 -1
package/dist/cli.mjs +6129 -1
package/dist/composability-vectors.js +6125 -0
package/dist/composability-vectors.mjs +6125 -0
package/dist/confusable-weights.d.mts +1 -1
package/dist/confusable-weights.d.ts +1 -1
package/dist/confusable-weights.js +818 -42
package/dist/confusable-weights.mjs +818 -42
package/dist/index.d.mts +63 -8
package/dist/index.d.ts +63 -8
package/dist/index.js +130 -27
package/dist/index.mjs +129 -27
package/dist/profanity-en.js +6125 -0
package/dist/profanity-en.mjs +6125 -0
package/package.json +2 -2

package/README.md CHANGED Viewed

@@ -5,11 +5,40 @@
 [![TypeScript](https://img.shields.io/badge/TypeScript-5.0+-blue.svg)](https://www.typescriptlang.org/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-**Claim safe slugs in one line**: availability, reserved names, spoofing protection, and moderation hooks.
+**The world's first library that detects confusable characters across non-Latin scripts.** Slug claimability, Unicode anti-spoofing, and LLM Denial of Spend defence in one zero-dependency package.
 - Live demo: https://paultendo.github.io/namespace-guard/
 - Blog post: https://paultendo.github.io/posts/namespace-guard-launch/
+## Cross-script confusable detection
+Existing confusable standards (TR39, IDNA) map non-Latin characters to Latin equivalents. They have zero coverage for confusable pairs *between* two non-Latin scripts.
+namespace-guard ships 494 SSIM-measured cross-script pairs from [confusable-vision](https://github.com/paultendo/confusable-vision) (rendered across 230 system fonts, scored by structural similarity). This catches attacks that no other library detects:
+```typescript
+import { areConfusable, detectCrossScriptRisk } from "namespace-guard";
+import { CONFUSABLE_WEIGHTS } from "namespace-guard/confusable-weights";
+// Hangul ᅵ and Han 丨 are visually identical (SSIM 0.999, Arial Unicode MS)
+areConfusable("\u1175", "\u4E28", { weights: CONFUSABLE_WEIGHTS }); // true
+// Greek Τ and Han 丅 are near-identical (SSIM 0.930, Hiragino Kaku Gothic ProN)
+areConfusable("\u03A4", "\u4E05", { weights: CONFUSABLE_WEIGHTS }); // true
+// Cyrillic І and Greek Ι are pixel-identical (SSIM 1.0, 61 fonts agree)
+areConfusable("\u0406", "\u0399", { weights: CONFUSABLE_WEIGHTS }); // true
+// Without weights, only skeleton-based detection (TR39 coverage)
+areConfusable("\u1175", "\u4E28"); // false
+// Analyze an identifier for cross-script risk
+const risk = detectCrossScriptRisk("\u1175\u4E28", { weights: CONFUSABLE_WEIGHTS });
+// { riskLevel: "high", scripts: ["han", "hangul"], crossScriptPairs: [...] }
+```
+1,397 total SSIM-scored confusable pairs (110 TR39-confirmed, 793 novel Latin-target, 494 cross-script). Cross-script data licensed CC-BY-4.0.
 ## Installation
 ```bash
@@ -57,58 +86,24 @@ if (!result.claimed) {
 }
 ```
-## Research-Backed Differentiation
-We started by auditing how major Unicode-confusable implementations compose normalization and mapping in practice (including ICU, Chromium, Rust, and django-registration), then converted that gap into a reproducible library design.
-- Documented a 31-entry NFKC vs TR39 divergence set and shipped it as a named regression suite: `nfkc-tr39-divergence-v1`.
-- Ship two maps for two real pipelines:
-  `CONFUSABLE_MAP` (NFKC-first) and `CONFUSABLE_MAP_FULL` (TR39/NFD/raw-input pipelines).
-- Export the vectors as JSON (`docs/data/composability-vectors.json`) and wire them into CLI drift baselines.
-- Publish a labeled benchmark corpus (`docs/data/confusable-bench.v1.json`) for cross-tool evaluation and CI regressions.
-- Submitted the findings for Unicode public review (PRI #540): https://www.unicode.org/review/pri540/
-- Validated the 31 divergence vectors empirically by rendering each character across 12 fonts and measuring SSIM similarity: TR39 is visually correct for letter-shape confusables, NFKC for digit confusables, and 61% are ties where both targets are near-identical ([confusable-vision](https://github.com/paultendo/confusable-vision)).
-Details:
-- Technical reference: [docs/reference.md#how-the-anti-spoofing-pipeline-works](docs/reference.md#how-the-anti-spoofing-pipeline-works)
-- Launch write-up: https://paultendo.github.io/posts/namespace-guard-launch/
 ## What You Get
+- **Cross-script confusable detection** with 494 SSIM-measured pairs between non-Latin scripts
 - Cross-table collision checks (users, orgs, teams, etc.)
 - Reserved-name blocking with category-aware messages
 - Unicode anti-spoofing (NFKC + confusable detection + mixed-script/risk controls)
-- Invisible character detection (default-ignorable + bidi controls, optional combining-mark blocking)
+- Invisible character detection (zero-width joiners, direction overrides, and other hidden bytes)
 - Optional profanity/evasion validation
 - Suggestion strategies for taken names
 - CLI for red-team generation, calibration, drift, and CI gates
 ## LLM Pipeline Preprocessing
-LLM tokenizers process Unicode codepoints, not rendered glyphs. Confusable substitutions can inflate token counts and hide important terms in mixed-script text, especially on smaller models.
-Use namespace-guard as a deterministic preprocess layer before model calls:
-```text
-Document ingestion
-       |
-       v
-+----------------+
-| namespace-     |  <-- Detect mixed-script confusable substitution
-| guard          |  <-- Canonicalise to Latin equivalents
-| (microseconds) |  <-- Flag suspicious patterns for review
-+----------------+
-       |
-       v
-+----------------+
-| LLM API        |  <-- Any model/provider
-| (GPT/Claude/   |  <-- Receives canonicalised text
-| Llama/etc)     |
-+----------------+
-       |
-       v
-   Analysis output
-```
+Confusable characters are pixel-identical to Latin letters but encode as multi-byte BPE tokens. A 95-line contract that costs 881 tokens in clean ASCII costs 4,567 tokens when flooded with confusables: **5.2x the API bill**. The model reads it correctly. The invoice does not care.
+We tested this across 4 frontier models, 8 attack types, and 130+ API calls. Zero meaning flips. Every substituted clause was correctly interpreted. But the billing attack succeeds. We call it **Denial of Spend**: the confusable analogue of DDoS, where the attacker cannot degrade the service but can inflate the cost of running it.
+`canonicalise()` recovered every substituted term across all 12 attack variants, collapsing the 5.2x inflation to 1.0x. Processing a 10,000-character document takes under 1ms.
 ```typescript
 import { canonicalise, scan, isClean } from "namespace-guard";
@@ -124,10 +119,47 @@ const ok = isClean(raw);         // false (mixed-script confusable detected)
 canonicalise("поп-refundable", { strategy: "all" }); // "non-refundable"
 ```
-Research context:
+Research:
+- Denial of Spend: https://paultendo.github.io/posts/confusable-vision-llm-attack-tests/
 - Launch: https://paultendo.github.io/posts/namespace-guard-launch/
 - NFKC/TR39 composability: https://paultendo.github.io/posts/unicode-confusables-nfkc-conflict/
-- Confusable detection without NFKC: https://paultendo.github.io/posts/confusable-detection-without-nfkc/
+## Advanced Security Primitives
+Low-level helpers for custom scoring, pairwise checks, and cross-script risk analysis:
+```typescript
+import { skeleton, areConfusable, confusableDistance } from "namespace-guard";
+skeleton("pa\u0443pal"); // "paypal" skeleton form
+areConfusable("paypal", "pa\u0443pal"); // true
+confusableDistance("paypal", "pa\u0443pal"); // graded similarity + chainDepth + explainable steps
+```
+For measured visual scoring, pass the optional weights from confusable-vision (1,397 SSIM-scored pairs across 230 fonts, including 494 cross-script pairs). The `context` filter restricts to identifier-valid, domain-valid, or all pairs.
+```typescript
+import { confusableDistance } from "namespace-guard";
+import { CONFUSABLE_WEIGHTS } from "namespace-guard/confusable-weights";
+const result = confusableDistance("paypal", "pa\u0443pal", {
+  weights: CONFUSABLE_WEIGHTS,
+  context: "identifier",
+});
+// result.similarity, result.steps (including "visual-weight" reason for novel pairs)
+```
+## Research
+Two research tracks feed the library:
+**Visual measurement.** 1,397 confusable pairs rendered across 230 system fonts, scored by structural similarity (SSIM). 494 of these are novel cross-script pairs between non-Latin scripts (Hangul/Han, Cyrillic/Greek, Cyrillic/Arabic, and more) with zero coverage in any existing standard. Full dataset published as [confusable-vision](https://github.com/paultendo/confusable-vision) (CC-BY-4.0).
+**Normalisation composability.** 31 characters where Unicode's confusables.txt and NFKC normalisation disagree. Two production maps (`CONFUSABLE_MAP` for NFKC-first, `CONFUSABLE_MAP_FULL` for raw-input pipelines), a benchmark corpus, and composability vectors wired into CLI drift baselines. Findings accepted into [Unicode public review (PRI #540)](https://www.unicode.org/review/pri540/).
+- Technical reference: [docs/reference.md#how-the-anti-spoofing-pipeline-works](docs/reference.md#how-the-anti-spoofing-pipeline-works)
+- Launch write-up: https://paultendo.github.io/posts/namespace-guard-launch/
+- Denial of Spend: https://paultendo.github.io/posts/confusable-vision-llm-attack-tests/
 ## Built-in Profiles
@@ -177,35 +209,10 @@ npx namespace-guard recommend ./risk-dataset.json
 # 3) Preflight canonical collisions before adding DB unique constraints
 npx namespace-guard audit-canonical ./users-export.json --json
-# 4) Compare TR39-full vs NFKC-filtered behavior
+# 4) Compare TR39-full vs NFKC-filtered behaviour
 npx namespace-guard drift --json
 ```
-## Advanced Security Primitives (Optional)
-Use these when you need custom scoring, explainability, or pairwise checks outside the default claim flow:
-```typescript
-import { skeleton, areConfusable, confusableDistance } from "namespace-guard";
-skeleton("pa\u0443pal"); // "paypal" skeleton form
-areConfusable("paypal", "pa\u0443pal"); // true
-confusableDistance("paypal", "pa\u0443pal"); // graded similarity + chainDepth + explainable steps
-```
-For measured visual scoring, pass the optional weights from confusable-vision (903 SSIM-scored pairs across 230 fonts). The `context` filter restricts to identifier-valid, domain-valid, or all pairs.
-```typescript
-import { confusableDistance } from "namespace-guard";
-import { CONFUSABLE_WEIGHTS } from "namespace-guard/confusable-weights";
-const result = confusableDistance("paypal", "pa\u0443pal", {
-  weights: CONFUSABLE_WEIGHTS,
-  context: "identifier",
-});
-// result.similarity, result.steps (including "visual-weight" reason for novel pairs)
-```
 ## Adapter Support
 - Prisma
@@ -236,7 +243,8 @@ Migration guides per adapter: [docs/reference.md#canonical-uniqueness-migration-
 - LLM preprocessing (`canonicalise`, `scan`, `isClean`): [docs/reference.md#llm-pipeline-preprocessing](docs/reference.md#llm-pipeline-preprocessing)
 - Benchmark corpus (`confusable-bench.v1`): [docs/reference.md#confusable-benchmark-corpus-artifact](docs/reference.md#confusable-benchmark-corpus-artifact)
 - Advanced primitives (`skeleton`, `areConfusable`, `confusableDistance`): [docs/reference.md#advanced-security-primitives](docs/reference.md#advanced-security-primitives)
-- Confusable weights (SSIM-scored pairs): [docs/reference.md#confusable-weights-subpath](docs/reference.md#confusable-weights-subpath)
+- Confusable weights (SSIM-scored pairs, including cross-script): [docs/reference.md#confusable-weights-subpath](docs/reference.md#confusable-weights-subpath)
+- Cross-script detection: [docs/reference.md#cross-script-detection](docs/reference.md#cross-script-detection)
 - CLI reference: [docs/reference.md#cli](docs/reference.md#cli)
 - API reference: [docs/reference.md#api-reference](docs/reference.md#api-reference)
 - Framework integration (Next.js/Express/tRPC): [docs/reference.md#framework-integration](docs/reference.md#framework-integration)