namespace-guard 0.7.0 → 0.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -299,7 +299,7 @@ No words are bundled — use any word list you like (e.g., the `bad-words` npm p
299
299
 
300
300
  ### Built-in Homoglyph Validator
301
301
 
302
- Prevent spoofing attacks where Cyrillic or Greek characters are substituted for visually identical Latin letters (e.g., Cyrillic "а" for Latin "a" in "admin"):
302
+ Prevent spoofing attacks where visually similar characters from any Unicode script are substituted for Latin letters (e.g., Cyrillic "а" for Latin "a" in "admin"):
303
303
 
304
304
  ```typescript
305
305
  import { createNamespaceGuard, createHomoglyphValidator } from "namespace-guard";
@@ -318,11 +318,43 @@ Options:
318
318
  createHomoglyphValidator({
319
319
  message: "Custom rejection message.", // optional
320
320
  additionalMappings: { "\u0261": "g" }, // extend the built-in map
321
- rejectMixedScript: true, // also reject Latin + Cyrillic/Greek mixing
321
+ rejectMixedScript: true, // also reject Latin + non-Latin script mixing
322
322
  })
323
323
  ```
324
324
 
325
- The built-in `CONFUSABLE_MAP` covers ~30 Cyrillic-to-Latin and Greek-to-Latin pairs the most common spoofing vectors. It's exported for inspection or extension.
325
+ The built-in `CONFUSABLE_MAP` contains 613 character pairs generated from [Unicode TR39 confusables.txt](https://unicode.org/reports/tr39/) plus supplemental Latin small capitals. It covers Cyrillic, Greek, Armenian, Cherokee, IPA, Coptic, Lisu, Canadian Syllabics, Georgian, and 20+ other scripts. The map is exported for inspection or extension, and is regenerable for new Unicode versions with `npx tsx scripts/generate-confusables.ts`.
326
+
327
+ ### How the anti-spoofing pipeline works
328
+
329
+ Most confusable-detection libraries apply a character map in isolation. namespace-guard uses a three-stage pipeline where each stage is aware of the others:
330
+
331
+ ```
332
+ Input → NFKC normalize → Confusable map → Mixed-script reject
333
+ (stage 1) (stage 2) (stage 3)
334
+ ```
335
+
336
+ **Stage 1: NFKC normalization** collapses full-width characters (`I` → `I`), ligatures (`fi` → `fi`), superscripts, and other Unicode compatibility forms to their canonical equivalents. This runs first, before any confusable check.
337
+
338
+ **Stage 2: Confusable map** catches characters that survive NFKC but visually mimic Latin letters — Cyrillic `а` for `a`, Greek `ο` for `o`, Cherokee `Ꭺ` for `A`, and 600+ others from the Unicode Consortium's [confusables.txt](https://unicode.org/Public/security/latest/confusables.txt).
339
+
340
+ **Stage 3: Mixed-script rejection** (`rejectMixedScript: true`) blocks identifiers that mix Latin with non-Latin scripts (Hebrew, Arabic, Devanagari, Thai, Georgian, Ethiopic, etc.) even if the specific characters aren't in the confusable map. This catches novel homoglyphs that the map doesn't cover.
341
+
342
+ #### Why NFKC-aware filtering matters
343
+
344
+ The key insight: TR39's confusables.txt and NFKC normalization sometimes disagree. For example, Unicode says capital `I` (U+0049) is confusable with lowercase `l` — visually true in many fonts. But NFKC maps Mathematical Bold `𝐈` (U+1D408) to `I`, not `l`. If you naively ship the TR39 mapping (`𝐈` → `l`), the confusable check will never see that character — NFKC already converted it to `I` in stage 1.
345
+
346
+ We found 31 entries where this happens:
347
+
348
+ | Character | TR39 says | NFKC says | Winner |
349
+ |-----------|-----------|-----------|--------|
350
+ | `ſ` Long S (U+017F) | `f` | `s` | NFKC (`s` is correct) |
351
+ | `Ⅰ` Roman Numeral I (U+2160) | `l` | `i` | NFKC (`i` is correct) |
352
+ | `I` Fullwidth I (U+FF29) | `l` | `i` | NFKC (`i` is correct) |
353
+ | `𝟎` Math Bold 0 (U+1D7CE) | `o` | `0` | NFKC (`0` is correct) |
354
+ | 11 Mathematical I variants | `l` | `i` | NFKC |
355
+ | 12 Mathematical 0/1 variants | `o`/`l` | `0`/`1` | NFKC |
356
+
357
+ These entries are dead code in any pipeline that runs NFKC first — and worse, they encode the *wrong* mapping. The generate script (`scripts/generate-confusables.ts`) automatically detects and excludes them.
326
358
 
327
359
  ## Unicode Normalization
328
360
 
package/dist/index.d.mts CHANGED
@@ -136,22 +136,27 @@ declare function createProfanityValidator(words: string[], options?: {
136
136
  message: string;
137
137
  } | null>;
138
138
  /**
139
- * Default mapping of visually confusable Unicode characters to their Latin equivalents.
140
- * Covers Cyrillic-to-Latin and Greek-to-Latin lookalikes the most common spoofing vectors.
141
- * Exported for advanced users who need to inspect or extend the mapping.
139
+ * Mapping of visually confusable Unicode characters to their Latin/digit equivalents.
140
+ * Generated from Unicode TR39 confusables.txt + supplemental Latin small capitals.
141
+ * Covers every single-character mapping to a lowercase Latin letter or digit,
142
+ * excluding characters already handled by NFKC normalization (either collapsed
143
+ * to the same target, or mapped to a different valid Latin char/digit).
144
+ * Regenerate: `npx tsx scripts/generate-confusables.ts`
142
145
  */
143
146
  declare const CONFUSABLE_MAP: Record<string, string>;
144
147
  /**
145
148
  * Create a validator that rejects identifiers containing homoglyph/confusable characters.
146
149
  *
147
- * Catches spoofing attacks where Cyrillic or Greek characters are substituted for
150
+ * Catches spoofing attacks where characters from other scripts are substituted for
148
151
  * visually identical Latin characters (e.g., Cyrillic "а" for Latin "a" in "admin").
149
- * Uses a curated mapping of ~30 character pairs that covers 95%+ of real impersonation attempts.
152
+ * Uses a comprehensive mapping of 613 character pairs generated from Unicode TR39
153
+ * confusables.txt, covering Cyrillic, Greek, Armenian, Cherokee, IPA, Latin small
154
+ * capitals, Canadian Syllabics, Georgian, Lisu, Coptic, and many other scripts.
150
155
  *
151
156
  * @param options - Optional settings
152
157
  * @param options.message - Custom rejection message (default: "That name contains characters that could be confused with other letters.")
153
158
  * @param options.additionalMappings - Extra confusable pairs to merge with the built-in map
154
- * @param options.rejectMixedScript - Also reject identifiers that mix Latin with Cyrillic/Greek characters (default: false)
159
+ * @param options.rejectMixedScript - Also reject identifiers that mix Latin with non-Latin characters from any covered script (Cyrillic, Greek, Armenian, Hebrew, Arabic, Georgian, Cherokee, Canadian Syllabics, Ethiopic, Coptic, Lisu, and more) (default: false)
155
160
  * @returns An async validator function for use in `config.validators`
156
161
  *
157
162
  * @example
package/dist/index.d.ts CHANGED
@@ -136,22 +136,27 @@ declare function createProfanityValidator(words: string[], options?: {
136
136
  message: string;
137
137
  } | null>;
138
138
  /**
139
- * Default mapping of visually confusable Unicode characters to their Latin equivalents.
140
- * Covers Cyrillic-to-Latin and Greek-to-Latin lookalikes the most common spoofing vectors.
141
- * Exported for advanced users who need to inspect or extend the mapping.
139
+ * Mapping of visually confusable Unicode characters to their Latin/digit equivalents.
140
+ * Generated from Unicode TR39 confusables.txt + supplemental Latin small capitals.
141
+ * Covers every single-character mapping to a lowercase Latin letter or digit,
142
+ * excluding characters already handled by NFKC normalization (either collapsed
143
+ * to the same target, or mapped to a different valid Latin char/digit).
144
+ * Regenerate: `npx tsx scripts/generate-confusables.ts`
142
145
  */
143
146
  declare const CONFUSABLE_MAP: Record<string, string>;
144
147
  /**
145
148
  * Create a validator that rejects identifiers containing homoglyph/confusable characters.
146
149
  *
147
- * Catches spoofing attacks where Cyrillic or Greek characters are substituted for
150
+ * Catches spoofing attacks where characters from other scripts are substituted for
148
151
  * visually identical Latin characters (e.g., Cyrillic "а" for Latin "a" in "admin").
149
- * Uses a curated mapping of ~30 character pairs that covers 95%+ of real impersonation attempts.
152
+ * Uses a comprehensive mapping of 613 character pairs generated from Unicode TR39
153
+ * confusables.txt, covering Cyrillic, Greek, Armenian, Cherokee, IPA, Latin small
154
+ * capitals, Canadian Syllabics, Georgian, Lisu, Coptic, and many other scripts.
150
155
  *
151
156
  * @param options - Optional settings
152
157
  * @param options.message - Custom rejection message (default: "That name contains characters that could be confused with other letters.")
153
158
  * @param options.additionalMappings - Extra confusable pairs to merge with the built-in map
154
- * @param options.rejectMixedScript - Also reject identifiers that mix Latin with Cyrillic/Greek characters (default: false)
159
+ * @param options.rejectMixedScript - Also reject identifiers that mix Latin with non-Latin characters from any covered script (Cyrillic, Greek, Armenian, Hebrew, Arabic, Georgian, Cherokee, Canadian Syllabics, Ethiopic, Coptic, Lisu, and more) (default: false)
155
160
  * @returns An async validator function for use in `config.validators`
156
161
  *
157
162
  * @example