deghost 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Kelly Mears
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,241 @@
1
+ # deghost
2
+
3
+ Strip invisible Unicode characters and normalize whitespace. Chainable, typesafe, zero dependencies.
4
+
5
+ ```
6
+ npm install deghost
7
+ ```
8
+
9
+ ## Why
10
+
11
+ Text from binary formats, APIs, and user input is full of invisible Unicode characters — non-breaking spaces, zero-width joiners, directional marks, BOM, control characters. They break string comparison, corrupt search indexes, and produce garbled output.
12
+
13
+ Existing tools either strip everything indiscriminately or miss entire character categories. deghost gives you category-level control with a chainable API that distinguishes between _stripping_ (remove entirely) and _normalizing_ (replace with a visible substitute).
14
+
15
+ ## Quick start
16
+
17
+ ```typescript
18
+ import { deghost } from 'deghost'
19
+
20
+ // Sensible defaults — handles the common cases
21
+ ;`${deghost('Plant\u00a064\u00a0-\u00a0Woodbridge')}`
22
+ // → 'Plant 64 - Woodbridge'
23
+
24
+ `${deghost('hello\u200Bworld')}`
25
+ // → 'helloworld'
26
+
27
+ // Also works as a tagged template literal
28
+ `${deghost`Plant\u00a064\u00a0-\u00a0Woodbridge`}`
29
+ // → 'Plant 64 - Woodbridge'
30
+ ```
31
+
32
+ ## Chainable API
33
+
34
+ Fine-grained control over what gets stripped vs. normalized:
35
+
36
+ ```typescript
37
+ import { deghost } from 'deghost'
38
+
39
+ deghost('text\u200B\u00a0here')
40
+ .strip('format') // zero-width joiners, directional marks, soft hyphens
41
+ .strip('control') // C0/C1 control characters
42
+ .normalize('spaces') // NBSP, en/em space → regular space
43
+ .trim()
44
+ .toString()
45
+ // → 'text here'
46
+ ```
47
+
48
+ The chain is immutable — each method returns a new instance, so you can branch without side effects.
49
+
50
+ ### Chain methods
51
+
52
+ | Method | Returns | Description |
53
+ | ------------------------------------ | -------------- | -------------------------------------------------------------------- |
54
+ | `.strip(category)` | `DeghostChain` | Remove all characters in a category |
55
+ | `.normalize(category, replacement?)` | `DeghostChain` | Replace characters with a substitute (default: `' '`) |
56
+ | `.replace(category, mapper)` | `DeghostChain` | Replace characters using a function that receives detection metadata |
57
+ | `.highlight(category?, formatter?)` | `DeghostChain` | Replace ghosts with visible markers like `[U+200B]` |
58
+ | `.collapse()` | `DeghostChain` | Collapse runs of whitespace into a single space |
59
+ | `.trim()` | `DeghostChain` | Trim leading/trailing whitespace |
60
+ | `.clean()` | `DeghostChain` | Apply the default preset |
61
+ | `.detect(categories?)` | `Detection[]` | Return detections for the current value |
62
+ | `.hasGhosts(categories?)` | `boolean` | Check if invisible characters remain |
63
+ | `.isClean(categories?)` | `boolean` | Inverse of `.hasGhosts()` |
64
+ | `.count(categories?)` | `Record` | Count ghosts by category |
65
+ | `.summary(categories?)` | `string` | Human-readable report of ghosts found |
66
+ | `.toString()` | `string` | Extract the string |
67
+
68
+ **Categories:**
69
+
70
+ | Category | What it matches | Default behavior |
71
+ | --------- | ------------------------------------------------------------ | ------------------ |
72
+ | `format` | Zero-width joiners, directional marks, soft hyphens (\p{Cf}) | Strip |
73
+ | `control` | C0/C1 control characters (\p{Cc}) | Strip |
74
+ | `spaces` | NBSP, en/em space, thin space, ideographic space (\p{Zs}) | Normalize to `' '` |
75
+ | `bom` | Byte order mark (U+FEFF) | Strip |
76
+ | `tag` | Unicode tag characters (U+E0001–U+E007F) | — |
77
+ | `fillers` | Hangul, Khmer, Mongolian, Ogham fillers | — |
78
+ | `math` | Invisible math operators (U+2061–U+2064) | — |
79
+
80
+ ## Reusable cleaners
81
+
82
+ Build a cleaning pipeline once, apply it to many strings with no per-call chain allocation:
83
+
84
+ ```typescript
85
+ import { cleaner } from 'deghost'
86
+
87
+ const clean = cleaner().strip('format').strip('control').normalize('spaces').trim().build()
88
+
89
+ clean('dirty\u00a0string') // 'dirty string'
90
+ clean('another\u200Bone') // 'anotherone'
91
+ ```
92
+
93
+ Cleaners also support `.replace()` and `.highlight()` for dynamic transformations:
94
+
95
+ ```typescript
96
+ const annotate = cleaner().highlight('format').normalize('spaces').build()
97
+
98
+ annotate('a\u200Bb\u00a0c') // 'a[U+200B]b c'
99
+ ```
100
+
101
+ ## Detection
102
+
103
+ Find out what's hiding in your strings:
104
+
105
+ ```typescript
106
+ import { detect, hasGhosts, isClean, count, first, scan } from 'deghost'
107
+
108
+ detect('sneaky\u200Btext')
109
+ // [{
110
+ // char: '\u200B',
111
+ // codepoint: 'U+200B',
112
+ // name: 'ZERO WIDTH SPACE',
113
+ // category: 'format',
114
+ // offset: 6
115
+ // }]
116
+
117
+ hasGhosts('hello\u200Bworld') // true
118
+ isClean('hello world') // true
119
+
120
+ count('a\u00a0b\u200Bc\u200Bd')
121
+ // { spaces: 1, format: 2 }
122
+
123
+ // Get just the first detection (stops early)
124
+ first('a\u200Bb\u00a0c')
125
+ // { char: '\u200B', codepoint: 'U+200B', ... }
126
+
127
+ // Lazy iterator for large strings
128
+ for (const d of scan(largeString)) {
129
+ if (d.category === 'format') break
130
+ }
131
+ ```
132
+
133
+ All detection functions accept an optional `categories` array to filter:
134
+
135
+ ```typescript
136
+ detect('a\u200Bb\u00a0c', ['spaces'])
137
+ // Only returns the NBSP detection
138
+ ```
139
+
140
+ ## Highlighting
141
+
142
+ Make invisible characters visible for debugging:
143
+
144
+ ```typescript
145
+ import { highlight } from 'deghost'
146
+
147
+ highlight('hello\u200Bworld')
148
+ // 'hello[U+200B]world'
149
+
150
+ // Custom formatter
151
+ highlight('a\u200Bb', (d) => `{${d.name}}`)
152
+ // 'a{ZERO WIDTH SPACE}b'
153
+
154
+ // Filter by category
155
+ highlight('a\u00a0b\u200Bc', { categories: ['format'] })
156
+ // 'a\u00a0b[U+200B]c'
157
+ ```
158
+
159
+ ## Summary
160
+
161
+ Get a human-readable report of all invisible characters:
162
+
163
+ ```typescript
164
+ import { summary } from 'deghost'
165
+
166
+ summary('hello\u200Bworld\u00a0here')
167
+ // 2 invisible characters found.
168
+ //
169
+ // By category:
170
+ // format: 1
171
+ // spaces: 1
172
+ //
173
+ // Details:
174
+ // U+200B ZERO WIDTH SPACE (format, offset 5)
175
+ // U+00A0 NO-BREAK SPACE (spaces, offset 11)
176
+ ```
177
+
178
+ ## Character lookup
179
+
180
+ Identify a single character or codepoint:
181
+
182
+ ```typescript
183
+ import { identify } from 'deghost'
184
+
185
+ identify('\u200B')
186
+ // { codepoint: 'U+200B', name: 'ZERO WIDTH SPACE', category: 'format' }
187
+
188
+ identify(0x00a0)
189
+ // { codepoint: 'U+00A0', name: 'NO-BREAK SPACE', category: 'spaces' }
190
+
191
+ identify('a') // undefined — not a ghost
192
+ ```
193
+
194
+ ## Presets
195
+
196
+ ```typescript
197
+ import { presets } from 'deghost'
198
+
199
+ // Default: strip format + control + BOM, normalize spaces
200
+ presets.clean('text\u00a0with\u200Bghosts')
201
+ // → 'text with ghosts'
202
+
203
+ // Aggressive: strip everything invisible
204
+ presets.aggressive('text\u2061with\u200Bghosts')
205
+ // → 'textwithghosts'
206
+
207
+ // Spaces only: just normalize whitespace
208
+ presets.spaces('text\u00a0here')
209
+ // → 'text here'
210
+ ```
211
+
212
+ ## How it works
213
+
214
+ deghost uses [ES2018 Unicode property escapes](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape) (`\p{Cf}`, `\p{Cc}`, `\p{Zs}`) for broad category matching, plus curated codepoint sets for categories not covered by a single Unicode general category (tag characters, script-specific fillers, invisible math operators).
215
+
216
+ The key design choice: **strip vs. normalize**. A non-breaking space (U+00A0) _should_ become a regular space, not disappear — otherwise `"Plant\u00a064"` becomes `"Plant64"`. deghost handles this by default; `out-of-character` does not.
217
+
218
+ ## Comparison
219
+
220
+ | Feature | deghost | out-of-character |
221
+ | ------------------------------- | ------- | ---------------- |
222
+ | Strip invisible chars | yes | yes |
223
+ | Normalize spaces (NBSP → space) | yes | no (strips) |
224
+ | Chainable API | yes | no |
225
+ | Reusable cleaners | yes | no |
226
+ | Detection with metadata | yes | yes |
227
+ | Category-level control | yes | no |
228
+ | Highlighting / debugging | yes | no |
229
+ | Tagged template literal | yes | no |
230
+ | TypeScript-native | yes | no |
231
+ | Presets | yes | no |
232
+ | CLI | not yet | yes |
233
+ | Zero dependencies | yes | yes |
234
+
235
+ ## Requirements
236
+
237
+ Node.js >= 18. Uses ES2018 Unicode property escapes (supported in all modern runtimes).
238
+
239
+ ## License
240
+
241
+ MIT
package/dist/index.cjs ADDED
@@ -0,0 +1,433 @@
1
+ 'use strict';
2
+
3
+ // src/categories.ts
4
+ var patterns = {
5
+ /** Format characters: zero-width joiners, directional marks, soft hyphens, etc. */
6
+ format: /\p{Cf}/gu,
7
+ /** Control characters: C0 (0x00–0x1F) and C1 (0x7F–0x9F) controls. */
8
+ control: /\p{Cc}/gu,
9
+ /**
10
+ * Space separators: NBSP, en/em/thin/hair/ideographic space, etc.
11
+ * Excludes U+0020 (regular ASCII space) — that's not a ghost.
12
+ */
13
+ spaces: /(?!\u0020)\p{Zs}/gu,
14
+ /** Tag characters: deprecated Unicode tag block (U+E0001–U+E007F). */
15
+ tag: /[\u{E0001}-\u{E007F}]/gu,
16
+ /** Byte order mark / zero-width no-break space. */
17
+ bom: /\uFEFF/gu,
18
+ /**
19
+ * Script-specific filler characters:
20
+ * - U+115F, U+1160: Hangul Choseong/Jungseong fillers
21
+ * - U+3164: Hangul filler
22
+ * - U+FFA0: Halfwidth Hangul filler
23
+ * - U+17B4, U+17B5: Khmer vowel inherent
24
+ * - U+180E: Mongolian vowel separator
25
+ * - U+1680: Ogham space mark
26
+ */
27
+ fillers: /[\u115F\u1160\u3164\uFFA0\u17B4\u17B5\u180E\u1680]/gu,
28
+ /**
29
+ * Invisible math operators:
30
+ * - U+2061: Function application
31
+ * - U+2062: Invisible times
32
+ * - U+2063: Invisible separator
33
+ * - U+2064: Invisible plus
34
+ */
35
+ math: /[\u2061-\u2064]/gu
36
+ };
37
+ var categories = Object.freeze(Object.keys(patterns));
38
+ var descriptions = {
39
+ format: "Format characters (zero-width joiners, directional marks, soft hyphens)",
40
+ control: "Control characters (C0/C1 controls)",
41
+ spaces: "Space separators (NBSP, en/em space, thin space, ideographic space)",
42
+ tag: "Unicode tag characters",
43
+ bom: "Byte order mark",
44
+ fillers: "Script-specific filler characters (Hangul, Khmer, Mongolian, Ogham)",
45
+ math: "Invisible math operators"
46
+ };
47
+ var charNames = {
48
+ 173: "SOFT HYPHEN",
49
+ 847: "COMBINING GRAPHEME JOINER",
50
+ 1564: "ARABIC LETTER MARK",
51
+ 160: "NO-BREAK SPACE",
52
+ 5760: "OGHAM SPACE MARK",
53
+ 8192: "EN QUAD",
54
+ 8193: "EM QUAD",
55
+ 8194: "EN SPACE",
56
+ 8195: "EM SPACE",
57
+ 8196: "THREE-PER-EM SPACE",
58
+ 8197: "FOUR-PER-EM SPACE",
59
+ 8198: "SIX-PER-EM SPACE",
60
+ 8199: "FIGURE SPACE",
61
+ 8200: "PUNCTUATION SPACE",
62
+ 8201: "THIN SPACE",
63
+ 8202: "HAIR SPACE",
64
+ 8203: "ZERO WIDTH SPACE",
65
+ 8204: "ZERO WIDTH NON-JOINER",
66
+ 8205: "ZERO WIDTH JOINER",
67
+ 8206: "LEFT-TO-RIGHT MARK",
68
+ 8207: "RIGHT-TO-LEFT MARK",
69
+ 8232: "LINE SEPARATOR",
70
+ 8233: "PARAGRAPH SEPARATOR",
71
+ 8234: "LEFT-TO-RIGHT EMBEDDING",
72
+ 8235: "RIGHT-TO-LEFT EMBEDDING",
73
+ 8236: "POP DIRECTIONAL FORMATTING",
74
+ 8237: "LEFT-TO-RIGHT OVERRIDE",
75
+ 8238: "RIGHT-TO-LEFT OVERRIDE",
76
+ 8239: "NARROW NO-BREAK SPACE",
77
+ 8287: "MEDIUM MATHEMATICAL SPACE",
78
+ 8288: "WORD JOINER",
79
+ 8289: "FUNCTION APPLICATION",
80
+ 8290: "INVISIBLE TIMES",
81
+ 8291: "INVISIBLE SEPARATOR",
82
+ 8292: "INVISIBLE PLUS",
83
+ 8294: "LEFT-TO-RIGHT ISOLATE",
84
+ 8295: "RIGHT-TO-LEFT ISOLATE",
85
+ 8296: "FIRST STRONG ISOLATE",
86
+ 8297: "POP DIRECTIONAL ISOLATE",
87
+ 12288: "IDEOGRAPHIC SPACE",
88
+ 12644: "HANGUL FILLER",
89
+ 65279: "ZERO WIDTH NO-BREAK SPACE",
90
+ 65440: "HALFWIDTH HANGUL FILLER",
91
+ 4447: "HANGUL CHOSEONG FILLER",
92
+ 4448: "HANGUL JUNGSEONG FILLER",
93
+ 6068: "KHMER VOWEL INHERENT AQ",
94
+ 6069: "KHMER VOWEL INHERENT AA",
95
+ 6158: "MONGOLIAN VOWEL SEPARATOR"
96
+ };
97
+ var fillerSet = /* @__PURE__ */ new Set([4447, 4448, 12644, 65440, 6068, 6069, 6158, 5760]);
98
+ var isZs = /^\p{Zs}$/u;
99
+ var isCf = /^\p{Cf}$/u;
100
+ var isCc = /^\p{Cc}$/u;
101
+ function categorize(codepoint) {
102
+ if (codepoint === 65279) return "bom";
103
+ if (codepoint >= 917505 && codepoint <= 917631) return "tag";
104
+ if (codepoint >= 8289 && codepoint <= 8292) return "math";
105
+ if (fillerSet.has(codepoint)) return "fillers";
106
+ if (codepoint === 32) return void 0;
107
+ const char = String.fromCodePoint(codepoint);
108
+ if (isZs.test(char)) return "spaces";
109
+ if (isCf.test(char)) return "format";
110
+ if (isCc.test(char)) return "control";
111
+ return void 0;
112
+ }
113
+
114
+ // src/detect.ts
115
+ var formatHex = (cp) => `U+${cp.toString(16).toUpperCase().padStart(4, "0")}`;
116
+ var regexCache = /* @__PURE__ */ new Map();
117
+ function getRegex(categories2) {
118
+ const cats = categories2 ?? Object.keys(patterns);
119
+ const key = [...cats].sort().join(",");
120
+ let cached = regexCache.get(key);
121
+ if (!cached) {
122
+ cached = new RegExp(cats.map((c) => patterns[c].source).join("|"), "gu");
123
+ regexCache.set(key, cached);
124
+ }
125
+ return cached;
126
+ }
127
+ function* scan(input, categories2) {
128
+ const regex = getRegex(categories2);
129
+ regex.lastIndex = 0;
130
+ for (const match of input.matchAll(regex)) {
131
+ const char = match[0];
132
+ const cp = char.codePointAt(0);
133
+ const category = categorize(cp);
134
+ if (category === void 0) continue;
135
+ const hex = formatHex(cp);
136
+ yield {
137
+ char,
138
+ codepoint: hex,
139
+ name: charNames[cp] ?? hex,
140
+ category,
141
+ offset: match.index
142
+ };
143
+ }
144
+ }
145
+ function detect(input, categories2) {
146
+ return [...scan(input, categories2)];
147
+ }
148
+ function first(input, categories2) {
149
+ for (const d of scan(input, categories2)) return d;
150
+ return void 0;
151
+ }
152
+ function hasGhosts(input, categories2) {
153
+ const regex = getRegex(categories2);
154
+ regex.lastIndex = 0;
155
+ return regex.test(input);
156
+ }
157
+ function isClean(input, categories2) {
158
+ return !hasGhosts(input, categories2);
159
+ }
160
+ function count(input, categories2) {
161
+ const result = {};
162
+ for (const d of detect(input, categories2)) {
163
+ result[d.category] = (result[d.category] ?? 0) + 1;
164
+ }
165
+ return result;
166
+ }
167
+ function identify(input) {
168
+ const cp = typeof input === "number" ? input : input.codePointAt(0);
169
+ const category = categorize(cp);
170
+ if (category === void 0) return void 0;
171
+ const hex = formatHex(cp);
172
+ return {
173
+ codepoint: hex,
174
+ name: charNames[cp] ?? hex,
175
+ category
176
+ };
177
+ }
178
+
179
+ // src/replace.ts
180
+ var defaultFormatter = (d) => `[${d.codepoint}]`;
181
+ function replaceDetections(input, detections, mapper) {
182
+ if (detections.length === 0) return input;
183
+ const parts = [];
184
+ let cursor = 0;
185
+ for (const d of detections) {
186
+ parts.push(input.slice(cursor, d.offset));
187
+ parts.push(mapper(d));
188
+ cursor = d.offset + d.char.length;
189
+ }
190
+ parts.push(input.slice(cursor));
191
+ return parts.join("");
192
+ }
193
+
194
+ // src/summary.ts
195
+ function summary(input, categories2) {
196
+ const detections = detect(input, categories2);
197
+ if (detections.length === 0) return "No invisible characters found.";
198
+ const counts = {};
199
+ for (const d of detections) {
200
+ counts[d.category] = (counts[d.category] ?? 0) + 1;
201
+ }
202
+ const total = detections.length;
203
+ const header = `${total} invisible character${total === 1 ? "" : "s"} found.`;
204
+ const byCategory = Object.entries(counts).sort(([a], [b]) => a.localeCompare(b)).map(([cat, n]) => ` ${cat}: ${n}`).join("\n");
205
+ const details = detections.map((d) => ` ${d.codepoint} ${d.name} (${d.category}, offset ${d.offset})`).join("\n");
206
+ return `${header}
207
+
208
+ By category:
209
+ ${byCategory}
210
+
211
+ Details:
212
+ ${details}`;
213
+ }
214
+
215
+ // src/chain.ts
216
+ var DeghostChain = class _DeghostChain {
217
+ #value;
218
+ constructor(value) {
219
+ this.#value = value;
220
+ }
221
+ /** Remove all characters in the given category. */
222
+ strip(category) {
223
+ return new _DeghostChain(this.#value.replace(patterns[category], ""));
224
+ }
225
+ /** Replace all characters in the given category with a substitute. */
226
+ normalize(category, replacement = " ") {
227
+ return new _DeghostChain(this.#value.replace(patterns[category], replacement));
228
+ }
229
+ /** Replace matched ghosts in a category using a mapper function. */
230
+ replace(category, mapper) {
231
+ const result = replaceDetections(this.#value, detect(this.#value, [category]), mapper);
232
+ return result === this.#value ? this : new _DeghostChain(result);
233
+ }
234
+ /** Replace ghosts with visible markers like `[U+200B]`. */
235
+ highlight(category, formatter = defaultFormatter) {
236
+ const categories2 = category ? [category] : void 0;
237
+ const result = replaceDetections(
238
+ this.#value,
239
+ detect(this.#value, categories2),
240
+ formatter
241
+ );
242
+ return result === this.#value ? this : new _DeghostChain(result);
243
+ }
244
+ /** Return detections for the current chain value. */
245
+ detect(categories2) {
246
+ return detect(this.#value, categories2);
247
+ }
248
+ /** Check if the current chain value contains invisible characters. */
249
+ hasGhosts(categories2) {
250
+ return hasGhosts(this.#value, categories2);
251
+ }
252
+ /** Count invisible characters by category in the current chain value. */
253
+ count(categories2) {
254
+ return count(this.#value, categories2);
255
+ }
256
+ /** Returns true if the current chain value has no invisible characters. */
257
+ isClean(categories2) {
258
+ return isClean(this.#value, categories2);
259
+ }
260
+ /** Return a human-readable report of ghosts in the current chain value. */
261
+ summary(categories2) {
262
+ return summary(this.#value, categories2);
263
+ }
264
+ /** Collapse runs of whitespace into a single space. */
265
+ collapse() {
266
+ return new _DeghostChain(this.#value.replace(/ {2,}/g, " "));
267
+ }
268
+ /** Trim leading and trailing whitespace. */
269
+ trim() {
270
+ return new _DeghostChain(this.#value.trim());
271
+ }
272
+ /** Apply the default cleaning preset: strip format + control, normalize spaces, trim. */
273
+ clean() {
274
+ return this.strip("format").strip("control").strip("bom").normalize("spaces").collapse().trim();
275
+ }
276
+ /** Extract the cleaned string. */
277
+ toString() {
278
+ return this.#value;
279
+ }
280
+ /** Extract the cleaned string (alias for toString). */
281
+ valueOf() {
282
+ return this.#value;
283
+ }
284
+ /** Support JSON.stringify. */
285
+ toJSON() {
286
+ return this.#value;
287
+ }
288
+ };
289
+
290
+ // src/cleaner.ts
291
+ var CleanerBuilder = class {
292
+ #rules = [];
293
+ #trim = false;
294
+ #collapse = false;
295
+ /** Add a strip rule — remove all characters in this category. */
296
+ strip(category) {
297
+ this.#rules.push({ category, action: "strip" });
298
+ return this;
299
+ }
300
+ /** Add a normalize rule — replace characters in this category. */
301
+ normalize(category, replacement = " ") {
302
+ this.#rules.push({ category, action: "normalize", replacement });
303
+ return this;
304
+ }
305
+ /** Add a replace rule — transform characters using detection metadata. */
306
+ replace(category, mapper) {
307
+ this.#rules.push({ category, action: "replace", mapper });
308
+ return this;
309
+ }
310
+ /** Add a highlight step — annotate characters with visible markers. */
311
+ highlight(category, formatter = defaultFormatter) {
312
+ this.#rules.push({ category, action: "replace", mapper: formatter });
313
+ return this;
314
+ }
315
+ /** Enable whitespace trimming as a final step. */
316
+ trim() {
317
+ this.#trim = true;
318
+ return this;
319
+ }
320
+ /** Enable collapsing runs of whitespace as a final step. */
321
+ collapse() {
322
+ this.#collapse = true;
323
+ return this;
324
+ }
325
+ /** Compile the pipeline into a reusable function. */
326
+ build() {
327
+ const steps = this.#rules.map((rule) => {
328
+ if (rule.action === "replace") {
329
+ const { mapper, category } = rule;
330
+ return (s) => replaceDetections(s, detect(s, [category]), mapper);
331
+ }
332
+ const pattern = patterns[rule.category];
333
+ if (rule.action === "strip") {
334
+ return (s) => s.replace(pattern, "");
335
+ }
336
+ const replacement = rule.replacement ?? " ";
337
+ return (s) => s.replace(pattern, replacement);
338
+ });
339
+ const doCollapse = this.#collapse;
340
+ const doTrim = this.#trim;
341
+ return (input) => {
342
+ let result = input;
343
+ for (const step of steps) {
344
+ result = step(result);
345
+ }
346
+ if (doCollapse) result = result.replace(/ {2,}/g, " ");
347
+ if (doTrim) result = result.trim();
348
+ return result;
349
+ };
350
+ }
351
+ };
352
+ function cleaner() {
353
+ return new CleanerBuilder();
354
+ }
355
+
356
+ // src/presets.ts
357
+ var presets = {
358
+ /**
359
+ * Default clean: strip format + control + BOM, normalize spaces, collapse, trim.
360
+ *
361
+ * The right choice for most text processing — catches invisible chars from
362
+ * binary formats (Garmin FIT, PDFs), APIs, and copy-paste while preserving
363
+ * word boundaries.
364
+ */
365
+ clean: cleaner().strip("format").strip("control").strip("bom").normalize("spaces").collapse().trim().build(),
366
+ /**
367
+ * Aggressive: strip everything invisible, including fillers, math operators, and tags.
368
+ *
369
+ * Use when you want maximally clean output and don't need to preserve any
370
+ * invisible Unicode semantics (ligature joiners, bidi marks, etc.).
371
+ */
372
+ aggressive: cleaner().strip("format").strip("control").strip("bom").strip("tag").strip("fillers").strip("math").normalize("spaces").collapse().trim().build(),
373
+ /**
374
+ * Spaces only: normalize Unicode whitespace to ASCII space.
375
+ *
376
+ * Leaves format/control characters alone. Useful when you only care about
377
+ * NBSP and exotic spaces (common in data from Garmin, Strava, etc.).
378
+ */
379
+ spaces: cleaner().normalize("spaces").collapse().trim().build()
380
+ };
381
+
382
+ // src/highlight.ts
383
+ function highlight(input, options) {
384
+ const formatter = typeof options === "function" ? options : options?.formatter ?? defaultFormatter;
385
+ const categories2 = typeof options === "object" ? options.categories : void 0;
386
+ return replaceDetections(input, detect(input, categories2), formatter);
387
+ }
388
+
389
+ // src/index.ts
390
+ function deghost(input, ...rest) {
391
+ let raw;
392
+ if (typeof input === "string") {
393
+ raw = input;
394
+ } else {
395
+ raw = input[0] ?? "";
396
+ for (let i = 0; i < rest.length; i++) {
397
+ raw += String(rest[i]) + (input[i + 1] ?? "");
398
+ }
399
+ }
400
+ const options = typeof input === "string" && rest.length <= 1 ? rest[0] : void 0;
401
+ const chain = new DeghostChain(raw);
402
+ const cleaned = presets.clean(raw);
403
+ const trimmed = options?.trim ?? true ? cleaned.trim() : cleaned;
404
+ return new Proxy(chain, {
405
+ get(target, prop, receiver) {
406
+ if (prop === Symbol.toPrimitive) return () => trimmed;
407
+ if (prop === "length") return trimmed.length;
408
+ return Reflect.get(target, prop, receiver);
409
+ }
410
+ });
411
+ }
412
+
413
+ exports.CleanerBuilder = CleanerBuilder;
414
+ exports.DeghostChain = DeghostChain;
415
+ exports.categories = categories;
416
+ exports.categorize = categorize;
417
+ exports.charNames = charNames;
418
+ exports.cleaner = cleaner;
419
+ exports.count = count;
420
+ exports.deghost = deghost;
421
+ exports.descriptions = descriptions;
422
+ exports.detect = detect;
423
+ exports.first = first;
424
+ exports.hasGhosts = hasGhosts;
425
+ exports.highlight = highlight;
426
+ exports.identify = identify;
427
+ exports.isClean = isClean;
428
+ exports.patterns = patterns;
429
+ exports.presets = presets;
430
+ exports.scan = scan;
431
+ exports.summary = summary;
432
+ //# sourceMappingURL=index.cjs.map
433
+ //# sourceMappingURL=index.cjs.map