terlik.js 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,519 @@
1
+ # terlik.js
2
+
3
+ ![terlik.js](git-header.png)
4
+
5
+ [![CI](https://github.com/badursun/terlik.js/actions/workflows/ci.yml/badge.svg)](https://github.com/badursun/terlik.js/actions/workflows/ci.yml)
6
+ [![npm version](https://img.shields.io/npm/v/terlik.js.svg)](https://www.npmjs.com/package/terlik.js)
7
+ [![npm bundle size](https://img.shields.io/bundlephobia/minzip/terlik.js)](https://bundlephobia.com/package/terlik.js)
8
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
9
+
10
+ Production-grade multi-language profanity detection and filtering. Not a naive blacklist — a multi-layered normalization and pattern engine that catches what simple string matching misses.
11
+
12
+ Built-in support for **Turkish**, **English**, **Spanish**, and **German**. Adding a new language is just a folder with two files.
13
+
14
+ Zero runtime dependencies. Full TypeScript. ESM + CJS. **35 KB** gzipped. Works in Node.js, Bun, Deno, browsers, Cloudflare Workers, and Edge runtimes — no Node.js-specific APIs used.
15
+
16
+ ## Why terlik.js?
17
+
18
+ Turkish profanity evasion is creative. Users write `s2k`, `$1kt1r`, `s.i.k.t.i.r`, `SİKTİR`, `siiiiiktir`, `i8ne`, `or*spu`, `pu$ttt`, `6öt` — and expect to get away with it. Turkish is agglutinative — a single root like `sik` spawns dozens of forms: `siktiler`, `sikerim`, `siktirler`, `sikimsonik`. Manually listing every variant doesn't scale.
19
+
20
+ terlik.js catches all of these with a **suffix engine** that automatically recognizes Turkish grammatical suffixes on profane roots. Here's what a single call handles:
21
+
22
+ ```ts
23
+ import { Terlik } from "terlik.js";
24
+ const terlik = new Terlik();
25
+
26
+ terlik.clean("s2mle yüzle$ g0t_v3r3n o r o s p u pezev3nk i8ne pu$ttt or*spu");
27
+ // "***** yüzle$ ********* *********** ******** **** ****** ******"
28
+ // 7 matches, 0 false positives, <2ms
29
+ ```
30
+
31
+ ## Install
32
+
33
+ ```bash
34
+ npm install terlik.js
35
+ # or
36
+ pnpm add terlik.js
37
+ # or
38
+ yarn add terlik.js
39
+ ```
40
+
41
+ ## Quick Start
42
+
43
+ ```ts
44
+ import { Terlik } from "terlik.js";
45
+
46
+ // Turkish (default)
47
+ const tr = new Terlik();
48
+ tr.containsProfanity("siktir git"); // true
49
+ tr.clean("siktir git burdan"); // "****** git burdan"
50
+
51
+ // English
52
+ const en = new Terlik({ language: "en" });
53
+ en.containsProfanity("what the fuck"); // true
54
+ en.containsProfanity("siktir git"); // false (Turkish not loaded)
55
+
56
+ // Spanish & German
57
+ const es = new Terlik({ language: "es" });
58
+ const de = new Terlik({ language: "de" });
59
+ es.containsProfanity("hijo de puta"); // true
60
+ de.containsProfanity("scheiße"); // true
61
+ ```
62
+
63
+ ## What It Catches
64
+
65
+ | Evasion technique | Example | Detected as |
66
+ |---|---|---|
67
+ | Plain text | `siktir` | sik |
68
+ | Turkish İ/I | `SİKTİR` | sik |
69
+ | Leet speak | `$1kt1r`, `@pt@l` | sik, aptal |
70
+ | Visual leet (TR) | `8ok`, `6öt`, `i8ne`, `s2k` | bok, göt, ibne, sik |
71
+ | Turkish number words | `s2mle` (s+iki+mle) | sik (sikimle) |
72
+ | Separators | `s.i.k.t.i.r`, `s_i_k` | sik |
73
+ | Spaces | `o r o s p u` | orospu |
74
+ | Char repetition | `siiiiiktir`, `pu$ttt` | sik, puşt |
75
+ | Mixed punctuation | `or*spu`, `g0t_v3r3n` | orospu, göt |
76
+ | Combined | `$1kt1r g0t_v3r3n` | both caught |
77
+ | **Suffix forms** | `siktiler`, `orospuluk`, `gotune` | sik, orospu, göt |
78
+ | **Suffix + evasion** | `s.i.k.t.i.r.l.e.r`, `$1kt1rler` | sik |
79
+ | **Suffix chaining** | `siktirler` (sik+tir+ler) | sik |
80
+ | **Deep agglutination** | `siktiğimin`, `sikermisiniz`, `siktirmişcesine` | sik |
81
+ | **Zero-width chars** | `s\u200Bi\u200Bk\u200Bt\u200Bi\u200Br` (ZWSP/ZWNJ/ZWJ) | sik |
82
+
83
+ ### What It Doesn't Catch (on purpose)
84
+
85
+ Whitelist prevents false positives on legitimate words:
86
+
87
+ ```ts
88
+ terlik.containsProfanity("Amsterdam"); // false
89
+ terlik.containsProfanity("sikke"); // false (Ottoman coin)
90
+ terlik.containsProfanity("ambulans"); // false
91
+ terlik.containsProfanity("siklet"); // false (boxing weight class)
92
+ terlik.containsProfanity("memur"); // false
93
+ terlik.containsProfanity("malzeme"); // false
94
+ terlik.containsProfanity("ama"); // false (conjunction)
95
+ terlik.containsProfanity("amir"); // false
96
+ terlik.containsProfanity("dolmen"); // false
97
+ ```
98
+
99
+ ## How It Works
100
+
101
+ Six-stage normalization pipeline (language-aware), then pattern matching:
102
+
103
+ ```
104
+ input
105
+ → lowercase (locale-aware: "tr", "en", "es", "de")
106
+ → char folding (language-specific: İ→i, ñ→n, ß→ss, ä→a, ...)
107
+ → number expansion (optional, e.g. Turkish: s2k → sikik)
108
+ → leet speak decode (0→o, 1→i, @→a, $→s, ...)
109
+ → punctuation removal (between letters: s.i.k → sik)
110
+ → repeat collapse (siiiiik → sik)
111
+ → pattern matching (dynamic regex with language-specific char classes)
112
+ → whitelist filtering
113
+ → result
114
+ ```
115
+
116
+ Each language has its own char map, leet map, char classes, and optional number expansions. The engine is language-agnostic — only the data is language-specific.
117
+
118
+ For suffixable roots, the engine appends an optional suffix group (up to 2 chained suffixes). Turkish has 83 suffixes (including question particles and adverbial forms), English has 8, Spanish has 13, German has 8.
119
+
120
+ ### Language Packs
121
+
122
+ Each language lives in its own folder under `src/lang/`:
123
+
124
+ ```
125
+ src/lang/
126
+ tr/
127
+ config.ts ← charMap, leetMap, charClasses, locale
128
+ dictionary.json ← entries, suffixes, whitelist
129
+ en/
130
+ config.ts
131
+ dictionary.json
132
+ ...
133
+ ```
134
+
135
+ Dictionary format (community-friendly JSON, no TypeScript needed):
136
+
137
+ ```json
138
+ {
139
+ "version": 1,
140
+ "suffixes": ["ing", "ed", "er", "s"],
141
+ "entries": [
142
+ { "root": "fuck", "variants": ["fucking", "fucker"], "severity": "high", "category": "sexual", "suffixable": true }
143
+ ],
144
+ "whitelist": ["assassin", "class", "grass"]
145
+ }
146
+ ```
147
+
148
+ Categories: `sexual`, `insult`, `slur`, `general`. Severity: `high`, `medium`, `low`.
149
+
150
+ ### Adding a New Language
151
+
152
+ 1. Create `src/lang/xx/` folder
153
+ 2. Add `dictionary.json` (entries, suffixes, whitelist)
154
+ 3. Add `config.ts` (locale, charMap, leetMap, charClasses)
155
+ 4. Register in `src/lang/index.ts` (one import line)
156
+ 5. Write tests, build, done
157
+
158
+ ## Dictionary Strategy
159
+
160
+ terlik.js ships with a **deliberately narrow dictionary** — the goal is to **minimize false positives** while catching real-world evasion patterns. The dictionary is not a massive word list; it's a curated set of roots + variants that the pattern engine expands through normalization, leet decoding, separator tolerance, and suffix chaining.
161
+
162
+ ### Coverage
163
+
164
+ | Language | Roots | Explicit Variants | Suffixes | Whitelist | Effective Forms |
165
+ |---|---|---|---|---|---|
166
+ | Turkish | 25 | 88 | 83 | 52 | ~3,000+ |
167
+ | English | 23 | 106 | 8 | 42 | ~700+ |
168
+ | Spanish | 19 | 73 | 13 | 15 | ~500+ |
169
+ | German | 18 | 48 | 8 | 3 | ~300+ |
170
+
171
+ "Effective forms" = roots × normalization variants × suffix combinations × evasion patterns. A root like `sik` with 83 possible suffixes, leet decoding, separator tolerance, and repeat collapse produces thousands of detectable surface forms.
172
+
173
+ ### What IS Covered
174
+
175
+ - **Core profanity roots** per language (high-severity sexual, insults, slurs)
176
+ - **Grammatical inflections** via suffix engine (Turkish agglutination, English -ing/-ed, etc.)
177
+ - **Evasion patterns**: leet speak, separators, repetition, mixed case, number words (TR)
178
+ - **Compound forms**: `orospucocugu`, `motherfucker`, `hijoputa`, `hurensohn`
179
+
180
+ ### What is NOT Covered (by design)
181
+
182
+ - **Slang / regional variants** that change rapidly — better handled with `customList`
183
+ - **Context-dependent words** that are profane only in certain contexts
184
+ - **Phonetic substitutions** beyond leet (e.g., "phuck") — add via `customList`
185
+ - **New coinages** — use `addWords()` at runtime
186
+
187
+ ### Why Narrow?
188
+
189
+ A large dictionary maximizes recall but tanks precision. In production chat systems, **false positives are worse than false negatives** — blocking "class" or "grass" because the dictionary is too broad erodes user trust. terlik.js defaults to high precision and lets you widen coverage per your needs:
190
+
191
+ > **The `sık`/`sik` paradox:** Turkish `sık` (frequent/tight) normalizes to `sik` because `ı→i` char folding is required to catch evasions like `s1kt1r`. Making `sik` suffix-aware would flag `sıkıntı` (trouble), `sıkma` (squeeze), `sıkı` (tight) — extremely common words. Instead, deep agglutination forms like `siktiğimin` and `sikermisiniz` are added as explicit variants. This is a deliberate precision-over-recall tradeoff.
192
+
193
+ ```ts
194
+ // Add domain-specific words
195
+ terlik.addWords(["customSlang", "anotherWord"]);
196
+
197
+ // Or at construction time
198
+ const terlik = new Terlik({
199
+ customList: ["customSlang", "anotherWord"],
200
+ whitelist: ["legitimateWord"],
201
+ });
202
+
203
+ // Remove a built-in word if it causes false positives in your domain
204
+ terlik.removeWords(["damn"]);
205
+ ```
206
+
207
+ ## Performance
208
+
209
+ ### Lazy Compilation
210
+
211
+ terlik.js uses **lazy compilation** — `new Terlik()` is near-instant (~1.5ms). Regex patterns are compiled on the first `detect()` call, not at construction time. This eliminates startup cost when creating multiple instances.
212
+
213
+ | Phase | Cost | When |
214
+ |---|---|---|
215
+ | `new Terlik()` | **~1.5ms** | Construction (lookup tables only) |
216
+ | First `detect()` | ~200-700ms | Lazy regex compilation + V8 JIT warmup |
217
+ | Subsequent calls | **<1ms** | Patterns cached, JIT optimized |
218
+
219
+ **Where do you want to pay the compilation cost?**
220
+
221
+ ```ts
222
+ // Option A: Background warmup (recommended for servers)
223
+ // Construction is instant. Patterns compile in the next event loop tick.
224
+ // If a request arrives before warmup finishes, it compiles synchronously.
225
+ const terlik = new Terlik({ backgroundWarmup: true });
226
+
227
+ app.post("/chat", (req, res) => {
228
+ const cleaned = terlik.clean(req.body.message); // <1ms (warmup already done)
229
+ });
230
+ ```
231
+
232
+ ```ts
233
+ // Option B: Explicit warmup at startup
234
+ const terlik = new Terlik();
235
+ terlik.containsProfanity("warmup"); // Forces compilation here
236
+
237
+ app.post("/chat", (req, res) => {
238
+ const cleaned = terlik.clean(req.body.message); // <1ms
239
+ });
240
+ ```
241
+
242
+ ```ts
243
+ // Option C: Lazy (pay on first request)
244
+ const terlik = new Terlik(); // ~1.5ms
245
+
246
+ app.post("/chat", (req, res) => {
247
+ const cleaned = terlik.clean(req.body.message); // First call: ~500ms, then <1ms
248
+ });
249
+ ```
250
+
251
+ ```ts
252
+ // Option D: Multi-language warmup
253
+ const cache = Terlik.warmup(["tr", "en", "es", "de"]);
254
+
255
+ app.post("/chat", (req, res) => {
256
+ const lang = req.body.language;
257
+ const cleaned = cache.get(lang)!.clean(req.body.message); // <1ms
258
+ });
259
+ ```
260
+
261
+ > **Important:** Never create `new Terlik()` per request. A single cached instance handles requests in microseconds.
262
+
263
+ > **Serverless (Lambda, Vercel, Cloudflare Workers):** Do NOT use `backgroundWarmup`. The `setTimeout` callback may never fire because serverless runtimes freeze the process between invocations. Use explicit warmup instead: `const t = new Terlik(); t.containsProfanity("warmup");` at module scope.
264
+
265
+ ### Throughput
266
+
267
+ Benchmark results (Apple Silicon, single core, msgs/sec):
268
+
269
+ | Scenario | msgs/sec |
270
+ |---|---|
271
+ | Clean messages (no matches) | ~193,000 |
272
+ | Mixed messages (balanced mode) | ~151,000 |
273
+ | Suffixed dirty messages | ~142,000 |
274
+ | Strict mode | ~390,000 |
275
+ | Loose mode (with fuzzy) | ~8,400 |
276
+
277
+ > **Note:** Loose/fuzzy mode is ~18x slower than balanced mode due to O(n*m) similarity computation. Use it only when typo tolerance is critical, not as a default.
278
+
279
+ ### Accuracy
280
+
281
+ Measured on a labeled corpus of 388 samples across 4 languages (profane + clean + whitelist + edge cases):
282
+
283
+ | Language | Mode | Precision | Recall | F1 | FPR | FNR |
284
+ |---|---|---|---|---|---|---|
285
+ | TR | strict | 100.0% | 88.6% | 93.9% | 0.0% | 11.4% |
286
+ | TR | **balanced** | **100.0%** | **100.0%** | **100.0%** | **0.0%** | **0.0%** |
287
+ | TR | loose | 99.1% | 100.0% | 99.5% | 1.6% | 0.0% |
288
+ | EN | strict | 100.0% | 95.5% | 97.7% | 0.0% | 4.5% |
289
+ | EN | **balanced** | **100.0%** | **98.5%** | **99.2%** | **0.0%** | **1.5%** |
290
+ | EN | loose | 98.5% | 98.5% | 98.5% | 2.0% | 1.5% |
291
+ | ES | strict | 100.0% | 96.7% | 98.3% | 0.0% | 3.3% |
292
+ | ES | **balanced** | **100.0%** | **96.7%** | **98.3%** | **0.0%** | **3.3%** |
293
+ | ES | loose | 100.0% | 96.7% | 98.3% | 0.0% | 3.3% |
294
+ | DE | strict | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% |
295
+ | DE | **balanced** | **100.0%** | **100.0%** | **100.0%** | **0.0%** | **0.0%** |
296
+ | DE | loose | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% |
297
+
298
+ **Mode characteristics:**
299
+ - **Strict** — highest precision (0% FP), trades recall for safety. Misses some suffixed forms and evasion patterns.
300
+ - **Balanced** — best overall F1. Catches evasion patterns while keeping FPR near zero. **Recommended for production.**
301
+ - **Loose** — adds fuzzy matching. Slightly higher FPR due to similarity matches on borderline words.
302
+
303
+ Reproduce: `pnpm bench:accuracy` — outputs per-category breakdown, failure list, and JSON results.
304
+
305
+ ## Options
306
+
307
+ ```ts
308
+ const terlik = new Terlik({
309
+ language: "tr", // "tr" | "en" | "es" | "de" (default: "tr")
310
+ mode: "balanced", // "strict" | "balanced" | "loose"
311
+ maskStyle: "stars", // "stars" | "partial" | "replace"
312
+ replaceMask: "[***]", // mask text for "replace" style
313
+ customList: ["customword"], // additional words to detect
314
+ whitelist: ["safeword"], // additional words to whitelist
315
+ enableFuzzy: false, // enable fuzzy matching
316
+ fuzzyThreshold: 0.8, // similarity threshold (0-1). 0.8 ≈ 1 typo per 5 chars
317
+ fuzzyAlgorithm: "levenshtein", // "levenshtein" | "dice"
318
+ maxLength: 10000, // truncate input beyond this
319
+ backgroundWarmup: false, // compile patterns in background via setTimeout
320
+ });
321
+ ```
322
+
323
+ ## Detection Modes
324
+
325
+ | Mode | What it does | Best for |
326
+ |---|---|---|
327
+ | `strict` | Normalize + exact match only | Minimum false positives |
328
+ | `balanced` | Normalize + pattern matching with separator/leet tolerance | **General use (default)** |
329
+ | `loose` | Pattern + fuzzy matching (Levenshtein or Dice) | Maximum coverage, typo tolerance |
330
+
331
+ ## API
332
+
333
+ ### `terlik.containsProfanity(text, options?): boolean`
334
+
335
+ Quick boolean check. Runs full detection internally and returns `true` if any match exists.
336
+
337
+ ### `terlik.getMatches(text, options?): MatchResult[]`
338
+
339
+ Returns all matches with details:
340
+
341
+ ```ts
342
+ interface MatchResult {
343
+ word: string; // matched text from original input
344
+ root: string; // dictionary root word
345
+ index: number; // position in original text
346
+ severity: "high" | "medium" | "low";
347
+ method: "exact" | "pattern" | "fuzzy";
348
+ }
349
+ ```
350
+
351
+ ### `terlik.clean(text, options?): string`
352
+
353
+ Returns text with profanity masked. Three styles:
354
+
355
+ ```ts
356
+ terlik.clean("siktir git"); // "****** git"
357
+ terlik.clean("siktir git", { maskStyle: "partial" }); // "s****r git"
358
+ terlik.clean("siktir git", { maskStyle: "replace" }); // "[***] git"
359
+ ```
360
+
361
+ ### `terlik.addWords(words) / removeWords(words)`
362
+
363
+ Runtime dictionary modification. Recompiles patterns automatically.
364
+
365
+ ```ts
366
+ terlik.addWords(["customword"]);
367
+ terlik.containsProfanity("customword"); // true
368
+
369
+ terlik.removeWords(["salak"]);
370
+ terlik.containsProfanity("salak"); // false
371
+ ```
372
+
373
+ ### `Terlik.warmup(languages, options?): Map<string, Terlik>`
374
+
375
+ Static method. Creates and JIT-warms instances for multiple languages at once.
376
+
377
+ ```ts
378
+ const cache = Terlik.warmup(["tr", "en", "es", "de"]);
379
+ cache.get("en")!.containsProfanity("fuck"); // true — no cold start
380
+ ```
381
+
382
+ ### `terlik.language: string`
383
+
384
+ Read-only property. Returns the language code of the instance.
385
+
386
+ ### `getSupportedLanguages(): string[]`
387
+
388
+ Returns all available language codes.
389
+
390
+ ```ts
391
+ import { getSupportedLanguages } from "terlik.js";
392
+ getSupportedLanguages(); // ["tr", "en", "es", "de"]
393
+ ```
394
+
395
+ ### `normalize(text): string`
396
+
397
+ Standalone export. Uses Turkish locale by default.
398
+
399
+ ```ts
400
+ import { normalize, createNormalizer } from "terlik.js";
401
+
402
+ normalize("S.İ.K.T.İ.R"); // "siktir" (Turkish default)
403
+
404
+ // Custom normalizer for any language
405
+ const deNormalize = createNormalizer({
406
+ locale: "de",
407
+ charMap: { ä: "a", ö: "o", ü: "u", ß: "ss" },
408
+ leetMap: { "0": "o", "3": "e" },
409
+ });
410
+ deNormalize("Scheiße"); // "scheisse"
411
+ ```
412
+
413
+ ## Testing
414
+
415
+ 631 tests covering all 4 languages, 25 Turkish root words, suffix detection, lazy compilation, multi-language isolation, normalization, fuzzy matching, cleaning, integration, ReDoS hardening, attack surface coverage, and edge cases:
416
+
417
+ ```bash
418
+ pnpm test # run once
419
+ pnpm test:watch # watch mode
420
+ ```
421
+
422
+ ### Live Test Server
423
+
424
+ An interactive browser-based test environment is included. Chat interface on the left, real-time process log on the right — see exactly what terlik.js does at each step (normalization, pattern matching, match details, timing).
425
+
426
+ ```bash
427
+ pnpm dev:live # http://localhost:2026
428
+ ```
429
+
430
+ See [`live_test_server/README.md`](./live_test_server/README.md) for details.
431
+
432
+ ### Integration Guide
433
+
434
+ See [**Integration Guide**](./docs/integration-guide.md) for Express, Fastify, Next.js, Nuxt, Socket.io, and multi-language server examples.
435
+
436
+ ## Development
437
+
438
+ ```bash
439
+ pnpm install # install dependencies
440
+ pnpm test # run tests
441
+ pnpm test:coverage # run tests with coverage report
442
+ pnpm typecheck # TypeScript type checking
443
+ pnpm build # build ESM + CJS output
444
+ pnpm bench # run performance benchmarks
445
+ pnpm dev:live # start interactive test server
446
+ ```
447
+
448
+ Pre-commit hooks (via Husky) automatically run type checking on staged `.ts` files.
449
+
450
+ See [CONTRIBUTING.md](./CONTRIBUTING.md) for contribution guidelines.
451
+
452
+ ## Changelog
453
+
454
+ ### 2026-02-28 (v2.2) — Lazy Compilation + Linguistic Patch
455
+
456
+ **Zero-cost construction. Background warmup. Turkish agglutination hardening.**
457
+
458
+ - **Lazy compilation** — Pattern compilation deferred from constructor to first `detect()` call. `new Terlik()` drops from ~225ms to **~1.5ms**. Strict-mode users never pay regex cost (hash lookup only).
459
+ - **`backgroundWarmup` option** — `new Terlik({ backgroundWarmup: true })` schedules compilation + JIT warmup via `setTimeout(fn, 0)`. Idempotent: if `detect()` is called before the timer fires, it compiles synchronously and the timer becomes a no-op.
460
+ - **`detector.compile()` public method** — Allows manual precompilation for advanced use cases.
461
+ - **Turkish suffix expansion** — Added question particles (`misin`, `misiniz`, `musun`, `musunuz`, `miyim`, `miyiz`) and adverbial forms (`cesine`, `casina`) to suffix engine (now 83 total). All suffixable entries (orospu, piç, yarrak, ibne, etc.) now catch question and adverbial inflections.
462
+ - **Deep agglutination variants** — Added explicit variants for `siktiğimin`, `sikermisiniz`, `sikermisin`, `siktirmişcesine`. These forms require 3+ suffix chains or non-standard morpheme boundaries (ğ→g bridge) that the suffix engine can't generalize without false positives.
463
+ - **`MAX_PATTERN_LENGTH` 6000 → 10000** — Accommodates the larger suffix group without fallback to non-suffix mode.
464
+ - **Test count** — 619 → 631. New `tests/lazy-compilation.test.ts` covers construction timing, transparent lazy compile, strict-mode optimization, backgroundWarmup with fake timers, and idempotent early-detect.
465
+
466
+ | Change | File |
467
+ |---|---|
468
+ | `backgroundWarmup` option | `src/types.ts` |
469
+ | Lazy `_patterns`, `ensureCompiled()`, `compile()` | `src/detector.ts` |
470
+ | backgroundWarmup setTimeout scheduling | `src/terlik.ts` |
471
+ | Suffix + variant expansion, MAX_PATTERN_LENGTH | `src/patterns.ts`, `src/lang/tr/dictionary.json` |
472
+ | Lazy compilation tests (new) | `tests/lazy-compilation.test.ts` |
473
+
474
+ ### 2026-02-28 (v2.1) — ReDoS Security Hardening
475
+
476
+ **Added Regex Denial-of-Service protection.**
477
+
478
+ Identified vulnerability: overlap between `charClasses` and `separator` (`@`, `$`, `!`, `|`, `+`, `#`, `€`, `¢`, `©` could be matched by both char class and separator) enabled polynomial O(n^2) backtracking via adversarial input.
479
+
480
+ - **Bounded separator** — `[^\p{L}\p{N}]*` (unbounded) replaced with `[^\p{L}\p{N}]{0,3}` (max 3 chars). Real-world evasions (`s.i.k.t.i.r`, `s_i_k`) use 1 separator char. This reduces backtracking from O(n^2) to O(1) per boundary.
481
+ - **Regex timeout safety net** — Added 250ms timeout (`REGEX_TIMEOUT_MS`) to `runPatterns()` and `detectFuzzy()` loops. Never triggers on normal input (<1ms), but provides a hard cap on adversarial input.
482
+ - **charClasses cleanup** — Removed separator-overlapping symbols from all 4 language configs (TR, EN, ES, DE). These symbols are already defined in `leetMap` and converted during the normalizer pass — removing them from pattern matching causes no false negatives.
483
+ - **ReDoS test suite** — `tests/redos.test.ts`: 71 tests covering adversarial timing, attack surface (separator abuse, leet bypass, char repetition, Unicode tricks, whitelist integrity, boundary attacks, multi-match, input edge cases, suffix hardening).
484
+ - **MAX_PATTERN_LENGTH** — 5000 → 6000 (later raised to 10000 in v2.2). The `{0,3}` separator adds ~3 chars per boundary; raised the limit so large suffix patterns (e.g. `orospu`) don't fall back to non-suffix mode.
485
+ - **Test count** — 548 → 619.
486
+
487
+ | Change | File |
488
+ |---|---|
489
+ | Separator `*` → `{0,3}`, timeout constant | `src/patterns.ts` |
490
+ | Timeout loop guard | `src/detector.ts` |
491
+ | charClasses cleanup | `src/lang/{tr,en,es,de}/config.ts` |
492
+ | ReDoS + attack surface test suite (new) | `tests/redos.test.ts` |
493
+
494
+ ### 2026-02-28 (v2)
495
+
496
+ **Multi-Language Support**
497
+
498
+ - **4 built-in languages** — Turkish (tr), English (en), Spanish (es), German (de). Each language is a self-contained folder (`src/lang/xx/`) with `config.ts` and `dictionary.json`.
499
+ - **Folder-based language packs** — Adding a new language requires creating one folder with two files and one import line in the registry.
500
+ - **`Terlik.warmup()`** — Static method to create and JIT-warm multiple language instances at once for server deployments.
501
+ - **`language` option** — `new Terlik({ language: "en" })`. Default remains `"tr"` (backward compatible).
502
+ - **Language-agnostic engine** — Normalizer, pattern compiler, detector, and cleaner are now fully parametric. Language-specific data (charMap, leetMap, charClasses, numberExpansions) comes from config files.
503
+ - **New exports** — `createNormalizer`, `getLanguageConfig`, `getSupportedLanguages`, `LanguageConfig` type.
504
+ - **Test coverage** — 346 → 418 tests. Added language-specific tests, cross-language isolation tests, and registry tests.
505
+
506
+ ### 2026-02-28
507
+
508
+ **Suffix Engine + JSON Dictionary Migration**
509
+
510
+ - **JSON dictionary** — Migrated dictionary from `tr.ts` to community-friendly `tr.json` format. Added runtime schema validation (`validateDictionary`). Each entry now includes `category` and `suffixable` fields.
511
+ - **Suffix engine** — Defined Turkish grammatical suffixes (later expanded to 83 in v2.2). Suffixable roots (`orospu`, `salak`, `aptal`, `kahpe`, etc.) automatically catch inflected forms like `orospuluk`, `salaksin`, `aptallarin`, `kahpeler`. Short roots (3-char: `sik`, `bok`, `göt`, `döl`) use explicit variants instead to prevent false positives.
512
+ - **Critical bug fix: `\W` separator** — JavaScript's `\W` treats Turkish characters (`ı`, `ş`, `ğ`, `ö`, `ü`, `ç`) as non-word characters. The pattern engine separator `[\W_]*` was changed to `[^\p{L}\p{N}]*` (Unicode-aware). This fixed false positives on innocent words like `sıkma`, `sıkıntı`, `sıkıştı`.
513
+ - **Live test server warmup fix** — Fixed cache key mismatch and added JIT warmup. First request latency reduced from 3318ms to 37ms.
514
+ - **Test coverage** — 101 → 346 tests. All 25 root words are comprehensively tested.
515
+ - **Expanded whitelist** — Added `ama`, `ami`, `amen`, `amir`, `amil`, `dolmen`.
516
+
517
+ ## License
518
+
519
+ MIT