terlik.js 2.4.0 → 2.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +24 -99
  2. package/package.json +13 -5
package/README.md CHANGED
@@ -4,14 +4,29 @@
4
4
 
5
5
  [![CI](https://github.com/badursun/terlik.js/actions/workflows/ci.yml/badge.svg)](https://github.com/badursun/terlik.js/actions/workflows/ci.yml)
6
6
  [![npm version](https://img.shields.io/npm/v/terlik.js.svg)](https://www.npmjs.com/package/terlik.js)
7
+ [![npm downloads](https://img.shields.io/npm/dm/terlik.js.svg)](https://www.npmjs.com/package/terlik.js)
7
8
  [![npm bundle size](https://img.shields.io/bundlephobia/minzip/terlik.js)](https://bundlephobia.com/package/terlik.js)
9
+ [![TypeScript](https://img.shields.io/badge/TypeScript-Ready-blue.svg)](https://www.typescriptlang.org/)
10
+ [![zero dependencies](https://img.shields.io/badge/dependencies-0-brightgreen.svg)]()
8
11
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
9
12
 
10
- Turkish-first multi-language profanity detection and filtering. Not a naive blacklist — a multi-layered normalization and pattern engine that catches what simple string matching misses.
13
+ Multi-language profanity detection and filtering engine, designed Turkish-first and **extensible to any language**. Not a naive blacklist — a multi-layered normalization and pattern engine that catches what simple string matching misses.
11
14
 
12
- **Turkish** is the flagship language with full coverage. **English**, **Spanish**, and **German** are community-maintained and open for contributions. Adding a new language is just a folder with two files.
15
+ Ships with **Turkish** (flagship, full coverage), **English**, **Spanish**, and **German** built-in. Add any language with a folder and two files, or extend at runtime via `extendDictionary`.
13
16
 
14
- Zero runtime dependencies. Full TypeScript. ESM + CJS. **35 KB** gzipped. Works in Node.js, Bun, Deno, browsers, Cloudflare Workers, and Edge runtimes no Node.js-specific APIs used.
17
+ > **Turkce:** Turkce oncelikli, her dile genisletilebilir kufur tespit ve filtreleme motoru. Leet speak, karakter tekrari, ayirici karakterler ve Turkce ek sistemi destegi ile yaratici kufur denemelerini yakalar. Sifir bagimlilik, TypeScript, 35 KB.
18
+
19
+ ## Features
20
+
21
+ - **Extensible to any language** — ships with TR/EN/ES/DE, add more via language packs or `extendDictionary`
22
+ - Catches leet speak, separators, char repetition, mixed case, zero-width chars
23
+ - Turkish suffix engine (83 suffixes, ~3,000+ detectable forms from 25 roots)
24
+ - Three detection modes: strict, balanced, loose (with fuzzy matching)
25
+ - Zero dependencies, **35 KB** gzipped
26
+ - ESM + CJS — works in Node.js, Bun, Deno, browsers, Cloudflare Workers, Edge runtimes
27
+ - Lazy compilation: ~1.5ms construction, <1ms per check after warmup
28
+ - ReDoS-safe regex patterns with timeout safety net
29
+ - Full TypeScript support with exported types
15
30
 
16
31
  ## Why terlik.js?
17
32
 
@@ -113,7 +128,7 @@ input
113
128
  → result
114
129
  ```
115
130
 
116
- Each language has its own char map, leet map, char classes, and optional number expansions. The engine is language-agnostic — only the data is language-specific.
131
+ Each language has its own char map, leet map, char classes, and optional number expansions. The engine is language-agnostic — only the data is language-specific. This means **any language can be added** without modifying the core engine.
117
132
 
118
133
  For suffixable roots, the engine appends an optional suffix group (up to 2 chained suffixes). Turkish has 83 suffixes (including question particles and adverbial forms), English has 8, Spanish has 13, German has 8.
119
134
 
@@ -172,6 +187,8 @@ terlik.js ships with a **deliberately narrow dictionary** — the goal is to **m
172
187
 
173
188
  "Effective forms" = roots × normalization variants × suffix combinations × evasion patterns. A root like `sik` with 83 possible suffixes, leet decoding, separator tolerance, and repeat collapse produces thousands of detectable surface forms.
174
189
 
190
+ > **Add your language!** The engine is language-agnostic. See [Adding a New Language](#adding-a-new-language) or use [`extendDictionary`](#extenddictionary-option) for runtime extension.
191
+
175
192
  ### What IS Covered
176
193
 
177
194
  - **Core profanity roots** per language (high-severity sexual, insults, slurs)
@@ -308,7 +325,7 @@ Reproduce: `pnpm bench:accuracy` — outputs per-category breakdown, failure lis
308
325
 
309
326
  ```ts
310
327
  const terlik = new Terlik({
311
- language: "tr", // "tr" | "en" | "es" | "de" (default: "tr")
328
+ language: "tr", // built-in: "tr" | "en" | "es" | "de" (default: "tr")
312
329
  mode: "balanced", // "strict" | "balanced" | "loose"
313
330
  maskStyle: "stars", // "stars" | "partial" | "replace"
314
331
  replaceMask: "[***]", // mask text for "replace" style
@@ -439,7 +456,7 @@ deNormalize("Scheiße"); // "scheisse"
439
456
 
440
457
  ## Testing
441
458
 
442
- 874 tests covering all 4 languages, 25 Turkish root words, suffix detection, lazy compilation, multi-language isolation, normalization, fuzzy matching, cleaning, integration, ReDoS hardening, attack surface coverage, external dictionary merging, and edge cases:
459
+ 874 tests covering all built-in languages, 25 Turkish root words, suffix detection, lazy compilation, multi-language isolation, normalization, fuzzy matching, cleaning, integration, ReDoS hardening, attack surface coverage, external dictionary merging, and edge cases:
443
460
 
444
461
  ```bash
445
462
  pnpm test # run once
@@ -478,99 +495,7 @@ See [CONTRIBUTING.md](./CONTRIBUTING.md) for contribution guidelines.
478
495
 
479
496
  ## Changelog
480
497
 
481
- ### 2026-02-28 (v2.3.0) 40x Faster Cold Start: V8 JIT Regex Optimization
482
-
483
- **Replaces `\p{L}`/`\p{N}` Unicode property escapes with explicit Latin ranges, eliminating V8 JIT bottleneck.**
484
-
485
- - **40x faster cold start** — First `containsProfanity()` call: 16,494ms → 404ms.
486
- - **356x faster multi-language warmup** — 4-language warmup: 19,234ms → 54ms.
487
- - **13x less memory** — Heap usage: 492MB → 38MB.
488
- - **Static pattern cache** — Same-language instances share compiled patterns via `Detector.patternCache`.
489
- - **Background warmup** — Dev server starts instantly, warms up in background.
490
-
491
- | Change | File |
492
- |---|---|
493
- | Replace `\p{L}\p{N}` with `[a-zA-Z0-9À-ɏ]` | `src/patterns.ts` |
494
- | Static pattern cache + explicit range in getSurroundingWord | `src/detector.ts` |
495
- | Explicit range in number expander + punctuation removal | `src/normalizer.ts` |
496
- | Pass cacheKey to Detector | `src/terlik.ts` |
497
- | Background warmup, lazy instance cache | `tools/server.ts` |
498
- | NODE_OPTIONS heap safety net | `.github/workflows/ci.yml` |
499
-
500
- ### 2026-02-28 (v2.2.1) — CI Fix: Timeout Race Condition + İ Platform Compatibility
501
-
502
- **Fixes detection failures on slow runners and cross-platform İ (U+0130) handling.**
503
-
504
- - **Timeout race condition fix** — `REGEX_TIMEOUT_MS` check moved from _before_ match processing to _after_. Previously, V8 JIT compilation on first `exec()` call (triggered by lazy compilation) could exceed 250ms, causing the timeout to discard a valid match before it was recorded. Now the current match is always processed; the timeout only prevents scanning for additional matches.
505
- - **İ (U+0130) cross-platform fix** — First regex pass now runs on `text.toLocaleLowerCase(locale)` instead of raw text. Turkish İ→i mapping is performed explicitly before regex matching, avoiding inconsistent V8/ICU case-folding behavior across platforms (Ubuntu vs macOS). The `mapNormalizedToOriginal()` mapper recovers original-cased words for result output.
506
-
507
- | Change | File |
508
- |---|---|
509
- | Timeout check moved after match processing | `src/detector.ts` (`runPatterns`) |
510
- | Locale-lower first pass for İ safety | `src/detector.ts` (`detectPattern`) |
511
-
512
- ### 2026-02-28 (v2.2) — Lazy Compilation + Linguistic Patch
513
-
514
- **Zero-cost construction. Background warmup. Turkish agglutination hardening.**
515
-
516
- - **Lazy compilation** — Pattern compilation deferred from constructor to first `detect()` call. `new Terlik()` drops from ~225ms to **~1.5ms**. Strict-mode users never pay regex cost (hash lookup only).
517
- - **`backgroundWarmup` option** — `new Terlik({ backgroundWarmup: true })` schedules compilation + JIT warmup via `setTimeout(fn, 0)`. Idempotent: if `detect()` is called before the timer fires, it compiles synchronously and the timer becomes a no-op.
518
- - **`detector.compile()` public method** — Allows manual precompilation for advanced use cases.
519
- - **Turkish suffix expansion** — Added question particles (`misin`, `misiniz`, `musun`, `musunuz`, `miyim`, `miyiz`) and adverbial forms (`cesine`, `casina`) to suffix engine (now 83 total). All suffixable entries (orospu, piç, yarrak, ibne, etc.) now catch question and adverbial inflections.
520
- - **Deep agglutination variants** — Added explicit variants for `siktiğimin`, `sikermisiniz`, `sikermisin`, `siktirmişcesine`. These forms require 3+ suffix chains or non-standard morpheme boundaries (ğ→g bridge) that the suffix engine can't generalize without false positives.
521
- - **`MAX_PATTERN_LENGTH` 6000 → 10000** — Accommodates the larger suffix group without fallback to non-suffix mode.
522
- - **Test count** — 619 → 631. New `tests/lazy-compilation.test.ts` covers construction timing, transparent lazy compile, strict-mode optimization, backgroundWarmup with fake timers, and idempotent early-detect.
523
-
524
- | Change | File |
525
- |---|---|
526
- | `backgroundWarmup` option | `src/types.ts` |
527
- | Lazy `_patterns`, `ensureCompiled()`, `compile()` | `src/detector.ts` |
528
- | backgroundWarmup setTimeout scheduling | `src/terlik.ts` |
529
- | Suffix + variant expansion, MAX_PATTERN_LENGTH | `src/patterns.ts`, `src/lang/tr/dictionary.json` |
530
- | Lazy compilation tests (new) | `tests/lazy-compilation.test.ts` |
531
-
532
- ### 2026-02-28 (v2.1) — ReDoS Security Hardening
533
-
534
- **Added Regex Denial-of-Service protection.**
535
-
536
- Identified vulnerability: overlap between `charClasses` and `separator` (`@`, `$`, `!`, `|`, `+`, `#`, `€`, `¢`, `©` could be matched by both char class and separator) enabled polynomial O(n^2) backtracking via adversarial input.
537
-
538
- - **Bounded separator** — `[^\p{L}\p{N}]*` (unbounded) replaced with `[^\p{L}\p{N}]{0,3}` (max 3 chars). Real-world evasions (`s.i.k.t.i.r`, `s_i_k`) use 1 separator char. This reduces backtracking from O(n^2) to O(1) per boundary.
539
- - **Regex timeout safety net** — Added 250ms timeout (`REGEX_TIMEOUT_MS`) to `runPatterns()` and `detectFuzzy()` loops. Never triggers on normal input (<1ms), but provides a hard cap on adversarial input.
540
- - **charClasses cleanup** — Removed separator-overlapping symbols from all 4 language configs (TR, EN, ES, DE). These symbols are already defined in `leetMap` and converted during the normalizer pass — removing them from pattern matching causes no false negatives.
541
- - **ReDoS test suite** — `tests/redos.test.ts`: 71 tests covering adversarial timing, attack surface (separator abuse, leet bypass, char repetition, Unicode tricks, whitelist integrity, boundary attacks, multi-match, input edge cases, suffix hardening).
542
- - **MAX_PATTERN_LENGTH** — 5000 → 6000 (later raised to 10000 in v2.2). The `{0,3}` separator adds ~3 chars per boundary; raised the limit so large suffix patterns (e.g. `orospu`) don't fall back to non-suffix mode.
543
- - **Test count** — 548 → 619.
544
-
545
- | Change | File |
546
- |---|---|
547
- | Separator `*` → `{0,3}`, timeout constant | `src/patterns.ts` |
548
- | Timeout loop guard | `src/detector.ts` |
549
- | charClasses cleanup | `src/lang/{tr,en,es,de}/config.ts` |
550
- | ReDoS + attack surface test suite (new) | `tests/redos.test.ts` |
551
-
552
- ### 2026-02-28 (v2)
553
-
554
- **Multi-Language Support**
555
-
556
- - **4 built-in languages** — Turkish (tr), English (en), Spanish (es), German (de). Each language is a self-contained folder (`src/lang/xx/`) with `config.ts` and `dictionary.json`.
557
- - **Folder-based language packs** — Adding a new language requires creating one folder with two files and one import line in the registry.
558
- - **`Terlik.warmup()`** — Static method to create and JIT-warm multiple language instances at once for server deployments.
559
- - **`language` option** — `new Terlik({ language: "en" })`. Default remains `"tr"` (backward compatible).
560
- - **Language-agnostic engine** — Normalizer, pattern compiler, detector, and cleaner are now fully parametric. Language-specific data (charMap, leetMap, charClasses, numberExpansions) comes from config files.
561
- - **New exports** — `createNormalizer`, `getLanguageConfig`, `getSupportedLanguages`, `LanguageConfig` type.
562
- - **Test coverage** — 346 → 418 tests. Added language-specific tests, cross-language isolation tests, and registry tests.
563
-
564
- ### 2026-02-28
565
-
566
- **Suffix Engine + JSON Dictionary Migration**
567
-
568
- - **JSON dictionary** — Migrated dictionary from `tr.ts` to community-friendly `tr.json` format. Added runtime schema validation (`validateDictionary`). Each entry now includes `category` and `suffixable` fields.
569
- - **Suffix engine** — Defined Turkish grammatical suffixes (later expanded to 83 in v2.2). Suffixable roots (`orospu`, `salak`, `aptal`, `kahpe`, etc.) automatically catch inflected forms like `orospuluk`, `salaksin`, `aptallarin`, `kahpeler`. Short roots (3-char: `sik`, `bok`, `göt`, `döl`) use explicit variants instead to prevent false positives.
570
- - **Critical bug fix: `\W` separator** — JavaScript's `\W` treats Turkish characters (`ı`, `ş`, `ğ`, `ö`, `ü`, `ç`) as non-word characters. The pattern engine separator `[\W_]*` was changed to `[^\p{L}\p{N}]*` (Unicode-aware). This fixed false positives on innocent words like `sıkma`, `sıkıntı`, `sıkıştı`.
571
- - **Live test server warmup fix** — Fixed cache key mismatch and added JIT warmup. First request latency reduced from 3318ms to 37ms.
572
- - **Test coverage** — 101 → 346 tests. All 25 root words are comprehensively tested.
573
- - **Expanded whitelist** — Added `ama`, `ami`, `amen`, `amir`, `amil`, `dolmen`.
498
+ See [CHANGELOG.md](./CHANGELOG.md) for the full version history.
574
499
 
575
500
  ## License
576
501
 
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "terlik.js",
3
- "version": "2.4.0",
4
- "description": "Ultra-fast, zero-dependency multi-language profanity detection engine for Turkish, English, Spanish, and German with lazy compilation, deep agglutination support, and ReDoS-safe regex patterns",
3
+ "version": "2.4.1",
4
+ "description": "Ultra-fast, zero-dependency profanity detection engine. Ships with Turkish, English, Spanish & German extensible to any language. Lazy compilation, deep agglutination support, ReDoS-safe regex patterns",
5
5
  "main": "./dist/index.js",
6
6
  "module": "./dist/index.mjs",
7
7
  "types": "./dist/index.d.ts",
@@ -34,21 +34,29 @@
34
34
  "keywords": [
35
35
  "profanity",
36
36
  "profanity-filter",
37
+ "profanity-detection",
37
38
  "filter",
38
39
  "content-filter",
40
+ "content-moderation",
39
41
  "moderation",
40
42
  "nsfw",
41
43
  "censor",
42
44
  "badwords",
45
+ "bad-words",
46
+ "swear-words",
47
+ "word-filter",
48
+ "chat-filter",
49
+ "text-moderation",
50
+ "offensive-language",
43
51
  "turkish",
44
- "english",
45
- "spanish",
46
- "german",
47
52
  "multi-language",
53
+ "extensible",
48
54
  "kufur",
55
+ "küfür",
49
56
  "turkce",
50
57
  "zero-dependency",
51
58
  "typescript",
59
+ "nodejs",
52
60
  "isomorphic",
53
61
  "regex-engine"
54
62
  ],