bekindprofanityfilter 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (82) hide show
  1. package/CONTRIBUTORS.md +106 -0
  2. package/LICENSE +22 -0
  3. package/README.md +1015 -0
  4. package/allprofanity.config.example.json +35 -0
  5. package/bin/init.js +49 -0
  6. package/config.schema.json +163 -0
  7. package/dist/algos/aho-corasick.d.ts +75 -0
  8. package/dist/algos/aho-corasick.js +238 -0
  9. package/dist/algos/aho-corasick.js.map +1 -0
  10. package/dist/algos/bloom-filter.d.ts +103 -0
  11. package/dist/algos/bloom-filter.js +208 -0
  12. package/dist/algos/bloom-filter.js.map +1 -0
  13. package/dist/algos/context-patterns.d.ts +102 -0
  14. package/dist/algos/context-patterns.js +484 -0
  15. package/dist/algos/context-patterns.js.map +1 -0
  16. package/dist/index.d.ts +1332 -0
  17. package/dist/index.js +2631 -0
  18. package/dist/index.js.map +1 -0
  19. package/dist/innocence-scoring.d.ts +23 -0
  20. package/dist/innocence-scoring.js +118 -0
  21. package/dist/innocence-scoring.js.map +1 -0
  22. package/dist/language-detector.d.ts +162 -0
  23. package/dist/language-detector.js +952 -0
  24. package/dist/language-detector.js.map +1 -0
  25. package/dist/language-dicts.d.ts +60 -0
  26. package/dist/language-dicts.js +2718 -0
  27. package/dist/language-dicts.js.map +1 -0
  28. package/dist/languages/arabic-words.d.ts +10 -0
  29. package/dist/languages/arabic-words.js +1649 -0
  30. package/dist/languages/arabic-words.js.map +1 -0
  31. package/dist/languages/bengali-words.d.ts +10 -0
  32. package/dist/languages/bengali-words.js +1696 -0
  33. package/dist/languages/bengali-words.js.map +1 -0
  34. package/dist/languages/brazilian-words.d.ts +10 -0
  35. package/dist/languages/brazilian-words.js +2122 -0
  36. package/dist/languages/brazilian-words.js.map +1 -0
  37. package/dist/languages/chinese-words.d.ts +10 -0
  38. package/dist/languages/chinese-words.js +2728 -0
  39. package/dist/languages/chinese-words.js.map +1 -0
  40. package/dist/languages/english-primary-all-languages.d.ts +23 -0
  41. package/dist/languages/english-primary-all-languages.js +36894 -0
  42. package/dist/languages/english-primary-all-languages.js.map +1 -0
  43. package/dist/languages/english-words.d.ts +5 -0
  44. package/dist/languages/english-words.js +5156 -0
  45. package/dist/languages/english-words.js.map +1 -0
  46. package/dist/languages/french-words.d.ts +10 -0
  47. package/dist/languages/french-words.js +2326 -0
  48. package/dist/languages/french-words.js.map +1 -0
  49. package/dist/languages/german-words.d.ts +10 -0
  50. package/dist/languages/german-words.js +2633 -0
  51. package/dist/languages/german-words.js.map +1 -0
  52. package/dist/languages/hindi-words.d.ts +10 -0
  53. package/dist/languages/hindi-words.js +2341 -0
  54. package/dist/languages/hindi-words.js.map +1 -0
  55. package/dist/languages/innocent-words.d.ts +41 -0
  56. package/dist/languages/innocent-words.js +109 -0
  57. package/dist/languages/innocent-words.js.map +1 -0
  58. package/dist/languages/italian-words.d.ts +10 -0
  59. package/dist/languages/italian-words.js +2287 -0
  60. package/dist/languages/italian-words.js.map +1 -0
  61. package/dist/languages/japanese-words.d.ts +11 -0
  62. package/dist/languages/japanese-words.js +2557 -0
  63. package/dist/languages/japanese-words.js.map +1 -0
  64. package/dist/languages/korean-words.d.ts +10 -0
  65. package/dist/languages/korean-words.js +2509 -0
  66. package/dist/languages/korean-words.js.map +1 -0
  67. package/dist/languages/russian-words.d.ts +10 -0
  68. package/dist/languages/russian-words.js +2175 -0
  69. package/dist/languages/russian-words.js.map +1 -0
  70. package/dist/languages/spanish-words.d.ts +11 -0
  71. package/dist/languages/spanish-words.js +2536 -0
  72. package/dist/languages/spanish-words.js.map +1 -0
  73. package/dist/languages/tamil-words.d.ts +10 -0
  74. package/dist/languages/tamil-words.js +1722 -0
  75. package/dist/languages/tamil-words.js.map +1 -0
  76. package/dist/languages/telugu-words.d.ts +10 -0
  77. package/dist/languages/telugu-words.js +1739 -0
  78. package/dist/languages/telugu-words.js.map +1 -0
  79. package/dist/romanization-detector.d.ts +50 -0
  80. package/dist/romanization-detector.js +779 -0
  81. package/dist/romanization-detector.js.map +1 -0
  82. package/package.json +79 -0
package/README.md ADDED
@@ -0,0 +1,1015 @@
1
+ # BeKind Profanity Filter
2
+
3
+ > Forked from [AllProfanity](https://github.com/ayush-jadaun/allprofanity) by Ayush Jadaun. Extended with **romanization profanity detection** (catches Hinglish, transliterated text), **language-aware innocence scoring** (ELD + trie-based detection prevents false positives for cross-language collisions like "slut" in Swedish), and additional language dictionaries. Licensed under MIT.
4
+
5
+ > ⚠️ **Early-stage package in progress.** Features available in the original AllProfanity are being actively deprecated, adjusted, or replaced. API surface may change without notice. Contributions and suggestions greatly appreciated.
6
+
7
+ > **Please be advised:** Due to the nature of its purpose, the be-kind repository contains explicit profanity, slurs, hate speech, and other offensive language across its source files, dictionaries, and test suites (sorry!). The inclusion of these words does not reflect the views of the authors or contributors.
8
+
9
+ A multi-language profanity filter with romanization detection, language-aware innocence scoring, leet-speak detection, and cross-language collision handling.
10
+
11
+ [![npm version](https://img.shields.io/npm/v/bekindprofanityfilter.svg)](https://www.npmjs.com/package/bekindprofanityfilter)
12
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
13
+ ![Languages](https://img.shields.io/badge/profanity_dicts-15_languages-blue)
14
+ ![Detection Trie](https://img.shields.io/badge/detection_trie-18_languages-informational)
15
+
16
+ ---
17
+
18
+ ## What This Version Contains
19
+
20
+ - **Multi-Language Profanity Detection:** 34K+ word dictionary across 16 languages with 18-language detection trie
21
+ - **Romanization Detection:** Catches Hinglish, transliterated Bengali, Tamil, Telugu, and Japanese
22
+ - **Cross-Language Innocence Scoring:** Handles words like "slut" (Swedish: "end") and "fart" (Norwegian: "speed")
23
+ - **Context-Aware Analysis:** Booster/reducer patterns detect sexual context, negation, medical usage, and quoted speech
24
+ - **Leet-Speak Detection:** Catches obfuscated profanity (`f#ck`, `a55hole`, `sh1t`)
25
+ - **Word Boundary Detection:** Smart whole-word matching prevents flagging "assassin" or "assistance"
26
+ - **Multiple Algorithms:** Trie (default), Aho-Corasick, or Hybrid modes with optional Bloom filters and result caching
27
+
28
+ ---
29
+
30
+ ## Features
31
+
32
+ ### Performance & Speed
33
+
34
+ - **Multiple Algorithm Options:** Choose between Trie (default), Aho-Corasick, or Hybrid modes
35
+ - **664% Faster on Large Texts:** Aho-Corasick delivers O(n) multi-pattern matching
36
+ - **123x Speedup with Caching:** Result cache perfect for repeated checks (chat, forms, APIs)
37
+ - **~27K ops/sec:** Default Trie mode handles short texts incredibly fast
38
+ - **Single-Pass Scanning:** O(n) complexity regardless of dictionary size
39
+ - **Batch Processing Ready:** Optimized for high-throughput API endpoints
40
+
41
+ ### Accuracy & Detection
42
+
43
+ - **Word Boundary Matching:** Smart whole-word detection prevents false positives like "assassin" or "assistance"
44
+ - **Advanced Leet-Speak:** Detects obfuscated profanities (`f#ck`, `a55hole`, `sh1t`, etc.)
45
+ - **Comprehensive Coverage:** Catches profanity while minimizing false flags
46
+ - **Configurable Strictness:** Tune detection sensitivity to your needs
47
+
48
+ ### Multi-Language & Flexibility
49
+
50
+ - **Multi-Language Support:** Built-in profanity dictionaries for 16 languages: English, Hindi, French, German, Spanish, Italian, Brazilian Portuguese, Russian, Arabic, Chinese, Japanese, Korean, Bengali, Tamil, Telugu, Turkish
51
+ - **Multiple Scripts:** Latin/Roman (Hinglish) and native scripts (Devanagari, Tamil, Telugu, etc.)
52
+ - **Custom Dictionaries:** Add/remove words or entire language packs at runtime
53
+ - **Whitelisting:** Exclude safe words from detection
54
+ - **Severity Scoring:** Assess content offensiveness (`MILD`, `MODERATE`, `SEVERE`, `EXTREME`)
55
+
56
+ ### Developer Experience
57
+
58
+ - **TypeScript Support:** Fully typed API with comprehensive documentation
59
+ - **Zero 3rd-Party Dependencies:** Only internal code and data
60
+ - **Configurable:** Tune performance vs accuracy for your use case
61
+ - **No Dictionary Exposure:** Secure by design - word lists never exposed
62
+ - **Universal:** Works in Node.js and browsers
63
+
64
+ ---
65
+
66
+ > **Forked from [BeKind](https://github.com/ayush-jadaun/allprofanity)** by Ayush Jadaun. Extended with **romanization profanity detection** (catches Hinglish, transliterated text), **language-aware innocence scoring** (ELD + trie-based detection prevents false positives for cross-language collisions like "slut" in Swedish), and additional language dictionaries. Licensed under MIT.
67
+
68
+ ## Installation
69
+
70
+ ```bash
71
+ npm install bekindprofanityfilter
72
+ # or
73
+ yarn add bekindprofanityfilter
74
+ ```
75
+
76
+ **Generate configuration file (optional):**
77
+
78
+ ```bash
79
+ npx bekindprofanityfilter
80
+ # Creates bekindprofanityfilter.config.json and config.schema.json in your project
81
+ ```
82
+
83
+ ---
84
+
85
+ ## Quick Start
86
+
87
+ ```typescript
88
+ import profanity from 'bekindprofanityfilter';
89
+
90
+ // Simple check
91
+ profanity.check('This is a clean sentence.'); // false
92
+ profanity.check('What the f#ck is this?'); // true (leet-speak detected)
93
+ profanity.check('यह एक चूतिया परीक्षण है।'); // true (Hindi)
94
+ profanity.check('Ye ek chutiya test hai.'); // true (Hinglish Roman script)
95
+ ```
96
+
97
+ ---
98
+
99
+ ## Algorithm Configuration
100
+
101
+ BeKind v2.2+ offers multiple algorithms optimized for different use cases. You can configure via **constructor options** or **config file**.
102
+
103
+ ### Configuration Methods
104
+
105
+ #### Method 1: Constructor Options (Inline)
106
+
107
+ ```typescript
108
+ import { BeKind } from 'bekindprofanityfilter';
109
+
110
+ const filter = new BeKind({
111
+ algorithm: { matching: "hybrid" },
112
+ performance: { enableCaching: true }
113
+ });
114
+ ```
115
+
116
+ #### Method 2: Config File (Recommended)
117
+
118
+ ```bash
119
+ # Generate config files in your project
120
+ npx bekindprofanityfilter
121
+
122
+ # This creates:
123
+ # - bekindprofanityfilter.config.json (main config)
124
+ # - config.schema.json (for IDE autocomplete)
125
+ ```
126
+
127
+ ```typescript
128
+ import { BeKind } from 'bekindprofanityfilter';
129
+ import config from './bekindprofanityfilter.config.json';
130
+
131
+ // Load from generated config file
132
+ const filter = BeKind.fromConfig(config);
133
+
134
+ // Or directly from object (no file needed)
135
+ const filter2 = BeKind.fromConfig({
136
+ algorithm: { matching: "hybrid", useContextAnalysis: true },
137
+ performance: { enableCaching: true, cacheSize: 1000 }
138
+ });
139
+ ```
140
+
141
+ **Example Config File** (`bekindprofanityfilter.config.json`):
142
+
143
+ ```json
144
+ {
145
+ "algorithm": {
146
+ "matching": "hybrid",
147
+ "useAhoCorasick": true,
148
+ "useBloomFilter": true
149
+ },
150
+ "profanityDetection": {
151
+ "enableLeetSpeak": true,
152
+ "caseSensitive": false,
153
+ "strictMode": false
154
+ },
155
+ "performance": {
156
+ "enableCaching": true,
157
+ "cacheSize": 1000
158
+ }
159
+ }
160
+ ```
161
+
162
+ **Config File:** Run `npx bekindprofanityfilter` to generate config files in your project. The JSON schema provides IDE autocomplete and validation.
163
+
164
+ ---
165
+
166
+ ### Quick Configuration Examples
167
+
168
+ #### 1. Default (Best for General Use)
169
+
170
+ ```typescript
171
+ import { BeKind } from 'bekindprofanityfilter';
172
+ const filter = new BeKind();
173
+ // Uses optimized Trie - fast and reliable (~27K ops/sec)
174
+ ```
175
+
176
+ #### 2. Large Text Processing (Documents, Articles)
177
+
178
+ ```typescript
179
+ const filter = new BeKind({
180
+ algorithm: { matching: "aho-corasick" }
181
+ });
182
+ // 664% faster on 1KB+ texts
183
+ ```
184
+
185
+ #### 3. Repeated Checks (Chat, Forms, APIs)
186
+
187
+ ```typescript
188
+ const filter = new BeKind({
189
+ performance: {
190
+ enableCaching: true,
191
+ cacheSize: 1000
192
+ }
193
+ });
194
+ ```
195
+
196
+ ### Competitor Comparison
197
+
198
+ Benchmarked on a single CPU core (pinned via `taskset -c 0`). All numbers are **ops/second — higher is better**.
199
+
200
+ > **Honest context:** be-kind loads a ~34K-word dictionary across 18 languages by default. `leo + dict` injects be-kind's full 34K dictionary into [leo-profanity](https://github.com/jojoee/leo-profanity) (which ships with ~400 English words only) to test the matching engine with equivalent vocabulary. glin-profanity is benchmarked with all 24 supported languages loaded. `glin + dict` injects be-kind's full 34K dictionary into glin for the same reason.
201
+
202
+ | Library | Languages (out-of-the-box) | Leet-speak | Repeat compression | Context-aware |
203
+ |---------|--------------------------|-----------|-------------------|--------------|
204
+ | **be-kind** | 16 profanity dicts + 18-lang detection trie | ✅ | 🚧 planned | ✅ (certainty-delta) |
205
+ | **be-kind (ctx)** | same as be-kind | ✅ | 🚧 planned | ✅ (boosters + reducers) |
206
+ | [leo-profanity](https://github.com/jojoee/leo-profanity) + dict | 16 (via be-kind dict injection) | ❌ | ❌ | ❌ |
207
+ | [bad-words](https://github.com/web-mech/badwords) | 1 (English) | ❌ | ❌ | ❌ |
208
+ | [glin-profanity](https://www.glincker.com/tools/glin-profanity) | 24 | ✅ (3 levels) | ✅ | ✅ (heuristic) |
209
+
210
+ **Speed benchmark** — ops/second on a single CPU core (`taskset -c 0`), higher is better:
211
+
212
+ | Test | be-kind | be-kind (ctx) | leo | bad-words | glin (basic) | glin (enhanced) | glin + dict |
213
+ |------|--------:|--------------:|----:|----------:|-------------:|----------------:|------------:|
214
+ | check — clean (short) | 2,654 | 2,903 | 879,009 | 2,932 | 816 | 751 | 68 |
215
+ | check — profane (short) | 2,366 | 2,031 | 1,496,281 | 3,025 | 3,128 | 3,304 | 3,350 |
216
+ | check — leet-speak | 1,243 | 1,198 | 1,100,028 | 3,148 | 2,760 | 4,078 | 4,499 |
217
+ | clean — profane (short) | 2,398 | 2,011 | 298,713 | 243 | N/A | N/A | N/A |
218
+ | check — 500-char clean | 411 | 397 | 100,898 | 2,157 | 253 | 247 | 20 |
219
+ | check — 500-char profane | 348 | 277 | 216,204 | 2,155 | 789 | 720 | 762 |
220
+ | check — 2,500-char clean | 91 | 88 | 18,900 | 1,225 | 74 | 71 | 6 |
221
+ | check — 2,500-char profane | 82 | 62 | 50,454 | 1,084 | 196 | 185 | 186 |
222
+
223
+ **Library versions tested:** `leo-profanity@1.9.0`, `bad-words@4.0.0`, `glin-profanity@3.3.0`
224
+
225
+ **Notes:**
226
+ - **be-kind** and **be-kind (ctx)** both load a 34K-word dictionary across 18 languages. Despite this, be-kind is ~3x faster than glin on clean text because it uses a **trie** (O(input_length) matching), while glin uses **linear scanning** over its dictionary (`for (const word of this.words.keys())` — O(dict_size * input_length)). This architectural difference becomes dramatic at large dictionary sizes.
227
+ - `be-kind (ctx)` adds ~10-20% overhead over default be-kind — context analysis (certainty-delta pattern matching) is cheap.
228
+ - `leo-profanity` is the fastest but its ~400-word English-only dictionary explains most of the gap.
229
+ - `glin` with all 24 languages loaded is ~17x slower than English-only due to its linear-scan architecture scaling with dictionary size.
230
+ - `glin + dict` (glin enhanced + be-kind's 34K words injected) demonstrates the linear-scan bottleneck: 67 ops/s on clean short text vs 2,560 for be-kind with the same vocabulary. On profane text it short-circuits on first match, so performance is normal (3,335 ops/s).
231
+ - be-kind is the only library with cross-language innocence scoring, romanization support, and context-aware certainty adjustment.
232
+
233
+ Run the benchmark yourself:
234
+ ```bash
235
+ taskset -c 0 bun run benchmark:competitors
236
+ ```
237
+
238
+ ### Accuracy Comparison
239
+
240
+ Measures TP rate (recall), FP rate, and F1 across eight test categories (225 labeled cases, dataset v6). All libraries are tested against all categories — no exemptions. **Higher F1 and lower FP rate are better.**
241
+
242
+ > **Bias disclaimer:** This dataset was created by the be-kind team. Non-English cases were likely drawn from or verified against be-kind's own dictionary, which advantages be-kind on those categories. To partially offset this, the dataset includes independent test cases from [glin-profanity's upstream test suite](https://github.com/GLINCKER/glin-profanity/tree/release/tests) and adversarial false-positive cases specifically chosen to expose known be-kind failures. We strongly recommend running this benchmark against your own dataset before drawing conclusions.
243
+
244
+ > **Note:** `be-kind (sensitive)` = `sensitiveMode: true` (flags AMBIVALENT words too). `be-kind (ctx)` = `contextAnalysis.enabled: true`. `glin (collapsed)` = glin (basic) with `collapseRepeatedCharacters()` pre-processing.
245
+
246
+ #### Single-language detection — 65 cases (English incl. leetspeak, French, German, Spanish, Hindi)
247
+
248
+ | Library | Recall | Precision | FP Rate | F1 |
249
+ |---|---|---|---|---|
250
+ | be-kind (sensitive) | 100% | 100% | 0% | **1.00** |
251
+ | leo + dict | 82% | 100% | 0% | 0.90 |
252
+ | be-kind | 80% | 100% | 0% | 0.89 |
253
+ | be-kind (ctx) | 80% | 100% | 0% | 0.89 |
254
+ | glin (enhanced) | 72% | 100% | 0% | 0.84 |
255
+ | glin (collapsed) | 72% | 100% | 0% | 0.84 |
256
+ | bad-words | 52% | 100% | 0% | 0.68 |
257
+
258
+ > All libraries tested against all 65 cases including French, German, Spanish, and Hindi. `leo + dict` benefits significantly from be-kind's multilingual dictionary, jumping from 34% to 82% recall. be-kind misses mild words (`damn`, `hell`) in default mode; `sensitiveMode: true` catches these. All libraries achieve 100% precision — when they flag something, it's always correct.
259
+
260
+ #### False positives / innocent words — 48 cases (clean only, lower FP rate is better)
261
+
262
+ Includes adversarial cases (`cum laude`, `Dick Van Dyke`, culinary `faggots`, Swedish `slut`). Recall and F1 are undefined (no profane cases).
263
+
264
+ | Library | FP Rate |
265
+ |---|---|
266
+ | glin (collapsed) | **19%** |
267
+ | glin (enhanced) | 21% |
268
+ | be-kind (ctx) | 21% |
269
+ | bad-words | 23% |
270
+ | leo + dict | 25% |
271
+ | be-kind | 27% |
272
+ | be-kind (sensitive) | 31% |
273
+
274
+ > be-kind's FP rate remains its most significant weakness — over-triggers on proper nouns, Latin phrases, and homographs. `sensitiveMode: true` worsens this. `be-kind (ctx)` with context analysis reduces FP rate from 27% to 21% by detecting innocent contexts (medical terms, proper nouns, quoted text). `leo + dict` at 25% shows that leo's simple substring matching creates more false positives when given a large dictionary.
275
+
276
+ #### Multi-language detection — 26 cases (Hinglish, French, German, Spanish, mixed)
277
+
278
+ | Library | Recall | Precision | FP Rate | F1 |
279
+ |---|---|---|---|---|
280
+ | be-kind | 100% | 100% | 0% | **1.00** |
281
+ | be-kind (sensitive) | 100% | 100% | 0% | **1.00** |
282
+ | leo + dict | 100% | 100% | 0% | **1.00** |
283
+ | be-kind (ctx) | 95% | 100% | 0% | 0.98 |
284
+ | glin (enhanced) | 95% | 100% | 0% | 0.98 |
285
+ | glin (collapsed) | 95% | 100% | 0% | 0.98 |
286
+ | bad-words | 62% | 100% | 0% | 0.76 |
287
+
288
+ > With be-kind's dictionary injected, leo + dict achieves 100% recall on multi-language cases — proving the dictionary is the key differentiator. be-kind (ctx) scores 95% — context analysis slightly reduces multi-language recall vs default be-kind.
289
+
290
+ #### Romanization — 30 cases (Hinglish, Bengali, Tamil, Telugu, Japanese)
291
+
292
+ | Library | Recall | Precision | FP Rate | F1 |
293
+ |---|---|---|---|---|
294
+ | leo + dict | 75% | 94% | 10% | **0.83** |
295
+ | be-kind | 80% | 84% | 30% | 0.82 |
296
+ | be-kind (sensitive) | 80% | 84% | 30% | 0.82 |
297
+ | be-kind (ctx) | 80% | 84% | 30% | 0.82 |
298
+ | glin (enhanced) | 15% | 100% | 0% | 0.26 |
299
+ | glin (collapsed) | 15% | 100% | 0% | 0.26 |
300
+ | bad-words | 0% | 0% | 10% | — |
301
+
302
+ > leo + dict edges out be-kind on F1 here (0.83 vs 0.82) thanks to a lower FP rate (10% vs 30%) despite slightly lower recall (75% vs 80%). be-kind's higher FP rate is a known limitation where clean romanized words collide with its dictionary. glin catches 15% with perfect precision but far less coverage.
303
+
304
+ #### Semantic context — 25 cases
305
+
306
+ | Library | Recall | Precision | FP Rate | F1 |
307
+ |---|---|---|---|---|
308
+ | be-kind (ctx) | 80% | 73% | 20% | **0.76** |
309
+ | leo + dict | 100% | 59% | 47% | 0.74 |
310
+ | glin (enhanced) | 90% | 53% | 53% | 0.67 |
311
+ | glin (collapsed) | 90% | 53% | 53% | 0.67 |
312
+ | be-kind (sensitive) | 100% | 48% | 73% | 0.65 |
313
+ | bad-words | 100% | 48% | 73% | 0.65 |
314
+ | be-kind | 80% | 47% | 60% | 0.59 |
315
+
316
+ > Semantic context is where all libraries struggle — precision drops below 50% for most. Cases include metalinguistic uses ("the word 'fuck' has uncertain origins"), negation ("she's not a bitch"), and medical context ("rectal cancer screening"). be-kind (ctx) achieves the best F1 (0.76) thanks to context-aware certainty adjustment — boosters confirm profane intent, reducers detect innocent contexts like quotation and negation. leo + dict achieves 100% recall but at the cost of a 47% FP rate.
317
+
318
+ #### Repeated character evasion — 5 cases (`fuuuuuuuuck`, `cunnnnnnttttt`, etc.)
319
+
320
+ No clean cases in this category — FP rate is undefined.
321
+
322
+ | Library | Recall | Precision |
323
+ |---|---|---|
324
+ | glin (enhanced) | **100%** | 100% |
325
+ | glin (collapsed) | 40% | 100% |
326
+ | be-kind | 0% | — |
327
+ | be-kind (sensitive) | 0% | — |
328
+ | be-kind (ctx) | 0% | — |
329
+ | leo + dict | 0% | — |
330
+ | bad-words | 0% | — |
331
+
332
+ #### Concatenated / no-space evasion — 7 cases (`urASSHOLEbro`, `youFUCKINGidiot`, etc.)
333
+
334
+ | Library | Recall | Precision | FP Rate | F1 |
335
+ |---|---|---|---|---|
336
+ | be-kind | 20% | 100% | 0% | 0.33 |
337
+ | be-kind (sensitive) | 20% | 100% | 0% | 0.33 |
338
+ | be-kind (ctx) | 20% | 100% | 0% | 0.33 |
339
+ | leo + dict | 0% | — | 0% | — |
340
+ | bad-words | 0% | — | 0% | — |
341
+ | glin (enhanced) | 0% | — | 0% | — |
342
+ | glin (collapsed) | 0% | — | 0% | — |
343
+
344
+ #### Challenge cases — 19 cases (semantic disambiguation, embedded substrings, separator evasion)
345
+
346
+ Hard problems: `cock` as rooster, `ass` as donkey, Swedish `slut` = "end", `puta` in etymological discussion, profanity in concatenated strings, and separator-spaced evasion (`f u c k`, `f_u*c k`, `a.s.s.h.o.l.e`).
347
+
348
+ | Library | Recall | Precision | FP Rate | F1 |
349
+ |---|---|---|---|---|
350
+ | be-kind (ctx) | 60% | 75% | 22% | **0.67** |
351
+ | be-kind | 60% | 60% | 44% | 0.60 |
352
+ | be-kind (sensitive) | 60% | 60% | 44% | 0.60 |
353
+ | glin (enhanced) | 30% | 43% | 44% | 0.35 |
354
+ | leo + dict | 20% | 50% | 22% | 0.29 |
355
+ | bad-words | 20% | 33% | 44% | 0.25 |
356
+ | glin (collapsed) | 0% | 0% | 44% | — |
357
+
358
+ > be-kind (ctx) halves the FP rate on challenge cases (44% → 22%) by recognizing innocent contexts like "cock crowed at dawn" and "wild ass is an equine." Separator-spaced evasion cases (`f u c k`, `f_u*c k`, mixed separators) test the separator tolerance feature. These cases still require semantic understanding that no dictionary-based filter can fully solve — the strongest argument for LLM-assisted moderation as a second pass.
359
+
360
+ #### Overall summary — micro-averaged across all 225 cases
361
+
362
+ | Library | Recall | Precision | FP Rate | F1 | TP | FN | FP | TN |
363
+ |---|---|---|---|---|---|---|---|---|
364
+ | be-kind (sensitive) | **86%** | 76% | 32% | 0.81 | 104 | 17 | 33 | 71 |
365
+ | be-kind (ctx) | 75% | **83%** | **17%** | **0.79** | 91 | 30 | 18 | 86 |
366
+ | be-kind | 76% | 76% | 28% | 0.76 | 92 | 29 | 29 | 75 |
367
+ | leo + dict | 74% | 80% | 21% | 0.76 | 89 | 32 | 22 | 82 |
368
+ | glin (enhanced) | 63% | 78% | 21% | 0.70 | 76 | 45 | 22 | 82 |
369
+ | glin (collapsed) | 58% | 77% | 20% | 0.66 | 70 | 51 | 21 | 83 |
370
+ | bad-words | 42% | 65% | 26% | 0.51 | 51 | 70 | 27 | 77 |
371
+
372
+ > Micro-averaged: all 225 cases (121 profane, 104 clean) aggregated into one confusion matrix per library, then recall/precision/F1 computed once. No category weighting artifacts. All glin variants use all 24 supported languages. `leo + dict` with be-kind's 34K dictionary achieves F1 parity with default be-kind (0.76) — proving the dictionary is the core differentiator. be-kind (ctx) achieves the best balance of precision (83%) and recall (75%) with the lowest FP rate (17%) among be-kind variants, thanks to context-aware certainty adjustment via booster and reducer patterns.
373
+
374
+ Run the accuracy benchmark yourself:
375
+ ```bash
376
+ bun run benchmark:accuracy
377
+ ```
378
+
379
+ ---
380
+
381
+ ## API Reference & Examples
382
+
383
+ ### `check(text: string): boolean`
384
+
385
+ Returns `true` if the text contains any profanity.
386
+
387
+ ```typescript
388
+ profanity.check('This is a clean sentence.'); // false
389
+ profanity.check('This is a bullshit sentence.'); // true
390
+ profanity.check('What the f#ck is this?'); // true (leet-speak)
391
+ profanity.check('यह एक चूतिया परीक्षण है।'); // true (Hindi)
392
+ ```
393
+
394
+ ---
395
+
396
+ ### `detect(text: string): ProfanityDetectionResult`
397
+
398
+ Returns a detailed result:
399
+
400
+ - `hasProfanity: boolean`
401
+ - `detectedWords: string[]` (actual matched words)
402
+ - `cleanedText: string` (character-masked)
403
+ - `severity: ProfanitySeverity` (`MILD`, `MODERATE`, `SEVERE`, `EXTREME`)
404
+ - `positions: Array<{ word: string, start: number, end: number }>`
405
+
406
+ ```typescript
407
+ const result = profanity.detect('This is fucking bullshit and chutiya.');
408
+ console.log(result.hasProfanity); // true
409
+ console.log(result.detectedWords); // ['fucking', 'bullshit', 'chutiya']
410
+ console.log(result.severity); // 3 (SEVERE)
411
+ console.log(result.cleanedText); // "This is ******* ******** and ******."
412
+ console.log(result.positions); // e.g. [{word: 'fucking', start: 8, end: 15}, ...]
413
+ ```
414
+
415
+ ---
416
+
417
+ ### `clean(text: string, placeholder?: string): string`
418
+
419
+ Replace each character of profane words with a placeholder (default: `*`).
420
+
421
+ ```typescript
422
+ profanity.clean('This contains bullshit.'); // "This contains ********."
423
+ profanity.clean('This contains bullshit.', '#'); // "This contains ########."
424
+ profanity.clean('यह एक चूतिया परीक्षण है।'); // e.g. "यह एक ***** परीक्षण है।"
425
+ ```
426
+
427
+ ---
428
+
429
+ ### `cleanWithPlaceholder(text: string, placeholder?: string): string`
430
+
431
+ Replace each profane word with a single placeholder (default: `***`).
432
+ (If the placeholder is omitted, uses `***`.)
433
+
434
+ ```typescript
435
+ profanity.cleanWithPlaceholder('This contains bullshit.'); // "This contains ***."
436
+ profanity.cleanWithPlaceholder('This contains bullshit.', '[CENSORED]'); // "This contains [CENSORED]."
437
+ profanity.cleanWithPlaceholder('यह एक चूतिया परीक्षण है।', '####'); // e.g. "यह एक #### परीक्षण है।"
438
+ ```
439
+
440
+ ---
441
+
442
+ ### `add(word: string | string[]): void`
443
+
444
+ Add a word or an array of words to the profanity filter.
445
+
446
+ ```typescript
447
+ profanity.add('badword123');
448
+ profanity.check('This is badword123.'); // true
449
+
450
+ profanity.add(['mierda', 'puta']);
451
+ profanity.check('Esto es mierda.'); // true (Spanish)
452
+ profanity.check('Qué puta situación.'); // true
453
+ ```
454
+
455
+ ---
456
+
457
+ ### `remove(word: string | string[]): void`
458
+
459
+ Remove a word or an array of words from the profanity filter.
460
+
461
+ ```typescript
462
+ profanity.remove('bullshit');
463
+ profanity.check('This is bullshit.'); // false
464
+
465
+ profanity.remove(['mierda', 'puta']);
466
+ profanity.check('Esto es mierda.'); // false
467
+ ```
468
+
469
+ ---
470
+
471
+ ### `addToWhitelist(words: string[]): void`
472
+
473
+ Whitelist words so they are never flagged as profane.
474
+
475
+ ```typescript
476
+ profanity.addToWhitelist(['fuck', 'idiot','shit']);
477
+ profanity.check('He is an fucking idiot.'); // false
478
+ profanity.check('Fuck this shit.'); // false
479
+ // Remove from whitelist to restore detection
480
+ profanity.removeFromWhitelist(['fuck', 'idiot','shit']);
481
+ ```
482
+
483
+ ---
484
+
485
+ ### `removeFromWhitelist(words: string[]): void`
486
+
487
+ Remove words from the whitelist so they can be detected again.
488
+
489
+ ```typescript
490
+ profanity.removeFromWhitelist(['anal']);
491
+ ```
492
+
493
+ ---
494
+
495
+ ### `setPlaceholder(placeholder: string): void`
496
+
497
+ Set the default placeholder character for `clean()`.
498
+
499
+ ```typescript
500
+ profanity.setPlaceholder('#');
501
+ profanity.clean('This is bullshit.'); // "This is ########."
502
+ profanity.setPlaceholder('*'); // Reset to default
503
+ ```
504
+
505
+ ---
506
+
507
+ ### `updateConfig(options: Partial<BeKindOptions>): void`
508
+
509
+ Change configuration at runtime.
510
+ Options include: `enableLeetSpeak`, `caseSensitive`, `strictMode`, `detectPartialWords`, `defaultPlaceholder`, `languages`, `whitelistWords`.
511
+
512
+ ```typescript
513
+ profanity.updateConfig({ caseSensitive: true, enableLeetSpeak: false });
514
+ profanity.check('FUCK'); // false (if caseSensitive)
515
+ profanity.updateConfig({ caseSensitive: false, enableLeetSpeak: true });
516
+ profanity.check('f#ck'); // true
517
+ ```
518
+
519
+ ---
520
+
521
+ ### `loadLanguage(language: string): boolean`
522
+
523
+ Load a built-in language.
524
+
525
+ ```typescript
526
+ profanity.loadLanguage('french');
527
+ profanity.check('Ce mot est merde.'); // true
528
+ ```
529
+
530
+ ---
531
+
532
+ ### `loadLanguages(languages: string[]): number`
533
+
534
+ Load multiple built-in languages at once.
535
+
536
+ ```typescript
537
+ profanity.loadLanguages(['english', 'french', 'german']);
538
+ profanity.check('Das ist scheiße.'); // true (German)
539
+ ```
540
+
541
+ ---
542
+
543
+ ### `loadIndianLanguages(): number`
544
+
545
+ Convenience: Load all major Indian language packs.
546
+
547
+ ```typescript
548
+ profanity.loadIndianLanguages();
549
+ profanity.check('यह एक बेंगाली गाली है।'); // true (Bengali)
550
+ profanity.check('This is a Tamil profanity: புண்டை'); // true
551
+ ```
552
+
553
+ ---
554
+
555
+ ### `loadCustomDictionary(name: string, words: string[]): void`
556
+
557
+ Add your own dictionary as an additional language.
558
+
559
+ ```typescript
560
+ profanity.loadCustomDictionary('swedish', ['fan', 'jävla', 'skit']);
561
+ profanity.loadLanguage('swedish');
562
+ profanity.check('Det här är skit.'); // true
563
+ ```
564
+
565
+ ---
566
+
567
+ ### `getLoadedLanguages(): string[]`
568
+
569
+ Returns the names of all currently loaded language packs.
570
+
571
+ ```typescript
572
+ console.log(profanity.getLoadedLanguages()); // ['english', 'hindi', ...]
573
+ ```
574
+
575
+ ---
576
+
577
+ ### `getAvailableLanguages(): string[]`
578
+
579
+ Returns the names of all available built-in language packs.
580
+
581
+ ```typescript
582
+ console.log(profanity.getAvailableLanguages());
583
+ // ['english', 'hindi', 'french', 'german', 'spanish', 'bengali', 'tamil', 'telugu', 'brazilian']
584
+ ```
585
+
586
+ ---
587
+
588
+ ### `clearList(): void`
589
+
590
+ Remove all loaded languages and dynamic words (start with a clean filter).
591
+
592
+ ```typescript
593
+ profanity.clearList();
594
+ profanity.check('fuck'); // false
595
+ profanity.loadLanguage('english');
596
+ profanity.check('fuck'); // true
597
+ ```
598
+
599
+ ---
600
+
601
+ ### `getConfig(): Partial<BeKindOptions>`
602
+
603
+ Get the current configuration.
604
+
605
+ ```typescript
606
+ console.log(profanity.getConfig());
607
+ /*
608
+ {
609
+ defaultPlaceholder: '*',
610
+ enableLeetSpeak: true,
611
+ caseSensitive: false,
612
+ strictMode: false,
613
+ detectPartialWords: false,
614
+ languages: [...],
615
+ whitelistWords: [...]
616
+ }
617
+ */
618
+ ```
619
+
620
+ ---
621
+
622
+ ## Configuration File Structure
623
+
624
+ BeKind supports JSON-based configuration for easy setup and deployment. The config file structure supports all algorithm and detection options.
625
+
626
+ ### Full Configuration Schema
627
+
628
+ ```typescript
629
+ {
630
+ "algorithm": {
631
+ "matching": "trie" | "aho-corasick" | "hybrid", // Algorithm selection
632
+ "useAhoCorasick": boolean, // Enable Aho-Corasick
633
+ "useBloomFilter": boolean // Enable Bloom Filter
634
+ },
635
+ "bloomFilter": {
636
+ "enabled": boolean, // Enable/disable
637
+ "expectedItems": number, // Expected dictionary size (default: 10000)
638
+ "falsePositiveRate": number // Acceptable false positive rate (default: 0.01)
639
+ },
640
+ "ahoCorasick": {
641
+ "enabled": boolean, // Enable/disable
642
+ "prebuild": boolean // Prebuild automaton (default: true)
643
+ },
644
+ "profanityDetection": {
645
+ "enableLeetSpeak": boolean, // Detect l33t speak (default: true)
646
+ "caseSensitive": boolean, // Case sensitive matching (default: false)
647
+ "strictMode": boolean, // Require word boundaries (default: false)
648
+ "detectPartialWords": boolean, // Detect within words (default: false)
649
+ "defaultPlaceholder": string // Default censoring character (default: "*")
650
+ },
651
+ "performance": {
652
+ "enableCaching": boolean, // Enable result cache (default: false)
653
+ "cacheSize": number // Cache size limit (default: 1000)
654
+ }
655
+ }
656
+ ```
657
+
658
+ ### Pre-configured Templates
659
+
660
+ #### High Performance (Large Texts)
661
+
662
+ ```json
663
+ {
664
+ "algorithm": { "matching": "aho-corasick" },
665
+ "ahoCorasick": { "enabled": true, "prebuild": true },
666
+ "profanityDetection": { "enableLeetSpeak": true }
667
+ }
668
+ ```
669
+
670
+ #### Balanced (Production)
671
+
672
+ ```json
673
+ {
674
+ "algorithm": {
675
+ "matching": "hybrid",
676
+ "useAhoCorasick": true,
677
+ "useBloomFilter": true
678
+ },
679
+ "profanityDetection": { "enableLeetSpeak": true },
680
+ "performance": { "enableCaching": true, "cacheSize": 1000 }
681
+ }
682
+ ```
683
+
684
+ ### Using Config Files
685
+
686
+ **Step 1: Generate Config Files**
687
+
688
+ ```bash
689
+ # Run this in your project directory
690
+ npx bekindprofanityfilter
691
+
692
+ # Output:
693
+ # ✅ BeKind configuration files created!
694
+ #
695
+ # Created files:
696
+ # 📄 bekindprofanityfilter.config.json - Main configuration
697
+ # 📄 config.schema.json - JSON schema for IDE autocomplete
698
+ ```
699
+
700
+ **Step 2: Load Config in Your Code**
701
+
702
+ ```typescript
703
+ // ES Modules / TypeScript
704
+ import { BeKind } from 'bekindprofanityfilter';
705
+ import config from './bekindprofanityfilter.config.json';
706
+
707
+ const filter = BeKind.fromConfig(config);
708
+ ```
709
+
710
+ ```javascript
711
+ // CommonJS (Node.js)
712
+ const { BeKind } = require('bekindprofanityfilter');
713
+ const config = require('./bekindprofanityfilter.config.json');
714
+
715
+ const filter = BeKind.fromConfig(config);
716
+ ```
717
+
718
+ **Step 3: Customize Config**
719
+
720
+ Edit `bekindprofanityfilter.config.json` to enable/disable features. Your IDE will provide autocomplete thanks to the JSON schema!
721
+
722
+ ---
723
+
724
+ ## Cross-Language Innocence Scoring
725
+
726
+ Many words are profane in one language but perfectly innocent in another. For example, "slut" means "end/finish" in Swedish, "fart" means "speed" in Scandinavian languages, and "bite" is a common English word that's vulgar in French. BeKind handles these cross-language collisions automatically using a multi-layer language detection and scoring system.
727
+
728
+ ### Language Detection Architecture
729
+
730
+ BeKind uses a hybrid language detection system with three layers:
731
+
732
+ **1. ELD N-gram Detection** (`eld/small`)
733
+ We integrate [Nito-ELD](https://github.com/nitotm/efficient-language-detector), a corpus-trained byte-level n-gram language detector supporting 60+ languages. ELD analyzes character sequences (trigrams) and compares them against frequency profiles trained on massive corpora. It provides both per-word scores and full-text Bayesian priors.
734
+
735
+ *Limitation:* ELD works on UTF-8 byte patterns, so it struggles with accent-stripped text and frequently confuses closely related languages (Swedish ↔ German, Norwegian ↔ Danish). This is why we don't rely on ELD alone.
736
+
737
+ **2. Trie Vocabulary Detection** (18 languages)
738
+ Per-language tries built from ~200-350 common words each. When a word is looked up, the trie returns a match score (0-1) indicating how strongly the word belongs to that language. Supports accent-tolerant matching (e.g., "gurultu" matches Turkish "gürültü" with a small penalty).
739
+
740
+ **3. Script Detection**
741
+ Unicode codepoint ranges map characters directly to language families (e.g., Cyrillic → Russian, Devanagari → Hindi). This is deterministic and instant, providing strong signal for non-Latin scripts.
742
+
743
+ ### The `scoreWord()` Function
744
+
745
+ For each word, `scoreWord()` combines all three layers into a single `Record<string, number>` mapping language codes to confidence scores:
746
+
747
+ ```
748
+ scoreWord("slut") → { sv: 0.8, en: 0.6, de: 0.3, ... }
749
+ ↑ Swedish trie match (exact word in vocabulary)
750
+ ↑ English trie match (partial/common word)
751
+ ↑ German ELD n-gram signal
752
+ ```
753
+
754
+ Layer weights: Script (1.0) > Trie (0.8) > ELD (0.6) > Suffix (0.3+) > Prefix (0.3+)
755
+
756
+ ### The `detectLanguages()` Function
757
+
758
+ For full text, `detectLanguages()` runs `scoreWord()` on every word and aggregates results into document-level proportions:
759
+
760
+ ```typescript
761
+ detectLanguages("Programmet börjar klockan åtta och tar slut vid tio")
762
+ // → { languages: [{ language: "de", proportion: 0.6 }, { language: "sv", proportion: 0.3 }, ...] }
763
+ ```
764
+
765
+ *Note:* ELD often classifies Swedish as German due to n-gram similarity. The confusion map (see below) compensates for this.
766
+
767
+ ### Two-Layer Signal Combination
768
+
769
+ When a collision word is detected, we combine word-level and document-level signals using a **1.5:1 weighted average** favoring the document signal:
770
+
771
+ ```
772
+ amplified[lang] = (scoreWord[lang] × 1.0 + docSignal[lang] × 1.5) / 2.5
773
+ ```
774
+
775
+ The document signal is favored because it provides broader context — a single word's language score can be ambiguous, but the surrounding text usually makes the language clear.
776
+
777
+ ### The Confusion Map
778
+
779
+ ELD's n-gram model frequently misclassifies Scandinavian languages as German (they share many character patterns). The confusion map treats German signal as partial evidence of Scandinavian:
780
+
781
+ ```
782
+ effectiveAmp["sv"] = max(directAmp["sv"], confusedAmp["de"] × 0.8)
783
+ ```
784
+
785
+ The 0.8 discount prevents over-attribution — German text shouldn't fully count as Swedish, but mostly-German signal in a Scandinavian context should still trigger dampening.
786
+
787
+ ### Certainty Adjustment Formula
788
+
789
+ Once we have the amplified language signals, `adjustCertaintyForLanguage()` adjusts the word's certainty score:
790
+
791
+ ```
792
+ If innocent language dominates (innocentAmp > profaneAmp):
793
+ adjusted = certainty × (1 - dampeningFactor × innocentAmp) ← reduces certainty
794
+
795
+ If profane language dominates (profaneAmp > innocentAmp):
796
+ adjusted = certainty × (1 + dampeningFactor × profaneAmp) ← increases certainty
797
+
798
+ Result clamped to [0, 5]
799
+ ```
800
+
801
+ The `dampeningFactor` (0-1) controls how aggressively the adjustment works per collision word. Words that are genuinely innocent in another language (e.g., "slut" in Swedish, df=0.9) get heavy dampening, while dangerous dual-meaning words (e.g., "cock" as rooster, df=0.1) barely adjust.
802
+
803
+ ### End-to-End Flow
804
+
805
+ ```
806
+ Text: "Programmet börjar klockan åtta och tar slut vid tio"
807
+ ^^^^
808
+ "slut" detected (en: s:3 c:4)
809
+
810
+ 1. Collision word matched → check innocent-words map
811
+ "slut" → innocent in Swedish (meaning: "end/finish", dampeningFactor: 0.9)
812
+
813
+ 2. Language detection triggered (lazy — only runs on collision matches)
814
+ Document signal: detectLanguages() → { de: 0.7, en: 0.2, ... }
815
+ Word signal: scoreWord("slut") → { sv: 0.8, en: 0.6, ... }
816
+
817
+ 3. Weighted average (1.5:1 doc:word ratio)
818
+ amplified["sv"] = (0.8 × 1.0 + 0.0 × 1.5) / 2.5 = 0.32
819
+ amplified["de"] = (0.0 × 1.0 + 0.7 × 1.5) / 2.5 = 0.42
820
+ amplified["en"] = (0.6 × 1.0 + 0.2 × 1.5) / 2.5 = 0.36
821
+
822
+ 4. Confusion map: German signal → partial Swedish evidence
823
+ effectiveAmp["sv"] = max(0.32, 0.42 × 0.8) = 0.336
824
+
825
+ 5. Innocent language (sv: 0.336) > Profane language (en: 0.36)?
826
+ → Close, but Swedish trie words boost sv signal further
827
+ → Certainty dampened: 4 × (1 - 0.9 × 0.336) = 2.79
828
+ → Below flag threshold (s:3 needs c:3+) → NOT FLAGGED ✓
829
+ ```
830
+
831
+ ### Key Features
832
+
833
+ - **29 collision words** mapped across 7 languages (English, Swedish, Norwegian, Danish, German, Dutch, French, Spanish)
834
+ - **Per-word dampening factors** control adjustment strength:
835
+ - `0.9` = heavy dampening (genuinely innocent cross-language, e.g., "slut" in Swedish)
836
+ - `0.1` = barely dampens (almost always used as profanity, e.g., "cock" in English)
837
+ - **Lazy language detection** — `detectLanguages()` only runs when a collision word is matched (zero performance cost for non-collision text)
838
+ - **Confusion map** — handles ELD n-gram detector's known misclassifications (e.g., Swedish often classified as German)
839
+ - **Swedish trie vocabulary** — ~350 common words for reliable word-level Swedish detection
840
+
841
+ ### Collision Words
842
+
843
+ | Word | Profane In | Innocent In | Meaning |
844
+ |------|-----------|-------------|---------|
845
+ | slut | English | Swedish, Danish | end/finish |
846
+ | fart | English | Swedish, Norwegian, Danish | speed |
847
+ | hell | English | Swedish, Norwegian | luck |
848
+ | prick | English | Swedish | dot/point |
849
+ | kock | English | Swedish | chef/cook |
850
+ | bra | English | Swedish | good |
851
+ | bite | French | English | to use teeth |
852
+ | con | French | English, Spanish | prefix/with |
853
+ | pet | French | English | animal companion |
854
+ | mist | Dutch/German | English | fog/haze |
855
+ | hoe | English | Dutch | how |
856
+ | kant | Dutch | German | edge |
857
+ | ass | English | English | donkey (df: 0.15) |
858
+ | cock | English | English | rooster (df: 0.1) |
859
+
860
+ *Full list in `src/languages/innocent-words.ts`*
861
+
862
+ ### Same-Language Collisions
863
+
864
+ Words like "ass" (donkey) and "cock" (rooster) are both profane and innocent in English. Since the profane and innocent language signals are equal, the system cannot disambiguate — these always remain flagged. This is a known limitation that would require semantic context analysis to solve.
865
+
866
+ ### Tested Scenarios
867
+
868
+ The challenge test suite (`tests/challenge-tests.test.ts`) validates 32 real-world scenarios:
869
+
870
+ | Category | Tests | Passing | Description |
871
+ |----------|-------|---------|-------------|
872
+ | Swedish text | 7 | 7 | News, recipes, driving, email, school contexts |
873
+ | Norwegian/Danish | 4 | 4 | Via confusion map cross-detection |
874
+ | Mixed-language | 5 | 4 | Code-switching, bilingual documents |
875
+ | Same-language (en→en) | 5 | 3 | Donkey/rooster/garden contexts |
876
+ | Threshold boundaries | 3 | 3 | Minimal context, short text |
877
+ | Adversarial inputs | 4 | 4 | Swedish padding attacks, Unicode tricks |
878
+ | Missing language pairs | 4 | 3 | Dutch, German, Italian, Portuguese |
879
+
880
+ **4 skipped tests** document unsolved challenges requiring semantic analysis or additional language support.
881
+
882
+ ---
883
+
884
+ ## Severity Levels
885
+
886
+ Severity reflects the number and variety of detected profanities:
887
+
888
+ | Level | Enum Value | Description |
889
+ |-----------|------------|-----------------------------------------|
890
+ | MILD | 1 | 1 unique/total word |
891
+ | MODERATE | 2 | 2 unique or total words |
892
+ | SEVERE | 3 | 3 unique/total words |
893
+ | EXTREME | 4 | 4+ unique or 5+ total profane words |
894
+
895
+ ---
896
+
897
+ ## Language Support
898
+
899
+ - **Profanity Dictionaries (15):** English, Hindi, French, German, Spanish, Italian, Brazilian Portuguese, Russian, Arabic, Chinese, Japanese, Korean, Bengali, Tamil, Telugu
900
+ - **Language Detection Trie (18):** All 15 above + Dutch, Turkish, Swedish (used for innocence scoring, not profanity detection)
901
+ - **Cross-Language Innocence Scoring:** English, Swedish, Norwegian, Danish, German, Dutch, French, Spanish
902
+ - **Scripts:** Latin/Roman, Devanagari, Tamil, Telugu, Bengali, Cyrillic, Arabic, CJK, etc.
903
+ - **Mixed Content:** Handles mixed-language and code-switched sentences with language-aware scoring.
904
+
905
+ ```typescript
906
+ profanity.check('This is bullshit and चूतिया.'); // true (mixed English/Hindi)
907
+ profanity.check('Ce mot est merde and पागल.'); // true (French/Hindi)
908
+ profanity.check('Isso é uma merda.'); // true (Brazilian Portuguese)
909
+ ```
910
+
911
+ ---
912
+
913
+ ## Use Exported Wordlists
914
+
915
+ For sample words in a language (for UIs, admin, etc):
916
+
917
+ ```typescript
918
+ import { englishBadWords, hindiBadWords } from 'bekindprofanityfilter';
919
+ console.log(englishBadWords.slice(0, 5)); // ["fuck", "shit", ...]
920
+ ```
921
+
922
+ ---
923
+
924
+ ## Security
925
+
926
+ - **No wordlist exposure:** There is no `.list()` function for security and encapsulation. Use exported word arrays for samples.
927
+ - **TRIE-based:** Scales easily to 50,000+ words.
928
+ - **Handles leet-speak:** Catches obfuscated variants like `f#ck`, `a55hole`.
929
+
930
+ ---
931
+
932
+ ## Full Example
933
+
934
+ ```typescript
935
+ import profanity, { ProfanitySeverity } from 'bekindprofanityfilter';
936
+
937
+
938
+ // Multi-language detection
939
+ profanity.loadLanguages(['english', 'french', 'tamil']);
940
+ console.log(profanity.check('Ce mot est merde.')); // true
941
+
942
+ // Leet-speak detection
943
+ console.log(profanity.check('You a f#cking a55hole!')); // true
944
+
945
+ // Whitelisting
946
+ profanity.addToWhitelist(['anal', 'ass']);
947
+ console.log(profanity.check('He is an associate professor.')); // false
948
+
949
+ // Severity
950
+ const result = profanity.detect('This is fucking bullshit and chutiya.');
951
+ console.log(ProfanitySeverity[result.severity]); // "SEVERE"
952
+
953
+ // Custom dictionary
954
+ profanity.loadCustomDictionary('pirate', ['barnacle-head', 'landlubber']);
955
+ profanity.loadLanguage('pirate');
956
+ console.log(profanity.check('You barnacle-head!')); // true
957
+
958
+ // Placeholder configuration
959
+ profanity.setPlaceholder('#');
960
+ console.log(profanity.clean('This is bullshit.')); // "This is ########."
961
+ profanity.setPlaceholder('*'); // Reset
962
+ ```
963
+
964
+ ---
965
+
966
+ ## FAQ
967
+
968
+ **Q: How do I see all loaded profanities?**
969
+ A: For security, the internal word list is not exposed. Use `englishBadWords` etc. for samples.
970
+
971
+ **Q: How do I reset the filter?**
972
+ A: Use `clearList()` and reload languages/dictionaries.
973
+
974
+ **Q: Is this safe for browser and Node.js?**
975
+ A: Yes! BeKind is universal.
976
+
977
+ ---
978
+
979
+ ## Middleware Examples
980
+
981
+ **Looking for Express.js/Node.js middleware to use BeKind in your API or chat app?**
982
+ **Check the [`examples/`](./examples/) folder for ready-to-copy middleware and integration samples.**
983
+
984
+ ---
985
+
986
+ ## Roadmap
987
+
988
+ - ✅ Cross-language innocence scoring (collision word disambiguation)
989
+ - ✅ Multi-language detection trie (18 languages)
990
+ - ✅ Language confusion map for Scandinavian/Germanic disambiguation
991
+ - ✅ Additional language packs (Arabic, Russian, Japanese, Korean, Chinese, Dutch)
992
+ - ✅ Romanization detection (Hinglish and other transliterated scripts)
993
+ - 🚧 Norwegian and Danish trie vocabularies (currently covered via confusion map)
994
+ - 🚧 Repeat character compression (normalize "fuuuuccckkkk" → "fuck" before matching, avoiding the need to enumerate elongations in the dictionary)
995
+ - 🚧 Phonetic matching (sounds-like detection)
996
+ - 🚧 Plugin system for custom detection algorithms
997
+
998
+ ---
999
+
1000
+ ## License
1001
+
1002
+ MIT — See [LICENSE](https://github.com/grassroots-labs-org/be-kind-profanity-filter/blob/main/LICENSE)
1003
+
1004
+ This project is a fork of [BeKind](https://github.com/ayush-jadaun/allprofanity) by Ayush Jadaun, also licensed under MIT.
1005
+
1006
+ ---
1007
+
1008
+ ## Contributing
1009
+
1010
+ We welcome contributions! Please see our [CONTRIBUTORS.md](./CONTRIBUTORS.md) for:
1011
+
1012
+ - How to add your name to our contributors list
1013
+ - Guidelines for adding new languages
1014
+ - Test requirements (must include passing test screenshots in PRs)
1015
+ - Code of conduct and PR guidelines