bekindprofanityfilter 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CONTRIBUTORS.md +106 -0
- package/LICENSE +22 -0
- package/README.md +1015 -0
- package/allprofanity.config.example.json +35 -0
- package/bin/init.js +49 -0
- package/config.schema.json +163 -0
- package/dist/algos/aho-corasick.d.ts +75 -0
- package/dist/algos/aho-corasick.js +238 -0
- package/dist/algos/aho-corasick.js.map +1 -0
- package/dist/algos/bloom-filter.d.ts +103 -0
- package/dist/algos/bloom-filter.js +208 -0
- package/dist/algos/bloom-filter.js.map +1 -0
- package/dist/algos/context-patterns.d.ts +102 -0
- package/dist/algos/context-patterns.js +484 -0
- package/dist/algos/context-patterns.js.map +1 -0
- package/dist/index.d.ts +1332 -0
- package/dist/index.js +2631 -0
- package/dist/index.js.map +1 -0
- package/dist/innocence-scoring.d.ts +23 -0
- package/dist/innocence-scoring.js +118 -0
- package/dist/innocence-scoring.js.map +1 -0
- package/dist/language-detector.d.ts +162 -0
- package/dist/language-detector.js +952 -0
- package/dist/language-detector.js.map +1 -0
- package/dist/language-dicts.d.ts +60 -0
- package/dist/language-dicts.js +2718 -0
- package/dist/language-dicts.js.map +1 -0
- package/dist/languages/arabic-words.d.ts +10 -0
- package/dist/languages/arabic-words.js +1649 -0
- package/dist/languages/arabic-words.js.map +1 -0
- package/dist/languages/bengali-words.d.ts +10 -0
- package/dist/languages/bengali-words.js +1696 -0
- package/dist/languages/bengali-words.js.map +1 -0
- package/dist/languages/brazilian-words.d.ts +10 -0
- package/dist/languages/brazilian-words.js +2122 -0
- package/dist/languages/brazilian-words.js.map +1 -0
- package/dist/languages/chinese-words.d.ts +10 -0
- package/dist/languages/chinese-words.js +2728 -0
- package/dist/languages/chinese-words.js.map +1 -0
- package/dist/languages/english-primary-all-languages.d.ts +23 -0
- package/dist/languages/english-primary-all-languages.js +36894 -0
- package/dist/languages/english-primary-all-languages.js.map +1 -0
- package/dist/languages/english-words.d.ts +5 -0
- package/dist/languages/english-words.js +5156 -0
- package/dist/languages/english-words.js.map +1 -0
- package/dist/languages/french-words.d.ts +10 -0
- package/dist/languages/french-words.js +2326 -0
- package/dist/languages/french-words.js.map +1 -0
- package/dist/languages/german-words.d.ts +10 -0
- package/dist/languages/german-words.js +2633 -0
- package/dist/languages/german-words.js.map +1 -0
- package/dist/languages/hindi-words.d.ts +10 -0
- package/dist/languages/hindi-words.js +2341 -0
- package/dist/languages/hindi-words.js.map +1 -0
- package/dist/languages/innocent-words.d.ts +41 -0
- package/dist/languages/innocent-words.js +109 -0
- package/dist/languages/innocent-words.js.map +1 -0
- package/dist/languages/italian-words.d.ts +10 -0
- package/dist/languages/italian-words.js +2287 -0
- package/dist/languages/italian-words.js.map +1 -0
- package/dist/languages/japanese-words.d.ts +11 -0
- package/dist/languages/japanese-words.js +2557 -0
- package/dist/languages/japanese-words.js.map +1 -0
- package/dist/languages/korean-words.d.ts +10 -0
- package/dist/languages/korean-words.js +2509 -0
- package/dist/languages/korean-words.js.map +1 -0
- package/dist/languages/russian-words.d.ts +10 -0
- package/dist/languages/russian-words.js +2175 -0
- package/dist/languages/russian-words.js.map +1 -0
- package/dist/languages/spanish-words.d.ts +11 -0
- package/dist/languages/spanish-words.js +2536 -0
- package/dist/languages/spanish-words.js.map +1 -0
- package/dist/languages/tamil-words.d.ts +10 -0
- package/dist/languages/tamil-words.js +1722 -0
- package/dist/languages/tamil-words.js.map +1 -0
- package/dist/languages/telugu-words.d.ts +10 -0
- package/dist/languages/telugu-words.js +1739 -0
- package/dist/languages/telugu-words.js.map +1 -0
- package/dist/romanization-detector.d.ts +50 -0
- package/dist/romanization-detector.js +779 -0
- package/dist/romanization-detector.js.map +1 -0
- package/package.json +79 -0
package/README.md
ADDED
|
@@ -0,0 +1,1015 @@
|
|
|
1
|
+
# BeKind Profanity Filter
|
|
2
|
+
|
|
3
|
+
> Forked from [AllProfanity](https://github.com/ayush-jadaun/allprofanity) by Ayush Jadaun. Extended with **romanization profanity detection** (catches Hinglish, transliterated text), **language-aware innocence scoring** (ELD + trie-based detection prevents false positives for cross-language collisions like "slut" in Swedish), and additional language dictionaries. Licensed under MIT.
|
|
4
|
+
|
|
5
|
+
> ⚠️ **Early-stage package in progress.** Features available in the original AllProfanity are being actively deprecated, adjusted, or replaced. API surface may change without notice. Contributions and suggestions greatly appreciated.
|
|
6
|
+
|
|
7
|
+
> **Please be advised:** Due to the nature of its purpose, the be-kind repository contains explicit profanity, slurs, hate speech, and other offensive language across its source files, dictionaries, and test suites (sorry!). The inclusion of these words does not reflect the views of the authors or contributors.
|
|
8
|
+
|
|
9
|
+
A multi-language profanity filter with romanization detection, language-aware innocence scoring, leet-speak detection, and cross-language collision handling.
|
|
10
|
+
|
|
11
|
+
[](https://www.npmjs.com/package/bekindprofanityfilter)
|
|
12
|
+
[](https://opensource.org/licenses/MIT)
|
|
13
|
+

|
|
14
|
+

|
|
15
|
+
|
|
16
|
+
---
|
|
17
|
+
|
|
18
|
+
## What This Version Contains
|
|
19
|
+
|
|
20
|
+
- **Multi-Language Profanity Detection:** 34K+ word dictionary across 16 languages with 18-language detection trie
|
|
21
|
+
- **Romanization Detection:** Catches Hinglish, transliterated Bengali, Tamil, Telugu, and Japanese
|
|
22
|
+
- **Cross-Language Innocence Scoring:** Handles words like "slut" (Swedish: "end") and "fart" (Norwegian: "speed")
|
|
23
|
+
- **Context-Aware Analysis:** Booster/reducer patterns detect sexual context, negation, medical usage, and quoted speech
|
|
24
|
+
- **Leet-Speak Detection:** Catches obfuscated profanity (`f#ck`, `a55hole`, `sh1t`)
|
|
25
|
+
- **Word Boundary Detection:** Smart whole-word matching prevents flagging "assassin" or "assistance"
|
|
26
|
+
- **Multiple Algorithms:** Trie (default), Aho-Corasick, or Hybrid modes with optional Bloom filters and result caching
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
## Features
|
|
31
|
+
|
|
32
|
+
### Performance & Speed
|
|
33
|
+
|
|
34
|
+
- **Multiple Algorithm Options:** Choose between Trie (default), Aho-Corasick, or Hybrid modes
|
|
35
|
+
- **664% Faster on Large Texts:** Aho-Corasick delivers O(n) multi-pattern matching
|
|
36
|
+
- **123x Speedup with Caching:** Result cache perfect for repeated checks (chat, forms, APIs)
|
|
37
|
+
- **~27K ops/sec:** Default Trie mode handles short texts incredibly fast
|
|
38
|
+
- **Single-Pass Scanning:** O(n) complexity regardless of dictionary size
|
|
39
|
+
- **Batch Processing Ready:** Optimized for high-throughput API endpoints
|
|
40
|
+
|
|
41
|
+
### Accuracy & Detection
|
|
42
|
+
|
|
43
|
+
- **Word Boundary Matching:** Smart whole-word detection prevents false positives like "assassin" or "assistance"
|
|
44
|
+
- **Advanced Leet-Speak:** Detects obfuscated profanities (`f#ck`, `a55hole`, `sh1t`, etc.)
|
|
45
|
+
- **Comprehensive Coverage:** Catches profanity while minimizing false flags
|
|
46
|
+
- **Configurable Strictness:** Tune detection sensitivity to your needs
|
|
47
|
+
|
|
48
|
+
### Multi-Language & Flexibility
|
|
49
|
+
|
|
50
|
+
- **Multi-Language Support:** Built-in profanity dictionaries for 16 languages: English, Hindi, French, German, Spanish, Italian, Brazilian Portuguese, Russian, Arabic, Chinese, Japanese, Korean, Bengali, Tamil, Telugu, Turkish
|
|
51
|
+
- **Multiple Scripts:** Latin/Roman (Hinglish) and native scripts (Devanagari, Tamil, Telugu, etc.)
|
|
52
|
+
- **Custom Dictionaries:** Add/remove words or entire language packs at runtime
|
|
53
|
+
- **Whitelisting:** Exclude safe words from detection
|
|
54
|
+
- **Severity Scoring:** Assess content offensiveness (`MILD`, `MODERATE`, `SEVERE`, `EXTREME`)
|
|
55
|
+
|
|
56
|
+
### Developer Experience
|
|
57
|
+
|
|
58
|
+
- **TypeScript Support:** Fully typed API with comprehensive documentation
|
|
59
|
+
- **Zero 3rd-Party Dependencies:** Only internal code and data
|
|
60
|
+
- **Configurable:** Tune performance vs accuracy for your use case
|
|
61
|
+
- **No Dictionary Exposure:** Secure by design - word lists never exposed
|
|
62
|
+
- **Universal:** Works in Node.js and browsers
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
> **Forked from [BeKind](https://github.com/ayush-jadaun/allprofanity)** by Ayush Jadaun. Extended with **romanization profanity detection** (catches Hinglish, transliterated text), **language-aware innocence scoring** (ELD + trie-based detection prevents false positives for cross-language collisions like "slut" in Swedish), and additional language dictionaries. Licensed under MIT.
|
|
67
|
+
|
|
68
|
+
## Installation
|
|
69
|
+
|
|
70
|
+
```bash
|
|
71
|
+
npm install bekindprofanityfilter
|
|
72
|
+
# or
|
|
73
|
+
yarn add bekindprofanityfilter
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
**Generate configuration file (optional):**
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
npx bekindprofanityfilter
|
|
80
|
+
# Creates bekindprofanityfilter.config.json and config.schema.json in your project
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
85
|
+
## Quick Start
|
|
86
|
+
|
|
87
|
+
```typescript
|
|
88
|
+
import profanity from 'bekindprofanityfilter';
|
|
89
|
+
|
|
90
|
+
// Simple check
|
|
91
|
+
profanity.check('This is a clean sentence.'); // false
|
|
92
|
+
profanity.check('What the f#ck is this?'); // true (leet-speak detected)
|
|
93
|
+
profanity.check('यह एक चूतिया परीक्षण है।'); // true (Hindi)
|
|
94
|
+
profanity.check('Ye ek chutiya test hai.'); // true (Hinglish Roman script)
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## Algorithm Configuration
|
|
100
|
+
|
|
101
|
+
BeKind v2.2+ offers multiple algorithms optimized for different use cases. You can configure via **constructor options** or **config file**.
|
|
102
|
+
|
|
103
|
+
### Configuration Methods
|
|
104
|
+
|
|
105
|
+
#### Method 1: Constructor Options (Inline)
|
|
106
|
+
|
|
107
|
+
```typescript
|
|
108
|
+
import { BeKind } from 'bekindprofanityfilter';
|
|
109
|
+
|
|
110
|
+
const filter = new BeKind({
|
|
111
|
+
algorithm: { matching: "hybrid" },
|
|
112
|
+
performance: { enableCaching: true }
|
|
113
|
+
});
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
#### Method 2: Config File (Recommended)
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
# Generate config files in your project
|
|
120
|
+
npx bekindprofanityfilter
|
|
121
|
+
|
|
122
|
+
# This creates:
|
|
123
|
+
# - bekindprofanityfilter.config.json (main config)
|
|
124
|
+
# - config.schema.json (for IDE autocomplete)
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
```typescript
|
|
128
|
+
import { BeKind } from 'bekindprofanityfilter';
|
|
129
|
+
import config from './bekindprofanityfilter.config.json';
|
|
130
|
+
|
|
131
|
+
// Load from generated config file
|
|
132
|
+
const filter = BeKind.fromConfig(config);
|
|
133
|
+
|
|
134
|
+
// Or directly from object (no file needed)
|
|
135
|
+
const filter2 = BeKind.fromConfig({
|
|
136
|
+
algorithm: { matching: "hybrid", useContextAnalysis: true },
|
|
137
|
+
performance: { enableCaching: true, cacheSize: 1000 }
|
|
138
|
+
});
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
**Example Config File** (`bekindprofanityfilter.config.json`):
|
|
142
|
+
|
|
143
|
+
```json
|
|
144
|
+
{
|
|
145
|
+
"algorithm": {
|
|
146
|
+
"matching": "hybrid",
|
|
147
|
+
"useAhoCorasick": true,
|
|
148
|
+
"useBloomFilter": true
|
|
149
|
+
},
|
|
150
|
+
"profanityDetection": {
|
|
151
|
+
"enableLeetSpeak": true,
|
|
152
|
+
"caseSensitive": false,
|
|
153
|
+
"strictMode": false
|
|
154
|
+
},
|
|
155
|
+
"performance": {
|
|
156
|
+
"enableCaching": true,
|
|
157
|
+
"cacheSize": 1000
|
|
158
|
+
}
|
|
159
|
+
}
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
**Config File:** Run `npx bekindprofanityfilter` to generate config files in your project. The JSON schema provides IDE autocomplete and validation.
|
|
163
|
+
|
|
164
|
+
---
|
|
165
|
+
|
|
166
|
+
### Quick Configuration Examples
|
|
167
|
+
|
|
168
|
+
#### 1. Default (Best for General Use)
|
|
169
|
+
|
|
170
|
+
```typescript
|
|
171
|
+
import { BeKind } from 'bekindprofanityfilter';
|
|
172
|
+
const filter = new BeKind();
|
|
173
|
+
// Uses optimized Trie - fast and reliable (~27K ops/sec)
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
#### 2. Large Text Processing (Documents, Articles)
|
|
177
|
+
|
|
178
|
+
```typescript
|
|
179
|
+
const filter = new BeKind({
|
|
180
|
+
algorithm: { matching: "aho-corasick" }
|
|
181
|
+
});
|
|
182
|
+
// 664% faster on 1KB+ texts
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
#### 3. Repeated Checks (Chat, Forms, APIs)
|
|
186
|
+
|
|
187
|
+
```typescript
|
|
188
|
+
const filter = new BeKind({
|
|
189
|
+
performance: {
|
|
190
|
+
enableCaching: true,
|
|
191
|
+
cacheSize: 1000
|
|
192
|
+
}
|
|
193
|
+
});
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
### Competitor Comparison
|
|
197
|
+
|
|
198
|
+
Benchmarked on a single CPU core (pinned via `taskset -c 0`). All numbers are **ops/second — higher is better**.
|
|
199
|
+
|
|
200
|
+
> **Honest context:** be-kind loads a ~34K-word dictionary across 18 languages by default. `leo + dict` injects be-kind's full 34K dictionary into [leo-profanity](https://github.com/jojoee/leo-profanity) (which ships with ~400 English words only) to test the matching engine with equivalent vocabulary. glin-profanity is benchmarked with all 24 supported languages loaded. `glin + dict` injects be-kind's full 34K dictionary into glin for the same reason.
|
|
201
|
+
|
|
202
|
+
| Library | Languages (out-of-the-box) | Leet-speak | Repeat compression | Context-aware |
|
|
203
|
+
|---------|--------------------------|-----------|-------------------|--------------|
|
|
204
|
+
| **be-kind** | 16 profanity dicts + 18-lang detection trie | ✅ | 🚧 planned | ✅ (certainty-delta) |
|
|
205
|
+
| **be-kind (ctx)** | same as be-kind | ✅ | 🚧 planned | ✅ (boosters + reducers) |
|
|
206
|
+
| [leo-profanity](https://github.com/jojoee/leo-profanity) + dict | 16 (via be-kind dict injection) | ❌ | ❌ | ❌ |
|
|
207
|
+
| [bad-words](https://github.com/web-mech/badwords) | 1 (English) | ❌ | ❌ | ❌ |
|
|
208
|
+
| [glin-profanity](https://www.glincker.com/tools/glin-profanity) | 24 | ✅ (3 levels) | ✅ | ✅ (heuristic) |
|
|
209
|
+
|
|
210
|
+
**Speed benchmark** — ops/second on a single CPU core (`taskset -c 0`), higher is better:
|
|
211
|
+
|
|
212
|
+
| Test | be-kind | be-kind (ctx) | leo | bad-words | glin (basic) | glin (enhanced) | glin + dict |
|
|
213
|
+
|------|--------:|--------------:|----:|----------:|-------------:|----------------:|------------:|
|
|
214
|
+
| check — clean (short) | 2,654 | 2,903 | 879,009 | 2,932 | 816 | 751 | 68 |
|
|
215
|
+
| check — profane (short) | 2,366 | 2,031 | 1,496,281 | 3,025 | 3,128 | 3,304 | 3,350 |
|
|
216
|
+
| check — leet-speak | 1,243 | 1,198 | 1,100,028 | 3,148 | 2,760 | 4,078 | 4,499 |
|
|
217
|
+
| clean — profane (short) | 2,398 | 2,011 | 298,713 | 243 | N/A | N/A | N/A |
|
|
218
|
+
| check — 500-char clean | 411 | 397 | 100,898 | 2,157 | 253 | 247 | 20 |
|
|
219
|
+
| check — 500-char profane | 348 | 277 | 216,204 | 2,155 | 789 | 720 | 762 |
|
|
220
|
+
| check — 2,500-char clean | 91 | 88 | 18,900 | 1,225 | 74 | 71 | 6 |
|
|
221
|
+
| check — 2,500-char profane | 82 | 62 | 50,454 | 1,084 | 196 | 185 | 186 |
|
|
222
|
+
|
|
223
|
+
**Library versions tested:** `leo-profanity@1.9.0`, `bad-words@4.0.0`, `glin-profanity@3.3.0`
|
|
224
|
+
|
|
225
|
+
**Notes:**
|
|
226
|
+
- **be-kind** and **be-kind (ctx)** both load a 34K-word dictionary across 18 languages. Despite this, be-kind is ~3x faster than glin on clean text because it uses a **trie** (O(input_length) matching), while glin uses **linear scanning** over its dictionary (`for (const word of this.words.keys())` — O(dict_size * input_length)). This architectural difference becomes dramatic at large dictionary sizes.
|
|
227
|
+
- `be-kind (ctx)` adds ~10-20% overhead over default be-kind — context analysis (certainty-delta pattern matching) is cheap.
|
|
228
|
+
- `leo-profanity` is the fastest but its ~400-word English-only dictionary explains most of the gap.
|
|
229
|
+
- `glin` with all 24 languages loaded is ~17x slower than English-only due to its linear-scan architecture scaling with dictionary size.
|
|
230
|
+
- `glin + dict` (glin enhanced + be-kind's 34K words injected) demonstrates the linear-scan bottleneck: 67 ops/s on clean short text vs 2,560 for be-kind with the same vocabulary. On profane text it short-circuits on first match, so performance is normal (3,335 ops/s).
|
|
231
|
+
- be-kind is the only library with cross-language innocence scoring, romanization support, and context-aware certainty adjustment.
|
|
232
|
+
|
|
233
|
+
Run the benchmark yourself:
|
|
234
|
+
```bash
|
|
235
|
+
taskset -c 0 bun run benchmark:competitors
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
### Accuracy Comparison
|
|
239
|
+
|
|
240
|
+
Measures TP rate (recall), FP rate, and F1 across eight test categories (225 labeled cases, dataset v6). All libraries are tested against all categories — no exemptions. **Higher F1 and lower FP rate are better.**
|
|
241
|
+
|
|
242
|
+
> **Bias disclaimer:** This dataset was created by the be-kind team. Non-English cases were likely drawn from or verified against be-kind's own dictionary, which advantages be-kind on those categories. To partially offset this, the dataset includes independent test cases from [glin-profanity's upstream test suite](https://github.com/GLINCKER/glin-profanity/tree/release/tests) and adversarial false-positive cases specifically chosen to expose known be-kind failures. We strongly recommend running this benchmark against your own dataset before drawing conclusions.
|
|
243
|
+
|
|
244
|
+
> **Note:** `be-kind (sensitive)` = `sensitiveMode: true` (flags AMBIVALENT words too). `be-kind (ctx)` = `contextAnalysis.enabled: true`. `glin (collapsed)` = glin (basic) with `collapseRepeatedCharacters()` pre-processing.
|
|
245
|
+
|
|
246
|
+
#### Single-language detection — 65 cases (English incl. leetspeak, French, German, Spanish, Hindi)
|
|
247
|
+
|
|
248
|
+
| Library | Recall | Precision | FP Rate | F1 |
|
|
249
|
+
|---|---|---|---|---|
|
|
250
|
+
| be-kind (sensitive) | 100% | 100% | 0% | **1.00** |
|
|
251
|
+
| leo + dict | 82% | 100% | 0% | 0.90 |
|
|
252
|
+
| be-kind | 80% | 100% | 0% | 0.89 |
|
|
253
|
+
| be-kind (ctx) | 80% | 100% | 0% | 0.89 |
|
|
254
|
+
| glin (enhanced) | 72% | 100% | 0% | 0.84 |
|
|
255
|
+
| glin (collapsed) | 72% | 100% | 0% | 0.84 |
|
|
256
|
+
| bad-words | 52% | 100% | 0% | 0.68 |
|
|
257
|
+
|
|
258
|
+
> All libraries tested against all 65 cases including French, German, Spanish, and Hindi. `leo + dict` benefits significantly from be-kind's multilingual dictionary, jumping from 34% to 82% recall. be-kind misses mild words (`damn`, `hell`) in default mode; `sensitiveMode: true` catches these. All libraries achieve 100% precision — when they flag something, it's always correct.
|
|
259
|
+
|
|
260
|
+
#### False positives / innocent words — 48 cases (clean only, lower FP rate is better)
|
|
261
|
+
|
|
262
|
+
Includes adversarial cases (`cum laude`, `Dick Van Dyke`, culinary `faggots`, Swedish `slut`). Recall and F1 are undefined (no profane cases).
|
|
263
|
+
|
|
264
|
+
| Library | FP Rate |
|
|
265
|
+
|---|---|
|
|
266
|
+
| glin (collapsed) | **19%** |
|
|
267
|
+
| glin (enhanced) | 21% |
|
|
268
|
+
| be-kind (ctx) | 21% |
|
|
269
|
+
| bad-words | 23% |
|
|
270
|
+
| leo + dict | 25% |
|
|
271
|
+
| be-kind | 27% |
|
|
272
|
+
| be-kind (sensitive) | 31% |
|
|
273
|
+
|
|
274
|
+
> be-kind's FP rate remains its most significant weakness — over-triggers on proper nouns, Latin phrases, and homographs. `sensitiveMode: true` worsens this. `be-kind (ctx)` with context analysis reduces FP rate from 27% to 21% by detecting innocent contexts (medical terms, proper nouns, quoted text). `leo + dict` at 25% shows that leo's simple substring matching creates more false positives when given a large dictionary.
|
|
275
|
+
|
|
276
|
+
#### Multi-language detection — 26 cases (Hinglish, French, German, Spanish, mixed)
|
|
277
|
+
|
|
278
|
+
| Library | Recall | Precision | FP Rate | F1 |
|
|
279
|
+
|---|---|---|---|---|
|
|
280
|
+
| be-kind | 100% | 100% | 0% | **1.00** |
|
|
281
|
+
| be-kind (sensitive) | 100% | 100% | 0% | **1.00** |
|
|
282
|
+
| leo + dict | 100% | 100% | 0% | **1.00** |
|
|
283
|
+
| be-kind (ctx) | 95% | 100% | 0% | 0.98 |
|
|
284
|
+
| glin (enhanced) | 95% | 100% | 0% | 0.98 |
|
|
285
|
+
| glin (collapsed) | 95% | 100% | 0% | 0.98 |
|
|
286
|
+
| bad-words | 62% | 100% | 0% | 0.76 |
|
|
287
|
+
|
|
288
|
+
> With be-kind's dictionary injected, leo + dict achieves 100% recall on multi-language cases — proving the dictionary is the key differentiator. be-kind (ctx) scores 95% — context analysis slightly reduces multi-language recall vs default be-kind.
|
|
289
|
+
|
|
290
|
+
#### Romanization — 30 cases (Hinglish, Bengali, Tamil, Telugu, Japanese)
|
|
291
|
+
|
|
292
|
+
| Library | Recall | Precision | FP Rate | F1 |
|
|
293
|
+
|---|---|---|---|---|
|
|
294
|
+
| leo + dict | 75% | 94% | 10% | **0.83** |
|
|
295
|
+
| be-kind | 80% | 84% | 30% | 0.82 |
|
|
296
|
+
| be-kind (sensitive) | 80% | 84% | 30% | 0.82 |
|
|
297
|
+
| be-kind (ctx) | 80% | 84% | 30% | 0.82 |
|
|
298
|
+
| glin (enhanced) | 15% | 100% | 0% | 0.26 |
|
|
299
|
+
| glin (collapsed) | 15% | 100% | 0% | 0.26 |
|
|
300
|
+
| bad-words | 0% | 0% | 10% | — |
|
|
301
|
+
|
|
302
|
+
> leo + dict edges out be-kind on F1 here (0.83 vs 0.82) thanks to a lower FP rate (10% vs 30%) despite slightly lower recall (75% vs 80%). be-kind's higher FP rate is a known limitation where clean romanized words collide with its dictionary. glin catches 15% with perfect precision but far less coverage.
|
|
303
|
+
|
|
304
|
+
#### Semantic context — 25 cases
|
|
305
|
+
|
|
306
|
+
| Library | Recall | Precision | FP Rate | F1 |
|
|
307
|
+
|---|---|---|---|---|
|
|
308
|
+
| be-kind (ctx) | 80% | 73% | 20% | **0.76** |
|
|
309
|
+
| leo + dict | 100% | 59% | 47% | 0.74 |
|
|
310
|
+
| glin (enhanced) | 90% | 53% | 53% | 0.67 |
|
|
311
|
+
| glin (collapsed) | 90% | 53% | 53% | 0.67 |
|
|
312
|
+
| be-kind (sensitive) | 100% | 48% | 73% | 0.65 |
|
|
313
|
+
| bad-words | 100% | 48% | 73% | 0.65 |
|
|
314
|
+
| be-kind | 80% | 47% | 60% | 0.59 |
|
|
315
|
+
|
|
316
|
+
> Semantic context is where all libraries struggle — precision drops below 50% for most. Cases include metalinguistic uses ("the word 'fuck' has uncertain origins"), negation ("she's not a bitch"), and medical context ("rectal cancer screening"). be-kind (ctx) achieves the best F1 (0.76) thanks to context-aware certainty adjustment — boosters confirm profane intent, reducers detect innocent contexts like quotation and negation. leo + dict achieves 100% recall but at the cost of a 47% FP rate.
|
|
317
|
+
|
|
318
|
+
#### Repeated character evasion — 5 cases (`fuuuuuuuuck`, `cunnnnnnttttt`, etc.)
|
|
319
|
+
|
|
320
|
+
No clean cases in this category — FP rate is undefined.
|
|
321
|
+
|
|
322
|
+
| Library | Recall | Precision |
|
|
323
|
+
|---|---|---|
|
|
324
|
+
| glin (enhanced) | **100%** | 100% |
|
|
325
|
+
| glin (collapsed) | 40% | 100% |
|
|
326
|
+
| be-kind | 0% | — |
|
|
327
|
+
| be-kind (sensitive) | 0% | — |
|
|
328
|
+
| be-kind (ctx) | 0% | — |
|
|
329
|
+
| leo + dict | 0% | — |
|
|
330
|
+
| bad-words | 0% | — |
|
|
331
|
+
|
|
332
|
+
#### Concatenated / no-space evasion — 7 cases (`urASSHOLEbro`, `youFUCKINGidiot`, etc.)
|
|
333
|
+
|
|
334
|
+
| Library | Recall | Precision | FP Rate | F1 |
|
|
335
|
+
|---|---|---|---|---|
|
|
336
|
+
| be-kind | 20% | 100% | 0% | 0.33 |
|
|
337
|
+
| be-kind (sensitive) | 20% | 100% | 0% | 0.33 |
|
|
338
|
+
| be-kind (ctx) | 20% | 100% | 0% | 0.33 |
|
|
339
|
+
| leo + dict | 0% | — | 0% | — |
|
|
340
|
+
| bad-words | 0% | — | 0% | — |
|
|
341
|
+
| glin (enhanced) | 0% | — | 0% | — |
|
|
342
|
+
| glin (collapsed) | 0% | — | 0% | — |
|
|
343
|
+
|
|
344
|
+
#### Challenge cases — 19 cases (semantic disambiguation, embedded substrings, separator evasion)
|
|
345
|
+
|
|
346
|
+
Hard problems: `cock` as rooster, `ass` as donkey, Swedish `slut` = "end", `puta` in etymological discussion, profanity in concatenated strings, and separator-spaced evasion (`f u c k`, `f_u*c k`, `a.s.s.h.o.l.e`).
|
|
347
|
+
|
|
348
|
+
| Library | Recall | Precision | FP Rate | F1 |
|
|
349
|
+
|---|---|---|---|---|
|
|
350
|
+
| be-kind (ctx) | 60% | 75% | 22% | **0.67** |
|
|
351
|
+
| be-kind | 60% | 60% | 44% | 0.60 |
|
|
352
|
+
| be-kind (sensitive) | 60% | 60% | 44% | 0.60 |
|
|
353
|
+
| glin (enhanced) | 30% | 43% | 44% | 0.35 |
|
|
354
|
+
| leo + dict | 20% | 50% | 22% | 0.29 |
|
|
355
|
+
| bad-words | 20% | 33% | 44% | 0.25 |
|
|
356
|
+
| glin (collapsed) | 0% | 0% | 44% | — |
|
|
357
|
+
|
|
358
|
+
> be-kind (ctx) halves the FP rate on challenge cases (44% → 22%) by recognizing innocent contexts like "cock crowed at dawn" and "wild ass is an equine." Separator-spaced evasion cases (`f u c k`, `f_u*c k`, mixed separators) test the separator tolerance feature. These cases still require semantic understanding that no dictionary-based filter can fully solve — the strongest argument for LLM-assisted moderation as a second pass.
|
|
359
|
+
|
|
360
|
+
#### Overall summary — micro-averaged across all 225 cases
|
|
361
|
+
|
|
362
|
+
| Library | Recall | Precision | FP Rate | F1 | TP | FN | FP | TN |
|
|
363
|
+
|---|---|---|---|---|---|---|---|---|
|
|
364
|
+
| be-kind (sensitive) | **86%** | 76% | 32% | 0.81 | 104 | 17 | 33 | 71 |
|
|
365
|
+
| be-kind (ctx) | 75% | **83%** | **17%** | **0.79** | 91 | 30 | 18 | 86 |
|
|
366
|
+
| be-kind | 76% | 76% | 28% | 0.76 | 92 | 29 | 29 | 75 |
|
|
367
|
+
| leo + dict | 74% | 80% | 21% | 0.76 | 89 | 32 | 22 | 82 |
|
|
368
|
+
| glin (enhanced) | 63% | 78% | 21% | 0.70 | 76 | 45 | 22 | 82 |
|
|
369
|
+
| glin (collapsed) | 58% | 77% | 20% | 0.66 | 70 | 51 | 21 | 83 |
|
|
370
|
+
| bad-words | 42% | 65% | 26% | 0.51 | 51 | 70 | 27 | 77 |
|
|
371
|
+
|
|
372
|
+
> Micro-averaged: all 225 cases (121 profane, 104 clean) aggregated into one confusion matrix per library, then recall/precision/F1 computed once. No category weighting artifacts. All glin variants use all 24 supported languages. `leo + dict` with be-kind's 34K dictionary achieves F1 parity with default be-kind (0.76) — proving the dictionary is the core differentiator. be-kind (ctx) achieves the best balance of precision (83%) and recall (75%) with the lowest FP rate (17%) among be-kind variants, thanks to context-aware certainty adjustment via booster and reducer patterns.
|
|
373
|
+
|
|
374
|
+
Run the accuracy benchmark yourself:
|
|
375
|
+
```bash
|
|
376
|
+
bun run benchmark:accuracy
|
|
377
|
+
```
|
|
378
|
+
|
|
379
|
+
---
|
|
380
|
+
|
|
381
|
+
## API Reference & Examples
|
|
382
|
+
|
|
383
|
+
### `check(text: string): boolean`
|
|
384
|
+
|
|
385
|
+
Returns `true` if the text contains any profanity.
|
|
386
|
+
|
|
387
|
+
```typescript
|
|
388
|
+
profanity.check('This is a clean sentence.'); // false
|
|
389
|
+
profanity.check('This is a bullshit sentence.'); // true
|
|
390
|
+
profanity.check('What the f#ck is this?'); // true (leet-speak)
|
|
391
|
+
profanity.check('यह एक चूतिया परीक्षण है।'); // true (Hindi)
|
|
392
|
+
```
|
|
393
|
+
|
|
394
|
+
---
|
|
395
|
+
|
|
396
|
+
### `detect(text: string): ProfanityDetectionResult`
|
|
397
|
+
|
|
398
|
+
Returns a detailed result:
|
|
399
|
+
|
|
400
|
+
- `hasProfanity: boolean`
|
|
401
|
+
- `detectedWords: string[]` (actual matched words)
|
|
402
|
+
- `cleanedText: string` (character-masked)
|
|
403
|
+
- `severity: ProfanitySeverity` (`MILD`, `MODERATE`, `SEVERE`, `EXTREME`)
|
|
404
|
+
- `positions: Array<{ word: string, start: number, end: number }>`
|
|
405
|
+
|
|
406
|
+
```typescript
|
|
407
|
+
const result = profanity.detect('This is fucking bullshit and chutiya.');
|
|
408
|
+
console.log(result.hasProfanity); // true
|
|
409
|
+
console.log(result.detectedWords); // ['fucking', 'bullshit', 'chutiya']
|
|
410
|
+
console.log(result.severity); // 3 (SEVERE)
|
|
411
|
+
console.log(result.cleanedText); // "This is ******* ******** and ******."
|
|
412
|
+
console.log(result.positions); // e.g. [{word: 'fucking', start: 8, end: 15}, ...]
|
|
413
|
+
```
|
|
414
|
+
|
|
415
|
+
---
|
|
416
|
+
|
|
417
|
+
### `clean(text: string, placeholder?: string): string`
|
|
418
|
+
|
|
419
|
+
Replace each character of profane words with a placeholder (default: `*`).
|
|
420
|
+
|
|
421
|
+
```typescript
|
|
422
|
+
profanity.clean('This contains bullshit.'); // "This contains ********."
|
|
423
|
+
profanity.clean('This contains bullshit.', '#'); // "This contains ########."
|
|
424
|
+
profanity.clean('यह एक चूतिया परीक्षण है।'); // e.g. "यह एक ***** परीक्षण है।"
|
|
425
|
+
```
|
|
426
|
+
|
|
427
|
+
---
|
|
428
|
+
|
|
429
|
+
### `cleanWithPlaceholder(text: string, placeholder?: string): string`
|
|
430
|
+
|
|
431
|
+
Replace each profane word with a single placeholder (default: `***`).
|
|
432
|
+
(If the placeholder is omitted, uses `***`.)
|
|
433
|
+
|
|
434
|
+
```typescript
|
|
435
|
+
profanity.cleanWithPlaceholder('This contains bullshit.'); // "This contains ***."
|
|
436
|
+
profanity.cleanWithPlaceholder('This contains bullshit.', '[CENSORED]'); // "This contains [CENSORED]."
|
|
437
|
+
profanity.cleanWithPlaceholder('यह एक चूतिया परीक्षण है।', '####'); // e.g. "यह एक #### परीक्षण है।"
|
|
438
|
+
```
|
|
439
|
+
|
|
440
|
+
---
|
|
441
|
+
|
|
442
|
+
### `add(word: string | string[]): void`
|
|
443
|
+
|
|
444
|
+
Add a word or an array of words to the profanity filter.
|
|
445
|
+
|
|
446
|
+
```typescript
|
|
447
|
+
profanity.add('badword123');
|
|
448
|
+
profanity.check('This is badword123.'); // true
|
|
449
|
+
|
|
450
|
+
profanity.add(['mierda', 'puta']);
|
|
451
|
+
profanity.check('Esto es mierda.'); // true (Spanish)
|
|
452
|
+
profanity.check('Qué puta situación.'); // true
|
|
453
|
+
```
|
|
454
|
+
|
|
455
|
+
---
|
|
456
|
+
|
|
457
|
+
### `remove(word: string | string[]): void`
|
|
458
|
+
|
|
459
|
+
Remove a word or an array of words from the profanity filter.
|
|
460
|
+
|
|
461
|
+
```typescript
|
|
462
|
+
profanity.remove('bullshit');
|
|
463
|
+
profanity.check('This is bullshit.'); // false
|
|
464
|
+
|
|
465
|
+
profanity.remove(['mierda', 'puta']);
|
|
466
|
+
profanity.check('Esto es mierda.'); // false
|
|
467
|
+
```
|
|
468
|
+
|
|
469
|
+
---
|
|
470
|
+
|
|
471
|
+
### `addToWhitelist(words: string[]): void`
|
|
472
|
+
|
|
473
|
+
Whitelist words so they are never flagged as profane.
|
|
474
|
+
|
|
475
|
+
```typescript
|
|
476
|
+
profanity.addToWhitelist(['fuck', 'idiot','shit']);
|
|
477
|
+
profanity.check('He is an fucking idiot.'); // false
|
|
478
|
+
profanity.check('Fuck this shit.'); // false
|
|
479
|
+
// Remove from whitelist to restore detection
|
|
480
|
+
profanity.removeFromWhitelist(['fuck', 'idiot','shit']);
|
|
481
|
+
```
|
|
482
|
+
|
|
483
|
+
---
|
|
484
|
+
|
|
485
|
+
### `removeFromWhitelist(words: string[]): void`
|
|
486
|
+
|
|
487
|
+
Remove words from the whitelist so they can be detected again.
|
|
488
|
+
|
|
489
|
+
```typescript
|
|
490
|
+
profanity.removeFromWhitelist(['anal']);
|
|
491
|
+
```
|
|
492
|
+
|
|
493
|
+
---
|
|
494
|
+
|
|
495
|
+
### `setPlaceholder(placeholder: string): void`
|
|
496
|
+
|
|
497
|
+
Set the default placeholder character for `clean()`.
|
|
498
|
+
|
|
499
|
+
```typescript
|
|
500
|
+
profanity.setPlaceholder('#');
|
|
501
|
+
profanity.clean('This is bullshit.'); // "This is ########."
|
|
502
|
+
profanity.setPlaceholder('*'); // Reset to default
|
|
503
|
+
```
|
|
504
|
+
|
|
505
|
+
---
|
|
506
|
+
|
|
507
|
+
### `updateConfig(options: Partial<BeKindOptions>): void`
|
|
508
|
+
|
|
509
|
+
Change configuration at runtime.
|
|
510
|
+
Options include: `enableLeetSpeak`, `caseSensitive`, `strictMode`, `detectPartialWords`, `defaultPlaceholder`, `languages`, `whitelistWords`.
|
|
511
|
+
|
|
512
|
+
```typescript
|
|
513
|
+
profanity.updateConfig({ caseSensitive: true, enableLeetSpeak: false });
|
|
514
|
+
profanity.check('FUCK'); // false (if caseSensitive)
|
|
515
|
+
profanity.updateConfig({ caseSensitive: false, enableLeetSpeak: true });
|
|
516
|
+
profanity.check('f#ck'); // true
|
|
517
|
+
```
|
|
518
|
+
|
|
519
|
+
---
|
|
520
|
+
|
|
521
|
+
### `loadLanguage(language: string): boolean`
|
|
522
|
+
|
|
523
|
+
Load a built-in language.
|
|
524
|
+
|
|
525
|
+
```typescript
|
|
526
|
+
profanity.loadLanguage('french');
|
|
527
|
+
profanity.check('Ce mot est merde.'); // true
|
|
528
|
+
```
|
|
529
|
+
|
|
530
|
+
---
|
|
531
|
+
|
|
532
|
+
### `loadLanguages(languages: string[]): number`
|
|
533
|
+
|
|
534
|
+
Load multiple built-in languages at once.
|
|
535
|
+
|
|
536
|
+
```typescript
|
|
537
|
+
profanity.loadLanguages(['english', 'french', 'german']);
|
|
538
|
+
profanity.check('Das ist scheiße.'); // true (German)
|
|
539
|
+
```
|
|
540
|
+
|
|
541
|
+
---
|
|
542
|
+
|
|
543
|
+
### `loadIndianLanguages(): number`
|
|
544
|
+
|
|
545
|
+
Convenience: Load all major Indian language packs.
|
|
546
|
+
|
|
547
|
+
```typescript
|
|
548
|
+
profanity.loadIndianLanguages();
|
|
549
|
+
profanity.check('यह एक बेंगाली गाली है।'); // true (Bengali)
|
|
550
|
+
profanity.check('This is a Tamil profanity: புண்டை'); // true
|
|
551
|
+
```
|
|
552
|
+
|
|
553
|
+
---
|
|
554
|
+
|
|
555
|
+
### `loadCustomDictionary(name: string, words: string[]): void`
|
|
556
|
+
|
|
557
|
+
Add your own dictionary as an additional language.
|
|
558
|
+
|
|
559
|
+
```typescript
|
|
560
|
+
profanity.loadCustomDictionary('swedish', ['fan', 'jävla', 'skit']);
|
|
561
|
+
profanity.loadLanguage('swedish');
|
|
562
|
+
profanity.check('Det här är skit.'); // true
|
|
563
|
+
```
|
|
564
|
+
|
|
565
|
+
---
|
|
566
|
+
|
|
567
|
+
### `getLoadedLanguages(): string[]`
|
|
568
|
+
|
|
569
|
+
Returns the names of all currently loaded language packs.
|
|
570
|
+
|
|
571
|
+
```typescript
|
|
572
|
+
console.log(profanity.getLoadedLanguages()); // ['english', 'hindi', ...]
|
|
573
|
+
```
|
|
574
|
+
|
|
575
|
+
---
|
|
576
|
+
|
|
577
|
+
### `getAvailableLanguages(): string[]`
|
|
578
|
+
|
|
579
|
+
Returns the names of all available built-in language packs.
|
|
580
|
+
|
|
581
|
+
```typescript
|
|
582
|
+
console.log(profanity.getAvailableLanguages());
|
|
583
|
+
// ['english', 'hindi', 'french', 'german', 'spanish', 'bengali', 'tamil', 'telugu', 'brazilian']
|
|
584
|
+
```
|
|
585
|
+
|
|
586
|
+
---
|
|
587
|
+
|
|
588
|
+
### `clearList(): void`
|
|
589
|
+
|
|
590
|
+
Remove all loaded languages and dynamic words (start with a clean filter).
|
|
591
|
+
|
|
592
|
+
```typescript
|
|
593
|
+
profanity.clearList();
|
|
594
|
+
profanity.check('fuck'); // false
|
|
595
|
+
profanity.loadLanguage('english');
|
|
596
|
+
profanity.check('fuck'); // true
|
|
597
|
+
```
|
|
598
|
+
|
|
599
|
+
---
|
|
600
|
+
|
|
601
|
+
### `getConfig(): Partial<BeKindOptions>`
|
|
602
|
+
|
|
603
|
+
Get the current configuration.
|
|
604
|
+
|
|
605
|
+
```typescript
|
|
606
|
+
console.log(profanity.getConfig());
|
|
607
|
+
/*
|
|
608
|
+
{
|
|
609
|
+
defaultPlaceholder: '*',
|
|
610
|
+
enableLeetSpeak: true,
|
|
611
|
+
caseSensitive: false,
|
|
612
|
+
strictMode: false,
|
|
613
|
+
detectPartialWords: false,
|
|
614
|
+
languages: [...],
|
|
615
|
+
whitelistWords: [...]
|
|
616
|
+
}
|
|
617
|
+
*/
|
|
618
|
+
```
|
|
619
|
+
|
|
620
|
+
---
|
|
621
|
+
|
|
622
|
+
## Configuration File Structure
|
|
623
|
+
|
|
624
|
+
BeKind supports JSON-based configuration for easy setup and deployment. The config file structure supports all algorithm and detection options.
|
|
625
|
+
|
|
626
|
+
### Full Configuration Schema
|
|
627
|
+
|
|
628
|
+
```typescript
|
|
629
|
+
{
|
|
630
|
+
"algorithm": {
|
|
631
|
+
"matching": "trie" | "aho-corasick" | "hybrid", // Algorithm selection
|
|
632
|
+
"useAhoCorasick": boolean, // Enable Aho-Corasick
|
|
633
|
+
"useBloomFilter": boolean // Enable Bloom Filter
|
|
634
|
+
},
|
|
635
|
+
"bloomFilter": {
|
|
636
|
+
"enabled": boolean, // Enable/disable
|
|
637
|
+
"expectedItems": number, // Expected dictionary size (default: 10000)
|
|
638
|
+
"falsePositiveRate": number // Acceptable false positive rate (default: 0.01)
|
|
639
|
+
},
|
|
640
|
+
"ahoCorasick": {
|
|
641
|
+
"enabled": boolean, // Enable/disable
|
|
642
|
+
"prebuild": boolean // Prebuild automaton (default: true)
|
|
643
|
+
},
|
|
644
|
+
"profanityDetection": {
|
|
645
|
+
"enableLeetSpeak": boolean, // Detect l33t speak (default: true)
|
|
646
|
+
"caseSensitive": boolean, // Case sensitive matching (default: false)
|
|
647
|
+
"strictMode": boolean, // Require word boundaries (default: false)
|
|
648
|
+
"detectPartialWords": boolean, // Detect within words (default: false)
|
|
649
|
+
"defaultPlaceholder": string // Default censoring character (default: "*")
|
|
650
|
+
},
|
|
651
|
+
"performance": {
|
|
652
|
+
"enableCaching": boolean, // Enable result cache (default: false)
|
|
653
|
+
"cacheSize": number // Cache size limit (default: 1000)
|
|
654
|
+
}
|
|
655
|
+
}
|
|
656
|
+
```
|
|
657
|
+
|
|
658
|
+
### Pre-configured Templates
|
|
659
|
+
|
|
660
|
+
#### High Performance (Large Texts)
|
|
661
|
+
|
|
662
|
+
```json
|
|
663
|
+
{
|
|
664
|
+
"algorithm": { "matching": "aho-corasick" },
|
|
665
|
+
"ahoCorasick": { "enabled": true, "prebuild": true },
|
|
666
|
+
"profanityDetection": { "enableLeetSpeak": true }
|
|
667
|
+
}
|
|
668
|
+
```
|
|
669
|
+
|
|
670
|
+
#### Balanced (Production)
|
|
671
|
+
|
|
672
|
+
```json
|
|
673
|
+
{
|
|
674
|
+
"algorithm": {
|
|
675
|
+
"matching": "hybrid",
|
|
676
|
+
"useAhoCorasick": true,
|
|
677
|
+
"useBloomFilter": true
|
|
678
|
+
},
|
|
679
|
+
"profanityDetection": { "enableLeetSpeak": true },
|
|
680
|
+
"performance": { "enableCaching": true, "cacheSize": 1000 }
|
|
681
|
+
}
|
|
682
|
+
```
|
|
683
|
+
|
|
684
|
+
### Using Config Files
|
|
685
|
+
|
|
686
|
+
**Step 1: Generate Config Files**
|
|
687
|
+
|
|
688
|
+
```bash
|
|
689
|
+
# Run this in your project directory
|
|
690
|
+
npx bekindprofanityfilter
|
|
691
|
+
|
|
692
|
+
# Output:
|
|
693
|
+
# ✅ BeKind configuration files created!
|
|
694
|
+
#
|
|
695
|
+
# Created files:
|
|
696
|
+
# 📄 bekindprofanityfilter.config.json - Main configuration
|
|
697
|
+
# 📄 config.schema.json - JSON schema for IDE autocomplete
|
|
698
|
+
```
|
|
699
|
+
|
|
700
|
+
**Step 2: Load Config in Your Code**
|
|
701
|
+
|
|
702
|
+
```typescript
|
|
703
|
+
// ES Modules / TypeScript
|
|
704
|
+
import { BeKind } from 'bekindprofanityfilter';
|
|
705
|
+
import config from './bekindprofanityfilter.config.json';
|
|
706
|
+
|
|
707
|
+
const filter = BeKind.fromConfig(config);
|
|
708
|
+
```
|
|
709
|
+
|
|
710
|
+
```javascript
|
|
711
|
+
// CommonJS (Node.js)
|
|
712
|
+
const { BeKind } = require('bekindprofanityfilter');
|
|
713
|
+
const config = require('./bekindprofanityfilter.config.json');
|
|
714
|
+
|
|
715
|
+
const filter = BeKind.fromConfig(config);
|
|
716
|
+
```
|
|
717
|
+
|
|
718
|
+
**Step 3: Customize Config**
|
|
719
|
+
|
|
720
|
+
Edit `bekindprofanityfilter.config.json` to enable/disable features. Your IDE will provide autocomplete thanks to the JSON schema!
|
|
721
|
+
|
|
722
|
+
---
|
|
723
|
+
|
|
724
|
+
## Cross-Language Innocence Scoring
|
|
725
|
+
|
|
726
|
+
Many words are profane in one language but perfectly innocent in another. For example, "slut" means "end/finish" in Swedish, "fart" means "speed" in Scandinavian languages, and "bite" is a common English word that's vulgar in French. BeKind handles these cross-language collisions automatically using a multi-layer language detection and scoring system.
|
|
727
|
+
|
|
728
|
+
### Language Detection Architecture
|
|
729
|
+
|
|
730
|
+
BeKind uses a hybrid language detection system with three layers:
|
|
731
|
+
|
|
732
|
+
**1. ELD N-gram Detection** (`eld/small`)
|
|
733
|
+
We integrate [Nito-ELD](https://github.com/nitotm/efficient-language-detector), a corpus-trained byte-level n-gram language detector supporting 60+ languages. ELD analyzes character sequences (trigrams) and compares them against frequency profiles trained on massive corpora. It provides both per-word scores and full-text Bayesian priors.
|
|
734
|
+
|
|
735
|
+
*Limitation:* ELD works on UTF-8 byte patterns, so it struggles with accent-stripped text and frequently confuses closely related languages (Swedish ↔ German, Norwegian ↔ Danish). This is why we don't rely on ELD alone.
|
|
736
|
+
|
|
737
|
+
**2. Trie Vocabulary Detection** (18 languages)
|
|
738
|
+
Per-language tries built from ~200-350 common words each. When a word is looked up, the trie returns a match score (0-1) indicating how strongly the word belongs to that language. Supports accent-tolerant matching (e.g., "gurultu" matches Turkish "gürültü" with a small penalty).
|
|
739
|
+
|
|
740
|
+
**3. Script Detection**
|
|
741
|
+
Unicode codepoint ranges map characters directly to language families (e.g., Cyrillic → Russian, Devanagari → Hindi). This is deterministic and instant, providing strong signal for non-Latin scripts.
|
|
742
|
+
|
|
743
|
+
### The `scoreWord()` Function
|
|
744
|
+
|
|
745
|
+
For each word, `scoreWord()` combines all three layers into a single `Record<string, number>` mapping language codes to confidence scores:
|
|
746
|
+
|
|
747
|
+
```
|
|
748
|
+
scoreWord("slut") → { sv: 0.8, en: 0.6, de: 0.3, ... }
|
|
749
|
+
↑ Swedish trie match (exact word in vocabulary)
|
|
750
|
+
↑ English trie match (partial/common word)
|
|
751
|
+
↑ German ELD n-gram signal
|
|
752
|
+
```
|
|
753
|
+
|
|
754
|
+
Layer weights: Script (1.0) > Trie (0.8) > ELD (0.6) > Suffix (0.3+) > Prefix (0.3+)
|
|
755
|
+
|
|
756
|
+
### The `detectLanguages()` Function
|
|
757
|
+
|
|
758
|
+
For full text, `detectLanguages()` runs `scoreWord()` on every word and aggregates results into document-level proportions:
|
|
759
|
+
|
|
760
|
+
```typescript
|
|
761
|
+
detectLanguages("Programmet börjar klockan åtta och tar slut vid tio")
|
|
762
|
+
// → { languages: [{ language: "de", proportion: 0.6 }, { language: "sv", proportion: 0.3 }, ...] }
|
|
763
|
+
```
|
|
764
|
+
|
|
765
|
+
*Note:* ELD often classifies Swedish as German due to n-gram similarity. The confusion map (see below) compensates for this.
|
|
766
|
+
|
|
767
|
+
### Two-Layer Signal Combination
|
|
768
|
+
|
|
769
|
+
When a collision word is detected, we combine word-level and document-level signals using a **1.5:1 weighted average** favoring the document signal:
|
|
770
|
+
|
|
771
|
+
```
|
|
772
|
+
amplified[lang] = (scoreWord[lang] × 1.0 + docSignal[lang] × 1.5) / 2.5
|
|
773
|
+
```
|
|
774
|
+
|
|
775
|
+
The document signal is favored because it provides broader context — a single word's language score can be ambiguous, but the surrounding text usually makes the language clear.
|
|
776
|
+
|
|
777
|
+
### The Confusion Map
|
|
778
|
+
|
|
779
|
+
ELD's n-gram model frequently misclassifies Scandinavian languages as German (they share many character patterns). The confusion map treats German signal as partial evidence of Scandinavian:
|
|
780
|
+
|
|
781
|
+
```
|
|
782
|
+
effectiveAmp["sv"] = max(directAmp["sv"], confusedAmp["de"] × 0.8)
|
|
783
|
+
```
|
|
784
|
+
|
|
785
|
+
The 0.8 discount prevents over-attribution — German text shouldn't fully count as Swedish, but mostly-German signal in a Scandinavian context should still trigger dampening.
|
|
786
|
+
|
|
787
|
+
### Certainty Adjustment Formula
|
|
788
|
+
|
|
789
|
+
Once we have the amplified language signals, `adjustCertaintyForLanguage()` adjusts the word's certainty score:
|
|
790
|
+
|
|
791
|
+
```
|
|
792
|
+
If innocent language dominates (innocentAmp > profaneAmp):
|
|
793
|
+
adjusted = certainty × (1 - dampeningFactor × innocentAmp) ← reduces certainty
|
|
794
|
+
|
|
795
|
+
If profane language dominates (profaneAmp > innocentAmp):
|
|
796
|
+
adjusted = certainty × (1 + dampeningFactor × profaneAmp) ← increases certainty
|
|
797
|
+
|
|
798
|
+
Result clamped to [0, 5]
|
|
799
|
+
```
|
|
800
|
+
|
|
801
|
+
The `dampeningFactor` (0-1) controls how aggressively the adjustment works per collision word. Words that are genuinely innocent in another language (e.g., "slut" in Swedish, df=0.9) get heavy dampening, while dangerous dual-meaning words (e.g., "cock" as rooster, df=0.1) barely adjust.
|
|
802
|
+
|
|
803
|
+
### End-to-End Flow
|
|
804
|
+
|
|
805
|
+
```
|
|
806
|
+
Text: "Programmet börjar klockan åtta och tar slut vid tio"
|
|
807
|
+
^^^^
|
|
808
|
+
"slut" detected (en: s:3 c:4)
|
|
809
|
+
|
|
810
|
+
1. Collision word matched → check innocent-words map
|
|
811
|
+
"slut" → innocent in Swedish (meaning: "end/finish", dampeningFactor: 0.9)
|
|
812
|
+
|
|
813
|
+
2. Language detection triggered (lazy — only runs on collision matches)
|
|
814
|
+
Document signal: detectLanguages() → { de: 0.7, en: 0.2, ... }
|
|
815
|
+
Word signal: scoreWord("slut") → { sv: 0.8, en: 0.6, ... }
|
|
816
|
+
|
|
817
|
+
3. Weighted average (1.5:1 doc:word ratio)
|
|
818
|
+
amplified["sv"] = (0.8 × 1.0 + 0.0 × 1.5) / 2.5 = 0.32
|
|
819
|
+
amplified["de"] = (0.0 × 1.0 + 0.7 × 1.5) / 2.5 = 0.42
|
|
820
|
+
amplified["en"] = (0.6 × 1.0 + 0.2 × 1.5) / 2.5 = 0.36
|
|
821
|
+
|
|
822
|
+
4. Confusion map: German signal → partial Swedish evidence
|
|
823
|
+
effectiveAmp["sv"] = max(0.32, 0.42 × 0.8) = 0.336
|
|
824
|
+
|
|
825
|
+
5. Innocent language (sv: 0.336) > Profane language (en: 0.36)?
|
|
826
|
+
→ Close, but Swedish trie words boost sv signal further
|
|
827
|
+
→ Certainty dampened: 4 × (1 - 0.9 × 0.336) = 2.79
|
|
828
|
+
→ Below flag threshold (s:3 needs c:3+) → NOT FLAGGED ✓
|
|
829
|
+
```
|
|
830
|
+
|
|
831
|
+
### Key Features
|
|
832
|
+
|
|
833
|
+
- **29 collision words** mapped across 7 languages (English, Swedish, Norwegian, Danish, German, Dutch, French, Spanish)
|
|
834
|
+
- **Per-word dampening factors** control adjustment strength:
|
|
835
|
+
- `0.9` = heavy dampening (genuinely innocent cross-language, e.g., "slut" in Swedish)
|
|
836
|
+
- `0.1` = barely dampens (almost always used as profanity, e.g., "cock" in English)
|
|
837
|
+
- **Lazy language detection** — `detectLanguages()` only runs when a collision word is matched (zero performance cost for non-collision text)
|
|
838
|
+
- **Confusion map** — handles ELD n-gram detector's known misclassifications (e.g., Swedish often classified as German)
|
|
839
|
+
- **Swedish trie vocabulary** — ~350 common words for reliable word-level Swedish detection
|
|
840
|
+
|
|
841
|
+
### Collision Words
|
|
842
|
+
|
|
843
|
+
| Word | Profane In | Innocent In | Meaning |
|
|
844
|
+
|------|-----------|-------------|---------|
|
|
845
|
+
| slut | English | Swedish, Danish | end/finish |
|
|
846
|
+
| fart | English | Swedish, Norwegian, Danish | speed |
|
|
847
|
+
| hell | English | Swedish, Norwegian | luck |
|
|
848
|
+
| prick | English | Swedish | dot/point |
|
|
849
|
+
| kock | English | Swedish | chef/cook |
|
|
850
|
+
| bra | English | Swedish | good |
|
|
851
|
+
| bite | French | English | to use teeth |
|
|
852
|
+
| con | French | English, Spanish | prefix/with |
|
|
853
|
+
| pet | French | English | animal companion |
|
|
854
|
+
| mist | Dutch/German | English | fog/haze |
|
|
855
|
+
| hoe | English | Dutch | how |
|
|
856
|
+
| kant | Dutch | German | edge |
|
|
857
|
+
| ass | English | English | donkey (df: 0.15) |
|
|
858
|
+
| cock | English | English | rooster (df: 0.1) |
|
|
859
|
+
|
|
860
|
+
*Full list in `src/languages/innocent-words.ts`*
|
|
861
|
+
|
|
862
|
+
### Same-Language Collisions
|
|
863
|
+
|
|
864
|
+
Words like "ass" (donkey) and "cock" (rooster) are both profane and innocent in English. Since the profane and innocent language signals are equal, the system cannot disambiguate — these always remain flagged. This is a known limitation that would require semantic context analysis to solve.
|
|
865
|
+
|
|
866
|
+
### Tested Scenarios
|
|
867
|
+
|
|
868
|
+
The challenge test suite (`tests/challenge-tests.test.ts`) validates 32 real-world scenarios:
|
|
869
|
+
|
|
870
|
+
| Category | Tests | Passing | Description |
|
|
871
|
+
|----------|-------|---------|-------------|
|
|
872
|
+
| Swedish text | 7 | 7 | News, recipes, driving, email, school contexts |
|
|
873
|
+
| Norwegian/Danish | 4 | 4 | Via confusion map cross-detection |
|
|
874
|
+
| Mixed-language | 5 | 4 | Code-switching, bilingual documents |
|
|
875
|
+
| Same-language (en→en) | 5 | 3 | Donkey/rooster/garden contexts |
|
|
876
|
+
| Threshold boundaries | 3 | 3 | Minimal context, short text |
|
|
877
|
+
| Adversarial inputs | 4 | 4 | Swedish padding attacks, Unicode tricks |
|
|
878
|
+
| Missing language pairs | 4 | 3 | Dutch, German, Italian, Portuguese |
|
|
879
|
+
|
|
880
|
+
**4 skipped tests** document unsolved challenges requiring semantic analysis or additional language support.
|
|
881
|
+
|
|
882
|
+
---
|
|
883
|
+
|
|
884
|
+
## Severity Levels
|
|
885
|
+
|
|
886
|
+
Severity reflects the number and variety of detected profanities:
|
|
887
|
+
|
|
888
|
+
| Level | Enum Value | Description |
|
|
889
|
+
|-----------|------------|-----------------------------------------|
|
|
890
|
+
| MILD | 1 | 1 unique/total word |
|
|
891
|
+
| MODERATE | 2 | 2 unique or total words |
|
|
892
|
+
| SEVERE | 3 | 3 unique/total words |
|
|
893
|
+
| EXTREME | 4 | 4+ unique or 5+ total profane words |
|
|
894
|
+
|
|
895
|
+
---
|
|
896
|
+
|
|
897
|
+
## Language Support
|
|
898
|
+
|
|
899
|
+
- **Profanity Dictionaries (15):** English, Hindi, French, German, Spanish, Italian, Brazilian Portuguese, Russian, Arabic, Chinese, Japanese, Korean, Bengali, Tamil, Telugu
|
|
900
|
+
- **Language Detection Trie (18):** All 15 above + Dutch, Turkish, Swedish (used for innocence scoring, not profanity detection)
|
|
901
|
+
- **Cross-Language Innocence Scoring:** English, Swedish, Norwegian, Danish, German, Dutch, French, Spanish
|
|
902
|
+
- **Scripts:** Latin/Roman, Devanagari, Tamil, Telugu, Bengali, Cyrillic, Arabic, CJK, etc.
|
|
903
|
+
- **Mixed Content:** Handles mixed-language and code-switched sentences with language-aware scoring.
|
|
904
|
+
|
|
905
|
+
```typescript
|
|
906
|
+
profanity.check('This is bullshit and चूतिया.'); // true (mixed English/Hindi)
|
|
907
|
+
profanity.check('Ce mot est merde and पागल.'); // true (French/Hindi)
|
|
908
|
+
profanity.check('Isso é uma merda.'); // true (Brazilian Portuguese)
|
|
909
|
+
```
|
|
910
|
+
|
|
911
|
+
---
|
|
912
|
+
|
|
913
|
+
## Use Exported Wordlists
|
|
914
|
+
|
|
915
|
+
For sample words in a language (for UIs, admin, etc):
|
|
916
|
+
|
|
917
|
+
```typescript
|
|
918
|
+
import { englishBadWords, hindiBadWords } from 'bekindprofanityfilter';
|
|
919
|
+
console.log(englishBadWords.slice(0, 5)); // ["fuck", "shit", ...]
|
|
920
|
+
```
|
|
921
|
+
|
|
922
|
+
---
|
|
923
|
+
|
|
924
|
+
## Security
|
|
925
|
+
|
|
926
|
+
- **No wordlist exposure:** There is no `.list()` function for security and encapsulation. Use exported word arrays for samples.
|
|
927
|
+
- **TRIE-based:** Scales easily to 50,000+ words.
|
|
928
|
+
- **Handles leet-speak:** Catches obfuscated variants like `f#ck`, `a55hole`.
|
|
929
|
+
|
|
930
|
+
---
|
|
931
|
+
|
|
932
|
+
## Full Example
|
|
933
|
+
|
|
934
|
+
```typescript
|
|
935
|
+
import profanity, { ProfanitySeverity } from 'bekindprofanityfilter';
|
|
936
|
+
|
|
937
|
+
|
|
938
|
+
// Multi-language detection
|
|
939
|
+
profanity.loadLanguages(['english', 'french', 'tamil']);
|
|
940
|
+
console.log(profanity.check('Ce mot est merde.')); // true
|
|
941
|
+
|
|
942
|
+
// Leet-speak detection
|
|
943
|
+
console.log(profanity.check('You a f#cking a55hole!')); // true
|
|
944
|
+
|
|
945
|
+
// Whitelisting
|
|
946
|
+
profanity.addToWhitelist(['anal', 'ass']);
|
|
947
|
+
console.log(profanity.check('He is an associate professor.')); // false
|
|
948
|
+
|
|
949
|
+
// Severity
|
|
950
|
+
const result = profanity.detect('This is fucking bullshit and chutiya.');
|
|
951
|
+
console.log(ProfanitySeverity[result.severity]); // "SEVERE"
|
|
952
|
+
|
|
953
|
+
// Custom dictionary
|
|
954
|
+
profanity.loadCustomDictionary('pirate', ['barnacle-head', 'landlubber']);
|
|
955
|
+
profanity.loadLanguage('pirate');
|
|
956
|
+
console.log(profanity.check('You barnacle-head!')); // true
|
|
957
|
+
|
|
958
|
+
// Placeholder configuration
|
|
959
|
+
profanity.setPlaceholder('#');
|
|
960
|
+
console.log(profanity.clean('This is bullshit.')); // "This is ########."
|
|
961
|
+
profanity.setPlaceholder('*'); // Reset
|
|
962
|
+
```
|
|
963
|
+
|
|
964
|
+
---
|
|
965
|
+
|
|
966
|
+
## FAQ
|
|
967
|
+
|
|
968
|
+
**Q: How do I see all loaded profanities?**
|
|
969
|
+
A: For security, the internal word list is not exposed. Use `englishBadWords` etc. for samples.
|
|
970
|
+
|
|
971
|
+
**Q: How do I reset the filter?**
|
|
972
|
+
A: Use `clearList()` and reload languages/dictionaries.
|
|
973
|
+
|
|
974
|
+
**Q: Is this safe for browser and Node.js?**
|
|
975
|
+
A: Yes! BeKind is universal.
|
|
976
|
+
|
|
977
|
+
---
|
|
978
|
+
|
|
979
|
+
## Middleware Examples
|
|
980
|
+
|
|
981
|
+
**Looking for Express.js/Node.js middleware to use BeKind in your API or chat app?**
|
|
982
|
+
**Check the [`examples/`](./examples/) folder for ready-to-copy middleware and integration samples.**
|
|
983
|
+
|
|
984
|
+
---
|
|
985
|
+
|
|
986
|
+
## Roadmap
|
|
987
|
+
|
|
988
|
+
- ✅ Cross-language innocence scoring (collision word disambiguation)
|
|
989
|
+
- ✅ Multi-language detection trie (18 languages)
|
|
990
|
+
- ✅ Language confusion map for Scandinavian/Germanic disambiguation
|
|
991
|
+
- ✅ Additional language packs (Arabic, Russian, Japanese, Korean, Chinese, Dutch)
|
|
992
|
+
- ✅ Romanization detection (Hinglish and other transliterated scripts)
|
|
993
|
+
- 🚧 Norwegian and Danish trie vocabularies (currently covered via confusion map)
|
|
994
|
+
- 🚧 Repeat character compression (normalize "fuuuuccckkkk" → "fuck" before matching, avoiding the need to enumerate elongations in the dictionary)
|
|
995
|
+
- 🚧 Phonetic matching (sounds-like detection)
|
|
996
|
+
- 🚧 Plugin system for custom detection algorithms
|
|
997
|
+
|
|
998
|
+
---
|
|
999
|
+
|
|
1000
|
+
## License
|
|
1001
|
+
|
|
1002
|
+
MIT — See [LICENSE](https://github.com/grassroots-labs-org/be-kind-profanity-filter/blob/main/LICENSE)
|
|
1003
|
+
|
|
1004
|
+
This project is a fork of [BeKind](https://github.com/ayush-jadaun/allprofanity) by Ayush Jadaun, also licensed under MIT.
|
|
1005
|
+
|
|
1006
|
+
---
|
|
1007
|
+
|
|
1008
|
+
## Contributing
|
|
1009
|
+
|
|
1010
|
+
We welcome contributions! Please see our [CONTRIBUTORS.md](./CONTRIBUTORS.md) for:
|
|
1011
|
+
|
|
1012
|
+
- How to add your name to our contributors list
|
|
1013
|
+
- Guidelines for adding new languages
|
|
1014
|
+
- Test requirements (must include passing test screenshots in PRs)
|
|
1015
|
+
- Code of conduct and PR guidelines
|