bekindprofanityfilter 0.0.6 → 0.0.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +19 -13
- package/dist/cjs/index.js +3 -3
- package/dist/esm/index.d.ts +15 -0
- package/dist/esm.min.js +3 -3
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -20,8 +20,9 @@ A multi-language profanity filter with romanization detection, language-aware in
|
|
|
20
20
|
- **Multi-Language Profanity Detection:** 34K+ word dictionary across 16 languages with 18-language detection trie
|
|
21
21
|
- **Romanization Detection:** Catches Hinglish, transliterated Bengali, Tamil, Telugu, and Japanese
|
|
22
22
|
- **Cross-Language Innocence Scoring:** Handles words like "got" (Turkish: "buttocks") and "fart" (Norwegian: "speed")
|
|
23
|
+
- **False Positive Reduction:** Common dual-meaning words (groomer, missionary, edibles, tinker, etc.) are excluded from default detection to prevent flagging legitimate content on community platforms
|
|
23
24
|
- **Context-Aware Analysis:** Booster/reducer patterns detect sexual context, negation, medical usage, and quoted speech
|
|
24
|
-
- **Leet-Speak Detection:** Catches obfuscated profanity (`f#ck`, `a55hole`, `sh1t`)
|
|
25
|
+
- **Leet-Speak Detection:** Catches obfuscated profanity (`f#ck`, `a55hole`, `sh1t`) including digit-based leet (`6006s`, `d1ld0`)
|
|
25
26
|
- **Word Boundary Detection:** Smart whole-word matching prevents flagging "assassin" or "assistance"
|
|
26
27
|
- **Multiple Algorithms:** Trie (default), Aho-Corasick, or Hybrid modes with optional Bloom filters and result caching
|
|
27
28
|
|
|
@@ -32,7 +33,7 @@ A multi-language profanity filter with romanization detection, language-aware in
|
|
|
32
33
|
### Performance & Speed
|
|
33
34
|
|
|
34
35
|
- **Multiple Algorithm Options:** Choose between Trie (default), Aho-Corasick, or Hybrid modes
|
|
35
|
-
- **
|
|
36
|
+
- **Fast on Large Texts:** Aho-Corasick delivers O(n) multi-pattern matching
|
|
36
37
|
- **123x Speedup with Caching:** Result cache perfect for repeated checks (chat, forms, APIs)
|
|
37
38
|
- **~27K ops/sec:** Default Trie mode handles short texts incredibly fast
|
|
38
39
|
- **Single-Pass Scanning:** O(n) complexity regardless of dictionary size
|
|
@@ -867,19 +868,21 @@ Words like "ass" (donkey) and "cock" (rooster) are both profane and innocent in
|
|
|
867
868
|
|
|
868
869
|
### Tested Scenarios
|
|
869
870
|
|
|
870
|
-
The challenge test suite (`tests/challenge-tests.test.ts`)
|
|
871
|
+
The challenge test suite (`tests/challenge-tests.test.ts`) documents known limitations and unsolved edge cases:
|
|
871
872
|
|
|
872
|
-
| Category | Tests |
|
|
873
|
-
|
|
874
|
-
|
|
|
875
|
-
|
|
|
876
|
-
|
|
|
877
|
-
|
|
|
878
|
-
|
|
|
879
|
-
|
|
|
880
|
-
|
|
|
873
|
+
| Category | Tests | Status | Description |
|
|
874
|
+
|----------|-------|--------|-------------|
|
|
875
|
+
| Semantic analysis | 4 | skipped | "slut" in Swedish, "ass" as donkey, "cock" as rooster, "git" as VCS tool |
|
|
876
|
+
| Dual-meaning words (innocent) | 10 | skipped | tinker, edibles, missionary, groomer, puttanesca, 8ball, catfish, knob, redskins, crime statistics |
|
|
877
|
+
| Dual-meaning words (hateful) | 4 | skipped | Same words used as slurs/dog whistles |
|
|
878
|
+
| Short common words (innocent) | 18 | skipped | ken, nom, gay, dom, eta, tat, goo, mut, bur, ano, wea, mae, pos, hag, div, bra, bal, gu |
|
|
879
|
+
| Short common words (profane) | 18 | skipped | Same words in their profane language contexts |
|
|
880
|
+
| Embedded profanity | 1 | skipped | "urASSHOLEbro" — concatenated evasion |
|
|
881
|
+
| Digit leet-speak | 4 | passing | "6006s", "b006s", "4ss", "d1ld0" |
|
|
881
882
|
|
|
882
|
-
**
|
|
883
|
+
**Dual-meaning words** like "groomer" (dog grooming vs anti-LGBTQ+ slur), "missionary" (religious work vs sexual position), and "edibles" (food vs cannabis) are commented out of the dictionary to prevent false positives on community platforms. The challenge tests document both the innocent and hateful usage patterns — solving these requires context-aware detection that can distinguish legitimate uses from slurs/dog whistles.
|
|
884
|
+
|
|
885
|
+
**Short common words** (≤3 chars) collide across languages (e.g., "ken" is French verlan for a slur but also a common English name). These need language-aware context detection before re-enabling.
|
|
883
886
|
|
|
884
887
|
---
|
|
885
888
|
|
|
@@ -992,6 +995,9 @@ A: Yes! BeKind is universal.
|
|
|
992
995
|
- ✅ Language confusion map for Scandinavian/Germanic disambiguation
|
|
993
996
|
- ✅ Additional language packs (Arabic, Russian, Japanese, Korean, Chinese, Dutch)
|
|
994
997
|
- ✅ Romanization detection (Hinglish and other transliterated scripts)
|
|
998
|
+
- ✅ False positive reduction for dual-meaning words (groomer, missionary, edibles, etc.)
|
|
999
|
+
- ✅ Digit leet-speak detection (6006s, b006s, 4ss, d1ld0)
|
|
1000
|
+
- 🚧 Context-aware dual-meaning word disambiguation (distinguish "dog groomer" from "groomer" as slur)
|
|
995
1001
|
- 🚧 Norwegian and Danish trie vocabularies (currently covered via confusion map)
|
|
996
1002
|
- 🚧 Repeat character compression (normalize elongated words before matching, avoiding the need to enumerate elongations in the dictionary)
|
|
997
1003
|
- 🚧 Phonetic matching (sounds-like detection)
|