bekindprofanityfilter 0.0.7 → 0.0.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -20,8 +20,9 @@ A multi-language profanity filter with romanization detection, language-aware in
20
20
  - **Multi-Language Profanity Detection:** 34K+ word dictionary across 16 languages with 18-language detection trie
21
21
  - **Romanization Detection:** Catches Hinglish, transliterated Bengali, Tamil, Telugu, and Japanese
22
22
  - **Cross-Language Innocence Scoring:** Handles words like "got" (Turkish: "buttocks") and "fart" (Norwegian: "speed")
23
+ - **False Positive Reduction:** Common dual-meaning words (groomer, missionary, edibles, tinker, etc.) are excluded from default detection to prevent flagging legitimate content on community platforms
23
24
  - **Context-Aware Analysis:** Booster/reducer patterns detect sexual context, negation, medical usage, and quoted speech
24
- - **Leet-Speak Detection:** Catches obfuscated profanity (`f#ck`, `a55hole`, `sh1t`)
25
+ - **Leet-Speak Detection:** Catches obfuscated profanity (`f#ck`, `a55hole`, `sh1t`) including digit-based leet (`6006s`, `d1ld0`)
25
26
  - **Word Boundary Detection:** Smart whole-word matching prevents flagging "assassin" or "assistance"
26
27
  - **Multiple Algorithms:** Trie (default), Aho-Corasick, or Hybrid modes with optional Bloom filters and result caching
27
28
 
@@ -32,7 +33,7 @@ A multi-language profanity filter with romanization detection, language-aware in
32
33
  ### Performance & Speed
33
34
 
34
35
  - **Multiple Algorithm Options:** Choose between Trie (default), Aho-Corasick, or Hybrid modes
35
- - **664% Faster on Large Texts:** Aho-Corasick delivers O(n) multi-pattern matching
36
+ - **Fast on Large Texts:** Aho-Corasick delivers O(n) multi-pattern matching
36
37
  - **123x Speedup with Caching:** Result cache perfect for repeated checks (chat, forms, APIs)
37
38
  - **~27K ops/sec:** Default Trie mode handles short texts incredibly fast
38
39
  - **Single-Pass Scanning:** O(n) complexity regardless of dictionary size
@@ -867,19 +868,21 @@ Words like "ass" (donkey) and "cock" (rooster) are both profane and innocent in
867
868
 
868
869
  ### Tested Scenarios
869
870
 
870
- The challenge test suite (`tests/challenge-tests.test.ts`) validates 32 real-world scenarios:
871
+ The challenge test suite (`tests/challenge-tests.test.ts`) documents known limitations and unsolved edge cases:
871
872
 
872
- | Category | Tests | Passing | Description |
873
- |----------|-------|---------|-------------|
874
- | Swedish text | 7 | 7 | News, recipes, driving, email, school contexts |
875
- | Norwegian/Danish | 4 | 4 | Via confusion map cross-detection |
876
- | Mixed-language | 5 | 4 | Code-switching, bilingual documents |
877
- | Same-language (en→en) | 5 | 3 | Donkey/rooster/garden contexts |
878
- | Threshold boundaries | 3 | 3 | Minimal context, short text |
879
- | Adversarial inputs | 4 | 4 | Swedish padding attacks, Unicode tricks |
880
- | Missing language pairs | 4 | 3 | Dutch, German, Italian, Portuguese |
873
+ | Category | Tests | Status | Description |
874
+ |----------|-------|--------|-------------|
875
+ | Semantic analysis | 4 | skipped | "slut" in Swedish, "ass" as donkey, "cock" as rooster, "git" as VCS tool |
876
+ | Dual-meaning words (innocent) | 10 | skipped | tinker, edibles, missionary, groomer, puttanesca, 8ball, catfish, knob, redskins, crime statistics |
877
+ | Dual-meaning words (hateful) | 4 | skipped | Same words used as slurs/dog whistles |
878
+ | Short common words (innocent) | 18 | skipped | ken, nom, gay, dom, eta, tat, goo, mut, bur, ano, wea, mae, pos, hag, div, bra, bal, gu |
879
+ | Short common words (profane) | 18 | skipped | Same words in their profane language contexts |
880
+ | Embedded profanity | 1 | skipped | "urASSHOLEbro" concatenated evasion |
881
+ | Digit leet-speak | 4 | passing | "6006s", "b006s", "4ss", "d1ld0" |
881
882
 
882
- **4 skipped tests** document unsolved challenges requiring semantic analysis or additional language support.
883
+ **Dual-meaning words** like "groomer" (dog grooming vs anti-LGBTQ+ slur), "missionary" (religious work vs sexual position), and "edibles" (food vs cannabis) are commented out of the dictionary to prevent false positives on community platforms. The challenge tests document both the innocent and hateful usage patterns solving these requires context-aware detection that can distinguish legitimate uses from slurs/dog whistles.
884
+
885
+ **Short common words** (≤3 chars) collide across languages (e.g., "ken" is French verlan for a slur but also a common English name). These need language-aware context detection before re-enabling.
883
886
 
884
887
  ---
885
888
 
@@ -992,6 +995,9 @@ A: Yes! BeKind is universal.
992
995
  - ✅ Language confusion map for Scandinavian/Germanic disambiguation
993
996
  - ✅ Additional language packs (Arabic, Russian, Japanese, Korean, Chinese, Dutch)
994
997
  - ✅ Romanization detection (Hinglish and other transliterated scripts)
998
+ - ✅ False positive reduction for dual-meaning words (groomer, missionary, edibles, etc.)
999
+ - ✅ Digit leet-speak detection (6006s, b006s, 4ss, d1ld0)
1000
+ - 🚧 Context-aware dual-meaning word disambiguation (distinguish "dog groomer" from "groomer" as slur)
995
1001
  - 🚧 Norwegian and Danish trie vocabularies (currently covered via confusion map)
996
1002
  - 🚧 Repeat character compression (normalize elongated words before matching, avoiding the need to enumerate elongations in the dictionary)
997
1003
  - 🚧 Phonetic matching (sounds-like detection)