glin-profanity 2.3.8 → 3.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -28,6 +28,9 @@
28
28
  <a href="https://www.npmjs.com/package/glin-profanity">
29
29
  <img src="https://img.shields.io/npm/dw/glin-profanity" alt="Weekly Downloads" />
30
30
  </a>
31
+ <a href="https://pepy.tech/projects/glin-profanity">
32
+ <img src="https://static.pepy.tech/personalized-badge/glin-profanity?period=total&units=international_system&left_color=black&right_color=green&left_text=Python%20Downloads" alt="PyPI Downloads" />
33
+ </a>
31
34
  <a href="https://github.com/GLINCKER/glin-profanity/issues">
32
35
  <img src="https://img.shields.io/github/issues/GLINCKER/glin-profanity" alt="Open Issues" />
33
36
  </a>
@@ -81,8 +84,29 @@ Whether you're moderating chat messages, community forums, or content input form
81
84
  <img src="https://img.shields.io/badge/Real--Time-⚡-yellow?style=for-the-badge" alt="Real-Time" />
82
85
  <img src="https://img.shields.io/badge/Obfuscation_Detection-🕵️-purple?style=for-the-badge" alt="Obfuscation" />
83
86
  <img src="https://img.shields.io/badge/Framework_Agnostic-🧩-green?style=for-the-badge" alt="Framework Agnostic" />
87
+ <img src="https://img.shields.io/badge/ML_Powered-🤖-orange?style=for-the-badge" alt="ML Powered" />
84
88
  </div>
85
89
 
90
+ ### 💡 Why glin-profanity?
91
+
92
+ | | |
93
+ |---|---|
94
+ | 🔒 **Privacy First** | Runs entirely on-device. No API calls, no data leaves your app. GDPR/CCPA friendly. |
95
+ | ⚡ **Blazing Fast** | 23K-115K ops/sec rule-based, 21M+ ops/sec with caching. Sub-millisecond latency. |
96
+ | 🌍 **Truly Multilingual** | 23 languages with unified dictionary. Consistent detection across locales. |
97
+ | 🛡️ **Evasion Resistant** | Catches leetspeak (`f4ck`), Unicode tricks (`fυck`), zero-width chars, and homoglyphs. |
98
+ | 🤖 **AI-Ready** | Optional ML integration for context-aware toxicity detection beyond keywords. |
99
+ | 🧩 **Zero Config** | Works out of the box. No API keys, no server, no setup required. |
100
+ | 📦 **Lightweight** | ~90KB core bundle. Tree-shakeable. No heavy dependencies for basic usage. |
101
+
102
+ ### ✨ What's New in v3.0
103
+
104
+ - **Leetspeak Detection** — Catch `f4ck`, `@ss`, `$h!t` with 3 intensity levels
105
+ - **Unicode Normalization** — Detect Cyrillic/Greek lookalikes, full-width chars, zero-width spaces
106
+ - **Result Caching** — 800x speedup for repeated checks
107
+ - **ML Integration** — Optional TensorFlow.js toxicity model for nuanced detection
108
+ - **Performance** — Optimized for high-throughput production workloads
109
+
86
110
  ## 📚 Table of Contents
87
111
 
88
112
  - [🚀 Key Features](#-key-features)
@@ -104,6 +128,13 @@ Whether you're moderating chat messages, community forums, or content input form
104
128
  - [Return Value](#return-value)
105
129
  - [⚠️ Note](#note)
106
130
  - [🛠 Use Cases](#-use-cases)
131
+ - [🔬 Advanced Features](#-advanced-features)
132
+ - [Leetspeak Detection](#leetspeak-detection)
133
+ - [Unicode Normalization](#unicode-normalization)
134
+ - [Result Caching](#result-caching)
135
+ - [Configuration Management](#configuration-management)
136
+ - [ML-Based Detection](#ml-based-detection)
137
+ - [📊 Benchmarks](#-benchmarks)
107
138
  - [📄 License](#license)
108
139
  - [MIT License](#mit-license)
109
140
 
@@ -357,6 +388,11 @@ new Filter(config?: FilterConfig);
357
388
  | `autoReplace` | `boolean` | Whether to auto-replace flagged words |
358
389
  | `minSeverity` | `SeverityLevel` | Minimum severity to include in final list |
359
390
  | `customActions` | `(result) => void` | Custom logging/callback support |
391
+ | `detectLeetspeak` | `boolean` | Enable leetspeak detection (e.g., `f4ck` → `fuck`) |
392
+ | `leetspeakLevel` | `'basic' \| 'moderate' \| 'aggressive'` | Leetspeak detection intensity |
393
+ | `normalizeUnicode` | `boolean` | Enable Unicode normalization for homoglyphs |
394
+ | `cacheResults` | `boolean` | Cache results for repeated checks |
395
+ | `maxCacheSize` | `number` | Maximum cache size (default: 1000) |
360
396
 
361
397
  ---
362
398
 
@@ -419,6 +455,167 @@ const { result, checkText, checkTextAsync, reset, isDirty, isWordProfane } = use
419
455
  - 🕹️ Game lobbies & multiplayer chats
420
456
  - 🤖 AI content filters before processing input
421
457
 
458
+ ## 🔬 Advanced Features
459
+
460
+ ### Leetspeak Detection
461
+
462
+ Detect and normalize leetspeak variations like `f4ck`, `@ss`, `$h!t`:
463
+
464
+ ```typescript
465
+ import { Filter } from 'glin-profanity';
466
+
467
+ const filter = new Filter({
468
+ languages: ['english'],
469
+ detectLeetspeak: true,
470
+ leetspeakLevel: 'moderate', // 'basic' | 'moderate' | 'aggressive'
471
+ });
472
+
473
+ filter.isProfane('f4ck'); // true
474
+ filter.isProfane('@ss'); // true
475
+ filter.isProfane('$h!t'); // true
476
+ filter.isProfane('f u c k'); // true (spaced characters)
477
+ ```
478
+
479
+ **Leetspeak Levels:**
480
+ - `basic`: Numbers only (0→o, 1→i, 3→e, 4→a, 5→s)
481
+ - `moderate`: Basic + common symbols (@→a, $→s, !→i)
482
+ - `aggressive`: All known substitutions including rare ones
483
+
484
+ ### Unicode Normalization
485
+
486
+ Detect homoglyphs and Unicode obfuscation:
487
+
488
+ ```typescript
489
+ import { Filter } from 'glin-profanity';
490
+
491
+ const filter = new Filter({
492
+ languages: ['english'],
493
+ normalizeUnicode: true, // enabled by default
494
+ });
495
+
496
+ // Detects various Unicode tricks:
497
+ filter.isProfane('fυck'); // true (Greek upsilon υ → u)
498
+ filter.isProfane('fᴜck'); // true (Small caps ᴜ → u)
499
+ filter.isProfane('f​u​c​k'); // true (Zero-width spaces removed)
500
+ filter.isProfane('fuck'); // true (Full-width characters)
501
+ ```
502
+
503
+ ### Result Caching
504
+
505
+ Enable caching for high-performance repeated checks:
506
+
507
+ ```typescript
508
+ import { Filter } from 'glin-profanity';
509
+
510
+ const filter = new Filter({
511
+ languages: ['english'],
512
+ cacheResults: true,
513
+ maxCacheSize: 1000, // LRU eviction when full
514
+ });
515
+
516
+ // First call computes result
517
+ filter.checkProfanity('hello world'); // ~0.04ms
518
+
519
+ // Subsequent calls return cached result
520
+ filter.checkProfanity('hello world'); // ~0.00005ms (800x faster!)
521
+
522
+ // Cache management
523
+ console.log(filter.getCacheSize()); // 1
524
+ filter.clearCache();
525
+ ```
526
+
527
+ ### Configuration Management
528
+
529
+ Export and import filter configurations for sharing between environments:
530
+
531
+ ```typescript
532
+ import { Filter } from 'glin-profanity';
533
+
534
+ const filter = new Filter({
535
+ languages: ['english', 'spanish'],
536
+ detectLeetspeak: true,
537
+ leetspeakLevel: 'aggressive',
538
+ cacheResults: true,
539
+ });
540
+
541
+ // Export configuration
542
+ const config = filter.getConfig();
543
+ // Save to file: fs.writeFileSync('filter.config.json', JSON.stringify(config));
544
+
545
+ // Later, restore configuration
546
+ // const savedConfig = JSON.parse(fs.readFileSync('filter.config.json'));
547
+ // const restoredFilter = new Filter(savedConfig);
548
+
549
+ // Get dictionary size for monitoring
550
+ console.log(filter.getWordCount()); // 406
551
+ ```
552
+
553
+ ### ML-Based Detection
554
+
555
+ Optional TensorFlow.js-powered toxicity detection for context-aware filtering:
556
+
557
+ ```bash
558
+ # Install optional dependencies
559
+ npm install @tensorflow/tfjs @tensorflow-models/toxicity
560
+ ```
561
+
562
+ ```typescript
563
+ import { HybridFilter } from 'glin-profanity/ml';
564
+
565
+ const filter = new HybridFilter({
566
+ languages: ['english'],
567
+ detectLeetspeak: true,
568
+ enableML: true,
569
+ mlThreshold: 0.85,
570
+ combinationMode: 'or', // 'or' | 'and' | 'ml-override' | 'rules-first'
571
+ });
572
+
573
+ // Initialize ML model (async)
574
+ await filter.initialize();
575
+
576
+ // Hybrid check (rules + ML)
577
+ const result = await filter.checkProfanityAsync('you are terrible');
578
+ console.log(result.isToxic); // true
579
+ console.log(result.mlResult?.matchedCategories); // ['insult', 'toxicity']
580
+ console.log(result.confidence); // 0.92
581
+
582
+ // Sync rule-based check (fast, no ML)
583
+ filter.isProfane('badword'); // true
584
+ ```
585
+
586
+ **ML Categories Detected:**
587
+ - `toxicity` - General toxic content
588
+ - `insult` - Insults and personal attacks
589
+ - `threat` - Threatening language
590
+ - `obscene` - Obscene/vulgar content
591
+ - `identity_attack` - Identity-based hate
592
+ - `sexual_explicit` - Sexually explicit content
593
+ - `severe_toxicity` - Highly toxic content
594
+
595
+ ## 📊 Benchmarks
596
+
597
+ Performance benchmarks on a MacBook Pro (M1):
598
+
599
+ | Operation | Throughput | Average Time |
600
+ |-----------|------------|--------------|
601
+ | `isProfane` (clean text) | 23,524 ops/sec | 0.04ms |
602
+ | `isProfane` (profane text) | 114,666 ops/sec | 0.009ms |
603
+ | With leetspeak detection | 22,904 ops/sec | 0.04ms |
604
+ | With Unicode normalization | 24,058 ops/sec | 0.04ms |
605
+ | With caching (cached hit) | **21,396,095 ops/sec** | 0.00005ms |
606
+ | `checkProfanity` (detailed) | 3,677 ops/sec | 0.27ms |
607
+ | Multi-language (4 langs) | 24,855 ops/sec | 0.04ms |
608
+ | All languages (23 langs) | 14,114 ops/sec | 0.07ms |
609
+
610
+ **Key Findings:**
611
+ - Leetspeak and Unicode normalization add minimal overhead
612
+ - Caching provides **800x speedup** for repeated checks
613
+ - Multi-language support scales well
614
+
615
+ Run benchmarks yourself:
616
+ ```bash
617
+ npm run benchmark
618
+ ```
422
619
 
423
620
  ## License
424
621