py-secret-scan 3.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. py_secret_scan-3.0.1/.gitignore +12 -0
  2. py_secret_scan-3.0.1/LICENSE +21 -0
  3. py_secret_scan-3.0.1/PKG-INFO +662 -0
  4. py_secret_scan-3.0.1/README.md +643 -0
  5. py_secret_scan-3.0.1/data/Infrastructure/authentication/regex.list +21 -0
  6. py_secret_scan-3.0.1/data/Infrastructure/authentication/rules.json +355 -0
  7. py_secret_scan-3.0.1/data/Infrastructure/authentication/test_data.json +230 -0
  8. py_secret_scan-3.0.1/data/Infrastructure/database_credentials/regex.list +8 -0
  9. py_secret_scan-3.0.1/data/Infrastructure/database_credentials/rules.json +253 -0
  10. py_secret_scan-3.0.1/data/Infrastructure/database_credentials/test_data.json +150 -0
  11. py_secret_scan-3.0.1/data/Infrastructure/infrastructure/regex.list +2 -0
  12. py_secret_scan-3.0.1/data/Infrastructure/infrastructure/rules.json +131 -0
  13. py_secret_scan-3.0.1/data/Infrastructure/infrastructure/test_data.json +88 -0
  14. py_secret_scan-3.0.1/data/Structured/api_keys/regex.list +682 -0
  15. py_secret_scan-3.0.1/data/Structured/api_keys/rules.json +4567 -0
  16. py_secret_scan-3.0.1/data/Structured/api_keys/test_data.json +2908 -0
  17. py_secret_scan-3.0.1/data/Structured/certificates/regex.list +1 -0
  18. py_secret_scan-3.0.1/data/Structured/certificates/rules.json +14 -0
  19. py_secret_scan-3.0.1/data/Structured/certificates/test_data.json +10 -0
  20. py_secret_scan-3.0.1/data/Structured/cloud_credentials/regex.list +7 -0
  21. py_secret_scan-3.0.1/data/Structured/cloud_credentials/rules.json +243 -0
  22. py_secret_scan-3.0.1/data/Structured/cloud_credentials/test_data.json +158 -0
  23. py_secret_scan-3.0.1/data/Structured/private_keys/regex.list +4 -0
  24. py_secret_scan-3.0.1/data/Structured/private_keys/rules.json +7890 -0
  25. py_secret_scan-3.0.1/data/Structured/private_keys/test_data.json +5256 -0
  26. py_secret_scan-3.0.1/data/Structured/tokens/regex.list +36 -0
  27. py_secret_scan-3.0.1/data/Structured/tokens/rules.json +642 -0
  28. py_secret_scan-3.0.1/data/Structured/tokens/test_data.json +412 -0
  29. py_secret_scan-3.0.1/data/contextual/rules.json +67 -0
  30. py_secret_scan-3.0.1/data/contextual/test_data.json +42 -0
  31. py_secret_scan-3.0.1/data/pii/rules.json +60 -0
  32. py_secret_scan-3.0.1/data/pii/test_data.json +50 -0
  33. py_secret_scan-3.0.1/pyproject.toml +41 -0
  34. py_secret_scan-3.0.1/src/__init__.py +0 -0
  35. py_secret_scan-3.0.1/src/cache_engine.py +27 -0
  36. py_secret_scan-3.0.1/src/cli.py +249 -0
  37. py_secret_scan-3.0.1/src/detector.py +621 -0
  38. py_secret_scan-3.0.1/src/git_engine.py +117 -0
  39. py_secret_scan-3.0.1/src/ignore_engine.py +61 -0
  40. py_secret_scan-3.0.1/src/obfuscator.py +151 -0
  41. py_secret_scan-3.0.1/src/report.py +165 -0
  42. py_secret_scan-3.0.1/src/suggestions.py +23 -0
  43. py_secret_scan-3.0.1/src/validators.py +32 -0
@@ -0,0 +1,12 @@
1
+ # Default ignored files
2
+ /shelf/
3
+ /workspace.xml
4
+ # Ignored default folder with query files
5
+ /queries/
6
+ # Datasource local storage ignored files
7
+ /dataSources/
8
+ /dataSources.local.xml
9
+ # Editor-based HTTP Client requests
10
+ /httpRequests/
11
+ /.venv/
12
+ __pycache__/
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 JMartynov
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,662 @@
1
+ Metadata-Version: 2.4
2
+ Name: py-secret-scan
3
+ Version: 3.0.1
4
+ Summary: A local security tool to detect secrets and PII in LLM prompts, code, and logs.
5
+ Project-URL: Homepage, https://github.com/JMartynov/secret-scan
6
+ Project-URL: Repository, https://github.com/JMartynov/secret-scan
7
+ Author: JMartynov
8
+ License: MIT
9
+ License-File: LICENSE
10
+ Requires-Python: >=3.8
11
+ Requires-Dist: faker
12
+ Requires-Dist: google-re2
13
+ Requires-Dist: pyahocorasick
14
+ Requires-Dist: pydantic<2.0
15
+ Requires-Dist: pyyaml==6.0.3
16
+ Requires-Dist: regex
17
+ Requires-Dist: ruamel-yaml
18
+ Description-Content-Type: text/markdown
19
+
20
+ # LLM Secrets Leak Detector
21
+
22
+ [![Self Secret Scan](https://github.com/JMartynov/secret-scan/actions/workflows/self-scan.yml/badge.svg)](https://github.com/JMartynov/secret-scan/actions/workflows/self-scan.yml)
23
+
24
+ **LLM Secrets Leak Detector** is a security tool designed to prevent accidental exposure of sensitive data when interacting with Large Language Models (LLMs).
25
+
26
+ Modern AI development workflows frequently involve sending code snippets, configuration files, logs, and debugging output to language models. In many cases developers unintentionally include sensitive information such as API keys, database credentials, private tokens, or internal infrastructure details.
27
+
28
+ This project detects those secrets **before they leave the developer’s environment**.
29
+
30
+ The system scans prompts, responses, logs, and source code to identify potential secrets and warns the user when confidential data may be exposed.
31
+
32
+ ---
33
+
34
+ # Problem
35
+
36
+ LLM-assisted development has dramatically increased developer productivity. However, it also introduced a new security risk.
37
+
38
+ Developers regularly paste entire code blocks, configuration files, or logs into AI assistants to ask questions like:
39
+
40
+ > “Here is my code, can you help debug it?”
41
+
42
+ These inputs often contain secrets such as:
43
+
44
+ * API keys
45
+ * database credentials
46
+ * private tokens
47
+ * authentication secrets
48
+ * internal infrastructure URLs
49
+ * encryption keys
50
+ * JWT tokens
51
+
52
+ Example input that might be leaked:
53
+
54
+ ```
55
+ OPENAI_API_KEY=sk-abc123
56
+ DATABASE_URL=postgres://user:pass@db
57
+ JWT_SECRET=super-secret-token
58
+ ```
59
+
60
+ Once sent to an external LLM service, this data may:
61
+
62
+ * appear in provider logs
63
+ * be stored for debugging
64
+ * violate compliance policies
65
+ * leak sensitive infrastructure details
66
+
67
+ The exposure of secrets in software artifacts has been increasing rapidly, with millions of credentials discovered in public repositories in recent years. ([arXiv][1])
68
+
69
+ Security teams now treat **secret detection as a critical part of modern development pipelines**.
70
+
71
+ ---
72
+
73
+ # Solution
74
+
75
+ LLM Secrets Leak Detector automatically scans AI interaction data and identifies potential secrets before they are transmitted.
76
+
77
+ The tool analyzes:
78
+
79
+ * prompts sent to LLMs
80
+ * LLM responses
81
+ * application logs
82
+ * code snippets
83
+ * configuration files
84
+
85
+ When a potential secret is detected, the tool generates a warning describing:
86
+
87
+ * the secret type
88
+ * location
89
+ * severity level
90
+
91
+ Example output:
92
+
93
+ ```
94
+ ⚠ Secrets detected
95
+
96
+ Type: OpenAI API Key
97
+ Location: line 3
98
+ Risk: HIGH
99
+
100
+ Type: Database credentials
101
+ Location: line 4
102
+ Risk: CRITICAL
103
+ ```
104
+
105
+ This allows developers to **remove or redact sensitive information before it reaches an external AI system.**
106
+
107
+ ---
108
+
109
+ # Core Detection Approach
110
+
111
+ The detection engine uses a layered strategy similar to modern secret detection systems.
112
+
113
+ Most secret scanners rely on three complementary techniques:
114
+
115
+ 1. **Pattern Matching (Regex)**
116
+ Identifies secrets with known formats such as AWS keys or GitHub tokens.
117
+
118
+ 2. **Entropy Analysis**
119
+ Detects strings that appear random, which is typical for cryptographic tokens.
120
+
121
+ 3. **Contextual Analysis**
122
+ Reduces false positives by analyzing surrounding code and variable names. ([gitguardian.com][2])
123
+
124
+ Combining these methods significantly improves accuracy.
125
+
126
+ ---
127
+
128
+ # Secret Types Detected
129
+
130
+ The scanner detects over **180 classes** of sensitive data, including:
131
+
132
+ ### API Keys
133
+
134
+ Examples:
135
+
136
+ ```
137
+ sk-xxxxxxxxxxxxxxxx
138
+ AIzaSyxxxxxxxxxxxx
139
+ ```
140
+
141
+ ### Cloud Credentials
142
+
143
+ Examples:
144
+
145
+ ```
146
+ AWS_ACCESS_KEY_ID
147
+ AWS_SECRET_ACCESS_KEY
148
+ AZURE_TOKEN
149
+ ```
150
+
151
+ ### Version Control Tokens
152
+
153
+ Examples:
154
+
155
+ ```
156
+ ghp_xxxxxxxxxxxxxxxxx
157
+ glpat_xxxxxxxxxxxxx
158
+ ```
159
+
160
+ ### Authentication Secrets
161
+
162
+ Examples:
163
+
164
+ ```
165
+ JWT_SECRET
166
+ SESSION_SECRET
167
+ PRIVATE_KEY
168
+ ```
169
+
170
+ ### Database Credentials
171
+
172
+ Examples:
173
+
174
+ ```
175
+ postgres://user:password@host
176
+ mysql://root:pass@db
177
+ ```
178
+
179
+ ### Cryptographic Material
180
+
181
+ Examples:
182
+
183
+ ```
184
+ -----BEGIN PRIVATE KEY-----
185
+ ```
186
+
187
+ ---
188
+
189
+ # Feature Matrix
190
+
191
+ The **LLM Secrets Leak Detector** provides a comprehensive suite of features designed for security, performance, and developer experience.
192
+
193
+ | Category | Feature | Status | Implementation Details |
194
+ | :--- | :--- | :--- | :--- |
195
+ | **Detection Engines** | **Regex Matching (RE2)** | ✅ | Primary engine using `google-re2`.<br>Fast, linear-time matching. |
196
+ | | **Regex Matching (Legacy)** | ✅ | Fallback to `regex` (Python) for complex patterns.<br>ReDoS protection. |
197
+ | | **Entropy Analysis** | ✅ | Shannon entropy scoring for random-looking tokens (min 20 chars). |
198
+ | | **Contextual Heuristics** | ✅ | Identifies secrets based on surrounding keywords like `prod`, `password`, `key`. Supports multi-lingual conversational intent matching (English, Spanish, French, German). |
199
+ | | **Rule-based Logic** | ✅ | 1750+ rules loaded from `data/` (Expanded 2026). |
200
+ | **Input Sources** | **File Scanning** | ✅ | Scans local files with UTF-8 support.<br>Error handling. |
201
+ | | **Stdin / Piped Input** | ✅ | Real-time processing of piped data (e.g., `cat log \| ./run.sh`). |
202
+ | | **Direct Text** | ✅ | Via `--text` flag for quick prompt validation. |
203
+ | | **Streaming** | ✅ | Optimized line-by-line generator for low-latency processing. |
204
+ | **Obfuscation** | **Redact** | ✅ | Masks the middle of secrets (e.g., `AKIA...CDEF`). |
205
+ | | **Hash** | ✅ | Consistent SHA-256 hashing (first 12 chars) for safe debugging. |
206
+ | | **Synthetic** | ✅ | [NEW] Realistic fake data generation (AWS, GitHub, Emails) using `Faker`. |
207
+ | **Safety & Performance** | **Keyword Filtering** | ✅ | Uses `ahocorasick-rs` automaton (with SIMD) to skip rules missing their required keywords. |
208
+ | | **Parallel Scanning** | ✅ | [NEW] Utilizes `ProcessPoolExecutor` for high-speed historical audits and multi-file directory scans. |
209
+ | | **Commit Caching** | ✅ | [NEW] Incremental scanning using `.secretscan_cache` to skip verified SHAs. |
210
+ | | **Zero-Copy Scanning**| ✅ | Uses `mmap` mapping with chunk overlaps for gigabyte-scale logs. |
211
+ | | **ReDoS Protection** | ✅ | `SIGALRM` timeouts (1s) for non-RE2 regex execution. |
212
+ | | **Input Truncation** | ✅ | Blocks capped at 1MB characters to prevent memory exhaustion. |
213
+ | | **Deduplication** | ✅ | Merges overlapping findings.<br>Prioritizes longest matches. |
214
+ | | **Force All Scan** | ✅ | `--force-scan-all` bypasses keyword filters so every line is scored. |
215
+ | **Reporting & UI** | **Surgical Highlighting** | ✅ | [NEW] ANSI-colored context lines with the secret highlighted in red. |
216
+ | | **Remediation Hints** | ✅ | [NEW] Actionable advice with links to official provider documentation. |
217
+ | | **Colorized Output** | ✅ | ANSI colors for risk levels (Red=High, Yellow=Medium, Blue=Low). |
218
+ | | **Report Formats** | ✅ | `Summary` (counts only).<br>`Short` (redacted).<br>`Full` (raw secrets + context).<br>`SARIF` (GitHub Code Scanning). |
219
+ | | **CI/CD Friendly** | ✅ | `--nocolors` flag.<br>Standard exit codes for automation. |
220
+ | **Testing & Dev** | **BDD Acceptance** | ✅ | 25 scenarios in `acceptance.feature` (including Git workflows) using `pytest-bdd`. |
221
+ | | **Performance Bench** | ✅ | [NEW] Automated suite to verify caching and parallelization gains. |
222
+ | | **Unit Testing** | ✅ | Comprehensive suite for core logic (detector, obfuscator, cli). |
223
+ | | **Synthetic Corpus** | ✅ | `generate_test_data.py` creates a balanced test set from rules. |
224
+ | | **Rule Deduplication** | ✅ | `tools/deduplicate_rules.py` keeps the catalog clean before release. |
225
+
226
+ ## Git & CI/CD Integration
227
+
228
+ The detector is now natively aware of Git lifecycles, allowing for surgical scans of changes rather than entire files.
229
+
230
+ ### 🛠 Git Scanning Modes
231
+
232
+ ```bash
233
+ # Scan staged changes (perfect for pre-commit hooks)
234
+ ./run.sh --git-staged --mode fast
235
+
236
+ # Scan unstaged changes in the working directory
237
+ ./run.sh --git-working
238
+
239
+ # Scan the diff between a feature branch and main (PR audits)
240
+ ./run.sh --git-branch origin/main --format sarif
241
+
242
+ # Deep audit of repository history (Parallelized & Cached)
243
+ ./run.sh --history --since "1 month ago" --max-commits 100
244
+ ```
245
+
246
+ ### 🏎 Performance & Scalability
247
+
248
+ - **Parallel Execution**: Large-scale historical audits and multi-file directory scans automatically utilize multiple CPU cores for regex and entropy analysis.
249
+ - **Commit Caching**: The engine maintains a `.secretscan_cache` to track verified "clean" commits, reducing redundant scan times by up to 90% in incremental audits.
250
+ - **Modes**: Choose between `fast` (optimized for <1s hooks), `balanced` (standard dev), and `deep` (thorough CI audits).
251
+
252
+ ---
253
+
254
+ ## Surgical Highlighting & Remediation
255
+
256
+ When a secret is detected, the terminal output provides immediate visual context and actionable fix instructions.
257
+
258
+ ```text
259
+ ⚠ Secrets detected: 1
260
+ - HIGH: 1
261
+
262
+ Type: stripe_api_key
263
+ Location: line 1
264
+ Risk: HIGH
265
+ Suggestion: Rotate this Stripe API key immediately in your dashboard. See: https://stripe.com/docs/keys#api-key-rotation
266
+ Context: config process result Stripe secret: [SECRET_HIGHLIGHTED_IN_RED]
267
+ ```
268
+
269
+ Remediation hints now include direct links to official security guides for AWS, GitHub, Stripe, and Google Cloud to guide developers through the revocation and rotation process.
270
+
271
+ ### Natural Language Contextual Matching
272
+
273
+ The detection engine uses a 100-character context window to detect natural language conversational intents, such as `Here is my prod api key:`, which are common when interacting with LLMs. This feature is fully multi-lingual, boosting confidence scores when intent is detected in English, Spanish, French, or German.
274
+
275
+ ---
276
+
277
+ ## Extended Infrastructure Mode
278
+
279
+ The latest feature expansion brings the infrastructure-focused taxonomy front and center:
280
+
281
+ * `data/infrastructure` now houses rules for credit cards, IBANs/SEPA references, national ID numbers, and other high-risk identifiers.
282
+ * Entropy-aware scoring plus overlap resolution lets structured infrastructure matches win over generic keywords or high-entropy heuristics.
283
+ * The CLI `--force-scan-all` option ensures legacy logs that omit keywords still get evaluated (see the new acceptance scenario for this mode).
284
+ * Dedicated tests cover deduped rules, synthetic obfuscation, and the expanded dataset to ensure the library stays precise.
285
+
286
+ ## Development Utilities
287
+
288
+ Keep the catalog healthy with the accompanying tools:
289
+
290
+ * `tools/migrate_patterns.py` normalizes schema fields, adds entropy defaults, and maps external categories to the in-tree taxonomy.
291
+ * `tools/generate_test_data.py` rebuilds the base64-encoded `data/*/test_data.json` files from regexes so every rule ships with reproducible samples.
292
+ * `tools/deduplicate_rules.py` merges duplicate patterns across categories before rules ship.
293
+ * Use `tools/regex_lint.py`, `tools/run_safe_regex.py`, and `tools/run_redoctor.py` to guard against ReDoS, syntax drift, and schema regressions.
294
+
295
+ Run `pytest tests/test_acceptance.py::test_force_scan_keywordless` before releasing to exercise the keywordless mode.
296
+
297
+ ---
298
+
299
+ # Pattern Database
300
+
301
+ The detection engine can leverage large open-source pattern databases containing thousands of secret signatures.
302
+
303
+ For example, open datasets include **over 1600 regular expressions** that detect API keys, tokens, passwords, and other credentials across hundreds of services. ([GitHub][3])
304
+
305
+ This allows the scanner to stay updated with newly introduced API key formats.
306
+
307
+ ---
308
+
309
+ # Project Goals
310
+
311
+ The project focuses on **protecting AI workflows** rather than traditional repository scanning.
312
+
313
+ Key design goals:
314
+
315
+ ### AI-first security
316
+
317
+ Detect secrets inside:
318
+
319
+ * LLM prompts
320
+ * chat transcripts
321
+ * agent logs
322
+ * debugging sessions
323
+
324
+ ### Developer-first experience
325
+
326
+ The tool integrates directly into developer workflows without requiring complex configuration.
327
+
328
+ ### Local processing
329
+
330
+ All scanning occurs locally to ensure no data leaves the environment.
331
+
332
+ ### Fast feedback
333
+
334
+ Secrets should be detected immediately during development.
335
+
336
+ ---
337
+
338
+ # Core Components
339
+
340
+ The system is composed of several modules.
341
+
342
+ ### Detection Engine
343
+
344
+ Responsible for identifying potential secrets using:
345
+
346
+ * regex pattern matching
347
+ * entropy scoring
348
+ * context heuristics
349
+
350
+ ### Pattern Database
351
+
352
+ A continuously updated collection of secret signatures.
353
+
354
+ Includes patterns for:
355
+
356
+ * API providers
357
+ * cloud platforms
358
+ * CI/CD tokens
359
+ * authentication systems
360
+
361
+ ### Scanner Interface
362
+
363
+ The scanner processes different input sources:
364
+
365
+ * text prompts
366
+ * log streams
367
+ * source files
368
+ * application outputs
369
+
370
+ ### Reporting System
371
+
372
+ Findings are returned as structured results including:
373
+
374
+ * secret type
375
+ * location
376
+ * confidence score
377
+ * risk level
378
+ * risk score (0-100)
379
+
380
+ The CLI produces clear, color-coded output highlighting the location, risk level (HIGH, MEDIUM, LOW), and an Advanced Risk Score (0-100) of detected secrets. The risk score is determined by a weighted heuristic that incorporates regex confidence, contextual proximity bonuses, and entropy adjustments. The `report.py` module manages deduplication and formatting. Use `--format sarif` for CI/CD integration.
381
+
382
+ You can tune sensitivity and filter out low-confidence noise by using the `--min-score` flag (e.g., `--min-score 70`).
383
+
384
+ ---
385
+
386
+ # Architecture
387
+
388
+ The architecture prioritizes simplicity and speed.
389
+
390
+ ```
391
+ Input Sources
392
+
393
+
394
+
395
+ Preprocessing Layer
396
+
397
+
398
+
399
+ Detection Engine
400
+ ├── Regex Matching
401
+ ├── Entropy Detection
402
+ └── Context Analysis
403
+
404
+
405
+ Secret Classification
406
+
407
+
408
+ Security Report
409
+ ```
410
+
411
+ ---
412
+
413
+ # Example Detection
414
+
415
+ Input text:
416
+
417
+ ```
418
+ Here is my configuration:
419
+
420
+ DATABASE_URL=postgres://admin:password@localhost
421
+ ```
422
+
423
+ Output:
424
+
425
+ ```
426
+ Secrets detected:
427
+
428
+ [1] OpenAI API Key
429
+ location: line 3
430
+ risk: HIGH
431
+
432
+ [2] Database Credentials
433
+ location: line 4
434
+ risk: CRITICAL
435
+ ```
436
+
437
+ ---
438
+
439
+ ## Installation
440
+
441
+ ```bash
442
+ # Install from PyPI (Recommended)
443
+ pip install py-secret-scan
444
+
445
+ # Run
446
+ secret-scan example_file.txt
447
+ ```
448
+
449
+ ### Developer Installation
450
+
451
+ ```bash
452
+ # Create virtual environment
453
+ python3 -m venv .venv
454
+ source .venv/bin/activate
455
+
456
+ # Install in editable mode
457
+ pip install -e .
458
+ ```
459
+
460
+ Or scan text directly:
461
+ ```bash
462
+ # Standard scan
463
+ secret-scan --text "My API key is AIzaSy-12345"
464
+
465
+ # Force scan all lines (bypasses keyword filters)
466
+ secret-scan --force-scan-all .
467
+ ```
468
+
469
+ ### Data Obfuscation & Masking
470
+
471
+ You can redact sensitive data from logs or prompts while preserving the rest of the text. This is useful for sanitizing data before sharing it with an LLM or for safe debugging.
472
+
473
+ Enable obfuscation with the `--obfuscate` flag:
474
+
475
+ ```bash
476
+ # Default mode: redact (Redacts middle of the secret)
477
+ # Input: "My key is ghp_1234567890abcdefghijklmnopqrstuvwx"
478
+ # Output: "My key is ghp_...uvwx"
479
+ cat logs.txt | secret-scan --obfuscate
480
+ ```
481
+
482
+ Choose different obfuscation strategies with `--obfuscate-mode`:
483
+
484
+ #### 1. `redact` (Default)
485
+ Partial masking that keeps the prefix/suffix for context but hides the sensitive core.
486
+ * **Example:** `AKIA...CDEF`
487
+
488
+ #### 2. `hash`
489
+ Replaces secrets with a consistent, short SHA-256 hash. Identical secrets will result in identical hashes, which is crucial for debugging data flows without seeing the actual values.
490
+ * **Example:** `[HASHED_d8c7b92f4a19]`
491
+
492
+ #### 3. `synthetic` (Recommended for LLM Prompts)
493
+ Replaces secrets with realistic-looking fake data that matches the original format (using the `Faker` library). This allows LLMs to still "understand" the structure of your data (e.g., seeing a fake AWS key where a real one was) without exposing real credentials.
494
+ * **Example (AWS ID):** `AKIAJ7O2N6M4L9K0P8R1`
495
+ * **Example (GitHub Token):** `ghp_zXyWvUtSrQpOnMlKjIhGfEdCbA9876543210`
496
+ * **Example (Email):** `fake_user@example.org`
497
+
498
+ ```bash
499
+ # Use synthetic mode for realistic placeholders
500
+ secret-scan --obfuscate --obfuscate-mode synthetic logs.txt
501
+ ```
502
+
503
+ ### Custom CLI helpers
504
+
505
+ The repository ships with a few convenience commands:
506
+
507
+ * `./run.sh <file>` is deprecated, use `secret-scan <file>` directly.
508
+ * `secret-scan --text "<string>"` runs the scanner on an inline string (useful when building prompts before sending them to an LLM).
509
+ * `python tools/generate_test_data.py` rebuilds `data/test_data.json` from `data/rules.json` and should be rerun whenever the rule set changes.
510
+
511
+ ## Usage Example
512
+
513
+ ```
514
+ $ secret-scan test_file.py
515
+
516
+ ⚠ Secrets detected: 1
517
+ - CRITICAL: 1
518
+
519
+ Type: Database Credentials
520
+ Location: line 10
521
+ Risk: CRITICAL
522
+ Content: post...ocal (redacted)
523
+ ```
524
+
525
+ # Test Data & Custom Cases
526
+
527
+ Every rule in `data/rules.json` maps to a base64-encoded positive and negative sample under `data/test_data.json`. `tools/generate_test_data.py` drives the data:
528
+
529
+ * It loads each regex, runs it through `exrex` (with length caps) to emit matches, and encodes them so the detector tests operate on identical strings as production data.
530
+ * Negatives are hand-crafted near-misses that resemble real-world secrets but should not trigger a hit.
531
+ * Rules listed in `STRICT_RULES` bypass the default `encode_str` mutation because even inserting `DUMMY_IGNORE` would break the required format.
532
+ * Custom helpers generate valid payloads for the trickiest patterns (`auth0_domain_url`, `skybiometry_api_key`, `okta_api_domain_url`, `facebook_oauth_id`, `linemessaging_api_key`, `nethunt_api_key`) so the detector still sees legal samples even though those regexes restrict character sets or lengths tightly.
533
+
534
+ Run `python tools/generate_test_data.py` after any rule changes; it prints progress every 100 rules and overwrites `data/test_data.json` with the refreshed corpus that powers the pytest suite.
535
+
536
+
537
+ # Use Cases
538
+
539
+ ### AI Application Development
540
+
541
+ Developers building:
542
+
543
+ * chatbots
544
+ * RAG pipelines
545
+ * AI agents
546
+ * coding assistants
547
+
548
+ can scan prompts before sending them to LLM APIs.
549
+
550
+ ---
551
+
552
+ ### Security Auditing
553
+
554
+ Security teams can analyze:
555
+
556
+ * prompt logs
557
+ * application logs
558
+ * LLM interaction history
559
+
560
+ to ensure no secrets were exposed.
561
+
562
+ ---
563
+
564
+ ### Compliance
565
+
566
+ Organizations can enforce policies preventing sensitive information from being sent to external AI providers.
567
+
568
+ ---
569
+
570
+ ### DevSecOps Integration
571
+
572
+ The scanner can be integrated into:
573
+
574
+ * CI/CD pipelines
575
+ * AI gateways
576
+ * API proxies
577
+ * developer tooling
578
+
579
+ ---
580
+
581
+ ## PII Detection
582
+
583
+ The scanner now supports detecting Personal Identifiable Information (PII) including emails, phone numbers, credit cards, and SSNs.
584
+
585
+ Enable PII detection with the `--pii` flag:
586
+
587
+ ```bash
588
+ # Scan a file for secrets and PII
589
+ secret-scan --pii example_file.txt
590
+
591
+ # Limit PII scanning to specific regions (e.g., US only for SSNs and US phone numbers)
592
+ secret-scan --pii --pii-region US example_file.txt
593
+ ```
594
+
595
+ PII findings are integrated into the multi-tier reporting system, where highly structured secrets (Tier 1) take precedence over contextual or generic entropy hits.
596
+
597
+ ---
598
+
599
+ # Target Users
600
+
601
+ ### AI Developers
602
+
603
+ Engineers building LLM-powered applications.
604
+
605
+ ### Security Engineers
606
+
607
+ Teams responsible for application security reviews.
608
+
609
+ ### AI Startups
610
+
611
+ Companies working with prompt engineering and LLM pipelines.
612
+
613
+ ---
614
+
615
+ # Roadmap
616
+
617
+ The project evolves in several stages.
618
+
619
+ ### CLI Scanner
620
+
621
+ A lightweight command-line tool for scanning prompts and logs.
622
+
623
+ ### API Service
624
+
625
+ A service that allows AI systems to validate prompts before sending them to LLM providers.
626
+
627
+ ### Developer Tooling
628
+
629
+ Integration with:
630
+
631
+ * IDE plugins
632
+ * Git hooks
633
+ * CI pipelines
634
+
635
+ ### Enterprise Security Platform
636
+
637
+ Future capabilities may include:
638
+
639
+ * real-time prompt filtering
640
+ * AI data loss prevention (DLP)
641
+ * secret monitoring across AI infrastructure
642
+
643
+ ---
644
+
645
+ # Why This Matters
646
+
647
+ AI-assisted development dramatically increases the speed of coding and debugging, but it also increases the risk of accidentally exposing sensitive data.
648
+
649
+ Developers frequently paste large blocks of code or logs into AI systems without reviewing them for secrets.
650
+
651
+ LLM Secrets Leak Detector provides a safety layer that prevents confidential data from leaving the organization.
652
+
653
+ ---
654
+
655
+ # License
656
+
657
+ MIT License
658
+
659
+
660
+ [1]: https://arxiv.org/abs/2307.00714?utm_source=chatgpt.com "A Comparative Study of Software Secrets Reporting by Secret Detection Tools"
661
+ [2]: https://www.gitguardian.com/solutions/secrets-detection?utm_source=chatgpt.com "Secrets Detection: Scan Code for Exposed API Keys and Credentials | GitGuardian"
662
+ [3]: https://github.com/mazen160/secrets-patterns-db?utm_source=chatgpt.com "GitHub - mazen160/secrets-patterns-db: Secrets Patterns DB: The largest open-source Database for detecting secrets, API keys, passwords, tokens, and more."