datacloak 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. datacloak-0.1.0/.gitignore +42 -0
  2. datacloak-0.1.0/CHANGELOG.md +23 -0
  3. datacloak-0.1.0/LICENSE +21 -0
  4. datacloak-0.1.0/PKG-INFO +364 -0
  5. datacloak-0.1.0/README.md +327 -0
  6. datacloak-0.1.0/datacloak/__init__.py +177 -0
  7. datacloak-0.1.0/datacloak/cli.py +222 -0
  8. datacloak-0.1.0/datacloak/detectors/__init__.py +40 -0
  9. datacloak-0.1.0/datacloak/detectors/aadhaar.py +53 -0
  10. datacloak-0.1.0/datacloak/detectors/base.py +97 -0
  11. datacloak-0.1.0/datacloak/detectors/credit_card.py +60 -0
  12. datacloak-0.1.0/datacloak/detectors/email.py +50 -0
  13. datacloak-0.1.0/datacloak/detectors/ifsc.py +57 -0
  14. datacloak-0.1.0/datacloak/detectors/ip_address.py +86 -0
  15. datacloak-0.1.0/datacloak/detectors/mobile.py +60 -0
  16. datacloak-0.1.0/datacloak/detectors/pan.py +57 -0
  17. datacloak-0.1.0/datacloak/detectors/upi.py +64 -0
  18. datacloak-0.1.0/datacloak/file_scanner.py +272 -0
  19. datacloak-0.1.0/datacloak/masker.py +196 -0
  20. datacloak-0.1.0/datacloak/py.typed +0 -0
  21. datacloak-0.1.0/datacloak/reporter.py +126 -0
  22. datacloak-0.1.0/datacloak/scanner.py +76 -0
  23. datacloak-0.1.0/pyproject.toml +127 -0
  24. datacloak-0.1.0/tests/__init__.py +1 -0
  25. datacloak-0.1.0/tests/test_aadhaar.py +60 -0
  26. datacloak-0.1.0/tests/test_email.py +62 -0
  27. datacloak-0.1.0/tests/test_file_scanner.py +131 -0
  28. datacloak-0.1.0/tests/test_integration.py +127 -0
  29. datacloak-0.1.0/tests/test_masking.py +126 -0
  30. datacloak-0.1.0/tests/test_mobile.py +61 -0
  31. datacloak-0.1.0/tests/test_other_detectors.py +136 -0
  32. datacloak-0.1.0/tests/test_pan.py +54 -0
  33. datacloak-0.1.0/tests/test_upi.py +17 -0
@@ -0,0 +1,42 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.pyo
5
+ *.pyd
6
+ *.so
7
+ *.egg
8
+ *.egg-info/
9
+ dist/
10
+ build/
11
+ .eggs/
12
+ .env
13
+ .venv/
14
+ venv/
15
+ env/
16
+
17
+ # Testing / Coverage
18
+ .pytest_cache/
19
+ .coverage
20
+ coverage.xml
21
+ htmlcov/
22
+
23
+ # Type checking
24
+ .mypy_cache/
25
+
26
+ # Linting
27
+ .ruff_cache/
28
+
29
+ # IDE
30
+ .idea/
31
+ .vscode/
32
+ *.swp
33
+ *.swo
34
+
35
+ # OS
36
+ .DS_Store
37
+ Thumbs.db
38
+
39
+ # Packaging
40
+ *.whl
41
+ *.tar.gz
42
+ MANIFEST
@@ -0,0 +1,23 @@
1
+ # Changelog
2
+
3
+ All notable changes to DataCloak will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ---
9
+
10
+ ## [0.1.0] โ€” 2025-06-01
11
+
12
+ ### Added
13
+ - Initial release of DataCloak
14
+ - Built-in detectors: Aadhaar, PAN, Indian Mobile, Email, UPI ID, Credit Card (Luhn-validated), IFSC, IPv4/IPv6
15
+ - Three masking modes: `partial`, `full`, `hash`
16
+ - `scan()` API for structured PII detection without modification
17
+ - `scan_file()` supporting `.txt`, `.log`, and `.csv` files
18
+ - Extensible `FileHandler` interface for adding new file format support
19
+ - JSON report generation with risk-level classification
20
+ - `datacloak` CLI with `scan`, `mask`, and `report` commands
21
+ - Custom detector framework via `BaseDetector`
22
+ - Full pytest test suite (>90% coverage)
23
+ - Typed codebase (PEP 561 compliant)
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 DataCloak Contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,364 @@
1
+ Metadata-Version: 2.4
2
+ Name: datacloak
3
+ Version: 0.1.0
4
+ Summary: Privacy protection library for detecting and masking PII in text, logs, and files.
5
+ Project-URL: Homepage, https://github.com/datacloak/datacloak
6
+ Project-URL: Documentation, https://datacloak.readthedocs.io
7
+ Project-URL: Repository, https://github.com/datacloak/datacloak
8
+ Project-URL: Bug Tracker, https://github.com/datacloak/datacloak/issues
9
+ Project-URL: Changelog, https://github.com/datacloak/datacloak/blob/main/CHANGELOG.md
10
+ Author: DataCloak Contributors
11
+ License: MIT
12
+ License-File: LICENSE
13
+ Keywords: aadhaar,compliance,data-masking,gdpr,india,pan,pii,privacy,redaction,security
14
+ Classifier: Development Status :: 3 - Alpha
15
+ Classifier: Intended Audience :: Developers
16
+ Classifier: Intended Audience :: Information Technology
17
+ Classifier: License :: OSI Approved :: MIT License
18
+ Classifier: Operating System :: OS Independent
19
+ Classifier: Programming Language :: Python :: 3
20
+ Classifier: Programming Language :: Python :: 3.11
21
+ Classifier: Programming Language :: Python :: 3.12
22
+ Classifier: Topic :: Security
23
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
24
+ Classifier: Topic :: Text Processing :: General
25
+ Classifier: Typing :: Typed
26
+ Requires-Python: >=3.11
27
+ Requires-Dist: click>=8.1
28
+ Provides-Extra: dev
29
+ Requires-Dist: build; extra == 'dev'
30
+ Requires-Dist: hatchling; extra == 'dev'
31
+ Requires-Dist: mypy>=1.10; extra == 'dev'
32
+ Requires-Dist: pytest-cov>=5.0; extra == 'dev'
33
+ Requires-Dist: pytest>=8.0; extra == 'dev'
34
+ Requires-Dist: ruff>=0.4; extra == 'dev'
35
+ Requires-Dist: twine; extra == 'dev'
36
+ Description-Content-Type: text/markdown
37
+
38
+ # DataCloak ๐Ÿ”’
39
+
40
+ > **Privacy protection for Python applications** โ€” detect and mask PII in text, logs, files, and data pipelines.
41
+
42
+ [![PyPI version](https://img.shields.io/pypi/v/datacloak.svg)](https://pypi.org/project/datacloak/)
43
+ [![Python](https://img.shields.io/pypi/pyversions/datacloak.svg)](https://pypi.org/project/datacloak/)
44
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
45
+ [![Tests](https://github.com/datacloak/datacloak/actions/workflows/ci.yml/badge.svg)](https://github.com/datacloak/datacloak/actions)
46
+ [![Coverage](https://img.shields.io/codecov/c/github/datacloak/datacloak)](https://codecov.io/gh/datacloak/datacloak)
47
+
48
+ ---
49
+
50
+ DataCloak is a production-ready Python library for **automatically detecting and masking Personally Identifiable Information (PII)** โ€” built for India-first compliance use cases (Aadhaar, PAN, UPI) while covering universal types like email and credit cards.
51
+
52
+ ## โœจ Features
53
+
54
+ | Capability | Description |
55
+ |---|---|
56
+ | **8 built-in detectors** | Aadhaar, PAN, Mobile, Email, UPI ID, Credit Card (Luhn), IFSC, IPv4/IPv6 |
57
+ | **3 masking modes** | `partial` (keep trailing chars), `full` (redaction tags), `hash` (SHA-256) |
58
+ | **File scanning** | `.txt`, `.log`, `.csv` โ€” extensible to PDF, DOCX, and more |
59
+ | **Structured scan** | Returns findings dict without modifying original text |
60
+ | **JSON reports** | Risk-level classified reports with per-type counts |
61
+ | **CLI** | `datacloak scan / mask / report` commands |
62
+ | **Pluggable detectors** | Subclass `BaseDetector` to add your own PII types |
63
+
64
+ ---
65
+
66
+ ## ๐Ÿš€ Installation
67
+
68
+ ```bash
69
+ pip install datacloak
70
+ ```
71
+
72
+ Requires Python 3.11+.
73
+
74
+ ---
75
+
76
+ ## โšก Quick Start
77
+
78
+ ```python
79
+ from datacloak import mask, scan
80
+
81
+ text = """
82
+ Aadhaar: 2345 6789 0123
83
+ PAN: ABCPE1234F
84
+ Email: alice@example.com
85
+ Phone: 9876543210
86
+ """
87
+
88
+ # Partial masking (default) โ€” keeps trailing characters visible
89
+ print(mask(text))
90
+ ```
91
+
92
+ **Output:**
93
+ ```
94
+ Aadhaar: XXXX XXXX 0123
95
+ PAN: XXXXX1234F
96
+ Email: a***@example.com
97
+ Phone: ******3210
98
+ ```
99
+
100
+ ---
101
+
102
+ ## ๐Ÿ“– Usage Guide
103
+
104
+ ### 1. Masking Modes
105
+
106
+ ```python
107
+ from datacloak import mask
108
+
109
+ text = "Contact: alice@example.com | Phone: 9876543210"
110
+
111
+ # Partial โ€” show trailing characters (default)
112
+ mask(text, mode="partial")
113
+ # โ†’ 'Contact: a***@example.com | Phone: ******3210'
114
+
115
+ # Full โ€” replace with descriptive redaction tags
116
+ mask(text, mode="full")
117
+ # โ†’ 'Contact: [EMAIL_REDACTED] | Phone: [PHONE_REDACTED]'
118
+
119
+ # Hash โ€” SHA-256 digest (deterministic, reversible with original)
120
+ mask(text, mode="hash")
121
+ # โ†’ 'Contact: [HASH:142d78e466cacab3] | Phone: [HASH:7619ee8cea49187f]'
122
+ ```
123
+
124
+ ### 2. Scan Without Masking
125
+
126
+ ```python
127
+ from datacloak import scan
128
+
129
+ findings = scan("Send invoice to billing@acme.com, call 9876543210")
130
+ print(findings)
131
+ # {
132
+ # "email": ["billing@acme.com"],
133
+ # "phone": ["9876543210"]
134
+ # }
135
+ ```
136
+
137
+ ### 3. File Scanning
138
+
139
+ ```python
140
+ from datacloak import scan_file
141
+
142
+ # Scan a plain text or log file
143
+ result = scan_file("application.log")
144
+
145
+ # Scan a CSV (each cell is scanned individually)
146
+ result = scan_file("customers.csv")
147
+
148
+ print(result.summary)
149
+ # {"email": 142, "phone": 38, "aadhaar": 5}
150
+
151
+ print(result.findings[0])
152
+ # FileFinding(email='alice@example.com' @customers.csv:line=2)
153
+ ```
154
+
155
+ ### 4. Report Generation
156
+
157
+ ```python
158
+ from datacloak import report
159
+
160
+ r = report(text, source_label="user_input")
161
+ print(r.to_json())
162
+ ```
163
+
164
+ ```json
165
+ {
166
+ "generated_at": "2025-06-01T10:23:00+00:00",
167
+ "source": "user_input",
168
+ "total_findings": 4,
169
+ "summary": {
170
+ "aadhaar": 1,
171
+ "email": 1,
172
+ "phone": 1,
173
+ "pan": 1
174
+ },
175
+ "details": { ... },
176
+ "risk_level": "MEDIUM"
177
+ }
178
+ ```
179
+
180
+ Save to disk:
181
+ ```python
182
+ r.save("pii_report.json")
183
+ ```
184
+
185
+ ---
186
+
187
+ ## ๐Ÿ–ฅ๏ธ Command-Line Interface
188
+
189
+ DataCloak ships a full CLI via `datacloak`:
190
+
191
+ ```bash
192
+ # Scan a file and display findings as a table
193
+ datacloak scan customers.txt
194
+
195
+ # Scan with JSON output
196
+ datacloak scan --format json customers.txt
197
+
198
+ # Mask a file (writes customers.masked.txt by default)
199
+ datacloak mask customers.txt
200
+
201
+ # Mask with full-redaction mode, specify output file
202
+ datacloak mask customers.txt --mode full --output clean.txt
203
+
204
+ # Print masked output to stdout (pipe-friendly)
205
+ datacloak mask customers.txt --stdout | grep "REDACTED"
206
+
207
+ # Generate a JSON report
208
+ datacloak report customers.txt
209
+
210
+ # Save report to file
211
+ datacloak report customers.txt --output report.json
212
+
213
+ # Verbose logging
214
+ datacloak -v scan customers.txt
215
+ ```
216
+
217
+ ---
218
+
219
+ ## ๐Ÿ”Œ Writing a Custom Detector
220
+
221
+ DataCloak's detector framework is designed for extension. Subclass `BaseDetector`, set `_pattern`, and optionally override `_validate()`:
222
+
223
+ ```python
224
+ import re
225
+ from datacloak.detectors import BaseDetector, Detection
226
+ from datacloak import scan
227
+
228
+ class PassportDetector(BaseDetector):
229
+ name = "indian_passport"
230
+ description = "Indian Passport Number (A-Z followed by 7 digits)"
231
+ _pattern = re.compile(r"\b[A-Z]\d{7}\b")
232
+
233
+ # Use alongside built-in detectors
234
+ from datacloak.detectors import DEFAULT_DETECTORS
235
+
236
+ my_detectors = DEFAULT_DETECTORS + [PassportDetector()]
237
+ findings = scan("Passport: A1234567", detectors=my_detectors)
238
+ # {"indian_passport": ["A1234567"]}
239
+ ```
240
+
241
+ ### Adding a new file format
242
+
243
+ ```python
244
+ from datacloak.file_scanner import FileHandler, register_handler
245
+ from pathlib import Path
246
+
247
+ class PDFHandler(FileHandler):
248
+ extensions = (".pdf",)
249
+
250
+ def extract_chunks(self, path: Path):
251
+ # Use any PDF library (pdfplumber, PyMuPDF, etc.)
252
+ import pdfplumber
253
+ with pdfplumber.open(path) as pdf:
254
+ for page_num, page in enumerate(pdf.pages, start=1):
255
+ text = page.extract_text() or ""
256
+ yield page_num, None, text
257
+
258
+ register_handler(PDFHandler())
259
+
260
+ # Now scan_file("document.pdf") works automatically
261
+ ```
262
+
263
+ ---
264
+
265
+ ## ๐Ÿ•ต๏ธ Supported PII Types
266
+
267
+ | Detector | Example | Validation |
268
+ |---|---|---|
269
+ | `aadhaar` | `2345 6789 0123` | 12 digits, starts 2-9, space/hyphen/plain |
270
+ | `pan` | `ABCPE1234F` | AAAAA9999A format, valid entity code |
271
+ | `phone` | `9876543210`, `+91 9876543210` | 10 digits, starts 6-9, optional country code |
272
+ | `email` | `alice@example.com` | RFC-5321 compliant |
273
+ | `upi_id` | `user@okaxis` | VPA format, non-email handles only |
274
+ | `credit_card` | `4111 1111 1111 1111` | 13-19 digits, Luhn algorithm validated |
275
+ | `ifsc` | `HDFC0001234` | 4-alpha + 0 + 6-alphanumeric |
276
+ | `ip_address` | `192.168.1.1`, `::1` | IPv4 (range-validated) and IPv6 |
277
+
278
+ ---
279
+
280
+ ## ๐Ÿงช Running Tests
281
+
282
+ ```bash
283
+ # Clone the repo
284
+ git clone https://github.com/datacloak/datacloak.git
285
+ cd datacloak
286
+
287
+ # Install dev dependencies
288
+ pip install -e ".[dev]"
289
+
290
+ # Run tests
291
+ pytest
292
+
293
+ # Run with coverage
294
+ pytest --cov=datacloak --cov-report=term-missing
295
+ ```
296
+
297
+ Target: **โ‰ฅ 90% coverage**.
298
+
299
+ ---
300
+
301
+ ## ๐Ÿ—๏ธ Architecture
302
+
303
+ ```
304
+ datacloak/
305
+ โ”œโ”€โ”€ __init__.py # Public API: mask(), scan(), report(), scan_file()
306
+ โ”œโ”€โ”€ detectors/
307
+ โ”‚ โ”œโ”€โ”€ __init__.py # Exports all detectors + DEFAULT_DETECTORS registry
308
+ โ”‚ โ”œโ”€โ”€ base.py # BaseDetector, Detection dataclass
309
+ โ”‚ โ”œโ”€โ”€ aadhaar.py
310
+ โ”‚ โ”œโ”€โ”€ pan.py
311
+ โ”‚ โ”œโ”€โ”€ mobile.py
312
+ โ”‚ โ”œโ”€โ”€ email.py
313
+ โ”‚ โ”œโ”€โ”€ upi.py
314
+ โ”‚ โ”œโ”€โ”€ credit_card.py # Luhn validation
315
+ โ”‚ โ”œโ”€โ”€ ifsc.py
316
+ โ”‚ โ””โ”€โ”€ ip_address.py # IPv4 + IPv6
317
+ โ”œโ”€โ”€ masker.py # mask_text(), masking modes logic
318
+ โ”œโ”€โ”€ scanner.py # scan_text(), scan_summary()
319
+ โ”œโ”€โ”€ file_scanner.py # scan_file(), mask_file(), FileHandler interface
320
+ โ”œโ”€โ”€ reporter.py # Report dataclass, generate_report_*()
321
+ โ””โ”€โ”€ cli.py # Click CLI: scan, mask, report commands
322
+ ```
323
+
324
+ **Design principles applied:**
325
+ - Single Responsibility โ€” each detector, masker, scanner, and reporter is self-contained
326
+ - Open/Closed โ€” extend via `BaseDetector` or `FileHandler` without modifying core
327
+ - Liskov Substitution โ€” any `BaseDetector` subclass drops in transparently
328
+ - Dependency Injection โ€” all public functions accept `detectors=` for testability
329
+ - Logging โ€” structured `logging` throughout, silent by default (NullHandler)
330
+
331
+ ---
332
+
333
+ ## ๐Ÿ“ฆ Publishing to PyPI
334
+
335
+ See [PUBLISHING.md](PUBLISHING.md) for a complete step-by-step guide.
336
+
337
+ ```bash
338
+ # Quick summary
339
+ pip install build twine
340
+ python -m build
341
+ twine upload dist/*
342
+ ```
343
+
344
+ ---
345
+
346
+ ## ๐Ÿ“„ License
347
+
348
+ [MIT License](LICENSE) โ€” Copyright ยฉ 2025 DataCloak Contributors.
349
+
350
+ ---
351
+
352
+ ## ๐Ÿค Contributing
353
+
354
+ Contributions, issues, and feature requests are welcome! Please read the contributing guide and open a pull request.
355
+
356
+ 1. Fork the repository
357
+ 2. Create a feature branch: `git checkout -b feat/my-detector`
358
+ 3. Write your code and tests
359
+ 4. Run `pytest` and ensure coverage stays โ‰ฅ 90%
360
+ 5. Open a pull request
361
+
362
+ ---
363
+
364
+ *DataCloak โ€” because privacy is not optional.*