inconnu 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,524 @@
1
+ Metadata-Version: 2.4
2
+ Name: inconnu
3
+ Version: 0.1.0
4
+ Summary: GDPR-compliant data privacy tool for entity redaction and de-anonymization
5
+ Project-URL: Homepage, https://github.com/0xjgv/inconnu
6
+ Project-URL: Documentation, https://github.com/0xjgv/inconnu#readme
7
+ Project-URL: Repository, https://github.com/0xjgv/inconnu
8
+ Project-URL: Issues, https://github.com/0xjgv/inconnu/issues
9
+ Author-email: 0xjgv <juans.gaitan@gmail.com>
10
+ License: MIT
11
+ License-File: LICENSE
12
+ Keywords: anonymization,gdpr,nlp,pii,privacy,pseudonymization,redaction,spacy
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Intended Audience :: Healthcare Industry
16
+ Classifier: Intended Audience :: Legal Industry
17
+ Classifier: License :: OSI Approved :: MIT License
18
+ Classifier: Operating System :: OS Independent
19
+ Classifier: Programming Language :: Python :: 3
20
+ Classifier: Programming Language :: Python :: 3.10
21
+ Classifier: Programming Language :: Python :: 3.11
22
+ Classifier: Programming Language :: Python :: 3.12
23
+ Classifier: Topic :: Security
24
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
25
+ Classifier: Topic :: Text Processing :: Linguistic
26
+ Requires-Python: >=3.10
27
+ Requires-Dist: phonenumbers>=9.0.8
28
+ Requires-Dist: spacy>=3.8.7
29
+ Provides-Extra: all
30
+ Provides-Extra: de
31
+ Provides-Extra: en
32
+ Provides-Extra: es
33
+ Provides-Extra: fr
34
+ Provides-Extra: it
35
+ Description-Content-Type: text/markdown
36
+
37
+ # Inconnu
38
+
39
+ ## What is Inconnu?
40
+
41
+ Inconnu is a GDPR-compliant data privacy tool designed for entity redaction and de-anonymization. It provides cutting-edge NLP-based tools for anonymizing and pseudonymizing text data while maintaining data utility, ensuring your business meets stringent privacy regulations.
42
+
43
+ ## Why Inconnu?
44
+
45
+ 1. **Seamless Compliance**: Inconnu simplifies the complexity of GDPR and other privacy laws, making sure your data handling practices are always in line with legal standards.
46
+
47
+ 2. **State-of-the-Art NLP**: Utilizing advanced spaCy models and custom entity recognition, Inconnu ensures that personal identifiers are completely detected and properly handled.
48
+
49
+ 3. **Transparency and Trust**: Complete processing documentation with timestamping, hashing, and entity mapping for full audit trails.
50
+
51
+ 4. **Reversible Processing**: Support for both anonymization and pseudonymization with complete de-anonymization capabilities.
52
+
53
+ 5. **Performance Optimized**: Fast processing with singleton pattern optimization and configurable text length limits.
54
+
55
+ ## Installation
56
+
57
+ ### Prerequisites
58
+
59
+ - Python 3.10 or higher
60
+ - pip (Python package manager)
61
+
62
+ ### Install from PyPI
63
+
64
+ ```bash
65
+ # Basic installation (without language models)
66
+ pip install inconnu
67
+
68
+ # Install with English language support
69
+ pip install inconnu[en]
70
+
71
+ # Install with specific language support
72
+ pip install inconnu[de] # German
73
+ pip install inconnu[fr] # French
74
+ pip install inconnu[es] # Spanish
75
+ pip install inconnu[it] # Italian
76
+
77
+ # Install with multiple languages
78
+ pip install inconnu[en,de,fr]
79
+
80
+ # Install with all language support
81
+ pip install inconnu[all]
82
+ ```
83
+
84
+ ### Download Language Models
85
+
86
+ After installation, download the required spaCy models:
87
+
88
+ ```bash
89
+ # Using the built-in CLI tool
90
+ inconnu-download en # Download default English model
91
+ inconnu-download de fr # Download German and French models
92
+ inconnu-download en --size large # Download large English model
93
+ inconnu-download all # Download all default models
94
+ inconnu-download --list # List all available models
95
+
96
+ # Or using spaCy directly
97
+ python -m spacy download en_core_web_sm
98
+ python -m spacy download de_core_news_sm
99
+ ```
100
+
101
+ ### Install from Source
102
+
103
+ 1. **Clone the repository**:
104
+ ```bash
105
+ git clone https://github.com/0xjgv/inconnu.git
106
+ cd inconnu
107
+ ```
108
+
109
+ 2. **Install with UV (recommended for development)**:
110
+ ```bash
111
+ make install # Install dependencies
112
+ make model-de # Download German model
113
+ make test # Run tests
114
+ ```
115
+
116
+ 3. **Or install with pip**:
117
+ ```bash
118
+ pip install -e . # Install in editable mode
119
+ python -m spacy download en_core_web_sm
120
+ ```
121
+
122
+ ### Installing Additional Models
123
+
124
+ Inconnu supports multiple spaCy models for enhanced accuracy. The default `en_core_web_sm` model is lightweight and fast, but you can install more accurate models:
125
+
126
+ #### English Models
127
+ ```bash
128
+ # Small model (default) - 15MB, fast processing
129
+ uv run python -m spacy download en_core_web_sm
130
+
131
+ # Large model - 560MB, higher accuracy
132
+ uv run python -m spacy download en_core_web_lg
133
+
134
+ # Transformer model - 438MB, highest accuracy
135
+ uv run python -m spacy download en_core_web_trf
136
+ ```
137
+
138
+ #### Additional Language Models
139
+ ```bash
140
+ # German model
141
+ make model-de
142
+ uv run python -m spacy download de_core_news_sm
143
+
144
+ # Italian model
145
+ make model-it
146
+ uv run python -m spacy download it_core_news_sm
147
+
148
+ # Spanish model
149
+ make model-es
150
+ uv run python -m spacy download es_core_news_sm
151
+
152
+ # French model
153
+ make model-fr
154
+ uv run python -m spacy download fr_core_news_sm
155
+
156
+ # For enhanced accuracy (manual installation)
157
+ # Medium German model - better accuracy
158
+ uv run python -m spacy download de_core_news_md
159
+
160
+ # Large German model - highest accuracy
161
+ uv run python -m spacy download de_core_news_lg
162
+ ```
163
+
164
+ #### Using Different Models
165
+
166
+ To use a different model, specify it when initializing the EntityRedactor:
167
+
168
+ ```python
169
+ from inconnu.nlp.entity_redactor import EntityRedactor, SpacyModels
170
+
171
+ # Use transformer model for highest accuracy
172
+ entity_redactor = EntityRedactor(
173
+ custom_components=None,
174
+ language="en",
175
+ model_name=SpacyModels.EN_CORE_WEB_TRF # High accuracy transformer model
176
+ )
177
+ ```
178
+
179
+ **Model Selection Guide:**
180
+ - `en_core_web_sm`: Fast processing, good for high-volume processing
181
+ - `en_core_web_lg`: Better accuracy, moderate processing time
182
+ - `en_core_web_trf`: Highest accuracy, slower processing (recommended for sensitive data)
183
+
184
+ For more models, visit the [spaCy Models Directory](https://spacy.io/models).
185
+
186
+ ## Development Setup
187
+
188
+ ### Available Commands
189
+
190
+ ```bash
191
+ # Development workflow
192
+ make install # Install all dependencies
193
+ make model-de # Download German spaCy model
194
+ make model-it # Download Italian spaCy model
195
+ make model-es # Download Spanish spaCy model
196
+ make model-fr # Download French spaCy model
197
+ make test # Run full test suite
198
+ make lint # Check code with ruff
199
+ make format # Format code with ruff
200
+ make fix # Auto-fix linting issues
201
+ make clean # Format, lint, fix, and clean cache
202
+ make update-deps # Update dependencies
203
+ ```
204
+
205
+ ### Running Tests
206
+
207
+ ```bash
208
+ # Run all tests
209
+ make test
210
+
211
+ # Run with verbose output
212
+ uv run pytest -vv
213
+
214
+ # Run specific test file
215
+ uv run pytest tests/test_inconnu.py -vv
216
+
217
+ # Run specific test class
218
+ uv run pytest tests/test_inconnu.py::TestInconnuPseudonymizer -vv
219
+ ```
220
+
221
+ ## Usage Examples
222
+
223
+ ### Basic Text Anonymization
224
+
225
+ ```python
226
+ from inconnu import Inconnu
227
+
228
+ # Simple initialization - no Config class required!
229
+ inconnu = Inconnu() # Uses sensible defaults
230
+
231
+ # Simple anonymization - just the redacted text
232
+ text = "John Doe from New York visited Paris last summer."
233
+ redacted = inconnu.redact(text)
234
+ print(redacted)
235
+ # Output: "[PERSON] from [GPE] visited [GPE] [DATE]."
236
+
237
+ # Pseudonymization - get both redacted text and entity mapping
238
+ redacted_text, entity_map = inconnu.pseudonymize(text)
239
+ print(redacted_text)
240
+ # Output: "[PERSON_0] from [GPE_0] visited [GPE_1] [DATE_0]."
241
+ print(entity_map)
242
+ # Output: {'[PERSON_0]': 'John Doe', '[GPE_0]': 'New York', '[GPE_1]': 'Paris', '[DATE_0]': 'last summer'}
243
+
244
+ # Advanced usage with full metadata (original API)
245
+ result = inconnu(text=text)
246
+ print(result.redacted_text)
247
+ print(f"Processing time: {result.processing_time_ms:.2f}ms")
248
+ ```
249
+
250
+ ### Async and Batch Processing
251
+
252
+ ```python
253
+ import asyncio
254
+
255
+ # Async processing for non-blocking operations
256
+ async def process_texts():
257
+ inconnu = Inconnu()
258
+
259
+ # Single async processing
260
+ text = "John Doe called from +1-555-123-4567"
261
+ redacted = await inconnu.redact_async(text)
262
+ print(redacted) # "[PERSON] called from [PHONE_NUMBER]"
263
+
264
+ # Batch async processing
265
+ texts = [
266
+ "Alice Smith visited Berlin",
267
+ "Bob Jones went to Tokyo",
268
+ "Carol Brown lives in Paris"
269
+ ]
270
+ results = await inconnu.redact_batch_async(texts)
271
+ for result in results:
272
+ print(result)
273
+
274
+ asyncio.run(process_texts())
275
+ ```
276
+
277
+ ### Customer Service Email Processing
278
+
279
+ ```python
280
+ # Process customer service email with personal data
281
+ customer_email = """
282
+ Dear SolarTech Team,
283
+
284
+ I am Max Mustermann living at Hauptstraße 50, 80331 Munich, Germany.
285
+ My phone number is +49 89 1234567 and my email is max@example.com.
286
+ I need to return my solar modules (Order: ST-78901) due to relocation.
287
+
288
+ Best regards,
289
+ Max Mustermann
290
+ """
291
+
292
+ # Simple redaction
293
+ redacted = inconnu.redact(customer_email)
294
+ print(redacted)
295
+ # Personal identifiers are automatically detected and redacted
296
+ ```
297
+
298
+ ### Multi-language Support
299
+
300
+ ```python
301
+ # German language processing - simplified!
302
+ inconnu_de = Inconnu("de") # Just specify the language
303
+
304
+ german_text = "Herr Schmidt aus München besuchte Berlin im März."
305
+ redacted = inconnu_de.redact(german_text)
306
+ print(redacted)
307
+ # Output: "[PERSON] aus [GPE] besuchte [GPE] [DATE]."
308
+ ```
309
+
310
+ ### Custom Entity Recognition
311
+
312
+ ```python
313
+ from inconnu import Inconnu, NERComponent
314
+ import re
315
+
316
+ # Add custom entity recognition
317
+ custom_components = [
318
+ NERComponent(
319
+ label="CREDIT_CARD",
320
+ pattern=re.compile(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b'),
321
+ processing_function=None
322
+ )
323
+ ]
324
+
325
+ # Simple initialization with custom components
326
+ inconnu_custom = Inconnu(
327
+ language="en",
328
+ custom_components=custom_components
329
+ )
330
+
331
+ # Test custom entity detection
332
+ text = "My card number is 1234 5678 9012 3456"
333
+ redacted = inconnu_custom.redact(text)
334
+ print(redacted) # "My card number is [CREDIT_CARD]"
335
+ ```
336
+
337
+ ### Context Manager for Resource Management
338
+
339
+ ```python
340
+ # Automatic resource cleanup
341
+ with Inconnu() as inc:
342
+ redacted = inc.redact("Sensitive data about John Doe")
343
+ print(redacted)
344
+ # Resources automatically cleaned up
345
+ ```
346
+
347
+ ### Error Handling
348
+
349
+ ```python
350
+ from inconnu import Inconnu, TextTooLongError, ProcessingError
351
+
352
+ inconnu = Inconnu(max_text_length=100) # Set small limit for demo
353
+
354
+ try:
355
+ long_text = "x" * 200 # Exceeds limit
356
+ result = inconnu.redact(long_text)
357
+ except TextTooLongError as e:
358
+ print(f"Text too long: {e}")
359
+ # Error includes helpful suggestions for resolution
360
+ except ProcessingError as e:
361
+ print(f"Processing failed: {e}")
362
+ ```
363
+
364
+ ## Use Cases
365
+
366
+ ### 1. **Customer Support Systems**
367
+ Automatically redact personal information from customer service emails, chat logs, and support tickets while maintaining context for analysis.
368
+
369
+ ### 2. **Legal Document Processing**
370
+ Anonymize legal documents, contracts, and case files for training, analysis, or public release while ensuring GDPR compliance.
371
+
372
+ ### 3. **Medical Record Anonymization**
373
+ Process medical records and research data to remove patient identifiers while preserving clinical information for research purposes.
374
+
375
+ ### 4. **Financial Transaction Analysis**
376
+ Redact personal financial information from transaction logs and banking communications for fraud analysis and compliance reporting.
377
+
378
+ ### 5. **Survey and Feedback Analysis**
379
+ Anonymize customer feedback, survey responses, and user-generated content for analysis while protecting respondent privacy.
380
+
381
+ ### 6. **Training Data Preparation**
382
+ Prepare training datasets for machine learning models by removing personal identifiers from text data while maintaining semantic meaning.
383
+
384
+ ## Supported Entity Types
385
+
386
+ - **Standard Entities**: PERSON, GPE (locations), DATE, ORG, MONEY
387
+ - **Custom Entities**: EMAIL, IBAN, PHONE_NUMBER
388
+ - **Enhanced Detection**: Person titles (Dr, Mr, Ms), international phone numbers
389
+ - **Multilingual**: English, German, Italian, Spanish, and French language support
390
+
391
+ ## Features
392
+
393
+ - **Robust Entity Detection**: Advanced NLP with spaCy models and custom regex patterns
394
+ - **Dual Processing Modes**: Anonymization (`[PERSON]`) and pseudonymization (`[PERSON_0]`)
395
+ - **Complete Audit Trail**: Timestamping, hashing, and processing metadata
396
+ - **Reversible Processing**: Full de-anonymization capabilities with entity mapping
397
+ - **Performance Optimized**: Singleton pattern for model loading, configurable limits
398
+ - **GDPR Compliant**: Built-in data retention policies and compliance features
399
+
400
+ ## Contributing
401
+
402
+ We welcome contributions to Inconnu! As an open source project, we believe in the power of community collaboration to build better privacy tools.
403
+
404
+ ### How to Contribute
405
+
406
+ #### 1. **Bug Reports & Feature Requests**
407
+ - Open an issue on GitHub with detailed descriptions
408
+ - Include code examples and expected vs actual behavior
409
+ - Tag issues appropriately (bug, enhancement, documentation)
410
+
411
+ #### 2. **Code Contributions**
412
+ ```bash
413
+ # Fork the repository and create a feature branch
414
+ git checkout -b feature/your-feature-name
415
+
416
+ # Make your changes and ensure tests pass
417
+ make test
418
+ make lint
419
+
420
+ # Submit a pull request with:
421
+ # - Clear description of changes
422
+ # - Test coverage for new features
423
+ # - Updated documentation if needed
424
+ ```
425
+
426
+ #### 3. **Development Guidelines**
427
+ - Follow existing code style and patterns
428
+ - Add tests for new functionality
429
+ - Update documentation for user-facing changes
430
+ - Ensure GDPR compliance considerations are addressed
431
+
432
+ #### 4. **Areas for Contribution**
433
+ - **Language Support**: Add new language models and region-specific entity detection
434
+ - **Custom Entities**: Implement detection for industry-specific identifiers
435
+ - **Performance**: Optimize processing speed and memory usage
436
+ - **Documentation**: Improve examples, tutorials, and API documentation
437
+ - **Testing**: Expand test coverage and edge case handling
438
+
439
+ #### 5. **Code Review Process**
440
+ - All contributions require code review
441
+ - Automated tests must pass
442
+ - Documentation updates are appreciated
443
+ - Maintain backward compatibility when possible
444
+
445
+ ### Community Guidelines
446
+
447
+ - **Be Respectful**: Foster an inclusive environment for all contributors
448
+ - **Privacy First**: Always consider privacy implications of changes
449
+ - **Security Minded**: Report security issues privately before public disclosure
450
+ - **Quality Focused**: Prioritize code quality and comprehensive testing
451
+
452
+ ### Getting Help
453
+
454
+ - **Discussions**: Use GitHub Discussions for questions and ideas
455
+ - **Issues**: Report bugs and request features through GitHub Issues
456
+ - **Documentation**: Check existing docs and contribute improvements
457
+
458
+ Thank you for helping make Inconnu a better tool for data privacy and GDPR compliance!
459
+
460
+ ## Publishing to PyPI
461
+
462
+ ### For Maintainers
463
+
464
+ To publish a new version to PyPI:
465
+
466
+ 1. **Configure Trusted Publisher** (first time only):
467
+ - Go to https://pypi.org/manage/project/inconnu/settings/publishing/
468
+ - Add a new trusted publisher:
469
+ - Publisher: GitHub
470
+ - Organization/username: `0xjgv`
471
+ - Repository name: `inconnu`
472
+ - Workflow name: `publish.yml`
473
+ - Environment name: `pypi` (optional but recommended)
474
+ - For Test PyPI, do the same at https://test.pypi.org with environment name: `testpypi`
475
+
476
+ 2. **Update Version**: Update the version in `pyproject.toml` and `inconnu/__init__.py`
477
+
478
+ 3. **Create a Git Tag**:
479
+ ```bash
480
+ git tag v0.1.0
481
+ git push origin v0.1.0
482
+ ```
483
+
484
+ 4. **GitHub Actions**: The workflow will automatically:
485
+ - Run tests on Python 3.10, 3.11, and 3.12
486
+ - Build the package
487
+ - Publish to PyPI using Trusted Publisher (no API tokens needed!)
488
+ - Generate PEP 740 attestations for security
489
+
490
+ 5. **Test PyPI Publishing**:
491
+ - Use workflow_dispatch to manually trigger Test PyPI publishing
492
+ - Go to Actions → Publish to PyPI → Run workflow
493
+
494
+ ### Manual Publishing (if needed)
495
+
496
+ ```bash
497
+ # Build the package
498
+ uv build
499
+
500
+ # Check the package
501
+ twine check dist/*
502
+
503
+ # Upload to Test PyPI (requires API token)
504
+ twine upload --repository testpypi dist/*
505
+
506
+ # Upload to PyPI (requires API token)
507
+ twine upload dist/*
508
+ ```
509
+
510
+ ### GitHub Environments (Recommended)
511
+
512
+ Configure GitHub environments for additional security:
513
+ 1. Go to Settings → Environments
514
+ 2. Create `pypi` and `testpypi` environments
515
+ 3. Add protection rules:
516
+ - Required reviewers
517
+ - Restrict to specific tags (e.g., `v*`)
518
+ - Add deployment branch restrictions
519
+
520
+ ## Additional Resources
521
+
522
+ - [spaCy Models Directory](https://spacy.io/models) - Complete list of available language models
523
+ - [spaCy Model Releases](https://github.com/explosion/spacy-models) - GitHub repository for model updates
524
+ - [pgeocode](https://pypi.org/project/pgeocode/) - Geographic location processing (potential future integration)
@@ -0,0 +1,13 @@
1
+ inconnu/__init__.py,sha256=FHDRvMfesj7UYM1JSLwzWcDQs7eqp-zFoljNCU--YZk,7567
2
+ inconnu/config.py,sha256=SFZjg0IpzOfac8RNmCnq9sjxqHmbhAkA1LfGHqfYiP8,129
3
+ inconnu/exceptions.py,sha256=9qEqqwiRLvy5gDEPTiiTyyr_U5SQdzivBFPFx7HErG4,1547
4
+ inconnu/model_installer.py,sha256=_PphTFdkJXsz0vwqrY0W9RTbxPaYYJylgBT1H9w7AHk,6433
5
+ inconnu/nlp/entity_redactor.py,sha256=TD1G8qDX4bI9bAi5zR5oR1IbJJSst80dF2wXBCloj1Y,8003
6
+ inconnu/nlp/interfaces.py,sha256=B9FhChpPBg7nmFOJltWga5nWzMsnP9yj7SxfnBjJydg,495
7
+ inconnu/nlp/patterns.py,sha256=VxwgetKRd22esnjeya86j4oNKGzcHXIiQ6VE1LAVNzE,5662
8
+ inconnu/nlp/utils.py,sha256=700Tz-wR4JFYvnvuAvyu2x2YNwkOPtvQx007H-wS-7Y,2775
9
+ inconnu-0.1.0.dist-info/METADATA,sha256=CHGP-uLQ2xf5HOOT_aGO1ePE_qXkEG3lV8LrQZ-ctWM,16533
10
+ inconnu-0.1.0.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
11
+ inconnu-0.1.0.dist-info/entry_points.txt,sha256=jBJr5LeX-XGEBh5iMQIJr5zdzqbyOUyw3rSgWZfQcDk,66
12
+ inconnu-0.1.0.dist-info/licenses/LICENSE,sha256=LMGDpdSqFgydJ63Q0EjrcYxFvATmqE_bdNHrdsAEqNE,1089
13
+ inconnu-0.1.0.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: hatchling 1.27.0
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ inconnu-download = inconnu.model_installer:main
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2025 Juan Gaitán-Villamizar
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.