PyPI - inconnu - Versions diffs - 0.1.0__py3-none-any.whl - Mend

inconnu 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

inconnu/__init__.py +235 -0
inconnu/config.py +7 -0
inconnu/exceptions.py +48 -0
inconnu/model_installer.py +200 -0
inconnu/nlp/entity_redactor.py +229 -0
inconnu/nlp/interfaces.py +23 -0
inconnu/nlp/patterns.py +144 -0
inconnu/nlp/utils.py +97 -0
inconnu-0.1.0.dist-info/METADATA +524 -0
inconnu-0.1.0.dist-info/RECORD +13 -0
inconnu-0.1.0.dist-info/WHEEL +4 -0
inconnu-0.1.0.dist-info/entry_points.txt +2 -0
inconnu-0.1.0.dist-info/licenses/LICENSE +21 -0

inconnu-0.1.0.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,524 @@
+Metadata-Version: 2.4
+Name: inconnu
+Version: 0.1.0
+Summary: GDPR-compliant data privacy tool for entity redaction and de-anonymization
+Project-URL: Homepage, https://github.com/0xjgv/inconnu
+Project-URL: Documentation, https://github.com/0xjgv/inconnu#readme
+Project-URL: Repository, https://github.com/0xjgv/inconnu
+Project-URL: Issues, https://github.com/0xjgv/inconnu/issues
+Author-email: 0xjgv <juans.gaitan@gmail.com>
+License: MIT
+License-File: LICENSE
+Keywords: anonymization,gdpr,nlp,pii,privacy,pseudonymization,redaction,spacy
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Healthcare Industry
+Classifier: Intended Audience :: Legal Industry
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Security
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Classifier: Topic :: Text Processing :: Linguistic
+Requires-Python: >=3.10
+Requires-Dist: phonenumbers>=9.0.8
+Requires-Dist: spacy>=3.8.7
+Provides-Extra: all
+Provides-Extra: de
+Provides-Extra: en
+Provides-Extra: es
+Provides-Extra: fr
+Provides-Extra: it
+Description-Content-Type: text/markdown
+# Inconnu
+## What is Inconnu?
+Inconnu is a GDPR-compliant data privacy tool designed for entity redaction and de-anonymization. It provides cutting-edge NLP-based tools for anonymizing and pseudonymizing text data while maintaining data utility, ensuring your business meets stringent privacy regulations.
+## Why Inconnu?
+1. **Seamless Compliance**: Inconnu simplifies the complexity of GDPR and other privacy laws, making sure your data handling practices are always in line with legal standards.
+2. **State-of-the-Art NLP**: Utilizing advanced spaCy models and custom entity recognition, Inconnu ensures that personal identifiers are completely detected and properly handled.
+3. **Transparency and Trust**: Complete processing documentation with timestamping, hashing, and entity mapping for full audit trails.
+4. **Reversible Processing**: Support for both anonymization and pseudonymization with complete de-anonymization capabilities.
+5. **Performance Optimized**: Fast processing with singleton pattern optimization and configurable text length limits.
+## Installation
+### Prerequisites
+- Python 3.10 or higher
+- pip (Python package manager)
+### Install from PyPI
+```bash
+# Basic installation (without language models)
+pip install inconnu
+# Install with English language support
+pip install inconnu[en]
+# Install with specific language support
+pip install inconnu[de]     # German
+pip install inconnu[fr]     # French
+pip install inconnu[es]     # Spanish
+pip install inconnu[it]     # Italian
+# Install with multiple languages
+pip install inconnu[en,de,fr]
+# Install with all language support
+pip install inconnu[all]
+```
+### Download Language Models
+After installation, download the required spaCy models:
+```bash
+# Using the built-in CLI tool
+inconnu-download en            # Download default English model
+inconnu-download de fr         # Download German and French models
+inconnu-download en --size large  # Download large English model
+inconnu-download all           # Download all default models
+inconnu-download --list        # List all available models
+# Or using spaCy directly
+python -m spacy download en_core_web_sm
+python -m spacy download de_core_news_sm
+```
+### Install from Source
+1. **Clone the repository**:
+   ```bash
+   git clone https://github.com/0xjgv/inconnu.git
+   cd inconnu
+   ```
+2. **Install with UV (recommended for development)**:
+   ```bash
+   make install          # Install dependencies
+   make model-de        # Download German model
+   make test            # Run tests
+   ```
+3. **Or install with pip**:
+   ```bash
+   pip install -e .     # Install in editable mode
+   python -m spacy download en_core_web_sm
+   ```
+### Installing Additional Models
+Inconnu supports multiple spaCy models for enhanced accuracy. The default `en_core_web_sm` model is lightweight and fast, but you can install more accurate models:
+#### English Models
+```bash
+# Small model (default) - 15MB, fast processing
+uv run python -m spacy download en_core_web_sm
+# Large model - 560MB, higher accuracy
+uv run python -m spacy download en_core_web_lg
+# Transformer model - 438MB, highest accuracy
+uv run python -m spacy download en_core_web_trf
+```
+#### Additional Language Models
+```bash
+# German model
+make model-de
+uv run python -m spacy download de_core_news_sm
+# Italian model
+make model-it
+uv run python -m spacy download it_core_news_sm
+# Spanish model
+make model-es
+uv run python -m spacy download es_core_news_sm
+# French model
+make model-fr
+uv run python -m spacy download fr_core_news_sm
+# For enhanced accuracy (manual installation)
+# Medium German model - better accuracy
+uv run python -m spacy download de_core_news_md
+# Large German model - highest accuracy
+uv run python -m spacy download de_core_news_lg
+```
+#### Using Different Models
+To use a different model, specify it when initializing the EntityRedactor:
+```python
+from inconnu.nlp.entity_redactor import EntityRedactor, SpacyModels
+# Use transformer model for highest accuracy
+entity_redactor = EntityRedactor(
+    custom_components=None,
+    language="en",
+    model_name=SpacyModels.EN_CORE_WEB_TRF  # High accuracy transformer model
+)
+```
+**Model Selection Guide:**
+- `en_core_web_sm`: Fast processing, good for high-volume processing
+- `en_core_web_lg`: Better accuracy, moderate processing time
+- `en_core_web_trf`: Highest accuracy, slower processing (recommended for sensitive data)
+For more models, visit the [spaCy Models Directory](https://spacy.io/models).
+## Development Setup
+### Available Commands
+```bash
+# Development workflow
+make install          # Install all dependencies
+make model-de         # Download German spaCy model
+make model-it         # Download Italian spaCy model
+make model-es         # Download Spanish spaCy model
+make model-fr         # Download French spaCy model
+make test            # Run full test suite
+make lint            # Check code with ruff
+make format          # Format code with ruff
+make fix             # Auto-fix linting issues
+make clean           # Format, lint, fix, and clean cache
+make update-deps     # Update dependencies
+```
+### Running Tests
+```bash
+# Run all tests
+make test
+# Run with verbose output
+uv run pytest -vv
+# Run specific test file
+uv run pytest tests/test_inconnu.py -vv
+# Run specific test class
+uv run pytest tests/test_inconnu.py::TestInconnuPseudonymizer -vv
+```
+## Usage Examples
+### Basic Text Anonymization
+```python
+from inconnu import Inconnu
+# Simple initialization - no Config class required!
+inconnu = Inconnu()  # Uses sensible defaults
+# Simple anonymization - just the redacted text
+text = "John Doe from New York visited Paris last summer."
+redacted = inconnu.redact(text)
+print(redacted)
+# Output: "[PERSON] from [GPE] visited [GPE] [DATE]."
+# Pseudonymization - get both redacted text and entity mapping
+redacted_text, entity_map = inconnu.pseudonymize(text)
+print(redacted_text)
+# Output: "[PERSON_0] from [GPE_0] visited [GPE_1] [DATE_0]."
+print(entity_map)
+# Output: {'[PERSON_0]': 'John Doe', '[GPE_0]': 'New York', '[GPE_1]': 'Paris', '[DATE_0]': 'last summer'}
+# Advanced usage with full metadata (original API)
+result = inconnu(text=text)
+print(result.redacted_text)
+print(f"Processing time: {result.processing_time_ms:.2f}ms")
+```
+### Async and Batch Processing
+```python
+import asyncio
+# Async processing for non-blocking operations
+async def process_texts():
+    inconnu = Inconnu()
+    # Single async processing
+    text = "John Doe called from +1-555-123-4567"
+    redacted = await inconnu.redact_async(text)
+    print(redacted)  # "[PERSON] called from [PHONE_NUMBER]"
+    # Batch async processing
+    texts = [
+        "Alice Smith visited Berlin",
+        "Bob Jones went to Tokyo",
+        "Carol Brown lives in Paris"
+    ]
+    results = await inconnu.redact_batch_async(texts)
+    for result in results:
+        print(result)
+asyncio.run(process_texts())
+```
+### Customer Service Email Processing
+```python
+# Process customer service email with personal data
+customer_email = """
+Dear SolarTech Team,
+I am Max Mustermann living at Hauptstraße 50, 80331 Munich, Germany.
+My phone number is +49 89 1234567 and my email is max@example.com.
+I need to return my solar modules (Order: ST-78901) due to relocation.
+Best regards,
+Max Mustermann
+"""
+# Simple redaction
+redacted = inconnu.redact(customer_email)
+print(redacted)
+# Personal identifiers are automatically detected and redacted
+```
+### Multi-language Support
+```python
+# German language processing - simplified!
+inconnu_de = Inconnu("de")  # Just specify the language
+german_text = "Herr Schmidt aus München besuchte Berlin im März."
+redacted = inconnu_de.redact(german_text)
+print(redacted)
+# Output: "[PERSON] aus [GPE] besuchte [GPE] [DATE]."
+```
+### Custom Entity Recognition
+```python
+from inconnu import Inconnu, NERComponent
+import re
+# Add custom entity recognition
+custom_components = [
+    NERComponent(
+        label="CREDIT_CARD",
+        pattern=re.compile(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b'),
+        processing_function=None
+    )
+]
+# Simple initialization with custom components
+inconnu_custom = Inconnu(
+    language="en",
+    custom_components=custom_components
+)
+# Test custom entity detection
+text = "My card number is 1234 5678 9012 3456"
+redacted = inconnu_custom.redact(text)
+print(redacted)  # "My card number is [CREDIT_CARD]"
+```
+### Context Manager for Resource Management
+```python
+# Automatic resource cleanup
+with Inconnu() as inc:
+    redacted = inc.redact("Sensitive data about John Doe")
+    print(redacted)
+# Resources automatically cleaned up
+```
+### Error Handling
+```python
+from inconnu import Inconnu, TextTooLongError, ProcessingError
+inconnu = Inconnu(max_text_length=100)  # Set small limit for demo
+try:
+    long_text = "x" * 200  # Exceeds limit
+    result = inconnu.redact(long_text)
+except TextTooLongError as e:
+    print(f"Text too long: {e}")
+    # Error includes helpful suggestions for resolution
+except ProcessingError as e:
+    print(f"Processing failed: {e}")
+```
+## Use Cases
+### 1. **Customer Support Systems**
+Automatically redact personal information from customer service emails, chat logs, and support tickets while maintaining context for analysis.
+### 2. **Legal Document Processing**
+Anonymize legal documents, contracts, and case files for training, analysis, or public release while ensuring GDPR compliance.
+### 3. **Medical Record Anonymization**
+Process medical records and research data to remove patient identifiers while preserving clinical information for research purposes.
+### 4. **Financial Transaction Analysis**
+Redact personal financial information from transaction logs and banking communications for fraud analysis and compliance reporting.
+### 5. **Survey and Feedback Analysis**
+Anonymize customer feedback, survey responses, and user-generated content for analysis while protecting respondent privacy.
+### 6. **Training Data Preparation**
+Prepare training datasets for machine learning models by removing personal identifiers from text data while maintaining semantic meaning.
+## Supported Entity Types
+- **Standard Entities**: PERSON, GPE (locations), DATE, ORG, MONEY
+- **Custom Entities**: EMAIL, IBAN, PHONE_NUMBER
+- **Enhanced Detection**: Person titles (Dr, Mr, Ms), international phone numbers
+- **Multilingual**: English, German, Italian, Spanish, and French language support
+## Features
+- **Robust Entity Detection**: Advanced NLP with spaCy models and custom regex patterns
+- **Dual Processing Modes**: Anonymization (`[PERSON]`) and pseudonymization (`[PERSON_0]`)
+- **Complete Audit Trail**: Timestamping, hashing, and processing metadata
+- **Reversible Processing**: Full de-anonymization capabilities with entity mapping
+- **Performance Optimized**: Singleton pattern for model loading, configurable limits
+- **GDPR Compliant**: Built-in data retention policies and compliance features
+## Contributing
+We welcome contributions to Inconnu! As an open source project, we believe in the power of community collaboration to build better privacy tools.
+### How to Contribute
+#### 1. **Bug Reports & Feature Requests**
+- Open an issue on GitHub with detailed descriptions
+- Include code examples and expected vs actual behavior
+- Tag issues appropriately (bug, enhancement, documentation)
+#### 2. **Code Contributions**
+```bash
+# Fork the repository and create a feature branch
+git checkout -b feature/your-feature-name
+# Make your changes and ensure tests pass
+make test
+make lint
+# Submit a pull request with:
+# - Clear description of changes
+# - Test coverage for new features
+# - Updated documentation if needed
+```
+#### 3. **Development Guidelines**
+- Follow existing code style and patterns
+- Add tests for new functionality
+- Update documentation for user-facing changes
+- Ensure GDPR compliance considerations are addressed
+#### 4. **Areas for Contribution**
+- **Language Support**: Add new language models and region-specific entity detection
+- **Custom Entities**: Implement detection for industry-specific identifiers
+- **Performance**: Optimize processing speed and memory usage
+- **Documentation**: Improve examples, tutorials, and API documentation
+- **Testing**: Expand test coverage and edge case handling
+#### 5. **Code Review Process**
+- All contributions require code review
+- Automated tests must pass
+- Documentation updates are appreciated
+- Maintain backward compatibility when possible
+### Community Guidelines
+- **Be Respectful**: Foster an inclusive environment for all contributors
+- **Privacy First**: Always consider privacy implications of changes
+- **Security Minded**: Report security issues privately before public disclosure
+- **Quality Focused**: Prioritize code quality and comprehensive testing
+### Getting Help
+- **Discussions**: Use GitHub Discussions for questions and ideas
+- **Issues**: Report bugs and request features through GitHub Issues
+- **Documentation**: Check existing docs and contribute improvements
+Thank you for helping make Inconnu a better tool for data privacy and GDPR compliance!
+## Publishing to PyPI
+### For Maintainers
+To publish a new version to PyPI:
+1. **Configure Trusted Publisher** (first time only):
+   - Go to https://pypi.org/manage/project/inconnu/settings/publishing/
+   - Add a new trusted publisher:
+     - Publisher: GitHub
+     - Organization/username: `0xjgv`
+     - Repository name: `inconnu`
+     - Workflow name: `publish.yml`
+     - Environment name: `pypi` (optional but recommended)
+   - For Test PyPI, do the same at https://test.pypi.org with environment name: `testpypi`
+2. **Update Version**: Update the version in `pyproject.toml` and `inconnu/__init__.py`
+3. **Create a Git Tag**:
+   ```bash
+   git tag v0.1.0
+   git push origin v0.1.0
+   ```
+4. **GitHub Actions**: The workflow will automatically:
+   - Run tests on Python 3.10, 3.11, and 3.12
+   - Build the package
+   - Publish to PyPI using Trusted Publisher (no API tokens needed!)
+   - Generate PEP 740 attestations for security
+5. **Test PyPI Publishing**:
+   - Use workflow_dispatch to manually trigger Test PyPI publishing
+   - Go to Actions → Publish to PyPI → Run workflow
+### Manual Publishing (if needed)
+```bash
+# Build the package
+uv build
+# Check the package
+twine check dist/*
+# Upload to Test PyPI (requires API token)
+twine upload --repository testpypi dist/*
+# Upload to PyPI (requires API token)
+twine upload dist/*
+```
+### GitHub Environments (Recommended)
+Configure GitHub environments for additional security:
+1. Go to Settings → Environments
+2. Create `pypi` and `testpypi` environments
+3. Add protection rules:
+   - Required reviewers
+   - Restrict to specific tags (e.g., `v*`)
+   - Add deployment branch restrictions
+## Additional Resources
+- [spaCy Models Directory](https://spacy.io/models) - Complete list of available language models
+- [spaCy Model Releases](https://github.com/explosion/spacy-models) - GitHub repository for model updates
+- [pgeocode](https://pypi.org/project/pgeocode/) - Geographic location processing (potential future integration)

inconnu-0.1.0.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,13 @@
+inconnu/__init__.py,sha256=FHDRvMfesj7UYM1JSLwzWcDQs7eqp-zFoljNCU--YZk,7567
+inconnu/config.py,sha256=SFZjg0IpzOfac8RNmCnq9sjxqHmbhAkA1LfGHqfYiP8,129
+inconnu/exceptions.py,sha256=9qEqqwiRLvy5gDEPTiiTyyr_U5SQdzivBFPFx7HErG4,1547
+inconnu/model_installer.py,sha256=_PphTFdkJXsz0vwqrY0W9RTbxPaYYJylgBT1H9w7AHk,6433
+inconnu/nlp/entity_redactor.py,sha256=TD1G8qDX4bI9bAi5zR5oR1IbJJSst80dF2wXBCloj1Y,8003
+inconnu/nlp/interfaces.py,sha256=B9FhChpPBg7nmFOJltWga5nWzMsnP9yj7SxfnBjJydg,495
+inconnu/nlp/patterns.py,sha256=VxwgetKRd22esnjeya86j4oNKGzcHXIiQ6VE1LAVNzE,5662
+inconnu/nlp/utils.py,sha256=700Tz-wR4JFYvnvuAvyu2x2YNwkOPtvQx007H-wS-7Y,2775
+inconnu-0.1.0.dist-info/METADATA,sha256=CHGP-uLQ2xf5HOOT_aGO1ePE_qXkEG3lV8LrQZ-ctWM,16533
+inconnu-0.1.0.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
+inconnu-0.1.0.dist-info/entry_points.txt,sha256=jBJr5LeX-XGEBh5iMQIJr5zdzqbyOUyw3rSgWZfQcDk,66
+inconnu-0.1.0.dist-info/licenses/LICENSE,sha256=LMGDpdSqFgydJ63Q0EjrcYxFvATmqE_bdNHrdsAEqNE,1089
+inconnu-0.1.0.dist-info/RECORD,,

inconnu-0.1.0.dist-info/WHEEL ADDED Viewed

@@ -0,0 +1,4 @@
+Wheel-Version: 1.0
+Generator: hatchling 1.27.0
+Root-Is-Purelib: true
+Tag: py3-none-any

inconnu-0.1.0.dist-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ [console_scripts]
2	+ inconnu-download = inconnu.model_installer:main

inconnu-0.1.0.dist-info/licenses/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2025 Juan Gaitán-Villamizar
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.