PyPI - resolvekit - Versions diffs - 0.0.1__py3-none-any.whl - Mend

resolvekit 0.0.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (70) hide show

resolvekit/README.md +134 -0
resolvekit/__init__.py +67 -0
resolvekit/api/README.md +165 -0
resolvekit/api/__init__.py +10 -0
resolvekit/api/convenience.py +53 -0
resolvekit/api/resolver.py +457 -0
resolvekit/builders/README.md +173 -0
resolvekit/builders/__init__.py +0 -0
resolvekit/calibration/README.md +351 -0
resolvekit/calibration/__init__.py +12 -0
resolvekit/calibration/calibrator.py +184 -0
resolvekit/calibration/features.py +139 -0
resolvekit/calibration/models.py +78 -0
resolvekit/cli/README.md +215 -0
resolvekit/cli/__init__.py +0 -0
resolvekit/cli/main.py +18 -0
resolvekit/config.py +128 -0
resolvekit/constants.py +252 -0
resolvekit/constraints/README.md +102 -0
resolvekit/constraints/__init__.py +17 -0
resolvekit/constraints/constraint_engine.py +111 -0
resolvekit/constraints/hierarchy_validator.py +148 -0
resolvekit/constraints/membership_validator.py +60 -0
resolvekit/constraints/protocols.py +33 -0
resolvekit/constraints/temporal_validator.py +43 -0
resolvekit/constraints/type_validator.py +42 -0
resolvekit/data/README.md +165 -0
resolvekit/data/__init__.py +14 -0
resolvekit/data/alias_repository.py +206 -0
resolvekit/data/code_repository.py +85 -0
resolvekit/data/context_filters.py +49 -0
resolvekit/data/db_manager.py +196 -0
resolvekit/data/entity_repository.py +466 -0
resolvekit/data/membership_repository.py +107 -0
resolvekit/data/query_builder.py +177 -0
resolvekit/data/schema.py +122 -0
resolvekit/disambiguation/README.md +72 -0
resolvekit/disambiguation/__init__.py +0 -0
resolvekit/extraction/README.md +204 -0
resolvekit/extraction/__init__.py +0 -0
resolvekit/matchers/README.md +77 -0
resolvekit/matchers/__init__.py +65 -0
resolvekit/matchers/alias_exact.py +65 -0
resolvekit/matchers/canonical_name.py +62 -0
resolvekit/matchers/cascade.py +127 -0
resolvekit/matchers/code_validators.py +250 -0
resolvekit/matchers/exact_code.py +177 -0
resolvekit/matchers/fts_matcher.py +106 -0
resolvekit/matchers/fuzzy_matcher.py +142 -0
resolvekit/matchers/priorities.py +174 -0
resolvekit/matchers/protocols.py +75 -0
resolvekit/normalization/README.md +192 -0
resolvekit/normalization/__init__.py +8 -0
resolvekit/normalization/normalizer.py +164 -0
resolvekit/overlays/README.md +226 -0
resolvekit/overlays/__init__.py +0 -0
resolvekit/types.py +534 -0
resolvekit/utils/README.md +188 -0
resolvekit/utils/__init__.py +48 -0
resolvekit/utils/cache.py +109 -0
resolvekit/utils/dates.py +339 -0
resolvekit/utils/errors.py +145 -0
resolvekit/utils/files.py +366 -0
resolvekit/utils/logging.py +219 -0
resolvekit/utils/text.py +475 -0
resolvekit/utils/validation.py +301 -0
resolvekit-0.0.1.dist-info/METADATA +36 -0
resolvekit-0.0.1.dist-info/RECORD +70 -0
resolvekit-0.0.1.dist-info/WHEEL +4 -0
resolvekit-0.0.1.dist-info/entry_points.txt +3 -0

resolvekit/normalization/README.md ADDED Viewed

@@ -0,0 +1,192 @@
+# Normalization Module
+## Purpose
+The normalization module handles text preprocessing to ensure consistent string matching across different input formats and languages. It provides three preset levels optimized for different use cases.
+## Implementation Status
+✅ **Implemented** - Phase A complete with 94% test coverage
+## Features
+- **Three Preset Levels**: strict, standard, aggressive
+- **Performance-First**: LRU caching for repeated queries (10,000 entry cache)
+- **Unicode Handling**: NFKD normalization for consistent representation
+- **Diacritic Removal**: Configurable diacritical mark handling
+- **Batch Operations**: Efficient processing of multiple texts
+- **Graceful Error Handling**: Handles edge cases (empty strings, None, etc.)
+## Quick Start
+```python
+from resolvekit.normalization import TextNormalizer
+# Create normalizer (stateless, reusable)
+normalizer = TextNormalizer()
+# Basic usage (default: standard level)
+result = normalizer.normalize("Côte d'Ivoire")
+print(result)  # Output: "cote d'ivoire"
+# Batch processing
+countries = ["France", "Côte d'Ivoire", "Germany"]
+normalized = normalizer.normalize_batch(countries)
+print(normalized)  # ['france', "cote d'ivoire", 'germany']
+```
+## Normalization Levels
+### Strict
+**Use for**: Exact matching where case and diacritics matter (e.g., ISO codes)
+```python
+normalizer.normalize("USA", level="strict")  # "USA"
+normalizer.normalize("Côte d'Ivoire", level="strict")  # "Côte d'Ivoire" (decomposed unicode)
+```
+- ✅ Unicode normalization (NFKD)
+- ✅ Whitespace normalization
+- ❌ Diacritic removal
+- ❌ Case folding
+- ❌ Punctuation removal
+### Standard (Default)
+**Use for**: General entity resolution and fuzzy matching
+```python
+normalizer.normalize("Côte d'Ivoire")  # "cote d'ivoire"
+normalizer.normalize("U.S.A.")  # "u.s.a."
+```
+- ✅ Unicode normalization (NFKD)
+- ✅ Diacritic removal
+- ✅ Case folding
+- ✅ Whitespace normalization
+- ❌ Punctuation removal
+### Aggressive
+**Use for**: Noisy data, user input with typos, low-quality sources
+```python
+normalizer.normalize("U.S.A.", level="aggressive")  # "usa"
+normalizer.normalize("Côte-d'Ivoire!!!", level="aggressive")  # "cote divoire"
+```
+- ✅ Unicode normalization (NFKD)
+- ✅ Diacritic removal
+- ✅ Case folding
+- ✅ Whitespace normalization
+- ✅ Punctuation removal
+## API Reference
+### TextNormalizer
+```python
+class TextNormalizer:
+    def normalize(text: str, level: str = "standard") -> str:
+        """Normalize single text with LRU caching."""
+    def normalize_batch(texts: list[str], level: str = "standard") -> list[str]:
+        """Normalize multiple texts efficiently."""
+    def get_level_config(level: str) -> NormalizationLevel:
+        """Get configuration for a level (for introspection)."""
+```
+### Examples
+```python
+from resolvekit.normalization import TextNormalizer
+normalizer = TextNormalizer()
+# Different levels
+normalizer.normalize("Türkiye", level="strict")      # "Türkiye"
+normalizer.normalize("Türkiye", level="standard")    # "turkiye"
+normalizer.normalize("Türkiye", level="aggressive")  # "turkiye"
+# Batch processing
+texts = ["France", "Germany", "Italy"]
+normalized = normalizer.normalize_batch(texts)  # ["france", "germany", "italy"]
+# Edge cases (handled gracefully)
+normalizer.normalize("")         # ""
+normalizer.normalize("   ")      # ""
+normalizer.normalize(None)       # ""
+# Introspection
+config = normalizer.get_level_config("standard")
+print(config.remove_diacritics)  # True
+```
+## Performance
+The normalization module is optimized for high-throughput entity resolution:
+- **Caching**: 10,000-entry LRU cache for repeated queries
+- **Speed**: < 0.001ms per cached normalization
+- **Batch**: Efficient config reuse for multiple texts
+- **Coverage**: 94% test coverage with performance benchmarks
+### Performance Examples
+```python
+# Cached calls are 10-100x faster
+normalizer.normalize("France")  # First call: ~0.01ms
+normalizer.normalize("France")  # Cached: ~0.0001ms
+# Batch processing is efficient
+texts = ["France"] * 1000
+normalizer.normalize_batch(texts)  # Config reused for all texts
+```
+## Design Principles
+1. **Thin Orchestrator**: Wraps existing `utils.text` functions
+2. **Preset Levels**: Simple API with sensible defaults
+3. **Performance-First**: LRU caching from day one
+4. **Graceful Degradation**: Handles edge cases without crashing
+5. **Testable**: 19 comprehensive tests with 94% coverage
+## Integration
+The normalizer integrates with the matcher cascade:
+| Matcher | Level | Rationale |
+|---------|-------|-----------|
+| ExactCodeMatcher | strict | Codes are case-sensitive |
+| CanonicalNameMatcher | standard | Canonical names need normalization |
+| AliasExactMatcher | standard | Aliases need diacritic/case handling |
+| FuzzyMatchers | standard/aggressive | Depends on data quality |
+## Future Enhancements
+- **Transliteration**: Cyrillic→Latin, Arabic→Latin (deferred to Phase B+)
+- **Custom levels**: User-defined normalization configs
+- **Language hints**: Language-aware normalization rules
+- **Normalization metadata**: Track applied transformations for debugging
+## Testing
+Run normalization tests:
+```bash
+# All normalization tests
+uv run pytest tests/test_normalization.py -v
+# Performance tests only
+uv run pytest tests/test_normalization.py::TestPerformance -v
+# With coverage
+uv run pytest tests/test_normalization.py --cov=src/resolvekit/normalization
+```
+## Files
+```
+src/resolvekit/normalization/
+├── __init__.py          # Exports TextNormalizer, NormalizationLevel
+└── normalizer.py        # Implementation (40 SLOC, 94% coverage)
+```

resolvekit/normalization/__init__.py ADDED Viewed

@@ -0,0 +1,8 @@
+"""Normalization module for text preprocessing."""
+from resolvekit.normalization.normalizer import NormalizationLevel, TextNormalizer
+__all__ = [
+    "NormalizationLevel",
+    "TextNormalizer",
+]

resolvekit/normalization/normalizer.py ADDED Viewed

@@ -0,0 +1,164 @@
+"""Text normalization module for resolvekit."""
+import re
+from dataclasses import dataclass
+from typing import ClassVar
+from resolvekit.utils.cache import lru_cache
+from resolvekit.utils.text import (
+    case_fold,
+    normalize_unicode,
+    normalize_whitespace,
+    remove_diacritics,
+)
+@dataclass(frozen=True)
+class NormalizationLevel:
+    """Configuration for a normalization level."""
+    name: str
+    unicode_norm: bool
+    remove_diacritics: bool
+    case_fold: bool
+    normalize_whitespace: bool
+    remove_punctuation: bool
+class TextNormalizer:
+    """Text normalizer with preset levels and LRU caching."""
+    LEVELS: ClassVar[dict[str, NormalizationLevel]] = {
+        "strict": NormalizationLevel(
+            name="strict",
+            unicode_norm=True,
+            remove_diacritics=False,
+            case_fold=False,
+            normalize_whitespace=True,
+            remove_punctuation=False,
+        ),
+        "standard": NormalizationLevel(
+            name="standard",
+            unicode_norm=True,
+            remove_diacritics=True,
+            case_fold=True,
+            normalize_whitespace=True,
+            remove_punctuation=False,
+        ),
+        "aggressive": NormalizationLevel(
+            name="aggressive",
+            unicode_norm=True,
+            remove_diacritics=True,
+            case_fold=True,
+            normalize_whitespace=True,
+            remove_punctuation=True,
+        ),
+    }
+    @lru_cache(maxsize=10000)
+    def normalize(self, text: str, level: str = "standard") -> str:
+        """
+        Normalize text using preset level.
+        Args:
+            text: Text to normalize
+            level: Normalization level (strict, standard, aggressive)
+        Returns:
+            Normalized text
+        Examples:
+            >>> normalizer = TextNormalizer()
+            >>> normalizer.normalize("Côte d'Ivoire")
+            "cote d'ivoire"
+        """
+        # Handle empty/whitespace-only strings
+        if not text or not text.strip():
+            return ""
+        # Get level config
+        level_config = self.LEVELS[level]
+        # Apply normalizations
+        return self._apply_normalizations(text, level_config)
+    def _apply_normalizations(self, text: str, config: NormalizationLevel) -> str:
+        """
+        Apply normalizations based on configuration.
+        Args:
+            text: Text to normalize
+            config: Normalization level configuration
+        Returns:
+            Normalized text
+        """
+        # Unicode normalization
+        if config.unicode_norm:
+            text = normalize_unicode(text)
+        # Remove diacritics
+        if config.remove_diacritics:
+            text = remove_diacritics(text)
+        # Case folding
+        if config.case_fold:
+            text = case_fold(text)
+        # Normalize whitespace
+        if config.normalize_whitespace:
+            text = normalize_whitespace(text)
+        # Remove punctuation
+        if config.remove_punctuation:
+            text = self._remove_punctuation(text)
+        return text
+    @staticmethod
+    def _remove_punctuation(text: str) -> str:
+        """
+        Remove punctuation from text.
+        Args:
+            text: Input text
+        Returns:
+            Text without punctuation
+        """
+        # Remove common punctuation but keep spaces and alphanumerics
+        return re.sub(r"[^\w\s]", "", text)
+    def normalize_batch(self, texts: list[str], level: str = "standard") -> list[str]:
+        """
+        Normalize multiple texts efficiently.
+        Args:
+            texts: List of texts to normalize
+            level: Normalization level (strict, standard, aggressive)
+        Returns:
+            List of normalized texts
+        Examples:
+            >>> normalizer = TextNormalizer()
+            >>> normalizer.normalize_batch(["France", "Germany"])
+            ["france", "germany"]
+        """
+        # Delegate to normalize() to reuse guard logic and benefit from cache
+        return [self.normalize(text, level) for text in texts]
+    def get_level_config(self, level: str) -> NormalizationLevel:
+        """
+        Get configuration for a normalization level.
+        Args:
+            level: Normalization level name
+        Returns:
+            NormalizationLevel configuration
+        Raises:
+            KeyError: If level name is invalid
+        """
+        return self.LEVELS[level]

resolvekit/overlays/README.md ADDED Viewed

@@ -0,0 +1,226 @@
+# Overlays Module
+## Purpose
+The overlays module manages custom data packs that extend or override the base data pack. Overlays enable users to install domain-specific packs (e.g., humanitarian-orgs-pack, schools-pack, corporate-entities-pack) or create their own custom aliases and entities.
+## Components
+### Core Components
+1. **Overlay Manager** (`overlay_manager.py`)
+   - Load and attach overlay databases
+   - Manage precedence rules
+   - Handle overlay conflicts
+2. **Overlay Writer** (`overlay_writer.py`)
+   - Create new overlay packs
+   - Add/update aliases, entities, memberships
+   - Build FTS indexes for overlays
+3. **Overlay Validator** (`validator.py`)
+   - Validate overlay schemas
+   - Check for conflicts with base pack
+   - Verify referential integrity
+4. **Overlay Merger** (`merger.py`)
+   - Merge queries across base + overlays
+   - Apply precedence rules
+   - Deduplicate results
+### Data Models
+- `overlay_config.py`: Overlay configuration and metadata
+- `precedence.py`: Precedence rule engine
+## Overlay Structure
+Each overlay is a SQLite database with the same schema as base pack:
+```
+my-overlay.sqlite
+├── entities (custom entities only)
+├── aliases (custom aliases + overrides)
+├── aliases_fts (FTS index)
+├── codes (custom codes)
+├── memberships (custom memberships)
+└── manifest (overlay metadata)
+```
+## Overlay Manifest
+```json
+{
+  "pack_name": "humanitarian-orgs.overlay",
+  "version": "0.3.0",
+  "base_compat": ">=1.2.0",
+  "components": {
+    "overlay_sqlite": "humanitarian-orgs.overlay.sqlite",
+    "fts_built": true,
+    "calibration_delta": null
+  },
+  "description": "Humanitarian organization names and aliases",
+  "author": "OCHA",
+  "created": "2025-10-20T10:00:00Z"
+}
+```
+**Note:** Precedence is not stored in the manifest. It's determined by the order in which overlays are loaded (earlier in list = higher precedence).
+## Precedence Rules
+Precedence is determined by the order in the overlays list:
+- **First overlay in list**: Highest precedence (100)
+- **Second overlay**: Precedence 99
+- **Third overlay**: Precedence 98
+- **...and so on** (max 5 overlays)
+- **Base pack**: Precedence 0 (lowest)
+When the same alias maps to different entities:
+1. Use the highest precedence mapping (earliest in the list)
+2. Log warning about conflict
+3. Optional: Surface conflict to user for resolution
+**Example:**
+```python
+resolver = Resolver(overlays=["personal.sqlite", "humanitarian.sqlite", "schools.sqlite"])
+# personal.sqlite: precedence 100 (highest)
+# humanitarian.sqlite: precedence 99
+# schools.sqlite: precedence 98
+# base pack: precedence 0 (lowest)
+```
+## Creating Overlays
+### From CSV
+```python
+from resolvekit.overlays import OverlayWriter
+# Create overlay from CSV
+writer = OverlayWriter("my-overlay")
+# Add aliases
+writer.add_aliases_from_csv("custom_aliases.csv")
+# CSV columns: dcid, alias_text, alias_type, language
+# Add custom entities
+writer.add_entities_from_csv("custom_entities.csv")
+# CSV columns: dcid, canonical_name, entity_type, parent_dcid
+# Add memberships
+writer.add_memberships_from_csv("memberships.csv")
+# CSV columns: entity_dcid, group_dcid, valid_from, valid_until
+# Build and save
+writer.build("my-overlay.sqlite")
+```
+### From YAML
+```yaml
+# custom_data.yaml
+entities:
+  - dcid: "custom/MyOrg"
+    name: "My Organization"
+    type: "organization"
+    aliases:
+      - "MyOrg"
+      - "The Organization"
+    codes:
+      internal_id: "ORG-001"
+  - dcid: "country/COD"  # Extend existing entity
+    custom_aliases:
+      - "Congo-Kinshasa"
+      - "DROC"  # Internal shorthand
+groups:
+  - dcid: "custom/OurPartners"
+    name: "Our Partner Countries"
+    members:
+      - country/KEN
+      - country/TZA
+      - country/UGA
+```
+### Via API
+```python
+from resolvekit.overlays import create_overlay
+overlay = create_overlay(
+    name="my-custom-overlay",
+    base_pack_version="1.2.0"
+)
+# Add custom alias
+overlay.add_alias(
+    entity_dcid="country/COD",
+    alias_text="Congo-Kinshasa",
+    alias_type="exonym",
+    language="en"
+)
+# Add custom entity
+overlay.add_entity(
+    dcid="custom/MyEntity",
+    canonical_name="My Custom Entity",
+    entity_type="organization"
+)
+# Save overlay
+overlay.save("my-custom.overlay.sqlite")
+```
+**Note:** Precedence is determined when loading overlays into the Resolver, not when creating them.
+## Loading Overlays
+```python
+from resolvekit.api import Resolver
+# Load resolver with overlays (order = precedence)
+resolver = Resolver(
+    overlays=[
+        "path/to/personal-customizations.sqlite",  # Highest precedence (100)
+        "path/to/humanitarian-orgs.sqlite",        # Precedence 99
+        "path/to/schools-pack.sqlite"              # Precedence 98
+    ]
+)
+# Overlays are automatically applied to all queries
+result = resolver.resolve("DROC")  # Matches custom alias from highest-precedence overlay
+# Maximum 5 overlays supported
+# Order in the list determines precedence (first = highest)
+```
+## Conflict Resolution
+When conflicts detected:
+```python
+# Log warning
+logger.warning(
+    "Alias 'Georgia' maps to both country/GEO (precedence: 0) "
+    "and geoId/13 (precedence: 50). Using geoId/13."
+)
+# Optionally return conflict info
+result = resolver.resolve("Georgia", return_conflicts=True)
+if result.conflicts:
+    print(f"Conflicts: {result.conflicts}")
+```
+## Design Principles
+1. **Non-destructive**: Overlays don't modify base pack
+2. **Composable**: Multiple overlays can coexist
+3. **Validated**: Schema validation on overlay creation
+4. **Transparent**: Clear precedence rules, conflict reporting
+## Implementation Priority
+**Phase C** - Overlay system and builders

resolvekit/overlays/__init__.py ADDED Viewed

File without changes