PyPI - locationformatter - Versions diffs - 0.1.0__tar.gz - Mend

locationformatter 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

locationformatter-0.1.0/.gitignore +12 -0
locationformatter-0.1.0/PKG-INFO +249 -0
locationformatter-0.1.0/README.md +229 -0
locationformatter-0.1.0/README_huggingface.md +347 -0
locationformatter-0.1.0/UC0.0_Location_formatting_model_training.ipynb +11754 -0
locationformatter-0.1.0/processed_locations.jsonl +7731 -0
locationformatter-0.1.0/processed_locations_final.jsonl +7729 -0
locationformatter-0.1.0/pyproject.toml +37 -0
locationformatter-0.1.0/src/locationformatter/__init__.py +23 -0
locationformatter-0.1.0/src/locationformatter/locationformatter.py +444 -0
locationformatter-0.1.0/tests/test_formatter.py +205 -0

locationformatter-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,12 @@
+*.pyc
+__pycache__/
+*.egg-info/
+/dist
+/build
+.pytest_cache/
+*.pth
+development.ipynb
+integration.ipynb
+/library
+/development
+*.toml

locationformatter-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,249 @@
+Metadata-Version: 2.4
+Name: locationformatter
+Version: 0.1.0
+Summary: A dual-head NER-based parser for location strings
+Project-URL: Homepage, https://github.com/semantic-ai/decide-location-formatter
+Project-URL: Issues, https://github.com/semantic-ai/decide-location-formatter/issues
+Author-email: Author <stefaan.vercoutere@sirus.be>
+License: MIT
+Classifier: Intended Audience :: Developers
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Topic :: Text Processing :: Linguistic
+Requires-Python: >=3.9
+Requires-Dist: pytorch-crf>=0.7.2
+Requires-Dist: torch>=2.0
+Requires-Dist: transformers>=4.35
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0.0; extra == 'dev'
+Description-Content-Type: text/markdown
+# decide-location-formatter
+A Python package for parsing and structuring ocation strings into their individual address components. Built around a dual-head NER model ([`svercoutere/abb-dual-location-component-ner`](https://huggingface.co/svercoutere/abb-dual-location-component-ner)) fine-tuned on top of **XLM-RoBERTa base**.
+## How it works
+Raw location strings like `"Scaldisstraat 23-25, 2000 Antwerpen"` or `"Cafe den Draak, Lovegemlaan 7, 9000 Gent"` are common in  municipal decision text but are inconsistently formatted and often contain multiple distinct locations in a single string.
+The pipeline has three steps:
+1. **Text cleaning** — normalises whitespace, unicode, and newlines.
+2. **Dual-head NER inference** — the model runs two independent CRF-decoded classification heads over every token simultaneously:
+   - **Component head** — tags each token as one of 12 address component types (street, city, postcode, …).
+   - **Location head** — groups tokens that belong to the same physical location into `B-LOCATION` / `I-LOCATION` spans, allowing multi-location strings to be split.
+3. **Post-processing** — component spans are nested inside their parent location spans, housenumber ranges/sequences (e.g. `23-25`, `7 en 9`) are expanded into individual entries, and bus numbers are split into a separate field.
+### Architecture
+| Component | Detail |
+|---|---|
+| Base encoder | `xlm-roberta-base` (12 layers, 768 hidden) |
+| Component head | Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 25) + CRF |
+| Location head | Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 3) + CRF |
+| Tokenisation | Word-level regex tokeniser; sub-word alignments via fast tokenizer `word_ids()` |
+| Max input length | 256 sub-word tokens |
+### Entity types (component head)
+| Label | Description |
+|---|---|
+| `STREET` | Street name (no house number) |
+| `ROAD` | Road or route name |
+| `HOUSENUMBER` | House/building number(s), ranges or sequences |
+| `POSTCODE` | Postal or ZIP code |
+| `CITY` | City or municipality name |
+| `PROVINCE` | Province or region name |
+| `BUILDING` | Named building, site or facility |
+| `INTERSECTION` | Crossing or intersection of roads |
+| `PARCEL` | Land parcel, section or lot number |
+| `DISTRICT` | District, neighbourhood or borough |
+| `GRAVE_LOCATION` | Plot/row/number within a cemetery |
+| `DOMAIN_ZONE_AREA` | Domain, zone or area name |
+## Evaluation
+Evaluated on a held-out 10 % split of ~10 000 Belgian municipal decision location strings.
+| Metric | Score |
+|---|---|
+| **Combined F1** | **0.9435** |
+| Component F1 | 0.9295 |
+| Location F1 | 0.9576 |
+## Installation
+### From source (recommended during development)
+```bash
+git clone https://github.com/semantic-ai/decide-location-formatter.git
+cd decide-location-formatter
+pip install -e .
+```
+### Dependencies only
+```bash
+pip install torch>=2.0 transformers>=4.35 pytorch-crf>=0.7.2
+```
+> The model weights (~1 GB) are downloaded automatically from the Hugging Face Hub on first use.
+## Usage
+### Quick start
+```python
+from locationformatter import LocationFormatter
+lf = LocationFormatter()   # loads model once; reuse for many calls
+result = lf.parse("Scaldisstraat 23-25, 2000 Antwerpen")
+print(result)
+```
+```json
+{
+  "original": "Scaldisstraat 23-25, 2000 Antwerpen",
+  "locations": [
+    {
+      "location": "Scaldisstraat 23-25, 2000 Antwerpen",
+      "street": "Scaldisstraat",
+      "housenumber": "23",
+      "housenumber_type": "single",
+      "postcode": "2000",
+      "city": "Antwerpen"
+    },
+    {
+      "location": "Scaldisstraat 23-25, 2000 Antwerpen",
+      "street": "Scaldisstraat",
+      "housenumber": "25",
+      "housenumber_type": "single",
+      "postcode": "2000",
+      "city": "Antwerpen"
+    }
+  ]
+}
+```
+### Multi-location strings
+Strings that contain several distinct locations are automatically split:
+```python
+result = lf.parse("Lovegemlaan 7, 9000 Gent en Dorpstraat 12, 9240 Zele")
+for loc in result["locations"]:
+    print(loc)
+```
+### Raw prediction (no housenumber expansion)
+`predict()` returns spans straight from the model without expanding ranges or splitting bus numbers:
+```python
+raw = lf.predict("Heikeesstraat 2-4, 9240 Zele")
+# raw["locations"][0]["housenumber"] == "2-4"
+# raw["locations"][0]["housenumber_type"] == "range"
+```
+### One-shot helper
+For a single call without keeping the model in memory:
+```python
+from locationformatter import parse_location
+result = parse_location("Grote Markt 1, 2000 Antwerpen")
+```
+> **Note:** `parse_location` reloads the model on every call. Use `LocationFormatter` for repeated parsing.
+### Custom model or device
+```python
+lf = LocationFormatter(
+    repo="your-org/your-model",   # any compatible HF Hub repo
+    device="cuda",                # "cpu" or "cuda"; auto-detected when omitted
+)
+```
+## API reference
+### `LocationFormatter`
+```python
+class LocationFormatter:
+    def __init__(self, repo: str = "svercoutere/abb-dual-location-component-ner",
+                 device: str | None = None): ...
+    def parse(self, text: str) -> dict: ...
+    # Full pipeline: clean → NER → expand housenumbers.
+    # Returns {"original": str, "locations": list[dict]}
+    def predict(self, text: str) -> dict: ...
+    # NER only, no housenumber expansion.
+    # Returns {"original": str, "locations": list[dict]}
+```
+### Helper functions
+```python
+from locationformatter import clean_string, clean_house_number, extract_house_and_bus_number
+clean_string("  Grote   Markt\n1  ")
+# → "Grote Markt 1"
+clean_house_number("3 t.e.m. 7")
+# → ["3", "4", "5", "6", "7"]
+clean_house_number("10-14")
+# → ["10", "11", "12", "13", "14"]
+extract_house_and_bus_number("5 bus 3")
+# → {"housenumber": "5", "bus": "3"}
+```
+### Output schema
+Each entry in the `locations` list is a flat dict. Only fields detected by the model are included.
+| Field | Type | Description |
+|---|---|---|
+| `location` | `str` | The substring corresponding to this location |
+| `street` | `str` | Street name |
+| `road` | `str` | Road/route name |
+| `housenumber` | `str` | Individual house number (after expansion) |
+| `housenumber_type` | `str` | `"single"`, `"range"`, or `"sequence"` |
+| `bus` | `str` | Bus/apartment number (when present) |
+| `postcode` | `str` | Postal code |
+| `city` | `str` | City or municipality |
+| `province` | `str` | Province |
+| `building` | `str` | Named building or facility |
+| `intersection` | `str` | Road intersection |
+| `parcel` | `str` | Land parcel identifier |
+| `district` | `str` | District or neighbourhood |
+| `grave_location` | `str` | Cemetery plot/row/number |
+| `domain_zone_area` | `str` | Zone or area name |
+## Development
+### Running tests
+```bash
+pytest tests/
+```
+The unit tests for the helper functions (`clean_string`, `clean_house_number`, `extract_house_and_bus_number`) do not require the model to be loaded and run offline.
+### Project structure
+```
+src/locationformatter/
+├── __init__.py            # Public exports
+└── locationformatter.py   # Model architecture, inference, helpers
+tests/
+└── test_formatter.py      # Unit tests
+development/
+├── development.ipynb      # Training & experimentation notebook
+└── integration.ipynb      # Integration testing notebook
+```

locationformatter-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,229 @@
+# decide-location-formatter
+A Python package for parsing and structuring ocation strings into their individual address components. Built around a dual-head NER model ([`svercoutere/abb-dual-location-component-ner`](https://huggingface.co/svercoutere/abb-dual-location-component-ner)) fine-tuned on top of **XLM-RoBERTa base**.
+## How it works
+Raw location strings like `"Scaldisstraat 23-25, 2000 Antwerpen"` or `"Cafe den Draak, Lovegemlaan 7, 9000 Gent"` are common in  municipal decision text but are inconsistently formatted and often contain multiple distinct locations in a single string.
+The pipeline has three steps:
+1. **Text cleaning** — normalises whitespace, unicode, and newlines.
+2. **Dual-head NER inference** — the model runs two independent CRF-decoded classification heads over every token simultaneously:
+   - **Component head** — tags each token as one of 12 address component types (street, city, postcode, …).
+   - **Location head** — groups tokens that belong to the same physical location into `B-LOCATION` / `I-LOCATION` spans, allowing multi-location strings to be split.
+3. **Post-processing** — component spans are nested inside their parent location spans, housenumber ranges/sequences (e.g. `23-25`, `7 en 9`) are expanded into individual entries, and bus numbers are split into a separate field.
+### Architecture
+| Component | Detail |
+|---|---|
+| Base encoder | `xlm-roberta-base` (12 layers, 768 hidden) |
+| Component head | Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 25) + CRF |
+| Location head | Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 3) + CRF |
+| Tokenisation | Word-level regex tokeniser; sub-word alignments via fast tokenizer `word_ids()` |
+| Max input length | 256 sub-word tokens |
+### Entity types (component head)
+| Label | Description |
+|---|---|
+| `STREET` | Street name (no house number) |
+| `ROAD` | Road or route name |
+| `HOUSENUMBER` | House/building number(s), ranges or sequences |
+| `POSTCODE` | Postal or ZIP code |
+| `CITY` | City or municipality name |
+| `PROVINCE` | Province or region name |
+| `BUILDING` | Named building, site or facility |
+| `INTERSECTION` | Crossing or intersection of roads |
+| `PARCEL` | Land parcel, section or lot number |
+| `DISTRICT` | District, neighbourhood or borough |
+| `GRAVE_LOCATION` | Plot/row/number within a cemetery |
+| `DOMAIN_ZONE_AREA` | Domain, zone or area name |
+## Evaluation
+Evaluated on a held-out 10 % split of ~10 000 Belgian municipal decision location strings.
+| Metric | Score |
+|---|---|
+| **Combined F1** | **0.9435** |
+| Component F1 | 0.9295 |
+| Location F1 | 0.9576 |
+## Installation
+### From source (recommended during development)
+```bash
+git clone https://github.com/semantic-ai/decide-location-formatter.git
+cd decide-location-formatter
+pip install -e .
+```
+### Dependencies only
+```bash
+pip install torch>=2.0 transformers>=4.35 pytorch-crf>=0.7.2
+```
+> The model weights (~1 GB) are downloaded automatically from the Hugging Face Hub on first use.
+## Usage
+### Quick start
+```python
+from locationformatter import LocationFormatter
+lf = LocationFormatter()   # loads model once; reuse for many calls
+result = lf.parse("Scaldisstraat 23-25, 2000 Antwerpen")
+print(result)
+```
+```json
+{
+  "original": "Scaldisstraat 23-25, 2000 Antwerpen",
+  "locations": [
+    {
+      "location": "Scaldisstraat 23-25, 2000 Antwerpen",
+      "street": "Scaldisstraat",
+      "housenumber": "23",
+      "housenumber_type": "single",
+      "postcode": "2000",
+      "city": "Antwerpen"
+    },
+    {
+      "location": "Scaldisstraat 23-25, 2000 Antwerpen",
+      "street": "Scaldisstraat",
+      "housenumber": "25",
+      "housenumber_type": "single",
+      "postcode": "2000",
+      "city": "Antwerpen"
+    }
+  ]
+}
+```
+### Multi-location strings
+Strings that contain several distinct locations are automatically split:
+```python
+result = lf.parse("Lovegemlaan 7, 9000 Gent en Dorpstraat 12, 9240 Zele")
+for loc in result["locations"]:
+    print(loc)
+```
+### Raw prediction (no housenumber expansion)
+`predict()` returns spans straight from the model without expanding ranges or splitting bus numbers:
+```python
+raw = lf.predict("Heikeesstraat 2-4, 9240 Zele")
+# raw["locations"][0]["housenumber"] == "2-4"
+# raw["locations"][0]["housenumber_type"] == "range"
+```
+### One-shot helper
+For a single call without keeping the model in memory:
+```python
+from locationformatter import parse_location
+result = parse_location("Grote Markt 1, 2000 Antwerpen")
+```
+> **Note:** `parse_location` reloads the model on every call. Use `LocationFormatter` for repeated parsing.
+### Custom model or device
+```python
+lf = LocationFormatter(
+    repo="your-org/your-model",   # any compatible HF Hub repo
+    device="cuda",                # "cpu" or "cuda"; auto-detected when omitted
+)
+```
+## API reference
+### `LocationFormatter`
+```python
+class LocationFormatter:
+    def __init__(self, repo: str = "svercoutere/abb-dual-location-component-ner",
+                 device: str | None = None): ...
+    def parse(self, text: str) -> dict: ...
+    # Full pipeline: clean → NER → expand housenumbers.
+    # Returns {"original": str, "locations": list[dict]}
+    def predict(self, text: str) -> dict: ...
+    # NER only, no housenumber expansion.
+    # Returns {"original": str, "locations": list[dict]}
+```
+### Helper functions
+```python
+from locationformatter import clean_string, clean_house_number, extract_house_and_bus_number
+clean_string("  Grote   Markt\n1  ")
+# → "Grote Markt 1"
+clean_house_number("3 t.e.m. 7")
+# → ["3", "4", "5", "6", "7"]
+clean_house_number("10-14")
+# → ["10", "11", "12", "13", "14"]
+extract_house_and_bus_number("5 bus 3")
+# → {"housenumber": "5", "bus": "3"}
+```
+### Output schema
+Each entry in the `locations` list is a flat dict. Only fields detected by the model are included.
+| Field | Type | Description |
+|---|---|---|
+| `location` | `str` | The substring corresponding to this location |
+| `street` | `str` | Street name |
+| `road` | `str` | Road/route name |
+| `housenumber` | `str` | Individual house number (after expansion) |
+| `housenumber_type` | `str` | `"single"`, `"range"`, or `"sequence"` |
+| `bus` | `str` | Bus/apartment number (when present) |
+| `postcode` | `str` | Postal code |
+| `city` | `str` | City or municipality |
+| `province` | `str` | Province |
+| `building` | `str` | Named building or facility |
+| `intersection` | `str` | Road intersection |
+| `parcel` | `str` | Land parcel identifier |
+| `district` | `str` | District or neighbourhood |
+| `grave_location` | `str` | Cemetery plot/row/number |
+| `domain_zone_area` | `str` | Zone or area name |
+## Development
+### Running tests
+```bash
+pytest tests/
+```
+The unit tests for the helper functions (`clean_string`, `clean_house_number`, `extract_house_and_bus_number`) do not require the model to be loaded and run offline.
+### Project structure
+```
+src/locationformatter/
+├── __init__.py            # Public exports
+└── locationformatter.py   # Model architecture, inference, helpers
+tests/
+└── test_formatter.py      # Unit tests
+development/
+├── development.ipynb      # Training & experimentation notebook
+└── integration.ipynb      # Integration testing notebook
+```