locationformatter 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,12 @@
1
+ *.pyc
2
+ __pycache__/
3
+ *.egg-info/
4
+ /dist
5
+ /build
6
+ .pytest_cache/
7
+ *.pth
8
+ development.ipynb
9
+ integration.ipynb
10
+ /library
11
+ /development
12
+ *.toml
@@ -0,0 +1,249 @@
1
+ Metadata-Version: 2.4
2
+ Name: locationformatter
3
+ Version: 0.1.0
4
+ Summary: A dual-head NER-based parser for location strings
5
+ Project-URL: Homepage, https://github.com/semantic-ai/decide-location-formatter
6
+ Project-URL: Issues, https://github.com/semantic-ai/decide-location-formatter/issues
7
+ Author-email: Author <stefaan.vercoutere@sirus.be>
8
+ License: MIT
9
+ Classifier: Intended Audience :: Developers
10
+ Classifier: Operating System :: OS Independent
11
+ Classifier: Programming Language :: Python :: 3
12
+ Classifier: Topic :: Text Processing :: Linguistic
13
+ Requires-Python: >=3.9
14
+ Requires-Dist: pytorch-crf>=0.7.2
15
+ Requires-Dist: torch>=2.0
16
+ Requires-Dist: transformers>=4.35
17
+ Provides-Extra: dev
18
+ Requires-Dist: pytest>=7.0.0; extra == 'dev'
19
+ Description-Content-Type: text/markdown
20
+
21
+ # decide-location-formatter
22
+
23
+ A Python package for parsing and structuring ocation strings into their individual address components. Built around a dual-head NER model ([`svercoutere/abb-dual-location-component-ner`](https://huggingface.co/svercoutere/abb-dual-location-component-ner)) fine-tuned on top of **XLM-RoBERTa base**.
24
+
25
+ ## How it works
26
+
27
+ Raw location strings like `"Scaldisstraat 23-25, 2000 Antwerpen"` or `"Cafe den Draak, Lovegemlaan 7, 9000 Gent"` are common in municipal decision text but are inconsistently formatted and often contain multiple distinct locations in a single string.
28
+
29
+ The pipeline has three steps:
30
+
31
+ 1. **Text cleaning** — normalises whitespace, unicode, and newlines.
32
+ 2. **Dual-head NER inference** — the model runs two independent CRF-decoded classification heads over every token simultaneously:
33
+ - **Component head** — tags each token as one of 12 address component types (street, city, postcode, …).
34
+ - **Location head** — groups tokens that belong to the same physical location into `B-LOCATION` / `I-LOCATION` spans, allowing multi-location strings to be split.
35
+ 3. **Post-processing** — component spans are nested inside their parent location spans, housenumber ranges/sequences (e.g. `23-25`, `7 en 9`) are expanded into individual entries, and bus numbers are split into a separate field.
36
+
37
+ ### Architecture
38
+
39
+ | Component | Detail |
40
+ |---|---|
41
+ | Base encoder | `xlm-roberta-base` (12 layers, 768 hidden) |
42
+ | Component head | Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 25) + CRF |
43
+ | Location head | Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 3) + CRF |
44
+ | Tokenisation | Word-level regex tokeniser; sub-word alignments via fast tokenizer `word_ids()` |
45
+ | Max input length | 256 sub-word tokens |
46
+
47
+ ### Entity types (component head)
48
+
49
+ | Label | Description |
50
+ |---|---|
51
+ | `STREET` | Street name (no house number) |
52
+ | `ROAD` | Road or route name |
53
+ | `HOUSENUMBER` | House/building number(s), ranges or sequences |
54
+ | `POSTCODE` | Postal or ZIP code |
55
+ | `CITY` | City or municipality name |
56
+ | `PROVINCE` | Province or region name |
57
+ | `BUILDING` | Named building, site or facility |
58
+ | `INTERSECTION` | Crossing or intersection of roads |
59
+ | `PARCEL` | Land parcel, section or lot number |
60
+ | `DISTRICT` | District, neighbourhood or borough |
61
+ | `GRAVE_LOCATION` | Plot/row/number within a cemetery |
62
+ | `DOMAIN_ZONE_AREA` | Domain, zone or area name |
63
+
64
+ ## Evaluation
65
+
66
+ Evaluated on a held-out 10 % split of ~10 000 Belgian municipal decision location strings.
67
+
68
+ | Metric | Score |
69
+ |---|---|
70
+ | **Combined F1** | **0.9435** |
71
+ | Component F1 | 0.9295 |
72
+ | Location F1 | 0.9576 |
73
+
74
+ ## Installation
75
+
76
+ ### From source (recommended during development)
77
+
78
+ ```bash
79
+ git clone https://github.com/semantic-ai/decide-location-formatter.git
80
+ cd decide-location-formatter
81
+ pip install -e .
82
+ ```
83
+
84
+ ### Dependencies only
85
+
86
+ ```bash
87
+ pip install torch>=2.0 transformers>=4.35 pytorch-crf>=0.7.2
88
+ ```
89
+
90
+ > The model weights (~1 GB) are downloaded automatically from the Hugging Face Hub on first use.
91
+
92
+ ## Usage
93
+
94
+ ### Quick start
95
+
96
+ ```python
97
+ from locationformatter import LocationFormatter
98
+
99
+ lf = LocationFormatter() # loads model once; reuse for many calls
100
+
101
+ result = lf.parse("Scaldisstraat 23-25, 2000 Antwerpen")
102
+ print(result)
103
+ ```
104
+
105
+ ```json
106
+ {
107
+ "original": "Scaldisstraat 23-25, 2000 Antwerpen",
108
+ "locations": [
109
+ {
110
+ "location": "Scaldisstraat 23-25, 2000 Antwerpen",
111
+ "street": "Scaldisstraat",
112
+ "housenumber": "23",
113
+ "housenumber_type": "single",
114
+ "postcode": "2000",
115
+ "city": "Antwerpen"
116
+ },
117
+ {
118
+ "location": "Scaldisstraat 23-25, 2000 Antwerpen",
119
+ "street": "Scaldisstraat",
120
+ "housenumber": "25",
121
+ "housenumber_type": "single",
122
+ "postcode": "2000",
123
+ "city": "Antwerpen"
124
+ }
125
+ ]
126
+ }
127
+ ```
128
+
129
+ ### Multi-location strings
130
+
131
+ Strings that contain several distinct locations are automatically split:
132
+
133
+ ```python
134
+ result = lf.parse("Lovegemlaan 7, 9000 Gent en Dorpstraat 12, 9240 Zele")
135
+ for loc in result["locations"]:
136
+ print(loc)
137
+ ```
138
+
139
+ ### Raw prediction (no housenumber expansion)
140
+
141
+ `predict()` returns spans straight from the model without expanding ranges or splitting bus numbers:
142
+
143
+ ```python
144
+ raw = lf.predict("Heikeesstraat 2-4, 9240 Zele")
145
+ # raw["locations"][0]["housenumber"] == "2-4"
146
+ # raw["locations"][0]["housenumber_type"] == "range"
147
+ ```
148
+
149
+ ### One-shot helper
150
+
151
+ For a single call without keeping the model in memory:
152
+
153
+ ```python
154
+ from locationformatter import parse_location
155
+
156
+ result = parse_location("Grote Markt 1, 2000 Antwerpen")
157
+ ```
158
+
159
+ > **Note:** `parse_location` reloads the model on every call. Use `LocationFormatter` for repeated parsing.
160
+
161
+ ### Custom model or device
162
+
163
+ ```python
164
+ lf = LocationFormatter(
165
+ repo="your-org/your-model", # any compatible HF Hub repo
166
+ device="cuda", # "cpu" or "cuda"; auto-detected when omitted
167
+ )
168
+ ```
169
+
170
+ ## API reference
171
+
172
+ ### `LocationFormatter`
173
+
174
+ ```python
175
+ class LocationFormatter:
176
+ def __init__(self, repo: str = "svercoutere/abb-dual-location-component-ner",
177
+ device: str | None = None): ...
178
+
179
+ def parse(self, text: str) -> dict: ...
180
+ # Full pipeline: clean → NER → expand housenumbers.
181
+ # Returns {"original": str, "locations": list[dict]}
182
+
183
+ def predict(self, text: str) -> dict: ...
184
+ # NER only, no housenumber expansion.
185
+ # Returns {"original": str, "locations": list[dict]}
186
+ ```
187
+
188
+ ### Helper functions
189
+
190
+ ```python
191
+ from locationformatter import clean_string, clean_house_number, extract_house_and_bus_number
192
+
193
+ clean_string(" Grote Markt\n1 ")
194
+ # → "Grote Markt 1"
195
+
196
+ clean_house_number("3 t.e.m. 7")
197
+ # → ["3", "4", "5", "6", "7"]
198
+
199
+ clean_house_number("10-14")
200
+ # → ["10", "11", "12", "13", "14"]
201
+
202
+ extract_house_and_bus_number("5 bus 3")
203
+ # → {"housenumber": "5", "bus": "3"}
204
+ ```
205
+
206
+ ### Output schema
207
+
208
+ Each entry in the `locations` list is a flat dict. Only fields detected by the model are included.
209
+
210
+ | Field | Type | Description |
211
+ |---|---|---|
212
+ | `location` | `str` | The substring corresponding to this location |
213
+ | `street` | `str` | Street name |
214
+ | `road` | `str` | Road/route name |
215
+ | `housenumber` | `str` | Individual house number (after expansion) |
216
+ | `housenumber_type` | `str` | `"single"`, `"range"`, or `"sequence"` |
217
+ | `bus` | `str` | Bus/apartment number (when present) |
218
+ | `postcode` | `str` | Postal code |
219
+ | `city` | `str` | City or municipality |
220
+ | `province` | `str` | Province |
221
+ | `building` | `str` | Named building or facility |
222
+ | `intersection` | `str` | Road intersection |
223
+ | `parcel` | `str` | Land parcel identifier |
224
+ | `district` | `str` | District or neighbourhood |
225
+ | `grave_location` | `str` | Cemetery plot/row/number |
226
+ | `domain_zone_area` | `str` | Zone or area name |
227
+
228
+ ## Development
229
+
230
+ ### Running tests
231
+
232
+ ```bash
233
+ pytest tests/
234
+ ```
235
+
236
+ The unit tests for the helper functions (`clean_string`, `clean_house_number`, `extract_house_and_bus_number`) do not require the model to be loaded and run offline.
237
+
238
+ ### Project structure
239
+
240
+ ```
241
+ src/locationformatter/
242
+ ├── __init__.py # Public exports
243
+ └── locationformatter.py # Model architecture, inference, helpers
244
+ tests/
245
+ └── test_formatter.py # Unit tests
246
+ development/
247
+ ├── development.ipynb # Training & experimentation notebook
248
+ └── integration.ipynb # Integration testing notebook
249
+ ```
@@ -0,0 +1,229 @@
1
+ # decide-location-formatter
2
+
3
+ A Python package for parsing and structuring ocation strings into their individual address components. Built around a dual-head NER model ([`svercoutere/abb-dual-location-component-ner`](https://huggingface.co/svercoutere/abb-dual-location-component-ner)) fine-tuned on top of **XLM-RoBERTa base**.
4
+
5
+ ## How it works
6
+
7
+ Raw location strings like `"Scaldisstraat 23-25, 2000 Antwerpen"` or `"Cafe den Draak, Lovegemlaan 7, 9000 Gent"` are common in municipal decision text but are inconsistently formatted and often contain multiple distinct locations in a single string.
8
+
9
+ The pipeline has three steps:
10
+
11
+ 1. **Text cleaning** — normalises whitespace, unicode, and newlines.
12
+ 2. **Dual-head NER inference** — the model runs two independent CRF-decoded classification heads over every token simultaneously:
13
+ - **Component head** — tags each token as one of 12 address component types (street, city, postcode, …).
14
+ - **Location head** — groups tokens that belong to the same physical location into `B-LOCATION` / `I-LOCATION` spans, allowing multi-location strings to be split.
15
+ 3. **Post-processing** — component spans are nested inside their parent location spans, housenumber ranges/sequences (e.g. `23-25`, `7 en 9`) are expanded into individual entries, and bus numbers are split into a separate field.
16
+
17
+ ### Architecture
18
+
19
+ | Component | Detail |
20
+ |---|---|
21
+ | Base encoder | `xlm-roberta-base` (12 layers, 768 hidden) |
22
+ | Component head | Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 25) + CRF |
23
+ | Location head | Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 3) + CRF |
24
+ | Tokenisation | Word-level regex tokeniser; sub-word alignments via fast tokenizer `word_ids()` |
25
+ | Max input length | 256 sub-word tokens |
26
+
27
+ ### Entity types (component head)
28
+
29
+ | Label | Description |
30
+ |---|---|
31
+ | `STREET` | Street name (no house number) |
32
+ | `ROAD` | Road or route name |
33
+ | `HOUSENUMBER` | House/building number(s), ranges or sequences |
34
+ | `POSTCODE` | Postal or ZIP code |
35
+ | `CITY` | City or municipality name |
36
+ | `PROVINCE` | Province or region name |
37
+ | `BUILDING` | Named building, site or facility |
38
+ | `INTERSECTION` | Crossing or intersection of roads |
39
+ | `PARCEL` | Land parcel, section or lot number |
40
+ | `DISTRICT` | District, neighbourhood or borough |
41
+ | `GRAVE_LOCATION` | Plot/row/number within a cemetery |
42
+ | `DOMAIN_ZONE_AREA` | Domain, zone or area name |
43
+
44
+ ## Evaluation
45
+
46
+ Evaluated on a held-out 10 % split of ~10 000 Belgian municipal decision location strings.
47
+
48
+ | Metric | Score |
49
+ |---|---|
50
+ | **Combined F1** | **0.9435** |
51
+ | Component F1 | 0.9295 |
52
+ | Location F1 | 0.9576 |
53
+
54
+ ## Installation
55
+
56
+ ### From source (recommended during development)
57
+
58
+ ```bash
59
+ git clone https://github.com/semantic-ai/decide-location-formatter.git
60
+ cd decide-location-formatter
61
+ pip install -e .
62
+ ```
63
+
64
+ ### Dependencies only
65
+
66
+ ```bash
67
+ pip install torch>=2.0 transformers>=4.35 pytorch-crf>=0.7.2
68
+ ```
69
+
70
+ > The model weights (~1 GB) are downloaded automatically from the Hugging Face Hub on first use.
71
+
72
+ ## Usage
73
+
74
+ ### Quick start
75
+
76
+ ```python
77
+ from locationformatter import LocationFormatter
78
+
79
+ lf = LocationFormatter() # loads model once; reuse for many calls
80
+
81
+ result = lf.parse("Scaldisstraat 23-25, 2000 Antwerpen")
82
+ print(result)
83
+ ```
84
+
85
+ ```json
86
+ {
87
+ "original": "Scaldisstraat 23-25, 2000 Antwerpen",
88
+ "locations": [
89
+ {
90
+ "location": "Scaldisstraat 23-25, 2000 Antwerpen",
91
+ "street": "Scaldisstraat",
92
+ "housenumber": "23",
93
+ "housenumber_type": "single",
94
+ "postcode": "2000",
95
+ "city": "Antwerpen"
96
+ },
97
+ {
98
+ "location": "Scaldisstraat 23-25, 2000 Antwerpen",
99
+ "street": "Scaldisstraat",
100
+ "housenumber": "25",
101
+ "housenumber_type": "single",
102
+ "postcode": "2000",
103
+ "city": "Antwerpen"
104
+ }
105
+ ]
106
+ }
107
+ ```
108
+
109
+ ### Multi-location strings
110
+
111
+ Strings that contain several distinct locations are automatically split:
112
+
113
+ ```python
114
+ result = lf.parse("Lovegemlaan 7, 9000 Gent en Dorpstraat 12, 9240 Zele")
115
+ for loc in result["locations"]:
116
+ print(loc)
117
+ ```
118
+
119
+ ### Raw prediction (no housenumber expansion)
120
+
121
+ `predict()` returns spans straight from the model without expanding ranges or splitting bus numbers:
122
+
123
+ ```python
124
+ raw = lf.predict("Heikeesstraat 2-4, 9240 Zele")
125
+ # raw["locations"][0]["housenumber"] == "2-4"
126
+ # raw["locations"][0]["housenumber_type"] == "range"
127
+ ```
128
+
129
+ ### One-shot helper
130
+
131
+ For a single call without keeping the model in memory:
132
+
133
+ ```python
134
+ from locationformatter import parse_location
135
+
136
+ result = parse_location("Grote Markt 1, 2000 Antwerpen")
137
+ ```
138
+
139
+ > **Note:** `parse_location` reloads the model on every call. Use `LocationFormatter` for repeated parsing.
140
+
141
+ ### Custom model or device
142
+
143
+ ```python
144
+ lf = LocationFormatter(
145
+ repo="your-org/your-model", # any compatible HF Hub repo
146
+ device="cuda", # "cpu" or "cuda"; auto-detected when omitted
147
+ )
148
+ ```
149
+
150
+ ## API reference
151
+
152
+ ### `LocationFormatter`
153
+
154
+ ```python
155
+ class LocationFormatter:
156
+ def __init__(self, repo: str = "svercoutere/abb-dual-location-component-ner",
157
+ device: str | None = None): ...
158
+
159
+ def parse(self, text: str) -> dict: ...
160
+ # Full pipeline: clean → NER → expand housenumbers.
161
+ # Returns {"original": str, "locations": list[dict]}
162
+
163
+ def predict(self, text: str) -> dict: ...
164
+ # NER only, no housenumber expansion.
165
+ # Returns {"original": str, "locations": list[dict]}
166
+ ```
167
+
168
+ ### Helper functions
169
+
170
+ ```python
171
+ from locationformatter import clean_string, clean_house_number, extract_house_and_bus_number
172
+
173
+ clean_string(" Grote Markt\n1 ")
174
+ # → "Grote Markt 1"
175
+
176
+ clean_house_number("3 t.e.m. 7")
177
+ # → ["3", "4", "5", "6", "7"]
178
+
179
+ clean_house_number("10-14")
180
+ # → ["10", "11", "12", "13", "14"]
181
+
182
+ extract_house_and_bus_number("5 bus 3")
183
+ # → {"housenumber": "5", "bus": "3"}
184
+ ```
185
+
186
+ ### Output schema
187
+
188
+ Each entry in the `locations` list is a flat dict. Only fields detected by the model are included.
189
+
190
+ | Field | Type | Description |
191
+ |---|---|---|
192
+ | `location` | `str` | The substring corresponding to this location |
193
+ | `street` | `str` | Street name |
194
+ | `road` | `str` | Road/route name |
195
+ | `housenumber` | `str` | Individual house number (after expansion) |
196
+ | `housenumber_type` | `str` | `"single"`, `"range"`, or `"sequence"` |
197
+ | `bus` | `str` | Bus/apartment number (when present) |
198
+ | `postcode` | `str` | Postal code |
199
+ | `city` | `str` | City or municipality |
200
+ | `province` | `str` | Province |
201
+ | `building` | `str` | Named building or facility |
202
+ | `intersection` | `str` | Road intersection |
203
+ | `parcel` | `str` | Land parcel identifier |
204
+ | `district` | `str` | District or neighbourhood |
205
+ | `grave_location` | `str` | Cemetery plot/row/number |
206
+ | `domain_zone_area` | `str` | Zone or area name |
207
+
208
+ ## Development
209
+
210
+ ### Running tests
211
+
212
+ ```bash
213
+ pytest tests/
214
+ ```
215
+
216
+ The unit tests for the helper functions (`clean_string`, `clean_house_number`, `extract_house_and_bus_number`) do not require the model to be loaded and run offline.
217
+
218
+ ### Project structure
219
+
220
+ ```
221
+ src/locationformatter/
222
+ ├── __init__.py # Public exports
223
+ └── locationformatter.py # Model architecture, inference, helpers
224
+ tests/
225
+ └── test_formatter.py # Unit tests
226
+ development/
227
+ ├── development.ipynb # Training & experimentation notebook
228
+ └── integration.ipynb # Integration testing notebook
229
+ ```