bharataddress 0.2.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (36) hide show
  1. bharataddress-0.2.2/LICENSE +21 -0
  2. bharataddress-0.2.2/PKG-INFO +400 -0
  3. bharataddress-0.2.2/README.md +375 -0
  4. bharataddress-0.2.2/bharataddress/__init__.py +47 -0
  5. bharataddress-0.2.2/bharataddress/batch.py +93 -0
  6. bharataddress-0.2.2/bharataddress/cli.py +54 -0
  7. bharataddress-0.2.2/bharataddress/data/abbreviations.json +63 -0
  8. bharataddress-0.2.2/bharataddress/data/localities.json +1 -0
  9. bharataddress-0.2.2/bharataddress/data/pincodes.json +1 -0
  10. bharataddress-0.2.2/bharataddress/data/vernacular_mappings.json +76 -0
  11. bharataddress-0.2.2/bharataddress/digipin.py +114 -0
  12. bharataddress-0.2.2/bharataddress/enrichment.py +73 -0
  13. bharataddress-0.2.2/bharataddress/formatter.py +71 -0
  14. bharataddress-0.2.2/bharataddress/geocoder.py +84 -0
  15. bharataddress-0.2.2/bharataddress/parser.py +475 -0
  16. bharataddress-0.2.2/bharataddress/pincode.py +53 -0
  17. bharataddress-0.2.2/bharataddress/preprocessor.py +107 -0
  18. bharataddress-0.2.2/bharataddress/similarity.py +149 -0
  19. bharataddress-0.2.2/bharataddress/validator.py +104 -0
  20. bharataddress-0.2.2/bharataddress.egg-info/PKG-INFO +400 -0
  21. bharataddress-0.2.2/bharataddress.egg-info/SOURCES.txt +34 -0
  22. bharataddress-0.2.2/bharataddress.egg-info/dependency_links.txt +1 -0
  23. bharataddress-0.2.2/bharataddress.egg-info/entry_points.txt +2 -0
  24. bharataddress-0.2.2/bharataddress.egg-info/requires.txt +3 -0
  25. bharataddress-0.2.2/bharataddress.egg-info/top_level.txt +1 -0
  26. bharataddress-0.2.2/pyproject.toml +41 -0
  27. bharataddress-0.2.2/setup.cfg +4 -0
  28. bharataddress-0.2.2/tests/test_batch.py +47 -0
  29. bharataddress-0.2.2/tests/test_digipin.py +159 -0
  30. bharataddress-0.2.2/tests/test_enrichment.py +24 -0
  31. bharataddress-0.2.2/tests/test_formatter.py +43 -0
  32. bharataddress-0.2.2/tests/test_geocoder.py +36 -0
  33. bharataddress-0.2.2/tests/test_no_comma.py +29 -0
  34. bharataddress-0.2.2/tests/test_parse.py +245 -0
  35. bharataddress-0.2.2/tests/test_similarity.py +29 -0
  36. bharataddress-0.2.2/tests/test_validator.py +37 -0
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Neelagiri (Srinathprasanna Shanmugam)
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,400 @@
1
+ Metadata-Version: 2.4
2
+ Name: bharataddress
3
+ Version: 0.2.2
4
+ Summary: Deterministic, offline parser for messy Indian addresses. Zero config, zero API keys.
5
+ Author: Neelagiri
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/Neelagiri65/bharataddress
8
+ Project-URL: Issues, https://github.com/Neelagiri65/bharataddress/issues
9
+ Keywords: india,address,parser,pincode,geocoding,nlp
10
+ Classifier: Development Status :: 3 - Alpha
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3 :: Only
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Topic :: Text Processing :: Linguistic
19
+ Requires-Python: >=3.10
20
+ Description-Content-Type: text/markdown
21
+ License-File: LICENSE
22
+ Provides-Extra: dev
23
+ Requires-Dist: pytest>=7; extra == "dev"
24
+ Dynamic: license-file
25
+
26
+ # bharataddress
27
+
28
+ **The deterministic Indian address parser. Zero config. Zero API keys. Zero network calls.**
29
+
30
+ `pip install bharataddress` → parse messy Indian addresses into structured JSON in one line. No model downloads, no Claude API key, no Nominatim instance, nothing to set up. The pincode directory ships embedded in the package.
31
+
32
+ ```python
33
+ >>> from bharataddress import parse
34
+ >>> parse("Flat 302, Raheja Atlantis, Near Hanuman Mandir, Sector 31, Gurgaon 122001").to_dict()
35
+ {
36
+ "building_number": "302",
37
+ "building_name": "Raheja Atlantis",
38
+ "landmark": "Hanuman mandir",
39
+ "locality": "Sector 31",
40
+ "city": "Gurgaon",
41
+ "district": "Gurgaon",
42
+ "state": "Haryana",
43
+ "pincode": "122001",
44
+ "confidence": 1.0
45
+ }
46
+ ```
47
+
48
+ That's the whole pitch. Sixty seconds from `pip install` to a parsed address.
49
+
50
+ ---
51
+
52
+ ## Why this exists
53
+
54
+ Indian addresses break every off-the-shelf parser. **Libpostal** (the global gold standard) has had an open issue for India support since 2018. **Deepparse** has no India training data. **Lokly** is abandoned. **Delhivery's AddFix** is the best solution but proprietary, patented, and unavailable. **HyperVerge** is closed-source, paid-only.
55
+
56
+ Indian addresses are different:
57
+
58
+ - **Landmark-based** — "Near Hanuman Mandir", "Opp SBI Bank", "Behind Reliance Fresh" are valid components, not noise
59
+ - **No street numbering** in most localities
60
+ - **Transliteration chaos** — Gurgaon / Gurugram / Gudgaon all refer to the same place
61
+ - **Pincodes cover huge areas** — median 90 km², up to 1M households
62
+ - **Mixed Hindi + English** in the same string
63
+ - **Abbreviation soup** — H.No., S/O, D/O, W/O, B.O., S.O., Opp., Nr., Ngr., Clny, Mohalla, Marg
64
+
65
+ `bharataddress` handles all of these in v0.1 with pure rules + the embedded India Post directory. No ML model. No API. No network.
66
+
67
+ ---
68
+
69
+ ## Install
70
+
71
+ ```bash
72
+ pip install bharataddress
73
+ ```
74
+
75
+ Or from source:
76
+
77
+ ```bash
78
+ git clone https://github.com/Neelagiri65/bharataddress
79
+ cd bharataddress
80
+ pip install -e .
81
+ ```
82
+
83
+ Requires Python ≥3.10. Zero runtime dependencies.
84
+
85
+ ---
86
+
87
+ ## Usage
88
+
89
+ ### Python
90
+
91
+ ```python
92
+ from bharataddress import parse
93
+
94
+ result = parse("h.no.45/2 ; phase-2 ; sector 14 ; gurgaon - 122001")
95
+ print(result.pincode) # '122001'
96
+ print(result.state) # 'Haryana'
97
+ print(result.locality) # 'sector 14'
98
+ print(result.confidence) # 1.0
99
+
100
+ # JSON-friendly dict
101
+ result.to_dict()
102
+ ```
103
+
104
+ ### CLI
105
+
106
+ ```bash
107
+ bharataddress parse "12, Dalal Street, Fort, Mumbai 400001" --pretty
108
+ bharataddress lookup 560001
109
+ bharataddress --version
110
+ ```
111
+
112
+ ### Pincode lookup (no parsing)
113
+
114
+ ```python
115
+ from bharataddress import pincode
116
+ pincode.lookup("122001")
117
+ # {'pincode': '122001', 'district': 'Gurgaon', 'city': 'Gurgaon',
118
+ # 'state': 'Haryana', 'offices': [...]}
119
+ ```
120
+
121
+ ---
122
+
123
+ ## What you get back
124
+
125
+ | Field | Type | Source |
126
+ | ----------------- | --------------- | -------------------------------------------- |
127
+ | `building_number` | `str \| None` | First numeric segment / `Flat`/`H.No.` lead |
128
+ | `building_name` | `str \| None` | Segment after building number |
129
+ | `landmark` | `str \| None` | `Near/Opp/Behind/Beside/Next to/...` segment |
130
+ | `locality` | `str \| None` | `Sector/Phase/Block/<x> Nagar/<x> Colony/…` |
131
+ | `sub_locality` | `str \| None` | Second locality match if present |
132
+ | `city` | `str \| None` | Pincode lookup; falls back to text guess |
133
+ | `district` | `str \| None` | Pincode lookup |
134
+ | `state` | `str \| None` | Pincode lookup |
135
+ | `pincode` | `str \| None` | `[1-8]\d{5}` regex, validated against DB |
136
+ | `confidence` | `float (0..1)` | Weighted component-presence score |
137
+ | `cleaned` | `str` | Normalised input after preprocessing |
138
+ | `raw` | `str` | Original input |
139
+
140
+ **Confidence weights:** pincode 0.40, city-matches-pincode 0.20, locality 0.20, building 0.10, landmark 0.10.
141
+
142
+ ---
143
+
144
+ ## How it works
145
+
146
+ ```
147
+ input string
148
+
149
+
150
+ Layer 1 — Preprocess NFKC unicode → expand abbreviations
151
+ → normalise vernacular tokens
152
+ → tidy whitespace
153
+
154
+
155
+ Layer 2 — Extract pincode regex [1-8]\d{5} → embedded India Post lookup
156
+ → district / state / city
157
+
158
+
159
+ Layer 3 — Segment & classify walk comma-separated parts; rules for
160
+ building / landmark / locality
161
+
162
+
163
+ Layer 4 — Confidence scoring weighted component presence
164
+
165
+
166
+ ParsedAddress
167
+ ```
168
+
169
+ The embedded `pincodes.json` contains 23,915 Indian pincodes derived from the India Post directory mirror at [`kishorek/India-Codes`](https://github.com/kishorek/India-Codes). Refresh it any time with `python scripts/build_pincode_data.py`.
170
+
171
+ ---
172
+
173
+ ## DIGIPIN
174
+
175
+ `bharataddress` ships a verbatim Python port of the official India Post DIGIPIN
176
+ algorithm (Apache-2.0, [github.com/INDIAPOST-gov/digipin](https://github.com/INDIAPOST-gov/digipin)).
177
+ DIGIPIN is the 10-character geocode published by the Department of Posts in 2025
178
+ that maps any point in India to a ~3.8 m grid cell. Pure math, no network, no
179
+ dependencies.
180
+
181
+ ```python
182
+ from bharataddress import digipin
183
+
184
+ # Encode lat/lng -> DIGIPIN (XXX-XXX-XXXX)
185
+ digipin.encode(28.6129, 77.2295)
186
+ # '39J-429-L4TK'
187
+
188
+ # Decode back to the centre lat/lng of the cell
189
+ digipin.decode('39J-429-L4TK')
190
+ # (28.612906..., 77.229494...)
191
+
192
+ # Validate
193
+ digipin.validate('39J-429-L4TK') # True
194
+ digipin.validate('AAA-BBB-CCCC') # False
195
+ ```
196
+
197
+ `parse()` accepts an optional `latlng=` hint and populates a `digipin` field
198
+ on the result when a coordinate is supplied:
199
+
200
+ ```python
201
+ from bharataddress import parse
202
+
203
+ result = parse(
204
+ "Plot 88, Basheer Bagh, Hyderabad 500001",
205
+ latlng=(17.3850, 78.4867),
206
+ )
207
+ result.digipin
208
+ # '422-594-J546'
209
+ ```
210
+
211
+ The parser does not geocode addresses on its own — `digipin` stays `None`
212
+ unless you pass a coordinate. The bounding box is hard-coded to India
213
+ (lat 2.5–38.5, lng 63.5–99.5); points outside raise `ValueError`.
214
+
215
+ ---
216
+
217
+ ## v0.2 modules
218
+
219
+ v0.2 ships six new modules around the core parser. All offline, all
220
+ zero-dependency, all importable straight from the top-level package.
221
+
222
+ ### `formatter` — reconstruct a clean address
223
+
224
+ ```python
225
+ >>> from bharataddress import parse, format
226
+ >>> p = parse("Flat 302, Raheja Atlantis, Sector 31, Gurgaon 122001")
227
+ >>> print(format(p, style="india_post"))
228
+ 302 Raheja Atlantis
229
+ Sector 31
230
+ Gurgaon, Gurgaon
231
+ Haryana, 122001
232
+ >>> format(p, style="single_line")
233
+ '302 Raheja Atlantis, Sector 31, Gurgaon, Gurgaon, Haryana, 122001'
234
+ >>> print(format(p, style="label"))
235
+ Building: 302 Raheja Atlantis
236
+ Locality: Sector 31
237
+ City: Gurgaon
238
+ ...
239
+ ```
240
+
241
+ ### `validator` — confidence + consistency
242
+
243
+ ```python
244
+ >>> from bharataddress import parse, validate, is_deliverable
245
+ >>> p = parse("Flat 302, Sector 31, Gurgaon 122001")
246
+ >>> is_deliverable(p)
247
+ True
248
+ >>> validate(p)
249
+ {'fields': {'pincode': 1.0, 'state': 1.0, ...}, 'issues': [], 'is_deliverable': True, 'overall': 0.91}
250
+ ```
251
+
252
+ `validate` flags state / district / city mismatches against the embedded India
253
+ Post directory. `is_deliverable` is the minimum-fields check (pincode + city +
254
+ state).
255
+
256
+ ### `geocoder` — pincode centroid + reverse geocoding
257
+
258
+ ```python
259
+ >>> from bharataddress import parse, geocode, reverse_geocode
260
+ >>> geocode(parse("Sector 31, Gurgaon 122001")) # None until pincodes.json gains centroids
261
+ None
262
+ >>> reverse_geocode(28.6129, 77.2295)
263
+ {'digipin': '39J-438-TJC7', 'pincode': None, 'distance_km': None}
264
+ ```
265
+
266
+ DIGIPIN is always returned (it's pure math). The nearest pincode is returned
267
+ once a future dataset refresh adds latitude / longitude per pincode — the
268
+ hook is wired and dormant today.
269
+
270
+ ### `similarity` — fuzzy address matching
271
+
272
+ ```python
273
+ >>> from bharataddress import address_similarity
274
+ >>> address_similarity("MG Road, Bengaluru 560001",
275
+ ... "Mahatma Gandhi Road, Bangalore 560001")
276
+ 0.9
277
+ ```
278
+
279
+ Pincode is the strongest signal, then city (with Bengaluru/Bangalore,
280
+ Mumbai/Bombay, etc. aliasing), then locality token overlap. Returns a float
281
+ in `[0, 1]`.
282
+
283
+ ### `batch` — list / CSV / DataFrame helpers
284
+
285
+ ```python
286
+ >>> from bharataddress import parse_batch, parse_csv, parse_dataframe
287
+ >>> parse_batch(["Sector 31, Gurgaon 122001", "Anna Salai, Chennai 600002"])
288
+ [ParsedAddress(...), ParsedAddress(...)]
289
+ >>> parse_csv("addresses.csv", column="address") # writes addresses_parsed.csv
290
+ PosixPath('addresses_parsed.csv')
291
+ >>> parse_dataframe(df, column="address") # pandas optional, lazy import
292
+ ```
293
+
294
+ ### `enrichment` — non-address sources
295
+
296
+ ```python
297
+ >>> from bharataddress import extract_state_from_gstin
298
+ >>> extract_state_from_gstin("29ABCDE1234F1Z5")
299
+ 'Karnataka'
300
+ ```
301
+
302
+ The first two digits of a GSTIN are the GST Council state code. Pure lookup,
303
+ no network.
304
+
305
+ ---
306
+
307
+ ## What's NOT in v0.2
308
+
309
+ By design, kept out so the package stays small, fast, and dependency-free:
310
+
311
+ - ❌ LLM parsing (Claude API)
312
+ - ❌ Phonetic fuzzy matching (Gurgaon ↔ Gudgaon)
313
+ - ❌ Pincode boundary GeoJSON — **v0.3**
314
+ - ❌ FastAPI server — **v0.3**
315
+ - ❌ Devanagari / Tamil / Bengali script parsing — English + Romanised Hindi only
316
+
317
+ The architecture already accommodates all of these. v0.2 ships the foundation everything else builds on.
318
+
319
+ ---
320
+
321
+ ## Tests
322
+
323
+ ```bash
324
+ pip install -e ".[dev]"
325
+ pytest
326
+ ```
327
+
328
+ 95 tests covering parser, DIGIPIN, formatter, validator, geocoder, similarity, batch, and enrichment modules. All passing on v0.2.0.
329
+
330
+ There is also an architectural-constraint test that monkeypatches `socket.socket` and asserts `parse()` opens **zero** network connections. The "offline by default" promise is enforced in CI.
331
+
332
+ ---
333
+
334
+ ## Benchmarks
335
+
336
+ `bharataddress` ships with a 200-row hand-labelled gold set (`tests/data/gold_200.jsonl`) covering metro / tier-2 / rural / landmark-heavy / vernacular / no-pincode / irregular-punctuation / S-O-format inputs. `scripts/evaluate.py` reports per-field precision / recall / F1 plus exact-match. The matcher is two-way substring (`a in b or b in a`), case-insensitive.
337
+
338
+ ### bharataddress v0.1.2 vs Shiprocket TinyBERT NER
339
+
340
+ The only other open-source Indian address parser of comparable scope is [`shiprocket-ai/open-tinybert-indian-address-ner`](https://huggingface.co/shiprocket-ai/open-tinybert-indian-address-ner) — a fine-tuned TinyBERT (~760 MB, Apache-2.0). It claims Micro F1 0.94 on a private set; this is the first public head-to-head I'm aware of. Both models were run over the same `gold_200.jsonl`. Reproduce with `python scripts/eval_competitor.py`.
341
+
342
+ | Field | bharataddress v0.1.2 F1 | TinyBERT F1 | Winner |
343
+ | ----------------- | ----------------------: | ----------: | ------------------ |
344
+ | `pincode` | **0.995** | 0.984 | bharataddress |
345
+ | `city` | **0.959** | 0.718 | bharataddress (+0.24) |
346
+ | `building_number` | 0.958 | **0.973** | TinyBERT (+0.02) |
347
+ | `state` | **0.923** | 0.268 | bharataddress (+0.66) |
348
+ | `landmark` | **0.918** | 0.580 | bharataddress (+0.34) |
349
+ | `district` | **0.933** | N/A* | bharataddress |
350
+ | `locality` | **0.723** | 0.634 | bharataddress (+0.09) |
351
+ | `building_name` | 0.635 | **0.643** | TinyBERT (+0.01) |
352
+ | `sub_locality` | 0.472 | 0.470 | tied |
353
+
354
+ \* TinyBERT has no `district` label; closest equivalent in its label set is `state`.
355
+
356
+ **Exact-match (all 9 fields must match):** bharataddress **48.5%** (97/200) vs TinyBERT 1.0% (2/200). The exact-match gap is misleading because TinyBERT can never produce a `district` and can't reach `state` reliably without the pincode lookup, but the per-field F1 is the apples-to-apples view.
357
+
358
+ **Where each model wins:**
359
+ - `bharataddress` wins decisively on pincode-derived fields (`city`, `district`, `state`, `pincode`) because the embedded India Post directory turns these into a lookup, not a prediction. It also handles `landmark` better thanks to the `Near/Opp/Behind/Beside` cue list.
360
+ - `TinyBERT` is essentially tied on `building_number`, `building_name`, and `sub_locality` — fields where context matters more than vocabulary.
361
+ - Neither model is good at `sub_locality` yet (~0.47) — both struggle to disambiguate "MG Road" (sub_locality) from "Indiranagar" (locality) when the input is sparse.
362
+
363
+ **Footprint comparison:**
364
+
365
+ | | bharataddress v0.1.2 | TinyBERT NER |
366
+ | ---------------------------- | -------------------- | ------------ |
367
+ | Install size | ~5 MB (incl. 23k pincodes) | ~760 MB |
368
+ | Runtime dependencies | none | torch, transformers (~2 GB) |
369
+ | First-call latency | ~5 ms | ~150 ms (CPU) |
370
+ | Network calls during parse | zero (enforced) | zero (after download) |
371
+ | GPU required | no | no, but recommended |
372
+ | Pincode → district/state | yes (lookup) | no |
373
+
374
+ For high-throughput pipelines, batch geocoding, or any environment where dropping a 760 MB model is a non-starter (serverless, mobile, edge), `bharataddress` is the better fit. For free-form addresses where you don't have a pincode at all, TinyBERT's text-only approach is competitive on the structural fields.
375
+
376
+ ---
377
+
378
+ ## Roadmap
379
+
380
+ - **v0.2** — Opt-in Claude API parser for the messy 20% • phonetic fuzzy matching • Nominatim geocoding • Devanagari preprocessing
381
+ - **v0.3** — Pincode boundary GeoJSON • spatial validation • FastAPI server • Docker
382
+ - **v0.4** — Distilled local model trained on Claude-generated parses (eliminates LLM cost at scale)
383
+
384
+ **The moat is the data, not the parser.** Every paid-tier user who corrects an address makes the dataset better. Free core (this package, MIT) + paid layer for continuously updated, validated locality and boundary data.
385
+
386
+ ---
387
+
388
+ ## Contributing
389
+
390
+ Issues and PRs welcome. The most useful contributions for v0.1:
391
+
392
+ 1. **Failing addresses** — open an issue with a real-world string the parser mangles
393
+ 2. **Vernacular mappings** — add to `bharataddress/data/vernacular_mappings.json`
394
+ 3. **Test cases** — add to `tests/test_parse.py`
395
+
396
+ ---
397
+
398
+ ## License
399
+
400
+ MIT. Use it for anything. The data sources (India Post directory) are public domain via data.gov.in.