PyPI - address-toolkit - Versions diffs - 1.0.0__tar.gz - Mend

address-toolkit 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

address_toolkit-1.0.0/.github/dependabot.yml ADDED Viewed

@@ -0,0 +1,15 @@
+# To get started with Dependabot version updates, you'll need to specify which
+# package ecosystems to update and where the package manifests are located.
+# Please see the documentation for all configuration options:
+# https://docs.github.com/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file
+version: 2
+updates:
+  - package-ecosystem: "pip" # See documentation for possible values
+    directory: "/" # Location of package manifests
+    schedule:
+      interval: "weekly"
+  - package-ecosystem: "github-actions" # See documentation for possible values
+    directory: "/" # Location of package manifests
+    schedule:
+      interval: "weekly"

address_toolkit-1.0.0/.github/workflows/CODEOWNERS ADDED Viewed

	@@ -0,0 +1 @@
1	+ * @DanHarrisONS @BenMoscropONS

address_toolkit-1.0.0/.github/workflows/python-package.yml ADDED Viewed

@@ -0,0 +1,38 @@
+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
+name: Python Package Version Testing
+on:
+  push:
+    branches: [ "main" ]
+  pull_request:
+    branches: [ "main" ]
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.9", "3.10"]
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v3
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        python -m pip install flake8 pytest
+        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+    - name: Test with pytest
+      uses: actions/setup-java@v4
+      with:
+        java-version: '8'
+        distribution: temurin
+    - run: |
+        pytest

address_toolkit-1.0.0/.github/workflows/python-publish.yml ADDED Viewed

@@ -0,0 +1,57 @@
+# This workflow will upload a Python Package to PyPI when a release is created
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries
+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+name: Upload Python Package
+on:
+  release:
+    types: [published]
+permissions:
+  contents: read
+jobs:
+  release-build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+      - name: Build release distributions
+        run: |
+          # NOTE: put your own distribution build steps here.
+          python -m pip install build
+          python -m build
+      - name: Upload distributions
+        uses: actions/upload-artifact@v4
+        with:
+          name: release-dists
+          path: dist/
+  pypi-publish:
+    runs-on: ubuntu-latest
+    needs:
+      - release-build
+    permissions:
+      id-token: write
+    steps:
+      - name: Retrieve release distributions
+        uses: actions/download-artifact@v4
+        with:
+          name: release-dists
+          path: dist/
+      - name: Publish release distributions to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          packages-dir: dist/

address_toolkit-1.0.0/AI.gitignore ADDED Viewed

@@ -0,0 +1,23 @@
+# Ignore cache files
+.cache/
+.pip/
+.local/
+.config/
+.cache/
+# Ignore Python virtual environments
+venv/
+.venv/
+# Ignore Jupyter notebook checkpoints
+.ipynb_checkpoints/
+# Ignore other system files
+.DS_Store
+Thumbs.db
+# Ignore SSH keys and config
+.ssh/
+# Ignore Test Files
+test_files/

address_toolkit-1.0.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) Crown Copyright
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

address_toolkit-1.0.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,90 @@
+Metadata-Version: 2.4
+Name: address-toolkit
+Version: 1.0.0
+Summary: Toolkit for cleaning, validating, extracting and contextualising GB address data with PySpark.
+Author-email: Dan Harris <Dan.Harris@ONS.gov.uk>, Ben Moscrop <Ben.Moscrop@ONS.gov.uk>, Stephen Rowlands <Stephen.Rowlands@ONS.gov.uk>
+License: MIT
+Project-URL: Homepage, https://github.com/ONSdigital/address-toolkit
+Project-URL: Documentation, https://github.com/ONSdigital/address-toolkit/blob/main/README.md
+Keywords: address,postcode,address-cleaning,address-validating,address-extracting,address-contextualising
+Requires-Python: <3.11,>=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: pyspark==3.3.2
+Requires-Dist: RapidFuzz==3.13.0
+Dynamic: license-file
+# Address Toolkit
+![Python](https://img.shields.io/badge/Python-3.10-blue)
+![License](https://img.shields.io/badge/License-MIT-orange)
+## Introduction
+The address toolkit package supports the cleaning and processing of address data registered as a `pyspark.sql.DataFrame`.
+The package includes functions for cleaning, validating, extracting and contextualising addresses and address components.
+Additionally, `workflows` have been created as an 'out-of-the-box' application combining the functions across `cleaning`, `validating` and `extracting`.
+To install the package, run:
+```
+pip install address-toolkit
+```
+To register a spark dataframe from a CSV, the following code can be run.
+```
+from pyspark.sql import SparkSession
+spark = SparkSession.builder.master("local").appName("test").getOrCreate()
+df = spark.read.csv("test.csv")
+```
+A recommended first step in using this package is to apply the `clean_punctuation` function from `address_toolkit.cleaning`.
+```
+from address_toolkit.cleaning import clean_punctuation
+df = clean_punctuation(df, 'address_column', create_flag = True, overwrite = True)
+```
+## Main Package Contents:
+| Folder Name | Description | Includes |
+| ------------- | ------------- | ------------- |
+| `cleaning` | Contains functions to clean addresses | `clean_punctuation`, `denoise_addresses`, `deduplicate_addresses`, `deduplicate_postcodes`, `rectify_postcodes`, `standardise_street_types`, `prettify_addresses` |
+| `validating` | Contains functions to validate addresses | `validate_from_list`, `validate_from_regex`, `validate_postcodes` |
+| `extracting` | Contains functions to extract address components | `extract_from_list`, `extract_from_regex`, `extract_postcodes` |
+| `contextualising` | Contains functions to contextualise addresses | `contextualise_from_component` |
+| `workflows` | Contains functions to streamline processing | `clean_addresses`, `validate_addresses`, `extract_address_components` |
+## Supplementary Package Contents (Resources):
+| Resource | Includes |
+| ------------- | ------------- |
+| UK Postcode Regex | `postcode_regex` |
+| Unit Address Level Regex | `flat_regex`, `room_regex`, `unit_regex`, `block_regex`, `apartment_regex`, `floor_regex` |
+| Noise Regex | `consecutive_letters_regex` |
+| Miscellaneous Regex | `misc_numbers_regex`, `standalone_numbers_regex`, `txt_before_numbers_regex`, `end_address_numbers_regex`, `start_address_numbers_regex`, `end_address_identifier_regex` |
+| Address Component Lists | `town_list`, `city_list`, `village_list`, `hamlet_list`, `suburb_list`, `bay_list`, `place_list`, `district_list`, `county_list`, `allowed_country_list`, `disallowed_country_list` |
+| Keyword Component Lists | `contextual_keywords`, `misc_keywords_list`, `misc_special_keywords` |
+| Address Component Lookups | `town_lookup`, `village_lookup`, `bay_lookup`, `hamlet_lookup`, `suburb_lookup` |
+Note:
+For unit address level RegEx patterns, alternatives i.e. `flat_regex_alternative` are available which are less 'strict' in their matching.
+Address Component lists and lookups have been created from [Ordnance Survey Open Names](https://osdatahub.os.uk/data/downloads/open/OpenNames).
+Contains OS data © Crown copyright and database right 2026.
+Licensed under the Open Government License v3.0. This dataset is updated quarterly in January, April, July and October.
+Postcode processing is based upon the [Annex C - Valid post code format](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/611951/Appendix_C_ILR_2017_to_2018_v1_Published_28April17.pdf) specification.
+## Example Usage
+See `tutorial.ipynb` for full use of all the functions within `cleaning`, `validating`, `extracting`, `contextualising` and `workflows`.

address_toolkit-1.0.0/README.md ADDED Viewed

@@ -0,0 +1,74 @@
+# Address Toolkit
+![Python](https://img.shields.io/badge/Python-3.10-blue)
+![License](https://img.shields.io/badge/License-MIT-orange)
+## Introduction
+The address toolkit package supports the cleaning and processing of address data registered as a `pyspark.sql.DataFrame`.
+The package includes functions for cleaning, validating, extracting and contextualising addresses and address components.
+Additionally, `workflows` have been created as an 'out-of-the-box' application combining the functions across `cleaning`, `validating` and `extracting`.
+To install the package, run:
+```
+pip install address-toolkit
+```
+To register a spark dataframe from a CSV, the following code can be run.
+```
+from pyspark.sql import SparkSession
+spark = SparkSession.builder.master("local").appName("test").getOrCreate()
+df = spark.read.csv("test.csv")
+```
+A recommended first step in using this package is to apply the `clean_punctuation` function from `address_toolkit.cleaning`.
+```
+from address_toolkit.cleaning import clean_punctuation
+df = clean_punctuation(df, 'address_column', create_flag = True, overwrite = True)
+```
+## Main Package Contents:
+| Folder Name | Description | Includes |
+| ------------- | ------------- | ------------- |
+| `cleaning` | Contains functions to clean addresses | `clean_punctuation`, `denoise_addresses`, `deduplicate_addresses`, `deduplicate_postcodes`, `rectify_postcodes`, `standardise_street_types`, `prettify_addresses` |
+| `validating` | Contains functions to validate addresses | `validate_from_list`, `validate_from_regex`, `validate_postcodes` |
+| `extracting` | Contains functions to extract address components | `extract_from_list`, `extract_from_regex`, `extract_postcodes` |
+| `contextualising` | Contains functions to contextualise addresses | `contextualise_from_component` |
+| `workflows` | Contains functions to streamline processing | `clean_addresses`, `validate_addresses`, `extract_address_components` |
+## Supplementary Package Contents (Resources):
+| Resource | Includes |
+| ------------- | ------------- |
+| UK Postcode Regex | `postcode_regex` |
+| Unit Address Level Regex | `flat_regex`, `room_regex`, `unit_regex`, `block_regex`, `apartment_regex`, `floor_regex` |
+| Noise Regex | `consecutive_letters_regex` |
+| Miscellaneous Regex | `misc_numbers_regex`, `standalone_numbers_regex`, `txt_before_numbers_regex`, `end_address_numbers_regex`, `start_address_numbers_regex`, `end_address_identifier_regex` |
+| Address Component Lists | `town_list`, `city_list`, `village_list`, `hamlet_list`, `suburb_list`, `bay_list`, `place_list`, `district_list`, `county_list`, `allowed_country_list`, `disallowed_country_list` |
+| Keyword Component Lists | `contextual_keywords`, `misc_keywords_list`, `misc_special_keywords` |
+| Address Component Lookups | `town_lookup`, `village_lookup`, `bay_lookup`, `hamlet_lookup`, `suburb_lookup` |
+Note:
+For unit address level RegEx patterns, alternatives i.e. `flat_regex_alternative` are available which are less 'strict' in their matching.
+Address Component lists and lookups have been created from [Ordnance Survey Open Names](https://osdatahub.os.uk/data/downloads/open/OpenNames).
+Contains OS data © Crown copyright and database right 2026.
+Licensed under the Open Government License v3.0. This dataset is updated quarterly in January, April, July and October.
+Postcode processing is based upon the [Annex C - Valid post code format](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/611951/Appendix_C_ILR_2017_to_2018_v1_Published_28April17.pdf) specification.
+## Example Usage
+See `tutorial.ipynb` for full use of all the functions within `cleaning`, `validating`, `extracting`, `contextualising` and `workflows`.

address_toolkit-1.0.0/address_toolkit/__init__.py ADDED Viewed

File without changes

address_toolkit-1.0.0/address_toolkit/cleaning/__init__.py ADDED Viewed

@@ -0,0 +1,12 @@
+from address_toolkit.cleaning.cleaning import (
+                                                clean_punctuation,
+                                                deduplicate_addresses,
+                                                deduplicate_postcodes,
+                                                denoise_addresses,
+                                                prettify_addresses,
+                                                rectify_postcodes,
+                                                standardise_street_types,
+)
+__all__ = ['clean_punctuation', 'denoise_addresses', 'deduplicate_addresses', 'deduplicate_postcodes',
+           'rectify_postcodes', 'standardise_street_types','prettify_addresses']