PyPI - splink - Versions diffs - 4.0.0.dev2__tar.gz → 4.0.0.dev4__tar.gz - Mend

splink 4.0.0.dev2tar.gz → 4.0.0.dev4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (158) hide show

{splink-4.0.0.dev2 → splink-4.0.0.dev4}/PKG-INFO RENAMED Viewed

@@ -1,34 +1,36 @@
 Metadata-Version: 2.1
 Name: splink
-Version: 4.0.0.dev2
+Version: 4.0.0.dev4
 Summary: Fast probabilistic data linkage at scale
 Home-page: https://github.com/moj-analytical-services/splink
 License: MIT
 Author: Robin Linacre
 Author-email: robinlinacre@hotmail.com
-Requires-Python: >=3.7.1,<4.0.0
+Requires-Python: >=3.8.0,<4.0.0
 Classifier: License :: OSI Approved :: MIT License
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3.8
 Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
 Provides-Extra: athena
 Provides-Extra: postgres
 Provides-Extra: pyspark
 Provides-Extra: spark
 Requires-Dist: Jinja2 (>=3.0.3)
 Requires-Dist: altair (>=5.0.1,<6.0.0)
-Requires-Dist: awswrangler (==2.18.0) ; (python_full_version >= "3.7.1" and python_version < "3.8") and (extra == "athena")
-Requires-Dist: awswrangler (>=3.0.0,<4.0.0) ; (python_version >= "3.8" and python_version < "4.0") and (extra == "athena")
-Requires-Dist: duckdb (>=0.8.0)
-Requires-Dist: jsonschema (>=3.2,<5.0)
-Requires-Dist: pandas (>1.3.0)
-Requires-Dist: phonetics (>=1.0.5,<2.0.0)
+Requires-Dist: awswrangler (>=3.0.0,<4.0.0) ; (python_version >= "3.8") and (extra == "athena")
+Requires-Dist: duckdb (>=0.9.2)
+Requires-Dist: igraph (>=0.11.2) ; python_version >= "3.8"
+Requires-Dist: jsonschema (>=3.2)
+Requires-Dist: numpy (>=1.17.3) ; python_version < "3.12"
+Requires-Dist: numpy (>=1.26.0) ; python_version >= "3.12"
+Requires-Dist: pandas (>1.3.5)
+Requires-Dist: phonetics (>=1.0.5)
 Requires-Dist: psycopg2-binary (>=2.8.0) ; extra == "postgres"
-Requires-Dist: pyspark (>=3.2.1,<4.0.0) ; extra == "pyspark" or extra == "spark"
-Requires-Dist: sqlalchemy (>=1.4.0,<2.0.0) ; extra == "postgres"
-Requires-Dist: sqlglot (>=13.0.0,<19.0.0)
+Requires-Dist: pyspark (>=3.2.1) ; extra == "pyspark" or extra == "spark"
+Requires-Dist: sqlglot (>=13.0.0)
 Project-URL: Repository, https://github.com/moj-analytical-services/splink
 Description-Content-Type: text/markdown
@@ -40,7 +42,8 @@ Description-Content-Type: text/markdown
 [![Downloads](https://static.pepy.tech/badge/splink/month)](https://pepy.tech/project/splink)
 [![Documentation](https://img.shields.io/badge/API-documentation-blue)](https://moj-analytical-services.github.io/splink/)
+> [!IMPORTANT]
+> Development has begun on Splink 4 on the `splink4_dev` branch.  Splink 3 is in maintenance mode and we are no longer accepting new features. We welcome contributions to Splink 4. Read more on our latest [blog](https://moj-analytical-services.github.io/splink/blog/2024/03/19/splink4.html).
 # Fast, accurate and scalable probabilistic data linkage
@@ -54,7 +57,7 @@ Splink is a Python package for probabilistic record linkage (entity resolution)
 🎓 **Unsupervised Learning:** No training data is required for model training.
 📊 **Interactive Outputs:** A suite of interactive visualisations help users understand their model and diagnose problems.
-Splink's linkage algorithm is based on Fellegi-Sunter's model of record linkage, with various customizations to improve accuracy.
+Splink's linkage algorithm is based on Fellegi-Sunter's model of record linkage, with various customisations to improve accuracy.
 ## What does Splink do?
@@ -72,7 +75,7 @@ and clusters these links to produce an estimated person ID:
 ## What data does Splink work best with?
-Before using Splink, input data should be standardized, with consistent column names and formatting (e.g., lowercased, punctuation cleaned up, etc.).
+Before using Splink, input data should be standardised, with consistent column names and formatting (e.g., lowercased, punctuation cleaned up, etc.).
 Splink performs best with input data containing **multiple** columns that are **not highly correlated**. For instance, if the entity type is persons, you may have columns for full name, date of birth, and city. If the entity type is companies, you could have columns for name, turnover, sector, and telephone number.
@@ -104,39 +107,33 @@ or, if you prefer, you can instead install splink using conda:
 conda install -c conda-forge splink
 ```
-<details>
-<summary><h3>Additional installation methods</h3></summary>
+### Installing Splink for Specific Backends
-<br>
-### Backend Specific Installs
-From Splink v3.9.7, packages required by specific splink backends can be optionally installed by adding the `[<backend>]` suffix to the end of your `pip install`.
+For projects requiring specific backends, Splink offers optional installations for **Spark**, **Athena**, and **PostgreSQL**. These can be installed by appending the backend name in brackets to the pip install command:
+```sh
+pip install 'splink[{backend}]'
+```
-**Note** that **SQLite** and **DuckDB** come packaged with Splink and do not need to be optionally installed.
+Should you require a version of Splink without **DuckDB**, see our section on [DuckDBLess Splink Installation](https://moj-analytical-services.github.io/splink/installations.html#duckdb-less-installation).
-Backends supported by optional installs:
+<details>
+<summary><i>Click here for backend-specific installation commands</i></summary>
-**Spark**
+#### Spark
 ```sh
 pip install 'splink[spark]'
 ```
-**Athena**
+#### Athena
 ```sh
 pip install 'splink[athena]'
 ```
-**PostgreSQL**
+#### PostgreSQL
 ```sh
 pip install 'splink[postgres]'
 ```
-<br>
-### DuckDBLess Splink
-Should you require a more bare-bones version of Splink **without DuckDB**, please see the following area of the docs:
-> [DuckDBless Splink Installation](https://moj-analytical-services.github.io/splink/installations.html#duckdb-less-installation)
 </details>
 ## Quickstart
@@ -192,23 +189,29 @@ clusters.as_pandas_dataframe(limit=5)
 ## Charts Gallery
-You can see all of the interactive charts provided in Splink by checking out the [Charts Gallery](./charts/index.md).
+You can see all of the interactive charts provided in Splink by checking out the [Charts Gallery](https://moj-analytical-services.github.io/splink/charts/index.html).
 ## Support
 To find the best place to ask a question, report a bug or get general advice, please refer to our [Contributing Guide](./CONTRIBUTING.md).
+## Use Cases
+To see how users are using Splink in the wild, check out the [Use Cases](https://moj-analytical-services.github.io/splink/#use-cases) section of the docs.
 ## Awards
-🥇 Analysis in Government Awards 2020: Innovative Methods - [Winner](https://www.gov.uk/government/news/launch-of-the-analysis-in-government-awards)
+❓ Future of Government Awards 2023: Open Source Creation - [Shortlisted, result to be announced shortly](https://futureofgovernment.com/en)
-🥇 MoJ DASD Awards 2020: Innovation and Impact - Winner
+🥈 Civil Service Awards 2023: Best Use of Data, Science, and Technology - [Runner up](https://www.civilserviceawards.com/best-use-of-data-science-and-technology-award-2/)
 🥇 Analysis in Government Awards 2022: People's Choice Award - [Winner](https://analysisfunction.civilservice.gov.uk/news/announcing-the-winner-of-the-first-analysis-in-government-peoples-choice-award/)
 🥈 Analysis in Government Awards 2022: Innovative Methods - [Runner up](https://twitter.com/gov_analysis/status/1616073633692274689?s=20&t=6TQyNLJRjnhsfJy28Zd6UQ)
-🥈 Civil Service Awards 2023: Best Use of Data, Science, and Technology - [Runner up](https://www.civilserviceawards.com/best-use-of-data-science-and-technology-award-2/)
+🥇 Analysis in Government Awards 2020: Innovative Methods - [Winner](https://www.gov.uk/government/news/launch-of-the-analysis-in-government-awards)
+🥇 MoJ Data and Analytical Services Directorate (DASD) Awards 2020: Innovation and Impact - Winner
 ## Citation

{splink-4.0.0.dev2 → splink-4.0.0.dev4}/README.md RENAMED Viewed

@@ -6,7 +6,8 @@
 [![Downloads](https://static.pepy.tech/badge/splink/month)](https://pepy.tech/project/splink)
 [![Documentation](https://img.shields.io/badge/API-documentation-blue)](https://moj-analytical-services.github.io/splink/)
+> [!IMPORTANT]
+> Development has begun on Splink 4 on the `splink4_dev` branch.  Splink 3 is in maintenance mode and we are no longer accepting new features. We welcome contributions to Splink 4. Read more on our latest [blog](https://moj-analytical-services.github.io/splink/blog/2024/03/19/splink4.html).
 # Fast, accurate and scalable probabilistic data linkage
@@ -20,7 +21,7 @@ Splink is a Python package for probabilistic record linkage (entity resolution)
 🎓 **Unsupervised Learning:** No training data is required for model training.
 📊 **Interactive Outputs:** A suite of interactive visualisations help users understand their model and diagnose problems.
-Splink's linkage algorithm is based on Fellegi-Sunter's model of record linkage, with various customizations to improve accuracy.
+Splink's linkage algorithm is based on Fellegi-Sunter's model of record linkage, with various customisations to improve accuracy.
 ## What does Splink do?
@@ -38,7 +39,7 @@ and clusters these links to produce an estimated person ID:
 ## What data does Splink work best with?
-Before using Splink, input data should be standardized, with consistent column names and formatting (e.g., lowercased, punctuation cleaned up, etc.).
+Before using Splink, input data should be standardised, with consistent column names and formatting (e.g., lowercased, punctuation cleaned up, etc.).
 Splink performs best with input data containing **multiple** columns that are **not highly correlated**. For instance, if the entity type is persons, you may have columns for full name, date of birth, and city. If the entity type is companies, you could have columns for name, turnover, sector, and telephone number.
@@ -70,39 +71,33 @@ or, if you prefer, you can instead install splink using conda:
 conda install -c conda-forge splink
 ```
-<details>
-<summary><h3>Additional installation methods</h3></summary>
+### Installing Splink for Specific Backends
-<br>
-### Backend Specific Installs
-From Splink v3.9.7, packages required by specific splink backends can be optionally installed by adding the `[<backend>]` suffix to the end of your `pip install`.
+For projects requiring specific backends, Splink offers optional installations for **Spark**, **Athena**, and **PostgreSQL**. These can be installed by appending the backend name in brackets to the pip install command:
+```sh
+pip install 'splink[{backend}]'
+```
-**Note** that **SQLite** and **DuckDB** come packaged with Splink and do not need to be optionally installed.
+Should you require a version of Splink without **DuckDB**, see our section on [DuckDBLess Splink Installation](https://moj-analytical-services.github.io/splink/installations.html#duckdb-less-installation).
-Backends supported by optional installs:
+<details>
+<summary><i>Click here for backend-specific installation commands</i></summary>
-**Spark**
+#### Spark
 ```sh
 pip install 'splink[spark]'
 ```
-**Athena**
+#### Athena
 ```sh
 pip install 'splink[athena]'
 ```
-**PostgreSQL**
+#### PostgreSQL
 ```sh
 pip install 'splink[postgres]'
 ```
-<br>
-### DuckDBLess Splink
-Should you require a more bare-bones version of Splink **without DuckDB**, please see the following area of the docs:
-> [DuckDBless Splink Installation](https://moj-analytical-services.github.io/splink/installations.html#duckdb-less-installation)
 </details>
 ## Quickstart
@@ -158,23 +153,29 @@ clusters.as_pandas_dataframe(limit=5)
 ## Charts Gallery
-You can see all of the interactive charts provided in Splink by checking out the [Charts Gallery](./charts/index.md).
+You can see all of the interactive charts provided in Splink by checking out the [Charts Gallery](https://moj-analytical-services.github.io/splink/charts/index.html).
 ## Support
 To find the best place to ask a question, report a bug or get general advice, please refer to our [Contributing Guide](./CONTRIBUTING.md).
+## Use Cases
+To see how users are using Splink in the wild, check out the [Use Cases](https://moj-analytical-services.github.io/splink/#use-cases) section of the docs.
 ## Awards
-🥇 Analysis in Government Awards 2020: Innovative Methods - [Winner](https://www.gov.uk/government/news/launch-of-the-analysis-in-government-awards)
+❓ Future of Government Awards 2023: Open Source Creation - [Shortlisted, result to be announced shortly](https://futureofgovernment.com/en)
-🥇 MoJ DASD Awards 2020: Innovation and Impact - Winner
+🥈 Civil Service Awards 2023: Best Use of Data, Science, and Technology - [Runner up](https://www.civilserviceawards.com/best-use-of-data-science-and-technology-award-2/)
 🥇 Analysis in Government Awards 2022: People's Choice Award - [Winner](https://analysisfunction.civilservice.gov.uk/news/announcing-the-winner-of-the-first-analysis-in-government-peoples-choice-award/)
 🥈 Analysis in Government Awards 2022: Innovative Methods - [Runner up](https://twitter.com/gov_analysis/status/1616073633692274689?s=20&t=6TQyNLJRjnhsfJy28Zd6UQ)
-🥈 Civil Service Awards 2023: Best Use of Data, Science, and Technology - [Runner up](https://www.civilserviceawards.com/best-use-of-data-science-and-technology-award-2/)
+🥇 Analysis in Government Awards 2020: Innovative Methods - [Winner](https://www.gov.uk/government/news/launch-of-the-analysis-in-government-awards)
+🥇 MoJ Data and Analytical Services Directorate (DASD) Awards 2020: Innovation and Impact - Winner
 ## Citation

{splink-4.0.0.dev2 → splink-4.0.0.dev4}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "splink"
-version = "4.0.0.dev2"
+version = "4.0.0.dev4"
 description = "Fast probabilistic data linkage at scale"
 authors = ["Robin Linacre <robinlinacre@hotmail.com>", "Sam Lindsay", "Theodore Manassis", "Tom Hepworth", "Andy Bond", "Ross Kennedy"]
 license = "MIT"
@@ -9,70 +9,63 @@ repository = "https://github.com/moj-analytical-services/splink"
 readme = "README.md"
 [tool.poetry.dependencies]
-python = ">=3.7.1,<4.0.0"
-jsonschema = ">=3.2,<5.0"
+python = ">=3.8.0,<4.0.0"
+jsonschema = ">=3.2"
 # 1.3.5 is the last version supporting py 3.7.1
-pandas = ">1.3.0"
-duckdb = ">=0.8.0"
-sqlglot = ">=13.0.0, <19.0.0"
+pandas = ">1.3.5"
+duckdb = ">=0.9.2"
+sqlglot = ">=13.0.0"
 altair = "^5.0.1"
 Jinja2 = ">=3.0.3"
-phonetics = "^1.0.5"
+phonetics = ">=1.0.5"
+# need to manually specify numpy versions suitable for CI
+# 1.24.4 works with python 3.8, but not 3.12
+numpy = [
+    # version is minimum valid with above listed pandas version
+    {version=">=1.17.3", python = "<3.12"},
+    {version=">=1.26.0", python = ">=3.12"},
+]
 # Optional installs
-pyspark = {version="^3.2.1", optional=true}
+# python >=3.12 requires pyspark >=4.0.0 - currently unreleased
+pyspark = {version=">=3.2.1", optional=true}
 awswrangler = [
-    {version = "2.18.0,<3.0.0", python = ">=3.7.1,<3.8", optional=true},
-    {version=">=3.0.0,<4.0.0", python = "^3.8", optional=true}
+    {version=">=3.0.0,<4.0.0", python = ">=3.8", optional=true}
 ]
-# sqlalchemy >= 2.0.0 not working well with older pandas
-sqlalchemy = {version=">=1.4.0,<2.0.0", optional=true}
 psycopg2-binary = {version=">=2.8.0", optional=true}
+# for graph metrics
+igraph = { version = ">=0.11.2", python = ">=3.8", optional=true }
 [tool.poetry.group.dev]
 [tool.poetry.group.dev.dependencies]
-tabulate = "0.8.9"
-pyspark = "^3.2.1"
-# sqlalchemy >= 2.0.0 not working well with older pandas
-sqlalchemy = ">=1.4.0,<2.0.0"
+tabulate = ">=0.8.9"
+pyspark = ">=3.2.1"
+sqlalchemy = ">=1.4.0"
 # temporarily use binary version, to avoid issues with pg_config path
 psycopg2-binary = ">=2.8.0"
+igraph = ">=0.11.2"
 [tool.poetry.group.linting]
 [tool.poetry.group.linting.dependencies]
-black = "22.6.0"
-ruff = "0.0.257"
+ruff = "^0.4.2"
 [tool.poetry.group.testing]
 [tool.poetry.group.testing.dependencies]
 # pin to reduce dependencies
-pytest = "7.3"
+pytest = ">=7.3"
 pyarrow = ">=7.0.0"
-networkx = "2.5.1"
-rapidfuzz = "^2.0.3"
-[tool.poetry.group.benchmarking]
-optional = true
-[tool.poetry.group.benchmarking.dependencies]
-pytest-benchmark = "^4"
-lzstring = "1.0.4"
+networkx = ">=2.5.1"
+rapidfuzz = ">=2.0.3"
 [tool.poetry.group.typechecking]
 optional = true
 [tool.poetry.group.typechecking.dependencies]
-mypy = "1.7.0"
-[tool.poetry.group.demos]
-[tool.poetry.group.demos.dependencies]
-importlib-resources = "5.4.0"
-jupyterlab = "3.6.1"
-pyarrow = ">=7.0.0"
-ipywidgets = "8.0.4"
-nbmake = "1.3.4"
-pytest = "^7.0"
-pyspark = "^3.2.1"
+mypy = "1.9.0"
 [tool.poetry.extras]
 pyspark = ["pyspark"]
@@ -89,7 +82,7 @@ profile = "black"
 [tool.ruff]
 line-length = 88
-select = [
+lint.select = [
     # Pyflakes
     "F",
     # Pycodestyle
@@ -102,7 +95,7 @@ select = [
     # flake8-print
     "T20"
 ]
-ignore = [
+lint.ignore = [
     "B905", # `zip()` without an explicit `strict=` parameter
     "B006", # Do not use mutable data structures for argument defaults"
 ]
@@ -122,22 +115,33 @@ markers = [
     "spark_only",
     "sqlite",
     "sqlite_only",
+    "postgres",
+    "postgres_only",
 ]
 [tool.mypy]
 packages = "splink"
-# temporary exclusions
-exclude = [
-    # modules getting substantial rewrites:
-    '.*comparison_imports\.py$',
-    '.*comparison.*library\.py',
-    'comparison_level_composition',
-    # modules with large number of errors
-    '.*linker\.py',
-]
 # for now at least allow implicit optionals
 # to cut down on noise. Easy to fix.
 implicit_optional = true
 # for now, ignore missing imports
 # can remove later and install stubs, where existent
 ignore_missing_imports = true
+# options for strict mode
+# too much to handle at once, so opt-in a little at a time
+# https://mypy.readthedocs.io/en/stable/existing_code.html#introduce-stricter-options
+warn_unused_configs = true
+warn_redundant_casts = true
+warn_unused_ignores = true
+strict_equality = true
+# don't worry about warning: https://github.com/python/mypy/issues/16189
+strict_concatenate = true
+check_untyped_defs = true
+disallow_subclassing_any = true
+disallow_untyped_decorators = true
+disallow_any_generics = true
+# further strict checks to add in:
+# disallow_untyped_calls = true
+disallow_incomplete_defs = true
+# disallow_untyped_defs = true

splink-4.0.0.dev4/splink/__init__.py ADDED Viewed

@@ -0,0 +1,60 @@
+from typing import TYPE_CHECKING
+# Explicitly declare exported names to avoid 'imported but unused' linting issues
+__all__ = [
+    "block_on",
+    "splink_datasets",
+    "Linker",
+    "SettingsCreator",
+    "SQLiteAPI",
+    "SparkAPI",
+    "DuckDBAPI",
+    "PostgresAPI",
+]
+from splink.blocking_rule_library import block_on
+from splink.datasets import splink_datasets
+from splink.linker import Linker
+from splink.settings_creator import SettingsCreator
+from splink.sqlite.database_api import SQLiteAPI
+# The following is a workaround for the fact that dependencies of postgres, spark
+# and duckdb may not be installed, but we don't want this to prevent import
+# of the other backends.
+# This enables auto-complete to be used to import the various DBAPIs
+# and ensures that typing information is retained so e.g. the arguments autocomplete
+# without importing them at runtime
+if TYPE_CHECKING:
+    from splink.duckdb.database_api import DuckDBAPI
+    from splink.postgres.database_api import PostgresAPI
+    from splink.spark.database_api import SparkAPI
+# Use getarr to make the error appear at the point of use
+def __getattr__(name):
+    try:
+        if name == "SparkAPI":
+            from splink.spark.database_api import SparkAPI
+            return SparkAPI
+        elif name == "DuckDBAPI":
+            from splink.duckdb.database_api import DuckDBAPI
+            return DuckDBAPI
+        elif name == "PostgresAPI":
+            from splink.postgres.database_api import PostgresAPI
+            return PostgresAPI
+    except ImportError as err:
+        if name in ["SparkAPI", "DuckDBAPI", "PostgresAPI"]:
+            raise ImportError(
+                f"{name} cannot be imported because its dependencies are not "
+                "installed. Please `pip install` the required package(s) as "
+                "specified in the optional dependencies in pyproject.toml"
+            ) from err
+    raise AttributeError(f"module 'splink' has no attribute '{name}'") from None
+__version__ = "4.0.0.dev4"

splink 4.0.0.dev2__tar.gz → 4.0.0.dev4__tar.gz

splink 4.0.0.dev2tar.gz → 4.0.0.dev4tar.gz