PyPI - data-validation-engine - Versions diffs - 0.7.1__tar.gz → 0.7.3__tar.gz - Mend

data-validation-engine 0.7.1tar.gz → 0.7.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (105) hide show

{data_validation_engine-0.7.1 → data_validation_engine-0.7.3}/PKG-INFO RENAMED Viewed

@@ -1,15 +1,16 @@
 Metadata-Version: 2.4
 Name: data-validation-engine
-Version: 0.7.1
+Version: 0.7.3
 Summary: `nhs data validation engine` is a framework used to validate data
+License-Expression: MIT
 License-File: LICENSE
 Author: NHS England
 Author-email: england.contactus@nhs.net
 Requires-Python: >=3.10,<3.12
-Classifier: Operating System :: OS Independent
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
+Classifier: Operating System :: OS Independent
 Classifier: Topic :: Software Development :: Libraries
 Classifier: Typing :: Typed
 Requires-Dist: Jinja2 (==3.1.*)
@@ -18,13 +19,19 @@ Requires-Dist: botocore (>=1.34.162,<1.36)
 Requires-Dist: delta-spark (==2.4.*)
 Requires-Dist: duckdb (==1.1.*)
 Requires-Dist: lxml (>=4.9.1,<5.0.0)
+Requires-Dist: numpy (==1.26.4)
 Requires-Dist: openpyxl (>=3.1,<4.0)
 Requires-Dist: pandas (>=2.2.2,<3.0.0)
 Requires-Dist: polars (==0.20.*)
 Requires-Dist: pyarrow (>=17.0.0,<18.0.0)
-Requires-Dist: pydantic (==1.10.15)
+Requires-Dist: pydantic (==1.10.16)
 Requires-Dist: pyspark (==3.4.*)
 Requires-Dist: typing_extensions (>=4.6.2,<5.0.0)
+Project-URL: Changelog, https://github.com/NHSDigital/data-validation-engine/blob/main/CHANGELOG.md
+Project-URL: Documentation, https://nhsdigital.github.io/data-validation-engine/
+Project-URL: Homepage, https://github.com/NHSDigital/data-validation-engine
+Project-URL: Issues, https://github.com/NHSDigital/data-validation-engine/issues
+Project-URL: Repository, https://github.com/NHSDigital/data-validation-engine.git
 Description-Content-Type: text/markdown
 <h1 style="display: flex; align-items: center; gap: 10px;">
@@ -39,7 +46,7 @@ Description-Content-Type: text/markdown
 The Data Validation Engine (DVE) is a configuration driven data validation library built and utilised by NHS England. Currently the package has been reverted from v1.0.0 release to a 0.x as we feel the package is not yet mature enough to be considered a 1.0.0 release. So please bear this in mind if reading through the commits and references to a v1+ release when on v0.x.
-As mentioned above, the DVE is "configuration driven" which means the majority of development for you as a user will be building a JSON document to describe how the data will be validated. The JSON document is known as a `dischema` file and example files can be accessed [here](./tests/testdata/). If you'd like to learn more about JSON document and how to build one from scratch, then please read the documentation [here](./docs/).
+As mentioned above, the DVE is "configuration driven" which means the majority of development for you as a user will be building a JSON document to describe how the data will be validated. The JSON document is known as a `dischema` file and example files can be accessed [here](https://github.com/NHSDigital/data-validation-engine/tree/main/tests/testdata). If you'd like to learn more about JSON document and how to build one from scratch, then please read the documentation [here](https://nhsdigital.github.io/data-validation-engine/).
 Once a dischema file has been defined, you are ready to use the DVE. The DVE is typically orchestrated based on four key "services". These are...
@@ -50,7 +57,7 @@ Once a dischema file has been defined, you are ready to use the DVE. The DVE is
 | 3. | Business Rules | The business rules service will perform more complex validations such as comparisons between fields and tables, aggregations, filters etc to generate new entities. |
 | 4. | Error Reports | The error reports service will take all the errors raised in previous services and surface them into a readable format for a downstream users/service. Currently, this implemented to be an excel spreadsheet but could be reconfigured to meet other requirements/use cases. |
-If you'd like more detailed documentation around these services the please read the extended documentation [here](./docs/).
+If you'd like more detailed documentation around these services the please read the extended documentation [here](https://nhsdigital.github.io/data-validation-engine/).
 The DVE has been designed in a way that's modular and can support users who just want to utilise specific "services" from the DVE (i.e. just the file transformation + data contract). Additionally, the DVE is designed to support different backend implementations. As part of the base installation of DVE, you will find backend support for `Spark` and `DuckDB`. So, if you need a `MySQL` backend implementation, you can implement this yourself. Given our organisations requirements, it will be unlikely that we add anymore specific backend implementations into the base package beyond Spark and DuckDB. So, if you are unable to implement this yourself, I would recommend reading the guidance on [requesting new features and raising bug reports here](#requesting-new-features-and-raising-bug-reports).
@@ -72,7 +79,7 @@ pip install data-validation-engine
 *Note - Only versions >=0.6.2 are available on PyPi. For older versions please install directly from the git repo or build from source.*
-Once you have installed the DVE you are ready to use it. For guidance on how to create your dischema JSON document (configuration), please read the [documentation](./docs/).
+Once you have installed the DVE you are ready to use it. For guidance on how to create your dischema JSON document (configuration), please read the [documentation](https://nhsdigital.github.io/data-validation-engine/).
 Version 0.0.1 does support a working Python 3.7 installation. However, we will not be supporting any issues with that version of the DVE if you choose to use it. __Use at your own risk__.
@@ -85,17 +92,20 @@ If you have feature request then please follow the same process whilst using the
 ## Upcoming features
 Below is a list of features that we would like to implement or have been requested.
-| Feature | Release Version | Released? |
-| ------- | --------------- | --------- |
-| Open source release | 0.1.0 | Yes |
-| Uplift to Python 3.11 | 0.2.0 | Yes |
-| Upgrade to Pydantic 2.0 | Before 1.0 release | No |
-| Create a more user friendly interface for building and modifying dischema files | Not yet confirmed | No |
-Beyond the Python and Pydantic upgrade, we cannot confirm the other features will be made available anytime soon. Therefore, if you have the interest and desire to make these features available, then please read the [Contributing](#contributing) section and get involved.
+| Feature                                                                         | Release Version   | Released? |
+| ------------------------------------------------------------------------------- | ----------------- | --------- |
+| Open source release                                                             | 0.1.0             | Yes       |
+| Uplift to Python 3.11                                                           | 0.2.0             | Yes       |
+| Uplift Pyspark to 3.5                                                           | TBA               | No        |
+| Allow DVE to run on Python 3.12+                                                | TBA               | No        |
+| Upgrade to Pydantic 2.0                                                         | TBA               | No        |
+| Uplift Pyspark to 4.0+                                                          | TBA               | No        |
+| Create a more user friendly interface for building and modifying dischema files | Not yet confirmed | No        |
+Beyond the Python and Pydantic upgrade, we cannot confirm the other features will be made available anytime soon. Therefore, if you have the interest and desire to make these features available, then please read the [Contributing](#Contributing) section and get involved.
 ## Contributing
-Please see guidance [here](./CONTRIBUTE.md).
+Please see guidance [here](https://github.com/NHSDigital/data-validation-engine/blob/main/CONTRIBUTE.md).
 ## Legal
 This codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.

{data_validation_engine-0.7.1 → data_validation_engine-0.7.3}/README.md RENAMED Viewed

@@ -10,7 +10,7 @@
 The Data Validation Engine (DVE) is a configuration driven data validation library built and utilised by NHS England. Currently the package has been reverted from v1.0.0 release to a 0.x as we feel the package is not yet mature enough to be considered a 1.0.0 release. So please bear this in mind if reading through the commits and references to a v1+ release when on v0.x.
-As mentioned above, the DVE is "configuration driven" which means the majority of development for you as a user will be building a JSON document to describe how the data will be validated. The JSON document is known as a `dischema` file and example files can be accessed [here](./tests/testdata/). If you'd like to learn more about JSON document and how to build one from scratch, then please read the documentation [here](./docs/).
+As mentioned above, the DVE is "configuration driven" which means the majority of development for you as a user will be building a JSON document to describe how the data will be validated. The JSON document is known as a `dischema` file and example files can be accessed [here](https://github.com/NHSDigital/data-validation-engine/tree/main/tests/testdata). If you'd like to learn more about JSON document and how to build one from scratch, then please read the documentation [here](https://nhsdigital.github.io/data-validation-engine/).
 Once a dischema file has been defined, you are ready to use the DVE. The DVE is typically orchestrated based on four key "services". These are...
@@ -21,7 +21,7 @@ Once a dischema file has been defined, you are ready to use the DVE. The DVE is
 | 3. | Business Rules | The business rules service will perform more complex validations such as comparisons between fields and tables, aggregations, filters etc to generate new entities. |
 | 4. | Error Reports | The error reports service will take all the errors raised in previous services and surface them into a readable format for a downstream users/service. Currently, this implemented to be an excel spreadsheet but could be reconfigured to meet other requirements/use cases. |
-If you'd like more detailed documentation around these services the please read the extended documentation [here](./docs/).
+If you'd like more detailed documentation around these services the please read the extended documentation [here](https://nhsdigital.github.io/data-validation-engine/).
 The DVE has been designed in a way that's modular and can support users who just want to utilise specific "services" from the DVE (i.e. just the file transformation + data contract). Additionally, the DVE is designed to support different backend implementations. As part of the base installation of DVE, you will find backend support for `Spark` and `DuckDB`. So, if you need a `MySQL` backend implementation, you can implement this yourself. Given our organisations requirements, it will be unlikely that we add anymore specific backend implementations into the base package beyond Spark and DuckDB. So, if you are unable to implement this yourself, I would recommend reading the guidance on [requesting new features and raising bug reports here](#requesting-new-features-and-raising-bug-reports).
@@ -43,7 +43,7 @@ pip install data-validation-engine
 *Note - Only versions >=0.6.2 are available on PyPi. For older versions please install directly from the git repo or build from source.*
-Once you have installed the DVE you are ready to use it. For guidance on how to create your dischema JSON document (configuration), please read the [documentation](./docs/).
+Once you have installed the DVE you are ready to use it. For guidance on how to create your dischema JSON document (configuration), please read the [documentation](https://nhsdigital.github.io/data-validation-engine/).
 Version 0.0.1 does support a working Python 3.7 installation. However, we will not be supporting any issues with that version of the DVE if you choose to use it. __Use at your own risk__.
@@ -56,17 +56,20 @@ If you have feature request then please follow the same process whilst using the
 ## Upcoming features
 Below is a list of features that we would like to implement or have been requested.
-| Feature | Release Version | Released? |
-| ------- | --------------- | --------- |
-| Open source release | 0.1.0 | Yes |
-| Uplift to Python 3.11 | 0.2.0 | Yes |
-| Upgrade to Pydantic 2.0 | Before 1.0 release | No |
-| Create a more user friendly interface for building and modifying dischema files | Not yet confirmed | No |
-Beyond the Python and Pydantic upgrade, we cannot confirm the other features will be made available anytime soon. Therefore, if you have the interest and desire to make these features available, then please read the [Contributing](#contributing) section and get involved.
+| Feature                                                                         | Release Version   | Released? |
+| ------------------------------------------------------------------------------- | ----------------- | --------- |
+| Open source release                                                             | 0.1.0             | Yes       |
+| Uplift to Python 3.11                                                           | 0.2.0             | Yes       |
+| Uplift Pyspark to 3.5                                                           | TBA               | No        |
+| Allow DVE to run on Python 3.12+                                                | TBA               | No        |
+| Upgrade to Pydantic 2.0                                                         | TBA               | No        |
+| Uplift Pyspark to 4.0+                                                          | TBA               | No        |
+| Create a more user friendly interface for building and modifying dischema files | Not yet confirmed | No        |
+Beyond the Python and Pydantic upgrade, we cannot confirm the other features will be made available anytime soon. Therefore, if you have the interest and desire to make these features available, then please read the [Contributing](#Contributing) section and get involved.
 ## Contributing
-Please see guidance [here](./CONTRIBUTE.md).
+Please see guidance [here](https://github.com/NHSDigital/data-validation-engine/blob/main/CONTRIBUTE.md).
 ## Legal
 This codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.

{data_validation_engine-0.7.1 → data_validation_engine-0.7.3}/pyproject.toml RENAMED Viewed

@@ -1,12 +1,11 @@
-[tool.poetry]
+[project]
 name = "data-validation-engine"
-version = "0.7.1"
+dynamic = [ "version" ]
 description = "`nhs data validation engine` is a framework used to validate data"
-authors = ["NHS England <england.contactus@nhs.net>"]
-readme = "README.md"
-packages = [
-    { include = "dve", from = "src" },
+authors = [
+    { name = "NHS England", email = "england.contactus@nhs.net" }
 ]
+readme = "README.md"
 classifiers = [
     "Programming Language :: Python :: 3",
     "Programming Language :: Python :: 3.10",
@@ -15,6 +14,20 @@ classifiers = [
     "Topic :: Software Development :: Libraries",
     "Typing :: Typed",
 ]
+license = "MIT"
+[project.urls]
+Homepage = "https://github.com/NHSDigital/data-validation-engine"
+Documentation = "https://nhsdigital.github.io/data-validation-engine/"
+Repository = "https://github.com/NHSDigital/data-validation-engine.git"
+Issues = "https://github.com/NHSDigital/data-validation-engine/issues"
+Changelog = "https://github.com/NHSDigital/data-validation-engine/blob/main/CHANGELOG.md"
+[tool.poetry]
+version = "0.7.3"
+packages = [
+    { include = "dve", from = "src" },
+]
 [tool.poetry.dependencies]
 python = ">=3.10,<3.12"
@@ -24,11 +37,12 @@ delta-spark = "2.4.*"
 duckdb = "1.1.*"  # breaking changes beyond 1.1
 Jinja2 = "3.1.*"
 lxml = "^4.9.1"
+numpy = "1.26.4"
 openpyxl = "^3.1"
 pandas = "^2.2.2"
 polars = "0.20.*"
 pyarrow = "^17.0.0"
-pydantic = "1.10.15"
+pydantic = "1.10.16"
 pyspark = "3.4.*"
 typing_extensions = "^4.6.2"
@@ -42,6 +56,9 @@ include-groups = [
 [tool.poetry.group.dev.dependencies]
 commitizen = "4.9.1"
 pre-commit = "4.3.0"
+charset-normalizer = "3.4.6"
+python-discovery = "1.2.0"
+requests = "2.33.0"
 [tool.poetry.group.test]
 optional = true
@@ -78,6 +95,17 @@ types-setuptools = "68.2.0.0"
 types-urllib3 = "1.26.25.14"
 types-xmltodict = "0.13.0.3"
+[tool.poetry.group.docs]
+optional = true
+[tool.poetry.group.docs.dependencies]
+click = "8.2.1"
+mkdocs = "^1.6.1"
+mkdocstrings = { version = "1.0.3", extras = ["python"] }
+griffelib = "2.0.1"
+pymdown-extensions = "10.21.2"
+zensical = "0.0.31"
 [tool.ruff]
 line-length = 100

{data_validation_engine-0.7.1 → data_validation_engine-0.7.3}/src/dve/core_engine/backends/implementations/duckdb/contract.py RENAMED Viewed

@@ -31,6 +31,7 @@ from dve.core_engine.backends.implementations.duckdb.duckdb_helpers import (
     duckdb_read_parquet,
     duckdb_record_index,
     duckdb_write_parquet,
+    get_duckdb_cast_statement_from_annotation,
     get_duckdb_type_from_annotation,
     relation_is_empty,
 )
@@ -101,18 +102,7 @@ class DuckDBDataContract(BaseDataContract[DuckDBPyRelation]):
         _lazy_df = pl.LazyFrame(records, polars_schema)  # type: ignore # pylint: disable=unused-variable
         return self._connection.sql("select * from _lazy_df")
-    @staticmethod
-    def generate_ddb_cast_statement(
-        column_name: str, dtype: DuckDBPyType, null_flag: bool = False
-    ) -> str:
-        """Helper method to generate sql statements for casting datatypes (permissively).
-        Current duckdb python API doesn't play well with this currently.
-        """
-        if not null_flag:
-            return f'try_cast("{column_name}" AS {dtype}) AS "{column_name}"'
-        return f'cast(NULL AS {dtype}) AS "{column_name}"'
-    # pylint: disable=R0914
+    # pylint: disable=R0914,R0915
     def apply_data_contract(
         self,
         working_dir: URI,
@@ -180,12 +170,16 @@ class DuckDBDataContract(BaseDataContract[DuckDBPyRelation]):
                 casting_statements = [
                     (
-                        self.generate_ddb_cast_statement(column, dtype)
+                        get_duckdb_cast_statement_from_annotation(column, mdl_fld.annotation)
+                        + f""" AS "{column}" """
                         if column in relation.columns
-                        else self.generate_ddb_cast_statement(column, dtype, null_flag=True)
+                        else f"CAST(NULL AS {ddb_schema[column]}) AS {column}"
                     )
-                    for column, dtype in ddb_schema.items()
+                    for column, mdl_fld in entity_fields.items()
                 ]
+                casting_statements.append(
+                    f"CAST({RECORD_INDEX_COLUMN_NAME} AS {get_duckdb_type_from_annotation(int)}) AS {RECORD_INDEX_COLUMN_NAME}"  # pylint: disable=C0301
+                )
                 try:
                     relation = relation.project(", ".join(casting_statements))
                 except Exception as err:  # pylint: disable=broad-except

{data_validation_engine-0.7.1 → data_validation_engine-0.7.3}/src/dve/core_engine/backends/implementations/duckdb/duckdb_helpers.py RENAMED Viewed

@@ -313,3 +313,108 @@ def duckdb_record_index(cls):
     setattr(cls, "add_record_index", _add_duckdb_record_index)
     setattr(cls, "drop_record_index", _drop_duckdb_record_index)
     return cls
+def _cast_as_ddb_type(field_expr: str, type_annotation: Any) -> str:
+    """Cast to Duck DB type"""
+    return f"""try_cast({field_expr} as {get_duckdb_type_from_annotation(type_annotation)})"""
+def _ddb_safely_quote_name(field_name: str) -> str:
+    """Quote field names in case reserved"""
+    try:
+        sep_idx = field_name.index(".")
+        return f'"{field_name[: sep_idx]}"' + field_name[sep_idx:]
+    except ValueError:
+        return f'"{field_name}"'
+# pylint: disable=R0801,R0911,R0912
+def get_duckdb_cast_statement_from_annotation(
+    element_name: str,
+    type_annotation: Any,
+    parent_element: bool = True,
+    date_regex: str = r"^[0-9]{4}-[0-9]{2}-[0-9]{2}$",
+    timestamp_regex: str = r"^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}((\+|\-)[0-9]{2}:[0-9]{2})?$",  # pylint: disable=C0301
+    time_regex: str = r"^[0-9]{2}:[0-9]{2}:[0-9]{2}$",
+) -> str:
+    """Generate casting statements for duckdb relations from type annotations"""
+    type_origin = get_origin(type_annotation)
+    quoted_name = _ddb_safely_quote_name(element_name)
+    # An `Optional` or `Union` type, check to ensure non-heterogenity.
+    if type_origin is Union:
+        python_type = _get_non_heterogenous_type(get_args(type_annotation))
+        return get_duckdb_cast_statement_from_annotation(
+            element_name, python_type, parent_element, date_regex, timestamp_regex
+        )
+    # Type hint is e.g. `List[str]`, check to ensure non-heterogenity.
+    if type_origin is list or (isinstance(type_origin, type) and issubclass(type_origin, list)):
+        element_type = _get_non_heterogenous_type(get_args(type_annotation))
+        stmt = f"list_transform({quoted_name}, x -> {get_duckdb_cast_statement_from_annotation('x',element_type, False, date_regex, timestamp_regex)})"  # pylint: disable=C0301
+        return stmt if not parent_element else _cast_as_ddb_type(stmt, type_annotation)
+    if type_origin is Annotated:
+        python_type, *other_args = get_args(type_annotation)  # pylint: disable=unused-variable
+        return get_duckdb_cast_statement_from_annotation(
+            element_name, python_type, parent_element, date_regex, timestamp_regex
+        )  # add other expected params here
+    # Ensure that we have a concrete type at this point.
+    if not isinstance(type_annotation, type):
+        raise ValueError(f"Unsupported type annotation {type_annotation!r}")
+    if (
+        # Type hint is a dict subclass, but not dict. Possibly a `TypedDict`.
+        (issubclass(type_annotation, dict) and type_annotation is not dict)
+        # Type hint is a dataclass.
+        or is_dataclass(type_annotation)
+        # Type hint is a `pydantic` model.
+        or (type_origin is None and issubclass(type_annotation, BaseModel))
+    ):
+        fields: dict[str, str] = {}
+        for field_name, field_annotation in get_type_hints(type_annotation).items():
+            # Technically non-string keys are disallowed, but people are bad.
+            if not isinstance(field_name, str):
+                raise ValueError(
+                    f"Dictionary/Dataclass keys must be strings, got {type_annotation!r}"
+                )  # pragma: no cover
+            if get_origin(field_annotation) is ClassVar:
+                continue
+            fields[field_name] = get_duckdb_cast_statement_from_annotation(
+                f"{element_name}.{field_name}", field_annotation, False, date_regex, timestamp_regex
+            )
+        if not fields:
+            raise ValueError(
+                f"No type annotations in dict/dataclass type (got {type_annotation!r})"
+            )
+        cast_exprs = ",".join([f'"{nme}":= {stmt}' for nme, stmt in fields.items()])
+        stmt = f"struct_pack({cast_exprs})"
+        return stmt if not parent_element else _cast_as_ddb_type(stmt, type_annotation)
+    if type_annotation is list:
+        raise ValueError(
+            f"List must have type annotation (e.g. `List[str]`), got {type_annotation!r}"
+        )
+    if type_annotation is dict or type_origin is dict:
+        raise ValueError(f"dict must be `typing.TypedDict` subclass, got {type_annotation!r}")
+    for type_ in type_annotation.mro():
+        # datetime is subclass of date, so needs to be handled first
+        if issubclass(type_, datetime):
+            stmt = rf"CASE WHEN REGEXP_MATCHES(TRIM({quoted_name}), '{timestamp_regex}') THEN TRY_CAST(TRIM({quoted_name}) as TIMESTAMP) ELSE NULL END"  # pylint: disable=C0301
+            return stmt
+        if issubclass(type_, date):
+            stmt = rf"CASE WHEN REGEXP_MATCHES(TRIM({quoted_name}), '{date_regex}') THEN TRY_CAST(TRIM({quoted_name}) as DATE) ELSE NULL END"  # pylint: disable=C0301
+            return stmt
+        if issubclass(type_, time):
+            stmt = rf"CASE WHEN REGEXP_MATCHES(TRIM({quoted_name}), '{time_regex}') THEN TRY_CAST(TRIM({quoted_name}) as TIME) ELSE NULL END" # pylint: disable=C0301
+            return stmt
+        duck_type = get_duckdb_type_from_annotation(type_)
+        if duck_type:
+            stmt = f"trim({quoted_name})"
+            return _cast_as_ddb_type(stmt, type_) if parent_element else stmt
+    raise ValueError(f"No equivalent DuckDB type for {type_annotation!r}")

{data_validation_engine-0.7.1 → data_validation_engine-0.7.3}/src/dve/core_engine/backends/implementations/duckdb/readers/csv.py RENAMED Viewed

@@ -82,11 +82,12 @@ class DuckDBCSVReader(BaseFileReader):
                 messages=[
                     FeedbackMessage(
                         entity=entity_name,
-                        record=None,
+                        record={"missing_fields": missing},
                         failure_type="submission",
                         error_location="Whole File",
+                        reporting_field="missing_fields",
                         error_code=self.field_check_error_code,
-                        error_message=f"{self.field_check_error_message} - missing fields: {missing}",  # pylint: disable=line-too-long
+                        error_message=f"{self.field_check_error_message}",  # pylint: disable=line-too-long
                     )
                 ],
             )
@@ -202,6 +203,7 @@ class DuckDBCSVRepeatingHeaderReader(PolarsToDuckDBCSVReader):
     `NonDistinctHeaderError`.
     So using the example above, the expected entity would look like this...
     | headerCol1 | headerCol2 | headerCol3 |
     | ---------- | ---------- | ---------- |
     | shop1      | clothes    | 2025-01-01 |

{data_validation_engine-0.7.1 → data_validation_engine-0.7.3}/src/dve/core_engine/backends/implementations/spark/spark_helpers.py RENAMED Viewed

@@ -439,3 +439,103 @@ def spark_record_index(cls):
     setattr(cls, "add_record_index", _add_spark_record_index)
     setattr(cls, "drop_record_index", _drop_spark_record_index)
     return cls
+def _cast_as_spark_type(field_expr: str, field_type: Any) -> Column:
+    """Cast to spark type"""
+    return sf.expr(field_expr).cast(get_type_from_annotation(field_type))
+def _spark_safely_quote_name(field_name: str) -> str:
+    """Quote field names in case reserved"""
+    try:
+        sep_idx = field_name.index(".")
+        return f"`{field_name[: sep_idx]}`" + field_name[sep_idx:]
+    except ValueError:
+        return f"`{field_name}`"
+# pylint: disable=R0801
+def get_spark_cast_statement_from_annotation(
+    element_name: str,
+    type_annotation: Any,
+    parent_element: bool = True,
+    date_regex: str = r"^[0-9]{4}-[0-9]{2}-[0-9]{2}$",
+    timestamp_regex: str = r"^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}((\\+|\\-)[0-9]{2}:[0-9]{2})?$",  # pylint: disable=C0301
+):
+    """Generate casting statements for spark dataframes based on type annotations"""
+    type_origin = get_origin(type_annotation)
+    quoted_name = _spark_safely_quote_name(element_name)
+    # An `Optional` or `Union` type, check to ensure non-heterogenity.
+    if type_origin is Union:
+        python_type = _get_non_heterogenous_type(get_args(type_annotation))
+        return get_spark_cast_statement_from_annotation(
+            element_name, python_type, parent_element, date_regex, timestamp_regex
+        )
+    # Type hint is e.g. `List[str]`, check to ensure non-heterogenity.
+    if type_origin is list or (isinstance(type_origin, type) and issubclass(type_origin, list)):
+        element_type = _get_non_heterogenous_type(get_args(type_annotation))
+        stmt = f"transform({quoted_name}, x -> {get_spark_cast_statement_from_annotation('x',element_type, False, date_regex, timestamp_regex)})"  # pylint: disable=C0301
+        return stmt if not parent_element else _cast_as_spark_type(stmt, type_annotation)
+    if type_origin is Annotated:
+        python_type, *_ = get_args(type_annotation)  # pylint: disable=unused-variable
+        return get_spark_cast_statement_from_annotation(
+            element_name, python_type, parent_element, date_regex, timestamp_regex
+        )  # add other expected params here
+    # Ensure that we have a concrete type at this point.
+    if not isinstance(type_annotation, type):
+        raise ValueError(f"Unsupported type annotation {type_annotation!r}")
+    if (
+        # Type hint is a dict subclass, but not dict. Possibly a `TypedDict`.
+        (issubclass(type_annotation, dict) and type_annotation is not dict)
+        # Type hint is a dataclass.
+        or is_dataclass(type_annotation)
+        # Type hint is a `pydantic` model.
+        or (type_origin is None and issubclass(type_annotation, BaseModel))
+    ):
+        fields: dict[str, str] = {}
+        for field_name, field_annotation in get_type_hints(type_annotation).items():
+            # Technically non-string keys are disallowed, but people are bad.
+            if not isinstance(field_name, str):
+                raise ValueError(
+                    f"Dictionary/Dataclass keys must be strings, got {type_annotation!r}"
+                )  # pragma: no cover
+            if get_origin(field_annotation) is ClassVar:
+                continue
+            fields[field_name] = get_spark_cast_statement_from_annotation(
+                f"{element_name}.{field_name}", field_annotation, False, date_regex, timestamp_regex
+            )
+        if not fields:
+            raise ValueError(
+                f"No type annotations in dict/dataclass type (got {type_annotation!r})"
+            )
+        cast_exprs = ",".join([f"{stmt} AS `{nme}`" for nme, stmt in fields.items()])
+        stmt = f"struct({cast_exprs})"
+        return stmt if not parent_element else _cast_as_spark_type(stmt, type_annotation)
+    if type_annotation is list:
+        raise ValueError(
+            f"List must have type annotation (e.g. `List[str]`), got {type_annotation!r}"
+        )
+    if type_annotation is dict or type_origin is dict:
+        raise ValueError(f"dict must be `typing.TypedDict` subclass, got {type_annotation!r}")
+    for type_ in type_annotation.mro():
+        # datetime is subclass of date, so needs to be handled first
+        if issubclass(type_, dt.datetime):
+            stmt = rf"CASE WHEN REGEXP(TRIM({quoted_name}), '{timestamp_regex}') THEN TRIM({quoted_name}) ELSE NULL END"  # pylint: disable=C0301
+            return _cast_as_spark_type(stmt, type_) if parent_element else stmt
+        if issubclass(type_, dt.date):
+            stmt = rf"CASE WHEN REGEXP(TRIM({quoted_name}), '{date_regex}') THEN TRIM({quoted_name}) ELSE NULL END"  # pylint: disable=C0301
+            return _cast_as_spark_type(stmt, type_) if parent_element else stmt
+        spark_type = get_type_from_annotation(type_)
+        if spark_type:
+            stmt = f"trim({quoted_name})"
+            return _cast_as_spark_type(stmt, type_) if parent_element else stmt
+    raise ValueError(f"No equivalent Spark type for {type_annotation!r}")

{data_validation_engine-0.7.1 → data_validation_engine-0.7.3}/src/dve/metadata_parser/domain_types.py RENAMED Viewed

@@ -519,7 +519,8 @@ class FormattedTime(dt.time):
             raise ValueError("Provided time has timezone, but this is forbidden for this field")
         if cls.TIMEZONE_TREATMENT == "require" and not new_time.tzinfo:
             raise ValueError("Provided time missing timezone, but this is required for this field")
+        if isinstance(value, str) and cls.TIME_FORMAT and value != str(new_time):
+            raise ValueError("Provided time is not matching expected time format supplied.")
         return new_time
     @classmethod

{data_validation_engine-0.7.1 → data_validation_engine-0.7.3}/src/dve/pipeline/foundry_ddb_pipeline.py RENAMED Viewed

@@ -42,11 +42,15 @@ class FoundryDDBPipeline(DDBDVEPipeline):
             write_to.parent.mkdir(parents=True, exist_ok=True)
             write_to = write_to.as_posix()
         self.write_parquet(  # type: ignore # pylint: disable=E1101
-            self._audit_tables._processing_status.get_relation(),  # pylint: disable=W0212
+            self._audit_tables._processing_status.get_relation().filter(  # pylint: disable=W0212
+                f"submission_id = '{submission_info.submission_id}'"
+            ),
             fh.joinuri(write_to, "processing_status.parquet"),
         )
         self.write_parquet(  # type: ignore # pylint: disable=E1101
-            self._audit_tables._submission_statistics.get_relation(),  # pylint: disable=W0212
+            self._audit_tables._submission_statistics.get_relation().filter(  # pylint: disable=W0212
+                f"submission_id = '{submission_info.submission_id}'"
+            ),
             fh.joinuri(write_to, "submission_statistics.parquet"),
         )
         return write_to

{data_validation_engine-0.7.1 → data_validation_engine-0.7.3}/src/dve/reporting/excel_report.py RENAMED Viewed

@@ -135,8 +135,11 @@ class SummaryItems:
         if "Status" not in self.summary_dict:
             summary.append(["", "Status", status])
+        _key_renames = {
+            "File Size": "File Size (Bytes)",
+        }
         for key, value in self.summary_dict.items():
-            summary.append(["", key, str(value)])
+            summary.append(["", _key_renames.get(key, key), str(value)])
         summary.append(["", ""])