PyPI - xml2db - Versions diffs - 0.9.1__tar.gz → 0.9.4__tar.gz - Mend

xml2db 0.9.1tar.gz → 0.9.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

{xml2db-0.9.1 → xml2db-0.9.4}/LICENSE RENAMED Viewed

@@ -1,4 +1,4 @@
-Copyright (c) 2023 Commission de régulation de l'énergie
+Copyright (c) 2024 Commission de régulation de l'énergie
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

{xml2db-0.9.1/src/xml2db.egg-info → xml2db-0.9.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: xml2db
-Version: 0.9.1
+Version: 0.9.4
 Summary: Import complex XML files to a relational database
 Author-email: Commission de régulation de l'énergie <opensource@cre.fr>
 Project-URL: Documentation, https://cre-dev.github.io/xml2db
@@ -13,18 +13,18 @@ Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: sqlalchemy>1.4
-Requires-Dist: xmlschema==2.5.0
-Requires-Dist: lxml==4.9.3
+Requires-Dist: xmlschema==3.1.0
+Requires-Dist: lxml==5.1.0
 Provides-Extra: docs
-Requires-Dist: mkdocs-material==9.5.0; extra == "docs"
-Requires-Dist: mkdocstrings[python]==0.24.0; extra == "docs"
+Requires-Dist: mkdocs-material==9.5.14; extra == "docs"
+Requires-Dist: mkdocstrings[python]==0.24.1; extra == "docs"
 Provides-Extra: tests
 Requires-Dist: pytest>=7.0; extra == "tests"
 # Xml2db
 `xml2db` is a Python package which allows loading XML data into a relational database. It is designed to handle complex
-schemas which cannot be denormalized to a flat table, without any custom code.
+schemas which cannot be easily denormalized to a flat table, without any custom code.
 It builds a data model (i.e. a set of database tables linked with foreign keys relationships) based on a XSD schema and
 allows parsing and loading XML files into the database, and get them back to XML, if needed.
@@ -37,7 +37,7 @@ from xml2db import DataModel
 # Create a data model of tables with relations based on the XSD file
 data_model = DataModel(
     xsd_file="path/to/file.xsd",
-    connection_string="mssql+pyodbc://server/database?driver=ODBC+Driver+17+for+SQL+Server&trusted_connection=yes",
+    connection_string="postgresql+psycopg2://testuser:testuser@localhost:5432/testdb",
 )
 # Parse an XML file based on this XSD
 document = data_model.parse_xml(
@@ -51,11 +51,10 @@ The data model will adhere closely to the XSD schema, but `xml2db` will perform
 complexity of the resulting data model and the storage footprint.
 The raw data loaded into the database can then be processed using [DBT](https://www.getdbt.com/), SQL views or
-stored procedures aimed at extracting, correcting and formatting the data into more user-friendly tables.
+other tools aimed at extracting, correcting and formatting the data into more user-friendly tables.
 `xml2db` is developed and used at the [French energy regulation authority (CRE)](https://www.cre.fr/) to process XML
-data, notably [REMIT data](https://www.acer.europa.eu/remit/data-collection). There, it handles batches of ~500 MB XML
-files translating into a 20+ tables data model in the database.
+data.
 This package uses `sqlalchemy` to interact with the database, so it should work with different database backends. It has
 been tested against PostgreSQL and MS SQL Server. It currently does not work with SQLite. You may have to install
@@ -86,8 +85,8 @@ Run all tests with the following command:
 python -m pytest
 ```
-Integration tests require write access to a MS SQL server database; the connection string is provided as an environment
-variable `DB_STRING`. If you want to run only conversion tests that do not require a database you can run:
+Integration tests require write access to a PostgreSQL or MS SQL Server database; the connection string is provided as an
+environment variable `DB_STRING`. If you want to run only conversion tests that do not require a database you can run:
 ```bash
 pytest -m "not dbtest"

{xml2db-0.9.1 → xml2db-0.9.4}/README.md RENAMED Viewed

@@ -1,7 +1,7 @@
 # Xml2db
 `xml2db` is a Python package which allows loading XML data into a relational database. It is designed to handle complex
-schemas which cannot be denormalized to a flat table, without any custom code.
+schemas which cannot be easily denormalized to a flat table, without any custom code.
 It builds a data model (i.e. a set of database tables linked with foreign keys relationships) based on a XSD schema and
 allows parsing and loading XML files into the database, and get them back to XML, if needed.
@@ -14,7 +14,7 @@ from xml2db import DataModel
 # Create a data model of tables with relations based on the XSD file
 data_model = DataModel(
     xsd_file="path/to/file.xsd",
-    connection_string="mssql+pyodbc://server/database?driver=ODBC+Driver+17+for+SQL+Server&trusted_connection=yes",
+    connection_string="postgresql+psycopg2://testuser:testuser@localhost:5432/testdb",
 )
 # Parse an XML file based on this XSD
 document = data_model.parse_xml(
@@ -28,11 +28,10 @@ The data model will adhere closely to the XSD schema, but `xml2db` will perform
 complexity of the resulting data model and the storage footprint.
 The raw data loaded into the database can then be processed using [DBT](https://www.getdbt.com/), SQL views or
-stored procedures aimed at extracting, correcting and formatting the data into more user-friendly tables.
+other tools aimed at extracting, correcting and formatting the data into more user-friendly tables.
 `xml2db` is developed and used at the [French energy regulation authority (CRE)](https://www.cre.fr/) to process XML
-data, notably [REMIT data](https://www.acer.europa.eu/remit/data-collection). There, it handles batches of ~500 MB XML
-files translating into a 20+ tables data model in the database.
+data.
 This package uses `sqlalchemy` to interact with the database, so it should work with different database backends. It has
 been tested against PostgreSQL and MS SQL Server. It currently does not work with SQLite. You may have to install
@@ -63,8 +62,8 @@ Run all tests with the following command:
 python -m pytest
 ```
-Integration tests require write access to a MS SQL server database; the connection string is provided as an environment
-variable `DB_STRING`. If you want to run only conversion tests that do not require a database you can run:
+Integration tests require write access to a PostgreSQL or MS SQL Server database; the connection string is provided as an
+environment variable `DB_STRING`. If you want to run only conversion tests that do not require a database you can run:
 ```bash
 pytest -m "not dbtest"

{xml2db-0.9.1 → xml2db-0.9.4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "xml2db"
-version = "0.9.1"
+version = "0.9.4"
 authors = [
   { name="Commission de régulation de l'énergie", email="opensource@cre.fr" },
 ]
@@ -18,12 +18,12 @@ classifiers = [
 ]
 dependencies = [
     "sqlalchemy>1.4",
-    "xmlschema==2.5.0",
-    "lxml==4.9.3",
+    "xmlschema==3.1.0",
+    "lxml==5.1.0",
 ]
 [project.optional-dependencies]
-docs = ["mkdocs-material==9.5.0", "mkdocstrings[python]==0.24.0"]
+docs = ["mkdocs-material==9.5.14", "mkdocstrings[python]==0.24.1"]
 tests = ["pytest>=7.0"]
 [project.urls]

{xml2db-0.9.1 → xml2db-0.9.4}/src/xml2db/document.py RENAMED Viewed

@@ -122,8 +122,8 @@ class Document:
                 ]
                 for h_child in sorted(h_children):
                     h.update(h_child)
-        node["xml2db_record_hash"] = h.digest()
-        return node["xml2db_record_hash"]
+        node["record_hash"] = h.digest()
+        return node["record_hash"]
     def doc_tree_to_flat_data(self, document_tree: dict) -> dict:
         """Convert document tree (nested dict) to flat tables data model to prepare database import
@@ -164,7 +164,7 @@ class Document:
                     }
             data = data_model[node["type"]]
-            hex_hash = str(node["xml2db_record_hash"])
+            hex_hash = str(node["record_hash"])
             # if node is reused and a record with identical hash is already inserted, return its pk
             if model_table.is_reused:
@@ -204,7 +204,10 @@ class Document:
                         else:
                             esc_val = [str(v).replace('"', '\\"') for v in val]
                             esc_val = [
-                                f'"{v}"' if "," in v or '"' in v else v for v in esc_val
+                                f'"{v}"'
+                                if "," in v or "\n" in v or "\r" in v or '"' in v
+                                else v
+                                for v in esc_val
                             ]
                             record[key] = ",".join(esc_val)
                     else:
@@ -222,7 +225,7 @@ class Document:
                     else:
                         record[f"temp_{rel.field_name}"] = None
-            record["xml2db_record_hash"] = bytes(node["xml2db_record_hash"])
+            record["record_hash"] = bytes(node["record_hash"])
             # add integration meta data if root table
             if model_table.type_name == self.model.root_table:

{xml2db-0.9.1 → xml2db-0.9.4}/src/xml2db/model.py RENAMED Viewed

@@ -311,7 +311,11 @@ class DataModel:
                 )
             elem_type = elem_type[0]
             if elem_type.is_union():
-                return recurse_parse_simple_type(elem_type.base_type.member_types)
+                return (
+                    recurse_parse_simple_type(elem_type.base_type.member_types)
+                    if elem_type.base_type
+                    else recurse_parse_simple_type(elem_type.member_types)
+                )
             if elem_type.is_restriction():
                 dt = elem_type.base_type.local_name
                 mil = elem_type.min_length
@@ -402,7 +406,12 @@ class DataModel:
                         )
                     )
                 ct = child.type
-                if ct.is_complex() and len(child) == 0 and ct.base_type is not None:
+                if (
+                    ct.is_complex()
+                    and len(child) == 0
+                    and len(child.attributes) == 0
+                    and ct.base_type is not None
+                ):
                     ct = ct.base_type
                 if ct.is_simple():
                     (
@@ -427,7 +436,9 @@ class DataModel:
                 elif ct.is_complex():
                     child_table = self._parse_tree(child)
                     child_table.model_group = (
-                        "choice" if ct.model_group.model == "choice" else "sequence"
+                        "choice"
+                        if ct.model_group and ct.model_group.model == "choice"
+                        else "sequence"
                     )
                     occurs = get_occurs(child)
                     if child.is_single():
@@ -449,16 +460,29 @@ class DataModel:
             else:
                 raise ValueError("unknown case; please check (child not an XsdElement)")
-        if hasattr(parent_node, "type") and parent_node.type.has_mixed_content():
+        if hasattr(parent_node, "type") and (
+            parent_node.type.has_mixed_content()
+            or parent_node.type.has_simple_content()
+        ):
+            if parent_node.type.base_type is not None:
+                (
+                    data_type,
+                    min_length,
+                    max_length,
+                    allow_empty,
+                ) = recurse_parse_simple_type([parent_node.type.base_type])
+            else:
+                data_type, min_length, max_length, allow_empty = "string", 0, None, True
             parent_table.add_column(
                 "value",
-                "string",
+                data_type,
                 [0, 1],
-                0,
-                None,
+                min_length,
+                max_length,
                 False,
                 True,
-                True,
+                allow_empty,
                 None,
             )

{xml2db-0.9.1 → xml2db-0.9.4}/src/xml2db/table/column.py RENAMED Viewed

@@ -10,6 +10,7 @@ from sqlalchemy import (
     Column,
     DateTime,
     String,
+    LargeBinary,
 )
 from sqlalchemy.dialects import mssql
@@ -51,6 +52,8 @@ def types_mapping_default(temp: bool, col: "DataModelColumn") -> Any:
         if min_length >= col.max_length - 1 and not col.allow_empty:
             return String(col.max_length)
         return String(col.max_length)
+    if col.data_type == "binary":
+        return LargeBinary(col.max_length)
     else:
         logger.warning(
             f"unknown type '{col.data_type}' for column '{col.name}', defaulting to VARCHAR(1000) "
@@ -93,6 +96,10 @@ def types_mapping_mssql(temp: bool, col: "DataModelColumn") -> Any:
         if min_length >= col.max_length - 1 and not col.allow_empty:
             return mssql.CHAR(col.max_length)
         return mssql.VARCHAR(col.max_length)
+    if col.data_type == "binary":
+        if col.max_length == col.min_length:
+            return mssql.BINARY(col.max_length)
+        return mssql.VARBINARY(col.max_length)
     else:
         logger.warning(
             f"unknown type '{col.data_type}' for column '{col.name}', defaulting to VARCHAR(1000) "
@@ -109,6 +116,8 @@ class DataModelColumn:
     :param occurs: min and max occurrences of the field
     :param min_length: min length
     :param max_length: max length
+    :param is_attr: does the column value come from an xml attribute?
+    :param is_content: is the column used to store the content value of a mixed complex type?
     :param allow_empty: is nullable ?
     :param ngroup: a key used to handle nested sequences
     :param model_config: data model config, may contain column type information
@@ -126,7 +135,7 @@ class DataModelColumn:
         data_type: str,
         occurs: List[int],
         min_length: int,
-        max_length: int,
+        max_length: Union[int, None],
         is_attr: bool,
         is_content: bool,
         allow_empty: bool,

{xml2db-0.9.1 → xml2db-0.9.4}/src/xml2db/table/reused_table.py RENAMED Viewed

@@ -13,6 +13,7 @@ from sqlalchemy import (
 )
 from .transformed_table import DataModelTableTransformed
+from .column import DataModelColumn
 class DataModelTableReused(DataModelTableTransformed):
@@ -49,13 +50,39 @@ class DataModelTableReused(DataModelTableTransformed):
             # Root table is given additional integration metadata columns
             if self.is_root_table:
                 yield Column("xml2db_input_file_path", String(256), nullable=False)
-                yield Column(
-                    "xml2db_processed_at", DateTime(timezone=True), nullable=False
+                # Use DataModelColumn to create record hash column in order to get the right data type
+                processed_at_col = DataModelColumn(
+                    "xml2db_processed_at",
+                    [],
+                    "dateTime",
+                    [1, 1],
+                    0,
+                    None,
+                    False,
+                    False,
+                    False,
+                    None,
+                    self.config,
+                    self.data_model,
                 )
-            yield Column("xml2db_record_hash", LargeBinary(20), nullable=False)
+                yield from processed_at_col.get_sqlalchemy_column(temp)
+            hash_col = DataModelColumn(
+                "record_hash",
+                [],
+                "binary",
+                [1, 1],
+                20,
+                20,
+                False,
+                False,
+                False,
+                None,
+                self.config,
+                self.data_model,
+            )
+            yield from hash_col.get_sqlalchemy_column(temp)
             yield UniqueConstraint(
-                "xml2db_record_hash",
+                "record_hash",
                 name=f"{prefix if temp else ''}{self.name}_xml2db_record_hash",
             )
@@ -115,10 +142,8 @@ class DataModelTableReused(DataModelTableTransformed):
         # find matching records hash in target table
         yield self.temp_table.update().values(temp_exists=True).where(
-            getattr(  # noqa: Linter puzzled by ==
-                self.temp_table.c, "xml2db_record_hash"
-            )
-            == getattr(self.table.c, "xml2db_record_hash")
+            getattr(self.temp_table.c, "record_hash")  # noqa: Linter puzzled by ==
+            == getattr(self.table.c, "record_hash")
         )
         # update foreign keys for n-1 relations tables
@@ -141,10 +166,8 @@ class DataModelTableReused(DataModelTableTransformed):
         yield self.temp_table.update().values(
             **{f"pk_{self.name}": getattr(self.table.c, f"pk_{self.name}")}
         ).where(
-            getattr(  # noqa: Linter puzzled by ==
-                self.temp_table.c, "xml2db_record_hash"
-            )
-            == getattr(self.table.c, "xml2db_record_hash")
+            getattr(self.temp_table.c, "record_hash")  # noqa: Linter puzzled by ==
+            == getattr(self.table.c, "record_hash")
         )
         # update primary keys for n-n relations tables

{xml2db-0.9.1 → xml2db-0.9.4}/src/xml2db/xml_converter.py RENAMED Viewed

@@ -158,6 +158,7 @@ class XMLConverter:
     def _make_xml_node(self, node_data, node_name, nsmap: dict = None):
         def check_transformed_node(node_type, element):
+            """Convert "choice" transformed nodes (type/value) to `<type>value</type>` XML nodes"""
             if (
                 node_type in self.model.types_transforms
                 and self.model.types_transforms[node_type] == "choice"
@@ -176,18 +177,27 @@ class XMLConverter:
             return element
         tb = self.model.tables[node_data["type"]]
+        # due to "elevated" nodes (i.e. flattened), we need to build a stack of nested nodes to reconstruct the
+        # original XML. It is a list of tuples of (node type, node Element).
         nodes_stack = [(node_data["type"], etree.Element(node_name, nsmap=nsmap))]
         prev_chain = []
         prev_ngroup = None
         ngroup_stack = []
         for field_type, rel_name, rel in tb.fields:
+            # This part manages the nodes stack, based on `name_chain` attribute which represents a "path".
+            # We compare the current path with the previous field path, and manage the nodes stack accordingly
+            # (i.e. create new nested nodes, pop the last node when moving to another path, etc.)
             name_chain = rel.name_chain[:-1]
             i = len(prev_chain)
             while i > 0 and (
                 i > len(name_chain) or name_chain[i - 1][0] != prev_chain[i - 1][0]
             ):
                 completed_node = check_transformed_node(*nodes_stack.pop())
-                if completed_node is not None and len(completed_node) > 0:
+                if completed_node is not None and (
+                    len(completed_node) > 0
+                    or completed_node.text
+                    or len(completed_node.attrib) > 0
+                ):
                     nodes_stack[-1][1].append(completed_node)
                 i -= 1
             while i < len(name_chain):

{xml2db-0.9.1 → xml2db-0.9.4/src/xml2db.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: xml2db
-Version: 0.9.1
+Version: 0.9.4
 Summary: Import complex XML files to a relational database
 Author-email: Commission de régulation de l'énergie <opensource@cre.fr>
 Project-URL: Documentation, https://cre-dev.github.io/xml2db
@@ -13,18 +13,18 @@ Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: sqlalchemy>1.4
-Requires-Dist: xmlschema==2.5.0
-Requires-Dist: lxml==4.9.3
+Requires-Dist: xmlschema==3.1.0
+Requires-Dist: lxml==5.1.0
 Provides-Extra: docs
-Requires-Dist: mkdocs-material==9.5.0; extra == "docs"
-Requires-Dist: mkdocstrings[python]==0.24.0; extra == "docs"
+Requires-Dist: mkdocs-material==9.5.14; extra == "docs"
+Requires-Dist: mkdocstrings[python]==0.24.1; extra == "docs"
 Provides-Extra: tests
 Requires-Dist: pytest>=7.0; extra == "tests"
 # Xml2db
 `xml2db` is a Python package which allows loading XML data into a relational database. It is designed to handle complex
-schemas which cannot be denormalized to a flat table, without any custom code.
+schemas which cannot be easily denormalized to a flat table, without any custom code.
 It builds a data model (i.e. a set of database tables linked with foreign keys relationships) based on a XSD schema and
 allows parsing and loading XML files into the database, and get them back to XML, if needed.
@@ -37,7 +37,7 @@ from xml2db import DataModel
 # Create a data model of tables with relations based on the XSD file
 data_model = DataModel(
     xsd_file="path/to/file.xsd",
-    connection_string="mssql+pyodbc://server/database?driver=ODBC+Driver+17+for+SQL+Server&trusted_connection=yes",
+    connection_string="postgresql+psycopg2://testuser:testuser@localhost:5432/testdb",
 )
 # Parse an XML file based on this XSD
 document = data_model.parse_xml(
@@ -51,11 +51,10 @@ The data model will adhere closely to the XSD schema, but `xml2db` will perform
 complexity of the resulting data model and the storage footprint.
 The raw data loaded into the database can then be processed using [DBT](https://www.getdbt.com/), SQL views or
-stored procedures aimed at extracting, correcting and formatting the data into more user-friendly tables.
+other tools aimed at extracting, correcting and formatting the data into more user-friendly tables.
 `xml2db` is developed and used at the [French energy regulation authority (CRE)](https://www.cre.fr/) to process XML
-data, notably [REMIT data](https://www.acer.europa.eu/remit/data-collection). There, it handles batches of ~500 MB XML
-files translating into a 20+ tables data model in the database.
+data.
 This package uses `sqlalchemy` to interact with the database, so it should work with different database backends. It has
 been tested against PostgreSQL and MS SQL Server. It currently does not work with SQLite. You may have to install
@@ -86,8 +85,8 @@ Run all tests with the following command:
 python -m pytest
 ```
-Integration tests require write access to a MS SQL server database; the connection string is provided as an environment
-variable `DB_STRING`. If you want to run only conversion tests that do not require a database you can run:
+Integration tests require write access to a PostgreSQL or MS SQL Server database; the connection string is provided as an
+environment variable `DB_STRING`. If you want to run only conversion tests that do not require a database you can run:
 ```bash
 pytest -m "not dbtest"