PyPI - pyspark-connectby - Versions diffs - 1.3.1__tar.gz - Mend

pyspark-connectby 1.3.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

pyspark_connectby-1.3.1/PKG-INFO +131 -0
pyspark_connectby-1.3.1/README.md +116 -0
pyspark_connectby-1.3.1/pyproject.toml +21 -0
pyspark_connectby-1.3.1/pyspark_connectby/__init__.py +4 -0
pyspark_connectby-1.3.1/pyspark_connectby/connectby_query.py +183 -0
pyspark_connectby-1.3.1/pyspark_connectby/dataframe_connectby.py +56 -0

pyspark_connectby-1.3.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,131 @@
+Metadata-Version: 2.4
+Name: pyspark-connectby
+Version: 1.3.1
+Summary: connectby hierarchy query in spark
+Author: Chen, Yu
+Requires-Python: >=3.9,<3.14
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Description-Content-Type: text/markdown
+# pyspark-connectby
+Spark currently does not support hierarchy query `connectBy` as of version 3.5.0. And there is a [PR](https://github.com/apache/spark/pull/40744) opened to support recursive CTE query. But that is still not available yet.
+This is an attempt to add `connectBy` method to [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html)
+# Concept
+Hierarchy query is one of the important feature that many relational databases, such as [Oracle](https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Hierarchical-Queries.html#GUID-0118DF1D-B9A9-41EB-8556-C6E7D6A5A84E), DB2, My SQL,
+Snowflake, [Redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_CONNECT_BY_clause.html), etc.,
+would support directly or alternatively by using recursive CTE.
+Example in Redshift:
+```sql
+select id, name, manager_id, level
+from employee
+start with emp_id = 1
+connect by prior emp_id = manager_id;
+```
+With this library, you can use `connectBy()` on `Dateframe`:
+```python
+from pyspark_connectby import connectBy
+from pyspark.sql import SparkSession
+schema = 'emp_id string, manager_id string, name string'
+data = [[1, None, 'Carlos'],
+        [11, 1, 'John'],
+        [111, 11, 'Jorge'],
+        [112, 11, 'Kwaku'],
+        [113, 11, 'Liu'],
+        [2, None, 'Mat']
+        ]
+spark = SparkSession.builder.getOrCreate()
+df = spark.createDataFrame(data, schema)
+df2 = df.connectBy(prior='emp_id', to='manager_id', start_with='1')
+df2.show()
+```
+With result:
+```
++------+----------+-----+-----------------+----------+------+
+|emp_id|START_WITH|LEVEL|CONNECT_BY_ISLEAF|manager_id|  name|
++------+----------+-----+-----------------+----------+------+
+|     1|         1|    1|            false|      null|Carlos|
+|    11|         1|    2|            false|         1|  John|
+|   111|         1|    3|             true|        11| Jorge|
+|   112|         1|    3|             true|        11| Kwaku|
+|   113|         1|    3|             true|        11|   Liu|
++------+----------+-----+-----------------+----------+------+
+```
+Note the pseudo columns in the query result:
+- START_WITH
+- LEVEL
+- CONNECT_BY_ISLEAF
+# Installation
+## Python
+Version >= 3.9, <3.14
+```
+$ pip install --upgrade pyspark-connectby
+```
+# Usage
+```python
+from pyspark_connectby import connectBy
+df = ...
+df.connectBy(prior='emp_id', to='manager_id', start_with='1')  # start_with `emp_id` as 1
+df.transform(connectBy, prior='emp_id', to='manager_id', start_with='1')  # or by using df.transform() method
+df.connectBy(prior='emp_id', to='manager_id')  # without start_with, it will go through each node
+df.connectBy(prior='emp_id', to='manager_id', start_with=['1', '2'])  # start_with a list of top nodes ids.
+```
+# Developer
+## Setup
+### java
+java 17 or later
+```commandline
+brew install openjdk@17
+sudo ln -sfn /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-17.jdk
+export JAVA_HOME=$(/usr/libexec/java_home -v 17)   # e.g in ~/.zshrc
+```
+### poetry
+```commandline
+pipx install poetry
+poetry env list
+poetry env use 3.13  #  e.g to create env for python 3.13
+```
+### tox
+```commandline
+pipx install tox
+pipx install uv
+uv python install 3.9 3.10 3.11 3.12 3.13   # install multiple versions for python
+```
+## Test
+```commandline
+pytest
+poetry run pytest
+tox
+```
+## Publish
+```commandline
+poetry version patch
+poetry version minor
+poetry publish --build
+tox -e release
+```

pyspark_connectby-1.3.1/README.md ADDED Viewed

@@ -0,0 +1,116 @@
+# pyspark-connectby
+Spark currently does not support hierarchy query `connectBy` as of version 3.5.0. And there is a [PR](https://github.com/apache/spark/pull/40744) opened to support recursive CTE query. But that is still not available yet.
+This is an attempt to add `connectBy` method to [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html)
+# Concept
+Hierarchy query is one of the important feature that many relational databases, such as [Oracle](https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Hierarchical-Queries.html#GUID-0118DF1D-B9A9-41EB-8556-C6E7D6A5A84E), DB2, My SQL,
+Snowflake, [Redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_CONNECT_BY_clause.html), etc.,
+would support directly or alternatively by using recursive CTE.
+Example in Redshift:
+```sql
+select id, name, manager_id, level
+from employee
+start with emp_id = 1
+connect by prior emp_id = manager_id;
+```
+With this library, you can use `connectBy()` on `Dateframe`:
+```python
+from pyspark_connectby import connectBy
+from pyspark.sql import SparkSession
+schema = 'emp_id string, manager_id string, name string'
+data = [[1, None, 'Carlos'],
+        [11, 1, 'John'],
+        [111, 11, 'Jorge'],
+        [112, 11, 'Kwaku'],
+        [113, 11, 'Liu'],
+        [2, None, 'Mat']
+        ]
+spark = SparkSession.builder.getOrCreate()
+df = spark.createDataFrame(data, schema)
+df2 = df.connectBy(prior='emp_id', to='manager_id', start_with='1')
+df2.show()
+```
+With result:
+```
++------+----------+-----+-----------------+----------+------+
+|emp_id|START_WITH|LEVEL|CONNECT_BY_ISLEAF|manager_id|  name|
++------+----------+-----+-----------------+----------+------+
+|     1|         1|    1|            false|      null|Carlos|
+|    11|         1|    2|            false|         1|  John|
+|   111|         1|    3|             true|        11| Jorge|
+|   112|         1|    3|             true|        11| Kwaku|
+|   113|         1|    3|             true|        11|   Liu|
++------+----------+-----+-----------------+----------+------+
+```
+Note the pseudo columns in the query result:
+- START_WITH
+- LEVEL
+- CONNECT_BY_ISLEAF
+# Installation
+## Python
+Version >= 3.9, <3.14
+```
+$ pip install --upgrade pyspark-connectby
+```
+# Usage
+```python
+from pyspark_connectby import connectBy
+df = ...
+df.connectBy(prior='emp_id', to='manager_id', start_with='1')  # start_with `emp_id` as 1
+df.transform(connectBy, prior='emp_id', to='manager_id', start_with='1')  # or by using df.transform() method
+df.connectBy(prior='emp_id', to='manager_id')  # without start_with, it will go through each node
+df.connectBy(prior='emp_id', to='manager_id', start_with=['1', '2'])  # start_with a list of top nodes ids.
+```
+# Developer
+## Setup
+### java
+java 17 or later
+```commandline
+brew install openjdk@17
+sudo ln -sfn /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-17.jdk
+export JAVA_HOME=$(/usr/libexec/java_home -v 17)   # e.g in ~/.zshrc
+```
+### poetry
+```commandline
+pipx install poetry
+poetry env list
+poetry env use 3.13  #  e.g to create env for python 3.13
+```
+### tox
+```commandline
+pipx install tox
+pipx install uv
+uv python install 3.9 3.10 3.11 3.12 3.13   # install multiple versions for python
+```
+## Test
+```commandline
+pytest
+poetry run pytest
+tox
+```
+## Publish
+```commandline
+poetry version patch
+poetry version minor
+poetry publish --build
+tox -e release
+```

pyspark_connectby-1.3.1/pyproject.toml ADDED Viewed

@@ -0,0 +1,21 @@
+[tool.poetry]
+name = "pyspark-connectby"
+version = "1.3.1"
+description = "connectby hierarchy query in spark"
+authors = ["Chen, Yu"]
+readme = "README.md"
+packages = [{include = "pyspark_connectby"}]
+[tool.poetry.dependencies]
+python = ">=3.9,<3.14"
+[tool.poetry.group.dev.dependencies]
+pyspark = ">=3.5.3"
+[tool.poetry.group.test.dependencies]
+pytest = "*"
+pyspark = ">=3.5.3"
+[build-system]
+requires = ["poetry-core"]
+build-backend = "poetry.core.masonry.api"

pyspark_connectby-1.3.1/pyspark_connectby/__init__.py ADDED Viewed

@@ -0,0 +1,4 @@
+from pyspark_connectby.dataframe_connectby import connectBy
+from pyspark_connectby.connectby_query import CONNECT_BY_PATH_SEPARATOR
+__all__ = ['connectBy', 'CONNECT_BY_PATH_SEPARATOR']

pyspark_connectby-1.3.1/pyspark_connectby/connectby_query.py ADDED Viewed

@@ -0,0 +1,183 @@
+from dataclasses import dataclass
+from typing import Union, List
+from pyspark.sql import DataFrame
+from pyspark.sql import functions as F
+COLUMN_START_WITH = 'START_WITH'
+COLUMN_LEVEL = 'LEVEL'
+COLUMN_CONNECT_BY_ISLEAF = 'CONNECT_BY_ISLEAF'
+COLUMN_CONNECT_BY_PATH = 'CONNECT_BY_PATH'
+CONNECT_BY_PATH_SEPARATOR = '/'
+@dataclass
+class Path:
+    steps: [str]
+    is_leaf: bool = False
+    @classmethod
+    def path_start_with(cls, start_id: str) -> 'Path':
+        return cls(steps=[start_id])
+    @property
+    def start_id(self) -> str:
+        return self.steps[0]
+    @property
+    def end_id(self) -> str:
+        return self.steps[-1]
+    @property
+    def level(self) -> int:
+        return len(self.steps)
+@dataclass
+class Node:
+    node_id: str
+    parent_id: str
+class ConnectByQuery:
+    def __init__(self, df: DataFrame, child_col: str, parent_col: str, start_with: Union[List[str], str] = None,
+                 connect_by_path_cols: [str] = None, connect_by_path_separator: str = CONNECT_BY_PATH_SEPARATOR):
+        self.df: DataFrame = df
+        self.child_col = child_col
+        self.parent_col = parent_col
+        self.start_with = start_with
+        self.connect_by_path_cols = self.__connect_by_path_cols(connect_by_path_cols)
+        self.connect_by_path_separator = connect_by_path_separator
+        self._start_paths: [Path] = None
+        self._all_nodes: [Node] = None
+    @property
+    def start_paths(self) -> [Path]:
+        if self._start_paths is None:
+            if self.start_with is None:
+                paths = []
+            elif isinstance(self.start_with, list):
+                paths = [Path.path_start_with(i) for i in self.start_with]
+            else:
+                assert isinstance(self.start_with, str)
+                paths = [Path.path_start_with(self.start_with)]
+            self._start_paths = paths or self.__default_start_paths()
+        return self._start_paths
+    @property
+    def all_nodes(self) -> [Node]:
+        if self._all_nodes is None:
+            rows = self.df.select(self.child_col, self.parent_col).collect()
+            self._all_nodes = [Node(node_id=r[self.child_col], parent_id=r[self.parent_col]) for r in rows]
+        return self._all_nodes
+    def __connect_by_path_cols(self, cols: [str]) -> [str]:
+        cols_list = cols or []
+        if len(cols_list) == 0:
+            return cols_list
+        cols_upper = [c.upper() for c in cols_list]
+        df_cols = [c.upper() for c in self.df.columns]
+        assert set(cols_upper).issubset(set(df_cols))
+        assert self.child_col not in cols_upper, \
+            f'`connect_by_path` pseudo column for {self.child_col} is provided by default'
+        return cols_upper
+    def __children_with_parent(self, parent_id: str) -> [Node]:
+        children = list(filter(lambda n: n.parent_id == parent_id, self.all_nodes))
+        return children
+    def __default_start_paths(self) -> [Path]:
+        rows = self.df.collect()
+        return [Path.path_start_with(r[self.child_col]) for r in rows]
+    def __fetch_descendants(self, path: Path) -> []:
+        children_nodes: [Node] = self.__children_with_parent(path.end_id)
+        is_leaf = len(children_nodes) == 0
+        if is_leaf:
+            path.is_leaf = True
+            return []
+        children = [Path(steps=path.steps + [c.node_id]) for c in children_nodes]
+        grandchildren = list(map(lambda c: self.__fetch_descendants(c), children))
+        descendants = children + grandchildren
+        return descendants
+    @staticmethod
+    def __flatten_list(nested_list: []) -> []:
+        flat_list = []
+        for item in nested_list:
+            if isinstance(item, list):
+                flat_list += ConnectByQuery.__flatten_list(item)
+            else:
+                flat_list.append(item)
+        return flat_list
+    def __run(self) -> [Path]:
+        descendants = list(map(lambda e: self.__fetch_descendants(e), self.start_paths))
+        descendants_paths: [Path] = self.__flatten_list(descendants)
+        return self.start_paths + descendants_paths
+    def get_result_df(self) -> DataFrame:
+        spark = self.df._session
+        result_paths: [Path] = self.__run()
+        schema = f'''
+            {COLUMN_START_WITH} string,
+            {self.child_col} string,
+            {COLUMN_LEVEL} int,
+            {COLUMN_CONNECT_BY_ISLEAF} boolean,
+            {COLUMN_CONNECT_BY_PATH} array<string>
+        '''
+        data = [(p.start_id, p.end_id, p.level, p.is_leaf, p.steps) for p in result_paths]
+        df_result = (spark.createDataFrame(data, schema=schema))
+        df_result = (
+            self.__augment_connect_by_path_if_needed(df_result)
+            .withColumn(COLUMN_CONNECT_BY_PATH, F.concat_ws(self.connect_by_path_separator, COLUMN_CONNECT_BY_PATH))
+            .join(self.df, on=self.child_col)
+        )
+        return df_result
+    def __augment_connect_by_path_if_needed(self, df: DataFrame) -> DataFrame:
+        if len(self.connect_by_path_cols) == 0:
+            return df
+        result_cols = df.columns
+        df_right = (
+            self.df.select(self.child_col, *self.connect_by_path_cols)
+            .withColumnRenamed(self.child_col, '_exploded_id')
+        )
+        df_exploded = df.select(*result_cols, F.posexplode(COLUMN_CONNECT_BY_PATH))
+        df_joined = (
+            df_exploded.withColumnRenamed('col', '_exploded_id')
+            .join(df_right, on='_exploded_id', how='left')
+        )
+        agg_cols = [F.sort_array(F.collect_list(F.struct('pos', col))).alias(f'_sorted_struct_{col}')
+                    for col in self.connect_by_path_cols]
+        df_result = (
+            df_joined.groupby(*result_cols)
+            .agg(*agg_cols)
+        )
+        for col in self.connect_by_path_cols:
+            struct_col = f'_sorted_struct_{col}'
+            array_col = f'_array_{col}'
+            df_result = (
+                df_result.withColumn(f'_array_{col}', F.col(f'{struct_col}.{col}'))
+                .withColumn(f'{COLUMN_CONNECT_BY_PATH}_{col}',
+                            F.concat_ws(self.connect_by_path_separator, array_col))
+                .drop(struct_col, array_col)
+            )
+        return df_result

pyspark_connectby-1.3.1/pyspark_connectby/dataframe_connectby.py ADDED Viewed

@@ -0,0 +1,56 @@
+from typing import Union, List
+from pyspark.sql import DataFrame
+from pyspark_connectby.connectby_query import ConnectByQuery, CONNECT_BY_PATH_SEPARATOR
+def connectBy(df: DataFrame, prior: str, to: str, start_with: Union[List[str], str] = None,
+              connect_by_path_cols: [str] = None, connect_by_path_separator: str = CONNECT_BY_PATH_SEPARATOR) -> DataFrame:
+    """ Returns a new :class:`DataFrame` with result for Connect By(hiearchical) query.
+    It is very similar to connect_by clause in Oracle or Redshift.
+    Parameters
+    ----------
+    df: :class:`DataFrame`
+        contains hierarchical data. e.g. has child_id and parent_id columns: `emp_id` and `manager_id`
+    prior:
+        prior child_id column. e.g. `emp_id`
+    to:
+        connect to parent_id column. e.g `manager_id`
+    start_with: str, or list of str, optional.
+        specifies the root id(s) of the hierarchy. e.g `1` or `['1', '2']`.
+        If you omit this parameter, all child_id in the df will be used as root ids.
+    Examples
+    --------
+    The following performs a connect_by query on ``df``. `emp_id` '1' is used as root id.
+    >>> from pyspark.sql import SparkSession
+    >>> schema = 'emp_id string, manager_id string, name string'
+    >>> data = [[1, None, 'Carlos'], \
+                [11, 1, 'John'], \
+                [111, 11, 'Jorge'], \
+                [112, 11, 'Kwaku'], \
+                [113, 11, 'Liu'], \
+                [2, None, 'Mat']]
+    >>> spark = SparkSession.builder.getOrCreate()
+    >>> df = spark.createDataFrame(data, schema)
+    >>> df2 = df.connectBy(prior='emp_id', to='manager_id', start_with='1')
+    >>> df2.show()
+    +------+----------+-----+-----------------+----------+------+
+    |emp_id|START_WITH|LEVEL|CONNECT_BY_ISLEAF|manager_id|  name|
+    +------+----------+-----+-----------------+----------+------+
+    |     1|         1|    1|            false|      null|Carlos|
+    |    11|         1|    2|            false|         1|  John|
+    |   111|         1|    3|             true|        11| Jorge|
+    |   112|         1|    3|             true|        11| Kwaku|
+    |   113|         1|    3|             true|        11|   Liu|
+    +------+----------+-----+-----------------+----------+------+
+    """
+    query = ConnectByQuery(df, child_col=prior, parent_col=to, start_with=start_with,
+                           connect_by_path_cols=connect_by_path_cols, connect_by_path_separator=connect_by_path_separator)
+    return query.get_result_df()
+DataFrame.connectBy = connectBy