pyspark-connectby 1.3.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,131 @@
1
+ Metadata-Version: 2.4
2
+ Name: pyspark-connectby
3
+ Version: 1.3.1
4
+ Summary: connectby hierarchy query in spark
5
+ Author: Chen, Yu
6
+ Requires-Python: >=3.9,<3.14
7
+ Classifier: Programming Language :: Python :: 3
8
+ Classifier: Programming Language :: Python :: 3.9
9
+ Classifier: Programming Language :: Python :: 3.10
10
+ Classifier: Programming Language :: Python :: 3.11
11
+ Classifier: Programming Language :: Python :: 3.12
12
+ Classifier: Programming Language :: Python :: 3.13
13
+ Description-Content-Type: text/markdown
14
+
15
+ # pyspark-connectby
16
+ Spark currently does not support hierarchy query `connectBy` as of version 3.5.0. And there is a [PR](https://github.com/apache/spark/pull/40744) opened to support recursive CTE query. But that is still not available yet.
17
+
18
+ This is an attempt to add `connectBy` method to [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html)
19
+
20
+ # Concept
21
+ Hierarchy query is one of the important feature that many relational databases, such as [Oracle](https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Hierarchical-Queries.html#GUID-0118DF1D-B9A9-41EB-8556-C6E7D6A5A84E), DB2, My SQL,
22
+ Snowflake, [Redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_CONNECT_BY_clause.html), etc.,
23
+ would support directly or alternatively by using recursive CTE.
24
+
25
+ Example in Redshift:
26
+ ```sql
27
+ select id, name, manager_id, level
28
+ from employee
29
+ start with emp_id = 1
30
+ connect by prior emp_id = manager_id;
31
+ ```
32
+
33
+ With this library, you can use `connectBy()` on `Dateframe`:
34
+
35
+ ```python
36
+ from pyspark_connectby import connectBy
37
+ from pyspark.sql import SparkSession
38
+
39
+ schema = 'emp_id string, manager_id string, name string'
40
+ data = [[1, None, 'Carlos'],
41
+ [11, 1, 'John'],
42
+ [111, 11, 'Jorge'],
43
+ [112, 11, 'Kwaku'],
44
+ [113, 11, 'Liu'],
45
+ [2, None, 'Mat']
46
+ ]
47
+ spark = SparkSession.builder.getOrCreate()
48
+ df = spark.createDataFrame(data, schema)
49
+ df2 = df.connectBy(prior='emp_id', to='manager_id', start_with='1')
50
+ df2.show()
51
+ ```
52
+ With result:
53
+ ```
54
+ +------+----------+-----+-----------------+----------+------+
55
+ |emp_id|START_WITH|LEVEL|CONNECT_BY_ISLEAF|manager_id| name|
56
+ +------+----------+-----+-----------------+----------+------+
57
+ | 1| 1| 1| false| null|Carlos|
58
+ | 11| 1| 2| false| 1| John|
59
+ | 111| 1| 3| true| 11| Jorge|
60
+ | 112| 1| 3| true| 11| Kwaku|
61
+ | 113| 1| 3| true| 11| Liu|
62
+ +------+----------+-----+-----------------+----------+------+
63
+ ```
64
+ Note the pseudo columns in the query result:
65
+ - START_WITH
66
+ - LEVEL
67
+ - CONNECT_BY_ISLEAF
68
+
69
+ # Installation
70
+ ## Python
71
+ Version >= 3.9, <3.14
72
+ ```
73
+ $ pip install --upgrade pyspark-connectby
74
+ ```
75
+
76
+ # Usage
77
+
78
+ ```python
79
+ from pyspark_connectby import connectBy
80
+
81
+ df = ...
82
+
83
+ df.connectBy(prior='emp_id', to='manager_id', start_with='1') # start_with `emp_id` as 1
84
+
85
+ df.transform(connectBy, prior='emp_id', to='manager_id', start_with='1') # or by using df.transform() method
86
+
87
+ df.connectBy(prior='emp_id', to='manager_id') # without start_with, it will go through each node
88
+
89
+ df.connectBy(prior='emp_id', to='manager_id', start_with=['1', '2']) # start_with a list of top nodes ids.
90
+
91
+ ```
92
+
93
+ # Developer
94
+ ## Setup
95
+ ### java
96
+ java 17 or later
97
+ ```commandline
98
+ brew install openjdk@17
99
+ sudo ln -sfn /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-17.jdk
100
+ export JAVA_HOME=$(/usr/libexec/java_home -v 17) # e.g in ~/.zshrc
101
+ ```
102
+
103
+ ### poetry
104
+ ```commandline
105
+ pipx install poetry
106
+ poetry env list
107
+ poetry env use 3.13 # e.g to create env for python 3.13
108
+ ```
109
+
110
+ ### tox
111
+ ```commandline
112
+ pipx install tox
113
+ pipx install uv
114
+ uv python install 3.9 3.10 3.11 3.12 3.13 # install multiple versions for python
115
+ ```
116
+
117
+ ## Test
118
+ ```commandline
119
+ pytest
120
+ poetry run pytest
121
+ tox
122
+ ```
123
+
124
+ ## Publish
125
+ ```commandline
126
+ poetry version patch
127
+ poetry version minor
128
+ poetry publish --build
129
+ tox -e release
130
+ ```
131
+
@@ -0,0 +1,116 @@
1
+ # pyspark-connectby
2
+ Spark currently does not support hierarchy query `connectBy` as of version 3.5.0. And there is a [PR](https://github.com/apache/spark/pull/40744) opened to support recursive CTE query. But that is still not available yet.
3
+
4
+ This is an attempt to add `connectBy` method to [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html)
5
+
6
+ # Concept
7
+ Hierarchy query is one of the important feature that many relational databases, such as [Oracle](https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Hierarchical-Queries.html#GUID-0118DF1D-B9A9-41EB-8556-C6E7D6A5A84E), DB2, My SQL,
8
+ Snowflake, [Redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_CONNECT_BY_clause.html), etc.,
9
+ would support directly or alternatively by using recursive CTE.
10
+
11
+ Example in Redshift:
12
+ ```sql
13
+ select id, name, manager_id, level
14
+ from employee
15
+ start with emp_id = 1
16
+ connect by prior emp_id = manager_id;
17
+ ```
18
+
19
+ With this library, you can use `connectBy()` on `Dateframe`:
20
+
21
+ ```python
22
+ from pyspark_connectby import connectBy
23
+ from pyspark.sql import SparkSession
24
+
25
+ schema = 'emp_id string, manager_id string, name string'
26
+ data = [[1, None, 'Carlos'],
27
+ [11, 1, 'John'],
28
+ [111, 11, 'Jorge'],
29
+ [112, 11, 'Kwaku'],
30
+ [113, 11, 'Liu'],
31
+ [2, None, 'Mat']
32
+ ]
33
+ spark = SparkSession.builder.getOrCreate()
34
+ df = spark.createDataFrame(data, schema)
35
+ df2 = df.connectBy(prior='emp_id', to='manager_id', start_with='1')
36
+ df2.show()
37
+ ```
38
+ With result:
39
+ ```
40
+ +------+----------+-----+-----------------+----------+------+
41
+ |emp_id|START_WITH|LEVEL|CONNECT_BY_ISLEAF|manager_id| name|
42
+ +------+----------+-----+-----------------+----------+------+
43
+ | 1| 1| 1| false| null|Carlos|
44
+ | 11| 1| 2| false| 1| John|
45
+ | 111| 1| 3| true| 11| Jorge|
46
+ | 112| 1| 3| true| 11| Kwaku|
47
+ | 113| 1| 3| true| 11| Liu|
48
+ +------+----------+-----+-----------------+----------+------+
49
+ ```
50
+ Note the pseudo columns in the query result:
51
+ - START_WITH
52
+ - LEVEL
53
+ - CONNECT_BY_ISLEAF
54
+
55
+ # Installation
56
+ ## Python
57
+ Version >= 3.9, <3.14
58
+ ```
59
+ $ pip install --upgrade pyspark-connectby
60
+ ```
61
+
62
+ # Usage
63
+
64
+ ```python
65
+ from pyspark_connectby import connectBy
66
+
67
+ df = ...
68
+
69
+ df.connectBy(prior='emp_id', to='manager_id', start_with='1') # start_with `emp_id` as 1
70
+
71
+ df.transform(connectBy, prior='emp_id', to='manager_id', start_with='1') # or by using df.transform() method
72
+
73
+ df.connectBy(prior='emp_id', to='manager_id') # without start_with, it will go through each node
74
+
75
+ df.connectBy(prior='emp_id', to='manager_id', start_with=['1', '2']) # start_with a list of top nodes ids.
76
+
77
+ ```
78
+
79
+ # Developer
80
+ ## Setup
81
+ ### java
82
+ java 17 or later
83
+ ```commandline
84
+ brew install openjdk@17
85
+ sudo ln -sfn /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-17.jdk
86
+ export JAVA_HOME=$(/usr/libexec/java_home -v 17) # e.g in ~/.zshrc
87
+ ```
88
+
89
+ ### poetry
90
+ ```commandline
91
+ pipx install poetry
92
+ poetry env list
93
+ poetry env use 3.13 # e.g to create env for python 3.13
94
+ ```
95
+
96
+ ### tox
97
+ ```commandline
98
+ pipx install tox
99
+ pipx install uv
100
+ uv python install 3.9 3.10 3.11 3.12 3.13 # install multiple versions for python
101
+ ```
102
+
103
+ ## Test
104
+ ```commandline
105
+ pytest
106
+ poetry run pytest
107
+ tox
108
+ ```
109
+
110
+ ## Publish
111
+ ```commandline
112
+ poetry version patch
113
+ poetry version minor
114
+ poetry publish --build
115
+ tox -e release
116
+ ```
@@ -0,0 +1,21 @@
1
+ [tool.poetry]
2
+ name = "pyspark-connectby"
3
+ version = "1.3.1"
4
+ description = "connectby hierarchy query in spark"
5
+ authors = ["Chen, Yu"]
6
+ readme = "README.md"
7
+ packages = [{include = "pyspark_connectby"}]
8
+
9
+ [tool.poetry.dependencies]
10
+ python = ">=3.9,<3.14"
11
+
12
+ [tool.poetry.group.dev.dependencies]
13
+ pyspark = ">=3.5.3"
14
+
15
+ [tool.poetry.group.test.dependencies]
16
+ pytest = "*"
17
+ pyspark = ">=3.5.3"
18
+
19
+ [build-system]
20
+ requires = ["poetry-core"]
21
+ build-backend = "poetry.core.masonry.api"
@@ -0,0 +1,4 @@
1
+ from pyspark_connectby.dataframe_connectby import connectBy
2
+ from pyspark_connectby.connectby_query import CONNECT_BY_PATH_SEPARATOR
3
+
4
+ __all__ = ['connectBy', 'CONNECT_BY_PATH_SEPARATOR']
@@ -0,0 +1,183 @@
1
+ from dataclasses import dataclass
2
+ from typing import Union, List
3
+
4
+ from pyspark.sql import DataFrame
5
+ from pyspark.sql import functions as F
6
+
7
+ COLUMN_START_WITH = 'START_WITH'
8
+ COLUMN_LEVEL = 'LEVEL'
9
+ COLUMN_CONNECT_BY_ISLEAF = 'CONNECT_BY_ISLEAF'
10
+ COLUMN_CONNECT_BY_PATH = 'CONNECT_BY_PATH'
11
+ CONNECT_BY_PATH_SEPARATOR = '/'
12
+
13
+
14
+ @dataclass
15
+ class Path:
16
+ steps: [str]
17
+ is_leaf: bool = False
18
+
19
+ @classmethod
20
+ def path_start_with(cls, start_id: str) -> 'Path':
21
+ return cls(steps=[start_id])
22
+
23
+ @property
24
+ def start_id(self) -> str:
25
+ return self.steps[0]
26
+
27
+ @property
28
+ def end_id(self) -> str:
29
+ return self.steps[-1]
30
+
31
+ @property
32
+ def level(self) -> int:
33
+ return len(self.steps)
34
+
35
+
36
+ @dataclass
37
+ class Node:
38
+ node_id: str
39
+ parent_id: str
40
+
41
+
42
+ class ConnectByQuery:
43
+ def __init__(self, df: DataFrame, child_col: str, parent_col: str, start_with: Union[List[str], str] = None,
44
+ connect_by_path_cols: [str] = None, connect_by_path_separator: str = CONNECT_BY_PATH_SEPARATOR):
45
+ self.df: DataFrame = df
46
+ self.child_col = child_col
47
+ self.parent_col = parent_col
48
+ self.start_with = start_with
49
+ self.connect_by_path_cols = self.__connect_by_path_cols(connect_by_path_cols)
50
+ self.connect_by_path_separator = connect_by_path_separator
51
+
52
+ self._start_paths: [Path] = None
53
+ self._all_nodes: [Node] = None
54
+
55
+ @property
56
+ def start_paths(self) -> [Path]:
57
+ if self._start_paths is None:
58
+ if self.start_with is None:
59
+ paths = []
60
+ elif isinstance(self.start_with, list):
61
+ paths = [Path.path_start_with(i) for i in self.start_with]
62
+ else:
63
+ assert isinstance(self.start_with, str)
64
+ paths = [Path.path_start_with(self.start_with)]
65
+ self._start_paths = paths or self.__default_start_paths()
66
+
67
+ return self._start_paths
68
+
69
+ @property
70
+ def all_nodes(self) -> [Node]:
71
+ if self._all_nodes is None:
72
+ rows = self.df.select(self.child_col, self.parent_col).collect()
73
+ self._all_nodes = [Node(node_id=r[self.child_col], parent_id=r[self.parent_col]) for r in rows]
74
+ return self._all_nodes
75
+
76
+ def __connect_by_path_cols(self, cols: [str]) -> [str]:
77
+ cols_list = cols or []
78
+ if len(cols_list) == 0:
79
+ return cols_list
80
+
81
+ cols_upper = [c.upper() for c in cols_list]
82
+ df_cols = [c.upper() for c in self.df.columns]
83
+
84
+ assert set(cols_upper).issubset(set(df_cols))
85
+ assert self.child_col not in cols_upper, \
86
+ f'`connect_by_path` pseudo column for {self.child_col} is provided by default'
87
+
88
+ return cols_upper
89
+
90
+ def __children_with_parent(self, parent_id: str) -> [Node]:
91
+ children = list(filter(lambda n: n.parent_id == parent_id, self.all_nodes))
92
+ return children
93
+
94
+ def __default_start_paths(self) -> [Path]:
95
+ rows = self.df.collect()
96
+ return [Path.path_start_with(r[self.child_col]) for r in rows]
97
+
98
+ def __fetch_descendants(self, path: Path) -> []:
99
+ children_nodes: [Node] = self.__children_with_parent(path.end_id)
100
+ is_leaf = len(children_nodes) == 0
101
+ if is_leaf:
102
+ path.is_leaf = True
103
+ return []
104
+
105
+ children = [Path(steps=path.steps + [c.node_id]) for c in children_nodes]
106
+ grandchildren = list(map(lambda c: self.__fetch_descendants(c), children))
107
+
108
+ descendants = children + grandchildren
109
+ return descendants
110
+
111
+ @staticmethod
112
+ def __flatten_list(nested_list: []) -> []:
113
+ flat_list = []
114
+ for item in nested_list:
115
+ if isinstance(item, list):
116
+ flat_list += ConnectByQuery.__flatten_list(item)
117
+ else:
118
+ flat_list.append(item)
119
+ return flat_list
120
+
121
+ def __run(self) -> [Path]:
122
+ descendants = list(map(lambda e: self.__fetch_descendants(e), self.start_paths))
123
+ descendants_paths: [Path] = self.__flatten_list(descendants)
124
+
125
+ return self.start_paths + descendants_paths
126
+
127
+ def get_result_df(self) -> DataFrame:
128
+ spark = self.df._session
129
+ result_paths: [Path] = self.__run()
130
+ schema = f'''
131
+ {COLUMN_START_WITH} string,
132
+ {self.child_col} string,
133
+ {COLUMN_LEVEL} int,
134
+ {COLUMN_CONNECT_BY_ISLEAF} boolean,
135
+ {COLUMN_CONNECT_BY_PATH} array<string>
136
+ '''
137
+
138
+ data = [(p.start_id, p.end_id, p.level, p.is_leaf, p.steps) for p in result_paths]
139
+ df_result = (spark.createDataFrame(data, schema=schema))
140
+
141
+ df_result = (
142
+ self.__augment_connect_by_path_if_needed(df_result)
143
+ .withColumn(COLUMN_CONNECT_BY_PATH, F.concat_ws(self.connect_by_path_separator, COLUMN_CONNECT_BY_PATH))
144
+ .join(self.df, on=self.child_col)
145
+ )
146
+ return df_result
147
+
148
+ def __augment_connect_by_path_if_needed(self, df: DataFrame) -> DataFrame:
149
+ if len(self.connect_by_path_cols) == 0:
150
+ return df
151
+
152
+ result_cols = df.columns
153
+ df_right = (
154
+ self.df.select(self.child_col, *self.connect_by_path_cols)
155
+ .withColumnRenamed(self.child_col, '_exploded_id')
156
+ )
157
+
158
+ df_exploded = df.select(*result_cols, F.posexplode(COLUMN_CONNECT_BY_PATH))
159
+
160
+ df_joined = (
161
+ df_exploded.withColumnRenamed('col', '_exploded_id')
162
+ .join(df_right, on='_exploded_id', how='left')
163
+ )
164
+
165
+ agg_cols = [F.sort_array(F.collect_list(F.struct('pos', col))).alias(f'_sorted_struct_{col}')
166
+ for col in self.connect_by_path_cols]
167
+
168
+ df_result = (
169
+ df_joined.groupby(*result_cols)
170
+ .agg(*agg_cols)
171
+ )
172
+
173
+ for col in self.connect_by_path_cols:
174
+ struct_col = f'_sorted_struct_{col}'
175
+ array_col = f'_array_{col}'
176
+ df_result = (
177
+ df_result.withColumn(f'_array_{col}', F.col(f'{struct_col}.{col}'))
178
+ .withColumn(f'{COLUMN_CONNECT_BY_PATH}_{col}',
179
+ F.concat_ws(self.connect_by_path_separator, array_col))
180
+ .drop(struct_col, array_col)
181
+ )
182
+
183
+ return df_result
@@ -0,0 +1,56 @@
1
+ from typing import Union, List
2
+
3
+ from pyspark.sql import DataFrame
4
+
5
+ from pyspark_connectby.connectby_query import ConnectByQuery, CONNECT_BY_PATH_SEPARATOR
6
+
7
+
8
+ def connectBy(df: DataFrame, prior: str, to: str, start_with: Union[List[str], str] = None,
9
+ connect_by_path_cols: [str] = None, connect_by_path_separator: str = CONNECT_BY_PATH_SEPARATOR) -> DataFrame:
10
+ """ Returns a new :class:`DataFrame` with result for Connect By(hiearchical) query.
11
+ It is very similar to connect_by clause in Oracle or Redshift.
12
+
13
+ Parameters
14
+ ----------
15
+ df: :class:`DataFrame`
16
+ contains hierarchical data. e.g. has child_id and parent_id columns: `emp_id` and `manager_id`
17
+ prior:
18
+ prior child_id column. e.g. `emp_id`
19
+ to:
20
+ connect to parent_id column. e.g `manager_id`
21
+ start_with: str, or list of str, optional.
22
+ specifies the root id(s) of the hierarchy. e.g `1` or `['1', '2']`.
23
+ If you omit this parameter, all child_id in the df will be used as root ids.
24
+
25
+ Examples
26
+ --------
27
+ The following performs a connect_by query on ``df``. `emp_id` '1' is used as root id.
28
+
29
+ >>> from pyspark.sql import SparkSession
30
+ >>> schema = 'emp_id string, manager_id string, name string'
31
+ >>> data = [[1, None, 'Carlos'], \
32
+ [11, 1, 'John'], \
33
+ [111, 11, 'Jorge'], \
34
+ [112, 11, 'Kwaku'], \
35
+ [113, 11, 'Liu'], \
36
+ [2, None, 'Mat']]
37
+ >>> spark = SparkSession.builder.getOrCreate()
38
+ >>> df = spark.createDataFrame(data, schema)
39
+ >>> df2 = df.connectBy(prior='emp_id', to='manager_id', start_with='1')
40
+ >>> df2.show()
41
+ +------+----------+-----+-----------------+----------+------+
42
+ |emp_id|START_WITH|LEVEL|CONNECT_BY_ISLEAF|manager_id| name|
43
+ +------+----------+-----+-----------------+----------+------+
44
+ | 1| 1| 1| false| null|Carlos|
45
+ | 11| 1| 2| false| 1| John|
46
+ | 111| 1| 3| true| 11| Jorge|
47
+ | 112| 1| 3| true| 11| Kwaku|
48
+ | 113| 1| 3| true| 11| Liu|
49
+ +------+----------+-----+-----------------+----------+------+
50
+ """
51
+ query = ConnectByQuery(df, child_col=prior, parent_col=to, start_with=start_with,
52
+ connect_by_path_cols=connect_by_path_cols, connect_by_path_separator=connect_by_path_separator)
53
+ return query.get_result_df()
54
+
55
+
56
+ DataFrame.connectBy = connectBy