mkpipe-loader-postgres 0.3.0__tar.gz → 0.6.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (19) hide show
  1. mkpipe_loader_postgres-0.6.0/PKG-INFO +138 -0
  2. mkpipe_loader_postgres-0.6.0/README.md +114 -0
  3. {mkpipe_loader_postgres-0.3.0 → mkpipe_loader_postgres-0.6.0}/mkpipe_loader_postgres/__init__.py +4 -1
  4. mkpipe_loader_postgres-0.6.0/mkpipe_loader_postgres/jars/.gitkeep +0 -0
  5. mkpipe_loader_postgres-0.6.0/mkpipe_loader_postgres.egg-info/PKG-INFO +138 -0
  6. {mkpipe_loader_postgres-0.3.0 → mkpipe_loader_postgres-0.6.0}/mkpipe_loader_postgres.egg-info/SOURCES.txt +1 -1
  7. {mkpipe_loader_postgres-0.3.0 → mkpipe_loader_postgres-0.6.0}/setup.py +1 -1
  8. mkpipe_loader_postgres-0.3.0/PKG-INFO +0 -50
  9. mkpipe_loader_postgres-0.3.0/README.md +0 -26
  10. mkpipe_loader_postgres-0.3.0/mkpipe_loader_postgres/jars/org.postgresql_postgresql-42.7.4.jar +0 -0
  11. mkpipe_loader_postgres-0.3.0/mkpipe_loader_postgres.egg-info/PKG-INFO +0 -50
  12. {mkpipe_loader_postgres-0.3.0 → mkpipe_loader_postgres-0.6.0}/LICENSE +0 -0
  13. {mkpipe_loader_postgres-0.3.0 → mkpipe_loader_postgres-0.6.0}/MANIFEST.in +0 -0
  14. {mkpipe_loader_postgres-0.3.0 → mkpipe_loader_postgres-0.6.0}/mkpipe_loader_postgres/jar_paths.py +0 -0
  15. {mkpipe_loader_postgres-0.3.0 → mkpipe_loader_postgres-0.6.0}/mkpipe_loader_postgres.egg-info/dependency_links.txt +0 -0
  16. {mkpipe_loader_postgres-0.3.0 → mkpipe_loader_postgres-0.6.0}/mkpipe_loader_postgres.egg-info/entry_points.txt +0 -0
  17. {mkpipe_loader_postgres-0.3.0 → mkpipe_loader_postgres-0.6.0}/mkpipe_loader_postgres.egg-info/requires.txt +0 -0
  18. {mkpipe_loader_postgres-0.3.0 → mkpipe_loader_postgres-0.6.0}/mkpipe_loader_postgres.egg-info/top_level.txt +0 -0
  19. {mkpipe_loader_postgres-0.3.0 → mkpipe_loader_postgres-0.6.0}/setup.cfg +0 -0
@@ -0,0 +1,138 @@
1
+ Metadata-Version: 2.4
2
+ Name: mkpipe-loader-postgres
3
+ Version: 0.6.0
4
+ Summary: PostgreSQL loader for mkpipe.
5
+ Author: Metin Karakus
6
+ Author-email: metin_karakus@yahoo.com
7
+ License: Apache License 2.0
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: Apache Software License
10
+ Requires-Python: >=3.8
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE
13
+ Requires-Dist: mkpipe
14
+ Dynamic: author
15
+ Dynamic: author-email
16
+ Dynamic: classifier
17
+ Dynamic: description
18
+ Dynamic: description-content-type
19
+ Dynamic: license
20
+ Dynamic: license-file
21
+ Dynamic: requires-dist
22
+ Dynamic: requires-python
23
+ Dynamic: summary
24
+
25
+ # mkpipe-loader-postgres
26
+
27
+ PostgreSQL loader plugin for [MkPipe](https://github.com/mkpipe-etl/mkpipe). Writes Spark DataFrames into PostgreSQL tables via JDBC.
28
+
29
+ ## Documentation
30
+
31
+ For more detailed documentation, please visit the [GitHub repository](https://github.com/mkpipe-etl/mkpipe).
32
+
33
+ ## License
34
+
35
+ This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.
36
+
37
+ ---
38
+
39
+ ## Connection Configuration
40
+
41
+ ```yaml
42
+ connections:
43
+ pg_target:
44
+ variant: postgres
45
+ host: localhost
46
+ port: 5432
47
+ database: mydb
48
+ schema: public
49
+ user: myuser
50
+ password: mypassword
51
+ ```
52
+
53
+ ---
54
+
55
+ ## Table Configuration
56
+
57
+ ```yaml
58
+ pipelines:
59
+ - name: source_to_pg
60
+ source: my_source
61
+ destination: pg_target
62
+ tables:
63
+ - name: source_table
64
+ target_name: public.stg_table
65
+ replication_method: full
66
+ batchsize: 10000
67
+
68
+ - name: source_table
69
+ target_name: public.stg_table
70
+ replication_method: incremental
71
+ iterate_column: updated_at
72
+ write_strategy: upsert
73
+ write_key: [id]
74
+ ```
75
+
76
+ ---
77
+
78
+ ## Write Strategy
79
+
80
+ Control how data is written to PostgreSQL:
81
+
82
+ ```yaml
83
+ - name: source_table
84
+ target_name: public.stg_table
85
+ write_strategy: upsert # append | replace | upsert | merge
86
+ write_key: [id] # required for upsert/merge
87
+ ```
88
+
89
+ | Strategy | PostgreSQL Behavior |
90
+ |---|---|
91
+ | `append` | Plain `INSERT` via JDBC (default for incremental) |
92
+ | `replace` | Drop and recreate table, then insert (default for full) |
93
+ | `upsert` | `INSERT ... ON CONFLICT (write_key) DO UPDATE` via temp table |
94
+ | `merge` | Same as upsert for PostgreSQL |
95
+
96
+ > **Note:** `upsert`/`merge` requires `write_key`. The loader writes to a temp table first, then executes a single `INSERT ... ON CONFLICT` statement to merge into the target.
97
+
98
+ ---
99
+
100
+ ## Write Parallelism & Throughput
101
+
102
+ Two parameters control write performance:
103
+
104
+ ```yaml
105
+ - name: source_table
106
+ target_name: public.stg_table
107
+ replication_method: full
108
+ batchsize: 10000 # rows per JDBC batch insert (default: 10000)
109
+ write_partitions: 4 # coalesce DataFrame to N partitions before writing
110
+ ```
111
+
112
+ ### How they work
113
+
114
+ - **`batchsize`**: rows buffered before sending one `INSERT` statement. PostgreSQL handles 5,000–10,000 well; very large batches (>100K) can increase memory pressure.
115
+ - **`write_partitions`**: calls `coalesce(N)` on the DataFrame, reducing concurrent JDBC connections to PostgreSQL.
116
+
117
+ ### Performance Notes
118
+
119
+ - PostgreSQL's `COPY` protocol is faster than JDBC for bulk loads, but mkpipe uses JDBC for portability.
120
+ - For large loads, `write_partitions: 4–8` with `batchsize: 10000` is a reliable baseline.
121
+ - If the target table has many indexes or constraints, writes will be slower — consider disabling indexes during bulk loads.
122
+
123
+ ---
124
+
125
+ ## All Table Parameters
126
+
127
+ | Parameter | Type | Default | Description |
128
+ |---|---|---|---|
129
+ | `name` | string | required | Source table name |
130
+ | `target_name` | string | required | PostgreSQL destination table name |
131
+ | `replication_method` | `full` / `incremental` | `full` | Replication strategy |
132
+ | `batchsize` | int | `10000` | Rows per JDBC batch insert |
133
+ | `write_partitions` | int | — | Coalesce DataFrame to N partitions before writing |
134
+ | `write_strategy` | string | — | `append`, `replace`, `upsert`, `merge` |
135
+ | `write_key` | list | — | Key columns for upsert/merge (required) |
136
+ | `dedup_columns` | list | — | Columns used for `mkpipe_id` hash deduplication |
137
+ | `tags` | list | `[]` | Tags for selective pipeline execution |
138
+ | `pass_on_error` | bool | `false` | Skip table on error instead of failing |
@@ -0,0 +1,114 @@
1
+ # mkpipe-loader-postgres
2
+
3
+ PostgreSQL loader plugin for [MkPipe](https://github.com/mkpipe-etl/mkpipe). Writes Spark DataFrames into PostgreSQL tables via JDBC.
4
+
5
+ ## Documentation
6
+
7
+ For more detailed documentation, please visit the [GitHub repository](https://github.com/mkpipe-etl/mkpipe).
8
+
9
+ ## License
10
+
11
+ This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.
12
+
13
+ ---
14
+
15
+ ## Connection Configuration
16
+
17
+ ```yaml
18
+ connections:
19
+ pg_target:
20
+ variant: postgres
21
+ host: localhost
22
+ port: 5432
23
+ database: mydb
24
+ schema: public
25
+ user: myuser
26
+ password: mypassword
27
+ ```
28
+
29
+ ---
30
+
31
+ ## Table Configuration
32
+
33
+ ```yaml
34
+ pipelines:
35
+ - name: source_to_pg
36
+ source: my_source
37
+ destination: pg_target
38
+ tables:
39
+ - name: source_table
40
+ target_name: public.stg_table
41
+ replication_method: full
42
+ batchsize: 10000
43
+
44
+ - name: source_table
45
+ target_name: public.stg_table
46
+ replication_method: incremental
47
+ iterate_column: updated_at
48
+ write_strategy: upsert
49
+ write_key: [id]
50
+ ```
51
+
52
+ ---
53
+
54
+ ## Write Strategy
55
+
56
+ Control how data is written to PostgreSQL:
57
+
58
+ ```yaml
59
+ - name: source_table
60
+ target_name: public.stg_table
61
+ write_strategy: upsert # append | replace | upsert | merge
62
+ write_key: [id] # required for upsert/merge
63
+ ```
64
+
65
+ | Strategy | PostgreSQL Behavior |
66
+ |---|---|
67
+ | `append` | Plain `INSERT` via JDBC (default for incremental) |
68
+ | `replace` | Drop and recreate table, then insert (default for full) |
69
+ | `upsert` | `INSERT ... ON CONFLICT (write_key) DO UPDATE` via temp table |
70
+ | `merge` | Same as upsert for PostgreSQL |
71
+
72
+ > **Note:** `upsert`/`merge` requires `write_key`. The loader writes to a temp table first, then executes a single `INSERT ... ON CONFLICT` statement to merge into the target.
73
+
74
+ ---
75
+
76
+ ## Write Parallelism & Throughput
77
+
78
+ Two parameters control write performance:
79
+
80
+ ```yaml
81
+ - name: source_table
82
+ target_name: public.stg_table
83
+ replication_method: full
84
+ batchsize: 10000 # rows per JDBC batch insert (default: 10000)
85
+ write_partitions: 4 # coalesce DataFrame to N partitions before writing
86
+ ```
87
+
88
+ ### How they work
89
+
90
+ - **`batchsize`**: rows buffered before sending one `INSERT` statement. PostgreSQL handles 5,000–10,000 well; very large batches (>100K) can increase memory pressure.
91
+ - **`write_partitions`**: calls `coalesce(N)` on the DataFrame, reducing concurrent JDBC connections to PostgreSQL.
92
+
93
+ ### Performance Notes
94
+
95
+ - PostgreSQL's `COPY` protocol is faster than JDBC for bulk loads, but mkpipe uses JDBC for portability.
96
+ - For large loads, `write_partitions: 4–8` with `batchsize: 10000` is a reliable baseline.
97
+ - If the target table has many indexes or constraints, writes will be slower — consider disabling indexes during bulk loads.
98
+
99
+ ---
100
+
101
+ ## All Table Parameters
102
+
103
+ | Parameter | Type | Default | Description |
104
+ |---|---|---|---|
105
+ | `name` | string | required | Source table name |
106
+ | `target_name` | string | required | PostgreSQL destination table name |
107
+ | `replication_method` | `full` / `incremental` | `full` | Replication strategy |
108
+ | `batchsize` | int | `10000` | Rows per JDBC batch insert |
109
+ | `write_partitions` | int | — | Coalesce DataFrame to N partitions before writing |
110
+ | `write_strategy` | string | — | `append`, `replace`, `upsert`, `merge` |
111
+ | `write_key` | list | — | Key columns for upsert/merge (required) |
112
+ | `dedup_columns` | list | — | Columns used for `mkpipe_id` hash deduplication |
113
+ | `tags` | list | `[]` | Tags for selective pipeline execution |
114
+ | `pass_on_error` | bool | `false` | Skip table on error instead of failing |
@@ -1,9 +1,12 @@
1
1
  from mkpipe.spark import JdbcLoader
2
2
 
3
+ JAR_PACKAGES = ['org.postgresql:postgresql:42.7.4']
3
4
 
4
- class PostgresLoader(JdbcLoader, variant='postgresql'):
5
+
6
+ class PostgresLoader(JdbcLoader, variant='postgres'):
5
7
  driver_name = 'postgresql'
6
8
  driver_jdbc = 'org.postgresql.Driver'
9
+ _dialect = 'postgres'
7
10
 
8
11
  def build_jdbc_url(self):
9
12
  url = (
@@ -0,0 +1,138 @@
1
+ Metadata-Version: 2.4
2
+ Name: mkpipe-loader-postgres
3
+ Version: 0.6.0
4
+ Summary: PostgreSQL loader for mkpipe.
5
+ Author: Metin Karakus
6
+ Author-email: metin_karakus@yahoo.com
7
+ License: Apache License 2.0
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: Apache Software License
10
+ Requires-Python: >=3.8
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE
13
+ Requires-Dist: mkpipe
14
+ Dynamic: author
15
+ Dynamic: author-email
16
+ Dynamic: classifier
17
+ Dynamic: description
18
+ Dynamic: description-content-type
19
+ Dynamic: license
20
+ Dynamic: license-file
21
+ Dynamic: requires-dist
22
+ Dynamic: requires-python
23
+ Dynamic: summary
24
+
25
+ # mkpipe-loader-postgres
26
+
27
+ PostgreSQL loader plugin for [MkPipe](https://github.com/mkpipe-etl/mkpipe). Writes Spark DataFrames into PostgreSQL tables via JDBC.
28
+
29
+ ## Documentation
30
+
31
+ For more detailed documentation, please visit the [GitHub repository](https://github.com/mkpipe-etl/mkpipe).
32
+
33
+ ## License
34
+
35
+ This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.
36
+
37
+ ---
38
+
39
+ ## Connection Configuration
40
+
41
+ ```yaml
42
+ connections:
43
+ pg_target:
44
+ variant: postgres
45
+ host: localhost
46
+ port: 5432
47
+ database: mydb
48
+ schema: public
49
+ user: myuser
50
+ password: mypassword
51
+ ```
52
+
53
+ ---
54
+
55
+ ## Table Configuration
56
+
57
+ ```yaml
58
+ pipelines:
59
+ - name: source_to_pg
60
+ source: my_source
61
+ destination: pg_target
62
+ tables:
63
+ - name: source_table
64
+ target_name: public.stg_table
65
+ replication_method: full
66
+ batchsize: 10000
67
+
68
+ - name: source_table
69
+ target_name: public.stg_table
70
+ replication_method: incremental
71
+ iterate_column: updated_at
72
+ write_strategy: upsert
73
+ write_key: [id]
74
+ ```
75
+
76
+ ---
77
+
78
+ ## Write Strategy
79
+
80
+ Control how data is written to PostgreSQL:
81
+
82
+ ```yaml
83
+ - name: source_table
84
+ target_name: public.stg_table
85
+ write_strategy: upsert # append | replace | upsert | merge
86
+ write_key: [id] # required for upsert/merge
87
+ ```
88
+
89
+ | Strategy | PostgreSQL Behavior |
90
+ |---|---|
91
+ | `append` | Plain `INSERT` via JDBC (default for incremental) |
92
+ | `replace` | Drop and recreate table, then insert (default for full) |
93
+ | `upsert` | `INSERT ... ON CONFLICT (write_key) DO UPDATE` via temp table |
94
+ | `merge` | Same as upsert for PostgreSQL |
95
+
96
+ > **Note:** `upsert`/`merge` requires `write_key`. The loader writes to a temp table first, then executes a single `INSERT ... ON CONFLICT` statement to merge into the target.
97
+
98
+ ---
99
+
100
+ ## Write Parallelism & Throughput
101
+
102
+ Two parameters control write performance:
103
+
104
+ ```yaml
105
+ - name: source_table
106
+ target_name: public.stg_table
107
+ replication_method: full
108
+ batchsize: 10000 # rows per JDBC batch insert (default: 10000)
109
+ write_partitions: 4 # coalesce DataFrame to N partitions before writing
110
+ ```
111
+
112
+ ### How they work
113
+
114
+ - **`batchsize`**: rows buffered before sending one `INSERT` statement. PostgreSQL handles 5,000–10,000 well; very large batches (>100K) can increase memory pressure.
115
+ - **`write_partitions`**: calls `coalesce(N)` on the DataFrame, reducing concurrent JDBC connections to PostgreSQL.
116
+
117
+ ### Performance Notes
118
+
119
+ - PostgreSQL's `COPY` protocol is faster than JDBC for bulk loads, but mkpipe uses JDBC for portability.
120
+ - For large loads, `write_partitions: 4–8` with `batchsize: 10000` is a reliable baseline.
121
+ - If the target table has many indexes or constraints, writes will be slower — consider disabling indexes during bulk loads.
122
+
123
+ ---
124
+
125
+ ## All Table Parameters
126
+
127
+ | Parameter | Type | Default | Description |
128
+ |---|---|---|---|
129
+ | `name` | string | required | Source table name |
130
+ | `target_name` | string | required | PostgreSQL destination table name |
131
+ | `replication_method` | `full` / `incremental` | `full` | Replication strategy |
132
+ | `batchsize` | int | `10000` | Rows per JDBC batch insert |
133
+ | `write_partitions` | int | — | Coalesce DataFrame to N partitions before writing |
134
+ | `write_strategy` | string | — | `append`, `replace`, `upsert`, `merge` |
135
+ | `write_key` | list | — | Key columns for upsert/merge (required) |
136
+ | `dedup_columns` | list | — | Columns used for `mkpipe_id` hash deduplication |
137
+ | `tags` | list | `[]` | Tags for selective pipeline execution |
138
+ | `pass_on_error` | bool | `false` | Skip table on error instead of failing |
@@ -10,4 +10,4 @@ mkpipe_loader_postgres.egg-info/dependency_links.txt
10
10
  mkpipe_loader_postgres.egg-info/entry_points.txt
11
11
  mkpipe_loader_postgres.egg-info/requires.txt
12
12
  mkpipe_loader_postgres.egg-info/top_level.txt
13
- mkpipe_loader_postgres/jars/org.postgresql_postgresql-42.7.4.jar
13
+ mkpipe_loader_postgres/jars/.gitkeep
@@ -2,7 +2,7 @@ from setuptools import setup, find_packages
2
2
 
3
3
  setup(
4
4
  name='mkpipe-loader-postgres',
5
- version='0.3.0',
5
+ version='0.6.0',
6
6
  license='Apache License 2.0',
7
7
  packages=find_packages(exclude=['tests', 'scripts', 'deploy', 'install_jars.py']),
8
8
  install_requires=['mkpipe'],
@@ -1,50 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: mkpipe-loader-postgres
3
- Version: 0.3.0
4
- Summary: PostgreSQL loader for mkpipe.
5
- Author: Metin Karakus
6
- Author-email: metin_karakus@yahoo.com
7
- License: Apache License 2.0
8
- Classifier: Programming Language :: Python :: 3
9
- Classifier: License :: OSI Approved :: Apache Software License
10
- Requires-Python: >=3.8
11
- Description-Content-Type: text/markdown
12
- License-File: LICENSE
13
- Requires-Dist: mkpipe
14
- Dynamic: author
15
- Dynamic: author-email
16
- Dynamic: classifier
17
- Dynamic: description
18
- Dynamic: description-content-type
19
- Dynamic: license
20
- Dynamic: license-file
21
- Dynamic: requires-dist
22
- Dynamic: requires-python
23
- Dynamic: summary
24
-
25
- # MkPipe
26
-
27
- **MkPipe** is a modular, open-source ETL (Extract, Transform, Load) tool that allows you to integrate various data sources and sinks easily. It is designed to be extensible with a plugin-based architecture that supports extractors, transformers, and loaders.
28
-
29
- ## Documentation
30
-
31
- For more detailed documentation, please visit the [GitHub repository](https://github.com/mkpipe-etl/mkpipe).
32
-
33
- ## License
34
-
35
- This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.
36
-
37
-
38
- ## mkpipe_project.yaml Variables
39
- ```yaml
40
- ...
41
- connections:
42
- source:
43
- host: 'XXX'
44
- port: 'XXX'
45
- database: 'XXX'
46
- schema: 'XXX'
47
- user: 'XXX'
48
- password: 'XXX'
49
- ...
50
- ```
@@ -1,26 +0,0 @@
1
- # MkPipe
2
-
3
- **MkPipe** is a modular, open-source ETL (Extract, Transform, Load) tool that allows you to integrate various data sources and sinks easily. It is designed to be extensible with a plugin-based architecture that supports extractors, transformers, and loaders.
4
-
5
- ## Documentation
6
-
7
- For more detailed documentation, please visit the [GitHub repository](https://github.com/mkpipe-etl/mkpipe).
8
-
9
- ## License
10
-
11
- This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.
12
-
13
-
14
- ## mkpipe_project.yaml Variables
15
- ```yaml
16
- ...
17
- connections:
18
- source:
19
- host: 'XXX'
20
- port: 'XXX'
21
- database: 'XXX'
22
- schema: 'XXX'
23
- user: 'XXX'
24
- password: 'XXX'
25
- ...
26
- ```
@@ -1,50 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: mkpipe-loader-postgres
3
- Version: 0.3.0
4
- Summary: PostgreSQL loader for mkpipe.
5
- Author: Metin Karakus
6
- Author-email: metin_karakus@yahoo.com
7
- License: Apache License 2.0
8
- Classifier: Programming Language :: Python :: 3
9
- Classifier: License :: OSI Approved :: Apache Software License
10
- Requires-Python: >=3.8
11
- Description-Content-Type: text/markdown
12
- License-File: LICENSE
13
- Requires-Dist: mkpipe
14
- Dynamic: author
15
- Dynamic: author-email
16
- Dynamic: classifier
17
- Dynamic: description
18
- Dynamic: description-content-type
19
- Dynamic: license
20
- Dynamic: license-file
21
- Dynamic: requires-dist
22
- Dynamic: requires-python
23
- Dynamic: summary
24
-
25
- # MkPipe
26
-
27
- **MkPipe** is a modular, open-source ETL (Extract, Transform, Load) tool that allows you to integrate various data sources and sinks easily. It is designed to be extensible with a plugin-based architecture that supports extractors, transformers, and loaders.
28
-
29
- ## Documentation
30
-
31
- For more detailed documentation, please visit the [GitHub repository](https://github.com/mkpipe-etl/mkpipe).
32
-
33
- ## License
34
-
35
- This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.
36
-
37
-
38
- ## mkpipe_project.yaml Variables
39
- ```yaml
40
- ...
41
- connections:
42
- source:
43
- host: 'XXX'
44
- port: 'XXX'
45
- database: 'XXX'
46
- schema: 'XXX'
47
- user: 'XXX'
48
- password: 'XXX'
49
- ...
50
- ```