ducklake-delta-exporter 0.1.1__tar.gz → 0.1.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,123 @@
1
+ Metadata-Version: 2.4
2
+ Name: ducklake-delta-exporter
3
+ Version: 0.1.3
4
+ Summary: A utility to export DuckLake database metadata to Delta Lake transaction logs.
5
+ Home-page: https://github.com/djouallah/ducklake_delta_exporter
6
+ Author: mim
7
+ Author-email: your.email@example.com
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
13
+ Classifier: Development Status :: 3 - Alpha
14
+ Requires-Python: >=3.8
15
+ Description-Content-Type: text/markdown
16
+ Requires-Dist: duckdb
17
+ Dynamic: author
18
+ Dynamic: author-email
19
+ Dynamic: classifier
20
+ Dynamic: description
21
+ Dynamic: description-content-type
22
+ Dynamic: home-page
23
+ Dynamic: requires-dist
24
+ Dynamic: requires-python
25
+ Dynamic: summary
26
+
27
+ # DuckLake Delta Exporter
28
+
29
+ A Python package for exporting DuckLake snapshots as Delta Lake checkpoint files, enabling compatibility with Delta Lake readers, support local path, s3 and gcs, for onelake use mounted storage as azure storage is not supported
30
+
31
+ this is just a fun project
32
+
33
+ ## Repository
34
+
35
+ https://github.com/djouallah/ducklake_delta_exporter
36
+
37
+ ## Installation
38
+
39
+ ```bash
40
+ pip install ducklake-delta-exporter
41
+ ```
42
+
43
+ ## Usage
44
+
45
+ ```python
46
+ from ducklake_delta_exporter import generate_latest_delta_log
47
+
48
+ # Export all tables from a DuckLake database
49
+ generate_latest_delta_log("/path/to/ducklake.db")
50
+
51
+ # Specify a custom data root directory
52
+ generate_latest_delta_log("/path/to/ducklake.db", data_root="/custom/data/path")
53
+ ```
54
+
55
+ ## What it does
56
+
57
+ This package converts DuckLake table snapshots into Delta Lake format by:
58
+
59
+ 1. **Reading DuckLake metadata** - Extracts table schemas, file paths, and snapshot information
60
+ 2. **Creating Delta checkpoint files** - Generates `.checkpoint.parquet` files with Delta Lake metadata
61
+ 3. **Writing JSON transaction logs** - Creates minimal `.json` log files for Spark compatibility
62
+ 4. **Mapping data types** - Converts DuckDB types to Spark SQL equivalents
63
+
64
+ ## Features
65
+
66
+ - ✅ **Spark Compatible** - Generated Delta files can be read by Spark and other Delta Lake tools
67
+ - ✅ **Type Mapping** - Automatic conversion between DuckDB and Spark data types
68
+ - ✅ **Batch Processing** - Exports all tables in a DuckLake database
69
+ - ✅ **Error Handling** - Graceful handling of missing snapshots and other issues
70
+ - ✅ **Progress Reporting** - Clear feedback on export progress and results
71
+
72
+ ## Requirements
73
+
74
+ - Python 3.8+
75
+ - DuckDB
76
+
77
+ ## File Structure
78
+
79
+ After running the exporter, your Delta tables will have the following structure:
80
+
81
+ ```
82
+ your_table/
83
+ ├── data_file_1.parquet
84
+ ├── data_file_2.parquet
85
+ └── _delta_log/
86
+ ├── 00000000000000000000.json
87
+ ├── 00000000000000000000.checkpoint.parquet
88
+ └── _last_checkpoint
89
+ ```
90
+
91
+ ## Type Mapping
92
+
93
+ The exporter automatically maps DuckDB types to Spark SQL types:
94
+
95
+ | DuckDB Type | Spark Type |
96
+ |-------------|------------|
97
+ | INTEGER | integer |
98
+ | BIGINT | long |
99
+ | FLOAT | double |
100
+ | DOUBLE | double |
101
+ | DECIMAL | decimal(10,0) |
102
+ | BOOLEAN | boolean |
103
+ | TIMESTAMP | timestamp |
104
+ | DATE | date |
105
+ | VARCHAR | string |
106
+ | Others | string |
107
+
108
+ ## Error Handling
109
+
110
+ The exporter handles various error conditions:
111
+
112
+ - **Missing snapshots** - Skips tables with no data
113
+ - **Existing checkpoints** - Avoids overwriting existing files
114
+ - **Schema changes** - Uses the latest schema for each table
115
+ - **File system errors** - Reports and continues with other tables
116
+
117
+ ## License
118
+
119
+ MIT License - see LICENSE file for details.
120
+
121
+ ## Contributing
122
+
123
+ Contributions are welcome! Please feel free to submit a Pull Request.
@@ -0,0 +1,97 @@
1
+ # DuckLake Delta Exporter
2
+
3
+ A Python package for exporting DuckLake snapshots as Delta Lake checkpoint files, enabling compatibility with Delta Lake readers, support local path, s3 and gcs, for onelake use mounted storage as azure storage is not supported
4
+
5
+ this is just a fun project
6
+
7
+ ## Repository
8
+
9
+ https://github.com/djouallah/ducklake_delta_exporter
10
+
11
+ ## Installation
12
+
13
+ ```bash
14
+ pip install ducklake-delta-exporter
15
+ ```
16
+
17
+ ## Usage
18
+
19
+ ```python
20
+ from ducklake_delta_exporter import generate_latest_delta_log
21
+
22
+ # Export all tables from a DuckLake database
23
+ generate_latest_delta_log("/path/to/ducklake.db")
24
+
25
+ # Specify a custom data root directory
26
+ generate_latest_delta_log("/path/to/ducklake.db", data_root="/custom/data/path")
27
+ ```
28
+
29
+ ## What it does
30
+
31
+ This package converts DuckLake table snapshots into Delta Lake format by:
32
+
33
+ 1. **Reading DuckLake metadata** - Extracts table schemas, file paths, and snapshot information
34
+ 2. **Creating Delta checkpoint files** - Generates `.checkpoint.parquet` files with Delta Lake metadata
35
+ 3. **Writing JSON transaction logs** - Creates minimal `.json` log files for Spark compatibility
36
+ 4. **Mapping data types** - Converts DuckDB types to Spark SQL equivalents
37
+
38
+ ## Features
39
+
40
+ - ✅ **Spark Compatible** - Generated Delta files can be read by Spark and other Delta Lake tools
41
+ - ✅ **Type Mapping** - Automatic conversion between DuckDB and Spark data types
42
+ - ✅ **Batch Processing** - Exports all tables in a DuckLake database
43
+ - ✅ **Error Handling** - Graceful handling of missing snapshots and other issues
44
+ - ✅ **Progress Reporting** - Clear feedback on export progress and results
45
+
46
+ ## Requirements
47
+
48
+ - Python 3.8+
49
+ - DuckDB
50
+
51
+ ## File Structure
52
+
53
+ After running the exporter, your Delta tables will have the following structure:
54
+
55
+ ```
56
+ your_table/
57
+ ├── data_file_1.parquet
58
+ ├── data_file_2.parquet
59
+ └── _delta_log/
60
+ ├── 00000000000000000000.json
61
+ ├── 00000000000000000000.checkpoint.parquet
62
+ └── _last_checkpoint
63
+ ```
64
+
65
+ ## Type Mapping
66
+
67
+ The exporter automatically maps DuckDB types to Spark SQL types:
68
+
69
+ | DuckDB Type | Spark Type |
70
+ |-------------|------------|
71
+ | INTEGER | integer |
72
+ | BIGINT | long |
73
+ | FLOAT | double |
74
+ | DOUBLE | double |
75
+ | DECIMAL | decimal(10,0) |
76
+ | BOOLEAN | boolean |
77
+ | TIMESTAMP | timestamp |
78
+ | DATE | date |
79
+ | VARCHAR | string |
80
+ | Others | string |
81
+
82
+ ## Error Handling
83
+
84
+ The exporter handles various error conditions:
85
+
86
+ - **Missing snapshots** - Skips tables with no data
87
+ - **Existing checkpoints** - Avoids overwriting existing files
88
+ - **Schema changes** - Uses the latest schema for each table
89
+ - **File system errors** - Reports and continues with other tables
90
+
91
+ ## License
92
+
93
+ MIT License - see LICENSE file for details.
94
+
95
+ ## Contributing
96
+
97
+ Contributions are welcome! Please feel free to submit a Pull Request.
@@ -0,0 +1,326 @@
1
+ # File: ducklake_delta_exporter.py
2
+ import json
3
+ import time
4
+ import duckdb
5
+
6
+ def map_type_ducklake_to_spark(t):
7
+ """Maps DuckDB data types to their Spark SQL equivalents for the Delta schema."""
8
+ t = t.lower()
9
+ if 'int' in t:
10
+ return 'long' if '64' in t else 'integer'
11
+ elif 'float' in t:
12
+ return 'double'
13
+ elif 'double' in t:
14
+ return 'double'
15
+ elif 'decimal' in t:
16
+ return 'decimal(10,0)'
17
+ elif 'bool' in t:
18
+ return 'boolean'
19
+ elif 'timestamp' in t:
20
+ return 'timestamp'
21
+ elif 'date' in t:
22
+ return 'date'
23
+ return 'string'
24
+
25
+ def create_spark_schema_string(fields):
26
+ """Creates a JSON string for the Spark schema from a list of fields."""
27
+ return json.dumps({"type": "struct", "fields": fields})
28
+
29
+ def get_latest_ducklake_snapshot(con, table_id):
30
+ """
31
+ Get the latest DuckLake snapshot ID for a table.
32
+ """
33
+ latest_snapshot = con.execute(f""" SELECT MAX(begin_snapshot) as latest_snapshot FROM ducklake_data_file WHERE table_id = {table_id} """).fetchone()[0]
34
+ return latest_snapshot
35
+
36
+ def get_latest_delta_checkpoint(con, table_id):
37
+ """
38
+ check how many times a table has being modified.
39
+ """
40
+ delta_checkpoint = con.execute(f""" SELECT count(snapshot_id) FROM ducklake_snapshot_changes
41
+ where changes_made like '%:{table_id}' or changes_made like '%:{table_id},%' """).fetchone()[0]
42
+ print(table_id)
43
+ print(delta_checkpoint)
44
+ return delta_checkpoint
45
+
46
+ def get_file_modification_time(dummy_time):
47
+ """
48
+ Return a dummy modification time for parquet files.
49
+ This avoids the latency of actually reading file metadata.
50
+
51
+ Args:
52
+ dummy_time: Timestamp in milliseconds to use as modification time
53
+
54
+ Returns:
55
+ Modification time in milliseconds
56
+ """
57
+ return dummy_time
58
+
59
+ def create_dummy_json_log(table_root, delta_version, table_info, schema_fields, now):
60
+ """
61
+ Create a minimal JSON log file for Spark compatibility using DuckDB.
62
+ """
63
+ json_log_file = table_root + f"_delta_log/{delta_version:020d}.json"
64
+
65
+ # Create JSON log entries using DuckDB
66
+ duckdb.execute("DROP TABLE IF EXISTS json_log_table")
67
+
68
+ # Protocol entry
69
+ protocol_json = json.dumps({
70
+ "protocol": {
71
+ "minReaderVersion": 1,
72
+ "minWriterVersion": 2
73
+ }
74
+ })
75
+
76
+ # Metadata entry
77
+ metadata_json = json.dumps({
78
+ "metaData": {
79
+ "id": str(table_info['table_id']),
80
+ "name": table_info['table_name'],
81
+ "description": None,
82
+ "format": {
83
+ "provider": "parquet",
84
+ "options": {}
85
+ },
86
+ "schemaString": create_spark_schema_string(schema_fields),
87
+ "partitionColumns": [],
88
+ "createdTime": now,
89
+ "configuration": {
90
+ "delta.logRetentionDuration": "interval 1 hour"
91
+ }
92
+ }
93
+ })
94
+
95
+ # Commit info entry
96
+ commitinfo_json = json.dumps({
97
+ "commitInfo": {
98
+ "timestamp": now,
99
+ "operation": "CONVERT",
100
+ "operationParameters": {
101
+ "convertedFrom": "DuckLake"
102
+ },
103
+ "isBlindAppend": True,
104
+ "engineInfo": "DuckLake-Delta-Exporter",
105
+ "clientVersion": "1.0.0"
106
+ }
107
+ })
108
+
109
+ # Create table with JSON entries
110
+ duckdb.execute("""
111
+ CREATE TABLE json_log_table AS
112
+ SELECT ? AS json_line
113
+ UNION ALL
114
+ SELECT ? AS json_line
115
+ UNION ALL
116
+ SELECT ? AS json_line
117
+ """, [protocol_json, metadata_json, commitinfo_json])
118
+
119
+ # Write JSON log file using DuckDB
120
+ duckdb.execute(f"COPY (SELECT json_line FROM json_log_table) TO '{json_log_file}' (FORMAT CSV, HEADER false, QUOTE '')")
121
+
122
+ # Clean up
123
+ duckdb.execute("DROP TABLE IF EXISTS json_log_table")
124
+
125
+ return json_log_file
126
+
127
+ def build_file_path(table_root, relative_path):
128
+ """
129
+ Build full file path from table root and relative path.
130
+ Works with both local paths and S3 URLs.
131
+ """
132
+ table_root = table_root.rstrip('/')
133
+ relative_path = relative_path.lstrip('/')
134
+ return f"{table_root}/{relative_path}"
135
+
136
+ def create_checkpoint_for_latest_snapshot(con, table_info, data_root):
137
+ """
138
+ Create a Delta checkpoint file for the latest DuckLake snapshot.
139
+ """
140
+ table_root = data_root.rstrip('/') + '/' + table_info['schema_path'] + table_info['table_path']
141
+
142
+ # Get the latest snapshot
143
+ latest_snapshot = get_latest_ducklake_snapshot(con, table_info['table_id'])
144
+ if latest_snapshot is None:
145
+ print(f"⚠️ {table_info['schema_name']}.{table_info['table_name']}: No snapshots found")
146
+ return False
147
+ delta_version = get_latest_delta_checkpoint(con, table_info['table_id'])
148
+ checkpoint_file = table_root + f"_delta_log/{delta_version:020d}.checkpoint.parquet"
149
+ json_log_file = table_root + f"_delta_log/{delta_version:020d}.json"
150
+
151
+ try:
152
+ con.execute(f"SELECT protocol FROM '{checkpoint_file}' limit 0 ")
153
+ print(f"⚠️ {table_info['schema_name']}.{table_info['table_name']}: Checkpoint file already exists: {checkpoint_file}")
154
+ except:
155
+
156
+ now = int(time.time() * 1000)
157
+
158
+ # Get all files for the latest snapshot
159
+ file_rows = con.execute(f"""
160
+ SELECT path, file_size_bytes FROM ducklake_data_file
161
+ WHERE table_id = {table_info['table_id']}
162
+ AND begin_snapshot <= {latest_snapshot}
163
+ AND (end_snapshot IS NULL OR end_snapshot > {latest_snapshot})
164
+ """).fetchall()
165
+
166
+ # Get schema for the latest snapshot
167
+ columns = con.execute(f"""
168
+ SELECT column_name, column_type FROM ducklake_column
169
+ WHERE table_id = {table_info['table_id']}
170
+ AND begin_snapshot <= {latest_snapshot}
171
+ AND (end_snapshot IS NULL OR end_snapshot > {latest_snapshot})
172
+ ORDER BY column_order
173
+ """).fetchall()
174
+
175
+ # Get or generate table metadata ID
176
+ table_meta_id = str(table_info['table_id'])
177
+
178
+ # Prepare schema
179
+ schema_fields = [
180
+ {"name": name, "type": map_type_ducklake_to_spark(typ), "nullable": True, "metadata": {}}
181
+ for name, typ in columns
182
+ ]
183
+
184
+ # Create checkpoint data using DuckDB directly
185
+ checkpoint_data = []
186
+
187
+ # Create checkpoint data directly in DuckDB using proper data types
188
+ duckdb.execute("DROP TABLE IF EXISTS checkpoint_table")
189
+
190
+ # Create the checkpoint table with proper nested structure
191
+ duckdb.execute("""
192
+ CREATE TABLE checkpoint_table AS
193
+ WITH checkpoint_data AS (
194
+ -- Protocol record
195
+ SELECT
196
+ {'minReaderVersion': 1, 'minWriterVersion': 2}::STRUCT(minReaderVersion INTEGER, minWriterVersion INTEGER) AS protocol,
197
+ NULL::STRUCT(id VARCHAR, name VARCHAR, description VARCHAR, format STRUCT(provider VARCHAR, options MAP(VARCHAR, VARCHAR)), schemaString VARCHAR, partitionColumns VARCHAR[], createdTime BIGINT, configuration MAP(VARCHAR, VARCHAR)) AS metaData,
198
+ NULL::STRUCT(path VARCHAR, partitionValues MAP(VARCHAR, VARCHAR), size BIGINT, modificationTime BIGINT, dataChange BOOLEAN, stats VARCHAR, tags MAP(VARCHAR, VARCHAR)) AS add,
199
+ NULL::STRUCT(path VARCHAR, deletionTimestamp BIGINT, dataChange BOOLEAN) AS remove,
200
+ NULL::STRUCT(timestamp TIMESTAMP, operation VARCHAR, operationParameters MAP(VARCHAR, VARCHAR), isBlindAppend BOOLEAN, engineInfo VARCHAR, clientVersion VARCHAR) AS commitInfo
201
+
202
+ UNION ALL
203
+
204
+ -- Metadata record
205
+ SELECT
206
+ NULL::STRUCT(minReaderVersion INTEGER, minWriterVersion INTEGER) AS protocol,
207
+ {
208
+ 'id': ?,
209
+ 'name': ?,
210
+ 'description': NULL,
211
+ 'format': {'provider': 'parquet', 'options': MAP{}}::STRUCT(provider VARCHAR, options MAP(VARCHAR, VARCHAR)),
212
+ 'schemaString': ?,
213
+ 'partitionColumns': []::VARCHAR[],
214
+ 'createdTime': ?,
215
+ 'configuration': MAP{'delta.logRetentionDuration': 'interval 1 hour'}
216
+ }::STRUCT(id VARCHAR, name VARCHAR, description VARCHAR, format STRUCT(provider VARCHAR, options MAP(VARCHAR, VARCHAR)), schemaString VARCHAR, partitionColumns VARCHAR[], createdTime BIGINT, configuration MAP(VARCHAR, VARCHAR)) AS metaData,
217
+ NULL::STRUCT(path VARCHAR, partitionValues MAP(VARCHAR, VARCHAR), size BIGINT, modificationTime BIGINT, dataChange BOOLEAN, stats VARCHAR, tags MAP(VARCHAR, VARCHAR)) AS add,
218
+ NULL::STRUCT(path VARCHAR, deletionTimestamp BIGINT, dataChange BOOLEAN) AS remove,
219
+ NULL::STRUCT(timestamp TIMESTAMP, operation VARCHAR, operationParameters MAP(VARCHAR, VARCHAR), isBlindAppend BOOLEAN, engineInfo VARCHAR, clientVersion VARCHAR) AS commitInfo
220
+ )
221
+ SELECT * FROM checkpoint_data
222
+ """, [table_meta_id, table_info['table_name'], create_spark_schema_string(schema_fields), now])
223
+
224
+ # Add file records
225
+ for path, size in file_rows:
226
+ rel_path = path.lstrip('/')
227
+ full_path = build_file_path(table_root, rel_path)
228
+ mod_time = get_file_modification_time(now)
229
+
230
+ duckdb.execute("""
231
+ INSERT INTO checkpoint_table
232
+ SELECT
233
+ NULL::STRUCT(minReaderVersion INTEGER, minWriterVersion INTEGER) AS protocol,
234
+ NULL::STRUCT(id VARCHAR, name VARCHAR, description VARCHAR, format STRUCT(provider VARCHAR, options MAP(VARCHAR, VARCHAR)), schemaString VARCHAR, partitionColumns VARCHAR[], createdTime BIGINT, configuration MAP(VARCHAR, VARCHAR)) AS metaData,
235
+ {
236
+ 'path': ?,
237
+ 'partitionValues': MAP{}::MAP(VARCHAR, VARCHAR),
238
+ 'size': ?,
239
+ 'modificationTime': ?,
240
+ 'dataChange': true,
241
+ 'stats': ?,
242
+ 'tags': NULL::MAP(VARCHAR, VARCHAR)
243
+ }::STRUCT(path VARCHAR, partitionValues MAP(VARCHAR, VARCHAR), size BIGINT, modificationTime BIGINT, dataChange BOOLEAN, stats VARCHAR, tags MAP(VARCHAR, VARCHAR)) AS add,
244
+ NULL::STRUCT(path VARCHAR, deletionTimestamp BIGINT, dataChange BOOLEAN) AS remove,
245
+ NULL::STRUCT(timestamp TIMESTAMP, operation VARCHAR, operationParameters MAP(VARCHAR, VARCHAR), isBlindAppend BOOLEAN, engineInfo VARCHAR, clientVersion VARCHAR) AS commitInfo
246
+ """, [rel_path, size, mod_time, json.dumps({"numRecords": None})])
247
+
248
+ # Create the _delta_log directory if it doesn't exist
249
+ duckdb.execute(f"COPY (SELECT 43) TO '{table_root}_delta_log' (FORMAT PARQUET, PER_THREAD_OUTPUT, OVERWRITE_OR_IGNORE)")
250
+
251
+ # Write the checkpoint file
252
+ duckdb.execute(f"COPY (SELECT * FROM checkpoint_table) TO '{checkpoint_file}' (FORMAT PARQUET)")
253
+
254
+ # Create dummy JSON log file for Spark compatibility
255
+ create_dummy_json_log(table_root, delta_version, table_info, schema_fields, now)
256
+
257
+ # Write the _last_checkpoint file
258
+ total_records = 2 + len(file_rows) # protocol + metadata + file records
259
+ duckdb.execute(f"""
260
+ COPY (SELECT {delta_version} AS version, {total_records} AS size)
261
+ TO '{table_root}_delta_log/_last_checkpoint' (FORMAT JSON, ARRAY false)
262
+ """)
263
+
264
+ print(f"✅ Exported DuckLake snapshot {latest_snapshot} as Delta checkpoint v{delta_version}")
265
+ print(f"✅ Created JSON log file: {json_log_file}")
266
+
267
+ # Clean up temporary tables
268
+ duckdb.execute("DROP TABLE IF EXISTS checkpoint_table")
269
+
270
+ return True, delta_version, latest_snapshot
271
+
272
+ def generate_latest_delta_log(db_path: str, data_root: str = None):
273
+ """
274
+ Export the latest DuckLake snapshot for each table as a Delta checkpoint file.
275
+ Creates both checkpoint files and minimal JSON log files for Spark compatibility.
276
+
277
+ Args:
278
+ db_path (str): The path to the DuckLake database file.
279
+ data_root (str): The root directory for the lakehouse data.
280
+ """
281
+ con = duckdb.connect(db_path, read_only=True)
282
+
283
+ if data_root is None:
284
+ data_root = con.sql("SELECT value FROM ducklake_metadata WHERE key = 'data_path'").fetchone()[0]
285
+
286
+ # Get all active tables
287
+ tables = con.execute("""
288
+ SELECT
289
+ t.table_id,
290
+ t.table_name,
291
+ s.schema_name,
292
+ t.path as table_path,
293
+ s.path as schema_path
294
+ FROM ducklake_table t
295
+ JOIN ducklake_schema s USING(schema_id)
296
+ WHERE t.end_snapshot IS NULL
297
+ """).fetchall()
298
+
299
+ total_tables = len(tables)
300
+ successful_exports = 0
301
+
302
+ for table_row in tables:
303
+ table_info = {
304
+ 'table_id': table_row[0],
305
+ 'table_name': table_row[1],
306
+ 'schema_name': table_row[2],
307
+ 'table_path': table_row[3],
308
+ 'schema_path': table_row[4]
309
+ }
310
+
311
+ table_key = f"{table_info['schema_name']}.{table_info['table_name']}"
312
+ print(f"Processing {table_key}...")
313
+
314
+ try:
315
+ result = create_checkpoint_for_latest_snapshot(con, table_info, data_root)
316
+
317
+ if result:
318
+ successful_exports += 1
319
+ else:
320
+ print(f"⚠️ {table_key}: No data to export")
321
+
322
+ except Exception as e:
323
+ print(f"❌ {table_key}: Failed to export checkpoint - {e}")
324
+
325
+ con.close()
326
+ print(f"\n🎉 Export completed! {successful_exports}/{total_tables} tables exported successfully.")
@@ -0,0 +1,123 @@
1
+ Metadata-Version: 2.4
2
+ Name: ducklake-delta-exporter
3
+ Version: 0.1.3
4
+ Summary: A utility to export DuckLake database metadata to Delta Lake transaction logs.
5
+ Home-page: https://github.com/djouallah/ducklake_delta_exporter
6
+ Author: mim
7
+ Author-email: your.email@example.com
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
13
+ Classifier: Development Status :: 3 - Alpha
14
+ Requires-Python: >=3.8
15
+ Description-Content-Type: text/markdown
16
+ Requires-Dist: duckdb
17
+ Dynamic: author
18
+ Dynamic: author-email
19
+ Dynamic: classifier
20
+ Dynamic: description
21
+ Dynamic: description-content-type
22
+ Dynamic: home-page
23
+ Dynamic: requires-dist
24
+ Dynamic: requires-python
25
+ Dynamic: summary
26
+
27
+ # DuckLake Delta Exporter
28
+
29
+ A Python package for exporting DuckLake snapshots as Delta Lake checkpoint files, enabling compatibility with Delta Lake readers, support local path, s3 and gcs, for onelake use mounted storage as azure storage is not supported
30
+
31
+ this is just a fun project
32
+
33
+ ## Repository
34
+
35
+ https://github.com/djouallah/ducklake_delta_exporter
36
+
37
+ ## Installation
38
+
39
+ ```bash
40
+ pip install ducklake-delta-exporter
41
+ ```
42
+
43
+ ## Usage
44
+
45
+ ```python
46
+ from ducklake_delta_exporter import generate_latest_delta_log
47
+
48
+ # Export all tables from a DuckLake database
49
+ generate_latest_delta_log("/path/to/ducklake.db")
50
+
51
+ # Specify a custom data root directory
52
+ generate_latest_delta_log("/path/to/ducklake.db", data_root="/custom/data/path")
53
+ ```
54
+
55
+ ## What it does
56
+
57
+ This package converts DuckLake table snapshots into Delta Lake format by:
58
+
59
+ 1. **Reading DuckLake metadata** - Extracts table schemas, file paths, and snapshot information
60
+ 2. **Creating Delta checkpoint files** - Generates `.checkpoint.parquet` files with Delta Lake metadata
61
+ 3. **Writing JSON transaction logs** - Creates minimal `.json` log files for Spark compatibility
62
+ 4. **Mapping data types** - Converts DuckDB types to Spark SQL equivalents
63
+
64
+ ## Features
65
+
66
+ - ✅ **Spark Compatible** - Generated Delta files can be read by Spark and other Delta Lake tools
67
+ - ✅ **Type Mapping** - Automatic conversion between DuckDB and Spark data types
68
+ - ✅ **Batch Processing** - Exports all tables in a DuckLake database
69
+ - ✅ **Error Handling** - Graceful handling of missing snapshots and other issues
70
+ - ✅ **Progress Reporting** - Clear feedback on export progress and results
71
+
72
+ ## Requirements
73
+
74
+ - Python 3.8+
75
+ - DuckDB
76
+
77
+ ## File Structure
78
+
79
+ After running the exporter, your Delta tables will have the following structure:
80
+
81
+ ```
82
+ your_table/
83
+ ├── data_file_1.parquet
84
+ ├── data_file_2.parquet
85
+ └── _delta_log/
86
+ ├── 00000000000000000000.json
87
+ ├── 00000000000000000000.checkpoint.parquet
88
+ └── _last_checkpoint
89
+ ```
90
+
91
+ ## Type Mapping
92
+
93
+ The exporter automatically maps DuckDB types to Spark SQL types:
94
+
95
+ | DuckDB Type | Spark Type |
96
+ |-------------|------------|
97
+ | INTEGER | integer |
98
+ | BIGINT | long |
99
+ | FLOAT | double |
100
+ | DOUBLE | double |
101
+ | DECIMAL | decimal(10,0) |
102
+ | BOOLEAN | boolean |
103
+ | TIMESTAMP | timestamp |
104
+ | DATE | date |
105
+ | VARCHAR | string |
106
+ | Others | string |
107
+
108
+ ## Error Handling
109
+
110
+ The exporter handles various error conditions:
111
+
112
+ - **Missing snapshots** - Skips tables with no data
113
+ - **Existing checkpoints** - Avoids overwriting existing files
114
+ - **Schema changes** - Uses the latest schema for each table
115
+ - **File system errors** - Reports and continues with other tables
116
+
117
+ ## License
118
+
119
+ MIT License - see LICENSE file for details.
120
+
121
+ ## Contributing
122
+
123
+ Contributions are welcome! Please feel free to submit a Pull Request.
@@ -3,12 +3,9 @@ from setuptools import setup, find_packages
3
3
 
4
4
  setup(
5
5
  name='ducklake-delta-exporter',
6
- version='0.1.1',
6
+ version='0.1.3',
7
7
  packages=find_packages(),
8
- install_requires=[
9
- 'duckdb',
10
- 'pyarrow'
11
- ],
8
+ install_requires=['duckdb'],
12
9
  author='mim',
13
10
  author_email='your.email@example.com',
14
11
  description='A utility to export DuckLake database metadata to Delta Lake transaction logs.',