duckrun 0.1.9__tar.gz → 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {duckrun-0.1.9 → duckrun-0.2.0}/PKG-INFO +139 -5
- {duckrun-0.1.9 → duckrun-0.2.0}/README.md +138 -4
- {duckrun-0.1.9 → duckrun-0.2.0}/duckrun/core.py +126 -10
- {duckrun-0.1.9 → duckrun-0.2.0}/duckrun.egg-info/PKG-INFO +139 -5
- {duckrun-0.1.9 → duckrun-0.2.0}/pyproject.toml +1 -1
- {duckrun-0.1.9 → duckrun-0.2.0}/LICENSE +0 -0
- {duckrun-0.1.9 → duckrun-0.2.0}/duckrun/__init__.py +0 -0
- {duckrun-0.1.9 → duckrun-0.2.0}/duckrun.egg-info/SOURCES.txt +0 -0
- {duckrun-0.1.9 → duckrun-0.2.0}/duckrun.egg-info/dependency_links.txt +0 -0
- {duckrun-0.1.9 → duckrun-0.2.0}/duckrun.egg-info/requires.txt +0 -0
- {duckrun-0.1.9 → duckrun-0.2.0}/duckrun.egg-info/top_level.txt +0 -0
- {duckrun-0.1.9 → duckrun-0.2.0}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: duckrun
|
3
|
-
Version: 0.
|
3
|
+
Version: 0.2.0
|
4
4
|
Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
|
5
5
|
Author: mim
|
6
6
|
License: MIT
|
@@ -28,6 +28,8 @@ A helper package for stuff that made my life easier when working with Fabric Pyt
|
|
28
28
|
- Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)
|
29
29
|
- Workspace and lakehouse names cannot contain spaces
|
30
30
|
|
31
|
+
**Delta Lake Version:** This package uses an older version of deltalake to maintain row size control capabilities, which is crucial for Power BI performance optimization. The newer Rust-based deltalake versions don't yet support the row group size parameters that are essential for optimal DirectLake performance.
|
32
|
+
|
31
33
|
**Why no spaces?** Duckrun uses simple name-based paths instead of GUIDs. This keeps the code clean and readable, which is perfect for data engineering workspaces where naming conventions are already well-established. Just use underscores or hyphens instead: `my_workspace` or `my-lakehouse`.
|
32
34
|
|
33
35
|
## What It Does
|
@@ -131,6 +133,22 @@ con.sql("""
|
|
131
133
|
|
132
134
|
# Append mode
|
133
135
|
con.sql("SELECT * FROM new_orders").write.mode("append").saveAsTable("orders")
|
136
|
+
|
137
|
+
# Schema evolution and partitioning (exact Spark API compatibility)
|
138
|
+
con.sql("""
|
139
|
+
SELECT
|
140
|
+
customer_id,
|
141
|
+
order_date,
|
142
|
+
region,
|
143
|
+
product_category,
|
144
|
+
sales_amount,
|
145
|
+
new_column_added_later -- This column might not exist in target table
|
146
|
+
FROM source_table
|
147
|
+
""").write \
|
148
|
+
.mode("append") \
|
149
|
+
.option("mergeSchema", "true") \
|
150
|
+
.partitionBy("region", "product_category") \
|
151
|
+
.saveAsTable("sales_partitioned")
|
134
152
|
```
|
135
153
|
|
136
154
|
**Note:** `.format("delta")` is optional - Delta is the default format!
|
@@ -204,7 +222,10 @@ def download_data(url, path):
|
|
204
222
|
|
205
223
|
### SQL Tasks
|
206
224
|
|
207
|
-
**
|
225
|
+
**Formats:**
|
226
|
+
- `('table_name', 'mode')` - Simple SQL with no parameters
|
227
|
+
- `('table_name', 'mode', {params})` - SQL with template parameters
|
228
|
+
- `('table_name', 'mode', {params}, {delta_options})` - SQL with Delta Lake options
|
208
229
|
|
209
230
|
Create `sql_folder/table_name.sql`:
|
210
231
|
|
@@ -244,8 +265,66 @@ SELECT * FROM transactions
|
|
244
265
|
WHERE date BETWEEN '$start_date' AND '$end_date'
|
245
266
|
```
|
246
267
|
|
268
|
+
### Delta Lake Options (Schema Evolution & Partitioning)
|
269
|
+
|
270
|
+
Use the 4-tuple format for advanced Delta Lake features:
|
271
|
+
|
272
|
+
```python
|
273
|
+
pipeline = [
|
274
|
+
# SQL with empty params but Delta options
|
275
|
+
('evolving_table', 'append', {}, {'mergeSchema': 'true'}),
|
276
|
+
|
277
|
+
# SQL with both params AND Delta options
|
278
|
+
('sales_data', 'append',
|
279
|
+
{'region': 'North America'},
|
280
|
+
{'mergeSchema': 'true', 'partitionBy': ['region', 'year']}),
|
281
|
+
|
282
|
+
# Partitioning without schema merging
|
283
|
+
('time_series', 'overwrite',
|
284
|
+
{'start_date': '2024-01-01'},
|
285
|
+
{'partitionBy': ['year', 'month']})
|
286
|
+
]
|
287
|
+
```
|
288
|
+
|
289
|
+
**Available Delta Options:**
|
290
|
+
- `mergeSchema: 'true'` - Automatically handle schema evolution (new columns)
|
291
|
+
- `partitionBy: ['col1', 'col2']` - Partition data by specified columns
|
292
|
+
|
247
293
|
## Advanced Features
|
248
294
|
|
295
|
+
### Schema Evolution & Partitioning
|
296
|
+
|
297
|
+
Handle evolving schemas and optimize query performance with partitioning:
|
298
|
+
|
299
|
+
```python
|
300
|
+
# Using Spark-style API
|
301
|
+
con.sql("""
|
302
|
+
SELECT
|
303
|
+
customer_id,
|
304
|
+
region,
|
305
|
+
product_category,
|
306
|
+
sales_amount,
|
307
|
+
-- New column that might not exist in target table
|
308
|
+
discount_percentage
|
309
|
+
FROM raw_sales
|
310
|
+
""").write \
|
311
|
+
.mode("append") \
|
312
|
+
.option("mergeSchema", "true") \
|
313
|
+
.partitionBy("region", "product_category") \
|
314
|
+
.saveAsTable("sales_partitioned")
|
315
|
+
|
316
|
+
# Using pipeline format
|
317
|
+
pipeline = [
|
318
|
+
('sales_summary', 'append',
|
319
|
+
{'batch_date': '2024-10-07'},
|
320
|
+
{'mergeSchema': 'true', 'partitionBy': ['region', 'year']})
|
321
|
+
]
|
322
|
+
```
|
323
|
+
|
324
|
+
**Benefits:**
|
325
|
+
- 🔄 **Schema Evolution**: Automatically handles new columns without breaking existing queries
|
326
|
+
- ⚡ **Query Performance**: Partitioning improves performance for filtered queries
|
327
|
+
|
249
328
|
### Table Name Variants
|
250
329
|
|
251
330
|
Use `__` to create multiple versions of the same table:
|
@@ -404,8 +483,8 @@ pipeline = [
|
|
404
483
|
# Aggregate by region (SQL with params)
|
405
484
|
('regional_summary', 'overwrite', {'min_amount': 1000}),
|
406
485
|
|
407
|
-
# Append to history (SQL)
|
408
|
-
('sales_history', 'append')
|
486
|
+
# Append to history with schema evolution (SQL with Delta options)
|
487
|
+
('sales_history', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['year', 'region']})
|
409
488
|
]
|
410
489
|
|
411
490
|
# Run pipeline
|
@@ -430,7 +509,62 @@ con.download("processed_reports", "./exports", ['.csv'])
|
|
430
509
|
- 🔄 **Pipeline orchestration** with SQL and Python tasks
|
431
510
|
- ⚡ **Fast data exploration** with DuckDB
|
432
511
|
- 💾 **Delta table creation** with Spark-style API
|
433
|
-
-
|
512
|
+
- � **Schema evolution** and partitioning
|
513
|
+
- �📤 **File downloads** from OneLake Files
|
514
|
+
|
515
|
+
## Schema Evolution & Partitioning Guide
|
516
|
+
|
517
|
+
### When to Use Schema Evolution
|
518
|
+
|
519
|
+
Use `mergeSchema: 'true'` when:
|
520
|
+
- Adding new columns to existing tables
|
521
|
+
- Source data schema changes over time
|
522
|
+
- Working with evolving data pipelines
|
523
|
+
- Need backward compatibility
|
524
|
+
|
525
|
+
### When to Use Partitioning
|
526
|
+
|
527
|
+
Use `partitionBy` when:
|
528
|
+
- Queries frequently filter by specific columns (dates, regions, categories)
|
529
|
+
- Tables are large and need performance optimization
|
530
|
+
- Want to organize data logically for maintenance
|
531
|
+
|
532
|
+
### Best Practices
|
533
|
+
|
534
|
+
```python
|
535
|
+
# ✅ Good: Partition by commonly filtered columns
|
536
|
+
.partitionBy("year", "region") # Often filtered: WHERE year = 2024 AND region = 'US'
|
537
|
+
|
538
|
+
# ❌ Avoid: High cardinality partitions
|
539
|
+
.partitionBy("customer_id") # Creates too many small partitions
|
540
|
+
|
541
|
+
# ✅ Good: Schema evolution for append operations
|
542
|
+
.mode("append").option("mergeSchema", "true")
|
543
|
+
|
544
|
+
# ✅ Good: Combined approach for data lakes
|
545
|
+
pipeline = [
|
546
|
+
('daily_sales', 'append',
|
547
|
+
{'batch_date': '2024-10-07'},
|
548
|
+
{'mergeSchema': 'true', 'partitionBy': ['year', 'month', 'region']})
|
549
|
+
]
|
550
|
+
```
|
551
|
+
|
552
|
+
### Task Format Reference
|
553
|
+
|
554
|
+
```python
|
555
|
+
# 2-tuple: Simple SQL/Python
|
556
|
+
('task_name', 'mode') # SQL: no params, no Delta options
|
557
|
+
('function_name', (args)) # Python: function with arguments
|
558
|
+
|
559
|
+
# 3-tuple: SQL with parameters
|
560
|
+
('task_name', 'mode', {'param': 'value'})
|
561
|
+
|
562
|
+
# 4-tuple: SQL with parameters AND Delta options
|
563
|
+
('task_name', 'mode', {'param': 'value'}, {'mergeSchema': 'true', 'partitionBy': ['col']})
|
564
|
+
|
565
|
+
# 4-tuple: Empty parameters but Delta options
|
566
|
+
('task_name', 'mode', {}, {'mergeSchema': 'true'})
|
567
|
+
```
|
434
568
|
|
435
569
|
## How It Works
|
436
570
|
|
@@ -8,6 +8,8 @@ A helper package for stuff that made my life easier when working with Fabric Pyt
|
|
8
8
|
- Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)
|
9
9
|
- Workspace and lakehouse names cannot contain spaces
|
10
10
|
|
11
|
+
**Delta Lake Version:** This package uses an older version of deltalake to maintain row size control capabilities, which is crucial for Power BI performance optimization. The newer Rust-based deltalake versions don't yet support the row group size parameters that are essential for optimal DirectLake performance.
|
12
|
+
|
11
13
|
**Why no spaces?** Duckrun uses simple name-based paths instead of GUIDs. This keeps the code clean and readable, which is perfect for data engineering workspaces where naming conventions are already well-established. Just use underscores or hyphens instead: `my_workspace` or `my-lakehouse`.
|
12
14
|
|
13
15
|
## What It Does
|
@@ -111,6 +113,22 @@ con.sql("""
|
|
111
113
|
|
112
114
|
# Append mode
|
113
115
|
con.sql("SELECT * FROM new_orders").write.mode("append").saveAsTable("orders")
|
116
|
+
|
117
|
+
# Schema evolution and partitioning (exact Spark API compatibility)
|
118
|
+
con.sql("""
|
119
|
+
SELECT
|
120
|
+
customer_id,
|
121
|
+
order_date,
|
122
|
+
region,
|
123
|
+
product_category,
|
124
|
+
sales_amount,
|
125
|
+
new_column_added_later -- This column might not exist in target table
|
126
|
+
FROM source_table
|
127
|
+
""").write \
|
128
|
+
.mode("append") \
|
129
|
+
.option("mergeSchema", "true") \
|
130
|
+
.partitionBy("region", "product_category") \
|
131
|
+
.saveAsTable("sales_partitioned")
|
114
132
|
```
|
115
133
|
|
116
134
|
**Note:** `.format("delta")` is optional - Delta is the default format!
|
@@ -184,7 +202,10 @@ def download_data(url, path):
|
|
184
202
|
|
185
203
|
### SQL Tasks
|
186
204
|
|
187
|
-
**
|
205
|
+
**Formats:**
|
206
|
+
- `('table_name', 'mode')` - Simple SQL with no parameters
|
207
|
+
- `('table_name', 'mode', {params})` - SQL with template parameters
|
208
|
+
- `('table_name', 'mode', {params}, {delta_options})` - SQL with Delta Lake options
|
188
209
|
|
189
210
|
Create `sql_folder/table_name.sql`:
|
190
211
|
|
@@ -224,8 +245,66 @@ SELECT * FROM transactions
|
|
224
245
|
WHERE date BETWEEN '$start_date' AND '$end_date'
|
225
246
|
```
|
226
247
|
|
248
|
+
### Delta Lake Options (Schema Evolution & Partitioning)
|
249
|
+
|
250
|
+
Use the 4-tuple format for advanced Delta Lake features:
|
251
|
+
|
252
|
+
```python
|
253
|
+
pipeline = [
|
254
|
+
# SQL with empty params but Delta options
|
255
|
+
('evolving_table', 'append', {}, {'mergeSchema': 'true'}),
|
256
|
+
|
257
|
+
# SQL with both params AND Delta options
|
258
|
+
('sales_data', 'append',
|
259
|
+
{'region': 'North America'},
|
260
|
+
{'mergeSchema': 'true', 'partitionBy': ['region', 'year']}),
|
261
|
+
|
262
|
+
# Partitioning without schema merging
|
263
|
+
('time_series', 'overwrite',
|
264
|
+
{'start_date': '2024-01-01'},
|
265
|
+
{'partitionBy': ['year', 'month']})
|
266
|
+
]
|
267
|
+
```
|
268
|
+
|
269
|
+
**Available Delta Options:**
|
270
|
+
- `mergeSchema: 'true'` - Automatically handle schema evolution (new columns)
|
271
|
+
- `partitionBy: ['col1', 'col2']` - Partition data by specified columns
|
272
|
+
|
227
273
|
## Advanced Features
|
228
274
|
|
275
|
+
### Schema Evolution & Partitioning
|
276
|
+
|
277
|
+
Handle evolving schemas and optimize query performance with partitioning:
|
278
|
+
|
279
|
+
```python
|
280
|
+
# Using Spark-style API
|
281
|
+
con.sql("""
|
282
|
+
SELECT
|
283
|
+
customer_id,
|
284
|
+
region,
|
285
|
+
product_category,
|
286
|
+
sales_amount,
|
287
|
+
-- New column that might not exist in target table
|
288
|
+
discount_percentage
|
289
|
+
FROM raw_sales
|
290
|
+
""").write \
|
291
|
+
.mode("append") \
|
292
|
+
.option("mergeSchema", "true") \
|
293
|
+
.partitionBy("region", "product_category") \
|
294
|
+
.saveAsTable("sales_partitioned")
|
295
|
+
|
296
|
+
# Using pipeline format
|
297
|
+
pipeline = [
|
298
|
+
('sales_summary', 'append',
|
299
|
+
{'batch_date': '2024-10-07'},
|
300
|
+
{'mergeSchema': 'true', 'partitionBy': ['region', 'year']})
|
301
|
+
]
|
302
|
+
```
|
303
|
+
|
304
|
+
**Benefits:**
|
305
|
+
- 🔄 **Schema Evolution**: Automatically handles new columns without breaking existing queries
|
306
|
+
- ⚡ **Query Performance**: Partitioning improves performance for filtered queries
|
307
|
+
|
229
308
|
### Table Name Variants
|
230
309
|
|
231
310
|
Use `__` to create multiple versions of the same table:
|
@@ -384,8 +463,8 @@ pipeline = [
|
|
384
463
|
# Aggregate by region (SQL with params)
|
385
464
|
('regional_summary', 'overwrite', {'min_amount': 1000}),
|
386
465
|
|
387
|
-
# Append to history (SQL)
|
388
|
-
('sales_history', 'append')
|
466
|
+
# Append to history with schema evolution (SQL with Delta options)
|
467
|
+
('sales_history', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['year', 'region']})
|
389
468
|
]
|
390
469
|
|
391
470
|
# Run pipeline
|
@@ -410,7 +489,62 @@ con.download("processed_reports", "./exports", ['.csv'])
|
|
410
489
|
- 🔄 **Pipeline orchestration** with SQL and Python tasks
|
411
490
|
- ⚡ **Fast data exploration** with DuckDB
|
412
491
|
- 💾 **Delta table creation** with Spark-style API
|
413
|
-
-
|
492
|
+
- � **Schema evolution** and partitioning
|
493
|
+
- �📤 **File downloads** from OneLake Files
|
494
|
+
|
495
|
+
## Schema Evolution & Partitioning Guide
|
496
|
+
|
497
|
+
### When to Use Schema Evolution
|
498
|
+
|
499
|
+
Use `mergeSchema: 'true'` when:
|
500
|
+
- Adding new columns to existing tables
|
501
|
+
- Source data schema changes over time
|
502
|
+
- Working with evolving data pipelines
|
503
|
+
- Need backward compatibility
|
504
|
+
|
505
|
+
### When to Use Partitioning
|
506
|
+
|
507
|
+
Use `partitionBy` when:
|
508
|
+
- Queries frequently filter by specific columns (dates, regions, categories)
|
509
|
+
- Tables are large and need performance optimization
|
510
|
+
- Want to organize data logically for maintenance
|
511
|
+
|
512
|
+
### Best Practices
|
513
|
+
|
514
|
+
```python
|
515
|
+
# ✅ Good: Partition by commonly filtered columns
|
516
|
+
.partitionBy("year", "region") # Often filtered: WHERE year = 2024 AND region = 'US'
|
517
|
+
|
518
|
+
# ❌ Avoid: High cardinality partitions
|
519
|
+
.partitionBy("customer_id") # Creates too many small partitions
|
520
|
+
|
521
|
+
# ✅ Good: Schema evolution for append operations
|
522
|
+
.mode("append").option("mergeSchema", "true")
|
523
|
+
|
524
|
+
# ✅ Good: Combined approach for data lakes
|
525
|
+
pipeline = [
|
526
|
+
('daily_sales', 'append',
|
527
|
+
{'batch_date': '2024-10-07'},
|
528
|
+
{'mergeSchema': 'true', 'partitionBy': ['year', 'month', 'region']})
|
529
|
+
]
|
530
|
+
```
|
531
|
+
|
532
|
+
### Task Format Reference
|
533
|
+
|
534
|
+
```python
|
535
|
+
# 2-tuple: Simple SQL/Python
|
536
|
+
('task_name', 'mode') # SQL: no params, no Delta options
|
537
|
+
('function_name', (args)) # Python: function with arguments
|
538
|
+
|
539
|
+
# 3-tuple: SQL with parameters
|
540
|
+
('task_name', 'mode', {'param': 'value'})
|
541
|
+
|
542
|
+
# 4-tuple: SQL with parameters AND Delta options
|
543
|
+
('task_name', 'mode', {'param': 'value'}, {'mergeSchema': 'true', 'partitionBy': ['col']})
|
544
|
+
|
545
|
+
# 4-tuple: Empty parameters but Delta options
|
546
|
+
('task_name', 'mode', {}, {'mergeSchema': 'true'})
|
547
|
+
```
|
414
548
|
|
415
549
|
## How It Works
|
416
550
|
|
@@ -12,6 +12,36 @@ from obstore.store import AzureStore
|
|
12
12
|
RG = 8_000_000
|
13
13
|
|
14
14
|
|
15
|
+
def _build_write_deltalake_args(path, df, mode, schema_mode=None, partition_by=None):
|
16
|
+
"""
|
17
|
+
Build arguments for write_deltalake based on requirements:
|
18
|
+
- If schema_mode='merge': use rust engine (no row group params)
|
19
|
+
- Otherwise: use pyarrow engine with row group optimization
|
20
|
+
"""
|
21
|
+
args = {
|
22
|
+
'table_or_uri': path,
|
23
|
+
'data': df,
|
24
|
+
'mode': mode
|
25
|
+
}
|
26
|
+
|
27
|
+
# Add partition_by if specified
|
28
|
+
if partition_by:
|
29
|
+
args['partition_by'] = partition_by
|
30
|
+
|
31
|
+
# Engine selection based on schema_mode
|
32
|
+
if schema_mode == 'merge':
|
33
|
+
# Use rust engine for schema merging (no row group params supported)
|
34
|
+
args['schema_mode'] = 'merge'
|
35
|
+
args['engine'] = 'rust'
|
36
|
+
else:
|
37
|
+
# Use pyarrow engine with row group optimization (default)
|
38
|
+
args['max_rows_per_file'] = RG
|
39
|
+
args['max_rows_per_group'] = RG
|
40
|
+
args['min_rows_per_group'] = RG
|
41
|
+
|
42
|
+
return args
|
43
|
+
|
44
|
+
|
15
45
|
class DeltaWriter:
|
16
46
|
"""Spark-style write API for Delta Lake"""
|
17
47
|
|
@@ -20,6 +50,8 @@ class DeltaWriter:
|
|
20
50
|
self.duckrun = duckrun_instance
|
21
51
|
self._format = "delta"
|
22
52
|
self._mode = "overwrite"
|
53
|
+
self._schema_mode = None
|
54
|
+
self._partition_by = None
|
23
55
|
|
24
56
|
def format(self, format_type: str):
|
25
57
|
"""Set output format (only 'delta' supported)"""
|
@@ -35,6 +67,27 @@ class DeltaWriter:
|
|
35
67
|
self._mode = write_mode
|
36
68
|
return self
|
37
69
|
|
70
|
+
def option(self, key: str, value):
|
71
|
+
"""Set write option (Spark-compatible)"""
|
72
|
+
if key == "mergeSchema":
|
73
|
+
if str(value).lower() in ("true", "1"):
|
74
|
+
self._schema_mode = "merge"
|
75
|
+
else:
|
76
|
+
self._schema_mode = None
|
77
|
+
else:
|
78
|
+
raise ValueError(f"Unsupported option: {key}")
|
79
|
+
return self
|
80
|
+
|
81
|
+
def partitionBy(self, *columns):
|
82
|
+
"""Set partition columns (Spark-compatible)"""
|
83
|
+
if len(columns) == 1 and isinstance(columns[0], (list, tuple)):
|
84
|
+
# Handle partitionBy(["col1", "col2"]) case
|
85
|
+
self._partition_by = list(columns[0])
|
86
|
+
else:
|
87
|
+
# Handle partitionBy("col1", "col2") case
|
88
|
+
self._partition_by = list(columns)
|
89
|
+
return self
|
90
|
+
|
38
91
|
def saveAsTable(self, table_name: str):
|
39
92
|
"""Save query result as Delta table"""
|
40
93
|
if self._format != "delta":
|
@@ -50,8 +103,18 @@ class DeltaWriter:
|
|
50
103
|
path = f"{self.duckrun.table_base_url}{schema}/{table}"
|
51
104
|
df = self.relation.record_batch()
|
52
105
|
|
53
|
-
|
54
|
-
|
106
|
+
# Build write arguments based on schema_mode and partition_by
|
107
|
+
write_args = _build_write_deltalake_args(
|
108
|
+
path, df, self._mode,
|
109
|
+
schema_mode=self._schema_mode,
|
110
|
+
partition_by=self._partition_by
|
111
|
+
)
|
112
|
+
|
113
|
+
engine_info = f" (engine=rust, schema_mode=merge)" if self._schema_mode == 'merge' else " (engine=pyarrow)"
|
114
|
+
partition_info = f" partitioned by {self._partition_by}" if self._partition_by else ""
|
115
|
+
print(f"Writing to Delta table: {schema}.{table} (mode={self._mode}){engine_info}{partition_info}")
|
116
|
+
|
117
|
+
write_deltalake(**write_args)
|
55
118
|
|
56
119
|
self.duckrun.con.sql(f"DROP VIEW IF EXISTS {table}")
|
57
120
|
self.duckrun.con.sql(f"""
|
@@ -113,6 +176,21 @@ class Duckrun:
|
|
113
176
|
dr = Duckrun.connect("workspace/lakehouse.lakehouse")
|
114
177
|
dr.sql("SELECT * FROM table").show()
|
115
178
|
dr.sql("SELECT 43").write.mode("append").saveAsTable("test")
|
179
|
+
|
180
|
+
# Schema evolution and partitioning (exact Spark API):
|
181
|
+
dr.sql("SELECT * FROM source").write.mode("append").option("mergeSchema", "true").partitionBy("region").saveAsTable("sales")
|
182
|
+
|
183
|
+
# Pipeline formats:
|
184
|
+
pipeline = [
|
185
|
+
# SQL with parameters only
|
186
|
+
('table_name', 'mode', {'param1': 'value1'}),
|
187
|
+
|
188
|
+
# SQL with Delta options (4-tuple format)
|
189
|
+
('table_name', 'mode', {'param1': 'value1'}, {'mergeSchema': 'true', 'partitionBy': ['region']}),
|
190
|
+
|
191
|
+
# Python task
|
192
|
+
('process_data', ('table_name',))
|
193
|
+
]
|
116
194
|
"""
|
117
195
|
|
118
196
|
def __init__(self, workspace: str, lakehouse_name: str, schema: str = "dbo",
|
@@ -392,7 +470,7 @@ class Duckrun:
|
|
392
470
|
print(f"✅ Python '{name}' completed")
|
393
471
|
return result
|
394
472
|
|
395
|
-
def _run_sql(self, table: str, mode: str, params: Dict) -> str:
|
473
|
+
def _run_sql(self, table: str, mode: str, params: Dict, delta_options: Dict = None) -> str:
|
396
474
|
"""Execute SQL task, write to Delta, return normalized table name"""
|
397
475
|
self._create_onelake_secret()
|
398
476
|
|
@@ -406,10 +484,23 @@ class Duckrun:
|
|
406
484
|
normalized_table = self._normalize_table_name(table)
|
407
485
|
path = f"{self.table_base_url}{self.schema}/{normalized_table}"
|
408
486
|
|
487
|
+
# Extract Delta Lake specific options from delta_options
|
488
|
+
delta_options = delta_options or {}
|
489
|
+
merge_schema = delta_options.get('mergeSchema')
|
490
|
+
schema_mode = 'merge' if str(merge_schema).lower() in ('true', '1') else None
|
491
|
+
partition_by = delta_options.get('partitionBy') or delta_options.get('partition_by')
|
492
|
+
|
409
493
|
if mode == 'overwrite':
|
410
494
|
self.con.sql(f"DROP VIEW IF EXISTS {normalized_table}")
|
411
495
|
df = self.con.sql(sql).record_batch()
|
412
|
-
|
496
|
+
|
497
|
+
write_args = _build_write_deltalake_args(
|
498
|
+
path, df, 'overwrite',
|
499
|
+
schema_mode=schema_mode,
|
500
|
+
partition_by=partition_by
|
501
|
+
)
|
502
|
+
write_deltalake(**write_args)
|
503
|
+
|
413
504
|
self.con.sql(f"CREATE OR REPLACE VIEW {normalized_table} AS SELECT * FROM delta_scan('{path}')")
|
414
505
|
dt = DeltaTable(path)
|
415
506
|
dt.vacuum(retention_hours=0, dry_run=False, enforce_retention_duration=False)
|
@@ -417,7 +508,14 @@ class Duckrun:
|
|
417
508
|
|
418
509
|
elif mode == 'append':
|
419
510
|
df = self.con.sql(sql).record_batch()
|
420
|
-
|
511
|
+
|
512
|
+
write_args = _build_write_deltalake_args(
|
513
|
+
path, df, 'append',
|
514
|
+
schema_mode=schema_mode,
|
515
|
+
partition_by=partition_by
|
516
|
+
)
|
517
|
+
write_deltalake(**write_args)
|
518
|
+
|
421
519
|
self.con.sql(f"CREATE OR REPLACE VIEW {normalized_table} AS SELECT * FROM delta_scan('{path}')")
|
422
520
|
dt = DeltaTable(path)
|
423
521
|
if len(dt.file_uris()) > self.compaction_threshold:
|
@@ -434,13 +532,22 @@ class Duckrun:
|
|
434
532
|
print(f"Table {normalized_table} doesn't exist. Creating...")
|
435
533
|
self.con.sql(f"DROP VIEW IF EXISTS {normalized_table}")
|
436
534
|
df = self.con.sql(sql).record_batch()
|
437
|
-
|
535
|
+
|
536
|
+
write_args = _build_write_deltalake_args(
|
537
|
+
path, df, 'overwrite',
|
538
|
+
schema_mode=schema_mode,
|
539
|
+
partition_by=partition_by
|
540
|
+
)
|
541
|
+
write_deltalake(**write_args)
|
542
|
+
|
438
543
|
self.con.sql(f"CREATE OR REPLACE VIEW {normalized_table} AS SELECT * FROM delta_scan('{path}')")
|
439
544
|
dt = DeltaTable(path)
|
440
545
|
dt.vacuum(dry_run=False)
|
441
546
|
dt.cleanup_metadata()
|
442
547
|
|
443
|
-
|
548
|
+
engine_info = f" (engine=rust, schema_mode=merge)" if schema_mode == 'merge' else " (engine=pyarrow)"
|
549
|
+
partition_info = f" partitioned by {partition_by}" if partition_by else ""
|
550
|
+
print(f"✅ SQL '{table}' → '{normalized_table}' ({mode}){engine_info}{partition_info}")
|
444
551
|
return normalized_table
|
445
552
|
|
446
553
|
def run(self, pipeline: List[Tuple]) -> bool:
|
@@ -449,7 +556,8 @@ class Duckrun:
|
|
449
556
|
|
450
557
|
Task formats:
|
451
558
|
- Python: ('function_name', (arg1, arg2, ...))
|
452
|
-
- SQL: ('table_name', 'mode') or ('table_name', 'mode', {
|
559
|
+
- SQL: ('table_name', 'mode') or ('table_name', 'mode', {sql_params})
|
560
|
+
- SQL with Delta options: ('table_name', 'mode', {sql_params}, {delta_options})
|
453
561
|
|
454
562
|
Returns:
|
455
563
|
True if all tasks succeeded
|
@@ -469,7 +577,7 @@ class Duckrun:
|
|
469
577
|
if len(task) == 2:
|
470
578
|
name, second = task
|
471
579
|
if isinstance(second, str) and second in {'overwrite', 'append', 'ignore'}:
|
472
|
-
result = self._run_sql(name, second, {})
|
580
|
+
result = self._run_sql(name, second, {}, {})
|
473
581
|
else:
|
474
582
|
args = second if isinstance(second, (tuple, list)) else (second,)
|
475
583
|
result = self._run_python(name, tuple(args))
|
@@ -478,7 +586,15 @@ class Duckrun:
|
|
478
586
|
table, mode, params = task
|
479
587
|
if not isinstance(params, dict):
|
480
588
|
raise ValueError(f"Expected dict for params, got {type(params)}")
|
481
|
-
result = self._run_sql(table, mode, params)
|
589
|
+
result = self._run_sql(table, mode, params, {})
|
590
|
+
|
591
|
+
elif len(task) == 4:
|
592
|
+
table, mode, params, delta_options = task
|
593
|
+
if not isinstance(params, dict):
|
594
|
+
raise ValueError(f"Expected dict for SQL params, got {type(params)}")
|
595
|
+
if not isinstance(delta_options, dict):
|
596
|
+
raise ValueError(f"Expected dict for Delta options, got {type(delta_options)}")
|
597
|
+
result = self._run_sql(table, mode, params, delta_options)
|
482
598
|
|
483
599
|
else:
|
484
600
|
raise ValueError(f"Invalid task format: {task}")
|
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: duckrun
|
3
|
-
Version: 0.
|
3
|
+
Version: 0.2.0
|
4
4
|
Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
|
5
5
|
Author: mim
|
6
6
|
License: MIT
|
@@ -28,6 +28,8 @@ A helper package for stuff that made my life easier when working with Fabric Pyt
|
|
28
28
|
- Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)
|
29
29
|
- Workspace and lakehouse names cannot contain spaces
|
30
30
|
|
31
|
+
**Delta Lake Version:** This package uses an older version of deltalake to maintain row size control capabilities, which is crucial for Power BI performance optimization. The newer Rust-based deltalake versions don't yet support the row group size parameters that are essential for optimal DirectLake performance.
|
32
|
+
|
31
33
|
**Why no spaces?** Duckrun uses simple name-based paths instead of GUIDs. This keeps the code clean and readable, which is perfect for data engineering workspaces where naming conventions are already well-established. Just use underscores or hyphens instead: `my_workspace` or `my-lakehouse`.
|
32
34
|
|
33
35
|
## What It Does
|
@@ -131,6 +133,22 @@ con.sql("""
|
|
131
133
|
|
132
134
|
# Append mode
|
133
135
|
con.sql("SELECT * FROM new_orders").write.mode("append").saveAsTable("orders")
|
136
|
+
|
137
|
+
# Schema evolution and partitioning (exact Spark API compatibility)
|
138
|
+
con.sql("""
|
139
|
+
SELECT
|
140
|
+
customer_id,
|
141
|
+
order_date,
|
142
|
+
region,
|
143
|
+
product_category,
|
144
|
+
sales_amount,
|
145
|
+
new_column_added_later -- This column might not exist in target table
|
146
|
+
FROM source_table
|
147
|
+
""").write \
|
148
|
+
.mode("append") \
|
149
|
+
.option("mergeSchema", "true") \
|
150
|
+
.partitionBy("region", "product_category") \
|
151
|
+
.saveAsTable("sales_partitioned")
|
134
152
|
```
|
135
153
|
|
136
154
|
**Note:** `.format("delta")` is optional - Delta is the default format!
|
@@ -204,7 +222,10 @@ def download_data(url, path):
|
|
204
222
|
|
205
223
|
### SQL Tasks
|
206
224
|
|
207
|
-
**
|
225
|
+
**Formats:**
|
226
|
+
- `('table_name', 'mode')` - Simple SQL with no parameters
|
227
|
+
- `('table_name', 'mode', {params})` - SQL with template parameters
|
228
|
+
- `('table_name', 'mode', {params}, {delta_options})` - SQL with Delta Lake options
|
208
229
|
|
209
230
|
Create `sql_folder/table_name.sql`:
|
210
231
|
|
@@ -244,8 +265,66 @@ SELECT * FROM transactions
|
|
244
265
|
WHERE date BETWEEN '$start_date' AND '$end_date'
|
245
266
|
```
|
246
267
|
|
268
|
+
### Delta Lake Options (Schema Evolution & Partitioning)
|
269
|
+
|
270
|
+
Use the 4-tuple format for advanced Delta Lake features:
|
271
|
+
|
272
|
+
```python
|
273
|
+
pipeline = [
|
274
|
+
# SQL with empty params but Delta options
|
275
|
+
('evolving_table', 'append', {}, {'mergeSchema': 'true'}),
|
276
|
+
|
277
|
+
# SQL with both params AND Delta options
|
278
|
+
('sales_data', 'append',
|
279
|
+
{'region': 'North America'},
|
280
|
+
{'mergeSchema': 'true', 'partitionBy': ['region', 'year']}),
|
281
|
+
|
282
|
+
# Partitioning without schema merging
|
283
|
+
('time_series', 'overwrite',
|
284
|
+
{'start_date': '2024-01-01'},
|
285
|
+
{'partitionBy': ['year', 'month']})
|
286
|
+
]
|
287
|
+
```
|
288
|
+
|
289
|
+
**Available Delta Options:**
|
290
|
+
- `mergeSchema: 'true'` - Automatically handle schema evolution (new columns)
|
291
|
+
- `partitionBy: ['col1', 'col2']` - Partition data by specified columns
|
292
|
+
|
247
293
|
## Advanced Features
|
248
294
|
|
295
|
+
### Schema Evolution & Partitioning
|
296
|
+
|
297
|
+
Handle evolving schemas and optimize query performance with partitioning:
|
298
|
+
|
299
|
+
```python
|
300
|
+
# Using Spark-style API
|
301
|
+
con.sql("""
|
302
|
+
SELECT
|
303
|
+
customer_id,
|
304
|
+
region,
|
305
|
+
product_category,
|
306
|
+
sales_amount,
|
307
|
+
-- New column that might not exist in target table
|
308
|
+
discount_percentage
|
309
|
+
FROM raw_sales
|
310
|
+
""").write \
|
311
|
+
.mode("append") \
|
312
|
+
.option("mergeSchema", "true") \
|
313
|
+
.partitionBy("region", "product_category") \
|
314
|
+
.saveAsTable("sales_partitioned")
|
315
|
+
|
316
|
+
# Using pipeline format
|
317
|
+
pipeline = [
|
318
|
+
('sales_summary', 'append',
|
319
|
+
{'batch_date': '2024-10-07'},
|
320
|
+
{'mergeSchema': 'true', 'partitionBy': ['region', 'year']})
|
321
|
+
]
|
322
|
+
```
|
323
|
+
|
324
|
+
**Benefits:**
|
325
|
+
- 🔄 **Schema Evolution**: Automatically handles new columns without breaking existing queries
|
326
|
+
- ⚡ **Query Performance**: Partitioning improves performance for filtered queries
|
327
|
+
|
249
328
|
### Table Name Variants
|
250
329
|
|
251
330
|
Use `__` to create multiple versions of the same table:
|
@@ -404,8 +483,8 @@ pipeline = [
|
|
404
483
|
# Aggregate by region (SQL with params)
|
405
484
|
('regional_summary', 'overwrite', {'min_amount': 1000}),
|
406
485
|
|
407
|
-
# Append to history (SQL)
|
408
|
-
('sales_history', 'append')
|
486
|
+
# Append to history with schema evolution (SQL with Delta options)
|
487
|
+
('sales_history', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['year', 'region']})
|
409
488
|
]
|
410
489
|
|
411
490
|
# Run pipeline
|
@@ -430,7 +509,62 @@ con.download("processed_reports", "./exports", ['.csv'])
|
|
430
509
|
- 🔄 **Pipeline orchestration** with SQL and Python tasks
|
431
510
|
- ⚡ **Fast data exploration** with DuckDB
|
432
511
|
- 💾 **Delta table creation** with Spark-style API
|
433
|
-
-
|
512
|
+
- � **Schema evolution** and partitioning
|
513
|
+
- �📤 **File downloads** from OneLake Files
|
514
|
+
|
515
|
+
## Schema Evolution & Partitioning Guide
|
516
|
+
|
517
|
+
### When to Use Schema Evolution
|
518
|
+
|
519
|
+
Use `mergeSchema: 'true'` when:
|
520
|
+
- Adding new columns to existing tables
|
521
|
+
- Source data schema changes over time
|
522
|
+
- Working with evolving data pipelines
|
523
|
+
- Need backward compatibility
|
524
|
+
|
525
|
+
### When to Use Partitioning
|
526
|
+
|
527
|
+
Use `partitionBy` when:
|
528
|
+
- Queries frequently filter by specific columns (dates, regions, categories)
|
529
|
+
- Tables are large and need performance optimization
|
530
|
+
- Want to organize data logically for maintenance
|
531
|
+
|
532
|
+
### Best Practices
|
533
|
+
|
534
|
+
```python
|
535
|
+
# ✅ Good: Partition by commonly filtered columns
|
536
|
+
.partitionBy("year", "region") # Often filtered: WHERE year = 2024 AND region = 'US'
|
537
|
+
|
538
|
+
# ❌ Avoid: High cardinality partitions
|
539
|
+
.partitionBy("customer_id") # Creates too many small partitions
|
540
|
+
|
541
|
+
# ✅ Good: Schema evolution for append operations
|
542
|
+
.mode("append").option("mergeSchema", "true")
|
543
|
+
|
544
|
+
# ✅ Good: Combined approach for data lakes
|
545
|
+
pipeline = [
|
546
|
+
('daily_sales', 'append',
|
547
|
+
{'batch_date': '2024-10-07'},
|
548
|
+
{'mergeSchema': 'true', 'partitionBy': ['year', 'month', 'region']})
|
549
|
+
]
|
550
|
+
```
|
551
|
+
|
552
|
+
### Task Format Reference
|
553
|
+
|
554
|
+
```python
|
555
|
+
# 2-tuple: Simple SQL/Python
|
556
|
+
('task_name', 'mode') # SQL: no params, no Delta options
|
557
|
+
('function_name', (args)) # Python: function with arguments
|
558
|
+
|
559
|
+
# 3-tuple: SQL with parameters
|
560
|
+
('task_name', 'mode', {'param': 'value'})
|
561
|
+
|
562
|
+
# 4-tuple: SQL with parameters AND Delta options
|
563
|
+
('task_name', 'mode', {'param': 'value'}, {'mergeSchema': 'true', 'partitionBy': ['col']})
|
564
|
+
|
565
|
+
# 4-tuple: Empty parameters but Delta options
|
566
|
+
('task_name', 'mode', {}, {'mergeSchema': 'true'})
|
567
|
+
```
|
434
568
|
|
435
569
|
## How It Works
|
436
570
|
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|