duckrun 0.1.9__tar.gz → 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: duckrun
3
- Version: 0.1.9
3
+ Version: 0.2.0
4
4
  Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
5
5
  Author: mim
6
6
  License: MIT
@@ -28,6 +28,8 @@ A helper package for stuff that made my life easier when working with Fabric Pyt
28
28
  - Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)
29
29
  - Workspace and lakehouse names cannot contain spaces
30
30
 
31
+ **Delta Lake Version:** This package uses an older version of deltalake to maintain row size control capabilities, which is crucial for Power BI performance optimization. The newer Rust-based deltalake versions don't yet support the row group size parameters that are essential for optimal DirectLake performance.
32
+
31
33
  **Why no spaces?** Duckrun uses simple name-based paths instead of GUIDs. This keeps the code clean and readable, which is perfect for data engineering workspaces where naming conventions are already well-established. Just use underscores or hyphens instead: `my_workspace` or `my-lakehouse`.
32
34
 
33
35
  ## What It Does
@@ -131,6 +133,22 @@ con.sql("""
131
133
 
132
134
  # Append mode
133
135
  con.sql("SELECT * FROM new_orders").write.mode("append").saveAsTable("orders")
136
+
137
+ # Schema evolution and partitioning (exact Spark API compatibility)
138
+ con.sql("""
139
+ SELECT
140
+ customer_id,
141
+ order_date,
142
+ region,
143
+ product_category,
144
+ sales_amount,
145
+ new_column_added_later -- This column might not exist in target table
146
+ FROM source_table
147
+ """).write \
148
+ .mode("append") \
149
+ .option("mergeSchema", "true") \
150
+ .partitionBy("region", "product_category") \
151
+ .saveAsTable("sales_partitioned")
134
152
  ```
135
153
 
136
154
  **Note:** `.format("delta")` is optional - Delta is the default format!
@@ -204,7 +222,10 @@ def download_data(url, path):
204
222
 
205
223
  ### SQL Tasks
206
224
 
207
- **Format:** `('table_name', 'mode')` or `('table_name', 'mode', {params})`
225
+ **Formats:**
226
+ - `('table_name', 'mode')` - Simple SQL with no parameters
227
+ - `('table_name', 'mode', {params})` - SQL with template parameters
228
+ - `('table_name', 'mode', {params}, {delta_options})` - SQL with Delta Lake options
208
229
 
209
230
  Create `sql_folder/table_name.sql`:
210
231
 
@@ -244,8 +265,66 @@ SELECT * FROM transactions
244
265
  WHERE date BETWEEN '$start_date' AND '$end_date'
245
266
  ```
246
267
 
268
+ ### Delta Lake Options (Schema Evolution & Partitioning)
269
+
270
+ Use the 4-tuple format for advanced Delta Lake features:
271
+
272
+ ```python
273
+ pipeline = [
274
+ # SQL with empty params but Delta options
275
+ ('evolving_table', 'append', {}, {'mergeSchema': 'true'}),
276
+
277
+ # SQL with both params AND Delta options
278
+ ('sales_data', 'append',
279
+ {'region': 'North America'},
280
+ {'mergeSchema': 'true', 'partitionBy': ['region', 'year']}),
281
+
282
+ # Partitioning without schema merging
283
+ ('time_series', 'overwrite',
284
+ {'start_date': '2024-01-01'},
285
+ {'partitionBy': ['year', 'month']})
286
+ ]
287
+ ```
288
+
289
+ **Available Delta Options:**
290
+ - `mergeSchema: 'true'` - Automatically handle schema evolution (new columns)
291
+ - `partitionBy: ['col1', 'col2']` - Partition data by specified columns
292
+
247
293
  ## Advanced Features
248
294
 
295
+ ### Schema Evolution & Partitioning
296
+
297
+ Handle evolving schemas and optimize query performance with partitioning:
298
+
299
+ ```python
300
+ # Using Spark-style API
301
+ con.sql("""
302
+ SELECT
303
+ customer_id,
304
+ region,
305
+ product_category,
306
+ sales_amount,
307
+ -- New column that might not exist in target table
308
+ discount_percentage
309
+ FROM raw_sales
310
+ """).write \
311
+ .mode("append") \
312
+ .option("mergeSchema", "true") \
313
+ .partitionBy("region", "product_category") \
314
+ .saveAsTable("sales_partitioned")
315
+
316
+ # Using pipeline format
317
+ pipeline = [
318
+ ('sales_summary', 'append',
319
+ {'batch_date': '2024-10-07'},
320
+ {'mergeSchema': 'true', 'partitionBy': ['region', 'year']})
321
+ ]
322
+ ```
323
+
324
+ **Benefits:**
325
+ - 🔄 **Schema Evolution**: Automatically handles new columns without breaking existing queries
326
+ - ⚡ **Query Performance**: Partitioning improves performance for filtered queries
327
+
249
328
  ### Table Name Variants
250
329
 
251
330
  Use `__` to create multiple versions of the same table:
@@ -404,8 +483,8 @@ pipeline = [
404
483
  # Aggregate by region (SQL with params)
405
484
  ('regional_summary', 'overwrite', {'min_amount': 1000}),
406
485
 
407
- # Append to history (SQL)
408
- ('sales_history', 'append')
486
+ # Append to history with schema evolution (SQL with Delta options)
487
+ ('sales_history', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['year', 'region']})
409
488
  ]
410
489
 
411
490
  # Run pipeline
@@ -430,7 +509,62 @@ con.download("processed_reports", "./exports", ['.csv'])
430
509
  - 🔄 **Pipeline orchestration** with SQL and Python tasks
431
510
  - ⚡ **Fast data exploration** with DuckDB
432
511
  - 💾 **Delta table creation** with Spark-style API
433
- - 📤 **File downloads** from OneLake Files
512
+ - **Schema evolution** and partitioning
513
+ - �📤 **File downloads** from OneLake Files
514
+
515
+ ## Schema Evolution & Partitioning Guide
516
+
517
+ ### When to Use Schema Evolution
518
+
519
+ Use `mergeSchema: 'true'` when:
520
+ - Adding new columns to existing tables
521
+ - Source data schema changes over time
522
+ - Working with evolving data pipelines
523
+ - Need backward compatibility
524
+
525
+ ### When to Use Partitioning
526
+
527
+ Use `partitionBy` when:
528
+ - Queries frequently filter by specific columns (dates, regions, categories)
529
+ - Tables are large and need performance optimization
530
+ - Want to organize data logically for maintenance
531
+
532
+ ### Best Practices
533
+
534
+ ```python
535
+ # ✅ Good: Partition by commonly filtered columns
536
+ .partitionBy("year", "region") # Often filtered: WHERE year = 2024 AND region = 'US'
537
+
538
+ # ❌ Avoid: High cardinality partitions
539
+ .partitionBy("customer_id") # Creates too many small partitions
540
+
541
+ # ✅ Good: Schema evolution for append operations
542
+ .mode("append").option("mergeSchema", "true")
543
+
544
+ # ✅ Good: Combined approach for data lakes
545
+ pipeline = [
546
+ ('daily_sales', 'append',
547
+ {'batch_date': '2024-10-07'},
548
+ {'mergeSchema': 'true', 'partitionBy': ['year', 'month', 'region']})
549
+ ]
550
+ ```
551
+
552
+ ### Task Format Reference
553
+
554
+ ```python
555
+ # 2-tuple: Simple SQL/Python
556
+ ('task_name', 'mode') # SQL: no params, no Delta options
557
+ ('function_name', (args)) # Python: function with arguments
558
+
559
+ # 3-tuple: SQL with parameters
560
+ ('task_name', 'mode', {'param': 'value'})
561
+
562
+ # 4-tuple: SQL with parameters AND Delta options
563
+ ('task_name', 'mode', {'param': 'value'}, {'mergeSchema': 'true', 'partitionBy': ['col']})
564
+
565
+ # 4-tuple: Empty parameters but Delta options
566
+ ('task_name', 'mode', {}, {'mergeSchema': 'true'})
567
+ ```
434
568
 
435
569
  ## How It Works
436
570
 
@@ -8,6 +8,8 @@ A helper package for stuff that made my life easier when working with Fabric Pyt
8
8
  - Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)
9
9
  - Workspace and lakehouse names cannot contain spaces
10
10
 
11
+ **Delta Lake Version:** This package uses an older version of deltalake to maintain row size control capabilities, which is crucial for Power BI performance optimization. The newer Rust-based deltalake versions don't yet support the row group size parameters that are essential for optimal DirectLake performance.
12
+
11
13
  **Why no spaces?** Duckrun uses simple name-based paths instead of GUIDs. This keeps the code clean and readable, which is perfect for data engineering workspaces where naming conventions are already well-established. Just use underscores or hyphens instead: `my_workspace` or `my-lakehouse`.
12
14
 
13
15
  ## What It Does
@@ -111,6 +113,22 @@ con.sql("""
111
113
 
112
114
  # Append mode
113
115
  con.sql("SELECT * FROM new_orders").write.mode("append").saveAsTable("orders")
116
+
117
+ # Schema evolution and partitioning (exact Spark API compatibility)
118
+ con.sql("""
119
+ SELECT
120
+ customer_id,
121
+ order_date,
122
+ region,
123
+ product_category,
124
+ sales_amount,
125
+ new_column_added_later -- This column might not exist in target table
126
+ FROM source_table
127
+ """).write \
128
+ .mode("append") \
129
+ .option("mergeSchema", "true") \
130
+ .partitionBy("region", "product_category") \
131
+ .saveAsTable("sales_partitioned")
114
132
  ```
115
133
 
116
134
  **Note:** `.format("delta")` is optional - Delta is the default format!
@@ -184,7 +202,10 @@ def download_data(url, path):
184
202
 
185
203
  ### SQL Tasks
186
204
 
187
- **Format:** `('table_name', 'mode')` or `('table_name', 'mode', {params})`
205
+ **Formats:**
206
+ - `('table_name', 'mode')` - Simple SQL with no parameters
207
+ - `('table_name', 'mode', {params})` - SQL with template parameters
208
+ - `('table_name', 'mode', {params}, {delta_options})` - SQL with Delta Lake options
188
209
 
189
210
  Create `sql_folder/table_name.sql`:
190
211
 
@@ -224,8 +245,66 @@ SELECT * FROM transactions
224
245
  WHERE date BETWEEN '$start_date' AND '$end_date'
225
246
  ```
226
247
 
248
+ ### Delta Lake Options (Schema Evolution & Partitioning)
249
+
250
+ Use the 4-tuple format for advanced Delta Lake features:
251
+
252
+ ```python
253
+ pipeline = [
254
+ # SQL with empty params but Delta options
255
+ ('evolving_table', 'append', {}, {'mergeSchema': 'true'}),
256
+
257
+ # SQL with both params AND Delta options
258
+ ('sales_data', 'append',
259
+ {'region': 'North America'},
260
+ {'mergeSchema': 'true', 'partitionBy': ['region', 'year']}),
261
+
262
+ # Partitioning without schema merging
263
+ ('time_series', 'overwrite',
264
+ {'start_date': '2024-01-01'},
265
+ {'partitionBy': ['year', 'month']})
266
+ ]
267
+ ```
268
+
269
+ **Available Delta Options:**
270
+ - `mergeSchema: 'true'` - Automatically handle schema evolution (new columns)
271
+ - `partitionBy: ['col1', 'col2']` - Partition data by specified columns
272
+
227
273
  ## Advanced Features
228
274
 
275
+ ### Schema Evolution & Partitioning
276
+
277
+ Handle evolving schemas and optimize query performance with partitioning:
278
+
279
+ ```python
280
+ # Using Spark-style API
281
+ con.sql("""
282
+ SELECT
283
+ customer_id,
284
+ region,
285
+ product_category,
286
+ sales_amount,
287
+ -- New column that might not exist in target table
288
+ discount_percentage
289
+ FROM raw_sales
290
+ """).write \
291
+ .mode("append") \
292
+ .option("mergeSchema", "true") \
293
+ .partitionBy("region", "product_category") \
294
+ .saveAsTable("sales_partitioned")
295
+
296
+ # Using pipeline format
297
+ pipeline = [
298
+ ('sales_summary', 'append',
299
+ {'batch_date': '2024-10-07'},
300
+ {'mergeSchema': 'true', 'partitionBy': ['region', 'year']})
301
+ ]
302
+ ```
303
+
304
+ **Benefits:**
305
+ - 🔄 **Schema Evolution**: Automatically handles new columns without breaking existing queries
306
+ - ⚡ **Query Performance**: Partitioning improves performance for filtered queries
307
+
229
308
  ### Table Name Variants
230
309
 
231
310
  Use `__` to create multiple versions of the same table:
@@ -384,8 +463,8 @@ pipeline = [
384
463
  # Aggregate by region (SQL with params)
385
464
  ('regional_summary', 'overwrite', {'min_amount': 1000}),
386
465
 
387
- # Append to history (SQL)
388
- ('sales_history', 'append')
466
+ # Append to history with schema evolution (SQL with Delta options)
467
+ ('sales_history', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['year', 'region']})
389
468
  ]
390
469
 
391
470
  # Run pipeline
@@ -410,7 +489,62 @@ con.download("processed_reports", "./exports", ['.csv'])
410
489
  - 🔄 **Pipeline orchestration** with SQL and Python tasks
411
490
  - ⚡ **Fast data exploration** with DuckDB
412
491
  - 💾 **Delta table creation** with Spark-style API
413
- - 📤 **File downloads** from OneLake Files
492
+ - **Schema evolution** and partitioning
493
+ - �📤 **File downloads** from OneLake Files
494
+
495
+ ## Schema Evolution & Partitioning Guide
496
+
497
+ ### When to Use Schema Evolution
498
+
499
+ Use `mergeSchema: 'true'` when:
500
+ - Adding new columns to existing tables
501
+ - Source data schema changes over time
502
+ - Working with evolving data pipelines
503
+ - Need backward compatibility
504
+
505
+ ### When to Use Partitioning
506
+
507
+ Use `partitionBy` when:
508
+ - Queries frequently filter by specific columns (dates, regions, categories)
509
+ - Tables are large and need performance optimization
510
+ - Want to organize data logically for maintenance
511
+
512
+ ### Best Practices
513
+
514
+ ```python
515
+ # ✅ Good: Partition by commonly filtered columns
516
+ .partitionBy("year", "region") # Often filtered: WHERE year = 2024 AND region = 'US'
517
+
518
+ # ❌ Avoid: High cardinality partitions
519
+ .partitionBy("customer_id") # Creates too many small partitions
520
+
521
+ # ✅ Good: Schema evolution for append operations
522
+ .mode("append").option("mergeSchema", "true")
523
+
524
+ # ✅ Good: Combined approach for data lakes
525
+ pipeline = [
526
+ ('daily_sales', 'append',
527
+ {'batch_date': '2024-10-07'},
528
+ {'mergeSchema': 'true', 'partitionBy': ['year', 'month', 'region']})
529
+ ]
530
+ ```
531
+
532
+ ### Task Format Reference
533
+
534
+ ```python
535
+ # 2-tuple: Simple SQL/Python
536
+ ('task_name', 'mode') # SQL: no params, no Delta options
537
+ ('function_name', (args)) # Python: function with arguments
538
+
539
+ # 3-tuple: SQL with parameters
540
+ ('task_name', 'mode', {'param': 'value'})
541
+
542
+ # 4-tuple: SQL with parameters AND Delta options
543
+ ('task_name', 'mode', {'param': 'value'}, {'mergeSchema': 'true', 'partitionBy': ['col']})
544
+
545
+ # 4-tuple: Empty parameters but Delta options
546
+ ('task_name', 'mode', {}, {'mergeSchema': 'true'})
547
+ ```
414
548
 
415
549
  ## How It Works
416
550
 
@@ -12,6 +12,36 @@ from obstore.store import AzureStore
12
12
  RG = 8_000_000
13
13
 
14
14
 
15
+ def _build_write_deltalake_args(path, df, mode, schema_mode=None, partition_by=None):
16
+ """
17
+ Build arguments for write_deltalake based on requirements:
18
+ - If schema_mode='merge': use rust engine (no row group params)
19
+ - Otherwise: use pyarrow engine with row group optimization
20
+ """
21
+ args = {
22
+ 'table_or_uri': path,
23
+ 'data': df,
24
+ 'mode': mode
25
+ }
26
+
27
+ # Add partition_by if specified
28
+ if partition_by:
29
+ args['partition_by'] = partition_by
30
+
31
+ # Engine selection based on schema_mode
32
+ if schema_mode == 'merge':
33
+ # Use rust engine for schema merging (no row group params supported)
34
+ args['schema_mode'] = 'merge'
35
+ args['engine'] = 'rust'
36
+ else:
37
+ # Use pyarrow engine with row group optimization (default)
38
+ args['max_rows_per_file'] = RG
39
+ args['max_rows_per_group'] = RG
40
+ args['min_rows_per_group'] = RG
41
+
42
+ return args
43
+
44
+
15
45
  class DeltaWriter:
16
46
  """Spark-style write API for Delta Lake"""
17
47
 
@@ -20,6 +50,8 @@ class DeltaWriter:
20
50
  self.duckrun = duckrun_instance
21
51
  self._format = "delta"
22
52
  self._mode = "overwrite"
53
+ self._schema_mode = None
54
+ self._partition_by = None
23
55
 
24
56
  def format(self, format_type: str):
25
57
  """Set output format (only 'delta' supported)"""
@@ -35,6 +67,27 @@ class DeltaWriter:
35
67
  self._mode = write_mode
36
68
  return self
37
69
 
70
+ def option(self, key: str, value):
71
+ """Set write option (Spark-compatible)"""
72
+ if key == "mergeSchema":
73
+ if str(value).lower() in ("true", "1"):
74
+ self._schema_mode = "merge"
75
+ else:
76
+ self._schema_mode = None
77
+ else:
78
+ raise ValueError(f"Unsupported option: {key}")
79
+ return self
80
+
81
+ def partitionBy(self, *columns):
82
+ """Set partition columns (Spark-compatible)"""
83
+ if len(columns) == 1 and isinstance(columns[0], (list, tuple)):
84
+ # Handle partitionBy(["col1", "col2"]) case
85
+ self._partition_by = list(columns[0])
86
+ else:
87
+ # Handle partitionBy("col1", "col2") case
88
+ self._partition_by = list(columns)
89
+ return self
90
+
38
91
  def saveAsTable(self, table_name: str):
39
92
  """Save query result as Delta table"""
40
93
  if self._format != "delta":
@@ -50,8 +103,18 @@ class DeltaWriter:
50
103
  path = f"{self.duckrun.table_base_url}{schema}/{table}"
51
104
  df = self.relation.record_batch()
52
105
 
53
- print(f"Writing to Delta table: {schema}.{table} (mode={self._mode})")
54
- write_deltalake(path, df, mode=self._mode, max_rows_per_file=RG, max_rows_per_group=RG, min_rows_per_group=RG)
106
+ # Build write arguments based on schema_mode and partition_by
107
+ write_args = _build_write_deltalake_args(
108
+ path, df, self._mode,
109
+ schema_mode=self._schema_mode,
110
+ partition_by=self._partition_by
111
+ )
112
+
113
+ engine_info = f" (engine=rust, schema_mode=merge)" if self._schema_mode == 'merge' else " (engine=pyarrow)"
114
+ partition_info = f" partitioned by {self._partition_by}" if self._partition_by else ""
115
+ print(f"Writing to Delta table: {schema}.{table} (mode={self._mode}){engine_info}{partition_info}")
116
+
117
+ write_deltalake(**write_args)
55
118
 
56
119
  self.duckrun.con.sql(f"DROP VIEW IF EXISTS {table}")
57
120
  self.duckrun.con.sql(f"""
@@ -113,6 +176,21 @@ class Duckrun:
113
176
  dr = Duckrun.connect("workspace/lakehouse.lakehouse")
114
177
  dr.sql("SELECT * FROM table").show()
115
178
  dr.sql("SELECT 43").write.mode("append").saveAsTable("test")
179
+
180
+ # Schema evolution and partitioning (exact Spark API):
181
+ dr.sql("SELECT * FROM source").write.mode("append").option("mergeSchema", "true").partitionBy("region").saveAsTable("sales")
182
+
183
+ # Pipeline formats:
184
+ pipeline = [
185
+ # SQL with parameters only
186
+ ('table_name', 'mode', {'param1': 'value1'}),
187
+
188
+ # SQL with Delta options (4-tuple format)
189
+ ('table_name', 'mode', {'param1': 'value1'}, {'mergeSchema': 'true', 'partitionBy': ['region']}),
190
+
191
+ # Python task
192
+ ('process_data', ('table_name',))
193
+ ]
116
194
  """
117
195
 
118
196
  def __init__(self, workspace: str, lakehouse_name: str, schema: str = "dbo",
@@ -392,7 +470,7 @@ class Duckrun:
392
470
  print(f"✅ Python '{name}' completed")
393
471
  return result
394
472
 
395
- def _run_sql(self, table: str, mode: str, params: Dict) -> str:
473
+ def _run_sql(self, table: str, mode: str, params: Dict, delta_options: Dict = None) -> str:
396
474
  """Execute SQL task, write to Delta, return normalized table name"""
397
475
  self._create_onelake_secret()
398
476
 
@@ -406,10 +484,23 @@ class Duckrun:
406
484
  normalized_table = self._normalize_table_name(table)
407
485
  path = f"{self.table_base_url}{self.schema}/{normalized_table}"
408
486
 
487
+ # Extract Delta Lake specific options from delta_options
488
+ delta_options = delta_options or {}
489
+ merge_schema = delta_options.get('mergeSchema')
490
+ schema_mode = 'merge' if str(merge_schema).lower() in ('true', '1') else None
491
+ partition_by = delta_options.get('partitionBy') or delta_options.get('partition_by')
492
+
409
493
  if mode == 'overwrite':
410
494
  self.con.sql(f"DROP VIEW IF EXISTS {normalized_table}")
411
495
  df = self.con.sql(sql).record_batch()
412
- write_deltalake(path, df, mode='overwrite', max_rows_per_file=RG, max_rows_per_group=RG, min_rows_per_group=RG)
496
+
497
+ write_args = _build_write_deltalake_args(
498
+ path, df, 'overwrite',
499
+ schema_mode=schema_mode,
500
+ partition_by=partition_by
501
+ )
502
+ write_deltalake(**write_args)
503
+
413
504
  self.con.sql(f"CREATE OR REPLACE VIEW {normalized_table} AS SELECT * FROM delta_scan('{path}')")
414
505
  dt = DeltaTable(path)
415
506
  dt.vacuum(retention_hours=0, dry_run=False, enforce_retention_duration=False)
@@ -417,7 +508,14 @@ class Duckrun:
417
508
 
418
509
  elif mode == 'append':
419
510
  df = self.con.sql(sql).record_batch()
420
- write_deltalake(path, df, mode='append', max_rows_per_file=RG, max_rows_per_group=RG, min_rows_per_group=RG)
511
+
512
+ write_args = _build_write_deltalake_args(
513
+ path, df, 'append',
514
+ schema_mode=schema_mode,
515
+ partition_by=partition_by
516
+ )
517
+ write_deltalake(**write_args)
518
+
421
519
  self.con.sql(f"CREATE OR REPLACE VIEW {normalized_table} AS SELECT * FROM delta_scan('{path}')")
422
520
  dt = DeltaTable(path)
423
521
  if len(dt.file_uris()) > self.compaction_threshold:
@@ -434,13 +532,22 @@ class Duckrun:
434
532
  print(f"Table {normalized_table} doesn't exist. Creating...")
435
533
  self.con.sql(f"DROP VIEW IF EXISTS {normalized_table}")
436
534
  df = self.con.sql(sql).record_batch()
437
- write_deltalake(path, df, mode='overwrite', max_rows_per_file=RG, max_rows_per_group=RG, min_rows_per_group=RG)
535
+
536
+ write_args = _build_write_deltalake_args(
537
+ path, df, 'overwrite',
538
+ schema_mode=schema_mode,
539
+ partition_by=partition_by
540
+ )
541
+ write_deltalake(**write_args)
542
+
438
543
  self.con.sql(f"CREATE OR REPLACE VIEW {normalized_table} AS SELECT * FROM delta_scan('{path}')")
439
544
  dt = DeltaTable(path)
440
545
  dt.vacuum(dry_run=False)
441
546
  dt.cleanup_metadata()
442
547
 
443
- print(f" SQL '{table}' '{normalized_table}' ({mode})")
548
+ engine_info = f" (engine=rust, schema_mode=merge)" if schema_mode == 'merge' else " (engine=pyarrow)"
549
+ partition_info = f" partitioned by {partition_by}" if partition_by else ""
550
+ print(f"✅ SQL '{table}' → '{normalized_table}' ({mode}){engine_info}{partition_info}")
444
551
  return normalized_table
445
552
 
446
553
  def run(self, pipeline: List[Tuple]) -> bool:
@@ -449,7 +556,8 @@ class Duckrun:
449
556
 
450
557
  Task formats:
451
558
  - Python: ('function_name', (arg1, arg2, ...))
452
- - SQL: ('table_name', 'mode') or ('table_name', 'mode', {params})
559
+ - SQL: ('table_name', 'mode') or ('table_name', 'mode', {sql_params})
560
+ - SQL with Delta options: ('table_name', 'mode', {sql_params}, {delta_options})
453
561
 
454
562
  Returns:
455
563
  True if all tasks succeeded
@@ -469,7 +577,7 @@ class Duckrun:
469
577
  if len(task) == 2:
470
578
  name, second = task
471
579
  if isinstance(second, str) and second in {'overwrite', 'append', 'ignore'}:
472
- result = self._run_sql(name, second, {})
580
+ result = self._run_sql(name, second, {}, {})
473
581
  else:
474
582
  args = second if isinstance(second, (tuple, list)) else (second,)
475
583
  result = self._run_python(name, tuple(args))
@@ -478,7 +586,15 @@ class Duckrun:
478
586
  table, mode, params = task
479
587
  if not isinstance(params, dict):
480
588
  raise ValueError(f"Expected dict for params, got {type(params)}")
481
- result = self._run_sql(table, mode, params)
589
+ result = self._run_sql(table, mode, params, {})
590
+
591
+ elif len(task) == 4:
592
+ table, mode, params, delta_options = task
593
+ if not isinstance(params, dict):
594
+ raise ValueError(f"Expected dict for SQL params, got {type(params)}")
595
+ if not isinstance(delta_options, dict):
596
+ raise ValueError(f"Expected dict for Delta options, got {type(delta_options)}")
597
+ result = self._run_sql(table, mode, params, delta_options)
482
598
 
483
599
  else:
484
600
  raise ValueError(f"Invalid task format: {task}")
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: duckrun
3
- Version: 0.1.9
3
+ Version: 0.2.0
4
4
  Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
5
5
  Author: mim
6
6
  License: MIT
@@ -28,6 +28,8 @@ A helper package for stuff that made my life easier when working with Fabric Pyt
28
28
  - Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)
29
29
  - Workspace and lakehouse names cannot contain spaces
30
30
 
31
+ **Delta Lake Version:** This package uses an older version of deltalake to maintain row size control capabilities, which is crucial for Power BI performance optimization. The newer Rust-based deltalake versions don't yet support the row group size parameters that are essential for optimal DirectLake performance.
32
+
31
33
  **Why no spaces?** Duckrun uses simple name-based paths instead of GUIDs. This keeps the code clean and readable, which is perfect for data engineering workspaces where naming conventions are already well-established. Just use underscores or hyphens instead: `my_workspace` or `my-lakehouse`.
32
34
 
33
35
  ## What It Does
@@ -131,6 +133,22 @@ con.sql("""
131
133
 
132
134
  # Append mode
133
135
  con.sql("SELECT * FROM new_orders").write.mode("append").saveAsTable("orders")
136
+
137
+ # Schema evolution and partitioning (exact Spark API compatibility)
138
+ con.sql("""
139
+ SELECT
140
+ customer_id,
141
+ order_date,
142
+ region,
143
+ product_category,
144
+ sales_amount,
145
+ new_column_added_later -- This column might not exist in target table
146
+ FROM source_table
147
+ """).write \
148
+ .mode("append") \
149
+ .option("mergeSchema", "true") \
150
+ .partitionBy("region", "product_category") \
151
+ .saveAsTable("sales_partitioned")
134
152
  ```
135
153
 
136
154
  **Note:** `.format("delta")` is optional - Delta is the default format!
@@ -204,7 +222,10 @@ def download_data(url, path):
204
222
 
205
223
  ### SQL Tasks
206
224
 
207
- **Format:** `('table_name', 'mode')` or `('table_name', 'mode', {params})`
225
+ **Formats:**
226
+ - `('table_name', 'mode')` - Simple SQL with no parameters
227
+ - `('table_name', 'mode', {params})` - SQL with template parameters
228
+ - `('table_name', 'mode', {params}, {delta_options})` - SQL with Delta Lake options
208
229
 
209
230
  Create `sql_folder/table_name.sql`:
210
231
 
@@ -244,8 +265,66 @@ SELECT * FROM transactions
244
265
  WHERE date BETWEEN '$start_date' AND '$end_date'
245
266
  ```
246
267
 
268
+ ### Delta Lake Options (Schema Evolution & Partitioning)
269
+
270
+ Use the 4-tuple format for advanced Delta Lake features:
271
+
272
+ ```python
273
+ pipeline = [
274
+ # SQL with empty params but Delta options
275
+ ('evolving_table', 'append', {}, {'mergeSchema': 'true'}),
276
+
277
+ # SQL with both params AND Delta options
278
+ ('sales_data', 'append',
279
+ {'region': 'North America'},
280
+ {'mergeSchema': 'true', 'partitionBy': ['region', 'year']}),
281
+
282
+ # Partitioning without schema merging
283
+ ('time_series', 'overwrite',
284
+ {'start_date': '2024-01-01'},
285
+ {'partitionBy': ['year', 'month']})
286
+ ]
287
+ ```
288
+
289
+ **Available Delta Options:**
290
+ - `mergeSchema: 'true'` - Automatically handle schema evolution (new columns)
291
+ - `partitionBy: ['col1', 'col2']` - Partition data by specified columns
292
+
247
293
  ## Advanced Features
248
294
 
295
+ ### Schema Evolution & Partitioning
296
+
297
+ Handle evolving schemas and optimize query performance with partitioning:
298
+
299
+ ```python
300
+ # Using Spark-style API
301
+ con.sql("""
302
+ SELECT
303
+ customer_id,
304
+ region,
305
+ product_category,
306
+ sales_amount,
307
+ -- New column that might not exist in target table
308
+ discount_percentage
309
+ FROM raw_sales
310
+ """).write \
311
+ .mode("append") \
312
+ .option("mergeSchema", "true") \
313
+ .partitionBy("region", "product_category") \
314
+ .saveAsTable("sales_partitioned")
315
+
316
+ # Using pipeline format
317
+ pipeline = [
318
+ ('sales_summary', 'append',
319
+ {'batch_date': '2024-10-07'},
320
+ {'mergeSchema': 'true', 'partitionBy': ['region', 'year']})
321
+ ]
322
+ ```
323
+
324
+ **Benefits:**
325
+ - 🔄 **Schema Evolution**: Automatically handles new columns without breaking existing queries
326
+ - ⚡ **Query Performance**: Partitioning improves performance for filtered queries
327
+
249
328
  ### Table Name Variants
250
329
 
251
330
  Use `__` to create multiple versions of the same table:
@@ -404,8 +483,8 @@ pipeline = [
404
483
  # Aggregate by region (SQL with params)
405
484
  ('regional_summary', 'overwrite', {'min_amount': 1000}),
406
485
 
407
- # Append to history (SQL)
408
- ('sales_history', 'append')
486
+ # Append to history with schema evolution (SQL with Delta options)
487
+ ('sales_history', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['year', 'region']})
409
488
  ]
410
489
 
411
490
  # Run pipeline
@@ -430,7 +509,62 @@ con.download("processed_reports", "./exports", ['.csv'])
430
509
  - 🔄 **Pipeline orchestration** with SQL and Python tasks
431
510
  - ⚡ **Fast data exploration** with DuckDB
432
511
  - 💾 **Delta table creation** with Spark-style API
433
- - 📤 **File downloads** from OneLake Files
512
+ - **Schema evolution** and partitioning
513
+ - �📤 **File downloads** from OneLake Files
514
+
515
+ ## Schema Evolution & Partitioning Guide
516
+
517
+ ### When to Use Schema Evolution
518
+
519
+ Use `mergeSchema: 'true'` when:
520
+ - Adding new columns to existing tables
521
+ - Source data schema changes over time
522
+ - Working with evolving data pipelines
523
+ - Need backward compatibility
524
+
525
+ ### When to Use Partitioning
526
+
527
+ Use `partitionBy` when:
528
+ - Queries frequently filter by specific columns (dates, regions, categories)
529
+ - Tables are large and need performance optimization
530
+ - Want to organize data logically for maintenance
531
+
532
+ ### Best Practices
533
+
534
+ ```python
535
+ # ✅ Good: Partition by commonly filtered columns
536
+ .partitionBy("year", "region") # Often filtered: WHERE year = 2024 AND region = 'US'
537
+
538
+ # ❌ Avoid: High cardinality partitions
539
+ .partitionBy("customer_id") # Creates too many small partitions
540
+
541
+ # ✅ Good: Schema evolution for append operations
542
+ .mode("append").option("mergeSchema", "true")
543
+
544
+ # ✅ Good: Combined approach for data lakes
545
+ pipeline = [
546
+ ('daily_sales', 'append',
547
+ {'batch_date': '2024-10-07'},
548
+ {'mergeSchema': 'true', 'partitionBy': ['year', 'month', 'region']})
549
+ ]
550
+ ```
551
+
552
+ ### Task Format Reference
553
+
554
+ ```python
555
+ # 2-tuple: Simple SQL/Python
556
+ ('task_name', 'mode') # SQL: no params, no Delta options
557
+ ('function_name', (args)) # Python: function with arguments
558
+
559
+ # 3-tuple: SQL with parameters
560
+ ('task_name', 'mode', {'param': 'value'})
561
+
562
+ # 4-tuple: SQL with parameters AND Delta options
563
+ ('task_name', 'mode', {'param': 'value'}, {'mergeSchema': 'true', 'partitionBy': ['col']})
564
+
565
+ # 4-tuple: Empty parameters but Delta options
566
+ ('task_name', 'mode', {}, {'mergeSchema': 'true'})
567
+ ```
434
568
 
435
569
  ## How It Works
436
570
 
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "duckrun"
7
- version = "0.1.9"
7
+ version = "0.2.0"
8
8
  description = "Lakehouse task runner powered by DuckDB for Microsoft Fabric"
9
9
  readme = "README.md"
10
10
  license = {text = "MIT"}
File without changes
File without changes
File without changes