duckrun 0.2.10.dev1__py3-none-any.whl → 0.2.11__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of duckrun might be problematic. Click here for more details.

@@ -0,0 +1,1367 @@
1
+ Metadata-Version: 2.4
2
+ Name: duckrun
3
+ Version: 0.2.11
4
+ Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
5
+ Author: mim
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/djouallah/duckrun
8
+ Project-URL: Repository, https://github.com/djouallah/duckrun
9
+ Project-URL: Issues, https://github.com/djouallah/duckrun/issues
10
+ Requires-Python: >=3.9
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE
13
+ Requires-Dist: duckdb>=1.2.2
14
+ Requires-Dist: deltalake<=0.18.2
15
+ Requires-Dist: requests>=2.28.0
16
+ Requires-Dist: obstore>=0.2.0
17
+ Provides-Extra: local
18
+ Requires-Dist: azure-identity>=1.12.0; extra == "local"
19
+ Dynamic: license-file
20
+
21
+ <img src="https://raw.githubusercontent.com/djouallah/duckrun/main/duckrun.png" width="400" alt="Duckrun"><img src="https://raw.githubusercontent.com/djouallah/duckrun/main/duckrun.png" width="400" alt="Duckrun">
22
+
23
+
24
+
25
+ A helper package for working with Microsoft Fabric lakehouses - orchestration, SQL queries, and file management powered by DuckDB.A helper package for stuff that made my life easier when working with Fabric Python notebooks. Just the things that actually made sense to me - nothing fancy
26
+
27
+
28
+
29
+ ## Installation## Important Notes
30
+
31
+
32
+
33
+ ```bash**Requirements:**
34
+
35
+ pip install duckrun- Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)
36
+
37
+ ```- **Workspace names with spaces are fully supported!** ✅
38
+
39
+
40
+
41
+ For local usage (requires Azure CLI or interactive browser auth):
42
+
43
+ ```bash**Delta Lake Version:** This package uses an older version of deltalake to maintain row size control capabilities, which is crucial for Power BI performance optimization. The newer Rust-based deltalake versions don't yet support the row group size parameters that are essential for optimal DirectLake performance.
44
+
45
+ pip install duckrun[local]
46
+
47
+ ```## What It Does
48
+
49
+
50
+
51
+ ## Quick StartIt does orchestration, arbitrary SQL statements, and file manipulation. That's it - just stuff I encounter in my daily workflow when working with Fabric notebooks.
52
+
53
+
54
+
55
+ ### Basic Usage## Installation
56
+
57
+
58
+
59
+ ```python```bash
60
+
61
+ import duckrunpip install duckrun
62
+
63
+ ```
64
+
65
+ # Connect to a lakehouse and query datafor local usage, Note: When running locally, your internet speed will be the main bottleneck.
66
+
67
+ con = duckrun.connect("My Workspace/data.lakehouse/dbo")
68
+
69
+ con.sql("SELECT * FROM my_table LIMIT 10").show()```bash
70
+
71
+ pip install duckrun[local]
72
+
73
+ # Write query results to a new table```
74
+
75
+ con.sql("SELECT * FROM source WHERE year = 2024") \
76
+
77
+ .write.mode("overwrite").saveAsTable("filtered_data")## Quick Start
78
+
79
+
80
+
81
+ # Upload/download files### Simple Example for New Users
82
+
83
+ con.copy("./local_data", "remote_folder") # Upload
84
+
85
+ con.download("remote_folder", "./local") # Download```python
86
+
87
+ ```import duckrun
88
+
89
+
90
+
91
+ ### Complete Example# Connect to a workspace and manage lakehouses
92
+
93
+ con = duckrun.connect('My Workspace')
94
+
95
+ ```pythoncon.list_lakehouses() # See what lakehouses exist
96
+
97
+ import duckruncon.create_lakehouse_if_not_exists('data') # Create if needed
98
+
99
+
100
+
101
+ # 1. Connect to lakehouse# Connect to a specific lakehouse and query data
102
+
103
+ con = duckrun.connect("Analytics/Sales.lakehouse/dbo")con = duckrun.connect("My Workspace/data.lakehouse/dbo")
104
+
105
+ con.sql("SELECT * FROM my_table LIMIT 10").show()
106
+
107
+ # 2. Query and explore data```
108
+
109
+ result = con.sql("""
110
+
111
+ SELECT region, SUM(amount) as total### Full Feature Overview
112
+
113
+ FROM sales
114
+
115
+ WHERE year = 2024```python
116
+
117
+ GROUP BY regionimport duckrun
118
+
119
+ """).show()
120
+
121
+ # 1. Workspace Management (list and create lakehouses)
122
+
123
+ # 3. Create new tables from queriesws = duckrun.connect("My Workspace")
124
+
125
+ con.sql("SELECT * FROM sales WHERE region = 'US'") \lakehouses = ws.list_lakehouses() # Returns list of lakehouse names
126
+
127
+ .write.mode("overwrite").saveAsTable("us_sales")ws.create_lakehouse_if_not_exists("New_Lakehouse")
128
+
129
+
130
+
131
+ # 4. Upload files to OneLake# 2. Connect to lakehouse with a specific schema
132
+
133
+ con.copy("./reports", "monthly_reports", ['.csv'])con = duckrun.connect("My Workspace/MyLakehouse.lakehouse/dbo")
134
+
135
+
136
+
137
+ # 5. Run data pipeline# Workspace names with spaces are supported!
138
+
139
+ pipeline = [con = duckrun.connect("Data Analytics/SalesData.lakehouse/analytics")
140
+
141
+ ('clean_data', 'overwrite'),
142
+
143
+ ('aggregate', 'append', {'min_amount': 1000})# Schema defaults to 'dbo' if not specified (scans all schemas)
144
+
145
+ ]# ⚠️ WARNING: Scanning all schemas can be slow for large lakehouses!
146
+
147
+ con.run(pipeline)con = duckrun.connect("My Workspace/My_Lakehouse.lakehouse")
148
+
149
+ ```
150
+
151
+ # 3. Explore data
152
+
153
+ ---con.sql("SELECT * FROM my_table LIMIT 10").show()
154
+
155
+
156
+
157
+ ## Core Functions# 4. Write to Delta tables (Spark-style API)
158
+
159
+ con.sql("SELECT * FROM source").write.mode("overwrite").saveAsTable("target")
160
+
161
+ ### Connection
162
+
163
+ # 5. Upload/download files to/from OneLake Files
164
+
165
+ #### `connect(connection_string, sql_folder=None, compaction_threshold=100)`con.copy("./local_folder", "target_folder") # Upload files
166
+
167
+ con.download("target_folder", "./downloaded") # Download files
168
+
169
+ Connect to a workspace or lakehouse.```
170
+
171
+
172
+
173
+ **Parameters:**That's it! No `sql_folder` needed for data exploration.
174
+
175
+ - `connection_string` (str): Connection path
176
+
177
+ - Workspace only: `"My Workspace"`## Connection Format
178
+
179
+ - Lakehouse with schema: `"My Workspace/lakehouse.lakehouse/dbo"`
180
+
181
+ - Lakehouse without schema: `"My Workspace/lakehouse.lakehouse"` (scans all schemas)```python
182
+
183
+ - `sql_folder` (str, optional): Path to SQL/Python files for pipelines# Workspace management (list and create lakehouses)
184
+
185
+ - `compaction_threshold` (int): File count before auto-compaction (default: 100)ws = duckrun.connect("My Workspace")
186
+
187
+ ws.list_lakehouses() # Returns: ['lakehouse1', 'lakehouse2', ...]
188
+
189
+ **Returns:** `Duckrun` instance or `WorkspaceConnection` instancews.create_lakehouse_if_not_exists("New Lakehouse")
190
+
191
+
192
+
193
+ **Examples:**# Lakehouse connection with schema (recommended for best performance)
194
+
195
+ ```pythoncon = duckrun.connect("My Workspace/My Lakehouse.lakehouse/dbo")
196
+
197
+ # Workspace management
198
+
199
+ ws = duckrun.connect("My Workspace")# Supports workspace names with spaces!
200
+
201
+ ws.list_lakehouses()con = duckrun.connect("Data Analytics/Sales Data.lakehouse/analytics")
202
+
203
+ ws.create_lakehouse_if_not_exists("new_lakehouse")
204
+
205
+ # Without schema (defaults to 'dbo', scans all schemas)
206
+
207
+ # Lakehouse connection (recommended - specify schema)# ⚠️ This can be slow for large lakehouses!
208
+
209
+ con = duckrun.connect("My Workspace/data.lakehouse/dbo")con = duckrun.connect("My Workspace/My Lakehouse.lakehouse")
210
+
211
+
212
+
213
+ # With SQL folder for pipelines# With SQL folder for pipeline orchestration
214
+
215
+ con = duckrun.connect("My Workspace/data.lakehouse/dbo", sql_folder="./sql")con = duckrun.connect("My Workspace/My Lakehouse.lakehouse/dbo", sql_folder="./sql")
216
+
217
+ ``````
218
+
219
+
220
+
221
+ **Notes:**### Multi-Schema Support
222
+
223
+ - Workspace names with spaces are fully supported ✅
224
+
225
+ - Specifying schema improves connection speedWhen you don't specify a schema, Duckrun will:
226
+
227
+ - Without schema, all schemas are scanned (slower for large lakehouses)- **Default to `dbo`** for write operations
228
+
229
+ - **Scan all schemas** to discover and attach all Delta tables
230
+
231
+ ---- **Prefix table names** with schema to avoid conflicts (e.g., `dbo_customers`, `bronze_raw_data`)
232
+
233
+
234
+
235
+ ### Query & Write**Performance Note:** Scanning all schemas requires listing all files in the lakehouse, which can be slow for large lakehouses with many tables. For better performance, always specify a schema when possible.
236
+
237
+
238
+
239
+ #### `sql(query)````python
240
+
241
+ # Fast: scans only 'dbo' schema
242
+
243
+ Execute SQL query with Spark-style write API.con = duckrun.connect("workspace/lakehouse.lakehouse/dbo")
244
+
245
+
246
+
247
+ **Parameters:**# Slower: scans all schemas
248
+
249
+ - `query` (str): SQL query to executecon = duckrun.connect("workspace/lakehouse.lakehouse")
250
+
251
+
252
+
253
+ **Returns:** `QueryResult` object with methods:# Query tables from different schemas (when scanning all)
254
+
255
+ - `.show(max_width=None)` - Display results in consolecon.sql("SELECT * FROM dbo_customers").show()
256
+
257
+ - `.df()` - Get pandas DataFramecon.sql("SELECT * FROM bronze_raw_data").show()
258
+
259
+ - `.write` - Access write API (see below)```
260
+
261
+
262
+
263
+ **Examples:**## Three Ways to Use Duckrun
264
+
265
+ ```python
266
+
267
+ # Show results### 1. Data Exploration (Spark-Style API)
268
+
269
+ con.sql("SELECT * FROM sales LIMIT 10").show()
270
+
271
+ Perfect for ad-hoc analysis and interactive notebooks:
272
+
273
+ # Get DataFrame
274
+
275
+ df = con.sql("SELECT COUNT(*) FROM orders").df()```python
276
+
277
+ con = duckrun.connect("workspace/lakehouse.lakehouse/dbo")
278
+
279
+ # Write to table
280
+
281
+ con.sql("SELECT * FROM source").write.mode("overwrite").saveAsTable("target")# Query existing tables
282
+
283
+ ```con.sql("SELECT * FROM sales WHERE year = 2024").show()
284
+
285
+
286
+
287
+ #### Write API# Get DataFrame
288
+
289
+ df = con.sql("SELECT COUNT(*) FROM orders").df()
290
+
291
+ **Methods:**
292
+
293
+ - `.mode(mode)` - Set write mode: `"overwrite"`, `"append"`, or `"ignore"`# Write results to Delta tables
294
+
295
+ - `.option(key, value)` - Set Delta Lake optioncon.sql("""
296
+
297
+ - `.partitionBy(*cols)` - Partition by columns SELECT
298
+
299
+ - `.saveAsTable(table_name)` - Write to Delta table customer_id,
300
+
301
+ SUM(amount) as total
302
+
303
+ **Examples:** FROM orders
304
+
305
+ ```python GROUP BY customer_id
306
+
307
+ # Simple write""").write.mode("overwrite").saveAsTable("customer_totals")
308
+
309
+ con.sql("SELECT * FROM data").write.mode("overwrite").saveAsTable("target")
310
+
311
+ # Append mode
312
+
313
+ # With schema evolutioncon.sql("SELECT * FROM new_orders").write.mode("append").saveAsTable("orders")
314
+
315
+ con.sql("SELECT * FROM source") \
316
+
317
+ .write.mode("append") \# Schema evolution and partitioning (exact Spark API compatibility)
318
+
319
+ .option("mergeSchema", "true") \con.sql("""
320
+
321
+ .saveAsTable("evolving_table") SELECT
322
+
323
+ customer_id,
324
+
325
+ # With partitioning order_date,
326
+
327
+ con.sql("SELECT * FROM sales") \ region,
328
+
329
+ .write.mode("overwrite") \ product_category,
330
+
331
+ .partitionBy("region", "year") \ sales_amount,
332
+
333
+ .saveAsTable("partitioned_sales") new_column_added_later -- This column might not exist in target table
334
+
335
+ FROM source_table
336
+
337
+ # Combined""").write \
338
+
339
+ con.sql("SELECT * FROM data") \ .mode("append") \
340
+
341
+ .write.mode("append") \ .option("mergeSchema", "true") \
342
+
343
+ .option("mergeSchema", "true") \ .partitionBy("region", "product_category") \
344
+
345
+ .partitionBy("date", "category") \ .saveAsTable("sales_partitioned")
346
+
347
+ .saveAsTable("final_table")```
348
+
349
+ ```
350
+
351
+ **Note:** `.format("delta")` is optional - Delta is the default format!
352
+
353
+ ---
354
+
355
+ ### 2. File Management (OneLake Files)
356
+
357
+ ### File Operations
358
+
359
+ Upload and download files to/from OneLake Files section (not Delta tables):
360
+
361
+ #### `copy(local_folder, remote_folder, file_extensions=None, overwrite=False)`
362
+
363
+ ```python
364
+
365
+ Upload files from local folder to OneLake Files section.con = duckrun.connect("workspace/lakehouse.lakehouse/dbo")
366
+
367
+
368
+
369
+ **Parameters:**# Upload files to OneLake Files (remote_folder is required)
370
+
371
+ - `local_folder` (str): Local source folder pathcon.copy("./local_data", "uploaded_data")
372
+
373
+ - `remote_folder` (str): Remote target folder in OneLake Files (required)
374
+
375
+ - `file_extensions` (list, optional): Filter by extensions (e.g., `['.csv', '.parquet']`)# Upload only specific file types
376
+
377
+ - `overwrite` (bool): Overwrite existing files (default: `False`)con.copy("./reports", "daily_reports", ['.csv', '.parquet'])
378
+
379
+
380
+
381
+ **Returns:** `True` if successful, `False` otherwise# Upload with overwrite enabled (default is False for safety)
382
+
383
+ con.copy("./backup", "backups", overwrite=True)
384
+
385
+ **Examples:**
386
+
387
+ ```python# Download files from OneLake Files
388
+
389
+ # Upload all filescon.download("uploaded_data", "./downloaded")
390
+
391
+ con.copy("./data", "processed_data")
392
+
393
+ # Download only CSV files from a specific folder
394
+
395
+ # Upload specific file typescon.download("daily_reports", "./reports", ['.csv'])
396
+
397
+ con.copy("./reports", "monthly", ['.csv', '.xlsx'])```
398
+
399
+
400
+
401
+ # With overwrite**Key Features:**
402
+
403
+ con.copy("./backup", "daily_backup", overwrite=True)- ✅ **Files go to OneLake Files section** (not Delta Tables)
404
+
405
+ ```- ✅ **`remote_folder` parameter is required** for uploads (prevents accidental uploads)
406
+
407
+ - ✅ **`overwrite=False` by default** (safer - prevents accidental overwrites)
408
+
409
+ #### `download(remote_folder="", local_folder="./downloaded_files", file_extensions=None, overwrite=False)`- ✅ **File extension filtering** (e.g., only `.csv` or `.parquet` files)
410
+
411
+ - ✅ **Preserves folder structure** during upload/download
412
+
413
+ Download files from OneLake Files section to local folder.- ✅ **Progress reporting** with file sizes and upload status
414
+
415
+
416
+
417
+ **Parameters:**### 3. Pipeline Orchestration
418
+
419
+ - `remote_folder` (str): Source folder in OneLake Files (default: root)
420
+
421
+ - `local_folder` (str): Local destination folder (default: `"./downloaded_files"`)For production workflows with reusable SQL and Python tasks:
422
+
423
+ - `file_extensions` (list, optional): Filter by extensions
424
+
425
+ - `overwrite` (bool): Overwrite existing files (default: `False`)```python
426
+
427
+ con = duckrun.connect(
428
+
429
+ **Returns:** `True` if successful, `False` otherwise "my_workspace/my_lakehouse.lakehouse/dbo",
430
+
431
+ sql_folder="./sql" # folder with .sql and .py files
432
+
433
+ **Examples:**)
434
+
435
+ ```python
436
+
437
+ # Download from root# Define pipeline
438
+
439
+ con.download()pipeline = [
440
+
441
+ ('download_data', (url, path)), # Python task
442
+
443
+ # Download from specific folder ('clean_data', 'overwrite'), # SQL task
444
+
445
+ con.download("processed_data", "./local_data") ('aggregate', 'append') # SQL task
446
+
447
+ ]
448
+
449
+ # Download specific file types
450
+
451
+ con.download("reports", "./exports", ['.csv'])# Run it
452
+
453
+ ```con.run(pipeline)
454
+
455
+ ```
456
+
457
+ ---
458
+
459
+ ## Pipeline Tasks
460
+
461
+ ### Pipeline Orchestration
462
+
463
+ ### Python Tasks
464
+
465
+ #### `run(pipeline)`
466
+
467
+ **Format:** `('function_name', (arg1, arg2, ...))`
468
+
469
+ Execute a pipeline of SQL and Python tasks.
470
+
471
+ Create `sql_folder/function_name.py`:
472
+
473
+ **Parameters:**
474
+
475
+ - `pipeline` (list): List of task tuples```python
476
+
477
+ # sql_folder/download_data.py
478
+
479
+ **Returns:** `True` if all tasks succeeded, `False` if any faileddef download_data(url, path):
480
+
481
+ # your code here
482
+
483
+ **Task Formats:** return 1 # 1 = success, 0 = failure
484
+
485
+ ```python```
486
+
487
+ # Python task: ('function_name', (arg1, arg2, ...))
488
+
489
+ ('download_data', ('https://api.example.com/data', './raw'))### SQL Tasks
490
+
491
+
492
+
493
+ # SQL task: ('table_name', 'mode')**Formats:**
494
+
495
+ ('clean_data', 'overwrite')- `('table_name', 'mode')` - Simple SQL with no parameters
496
+
497
+ - `('table_name', 'mode', {params})` - SQL with template parameters
498
+
499
+ # SQL with params: ('table_name', 'mode', {params})- `('table_name', 'mode', {params}, {delta_options})` - SQL with Delta Lake options
500
+
501
+ ('filter_data', 'append', {'min_value': 100})
502
+
503
+ Create `sql_folder/table_name.sql`:
504
+
505
+ # SQL with Delta options: ('table_name', 'mode', {params}, {options})
506
+
507
+ ('evolving_table', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['region']})```sql
508
+
509
+ ```-- sql_folder/clean_data.sql
510
+
511
+ SELECT
512
+
513
+ **Write Modes:** id,
514
+
515
+ - `"overwrite"` - Replace table completely TRIM(name) as name,
516
+
517
+ - `"append"` - Add to existing table date
518
+
519
+ - `"ignore"` - Create only if doesn't existFROM raw_data
520
+
521
+ WHERE date >= '2024-01-01'
522
+
523
+ **Examples:**```
524
+
525
+ ```python
526
+
527
+ con = duckrun.connect("workspace/lakehouse.lakehouse/dbo", sql_folder="./sql")**Write Modes:**
528
+
529
+ - `overwrite` - Replace table completely
530
+
531
+ # Simple pipeline- `append` - Add to existing table
532
+
533
+ pipeline = [- `ignore` - Create only if doesn't exist
534
+
535
+ ('extract_data', 'overwrite'),
536
+
537
+ ('transform', 'append'),### Parameterized SQL
538
+
539
+ ('load_final', 'overwrite')
540
+
541
+ ]Built-in parameters (always available):
542
+
543
+ con.run(pipeline)- `$ws` - workspace name
544
+
545
+ - `$lh` - lakehouse name
546
+
547
+ # With parameters- `$schema` - schema name
548
+
549
+ pipeline = [
550
+
551
+ ('fetch_api', ('https://api.com/data', './raw')),Custom parameters:
552
+
553
+ ('clean', 'overwrite', {'date': '2024-01-01'}),
554
+
555
+ ('aggregate', 'append', {}, {'partitionBy': ['region']})```python
556
+
557
+ ]pipeline = [
558
+
559
+ con.run(pipeline) ('sales', 'append', {'start_date': '2024-01-01', 'end_date': '2024-12-31'})
560
+
561
+ ```]
562
+
563
+ ```
564
+
565
+ **Pipeline Behavior:**
566
+
567
+ - SQL tasks automatically fail on errors (syntax, runtime)```sql
568
+
569
+ - Python tasks control success/failure by returning `1` (success) or `0` (failure)-- sql_folder/sales.sql
570
+
571
+ - Pipeline stops immediately when any task failsSELECT * FROM transactions
572
+
573
+ - Remaining tasks are skippedWHERE date BETWEEN '$start_date' AND '$end_date'
574
+
575
+ ```
576
+
577
+ ---
578
+
579
+ ### Delta Lake Options (Schema Evolution & Partitioning)
580
+
581
+ ### Workspace Management
582
+
583
+ Use the 4-tuple format for advanced Delta Lake features:
584
+
585
+ #### `list_lakehouses()`
586
+
587
+ ```python
588
+
589
+ List all lakehouses in the workspace.pipeline = [
590
+
591
+ # SQL with empty params but Delta options
592
+
593
+ **Returns:** List of lakehouse names (strings) ('evolving_table', 'append', {}, {'mergeSchema': 'true'}),
594
+
595
+
596
+
597
+ **Example:** # SQL with both params AND Delta options
598
+
599
+ ```python ('sales_data', 'append',
600
+
601
+ ws = duckrun.connect("My Workspace") {'region': 'North America'},
602
+
603
+ lakehouses = ws.list_lakehouses() {'mergeSchema': 'true', 'partitionBy': ['region', 'year']}),
604
+
605
+ print(lakehouses) # ['lakehouse1', 'lakehouse2', ...]
606
+
607
+ ``` # Partitioning without schema merging
608
+
609
+ ('time_series', 'overwrite',
610
+
611
+ #### `create_lakehouse_if_not_exists(lakehouse_name)` {'start_date': '2024-01-01'},
612
+
613
+ {'partitionBy': ['year', 'month']})
614
+
615
+ Create a lakehouse if it doesn't already exist.]
616
+
617
+ ```
618
+
619
+ **Parameters:**
620
+
621
+ - `lakehouse_name` (str): Name of the lakehouse to create**Available Delta Options:**
622
+
623
+ - `mergeSchema: 'true'` - Automatically handle schema evolution (new columns)
624
+
625
+ **Returns:** `True` if exists or was created, `False` on error- `partitionBy: ['col1', 'col2']` - Partition data by specified columns
626
+
627
+
628
+
629
+ **Example:**## Advanced Features
630
+
631
+ ```python
632
+
633
+ ws = duckrun.connect("My Workspace")### SQL Lookup Functions
634
+
635
+ success = ws.create_lakehouse_if_not_exists("new_lakehouse")
636
+
637
+ ```Duckrun automatically registers helper functions that allow you to resolve workspace and lakehouse names from GUIDs directly in SQL queries. These are especially useful when working with storage logs or audit data that contains workspace/lakehouse IDs.
638
+
639
+
640
+
641
+ ---**Available Functions:**
642
+
643
+
644
+
645
+ ### SQL Lookup Functions```python
646
+
647
+ con = duckrun.connect("workspace/lakehouse.lakehouse/dbo")
648
+
649
+ Built-in SQL functions for resolving workspace/lakehouse names from GUIDs.
650
+
651
+ # ID → Name lookups (most common use case)
652
+
653
+ **Functions:**con.sql("""
654
+
655
+ - `get_workspace_name(workspace_id)` - GUID → workspace name SELECT
656
+
657
+ - `get_lakehouse_name(workspace_id, lakehouse_id)` - GUIDs → lakehouse name workspace_id,
658
+
659
+ - `get_workspace_id_from_name(workspace_name)` - workspace name → GUID get_workspace_name(workspace_id) as workspace_name,
660
+
661
+ - `get_lakehouse_id_from_name(workspace_id, lakehouse_name)` - lakehouse name → GUID lakehouse_id,
662
+
663
+ get_lakehouse_name(workspace_id, lakehouse_id) as lakehouse_name
664
+
665
+ **Features:** FROM storage_logs
666
+
667
+ - Automatically cached to avoid repeated API calls""").show()
668
+
669
+ - Return `NULL` for missing or inaccessible items
670
+
671
+ - Always available after connection# Name → ID lookups (reverse)
672
+
673
+ con.sql("""
674
+
675
+ **Example:** SELECT
676
+
677
+ ```python workspace_name,
678
+
679
+ con = duckrun.connect("workspace/lakehouse.lakehouse/dbo") get_workspace_id_from_name(workspace_name) as workspace_id,
680
+
681
+ lakehouse_name,
682
+
683
+ # Enrich storage logs with friendly names get_lakehouse_id_from_name(workspace_id, lakehouse_name) as lakehouse_id
684
+
685
+ result = con.sql(""" FROM configuration_table
686
+
687
+ SELECT """).show()
688
+
689
+ workspace_id,```
690
+
691
+ get_workspace_name(workspace_id) as workspace_name,
692
+
693
+ lakehouse_id,**Function Reference:**
694
+
695
+ get_lakehouse_name(workspace_id, lakehouse_id) as lakehouse_name,
696
+
697
+ operation_count- `get_workspace_name(workspace_id)` - Convert workspace GUID to display name
698
+
699
+ FROM storage_logs- `get_lakehouse_name(workspace_id, lakehouse_id)` - Convert lakehouse GUID to display name
700
+
701
+ ORDER BY workspace_name, lakehouse_name- `get_workspace_id_from_name(workspace_name)` - Convert workspace name to GUID
702
+
703
+ """).show()- `get_lakehouse_id_from_name(workspace_id, lakehouse_name)` - Convert lakehouse name to GUID
704
+
705
+ ```
706
+
707
+ **Features:**
708
+
709
+ ---- ✅ **Automatic Caching**: Results are cached to avoid repeated API calls
710
+
711
+ - ✅ **NULL on Error**: Returns `NULL` instead of errors for missing or inaccessible items
712
+
713
+ ### Semantic Model Deployment- ✅ **Fabric API Integration**: Resolves names using Microsoft Fabric REST API
714
+
715
+ - ✅ **Always Available**: Functions are automatically registered on connection
716
+
717
+ #### `deploy(bim_url, dataset_name=None, wait_seconds=5)`
718
+
719
+ **Example Use Case:**
720
+
721
+ Deploy a Power BI semantic model from a BIM file using DirectLake mode.
722
+
723
+ ```python
724
+
725
+ **Parameters:**# Enrich OneLake storage logs with friendly names
726
+
727
+ - `bim_url` (str): URL to BIM file, local path, or `"workspace/model"` formatcon = duckrun.connect("Analytics/Monitoring.lakehouse/dbo")
728
+
729
+ - `dataset_name` (str, optional): Name for semantic model (auto-generated if not provided)
730
+
731
+ - `wait_seconds` (int): Wait time for permission propagation (default: 5)result = con.sql("""
732
+
733
+ SELECT
734
+
735
+ **Returns:** `1` for success, `0` for failure workspace_id,
736
+
737
+ get_workspace_name(workspace_id) as workspace_name,
738
+
739
+ **Examples:** lakehouse_id,
740
+
741
+ ```python get_lakehouse_name(workspace_id, lakehouse_id) as lakehouse_name,
742
+
743
+ con = duckrun.connect("Analytics/Sales.lakehouse/dbo") operation_name,
744
+
745
+ COUNT(*) as operation_count,
746
+
747
+ # From URL SUM(bytes_transferred) as total_bytes
748
+
749
+ con.deploy("https://raw.githubusercontent.com/user/repo/main/model.bim") FROM onelake_storage_logs
750
+
751
+ WHERE log_date = CURRENT_DATE
752
+
753
+ # With custom name GROUP BY ALL
754
+
755
+ con.deploy( ORDER BY workspace_name, lakehouse_name
756
+
757
+ "https://github.com/user/repo/raw/main/sales.bim",""").show()
758
+
759
+ dataset_name="Sales Analytics"```
760
+
761
+ )
762
+
763
+ This makes it easy to create human-readable reports from GUID-based log data!
764
+
765
+ # From workspace/model (copies from another workspace)
766
+
767
+ con.deploy("Source Workspace/Source Model", dataset_name="Sales Copy")### Schema Evolution & Partitioning
768
+
769
+ ```
770
+
771
+ Handle evolving schemas and optimize query performance with partitioning:
772
+
773
+ ---
774
+
775
+ ```python
776
+
777
+ ### Utility Methods# Using Spark-style API
778
+
779
+ con.sql("""
780
+
781
+ #### `get_workspace_id()` SELECT
782
+
783
+ customer_id,
784
+
785
+ Get the workspace ID (GUID or name without spaces). region,
786
+
787
+ product_category,
788
+
789
+ **Returns:** Workspace ID string sales_amount,
790
+
791
+ -- New column that might not exist in target table
792
+
793
+ #### `get_lakehouse_id()` discount_percentage
794
+
795
+ FROM raw_sales
796
+
797
+ Get the lakehouse ID (GUID or name).""").write \
798
+
799
+ .mode("append") \
800
+
801
+ **Returns:** Lakehouse ID string .option("mergeSchema", "true") \
802
+
803
+ .partitionBy("region", "product_category") \
804
+
805
+ #### `get_connection()` .saveAsTable("sales_partitioned")
806
+
807
+
808
+
809
+ Get the underlying DuckDB connection object.# Using pipeline format
810
+
811
+ pipeline = [
812
+
813
+ **Returns:** DuckDB connection ('sales_summary', 'append',
814
+
815
+ {'batch_date': '2024-10-07'},
816
+
817
+ #### `close()` {'mergeSchema': 'true', 'partitionBy': ['region', 'year']})
818
+
819
+ ]
820
+
821
+ Close the DuckDB connection.```
822
+
823
+
824
+
825
+ **Example:****Benefits:**
826
+
827
+ ```python- 🔄 **Schema Evolution**: Automatically handles new columns without breaking existing queries
828
+
829
+ con.close()- ⚡ **Query Performance**: Partitioning improves performance for filtered queries
830
+
831
+ ```
832
+
833
+ ### Table Name Variants
834
+
835
+ ---
836
+
837
+ Use `__` to create multiple versions of the same table:
838
+
839
+ ## Advanced Features
840
+
841
+ ```python
842
+
843
+ ### Schema Evolutionpipeline = [
844
+
845
+ ('sales__initial', 'overwrite'), # writes to 'sales'
846
+
847
+ Automatically handle schema changes (new columns) using `mergeSchema`: ('sales__incremental', 'append'), # appends to 'sales'
848
+
849
+ ]
850
+
851
+ ```python```
852
+
853
+ # Using write API
854
+
855
+ con.sql("SELECT * FROM source").write \Both tasks write to the `sales` table but use different SQL files (`sales__initial.sql` and `sales__incremental.sql`).
856
+
857
+ .mode("append") \
858
+
859
+ .option("mergeSchema", "true") \### Remote SQL Files
860
+
861
+ .saveAsTable("evolving_table")
862
+
863
+ Load tasks from GitHub or any URL:
864
+
865
+ # Using pipeline
866
+
867
+ pipeline = [```python
868
+
869
+ ('table', 'append', {}, {'mergeSchema': 'true'})con = duckrun.connect(
870
+
871
+ ] "Analytics/Sales.lakehouse/dbo",
872
+
873
+ ``` sql_folder="https://raw.githubusercontent.com/user/repo/main/sql"
874
+
875
+ )
876
+
877
+ ### Partitioning```
878
+
879
+
880
+
881
+ Optimize query performance by partitioning data:### Early Exit on Failure
882
+
883
+
884
+
885
+ ```python**Pipelines automatically stop when any task fails** - subsequent tasks won't run.
886
+
887
+ # Partition by single column
888
+
889
+ con.sql("SELECT * FROM sales").write \For **SQL tasks**, failure is automatic:
890
+
891
+ .mode("overwrite") \- If the query has a syntax error or runtime error, the task fails
892
+
893
+ .partitionBy("region") \- The pipeline stops immediately
894
+
895
+ .saveAsTable("partitioned_sales")
896
+
897
+ For **Python tasks**, you control success/failure by returning:
898
+
899
+ # Partition by multiple columns- `1` = Success → pipeline continues to next task
900
+
901
+ con.sql("SELECT * FROM orders").write \- `0` = Failure → pipeline stops, remaining tasks are skipped
902
+
903
+ .mode("overwrite") \
904
+
905
+ .partitionBy("year", "month", "region") \Example:
906
+
907
+ .saveAsTable("time_partitioned")
908
+
909
+ ``````python
910
+
911
+ # sql_folder/download_data.py
912
+
913
+ **Best Practices:**def download_data(url, path):
914
+
915
+ - ✅ Partition by columns frequently used in WHERE clauses try:
916
+
917
+ - ✅ Use low to medium cardinality columns (dates, regions, categories) response = requests.get(url)
918
+
919
+ - ❌ Avoid high cardinality columns (customer_id, transaction_id) response.raise_for_status()
920
+
921
+ # save data...
922
+
923
+ ### SQL Template Parameters return 1 # Success - pipeline continues
924
+
925
+ except Exception as e:
926
+
927
+ Use template parameters in SQL files: print(f"Download failed: {e}")
928
+
929
+ return 0 # Failure - pipeline stops here
930
+
931
+ **Built-in parameters:**```
932
+
933
+ - `$ws` - workspace name
934
+
935
+ - `$lh` - lakehouse name```python
936
+
937
+ - `$schema` - schema namepipeline = [
938
+
939
+ - `$storage_account` - storage account name ('download_data', (url, path)), # If returns 0, stops here
940
+
941
+ - `$tables_url` - base URL for Tables folder ('clean_data', 'overwrite'), # Won't run if download failed
942
+
943
+ - `$files_url` - base URL for Files folder ('aggregate', 'append') # Won't run if download failed
944
+
945
+ ]
946
+
947
+ **Custom parameters:**
948
+
949
+ ```pythonsuccess = con.run(pipeline) # Returns True only if ALL tasks succeed
950
+
951
+ # sql/sales.sql```
952
+
953
+ SELECT * FROM transactions
954
+
955
+ WHERE date >= '$start_date' AND region = '$region'This prevents downstream tasks from processing incomplete or corrupted data.
956
+
957
+ ```
958
+
959
+ ### Semantic Model Deployment
960
+
961
+ ```python
962
+
963
+ pipeline = [Deploy Power BI semantic models directly from BIM files using DirectLake mode:
964
+
965
+ ('sales', 'append', {'start_date': '2024-01-01', 'region': 'US'})
966
+
967
+ ]```python
968
+
969
+ ```# Connect to lakehouse
970
+
971
+ con = duckrun.connect("Analytics/Sales.lakehouse/dbo")
972
+
973
+ ### Table Name Variants
974
+
975
+ # Deploy with auto-generated name (lakehouse_schema)
976
+
977
+ Create multiple SQL files for the same table:con.deploy("https://raw.githubusercontent.com/user/repo/main/model.bim")
978
+
979
+
980
+
981
+ ```python# Deploy with custom name
982
+
983
+ pipeline = [con.deploy(
984
+
985
+ ('sales__initial', 'overwrite'), # writes to 'sales' "https://raw.githubusercontent.com/user/repo/main/sales_model.bim",
986
+
987
+ ('sales__incremental', 'append'), # appends to 'sales' dataset_name="Sales Analytics Model",
988
+
989
+ ] wait_seconds=10 # Wait for permission propagation
990
+
991
+ ```)
992
+
993
+ ```
994
+
995
+ Both use different SQL files but write to the same `sales` table.
996
+
997
+ **Features:**
998
+
999
+ ### Remote SQL Files- 🚀 **DirectLake Mode**: Deploys semantic models with DirectLake connection
1000
+
1001
+ - 🔄 **Automatic Configuration**: Auto-configures workspace, lakehouse, and schema connections
1002
+
1003
+ Load SQL/Python files from GitHub or any URL:- 📦 **BIM from URL**: Load model definitions from GitHub or any accessible URL
1004
+
1005
+ - ⏱️ **Permission Handling**: Configurable wait time for permission propagation
1006
+
1007
+ ```python
1008
+
1009
+ con = duckrun.connect(**Use Cases:**
1010
+
1011
+ "workspace/lakehouse.lakehouse/dbo",- Deploy semantic models as part of CI/CD pipelines
1012
+
1013
+ sql_folder="https://raw.githubusercontent.com/user/repo/main/sql"- Version control your semantic models in Git
1014
+
1015
+ )- Automated model deployment across environments
1016
+
1017
+ ```- Streamline DirectLake model creation
1018
+
1019
+
1020
+
1021
+ ### Auto-Compaction### Delta Lake Optimization
1022
+
1023
+
1024
+
1025
+ Delta tables are automatically compacted when file count exceeds threshold:Duckrun automatically:
1026
+
1027
+ - Compacts small files when file count exceeds threshold (default: 100)
1028
+
1029
+ ```python- Vacuums old versions on overwrite
1030
+
1031
+ # Customize threshold- Cleans up metadata
1032
+
1033
+ con = duckrun.connect(
1034
+
1035
+ "workspace/lakehouse.lakehouse/dbo",Customize compaction threshold:
1036
+
1037
+ compaction_threshold=50 # compact after 50 files
1038
+
1039
+ )```python
1040
+
1041
+ ```con = duckrun.connect(
1042
+
1043
+ "workspace/lakehouse.lakehouse/dbo",
1044
+
1045
+ --- compaction_threshold=50 # compact after 50 files
1046
+
1047
+ )
1048
+
1049
+ ## Complete Example```
1050
+
1051
+
1052
+
1053
+ ```python## File Management API Reference
1054
+
1055
+ import duckrun
1056
+
1057
+ ### `copy(local_folder, remote_folder, file_extensions=None, overwrite=False)`
1058
+
1059
+ # 1. Connect with SQL folder for pipelines
1060
+
1061
+ con = duckrun.connect("Analytics/Sales.lakehouse/dbo", sql_folder="./sql")Upload files from a local folder to OneLake Files section.
1062
+
1063
+
1064
+
1065
+ # 2. Upload raw data files**Parameters:**
1066
+
1067
+ con.copy("./raw_data", "staging", ['.csv', '.json'])- `local_folder` (str): Path to local folder containing files to upload
1068
+
1069
+ - `remote_folder` (str): **Required** target folder path in OneLake Files
1070
+
1071
+ # 3. Run data pipeline- `file_extensions` (list, optional): Filter by file extensions (e.g., `['.csv', '.parquet']`)
1072
+
1073
+ pipeline = [- `overwrite` (bool, optional): Whether to overwrite existing files (default: False)
1074
+
1075
+ # Python: Download from API
1076
+
1077
+ ('fetch_api_data', ('https://api.example.com/sales', 'raw')),**Returns:** `True` if all files uploaded successfully, `False` otherwise
1078
+
1079
+
1080
+
1081
+ # SQL: Clean and transform**Examples:**
1082
+
1083
+ ('clean_sales', 'overwrite'),```python
1084
+
1085
+ # Upload all files to a target folder
1086
+
1087
+ # SQL: Aggregate with parameterscon.copy("./data", "processed_data")
1088
+
1089
+ ('regional_summary', 'overwrite', {'min_amount': 1000}),
1090
+
1091
+ # Upload only CSV and Parquet files
1092
+
1093
+ # SQL: Append to history with schema evolution and partitioningcon.copy("./reports", "monthly_reports", ['.csv', '.parquet'])
1094
+
1095
+ ('sales_history', 'append', {}, {
1096
+
1097
+ 'mergeSchema': 'true', # Upload with overwrite enabled
1098
+
1099
+ 'partitionBy': ['year', 'region']con.copy("./backup", "daily_backup", overwrite=True)
1100
+
1101
+ })```
1102
+
1103
+ ]
1104
+
1105
+ ### `download(remote_folder="", local_folder="./downloaded_files", file_extensions=None, overwrite=False)`
1106
+
1107
+ success = con.run(pipeline)
1108
+
1109
+ Download files from OneLake Files section to a local folder.
1110
+
1111
+ # 4. Query and explore results
1112
+
1113
+ con.sql("""**Parameters:**
1114
+
1115
+ SELECT region, SUM(total) as grand_total- `remote_folder` (str, optional): Source folder path in OneLake Files (default: root)
1116
+
1117
+ FROM regional_summary- `local_folder` (str, optional): Local destination folder (default: "./downloaded_files")
1118
+
1119
+ GROUP BY region- `file_extensions` (list, optional): Filter by file extensions (e.g., `['.csv', '.json']`)
1120
+
1121
+ """).show()- `overwrite` (bool, optional): Whether to overwrite existing local files (default: False)
1122
+
1123
+
1124
+
1125
+ # 5. Create derived table**Returns:** `True` if all files downloaded successfully, `False` otherwise
1126
+
1127
+ con.sql("SELECT * FROM sales WHERE year = 2024").write \
1128
+
1129
+ .mode("overwrite") \**Examples:**
1130
+
1131
+ .partitionBy("month") \```python
1132
+
1133
+ .saveAsTable("sales_2024")# Download all files from OneLake Files root
1134
+
1135
+ con.download()
1136
+
1137
+ # 6. Download processed reports
1138
+
1139
+ con.download("processed_reports", "./exports", ['.csv'])# Download from specific folder
1140
+
1141
+ con.download("processed_data", "./local_data")
1142
+
1143
+ # 7. Deploy semantic model
1144
+
1145
+ con.deploy(# Download only JSON files
1146
+
1147
+ "https://raw.githubusercontent.com/user/repo/main/sales_model.bim",con.download("config", "./configs", ['.json'])
1148
+
1149
+ dataset_name="Sales Analytics"```
1150
+
1151
+ )
1152
+
1153
+ **Important Notes:**
1154
+
1155
+ # 8. Enrich logs with lookup functions- Files are uploaded/downloaded to/from the **OneLake Files section**, not Delta Tables
1156
+
1157
+ logs = con.sql("""- The `remote_folder` parameter is **required** for uploads to prevent accidental uploads
1158
+
1159
+ SELECT - Both methods default to `overwrite=False` for safety
1160
+
1161
+ workspace_id,- Folder structure is preserved during upload/download operations
1162
+
1163
+ get_workspace_name(workspace_id) as workspace,- Progress is reported with file names, sizes, and upload/download status
1164
+
1165
+ lakehouse_id,
1166
+
1167
+ get_lakehouse_name(workspace_id, lakehouse_id) as lakehouse,## Complete Example
1168
+
1169
+ COUNT(*) as operations
1170
+
1171
+ FROM audit_logs```python
1172
+
1173
+ GROUP BY ALLimport duckrun
1174
+
1175
+ """).df()
1176
+
1177
+ # Connect (specify schema for best performance)
1178
+
1179
+ print(logs)con = duckrun.connect("Analytics/Sales.lakehouse/dbo", sql_folder="./sql")
1180
+
1181
+ ```
1182
+
1183
+ # 1. Upload raw data files to OneLake Files
1184
+
1185
+ ---con.copy("./raw_data", "raw_uploads", ['.csv', '.json'])
1186
+
1187
+
1188
+
1189
+ ## Python Task Reference# 2. Pipeline with mixed tasks
1190
+
1191
+ pipeline = [
1192
+
1193
+ Create Python tasks in your `sql_folder`: # Download raw data (Python)
1194
+
1195
+ ('fetch_api_data', ('https://api.example.com/sales', 'raw')),
1196
+
1197
+ ```python
1198
+
1199
+ # sql_folder/fetch_api_data.py # Clean and transform (SQL)
1200
+
1201
+ def fetch_api_data(url, output_path): ('clean_sales', 'overwrite'),
1202
+
1203
+ """
1204
+
1205
+ Download data from API. # Aggregate by region (SQL with params)
1206
+
1207
+ ('regional_summary', 'overwrite', {'min_amount': 1000}),
1208
+
1209
+ Returns:
1210
+
1211
+ 1 for success (pipeline continues) # Append to history with schema evolution (SQL with Delta options)
1212
+
1213
+ 0 for failure (pipeline stops) ('sales_history', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['year', 'region']})
1214
+
1215
+ """]
1216
+
1217
+ try:
1218
+
1219
+ import requests# Run pipeline
1220
+
1221
+ response = requests.get(url)success = con.run(pipeline)
1222
+
1223
+ response.raise_for_status()
1224
+
1225
+ # 3. Explore results using DuckDB
1226
+
1227
+ # Save datacon.sql("SELECT * FROM regional_summary").show()
1228
+
1229
+ with open(output_path, 'w') as f:
1230
+
1231
+ f.write(response.text)# 4. Export to new Delta table
1232
+
1233
+ con.sql("""
1234
+
1235
+ return 1 # Success SELECT region, SUM(total) as grand_total
1236
+
1237
+ except Exception as e: FROM regional_summary
1238
+
1239
+ print(f"Error: {e}") GROUP BY region
1240
+
1241
+ return 0 # Failure - pipeline will stop""").write.mode("overwrite").saveAsTable("region_totals")
1242
+
1243
+ ```
1244
+
1245
+ # 5. Download processed files for external systems
1246
+
1247
+ **Important:**con.download("processed_reports", "./exports", ['.csv'])
1248
+
1249
+ - Function name must match filename
1250
+
1251
+ - Return `1` for success, `0` for failure# 6. Deploy semantic model for Power BI
1252
+
1253
+ - Python tasks can use workspace/lakehouse IDs as parameterscon.deploy(
1254
+
1255
+ "https://raw.githubusercontent.com/user/repo/main/sales_model.bim",
1256
+
1257
+ --- dataset_name="Sales Analytics"
1258
+
1259
+ )
1260
+
1261
+ ## Requirements & Notes```
1262
+
1263
+
1264
+
1265
+ **Requirements:****This example demonstrates:**
1266
+
1267
+ - Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)- 📁 **File uploads** to OneLake Files section
1268
+
1269
+ - Azure authentication (Azure CLI, browser, or Fabric notebook environment)- 🔄 **Pipeline orchestration** with SQL and Python tasks
1270
+
1271
+ - ⚡ **Fast data exploration** with DuckDB
1272
+
1273
+ **Important Notes:**- 💾 **Delta table creation** with Spark-style API
1274
+
1275
+ - ✅ Workspace names with spaces are fully supported- 🔀 **Schema evolution** and partitioning
1276
+
1277
+ - ✅ Files uploaded/downloaded to OneLake **Files** section (not Delta Tables)- 📤 **File downloads** from OneLake Files
1278
+
1279
+ - ✅ Pipeline stops on first failure (SQL errors or Python returning 0)- 📊 **Semantic model deployment** with DirectLake
1280
+
1281
+ - ⚠️ Uses older deltalake version for row size control (Power BI optimization)
1282
+
1283
+ - ⚠️ Scanning all schemas can be slow for large lakehouses## Schema Evolution & Partitioning Guide
1284
+
1285
+
1286
+
1287
+ **Authentication:**### When to Use Schema Evolution
1288
+
1289
+ - Fabric notebooks: Automatic using notebook credentials
1290
+
1291
+ - Local/VS Code: Azure CLI or interactive browser authenticationUse `mergeSchema: 'true'` when:
1292
+
1293
+ - Custom: Use Azure Identity credential chain- Adding new columns to existing tables
1294
+
1295
+ - Source data schema changes over time
1296
+
1297
+ ---- Working with evolving data pipelines
1298
+
1299
+ - Need backward compatibility
1300
+
1301
+ ## Real-World Example
1302
+
1303
+ ### When to Use Partitioning
1304
+
1305
+ For a complete production example, see [fabric_demo](https://github.com/djouallah/fabric_demo).
1306
+
1307
+ Use `partitionBy` when:
1308
+
1309
+ ## License- Queries frequently filter by specific columns (dates, regions, categories)
1310
+
1311
+ - Tables are large and need performance optimization
1312
+
1313
+ MIT- Want to organize data logically for maintenance
1314
+
1315
+
1316
+ ### Best Practices
1317
+
1318
+ ```python
1319
+ # ✅ Good: Partition by commonly filtered columns
1320
+ .partitionBy("year", "region") # Often filtered: WHERE year = 2024 AND region = 'US'
1321
+
1322
+ # ❌ Avoid: High cardinality partitions
1323
+ .partitionBy("customer_id") # Creates too many small partitions
1324
+
1325
+ # ✅ Good: Schema evolution for append operations
1326
+ .mode("append").option("mergeSchema", "true")
1327
+
1328
+ # ✅ Good: Combined approach for data lakes
1329
+ pipeline = [
1330
+ ('daily_sales', 'append',
1331
+ {'batch_date': '2024-10-07'},
1332
+ {'mergeSchema': 'true', 'partitionBy': ['year', 'month', 'region']})
1333
+ ]
1334
+ ```
1335
+
1336
+ ### Task Format Reference
1337
+
1338
+ ```python
1339
+ # 2-tuple: Simple SQL/Python
1340
+ ('task_name', 'mode') # SQL: no params, no Delta options
1341
+ ('function_name', (args)) # Python: function with arguments
1342
+
1343
+ # 3-tuple: SQL with parameters
1344
+ ('task_name', 'mode', {'param': 'value'})
1345
+
1346
+ # 4-tuple: SQL with parameters AND Delta options
1347
+ ('task_name', 'mode', {'param': 'value'}, {'mergeSchema': 'true', 'partitionBy': ['col']})
1348
+
1349
+ # 4-tuple: Empty parameters but Delta options
1350
+ ('task_name', 'mode', {}, {'mergeSchema': 'true'})
1351
+ ```
1352
+
1353
+ ## How It Works
1354
+
1355
+ 1. **Connection**: Duckrun connects to your Fabric lakehouse using OneLake and Azure authentication
1356
+ 2. **Table Discovery**: Automatically scans for Delta tables in your schema (or all schemas) and creates DuckDB views
1357
+ 3. **Query Execution**: Run SQL queries directly against Delta tables using DuckDB's speed
1358
+ 4. **Write Operations**: Results are written back as Delta tables with automatic optimization
1359
+ 5. **Pipelines**: Orchestrate complex workflows with reusable SQL and Python tasks
1360
+
1361
+ ## Real-World Example
1362
+
1363
+ For a complete production example, see [fabric_demo](https://github.com/djouallah/fabric_demo).
1364
+
1365
+ ## License
1366
+
1367
+ MIT