duckrun 0.2.11__py3-none-any.whl → 0.2.12__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of duckrun might be problematic. Click here for more details.

@@ -0,0 +1,662 @@
1
+ Metadata-Version: 2.4
2
+ Name: duckrun
3
+ Version: 0.2.12
4
+ Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
5
+ Author: mim
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/djouallah/duckrun
8
+ Project-URL: Repository, https://github.com/djouallah/duckrun
9
+ Project-URL: Issues, https://github.com/djouallah/duckrun/issues
10
+ Requires-Python: >=3.9
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE
13
+ Requires-Dist: duckdb>=1.2.2
14
+ Requires-Dist: deltalake<=0.18.2
15
+ Requires-Dist: requests>=2.28.0
16
+ Requires-Dist: obstore>=0.2.0
17
+ Provides-Extra: local
18
+ Requires-Dist: azure-identity>=1.12.0; extra == "local"
19
+ Dynamic: license-file
20
+
21
+ <img src="https://raw.githubusercontent.com/djouallah/duckrun/main/duckrun.png" width="400" alt="Duckrun">
22
+
23
+ A helper package for working with Microsoft Fabric lakehouses - orchestration, SQL queries, and file management powered by DuckDB.
24
+
25
+ ## Important Notes
26
+
27
+ **Requirements:**
28
+ - Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)
29
+ - **Workspace names with spaces are fully supported!** ✅
30
+
31
+ **Delta Lake Version:** This package uses an older version of deltalake to maintain row size control capabilities, which is crucial for Power BI performance optimization. The newer Rust-based deltalake versions don't yet support the row group size parameters that are essential for optimal DirectLake performance.
32
+
33
+ ## What It Does
34
+
35
+ It does orchestration, arbitrary SQL statements, and file manipulation. That's it - just stuff I encounter in my daily workflow when working with Fabric notebooks.
36
+
37
+ ## Installation
38
+
39
+ ```bash
40
+ pip install duckrun
41
+ ```
42
+
43
+ For local usage (requires Azure CLI or interactive browser auth):
44
+
45
+ ```bash
46
+ pip install duckrun[local]
47
+ ```
48
+
49
+ Note: When running locally, your internet speed will be the main bottleneck.
50
+
51
+ ## Quick Start
52
+
53
+ ### Simple Example for New Users
54
+
55
+ ```python
56
+ import duckrun
57
+
58
+ # Connect to a workspace and manage lakehouses
59
+ con = duckrun.connect('My Workspace')
60
+ con.list_lakehouses() # See what lakehouses exist
61
+ con.create_lakehouse_if_not_exists('data') # Create if needed
62
+
63
+ # Connect to a specific lakehouse and query data
64
+ con = duckrun.connect("My Workspace/data.lakehouse/dbo")
65
+ con.sql("SELECT * FROM my_table LIMIT 10").show()
66
+ ```
67
+
68
+ ### Full Feature Overview
69
+
70
+ ```python
71
+ import duckrun
72
+
73
+ # 1. Workspace Management (list and create lakehouses)
74
+ ws = duckrun.connect("My Workspace")
75
+ lakehouses = ws.list_lakehouses() # Returns list of lakehouse names
76
+ ws.create_lakehouse_if_not_exists("New_Lakehouse")
77
+
78
+ # 2. Connect to lakehouse with a specific schema
79
+ con = duckrun.connect("My Workspace/MyLakehouse.lakehouse/dbo")
80
+
81
+ # Workspace names with spaces are supported!
82
+ con = duckrun.connect("Data Analytics/SalesData.lakehouse/analytics")
83
+
84
+ # Schema defaults to 'dbo' if not specified (scans all schemas)
85
+ # ⚠️ WARNING: Scanning all schemas can be slow for large lakehouses!
86
+ con = duckrun.connect("My Workspace/My_Lakehouse.lakehouse")
87
+
88
+ # 3. Explore data
89
+ con.sql("SELECT * FROM my_table LIMIT 10").show()
90
+
91
+ # 4. Write to Delta tables (Spark-style API)
92
+ con.sql("SELECT * FROM source").write.mode("overwrite").saveAsTable("target")
93
+
94
+ # 5. Upload/download files to/from OneLake Files
95
+ con.copy("./local_folder", "target_folder") # Upload files
96
+ con.download("target_folder", "./downloaded") # Download files
97
+ ```
98
+
99
+ That's it! No `sql_folder` needed for data exploration.
100
+
101
+ ## Connection Format
102
+
103
+ ```python
104
+ # Workspace management (list and create lakehouses)
105
+ ws = duckrun.connect("My Workspace")
106
+ ws.list_lakehouses() # Returns: ['lakehouse1', 'lakehouse2', ...]
107
+ ws.create_lakehouse_if_not_exists("New Lakehouse")
108
+
109
+ # Lakehouse connection with schema (recommended for best performance)
110
+ con = duckrun.connect("My Workspace/My Lakehouse.lakehouse/dbo")
111
+
112
+ # Supports workspace names with spaces!
113
+ con = duckrun.connect("Data Analytics/Sales Data.lakehouse/analytics")
114
+
115
+ # Without schema (defaults to 'dbo', scans all schemas)
116
+ # ⚠️ This can be slow for large lakehouses!
117
+ con = duckrun.connect("My Workspace/My Lakehouse.lakehouse")
118
+
119
+ # With SQL folder for pipeline orchestration
120
+ con = duckrun.connect("My Workspace/My Lakehouse.lakehouse/dbo", sql_folder="./sql")
121
+ ```
122
+
123
+ ### Multi-Schema Support
124
+
125
+ When you don't specify a schema, Duckrun will:
126
+ - **Default to `dbo`** for write operations
127
+ - **Scan all schemas** to discover and attach all Delta tables
128
+ - **Prefix table names** with schema to avoid conflicts (e.g., `dbo_customers`, `bronze_raw_data`)
129
+
130
+ **Performance Note:** Scanning all schemas requires listing all files in the lakehouse, which can be slow for large lakehouses with many tables. For better performance, always specify a schema when possible.
131
+
132
+ ```python
133
+ # Fast: scans only 'dbo' schema
134
+ con = duckrun.connect("workspace/lakehouse.lakehouse/dbo")
135
+
136
+ # Slower: scans all schemas
137
+ con = duckrun.connect("workspace/lakehouse.lakehouse")
138
+
139
+ # Query tables from different schemas (when scanning all)
140
+ con.sql("SELECT * FROM dbo_customers").show()
141
+ con.sql("SELECT * FROM bronze_raw_data").show()
142
+ ```
143
+
144
+ ## Three Ways to Use Duckrun
145
+
146
+ ### 1. Data Exploration (Spark-Style API)
147
+
148
+ Perfect for ad-hoc analysis and interactive notebooks:
149
+
150
+ ```python
151
+ con = duckrun.connect("workspace/lakehouse.lakehouse/dbo")
152
+
153
+ # Query existing tables
154
+ con.sql("SELECT * FROM sales WHERE year = 2024").show()
155
+
156
+ # Get DataFrame
157
+ df = con.sql("SELECT COUNT(*) FROM orders").df()
158
+
159
+ # Write results to Delta tables
160
+ con.sql("""
161
+ SELECT
162
+ customer_id,
163
+ SUM(amount) as total
164
+ FROM orders
165
+ GROUP BY customer_id
166
+ """).write.mode("overwrite").saveAsTable("customer_totals")
167
+
168
+ # Schema evolution and partitioning (exact Spark API compatibility)
169
+ con.sql("""
170
+ SELECT
171
+ customer_id,
172
+ order_date,
173
+ region,
174
+ product_category,
175
+ sales_amount,
176
+ new_column_added_later -- This column might not exist in target table
177
+ FROM source_table
178
+ """).write \
179
+ .mode("append") \
180
+ .option("mergeSchema", "true") \
181
+ .partitionBy("region", "product_category") \
182
+ .saveAsTable("sales_partitioned")
183
+ ```
184
+
185
+ **Note:** `.format("delta")` is optional - Delta is the default format!
186
+
187
+ ### 2. File Management (OneLake Files)
188
+
189
+ Upload and download files to/from OneLake Files section (not Delta tables):
190
+
191
+ ```python
192
+ con = duckrun.connect("workspace/lakehouse.lakehouse/dbo")
193
+
194
+ # Upload files to OneLake Files (remote_folder is required)
195
+ con.copy("./local_data", "uploaded_data")
196
+
197
+ # Upload only specific file types
198
+ con.copy("./reports", "daily_reports", ['.csv', '.parquet'])
199
+
200
+ # Upload with overwrite enabled (default is False for safety)
201
+ con.copy("./backup", "backups", overwrite=True)
202
+
203
+ # Download files from OneLake Files
204
+ con.download("uploaded_data", "./downloaded")
205
+
206
+ # Download only CSV files from a specific folder
207
+ con.download("daily_reports", "./reports", ['.csv'])
208
+ ```
209
+
210
+ **Key Features:**
211
+ - ✅ **Files go to OneLake Files section** (not Delta Tables)
212
+ - ✅ **`remote_folder` parameter is required** for uploads (prevents accidental uploads)
213
+ - ✅ **`overwrite=False` by default** (safer - prevents accidental overwrites)
214
+ - ✅ **File extension filtering** (e.g., only `.csv` or `.parquet` files)
215
+ - ✅ **Preserves folder structure** during upload/download
216
+ - ✅ **Progress reporting** with file sizes and upload status
217
+
218
+ ### 3. Pipeline Orchestration
219
+
220
+ For production workflows with reusable SQL and Python tasks:
221
+
222
+ ```python
223
+ con = duckrun.connect(
224
+ "my_workspace/my_lakehouse.lakehouse/dbo",
225
+ sql_folder="./sql" # folder with .sql and .py files
226
+ )
227
+
228
+ # Define pipeline
229
+ pipeline = [
230
+ ('download_data', (url, path)), # Python task
231
+ ('clean_data', 'overwrite'), # SQL task
232
+ ('aggregate', 'append') # SQL task
233
+ ]
234
+
235
+ # Run it
236
+ con.run(pipeline)
237
+ ```
238
+
239
+ ## Pipeline Tasks
240
+
241
+ ### Python Tasks
242
+
243
+ **Format:** `('function_name', (arg1, arg2, ...))`
244
+
245
+ Create `sql_folder/function_name.py`:
246
+
247
+ ```python
248
+ # sql_folder/download_data.py
249
+ def download_data(url, path):
250
+ # your code here
251
+ return 1 # 1 = success, 0 = failure
252
+ ```
253
+
254
+ ### SQL Tasks
255
+
256
+ **Formats:**
257
+ - `('table_name', 'mode')` - Simple SQL with no parameters
258
+ - `('table_name', 'mode', {params})` - SQL with template parameters
259
+ - `('table_name', 'mode', {params}, {delta_options})` - SQL with Delta Lake options
260
+
261
+ Create `sql_folder/table_name.sql`:
262
+
263
+ ```sql
264
+ -- sql_folder/clean_data.sql
265
+ SELECT
266
+ id,
267
+ TRIM(name) as name,
268
+ date
269
+ FROM raw_data
270
+ WHERE date >= '2024-01-01'
271
+ ```
272
+
273
+ **Write Modes:**
274
+ - `overwrite` - Replace table completely
275
+ - `append` - Add to existing table
276
+ - `ignore` - Create only if doesn't exist
277
+
278
+ ### Parameterized SQL
279
+
280
+ Built-in parameters (always available):
281
+ - `$ws` - workspace name
282
+ - `$lh` - lakehouse name
283
+ - `$schema` - schema name
284
+
285
+ Custom parameters:
286
+
287
+ ```python
288
+ pipeline = [
289
+ ('sales', 'append', {'start_date': '2024-01-01', 'end_date': '2024-12-31'})
290
+ ]
291
+ ```
292
+
293
+ ```sql
294
+ -- sql_folder/sales.sql
295
+ SELECT * FROM transactions
296
+ WHERE date BETWEEN '$start_date' AND '$end_date'
297
+ ```
298
+
299
+ ### Delta Lake Options (Schema Evolution & Partitioning)
300
+
301
+ Use the 4-tuple format for advanced Delta Lake features:
302
+
303
+ ```python
304
+ pipeline = [
305
+ # SQL with empty params but Delta options
306
+ ('evolving_table', 'append', {}, {'mergeSchema': 'true'}),
307
+
308
+ # SQL with both params AND Delta options
309
+ ('sales_data', 'append',
310
+ {'region': 'North America'},
311
+ {'mergeSchema': 'true', 'partitionBy': ['region', 'year']}),
312
+
313
+ # Partitioning without schema merging
314
+ ('time_series', 'overwrite',
315
+ {'start_date': '2024-01-01'},
316
+ {'partitionBy': ['year', 'month']})
317
+ ]
318
+ ```
319
+
320
+ **Available Delta Options:**
321
+ - `mergeSchema: 'true'` - Automatically handle schema evolution (new columns)
322
+ - `partitionBy: ['col1', 'col2']` - Partition data by specified columns
323
+
324
+ ## Advanced Features
325
+
326
+ ### SQL Lookup Functions
327
+
328
+ Duckrun automatically registers helper functions that allow you to resolve workspace and lakehouse names from GUIDs directly in SQL queries. These are especially useful when working with storage logs or audit data that contains workspace/lakehouse IDs.
329
+
330
+ **Available Functions:**
331
+
332
+ ```python
333
+ con = duckrun.connect("workspace/lakehouse.lakehouse/dbo")
334
+
335
+ # ID → Name lookups (most common use case)
336
+ con.sql("""
337
+ SELECT
338
+ workspace_id,
339
+ get_workspace_name(workspace_id) as workspace_name,
340
+ lakehouse_id,
341
+ get_lakehouse_name(workspace_id, lakehouse_id) as lakehouse_name
342
+ FROM storage_logs
343
+ """).show()
344
+
345
+ # Name → ID lookups (reverse)
346
+ con.sql("""
347
+ SELECT
348
+ workspace_name,
349
+ get_workspace_id_from_name(workspace_name) as workspace_id,
350
+ lakehouse_name,
351
+ get_lakehouse_id_from_name(workspace_id, lakehouse_name) as lakehouse_id
352
+ FROM configuration_table
353
+ """).show()
354
+ ```
355
+
356
+ **Function Reference:**
357
+
358
+ - `get_workspace_name(workspace_id)` - Convert workspace GUID to display name
359
+ - `get_lakehouse_name(workspace_id, lakehouse_id)` - Convert lakehouse GUID to display name
360
+ - `get_workspace_id_from_name(workspace_name)` - Convert workspace name to GUID
361
+ - `get_lakehouse_id_from_name(workspace_id, lakehouse_name)` - Convert lakehouse name to GUID
362
+
363
+ **Features:**
364
+ - ✅ **Automatic Caching**: Results are cached to avoid repeated API calls
365
+ - ✅ **NULL on Error**: Returns `NULL` instead of errors for missing or inaccessible items
366
+ - ✅ **Fabric API Integration**: Resolves names using Microsoft Fabric REST API
367
+ - ✅ **Always Available**: Functions are automatically registered on connection
368
+
369
+ **Example Use Case:**
370
+
371
+ ```python
372
+ # Enrich OneLake storage logs with friendly names
373
+ con = duckrun.connect("Analytics/Monitoring.lakehouse/dbo")
374
+
375
+ result = con.sql("""
376
+ SELECT
377
+ workspace_id,
378
+ get_workspace_name(workspace_id) as workspace_name,
379
+ lakehouse_id,
380
+ get_lakehouse_name(workspace_id, lakehouse_id) as lakehouse_name,
381
+ operation_name,
382
+ COUNT(*) as operation_count,
383
+ SUM(bytes_transferred) as total_bytes
384
+ FROM onelake_storage_logs
385
+ WHERE log_date = CURRENT_DATE
386
+ GROUP BY ALL
387
+ ORDER BY workspace_name, lakehouse_name
388
+ """).show()
389
+ ```
390
+
391
+ This makes it easy to create human-readable reports from GUID-based log data!
392
+
393
+ ### Schema Evolution & Partitioning
394
+
395
+ Handle evolving schemas and optimize query performance with partitioning:
396
+
397
+ ```python
398
+ # Using Spark-style API
399
+ con.sql("""
400
+ SELECT
401
+ customer_id,
402
+ region,
403
+ product_category,
404
+ sales_amount,
405
+ -- New column that might not exist in target table
406
+ discount_percentage
407
+ FROM raw_sales
408
+ """).write \
409
+ .mode("append") \
410
+ .option("mergeSchema", "true") \
411
+ .partitionBy("region", "product_category") \
412
+ .saveAsTable("sales_partitioned")
413
+
414
+ # Using pipeline format
415
+ pipeline = [
416
+ ('sales_summary', 'append',
417
+ {'batch_date': '2024-10-07'},
418
+ {'mergeSchema': 'true', 'partitionBy': ['region', 'year']})
419
+ ]
420
+ ```
421
+
422
+ **Benefits:**
423
+ - 🔄 **Schema Evolution**: Automatically handles new columns without breaking existing queries
424
+ - ⚡ **Query Performance**: Partitioning improves performance for filtered queries
425
+
426
+ ### Table Name Variants
427
+
428
+ Use `__` to create multiple versions of the same table:
429
+
430
+ ```python
431
+ pipeline = [
432
+ ('sales__initial', 'overwrite'), # writes to 'sales'
433
+ ('sales__incremental', 'append'), # appends to 'sales'
434
+ ]
435
+ ```
436
+
437
+ Both tasks write to the `sales` table but use different SQL files (`sales__initial.sql` and `sales__incremental.sql`).
438
+
439
+ ### Remote SQL Files
440
+
441
+ Load tasks from GitHub or any URL:
442
+
443
+ ```python
444
+ con = duckrun.connect(
445
+ "Analytics/Sales.lakehouse/dbo",
446
+ sql_folder="https://raw.githubusercontent.com/user/repo/main/sql"
447
+ )
448
+ ```
449
+
450
+ ### Early Exit on Failure
451
+
452
+ **Pipelines automatically stop when any task fails** - subsequent tasks won't run.
453
+
454
+ For **SQL tasks**, failure is automatic:
455
+ - If the query has a syntax error or runtime error, the task fails
456
+ - The pipeline stops immediately
457
+
458
+ For **Python tasks**, you control success/failure by returning:
459
+ - `1` = Success → pipeline continues to next task
460
+ - `0` = Failure → pipeline stops, remaining tasks are skipped
461
+
462
+ Example:
463
+
464
+ ```python
465
+ # sql_folder/download_data.py
466
+ def download_data(url, path):
467
+ try:
468
+ response = requests.get(url)
469
+ response.raise_for_status()
470
+ # save data...
471
+ return 1 # Success - pipeline continues
472
+ except Exception as e:
473
+ print(f"Download failed: {e}")
474
+ return 0 # Failure - pipeline stops here
475
+ ```
476
+
477
+ ```python
478
+ pipeline = [
479
+ ('download_data', (url, path)), # If returns 0, stops here
480
+ ('clean_data', 'overwrite'), # Won't run if download failed
481
+ ('aggregate', 'append') # Won't run if download failed
482
+ ]
483
+
484
+ success = con.run(pipeline) # Returns True only if ALL tasks succeed
485
+ ```
486
+
487
+ This prevents downstream tasks from processing incomplete or corrupted data.
488
+
489
+ ### Semantic Model Deployment
490
+
491
+ Deploy Power BI semantic models directly from BIM files using DirectLake mode:
492
+
493
+ ```python
494
+ # Connect to lakehouse
495
+ con = duckrun.connect("Analytics/Sales.lakehouse/dbo")
496
+
497
+ # Deploy with auto-generated name (lakehouse_schema)
498
+ con.deploy("https://raw.githubusercontent.com/user/repo/main/model.bim")
499
+
500
+ # Deploy with custom name
501
+ con.deploy(
502
+ "https://raw.githubusercontent.com/user/repo/main/sales_model.bim",
503
+ dataset_name="Sales Analytics Model",
504
+ wait_seconds=10 # Wait for permission propagation
505
+ )
506
+ ```
507
+
508
+ **Features:**
509
+ - 🚀 **DirectLake Mode**: Deploys semantic models with DirectLake connection
510
+ - 🔄 **Automatic Configuration**: Auto-configures workspace, lakehouse, and schema connections
511
+ - 📦 **BIM from URL**: Load model definitions from GitHub or any accessible URL
512
+ - ⏱️ **Permission Handling**: Configurable wait time for permission propagation
513
+
514
+ **Use Cases:**
515
+ - Deploy semantic models as part of CI/CD pipelines
516
+ - Version control your semantic models in Git
517
+ - Automated model deployment across environments
518
+ - Streamline DirectLake model creation
519
+
520
+ ### Delta Lake Optimization
521
+
522
+ Duckrun automatically:
523
+ - Compacts small files when file count exceeds threshold (default: 100)
524
+ - Vacuums old versions on overwrite
525
+ - Cleans up metadata
526
+
527
+ Customize compaction threshold:
528
+
529
+ ```python
530
+ con = duckrun.connect(
531
+ "workspace/lakehouse.lakehouse/dbo",
532
+ compaction_threshold=50 # compact after 50 files
533
+ )
534
+ ```
535
+
536
+ ## Complete Example
537
+
538
+ ```python
539
+ import duckrun
540
+
541
+ # Connect (specify schema for best performance)
542
+ con = duckrun.connect("Analytics/Sales.lakehouse/dbo", sql_folder="./sql")
543
+
544
+ # 1. Upload raw data files to OneLake Files
545
+ con.copy("./raw_data", "raw_uploads", ['.csv', '.json'])
546
+
547
+ # 2. Pipeline with mixed tasks
548
+ pipeline = [
549
+ # Download raw data (Python)
550
+ ('fetch_api_data', ('https://api.example.com/sales', 'raw')),
551
+
552
+ # Clean and transform (SQL)
553
+ ('clean_sales', 'overwrite'),
554
+
555
+ # Aggregate by region (SQL with params)
556
+ ('regional_summary', 'overwrite', {'min_amount': 1000}),
557
+
558
+ # Append to history with schema evolution (SQL with Delta options)
559
+ ('sales_history', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['year', 'region']})
560
+ ]
561
+
562
+ # Run pipeline
563
+ success = con.run(pipeline)
564
+
565
+ # 3. Explore results using DuckDB
566
+ con.sql("SELECT * FROM regional_summary").show()
567
+
568
+ # 4. Export to new Delta table
569
+ con.sql("""
570
+ SELECT region, SUM(total) as grand_total
571
+ FROM regional_summary
572
+ GROUP BY region
573
+ """).write.mode("overwrite").saveAsTable("region_totals")
574
+
575
+ # 5. Download processed files for external systems
576
+ con.download("processed_reports", "./exports", ['.csv'])
577
+
578
+ # 6. Deploy semantic model for Power BI
579
+ con.deploy(
580
+ "https://raw.githubusercontent.com/user/repo/main/sales_model.bim",
581
+ dataset_name="Sales Analytics"
582
+ )
583
+ ```
584
+
585
+ **This example demonstrates:**
586
+ - 📁 **File uploads** to OneLake Files section
587
+ - 🔄 **Pipeline orchestration** with SQL and Python tasks
588
+ - ⚡ **Fast data exploration** with DuckDB
589
+ - 💾 **Delta table creation** with Spark-style API
590
+ - 🔀 **Schema evolution** and partitioning
591
+ - 📤 **File downloads** from OneLake Files
592
+ - 📊 **Semantic model deployment** with DirectLake
593
+
594
+ ## Schema Evolution & Partitioning Guide
595
+
596
+ ### When to Use Schema Evolution
597
+
598
+ Use `mergeSchema: 'true'` when:
599
+ - Adding new columns to existing tables
600
+ - Source data schema changes over time
601
+ - Working with evolving data pipelines
602
+ - Need backward compatibility
603
+
604
+ ### When to Use Partitioning
605
+
606
+ Use `partitionBy` when:
607
+ - Queries frequently filter by specific columns (dates, regions, categories)
608
+ - Tables are large and need performance optimization
609
+ - Want to organize data logically for maintenance
610
+
611
+ ### Best Practices
612
+
613
+ ```python
614
+ # ✅ Good: Partition by commonly filtered columns
615
+ .partitionBy("year", "region") # Often filtered: WHERE year = 2024 AND region = 'US'
616
+
617
+ # ❌ Avoid: High cardinality partitions
618
+ .partitionBy("customer_id") # Creates too many small partitions
619
+
620
+ # ✅ Good: Schema evolution for append operations
621
+ .mode("append").option("mergeSchema", "true")
622
+
623
+ # ✅ Good: Combined approach for data lakes
624
+ pipeline = [
625
+ ('daily_sales', 'append',
626
+ {'batch_date': '2024-10-07'},
627
+ {'mergeSchema': 'true', 'partitionBy': ['year', 'month', 'region']})
628
+ ]
629
+ ```
630
+
631
+ ### Task Format Reference
632
+
633
+ ```python
634
+ # 2-tuple: Simple SQL/Python
635
+ ('task_name', 'mode') # SQL: no params, no Delta options
636
+ ('function_name', (args)) # Python: function with arguments
637
+
638
+ # 3-tuple: SQL with parameters
639
+ ('task_name', 'mode', {'param': 'value'})
640
+
641
+ # 4-tuple: SQL with parameters AND Delta options
642
+ ('task_name', 'mode', {'param': 'value'}, {'mergeSchema': 'true', 'partitionBy': ['col']})
643
+
644
+ # 4-tuple: Empty parameters but Delta options
645
+ ('task_name', 'mode', {}, {'mergeSchema': 'true'})
646
+ ```
647
+
648
+ ## How It Works
649
+
650
+ 1. **Connection**: Duckrun connects to your Fabric lakehouse using OneLake and Azure authentication
651
+ 2. **Table Discovery**: Automatically scans for Delta tables in your schema (or all schemas) and creates DuckDB views
652
+ 3. **Query Execution**: Run SQL queries directly against Delta tables using DuckDB's speed
653
+ 4. **Write Operations**: Results are written back as Delta tables with automatic optimization
654
+ 5. **Pipelines**: Orchestrate complex workflows with reusable SQL and Python tasks
655
+
656
+ ## Real-World Example
657
+
658
+ For a complete production example, see [fabric_demo](https://github.com/djouallah/fabric_demo).
659
+
660
+ ## License
661
+
662
+ MIT
@@ -7,8 +7,8 @@ duckrun/runner.py,sha256=yrDxfy1RVkb8iK9GKGmIFZHzCvcO_0GVQlbng7Vw_iM,14171
7
7
  duckrun/semantic_model.py,sha256=obzlN2-dbEW3JmDop-vrZGGGLi9u3ThhTbgtDjou7uY,29509
8
8
  duckrun/stats.py,sha256=oKIjZ7u5cFVT63FuOl5UqoDsOG3098woSCn-uI6i_sQ,11084
9
9
  duckrun/writer.py,sha256=svUuPCYOhrz299NgnpTKhARKjfej0PxnoND2iPDSypk,8098
10
- duckrun-0.2.11.dist-info/licenses/LICENSE,sha256=-DeQQwdbCbkB4507ZF3QbocysB-EIjDtaLexvqRkGZc,1083
11
- duckrun-0.2.11.dist-info/METADATA,sha256=gmMgCIUivM7CCtENbLv9RBPpkU-I6bpoAaZ7EkX07PM,39613
12
- duckrun-0.2.11.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
13
- duckrun-0.2.11.dist-info/top_level.txt,sha256=BknMEwebbUHrVAp3SC92ps8MPhK7XSYsaogTvi_DmEU,8
14
- duckrun-0.2.11.dist-info/RECORD,,
10
+ duckrun-0.2.12.dist-info/licenses/LICENSE,sha256=-DeQQwdbCbkB4507ZF3QbocysB-EIjDtaLexvqRkGZc,1083
11
+ duckrun-0.2.12.dist-info/METADATA,sha256=MPsLnsgyPshKTIXKiO_MiAXbqPVxQ7dBVwzDggm56bQ,20766
12
+ duckrun-0.2.12.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
13
+ duckrun-0.2.12.dist-info/top_level.txt,sha256=BknMEwebbUHrVAp3SC92ps8MPhK7XSYsaogTvi_DmEU,8
14
+ duckrun-0.2.12.dist-info/RECORD,,