duckrun 0.2.11__py3-none-any.whl → 0.2.12__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of duckrun might be problematic. Click here for more details.
- duckrun-0.2.12.dist-info/METADATA +662 -0
- {duckrun-0.2.11.dist-info → duckrun-0.2.12.dist-info}/RECORD +5 -5
- duckrun-0.2.11.dist-info/METADATA +0 -1367
- {duckrun-0.2.11.dist-info → duckrun-0.2.12.dist-info}/WHEEL +0 -0
- {duckrun-0.2.11.dist-info → duckrun-0.2.12.dist-info}/licenses/LICENSE +0 -0
- {duckrun-0.2.11.dist-info → duckrun-0.2.12.dist-info}/top_level.txt +0 -0
|
@@ -1,1367 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: duckrun
|
|
3
|
-
Version: 0.2.11
|
|
4
|
-
Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
|
|
5
|
-
Author: mim
|
|
6
|
-
License: MIT
|
|
7
|
-
Project-URL: Homepage, https://github.com/djouallah/duckrun
|
|
8
|
-
Project-URL: Repository, https://github.com/djouallah/duckrun
|
|
9
|
-
Project-URL: Issues, https://github.com/djouallah/duckrun/issues
|
|
10
|
-
Requires-Python: >=3.9
|
|
11
|
-
Description-Content-Type: text/markdown
|
|
12
|
-
License-File: LICENSE
|
|
13
|
-
Requires-Dist: duckdb>=1.2.2
|
|
14
|
-
Requires-Dist: deltalake<=0.18.2
|
|
15
|
-
Requires-Dist: requests>=2.28.0
|
|
16
|
-
Requires-Dist: obstore>=0.2.0
|
|
17
|
-
Provides-Extra: local
|
|
18
|
-
Requires-Dist: azure-identity>=1.12.0; extra == "local"
|
|
19
|
-
Dynamic: license-file
|
|
20
|
-
|
|
21
|
-
<img src="https://raw.githubusercontent.com/djouallah/duckrun/main/duckrun.png" width="400" alt="Duckrun"><img src="https://raw.githubusercontent.com/djouallah/duckrun/main/duckrun.png" width="400" alt="Duckrun">
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
A helper package for working with Microsoft Fabric lakehouses - orchestration, SQL queries, and file management powered by DuckDB.A helper package for stuff that made my life easier when working with Fabric Python notebooks. Just the things that actually made sense to me - nothing fancy
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
## Installation## Important Notes
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
```bash**Requirements:**
|
|
34
|
-
|
|
35
|
-
pip install duckrun- Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)
|
|
36
|
-
|
|
37
|
-
```- **Workspace names with spaces are fully supported!** ✅
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
For local usage (requires Azure CLI or interactive browser auth):
|
|
42
|
-
|
|
43
|
-
```bash**Delta Lake Version:** This package uses an older version of deltalake to maintain row size control capabilities, which is crucial for Power BI performance optimization. The newer Rust-based deltalake versions don't yet support the row group size parameters that are essential for optimal DirectLake performance.
|
|
44
|
-
|
|
45
|
-
pip install duckrun[local]
|
|
46
|
-
|
|
47
|
-
```## What It Does
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
## Quick StartIt does orchestration, arbitrary SQL statements, and file manipulation. That's it - just stuff I encounter in my daily workflow when working with Fabric notebooks.
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
### Basic Usage## Installation
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
```python```bash
|
|
60
|
-
|
|
61
|
-
import duckrunpip install duckrun
|
|
62
|
-
|
|
63
|
-
```
|
|
64
|
-
|
|
65
|
-
# Connect to a lakehouse and query datafor local usage, Note: When running locally, your internet speed will be the main bottleneck.
|
|
66
|
-
|
|
67
|
-
con = duckrun.connect("My Workspace/data.lakehouse/dbo")
|
|
68
|
-
|
|
69
|
-
con.sql("SELECT * FROM my_table LIMIT 10").show()```bash
|
|
70
|
-
|
|
71
|
-
pip install duckrun[local]
|
|
72
|
-
|
|
73
|
-
# Write query results to a new table```
|
|
74
|
-
|
|
75
|
-
con.sql("SELECT * FROM source WHERE year = 2024") \
|
|
76
|
-
|
|
77
|
-
.write.mode("overwrite").saveAsTable("filtered_data")## Quick Start
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
# Upload/download files### Simple Example for New Users
|
|
82
|
-
|
|
83
|
-
con.copy("./local_data", "remote_folder") # Upload
|
|
84
|
-
|
|
85
|
-
con.download("remote_folder", "./local") # Download```python
|
|
86
|
-
|
|
87
|
-
```import duckrun
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
### Complete Example# Connect to a workspace and manage lakehouses
|
|
92
|
-
|
|
93
|
-
con = duckrun.connect('My Workspace')
|
|
94
|
-
|
|
95
|
-
```pythoncon.list_lakehouses() # See what lakehouses exist
|
|
96
|
-
|
|
97
|
-
import duckruncon.create_lakehouse_if_not_exists('data') # Create if needed
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
# 1. Connect to lakehouse# Connect to a specific lakehouse and query data
|
|
102
|
-
|
|
103
|
-
con = duckrun.connect("Analytics/Sales.lakehouse/dbo")con = duckrun.connect("My Workspace/data.lakehouse/dbo")
|
|
104
|
-
|
|
105
|
-
con.sql("SELECT * FROM my_table LIMIT 10").show()
|
|
106
|
-
|
|
107
|
-
# 2. Query and explore data```
|
|
108
|
-
|
|
109
|
-
result = con.sql("""
|
|
110
|
-
|
|
111
|
-
SELECT region, SUM(amount) as total### Full Feature Overview
|
|
112
|
-
|
|
113
|
-
FROM sales
|
|
114
|
-
|
|
115
|
-
WHERE year = 2024```python
|
|
116
|
-
|
|
117
|
-
GROUP BY regionimport duckrun
|
|
118
|
-
|
|
119
|
-
""").show()
|
|
120
|
-
|
|
121
|
-
# 1. Workspace Management (list and create lakehouses)
|
|
122
|
-
|
|
123
|
-
# 3. Create new tables from queriesws = duckrun.connect("My Workspace")
|
|
124
|
-
|
|
125
|
-
con.sql("SELECT * FROM sales WHERE region = 'US'") \lakehouses = ws.list_lakehouses() # Returns list of lakehouse names
|
|
126
|
-
|
|
127
|
-
.write.mode("overwrite").saveAsTable("us_sales")ws.create_lakehouse_if_not_exists("New_Lakehouse")
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
# 4. Upload files to OneLake# 2. Connect to lakehouse with a specific schema
|
|
132
|
-
|
|
133
|
-
con.copy("./reports", "monthly_reports", ['.csv'])con = duckrun.connect("My Workspace/MyLakehouse.lakehouse/dbo")
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
# 5. Run data pipeline# Workspace names with spaces are supported!
|
|
138
|
-
|
|
139
|
-
pipeline = [con = duckrun.connect("Data Analytics/SalesData.lakehouse/analytics")
|
|
140
|
-
|
|
141
|
-
('clean_data', 'overwrite'),
|
|
142
|
-
|
|
143
|
-
('aggregate', 'append', {'min_amount': 1000})# Schema defaults to 'dbo' if not specified (scans all schemas)
|
|
144
|
-
|
|
145
|
-
]# ⚠️ WARNING: Scanning all schemas can be slow for large lakehouses!
|
|
146
|
-
|
|
147
|
-
con.run(pipeline)con = duckrun.connect("My Workspace/My_Lakehouse.lakehouse")
|
|
148
|
-
|
|
149
|
-
```
|
|
150
|
-
|
|
151
|
-
# 3. Explore data
|
|
152
|
-
|
|
153
|
-
---con.sql("SELECT * FROM my_table LIMIT 10").show()
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
## Core Functions# 4. Write to Delta tables (Spark-style API)
|
|
158
|
-
|
|
159
|
-
con.sql("SELECT * FROM source").write.mode("overwrite").saveAsTable("target")
|
|
160
|
-
|
|
161
|
-
### Connection
|
|
162
|
-
|
|
163
|
-
# 5. Upload/download files to/from OneLake Files
|
|
164
|
-
|
|
165
|
-
#### `connect(connection_string, sql_folder=None, compaction_threshold=100)`con.copy("./local_folder", "target_folder") # Upload files
|
|
166
|
-
|
|
167
|
-
con.download("target_folder", "./downloaded") # Download files
|
|
168
|
-
|
|
169
|
-
Connect to a workspace or lakehouse.```
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
**Parameters:**That's it! No `sql_folder` needed for data exploration.
|
|
174
|
-
|
|
175
|
-
- `connection_string` (str): Connection path
|
|
176
|
-
|
|
177
|
-
- Workspace only: `"My Workspace"`## Connection Format
|
|
178
|
-
|
|
179
|
-
- Lakehouse with schema: `"My Workspace/lakehouse.lakehouse/dbo"`
|
|
180
|
-
|
|
181
|
-
- Lakehouse without schema: `"My Workspace/lakehouse.lakehouse"` (scans all schemas)```python
|
|
182
|
-
|
|
183
|
-
- `sql_folder` (str, optional): Path to SQL/Python files for pipelines# Workspace management (list and create lakehouses)
|
|
184
|
-
|
|
185
|
-
- `compaction_threshold` (int): File count before auto-compaction (default: 100)ws = duckrun.connect("My Workspace")
|
|
186
|
-
|
|
187
|
-
ws.list_lakehouses() # Returns: ['lakehouse1', 'lakehouse2', ...]
|
|
188
|
-
|
|
189
|
-
**Returns:** `Duckrun` instance or `WorkspaceConnection` instancews.create_lakehouse_if_not_exists("New Lakehouse")
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
|
|
193
|
-
**Examples:**# Lakehouse connection with schema (recommended for best performance)
|
|
194
|
-
|
|
195
|
-
```pythoncon = duckrun.connect("My Workspace/My Lakehouse.lakehouse/dbo")
|
|
196
|
-
|
|
197
|
-
# Workspace management
|
|
198
|
-
|
|
199
|
-
ws = duckrun.connect("My Workspace")# Supports workspace names with spaces!
|
|
200
|
-
|
|
201
|
-
ws.list_lakehouses()con = duckrun.connect("Data Analytics/Sales Data.lakehouse/analytics")
|
|
202
|
-
|
|
203
|
-
ws.create_lakehouse_if_not_exists("new_lakehouse")
|
|
204
|
-
|
|
205
|
-
# Without schema (defaults to 'dbo', scans all schemas)
|
|
206
|
-
|
|
207
|
-
# Lakehouse connection (recommended - specify schema)# ⚠️ This can be slow for large lakehouses!
|
|
208
|
-
|
|
209
|
-
con = duckrun.connect("My Workspace/data.lakehouse/dbo")con = duckrun.connect("My Workspace/My Lakehouse.lakehouse")
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
# With SQL folder for pipelines# With SQL folder for pipeline orchestration
|
|
214
|
-
|
|
215
|
-
con = duckrun.connect("My Workspace/data.lakehouse/dbo", sql_folder="./sql")con = duckrun.connect("My Workspace/My Lakehouse.lakehouse/dbo", sql_folder="./sql")
|
|
216
|
-
|
|
217
|
-
``````
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
**Notes:**### Multi-Schema Support
|
|
222
|
-
|
|
223
|
-
- Workspace names with spaces are fully supported ✅
|
|
224
|
-
|
|
225
|
-
- Specifying schema improves connection speedWhen you don't specify a schema, Duckrun will:
|
|
226
|
-
|
|
227
|
-
- Without schema, all schemas are scanned (slower for large lakehouses)- **Default to `dbo`** for write operations
|
|
228
|
-
|
|
229
|
-
- **Scan all schemas** to discover and attach all Delta tables
|
|
230
|
-
|
|
231
|
-
---- **Prefix table names** with schema to avoid conflicts (e.g., `dbo_customers`, `bronze_raw_data`)
|
|
232
|
-
|
|
233
|
-
|
|
234
|
-
|
|
235
|
-
### Query & Write**Performance Note:** Scanning all schemas requires listing all files in the lakehouse, which can be slow for large lakehouses with many tables. For better performance, always specify a schema when possible.
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
#### `sql(query)````python
|
|
240
|
-
|
|
241
|
-
# Fast: scans only 'dbo' schema
|
|
242
|
-
|
|
243
|
-
Execute SQL query with Spark-style write API.con = duckrun.connect("workspace/lakehouse.lakehouse/dbo")
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
**Parameters:**# Slower: scans all schemas
|
|
248
|
-
|
|
249
|
-
- `query` (str): SQL query to executecon = duckrun.connect("workspace/lakehouse.lakehouse")
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
**Returns:** `QueryResult` object with methods:# Query tables from different schemas (when scanning all)
|
|
254
|
-
|
|
255
|
-
- `.show(max_width=None)` - Display results in consolecon.sql("SELECT * FROM dbo_customers").show()
|
|
256
|
-
|
|
257
|
-
- `.df()` - Get pandas DataFramecon.sql("SELECT * FROM bronze_raw_data").show()
|
|
258
|
-
|
|
259
|
-
- `.write` - Access write API (see below)```
|
|
260
|
-
|
|
261
|
-
|
|
262
|
-
|
|
263
|
-
**Examples:**## Three Ways to Use Duckrun
|
|
264
|
-
|
|
265
|
-
```python
|
|
266
|
-
|
|
267
|
-
# Show results### 1. Data Exploration (Spark-Style API)
|
|
268
|
-
|
|
269
|
-
con.sql("SELECT * FROM sales LIMIT 10").show()
|
|
270
|
-
|
|
271
|
-
Perfect for ad-hoc analysis and interactive notebooks:
|
|
272
|
-
|
|
273
|
-
# Get DataFrame
|
|
274
|
-
|
|
275
|
-
df = con.sql("SELECT COUNT(*) FROM orders").df()```python
|
|
276
|
-
|
|
277
|
-
con = duckrun.connect("workspace/lakehouse.lakehouse/dbo")
|
|
278
|
-
|
|
279
|
-
# Write to table
|
|
280
|
-
|
|
281
|
-
con.sql("SELECT * FROM source").write.mode("overwrite").saveAsTable("target")# Query existing tables
|
|
282
|
-
|
|
283
|
-
```con.sql("SELECT * FROM sales WHERE year = 2024").show()
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
#### Write API# Get DataFrame
|
|
288
|
-
|
|
289
|
-
df = con.sql("SELECT COUNT(*) FROM orders").df()
|
|
290
|
-
|
|
291
|
-
**Methods:**
|
|
292
|
-
|
|
293
|
-
- `.mode(mode)` - Set write mode: `"overwrite"`, `"append"`, or `"ignore"`# Write results to Delta tables
|
|
294
|
-
|
|
295
|
-
- `.option(key, value)` - Set Delta Lake optioncon.sql("""
|
|
296
|
-
|
|
297
|
-
- `.partitionBy(*cols)` - Partition by columns SELECT
|
|
298
|
-
|
|
299
|
-
- `.saveAsTable(table_name)` - Write to Delta table customer_id,
|
|
300
|
-
|
|
301
|
-
SUM(amount) as total
|
|
302
|
-
|
|
303
|
-
**Examples:** FROM orders
|
|
304
|
-
|
|
305
|
-
```python GROUP BY customer_id
|
|
306
|
-
|
|
307
|
-
# Simple write""").write.mode("overwrite").saveAsTable("customer_totals")
|
|
308
|
-
|
|
309
|
-
con.sql("SELECT * FROM data").write.mode("overwrite").saveAsTable("target")
|
|
310
|
-
|
|
311
|
-
# Append mode
|
|
312
|
-
|
|
313
|
-
# With schema evolutioncon.sql("SELECT * FROM new_orders").write.mode("append").saveAsTable("orders")
|
|
314
|
-
|
|
315
|
-
con.sql("SELECT * FROM source") \
|
|
316
|
-
|
|
317
|
-
.write.mode("append") \# Schema evolution and partitioning (exact Spark API compatibility)
|
|
318
|
-
|
|
319
|
-
.option("mergeSchema", "true") \con.sql("""
|
|
320
|
-
|
|
321
|
-
.saveAsTable("evolving_table") SELECT
|
|
322
|
-
|
|
323
|
-
customer_id,
|
|
324
|
-
|
|
325
|
-
# With partitioning order_date,
|
|
326
|
-
|
|
327
|
-
con.sql("SELECT * FROM sales") \ region,
|
|
328
|
-
|
|
329
|
-
.write.mode("overwrite") \ product_category,
|
|
330
|
-
|
|
331
|
-
.partitionBy("region", "year") \ sales_amount,
|
|
332
|
-
|
|
333
|
-
.saveAsTable("partitioned_sales") new_column_added_later -- This column might not exist in target table
|
|
334
|
-
|
|
335
|
-
FROM source_table
|
|
336
|
-
|
|
337
|
-
# Combined""").write \
|
|
338
|
-
|
|
339
|
-
con.sql("SELECT * FROM data") \ .mode("append") \
|
|
340
|
-
|
|
341
|
-
.write.mode("append") \ .option("mergeSchema", "true") \
|
|
342
|
-
|
|
343
|
-
.option("mergeSchema", "true") \ .partitionBy("region", "product_category") \
|
|
344
|
-
|
|
345
|
-
.partitionBy("date", "category") \ .saveAsTable("sales_partitioned")
|
|
346
|
-
|
|
347
|
-
.saveAsTable("final_table")```
|
|
348
|
-
|
|
349
|
-
```
|
|
350
|
-
|
|
351
|
-
**Note:** `.format("delta")` is optional - Delta is the default format!
|
|
352
|
-
|
|
353
|
-
---
|
|
354
|
-
|
|
355
|
-
### 2. File Management (OneLake Files)
|
|
356
|
-
|
|
357
|
-
### File Operations
|
|
358
|
-
|
|
359
|
-
Upload and download files to/from OneLake Files section (not Delta tables):
|
|
360
|
-
|
|
361
|
-
#### `copy(local_folder, remote_folder, file_extensions=None, overwrite=False)`
|
|
362
|
-
|
|
363
|
-
```python
|
|
364
|
-
|
|
365
|
-
Upload files from local folder to OneLake Files section.con = duckrun.connect("workspace/lakehouse.lakehouse/dbo")
|
|
366
|
-
|
|
367
|
-
|
|
368
|
-
|
|
369
|
-
**Parameters:**# Upload files to OneLake Files (remote_folder is required)
|
|
370
|
-
|
|
371
|
-
- `local_folder` (str): Local source folder pathcon.copy("./local_data", "uploaded_data")
|
|
372
|
-
|
|
373
|
-
- `remote_folder` (str): Remote target folder in OneLake Files (required)
|
|
374
|
-
|
|
375
|
-
- `file_extensions` (list, optional): Filter by extensions (e.g., `['.csv', '.parquet']`)# Upload only specific file types
|
|
376
|
-
|
|
377
|
-
- `overwrite` (bool): Overwrite existing files (default: `False`)con.copy("./reports", "daily_reports", ['.csv', '.parquet'])
|
|
378
|
-
|
|
379
|
-
|
|
380
|
-
|
|
381
|
-
**Returns:** `True` if successful, `False` otherwise# Upload with overwrite enabled (default is False for safety)
|
|
382
|
-
|
|
383
|
-
con.copy("./backup", "backups", overwrite=True)
|
|
384
|
-
|
|
385
|
-
**Examples:**
|
|
386
|
-
|
|
387
|
-
```python# Download files from OneLake Files
|
|
388
|
-
|
|
389
|
-
# Upload all filescon.download("uploaded_data", "./downloaded")
|
|
390
|
-
|
|
391
|
-
con.copy("./data", "processed_data")
|
|
392
|
-
|
|
393
|
-
# Download only CSV files from a specific folder
|
|
394
|
-
|
|
395
|
-
# Upload specific file typescon.download("daily_reports", "./reports", ['.csv'])
|
|
396
|
-
|
|
397
|
-
con.copy("./reports", "monthly", ['.csv', '.xlsx'])```
|
|
398
|
-
|
|
399
|
-
|
|
400
|
-
|
|
401
|
-
# With overwrite**Key Features:**
|
|
402
|
-
|
|
403
|
-
con.copy("./backup", "daily_backup", overwrite=True)- ✅ **Files go to OneLake Files section** (not Delta Tables)
|
|
404
|
-
|
|
405
|
-
```- ✅ **`remote_folder` parameter is required** for uploads (prevents accidental uploads)
|
|
406
|
-
|
|
407
|
-
- ✅ **`overwrite=False` by default** (safer - prevents accidental overwrites)
|
|
408
|
-
|
|
409
|
-
#### `download(remote_folder="", local_folder="./downloaded_files", file_extensions=None, overwrite=False)`- ✅ **File extension filtering** (e.g., only `.csv` or `.parquet` files)
|
|
410
|
-
|
|
411
|
-
- ✅ **Preserves folder structure** during upload/download
|
|
412
|
-
|
|
413
|
-
Download files from OneLake Files section to local folder.- ✅ **Progress reporting** with file sizes and upload status
|
|
414
|
-
|
|
415
|
-
|
|
416
|
-
|
|
417
|
-
**Parameters:**### 3. Pipeline Orchestration
|
|
418
|
-
|
|
419
|
-
- `remote_folder` (str): Source folder in OneLake Files (default: root)
|
|
420
|
-
|
|
421
|
-
- `local_folder` (str): Local destination folder (default: `"./downloaded_files"`)For production workflows with reusable SQL and Python tasks:
|
|
422
|
-
|
|
423
|
-
- `file_extensions` (list, optional): Filter by extensions
|
|
424
|
-
|
|
425
|
-
- `overwrite` (bool): Overwrite existing files (default: `False`)```python
|
|
426
|
-
|
|
427
|
-
con = duckrun.connect(
|
|
428
|
-
|
|
429
|
-
**Returns:** `True` if successful, `False` otherwise "my_workspace/my_lakehouse.lakehouse/dbo",
|
|
430
|
-
|
|
431
|
-
sql_folder="./sql" # folder with .sql and .py files
|
|
432
|
-
|
|
433
|
-
**Examples:**)
|
|
434
|
-
|
|
435
|
-
```python
|
|
436
|
-
|
|
437
|
-
# Download from root# Define pipeline
|
|
438
|
-
|
|
439
|
-
con.download()pipeline = [
|
|
440
|
-
|
|
441
|
-
('download_data', (url, path)), # Python task
|
|
442
|
-
|
|
443
|
-
# Download from specific folder ('clean_data', 'overwrite'), # SQL task
|
|
444
|
-
|
|
445
|
-
con.download("processed_data", "./local_data") ('aggregate', 'append') # SQL task
|
|
446
|
-
|
|
447
|
-
]
|
|
448
|
-
|
|
449
|
-
# Download specific file types
|
|
450
|
-
|
|
451
|
-
con.download("reports", "./exports", ['.csv'])# Run it
|
|
452
|
-
|
|
453
|
-
```con.run(pipeline)
|
|
454
|
-
|
|
455
|
-
```
|
|
456
|
-
|
|
457
|
-
---
|
|
458
|
-
|
|
459
|
-
## Pipeline Tasks
|
|
460
|
-
|
|
461
|
-
### Pipeline Orchestration
|
|
462
|
-
|
|
463
|
-
### Python Tasks
|
|
464
|
-
|
|
465
|
-
#### `run(pipeline)`
|
|
466
|
-
|
|
467
|
-
**Format:** `('function_name', (arg1, arg2, ...))`
|
|
468
|
-
|
|
469
|
-
Execute a pipeline of SQL and Python tasks.
|
|
470
|
-
|
|
471
|
-
Create `sql_folder/function_name.py`:
|
|
472
|
-
|
|
473
|
-
**Parameters:**
|
|
474
|
-
|
|
475
|
-
- `pipeline` (list): List of task tuples```python
|
|
476
|
-
|
|
477
|
-
# sql_folder/download_data.py
|
|
478
|
-
|
|
479
|
-
**Returns:** `True` if all tasks succeeded, `False` if any faileddef download_data(url, path):
|
|
480
|
-
|
|
481
|
-
# your code here
|
|
482
|
-
|
|
483
|
-
**Task Formats:** return 1 # 1 = success, 0 = failure
|
|
484
|
-
|
|
485
|
-
```python```
|
|
486
|
-
|
|
487
|
-
# Python task: ('function_name', (arg1, arg2, ...))
|
|
488
|
-
|
|
489
|
-
('download_data', ('https://api.example.com/data', './raw'))### SQL Tasks
|
|
490
|
-
|
|
491
|
-
|
|
492
|
-
|
|
493
|
-
# SQL task: ('table_name', 'mode')**Formats:**
|
|
494
|
-
|
|
495
|
-
('clean_data', 'overwrite')- `('table_name', 'mode')` - Simple SQL with no parameters
|
|
496
|
-
|
|
497
|
-
- `('table_name', 'mode', {params})` - SQL with template parameters
|
|
498
|
-
|
|
499
|
-
# SQL with params: ('table_name', 'mode', {params})- `('table_name', 'mode', {params}, {delta_options})` - SQL with Delta Lake options
|
|
500
|
-
|
|
501
|
-
('filter_data', 'append', {'min_value': 100})
|
|
502
|
-
|
|
503
|
-
Create `sql_folder/table_name.sql`:
|
|
504
|
-
|
|
505
|
-
# SQL with Delta options: ('table_name', 'mode', {params}, {options})
|
|
506
|
-
|
|
507
|
-
('evolving_table', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['region']})```sql
|
|
508
|
-
|
|
509
|
-
```-- sql_folder/clean_data.sql
|
|
510
|
-
|
|
511
|
-
SELECT
|
|
512
|
-
|
|
513
|
-
**Write Modes:** id,
|
|
514
|
-
|
|
515
|
-
- `"overwrite"` - Replace table completely TRIM(name) as name,
|
|
516
|
-
|
|
517
|
-
- `"append"` - Add to existing table date
|
|
518
|
-
|
|
519
|
-
- `"ignore"` - Create only if doesn't existFROM raw_data
|
|
520
|
-
|
|
521
|
-
WHERE date >= '2024-01-01'
|
|
522
|
-
|
|
523
|
-
**Examples:**```
|
|
524
|
-
|
|
525
|
-
```python
|
|
526
|
-
|
|
527
|
-
con = duckrun.connect("workspace/lakehouse.lakehouse/dbo", sql_folder="./sql")**Write Modes:**
|
|
528
|
-
|
|
529
|
-
- `overwrite` - Replace table completely
|
|
530
|
-
|
|
531
|
-
# Simple pipeline- `append` - Add to existing table
|
|
532
|
-
|
|
533
|
-
pipeline = [- `ignore` - Create only if doesn't exist
|
|
534
|
-
|
|
535
|
-
('extract_data', 'overwrite'),
|
|
536
|
-
|
|
537
|
-
('transform', 'append'),### Parameterized SQL
|
|
538
|
-
|
|
539
|
-
('load_final', 'overwrite')
|
|
540
|
-
|
|
541
|
-
]Built-in parameters (always available):
|
|
542
|
-
|
|
543
|
-
con.run(pipeline)- `$ws` - workspace name
|
|
544
|
-
|
|
545
|
-
- `$lh` - lakehouse name
|
|
546
|
-
|
|
547
|
-
# With parameters- `$schema` - schema name
|
|
548
|
-
|
|
549
|
-
pipeline = [
|
|
550
|
-
|
|
551
|
-
('fetch_api', ('https://api.com/data', './raw')),Custom parameters:
|
|
552
|
-
|
|
553
|
-
('clean', 'overwrite', {'date': '2024-01-01'}),
|
|
554
|
-
|
|
555
|
-
('aggregate', 'append', {}, {'partitionBy': ['region']})```python
|
|
556
|
-
|
|
557
|
-
]pipeline = [
|
|
558
|
-
|
|
559
|
-
con.run(pipeline) ('sales', 'append', {'start_date': '2024-01-01', 'end_date': '2024-12-31'})
|
|
560
|
-
|
|
561
|
-
```]
|
|
562
|
-
|
|
563
|
-
```
|
|
564
|
-
|
|
565
|
-
**Pipeline Behavior:**
|
|
566
|
-
|
|
567
|
-
- SQL tasks automatically fail on errors (syntax, runtime)```sql
|
|
568
|
-
|
|
569
|
-
- Python tasks control success/failure by returning `1` (success) or `0` (failure)-- sql_folder/sales.sql
|
|
570
|
-
|
|
571
|
-
- Pipeline stops immediately when any task failsSELECT * FROM transactions
|
|
572
|
-
|
|
573
|
-
- Remaining tasks are skippedWHERE date BETWEEN '$start_date' AND '$end_date'
|
|
574
|
-
|
|
575
|
-
```
|
|
576
|
-
|
|
577
|
-
---
|
|
578
|
-
|
|
579
|
-
### Delta Lake Options (Schema Evolution & Partitioning)
|
|
580
|
-
|
|
581
|
-
### Workspace Management
|
|
582
|
-
|
|
583
|
-
Use the 4-tuple format for advanced Delta Lake features:
|
|
584
|
-
|
|
585
|
-
#### `list_lakehouses()`
|
|
586
|
-
|
|
587
|
-
```python
|
|
588
|
-
|
|
589
|
-
List all lakehouses in the workspace.pipeline = [
|
|
590
|
-
|
|
591
|
-
# SQL with empty params but Delta options
|
|
592
|
-
|
|
593
|
-
**Returns:** List of lakehouse names (strings) ('evolving_table', 'append', {}, {'mergeSchema': 'true'}),
|
|
594
|
-
|
|
595
|
-
|
|
596
|
-
|
|
597
|
-
**Example:** # SQL with both params AND Delta options
|
|
598
|
-
|
|
599
|
-
```python ('sales_data', 'append',
|
|
600
|
-
|
|
601
|
-
ws = duckrun.connect("My Workspace") {'region': 'North America'},
|
|
602
|
-
|
|
603
|
-
lakehouses = ws.list_lakehouses() {'mergeSchema': 'true', 'partitionBy': ['region', 'year']}),
|
|
604
|
-
|
|
605
|
-
print(lakehouses) # ['lakehouse1', 'lakehouse2', ...]
|
|
606
|
-
|
|
607
|
-
``` # Partitioning without schema merging
|
|
608
|
-
|
|
609
|
-
('time_series', 'overwrite',
|
|
610
|
-
|
|
611
|
-
#### `create_lakehouse_if_not_exists(lakehouse_name)` {'start_date': '2024-01-01'},
|
|
612
|
-
|
|
613
|
-
{'partitionBy': ['year', 'month']})
|
|
614
|
-
|
|
615
|
-
Create a lakehouse if it doesn't already exist.]
|
|
616
|
-
|
|
617
|
-
```
|
|
618
|
-
|
|
619
|
-
**Parameters:**
|
|
620
|
-
|
|
621
|
-
- `lakehouse_name` (str): Name of the lakehouse to create**Available Delta Options:**
|
|
622
|
-
|
|
623
|
-
- `mergeSchema: 'true'` - Automatically handle schema evolution (new columns)
|
|
624
|
-
|
|
625
|
-
**Returns:** `True` if exists or was created, `False` on error- `partitionBy: ['col1', 'col2']` - Partition data by specified columns
|
|
626
|
-
|
|
627
|
-
|
|
628
|
-
|
|
629
|
-
**Example:**## Advanced Features
|
|
630
|
-
|
|
631
|
-
```python
|
|
632
|
-
|
|
633
|
-
ws = duckrun.connect("My Workspace")### SQL Lookup Functions
|
|
634
|
-
|
|
635
|
-
success = ws.create_lakehouse_if_not_exists("new_lakehouse")
|
|
636
|
-
|
|
637
|
-
```Duckrun automatically registers helper functions that allow you to resolve workspace and lakehouse names from GUIDs directly in SQL queries. These are especially useful when working with storage logs or audit data that contains workspace/lakehouse IDs.
|
|
638
|
-
|
|
639
|
-
|
|
640
|
-
|
|
641
|
-
---**Available Functions:**
|
|
642
|
-
|
|
643
|
-
|
|
644
|
-
|
|
645
|
-
### SQL Lookup Functions```python
|
|
646
|
-
|
|
647
|
-
con = duckrun.connect("workspace/lakehouse.lakehouse/dbo")
|
|
648
|
-
|
|
649
|
-
Built-in SQL functions for resolving workspace/lakehouse names from GUIDs.
|
|
650
|
-
|
|
651
|
-
# ID → Name lookups (most common use case)
|
|
652
|
-
|
|
653
|
-
**Functions:**con.sql("""
|
|
654
|
-
|
|
655
|
-
- `get_workspace_name(workspace_id)` - GUID → workspace name SELECT
|
|
656
|
-
|
|
657
|
-
- `get_lakehouse_name(workspace_id, lakehouse_id)` - GUIDs → lakehouse name workspace_id,
|
|
658
|
-
|
|
659
|
-
- `get_workspace_id_from_name(workspace_name)` - workspace name → GUID get_workspace_name(workspace_id) as workspace_name,
|
|
660
|
-
|
|
661
|
-
- `get_lakehouse_id_from_name(workspace_id, lakehouse_name)` - lakehouse name → GUID lakehouse_id,
|
|
662
|
-
|
|
663
|
-
get_lakehouse_name(workspace_id, lakehouse_id) as lakehouse_name
|
|
664
|
-
|
|
665
|
-
**Features:** FROM storage_logs
|
|
666
|
-
|
|
667
|
-
- Automatically cached to avoid repeated API calls""").show()
|
|
668
|
-
|
|
669
|
-
- Return `NULL` for missing or inaccessible items
|
|
670
|
-
|
|
671
|
-
- Always available after connection# Name → ID lookups (reverse)
|
|
672
|
-
|
|
673
|
-
con.sql("""
|
|
674
|
-
|
|
675
|
-
**Example:** SELECT
|
|
676
|
-
|
|
677
|
-
```python workspace_name,
|
|
678
|
-
|
|
679
|
-
con = duckrun.connect("workspace/lakehouse.lakehouse/dbo") get_workspace_id_from_name(workspace_name) as workspace_id,
|
|
680
|
-
|
|
681
|
-
lakehouse_name,
|
|
682
|
-
|
|
683
|
-
# Enrich storage logs with friendly names get_lakehouse_id_from_name(workspace_id, lakehouse_name) as lakehouse_id
|
|
684
|
-
|
|
685
|
-
result = con.sql(""" FROM configuration_table
|
|
686
|
-
|
|
687
|
-
SELECT """).show()
|
|
688
|
-
|
|
689
|
-
workspace_id,```
|
|
690
|
-
|
|
691
|
-
get_workspace_name(workspace_id) as workspace_name,
|
|
692
|
-
|
|
693
|
-
lakehouse_id,**Function Reference:**
|
|
694
|
-
|
|
695
|
-
get_lakehouse_name(workspace_id, lakehouse_id) as lakehouse_name,
|
|
696
|
-
|
|
697
|
-
operation_count- `get_workspace_name(workspace_id)` - Convert workspace GUID to display name
|
|
698
|
-
|
|
699
|
-
FROM storage_logs- `get_lakehouse_name(workspace_id, lakehouse_id)` - Convert lakehouse GUID to display name
|
|
700
|
-
|
|
701
|
-
ORDER BY workspace_name, lakehouse_name- `get_workspace_id_from_name(workspace_name)` - Convert workspace name to GUID
|
|
702
|
-
|
|
703
|
-
""").show()- `get_lakehouse_id_from_name(workspace_id, lakehouse_name)` - Convert lakehouse name to GUID
|
|
704
|
-
|
|
705
|
-
```
|
|
706
|
-
|
|
707
|
-
**Features:**
|
|
708
|
-
|
|
709
|
-
---- ✅ **Automatic Caching**: Results are cached to avoid repeated API calls
|
|
710
|
-
|
|
711
|
-
- ✅ **NULL on Error**: Returns `NULL` instead of errors for missing or inaccessible items
|
|
712
|
-
|
|
713
|
-
### Semantic Model Deployment- ✅ **Fabric API Integration**: Resolves names using Microsoft Fabric REST API
|
|
714
|
-
|
|
715
|
-
- ✅ **Always Available**: Functions are automatically registered on connection
|
|
716
|
-
|
|
717
|
-
#### `deploy(bim_url, dataset_name=None, wait_seconds=5)`
|
|
718
|
-
|
|
719
|
-
**Example Use Case:**
|
|
720
|
-
|
|
721
|
-
Deploy a Power BI semantic model from a BIM file using DirectLake mode.
|
|
722
|
-
|
|
723
|
-
```python
|
|
724
|
-
|
|
725
|
-
**Parameters:**# Enrich OneLake storage logs with friendly names
|
|
726
|
-
|
|
727
|
-
- `bim_url` (str): URL to BIM file, local path, or `"workspace/model"` formatcon = duckrun.connect("Analytics/Monitoring.lakehouse/dbo")
|
|
728
|
-
|
|
729
|
-
- `dataset_name` (str, optional): Name for semantic model (auto-generated if not provided)
|
|
730
|
-
|
|
731
|
-
- `wait_seconds` (int): Wait time for permission propagation (default: 5)result = con.sql("""
|
|
732
|
-
|
|
733
|
-
SELECT
|
|
734
|
-
|
|
735
|
-
**Returns:** `1` for success, `0` for failure workspace_id,
|
|
736
|
-
|
|
737
|
-
get_workspace_name(workspace_id) as workspace_name,
|
|
738
|
-
|
|
739
|
-
**Examples:** lakehouse_id,
|
|
740
|
-
|
|
741
|
-
```python get_lakehouse_name(workspace_id, lakehouse_id) as lakehouse_name,
|
|
742
|
-
|
|
743
|
-
con = duckrun.connect("Analytics/Sales.lakehouse/dbo") operation_name,
|
|
744
|
-
|
|
745
|
-
COUNT(*) as operation_count,
|
|
746
|
-
|
|
747
|
-
# From URL SUM(bytes_transferred) as total_bytes
|
|
748
|
-
|
|
749
|
-
con.deploy("https://raw.githubusercontent.com/user/repo/main/model.bim") FROM onelake_storage_logs
|
|
750
|
-
|
|
751
|
-
WHERE log_date = CURRENT_DATE
|
|
752
|
-
|
|
753
|
-
# With custom name GROUP BY ALL
|
|
754
|
-
|
|
755
|
-
con.deploy( ORDER BY workspace_name, lakehouse_name
|
|
756
|
-
|
|
757
|
-
"https://github.com/user/repo/raw/main/sales.bim",""").show()
|
|
758
|
-
|
|
759
|
-
dataset_name="Sales Analytics"```
|
|
760
|
-
|
|
761
|
-
)
|
|
762
|
-
|
|
763
|
-
This makes it easy to create human-readable reports from GUID-based log data!
|
|
764
|
-
|
|
765
|
-
# From workspace/model (copies from another workspace)
|
|
766
|
-
|
|
767
|
-
con.deploy("Source Workspace/Source Model", dataset_name="Sales Copy")### Schema Evolution & Partitioning
|
|
768
|
-
|
|
769
|
-
```
|
|
770
|
-
|
|
771
|
-
Handle evolving schemas and optimize query performance with partitioning:
|
|
772
|
-
|
|
773
|
-
---
|
|
774
|
-
|
|
775
|
-
```python
|
|
776
|
-
|
|
777
|
-
### Utility Methods# Using Spark-style API
|
|
778
|
-
|
|
779
|
-
con.sql("""
|
|
780
|
-
|
|
781
|
-
#### `get_workspace_id()` SELECT
|
|
782
|
-
|
|
783
|
-
customer_id,
|
|
784
|
-
|
|
785
|
-
Get the workspace ID (GUID or name without spaces). region,
|
|
786
|
-
|
|
787
|
-
product_category,
|
|
788
|
-
|
|
789
|
-
**Returns:** Workspace ID string sales_amount,
|
|
790
|
-
|
|
791
|
-
-- New column that might not exist in target table
|
|
792
|
-
|
|
793
|
-
#### `get_lakehouse_id()` discount_percentage
|
|
794
|
-
|
|
795
|
-
FROM raw_sales
|
|
796
|
-
|
|
797
|
-
Get the lakehouse ID (GUID or name).""").write \
|
|
798
|
-
|
|
799
|
-
.mode("append") \
|
|
800
|
-
|
|
801
|
-
**Returns:** Lakehouse ID string .option("mergeSchema", "true") \
|
|
802
|
-
|
|
803
|
-
.partitionBy("region", "product_category") \
|
|
804
|
-
|
|
805
|
-
#### `get_connection()` .saveAsTable("sales_partitioned")
|
|
806
|
-
|
|
807
|
-
|
|
808
|
-
|
|
809
|
-
Get the underlying DuckDB connection object.# Using pipeline format
|
|
810
|
-
|
|
811
|
-
pipeline = [
|
|
812
|
-
|
|
813
|
-
**Returns:** DuckDB connection ('sales_summary', 'append',
|
|
814
|
-
|
|
815
|
-
{'batch_date': '2024-10-07'},
|
|
816
|
-
|
|
817
|
-
#### `close()` {'mergeSchema': 'true', 'partitionBy': ['region', 'year']})
|
|
818
|
-
|
|
819
|
-
]
|
|
820
|
-
|
|
821
|
-
Close the DuckDB connection.```
|
|
822
|
-
|
|
823
|
-
|
|
824
|
-
|
|
825
|
-
**Example:****Benefits:**
|
|
826
|
-
|
|
827
|
-
```python- 🔄 **Schema Evolution**: Automatically handles new columns without breaking existing queries
|
|
828
|
-
|
|
829
|
-
con.close()- ⚡ **Query Performance**: Partitioning improves performance for filtered queries
|
|
830
|
-
|
|
831
|
-
```
|
|
832
|
-
|
|
833
|
-
### Table Name Variants
|
|
834
|
-
|
|
835
|
-
---
|
|
836
|
-
|
|
837
|
-
Use `__` to create multiple versions of the same table:
|
|
838
|
-
|
|
839
|
-
## Advanced Features
|
|
840
|
-
|
|
841
|
-
```python
|
|
842
|
-
|
|
843
|
-
### Schema Evolutionpipeline = [
|
|
844
|
-
|
|
845
|
-
('sales__initial', 'overwrite'), # writes to 'sales'
|
|
846
|
-
|
|
847
|
-
Automatically handle schema changes (new columns) using `mergeSchema`: ('sales__incremental', 'append'), # appends to 'sales'
|
|
848
|
-
|
|
849
|
-
]
|
|
850
|
-
|
|
851
|
-
```python```
|
|
852
|
-
|
|
853
|
-
# Using write API
|
|
854
|
-
|
|
855
|
-
con.sql("SELECT * FROM source").write \Both tasks write to the `sales` table but use different SQL files (`sales__initial.sql` and `sales__incremental.sql`).
|
|
856
|
-
|
|
857
|
-
.mode("append") \
|
|
858
|
-
|
|
859
|
-
.option("mergeSchema", "true") \### Remote SQL Files
|
|
860
|
-
|
|
861
|
-
.saveAsTable("evolving_table")
|
|
862
|
-
|
|
863
|
-
Load tasks from GitHub or any URL:
|
|
864
|
-
|
|
865
|
-
# Using pipeline
|
|
866
|
-
|
|
867
|
-
pipeline = [```python
|
|
868
|
-
|
|
869
|
-
('table', 'append', {}, {'mergeSchema': 'true'})con = duckrun.connect(
|
|
870
|
-
|
|
871
|
-
] "Analytics/Sales.lakehouse/dbo",
|
|
872
|
-
|
|
873
|
-
``` sql_folder="https://raw.githubusercontent.com/user/repo/main/sql"
|
|
874
|
-
|
|
875
|
-
)
|
|
876
|
-
|
|
877
|
-
### Partitioning```
|
|
878
|
-
|
|
879
|
-
|
|
880
|
-
|
|
881
|
-
Optimize query performance by partitioning data:### Early Exit on Failure
|
|
882
|
-
|
|
883
|
-
|
|
884
|
-
|
|
885
|
-
```python**Pipelines automatically stop when any task fails** - subsequent tasks won't run.
|
|
886
|
-
|
|
887
|
-
# Partition by single column
|
|
888
|
-
|
|
889
|
-
con.sql("SELECT * FROM sales").write \For **SQL tasks**, failure is automatic:
|
|
890
|
-
|
|
891
|
-
.mode("overwrite") \- If the query has a syntax error or runtime error, the task fails
|
|
892
|
-
|
|
893
|
-
.partitionBy("region") \- The pipeline stops immediately
|
|
894
|
-
|
|
895
|
-
.saveAsTable("partitioned_sales")
|
|
896
|
-
|
|
897
|
-
For **Python tasks**, you control success/failure by returning:
|
|
898
|
-
|
|
899
|
-
# Partition by multiple columns- `1` = Success → pipeline continues to next task
|
|
900
|
-
|
|
901
|
-
con.sql("SELECT * FROM orders").write \- `0` = Failure → pipeline stops, remaining tasks are skipped
|
|
902
|
-
|
|
903
|
-
.mode("overwrite") \
|
|
904
|
-
|
|
905
|
-
.partitionBy("year", "month", "region") \Example:
|
|
906
|
-
|
|
907
|
-
.saveAsTable("time_partitioned")
|
|
908
|
-
|
|
909
|
-
``````python
|
|
910
|
-
|
|
911
|
-
# sql_folder/download_data.py
|
|
912
|
-
|
|
913
|
-
**Best Practices:**def download_data(url, path):
|
|
914
|
-
|
|
915
|
-
- ✅ Partition by columns frequently used in WHERE clauses try:
|
|
916
|
-
|
|
917
|
-
- ✅ Use low to medium cardinality columns (dates, regions, categories) response = requests.get(url)
|
|
918
|
-
|
|
919
|
-
- ❌ Avoid high cardinality columns (customer_id, transaction_id) response.raise_for_status()
|
|
920
|
-
|
|
921
|
-
# save data...
|
|
922
|
-
|
|
923
|
-
### SQL Template Parameters return 1 # Success - pipeline continues
|
|
924
|
-
|
|
925
|
-
except Exception as e:
|
|
926
|
-
|
|
927
|
-
Use template parameters in SQL files: print(f"Download failed: {e}")
|
|
928
|
-
|
|
929
|
-
return 0 # Failure - pipeline stops here
|
|
930
|
-
|
|
931
|
-
**Built-in parameters:**```
|
|
932
|
-
|
|
933
|
-
- `$ws` - workspace name
|
|
934
|
-
|
|
935
|
-
- `$lh` - lakehouse name```python
|
|
936
|
-
|
|
937
|
-
- `$schema` - schema namepipeline = [
|
|
938
|
-
|
|
939
|
-
- `$storage_account` - storage account name ('download_data', (url, path)), # If returns 0, stops here
|
|
940
|
-
|
|
941
|
-
- `$tables_url` - base URL for Tables folder ('clean_data', 'overwrite'), # Won't run if download failed
|
|
942
|
-
|
|
943
|
-
- `$files_url` - base URL for Files folder ('aggregate', 'append') # Won't run if download failed
|
|
944
|
-
|
|
945
|
-
]
|
|
946
|
-
|
|
947
|
-
**Custom parameters:**
|
|
948
|
-
|
|
949
|
-
```pythonsuccess = con.run(pipeline) # Returns True only if ALL tasks succeed
|
|
950
|
-
|
|
951
|
-
# sql/sales.sql```
|
|
952
|
-
|
|
953
|
-
SELECT * FROM transactions
|
|
954
|
-
|
|
955
|
-
WHERE date >= '$start_date' AND region = '$region'This prevents downstream tasks from processing incomplete or corrupted data.
|
|
956
|
-
|
|
957
|
-
```
|
|
958
|
-
|
|
959
|
-
### Semantic Model Deployment
|
|
960
|
-
|
|
961
|
-
```python
|
|
962
|
-
|
|
963
|
-
pipeline = [Deploy Power BI semantic models directly from BIM files using DirectLake mode:
|
|
964
|
-
|
|
965
|
-
('sales', 'append', {'start_date': '2024-01-01', 'region': 'US'})
|
|
966
|
-
|
|
967
|
-
]```python
|
|
968
|
-
|
|
969
|
-
```# Connect to lakehouse
|
|
970
|
-
|
|
971
|
-
con = duckrun.connect("Analytics/Sales.lakehouse/dbo")
|
|
972
|
-
|
|
973
|
-
### Table Name Variants
|
|
974
|
-
|
|
975
|
-
# Deploy with auto-generated name (lakehouse_schema)
|
|
976
|
-
|
|
977
|
-
Create multiple SQL files for the same table:con.deploy("https://raw.githubusercontent.com/user/repo/main/model.bim")
|
|
978
|
-
|
|
979
|
-
|
|
980
|
-
|
|
981
|
-
```python# Deploy with custom name
|
|
982
|
-
|
|
983
|
-
pipeline = [con.deploy(
|
|
984
|
-
|
|
985
|
-
('sales__initial', 'overwrite'), # writes to 'sales' "https://raw.githubusercontent.com/user/repo/main/sales_model.bim",
|
|
986
|
-
|
|
987
|
-
('sales__incremental', 'append'), # appends to 'sales' dataset_name="Sales Analytics Model",
|
|
988
|
-
|
|
989
|
-
] wait_seconds=10 # Wait for permission propagation
|
|
990
|
-
|
|
991
|
-
```)
|
|
992
|
-
|
|
993
|
-
```
|
|
994
|
-
|
|
995
|
-
Both use different SQL files but write to the same `sales` table.
|
|
996
|
-
|
|
997
|
-
**Features:**
|
|
998
|
-
|
|
999
|
-
### Remote SQL Files- 🚀 **DirectLake Mode**: Deploys semantic models with DirectLake connection
|
|
1000
|
-
|
|
1001
|
-
- 🔄 **Automatic Configuration**: Auto-configures workspace, lakehouse, and schema connections
|
|
1002
|
-
|
|
1003
|
-
Load SQL/Python files from GitHub or any URL:- 📦 **BIM from URL**: Load model definitions from GitHub or any accessible URL
|
|
1004
|
-
|
|
1005
|
-
- ⏱️ **Permission Handling**: Configurable wait time for permission propagation
|
|
1006
|
-
|
|
1007
|
-
```python
|
|
1008
|
-
|
|
1009
|
-
con = duckrun.connect(**Use Cases:**
|
|
1010
|
-
|
|
1011
|
-
"workspace/lakehouse.lakehouse/dbo",- Deploy semantic models as part of CI/CD pipelines
|
|
1012
|
-
|
|
1013
|
-
sql_folder="https://raw.githubusercontent.com/user/repo/main/sql"- Version control your semantic models in Git
|
|
1014
|
-
|
|
1015
|
-
)- Automated model deployment across environments
|
|
1016
|
-
|
|
1017
|
-
```- Streamline DirectLake model creation
|
|
1018
|
-
|
|
1019
|
-
|
|
1020
|
-
|
|
1021
|
-
### Auto-Compaction### Delta Lake Optimization
|
|
1022
|
-
|
|
1023
|
-
|
|
1024
|
-
|
|
1025
|
-
Delta tables are automatically compacted when file count exceeds threshold:Duckrun automatically:
|
|
1026
|
-
|
|
1027
|
-
- Compacts small files when file count exceeds threshold (default: 100)
|
|
1028
|
-
|
|
1029
|
-
```python- Vacuums old versions on overwrite
|
|
1030
|
-
|
|
1031
|
-
# Customize threshold- Cleans up metadata
|
|
1032
|
-
|
|
1033
|
-
con = duckrun.connect(
|
|
1034
|
-
|
|
1035
|
-
"workspace/lakehouse.lakehouse/dbo",Customize compaction threshold:
|
|
1036
|
-
|
|
1037
|
-
compaction_threshold=50 # compact after 50 files
|
|
1038
|
-
|
|
1039
|
-
)```python
|
|
1040
|
-
|
|
1041
|
-
```con = duckrun.connect(
|
|
1042
|
-
|
|
1043
|
-
"workspace/lakehouse.lakehouse/dbo",
|
|
1044
|
-
|
|
1045
|
-
--- compaction_threshold=50 # compact after 50 files
|
|
1046
|
-
|
|
1047
|
-
)
|
|
1048
|
-
|
|
1049
|
-
## Complete Example```
|
|
1050
|
-
|
|
1051
|
-
|
|
1052
|
-
|
|
1053
|
-
```python## File Management API Reference
|
|
1054
|
-
|
|
1055
|
-
import duckrun
|
|
1056
|
-
|
|
1057
|
-
### `copy(local_folder, remote_folder, file_extensions=None, overwrite=False)`
|
|
1058
|
-
|
|
1059
|
-
# 1. Connect with SQL folder for pipelines
|
|
1060
|
-
|
|
1061
|
-
con = duckrun.connect("Analytics/Sales.lakehouse/dbo", sql_folder="./sql")Upload files from a local folder to OneLake Files section.
|
|
1062
|
-
|
|
1063
|
-
|
|
1064
|
-
|
|
1065
|
-
# 2. Upload raw data files**Parameters:**
|
|
1066
|
-
|
|
1067
|
-
con.copy("./raw_data", "staging", ['.csv', '.json'])- `local_folder` (str): Path to local folder containing files to upload
|
|
1068
|
-
|
|
1069
|
-
- `remote_folder` (str): **Required** target folder path in OneLake Files
|
|
1070
|
-
|
|
1071
|
-
# 3. Run data pipeline- `file_extensions` (list, optional): Filter by file extensions (e.g., `['.csv', '.parquet']`)
|
|
1072
|
-
|
|
1073
|
-
pipeline = [- `overwrite` (bool, optional): Whether to overwrite existing files (default: False)
|
|
1074
|
-
|
|
1075
|
-
# Python: Download from API
|
|
1076
|
-
|
|
1077
|
-
('fetch_api_data', ('https://api.example.com/sales', 'raw')),**Returns:** `True` if all files uploaded successfully, `False` otherwise
|
|
1078
|
-
|
|
1079
|
-
|
|
1080
|
-
|
|
1081
|
-
# SQL: Clean and transform**Examples:**
|
|
1082
|
-
|
|
1083
|
-
('clean_sales', 'overwrite'),```python
|
|
1084
|
-
|
|
1085
|
-
# Upload all files to a target folder
|
|
1086
|
-
|
|
1087
|
-
# SQL: Aggregate with parameterscon.copy("./data", "processed_data")
|
|
1088
|
-
|
|
1089
|
-
('regional_summary', 'overwrite', {'min_amount': 1000}),
|
|
1090
|
-
|
|
1091
|
-
# Upload only CSV and Parquet files
|
|
1092
|
-
|
|
1093
|
-
# SQL: Append to history with schema evolution and partitioningcon.copy("./reports", "monthly_reports", ['.csv', '.parquet'])
|
|
1094
|
-
|
|
1095
|
-
('sales_history', 'append', {}, {
|
|
1096
|
-
|
|
1097
|
-
'mergeSchema': 'true', # Upload with overwrite enabled
|
|
1098
|
-
|
|
1099
|
-
'partitionBy': ['year', 'region']con.copy("./backup", "daily_backup", overwrite=True)
|
|
1100
|
-
|
|
1101
|
-
})```
|
|
1102
|
-
|
|
1103
|
-
]
|
|
1104
|
-
|
|
1105
|
-
### `download(remote_folder="", local_folder="./downloaded_files", file_extensions=None, overwrite=False)`
|
|
1106
|
-
|
|
1107
|
-
success = con.run(pipeline)
|
|
1108
|
-
|
|
1109
|
-
Download files from OneLake Files section to a local folder.
|
|
1110
|
-
|
|
1111
|
-
# 4. Query and explore results
|
|
1112
|
-
|
|
1113
|
-
con.sql("""**Parameters:**
|
|
1114
|
-
|
|
1115
|
-
SELECT region, SUM(total) as grand_total- `remote_folder` (str, optional): Source folder path in OneLake Files (default: root)
|
|
1116
|
-
|
|
1117
|
-
FROM regional_summary- `local_folder` (str, optional): Local destination folder (default: "./downloaded_files")
|
|
1118
|
-
|
|
1119
|
-
GROUP BY region- `file_extensions` (list, optional): Filter by file extensions (e.g., `['.csv', '.json']`)
|
|
1120
|
-
|
|
1121
|
-
""").show()- `overwrite` (bool, optional): Whether to overwrite existing local files (default: False)
|
|
1122
|
-
|
|
1123
|
-
|
|
1124
|
-
|
|
1125
|
-
# 5. Create derived table**Returns:** `True` if all files downloaded successfully, `False` otherwise
|
|
1126
|
-
|
|
1127
|
-
con.sql("SELECT * FROM sales WHERE year = 2024").write \
|
|
1128
|
-
|
|
1129
|
-
.mode("overwrite") \**Examples:**
|
|
1130
|
-
|
|
1131
|
-
.partitionBy("month") \```python
|
|
1132
|
-
|
|
1133
|
-
.saveAsTable("sales_2024")# Download all files from OneLake Files root
|
|
1134
|
-
|
|
1135
|
-
con.download()
|
|
1136
|
-
|
|
1137
|
-
# 6. Download processed reports
|
|
1138
|
-
|
|
1139
|
-
con.download("processed_reports", "./exports", ['.csv'])# Download from specific folder
|
|
1140
|
-
|
|
1141
|
-
con.download("processed_data", "./local_data")
|
|
1142
|
-
|
|
1143
|
-
# 7. Deploy semantic model
|
|
1144
|
-
|
|
1145
|
-
con.deploy(# Download only JSON files
|
|
1146
|
-
|
|
1147
|
-
"https://raw.githubusercontent.com/user/repo/main/sales_model.bim",con.download("config", "./configs", ['.json'])
|
|
1148
|
-
|
|
1149
|
-
dataset_name="Sales Analytics"```
|
|
1150
|
-
|
|
1151
|
-
)
|
|
1152
|
-
|
|
1153
|
-
**Important Notes:**
|
|
1154
|
-
|
|
1155
|
-
# 8. Enrich logs with lookup functions- Files are uploaded/downloaded to/from the **OneLake Files section**, not Delta Tables
|
|
1156
|
-
|
|
1157
|
-
logs = con.sql("""- The `remote_folder` parameter is **required** for uploads to prevent accidental uploads
|
|
1158
|
-
|
|
1159
|
-
SELECT - Both methods default to `overwrite=False` for safety
|
|
1160
|
-
|
|
1161
|
-
workspace_id,- Folder structure is preserved during upload/download operations
|
|
1162
|
-
|
|
1163
|
-
get_workspace_name(workspace_id) as workspace,- Progress is reported with file names, sizes, and upload/download status
|
|
1164
|
-
|
|
1165
|
-
lakehouse_id,
|
|
1166
|
-
|
|
1167
|
-
get_lakehouse_name(workspace_id, lakehouse_id) as lakehouse,## Complete Example
|
|
1168
|
-
|
|
1169
|
-
COUNT(*) as operations
|
|
1170
|
-
|
|
1171
|
-
FROM audit_logs```python
|
|
1172
|
-
|
|
1173
|
-
GROUP BY ALLimport duckrun
|
|
1174
|
-
|
|
1175
|
-
""").df()
|
|
1176
|
-
|
|
1177
|
-
# Connect (specify schema for best performance)
|
|
1178
|
-
|
|
1179
|
-
print(logs)con = duckrun.connect("Analytics/Sales.lakehouse/dbo", sql_folder="./sql")
|
|
1180
|
-
|
|
1181
|
-
```
|
|
1182
|
-
|
|
1183
|
-
# 1. Upload raw data files to OneLake Files
|
|
1184
|
-
|
|
1185
|
-
---con.copy("./raw_data", "raw_uploads", ['.csv', '.json'])
|
|
1186
|
-
|
|
1187
|
-
|
|
1188
|
-
|
|
1189
|
-
## Python Task Reference# 2. Pipeline with mixed tasks
|
|
1190
|
-
|
|
1191
|
-
pipeline = [
|
|
1192
|
-
|
|
1193
|
-
Create Python tasks in your `sql_folder`: # Download raw data (Python)
|
|
1194
|
-
|
|
1195
|
-
('fetch_api_data', ('https://api.example.com/sales', 'raw')),
|
|
1196
|
-
|
|
1197
|
-
```python
|
|
1198
|
-
|
|
1199
|
-
# sql_folder/fetch_api_data.py # Clean and transform (SQL)
|
|
1200
|
-
|
|
1201
|
-
def fetch_api_data(url, output_path): ('clean_sales', 'overwrite'),
|
|
1202
|
-
|
|
1203
|
-
"""
|
|
1204
|
-
|
|
1205
|
-
Download data from API. # Aggregate by region (SQL with params)
|
|
1206
|
-
|
|
1207
|
-
('regional_summary', 'overwrite', {'min_amount': 1000}),
|
|
1208
|
-
|
|
1209
|
-
Returns:
|
|
1210
|
-
|
|
1211
|
-
1 for success (pipeline continues) # Append to history with schema evolution (SQL with Delta options)
|
|
1212
|
-
|
|
1213
|
-
0 for failure (pipeline stops) ('sales_history', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['year', 'region']})
|
|
1214
|
-
|
|
1215
|
-
"""]
|
|
1216
|
-
|
|
1217
|
-
try:
|
|
1218
|
-
|
|
1219
|
-
import requests# Run pipeline
|
|
1220
|
-
|
|
1221
|
-
response = requests.get(url)success = con.run(pipeline)
|
|
1222
|
-
|
|
1223
|
-
response.raise_for_status()
|
|
1224
|
-
|
|
1225
|
-
# 3. Explore results using DuckDB
|
|
1226
|
-
|
|
1227
|
-
# Save datacon.sql("SELECT * FROM regional_summary").show()
|
|
1228
|
-
|
|
1229
|
-
with open(output_path, 'w') as f:
|
|
1230
|
-
|
|
1231
|
-
f.write(response.text)# 4. Export to new Delta table
|
|
1232
|
-
|
|
1233
|
-
con.sql("""
|
|
1234
|
-
|
|
1235
|
-
return 1 # Success SELECT region, SUM(total) as grand_total
|
|
1236
|
-
|
|
1237
|
-
except Exception as e: FROM regional_summary
|
|
1238
|
-
|
|
1239
|
-
print(f"Error: {e}") GROUP BY region
|
|
1240
|
-
|
|
1241
|
-
return 0 # Failure - pipeline will stop""").write.mode("overwrite").saveAsTable("region_totals")
|
|
1242
|
-
|
|
1243
|
-
```
|
|
1244
|
-
|
|
1245
|
-
# 5. Download processed files for external systems
|
|
1246
|
-
|
|
1247
|
-
**Important:**con.download("processed_reports", "./exports", ['.csv'])
|
|
1248
|
-
|
|
1249
|
-
- Function name must match filename
|
|
1250
|
-
|
|
1251
|
-
- Return `1` for success, `0` for failure# 6. Deploy semantic model for Power BI
|
|
1252
|
-
|
|
1253
|
-
- Python tasks can use workspace/lakehouse IDs as parameterscon.deploy(
|
|
1254
|
-
|
|
1255
|
-
"https://raw.githubusercontent.com/user/repo/main/sales_model.bim",
|
|
1256
|
-
|
|
1257
|
-
--- dataset_name="Sales Analytics"
|
|
1258
|
-
|
|
1259
|
-
)
|
|
1260
|
-
|
|
1261
|
-
## Requirements & Notes```
|
|
1262
|
-
|
|
1263
|
-
|
|
1264
|
-
|
|
1265
|
-
**Requirements:****This example demonstrates:**
|
|
1266
|
-
|
|
1267
|
-
- Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)- 📁 **File uploads** to OneLake Files section
|
|
1268
|
-
|
|
1269
|
-
- Azure authentication (Azure CLI, browser, or Fabric notebook environment)- 🔄 **Pipeline orchestration** with SQL and Python tasks
|
|
1270
|
-
|
|
1271
|
-
- ⚡ **Fast data exploration** with DuckDB
|
|
1272
|
-
|
|
1273
|
-
**Important Notes:**- 💾 **Delta table creation** with Spark-style API
|
|
1274
|
-
|
|
1275
|
-
- ✅ Workspace names with spaces are fully supported- 🔀 **Schema evolution** and partitioning
|
|
1276
|
-
|
|
1277
|
-
- ✅ Files uploaded/downloaded to OneLake **Files** section (not Delta Tables)- 📤 **File downloads** from OneLake Files
|
|
1278
|
-
|
|
1279
|
-
- ✅ Pipeline stops on first failure (SQL errors or Python returning 0)- 📊 **Semantic model deployment** with DirectLake
|
|
1280
|
-
|
|
1281
|
-
- ⚠️ Uses older deltalake version for row size control (Power BI optimization)
|
|
1282
|
-
|
|
1283
|
-
- ⚠️ Scanning all schemas can be slow for large lakehouses## Schema Evolution & Partitioning Guide
|
|
1284
|
-
|
|
1285
|
-
|
|
1286
|
-
|
|
1287
|
-
**Authentication:**### When to Use Schema Evolution
|
|
1288
|
-
|
|
1289
|
-
- Fabric notebooks: Automatic using notebook credentials
|
|
1290
|
-
|
|
1291
|
-
- Local/VS Code: Azure CLI or interactive browser authenticationUse `mergeSchema: 'true'` when:
|
|
1292
|
-
|
|
1293
|
-
- Custom: Use Azure Identity credential chain- Adding new columns to existing tables
|
|
1294
|
-
|
|
1295
|
-
- Source data schema changes over time
|
|
1296
|
-
|
|
1297
|
-
---- Working with evolving data pipelines
|
|
1298
|
-
|
|
1299
|
-
- Need backward compatibility
|
|
1300
|
-
|
|
1301
|
-
## Real-World Example
|
|
1302
|
-
|
|
1303
|
-
### When to Use Partitioning
|
|
1304
|
-
|
|
1305
|
-
For a complete production example, see [fabric_demo](https://github.com/djouallah/fabric_demo).
|
|
1306
|
-
|
|
1307
|
-
Use `partitionBy` when:
|
|
1308
|
-
|
|
1309
|
-
## License- Queries frequently filter by specific columns (dates, regions, categories)
|
|
1310
|
-
|
|
1311
|
-
- Tables are large and need performance optimization
|
|
1312
|
-
|
|
1313
|
-
MIT- Want to organize data logically for maintenance
|
|
1314
|
-
|
|
1315
|
-
|
|
1316
|
-
### Best Practices
|
|
1317
|
-
|
|
1318
|
-
```python
|
|
1319
|
-
# ✅ Good: Partition by commonly filtered columns
|
|
1320
|
-
.partitionBy("year", "region") # Often filtered: WHERE year = 2024 AND region = 'US'
|
|
1321
|
-
|
|
1322
|
-
# ❌ Avoid: High cardinality partitions
|
|
1323
|
-
.partitionBy("customer_id") # Creates too many small partitions
|
|
1324
|
-
|
|
1325
|
-
# ✅ Good: Schema evolution for append operations
|
|
1326
|
-
.mode("append").option("mergeSchema", "true")
|
|
1327
|
-
|
|
1328
|
-
# ✅ Good: Combined approach for data lakes
|
|
1329
|
-
pipeline = [
|
|
1330
|
-
('daily_sales', 'append',
|
|
1331
|
-
{'batch_date': '2024-10-07'},
|
|
1332
|
-
{'mergeSchema': 'true', 'partitionBy': ['year', 'month', 'region']})
|
|
1333
|
-
]
|
|
1334
|
-
```
|
|
1335
|
-
|
|
1336
|
-
### Task Format Reference
|
|
1337
|
-
|
|
1338
|
-
```python
|
|
1339
|
-
# 2-tuple: Simple SQL/Python
|
|
1340
|
-
('task_name', 'mode') # SQL: no params, no Delta options
|
|
1341
|
-
('function_name', (args)) # Python: function with arguments
|
|
1342
|
-
|
|
1343
|
-
# 3-tuple: SQL with parameters
|
|
1344
|
-
('task_name', 'mode', {'param': 'value'})
|
|
1345
|
-
|
|
1346
|
-
# 4-tuple: SQL with parameters AND Delta options
|
|
1347
|
-
('task_name', 'mode', {'param': 'value'}, {'mergeSchema': 'true', 'partitionBy': ['col']})
|
|
1348
|
-
|
|
1349
|
-
# 4-tuple: Empty parameters but Delta options
|
|
1350
|
-
('task_name', 'mode', {}, {'mergeSchema': 'true'})
|
|
1351
|
-
```
|
|
1352
|
-
|
|
1353
|
-
## How It Works
|
|
1354
|
-
|
|
1355
|
-
1. **Connection**: Duckrun connects to your Fabric lakehouse using OneLake and Azure authentication
|
|
1356
|
-
2. **Table Discovery**: Automatically scans for Delta tables in your schema (or all schemas) and creates DuckDB views
|
|
1357
|
-
3. **Query Execution**: Run SQL queries directly against Delta tables using DuckDB's speed
|
|
1358
|
-
4. **Write Operations**: Results are written back as Delta tables with automatic optimization
|
|
1359
|
-
5. **Pipelines**: Orchestrate complex workflows with reusable SQL and Python tasks
|
|
1360
|
-
|
|
1361
|
-
## Real-World Example
|
|
1362
|
-
|
|
1363
|
-
For a complete production example, see [fabric_demo](https://github.com/djouallah/fabric_demo).
|
|
1364
|
-
|
|
1365
|
-
## License
|
|
1366
|
-
|
|
1367
|
-
MIT
|