PyPI - duckrun - Versions diffs - 0.1.9__tar.gz → 0.2.0__tar.gz - Mend

duckrun 0.1.9tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

{duckrun-0.1.9 → duckrun-0.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: duckrun
-Version: 0.1.9
+Version: 0.2.0
 Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
 Author: mim
 License: MIT
@@ -28,6 +28,8 @@ A helper package for stuff that made my life easier when working with Fabric Pyt
 - Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)
 - Workspace and lakehouse names cannot contain spaces
+**Delta Lake Version:** This package uses an older version of deltalake to maintain row size control capabilities, which is crucial for Power BI performance optimization. The newer Rust-based deltalake versions don't yet support the row group size parameters that are essential for optimal DirectLake performance.
 **Why no spaces?** Duckrun uses simple name-based paths instead of GUIDs. This keeps the code clean and readable, which is perfect for data engineering workspaces where naming conventions are already well-established. Just use underscores or hyphens instead: `my_workspace` or `my-lakehouse`.
 ## What It Does
@@ -131,6 +133,22 @@ con.sql("""
 # Append mode
 con.sql("SELECT * FROM new_orders").write.mode("append").saveAsTable("orders")
+# Schema evolution and partitioning (exact Spark API compatibility)
+con.sql("""
+    SELECT
+        customer_id,
+        order_date,
+        region,
+        product_category,
+        sales_amount,
+        new_column_added_later  -- This column might not exist in target table
+    FROM source_table
+""").write \
+    .mode("append") \
+    .option("mergeSchema", "true") \
+    .partitionBy("region", "product_category") \
+    .saveAsTable("sales_partitioned")
 ```
 **Note:** `.format("delta")` is optional - Delta is the default format!
@@ -204,7 +222,10 @@ def download_data(url, path):
 ### SQL Tasks
-**Format:** `('table_name', 'mode')` or `('table_name', 'mode', {params})`
+**Formats:**
+- `('table_name', 'mode')` - Simple SQL with no parameters
+- `('table_name', 'mode', {params})` - SQL with template parameters
+- `('table_name', 'mode', {params}, {delta_options})` - SQL with Delta Lake options
 Create `sql_folder/table_name.sql`:
@@ -244,8 +265,66 @@ SELECT * FROM transactions
 WHERE date BETWEEN '$start_date' AND '$end_date'
 ```
+### Delta Lake Options (Schema Evolution & Partitioning)
+Use the 4-tuple format for advanced Delta Lake features:
+```python
+pipeline = [
+    # SQL with empty params but Delta options
+    ('evolving_table', 'append', {}, {'mergeSchema': 'true'}),
+    # SQL with both params AND Delta options
+    ('sales_data', 'append',
+     {'region': 'North America'},
+     {'mergeSchema': 'true', 'partitionBy': ['region', 'year']}),
+    # Partitioning without schema merging
+    ('time_series', 'overwrite',
+     {'start_date': '2024-01-01'},
+     {'partitionBy': ['year', 'month']})
+]
+```
+**Available Delta Options:**
+- `mergeSchema: 'true'` - Automatically handle schema evolution (new columns)
+- `partitionBy: ['col1', 'col2']` - Partition data by specified columns
 ## Advanced Features
+### Schema Evolution & Partitioning
+Handle evolving schemas and optimize query performance with partitioning:
+```python
+# Using Spark-style API
+con.sql("""
+    SELECT
+        customer_id,
+        region,
+        product_category,
+        sales_amount,
+        -- New column that might not exist in target table
+        discount_percentage
+    FROM raw_sales
+""").write \
+    .mode("append") \
+    .option("mergeSchema", "true") \
+    .partitionBy("region", "product_category") \
+    .saveAsTable("sales_partitioned")
+# Using pipeline format
+pipeline = [
+    ('sales_summary', 'append',
+     {'batch_date': '2024-10-07'},
+     {'mergeSchema': 'true', 'partitionBy': ['region', 'year']})
+]
+```
+**Benefits:**
+- 🔄 **Schema Evolution**: Automatically handles new columns without breaking existing queries
+- ⚡ **Query Performance**: Partitioning improves performance for filtered queries
 ### Table Name Variants
 Use `__` to create multiple versions of the same table:
@@ -404,8 +483,8 @@ pipeline = [
     # Aggregate by region (SQL with params)
     ('regional_summary', 'overwrite', {'min_amount': 1000}),
-    # Append to history (SQL)
-    ('sales_history', 'append')
+    # Append to history with schema evolution (SQL with Delta options)
+    ('sales_history', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['year', 'region']})
 ]
 # Run pipeline
@@ -430,7 +509,62 @@ con.download("processed_reports", "./exports", ['.csv'])
 - 🔄 **Pipeline orchestration** with SQL and Python tasks
 - ⚡ **Fast data exploration** with DuckDB
 - 💾 **Delta table creation** with Spark-style API
-- 📤 **File downloads** from OneLake Files
+- � **Schema evolution** and partitioning
+- �📤 **File downloads** from OneLake Files
+## Schema Evolution & Partitioning Guide
+### When to Use Schema Evolution
+Use `mergeSchema: 'true'` when:
+- Adding new columns to existing tables
+- Source data schema changes over time
+- Working with evolving data pipelines
+- Need backward compatibility
+### When to Use Partitioning
+Use `partitionBy` when:
+- Queries frequently filter by specific columns (dates, regions, categories)
+- Tables are large and need performance optimization
+- Want to organize data logically for maintenance
+### Best Practices
+```python
+# ✅ Good: Partition by commonly filtered columns
+.partitionBy("year", "region")  # Often filtered: WHERE year = 2024 AND region = 'US'
+# ❌ Avoid: High cardinality partitions
+.partitionBy("customer_id")  # Creates too many small partitions
+# ✅ Good: Schema evolution for append operations
+.mode("append").option("mergeSchema", "true")
+# ✅ Good: Combined approach for data lakes
+pipeline = [
+    ('daily_sales', 'append',
+     {'batch_date': '2024-10-07'},
+     {'mergeSchema': 'true', 'partitionBy': ['year', 'month', 'region']})
+]
+```
+### Task Format Reference
+```python
+# 2-tuple: Simple SQL/Python
+('task_name', 'mode')                    # SQL: no params, no Delta options
+('function_name', (args))                # Python: function with arguments
+# 3-tuple: SQL with parameters
+('task_name', 'mode', {'param': 'value'})
+# 4-tuple: SQL with parameters AND Delta options
+('task_name', 'mode', {'param': 'value'}, {'mergeSchema': 'true', 'partitionBy': ['col']})
+# 4-tuple: Empty parameters but Delta options
+('task_name', 'mode', {}, {'mergeSchema': 'true'})
+```
 ## How It Works

{duckrun-0.1.9 → duckrun-0.2.0}/README.md RENAMED Viewed

@@ -8,6 +8,8 @@ A helper package for stuff that made my life easier when working with Fabric Pyt
 - Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)
 - Workspace and lakehouse names cannot contain spaces
+**Delta Lake Version:** This package uses an older version of deltalake to maintain row size control capabilities, which is crucial for Power BI performance optimization. The newer Rust-based deltalake versions don't yet support the row group size parameters that are essential for optimal DirectLake performance.
 **Why no spaces?** Duckrun uses simple name-based paths instead of GUIDs. This keeps the code clean and readable, which is perfect for data engineering workspaces where naming conventions are already well-established. Just use underscores or hyphens instead: `my_workspace` or `my-lakehouse`.
 ## What It Does
@@ -111,6 +113,22 @@ con.sql("""
 # Append mode
 con.sql("SELECT * FROM new_orders").write.mode("append").saveAsTable("orders")
+# Schema evolution and partitioning (exact Spark API compatibility)
+con.sql("""
+    SELECT
+        customer_id,
+        order_date,
+        region,
+        product_category,
+        sales_amount,
+        new_column_added_later  -- This column might not exist in target table
+    FROM source_table
+""").write \
+    .mode("append") \
+    .option("mergeSchema", "true") \
+    .partitionBy("region", "product_category") \
+    .saveAsTable("sales_partitioned")
 ```
 **Note:** `.format("delta")` is optional - Delta is the default format!
@@ -184,7 +202,10 @@ def download_data(url, path):
 ### SQL Tasks
-**Format:** `('table_name', 'mode')` or `('table_name', 'mode', {params})`
+**Formats:**
+- `('table_name', 'mode')` - Simple SQL with no parameters
+- `('table_name', 'mode', {params})` - SQL with template parameters
+- `('table_name', 'mode', {params}, {delta_options})` - SQL with Delta Lake options
 Create `sql_folder/table_name.sql`:
@@ -224,8 +245,66 @@ SELECT * FROM transactions
 WHERE date BETWEEN '$start_date' AND '$end_date'
 ```
+### Delta Lake Options (Schema Evolution & Partitioning)
+Use the 4-tuple format for advanced Delta Lake features:
+```python
+pipeline = [
+    # SQL with empty params but Delta options
+    ('evolving_table', 'append', {}, {'mergeSchema': 'true'}),
+    # SQL with both params AND Delta options
+    ('sales_data', 'append',
+     {'region': 'North America'},
+     {'mergeSchema': 'true', 'partitionBy': ['region', 'year']}),
+    # Partitioning without schema merging
+    ('time_series', 'overwrite',
+     {'start_date': '2024-01-01'},
+     {'partitionBy': ['year', 'month']})
+]
+```
+**Available Delta Options:**
+- `mergeSchema: 'true'` - Automatically handle schema evolution (new columns)
+- `partitionBy: ['col1', 'col2']` - Partition data by specified columns
 ## Advanced Features
+### Schema Evolution & Partitioning
+Handle evolving schemas and optimize query performance with partitioning:
+```python
+# Using Spark-style API
+con.sql("""
+    SELECT
+        customer_id,
+        region,
+        product_category,
+        sales_amount,
+        -- New column that might not exist in target table
+        discount_percentage
+    FROM raw_sales
+""").write \
+    .mode("append") \
+    .option("mergeSchema", "true") \
+    .partitionBy("region", "product_category") \
+    .saveAsTable("sales_partitioned")
+# Using pipeline format
+pipeline = [
+    ('sales_summary', 'append',
+     {'batch_date': '2024-10-07'},
+     {'mergeSchema': 'true', 'partitionBy': ['region', 'year']})
+]
+```
+**Benefits:**
+- 🔄 **Schema Evolution**: Automatically handles new columns without breaking existing queries
+- ⚡ **Query Performance**: Partitioning improves performance for filtered queries
 ### Table Name Variants
 Use `__` to create multiple versions of the same table:
@@ -384,8 +463,8 @@ pipeline = [
     # Aggregate by region (SQL with params)
     ('regional_summary', 'overwrite', {'min_amount': 1000}),
-    # Append to history (SQL)
-    ('sales_history', 'append')
+    # Append to history with schema evolution (SQL with Delta options)
+    ('sales_history', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['year', 'region']})
 ]
 # Run pipeline
@@ -410,7 +489,62 @@ con.download("processed_reports", "./exports", ['.csv'])
 - 🔄 **Pipeline orchestration** with SQL and Python tasks
 - ⚡ **Fast data exploration** with DuckDB
 - 💾 **Delta table creation** with Spark-style API
-- 📤 **File downloads** from OneLake Files
+- � **Schema evolution** and partitioning
+- �📤 **File downloads** from OneLake Files
+## Schema Evolution & Partitioning Guide
+### When to Use Schema Evolution
+Use `mergeSchema: 'true'` when:
+- Adding new columns to existing tables
+- Source data schema changes over time
+- Working with evolving data pipelines
+- Need backward compatibility
+### When to Use Partitioning
+Use `partitionBy` when:
+- Queries frequently filter by specific columns (dates, regions, categories)
+- Tables are large and need performance optimization
+- Want to organize data logically for maintenance
+### Best Practices
+```python
+# ✅ Good: Partition by commonly filtered columns
+.partitionBy("year", "region")  # Often filtered: WHERE year = 2024 AND region = 'US'
+# ❌ Avoid: High cardinality partitions
+.partitionBy("customer_id")  # Creates too many small partitions
+# ✅ Good: Schema evolution for append operations
+.mode("append").option("mergeSchema", "true")
+# ✅ Good: Combined approach for data lakes
+pipeline = [
+    ('daily_sales', 'append',
+     {'batch_date': '2024-10-07'},
+     {'mergeSchema': 'true', 'partitionBy': ['year', 'month', 'region']})
+]
+```
+### Task Format Reference
+```python
+# 2-tuple: Simple SQL/Python
+('task_name', 'mode')                    # SQL: no params, no Delta options
+('function_name', (args))                # Python: function with arguments
+# 3-tuple: SQL with parameters
+('task_name', 'mode', {'param': 'value'})
+# 4-tuple: SQL with parameters AND Delta options
+('task_name', 'mode', {'param': 'value'}, {'mergeSchema': 'true', 'partitionBy': ['col']})
+# 4-tuple: Empty parameters but Delta options
+('task_name', 'mode', {}, {'mergeSchema': 'true'})
+```
 ## How It Works

{duckrun-0.1.9 → duckrun-0.2.0}/duckrun/core.py RENAMED Viewed

@@ -12,6 +12,36 @@ from obstore.store import AzureStore
 RG = 8_000_000
+def _build_write_deltalake_args(path, df, mode, schema_mode=None, partition_by=None):
+    """
+    Build arguments for write_deltalake based on requirements:
+    - If schema_mode='merge': use rust engine (no row group params)
+    - Otherwise: use pyarrow engine with row group optimization
+    """
+    args = {
+        'table_or_uri': path,
+        'data': df,
+        'mode': mode
+    }
+    # Add partition_by if specified
+    if partition_by:
+        args['partition_by'] = partition_by
+    # Engine selection based on schema_mode
+    if schema_mode == 'merge':
+        # Use rust engine for schema merging (no row group params supported)
+        args['schema_mode'] = 'merge'
+        args['engine'] = 'rust'
+    else:
+        # Use pyarrow engine with row group optimization (default)
+        args['max_rows_per_file'] = RG
+        args['max_rows_per_group'] = RG
+        args['min_rows_per_group'] = RG
+    return args
 class DeltaWriter:
     """Spark-style write API for Delta Lake"""
@@ -20,6 +50,8 @@ class DeltaWriter:
         self.duckrun = duckrun_instance
         self._format = "delta"
         self._mode = "overwrite"
+        self._schema_mode = None
+        self._partition_by = None
     def format(self, format_type: str):
         """Set output format (only 'delta' supported)"""
@@ -35,6 +67,27 @@ class DeltaWriter:
         self._mode = write_mode
         return self
+    def option(self, key: str, value):
+        """Set write option (Spark-compatible)"""
+        if key == "mergeSchema":
+            if str(value).lower() in ("true", "1"):
+                self._schema_mode = "merge"
+            else:
+                self._schema_mode = None
+        else:
+            raise ValueError(f"Unsupported option: {key}")
+        return self
+    def partitionBy(self, *columns):
+        """Set partition columns (Spark-compatible)"""
+        if len(columns) == 1 and isinstance(columns[0], (list, tuple)):
+            # Handle partitionBy(["col1", "col2"]) case
+            self._partition_by = list(columns[0])
+        else:
+            # Handle partitionBy("col1", "col2") case
+            self._partition_by = list(columns)
+        return self
     def saveAsTable(self, table_name: str):
         """Save query result as Delta table"""
         if self._format != "delta":
@@ -50,8 +103,18 @@ class DeltaWriter:
         path = f"{self.duckrun.table_base_url}{schema}/{table}"
         df = self.relation.record_batch()
-        print(f"Writing to Delta table: {schema}.{table} (mode={self._mode})")
-        write_deltalake(path, df, mode=self._mode, max_rows_per_file=RG, max_rows_per_group=RG, min_rows_per_group=RG)
+        # Build write arguments based on schema_mode and partition_by
+        write_args = _build_write_deltalake_args(
+            path, df, self._mode,
+            schema_mode=self._schema_mode,
+            partition_by=self._partition_by
+        )
+        engine_info = f" (engine=rust, schema_mode=merge)" if self._schema_mode == 'merge' else " (engine=pyarrow)"
+        partition_info = f" partitioned by {self._partition_by}" if self._partition_by else ""
+        print(f"Writing to Delta table: {schema}.{table} (mode={self._mode}){engine_info}{partition_info}")
+        write_deltalake(**write_args)
         self.duckrun.con.sql(f"DROP VIEW IF EXISTS {table}")
         self.duckrun.con.sql(f"""
@@ -113,6 +176,21 @@ class Duckrun:
         dr = Duckrun.connect("workspace/lakehouse.lakehouse")
         dr.sql("SELECT * FROM table").show()
         dr.sql("SELECT 43").write.mode("append").saveAsTable("test")
+        # Schema evolution and partitioning (exact Spark API):
+        dr.sql("SELECT * FROM source").write.mode("append").option("mergeSchema", "true").partitionBy("region").saveAsTable("sales")
+        # Pipeline formats:
+        pipeline = [
+            # SQL with parameters only
+            ('table_name', 'mode', {'param1': 'value1'}),
+            # SQL with Delta options (4-tuple format)
+            ('table_name', 'mode', {'param1': 'value1'}, {'mergeSchema': 'true', 'partitionBy': ['region']}),
+            # Python task
+            ('process_data', ('table_name',))
+        ]
     """
     def __init__(self, workspace: str, lakehouse_name: str, schema: str = "dbo",
@@ -392,7 +470,7 @@ class Duckrun:
         print(f"✅ Python '{name}' completed")
         return result
-    def _run_sql(self, table: str, mode: str, params: Dict) -> str:
+    def _run_sql(self, table: str, mode: str, params: Dict, delta_options: Dict = None) -> str:
         """Execute SQL task, write to Delta, return normalized table name"""
         self._create_onelake_secret()
@@ -406,10 +484,23 @@ class Duckrun:
         normalized_table = self._normalize_table_name(table)
         path = f"{self.table_base_url}{self.schema}/{normalized_table}"
+        # Extract Delta Lake specific options from delta_options
+        delta_options = delta_options or {}
+        merge_schema = delta_options.get('mergeSchema')
+        schema_mode = 'merge' if str(merge_schema).lower() in ('true', '1') else None
+        partition_by = delta_options.get('partitionBy') or delta_options.get('partition_by')
         if mode == 'overwrite':
             self.con.sql(f"DROP VIEW IF EXISTS {normalized_table}")
             df = self.con.sql(sql).record_batch()
-            write_deltalake(path, df, mode='overwrite', max_rows_per_file=RG, max_rows_per_group=RG, min_rows_per_group=RG)
+            write_args = _build_write_deltalake_args(
+                path, df, 'overwrite',
+                schema_mode=schema_mode,
+                partition_by=partition_by
+            )
+            write_deltalake(**write_args)
             self.con.sql(f"CREATE OR REPLACE VIEW {normalized_table} AS SELECT * FROM delta_scan('{path}')")
             dt = DeltaTable(path)
             dt.vacuum(retention_hours=0, dry_run=False, enforce_retention_duration=False)
@@ -417,7 +508,14 @@ class Duckrun:
         elif mode == 'append':
             df = self.con.sql(sql).record_batch()
-            write_deltalake(path, df, mode='append', max_rows_per_file=RG, max_rows_per_group=RG, min_rows_per_group=RG)
+            write_args = _build_write_deltalake_args(
+                path, df, 'append',
+                schema_mode=schema_mode,
+                partition_by=partition_by
+            )
+            write_deltalake(**write_args)
             self.con.sql(f"CREATE OR REPLACE VIEW {normalized_table} AS SELECT * FROM delta_scan('{path}')")
             dt = DeltaTable(path)
             if len(dt.file_uris()) > self.compaction_threshold:
@@ -434,13 +532,22 @@ class Duckrun:
                 print(f"Table {normalized_table} doesn't exist. Creating...")
                 self.con.sql(f"DROP VIEW IF EXISTS {normalized_table}")
                 df = self.con.sql(sql).record_batch()
-                write_deltalake(path, df, mode='overwrite', max_rows_per_file=RG, max_rows_per_group=RG, min_rows_per_group=RG)
+                write_args = _build_write_deltalake_args(
+                    path, df, 'overwrite',
+                    schema_mode=schema_mode,
+                    partition_by=partition_by
+                )
+                write_deltalake(**write_args)
                 self.con.sql(f"CREATE OR REPLACE VIEW {normalized_table} AS SELECT * FROM delta_scan('{path}')")
                 dt = DeltaTable(path)
                 dt.vacuum(dry_run=False)
                 dt.cleanup_metadata()
-        print(f"✅ SQL '{table}' → '{normalized_table}' ({mode})")
+        engine_info = f" (engine=rust, schema_mode=merge)" if schema_mode == 'merge' else " (engine=pyarrow)"
+        partition_info = f" partitioned by {partition_by}" if partition_by else ""
+        print(f"✅ SQL '{table}' → '{normalized_table}' ({mode}){engine_info}{partition_info}")
         return normalized_table
     def run(self, pipeline: List[Tuple]) -> bool:
@@ -449,7 +556,8 @@ class Duckrun:
         Task formats:
             - Python: ('function_name', (arg1, arg2, ...))
-            - SQL:    ('table_name', 'mode') or ('table_name', 'mode', {params})
+            - SQL:    ('table_name', 'mode') or ('table_name', 'mode', {sql_params})
+            - SQL with Delta options: ('table_name', 'mode', {sql_params}, {delta_options})
         Returns:
             True if all tasks succeeded
@@ -469,7 +577,7 @@ class Duckrun:
                 if len(task) == 2:
                     name, second = task
                     if isinstance(second, str) and second in {'overwrite', 'append', 'ignore'}:
-                        result = self._run_sql(name, second, {})
+                        result = self._run_sql(name, second, {}, {})
                     else:
                         args = second if isinstance(second, (tuple, list)) else (second,)
                         result = self._run_python(name, tuple(args))
@@ -478,7 +586,15 @@ class Duckrun:
                     table, mode, params = task
                     if not isinstance(params, dict):
                         raise ValueError(f"Expected dict for params, got {type(params)}")
-                    result = self._run_sql(table, mode, params)
+                    result = self._run_sql(table, mode, params, {})
+                elif len(task) == 4:
+                    table, mode, params, delta_options = task
+                    if not isinstance(params, dict):
+                        raise ValueError(f"Expected dict for SQL params, got {type(params)}")
+                    if not isinstance(delta_options, dict):
+                        raise ValueError(f"Expected dict for Delta options, got {type(delta_options)}")
+                    result = self._run_sql(table, mode, params, delta_options)
                 else:
                     raise ValueError(f"Invalid task format: {task}")

{duckrun-0.1.9 → duckrun-0.2.0}/duckrun.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: duckrun
-Version: 0.1.9
+Version: 0.2.0
 Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
 Author: mim
 License: MIT
@@ -28,6 +28,8 @@ A helper package for stuff that made my life easier when working with Fabric Pyt
 - Lakehouse must have a schema (e.g., `dbo`, `sales`, `analytics`)
 - Workspace and lakehouse names cannot contain spaces
+**Delta Lake Version:** This package uses an older version of deltalake to maintain row size control capabilities, which is crucial for Power BI performance optimization. The newer Rust-based deltalake versions don't yet support the row group size parameters that are essential for optimal DirectLake performance.
 **Why no spaces?** Duckrun uses simple name-based paths instead of GUIDs. This keeps the code clean and readable, which is perfect for data engineering workspaces where naming conventions are already well-established. Just use underscores or hyphens instead: `my_workspace` or `my-lakehouse`.
 ## What It Does
@@ -131,6 +133,22 @@ con.sql("""
 # Append mode
 con.sql("SELECT * FROM new_orders").write.mode("append").saveAsTable("orders")
+# Schema evolution and partitioning (exact Spark API compatibility)
+con.sql("""
+    SELECT
+        customer_id,
+        order_date,
+        region,
+        product_category,
+        sales_amount,
+        new_column_added_later  -- This column might not exist in target table
+    FROM source_table
+""").write \
+    .mode("append") \
+    .option("mergeSchema", "true") \
+    .partitionBy("region", "product_category") \
+    .saveAsTable("sales_partitioned")
 ```
 **Note:** `.format("delta")` is optional - Delta is the default format!
@@ -204,7 +222,10 @@ def download_data(url, path):
 ### SQL Tasks
-**Format:** `('table_name', 'mode')` or `('table_name', 'mode', {params})`
+**Formats:**
+- `('table_name', 'mode')` - Simple SQL with no parameters
+- `('table_name', 'mode', {params})` - SQL with template parameters
+- `('table_name', 'mode', {params}, {delta_options})` - SQL with Delta Lake options
 Create `sql_folder/table_name.sql`:
@@ -244,8 +265,66 @@ SELECT * FROM transactions
 WHERE date BETWEEN '$start_date' AND '$end_date'
 ```
+### Delta Lake Options (Schema Evolution & Partitioning)
+Use the 4-tuple format for advanced Delta Lake features:
+```python
+pipeline = [
+    # SQL with empty params but Delta options
+    ('evolving_table', 'append', {}, {'mergeSchema': 'true'}),
+    # SQL with both params AND Delta options
+    ('sales_data', 'append',
+     {'region': 'North America'},
+     {'mergeSchema': 'true', 'partitionBy': ['region', 'year']}),
+    # Partitioning without schema merging
+    ('time_series', 'overwrite',
+     {'start_date': '2024-01-01'},
+     {'partitionBy': ['year', 'month']})
+]
+```
+**Available Delta Options:**
+- `mergeSchema: 'true'` - Automatically handle schema evolution (new columns)
+- `partitionBy: ['col1', 'col2']` - Partition data by specified columns
 ## Advanced Features
+### Schema Evolution & Partitioning
+Handle evolving schemas and optimize query performance with partitioning:
+```python
+# Using Spark-style API
+con.sql("""
+    SELECT
+        customer_id,
+        region,
+        product_category,
+        sales_amount,
+        -- New column that might not exist in target table
+        discount_percentage
+    FROM raw_sales
+""").write \
+    .mode("append") \
+    .option("mergeSchema", "true") \
+    .partitionBy("region", "product_category") \
+    .saveAsTable("sales_partitioned")
+# Using pipeline format
+pipeline = [
+    ('sales_summary', 'append',
+     {'batch_date': '2024-10-07'},
+     {'mergeSchema': 'true', 'partitionBy': ['region', 'year']})
+]
+```
+**Benefits:**
+- 🔄 **Schema Evolution**: Automatically handles new columns without breaking existing queries
+- ⚡ **Query Performance**: Partitioning improves performance for filtered queries
 ### Table Name Variants
 Use `__` to create multiple versions of the same table:
@@ -404,8 +483,8 @@ pipeline = [
     # Aggregate by region (SQL with params)
     ('regional_summary', 'overwrite', {'min_amount': 1000}),
-    # Append to history (SQL)
-    ('sales_history', 'append')
+    # Append to history with schema evolution (SQL with Delta options)
+    ('sales_history', 'append', {}, {'mergeSchema': 'true', 'partitionBy': ['year', 'region']})
 ]
 # Run pipeline
@@ -430,7 +509,62 @@ con.download("processed_reports", "./exports", ['.csv'])
 - 🔄 **Pipeline orchestration** with SQL and Python tasks
 - ⚡ **Fast data exploration** with DuckDB
 - 💾 **Delta table creation** with Spark-style API
-- 📤 **File downloads** from OneLake Files
+- � **Schema evolution** and partitioning
+- �📤 **File downloads** from OneLake Files
+## Schema Evolution & Partitioning Guide
+### When to Use Schema Evolution
+Use `mergeSchema: 'true'` when:
+- Adding new columns to existing tables
+- Source data schema changes over time
+- Working with evolving data pipelines
+- Need backward compatibility
+### When to Use Partitioning
+Use `partitionBy` when:
+- Queries frequently filter by specific columns (dates, regions, categories)
+- Tables are large and need performance optimization
+- Want to organize data logically for maintenance
+### Best Practices
+```python
+# ✅ Good: Partition by commonly filtered columns
+.partitionBy("year", "region")  # Often filtered: WHERE year = 2024 AND region = 'US'
+# ❌ Avoid: High cardinality partitions
+.partitionBy("customer_id")  # Creates too many small partitions
+# ✅ Good: Schema evolution for append operations
+.mode("append").option("mergeSchema", "true")
+# ✅ Good: Combined approach for data lakes
+pipeline = [
+    ('daily_sales', 'append',
+     {'batch_date': '2024-10-07'},
+     {'mergeSchema': 'true', 'partitionBy': ['year', 'month', 'region']})
+]
+```
+### Task Format Reference
+```python
+# 2-tuple: Simple SQL/Python
+('task_name', 'mode')                    # SQL: no params, no Delta options
+('function_name', (args))                # Python: function with arguments
+# 3-tuple: SQL with parameters
+('task_name', 'mode', {'param': 'value'})
+# 4-tuple: SQL with parameters AND Delta options
+('task_name', 'mode', {'param': 'value'}, {'mergeSchema': 'true', 'partitionBy': ['col']})
+# 4-tuple: Empty parameters but Delta options
+('task_name', 'mode', {}, {'mergeSchema': 'true'})
+```
 ## How It Works

{duckrun-0.1.9 → duckrun-0.2.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "duckrun"
-version = "0.1.9"
+version = "0.2.0"
 description = "Lakehouse task runner powered by DuckDB for Microsoft Fabric"
 readme = "README.md"
 license = {text = "MIT"}

{duckrun-0.1.9 → duckrun-0.2.0}/LICENSE RENAMED Viewed

File without changes

{duckrun-0.1.9 → duckrun-0.2.0}/duckrun/__init__.py RENAMED Viewed

File without changes

{duckrun-0.1.9 → duckrun-0.2.0}/duckrun.egg-info/SOURCES.txt RENAMED Viewed

File without changes

{duckrun-0.1.9 → duckrun-0.2.0}/duckrun.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{duckrun-0.1.9 → duckrun-0.2.0}/duckrun.egg-info/requires.txt RENAMED Viewed

File without changes

{duckrun-0.1.9 → duckrun-0.2.0}/duckrun.egg-info/top_level.txt RENAMED Viewed

File without changes

{duckrun-0.1.9 → duckrun-0.2.0}/setup.cfg RENAMED Viewed

File without changes

duckrun 0.1.9__tar.gz → 0.2.0__tar.gz

duckrun 0.1.9tar.gz → 0.2.0tar.gz