PyPI - duckrun - Versions diffs - 0.1.3__tar.gz → 0.1.5__tar.gz - Mend

duckrun 0.1.3tar.gz → 0.1.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

{duckrun-0.1.3 → duckrun-0.1.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: duckrun
-Version: 0.1.3
+Version: 0.1.5
 Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
 License-Expression: MIT
 Project-URL: Homepage, https://github.com/djouallah/duckrun
@@ -35,14 +35,14 @@ pip install duckrun
 ## Quick Start
 ```python
-import duckrun as dr
+import duckrun
-# Connect to your Fabric lakehouse
-lakehouse = dr.connect(
+# Connect to your Fabric lakehouse (using `con` pattern)
+con = duckrun.connect(
     workspace="my_workspace",
     lakehouse_name="my_lakehouse",
     schema="dbo",
-    sql_folder="./sql"  # folder containing your .sql and .py files
+    sql_folder="./sql"  # optional: folder containing your .sql and .py files (only needed for pipeline tasks)
 )
 # Define your pipeline
@@ -53,9 +53,11 @@ pipeline = [
 ]
 # Run it
-lakehouse.run(pipeline)
+con.run(pipeline)
 ```
+Note: the `sql/` folder is optional — if all you want to do is explore data with SQL (for example by calling `con.sql(...)`), you don't need to provide a `sql_folder`.
 ## Early Exit
 In a pipeline run, if a task fails, the pipeline will stop without running the subsequent tasks.
@@ -138,12 +140,14 @@ Both write to the same `sales` table, but use different SQL files.
 ```python
 # Run queries
-lakehouse.sql("SELECT * FROM my_table LIMIT 10").show()
+con.sql("SELECT * FROM my_table LIMIT 10").show()
 # Get as DataFrame
-df = lakehouse.sql("SELECT COUNT(*) FROM sales").df()
+df = con.sql("SELECT COUNT(*) FROM sales").df()
 ```
+Explanation: DuckDB is connected to the lakehouse through `con`, so it is aware of the tables in that lakehouse (including tables created by your pipelines). That means you can query those tables directly with `con.sql(...)` just like any other DuckDB query. If you don't provide a `sql_folder`, you can still use `con.sql(...)` to explore existing tables.
 ## Remote SQL Files
@@ -151,7 +155,7 @@ df = lakehouse.sql("SELECT COUNT(*) FROM sales").df()
 You can load SQL/Python files from a URL:
 ```python
-lakehouse = dr.connect(
+con = duckrun.connect(
     workspace="Analytics",
     lakehouse_name="Sales",
     schema="dbo",

{duckrun-0.1.3 → duckrun-0.1.5}/README.md RENAMED Viewed

@@ -19,14 +19,14 @@ pip install duckrun
 ## Quick Start
 ```python
-import duckrun as dr
+import duckrun
-# Connect to your Fabric lakehouse
-lakehouse = dr.connect(
+# Connect to your Fabric lakehouse (using `con` pattern)
+con = duckrun.connect(
     workspace="my_workspace",
     lakehouse_name="my_lakehouse",
     schema="dbo",
-    sql_folder="./sql"  # folder containing your .sql and .py files
+    sql_folder="./sql"  # optional: folder containing your .sql and .py files (only needed for pipeline tasks)
 )
 # Define your pipeline
@@ -37,9 +37,11 @@ pipeline = [
 ]
 # Run it
-lakehouse.run(pipeline)
+con.run(pipeline)
 ```
+Note: the `sql/` folder is optional — if all you want to do is explore data with SQL (for example by calling `con.sql(...)`), you don't need to provide a `sql_folder`.
 ## Early Exit
 In a pipeline run, if a task fails, the pipeline will stop without running the subsequent tasks.
@@ -122,12 +124,14 @@ Both write to the same `sales` table, but use different SQL files.
 ```python
 # Run queries
-lakehouse.sql("SELECT * FROM my_table LIMIT 10").show()
+con.sql("SELECT * FROM my_table LIMIT 10").show()
 # Get as DataFrame
-df = lakehouse.sql("SELECT COUNT(*) FROM sales").df()
+df = con.sql("SELECT COUNT(*) FROM sales").df()
 ```
+Explanation: DuckDB is connected to the lakehouse through `con`, so it is aware of the tables in that lakehouse (including tables created by your pipelines). That means you can query those tables directly with `con.sql(...)` just like any other DuckDB query. If you don't provide a `sql_folder`, you can still use `con.sql(...)` to explore existing tables.
 ## Remote SQL Files
@@ -135,7 +139,7 @@ df = lakehouse.sql("SELECT COUNT(*) FROM sales").df()
 You can load SQL/Python files from a URL:
 ```python
-lakehouse = dr.connect(
+con = duckrun.connect(
     workspace="Analytics",
     lakehouse_name="Sales",
     schema="dbo",

{duckrun-0.1.3 → duckrun-0.1.5}/duckrun/core.py RENAMED Viewed

@@ -6,6 +6,99 @@ from deltalake import DeltaTable, write_deltalake
 from typing import List, Tuple, Union, Optional, Callable, Dict, Any
 from string import Template
+class DeltaWriter:
+    """Spark-style write API for Delta Lake"""
+    def __init__(self, relation, duckrun_instance):
+        self.relation = relation
+        self.duckrun = duckrun_instance
+        self._format = None
+        self._mode = "overwrite"
+    def format(self, format_type: str):
+        """Set output format (only 'delta' supported)"""
+        if format_type.lower() != "delta":
+            raise ValueError(f"Only 'delta' format is supported, got '{format_type}'")
+        self._format = "delta"
+        return self
+    def mode(self, write_mode: str):
+        """Set write mode: 'overwrite' or 'append'"""
+        if write_mode not in {"overwrite", "append"}:
+            raise ValueError(f"Mode must be 'overwrite' or 'append', got '{write_mode}'")
+        self._mode = write_mode
+        return self
+    def saveAsTable(self, table_name: str):
+        """Save query result as Delta table"""
+        if self._format != "delta":
+            raise RuntimeError("Must call .format('delta') before saveAsTable()")
+        # Parse schema.table or use default schema
+        if "." in table_name:
+            schema, table = table_name.split(".", 1)
+        else:
+            schema = self.duckrun.schema
+            table = table_name
+        # Ensure OneLake secret is created
+        self.duckrun._create_onelake_secret()
+        # Build path
+        path = f"{self.duckrun.table_base_url}{schema}/{table}"
+        # Execute query and get result
+        df = self.relation.record_batch()
+        print(f"Writing to Delta table: {schema}.{table} (mode={self._mode})")
+        # Write to Delta
+        write_deltalake(path, df, mode=self._mode)
+        # Create or replace view in DuckDB
+        self.duckrun.con.sql(f"DROP VIEW IF EXISTS {table}")
+        self.duckrun.con.sql(f"""
+            CREATE OR REPLACE VIEW {table}
+            AS SELECT * FROM delta_scan('{path}')
+        """)
+        # Optimize if needed
+        dt = DeltaTable(path)
+        if self._mode == "overwrite":
+            dt.vacuum(retention_hours=0, dry_run=False, enforce_retention_duration=False)
+            dt.cleanup_metadata()
+            print(f"✅ Table {schema}.{table} created/overwritten")
+        else:  # append
+            file_count = len(dt.file_uris())
+            if file_count > self.duckrun.compaction_threshold:
+                print(f"Compacting {schema}.{table} ({file_count} files)")
+                dt.optimize.compact()
+                dt.vacuum(dry_run=False)
+                dt.cleanup_metadata()
+            print(f"✅ Data appended to {schema}.{table}")
+        return table
+class QueryResult:
+    """Wrapper for DuckDB relation with write API"""
+    def __init__(self, relation, duckrun_instance):
+        self.relation = relation
+        self.duckrun = duckrun_instance
+    @property
+    def write(self):
+        """Access write API"""
+        return DeltaWriter(self.relation, self.duckrun)
+    def __getattr__(self, name):
+        """Delegate all other methods to underlying DuckDB relation"""
+        return getattr(self.relation, name)
 class Duckrun:
     """
     Lakehouse task runner with clean tuple-based API.
@@ -16,23 +109,22 @@ class Duckrun:
         SQL:    ('table_name', 'mode', {params})
     Usage:
+        # For pipelines:
         dr = Duckrun.connect(workspace, lakehouse, schema, sql_folder)
-        pipeline = [
-            ('download', (urls, paths, depth)),
-            ('staging', 'overwrite', {'run_date': '2024-06-01'}),
-            ('transform', 'append')
-        ]
         dr.run(pipeline)
+        # For data exploration with Spark-style API:
+        dr = Duckrun.connect(workspace, lakehouse, schema)
+        dr.sql("SELECT * FROM table").show()
+        dr.sql("SELECT 43").write.format("delta").mode("append").saveAsTable("aemo.test")
     """
     def __init__(self, workspace: str, lakehouse_name: str, schema: str,
-                 sql_folder: str, compaction_threshold: int = 10):
+                 sql_folder: Optional[str] = None, compaction_threshold: int = 10):
         self.workspace = workspace
         self.lakehouse_name = lakehouse_name
         self.schema = schema
-        self.sql_folder = sql_folder.strip()
+        self.sql_folder = sql_folder.strip() if sql_folder else None
         self.compaction_threshold = compaction_threshold
         self.table_base_url = f'abfss://{workspace}@onelake.dfs.fabric.microsoft.com/{lakehouse_name}.Lakehouse/Tables/'
         self.con = duckdb.connect()
@@ -41,7 +133,7 @@ class Duckrun:
     @classmethod
     def connect(cls, workspace: str, lakehouse_name: str, schema: str,
-                sql_folder: str, compaction_threshold: int = 10):
+                sql_folder: Optional[str] = None, compaction_threshold: int = 100):
         """Create and connect to lakehouse"""
         print("Connecting to Lakehouse...")
         return cls(workspace, lakehouse_name, schema, sql_folder, compaction_threshold)
@@ -114,6 +206,9 @@ class Duckrun:
         return name.split('__', 1)[0] if '__' in name else name
     def _read_sql_file(self, table_name: str, params: Optional[Dict] = None) -> Optional[str]:
+        if self.sql_folder is None:
+            raise RuntimeError("sql_folder is not configured. Cannot read SQL files.")
         is_url = self.sql_folder.startswith("http")
         if is_url:
             url = f"{self.sql_folder.rstrip('/')}/{table_name}.sql".strip()
@@ -159,6 +254,9 @@ class Duckrun:
         return content
     def _load_py_function(self, name: str) -> Optional[Callable]:
+        if self.sql_folder is None:
+            raise RuntimeError("sql_folder is not configured. Cannot load Python functions.")
         is_url = self.sql_folder.startswith("http")
         try:
             if is_url:
@@ -267,6 +365,9 @@ class Duckrun:
             ]
             dr.run(pipeline)
         """
+        if self.sql_folder is None:
+            raise RuntimeError("sql_folder is not configured. Cannot run pipelines. Set sql_folder when creating connection.")
         for i, task in enumerate(pipeline, 1):
             print(f"\n{'='*60}")
             print(f"Task {i}/{len(pipeline)}: {task[0]}")
@@ -305,13 +406,19 @@ class Duckrun:
     def sql(self, query: str):
         """
-        Execute raw SQL query.
+        Execute raw SQL query with Spark-style write API.
         Example:
+            # Traditional DuckDB style
             dr.sql("SELECT * FROM table").show()
             df = dr.sql("SELECT * FROM table").df()
+            # New Spark-style write API
+            dr.sql("SELECT 43 as value").write.format("delta").mode("append").saveAsTable("aemo.test")
+            dr.sql("SELECT * FROM source").write.format("delta").mode("overwrite").saveAsTable("target")
         """
-        return self.con.sql(query)
+        relation = self.con.sql(query)
+        return QueryResult(relation, self)
     def get_connection(self):
         """Get underlying DuckDB connection"""

{duckrun-0.1.3 → duckrun-0.1.5}/duckrun.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: duckrun
-Version: 0.1.3
+Version: 0.1.5
 Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
 License-Expression: MIT
 Project-URL: Homepage, https://github.com/djouallah/duckrun
@@ -35,14 +35,14 @@ pip install duckrun
 ## Quick Start
 ```python
-import duckrun as dr
+import duckrun
-# Connect to your Fabric lakehouse
-lakehouse = dr.connect(
+# Connect to your Fabric lakehouse (using `con` pattern)
+con = duckrun.connect(
     workspace="my_workspace",
     lakehouse_name="my_lakehouse",
     schema="dbo",
-    sql_folder="./sql"  # folder containing your .sql and .py files
+    sql_folder="./sql"  # optional: folder containing your .sql and .py files (only needed for pipeline tasks)
 )
 # Define your pipeline
@@ -53,9 +53,11 @@ pipeline = [
 ]
 # Run it
-lakehouse.run(pipeline)
+con.run(pipeline)
 ```
+Note: the `sql/` folder is optional — if all you want to do is explore data with SQL (for example by calling `con.sql(...)`), you don't need to provide a `sql_folder`.
 ## Early Exit
 In a pipeline run, if a task fails, the pipeline will stop without running the subsequent tasks.
@@ -138,12 +140,14 @@ Both write to the same `sales` table, but use different SQL files.
 ```python
 # Run queries
-lakehouse.sql("SELECT * FROM my_table LIMIT 10").show()
+con.sql("SELECT * FROM my_table LIMIT 10").show()
 # Get as DataFrame
-df = lakehouse.sql("SELECT COUNT(*) FROM sales").df()
+df = con.sql("SELECT COUNT(*) FROM sales").df()
 ```
+Explanation: DuckDB is connected to the lakehouse through `con`, so it is aware of the tables in that lakehouse (including tables created by your pipelines). That means you can query those tables directly with `con.sql(...)` just like any other DuckDB query. If you don't provide a `sql_folder`, you can still use `con.sql(...)` to explore existing tables.
 ## Remote SQL Files
@@ -151,7 +155,7 @@ df = lakehouse.sql("SELECT COUNT(*) FROM sales").df()
 You can load SQL/Python files from a URL:
 ```python
-lakehouse = dr.connect(
+con = duckrun.connect(
     workspace="Analytics",
     lakehouse_name="Sales",
     schema="dbo",

{duckrun-0.1.3 → duckrun-0.1.5}/pyproject.toml RENAMED Viewed

@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "duckrun"
-version = "0.1.3"
+version = "0.1.5"
 description = "Lakehouse task runner powered by DuckDB for Microsoft Fabric"
 readme = "README.md"
 license = "MIT"

{duckrun-0.1.3 → duckrun-0.1.5}/LICENSE RENAMED Viewed

File without changes

{duckrun-0.1.3 → duckrun-0.1.5}/duckrun/__init__.py RENAMED Viewed

File without changes

{duckrun-0.1.3 → duckrun-0.1.5}/duckrun.egg-info/SOURCES.txt RENAMED Viewed

File without changes

{duckrun-0.1.3 → duckrun-0.1.5}/duckrun.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{duckrun-0.1.3 → duckrun-0.1.5}/duckrun.egg-info/requires.txt RENAMED Viewed

File without changes

{duckrun-0.1.3 → duckrun-0.1.5}/duckrun.egg-info/top_level.txt RENAMED Viewed

File without changes

{duckrun-0.1.3 → duckrun-0.1.5}/setup.cfg RENAMED Viewed

File without changes

duckrun 0.1.3__tar.gz → 0.1.5__tar.gz

duckrun 0.1.3tar.gz → 0.1.5tar.gz