PyPI - duckrun - Versions diffs - 0.1.2__tar.gz → 0.1.4__tar.gz - Mend

duckrun 0.1.2tar.gz → 0.1.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

{duckrun-0.1.2 → duckrun-0.1.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: duckrun
-Version: 0.1.2
+Version: 0.1.4
 Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
 License-Expression: MIT
 Project-URL: Homepage, https://github.com/djouallah/duckrun
@@ -35,14 +35,14 @@ pip install duckrun
 ## Quick Start
 ```python
-import duckrun as dr
+import duckrun
-# Connect to your Fabric lakehouse
-lakehouse = dr.connect(
+# Connect to your Fabric lakehouse (using `con` pattern)
+con = duckrun.connect(
     workspace="my_workspace",
     lakehouse_name="my_lakehouse",
     schema="dbo",
-    sql_folder="./sql"  # folder containing your .sql and .py files
+    sql_folder="./sql"  # optional: folder containing your .sql and .py files (only needed for pipeline tasks)
 )
 # Define your pipeline
@@ -53,9 +53,11 @@ pipeline = [
 ]
 # Run it
-lakehouse.run(pipeline)
+con.run(pipeline)
 ```
+Note: the `sql/` folder is optional — if all you want to do is explore data with SQL (for example by calling `con.sql(...)`), you don't need to provide a `sql_folder`.
 ## Early Exit
 In a pipeline run, if a task fails, the pipeline will stop without running the subsequent tasks.
@@ -138,12 +140,14 @@ Both write to the same `sales` table, but use different SQL files.
 ```python
 # Run queries
-lakehouse.sql("SELECT * FROM my_table LIMIT 10").show()
+con.sql("SELECT * FROM my_table LIMIT 10").show()
 # Get as DataFrame
-df = lakehouse.sql("SELECT COUNT(*) FROM sales").df()
+df = con.sql("SELECT COUNT(*) FROM sales").df()
 ```
+Explanation: DuckDB is connected to the lakehouse through `con`, so it is aware of the tables in that lakehouse (including tables created by your pipelines). That means you can query those tables directly with `con.sql(...)` just like any other DuckDB query. If you don't provide a `sql_folder`, you can still use `con.sql(...)` to explore existing tables.
 ## Remote SQL Files
@@ -151,7 +155,7 @@ df = lakehouse.sql("SELECT COUNT(*) FROM sales").df()
 You can load SQL/Python files from a URL:
 ```python
-lakehouse = dr.connect(
+con = duckrun.connect(
     workspace="Analytics",
     lakehouse_name="Sales",
     schema="dbo",

{duckrun-0.1.2 → duckrun-0.1.4}/README.md RENAMED Viewed

@@ -19,14 +19,14 @@ pip install duckrun
 ## Quick Start
 ```python
-import duckrun as dr
+import duckrun
-# Connect to your Fabric lakehouse
-lakehouse = dr.connect(
+# Connect to your Fabric lakehouse (using `con` pattern)
+con = duckrun.connect(
     workspace="my_workspace",
     lakehouse_name="my_lakehouse",
     schema="dbo",
-    sql_folder="./sql"  # folder containing your .sql and .py files
+    sql_folder="./sql"  # optional: folder containing your .sql and .py files (only needed for pipeline tasks)
 )
 # Define your pipeline
@@ -37,9 +37,11 @@ pipeline = [
 ]
 # Run it
-lakehouse.run(pipeline)
+con.run(pipeline)
 ```
+Note: the `sql/` folder is optional — if all you want to do is explore data with SQL (for example by calling `con.sql(...)`), you don't need to provide a `sql_folder`.
 ## Early Exit
 In a pipeline run, if a task fails, the pipeline will stop without running the subsequent tasks.
@@ -122,12 +124,14 @@ Both write to the same `sales` table, but use different SQL files.
 ```python
 # Run queries
-lakehouse.sql("SELECT * FROM my_table LIMIT 10").show()
+con.sql("SELECT * FROM my_table LIMIT 10").show()
 # Get as DataFrame
-df = lakehouse.sql("SELECT COUNT(*) FROM sales").df()
+df = con.sql("SELECT COUNT(*) FROM sales").df()
 ```
+Explanation: DuckDB is connected to the lakehouse through `con`, so it is aware of the tables in that lakehouse (including tables created by your pipelines). That means you can query those tables directly with `con.sql(...)` just like any other DuckDB query. If you don't provide a `sql_folder`, you can still use `con.sql(...)` to explore existing tables.
 ## Remote SQL Files
@@ -135,7 +139,7 @@ df = lakehouse.sql("SELECT COUNT(*) FROM sales").df()
 You can load SQL/Python files from a URL:
 ```python
-lakehouse = dr.connect(
+con = duckrun.connect(
     workspace="Analytics",
     lakehouse_name="Sales",
     schema="dbo",

{duckrun-0.1.2 → duckrun-0.1.4}/duckrun/core.py RENAMED Viewed

@@ -16,23 +16,21 @@ class Duckrun:
         SQL:    ('table_name', 'mode', {params})
     Usage:
+        # For pipelines:
         dr = Duckrun.connect(workspace, lakehouse, schema, sql_folder)
-        pipeline = [
-            ('download', (urls, paths, depth)),
-            ('staging', 'overwrite', {'run_date': '2024-06-01'}),
-            ('transform', 'append')
-        ]
         dr.run(pipeline)
+        # For data exploration only:
+        dr = Duckrun.connect(workspace, lakehouse, schema)
+        dr.sql("SELECT * FROM table").show()
     """
     def __init__(self, workspace: str, lakehouse_name: str, schema: str,
-                 sql_folder: str, compaction_threshold: int = 10):
+                 sql_folder: Optional[str] = None, compaction_threshold: int = 10):
         self.workspace = workspace
         self.lakehouse_name = lakehouse_name
         self.schema = schema
-        self.sql_folder = sql_folder.strip()
+        self.sql_folder = sql_folder.strip() if sql_folder else None
         self.compaction_threshold = compaction_threshold
         self.table_base_url = f'abfss://{workspace}@onelake.dfs.fabric.microsoft.com/{lakehouse_name}.Lakehouse/Tables/'
         self.con = duckdb.connect()
@@ -41,7 +39,7 @@ class Duckrun:
     @classmethod
     def connect(cls, workspace: str, lakehouse_name: str, schema: str,
-                sql_folder: str, compaction_threshold: int = 10):
+                sql_folder: Optional[str] = None, compaction_threshold: int = 100):
         """Create and connect to lakehouse"""
         print("Connecting to Lakehouse...")
         return cls(workspace, lakehouse_name, schema, sql_folder, compaction_threshold)
@@ -64,44 +62,44 @@ class Duckrun:
     def _attach_lakehouse(self):
         self._create_onelake_secret()
         try:
-            # Exclude Iceberg metadata folders when scanning for Delta tables
+            # Use expensive list operation but filter for _delta_log folders only
+            # This avoids parsing JSON content that causes Iceberg metadata issues
+            print(f"Scanning for Delta tables in {self.schema}... (this may take a moment)")
             list_tables_query = f"""
-                SELECT DISTINCT(split_part(file, '_delta_log', 1)) as tables
-                FROM glob ("abfss://{self.workspace}@onelake.dfs.fabric.microsoft.com/{self.lakehouse_name}.Lakehouse/Tables/*/*/_delta_log/*.json")
-                WHERE file NOT LIKE '%/metadata/%'
+                SELECT DISTINCT
+                    regexp_extract(file, 'Tables/{self.schema}/([^/]+)/_delta_log', 1) as table_name
+                FROM glob("abfss://{self.workspace}@onelake.dfs.fabric.microsoft.com/{self.lakehouse_name}.Lakehouse/Tables/{self.schema}/**")
+                WHERE file LIKE '%/_delta_log/%'
+                  AND file NOT LIKE '%/metadata/%'
                   AND file NOT LIKE '%/iceberg/%'
-                  AND split_part(file, '_delta_log', 1) NOT LIKE '%/metadata'
-                  AND split_part(file, '_delta_log', 1) NOT LIKE '%/iceberg'
+                  AND regexp_extract(file, 'Tables/{self.schema}/([^/]+)/_delta_log', 1) IS NOT NULL
             """
             list_tables_df = self.con.sql(list_tables_query).df()
-            list_tables = list_tables_df['tables'].tolist() if not list_tables_df.empty else []
-            if not list_tables:
-                print(f"No Delta tables found in {self.lakehouse_name}.Lakehouse/Tables.")
+            if list_tables_df.empty:
+                print(f"No Delta tables found in {self.lakehouse_name}.Lakehouse/Tables/{self.schema}.")
                 return
+            table_names = list_tables_df['table_name'].tolist()
-            print(f"Found {len(list_tables)} Delta tables. Attaching as views...")
+            print(f"Found {len(table_names)} Delta tables. Attaching as views...")
-            for table_path in list_tables:
-                parts = table_path.strip("/").split("/")
-                if len(parts) >= 2:
-                    potential_schema = parts[-2]
-                    table = parts[-1]
-                    # Skip Iceberg-related folders
-                    if table in ('metadata', 'iceberg') or potential_schema in ('metadata', 'iceberg'):
-                        continue
-                    if potential_schema == self.schema:
-                        try:
-                            self.con.sql(f"""
-                                CREATE OR REPLACE VIEW {table}
-                                AS SELECT * FROM delta_scan('{self.table_base_url}{self.schema}/{table}');
-                            """)
-                            print(f"  ✓ Attached: {table}")
-                        except Exception as e:
-                            print(f"  ⚠ Skipped {table}: {str(e)[:100]}")
-                            continue
+            for table in table_names:
+                # Skip Iceberg-related folders and empty names
+                if not table or table in ('metadata', 'iceberg'):
+                    continue
+                try:
+                    self.con.sql(f"""
+                        CREATE OR REPLACE VIEW {table}
+                        AS SELECT * FROM delta_scan('{self.table_base_url}{self.schema}/{table}');
+                    """)
+                    print(f"  ✓ Attached: {table}")
+                except Exception as e:
+                    print(f"  ⚠ Skipped {table}: {str(e)[:100]}")
+                    continue
             print("\nAttached tables (views) in DuckDB:")
             self.con.sql("SELECT name FROM (SHOW ALL TABLES) WHERE database='memory'").show()
@@ -114,6 +112,9 @@ class Duckrun:
         return name.split('__', 1)[0] if '__' in name else name
     def _read_sql_file(self, table_name: str, params: Optional[Dict] = None) -> Optional[str]:
+        if self.sql_folder is None:
+            raise RuntimeError("sql_folder is not configured. Cannot read SQL files.")
         is_url = self.sql_folder.startswith("http")
         if is_url:
             url = f"{self.sql_folder.rstrip('/')}/{table_name}.sql".strip()
@@ -159,6 +160,9 @@ class Duckrun:
         return content
     def _load_py_function(self, name: str) -> Optional[Callable]:
+        if self.sql_folder is None:
+            raise RuntimeError("sql_folder is not configured. Cannot load Python functions.")
         is_url = self.sql_folder.startswith("http")
         try:
             if is_url:
@@ -267,6 +271,9 @@ class Duckrun:
             ]
             dr.run(pipeline)
         """
+        if self.sql_folder is None:
+            raise RuntimeError("sql_folder is not configured. Cannot run pipelines. Set sql_folder when creating connection.")
         for i, task in enumerate(pipeline, 1):
             print(f"\n{'='*60}")
             print(f"Task {i}/{len(pipeline)}: {task[0]}")

{duckrun-0.1.2 → duckrun-0.1.4}/duckrun.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: duckrun
-Version: 0.1.2
+Version: 0.1.4
 Summary: Lakehouse task runner powered by DuckDB for Microsoft Fabric
 License-Expression: MIT
 Project-URL: Homepage, https://github.com/djouallah/duckrun
@@ -35,14 +35,14 @@ pip install duckrun
 ## Quick Start
 ```python
-import duckrun as dr
+import duckrun
-# Connect to your Fabric lakehouse
-lakehouse = dr.connect(
+# Connect to your Fabric lakehouse (using `con` pattern)
+con = duckrun.connect(
     workspace="my_workspace",
     lakehouse_name="my_lakehouse",
     schema="dbo",
-    sql_folder="./sql"  # folder containing your .sql and .py files
+    sql_folder="./sql"  # optional: folder containing your .sql and .py files (only needed for pipeline tasks)
 )
 # Define your pipeline
@@ -53,9 +53,11 @@ pipeline = [
 ]
 # Run it
-lakehouse.run(pipeline)
+con.run(pipeline)
 ```
+Note: the `sql/` folder is optional — if all you want to do is explore data with SQL (for example by calling `con.sql(...)`), you don't need to provide a `sql_folder`.
 ## Early Exit
 In a pipeline run, if a task fails, the pipeline will stop without running the subsequent tasks.
@@ -138,12 +140,14 @@ Both write to the same `sales` table, but use different SQL files.
 ```python
 # Run queries
-lakehouse.sql("SELECT * FROM my_table LIMIT 10").show()
+con.sql("SELECT * FROM my_table LIMIT 10").show()
 # Get as DataFrame
-df = lakehouse.sql("SELECT COUNT(*) FROM sales").df()
+df = con.sql("SELECT COUNT(*) FROM sales").df()
 ```
+Explanation: DuckDB is connected to the lakehouse through `con`, so it is aware of the tables in that lakehouse (including tables created by your pipelines). That means you can query those tables directly with `con.sql(...)` just like any other DuckDB query. If you don't provide a `sql_folder`, you can still use `con.sql(...)` to explore existing tables.
 ## Remote SQL Files
@@ -151,7 +155,7 @@ df = lakehouse.sql("SELECT COUNT(*) FROM sales").df()
 You can load SQL/Python files from a URL:
 ```python
-lakehouse = dr.connect(
+con = duckrun.connect(
     workspace="Analytics",
     lakehouse_name="Sales",
     schema="dbo",

{duckrun-0.1.2 → duckrun-0.1.4}/pyproject.toml RENAMED Viewed

@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "duckrun"
-version = "0.1.2"
+version = "0.1.4"
 description = "Lakehouse task runner powered by DuckDB for Microsoft Fabric"
 readme = "README.md"
 license = "MIT"

{duckrun-0.1.2 → duckrun-0.1.4}/LICENSE RENAMED Viewed

File without changes

{duckrun-0.1.2 → duckrun-0.1.4}/duckrun/__init__.py RENAMED Viewed

File without changes

{duckrun-0.1.2 → duckrun-0.1.4}/duckrun.egg-info/SOURCES.txt RENAMED Viewed

File without changes

{duckrun-0.1.2 → duckrun-0.1.4}/duckrun.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{duckrun-0.1.2 → duckrun-0.1.4}/duckrun.egg-info/requires.txt RENAMED Viewed

File without changes

{duckrun-0.1.2 → duckrun-0.1.4}/duckrun.egg-info/top_level.txt RENAMED Viewed

File without changes

{duckrun-0.1.2 → duckrun-0.1.4}/setup.cfg RENAMED Viewed

File without changes

duckrun 0.1.2__tar.gz → 0.1.4__tar.gz

duckrun 0.1.2tar.gz → 0.1.4tar.gz