PyPI - deltacat - Versions diffs - 2.0.0.post1__tar.gz → 2.0.0.post3__tar.gz - Mend

deltacat 2.0.0.post1tar.gz → 2.0.0.post3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (454) hide show

{deltacat-2.0.0.post1/deltacat.egg-info → deltacat-2.0.0.post3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: deltacat
-Version: 2.0.0.post1
+Version: 2.0.0.post3
 Summary: DeltaCAT is a portable Pythonic Data Lakehouse powered by Ray.
 Home-page: https://github.com/ray-project/deltacat
 Author: Ray Team
@@ -53,22 +53,20 @@ Dynamic: summary
   <img src="https://github.com/ray-project/deltacat/raw/2.0/media/deltacat-logo-alpha-750.png" alt="deltacat logo" style="width:55%; height:auto; text-align: center;">
 </p>
-DeltaCAT is a portable Pythonic Data Lakehouse powered by [Ray](https://github.com/ray-project/ray). It lets you define and manage
-fast, scalable, ACID-compliant multimodal data lakes, and has been used to [successfully manage exabyte-scale enterprise
-data lakes](https://aws.amazon.com/blogs/opensource/amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-amazon-ec2/).
+DeltaCAT is a portable Multimodal Lakehouse powered by [Ray](https://github.com/ray-project/ray), [Apache Arrow](https://github.com/apache/arrow), and [Daft](https://github.com/Eventual-Inc/Daft). It lets you create ACID-compliant multimodal data lakes [that efficiently scale to manage exabytes of production data](https://aws.amazon.com/blogs/opensource/amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-amazon-ec2/).
-It uses the Ray distributed compute framework together with [Apache Arrow](https://github.com/apache/arrow) and
-[Daft](https://github.com/Eventual-Inc/Daft) to efficiently scale common table management tasks, like petabyte-scale
-merge-on-read and copy-on-write operations.
+It provides data lake level transactions & time travel, zero-copy schema evolution, zero-copy multimodal file processing (image, audio, video, text, etc.), and transparent dataset optimization. It runs locally for rapid development or in the cloud for production workloads. It runs on any filesystem for easy setup and sharing - no external catalog services, lock managers, or key value stores required.
+## Overview
 DeltaCAT provides the following high-level components:
-1. [**Catalog**](https://github.com/ray-project/deltacat/tree/2.0/deltacat/catalog/interface.py): High-level APIs to create, discover, organize, share, and manage datasets.
-2. [**Compute**](https://github.com/ray-project/deltacat/tree/2.0/deltacat/compute/): Distributed data management procedures to read, write, and optimize datasets.
-3. [**Storage**](https://github.com/ray-project/deltacat/tree/2.0/deltacat/storage/): In-memory and on-disk multimodal dataset formats.
+1. [**Catalog**](https://github.com/ray-project/deltacat/tree/2.0/deltacat/catalog/interface.py): Pythonic APIs to discover, read, write, and manage datasets.
+2. [**Compute**](https://github.com/ray-project/deltacat/tree/2.0/deltacat/compute/): Distributed data management procedures that automatically optimize your datasets.
+3. [**Storage**](https://github.com/ray-project/deltacat/tree/2.0/deltacat/storage/): A portable multimodal data lake format useable with any filesystem.
 4. **Sync** (in development): Synchronize DeltaCAT datasets to data warehouses and other table formats.
-## Overview
-DeltaCAT's **Catalog**, **Compute**, and **Storage** layers work together to bring ACID-compliant data management to any Ray application. These components automate data indexing, change management, dataset read/write optimization, schema evolution, and other common data management tasks across any set of data files readable by Ray Data, Daft, Pandas, Polars, PyArrow, or NumPy.
+DeltaCAT's **Catalog**, **Compute**, and **Storage** layers work together to bring ACID-compliant data management to any Ray application. These components automate data indexing, change management, dataset read/write optimization, schema evolution, and other common data management tasks across any set of data files readable by [Pandas](https://github.com/pandas-dev/pandas), [NumPy](https://github.com/numpy/numpy), [Polars](https://github.com/pola-rs/polars), [PyArrow](https://arrow.apache.org/docs/python/index.html), [Ray Data](https://docs.ray.io/en/latest/data/data.html), and [Daft](https://docs.daft.ai/en/stable/api/dataframe/).
 <p align="center">
   <img src="https://github.com/ray-project/deltacat/raw/2.0/media/deltacat-tech-overview.png" alt="deltacat tech overview" style="width:100%; height:auto; text-align: center;">
@@ -79,10 +77,15 @@ Data consumers that prefer to stay within the ecosystem of Pythonic data managem
 ## Getting Started
 DeltaCAT applications run anywhere that Ray runs, including your local laptop, cloud computing cluster, or on-premise cluster.
-DeltaCAT lets you manage **Tables** across one or more **Catalogs**. A **Table** can be thought of as a named collection of one or more data files. A **Catalog** provides a root location (e.g., a local file path or S3 Bucket) to store table information, and can be rooted in any [PyArrow-compatible Filesystem](https://arrow.apache.org/docs/python/filesystems.html). **Tables** can be created, read, and written using the `dc.write` and `dc.read` APIs.
+DeltaCAT lets you manage **Tables** across one or more **Catalogs**. A **Table** can be thought of as a named collection of data files. A **Catalog** can be thought of as a named data lake that contains a set of **Tables**. A **Catalog** provides a root location (e.g., a local file path or S3 Bucket) to store information about all your **Tables**, and can be rooted in any [PyArrow-compatible Filesystem](https://arrow.apache.org/docs/python/filesystems.html). **Tables** can be created, read, and written using the `dc.write` and `dc.read` APIs.
 ### Quick Start
+Install DeltaCAT with: `pip install deltacat`
+Then run this script to create and read your first table:
 ```python
 import deltacat as dc
 import pandas as pd
@@ -108,8 +111,8 @@ dc.write(data, "users")
 daft_df = dc.read("users")  # Returns Daft DataFrame (default)
 daft_df.show()  # Materialize and print the DataFrame
-# Append more data and add a new column.
-# Compaction and schema evolution are handled automatically.
+# Add more data and add a new column.
+# Compaction and zero-copy schema evolution are handled automatically.
 data = pd.DataFrame({
     "id": [4, 5, 6],
     "name": ["Tom", "Simpkin", "Delta"],
@@ -125,13 +128,13 @@ daft_df.select("name", "age", "city").show()
 ```
 ### Core Concepts
-DeltaCAT can do much more than just append data to tables and read it back again. Expand the sections below to see examples of other core DeltaCAT concepts and APIs.
+DeltaCAT can do much more than just add data to tables and read it back again. Expand the sections below to see examples of other core DeltaCAT concepts and APIs.
 <details>
-<summary><span style="font-size: 1.25em; font-weight: bold;">Replacing and Dropping Tables</span></summary>
+<summary><span style="font-size: 1.25em; font-weight: bold;">Idempotent Writes</span></summary>
-If you run the quick start example repeatedly from the same working directory, you'll notice that the table it writes to just keeps growing larger. This is because DeltaCAT always **appends** table data by default. One way to prevent this perpetual table growth and make the example idempotent is to use the **REPLACE** write mode if the table already exists:
+If you run the quick start example repeatedly from the same working directory, you'll notice that the table it writes to just keeps growing larger. This is because DeltaCAT always **adds** table data by default. One way to prevent this perpetual table growth and make the example idempotent is to use the **REPLACE** write mode if the table already exists:
 ```python
 import deltacat as dc
@@ -165,7 +168,7 @@ dc.write(data, "users", mode=write_mode)
 daft_df = dc.read("users")  # Returns Daft DataFrame (default)
 daft_df.show()  # Materialize and print the DataFrame
-# Explicitly append more data and add a new column.
+# Explicitly add more data and add a new column.
 # Compaction and schema evolution are handled automatically.
 data = pd.DataFrame({
     "id": [4, 5, 6],
@@ -173,7 +176,7 @@ data = pd.DataFrame({
     "age": [2, 12, 4],
     "city": ["Hollywood", "Gloucester", "San Francisco"]
 })
-dc.write(data, "users", mode=dc.TableWriteMode.APPEND)
+dc.write(data, "users", mode=dc.TableWriteMode.ADD)
 # Read the full table back into a Daft DataFrame.
 daft_df = dc.read("users")
@@ -217,7 +220,7 @@ dc.write(data, "users", mode=dc.TableWriteMode.CREATE)
 daft_df = dc.read("users")  # Returns Daft DataFrame (default)
 daft_df.show()  # Materialize and print the DataFrame
-# Explicitly append more data and add a new column.
+# Explicitly add more data and add a new column.
 # Compaction and schema evolution are handled automatically.
 data = pd.DataFrame({
     "id": [4, 5, 6],
@@ -225,7 +228,7 @@ data = pd.DataFrame({
     "age": [2, 12, 4],
     "city": ["Hollywood", "Gloucester", "San Francisco"]
 })
-dc.write(data, "users", mode=dc.TableWriteMode.APPEND)
+dc.write(data, "users", mode=dc.TableWriteMode.ADD)
 # Read the full table back into a Daft DataFrame.
 daft_df = dc.read("users")
@@ -237,9 +240,117 @@ assert dc.dataset_length(daft_df) == 6
 </details>
 <details>
-<summary><span style="font-size: 1.25em; font-weight: bold;">Supported Dataset and File Formats</span></summary>
+<summary><span style="font-size: 1.25em; font-weight: bold;">Ordered Writes</span></summary>
+DeltaCAT writes are unordered by default, which means that the order of data written to the table isn't guaranteed to match the order that it is read back. While this is useful for preventing conflicts between concurrent writers, you can also use the **APPEND** write mode to preserve write order and raise explicit concurrency conflicts between parallel writers:
+```python
+import deltacat as dc
+import pandas as pd
+# Initialize DeltaCAT with a default local catalog.
+# Ray will be initialized automatically.
+# Catalog files will be stored in .deltacat/ in the current working directory.
+dc.init_local()
+# Create data to write.
+data = pd.DataFrame({
+    "id": [1, 2],
+    "name": ["Cheshire", "Dinah"],
+    "age": [3, 7]
+})
+# Derive a DeltaCAT schema for the data.
+schema = dc.Schema.of(dc.dataset_schema(data))
+# Create an empty table to hold ordered user data.
+if not dc.table_exists("users_ordered"):
+    dc.create_table("users_ordered", schema=schema)
+# Write the first ordered delta to the table.
+dc.write(data, "users_ordered", mode=dc.TableWriteMode.APPEND)
+# Write the second ordered delta to the table.
+data = pd.DataFrame({
+    "id": [3, 4],
+    "name": ["Felix", "Tom"],
+    "age": [2, 12],
+    "city": ["Hollywood", "Gloucester"]
+})
+dc.write(data, "users_ordered", mode=dc.TableWriteMode.APPEND)
+# Write the third ordered delta to the table.
+data = pd.DataFrame({
+    "id": [5, 6],
+    "name": ["Simpkin", "Delta"],
+    "age": [12, 4],
+    "city": ["San Francisco", "San Francisco"]
+})
+dc.write(data, "users_ordered", mode=dc.TableWriteMode.APPEND)
+# Read the data back as a Pandas DataFrame, and ensure that the
+# order of the records returned matches the order they were written.
+pandas_df = dc.read("users_ordered", read_as=dc.DatasetType.PANDAS)
+print(pandas_df)
+```
+</details>
+<details>
+<summary><span style="font-size: 1.25em; font-weight: bold;">Schemaless Tables</span></summary>
+Tables created automatically via `dc.write` have a schema inferred from the data written by default. However, if you create an empty table without providing a schema, it defaults to schemaless. Writes to schemaless tables are more efficient and flexible, since they simply track the location and basic metadata associated with the data files written to the table. However, if you know that a unified schema can be derived for your schemaless data, then you can you can still read it back as a structured dataset:
+```python
+import deltacat as dc
+import pandas as pd
+# Initialize DeltaCAT with a default local catalog.
+# Ray will be initialized automatically.
+# Catalog files will be stored in .deltacat/ in the current working directory.
+dc.init_local()
+# Create data to write.
+data = pd.DataFrame({
+    "id": [1, 2],
+    "name": ["Cheshire", "Dinah"],
+    "age": [3, 7]
+})
+# Create an empty schemaless table to hold ordered user data.
+if not dc.table_exists("users_schemaless"):
+    dc.create_table("users_schemaless")
+# Write the first ordered delta to the table.
+dc.write(data, "users_schemaless", mode=dc.TableWriteMode.APPEND)
+# Write the second ordered delta to the table.
+data = pd.DataFrame({
+    "id": [3, 4],
+    "name": ["Felix", "Tom"],
+    "age": [2, 12],
+    "city": ["Hollywood", "Gloucester"]
+})
+dc.write(data, "users_schemaless", mode=dc.TableWriteMode.APPEND)
+# Read back the file manifest of the schemaless table.
+# Notice that file paths, sizes, etc. are returned instead of the dataframes written.
+manifest_df = dc.read("users_schemaless", read_as=dc.DatasetType.PANDAS)
+print(manifest_df)
+# Use from_manifest_table to convert the manifest table to a structured dataset.
+structured_daft_df = dc.from_manifest_table(manifest_df)
+structured_daft_df.show()
+```
+</details>
+<details>
+<summary><span style="font-size: 1.25em; font-weight: bold;">Working Across Dataset and File Types</span></summary>
 DeltaCAT natively supports a variety of open dataset and file formats already integrated with Ray and Arrow. You can use `dc.read` to read tables back as a Daft DataFrame, Ray Dataset, Pandas DataFrame, PyArrow Table, Polars DataFrame, NumPy Array, or list of PyArrow ParquetFile objects:
@@ -329,7 +440,7 @@ print("\n=== NumPy Table ===")
 dc.read("my_numpy_table").show()
 ```
-Or write to different table file formats:
+DeltaCAT tables also support persisting data in heterogeneous table file formats like Avro, ORC, or Feather:
 ```python
 data = pd.DataFrame({"id": [1], "name": ["Cheshire"], "age": [3]})
@@ -372,9 +483,9 @@ print(pandas_df)
 <details>
-<summary><span style="font-size: 1.25em; font-weight: bold;">Merging and Deleting Data</span></summary>
+<summary><span style="font-size: 1.25em; font-weight: bold;">Live Feature Enrichment</span></summary>
-DeltaCAT can automatically merge and delete data by defining a table schema with one or more merge keys:
+DeltaCAT can update your datasets on-the-fly to keep up with a continuous stream of new insights, and support common ML use-cases like feature enrichment. Just define a table schema with one or more merge keys to start updating and deleting existing records:
 ```python
 import deltacat as dc
@@ -385,53 +496,50 @@ import tempfile
 # Initialize DeltaCAT with a fresh temporary catalog
 dc.init_local(tempfile.mkdtemp())
-# Define a schema with user_id as a merge key.
-schema = dc.Schema.of([
+# Start with minimal schema - just user_id as merge key and name
+initial_schema = dc.Schema.of([
     dc.Field.of(pa.field("user_id", pa.int64()), is_merge_key=True),
     dc.Field.of(pa.field("name", pa.string())),
-    dc.Field.of(pa.field("age", pa.int32())),
-    dc.Field.of(pa.field("status", pa.string())),
 ])
-# Initial user data
+# Initial user data - just basic info
 initial_users = pd.DataFrame({
     "user_id": [1, 2, 3],
-    "name": ["Cheshire", "Dinah", "Felix"],
-    "age": [3, 7, 2],
-    "status": ["active", "active", "inactive"]
+    "name": ["Jim", "Dinah", "Bob"],
 })
-# Write initial data with the merge key schema
-dc.write(initial_users, "users", schema=schema)
+# Write initial data with minimal schema
+dc.write(initial_users, "users", schema=initial_schema)
-# Read the data back as a Pandas DataFrame.
+# Read the data back as a Pandas DataFrame
 df = dc.read("users", read_as=dc.DatasetType.PANDAS)
-print("=== Initial Users ===")
+print("=== Initial Users (Basic Info) ===")
 print(df.sort_values("user_id"))
-# Update data for existing users + add new users
-updated_users = pd.DataFrame({
-    "user_id": [2, 3, 4, 5, 6],
-    "name": ["Dinah", "Felix", "Tom", "Simpkin", "Delta"],
-    "age": [7, 2, 5, 12, 4],
-    "status": ["premium", "active", "active", "active", "active"]
+# Later, enrich with new insights: add age/job features + new users
+enriched_data = pd.DataFrame({
+    "user_id": [1, 3, 4, 5, 6],
+    "name": ["Cheshire", "Felix", "Tom", "Simpkin", "Delta"],
+    "age": [3, 2, 5, 12, 4],
+    "job": ["Tour Guide", "Drifter", "Housekeeper", "Mouser", "Engineer"]
 })
-# Write automatically detects that the schema has a merge key and:
-# 1. Updates existing records with matching user IDs.
-# 2. Inserts new records with new user IDs.
-dc.write(updated_users, "users", schema=schema)
+# DeltaCAT automatically evolves the schema and merges by user_id:
+# 1. Enriches existing users (Jim -> Cheshire age=3, job="Tour Guide"; Bob -> Felix)
+# 2. Adds new age/job columns with automatic schema evolution
+# 3. Inserts new users (Tom, Simpkin, Delta) with full feature set
+dc.write(enriched_data, "users")
-# Read back to see merged results
+# Read back to see live feature enrichment results
 df = dc.read("users", read_as=dc.DatasetType.PANDAS)
-print("\n=== After Merge ===")
+print("\n=== Enriched Users (Age & Job) ===")
 print(df.sort_values("user_id"))
-# - Cheshire (user_id=1) remains unchanged
-# - Dinah (user_id=2) status updated to "premium"
-# - Felix (user_id=3) updated to "active"
-# - New users (4,5,6), (Tom, Simpkin, Delta) added
-# - No duplicate user_id values exist
+# - Cheshire (user_id=1) name updated from Jim, gets age=3, job="Tour Guide"
+# - Dinah (user_id=2) keeps original name, gets null age/job (missing features)
+# - Felix (user_id=3) name updated from Bob, gets age=2, job="Drifter"
+# - New users (4,5,6) added with complete feature set
+# - Schema automatically evolved to include age/job columns
 # Specify the users to delete.
 # We only need to specify matching merge key values.
@@ -440,7 +548,7 @@ users_to_delete = pd.DataFrame({
 })
 # Delete the records that match our merge keys.
-dc.write(users_to_delete, "users", schema=schema, mode=dc.TableWriteMode.DELETE)
+dc.write(users_to_delete, "users", mode=dc.TableWriteMode.DELETE)
 # Read the table back to confirm target users have been deleted.
 df = dc.read("users", read_as=dc.DatasetType.PANDAS)
@@ -456,6 +564,117 @@ print(df.sort_values("user_id"))
 <details>
+<summary><span style="font-size: 1.25em; font-weight: bold;">Zero-Copy Multimodal URL Processing</span></summary>
+DeltaCAT can register and process existing multimodal datasets from local or remote URLs. This enables zero-copy distributed processing of images, audio, text, and other file formats:
+```python
+import deltacat as dc
+import pandas as pd
+import pyarrow as pa
+import tempfile
+import ray
+# Initialize DeltaCAT with a fresh temporary catalog
+dc.init_local(tempfile.mkdtemp())
+# Create dataset with DeltaCAT URLs pointing to existing files
+urls_df = pd.DataFrame({
+    "file_id": [1, 2, 3, 4, 5, 6],
+    "url": [
+        # URLs with common file extensions will have their content type inferred.
+        "https://picsum.photos/id/237/400/300.jpg",
+        "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv",
+        "https://raw.githubusercontent.com/SergLam/Audio-Sample-files/master/sample.mp3",
+        "https://raw.githubusercontent.com/burningtree/awesome-json/master/README.md",
+        "https://raw.githubusercontent.com/microsoft/vscode/main/package.json",
+        # URLs without common file extensions will be read as binary by default.
+        "https://picsum.photos/200"
+    ]
+})
+# Create empty table with merge key to efficiently add insights about each file
+dc.create_table(
+    "multimodal_files",
+    schema=dc.Schema.of([
+        dc.Field.of(pa.field("file_id", pa.int64()), is_merge_key=True),
+        dc.Field.of(pa.field("url", pa.string()))
+    ])
+)
+# Write URLs to DeltaCAT table
+dc.write(urls_df, "multimodal_files")
+# UDF to process each file in parallel using Ray Dataset map method
+def analyze_file(row):
+    file_id = row["file_id"]
+    url = row["url"]
+    # DeltaCAT automatically infers the right Ray Data reader for the URL
+    dataset = dc.get(url)
+    records = dataset.take_all()
+    url_type = dc.DatastoreType.from_url(url)
+    # Extract standard Ray Dataset fields for each file type
+    if url_type == dc.DatastoreType.IMAGES:
+        image = records[0]["image"]
+        analysis = f"Image {image.shape[1]}x{image.shape[0]} pixels"
+    elif url_type == dc.DatastoreType.CSV:
+        analysis = f"CSV with {len(records)} rows, {len(records[0].keys())} columns"
+    elif url_type == dc.DatastoreType.AUDIO:
+        sample_rate = records[0]["sample_rate"]
+        duration = len(records[0]["amplitude"][0]) / sample_rate
+        analysis = f"Audio {duration:.1f}s, {sample_rate}Hz"
+    elif url_type == dc.DatastoreType.JSON:
+        analysis = f"JSON with {len(records[0].keys())} fields"
+    elif url_type == dc.DatastoreType.TEXT:
+        analysis = f"Text with {len(records)} records"
+    else:
+        analysis = f"Binary with {len(records[0]['bytes'])} bytes"
+    return {"file_id": file_id, "analysis": analysis}
+# Read the multimodal_files table as a Ray Dataset
+ray_dataset = dc.read("multimodal_files", read_as=dc.DatasetType.RAY_DATASET)
+# Download and analyze each URL in parallel using map
+results_dataset = ray_dataset.map(analyze_file)
+# Write results back to the multimodal_files table
+dc.write(results_dataset, "multimodal_files", mode=dc.TableWriteMode.MERGE)
+# Read final results and compare to initial dataset
+print("\n=== Initial Dataset ===")
+print(dc.to_pandas(ray_dataset))
+print("\n=== Final Results with Analysis ===")
+print(dc.read("multimodal_files", read_as=dc.DatasetType.PANDAS))
+```
+The default dataset type used by `dc.get` is a Ray Dataset but, similar to `dc.read`, `dc.get` can also read URLs into other dataset types like Daft:
+```python
+import deltacat as dc
+# Create dataset with DeltaCAT URLs pointing to existing files
+urls = [
+    # URLs with common file extensions will have their content type inferred.
+    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv",
+    "https://raw.githubusercontent.com/burningtree/awesome-json/master/README.md",
+    # URLs without common file extensions will be read as binary by default.
+    "https://picsum.photos/200"
+]
+# Download each URL into a Daft DataFrame serially
+for url in urls:
+    dataset = dc.get(url, read_as=dc.DatasetType.DAFT)
+    print(f"\n=== {url} ===")
+    print(dataset.show())
+```
+</details>
+<details>
 <summary><span style="font-size: 1.25em; font-weight: bold;">Organizing Tables with Namespaces</span></summary>
 In DeltaCAT, table **Namespaces** are optional but useful for organizing related tables within a catalog:
@@ -486,6 +705,10 @@ order_data = pd.DataFrame({
     "product_id": [101, 102, 103],
     "quantity": [2, 1, 2]
 })
+# Create identity, inventory, and sales namespaces
+dc.create_namespace("identity")
+dc.create_namespace("inventory")
+dc.create_namespace("sales")
 # Write tables to different namespaces to organize them by domain
 dc.write(user_data, "users", namespace="identity")
@@ -511,7 +734,10 @@ finance_users = pd.DataFrame({
     "preferred_payment_method": ["credit", "cash", "paypal"]
 })
+dc.create_namespace("marketing")
 dc.write(marketing_users, "users", namespace="marketing")
+dc.create_namespace("finance")
 dc.write(finance_users, "users", namespace="finance")
 # Each namespace maintains its own "users" table with different schemas
@@ -534,9 +760,9 @@ print(finance_df)
 <details>
-<summary><span style="font-size: 1.25em; font-weight: bold;">Multi-Table Transactions</span></summary>
+<summary><span style="font-size: 1.25em; font-weight: bold;">Data Lake Level Transactions</span></summary>
-DeltaCAT transactions can span multiple tables and namespaces. Since all operations within a transaction either succeed or fail together, this simplifies keeping related datasets in sync across your entire catalog.
+DeltaCAT transactions can span multiple tables and namespaces. Since transaction history is maintained at the catalog level, every transaction operates against a consistent snapshot of every object in your data lake. Since all operations within a transaction either succeed or fail together, this simplifies keeping related datasets in sync across your entire catalog.
 Consider the previous example that organized tables with namespaces. One table tracked customer orders, and another table tracked the lifetime payments of each customer. If one table was updated but not the other, then it would result in an accounting discrepancy. This edge case can be eliminated by using multi-table transactions:
@@ -557,6 +783,7 @@ product_data = pd.DataFrame({
 })
 # The product catalog can be created independently.
+dc.create_namespace("inventory")
 dc.write(product_data, "catalog", namespace="inventory")
 print(f"\n=== Initial Product Data ===")
@@ -583,7 +810,9 @@ finance_schema = dc.Schema.of([
 # Create user identities and user finance data within a single transaction.
 # Since transactions are atomic, this prevents accounting discrepancies.
 with dc.transaction():
+    dc.create_namespace("identity")
     dc.write(user_data, "users", namespace="identity")
+    dc.create_namespace("finance")
     dc.write(initial_finance, "users", namespace="finance", schema=finance_schema)
 print(f"\n=== Initial User Data ===")
@@ -602,6 +831,7 @@ new_orders = pd.DataFrame({
 # Process new orders and update lifetime payment totals within a single transaction.
 with dc.transaction():
     # Step 1: Write the new orders
+    dc.create_namespace("sales")
     dc.write(new_orders, "transactions", namespace="sales")
     # Step 2: Read back transactions and products to compute actual totals
@@ -617,6 +847,7 @@ with dc.transaction():
     finance_updates.columns = ["user_id", "lifetime_payments"]
     # Step 4: Write the computed totals
+    dc.create_namespace("finance")
     dc.write(finance_updates, "users", namespace="finance", mode=dc.TableWriteMode.MERGE)
 # Verify that orders and and lifetime payments are kept in sync.
@@ -630,7 +861,7 @@ print(dc.read("users", namespace="finance", read_as=dc.DatasetType.PANDAS))
 <details>
-<summary><span style="font-size: 1.25em; font-weight: bold;">Working with Multiple Catalogs</span></summary>
+<summary><span style="font-size: 1.25em; font-weight: bold;">Managing Multiple Data Lakes</span></summary>
 DeltaCAT lets you work with multiple catalogs in a single application. All catalogs registered with DeltaCAT are tracked by a Ray Actor to make them available to all workers in your Ray application.
@@ -646,16 +877,14 @@ import tempfile
 from decimal import Decimal
 # Initialize catalogs with separate names and catalog roots.
-dc.init(catalogs={
-    "staging": dc.Catalog(config=dc.CatalogProperties(
-        root=tempfile.mkdtemp(),  # Use temporary directory for staging
-        filesystem=pa.fs.LocalFileSystem()
-    )),
-    "prod": dc.Catalog(config=dc.CatalogProperties(
-        root=tempfile.mkdtemp(),  # Use temporary directory for prod
-        filesystem=pa.fs.LocalFileSystem()
-    ))
-})
+dc.init(
+    catalogs={
+        # Use temporary directory for staging
+        "staging": dc.Catalog(dc.CatalogProperties(tempfile.mkdtemp())),
+        # Use S3 for prod
+        "prod": dc.Catalog(dc.CatalogProperties("s3://example/deltacat"))
+    }
+)
 # Create a PyArrow table with decimal256 data
 decimal_table = pa.table({
@@ -705,9 +934,95 @@ print(dc.read("financial_data", catalog="prod", read_as=dc.DatasetType.PANDAS))
 <details>
-<summary><span style="font-size: 1.25em; font-weight: bold;">Transaction History & Time Travel</span></summary>
+<summary><span style="font-size: 1.25em; font-weight: bold;">Data Lake Sharing & Portability</span></summary>
+DeltaCAT catalogs are self-contained directories on a filesystem, so you can easily share your data lake with others. A local catalog on your laptop can be compressed and sent anywhere. A cloud catalog in S3, GCS, or Azure Blog Storage can be shared via URL. The read/write permissions of your catalog are the read/write permissions of your filesystem.
+For example, you can zip up your local catalog and upload it to S3 via:
+```bash
+# zip a local catalog
+zip -r catalog.zip .deltacat/
+# copy the catalog to a cloud bucket
+aws s3 cp catalog.zip s3://my-bucket/catalog.zip
+```
+The person you shared it with can retrieve and decompress it via:
+```bash
+# copy the cloud catalog to local disk
+aws s3 cp s3://my-bucket/catalog.zip .
+# unzip the catalog to a local directory
+unzip catalog.zip -d .deltacat_copy/
+```
+And then initialize it together with any other catalogs they're working with:
+```python
+import deltacat as dc
+# Initialize catalogs with separate names and catalog roots.
+dc.init(
+    catalogs={
+        "original": dc.Catalog(dc.CatalogProperties(".deltacat")),
+        "copy": dc.Catalog(dc.CatalogProperties(".deltacat_copy")),
+        "prod_aws": dc.Catalog(dc.CatalogProperties("s3://prod/deltacat")),
+        "prod_gcp": dc.Catalog(dc.CatalogProperties("gs://prod/deltacat")),
+        "prod_azure": dc.Catalog(dc.CatalogProperties("az://prod/deltacat")),
+    }
+)
+# List all namespaces in the original catalog
+namespaces = dc.list("dc://original")
+print([namespace.name for namespace in namespaces])
+# List all namespaces in the copy catalog
+namespaces = dc.list("dc://copy")
+print([namespace.name for namespace in namespaces])
+# List all tables in the default namespace of the original catalog
+tables = dc.list("dc://original/default")
+print([table.name for table in tables])
+# List all tables in the default namespace of the copy catalog
+tables = dc.list("dc://copy/default")
+print([table.name for table in tables])
+```
+`dc.copy` can also be used to copy namespaces and tables between catalogs:
+```python
+# Copy the "default" namespace from the original local catalog over to the "myspace" namespace in the copy catalog
+dc.copy("dc://original/default", "dc://copy/default/myspace")
+# By default, no tables are copied from the source namespace to the destination
+tables = dc.list("dc://copy/myspace")
+print(f"{len(tables)} tables in myspace.")
+# Copy the "users" table from the original local catalog over to "local_users" in the prod_aws catalog
+dc.copy("dc://original/default/users", "dc://prod_aws/default/local_users")
+# Read the copied table back
+df = dc.read("local_users", catalog="prod_aws")
+df.show()
+# We can also copy all tables in the default namespace using **
+dc.copy("dc://original/default/**", "dc://copy/default/myspace")
+tables = dc.list("dc://copy/myspace")
+print(f"{len(tables)} tables in myspace.")
+# Or we can copy all namespaces from the original catalog using *
+dc.copy("dc://original/*", "dc://copy")
+namespaces = dc.list("dc://copy")
+print([namespace.name for namespace in namespaces])
+```
+</details>
+<details>
+<summary><span style="font-size: 1.25em; font-weight: bold;">Data Lake Level Time Travel</span></summary>
-DeltaCAT supports time travel queries that let you read all tables in a catalog as they existed at any point in the past. Combined with multi-table transactions, this enables consistent point-in-time views across your entire data catalog.
+DeltaCAT supports time travel queries that let you read all tables in a catalog as they existed at any point in the past. Combined with catalog-level transactions, this enables consistent point-in-time views across your entire data lake.
 ```python
 import deltacat as dc
@@ -744,10 +1059,10 @@ initial_finance = pd.DataFrame({
 # Write initial state atomically with a commit message
 with dc.transaction(commit_message="Initial data load: users, products, orders, and finance"):
-    dc.write(initial_users, "users", namespace="identity")
-    dc.write(initial_products, "catalog", namespace="inventory")
-    dc.write(initial_orders, "transactions", namespace="sales")
-    dc.write(initial_finance, "users", namespace="finance")
+    dc.write(initial_users, "users", namespace="identity", auto_create_namespace=True)
+    dc.write(initial_products, "catalog", namespace="inventory", auto_create_namespace=True)
+    dc.write(initial_orders, "transactions", namespace="sales", auto_create_namespace=True)
+    dc.write(initial_finance, "users", namespace="finance", auto_create_namespace=True)
 # Sleep briefly to ensure transaction timestamp separation
 time.sleep(0.1)
@@ -847,7 +1162,7 @@ print("\nTime travel validation successful!")
 <summary><span style="font-size: 1.25em; font-weight: bold;">Multimodal Batch Inference</span></summary>
-DeltaCAT's support for merging new fields into existing records and multimodal datasets can be used to build a multimodal batch inference pipeline. For example, the following code indexes images of cats, then merges in new fields with breed precitions predictions for each image:
+DeltaCAT's support for merging new fields into existing records and multimodal datasets can be used to build a multimodal batch inference pipeline. For example, the following code indexes images of cats, then merges in new fields with breed predictions for each image:
 > **Requirements**: This example requires PyTorch ≥ 2.8.0 and torchvision ≥ 0.23.0. Install via: `pip install torch>=2.8.0 torchvision>=0.23.0`
@@ -938,7 +1253,7 @@ final_df.show()
 <summary><span style="font-size: 1.25em; font-weight: bold;">LLM Batch Inference</span></summary>
-DeltaCAT multi-table transactions, time travel queries, and automatic schema evolution can be used to create auditable LLM batch inference pipelines. For example, the following code tries different approaches to analyze the overall tone of customer feedback, then generates customer service responses based on the analysis:
+DeltaCAT multi-table transactions, data lake time travel, and automatic schema evolution can be used to create auditable LLM batch inference pipelines. For example, the following code tries different approaches to analyze the overall tone of customer feedback, then generates customer service responses based on the analysis:
 ```python
 import deltacat as dc
@@ -963,7 +1278,7 @@ daft_docs = daft_docs.with_column("content", daft_docs["path"].url.download().de
 # Capture basic feedback sentiment analysis in a parallel multi-table transaction
 with dc.transaction():
     # Write the full customer feedback to a new "documents" table.
-    dc.write(daft_docs, "documents", namespace="analysis")
+    dc.write(daft_docs, "documents")
     # Define a UDF to analyze customer feedback sentiment.
     @daft.udf(return_dtype=daft.DataType.struct({
@@ -1000,14 +1315,14 @@ with dc.transaction():
         dc.Field.of(pa.field("confidence", pa.float64())),
         dc.Field.of(pa.field("model_version", pa.large_string())),
     ])
-    dc.write(daft_results, "insights", namespace="analysis", schema=initial_schema)
+    dc.write(daft_results, "insights", schema=initial_schema)
     # Write to a new audit trail table.
     audit_df = pd.DataFrame([{
         "version": "v1.0",
         "docs_processed": dc.dataset_length(daft_docs),
     }])
-    dc.write(audit_df, "audit", namespace="analysis")
+    dc.write(audit_df, "audit")
 print("=== V1.0: Customer feedback sentiment analysis processing complete! ===")
@@ -1048,9 +1363,9 @@ with dc.transaction():
     )
     # Merge new V2.0 insights into the existing V1.0 insights table.
-    dc.write(daft_emotions, "insights", namespace="analysis")
+    dc.write(daft_emotions, "insights")
     audit_df = pd.DataFrame([{"version": "v2.0", "docs_processed": dc.dataset_length(daft_docs)}])
-    dc.write(audit_df, "audit", namespace="analysis")
+    dc.write(audit_df, "audit")
 print("=== V2.0: Customer feedback emotion analysis processing complete! ===")
@@ -1062,7 +1377,7 @@ time.sleep(0.1)
 # Generate customer service responses based on emotion analysis results.
 with dc.transaction():
     # First, read the current insights table with emotion analysis
-    current_insights = dc.read("insights", namespace="analysis")
+    current_insights = dc.read("insights")
     # Define a UDF to generate customer service responses based on analysis results.
     @daft.udf(return_dtype=daft.DataType.struct({
@@ -1109,39 +1424,39 @@ with dc.transaction():
     )
     # Merge new V3.0 responses into the existing V2.0 insights table.
     # The new response columns are automatically joined by document ID.
-    dc.write(daft_responses, "insights", namespace="analysis")
+    dc.write(daft_responses, "insights")
     audit_df = pd.DataFrame([{"version": "v3.0", "docs_processed": dc.dataset_length(current_insights)}])
-    dc.write(audit_df, "audit", namespace="analysis")
+    dc.write(audit_df, "audit")
 print("=== V3.0: Customer service response generation processing complete! ===")
 print("\n=== Time Travel Comparison of all Versions ===")
 with dc.transaction(as_of=checkpoint_v1):
     print(f"== V1.0 Insights (sentiment) ==")
-    print(dc.read("insights", namespace="analysis").show())
+    print(dc.read("insights").show())
     print(f"== V1.0 Audit ==")
-    print(dc.read("audit", namespace="analysis").show())
+    print(dc.read("audit").show())
 with dc.transaction(as_of=checkpoint_v2):
     print(f"== V2.0 Insights (emotion) ==")
-    print(dc.read("insights", namespace="analysis").show())
+    print(dc.read("insights").show())
     print(f"== V2.0 Audit ==")
-    print(dc.read("audit", namespace="analysis").show())
+    print(dc.read("audit").show())
-v3_results = dc.read("insights", namespace="analysis")
+v3_results = dc.read("insights")
 print(f"== V3.0 Insights (customer service response) ==")
-print(dc.read("insights", namespace="analysis").show())
+print(dc.read("insights").show())
 print(f"== V3.0 Audit ==")
-print(dc.read("audit", namespace="analysis").show())
+print(dc.read("audit").show())
 ```
 </details>
 ## Runtime Environment Requirements
-DeltaCAT's transaction system assumes that the host machine provides strong system clock accuracy guarantees, and that the filesystem hosting the catalog root directory offers strong consistency.
+DeltaCAT's transaction system assumes that the host machine provides strong system clock accuracy guarantees, and that the filesystem hosting the catalog root directory offers strong read-after-write consistency.
-Taken together, these requirements make DeltaCAT suitable for production use on most major cloud computing hosts (e.g., EC2, GCE, Azure VMs) and storage systems (e.g., S3, GCS, Azure Blob Storage), but local laptops should typically be limited to testing/experimental purposes.
+Taken together, these requirements make DeltaCAT suitable for production use on most major cloud computing hosts (e.g., EC2, GCE, Azure VMs) and storage systems (e.g., S3, GCS, Azure Blob Storage), but local laptops should typically be limited to testing/experimental purposes (e.g., due to potential system clock drift).
 ## Additional Resources
 ### Table Documentation

deltacat 2.0.0.post1__tar.gz → 2.0.0.post3__tar.gz

deltacat 2.0.0.post1tar.gz → 2.0.0.post3tar.gz