PyPI - datachain - Versions diffs - 0.6.8__tar.gz → 0.6.9__tar.gz - Mend

datachain 0.6.8tar.gz → 0.6.9tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of datachain might be problematic. Click here for more details.

Files changed (262) hide show

{datachain-0.6.8 → datachain-0.6.9}/.pre-commit-config.yaml RENAMED Viewed

@@ -24,7 +24,7 @@ repos:
       - id: trailing-whitespace
         exclude: '^LICENSES/'
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: 'v0.7.2'
+    rev: 'v0.7.3'
     hooks:
       - id: ruff
         args: [--fix, --exit-non-zero-on-fix]

{datachain-0.6.8/src/datachain.egg-info → datachain-0.6.9}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: datachain
-Version: 0.6.8
+Version: 0.6.9
 Summary: Wrangle unstructured AI data at scale
 Author-email: Dmitry Petrov <support@dvc.org>
 License: Apache-2.0
@@ -120,33 +120,41 @@ Requires-Dist: onnx==1.16.1; extra == "examples"
    :target: https://github.com/iterative/datachain/actions/workflows/tests.yml
    :alt: Tests
-DataChain is a modern Pythonic data-frame library designed for artificial intelligence.
-It is made to organize your unstructured data into datasets and wrangle it at scale on
-your local machine. Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack.
+DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured
+data like images, audio, videos, text and PDFs. It integrates with external storage
+(e.g., S3) to process data efficiently without data duplication and manages metadata
+in an internal database for easy and efficient querying.
+Use Cases
+=========
+1. **Multimodal Dataset Preparation and Curation**: ideal for organizing and
+   refining data in pre-training, finetuning or LLM evaluating stages.
+2. **GenAI Data Analytics**: Enables advanced analytics for multimodal data and
+   ad-hoc analytics using LLMs.
 Key Features
 ============
-📂 **Storage as a Source of Truth.**
-   - Process unstructured data without redundant copies from S3, GCP, Azure, and local
-     file systems.
-   - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet.
+📂 **Multimodal Dataset Versioning.**
+   - Version unstructured data without redundant data copies, by supporitng
+     references to S3, GCP, Azure, and local file systems.
+   - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.
    - Unite files and metadata together into persistent, versioned, columnar datasets.
-🐍 **Python-friendly data pipelines.**
-   - Operate on Python objects and object fields.
-   - Built-in parallelization and out-of-memory compute without SQL or Spark.
+🐍 **Python-friendly.**
+   - Operate on Python objects and object fields: float scores, strings, matrixes,
+     LLM response objects.
+   - Run Python code in a high-scale, terabytes size datasets, with built-in
+     parallelization and memory-efficient computing — no SQL or Spark required.
 🧠 **Data Enrichment and Processing.**
    - Generate metadata using local AI models and LLM APIs.
-   - Filter, join, and group by metadata. Search by vector embeddings.
+   - Filter, join, and group datasets by metadata. Search by vector embeddings.
+   - High-performance vectorized operations on Python objects: sum, count, avg, etc.
    - Pass datasets to Pytorch and Tensorflow, or export them back into storage.
-🚀 **Efficiency.**
-   - Parallelization, out-of-memory workloads and data caching.
-   - Vectorized operations on Python object fields: sum, count, avg, etc.
-   - Optimized vector search.
 Quick Start
 -----------
@@ -196,7 +204,7 @@ Batch inference with a simple sentiment model using the `transformers` library:
     pip install transformers
-The code below downloads files the cloud, and applies a user-defined function
+The code below downloads files from the cloud, and applies a user-defined function
 to each one of them. All files with a positive sentiment
 detected are then copied to the local directory.
@@ -429,6 +437,19 @@ name suffix, the following code will do it:
     loader = DataLoader(chain, batch_size=1)
+DataChain Studio Platform
+-------------------------
+`DataChain Studio`_ is a proprietary solution for teams that offers:
+- **Centralized dataset registry** to manage data, code and dependency
+  dependencies in one place.
+- **Data Lineage** for data sources as well as direvative dataset.
+- **UI for Multimodal Data** like images, videos, and PDFs.
+- **Scalable Compute** to handle large datasets (100M+ files) and in-house
+  AI model inference.
+- **Access control** including SSO and team based collaboration.
 Tutorials
 ---------
@@ -462,6 +483,5 @@ Community and Support
 .. _Pydantic: https://github.com/pydantic/pydantic
 .. _publicly available: https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot
 .. _SQLite: https://www.sqlite.org/
-.. _Getting Started: https://datachain.dvc.ai/
-.. |Flowchart| image:: https://github.com/iterative/datachain/blob/main/docs/assets/flowchart.png?raw=true
-   :alt: DataChain FlowChart
+.. _Getting Started: https://docs.datachain.ai/
+.. _DataChain Studio: https://studio.datachain.ai/

{datachain-0.6.8 → datachain-0.6.9}/README.rst RENAMED Viewed

@@ -19,33 +19,41 @@
    :target: https://github.com/iterative/datachain/actions/workflows/tests.yml
    :alt: Tests
-DataChain is a modern Pythonic data-frame library designed for artificial intelligence.
-It is made to organize your unstructured data into datasets and wrangle it at scale on
-your local machine. Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack.
+DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured
+data like images, audio, videos, text and PDFs. It integrates with external storage
+(e.g., S3) to process data efficiently without data duplication and manages metadata
+in an internal database for easy and efficient querying.
+Use Cases
+=========
+1. **Multimodal Dataset Preparation and Curation**: ideal for organizing and
+   refining data in pre-training, finetuning or LLM evaluating stages.
+2. **GenAI Data Analytics**: Enables advanced analytics for multimodal data and
+   ad-hoc analytics using LLMs.
 Key Features
 ============
-📂 **Storage as a Source of Truth.**
-   - Process unstructured data without redundant copies from S3, GCP, Azure, and local
-     file systems.
-   - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet.
+📂 **Multimodal Dataset Versioning.**
+   - Version unstructured data without redundant data copies, by supporitng
+     references to S3, GCP, Azure, and local file systems.
+   - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.
    - Unite files and metadata together into persistent, versioned, columnar datasets.
-🐍 **Python-friendly data pipelines.**
-   - Operate on Python objects and object fields.
-   - Built-in parallelization and out-of-memory compute without SQL or Spark.
+🐍 **Python-friendly.**
+   - Operate on Python objects and object fields: float scores, strings, matrixes,
+     LLM response objects.
+   - Run Python code in a high-scale, terabytes size datasets, with built-in
+     parallelization and memory-efficient computing — no SQL or Spark required.
 🧠 **Data Enrichment and Processing.**
    - Generate metadata using local AI models and LLM APIs.
-   - Filter, join, and group by metadata. Search by vector embeddings.
+   - Filter, join, and group datasets by metadata. Search by vector embeddings.
+   - High-performance vectorized operations on Python objects: sum, count, avg, etc.
    - Pass datasets to Pytorch and Tensorflow, or export them back into storage.
-🚀 **Efficiency.**
-   - Parallelization, out-of-memory workloads and data caching.
-   - Vectorized operations on Python object fields: sum, count, avg, etc.
-   - Optimized vector search.
 Quick Start
 -----------
@@ -95,7 +103,7 @@ Batch inference with a simple sentiment model using the `transformers` library:
     pip install transformers
-The code below downloads files the cloud, and applies a user-defined function
+The code below downloads files from the cloud, and applies a user-defined function
 to each one of them. All files with a positive sentiment
 detected are then copied to the local directory.
@@ -328,6 +336,19 @@ name suffix, the following code will do it:
     loader = DataLoader(chain, batch_size=1)
+DataChain Studio Platform
+-------------------------
+`DataChain Studio`_ is a proprietary solution for teams that offers:
+- **Centralized dataset registry** to manage data, code and dependency
+  dependencies in one place.
+- **Data Lineage** for data sources as well as direvative dataset.
+- **UI for Multimodal Data** like images, videos, and PDFs.
+- **Scalable Compute** to handle large datasets (100M+ files) and in-house
+  AI model inference.
+- **Access control** including SSO and team based collaboration.
 Tutorials
 ---------
@@ -361,6 +382,5 @@ Community and Support
 .. _Pydantic: https://github.com/pydantic/pydantic
 .. _publicly available: https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot
 .. _SQLite: https://www.sqlite.org/
-.. _Getting Started: https://datachain.dvc.ai/
-.. |Flowchart| image:: https://github.com/iterative/datachain/blob/main/docs/assets/flowchart.png?raw=true
-   :alt: DataChain FlowChart
+.. _Getting Started: https://docs.datachain.ai/
+.. _DataChain Studio: https://studio.datachain.ai/

{datachain-0.6.8 → datachain-0.6.9}/src/datachain/catalog/catalog.py RENAMED Viewed

@@ -769,6 +769,7 @@ class Catalog:
         create_rows: Optional[bool] = True,
         validate_version: Optional[bool] = True,
         listing: Optional[bool] = False,
+        uuid: Optional[str] = None,
     ) -> "DatasetRecord":
         """
         Creates new dataset of a specific version.
@@ -816,6 +817,7 @@ class Catalog:
             query_script=query_script,
             create_rows_table=create_rows,
             columns=columns,
+            uuid=uuid,
         )
     def create_new_dataset_version(
@@ -832,6 +834,7 @@ class Catalog:
         script_output="",
         create_rows_table=True,
         job_id: Optional[str] = None,
+        uuid: Optional[str] = None,
     ) -> DatasetRecord:
         """
         Creates dataset version if it doesn't exist.
@@ -855,6 +858,7 @@ class Catalog:
             schema=schema,
             job_id=job_id,
             ignore_if_exists=True,
+            uuid=uuid,
         )
         if create_rows_table:
@@ -1400,6 +1404,7 @@ class Catalog:
             columns=columns,
             feature_schema=remote_dataset_version.feature_schema,
             validate_version=False,
+            uuid=remote_dataset_version.uuid,
         )
         # asking remote to export dataset rows table to s3 and to return signed

{datachain-0.6.8 → datachain-0.6.9}/src/datachain/client/fsspec.py RENAMED Viewed

@@ -358,7 +358,7 @@ class Client(ABC):
     ) -> BinaryIO:
         """Open a file, including files in tar archives."""
         if use_cache and (cache_path := self.cache.get_path(file)):
-            return open(cache_path, mode="rb")  # noqa: SIM115
+            return open(cache_path, mode="rb")
         assert not file.location
         return FileWrapper(self.fs.open(self.get_full_path(file.path)), cb)  # type: ignore[return-value]

{datachain-0.6.8 → datachain-0.6.9}/src/datachain/data_storage/metastore.py RENAMED Viewed

@@ -138,6 +138,7 @@ class AbstractMetastore(ABC, Serializable):
         size: Optional[int] = None,
         preview: Optional[list[dict]] = None,
         job_id: Optional[str] = None,
+        uuid: Optional[str] = None,
     ) -> DatasetRecord:
         """Creates new dataset version."""
@@ -352,6 +353,7 @@ class AbstractDBMetastore(AbstractMetastore):
         """Datasets versions table columns."""
         return [
             Column("id", Integer, primary_key=True),
+            Column("uuid", Text, nullable=False, default=uuid4()),
             Column(
                 "dataset_id",
                 Integer,
@@ -545,6 +547,7 @@ class AbstractDBMetastore(AbstractMetastore):
         size: Optional[int] = None,
         preview: Optional[list[dict]] = None,
         job_id: Optional[str] = None,
+        uuid: Optional[str] = None,
         conn=None,
     ) -> DatasetRecord:
         """Creates new dataset version."""
@@ -555,6 +558,7 @@ class AbstractDBMetastore(AbstractMetastore):
         query = self._datasets_versions_insert().values(
             dataset_id=dataset.id,
+            uuid=uuid or str(uuid4()),
             version=version,
             status=status,
             feature_schema=json.dumps(feature_schema or {}),

{datachain-0.6.8 → datachain-0.6.9}/src/datachain/dataset.py RENAMED Viewed

@@ -163,6 +163,7 @@ class DatasetStatus:
 @dataclass
 class DatasetVersion:
     id: int
+    uuid: str
     dataset_id: int
     version: int
     status: int
@@ -184,6 +185,7 @@ class DatasetVersion:
     def parse(  # noqa: PLR0913
         cls: type[V],
         id: int,
+        uuid: str,
         dataset_id: int,
         version: int,
         status: int,
@@ -203,6 +205,7 @@ class DatasetVersion:
     ):
         return cls(
             id,
+            uuid,
             dataset_id,
             version,
             status,
@@ -306,6 +309,7 @@ class DatasetRecord:
         query_script: str,
         schema: str,
         version_id: int,
+        version_uuid: str,
         version_dataset_id: int,
         version: int,
         version_status: int,
@@ -331,6 +335,7 @@ class DatasetRecord:
         dataset_version = DatasetVersion.parse(
             version_id,
+            version_uuid,
             version_dataset_id,
             version,
             version_status,

{datachain-0.6.8 → datachain-0.6.9}/src/datachain/lib/dataset_info.py RENAMED Viewed

@@ -1,6 +1,7 @@
 import json
 from datetime import datetime
 from typing import TYPE_CHECKING, Any, Optional, Union
+from uuid import uuid4
 from pydantic import Field, field_validator
@@ -15,6 +16,7 @@ if TYPE_CHECKING:
 class DatasetInfo(DataModel):
     name: str
+    uuid: str = Field(default=str(uuid4()))
     version: int = Field(default=1)
     status: int = Field(default=DatasetStatus.CREATED)
     created_at: datetime = Field(default=TIME_ZERO)
@@ -60,6 +62,7 @@ class DatasetInfo(DataModel):
         job: Optional[Job],
     ) -> "Self":
         return cls(
+            uuid=version.uuid,
             name=dataset.name,
             version=version.version,
             status=version.status,

{datachain-0.6.8 → datachain-0.6.9}/src/datachain/lib/dc.py RENAMED Viewed

@@ -30,7 +30,7 @@ from datachain.client.local import FileClient
 from datachain.dataset import DatasetRecord
 from datachain.lib.convert.python_to_sql import python_to_sql
 from datachain.lib.convert.values_to_tuples import values_to_tuples
-from datachain.lib.data_model import DataModel, DataType, dict_to_data_model
+from datachain.lib.data_model import DataModel, DataType, DataValue, dict_to_data_model
 from datachain.lib.dataset_info import DatasetInfo
 from datachain.lib.file import ArrowRow, File, get_file_type
 from datachain.lib.file import ExportPlacement as FileExportPlacement
@@ -895,7 +895,7 @@ class DataChain:
         2. Group-based UDF function input: Instead of individual rows, the function
            receives a list all rows within each group defined by `partition_by`.
-        Example:
+        Examples:
             ```py
             chain = chain.agg(
                 total=lambda category, amount: [sum(amount)],
@@ -904,6 +904,26 @@ class DataChain:
             )
             chain.save("new_dataset")
             ```
+            An alternative syntax, when you need to specify a more complex function:
+            ```py
+            # It automatically resolves which columns to pass to the function
+            # by looking at the function signature.
+            def agg_sum(
+                file: list[File], amount: list[float]
+            ) -> Iterator[tuple[File, float]]:
+                yield file[0], sum(amount)
+            chain = chain.agg(
+                agg_sum,
+                output={"file": File, "total": float},
+                # Alternative syntax is to use `C` (short for Column) to specify
+                # a column name or a nested column, e.g. C("file.path").
+                partition_by=C("category"),
+            )
+            chain.save("new_dataset")
+            ```
         """
         udf_obj = self._udf_to_obj(Aggregator, func, params, output, signal_map)
         return self._evolve(
@@ -1242,15 +1262,15 @@ class DataChain:
         return self.results(row_factory=to_dict)
     @overload
-    def collect(self) -> Iterator[tuple[DataType, ...]]: ...
+    def collect(self) -> Iterator[tuple[DataValue, ...]]: ...
     @overload
-    def collect(self, col: str) -> Iterator[DataType]: ...  # type: ignore[overload-overlap]
+    def collect(self, col: str) -> Iterator[DataValue]: ...
     @overload
-    def collect(self, *cols: str) -> Iterator[tuple[DataType, ...]]: ...
+    def collect(self, *cols: str) -> Iterator[tuple[DataValue, ...]]: ...
-    def collect(self, *cols: str) -> Iterator[Union[DataType, tuple[DataType, ...]]]:  # type: ignore[overload-overlap,misc]
+    def collect(self, *cols: str) -> Iterator[Union[DataValue, tuple[DataValue, ...]]]:  # type: ignore[overload-overlap,misc]
         """Yields rows of values, optionally limited to the specified columns.
         Args:

{datachain-0.6.8 → datachain-0.6.9}/src/datachain/lib/meta_formats.py RENAMED Viewed

@@ -114,6 +114,7 @@ def read_meta(  # noqa: C901
             )
         )
         (model_output,) = chain.collect("meta_schema")
+        assert isinstance(model_output, str)
         if print_schema:
             print(f"{model_output}")
         # Below 'spec' should be a dynamically converted DataModel from Pydantic

{datachain-0.6.8 → datachain-0.6.9}/src/datachain/lib/signal_schema.py RENAMED Viewed

@@ -378,7 +378,7 @@ class SignalSchema:
     def row_to_features(
         self, row: Sequence, catalog: "Catalog", cache: bool = False
-    ) -> list[DataType]:
+    ) -> list[DataValue]:
         res = []
         pos = 0
         for fr_cls in self.values.values():

{datachain-0.6.8 → datachain-0.6.9/src/datachain.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: datachain
-Version: 0.6.8
+Version: 0.6.9
 Summary: Wrangle unstructured AI data at scale
 Author-email: Dmitry Petrov <support@dvc.org>
 License: Apache-2.0
@@ -120,33 +120,41 @@ Requires-Dist: onnx==1.16.1; extra == "examples"
    :target: https://github.com/iterative/datachain/actions/workflows/tests.yml
    :alt: Tests
-DataChain is a modern Pythonic data-frame library designed for artificial intelligence.
-It is made to organize your unstructured data into datasets and wrangle it at scale on
-your local machine. Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack.
+DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured
+data like images, audio, videos, text and PDFs. It integrates with external storage
+(e.g., S3) to process data efficiently without data duplication and manages metadata
+in an internal database for easy and efficient querying.
+Use Cases
+=========
+1. **Multimodal Dataset Preparation and Curation**: ideal for organizing and
+   refining data in pre-training, finetuning or LLM evaluating stages.
+2. **GenAI Data Analytics**: Enables advanced analytics for multimodal data and
+   ad-hoc analytics using LLMs.
 Key Features
 ============
-📂 **Storage as a Source of Truth.**
-   - Process unstructured data without redundant copies from S3, GCP, Azure, and local
-     file systems.
-   - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet.
+📂 **Multimodal Dataset Versioning.**
+   - Version unstructured data without redundant data copies, by supporitng
+     references to S3, GCP, Azure, and local file systems.
+   - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.
    - Unite files and metadata together into persistent, versioned, columnar datasets.
-🐍 **Python-friendly data pipelines.**
-   - Operate on Python objects and object fields.
-   - Built-in parallelization and out-of-memory compute without SQL or Spark.
+🐍 **Python-friendly.**
+   - Operate on Python objects and object fields: float scores, strings, matrixes,
+     LLM response objects.
+   - Run Python code in a high-scale, terabytes size datasets, with built-in
+     parallelization and memory-efficient computing — no SQL or Spark required.
 🧠 **Data Enrichment and Processing.**
    - Generate metadata using local AI models and LLM APIs.
-   - Filter, join, and group by metadata. Search by vector embeddings.
+   - Filter, join, and group datasets by metadata. Search by vector embeddings.
+   - High-performance vectorized operations on Python objects: sum, count, avg, etc.
    - Pass datasets to Pytorch and Tensorflow, or export them back into storage.
-🚀 **Efficiency.**
-   - Parallelization, out-of-memory workloads and data caching.
-   - Vectorized operations on Python object fields: sum, count, avg, etc.
-   - Optimized vector search.
 Quick Start
 -----------
@@ -196,7 +204,7 @@ Batch inference with a simple sentiment model using the `transformers` library:
     pip install transformers
-The code below downloads files the cloud, and applies a user-defined function
+The code below downloads files from the cloud, and applies a user-defined function
 to each one of them. All files with a positive sentiment
 detected are then copied to the local directory.
@@ -429,6 +437,19 @@ name suffix, the following code will do it:
     loader = DataLoader(chain, batch_size=1)
+DataChain Studio Platform
+-------------------------
+`DataChain Studio`_ is a proprietary solution for teams that offers:
+- **Centralized dataset registry** to manage data, code and dependency
+  dependencies in one place.
+- **Data Lineage** for data sources as well as direvative dataset.
+- **UI for Multimodal Data** like images, videos, and PDFs.
+- **Scalable Compute** to handle large datasets (100M+ files) and in-house
+  AI model inference.
+- **Access control** including SSO and team based collaboration.
 Tutorials
 ---------
@@ -462,6 +483,5 @@ Community and Support
 .. _Pydantic: https://github.com/pydantic/pydantic
 .. _publicly available: https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot
 .. _SQLite: https://www.sqlite.org/
-.. _Getting Started: https://datachain.dvc.ai/
-.. |Flowchart| image:: https://github.com/iterative/datachain/blob/main/docs/assets/flowchart.png?raw=true
-   :alt: DataChain FlowChart
+.. _Getting Started: https://docs.datachain.ai/
+.. _DataChain Studio: https://studio.datachain.ai/

{datachain-0.6.8 → datachain-0.6.9}/src/datachain.egg-info/SOURCES.txt RENAMED Viewed

@@ -23,7 +23,6 @@ docs/index.md
 docs/assets/captioned_cartoons.png
 docs/assets/datachain-white.svg
 docs/assets/datachain.svg
-docs/assets/flowchart.png
 docs/references/datachain.md
 docs/references/datatype.md
 docs/references/file.md

{datachain-0.6.8 → datachain-0.6.9}/tests/func/test_datasets.py RENAMED Viewed

@@ -56,6 +56,7 @@ def test_create_dataset_no_version_specified(cloud_test_catalog, create_rows):
     assert dataset.schema["similarity"] == Float32
     assert dataset_version.schema["similarity"] == Float32
     assert dataset_version.status == DatasetStatus.PENDING
+    assert dataset_version.uuid
     assert dataset.status == DatasetStatus.CREATED  # dataset status is deprecated
     if create_rows:
         assert dataset_version.num_objects == 0
@@ -85,6 +86,7 @@ def test_create_dataset_with_explicit_version(cloud_test_catalog, create_rows):
     assert dataset.schema["similarity"] == Float32
     assert dataset_version.schema["similarity"] == Float32
     assert dataset_version.status == DatasetStatus.PENDING
+    assert dataset_version.uuid
     assert dataset.status == DatasetStatus.CREATED
     if create_rows:
         assert dataset_version.num_objects == 0
@@ -178,6 +180,7 @@ def test_create_dataset_from_sources(listed_bucket, cloud_test_catalog):
     assert dataset_version.error_stack == ""
     assert dataset_version.script_output == ""
     assert dataset_version.sources == f"{src_uri}/dogs/*"
+    assert dataset_version.uuid
     dr = catalog.warehouse.schema.dataset_row_cls
     sys_schema = {c.name: type(c.type) for c in dr.sys_columns()}
@@ -214,6 +217,7 @@ def test_create_dataset_from_sources_dataset(cloud_test_catalog, dogs_dataset):
     assert dataset_version.error_stack == ""
     assert dataset_version.script_output == ""
     assert dataset_version.sources == f"ds://{dogs_dataset.name}"
+    assert dataset_version.uuid
     dr = catalog.warehouse.schema.dataset_row_cls
     sys_schema = {c.name: type(c.type) for c in dr.sys_columns()}

{datachain-0.6.8 → datachain-0.6.9}/tests/func/test_pull.py RENAMED Viewed

@@ -13,6 +13,8 @@ from datachain.utils import STUDIO_URL, JSONSerialize
 from tests.data import ENTRIES
 from tests.utils import assert_row_names, skip_if_not_sqlite
+DATASET_UUID = "20f5a2f1-fc9a-4e36-8b91-5a530f289451"
 @pytest.fixture(autouse=True)
 def studio_config():
@@ -90,6 +92,7 @@ def schema():
 def remote_dataset_version(schema, dataset_rows):
     return {
         "id": 1,
+        "uuid": DATASET_UUID,
         "dataset_id": 1,
         "version": 1,
         "status": 4,
@@ -179,6 +182,7 @@ def test_pull_dataset_success(
     assert dataset_version.schema
     assert dataset_version.num_objects == 4
     assert dataset_version.size == 15
+    assert dataset_version.uuid == DATASET_UUID
     assert_row_names(
         catalog,