PyPI - datachain - Versions diffs - 0.2.16__tar.gz → 0.2.17__tar.gz - Mend

datachain 0.2.16tar.gz → 0.2.17tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of datachain might be problematic. Click here for more details.

Files changed (256) hide show

{datachain-0.2.16/src/datachain.egg-info → datachain-0.2.17}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: datachain
-Version: 0.2.16
+Version: 0.2.17
 Summary: Wrangle unstructured AI data at scale
 Author-email: Dmitry Petrov <support@dvc.org>
 License: Apache-2.0
@@ -100,28 +100,87 @@ Requires-Dist: types-requests; extra == "dev"
 AI 🔗 DataChain
 ----------------
-DataChain is an open-source Python library for processing and curating unstructured
-data at scale.
+DataChain is a data-frame library designed for AI-specific scenarios. It helps ML and
+AI engineers build a metadata layer on top of unstructured files and analyze data using
+this layer.
-🤖 AI-Driven Data Curation: Use local ML models or LLM APIs calls to enrich your data.
+📂 **Raw Files Processing**
+   Process raw files (images, video, text, PDFs) directly from storage (S3, GCP, Azure,
+   Local), version and update datasets.
-🚀 GenAI Dataset scale: Handle tens of millions of multimodal files.
+🌟 **Metadata layer.**
+   Build a metadata layer on top of files using structured sources like CSV, Parquet,
+   and JSON files.
-🐍 Python-friendly: Use strictly-typed `Pydantic`_ objects instead of JSON.
+⭐ **Metadata enrichment.**
+   Enhance the metadata layer with outputs from local ML model inferences and LLM calls.
+🛠️ **Data Transformation.**
+   Transform metadata using traditional methods like filtering, grouping, joining, and
+   others.
-Datachain supports parallel processing, parallel data
-downloads, and out-of-memory computing. It excels at optimizing offline batch operations.
-The typical use cases include Computer Vision data curation, LLM analytics,
-and validation of multimodal AI applications.
+🐍 **User-friendly interface.**
+   Operate efficiently with familiar Python objects and object fields, eliminating the
+   need for SQL.
 .. code:: console
    $ pip install datachain
-|Flowchart|
+Data Structures
+===============
+DataChain introduces expressive data structures tailored for AI-specific workload:
+- **Dataset:** Preserves the file-references and meta-information. Takes care of Python
+  object serialization, dataset versioning and difference. Operations on dataset:
+  - **Transformations:** traditional data-frame or SQL operations such as filtering,
+    grouping, joining.
+  - **Enrichments:** mapping, aggregating and generating using customer’s Python
+    code. This is needed to work with ML inference and LLM calls.
+- **Chain** is a sequence of operations on datasets. Chain executes operations in lazy
+  mode - only when needed.
+DataChain name comes from these major data structures: dataset and chaining.
+What’s new in DataChain?
+========================
+The project combines multiple ideas from different areas in order to simplify AI
+use-cases and at the same time to fit it into traditional data infrastructure.
+- **Python-Native for AI.** Utilizes Python instead of SQL for data manipulation as the
+  native language for AI. It’s powered by `Pydantic`_ data models.
+- **Separation of CPU-GPU workloads.** Distinguishes CPU-heavy transformations (filter,
+  group_by, join) from GPU heavy enrichments (ML-inference or LLM calls). That’s mostly
+  needed for distributed computations.
+- **Resuming data processing** (in development). Introduces idempotent operations,
+  allowing data processing to resume from the last successful process file/record/batch
+  if it fails due to issues like failed LLM calls, ML inference or file download.
+Additional relatively new ideas:
+- **Functional style data processing.** Using a functional/chaining approach to data
+  processing rather than declarative SQL, inspired by R-dplyr and some Python libraries.
+- **Data Versioning.** Treats raw files in cloud storage as the source of truth for data
+  and implements data versioning, extending ideas from DVC (developed by the same team).
+What DataChain is NOT?
+======================
+- **Not a database** (Postgres, MySQL). Instead, it uses databases under the hood:
+  `SQLite`_ in open-source and ClickHouse and other data warehouses for the commercial
+  version.
+- **Not a data processing tool / data warehouse** (Spark, Snowflake, Big Query) since
+  it delegates heavy data transformations to underlying data warehouses and focuses on
+  AI specific data enrichments and orchestrating all the pieces together.
 Quick Start
 -----------

{datachain-0.2.16 → datachain-0.2.17}/README.rst RENAMED Viewed

@@ -16,28 +16,87 @@
 AI 🔗 DataChain
 ----------------
-DataChain is an open-source Python library for processing and curating unstructured
-data at scale.
+DataChain is a data-frame library designed for AI-specific scenarios. It helps ML and
+AI engineers build a metadata layer on top of unstructured files and analyze data using
+this layer.
-🤖 AI-Driven Data Curation: Use local ML models or LLM APIs calls to enrich your data.
+📂 **Raw Files Processing**
+   Process raw files (images, video, text, PDFs) directly from storage (S3, GCP, Azure,
+   Local), version and update datasets.
-🚀 GenAI Dataset scale: Handle tens of millions of multimodal files.
+🌟 **Metadata layer.**
+   Build a metadata layer on top of files using structured sources like CSV, Parquet,
+   and JSON files.
-🐍 Python-friendly: Use strictly-typed `Pydantic`_ objects instead of JSON.
+⭐ **Metadata enrichment.**
+   Enhance the metadata layer with outputs from local ML model inferences and LLM calls.
+🛠️ **Data Transformation.**
+   Transform metadata using traditional methods like filtering, grouping, joining, and
+   others.
-Datachain supports parallel processing, parallel data
-downloads, and out-of-memory computing. It excels at optimizing offline batch operations.
-The typical use cases include Computer Vision data curation, LLM analytics,
-and validation of multimodal AI applications.
+🐍 **User-friendly interface.**
+   Operate efficiently with familiar Python objects and object fields, eliminating the
+   need for SQL.
 .. code:: console
    $ pip install datachain
-|Flowchart|
+Data Structures
+===============
+DataChain introduces expressive data structures tailored for AI-specific workload:
+- **Dataset:** Preserves the file-references and meta-information. Takes care of Python
+  object serialization, dataset versioning and difference. Operations on dataset:
+  - **Transformations:** traditional data-frame or SQL operations such as filtering,
+    grouping, joining.
+  - **Enrichments:** mapping, aggregating and generating using customer’s Python
+    code. This is needed to work with ML inference and LLM calls.
+- **Chain** is a sequence of operations on datasets. Chain executes operations in lazy
+  mode - only when needed.
+DataChain name comes from these major data structures: dataset and chaining.
+What’s new in DataChain?
+========================
+The project combines multiple ideas from different areas in order to simplify AI
+use-cases and at the same time to fit it into traditional data infrastructure.
+- **Python-Native for AI.** Utilizes Python instead of SQL for data manipulation as the
+  native language for AI. It’s powered by `Pydantic`_ data models.
+- **Separation of CPU-GPU workloads.** Distinguishes CPU-heavy transformations (filter,
+  group_by, join) from GPU heavy enrichments (ML-inference or LLM calls). That’s mostly
+  needed for distributed computations.
+- **Resuming data processing** (in development). Introduces idempotent operations,
+  allowing data processing to resume from the last successful process file/record/batch
+  if it fails due to issues like failed LLM calls, ML inference or file download.
+Additional relatively new ideas:
+- **Functional style data processing.** Using a functional/chaining approach to data
+  processing rather than declarative SQL, inspired by R-dplyr and some Python libraries.
+- **Data Versioning.** Treats raw files in cloud storage as the source of truth for data
+  and implements data versioning, extending ideas from DVC (developed by the same team).
+What DataChain is NOT?
+======================
+- **Not a database** (Postgres, MySQL). Instead, it uses databases under the hood:
+  `SQLite`_ in open-source and ClickHouse and other data warehouses for the commercial
+  version.
+- **Not a data processing tool / data warehouse** (Spark, Snowflake, Big Query) since
+  it delegates heavy data transformations to underlying data warehouses and focuses on
+  AI specific data enrichments and orchestrating all the pieces together.
 Quick Start
 -----------

{datachain-0.2.16 → datachain-0.2.17}/examples/get_started/json-csv-reader.py RENAMED Viewed

@@ -89,6 +89,15 @@ def main():
     static_csv_ds.print_schema()
     static_csv_ds.show()
+    uri = "gs://datachain-demo/laion-aesthetics-csv/laion_aesthetics_1024_33M_1.csv"
+    print()
+    print("========================================================================")
+    print("dynamic CSV with header schema test parsing 3/3M objects")
+    print("========================================================================")
+    dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion", nrows=3)
+    dynamic_csv_ds.print_schema()
+    dynamic_csv_ds.show()
 if __name__ == "__main__":
     main()

{datachain-0.2.16 → datachain-0.2.17}/src/datachain/catalog/catalog.py RENAMED Viewed

@@ -236,36 +236,36 @@ class DatasetRowsFetcher(NodesThreadPool):
         import lz4.frame
         import pandas as pd
-        metastore = self.metastore.clone()  # metastore is not thread safe
-        warehouse = self.warehouse.clone()  # warehouse is not thread safe
-        dataset = metastore.get_dataset(self.dataset_name)
-        urls = list(urls)
-        while urls:
-            for url in urls:
-                if self.should_check_for_status():
-                    self.check_for_status()
-                r = requests.get(url, timeout=PULL_DATASET_CHUNK_TIMEOUT)
-                if r.status_code == 404:
-                    time.sleep(PULL_DATASET_SLEEP_INTERVAL)
-                    # moving to the next url
-                    continue
+        # metastore and warehouse are not thread safe
+        with self.metastore.clone() as metastore, self.warehouse.clone() as warehouse:
+            dataset = metastore.get_dataset(self.dataset_name)
-                r.raise_for_status()
+            urls = list(urls)
+            while urls:
+                for url in urls:
+                    if self.should_check_for_status():
+                        self.check_for_status()
-                df = pd.read_parquet(io.BytesIO(lz4.frame.decompress(r.content)))
+                    r = requests.get(url, timeout=PULL_DATASET_CHUNK_TIMEOUT)
+                    if r.status_code == 404:
+                        time.sleep(PULL_DATASET_SLEEP_INTERVAL)
+                        # moving to the next url
+                        continue
-                self.fix_columns(df)
+                    r.raise_for_status()
-                # id will be autogenerated in DB
-                df = df.drop("sys__id", axis=1)
+                    df = pd.read_parquet(io.BytesIO(lz4.frame.decompress(r.content)))
-                inserted = warehouse.insert_dataset_rows(
-                    df, dataset, self.dataset_version
-                )
-                self.increase_counter(inserted)  # type: ignore [arg-type]
-                urls.remove(url)
+                    self.fix_columns(df)
+                    # id will be autogenerated in DB
+                    df = df.drop("sys__id", axis=1)
+                    inserted = warehouse.insert_dataset_rows(
+                        df, dataset, self.dataset_version
+                    )
+                    self.increase_counter(inserted)  # type: ignore [arg-type]
+                    urls.remove(url)
 @dataclass
@@ -720,7 +720,6 @@ class Catalog:
             client.uri, posixpath.join(prefix, "")
         )
         source_metastore = self.metastore.clone(client.uri)
-        source_warehouse = self.warehouse.clone()
         columns = [
             Column("vtype", String),
@@ -1835,25 +1834,29 @@ class Catalog:
         if signed_urls:
             shuffle(signed_urls)
-            rows_fetcher = DatasetRowsFetcher(
-                self.metastore.clone(),
-                self.warehouse.clone(),
-                remote_config,
-                dataset.name,
-                version,
-                schema,
-            )
-            try:
-                rows_fetcher.run(
-                    batched(
-                        signed_urls,
-                        math.ceil(len(signed_urls) / PULL_DATASET_MAX_THREADS),
-                    ),
-                    dataset_save_progress_bar,
+            with (
+                self.metastore.clone() as metastore,
+                self.warehouse.clone() as warehouse,
+            ):
+                rows_fetcher = DatasetRowsFetcher(
+                    metastore,
+                    warehouse,
+                    remote_config,
+                    dataset.name,
+                    version,
+                    schema,
                 )
-            except:
-                self.remove_dataset(dataset.name, version)
-                raise
+                try:
+                    rows_fetcher.run(
+                        batched(
+                            signed_urls,
+                            math.ceil(len(signed_urls) / PULL_DATASET_MAX_THREADS),
+                        ),
+                        dataset_save_progress_bar,
+                    )
+                except:
+                    self.remove_dataset(dataset.name, version)
+                    raise
         dataset = self.metastore.update_dataset_status(
             dataset,

{datachain-0.2.16 → datachain-0.2.17}/src/datachain/data_storage/db_engine.py RENAMED Viewed

@@ -4,7 +4,6 @@ from collections.abc import Iterator
 from typing import TYPE_CHECKING, Any, ClassVar, Optional, Union
 import sqlalchemy as sa
-from attrs import frozen
 from sqlalchemy.sql import FROM_LINTING
 from sqlalchemy.sql.roles import DDLRole
@@ -23,13 +22,18 @@ logger = logging.getLogger("datachain")
 SELECT_BATCH_SIZE = 100_000  # number of rows to fetch at a time
-@frozen
 class DatabaseEngine(ABC, Serializable):
     dialect: ClassVar["Dialect"]
     engine: "Engine"
     metadata: "MetaData"
+    def __enter__(self) -> "DatabaseEngine":
+        return self
+    def __exit__(self, exc_type, exc_value, traceback) -> None:
+        self.close()
     @abstractmethod
     def clone(self) -> "DatabaseEngine":
         """Clones DatabaseEngine implementation."""

{datachain-0.2.16 → datachain-0.2.17}/src/datachain/data_storage/id_generator.py RENAMED Viewed

@@ -33,6 +33,16 @@ class AbstractIDGenerator(ABC, Serializable):
     def cleanup_for_tests(self):
         """Cleanup for tests."""
+    def close(self) -> None:
+        """Closes any active database connections."""
+    def close_on_exit(self) -> None:
+        """Closes any active database or HTTP connections, called on Session exit or
+        for test cleanup only, as some ID Generator implementations may handle this
+        differently.
+        """
+        self.close()
     @abstractmethod
     def init_id(self, uri: str) -> None:
         """Initializes the ID generator for the given URI with zero last_id."""
@@ -83,6 +93,10 @@ class AbstractDBIDGenerator(AbstractIDGenerator):
     def clone(self) -> "AbstractDBIDGenerator":
         """Clones AbstractIDGenerator implementation."""
+    def close(self) -> None:
+        """Closes any active database connections."""
+        self.db.close()
     @property
     def db(self) -> "DatabaseEngine":
         return self._db

{datachain-0.2.16 → datachain-0.2.17}/src/datachain/data_storage/metastore.py RENAMED Viewed

@@ -78,6 +78,13 @@ class AbstractMetastore(ABC, Serializable):
         self.uri = uri
         self.partial_id: Optional[int] = partial_id
+    def __enter__(self) -> "AbstractMetastore":
+        """Returns self upon entering context manager."""
+        return self
+    def __exit__(self, exc_type, exc_value, traceback) -> None:
+        """Default behavior is to do nothing, as connections may be shared."""
     @abstractmethod
     def clone(
         self,
@@ -97,6 +104,12 @@ class AbstractMetastore(ABC, Serializable):
     def close(self) -> None:
         """Closes any active database or HTTP connections."""
+    def close_on_exit(self) -> None:
+        """Closes any active database or HTTP connections, called on Session exit or
+        for test cleanup only, as some Metastore implementations may handle this
+        differently."""
+        self.close()
     def cleanup_tables(self, temp_table_names: list[str]) -> None:
         """Cleanup temp tables."""

{datachain-0.2.16 → datachain-0.2.17}/src/datachain/data_storage/sqlite.py RENAMED Viewed

@@ -15,7 +15,6 @@ from typing import (
 )
 import sqlalchemy
-from attrs import frozen
 from sqlalchemy import MetaData, Table, UniqueConstraint, exists, select
 from sqlalchemy.dialects import sqlite
 from sqlalchemy.schema import CreateIndex, CreateTable, DropTable
@@ -40,6 +39,7 @@ from datachain.utils import DataChainDir
 if TYPE_CHECKING:
     from sqlalchemy.dialects.sqlite import Insert
+    from sqlalchemy.engine.base import Engine
     from sqlalchemy.schema import SchemaItem
     from sqlalchemy.sql.elements import ColumnClause, ColumnElement, TextClause
     from sqlalchemy.sql.selectable import Select
@@ -52,6 +52,8 @@ RETRY_START_SEC = 0.01
 RETRY_MAX_TIMES = 10
 RETRY_FACTOR = 2
+DETECT_TYPES = sqlite3.PARSE_DECLTYPES | sqlite3.PARSE_COLNAMES
 Column = Union[str, "ColumnClause[Any]", "TextClause"]
 datachain.sql.sqlite.setup()
@@ -80,26 +82,41 @@ def retry_sqlite_locks(func):
     return wrapper
-@frozen
 class SQLiteDatabaseEngine(DatabaseEngine):
     dialect = sqlite_dialect
     db: sqlite3.Connection
     db_file: Optional[str]
+    is_closed: bool
+    def __init__(
+        self,
+        engine: "Engine",
+        metadata: "MetaData",
+        db: sqlite3.Connection,
+        db_file: Optional[str] = None,
+    ):
+        self.engine = engine
+        self.metadata = metadata
+        self.db = db
+        self.db_file = db_file
+        self.is_closed = False
     @classmethod
     def from_db_file(cls, db_file: Optional[str] = None) -> "SQLiteDatabaseEngine":
-        detect_types = sqlite3.PARSE_DECLTYPES | sqlite3.PARSE_COLNAMES
+        return cls(*cls._connect(db_file=db_file))
+    @staticmethod
+    def _connect(db_file: Optional[str] = None):
         try:
             if db_file == ":memory:":
                 # Enable multithreaded usage of the same in-memory db
                 db = sqlite3.connect(
-                    "file::memory:?cache=shared", uri=True, detect_types=detect_types
+                    "file::memory:?cache=shared", uri=True, detect_types=DETECT_TYPES
                 )
             else:
                 db = sqlite3.connect(
-                    db_file or DataChainDir.find().db, detect_types=detect_types
+                    db_file or DataChainDir.find().db, detect_types=DETECT_TYPES
                 )
             create_user_defined_sql_functions(db)
             engine = sqlalchemy.create_engine(
@@ -118,7 +135,7 @@ class SQLiteDatabaseEngine(DatabaseEngine):
             load_usearch_extension(db)
-            return cls(engine, MetaData(), db, db_file)
+            return engine, MetaData(), db, db_file
         except RuntimeError:
             raise DataChainError("Can't connect to SQLite DB") from None
@@ -138,6 +155,16 @@ class SQLiteDatabaseEngine(DatabaseEngine):
             {},
         )
+    def _reconnect(self) -> None:
+        if not self.is_closed:
+            raise RuntimeError("Cannot reconnect on still-open DB!")
+        engine, metadata, db, db_file = self._connect(db_file=self.db_file)
+        self.engine = engine
+        self.metadata = metadata
+        self.db = db
+        self.db_file = db_file
+        self.is_closed = False
     @retry_sqlite_locks
     def execute(
         self,
@@ -145,6 +172,9 @@ class SQLiteDatabaseEngine(DatabaseEngine):
         cursor: Optional[sqlite3.Cursor] = None,
         conn=None,
     ) -> sqlite3.Cursor:
+        if self.is_closed:
+            # Reconnect in case of being closed previously.
+            self._reconnect()
         if cursor is not None:
             result = cursor.execute(*self.compile_to_args(query))
         elif conn is not None:
@@ -179,6 +209,7 @@ class SQLiteDatabaseEngine(DatabaseEngine):
     def close(self) -> None:
         self.db.close()
+        self.is_closed = True
     @contextmanager
     def transaction(self):
@@ -359,6 +390,10 @@ class SQLiteMetastore(AbstractDBMetastore):
         self._init_tables()
+    def __exit__(self, exc_type, exc_value, traceback) -> None:
+        """Close connection upon exit from context manager."""
+        self.close()
     def clone(
         self,
         uri: StorageURI = StorageURI(""),
@@ -521,6 +556,10 @@ class SQLiteWarehouse(AbstractWarehouse):
         self.db = db or SQLiteDatabaseEngine.from_db_file(db_file)
+    def __exit__(self, exc_type, exc_value, traceback) -> None:
+        """Close connection upon exit from context manager."""
+        self.close()
     def clone(self, use_new_connection: bool = False) -> "SQLiteWarehouse":
         return SQLiteWarehouse(self.id_generator.clone(), db=self.db.clone())

{datachain-0.2.16 → datachain-0.2.17}/src/datachain/data_storage/warehouse.py RENAMED Viewed

@@ -70,6 +70,13 @@ class AbstractWarehouse(ABC, Serializable):
     def __init__(self, id_generator: "AbstractIDGenerator"):
         self.id_generator = id_generator
+    def __enter__(self) -> "AbstractWarehouse":
+        return self
+    def __exit__(self, exc_type, exc_value, traceback) -> None:
+        # Default behavior is to do nothing, as connections may be shared.
+        pass
     def cleanup_for_tests(self):
         """Cleanup for tests."""
@@ -158,6 +165,12 @@ class AbstractWarehouse(ABC, Serializable):
         """Closes any active database connections."""
         self.db.close()
+    def close_on_exit(self) -> None:
+        """Closes any active database or HTTP connections, called on Session exit or
+        for test cleanup only, as some Warehouse implementations may handle this
+        differently."""
+        self.close()
     #
     # Query Tables
     #

{datachain-0.2.16 → datachain-0.2.17}/src/datachain/lib/arrow.py RENAMED Viewed

@@ -1,5 +1,6 @@
 import re
 from collections.abc import Sequence
+from tempfile import NamedTemporaryFile
 from typing import TYPE_CHECKING, Optional
 import pyarrow as pa
@@ -43,13 +44,17 @@ class ArrowGenerator(Generator):
         self.kwargs = kwargs
     def process(self, file: File):
-        path = file.get_path()
-        ds = dataset(
-            path, filesystem=file.get_fs(), schema=self.input_schema, **self.kwargs
-        )
+        if self.nrows:
+            path = _nrows_file(file, self.nrows)
+            ds = dataset(path, schema=self.input_schema, **self.kwargs)
+        else:
+            path = file.get_path()
+            ds = dataset(
+                path, filesystem=file.get_fs(), schema=self.input_schema, **self.kwargs
+            )
         index = 0
         with tqdm(desc="Parsed by pyarrow", unit=" rows") as pbar:
-            for record_batch in ds.to_batches(use_threads=False):
+            for record_batch in ds.to_batches():
                 for record in record_batch.to_pylist():
                     vals = list(record.values())
                     if self.output_schema:
@@ -60,8 +65,6 @@ class ArrowGenerator(Generator):
                     else:
                         yield vals
                     index += 1
-                    if self.nrows and index >= self.nrows:
-                        return
                 pbar.update(len(record_batch))
@@ -125,3 +128,15 @@ def _arrow_type_mapper(col_type: pa.DataType) -> type:  # noqa: PLR0911
     if isinstance(col_type, pa.lib.DictionaryType):
         return _arrow_type_mapper(col_type.value_type)  # type: ignore[return-value]
     raise TypeError(f"{col_type!r} datatypes not supported")
+def _nrows_file(file: File, nrows: int) -> str:
+    tf = NamedTemporaryFile(delete=False)
+    with file.open(mode="r") as reader:
+        with open(tf.name, "a") as writer:
+            for row, line in enumerate(reader):
+                if row >= nrows:
+                    break
+                writer.write(line)
+                writer.write("\n")
+    return tf.name

datachain 0.2.16__tar.gz → 0.2.17__tar.gz

Potentially problematic release.

datachain 0.2.16tar.gz → 0.2.17tar.gz