PyPI - datachain - Versions diffs - 0.3.8__tar.gz → 0.3.9__tar.gz - Mend

datachain 0.3.8tar.gz → 0.3.9tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of datachain might be problematic. Click here for more details.

Files changed (241) hide show

{datachain-0.3.8 → datachain-0.3.9}/.github/workflows/tests.yml RENAMED Viewed

@@ -50,7 +50,7 @@ jobs:
         run: nox -s lint
   datachain:
-    timeout-minutes: 30
+    timeout-minutes: 40
     runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false

{datachain-0.3.8 → datachain-0.3.9}/.pre-commit-config.yaml RENAMED Viewed

@@ -24,7 +24,7 @@ repos:
       - id: trailing-whitespace
         exclude: '^LICENSES/'
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: 'v0.6.1'
+    rev: 'v0.6.2'
     hooks:
       - id: ruff
         args: [--fix, --exit-non-zero-on-fix]

{datachain-0.3.8/src/datachain.egg-info → datachain-0.3.9}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: datachain
-Version: 0.3.8
+Version: 0.3.9
 Summary: Wrangle unstructured AI data at scale
 Author-email: Dmitry Petrov <support@dvc.org>
 License: Apache-2.0
@@ -115,31 +115,30 @@ AI 🔗 DataChain
 DataChain is a modern Pythonic data-frame library designed for artificial intelligence.
 It is made to organize your unstructured data into datasets and wrangle it at scale on
-your local machine.
+your local machine. Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack.
 Key Features
 ============
 📂 **Storage as a Source of Truth.**
-   - Process unstructured data without redundant copies: S3, GCP, Azure, and local
+   - Process unstructured data without redundant copies from S3, GCP, Azure, and local
      file systems.
-   - Multimodal data: images, video, text, PDFs, JSONs, CSVs, parquet.
-   - Join files and metadata together into persistent, versioned, columnar datasets.
+   - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet.
+   - Unite files and metadata together into persistent, versioned, columnar datasets.
 🐍 **Python-friendly data pipelines.**
    - Operate on Python objects and object fields.
-   - Built-in parallelization and out-of-memory compute without a need in SQL or
-     Spark jobs.
+   - Built-in parallelization and out-of-memory compute without SQL or Spark.
 🧠 **Data Enrichment and Processing.**
-   - Generate metadata columns using local AI models and LLM APIs.
-   - Filter, join, and group by AI metadata. Vector similarity search.
-   - Pass datasets to Pytorch and Tensorflow, or export back into storage.
+   - Generate metadata using local AI models and LLM APIs.
+   - Filter, join, and group by metadata. Search by vector embeddings.
+   - Pass datasets to Pytorch and Tensorflow, or export them back into storage.
 🚀 **Efficiency.**
    - Parallelization, out-of-memory workloads and data caching.
    - Vectorized operations on Python object fields: sum, count, avg, etc.
-   - Vector search on embeddings.
+   - Optimized vector search.
 Quick Start
@@ -164,7 +163,7 @@ where each image has a matching JSON file like `cat.1009.json`:
         "inference": {"class": "dog", "confidence": 0.68}
     }
-Example of downloading only high-confidence cat images using JSON metadata:
+Example of downloading only "high-confidence cat" inferred images using JSON metadata:
 .. code:: py
@@ -234,7 +233,7 @@ detected are then copied to the local directory.
 LLM judging chatbots
 =============================
-LLMs can work as efficient universal classifiers. In the example below,
+LLMs can work as universal classifiers. In the example below,
 we employ a free API from Mistral to judge the `publicly available`_ chatbot dialogs. Please get a free
 Mistral API key at https://console.mistral.ai

{datachain-0.3.8 → datachain-0.3.9}/README.rst RENAMED Viewed

@@ -18,31 +18,30 @@ AI 🔗 DataChain
 DataChain is a modern Pythonic data-frame library designed for artificial intelligence.
 It is made to organize your unstructured data into datasets and wrangle it at scale on
-your local machine.
+your local machine. Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack.
 Key Features
 ============
 📂 **Storage as a Source of Truth.**
-   - Process unstructured data without redundant copies: S3, GCP, Azure, and local
+   - Process unstructured data without redundant copies from S3, GCP, Azure, and local
      file systems.
-   - Multimodal data: images, video, text, PDFs, JSONs, CSVs, parquet.
-   - Join files and metadata together into persistent, versioned, columnar datasets.
+   - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet.
+   - Unite files and metadata together into persistent, versioned, columnar datasets.
 🐍 **Python-friendly data pipelines.**
    - Operate on Python objects and object fields.
-   - Built-in parallelization and out-of-memory compute without a need in SQL or
-     Spark jobs.
+   - Built-in parallelization and out-of-memory compute without SQL or Spark.
 🧠 **Data Enrichment and Processing.**
-   - Generate metadata columns using local AI models and LLM APIs.
-   - Filter, join, and group by AI metadata. Vector similarity search.
-   - Pass datasets to Pytorch and Tensorflow, or export back into storage.
+   - Generate metadata using local AI models and LLM APIs.
+   - Filter, join, and group by metadata. Search by vector embeddings.
+   - Pass datasets to Pytorch and Tensorflow, or export them back into storage.
 🚀 **Efficiency.**
    - Parallelization, out-of-memory workloads and data caching.
    - Vectorized operations on Python object fields: sum, count, avg, etc.
-   - Vector search on embeddings.
+   - Optimized vector search.
 Quick Start
@@ -67,7 +66,7 @@ where each image has a matching JSON file like `cat.1009.json`:
         "inference": {"class": "dog", "confidence": 0.68}
     }
-Example of downloading only high-confidence cat images using JSON metadata:
+Example of downloading only "high-confidence cat" inferred images using JSON metadata:
 .. code:: py
@@ -137,7 +136,7 @@ detected are then copied to the local directory.
 LLM judging chatbots
 =============================
-LLMs can work as efficient universal classifiers. In the example below,
+LLMs can work as universal classifiers. In the example below,
 we employ a free API from Mistral to judge the `publicly available`_ chatbot dialogs. Please get a free
 Mistral API key at https://console.mistral.ai

{datachain-0.3.8 → datachain-0.3.9}/examples/llm_and_nlp/unstructured-text.py RENAMED Viewed

@@ -1,5 +1,5 @@
 #
-# pip install unstructured[pdf] nltk==3.8.1 huggingface_hub[hf_transfer]
+# pip install unstructured[pdf] huggingface_hub[hf_transfer]
 #
 import os

{datachain-0.3.8 → datachain-0.3.9}/examples/multimodal/wds_filtered.py RENAMED Viewed

@@ -1,13 +1,11 @@
 import datachain.error
 from datachain import C, DataChain
-from datachain.lib.model_store import ModelStore
 from datachain.lib.webdataset import process_webdataset
-from datachain.lib.webdataset_laion import LaionMeta, WDSLaion
+from datachain.lib.webdataset_laion import WDSLaion
 from datachain.sql import literal
 from datachain.sql.functions import array, greatest, least, string
 name = "wds"
-ModelStore.register(LaionMeta)
 try:
     wds = DataChain.from_dataset(name=name)
 except datachain.error.DatasetNotFoundError:

{datachain-0.3.8 → datachain-0.3.9}/src/datachain/catalog/catalog.py RENAMED Viewed

@@ -1560,17 +1560,8 @@ class Catalog:
         version = self.get_dataset(dataset_name).get_version(dataset_version)
         file_signals_values = {}
-        file_schemas = {}
-        # TODO: To remove after we properly fix deserialization
-        for signal, type_name in version.feature_schema.items():
-            from datachain.lib.model_store import ModelStore
-            type_name_parsed, v = ModelStore.parse_name_version(type_name)
-            fr = ModelStore.get(type_name_parsed, v)
-            if fr and issubclass(fr, File):
-                file_schemas[signal] = type_name
-        schema = SignalSchema.deserialize(file_schemas)
+        schema = SignalSchema.deserialize(version.feature_schema)
         for file_signals in schema.get_signals(File):
             prefix = file_signals.replace(".", DEFAULT_DELIMITER) + DEFAULT_DELIMITER
             file_signals_values[file_signals] = {
@@ -1916,7 +1907,7 @@ class Catalog:
         """
         from datachain.query.dataset import ExecutionResult
-        feature_file = tempfile.NamedTemporaryFile(
+        feature_file = tempfile.NamedTemporaryFile(  # noqa: SIM115
             dir=os.getcwd(), suffix=".py", delete=False
         )
         _, feature_module = os.path.split(feature_file.name)

{datachain-0.3.8 → datachain-0.3.9}/src/datachain/lib/arrow.py RENAMED Viewed

@@ -131,7 +131,7 @@ def arrow_type_mapper(col_type: pa.DataType) -> type:  # noqa: PLR0911
 def _nrows_file(file: File, nrows: int) -> str:
-    tf = NamedTemporaryFile(delete=False)
+    tf = NamedTemporaryFile(delete=False)  # noqa: SIM115
     with file.open(mode="r") as reader:
         with open(tf.name, "a") as writer:
             for row, line in enumerate(reader):

{datachain-0.3.8 → datachain-0.3.9}/src/datachain/lib/dc.py RENAMED Viewed

@@ -1153,17 +1153,35 @@ class DataChain(DatasetQuery):
         self,
         other: "DataChain",
         on: Optional[Union[str, Sequence[str]]] = None,
+        right_on: Optional[Union[str, Sequence[str]]] = None,
     ) -> "Self":
         """Remove rows that appear in another chain.
         Parameters:
             other: chain whose rows will be removed from `self`
-            on: columns to consider for determining row equality. If unspecified,
-                defaults to all common columns between `self` and `other`.
+            on: columns to consider for determining row equality in `self`.
+                If unspecified, defaults to all common columns
+                between `self` and `other`.
+            right_on: columns to consider for determining row equality in `other`.
+                If unspecified, defaults to the same values as `on`.
         """
         if isinstance(on, str):
+            if not on:
+                raise DataChainParamsError("'on' cannot be an empty string")
             on = [on]
-        if on is None:
+        elif isinstance(on, Sequence):
+            if not on or any(not col for col in on):
+                raise DataChainParamsError("'on' cannot contain empty strings")
+        if isinstance(right_on, str):
+            if not right_on:
+                raise DataChainParamsError("'right_on' cannot be an empty string")
+            right_on = [right_on]
+        elif isinstance(right_on, Sequence):
+            if not right_on or any(not col for col in right_on):
+                raise DataChainParamsError("'right_on' cannot contain empty strings")
+        if on is None and right_on is None:
             other_columns = set(other._effective_signals_schema.db_signals())
             signals = [
                 c
@@ -1172,16 +1190,29 @@ class DataChain(DatasetQuery):
             ]
             if not signals:
                 raise DataChainParamsError("subtract(): no common columns")
-        elif not isinstance(on, Sequence):
-            raise TypeError(
-                f"'on' must be 'str' or 'Sequence' object but got type '{type(on)}'",
-            )
-        elif not on:
+        elif on is not None and right_on is None:
+            right_on = on
+            signals = list(self.signals_schema.resolve(*on).db_signals())
+        elif on is None and right_on is not None:
             raise DataChainParamsError(
-                "'on' cannot be empty",
+                "'on' must be specified when 'right_on' is provided"
             )
         else:
-            signals = self.signals_schema.resolve(*on).db_signals()  # type: ignore[assignment]
+            if not isinstance(on, Sequence) or not isinstance(right_on, Sequence):
+                raise TypeError(
+                    "'on' and 'right_on' must be 'str' or 'Sequence' object"
+                )
+            if len(on) != len(right_on):
+                raise DataChainParamsError(
+                    "'on' and 'right_on' must have the same length"
+                )
+            signals = list(
+                zip(
+                    self.signals_schema.resolve(*on).db_signals(),
+                    other.signals_schema.resolve(*right_on).db_signals(),
+                )  # type: ignore[arg-type]
+            )
         return super()._subtract(other, signals)  # type: ignore[arg-type]
     @classmethod

{datachain-0.3.8 → datachain-0.3.9}/src/datachain/lib/webdataset.py RENAMED Viewed

@@ -222,7 +222,7 @@ class TarStream(File):
         self._tar = None
     def open(self):
-        self._tar = tarfile.open(fileobj=super().open())
+        self._tar = tarfile.open(fileobj=super().open())  # noqa: SIM115
         return self
     def getmembers(self) -> list[tarfile.TarInfo]:

{datachain-0.3.8 → datachain-0.3.9}/src/datachain/query/dataset.py RENAMED Viewed

@@ -296,15 +296,23 @@ class DatasetDiffOperation(Step):
 @frozen
 class Subtract(DatasetDiffOperation):
-    on: Sequence[str]
+    on: Sequence[tuple[str, str]]
     def query(self, source_query: Select, target_query: Select) -> sa.Selectable:
         sq = source_query.alias("source_query")
         tq = target_query.alias("target_query")
         where_clause = sa.and_(
-            getattr(sq.c, col_name).is_not_distinct_from(getattr(tq.c, col_name))
-            for col_name in self.on
-        )  # type: ignore[arg-type]
+            *[
+                getattr(
+                    sq.c, col_name[0] if isinstance(col_name, tuple) else col_name
+                ).is_not_distinct_from(
+                    getattr(
+                        tq.c, col_name[1] if isinstance(col_name, tuple) else col_name
+                    )
+                )
+                for col_name in self.on
+            ]
+        )
         return sq.select().except_(sq.select().where(where_clause))
@@ -1571,10 +1579,10 @@ class DatasetQuery:
     @detach
     def subtract(self, dq: "DatasetQuery") -> "Self":
-        return self._subtract(dq, on=["source", "path"])
+        return self._subtract(dq, on=[("source", "source"), ("path", "path")])
     @detach
-    def _subtract(self, dq: "DatasetQuery", on: Sequence[str]) -> "Self":
+    def _subtract(self, dq: "DatasetQuery", on: Sequence[tuple[str, str]]) -> "Self":
         query = self.clone()
         query.steps.append(Subtract(dq, self.catalog, on=on))
         return query

{datachain-0.3.8 → datachain-0.3.9/src/datachain.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: datachain
-Version: 0.3.8
+Version: 0.3.9
 Summary: Wrangle unstructured AI data at scale
 Author-email: Dmitry Petrov <support@dvc.org>
 License: Apache-2.0
@@ -115,31 +115,30 @@ AI 🔗 DataChain
 DataChain is a modern Pythonic data-frame library designed for artificial intelligence.
 It is made to organize your unstructured data into datasets and wrangle it at scale on
-your local machine.
+your local machine. Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack.
 Key Features
 ============
 📂 **Storage as a Source of Truth.**
-   - Process unstructured data without redundant copies: S3, GCP, Azure, and local
+   - Process unstructured data without redundant copies from S3, GCP, Azure, and local
      file systems.
-   - Multimodal data: images, video, text, PDFs, JSONs, CSVs, parquet.
-   - Join files and metadata together into persistent, versioned, columnar datasets.
+   - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet.
+   - Unite files and metadata together into persistent, versioned, columnar datasets.
 🐍 **Python-friendly data pipelines.**
    - Operate on Python objects and object fields.
-   - Built-in parallelization and out-of-memory compute without a need in SQL or
-     Spark jobs.
+   - Built-in parallelization and out-of-memory compute without SQL or Spark.
 🧠 **Data Enrichment and Processing.**
-   - Generate metadata columns using local AI models and LLM APIs.
-   - Filter, join, and group by AI metadata. Vector similarity search.
-   - Pass datasets to Pytorch and Tensorflow, or export back into storage.
+   - Generate metadata using local AI models and LLM APIs.
+   - Filter, join, and group by metadata. Search by vector embeddings.
+   - Pass datasets to Pytorch and Tensorflow, or export them back into storage.
 🚀 **Efficiency.**
    - Parallelization, out-of-memory workloads and data caching.
    - Vectorized operations on Python object fields: sum, count, avg, etc.
-   - Vector search on embeddings.
+   - Optimized vector search.
 Quick Start
@@ -164,7 +163,7 @@ where each image has a matching JSON file like `cat.1009.json`:
         "inference": {"class": "dog", "confidence": 0.68}
     }
-Example of downloading only high-confidence cat images using JSON metadata:
+Example of downloading only "high-confidence cat" inferred images using JSON metadata:
 .. code:: py
@@ -234,7 +233,7 @@ detected are then copied to the local directory.
 LLM judging chatbots
 =============================
-LLMs can work as efficient universal classifiers. In the example below,
+LLMs can work as universal classifiers. In the example below,
 we employ a free API from Mistral to judge the `publicly available`_ chatbot dialogs. Please get a free
 Mistral API key at https://console.mistral.ai

{datachain-0.3.8 → datachain-0.3.9}/tests/func/test_catalog.py RENAMED Viewed

@@ -1151,6 +1151,36 @@ def test_get_file_signals(cloud_test_catalog, dogs_dataset):
     }
+def test_get_file_signals_with_custom_types(cloud_test_catalog, dogs_dataset):
+    catalog = cloud_test_catalog.catalog
+    catalog.metastore.update_dataset_version(
+        dogs_dataset,
+        1,
+        feature_schema={
+            "name": "str",
+            "age": "str",
+            "f1": "File@v1",
+            "f2": "File@v1",
+            "_custom_types": {
+                "File@v1": {"source": "str", "name": "str"},
+            },
+        },
+    )
+    row = {
+        "name": "Jon",
+        "age": 25,
+        "f1__source": "s3://first_bucket",
+        "f1__name": "image1.jpg",
+        "f2__source": "s3://second_bucket",
+        "f2__name": "image2.jpg",
+    }
+    assert catalog.get_file_signals(dogs_dataset.name, 1, row) == {
+        "source": "s3://first_bucket",
+        "name": "image1.jpg",
+    }
 def test_get_file_signals_no_signals(cloud_test_catalog, dogs_dataset):
     catalog = cloud_test_catalog.catalog
     catalog.metastore.update_dataset_version(

{datachain-0.3.8 → datachain-0.3.9}/tests/unit/lib/test_datachain.py RENAMED Viewed

@@ -1504,6 +1504,11 @@ def test_subtract(test_session):
     assert set(chain1.subtract(chain3, on="a").collect()) == {(2, "z")}
     assert set(chain1.subtract(chain3).collect()) == {(2, "z")}
+    chain4 = DataChain.from_values(d=[1, 2, 3], e=["x", "y", "z"], session=test_session)
+    chain5 = DataChain.from_values(a=[1, 2], b=["x", "y"], session=test_session)
+    assert set(chain4.subtract(chain5, on="d", right_on="a").collect()) == {(3, "z")}
 def test_subtract_error(test_session):
     chain1 = DataChain.from_values(a=[1, 1, 2], b=["x", "y", "z"], session=test_session)
@@ -1513,6 +1518,36 @@ def test_subtract_error(test_session):
     with pytest.raises(TypeError):
         chain1.subtract(chain2, on=42)
+    with pytest.raises(DataChainParamsError):
+        chain1.subtract(chain2, on="")
+    with pytest.raises(DataChainParamsError):
+        chain1.subtract(chain2, on="a", right_on="")
+    with pytest.raises(DataChainParamsError):
+        chain1.subtract(chain2, on=["a", "b"], right_on=["c", ""])
+    with pytest.raises(DataChainParamsError):
+        chain1.subtract(chain2, on=["a", "b"], right_on=[])
+    with pytest.raises(DataChainParamsError):
+        chain1.subtract(chain2, on=["a", "b"], right_on=["d"])
+    with pytest.raises(DataChainParamsError):
+        chain1.subtract(chain2, right_on=[])
+    with pytest.raises(DataChainParamsError):
+        chain1.subtract(chain2, right_on="")
+    with pytest.raises(DataChainParamsError):
+        chain1.subtract(chain2, right_on=42)
+    with pytest.raises(DataChainParamsError):
+        chain1.subtract(chain2, right_on=["a"])
+    with pytest.raises(TypeError):
+        chain1.subtract(chain2, on=42, right_on=42)
     chain3 = DataChain.from_values(c=["foo", "bar"], session=test_session)
     with pytest.raises(DataChainParamsError):
         chain1.subtract(chain3)