PyPI - datachain - Versions diffs - 0.8.0__tar.gz → 0.8.2__tar.gz - Mend

datachain 0.8.0tar.gz → 0.8.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of datachain might be problematic. Click here for more details.

Files changed (290) hide show

{datachain-0.8.0 → datachain-0.8.2}/.github/workflows/benchmarks.yml RENAMED Viewed

@@ -25,7 +25,7 @@ jobs:
           python-version: '3.12'
       - name: Setup uv
-        uses: astral-sh/setup-uv@v4
+        uses: astral-sh/setup-uv@v5
         with:
           enable-cache: true
           cache-suffix: benchmarks

{datachain-0.8.0 → datachain-0.8.2}/.github/workflows/release.yml RENAMED Viewed

@@ -27,7 +27,7 @@ jobs:
           python-version: '3.12'
       - name: Setup uv
-        uses: astral-sh/setup-uv@v4
+        uses: astral-sh/setup-uv@v5
       - name: Install nox
         run: uv pip install nox --system

{datachain-0.8.0 → datachain-0.8.2}/.github/workflows/tests-studio.yml RENAMED Viewed

@@ -81,7 +81,7 @@ jobs:
           python-version: ${{ matrix.pyv }}
       - name: Setup uv
-        uses: astral-sh/setup-uv@v4
+        uses: astral-sh/setup-uv@v5
         with:
           enable-cache: true
           cache-suffix: studio

{datachain-0.8.0 → datachain-0.8.2}/.github/workflows/tests.yml RENAMED Viewed

@@ -37,7 +37,7 @@ jobs:
           python-version: '3.9'
       - name: Setup uv
-        uses: astral-sh/setup-uv@v4
+        uses: astral-sh/setup-uv@v5
         with:
           enable-cache: true
           cache-suffix: lint
@@ -94,7 +94,7 @@ jobs:
           python-version: ${{ matrix.pyv }}
       - name: Setup uv
-        uses: astral-sh/setup-uv@v4
+        uses: astral-sh/setup-uv@v5
         with:
           enable-cache: true
           cache-suffix: tests-${{ matrix.pyv }}
@@ -157,7 +157,7 @@ jobs:
           python-version: ${{ matrix.pyv }}
       - name: Setup uv
-        uses: astral-sh/setup-uv@v4
+        uses: astral-sh/setup-uv@v5
         with:
           enable-cache: true
           cache-suffix: examples-${{ matrix.pyv }}

{datachain-0.8.0 → datachain-0.8.2}/.pre-commit-config.yaml RENAMED Viewed

@@ -24,7 +24,7 @@ repos:
       - id: trailing-whitespace
         exclude: '^LICENSES/'
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: 'v0.8.3'
+    rev: 'v0.8.4'
     hooks:
       - id: ruff
         args: [--fix, --exit-non-zero-on-fix]

{datachain-0.8.0/src/datachain.egg-info → datachain-0.8.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: datachain
-Version: 0.8.0
+Version: 0.8.2
 Summary: Wrangle unstructured AI data at scale
 Author-email: Dmitry Petrov <support@dvc.org>
 License: Apache-2.0
@@ -84,7 +84,7 @@ Requires-Dist: requests-mock; extra == "tests"
 Requires-Dist: scipy; extra == "tests"
 Provides-Extra: dev
 Requires-Dist: datachain[docs,tests]; extra == "dev"
-Requires-Dist: mypy==1.13.0; extra == "dev"
+Requires-Dist: mypy==1.14.0; extra == "dev"
 Requires-Dist: types-python-dateutil; extra == "dev"
 Requires-Dist: types-pytz; extra == "dev"
 Requires-Dist: types-PyYAML; extra == "dev"
@@ -99,7 +99,7 @@ Requires-Dist: unstructured[pdf]; extra == "examples"
 Requires-Dist: pdfplumber==0.11.4; extra == "examples"
 Requires-Dist: huggingface_hub[hf_transfer]; extra == "examples"
 Requires-Dist: onnx==1.16.1; extra == "examples"
-Requires-Dist: ultralytics==8.3.50; extra == "examples"
+Requires-Dist: ultralytics==8.3.53; extra == "examples"
 ================
 |logo| DataChain
@@ -145,6 +145,88 @@ Getting Started
 Visit `Quick Start <https://docs.datachain.ai/quick-start>`_ and `Docs <https://docs.datachain.ai/>`_
 to get started with `DataChain` and learn more.
+.. code:: bash
+        pip install datachain
+Example: download subset of files based on metadata
+---------------------------------------------------
+Sometimes users only need to download a specific subset of files from cloud storage,
+rather than the entire dataset.
+For example, you could use a JSON file's metadata to download just cat images with
+high confidence scores.
+.. code:: py
+    from datachain import Column, DataChain
+    meta = DataChain.from_json("gs://datachain-demo/dogs-and-cats/*json", object_name="meta", anon=True)
+    images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg", anon=True)
+    images_id = images.map(id=lambda file: file.path.split('.')[-2])
+    annotated = images_id.merge(meta, on="id", right_on="meta.id")
+    likely_cats = annotated.filter((Column("meta.inference.confidence") > 0.93) \
+                                   & (Column("meta.inference.class_") == "cat"))
+    likely_cats.export_files("high-confidence-cats/", signal="file")
+Example: LLM based text-file evaluation
+---------------------------------------
+In this example, we evaluate chatbot conversations stored in text files
+using LLM based evaluation.
+.. code:: shell
+    $ pip install mistralai # Requires version >=1.0.0
+    $ export MISTRAL_API_KEY=_your_key_
+Python code:
+.. code:: py
+    from mistralai import Mistral
+    from datachain import File, DataChain, Column
+    PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure."
+    def eval_dialogue(file: File) -> bool:
+         client = Mistral()
+         response = client.chat.complete(
+             model="open-mixtral-8x22b",
+             messages=[{"role": "system", "content": PROMPT},
+                       {"role": "user", "content": file.read()}])
+         result = response.choices[0].message.content
+         return result.lower().startswith("success")
+    chain = (
+       DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file", anon=True)
+       .settings(parallel=4, cache=True)
+       .map(is_success=eval_dialogue)
+       .save("mistral_files")
+    )
+    successful_chain = chain.filter(Column("is_success") == True)
+    successful_chain.export_files("./output_mistral")
+    print(f"{successful_chain.count()} files were exported")
+With the instruction above, the Mistral model considers 31/50 files to hold the successful dialogues:
+.. code:: shell
+    $ ls output_mistral/datachain-demo/chatbot-KiT/
+    1.txt  15.txt 18.txt 2.txt  22.txt 25.txt 28.txt 33.txt 37.txt 4.txt  41.txt ...
+    $ ls output_mistral/datachain-demo/chatbot-KiT/ | wc -l
+    31
 Key Features
 ============

{datachain-0.8.0 → datachain-0.8.2}/README.rst RENAMED Viewed

@@ -42,6 +42,88 @@ Getting Started
 Visit `Quick Start <https://docs.datachain.ai/quick-start>`_ and `Docs <https://docs.datachain.ai/>`_
 to get started with `DataChain` and learn more.
+.. code:: bash
+        pip install datachain
+Example: download subset of files based on metadata
+---------------------------------------------------
+Sometimes users only need to download a specific subset of files from cloud storage,
+rather than the entire dataset.
+For example, you could use a JSON file's metadata to download just cat images with
+high confidence scores.
+.. code:: py
+    from datachain import Column, DataChain
+    meta = DataChain.from_json("gs://datachain-demo/dogs-and-cats/*json", object_name="meta", anon=True)
+    images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg", anon=True)
+    images_id = images.map(id=lambda file: file.path.split('.')[-2])
+    annotated = images_id.merge(meta, on="id", right_on="meta.id")
+    likely_cats = annotated.filter((Column("meta.inference.confidence") > 0.93) \
+                                   & (Column("meta.inference.class_") == "cat"))
+    likely_cats.export_files("high-confidence-cats/", signal="file")
+Example: LLM based text-file evaluation
+---------------------------------------
+In this example, we evaluate chatbot conversations stored in text files
+using LLM based evaluation.
+.. code:: shell
+    $ pip install mistralai # Requires version >=1.0.0
+    $ export MISTRAL_API_KEY=_your_key_
+Python code:
+.. code:: py
+    from mistralai import Mistral
+    from datachain import File, DataChain, Column
+    PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure."
+    def eval_dialogue(file: File) -> bool:
+         client = Mistral()
+         response = client.chat.complete(
+             model="open-mixtral-8x22b",
+             messages=[{"role": "system", "content": PROMPT},
+                       {"role": "user", "content": file.read()}])
+         result = response.choices[0].message.content
+         return result.lower().startswith("success")
+    chain = (
+       DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file", anon=True)
+       .settings(parallel=4, cache=True)
+       .map(is_success=eval_dialogue)
+       .save("mistral_files")
+    )
+    successful_chain = chain.filter(Column("is_success") == True)
+    successful_chain.export_files("./output_mistral")
+    print(f"{successful_chain.count()} files were exported")
+With the instruction above, the Mistral model considers 31/50 files to hold the successful dialogues:
+.. code:: shell
+    $ ls output_mistral/datachain-demo/chatbot-KiT/
+    1.txt  15.txt 18.txt 2.txt  22.txt 25.txt 28.txt 33.txt 37.txt 4.txt  41.txt ...
+    $ ls output_mistral/datachain-demo/chatbot-KiT/ | wc -l
+    31
 Key Features
 ============

{datachain-0.8.0 → datachain-0.8.2}/docs/quick-start.md RENAMED Viewed

@@ -39,8 +39,8 @@ using JSON metadata:
 ``` py
 from datachain import Column, DataChain
-meta = DataChain.from_json("gs://datachain-demo/dogs-and-cats/*json", object_name="meta")
-images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg")
+meta = DataChain.from_json("gs://datachain-demo/dogs-and-cats/*json", object_name="meta", anon=True)
+images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg", anon=True)
 images_id = images.map(id=lambda file: file.path.split('.')[-2])
 annotated = images_id.merge(meta, on="id", right_on="meta.id")
@@ -59,6 +59,8 @@ Batch inference with a simple sentiment model using the
 pip install transformers
 ```
+Note, `transformers` works only if `torch`, `tensorflow` >= 2.0, or `flax` are installed.
 The code below downloads files from the cloud, and applies a
 user-defined function to each one of them. All files with a positive
 sentiment detected are then copied to the local directory.
@@ -76,7 +78,7 @@ def is_positive_dialogue_ending(file) -> bool:
 chain = (
    DataChain.from_storage("gs://datachain-demo/chatbot-KiT/",
-                          object_name="file", type="text")
+                          object_name="file", type="text", anon=True)
    .settings(parallel=8, cache=True)
    .map(is_positive=is_positive_dialogue_ending)
    .save("file_response")
@@ -114,13 +116,14 @@ DataChain can parallelize API calls; the free Mistral tier supports up
 to 4 requests at the same time.
 ``` py
+import os
 from mistralai import Mistral
 from datachain import File, DataChain, Column
 PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure."
 def eval_dialogue(file: File) -> bool:
-     client = Mistral()
+     client = Mistral(api_key = os.environ["MISTRAL_API_KEY"])
      response = client.chat.complete(
          model="open-mixtral-8x22b",
          messages=[{"role": "system", "content": PROMPT},
@@ -129,8 +132,7 @@ def eval_dialogue(file: File) -> bool:
      return result.lower().startswith("success")
 chain = (
-   DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file")
-   .settings(parallel=4, cache=True)
+   DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file", anon=True)
    .map(is_success=eval_dialogue)
    .save("mistral_files")
 )
@@ -175,7 +177,7 @@ def eval_dialog(file: File) -> ChatCompletionResponse:
                    {"role": "user", "content": file.read()}])
 chain = (
-   DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file")
+   DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file", anon=True)
    .settings(parallel=4, cache=True)
    .map(response=eval_dialog)
    .map(status=lambda response: response.choices[0].message.content.lower()[:7])
@@ -271,7 +273,7 @@ from datachain import C, DataChain
 processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
 chain = (
-    DataChain.from_storage("gs://datachain-demo/dogs-and-cats/", type="image")
+    DataChain.from_storage("gs://datachain-demo/dogs-and-cats/", type="image", anon=True)
     .map(label=lambda name: name.split(".")[0], params=["file.name"])
     .select("file", "label").to_pytorch(
         transform=processor.image_processor,

{datachain-0.8.0 → datachain-0.8.2}/pyproject.toml RENAMED Viewed

@@ -96,7 +96,7 @@ tests = [
 ]
 dev = [
   "datachain[docs,tests]",
-  "mypy==1.13.0",
+  "mypy==1.14.0",
   "types-python-dateutil",
   "types-pytz",
   "types-PyYAML",
@@ -112,7 +112,7 @@ examples = [
   "pdfplumber==0.11.4",
   "huggingface_hub[hf_transfer]",
   "onnx==1.16.1",
-  "ultralytics==8.3.50"
+  "ultralytics==8.3.53"
 ]
 [project.urls]

{datachain-0.8.0 → datachain-0.8.2}/src/datachain/catalog/catalog.py RENAMED Viewed

@@ -52,6 +52,7 @@ from datachain.error import (
     QueryScriptCancelError,
     QueryScriptRunError,
 )
+from datachain.lib.listing import get_listing
 from datachain.node import DirType, Node, NodeWithPath
 from datachain.nodes_thread_pool import NodesThreadPool
 from datachain.remote.studio import StudioClient
@@ -599,7 +600,7 @@ class Catalog:
             source, session=self.session, update=update, object_name=object_name
         )
-        list_ds_name, list_uri, list_path, _ = DataChain.parse_uri(
+        list_ds_name, list_uri, list_path, _ = get_listing(
             source, self.session, update=update
         )
@@ -697,11 +698,9 @@ class Catalog:
                 )
                 indexed_sources = []
                 for source in dataset_sources:
-                    from datachain.lib.dc import DataChain
                     client = self.get_client(source, **client_config)
                     uri = client.uri
-                    dataset_name, _, _, _ = DataChain.parse_uri(uri, self.session)
+                    dataset_name, _, _, _ = get_listing(uri, self.session)
                     listing = Listing(
                         self.metastore.clone(),
                         self.warehouse.clone(),

{datachain-0.8.0 → datachain-0.8.2}/src/datachain/client/gcs.py RENAMED Viewed

@@ -32,6 +32,16 @@ class GCSClient(Client):
         return cast(GCSFileSystem, super().create_fs(**kwargs))
+    def url(self, path: str, expires: int = 3600, **kwargs) -> str:
+        """
+        Generate a signed URL for the given path.
+        If the client is anonymous, a public URL is returned instead
+        (see https://cloud.google.com/storage/docs/access-public-data#api-link).
+        """
+        if self.fs.storage_options.get("token") == "anon":
+            return f"https://storage.googleapis.com/{self.name}/{path}"
+        return self.fs.sign(self.get_full_path(path), expiration=expires, **kwargs)
     @staticmethod
     def parse_timestamp(timestamp: str) -> datetime:
         """

{datachain-0.8.0 → datachain-0.8.2}/src/datachain/data_storage/warehouse.py RENAMED Viewed

@@ -216,7 +216,6 @@ class AbstractWarehouse(ABC, Serializable):
         limit = query._limit
         paginated_query = query.limit(page_size)
-        results = None
         offset = 0
         num_yielded = 0

{datachain-0.8.0 → datachain-0.8.2}/src/datachain/lib/arrow.py RENAMED Viewed

@@ -1,9 +1,11 @@
 from collections.abc import Sequence
-from tempfile import NamedTemporaryFile
+from itertools import islice
 from typing import TYPE_CHECKING, Any, Optional
+import fsspec.implementations.reference
 import orjson
 import pyarrow as pa
+from fsspec.core import split_protocol
 from pyarrow.dataset import CsvFileFormat, dataset
 from tqdm import tqdm
@@ -25,7 +27,18 @@ if TYPE_CHECKING:
 DATACHAIN_SIGNAL_SCHEMA_PARQUET_KEY = b"DataChain SignalSchema"
+class ReferenceFileSystem(fsspec.implementations.reference.ReferenceFileSystem):
+    def _open(self, path, mode="rb", *args, **kwargs):
+        # overriding because `fsspec`'s `ReferenceFileSystem._open`
+        # reads the whole file in-memory.
+        (uri,) = self.references[path]
+        protocol, _ = split_protocol(uri)
+        return self.fss[protocol]._open(uri, mode, *args, **kwargs)
 class ArrowGenerator(Generator):
+    DEFAULT_BATCH_SIZE = 2**17  # same as `pyarrow._dataset._DEFAULT_BATCH_SIZE`
     def __init__(
         self,
         input_schema: Optional["pa.Schema"] = None,
@@ -55,57 +68,80 @@ class ArrowGenerator(Generator):
     def process(self, file: File):
         if file._caching_enabled:
             file.ensure_cached()
-            path = file.get_local_path()
-            ds = dataset(path, schema=self.input_schema, **self.kwargs)
-        elif self.nrows:
-            path = _nrows_file(file, self.nrows)
-            ds = dataset(path, schema=self.input_schema, **self.kwargs)
+            cache_path = file.get_local_path()
+            fs_path = file.path
+            fs = ReferenceFileSystem({fs_path: [cache_path]})
         else:
-            path = file.get_path()
-            ds = dataset(
-                path, filesystem=file.get_fs(), schema=self.input_schema, **self.kwargs
-            )
+            fs, fs_path = file.get_fs(), file.get_path()
+        ds = dataset(fs_path, schema=self.input_schema, filesystem=fs, **self.kwargs)
         hf_schema = _get_hf_schema(ds.schema)
         use_datachain_schema = (
             bool(ds.schema.metadata)
             and DATACHAIN_SIGNAL_SCHEMA_PARQUET_KEY in ds.schema.metadata
         )
-        index = 0
-        with tqdm(desc="Parsed by pyarrow", unit=" rows") as pbar:
-            for record_batch in ds.to_batches():
-                for record in record_batch.to_pylist():
-                    if use_datachain_schema and self.output_schema:
-                        vals = [_nested_model_instantiate(record, self.output_schema)]
-                    else:
-                        vals = list(record.values())
-                        if self.output_schema:
-                            fields = self.output_schema.model_fields
-                            vals_dict = {}
-                            for i, ((field, field_info), val) in enumerate(
-                                zip(fields.items(), vals)
-                            ):
-                                anno = field_info.annotation
-                                if hf_schema:
-                                    from datachain.lib.hf import convert_feature
-                                    feat = list(hf_schema[0].values())[i]
-                                    vals_dict[field] = convert_feature(val, feat, anno)
-                                elif ModelStore.is_pydantic(anno):
-                                    vals_dict[field] = anno(**val)  # type: ignore[misc]
-                                else:
-                                    vals_dict[field] = val
-                            vals = [self.output_schema(**vals_dict)]
-                    if self.source:
-                        kwargs: dict = self.kwargs
-                        # Can't serialize CsvFileFormat; may lose formatting options.
-                        if isinstance(kwargs.get("format"), CsvFileFormat):
-                            kwargs["format"] = "csv"
-                        arrow_file = ArrowRow(file=file, index=index, kwargs=kwargs)
-                        yield [arrow_file, *vals]
-                    else:
-                        yield vals
-                    index += 1
-                pbar.update(len(record_batch))
+        kw = {}
+        if self.nrows:
+            kw = {"batch_size": min(self.DEFAULT_BATCH_SIZE, self.nrows)}
+        def iter_records():
+            for record_batch in ds.to_batches(**kw):
+                yield from record_batch.to_pylist()
+        it = islice(iter_records(), self.nrows)
+        with tqdm(it, desc="Parsed by pyarrow", unit="rows", total=self.nrows) as pbar:
+            for index, record in enumerate(pbar):
+                yield self._process_record(
+                    record, file, index, hf_schema, use_datachain_schema
+                )
+    def _process_record(
+        self,
+        record: dict[str, Any],
+        file: File,
+        index: int,
+        hf_schema: Optional[tuple["Features", dict[str, "DataType"]]],
+        use_datachain_schema: bool,
+    ):
+        if use_datachain_schema and self.output_schema:
+            vals = [_nested_model_instantiate(record, self.output_schema)]
+        else:
+            vals = self._process_non_datachain_record(record, hf_schema)
+        if self.source:
+            kwargs: dict = self.kwargs
+            # Can't serialize CsvFileFormat; may lose formatting options.
+            if isinstance(kwargs.get("format"), CsvFileFormat):
+                kwargs["format"] = "csv"
+            arrow_file = ArrowRow(file=file, index=index, kwargs=kwargs)
+            return [arrow_file, *vals]
+        return vals
+    def _process_non_datachain_record(
+        self,
+        record: dict[str, Any],
+        hf_schema: Optional[tuple["Features", dict[str, "DataType"]]],
+    ):
+        vals = list(record.values())
+        if not self.output_schema:
+            return vals
+        fields = self.output_schema.model_fields
+        vals_dict = {}
+        for i, ((field, field_info), val) in enumerate(zip(fields.items(), vals)):
+            anno = field_info.annotation
+            if hf_schema:
+                from datachain.lib.hf import convert_feature
+                feat = list(hf_schema[0].values())[i]
+                vals_dict[field] = convert_feature(val, feat, anno)
+            elif ModelStore.is_pydantic(anno):
+                vals_dict[field] = anno(**val)  # type: ignore[misc]
+            else:
+                vals_dict[field] = val
+        return [self.output_schema(**vals_dict)]
 def infer_schema(chain: "DataChain", **kwargs) -> pa.Schema:
@@ -190,18 +226,6 @@ def arrow_type_mapper(col_type: pa.DataType, column: str = "") -> type:  # noqa:
     raise TypeError(f"{col_type!r} datatypes not supported, column: {column}")
-def _nrows_file(file: File, nrows: int) -> str:
-    tf = NamedTemporaryFile(delete=False)  # noqa: SIM115
-    with file.open(mode="r") as reader:
-        with open(tf.name, "a") as writer:
-            for row, line in enumerate(reader):
-                if row >= nrows:
-                    break
-                writer.write(line)
-                writer.write("\n")
-    return tf.name
 def _get_hf_schema(
     schema: "pa.Schema",
 ) -> Optional[tuple["Features", dict[str, "DataType"]]]:

datachain 0.8.0__tar.gz → 0.8.2__tar.gz

Potentially problematic release.

datachain 0.8.0tar.gz → 0.8.2tar.gz