PyPI - datachain - Versions diffs - 0.18.11__tar.gz → 0.19__tar.gz - Mend

datachain 0.18.11tar.gz → 0.19tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of datachain might be problematic. Click here for more details.

Files changed (389) hide show

{datachain-0.18.11 → datachain-0.19}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: datachain
-Version: 0.18.11
+Version: 0.19
 Summary: Wrangle unstructured AI data at scale
 Author-email: Dmitry Petrov <support@dvc.org>
 License-Expression: Apache-2.0
@@ -146,6 +146,12 @@ Use Cases
    on these tables at scale.
 3. **Versioning.** DataChain doesn't store, require moving or copying data (unlike DVC).
    Perfect use case is a bucket with thousands or millions of images, videos, audio, PDFs.
+4. **Incremental Processing.** DataChain's delta and retry features allow for efficient
+   processing workflows:
+   - **Delta Processing**: Process only new or changed files/records
+   - **Retry Processing**: Automatically reprocess records with errors or missing results
+   - **Combined Approach**: Process new data and fix errors in a single pipeline
 Getting Started
 ===============
@@ -158,7 +164,7 @@ to get started with `DataChain` and learn more.
         pip install datachain
-Example: download subset of files based on metadata
+Example: Download Subset of Files Based on Metadata
 ---------------------------------------------------
 Sometimes users only need to download a specific subset of files from cloud storage,
@@ -182,6 +188,54 @@ high confidence scores.
     likely_cats.to_storage("high-confidence-cats/", signal="file")
+Example: Incremental Processing with Error Handling
+---------------------------------------------------
+This example shows how to use both delta and retry processing for efficient handling of large
+datasets that evolve over time and may occasionally have processing errors.
+.. code:: py
+    import datachain as dc
+    from datachain import C, File
+    def process_file(file: File):
+        """Process a file, which may occasionally fail."""
+        try:
+            # Your processing logic here
+            content = file.read_text()
+            result = analyze_content(content)
+            return {
+                "content": content,
+                "result": result,
+                "error": None  # No error
+            }
+        except Exception as e:
+            # Return an error that will trigger reprocessing next time
+            return {
+                "content": None,
+                "result": None,
+                "error": str(e)  # Error field will trigger retry
+            }
+    # Process files efficiently with delta and retry
+    chain = (
+        dc.read_storage(
+            "data/",
+            update=True,
+            delta=True,              # Process only new/changed files
+            delta_on="file.path",    # Identify files by path
+            retry_on="error"         # Field that indicates errors
+        )
+        .map(processed_result=process_file)
+        .mutate(
+            content=C("processed_result.content"),
+            result=C("processed_result.result"),
+            error=C("processed_result.error")
+        )
+        .save(name="processed_data")
+    )
 Example: LLM based text-file evaluation
 ---------------------------------------

{datachain-0.18.11 → datachain-0.19}/README.rst RENAMED Viewed

@@ -35,6 +35,12 @@ Use Cases
    on these tables at scale.
 3. **Versioning.** DataChain doesn't store, require moving or copying data (unlike DVC).
    Perfect use case is a bucket with thousands or millions of images, videos, audio, PDFs.
+4. **Incremental Processing.** DataChain's delta and retry features allow for efficient
+   processing workflows:
+   - **Delta Processing**: Process only new or changed files/records
+   - **Retry Processing**: Automatically reprocess records with errors or missing results
+   - **Combined Approach**: Process new data and fix errors in a single pipeline
 Getting Started
 ===============
@@ -47,7 +53,7 @@ to get started with `DataChain` and learn more.
         pip install datachain
-Example: download subset of files based on metadata
+Example: Download Subset of Files Based on Metadata
 ---------------------------------------------------
 Sometimes users only need to download a specific subset of files from cloud storage,
@@ -71,6 +77,54 @@ high confidence scores.
     likely_cats.to_storage("high-confidence-cats/", signal="file")
+Example: Incremental Processing with Error Handling
+---------------------------------------------------
+This example shows how to use both delta and retry processing for efficient handling of large
+datasets that evolve over time and may occasionally have processing errors.
+.. code:: py
+    import datachain as dc
+    from datachain import C, File
+    def process_file(file: File):
+        """Process a file, which may occasionally fail."""
+        try:
+            # Your processing logic here
+            content = file.read_text()
+            result = analyze_content(content)
+            return {
+                "content": content,
+                "result": result,
+                "error": None  # No error
+            }
+        except Exception as e:
+            # Return an error that will trigger reprocessing next time
+            return {
+                "content": None,
+                "result": None,
+                "error": str(e)  # Error field will trigger retry
+            }
+    # Process files efficiently with delta and retry
+    chain = (
+        dc.read_storage(
+            "data/",
+            update=True,
+            delta=True,              # Process only new/changed files
+            delta_on="file.path",    # Identify files by path
+            retry_on="error"         # Field that indicates errors
+        )
+        .map(processed_result=process_file)
+        .mutate(
+            content=C("processed_result.content"),
+            result=C("processed_result.result"),
+            error=C("processed_result.error")
+        )
+        .save(name="processed_data")
+    )
 Example: LLM based text-file evaluation
 ---------------------------------------

datachain-0.19/docs/guide/delta.md ADDED Viewed

@@ -0,0 +1,82 @@
+# Delta Processing
+Delta processing allows you to optimize the creation of new dataset versions by calculating and processing only the differences (deltas) between dataset versions. This is especially useful when working with large datasets that change incrementally over time.
+## Overview
+When working with large datasets that receive regular updates, reprocessing the entire dataset each time is inefficient. The `delta=True` flag in DataChain enables incremental processing, where only new or changed records are processed.
+## How Delta Processing Works
+When delta processing is enabled:
+1. DataChain calculates the difference between the latest version of your source dataset and the version used to create the most recent version of your result dataset
+2. Only the records that are new or modified in the source dataset are processed
+3. These processed records are then merged with the latest version of your result dataset
+This optimization significantly reduces processing time and resource usage for incremental updates.
+## Usage
+Delta processing can be enabled through the `delta` parameter when using `read_storage()` or `read_dataset()`:
+```python
+import datachain as dc
+# Process only new or changed files
+chain = (
+    dc.read_storage(
+        "data/",
+        delta=True,                    # Enable delta processing
+        delta_on="file.path",          # Field that uniquely identifies records
+        delta_compare="file.mtime"     # Field to check for changes
+    )
+    .map(result=process_function)
+    .save(name="processed_data")
+)
+```
+## Parameters
+- **delta**: Boolean flag to enable delta processing
+- **delta_on**: Field(s) that uniquely identify rows in the source dataset
+- **delta_result_on**: Field(s) in the resulting dataset that correspond to `delta_on` fields (if they have different names)
+- **delta_compare**: Field(s) used to check if a record has been modified
+## Example: Processing New Files Only
+```python
+import datachain as dc
+import time
+def process_file(file):
+    """Process a file and return results."""
+    content = file.read_text()
+    # Simulate processing time
+    time.sleep(0.1)
+    return {
+        "content": content,
+        "processed_at": time.strftime("%Y-%m-%d %H:%M:%S")
+    }
+# Process only new or modified files
+chain = (
+    dc.read_storage(
+        "data/",
+        update=True,                   # Update the storage index
+        delta=True,                    # Process only new files
+        delta_on="file.path"           # Files are identified by their paths
+    )
+    .map(result=process_file)
+    .save(name="processed_files")
+)
+print(f"Processed {chain.count()} files")
+```
+## Combining with Retry Processing
+Delta processing can be combined with [retry processing](./retry.md) to create a powerful workflow that both:
+1. Processes only new or changed records (delta)
+2. Reprocesses records with errors or that are missing (retry)

datachain-0.19/docs/guide/index.md ADDED Viewed

@@ -0,0 +1,12 @@
+---
+title: User Guide
+---
+# User Guide
+Welcome to the DataChain User Guide! This section provides comprehensive documentation on how to effectively use DataChain's advanced features and capabilities.
+- [Interacting with Remote Storage](./remotes.md) - Learn how to connect to and work with various remote storage systems.
+- [Data Processing Overview](./processing.md) - Discover DataChain's specialized data processing features.
+    - [Delta Processing](./delta.md) - Incremental data processing to efficiently handle large datasets that change over time.
+    - [Error Handling and Retries](./retry.md) - Learn how to handle processing errors and selectively reprocess problematic records.

datachain-0.19/docs/guide/processing.md ADDED Viewed

@@ -0,0 +1,10 @@
+# Data Processing Features
+DataChain offers specialized data processing features that help optimize the handling of large and evolving datasets:
+## Key Processing Features
+- [Delta Processing](./delta.md): Efficiently process only new or changed records between dataset versions
+- [Retry Processing](./retry.md): Automatically identify and reprocess records with errors or missing records
+These features can be used independently or combined to create powerful data processing workflows.

datachain-0.19/docs/guide/retry.md ADDED Viewed

@@ -0,0 +1,105 @@
+# Retry Processing
+The retry functionality in DataChain allows for selective reprocessing of records that either have errors or are missing from the final result dataset. This powerful feature works alongside the existing `delta` functionality, providing both incremental processing of new data and reprocessing of problematic records.
+## Overview
+When working with large datasets, two common scenarios arise:
+1. **Processing new data incrementally**: You only want to process new data that has been added since the last run (handled by the `delta=True` functionality)
+2. **Reprocessing problematic records**: Some records may have failed processing or produced errors that need to be fixed and reprocessed
+The retry functionality addresses the second scenario while still supporting the first.
+## How It Works
+The retry functionality allows you to:
+1. **Identify error records**: Records in the result dataset with a non-NULL value in a specified error field
+2. **Identify missing records**: Records that exist in the source dataset but are missing from the result dataset
+3. **Reprocess only these records**: Applying your full processing chain only to these records that need attention
+## Usage
+`retry` can be enabled by specifying `delta_retry`. It is enabled only when `delta` is enabled.
+```python
+import datachain as dc
+from datachain import C
+chain = (
+    dc.read_storage(
+        "path/to/data/",
+        # Enable delta processing to handle only new files (and retries in this case)
+        delta=True,
+        # Field(s) that uniquely identify records in the source dataset
+        delta_on="id",
+        # Controls which records to reprocess:
+        # - String: field name indicating errors when not empty
+        # - True: retry missing records from result dataset
+        # - False/None: no retry processing
+        delta_retry="error"
+    )
+    .map(result=process_function)  # Your processing function
+    .save(name="processed_data")    # Save results
+)
+```
+## Example: Processing Files with Error Handling
+```python
+import datachain as dc
+from datachain import C
+def process_file(file):
+    """Process a file - may occasionally fail with an error."""
+    try:
+        # Your processing logic here
+        content = file.read_text()
+        result = analyze_content(content)
+        return {
+            "content": content,
+            "result": result,
+            "error": ""  # No error
+        }
+    except Exception as e:
+        # Log the error and return it in the result
+        return {
+            "content": "",
+            "result": "",
+            "error": str(e)  # Store the error message
+        }
+# Process files with with retry functionality
+chain = (
+    dc.read_storage(
+        "data/",
+        delta=True,                    # Process only new files
+        delta_on="file.path",          # Files are identified by their paths
+        delta_retry="error"            # Retry records with errors in "error" field
+    )
+    .map(result=process_file)
+    .save(name="processed_files")
+)
+# Show records with errors that will be retried next time
+error_records = chain.filter(C("error") != None)
+if not error_records.empty:
+    print("Records with errors that will be retried:")
+    error_records.show()
+```
+## Combining Delta and Retry
+The real power comes when combining both delta and retry processing:
+1. **Delta processing**: Only process new or modified records
+2. **Retry processing**: Reprocess any records that had errors previously
+When both are enabled, DataChain will:
+1. First identify records that need to be retried (based on errors or missing records)
+2. Then apply delta processing to that subset
+3. Process only the resulting records, avoiding unnecessary reprocessing of unchanged data
+This provides the most efficient way to maintain a dataset that is always up-to-date and free of processing errors.

{datachain-0.18.11 → datachain-0.19}/docs/index.md RENAMED Viewed

@@ -86,6 +86,7 @@ The following pages provide detailed documentation on DataChain's features, arch
 - [🏃🏼‍♂️ Quick Start](quick-start.md): Get up and running with DataChain in no time.
 - [🎯 Examples](examples.md): Explore practical examples and use cases.
 - [📚 Tutorials](tutorials.md): Learn how to use DataChain for specific tasks.
+- [📚 User Guide](guide/index.md): Deeper dive into DataChain technical aspects and supported workflows.
 - [🐍 API Reference](references/index.md): Dive into the technical details and API reference.
 - [🤝 Contributing](contributing.md): Learn how to contribute to DataChain.

datachain-0.19/examples/incremental_processing/retry.py ADDED Viewed

@@ -0,0 +1,130 @@
+#!/usr/bin/env python
+"""
+Retry Processing Example with DataChain
+This script demonstrates DataChain's retry functionality by:
+1. Creating an initial in-memory dataset with sample data
+2. Processing records with simulated errors
+3. Using retry flags to reprocess failed records in a second pass
+"""
+from datetime import datetime, timezone
+import datachain as dc
+from datachain import C, DataModel
+class ProcessingResult(DataModel):
+    """Data model for processing results."""
+    processed_content: str
+    processed_at: str
+    error: str
+    attempt: int
+def process_data(item_id: int, content: str, attempt: int) -> ProcessingResult:
+    """
+    Process a data record - initially fails for odd IDs, but succeeds on retry.
+    In a real-world application, this could be data validation,
+    transformation, or an ML model inference step.
+    """
+    # Simulate an error for odd IDs on first attempt only
+    # This will succeed on retry to demonstrate the retry functionality
+    error = item_id % 2 == 1 and attempt == 1
+    return ProcessingResult(
+        processed_content="" if error else content.upper(),
+        processed_at=datetime.now(tz=timezone.utc).isoformat(),
+        error=f"Processing error for item {item_id} (attempt {attempt})"
+        if error
+        else "",
+        attempt=attempt,
+    )
+def retry_processing_example():
+    """
+    Demonstrates retry processing using DataChain.
+    """
+    # Step 1: Create initial dataset with sample data
+    print("Step 1: Creating initial dataset with sample data...")
+    sample_ids = [1, 2, 3, 4, 5]
+    sample_contents = [
+        "first item",
+        "second item",
+        "third item",
+        "fourth item",
+        "fifth item",
+    ]
+    initial_chain = dc.read_values(
+        item_id=sample_ids, content=sample_contents, in_memory=True
+    ).save(name="sample_data")
+    print(f"Created dataset with {initial_chain.count()} records\n")
+    # Step 2: First processing pass - some records will fail
+    # We enable delta, initially it won't do anything, but it means
+    # that code stays the same across all attempts
+    print("Step 2: First processing pass (some records will fail)...")
+    first_pass = (
+        dc.read_dataset(
+            "sample_data",
+            delta=True,
+            delta_on="item_id",
+            delta_retry="result.error",
+        )
+        # Set attempt number for processing, this is specific to this example only
+        .setup(attempt=lambda: 1)
+        .map(result=process_data)
+        .save(name="processed_data")
+    )
+    print(f"First pass completed. Total records: {first_pass.count()}")
+    # Show results of first pass
+    error_count = first_pass.filter(C("result.error") != "").count()
+    success_count = first_pass.filter(C("result.error") == "").count()
+    print(f"Successful: {success_count}, Failed: {error_count}")
+    print("\nFirst pass results:")
+    first_pass.show()
+    # Step 3: Retry processing - only failed records will be reprocessed
+    # Note how we use exactly the same DataChain code as above
+    print("\nStep 3: Retry processing (failed records will be reprocessed)...")
+    retry_chain = (
+        dc.read_dataset(
+            "sample_data",
+            delta=True,
+            delta_on="item_id",
+            # Retry records where result.error field is not empty
+            delta_retry="result.error",
+        )
+        .setup(attempt=lambda: 2)  # Set attempt number for processing
+        .map(result=process_data)
+        .save(name="processed_data")
+    )
+    print(f"Retry pass completed. Total records: {retry_chain.count()}")
+    # Show final results
+    final_error_count = retry_chain.filter(C("result.error") != "").count()
+    final_success_count = retry_chain.filter(C("result.error") == "").count()
+    print(f"Final - Successful: {final_success_count}, Failed: {final_error_count}")
+    print("\nFinal results after retry:")
+    retry_chain.show()
+    if final_error_count > 0:
+        print(f"\nNote: {final_error_count} records still have errors after retry.")
+        print(
+            "In a real scenario, you could run this again or implement "
+            "different retry logic."
+        )
+    else:
+        print("\nAll records processed successfully after retry!")
+if __name__ == "__main__":
+    retry_processing_example()

{datachain-0.18.11 → datachain-0.19}/mkdocs.yml RENAMED Viewed

@@ -97,7 +97,13 @@ nav:
                   - logs: commands/job/logs.md
                   - cancel: commands/job/cancel.md
                   - ls: commands/job/ls.md
-      - 📡 Interacting with remote storage: references/remotes.md
+      - 📚 User Guide:
+          - Overview: guide/index.md
+          - 📡 Interacting with remote storage: guide/remotes.md
+          - Data Processing:
+              - Overview: guide/processing.md
+              - Delta Processing: guide/delta.md
+              - Errors Handling and Retries: guide/retry.md
       - 🤝 Contributing: contributing.md
   - DataChain Website ↗: https://datachain.ai" target="_blank"

{datachain-0.18.11 → datachain-0.19}/src/datachain/client/fsspec.py RENAMED Viewed

@@ -330,6 +330,7 @@ class Client(ABC):
         return getattr(self.fs, "version_aware", False)
     async def ls_dir(self, path):
+        kwargs = {}
         if self._is_version_aware():
             kwargs = {"versions": True}
         return await self.fs._ls(path, detail=True, **kwargs)

{datachain-0.18.11 → datachain-0.19}/src/datachain/client/hf.py RENAMED Viewed

@@ -15,6 +15,24 @@ class classproperty:  # noqa: N801
         return self.fget(owner)
+def _wrap_class(sync_fs_class):
+    """
+    Analog of `AsyncFileSystemWrapper.wrap_class` from fsspec, but sets
+    asynchronous to False by default. This is similar to other Async FS
+    we initialize. E.g. it means we don't break things in Jupyter where code
+    run in async.
+    """
+    from fsspec.implementations.asyn_wrapper import AsyncFileSystemWrapper
+    class GeneratedAsyncFileSystemWrapper(AsyncFileSystemWrapper):
+        def __init__(self, *args, **kwargs):
+            sync_fs = sync_fs_class(*args, **kwargs)
+            super().__init__(sync_fs, asynchronous=False)
+    GeneratedAsyncFileSystemWrapper.__name__ = f"Async{sync_fs_class.__name__}Wrapper"
+    return GeneratedAsyncFileSystemWrapper
 @functools.cache
 def get_hf_filesystem_cls():
     import fsspec
@@ -29,10 +47,9 @@ def get_hf_filesystem_cls():
             f"{fsspec_version} is installed."
         )
-    from fsspec.implementations.asyn_wrapper import AsyncFileSystemWrapper
     from huggingface_hub import HfFileSystem
-    fs_cls = AsyncFileSystemWrapper.wrap_class(HfFileSystem)
+    fs_cls = _wrap_class(HfFileSystem)
     # AsyncFileSystemWrapper does not set class properties, so we need to set them back.
     fs_cls.protocol = HfFileSystem.protocol
     return fs_cls

datachain 0.18.11__tar.gz → 0.19__tar.gz

Potentially problematic release.

datachain 0.18.11tar.gz → 0.19tar.gz