datachain 0.7.8__tar.gz → 0.7.10__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of datachain might be problematic. Click here for more details.
- {datachain-0.7.8 → datachain-0.7.10}/.github/workflows/tests.yml +17 -2
- datachain-0.7.10/PKG-INFO +207 -0
- datachain-0.7.10/README.rst +105 -0
- datachain-0.7.10/docs/contributing.md +111 -0
- datachain-0.7.8/docs/index.md → datachain-0.7.10/docs/examples.md +48 -61
- datachain-0.7.10/docs/index.md +103 -0
- datachain-0.7.10/docs/quick-start.md +286 -0
- datachain-0.7.10/docs/references/index.md +10 -0
- datachain-0.7.10/docs/tutorials.md +5 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/llm_and_nlp/hf-dataset-llm-eval.py +6 -3
- {datachain-0.7.8 → datachain-0.7.10}/mkdocs.yml +11 -4
- {datachain-0.7.8 → datachain-0.7.10}/pyproject.toml +1 -1
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/cli.py +9 -3
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/client/fsspec.py +4 -2
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/client/local.py +9 -4
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/data_storage/metastore.py +3 -2
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/func/__init__.py +4 -1
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/func/numeric.py +46 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/func/string.py +46 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/convert/flatten.py +7 -5
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/convert/unflatten.py +2 -2
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/convert/values_to_tuples.py +1 -1
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/dc.py +1 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/pytorch.py +54 -37
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/utils.py +1 -1
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/query/dataset.py +1 -1
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/remote/studio.py +44 -25
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/functions/numeric.py +12 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/functions/string.py +12 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/sqlite/base.py +40 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/studio.py +2 -2
- datachain-0.7.10/src/datachain.egg-info/PKG-INFO +207 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain.egg-info/SOURCES.txt +4 -1
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain.egg-info/requires.txt +1 -1
- {datachain-0.7.8 → datachain-0.7.10}/tests/conftest.py +1 -1
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_catalog.py +32 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_ls.py +2 -2
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_pull.py +13 -13
- {datachain-0.7.8 → datachain-0.7.10}/tests/test_cli_studio.py +4 -2
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_func.py +60 -2
- datachain-0.7.8/CONTRIBUTING.rst +0 -129
- datachain-0.7.8/PKG-INFO +0 -488
- datachain-0.7.8/README.rst +0 -386
- datachain-0.7.8/docs/references/index.md +0 -8
- datachain-0.7.8/src/datachain.egg-info/PKG-INFO +0 -488
- {datachain-0.7.8 → datachain-0.7.10}/.cruft.json +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/.gitattributes +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/.github/ISSUE_TEMPLATE/bug_report.yml +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/.github/ISSUE_TEMPLATE/empty_issue.md +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/.github/ISSUE_TEMPLATE/feature_request.yml +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/.github/codecov.yaml +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/.github/dependabot.yml +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/.github/workflows/benchmarks.yml +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/.github/workflows/release.yml +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/.github/workflows/tests-studio.yml +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/.github/workflows/update-template.yaml +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/.gitignore +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/.pre-commit-config.yaml +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/CODE_OF_CONDUCT.rst +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/LICENSE +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/docs/assets/captioned_cartoons.png +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/docs/assets/datachain-white.svg +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/docs/assets/datachain.svg +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/docs/overrides/main.html +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/docs/references/datachain.md +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/docs/references/datatype.md +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/docs/references/file.md +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/docs/references/sql.md +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/docs/references/torch.md +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/docs/references/udf.md +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/computer_vision/iptc_exif_xmp_lib.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/computer_vision/llava2_image_desc_lib.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/computer_vision/openimage-detect.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/computer_vision/ultralytics-bbox.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/computer_vision/ultralytics-pose.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/computer_vision/ultralytics-segment.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/get_started/common_sql_functions.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/get_started/json-csv-reader.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/get_started/torch-loader.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/get_started/udfs/parallel.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/get_started/udfs/simple.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/get_started/udfs/stateful.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/llm_and_nlp/claude-query.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/llm_and_nlp/unstructured-embeddings-gen.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/llm_and_nlp/unstructured-summary-map.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/multimodal/clip_inference.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/multimodal/hf_pipeline.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/multimodal/openai_image_desc_lib.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/multimodal/wds.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/examples/multimodal/wds_filtered.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/noxfile.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/setup.cfg +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/__main__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/asyn.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/cache.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/catalog/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/catalog/catalog.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/catalog/datasource.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/catalog/loader.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/cli_utils.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/client/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/client/azure.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/client/fileslice.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/client/gcs.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/client/hf.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/client/s3.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/config.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/data_storage/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/data_storage/db_engine.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/data_storage/job.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/data_storage/schema.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/data_storage/serializer.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/data_storage/sqlite.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/data_storage/warehouse.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/dataset.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/error.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/func/aggregate.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/func/array.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/func/base.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/func/conditional.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/func/func.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/func/path.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/func/random.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/func/window.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/job.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/arrow.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/clip.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/convert/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/convert/python_to_sql.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/convert/sql_to_python.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/data_model.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/dataset_info.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/file.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/hf.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/image.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/listing.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/listing_info.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/meta_formats.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/model_store.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/settings.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/signal_schema.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/tar.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/text.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/udf.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/udf_signature.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/vfile.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/webdataset.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/lib/webdataset_laion.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/listing.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/model/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/model/bbox.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/model/pose.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/model/segment.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/model/ultralytics/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/model/ultralytics/bbox.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/model/ultralytics/pose.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/model/ultralytics/segment.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/node.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/nodes_fetcher.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/nodes_thread_pool.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/progress.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/py.typed +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/query/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/query/batch.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/query/dispatch.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/query/metrics.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/query/params.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/query/queue.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/query/schema.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/query/session.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/remote/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/default/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/default/base.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/functions/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/functions/aggregate.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/functions/array.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/functions/conditional.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/functions/path.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/functions/random.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/selectable.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/sqlite/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/sqlite/types.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/sqlite/vector.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/types.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/sql/utils.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/telemetry.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/toolkit/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/toolkit/split.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/torch/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain/utils.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain.egg-info/dependency_links.txt +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain.egg-info/entry_points.txt +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/src/datachain.egg-info/top_level.txt +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/benchmarks/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/benchmarks/conftest.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/benchmarks/datasets/.dvc/.gitignore +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/benchmarks/datasets/.dvc/config +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/benchmarks/datasets/.gitignore +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/benchmarks/datasets/laion-tiny.npz.dvc +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/benchmarks/test_datachain.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/benchmarks/test_ls.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/benchmarks/test_version.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/data.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/examples/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/examples/test_examples.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/examples/test_wds_e2e.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/examples/wds_data.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_client.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_datachain.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_dataset_query.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_datasets.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_feature_pickling.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_listing.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_meta_formats.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_metrics.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_pytorch.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_query.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/func/test_toolkit.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/scripts/feature_class.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/scripts/feature_class_exception.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/scripts/feature_class_parallel.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/scripts/feature_class_parallel_data_model.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/scripts/name_len_slow.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/test_atomicity.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/test_cli_e2e.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/test_query_e2e.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/test_telemetry.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/conftest.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_arrow.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_clip.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_datachain.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_datachain_bootstrap.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_datachain_merge.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_feature.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_feature_utils.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_file.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_hf.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_image.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_listing_info.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_models.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_schema.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_signal_schema.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_sql_to_python.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_text.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_udf_signature.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_utils.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/lib/test_webdataset.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/sql/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/sql/sqlite/__init__.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/sql/sqlite/test_types.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/sql/sqlite/test_utils.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/sql/test_array.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/sql/test_conditional.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/sql/test_path.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/sql/test_random.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/sql/test_selectable.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/sql/test_string.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_asyn.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_cache.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_catalog.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_catalog_loader.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_cli_parsing.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_client.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_client_s3.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_config.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_data_storage.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_database_engine.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_dataset.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_dispatch.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_fileslice.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_listing.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_metastore.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_module_exports.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_query.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_query_metrics.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_query_params.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_serializer.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_session.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_utils.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/unit/test_warehouse.py +0 -0
- {datachain-0.7.8 → datachain-0.7.10}/tests/utils.py +0 -0
|
@@ -3,7 +3,7 @@ name: Tests
|
|
|
3
3
|
on:
|
|
4
4
|
push:
|
|
5
5
|
branches: [main]
|
|
6
|
-
|
|
6
|
+
pull_request_target:
|
|
7
7
|
workflow_dispatch:
|
|
8
8
|
|
|
9
9
|
env:
|
|
@@ -14,13 +14,22 @@ concurrency:
|
|
|
14
14
|
cancel-in-progress: true
|
|
15
15
|
|
|
16
16
|
jobs:
|
|
17
|
+
authorize:
|
|
18
|
+
environment: ${{ github.event_name == 'pull_request_target' && github.event.pull_request.head.repo.full_name != github.repository && 'external' || 'internal' }}
|
|
19
|
+
runs-on: ubuntu-latest
|
|
20
|
+
steps:
|
|
21
|
+
- run: true
|
|
22
|
+
|
|
17
23
|
lint:
|
|
24
|
+
needs: authorize
|
|
25
|
+
|
|
18
26
|
runs-on: ubuntu-latest
|
|
19
27
|
steps:
|
|
20
28
|
- name: Check out the repository
|
|
21
29
|
uses: actions/checkout@v4
|
|
22
30
|
with:
|
|
23
31
|
fetch-depth: 0
|
|
32
|
+
ref: ${{ github.event.pull_request.head.sha || github.ref }}
|
|
24
33
|
|
|
25
34
|
- name: Set up Python 3.9
|
|
26
35
|
uses: actions/setup-python@v5
|
|
@@ -53,6 +62,8 @@ jobs:
|
|
|
53
62
|
run: nox -s lint
|
|
54
63
|
|
|
55
64
|
datachain:
|
|
65
|
+
needs: authorize
|
|
66
|
+
|
|
56
67
|
timeout-minutes: 40
|
|
57
68
|
runs-on: ${{ matrix.os }}
|
|
58
69
|
strategy:
|
|
@@ -75,6 +86,7 @@ jobs:
|
|
|
75
86
|
uses: actions/checkout@v4
|
|
76
87
|
with:
|
|
77
88
|
fetch-depth: 0
|
|
89
|
+
ref: ${{ github.event.pull_request.head.sha || github.ref }}
|
|
78
90
|
|
|
79
91
|
- name: Set up Python ${{ matrix.pyv }}
|
|
80
92
|
uses: actions/setup-python@v5
|
|
@@ -117,6 +129,8 @@ jobs:
|
|
|
117
129
|
run: nox -s docs
|
|
118
130
|
|
|
119
131
|
examples:
|
|
132
|
+
needs: authorize
|
|
133
|
+
|
|
120
134
|
runs-on: ${{ matrix.os }}
|
|
121
135
|
timeout-minutes: 60
|
|
122
136
|
strategy:
|
|
@@ -132,9 +146,10 @@ jobs:
|
|
|
132
146
|
- {os: ubuntu-latest-4-cores, pyv: "3.9", group: multimodal}
|
|
133
147
|
- {os: ubuntu-latest-4-cores, pyv: "3.12", group: multimodal}
|
|
134
148
|
|
|
135
|
-
|
|
136
149
|
steps:
|
|
137
150
|
- uses: actions/checkout@v4
|
|
151
|
+
with:
|
|
152
|
+
ref: ${{ github.event.pull_request.head.sha || github.ref }}
|
|
138
153
|
|
|
139
154
|
- name: Set up Python ${{ matrix.pyv }}
|
|
140
155
|
uses: actions/setup-python@v5
|
|
@@ -0,0 +1,207 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: datachain
|
|
3
|
+
Version: 0.7.10
|
|
4
|
+
Summary: Wrangle unstructured AI data at scale
|
|
5
|
+
Author-email: Dmitry Petrov <support@dvc.org>
|
|
6
|
+
License: Apache-2.0
|
|
7
|
+
Project-URL: Documentation, https://datachain.dvc.ai
|
|
8
|
+
Project-URL: Issues, https://github.com/iterative/datachain/issues
|
|
9
|
+
Project-URL: Source, https://github.com/iterative/datachain
|
|
10
|
+
Classifier: Programming Language :: Python :: 3
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
15
|
+
Classifier: Development Status :: 2 - Pre-Alpha
|
|
16
|
+
Requires-Python: >=3.9
|
|
17
|
+
Description-Content-Type: text/x-rst
|
|
18
|
+
License-File: LICENSE
|
|
19
|
+
Requires-Dist: pyyaml
|
|
20
|
+
Requires-Dist: tomlkit
|
|
21
|
+
Requires-Dist: tqdm
|
|
22
|
+
Requires-Dist: numpy<3,>=1
|
|
23
|
+
Requires-Dist: pandas>=2.0.0
|
|
24
|
+
Requires-Dist: pyarrow
|
|
25
|
+
Requires-Dist: typing-extensions
|
|
26
|
+
Requires-Dist: python-dateutil>=2
|
|
27
|
+
Requires-Dist: attrs>=21.3.0
|
|
28
|
+
Requires-Dist: s3fs>=2024.2.0
|
|
29
|
+
Requires-Dist: gcsfs>=2024.2.0
|
|
30
|
+
Requires-Dist: adlfs>=2024.2.0
|
|
31
|
+
Requires-Dist: dvc-data<4,>=3.10
|
|
32
|
+
Requires-Dist: dvc-objects<6,>=4
|
|
33
|
+
Requires-Dist: shtab<2,>=1.3.4
|
|
34
|
+
Requires-Dist: sqlalchemy>=2
|
|
35
|
+
Requires-Dist: multiprocess==0.70.16
|
|
36
|
+
Requires-Dist: cloudpickle
|
|
37
|
+
Requires-Dist: orjson>=3.10.5
|
|
38
|
+
Requires-Dist: pydantic<3,>=2
|
|
39
|
+
Requires-Dist: jmespath>=1.0
|
|
40
|
+
Requires-Dist: datamodel-code-generator>=0.25
|
|
41
|
+
Requires-Dist: Pillow<12,>=10.0.0
|
|
42
|
+
Requires-Dist: msgpack<2,>=1.0.4
|
|
43
|
+
Requires-Dist: psutil
|
|
44
|
+
Requires-Dist: huggingface_hub
|
|
45
|
+
Requires-Dist: iterative-telemetry>=0.0.9
|
|
46
|
+
Requires-Dist: platformdirs
|
|
47
|
+
Requires-Dist: dvc-studio-client<1,>=0.21
|
|
48
|
+
Requires-Dist: tabulate
|
|
49
|
+
Provides-Extra: docs
|
|
50
|
+
Requires-Dist: mkdocs>=1.5.2; extra == "docs"
|
|
51
|
+
Requires-Dist: mkdocs-gen-files>=0.5.0; extra == "docs"
|
|
52
|
+
Requires-Dist: mkdocs-material>=9.3.1; extra == "docs"
|
|
53
|
+
Requires-Dist: mkdocs-section-index>=0.3.6; extra == "docs"
|
|
54
|
+
Requires-Dist: mkdocstrings-python>=1.6.3; extra == "docs"
|
|
55
|
+
Requires-Dist: mkdocs-literate-nav>=0.6.1; extra == "docs"
|
|
56
|
+
Provides-Extra: torch
|
|
57
|
+
Requires-Dist: torch>=2.1.0; extra == "torch"
|
|
58
|
+
Requires-Dist: torchvision; extra == "torch"
|
|
59
|
+
Requires-Dist: transformers>=4.36.0; extra == "torch"
|
|
60
|
+
Provides-Extra: remote
|
|
61
|
+
Requires-Dist: lz4; extra == "remote"
|
|
62
|
+
Requires-Dist: requests>=2.22.0; extra == "remote"
|
|
63
|
+
Provides-Extra: vector
|
|
64
|
+
Requires-Dist: usearch; extra == "vector"
|
|
65
|
+
Provides-Extra: hf
|
|
66
|
+
Requires-Dist: numba>=0.60.0; extra == "hf"
|
|
67
|
+
Requires-Dist: datasets[audio,vision]>=2.21.0; extra == "hf"
|
|
68
|
+
Provides-Extra: tests
|
|
69
|
+
Requires-Dist: datachain[hf,remote,torch,vector]; extra == "tests"
|
|
70
|
+
Requires-Dist: pytest<9,>=8; extra == "tests"
|
|
71
|
+
Requires-Dist: pytest-sugar>=0.9.6; extra == "tests"
|
|
72
|
+
Requires-Dist: pytest-cov>=4.1.0; extra == "tests"
|
|
73
|
+
Requires-Dist: pytest-mock>=3.12.0; extra == "tests"
|
|
74
|
+
Requires-Dist: pytest-servers[all]>=0.5.8; extra == "tests"
|
|
75
|
+
Requires-Dist: pytest-benchmark[histogram]; extra == "tests"
|
|
76
|
+
Requires-Dist: pytest-xdist>=3.3.1; extra == "tests"
|
|
77
|
+
Requires-Dist: virtualenv; extra == "tests"
|
|
78
|
+
Requires-Dist: dulwich; extra == "tests"
|
|
79
|
+
Requires-Dist: hypothesis; extra == "tests"
|
|
80
|
+
Requires-Dist: open_clip_torch; extra == "tests"
|
|
81
|
+
Requires-Dist: aiotools>=1.7.0; extra == "tests"
|
|
82
|
+
Requires-Dist: requests-mock; extra == "tests"
|
|
83
|
+
Requires-Dist: scipy; extra == "tests"
|
|
84
|
+
Provides-Extra: dev
|
|
85
|
+
Requires-Dist: datachain[docs,tests]; extra == "dev"
|
|
86
|
+
Requires-Dist: mypy==1.13.0; extra == "dev"
|
|
87
|
+
Requires-Dist: types-python-dateutil; extra == "dev"
|
|
88
|
+
Requires-Dist: types-pytz; extra == "dev"
|
|
89
|
+
Requires-Dist: types-PyYAML; extra == "dev"
|
|
90
|
+
Requires-Dist: types-requests; extra == "dev"
|
|
91
|
+
Requires-Dist: types-tabulate; extra == "dev"
|
|
92
|
+
Provides-Extra: examples
|
|
93
|
+
Requires-Dist: datachain[tests]; extra == "examples"
|
|
94
|
+
Requires-Dist: numpy<2,>=1; extra == "examples"
|
|
95
|
+
Requires-Dist: defusedxml; extra == "examples"
|
|
96
|
+
Requires-Dist: accelerate; extra == "examples"
|
|
97
|
+
Requires-Dist: unstructured[embed-huggingface,pdf]<0.16.0; extra == "examples"
|
|
98
|
+
Requires-Dist: pdfplumber==0.11.4; extra == "examples"
|
|
99
|
+
Requires-Dist: huggingface_hub[hf_transfer]; extra == "examples"
|
|
100
|
+
Requires-Dist: onnx==1.16.1; extra == "examples"
|
|
101
|
+
Requires-Dist: ultralytics==8.3.37; extra == "examples"
|
|
102
|
+
|
|
103
|
+
================
|
|
104
|
+
|logo| DataChain
|
|
105
|
+
================
|
|
106
|
+
|
|
107
|
+
|PyPI| |Python Version| |Codecov| |Tests|
|
|
108
|
+
|
|
109
|
+
.. |logo| image:: docs/assets/datachain.svg
|
|
110
|
+
:height: 24
|
|
111
|
+
.. |PyPI| image:: https://img.shields.io/pypi/v/datachain.svg
|
|
112
|
+
:target: https://pypi.org/project/datachain/
|
|
113
|
+
:alt: PyPI
|
|
114
|
+
.. |Python Version| image:: https://img.shields.io/pypi/pyversions/datachain
|
|
115
|
+
:target: https://pypi.org/project/datachain
|
|
116
|
+
:alt: Python Version
|
|
117
|
+
.. |Codecov| image:: https://codecov.io/gh/iterative/datachain/graph/badge.svg?token=byliXGGyGB
|
|
118
|
+
:target: https://codecov.io/gh/iterative/datachain
|
|
119
|
+
:alt: Codecov
|
|
120
|
+
.. |Tests| image:: https://github.com/iterative/datachain/actions/workflows/tests.yml/badge.svg
|
|
121
|
+
:target: https://github.com/iterative/datachain/actions/workflows/tests.yml
|
|
122
|
+
:alt: Tests
|
|
123
|
+
|
|
124
|
+
DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured
|
|
125
|
+
data like images, audio, videos, text and PDFs. It integrates with external storage
|
|
126
|
+
(e.g. S3) to process data efficiently without data duplication and manages metadata
|
|
127
|
+
in an internal database for easy and efficient querying.
|
|
128
|
+
|
|
129
|
+
|
|
130
|
+
Use Cases
|
|
131
|
+
=========
|
|
132
|
+
|
|
133
|
+
1. **ETL.** Pythonic framework for describing and running unstructured data transformations
|
|
134
|
+
and enrichments, applying models to data, including LLMs.
|
|
135
|
+
2. **Analytics.** DataChain dataset is a table that combines all the information about data
|
|
136
|
+
objects in one place + it provides dataframe-like API and vecrorized engine to do analytics
|
|
137
|
+
on these tables at scale.
|
|
138
|
+
3. **Versioning.** DataChain doesn't store, require moving or copying data (unlike DVC).
|
|
139
|
+
Perfect use case is a bucket with thousands or millions of images, videos, audio, PDFs.
|
|
140
|
+
|
|
141
|
+
|
|
142
|
+
Key Features
|
|
143
|
+
============
|
|
144
|
+
|
|
145
|
+
📂 **Multimodal Dataset Versioning.**
|
|
146
|
+
- Version unstructured data without moving or creating data copies, by supporting
|
|
147
|
+
references to S3, GCP, Azure, and local file systems.
|
|
148
|
+
- Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.
|
|
149
|
+
- Unite files and metadata together into persistent, versioned, columnar datasets.
|
|
150
|
+
|
|
151
|
+
🐍 **Python-friendly.**
|
|
152
|
+
- Operate on Python objects and object fields: float scores, strings, matrixes,
|
|
153
|
+
LLM response objects.
|
|
154
|
+
- Run Python code in a high-scale, terabytes size datasets, with built-in
|
|
155
|
+
parallelization and memory-efficient computing — no SQL or Spark required.
|
|
156
|
+
|
|
157
|
+
🧠 **Data Enrichment and Processing.**
|
|
158
|
+
- Generate metadata using local AI models and LLM APIs.
|
|
159
|
+
- Filter, join, and group datasets by metadata. Search by vector embeddings.
|
|
160
|
+
- High-performance vectorized operations on Python objects: sum, count, avg, etc.
|
|
161
|
+
- Pass datasets to Pytorch and Tensorflow, or export them back into storage.
|
|
162
|
+
|
|
163
|
+
|
|
164
|
+
Getting Started
|
|
165
|
+
===============
|
|
166
|
+
|
|
167
|
+
Visit `Quick Start <https://docs.datachain.ai/quick-start>`_ to get started with `DataChain` and learn more.
|
|
168
|
+
|
|
169
|
+
|
|
170
|
+
Contributing
|
|
171
|
+
============
|
|
172
|
+
|
|
173
|
+
Contributions are very welcome. To learn more, see the `Contributor Guide`_.
|
|
174
|
+
|
|
175
|
+
|
|
176
|
+
Community and Support
|
|
177
|
+
=====================
|
|
178
|
+
|
|
179
|
+
* `Docs <https://docs.datachain.ai/>`_
|
|
180
|
+
* `File an issue`_ if you encounter any problems
|
|
181
|
+
* `Discord Chat <https://dvc.org/chat>`_
|
|
182
|
+
* `Email <mailto:support@dvc.org>`_
|
|
183
|
+
* `Twitter <https://twitter.com/DVCorg>`_
|
|
184
|
+
|
|
185
|
+
|
|
186
|
+
DataChain Studio Platform
|
|
187
|
+
=========================
|
|
188
|
+
|
|
189
|
+
`DataChain Studio`_ is a proprietary solution for teams that offers:
|
|
190
|
+
|
|
191
|
+
- **Centralized dataset registry** to manage data, code and dependency
|
|
192
|
+
dependencies in one place.
|
|
193
|
+
- **Data Lineage** for data sources as well as derivative dataset.
|
|
194
|
+
- **UI for Multimodal Data** like images, videos, and PDFs.
|
|
195
|
+
- **Scalable Compute** to handle large datasets (100M+ files) and in-house
|
|
196
|
+
AI model inference.
|
|
197
|
+
- **Access control** including SSO and team based collaboration.
|
|
198
|
+
|
|
199
|
+
.. _PyPI: https://pypi.org/
|
|
200
|
+
.. _file an issue: https://github.com/iterative/datachain/issues
|
|
201
|
+
.. github-only
|
|
202
|
+
.. _Contributor Guide: https://docs.datachain.ai/contributing
|
|
203
|
+
.. _Pydantic: https://github.com/pydantic/pydantic
|
|
204
|
+
.. _publicly available: https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot
|
|
205
|
+
.. _SQLite: https://www.sqlite.org/
|
|
206
|
+
.. _Getting Started: https://docs.datachain.ai/
|
|
207
|
+
.. _DataChain Studio: https://studio.datachain.ai/
|
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
================
|
|
2
|
+
|logo| DataChain
|
|
3
|
+
================
|
|
4
|
+
|
|
5
|
+
|PyPI| |Python Version| |Codecov| |Tests|
|
|
6
|
+
|
|
7
|
+
.. |logo| image:: docs/assets/datachain.svg
|
|
8
|
+
:height: 24
|
|
9
|
+
.. |PyPI| image:: https://img.shields.io/pypi/v/datachain.svg
|
|
10
|
+
:target: https://pypi.org/project/datachain/
|
|
11
|
+
:alt: PyPI
|
|
12
|
+
.. |Python Version| image:: https://img.shields.io/pypi/pyversions/datachain
|
|
13
|
+
:target: https://pypi.org/project/datachain
|
|
14
|
+
:alt: Python Version
|
|
15
|
+
.. |Codecov| image:: https://codecov.io/gh/iterative/datachain/graph/badge.svg?token=byliXGGyGB
|
|
16
|
+
:target: https://codecov.io/gh/iterative/datachain
|
|
17
|
+
:alt: Codecov
|
|
18
|
+
.. |Tests| image:: https://github.com/iterative/datachain/actions/workflows/tests.yml/badge.svg
|
|
19
|
+
:target: https://github.com/iterative/datachain/actions/workflows/tests.yml
|
|
20
|
+
:alt: Tests
|
|
21
|
+
|
|
22
|
+
DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured
|
|
23
|
+
data like images, audio, videos, text and PDFs. It integrates with external storage
|
|
24
|
+
(e.g. S3) to process data efficiently without data duplication and manages metadata
|
|
25
|
+
in an internal database for easy and efficient querying.
|
|
26
|
+
|
|
27
|
+
|
|
28
|
+
Use Cases
|
|
29
|
+
=========
|
|
30
|
+
|
|
31
|
+
1. **ETL.** Pythonic framework for describing and running unstructured data transformations
|
|
32
|
+
and enrichments, applying models to data, including LLMs.
|
|
33
|
+
2. **Analytics.** DataChain dataset is a table that combines all the information about data
|
|
34
|
+
objects in one place + it provides dataframe-like API and vecrorized engine to do analytics
|
|
35
|
+
on these tables at scale.
|
|
36
|
+
3. **Versioning.** DataChain doesn't store, require moving or copying data (unlike DVC).
|
|
37
|
+
Perfect use case is a bucket with thousands or millions of images, videos, audio, PDFs.
|
|
38
|
+
|
|
39
|
+
|
|
40
|
+
Key Features
|
|
41
|
+
============
|
|
42
|
+
|
|
43
|
+
📂 **Multimodal Dataset Versioning.**
|
|
44
|
+
- Version unstructured data without moving or creating data copies, by supporting
|
|
45
|
+
references to S3, GCP, Azure, and local file systems.
|
|
46
|
+
- Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.
|
|
47
|
+
- Unite files and metadata together into persistent, versioned, columnar datasets.
|
|
48
|
+
|
|
49
|
+
🐍 **Python-friendly.**
|
|
50
|
+
- Operate on Python objects and object fields: float scores, strings, matrixes,
|
|
51
|
+
LLM response objects.
|
|
52
|
+
- Run Python code in a high-scale, terabytes size datasets, with built-in
|
|
53
|
+
parallelization and memory-efficient computing — no SQL or Spark required.
|
|
54
|
+
|
|
55
|
+
🧠 **Data Enrichment and Processing.**
|
|
56
|
+
- Generate metadata using local AI models and LLM APIs.
|
|
57
|
+
- Filter, join, and group datasets by metadata. Search by vector embeddings.
|
|
58
|
+
- High-performance vectorized operations on Python objects: sum, count, avg, etc.
|
|
59
|
+
- Pass datasets to Pytorch and Tensorflow, or export them back into storage.
|
|
60
|
+
|
|
61
|
+
|
|
62
|
+
Getting Started
|
|
63
|
+
===============
|
|
64
|
+
|
|
65
|
+
Visit `Quick Start <https://docs.datachain.ai/quick-start>`_ to get started with `DataChain` and learn more.
|
|
66
|
+
|
|
67
|
+
|
|
68
|
+
Contributing
|
|
69
|
+
============
|
|
70
|
+
|
|
71
|
+
Contributions are very welcome. To learn more, see the `Contributor Guide`_.
|
|
72
|
+
|
|
73
|
+
|
|
74
|
+
Community and Support
|
|
75
|
+
=====================
|
|
76
|
+
|
|
77
|
+
* `Docs <https://docs.datachain.ai/>`_
|
|
78
|
+
* `File an issue`_ if you encounter any problems
|
|
79
|
+
* `Discord Chat <https://dvc.org/chat>`_
|
|
80
|
+
* `Email <mailto:support@dvc.org>`_
|
|
81
|
+
* `Twitter <https://twitter.com/DVCorg>`_
|
|
82
|
+
|
|
83
|
+
|
|
84
|
+
DataChain Studio Platform
|
|
85
|
+
=========================
|
|
86
|
+
|
|
87
|
+
`DataChain Studio`_ is a proprietary solution for teams that offers:
|
|
88
|
+
|
|
89
|
+
- **Centralized dataset registry** to manage data, code and dependency
|
|
90
|
+
dependencies in one place.
|
|
91
|
+
- **Data Lineage** for data sources as well as derivative dataset.
|
|
92
|
+
- **UI for Multimodal Data** like images, videos, and PDFs.
|
|
93
|
+
- **Scalable Compute** to handle large datasets (100M+ files) and in-house
|
|
94
|
+
AI model inference.
|
|
95
|
+
- **Access control** including SSO and team based collaboration.
|
|
96
|
+
|
|
97
|
+
.. _PyPI: https://pypi.org/
|
|
98
|
+
.. _file an issue: https://github.com/iterative/datachain/issues
|
|
99
|
+
.. github-only
|
|
100
|
+
.. _Contributor Guide: https://docs.datachain.ai/contributing
|
|
101
|
+
.. _Pydantic: https://github.com/pydantic/pydantic
|
|
102
|
+
.. _publicly available: https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot
|
|
103
|
+
.. _SQLite: https://www.sqlite.org/
|
|
104
|
+
.. _Getting Started: https://docs.datachain.ai/
|
|
105
|
+
.. _DataChain Studio: https://studio.datachain.ai/
|
|
@@ -0,0 +1,111 @@
|
|
|
1
|
+
# Contributor Guide
|
|
2
|
+
|
|
3
|
+
Thank you for your interest in improving this project. This project is
|
|
4
|
+
open-source under the [Apache 2.0
|
|
5
|
+
license](https://opensource.org/licenses/Apache-2.0) and welcomes
|
|
6
|
+
contributions in the form of bug reports, feature requests, and pull
|
|
7
|
+
requests.
|
|
8
|
+
|
|
9
|
+
Here is a list of important resources for contributors:
|
|
10
|
+
|
|
11
|
+
- [Source Code](https://github.com/iterative/datachain)
|
|
12
|
+
- [Documentation](https://docs.dvc.ai/datachain)
|
|
13
|
+
- [Issue Tracker](https://github.com/iterative/datachain/issues)
|
|
14
|
+
- [Code of Conduct](https://github.com/iterative/datachain?tab=coc-ov-file)
|
|
15
|
+
|
|
16
|
+
## How to report a bug
|
|
17
|
+
|
|
18
|
+
Report bugs on the [Issue
|
|
19
|
+
Tracker](https://github.com/iterative/datachain/issues).
|
|
20
|
+
|
|
21
|
+
When filing an issue, make sure to answer these questions:
|
|
22
|
+
|
|
23
|
+
- Which operating system and Python version are you using?
|
|
24
|
+
- Which version of this project are you using?
|
|
25
|
+
- What did you do?
|
|
26
|
+
- What did you expect to see?
|
|
27
|
+
- What did you see instead?
|
|
28
|
+
|
|
29
|
+
The best way to get your bug fixed is to provide a test case, and/or
|
|
30
|
+
steps to reproduce the issue.
|
|
31
|
+
|
|
32
|
+
## How to request a feature
|
|
33
|
+
|
|
34
|
+
Request features on the [Issue
|
|
35
|
+
Tracker](https://github.com/iterative/datachain/issues).
|
|
36
|
+
|
|
37
|
+
## How to set up your development environment
|
|
38
|
+
|
|
39
|
+
You need Python 3.8+ and the following tools:
|
|
40
|
+
|
|
41
|
+
- [Nox](https://nox.thea.codes/)
|
|
42
|
+
|
|
43
|
+
Install the package with development requirements:
|
|
44
|
+
|
|
45
|
+
``` console
|
|
46
|
+
$ pip install nox
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## How to test the project
|
|
50
|
+
|
|
51
|
+
Run the full test suite:
|
|
52
|
+
|
|
53
|
+
``` console
|
|
54
|
+
$ nox
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
List the available Nox sessions:
|
|
58
|
+
|
|
59
|
+
``` console
|
|
60
|
+
$ nox --list-sessions
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
You can also run a specific Nox session. For example, invoke the unit
|
|
64
|
+
test suite like this:
|
|
65
|
+
|
|
66
|
+
``` console
|
|
67
|
+
$ nox --session=tests
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
Unit tests are located in the `tests` directory, and are written using
|
|
71
|
+
the [pytest](https://pytest.readthedocs.io/) testing framework.
|
|
72
|
+
|
|
73
|
+
## Build documentation
|
|
74
|
+
|
|
75
|
+
If you've made any changes to the documentation (including changes to
|
|
76
|
+
function signatures, class definitions, or docstrings that will appear
|
|
77
|
+
in the API documentation), make sure it builds successfully.
|
|
78
|
+
|
|
79
|
+
``` console
|
|
80
|
+
$ nox -s docs
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
In order to run this locally with hot reload on changes:
|
|
84
|
+
|
|
85
|
+
``` console
|
|
86
|
+
$ mkdocs serve
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## How to submit changes
|
|
90
|
+
|
|
91
|
+
Open a [pull request](https://github.com/iterative/datachain/pulls) to
|
|
92
|
+
submit changes to this project.
|
|
93
|
+
|
|
94
|
+
Your pull request needs to meet the following guidelines for acceptance:
|
|
95
|
+
|
|
96
|
+
- The Nox test suite must pass without errors and warnings.
|
|
97
|
+
- Include unit tests. This project maintains 100% code coverage.
|
|
98
|
+
- If your changes add functionality, update the documentation
|
|
99
|
+
accordingly.
|
|
100
|
+
|
|
101
|
+
Feel free to submit early, though---we can always iterate on this.
|
|
102
|
+
|
|
103
|
+
To run linting and code formatting checks, you can invoke a `lint` session in nox:
|
|
104
|
+
|
|
105
|
+
``` console
|
|
106
|
+
$ nox -s lint
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
It is recommended to open an issue before starting work on anything.
|
|
110
|
+
This will allow a chance to talk it over with the owners and validate
|
|
111
|
+
your approach.
|
|
@@ -1,80 +1,67 @@
|
|
|
1
|
-
# Get Started with DataChain
|
|
2
1
|
|
|
3
|
-
|
|
2
|
+
# Examples
|
|
4
3
|
|
|
4
|
+
## DataChain Basics
|
|
5
5
|
|
|
6
|
-
|
|
6
|
+
!!! example "DataChain Basics"
|
|
7
7
|
|
|
8
|
-
|
|
8
|
+
Datachain is built by composing wrangling operations.
|
|
9
9
|
|
|
10
|
-
|
|
10
|
+
For example, let us consider the New Yorker Cartoon caption contest dataset, where cartoons are matched against the potential titles. Let us imagine we want to augment this dataset with synthetic scene descriptions coming from an AI model. The below code takes images from the cloud, and applies PaliGemma model to caption the first five of them and put the results in the column “scene”:
|
|
11
11
|
|
|
12
|
-
|
|
12
|
+
```python
|
|
13
|
+
from datachain.lib.dc import Column, DataChain, File # (1)!
|
|
14
|
+
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration # (2)!
|
|
13
15
|
|
|
14
|
-
|
|
16
|
+
images = DataChain.from_storage("gs://datachain-demo/newyorker_caption_contest/images", type="image")
|
|
15
17
|
|
|
16
|
-
|
|
18
|
+
model = PaliGemmaForConditionalGeneration.from_pretrained("google/paligemma-3b-mix-224")
|
|
19
|
+
processor = AutoProcessor.from_pretrained("google/paligemma-3b-mix-224")
|
|
17
20
|
|
|
18
|
-
|
|
21
|
+
def process(file: File) -> str:
|
|
22
|
+
image=file.read().convert("RGB")
|
|
23
|
+
inputs = processor(text="caption", images=image, return_tensors="pt")
|
|
24
|
+
generate_ids = model.generate(**inputs, max_new_tokens=100)
|
|
25
|
+
return processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
|
19
26
|
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
from datachain.lib.dc import Column, DataChain, File
|
|
28
|
-
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
|
|
29
|
-
|
|
30
|
-
images = DataChain.from_storage("gs://datachain-demo/newyorker_caption_contest/images", type="image")
|
|
31
|
-
|
|
32
|
-
model = PaliGemmaForConditionalGeneration.from_pretrained("google/paligemma-3b-mix-224")
|
|
33
|
-
processor = AutoProcessor.from_pretrained("google/paligemma-3b-mix-224")
|
|
34
|
-
|
|
35
|
-
def process(file: File) -> str:
|
|
36
|
-
image=file.read().convert("RGB")
|
|
37
|
-
inputs = processor(text="caption", images=image, return_tensors="pt")
|
|
38
|
-
generate_ids = model.generate(**inputs, max_new_tokens=100)
|
|
39
|
-
return processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
|
40
|
-
|
|
41
|
-
chain = (
|
|
42
|
-
images.limit(5)
|
|
43
|
-
.settings(cache=True)
|
|
44
|
-
.map(scene=lambda file: process(file), output = str)
|
|
45
|
-
.save()
|
|
46
|
-
)
|
|
47
|
-
```
|
|
27
|
+
chain = (
|
|
28
|
+
images.limit(5)
|
|
29
|
+
.settings(cache=True)
|
|
30
|
+
.map(scene=lambda file: process(file), output = str)
|
|
31
|
+
.save()
|
|
32
|
+
)
|
|
33
|
+
```
|
|
48
34
|
|
|
49
|
-
|
|
35
|
+
1. `pip install datachain`
|
|
36
|
+
2. `pip install transformers`
|
|
50
37
|
|
|
51
|
-
|
|
52
|
-
import matplotlib.pyplot as plt
|
|
53
|
-
import re
|
|
54
|
-
from textwrap import wrap
|
|
38
|
+
Here is how we can view the results in a plot:
|
|
55
39
|
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
40
|
+
```python
|
|
41
|
+
import matplotlib.pyplot as plt
|
|
42
|
+
import re
|
|
43
|
+
from textwrap import wrap
|
|
59
44
|
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
45
|
+
def trim_text(text):
|
|
46
|
+
match = re.search(r'[A-Z][^.]*\.', text)
|
|
47
|
+
return match.group(0) if match else ''
|
|
63
48
|
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
wrapped_caption = "\n".join(wrap(trim_text(caption), 30))
|
|
68
|
-
ax.set_title(wrapped_caption, fontsize=6)
|
|
49
|
+
images = chain.collect("file")
|
|
50
|
+
captions = chain.collect("scene")
|
|
51
|
+
_ , axes = plt.subplots(1, len(captions), figsize=(15, 5))
|
|
69
52
|
|
|
70
|
-
|
|
71
|
-
|
|
53
|
+
for ax, img, caption in zip(axes, images, captions):
|
|
54
|
+
ax.imshow(img.read(),cmap='gray')
|
|
55
|
+
ax.axis('off')
|
|
56
|
+
wrapped_caption = "\n".join(wrap(trim_text(caption), 30))
|
|
57
|
+
ax.set_title(wrapped_caption, fontsize=6)
|
|
72
58
|
|
|
73
|
-
|
|
59
|
+
plt.show()
|
|
60
|
+
```
|
|
74
61
|
|
|
75
|
-
|
|
62
|
+

|
|
76
63
|
|
|
77
|
-
|
|
64
|
+
If interested to see more examples, please check out the [tutorials](tutorials.md).
|
|
78
65
|
|
|
79
66
|
### Handling Python objects
|
|
80
67
|
|
|
@@ -188,7 +175,7 @@ Datachain avoids redundant operations. Execution is triggered only when a downst
|
|
|
188
175
|
|
|
189
176
|
“Save” operation nails execution results and automatically refers to them every time the downstream functions ask for data. Saving without an explicit name generates an auto-named dataset which serves the same purpose.
|
|
190
177
|
|
|
191
|
-
Datachain natively supports parallelism in execution. If an API or a local model supports parallel requests, the `settings` operator can split the load across multiple workers (see the [code example above](
|
|
178
|
+
Datachain natively supports parallelism in execution. If an API or a local model supports parallel requests, the `settings` operator can split the load across multiple workers (see the [code example above](#handling-python-objects))
|
|
192
179
|
|
|
193
180
|
### Reading external metadata
|
|
194
181
|
|
|
@@ -279,7 +266,7 @@ images_with_dogs.select("annotations", "file.name").show()
|
|
|
279
266
|
```
|
|
280
267
|
For in-depth review of working with JSON metadata, please follow this tutorial:
|
|
281
268
|
|
|
282
|
-
[
|
|
269
|
+
[GitHub](https://github.com/iterative/datachain-examples/blob/main/formats/json-metadata-tutorial.ipynb) or [Google Colab](https://colab.research.google.com/github/iterative/datachain-examples/blob/main/formats/json-metadata-tutorial.ipynb)
|
|
283
270
|
|
|
284
271
|
### Passing data to training
|
|
285
272
|
|
|
@@ -299,4 +286,4 @@ train(loader, model, optimizer)
|
|
|
299
286
|
|
|
300
287
|
See a larger example for CLIP fine-tuning here:
|
|
301
288
|
|
|
302
|
-
[
|
|
289
|
+
[GitHub](https://github.com/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb) or [Google Colab](https://colab.research.google.com/github/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb)
|