datachain 0.7.9__tar.gz → 0.7.10__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of datachain might be problematic. Click here for more details.

Files changed (288) hide show
  1. {datachain-0.7.9 → datachain-0.7.10}/.github/workflows/tests.yml +17 -2
  2. datachain-0.7.10/PKG-INFO +207 -0
  3. datachain-0.7.10/README.rst +105 -0
  4. datachain-0.7.10/docs/contributing.md +111 -0
  5. datachain-0.7.9/docs/index.md → datachain-0.7.10/docs/examples.md +48 -61
  6. datachain-0.7.10/docs/index.md +103 -0
  7. datachain-0.7.10/docs/quick-start.md +286 -0
  8. datachain-0.7.10/docs/references/index.md +10 -0
  9. datachain-0.7.10/docs/tutorials.md +5 -0
  10. {datachain-0.7.9 → datachain-0.7.10}/mkdocs.yml +11 -4
  11. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/client/fsspec.py +4 -2
  12. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/client/local.py +9 -4
  13. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/func/__init__.py +4 -1
  14. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/func/numeric.py +46 -0
  15. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/func/string.py +46 -0
  16. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/convert/flatten.py +7 -5
  17. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/convert/unflatten.py +2 -2
  18. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/convert/values_to_tuples.py +1 -1
  19. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/utils.py +1 -1
  20. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/query/dataset.py +1 -1
  21. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/functions/numeric.py +12 -0
  22. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/functions/string.py +12 -0
  23. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/sqlite/base.py +40 -0
  24. datachain-0.7.10/src/datachain.egg-info/PKG-INFO +207 -0
  25. {datachain-0.7.9 → datachain-0.7.10}/src/datachain.egg-info/SOURCES.txt +4 -1
  26. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_func.py +60 -2
  27. datachain-0.7.9/CONTRIBUTING.rst +0 -129
  28. datachain-0.7.9/PKG-INFO +0 -488
  29. datachain-0.7.9/README.rst +0 -386
  30. datachain-0.7.9/docs/references/index.md +0 -8
  31. datachain-0.7.9/src/datachain.egg-info/PKG-INFO +0 -488
  32. {datachain-0.7.9 → datachain-0.7.10}/.cruft.json +0 -0
  33. {datachain-0.7.9 → datachain-0.7.10}/.gitattributes +0 -0
  34. {datachain-0.7.9 → datachain-0.7.10}/.github/ISSUE_TEMPLATE/bug_report.yml +0 -0
  35. {datachain-0.7.9 → datachain-0.7.10}/.github/ISSUE_TEMPLATE/empty_issue.md +0 -0
  36. {datachain-0.7.9 → datachain-0.7.10}/.github/ISSUE_TEMPLATE/feature_request.yml +0 -0
  37. {datachain-0.7.9 → datachain-0.7.10}/.github/codecov.yaml +0 -0
  38. {datachain-0.7.9 → datachain-0.7.10}/.github/dependabot.yml +0 -0
  39. {datachain-0.7.9 → datachain-0.7.10}/.github/workflows/benchmarks.yml +0 -0
  40. {datachain-0.7.9 → datachain-0.7.10}/.github/workflows/release.yml +0 -0
  41. {datachain-0.7.9 → datachain-0.7.10}/.github/workflows/tests-studio.yml +0 -0
  42. {datachain-0.7.9 → datachain-0.7.10}/.github/workflows/update-template.yaml +0 -0
  43. {datachain-0.7.9 → datachain-0.7.10}/.gitignore +0 -0
  44. {datachain-0.7.9 → datachain-0.7.10}/.pre-commit-config.yaml +0 -0
  45. {datachain-0.7.9 → datachain-0.7.10}/CODE_OF_CONDUCT.rst +0 -0
  46. {datachain-0.7.9 → datachain-0.7.10}/LICENSE +0 -0
  47. {datachain-0.7.9 → datachain-0.7.10}/docs/assets/captioned_cartoons.png +0 -0
  48. {datachain-0.7.9 → datachain-0.7.10}/docs/assets/datachain-white.svg +0 -0
  49. {datachain-0.7.9 → datachain-0.7.10}/docs/assets/datachain.svg +0 -0
  50. {datachain-0.7.9 → datachain-0.7.10}/docs/overrides/main.html +0 -0
  51. {datachain-0.7.9 → datachain-0.7.10}/docs/references/datachain.md +0 -0
  52. {datachain-0.7.9 → datachain-0.7.10}/docs/references/datatype.md +0 -0
  53. {datachain-0.7.9 → datachain-0.7.10}/docs/references/file.md +0 -0
  54. {datachain-0.7.9 → datachain-0.7.10}/docs/references/sql.md +0 -0
  55. {datachain-0.7.9 → datachain-0.7.10}/docs/references/torch.md +0 -0
  56. {datachain-0.7.9 → datachain-0.7.10}/docs/references/udf.md +0 -0
  57. {datachain-0.7.9 → datachain-0.7.10}/examples/computer_vision/iptc_exif_xmp_lib.py +0 -0
  58. {datachain-0.7.9 → datachain-0.7.10}/examples/computer_vision/llava2_image_desc_lib.py +0 -0
  59. {datachain-0.7.9 → datachain-0.7.10}/examples/computer_vision/openimage-detect.py +0 -0
  60. {datachain-0.7.9 → datachain-0.7.10}/examples/computer_vision/ultralytics-bbox.py +0 -0
  61. {datachain-0.7.9 → datachain-0.7.10}/examples/computer_vision/ultralytics-pose.py +0 -0
  62. {datachain-0.7.9 → datachain-0.7.10}/examples/computer_vision/ultralytics-segment.py +0 -0
  63. {datachain-0.7.9 → datachain-0.7.10}/examples/get_started/common_sql_functions.py +0 -0
  64. {datachain-0.7.9 → datachain-0.7.10}/examples/get_started/json-csv-reader.py +0 -0
  65. {datachain-0.7.9 → datachain-0.7.10}/examples/get_started/torch-loader.py +0 -0
  66. {datachain-0.7.9 → datachain-0.7.10}/examples/get_started/udfs/parallel.py +0 -0
  67. {datachain-0.7.9 → datachain-0.7.10}/examples/get_started/udfs/simple.py +0 -0
  68. {datachain-0.7.9 → datachain-0.7.10}/examples/get_started/udfs/stateful.py +0 -0
  69. {datachain-0.7.9 → datachain-0.7.10}/examples/llm_and_nlp/claude-query.py +0 -0
  70. {datachain-0.7.9 → datachain-0.7.10}/examples/llm_and_nlp/hf-dataset-llm-eval.py +0 -0
  71. {datachain-0.7.9 → datachain-0.7.10}/examples/llm_and_nlp/unstructured-embeddings-gen.py +0 -0
  72. {datachain-0.7.9 → datachain-0.7.10}/examples/llm_and_nlp/unstructured-summary-map.py +0 -0
  73. {datachain-0.7.9 → datachain-0.7.10}/examples/multimodal/clip_inference.py +0 -0
  74. {datachain-0.7.9 → datachain-0.7.10}/examples/multimodal/hf_pipeline.py +0 -0
  75. {datachain-0.7.9 → datachain-0.7.10}/examples/multimodal/openai_image_desc_lib.py +0 -0
  76. {datachain-0.7.9 → datachain-0.7.10}/examples/multimodal/wds.py +0 -0
  77. {datachain-0.7.9 → datachain-0.7.10}/examples/multimodal/wds_filtered.py +0 -0
  78. {datachain-0.7.9 → datachain-0.7.10}/noxfile.py +0 -0
  79. {datachain-0.7.9 → datachain-0.7.10}/pyproject.toml +0 -0
  80. {datachain-0.7.9 → datachain-0.7.10}/setup.cfg +0 -0
  81. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/__init__.py +0 -0
  82. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/__main__.py +0 -0
  83. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/asyn.py +0 -0
  84. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/cache.py +0 -0
  85. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/catalog/__init__.py +0 -0
  86. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/catalog/catalog.py +0 -0
  87. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/catalog/datasource.py +0 -0
  88. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/catalog/loader.py +0 -0
  89. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/cli.py +0 -0
  90. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/cli_utils.py +0 -0
  91. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/client/__init__.py +0 -0
  92. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/client/azure.py +0 -0
  93. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/client/fileslice.py +0 -0
  94. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/client/gcs.py +0 -0
  95. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/client/hf.py +0 -0
  96. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/client/s3.py +0 -0
  97. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/config.py +0 -0
  98. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/data_storage/__init__.py +0 -0
  99. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/data_storage/db_engine.py +0 -0
  100. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/data_storage/job.py +0 -0
  101. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/data_storage/metastore.py +0 -0
  102. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/data_storage/schema.py +0 -0
  103. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/data_storage/serializer.py +0 -0
  104. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/data_storage/sqlite.py +0 -0
  105. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/data_storage/warehouse.py +0 -0
  106. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/dataset.py +0 -0
  107. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/error.py +0 -0
  108. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/func/aggregate.py +0 -0
  109. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/func/array.py +0 -0
  110. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/func/base.py +0 -0
  111. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/func/conditional.py +0 -0
  112. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/func/func.py +0 -0
  113. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/func/path.py +0 -0
  114. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/func/random.py +0 -0
  115. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/func/window.py +0 -0
  116. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/job.py +0 -0
  117. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/__init__.py +0 -0
  118. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/arrow.py +0 -0
  119. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/clip.py +0 -0
  120. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/convert/__init__.py +0 -0
  121. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/convert/python_to_sql.py +0 -0
  122. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/convert/sql_to_python.py +0 -0
  123. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/data_model.py +0 -0
  124. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/dataset_info.py +0 -0
  125. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/dc.py +0 -0
  126. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/file.py +0 -0
  127. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/hf.py +0 -0
  128. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/image.py +0 -0
  129. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/listing.py +0 -0
  130. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/listing_info.py +0 -0
  131. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/meta_formats.py +0 -0
  132. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/model_store.py +0 -0
  133. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/pytorch.py +0 -0
  134. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/settings.py +0 -0
  135. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/signal_schema.py +0 -0
  136. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/tar.py +0 -0
  137. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/text.py +0 -0
  138. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/udf.py +0 -0
  139. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/udf_signature.py +0 -0
  140. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/vfile.py +0 -0
  141. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/webdataset.py +0 -0
  142. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/lib/webdataset_laion.py +0 -0
  143. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/listing.py +0 -0
  144. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/model/__init__.py +0 -0
  145. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/model/bbox.py +0 -0
  146. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/model/pose.py +0 -0
  147. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/model/segment.py +0 -0
  148. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/model/ultralytics/__init__.py +0 -0
  149. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/model/ultralytics/bbox.py +0 -0
  150. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/model/ultralytics/pose.py +0 -0
  151. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/model/ultralytics/segment.py +0 -0
  152. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/node.py +0 -0
  153. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/nodes_fetcher.py +0 -0
  154. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/nodes_thread_pool.py +0 -0
  155. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/progress.py +0 -0
  156. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/py.typed +0 -0
  157. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/query/__init__.py +0 -0
  158. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/query/batch.py +0 -0
  159. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/query/dispatch.py +0 -0
  160. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/query/metrics.py +0 -0
  161. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/query/params.py +0 -0
  162. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/query/queue.py +0 -0
  163. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/query/schema.py +0 -0
  164. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/query/session.py +0 -0
  165. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/remote/__init__.py +0 -0
  166. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/remote/studio.py +0 -0
  167. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/__init__.py +0 -0
  168. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/default/__init__.py +0 -0
  169. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/default/base.py +0 -0
  170. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/functions/__init__.py +0 -0
  171. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/functions/aggregate.py +0 -0
  172. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/functions/array.py +0 -0
  173. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/functions/conditional.py +0 -0
  174. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/functions/path.py +0 -0
  175. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/functions/random.py +0 -0
  176. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/selectable.py +0 -0
  177. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/sqlite/__init__.py +0 -0
  178. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/sqlite/types.py +0 -0
  179. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/sqlite/vector.py +0 -0
  180. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/types.py +0 -0
  181. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/sql/utils.py +0 -0
  182. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/studio.py +0 -0
  183. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/telemetry.py +0 -0
  184. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/toolkit/__init__.py +0 -0
  185. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/toolkit/split.py +0 -0
  186. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/torch/__init__.py +0 -0
  187. {datachain-0.7.9 → datachain-0.7.10}/src/datachain/utils.py +0 -0
  188. {datachain-0.7.9 → datachain-0.7.10}/src/datachain.egg-info/dependency_links.txt +0 -0
  189. {datachain-0.7.9 → datachain-0.7.10}/src/datachain.egg-info/entry_points.txt +0 -0
  190. {datachain-0.7.9 → datachain-0.7.10}/src/datachain.egg-info/requires.txt +0 -0
  191. {datachain-0.7.9 → datachain-0.7.10}/src/datachain.egg-info/top_level.txt +0 -0
  192. {datachain-0.7.9 → datachain-0.7.10}/tests/__init__.py +0 -0
  193. {datachain-0.7.9 → datachain-0.7.10}/tests/benchmarks/__init__.py +0 -0
  194. {datachain-0.7.9 → datachain-0.7.10}/tests/benchmarks/conftest.py +0 -0
  195. {datachain-0.7.9 → datachain-0.7.10}/tests/benchmarks/datasets/.dvc/.gitignore +0 -0
  196. {datachain-0.7.9 → datachain-0.7.10}/tests/benchmarks/datasets/.dvc/config +0 -0
  197. {datachain-0.7.9 → datachain-0.7.10}/tests/benchmarks/datasets/.gitignore +0 -0
  198. {datachain-0.7.9 → datachain-0.7.10}/tests/benchmarks/datasets/laion-tiny.npz.dvc +0 -0
  199. {datachain-0.7.9 → datachain-0.7.10}/tests/benchmarks/test_datachain.py +0 -0
  200. {datachain-0.7.9 → datachain-0.7.10}/tests/benchmarks/test_ls.py +0 -0
  201. {datachain-0.7.9 → datachain-0.7.10}/tests/benchmarks/test_version.py +0 -0
  202. {datachain-0.7.9 → datachain-0.7.10}/tests/conftest.py +0 -0
  203. {datachain-0.7.9 → datachain-0.7.10}/tests/data.py +0 -0
  204. {datachain-0.7.9 → datachain-0.7.10}/tests/examples/__init__.py +0 -0
  205. {datachain-0.7.9 → datachain-0.7.10}/tests/examples/test_examples.py +0 -0
  206. {datachain-0.7.9 → datachain-0.7.10}/tests/examples/test_wds_e2e.py +0 -0
  207. {datachain-0.7.9 → datachain-0.7.10}/tests/examples/wds_data.py +0 -0
  208. {datachain-0.7.9 → datachain-0.7.10}/tests/func/__init__.py +0 -0
  209. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_catalog.py +0 -0
  210. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_client.py +0 -0
  211. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_datachain.py +0 -0
  212. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_dataset_query.py +0 -0
  213. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_datasets.py +0 -0
  214. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_feature_pickling.py +0 -0
  215. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_listing.py +0 -0
  216. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_ls.py +0 -0
  217. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_meta_formats.py +0 -0
  218. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_metrics.py +0 -0
  219. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_pull.py +0 -0
  220. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_pytorch.py +0 -0
  221. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_query.py +0 -0
  222. {datachain-0.7.9 → datachain-0.7.10}/tests/func/test_toolkit.py +0 -0
  223. {datachain-0.7.9 → datachain-0.7.10}/tests/scripts/feature_class.py +0 -0
  224. {datachain-0.7.9 → datachain-0.7.10}/tests/scripts/feature_class_exception.py +0 -0
  225. {datachain-0.7.9 → datachain-0.7.10}/tests/scripts/feature_class_parallel.py +0 -0
  226. {datachain-0.7.9 → datachain-0.7.10}/tests/scripts/feature_class_parallel_data_model.py +0 -0
  227. {datachain-0.7.9 → datachain-0.7.10}/tests/scripts/name_len_slow.py +0 -0
  228. {datachain-0.7.9 → datachain-0.7.10}/tests/test_atomicity.py +0 -0
  229. {datachain-0.7.9 → datachain-0.7.10}/tests/test_cli_e2e.py +0 -0
  230. {datachain-0.7.9 → datachain-0.7.10}/tests/test_cli_studio.py +0 -0
  231. {datachain-0.7.9 → datachain-0.7.10}/tests/test_query_e2e.py +0 -0
  232. {datachain-0.7.9 → datachain-0.7.10}/tests/test_telemetry.py +0 -0
  233. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/__init__.py +0 -0
  234. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/__init__.py +0 -0
  235. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/conftest.py +0 -0
  236. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_arrow.py +0 -0
  237. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_clip.py +0 -0
  238. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_datachain.py +0 -0
  239. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_datachain_bootstrap.py +0 -0
  240. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_datachain_merge.py +0 -0
  241. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_feature.py +0 -0
  242. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_feature_utils.py +0 -0
  243. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_file.py +0 -0
  244. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_hf.py +0 -0
  245. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_image.py +0 -0
  246. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_listing_info.py +0 -0
  247. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_models.py +0 -0
  248. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_schema.py +0 -0
  249. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_signal_schema.py +0 -0
  250. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_sql_to_python.py +0 -0
  251. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_text.py +0 -0
  252. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_udf_signature.py +0 -0
  253. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_utils.py +0 -0
  254. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/lib/test_webdataset.py +0 -0
  255. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/sql/__init__.py +0 -0
  256. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/sql/sqlite/__init__.py +0 -0
  257. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/sql/sqlite/test_types.py +0 -0
  258. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/sql/sqlite/test_utils.py +0 -0
  259. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/sql/test_array.py +0 -0
  260. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/sql/test_conditional.py +0 -0
  261. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/sql/test_path.py +0 -0
  262. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/sql/test_random.py +0 -0
  263. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/sql/test_selectable.py +0 -0
  264. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/sql/test_string.py +0 -0
  265. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_asyn.py +0 -0
  266. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_cache.py +0 -0
  267. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_catalog.py +0 -0
  268. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_catalog_loader.py +0 -0
  269. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_cli_parsing.py +0 -0
  270. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_client.py +0 -0
  271. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_client_s3.py +0 -0
  272. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_config.py +0 -0
  273. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_data_storage.py +0 -0
  274. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_database_engine.py +0 -0
  275. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_dataset.py +0 -0
  276. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_dispatch.py +0 -0
  277. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_fileslice.py +0 -0
  278. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_listing.py +0 -0
  279. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_metastore.py +0 -0
  280. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_module_exports.py +0 -0
  281. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_query.py +0 -0
  282. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_query_metrics.py +0 -0
  283. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_query_params.py +0 -0
  284. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_serializer.py +0 -0
  285. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_session.py +0 -0
  286. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_utils.py +0 -0
  287. {datachain-0.7.9 → datachain-0.7.10}/tests/unit/test_warehouse.py +0 -0
  288. {datachain-0.7.9 → datachain-0.7.10}/tests/utils.py +0 -0
@@ -3,7 +3,7 @@ name: Tests
3
3
  on:
4
4
  push:
5
5
  branches: [main]
6
- pull_request:
6
+ pull_request_target:
7
7
  workflow_dispatch:
8
8
 
9
9
  env:
@@ -14,13 +14,22 @@ concurrency:
14
14
  cancel-in-progress: true
15
15
 
16
16
  jobs:
17
+ authorize:
18
+ environment: ${{ github.event_name == 'pull_request_target' && github.event.pull_request.head.repo.full_name != github.repository && 'external' || 'internal' }}
19
+ runs-on: ubuntu-latest
20
+ steps:
21
+ - run: true
22
+
17
23
  lint:
24
+ needs: authorize
25
+
18
26
  runs-on: ubuntu-latest
19
27
  steps:
20
28
  - name: Check out the repository
21
29
  uses: actions/checkout@v4
22
30
  with:
23
31
  fetch-depth: 0
32
+ ref: ${{ github.event.pull_request.head.sha || github.ref }}
24
33
 
25
34
  - name: Set up Python 3.9
26
35
  uses: actions/setup-python@v5
@@ -53,6 +62,8 @@ jobs:
53
62
  run: nox -s lint
54
63
 
55
64
  datachain:
65
+ needs: authorize
66
+
56
67
  timeout-minutes: 40
57
68
  runs-on: ${{ matrix.os }}
58
69
  strategy:
@@ -75,6 +86,7 @@ jobs:
75
86
  uses: actions/checkout@v4
76
87
  with:
77
88
  fetch-depth: 0
89
+ ref: ${{ github.event.pull_request.head.sha || github.ref }}
78
90
 
79
91
  - name: Set up Python ${{ matrix.pyv }}
80
92
  uses: actions/setup-python@v5
@@ -117,6 +129,8 @@ jobs:
117
129
  run: nox -s docs
118
130
 
119
131
  examples:
132
+ needs: authorize
133
+
120
134
  runs-on: ${{ matrix.os }}
121
135
  timeout-minutes: 60
122
136
  strategy:
@@ -132,9 +146,10 @@ jobs:
132
146
  - {os: ubuntu-latest-4-cores, pyv: "3.9", group: multimodal}
133
147
  - {os: ubuntu-latest-4-cores, pyv: "3.12", group: multimodal}
134
148
 
135
-
136
149
  steps:
137
150
  - uses: actions/checkout@v4
151
+ with:
152
+ ref: ${{ github.event.pull_request.head.sha || github.ref }}
138
153
 
139
154
  - name: Set up Python ${{ matrix.pyv }}
140
155
  uses: actions/setup-python@v5
@@ -0,0 +1,207 @@
1
+ Metadata-Version: 2.1
2
+ Name: datachain
3
+ Version: 0.7.10
4
+ Summary: Wrangle unstructured AI data at scale
5
+ Author-email: Dmitry Petrov <support@dvc.org>
6
+ License: Apache-2.0
7
+ Project-URL: Documentation, https://datachain.dvc.ai
8
+ Project-URL: Issues, https://github.com/iterative/datachain/issues
9
+ Project-URL: Source, https://github.com/iterative/datachain
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: Programming Language :: Python :: 3.9
12
+ Classifier: Programming Language :: Python :: 3.10
13
+ Classifier: Programming Language :: Python :: 3.11
14
+ Classifier: Programming Language :: Python :: 3.12
15
+ Classifier: Development Status :: 2 - Pre-Alpha
16
+ Requires-Python: >=3.9
17
+ Description-Content-Type: text/x-rst
18
+ License-File: LICENSE
19
+ Requires-Dist: pyyaml
20
+ Requires-Dist: tomlkit
21
+ Requires-Dist: tqdm
22
+ Requires-Dist: numpy<3,>=1
23
+ Requires-Dist: pandas>=2.0.0
24
+ Requires-Dist: pyarrow
25
+ Requires-Dist: typing-extensions
26
+ Requires-Dist: python-dateutil>=2
27
+ Requires-Dist: attrs>=21.3.0
28
+ Requires-Dist: s3fs>=2024.2.0
29
+ Requires-Dist: gcsfs>=2024.2.0
30
+ Requires-Dist: adlfs>=2024.2.0
31
+ Requires-Dist: dvc-data<4,>=3.10
32
+ Requires-Dist: dvc-objects<6,>=4
33
+ Requires-Dist: shtab<2,>=1.3.4
34
+ Requires-Dist: sqlalchemy>=2
35
+ Requires-Dist: multiprocess==0.70.16
36
+ Requires-Dist: cloudpickle
37
+ Requires-Dist: orjson>=3.10.5
38
+ Requires-Dist: pydantic<3,>=2
39
+ Requires-Dist: jmespath>=1.0
40
+ Requires-Dist: datamodel-code-generator>=0.25
41
+ Requires-Dist: Pillow<12,>=10.0.0
42
+ Requires-Dist: msgpack<2,>=1.0.4
43
+ Requires-Dist: psutil
44
+ Requires-Dist: huggingface_hub
45
+ Requires-Dist: iterative-telemetry>=0.0.9
46
+ Requires-Dist: platformdirs
47
+ Requires-Dist: dvc-studio-client<1,>=0.21
48
+ Requires-Dist: tabulate
49
+ Provides-Extra: docs
50
+ Requires-Dist: mkdocs>=1.5.2; extra == "docs"
51
+ Requires-Dist: mkdocs-gen-files>=0.5.0; extra == "docs"
52
+ Requires-Dist: mkdocs-material>=9.3.1; extra == "docs"
53
+ Requires-Dist: mkdocs-section-index>=0.3.6; extra == "docs"
54
+ Requires-Dist: mkdocstrings-python>=1.6.3; extra == "docs"
55
+ Requires-Dist: mkdocs-literate-nav>=0.6.1; extra == "docs"
56
+ Provides-Extra: torch
57
+ Requires-Dist: torch>=2.1.0; extra == "torch"
58
+ Requires-Dist: torchvision; extra == "torch"
59
+ Requires-Dist: transformers>=4.36.0; extra == "torch"
60
+ Provides-Extra: remote
61
+ Requires-Dist: lz4; extra == "remote"
62
+ Requires-Dist: requests>=2.22.0; extra == "remote"
63
+ Provides-Extra: vector
64
+ Requires-Dist: usearch; extra == "vector"
65
+ Provides-Extra: hf
66
+ Requires-Dist: numba>=0.60.0; extra == "hf"
67
+ Requires-Dist: datasets[audio,vision]>=2.21.0; extra == "hf"
68
+ Provides-Extra: tests
69
+ Requires-Dist: datachain[hf,remote,torch,vector]; extra == "tests"
70
+ Requires-Dist: pytest<9,>=8; extra == "tests"
71
+ Requires-Dist: pytest-sugar>=0.9.6; extra == "tests"
72
+ Requires-Dist: pytest-cov>=4.1.0; extra == "tests"
73
+ Requires-Dist: pytest-mock>=3.12.0; extra == "tests"
74
+ Requires-Dist: pytest-servers[all]>=0.5.8; extra == "tests"
75
+ Requires-Dist: pytest-benchmark[histogram]; extra == "tests"
76
+ Requires-Dist: pytest-xdist>=3.3.1; extra == "tests"
77
+ Requires-Dist: virtualenv; extra == "tests"
78
+ Requires-Dist: dulwich; extra == "tests"
79
+ Requires-Dist: hypothesis; extra == "tests"
80
+ Requires-Dist: open_clip_torch; extra == "tests"
81
+ Requires-Dist: aiotools>=1.7.0; extra == "tests"
82
+ Requires-Dist: requests-mock; extra == "tests"
83
+ Requires-Dist: scipy; extra == "tests"
84
+ Provides-Extra: dev
85
+ Requires-Dist: datachain[docs,tests]; extra == "dev"
86
+ Requires-Dist: mypy==1.13.0; extra == "dev"
87
+ Requires-Dist: types-python-dateutil; extra == "dev"
88
+ Requires-Dist: types-pytz; extra == "dev"
89
+ Requires-Dist: types-PyYAML; extra == "dev"
90
+ Requires-Dist: types-requests; extra == "dev"
91
+ Requires-Dist: types-tabulate; extra == "dev"
92
+ Provides-Extra: examples
93
+ Requires-Dist: datachain[tests]; extra == "examples"
94
+ Requires-Dist: numpy<2,>=1; extra == "examples"
95
+ Requires-Dist: defusedxml; extra == "examples"
96
+ Requires-Dist: accelerate; extra == "examples"
97
+ Requires-Dist: unstructured[embed-huggingface,pdf]<0.16.0; extra == "examples"
98
+ Requires-Dist: pdfplumber==0.11.4; extra == "examples"
99
+ Requires-Dist: huggingface_hub[hf_transfer]; extra == "examples"
100
+ Requires-Dist: onnx==1.16.1; extra == "examples"
101
+ Requires-Dist: ultralytics==8.3.37; extra == "examples"
102
+
103
+ ================
104
+ |logo| DataChain
105
+ ================
106
+
107
+ |PyPI| |Python Version| |Codecov| |Tests|
108
+
109
+ .. |logo| image:: docs/assets/datachain.svg
110
+ :height: 24
111
+ .. |PyPI| image:: https://img.shields.io/pypi/v/datachain.svg
112
+ :target: https://pypi.org/project/datachain/
113
+ :alt: PyPI
114
+ .. |Python Version| image:: https://img.shields.io/pypi/pyversions/datachain
115
+ :target: https://pypi.org/project/datachain
116
+ :alt: Python Version
117
+ .. |Codecov| image:: https://codecov.io/gh/iterative/datachain/graph/badge.svg?token=byliXGGyGB
118
+ :target: https://codecov.io/gh/iterative/datachain
119
+ :alt: Codecov
120
+ .. |Tests| image:: https://github.com/iterative/datachain/actions/workflows/tests.yml/badge.svg
121
+ :target: https://github.com/iterative/datachain/actions/workflows/tests.yml
122
+ :alt: Tests
123
+
124
+ DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured
125
+ data like images, audio, videos, text and PDFs. It integrates with external storage
126
+ (e.g. S3) to process data efficiently without data duplication and manages metadata
127
+ in an internal database for easy and efficient querying.
128
+
129
+
130
+ Use Cases
131
+ =========
132
+
133
+ 1. **ETL.** Pythonic framework for describing and running unstructured data transformations
134
+ and enrichments, applying models to data, including LLMs.
135
+ 2. **Analytics.** DataChain dataset is a table that combines all the information about data
136
+ objects in one place + it provides dataframe-like API and vecrorized engine to do analytics
137
+ on these tables at scale.
138
+ 3. **Versioning.** DataChain doesn't store, require moving or copying data (unlike DVC).
139
+ Perfect use case is a bucket with thousands or millions of images, videos, audio, PDFs.
140
+
141
+
142
+ Key Features
143
+ ============
144
+
145
+ 📂 **Multimodal Dataset Versioning.**
146
+ - Version unstructured data without moving or creating data copies, by supporting
147
+ references to S3, GCP, Azure, and local file systems.
148
+ - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.
149
+ - Unite files and metadata together into persistent, versioned, columnar datasets.
150
+
151
+ 🐍 **Python-friendly.**
152
+ - Operate on Python objects and object fields: float scores, strings, matrixes,
153
+ LLM response objects.
154
+ - Run Python code in a high-scale, terabytes size datasets, with built-in
155
+ parallelization and memory-efficient computing — no SQL or Spark required.
156
+
157
+ 🧠 **Data Enrichment and Processing.**
158
+ - Generate metadata using local AI models and LLM APIs.
159
+ - Filter, join, and group datasets by metadata. Search by vector embeddings.
160
+ - High-performance vectorized operations on Python objects: sum, count, avg, etc.
161
+ - Pass datasets to Pytorch and Tensorflow, or export them back into storage.
162
+
163
+
164
+ Getting Started
165
+ ===============
166
+
167
+ Visit `Quick Start <https://docs.datachain.ai/quick-start>`_ to get started with `DataChain` and learn more.
168
+
169
+
170
+ Contributing
171
+ ============
172
+
173
+ Contributions are very welcome. To learn more, see the `Contributor Guide`_.
174
+
175
+
176
+ Community and Support
177
+ =====================
178
+
179
+ * `Docs <https://docs.datachain.ai/>`_
180
+ * `File an issue`_ if you encounter any problems
181
+ * `Discord Chat <https://dvc.org/chat>`_
182
+ * `Email <mailto:support@dvc.org>`_
183
+ * `Twitter <https://twitter.com/DVCorg>`_
184
+
185
+
186
+ DataChain Studio Platform
187
+ =========================
188
+
189
+ `DataChain Studio`_ is a proprietary solution for teams that offers:
190
+
191
+ - **Centralized dataset registry** to manage data, code and dependency
192
+ dependencies in one place.
193
+ - **Data Lineage** for data sources as well as derivative dataset.
194
+ - **UI for Multimodal Data** like images, videos, and PDFs.
195
+ - **Scalable Compute** to handle large datasets (100M+ files) and in-house
196
+ AI model inference.
197
+ - **Access control** including SSO and team based collaboration.
198
+
199
+ .. _PyPI: https://pypi.org/
200
+ .. _file an issue: https://github.com/iterative/datachain/issues
201
+ .. github-only
202
+ .. _Contributor Guide: https://docs.datachain.ai/contributing
203
+ .. _Pydantic: https://github.com/pydantic/pydantic
204
+ .. _publicly available: https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot
205
+ .. _SQLite: https://www.sqlite.org/
206
+ .. _Getting Started: https://docs.datachain.ai/
207
+ .. _DataChain Studio: https://studio.datachain.ai/
@@ -0,0 +1,105 @@
1
+ ================
2
+ |logo| DataChain
3
+ ================
4
+
5
+ |PyPI| |Python Version| |Codecov| |Tests|
6
+
7
+ .. |logo| image:: docs/assets/datachain.svg
8
+ :height: 24
9
+ .. |PyPI| image:: https://img.shields.io/pypi/v/datachain.svg
10
+ :target: https://pypi.org/project/datachain/
11
+ :alt: PyPI
12
+ .. |Python Version| image:: https://img.shields.io/pypi/pyversions/datachain
13
+ :target: https://pypi.org/project/datachain
14
+ :alt: Python Version
15
+ .. |Codecov| image:: https://codecov.io/gh/iterative/datachain/graph/badge.svg?token=byliXGGyGB
16
+ :target: https://codecov.io/gh/iterative/datachain
17
+ :alt: Codecov
18
+ .. |Tests| image:: https://github.com/iterative/datachain/actions/workflows/tests.yml/badge.svg
19
+ :target: https://github.com/iterative/datachain/actions/workflows/tests.yml
20
+ :alt: Tests
21
+
22
+ DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured
23
+ data like images, audio, videos, text and PDFs. It integrates with external storage
24
+ (e.g. S3) to process data efficiently without data duplication and manages metadata
25
+ in an internal database for easy and efficient querying.
26
+
27
+
28
+ Use Cases
29
+ =========
30
+
31
+ 1. **ETL.** Pythonic framework for describing and running unstructured data transformations
32
+ and enrichments, applying models to data, including LLMs.
33
+ 2. **Analytics.** DataChain dataset is a table that combines all the information about data
34
+ objects in one place + it provides dataframe-like API and vecrorized engine to do analytics
35
+ on these tables at scale.
36
+ 3. **Versioning.** DataChain doesn't store, require moving or copying data (unlike DVC).
37
+ Perfect use case is a bucket with thousands or millions of images, videos, audio, PDFs.
38
+
39
+
40
+ Key Features
41
+ ============
42
+
43
+ 📂 **Multimodal Dataset Versioning.**
44
+ - Version unstructured data without moving or creating data copies, by supporting
45
+ references to S3, GCP, Azure, and local file systems.
46
+ - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.
47
+ - Unite files and metadata together into persistent, versioned, columnar datasets.
48
+
49
+ 🐍 **Python-friendly.**
50
+ - Operate on Python objects and object fields: float scores, strings, matrixes,
51
+ LLM response objects.
52
+ - Run Python code in a high-scale, terabytes size datasets, with built-in
53
+ parallelization and memory-efficient computing — no SQL or Spark required.
54
+
55
+ 🧠 **Data Enrichment and Processing.**
56
+ - Generate metadata using local AI models and LLM APIs.
57
+ - Filter, join, and group datasets by metadata. Search by vector embeddings.
58
+ - High-performance vectorized operations on Python objects: sum, count, avg, etc.
59
+ - Pass datasets to Pytorch and Tensorflow, or export them back into storage.
60
+
61
+
62
+ Getting Started
63
+ ===============
64
+
65
+ Visit `Quick Start <https://docs.datachain.ai/quick-start>`_ to get started with `DataChain` and learn more.
66
+
67
+
68
+ Contributing
69
+ ============
70
+
71
+ Contributions are very welcome. To learn more, see the `Contributor Guide`_.
72
+
73
+
74
+ Community and Support
75
+ =====================
76
+
77
+ * `Docs <https://docs.datachain.ai/>`_
78
+ * `File an issue`_ if you encounter any problems
79
+ * `Discord Chat <https://dvc.org/chat>`_
80
+ * `Email <mailto:support@dvc.org>`_
81
+ * `Twitter <https://twitter.com/DVCorg>`_
82
+
83
+
84
+ DataChain Studio Platform
85
+ =========================
86
+
87
+ `DataChain Studio`_ is a proprietary solution for teams that offers:
88
+
89
+ - **Centralized dataset registry** to manage data, code and dependency
90
+ dependencies in one place.
91
+ - **Data Lineage** for data sources as well as derivative dataset.
92
+ - **UI for Multimodal Data** like images, videos, and PDFs.
93
+ - **Scalable Compute** to handle large datasets (100M+ files) and in-house
94
+ AI model inference.
95
+ - **Access control** including SSO and team based collaboration.
96
+
97
+ .. _PyPI: https://pypi.org/
98
+ .. _file an issue: https://github.com/iterative/datachain/issues
99
+ .. github-only
100
+ .. _Contributor Guide: https://docs.datachain.ai/contributing
101
+ .. _Pydantic: https://github.com/pydantic/pydantic
102
+ .. _publicly available: https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot
103
+ .. _SQLite: https://www.sqlite.org/
104
+ .. _Getting Started: https://docs.datachain.ai/
105
+ .. _DataChain Studio: https://studio.datachain.ai/
@@ -0,0 +1,111 @@
1
+ # Contributor Guide
2
+
3
+ Thank you for your interest in improving this project. This project is
4
+ open-source under the [Apache 2.0
5
+ license](https://opensource.org/licenses/Apache-2.0) and welcomes
6
+ contributions in the form of bug reports, feature requests, and pull
7
+ requests.
8
+
9
+ Here is a list of important resources for contributors:
10
+
11
+ - [Source Code](https://github.com/iterative/datachain)
12
+ - [Documentation](https://docs.dvc.ai/datachain)
13
+ - [Issue Tracker](https://github.com/iterative/datachain/issues)
14
+ - [Code of Conduct](https://github.com/iterative/datachain?tab=coc-ov-file)
15
+
16
+ ## How to report a bug
17
+
18
+ Report bugs on the [Issue
19
+ Tracker](https://github.com/iterative/datachain/issues).
20
+
21
+ When filing an issue, make sure to answer these questions:
22
+
23
+ - Which operating system and Python version are you using?
24
+ - Which version of this project are you using?
25
+ - What did you do?
26
+ - What did you expect to see?
27
+ - What did you see instead?
28
+
29
+ The best way to get your bug fixed is to provide a test case, and/or
30
+ steps to reproduce the issue.
31
+
32
+ ## How to request a feature
33
+
34
+ Request features on the [Issue
35
+ Tracker](https://github.com/iterative/datachain/issues).
36
+
37
+ ## How to set up your development environment
38
+
39
+ You need Python 3.8+ and the following tools:
40
+
41
+ - [Nox](https://nox.thea.codes/)
42
+
43
+ Install the package with development requirements:
44
+
45
+ ``` console
46
+ $ pip install nox
47
+ ```
48
+
49
+ ## How to test the project
50
+
51
+ Run the full test suite:
52
+
53
+ ``` console
54
+ $ nox
55
+ ```
56
+
57
+ List the available Nox sessions:
58
+
59
+ ``` console
60
+ $ nox --list-sessions
61
+ ```
62
+
63
+ You can also run a specific Nox session. For example, invoke the unit
64
+ test suite like this:
65
+
66
+ ``` console
67
+ $ nox --session=tests
68
+ ```
69
+
70
+ Unit tests are located in the `tests` directory, and are written using
71
+ the [pytest](https://pytest.readthedocs.io/) testing framework.
72
+
73
+ ## Build documentation
74
+
75
+ If you've made any changes to the documentation (including changes to
76
+ function signatures, class definitions, or docstrings that will appear
77
+ in the API documentation), make sure it builds successfully.
78
+
79
+ ``` console
80
+ $ nox -s docs
81
+ ```
82
+
83
+ In order to run this locally with hot reload on changes:
84
+
85
+ ``` console
86
+ $ mkdocs serve
87
+ ```
88
+
89
+ ## How to submit changes
90
+
91
+ Open a [pull request](https://github.com/iterative/datachain/pulls) to
92
+ submit changes to this project.
93
+
94
+ Your pull request needs to meet the following guidelines for acceptance:
95
+
96
+ - The Nox test suite must pass without errors and warnings.
97
+ - Include unit tests. This project maintains 100% code coverage.
98
+ - If your changes add functionality, update the documentation
99
+ accordingly.
100
+
101
+ Feel free to submit early, though---we can always iterate on this.
102
+
103
+ To run linting and code formatting checks, you can invoke a `lint` session in nox:
104
+
105
+ ``` console
106
+ $ nox -s lint
107
+ ```
108
+
109
+ It is recommended to open an issue before starting work on anything.
110
+ This will allow a chance to talk it over with the owners and validate
111
+ your approach.
@@ -1,80 +1,67 @@
1
- # Get Started with DataChain
2
1
 
3
- 🔨Wrangle unstructured AI data at scale
2
+ # Examples
4
3
 
4
+ ## DataChain Basics
5
5
 
6
- Datachain enables multimodal API calls and local AI inferences to run in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. Datachain can persist features of Python objects returned by AI models, and enables vectorized analytical operations over them.
6
+ !!! example "DataChain Basics"
7
7
 
8
- The typical use cases are data curation, LLM analytics and validation, image segmentation, pose detection, and GenAI alignment. Datachain is especially helpful if batch operations can be optimized – for instance, when synchronous API calls can be parallelized or where an LLM API offers batch processing.
8
+ Datachain is built by composing wrangling operations.
9
9
 
10
- ---
10
+ For example, let us consider the New Yorker Cartoon caption contest dataset, where cartoons are matched against the potential titles. Let us imagine we want to augment this dataset with synthetic scene descriptions coming from an AI model. The below code takes images from the cloud, and applies PaliGemma model to caption the first five of them and put the results in the column “scene”:
11
11
 
12
- `pip install datachain`
12
+ ```python
13
+ from datachain.lib.dc import Column, DataChain, File # (1)!
14
+ from transformers import AutoProcessor, PaliGemmaForConditionalGeneration # (2)!
13
15
 
14
- ---
16
+ images = DataChain.from_storage("gs://datachain-demo/newyorker_caption_contest/images", type="image")
15
17
 
16
- ### Operation basics
18
+ model = PaliGemmaForConditionalGeneration.from_pretrained("google/paligemma-3b-mix-224")
19
+ processor = AutoProcessor.from_pretrained("google/paligemma-3b-mix-224")
17
20
 
18
- Datachain is built by composing wrangling operations.
21
+ def process(file: File) -> str:
22
+ image=file.read().convert("RGB")
23
+ inputs = processor(text="caption", images=image, return_tensors="pt")
24
+ generate_ids = model.generate(**inputs, max_new_tokens=100)
25
+ return processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
19
26
 
20
- For example, let us consider the New Yorker Cartoon caption contest dataset, where cartoons are matched against the potential titles. Let us imagine we want to augment this dataset with synthetic scene descriptions coming from an AI model. The below code takes images from the cloud, and applies PaliGemma model to caption the first five of them and put the results in the column “scene”:
21
-
22
- ```python
23
- #
24
- # pip install transformers
25
- #
26
-
27
- from datachain.lib.dc import Column, DataChain, File
28
- from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
29
-
30
- images = DataChain.from_storage("gs://datachain-demo/newyorker_caption_contest/images", type="image")
31
-
32
- model = PaliGemmaForConditionalGeneration.from_pretrained("google/paligemma-3b-mix-224")
33
- processor = AutoProcessor.from_pretrained("google/paligemma-3b-mix-224")
34
-
35
- def process(file: File) -> str:
36
- image=file.read().convert("RGB")
37
- inputs = processor(text="caption", images=image, return_tensors="pt")
38
- generate_ids = model.generate(**inputs, max_new_tokens=100)
39
- return processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
40
-
41
- chain = (
42
- images.limit(5)
43
- .settings(cache=True)
44
- .map(scene=lambda file: process(file), output = str)
45
- .save()
46
- )
47
- ```
27
+ chain = (
28
+ images.limit(5)
29
+ .settings(cache=True)
30
+ .map(scene=lambda file: process(file), output = str)
31
+ .save()
32
+ )
33
+ ```
48
34
 
49
- Here is how we can view the results in a plot:
35
+ 1. `pip install datachain`
36
+ 2. `pip install transformers`
50
37
 
51
- ```python
52
- import matplotlib.pyplot as plt
53
- import re
54
- from textwrap import wrap
38
+ Here is how we can view the results in a plot:
55
39
 
56
- def trim_text(text):
57
- match = re.search(r'[A-Z][^.]*\.', text)
58
- return match.group(0) if match else ''
40
+ ```python
41
+ import matplotlib.pyplot as plt
42
+ import re
43
+ from textwrap import wrap
59
44
 
60
- images = chain.collect("file")
61
- captions = chain.collect("scene")
62
- _ , axes = plt.subplots(1, len(captions), figsize=(15, 5))
45
+ def trim_text(text):
46
+ match = re.search(r'[A-Z][^.]*\.', text)
47
+ return match.group(0) if match else ''
63
48
 
64
- for ax, img, caption in zip(axes, images, captions):
65
- ax.imshow(img.read(),cmap='gray')
66
- ax.axis('off')
67
- wrapped_caption = "\n".join(wrap(trim_text(caption), 30))
68
- ax.set_title(wrapped_caption, fontsize=6)
49
+ images = chain.collect("file")
50
+ captions = chain.collect("scene")
51
+ _ , axes = plt.subplots(1, len(captions), figsize=(15, 5))
69
52
 
70
- plt.show()
71
- ```
53
+ for ax, img, caption in zip(axes, images, captions):
54
+ ax.imshow(img.read(),cmap='gray')
55
+ ax.axis('off')
56
+ wrapped_caption = "\n".join(wrap(trim_text(caption), 30))
57
+ ax.set_title(wrapped_caption, fontsize=6)
72
58
 
73
- ![Untitled](assets/captioned_cartoons.png)
59
+ plt.show()
60
+ ```
74
61
 
75
- If interested to see more multimodal examples for DataChain, please follow this tutorial:
62
+ ![Untitled](assets/captioned_cartoons.png)
76
63
 
77
- [https://github.com/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb](https://github.com/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb) [Google Colab](https://colab.research.google.com/github/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb)
64
+ If interested to see more examples, please check out the [tutorials](tutorials.md).
78
65
 
79
66
  ### Handling Python objects
80
67
 
@@ -188,7 +175,7 @@ Datachain avoids redundant operations. Execution is triggered only when a downst
188
175
 
189
176
  “Save” operation nails execution results and automatically refers to them every time the downstream functions ask for data. Saving without an explicit name generates an auto-named dataset which serves the same purpose.
190
177
 
191
- Datachain natively supports parallelism in execution. If an API or a local model supports parallel requests, the `settings` operator can split the load across multiple workers (see the [code example above](https://www.notion.so/DataChain-Getting-Started-3ed9414febac48f888f90cdaa2ca7667?pvs=21))
178
+ Datachain natively supports parallelism in execution. If an API or a local model supports parallel requests, the `settings` operator can split the load across multiple workers (see the [code example above](#handling-python-objects))
192
179
 
193
180
  ### Reading external metadata
194
181
 
@@ -279,7 +266,7 @@ images_with_dogs.select("annotations", "file.name").show()
279
266
  ```
280
267
  For in-depth review of working with JSON metadata, please follow this tutorial:
281
268
 
282
- [https://github.com/iterative/datachain-examples/blob/main/formats/json-metadata-tutorial.ipynb](https://github.com/iterative/datachain-examples/blob/main/formats/json-metadata-tutorial.ipynb) [Google Colab](https://colab.research.google.com/github/iterative/datachain-examples/blob/main/formats/json-metadata-tutorial.ipynb)
269
+ [GitHub](https://github.com/iterative/datachain-examples/blob/main/formats/json-metadata-tutorial.ipynb) or [Google Colab](https://colab.research.google.com/github/iterative/datachain-examples/blob/main/formats/json-metadata-tutorial.ipynb)
283
270
 
284
271
  ### Passing data to training
285
272
 
@@ -299,4 +286,4 @@ train(loader, model, optimizer)
299
286
 
300
287
  See a larger example for CLIP fine-tuning here:
301
288
 
302
- [https://github.com/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb](https://github.com/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb) [Google Colab](https://colab.research.google.com/github/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb)
289
+ [GitHub](https://github.com/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb) or [Google Colab](https://colab.research.google.com/github/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb)