datachain 0.6.8__py3-none-any.whl → 0.6.10__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of datachain might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: datachain
3
- Version: 0.6.8
3
+ Version: 0.6.10
4
4
  Summary: Wrangle unstructured AI data at scale
5
5
  Author-email: Dmitry Petrov <support@dvc.org>
6
6
  License: Apache-2.0
@@ -82,7 +82,7 @@ Requires-Dist: pytest <9,>=8 ; extra == 'tests'
82
82
  Requires-Dist: pytest-sugar >=0.9.6 ; extra == 'tests'
83
83
  Requires-Dist: pytest-cov >=4.1.0 ; extra == 'tests'
84
84
  Requires-Dist: pytest-mock >=3.12.0 ; extra == 'tests'
85
- Requires-Dist: pytest-servers[all] >=0.5.7 ; extra == 'tests'
85
+ Requires-Dist: pytest-servers[all] >=0.5.8 ; extra == 'tests'
86
86
  Requires-Dist: pytest-benchmark[histogram] ; extra == 'tests'
87
87
  Requires-Dist: pytest-xdist >=3.3.1 ; extra == 'tests'
88
88
  Requires-Dist: virtualenv ; extra == 'tests'
@@ -120,33 +120,41 @@ Requires-Dist: usearch ; extra == 'vector'
120
120
  :target: https://github.com/iterative/datachain/actions/workflows/tests.yml
121
121
  :alt: Tests
122
122
 
123
- DataChain is a modern Pythonic data-frame library designed for artificial intelligence.
124
- It is made to organize your unstructured data into datasets and wrangle it at scale on
125
- your local machine. Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack.
123
+ DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured
124
+ data like images, audio, videos, text and PDFs. It integrates with external storage
125
+ (e.g., S3) to process data efficiently without data duplication and manages metadata
126
+ in an internal database for easy and efficient querying.
127
+
128
+
129
+ Use Cases
130
+ =========
131
+
132
+ 1. **Multimodal Dataset Preparation and Curation**: ideal for organizing and
133
+ refining data in pre-training, finetuning or LLM evaluating stages.
134
+ 2. **GenAI Data Analytics**: Enables advanced analytics for multimodal data and
135
+ ad-hoc analytics using LLMs.
126
136
 
127
137
  Key Features
128
138
  ============
129
139
 
130
- 📂 **Storage as a Source of Truth.**
131
- - Process unstructured data without redundant copies from S3, GCP, Azure, and local
132
- file systems.
133
- - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet.
140
+ 📂 **Multimodal Dataset Versioning.**
141
+ - Version unstructured data without redundant data copies, by supporitng
142
+ references to S3, GCP, Azure, and local file systems.
143
+ - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.
134
144
  - Unite files and metadata together into persistent, versioned, columnar datasets.
135
145
 
136
- 🐍 **Python-friendly data pipelines.**
137
- - Operate on Python objects and object fields.
138
- - Built-in parallelization and out-of-memory compute without SQL or Spark.
146
+ 🐍 **Python-friendly.**
147
+ - Operate on Python objects and object fields: float scores, strings, matrixes,
148
+ LLM response objects.
149
+ - Run Python code in a high-scale, terabytes size datasets, with built-in
150
+ parallelization and memory-efficient computing — no SQL or Spark required.
139
151
 
140
152
  🧠 **Data Enrichment and Processing.**
141
153
  - Generate metadata using local AI models and LLM APIs.
142
- - Filter, join, and group by metadata. Search by vector embeddings.
154
+ - Filter, join, and group datasets by metadata. Search by vector embeddings.
155
+ - High-performance vectorized operations on Python objects: sum, count, avg, etc.
143
156
  - Pass datasets to Pytorch and Tensorflow, or export them back into storage.
144
157
 
145
- 🚀 **Efficiency.**
146
- - Parallelization, out-of-memory workloads and data caching.
147
- - Vectorized operations on Python object fields: sum, count, avg, etc.
148
- - Optimized vector search.
149
-
150
158
 
151
159
  Quick Start
152
160
  -----------
@@ -196,7 +204,7 @@ Batch inference with a simple sentiment model using the `transformers` library:
196
204
 
197
205
  pip install transformers
198
206
 
199
- The code below downloads files the cloud, and applies a user-defined function
207
+ The code below downloads files from the cloud, and applies a user-defined function
200
208
  to each one of them. All files with a positive sentiment
201
209
  detected are then copied to the local directory.
202
210
 
@@ -429,6 +437,19 @@ name suffix, the following code will do it:
429
437
  loader = DataLoader(chain, batch_size=1)
430
438
 
431
439
 
440
+ DataChain Studio Platform
441
+ -------------------------
442
+
443
+ `DataChain Studio`_ is a proprietary solution for teams that offers:
444
+
445
+ - **Centralized dataset registry** to manage data, code and dependency
446
+ dependencies in one place.
447
+ - **Data Lineage** for data sources as well as direvative dataset.
448
+ - **UI for Multimodal Data** like images, videos, and PDFs.
449
+ - **Scalable Compute** to handle large datasets (100M+ files) and in-house
450
+ AI model inference.
451
+ - **Access control** including SSO and team based collaboration.
452
+
432
453
  Tutorials
433
454
  ---------
434
455
 
@@ -462,6 +483,5 @@ Community and Support
462
483
  .. _Pydantic: https://github.com/pydantic/pydantic
463
484
  .. _publicly available: https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot
464
485
  .. _SQLite: https://www.sqlite.org/
465
- .. _Getting Started: https://datachain.dvc.ai/
466
- .. |Flowchart| image:: https://github.com/iterative/datachain/blob/main/docs/assets/flowchart.png?raw=true
467
- :alt: DataChain FlowChart
486
+ .. _Getting Started: https://docs.datachain.ai/
487
+ .. _DataChain Studio: https://studio.datachain.ai/
@@ -5,10 +5,10 @@ datachain/cache.py,sha256=s0YHN7qurmQv-eC265TjeureK84TebWWAnL07cxchZQ,2997
5
5
  datachain/cli.py,sha256=hdVt_HJumQVgtaBAtBVJm-uPyYVogMXNVLmRcZyWHgk,36677
6
6
  datachain/cli_utils.py,sha256=jrn9ejGXjybeO1ur3fjdSiAyCHZrX0qsLLbJzN9ErPM,2418
7
7
  datachain/config.py,sha256=g8qbNV0vW2VEKpX-dGZ9pAn0DAz6G2ZFcr7SAV3PoSM,4272
8
- datachain/dataset.py,sha256=lLUbUbJP1TYL9Obkc0f2IDziGcDylZge9ORQjK-WtXs,14717
8
+ datachain/dataset.py,sha256=0IN-5y723y-bnFlieKtOFZLCjwX_yplFo3q0DV7LRPw,14821
9
9
  datachain/error.py,sha256=bxAAL32lSeMgzsQDEHbGTGORj-mPzzpCRvWDPueJNN4,1092
10
10
  datachain/job.py,sha256=Jt4sNutMHJReaGsj3r3scueN5aESLGfhimAa8pUP7Is,1271
11
- datachain/listing.py,sha256=AV23WZq-k6e2zeeNBhVQP1-2PrwNCYidO0HBDKzpVaA,7152
11
+ datachain/listing.py,sha256=TgKg25ZWAP5enzKgw2_2GUPJVdnQUh6uySHB5SJrUY4,7773
12
12
  datachain/node.py,sha256=i7_jC8VcW6W5VYkDszAOu0H-rNBuqXB4UnLEh4wFzjc,5195
13
13
  datachain/nodes_fetcher.py,sha256=F-73-h19HHNGtHFBGKk7p3mc0ALm4a9zGnzhtuUjnp4,1107
14
14
  datachain/nodes_thread_pool.py,sha256=uPo-xl8zG5m9YgODjPFBpbcqqHjI-dcxH87yAbj_qco,3192
@@ -18,13 +18,13 @@ datachain/studio.py,sha256=6kxF7VxPAbh9D7_Bk8_SghS5OXrwUwSpDaw19eNCTP4,4083
18
18
  datachain/telemetry.py,sha256=0A4IOPPp9VlP5pyW9eBfaTK3YhHGzHl7dQudQjUAx9A,994
19
19
  datachain/utils.py,sha256=-mSFowjIidJ4_sMXInvNHLn4rK_QnHuIlLuH1_lMGmI,13897
20
20
  datachain/catalog/__init__.py,sha256=g2iAAFx_gEIrqshXlhSEbrc8qDaEH11cjU40n3CHDz4,409
21
- datachain/catalog/catalog.py,sha256=VwItaZG8MUqNKYz0xopDCdkVkbbxgTZYky3ElgsK5-M,57183
21
+ datachain/catalog/catalog.py,sha256=J1nUWLI4RYCvvR6fB4neQBtB7V-CTh4PM71irhNmJc4,57817
22
22
  datachain/catalog/datasource.py,sha256=D-VWIVDCM10A8sQavLhRXdYSCG7F4o4ifswEF80_NAQ,1412
23
23
  datachain/catalog/loader.py,sha256=-6VelNfXUdgUnwInVyA8g86Boxv2xqhTh9xNS-Zlwig,8242
24
24
  datachain/client/__init__.py,sha256=T4wiYL9KIM0ZZ_UqIyzV8_ufzYlewmizlV4iymHNluE,86
25
25
  datachain/client/azure.py,sha256=ffxs26zm6KLAL1aUWJm-vtzuZP3LSNha7UDGXynMBKo,2234
26
26
  datachain/client/fileslice.py,sha256=bT7TYco1Qe3bqoc8aUkUZcPdPofJDHlryL5BsTn9xsY,3021
27
- datachain/client/fsspec.py,sha256=C6C5AO6ndkgcoUxCRN9_8fUzqX2cRWJWG6FL6oD9X_Q,12708
27
+ datachain/client/fsspec.py,sha256=Ai5m7alkAnv-RWXuLbZ95SKEPaQ3Pyk5ujDy50JDX5w,12692
28
28
  datachain/client/gcs.py,sha256=cnTIr5GS6dbYOEYfqehhyQu3dr6XNjPHSg5U3FkivUk,4124
29
29
  datachain/client/hf.py,sha256=XeVJVbiNViZCpn3sfb90Fr8SYO3BdLmfE3hOWMoqInE,951
30
30
  datachain/client/local.py,sha256=vwbgCwZ7IqY2voj2l7tLJjgov7Dp--fEUvUwUBsMbls,4457
@@ -33,27 +33,27 @@ datachain/data_storage/__init__.py,sha256=cEOJpyu1JDZtfUupYucCDNFI6e5Wmp_Oyzq6rZ
33
33
  datachain/data_storage/db_engine.py,sha256=81Ol1of9TTTzD97ORajCnP366Xz2mEJt6C-kTUCaru4,3406
34
34
  datachain/data_storage/id_generator.py,sha256=lCEoU0BM37Ai2aRpSbwo5oQT0GqZnSpYwwvizathRMQ,4292
35
35
  datachain/data_storage/job.py,sha256=w-7spowjkOa1P5fUVtJou3OltT0L48P0RYWZ9rSJ9-s,383
36
- datachain/data_storage/metastore.py,sha256=-TJCqG70VofSVOh2yEez4dwjHS3eQL8p7d9uO3WTVwM,35878
36
+ datachain/data_storage/metastore.py,sha256=5b7o_CSHC2djottebYn-Hq5q0yaSLOKPIRCnaVRvjsU,36056
37
37
  datachain/data_storage/schema.py,sha256=scANMQqozita3HjEtq7eupMgh6yYkrZHoXtfuL2RoQg,9879
38
38
  datachain/data_storage/serializer.py,sha256=6G2YtOFqqDzJf1KbvZraKGXl2XHZyVml2krunWUum5o,927
39
- datachain/data_storage/sqlite.py,sha256=wb8xlMJYYyt59wft0psJj587d-AwpNThzIqspVcKnRI,27388
39
+ datachain/data_storage/sqlite.py,sha256=CspRUlYsIcubgzvcQxTACnmcuKESSLZcqCl0dcrtRiA,27471
40
40
  datachain/data_storage/warehouse.py,sha256=xwMaR4jBpR13vjG3zrhphH4z2_CFLNj0KPF0LJCXCJ8,30727
41
41
  datachain/lib/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
42
42
  datachain/lib/arrow.py,sha256=-hu9tic79a01SY2UBqkA3U6wUr6tnE3T3q5q_BnO93A,9156
43
43
  datachain/lib/clip.py,sha256=lm5CzVi4Cj1jVLEKvERKArb-egb9j1Ls-fwTItT6vlI,6150
44
44
  datachain/lib/data_model.py,sha256=dau4AlZBhOFvF7pEKMeqCeRkcFFg5KFvTBWW_2CdH5g,2371
45
- datachain/lib/dataset_info.py,sha256=srPPhI2UHf6hFPBecyFEVw2SS5aPisIIMsvGgKqi7ss,2366
46
- datachain/lib/dc.py,sha256=U1evAvSs563OMuUVildoaIOuOFiNB6fZcsN4BI8L9f0,85076
45
+ datachain/lib/dataset_info.py,sha256=q0EW9tj5jXGSD9Lzct9zbH4P1lfIGd_cIWqhnMxv7Q0,2464
46
+ datachain/lib/dc.py,sha256=BmRgCt5fXvBqlFV07KN-nWszueRyCkC7td1x7T4BZ7k,87688
47
47
  datachain/lib/file.py,sha256=lHxE1wOGR4QJBQ3AYjhPLwpX72dOi06vkcwA-WSAGlg,14817
48
48
  datachain/lib/hf.py,sha256=BW2NPpqxkpPwkSaGlppT8Rbs8zPpyYC-tR6htY08c-0,5817
49
49
  datachain/lib/image.py,sha256=AMXYwQsmarZjRbPCZY3M1jDsM2WAB_b3cTY4uOIuXNU,2675
50
50
  datachain/lib/listing.py,sha256=cVkCp7TRVpcZKSx-Bbk9t51bQI9Mw0o86W6ZPhAsuzM,3667
51
51
  datachain/lib/listing_info.py,sha256=9ua40Hw0aiQByUw3oAEeNzMavJYfW0Uhe8YdCTK-m_g,1110
52
- datachain/lib/meta_formats.py,sha256=3f-0vpMTesagS9iMd3y9-u9r-7g0eqYsxmK4fVfNWlw,6635
52
+ datachain/lib/meta_formats.py,sha256=anK2bDVbaeCCh0yvKUBaW2MVos3zRgdaSV8uSduzPcU,6680
53
53
  datachain/lib/model_store.py,sha256=DNIv8Y6Jtk1_idNLzIpsThOsdW2BMAudyUCbPUcgcxk,2515
54
54
  datachain/lib/pytorch.py,sha256=W-ARi2xH1f1DUkVfRuerW-YWYgSaJASmNCxtz2lrJGI,6072
55
55
  datachain/lib/settings.py,sha256=39thOpYJw-zPirzeNO6pmRC2vPrQvt4eBsw1xLWDFsw,2344
56
- datachain/lib/signal_schema.py,sha256=mQuviKAdZzFtZcbZHhqzUP-zivQ9MDZiLQhE54OPbOA,24555
56
+ datachain/lib/signal_schema.py,sha256=xwkE5bxJxUhZTjrA6jqN87XbSXPikCbL6eOPL9WyrKM,24556
57
57
  datachain/lib/tar.py,sha256=3WIzao6yD5fbLqXLTt9GhPGNonbFIs_fDRu-9vgLgsA,1038
58
58
  datachain/lib/text.py,sha256=UNHm8fhidk7wdrWqacEWaA6I9ykfYqarQ2URby7jc7M,1261
59
59
  datachain/lib/udf.py,sha256=4CqK51n3bntXCmkwoOQIrX34wMKOknkC23HtR4D_2vM,12705
@@ -71,10 +71,14 @@ datachain/lib/convert/values_to_tuples.py,sha256=varRCnSMT_pZmHznrd2Yi05qXLLz_v9
71
71
  datachain/lib/func/__init__.py,sha256=wlAKhGV0QDg9y7reSwoUF8Vicfqh_YOUNIXLzxICGz4,403
72
72
  datachain/lib/func/aggregate.py,sha256=H1ziFQdaK9zvnxvttfnEzkkyGvEEmMAvmgCsBV6nfm8,10917
73
73
  datachain/lib/func/func.py,sha256=HAJZ_tpiRG2R-et7pr0WnoyNZYtpbPn3_HBuL3RQpbU,4800
74
- datachain/lib/models/__init__.py,sha256=AGvjPbUokJiir3uelTa4XGtNSECkMFc5Xmi_N3AtxPQ,119
75
- datachain/lib/models/bbox.py,sha256=aiYNhvEcRK3dEN4MBcptmkPKc9kMP16ZQdu7xPk6hek,1555
76
- datachain/lib/models/pose.py,sha256=peuJPNSiGuTXfCfGIABwv8PGYistvTTBmtf-8X8E_eA,1077
77
- datachain/lib/models/yolo.py,sha256=eftoJDUa8iOpFTF1EkKVAd5Q-3HRd6X4eCIZ9h5p4nI,972
74
+ datachain/lib/models/__init__.py,sha256=6iwqXWcybyELKdLEe59yUPl8R8ZHDY4lA-xCHVYPdOA,191
75
+ datachain/lib/models/bbox.py,sha256=UJ_64D8TQglX2B_ueseILPoT3cGIWr9McVg0mv2YdmE,3717
76
+ datachain/lib/models/pose.py,sha256=KC-OpLC7-3v6qg4YN6pXlfAgtg88VLQoRc75JCEmbfY,3931
77
+ datachain/lib/models/segment.py,sha256=ergCFnEzLDzaU75p1_KvWgal1LSv4VuFmkWLkRJeaVk,1862
78
+ datachain/lib/models/ultralytics/__init__.py,sha256=g8mgII0k_RJiOG9kd4k_ECfCgDhT_iPh3vCC_5OiDD4,305
79
+ datachain/lib/models/ultralytics/bbox.py,sha256=LAaezAnnugfBiczWZ63NTo65kX2BegR5WGXjQTOTE28,5784
80
+ datachain/lib/models/ultralytics/pose.py,sha256=nMoEeeY_Zi7Iiu7vIo9ZTq8ARUdg_BcZMQIA_WgRNk4,3488
81
+ datachain/lib/models/ultralytics/segment.py,sha256=IHnthsq6uQ6DSdHLK2akbdd0Eq8wW7oaAK6pUG8nxJc,3818
78
82
  datachain/query/__init__.py,sha256=7DhEIjAA8uZJfejruAVMZVcGFmvUpffuZJwgRqNwe-c,263
79
83
  datachain/query/batch.py,sha256=5fEhORFe7li12SdYddaSK3LyqksMfCHhwN1_A6TfsA4,3485
80
84
  datachain/query/dataset.py,sha256=MGArYxioeGvm8w7hQtQAjEI6wsZN_XAoh4-jO4d0U5Q,53926
@@ -103,10 +107,12 @@ datachain/sql/sqlite/__init__.py,sha256=TAdJX0Bg28XdqPO-QwUVKy8rg78cgMileHvMNot7
103
107
  datachain/sql/sqlite/base.py,sha256=aHSZVvh4XSVkvZ07h3jMoRlHI4sWD8y3SnmGs9xMG9Y,14375
104
108
  datachain/sql/sqlite/types.py,sha256=yzvp0sXSEoEYXs6zaYC_2YubarQoZH-MiUNXcpuEP4s,1573
105
109
  datachain/sql/sqlite/vector.py,sha256=ncW4eu2FlJhrP_CIpsvtkUabZlQdl2D5Lgwy_cbfqR0,469
110
+ datachain/toolkit/__init__.py,sha256=eQ58Q5Yf_Fgv1ZG0IO5dpB4jmP90rk8YxUWmPc1M2Bo,68
111
+ datachain/toolkit/split.py,sha256=6FcEJgUsJsUcCqKW5aXuJy4DvbcQ7_dFbsfNPhn8EVg,2377
106
112
  datachain/torch/__init__.py,sha256=gIS74PoEPy4TB3X6vx9nLO0Y3sLJzsA8ckn8pRWihJM,579
107
- datachain-0.6.8.dist-info/LICENSE,sha256=8DnqK5yoPI_E50bEg_zsHKZHY2HqPy4rYN338BHQaRA,11344
108
- datachain-0.6.8.dist-info/METADATA,sha256=NDeFhQSQOSP3URzciSjDJnWyC9T3O8ptZmwOU8lDBSI,17259
109
- datachain-0.6.8.dist-info/WHEEL,sha256=P9jw-gEje8ByB7_hXoICnHtVCrEwMQh-630tKvQWehc,91
110
- datachain-0.6.8.dist-info/entry_points.txt,sha256=0GMJS6B_KWq0m3VT98vQI2YZodAMkn4uReZ_okga9R4,49
111
- datachain-0.6.8.dist-info/top_level.txt,sha256=lZPpdU_2jJABLNIg2kvEOBi8PtsYikbN1OdMLHk8bTg,10
112
- datachain-0.6.8.dist-info/RECORD,,
113
+ datachain-0.6.10.dist-info/LICENSE,sha256=8DnqK5yoPI_E50bEg_zsHKZHY2HqPy4rYN338BHQaRA,11344
114
+ datachain-0.6.10.dist-info/METADATA,sha256=AgQuuefAhZRIL1jDJWz-q4daqA5ZmnQN8dafqnt01XA,18038
115
+ datachain-0.6.10.dist-info/WHEEL,sha256=R06PA3UVYHThwHvxuRWMqaGcr-PuniXahwjmQRFMEkY,91
116
+ datachain-0.6.10.dist-info/entry_points.txt,sha256=0GMJS6B_KWq0m3VT98vQI2YZodAMkn4uReZ_okga9R4,49
117
+ datachain-0.6.10.dist-info/top_level.txt,sha256=lZPpdU_2jJABLNIg2kvEOBi8PtsYikbN1OdMLHk8bTg,10
118
+ datachain-0.6.10.dist-info/RECORD,,
@@ -1,5 +1,5 @@
1
1
  Wheel-Version: 1.0
2
- Generator: setuptools (75.3.0)
2
+ Generator: setuptools (75.5.0)
3
3
  Root-Is-Purelib: true
4
4
  Tag: py3-none-any
5
5
 
@@ -1,39 +0,0 @@
1
- """
2
- This module contains the YOLO models.
3
-
4
- YOLO stands for "You Only Look Once", a family of object detection models that
5
- are designed to be fast and accurate. The models are trained to detect objects
6
- in images by dividing the image into a grid and predicting the bounding boxes
7
- and class probabilities for each grid cell.
8
-
9
- More information about YOLO can be found here:
10
- - https://pjreddie.com/darknet/yolo/
11
- - https://docs.ultralytics.com/
12
- """
13
-
14
-
15
- class PoseBodyPart:
16
- """
17
- An enumeration of body parts for YOLO pose keypoints.
18
-
19
- More information about the body parts can be found here:
20
- https://docs.ultralytics.com/tasks/pose/
21
- """
22
-
23
- nose = 0
24
- left_eye = 1
25
- right_eye = 2
26
- left_ear = 3
27
- right_ear = 4
28
- left_shoulder = 5
29
- right_shoulder = 6
30
- left_elbow = 7
31
- right_elbow = 8
32
- left_wrist = 9
33
- right_wrist = 10
34
- left_hip = 11
35
- right_hip = 12
36
- left_knee = 13
37
- right_knee = 14
38
- left_ankle = 15
39
- right_ankle = 16