datachain 0.7.9__py3-none-any.whl → 0.7.11__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of datachain might be problematic. Click here for more details.

@@ -1,488 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: datachain
3
- Version: 0.7.9
4
- Summary: Wrangle unstructured AI data at scale
5
- Author-email: Dmitry Petrov <support@dvc.org>
6
- License: Apache-2.0
7
- Project-URL: Documentation, https://datachain.dvc.ai
8
- Project-URL: Issues, https://github.com/iterative/datachain/issues
9
- Project-URL: Source, https://github.com/iterative/datachain
10
- Classifier: Programming Language :: Python :: 3
11
- Classifier: Programming Language :: Python :: 3.9
12
- Classifier: Programming Language :: Python :: 3.10
13
- Classifier: Programming Language :: Python :: 3.11
14
- Classifier: Programming Language :: Python :: 3.12
15
- Classifier: Development Status :: 2 - Pre-Alpha
16
- Requires-Python: >=3.9
17
- Description-Content-Type: text/x-rst
18
- License-File: LICENSE
19
- Requires-Dist: pyyaml
20
- Requires-Dist: tomlkit
21
- Requires-Dist: tqdm
22
- Requires-Dist: numpy<3,>=1
23
- Requires-Dist: pandas>=2.0.0
24
- Requires-Dist: pyarrow
25
- Requires-Dist: typing-extensions
26
- Requires-Dist: python-dateutil>=2
27
- Requires-Dist: attrs>=21.3.0
28
- Requires-Dist: s3fs>=2024.2.0
29
- Requires-Dist: gcsfs>=2024.2.0
30
- Requires-Dist: adlfs>=2024.2.0
31
- Requires-Dist: dvc-data<4,>=3.10
32
- Requires-Dist: dvc-objects<6,>=4
33
- Requires-Dist: shtab<2,>=1.3.4
34
- Requires-Dist: sqlalchemy>=2
35
- Requires-Dist: multiprocess==0.70.16
36
- Requires-Dist: cloudpickle
37
- Requires-Dist: orjson>=3.10.5
38
- Requires-Dist: pydantic<3,>=2
39
- Requires-Dist: jmespath>=1.0
40
- Requires-Dist: datamodel-code-generator>=0.25
41
- Requires-Dist: Pillow<12,>=10.0.0
42
- Requires-Dist: msgpack<2,>=1.0.4
43
- Requires-Dist: psutil
44
- Requires-Dist: huggingface_hub
45
- Requires-Dist: iterative-telemetry>=0.0.9
46
- Requires-Dist: platformdirs
47
- Requires-Dist: dvc-studio-client<1,>=0.21
48
- Requires-Dist: tabulate
49
- Provides-Extra: docs
50
- Requires-Dist: mkdocs>=1.5.2; extra == "docs"
51
- Requires-Dist: mkdocs-gen-files>=0.5.0; extra == "docs"
52
- Requires-Dist: mkdocs-material>=9.3.1; extra == "docs"
53
- Requires-Dist: mkdocs-section-index>=0.3.6; extra == "docs"
54
- Requires-Dist: mkdocstrings-python>=1.6.3; extra == "docs"
55
- Requires-Dist: mkdocs-literate-nav>=0.6.1; extra == "docs"
56
- Provides-Extra: torch
57
- Requires-Dist: torch>=2.1.0; extra == "torch"
58
- Requires-Dist: torchvision; extra == "torch"
59
- Requires-Dist: transformers>=4.36.0; extra == "torch"
60
- Provides-Extra: remote
61
- Requires-Dist: lz4; extra == "remote"
62
- Requires-Dist: requests>=2.22.0; extra == "remote"
63
- Provides-Extra: vector
64
- Requires-Dist: usearch; extra == "vector"
65
- Provides-Extra: hf
66
- Requires-Dist: numba>=0.60.0; extra == "hf"
67
- Requires-Dist: datasets[audio,vision]>=2.21.0; extra == "hf"
68
- Provides-Extra: tests
69
- Requires-Dist: datachain[hf,remote,torch,vector]; extra == "tests"
70
- Requires-Dist: pytest<9,>=8; extra == "tests"
71
- Requires-Dist: pytest-sugar>=0.9.6; extra == "tests"
72
- Requires-Dist: pytest-cov>=4.1.0; extra == "tests"
73
- Requires-Dist: pytest-mock>=3.12.0; extra == "tests"
74
- Requires-Dist: pytest-servers[all]>=0.5.8; extra == "tests"
75
- Requires-Dist: pytest-benchmark[histogram]; extra == "tests"
76
- Requires-Dist: pytest-xdist>=3.3.1; extra == "tests"
77
- Requires-Dist: virtualenv; extra == "tests"
78
- Requires-Dist: dulwich; extra == "tests"
79
- Requires-Dist: hypothesis; extra == "tests"
80
- Requires-Dist: open_clip_torch; extra == "tests"
81
- Requires-Dist: aiotools>=1.7.0; extra == "tests"
82
- Requires-Dist: requests-mock; extra == "tests"
83
- Requires-Dist: scipy; extra == "tests"
84
- Provides-Extra: dev
85
- Requires-Dist: datachain[docs,tests]; extra == "dev"
86
- Requires-Dist: mypy==1.13.0; extra == "dev"
87
- Requires-Dist: types-python-dateutil; extra == "dev"
88
- Requires-Dist: types-pytz; extra == "dev"
89
- Requires-Dist: types-PyYAML; extra == "dev"
90
- Requires-Dist: types-requests; extra == "dev"
91
- Requires-Dist: types-tabulate; extra == "dev"
92
- Provides-Extra: examples
93
- Requires-Dist: datachain[tests]; extra == "examples"
94
- Requires-Dist: numpy<2,>=1; extra == "examples"
95
- Requires-Dist: defusedxml; extra == "examples"
96
- Requires-Dist: accelerate; extra == "examples"
97
- Requires-Dist: unstructured[embed-huggingface,pdf]<0.16.0; extra == "examples"
98
- Requires-Dist: pdfplumber==0.11.4; extra == "examples"
99
- Requires-Dist: huggingface_hub[hf_transfer]; extra == "examples"
100
- Requires-Dist: onnx==1.16.1; extra == "examples"
101
- Requires-Dist: ultralytics==8.3.37; extra == "examples"
102
-
103
- ================
104
- |logo| DataChain
105
- ================
106
-
107
- |PyPI| |Python Version| |Codecov| |Tests|
108
-
109
- .. |logo| image:: docs/assets/datachain.svg
110
- :height: 24
111
- .. |PyPI| image:: https://img.shields.io/pypi/v/datachain.svg
112
- :target: https://pypi.org/project/datachain/
113
- :alt: PyPI
114
- .. |Python Version| image:: https://img.shields.io/pypi/pyversions/datachain
115
- :target: https://pypi.org/project/datachain
116
- :alt: Python Version
117
- .. |Codecov| image:: https://codecov.io/gh/iterative/datachain/graph/badge.svg?token=byliXGGyGB
118
- :target: https://codecov.io/gh/iterative/datachain
119
- :alt: Codecov
120
- .. |Tests| image:: https://github.com/iterative/datachain/actions/workflows/tests.yml/badge.svg
121
- :target: https://github.com/iterative/datachain/actions/workflows/tests.yml
122
- :alt: Tests
123
-
124
- DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured
125
- data like images, audio, videos, text and PDFs. It integrates with external storage
126
- (e.g., S3) to process data efficiently without data duplication and manages metadata
127
- in an internal database for easy and efficient querying.
128
-
129
-
130
- Use Cases
131
- =========
132
-
133
- 1. **Multimodal Dataset Preparation and Curation**: ideal for organizing and
134
- refining data in pre-training, finetuning or LLM evaluating stages.
135
- 2. **GenAI Data Analytics**: Enables advanced analytics for multimodal data and
136
- ad-hoc analytics using LLMs.
137
-
138
- Key Features
139
- ============
140
-
141
- 📂 **Multimodal Dataset Versioning.**
142
- - Version unstructured data without redundant data copies, by supporting
143
- references to S3, GCP, Azure, and local file systems.
144
- - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.
145
- - Unite files and metadata together into persistent, versioned, columnar datasets.
146
-
147
- 🐍 **Python-friendly.**
148
- - Operate on Python objects and object fields: float scores, strings, matrixes,
149
- LLM response objects.
150
- - Run Python code in a high-scale, terabytes size datasets, with built-in
151
- parallelization and memory-efficient computing — no SQL or Spark required.
152
-
153
- 🧠 **Data Enrichment and Processing.**
154
- - Generate metadata using local AI models and LLM APIs.
155
- - Filter, join, and group datasets by metadata. Search by vector embeddings.
156
- - High-performance vectorized operations on Python objects: sum, count, avg, etc.
157
- - Pass datasets to Pytorch and Tensorflow, or export them back into storage.
158
-
159
-
160
- Quick Start
161
- -----------
162
-
163
- .. code:: console
164
-
165
- $ pip install datachain
166
-
167
-
168
- Selecting files using JSON metadata
169
- ======================================
170
-
171
- A storage consists of images of cats and dogs (`dog.1048.jpg`, `cat.1009.jpg`),
172
- annotated with ground truth and model inferences in the 'json-pairs' format,
173
- where each image has a matching JSON file like `cat.1009.json`:
174
-
175
- .. code:: json
176
-
177
- {
178
- "class": "cat", "id": "1009", "num_annotators": 8,
179
- "inference": {"class": "dog", "confidence": 0.68}
180
- }
181
-
182
- Example of downloading only "high-confidence cat" inferred images using JSON metadata:
183
-
184
-
185
- .. code:: py
186
-
187
- from datachain import Column, DataChain
188
-
189
- meta = DataChain.from_json("gs://datachain-demo/dogs-and-cats/*json", object_name="meta")
190
- images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg")
191
-
192
- images_id = images.map(id=lambda file: file.path.split('.')[-2])
193
- annotated = images_id.merge(meta, on="id", right_on="meta.id")
194
-
195
- likely_cats = annotated.filter((Column("meta.inference.confidence") > 0.93) \
196
- & (Column("meta.inference.class_") == "cat"))
197
- likely_cats.export_files("high-confidence-cats/", signal="file")
198
-
199
-
200
- Data curation with a local AI model
201
- ===================================
202
- Batch inference with a simple sentiment model using the `transformers` library:
203
-
204
- .. code:: shell
205
-
206
- pip install transformers
207
-
208
- The code below downloads files from the cloud, and applies a user-defined function
209
- to each one of them. All files with a positive sentiment
210
- detected are then copied to the local directory.
211
-
212
- .. code:: py
213
-
214
- from transformers import pipeline
215
- from datachain import DataChain, Column
216
-
217
- classifier = pipeline("sentiment-analysis", device="cpu",
218
- model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
219
-
220
- def is_positive_dialogue_ending(file) -> bool:
221
- dialogue_ending = file.read()[-512:]
222
- return classifier(dialogue_ending)[0]["label"] == "POSITIVE"
223
-
224
- chain = (
225
- DataChain.from_storage("gs://datachain-demo/chatbot-KiT/",
226
- object_name="file", type="text")
227
- .settings(parallel=8, cache=True)
228
- .map(is_positive=is_positive_dialogue_ending)
229
- .save("file_response")
230
- )
231
-
232
- positive_chain = chain.filter(Column("is_positive") == True)
233
- positive_chain.export_files("./output")
234
-
235
- print(f"{positive_chain.count()} files were exported")
236
-
237
-
238
-
239
- 13 files were exported
240
-
241
- .. code:: shell
242
-
243
- $ ls output/datachain-demo/chatbot-KiT/
244
- 15.txt 20.txt 24.txt 27.txt 28.txt 29.txt 33.txt 37.txt 38.txt 43.txt ...
245
- $ ls output/datachain-demo/chatbot-KiT/ | wc -l
246
- 13
247
-
248
-
249
- LLM judging chatbots
250
- =============================
251
-
252
- LLMs can work as universal classifiers. In the example below,
253
- we employ a free API from Mistral to judge the `publicly available`_ chatbot dialogs. Please get a free
254
- Mistral API key at https://console.mistral.ai
255
-
256
-
257
- .. code:: shell
258
-
259
- $ pip install mistralai (Requires version >=1.0.0)
260
- $ export MISTRAL_API_KEY=_your_key_
261
-
262
- DataChain can parallelize API calls; the free Mistral tier supports up to 4 requests at the same time.
263
-
264
- .. code:: py
265
-
266
- from mistralai import Mistral
267
- from datachain import File, DataChain, Column
268
-
269
- PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure."
270
-
271
- def eval_dialogue(file: File) -> bool:
272
- client = Mistral()
273
- response = client.chat.complete(
274
- model="open-mixtral-8x22b",
275
- messages=[{"role": "system", "content": PROMPT},
276
- {"role": "user", "content": file.read()}])
277
- result = response.choices[0].message.content
278
- return result.lower().startswith("success")
279
-
280
- chain = (
281
- DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file")
282
- .settings(parallel=4, cache=True)
283
- .map(is_success=eval_dialogue)
284
- .save("mistral_files")
285
- )
286
-
287
- successful_chain = chain.filter(Column("is_success") == True)
288
- successful_chain.export_files("./output_mistral")
289
-
290
- print(f"{successful_chain.count()} files were exported")
291
-
292
-
293
- With the instruction above, the Mistral model considers 31/50 files to hold the successful dialogues:
294
-
295
- .. code:: shell
296
-
297
- $ ls output_mistral/datachain-demo/chatbot-KiT/
298
- 1.txt 15.txt 18.txt 2.txt 22.txt 25.txt 28.txt 33.txt 37.txt 4.txt 41.txt ...
299
- $ ls output_mistral/datachain-demo/chatbot-KiT/ | wc -l
300
- 31
301
-
302
-
303
-
304
- Serializing Python-objects
305
- ==========================
306
-
307
- LLM responses may contain valuable information for analytics – such as the number of tokens used, or the
308
- model performance parameters.
309
-
310
- Instead of extracting this information from the Mistral response data structure (class
311
- `ChatCompletionResponse`), DataChain can serialize the entire LLM response to the internal DB:
312
-
313
-
314
- .. code:: py
315
-
316
- from mistralai import Mistral
317
- from mistralai.models import ChatCompletionResponse
318
- from datachain import File, DataChain, Column
319
-
320
- PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure."
321
-
322
- def eval_dialog(file: File) -> ChatCompletionResponse:
323
- client = MistralClient()
324
- return client.chat(
325
- model="open-mixtral-8x22b",
326
- messages=[{"role": "system", "content": PROMPT},
327
- {"role": "user", "content": file.read()}])
328
-
329
- chain = (
330
- DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file")
331
- .settings(parallel=4, cache=True)
332
- .map(response=eval_dialog)
333
- .map(status=lambda response: response.choices[0].message.content.lower()[:7])
334
- .save("response")
335
- )
336
-
337
- chain.select("file.name", "status", "response.usage").show(5)
338
-
339
- success_rate = chain.filter(Column("status") == "success").count() / chain.count()
340
- print(f"{100*success_rate:.1f}% dialogs were successful")
341
-
342
- Output:
343
-
344
- .. code:: shell
345
-
346
- file status response response response
347
- name usage usage usage
348
- prompt_tokens total_tokens completion_tokens
349
- 0 1.txt success 547 548 1
350
- 1 10.txt failure 3576 3578 2
351
- 2 11.txt failure 626 628 2
352
- 3 12.txt failure 1144 1182 38
353
- 4 13.txt success 1100 1101 1
354
-
355
- [Limited by 5 rows]
356
- 64.0% dialogs were successful
357
-
358
-
359
- Iterating over Python data structures
360
- =============================================
361
-
362
- In the previous examples, datasets were saved in the embedded database
363
- (`SQLite`_ in folder `.datachain` of the working directory).
364
- These datasets were automatically versioned, and can be accessed using
365
- `DataChain.from_dataset("dataset_name")`.
366
-
367
- Here is how to retrieve a saved dataset and iterate over the objects:
368
-
369
- .. code:: py
370
-
371
- chain = DataChain.from_dataset("response")
372
-
373
- # Iterating one-by-one: support out-of-memory workflow
374
- for file, response in chain.limit(5).collect("file", "response"):
375
- # verify the collected Python objects
376
- assert isinstance(response, ChatCompletionResponse)
377
-
378
- status = response.choices[0].message.content[:7]
379
- tokens = response.usage.total_tokens
380
- print(f"{file.get_uri()}: {status}, file size: {file.size}, tokens: {tokens}")
381
-
382
- Output:
383
-
384
- .. code:: shell
385
-
386
- gs://datachain-demo/chatbot-KiT/1.txt: Success, file size: 1776, tokens: 548
387
- gs://datachain-demo/chatbot-KiT/10.txt: Failure, file size: 11576, tokens: 3578
388
- gs://datachain-demo/chatbot-KiT/11.txt: Failure, file size: 2045, tokens: 628
389
- gs://datachain-demo/chatbot-KiT/12.txt: Failure, file size: 3833, tokens: 1207
390
- gs://datachain-demo/chatbot-KiT/13.txt: Success, file size: 3657, tokens: 1101
391
-
392
-
393
- Vectorized analytics over Python objects
394
- ========================================
395
-
396
- Some operations can run inside the DB without deserialization.
397
- For instance, let's calculate the total cost of using the LLM APIs, assuming the Mixtral call costs $2 per 1M input tokens and $6 per 1M output tokens:
398
-
399
- .. code:: py
400
-
401
- chain = DataChain.from_dataset("mistral_dataset")
402
-
403
- cost = chain.sum("response.usage.prompt_tokens")*0.000002 \
404
- + chain.sum("response.usage.completion_tokens")*0.000006
405
- print(f"Spent ${cost:.2f} on {chain.count()} calls")
406
-
407
- Output:
408
-
409
- .. code:: shell
410
-
411
- Spent $0.08 on 50 calls
412
-
413
-
414
- PyTorch data loader
415
- ===================
416
-
417
- Chain results can be exported or passed directly to PyTorch dataloader.
418
- For example, if we are interested in passing image and a label based on file
419
- name suffix, the following code will do it:
420
-
421
- .. code:: py
422
-
423
- from torch.utils.data import DataLoader
424
- from transformers import CLIPProcessor
425
-
426
- from datachain import C, DataChain
427
-
428
- processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
429
-
430
- chain = (
431
- DataChain.from_storage("gs://datachain-demo/dogs-and-cats/", type="image")
432
- .map(label=lambda name: name.split(".")[0], params=["file.name"])
433
- .select("file", "label").to_pytorch(
434
- transform=processor.image_processor,
435
- tokenizer=processor.tokenizer,
436
- )
437
- )
438
- loader = DataLoader(chain, batch_size=1)
439
-
440
-
441
- DataChain Studio Platform
442
- -------------------------
443
-
444
- `DataChain Studio`_ is a proprietary solution for teams that offers:
445
-
446
- - **Centralized dataset registry** to manage data, code and dependency
447
- dependencies in one place.
448
- - **Data Lineage** for data sources as well as direvative dataset.
449
- - **UI for Multimodal Data** like images, videos, and PDFs.
450
- - **Scalable Compute** to handle large datasets (100M+ files) and in-house
451
- AI model inference.
452
- - **Access control** including SSO and team based collaboration.
453
-
454
- Tutorials
455
- ---------
456
-
457
- * `Getting Started`_
458
- * `Multimodal <https://github.com/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb>`_ (try in `Colab <https://colab.research.google.com/github/iterative/datachain-examples/blob/main/multimodal/clip_fine_tuning.ipynb>`__)
459
- * `LLM evaluations <https://github.com/iterative/datachain-examples/blob/main/llm/llm_chatbot_evaluation.ipynb>`_ (try in `Colab <https://colab.research.google.com/github/iterative/datachain-examples/blob/main/llm/llm_chatbot_evaluation.ipynb>`__)
460
- * `Reading JSON metadata <https://github.com/iterative/datachain-examples/blob/main/formats/json-metadata-tutorial.ipynb>`_ (try in `Colab <https://colab.research.google.com/github/iterative/datachain-examples/blob/main/formats/json-metadata-tutorial.ipynb>`__)
461
-
462
-
463
- Contributions
464
- -------------
465
-
466
- Contributions are very welcome.
467
- To learn more, see the `Contributor Guide`_.
468
-
469
-
470
- Community and Support
471
- ---------------------
472
-
473
- * `Docs <https://datachain.dvc.ai/>`_
474
- * `File an issue`_ if you encounter any problems
475
- * `Discord Chat <https://dvc.org/chat>`_
476
- * `Email <mailto:support@dvc.org>`_
477
- * `Twitter <https://twitter.com/DVCorg>`_
478
-
479
-
480
- .. _PyPI: https://pypi.org/
481
- .. _file an issue: https://github.com/iterative/datachain/issues
482
- .. github-only
483
- .. _Contributor Guide: CONTRIBUTING.rst
484
- .. _Pydantic: https://github.com/pydantic/pydantic
485
- .. _publicly available: https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot
486
- .. _SQLite: https://www.sqlite.org/
487
- .. _Getting Started: https://docs.datachain.ai/
488
- .. _DataChain Studio: https://studio.datachain.ai/