datatoolpack 0.2.1__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,510 @@
1
+ Metadata-Version: 2.4
2
+ Name: datatoolpack
3
+ Version: 0.3.0
4
+ Summary: Official Python SDK for the AutoData ML data preparation pipeline API
5
+ Home-page: https://autodata.datatoolpack.com
6
+ Author: AutoData Team
7
+ Author-email: support@datatoolpack.com
8
+ Project-URL: Documentation, https://autodata.datatoolpack.com/docs
9
+ Project-URL: Bug Tracker, https://github.com/datatoolpack/autodata-client/issues
10
+ Keywords: autodata machine-learning data-preparation synthetic-data ml-pipeline
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.8
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
22
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
23
+ Requires-Python: >=3.8
24
+ Description-Content-Type: text/markdown
25
+ Requires-Dist: requests>=2.25.0
26
+ Dynamic: author
27
+ Dynamic: author-email
28
+ Dynamic: classifier
29
+ Dynamic: description
30
+ Dynamic: description-content-type
31
+ Dynamic: home-page
32
+ Dynamic: keywords
33
+ Dynamic: project-url
34
+ Dynamic: requires-dist
35
+ Dynamic: requires-python
36
+ Dynamic: summary
37
+
38
+ # AutoData Python Client
39
+
40
+ Official Python SDK for the [AutoData](https://autodata.datatoolpack.com) ML data preparation pipeline API.
41
+
42
+ ## Installation
43
+
44
+ ```bash
45
+ pip install datatoolpack
46
+ ```
47
+
48
+ Or install from source:
49
+
50
+ ```bash
51
+ git clone https://github.com/datatoolpack/datatoolpack
52
+ cd datatoolpack
53
+ pip install .
54
+ ```
55
+
56
+ ## Quick Start
57
+
58
+ ```python
59
+ from autodata import AutoDataClient
60
+
61
+ # Use as a context manager (recommended)
62
+ with AutoDataClient(
63
+ api_key="dtpk_YOUR_API_KEY",
64
+ base_url="https://autodata.datatoolpack.com",
65
+ ) as client:
66
+ result = client.process(
67
+ file_path="data.csv",
68
+ target_columns=["price"],
69
+ output_rows=20000,
70
+ )
71
+ print(result["files"])
72
+ ```
73
+
74
+ Get your API key from the [AutoData dashboard](https://autodata.datatoolpack.com/dashboard) → API Keys tab.
75
+
76
+ ### Authenticate with access code (passcode)
77
+
78
+ If you have an access code instead of an API key, you can use it directly:
79
+
80
+ ```python
81
+ with AutoDataClient(
82
+ passcode="123456789012",
83
+ base_url="https://autodata.datatoolpack.com",
84
+ ) as client:
85
+ result = client.process(
86
+ file_path="data.csv",
87
+ target_columns=["price"],
88
+ )
89
+ ```
90
+
91
+ ---
92
+
93
+ ## Supported File Formats
94
+
95
+ | Format | Extensions |
96
+ |----------|--------------------|
97
+ | CSV | `.csv` |
98
+ | Excel | `.xlsx`, `.xls` |
99
+ | Parquet | `.parquet` |
100
+ | JSON | `.json` |
101
+ | Feather | `.feather` |
102
+ | ORC | `.orc` |
103
+
104
+ ---
105
+
106
+ ## Reference
107
+
108
+ ### `AutoDataClient(api_key, base_url, timeout, max_retries)`
109
+
110
+ | Parameter | Type | Default | Description |
111
+ |---------------|-------|--------------------------------------|----------------------------------------------|
112
+ | `api_key` | `str` | required | API key starting with `dtpk_` |
113
+ | `base_url` | `str` | `"https://autodata.datatoolpack.com"` | Server URL (no trailing slash) |
114
+ | `timeout` | `int` | `120` | Request timeout in seconds |
115
+ | `max_retries` | `int` | `3` | Auto-retries for 429 / 502 / 503 / 504 |
116
+
117
+ Implements context manager — use with `with` to auto-close connections:
118
+
119
+ ```python
120
+ with AutoDataClient(api_key="dtpk_...") as client:
121
+ ...
122
+ # session is closed automatically
123
+ ```
124
+
125
+ Or close manually:
126
+
127
+ ```python
128
+ client = AutoDataClient(api_key="dtpk_...")
129
+ # ... use client ...
130
+ client.close()
131
+ ```
132
+
133
+ ---
134
+
135
+ ### `client.process(...)` — Upload & run pipeline
136
+
137
+ ```python
138
+ result = client.process(
139
+ file_path="data.csv", # Path to input file (CSV, XLSX, Parquet, …)
140
+ target_columns=["price"], # y-column(s) for ML
141
+ output_rows=20000, # Target row count in output
142
+ tools={ # Toggle pipeline steps (all optional)
143
+ "anomaly": False, # Anomaly detection (off by default)
144
+ "dtc": True, # Data Type Conversion
145
+ "mdh": True, # Missing Data Handler
146
+ "cds": True, # Column Scaling
147
+ "dsm": True, # Data Split Manager
148
+ "dsg": True, # Synthetic Data Generator
149
+ },
150
+ advanced_params={ # Fine-grained parameters (all optional)
151
+ "excluded_columns": ["id"], # Columns to drop before processing
152
+ "text_mode": 0, # 0=none, 1=neural, 2=tfidf
153
+ "text_cleaning": True, # Clean text before encoding
154
+ "zscore_limit": 3.0, # Z-score outlier threshold
155
+ "dsg_mode": "copula", # "copula" or "gan"
156
+ "similarity_p": 95, # Similarity percentile for DSG
157
+ },
158
+ wait=True, # Block until complete (default True)
159
+ poll_interval=2, # Status poll interval in seconds
160
+ download_path="./outputs/", # Where to save files (default auto)
161
+ auto_download=True, # Set False to skip download when wait=True
162
+ output_preferences=["dsg.csv"], # Which files to download (default all)
163
+ compressed=True, # Download as ZIP (default True)
164
+ )
165
+ ```
166
+
167
+ **Returns** a dict:
168
+
169
+ ```python
170
+ {
171
+ "session_id": "abc123...",
172
+ "status": "completed",
173
+ "files": [
174
+ {"name": "dsg.csv", "url": "/download/.../dsg.csv", "size": 2097152, "description": "..."},
175
+ {"name": "dsm_train.csv", ...},
176
+ ...
177
+ ],
178
+ "row_count": 20000,
179
+ "duration_seconds": 42.1,
180
+ }
181
+ ```
182
+
183
+ Set `wait=False` to get back immediately with just `session_id` and `status`:
184
+
185
+ ```python
186
+ result = client.process(file_path="data.csv", target_columns="price", wait=False)
187
+ session_id = result["session_id"]
188
+ ```
189
+
190
+ ---
191
+
192
+ ### `client.get_status(session_id)` — Poll progress
193
+
194
+ ```python
195
+ status = client.get_status(session_id)
196
+ # {
197
+ # "status": "running", # queued | running | completed | error | cancelled
198
+ # "message": "Running MDH...",
199
+ # "current_step": 3,
200
+ # "total_steps": 6,
201
+ # "progress_percent": 50,
202
+ # "duration_seconds": 15.3,
203
+ # }
204
+ ```
205
+
206
+ ---
207
+
208
+ ### `client.get_result(session_id)` — Fetch completed results
209
+
210
+ ```python
211
+ result = client.get_result(session_id)
212
+ # {"status": "completed", "files": [...], "row_count": ..., "duration_seconds": ...}
213
+ ```
214
+
215
+ ---
216
+
217
+ ### `client.wait_for_completion(session_id, poll_interval)` — Block until done
218
+
219
+ ```python
220
+ result = client.wait_for_completion(session_id, poll_interval=3)
221
+ ```
222
+
223
+ Prints live progress to stdout. Raises `AutoDataError` if processing fails.
224
+
225
+ ---
226
+
227
+ ### `client.cancel(session_id)` — Cancel a running job
228
+
229
+ ```python
230
+ cancelled = client.cancel(session_id) # True if acknowledged
231
+ ```
232
+
233
+ ---
234
+
235
+ ### `client.download_results(session_id, ...)` — Download output files
236
+
237
+ ```python
238
+ path = client.download_results(
239
+ session_id,
240
+ download_path="./my_outputs/", # Directory to save into
241
+ output_preferences=["dsg.csv"], # Specific files only (None = all)
242
+ compressed=True, # ZIP download (default) or individual files
243
+ )
244
+ print(f"Saved to {path}")
245
+ ```
246
+
247
+ ---
248
+
249
+ ### `client.download_file(url, output_path)` — Download a single file
250
+
251
+ ```python
252
+ client.download_file("/api/v1/download/abc123.../dsg.csv", "dsg.csv")
253
+ ```
254
+
255
+ ---
256
+
257
+ ### `client.list_keys()` — List API keys
258
+
259
+ ```python
260
+ keys = client.list_keys()
261
+ # [{"id": "...", "name": "My Key", "prefix": "dtpk_abc123", "created_at": "..."}]
262
+ ```
263
+
264
+ ---
265
+
266
+ ### `client.get_usage()` — Usage statistics
267
+
268
+ ```python
269
+ usage = client.get_usage()
270
+ # {
271
+ # "daily_credits_used": 500,
272
+ # "daily_credit_limit": 10000,
273
+ # "daily_remaining": 9500,
274
+ # "lifetime_credits_used": 12340,
275
+ # "lifetime_credit_limit": 1000000,
276
+ # "lifetime_remaining": 987660,
277
+ # "daily_request_count": 3,
278
+ # "last_used_at": "2026-04-12T10:30:00Z",
279
+ # }
280
+ ```
281
+
282
+ ---
283
+
284
+ ## Connectors — Process data from external sources
285
+
286
+ Instead of uploading a local file, you can read data directly from databases and cloud storage via connectors.
287
+
288
+ **Supported connector types:** `sql`, `snowflake`, `bigquery`, `mongodb`, `s3`, `gcs`, `databricks`, `delta`, `fabric`, `kafka`, `kinesis`
289
+
290
+ ### Quick connector example
291
+
292
+ ```python
293
+ with AutoDataClient(api_key="dtpk_...") as client:
294
+ # Test the connection
295
+ client.test_connector("sql", secrets={
296
+ "connection_string": "postgresql://user:pass@host/db"
297
+ })
298
+
299
+ # List tables
300
+ tables = client.discover("sql", secrets={
301
+ "connection_string": "postgresql://user:pass@host/db"
302
+ })
303
+ print(tables) # ["users", "orders", ...]
304
+
305
+ # Preview columns
306
+ info = client.preview("sql", table="orders", secrets={
307
+ "connection_string": "postgresql://user:pass@host/db"
308
+ })
309
+ print(info["columns"]) # ["id", "price", "date", ...]
310
+
311
+ # Run the full pipeline from the connector
312
+ result = client.process_from_connector(
313
+ connector_type="sql",
314
+ table="orders",
315
+ target_columns=["price"],
316
+ secrets={"connection_string": "postgresql://user:pass@host/db"},
317
+ output_rows=20000,
318
+ )
319
+ print(result["files"])
320
+ ```
321
+
322
+ ### Using saved credentials
323
+
324
+ Save credentials once, then reference them by ID:
325
+
326
+ ```python
327
+ with AutoDataClient(api_key="dtpk_...") as client:
328
+ # Save a credential
329
+ cred = client.save_credential(
330
+ name="Production DB",
331
+ connector_type="sql",
332
+ secrets={"connection_string": "postgresql://..."},
333
+ )
334
+ cred_id = cred["id"]
335
+
336
+ # Use credential_id instead of secrets
337
+ tables = client.discover("sql", credential_id=cred_id)
338
+ result = client.process_from_connector(
339
+ connector_type="sql",
340
+ table="orders",
341
+ target_columns=["price"],
342
+ credential_id=cred_id,
343
+ )
344
+ ```
345
+
346
+ ---
347
+
348
+ ### `client.test_connector(connector_type, secrets, credential_id)` — Test connection
349
+
350
+ ```python
351
+ result = client.test_connector("s3", secrets={
352
+ "bucket": "my-bucket",
353
+ "access_key_id": "AKIA...",
354
+ "secret_access_key": "...",
355
+ "region": "us-east-1",
356
+ })
357
+ # {"success": True, "message": "Connection successful"}
358
+ ```
359
+
360
+ ---
361
+
362
+ ### `client.discover(connector_type, ...)` — List tables / files
363
+
364
+ ```python
365
+ tables = client.discover("bigquery", secrets={
366
+ "credentials_json": '{"type":"service_account",...}',
367
+ "project_id": "my-project",
368
+ "dataset": "analytics",
369
+ })
370
+ # ["events", "users", "transactions"]
371
+ ```
372
+
373
+ ---
374
+
375
+ ### `client.preview(connector_type, table, ...)` — Preview columns
376
+
377
+ ```python
378
+ info = client.preview("snowflake", table="ORDERS", secrets={
379
+ "account": "abc123.us-east-1",
380
+ "user": "analyst",
381
+ "password": "...",
382
+ "database": "PROD",
383
+ "schema": "PUBLIC",
384
+ "warehouse": "COMPUTE_WH",
385
+ })
386
+ # {"success": True, "columns": ["ID", "AMOUNT", ...], "row_count": 50000}
387
+ ```
388
+
389
+ ---
390
+
391
+ ### `client.process_from_connector(...)` — Run pipeline from connector
392
+
393
+ ```python
394
+ result = client.process_from_connector(
395
+ connector_type="s3",
396
+ table="data/sales.csv", # object key in bucket
397
+ target_columns=["revenue"],
398
+ secrets={
399
+ "bucket": "my-data-lake",
400
+ "access_key_id": "AKIA...",
401
+ "secret_access_key": "...",
402
+ },
403
+ output_rows=50000,
404
+ wait=True,
405
+ download_path="./outputs/",
406
+ )
407
+ ```
408
+
409
+ Supports all the same options as `client.process()`: `tools`, `advanced_params`, `wait`, `poll_interval`, `auto_download`, `output_preferences`, `compressed`.
410
+
411
+ ---
412
+
413
+ ### `client.write_output(session_id, ...)` — Write results to a target
414
+
415
+ ```python
416
+ client.write_output(
417
+ session_id="abc123...",
418
+ connector_type="sql",
419
+ table_name="ml_prepared_data",
420
+ secrets={"connection_string": "postgresql://..."},
421
+ output_stage="dsg", # which pipeline output to write
422
+ if_exists="replace", # "replace", "append", or "fail"
423
+ )
424
+ # {"success": True, "rows_written": 20000, "table": "ml_prepared_data"}
425
+ ```
426
+
427
+ ---
428
+
429
+ ### `client.list_credentials()` / `save_credential()` / `delete_credential()`
430
+
431
+ ```python
432
+ # List saved credentials (secrets are never exposed)
433
+ creds = client.list_credentials()
434
+
435
+ # Save a new credential
436
+ cred = client.save_credential("My S3", "s3", secrets={...})
437
+
438
+ # Delete a credential
439
+ client.delete_credential(cred["id"])
440
+ ```
441
+
442
+ ---
443
+
444
+ ## Error Handling
445
+
446
+ All API errors raise `AutoDataError`:
447
+
448
+ ```python
449
+ from autodata import AutoDataClient, AutoDataError
450
+
451
+ with AutoDataClient(api_key="dtpk_...") as client:
452
+ try:
453
+ result = client.process("data.csv", target_columns="price")
454
+ except AutoDataError as e:
455
+ print(f"API error {e.status_code}: {e}")
456
+ except FileNotFoundError as e:
457
+ print(f"File not found: {e}")
458
+ except ValueError as e:
459
+ print(f"Invalid input: {e}") # e.g. unsupported file format
460
+ ```
461
+
462
+ `AutoDataError` attributes:
463
+ - `str(e)` — human-readable error message from the server
464
+ - `e.status_code` — HTTP status code (e.g. `401`, `429`, `500`), or `None` for non-HTTP errors
465
+
466
+ Transient errors (429, 502, 503, 504) are automatically retried up to `max_retries` times with exponential back-off.
467
+
468
+ ---
469
+
470
+ ## Advanced Example: Non-blocking with manual polling
471
+
472
+ ```python
473
+ import time
474
+ from autodata import AutoDataClient, AutoDataError
475
+
476
+ with AutoDataClient(api_key="dtpk_...") as client:
477
+ # Start job without blocking
478
+ job = client.process(
479
+ "large_dataset.csv",
480
+ target_columns=["churn"],
481
+ wait=False,
482
+ )
483
+ session_id = job["session_id"]
484
+ print(f"Job started: {session_id}")
485
+
486
+ # Poll manually
487
+ while True:
488
+ status = client.get_status(session_id)
489
+ print(f" {status['progress_percent']}% — {status['message']}")
490
+ if status["status"] == "completed":
491
+ break
492
+ elif status["status"] in ("error", "cancelled"):
493
+ raise AutoDataError(f"Job {status['status']}: {status['message']}")
494
+ time.sleep(5)
495
+
496
+ # Download results
497
+ path = client.download_results(session_id, download_path="./outputs/")
498
+ print(f"Results saved to {path}")
499
+ ```
500
+
501
+ ---
502
+
503
+ ## Requirements
504
+
505
+ - Python >= 3.8
506
+ - `requests` >= 2.25.0
507
+
508
+ ## License
509
+
510
+ MIT