datatoolpack 0.2.1__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,619 @@
1
+ Metadata-Version: 2.4
2
+ Name: datatoolpack
3
+ Version: 0.4.0
4
+ Summary: Official Python SDK for the AutoData ML data preparation pipeline API
5
+ Home-page: https://autodata.datatoolpack.com
6
+ Author: AutoData Team
7
+ Author-email: support@datatoolpack.com
8
+ Project-URL: Documentation, https://autodata.datatoolpack.com/docs
9
+ Project-URL: Bug Tracker, https://github.com/datatoolpack/autodata-client/issues
10
+ Keywords: autodata machine-learning data-preparation synthetic-data ml-pipeline
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.8
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
22
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
23
+ Requires-Python: >=3.8
24
+ Description-Content-Type: text/markdown
25
+ Requires-Dist: requests>=2.25.0
26
+ Dynamic: author
27
+ Dynamic: author-email
28
+ Dynamic: classifier
29
+ Dynamic: description
30
+ Dynamic: description-content-type
31
+ Dynamic: home-page
32
+ Dynamic: keywords
33
+ Dynamic: project-url
34
+ Dynamic: requires-dist
35
+ Dynamic: requires-python
36
+ Dynamic: summary
37
+
38
+ # AutoData Python Client
39
+
40
+ Official Python SDK for the [AutoData](https://autodata.datatoolpack.com) ML data preparation pipeline API.
41
+
42
+ ## Installation
43
+
44
+ ```bash
45
+ pip install datatoolpack
46
+ ```
47
+
48
+ Or install from source:
49
+
50
+ ```bash
51
+ git clone https://github.com/datatoolpack/datatoolpack
52
+ cd datatoolpack
53
+ pip install .
54
+ ```
55
+
56
+ ## Quick Start
57
+
58
+ ```python
59
+ from autodata import AutoDataClient
60
+
61
+ # Use as a context manager (recommended)
62
+ with AutoDataClient(
63
+ api_key="dtpk_YOUR_API_KEY",
64
+ base_url="https://autodata.datatoolpack.com",
65
+ ) as client:
66
+ result = client.process(
67
+ file_path="data.csv",
68
+ target_columns=["price"],
69
+ output_rows=20000,
70
+ )
71
+ print(result["files"])
72
+ ```
73
+
74
+ Get your API key from the [AutoData dashboard](https://autodata.datatoolpack.com/dashboard) → API Keys tab.
75
+
76
+ ### Authenticate with access code (passcode)
77
+
78
+ If you have an access code instead of an API key, you can use it directly:
79
+
80
+ ```python
81
+ with AutoDataClient(
82
+ passcode="123456789012",
83
+ base_url="https://autodata.datatoolpack.com",
84
+ ) as client:
85
+ result = client.process(
86
+ file_path="data.csv",
87
+ target_columns=["price"],
88
+ )
89
+ ```
90
+
91
+ ---
92
+
93
+ ## Supported File Formats
94
+
95
+ | Format | Extensions |
96
+ |----------|--------------------|
97
+ | CSV | `.csv` |
98
+ | Excel | `.xlsx`, `.xls` |
99
+ | Parquet | `.parquet` |
100
+ | JSON | `.json` |
101
+ | Feather | `.feather` |
102
+ | ORC | `.orc` |
103
+
104
+ ---
105
+
106
+ ## Reference
107
+
108
+ ### `AutoDataClient(api_key, base_url, timeout, max_retries)`
109
+
110
+ | Parameter | Type | Default | Description |
111
+ |---------------|-------|--------------------------------------|----------------------------------------------|
112
+ | `api_key` | `str` | required | API key starting with `dtpk_` |
113
+ | `base_url` | `str` | `"https://autodata.datatoolpack.com"` | Server URL (no trailing slash) |
114
+ | `timeout` | `int` | `120` | Request timeout in seconds |
115
+ | `max_retries` | `int` | `3` | Auto-retries for 429 / 502 / 503 / 504 |
116
+
117
+ Implements context manager — use with `with` to auto-close connections:
118
+
119
+ ```python
120
+ with AutoDataClient(api_key="dtpk_...") as client:
121
+ ...
122
+ # session is closed automatically
123
+ ```
124
+
125
+ Or close manually:
126
+
127
+ ```python
128
+ client = AutoDataClient(api_key="dtpk_...")
129
+ # ... use client ...
130
+ client.close()
131
+ ```
132
+
133
+ ---
134
+
135
+ ### `client.process(...)` — Upload & run pipeline
136
+
137
+ ```python
138
+ result = client.process(
139
+ file_path="data.csv", # Path to input file (CSV, XLSX, Parquet, …)
140
+ target_columns=["price"], # y-column(s) for ML
141
+ output_rows=20000, # Target row count in output
142
+ tools={ # Toggle pipeline steps (all optional)
143
+ "anomaly": False, # Anomaly detection (off by default)
144
+ "dtc": True, # Data Type Conversion
145
+ "mdh": True, # Missing Data Handler
146
+ "cds": True, # Column Scaling
147
+ "dsm": True, # Data Split Manager
148
+ "dsg": True, # Synthetic Data Generator
149
+ },
150
+ advanced_params={ # Fine-grained parameters (all optional)
151
+ "excluded_columns": ["id"], # Columns to drop before processing
152
+ "text_mode": 0, # 0=none, 1=neural, 2=tfidf
153
+ "text_cleaning": True, # Clean text before encoding
154
+ "zscore_limit": 3.0, # Z-score outlier threshold
155
+ "dsg_mode": "copula", # "copula" or "gan"
156
+ "similarity_p": 95, # Similarity percentile for DSG
157
+ },
158
+ wait=True, # Block until complete (default True)
159
+ poll_interval=2, # Status poll interval in seconds
160
+ download_path="./outputs/", # Where to save files (default auto)
161
+ auto_download=True, # Set False to skip download when wait=True
162
+ output_preferences=["dsg.csv"], # Which files to download (default all)
163
+ compressed=True, # Download as ZIP (default True)
164
+ )
165
+ ```
166
+
167
+ **Returns** a dict:
168
+
169
+ ```python
170
+ {
171
+ "session_id": "abc123...",
172
+ "status": "completed",
173
+ "files": [
174
+ {"name": "dsg.csv", "url": "/download/.../dsg.csv", "size": 2097152, "description": "..."},
175
+ {"name": "dsm_train.csv", ...},
176
+ ...
177
+ ],
178
+ "row_count": 20000,
179
+ "duration_seconds": 42.1,
180
+ }
181
+ ```
182
+
183
+ Set `wait=False` to get back immediately with just `session_id` and `status`:
184
+
185
+ ```python
186
+ result = client.process(file_path="data.csv", target_columns="price", wait=False)
187
+ session_id = result["session_id"]
188
+ ```
189
+
190
+ ---
191
+
192
+ ### `client.get_status(session_id)` — Poll progress
193
+
194
+ ```python
195
+ status = client.get_status(session_id)
196
+ # {
197
+ # "status": "running", # queued | running | completed | error | cancelled
198
+ # "message": "Running MDH...",
199
+ # "current_step": 3,
200
+ # "total_steps": 6,
201
+ # "progress_percent": 50,
202
+ # "duration_seconds": 15.3,
203
+ # }
204
+ ```
205
+
206
+ ---
207
+
208
+ ### `client.get_result(session_id)` — Fetch completed results
209
+
210
+ ```python
211
+ result = client.get_result(session_id)
212
+ # {"status": "completed", "files": [...], "row_count": ..., "duration_seconds": ...}
213
+ ```
214
+
215
+ ---
216
+
217
+ ### `client.wait_for_completion(session_id, poll_interval)` — Block until done
218
+
219
+ ```python
220
+ result = client.wait_for_completion(session_id, poll_interval=3)
221
+ ```
222
+
223
+ Prints live progress to stdout. Raises `AutoDataError` if processing fails.
224
+
225
+ ---
226
+
227
+ ### `client.cancel(session_id)` — Cancel a running job
228
+
229
+ ```python
230
+ cancelled = client.cancel(session_id) # True if acknowledged
231
+ ```
232
+
233
+ ---
234
+
235
+ ### `client.download_results(session_id, ...)` — Download output files
236
+
237
+ ```python
238
+ path = client.download_results(
239
+ session_id,
240
+ download_path="./my_outputs/", # Directory to save into
241
+ output_preferences=["dsg.csv"], # Specific files only (None = all)
242
+ compressed=True, # ZIP download (default) or individual files
243
+ )
244
+ print(f"Saved to {path}")
245
+ ```
246
+
247
+ ---
248
+
249
+ ### `client.download_file(url, output_path)` — Download a single file
250
+
251
+ ```python
252
+ client.download_file("/api/v1/download/abc123.../dsg.csv", "dsg.csv")
253
+ ```
254
+
255
+ ---
256
+
257
+ ### `client.list_keys()` — List API keys
258
+
259
+ ```python
260
+ keys = client.list_keys()
261
+ # [{"id": "...", "name": "My Key", "prefix": "dtpk_abc123", "created_at": "..."}]
262
+ ```
263
+
264
+ ---
265
+
266
+ ### `client.get_usage()` — Usage statistics
267
+
268
+ ```python
269
+ usage = client.get_usage()
270
+ # {
271
+ # "daily_credits_used": 500,
272
+ # "daily_credit_limit": 10000,
273
+ # "daily_remaining": 9500,
274
+ # "lifetime_credits_used": 12340,
275
+ # "lifetime_credit_limit": 1000000,
276
+ # "lifetime_remaining": 987660,
277
+ # "daily_request_count": 3,
278
+ # "last_used_at": "2026-04-12T10:30:00Z",
279
+ # }
280
+ ```
281
+
282
+ ---
283
+
284
+ ## Connectors — Process data from external sources
285
+
286
+ Instead of uploading a local file, you can read data directly from databases and cloud storage via connectors.
287
+
288
+ **Supported connector types:** `sql`, `snowflake`, `bigquery`, `mongodb`, `s3`, `gcs`, `databricks`, `delta`, `fabric`, `kafka`, `kinesis`
289
+
290
+ ### Quick connector example
291
+
292
+ ```python
293
+ with AutoDataClient(api_key="dtpk_...") as client:
294
+ # Test the connection
295
+ client.test_connector("sql", secrets={
296
+ "connection_string": "postgresql://user:pass@host/db"
297
+ })
298
+
299
+ # List tables
300
+ tables = client.discover("sql", secrets={
301
+ "connection_string": "postgresql://user:pass@host/db"
302
+ })
303
+ print(tables) # ["users", "orders", ...]
304
+
305
+ # Preview columns
306
+ info = client.preview("sql", table="orders", secrets={
307
+ "connection_string": "postgresql://user:pass@host/db"
308
+ })
309
+ print(info["columns"]) # ["id", "price", "date", ...]
310
+
311
+ # Run the full pipeline from the connector
312
+ result = client.process_from_connector(
313
+ connector_type="sql",
314
+ table="orders",
315
+ target_columns=["price"],
316
+ secrets={"connection_string": "postgresql://user:pass@host/db"},
317
+ output_rows=20000,
318
+ )
319
+ print(result["files"])
320
+ ```
321
+
322
+ ### Using saved credentials
323
+
324
+ Save credentials once, then reference them by ID:
325
+
326
+ ```python
327
+ with AutoDataClient(api_key="dtpk_...") as client:
328
+ # Save a credential
329
+ cred = client.save_credential(
330
+ name="Production DB",
331
+ connector_type="sql",
332
+ secrets={"connection_string": "postgresql://..."},
333
+ )
334
+ cred_id = cred["id"]
335
+
336
+ # Use credential_id instead of secrets
337
+ tables = client.discover("sql", credential_id=cred_id)
338
+ result = client.process_from_connector(
339
+ connector_type="sql",
340
+ table="orders",
341
+ target_columns=["price"],
342
+ credential_id=cred_id,
343
+ )
344
+ ```
345
+
346
+ ---
347
+
348
+ ### `client.test_connector(connector_type, secrets, credential_id)` — Test connection
349
+
350
+ ```python
351
+ result = client.test_connector("s3", secrets={
352
+ "bucket": "my-bucket",
353
+ "access_key_id": "AKIA...",
354
+ "secret_access_key": "...",
355
+ "region": "us-east-1",
356
+ })
357
+ # {"success": True, "message": "Connection successful"}
358
+ ```
359
+
360
+ ---
361
+
362
+ ### `client.discover(connector_type, ...)` — List tables / files
363
+
364
+ ```python
365
+ tables = client.discover("bigquery", secrets={
366
+ "credentials_json": '{"type":"service_account",...}',
367
+ "project_id": "my-project",
368
+ "dataset": "analytics",
369
+ })
370
+ # ["events", "users", "transactions"]
371
+ ```
372
+
373
+ ---
374
+
375
+ ### `client.preview(connector_type, table, ...)` — Preview columns
376
+
377
+ ```python
378
+ info = client.preview("snowflake", table="ORDERS", secrets={
379
+ "account": "abc123.us-east-1",
380
+ "user": "analyst",
381
+ "password": "...",
382
+ "database": "PROD",
383
+ "schema": "PUBLIC",
384
+ "warehouse": "COMPUTE_WH",
385
+ })
386
+ # {"success": True, "columns": ["ID", "AMOUNT", ...], "row_count": 50000}
387
+ ```
388
+
389
+ ---
390
+
391
+ ### `client.process_from_connector(...)` — Run pipeline from connector
392
+
393
+ ```python
394
+ result = client.process_from_connector(
395
+ connector_type="s3",
396
+ table="data/sales.csv", # object key in bucket
397
+ target_columns=["revenue"],
398
+ secrets={
399
+ "bucket": "my-data-lake",
400
+ "access_key_id": "AKIA...",
401
+ "secret_access_key": "...",
402
+ },
403
+ output_rows=50000,
404
+ wait=True,
405
+ download_path="./outputs/",
406
+ )
407
+ ```
408
+
409
+ Supports all the same options as `client.process()`: `tools`, `advanced_params`, `wait`, `poll_interval`, `auto_download`, `output_preferences`, `compressed`.
410
+
411
+ ---
412
+
413
+ ### `client.write_output(session_id, ...)` — Write results to a target
414
+
415
+ ```python
416
+ client.write_output(
417
+ session_id="abc123...",
418
+ connector_type="sql",
419
+ table_name="ml_prepared_data",
420
+ secrets={"connection_string": "postgresql://..."},
421
+ output_stage="dsg", # which pipeline output to write
422
+ if_exists="replace", # "replace", "append", or "fail"
423
+ )
424
+ # {"success": True, "rows_written": 20000, "table": "ml_prepared_data"}
425
+ ```
426
+
427
+ ---
428
+
429
+ ### `client.list_credentials()` / `save_credential()` / `delete_credential()`
430
+
431
+ ```python
432
+ # List saved credentials (secrets are never exposed)
433
+ creds = client.list_credentials()
434
+
435
+ # Save a new credential
436
+ cred = client.save_credential("My S3", "s3", secrets={...})
437
+
438
+ # Delete a credential
439
+ client.delete_credential(cred["id"])
440
+ ```
441
+
442
+ ---
443
+
444
+ ## Error Handling
445
+
446
+ All API errors raise `AutoDataError`:
447
+
448
+ ```python
449
+ from autodata import AutoDataClient, AutoDataError
450
+
451
+ with AutoDataClient(api_key="dtpk_...") as client:
452
+ try:
453
+ result = client.process("data.csv", target_columns="price")
454
+ except AutoDataError as e:
455
+ print(f"API error {e.status_code}: {e}")
456
+ except FileNotFoundError as e:
457
+ print(f"File not found: {e}")
458
+ except ValueError as e:
459
+ print(f"Invalid input: {e}") # e.g. unsupported file format
460
+ ```
461
+
462
+ `AutoDataError` attributes:
463
+ - `str(e)` — human-readable error message from the server
464
+ - `e.status_code` — HTTP status code (e.g. `401`, `429`, `500`), or `None` for non-HTTP errors
465
+
466
+ Transient errors (429, 502, 503, 504) are automatically retried up to `max_retries` times with exponential back-off.
467
+
468
+ ---
469
+
470
+ ## Advanced Example: Non-blocking with manual polling
471
+
472
+ ```python
473
+ import time
474
+ from autodata import AutoDataClient, AutoDataError
475
+
476
+ with AutoDataClient(api_key="dtpk_...") as client:
477
+ # Start job without blocking
478
+ job = client.process(
479
+ "large_dataset.csv",
480
+ target_columns=["churn"],
481
+ wait=False,
482
+ )
483
+ session_id = job["session_id"]
484
+ print(f"Job started: {session_id}")
485
+
486
+ # Poll manually
487
+ while True:
488
+ status = client.get_status(session_id)
489
+ print(f" {status['progress_percent']}% — {status['message']}")
490
+ if status["status"] == "completed":
491
+ break
492
+ elif status["status"] in ("error", "cancelled"):
493
+ raise AutoDataError(f"Job {status['status']}: {status['message']}")
494
+ time.sleep(5)
495
+
496
+ # Download results
497
+ path = client.download_results(session_id, download_path="./outputs/")
498
+ print(f"Results saved to {path}")
499
+ ```
500
+
501
+ ---
502
+
503
+ ## Quality Alerts
504
+
505
+ Monitor pipeline metrics and get notified when thresholds are breached:
506
+
507
+ ```python
508
+ # Create a quality alert rule
509
+ rule = client.create_quality_alert(
510
+ name="High row loss",
511
+ metric="row_loss_pct", # row_loss_pct, null_pct, column_drop_count, duration_seconds
512
+ operator=">", # >, <, >=, <=, ==
513
+ threshold=10.0,
514
+ severity="critical", # warning or critical
515
+ stage="mdh", # optional: anomaly, dtc, mdh, cds, dsm, dsg
516
+ )
517
+
518
+ # List rules
519
+ rules = client.list_quality_alerts()
520
+
521
+ # Get fired alert events
522
+ events = client.get_alert_events(session_id="optional-filter")
523
+
524
+ # Delete a rule
525
+ client.delete_quality_alert(rule_id=rule["rule"]["id"])
526
+ ```
527
+
528
+ ## Sync Watermarks
529
+
530
+ Track incremental sync progress for connector-based pipelines:
531
+
532
+ ```python
533
+ # List all watermarks
534
+ watermarks = client.list_watermarks()
535
+
536
+ # Reset a watermark (re-sync from beginning)
537
+ client.reset_watermark(watermark_id="wm-123")
538
+ ```
539
+
540
+ ## Scheduled Runs
541
+
542
+ Automate recurring pipeline executions:
543
+
544
+ ```python
545
+ # Create an interval-based schedule (every 24 hours)
546
+ schedule = client.create_scheduled_run(
547
+ name="Daily ETL",
548
+ schedule_type="interval",
549
+ interval_minutes=1440,
550
+ connector_type="snowflake",
551
+ table="orders",
552
+ credential_id="cred-id",
553
+ y_columns=["target"],
554
+ )
555
+
556
+ # Create with cron expression
557
+ schedule = client.create_scheduled_run(
558
+ name="Weekday ETL",
559
+ schedule_type="cron",
560
+ cron_expression="0 8 * * 1-5",
561
+ connector_type="snowflake",
562
+ table="orders",
563
+ credential_id="cred-id",
564
+ y_columns=["target"],
565
+ )
566
+
567
+ # List and delete
568
+ schedules = client.list_scheduled_runs()
569
+ client.delete_scheduled_run(run_id="run-id")
570
+ ```
571
+
572
+ ## Folder Listeners
573
+
574
+ Automatically trigger pipelines when new files appear:
575
+
576
+ ```python
577
+ # Create an S3 folder listener
578
+ listener = client.create_listener(
579
+ name="S3 Ingest",
580
+ source_type="s3", # s3, gcs, azure_blob, local, sftp
581
+ folder_path="s3://bucket/incoming/",
582
+ credential_id="cred-id",
583
+ y_columns=["target"],
584
+ pipeline_config={"enable_dsg": False},
585
+ )
586
+
587
+ # List and delete
588
+ listeners = client.list_listeners()
589
+ client.delete_listener(listener_id="listener-id")
590
+ ```
591
+
592
+ ## Pipeline Retry
593
+
594
+ Retry a failed pipeline from its last checkpoint:
595
+
596
+ ```python
597
+ result = client.retry_session(session_id="failed-session-id")
598
+ print(result) # {'success': True, 'session_id': '...', 'message': 'Retrying from last checkpoint'}
599
+ ```
600
+
601
+ ## Worker Status
602
+
603
+ Check the processing worker fleet status:
604
+
605
+ ```python
606
+ status = client.worker_status()
607
+ print(status) # {'backend': 'local', 'active_jobs': 2, 'queue_size': 0, ...}
608
+ ```
609
+
610
+ ---
611
+
612
+ ## Requirements
613
+
614
+ - Python >= 3.8
615
+ - `requests` >= 2.25.0
616
+
617
+ ## License
618
+
619
+ MIT