PyPI - datatoolpack - Versions diffs - 0.2.1__tar.gz → 0.4.0__tar.gz - Mend

datatoolpack 0.2.1tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

datatoolpack-0.4.0/PKG-INFO +619 -0
datatoolpack-0.4.0/README.md +582 -0
datatoolpack-0.4.0/autodata/__init__.py +4 -0
datatoolpack-0.4.0/autodata/client.py +1073 -0
datatoolpack-0.4.0/datatoolpack.egg-info/PKG-INFO +619 -0
{datatoolpack-0.2.1 → datatoolpack-0.4.0}/setup.py +1 -1
datatoolpack-0.2.1/PKG-INFO +0 -299
datatoolpack-0.2.1/README.md +0 -262
datatoolpack-0.2.1/autodata/__init__.py +0 -3
datatoolpack-0.2.1/autodata/client.py +0 -345
datatoolpack-0.2.1/datatoolpack.egg-info/PKG-INFO +0 -299
{datatoolpack-0.2.1 → datatoolpack-0.4.0}/datatoolpack.egg-info/SOURCES.txt +0 -0
{datatoolpack-0.2.1 → datatoolpack-0.4.0}/datatoolpack.egg-info/dependency_links.txt +0 -0
{datatoolpack-0.2.1 → datatoolpack-0.4.0}/datatoolpack.egg-info/requires.txt +0 -0
{datatoolpack-0.2.1 → datatoolpack-0.4.0}/datatoolpack.egg-info/top_level.txt +0 -0
{datatoolpack-0.2.1 → datatoolpack-0.4.0}/pyproject.toml +0 -0
{datatoolpack-0.2.1 → datatoolpack-0.4.0}/setup.cfg +0 -0

datatoolpack-0.4.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,619 @@
+Metadata-Version: 2.4
+Name: datatoolpack
+Version: 0.4.0
+Summary: Official Python SDK for the AutoData ML data preparation pipeline API
+Home-page: https://autodata.datatoolpack.com
+Author: AutoData Team
+Author-email: support@datatoolpack.com
+Project-URL: Documentation, https://autodata.datatoolpack.com/docs
+Project-URL: Bug Tracker, https://github.com/datatoolpack/autodata-client/issues
+Keywords: autodata machine-learning data-preparation synthetic-data ml-pipeline
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+Requires-Dist: requests>=2.25.0
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: keywords
+Dynamic: project-url
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+# AutoData Python Client
+Official Python SDK for the [AutoData](https://autodata.datatoolpack.com) ML data preparation pipeline API.
+## Installation
+```bash
+pip install datatoolpack
+```
+Or install from source:
+```bash
+git clone https://github.com/datatoolpack/datatoolpack
+cd datatoolpack
+pip install .
+```
+## Quick Start
+```python
+from autodata import AutoDataClient
+# Use as a context manager (recommended)
+with AutoDataClient(
+    api_key="dtpk_YOUR_API_KEY",
+    base_url="https://autodata.datatoolpack.com",
+) as client:
+    result = client.process(
+        file_path="data.csv",
+        target_columns=["price"],
+        output_rows=20000,
+    )
+    print(result["files"])
+```
+Get your API key from the [AutoData dashboard](https://autodata.datatoolpack.com/dashboard) → API Keys tab.
+### Authenticate with access code (passcode)
+If you have an access code instead of an API key, you can use it directly:
+```python
+with AutoDataClient(
+    passcode="123456789012",
+    base_url="https://autodata.datatoolpack.com",
+) as client:
+    result = client.process(
+        file_path="data.csv",
+        target_columns=["price"],
+    )
+```
+---
+## Supported File Formats
+| Format   | Extensions         |
+|----------|--------------------|
+| CSV      | `.csv`             |
+| Excel    | `.xlsx`, `.xls`    |
+| Parquet  | `.parquet`         |
+| JSON     | `.json`            |
+| Feather  | `.feather`         |
+| ORC      | `.orc`             |
+---
+## Reference
+### `AutoDataClient(api_key, base_url, timeout, max_retries)`
+| Parameter     | Type  | Default                              | Description                                  |
+|---------------|-------|--------------------------------------|----------------------------------------------|
+| `api_key`     | `str` | required                             | API key starting with `dtpk_`                |
+| `base_url`    | `str` | `"https://autodata.datatoolpack.com"` | Server URL (no trailing slash)              |
+| `timeout`     | `int` | `120`                                | Request timeout in seconds                   |
+| `max_retries` | `int` | `3`                                  | Auto-retries for 429 / 502 / 503 / 504      |
+Implements context manager — use with `with` to auto-close connections:
+```python
+with AutoDataClient(api_key="dtpk_...") as client:
+    ...
+# session is closed automatically
+```
+Or close manually:
+```python
+client = AutoDataClient(api_key="dtpk_...")
+# ... use client ...
+client.close()
+```
+---
+### `client.process(...)` — Upload & run pipeline
+```python
+result = client.process(
+    file_path="data.csv",           # Path to input file (CSV, XLSX, Parquet, …)
+    target_columns=["price"],       # y-column(s) for ML
+    output_rows=20000,              # Target row count in output
+    tools={                         # Toggle pipeline steps (all optional)
+        "anomaly": False,           # Anomaly detection (off by default)
+        "dtc": True,                # Data Type Conversion
+        "mdh": True,                # Missing Data Handler
+        "cds": True,                # Column Scaling
+        "dsm": True,                # Data Split Manager
+        "dsg": True,                # Synthetic Data Generator
+    },
+    advanced_params={               # Fine-grained parameters (all optional)
+        "excluded_columns": ["id"], # Columns to drop before processing
+        "text_mode": 0,             # 0=none, 1=neural, 2=tfidf
+        "text_cleaning": True,      # Clean text before encoding
+        "zscore_limit": 3.0,        # Z-score outlier threshold
+        "dsg_mode": "copula",       # "copula" or "gan"
+        "similarity_p": 95,         # Similarity percentile for DSG
+    },
+    wait=True,                      # Block until complete (default True)
+    poll_interval=2,                # Status poll interval in seconds
+    download_path="./outputs/",     # Where to save files (default auto)
+    auto_download=True,             # Set False to skip download when wait=True
+    output_preferences=["dsg.csv"], # Which files to download (default all)
+    compressed=True,                # Download as ZIP (default True)
+)
+```
+**Returns** a dict:
+```python
+{
+    "session_id": "abc123...",
+    "status": "completed",
+    "files": [
+        {"name": "dsg.csv", "url": "/download/.../dsg.csv", "size": 2097152, "description": "..."},
+        {"name": "dsm_train.csv", ...},
+        ...
+    ],
+    "row_count": 20000,
+    "duration_seconds": 42.1,
+}
+```
+Set `wait=False` to get back immediately with just `session_id` and `status`:
+```python
+result = client.process(file_path="data.csv", target_columns="price", wait=False)
+session_id = result["session_id"]
+```
+---
+### `client.get_status(session_id)` — Poll progress
+```python
+status = client.get_status(session_id)
+# {
+#   "status": "running",           # queued | running | completed | error | cancelled
+#   "message": "Running MDH...",
+#   "current_step": 3,
+#   "total_steps": 6,
+#   "progress_percent": 50,
+#   "duration_seconds": 15.3,
+# }
+```
+---
+### `client.get_result(session_id)` — Fetch completed results
+```python
+result = client.get_result(session_id)
+# {"status": "completed", "files": [...], "row_count": ..., "duration_seconds": ...}
+```
+---
+### `client.wait_for_completion(session_id, poll_interval)` — Block until done
+```python
+result = client.wait_for_completion(session_id, poll_interval=3)
+```
+Prints live progress to stdout. Raises `AutoDataError` if processing fails.
+---
+### `client.cancel(session_id)` — Cancel a running job
+```python
+cancelled = client.cancel(session_id)  # True if acknowledged
+```
+---
+### `client.download_results(session_id, ...)` — Download output files
+```python
+path = client.download_results(
+    session_id,
+    download_path="./my_outputs/",      # Directory to save into
+    output_preferences=["dsg.csv"],     # Specific files only (None = all)
+    compressed=True,                    # ZIP download (default) or individual files
+)
+print(f"Saved to {path}")
+```
+---
+### `client.download_file(url, output_path)` — Download a single file
+```python
+client.download_file("/api/v1/download/abc123.../dsg.csv", "dsg.csv")
+```
+---
+### `client.list_keys()` — List API keys
+```python
+keys = client.list_keys()
+# [{"id": "...", "name": "My Key", "prefix": "dtpk_abc123", "created_at": "..."}]
+```
+---
+### `client.get_usage()` — Usage statistics
+```python
+usage = client.get_usage()
+# {
+#   "daily_credits_used": 500,
+#   "daily_credit_limit": 10000,
+#   "daily_remaining": 9500,
+#   "lifetime_credits_used": 12340,
+#   "lifetime_credit_limit": 1000000,
+#   "lifetime_remaining": 987660,
+#   "daily_request_count": 3,
+#   "last_used_at": "2026-04-12T10:30:00Z",
+# }
+```
+---
+## Connectors — Process data from external sources
+Instead of uploading a local file, you can read data directly from databases and cloud storage via connectors.
+**Supported connector types:** `sql`, `snowflake`, `bigquery`, `mongodb`, `s3`, `gcs`, `databricks`, `delta`, `fabric`, `kafka`, `kinesis`
+### Quick connector example
+```python
+with AutoDataClient(api_key="dtpk_...") as client:
+    # Test the connection
+    client.test_connector("sql", secrets={
+        "connection_string": "postgresql://user:pass@host/db"
+    })
+    # List tables
+    tables = client.discover("sql", secrets={
+        "connection_string": "postgresql://user:pass@host/db"
+    })
+    print(tables)  # ["users", "orders", ...]
+    # Preview columns
+    info = client.preview("sql", table="orders", secrets={
+        "connection_string": "postgresql://user:pass@host/db"
+    })
+    print(info["columns"])  # ["id", "price", "date", ...]
+    # Run the full pipeline from the connector
+    result = client.process_from_connector(
+        connector_type="sql",
+        table="orders",
+        target_columns=["price"],
+        secrets={"connection_string": "postgresql://user:pass@host/db"},
+        output_rows=20000,
+    )
+    print(result["files"])
+```
+### Using saved credentials
+Save credentials once, then reference them by ID:
+```python
+with AutoDataClient(api_key="dtpk_...") as client:
+    # Save a credential
+    cred = client.save_credential(
+        name="Production DB",
+        connector_type="sql",
+        secrets={"connection_string": "postgresql://..."},
+    )
+    cred_id = cred["id"]
+    # Use credential_id instead of secrets
+    tables = client.discover("sql", credential_id=cred_id)
+    result = client.process_from_connector(
+        connector_type="sql",
+        table="orders",
+        target_columns=["price"],
+        credential_id=cred_id,
+    )
+```
+---
+### `client.test_connector(connector_type, secrets, credential_id)` — Test connection
+```python
+result = client.test_connector("s3", secrets={
+    "bucket": "my-bucket",
+    "access_key_id": "AKIA...",
+    "secret_access_key": "...",
+    "region": "us-east-1",
+})
+# {"success": True, "message": "Connection successful"}
+```
+---
+### `client.discover(connector_type, ...)` — List tables / files
+```python
+tables = client.discover("bigquery", secrets={
+    "credentials_json": '{"type":"service_account",...}',
+    "project_id": "my-project",
+    "dataset": "analytics",
+})
+# ["events", "users", "transactions"]
+```
+---
+### `client.preview(connector_type, table, ...)` — Preview columns
+```python
+info = client.preview("snowflake", table="ORDERS", secrets={
+    "account": "abc123.us-east-1",
+    "user": "analyst",
+    "password": "...",
+    "database": "PROD",
+    "schema": "PUBLIC",
+    "warehouse": "COMPUTE_WH",
+})
+# {"success": True, "columns": ["ID", "AMOUNT", ...], "row_count": 50000}
+```
+---
+### `client.process_from_connector(...)` — Run pipeline from connector
+```python
+result = client.process_from_connector(
+    connector_type="s3",
+    table="data/sales.csv",              # object key in bucket
+    target_columns=["revenue"],
+    secrets={
+        "bucket": "my-data-lake",
+        "access_key_id": "AKIA...",
+        "secret_access_key": "...",
+    },
+    output_rows=50000,
+    wait=True,
+    download_path="./outputs/",
+)
+```
+Supports all the same options as `client.process()`: `tools`, `advanced_params`, `wait`, `poll_interval`, `auto_download`, `output_preferences`, `compressed`.
+---
+### `client.write_output(session_id, ...)` — Write results to a target
+```python
+client.write_output(
+    session_id="abc123...",
+    connector_type="sql",
+    table_name="ml_prepared_data",
+    secrets={"connection_string": "postgresql://..."},
+    output_stage="dsg",        # which pipeline output to write
+    if_exists="replace",       # "replace", "append", or "fail"
+)
+# {"success": True, "rows_written": 20000, "table": "ml_prepared_data"}
+```
+---
+### `client.list_credentials()` / `save_credential()` / `delete_credential()`
+```python
+# List saved credentials (secrets are never exposed)
+creds = client.list_credentials()
+# Save a new credential
+cred = client.save_credential("My S3", "s3", secrets={...})
+# Delete a credential
+client.delete_credential(cred["id"])
+```
+---
+## Error Handling
+All API errors raise `AutoDataError`:
+```python
+from autodata import AutoDataClient, AutoDataError
+with AutoDataClient(api_key="dtpk_...") as client:
+    try:
+        result = client.process("data.csv", target_columns="price")
+    except AutoDataError as e:
+        print(f"API error {e.status_code}: {e}")
+    except FileNotFoundError as e:
+        print(f"File not found: {e}")
+    except ValueError as e:
+        print(f"Invalid input: {e}")  # e.g. unsupported file format
+```
+`AutoDataError` attributes:
+- `str(e)` — human-readable error message from the server
+- `e.status_code` — HTTP status code (e.g. `401`, `429`, `500`), or `None` for non-HTTP errors
+Transient errors (429, 502, 503, 504) are automatically retried up to `max_retries` times with exponential back-off.
+---
+## Advanced Example: Non-blocking with manual polling
+```python
+import time
+from autodata import AutoDataClient, AutoDataError
+with AutoDataClient(api_key="dtpk_...") as client:
+    # Start job without blocking
+    job = client.process(
+        "large_dataset.csv",
+        target_columns=["churn"],
+        wait=False,
+    )
+    session_id = job["session_id"]
+    print(f"Job started: {session_id}")
+    # Poll manually
+    while True:
+        status = client.get_status(session_id)
+        print(f"  {status['progress_percent']}% — {status['message']}")
+        if status["status"] == "completed":
+            break
+        elif status["status"] in ("error", "cancelled"):
+            raise AutoDataError(f"Job {status['status']}: {status['message']}")
+        time.sleep(5)
+    # Download results
+    path = client.download_results(session_id, download_path="./outputs/")
+    print(f"Results saved to {path}")
+```
+---
+## Quality Alerts
+Monitor pipeline metrics and get notified when thresholds are breached:
+```python
+# Create a quality alert rule
+rule = client.create_quality_alert(
+    name="High row loss",
+    metric="row_loss_pct",    # row_loss_pct, null_pct, column_drop_count, duration_seconds
+    operator=">",             # >, <, >=, <=, ==
+    threshold=10.0,
+    severity="critical",      # warning or critical
+    stage="mdh",              # optional: anomaly, dtc, mdh, cds, dsm, dsg
+)
+# List rules
+rules = client.list_quality_alerts()
+# Get fired alert events
+events = client.get_alert_events(session_id="optional-filter")
+# Delete a rule
+client.delete_quality_alert(rule_id=rule["rule"]["id"])
+```
+## Sync Watermarks
+Track incremental sync progress for connector-based pipelines:
+```python
+# List all watermarks
+watermarks = client.list_watermarks()
+# Reset a watermark (re-sync from beginning)
+client.reset_watermark(watermark_id="wm-123")
+```
+## Scheduled Runs
+Automate recurring pipeline executions:
+```python
+# Create an interval-based schedule (every 24 hours)
+schedule = client.create_scheduled_run(
+    name="Daily ETL",
+    schedule_type="interval",
+    interval_minutes=1440,
+    connector_type="snowflake",
+    table="orders",
+    credential_id="cred-id",
+    y_columns=["target"],
+)
+# Create with cron expression
+schedule = client.create_scheduled_run(
+    name="Weekday ETL",
+    schedule_type="cron",
+    cron_expression="0 8 * * 1-5",
+    connector_type="snowflake",
+    table="orders",
+    credential_id="cred-id",
+    y_columns=["target"],
+)
+# List and delete
+schedules = client.list_scheduled_runs()
+client.delete_scheduled_run(run_id="run-id")
+```
+## Folder Listeners
+Automatically trigger pipelines when new files appear:
+```python
+# Create an S3 folder listener
+listener = client.create_listener(
+    name="S3 Ingest",
+    source_type="s3",            # s3, gcs, azure_blob, local, sftp
+    folder_path="s3://bucket/incoming/",
+    credential_id="cred-id",
+    y_columns=["target"],
+    pipeline_config={"enable_dsg": False},
+)
+# List and delete
+listeners = client.list_listeners()
+client.delete_listener(listener_id="listener-id")
+```
+## Pipeline Retry
+Retry a failed pipeline from its last checkpoint:
+```python
+result = client.retry_session(session_id="failed-session-id")
+print(result)  # {'success': True, 'session_id': '...', 'message': 'Retrying from last checkpoint'}
+```
+## Worker Status
+Check the processing worker fleet status:
+```python
+status = client.worker_status()
+print(status)  # {'backend': 'local', 'active_jobs': 2, 'queue_size': 0, ...}
+```
+---
+## Requirements
+- Python >= 3.8
+- `requests` >= 2.25.0
+## License
+MIT

datatoolpack 0.2.1__tar.gz → 0.4.0__tar.gz

datatoolpack 0.2.1tar.gz → 0.4.0tar.gz