PyPI - sagemaker-ops-cli - Versions diffs - 0.1.1__tar.gz → 0.2.0__tar.gz - Mend

sagemaker-ops-cli 0.1.1tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

{sagemaker_ops_cli-0.1.1 → sagemaker_ops_cli-0.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: sagemaker-ops-cli
-Version: 0.1.1
+Version: 0.2.0
 Summary: CLI and TUI for submitting and monitoring Amazon SageMaker Processing Jobs and Pipelines.
 Requires-Python: >=3.10
 Description-Content-Type: text/markdown
@@ -54,7 +54,7 @@ pip install git+https://github.com/southpolemonkey/smops.git
 Install from a local wheel:
 ```bash
-pip install dist/sagemaker_ops_cli-0.1.1-py3-none-any.whl
+pip install dist/sagemaker_ops_cli-0.2.0-py3-none-any.whl
 ```
 Install with Homebrew:
@@ -85,6 +85,29 @@ To enable YAML config files:
 pip install -e '.[yaml]'
 ```
+## Defaults
+Set a default AWS region once so you do not need to pass `--region` on every command:
+```bash
+smops config set-region ap-southeast-2
+smops config get-region
+```
+The config file is stored at `~/.config/smops/config.json` by default. You can inspect it with:
+```bash
+smops config show
+smops config path
+```
+Region resolution order is:
+1. `--region`
+2. `SMOPS_DEFAULT_REGION`
+3. `smops config set-region ...`
+4. Region configured on the selected AWS profile
 ## Build The Python Package
 ```bash
@@ -94,8 +117,8 @@ python -m build
 Build artifacts are written to `dist/`:
-- `sagemaker_ops_cli-0.1.1-py3-none-any.whl`
-- `sagemaker_ops_cli-0.1.1.tar.gz`
+- `sagemaker_ops_cli-0.2.0-py3-none-any.whl`
+- `sagemaker_ops_cli-0.2.0.tar.gz`
 ## Submit A Processing Job
@@ -126,6 +149,23 @@ smops pipeline start \
   --parameter Mode=prod
 ```
+## Interactive TUI
+Open the TUI selector and choose between Pipelines and Processing Jobs:
+```bash
+smops tui --profile dev
+```
+Inside the TUI:
+- `p` / `P`: switch to the next AWS profile from your local AWS config
+- `s`: start a pipeline or submit a processing job from the current TUI
+- `r`: refresh
+- `q`: quit
+For pipeline starts, enter the pipeline name, optional display name, and optional comma-separated parameters such as `InputDate=2026-07-01,Mode=test`. For processing job submits, enter the path to a JSON/YAML config file using the same structure as boto3 `create_processing_job`.
 ## Processing Jobs TUI
 ```bash
@@ -147,6 +187,8 @@ smops tui processing --all-profiles
 Keyboard shortcuts:
 - `Up` / `Down` or `Left` / `Right`: switch jobs
+- `p` / `P`: switch to the next AWS profile
+- `s`: submit a Processing Job from a JSON/YAML config file
 - `r`: refresh
 - `q`: quit
@@ -172,6 +214,8 @@ Keyboard shortcuts:
 - `Left` / `Right`: switch focus between the executions and steps panels
 - `Up` / `Down`: move within the focused panel
+- `p` / `P`: switch to the next AWS profile
+- `s`: start a Pipeline execution
 - `l`: load the CloudWatch log tail for the selected failed step
 - `r`: refresh
 - `q`: quit
@@ -186,11 +230,32 @@ Log discovery is currently supported for these step job types:
 ```bash
 smops processing list --profile dev --region us-east-1
+smops processing wait --profile dev --region us-east-1 --name my-processing-job
 smops pipeline list --profile dev --region us-east-1
 smops pipeline list --profile dev --region us-east-1 --name my-pipeline --hours 6
 smops pipeline steps --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:...
+smops pipeline wait --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:...
+smops pipeline inspect --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:...
+smops pipeline diagnose --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:...
 ```
+Most non-interactive commands support `--json` for agents and automation:
+```bash
+smops processing list --profile dev --region us-east-1 --json
+smops processing wait --profile dev --region us-east-1 --name my-processing-job --json
+smops pipeline start --profile dev --region us-east-1 --name my-pipeline --json
+smops pipeline list --profile dev --region us-east-1 --json
+smops pipeline steps --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:... --json
+smops pipeline wait --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:... --json
+smops pipeline inspect --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:... --json
+smops pipeline diagnose --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:... --json
+```
+JSON responses use a stable envelope. Successful commands return `status: "ok"`; errors return `status: "error"` and a user-facing `error` message. List commands return `items`, `count`, and `next_token`.
+`pipeline inspect` returns execution details, all steps, and failed steps. `pipeline diagnose` extends that with the first failed step, inferred SageMaker job type/name, CloudWatch log group and stream prefix, log tail, and suggested next actions.
 `processing list` reads 20 running jobs per page by default. If the output includes `Next token`, pass it to fetch the next page:
 ```bash

{sagemaker_ops_cli-0.1.1 → sagemaker_ops_cli-0.2.0}/README.md RENAMED Viewed

@@ -34,7 +34,7 @@ pip install git+https://github.com/southpolemonkey/smops.git
 Install from a local wheel:
 ```bash
-pip install dist/sagemaker_ops_cli-0.1.1-py3-none-any.whl
+pip install dist/sagemaker_ops_cli-0.2.0-py3-none-any.whl
 ```
 Install with Homebrew:
@@ -65,6 +65,29 @@ To enable YAML config files:
 pip install -e '.[yaml]'
 ```
+## Defaults
+Set a default AWS region once so you do not need to pass `--region` on every command:
+```bash
+smops config set-region ap-southeast-2
+smops config get-region
+```
+The config file is stored at `~/.config/smops/config.json` by default. You can inspect it with:
+```bash
+smops config show
+smops config path
+```
+Region resolution order is:
+1. `--region`
+2. `SMOPS_DEFAULT_REGION`
+3. `smops config set-region ...`
+4. Region configured on the selected AWS profile
 ## Build The Python Package
 ```bash
@@ -74,8 +97,8 @@ python -m build
 Build artifacts are written to `dist/`:
-- `sagemaker_ops_cli-0.1.1-py3-none-any.whl`
-- `sagemaker_ops_cli-0.1.1.tar.gz`
+- `sagemaker_ops_cli-0.2.0-py3-none-any.whl`
+- `sagemaker_ops_cli-0.2.0.tar.gz`
 ## Submit A Processing Job
@@ -106,6 +129,23 @@ smops pipeline start \
   --parameter Mode=prod
 ```
+## Interactive TUI
+Open the TUI selector and choose between Pipelines and Processing Jobs:
+```bash
+smops tui --profile dev
+```
+Inside the TUI:
+- `p` / `P`: switch to the next AWS profile from your local AWS config
+- `s`: start a pipeline or submit a processing job from the current TUI
+- `r`: refresh
+- `q`: quit
+For pipeline starts, enter the pipeline name, optional display name, and optional comma-separated parameters such as `InputDate=2026-07-01,Mode=test`. For processing job submits, enter the path to a JSON/YAML config file using the same structure as boto3 `create_processing_job`.
 ## Processing Jobs TUI
 ```bash
@@ -127,6 +167,8 @@ smops tui processing --all-profiles
 Keyboard shortcuts:
 - `Up` / `Down` or `Left` / `Right`: switch jobs
+- `p` / `P`: switch to the next AWS profile
+- `s`: submit a Processing Job from a JSON/YAML config file
 - `r`: refresh
 - `q`: quit
@@ -152,6 +194,8 @@ Keyboard shortcuts:
 - `Left` / `Right`: switch focus between the executions and steps panels
 - `Up` / `Down`: move within the focused panel
+- `p` / `P`: switch to the next AWS profile
+- `s`: start a Pipeline execution
 - `l`: load the CloudWatch log tail for the selected failed step
 - `r`: refresh
 - `q`: quit
@@ -166,11 +210,32 @@ Log discovery is currently supported for these step job types:
 ```bash
 smops processing list --profile dev --region us-east-1
+smops processing wait --profile dev --region us-east-1 --name my-processing-job
 smops pipeline list --profile dev --region us-east-1
 smops pipeline list --profile dev --region us-east-1 --name my-pipeline --hours 6
 smops pipeline steps --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:...
+smops pipeline wait --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:...
+smops pipeline inspect --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:...
+smops pipeline diagnose --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:...
 ```
+Most non-interactive commands support `--json` for agents and automation:
+```bash
+smops processing list --profile dev --region us-east-1 --json
+smops processing wait --profile dev --region us-east-1 --name my-processing-job --json
+smops pipeline start --profile dev --region us-east-1 --name my-pipeline --json
+smops pipeline list --profile dev --region us-east-1 --json
+smops pipeline steps --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:... --json
+smops pipeline wait --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:... --json
+smops pipeline inspect --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:... --json
+smops pipeline diagnose --profile dev --region us-east-1 --execution-arn arn:aws:sagemaker:... --json
+```
+JSON responses use a stable envelope. Successful commands return `status: "ok"`; errors return `status: "error"` and a user-facing `error` message. List commands return `items`, `count`, and `next_token`.
+`pipeline inspect` returns execution details, all steps, and failed steps. `pipeline diagnose` extends that with the first failed step, inferred SageMaker job type/name, CloudWatch log group and stream prefix, log tail, and suggested next actions.
 `processing list` reads 20 running jobs per page by default. If the output includes `Next token`, pass it to fetch the next page:
 ```bash

{sagemaker_ops_cli-0.1.1 → sagemaker_ops_cli-0.2.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sagemaker-ops-cli"
-version = "0.1.1"
+version = "0.2.0"
 description = "CLI and TUI for submitting and monitoring Amazon SageMaker Processing Jobs and Pipelines."
 readme = "README.md"
 requires-python = ">=3.10"

{sagemaker_ops_cli-0.1.1 → sagemaker_ops_cli-0.2.0}/sagemaker_ops/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
 """SageMaker operations CLI."""
-__version__ = "0.1.1"
+__version__ = "0.2.0"

{sagemaker_ops_cli-0.1.1 → sagemaker_ops_cli-0.2.0}/sagemaker_ops/aws.py RENAMED Viewed

@@ -3,6 +3,8 @@ from __future__ import annotations
 import base64
 import binascii
 import json
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
 from dataclasses import dataclass
 from datetime import datetime, timedelta, timezone
 from pathlib import Path
@@ -14,6 +16,8 @@ from botocore.exceptions import BotoCoreError, ClientError
 ACTIVE_PROCESSING_STATUSES = ("InProgress", "Stopping")
 ACTIVE_PIPELINE_STATUSES = ("Executing", "Stopping")
+TERMINAL_PROCESSING_STATUSES = ("Completed", "Failed", "Stopped")
+TERMINAL_PIPELINE_STATUSES = ("Succeeded", "Failed", "Stopped")
 class AwsCliError(RuntimeError):
@@ -102,6 +106,13 @@ def parse_parameters(items: Iterable[str]) -> list[dict[str, str]]:
     return parameters
+def available_profiles() -> list[str]:
+    try:
+        return list(boto3.Session().available_profiles)
+    except BotoCoreError as exc:
+        raise AwsCliError(f"读取 AWS profiles 失败: {exc}") from exc
 def build_contexts(
     profiles: tuple[str, ...],
     region: str | None,
@@ -143,6 +154,32 @@ def submit_processing_job(ctx: AwsContext, spec: dict[str, Any]) -> dict[str, An
         raise AwsCliError(f"提交 processing job 失败: {exc}") from exc
+def describe_processing_job(ctx: AwsContext, job_name: str) -> ProcessingJobView:
+    try:
+        detail = ctx.sagemaker.describe_processing_job(ProcessingJobName=job_name)
+    except (BotoCoreError, ClientError) as exc:
+        raise AwsCliError(f"读取 processing job 失败 job={job_name}: {exc}") from exc
+    return _processing_job_view_from_detail(ctx, detail)
+def wait_processing_job(
+    ctx: AwsContext,
+    job_name: str,
+    timeout_seconds: int = 3600,
+    poll_seconds: int = 30,
+) -> ProcessingJobView:
+    deadline = time.monotonic() + max(0, timeout_seconds)
+    poll_seconds = max(1, poll_seconds)
+    while True:
+        job = describe_processing_job(ctx, job_name)
+        if job.status in TERMINAL_PROCESSING_STATUSES:
+            return job
+        if time.monotonic() >= deadline:
+            raise AwsCliError(f"等待 processing job 超时 job={job_name} status={job.status}")
+        time.sleep(min(poll_seconds, max(0.0, deadline - time.monotonic())))
 def start_pipeline_execution(
     ctx: AwsContext,
     pipeline_name: str,
@@ -214,13 +251,22 @@ def _processing_job_view_from_summary(ctx: AwsContext, summary: dict[str, Any])
         detail = ctx.sagemaker.describe_processing_job(ProcessingJobName=name)
     except (BotoCoreError, ClientError):
         detail = summary
+    return _processing_job_view_from_detail(ctx, detail, fallback=summary)
+def _processing_job_view_from_detail(
+    ctx: AwsContext,
+    detail: dict[str, Any],
+    fallback: dict[str, Any] | None = None,
+) -> ProcessingJobView:
+    fallback = fallback or {}
     cluster = detail.get("ProcessingResources", {}).get("ClusterConfig", {})
     return ProcessingJobView(
         profile=ctx.profile,
         region=ctx.region,
-        name=name,
-        status=detail.get("ProcessingJobStatus", summary.get("ProcessingJobStatus", "")),
-        creation_time=detail.get("CreationTime", summary.get("CreationTime")),
+        name=detail.get("ProcessingJobName", fallback.get("ProcessingJobName", "")),
+        status=detail.get("ProcessingJobStatus", fallback.get("ProcessingJobStatus", "")),
+        creation_time=detail.get("CreationTime", fallback.get("CreationTime")),
         last_modified_time=detail.get("LastModifiedTime"),
         started_time=detail.get("ProcessingStartTime"),
         ended_time=detail.get("ProcessingEndTime"),
@@ -228,7 +274,7 @@ def _processing_job_view_from_summary(ctx: AwsContext, summary: dict[str, Any])
         instance_count=cluster.get("InstanceCount"),
         role_arn=detail.get("RoleArn", ""),
         failure_reason=detail.get("FailureReason", ""),
-        arn=detail.get("ProcessingJobArn", summary.get("ProcessingJobArn", "")),
+        arn=detail.get("ProcessingJobArn", fallback.get("ProcessingJobArn", "")),
     )
@@ -281,10 +327,7 @@ def list_pipeline_executions_page(
         next_token=next_token,
     )
     cutoff = datetime.now(timezone.utc) - timedelta(hours=recent_hours)
-    executions: list[PipelineExecutionView] = []
-    for name in names:
-        executions.extend(_list_recent_pipeline_executions_for_name(ctx, name, per_pipeline, cutoff))
+    executions = _list_recent_pipeline_executions(ctx, names, per_pipeline, cutoff)
     return PipelineExecutionsPage(
         executions=sorted(
@@ -296,6 +339,29 @@ def list_pipeline_executions_page(
     )
+def _list_recent_pipeline_executions(
+    ctx: AwsContext,
+    pipeline_names: list[str],
+    per_pipeline: int,
+    cutoff: datetime,
+) -> list[PipelineExecutionView]:
+    if not pipeline_names:
+        return []
+    if len(pipeline_names) == 1:
+        return _list_recent_pipeline_executions_for_name(ctx, pipeline_names[0], per_pipeline, cutoff)
+    executions: list[PipelineExecutionView] = []
+    workers = min(8, len(pipeline_names))
+    with ThreadPoolExecutor(max_workers=workers) as executor:
+        futures = [
+            executor.submit(_list_recent_pipeline_executions_for_name, ctx, name, per_pipeline, cutoff)
+            for name in pipeline_names
+        ]
+        for future in as_completed(futures):
+            executions.extend(future.result())
+    return executions
 def _list_recent_pipeline_executions_for_name(
     ctx: AwsContext,
     pipeline_name: str,
@@ -317,18 +383,17 @@ def _list_recent_pipeline_executions_for_name(
     for summary in response.get("PipelineExecutionSummaries", []):
         status = summary.get("PipelineExecutionStatus", "")
         execution_arn = summary.get("PipelineExecutionArn", "")
-        detail = _describe_pipeline_execution_safely(ctx, execution_arn) if execution_arn else {}
-        start_time = detail.get("StartTime", summary.get("StartTime"))
-        last_modified_time = detail.get("LastModifiedTime", summary.get("LastModifiedTime"))
+        start_time = summary.get("StartTime")
+        last_modified_time = summary.get("LastModifiedTime")
         if not _should_show_pipeline_execution(status, start_time, last_modified_time, cutoff):
             continue
         executions.append(
             PipelineExecutionView(
                 profile=ctx.profile,
                 region=ctx.region,
-                pipeline_name=detail.get("PipelineName", pipeline_name),
+                pipeline_name=pipeline_name,
                 execution_arn=execution_arn,
-                display_name=summary.get("PipelineExecutionDisplayName", detail.get("PipelineExecutionDisplayName", "")),
+                display_name=summary.get("PipelineExecutionDisplayName", ""),
                 status=status,
                 start_time=start_time,
                 last_modified_time=last_modified_time,
@@ -405,6 +470,71 @@ def describe_pipeline_execution(ctx: AwsContext, execution_arn: str) -> dict[str
         raise AwsCliError(f"读取 pipeline execution 失败: {exc}") from exc
+def wait_pipeline_execution(
+    ctx: AwsContext,
+    execution_arn: str,
+    timeout_seconds: int = 3600,
+    poll_seconds: int = 30,
+) -> dict[str, Any]:
+    deadline = time.monotonic() + max(0, timeout_seconds)
+    poll_seconds = max(1, poll_seconds)
+    while True:
+        detail = describe_pipeline_execution(ctx, execution_arn)
+        status = detail.get("PipelineExecutionStatus", "")
+        if status in TERMINAL_PIPELINE_STATUSES:
+            return detail
+        if time.monotonic() >= deadline:
+            raise AwsCliError(f"等待 pipeline execution 超时 execution={execution_arn} status={status}")
+        time.sleep(min(poll_seconds, max(0.0, deadline - time.monotonic())))
+def inspect_pipeline_execution(ctx: AwsContext, execution_arn: str) -> dict[str, Any]:
+    detail = describe_pipeline_execution(ctx, execution_arn)
+    detail = {
+        **detail,
+        "PipelineExecutionArn": detail.get("PipelineExecutionArn") or execution_arn,
+        "PipelineName": detail.get("PipelineName") or _pipeline_name_from_execution_arn(execution_arn),
+    }
+    steps = list_pipeline_steps(ctx, execution_arn)
+    failed_steps = [step for step in steps if step.get("StepStatus") == "Failed"]
+    return {
+        "profile": ctx.profile,
+        "region": ctx.region,
+        "execution": detail,
+        "steps": steps,
+        "failed_steps": failed_steps,
+    }
+def diagnose_pipeline_execution(
+    ctx: AwsContext,
+    execution_arn: str,
+    log_limit: int = 80,
+) -> dict[str, Any]:
+    inspection = inspect_pipeline_execution(ctx, execution_arn)
+    failed_steps = inspection["failed_steps"]
+    failed_step = failed_steps[0] if failed_steps else None
+    log_source = infer_log_source(failed_step) if failed_step else None
+    log_tail = tail_step_logs(ctx, failed_step, limit=log_limit) if failed_step else []
+    job_type = None
+    job_name = None
+    if log_source and failed_step:
+        job_type = _step_job_type(failed_step)
+        job_name = log_source[1]
+    return {
+        **inspection,
+        "failed_step": failed_step,
+        "job_type": job_type,
+        "job_name": job_name,
+        "log_group": log_source[0] if log_source else None,
+        "log_stream_prefix": log_source[1] if log_source else None,
+        "log_tail": log_tail,
+        "next_actions": _diagnostic_next_actions(execution_arn, failed_step),
+    }
 def tail_step_logs(ctx: AwsContext, step: dict[str, Any], limit: int = 80) -> list[str]:
     source = infer_log_source(step)
     if source is None:
@@ -413,6 +543,10 @@ def tail_step_logs(ctx: AwsContext, step: dict[str, Any], limit: int = 80) -> li
     return tail_cloudwatch_logs(ctx, log_group, stream_prefix, limit=limit)
+def tail_processing_job_logs(ctx: AwsContext, job_name: str, limit: int = 80) -> list[str]:
+    return tail_cloudwatch_logs(ctx, "/aws/sagemaker/ProcessingJobs", job_name, limit=limit)
 def tail_cloudwatch_logs(
     ctx: AwsContext,
     log_group: str,
@@ -467,6 +601,39 @@ def infer_log_source(step: dict[str, Any]) -> tuple[str, str] | None:
     return None
+def _step_job_type(step: dict[str, Any]) -> str | None:
+    metadata = step.get("Metadata") or {}
+    for key in ("ProcessingJob", "TrainingJob", "TransformJob"):
+        if isinstance(metadata.get(key), dict):
+            return key
+    return None
+def _pipeline_name_from_execution_arn(execution_arn: str) -> str:
+    marker = ":pipeline/"
+    if marker not in execution_arn:
+        return ""
+    tail = execution_arn.split(marker, 1)[1]
+    return tail.split("/", 1)[0]
+def _diagnostic_next_actions(execution_arn: str, failed_step: dict[str, Any] | None) -> list[dict[str, str]]:
+    actions = [
+        {
+            "type": "inspect",
+            "command": f"smops pipeline inspect --execution-arn {execution_arn} --json",
+        }
+    ]
+    if failed_step:
+        actions.append(
+            {
+                "type": "diagnose",
+                "command": f"smops pipeline diagnose --execution-arn {execution_arn} --json",
+            }
+        )
+    return actions
 def format_dt(value: datetime | None) -> str:
     if value is None:
         return ""

sagemaker-ops-cli 0.1.1__tar.gz → 0.2.0__tar.gz

sagemaker-ops-cli 0.1.1tar.gz → 0.2.0tar.gz