PyPI - alpha-engine-lib - Versions diffs - 0.34.0__tar.gz → 0.35.0__tar.gz - Mend

alpha-engine-lib 0.34.0tar.gz → 0.35.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (75) hide show

{alpha_engine_lib-0.34.0 → alpha_engine_lib-0.35.0}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: alpha-engine-lib
-Version: 0.34.0
-Summary: Shared utilities for the Alpha Engine modules: preflight, structured logging with secret-redaction, ArcticDB universe access, NYSE-calendar dates + freshness predicates, decision capture, cost telemetry, RAG, agent output schemas, SSM-backed secrets, Telegram alerts + SNS fan-out, EC2 spot-launch resilience, SSM log-capture chokepoint, and Step-Functions execution-state projection. Full surface documented in README.
+Version: 0.35.0
+Summary: Shared utilities for the Alpha Engine modules: preflight, structured logging with secret-redaction, ArcticDB universe access, NYSE-calendar dates + freshness predicates, decision capture, cost telemetry, RAG, agent output schemas, SSM-backed secrets, Telegram alerts + SNS fan-out, EC2 spot-launch resilience, SSM log-capture chokepoint, SSM send-command + poll chokepoint, and Step-Functions execution-state projection. Full surface documented in README.
 Author: Brian McMahon
 License: Proprietary
 Requires-Python: >=3.9
@@ -181,18 +181,47 @@ results = retrieve(
 Requires the `[rag]` extra. Embeddings are Voyage `voyage-3-lite` (512d); the database backend is Neon Postgres with pgvector + HNSW indexes.
+### `ssm_dispatcher` — SSM send-command + poll chokepoint
+Canonical Python primitive for the `run_ssm` bash helper that previously appeared as a ~54-line mirror across each dispatcher script that drives a spot instance over the SSM transport. The pre-lift shape — base64-wrap the script body, `aws ssm send-command --document-name AWS-RunShellScript`, loop on `get-command-invocation`, stream the `StandardOutputContent` delta, propagate the inner exit — now lives in one place where the polling cadence, error-class handling, and S3 output-key layout match across every consumer.
+```bash
+python -m alpha_engine_lib.ssm_dispatcher run \
+  --instance-id "$INSTANCE_ID" \
+  --description "bootstrap" \
+  --timeout 3600 \
+  --output-bucket "$S3_BUCKET" \
+  --output-key-prefix "${S3_STAGING_PREFIX}/ssm-output" \
+  --region "$AWS_REGION" \
+  --script-stdin <<'BOOTSTRAP'
+set -eo pipefail
+export HOME=/home/ec2-user AWS_REGION=us-east-1
+# ...the script body the SSM target will execute...
+BOOTSTRAP
+```
+Exit 0 on `Success`; exit 1 on any terminal non-Success status, send-command failure, or unrecoverable poll failure; exit 2 on bad CLI input. `InvocationDoesNotExist` during the first 60s after SendCommand counts as a registration race and keeps polling — closes the 2026-05-23 Saturday SF substrate weakness at the chokepoint rather than per-SF-JSON Retry block.
+### `ssm_log_capture` — SSM-step log capture + S3 ship-on-exit chokepoint
+Pairs with `ssm_dispatcher` on the SSM target side. The dispatcher script tells the target instance to invoke `python -m alpha_engine_lib.ssm_log_capture run --slug X --log /var/log/X.log -- bash <launcher>`; the target wrapper tees the launcher's stdout/stderr to a local log file and to its own stdout (so the SSM `StandardOutputContent` channel still surfaces output to the dispatcher), then on exit ships the full local log to `s3://alpha-engine-research/_ssm_logs/{slug}/{date}/{host}-{time}.log` regardless of the inner exit code. Replaces the inline `trap 'aws s3 cp ...' EXIT` pattern that broke under ASL `States.Array` escape semantics (2026-05-22 Friday-PM dry-pass catch).
+### `ec2_spot` — capacity-resilient spot-launch chokepoint
+Rotates across `(instance_type × subnet)` combinations on `InsufficientInstanceCapacity` / `InsufficientHostCapacity` / `Unsupported` / `InvalidAvailabilityZone` / `SpotMaxPriceTooLow`; non-capacity errors raise immediately. CLI exit 64 distinguishes capacity exhaustion from generic failure. Replaces the hardcoded single-subnet + single-instance-type launch pattern that mirrored across each dispatcher; landed 2026-05-22 after the third-recurrence-in-a-month spot-launch fragility.
 ## How it's used
 All six Nous Ergon module repos depend on this lib:
 | Module | Repo | What it imports from here |
 |---|---|---|
-| Data | [`alpha-engine-data`](https://github.com/cipher813/alpha-engine-data) | `logging`, `preflight`, `arcticdb`, `dates`, `trading_calendar`, `rag` (ingestion) |
+| Data | [`alpha-engine-data`](https://github.com/cipher813/alpha-engine-data) | `logging`, `preflight`, `arcticdb`, `dates`, `trading_calendar`, `rag` (ingestion), `ec2_spot` + `ssm_log_capture` + `ssm_dispatcher` (spot launchers) |
 | Research | [`alpha-engine-research`](https://github.com/cipher813/alpha-engine-research) | `logging`, `decision_capture`, `cost`, `dates`, `rag` (retrieval), `agent_schemas` (canonical LLM-output contracts) |
-| Predictor | [`alpha-engine-predictor`](https://github.com/cipher813/alpha-engine-predictor) | `logging`, `preflight`, `arcticdb`, `dates` |
+| Predictor | [`alpha-engine-predictor`](https://github.com/cipher813/alpha-engine-predictor) | `logging`, `preflight`, `arcticdb`, `dates`, `ec2_spot` + `ssm_log_capture` + `ssm_dispatcher` (spot launcher) |
 | Executor | [`alpha-engine`](https://github.com/cipher813/alpha-engine) | `logging`, `preflight`, `arcticdb`, `dates`, `trading_calendar` |
-| Backtester | [`alpha-engine-backtester`](https://github.com/cipher813/alpha-engine-backtester) | `logging`, `preflight`, `arcticdb`, `dates`, `agent_schemas` (replay-harness Pydantic validation) |
-| Dashboard | [`alpha-engine-dashboard`](https://github.com/cipher813/alpha-engine-dashboard) | `logging`, `arcticdb`, `dates` |
+| Backtester | [`alpha-engine-backtester`](https://github.com/cipher813/alpha-engine-backtester) | `logging`, `preflight`, `arcticdb`, `dates`, `agent_schemas` (replay-harness Pydantic validation), `ec2_spot` + `ssm_log_capture` + `ssm_dispatcher` (spot launcher) |
+| Dashboard | [`alpha-engine-dashboard`](https://github.com/cipher813/alpha-engine-dashboard) | `logging`, `arcticdb`, `dates`, hosts the SSM-target `.venv` that `ssm_dispatcher` invokes via `python -m` |
 ## Development

{alpha_engine_lib-0.34.0 → alpha_engine_lib-0.35.0}/README.md RENAMED Viewed

@@ -152,18 +152,47 @@ results = retrieve(
 Requires the `[rag]` extra. Embeddings are Voyage `voyage-3-lite` (512d); the database backend is Neon Postgres with pgvector + HNSW indexes.
+### `ssm_dispatcher` — SSM send-command + poll chokepoint
+Canonical Python primitive for the `run_ssm` bash helper that previously appeared as a ~54-line mirror across each dispatcher script that drives a spot instance over the SSM transport. The pre-lift shape — base64-wrap the script body, `aws ssm send-command --document-name AWS-RunShellScript`, loop on `get-command-invocation`, stream the `StandardOutputContent` delta, propagate the inner exit — now lives in one place where the polling cadence, error-class handling, and S3 output-key layout match across every consumer.
+```bash
+python -m alpha_engine_lib.ssm_dispatcher run \
+  --instance-id "$INSTANCE_ID" \
+  --description "bootstrap" \
+  --timeout 3600 \
+  --output-bucket "$S3_BUCKET" \
+  --output-key-prefix "${S3_STAGING_PREFIX}/ssm-output" \
+  --region "$AWS_REGION" \
+  --script-stdin <<'BOOTSTRAP'
+set -eo pipefail
+export HOME=/home/ec2-user AWS_REGION=us-east-1
+# ...the script body the SSM target will execute...
+BOOTSTRAP
+```
+Exit 0 on `Success`; exit 1 on any terminal non-Success status, send-command failure, or unrecoverable poll failure; exit 2 on bad CLI input. `InvocationDoesNotExist` during the first 60s after SendCommand counts as a registration race and keeps polling — closes the 2026-05-23 Saturday SF substrate weakness at the chokepoint rather than per-SF-JSON Retry block.
+### `ssm_log_capture` — SSM-step log capture + S3 ship-on-exit chokepoint
+Pairs with `ssm_dispatcher` on the SSM target side. The dispatcher script tells the target instance to invoke `python -m alpha_engine_lib.ssm_log_capture run --slug X --log /var/log/X.log -- bash <launcher>`; the target wrapper tees the launcher's stdout/stderr to a local log file and to its own stdout (so the SSM `StandardOutputContent` channel still surfaces output to the dispatcher), then on exit ships the full local log to `s3://alpha-engine-research/_ssm_logs/{slug}/{date}/{host}-{time}.log` regardless of the inner exit code. Replaces the inline `trap 'aws s3 cp ...' EXIT` pattern that broke under ASL `States.Array` escape semantics (2026-05-22 Friday-PM dry-pass catch).
+### `ec2_spot` — capacity-resilient spot-launch chokepoint
+Rotates across `(instance_type × subnet)` combinations on `InsufficientInstanceCapacity` / `InsufficientHostCapacity` / `Unsupported` / `InvalidAvailabilityZone` / `SpotMaxPriceTooLow`; non-capacity errors raise immediately. CLI exit 64 distinguishes capacity exhaustion from generic failure. Replaces the hardcoded single-subnet + single-instance-type launch pattern that mirrored across each dispatcher; landed 2026-05-22 after the third-recurrence-in-a-month spot-launch fragility.
 ## How it's used
 All six Nous Ergon module repos depend on this lib:
 | Module | Repo | What it imports from here |
 |---|---|---|
-| Data | [`alpha-engine-data`](https://github.com/cipher813/alpha-engine-data) | `logging`, `preflight`, `arcticdb`, `dates`, `trading_calendar`, `rag` (ingestion) |
+| Data | [`alpha-engine-data`](https://github.com/cipher813/alpha-engine-data) | `logging`, `preflight`, `arcticdb`, `dates`, `trading_calendar`, `rag` (ingestion), `ec2_spot` + `ssm_log_capture` + `ssm_dispatcher` (spot launchers) |
 | Research | [`alpha-engine-research`](https://github.com/cipher813/alpha-engine-research) | `logging`, `decision_capture`, `cost`, `dates`, `rag` (retrieval), `agent_schemas` (canonical LLM-output contracts) |
-| Predictor | [`alpha-engine-predictor`](https://github.com/cipher813/alpha-engine-predictor) | `logging`, `preflight`, `arcticdb`, `dates` |
+| Predictor | [`alpha-engine-predictor`](https://github.com/cipher813/alpha-engine-predictor) | `logging`, `preflight`, `arcticdb`, `dates`, `ec2_spot` + `ssm_log_capture` + `ssm_dispatcher` (spot launcher) |
 | Executor | [`alpha-engine`](https://github.com/cipher813/alpha-engine) | `logging`, `preflight`, `arcticdb`, `dates`, `trading_calendar` |
-| Backtester | [`alpha-engine-backtester`](https://github.com/cipher813/alpha-engine-backtester) | `logging`, `preflight`, `arcticdb`, `dates`, `agent_schemas` (replay-harness Pydantic validation) |
-| Dashboard | [`alpha-engine-dashboard`](https://github.com/cipher813/alpha-engine-dashboard) | `logging`, `arcticdb`, `dates` |
+| Backtester | [`alpha-engine-backtester`](https://github.com/cipher813/alpha-engine-backtester) | `logging`, `preflight`, `arcticdb`, `dates`, `agent_schemas` (replay-harness Pydantic validation), `ec2_spot` + `ssm_log_capture` + `ssm_dispatcher` (spot launcher) |
+| Dashboard | [`alpha-engine-dashboard`](https://github.com/cipher813/alpha-engine-dashboard) | `logging`, `arcticdb`, `dates`, hosts the SSM-target `.venv` that `ssm_dispatcher` invokes via `python -m` |
 ## Development

{alpha_engine_lib-0.34.0 → alpha_engine_lib-0.35.0}/pyproject.toml RENAMED Viewed

@@ -4,8 +4,8 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "alpha-engine-lib"
-version = "0.34.0"
-description = "Shared utilities for the Alpha Engine modules: preflight, structured logging with secret-redaction, ArcticDB universe access, NYSE-calendar dates + freshness predicates, decision capture, cost telemetry, RAG, agent output schemas, SSM-backed secrets, Telegram alerts + SNS fan-out, EC2 spot-launch resilience, SSM log-capture chokepoint, and Step-Functions execution-state projection. Full surface documented in README."
+version = "0.35.0"
+description = "Shared utilities for the Alpha Engine modules: preflight, structured logging with secret-redaction, ArcticDB universe access, NYSE-calendar dates + freshness predicates, decision capture, cost telemetry, RAG, agent output schemas, SSM-backed secrets, Telegram alerts + SNS fan-out, EC2 spot-launch resilience, SSM log-capture chokepoint, SSM send-command + poll chokepoint, and Step-Functions execution-state projection. Full surface documented in README."
 readme = "README.md"
 # EC2 still runs Python 3.9 on the always-on micro instance (boto3 drops
 # 3.9 support 2026-04-29, so upgrade is on the near-term roadmap). All

{alpha_engine_lib-0.34.0 → alpha_engine_lib-0.35.0}/src/alpha_engine_lib/__init__.py RENAMED Viewed

@@ -1,3 +1,3 @@
 """alpha-engine-lib — shared utilities for Alpha Engine modules."""
-__version__ = "0.34.0"
+__version__ = "0.35.0"

{alpha_engine_lib-0.34.0 → alpha_engine_lib-0.35.0}/src/alpha_engine_lib/pipeline_status/read.py RENAMED Viewed

@@ -191,6 +191,30 @@ def _label_for_arn(state_machine_arn: str) -> str:
     return PIPELINE_LABELS.get(sm_name, sm_name or "Unknown SF")
+def _region_from_arn(state_machine_arn: str) -> Optional[str]:
+    """Extract the AWS region from a Step Functions ARN.
+    ARN shape: ``arn:aws:states:<region>:<account>:stateMachine:<name>``.
+    Returns the region segment, or None if the ARN doesn't parse — in
+    which case the boto3 client falls back to its normal region resolution
+    (env vars / config / instance metadata). The lib is permissive on
+    malformed input here because the downstream boto3 call will fail
+    loud with a typed error that surfaces via ``_raise_for_boto_error``.
+    Why this exists: Step Functions is a regional service and boto3
+    raises ``NoRegionError`` if no region is discoverable. Streamlit
+    systemd environments on EC2 may not have ``AWS_REGION`` set, but the
+    ARN ALWAYS carries the region — extracting it eliminates a class of
+    "missing region" failures at the lib chokepoint.
+    """
+    if not state_machine_arn or not state_machine_arn.startswith("arn:"):
+        return None
+    parts = state_machine_arn.split(":")
+    if len(parts) < 4 or not parts[3]:
+        return None
+    return parts[3]
 def _failure_cause_from(describe_resp: dict) -> str:
     """Extract + truncate the failure cause from DescribeExecution response.
@@ -430,7 +454,7 @@ def read_pipeline_state(
     if client is None:  # pragma: no cover — production path
         import boto3
-        client = boto3.client("stepfunctions")
+        client = boto3.client("stepfunctions", region_name=_region_from_arn(state_machine_arn))
     label = _label_for_arn(state_machine_arn)

alpha_engine_lib-0.35.0/src/alpha_engine_lib/ssm_dispatcher.py ADDED Viewed

@@ -0,0 +1,463 @@
+"""
+SSM send-command + poll-for-completion chokepoint.
+Consolidation substrate for the ``run_ssm`` bash helper that previously
+appeared as a ~54-line mirror in every dispatcher script that drives a
+spot instance over the SSM transport. The first occurrence shipped in
+alpha-engine-predictor #168 (2026-05-15) as part of the SSH/SCP→SSM
+migration; the second and third occurrences land when alpha-engine-data's
+``spot_data_weekly.sh`` and alpha-engine-backtester's ``spot_backtest.sh``
+migrate off SSH+SCP onto the same SSM transport. Per
+``~/Development/CLAUDE.md`` SOTA sub-sub-rule + the
+``[[feedback_lift_invariants_to_chokepoint_after_second_recurrence]]``
+discipline, the pattern lifts to lib at the second recurrence.
+The pre-lift bash shape was::
+    run_ssm "<description>" "<bash script>" [timeout_seconds]
+    # 1. base64-encode the script body (transport-safe wrapping of inner
+    #    heredocs / quoting)
+    # 2. aws ssm send-command --document-name AWS-RunShellScript \
+    #      --instance-ids "$INSTANCE_ID" \
+    #      --output-s3-bucket-name "$S3_BUCKET" \
+    #      --output-s3-key-prefix "${S3_STAGING_PREFIX}/ssm-output" \
+    #      --timeout-seconds "$timeout_s" \
+    #      --parameters file://$pfile
+    # 3. while :; do
+    #      aws ssm get-command-invocation --command-id $cmd_id
+    #      stream stdout delta; check Status; break on terminal
+    #    done
+    # 4. on Success → return 0
+    # 5. on Failed/TimedOut/Cancelled → fetch stderr, print, return 1
+The Python primitive in this module exposes the same contract — base64
+wrap, send, poll, stream, propagate exit — but lives in one place so
+the polling cadence, error-class handling, and S3 output-key layout
+match across every consumer.
+**Why a CLI, not a bash function:**
+Per the SOTA / institutional-approach sub-sub-rule ("when mirroring a
+pattern across repos, consider lifting it into ``alpha-engine-lib``...
+Pure-Bash primitives can stay mirrored unless re-expressible as a
+Python CLI entry callable from Bash, in which case the CLI re-expression
+is the institutional path"). The dispatcher script invokes::
+    python -m alpha_engine_lib.ssm_dispatcher run \\
+      --instance-id "$INSTANCE_ID" \\
+      --description "bootstrap" \\
+      --timeout 3600 \\
+      --output-bucket "$S3_BUCKET" \\
+      --output-key-prefix "${S3_STAGING_PREFIX}/ssm-output" \\
+      --region "$AWS_REGION" \\
+      --script-stdin <<'BOOTSTRAP'
+    set -eo pipefail
+    ...
+    BOOTSTRAP
+Exit code 0 on Success; 1 on terminal non-Success; 2 on bad input. The
+inner script's stdout streams to the dispatcher's stdout as it arrives
+(SSM ``StandardOutputContent`` delta); on terminal non-Success the
+``StandardErrorContent`` is fetched + printed before the dispatcher
+exits.
+**InvocationDoesNotExist race:**
+After ``send-command`` returns a ``CommandId``, the first poll of
+``get-command-invocation`` can race the SSM control plane's registration
+and return ``InvocationDoesNotExist``. The 2026-05-23 Saturday SF showed
+this exact failure mode at event 16 (MorningEnrich first poll), absorbed
+by the SF Catch but representing a substrate weakness. This module
+treats ``InvocationDoesNotExist`` as a transient "Pending" status for
+the first ~60s after SendCommand (the registration window) and as a
+terminal failure thereafter. Mirrors the bash predecessor's
+``2>/dev/null || echo Pending`` swallow without the all-errors-look-like-Pending
+ambiguity.
+**Failure behavior — never raises:**
+- Inner command's terminal status maps to exit code 0 (Success) or 1
+  (Failed / TimedOut / Cancelled / TerminalError). The dispatcher
+  script's ``set -e`` then propagates that exit upward to the SF Catch.
+- Subprocess setup failure (boto3 missing, IAM denied at SendCommand
+  time, instance not registered) is logged + returns 1. The caller
+  reads the failure from CloudWatch / SSM history; this module's job
+  is to be a thin transport, not a recovery layer.
+"""
+from __future__ import annotations
+import argparse
+import base64
+import logging
+import os
+import sys
+import time
+from typing import Final, Optional
+logger = logging.getLogger(__name__)
+# Status taxonomy from SSM's get-command-invocation. Terminal non-Success
+# statuses all map to exit 1.
+TERMINAL_NON_SUCCESS: Final[frozenset[str]] = frozenset(
+    {"Cancelled", "Failed", "TimedOut", "Cancelling", "TerminalError"}
+)
+PENDING_STATUSES: Final[frozenset[str]] = frozenset(
+    {"Pending", "InProgress", "Delayed"}
+)
+SUCCESS_STATUS: Final[str] = "Success"
+# Window during which InvocationDoesNotExist counts as a registration race
+# rather than a true failure. Mirrors the empirical observation that the
+# SSM control plane has settled by ~30s post-SendCommand under normal
+# conditions; 60s is a defensive ceiling.
+REGISTRATION_GRACE_SECONDS: Final[int] = 60
+# Poll cadence — matches the bash predecessor's `sleep 5`.
+DEFAULT_POLL_INTERVAL_SECONDS: Final[float] = 5.0
+# StandardOutputContent / StandardErrorContent fields are capped at 24KB
+# in get-command-invocation responses. Beyond the cap the buffer rotates
+# (we detect by a length decrease) and the full log lives in the
+# configured S3 output prefix.
+SSM_INLINE_OUTPUT_CAP_BYTES: Final[int] = 24 * 1024
+class SsmDispatchError(Exception):
+    """Non-recoverable SSM send-command / poll failure."""
+def _encode_command_payload(script: str) -> str:
+    """Wrap ``script`` for AWS-RunShellScript transport.
+    The pre-lift bash helper base64-encoded the script body and emitted
+    a single command ``echo <b64> | base64 -d | bash``. This is the
+    transport-safe wrapping that lets the script contain heredocs,
+    embedded Python, single quotes, etc. without ASL/SSM escaping
+    surface.
+    """
+    b64 = base64.b64encode(script.encode("utf-8")).decode("ascii")
+    return f"echo {b64} | base64 -d | bash"
+def run(
+    instance_id: str,
+    description: str,
+    script: str,
+    *,
+    timeout_seconds: int = 3600,
+    output_bucket: Optional[str] = None,
+    output_key_prefix: Optional[str] = None,
+    region: str = "us-east-1",
+    poll_interval_seconds: float = DEFAULT_POLL_INTERVAL_SECONDS,
+    stdout_stream=None,
+    stderr_stream=None,
+    sleep=time.sleep,
+    monotonic=time.monotonic,
+    boto3_client=None,
+) -> int:
+    """Send ``script`` to ``instance_id`` via SSM, poll until terminal, stream stdout.
+    Args:
+        instance_id: target EC2 instance ID (must be SSM-registered).
+        description: short label for SSM history + dispatcher logs.
+        script: bash script body. Will be base64-wrapped + executed as
+            a single AWS-RunShellScript command.
+        timeout_seconds: SSM command timeout (handed to SendCommand).
+        output_bucket: S3 bucket for SSM to write the full stdout/stderr
+            (past the 24KB inline cap). Optional; if unset, only inline
+            output is available.
+        output_key_prefix: S3 key prefix for the SSM output bucket.
+        region: AWS region.
+        poll_interval_seconds: gap between get-command-invocation polls.
+        stdout_stream: destination for streamed inner stdout (default:
+            ``sys.stdout``).
+        stderr_stream: destination for the terminal-failure stderr dump
+            (default: ``sys.stderr``).
+        sleep / monotonic: time hooks (overridable for tests).
+        boto3_client: optional boto3 ``ssm`` client (for tests). When
+            ``None``, constructed via ``boto3.client('ssm', region_name=region)``.
+    Returns:
+        ``0`` on terminal Success.
+        ``1`` on any terminal non-Success status, send-command failure,
+        or unrecoverable poll failure.
+    Never raises.
+    """
+    out = stdout_stream if stdout_stream is not None else sys.stdout
+    err = stderr_stream if stderr_stream is not None else sys.stderr
+    try:
+        if boto3_client is None:
+            import boto3
+            ssm = boto3.client("ssm", region_name=region)
+        else:
+            ssm = boto3_client
+    except Exception as exc:
+        print(
+            f"ssm_dispatcher: boto3 client construction failed: "
+            f"{type(exc).__name__}: {exc}",
+            file=err,
+        )
+        return 1
+    payload = _encode_command_payload(script)
+    send_kwargs: dict = {
+        "InstanceIds": [instance_id],
+        "DocumentName": "AWS-RunShellScript",
+        "Comment": description[:100],  # SSM Comment cap is 100 chars
+        "TimeoutSeconds": int(timeout_seconds),
+        "Parameters": {"commands": [payload]},
+    }
+    if output_bucket:
+        send_kwargs["OutputS3BucketName"] = output_bucket
+    if output_key_prefix:
+        send_kwargs["OutputS3KeyPrefix"] = output_key_prefix
+    try:
+        resp = ssm.send_command(**send_kwargs)
+    except Exception as exc:
+        print(
+            f"ssm_dispatcher: send_command failed for {description!r}: "
+            f"{type(exc).__name__}: {exc}",
+            file=err,
+        )
+        return 1
+    command_id = resp.get("Command", {}).get("CommandId")
+    if not command_id:
+        print(
+            f"ssm_dispatcher: send_command returned no CommandId for {description!r}",
+            file=err,
+        )
+        return 1
+    print(f"    [ssm {description}] command-id={command_id}", file=err)
+    start_monotonic = monotonic()
+    last_out_len = 0
+    while True:
+        sleep(poll_interval_seconds)
+        try:
+            inv = ssm.get_command_invocation(
+                CommandId=command_id,
+                InstanceId=instance_id,
+            )
+        except Exception as exc:
+            code = _classify_boto_exception(exc)
+            if code == "InvocationDoesNotExist":
+                elapsed = monotonic() - start_monotonic
+                if elapsed <= REGISTRATION_GRACE_SECONDS:
+                    # Registration race per the 2026-05-23 Saturday SF
+                    # event-16 substrate weakness; keep polling.
+                    continue
+                print(
+                    f"ssm_dispatcher: {description!r} command {command_id} "
+                    f"never registered (InvocationDoesNotExist after "
+                    f"{elapsed:.0f}s)",
+                    file=err,
+                )
+                return 1
+            # Other transient classes that the bash predecessor swallowed
+            # via `2>/dev/null || echo Pending`. Be explicit: only the
+            # listed set is treated as transient; anything else is a hard
+            # failure.
+            if code in {"ThrottlingException", "RequestLimitExceeded"}:
+                continue
+            print(
+                f"ssm_dispatcher: get_command_invocation for {description!r} "
+                f"raised {code}: {exc}",
+                file=err,
+            )
+            return 1
+        status = inv.get("Status", "Pending")
+        std_out = inv.get("StandardOutputContent", "") or ""
+        if len(std_out) > last_out_len:
+            out.write(std_out[last_out_len:])
+            out.flush()
+            last_out_len = len(std_out)
+        elif len(std_out) < last_out_len:
+            # 24KB cap rotated the buffer; the full log is in S3 (if
+            # output_bucket was configured).
+            cap_note = (
+                f"    [ssm {description}] (stdout exceeded "
+                f"{SSM_INLINE_OUTPUT_CAP_BYTES // 1024}KB cap — full log: "
+                f"s3://{output_bucket}/{output_key_prefix}/)\n"
+                if output_bucket
+                else (
+                    f"    [ssm {description}] (stdout exceeded "
+                    f"{SSM_INLINE_OUTPUT_CAP_BYTES // 1024}KB cap — "
+                    "configure --output-bucket for full log)\n"
+                )
+            )
+            err.write(cap_note)
+            err.flush()
+            last_out_len = len(std_out)
+        if status == SUCCESS_STATUS:
+            return 0
+        if status in TERMINAL_NON_SUCCESS:
+            std_err = inv.get("StandardErrorContent", "") or ""
+            err.write(
+                f"ERROR: SSM step {description!r} terminal status={status}\n"
+            )
+            if std_err:
+                err.write(
+                    f"--- stderr ({SSM_INLINE_OUTPUT_CAP_BYTES // 1024}KB cap; "
+                )
+                if output_bucket:
+                    err.write(
+                        f"full: s3://{output_bucket}/{output_key_prefix}/) ---\n"
+                    )
+                else:
+                    err.write("configure --output-bucket for full log) ---\n")
+                err.write(std_err)
+                if not std_err.endswith("\n"):
+                    err.write("\n")
+            err.flush()
+            return 1
+        if status not in PENDING_STATUSES:
+            # Unknown status — treat as a hard failure, log it.
+            err.write(
+                f"ssm_dispatcher: {description!r} returned unknown status "
+                f"{status!r}; treating as failure\n"
+            )
+            err.flush()
+            return 1
+        # Pending / InProgress / Delayed — keep polling.
+def _classify_boto_exception(exc: BaseException) -> str:
+    """Extract the ``Error.Code`` from a botocore ClientError.
+    Returns the exception class name when no ``response.Error.Code`` is
+    available (e.g., on non-botocore exceptions). Tests patch this for
+    deterministic InvocationDoesNotExist surfacing.
+    """
+    response = getattr(exc, "response", None)
+    if isinstance(response, dict):
+        code = response.get("Error", {}).get("Code")
+        if code:
+            return str(code)
+    return type(exc).__name__
+def _read_script(args: argparse.Namespace) -> str:
+    if args.script_file:
+        with open(args.script_file, "r", encoding="utf-8") as fh:
+            return fh.read()
+    if args.script_stdin:
+        return sys.stdin.read()
+    raise SystemExit(
+        "ssm_dispatcher: must pass either --script-file PATH or --script-stdin "
+        "(with the script body on stdin)"
+    )
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(
+        prog="python -m alpha_engine_lib.ssm_dispatcher",
+        description=(
+            "Send a bash script to an SSM-registered EC2 instance via "
+            "AWS-RunShellScript, poll until terminal, stream stdout to "
+            "this process, and propagate the inner exit status. The "
+            "institutional replacement for the ~54-line run_ssm bash "
+            "helper mirrored across alpha-engine-* dispatcher scripts."
+        ),
+    )
+    subparsers = parser.add_subparsers(dest="cmd", required=True)
+    run_p = subparsers.add_parser(
+        "run",
+        help="Dispatch a script to an instance and stream its output.",
+    )
+    run_p.add_argument(
+        "--instance-id",
+        required=True,
+        help="Target EC2 instance ID (must be SSM-registered).",
+    )
+    run_p.add_argument(
+        "--description",
+        required=True,
+        help=(
+            "Short label for the SSM command Comment + dispatcher log "
+            "lines (e.g., 'bootstrap', 'full-training')."
+        ),
+    )
+    run_p.add_argument(
+        "--timeout",
+        type=int,
+        default=3600,
+        help="SSM command timeout in seconds (default: 3600).",
+    )
+    run_p.add_argument(
+        "--output-bucket",
+        default=None,
+        help=(
+            "S3 bucket where SSM writes the full stdout/stderr beyond "
+            "the inline 24KB cap. Optional; without it, only the inline "
+            "delta is available."
+        ),
+    )
+    run_p.add_argument(
+        "--output-key-prefix",
+        default=None,
+        help="S3 key prefix under --output-bucket for the SSM output.",
+    )
+    run_p.add_argument(
+        "--region",
+        default=os.environ.get("AWS_REGION", "us-east-1"),
+        help="AWS region (default: $AWS_REGION or us-east-1).",
+    )
+    run_p.add_argument(
+        "--poll-interval",
+        type=float,
+        default=DEFAULT_POLL_INTERVAL_SECONDS,
+        help=(
+            "Seconds between get-command-invocation polls (default: "
+            f"{DEFAULT_POLL_INTERVAL_SECONDS:g})."
+        ),
+    )
+    script_grp = run_p.add_mutually_exclusive_group(required=True)
+    script_grp.add_argument(
+        "--script-file",
+        default=None,
+        help="Path to a local file containing the bash script body.",
+    )
+    script_grp.add_argument(
+        "--script-stdin",
+        action="store_true",
+        help="Read the bash script body from stdin (heredoc-friendly).",
+    )
+    args = parser.parse_args(argv)
+    logging.basicConfig(level=logging.WARNING)
+    script = _read_script(args)
+    if not script.strip():
+        print(
+            "ssm_dispatcher: empty script body (refusing to dispatch a no-op)",
+            file=sys.stderr,
+        )
+        return 2
+    return run(
+        instance_id=args.instance_id,
+        description=args.description,
+        script=script,
+        timeout_seconds=args.timeout,
+        output_bucket=args.output_bucket,
+        output_key_prefix=args.output_key_prefix,
+        region=args.region,
+        poll_interval_seconds=args.poll_interval,
+    )
+if __name__ == "__main__":
+    sys.exit(main())

alpha-engine-lib 0.34.0__tar.gz → 0.35.0__tar.gz

alpha-engine-lib 0.34.0tar.gz → 0.35.0tar.gz