PyPI - slurm-sdk - Versions diffs - 0.4.5.dev0__tar.gz → 0.4.6.dev0__tar.gz - Mend

slurm-sdk 0.4.5.dev0tar.gz → 0.4.6.dev0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (233) hide show

{slurm_sdk-0.4.5.dev0 → slurm_sdk-0.4.6.dev0}/AGENTS.md RENAMED Viewed

@@ -90,6 +90,70 @@ gh pr create --fill
 - The PR description should summarize changes and reference any related issues
 - Wait for CI to pass before requesting human review
+## Publishing to PyPI
+The package is published to PyPI via GitHub Actions using trusted publishing (no API tokens needed).
+### Dev Releases
+Dev releases publish the current version in `pyproject.toml` (e.g., `0.4.5-dev`) for testing:
+```bash
+gh workflow run publish.yml -f version_type=dev
+```
+To test the build without uploading:
+```bash
+gh workflow run publish.yml -f version_type=dev -f dry_run=true
+```
+### Production Releases
+Production releases require a clean version number and updated changelog:
+1. **Update version** in `pyproject.toml` (remove `-dev` suffix):
+   ```python
+   version = "0.4.5"  # was "0.4.5-dev"
+   ```
+1. **Update changelog** in `docs/CHANGELOG.md`:
+   - Move entries from `## [Unreleased]` to new section `## [0.4.5] - YYYY-MM-DD`
+   - Keep an empty `## [Unreleased]` section at the top
+1. **Commit, tag, and create GitHub release**:
+   ```bash
+   git add pyproject.toml docs/CHANGELOG.md
+   git commit -m "chore: release v0.4.5"
+   git tag v0.4.5
+   git push origin main --tags
+   gh release create v0.4.5 --generate-notes
+   ```
+   The GitHub release event automatically triggers PyPI publishing.
+1. **Prepare for next development cycle**:
+   ```bash
+   # Update version to next dev version
+   # version = "0.4.6-dev"
+   git commit -am "chore: bump version to 0.4.6-dev"
+   git push
+   ```
+### Manual Production Release
+If you need to publish a release without creating a GitHub release:
+```bash
+gh workflow run publish.yml -f version_type=release
+```
+This validates that the version doesn't contain `-dev`, `-alpha`, or `-beta` suffixes.
 ## Coding Style & Naming Conventions
 - Use 4-space indentation and type hints throughout; the package ships `py.typed`.

{slurm_sdk-0.4.5.dev0 → slurm_sdk-0.4.6.dev0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: slurm-sdk
-Version: 0.4.5.dev0
+Version: 0.4.6.dev0
 Summary: Pythonic SDK for Slurm.
 Author-email: Ville Kallioniemi <ville.kallioniemi@gmail.com>
 License-Expression: MIT
@@ -12,6 +12,7 @@ Requires-Dist: paramiko>=3.5.1
 Requires-Dist: requests>=2.32.3
 Requires-Dist: rich>=13.9.4
 Requires-Dist: tomli>=2.0.0; python_version < '3.11'
+Requires-Dist: tomlkit>=0.12
 Provides-Extra: tui
 Requires-Dist: pyyaml>=6.0; extra == 'tui'
 Requires-Dist: textual>=0.89.0; extra == 'tui'

{slurm_sdk-0.4.5.dev0 → slurm_sdk-0.4.6.dev0}/docs/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,117 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+### Fixed
+- Fixed missing imports in parallel train-eval workflow tutorial
+- Added `JobContext` to API reference documentation
+- Updated import paths in tutorials and how-to guides to use public API
+  (`from slurm.callbacks import ...`) instead of internal modules
+- Removed unused imports from workflow graph visualization tutorial
+### Added
+- `parse_packaging_config()` as a public API for parsing packaging specification
+  strings into configuration dicts; previously the private `_parse_packaging_config()`
+- `PackagingConfig` TypedDict in `slurm.packaging` documenting all valid packaging
+  configuration keys
+- `Job.snapshot()` method returning a frozen `JobSnapshot` dataclass with current
+  state, output tails, elapsed time, and terminal/success flags
+- `Job.tail()` method for live log streaming with configurable `output` parameter
+  accepting any writable IO object (`sys.stdout`, `io.StringIO`, file objects)
+- `BackendBase.tail_file()` method with implementations for SSH and local backends
+- `slurm jobs tail <job-id>` CLI command with `--stderr`, `--no-follow`, and
+  `--lines` options
+- Container image digest pinning via registry HTTP API; resolves digests with
+  a single HEAD request instead of pulling the full image
+- Usage examples in docstrings for `SlurmTask.__call__()`, `ArrayJob.get_results()`,
+  `WorkflowContext`, and `JobContext`
+- `llms.txt` file with complete API recipes, decision tree, and method signatures
+  for AI coding agent consumption
+- How-to guide for creating custom task and workflow decorators using existing
+  `@task`, `@workflow`, and `with_options()` APIs
+- `write_file()` and `close()` methods on `BackendBase` interface for unified
+  file operations and explicit resource cleanup
+- Pickle version headers for cross-version mismatch detection; result files now
+  include Python version and SDK version metadata, with clear warnings on mismatch
+- SSH lazy reconnection on transport errors (automatic retry once) and explicit
+  `cluster.reconnect()` for long-lived sessions (e.g. Jupyter notebooks)
+- `reconnect()` method on `BackendBase` interface (no-op for local backend)
+### Changed
+- Container packaging `use_digest` default changed from `False` to `True` for
+  reproducible deployments; pass `use_digest=False` to restore previous behavior
+- Pre-existing container images no longer require `docker pull` for digest
+  resolution when the registry API is accessible
+- Expanded container packaging explanation with details on multi-word Python
+  executables, container mounts, working directory, and array job naming
+- Restructured GPU, container dependency, and parallelization how-to guides
+  with proper problem statements, prerequisites, steps, and verification
+  sections following Diataxis how-to guide format
+- Input validation for `account` and `partition` sbatch options is now enforced
+  at submission time
+- Removed redundant SBATCH option normalization in `render_job_script()`
+- Merged `SlurmTaskWithDependencies` into `SlurmTask`; `.after()` now returns a
+  `SlurmTask` with bound dependencies. The `SlurmTaskWithDependencies` name is
+  kept as an alias for backward compatibility
+- Extracted `_resolve_cluster()` helper in context module, eliminating duplicated
+  context resolution logic across task submission methods
+- Consolidated packaging config resolution into `resolve_packaging_config()` with
+  documented precedence; eliminates duplicated logic between submission and
+  workflow dependency building
+- Replaced 12 positional parameters on `render_job_script()` with structured
+  `RenderContext` dataclass
+- Removed submission pipeline wrapper methods from `Cluster`; internal modules
+  now call extracted functions directly
+- `Job` now depends on `BackendBase` interface instead of `Cluster`; accepts
+  `backend` and `on_completed` keyword arguments. The `cluster` parameter is
+  kept for backward compatibility
+- `BackendBase` now provides `download_file()` (default: local copy) and
+  `hostname` class attribute (default: `"localhost"`)
+- Decomposed `cluster.py` into private modules (`_polling`, `_submission`,
+  `_workflow`) for maintainability; public API unchanged
+- Extracted private methods from `ContainerPackagingStrategy.prepare()` for
+  improved testability
+- Callback exceptions are now logged at WARNING level with full tracebacks
+  (previously logged at DEBUG)
+- SSH backend `host_key_policy` default changed from `"warn"` to `"reject"` for
+  improved security; pass `host_key_policy="warn"` to restore previous behavior
+- Extracted `_dispatch_callbacks()` helper on `Cluster` to deduplicate callback
+  dispatch logic
+- Consolidated `_runner_impl.py` into the `runner/` package; a thin
+  backwards-compatible shim remains for external references
+- Slurmfile TOML modification now uses `tomlkit` for proper round-trip parsing
+  instead of fragile line-by-line string manipulation
+### Removed
+- Removed unused `_runner_impl.py` backward-compatibility shim
+### Fixed
+- Corrected callback method names in callbacks and events explanation; expanded
+  from stub to comprehensive coverage of all 11 hooks, execution loci, and
+  serialization behavior
+- Moved `base64` import to module level in rendering to prevent potential
+  `NameError`
+- Temp file leak in `Job.get_result()` when downloading results via SSH; files
+  are now cleaned up in a `finally` block
+- Thread-safety issue with `Job` status cache; reads and writes to
+  `_status_cache`, `_status_cache_time`, and `_completed` are now protected
+  by an `RLock`
+- Race condition in `_job_pollers` dict access between main and poller threads
+- Job name validation and quoting in rendered sbatch scripts
+- Replaced deprecated `datetime.utcnow()` with `datetime.now(timezone.utc)`
+- Environment metadata files are now written with `0o600` permissions
+- `LocalBackend.execute_command()` no longer uses `shell=True`
+### Dependencies
+- Added `tomlkit>=0.12` as a required dependency
+## [0.4.5] - 2026-02-05
 ### Added
 - Interactive TUI commands (requires `pip install slurm-sdk[tui]`):

slurm_sdk-0.4.6.dev0/docs/explanation/callbacks_and_events.md ADDED Viewed

@@ -0,0 +1,215 @@
+# Callbacks and Events
+Callbacks let you observe packaging, submission, execution, and workflow events without changing task code. The SDK fires hooks at well-defined points in the job lifecycle, passing a typed context object that carries relevant metadata.
+## Lifecycle stages
+A single job passes through up to five stages, each with a begin/end hook pair:
+- **Packaging** (`on_begin_package_ctx` / `on_end_package_ctx`): Fires on the client while the SDK builds or resolves the deployment artifact (wheel or container image).
+- **Submission** (`on_begin_submit_job_ctx` / `on_end_submit_job_ctx`): Fires on the client immediately before and after the `sbatch` call.
+- **Execution** (`on_begin_run_job_ctx` / `on_end_run_job_ctx`): Fires on the runner (compute node) around the user function invocation.
+- **Status polling** (`on_job_status_update_ctx`): Fires on the client each time the SDK polls SLURM and observes a state change or the polling interval elapses.
+- **Completion** (`on_completed_ctx`): Fires when a job reaches a terminal state. By default this runs on both client and runner.
+Workflow orchestration adds three more hooks:
+- **Workflow begin/end** (`on_workflow_begin_ctx` / `on_workflow_end_ctx`): Fires on the runner around the workflow orchestrator logic, after the workflow job itself has started.
+- **Workflow task submitted** (`on_workflow_task_submitted_ctx`): Fires on the client each time the workflow submits a child task, enabling dependency-graph tracking.
+## All hooks
+The `BaseCallback` class defines 11 hooks. Each receives a single typed context argument:
+| #   | Hook method                      | Context type                | Description                                          |
+| --- | -------------------------------- | --------------------------- | ---------------------------------------------------- |
+| 1   | `on_begin_package_ctx`           | `PackagingBeginContext`     | Packaging is about to start                          |
+| 2   | `on_end_package_ctx`             | `PackagingEndContext`       | Packaging has completed                              |
+| 3   | `on_begin_submit_job_ctx`        | `SubmitBeginContext`        | Job is about to be submitted via sbatch              |
+| 4   | `on_end_submit_job_ctx`          | `SubmitEndContext`          | Job has been submitted; job ID is available          |
+| 5   | `on_job_status_update_ctx`       | `JobStatusUpdatedContext`   | Polling detected a status change or interval elapsed |
+| 6   | `on_begin_run_job_ctx`           | `RunBeginContext`           | Runner is about to invoke the user function          |
+| 7   | `on_end_run_job_ctx`             | `RunEndContext`             | User function has returned or raised                 |
+| 8   | `on_completed_ctx`               | `CompletedContext`          | Job reached a terminal SLURM state                   |
+| 9   | `on_workflow_begin_ctx`          | `WorkflowCallbackContext`   | Workflow orchestrator is starting                    |
+| 10  | `on_workflow_end_ctx`            | `WorkflowCallbackContext`   | Workflow orchestrator has finished                   |
+| 11  | `on_workflow_task_submitted_ctx` | `WorkflowTaskSubmitContext` | Workflow submitted a child task                      |
+## Callback timeline
+The diagram below shows the order in which hooks fire for a single job submission, with an optional workflow layer:
+```mermaid
+sequenceDiagram
+    participant Client
+    participant SLURM
+    participant Runner
+    rect rgb(230, 245, 255)
+        Note over Client: Client-side callbacks
+        Client->>Client: on_begin_package_ctx
+        Client->>Client: on_end_package_ctx
+        Client->>Client: on_begin_submit_job_ctx
+        Client->>SLURM: sbatch
+        SLURM-->>Client: job_id
+        Client->>Client: on_end_submit_job_ctx
+    end
+    rect rgb(240, 240, 255)
+        Note over Client: Client-side polling
+        loop poll_interval_secs
+            Client->>SLURM: squeue / sacct
+            SLURM-->>Client: status
+            Client->>Client: on_job_status_update_ctx
+        end
+    end
+    rect rgb(255, 245, 230)
+        Note over Runner: Runner-side callbacks
+        SLURM->>Runner: Start job
+        Runner->>Runner: on_begin_run_job_ctx
+        Runner->>Runner: Execute task function
+        Runner->>Runner: on_end_run_job_ctx
+        Runner->>Runner: on_completed_ctx (runner side)
+    end
+    rect rgb(230, 255, 230)
+        Note over Client: Completion
+        Client->>Client: on_completed_ctx (client side)
+    end
+    rect rgb(255, 240, 245)
+        Note over Runner: Workflow callbacks (runner-side)
+        Runner->>Runner: on_workflow_begin_ctx
+        loop For each child task
+            Runner->>Runner: on_workflow_task_submitted_ctx
+        end
+        Runner->>Runner: on_workflow_end_ctx
+    end
+```
+## Execution loci
+Every hook has a **default execution locus** that determines whether it fires on the client process, on the runner (compute node), or both. The SDK stores these defaults in `_DEFAULT_HOOK_LOCI`:
+| Hook                             | Default locus | Context type                |
+| -------------------------------- | ------------- | --------------------------- |
+| `on_begin_package_ctx`           | `CLIENT`      | `PackagingBeginContext`     |
+| `on_end_package_ctx`             | `CLIENT`      | `PackagingEndContext`       |
+| `on_begin_submit_job_ctx`        | `CLIENT`      | `SubmitBeginContext`        |
+| `on_end_submit_job_ctx`          | `CLIENT`      | `SubmitEndContext`          |
+| `on_job_status_update_ctx`       | `CLIENT`      | `JobStatusUpdatedContext`   |
+| `on_begin_run_job_ctx`           | `RUNNER`      | `RunBeginContext`           |
+| `on_end_run_job_ctx`             | `RUNNER`      | `RunEndContext`             |
+| `on_completed_ctx`               | `BOTH`        | `CompletedContext`          |
+| `on_workflow_begin_ctx`          | `RUNNER`      | `WorkflowCallbackContext`   |
+| `on_workflow_end_ctx`            | `RUNNER`      | `WorkflowCallbackContext`   |
+| `on_workflow_task_submitted_ctx` | `CLIENT`      | `WorkflowTaskSubmitContext` |
+The SDK calls `should_run_on_client(hook_name)` and `should_run_on_runner(hook_name)` to decide where each hook fires. For hooks with locus `BOTH`, the hook fires in both locations, and the `CompletedContext.emitted_by` field tells you which side emitted the current invocation.
+### Overriding the default locus with `execution_loci`
+You can override the default locus for any hook by setting the `execution_loci` dict on your callback subclass:
+```python
+class MyCallback(BaseCallback):
+    execution_loci = {
+        "on_completed_ctx": ExecutionLocus.CLIENT,  # only fire on client
+    }
+```
+This is a per-hook override. Any hook not listed in the dict falls back to its default from `_DEFAULT_HOOK_LOCI`. If a hook is not in either dict, it defaults to `CLIENT`.
+## `requires_pickling`
+The `requires_pickling` class attribute controls whether the SDK serializes the callback and ships it to the runner alongside the job script. It defaults to `True`.
+Set `requires_pickling = False` when your callback only needs client-side hooks (packaging, submission, polling). This avoids serialization failures for callbacks that hold unpicklable references such as open file handles, database connections, or Rich consoles.
+When `requires_pickling` is `False`, runner-side hooks (`on_begin_run_job_ctx`, `on_end_run_job_ctx`, `on_workflow_begin_ctx`, `on_workflow_end_ctx`) will never fire for that callback because the callback object is not present on the compute node.
+## `poll_interval_secs`
+The `poll_interval_secs` class attribute controls SDK-managed status polling. When set to a positive number, the SDK spawns a background thread that periodically queries SLURM for the job's current state and fires `on_job_status_update_ctx` on each poll cycle.
+```python
+class ProgressCallback(BaseCallback):
+    requires_pickling = False
+    poll_interval_secs = 30.0  # check every 30 seconds
+```
+If `poll_interval_secs` is `None` (the default), no automatic polling occurs and `on_job_status_update_ctx` is never called.
+The `JobStatusUpdatedContext` passed to the hook includes the current SLURM status dict, the previous state string, and a boolean `is_terminal` flag that is `True` when the job has reached a final state (COMPLETED, FAILED, CANCELLED, etc.).
+## Serialization rules
+Callbacks that need to run on the runner must survive pickling. The SDK serializes them into the job directory so the runner process can reconstruct them. The rules are:
+1. **`requires_pickling = True` (default)**: The callback is pickled and sent to the runner. All runner-side hooks fire normally. If pickling fails, the SDK raises an error at submission time.
+1. **`requires_pickling = False`**: The callback stays on the client only. Runner-side hooks are silently skipped. Client-side hooks (packaging, submission, polling, and the client side of `on_completed_ctx`) still fire.
+1. **Hooks with locus `BOTH`**: Currently only `on_completed_ctx` defaults to `BOTH`. When `requires_pickling = False`, only the client-side invocation fires. When `requires_pickling = True`, the hook fires on both the runner (immediately after `on_end_run_job_ctx`) and on the client (when polling detects the terminal state).
+The runner reconstructs callbacks from the pickled file, calls the runner-side hooks in order, and discards the callback objects when the job finishes. The client-side callback instances are the original objects held in memory by the submitting process.
+## Custom callback example
+Below is a complete `BaseCallback` subclass that logs timing information for packaging and submission on the client, without needing to be serialized to the runner:
+```python
+import logging
+from slurm.callbacks import (
+    BaseCallback,
+    PackagingBeginContext,
+    PackagingEndContext,
+    SubmitEndContext,
+    JobStatusUpdatedContext,
+)
+logger = logging.getLogger(__name__)
+class TimingCallback(BaseCallback):
+    """Logs wall-clock durations for packaging and submission."""
+    requires_pickling = False
+    poll_interval_secs = 60.0
+    def on_begin_package_ctx(self, ctx: PackagingBeginContext) -> None:
+        self._pack_start = ctx.timestamp
+        logger.info("Packaging started for %s", ctx.task)
+    def on_end_package_ctx(self, ctx: PackagingEndContext) -> None:
+        duration = ctx.duration or (ctx.timestamp - self._pack_start)
+        logger.info("Packaging finished in %.1fs", duration)
+    def on_end_submit_job_ctx(self, ctx: SubmitEndContext) -> None:
+        logger.info("Job %s submitted to %s", ctx.job_id, ctx.target_job_dir)
+    def on_job_status_update_ctx(self, ctx: JobStatusUpdatedContext) -> None:
+        state = ctx.status.get("job_state", "UNKNOWN")
+        logger.info(
+            "Job %s state: %s (terminal=%s)", ctx.job_id, state, ctx.is_terminal
+        )
+```
+Register the callback when creating the cluster or submitting a job:
+```python
+cluster = Cluster.from_file("Slurmfile", callbacks=[TimingCallback()])
+job = cluster.submit(my_task)
+```
+## Typical uses
+- **Structured logging and progress output**: Use client-side hooks to print Rich progress bars or write structured log lines.
+- **Dependency graph visualization**: Use `on_workflow_task_submitted_ctx` to capture parent-child edges and render a DAG.
+- **Custom metrics and telemetry**: Fire metrics to Prometheus, Datadog, or MLflow from `on_end_run_job_ctx`.
+- **Alerting on failure**: Check `RunEndContext.status` or `CompletedContext.job_state` and send notifications.
+- **Benchmarking**: Measure end-to-end wall time from `on_begin_package_ctx` through `on_completed_ctx`.
+## Further reading
+- [Callbacks reference](../reference/api/callbacks.md) for the full API surface of `BaseCallback` and all context dataclasses.
+- [How to create custom task and workflow decorators](../how-to/custom-task-decorators.md) for extending the SDK's decorator system.

slurm_sdk-0.4.6.dev0/docs/explanation/container_packaging.md ADDED Viewed

@@ -0,0 +1,87 @@
+# Container Packaging
+Container packaging is the default execution model. Tasks are built into a container image, pushed to a registry if needed, and executed on Slurm via Pyxis/enroot.
+## Build and resolve flow
+1. **Resolve image reference**: `ContainerPackagingStrategy._resolve_image_reference` picks a registry/name:tag.
+1. **Build image**: If a Dockerfile or build context is provided, the SDK runs `docker build` or `podman build`.
+1. **Push image**: Controlled by `packaging_push` and `packaging_registry`.
+1. **Convert for Pyxis**: Registry references are converted to enroot format when needed.
+## Runtime behavior
+- The job script exports `CONTAINER_IMAGE` for Pyxis.
+- `PY_EXEC` is set to the configured Python executable inside the container.
+- The runner executes with `srun --container-image` under the hood.
+## Multi-word Python executables
+When `python_executable` is a single word like `python`, the SDK sets `PY_EXEC` as a simple shell variable. However, when it contains multiple words (e.g., `uv run python`), the SDK stores it as a **bash array**:
+```bash
+# Single-word executable
+PY_EXEC='python'
+# Multi-word executable
+PY_EXEC=('uv' 'run' 'python')
+```
+The array is resolved with `PY_EXEC_RESOLVED="${PY_EXEC[*]}"` and expanded using `${PY_EXEC[@]}` in the execution command. This approach avoids bash word-splitting issues that would occur if a multi-word command were stored in a plain string variable -- the shell would attempt to find an executable literally named `uv run python` rather than running `uv` with arguments `run python`.
+## Container mounts
+The SDK automatically mounts the **job base directory** (the parent of the task-level directory tree) into the container with read-write access. This allows the runner to locate result files from dependent jobs when resolving `JobResultPlaceholder` objects.
+Additional mounts can be configured via the `packaging_mounts` task option. Mounts follow the standard `source:target:options` format:
+```python
+packaging_mounts=["/data:/data:ro", "/scratch:/scratch:rw"]
+```
+The SDK resolves shell expressions in mount paths so that job directory references remain valid inside the container.
+## Container working directory
+The container's working directory is set to the job directory via the `--container-workdir` flag on `srun`. This means task code that uses relative paths will resolve them against the job directory inside the container. If `container_workdir` is explicitly configured, the SDK uses that value instead, and `{job_dir}` can be used as a placeholder token.
+## Array job container naming
+Each container gets a unique name based on the job's pre-submission identifier: `slurm-sdk-{pre_submission_id}`. For array jobs, the SLURM array task ID is appended as a suffix: `slurm-sdk-{pre_submission_id}_{task_id}`. This naming scheme prevents container name collisions across array elements and enables `slurm jobs connect` to find and attach to the correct container.
+## Configuration knobs
+- `packaging_dockerfile`: Dockerfile path for builds.
+- `packaging_context`: Build context directory.
+- `packaging_registry`: Registry host/path for pushes and pulls.
+- `packaging_platform`: Target platform (e.g., `linux/amd64`).
+- `packaging_tls_verify`: TLS verification for registry access.
+- `packaging_runtime`: Explicit runtime (`docker` or `podman`).
+- `packaging_python_executable`: Python command inside the container (supports multi-word).
+- `packaging_mounts`: Additional bind mounts for the container.
+### Configuration example
+A complete task definition with container packaging options:
+```python
+@task(
+    time="01:00:00",
+    gpus_per_node=4,
+    packaging="container:my-registry.com/training:latest",
+    packaging_python_executable="uv run python",
+    packaging_mounts=["/data:/data:ro"],
+)
+def train(config: dict) -> dict:
+    return run_training(config)
+```
+## How workflows reuse images
+Workflow jobs export packaging config into `SLURM_SDK_PACKAGING_CONFIG`. Child tasks inherit the resolved image reference so they do not rebuild containers mid-workflow.
+## Design goals
+- Reproducible environments with minimal host coupling.
+- Explicit control over build/push/pull behavior.
+- Compatibility with Slurm + Pyxis/enroot deployments.

slurm_sdk-0.4.6.dev0/docs/how-to/container_dependencies.md ADDED Viewed

@@ -0,0 +1,160 @@
+# How to chain containerized tasks with dependencies
+## Problem
+You need to run a multi-phase pipeline (e.g., prepare, map, reduce) where all
+tasks run in the same container and each phase depends on the previous one
+completing successfully.
+## Prerequisites
+- A Slurm cluster with Pyxis/enroot installed
+- A container registry accessible from compute nodes
+- `slurm-sdk` installed locally
+## Steps
+### 1. Define a shared container image
+Create a single Dockerfile for all tasks in the pipeline:
+```dockerfile
+FROM python:3.11-slim
+WORKDIR /workspace
+COPY pyproject.toml README.md mkdocs.yml ./
+COPY src/ src/
+COPY docs/ docs/
+RUN pip install --no-cache-dir .
+```
+Set this as the cluster default so all tasks share it:
+```python
+from slurm import Cluster
+cluster = Cluster.from_args(
+    args,
+    default_packaging="container",
+    default_packaging_dockerfile="path/to/pipeline.Dockerfile",
+)
+```
+### 2. Define the pipeline tasks
+Define each phase as a separate task with the `@task` decorator:
+```python
+from slurm.decorators import task
+from typing import List
+@task(time="00:02:00", mem="256M", cpus_per_task=1)
+def prepare_data(num_chunks: int) -> List[dict]:
+    """Create data chunks for parallel processing."""
+    return [
+        {"chunk_id": i, "data": list(range(i * 100, (i + 1) * 100))}
+        for i in range(num_chunks)
+    ]
+@task(time="00:03:00", mem="256M", cpus_per_task=1)
+def process_chunk(chunk_id: int, data: List[int]) -> dict:
+    """Process a single data chunk (map phase)."""
+    return {
+        "chunk_id": chunk_id,
+        "count": len(data),
+        "sum": sum(data),
+    }
+@task(time="00:05:00", mem="512M", cpus_per_task=1)
+def aggregate_results(results: List[dict]) -> dict:
+    """Combine results from all chunks (reduce phase)."""
+    return {
+        "total_chunks": len(results),
+        "total_count": sum(r["count"] for r in results),
+        "total_sum": sum(r["sum"] for r in results),
+    }
+```
+### 3. Chain the tasks with dependencies
+Use `.after()` for sequential dependencies and `.map()` for the parallel phase:
+```python
+from slurm import Job
+from typing import List
+with cluster:
+    # Phase 1: Prepare data
+    prep_job: Job[List[dict]] = prepare_data(num_chunks=5)
+    prep_job.wait()
+    chunks = prep_job.get_result()
+    # Phase 2: Process chunks in parallel (array job)
+    # .after(prep_job) ensures map tasks wait for preparation
+    # .map(chunks) submits one task per chunk
+    map_jobs = process_chunk.after(prep_job).map(chunks)
+    map_jobs.wait()
+    map_results = map_jobs.get_results()
+    # Phase 3: Aggregate all results
+    # .after(map_jobs) waits for ALL map tasks to complete
+    reduce_job: Job[dict] = aggregate_results.after(map_jobs)(map_results)
+    reduce_job.wait()
+    final = reduce_job.get_result()
+```
+### 4. Run the built-in example
+The SDK includes a complete map-reduce example:
+```bash
+uv run python -m slurm.examples.map_reduce \
+  --hostname your-slurm-host \
+  --username $USER \
+  --partition debug \
+  --num-chunks 5 \
+  --packaging container \
+  --packaging-registry registry:5000/map-reduce \
+  --packaging-platform linux/amd64 \
+  --packaging-tls-verify false
+```
+Use `--num-chunks` to control the parallelism level.
+## Verification
+- All three phases should complete successfully in sequence.
+- The map phase should show tasks distributed across available nodes.
+- The final result should contain aggregated statistics:
+```
+Final Results:
+  Total Chunks: 5
+  Total Items:  500
+  Sum:          124750
+  Hosts Used:   3 (node001, node002, node003)
+```
+## Troubleshooting
+- **Map tasks fail to start**: Verify the prepare task completed
+  successfully before map tasks are submitted. Check that `.after(prep_job)`
+  is called before `.map(chunks)`.
+- **Reduce runs before map completes**: Ensure you pass the `map_jobs` array
+  to `.after()`, not a single job.
+- **Registry pull errors**: If compute nodes cannot pull images, configure a
+  registry with `--packaging-registry`.
+## See also
+- [Map-reduce tutorial](../tutorials/map_reduce.md) for a guided walkthrough
+  of the full example
+- [Choosing a parallelization pattern](parallelization_patterns.md) for other
+  orchestration patterns
+- [Tasks and Workflows reference](../reference/api/tasks_workflows.md) for
+  `.map()` and `.after()` API details

slurm-sdk 0.4.5.dev0__tar.gz → 0.4.6.dev0__tar.gz

slurm-sdk 0.4.5.dev0tar.gz → 0.4.6.dev0tar.gz