PyPI - py-cluster-api - Versions diffs - 0.2.4__tar.gz → 0.4.0__tar.gz - Mend

py-cluster-api 0.2.4tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/.github/workflows/ci.yml RENAMED Viewed

@@ -18,12 +18,12 @@ jobs:
     runs-on: ubuntu-latest
     steps:
-    - uses: actions/checkout@v4
+    - uses: actions/checkout@v5
     - name: Set up Pixi
       uses: prefix-dev/setup-pixi@v0.9.0
       with:
-        pixi-version: v0.55.0
+        pixi-version: v0.65.0
         cache: true
     - name: Lint

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/CLAUDE.md RENAMED Viewed

@@ -2,6 +2,8 @@
 Generic Python library for submitting and monitoring jobs on HPC clusters. Wraps scheduler CLIs (bsub/bjobs/bkill) behind an async executor abstraction with an active polling monitor that fires callbacks on job completion. Inspired by dask-jobqueue's script templating and Nextflow's portable config profiles, but unlike dask-jobqueue, this library actively polls the scheduler rather than relying on workers phoning home.
+Key capabilities beyond submit/poll/cancel: `reconnect()` rediscovers running jobs after a process restart (requires `job_name_prefix`), and `cancel_by_name()` kills jobs by name pattern (LSF only).
 Founding principles: async-only API, executors are thin wrappers around scheduler CLIs, all state lives in `JobRecord` dataclasses tracked in-process, monitoring is poll-based via `bjobs -json`, and configuration uses Nextflow-style YAML profiles.
 Always use `pixi run` to run commands — never invoke python, pytest, ruff, or other tools directly.
@@ -22,7 +24,7 @@ pixi run check        # lint + test
 - `cluster_api/` — library source
   - `core.py` — abstract `Executor` base class
-  - `_types.py` — `JobStatus`, `JobRecord`, `ResourceSpec`, `JobExitCondition`, `ArrayElement`
+  - `_types.py` — `JobStatus`, `JobRecord`, `ResourceSpec` (`cpus`, `gpus`, …), `JobExitCondition`, `ArrayElement`
   - `config.py` — YAML config loader with profiles
   - `script.py` — script rendering (`render_script`) and writing (`write_script`)
   - `monitor.py` — async polling loop + callback dispatch
@@ -41,7 +43,7 @@ Explicit `stdout_path` / `stderr_path` in `ResourceSpec` override these defaults
 ## Testing
-All tests mock `Executor._call()` to avoid needing a real scheduler (except `test_local.py` which runs real subprocesses). Use `unittest.mock.patch` with `AsyncMock` for async method mocking.
+All tests mock `Executor._call()` to avoid needing a real scheduler (except `test_local.py` which runs real subprocesses, and `test_integration.py` which requires a live LSF cluster and is skipped by default). Use `unittest.mock.patch` with `AsyncMock` for async method mocking.
 ## Style

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: py-cluster-api
-Version: 0.2.4
+Version: 0.4.0
 Summary: Generic Python library for running jobs on HPC clusters
 Project-URL: Homepage, https://github.com/JaneliaSciComp/py-cluster-api
 Project-URL: Repository, https://github.com/JaneliaSciComp/py-cluster-api
@@ -54,12 +54,17 @@ Description-Content-Type: text/markdown
 [![CI](https://github.com/JaneliaSciComp/py-cluster-api/actions/workflows/ci.yml/badge.svg)](https://github.com/JaneliaSciComp/py-cluster-api/actions/workflows/ci.yml)
-A Python library for submitting and monitoring jobs on HPC clusters. Supports running arbitrary executables (Nextflow pipelines, Python scripts, Java tools, etc.) on LSF clusters and taking action when jobs complete via async callbacks.
+A Python library for submitting and monitoring jobs on HPC clusters. Supports running arbitrary executables (Nextflow pipelines, Python scripts, Java tools, etc.) on clusters and taking action when jobs complete via async callbacks.
+## Executors
+* Local Subprocess
+* IBM Platform LSF
+* We will accept PRs that implement and test additional executors (SLURM, etc.)
 ## Features
 - **Async-first** — built on `asyncio` for non-blocking job submission and monitoring
-- **LSF executor** — submit via `bsub`, monitor via `bjobs -json`, cancel via `bkill`
 - **Local executor** — run jobs as local subprocesses for development and testing, including array jobs
 - **Job monitoring** — polls the scheduler and fires callbacks on job completion, failure, or cancellation
 - **Job arrays** — submit array jobs with per-element log files
@@ -97,7 +102,7 @@ async def main():
     job = await executor.submit(
         command="nextflow run nf-core/rnaseq --input samples.csv",
         name="rnaseq-run",
-        resources=ResourceSpec(cpus=4, memory="32 GB", walltime="24:00", queue="long"),
+        resources=ResourceSpec(cpus=4, gpus=1, memory="32 GB", walltime="24:00", queue="long"),
         env={"NXF_WORK": "/scratch/work"},
     )
     job.on_success(lambda j: print(f"Done! Job {j.job_id}, peak mem: {j.max_mem}"))
@@ -131,6 +136,26 @@ async def run_array():
 The array index environment variable depends on the executor: LSF uses `$LSB_JOBINDEX`, while the local executor uses `$ARRAY_INDEX`.
+### Reconnecting After Restart
+If your process crashes or restarts, `reconnect()` rediscovers running jobs from the scheduler and resumes tracking them. Requires `job_name_prefix` to be set in config.
+```python
+async def resume():
+    executor = create_executor(profile="janelia_lsf")
+    monitor = JobMonitor(executor)
+    await monitor.start()
+    recovered = await executor.reconnect()
+    for job in recovered:
+        print(f"Reconnected to {job.job_id} ({job.name}), status={job.status}")
+        job.on_exit(lambda j: print(f"Job {j.job_id} finished: {j.status}"))
+    if recovered:
+        await monitor.wait_for(*recovered)
+    await monitor.stop()
+```
 ### Local Testing
 ```python
@@ -166,6 +191,7 @@ profiles:
   janelia_lsf:
     executor: lsf
     queue: normal
+    gpus: 1
     memory: "8 GB"
     walltime: "04:00"
     script_prologue:
@@ -182,15 +208,16 @@ profiles:
 |---|---|---|
 | `executor` | `"local"` | Backend: `lsf` or `local` |
 | `cpus` | `None` | Default CPU count |
+| `gpus` | `None` | Default GPU count |
 | `memory` | `None` | Default memory (e.g. `"8 GB"`) |
 | `walltime` | `None` | Default wall time (e.g. `"04:00"`) |
 | `queue` | `None` | Default queue/partition |
 | `poll_interval` | `10.0` | Seconds between status polls |
-| `job_name_prefix` | `"capi"` | Prefix for all job names |
+| `job_name_prefix` | `None` | Optional prefix prepended to job names. When set, polling filters by `{prefix}-*` and `reconnect()` is available; when unset, the user controls the full job name and polling queries all jobs |
 | `shebang` | `"#!/bin/bash"` | Script shebang line |
 | `script_prologue` | `[]` | Lines inserted before the command |
 | `script_epilogue` | `[]` | Lines inserted after the command |
-| `extra_directives` | `[]` | Additional scheduler flags (directive prefix added automatically) |
+| `extra_directives` | `[]` | Additional scheduler directive lines appended verbatim to the script header (e.g. `"#BSUB -P myproject"`) |
 | `directives_skip` | `[]` | Substrings to filter out of directives |
 | `extra_args` | `[]` | Extra CLI args appended to the submit command (e.g. `bsub`) |
 | `lsf_units` | `"MB"` | LSF memory units (`KB`, `MB`, `GB`) |
@@ -201,41 +228,7 @@ profiles:
 ## API Reference
-### `create_executor(profile=None, config_path=None, **overrides)`
-Factory function that loads config and returns an `Executor` instance.
-### `Executor`
-Abstract base class. Key methods:
-- `submit(command, name, resources=None, prologue=None, epilogue=None, env=None, metadata=None)` — submit a job, returns `JobRecord`
-- `submit_array(command, name, array_range, ...)` — submit a job array
-- `cancel(job_id)` — cancel a job by ID
-- `cancel_by_name(name_pattern)` — cancel by name pattern (LSF only)
-- `cancel_all()` — cancel all tracked jobs
-- `poll()` — query scheduler and update job statuses
-- `jobs` / `active_jobs` — properties returning tracked job dicts
-### `JobRecord`
-Tracks a submitted job. Fields include `job_id`, `name`, `status`, `exit_code`, `exec_host`, `max_mem`, `submit_time`, `start_time`, `finish_time`, and `metadata`.
-- `on_success(callback)` — register callback for exit code 0
-- `on_failure(callback)` — register callback for non-zero exit
-- `on_exit(callback, condition=ANY)` — register callback for any exit condition
-- `is_terminal` — whether the job has finished
-### `JobMonitor`
-Async polling loop that drives status updates and callback dispatch.
-- `start()` / `stop()` — control the polling loop
-- `wait_for(*records, timeout=None)` — block until jobs reach a terminal state
-### `ResourceSpec`
-Resource requirements: `cpus`, `memory`, `walltime`, `queue`, `work_dir`, `stdout_path`, `stderr_path`, `extra_directives`, `extra_args`.
+See [docs/API.md](docs/API.md) for the full API reference and error handling guide.
 ## Development

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/README.md RENAMED Viewed

@@ -2,12 +2,17 @@
 [![CI](https://github.com/JaneliaSciComp/py-cluster-api/actions/workflows/ci.yml/badge.svg)](https://github.com/JaneliaSciComp/py-cluster-api/actions/workflows/ci.yml)
-A Python library for submitting and monitoring jobs on HPC clusters. Supports running arbitrary executables (Nextflow pipelines, Python scripts, Java tools, etc.) on LSF clusters and taking action when jobs complete via async callbacks.
+A Python library for submitting and monitoring jobs on HPC clusters. Supports running arbitrary executables (Nextflow pipelines, Python scripts, Java tools, etc.) on clusters and taking action when jobs complete via async callbacks.
+## Executors
+* Local Subprocess
+* IBM Platform LSF
+* We will accept PRs that implement and test additional executors (SLURM, etc.)
 ## Features
 - **Async-first** — built on `asyncio` for non-blocking job submission and monitoring
-- **LSF executor** — submit via `bsub`, monitor via `bjobs -json`, cancel via `bkill`
 - **Local executor** — run jobs as local subprocesses for development and testing, including array jobs
 - **Job monitoring** — polls the scheduler and fires callbacks on job completion, failure, or cancellation
 - **Job arrays** — submit array jobs with per-element log files
@@ -45,7 +50,7 @@ async def main():
     job = await executor.submit(
         command="nextflow run nf-core/rnaseq --input samples.csv",
         name="rnaseq-run",
-        resources=ResourceSpec(cpus=4, memory="32 GB", walltime="24:00", queue="long"),
+        resources=ResourceSpec(cpus=4, gpus=1, memory="32 GB", walltime="24:00", queue="long"),
         env={"NXF_WORK": "/scratch/work"},
     )
     job.on_success(lambda j: print(f"Done! Job {j.job_id}, peak mem: {j.max_mem}"))
@@ -79,6 +84,26 @@ async def run_array():
 The array index environment variable depends on the executor: LSF uses `$LSB_JOBINDEX`, while the local executor uses `$ARRAY_INDEX`.
+### Reconnecting After Restart
+If your process crashes or restarts, `reconnect()` rediscovers running jobs from the scheduler and resumes tracking them. Requires `job_name_prefix` to be set in config.
+```python
+async def resume():
+    executor = create_executor(profile="janelia_lsf")
+    monitor = JobMonitor(executor)
+    await monitor.start()
+    recovered = await executor.reconnect()
+    for job in recovered:
+        print(f"Reconnected to {job.job_id} ({job.name}), status={job.status}")
+        job.on_exit(lambda j: print(f"Job {j.job_id} finished: {j.status}"))
+    if recovered:
+        await monitor.wait_for(*recovered)
+    await monitor.stop()
+```
 ### Local Testing
 ```python
@@ -114,6 +139,7 @@ profiles:
   janelia_lsf:
     executor: lsf
     queue: normal
+    gpus: 1
     memory: "8 GB"
     walltime: "04:00"
     script_prologue:
@@ -130,15 +156,16 @@ profiles:
 |---|---|---|
 | `executor` | `"local"` | Backend: `lsf` or `local` |
 | `cpus` | `None` | Default CPU count |
+| `gpus` | `None` | Default GPU count |
 | `memory` | `None` | Default memory (e.g. `"8 GB"`) |
 | `walltime` | `None` | Default wall time (e.g. `"04:00"`) |
 | `queue` | `None` | Default queue/partition |
 | `poll_interval` | `10.0` | Seconds between status polls |
-| `job_name_prefix` | `"capi"` | Prefix for all job names |
+| `job_name_prefix` | `None` | Optional prefix prepended to job names. When set, polling filters by `{prefix}-*` and `reconnect()` is available; when unset, the user controls the full job name and polling queries all jobs |
 | `shebang` | `"#!/bin/bash"` | Script shebang line |
 | `script_prologue` | `[]` | Lines inserted before the command |
 | `script_epilogue` | `[]` | Lines inserted after the command |
-| `extra_directives` | `[]` | Additional scheduler flags (directive prefix added automatically) |
+| `extra_directives` | `[]` | Additional scheduler directive lines appended verbatim to the script header (e.g. `"#BSUB -P myproject"`) |
 | `directives_skip` | `[]` | Substrings to filter out of directives |
 | `extra_args` | `[]` | Extra CLI args appended to the submit command (e.g. `bsub`) |
 | `lsf_units` | `"MB"` | LSF memory units (`KB`, `MB`, `GB`) |
@@ -149,41 +176,7 @@ profiles:
 ## API Reference
-### `create_executor(profile=None, config_path=None, **overrides)`
-Factory function that loads config and returns an `Executor` instance.
-### `Executor`
-Abstract base class. Key methods:
-- `submit(command, name, resources=None, prologue=None, epilogue=None, env=None, metadata=None)` — submit a job, returns `JobRecord`
-- `submit_array(command, name, array_range, ...)` — submit a job array
-- `cancel(job_id)` — cancel a job by ID
-- `cancel_by_name(name_pattern)` — cancel by name pattern (LSF only)
-- `cancel_all()` — cancel all tracked jobs
-- `poll()` — query scheduler and update job statuses
-- `jobs` / `active_jobs` — properties returning tracked job dicts
-### `JobRecord`
-Tracks a submitted job. Fields include `job_id`, `name`, `status`, `exit_code`, `exec_host`, `max_mem`, `submit_time`, `start_time`, `finish_time`, and `metadata`.
-- `on_success(callback)` — register callback for exit code 0
-- `on_failure(callback)` — register callback for non-zero exit
-- `on_exit(callback, condition=ANY)` — register callback for any exit condition
-- `is_terminal` — whether the job has finished
-### `JobMonitor`
-Async polling loop that drives status updates and callback dispatch.
-- `start()` / `stop()` — control the polling loop
-- `wait_for(*records, timeout=None)` — block until jobs reach a terminal state
-### `ResourceSpec`
-Resource requirements: `cpus`, `memory`, `walltime`, `queue`, `work_dir`, `stdout_path`, `stderr_path`, `extra_directives`, `extra_args`.
+See [docs/API.md](docs/API.md) for the full API reference and error handling guide.
 ## Development

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/cluster_api/_types.py RENAMED Viewed

@@ -34,7 +34,25 @@ _TERMINAL_STATUSES = frozenset({JobStatus.DONE, JobStatus.FAILED, JobStatus.KILL
 @dataclass
 class ResourceSpec:
-    """Resource requirements for a job."""
+    """Resource requirements for a job.
+    Fields:
+        cpus: Number of CPU cores to request.
+        gpus: Number of GPUs to request.
+        memory: Memory limit as a string with unit, e.g. ``"16GB"`` or ``"500MB"``.
+            Passed directly to the scheduler directive.
+        walltime: Wall-clock time limit, e.g. ``"1:00"`` (h:mm) or ``"24:00:00"``.
+            Format depends on the target scheduler.
+        queue: Scheduler queue / partition name.
+        work_dir: Working directory for the job (defaults to ``os.getcwd()``).
+        stdout_path: Explicit path for stdout log. Overrides the executor's
+            default log naming (see CLAUDE.md § Log File Naming).
+        stderr_path: Explicit path for stderr log. Same override behaviour.
+        extra_directives: Raw scheduler directives injected into the job script
+            header (e.g. ``["#BSUB -R 'rusage[mem=16GB]'"]``).
+        extra_args: Extra command-line arguments appended to the submit command
+            (e.g. ``["-q", "gpu"]`` for ``bsub``).
+    """
     cpus: int | None = None
     gpus: int | None = None
@@ -149,8 +167,8 @@ class JobRecord:
             return JobStatus.RUNNING
         # All expected elements accounted for and terminal
-        if JobStatus.KILLED in statuses:
-            return JobStatus.KILLED
         if JobStatus.FAILED in statuses:
             return JobStatus.FAILED
+        if JobStatus.KILLED in statuses:
+            return JobStatus.KILLED
         return JobStatus.DONE

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/cluster_api/config.py RENAMED Viewed

@@ -58,6 +58,7 @@ class ClusterConfig:
     completed_retention_minutes: float = 10.0
     command_timeout: float = 100.0
     suppress_job_email: bool = True
+    poll_all_users: bool = False
 _CONFIG_SEARCH_PATHS = [
@@ -109,7 +110,12 @@ def load_config(
     profiles = raw.pop("profiles", {})
-    if profile and profile in profiles:
+    if profile:
+        if profile not in profiles:
+            available = ", ".join(sorted(profiles)) if profiles else "(none)"
+            raise ValueError(
+                f"Unknown profile {profile!r}; available profiles: {available}"
+            )
         raw = {**raw, **profiles[profile]}
     if overrides:

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/cluster_api/core.py RENAMED Viewed

@@ -7,8 +7,6 @@ import asyncio
 import logging
 import os
 import re
-import secrets
-import string
 from datetime import datetime, timezone
 from typing import Any
@@ -18,8 +16,10 @@ from ._types import ArrayElement, JobRecord, JobStatus, ResourceSpec
 logger = logging.getLogger(__name__)
+# Check for array element IDs like "12345[1]"
 _ARRAY_ELEMENT_RE = re.compile(r"^(.+)\[(\d+)\]$")
+# Check for job names that are unsafe in scheduler job names
 _UNSAFE_NAME_RE = re.compile(r"[^\w\-.]")
@@ -29,7 +29,37 @@ def _sanitize_job_name(name: str) -> str:
 class Executor(abc.ABC):
-    """Abstract base for cluster job executors."""
+    """Abstract base for cluster job executors.
+    Lifecycle:
+        1. **Construct** — instantiate with a ``ClusterConfig``.
+        2. **Submit** — call :meth:`submit` or :meth:`submit_array` to enqueue
+           jobs.  Each returns a :class:`JobRecord` tracked in-process.
+        3. **Poll** — call :meth:`poll` (usually via :class:`~cluster_api.monitor.Monitor`)
+           to query the scheduler and update every tracked ``JobRecord``.
+        4. **Cancel** — call :meth:`cancel`, :meth:`cancel_all`, or
+           :meth:`cancel_by_name` to kill running jobs.
+    Subclass requirements:
+        Must implement:
+            - :meth:`_submit_job` — run the scheduler submit command.
+            - :meth:`_build_status_args` — build the CLI args for a status query.
+            - :meth:`_parse_job_statuses` — parse status output into per-job dicts.
+        May override:
+            - :meth:`_submit_array_job` — array submission (default delegates
+              to ``_submit_job``).
+            - :meth:`_cancel_job` — cancel a single job.
+            - :meth:`cancel_by_name` — cancel by name pattern.
+            - :meth:`reconnect` — rediscover running jobs after restart.
+    Class attributes:
+        submit_command: CLI executable used for submission (e.g. ``"bsub"``).
+        cancel_command: CLI executable used for cancellation (e.g. ``"bkill"``).
+        status_command: CLI executable used for status queries (e.g. ``"bjobs"``).
+        job_id_regexp: Regex with a ``job_id`` named group, applied to submit
+            output to extract the job ID.
+    """
     submit_command: str
     cancel_command: str
@@ -39,13 +69,7 @@ class Executor(abc.ABC):
     def __init__(self, config: ClusterConfig) -> None:
         self.config = config
         self._jobs: dict[str, JobRecord] = {}
-        if config.job_name_prefix:
-            self._prefix = config.job_name_prefix
-        else:
-            # Generate a random prefix so concurrent users/sessions don't
-            # see each other's jobs when polling by name.
-            alphabet = string.ascii_lowercase + string.digits
-            self._prefix = "".join(secrets.choice(alphabet) for _ in range(5))
+        self._prefix = config.job_name_prefix  # None if not configured
     # --- Submission ---
@@ -61,7 +85,7 @@ class Executor(abc.ABC):
     ) -> JobRecord:
         """Submit a job to the scheduler."""
         resources = resources or ResourceSpec()
-        full_name = _sanitize_job_name(f"{self._prefix}-{name}")
+        full_name = _sanitize_job_name(f"{self._prefix}-{name}" if self._prefix else name)
         job_id, script_path = await self._submit_job(
             command, full_name, resources, prologue, epilogue, env,
@@ -96,7 +120,7 @@ class Executor(abc.ABC):
     ) -> JobRecord:
         """Submit a job array to the scheduler."""
         resources = resources or ResourceSpec()
-        full_name = _sanitize_job_name(f"{self._prefix}-{name}")
+        full_name = _sanitize_job_name(f"{self._prefix}-{name}" if self._prefix else name)
         job_id, script_path = await self._submit_array_job(
             command, full_name, array_range, resources, prologue, epilogue,
@@ -172,18 +196,26 @@ class Executor(abc.ABC):
     # --- Cancellation ---
-    async def cancel(self, job_id: str) -> None:
-        """Cancel a job by ID."""
-        cmd = [self.cancel_command, job_id]
-        logger.debug("Running: %s", " ".join(cmd))
-        await self._call(cmd, timeout=self.config.command_timeout)
+    async def cancel(self, job_id: str, *, done: bool = False) -> None:
+        """Cancel a job by ID.
+        Args:
+            job_id: The job ID to cancel.
+            done: If True, mark the job as DONE instead of KILLED.
+                  Subclasses may translate this into scheduler-specific flags.
+        """
+        await self._cancel_job(job_id, done=done)
         if job_id in self._jobs:
-            self._jobs[job_id].status = JobStatus.KILLED
-        logger.info("Cancelled job %s", job_id)
+            self._jobs[job_id].status = JobStatus.DONE if done else JobStatus.KILLED
+        logger.info("Cancelled job %s (done=%s)", job_id, done)
+    async def _cancel_job(self, job_id: str, *, done: bool = False) -> None:
+        """Run the scheduler cancel command. Must be implemented by subclasses."""
+        raise NotImplementedError("cancel is not supported by this executor")
     async def cancel_by_name(self, name_pattern: str) -> None:
         """Cancel jobs by name pattern. Override in subclasses for native support."""
-        raise NotImplementedError("cancel_by_name not supported by this executor")
+        raise NotImplementedError("cancel_by_name is not supported by this executor")
     async def reconnect(self) -> list[JobRecord]:
         """Reconnect to running jobs and resume tracking them.
@@ -195,12 +227,12 @@ class Executor(abc.ABC):
         Returns:
             List of newly created ``JobRecord`` instances.
         """
-        raise NotImplementedError("reconnect not supported by this executor")
+        raise NotImplementedError("reconnect is not supported by this executor")
-    async def cancel_all(self) -> None:
+    async def cancel_all(self, *, done: bool = False) -> None:
         """Cancel all tracked jobs."""
         to_cancel = [jid for jid, r in self._jobs.items() if not r.is_terminal]
-        await asyncio.gather(*(self.cancel(jid) for jid in to_cancel))
+        await asyncio.gather(*(self.cancel(jid, done=done) for jid in to_cancel))
     # --- Status polling ---
@@ -272,9 +304,9 @@ class Executor(abc.ABC):
     @staticmethod
     async def _call(
         cmd: list[str],
-        shell: bool = False,
         timeout: float = 100.0,
         env: dict[str, str] | None = None,
+        stdin_file: str | None = None,
     ) -> str:
         """Run a subprocess and return stdout.
@@ -284,31 +316,32 @@ class Executor(abc.ABC):
         if env:
             full_env = {**os.environ, **env}
-        if shell:
-            proc = await asyncio.create_subprocess_shell(
-                cmd if isinstance(cmd, str) else " ".join(cmd),
-                stdout=asyncio.subprocess.PIPE,
-                stderr=asyncio.subprocess.PIPE,
-                env=full_env,
-            )
-        else:
+        stdin_fh = None
+        try:
+            if stdin_file:
+                stdin_fh = open(stdin_file)  # noqa: SIM115
             proc = await asyncio.create_subprocess_exec(
                 *cmd,
+                stdin=stdin_fh,
                 stdout=asyncio.subprocess.PIPE,
                 stderr=asyncio.subprocess.PIPE,
                 env=full_env,
             )
-        try:
-            stdout, stderr = await asyncio.wait_for(
-                proc.communicate(),
-                timeout=timeout,
-            )
-        except asyncio.TimeoutError:
-            proc.kill()
-            raise CommandTimeoutError(
-                f"Command timed out after {timeout}s: {cmd}"
-            )
+            try:
+                stdout, stderr = await asyncio.wait_for(
+                    proc.communicate(),
+                    timeout=timeout,
+                )
+            except asyncio.TimeoutError:
+                proc.kill()
+                await proc.wait()
+                raise CommandTimeoutError(
+                    f"Command timed out after {timeout}s: {cmd}"
+                )
+        finally:
+            if stdin_fh:
+                stdin_fh.close()
         out = stdout.decode().strip()
         err = stderr.decode().strip()

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/cluster_api/executors/local.py RENAMED Viewed

@@ -88,6 +88,12 @@ class LocalExecutor(Executor):
         cwd: str | None = None,
     ) -> tuple[str, str | None]:
         """Spawn one subprocess per array element with ARRAY_INDEX env var."""
+        if max_concurrent is not None:
+            logger.warning(
+                "LocalExecutor does not support max_concurrent; "
+                "all %d elements will run simultaneously",
+                array_range[1] - array_range[0] + 1,
+            )
         header = self.build_header(name, resources)
         script = render_script(self.config, command, header, prologue, epilogue)
         script_path = write_script(resources.work_dir, script, name, next(self._script_counter))
@@ -191,37 +197,43 @@ class LocalExecutor(Executor):
         return {jid: r.status for jid, r in self._jobs.items()}
-    async def cancel(self, job_id: str) -> None:
+    async def cancel(self, job_id: str, *, done: bool = False) -> None:
         """Terminate a local subprocess (or all element processes for an array job)."""
-        # Kill single-job process if present
+        # Collect all live processes for this job (single + array elements)
+        live: list[tuple[str, asyncio.subprocess.Process]] = []
         proc = self._processes.get(job_id)
         if proc and proc.returncode is None:
-            proc.terminate()
-            try:
-                await asyncio.wait_for(proc.wait(), timeout=5.0)
-            except asyncio.TimeoutError:
-                proc.kill()
-        self._close_output_files(job_id)
-        # Kill array element processes matching "{job_id}[*]"
+            live.append((job_id, proc))
         prefix = f"{job_id}["
         for key, proc in self._processes.items():
             if key.startswith(prefix) and proc.returncode is None:
-                proc.terminate()
-                try:
-                    await asyncio.wait_for(proc.wait(), timeout=5.0)
-                except asyncio.TimeoutError:
-                    proc.kill()
-                self._close_output_files(key)
+                live.append((key, proc))
+        # Send SIGTERM to all, then wait concurrently
+        for _key, p in live:
+            p.terminate()
+        if live:
+            tasks = [asyncio.ensure_future(p.wait()) for _key, p in live]
+            _, pending = await asyncio.wait(tasks, timeout=5.0)
+            # SIGKILL any that didn't exit in time
+            for _key, p in live:
+                if p.returncode is None:
+                    p.kill()
+            # Reap the killed processes
+            if pending:
+                await asyncio.wait(pending, timeout=5.0)
+        for key, _p in live:
+            self._close_output_files(key)
+        target_status = JobStatus.DONE if done else JobStatus.KILLED
         if job_id in self._jobs:
             record = self._jobs[job_id]
-            record.status = JobStatus.KILLED
+            record.status = target_status
             for elem in record.array_elements.values():
                 if elem.status not in {JobStatus.DONE, JobStatus.FAILED, JobStatus.KILLED}:
-                    elem.status = JobStatus.KILLED
-        logger.info("Cancelled local job %s", job_id)
+                    elem.status = target_status
+        logger.info("Cancelled local job %s (done=%s)", job_id, done)
     def _open_output_files(
         self,

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/cluster_api/executors/lsf.py RENAMED Viewed

@@ -11,7 +11,7 @@ import re
 from datetime import datetime, timezone
 from typing import Any
-from .._types import ArrayElement, JobRecord, JobStatus, ResourceSpec
+from .._types import ArrayElement, JobRecord, JobStatus, ResourceSpec, _TERMINAL_STATUSES
 from ..config import ClusterConfig, parse_memory_bytes
 from ..core import Executor, _ARRAY_ELEMENT_RE
 from ..exceptions import ClusterAPIError, CommandFailedError
@@ -82,8 +82,8 @@ class LSFExecutor(Executor):
         out = resources.stdout_path or f"{resources.work_dir}/stdout.%J.log"
         err = resources.stderr_path or f"{resources.work_dir}/stderr.%J.log"
-        lines.append(f"{p} -o {out}")
-        lines.append(f"{p} -e {err}")
+        lines.append(f'{p} -o "{out}"')
+        lines.append(f'{p} -e "{err}"')
         # Queue
         queue = resources.queue or self.config.queue
@@ -116,7 +116,7 @@ class LSFExecutor(Executor):
             lines.append(f"{p} -W {walltime}")
         # Working directory
-        lines.append(f"{p} -cwd {resources.work_dir}")
+        lines.append(f'{p} -cwd "{resources.work_dir}"')
         # Custom cluster options
         if resources.extra_directives:
@@ -145,12 +145,11 @@ class LSFExecutor(Executor):
     ) -> str:
         """Run bsub with a script file and return raw output."""
         submit_env = self._build_submit_env(env)
-        extra = " ".join(extra_args) + " " if extra_args else ""
-        cmd = f"{self.submit_command} {extra}< {script_path}"
-        logger.debug("Running: %s", cmd)
+        cmd = [self.submit_command, *(extra_args or [])]
+        logger.debug("Running: %s < %s", cmd, script_path)
         return await self._call(
             cmd,
-            shell=True,
+            stdin_file=script_path,
             env=submit_env,
             timeout=self.config.command_timeout,
         )
@@ -220,14 +219,12 @@ class LSFExecutor(Executor):
     def _build_status_args(self) -> list[str]:
         """Build bjobs command with JSON output."""
-        prefix = self._prefix
-        args = [
-            self.status_command,
-            "-J", f"{prefix}-*",
-            "-a",
-            "-o", _BJOBS_FIELDS,
-            "-json",
-        ]
+        args = [self.status_command]
+        if self.config.poll_all_users:
+            args.extend(["-u", "all"])
+        if self._prefix:
+            args.extend(["-J", f"{self._prefix}-*"])
+        args.extend(["-a", "-o", _BJOBS_FIELDS, "-json"])
         return args
     def _parse_job_statuses(
@@ -282,11 +279,26 @@ class LSFExecutor(Executor):
         return result
+    async def _cancel_job(self, job_id: str, *, done: bool = False) -> None:
+        """Run bkill, with ``-d`` when *done* is True."""
+        cmd = [self.cancel_command]
+        if done:
+            cmd.append("-d")
+        cmd.append(job_id)
+        logger.debug("Running: %s", " ".join(cmd))
+        await self._call(cmd, timeout=self.config.command_timeout)
     async def cancel_by_name(self, name_pattern: str) -> None:
         """Cancel jobs matching name pattern via bkill -J."""
         cmd = [self.cancel_command, "-J", name_pattern]
         logger.debug("Running: %s", " ".join(cmd))
-        await self._call(cmd, timeout=self.config.command_timeout)
+        try:
+            await self._call(cmd, timeout=self.config.command_timeout)
+        except CommandFailedError as e:
+            if "No matching job" in str(e) or "No unfinished job" in str(e):
+                logger.debug("No jobs matched pattern %s", name_pattern)
+                return
+            raise
         # Update in-memory state for matching jobs
         for record in self._jobs.values():
             if not record.is_terminal and fnmatch.fnmatch(record.name, name_pattern):
@@ -302,13 +314,16 @@ class LSFExecutor(Executor):
                 "Cannot reconnect: no job_name_prefix was configured. "
                 "Set job_name_prefix in config to enable reconnection."
             )
-        return [
-            self.status_command,
+        args = [self.status_command]
+        if self.config.poll_all_users:
+            args.extend(["-u", "all"])
+        args.extend([
             "-J", f"{self._prefix}-*",
             "-a",
             "-o", _BJOBS_RECONNECT_FIELDS,
             "-json",
-        ]
+        ])
+        return args
     async def reconnect(self) -> list[JobRecord]:
         """Reconnect to running jobs and resume tracking them.
@@ -347,11 +362,14 @@ class LSFExecutor(Executor):
         new_records: list[JobRecord] = []
         now = datetime.now(timezone.utc)
-        # Process single (non-array) jobs
+        # Process single (non-array) jobs, skipping terminal ones
+        # (-a returns DONE/EXIT jobs too; no point reconnecting to those)
         for job_id, entries in singles.items():
             if job_id in self._jobs:
                 continue
             _, status, meta = entries[0]
+            if status in _TERMINAL_STATUSES:
+                continue
             record = JobRecord(
                 job_id=job_id,
                 name=meta.get("job_name") or "",
@@ -371,9 +389,12 @@ class LSFExecutor(Executor):
                 new_records.append(record)
         # Process array elements, grouping under parent
+        # Skip arrays where every visible element is already terminal
         for parent_id, elements in arrays.items():
             if parent_id in self._jobs:
                 continue
+            if all(s in _TERMINAL_STATUSES for _, s, _ in elements):
+                continue
             indices = sorted(idx for idx, _, _ in elements)
             array_range = (min(indices), max(indices))

py_cluster_api-0.4.0/docs/API.md ADDED Viewed

@@ -0,0 +1,77 @@
+# API Reference
+## `create_executor(profile=None, config_path=None, **overrides)`
+Factory function that loads config and returns an `Executor` instance.
+## `Executor`
+Abstract base class. Key methods:
+- `submit(command, name, resources=None, prologue=None, epilogue=None, env=None, metadata=None)` — submit a job, returns `JobRecord`
+- `submit_array(command, name, array_range, ...)` — submit a job array
+- `cancel(job_id, *, done=False)` — cancel a job by ID. By default marks the job as `KILLED`; pass `done=True` to mark it as `DONE` instead (useful for graceful pipeline termination where you don't want downstream logic to treat the cancellation as a failure)
+- `cancel_by_name(name_pattern)` — cancel jobs matching a name pattern (LSF only)
+- `cancel_all(*, done=False)` — cancel all tracked non-terminal jobs
+- `reconnect()` — rediscover running jobs after a process restart (requires `job_name_prefix`)
+- `poll()` — query scheduler and update job statuses
+- `jobs` / `active_jobs` — properties returning tracked job dicts
+## `JobRecord`
+Tracks a submitted job. Fields include `job_id`, `name`, `status`, `exit_code`, `exec_host`, `max_mem`, `submit_time`, `start_time`, `finish_time`, and `metadata`.
+- `on_success(callback)` — register callback for exit code 0
+- `on_failure(callback)` — register callback for non-zero exit
+- `on_exit(callback, condition=ANY)` — register callback for any exit condition
+- `is_terminal` — whether the job has finished
+## `JobMonitor`
+Async polling loop that drives status updates and callback dispatch.
+- `start()` / `stop()` — control the polling loop
+- `wait_for(*records, timeout=None)` — block until jobs reach a terminal state
+The monitor does not support `async with`, so use `try/finally` to ensure cleanup:
+```python
+monitor = JobMonitor(executor)
+await monitor.start()
+try:
+    job = await executor.submit(command="echo hi", name="test")
+    await monitor.wait_for(job)
+finally:
+    await monitor.stop()
+```
+## `ResourceSpec`
+Resource requirements: `cpus`, `gpus`, `memory`, `walltime`, `queue`, `work_dir`, `stdout_path`, `stderr_path`, `extra_directives`, `extra_args`.
+## Error Handling
+All exceptions inherit from `ClusterAPIError`, so you can catch broadly or narrowly:
+```python
+from cluster_api import ClusterAPIError, SubmitError, CommandTimeoutError, CommandFailedError
+try:
+    job = await executor.submit(command="echo hi", name="test")
+except SubmitError as e:
+    # Could not parse job ID from scheduler output
+    print(f"Submission failed: {e}")
+except CommandTimeoutError as e:
+    # Scheduler command (bsub, bjobs, bkill) exceeded command_timeout
+    print(f"Scheduler timed out: {e}")
+except CommandFailedError as e:
+    # Scheduler command returned a non-zero exit code
+    print(f"Scheduler error: {e}")
+```
+| Exception | Raised when |
+|---|---|
+| `ClusterAPIError` | Base class for all library errors |
+| `SubmitError` | Job ID could not be parsed from submit output |
+| `CommandTimeoutError` | A scheduler CLI command exceeded `command_timeout` |
+| `CommandFailedError` | A scheduler CLI command exited with non-zero status |

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/docs/Development.md RENAMED Viewed

@@ -37,6 +37,7 @@ pixi run check         # lint + test together
 | `test_lsf.py` | `LSFExecutor` header building, bsub submission, bjobs parsing, array rewriting | No — mocks `_call()` |
 | `test_local.py` | `LocalExecutor` end-to-end (submit, poll, output files, callbacks, array jobs) | **Yes** — runs real bash subprocesses |
 | `test_monitor.py` | `JobMonitor` polling loop, callback dispatch, zombie detection, purging | No — mocks `poll()` |
+| `test_reconnect.py` | `LSFExecutor.reconnect()` — rediscovering running jobs after restart | No — mocks `_call()` |
 | `test_integration.py` | Full LSF round-trips (submit, monitor, cancel, arrays, metadata) | **Yes** — requires a live LSF cluster |
 ### Writing tests
@@ -102,8 +103,10 @@ JobMonitor (monitor.py)    # async polling loop → callbacks + zombie detection
 ```
 - `build_header()` (per executor) produces directive lines from `ResourceSpec` + config defaults.
-- `extra_directives` (config-level and per-job) append custom flags — the directive prefix (e.g. `#BSUB`) is added automatically, so users write `"-P myproject"` not `"#BSUB -P myproject"`.
-- `extra_args` (config-level and per-job) append raw arguments to the submit command line (e.g. `bsub -P myproject script.sh`), bypassing the script entirely.
+- `extra_directives` has two levels with different behaviour:
+  - **Config-level** (`ClusterConfig.extra_directives`): appended verbatim to the script header — users must include the full prefix, e.g. `"#BSUB -P myproject"`.
+  - **ResourceSpec-level** (`ResourceSpec.extra_directives`): the directive prefix is added automatically, so users write `"-P myproject"` and the executor produces `"#BSUB -P myproject"`.
+- `extra_args` (config-level and per-job) append raw arguments to the submit command line, bypassing the script entirely. Both levels are merged at submit time: config-level args come first, then per-job (`ResourceSpec.extra_args`) args are appended.
 - `directives_skip` filters out unwanted directive lines by substring match.
 - Scripts are written to `{work_dir}/{safe_name}.{counter}.sh` and made executable.
@@ -133,8 +136,8 @@ Terminal jobs are purged from memory after `completed_retention_minutes` (once a
 ### Key design decisions
 - **Poll-based monitoring** — unlike dask-jobqueue (which relies on workers phoning home), this library actively polls the scheduler. This means it works with any executable, not just Python workers.
-- **File-based submission** — jobs are submitted via `bsub script.sh`, passing the script file path directly. The script is always written to disk before submission.
-- **Job name prefixing** — all jobs get a `{prefix}-{name}` name. The prefix is either configured (`job_name_prefix`) or randomly generated, so concurrent sessions don't collide when polling by name.
+- **Stdin-based submission** — job scripts are written to disk, then submitted via stdin redirection (`bsub < script.sh`). The script file is kept on disk for debugging.
+- **Job name prefixing** — when `job_name_prefix` is configured, all jobs get a `{prefix}-{name}` name and polling filters by that prefix. When unset, the user controls the full job name and polling queries all jobs. `reconnect()` requires a prefix to be set.
 - **Array status aggregation** — parent array job status is computed from element statuses. Only transitions to terminal when ALL expected elements are terminal.
 ## Module reference

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/pixi.lock RENAMED Viewed

@@ -5,6 +5,8 @@ environments:
     - url: https://conda.anaconda.org/conda-forge/
     indexes:
     - https://pypi.org/simple
+    options:
+      pypi-prerelease-mode: if-necessary-or-explicit
     packages:
       linux-64:
       - conda: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2
@@ -52,6 +54,8 @@ environments:
     - url: https://conda.anaconda.org/conda-forge/
     indexes:
     - https://pypi.org/simple
+    options:
+      pypi-prerelease-mode: if-necessary-or-explicit
     packages:
       linux-64:
       - conda: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2
@@ -841,8 +845,8 @@ packages:
   timestamp: 1764896838868
 - pypi: ./
   name: py-cluster-api
-  version: 0.2.4
-  sha256: fa7e3d392473de2f63cc6a0aff42c8491418e1cfb4d15f4d75d37dcdb48426f2
+  version: 0.4.0
+  sha256: 1dd95e2002e0e1b4908c3ea27e6c9b575ae0e2e514cf00a01289b554502ce15d
   requires_dist:
   - pyyaml
   - pytest ; extra == 'test'
@@ -851,7 +855,6 @@ packages:
   - build ; extra == 'release'
   - twine ; extra == 'release'
   requires_python: '>=3.10'
-  editable: true
 - conda: https://conda.anaconda.org/conda-forge/noarch/pycparser-2.22-pyh29332c3_1.conda
   sha256: 79db7928d13fab2d892592223d7570f5061c192f27b9febd1a418427b719acc6
   md5: 12c566707c80111f9799308d9e265aef

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "py-cluster-api"
-version = "0.2.4"
+version = "0.4.0"
 description = "Generic Python library for running jobs on HPC clusters"
 readme = "README.md"
 license = { file = "LICENSE" }
@@ -44,7 +44,7 @@ asyncio_mode = "auto"
 markers = ["integration: tests that submit real jobs to the cluster (deselected by default)"]
 addopts = "-m 'not integration'"
-[tool.pixi.project]
+[tool.pixi.workspace]
 channels = ["conda-forge"]
 platforms = ["linux-64"]

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/tests/cluster_config.example.yaml RENAMED Viewed

@@ -7,6 +7,9 @@ memory: "1 GB"
 lsf_units: MB
 suppress_job_email: true
+# Optional: request GPU resources
+# gpus: 1
 # Optional: prologue commands to run before each job
 # script_prologue:
 #   - "module load java/11"

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/tests/test_core.py RENAMED Viewed

@@ -115,21 +115,39 @@ class TestPrefix:
         executor = LocalExecutor(default_config)
         assert executor._prefix == "test"
-    def test_random_prefix_when_none(self):
+    def test_no_prefix_when_none(self):
         from cluster_api.config import ClusterConfig
         config = ClusterConfig()
         executor = LocalExecutor(config)
-        assert len(executor._prefix) == 5
-        assert executor._prefix.isalnum()
+        assert executor._prefix is None
-    def test_random_prefix_is_unique(self):
+    async def test_submit_no_prefix(self, work_dir):
         from cluster_api.config import ClusterConfig
         config = ClusterConfig()
-        a = LocalExecutor(config)
-        b = LocalExecutor(config)
-        assert a._prefix != b._prefix
+        executor = LocalExecutor(config)
+        job = await executor.submit(
+            command="echo hello",
+            name="my-job",
+            resources=ResourceSpec(work_dir=work_dir),
+        )
+        assert job.name == "my-job"
+        await executor.cancel(job.job_id)
+    async def test_submit_array_no_prefix(self, work_dir):
+        from cluster_api.config import ClusterConfig
+        config = ClusterConfig()
+        executor = LocalExecutor(config)
+        job = await executor.submit_array(
+            command="echo hello",
+            name="my-array",
+            array_range=(1, 2),
+            resources=ResourceSpec(work_dir=work_dir),
+        )
+        assert job.name == "my-array"
+        await executor.cancel(job.job_id)
 class TestSanitizeJobName:
@@ -186,3 +204,14 @@ class TestCancelAll:
         await executor.cancel_all()
         assert job.status == JobStatus.KILLED
+    async def test_cancel_all_done(self, default_config, work_dir):
+        executor = LocalExecutor(default_config)
+        job = await executor.submit(
+            command="sleep 60", name="sleeper",
+            resources=ResourceSpec(work_dir=work_dir),
+        )
+        assert not job.is_terminal
+        await executor.cancel_all(done=True)
+        assert job.status == JobStatus.DONE

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/tests/test_local.py RENAMED Viewed

@@ -70,6 +70,16 @@ class TestLocalSubmitAndPoll:
         await executor.cancel(job.job_id)
         assert job.status == JobStatus.KILLED
+    async def test_cancel_done(self, default_config, work_dir):
+        executor = LocalExecutor(default_config)
+        job = await executor.submit(
+            command="sleep 60", name="cancel-done-test",
+            resources=ResourceSpec(work_dir=work_dir),
+        )
+        await asyncio.sleep(0.1)
+        await executor.cancel(job.job_id, done=True)
+        assert job.status == JobStatus.DONE
     async def test_multiple_jobs(self, default_config, work_dir):
         executor = LocalExecutor(default_config)

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/tests/test_lsf.py RENAMED Viewed

@@ -8,6 +8,7 @@ from unittest.mock import AsyncMock, patch
 import pytest
 from cluster_api._types import ArrayElement, JobRecord, JobStatus, ResourceSpec
+from cluster_api.exceptions import CommandFailedError
 from cluster_api.executors.lsf import (
     LSFExecutor,
     _LSF_STATUS_MAP,
@@ -61,7 +62,7 @@ class TestBuildHeader:
         assert any("-n 4" in line for line in lines)
         assert any("span[hosts=1]" in line for line in lines)
         assert any("-W 08:00" in line for line in lines)
-        assert any("-cwd /scratch" in line for line in lines)
+        assert any('-cwd "/scratch"' in line for line in lines)
     def test_single_cpu_no_span(self, lsf_config):
         executor = LSFExecutor(lsf_config)
@@ -267,6 +268,17 @@ class TestBuildStatusArgs:
         assert "-json" in args
         assert "test-*" in args
+    def test_status_args_no_prefix(self):
+        from cluster_api.config import ClusterConfig
+        config = ClusterConfig(executor="lsf", lsf_units="MB")
+        executor = LSFExecutor(config)
+        args = executor._build_status_args()
+        assert "bjobs" in args
+        assert "-a" in args
+        assert "-json" in args
+        assert "-J" not in args
 class TestSubmission:
@@ -285,11 +297,11 @@ class TestSubmission:
             assert job.job_id == "12345"
             assert job.name == "test-my-job"
             assert job.status == JobStatus.PENDING
-            # Verify shell redirect submission
+            # Verify bsub invocation with stdin_file
             cmd = mock_call.call_args[0][0]
-            assert "bsub" in cmd
-            assert "< " in cmd
-            assert cmd.endswith(".sh")
+            assert cmd[0] == "bsub"
+            kwargs = mock_call.call_args[1]
+            assert kwargs["stdin_file"].endswith(".sh")
     async def test_submit_email_suppression(self, lsf_config, work_dir):
@@ -350,6 +362,53 @@ class TestArrayScriptRewriting:
             assert "stderr.%J.%I.log" in script
+class TestCancel:
+    async def test_cancel_passes_d_flag_when_done(self, lsf_config):
+        executor = LSFExecutor(lsf_config)
+        with patch.object(
+            executor, "_call",
+            new_callable=AsyncMock,
+            return_value="Job <123> is being submitted",
+        ):
+            job = await executor.submit(
+                command="echo hi", name="cancel-done",
+                resources=ResourceSpec(work_dir="/tmp"),
+            )
+        with patch.object(
+            executor, "_call",
+            new_callable=AsyncMock,
+            return_value="",
+        ) as mock_call:
+            await executor.cancel(job.job_id, done=True)
+            args = mock_call.call_args[0][0]
+            assert args == ["bkill", "-d", job.job_id]
+            assert job.status == JobStatus.DONE
+    async def test_cancel_without_done_flag(self, lsf_config):
+        executor = LSFExecutor(lsf_config)
+        with patch.object(
+            executor, "_call",
+            new_callable=AsyncMock,
+            return_value="Job <456> is being submitted",
+        ):
+            job = await executor.submit(
+                command="echo hi", name="cancel-kill",
+                resources=ResourceSpec(work_dir="/tmp"),
+            )
+        with patch.object(
+            executor, "_call",
+            new_callable=AsyncMock,
+            return_value="",
+        ) as mock_call:
+            await executor.cancel(job.job_id)
+            args = mock_call.call_args[0][0]
+            assert args == ["bkill", job.job_id]
+            assert job.status == JobStatus.KILLED
 class TestCancelByName:
     async def test_cancel_by_name(self, lsf_config):
@@ -366,6 +425,16 @@ class TestCancelByName:
             assert "-J" in args
             assert "test-*" in args
+    async def test_cancel_by_name_no_match(self, lsf_config):
+        """bkill -J returns non-zero when no jobs match; should not raise."""
+        executor = LSFExecutor(lsf_config)
+        with patch.object(
+            executor, "_call",
+            new_callable=AsyncMock,
+            side_effect=CommandFailedError("No matching job found"),
+        ):
+            await executor.cancel_by_name("nonexistent-*")
 class TestParseLsfTime:
     def test_standard_format(self):

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/tests/test_reconnect.py RENAMED Viewed

@@ -61,7 +61,8 @@ class TestReconnectByPrefix:
         assert job.resources is None
         assert job.exec_host == "node01"
-    async def test_completed_job(self, lsf_config):
+    async def test_completed_job_skipped(self, lsf_config):
+        """Terminal jobs from -a flag should not be reconnected."""
         executor = LSFExecutor(lsf_config)
         output = _make_bjobs_json([
             _make_record(
@@ -73,11 +74,10 @@ class TestReconnectByPrefix:
         with patch.object(executor, "_call", new_callable=AsyncMock, return_value=output):
             jobs = await executor.reconnect()
-        assert len(jobs) == 1
-        assert jobs[0].status == JobStatus.DONE
-        assert jobs[0].exit_code == 0
+        assert len(jobs) == 0
-    async def test_multiple_jobs(self, lsf_config):
+    async def test_multiple_jobs_filters_terminal(self, lsf_config):
+        """Only non-terminal jobs should be reconnected; DONE/EXIT are skipped."""
         executor = LSFExecutor(lsf_config)
         output = _make_bjobs_json([
             _make_record(job_id="100", job_name="test-a", stat="RUN"),
@@ -91,13 +91,12 @@ class TestReconnectByPrefix:
         with patch.object(executor, "_call", new_callable=AsyncMock, return_value=output):
             jobs = await executor.reconnect()
-        assert len(jobs) == 3
+        assert len(jobs) == 2
         ids = {j.job_id for j in jobs}
-        assert ids == {"100", "101", "102"}
+        assert ids == {"100", "101"}
         by_id = {j.job_id: j for j in jobs}
         assert by_id["100"].status == JobStatus.RUNNING
         assert by_id["101"].status == JobStatus.PENDING
-        assert by_id["102"].status == JobStatus.DONE
     async def test_skips_already_tracked(self, lsf_config, work_dir):
         executor = LSFExecutor(lsf_config)
@@ -232,9 +231,9 @@ class TestReconnectArrayJobs:
         assert jobs[0].metadata["array_range"] == (5, 10)
-    async def test_status_computed(self, lsf_config):
+    async def test_all_terminal_array_skipped(self, lsf_config):
+        """Array where all visible elements are terminal should not be reconnected."""
         executor = LSFExecutor(lsf_config)
-        # All elements done → parent status should be DONE
         output = _make_bjobs_json([
             _make_record(job_id="600[1]", job_name="test-alldone", stat="DONE", exit_code="0"),
             _make_record(job_id="600[2]", job_name="test-alldone", stat="DONE", exit_code="0"),
@@ -243,9 +242,10 @@ class TestReconnectArrayJobs:
         with patch.object(executor, "_call", new_callable=AsyncMock, return_value=output):
             jobs = await executor.reconnect()
-        assert jobs[0].status == JobStatus.DONE
+        assert len(jobs) == 0
-    async def test_status_computed_with_failure(self, lsf_config):
+    async def test_all_terminal_array_with_failure_skipped(self, lsf_config):
+        """Array where all elements are terminal (even with failures) should not be reconnected."""
         executor = LSFExecutor(lsf_config)
         output = _make_bjobs_json([
             _make_record(job_id="700[1]", job_name="test-mixed", stat="DONE", exit_code="0"),
@@ -255,8 +255,7 @@ class TestReconnectArrayJobs:
         with patch.object(executor, "_call", new_callable=AsyncMock, return_value=output):
             jobs = await executor.reconnect()
-        assert jobs[0].status == JobStatus.FAILED
-        assert jobs[0].failed_element_indices == [2]
+        assert len(jobs) == 0
     async def test_mixed_single_and_array(self, lsf_config):
         executor = LSFExecutor(lsf_config)
@@ -280,8 +279,8 @@ class TestReconnectArrayJobs:
         executor = LSFExecutor(lsf_config)
         output = _make_bjobs_json([
             _make_record(
-                job_id="1000[1]", job_name="test-meta", stat="DONE",
-                exit_code="0", exec_host="node01", max_mem="256 MB",
+                job_id="1000[1]", job_name="test-meta", stat="RUN",
+                exec_host="node01", max_mem="256 MB",
             ),
         ])
@@ -291,7 +290,6 @@ class TestReconnectArrayJobs:
         elem = jobs[0].array_elements[1]
         assert elem.exec_host == "node01"
         assert elem.max_mem == "256 MB"
-        assert elem.exit_code == 0
 class TestReconnectThenPoll:

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/.gitignore RENAMED Viewed

File without changes

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/LICENSE RENAMED Viewed

File without changes

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/cluster_api/__init__.py RENAMED Viewed

File without changes

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/cluster_api/exceptions.py RENAMED Viewed

File without changes

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/cluster_api/executors/__init__.py RENAMED Viewed

File without changes

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/cluster_api/monitor.py RENAMED Viewed

File without changes

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/cluster_api/script.py RENAMED Viewed

File without changes

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/tests/__init__.py RENAMED Viewed

File without changes

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/tests/conftest.py RENAMED Viewed

File without changes

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/tests/test_config.py RENAMED Viewed

File without changes

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/tests/test_integration.py RENAMED Viewed

File without changes

{py_cluster_api-0.2.4 → py_cluster_api-0.4.0}/tests/test_monitor.py RENAMED Viewed

File without changes

py-cluster-api 0.2.4__tar.gz → 0.4.0__tar.gz

py-cluster-api 0.2.4tar.gz → 0.4.0tar.gz