PyPI - dpdispatcher - Versions diffs - 0.5.6__tar.gz → 0.5.8__tar.gz - Mend

dpdispatcher 0.5.6tar.gz → 0.5.8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of dpdispatcher might be problematic. Click here for more details.

Files changed (232) hide show

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/.github/workflows/pyright.yml RENAMED Viewed

@@ -13,3 +13,5 @@ jobs:
         python-version: '3.11'
     - run: pip install -e .[cloudserver]
     - uses: jakebailey/pyright-action@v1
+      with:
+        version: 1.1.308

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/.pre-commit-config.yaml RENAMED Viewed

@@ -22,7 +22,7 @@ repos:
     -   id: black-jupyter
 -   repo: https://github.com/charliermarsh/ruff-pre-commit
     # Ruff version.
-    rev: v0.0.260
+    rev: v0.0.275
     hooks:
     - id: ruff
       args: ["--fix"]
@@ -34,6 +34,6 @@ repos:
       args: ["--write"]
 # Python inside docs
 -   repo: https://github.com/asottile/blacken-docs
-    rev: 1.13.0
+    rev: 1.14.0
     hooks:
     -   id: blacken-docs

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: dpdispatcher
-Version: 0.5.6
+Version: 0.5.8
 Summary: Generate HPC scheduler systems jobs input scripts, submit these scripts to HPC systems, and poke until they finish
 Author: DeepModeling
 License:                    GNU LESSER GENERAL PUBLIC LICENSE
@@ -191,15 +191,20 @@ License-File: LICENSE
 # DPDispatcher
-DPDispatcher is a python package used to generate HPC(High Performance Computing) scheduler systems (Slurm/PBS/LSF/dpcloudserver) jobs input scripts and submit these  scripts to HPC systems and poke until they finish.
+[![conda-forge](https://img.shields.io/conda/dn/conda-forge/dpdispatcher?color=red&label=conda-forge&logo=conda-forge)](https://anaconda.org/conda-forge/dpdispatcher)
+[![pip install](https://img.shields.io/pypi/dm/dpdispatcher?label=pip%20install&logo=pypi)](https://pypi.org/project/dpdispatcher)
+[![docker pull](https://img.shields.io/docker/pulls/dptechnology/dpdispatcher?logo=docker)](https://hub.docker.com/r/dptechnology/dpdispatcher)
+[![Documentation Status](https://readthedocs.org/projects/dpdispatcher/badge/)](https://dpdispatcher.readthedocs.io/)
+DPDispatcher is a Python package used to generate HPC (High-Performance Computing) scheduler systems (Slurm/PBS/LSF/Bohrium) jobs input scripts, submit them to HPC systems, and poke until they finish.
-DPDispatcher will monitor (poke) until these jobs finish and download the results files (if these jobs is running on remote systems connected by SSH).
+DPDispatcher will monitor (poke) until these jobs finish and download the results files (if these jobs are running on remote systems connected by SSH).
 For more information, check the [documentation](https://dpdispatcher.readthedocs.io/).
 ## Installation
-DPDispatcher can installed by `pip`:
+DPDispatcher can be installed by `pip`:
 ```bash
 pip install dpdispatcher
@@ -211,5 +216,9 @@ See [Getting Started](https://dpdispatcher.readthedocs.io/en/latest/getting-star
 ## Contributing
-DPDispatcher is maintained by Deep Modeling's developers and welcome other people.
+DPDispatcher is maintained by Deep Modeling's developers and welcomes other people.
 See [Contributing Guide](CONTRIBUTING.md) to become a contributor! 🤓
+## References
+DPDispatcher is derivated from the [DP-GEN](https://github.com/deepmodeling/dpgen) package. To mention DPDispatcher in a scholarly publication, please read Section 3.3 in the [DP-GEN paper](https://doi.org/10.1016/j.cpc.2020.107206).

dpdispatcher-0.5.8/README.md ADDED Viewed

@@ -0,0 +1,33 @@
+# DPDispatcher
+[![conda-forge](https://img.shields.io/conda/dn/conda-forge/dpdispatcher?color=red&label=conda-forge&logo=conda-forge)](https://anaconda.org/conda-forge/dpdispatcher)
+[![pip install](https://img.shields.io/pypi/dm/dpdispatcher?label=pip%20install&logo=pypi)](https://pypi.org/project/dpdispatcher)
+[![docker pull](https://img.shields.io/docker/pulls/dptechnology/dpdispatcher?logo=docker)](https://hub.docker.com/r/dptechnology/dpdispatcher)
+[![Documentation Status](https://readthedocs.org/projects/dpdispatcher/badge/)](https://dpdispatcher.readthedocs.io/)
+DPDispatcher is a Python package used to generate HPC (High-Performance Computing) scheduler systems (Slurm/PBS/LSF/Bohrium) jobs input scripts, submit them to HPC systems, and poke until they finish.
+DPDispatcher will monitor (poke) until these jobs finish and download the results files (if these jobs are running on remote systems connected by SSH).
+For more information, check the [documentation](https://dpdispatcher.readthedocs.io/).
+## Installation
+DPDispatcher can be installed by `pip`:
+```bash
+pip install dpdispatcher
+```
+## Usage
+See [Getting Started](https://dpdispatcher.readthedocs.io/en/latest/getting-started.html) for usage.
+## Contributing
+DPDispatcher is maintained by Deep Modeling's developers and welcomes other people.
+See [Contributing Guide](CONTRIBUTING.md) to become a contributor! 🤓
+## References
+DPDispatcher is derivated from the [DP-GEN](https://github.com/deepmodeling/dpgen) package. To mention DPDispatcher in a scholarly publication, please read Section 3.3 in the [DP-GEN paper](https://doi.org/10.1016/j.cpc.2020.107206).

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/doc/batch.md RENAMED Viewed

@@ -21,9 +21,9 @@ To avoid running multiple jobs at the same time, one could set {dargs:argument}`
 One needs to make sure slurm has been setup in the remote server and the related environment is activated.
 When `SlurmJobArray` is used, dpdispatcher submits Slurm jobs with [job arrays](https://slurm.schedmd.com/job_array.html).
-In this way, a dpdispatcher {class}`task <dpdispatcher.submission.Task>` maps to a Slurm job and a dpdispatcher {class}`job <dpdispatcher.submission.Job>` maps to a Slurm job array.
+In this way, several dpdispatcher {class}`task <dpdispatcher.submission.Task>`s map to a Slurm job and a dpdispatcher {class}`job <dpdispatcher.submission.Job>` maps to a Slurm job array.
 Millions of Slurm jobs can be submitted quickly and Slurm can execute all Slurm jobs at the same time.
-One can use {dargs:argument}`group_size <resources/group_size>` to control how many Slurm jobs are contained in a Slurm job array.
+One can use {dargs:argument}`group_size <resources/group_size>` and {dargs:argument}`slurm_job_size <resources[SlurmJobArray]/kwargs/slurm_job_size>` to control how many Slurm jobs are contained in a Slurm job array.
 ## OpenPBS or PBSPro
@@ -62,3 +62,11 @@ Read Bohrium documentation for details.
 `DistributedShell` is used to submit yarn jobs.
 Read [Support DPDispatcher on Yarn](dpdispatcher_on_yarn.md) for details.
+## Fugaku
+{dargs:argument}`batch_type <resources/batch_type>`: `Fugaku`
+[Fujitsu cloud service](https://doc.cloud.global.fujitsu.com/lib/common/jp/hpc-user-manual/) is a job scheduling system used by Fujitsu's HPCs such as Fugaku, ITO and K computer. It should be noted that although the same job scheduling system is used, there are some differences in the details, Fagaku class cannot be directly used for other HPCs.
+Read Fujitsu cloud service documentation for details.

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/__init__.py RENAMED Viewed

@@ -43,6 +43,7 @@ except ImportError:
 from .distributed_shell import DistributedShell
 from .dp_cloud_server import DpCloudServer, Lebesgue
 from .dp_cloud_server_context import DpCloudServerContext, LebesgueContext
+from .fugaku import Fugaku
 from .hdfs_context import HDFSContext
 from .lazy_local_context import LazyLocalContext
 from .local_context import LocalContext
@@ -85,6 +86,7 @@ __all__ = [
     "PBS",
     "Shell",
     "Slurm",
+    "Fugaku",
     "SSHContext",
     "Submission",
     "Task",

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/_version.py RENAMED Viewed

@@ -1,4 +1,4 @@
 # file generated by setuptools_scm
 # don't change, don't track in version control
-__version__ = version = '0.5.6'
-__version_tuple__ = version_tuple = (0, 5, 6)
+__version__ = version = '0.5.8'
+__version_tuple__ = version_tuple = (0, 5, 8)

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/base_context.py RENAMED Viewed

@@ -70,9 +70,6 @@ class BaseContext(metaclass=ABCMeta):
     def read_file(self, fname):
         raise NotImplementedError("abstract method")
-    def kill(self, proc):
-        raise NotImplementedError("abstract method")
     def check_finish(self, proc):
         raise NotImplementedError("abstract method")

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/distributed_shell.py RENAMED Viewed

@@ -136,17 +136,16 @@ class DistributedShell(Machine):
         resources = job.resources
         submit_command = (
-            "hadoop jar %s/hadoop-yarn-applications-distributedshell-*.jar "
+            "hadoop jar {}/hadoop-yarn-applications-distributedshell-*.jar "
             "org.apache.hadoop.yarn.applications.distributedshell.Client "
-            "-jar %s/hadoop-yarn-applications-distributedshell-*.jar "
-            '-queue %s -appname "distributedshell_dpgen_%s" '
+            "-jar {}/hadoop-yarn-applications-distributedshell-*.jar "
+            '-queue {} -appname "distributedshell_dpgen_{}" '
             "-shell_env YARN_CONTAINER_RUNTIME_TYPE=docker "
-            "-shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=%s "
+            "-shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE={} "
             "-shell_env ENV_DOCKER_CONTAINER_SHM_SIZE='600m' "
             "-master_memory 1024 -master_vcores 2 -num_containers 1 "
-            "-container_resources memory-mb=%s,vcores=%s "
-            "-shell_script /tmp/%s"
-            % (
+            "-container_resources memory-mb={},vcores={} "
+            "-shell_script /tmp/{}".format(
                 resources.kwargs.get("yarn_path", ""),
                 resources.kwargs.get("yarn_path", ""),
                 resources.queue_name,

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/dp_cloud_server.py RENAMED Viewed

@@ -106,7 +106,9 @@ class Bohrium(Machine):
         input_data = self.input_data.copy()
-        input_data["job_resources"] = job_resources
+        if not input_data.get("job_resources"):
+            input_data["job_resources"] = []
+        input_data["job_resources"].append(job_resources)
         input_data["command"] = f"bash {job.script_file_name}"
         if not input_data.get("backward_files"):
             input_data["backward_files"] = self._gen_backward_files_list(job)

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/dp_cloud_server_context.py RENAMED Viewed

@@ -270,9 +270,6 @@ class BohriumContext(BaseContext):
     #         retcode = cmd_pipes['stdout'].channel.recv_exit_status()
     #         return retcode, cmd_pipes['stdout'], cmd_pipes['stderr']
-    def kill(self, cmd_pipes):
-        pass
     @classmethod
     def machine_subfields(cls) -> List[Argument]:
         """Generate the machine subfields.

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/dpcloudserver/client.py RENAMED Viewed

@@ -198,7 +198,7 @@ class Client:
     ):
         post_data = {
             "job_type": job_type,
-            "oss_path": [oss_path],
+            "oss_path": oss_path,
         }
         if program_id is not None:
             post_data["project_id"] = program_id

dpdispatcher-0.5.8/dpdispatcher/fugaku.py ADDED Viewed

@@ -0,0 +1,94 @@
+import shlex
+from dpdispatcher import dlog
+from dpdispatcher.JobStatus import JobStatus
+from dpdispatcher.machine import Machine
+fugaku_script_header_template = """\
+{queue_name_line}
+{fugaku_node_number_line}
+{fugaku_ntasks_per_node_line}
+"""
+class Fugaku(Machine):
+    def gen_script(self, job):
+        fugaku_script = super().gen_script(job)
+        return fugaku_script
+    def gen_script_header(self, job):
+        resources = job.resources
+        fugaku_script_header_dict = {}
+        fugaku_script_header_dict[
+            "fugaku_node_number_line"
+        ] = f'#PJM -L "node={resources.number_node}" '
+        fugaku_script_header_dict[
+            "fugaku_ntasks_per_node_line"
+        ] = '#PJM --mpi "max-proc-per-node={cpu_per_node}"'.format(
+            cpu_per_node=resources.cpu_per_node
+        )
+        fugaku_script_header_dict[
+            "queue_name_line"
+        ] = f'#PJM -L "rscgrp={resources.queue_name}"'
+        fugaku_script_header = fugaku_script_header_template.format(
+            **fugaku_script_header_dict
+        )
+        return fugaku_script_header
+    def do_submit(self, job):
+        script_file_name = job.script_file_name
+        script_str = self.gen_script(job)
+        job_id_name = job.job_hash + "_job_id"
+        # script_str = self.sub_script(job_dirs, cmd, args=args, resources=resources, outlog=outlog, errlog=errlog)
+        self.context.write_file(fname=script_file_name, write_str=script_str)
+        # self.context.write_file(fname=os.path.join(self.context.submission.work_base, script_file_name), write_str=script_str)
+        # script_file_dir = os.path.join(self.context.submission.work_base)
+        script_file_dir = self.context.remote_root
+        # stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'pjsub', script_file_name))
+        stdin, stdout, stderr = self.context.block_checkcall(
+            "cd {} && {} {}".format(
+                shlex.quote(script_file_dir), "pjsub", shlex.quote(script_file_name)
+            )
+        )
+        subret = stdout.readlines()
+        job_id = subret[0].split()[5]
+        self.context.write_file(job_id_name, job_id)
+        return job_id
+    def default_resources(self, resources):
+        pass
+    def check_status(self, job):
+        job_id = job.job_id
+        if job_id == "":
+            return JobStatus.unsubmitted
+        ret, stdin, stdout, stderr = self.context.block_call("pjstat " + job_id)
+        err_str = stderr.read().decode("utf-8")
+        try:
+            status_line = stdout.read().decode("utf-8").split("\n")[-2]
+        # pjstat only retrun 0 if the job is not waiting or running
+        except Exception:
+            ret, stdin, stdout, stderr = self.context.block_call("pjstat -H  " + job_id)
+            status_line = stdout.read().decode("utf-8").split("\n")[-2]
+            status_word = status_line.split()[3]
+            if status_word in ["EXT", "CCL", "ERR"]:
+                if self.check_finish_tag(job):
+                    dlog.info(f"job: {job.job_hash} {job.job_id} finished")
+                    return JobStatus.finished
+                else:
+                    return JobStatus.terminated
+            else:
+                return JobStatus.unknown
+        status_word = status_line.split()[3]
+        # dlog.info (status_word)
+        if status_word in ["QUE", "HLD", "RNA", "SPD"]:
+            return JobStatus.waiting
+        elif status_word in ["RUN", "RNE"]:
+            return JobStatus.running
+        else:
+            return JobStatus.unknown
+    def check_finish_tag(self, job):
+        job_tag_finished = job.job_hash + "_job_tag_finished"
+        return self.context.check_file_exists(job_tag_finished)

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/hdfs_context.py RENAMED Viewed

@@ -247,6 +247,3 @@ class HDFSContext(BaseContext):
     def read_file(self, fname):
         return HDFS.read_hdfs_file(os.path.join(self.remote_root, fname))
-    def kill(self, job_id):
-        pass

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/lazy_local_context.py RENAMED Viewed

@@ -1,5 +1,4 @@
 import os
-import signal
 import subprocess as sp
 from dpdispatcher.base_context import BaseContext
@@ -167,9 +166,6 @@ class LazyLocalContext(BaseContext):
         )
         return proc
-    def kill(self, job_id):
-        os.kill(job_id, signal.SIGTERM)
     def check_finish(self, proc):
         return proc.poll() is not None

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/local_context.py RENAMED Viewed

@@ -1,7 +1,6 @@
 import hashlib
 import os
 import shutil
-import signal
 import subprocess as sp
 from glob import glob
 from subprocess import TimeoutExpired
@@ -291,9 +290,6 @@ class LocalContext(BaseContext):
         )
         return proc
-    def kill(self, job_id):
-        os.kill(job_id, signal.SIGTERM)
     def check_finish(self, proc):
         return proc.poll() is not None

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/lsf.py RENAMED Viewed

@@ -83,8 +83,7 @@ class LSF(Machine):
         try:
             stdin, stdout, stderr = self.context.block_checkcall(
-                "cd %s && %s %s"
-                % (
+                "cd {} && {} {}".format(
                     shlex.quote(self.context.remote_root),
                     "bsub < ",
                     shlex.quote(script_file_name),
@@ -211,3 +210,14 @@ class LSF(Machine):
                 doc="Extra arguments.",
             )
         ]
+    def kill(self, job):
+        """Kill the job.
+        Parameters
+        ----------
+        job : Job
+            job
+        """
+        job_id = job.job_id
+        ret, stdin, stdout, stderr = self.context.block_call("bkill " + str(job_id))

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/machine.py RENAMED Viewed

@@ -377,8 +377,12 @@ class Machine(metaclass=ABCMeta):
         machine_args = [
             Argument("batch_type", str, optional=False, doc=doc_batch_type),
             # TODO: add default to local_root and remote_root after refactor the code
-            Argument("local_root", [str, None], optional=False, doc=doc_local_root),
-            Argument("remote_root", [str, None], optional=True, doc=doc_remote_root),
+            Argument(
+                "local_root", [str, type(None)], optional=False, doc=doc_local_root
+            ),
+            Argument(
+                "remote_root", [str, type(None)], optional=True, doc=doc_remote_root
+            ),
             Argument(
                 "clean_asynchronously",
                 bool,
@@ -439,3 +443,15 @@ class Machine(metaclass=ABCMeta):
                 "kwargs", dict, optional=True, doc="This field is empty for this batch."
             )
         ]
+    def kill(self, job):
+        """Kill the job.
+        If not implemented, pass and let the user manually kill it.
+        Parameters
+        ----------
+        job : Job
+            job
+        """
+        dlog.warning("Job %s should be manually killed" % job.job_id)

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/pbs.py RENAMED Viewed

@@ -46,8 +46,9 @@ class PBS(Machine):
         script_file_dir = self.context.remote_root
         # stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'qsub', script_file_name))
         stdin, stdout, stderr = self.context.block_checkcall(
-            "cd %s && %s %s"
-            % (shlex.quote(script_file_dir), "qsub", shlex.quote(script_file_name))
+            "cd {} && {} {}".format(
+                shlex.quote(script_file_dir), "qsub", shlex.quote(script_file_name)
+            )
         )
         subret = stdout.readlines()
         job_id = subret[0].split()[0]
@@ -94,6 +95,17 @@ class PBS(Machine):
         job_tag_finished = job.job_hash + "_job_tag_finished"
         return self.context.check_file_exists(job_tag_finished)
+    def kill(self, job):
+        """Kill the job.
+        Parameters
+        ----------
+        job : Job
+            job
+        """
+        job_id = job.job_id
+        ret, stdin, stdout, stderr = self.context.block_call("qdel " + str(job_id))
 class Torque(PBS):
     def check_status(self, job):

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/shell.py RENAMED Viewed

@@ -25,8 +25,7 @@ class Shell(Machine):
         output_name = job.job_hash + ".out"
         self.context.write_file(fname=script_file_name, write_str=script_str)
         ret, stdin, stdout, stderr = self.context.block_call(
-            "cd %s && { nohup bash %s 1>>%s 2>>%s & } && echo $!"
-            % (
+            "cd {} && {{ nohup bash {} 1>>{} 2>>{} & }} && echo $!".format(
                 shlex.quote(self.context.remote_root),
                 script_file_name,
                 output_name,
@@ -66,7 +65,7 @@ class Shell(Machine):
         # mark defunct process as terminated
         ret, stdin, stdout, stderr = self.context.block_call(
-            f"if ps -p {job_id} > /dev/null && ! (ps -p {job_id} | grep defunct >/dev/null) ; then echo 1; fi"
+            f"if ps -p {job_id} > /dev/null && ! (ps -o command -p {job_id} | grep defunct >/dev/null) ; then echo 1; fi"
         )
         if ret != 0:
             err_str = stderr.read().decode("utf-8")
@@ -101,3 +100,15 @@ class Shell(Machine):
         job_tag_finished = job.job_hash + "_job_tag_finished"
         # print('job finished: ',job.job_id, job_tag_finished)
         return self.context.check_file_exists(job_tag_finished)
+    def kill(self, job):
+        """Kill the job.
+        Parameters
+        ----------
+        job : Job
+            job
+        """
+        job_id = job.job_id
+        # 9 means exit, cannot be blocked
+        ret, stdin, stdout, stderr = self.context.block_call("kill -9 " + str(job_id))

{dpdispatcher-0.5.6 → dpdispatcher-0.5.8}/dpdispatcher/slurm.py RENAMED Viewed

@@ -1,3 +1,4 @@
+import math
 import pathlib
 import shlex
 from typing import List
@@ -45,9 +46,12 @@ class Slurm(Machine):
             )
         else:
             script_header_dict["slurm_number_gpu_line"] = custom_gpu_line
-        script_header_dict[
-            "slurm_partition_line"
-        ] = f"#SBATCH --partition {resources.queue_name}"
+        if resources.queue_name != "":
+            script_header_dict[
+                "slurm_partition_line"
+            ] = f"#SBATCH --partition {resources.queue_name}"
+        else:
+            script_header_dict["slurm_partition_line"] = ""
         slurm_script_header = slurm_script_header_template.format(**script_header_dict)
         return slurm_script_header
@@ -60,8 +64,7 @@ class Slurm(Machine):
         self.context.write_file(fname=script_file_name, write_str=script_str)
         # self.context.write_file(fname=os.path.join(self.context.submission.work_base, script_file_name), write_str=script_str)
         ret, stdin, stdout, stderr = self.context.block_call(
-            "cd %s && %s %s"
-            % (
+            "cd {} && {} {}".format(
                 shlex.quote(self.context.remote_root),
                 "sbatch",
                 shlex.quote(script_file_name),
@@ -78,7 +81,12 @@ class Slurm(Machine):
                     "Get error code %d in submitting through ssh with job: %s . message: %s"
                     % (ret, job.job_hash, err_str)
                 )
-            elif "Job violates accounting/QOS policy" in err_str:
+            elif (
+                "Job violates accounting/QOS policy" in err_str
+                # the number of jobs exceeds DEFAULT_MAX_JOB_COUNT (by default 10000)
+                or "Slurm temporarily unable to accept job, sleeping and retrying"
+                in err_str
+            ):
                 # job number exceeds, skip the submitting
                 return ""
             raise RuntimeError(
@@ -115,6 +123,7 @@ class Slurm(Machine):
             elif (
                 "Socket timed out on send/recv operation" in err_str
                 or "Unable to contact slurm controller" in err_str
+                or "Invalid user for SlurmUser" in err_str
             ):
                 # retry 3 times
                 raise RetrySignal(
@@ -194,30 +203,47 @@ class Slurm(Machine):
             )
         ]
+    def kill(self, job):
+        """Kill the job.
+        Parameters
+        ----------
+        job : Job
+            job
+        """
+        job_id = job.job_id
+        # -Q Do not report an error if the specified job is already completed.
+        ret, stdin, stdout, stderr = self.context.block_call(
+            "scancel -Q " + str(job_id)
+        )
+        # we do not need to stop here if scancel failed; just continue
 class SlurmJobArray(Slurm):
     """Slurm with job array enabled for multiple tasks in a job."""
     def gen_script_header(self, job):
+        slurm_job_size = job.resources.kwargs.get("slurm_job_size", 1)
         if job.fail_count > 0:
             # resubmit jobs, check if some of tasks have been finished
-            job_array = []
+            job_array = set()
             for ii, task in enumerate(job.job_task_list):
                 task_tag_finished = (
                     pathlib.PurePath(task.task_work_path)
                     / (task.task_hash + "_task_tag_finished")
                 ).as_posix()
                 if not self.context.check_file_exists(task_tag_finished):
-                    job_array.append(ii)
+                    job_array.add(ii // slurm_job_size)
             return super().gen_script_header(job) + "\n#SBATCH --array=%s" % (
                 ",".join(map(str, job_array))
             )
         return super().gen_script_header(job) + "\n#SBATCH --array=0-%d" % (
-            len(job.job_task_list) - 1
+            math.ceil(len(job.job_task_list) / slurm_job_size) - 1
         )
     def gen_script_command(self, job):
         resources = job.resources
+        slurm_job_size = resources.kwargs.get("slurm_job_size", 1)
         # SLURM_ARRAY_TASK_ID: 0 ~ n_jobs-1
         script_command = "case $SLURM_ARRAY_TASK_ID in\n"
         for ii, task in enumerate(job.job_task_list):
@@ -243,10 +269,16 @@ class SlurmJobArray(Slurm):
                 task_tag_finished=task_tag_finished,
                 log_err_part=log_err_part,
             )
-            script_command += f"{ii})\n"
+            if ii % slurm_job_size == 0:
+                script_command += f"{ii // slurm_job_size})\n"
             script_command += single_script_command
             script_command += self.gen_script_wait(resources=resources)
-            script_command += "\n;;\n"
+            script_command += "\n"
+            if (
+                ii % slurm_job_size == slurm_job_size - 1
+                or ii == len(job.job_task_list) - 1
+            ):
+                script_command += ";;\n"
         script_command += "*)\nexit 1\n;;\nesac\n"
         return script_command
@@ -337,9 +369,30 @@ class SlurmJobArray(Slurm):
     def check_finish_tag(self, job):
         results = []
         for task in job.job_task_list:
-            task_tag_finished = (
-                pathlib.PurePath(task.task_work_path)
-                / (task.task_hash + "_task_tag_finished")
-            ).as_posix()
-            results.append(self.context.check_file_exists(task_tag_finished))
+            task.get_task_state(self.context)
+            results.append(task.task_state == JobStatus.finished)
         return all(results)
+    @classmethod
+    def resources_subfields(cls) -> List[Argument]:
+        """Generate the resources subfields.
+        Returns
+        -------
+        list[Argument]
+            resources subfields
+        """
+        doc_slurm_job_size = "Number of tasks in a Slurm job"
+        arg = super().resources_subfields()[0]
+        arg.extend_subfields(
+            [
+                Argument(
+                    "slurm_job_size",
+                    int,
+                    optional=True,
+                    default=1,
+                    doc=doc_slurm_job_size,
+                ),
+            ]
+        )
+        return [arg]

dpdispatcher 0.5.6__tar.gz → 0.5.8__tar.gz

Potentially problematic release.

dpdispatcher 0.5.6tar.gz → 0.5.8tar.gz