PyPI - checkpoint-engine - Versions diffs - 0.2.0__tar.gz → 0.2.1__tar.gz - Mend

checkpoint-engine 0.2.0tar.gz → 0.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

{checkpoint_engine-0.2.0 → checkpoint_engine-0.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: checkpoint-engine
-Version: 0.2.0
+Version: 0.2.1
 Summary: checkpoint-engine is a lightweight, decoupling and efficient weight update middleware
 Project-URL: Homepage, https://github.com/MoonshotAI/checkpoint-engine
 Project-URL: Repository, https://github.com/MoonshotAI/checkpoint-engine
@@ -169,13 +169,63 @@ A [PR](https://github.com/vllm-project/vllm/pull/24488) is opened to the vLLM pr
 Run a simple correctness test for checkpoint_engine
 ```bash
-torchrun --nproc-per-node 8 tests/test_update.py
+pytest tests/test_update.py
 ```
-Other unit tests can be done with pytest.
+`test_update.py` are only designed to run with `pytest`. Please don't run it directly with `torchrun`.
+Other unit tests can also be done with pytest. Only test_update.py requires GPUs, other tests can be run on CPUs. Only to run CPU tests, use:
+```bash
+pytest tests/ -m "not gpu"
+```
+## SGLang Integration
+Checkpoint Engine provides efficient distributed checkpoint loading for SGLang inference servers, significantly reducing model loading time for large models and multi-node setups.
+### Quick Start
+**1. Install checkpoint-engine:**
+```bash
+pip install 'checkpoint-engine[p2p]'
+```
+**2. Launch SGLang server:**
+```bash
+python -m sglang.launch_server \
+    --model-path $MODEL_PATH \
+    --tp 8 \
+    --load-format dummy \
+    --wait-for-initial-weights
+```
+**3. Run checkpoint engine:**
+```bash
+python -m sglang.srt.checkpoint_engine.update \
+    --update-method broadcast \
+    --checkpoint-path $MODEL_PATH \
+    --inference-parallel-size 8
+```
+### Multi-Node Setup
+For 2-node setup, run the same commands on both nodes with appropriate `--host` and distributed training parameters.
+### Key Options
+**SGLang Server:**
+- `--wait-for-initial-weights`: Wait for checkpoint engine before becoming ready
+- `--load-format dummy`: Enable overlapping initialization tasks
+**Checkpoint Engine:**
+- `--update-method`: Choose `broadcast`, `p2p`, or `all`
+- `--inference-parallel-size`: Number of parallel processes
+- `--checkpoint-path`: Model checkpoint directory
 ## Limitations and Future Work
-- This project is currently only tested with vLLM. But it is easy to integrate with other frameworks like SGLang.
+- This project is currently tested with vLLM and SGLang. Integration with other frameworks is planned for future releases.
 - The perfect three-stage pipeline mentioned in our paper is currently not implemented. This could be useful for architectures where H2D and broadcast do not conflict in PCIE.
 ## Acknowledgments

{checkpoint_engine-0.2.0 → checkpoint_engine-0.2.1}/README.md RENAMED Viewed

@@ -145,13 +145,63 @@ A [PR](https://github.com/vllm-project/vllm/pull/24488) is opened to the vLLM pr
 Run a simple correctness test for checkpoint_engine
 ```bash
-torchrun --nproc-per-node 8 tests/test_update.py
+pytest tests/test_update.py
 ```
-Other unit tests can be done with pytest.
+`test_update.py` are only designed to run with `pytest`. Please don't run it directly with `torchrun`.
+Other unit tests can also be done with pytest. Only test_update.py requires GPUs, other tests can be run on CPUs. Only to run CPU tests, use:
+```bash
+pytest tests/ -m "not gpu"
+```
+## SGLang Integration
+Checkpoint Engine provides efficient distributed checkpoint loading for SGLang inference servers, significantly reducing model loading time for large models and multi-node setups.
+### Quick Start
+**1. Install checkpoint-engine:**
+```bash
+pip install 'checkpoint-engine[p2p]'
+```
+**2. Launch SGLang server:**
+```bash
+python -m sglang.launch_server \
+    --model-path $MODEL_PATH \
+    --tp 8 \
+    --load-format dummy \
+    --wait-for-initial-weights
+```
+**3. Run checkpoint engine:**
+```bash
+python -m sglang.srt.checkpoint_engine.update \
+    --update-method broadcast \
+    --checkpoint-path $MODEL_PATH \
+    --inference-parallel-size 8
+```
+### Multi-Node Setup
+For 2-node setup, run the same commands on both nodes with appropriate `--host` and distributed training parameters.
+### Key Options
+**SGLang Server:**
+- `--wait-for-initial-weights`: Wait for checkpoint engine before becoming ready
+- `--load-format dummy`: Enable overlapping initialization tasks
+**Checkpoint Engine:**
+- `--update-method`: Choose `broadcast`, `p2p`, or `all`
+- `--inference-parallel-size`: Number of parallel processes
+- `--checkpoint-path`: Model checkpoint directory
 ## Limitations and Future Work
-- This project is currently only tested with vLLM. But it is easy to integrate with other frameworks like SGLang.
+- This project is currently tested with vLLM and SGLang. Integration with other frameworks is planned for future releases.
 - The perfect three-stage pipeline mentioned in our paper is currently not implemented. This could be useful for architectures where H2D and broadcast do not conflict in PCIE.
 ## Acknowledgments

{checkpoint_engine-0.2.0 → checkpoint_engine-0.2.1}/checkpoint_engine/_version.py RENAMED Viewed

@@ -28,7 +28,7 @@ version_tuple: VERSION_TUPLE
 commit_id: COMMIT_ID
 __commit_id__: COMMIT_ID
-__version__ = version = '0.2.0'
-__version_tuple__ = version_tuple = (0, 2, 0)
+__version__ = version = '0.2.1'
+__version_tuple__ = version_tuple = (0, 2, 1)
-__commit_id__ = commit_id = 'ga29178282'
+__commit_id__ = commit_id = 'g279a908a9'

checkpoint_engine-0.2.1/checkpoint_engine/device_utils.py ADDED Viewed

@@ -0,0 +1,86 @@
+import os
+import re
+import socket
+import subprocess
+from functools import lru_cache
+import torch
+from loguru import logger
+@lru_cache(maxsize=1)
+def get_ip() -> str:
+    try:
+        # try to get ip from network interface
+        with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
+            s.connect(("8.8.8.8", 80))
+            return s.getsockname()[0]
+    except Exception as e:  # noqa: BLE001
+        # fallback to get ip from hostname
+        logger.warning(
+            f"fail to get ip from network interface, fallback to get ip from hostname: {e}"
+        )
+        return socket.gethostbyname(socket.gethostname())
+def npu_generate_uuid() -> str:
+    str_pid = str(os.getpid())
+    npu_num = 8
+    try:
+        for npu_id in range(npu_num):
+            cmd = ["npu-smi", "info", "-t", "proc-mem", "-i", str(npu_id)]
+            result = subprocess.run(cmd, check=True, capture_output=True, text=True)  # noqa: S603
+            str_result = str(result.stdout)
+            if str_pid in str_result:
+                # In A3 server, one NPU has two chips.
+                match_chip_count = re.search(r"Chip Count[^\d]*(\d+)", str_result)
+                chip_count = int(match_chip_count.group(1))
+                search_after_pid = str_result[str_result.find(str_pid) + len(str_pid) :]
+                match_chip_id = re.search(r"Chip ID[^\d]*(\d+)", search_after_pid)
+                chip_id = int(match_chip_id.group(1))
+                return f"{get_ip()}-{npu_id * chip_count + chip_id}"
+        raise ValueError("The current process is not running on the npu device")
+    except subprocess.CalledProcessError as e:
+        raise ValueError("The current process is not running on the npu device") from e
+class DeviceManager:
+    def __init__(self):
+        self.device_type = self._detect_device_type()
+        self._setup_device_module()
+    def _is_torch_npu_available(self) -> bool:
+        try:
+            if hasattr(torch, "npu") and callable(getattr(torch.npu, "is_available", None)):
+                return torch.npu.is_available()
+            else:
+                return False
+        except ImportError:
+            return False
+    def _detect_device_type(self) -> str:
+        if self._is_torch_npu_available():
+            return "npu"
+        elif torch.cuda.is_available():
+            return "cuda"
+        else:
+            raise TypeError("The current device type is not supported")
+    def _setup_device_module(self):
+        if self.device_type == "npu":
+            import torch_npu
+            self.device_module = torch_npu.npu
+        elif self.device_type == "cuda":
+            self.device_module = torch.cuda
+        else:
+            raise TypeError("The current device type is not supported")
+    @property
+    def backend(self) -> str:
+        if self.device_type == "npu":
+            return "hccl"
+        elif self.device_type == "cuda":
+            return "nccl"
+        else:
+            raise TypeError("The current device type is not supported")

{checkpoint_engine-0.2.0 → checkpoint_engine-0.2.1}/checkpoint_engine/ps.py RENAMED Viewed

@@ -4,13 +4,11 @@ import ctypes
 import os
 import pickle
 import random
-import socket
 import threading
 import time
 from collections import defaultdict
 from collections.abc import Callable
 from datetime import timedelta
-from functools import lru_cache
 from typing import TYPE_CHECKING, Annotated, Any, BinaryIO, NamedTuple
 import httpx
@@ -23,6 +21,8 @@ from pydantic import BaseModel, PlainSerializer, PlainValidator, WithJsonSchema
 from safetensors.torch import safe_open
 from torch.multiprocessing.reductions import reduce_tensor
+from checkpoint_engine.device_utils import DeviceManager, get_ip, npu_generate_uuid
 if TYPE_CHECKING:
     from typing import TypeVar
@@ -254,28 +254,16 @@ def _concat_tp_weights(
     return torch.cat([w for w in tp_weights], dim=tp_concat_dim)
-def _get_physical_gpu_id(device_index: int | None = None) -> str:
+def _get_physical_gpu_id(device_manager: DeviceManager, device_index: int | None = None) -> str:
     try:
-        return f"GPU-{torch.cuda.get_device_properties(device_index).uuid!s}"
+        if device_manager.device_type == "npu":
+            return f"NPU-{npu_generate_uuid()}"
+        else:
+            return f"GPU-{device_manager.device_module.get_device_properties(device_index).uuid!s}"
     except AssertionError as e:
         raise ValueError(f"fail to get physical gpu id {device_index}") from e
-@lru_cache(maxsize=1)
-def _get_ip() -> str:
-    try:
-        # try to get ip from network interface
-        with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
-            s.connect(("8.8.8.8", 80))
-            return s.getsockname()[0]
-    except Exception as e:  # noqa: BLE001
-        # fallback to get ip from hostname
-        logger.warning(
-            f"fail to get ip from network interface, fallback to get ip from hostname: {e}"
-        )
-        return socket.gethostbyname(socket.gethostname())
 def _ibv_get_device_list() -> list[str]:
     lib = ctypes.CDLL("libibverbs.so.1")
     lib.ibv_get_device_list.argtypes = [ctypes.POINTER(ctypes.c_int)]  # int *num_devices
@@ -317,13 +305,21 @@ def _get_my_rdma_device(local_rank: int, gpu_count: int, devices: list[str]) ->
     """
     if not devices:
         raise RuntimeError("no rdma devices found")
-    assert len(devices) <= gpu_count, (
-        f"rdma devices count {len(devices)} should be less than or equal to gpu count {gpu_count}"
-    )
-    assert gpu_count % len(devices) == 0, (
-        f"gpu count {gpu_count} should be divisible by rdma devices count {len(devices)}"
-    )
-    return devices[local_rank // (gpu_count // len(devices))]
+    try:
+        assert len(devices) <= gpu_count, (
+            f"rdma devices count {len(devices)} should be less than or equal to gpu count {gpu_count}"
+        )
+        assert gpu_count % len(devices) == 0, (
+            f"gpu count {gpu_count} should be divisible by rdma devices count {len(devices)}"
+        )
+        return devices[local_rank // (gpu_count // len(devices))]
+    except AssertionError:
+        logger.error(
+            "Please set 'NCCL_IB_HCA' or 'PS_P2P_STORE_RDMA_DEVICES' environment variable to choose proper number of RDMA devices."
+            "The number of RDMA devices should be less than or equal to GPU count, and GPU count should be divisible by the number of RDMA devices."
+            "The acceptable value by NCCL_IB_HCA is documented in 'https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#id8'."
+        )
+        raise
 def _parse_NCCL_IB_HCA(value: str, available_devices: list[str]) -> list[str]:
@@ -677,20 +673,29 @@ def _get_bcast_rank_map(world_size: int, ranks: list[int] | None) -> dict[int, i
 class P2PStore:
-    def __init__(self):
+    def __init__(self, device_manager: DeviceManager):
         from mooncake.engine import TransferEngine
         self.rank = int(os.getenv("RANK"))
-        gpu_count = torch.cuda.device_count()
+        gpu_count = device_manager.device_module.device_count()
         local_rank = self.rank % gpu_count
-        self.device = _get_my_rdma_device(local_rank, gpu_count, _get_rdma_devices())
-        self.ip = _get_ip()
+        device_type = device_manager.device_type
+        if device_type == "npu" and os.getenv("PS_P2P_STORE_RDMA_DEVICES") is None:
+            self.device = ""
+        else:
+            self.device = _get_my_rdma_device(local_rank, gpu_count, _get_rdma_devices())
+        self.ip = get_ip()
         # we will start at most 8 ps processes, so we use 8 retries to avoid port conflicts in extreme cases
         retry_count = 8
         for i in range(retry_count):
             self.engine = TransferEngine()
-            ret = self.engine.initialize(self.ip, "P2PHANDSHAKE", "rdma", self.device)
+            ret = self.engine.initialize(
+                self.ip,
+                "P2PHANDSHAKE",
+                "ascend_direct" if device_type == "npu" else "rdma",
+                self.device,
+            )
             if ret == 0:
                 break
             # sleep 0.5 ~ 2.0s, to avoid port conflicts when two processes retry at the same time
@@ -757,11 +762,12 @@ class ParameterServer:
         Args:
             auto_pg: Whether to automatically initialize the process group.
                 Notice that if auto_pg is True, will destroy the process group after update.
-            mem_fraction: The proportion (as a fraction) of the current free CUDA memory for allocation.
+            mem_fraction: The proportion (as a fraction) of the current free device memory for allocation.
         """
         self._rank = rank or int(os.environ.get("RANK", None))
         self._world_size = world_size or int(os.environ.get("WORLD_SIZE", None))
-        self._gpu_count = gpu_count or torch.cuda.device_count()
+        self.device_manager = DeviceManager()
+        self._gpu_count = gpu_count or self.device_manager.device_module.device_count()
         self._local_rank = self._rank % self._gpu_count
         self._auto_pg = auto_pg
         self._all_hosts = []
@@ -775,7 +781,7 @@ class ParameterServer:
         assert (
             self._gpu_count is not None
             and self._gpu_count > 0
-            and self._gpu_count <= torch.cuda.device_count()
+            and self._gpu_count <= self.device_manager.device_module.device_count()
         ), self._gpu_count
         assert (
             self._mem_fraction is not None and self._mem_fraction > 0 and self._mem_fraction <= 1
@@ -787,15 +793,16 @@ class ParameterServer:
         self._memory_pool: dict[str, list[MemoryBuffer]] = {}
         # dict key is owner_rank, value is a bucket metas list in owner_rank
         self._current_global_parameter_metas: dict[int, MemoryBufferMetaList] = {}
+        # NPU transfer engine initialization requires prior set_device.
+        device_index = self._local_rank
+        self.device_manager.device_module.set_device(device_index)
         try:
-            self._p2p_store = P2PStore()
+            self._p2p_store = P2PStore(self.device_manager)
         except ImportError as e:
             logger.warning(f"[rank{self._rank}] fail to initialize p2p store due to {e}")
             self._p2p_store = None
-        device_index = self._local_rank
-        torch.cuda.set_device(device_index)
-        self._device_uuid = _get_physical_gpu_id(device_index)
+        self._device_uuid = _get_physical_gpu_id(self.device_manager, device_index)
         self._rdma_device = None if self._p2p_store is None else self._p2p_store.device
     def _logger_rank0(self, msg: str):
@@ -885,13 +892,15 @@ class ParameterServer:
                 for x in self._memory_pool.get(checkpoint_name, [])
             ],
             p2p_store_addr=None if self._p2p_store is None else self._p2p_store.addr,
-            host_ip=_get_ip(),
+            host_ip=get_ip(),
             device_uuid=self._device_uuid,
             rdma_device=self._rdma_device or "",
         )
         dist.all_gather_object(metas_lst, metas)
+        self._current_global_parameter_metas = {}
         num_parameters = 0
         all_hosts: list[str] = []
         global_device_uuids: list[str] = []
@@ -948,7 +957,7 @@ class ParameterServer:
             is_master=self._rank == 0,
         )
         dist.init_process_group(
-            backend="nccl",
+            backend=self.device_manager.backend,
             world_size=self._world_size,
             rank=self._rank,
             timeout=timeout,
@@ -991,21 +1000,22 @@ class ParameterServer:
                 if self._rank not in ranks:
                     return
                 self._update_per_bucket(checkpoint_name, req_func, ranks)
-            if self._auto_pg:
-                dist.destroy_process_group()
-            torch.cuda.empty_cache()
-            logger.info(
-                f"[rank{self._rank}] update checkpoint {checkpoint_name} with ranks {ranks} done. "
-                f"Current CUDA allocated {torch.cuda.memory_allocated() / 1024 / 1024} MB, "
-                f"reserved {torch.cuda.memory_reserved() / 1024 / 1024} MB."
-            )
         except Exception as e:
             logger.exception(
                 f"[rank{self._rank}] update checkpoint {checkpoint_name} with ranks {ranks} error {e}"
             )
             raise
+        finally:
+            if self._auto_pg and (not ranks or self._rank in ranks):
+                dist.destroy_process_group()
+            self.device_manager.device_module.empty_cache()
+            logger.info(
+                f"[rank{self._rank}] update checkpoint {checkpoint_name} with ranks {ranks} done. "
+                f"Current device allocated {self.device_manager.device_module.memory_allocated() / 1024 / 1024} MB, "
+                f"reserved {self.device_manager.device_module.memory_reserved() / 1024 / 1024} MB."
+            )
     def _bind_zmq_socket(self) -> tuple[zmq.Socket, list[tuple[str, str]]]:
         def zmq_handle(device_uuid: str) -> str:
@@ -1022,14 +1032,16 @@ class ParameterServer:
         # auto detect bucket size
         tensor = torch.tensor(
             [
-                # proportion of current cuda free memory bytes
-                int(float(torch.cuda.mem_get_info()[0]) * self._mem_fraction),
+                # proportion of current device free memory bytes
+                int(
+                    float(self.device_manager.device_module.mem_get_info()[0]) * self._mem_fraction
+                ),
                 # we use negative value to reuse allreduce min operation
                 # for getting the max value of zmq_addr_counter in all ranks
                 -self._zmq_addr_counter,
             ],
             dtype=torch.int64,
-            device="cuda",
+            device=self.device_manager.device_type,
         )
         dist.all_reduce(tensor, op=dist.ReduceOp.MIN)
         tensor = tensor.cpu()
@@ -1092,7 +1104,7 @@ class ParameterServer:
         assert offset == bucket.size, f"offset {offset} != bucket_size {bucket.size}"
         if owner_rank is not None:
             self._p2p_store.batch_transfer_sync_read(target_addr, buf_ptrs, remote_ptrs, lens)
-        torch.cuda.synchronize()
+        self.device_manager.device_module.synchronize()
     def init_process_group_for_ranks(
         self,
@@ -1132,7 +1144,11 @@ class ParameterServer:
             master_addr, master_port, len(ranks), is_master=rank == 0, timeout=timeout
         )
         dist.init_process_group(
-            backend="nccl", world_size=len(ranks), rank=rank, timeout=timeout, store=store
+            backend=self.device_manager.backend,
+            world_size=len(ranks),
+            rank=rank,
+            timeout=timeout,
+            store=store,
         )
     def _get_addr_ptrs(self, owner_rank: int) -> tuple[str, list[tuple[int, int]]]:
@@ -1184,7 +1200,7 @@ class ParameterServer:
             if not need_update:
                 return
-            # first execute a barrier to avoid subsequent cuda oom
+            # first execute a barrier to avoid subsequent device oom
             dist.barrier()
         bucket_size, disable_h2d_buffer = self._detect_bucket_size()
@@ -1199,7 +1215,7 @@ class ParameterServer:
         h2d_buffer: torch.Tensor | None = (
             None
             if disable_h2d_buffer
-            else torch.empty(bucket_size, dtype=torch.uint8, device="cuda")
+            else torch.empty(bucket_size, dtype=torch.uint8, device=self.device_manager.device_type)
         )
         # p2p store need to register h2d_buffer to let other ranks read
         if ranks:
@@ -1212,7 +1228,9 @@ class ParameterServer:
                 continue
             receiver_rank_buckets.append((owner_rank, bucket))
-        buffer = torch.empty(bucket_size * 2, dtype=torch.uint8, device="cuda")
+        buffer = torch.empty(
+            bucket_size * 2, dtype=torch.uint8, device=self.device_manager.device_type
+        )
         handle = reduce_tensor(buffer)
         buckets_by_receiver_rank: dict[int, list[H2DBucket]] = defaultdict(list)
@@ -1231,52 +1249,66 @@ class ParameterServer:
         socket.send_pyobj(handle)
         gidx = 0
+        ret_code = torch.zeros((), device=self.device_manager.device_type, dtype=torch.int64)
         bcast_rank_map = _get_bcast_rank_map(self._world_size, ranks)
-        for i in range(max_len):
-            if i < len(receiver_rank_buckets) and not disable_h2d_buffer:
-                self._copy_to_buffer(
-                    checkpoint_name,
-                    receiver_rank_buckets[i][1],
-                    h2d_buffer,
-                    receiver_rank_buckets[i][0] if ranks else None,
-                )
-            for receiver_rank, _buckets in buckets_by_receiver_rank.items():
-                if i >= len(_buckets):
-                    continue
-                bucket = _buckets[i]
-                alloc, reserved = (
-                    torch.cuda.memory_allocated() / 1024 / 1024,
-                    torch.cuda.memory_reserved() / 1024 / 1024,
-                )
-                self._logger_rank0(
-                    f"[rank{self._rank}] begin to update bucket {gidx + 1}/{len(buckets)} receiver_rank {receiver_rank} in checkpoint {checkpoint_name}, bucket_size: {bucket.size / 1024 / 1024:.2f}MiB, length: {len(bucket.items)}. "
-                    f"Current CUDA allocated {alloc:.2f} MB, "
-                    f"reserved {reserved:.2f} MB."
-                )
-                start = gidx % 2 * bucket_size
-                buffer_b: torch.Tensor = buffer[start : start + bucket.size]
-                if receiver_rank == self._rank:
-                    if disable_h2d_buffer:
-                        self._copy_to_buffer(checkpoint_name, bucket, buffer_b)
-                    else:
-                        buffer_b.data.copy_(h2d_buffer[: bucket.size])
-                brank = bcast_rank_map[receiver_rank]
-                dist.broadcast(buffer_b, src=brank)
-                socket.recv()
-                dist.barrier()
-                socket.send_pyobj(_to_named_tensor(bucket.items, gidx % 2 * bucket_size))
-                gidx += 1
-        socket.recv()
-        socket.send_pyobj(None)
-        socket.recv()
-        req_thread.join()
-        dist.barrier()
-        socket.close()
-        if ranks and h2d_buffer is not None:
-            self._p2p_store.unregister_named_tensors([h2d_buffer_name])
-        torch.cuda.empty_cache()
+        try:
+            for i in range(max_len):
+                if i < len(receiver_rank_buckets) and not disable_h2d_buffer:
+                    self._copy_to_buffer(
+                        checkpoint_name,
+                        receiver_rank_buckets[i][1],
+                        h2d_buffer,
+                        receiver_rank_buckets[i][0] if ranks else None,
+                    )
+                for receiver_rank, _buckets in buckets_by_receiver_rank.items():
+                    if i >= len(_buckets):
+                        continue
+                    bucket = _buckets[i]
+                    alloc, reserved = (
+                        self.device_manager.device_module.memory_allocated() / 1024 / 1024,
+                        self.device_manager.device_module.memory_reserved() / 1024 / 1024,
+                    )
+                    self._logger_rank0(
+                        f"[rank{self._rank}] begin to update bucket {gidx + 1}/{len(buckets)} receiver_rank {receiver_rank} in checkpoint {checkpoint_name}, bucket_size: {bucket.size / 1024 / 1024:.2f}MiB, length: {len(bucket.items)}. "
+                        f"Current device allocated {alloc:.2f} MB, "
+                        f"reserved {reserved:.2f} MB."
+                    )
+                    start = gidx % 2 * bucket_size
+                    buffer_b: torch.Tensor = buffer[start : start + bucket.size]
+                    if receiver_rank == self._rank:
+                        if disable_h2d_buffer:
+                            self._copy_to_buffer(checkpoint_name, bucket, buffer_b)
+                        else:
+                            buffer_b.data.copy_(h2d_buffer[: bucket.size])
+                    brank = bcast_rank_map[receiver_rank]
+                    dist.broadcast(buffer_b, src=brank)
+                    resp = socket.recv()
+                    if resp != b"":
+                        exception_obj = pickle.loads(resp)
+                        logger.error(
+                            f"[rank{self._rank}] receive error response '{type(exception_obj).__name__}: {exception_obj}' from rank {receiver_rank} for bucket {gidx} in checkpoint {checkpoint_name}"
+                        )
+                        ret_code.fill_(1)
+                    dist.all_reduce(ret_code, op=dist.ReduceOp.SUM)
+                    self.device_manager.device_module.synchronize()
+                    if ret_code.item() != 0:
+                        # quit early if any rank failed
+                        socket.send_pyobj(RuntimeError("Some workers failed to update weights"))
+                        raise RuntimeError("Failed to update weights due to remote errors")
+                    socket.send_pyobj(_to_named_tensor(bucket.items, gidx % 2 * bucket_size))
+                    gidx += 1
+            socket.recv()
+            socket.send_pyobj(None)
+            socket.recv()
+        finally:
+            req_thread.join()
+            dist.barrier()
+            socket.close()
+            if ranks and h2d_buffer is not None:
+                self._p2p_store.unregister_named_tensors([h2d_buffer_name])
+            self.device_manager.device_module.empty_cache()
 def _init_api(ps: ParameterServer) -> Any:
@@ -1294,6 +1326,7 @@ def _init_api(ps: ParameterServer) -> Any:
         update_url: str | None = None
         inference_group_ranks: list[int] = []
         timeout: float = 300.0
+        uds: str | None = None
     def wrap_exception(func: Callable[[], None]) -> Response:
         try:
@@ -1326,7 +1359,9 @@ def _init_api(ps: ParameterServer) -> Any:
                 return
             if req.inference_group_ranks:
                 socket_paths = [socket_paths[i] for i in req.inference_group_ranks]
-            request_inference_to_update(req.update_url, dict(socket_paths), timeout=req.timeout)
+            request_inference_to_update(
+                req.update_url, dict(socket_paths), timeout=req.timeout, uds=req.uds
+            )
         return wrap_exception(lambda: ps.update(checkpoint_name, update_func, ranks=req.ranks))

checkpoint-engine 0.2.0__tar.gz → 0.2.1__tar.gz

checkpoint-engine 0.2.0tar.gz → 0.2.1tar.gz