PyPI - checkpoint-engine - Versions diffs - 0.1.2__tar.gz → 0.2.0__tar.gz - Mend

checkpoint-engine 0.1.2tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

checkpoint_engine-0.2.0/.github/workflows/cpu-tests.yml ADDED Viewed

@@ -0,0 +1,30 @@
+name: CPU Tests
+on:
+  push:
+    branches: [main]
+  pull_request:
+    types: [opened, synchronize, reopened]
+permissions:
+  contents: read
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+    - name: Checkout code
+      uses: actions/checkout@v4
+    - name: Set up Python
+      uses: actions/setup-python@v3
+      with:
+        python-version: "3.10"
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install pytest
+        pip install .[p2p]
+    - name: Do CPU tests with pytest
+      run: |
+        pytest -v -m "not gpu" tests/

{checkpoint_engine-0.1.2 → checkpoint_engine-0.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: checkpoint-engine
-Version: 0.1.2
+Version: 0.2.0
 Summary: checkpoint-engine is a lightweight, decoupling and efficient weight update middleware
 Project-URL: Homepage, https://github.com/MoonshotAI/checkpoint-engine
 Project-URL: Repository, https://github.com/MoonshotAI/checkpoint-engine
@@ -38,8 +38,8 @@ updating our [Kimi-K2](https://github.com/MoonshotAI/Kimi-K2) model (1 Trillion
 The core weight update logic is in `ParameterServer` class, a service colocated with inference engines. It provides two implementations of weight update: Broadcast and P2P.
-- **Broadcast**: Used when a large number of inference instances need to update weights in synchronous. This is the fastest implementation and should be used as the default update method. See `_update_per_bucket`.
-- **P2P**: Used when new inference instances are dynamically added (due to restarts or dynamic availability) while the existing instances are already serving requests. Under this scenario, to avoid affecting the workloads on existing instances, we use the [`mooncake-transfer-engine`](https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#use-python-package) to P2P send weights from CPUs in existing instances to GPUs in new instances. See `_update_per_bucket_p2p`.
+- **Broadcast**: Used when a large number of inference instances need to update weights in synchronous. This is the fastest implementation and should be used as the default update method. See `_update_per_bucket` with `ranks == None or []`.
+- **P2P**: Used when new inference instances are dynamically added (due to restarts or dynamic availability) while the existing instances are already serving requests. Under this scenario, to avoid affecting the workloads on existing instances, we use the [`mooncake-transfer-engine`](https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#use-python-package) to P2P send weights from CPUs in existing instances to GPUs in new instances. See `_update_per_bucket` with `ranks` specified.
 ### Optimized Weight Broadcast
 In the *Broadcast* implementation, the checkpoint-engine holds references to sharded weights in CPU memory, and need to efficiently broadcast them to a cluster of inference instances, often under a different sharding pattern.
@@ -60,16 +60,22 @@ It then executes the transfer, where it controls the inference engine through a
 Pipelining naturally requires more GPU memory. When memory is not enough, checkpoint-engine will fallback to serial execution.
+### Optimized P2P Bucket Assignment
+In the *P2P* implementation, checkpoint-engine needs to send weights from existing instances to new instances.
+To minimize the overall transfer time, checkpoint-engine optimizes the bucket assignment for each sender-receiver pair.
+The optimization goal is to make full use of the available network bandwidth for each sender and receiver.
+See [issue #25](https://github.com/MoonshotAI/checkpoint-engine/issues/25)
 ## Benchmark
 | Model                                | Device Info  | GatherMetas | Update (Broadcast) | Update (P2P)            |
 | :----------------------------------- | :----------- | :---------- |:-------------------| :---------------------- |
-| GLM-4.5-Air (BF16)                   | 8xH800 TP8  | 0.17s       | 3.94s (1.42GiB)    | 8.83s (4.77GiB)         |
-| Qwen3-235B-A22B-Instruct-2507 (BF16) | 8xH800 TP8  | 0.46s       | 6.75s (2.69GiB)    | 16.47s (4.05GiB)        |
-| DeepSeek-V3.1 (FP8)                  | 16xH20 TP16  | 1.44s       | 12.22s (2.38GiB)   | 25.77s (3.61GiB)        |
-| Kimi-K2-Instruct (FP8)               | 16xH20 TP16  | 1.81s       | 15.45s (2.93GiB)   | 36.24s (4.46GiB)        |
-| DeepSeek-V3.1 (FP8)                  | 256xH20 TP16 | 1.40s       | 13.88s (2.54GiB)   | 33.30s (3.86 GiB) |
-| Kimi-K2-Instruct (FP8)               | 256xH20 TP16 | 1.88s       | 21.50s (2.99GiB)   | 34.49s (4.57 GiB) |
+| GLM-4.5-Air (BF16)                   | 8xH800 TP8   | 0.12s       | 3.47s (3.02GiB)    | 4.12s (3.02GiB)         |
+| Qwen3-235B-A22B-Instruct-2507 (BF16) | 8xH800 TP8   | 0.33s       | 6.22s (2.67GiB)    | 7.10s (2.68GiB)         |
+| DeepSeek-V3.1 (FP8)                  | 16xH20 TP16  | 1.17s       | 10.19s (5.39GiB)   | 11.80s (5.41GiB)        |
+| Kimi-K2-Instruct (FP8)               | 16xH20 TP16  | 1.33s       | 14.36s (5.89GiB)   | 17.49s (5.91GiB)        |
+| DeepSeek-V3.1 (FP8)                  | 256xH20 TP16 | 0.80s       | 11.33s (8.00GiB)   | 11.81s (8.00GiB)        |
+| Kimi-K2-Instruct (FP8)               | 256xH20 TP16 | 1.22s       | 16.04s (8.00GiB)   | 16.75s (8.00GiB)        |
 All results above are tested by [`examples/update.py`](./examples/update.py) and use [vLLM v0.10.2rc1](https://github.com/vllm-project/vllm/tree/v0.10.2rc1) as inference engine. Some notes:
@@ -77,6 +83,7 @@ All results above are tested by [`examples/update.py`](./examples/update.py) and
 * Device Info: we tested various combination of devices and parallelism setups. For example, a 256-GPU TP16 setup means that we deploy 16 vLLM instances, each with 16-way tensor parallelism.
 * Since update duration is related to IPC bucket size, we provide the bucket size in the table.
 * The P2P time were tested for updating no more than two nodes (16 GPUs) (`ParameterServer.update(ranks=range(0, 16))`) out of the entire cluster.
+* We bind each GPU to its corresponding NUMA node to ensure stable H2D transfer speeds.
 ## Installation
@@ -92,7 +99,7 @@ Use the flexible P2P implementation, notice this will install `mooncake-transfer
 pip install 'checkpoint-engine[p2p]'
 ```
-If set `NCCL_IB_HCA` env, checkpoint-engine will use it to auto select net devices for different ranks. If not set, it will read all RDMA devices and try to divide them into each rank.
+If set `NCCL_IB_HCA` env, checkpoint-engine will use it to auto select net devices for different ranks. Available patterns can be found from [NCCL documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#id8). If not set, it will read all RDMA devices and try to divide them into each rank.
 ## Getting Started
@@ -165,11 +172,11 @@ Run a simple correctness test for checkpoint_engine
 torchrun --nproc-per-node 8 tests/test_update.py
 ```
+Other unit tests can be done with pytest.
 ## Limitations and Future Work
 - This project is currently only tested with vLLM. But it is easy to integrate with other frameworks like SGLang.
 - The perfect three-stage pipeline mentioned in our paper is currently not implemented. This could be useful for architectures where H2D and broadcast do not conflict in PCIE.
-- The P2P update method is currently not the optimal implementation since it will receive data only in rank 0 and broadcast to others synchronizely. This is a potential optimization in the future.
 ## Acknowledgments

{checkpoint_engine-0.1.2 → checkpoint_engine-0.2.0}/README.md RENAMED Viewed

@@ -14,8 +14,8 @@ updating our [Kimi-K2](https://github.com/MoonshotAI/Kimi-K2) model (1 Trillion
 The core weight update logic is in `ParameterServer` class, a service colocated with inference engines. It provides two implementations of weight update: Broadcast and P2P.
-- **Broadcast**: Used when a large number of inference instances need to update weights in synchronous. This is the fastest implementation and should be used as the default update method. See `_update_per_bucket`.
-- **P2P**: Used when new inference instances are dynamically added (due to restarts or dynamic availability) while the existing instances are already serving requests. Under this scenario, to avoid affecting the workloads on existing instances, we use the [`mooncake-transfer-engine`](https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#use-python-package) to P2P send weights from CPUs in existing instances to GPUs in new instances. See `_update_per_bucket_p2p`.
+- **Broadcast**: Used when a large number of inference instances need to update weights in synchronous. This is the fastest implementation and should be used as the default update method. See `_update_per_bucket` with `ranks == None or []`.
+- **P2P**: Used when new inference instances are dynamically added (due to restarts or dynamic availability) while the existing instances are already serving requests. Under this scenario, to avoid affecting the workloads on existing instances, we use the [`mooncake-transfer-engine`](https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#use-python-package) to P2P send weights from CPUs in existing instances to GPUs in new instances. See `_update_per_bucket` with `ranks` specified.
 ### Optimized Weight Broadcast
 In the *Broadcast* implementation, the checkpoint-engine holds references to sharded weights in CPU memory, and need to efficiently broadcast them to a cluster of inference instances, often under a different sharding pattern.
@@ -36,16 +36,22 @@ It then executes the transfer, where it controls the inference engine through a
 Pipelining naturally requires more GPU memory. When memory is not enough, checkpoint-engine will fallback to serial execution.
+### Optimized P2P Bucket Assignment
+In the *P2P* implementation, checkpoint-engine needs to send weights from existing instances to new instances.
+To minimize the overall transfer time, checkpoint-engine optimizes the bucket assignment for each sender-receiver pair.
+The optimization goal is to make full use of the available network bandwidth for each sender and receiver.
+See [issue #25](https://github.com/MoonshotAI/checkpoint-engine/issues/25)
 ## Benchmark
 | Model                                | Device Info  | GatherMetas | Update (Broadcast) | Update (P2P)            |
 | :----------------------------------- | :----------- | :---------- |:-------------------| :---------------------- |
-| GLM-4.5-Air (BF16)                   | 8xH800 TP8  | 0.17s       | 3.94s (1.42GiB)    | 8.83s (4.77GiB)         |
-| Qwen3-235B-A22B-Instruct-2507 (BF16) | 8xH800 TP8  | 0.46s       | 6.75s (2.69GiB)    | 16.47s (4.05GiB)        |
-| DeepSeek-V3.1 (FP8)                  | 16xH20 TP16  | 1.44s       | 12.22s (2.38GiB)   | 25.77s (3.61GiB)        |
-| Kimi-K2-Instruct (FP8)               | 16xH20 TP16  | 1.81s       | 15.45s (2.93GiB)   | 36.24s (4.46GiB)        |
-| DeepSeek-V3.1 (FP8)                  | 256xH20 TP16 | 1.40s       | 13.88s (2.54GiB)   | 33.30s (3.86 GiB) |
-| Kimi-K2-Instruct (FP8)               | 256xH20 TP16 | 1.88s       | 21.50s (2.99GiB)   | 34.49s (4.57 GiB) |
+| GLM-4.5-Air (BF16)                   | 8xH800 TP8   | 0.12s       | 3.47s (3.02GiB)    | 4.12s (3.02GiB)         |
+| Qwen3-235B-A22B-Instruct-2507 (BF16) | 8xH800 TP8   | 0.33s       | 6.22s (2.67GiB)    | 7.10s (2.68GiB)         |
+| DeepSeek-V3.1 (FP8)                  | 16xH20 TP16  | 1.17s       | 10.19s (5.39GiB)   | 11.80s (5.41GiB)        |
+| Kimi-K2-Instruct (FP8)               | 16xH20 TP16  | 1.33s       | 14.36s (5.89GiB)   | 17.49s (5.91GiB)        |
+| DeepSeek-V3.1 (FP8)                  | 256xH20 TP16 | 0.80s       | 11.33s (8.00GiB)   | 11.81s (8.00GiB)        |
+| Kimi-K2-Instruct (FP8)               | 256xH20 TP16 | 1.22s       | 16.04s (8.00GiB)   | 16.75s (8.00GiB)        |
 All results above are tested by [`examples/update.py`](./examples/update.py) and use [vLLM v0.10.2rc1](https://github.com/vllm-project/vllm/tree/v0.10.2rc1) as inference engine. Some notes:
@@ -53,6 +59,7 @@ All results above are tested by [`examples/update.py`](./examples/update.py) and
 * Device Info: we tested various combination of devices and parallelism setups. For example, a 256-GPU TP16 setup means that we deploy 16 vLLM instances, each with 16-way tensor parallelism.
 * Since update duration is related to IPC bucket size, we provide the bucket size in the table.
 * The P2P time were tested for updating no more than two nodes (16 GPUs) (`ParameterServer.update(ranks=range(0, 16))`) out of the entire cluster.
+* We bind each GPU to its corresponding NUMA node to ensure stable H2D transfer speeds.
 ## Installation
@@ -68,7 +75,7 @@ Use the flexible P2P implementation, notice this will install `mooncake-transfer
 pip install 'checkpoint-engine[p2p]'
 ```
-If set `NCCL_IB_HCA` env, checkpoint-engine will use it to auto select net devices for different ranks. If not set, it will read all RDMA devices and try to divide them into each rank.
+If set `NCCL_IB_HCA` env, checkpoint-engine will use it to auto select net devices for different ranks. Available patterns can be found from [NCCL documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#id8). If not set, it will read all RDMA devices and try to divide them into each rank.
 ## Getting Started
@@ -141,11 +148,11 @@ Run a simple correctness test for checkpoint_engine
 torchrun --nproc-per-node 8 tests/test_update.py
 ```
+Other unit tests can be done with pytest.
 ## Limitations and Future Work
 - This project is currently only tested with vLLM. But it is easy to integrate with other frameworks like SGLang.
 - The perfect three-stage pipeline mentioned in our paper is currently not implemented. This could be useful for architectures where H2D and broadcast do not conflict in PCIE.
-- The P2P update method is currently not the optimal implementation since it will receive data only in rank 0 and broadcast to others synchronizely. This is a potential optimization in the future.
 ## Acknowledgments

{checkpoint_engine-0.1.2 → checkpoint_engine-0.2.0}/checkpoint_engine/_version.py RENAMED Viewed

@@ -28,7 +28,7 @@ version_tuple: VERSION_TUPLE
 commit_id: COMMIT_ID
 __commit_id__: COMMIT_ID
-__version__ = version = '0.1.2'
-__version_tuple__ = version_tuple = (0, 1, 2)
+__version__ = version = '0.2.0'
+__version_tuple__ = version_tuple = (0, 2, 0)
-__commit_id__ = commit_id = 'g716c0dad9'
+__commit_id__ = commit_id = 'ga29178282'

{checkpoint_engine-0.1.2 → checkpoint_engine-0.2.0}/checkpoint_engine/ps.py RENAMED Viewed

@@ -1,5 +1,3 @@
-from __future__ import annotations
 import argparse
 import concurrent.futures
 import ctypes
@@ -10,6 +8,7 @@ import socket
 import threading
 import time
 from collections import defaultdict
+from collections.abc import Callable
 from datetime import timedelta
 from functools import lru_cache
 from typing import TYPE_CHECKING, Annotated, Any, BinaryIO, NamedTuple
@@ -26,7 +25,7 @@ from torch.multiprocessing.reductions import reduce_tensor
 if TYPE_CHECKING:
-    from collections.abc import Callable
+    from typing import TypeVar
     from typing_extensions import TypedDict
@@ -37,6 +36,8 @@ if TYPE_CHECKING:
         type: type
         tp_concat_dim: int
+    T = TypeVar("T")
 def _dt_validate(value: Any) -> torch.dtype:
     if isinstance(value, str):
@@ -120,6 +121,7 @@ class MemoryBuffer(BaseModel):
 class MemoryBufferMetaList(BaseModel):
     p2p_store_addr: str | None
     memory_buffer_metas_list: list[MemoryBufferMetas]
+    rdma_device: str
 class DataToGather(MemoryBufferMetaList):
@@ -151,8 +153,8 @@ def _to_named_tensor(metas: list[ParameterMeta], offset: int = 0) -> list[dict]:
     return ret
-def _load_checkpoint_file(file_path: str) -> tuple[int, dict[str, tuple[FileMeta, torch.Tensor]]]:
-    def _safetensors_load(fn: str) -> dict[str, tuple[FileMeta, torch.Tensor]]:
+def _load_checkpoint_file(file_path: str) -> tuple[int, dict[str, tuple["FileMeta", torch.Tensor]]]:
+    def _safetensors_load(fn: str) -> dict[str, tuple["FileMeta", torch.Tensor]]:
         ret = {}
         with safe_open(fn, framework="pt") as f:
             for name in f.keys():  # noqa: SIM118
@@ -168,7 +170,7 @@ def _load_checkpoint_file(file_path: str) -> tuple[int, dict[str, tuple[FileMeta
         return ret
     # deprecated, will be removed in the future
-    def _fast_np_load(fn: str) -> dict[str, tuple[FileMeta, torch.Tensor]]:
+    def _fast_np_load(fn: str) -> dict[str, tuple["FileMeta", torch.Tensor]]:
         """load *.np file and return memmap and related tensor meta"""
         def parse_npy_header(fin: BinaryIO) -> dict[str, Any]:
@@ -306,14 +308,7 @@ def _get_rdma_devices() -> list[str]:
         return devices_str.split(",")
     # if PS_P2P_STORE_RDMA_DEVICES is not set, try to use NCCL_IB_HCA to get RDMA devices
     hca = os.getenv("NCCL_IB_HCA", None)
-    if hca:
-        hca_list = hca.split(",")
-        if len(hca_list) > 1:
-            # if NCCL_IB_HCA has multiple values, just return
-            return hca_list
-        else:
-            hca = hca_list[0]
-    return [device for device in sorted(_ibv_get_device_list()) if hca is None or hca in device]
+    return _parse_NCCL_IB_HCA(hca or "", _ibv_get_device_list()) or _ibv_get_device_list()
 def _get_my_rdma_device(local_rank: int, gpu_count: int, devices: list[str]) -> str:
@@ -331,6 +326,75 @@ def _get_my_rdma_device(local_rank: int, gpu_count: int, devices: list[str]) ->
     return devices[local_rank // (gpu_count // len(devices))]
+def _parse_NCCL_IB_HCA(value: str, available_devices: list[str]) -> list[str]:
+    """
+    The acceptable value by NCCL_IB_HCA is documented in https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#id8.
+    The Python version parser is referred to the CPP parser in NCCL: https://github.com/NVIDIA/nccl/blob/v2.28.3-1/src/transport/net_ib.cc#L658-L662.
+    The list is comma-separated; port numbers are NOT supported yet.
+    An optional prefix '^' indicates the list is an exclude list.
+    A second optional prefix '=' indicates that the tokens are exact names, otherwise by default NCCL would treat each token as a prefix.
+    Please note that when '^' and '=' appear together, only '^=' is allowed, '=^' is not supported.
+    Examples:
+    - `NCCL_IB_HCA="mlx5"`: Use all cards starting with `mlx5`.
+    - `NCCL_IB_HCA="=mlx5_0,mlx5_1"`: Use specific cards `mlx5_0` and `mlx5_1`.
+    - `NCCL_IB_HCA="^mlx5"`: Use all cards except those starting with `mlx5`.
+    - `NCCL_IB_HCA="^=mlx5_0,mlx5_1"`: Use all cards except `mlx5_0` and `mlx5_1`.
+    """
+    max_hcas = 32
+    if not value or value.strip() == "":
+        return available_devices[:max_hcas]
+    value = value.strip()
+    result = []
+    is_exclude = value.startswith("^")
+    if is_exclude:
+        value = value.removeprefix("^")
+    is_exact_match = value.startswith("=")
+    if is_exact_match:
+        value = value.removeprefix("=")
+    device_specs = [spec.strip() for spec in value.split(",") if spec.strip()]
+    result = _resolve_device_specs(device_specs, is_exact_match, available_devices)
+    if is_exclude:
+        result = [dev for dev in available_devices if dev not in result]
+    if len(result) > max_hcas:
+        result = result[:max_hcas]
+    logger.info(f"RDMA Devices from 'NCCL_IB_HCA': {result}")
+    return result
+def _resolve_device_specs(
+    device_specs: list[str], is_exact_match: bool, available_devices: list[str]
+) -> list[str]:
+    devices = set()
+    for spec in device_specs:
+        parts = spec.split(":", 1)
+        device_name = parts[0].strip()
+        # HACK: mooncake transfer engine does not support port specification yet, so we ignore it
+        # port = parts[1].strip() if len(parts) > 1 else None
+        base_devices = (
+            [device_name]
+            if device_name in available_devices
+            else []
+            if is_exact_match
+            else [dev for dev in available_devices if dev.startswith(device_name)]
+        )
+        if not base_devices:
+            logger.warning(f"No RDMA device match {device_name=} where {is_exact_match=}.")
+            continue
+        for base_dev in base_devices:
+            devices.add(base_dev)
+    return sorted(devices)
 def _load_checkpoint(files: list[str]) -> dict[str, torch.Tensor]:
     class TPMeta(BaseModel):
         concat_dim: int
@@ -493,8 +557,12 @@ def request_inference_to_update(
 def _gen_h2d_buckets(
-    global_metas: dict[int, MemoryBufferMetaList], bucket_size: int
-) -> list[tuple[int, H2DBucket]]:
+    global_metas: dict[int, MemoryBufferMetaList],
+    bucket_size: int,
+    local_topo: dict[str, set[int]],
+    remote_topo: dict[str, set[int]],
+    ranks: list[int] | None = None,
+) -> list[tuple[int, int, H2DBucket]]:
     buckets: list[tuple[int, H2DBucket]] = []
     for owner_rank, items in global_metas.items():
@@ -517,7 +585,73 @@ def _gen_h2d_buckets(
         assert buckets[-1][1].size > 0, (
             f"buckets[-1][1].size {buckets[-1][1].size} should be greater than 0"
         )
-    return buckets
+    ranks_set = set(ranks) if ranks else set()
+    actual_local_topo = (
+        {k: v & ranks_set for k, v in local_topo.items() if v & ranks_set} if ranks else local_topo
+    )
+    # if ranks is empty, assign the owner_rank as receiver_rank, this is used for colocate architecture
+    if not ranks:
+        return [(owner_rank, owner_rank, bucket) for owner_rank, bucket in buckets]
+    else:
+        return _assign_receiver_ranks(buckets, actual_local_topo, remote_topo)
+def _assign_receiver_ranks(
+    buckets: list[tuple[int, "T"]],
+    local_topo: dict[str, set[int]],
+    remote_topo: dict[str, set[int]],
+) -> list[tuple[int, int, "T"]]:
+    """
+    (owner_rank, bucket) -> (receiver_rank, owner_rank, bucket)
+    Assign receiver ranks to buckets. If ranks is empty, assign the owner_rank as receiver_rank.
+    GPU-rdma_device topology will be considered to make full use of the bandwidth.
+    """
+    if not buckets:
+        logger.warning("bucket list is empty, no need to assign receiver ranks")
+        return []
+    rank_to_rdma_device = {
+        rank: rdma_device for rdma_device, ranks in remote_topo.items() for rank in ranks
+    }
+    # group buckets by owner RDMA devices
+    buckets_by_rdma_device = defaultdict(list)
+    for owner_rank, bucket in buckets:
+        owner_rdma_device = rank_to_rdma_device[owner_rank]
+        buckets_by_rdma_device[owner_rdma_device].append((owner_rank, bucket))
+    buckets_matrix = list(buckets_by_rdma_device.values())
+    assert buckets_matrix, "buckets_matrix should not be empty"
+    # Select receiver ranks. We use the minimum rank in each local RDMA device group as receiver rank
+    num_receivers = min(len(local_topo), len(buckets_by_rdma_device))
+    receiver_list = [min(ranks) for ranks in list(local_topo.values())[:num_receivers]]
+    flattened_buckets = [
+        buckets_matrix[row][col]
+        for col in range(
+            max(len(matrix_row) for matrix_row in buckets_matrix) if buckets_matrix else 0
+        )
+        for row in range(len(buckets_matrix))
+        if col < len(buckets_matrix[row])
+    ]
+    buckets_with_receiver = []
+    assigned_cnt = 0
+    while assigned_cnt < len(flattened_buckets):
+        occupied_devices = set()
+        for receiver_rank in receiver_list:
+            if assigned_cnt >= len(flattened_buckets):
+                break
+            owner_rank, bucket = flattened_buckets[assigned_cnt]
+            rdma_device = rank_to_rdma_device[owner_rank]
+            if rdma_device in occupied_devices:
+                break
+            buckets_with_receiver.append((receiver_rank, owner_rank, bucket))
+            occupied_devices.add(rdma_device)
+            assigned_cnt += 1
+    return buckets_with_receiver
 def _get_master_port(master_port: int | None = None) -> int:
@@ -528,6 +662,20 @@ def _get_master_port(master_port: int | None = None) -> int:
     return master_port
+def _get_bcast_rank_map(world_size: int, ranks: list[int] | None) -> dict[int, int]:
+    """
+    map the real ranks (receiver_rank) to the bcast ranks (0 ~ len(ranks) - 1),
+    which are generated in self.init_process_group_for_ranks
+    """
+    bcast_rank_map: dict[int, int] = {}
+    if not ranks:
+        bcast_rank_map = {r: r for r in range(world_size)}
+    else:
+        for i, r in enumerate(ranks):
+            bcast_rank_map[r] = i
+    return bcast_rank_map
 class P2PStore:
     def __init__(self):
         from mooncake.engine import TransferEngine
@@ -535,14 +683,14 @@ class P2PStore:
         self.rank = int(os.getenv("RANK"))
         gpu_count = torch.cuda.device_count()
         local_rank = self.rank % gpu_count
-        device = _get_my_rdma_device(local_rank, gpu_count, _get_rdma_devices())
+        self.device = _get_my_rdma_device(local_rank, gpu_count, _get_rdma_devices())
         self.ip = _get_ip()
         # we will start at most 8 ps processes, so we use 8 retries to avoid port conflicts in extreme cases
         retry_count = 8
         for i in range(retry_count):
             self.engine = TransferEngine()
-            ret = self.engine.initialize(self.ip, "P2PHANDSHAKE", "rdma", device)
+            ret = self.engine.initialize(self.ip, "P2PHANDSHAKE", "rdma", self.device)
             if ret == 0:
                 break
             # sleep 0.5 ~ 2.0s, to avoid port conflicts when two processes retry at the same time
@@ -556,7 +704,7 @@ class P2PStore:
         self.port = self.engine.get_rpc_port()
         self.named_tensors: dict[str, torch.Tensor] = {}
         logger.info(
-            f"[rank{self.rank}] p2p store initialized, addr is {self.addr}, rdma device is {device}"
+            f"[rank{self.rank}] p2p store initialized, addr is {self.addr}, rdma device is {self.device}"
         )
     @property
@@ -595,7 +743,13 @@ class P2PStore:
 class ParameterServer:
     def __init__(
-        self, *, rank: int | None = None, world_size: int | None = None, auto_pg: bool = False
+        self,
+        *,
+        rank: int | None = None,
+        world_size: int | None = None,
+        auto_pg: bool = False,
+        gpu_count: int | None = None,
+        mem_fraction: float | None = None,
     ):
         """
         Initialize the parameter server. env RANK, WORLD_SIZE and MASTER_ADDR must be set.
@@ -603,17 +757,29 @@ class ParameterServer:
         Args:
             auto_pg: Whether to automatically initialize the process group.
                 Notice that if auto_pg is True, will destroy the process group after update.
+            mem_fraction: The proportion (as a fraction) of the current free CUDA memory for allocation.
         """
         self._rank = rank or int(os.environ.get("RANK", None))
         self._world_size = world_size or int(os.environ.get("WORLD_SIZE", None))
-        self._gpu_count = torch.cuda.device_count()
+        self._gpu_count = gpu_count or torch.cuda.device_count()
         self._local_rank = self._rank % self._gpu_count
         self._auto_pg = auto_pg
         self._all_hosts = []
         self._global_device_uuids: list[str] = []
+        self._local_rdma_devices: dict[str, set[int]] = defaultdict(set)
+        self._remote_rdma_devices: dict[str, set[int]] = defaultdict(set)
+        self._mem_fraction = mem_fraction or 0.9
         assert self._rank is not None and self._rank >= 0, self._rank
         assert self._world_size and self._world_size > 0, self._world_size
+        assert (
+            self._gpu_count is not None
+            and self._gpu_count > 0
+            and self._gpu_count <= torch.cuda.device_count()
+        ), self._gpu_count
+        assert (
+            self._mem_fraction is not None and self._mem_fraction > 0 and self._mem_fraction <= 1
+        ), self._mem_fraction
         self._zmq_ctx = zmq.Context()
         self._zmq_addr_counter = 0
@@ -630,6 +796,7 @@ class ParameterServer:
         device_index = self._local_rank
         torch.cuda.set_device(device_index)
         self._device_uuid = _get_physical_gpu_id(device_index)
+        self._rdma_device = None if self._p2p_store is None else self._p2p_store.device
     def _logger_rank0(self, msg: str):
         if self._local_rank == 0:
@@ -640,6 +807,13 @@ class ParameterServer:
     def load_metas(self, metas: dict[int, MemoryBufferMetaList]):
         self._current_global_parameter_metas = metas
+        self._remote_rdma_devices = defaultdict(set)
+        for i, meta in self._current_global_parameter_metas.items():
+            assert meta.rdma_device is not None, "meta.rdma_device should not be None"
+            assert meta.p2p_store_addr is not None, "meta.p2p_store_addr should not be None"
+            self._remote_rdma_devices[
+                meta.rdma_device + "@" + meta.p2p_store_addr.split(":")[0]
+            ].add(i)
     def register_checkpoint(
         self,
@@ -713,11 +887,11 @@ class ParameterServer:
             p2p_store_addr=None if self._p2p_store is None else self._p2p_store.addr,
             host_ip=_get_ip(),
             device_uuid=self._device_uuid,
+            rdma_device=self._rdma_device or "",
         )
         dist.all_gather_object(metas_lst, metas)
-        self._current_global_parameter_metas = {}
         num_parameters = 0
         all_hosts: list[str] = []
         global_device_uuids: list[str] = []
@@ -728,12 +902,24 @@ class ParameterServer:
             if not self._global_device_uuids:
                 global_device_uuids.append(metas_buckets.device_uuid)
             if metas_buckets.memory_buffer_metas_list:
-                self._current_global_parameter_metas[i] = metas_buckets
+                self._current_global_parameter_metas[i] = MemoryBufferMetaList(
+                    memory_buffer_metas_list=metas_buckets.memory_buffer_metas_list,
+                    p2p_store_addr=metas_buckets.p2p_store_addr,
+                    rdma_device=metas_buckets.rdma_device,
+                )
                 num_parameters += sum(len(x.metas) for x in metas_buckets.memory_buffer_metas_list)
+            self._local_rdma_devices[
+                metas_buckets.rdma_device + "@" + metas_buckets.p2p_store_addr.split(":")[0]
+                if metas_buckets.p2p_store_addr
+                else metas_buckets.host_ip
+            ].add(i)
         if not self._all_hosts:
             self._all_hosts = all_hosts
         if not self._global_device_uuids:
             self._global_device_uuids = global_device_uuids
+        # Sender node and Receiver node have the same GPU-rdma_device topology is considered as default.
+        # Rewrite the sender's topology (_remote_rdma_devices) by calling load_metas.
+        self._remote_rdma_devices = self._local_rdma_devices.copy()
         logger.info(
             f"[rank{self._rank}] gather parameter metas finished, num_parameters: {num_parameters}"
         )
@@ -788,6 +974,7 @@ class ParameterServer:
                 If set, will use p2p to update to the ranks, this is flexible to update to a group of ranks,
                 which is useful in disaggregated architecture.
         """
+        assert req_func is not None, "req_func is required"
         try:
             # if both ranks is None or [], it will use fully broadcast to update to all ranks
             if not ranks:
@@ -795,15 +982,15 @@ class ParameterServer:
                     self.init_process_group()
                 self._update_per_bucket(checkpoint_name, req_func)
             else:
-                if self._rank not in ranks:
-                    return
                 if self._auto_pg:
                     if dist.is_initialized():
                         dist.destroy_process_group()
                         # HACK: wait 2s to ensure destroy is finished
                         time.sleep(2)
                     self.init_process_group_for_ranks(ranks)
-                self._update_per_bucket_p2p(checkpoint_name, req_func, ranks)
+                if self._rank not in ranks:
+                    return
+                self._update_per_bucket(checkpoint_name, req_func, ranks)
             if self._auto_pg:
                 dist.destroy_process_group()
@@ -835,8 +1022,8 @@ class ParameterServer:
         # auto detect bucket size
         tensor = torch.tensor(
             [
-                # 90% of current cuda free memory bytes
-                int(float(torch.cuda.mem_get_info()[0]) * 0.9),
+                # proportion of current cuda free memory bytes
+                int(float(torch.cuda.mem_get_info()[0]) * self._mem_fraction),
                 # we use negative value to reuse allreduce min operation
                 # for getting the max value of zmq_addr_counter in all ranks
                 -self._zmq_addr_counter,
@@ -948,71 +1135,6 @@ class ParameterServer:
             backend="nccl", world_size=len(ranks), rank=rank, timeout=timeout, store=store
         )
-    def _update_per_bucket_p2p(
-        self,
-        checkpoint_name: str,
-        req_func: Callable[[list[tuple[str, str]]], None],
-        ranks: list[int],
-    ):
-        assert self._p2p_store is not None, "p2p store is not initialized"
-        assert ranks, "ranks should be set"
-        if len(self._current_global_parameter_metas) == 0:
-            raise ValueError("parameter metas is empty")
-        assert dist.is_initialized(), (
-            "process group is not initialized when update model per bucket p2p"
-        )
-        need_update = self._rank in ranks
-        logger.info(
-            f"[rank{self._rank}] update checkpoint {checkpoint_name} p2p, {need_update=} with {ranks=}, "
-            f"gpu_count {self._gpu_count}, world_size {self._world_size}"
-        )
-        if not need_update:
-            return
-        # first execute a barrier to avoid subsequent cuda oom
-        dist.barrier()
-        bucket_size, _ = self._detect_bucket_size(disable_h2d_buffer=True)
-        buffer = torch.empty(bucket_size * 2, dtype=torch.uint8, device="cuda")
-        ipc_buffer_name = "__ipc_buffer___"
-        self._p2p_store.register_named_tensors({ipc_buffer_name: buffer})
-        logger.info(
-            f"[rank{self._rank}] register buffer, shape={buffer.shape}, dtype={buffer.dtype}, data_ptr={buffer.data_ptr()}, nbytes={buffer.nbytes}"
-        )
-        handle = reduce_tensor(buffer)
-        buckets = _gen_h2d_buckets(self._current_global_parameter_metas, bucket_size)
-        socket, socket_paths = self._bind_zmq_socket()
-        req_thread = threading.Thread(
-            target=req_func,
-            args=(socket_paths,),
-        )
-        req_thread.start()
-        socket.send_pyobj(handle)
-        for gidx, (owner_rank, bucket) in enumerate(buckets):
-            self._logger_rank0(
-                f"[rank{self._rank}] begin to update bucket {gidx + 1}/{len(buckets)} owner_rank {owner_rank} in checkpoint {checkpoint_name}, bucket_size: {bucket.size / 1024 / 1024:.2f}MiB, length: {len(bucket.items)}. "
-            )
-            _buffer = buffer[gidx % 2 * bucket_size : gidx % 2 * bucket_size + bucket.size]
-            if dist.get_rank() == 0:
-                self._copy_to_buffer(checkpoint_name, bucket, _buffer, owner_rank)
-            # broadcast the collected data to all ranks
-            dist.broadcast(_buffer, src=0)
-            socket.recv()
-            dist.barrier()
-            socket.send_pyobj(_to_named_tensor(bucket.items, gidx % 2 * bucket_size))
-        socket.recv()
-        socket.send_pyobj(None)
-        socket.recv()
-        req_thread.join()
-        dist.barrier()
-        socket.close()
-        self._p2p_store.unregister_named_tensors([ipc_buffer_name])
-        torch.cuda.empty_cache()
     def _get_addr_ptrs(self, owner_rank: int) -> tuple[str, list[tuple[int, int]]]:
         addr = self._current_global_parameter_metas[owner_rank].p2p_store_addr
         metas_list = self._current_global_parameter_metas[owner_rank].memory_buffer_metas_list
@@ -1042,38 +1164,63 @@ class ParameterServer:
         self,
         checkpoint_name: str,
         req_func: Callable[[list[tuple[str, str]]], None],
+        ranks: list[int] | None = None,
     ):
-        if len(self._current_global_parameter_metas) == 0:
-            raise ValueError("parameter metas is empty")
+        assert len(self._current_global_parameter_metas) != 0, "parameter metas is empty"
         assert dist.is_initialized(), "process group is not initialized"
+        # if both ranks is None or [], it will use fully broadcast to update to all ranks
+        if not ranks:
+            logger.info(f"[rank{self._rank}] update checkpoint {checkpoint_name}")
+        # if ranks is set, it will use p2p to update to the ranks
+        else:
+            assert self._p2p_store is not None, "p2p store is not initialized"
+            assert ranks, "ranks should be set"
+            need_update = self._rank in ranks
+            logger.info(
+                f"[rank{self._rank}] update checkpoint {checkpoint_name} p2p, {need_update=} with {ranks=}, "
+                f"gpu_count {self._gpu_count}, world_size {self._world_size}"
+            )
-        logger.info(f"[rank{self._rank}] update checkpoint {checkpoint_name}")
+            if not need_update:
+                return
+            # first execute a barrier to avoid subsequent cuda oom
+            dist.barrier()
         bucket_size, disable_h2d_buffer = self._detect_bucket_size()
-        buckets = _gen_h2d_buckets(self._current_global_parameter_metas, bucket_size)
+        buckets = _gen_h2d_buckets(
+            self._current_global_parameter_metas,
+            bucket_size,
+            self._local_rdma_devices,
+            self._remote_rdma_devices,
+            ranks,
+        )
         h2d_buffer: torch.Tensor | None = (
             None
             if disable_h2d_buffer
             else torch.empty(bucket_size, dtype=torch.uint8, device="cuda")
         )
-        owner_rank_buckets: list[H2DBucket] = []
-        for owner_rank, bucket in buckets:
-            if owner_rank != self._rank:
+        # p2p store need to register h2d_buffer to let other ranks read
+        if ranks:
+            h2d_buffer_name = "__h2d_buffer__"
+            if h2d_buffer is not None and self._p2p_store is not None:
+                self._p2p_store.register_named_tensors({h2d_buffer_name: h2d_buffer})
+        receiver_rank_buckets: list[tuple[int, H2DBucket]] = []
+        for receiver_rank, owner_rank, bucket in buckets:
+            if receiver_rank != self._rank:
                 continue
-            owner_rank_buckets.append(bucket)
+            receiver_rank_buckets.append((owner_rank, bucket))
         buffer = torch.empty(bucket_size * 2, dtype=torch.uint8, device="cuda")
         handle = reduce_tensor(buffer)
-        buckets_by_owner_rank: dict[int, list[H2DBucket]] = defaultdict(list)
+        buckets_by_receiver_rank: dict[int, list[H2DBucket]] = defaultdict(list)
         max_len = 0
-        for owner_rank, bucket in buckets:
-            buckets_by_owner_rank[owner_rank].append(bucket)
-            if len(buckets_by_owner_rank[owner_rank]) > max_len:
-                max_len = len(buckets_by_owner_rank[owner_rank])
+        for receiver_rank, _, bucket in buckets:
+            buckets_by_receiver_rank[receiver_rank].append(bucket)
+            if len(buckets_by_receiver_rank[receiver_rank]) > max_len:
+                max_len = len(buckets_by_receiver_rank[receiver_rank])
         socket, socket_paths = self._bind_zmq_socket()
         req_thread = threading.Thread(
@@ -1084,11 +1231,16 @@ class ParameterServer:
         socket.send_pyobj(handle)
         gidx = 0
+        bcast_rank_map = _get_bcast_rank_map(self._world_size, ranks)
         for i in range(max_len):
-            if i < len(owner_rank_buckets) and not disable_h2d_buffer:
-                self._copy_to_buffer(checkpoint_name, owner_rank_buckets[i], h2d_buffer)
-            for owner_rank, _buckets in buckets_by_owner_rank.items():
+            if i < len(receiver_rank_buckets) and not disable_h2d_buffer:
+                self._copy_to_buffer(
+                    checkpoint_name,
+                    receiver_rank_buckets[i][1],
+                    h2d_buffer,
+                    receiver_rank_buckets[i][0] if ranks else None,
+                )
+            for receiver_rank, _buckets in buckets_by_receiver_rank.items():
                 if i >= len(_buckets):
                     continue
                 bucket = _buckets[i]
@@ -1097,18 +1249,19 @@ class ParameterServer:
                     torch.cuda.memory_reserved() / 1024 / 1024,
                 )
                 self._logger_rank0(
-                    f"[rank{self._rank}] begin to update bucket {gidx + 1}/{len(buckets)} owner_rank {owner_rank} in checkpoint {checkpoint_name}, bucket_size: {bucket.size / 1024 / 1024:.2f}MiB, length: {len(bucket.items)}. "
+                    f"[rank{self._rank}] begin to update bucket {gidx + 1}/{len(buckets)} receiver_rank {receiver_rank} in checkpoint {checkpoint_name}, bucket_size: {bucket.size / 1024 / 1024:.2f}MiB, length: {len(bucket.items)}. "
                     f"Current CUDA allocated {alloc:.2f} MB, "
                     f"reserved {reserved:.2f} MB."
                 )
                 start = gidx % 2 * bucket_size
                 buffer_b: torch.Tensor = buffer[start : start + bucket.size]
-                if owner_rank == self._rank:
+                if receiver_rank == self._rank:
                     if disable_h2d_buffer:
                         self._copy_to_buffer(checkpoint_name, bucket, buffer_b)
                     else:
                         buffer_b.data.copy_(h2d_buffer[: bucket.size])
-                dist.broadcast(buffer_b, src=owner_rank)
+                brank = bcast_rank_map[receiver_rank]
+                dist.broadcast(buffer_b, src=brank)
                 socket.recv()
                 dist.barrier()
                 socket.send_pyobj(_to_named_tensor(bucket.items, gidx % 2 * bucket_size))
@@ -1120,6 +1273,9 @@ class ParameterServer:
         req_thread.join()
         dist.barrier()
         socket.close()
+        if ranks and h2d_buffer is not None:
+            self._p2p_store.unregister_named_tensors([h2d_buffer_name])
         torch.cuda.empty_cache()

{checkpoint_engine-0.1.2 → checkpoint_engine-0.2.0}/checkpoint_engine.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: checkpoint-engine
-Version: 0.1.2
+Version: 0.2.0
 Summary: checkpoint-engine is a lightweight, decoupling and efficient weight update middleware
 Project-URL: Homepage, https://github.com/MoonshotAI/checkpoint-engine
 Project-URL: Repository, https://github.com/MoonshotAI/checkpoint-engine
@@ -38,8 +38,8 @@ updating our [Kimi-K2](https://github.com/MoonshotAI/Kimi-K2) model (1 Trillion
 The core weight update logic is in `ParameterServer` class, a service colocated with inference engines. It provides two implementations of weight update: Broadcast and P2P.
-- **Broadcast**: Used when a large number of inference instances need to update weights in synchronous. This is the fastest implementation and should be used as the default update method. See `_update_per_bucket`.
-- **P2P**: Used when new inference instances are dynamically added (due to restarts or dynamic availability) while the existing instances are already serving requests. Under this scenario, to avoid affecting the workloads on existing instances, we use the [`mooncake-transfer-engine`](https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#use-python-package) to P2P send weights from CPUs in existing instances to GPUs in new instances. See `_update_per_bucket_p2p`.
+- **Broadcast**: Used when a large number of inference instances need to update weights in synchronous. This is the fastest implementation and should be used as the default update method. See `_update_per_bucket` with `ranks == None or []`.
+- **P2P**: Used when new inference instances are dynamically added (due to restarts or dynamic availability) while the existing instances are already serving requests. Under this scenario, to avoid affecting the workloads on existing instances, we use the [`mooncake-transfer-engine`](https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#use-python-package) to P2P send weights from CPUs in existing instances to GPUs in new instances. See `_update_per_bucket` with `ranks` specified.
 ### Optimized Weight Broadcast
 In the *Broadcast* implementation, the checkpoint-engine holds references to sharded weights in CPU memory, and need to efficiently broadcast them to a cluster of inference instances, often under a different sharding pattern.
@@ -60,16 +60,22 @@ It then executes the transfer, where it controls the inference engine through a
 Pipelining naturally requires more GPU memory. When memory is not enough, checkpoint-engine will fallback to serial execution.
+### Optimized P2P Bucket Assignment
+In the *P2P* implementation, checkpoint-engine needs to send weights from existing instances to new instances.
+To minimize the overall transfer time, checkpoint-engine optimizes the bucket assignment for each sender-receiver pair.
+The optimization goal is to make full use of the available network bandwidth for each sender and receiver.
+See [issue #25](https://github.com/MoonshotAI/checkpoint-engine/issues/25)
 ## Benchmark
 | Model                                | Device Info  | GatherMetas | Update (Broadcast) | Update (P2P)            |
 | :----------------------------------- | :----------- | :---------- |:-------------------| :---------------------- |
-| GLM-4.5-Air (BF16)                   | 8xH800 TP8  | 0.17s       | 3.94s (1.42GiB)    | 8.83s (4.77GiB)         |
-| Qwen3-235B-A22B-Instruct-2507 (BF16) | 8xH800 TP8  | 0.46s       | 6.75s (2.69GiB)    | 16.47s (4.05GiB)        |
-| DeepSeek-V3.1 (FP8)                  | 16xH20 TP16  | 1.44s       | 12.22s (2.38GiB)   | 25.77s (3.61GiB)        |
-| Kimi-K2-Instruct (FP8)               | 16xH20 TP16  | 1.81s       | 15.45s (2.93GiB)   | 36.24s (4.46GiB)        |
-| DeepSeek-V3.1 (FP8)                  | 256xH20 TP16 | 1.40s       | 13.88s (2.54GiB)   | 33.30s (3.86 GiB) |
-| Kimi-K2-Instruct (FP8)               | 256xH20 TP16 | 1.88s       | 21.50s (2.99GiB)   | 34.49s (4.57 GiB) |
+| GLM-4.5-Air (BF16)                   | 8xH800 TP8   | 0.12s       | 3.47s (3.02GiB)    | 4.12s (3.02GiB)         |
+| Qwen3-235B-A22B-Instruct-2507 (BF16) | 8xH800 TP8   | 0.33s       | 6.22s (2.67GiB)    | 7.10s (2.68GiB)         |
+| DeepSeek-V3.1 (FP8)                  | 16xH20 TP16  | 1.17s       | 10.19s (5.39GiB)   | 11.80s (5.41GiB)        |
+| Kimi-K2-Instruct (FP8)               | 16xH20 TP16  | 1.33s       | 14.36s (5.89GiB)   | 17.49s (5.91GiB)        |
+| DeepSeek-V3.1 (FP8)                  | 256xH20 TP16 | 0.80s       | 11.33s (8.00GiB)   | 11.81s (8.00GiB)        |
+| Kimi-K2-Instruct (FP8)               | 256xH20 TP16 | 1.22s       | 16.04s (8.00GiB)   | 16.75s (8.00GiB)        |
 All results above are tested by [`examples/update.py`](./examples/update.py) and use [vLLM v0.10.2rc1](https://github.com/vllm-project/vllm/tree/v0.10.2rc1) as inference engine. Some notes:
@@ -77,6 +83,7 @@ All results above are tested by [`examples/update.py`](./examples/update.py) and
 * Device Info: we tested various combination of devices and parallelism setups. For example, a 256-GPU TP16 setup means that we deploy 16 vLLM instances, each with 16-way tensor parallelism.
 * Since update duration is related to IPC bucket size, we provide the bucket size in the table.
 * The P2P time were tested for updating no more than two nodes (16 GPUs) (`ParameterServer.update(ranks=range(0, 16))`) out of the entire cluster.
+* We bind each GPU to its corresponding NUMA node to ensure stable H2D transfer speeds.
 ## Installation
@@ -92,7 +99,7 @@ Use the flexible P2P implementation, notice this will install `mooncake-transfer
 pip install 'checkpoint-engine[p2p]'
 ```
-If set `NCCL_IB_HCA` env, checkpoint-engine will use it to auto select net devices for different ranks. If not set, it will read all RDMA devices and try to divide them into each rank.
+If set `NCCL_IB_HCA` env, checkpoint-engine will use it to auto select net devices for different ranks. Available patterns can be found from [NCCL documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#id8). If not set, it will read all RDMA devices and try to divide them into each rank.
 ## Getting Started
@@ -165,11 +172,11 @@ Run a simple correctness test for checkpoint_engine
 torchrun --nproc-per-node 8 tests/test_update.py
 ```
+Other unit tests can be done with pytest.
 ## Limitations and Future Work
 - This project is currently only tested with vLLM. But it is easy to integrate with other frameworks like SGLang.
 - The perfect three-stage pipeline mentioned in our paper is currently not implemented. This could be useful for architectures where H2D and broadcast do not conflict in PCIE.
-- The P2P update method is currently not the optimal implementation since it will receive data only in rank 0 and broadcast to others synchronizely. This is a potential optimization in the future.
 ## Acknowledgments

{checkpoint_engine-0.1.2 → checkpoint_engine-0.2.0}/checkpoint_engine.egg-info/SOURCES.txt RENAMED Viewed

@@ -3,6 +3,7 @@
 LICENCE
 README.md
 pyproject.toml
+.github/workflows/cpu-tests.yml
 .github/workflows/pre-commit.yaml
 .github/workflows/python-publish.yml
 checkpoint_engine/__init__.py
@@ -19,4 +20,6 @@ figures/checkpoint-engine.png
 figures/overlap-update-and-copy.png
 figures/pipeline.png
 patches/vllm_fp8.patch
+tests/test_assign_receiver_ranks.py
+tests/test_rdma_parser.py
 tests/test_update.py

{checkpoint_engine-0.1.2 → checkpoint_engine-0.2.0}/pyproject.toml RENAMED Viewed

@@ -158,3 +158,8 @@ inline-quotes = "double"
 [tool.ruff.lint.flake8-tidy-imports]
 ban-relative-imports = "all"
+[tool.pytest.ini_options]
+markers = [
+    "gpu: marks tests as GPU test (deselect with '-m \"not gpu\"')",
+]

checkpoint_engine-0.2.0/tests/test_assign_receiver_ranks.py ADDED Viewed

@@ -0,0 +1,68 @@
+import pytest
+from checkpoint_engine.ps import _assign_receiver_ranks
+@pytest.mark.parametrize(
+    "buckets,local_topo,remote_topo,expected_results",
+    [
+        (
+            [(i % 8, f"bucket{i}") for i in range(80)],
+            {f"rdma{i}": {i} for i in range(8)},
+            {f"rdma{i}": {i} for i in range(8)},
+            [(i % 8, i % 8, f"bucket{i}") for i in range(80)],
+        ),
+        (
+            [(i % 8, f"bucket{i}") for i in range(80)],
+            {f"rdma{i}": {i} for i in range(8)},
+            {f"rdma{i}": {i, i + 1} for i in range(0, 8, 2)},
+            [((i // 2 % 4), i % 8, f"bucket{i}") for i in range(80)],
+        ),
+        (
+            [(i % 8, f"bucket{i}") for i in range(80)],
+            {f"rdma{i}": {i, i + 1, i + 2, i + 3} for i in range(0, 8, 4)},
+            {f"rdma{i}": {i} for i in range(8)},
+            [((i % 2) * 4, i % 8, f"bucket{i}") for i in range(80)],
+        ),
+        (
+            [(i % 8, f"bucket{i}") for i in range(13)],
+            {f"rdma{i}": {i} for i in range(8)},
+            {f"rdma{i}": {i, i + 1} for i in range(0, 8, 2)},
+            [((i // 2 % 4), i % 8, f"bucket{i}") for i in range(13)],
+        ),
+        (
+            [(i % 8, f"bucket{i}") for i in range(13)],
+            {f"rdma{i}": {i, i + 1} for i in range(0, 8, 2)},
+            {f"rdma{i}": {i} for i in range(8)},
+            [((i % 4) * 2, i % 8, f"bucket{i}") for i in range(13)],
+        ),
+        (
+            [(i % 8, f"bucket{i}") for i in range(13)],
+            {f"rdma{i}": {i} for i in range(3)},
+            {f"rdma{i}": {i, i + 1} for i in range(0, 8, 2)},
+            [
+                (0, 0, "bucket0"),
+                (1, 1, "bucket1"),
+                (1, 2, "bucket2"),
+                (2, 3, "bucket3"),
+                (2, 4, "bucket4"),
+                (0, 5, "bucket5"),
+                (0, 6, "bucket6"),
+                (1, 7, "bucket7"),
+                (2, 0, "bucket8"),
+                (2, 1, "bucket9"),
+                (0, 2, "bucket10"),
+                (0, 3, "bucket11"),
+                (1, 4, "bucket12"),
+            ],
+        ),
+    ],
+)
+def test_basic_functionality(
+    buckets: list[tuple[int, str]],
+    local_topo: dict[str, int],
+    remote_topo: dict[str, int],
+    expected_results: list[tuple[int, int, str]],
+):
+    assert len(expected_results) == len(buckets)
+    assert set(expected_results) == set(_assign_receiver_ranks(buckets, local_topo, remote_topo))

checkpoint_engine-0.2.0/tests/test_rdma_parser.py ADDED Viewed

@@ -0,0 +1,197 @@
+import os
+from unittest.mock import patch
+import pytest
+from checkpoint_engine.ps import (
+    _get_my_rdma_device,
+    _get_rdma_devices,
+    _ibv_get_device_list,
+    _parse_NCCL_IB_HCA,
+)
+@pytest.fixture
+def mock_available_devices() -> list[str]:
+    """Provide mock available device list"""
+    return ["mlx5_0", "mlx5_1", "mlx4_0", "mlx4_1"]
+def test_detect_ibv_list():
+    """Test detection of _ibv_get_device_list function"""
+    # Skip this test if no real infiniband devices exist
+    if not os.path.exists("/sys/class/infiniband"):
+        pytest.skip("No infiniband devices found on system")
+    real_ibv_list = sorted(os.listdir("/sys/class/infiniband"))
+    if real_ibv_list:
+        devices = _ibv_get_device_list()
+        assert isinstance(devices, list)
+def test_parse_max_hcas_limit():
+    """Test maximum HCA quantity limit"""
+    # Create mock data with more than 32 devices
+    many_devices = [f"device_{i}" for i in range(50)]
+    result = _parse_NCCL_IB_HCA("", many_devices)
+    assert len(result) == 32
+    assert result == many_devices[:32]
+def test_get_rdma_devices_no_env_vars(mock_available_devices: list[str]):
+    """Test _get_rdma_devices with no environment variables"""
+    with (
+        patch.dict(os.environ, clear=True),
+        patch("checkpoint_engine.ps._ibv_get_device_list", return_value=mock_available_devices),
+    ):
+        devices = _get_rdma_devices()
+        assert sorted(devices) == sorted(mock_available_devices)
+@pytest.mark.parametrize(
+    "input_value,expected",
+    [
+        pytest.param("", ["mlx5_0", "mlx5_1", "mlx4_0", "mlx4_1"], id="empty string"),
+        pytest.param("   \t\n  ", ["mlx5_0", "mlx5_1", "mlx4_0", "mlx4_1"], id="whitespace"),
+        pytest.param("None", [], id="None string"),
+        pytest.param("^", ["mlx5_0", "mlx5_1", "mlx4_0", "mlx4_1"], id="caret"),
+        pytest.param("^=", ["mlx5_0", "mlx5_1", "mlx4_0", "mlx4_1"], id="caret-equals"),
+        pytest.param("=^", [], id="equals-caret"),
+        pytest.param("^^", ["mlx5_0", "mlx5_1", "mlx4_0", "mlx4_1"], id="double-caret"),
+        pytest.param("=", [], id="equals"),
+        pytest.param("==", [], id="double-equals"),
+    ],
+)
+def test_parse_basic_cases(
+    input_value: str, expected: list[str], mock_available_devices: list[str]
+):
+    """Test basic parsing cases: empty string, whitespace, None"""
+    result = _parse_NCCL_IB_HCA(input_value, mock_available_devices)
+    assert result == expected
+@pytest.mark.parametrize(
+    "input_value,expected",
+    [
+        # prefix
+        ("mlx5_0", ["mlx5_0"]),
+        ("mlx5", ["mlx5_0", "mlx5_1"]),
+        # exact match
+        ("=mlx5_0", ["mlx5_0"]),
+        ("=mlx5_0,mlx5_1", ["mlx5_0", "mlx5_1"]),
+        # ignore ports, whitespace and duplicated commas
+        ("mlx5_0:1,mlx5_1:2", ["mlx5_0", "mlx5_1"]),
+        ("mlx5_0:1,mlx5_1", ["mlx5_0", "mlx5_1"]),
+        (" mlx5_0 , mlx5_1 ", ["mlx5_0", "mlx5_1"]),
+        ("mlx5_0,,mlx5_1", ["mlx5_0", "mlx5_1"]),
+        # exclusion
+        ("^mlx5_0", ["mlx5_1", "mlx4_0", "mlx4_1"]),
+        ("^mlx5_0,mlx5_1", ["mlx4_0", "mlx4_1"]),
+        ("^mlx5", ["mlx4_0", "mlx4_1"]),
+        ("^=mlx5_0,mlx5_1", ["mlx4_0", "mlx4_1"]),
+        ("^=mlx4", ["mlx5_0", "mlx5_1", "mlx4_0", "mlx4_1"]),
+    ],
+)
+def test_parse_various_patterns(
+    input_value: str, expected: list[str], mock_available_devices: list[str]
+):
+    """Test various parsing patterns"""
+    result = _parse_NCCL_IB_HCA(input_value, mock_available_devices)
+    assert result == expected
+@pytest.mark.parametrize(
+    "input_value,expected_result,expected_warning",
+    [
+        ("=mlx5_100", [], "No RDMA device match device_name='mlx5_100' where is_exact_match=True."),
+        ("mlx5_100", [], "No RDMA device match device_name='mlx5_100' where is_exact_match=False."),
+        (
+            "^mlx5_100",
+            ["mlx5_0", "mlx5_1", "mlx4_0", "mlx4_1"],
+            "No RDMA device match device_name='mlx5_100' where is_exact_match=False.",
+        ),
+        ("mlx6", [], "No RDMA device match device_name='mlx6' where is_exact_match=False."),
+        ("=mlx6", [], "No RDMA device match device_name='mlx6' where is_exact_match=True."),
+    ],
+)
+def test_parse_exact_match_with_nonexistent_device(
+    input_value: str,
+    expected_result: list[str],
+    expected_warning: str,
+    mock_available_devices: list[str],
+):
+    """Test exact matching with non-existent device"""
+    with patch("checkpoint_engine.ps.logger") as mock_logger:
+        result = _parse_NCCL_IB_HCA(input_value, mock_available_devices)
+        assert result == expected_result
+        mock_logger.warning.assert_called_once_with(expected_warning)
+@pytest.mark.parametrize(
+    "env_var_name,env_var_value,expected_devices",
+    [
+        ("PS_P2P_STORE_RDMA_DEVICES", "mlx5_0,mlx5_1", ["mlx5_0", "mlx5_1"]),
+        ("NCCL_IB_HCA", "mlx5", ["mlx5_0", "mlx5_1"]),
+        ("NCCL_IB_HCA", "mlx5_0,mlx5_1", ["mlx5_0", "mlx5_1"]),
+        ("NCCL_IB_HCA", "^mlx5_0", ["mlx5_1", "mlx4_0", "mlx4_1"]),
+        ("NCCL_IB_HCA", "mlx6", ["mlx5_0", "mlx5_1", "mlx4_0", "mlx4_1"]),
+        ("NCCL_IB_HCA", "", ["mlx5_0", "mlx5_1", "mlx4_0", "mlx4_1"]),
+    ],
+)
+def test_get_rdma_devices_with_env_vars(
+    env_var_name: str,
+    env_var_value: str,
+    expected_devices: list[str],
+    mock_available_devices: list[str],
+):
+    """Test _get_rdma_devices with various environment variables"""
+    env_dict = {env_var_name: env_var_value}
+    with (
+        patch.dict(os.environ, env_dict),
+        patch("checkpoint_engine.ps._ibv_get_device_list", return_value=mock_available_devices),
+    ):
+        devices = _get_rdma_devices()
+        assert sorted(devices) == sorted(expected_devices)
+@pytest.mark.parametrize(
+    "local_rank,gpu_count,expected_device",
+    [
+        (0, 4, "mlx5_0"),
+        (3, 4, "mlx5_3"),
+        (4, 8, "mlx5_2"),
+        (7, 8, "mlx5_3"),
+    ],
+)
+def test_get_my_rdma_device_basic(local_rank: int, gpu_count: int, expected_device: str):
+    """Test _get_my_rdma_device with basic allocation"""
+    # Use fewer devices to match the GPU count constraint
+    devices = ["mlx5_0", "mlx5_1", "mlx5_2", "mlx5_3"]
+    device = _get_my_rdma_device(local_rank, gpu_count, devices)
+    assert device == expected_device
+@pytest.mark.parametrize(
+    "local_rank,gpu_count,devices,error",
+    [
+        (
+            0,
+            4,
+            ["mlx5_0", "mlx5_1", "mlx5_2", "mlx5_3", "mlx5_4"],
+            AssertionError,
+        ),  # Too many devices
+        (
+            0,
+            8,
+            ["mlx5_0", "mlx5_1", "mlx5_2"],
+            AssertionError,
+        ),  # GPU count not divisible by device count
+        (0, 8, [], RuntimeError),  # No devices
+    ],
+)
+def test_get_my_rdma_device_invalid_config(
+    local_rank: int, gpu_count: int, devices: list[str], error: type
+):
+    """Test _get_my_rdma_device with invalid configuration"""
+    with pytest.raises(error):
+        _get_my_rdma_device(local_rank, gpu_count, devices)