PyPI - megatron-core - Versions diffs - 0.14.0rc7__tar.gz → 0.15.0rc4__tar.gz - Mend

megatron-core 0.14.0rc7tar.gz → 0.15.0rc4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of megatron-core might be problematic. Click here for more details.

Files changed (354) hide show

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/LICENSE RENAMED Viewed

@@ -37,7 +37,7 @@ Below are licenses used in those files, as indicated.
 --------------------------------------------------------------------------------------
--- LICENSE FOR Facebook, huggingface, Google Research, LLaVA, Mamba, and vLLM code  --
+-- LICENSE FOR Facebook, huggingface, Google Research, LLaVA, Mamba, TinyZero and vLLM code  --
                                  Apache License

{megatron_core-0.14.0rc7/megatron_core.egg-info → megatron_core-0.15.0rc4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: megatron-core
-Version: 0.14.0rc7
+Version: 0.15.0rc4
 Summary: Megatron Core - a library for efficient and scalable training of transformer based models
 Author-email: NVIDIA <nemo-toolkit@nvidia.com>
 Maintainer-email: NVIDIA <nemo-toolkit@nvidia.com>
@@ -31,29 +31,30 @@ Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: torch
 Requires-Dist: numpy<2.0.0
-Requires-Dist: packaging
+Requires-Dist: packaging>=24.2
 Provides-Extra: mlm
 Requires-Dist: flask-restful; extra == "mlm"
 Requires-Dist: sentencepiece; extra == "mlm"
 Requires-Dist: tiktoken; extra == "mlm"
 Requires-Dist: wandb; extra == "mlm"
+Requires-Dist: transformers; extra == "mlm"
 Provides-Extra: dev
 Requires-Dist: tqdm; extra == "dev"
 Requires-Dist: einops~=0.8; extra == "dev"
 Requires-Dist: tensorstore!=0.1.46,!=0.1.72,~=0.1; extra == "dev"
 Requires-Dist: nvtx~=0.2; extra == "dev"
-Requires-Dist: transformers~=4.53; extra == "dev"
-Requires-Dist: multi-storage-client~=0.20; extra == "dev"
+Requires-Dist: multi-storage-client~=0.27; extra == "dev"
 Requires-Dist: opentelemetry-api~=1.33.1; extra == "dev"
 Requires-Dist: setuptools<80.0.0; extra == "dev"
 Requires-Dist: mamba-ssm~=2.2; extra == "dev"
 Requires-Dist: causal-conv1d~=1.5; extra == "dev"
 Requires-Dist: nv-grouped-gemm~=1.1; extra == "dev"
-Requires-Dist: transformer-engine[pytorch]<2.7.0,>=2.6.0a0; extra == "dev"
+Requires-Dist: transformer-engine[pytorch]<2.8.0,>=2.6.0a0; extra == "dev"
 Requires-Dist: nvidia-resiliency-ext<0.5.0,>=0.4.0a0; extra == "dev"
 Requires-Dist: nvidia-modelopt[torch]<0.34.0,>=0.33.0a0; sys_platform != "darwin" and extra == "dev"
 Requires-Dist: megatron-energon[av_decode]~=6.0; extra == "dev"
 Requires-Dist: flashinfer-python; extra == "dev"
+Requires-Dist: wget; extra == "dev"
 Requires-Dist: onnxscript; extra == "dev"
 Provides-Extra: lts
 Requires-Dist: tqdm; extra == "lts"
@@ -63,6 +64,7 @@ Requires-Dist: nvtx; extra == "lts"
 Requires-Dist: transformers; extra == "lts"
 Requires-Dist: zarr; extra == "lts"
 Requires-Dist: setuptools<80.0.0; extra == "lts"
+Requires-Dist: wget; extra == "lts"
 Dynamic: license-file
 <div align="center">
@@ -93,7 +95,10 @@ cd Megatron-LM
 # Latest News
-- 📣 NEW! **[DeepSeek & MoE Training with FP8](https://github.com/yanring/Megatron-MoE-ModelZoo)** examples are now available, including optimized configurations for `DeepSeek-V3`, `Qwen2` and `Mixtral` models with FP8 precision support.
+- 🔄 NEW! **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - Bidirectional converter for interoperability between Hugging Face and Megatron checkpoints, featuring production-ready recipes for popular models.
+- 🗺️ **[MoE Q3-Q4 2025 Roadmap](https://github.com/NVIDIA/Megatron-LM/issues/1729)** - Comprehensive roadmap for MoE features including DeepSeek-V3, Qwen3, advanced parallelism strategies, FP8 optimizations, and Blackwell performance enhancements.
+- 🚀 **[GPT-OSS Implementation](https://github.com/NVIDIA/Megatron-LM/issues/1739)** - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions are being integrated into Megatron Core.
+- **[2025/06]** **[Megatron MoE Model Zoo](https://github.com/yanring/Megatron-MoE-ModelZoo)** - Best practices and optimized configurations for training DeepSeek-V3, Mixtral, and Qwen3 MoE models with performance benchmarking and checkpoint conversion tools.
 - **[2025/05]** Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training ([blog](https://developer.nvidia.com/blog/turbocharge-llm-training-across-long-haul-data-center-networks-with-nvidia-nemo-framework/)).
 <details>
@@ -143,6 +148,7 @@ cd Megatron-LM
 **Resources**
 - [Examples](./examples/) - Training scripts and tutorials
 - [Documentation](https://docs.nvidia.com/Megatron-Core/) - Official docs
+- [Roadmaps](#roadmaps) - Development roadmaps and feature tracking
 - [Community & Support](#-community--support) - Get help and contribute
   - [Getting Help](#getting-help)
   - [Contributing](#contributing)
@@ -217,10 +223,12 @@ Megatron-LM/
 **Libraries using Megatron Core:**
+- **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - Training library with bidirectional Hugging Face ↔ Megatron checkpoint conversion, flexible training loops, and production-ready recipes
+- **[NeMo RL](https://github.com/NVIDIA-NeMo/RL)** - Scalable toolkit for efficient reinforcement learning with RLHF, DPO, and other post-training methods
 - **[NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)** - Enterprise framework with cloud-native support and end-to-end examples
 - **[TensorRT Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)** - Model optimization toolkit for quantization, pruning, and distillation
-**Compatible with:** [HuggingFace Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+**Compatible with:** [Hugging Face Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)
 # Installation
@@ -510,6 +518,15 @@ Based on [NVIDIA NeMo production configurations](https://github.com/NVIDIA/NeMo/
 --use-distributed-optimizer
 ```
+# Roadmaps
+Stay up-to-date with our development roadmaps and planned features:
+- **[MoE Q3-Q4 2025 Roadmap](https://github.com/NVIDIA/Megatron-LM/issues/1729)** - Comprehensive MoE feature development including DeepSeek-V3, Qwen3, advanced parallelism, FP8 optimizations, and Blackwell enhancements
+- **[GPT-OSS Implementation Tracker](https://github.com/NVIDIA/Megatron-LM/issues/1739)** - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions
+*More roadmap trackers will be added soon.*
 # Community & Support
 ## Getting Help

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/README.md RENAMED Viewed

@@ -26,7 +26,10 @@ cd Megatron-LM
 # Latest News
-- 📣 NEW! **[DeepSeek & MoE Training with FP8](https://github.com/yanring/Megatron-MoE-ModelZoo)** examples are now available, including optimized configurations for `DeepSeek-V3`, `Qwen2` and `Mixtral` models with FP8 precision support.
+- 🔄 NEW! **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - Bidirectional converter for interoperability between Hugging Face and Megatron checkpoints, featuring production-ready recipes for popular models.
+- 🗺️ **[MoE Q3-Q4 2025 Roadmap](https://github.com/NVIDIA/Megatron-LM/issues/1729)** - Comprehensive roadmap for MoE features including DeepSeek-V3, Qwen3, advanced parallelism strategies, FP8 optimizations, and Blackwell performance enhancements.
+- 🚀 **[GPT-OSS Implementation](https://github.com/NVIDIA/Megatron-LM/issues/1739)** - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions are being integrated into Megatron Core.
+- **[2025/06]** **[Megatron MoE Model Zoo](https://github.com/yanring/Megatron-MoE-ModelZoo)** - Best practices and optimized configurations for training DeepSeek-V3, Mixtral, and Qwen3 MoE models with performance benchmarking and checkpoint conversion tools.
 - **[2025/05]** Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training ([blog](https://developer.nvidia.com/blog/turbocharge-llm-training-across-long-haul-data-center-networks-with-nvidia-nemo-framework/)).
 <details>
@@ -76,6 +79,7 @@ cd Megatron-LM
 **Resources**
 - [Examples](./examples/) - Training scripts and tutorials
 - [Documentation](https://docs.nvidia.com/Megatron-Core/) - Official docs
+- [Roadmaps](#roadmaps) - Development roadmaps and feature tracking
 - [Community & Support](#-community--support) - Get help and contribute
   - [Getting Help](#getting-help)
   - [Contributing](#contributing)
@@ -150,10 +154,12 @@ Megatron-LM/
 **Libraries using Megatron Core:**
+- **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - Training library with bidirectional Hugging Face ↔ Megatron checkpoint conversion, flexible training loops, and production-ready recipes
+- **[NeMo RL](https://github.com/NVIDIA-NeMo/RL)** - Scalable toolkit for efficient reinforcement learning with RLHF, DPO, and other post-training methods
 - **[NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)** - Enterprise framework with cloud-native support and end-to-end examples
 - **[TensorRT Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)** - Model optimization toolkit for quantization, pruning, and distillation
-**Compatible with:** [HuggingFace Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+**Compatible with:** [Hugging Face Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)
 # Installation
@@ -443,6 +449,15 @@ Based on [NVIDIA NeMo production configurations](https://github.com/NVIDIA/NeMo/
 --use-distributed-optimizer
 ```
+# Roadmaps
+Stay up-to-date with our development roadmaps and planned features:
+- **[MoE Q3-Q4 2025 Roadmap](https://github.com/NVIDIA/Megatron-LM/issues/1729)** - Comprehensive MoE feature development including DeepSeek-V3, Qwen3, advanced parallelism, FP8 optimizations, and Blackwell enhancements
+- **[GPT-OSS Implementation Tracker](https://github.com/NVIDIA/Megatron-LM/issues/1739)** - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions
+*More roadmap trackers will be added soon.*
 # Community & Support
 ## Getting Help

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/megatron/core/__init__.py RENAMED Viewed

@@ -33,6 +33,17 @@ __all__ = [
     "InferenceParams",
     "ModelParallelConfig",
     "Timers",
+    "__contact_emails__",
+    "__contact_names__",
+    "__description__",
+    "__download_url__",
+    "__homepage__",
+    "__keywords__",
+    "__license__",
+    "__package_name__",
+    "__repository_url__",
+    "__shortversion__",
+    "__version__",
 ]
 from .safe_globals import register_safe_globals

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/megatron/core/datasets/blended_megatron_dataset_builder.py RENAMED Viewed

@@ -35,7 +35,8 @@ class BlendedMegatronDatasetBuilder(object):
         is_built_on_rank (Callable): A callable which returns True if the dataset should be built on
             the current rank and False otherwise. It should be Megatron Core parallelism aware i.e.
-            global rank, local group rank, and virtual rank may inform its return value.
+            global rank, local group rank, and virtual rank may inform its return value. Should
+            return true for exactly one process on global rank 0.
         config (BlendedMegatronDatasetConfig): The config object which informs dataset creation
     """
@@ -72,13 +73,6 @@ class BlendedMegatronDatasetBuilder(object):
                     for {split.name} split
                     This can occur with multiple validation sets if datasets have weights"""
-        if torch.distributed.is_initialized():
-            gb_rank = torch.distributed.get_rank()
-            if gb_rank == 0:
-                assert (
-                    self.is_built_on_rank()
-                ), "is_built_on_rank must return True when global rank = 0"
     def build(self) -> List[Optional[TopLevelDataset]]:
         """Build all dataset splits according to the provided blend(s)

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/megatron/core/datasets/blended_megatron_dataset_config.py RENAMED Viewed

@@ -6,8 +6,8 @@ import re
 from dataclasses import dataclass, field
 from typing import List, Optional, Tuple
-from megatron.core.datasets.megatron_tokenizer import MegatronTokenizer
 from megatron.core.datasets.utils import Split, log_single_rank, normalize
+from megatron.core.tokenizers import MegatronTokenizerBase
 logger = logging.getLogger(__name__)
@@ -66,8 +66,8 @@ class BlendedMegatronDatasetConfig:
        constructor.
     """
-    tokenizer: Optional[MegatronTokenizer] = None
-    """The MegatronTokenizer instance. Required for datasets that do online tokenization."""
+    tokenizer: Optional[MegatronTokenizerBase] = None
+    """The MegatronTokenizerBase instance. Required for datasets that do online tokenization."""
     mid_level_dataset_surplus: float = 0.005
     """The sample surplus to build for the mid-level datasets(s). Defaults arbitrarily to 0.005.

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/megatron/core/datasets/gpt_dataset.py RENAMED Viewed

@@ -12,9 +12,9 @@ import torch
 from megatron.core.datasets.blended_megatron_dataset_config import BlendedMegatronDatasetConfig
 from megatron.core.datasets.indexed_dataset import IndexedDataset
 from megatron.core.datasets.megatron_dataset import MegatronDataset
-from megatron.core.datasets.megatron_tokenizer import MegatronTokenizer
 from megatron.core.datasets.object_storage_utils import ObjectStorageConfig, is_object_storage_path
 from megatron.core.datasets.utils import Split
+from megatron.core.tokenizers import MegatronTokenizerBase
 from megatron.core.utils import log_single_rank
 logger = logging.getLogger(__name__)
@@ -701,8 +701,8 @@ class MockGPTLowLevelDataset:
     we add the end of document token to each element indexed in __getitem__
     Args:
-        tokenizer (MegatronTokenizer): The tokenizer the special token information of which we use
-            to augment the mock data.
+        tokenizer (MegatronTokenizerBase): The tokenizer the special token information of which
+        we use to augment the mock data.
     """
     seed: int = 0
@@ -714,7 +714,7 @@ class MockGPTLowLevelDataset:
     max_sequence_length: int = 4096
     """The hard-coded max sequence length to generate"""
-    def __init__(self, tokenizer: MegatronTokenizer) -> None:
+    def __init__(self, tokenizer: MegatronTokenizerBase) -> None:
         self.tokenizer = tokenizer
         rng = numpy.random.default_rng(seed=self.seed)
         self.sequence_lengths = rng.integers(

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/megatron/core/datasets/helpers.cpp RENAMED Viewed

@@ -3,6 +3,7 @@
 /* Helper methods for fast index mapping builds */
 #include <algorithm>
+#include <cassert>
 #include <iostream>
 #include <limits>
 #include <math.h>
@@ -46,7 +47,7 @@ void build_exhaustive_blending_indices(py::array_t<int16_t> &dataset_index, py::
   while (dataset_unspent_indices.size() > 0) {
     double index_sample_double = std::max(static_cast<double>(index_sample), 1.0);
-    int64_t error_argmax;
+    int64_t error_argmax = -1;
     double error_max = std::numeric_limits<double>::lowest();
     for (int32_t index_dataset : dataset_unspent_indices) {
@@ -56,6 +57,7 @@ void build_exhaustive_blending_indices(py::array_t<int16_t> &dataset_index, py::
         error_max = error;
       }
     }
+    assert(error_argmax >= 0);
     // Populate the indices.
     dataset_index_ptr[index_sample] = static_cast<int16_t>(error_argmax);

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/megatron/core/datasets/megatron_tokenizer.py RENAMED Viewed

@@ -7,7 +7,7 @@ from typing import Any
 import numpy
-class MegatronTokenizer(ABC):
+class MegatronLegacyTokenizer(ABC):
     """Abstract class for tokenizer
     Absent a config or class-specific tracking of which objects are uniquely identifying, we must

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/megatron/core/datasets/retro/config/tokenizers.py RENAMED Viewed

@@ -4,12 +4,12 @@
 from dataclasses import dataclass
-from megatron.core.datasets.megatron_tokenizer import MegatronTokenizer
+from megatron.core.tokenizers import MegatronTokenizerBase
 @dataclass
 class RetroTokenizers:
     """Container class for GPT and Bert tokenizers."""
-    gpt: MegatronTokenizer = None
-    bert: MegatronTokenizer = None
+    gpt: MegatronTokenizerBase = None
+    bert: MegatronTokenizerBase = None

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/megatron/core/dist_checkpointing/dict_utils.py RENAMED Viewed

@@ -103,11 +103,19 @@ def diff(x1: Any, x2: Any, prefix: Tuple = ()) -> Tuple[list, list, list]:
     else:
         only_left = []
         only_right = []
+        mismatch_debug_data = [prefix, type(x1), type(x2)]
         if isinstance(x1, torch.Tensor) and isinstance(x2, torch.Tensor):
-            if x1.device != x2.device:
-                _is_mismatch = not torch.all(x1.cpu() == x2.cpu())
-            else:
-                _is_mismatch = not torch.all(x1 == x2)
+            try:
+                if x1.device != x2.device:
+                    _is_mismatch = not torch.all(x1.cpu() == x2.cpu())
+                else:
+                    _is_mismatch = not torch.all(x1 == x2)
+                mismatch_debug_data.extend(
+                    [(x1 != x2).sum(), (x1 != x2).shape, (x1 != x2).nonzero().tolist()]
+                )
+            except (RuntimeError, TypeError, ValueError):
+                _is_mismatch = True
+                mismatch_debug_data.extend([x1.shape, x2.shape])
         # TODO: change with concrete type that has both replica_id and data attrs
         elif hasattr(x1, "replica_id") and hasattr(x2, "replica_id"):
             assert type(x1) == type(x2)
@@ -122,7 +130,7 @@ def diff(x1: Any, x2: Any, prefix: Tuple = ()) -> Tuple[list, list, list]:
                 _is_mismatch = True
         if _is_mismatch:
-            mismatch.append((prefix, type(x1), type(x2)))
+            mismatch.append(tuple(mismatch_debug_data))
     return only_left, only_right, mismatch

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/megatron/core/dist_checkpointing/mapping.py RENAMED Viewed

@@ -29,6 +29,9 @@ ShardedStateDict = Dict[str, Any]
 ReplicaId = Union[int, Tuple[int, ...]]
+_logged_deprecations = {}
 class ShardedBase(ABC):
     """Base class for ShardedTensor and ShardedStateDict."""
@@ -135,17 +138,40 @@ class ShardedTensor(ShardedBase):
                 f"equal to global shape dimensions for {self}"
             )
-        for off, sh in zip(self.global_offset[self.prepend_axis_num :], self.local_shape):
-            if sh != 0 and off % sh != 0:
-                raise CheckpointingException(
-                    f"Global offset ({off}) must be divisible by local shape ({sh}) for {self}."
-                )
+        if self.axis_fragmentations is not None:
+            for off, sh in zip(self.global_offset[self.prepend_axis_num :], self.local_shape):
+                if sh != 0 and off % sh != 0:
+                    raise CheckpointingException(
+                        f"Global offset ({off}) must be divisible by local shape ({sh}) for {self}."
+                    )
         if has_flattened_range and self.flattened_range.step is not None:
             raise CheckpointingException(
                 f"`step` argument in the flattened range of a ShardedTensor is not supported."
             )
+        if self.prepend_axis_num:
+            if not _logged_deprecations.get("prepend_axis_num", False):
+                logger.warning(
+                    "ShardedTensor.prepend_axis_num greater than 0 is deprecated."
+                    " In Megatron-Core this can be prevented by setting sharded_state_dict"
+                    " metadata['singleton_local_shards'] to True."
+                )
+                _logged_deprecations["prepend_axis_num"] = True
+        if self.flattened_range is not None:
+            if not _logged_deprecations.get("flattened_range", False):
+                logger.warning(
+                    "ShardedTensor.flattened_range is deprecated."
+                    " Use latest DistributedOptimizer formats."
+                )
+                _logged_deprecations["flattened_range"] = True
+    @property
+    def has_regular_grid(self):
+        """Alias for having a regular sharding grid."""
+        return self.axis_fragmentations is not None
     def global_slice(self) -> Tuple[Union[int, slice], ...]:
         """
         Returns a tuple of int and slice objects representing a slice of the

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/megatron/core/dist_checkpointing/optimizer.py RENAMED Viewed

@@ -25,6 +25,12 @@ from .mapping import (
 )
 from .utils import extract_sharded_tensors_and_factories
+KEEP_VARS_HINT = (
+    " Make sure state dict contains original torch.nn.Parameters (not pure torch.Tensors)"
+    " by passing `keep_vars=True` to `.state_dict()`. If any transformation of the original"
+    " parameter is needed, use a ShardedTensorFactory."
+)
 def get_optim_param_to_id_map(optim_params_iter: Iterable[torch.nn.Parameter]) -> Dict[int, int]:
     """Generate mapping from optimizer param to optimizer state id."""

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/megatron/core/dist_checkpointing/strategies/async_utils.py RENAMED Viewed

@@ -79,9 +79,24 @@ class AsyncRequest(NamedTuple):
         This logic is equivalent to what should happen in case of the async call.
         """
+        # preload tensors.
+        async_fn_args = list(self.async_fn_args)
+        if self.preload_fn:
+            assert len(async_fn_args) == 3, "Expected 3 args to be passed to async function"
+            # The async_fn is passed as a partial functool with pre-determined args
+            # In the async_fn_args we pass the remaining positional args required by the async_fn
+            # async_fn_args[1] refers to the write_buckets
+            # To ensure we stage the write_buckets to CPU memory for sync CP,
+            # we replace it with preload_fn callable that returns the CPU staged tensors
+            async_fn_args[1] = self.preload_fn()
+        # persist the state
         if self.async_fn is not None:
-            self.async_fn(*self.async_fn_args)
+            self.async_fn(*async_fn_args, **self.async_fn_kwargs)
+        # This utility implements a sync cp save. Hence the barrier.
         torch.distributed.barrier()
+        # Finalize the CP state
         for finalize_fn in self.finalize_fns:
             finalize_fn()
@@ -150,7 +165,7 @@ class AsyncCaller(ABC):
         return ten[0] == 0
     @abstractmethod
-    def close(self):
+    def close(self, abort=False):
         """Terminate the async caller at exit of an application or some termination conditions"""
         logger.info(f"AsyncCaller: {torch.distributed.get_rank()}, Destroying Async Caller")
@@ -237,15 +252,23 @@ class TemporalAsyncCaller(AsyncCaller):
             is_done = True
         return is_done
-    def close(self):
+    def close(self, abort=False):
         """For TemporalAsyncCaller, this method is called explictly in `is_current_async_calls_done`
         This method make sure the TemporalAsyncCaller terminated
         with all its assigned async request completed
+        Args:
+            abort (bool, optional): Default to False. Needs to be manually set to true when
+                the checkpoint async process needs to be aborted.
         """
         if self.process:
             logger.debug(f"rank: {torch.distributed.get_rank()}, joining self.process")
-            self.process.join()
+            if abort:
+                logger.warning(f"Temporal worker aborted in rank {torch.distributed.get_rank()}")
+                self.process.kill()
+            else:
+                self.process.join()
             self.process = None
             logger.debug(
                 "TemporalAsyncCaller: Async process join finished "
@@ -388,18 +411,25 @@ class PersistentAsyncCaller(AsyncCaller):
         return is_done
-    def close(self):
+    def close(self, abort=False):
         """Wait on the left async requests and terminate the PersistentAsyncCaller
         Signals the PersistentAsyncCaller by sending a 'DONE' message to make it terminated
+        Args:
+            abort (bool, optional): Default to False. Needs to be manually set to true when
+                the checkpoint async process needs to be aborted.
         """
         logger.info(
             f"PersistentAsyncCaller: {torch.distributed.get_rank()}, Destroying Async Caller"
         )
         if self.process:
-            self.queue.put('DONE')
-            self.queue.join()
-            self.process.join()
+            if abort:
+                logger.warning(f"Persistent worker aborted in rank {torch.distributed.get_rank()}")
+                self.process.kill()
+            else:
+                self.queue.put('DONE')
+                self.queue.join()
+                self.process.join()
             self.process = None
     def __del__(self):
@@ -528,6 +558,9 @@ class AsyncCallsQueue:
             blocking (bool, optional): if True, will wait until all active requests
                 are done. Otherwise, finalizes only the async request that already
                 finished. Defaults to False.
+            no_dist (bool, Optional): if True, training ranks simply check its
+                asynchronous checkpoint writer without synchronization.
         Returns:
             List[int]: list of indices (as returned by `schedule_async_request`)
                 of async calls that have been successfully finalized.
@@ -545,8 +578,8 @@ class AsyncCallsQueue:
                     finalize_fn()
                 ten = torch.tensor([call_idx], dtype=torch.int, device=torch.cuda.current_device())
                 torch.distributed.all_reduce(ten, op=torch.distributed.ReduceOp.MAX)
-                assert ten.item() == call_idx, 'Unmatched async calls. '
-                'That probably means not all ranks are participating in async finalization'
+                assert ten.item() == call_idx, "Unmatched async calls. "
+                "That probably means not all ranks are participating in async finalization"
                 call_idx_finalized.append(call_idx)
         return call_idx_finalized
@@ -554,8 +587,13 @@ class AsyncCallsQueue:
         """Get the number of active async calls."""
         return len(self.async_calls)
-    def close(self):
-        """Finalize all calls upon closing."""
-        self.maybe_finalize_async_calls(blocking=True)
+    def close(self, abort=False):
+        """Finalize all calls upon closing.
+        Args:
+            abort (bool, optional): Default to False. Needs to be manually set to true when
+                the checkpoint async process needs to be aborted.
+        """
+        if not abort:
+            self.maybe_finalize_async_calls(blocking=True)
         if self.persistent and self.persistent_caller:
-            self.persistent_caller.close()
+            self.persistent_caller.close(abort=abort)

{megatron_core-0.14.0rc7 → megatron_core-0.15.0rc4}/megatron/core/dist_checkpointing/strategies/base.py RENAMED Viewed

@@ -221,8 +221,4 @@ class AsyncSaveShardedStrategy(SaveShardedStrategy):
     def save(self, sharded_state_dict: ShardedStateDict, checkpoint_dir: Union[str, Path]):
         """Each async strategy can be trivially used as a sync strategy."""
         async_request = self.async_save(sharded_state_dict, checkpoint_dir)
-        # multiprocessing routines  may cause issue when called on parent process
-        # We keep this verbose call for now
-        global async_calls
-        async_calls.schedule_async_request(async_request)
-        async_calls.maybe_finalize_async_calls(blocking=True)
+        async_request.execute_sync()

megatron-core 0.14.0rc7__tar.gz → 0.15.0rc4__tar.gz

Potentially problematic release.

megatron-core 0.14.0rc7tar.gz → 0.15.0rc4tar.gz