PyPI - megatron-core - Versions diffs - 0.15.0rc0__tar.gz → 0.15.0rc5__tar.gz - Mend

megatron-core 0.15.0rc0tar.gz → 0.15.0rc5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of megatron-core might be problematic. Click here for more details.

Files changed (355) hide show

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/LICENSE RENAMED Viewed

@@ -37,7 +37,7 @@ Below are licenses used in those files, as indicated.
 --------------------------------------------------------------------------------------
--- LICENSE FOR Facebook, huggingface, Google Research, LLaVA, Mamba, and vLLM code  --
+-- LICENSE FOR Facebook, huggingface, Google Research, LLaVA, Mamba, TinyZero and vLLM code  --
                                  Apache License

{megatron_core-0.15.0rc0/megatron_core.egg-info → megatron_core-0.15.0rc5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: megatron-core
-Version: 0.15.0rc0
+Version: 0.15.0rc5
 Summary: Megatron Core - a library for efficient and scalable training of transformer based models
 Author-email: NVIDIA <nemo-toolkit@nvidia.com>
 Maintainer-email: NVIDIA <nemo-toolkit@nvidia.com>
@@ -37,23 +37,24 @@ Requires-Dist: flask-restful; extra == "mlm"
 Requires-Dist: sentencepiece; extra == "mlm"
 Requires-Dist: tiktoken; extra == "mlm"
 Requires-Dist: wandb; extra == "mlm"
+Requires-Dist: transformers; extra == "mlm"
 Provides-Extra: dev
 Requires-Dist: tqdm; extra == "dev"
 Requires-Dist: einops~=0.8; extra == "dev"
 Requires-Dist: tensorstore!=0.1.46,!=0.1.72,~=0.1; extra == "dev"
 Requires-Dist: nvtx~=0.2; extra == "dev"
-Requires-Dist: transformers~=4.53; extra == "dev"
-Requires-Dist: multi-storage-client<0.26,~=0.25; extra == "dev"
+Requires-Dist: multi-storage-client~=0.27; extra == "dev"
 Requires-Dist: opentelemetry-api~=1.33.1; extra == "dev"
 Requires-Dist: setuptools<80.0.0; extra == "dev"
 Requires-Dist: mamba-ssm~=2.2; extra == "dev"
 Requires-Dist: causal-conv1d~=1.5; extra == "dev"
 Requires-Dist: nv-grouped-gemm~=1.1; extra == "dev"
-Requires-Dist: transformer-engine[pytorch]<2.7.0,>=2.6.0a0; extra == "dev"
+Requires-Dist: transformer-engine[pytorch]<2.8.0,>=2.6.0a0; extra == "dev"
 Requires-Dist: nvidia-resiliency-ext<0.5.0,>=0.4.0a0; extra == "dev"
 Requires-Dist: nvidia-modelopt[torch]<0.34.0,>=0.33.0a0; sys_platform != "darwin" and extra == "dev"
 Requires-Dist: megatron-energon[av_decode]~=6.0; extra == "dev"
 Requires-Dist: flashinfer-python; extra == "dev"
+Requires-Dist: wget; extra == "dev"
 Requires-Dist: onnxscript; extra == "dev"
 Provides-Extra: lts
 Requires-Dist: tqdm; extra == "lts"
@@ -63,6 +64,7 @@ Requires-Dist: nvtx; extra == "lts"
 Requires-Dist: transformers; extra == "lts"
 Requires-Dist: zarr; extra == "lts"
 Requires-Dist: setuptools<80.0.0; extra == "lts"
+Requires-Dist: wget; extra == "lts"
 Dynamic: license-file
 <div align="center">
@@ -93,7 +95,10 @@ cd Megatron-LM
 # Latest News
-- 📣 NEW! **[DeepSeek & MoE Training with FP8](https://github.com/yanring/Megatron-MoE-ModelZoo)** examples are now available, including optimized configurations for `DeepSeek-V3`, `Qwen2` and `Mixtral` models with FP8 precision support.
+- 🔄 NEW! **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - Bidirectional converter for interoperability between Hugging Face and Megatron checkpoints, featuring production-ready recipes for popular models.
+- 🗺️ **[MoE Q3-Q4 2025 Roadmap](https://github.com/NVIDIA/Megatron-LM/issues/1729)** - Comprehensive roadmap for MoE features including DeepSeek-V3, Qwen3, advanced parallelism strategies, FP8 optimizations, and Blackwell performance enhancements.
+- 🚀 **[GPT-OSS Implementation](https://github.com/NVIDIA/Megatron-LM/issues/1739)** - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions are being integrated into Megatron Core.
+- **[2025/06]** **[Megatron MoE Model Zoo](https://github.com/yanring/Megatron-MoE-ModelZoo)** - Best practices and optimized configurations for training DeepSeek-V3, Mixtral, and Qwen3 MoE models with performance benchmarking and checkpoint conversion tools.
 - **[2025/05]** Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training ([blog](https://developer.nvidia.com/blog/turbocharge-llm-training-across-long-haul-data-center-networks-with-nvidia-nemo-framework/)).
 <details>
@@ -143,6 +148,7 @@ cd Megatron-LM
 **Resources**
 - [Examples](./examples/) - Training scripts and tutorials
 - [Documentation](https://docs.nvidia.com/Megatron-Core/) - Official docs
+- [Roadmaps](#roadmaps) - Development roadmaps and feature tracking
 - [Community & Support](#-community--support) - Get help and contribute
   - [Getting Help](#getting-help)
   - [Contributing](#contributing)
@@ -217,10 +223,12 @@ Megatron-LM/
 **Libraries using Megatron Core:**
+- **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - Training library with bidirectional Hugging Face ↔ Megatron checkpoint conversion, flexible training loops, and production-ready recipes
+- **[NeMo RL](https://github.com/NVIDIA-NeMo/RL)** - Scalable toolkit for efficient reinforcement learning with RLHF, DPO, and other post-training methods
 - **[NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)** - Enterprise framework with cloud-native support and end-to-end examples
 - **[TensorRT Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)** - Model optimization toolkit for quantization, pruning, and distillation
-**Compatible with:** [HuggingFace Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+**Compatible with:** [Hugging Face Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)
 # Installation
@@ -510,6 +518,15 @@ Based on [NVIDIA NeMo production configurations](https://github.com/NVIDIA/NeMo/
 --use-distributed-optimizer
 ```
+# Roadmaps
+Stay up-to-date with our development roadmaps and planned features:
+- **[MoE Q3-Q4 2025 Roadmap](https://github.com/NVIDIA/Megatron-LM/issues/1729)** - Comprehensive MoE feature development including DeepSeek-V3, Qwen3, advanced parallelism, FP8 optimizations, and Blackwell enhancements
+- **[GPT-OSS Implementation Tracker](https://github.com/NVIDIA/Megatron-LM/issues/1739)** - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions
+*More roadmap trackers will be added soon.*
 # Community & Support
 ## Getting Help

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/README.md RENAMED Viewed

@@ -26,7 +26,10 @@ cd Megatron-LM
 # Latest News
-- 📣 NEW! **[DeepSeek & MoE Training with FP8](https://github.com/yanring/Megatron-MoE-ModelZoo)** examples are now available, including optimized configurations for `DeepSeek-V3`, `Qwen2` and `Mixtral` models with FP8 precision support.
+- 🔄 NEW! **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - Bidirectional converter for interoperability between Hugging Face and Megatron checkpoints, featuring production-ready recipes for popular models.
+- 🗺️ **[MoE Q3-Q4 2025 Roadmap](https://github.com/NVIDIA/Megatron-LM/issues/1729)** - Comprehensive roadmap for MoE features including DeepSeek-V3, Qwen3, advanced parallelism strategies, FP8 optimizations, and Blackwell performance enhancements.
+- 🚀 **[GPT-OSS Implementation](https://github.com/NVIDIA/Megatron-LM/issues/1739)** - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions are being integrated into Megatron Core.
+- **[2025/06]** **[Megatron MoE Model Zoo](https://github.com/yanring/Megatron-MoE-ModelZoo)** - Best practices and optimized configurations for training DeepSeek-V3, Mixtral, and Qwen3 MoE models with performance benchmarking and checkpoint conversion tools.
 - **[2025/05]** Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training ([blog](https://developer.nvidia.com/blog/turbocharge-llm-training-across-long-haul-data-center-networks-with-nvidia-nemo-framework/)).
 <details>
@@ -76,6 +79,7 @@ cd Megatron-LM
 **Resources**
 - [Examples](./examples/) - Training scripts and tutorials
 - [Documentation](https://docs.nvidia.com/Megatron-Core/) - Official docs
+- [Roadmaps](#roadmaps) - Development roadmaps and feature tracking
 - [Community & Support](#-community--support) - Get help and contribute
   - [Getting Help](#getting-help)
   - [Contributing](#contributing)
@@ -150,10 +154,12 @@ Megatron-LM/
 **Libraries using Megatron Core:**
+- **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - Training library with bidirectional Hugging Face ↔ Megatron checkpoint conversion, flexible training loops, and production-ready recipes
+- **[NeMo RL](https://github.com/NVIDIA-NeMo/RL)** - Scalable toolkit for efficient reinforcement learning with RLHF, DPO, and other post-training methods
 - **[NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)** - Enterprise framework with cloud-native support and end-to-end examples
 - **[TensorRT Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)** - Model optimization toolkit for quantization, pruning, and distillation
-**Compatible with:** [HuggingFace Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+**Compatible with:** [Hugging Face Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)
 # Installation
@@ -443,6 +449,15 @@ Based on [NVIDIA NeMo production configurations](https://github.com/NVIDIA/NeMo/
 --use-distributed-optimizer
 ```
+# Roadmaps
+Stay up-to-date with our development roadmaps and planned features:
+- **[MoE Q3-Q4 2025 Roadmap](https://github.com/NVIDIA/Megatron-LM/issues/1729)** - Comprehensive MoE feature development including DeepSeek-V3, Qwen3, advanced parallelism, FP8 optimizations, and Blackwell enhancements
+- **[GPT-OSS Implementation Tracker](https://github.com/NVIDIA/Megatron-LM/issues/1739)** - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions
+*More roadmap trackers will be added soon.*
 # Community & Support
 ## Getting Help

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/megatron/core/__init__.py RENAMED Viewed

@@ -20,6 +20,7 @@ from megatron.core.package_info import (
     __version__,
 )
 from megatron.core.timers import Timers
+from megatron.core.utils import is_torch_min_version
 # Alias parallel_state as mpu, its legacy name
 mpu = parallel_state
@@ -32,4 +33,20 @@ __all__ = [
     "InferenceParams",
     "ModelParallelConfig",
     "Timers",
+    "__contact_emails__",
+    "__contact_names__",
+    "__description__",
+    "__download_url__",
+    "__homepage__",
+    "__keywords__",
+    "__license__",
+    "__package_name__",
+    "__repository_url__",
+    "__shortversion__",
+    "__version__",
 ]
+from .safe_globals import register_safe_globals
+if is_torch_min_version("2.6a0"):
+    register_safe_globals()

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/megatron/core/datasets/blended_megatron_dataset_builder.py RENAMED Viewed

@@ -35,7 +35,8 @@ class BlendedMegatronDatasetBuilder(object):
         is_built_on_rank (Callable): A callable which returns True if the dataset should be built on
             the current rank and False otherwise. It should be Megatron Core parallelism aware i.e.
-            global rank, local group rank, and virtual rank may inform its return value.
+            global rank, local group rank, and virtual rank may inform its return value. Should
+            return true for exactly one process on global rank 0.
         config (BlendedMegatronDatasetConfig): The config object which informs dataset creation
     """
@@ -72,13 +73,6 @@ class BlendedMegatronDatasetBuilder(object):
                     for {split.name} split
                     This can occur with multiple validation sets if datasets have weights"""
-        if torch.distributed.is_initialized():
-            gb_rank = torch.distributed.get_rank()
-            if gb_rank == 0:
-                assert (
-                    self.is_built_on_rank()
-                ), "is_built_on_rank must return True when global rank = 0"
     def build(self) -> List[Optional[TopLevelDataset]]:
         """Build all dataset splits according to the provided blend(s)

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/megatron/core/datasets/blended_megatron_dataset_config.py RENAMED Viewed

@@ -6,8 +6,8 @@ import re
 from dataclasses import dataclass, field
 from typing import List, Optional, Tuple
-from megatron.core.datasets.megatron_tokenizer import MegatronTokenizer
 from megatron.core.datasets.utils import Split, log_single_rank, normalize
+from megatron.core.tokenizers import MegatronTokenizerBase
 logger = logging.getLogger(__name__)
@@ -66,8 +66,8 @@ class BlendedMegatronDatasetConfig:
        constructor.
     """
-    tokenizer: Optional[MegatronTokenizer] = None
-    """The MegatronTokenizer instance. Required for datasets that do online tokenization."""
+    tokenizer: Optional[MegatronTokenizerBase] = None
+    """The MegatronTokenizerBase instance. Required for datasets that do online tokenization."""
     mid_level_dataset_surplus: float = 0.005
     """The sample surplus to build for the mid-level datasets(s). Defaults arbitrarily to 0.005.

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/megatron/core/datasets/gpt_dataset.py RENAMED Viewed

@@ -12,9 +12,9 @@ import torch
 from megatron.core.datasets.blended_megatron_dataset_config import BlendedMegatronDatasetConfig
 from megatron.core.datasets.indexed_dataset import IndexedDataset
 from megatron.core.datasets.megatron_dataset import MegatronDataset
-from megatron.core.datasets.megatron_tokenizer import MegatronTokenizer
 from megatron.core.datasets.object_storage_utils import ObjectStorageConfig, is_object_storage_path
 from megatron.core.datasets.utils import Split
+from megatron.core.tokenizers import MegatronTokenizerBase
 from megatron.core.utils import log_single_rank
 logger = logging.getLogger(__name__)
@@ -701,8 +701,8 @@ class MockGPTLowLevelDataset:
     we add the end of document token to each element indexed in __getitem__
     Args:
-        tokenizer (MegatronTokenizer): The tokenizer the special token information of which we use
-            to augment the mock data.
+        tokenizer (MegatronTokenizerBase): The tokenizer the special token information of which
+        we use to augment the mock data.
     """
     seed: int = 0
@@ -714,7 +714,7 @@ class MockGPTLowLevelDataset:
     max_sequence_length: int = 4096
     """The hard-coded max sequence length to generate"""
-    def __init__(self, tokenizer: MegatronTokenizer) -> None:
+    def __init__(self, tokenizer: MegatronTokenizerBase) -> None:
         self.tokenizer = tokenizer
         rng = numpy.random.default_rng(seed=self.seed)
         self.sequence_lengths = rng.integers(

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/megatron/core/datasets/helpers.cpp RENAMED Viewed

@@ -3,6 +3,7 @@
 /* Helper methods for fast index mapping builds */
 #include <algorithm>
+#include <cassert>
 #include <iostream>
 #include <limits>
 #include <math.h>
@@ -46,7 +47,7 @@ void build_exhaustive_blending_indices(py::array_t<int16_t> &dataset_index, py::
   while (dataset_unspent_indices.size() > 0) {
     double index_sample_double = std::max(static_cast<double>(index_sample), 1.0);
-    int64_t error_argmax;
+    int64_t error_argmax = -1;
     double error_max = std::numeric_limits<double>::lowest();
     for (int32_t index_dataset : dataset_unspent_indices) {
@@ -56,6 +57,7 @@ void build_exhaustive_blending_indices(py::array_t<int16_t> &dataset_index, py::
         error_max = error;
       }
     }
+    assert(error_argmax >= 0);
     // Populate the indices.
     dataset_index_ptr[index_sample] = static_cast<int16_t>(error_argmax);

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/megatron/core/datasets/indexed_dataset.py RENAMED Viewed

@@ -12,6 +12,7 @@ import shutil
 import struct
 import time
 from abc import ABC, abstractmethod
+from collections.abc import Iterable
 from enum import Enum
 from functools import lru_cache
 from itertools import accumulate
@@ -172,9 +173,9 @@ class _IndexWriter(object):
     def write(
         self,
-        sequence_lengths: List[int],
-        sequence_modes: Optional[List[int]],
-        document_indices: List[int],
+        sequence_lengths: Iterable[Union[int, numpy.integer]],
+        sequence_modes: Optional[Iterable[Union[int, numpy.integer]]],
+        document_indices: Iterable[Union[int, numpy.integer]],
     ) -> None:
         """Write the index (.idx) file
@@ -208,7 +209,9 @@ class _IndexWriter(object):
         if sequence_modes is not None:
             self.idx_writer.write(numpy.array(sequence_modes, dtype=numpy.int8).tobytes(order="C"))
-    def _sequence_pointers(self, sequence_lengths: List[int]) -> List[int]:
+    def _sequence_pointers(
+        self, sequence_lengths: Iterable[Union[int, numpy.integer]]
+    ) -> List[int]:
         """Build the sequence pointers per the sequence lengths and dtype size
         Args:
@@ -217,11 +220,11 @@ class _IndexWriter(object):
         Returns:
             List[int]: The pointer to the beginning of each sequence
         """
-        itemsize = DType.size(self.dtype)
-        curr_ptr = 0
+        itemsize = numpy.int64(DType.size(self.dtype))
+        curr_ptr = numpy.int64(0)
         list_ptr = []
         for length in sequence_lengths:
-            list_ptr.append(curr_ptr)
+            list_ptr.append(curr_ptr.item())
             curr_ptr += length * itemsize
         return list_ptr

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/megatron/core/datasets/megatron_tokenizer.py RENAMED Viewed

@@ -7,7 +7,7 @@ from typing import Any
 import numpy
-class MegatronTokenizer(ABC):
+class MegatronLegacyTokenizer(ABC):
     """Abstract class for tokenizer
     Absent a config or class-specific tracking of which objects are uniquely identifying, we must

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/megatron/core/datasets/retro/config/tokenizers.py RENAMED Viewed

@@ -4,12 +4,12 @@
 from dataclasses import dataclass
-from megatron.core.datasets.megatron_tokenizer import MegatronTokenizer
+from megatron.core.tokenizers import MegatronTokenizerBase
 @dataclass
 class RetroTokenizers:
     """Container class for GPT and Bert tokenizers."""
-    gpt: MegatronTokenizer = None
-    bert: MegatronTokenizer = None
+    gpt: MegatronTokenizerBase = None
+    bert: MegatronTokenizerBase = None

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/megatron/core/dist_checkpointing/mapping.py RENAMED Viewed

@@ -29,6 +29,9 @@ ShardedStateDict = Dict[str, Any]
 ReplicaId = Union[int, Tuple[int, ...]]
+_logged_deprecations = {}
 class ShardedBase(ABC):
     """Base class for ShardedTensor and ShardedStateDict."""
@@ -147,6 +150,23 @@ class ShardedTensor(ShardedBase):
                 f"`step` argument in the flattened range of a ShardedTensor is not supported."
             )
+        if self.prepend_axis_num:
+            if not _logged_deprecations.get("prepend_axis_num", False):
+                logger.warning(
+                    "ShardedTensor.prepend_axis_num greater than 0 is deprecated."
+                    " In Megatron-Core this can be prevented by setting sharded_state_dict"
+                    " metadata['singleton_local_shards'] to True."
+                )
+                _logged_deprecations["prepend_axis_num"] = True
+        if self.flattened_range is not None:
+            if not _logged_deprecations.get("flattened_range", False):
+                logger.warning(
+                    "ShardedTensor.flattened_range is deprecated."
+                    " Use latest DistributedOptimizer formats."
+                )
+                _logged_deprecations["flattened_range"] = True
     @property
     def has_regular_grid(self):
         """Alias for having a regular sharding grid."""

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/megatron/core/dist_checkpointing/strategies/common.py RENAMED Viewed

@@ -84,9 +84,9 @@ class TorchCommonLoadStrategy(LoadCommonStrategy):
         try:
             if MultiStorageClientFeature.is_enabled():
                 msc = MultiStorageClientFeature.import_package()
-                return msc.torch.load(load_path, map_location='cpu', weights_only=False)
+                return msc.torch.load(load_path, map_location='cpu')
             else:
-                return torch.load(load_path, map_location='cpu', weights_only=False)
+                return torch.load(load_path, map_location='cpu')
         except FileNotFoundError as e:
             err_msg = f'Common file {load_path} does not exist'
             if MultiStorageClientFeature.is_enabled():
@@ -118,9 +118,9 @@ class TorchCommonLoadStrategy(LoadCommonStrategy):
             try:
                 if MultiStorageClientFeature.is_enabled():
                     msc = MultiStorageClientFeature.import_package()
-                    loaded_obj = msc.torch.load(load_path, weights_only=False)
+                    loaded_obj = msc.torch.load(load_path)
                 else:
-                    loaded_obj = torch.load(load_path, weights_only=False)
+                    loaded_obj = torch.load(load_path)
             except FileNotFoundError as e:
                 # Backward compatible logic: previously the save format was incorrect
                 base, _ = os.path.splitext(sh_obj.unique_key)
@@ -128,9 +128,9 @@ class TorchCommonLoadStrategy(LoadCommonStrategy):
                 try:
                     if MultiStorageClientFeature.is_enabled():
                         msc = MultiStorageClientFeature.import_package()
-                        loaded_obj = msc.torch.load(old_load_path, weights_only=False)
+                        loaded_obj = msc.torch.load(old_load_path)
                     else:
-                        loaded_obj = torch.load(old_load_path, weights_only=False)
+                        loaded_obj = torch.load(old_load_path)
                 except FileNotFoundError:
                     err_msg = f'Object shard {load_path} not found'
                     obj_subdir = os.path.join(checkpoint_dir, sh_obj.key)

{megatron_core-0.15.0rc0 → megatron_core-0.15.0rc5}/megatron/core/dist_checkpointing/strategies/torch.py RENAMED Viewed

@@ -340,11 +340,12 @@ def mcore_to_pyt_state_dict(
                 if sh_ten.allow_shape_mismatch and is_loading:
                     sh_ten.data.zero_()
-        if not sh_tens[0].has_regular_grid:
-            if not is_torch_min_version("2.6a0"):
-                raise CheckpointingException(
-                    f"Uneven sharding not supported for PyTorch version {get_torch_version()}"
-                )
+        is_pre_mcore_014_sh_ten = (
+            sh_tens[0].prepend_axis_num or sh_tens[0].flattened_range is not None
+        )
+        if (
+            not is_pre_mcore_014_sh_ten or not sh_tens[0].has_regular_grid
+        ) and is_torch_min_version("2.6a0"):
             assert sh_tens[0].flattened_range is None
             if len(sh_tens) > 1:
                 return LocalShardsContainer(
@@ -353,6 +354,10 @@ def mcore_to_pyt_state_dict(
             else:
                 return CheckpointableShardedTensor.from_sh_ten(sh_tens[0])
         else:
+            if not sh_tens[0].has_regular_grid and not is_torch_min_version("2.6a0"):
+                raise CheckpointingException(
+                    f"Uneven sharding not supported for PyTorch version {get_torch_version()}"
+                )
             torch_sh_ten = sharded_tensor_to_torch_sharded_tensor(
                 sh_tens, rank, load_legacy_1d_flatten_tensors
             )

megatron-core 0.15.0rc0__tar.gz → 0.15.0rc5__tar.gz

Potentially problematic release.

megatron-core 0.15.0rc0tar.gz → 0.15.0rc5tar.gz