PyPI - megatron-core - Versions diffs - 0.15.0rc7__tar.gz → 0.16.0rc0.dev104455__tar.gz - Mend - Supply Chain Defender

megatron-core 0.15.0rc7tar.gz → 0.16.0rc0.dev104455tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of megatron-core might be problematic. Click here for more details.

Files changed (363) hide show

{megatron_core-0.15.0rc7 → megatron_core-0.16.0rc0.dev104455}/MANIFEST.in RENAMED Viewed

@@ -1,4 +1,5 @@
 include megatron/core/requirements.txt
 include megatron/core/README.md
 include megatron/core/package_info.py
+global-exclude LICENSE
 recursive-include requirements *

{megatron_core-0.15.0rc7/megatron_core.egg-info → megatron_core-0.16.0rc0.dev104455}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: megatron-core
-Version: 0.15.0rc7
+Version: 0.16.0rc0.dev104455
 Summary: Megatron Core - a library for efficient and scalable training of transformer based models
 Author-email: NVIDIA <nemo-toolkit@nvidia.com>
 Maintainer-email: NVIDIA <nemo-toolkit@nvidia.com>
@@ -29,7 +29,6 @@ Classifier: Topic :: Software Development :: Libraries
 Classifier: Topic :: Utilities
 Requires-Python: >=3.10
 Description-Content-Type: text/markdown
-License-File: LICENSE
 Requires-Dist: torch
 Requires-Dist: numpy<2.0.0
 Requires-Dist: packaging>=24.2
@@ -50,10 +49,11 @@ Requires-Dist: setuptools<80.0.0; extra == "dev"
 Requires-Dist: mamba-ssm~=2.2; extra == "dev"
 Requires-Dist: causal-conv1d~=1.5; extra == "dev"
 Requires-Dist: nv-grouped-gemm~=1.1; extra == "dev"
-Requires-Dist: transformer-engine[pytorch]<2.8.0,>=2.6.0a0; extra == "dev"
+Requires-Dist: transformer-engine[pytorch]<2.10.0,>=2.7.0a0; extra == "dev"
 Requires-Dist: nvidia-resiliency-ext<0.5.0,>=0.4.0a0; extra == "dev"
 Requires-Dist: nvidia-modelopt[torch]<0.34.0,>=0.33.0a0; sys_platform != "darwin" and extra == "dev"
 Requires-Dist: megatron-energon[av_decode]~=6.0; extra == "dev"
+Requires-Dist: av<16.0.0; extra == "dev"
 Requires-Dist: flashinfer-python; extra == "dev"
 Requires-Dist: wget; extra == "dev"
 Requires-Dist: onnxscript; extra == "dev"
@@ -66,12 +66,12 @@ Requires-Dist: transformers; extra == "lts"
 Requires-Dist: zarr; extra == "lts"
 Requires-Dist: setuptools<80.0.0; extra == "lts"
 Requires-Dist: wget; extra == "lts"
-Dynamic: license-file
 <div align="center">
 Megatron-LM & Megatron Core
 ===========================
 <h4>GPU-optimized library for training transformer models at scale</h4>
 [![Documentation](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://docs.nvidia.com/Megatron-Core/developer-guide/latest/index.html)
@@ -84,28 +84,29 @@ Megatron-LM & Megatron Core
 ```bash
 # 1. Install Megatron Core with required dependencies
-pip install megatron-core
-pip install --no-build-isolation transformer-engine[pytorch]
+pip install --no-build-isolation megatron-core[mlm,dev]
 # 2. Clone repository for examples
 git clone https://github.com/NVIDIA/Megatron-LM.git
 cd Megatron-LM
+pip install --no-build-isolation .[mlm,dev]
 ```
 **→ [Complete Installation Guide](#installation)** - Docker, pip variants (dev,lts,etc.), source installation, and system requirements
 # Latest News
-- 🔄 NEW! **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - Bidirectional converter for interoperability between Hugging Face and Megatron checkpoints, featuring production-ready recipes for popular models.
-- 🗺️ **[MoE Q3-Q4 2025 Roadmap](https://github.com/NVIDIA/Megatron-LM/issues/1729)** - Comprehensive roadmap for MoE features including DeepSeek-V3, Qwen3, advanced parallelism strategies, FP8 optimizations, and Blackwell performance enhancements.
-- 🚀 **[GPT-OSS Implementation](https://github.com/NVIDIA/Megatron-LM/issues/1739)** - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions are being integrated into Megatron Core.
+- 📣 NEW! **[Megatron Dev Branch](https://github.com/NVIDIA/Megatron-LM/tree/dev)** - early access branch with experimental features.
+- 🔄 **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - Bidirectional converter for interoperability between Hugging Face and Megatron checkpoints, featuring production-ready recipes for popular models.
+- **[2025/08]** **[MoE Q3-Q4 2025 Roadmap](https://github.com/NVIDIA/Megatron-LM/issues/1729)** - Comprehensive roadmap for MoE features including DeepSeek-V3, Qwen3, advanced parallelism strategies, FP8 optimizations, and Blackwell performance enhancements.
+- **[2025/08]** **[GPT-OSS Model](https://github.com/NVIDIA/Megatron-LM/issues/1739)** - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions are being integrated into Megatron Core.
 - **[2025/06]** **[Megatron MoE Model Zoo](https://github.com/yanring/Megatron-MoE-ModelZoo)** - Best practices and optimized configurations for training DeepSeek-V3, Mixtral, and Qwen3 MoE models with performance benchmarking and checkpoint conversion tools.
-- **[2025/05]** Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training ([blog](https://developer.nvidia.com/blog/turbocharge-llm-training-across-long-haul-data-center-networks-with-nvidia-nemo-framework/)).
+- **[2025/05]** Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training ([blog](https://developer.nvidia.com/blog/turbocharge-llm-training-across-long-haul-data-center-networks-with-nvidia-nemo-framework/)).
 <details>
 <summary>Previous News</summary>
-- **[2024/07]** Megatron Core v0.7 improves scalability and training resiliency and adds support for multimodal training ([blog](https://developer.nvidia.com/blog/train-generative-ai-models-more-efficiently-with-new-nvidia-Megatron-Core-functionalities/)).
+- **[2024/07]** Megatron Core v0.7 improves scalability and training resiliency and adds support for multimodal training ([blog](https://developer.nvidia.com/blog/train-generative-ai-models-more-efficiently-with-new-nvidia-Megatron-Core-functionalities/)).
 - **[2024/06]** Megatron Core added supports for Mamba-based models. Check out our paper [An Empirical Study of Mamba-based Language Models](https://arxiv.org/pdf/2406.07887) and [code example](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba).
 - **[2024/01 Announcement]** NVIDIA has released the core capabilities in **Megatron-LM** into [**Megatron Core**](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) in this repository. Megatron Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the [Megatron Core intro](#Megatron Core) for more details.
@@ -115,25 +116,28 @@ cd Megatron-LM
 <summary>Table of Contents</summary>
 **Getting Started**
 - [Quick Start](#-quick-start)
 - [Latest News](#latest-news)
 - [Megatron Overview](#megatron-overview)
   - [Project Structure](#project-structure)
   - [Megatron-LM: Reference Implementation](#megatron-lm-reference-implementation)
   - [Megatron Core: Production Library](#megatron-core-production-library)
-- [Installation](#installation)
+- [Installation](#installation)
   - [Docker (Recommended)](#-docker-recommended)
   - [Pip Installation](#-pip-installation)
   - [Source Installation](#-source-installation)
   - [System Requirements](#system-requirements)
 **Core Features**
 - [Performance Benchmarking](#performance-benchmarking)
   - [Weak Scaling Results](#weak-scaling-results)
   - [Strong Scaling Results](#strong-scaling-results)
 - [Ecosystem Libraries](#ecosystem-libraries)
 **Training**
 - [Training](#training)
   - [Getting Started](#getting-started)
   - [Data Preparation](#data-preparation)
@@ -147,6 +151,7 @@ cd Megatron-LM
 - [Performance Optimizations](#performance-optimizations)
 **Resources**
 - [Examples](./examples/) - Training scripts and tutorials
 - [Documentation](https://docs.nvidia.com/Megatron-Core/) - Official docs
 - [Roadmaps](#roadmaps) - Development roadmaps and feature tracking
@@ -160,6 +165,7 @@ cd Megatron-LM
 # Megatron Overview
 ## Project Structure
 ```
 Megatron-LM/
 ├── megatron/
@@ -184,28 +190,34 @@ Megatron-LM/
 ```
 ### Megatron-LM: Reference Implementation
 **Reference implementation** that includes Megatron Core plus everything needed to train models.
 **Best for:**
 - **Training state-of-the-art foundation models** at scale with cutting-edge performance on latest NVIDIA hardware
 - **Research teams** exploring new architectures and training techniques
 - **Learning distributed training** concepts and best practices
 - **Quick experimentation** with proven model configurations
 **What you get:**
 - Pre-configured training scripts for GPT, LLama, DeepSeek, Qwen, and more.
 - End-to-end examples from data prep to evaluation
 - Research-focused tools and utilities
 ### Megatron Core: Composable Library
 **Composable library** with GPU-optimized building blocks for custom training frameworks.
 **Best for:**
 - **Framework developers** building on top of modular and optimized components
 - **Research teams** needing custom training loops, optimizers, or data pipelines
 - **ML engineers** requiring fault-tolerant training pipelines
 **What you get:**
 - Composable transformer building blocks (attention, MLP, etc.)
 - Advanced parallelism strategies (TP, PP, DP, EP, CP)
 - Pipeline schedules and distributed optimizers
@@ -237,6 +249,8 @@ Megatron-LM/
 We strongly recommend using the previous releases of [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) rather than the latest one for optimal compatibility with Megatron Core release and testing. Our releases are always based on the previous month's NGC container, so this ensures compatibility and stability.
+**Note:** The NGC PyTorch container constraints the python environment globally via `PIP_CONSTRAINT`. In the following examples we will unset the variable.
 This container comes with all dependencies pre-installed with compatible versions and optimized configurations for NVIDIA GPUs:
 - PyTorch (latest stable version)
@@ -250,6 +264,7 @@ docker run --runtime --nvidia --gpus all -it --rm \
   -v /path/to/megatron:/workspace/megatron \
   -v /path/to/dataset:/workspace/dataset \
   -v /path/to/checkpoints:/workspace/checkpoints \
+  -e PIP_CONSTRAINT= \
   nvcr.io/nvidia/pytorch:25.04-py3
 ```
@@ -263,13 +278,21 @@ Megatron Core offers support for two NGC PyTorch containers:
 Both containers can be combined with `mlm` which adds package dependencies for Megatron-LM on top of Megatron Core.
 ```bash
-# Install the latest release with minimal dependencies (no Transformer Engine)
-pip install megatron-core[dev]
+# Install the latest release dependencies
+pip install "setuptools<80.0.0,>=77.0.0" "packaging>=24.2"
+pip install --no-build-isolation megatron-core[dev]
+# For running an M-LM application:
+pip install "setuptools<80.0.0,>=77.0.0" "packaging>=24.2"
+pip install --no-build-isolation megatron-core[mlm,dev]
 ```
 ```bash
 # Install packages for LTS support NGC PyTorch 24.01
-pip install megatron-core[lts]
+pip install "setuptools<80.0.0,>=77.0.0" "packaging>=24.2"
+pip install --no-build-isolation megatron-core[lts]
+# For running an M-LM application:
+pip install "setuptools<80.0.0,>=77.0.0" "packaging>=24.2"
+pip install --no-build-isolation megatron-core[mlm,lts]
 ```
 For a version of Megatron Core with only torch, run:
@@ -278,47 +301,15 @@ For a version of Megatron Core with only torch, run:
 pip install megatron-core
 ```
-For dependencies required by Megatron-LM, please run:
-```bash
-pip install megatron-core[mlm]
-```
-## Source Installation
-For development or latest features:
-For Hybrid models, Megatron Core requires [mamba](https://github.com/state-spaces/mamba). If the pre-built wheel in PyPI does not fit your environment, you can fall back to an install script Megatron Core uses in its CI system. For this, please install `uv` first:
-```bash
-export UV_VERSION=0.7.2
-export PATH="$HOME/.local/bin:$PATH"
-curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | sh
-export UV_PROJECT_ENVIRONMENT=./venv
-export PATH="$UV_PROJECT_ENVIRONMENT/bin:$PATH"
-export UV_LINK_MODE=copy
-```
-Run the following command to build upstream dependencies from source:
-```bash
-# Clone and install
-git clone https://github.com/NVIDIA/Megatron-LM.git
-cd Megatron-LM
-# Optional: checkout specific release
-git checkout core_r0.13.0
-bash docker/common/install.sh --environment {dev,lts}
-```
 ## System Requirements
 ### Hardware Requirements
 - **FP8 Support**: NVIDIA Hopper, Ada, Blackwell GPUs
 - **Recommended**: NVIDIA Turing architecture or later
 ### Software Requirements
 - **CUDA/cuDNN/NCCL**: Latest stable versions
 - **PyTorch**: Latest stable version
 - **Transformer Engine**: Latest stable version
@@ -333,12 +324,14 @@ Our codebase efficiently trains models from 2B to 462B parameters across thousan
 ![Model table](images/model_table.png)
 **Benchmark Configuration:**
 - **Vocabulary size**: 131,072 tokens
 - **Sequence length**: 4096 tokens
 - **Model scaling**: Varied hidden size, attention heads, and layers to achieve target parameter counts
 - **Communication optimizations**: Fine-grained overlapping with DP (`--overlap-grad-reduce`, `--overlap-param-gather`), TP (`--tp-comm-overlap`), and PP (enabled by default)
 **Key Results:**
 - **6144 H100 GPUs**: Successfully benchmarked 462B parameter model training
 - **Superlinear scaling**: MFU increases from 41% to 47-48% with model size
 - **End-to-end measurement**: Throughputs include all operations (data loading, optimizer steps, communication, logging)
@@ -346,11 +339,13 @@ Our codebase efficiently trains models from 2B to 462B parameters across thousan
 - *Note: Performance results measured without training to convergence*
 ## Weak Scaling Results
 Our weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.
 ![Weak scaling](images/weak_scaling.png)
 ## Strong Scaling Results
 We also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.
 ![Strong scaling](images/strong_scaling.png)
@@ -360,12 +355,14 @@ We also strong scaled the standard GPT-3 model (our version has slightly more th
 ## Getting Started
 ### Simple Training Example
 ```bash
 # Distributed training example (2 GPUs, mock data)
 torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
 ```
 ### LLama-3 Training Example
 ```bash
 # 8 GPUs, FP8 precision, mock data
 ./examples/llama/train_llama3_8b_fp8.sh
@@ -374,12 +371,14 @@ torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
 ## Data Preparation
 ### JSONL Data Format
 ```json
 {"text": "Your training text here..."}
 {"text": "Another training sample..."}
 ```
 ### Basic Preprocessing
 ```bash
 python tools/preprocess_data.py \
     --input data.jsonl \
@@ -391,6 +390,7 @@ python tools/preprocess_data.py \
 ```
 ### Key Arguments
 - `--input`: Path to input JSON/JSONL file
 - `--output-prefix`: Prefix for output binary files (.bin and .idx)
 - `--tokenizer-type`: Tokenizer type (`HuggingFaceTokenizer`, `GPT2BPETokenizer`, etc.)
@@ -405,6 +405,7 @@ python tools/preprocess_data.py \
 ## Data Parallelism (DP)
 ### Standard Data Parallel
 ```bash
 # Standard DDP - replicate model on each GPU
 torchrun --nproc_per_node=8 pretrain_gpt.py \
@@ -412,6 +413,7 @@ torchrun --nproc_per_node=8 pretrain_gpt.py \
 ```
 ### Fully Sharded Data Parallel (FSDP)
 ```bash
 # Megatron's optimized FSDP (~15% faster than PyTorch FSDP2)
 --use-custom-fsdp
@@ -426,21 +428,27 @@ torchrun --nproc_per_node=8 pretrain_gpt.py \
 ```
 ## Tensor Parallelism (TP)
 Split individual model layers across GPUs:
 ```bash
 --tensor-model-parallel-size 4  # 4-way tensor parallelism
 --sequence-parallel             # Enable sequence parallelism (recommended with TP)
 ```
 ## Pipeline Parallelism (PP)
 Split model depth across GPUs:
 ```bash
 --pipeline-model-parallel-size 8     # 8 pipeline stages
 --virtual-pipeline-model-parallel-size 4  # Virtual pipeline for better load balancing
 ```
 ## Context Parallelism (CP)
 Split long sequences across GPUs for handling long contexts:
 ```bash
 --context-parallel-size 2                    # 2-way context parallelism
 --cp-comm-type p2p                          # Communication: p2p, a2a, allgather, a2a+p2p
@@ -448,7 +456,9 @@ Split long sequences across GPUs for handling long contexts:
 ```
 ## Expert Parallelism (EP)
 For Mixture of Experts (MoE) models:
 ```bash
 --expert-model-parallel-size 4  # 4-way expert parallelism
 --num-experts 8                 # 8 experts per MoE layer
@@ -488,9 +498,11 @@ Based on [NVIDIA NeMo production configurations](https://github.com/NVIDIA/NeMo/
 **→ [NVIDIA NeMo Framework Performance Tuning Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#performance-tuning-guide)** - Comprehensive performance optimization guide covering advanced tuning techniques, communication overlaps, memory optimizations, and profiling options.
 ### FlashAttention
 [FlashAttention](https://github.com/Dao-AILab/flash-attention) is a fast and memory-efficient attention algorithm. We recommend the default usage, which uses cuDNN for attention via Transformer Engine and provides up to 50% speedups on forward and 84% on backward propagation with FP8 kernels. The `flash-attn` package is also supported via `--use-flash-attn`.
 ### Mixed Precision Training
 ```bash
 --fp16                    # Standard FP16
 --bf16                    # BFloat16 (recommended for large models)
@@ -498,6 +510,7 @@ Based on [NVIDIA NeMo production configurations](https://github.com/NVIDIA/NeMo/
 ```
 ### Activation Checkpointing and Recomputation
 ```bash
 # For limited memory
 --recompute-activations
@@ -515,6 +528,7 @@ Based on [NVIDIA NeMo production configurations](https://github.com/NVIDIA/NeMo/
 ```
 ### Distributed Optimizer
 ```bash
 --use-distributed-optimizer
 ```
@@ -531,11 +545,14 @@ Stay up-to-date with our development roadmaps and planned features:
 # Community & Support
 ## Getting Help
 - 📖 **[Documentation](https://docs.nvidia.com/Megatron-Core/)** - Official documentation
 - 🐛 **[Issues](https://github.com/NVIDIA/Megatron-LM/issues)** - Bug reports and feature requests
 ## Contributing
 We ❤️ contributions! Ways to contribute:
 - 🐛 **Report bugs** - Help us improve reliability
 - 💡 **Suggest features** - Shape the future of Megatron Core
 - 📝 **Improve docs** - Make Megatron Core more accessible
@@ -544,6 +561,7 @@ We ❤️ contributions! Ways to contribute:
 **→ [Contributing Guide](./CONTRIBUTING.md)**
 ## Citation
 ```bibtex
 @article{megatron-lm,
   title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},