PyPI - megatron-fsdp - Versions diffs - 0.1.0rc0__tar.gz - Mend

megatron-fsdp 0.1.0rc0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

megatron_fsdp-0.1.0rc0/PKG-INFO +128 -0
megatron_fsdp-0.1.0rc0/README.md +94 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp/__init__.py +53 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp/distributed_data_parallel_config.py +141 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp/fully_shard.py +387 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp/megatron_fsdp.py +1107 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp/package_info.py +27 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp/param_and_grad_buffer.py +3678 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp/uneven_dtensor.py +458 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp/utils.py +908 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp.egg-info/PKG-INFO +128 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp.egg-info/SOURCES.txt +15 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp.egg-info/dependency_links.txt +1 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp.egg-info/requires.txt +3 -0
megatron_fsdp-0.1.0rc0/megatron_fsdp.egg-info/top_level.txt +1 -0
megatron_fsdp-0.1.0rc0/pyproject.toml +68 -0
megatron_fsdp-0.1.0rc0/setup.cfg +4 -0

megatron_fsdp-0.1.0rc0/PKG-INFO ADDED Viewed

@@ -0,0 +1,128 @@
+Metadata-Version: 2.4
+Name: megatron-fsdp
+Version: 0.1.0rc0
+Summary: **Megatron-FSDP** is an NVIDIA-developed PyTorch extension that provides a high-performance implementation of Fully Sharded Data Parallelism (FSDP)
+Author-email: NVIDIA <nemo-toolkit@nvidia.com>
+Maintainer-email: NVIDIA <nemo-toolkit@nvidia.com>
+License: Apache 2.0
+Project-URL: Download, https://github.com/NVIDIA/Megatron-LM/releases
+Project-URL: Homepage, https://github.com/NVIDIA/Megatron-LM/megatron/core
+Keywords: NLP,NLU,deep,gpu,language,learning,learning,machine,nvidia,pytorch,torch,transformer
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Environment :: Console
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Information Technology
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: BSD License
+Classifier: Natural Language :: English
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Scientific/Engineering :: Image Recognition
+Classifier: Topic :: Scientific/Engineering :: Mathematics
+Classifier: Topic :: Scientific/Engineering
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Classifier: Topic :: Software Development :: Libraries
+Classifier: Topic :: Utilities
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+Requires-Dist: torch
+Requires-Dist: numpy<2.0.0
+Requires-Dist: packaging
+<div align="center">
+# 🚀 Megatron-FSDP
+</div>
+<div align="center">
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
+</div>
+## ✨ What is Megatron-FSDP?
+**Megatron-FSDP** is an NVIDIA-developed PyTorch extension that provides a high-performance implementation of Fully Sharded Data Parallelism (FSDP). It offers seamless cross-compatibility with major deep learning frameworks and parallelism libraries, making it easy to scale your PyTorch models across multiple GPUs and nodes.
+Megatron-FSDP can provide up to 25% speed up and 23% memory savings compared to FSDP2.
+### Compatibility
+- **[PyTorch DTensor](https://docs.pytorch.org/docs/stable/distributed.tensor.html)**
+- **[Megatron Core](https://github.com/NVIDIA/Megatron-LM)**
+- **[TransformerEngine](https://github.com/NVIDIA/TransformerEngine)**
+## ✨ Features
+- **Easy Integration**: Simple `fully_shard` function for quick model parallelization
+- **High Performance**: Optimized for NVIDIA GPUs with efficient memory management
+- **Cross-Framework**: Works seamlessly with PyTorch, Huggingface Transformers, Megatron-LM, Megatron Bridge and TransformerEngine
+- **Scalable**: Supports both single-node multi-GPU and multi-node distributed training
+- **Flexible Configuration**: Configurable sharding strategies and process groups
+## ⚡ Optimizations
+- **Advanced Bucketing**: Data-type aware bucketing system to minimize the overhead of collective operations
+- **Buffer Management**: Zero copy communication is achieved by reorganizing the storage of parameters and main grad with `ParamAndGradBuffer` class
+- **Communication Overlapping**: Improved communication overlap of paramter all-gather and gradient reduce-scatter
+- **User-Buffer-Registration NCCL communication**: Offload NCCL collective communication to NVL/IB Sharp to reduce GPU SM usage for communication
+- **FP8 Mixed Precision with Transformer Engine**: Compatibility with Transformer Engine enables efficient FP8 mixed precision training
+- **Gradient accumulate fusion support with Transformer Engine**: Remove the explicit gradient copy to the communication buffer in backwards pass
+<!-- ## 📊 Performance  -->
+<!-- ## 📦 Installation -->
+## 🚀 Quick Start
+### Basic Usage
+Transform your PyTorch model to use Fully Sharded Data Parallelism with just a few lines:
+```python
+import torch
+from megatron_fsdp import fully_shard
+# Your existing model and optimizer
+model = YourModel()
+optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
+# Enable FSDP with Megatron-FSDP
+model, optimizer = fully_shard(
+    model,
+    optimizer,
+    fsdp_unit_modules=[YourTransformerBlock], # Modules to shard
+)
+# Your model is now ready for distributed training!
+```
+### Comparison with FSDP-2
+We provide a similar approach for sharding the model with `fully_shard` function:
+- No need to call `fully_shard` on all the submodules.
+- One liner for the sharding change
+Here is an FSDP2 usage example for better comparison
+```python
+import torch
+from torch.distributed.fsdp import fully_shard
+# Your existing model and optimizer
+model = YourModel()
+optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
+# Enable FSDP with FSDP2
+for module in model.modules():
+    if isinstance(module, YourTransformerBlock): # Sub-Modules to shard
+        fully_shard(module)
+fully_shard(model)
+# Your model is now ready for distributed training!
+```

megatron_fsdp-0.1.0rc0/README.md ADDED Viewed

@@ -0,0 +1,94 @@
+<div align="center">
+# 🚀 Megatron-FSDP
+</div>
+<div align="center">
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
+</div>
+## ✨ What is Megatron-FSDP?
+**Megatron-FSDP** is an NVIDIA-developed PyTorch extension that provides a high-performance implementation of Fully Sharded Data Parallelism (FSDP). It offers seamless cross-compatibility with major deep learning frameworks and parallelism libraries, making it easy to scale your PyTorch models across multiple GPUs and nodes.
+Megatron-FSDP can provide up to 25% speed up and 23% memory savings compared to FSDP2.
+### Compatibility
+- **[PyTorch DTensor](https://docs.pytorch.org/docs/stable/distributed.tensor.html)**
+- **[Megatron Core](https://github.com/NVIDIA/Megatron-LM)**
+- **[TransformerEngine](https://github.com/NVIDIA/TransformerEngine)**
+## ✨ Features
+- **Easy Integration**: Simple `fully_shard` function for quick model parallelization
+- **High Performance**: Optimized for NVIDIA GPUs with efficient memory management
+- **Cross-Framework**: Works seamlessly with PyTorch, Huggingface Transformers, Megatron-LM, Megatron Bridge and TransformerEngine
+- **Scalable**: Supports both single-node multi-GPU and multi-node distributed training
+- **Flexible Configuration**: Configurable sharding strategies and process groups
+## ⚡ Optimizations
+- **Advanced Bucketing**: Data-type aware bucketing system to minimize the overhead of collective operations
+- **Buffer Management**: Zero copy communication is achieved by reorganizing the storage of parameters and main grad with `ParamAndGradBuffer` class
+- **Communication Overlapping**: Improved communication overlap of paramter all-gather and gradient reduce-scatter
+- **User-Buffer-Registration NCCL communication**: Offload NCCL collective communication to NVL/IB Sharp to reduce GPU SM usage for communication
+- **FP8 Mixed Precision with Transformer Engine**: Compatibility with Transformer Engine enables efficient FP8 mixed precision training
+- **Gradient accumulate fusion support with Transformer Engine**: Remove the explicit gradient copy to the communication buffer in backwards pass
+<!-- ## 📊 Performance  -->
+<!-- ## 📦 Installation -->
+## 🚀 Quick Start
+### Basic Usage
+Transform your PyTorch model to use Fully Sharded Data Parallelism with just a few lines:
+```python
+import torch
+from megatron_fsdp import fully_shard
+# Your existing model and optimizer
+model = YourModel()
+optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
+# Enable FSDP with Megatron-FSDP
+model, optimizer = fully_shard(
+    model,
+    optimizer,
+    fsdp_unit_modules=[YourTransformerBlock], # Modules to shard
+)
+# Your model is now ready for distributed training!
+```
+### Comparison with FSDP-2
+We provide a similar approach for sharding the model with `fully_shard` function:
+- No need to call `fully_shard` on all the submodules.
+- One liner for the sharding change
+Here is an FSDP2 usage example for better comparison
+```python
+import torch
+from torch.distributed.fsdp import fully_shard
+# Your existing model and optimizer
+model = YourModel()
+optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
+# Enable FSDP with FSDP2
+for module in model.modules():
+    if isinstance(module, YourTransformerBlock): # Sub-Modules to shard
+        fully_shard(module)
+fully_shard(model)
+# Your model is now ready for distributed training!
+```

megatron_fsdp-0.1.0rc0/megatron_fsdp/__init__.py ADDED Viewed

@@ -0,0 +1,53 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .distributed_data_parallel_config import DistributedDataParallelConfig
+from .megatron_fsdp import MegatronFSDP
+from .package_info import (
+    __contact_emails__,
+    __contact_names__,
+    __description__,
+    __download_url__,
+    __homepage__,
+    __keywords__,
+    __license__,
+    __package_name__,
+    __repository_url__,
+    __shortversion__,
+    __version__,
+)
+from .utils import FSDPDistributedIndex
+try:
+    from .fully_shard import fully_shard
+except ImportError as e:
+    print(f"Failed to import fully_shard: {e}")
+__all__ = [
+    "DistributedDataParallelConfig",
+    "MegatronFSDP",
+    "FSDPDistributedIndex",
+    "fully_shard",
+    "__contact_emails__",
+    "__contact_names__",
+    "__description__",
+    "__download_url__",
+    "__homepage__",
+    "__keywords__",
+    "__license__",
+    "__package_name__",
+    "__repository_url__",
+    "__shortversion__",
+    "__version__",
+]

megatron_fsdp-0.1.0rc0/megatron_fsdp/distributed_data_parallel_config.py ADDED Viewed

@@ -0,0 +1,141 @@
+# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+from dataclasses import dataclass
+from typing import Optional
+@dataclass
+class DistributedDataParallelConfig:
+    """Configuration for DistributedDataParallel."""
+    grad_reduce_in_fp32: bool = False
+    """If true, reduce grads in fp32."""
+    overlap_grad_reduce: bool = False
+    """If true, overlap grad all-reduce / reduce-scatter with backward compute."""
+    overlap_param_gather: bool = False
+    """If true, overlap param all-gather with forward compute."""
+    align_param_gather: bool = False
+    """If true, all PP stages will launch param all-gathers simultaneously. Otherwise, each
+    PP stage will independently launch as needed.
+    """
+    use_distributed_optimizer: bool = False
+    """If true, issue reduce-scatter collectives to aggregate gradients and clean up
+       originally allocated model parameters, otherwise issue all-reduce collectives.
+    """
+    num_distributed_optimizer_instances: int = 1
+    """Sets the factor by which the DP domain is sharded to have the partial DistOpt
+       enabled. Defaults to 1, which means DistOpt is across entire DP domain.
+    """
+    check_for_nan_in_grad: bool = False
+    """If true, check for NaNs and Infs in gradients _before_ communication collective."""
+    check_for_large_grads: bool = False
+    """If true, check for unexpectedly large gradients _before_ communication collective."""
+    bucket_size: Optional[int] = None
+    """Maximum number of parameters in each bucket. If unspecified, MCore uses a default
+       value of max(40000000, 1000000 * dp_size) parameters (larger DP sizes need larger
+       buckets to ensure collectives do not become latency-bound)."""
+    pad_buckets_for_high_nccl_busbw: bool = False
+    """If true, make sure the bucket size is divisible by a large power of 2 (2^16) to
+       ensure NCCL collectives have high bus bandwidth at large DP counts, since NCCL
+       message size (which for ring algorithms is bucket_size / dp_size) apparently needs
+       to be divisible by a power of 2 for high busbw."""
+    average_in_collective: bool = False
+    """If true, compute average in collective directly, as opposed to dividing by the
+       dp_size first and then computing sum in the collective."""
+    fp8_param_gather: bool = False
+    """If true, keep the compute param in fp8 (do not use any other intermediate dtype) and
+       perform the param all-gather in fp8."""
+    reuse_grad_buf_for_mxfp8_param_ag: bool = False
+    """If true, reuse the grad buffer for param AG when using mxfp8 recipe. Should be
+       set to True only when fp8_recipe is mxfp8 and fp8_param_gather is True."""
+    use_megatron_fsdp: bool = False
+    """If true, use the FSDP code path for DDP."""
+    use_custom_fsdp: bool = False
+    """
+    NOTE: The flag `use_custom_fsdp` is deprecated and will be removed in future versions.
+    Please use `use_megatron_fsdp` instead, as all functionality will be migrated there.
+    Future updates will drop support for `use_custom_fsdp` to avoid confusion.
+    """
+    data_parallel_sharding_strategy: str = 'no_shard'
+    """Sharding strategy for FSDP. Valid values are 'no_shard', 'optim',
+        'optim_grads', 'optim_grads_params'."""
+    gradient_reduce_div_fusion: bool = True
+    """If true, perform gradient reduce and division fusion."""
+    suggested_communication_unit_size: int = None
+    """Specifies the number of elements to communicate at once during
+      FSDP (Fully Sharded Data Parallel) operations.
+      This flag also affects FSDP all-gather prefetch behavior. Setting a larger
+      value increases the communication buffer size, while a smaller value
+      disables prefetching and may degrade performance. Adjust this value
+      based on your system's memory and performance requirements."""
+    preserve_fp32_weights: bool = True
+    """If true, preserve fp32 weights in the Megatron FSDP ParamAndGradBuffer."""
+    keep_fp8_transpose_cache: bool = False
+    """If true, keep the fp8 transpose cache when using Megatron FSDP."""
+    nccl_ub: bool = False
+    """If true, allocate and register NCCL userbuffer for param and grad buffer.
+      This flag enables SM efficient nccl algorithm that could improve the performance
+      of FSDP and DP with comm_overlap. This flag will be much more effective when used
+      together with sharp.
+      The follwoing will be the expected number of SM usage for various cases.
+      (Note that this is just a reference number and the number of SM usage could vary
+      on message size, communication domain size and nccl version.)
+      ----------------------------------------------------------
+      | Communication domain | use_sharp | SM usage of "AG/RS" |
+      |----------------------|-----------|---------------------|
+      | NVL                  | N/A       | 4 / 5               |
+      | NVL+IB               | False     | 16 / 16             |
+      | NVL+IB               | True      | 6 / 6               |
+      | IB                   | False     | 1 / 4               |
+      | IB                   | True      | 1 / 1               |
+      ----------------------------------------------------------
+    """
+    fsdp_double_buffer: bool = False
+    """If true, use persistently allocated double buffers for the
+      temporary memory needed in the Megatron FSDP communications.
+      This option will cause additional memory overhead, however, it is necessary for
+      to register user buffer (nccl_ub=True) for the Megatron FSDP.
+      This option will be automatically set to True when nccl_ub=True.
+   """
+    outer_dp_sharding_strategy: str = 'no_shard'
+    """
+    Sharding strategy for outer data parallel group in Hybrid Sharded Data Parallel (HSDP) mode.
+    Valid values are 'no_shard', 'optim', 'optim_grads', 'optim_grads_params'.
+    This option is only effective when Hybrid FSDP is enabled.
+    """
+    def __post_init__(self):
+        import os
+        """Check the validity of the config."""
+        if self.reuse_grad_buf_for_mxfp8_param_ag:
+            assert self.fp8_param_gather, "Reuse grad buffer only when keeping params in MXFP8."
+        if self.nccl_ub:
+            if 'expandable_segments:True' in os.getenv('PYTORCH_CUDA_ALLOC_CONF', '').split(','):
+                raise ValueError(
+                    "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True is currently not supported "
+                    "with nccl_ub due to compatibility issue with torch.cuda.MemPool API."
+                )