megatron-fsdp 0.1.0rc0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,128 @@
1
+ Metadata-Version: 2.4
2
+ Name: megatron-fsdp
3
+ Version: 0.1.0rc0
4
+ Summary: **Megatron-FSDP** is an NVIDIA-developed PyTorch extension that provides a high-performance implementation of Fully Sharded Data Parallelism (FSDP)
5
+ Author-email: NVIDIA <nemo-toolkit@nvidia.com>
6
+ Maintainer-email: NVIDIA <nemo-toolkit@nvidia.com>
7
+ License: Apache 2.0
8
+ Project-URL: Download, https://github.com/NVIDIA/Megatron-LM/releases
9
+ Project-URL: Homepage, https://github.com/NVIDIA/Megatron-LM/megatron/core
10
+ Keywords: NLP,NLU,deep,gpu,language,learning,learning,machine,nvidia,pytorch,torch,transformer
11
+ Classifier: Development Status :: 5 - Production/Stable
12
+ Classifier: Environment :: Console
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Information Technology
15
+ Classifier: Intended Audience :: Science/Research
16
+ Classifier: License :: OSI Approved :: BSD License
17
+ Classifier: Natural Language :: English
18
+ Classifier: Operating System :: OS Independent
19
+ Classifier: Programming Language :: Python :: 3
20
+ Classifier: Programming Language :: Python :: 3.8
21
+ Classifier: Programming Language :: Python :: 3.9
22
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
23
+ Classifier: Topic :: Scientific/Engineering :: Image Recognition
24
+ Classifier: Topic :: Scientific/Engineering :: Mathematics
25
+ Classifier: Topic :: Scientific/Engineering
26
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
27
+ Classifier: Topic :: Software Development :: Libraries
28
+ Classifier: Topic :: Utilities
29
+ Requires-Python: >=3.10
30
+ Description-Content-Type: text/markdown
31
+ Requires-Dist: torch
32
+ Requires-Dist: numpy<2.0.0
33
+ Requires-Dist: packaging
34
+
35
+ <div align="center">
36
+
37
+ # 🚀 Megatron-FSDP
38
+
39
+ </div>
40
+
41
+ <div align="center">
42
+
43
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
44
+
45
+ </div>
46
+
47
+ ## ✨ What is Megatron-FSDP?
48
+
49
+ **Megatron-FSDP** is an NVIDIA-developed PyTorch extension that provides a high-performance implementation of Fully Sharded Data Parallelism (FSDP). It offers seamless cross-compatibility with major deep learning frameworks and parallelism libraries, making it easy to scale your PyTorch models across multiple GPUs and nodes.
50
+
51
+ Megatron-FSDP can provide up to 25% speed up and 23% memory savings compared to FSDP2.
52
+
53
+ ### Compatibility
54
+
55
+ - **[PyTorch DTensor](https://docs.pytorch.org/docs/stable/distributed.tensor.html)**
56
+ - **[Megatron Core](https://github.com/NVIDIA/Megatron-LM)**
57
+ - **[TransformerEngine](https://github.com/NVIDIA/TransformerEngine)**
58
+
59
+ ## ✨ Features
60
+
61
+ - **Easy Integration**: Simple `fully_shard` function for quick model parallelization
62
+ - **High Performance**: Optimized for NVIDIA GPUs with efficient memory management
63
+ - **Cross-Framework**: Works seamlessly with PyTorch, Huggingface Transformers, Megatron-LM, Megatron Bridge and TransformerEngine
64
+ - **Scalable**: Supports both single-node multi-GPU and multi-node distributed training
65
+ - **Flexible Configuration**: Configurable sharding strategies and process groups
66
+
67
+ ## âš¡ Optimizations
68
+
69
+ - **Advanced Bucketing**: Data-type aware bucketing system to minimize the overhead of collective operations
70
+ - **Buffer Management**: Zero copy communication is achieved by reorganizing the storage of parameters and main grad with `ParamAndGradBuffer` class
71
+ - **Communication Overlapping**: Improved communication overlap of paramter all-gather and gradient reduce-scatter
72
+ - **User-Buffer-Registration NCCL communication**: Offload NCCL collective communication to NVL/IB Sharp to reduce GPU SM usage for communication
73
+ - **FP8 Mixed Precision with Transformer Engine**: Compatibility with Transformer Engine enables efficient FP8 mixed precision training
74
+ - **Gradient accumulate fusion support with Transformer Engine**: Remove the explicit gradient copy to the communication buffer in backwards pass
75
+
76
+ <!-- ## 📊 Performance -->
77
+
78
+ <!-- ## 📦 Installation -->
79
+
80
+ ## 🚀 Quick Start
81
+
82
+ ### Basic Usage
83
+
84
+ Transform your PyTorch model to use Fully Sharded Data Parallelism with just a few lines:
85
+
86
+ ```python
87
+ import torch
88
+ from megatron_fsdp import fully_shard
89
+
90
+ # Your existing model and optimizer
91
+ model = YourModel()
92
+ optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
93
+
94
+ # Enable FSDP with Megatron-FSDP
95
+ model, optimizer = fully_shard(
96
+ model,
97
+ optimizer,
98
+ fsdp_unit_modules=[YourTransformerBlock], # Modules to shard
99
+ )
100
+
101
+ # Your model is now ready for distributed training!
102
+ ```
103
+
104
+ ### Comparison with FSDP-2
105
+
106
+ We provide a similar approach for sharding the model with `fully_shard` function:
107
+
108
+ - No need to call `fully_shard` on all the submodules.
109
+ - One liner for the sharding change
110
+
111
+ Here is an FSDP2 usage example for better comparison
112
+
113
+ ```python
114
+ import torch
115
+ from torch.distributed.fsdp import fully_shard
116
+
117
+ # Your existing model and optimizer
118
+ model = YourModel()
119
+ optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
120
+
121
+ # Enable FSDP with FSDP2
122
+ for module in model.modules():
123
+ if isinstance(module, YourTransformerBlock): # Sub-Modules to shard
124
+ fully_shard(module)
125
+ fully_shard(model)
126
+
127
+ # Your model is now ready for distributed training!
128
+ ```
@@ -0,0 +1,94 @@
1
+ <div align="center">
2
+
3
+ # 🚀 Megatron-FSDP
4
+
5
+ </div>
6
+
7
+ <div align="center">
8
+
9
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)
10
+
11
+ </div>
12
+
13
+ ## ✨ What is Megatron-FSDP?
14
+
15
+ **Megatron-FSDP** is an NVIDIA-developed PyTorch extension that provides a high-performance implementation of Fully Sharded Data Parallelism (FSDP). It offers seamless cross-compatibility with major deep learning frameworks and parallelism libraries, making it easy to scale your PyTorch models across multiple GPUs and nodes.
16
+
17
+ Megatron-FSDP can provide up to 25% speed up and 23% memory savings compared to FSDP2.
18
+
19
+ ### Compatibility
20
+
21
+ - **[PyTorch DTensor](https://docs.pytorch.org/docs/stable/distributed.tensor.html)**
22
+ - **[Megatron Core](https://github.com/NVIDIA/Megatron-LM)**
23
+ - **[TransformerEngine](https://github.com/NVIDIA/TransformerEngine)**
24
+
25
+ ## ✨ Features
26
+
27
+ - **Easy Integration**: Simple `fully_shard` function for quick model parallelization
28
+ - **High Performance**: Optimized for NVIDIA GPUs with efficient memory management
29
+ - **Cross-Framework**: Works seamlessly with PyTorch, Huggingface Transformers, Megatron-LM, Megatron Bridge and TransformerEngine
30
+ - **Scalable**: Supports both single-node multi-GPU and multi-node distributed training
31
+ - **Flexible Configuration**: Configurable sharding strategies and process groups
32
+
33
+ ## âš¡ Optimizations
34
+
35
+ - **Advanced Bucketing**: Data-type aware bucketing system to minimize the overhead of collective operations
36
+ - **Buffer Management**: Zero copy communication is achieved by reorganizing the storage of parameters and main grad with `ParamAndGradBuffer` class
37
+ - **Communication Overlapping**: Improved communication overlap of paramter all-gather and gradient reduce-scatter
38
+ - **User-Buffer-Registration NCCL communication**: Offload NCCL collective communication to NVL/IB Sharp to reduce GPU SM usage for communication
39
+ - **FP8 Mixed Precision with Transformer Engine**: Compatibility with Transformer Engine enables efficient FP8 mixed precision training
40
+ - **Gradient accumulate fusion support with Transformer Engine**: Remove the explicit gradient copy to the communication buffer in backwards pass
41
+
42
+ <!-- ## 📊 Performance -->
43
+
44
+ <!-- ## 📦 Installation -->
45
+
46
+ ## 🚀 Quick Start
47
+
48
+ ### Basic Usage
49
+
50
+ Transform your PyTorch model to use Fully Sharded Data Parallelism with just a few lines:
51
+
52
+ ```python
53
+ import torch
54
+ from megatron_fsdp import fully_shard
55
+
56
+ # Your existing model and optimizer
57
+ model = YourModel()
58
+ optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
59
+
60
+ # Enable FSDP with Megatron-FSDP
61
+ model, optimizer = fully_shard(
62
+ model,
63
+ optimizer,
64
+ fsdp_unit_modules=[YourTransformerBlock], # Modules to shard
65
+ )
66
+
67
+ # Your model is now ready for distributed training!
68
+ ```
69
+
70
+ ### Comparison with FSDP-2
71
+
72
+ We provide a similar approach for sharding the model with `fully_shard` function:
73
+
74
+ - No need to call `fully_shard` on all the submodules.
75
+ - One liner for the sharding change
76
+
77
+ Here is an FSDP2 usage example for better comparison
78
+
79
+ ```python
80
+ import torch
81
+ from torch.distributed.fsdp import fully_shard
82
+
83
+ # Your existing model and optimizer
84
+ model = YourModel()
85
+ optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
86
+
87
+ # Enable FSDP with FSDP2
88
+ for module in model.modules():
89
+ if isinstance(module, YourTransformerBlock): # Sub-Modules to shard
90
+ fully_shard(module)
91
+ fully_shard(model)
92
+
93
+ # Your model is now ready for distributed training!
94
+ ```
@@ -0,0 +1,53 @@
1
+ # Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ from .distributed_data_parallel_config import DistributedDataParallelConfig
16
+ from .megatron_fsdp import MegatronFSDP
17
+ from .package_info import (
18
+ __contact_emails__,
19
+ __contact_names__,
20
+ __description__,
21
+ __download_url__,
22
+ __homepage__,
23
+ __keywords__,
24
+ __license__,
25
+ __package_name__,
26
+ __repository_url__,
27
+ __shortversion__,
28
+ __version__,
29
+ )
30
+ from .utils import FSDPDistributedIndex
31
+
32
+ try:
33
+ from .fully_shard import fully_shard
34
+ except ImportError as e:
35
+ print(f"Failed to import fully_shard: {e}")
36
+
37
+ __all__ = [
38
+ "DistributedDataParallelConfig",
39
+ "MegatronFSDP",
40
+ "FSDPDistributedIndex",
41
+ "fully_shard",
42
+ "__contact_emails__",
43
+ "__contact_names__",
44
+ "__description__",
45
+ "__download_url__",
46
+ "__homepage__",
47
+ "__keywords__",
48
+ "__license__",
49
+ "__package_name__",
50
+ "__repository_url__",
51
+ "__shortversion__",
52
+ "__version__",
53
+ ]
@@ -0,0 +1,141 @@
1
+ # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
2
+
3
+ from dataclasses import dataclass
4
+ from typing import Optional
5
+
6
+
7
+ @dataclass
8
+ class DistributedDataParallelConfig:
9
+ """Configuration for DistributedDataParallel."""
10
+
11
+ grad_reduce_in_fp32: bool = False
12
+ """If true, reduce grads in fp32."""
13
+
14
+ overlap_grad_reduce: bool = False
15
+ """If true, overlap grad all-reduce / reduce-scatter with backward compute."""
16
+
17
+ overlap_param_gather: bool = False
18
+ """If true, overlap param all-gather with forward compute."""
19
+
20
+ align_param_gather: bool = False
21
+ """If true, all PP stages will launch param all-gathers simultaneously. Otherwise, each
22
+ PP stage will independently launch as needed.
23
+ """
24
+
25
+ use_distributed_optimizer: bool = False
26
+ """If true, issue reduce-scatter collectives to aggregate gradients and clean up
27
+ originally allocated model parameters, otherwise issue all-reduce collectives.
28
+ """
29
+
30
+ num_distributed_optimizer_instances: int = 1
31
+ """Sets the factor by which the DP domain is sharded to have the partial DistOpt
32
+ enabled. Defaults to 1, which means DistOpt is across entire DP domain.
33
+ """
34
+
35
+ check_for_nan_in_grad: bool = False
36
+ """If true, check for NaNs and Infs in gradients _before_ communication collective."""
37
+
38
+ check_for_large_grads: bool = False
39
+ """If true, check for unexpectedly large gradients _before_ communication collective."""
40
+
41
+ bucket_size: Optional[int] = None
42
+ """Maximum number of parameters in each bucket. If unspecified, MCore uses a default
43
+ value of max(40000000, 1000000 * dp_size) parameters (larger DP sizes need larger
44
+ buckets to ensure collectives do not become latency-bound)."""
45
+
46
+ pad_buckets_for_high_nccl_busbw: bool = False
47
+ """If true, make sure the bucket size is divisible by a large power of 2 (2^16) to
48
+ ensure NCCL collectives have high bus bandwidth at large DP counts, since NCCL
49
+ message size (which for ring algorithms is bucket_size / dp_size) apparently needs
50
+ to be divisible by a power of 2 for high busbw."""
51
+
52
+ average_in_collective: bool = False
53
+ """If true, compute average in collective directly, as opposed to dividing by the
54
+ dp_size first and then computing sum in the collective."""
55
+
56
+ fp8_param_gather: bool = False
57
+ """If true, keep the compute param in fp8 (do not use any other intermediate dtype) and
58
+ perform the param all-gather in fp8."""
59
+
60
+ reuse_grad_buf_for_mxfp8_param_ag: bool = False
61
+ """If true, reuse the grad buffer for param AG when using mxfp8 recipe. Should be
62
+ set to True only when fp8_recipe is mxfp8 and fp8_param_gather is True."""
63
+
64
+ use_megatron_fsdp: bool = False
65
+ """If true, use the FSDP code path for DDP."""
66
+
67
+ use_custom_fsdp: bool = False
68
+ """
69
+ NOTE: The flag `use_custom_fsdp` is deprecated and will be removed in future versions.
70
+ Please use `use_megatron_fsdp` instead, as all functionality will be migrated there.
71
+ Future updates will drop support for `use_custom_fsdp` to avoid confusion.
72
+ """
73
+
74
+ data_parallel_sharding_strategy: str = 'no_shard'
75
+ """Sharding strategy for FSDP. Valid values are 'no_shard', 'optim',
76
+ 'optim_grads', 'optim_grads_params'."""
77
+
78
+ gradient_reduce_div_fusion: bool = True
79
+ """If true, perform gradient reduce and division fusion."""
80
+
81
+ suggested_communication_unit_size: int = None
82
+ """Specifies the number of elements to communicate at once during
83
+ FSDP (Fully Sharded Data Parallel) operations.
84
+ This flag also affects FSDP all-gather prefetch behavior. Setting a larger
85
+ value increases the communication buffer size, while a smaller value
86
+ disables prefetching and may degrade performance. Adjust this value
87
+ based on your system's memory and performance requirements."""
88
+
89
+ preserve_fp32_weights: bool = True
90
+ """If true, preserve fp32 weights in the Megatron FSDP ParamAndGradBuffer."""
91
+
92
+ keep_fp8_transpose_cache: bool = False
93
+ """If true, keep the fp8 transpose cache when using Megatron FSDP."""
94
+
95
+ nccl_ub: bool = False
96
+ """If true, allocate and register NCCL userbuffer for param and grad buffer.
97
+ This flag enables SM efficient nccl algorithm that could improve the performance
98
+ of FSDP and DP with comm_overlap. This flag will be much more effective when used
99
+ together with sharp.
100
+ The follwoing will be the expected number of SM usage for various cases.
101
+ (Note that this is just a reference number and the number of SM usage could vary
102
+ on message size, communication domain size and nccl version.)
103
+ ----------------------------------------------------------
104
+ | Communication domain | use_sharp | SM usage of "AG/RS" |
105
+ |----------------------|-----------|---------------------|
106
+ | NVL | N/A | 4 / 5 |
107
+ | NVL+IB | False | 16 / 16 |
108
+ | NVL+IB | True | 6 / 6 |
109
+ | IB | False | 1 / 4 |
110
+ | IB | True | 1 / 1 |
111
+ ----------------------------------------------------------
112
+ """
113
+
114
+ fsdp_double_buffer: bool = False
115
+ """If true, use persistently allocated double buffers for the
116
+ temporary memory needed in the Megatron FSDP communications.
117
+ This option will cause additional memory overhead, however, it is necessary for
118
+ to register user buffer (nccl_ub=True) for the Megatron FSDP.
119
+ This option will be automatically set to True when nccl_ub=True.
120
+ """
121
+
122
+ outer_dp_sharding_strategy: str = 'no_shard'
123
+ """
124
+ Sharding strategy for outer data parallel group in Hybrid Sharded Data Parallel (HSDP) mode.
125
+ Valid values are 'no_shard', 'optim', 'optim_grads', 'optim_grads_params'.
126
+ This option is only effective when Hybrid FSDP is enabled.
127
+ """
128
+
129
+ def __post_init__(self):
130
+ import os
131
+
132
+ """Check the validity of the config."""
133
+ if self.reuse_grad_buf_for_mxfp8_param_ag:
134
+ assert self.fp8_param_gather, "Reuse grad buffer only when keeping params in MXFP8."
135
+
136
+ if self.nccl_ub:
137
+ if 'expandable_segments:True' in os.getenv('PYTORCH_CUDA_ALLOC_CONF', '').split(','):
138
+ raise ValueError(
139
+ "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True is currently not supported "
140
+ "with nccl_ub due to compatibility issue with torch.cuda.MemPool API."
141
+ )