PyPI - megatron-core - Versions diffs - 0.10.0__tar.gz → 0.12.0__tar.gz - Mend

megatron-core 0.10.0tar.gz → 0.12.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of megatron-core might be problematic. Click here for more details.

Files changed (315) hide show

{megatron_core-0.10.0 → megatron_core-0.12.0}/LICENSE RENAMED Viewed

@@ -36,8 +36,8 @@ OpenAI). Files from these organizations have notices at the top of each file.
 Below are licenses used in those files, as indicated.
---------------------------------------------------------------------------------
--- LICENSE FOR Facebook, huggingface, Google Research, LLaVA, and Mamba code  --
+--------------------------------------------------------------------------------------
+-- LICENSE FOR Facebook, huggingface, Google Research, LLaVA, Mamba, and vLLM code  --
                                  Apache License
@@ -247,8 +247,9 @@ LICENSE FOR
 Facebook, Inc. and its affiliates,
 Meta Platforms, Inc. and its affiliates,
 Microsoft Corporation,
-OpenGVLab/InternVL, and
-Triton language and compiler.
+OpenGVLab/InternVL,
+Triton language and compiler,
+and DeepSeek.
 MIT License

{megatron_core-0.10.0/megatron_core.egg-info → megatron_core-0.12.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.2
+Metadata-Version: 2.4
 Name: megatron-core
-Version: 0.10.0
+Version: 0.12.0
 Summary: Megatron Core - a library for efficient and scalable training of transformer based models
 Home-page: https://github.com/NVIDIA/Megatron-LM/megatron/core
 Download-URL: https://github.com/NVIDIA/Megatron-LM/releases
@@ -46,8 +46,8 @@ License: The following applies to all files unless otherwise noted:
         Below are licenses used in those files, as indicated.
-        --------------------------------------------------------------------------------
-        -- LICENSE FOR Facebook, huggingface, Google Research, LLaVA, and Mamba code  --
+        --------------------------------------------------------------------------------------
+        -- LICENSE FOR Facebook, huggingface, Google Research, LLaVA, Mamba, and vLLM code  --
                                          Apache License
@@ -257,8 +257,9 @@ License: The following applies to all files unless otherwise noted:
         Facebook, Inc. and its affiliates,
         Meta Platforms, Inc. and its affiliates,
         Microsoft Corporation,
-        OpenGVLab/InternVL, and
-        Triton language and compiler.
+        OpenGVLab/InternVL,
+        Triton language and compiler,
+        and DeepSeek.
         MIT License
@@ -316,11 +317,15 @@ Requires-Dist: tiktoken
 Requires-Dist: wrapt
 Requires-Dist: zarr
 Requires-Dist: wandb
-Requires-Dist: tensorstore==0.1.45
-Requires-Dist: nvidia-modelopt[torch]>=0.19.0; sys_platform != "darwin"
+Requires-Dist: tensorstore!=0.1.46,!=0.1.72
+Requires-Dist: torch
+Requires-Dist: nvidia-modelopt[torch]>=0.23.2; sys_platform != "darwin"
+Requires-Dist: torch
+Requires-Dist: packaging
 Dynamic: author
 Dynamic: download-url
 Dynamic: home-page
+Dynamic: license-file
 Dynamic: maintainer
 Dynamic: requires-dist
@@ -342,9 +347,8 @@ Megatron-LM & Megatron-Core
 - **[2024/6]** Megatron-Core added supports for Mamba-based models. Check out our paper [An Empirical Study of Mamba-based Language Models](https://arxiv.org/pdf/2406.07887) and [code example](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba).
 - **[2024/1 Announcement]** NVIDIA has released the core capabilities in **Megatron-LM** into [**Megatron-Core**](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) in this repository. Megatron-Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the [Megatron-Core intro](#megatron-core) for more details.
 # Table of Contents
 - [Megatron-LM \& Megatron-Core](#megatron-lm--megatron-core)
 - [Latest News](#latest-news)
 - [Table of Contents](#table-of-contents)
@@ -368,7 +372,6 @@ Megatron-LM & Megatron-Core
   - [Retro and InstructRetro](#retro-and-instructretro)
   - [Mamba-based Language Models](#mamba-based-language-models)
   - [Mixture of Experts](#mixture-of-experts)
-    - [Key Features of MoE](#key-features-of-moe)
 - [Evaluation and Tasks](#evaluation-and-tasks)
   - [GPT Text Generation](#gpt-text-generation)
     - [Detoxify GPT via Self-generation](#detoxify-gpt-via-self-generation)
@@ -385,7 +388,10 @@ Megatron-LM & Megatron-Core
   - [Collecting Wikipedia Training Data](#collecting-wikipedia-training-data)
   - [Collecting GPT Webtext Data](#collecting-gpt-webtext-data)
 - [Reproducibility](#reproducibility)
-  - [Projects Using Megatron](#projects-using-megatron)
+- [Checkpoint conversion](#checkpoint-conversion)
+  - [Model class conversion](#model-class-conversion)
+  - [Checkpoint format conversion](#checkpoint-format-conversion)
+- [Projects Using Megatron](#projects-using-megatron)
 # Megatron Overview
 This repository comprises two essential components: **Megatron-LM** and **Megatron-Core**. Megatron-LM serves as a research-oriented framework leveraging Megatron-Core for large language model (LLM) training. Megatron-Core, on the other hand, is a library of GPU optimized training techniques that comes with formal product support including versioned APIs and regular releases. You can use Megatron-Core alongside Megatron-LM or [Nvidia NeMo Framework](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/nemo_megatron/mcore_customization.html) for an end-to-end and cloud-native solution. Alternatively, you can integrate Megatron-Core's building blocks into your preferred training framework.
@@ -915,7 +921,63 @@ There are currently three known Megatron optimizations that break reproducibilit
 In addition, determinisim has only been verified in NGC PyTorch containers up to and newer than 23.12. If you observe nondeterminism in Megatron training under other circumstances please open an issue.
-## Projects Using Megatron
+# Checkpoint conversion
+We support two forms of model conversion:
+1. Model class conversion (i.e., the `GPTModel` in `model.legacy` vs. `model.core`)
+2. Checkpoint format conversion (i.e., distributed vs. non-distributed checkpoint)
+## Model class conversion
+Megatron supports converting between different model classes, including internal model classes (we currently have the older `legacy` models, and the newer `core` models) and external model classes (such as Meta, Huggingface, Mistral, and Mixtral models). Additionally, during this conversion, one can update the parallel state of the model (i.e., changing tensor and pipeline model parallelism).
+ We provide the tool `tools/checkpoint/convert.py` to convert between model classes. Some important arguments include:
+- `--model-type`: `GPT` or `BERT`
+- `--loader`: format of the existing checkpoint. Supported formats include:
+  - `legacy`: our older model classes (under `megatron.legacy.model`)
+  - `core`: our newer model classes (under `megatron.core.models`)
+  - `llama_mistral`: for loading Llama and Mistral models (supports Meta and Huggingface formats)
+  - `mixtral_hf`: for loading Mixtral models (Huggingface only)
+- `--load-dir`: directory for loading the existing checkpoint
+- `--saver`: `legacy` or `core` (see descriptions under `--loader`)
+- `--save-dir`: directory for saving the new checkpoint
+- `--target-tensor-parallel-size`: new tensor model parallel size
+- `--target-pipeline-parallel-size`: new pipeline model parallel size
+For more argument details, please see the main script (`convert.py`), loader scripts (`loader_core.py`, `loader_legacy.py`, `loader_llama_mistral.py`, `loader_mixtral_hf.py`), or saver scripts (`saver_core.py`, `saver_legacy.py`).
+An example command for converting a GPT model from the old format (`legacy`) to the new format (`core`) would look as follows:
+```
+python tools/checkpoint/convert.py \
+>   --model-type GPT \
+>   --loader legacy \
+>   --load-dir ${LEGACY_FORMAT_DIR} \
+>   --saver core \
+>   --save-dir ${CORE_FORMAT_DIR} \
+>   --target-tensor-parallel-size ${TP} \
+>   --target-pipeline-parallel-size ${PP} \
+```
+For examples of converting Llama/Mistral models into Megatron, please see [here](docs/llama_mistral.md).
+## Checkpoint format conversion
+Megatron offers multiple checkpoint formats, including:
+- `torch`: Basic checkpoint format with sequential read & writes, and is tied to a specific tensor/pipeline model parallel state (TP/PP states, respectively). (While a specific checkpoint is tied to a specific TP/PP state, a checkpoint can still be manually converted via the model class converter described above).
+- `torch_dist`: Distributed checkpoint format, for fast parallel reads & writes, and also is parallel state agnostic (i.e., one can load the same checkpoint to different TP/PP setups).
+Generally speaking, `torch_dist` is the more modern and recommended checkpoint format due to its speed. However, depending on the use case, it may be desirable to convert between these two formats. To do so, launch your *training* script (e.g., via `pretrain_gpt.py`) as you normally would, but with two additional arguments:
+- `--ckpt-convert-format ${FORMAT}`: `${FORMAT}` can be one of `torch` or `torch_dist`, as described above.
+- `--ckpt-convert-save ${PATH_TO_SAVE_NEW_FORMAT}`: this path should be different than your existing `--load`/`--save` paths, to avoid overwriting the existing checkpoint. After converting, use this new path for your `--load`/`--save` paths.
+The general idea of this checkpoint format converter is that it launches the model just as one normally would for training, but before running any training iterations, it saves to the new checkpoint format, and then exits. It is important to note that all other launch args should remain the same, in order for the system to understand the previous checkpoint format.
+# Projects Using Megatron
 Below are some of the projects where we have directly used Megatron:
 * [BERT and GPT Studies Using Megatron](https://arxiv.org/pdf/1909.08053.pdf)
 * [BioMegatron: Larger Biomedical Domain Language Model](https://www.aclweb.org/anthology/2020.emnlp-main.379.pdf)

{megatron_core-0.10.0 → megatron_core-0.12.0}/README.md RENAMED Viewed

@@ -16,9 +16,8 @@ Megatron-LM & Megatron-Core
 - **[2024/6]** Megatron-Core added supports for Mamba-based models. Check out our paper [An Empirical Study of Mamba-based Language Models](https://arxiv.org/pdf/2406.07887) and [code example](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba).
 - **[2024/1 Announcement]** NVIDIA has released the core capabilities in **Megatron-LM** into [**Megatron-Core**](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) in this repository. Megatron-Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the [Megatron-Core intro](#megatron-core) for more details.
 # Table of Contents
 - [Megatron-LM \& Megatron-Core](#megatron-lm--megatron-core)
 - [Latest News](#latest-news)
 - [Table of Contents](#table-of-contents)
@@ -42,7 +41,6 @@ Megatron-LM & Megatron-Core
   - [Retro and InstructRetro](#retro-and-instructretro)
   - [Mamba-based Language Models](#mamba-based-language-models)
   - [Mixture of Experts](#mixture-of-experts)
-    - [Key Features of MoE](#key-features-of-moe)
 - [Evaluation and Tasks](#evaluation-and-tasks)
   - [GPT Text Generation](#gpt-text-generation)
     - [Detoxify GPT via Self-generation](#detoxify-gpt-via-self-generation)
@@ -59,7 +57,10 @@ Megatron-LM & Megatron-Core
   - [Collecting Wikipedia Training Data](#collecting-wikipedia-training-data)
   - [Collecting GPT Webtext Data](#collecting-gpt-webtext-data)
 - [Reproducibility](#reproducibility)
-  - [Projects Using Megatron](#projects-using-megatron)
+- [Checkpoint conversion](#checkpoint-conversion)
+  - [Model class conversion](#model-class-conversion)
+  - [Checkpoint format conversion](#checkpoint-format-conversion)
+- [Projects Using Megatron](#projects-using-megatron)
 # Megatron Overview
 This repository comprises two essential components: **Megatron-LM** and **Megatron-Core**. Megatron-LM serves as a research-oriented framework leveraging Megatron-Core for large language model (LLM) training. Megatron-Core, on the other hand, is a library of GPU optimized training techniques that comes with formal product support including versioned APIs and regular releases. You can use Megatron-Core alongside Megatron-LM or [Nvidia NeMo Framework](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/nemo_megatron/mcore_customization.html) for an end-to-end and cloud-native solution. Alternatively, you can integrate Megatron-Core's building blocks into your preferred training framework.
@@ -589,7 +590,63 @@ There are currently three known Megatron optimizations that break reproducibilit
 In addition, determinisim has only been verified in NGC PyTorch containers up to and newer than 23.12. If you observe nondeterminism in Megatron training under other circumstances please open an issue.
-## Projects Using Megatron
+# Checkpoint conversion
+We support two forms of model conversion:
+1. Model class conversion (i.e., the `GPTModel` in `model.legacy` vs. `model.core`)
+2. Checkpoint format conversion (i.e., distributed vs. non-distributed checkpoint)
+## Model class conversion
+Megatron supports converting between different model classes, including internal model classes (we currently have the older `legacy` models, and the newer `core` models) and external model classes (such as Meta, Huggingface, Mistral, and Mixtral models). Additionally, during this conversion, one can update the parallel state of the model (i.e., changing tensor and pipeline model parallelism).
+ We provide the tool `tools/checkpoint/convert.py` to convert between model classes. Some important arguments include:
+- `--model-type`: `GPT` or `BERT`
+- `--loader`: format of the existing checkpoint. Supported formats include:
+  - `legacy`: our older model classes (under `megatron.legacy.model`)
+  - `core`: our newer model classes (under `megatron.core.models`)
+  - `llama_mistral`: for loading Llama and Mistral models (supports Meta and Huggingface formats)
+  - `mixtral_hf`: for loading Mixtral models (Huggingface only)
+- `--load-dir`: directory for loading the existing checkpoint
+- `--saver`: `legacy` or `core` (see descriptions under `--loader`)
+- `--save-dir`: directory for saving the new checkpoint
+- `--target-tensor-parallel-size`: new tensor model parallel size
+- `--target-pipeline-parallel-size`: new pipeline model parallel size
+For more argument details, please see the main script (`convert.py`), loader scripts (`loader_core.py`, `loader_legacy.py`, `loader_llama_mistral.py`, `loader_mixtral_hf.py`), or saver scripts (`saver_core.py`, `saver_legacy.py`).
+An example command for converting a GPT model from the old format (`legacy`) to the new format (`core`) would look as follows:
+```
+python tools/checkpoint/convert.py \
+>   --model-type GPT \
+>   --loader legacy \
+>   --load-dir ${LEGACY_FORMAT_DIR} \
+>   --saver core \
+>   --save-dir ${CORE_FORMAT_DIR} \
+>   --target-tensor-parallel-size ${TP} \
+>   --target-pipeline-parallel-size ${PP} \
+```
+For examples of converting Llama/Mistral models into Megatron, please see [here](docs/llama_mistral.md).
+## Checkpoint format conversion
+Megatron offers multiple checkpoint formats, including:
+- `torch`: Basic checkpoint format with sequential read & writes, and is tied to a specific tensor/pipeline model parallel state (TP/PP states, respectively). (While a specific checkpoint is tied to a specific TP/PP state, a checkpoint can still be manually converted via the model class converter described above).
+- `torch_dist`: Distributed checkpoint format, for fast parallel reads & writes, and also is parallel state agnostic (i.e., one can load the same checkpoint to different TP/PP setups).
+Generally speaking, `torch_dist` is the more modern and recommended checkpoint format due to its speed. However, depending on the use case, it may be desirable to convert between these two formats. To do so, launch your *training* script (e.g., via `pretrain_gpt.py`) as you normally would, but with two additional arguments:
+- `--ckpt-convert-format ${FORMAT}`: `${FORMAT}` can be one of `torch` or `torch_dist`, as described above.
+- `--ckpt-convert-save ${PATH_TO_SAVE_NEW_FORMAT}`: this path should be different than your existing `--load`/`--save` paths, to avoid overwriting the existing checkpoint. After converting, use this new path for your `--load`/`--save` paths.
+The general idea of this checkpoint format converter is that it launches the model just as one normally would for training, but before running any training iterations, it saves to the new checkpoint format, and then exits. It is important to note that all other launch args should remain the same, in order for the system to understand the previous checkpoint format.
+# Projects Using Megatron
 Below are some of the projects where we have directly used Megatron:
 * [BERT and GPT Studies Using Megatron](https://arxiv.org/pdf/1909.08053.pdf)
 * [BioMegatron: Larger Biomedical Domain Language Model](https://www.aclweb.org/anthology/2020.emnlp-main.379.pdf)

{megatron_core-0.10.0 → megatron_core-0.12.0}/megatron/core/__init__.py RENAMED Viewed

@@ -1,4 +1,5 @@
 # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
 import megatron.core.tensor_parallel
 import megatron.core.utils
 from megatron.core import parallel_state

megatron_core-0.12.0/megatron/core/config.py ADDED Viewed

@@ -0,0 +1,3 @@
+# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ENABLE_EXPERIMENTAL = False

{megatron_core-0.10.0 → megatron_core-0.12.0}/megatron/core/datasets/blended_dataset.py RENAMED Viewed

@@ -29,7 +29,8 @@ class BlendedDataset(torch.utils.data.Dataset):
         weights (List[Union[int, float]]): The weights that determine the dataset blend ratios
-        size (Optional[int]): The number of samples to draw from the blend. If None, for each dataset index idx draw exactly weights[idx] samples from datasets[idx].
+        size (Optional[int]): The number of samples to draw from the blend. If None, for each
+            dataset index idx draw exactly weights[idx] samples from datasets[idx].
         config (BlendedMegatronDatasetConfig): The config
@@ -74,7 +75,6 @@ class BlendedDataset(torch.utils.data.Dataset):
         unique_identifiers["split"] = self.split.name
         unique_identifiers["weights"] = self.weights
         unique_identifiers["size"] = self.size
-        unique_identifiers["renormalize_blend_weights"] = self.config.renormalize_blend_weights
         self.unique_description = json.dumps(
             unique_identifiers, indent=4, default=lambda obj: obj.unique_identifiers
@@ -168,7 +168,7 @@ class BlendedDataset(torch.utils.data.Dataset):
                 log_single_rank(
                     logger,
                     logging.WARNING,
-                    f"Unable to save the {type(self).__name__} indexes because path_to_cache is None",
+                    f"Cannot save the {type(self).__name__} indexes because path_to_cache is None",
                 )
             t_end = time.time()

{megatron_core-0.10.0 → megatron_core-0.12.0}/megatron/core/datasets/blended_megatron_dataset_builder.py RENAMED Viewed

@@ -34,7 +34,9 @@ class BlendedMegatronDatasetBuilder(object):
         sizes (List[Optional[int]]): The minimum total number of samples to draw, or None, per split
-        is_built_on_rank (Callable): A callable which returns True if the dataset should be built on the current rank and False otherwise. It should be Megatron Core parallelism aware i.e. global rank, local group rank, and virtual rank may inform its return value.
+        is_built_on_rank (Callable): A callable which returns True if the dataset should be built on
+            the current rank and False otherwise. It should be Megatron Core parallelism aware i.e.
+            global rank, local group rank, and virtual rank may inform its return value.
         config (BlendedMegatronDatasetConfig): The config object which informs dataset creation
     """
@@ -54,7 +56,7 @@ class BlendedMegatronDatasetBuilder(object):
         log_single_rank(
             logger,
             logging.INFO,
-            f"Building dataset splits with cls={cls.__name__}, sizes={self.sizes}, and config={self.config}",
+            f"Building {cls.__name__} splits with sizes={self.sizes} and config={self.config}",
         )
         if not self.config.mock:
@@ -96,7 +98,8 @@ class BlendedMegatronDatasetBuilder(object):
         (2) The split has one contributing dataset, and...
             (a) 'size' is not None
-                - Build a mid-level dataset with low-level dataset sampling in proportion to the size
+                - Build a mid-level dataset with low-level dataset sampling in proportion to the
+                size
             (b) 'size' is None
                 - Build mid-level datasets with no excess low-level dataset sampling
@@ -104,24 +107,27 @@ class BlendedMegatronDatasetBuilder(object):
         (3) The split has multiple contributing datasets, and...
             (a) 'weights' is not None and 'size' is not None
-                - Build mid-level datasets with low-level dataset sampling in proportion to their weights and the size
-                - Build a top-level dataset of length marginally greater than 'size' with mid-level dataset sampling in proportion to their weights and the size
+                - Build mid-level datasets with low-level dataset sampling in proportion to their
+                weights and the size
+                - Build a top-level dataset of length marginally greater than 'size' with mid-level
+                dataset sampling in proportion to their weights and the size
             (b) 'weights' is not None and 'size' is None
                 - Error
             (c) 'weights' is None and 'size' is not None
                 - Build mid-level datasets with no excess low-level dataset sampling
-                - Build a top-level dataset of length 'size' with mid-level dataset sampling in proportion to their lengths and the size
-                  - The 'size' of the top-level dataset is capped at the sum of the mid-level dataset lengths
+                - Build a top-level dataset of length 'size' (capped at the sum of the mid-level
+                dataset lengths) with mid-level dataset sampling in proportion to their lengths
+                and the size
             (d) 'weights' is None and 'size' is None
                 - Build mid-level datasets with no excess low-level dataset sampling
                 - Build a top-level dataset with no excess mid-level dataset sampling
         Returns:
-            List[Optional[TopLevelDataset]]: A list containing a dataset instance (or None) per split
+            List[Optional[TopLevelDataset]]: A list containing a dataset instance (or None) per
+                split
         """
         datasets = self._build_blended_dataset_splits()
@@ -134,24 +140,35 @@ class BlendedMegatronDatasetBuilder(object):
                         log_single_rank(
                             logger,
                             logging.INFO,
-                            f"Verifying NumPy indices for {type(dataset).__name__} {dataset.split.name} split",
+                            (
+                                f"Verifying NumPy indices for {type(dataset).__name__} "
+                                f"{dataset.split.name} split"
+                            ),
                         )
                     else:
                         log_single_rank(
                             logger,
                             logging.INFO,
-                            f"NumPy indices for {type(dataset).__name__} {dataset.split.name} split are fully cached, skipping verification",
+                            (
+                                f"NumPy indices for {type(dataset).__name__} {dataset.split.name} "
+                                f"split are fully cached, skipping verification"
+                            ),
                         )
                         continue
                     # Check blend size
                     assert dataset.size is None or dataset.size == dataset.dataset_index.shape[0]
                     # Check blend access of mid-level datasets
-                    _, sizes = numpy.unique(dataset.dataset_index, return_counts=True)
-                    for i, dataset_and_size in enumerate(zip(dataset.datasets, sizes)):
-                        if len(dataset_and_size[0]) < dataset_and_size[1]:
+                    dataset_indices, dataset_sizes = numpy.unique(
+                        dataset.dataset_index, return_counts=True
+                    )
+                    for i, (index, size) in enumerate(zip(dataset_indices, dataset_sizes)):
+                        if len(dataset.datasets[index]) < size:
                             raise IndexError(
-                                f"The {dataset.split.name} blend oversamples (N = {dataset_and_size[1]}) {type(dataset_and_size[0]).__name__} {i} (len = {len(dataset_and_size[0])}). "
-                                f"Set renormalize_blend_weights to True and re-run. File an issue if the problem is not resolved."
+                                f"The {dataset.split.name} blend oversamples the contributing "
+                                f"datasets  and, e.g., requests {size} samples from "
+                                f"{type(dataset.datasets[index]).__name__} {i} with size "
+                                f"{len(dataset.datasets[index])}. This is unexpected. "
+                                f"Please file an issue."
                             )
         return datasets
@@ -162,7 +179,8 @@ class BlendedMegatronDatasetBuilder(object):
         See the BlendedMegatronDatasetBuilder.build alias for more information.
         Returns:
-            List[Optional[TopLevelDataset]]: A list containing a dataset instance (or None) per split
+            List[Optional[TopLevelDataset]]: A list containing a dataset instance (or None) per
+                split
         """
         ##
         # Return fake "mock" datasets
@@ -192,13 +210,19 @@ class BlendedMegatronDatasetBuilder(object):
             # Build the mid-level datasets
             if weights is None:
-                sizes_per_dataset = [[None for split in Split] for prefix in prefixes]
+                # Build only one "epoch"
+                sizes_per_dataset_buffer = [[None for split in Split] for prefix in prefixes]
             else:
-                sizes_per_dataset = _get_size_per_split_per_dataset(weights, self.sizes)
+                # The number of samples we plan to use per dataset
+                sizes_per_dataset_target = _get_size_per_split_per_dataset(weights, self.sizes)
+                # The number of samples we plan to build per dataset
+                sizes_per_dataset_buffer = _get_size_per_split_per_dataset(
+                    weights, self.sizes, margin=0.5
+                )
-            # build each dataset in parallel
+            # Build each dataset in parallel
             megatron_datasets = self._build_megatron_datasets_parallel(
-                prefixes, split, sizes_per_dataset
+                prefixes, split, sizes_per_dataset_buffer
             )
             # Build the top-level datasets
@@ -207,11 +231,11 @@ class BlendedMegatronDatasetBuilder(object):
                 if split[i] is not None:
                     weights_i = weights
                     if weights_i is not None and self.sizes[i] is not None:
-                        size_per_dataset = list(zip(*sizes_per_dataset))[i]
+                        # Blend according to client-specified weights and client-specified size
+                        size_per_dataset = list(zip(*sizes_per_dataset_target))[i]
                         size_i = sum(size_per_dataset)
-                        if self.config.renormalize_blend_weights:
-                            weights_i = list(map(lambda _size: _size / size_i, size_per_dataset))
                     elif weights_i is None:
+                        # Blend according to dataset sizes as-is and (maybe) client-specified size
                         try:
                             weights_i = [
                                 len(megatron_dataset) for megatron_dataset in megatron_datasets[i]
@@ -221,9 +245,12 @@ class BlendedMegatronDatasetBuilder(object):
                         if self.sizes[i] is not None:
                             size_i = min(self.sizes[i], sum(weights_i))
                         else:
-                            size_i = None  # => the size will be sum(weights_i)
+                            # Build exhaustive indices
+                            size_i = None
                     else:
-                        raise RuntimeError
+                        raise ValueError(
+                            "Using client-specified weights requires client-specified size"
+                        )
                     blended_datasets[i] = self.build_generic_dataset(
                         BlendedDataset,
                         self.is_built_on_rank,
@@ -263,22 +290,31 @@ class BlendedMegatronDatasetBuilder(object):
                     # Build mid-level datasets
                     if weights is None:
-                        sizes_per_dataset = [[None for split in Split] for prefix in prefixes]
+                        sizes_per_dataset_buffer = [
+                            [None for split in Split] for prefix in prefixes
+                        ]
                     else:
-                        sizes_per_dataset = _get_size_per_split_per_dataset(weights, sizes_spoof)
+                        # The number of samples we plan to use per dataset
+                        sizes_per_dataset_target = _get_size_per_split_per_dataset(
+                            weights, sizes_spoof
+                        )
+                        # The number of samples we plan to build per dataset
+                        sizes_per_dataset_buffer = _get_size_per_split_per_dataset(
+                            weights, sizes_spoof, margin=0.5
+                        )
-                    # build each dataset in parallel
+                    # Build each dataset in parallel
                     megatron_datasets = self._build_megatron_datasets_parallel(
-                        prefixes, split_spoof, sizes_per_dataset
+                        prefixes, split_spoof, sizes_per_dataset_buffer
                     )[i]
                     # Build top-level dataset
                     if weights is not None and self.sizes[i] is not None:
-                        size_per_dataset = list(zip(*sizes_per_dataset))[i]
+                        # Blend according to client-specified weights and client-specified size
+                        size_per_dataset = list(zip(*sizes_per_dataset_target))[i]
                         size = sum(size_per_dataset)
-                        if self.config.renormalize_blend_weights:
-                            weights = list(map(lambda _size: _size / size, size_per_dataset))
                     elif weights is None:
+                        # Blend according to dataset sizes as-is and (maybe) client-specified size
                         try:
                             weights = [
                                 len(megatron_dataset) for megatron_dataset in megatron_datasets
@@ -288,7 +324,8 @@ class BlendedMegatronDatasetBuilder(object):
                         if self.sizes[i] is not None:
                             size = min(self.sizes[i], sum(weights))
                         else:
-                            size = None  # => the size will be sum(weights)
+                            # Build exhaustive indices
+                            size = None
                     else:
                         raise RuntimeError
                     blended_datasets[i] = self.build_generic_dataset(
@@ -395,13 +432,15 @@ class BlendedMegatronDatasetBuilder(object):
         """Build each MidLevelDataset split from a single LowLevelDataset
         Args:
-            dataset_path (Optional[str]): The path on disk which defines the underlying LowLevelDataset, or None for mock dataset classes
+            dataset_path (Optional[str]): The path on disk which defines the underlying
+                LowLevelDataset, or None for mock dataset classes
             split (List[Tuple[float, float]]): The dataset split matrix
             sizes (List[int]): The number of total samples to draw from each split
-            synchronize_ranks (bool): Whether to call barrier for rank-0 / barrier / other-ranks behavior. Set to False when we enforce this behavior at higher level.
+            synchronize_ranks (bool): Whether to call barrier for rank-0 / barrier / other-ranks
+                behavior. Set to False when we enforce this behavior at higher level.
         Returns:
             List[Optional[MidLevelDataset]]: The MidLevelDataset (or None) per split
@@ -462,17 +501,22 @@ class BlendedMegatronDatasetBuilder(object):
         and torch.distributed is initialized.
         Args:
-            cls (Union[Type[DistributedDataset], Callable]): The DistributedDataset class to be built. In special cases, e.g. when we are building the low level dataset for a RawMegatronDataset instance, we can accept a Callable which returns an Iterable.
+            cls (Union[Type[DistributedDataset], Callable]): The DistributedDataset class to be
+                built. In special cases, e.g. when we are building the low level dataset for a
+                RawMegatronDataset instance, we can accept a Callable which returns an Iterable.
-            synchronize_ranks (bool): Whether to call barrier for rank-0 / barrier / other-ranks behavior. Set to False when we enforce this behavior at higher level.
+            synchronize_ranks (bool): Whether to call barrier for rank-0 / barrier / other-ranks
+                behavior. Set to False when we enforce this behavior at higher level.
-            args (Tuple[Any]): The positional arguments used to build the provided DistributedDataset class
+            args (Tuple[Any]): The positional arguments used to build the provided
+                DistributedDataset class
         Raises:
             Exception: When the dataset constructor raises an OSError
         Returns:
-            Optional[Union[DistributedDataset, Iterable]]: The DistributedDataset instantion, the Iterable instantiation, or None
+            Optional[Union[DistributedDataset, Iterable]]: The DistributedDataset instantion, the
+                Iterable instantiation, or None
         """
         if torch.distributed.is_initialized():
             rank = torch.distributed.get_rank()
@@ -485,10 +529,10 @@ class BlendedMegatronDatasetBuilder(object):
                     dataset = cls(*args)
                 except OSError as err:
                     log = (
-                        f"Failed to write dataset materials to the data cache directory. "
-                        + f"Please supply a directory to which you have write access via "
-                        + f"the path_to_cache attribute in BlendedMegatronDatasetConfig and "
-                        + f"retry. Refer to the preserved traceback above for more information."
+                        f"Failed to write dataset materials to the data cache directory. Please "
+                        f"supply a directory to which you have write access via the path_to_cache "
+                        f"attribute in BlendedMegatronDatasetConfig and retry. Refer to the "
+                        f"preserved traceback above for more information."
                     )
                     raise Exception(log) from err
@@ -505,23 +549,30 @@ class BlendedMegatronDatasetBuilder(object):
 def _get_size_per_split_per_dataset(
-    normalized_weights: List[float], target_size_per_split: List[int]
+    normalized_weights: List[float], target_size_per_split: List[int], margin: float = 0.0
 ) -> List[List[int]]:
     """Determine the contribution of the MegatronDataset splits to the BlendedDataset splits
     Args:
         normalized_weights (List[float]): e.g. [0.3, 0.7]
-        target_size_per_split (List[int]): The number of samples to target for each BlendedDataset split
+        target_size_per_split (List[int]): The number of samples to target for each BlendedDataset
+            split
+        margin (float): The relative quantity of extra samples to build per per split per dataset,
+            as a percentage
     Returns:
         List[List[int]]: The number of samples to request per MegatronDataset per split
     """
     assert numpy.isclose(sum(normalized_weights), 1.0)
-    # Use 0.5% target margin to ensure we satiate the request
+    # Use margin as buffer to ensure we satiate the request
     sizes_per_dataset = [
-        [int(math.ceil(target_size * weight * 1.005)) for target_size in target_size_per_split]
+        [
+            int(math.ceil(math.ceil(target_size * weight) * (1 + margin / 100)))
+            for target_size in target_size_per_split
+        ]
         for weight in normalized_weights
     ]

{megatron_core-0.10.0 → megatron_core-0.12.0}/megatron/core/datasets/blended_megatron_dataset_config.py RENAMED Viewed

@@ -34,12 +34,6 @@ class BlendedMegatronDatasetConfig:
        'blend'. Defauls to None.
     """
-    renormalize_blend_weights: bool = False
-    """Renormalize the blend weights to account for mid-level dataset oversampling done to ensure
-       fulfillmenet of the of the requested number of samples. Defaults to False for backward
-       comparability in the data sample order.
-    """
     split: Optional[str] = None
     """The split string, a comma separated weighting for the dataset splits when drawing samples
        from a single distribution. Not to be used with 'blend_per_split'.  Defaults to None.
@@ -67,7 +61,7 @@ class BlendedMegatronDatasetConfig:
     """
     tokenizer: Optional[MegatronTokenizer] = None
-    """The MegatronTokenizer instance or None. Required for datasets which do online tokenization."""
+    """The MegatronTokenizer instance. Required for datasets that do online tokenization."""
     def __post_init__(self) -> None:
         """Do asserts and set fields post init"""
@@ -149,7 +143,8 @@ def convert_split_vector_to_split_matrix(
     Args:
         vector_a (List[float]): The primary split vector
-        vector_b (Optional[List[float]]): An optional secondary split vector which constrains the primary split vector. Defaults to None.
+        vector_b (Optional[List[float]]): An optional secondary split vector which constrains the
+            primary split vector. Defaults to None.
     Returns:
         List[Tuple[float, float]]: The split matrix consisting of book-ends of each split in order

megatron-core 0.10.0__tar.gz → 0.12.0__tar.gz

Potentially problematic release.

megatron-core 0.10.0tar.gz → 0.12.0tar.gz