megatron-core 0.14.0rc5__tar.gz → 0.14.0rc7__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of megatron-core might be problematic. Click here for more details.

Files changed (329) hide show
  1. megatron_core-0.14.0rc7/PKG-INFO +536 -0
  2. megatron_core-0.14.0rc7/README.md +469 -0
  3. megatron_core-0.14.0rc7/megatron/core/README.md +51 -0
  4. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/__init__.py +6 -0
  5. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/blended_megatron_dataset_builder.py +17 -3
  6. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/blended_megatron_dataset_config.py +6 -0
  7. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/gpt_dataset.py +0 -4
  8. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/mapping.py +0 -6
  9. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/common.py +6 -6
  10. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/distributed/__init__.py +1 -0
  11. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/distributed/distributed_data_parallel.py +16 -6
  12. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/distributed/distributed_data_parallel_config.py +20 -6
  13. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/distributed/finalize_model_grads.py +209 -96
  14. megatron_core-0.14.0rc7/megatron/core/distributed/fsdp/__init__.py +3 -0
  15. megatron_core-0.14.0rc7/megatron/core/distributed/fsdp/mcore_fsdp_adapter.py +317 -0
  16. megatron_core-0.14.0rc7/megatron/core/distributed/fsdp/src/__init__.py +13 -0
  17. megatron_core-0.14.0rc7/megatron/core/distributed/fsdp/src/megatron_fsdp/__init__.py +22 -0
  18. megatron_core-0.14.0rc7/megatron/core/distributed/fsdp/src/megatron_fsdp/distributed_data_parallel_config.py +141 -0
  19. megatron_core-0.14.0rc7/megatron/core/distributed/fsdp/src/megatron_fsdp/fully_shard.py +387 -0
  20. megatron_core-0.14.0rc7/megatron/core/distributed/fsdp/src/megatron_fsdp/megatron_fsdp.py +1107 -0
  21. {megatron_core-0.14.0rc5/megatron/core/distributed/custom_fsdp → megatron_core-0.14.0rc7/megatron/core/distributed/fsdp/src/megatron_fsdp}/param_and_grad_buffer.py +1658 -522
  22. megatron_core-0.14.0rc7/megatron/core/distributed/fsdp/src/megatron_fsdp/uneven_dtensor.py +458 -0
  23. megatron_core-0.14.0rc7/megatron/core/distributed/fsdp/src/megatron_fsdp/utils.py +908 -0
  24. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/distributed/param_and_grad_buffer.py +22 -7
  25. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/distributed/torch_fully_sharded_data_parallel.py +8 -0
  26. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/extensions/transformer_engine.py +233 -18
  27. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fp8_utils.py +62 -48
  28. megatron_core-0.14.0rc7/megatron/core/full_cuda_graph.py +195 -0
  29. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/contexts/dynamic_context.py +127 -49
  30. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/engines/dynamic_engine.py +6 -3
  31. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/model_inference_wrappers/abstract_model_inference_wrapper.py +9 -0
  32. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/text_generation_controllers/text_generation_controller.py +37 -20
  33. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/common/language_module/language_module.py +19 -2
  34. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/gpt/gpt_model.py +24 -0
  35. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/huggingface/clip_model.py +1 -1
  36. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/huggingface/qwen_model.py +1 -1
  37. megatron_core-0.14.0rc7/megatron/core/nccl_allocator.py +249 -0
  38. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/optimizer/__init__.py +3 -22
  39. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/optimizer/clip_grads.py +15 -0
  40. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/optimizer/distrib_optimizer.py +155 -129
  41. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/optimizer/optimizer.py +10 -5
  42. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/optimizer/optimizer_config.py +24 -0
  43. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/package_info.py +1 -1
  44. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/parallel_state.py +57 -4
  45. megatron_core-0.14.0rc7/megatron/core/pipeline_parallel/p2p_communication.py +645 -0
  46. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/pipeline_parallel/schedules.py +379 -243
  47. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/pipeline_parallel/utils.py +12 -2
  48. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/process_groups_config.py +55 -0
  49. megatron_core-0.14.0rc7/megatron/core/safe_globals.py +33 -0
  50. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/tensor_parallel/layers.py +8 -8
  51. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/attention.py +65 -15
  52. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/cuda_graphs.py +328 -13
  53. megatron_core-0.14.0rc7/megatron/core/transformer/fsdp_dtensor_checkpoint.py +195 -0
  54. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/moe/experts.py +1 -25
  55. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/moe/moe_utils.py +183 -135
  56. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/moe/router.py +148 -138
  57. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/moe/token_dispatcher.py +5 -1
  58. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/multi_latent_attention.py +239 -14
  59. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/multi_token_prediction.py +258 -61
  60. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/transformer_config.py +52 -11
  61. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/transformer_layer.py +20 -6
  62. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/utils.py +0 -3
  63. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/utils.py +11 -54
  64. megatron_core-0.14.0rc7/megatron_core.egg-info/PKG-INFO +536 -0
  65. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron_core.egg-info/SOURCES.txt +14 -3
  66. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron_core.egg-info/requires.txt +1 -1
  67. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/pyproject.toml +2 -2
  68. megatron_core-0.14.0rc5/PKG-INFO +0 -836
  69. megatron_core-0.14.0rc5/README.md +0 -769
  70. megatron_core-0.14.0rc5/megatron/core/README.md +0 -14
  71. megatron_core-0.14.0rc5/megatron/core/distributed/custom_fsdp/__init__.py +0 -3
  72. megatron_core-0.14.0rc5/megatron/core/distributed/custom_fsdp/fully_sharded_data_parallel.py +0 -835
  73. megatron_core-0.14.0rc5/megatron/core/pipeline_parallel/p2p_communication.py +0 -628
  74. megatron_core-0.14.0rc5/megatron_core.egg-info/PKG-INFO +0 -836
  75. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/LICENSE +0 -0
  76. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/MANIFEST.in +0 -0
  77. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/activations.py +0 -0
  78. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/config.py +0 -0
  79. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/config_logger.py +0 -0
  80. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/__init__.py +0 -0
  81. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/bert_dataset.py +0 -0
  82. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/blended_dataset.py +0 -0
  83. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/helpers.cpp +0 -0
  84. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/helpers.py +0 -0
  85. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/indexed_dataset.py +0 -0
  86. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/masked_dataset.py +0 -0
  87. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/megatron_dataset.py +0 -0
  88. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/megatron_tokenizer.py +0 -0
  89. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/multimodal_dataset.py +0 -0
  90. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/object_storage_utils.py +0 -0
  91. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/__init__.py +0 -0
  92. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/config/__init__.py +0 -0
  93. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/config/bert_embedders.py +0 -0
  94. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/config/config.py +0 -0
  95. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/config/gpt_chunk_datasets.py +0 -0
  96. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/config/tokenizers.py +0 -0
  97. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/db/__init__.py +0 -0
  98. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/db/build.py +0 -0
  99. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/db/dataset.py +0 -0
  100. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/db/utils.py +0 -0
  101. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/external_libs.py +0 -0
  102. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/index/__init__.py +0 -0
  103. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/index/build.py +0 -0
  104. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/index/factory.py +0 -0
  105. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/index/index.py +0 -0
  106. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/index/indexes/__init__.py +0 -0
  107. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/index/indexes/faiss_base.py +0 -0
  108. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/index/indexes/faiss_par_add.py +0 -0
  109. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/index/utils.py +0 -0
  110. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/index/validate.py +0 -0
  111. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/query/__init__.py +0 -0
  112. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/query/gpt_chunk_dataset.py +0 -0
  113. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/query/multi_split_gpt_dataset.py +0 -0
  114. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/query/query.py +0 -0
  115. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/query/retro_dataset.py +0 -0
  116. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/query/utils.py +0 -0
  117. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/retro/utils.py +0 -0
  118. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/t5_dataset.py +0 -0
  119. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/utils.py +0 -0
  120. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/utils_object_storage.py +0 -0
  121. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/datasets/utils_s3.py +0 -0
  122. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/__init__.py +0 -0
  123. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/core.py +0 -0
  124. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/dict_utils.py +0 -0
  125. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/exchange_utils.py +0 -0
  126. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/optimizer.py +0 -0
  127. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/serialization.py +0 -0
  128. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/state_dict_utils.py +0 -0
  129. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/__init__.py +0 -0
  130. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/async_utils.py +0 -0
  131. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/base.py +0 -0
  132. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/cached_metadata_filesystem_reader.py +0 -0
  133. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/filesystem_async.py +0 -0
  134. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/fully_parallel.py +0 -0
  135. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/resharding.py +0 -0
  136. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/state_dict_saver.py +0 -0
  137. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/tensorstore.py +0 -0
  138. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/torch.py +0 -0
  139. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/two_stage.py +0 -0
  140. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/strategies/zarr.py +0 -0
  141. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/tensor_aware_state_dict.py +0 -0
  142. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/utils.py +0 -0
  143. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/dist_checkpointing/validation.py +0 -0
  144. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/distributed/data_parallel_base.py +0 -0
  145. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/distributed/torch_fully_sharded_data_parallel_config.py +0 -0
  146. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/energy_monitor.py +0 -0
  147. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/enums.py +0 -0
  148. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/__init__.py +0 -0
  149. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/data_type.py +0 -0
  150. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/export_config.py +0 -0
  151. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/model_type.py +0 -0
  152. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/__init__.py +0 -0
  153. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/engine_builder/__init__.py +0 -0
  154. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/engine_builder/trtllm_engine_builder.py +0 -0
  155. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/model_to_trllm_mapping/__init__.py +0 -0
  156. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/model_to_trllm_mapping/default_conversion_dict.py +0 -0
  157. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/trt_model_config.py +0 -0
  158. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/trt_model_type.py +0 -0
  159. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/trtllm_helper.py +0 -0
  160. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/trtllm_layers.py +0 -0
  161. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/trtllm_weights_converter/__init__.py +0 -0
  162. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/trtllm_weights_converter/distributed_trtllm_model_weights_converter.py +0 -0
  163. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/trtllm_weights_converter/single_device_trtllm_model_weights_converter.py +0 -0
  164. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/export/trtllm/trtllm_weights_converter/utils.py +0 -0
  165. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/extensions/__init__.py +0 -0
  166. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/extensions/kitchen.py +0 -0
  167. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/extensions/transformer_engine_spec_provider.py +0 -0
  168. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fusions/__init__.py +0 -0
  169. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fusions/fused_bias_dropout.py +0 -0
  170. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fusions/fused_bias_geglu.py +0 -0
  171. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fusions/fused_bias_gelu.py +0 -0
  172. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fusions/fused_bias_swiglu.py +0 -0
  173. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fusions/fused_cross_entropy.py +0 -0
  174. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fusions/fused_indices_converter.py +0 -0
  175. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fusions/fused_layer_norm.py +0 -0
  176. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fusions/fused_mla_yarn_rope_apply.py +0 -0
  177. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fusions/fused_pad_routing_map.py +0 -0
  178. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fusions/fused_softmax.py +0 -0
  179. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/fusions/fused_weighted_squared_relu.py +0 -0
  180. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/hyper_comm_grid.py +0 -0
  181. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/__init__.py +0 -0
  182. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/async_stream.py +0 -0
  183. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/common_inference_params.py +0 -0
  184. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/communication_utils.py +0 -0
  185. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/contexts/__init__.py +0 -0
  186. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/contexts/base_context.py +0 -0
  187. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/contexts/dynamic_chunk_allocator.py +0 -0
  188. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/contexts/static_context.py +0 -0
  189. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/engines/__init__.py +0 -0
  190. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/engines/abstract_engine.py +0 -0
  191. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/engines/mcore_engine.py +0 -0
  192. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/engines/static_engine.py +0 -0
  193. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/inference_request.py +0 -0
  194. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/model_inference_wrappers/__init__.py +0 -0
  195. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/model_inference_wrappers/gpt/__init__.py +0 -0
  196. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/model_inference_wrappers/gpt/gpt_inference_wrapper.py +0 -0
  197. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/model_inference_wrappers/inference_wrapper_config.py +0 -0
  198. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/model_inference_wrappers/multimodal/vlm_inference_wrapper.py +0 -0
  199. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/model_inference_wrappers/t5/__init__.py +0 -0
  200. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/model_inference_wrappers/t5/t5_inference_wrapper.py +0 -0
  201. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/sampling_params.py +0 -0
  202. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/scheduler.py +0 -0
  203. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/text_generation_controllers/__init__.py +0 -0
  204. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/text_generation_controllers/encoder_decoder_text_generation_controller.py +0 -0
  205. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/text_generation_controllers/simple_text_generation_controller.py +0 -0
  206. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/text_generation_controllers/vlm_text_generation_controller.py +0 -0
  207. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference/utils.py +0 -0
  208. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/inference_params.py +0 -0
  209. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/jit.py +0 -0
  210. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/model_parallel_config.py +0 -0
  211. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/T5/__init__.py +0 -0
  212. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/T5/t5_model.py +0 -0
  213. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/T5/t5_spec.py +0 -0
  214. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/__init__.py +0 -0
  215. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/backends.py +0 -0
  216. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/bert/__init__.py +0 -0
  217. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/bert/bert_layer_specs.py +0 -0
  218. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/bert/bert_lm_head.py +0 -0
  219. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/bert/bert_model.py +0 -0
  220. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/bert/pooler.py +0 -0
  221. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/common/__init__.py +0 -0
  222. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/common/embeddings/__init__.py +0 -0
  223. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/common/embeddings/language_model_embedding.py +0 -0
  224. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/common/embeddings/relative_pos_embedding.py +0 -0
  225. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/common/embeddings/rope_utils.py +0 -0
  226. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/common/embeddings/rotary_pos_embedding.py +0 -0
  227. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/common/embeddings/yarn_rotary_pos_embedding.py +0 -0
  228. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/common/language_module/__init__.py +0 -0
  229. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/common/model_chunk_schedule_plan.py +0 -0
  230. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/common/vision_module/__init__.py +0 -0
  231. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/common/vision_module/vision_module.py +0 -0
  232. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/gpt/__init__.py +0 -0
  233. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/gpt/fine_grained_callables.py +0 -0
  234. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/gpt/gpt_layer_specs.py +0 -0
  235. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/gpt/heterogeneous/heterogeneous_layer_specs.py +0 -0
  236. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/gpt/moe_module_specs.py +0 -0
  237. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/huggingface/__init__.py +0 -0
  238. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/huggingface/module.py +0 -0
  239. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/mamba/__init__.py +0 -0
  240. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/mamba/mamba_layer_specs.py +0 -0
  241. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/mamba/mamba_model.py +0 -0
  242. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/mimo/__init__.py +0 -0
  243. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/mimo/config/__init__.py +0 -0
  244. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/mimo/config/base_configs.py +0 -0
  245. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/mimo/model/__init__.py +0 -0
  246. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/mimo/model/base.py +0 -0
  247. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/mimo/submodules/audio.py +0 -0
  248. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/mimo/submodules/base.py +0 -0
  249. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/mimo/submodules/vision.py +0 -0
  250. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/multimodal/__init__.py +0 -0
  251. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/multimodal/context_parallel.py +0 -0
  252. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/multimodal/llava_model.py +0 -0
  253. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/multimodal/llava_spec.py +0 -0
  254. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/retro/__init__.py +0 -0
  255. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/retro/base_attention.py +0 -0
  256. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/retro/config.py +0 -0
  257. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/retro/decoder_attention.py +0 -0
  258. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/retro/decoder_spec.py +0 -0
  259. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/retro/encoder_attention.py +0 -0
  260. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/retro/encoder_spec.py +0 -0
  261. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/retro/model.py +0 -0
  262. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/retro/utils.py +0 -0
  263. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/vision/__init__.py +0 -0
  264. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/vision/clip_vit_model.py +0 -0
  265. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/vision/multimodal_projector.py +0 -0
  266. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/vision/radio.py +0 -0
  267. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/models/vision/vit_layer_specs.py +0 -0
  268. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/msc_utils.py +0 -0
  269. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/num_microbatches_calculator.py +0 -0
  270. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/optimizer/cpu_offloading/__init__.py +0 -0
  271. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py +0 -0
  272. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/optimizer/grad_scaler.py +0 -0
  273. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/optimizer_param_scheduler.py +0 -0
  274. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/packed_seq_params.py +0 -0
  275. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/pipeline_parallel/__init__.py +0 -0
  276. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/pipeline_parallel/combined_1f1b.py +0 -0
  277. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/post_training/__init__.py +0 -0
  278. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/post_training/modelopt/__init__.py +0 -0
  279. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/post_training/modelopt/gpt/__init__.py +0 -0
  280. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/post_training/modelopt/gpt/model_specs.py +0 -0
  281. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/post_training/modelopt/gpt/state_dict_hooks.py +0 -0
  282. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/post_training/modelopt/layers.py +0 -0
  283. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/post_training/modelopt/mamba/__init__.py +0 -0
  284. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/post_training/modelopt/mamba/model_specs.py +0 -0
  285. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/quantization/__init__.py +0 -0
  286. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/quantization/quant_config.py +0 -0
  287. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/quantization/utils.py +0 -0
  288. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/requirements.txt +0 -0
  289. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/rerun_state_machine.py +0 -0
  290. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/ssm/__init__.py +0 -0
  291. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/ssm/mamba_block.py +0 -0
  292. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/ssm/mamba_context_parallel.py +0 -0
  293. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/ssm/mamba_hybrid_layer_allocation.py +0 -0
  294. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/ssm/mamba_layer.py +0 -0
  295. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/ssm/mamba_mixer.py +0 -0
  296. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/ssm/mlp_layer.py +0 -0
  297. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/ssm/triton_cache_manager.py +0 -0
  298. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/tensor_parallel/__init__.py +0 -0
  299. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/tensor_parallel/cross_entropy.py +0 -0
  300. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/tensor_parallel/data.py +0 -0
  301. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/tensor_parallel/mappings.py +0 -0
  302. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/tensor_parallel/random.py +0 -0
  303. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/tensor_parallel/utils.py +0 -0
  304. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/timers.py +0 -0
  305. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/__init__.py +0 -0
  306. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/custom_layers/__init__.py +0 -0
  307. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/custom_layers/transformer_engine.py +0 -0
  308. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/dot_product_attention.py +0 -0
  309. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/enums.py +0 -0
  310. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/heterogeneous/heterogeneous_config.py +0 -0
  311. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/heterogeneous/linear_replacements.py +0 -0
  312. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/identity_op.py +0 -0
  313. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/mlp.py +0 -0
  314. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/module.py +0 -0
  315. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/moe/__init__.py +0 -0
  316. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/moe/fused_a2a.py +0 -0
  317. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/moe/grouped_gemm_util.py +0 -0
  318. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/moe/moe_layer.py +0 -0
  319. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/moe/shared_experts.py +0 -0
  320. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/moe/upcycling_utils.py +0 -0
  321. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/pipeline_parallel_layer_layout.py +0 -0
  322. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/spec_utils.py +0 -0
  323. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/torch_layer_norm.py +0 -0
  324. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/torch_norm.py +0 -0
  325. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron/core/transformer/transformer_block.py +0 -0
  326. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron_core.egg-info/dependency_links.txt +0 -0
  327. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/megatron_core.egg-info/top_level.txt +0 -0
  328. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/setup.cfg +0 -0
  329. {megatron_core-0.14.0rc5 → megatron_core-0.14.0rc7}/setup.py +0 -0
@@ -0,0 +1,536 @@
1
+ Metadata-Version: 2.4
2
+ Name: megatron-core
3
+ Version: 0.14.0rc7
4
+ Summary: Megatron Core - a library for efficient and scalable training of transformer based models
5
+ Author-email: NVIDIA <nemo-toolkit@nvidia.com>
6
+ Maintainer-email: NVIDIA <nemo-toolkit@nvidia.com>
7
+ License: Apache 2.0
8
+ Project-URL: Download, https://github.com/NVIDIA/Megatron-LM/releases
9
+ Project-URL: Homepage, https://github.com/NVIDIA/Megatron-LM/megatron/core
10
+ Keywords: NLP,NLU,deep,gpu,language,learning,learning,machine,nvidia,pytorch,torch,transformer
11
+ Classifier: Development Status :: 5 - Production/Stable
12
+ Classifier: Environment :: Console
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Information Technology
15
+ Classifier: Intended Audience :: Science/Research
16
+ Classifier: License :: OSI Approved :: BSD License
17
+ Classifier: Natural Language :: English
18
+ Classifier: Operating System :: OS Independent
19
+ Classifier: Programming Language :: Python :: 3
20
+ Classifier: Programming Language :: Python :: 3.8
21
+ Classifier: Programming Language :: Python :: 3.9
22
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
23
+ Classifier: Topic :: Scientific/Engineering :: Image Recognition
24
+ Classifier: Topic :: Scientific/Engineering :: Mathematics
25
+ Classifier: Topic :: Scientific/Engineering
26
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
27
+ Classifier: Topic :: Software Development :: Libraries
28
+ Classifier: Topic :: Utilities
29
+ Requires-Python: >=3.10
30
+ Description-Content-Type: text/markdown
31
+ License-File: LICENSE
32
+ Requires-Dist: torch
33
+ Requires-Dist: numpy<2.0.0
34
+ Requires-Dist: packaging
35
+ Provides-Extra: mlm
36
+ Requires-Dist: flask-restful; extra == "mlm"
37
+ Requires-Dist: sentencepiece; extra == "mlm"
38
+ Requires-Dist: tiktoken; extra == "mlm"
39
+ Requires-Dist: wandb; extra == "mlm"
40
+ Provides-Extra: dev
41
+ Requires-Dist: tqdm; extra == "dev"
42
+ Requires-Dist: einops~=0.8; extra == "dev"
43
+ Requires-Dist: tensorstore!=0.1.46,!=0.1.72,~=0.1; extra == "dev"
44
+ Requires-Dist: nvtx~=0.2; extra == "dev"
45
+ Requires-Dist: transformers~=4.53; extra == "dev"
46
+ Requires-Dist: multi-storage-client~=0.20; extra == "dev"
47
+ Requires-Dist: opentelemetry-api~=1.33.1; extra == "dev"
48
+ Requires-Dist: setuptools<80.0.0; extra == "dev"
49
+ Requires-Dist: mamba-ssm~=2.2; extra == "dev"
50
+ Requires-Dist: causal-conv1d~=1.5; extra == "dev"
51
+ Requires-Dist: nv-grouped-gemm~=1.1; extra == "dev"
52
+ Requires-Dist: transformer-engine[pytorch]<2.7.0,>=2.6.0a0; extra == "dev"
53
+ Requires-Dist: nvidia-resiliency-ext<0.5.0,>=0.4.0a0; extra == "dev"
54
+ Requires-Dist: nvidia-modelopt[torch]<0.34.0,>=0.33.0a0; sys_platform != "darwin" and extra == "dev"
55
+ Requires-Dist: megatron-energon[av_decode]~=6.0; extra == "dev"
56
+ Requires-Dist: flashinfer-python; extra == "dev"
57
+ Requires-Dist: onnxscript; extra == "dev"
58
+ Provides-Extra: lts
59
+ Requires-Dist: tqdm; extra == "lts"
60
+ Requires-Dist: einops; extra == "lts"
61
+ Requires-Dist: tensorstore!=0.1.46,!=0.1.72; extra == "lts"
62
+ Requires-Dist: nvtx; extra == "lts"
63
+ Requires-Dist: transformers; extra == "lts"
64
+ Requires-Dist: zarr; extra == "lts"
65
+ Requires-Dist: setuptools<80.0.0; extra == "lts"
66
+ Dynamic: license-file
67
+
68
+ <div align="center">
69
+
70
+ Megatron-LM & Megatron Core
71
+ ===========================
72
+ <h4>GPU-optimized library for training transformer models at scale</h4>
73
+
74
+ [![Documentation](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://docs.nvidia.com/Megatron-Core/developer-guide/latest/index.html)
75
+ [![version](https://img.shields.io/badge/release-0.12.0-green)](./CHANGELOG.md)
76
+ [![license](https://img.shields.io/badge/license-Apache-blue)](./LICENSE)
77
+
78
+ <div align="left">
79
+
80
+ ## ⚡ Quick Start
81
+
82
+ ```bash
83
+ # 1. Install Megatron Core with required dependencies
84
+ pip install megatron-core
85
+ pip install --no-build-isolation transformer-engine[pytorch]
86
+
87
+ # 2. Clone repository for examples
88
+ git clone https://github.com/NVIDIA/Megatron-LM.git
89
+ cd Megatron-LM
90
+ ```
91
+
92
+ **→ [Complete Installation Guide](#installation)** - Docker, pip variants (dev,lts,etc.), source installation, and system requirements
93
+
94
+ # Latest News
95
+
96
+ - 📣 NEW! **[DeepSeek & MoE Training with FP8](https://github.com/yanring/Megatron-MoE-ModelZoo)** examples are now available, including optimized configurations for `DeepSeek-V3`, `Qwen2` and `Mixtral` models with FP8 precision support.
97
+ - **[2025/05]** Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training ([blog](https://developer.nvidia.com/blog/turbocharge-llm-training-across-long-haul-data-center-networks-with-nvidia-nemo-framework/)).
98
+
99
+ <details>
100
+ <summary>Previous News</summary>
101
+
102
+ - **[2024/07]** Megatron Core v0.7 improves scalability and training resiliency and adds support for multimodal training ([blog](https://developer.nvidia.com/blog/train-generative-ai-models-more-efficiently-with-new-nvidia-Megatron-Core-functionalities/)).
103
+ - **[2024/06]** Megatron Core added supports for Mamba-based models. Check out our paper [An Empirical Study of Mamba-based Language Models](https://arxiv.org/pdf/2406.07887) and [code example](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba).
104
+ - **[2024/01 Announcement]** NVIDIA has released the core capabilities in **Megatron-LM** into [**Megatron Core**](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) in this repository. Megatron Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the [Megatron Core intro](#Megatron Core) for more details.
105
+
106
+ </details>
107
+
108
+ <details>
109
+ <summary>Table of Contents</summary>
110
+
111
+ **Getting Started**
112
+ - [Quick Start](#-quick-start)
113
+ - [Latest News](#latest-news)
114
+ - [Megatron Overview](#megatron-overview)
115
+ - [Project Structure](#project-structure)
116
+ - [Megatron-LM: Reference Implementation](#megatron-lm-reference-implementation)
117
+ - [Megatron Core: Production Library](#megatron-core-production-library)
118
+ - [Installation](#installation)
119
+ - [Docker (Recommended)](#-docker-recommended)
120
+ - [Pip Installation](#-pip-installation)
121
+ - [Source Installation](#-source-installation)
122
+ - [System Requirements](#system-requirements)
123
+
124
+ **Core Features**
125
+ - [Performance Benchmarking](#performance-benchmarking)
126
+ - [Weak Scaling Results](#weak-scaling-results)
127
+ - [Strong Scaling Results](#strong-scaling-results)
128
+ - [Ecosystem Libraries](#ecosystem-libraries)
129
+
130
+ **Training**
131
+ - [Training](#training)
132
+ - [Getting Started](#getting-started)
133
+ - [Data Preparation](#data-preparation)
134
+ - [Parallelism Strategies](#parallelism-strategies)
135
+ - [Data Parallelism (DP)](#data-parallelism-dp)
136
+ - [Tensor Parallelism (TP)](#tensor-parallelism-tp)
137
+ - [Pipeline Parallelism (PP)](#pipeline-parallelism-pp)
138
+ - [Context Parallelism (CP)](#context-parallelism-cp)
139
+ - [Expert Parallelism (EP)](#expert-parallelism-ep)
140
+ - [Parallelism Selection Guide](#parallelism-selection-guide)
141
+ - [Performance Optimizations](#performance-optimizations)
142
+
143
+ **Resources**
144
+ - [Examples](./examples/) - Training scripts and tutorials
145
+ - [Documentation](https://docs.nvidia.com/Megatron-Core/) - Official docs
146
+ - [Community & Support](#-community--support) - Get help and contribute
147
+ - [Getting Help](#getting-help)
148
+ - [Contributing](#contributing)
149
+ - [Citation](#citation)
150
+
151
+ </details>
152
+
153
+ # Megatron Overview
154
+
155
+ ## Project Structure
156
+ ```
157
+ Megatron-LM/
158
+ ├── megatron/
159
+ │ ├── core/ # Megatron Core (kernels, parallelism, building blocks)
160
+ │ │ ├── models/ # Transformer models
161
+ │ │ ├── transformer/ # Transformer building blocks
162
+ │ │ ├── tensor_parallel/ # Tensor parallelism
163
+ │ │ ├── pipeline_parallel/ # Pipeline parallelism
164
+ │ │ ├── distributed/ # Distributed training (FSDP, DDP)
165
+ │ │ ├── optimizer/ # Optimizers
166
+ │ │ ├── datasets/ # Dataset loaders
167
+ │ │ ├── inference/ # Inference engines
168
+ │ │ └── export/ # Model export (e.g. TensorRT-LLM)
169
+ │ ├── training/ # Training scripts
170
+ │ ├── inference/ # Inference server
171
+ │ ├── legacy/ # Legacy components
172
+ │ └── post_training/ # Post-training (RLHF, etc.)
173
+ ├── examples/ # Ready-to-use training examples
174
+ ├── tools/ # Utility tools
175
+ ├── tests/ # Comprehensive test suite
176
+ └── docs/ # Documentation
177
+ ```
178
+
179
+ ### Megatron-LM: Reference Implementation
180
+ **Reference implementation** that includes Megatron Core plus everything needed to train models.
181
+
182
+ **Best for:**
183
+ - **Training state-of-the-art foundation models** at scale with cutting-edge performance on latest NVIDIA hardware
184
+ - **Research teams** exploring new architectures and training techniques
185
+ - **Learning distributed training** concepts and best practices
186
+ - **Quick experimentation** with proven model configurations
187
+
188
+ **What you get:**
189
+ - Pre-configured training scripts for GPT, LLama, DeepSeek, Qwen, and more.
190
+ - End-to-end examples from data prep to evaluation
191
+ - Research-focused tools and utilities
192
+
193
+ ### Megatron Core: Composable Library
194
+ **Composable library** with GPU-optimized building blocks for custom training frameworks.
195
+
196
+ **Best for:**
197
+ - **Framework developers** building on top of modular and optimized components
198
+ - **Research teams** needing custom training loops, optimizers, or data pipelines
199
+ - **ML engineers** requiring fault-tolerant training pipelines
200
+
201
+ **What you get:**
202
+ - Composable transformer building blocks (attention, MLP, etc.)
203
+ - Advanced parallelism strategies (TP, PP, DP, EP, CP)
204
+ - Pipeline schedules and distributed optimizers
205
+ - Mixed precision support (FP16, BF16, FP8)
206
+ - GPU-optimized kernels and memory management
207
+ - High-performance dataloaders and dataset utilities
208
+ - Model architectures (LLaMA, Qwen, GPT, Mixtral, Mamba, etc.)
209
+
210
+ ## Ecosystem Libraries
211
+
212
+ **Libraries used by Megatron Core:**
213
+
214
+ - **[Megatron Energon](https://github.com/NVIDIA/Megatron-Energon)** 📣 **NEW!** - Multi-modal data loader (text, images, video, audio) with distributed loading and dataset blending
215
+ - **[Transformer Engine](https://github.com/NVIDIA/TransformerEngine)** - Optimized kernels and FP8 mixed precision support
216
+ - **[Resiliency Extension (NVRx)](https://github.com/NVIDIA/nvidia-resiliency-ext)** - Fault tolerant training with failure detection and recovery
217
+
218
+ **Libraries using Megatron Core:**
219
+
220
+ - **[NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)** - Enterprise framework with cloud-native support and end-to-end examples
221
+ - **[TensorRT Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)** - Model optimization toolkit for quantization, pruning, and distillation
222
+
223
+ **Compatible with:** [HuggingFace Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)
224
+
225
+ # Installation
226
+
227
+ ## 🐳 Docker (Recommended)
228
+
229
+ We strongly recommend using the previous releases of [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) rather than the latest one for optimal compatibility with Megatron Core release and testing. Our releases are always based on the previous month's NGC container, so this ensures compatibility and stability.
230
+
231
+ This container comes with all dependencies pre-installed with compatible versions and optimized configurations for NVIDIA GPUs:
232
+
233
+ - PyTorch (latest stable version)
234
+ - CUDA, cuDNN, NCCL (latest stable versions)
235
+ - Support for FP8 on NVIDIA Hopper, Ada, and Blackwell GPUs
236
+ - For best performance, use NVIDIA Turing GPU architecture generations and later
237
+
238
+ ```bash
239
+ # Run container with mounted directories
240
+ docker run --runtime --nvidia --gpus all -it --rm \
241
+ -v /path/to/megatron:/workspace/megatron \
242
+ -v /path/to/dataset:/workspace/dataset \
243
+ -v /path/to/checkpoints:/workspace/checkpoints \
244
+ nvcr.io/nvidia/pytorch:25.04-py3
245
+ ```
246
+
247
+ ## Pip Installation
248
+
249
+ Megatron Core offers support for two NGC PyTorch containers:
250
+
251
+ - `dev`: Moving head that supports the most recent upstream dependencies
252
+ - `lts`: Long-term support of NGC PyTorch 24.01
253
+
254
+ Both containers can be combined with `mlm` which adds package dependencies for Megatron-LM on top of Megatron Core.
255
+
256
+ ```bash
257
+ # Install the latest release with minimal dependencies (no Transformer Engine)
258
+ pip install megatron-core[dev]
259
+ ```
260
+
261
+ ```bash
262
+ # Install packages for LTS support NGC PyTorch 24.01
263
+ pip install megatron-core[lts]
264
+ ```
265
+
266
+ For a version of Megatron Core with only torch, run:
267
+
268
+ ```bash
269
+ pip install megatron-core
270
+ ```
271
+
272
+ For dependencies required by Megatron-LM, please run:
273
+
274
+ ```bash
275
+ pip install megatron-core[mlm]
276
+ ```
277
+
278
+ ## Source Installation
279
+
280
+ For development or latest features:
281
+
282
+ For Hybrid models, Megatron Core requires [mamba](https://github.com/state-spaces/mamba). If the pre-built wheel in PyPI does not fit your environment, you can fall back to an install script Megatron Core uses in its CI system. For this, please install `uv` first:
283
+
284
+ ```bash
285
+ export UV_VERSION=0.7.2
286
+ export PATH="$HOME/.local/bin:$PATH"
287
+ curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | sh
288
+ export UV_PROJECT_ENVIRONMENT=./venv
289
+ export PATH="$UV_PROJECT_ENVIRONMENT/bin:$PATH"
290
+ export UV_LINK_MODE=copy
291
+ ```
292
+
293
+ Run the following command to build upstream dependencies from source:
294
+
295
+ ```bash
296
+ # Clone and install
297
+ git clone https://github.com/NVIDIA/Megatron-LM.git
298
+ cd Megatron-LM
299
+
300
+ # Optional: checkout specific release
301
+ git checkout core_r0.13.0
302
+
303
+ bash docker/common/install.sh --environment {dev,lts}
304
+ ```
305
+
306
+ ## System Requirements
307
+
308
+ ### Hardware Requirements
309
+ - **FP8 Support**: NVIDIA Hopper, Ada, Blackwell GPUs
310
+ - **Recommended**: NVIDIA Turing architecture or later
311
+
312
+ ### Software Requirements
313
+ - **CUDA/cuDNN/NCCL**: Latest stable versions
314
+ - **PyTorch**: Latest stable version
315
+ - **Transformer Engine**: Latest stable version
316
+ - **Python**: 3.12 recommended
317
+
318
+ # Performance Benchmarking
319
+
320
+ For our latest performance benchmarking results, please refer to [NVIDIA NeMo Framework Performance Summary](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance_summary.html).
321
+
322
+ Our codebase efficiently trains models from 2B to 462B parameters across thousands of GPUs, achieving up to **47% Model FLOP Utilization (MFU)** on H100 clusters.
323
+
324
+ ![Model table](images/model_table.png)
325
+
326
+ **Benchmark Configuration:**
327
+ - **Vocabulary size**: 131,072 tokens
328
+ - **Sequence length**: 4096 tokens
329
+ - **Model scaling**: Varied hidden size, attention heads, and layers to achieve target parameter counts
330
+ - **Communication optimizations**: Fine-grained overlapping with DP (`--overlap-grad-reduce`, `--overlap-param-gather`), TP (`--tp-comm-overlap`), and PP (enabled by default)
331
+
332
+ **Key Results:**
333
+ - **6144 H100 GPUs**: Successfully benchmarked 462B parameter model training
334
+ - **Superlinear scaling**: MFU increases from 41% to 47-48% with model size
335
+ - **End-to-end measurement**: Throughputs include all operations (data loading, optimizer steps, communication, logging)
336
+ - **Production ready**: Full training pipeline with checkpointing and fault tolerance
337
+ - *Note: Performance results measured without training to convergence*
338
+
339
+ ## Weak Scaling Results
340
+ Our weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.
341
+
342
+ ![Weak scaling](images/weak_scaling.png)
343
+
344
+ ## Strong Scaling Results
345
+ We also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.
346
+
347
+ ![Strong scaling](images/strong_scaling.png)
348
+
349
+ # Training
350
+
351
+ ## Getting Started
352
+
353
+ ### Simple Training Example
354
+ ```bash
355
+ # Distributed training example (2 GPUs, mock data)
356
+ torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
357
+ ```
358
+
359
+ ### LLama-3 Training Example
360
+ ```bash
361
+ # 8 GPUs, FP8 precision, mock data
362
+ ./examples/llama/train_llama3_8b_fp8.sh
363
+ ```
364
+
365
+ ## Data Preparation
366
+
367
+ ### JSONL Data Format
368
+ ```json
369
+ {"text": "Your training text here..."}
370
+ {"text": "Another training sample..."}
371
+ ```
372
+
373
+ ### Basic Preprocessing
374
+ ```bash
375
+ python tools/preprocess_data.py \
376
+ --input data.jsonl \
377
+ --output-prefix processed_data \
378
+ --tokenizer-type HuggingFaceTokenizer \
379
+ --tokenizer-model /path/to/tokenizer.model \
380
+ --workers 8 \
381
+ --append-eod
382
+ ```
383
+
384
+ ### Key Arguments
385
+ - `--input`: Path to input JSON/JSONL file
386
+ - `--output-prefix`: Prefix for output binary files (.bin and .idx)
387
+ - `--tokenizer-type`: Tokenizer type (`HuggingFaceTokenizer`, `GPT2BPETokenizer`, etc.)
388
+ - `--tokenizer-model`: Path to tokenizer model file
389
+ - `--workers`: Number of parallel workers for processing
390
+ - `--append-eod`: Add end-of-document token
391
+
392
+ <!-- **→ [Complete Data Preparation Guide](./docs/data-preparation.md)** - Comprehensive guide covering advanced preprocessing, dataset collection, deduplication, and optimization strategies -->
393
+
394
+ # Parallelism Strategies
395
+
396
+ ## Data Parallelism (DP)
397
+
398
+ ### Standard Data Parallel
399
+ ```bash
400
+ # Standard DDP - replicate model on each GPU
401
+ torchrun --nproc_per_node=8 pretrain_gpt.py \
402
+ --data-parallel-sharding-strategy no_shard
403
+ ```
404
+
405
+ ### Fully Sharded Data Parallel (FSDP)
406
+ ```bash
407
+ # Megatron's optimized FSDP (~15% faster than PyTorch FSDP2)
408
+ --use-custom-fsdp
409
+
410
+ # PyTorch FSDP2
411
+ --use-torch-fsdp2
412
+
413
+ # Sharding strategies
414
+ --data-parallel-sharding-strategy optim # Shard optimizer states (ZeRO-1)
415
+ --data-parallel-sharding-strategy optim_grads # Shard gradients + optimizer (ZeRO-2)
416
+ --data-parallel-sharding-strategy optim_grads_params # Shard parameters + gradients + optimizer (ZeRO-3)
417
+ ```
418
+
419
+ ## Tensor Parallelism (TP)
420
+ Split individual model layers across GPUs:
421
+ ```bash
422
+ --tensor-model-parallel-size 4 # 4-way tensor parallelism
423
+ --sequence-parallel # Enable sequence parallelism (recommended with TP)
424
+ ```
425
+
426
+ ## Pipeline Parallelism (PP)
427
+ Split model depth across GPUs:
428
+ ```bash
429
+ --pipeline-model-parallel-size 8 # 8 pipeline stages
430
+ --virtual-pipeline-model-parallel-size 4 # Virtual pipeline for better load balancing
431
+ ```
432
+
433
+ ## Context Parallelism (CP)
434
+ Split long sequences across GPUs for handling long contexts:
435
+ ```bash
436
+ --context-parallel-size 2 # 2-way context parallelism
437
+ --cp-comm-type p2p # Communication: p2p, a2a, allgather, a2a+p2p
438
+ --hierarchical-context-parallel-sizes 2 4 # Hierarchical context parallelism
439
+ ```
440
+
441
+ ## Expert Parallelism (EP)
442
+ For Mixture of Experts (MoE) models:
443
+ ```bash
444
+ --expert-model-parallel-size 4 # 4-way expert parallelism
445
+ --num-experts 8 # 8 experts per MoE layer
446
+ --moe-grouped-gemm # Optimize expert computation
447
+ ```
448
+
449
+ ## Combining Parallelism Strategies
450
+
451
+ ### Parallelism Selection Guide
452
+
453
+ Based on [NVIDIA NeMo production configurations](https://github.com/NVIDIA/NeMo/tree/main/scripts/performance/recommended_model_configs):
454
+
455
+ | Model | Size | GPUs | TP | PP | CP | EP | Notes |
456
+ |-------|------|------|----|----|----|----|-------|
457
+ | **LLama-3** | 8B | 8 | 1 | 1 | 2 | 1 | CP for long seqlen (8K) |
458
+ | **LLama-3** | 70B | 64 | 4 | 4 | 2 | 1 | TP+PP |
459
+ | **LLama-3.1** | 405B | 1024 | 8 | 8 | 2 | 1 | 3D parallelism for scale |
460
+ | **GPT-3** | 175B | 128-512 | 4 | 8 | 1 | 1 | Large model config |
461
+ | **Mixtral** | 8x7B | 64 | 1 | 4 | 1 | 8 | EP for MoE |
462
+ | **Mixtral** | 8x22B | 256 | 4 | 4 | 8 | 8 | Combined TP+EP for large MoE |
463
+ | **DeepSeek-V3** | 671B | 1024 | 2 | 16 | 1 | 64 | Large MoE config |
464
+
465
+ ### MoE-Specific Requirements
466
+
467
+ **Important**: When combining Expert Parallelism (EP) with Tensor Parallelism (TP), **Sequence Parallelism (SP) must be enabled**.
468
+
469
+ ## Performance Optimizations
470
+
471
+ | Feature | Flag | Benefit |
472
+ |---------|------|---------|
473
+ | **FlashAttention** | `--attention-backend` | Faster attention and lower memory usage |
474
+ | **FP8 Training** | `--fp8-hybrid` | Faster training |
475
+ | **Activation Checkpointing** | `--recompute-activations` | Reduced memory usage |
476
+ | **Data Parallelism Communication Overlap** | `--overlap-grad-reduce` | Faster distributed training |
477
+ | **Distributed Optimizer** | `--use-distributed-optimizer` | Reduced checkpointing time |
478
+
479
+ **→ [NVIDIA NeMo Framework Performance Tuning Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#performance-tuning-guide)** - Comprehensive performance optimization guide covering advanced tuning techniques, communication overlaps, memory optimizations, and profiling options.
480
+
481
+ ### FlashAttention
482
+ [FlashAttention](https://github.com/Dao-AILab/flash-attention) is a fast and memory-efficient attention algorithm. We recommend the default usage, which uses cuDNN for attention via Transformer Engine and provides up to 50% speedups on forward and 84% on backward propagation with FP8 kernels. The `flash-attn` package is also supported via `--use-flash-attn`.
483
+
484
+ ### Mixed Precision Training
485
+ ```bash
486
+ --fp16 # Standard FP16
487
+ --bf16 # BFloat16 (recommended for large models)
488
+ --fp8-hybrid # FP8 training (Hopper, Ada, and Blackwell GPUs)
489
+ ```
490
+
491
+ ### Activation Checkpointing and Recomputation
492
+ ```bash
493
+ # For limited memory
494
+ --recompute-activations
495
+
496
+ # For extreme memory constraints
497
+ --recompute-granularity full \
498
+ --recompute-method uniform
499
+ ```
500
+
501
+ ### Data Parallelism Communication Overlap
502
+
503
+ ```bash
504
+ --overlap-grad-reduce
505
+ --overlap-param-gather
506
+ ```
507
+
508
+ ### Distributed Optimizer
509
+ ```bash
510
+ --use-distributed-optimizer
511
+ ```
512
+
513
+ # Community & Support
514
+
515
+ ## Getting Help
516
+ - 📖 **[Documentation](https://docs.nvidia.com/Megatron-Core/)** - Official documentation
517
+ - 🐛 **[Issues](https://github.com/NVIDIA/Megatron-LM/issues)** - Bug reports and feature requests
518
+
519
+ ## Contributing
520
+ We ❤️ contributions! Ways to contribute:
521
+ - 🐛 **Report bugs** - Help us improve reliability
522
+ - 💡 **Suggest features** - Shape the future of Megatron Core
523
+ - 📝 **Improve docs** - Make Megatron Core more accessible
524
+ - 🔧 **Submit PRs** - Contribute code improvements
525
+
526
+ **→ [Contributing Guide](./CONTRIBUTING.md)**
527
+
528
+ ## Citation
529
+ ```bibtex
530
+ @article{megatron-lm,
531
+ title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
532
+ author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},
533
+ journal={arXiv preprint arXiv:1909.08053},
534
+ year={2019}
535
+ }
536
+ ```