@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,366 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: training-llms-megatron
|
|
3
|
+
description: Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Synthetic Sciences
|
|
6
|
+
license: MIT
|
|
7
|
+
tags: [Megatron-Core, Large-Scale Training, NVIDIA, Tensor Parallelism, Pipeline Parallelism, Model Parallelism, H100, Distributed Training, Production]
|
|
8
|
+
dependencies: [megatron-core, torch, apex, transformer-engine]
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Megatron-Core - Large-Scale LLM Training
|
|
12
|
+
|
|
13
|
+
## Quick start
|
|
14
|
+
|
|
15
|
+
Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies.
|
|
16
|
+
|
|
17
|
+
**Installation**:
|
|
18
|
+
```bash
|
|
19
|
+
# Docker (recommended)
|
|
20
|
+
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3
|
|
21
|
+
|
|
22
|
+
# Or pip
|
|
23
|
+
pip install megatron-core
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
**Simple distributed training**:
|
|
27
|
+
```bash
|
|
28
|
+
# Train with 2 GPUs using data parallelism
|
|
29
|
+
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
|
|
30
|
+
|
|
31
|
+
# Or LLaMA-3 8B training
|
|
32
|
+
./examples/llama/train_llama3_8b_fp8.sh
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
## Common workflows
|
|
36
|
+
|
|
37
|
+
### Workflow 1: Train LLaMA-style model with 3D parallelism
|
|
38
|
+
|
|
39
|
+
Copy this checklist:
|
|
40
|
+
|
|
41
|
+
```
|
|
42
|
+
LLaMA Training Setup:
|
|
43
|
+
- [ ] Step 1: Choose parallelism configuration
|
|
44
|
+
- [ ] Step 2: Configure training hyperparameters
|
|
45
|
+
- [ ] Step 3: Launch distributed training
|
|
46
|
+
- [ ] Step 4: Monitor performance metrics
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
**Step 1: Choose parallelism configuration**
|
|
50
|
+
|
|
51
|
+
Model size determines parallelism strategy:
|
|
52
|
+
|
|
53
|
+
| Model Size | GPUs | Tensor Parallel | Pipeline Parallel | Data Parallel | Context Parallel |
|
|
54
|
+
|------------|------|-----------------|-------------------|---------------|------------------|
|
|
55
|
+
| 7B | 8 | 1 | 1 | 8 | 1 |
|
|
56
|
+
| 13B | 8 | 2 | 1 | 4 | 1 |
|
|
57
|
+
| 70B | 64 | 4 | 4 | 4 | 1 |
|
|
58
|
+
| 405B | 128 | 8 | 8 | 2 | 2 |
|
|
59
|
+
|
|
60
|
+
**Step 2: Configure training hyperparameters**
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
#!/bin/bash
|
|
64
|
+
# train_llama_70b.sh
|
|
65
|
+
|
|
66
|
+
GPUS_PER_NODE=8
|
|
67
|
+
NNODES=8 # 64 GPUs total
|
|
68
|
+
TP=4 # Tensor parallel
|
|
69
|
+
PP=4 # Pipeline parallel
|
|
70
|
+
CP=1 # Context parallel
|
|
71
|
+
|
|
72
|
+
# LLaMA 70B configuration
|
|
73
|
+
MODEL_SIZE=70 # Billion parameters
|
|
74
|
+
HIDDEN_SIZE=8192
|
|
75
|
+
NUM_LAYERS=80
|
|
76
|
+
NUM_HEADS=64
|
|
77
|
+
SEQ_LENGTH=4096
|
|
78
|
+
|
|
79
|
+
# Training hyperparameters
|
|
80
|
+
MICRO_BATCH=1
|
|
81
|
+
GLOBAL_BATCH=1024
|
|
82
|
+
LR=3e-4
|
|
83
|
+
|
|
84
|
+
torchrun \
|
|
85
|
+
--nproc_per_node=$GPUS_PER_NODE \
|
|
86
|
+
--nnodes=$NNODES \
|
|
87
|
+
pretrain_gpt.py \
|
|
88
|
+
--tensor-model-parallel-size $TP \
|
|
89
|
+
--pipeline-model-parallel-size $PP \
|
|
90
|
+
--context-parallel-size $CP \
|
|
91
|
+
--sequence-parallel \
|
|
92
|
+
--num-layers $NUM_LAYERS \
|
|
93
|
+
--hidden-size $HIDDEN_SIZE \
|
|
94
|
+
--num-attention-heads $NUM_HEADS \
|
|
95
|
+
--seq-length $SEQ_LENGTH \
|
|
96
|
+
--max-position-embeddings $SEQ_LENGTH \
|
|
97
|
+
--micro-batch-size $MICRO_BATCH \
|
|
98
|
+
--global-batch-size $GLOBAL_BATCH \
|
|
99
|
+
--lr $LR \
|
|
100
|
+
--train-iters 100000 \
|
|
101
|
+
--lr-decay-style cosine \
|
|
102
|
+
--lr-warmup-iters 2000 \
|
|
103
|
+
--weight-decay 0.1 \
|
|
104
|
+
--clip-grad 1.0 \
|
|
105
|
+
--bf16 \
|
|
106
|
+
--use-mcore-models \
|
|
107
|
+
--transformer-impl transformer_engine \
|
|
108
|
+
--data-path /path/to/data \
|
|
109
|
+
--vocab-file /path/to/vocab.json \
|
|
110
|
+
--merge-file /path/to/merges.txt
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
**Step 3: Launch distributed training**
|
|
114
|
+
|
|
115
|
+
```bash
|
|
116
|
+
# Single node (8 GPUs)
|
|
117
|
+
bash train_llama_70b.sh
|
|
118
|
+
|
|
119
|
+
# Multi-node with SLURM
|
|
120
|
+
sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
**Step 4: Monitor performance metrics**
|
|
124
|
+
|
|
125
|
+
Key metrics to track:
|
|
126
|
+
```
|
|
127
|
+
Model FLOP Utilization (MFU): Target >40% on H100
|
|
128
|
+
Throughput: Tokens/sec/GPU
|
|
129
|
+
Memory usage: <80GB per GPU for 70B model
|
|
130
|
+
Loss: Should decrease steadily
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
### Workflow 2: Configure Mixture of Experts (MoE) training
|
|
134
|
+
|
|
135
|
+
For sparse MoE models like Mixtral.
|
|
136
|
+
|
|
137
|
+
```
|
|
138
|
+
MoE Training:
|
|
139
|
+
- [ ] Step 1: Configure expert parallelism
|
|
140
|
+
- [ ] Step 2: Set MoE hyperparameters
|
|
141
|
+
- [ ] Step 3: Launch training with EP
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
**Step 1: Configure expert parallelism**
|
|
145
|
+
|
|
146
|
+
```bash
|
|
147
|
+
# Mixtral 8x7B example
|
|
148
|
+
TENSOR_PARALLEL=2
|
|
149
|
+
PIPELINE_PARALLEL=1
|
|
150
|
+
EXPERT_PARALLEL=4 # Split 8 experts across 4 GPUs
|
|
151
|
+
DATA_PARALLEL=4
|
|
152
|
+
|
|
153
|
+
TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))
|
|
154
|
+
# = 2 * 1 * 4 * 4 = 32 GPUs
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
**Step 2: Set MoE hyperparameters**
|
|
158
|
+
|
|
159
|
+
```bash
|
|
160
|
+
torchrun \
|
|
161
|
+
--nproc_per_node=8 \
|
|
162
|
+
pretrain_gpt.py \
|
|
163
|
+
--tensor-model-parallel-size 2 \
|
|
164
|
+
--pipeline-model-parallel-size 1 \
|
|
165
|
+
--expert-model-parallel-size 4 \
|
|
166
|
+
--num-experts 8 \
|
|
167
|
+
--moe-router-topk 2 \
|
|
168
|
+
--moe-router-load-balancing-type aux_loss \
|
|
169
|
+
--moe-aux-loss-coeff 0.01 \
|
|
170
|
+
--hidden-size 4096 \
|
|
171
|
+
--num-layers 32 \
|
|
172
|
+
--num-attention-heads 32 \
|
|
173
|
+
--seq-length 4096 \
|
|
174
|
+
--max-position-embeddings 4096 \
|
|
175
|
+
--bf16 \
|
|
176
|
+
--use-mcore-models \
|
|
177
|
+
--transformer-impl transformer_engine \
|
|
178
|
+
--data-path /path/to/data \
|
|
179
|
+
--vocab-file /path/to/vocab.json \
|
|
180
|
+
--merge-file /path/to/merges.txt
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
**Step 3: Launch training with EP**
|
|
184
|
+
|
|
185
|
+
Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.
|
|
186
|
+
|
|
187
|
+
```
|
|
188
|
+
Memory without EP: 8 experts × 7B = 56GB per GPU
|
|
189
|
+
Memory with EP=4: 2 experts × 7B = 14GB per GPU
|
|
190
|
+
Savings: 75% memory reduction
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
### Workflow 3: Optimize for maximum throughput
|
|
194
|
+
|
|
195
|
+
Achieve 47% MFU on H100.
|
|
196
|
+
|
|
197
|
+
```
|
|
198
|
+
Performance Optimization:
|
|
199
|
+
- [ ] Step 1: Enable Flash Attention
|
|
200
|
+
- [ ] Step 2: Use FP8 precision (H100)
|
|
201
|
+
- [ ] Step 3: Optimize micro-batch size
|
|
202
|
+
- [ ] Step 4: Tune parallelism degrees
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
**Step 1: Enable optimizations**
|
|
206
|
+
|
|
207
|
+
```bash
|
|
208
|
+
--use-mcore-models # Use Megatron Core models
|
|
209
|
+
--transformer-impl transformer_engine # Use Transformer Engine
|
|
210
|
+
--sequence-parallel # Reduce activation memory (use with TP)
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
**Step 2: Use FP8 precision (H100 only)**
|
|
214
|
+
|
|
215
|
+
```bash
|
|
216
|
+
--fp8-hybrid # FP8 mixed precision training
|
|
217
|
+
# Transformer Engine handles FP8 automatically
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
Result: 1.5-2x speedup on H100 vs BF16.
|
|
221
|
+
|
|
222
|
+
**Step 3: Optimize micro-batch size**
|
|
223
|
+
|
|
224
|
+
Find largest micro-batch that fits in memory:
|
|
225
|
+
|
|
226
|
+
```bash
|
|
227
|
+
# Start with 1, increase until OOM
|
|
228
|
+
for MBS in 1 2 4 8; do
|
|
229
|
+
echo "Testing micro-batch-size=$MBS"
|
|
230
|
+
torchrun ... --micro-batch-size $MBS
|
|
231
|
+
done
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
Typical values:
|
|
235
|
+
- 7B model: 4-8
|
|
236
|
+
- 70B model: 1-2
|
|
237
|
+
- 405B model: 1
|
|
238
|
+
|
|
239
|
+
**Step 4: Tune parallelism degrees**
|
|
240
|
+
|
|
241
|
+
Rules of thumb:
|
|
242
|
+
```
|
|
243
|
+
Tensor Parallel: Use ≤8 (limited by NVLink within node)
|
|
244
|
+
Pipeline Parallel: Use for >70B models
|
|
245
|
+
Context Parallel: Use for sequences >8K tokens
|
|
246
|
+
Data Parallel: Fill remaining GPUs
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
Example 405B on 128 H100s:
|
|
250
|
+
```
|
|
251
|
+
TP=8 (1 node)
|
|
252
|
+
PP=8 (across nodes)
|
|
253
|
+
CP=2 (long sequences)
|
|
254
|
+
DP=1
|
|
255
|
+
Total = 8 × 8 × 2 × 1 = 128 GPUs
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
## When to use vs alternatives
|
|
259
|
+
|
|
260
|
+
**Use Megatron-Core when:**
|
|
261
|
+
- Training models >10B parameters
|
|
262
|
+
- Need maximum efficiency (target >40% MFU)
|
|
263
|
+
- Using NVIDIA GPUs (A100, H100)
|
|
264
|
+
- Production training at scale
|
|
265
|
+
- Want fine-grained parallelism control
|
|
266
|
+
|
|
267
|
+
**Use alternatives instead:**
|
|
268
|
+
- **PyTorch FSDP**: Models <70B, simpler API, PyTorch native
|
|
269
|
+
- **DeepSpeed**: Easier setup, good for <100B models
|
|
270
|
+
- **HuggingFace Accelerate**: Prototyping, simpler workflows
|
|
271
|
+
- **LitGPT**: Educational, single-file implementations
|
|
272
|
+
|
|
273
|
+
## Common issues
|
|
274
|
+
|
|
275
|
+
**Issue: Low GPU utilization (<30% MFU)**
|
|
276
|
+
|
|
277
|
+
Causes:
|
|
278
|
+
1. Micro-batch too small
|
|
279
|
+
2. Too much parallelism overhead
|
|
280
|
+
3. Not using Flash Attention
|
|
281
|
+
|
|
282
|
+
Fixes:
|
|
283
|
+
```bash
|
|
284
|
+
# Increase micro-batch
|
|
285
|
+
--micro-batch-size 4 # Was 1
|
|
286
|
+
|
|
287
|
+
# Enable optimizations
|
|
288
|
+
--use-flash-attn
|
|
289
|
+
--sequence-parallel
|
|
290
|
+
|
|
291
|
+
# Reduce TP if >8
|
|
292
|
+
--tensor-model-parallel-size 4 # Was 16
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
**Issue: Out of memory**
|
|
296
|
+
|
|
297
|
+
Reduce memory with:
|
|
298
|
+
```bash
|
|
299
|
+
--tensor-model-parallel-size 2 # Split model across GPUs
|
|
300
|
+
--recompute-granularity full # Gradient checkpointing
|
|
301
|
+
--recompute-method block # Checkpoint transformer blocks
|
|
302
|
+
--recompute-num-layers 1 # Checkpoint every layer
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
Or use CPU/NVMe offloading:
|
|
306
|
+
```bash
|
|
307
|
+
--cpu-optimizer # Offload optimizer to CPU
|
|
308
|
+
--cpu-optimizer-type ADAM # CPU Adam variant
|
|
309
|
+
```
|
|
310
|
+
|
|
311
|
+
**Issue: Training slower than expected**
|
|
312
|
+
|
|
313
|
+
Check:
|
|
314
|
+
1. **Network bottleneck**: Ensure InfiniBand/NVLink enabled
|
|
315
|
+
2. **Pipeline bubbles**: Use interleaved pipeline schedule
|
|
316
|
+
```bash
|
|
317
|
+
--num-layers-per-virtual-pipeline-stage 2
|
|
318
|
+
```
|
|
319
|
+
3. **Data loading**: Use fast data loader
|
|
320
|
+
```bash
|
|
321
|
+
--dataloader-type cyclic
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
**Issue: Diverging loss**
|
|
325
|
+
|
|
326
|
+
Stabilize training:
|
|
327
|
+
```bash
|
|
328
|
+
--lr-warmup-iters 2000 # Longer warmup
|
|
329
|
+
--clip-grad 1.0 # Gradient clipping
|
|
330
|
+
--init-method-std 0.006 # Smaller init
|
|
331
|
+
--attention-dropout 0.0 # No dropout in attention
|
|
332
|
+
--hidden-dropout 0.0 # No dropout in FFN
|
|
333
|
+
```
|
|
334
|
+
|
|
335
|
+
## Advanced topics
|
|
336
|
+
|
|
337
|
+
**Parallelism strategies**: See [references/parallelism-guide.md](references/parallelism-guide.md) for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each.
|
|
338
|
+
|
|
339
|
+
**Performance benchmarks**: See [references/benchmarks.md](references/benchmarks.md) for MFU numbers across different model sizes and GPU configurations.
|
|
340
|
+
|
|
341
|
+
**Production configurations**: See [references/production-examples.md](references/production-examples.md) for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B.
|
|
342
|
+
|
|
343
|
+
**Training recipes**: See [references/training-recipes.md](references/training-recipes.md) for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures.
|
|
344
|
+
|
|
345
|
+
## Hardware requirements
|
|
346
|
+
|
|
347
|
+
- **GPU**: NVIDIA Ampere+ (A100, H100, B200)
|
|
348
|
+
- Turing works but slower
|
|
349
|
+
- FP8 requires Hopper/Ada/Blackwell
|
|
350
|
+
- **Network**: InfiniBand or 400Gb+ Ethernet for multi-node
|
|
351
|
+
- **Memory per GPU**:
|
|
352
|
+
- 7B model: 40GB+
|
|
353
|
+
- 70B model: 80GB (with TP=4)
|
|
354
|
+
- 405B model: 80GB (with TP=8, PP=8)
|
|
355
|
+
- **Storage**: Fast NVMe for checkpoints (1TB+ for 70B+ models)
|
|
356
|
+
|
|
357
|
+
## Resources
|
|
358
|
+
|
|
359
|
+
- Docs: https://docs.nvidia.com/megatron-core/
|
|
360
|
+
- GitHub: https://github.com/NVIDIA/Megatron-LM
|
|
361
|
+
- Papers:
|
|
362
|
+
- "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2019)
|
|
363
|
+
- "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021)
|
|
364
|
+
- NeMo Framework: https://docs.nvidia.com/nemo-framework/ (built on Megatron-Core)
|
|
365
|
+
|
|
366
|
+
|
|
@@ -0,0 +1,249 @@
|
|
|
1
|
+
# Performance Benchmarks
|
|
2
|
+
|
|
3
|
+
Performance metrics and benchmarks for Megatron-Core across different model sizes and hardware configurations.
|
|
4
|
+
|
|
5
|
+
## Model FLOP Utilization (MFU)
|
|
6
|
+
|
|
7
|
+
**H100 Clusters**: Up to 47% MFU achieved
|
|
8
|
+
|
|
9
|
+
MFU increases with larger model sizes due to higher arithmetic intensity in larger matrix multiplications (GEMMs).
|
|
10
|
+
|
|
11
|
+
## Throughput Metrics by Model Size
|
|
12
|
+
|
|
13
|
+
### GPT-3 175B
|
|
14
|
+
- **Hardware**: H100
|
|
15
|
+
- **Configuration**: TP=4, PP=8
|
|
16
|
+
- **GPUs**: 128-512
|
|
17
|
+
- **MFU**: 47% on H100
|
|
18
|
+
- **Throughput**: 390 TFlops/GPU on H100
|
|
19
|
+
|
|
20
|
+
### LLaMA Configurations
|
|
21
|
+
|
|
22
|
+
| Model | Size | GPUs | TP | PP | CP | Seq Length | Hardware | Notes |
|
|
23
|
+
|-------|------|------|----|----|----| -----------|----------|-------|
|
|
24
|
+
| LLaMA-3 | 8B | 8 | 1 | 1 | 2 | 8K | H100 | CP for long sequences |
|
|
25
|
+
| LLaMA-3 | 70B | 64 | 4 | 4 | 2 | 4K | H100 | TP+PP parallelism |
|
|
26
|
+
| LLaMA-3.1 | 405B | 1024 | 8 | 8 | 2 | 4K | H100 | 3D parallelism |
|
|
27
|
+
|
|
28
|
+
**LLaMA-3 405B Details**:
|
|
29
|
+
- 16K H100 GPUs (two 24K GPU clusters)
|
|
30
|
+
- TP=8, PP=8, CP=2
|
|
31
|
+
- 400 TFlops/GPU average
|
|
32
|
+
- 95%+ uptime
|
|
33
|
+
- 3× efficiency improvement vs LLaMA 2
|
|
34
|
+
|
|
35
|
+
### Mixtral (Mixture of Experts)
|
|
36
|
+
|
|
37
|
+
| Model | Active Params | Total Params | GPUs | TP | PP | EP | Experts | Hardware |
|
|
38
|
+
|-------|---------------|--------------|------|----|----|----|---------| ---------|
|
|
39
|
+
| Mixtral | 7B (active) | 8×7B (56B) | 64 | 1 | 4 | 8 | 8 | H100 |
|
|
40
|
+
| Mixtral | 22B (active) | 8×22B (176B) | 256 | 4 | 4 | 8 | 8 | H100 |
|
|
41
|
+
|
|
42
|
+
### DeepSeek-V3
|
|
43
|
+
|
|
44
|
+
- **Active Parameters**: 37B per token
|
|
45
|
+
- **Total Parameters**: 671B
|
|
46
|
+
- **GPUs**: 1024 H100
|
|
47
|
+
- **Configuration**: TP=2, PP=16, EP=64
|
|
48
|
+
- **Parallelism**: 4D with Expert Parallel
|
|
49
|
+
|
|
50
|
+
### GPT-462B (Largest Benchmark)
|
|
51
|
+
|
|
52
|
+
- **Parameters**: 462B
|
|
53
|
+
- **GPUs**: 6144 H100
|
|
54
|
+
- **MFU**: 47-48%
|
|
55
|
+
- **Throughput**: ~390 TFlops/GPU
|
|
56
|
+
|
|
57
|
+
## Hardware Performance Characteristics
|
|
58
|
+
|
|
59
|
+
### NVIDIA H100 (Hopper)
|
|
60
|
+
- **Peak Performance**:
|
|
61
|
+
- FP16: 1979 TFlops
|
|
62
|
+
- BF16: 1979 TFlops
|
|
63
|
+
- FP8: 3958 TFlops
|
|
64
|
+
- **Memory**: 80GB HBM3
|
|
65
|
+
- **Memory Bandwidth**: 3.35 TB/s
|
|
66
|
+
- **NVLink**: 900 GB/s per GPU
|
|
67
|
+
|
|
68
|
+
**Achieved MFU**: 40-47% (typical range)
|
|
69
|
+
|
|
70
|
+
### NVIDIA A100 (Ampere)
|
|
71
|
+
- **Peak Performance**:
|
|
72
|
+
- FP16: 312 TFlops (with sparsity)
|
|
73
|
+
- BF16: 312 TFlops
|
|
74
|
+
- **Memory**: 40GB or 80GB HBM2e
|
|
75
|
+
- **Memory Bandwidth**: 2 TB/s
|
|
76
|
+
- **NVLink**: 600 GB/s per GPU
|
|
77
|
+
|
|
78
|
+
**Typical MFU**: 35-42%
|
|
79
|
+
|
|
80
|
+
## Weak Scaling (Fixed Per-GPU Workload)
|
|
81
|
+
|
|
82
|
+
As you add more GPUs while keeping per-GPU workload constant:
|
|
83
|
+
|
|
84
|
+
| GPUs | Model Size | MFU | Efficiency |
|
|
85
|
+
|------|------------|-----|------------|
|
|
86
|
+
| 8 | 7B | 42% | 100% (baseline) |
|
|
87
|
+
| 64 | 70B | 44% | 95% |
|
|
88
|
+
| 512 | 175B | 45% | 93% |
|
|
89
|
+
| 1024 | 405B | 46% | 90% |
|
|
90
|
+
| 6144 | 462B | 47% | 88% |
|
|
91
|
+
|
|
92
|
+
## Strong Scaling (Fixed Total Workload)
|
|
93
|
+
|
|
94
|
+
Distributing a fixed model across more GPUs:
|
|
95
|
+
|
|
96
|
+
| Model | GPUs | Time per Iteration | Speedup | Efficiency |
|
|
97
|
+
|-------|------|-------------------|---------|------------|
|
|
98
|
+
| 70B | 64 | 1.0× (baseline) | 1.0× | 100% |
|
|
99
|
+
| 70B | 128 | 0.52× | 1.92× | 96% |
|
|
100
|
+
| 70B | 256 | 0.27× | 3.70× | 93% |
|
|
101
|
+
|
|
102
|
+
## Throughput Calculations
|
|
103
|
+
|
|
104
|
+
**Formula**:
|
|
105
|
+
```
|
|
106
|
+
Throughput (TFlops/GPU) = Total FLOPs / (Time × Number of GPUs × 10^12)
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
**Example (GPT-3 175B)**:
|
|
110
|
+
- Forward + Backward pass: 3 × (model FLOPs)
|
|
111
|
+
- Per-token FLOPs: ~350 billion for 175B model
|
|
112
|
+
- Batch size: 1536 (global)
|
|
113
|
+
- Sequence length: 2048
|
|
114
|
+
- Time per iteration: ~5 seconds on 512 H100s
|
|
115
|
+
- Throughput: ~390 TFlops/GPU
|
|
116
|
+
|
|
117
|
+
## Memory Usage vs Model Size
|
|
118
|
+
|
|
119
|
+
| Model Size | Parameters | Memory (FP16) | Memory (BF16) | Memory (FP8) |
|
|
120
|
+
|------------|------------|---------------|---------------|--------------|
|
|
121
|
+
| 7B | 7 billion | 14 GB | 14 GB | 7 GB |
|
|
122
|
+
| 13B | 13 billion | 26 GB | 26 GB | 13 GB |
|
|
123
|
+
| 70B | 70 billion | 140 GB | 140 GB | 70 GB |
|
|
124
|
+
| 175B | 175 billion | 350 GB | 350 GB | 175 GB |
|
|
125
|
+
| 405B | 405 billion | 810 GB | 810 GB | 405 GB |
|
|
126
|
+
|
|
127
|
+
**Note**: These are model weights only. Add ~2× for gradients and optimizer states during training.
|
|
128
|
+
|
|
129
|
+
## Communication Overhead
|
|
130
|
+
|
|
131
|
+
### Tensor Parallelism (TP)
|
|
132
|
+
- **Bandwidth Required**: ~20 GB/GPU for LLaMA 70B with TP=4
|
|
133
|
+
- **Frequency**: Every layer (80+ layers)
|
|
134
|
+
- **Best Practice**: Use NVLink, keep TP ≤8 within single node
|
|
135
|
+
|
|
136
|
+
### Pipeline Parallelism (PP)
|
|
137
|
+
- **Bandwidth Required**: Activation size only (~100s of MB)
|
|
138
|
+
- **Frequency**: Between pipeline stages
|
|
139
|
+
- **Best Practice**: Use for cross-node scaling
|
|
140
|
+
|
|
141
|
+
### Data Parallelism (DP)
|
|
142
|
+
- **Bandwidth Required**: Full gradient size
|
|
143
|
+
- **Frequency**: Once per iteration
|
|
144
|
+
- **Best Practice**: Use for remaining parallelism after TP/PP
|
|
145
|
+
|
|
146
|
+
## Optimization Impact
|
|
147
|
+
|
|
148
|
+
### Flash Attention
|
|
149
|
+
- **Speedup**: 2-4× on attention layers
|
|
150
|
+
- **Memory**: 10-20× reduction
|
|
151
|
+
- **Overall Impact**: ~30% faster training
|
|
152
|
+
|
|
153
|
+
### Sequence Parallelism
|
|
154
|
+
- **Memory Savings**: Activation memory / TP degree
|
|
155
|
+
- **Example**: With TP=4, saves 75% of activation memory
|
|
156
|
+
- **No Performance Cost**: Communication already happening
|
|
157
|
+
|
|
158
|
+
### Context Parallelism
|
|
159
|
+
- **Use Case**: Sequences >8K tokens
|
|
160
|
+
- **Memory Savings**: KV cache / CP degree
|
|
161
|
+
- **Communication**: Ring all-to-all pattern
|
|
162
|
+
|
|
163
|
+
### FP8 Training (H100 Only)
|
|
164
|
+
- **Speedup**: 1.5-2× vs BF16
|
|
165
|
+
- **Memory**: 50% reduction vs BF16
|
|
166
|
+
- **Quality**: Minimal degradation with proper scaling
|
|
167
|
+
|
|
168
|
+
## Production Deployments
|
|
169
|
+
|
|
170
|
+
### Meta LLaMA 3 Training
|
|
171
|
+
- **Models**: 8B, 70B, 405B
|
|
172
|
+
- **Cluster**: Two 24K H100 clusters
|
|
173
|
+
- **Efficiency**: 400 TFlops/GPU sustained
|
|
174
|
+
- **Uptime**: 95%+
|
|
175
|
+
- **Total Tokens**: 15 trillion for 405B model
|
|
176
|
+
|
|
177
|
+
### Microsoft Megatron-Turing NLG 530B
|
|
178
|
+
- **GPUs**: 560 NVIDIA A100 (80GB)
|
|
179
|
+
- **Parallelism**: DeepSpeed ZeRO-3 + Megatron TP/PP
|
|
180
|
+
- **Duration**: Several months
|
|
181
|
+
- **Year**: 2021
|
|
182
|
+
|
|
183
|
+
### NVIDIA Nemotron-4 340B
|
|
184
|
+
- **Architecture**: Mixture of Experts
|
|
185
|
+
- **Framework**: NeMo (built on Megatron-Core)
|
|
186
|
+
- **Production**: Commercial deployment
|
|
187
|
+
|
|
188
|
+
## Benchmarking Best Practices
|
|
189
|
+
|
|
190
|
+
1. **Measure Sustained Performance**: Not peak, measure over 100+ iterations
|
|
191
|
+
2. **Include All Operations**: Forward, backward, optimizer step, communication
|
|
192
|
+
3. **Report MFU**: Use theoretical peak FLOPs of hardware
|
|
193
|
+
4. **Specify Configuration**: TP, PP, CP, EP degrees, batch sizes, sequence length
|
|
194
|
+
5. **Note Optimizations**: Flash Attention, FP8, sequence parallel, etc.
|
|
195
|
+
|
|
196
|
+
## How to Measure Your Own Performance
|
|
197
|
+
|
|
198
|
+
**Enable profiling**:
|
|
199
|
+
```bash
|
|
200
|
+
torchrun pretrain_gpt.py \
|
|
201
|
+
--profile \
|
|
202
|
+
--profile-step-start 10 \
|
|
203
|
+
--profile-step-end 20
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
**Calculate MFU**:
|
|
207
|
+
```python
|
|
208
|
+
# Megatron logs this automatically
|
|
209
|
+
# Check logs for:
|
|
210
|
+
# - elapsed time per iteration (seconds)
|
|
211
|
+
# - samples per second
|
|
212
|
+
# - TFLOPs/s per GPU
|
|
213
|
+
# - MFU percentage
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
**Key Metrics to Track**:
|
|
217
|
+
- Elapsed time per iteration
|
|
218
|
+
- Throughput (TFlops/GPU)
|
|
219
|
+
- MFU (%)
|
|
220
|
+
- Memory usage (GB)
|
|
221
|
+
- Communication time (% of total)
|
|
222
|
+
|
|
223
|
+
## Troubleshooting Low Performance
|
|
224
|
+
|
|
225
|
+
**If MFU < 30%**:
|
|
226
|
+
1. Check micro-batch size (increase if possible)
|
|
227
|
+
2. Enable all optimizations (Flash Attention, sequence parallel, etc.)
|
|
228
|
+
3. Verify communication backend (NCCL properly configured)
|
|
229
|
+
4. Check for data loading bottlenecks
|
|
230
|
+
5. Ensure proper CPU-GPU pipeline
|
|
231
|
+
|
|
232
|
+
**If Communication Heavy** (>30% of time):
|
|
233
|
+
1. Reduce TP degree (especially across nodes)
|
|
234
|
+
2. Use interleaved pipeline schedule
|
|
235
|
+
3. Enable communication overlap flags
|
|
236
|
+
4. Check network topology (InfiniBand vs Ethernet)
|
|
237
|
+
|
|
238
|
+
**If Memory Bound**:
|
|
239
|
+
1. Enable gradient checkpointing
|
|
240
|
+
2. Use lower precision (BF16 or FP8)
|
|
241
|
+
3. Increase parallelism degrees
|
|
242
|
+
4. Reduce micro-batch size
|
|
243
|
+
|
|
244
|
+
## References
|
|
245
|
+
|
|
246
|
+
- NVIDIA Megatron-LM GitHub: https://github.com/NVIDIA/Megatron-LM
|
|
247
|
+
- Performance Docs: https://docs.nvidia.com/megatron-core/
|
|
248
|
+
- LLaMA 3 Paper: Meta AI
|
|
249
|
+
- DeepSeek-V3 Technical Report
|