@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,473 @@
|
|
|
1
|
+
# Production Examples
|
|
2
|
+
|
|
3
|
+
Real-world production deployments using Megatron-Core and Megatron-LM.
|
|
4
|
+
|
|
5
|
+
## Meta LLaMA 3
|
|
6
|
+
|
|
7
|
+
### Overview
|
|
8
|
+
- **Models**: 8B, 70B, 405B parameters
|
|
9
|
+
- **Training Infrastructure**: Two 24,000 H100 GPU clusters
|
|
10
|
+
- **Total Investment**: Massive scale, months of training
|
|
11
|
+
- **Training Data**: 15 trillion tokens for 405B model
|
|
12
|
+
- **Status**: Production deployed (llama.meta.com)
|
|
13
|
+
|
|
14
|
+
### LLaMA 3.1 405B - Largest Open Model
|
|
15
|
+
|
|
16
|
+
**Architecture**:
|
|
17
|
+
```yaml
|
|
18
|
+
Parameters: 405 billion
|
|
19
|
+
Layers: 126
|
|
20
|
+
Hidden size: 16384
|
|
21
|
+
Attention heads: 128
|
|
22
|
+
Query groups: 8 (GQA)
|
|
23
|
+
FFN size: 53248
|
|
24
|
+
Vocabulary: 128,256 tokens
|
|
25
|
+
Max context: 128K tokens (supports up to)
|
|
26
|
+
Position encoding: RoPE
|
|
27
|
+
Activation: SwiGLU
|
|
28
|
+
Normalization: RMSNorm
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
**Training Configuration**:
|
|
32
|
+
```bash
|
|
33
|
+
# 1024 H100 GPUs (128 nodes × 8 GPUs)
|
|
34
|
+
Tensor Parallel (TP): 8 # Within node
|
|
35
|
+
Pipeline Parallel (PP): 8 # Across nodes
|
|
36
|
+
Context Parallel (CP): 2 # For long sequences
|
|
37
|
+
Data Parallel (DP): 8 # Remaining dimension
|
|
38
|
+
|
|
39
|
+
Total GPUs: 8 × 8 × 2 × 8 = 1024
|
|
40
|
+
Effective batch size: 2048
|
|
41
|
+
Micro-batch per GPU: 1
|
|
42
|
+
Sequence length: 4096 tokens
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
**Performance Metrics**:
|
|
46
|
+
- **Sustained throughput**: 400 TFlops/GPU
|
|
47
|
+
- **MFU**: ~46% on H100
|
|
48
|
+
- **Uptime**: 95%+ over months
|
|
49
|
+
- **Efficiency improvement**: 3× vs LLaMA 2 training
|
|
50
|
+
|
|
51
|
+
**Training Duration**:
|
|
52
|
+
- 15 trillion tokens total
|
|
53
|
+
- ~54 days on 16,384 H100 GPUs
|
|
54
|
+
- Or ~6 months on 1,024 H100 GPUs
|
|
55
|
+
|
|
56
|
+
**Key Optimizations Used**:
|
|
57
|
+
```bash
|
|
58
|
+
--use-mcore-models \
|
|
59
|
+
--transformer-impl transformer_engine \
|
|
60
|
+
--sequence-parallel \
|
|
61
|
+
--context-parallel-size 2 \
|
|
62
|
+
--use-distributed-optimizer \
|
|
63
|
+
--overlap-grad-reduce \
|
|
64
|
+
--overlap-param-gather \
|
|
65
|
+
--use-flash-attn-v2 \
|
|
66
|
+
--bf16
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
**Production Serving**:
|
|
70
|
+
- Deployed on llama.meta.com
|
|
71
|
+
- Available via API and download
|
|
72
|
+
- Used in Meta products (Instagram, Facebook, WhatsApp)
|
|
73
|
+
|
|
74
|
+
### LLaMA 3 70B
|
|
75
|
+
|
|
76
|
+
**Training Configuration**:
|
|
77
|
+
```bash
|
|
78
|
+
# 64 H100 GPUs (8 nodes × 8 GPUs)
|
|
79
|
+
TP=4, PP=4, CP=2, DP=2
|
|
80
|
+
|
|
81
|
+
torchrun --nproc_per_node=8 --nnodes=8 pretrain_gpt.py \
|
|
82
|
+
--num-layers 80 \
|
|
83
|
+
--hidden-size 8192 \
|
|
84
|
+
--num-attention-heads 64 \
|
|
85
|
+
--num-query-groups 8 \
|
|
86
|
+
--seq-length 4096 \
|
|
87
|
+
--micro-batch-size 1 \
|
|
88
|
+
--global-batch-size 1024 \
|
|
89
|
+
--tensor-model-parallel-size 4 \
|
|
90
|
+
--pipeline-model-parallel-size 4 \
|
|
91
|
+
--context-parallel-size 2 \
|
|
92
|
+
--bf16 \
|
|
93
|
+
--use-mcore-models
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
**Memory per GPU**:
|
|
97
|
+
- Model parameters: 140GB / 4 (TP) / 4 (PP) = 8.75GB
|
|
98
|
+
- Optimizer states: ~17.5GB
|
|
99
|
+
- Activations: ~3GB
|
|
100
|
+
- **Total**: ~30GB per H100 (fits in 80GB)
|
|
101
|
+
|
|
102
|
+
## NVIDIA Nemotron-4 340B
|
|
103
|
+
|
|
104
|
+
### Overview
|
|
105
|
+
- **Organization**: NVIDIA
|
|
106
|
+
- **Parameters**: 340 billion
|
|
107
|
+
- **Framework**: NeMo (built on Megatron-Core)
|
|
108
|
+
- **Purpose**: Enterprise AI foundation model
|
|
109
|
+
- **Status**: Commercial deployment
|
|
110
|
+
|
|
111
|
+
**Key Features**:
|
|
112
|
+
- Mixture of Experts architecture
|
|
113
|
+
- Optimized for enterprise use cases
|
|
114
|
+
- NeMo framework integration
|
|
115
|
+
- Production-ready deployment
|
|
116
|
+
|
|
117
|
+
**Architecture**:
|
|
118
|
+
```yaml
|
|
119
|
+
Type: Mixture of Experts (MoE)
|
|
120
|
+
Total parameters: 340B
|
|
121
|
+
Active parameters per token: ~40B
|
|
122
|
+
Experts: 8
|
|
123
|
+
Router: Top-2
|
|
124
|
+
Context length: 4096
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
**Training Infrastructure**:
|
|
128
|
+
- NVIDIA DGX H100 systems
|
|
129
|
+
- Megatron-Core + NeMo
|
|
130
|
+
- Multi-node training
|
|
131
|
+
- Enterprise-grade fault tolerance
|
|
132
|
+
|
|
133
|
+
**Production Features**:
|
|
134
|
+
- NeMo Guardrails integration
|
|
135
|
+
- Enterprise support
|
|
136
|
+
- Customization options
|
|
137
|
+
- On-premise deployment available
|
|
138
|
+
|
|
139
|
+
## Microsoft & NVIDIA Megatron-Turing NLG 530B
|
|
140
|
+
|
|
141
|
+
### Overview
|
|
142
|
+
- **Organization**: Microsoft + NVIDIA collaboration
|
|
143
|
+
- **Parameters**: 530 billion (largest dense model when released)
|
|
144
|
+
- **Year**: 2021
|
|
145
|
+
- **Framework**: DeepSpeed ZeRO-3 + Megatron tensor/pipeline parallelism
|
|
146
|
+
- **Hardware**: 560 NVIDIA A100 80GB GPUs
|
|
147
|
+
|
|
148
|
+
**Architecture**:
|
|
149
|
+
```yaml
|
|
150
|
+
Parameters: 530 billion
|
|
151
|
+
Layers: 105
|
|
152
|
+
Hidden size: 20480
|
|
153
|
+
Attention heads: 128
|
|
154
|
+
Vocabulary: 51,200 tokens
|
|
155
|
+
Sequence length: 2048
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
**Training Configuration**:
|
|
159
|
+
```bash
|
|
160
|
+
# 560 A100 80GB GPUs
|
|
161
|
+
Tensor Parallel: 8
|
|
162
|
+
Pipeline Parallel: 35
|
|
163
|
+
Data Parallel: 2
|
|
164
|
+
Total: 8 × 35 × 2 = 560
|
|
165
|
+
|
|
166
|
+
DeepSpeed ZeRO Stage 3:
|
|
167
|
+
- Full parameter sharding
|
|
168
|
+
- Gradient sharding
|
|
169
|
+
- Optimizer state sharding
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
**Innovations**:
|
|
173
|
+
- First to combine DeepSpeed ZeRO-3 with Megatron parallelism
|
|
174
|
+
- Demonstrated training at 500B+ scale
|
|
175
|
+
- Proved viability of extreme parallelism
|
|
176
|
+
|
|
177
|
+
**Performance**:
|
|
178
|
+
- Trained on 339 billion tokens
|
|
179
|
+
- Multiple months of training
|
|
180
|
+
- Achieved state-of-the-art results in 2021
|
|
181
|
+
|
|
182
|
+
## BigScience BLOOM 176B
|
|
183
|
+
|
|
184
|
+
### Overview
|
|
185
|
+
- **Organization**: BigScience (1000+ researchers)
|
|
186
|
+
- **Parameters**: 176 billion
|
|
187
|
+
- **Year**: 2022
|
|
188
|
+
- **Framework**: Megatron-DeepSpeed
|
|
189
|
+
- **Hardware**: 384 NVIDIA A100 80GB GPUs
|
|
190
|
+
- **Training Duration**: 46 days
|
|
191
|
+
|
|
192
|
+
**Architecture**:
|
|
193
|
+
```yaml
|
|
194
|
+
Parameters: 176 billion
|
|
195
|
+
Layers: 70
|
|
196
|
+
Hidden size: 14336
|
|
197
|
+
Attention heads: 112
|
|
198
|
+
Vocabulary: 250,680 tokens (multilingual)
|
|
199
|
+
Sequence length: 2048
|
|
200
|
+
Languages: 46 natural languages + 13 programming languages
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
**Training Configuration**:
|
|
204
|
+
```bash
|
|
205
|
+
# 384 A100 80GB GPUs on Jean Zay supercomputer
|
|
206
|
+
Tensor Parallel: 4
|
|
207
|
+
Pipeline Parallel: 12
|
|
208
|
+
Data Parallel: 8
|
|
209
|
+
Total: 4 × 12 × 8 = 384
|
|
210
|
+
|
|
211
|
+
Global batch size: 2048
|
|
212
|
+
Micro-batch size: 4
|
|
213
|
+
Learning rate: 6e-5
|
|
214
|
+
Optimizer: Adam (β1=0.9, β2=0.95)
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
**Training Data**:
|
|
218
|
+
- 366 billion tokens (1.6TB)
|
|
219
|
+
- ROOTS corpus (custom multilingual dataset)
|
|
220
|
+
- 46 natural languages
|
|
221
|
+
- 13 programming languages
|
|
222
|
+
|
|
223
|
+
**Key Achievements**:
|
|
224
|
+
- Largest multilingual open-source model at release
|
|
225
|
+
- Trained on public supercomputer (Jean Zay)
|
|
226
|
+
- Fully documented training process
|
|
227
|
+
- Open-source model and training code
|
|
228
|
+
|
|
229
|
+
**Public Impact**:
|
|
230
|
+
- Downloaded 100,000+ times
|
|
231
|
+
- Used in hundreds of research papers
|
|
232
|
+
- Enabled multilingual AI research
|
|
233
|
+
- Demonstrated open science at scale
|
|
234
|
+
|
|
235
|
+
## DeepSeek-V3
|
|
236
|
+
|
|
237
|
+
### Overview
|
|
238
|
+
- **Organization**: DeepSeek
|
|
239
|
+
- **Parameters**: 671 billion total, 37B active per token
|
|
240
|
+
- **Type**: Mixture of Experts (MoE)
|
|
241
|
+
- **Year**: 2024-2025
|
|
242
|
+
- **Framework**: Megatron-Core
|
|
243
|
+
|
|
244
|
+
**Architecture**:
|
|
245
|
+
```yaml
|
|
246
|
+
Type: Mixture of Experts
|
|
247
|
+
Total parameters: 671B
|
|
248
|
+
Active parameters per token: 37B
|
|
249
|
+
Layers: 61
|
|
250
|
+
Hidden size: 7168
|
|
251
|
+
Attention heads: 128
|
|
252
|
+
Query groups: 16
|
|
253
|
+
Experts: 256 (massive MoE)
|
|
254
|
+
Router top-k: 8 (Multi-head Latent Attention)
|
|
255
|
+
Shared expert size: 18432
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
**Training Configuration**:
|
|
259
|
+
```bash
|
|
260
|
+
# 1024 H100 GPUs
|
|
261
|
+
Tensor Parallel (TP): 2
|
|
262
|
+
Pipeline Parallel (PP): 16
|
|
263
|
+
Expert Parallel (EP): 64
|
|
264
|
+
Context Parallel (CP): 1
|
|
265
|
+
|
|
266
|
+
Total: 2 × 16 × 64 = 2048 slots
|
|
267
|
+
# Uses overlapping parallelism
|
|
268
|
+
|
|
269
|
+
Global batch size: 4096
|
|
270
|
+
Sequence length: 4096
|
|
271
|
+
Training tokens: 14.8 trillion
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
**Innovations**:
|
|
275
|
+
- Multi-head Latent Attention (MLA) router
|
|
276
|
+
- Shared experts + routed experts
|
|
277
|
+
- Ultra-large expert count (256)
|
|
278
|
+
- Advanced load balancing
|
|
279
|
+
|
|
280
|
+
**Performance**:
|
|
281
|
+
- Competitive with GPT-4
|
|
282
|
+
- 37B active params rivals 70B+ dense models
|
|
283
|
+
- Efficient inference (only 37B active)
|
|
284
|
+
|
|
285
|
+
## OpenAI GPT-3 175B (2020)
|
|
286
|
+
|
|
287
|
+
### Overview
|
|
288
|
+
- **Organization**: OpenAI
|
|
289
|
+
- **Parameters**: 175 billion
|
|
290
|
+
- **Year**: 2020
|
|
291
|
+
- **Framework**: Megatron-inspired custom implementation
|
|
292
|
+
- **Hardware**: Thousands of NVIDIA V100 GPUs
|
|
293
|
+
|
|
294
|
+
**Architecture**:
|
|
295
|
+
```yaml
|
|
296
|
+
Parameters: 175 billion
|
|
297
|
+
Layers: 96
|
|
298
|
+
Hidden size: 12288
|
|
299
|
+
Attention heads: 96
|
|
300
|
+
FFN size: 49152
|
|
301
|
+
Vocabulary: 50,257 tokens (GPT-2 BPE)
|
|
302
|
+
Sequence length: 2048
|
|
303
|
+
Context window: 2048 tokens
|
|
304
|
+
```
|
|
305
|
+
|
|
306
|
+
**Training Configuration**:
|
|
307
|
+
```bash
|
|
308
|
+
# Estimated configuration
|
|
309
|
+
Tensor Parallel: 4-8
|
|
310
|
+
Pipeline Parallel: 8-16
|
|
311
|
+
Data Parallel: Remaining GPUs
|
|
312
|
+
|
|
313
|
+
Global batch size: 1536
|
|
314
|
+
Learning rate: 6e-5
|
|
315
|
+
Training tokens: 300 billion
|
|
316
|
+
```
|
|
317
|
+
|
|
318
|
+
**Training Compute**:
|
|
319
|
+
- 3.14 × 10^23 FLOPs
|
|
320
|
+
- Equivalent to ~355 GPU-years on V100
|
|
321
|
+
- Estimated cost: $4-12 million
|
|
322
|
+
|
|
323
|
+
**Impact**:
|
|
324
|
+
- Launched modern era of large language models
|
|
325
|
+
- Demonstrated few-shot learning
|
|
326
|
+
- Foundation for ChatGPT
|
|
327
|
+
|
|
328
|
+
## Stability AI StableLM
|
|
329
|
+
|
|
330
|
+
### Overview
|
|
331
|
+
- **Organization**: Stability AI
|
|
332
|
+
- **Framework**: GPT-NeoX (Megatron + DeepSpeed)
|
|
333
|
+
- **Hardware**: Training on supercomputers
|
|
334
|
+
- **Status**: Open-source
|
|
335
|
+
|
|
336
|
+
**Models**:
|
|
337
|
+
- StableLM-Base-Alpha: 3B, 7B
|
|
338
|
+
- StableLM-Tuned-Alpha: Fine-tuned versions
|
|
339
|
+
- StableCode: Code-specialized
|
|
340
|
+
|
|
341
|
+
**Training Configuration**:
|
|
342
|
+
```yaml
|
|
343
|
+
Framework: GPT-NeoX
|
|
344
|
+
Parallelism: Megatron TP/PP + DeepSpeed ZeRO
|
|
345
|
+
GPUs: A100 clusters
|
|
346
|
+
Training data: 1.5 trillion tokens (The Pile)
|
|
347
|
+
```
|
|
348
|
+
|
|
349
|
+
**Key Features**:
|
|
350
|
+
- Fully open-source (Apache 2.0)
|
|
351
|
+
- GPT-NeoX framework
|
|
352
|
+
- Trained on The Pile dataset
|
|
353
|
+
- Multiple model sizes
|
|
354
|
+
|
|
355
|
+
## Common Production Patterns
|
|
356
|
+
|
|
357
|
+
### Fault Tolerance
|
|
358
|
+
|
|
359
|
+
**Checkpoint Strategy**:
|
|
360
|
+
```bash
|
|
361
|
+
--save-interval 500 # Save every 500 iterations
|
|
362
|
+
--save /checkpoints/model_name # Checkpoint directory
|
|
363
|
+
--load /checkpoints/model_name # Auto-resume from latest
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
**Monitoring**:
|
|
367
|
+
```python
|
|
368
|
+
# Check in progress.txt
|
|
369
|
+
Job throughput: 45.2 TFLOPs/GPU
|
|
370
|
+
Cumulative throughput: 44.8 TFLOPs/GPU
|
|
371
|
+
Memory usage: 68.2 GB / 80 GB
|
|
372
|
+
Loss: 2.143
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
### Data Pipeline
|
|
376
|
+
|
|
377
|
+
**Preprocessing**:
|
|
378
|
+
```bash
|
|
379
|
+
python tools/preprocess_data.py \
|
|
380
|
+
--input data.jsonl \
|
|
381
|
+
--output-prefix /data/processed \
|
|
382
|
+
--vocab-file vocab.json \
|
|
383
|
+
--merge-file merges.txt \
|
|
384
|
+
--tokenizer-type GPT2BPETokenizer \
|
|
385
|
+
--append-eod \
|
|
386
|
+
--workers 64
|
|
387
|
+
```
|
|
388
|
+
|
|
389
|
+
**Training with Preprocessed Data**:
|
|
390
|
+
```bash
|
|
391
|
+
--data-path /data/processed_text_document \
|
|
392
|
+
--split 969,30,1 # Train/valid/test split
|
|
393
|
+
```
|
|
394
|
+
|
|
395
|
+
### Monitoring & Logging
|
|
396
|
+
|
|
397
|
+
**Key Metrics to Track**:
|
|
398
|
+
```bash
|
|
399
|
+
# Training metrics
|
|
400
|
+
- Loss (should steadily decrease)
|
|
401
|
+
- Learning rate (follows schedule)
|
|
402
|
+
- Gradient norm (watch for spikes)
|
|
403
|
+
- Throughput (TFlops/GPU)
|
|
404
|
+
- MFU percentage
|
|
405
|
+
|
|
406
|
+
# System metrics
|
|
407
|
+
- GPU utilization (>90%)
|
|
408
|
+
- Memory usage (<95% of capacity)
|
|
409
|
+
- Network bandwidth (saturated for TP)
|
|
410
|
+
- Data loading time (should be minimal)
|
|
411
|
+
```
|
|
412
|
+
|
|
413
|
+
**Production Monitoring Tools**:
|
|
414
|
+
- TensorBoard for loss curves
|
|
415
|
+
- Weights & Biases for experiment tracking
|
|
416
|
+
- Prometheus + Grafana for system metrics
|
|
417
|
+
- Custom scripts for MFU calculation
|
|
418
|
+
|
|
419
|
+
### Multi-Datacenter Training
|
|
420
|
+
|
|
421
|
+
**Challenges**:
|
|
422
|
+
- Higher latency between datacenters
|
|
423
|
+
- Network bandwidth limitations
|
|
424
|
+
- Fault isolation
|
|
425
|
+
|
|
426
|
+
**Solutions**:
|
|
427
|
+
```bash
|
|
428
|
+
# Keep TP within datacenter
|
|
429
|
+
--tensor-model-parallel-size 8 # Single node only
|
|
430
|
+
|
|
431
|
+
# Use PP across datacenters
|
|
432
|
+
--pipeline-model-parallel-size 16 # Across sites
|
|
433
|
+
|
|
434
|
+
# Data parallel across everything
|
|
435
|
+
# Automatic from remaining GPUs
|
|
436
|
+
```
|
|
437
|
+
|
|
438
|
+
## Lessons from Production
|
|
439
|
+
|
|
440
|
+
1. **Fault Tolerance is Critical**
|
|
441
|
+
- Save checkpoints frequently (every 500-1000 steps)
|
|
442
|
+
- Test checkpoint recovery regularly
|
|
443
|
+
- Monitor for GPU failures
|
|
444
|
+
|
|
445
|
+
2. **Data Quality Matters More Than Quantity**
|
|
446
|
+
- LLaMA 3: Carefully curated 15T tokens
|
|
447
|
+
- Better than naive web scraping
|
|
448
|
+
- Investment in data preprocessing pays off
|
|
449
|
+
|
|
450
|
+
3. **Parallelism Strategy Evolves with Scale**
|
|
451
|
+
- <70B: TP + DP sufficient
|
|
452
|
+
- 70-175B: Add PP
|
|
453
|
+
- 175B+: 3D or 4D parallelism required
|
|
454
|
+
- MoE: Add EP dimension
|
|
455
|
+
|
|
456
|
+
4. **Hardware Matters**
|
|
457
|
+
- H100 vs A100: 2× speedup from better hardware
|
|
458
|
+
- NVLink topology affects TP efficiency
|
|
459
|
+
- InfiniBand essential for multi-node
|
|
460
|
+
|
|
461
|
+
5. **Monitoring is Essential**
|
|
462
|
+
- Track MFU to catch performance issues
|
|
463
|
+
- Monitor loss for training health
|
|
464
|
+
- Watch memory usage to avoid OOM
|
|
465
|
+
- Log everything for debugging
|
|
466
|
+
|
|
467
|
+
## References
|
|
468
|
+
|
|
469
|
+
- Meta LLaMA 3 technical report
|
|
470
|
+
- NVIDIA Nemotron blog posts
|
|
471
|
+
- Microsoft Megatron-Turing NLG paper
|
|
472
|
+
- BigScience BLOOM documentation
|
|
473
|
+
- DeepSeek-V3 technical report
|