@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,366 @@
1
+ ---
2
+ name: training-llms-megatron
3
+ description: Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.
4
+ version: 1.0.0
5
+ author: Synthetic Sciences
6
+ license: MIT
7
+ tags: [Megatron-Core, Large-Scale Training, NVIDIA, Tensor Parallelism, Pipeline Parallelism, Model Parallelism, H100, Distributed Training, Production]
8
+ dependencies: [megatron-core, torch, apex, transformer-engine]
9
+ ---
10
+
11
+ # Megatron-Core - Large-Scale LLM Training
12
+
13
+ ## Quick start
14
+
15
+ Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies.
16
+
17
+ **Installation**:
18
+ ```bash
19
+ # Docker (recommended)
20
+ docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3
21
+
22
+ # Or pip
23
+ pip install megatron-core
24
+ ```
25
+
26
+ **Simple distributed training**:
27
+ ```bash
28
+ # Train with 2 GPUs using data parallelism
29
+ torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
30
+
31
+ # Or LLaMA-3 8B training
32
+ ./examples/llama/train_llama3_8b_fp8.sh
33
+ ```
34
+
35
+ ## Common workflows
36
+
37
+ ### Workflow 1: Train LLaMA-style model with 3D parallelism
38
+
39
+ Copy this checklist:
40
+
41
+ ```
42
+ LLaMA Training Setup:
43
+ - [ ] Step 1: Choose parallelism configuration
44
+ - [ ] Step 2: Configure training hyperparameters
45
+ - [ ] Step 3: Launch distributed training
46
+ - [ ] Step 4: Monitor performance metrics
47
+ ```
48
+
49
+ **Step 1: Choose parallelism configuration**
50
+
51
+ Model size determines parallelism strategy:
52
+
53
+ | Model Size | GPUs | Tensor Parallel | Pipeline Parallel | Data Parallel | Context Parallel |
54
+ |------------|------|-----------------|-------------------|---------------|------------------|
55
+ | 7B | 8 | 1 | 1 | 8 | 1 |
56
+ | 13B | 8 | 2 | 1 | 4 | 1 |
57
+ | 70B | 64 | 4 | 4 | 4 | 1 |
58
+ | 405B | 128 | 8 | 8 | 2 | 2 |
59
+
60
+ **Step 2: Configure training hyperparameters**
61
+
62
+ ```bash
63
+ #!/bin/bash
64
+ # train_llama_70b.sh
65
+
66
+ GPUS_PER_NODE=8
67
+ NNODES=8 # 64 GPUs total
68
+ TP=4 # Tensor parallel
69
+ PP=4 # Pipeline parallel
70
+ CP=1 # Context parallel
71
+
72
+ # LLaMA 70B configuration
73
+ MODEL_SIZE=70 # Billion parameters
74
+ HIDDEN_SIZE=8192
75
+ NUM_LAYERS=80
76
+ NUM_HEADS=64
77
+ SEQ_LENGTH=4096
78
+
79
+ # Training hyperparameters
80
+ MICRO_BATCH=1
81
+ GLOBAL_BATCH=1024
82
+ LR=3e-4
83
+
84
+ torchrun \
85
+ --nproc_per_node=$GPUS_PER_NODE \
86
+ --nnodes=$NNODES \
87
+ pretrain_gpt.py \
88
+ --tensor-model-parallel-size $TP \
89
+ --pipeline-model-parallel-size $PP \
90
+ --context-parallel-size $CP \
91
+ --sequence-parallel \
92
+ --num-layers $NUM_LAYERS \
93
+ --hidden-size $HIDDEN_SIZE \
94
+ --num-attention-heads $NUM_HEADS \
95
+ --seq-length $SEQ_LENGTH \
96
+ --max-position-embeddings $SEQ_LENGTH \
97
+ --micro-batch-size $MICRO_BATCH \
98
+ --global-batch-size $GLOBAL_BATCH \
99
+ --lr $LR \
100
+ --train-iters 100000 \
101
+ --lr-decay-style cosine \
102
+ --lr-warmup-iters 2000 \
103
+ --weight-decay 0.1 \
104
+ --clip-grad 1.0 \
105
+ --bf16 \
106
+ --use-mcore-models \
107
+ --transformer-impl transformer_engine \
108
+ --data-path /path/to/data \
109
+ --vocab-file /path/to/vocab.json \
110
+ --merge-file /path/to/merges.txt
111
+ ```
112
+
113
+ **Step 3: Launch distributed training**
114
+
115
+ ```bash
116
+ # Single node (8 GPUs)
117
+ bash train_llama_70b.sh
118
+
119
+ # Multi-node with SLURM
120
+ sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh
121
+ ```
122
+
123
+ **Step 4: Monitor performance metrics**
124
+
125
+ Key metrics to track:
126
+ ```
127
+ Model FLOP Utilization (MFU): Target >40% on H100
128
+ Throughput: Tokens/sec/GPU
129
+ Memory usage: <80GB per GPU for 70B model
130
+ Loss: Should decrease steadily
131
+ ```
132
+
133
+ ### Workflow 2: Configure Mixture of Experts (MoE) training
134
+
135
+ For sparse MoE models like Mixtral.
136
+
137
+ ```
138
+ MoE Training:
139
+ - [ ] Step 1: Configure expert parallelism
140
+ - [ ] Step 2: Set MoE hyperparameters
141
+ - [ ] Step 3: Launch training with EP
142
+ ```
143
+
144
+ **Step 1: Configure expert parallelism**
145
+
146
+ ```bash
147
+ # Mixtral 8x7B example
148
+ TENSOR_PARALLEL=2
149
+ PIPELINE_PARALLEL=1
150
+ EXPERT_PARALLEL=4 # Split 8 experts across 4 GPUs
151
+ DATA_PARALLEL=4
152
+
153
+ TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))
154
+ # = 2 * 1 * 4 * 4 = 32 GPUs
155
+ ```
156
+
157
+ **Step 2: Set MoE hyperparameters**
158
+
159
+ ```bash
160
+ torchrun \
161
+ --nproc_per_node=8 \
162
+ pretrain_gpt.py \
163
+ --tensor-model-parallel-size 2 \
164
+ --pipeline-model-parallel-size 1 \
165
+ --expert-model-parallel-size 4 \
166
+ --num-experts 8 \
167
+ --moe-router-topk 2 \
168
+ --moe-router-load-balancing-type aux_loss \
169
+ --moe-aux-loss-coeff 0.01 \
170
+ --hidden-size 4096 \
171
+ --num-layers 32 \
172
+ --num-attention-heads 32 \
173
+ --seq-length 4096 \
174
+ --max-position-embeddings 4096 \
175
+ --bf16 \
176
+ --use-mcore-models \
177
+ --transformer-impl transformer_engine \
178
+ --data-path /path/to/data \
179
+ --vocab-file /path/to/vocab.json \
180
+ --merge-file /path/to/merges.txt
181
+ ```
182
+
183
+ **Step 3: Launch training with EP**
184
+
185
+ Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.
186
+
187
+ ```
188
+ Memory without EP: 8 experts × 7B = 56GB per GPU
189
+ Memory with EP=4: 2 experts × 7B = 14GB per GPU
190
+ Savings: 75% memory reduction
191
+ ```
192
+
193
+ ### Workflow 3: Optimize for maximum throughput
194
+
195
+ Achieve 47% MFU on H100.
196
+
197
+ ```
198
+ Performance Optimization:
199
+ - [ ] Step 1: Enable Flash Attention
200
+ - [ ] Step 2: Use FP8 precision (H100)
201
+ - [ ] Step 3: Optimize micro-batch size
202
+ - [ ] Step 4: Tune parallelism degrees
203
+ ```
204
+
205
+ **Step 1: Enable optimizations**
206
+
207
+ ```bash
208
+ --use-mcore-models # Use Megatron Core models
209
+ --transformer-impl transformer_engine # Use Transformer Engine
210
+ --sequence-parallel # Reduce activation memory (use with TP)
211
+ ```
212
+
213
+ **Step 2: Use FP8 precision (H100 only)**
214
+
215
+ ```bash
216
+ --fp8-hybrid # FP8 mixed precision training
217
+ # Transformer Engine handles FP8 automatically
218
+ ```
219
+
220
+ Result: 1.5-2x speedup on H100 vs BF16.
221
+
222
+ **Step 3: Optimize micro-batch size**
223
+
224
+ Find largest micro-batch that fits in memory:
225
+
226
+ ```bash
227
+ # Start with 1, increase until OOM
228
+ for MBS in 1 2 4 8; do
229
+ echo "Testing micro-batch-size=$MBS"
230
+ torchrun ... --micro-batch-size $MBS
231
+ done
232
+ ```
233
+
234
+ Typical values:
235
+ - 7B model: 4-8
236
+ - 70B model: 1-2
237
+ - 405B model: 1
238
+
239
+ **Step 4: Tune parallelism degrees**
240
+
241
+ Rules of thumb:
242
+ ```
243
+ Tensor Parallel: Use ≤8 (limited by NVLink within node)
244
+ Pipeline Parallel: Use for >70B models
245
+ Context Parallel: Use for sequences >8K tokens
246
+ Data Parallel: Fill remaining GPUs
247
+ ```
248
+
249
+ Example 405B on 128 H100s:
250
+ ```
251
+ TP=8 (1 node)
252
+ PP=8 (across nodes)
253
+ CP=2 (long sequences)
254
+ DP=1
255
+ Total = 8 × 8 × 2 × 1 = 128 GPUs
256
+ ```
257
+
258
+ ## When to use vs alternatives
259
+
260
+ **Use Megatron-Core when:**
261
+ - Training models >10B parameters
262
+ - Need maximum efficiency (target >40% MFU)
263
+ - Using NVIDIA GPUs (A100, H100)
264
+ - Production training at scale
265
+ - Want fine-grained parallelism control
266
+
267
+ **Use alternatives instead:**
268
+ - **PyTorch FSDP**: Models <70B, simpler API, PyTorch native
269
+ - **DeepSpeed**: Easier setup, good for <100B models
270
+ - **HuggingFace Accelerate**: Prototyping, simpler workflows
271
+ - **LitGPT**: Educational, single-file implementations
272
+
273
+ ## Common issues
274
+
275
+ **Issue: Low GPU utilization (<30% MFU)**
276
+
277
+ Causes:
278
+ 1. Micro-batch too small
279
+ 2. Too much parallelism overhead
280
+ 3. Not using Flash Attention
281
+
282
+ Fixes:
283
+ ```bash
284
+ # Increase micro-batch
285
+ --micro-batch-size 4 # Was 1
286
+
287
+ # Enable optimizations
288
+ --use-flash-attn
289
+ --sequence-parallel
290
+
291
+ # Reduce TP if >8
292
+ --tensor-model-parallel-size 4 # Was 16
293
+ ```
294
+
295
+ **Issue: Out of memory**
296
+
297
+ Reduce memory with:
298
+ ```bash
299
+ --tensor-model-parallel-size 2 # Split model across GPUs
300
+ --recompute-granularity full # Gradient checkpointing
301
+ --recompute-method block # Checkpoint transformer blocks
302
+ --recompute-num-layers 1 # Checkpoint every layer
303
+ ```
304
+
305
+ Or use CPU/NVMe offloading:
306
+ ```bash
307
+ --cpu-optimizer # Offload optimizer to CPU
308
+ --cpu-optimizer-type ADAM # CPU Adam variant
309
+ ```
310
+
311
+ **Issue: Training slower than expected**
312
+
313
+ Check:
314
+ 1. **Network bottleneck**: Ensure InfiniBand/NVLink enabled
315
+ 2. **Pipeline bubbles**: Use interleaved pipeline schedule
316
+ ```bash
317
+ --num-layers-per-virtual-pipeline-stage 2
318
+ ```
319
+ 3. **Data loading**: Use fast data loader
320
+ ```bash
321
+ --dataloader-type cyclic
322
+ ```
323
+
324
+ **Issue: Diverging loss**
325
+
326
+ Stabilize training:
327
+ ```bash
328
+ --lr-warmup-iters 2000 # Longer warmup
329
+ --clip-grad 1.0 # Gradient clipping
330
+ --init-method-std 0.006 # Smaller init
331
+ --attention-dropout 0.0 # No dropout in attention
332
+ --hidden-dropout 0.0 # No dropout in FFN
333
+ ```
334
+
335
+ ## Advanced topics
336
+
337
+ **Parallelism strategies**: See [references/parallelism-guide.md](references/parallelism-guide.md) for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each.
338
+
339
+ **Performance benchmarks**: See [references/benchmarks.md](references/benchmarks.md) for MFU numbers across different model sizes and GPU configurations.
340
+
341
+ **Production configurations**: See [references/production-examples.md](references/production-examples.md) for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B.
342
+
343
+ **Training recipes**: See [references/training-recipes.md](references/training-recipes.md) for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures.
344
+
345
+ ## Hardware requirements
346
+
347
+ - **GPU**: NVIDIA Ampere+ (A100, H100, B200)
348
+ - Turing works but slower
349
+ - FP8 requires Hopper/Ada/Blackwell
350
+ - **Network**: InfiniBand or 400Gb+ Ethernet for multi-node
351
+ - **Memory per GPU**:
352
+ - 7B model: 40GB+
353
+ - 70B model: 80GB (with TP=4)
354
+ - 405B model: 80GB (with TP=8, PP=8)
355
+ - **Storage**: Fast NVMe for checkpoints (1TB+ for 70B+ models)
356
+
357
+ ## Resources
358
+
359
+ - Docs: https://docs.nvidia.com/megatron-core/
360
+ - GitHub: https://github.com/NVIDIA/Megatron-LM
361
+ - Papers:
362
+ - "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2019)
363
+ - "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021)
364
+ - NeMo Framework: https://docs.nvidia.com/nemo-framework/ (built on Megatron-Core)
365
+
366
+
@@ -0,0 +1,249 @@
1
+ # Performance Benchmarks
2
+
3
+ Performance metrics and benchmarks for Megatron-Core across different model sizes and hardware configurations.
4
+
5
+ ## Model FLOP Utilization (MFU)
6
+
7
+ **H100 Clusters**: Up to 47% MFU achieved
8
+
9
+ MFU increases with larger model sizes due to higher arithmetic intensity in larger matrix multiplications (GEMMs).
10
+
11
+ ## Throughput Metrics by Model Size
12
+
13
+ ### GPT-3 175B
14
+ - **Hardware**: H100
15
+ - **Configuration**: TP=4, PP=8
16
+ - **GPUs**: 128-512
17
+ - **MFU**: 47% on H100
18
+ - **Throughput**: 390 TFlops/GPU on H100
19
+
20
+ ### LLaMA Configurations
21
+
22
+ | Model | Size | GPUs | TP | PP | CP | Seq Length | Hardware | Notes |
23
+ |-------|------|------|----|----|----| -----------|----------|-------|
24
+ | LLaMA-3 | 8B | 8 | 1 | 1 | 2 | 8K | H100 | CP for long sequences |
25
+ | LLaMA-3 | 70B | 64 | 4 | 4 | 2 | 4K | H100 | TP+PP parallelism |
26
+ | LLaMA-3.1 | 405B | 1024 | 8 | 8 | 2 | 4K | H100 | 3D parallelism |
27
+
28
+ **LLaMA-3 405B Details**:
29
+ - 16K H100 GPUs (two 24K GPU clusters)
30
+ - TP=8, PP=8, CP=2
31
+ - 400 TFlops/GPU average
32
+ - 95%+ uptime
33
+ - 3× efficiency improvement vs LLaMA 2
34
+
35
+ ### Mixtral (Mixture of Experts)
36
+
37
+ | Model | Active Params | Total Params | GPUs | TP | PP | EP | Experts | Hardware |
38
+ |-------|---------------|--------------|------|----|----|----|---------| ---------|
39
+ | Mixtral | 7B (active) | 8×7B (56B) | 64 | 1 | 4 | 8 | 8 | H100 |
40
+ | Mixtral | 22B (active) | 8×22B (176B) | 256 | 4 | 4 | 8 | 8 | H100 |
41
+
42
+ ### DeepSeek-V3
43
+
44
+ - **Active Parameters**: 37B per token
45
+ - **Total Parameters**: 671B
46
+ - **GPUs**: 1024 H100
47
+ - **Configuration**: TP=2, PP=16, EP=64
48
+ - **Parallelism**: 4D with Expert Parallel
49
+
50
+ ### GPT-462B (Largest Benchmark)
51
+
52
+ - **Parameters**: 462B
53
+ - **GPUs**: 6144 H100
54
+ - **MFU**: 47-48%
55
+ - **Throughput**: ~390 TFlops/GPU
56
+
57
+ ## Hardware Performance Characteristics
58
+
59
+ ### NVIDIA H100 (Hopper)
60
+ - **Peak Performance**:
61
+ - FP16: 1979 TFlops
62
+ - BF16: 1979 TFlops
63
+ - FP8: 3958 TFlops
64
+ - **Memory**: 80GB HBM3
65
+ - **Memory Bandwidth**: 3.35 TB/s
66
+ - **NVLink**: 900 GB/s per GPU
67
+
68
+ **Achieved MFU**: 40-47% (typical range)
69
+
70
+ ### NVIDIA A100 (Ampere)
71
+ - **Peak Performance**:
72
+ - FP16: 312 TFlops (with sparsity)
73
+ - BF16: 312 TFlops
74
+ - **Memory**: 40GB or 80GB HBM2e
75
+ - **Memory Bandwidth**: 2 TB/s
76
+ - **NVLink**: 600 GB/s per GPU
77
+
78
+ **Typical MFU**: 35-42%
79
+
80
+ ## Weak Scaling (Fixed Per-GPU Workload)
81
+
82
+ As you add more GPUs while keeping per-GPU workload constant:
83
+
84
+ | GPUs | Model Size | MFU | Efficiency |
85
+ |------|------------|-----|------------|
86
+ | 8 | 7B | 42% | 100% (baseline) |
87
+ | 64 | 70B | 44% | 95% |
88
+ | 512 | 175B | 45% | 93% |
89
+ | 1024 | 405B | 46% | 90% |
90
+ | 6144 | 462B | 47% | 88% |
91
+
92
+ ## Strong Scaling (Fixed Total Workload)
93
+
94
+ Distributing a fixed model across more GPUs:
95
+
96
+ | Model | GPUs | Time per Iteration | Speedup | Efficiency |
97
+ |-------|------|-------------------|---------|------------|
98
+ | 70B | 64 | 1.0× (baseline) | 1.0× | 100% |
99
+ | 70B | 128 | 0.52× | 1.92× | 96% |
100
+ | 70B | 256 | 0.27× | 3.70× | 93% |
101
+
102
+ ## Throughput Calculations
103
+
104
+ **Formula**:
105
+ ```
106
+ Throughput (TFlops/GPU) = Total FLOPs / (Time × Number of GPUs × 10^12)
107
+ ```
108
+
109
+ **Example (GPT-3 175B)**:
110
+ - Forward + Backward pass: 3 × (model FLOPs)
111
+ - Per-token FLOPs: ~350 billion for 175B model
112
+ - Batch size: 1536 (global)
113
+ - Sequence length: 2048
114
+ - Time per iteration: ~5 seconds on 512 H100s
115
+ - Throughput: ~390 TFlops/GPU
116
+
117
+ ## Memory Usage vs Model Size
118
+
119
+ | Model Size | Parameters | Memory (FP16) | Memory (BF16) | Memory (FP8) |
120
+ |------------|------------|---------------|---------------|--------------|
121
+ | 7B | 7 billion | 14 GB | 14 GB | 7 GB |
122
+ | 13B | 13 billion | 26 GB | 26 GB | 13 GB |
123
+ | 70B | 70 billion | 140 GB | 140 GB | 70 GB |
124
+ | 175B | 175 billion | 350 GB | 350 GB | 175 GB |
125
+ | 405B | 405 billion | 810 GB | 810 GB | 405 GB |
126
+
127
+ **Note**: These are model weights only. Add ~2× for gradients and optimizer states during training.
128
+
129
+ ## Communication Overhead
130
+
131
+ ### Tensor Parallelism (TP)
132
+ - **Bandwidth Required**: ~20 GB/GPU for LLaMA 70B with TP=4
133
+ - **Frequency**: Every layer (80+ layers)
134
+ - **Best Practice**: Use NVLink, keep TP ≤8 within single node
135
+
136
+ ### Pipeline Parallelism (PP)
137
+ - **Bandwidth Required**: Activation size only (~100s of MB)
138
+ - **Frequency**: Between pipeline stages
139
+ - **Best Practice**: Use for cross-node scaling
140
+
141
+ ### Data Parallelism (DP)
142
+ - **Bandwidth Required**: Full gradient size
143
+ - **Frequency**: Once per iteration
144
+ - **Best Practice**: Use for remaining parallelism after TP/PP
145
+
146
+ ## Optimization Impact
147
+
148
+ ### Flash Attention
149
+ - **Speedup**: 2-4× on attention layers
150
+ - **Memory**: 10-20× reduction
151
+ - **Overall Impact**: ~30% faster training
152
+
153
+ ### Sequence Parallelism
154
+ - **Memory Savings**: Activation memory / TP degree
155
+ - **Example**: With TP=4, saves 75% of activation memory
156
+ - **No Performance Cost**: Communication already happening
157
+
158
+ ### Context Parallelism
159
+ - **Use Case**: Sequences >8K tokens
160
+ - **Memory Savings**: KV cache / CP degree
161
+ - **Communication**: Ring all-to-all pattern
162
+
163
+ ### FP8 Training (H100 Only)
164
+ - **Speedup**: 1.5-2× vs BF16
165
+ - **Memory**: 50% reduction vs BF16
166
+ - **Quality**: Minimal degradation with proper scaling
167
+
168
+ ## Production Deployments
169
+
170
+ ### Meta LLaMA 3 Training
171
+ - **Models**: 8B, 70B, 405B
172
+ - **Cluster**: Two 24K H100 clusters
173
+ - **Efficiency**: 400 TFlops/GPU sustained
174
+ - **Uptime**: 95%+
175
+ - **Total Tokens**: 15 trillion for 405B model
176
+
177
+ ### Microsoft Megatron-Turing NLG 530B
178
+ - **GPUs**: 560 NVIDIA A100 (80GB)
179
+ - **Parallelism**: DeepSpeed ZeRO-3 + Megatron TP/PP
180
+ - **Duration**: Several months
181
+ - **Year**: 2021
182
+
183
+ ### NVIDIA Nemotron-4 340B
184
+ - **Architecture**: Mixture of Experts
185
+ - **Framework**: NeMo (built on Megatron-Core)
186
+ - **Production**: Commercial deployment
187
+
188
+ ## Benchmarking Best Practices
189
+
190
+ 1. **Measure Sustained Performance**: Not peak, measure over 100+ iterations
191
+ 2. **Include All Operations**: Forward, backward, optimizer step, communication
192
+ 3. **Report MFU**: Use theoretical peak FLOPs of hardware
193
+ 4. **Specify Configuration**: TP, PP, CP, EP degrees, batch sizes, sequence length
194
+ 5. **Note Optimizations**: Flash Attention, FP8, sequence parallel, etc.
195
+
196
+ ## How to Measure Your Own Performance
197
+
198
+ **Enable profiling**:
199
+ ```bash
200
+ torchrun pretrain_gpt.py \
201
+ --profile \
202
+ --profile-step-start 10 \
203
+ --profile-step-end 20
204
+ ```
205
+
206
+ **Calculate MFU**:
207
+ ```python
208
+ # Megatron logs this automatically
209
+ # Check logs for:
210
+ # - elapsed time per iteration (seconds)
211
+ # - samples per second
212
+ # - TFLOPs/s per GPU
213
+ # - MFU percentage
214
+ ```
215
+
216
+ **Key Metrics to Track**:
217
+ - Elapsed time per iteration
218
+ - Throughput (TFlops/GPU)
219
+ - MFU (%)
220
+ - Memory usage (GB)
221
+ - Communication time (% of total)
222
+
223
+ ## Troubleshooting Low Performance
224
+
225
+ **If MFU < 30%**:
226
+ 1. Check micro-batch size (increase if possible)
227
+ 2. Enable all optimizations (Flash Attention, sequence parallel, etc.)
228
+ 3. Verify communication backend (NCCL properly configured)
229
+ 4. Check for data loading bottlenecks
230
+ 5. Ensure proper CPU-GPU pipeline
231
+
232
+ **If Communication Heavy** (>30% of time):
233
+ 1. Reduce TP degree (especially across nodes)
234
+ 2. Use interleaved pipeline schedule
235
+ 3. Enable communication overlap flags
236
+ 4. Check network topology (InfiniBand vs Ethernet)
237
+
238
+ **If Memory Bound**:
239
+ 1. Enable gradient checkpointing
240
+ 2. Use lower precision (BF16 or FP8)
241
+ 3. Increase parallelism degrees
242
+ 4. Reduce micro-batch size
243
+
244
+ ## References
245
+
246
+ - NVIDIA Megatron-LM GitHub: https://github.com/NVIDIA/Megatron-LM
247
+ - Performance Docs: https://docs.nvidia.com/megatron-core/
248
+ - LLaMA 3 Paper: Meta AI
249
+ - DeepSeek-V3 Technical Report