@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,473 @@
1
+ # Production Examples
2
+
3
+ Real-world production deployments using Megatron-Core and Megatron-LM.
4
+
5
+ ## Meta LLaMA 3
6
+
7
+ ### Overview
8
+ - **Models**: 8B, 70B, 405B parameters
9
+ - **Training Infrastructure**: Two 24,000 H100 GPU clusters
10
+ - **Total Investment**: Massive scale, months of training
11
+ - **Training Data**: 15 trillion tokens for 405B model
12
+ - **Status**: Production deployed (llama.meta.com)
13
+
14
+ ### LLaMA 3.1 405B - Largest Open Model
15
+
16
+ **Architecture**:
17
+ ```yaml
18
+ Parameters: 405 billion
19
+ Layers: 126
20
+ Hidden size: 16384
21
+ Attention heads: 128
22
+ Query groups: 8 (GQA)
23
+ FFN size: 53248
24
+ Vocabulary: 128,256 tokens
25
+ Max context: 128K tokens (supports up to)
26
+ Position encoding: RoPE
27
+ Activation: SwiGLU
28
+ Normalization: RMSNorm
29
+ ```
30
+
31
+ **Training Configuration**:
32
+ ```bash
33
+ # 1024 H100 GPUs (128 nodes × 8 GPUs)
34
+ Tensor Parallel (TP): 8 # Within node
35
+ Pipeline Parallel (PP): 8 # Across nodes
36
+ Context Parallel (CP): 2 # For long sequences
37
+ Data Parallel (DP): 8 # Remaining dimension
38
+
39
+ Total GPUs: 8 × 8 × 2 × 8 = 1024
40
+ Effective batch size: 2048
41
+ Micro-batch per GPU: 1
42
+ Sequence length: 4096 tokens
43
+ ```
44
+
45
+ **Performance Metrics**:
46
+ - **Sustained throughput**: 400 TFlops/GPU
47
+ - **MFU**: ~46% on H100
48
+ - **Uptime**: 95%+ over months
49
+ - **Efficiency improvement**: 3× vs LLaMA 2 training
50
+
51
+ **Training Duration**:
52
+ - 15 trillion tokens total
53
+ - ~54 days on 16,384 H100 GPUs
54
+ - Or ~6 months on 1,024 H100 GPUs
55
+
56
+ **Key Optimizations Used**:
57
+ ```bash
58
+ --use-mcore-models \
59
+ --transformer-impl transformer_engine \
60
+ --sequence-parallel \
61
+ --context-parallel-size 2 \
62
+ --use-distributed-optimizer \
63
+ --overlap-grad-reduce \
64
+ --overlap-param-gather \
65
+ --use-flash-attn-v2 \
66
+ --bf16
67
+ ```
68
+
69
+ **Production Serving**:
70
+ - Deployed on llama.meta.com
71
+ - Available via API and download
72
+ - Used in Meta products (Instagram, Facebook, WhatsApp)
73
+
74
+ ### LLaMA 3 70B
75
+
76
+ **Training Configuration**:
77
+ ```bash
78
+ # 64 H100 GPUs (8 nodes × 8 GPUs)
79
+ TP=4, PP=4, CP=2, DP=2
80
+
81
+ torchrun --nproc_per_node=8 --nnodes=8 pretrain_gpt.py \
82
+ --num-layers 80 \
83
+ --hidden-size 8192 \
84
+ --num-attention-heads 64 \
85
+ --num-query-groups 8 \
86
+ --seq-length 4096 \
87
+ --micro-batch-size 1 \
88
+ --global-batch-size 1024 \
89
+ --tensor-model-parallel-size 4 \
90
+ --pipeline-model-parallel-size 4 \
91
+ --context-parallel-size 2 \
92
+ --bf16 \
93
+ --use-mcore-models
94
+ ```
95
+
96
+ **Memory per GPU**:
97
+ - Model parameters: 140GB / 4 (TP) / 4 (PP) = 8.75GB
98
+ - Optimizer states: ~17.5GB
99
+ - Activations: ~3GB
100
+ - **Total**: ~30GB per H100 (fits in 80GB)
101
+
102
+ ## NVIDIA Nemotron-4 340B
103
+
104
+ ### Overview
105
+ - **Organization**: NVIDIA
106
+ - **Parameters**: 340 billion
107
+ - **Framework**: NeMo (built on Megatron-Core)
108
+ - **Purpose**: Enterprise AI foundation model
109
+ - **Status**: Commercial deployment
110
+
111
+ **Key Features**:
112
+ - Mixture of Experts architecture
113
+ - Optimized for enterprise use cases
114
+ - NeMo framework integration
115
+ - Production-ready deployment
116
+
117
+ **Architecture**:
118
+ ```yaml
119
+ Type: Mixture of Experts (MoE)
120
+ Total parameters: 340B
121
+ Active parameters per token: ~40B
122
+ Experts: 8
123
+ Router: Top-2
124
+ Context length: 4096
125
+ ```
126
+
127
+ **Training Infrastructure**:
128
+ - NVIDIA DGX H100 systems
129
+ - Megatron-Core + NeMo
130
+ - Multi-node training
131
+ - Enterprise-grade fault tolerance
132
+
133
+ **Production Features**:
134
+ - NeMo Guardrails integration
135
+ - Enterprise support
136
+ - Customization options
137
+ - On-premise deployment available
138
+
139
+ ## Microsoft & NVIDIA Megatron-Turing NLG 530B
140
+
141
+ ### Overview
142
+ - **Organization**: Microsoft + NVIDIA collaboration
143
+ - **Parameters**: 530 billion (largest dense model when released)
144
+ - **Year**: 2021
145
+ - **Framework**: DeepSpeed ZeRO-3 + Megatron tensor/pipeline parallelism
146
+ - **Hardware**: 560 NVIDIA A100 80GB GPUs
147
+
148
+ **Architecture**:
149
+ ```yaml
150
+ Parameters: 530 billion
151
+ Layers: 105
152
+ Hidden size: 20480
153
+ Attention heads: 128
154
+ Vocabulary: 51,200 tokens
155
+ Sequence length: 2048
156
+ ```
157
+
158
+ **Training Configuration**:
159
+ ```bash
160
+ # 560 A100 80GB GPUs
161
+ Tensor Parallel: 8
162
+ Pipeline Parallel: 35
163
+ Data Parallel: 2
164
+ Total: 8 × 35 × 2 = 560
165
+
166
+ DeepSpeed ZeRO Stage 3:
167
+ - Full parameter sharding
168
+ - Gradient sharding
169
+ - Optimizer state sharding
170
+ ```
171
+
172
+ **Innovations**:
173
+ - First to combine DeepSpeed ZeRO-3 with Megatron parallelism
174
+ - Demonstrated training at 500B+ scale
175
+ - Proved viability of extreme parallelism
176
+
177
+ **Performance**:
178
+ - Trained on 339 billion tokens
179
+ - Multiple months of training
180
+ - Achieved state-of-the-art results in 2021
181
+
182
+ ## BigScience BLOOM 176B
183
+
184
+ ### Overview
185
+ - **Organization**: BigScience (1000+ researchers)
186
+ - **Parameters**: 176 billion
187
+ - **Year**: 2022
188
+ - **Framework**: Megatron-DeepSpeed
189
+ - **Hardware**: 384 NVIDIA A100 80GB GPUs
190
+ - **Training Duration**: 46 days
191
+
192
+ **Architecture**:
193
+ ```yaml
194
+ Parameters: 176 billion
195
+ Layers: 70
196
+ Hidden size: 14336
197
+ Attention heads: 112
198
+ Vocabulary: 250,680 tokens (multilingual)
199
+ Sequence length: 2048
200
+ Languages: 46 natural languages + 13 programming languages
201
+ ```
202
+
203
+ **Training Configuration**:
204
+ ```bash
205
+ # 384 A100 80GB GPUs on Jean Zay supercomputer
206
+ Tensor Parallel: 4
207
+ Pipeline Parallel: 12
208
+ Data Parallel: 8
209
+ Total: 4 × 12 × 8 = 384
210
+
211
+ Global batch size: 2048
212
+ Micro-batch size: 4
213
+ Learning rate: 6e-5
214
+ Optimizer: Adam (β1=0.9, β2=0.95)
215
+ ```
216
+
217
+ **Training Data**:
218
+ - 366 billion tokens (1.6TB)
219
+ - ROOTS corpus (custom multilingual dataset)
220
+ - 46 natural languages
221
+ - 13 programming languages
222
+
223
+ **Key Achievements**:
224
+ - Largest multilingual open-source model at release
225
+ - Trained on public supercomputer (Jean Zay)
226
+ - Fully documented training process
227
+ - Open-source model and training code
228
+
229
+ **Public Impact**:
230
+ - Downloaded 100,000+ times
231
+ - Used in hundreds of research papers
232
+ - Enabled multilingual AI research
233
+ - Demonstrated open science at scale
234
+
235
+ ## DeepSeek-V3
236
+
237
+ ### Overview
238
+ - **Organization**: DeepSeek
239
+ - **Parameters**: 671 billion total, 37B active per token
240
+ - **Type**: Mixture of Experts (MoE)
241
+ - **Year**: 2024-2025
242
+ - **Framework**: Megatron-Core
243
+
244
+ **Architecture**:
245
+ ```yaml
246
+ Type: Mixture of Experts
247
+ Total parameters: 671B
248
+ Active parameters per token: 37B
249
+ Layers: 61
250
+ Hidden size: 7168
251
+ Attention heads: 128
252
+ Query groups: 16
253
+ Experts: 256 (massive MoE)
254
+ Router top-k: 8 (Multi-head Latent Attention)
255
+ Shared expert size: 18432
256
+ ```
257
+
258
+ **Training Configuration**:
259
+ ```bash
260
+ # 1024 H100 GPUs
261
+ Tensor Parallel (TP): 2
262
+ Pipeline Parallel (PP): 16
263
+ Expert Parallel (EP): 64
264
+ Context Parallel (CP): 1
265
+
266
+ Total: 2 × 16 × 64 = 2048 slots
267
+ # Uses overlapping parallelism
268
+
269
+ Global batch size: 4096
270
+ Sequence length: 4096
271
+ Training tokens: 14.8 trillion
272
+ ```
273
+
274
+ **Innovations**:
275
+ - Multi-head Latent Attention (MLA) router
276
+ - Shared experts + routed experts
277
+ - Ultra-large expert count (256)
278
+ - Advanced load balancing
279
+
280
+ **Performance**:
281
+ - Competitive with GPT-4
282
+ - 37B active params rivals 70B+ dense models
283
+ - Efficient inference (only 37B active)
284
+
285
+ ## OpenAI GPT-3 175B (2020)
286
+
287
+ ### Overview
288
+ - **Organization**: OpenAI
289
+ - **Parameters**: 175 billion
290
+ - **Year**: 2020
291
+ - **Framework**: Megatron-inspired custom implementation
292
+ - **Hardware**: Thousands of NVIDIA V100 GPUs
293
+
294
+ **Architecture**:
295
+ ```yaml
296
+ Parameters: 175 billion
297
+ Layers: 96
298
+ Hidden size: 12288
299
+ Attention heads: 96
300
+ FFN size: 49152
301
+ Vocabulary: 50,257 tokens (GPT-2 BPE)
302
+ Sequence length: 2048
303
+ Context window: 2048 tokens
304
+ ```
305
+
306
+ **Training Configuration**:
307
+ ```bash
308
+ # Estimated configuration
309
+ Tensor Parallel: 4-8
310
+ Pipeline Parallel: 8-16
311
+ Data Parallel: Remaining GPUs
312
+
313
+ Global batch size: 1536
314
+ Learning rate: 6e-5
315
+ Training tokens: 300 billion
316
+ ```
317
+
318
+ **Training Compute**:
319
+ - 3.14 × 10^23 FLOPs
320
+ - Equivalent to ~355 GPU-years on V100
321
+ - Estimated cost: $4-12 million
322
+
323
+ **Impact**:
324
+ - Launched modern era of large language models
325
+ - Demonstrated few-shot learning
326
+ - Foundation for ChatGPT
327
+
328
+ ## Stability AI StableLM
329
+
330
+ ### Overview
331
+ - **Organization**: Stability AI
332
+ - **Framework**: GPT-NeoX (Megatron + DeepSpeed)
333
+ - **Hardware**: Training on supercomputers
334
+ - **Status**: Open-source
335
+
336
+ **Models**:
337
+ - StableLM-Base-Alpha: 3B, 7B
338
+ - StableLM-Tuned-Alpha: Fine-tuned versions
339
+ - StableCode: Code-specialized
340
+
341
+ **Training Configuration**:
342
+ ```yaml
343
+ Framework: GPT-NeoX
344
+ Parallelism: Megatron TP/PP + DeepSpeed ZeRO
345
+ GPUs: A100 clusters
346
+ Training data: 1.5 trillion tokens (The Pile)
347
+ ```
348
+
349
+ **Key Features**:
350
+ - Fully open-source (Apache 2.0)
351
+ - GPT-NeoX framework
352
+ - Trained on The Pile dataset
353
+ - Multiple model sizes
354
+
355
+ ## Common Production Patterns
356
+
357
+ ### Fault Tolerance
358
+
359
+ **Checkpoint Strategy**:
360
+ ```bash
361
+ --save-interval 500 # Save every 500 iterations
362
+ --save /checkpoints/model_name # Checkpoint directory
363
+ --load /checkpoints/model_name # Auto-resume from latest
364
+ ```
365
+
366
+ **Monitoring**:
367
+ ```python
368
+ # Check in progress.txt
369
+ Job throughput: 45.2 TFLOPs/GPU
370
+ Cumulative throughput: 44.8 TFLOPs/GPU
371
+ Memory usage: 68.2 GB / 80 GB
372
+ Loss: 2.143
373
+ ```
374
+
375
+ ### Data Pipeline
376
+
377
+ **Preprocessing**:
378
+ ```bash
379
+ python tools/preprocess_data.py \
380
+ --input data.jsonl \
381
+ --output-prefix /data/processed \
382
+ --vocab-file vocab.json \
383
+ --merge-file merges.txt \
384
+ --tokenizer-type GPT2BPETokenizer \
385
+ --append-eod \
386
+ --workers 64
387
+ ```
388
+
389
+ **Training with Preprocessed Data**:
390
+ ```bash
391
+ --data-path /data/processed_text_document \
392
+ --split 969,30,1 # Train/valid/test split
393
+ ```
394
+
395
+ ### Monitoring & Logging
396
+
397
+ **Key Metrics to Track**:
398
+ ```bash
399
+ # Training metrics
400
+ - Loss (should steadily decrease)
401
+ - Learning rate (follows schedule)
402
+ - Gradient norm (watch for spikes)
403
+ - Throughput (TFlops/GPU)
404
+ - MFU percentage
405
+
406
+ # System metrics
407
+ - GPU utilization (>90%)
408
+ - Memory usage (<95% of capacity)
409
+ - Network bandwidth (saturated for TP)
410
+ - Data loading time (should be minimal)
411
+ ```
412
+
413
+ **Production Monitoring Tools**:
414
+ - TensorBoard for loss curves
415
+ - Weights & Biases for experiment tracking
416
+ - Prometheus + Grafana for system metrics
417
+ - Custom scripts for MFU calculation
418
+
419
+ ### Multi-Datacenter Training
420
+
421
+ **Challenges**:
422
+ - Higher latency between datacenters
423
+ - Network bandwidth limitations
424
+ - Fault isolation
425
+
426
+ **Solutions**:
427
+ ```bash
428
+ # Keep TP within datacenter
429
+ --tensor-model-parallel-size 8 # Single node only
430
+
431
+ # Use PP across datacenters
432
+ --pipeline-model-parallel-size 16 # Across sites
433
+
434
+ # Data parallel across everything
435
+ # Automatic from remaining GPUs
436
+ ```
437
+
438
+ ## Lessons from Production
439
+
440
+ 1. **Fault Tolerance is Critical**
441
+ - Save checkpoints frequently (every 500-1000 steps)
442
+ - Test checkpoint recovery regularly
443
+ - Monitor for GPU failures
444
+
445
+ 2. **Data Quality Matters More Than Quantity**
446
+ - LLaMA 3: Carefully curated 15T tokens
447
+ - Better than naive web scraping
448
+ - Investment in data preprocessing pays off
449
+
450
+ 3. **Parallelism Strategy Evolves with Scale**
451
+ - <70B: TP + DP sufficient
452
+ - 70-175B: Add PP
453
+ - 175B+: 3D or 4D parallelism required
454
+ - MoE: Add EP dimension
455
+
456
+ 4. **Hardware Matters**
457
+ - H100 vs A100: 2× speedup from better hardware
458
+ - NVLink topology affects TP efficiency
459
+ - InfiniBand essential for multi-node
460
+
461
+ 5. **Monitoring is Essential**
462
+ - Track MFU to catch performance issues
463
+ - Monitor loss for training health
464
+ - Watch memory usage to avoid OOM
465
+ - Log everything for debugging
466
+
467
+ ## References
468
+
469
+ - Meta LLaMA 3 technical report
470
+ - NVIDIA Nemotron blog posts
471
+ - Microsoft Megatron-Turing NLG paper
472
+ - BigScience BLOOM documentation
473
+ - DeepSeek-V3 technical report