@synsci/cli-darwin-arm64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,490 @@
1
+ # PyTorch Lightning Distributed Training
2
+
3
+ ## Distributed Strategies
4
+
5
+ Lightning supports multiple distributed strategies with a single parameter change.
6
+
7
+ ### 1. DDP (DistributedDataParallel)
8
+
9
+ **Default strategy for multi-GPU**:
10
+
11
+ ```python
12
+ # Automatic DDP on all available GPUs
13
+ trainer = L.Trainer(accelerator='gpu', devices=4, strategy='ddp')
14
+
15
+ # Or auto-detect
16
+ trainer = L.Trainer(accelerator='gpu', devices='auto')
17
+ ```
18
+
19
+ **How DDP works**:
20
+ - Replicates model on each GPU
21
+ - Each GPU processes different batch
22
+ - Gradients all-reduced across GPUs
23
+ - Model weights synchronized
24
+
25
+ **Launch**:
26
+ ```bash
27
+ # Lightning handles spawning processes automatically
28
+ python train.py
29
+ ```
30
+
31
+ **DDP Configuration**:
32
+ ```python
33
+ from lightning.pytorch.strategies import DDPStrategy
34
+
35
+ strategy = DDPStrategy(
36
+ find_unused_parameters=False, # Set True if model has unused params
37
+ gradient_as_bucket_view=True, # Memory optimization
38
+ static_graph=False, # Set True if graph doesn't change
39
+ )
40
+
41
+ trainer = L.Trainer(strategy=strategy)
42
+ ```
43
+
44
+ ### 2. FSDP (Fully Sharded Data Parallel)
45
+
46
+ **For large models (7B+ parameters)**:
47
+
48
+ ```python
49
+ from lightning.pytorch.strategies import FSDPStrategy
50
+
51
+ strategy = FSDPStrategy(
52
+ sharding_strategy="FULL_SHARD", # ZeRO-3 equivalent
53
+ activation_checkpointing=None, # Or specify layer types
54
+ cpu_offload=False, # CPU offload for memory
55
+ )
56
+
57
+ trainer = L.Trainer(
58
+ accelerator='gpu',
59
+ devices=8,
60
+ strategy=strategy,
61
+ precision='bf16' # Recommended with FSDP
62
+ )
63
+
64
+ trainer.fit(model, train_loader)
65
+ ```
66
+
67
+ **FSDP Sharding Strategies**:
68
+ ```python
69
+ # FULL_SHARD (most memory efficient, equivalent to ZeRO-3)
70
+ strategy = FSDPStrategy(sharding_strategy="FULL_SHARD")
71
+
72
+ # SHARD_GRAD_OP (less memory efficient, equivalent to ZeRO-2)
73
+ strategy = FSDPStrategy(sharding_strategy="SHARD_GRAD_OP")
74
+
75
+ # NO_SHARD (no sharding, like DDP)
76
+ strategy = FSDPStrategy(sharding_strategy="NO_SHARD")
77
+ ```
78
+
79
+ **Auto-wrap policy** (wrap transformer blocks):
80
+ ```python
81
+ from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
82
+ from transformers.models.gpt2.modeling_gpt2 import GPT2Block
83
+ import functools
84
+
85
+ auto_wrap_policy = functools.partial(
86
+ transformer_auto_wrap_policy,
87
+ transformer_layer_cls={GPT2Block}
88
+ )
89
+
90
+ strategy = FSDPStrategy(
91
+ auto_wrap_policy=auto_wrap_policy,
92
+ activation_checkpointing_policy={GPT2Block} # Checkpoint these blocks
93
+ )
94
+ ```
95
+
96
+ ### 3. DeepSpeed
97
+
98
+ **For massive models (70B+ parameters)**:
99
+
100
+ ```python
101
+ from lightning.pytorch.strategies import DeepSpeedStrategy
102
+
103
+ # DeepSpeed ZeRO-3 with CPU offload
104
+ strategy = DeepSpeedStrategy(
105
+ stage=3, # ZeRO-3
106
+ offload_optimizer=True, # CPU offload optimizer
107
+ offload_parameters=True, # CPU offload parameters
108
+ cpu_checkpointing=True, # Checkpoint to CPU
109
+ )
110
+
111
+ trainer = L.Trainer(
112
+ accelerator='gpu',
113
+ devices=8,
114
+ strategy=strategy,
115
+ precision='bf16'
116
+ )
117
+
118
+ trainer.fit(model, train_loader)
119
+ ```
120
+
121
+ **DeepSpeed configuration file**:
122
+ ```json
123
+ {
124
+ "train_batch_size": "auto",
125
+ "train_micro_batch_size_per_gpu": "auto",
126
+ "gradient_accumulation_steps": "auto",
127
+ "zero_optimization": {
128
+ "stage": 3,
129
+ "offload_optimizer": {
130
+ "device": "cpu",
131
+ "pin_memory": true
132
+ },
133
+ "offload_param": {
134
+ "device": "cpu",
135
+ "pin_memory": true
136
+ },
137
+ "overlap_comm": true,
138
+ "contiguous_gradients": true,
139
+ "reduce_bucket_size": 5e8,
140
+ "stage3_prefetch_bucket_size": 5e8,
141
+ "stage3_param_persistence_threshold": 1e6
142
+ },
143
+ "bf16": {
144
+ "enabled": true
145
+ }
146
+ }
147
+ ```
148
+
149
+ **Use config file**:
150
+ ```python
151
+ strategy = DeepSpeedStrategy(config='deepspeed_config.json')
152
+ trainer = L.Trainer(strategy=strategy)
153
+ ```
154
+
155
+ ### 4. DDP Spawn
156
+
157
+ **Windows-compatible DDP**:
158
+
159
+ ```python
160
+ # Use when DDP doesn't work (e.g., Windows, Jupyter)
161
+ trainer = L.Trainer(
162
+ accelerator='gpu',
163
+ devices=2,
164
+ strategy='ddp_spawn' # Spawns new processes
165
+ )
166
+ ```
167
+
168
+ **Note**: Slower than DDP due to process spawning overhead
169
+
170
+ ## Multi-Node Training
171
+
172
+ ### Setup Multi-Node Cluster
173
+
174
+ **Node 0 (master)**:
175
+ ```bash
176
+ export MASTER_ADDR=192.168.1.100
177
+ export MASTER_PORT=12355
178
+ export WORLD_SIZE=16 # 2 nodes × 8 GPUs
179
+ export NODE_RANK=0
180
+
181
+ python train.py
182
+ ```
183
+
184
+ **Node 1 (worker)**:
185
+ ```bash
186
+ export MASTER_ADDR=192.168.1.100
187
+ export MASTER_PORT=12355
188
+ export WORLD_SIZE=16
189
+ export NODE_RANK=1
190
+
191
+ python train.py
192
+ ```
193
+
194
+ **Training script**:
195
+ ```python
196
+ trainer = L.Trainer(
197
+ accelerator='gpu',
198
+ devices=8, # GPUs per node
199
+ num_nodes=2, # Total nodes
200
+ strategy='ddp'
201
+ )
202
+
203
+ trainer.fit(model, train_loader)
204
+ ```
205
+
206
+ ### SLURM Integration
207
+
208
+ **SLURM job script**:
209
+ ```bash
210
+ #!/bin/bash
211
+ #SBATCH --nodes=4
212
+ #SBATCH --ntasks-per-node=8
213
+ #SBATCH --gres=gpu:8
214
+ #SBATCH --time=24:00:00
215
+
216
+ # Lightning auto-detects SLURM environment
217
+ srun python train.py
218
+ ```
219
+
220
+ **Training script** (no changes needed):
221
+ ```python
222
+ # Lightning automatically reads SLURM environment variables
223
+ trainer = L.Trainer(
224
+ accelerator='gpu',
225
+ devices=8,
226
+ num_nodes=4, # From SBATCH --nodes
227
+ strategy='ddp'
228
+ )
229
+ ```
230
+
231
+ ### Kubernetes (KubeFlow)
232
+
233
+ **Training script**:
234
+ ```python
235
+ import os
236
+
237
+ # Lightning auto-detects Kubernetes
238
+ trainer = L.Trainer(
239
+ accelerator='gpu',
240
+ devices=int(os.getenv('WORLD_SIZE', 1)),
241
+ strategy='ddp'
242
+ )
243
+ ```
244
+
245
+ ## Mixed Precision Training
246
+
247
+ ### BF16 (A100/H100)
248
+
249
+ ```python
250
+ trainer = L.Trainer(
251
+ precision='bf16', # Or 'bf16-mixed'
252
+ accelerator='gpu'
253
+ )
254
+ ```
255
+
256
+ **Advantages**:
257
+ - No gradient scaler needed
258
+ - Same dynamic range as FP32
259
+ - 2× speedup, 50% memory reduction
260
+
261
+ ### FP16 (V100, older GPUs)
262
+
263
+ ```python
264
+ trainer = L.Trainer(
265
+ precision='16-mixed', # Or just '16'
266
+ accelerator='gpu'
267
+ )
268
+ ```
269
+
270
+ **Automatic gradient scaling** handled by Lightning
271
+
272
+ ### FP8 (H100)
273
+
274
+ ```python
275
+ # Requires transformer_engine
276
+ # pip install transformer-engine[pytorch]
277
+
278
+ trainer = L.Trainer(
279
+ precision='transformer-engine',
280
+ accelerator='gpu'
281
+ )
282
+ ```
283
+
284
+ **Benefits**: 2× faster than BF16 on H100
285
+
286
+ ## Gradient Accumulation
287
+
288
+ **Simulate larger batch size**:
289
+
290
+ ```python
291
+ trainer = L.Trainer(
292
+ accumulate_grad_batches=4, # Accumulate 4 batches
293
+ precision='bf16'
294
+ )
295
+
296
+ # Effective batch = batch_size × accumulate_grad_batches × num_gpus
297
+ # Example: 32 × 4 × 8 = 1024
298
+ ```
299
+
300
+ **Dynamic accumulation**:
301
+ ```python
302
+ # Accumulate more early in training
303
+ trainer = L.Trainer(
304
+ accumulate_grad_batches={
305
+ 0: 8, # Epochs 0-4: accumulate 8
306
+ 5: 4, # Epochs 5-9: accumulate 4
307
+ 10: 2 # Epochs 10+: accumulate 2
308
+ }
309
+ )
310
+ ```
311
+
312
+ ## Checkpointing in Distributed
313
+
314
+ ### Save Checkpoint
315
+
316
+ ```python
317
+ from lightning.pytorch.callbacks import ModelCheckpoint
318
+
319
+ # Only rank 0 saves by default
320
+ checkpoint = ModelCheckpoint(
321
+ dirpath='checkpoints/',
322
+ filename='model-{epoch:02d}',
323
+ save_top_k=3
324
+ )
325
+
326
+ trainer = L.Trainer(callbacks=[checkpoint], strategy='ddp')
327
+ trainer.fit(model, train_loader)
328
+ ```
329
+
330
+ **Manual save**:
331
+ ```python
332
+ class MyModel(L.LightningModule):
333
+ def training_step(self, batch, batch_idx):
334
+ # Training...
335
+ loss = ...
336
+
337
+ # Save every 1000 steps (only rank 0)
338
+ if batch_idx % 1000 == 0 and self.trainer.is_global_zero:
339
+ self.trainer.save_checkpoint(f'checkpoint_step_{batch_idx}.ckpt')
340
+
341
+ return loss
342
+ ```
343
+
344
+ ### Load Checkpoint
345
+
346
+ ```python
347
+ # Resume training
348
+ trainer = L.Trainer(strategy='ddp')
349
+ trainer.fit(model, train_loader, ckpt_path='checkpoints/last.ckpt')
350
+
351
+ # Load for inference
352
+ model = MyModel.load_from_checkpoint('checkpoints/best.ckpt')
353
+ model.eval()
354
+ ```
355
+
356
+ ## Strategy Comparison
357
+
358
+ | Strategy | Memory Efficiency | Speed | Use Case |
359
+ |----------|------------------|-------|----------|
360
+ | DDP | Low | Fast | Small models (<7B), single node |
361
+ | FSDP | High | Medium | Large models (7-70B) |
362
+ | DeepSpeed ZeRO-2 | Medium | Fast | Medium models (1-13B) |
363
+ | DeepSpeed ZeRO-3 | Very High | Slower | Massive models (70B+) |
364
+ | DDP Spawn | Low | Slow | Windows, debugging |
365
+
366
+ ## Best Practices
367
+
368
+ ### 1. Choose Right Strategy
369
+
370
+ ```python
371
+ # Model size guide
372
+ if model_params < 1e9: # <1B
373
+ strategy = 'ddp'
374
+ elif model_params < 7e9: # 1-7B
375
+ strategy = 'ddp' or DeepSpeedStrategy(stage=2)
376
+ elif model_params < 70e9: # 7-70B
377
+ strategy = FSDPStrategy(sharding_strategy="FULL_SHARD")
378
+ else: # 70B+
379
+ strategy = DeepSpeedStrategy(stage=3, offload_optimizer=True)
380
+
381
+ trainer = L.Trainer(strategy=strategy)
382
+ ```
383
+
384
+ ### 2. Avoid Sync Issues
385
+
386
+ ```python
387
+ class MyModel(L.LightningModule):
388
+ def training_step(self, batch, batch_idx):
389
+ # WRONG: This runs on all GPUs independently
390
+ if batch_idx % 100 == 0:
391
+ self.log_something() # Logged 8 times on 8 GPUs!
392
+
393
+ # CORRECT: Use is_global_zero
394
+ if batch_idx % 100 == 0 and self.trainer.is_global_zero:
395
+ self.log_something() # Logged once
396
+
397
+ loss = ...
398
+ return loss
399
+ ```
400
+
401
+ ### 3. Efficient Data Loading
402
+
403
+ ```python
404
+ from torch.utils.data import DataLoader, DistributedSampler
405
+
406
+ # Lightning handles DistributedSampler automatically
407
+ train_loader = DataLoader(
408
+ dataset,
409
+ batch_size=32,
410
+ num_workers=4, # 4 workers per GPU
411
+ pin_memory=True,
412
+ persistent_workers=True
413
+ )
414
+
415
+ # Lightning automatically wraps with DistributedSampler in DDP
416
+ trainer.fit(model, train_loader)
417
+ ```
418
+
419
+ ### 4. Reduce Communication Overhead
420
+
421
+ ```python
422
+ from lightning.pytorch.strategies import DDPStrategy
423
+
424
+ strategy = DDPStrategy(
425
+ gradient_as_bucket_view=True, # Reduce memory copies
426
+ static_graph=True, # If model graph doesn't change (faster)
427
+ )
428
+
429
+ trainer = L.Trainer(strategy=strategy)
430
+ ```
431
+
432
+ ## Common Issues
433
+
434
+ ### Issue: NCCL Timeout
435
+
436
+ **Symptom**: Training hangs with `NCCL timeout` error
437
+
438
+ **Solution 1**: Increase timeout
439
+ ```bash
440
+ export NCCL_TIMEOUT=3600 # 1 hour
441
+ python train.py
442
+ ```
443
+
444
+ **Solution 2**: Check network
445
+ ```bash
446
+ # Test inter-node communication
447
+ nvidia-smi nvlink -s
448
+
449
+ # Verify all nodes can ping each other
450
+ ping <node-2-ip>
451
+ ```
452
+
453
+ ### Issue: OOM with FSDP
454
+
455
+ **Solution**: Enable CPU offload
456
+ ```python
457
+ strategy = FSDPStrategy(
458
+ sharding_strategy="FULL_SHARD",
459
+ cpu_offload=True # Offload to CPU
460
+ )
461
+ ```
462
+
463
+ ### Issue: Different Results with DDP
464
+
465
+ **Cause**: Different random seeds per GPU
466
+
467
+ **Solution**: Set seed in LightningModule
468
+ ```python
469
+ class MyModel(L.LightningModule):
470
+ def __init__(self):
471
+ super().__init__()
472
+ L.seed_everything(42, workers=True) # Same seed everywhere
473
+ ```
474
+
475
+ ### Issue: DeepSpeed Config Errors
476
+
477
+ **Solution**: Use Lightning's auto config
478
+ ```python
479
+ strategy = DeepSpeedStrategy(
480
+ stage=3,
481
+ # Don't specify config file, Lightning generates automatically
482
+ )
483
+ ```
484
+
485
+ ## Resources
486
+
487
+ - Distributed strategies: https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html
488
+ - FSDP guide: https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/fsdp.html
489
+ - DeepSpeed: https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/deepspeed.html
490
+ - Multi-node: https://lightning.ai/docs/pytorch/stable/clouds/cluster.html