@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,203 @@
1
+ # Common Training Patterns
2
+
3
+ This guide provides common training patterns and use cases for TRL on Hugging Face Jobs.
4
+
5
+ ## Multi-GPU Training
6
+
7
+ Automatic distributed training across multiple GPUs. TRL/Accelerate handles distribution automatically:
8
+
9
+ ```python
10
+ hf_jobs("uv", {
11
+ "script": """
12
+ # Your training script here (same as single GPU)
13
+ # No changes needed - Accelerate detects multiple GPUs
14
+ """,
15
+ "flavor": "a10g-largex2", # 2x A10G GPUs
16
+ "timeout": "4h",
17
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
18
+ })
19
+ ```
20
+
21
+ **Tips for multi-GPU:**
22
+ - No code changes needed
23
+ - Use `per_device_train_batch_size` (per GPU, not total)
24
+ - Effective batch size = `per_device_train_batch_size` × `num_gpus` × `gradient_accumulation_steps`
25
+ - Monitor GPU utilization to ensure both GPUs are being used
26
+
27
+ ## DPO Training (Preference Learning)
28
+
29
+ Train with preference data for alignment:
30
+
31
+ ```python
32
+ hf_jobs("uv", {
33
+ "script": """
34
+ # /// script
35
+ # dependencies = ["trl>=0.12.0", "trackio"]
36
+ # ///
37
+
38
+ from datasets import load_dataset
39
+ from trl import DPOTrainer, DPOConfig
40
+ import trackio
41
+
42
+ dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
43
+
44
+ # Create train/eval split
45
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
46
+
47
+ config = DPOConfig(
48
+ output_dir="dpo-model",
49
+ push_to_hub=True,
50
+ hub_model_id="username/dpo-model",
51
+ num_train_epochs=1,
52
+ beta=0.1, # KL penalty coefficient
53
+ eval_strategy="steps",
54
+ eval_steps=50,
55
+ report_to="trackio",
56
+ run_name="baseline_run", # use a meaningful run name
57
+ # max_length=1024, # Default - only set if you need different sequence length
58
+ )
59
+
60
+ trainer = DPOTrainer(
61
+ model="Qwen/Qwen2.5-0.5B-Instruct", # Use instruct model as base
62
+ train_dataset=dataset_split["train"],
63
+ eval_dataset=dataset_split["test"], # IMPORTANT: Provide eval_dataset when eval_strategy is enabled
64
+ args=config,
65
+ )
66
+
67
+ trainer.train()
68
+ trainer.push_to_hub()
69
+ trackio.finish()
70
+ """,
71
+ "flavor": "a10g-large",
72
+ "timeout": "3h",
73
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
74
+ })
75
+ ```
76
+
77
+ **For DPO documentation:** Use `hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer")`
78
+
79
+ ## GRPO Training (Online RL)
80
+
81
+ Group Relative Policy Optimization for online reinforcement learning:
82
+
83
+ ```python
84
+ hf_jobs("uv", {
85
+ "script": "https://raw.githubusercontent.com/huggingface/trl/main/examples/scripts/grpo.py",
86
+ "script_args": [
87
+ "--model_name_or_path", "Qwen/Qwen2.5-0.5B-Instruct",
88
+ "--dataset_name", "trl-lib/math_shepherd",
89
+ "--output_dir", "grpo-model",
90
+ "--push_to_hub",
91
+ "--hub_model_id", "username/grpo-model"
92
+ ],
93
+ "flavor": "a10g-large",
94
+ "timeout": "4h",
95
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
96
+ })
97
+ ```
98
+
99
+ **For GRPO documentation:** Use `hf_doc_fetch("https://huggingface.co/docs/trl/grpo_trainer")`
100
+
101
+ ## Trackio Configuration
102
+
103
+ **Use sensible defaults for trackio setup.** See `references/trackio_guide.md` for complete documentation including grouping runs for experiments.
104
+
105
+ ### Basic Pattern
106
+
107
+ ```python
108
+ import trackio
109
+
110
+ trackio.init(
111
+ project="my-training",
112
+ run_name="baseline-run", # Descriptive name user will recognize
113
+ space_id="username/trackio", # Default space: {username}/trackio
114
+ config={
115
+ # Keep config minimal - hyperparameters and model/dataset info only
116
+ "model": "Qwen/Qwen2.5-0.5B",
117
+ "dataset": "trl-lib/Capybara",
118
+ "learning_rate": 2e-5,
119
+ }
120
+ )
121
+
122
+ # Your training code...
123
+
124
+ trackio.finish()
125
+ ```
126
+
127
+ ### Grouping for Experiments (Optional)
128
+
129
+ When user wants to compare related runs, use the `group` parameter:
130
+
131
+ ```python
132
+ # Hyperparameter sweep
133
+ trackio.init(project="hyperparam-sweep", run_name="lr-0.001", group="lr_0.001")
134
+ trackio.init(project="hyperparam-sweep", run_name="lr-0.01", group="lr_0.01")
135
+ ```
136
+
137
+ ## Pattern Selection Guide
138
+
139
+ | Use Case | Pattern | Hardware | Time |
140
+ |----------|---------|----------|------|
141
+ | SFT training | `scripts/train_sft_example.py` | a10g-large | 2-6 hours |
142
+ | Large dataset (>10K) | Multi-GPU | a10g-largex2 | 4-12 hours |
143
+ | Preference learning | DPO Training | a10g-large | 2-4 hours |
144
+ | Online RL | GRPO Training | a10g-large | 3-6 hours |
145
+
146
+ ## Critical: Evaluation Dataset Requirements
147
+
148
+ **⚠️ IMPORTANT**: If you set `eval_strategy="steps"` or `eval_strategy="epoch"`, you **MUST** provide an `eval_dataset` to the trainer, or the training will hang.
149
+
150
+ ### ✅ CORRECT - With eval dataset:
151
+ ```python
152
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
153
+
154
+ trainer = SFTTrainer(
155
+ model="Qwen/Qwen2.5-0.5B",
156
+ train_dataset=dataset_split["train"],
157
+ eval_dataset=dataset_split["test"], # ← MUST provide when eval_strategy is enabled
158
+ args=SFTConfig(eval_strategy="steps", ...),
159
+ )
160
+ ```
161
+
162
+ ### ❌ WRONG - Will hang:
163
+ ```python
164
+ trainer = SFTTrainer(
165
+ model="Qwen/Qwen2.5-0.5B",
166
+ train_dataset=dataset,
167
+ # NO eval_dataset but eval_strategy="steps" ← WILL HANG
168
+ args=SFTConfig(eval_strategy="steps", ...),
169
+ )
170
+ ```
171
+
172
+ ### Option: Disable evaluation if no eval dataset
173
+ ```python
174
+ config = SFTConfig(
175
+ eval_strategy="no", # ← Explicitly disable evaluation
176
+ # ... other config
177
+ )
178
+
179
+ trainer = SFTTrainer(
180
+ model="Qwen/Qwen2.5-0.5B",
181
+ train_dataset=dataset,
182
+ # No eval_dataset needed
183
+ args=config,
184
+ )
185
+ ```
186
+
187
+ ## Best Practices
188
+
189
+ 1. **Use train/eval splits** - Create evaluation split for monitoring progress
190
+ 2. **Enable Trackio** - Monitor progress in real-time
191
+ 3. **Add 20-30% buffer to timeout** - Account for loading/saving overhead
192
+ 4. **Test with TRL official scripts first** - Use maintained examples before custom code
193
+ 5. **Always provide eval_dataset** - When using eval_strategy, or set to "no"
194
+ 6. **Use multi-GPU for large models** - 7B+ models benefit significantly
195
+
196
+ ## See Also
197
+
198
+ - `scripts/train_sft_example.py` - Complete SFT template with Trackio and eval split
199
+ - `scripts/train_dpo_example.py` - Complete DPO template
200
+ - `scripts/train_grpo_example.py` - Complete GRPO template
201
+ - `references/hardware_guide.md` - Detailed hardware specifications
202
+ - `references/training_methods.md` - Overview of all TRL training methods
203
+ - `references/troubleshooting.md` - Common issues and solutions
@@ -0,0 +1,282 @@
1
+ # Troubleshooting TRL Training Jobs
2
+
3
+ Common issues and solutions when training with TRL on Hugging Face Jobs.
4
+
5
+ ## Training Hangs at "Starting training..." Step
6
+
7
+ **Problem:** Job starts but hangs at the training step - never progresses, never times out, just sits there.
8
+
9
+ **Root Cause:** Using `eval_strategy="steps"` or `eval_strategy="epoch"` without providing an `eval_dataset` to the trainer.
10
+
11
+ **Solution:**
12
+
13
+ **Option A: Provide eval_dataset (recommended)**
14
+ ```python
15
+ # Create train/eval split
16
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
17
+
18
+ trainer = SFTTrainer(
19
+ model="Qwen/Qwen2.5-0.5B",
20
+ train_dataset=dataset_split["train"],
21
+ eval_dataset=dataset_split["test"], # ← MUST provide when eval_strategy is enabled
22
+ args=SFTConfig(
23
+ eval_strategy="steps",
24
+ eval_steps=50,
25
+ ...
26
+ ),
27
+ )
28
+ ```
29
+
30
+ **Option B: Disable evaluation**
31
+ ```python
32
+ trainer = SFTTrainer(
33
+ model="Qwen/Qwen2.5-0.5B",
34
+ train_dataset=dataset,
35
+ # No eval_dataset
36
+ args=SFTConfig(
37
+ eval_strategy="no", # ← Explicitly disable
38
+ ...
39
+ ),
40
+ )
41
+ ```
42
+
43
+ **Prevention:**
44
+ - Always create train/eval split for better monitoring
45
+ - Use `dataset.train_test_split(test_size=0.1, seed=42)`
46
+ - Check example scripts: `scripts/train_sft_example.py` includes proper eval setup
47
+
48
+ ## Job Times Out
49
+
50
+ **Problem:** Job terminates before training completes, all progress lost.
51
+
52
+ **Solutions:**
53
+ - Increase timeout parameter (e.g., `"timeout": "4h"`)
54
+ - Reduce `num_train_epochs` or use smaller dataset slice
55
+ - Use smaller model or enable LoRA/PEFT to speed up training
56
+ - Add 20-30% buffer to estimated time for loading/saving overhead
57
+
58
+ **Prevention:**
59
+ - Always start with a quick demo run to estimate timing
60
+ - Use `scripts/estimate_cost.py` to get time estimates
61
+ - Monitor first runs closely via Trackio or logs
62
+
63
+ ## Model Not Saved to Hub
64
+
65
+ **Problem:** Training completes but model doesn't appear on Hub - all work lost.
66
+
67
+ **Check:**
68
+ - [ ] `push_to_hub=True` in training config
69
+ - [ ] `hub_model_id` specified with username (e.g., `"username/model-name"`)
70
+ - [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job submission
71
+ - [ ] User has write access to target repo
72
+ - [ ] Token has write permissions (check at https://huggingface.co/settings/tokens)
73
+ - [ ] Training script calls `trainer.push_to_hub()` at the end
74
+
75
+ **See:** `references/hub_saving.md` for detailed Hub authentication troubleshooting
76
+
77
+ ## Out of Memory (OOM)
78
+
79
+ **Problem:** Job fails with CUDA out of memory error.
80
+
81
+ **Solutions (in order of preference):**
82
+ 1. **Reduce batch size:** Lower `per_device_train_batch_size` (try 4 → 2 → 1)
83
+ 2. **Increase gradient accumulation:** Raise `gradient_accumulation_steps` to maintain effective batch size
84
+ 3. **Disable evaluation:** Remove `eval_dataset` and `eval_strategy` (saves ~40% memory, good for demos)
85
+ 4. **Enable LoRA/PEFT:** Use `peft_config=LoraConfig(r=8, lora_alpha=16)` to train adapters only (smaller rank = less memory)
86
+ 5. **Use larger GPU:** Switch from `t4-small` → `l4x1` → `a10g-large` → `a100-large`
87
+ 6. **Enable gradient checkpointing:** Set `gradient_checkpointing=True` in config (slower but saves memory)
88
+ 7. **Use smaller model:** Try a smaller variant (e.g., 0.5B instead of 3B)
89
+
90
+ **Memory guidelines:**
91
+ - T4 (16GB): <1B models with LoRA
92
+ - A10G (24GB): 1-3B models with LoRA, <1B full fine-tune
93
+ - A100 (40GB/80GB): 7B+ models with LoRA, 3B full fine-tune
94
+
95
+ ## Parameter Naming Issues
96
+
97
+ **Problem:** `TypeError: SFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'`
98
+
99
+ **Cause:** TRL config classes use `max_length`, not `max_seq_length`.
100
+
101
+ **Solution:**
102
+ ```python
103
+ # ✅ CORRECT - TRL uses max_length
104
+ SFTConfig(max_length=512)
105
+ DPOConfig(max_length=512)
106
+
107
+ # ❌ WRONG - This will fail
108
+ SFTConfig(max_seq_length=512)
109
+ ```
110
+
111
+ **Note:** Most TRL configs don't require explicit max_length - the default (1024) works well. Only set if you need a specific value.
112
+
113
+ ## Dataset Format Error
114
+
115
+ **Problem:** Training fails with dataset format errors or missing fields.
116
+
117
+ **Solutions:**
118
+ 1. **Check format documentation:**
119
+ ```python
120
+ hf_doc_fetch("https://huggingface.co/docs/trl/dataset_formats")
121
+ ```
122
+
123
+ 2. **Validate dataset before training:**
124
+ ```bash
125
+ uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
126
+ --dataset <dataset-name> --split train
127
+ ```
128
+ Or via hf_jobs:
129
+ ```python
130
+ hf_jobs("uv", {
131
+ "script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
132
+ "script_args": ["--dataset", "dataset-name", "--split", "train"]
133
+ })
134
+ ```
135
+
136
+ 3. **Verify field names:**
137
+ - **SFT:** Needs "messages" field (conversational), OR "text" field, OR "prompt"/"completion"
138
+ - **DPO:** Needs "chosen" and "rejected" fields
139
+ - **GRPO:** Needs prompt-only format
140
+
141
+ 4. **Check dataset split:**
142
+ - Ensure split exists (e.g., `split="train"`)
143
+ - Preview dataset: `load_dataset("name", split="train[:5]")`
144
+
145
+ ## Import/Module Errors
146
+
147
+ **Problem:** Job fails with "ModuleNotFoundError" or import errors.
148
+
149
+ **Solutions:**
150
+ 1. **Add PEP 723 header with dependencies:**
151
+ ```python
152
+ # /// script
153
+ # dependencies = [
154
+ # "trl>=0.12.0",
155
+ # "peft>=0.7.0",
156
+ # "transformers>=4.36.0",
157
+ # ]
158
+ # ///
159
+ ```
160
+
161
+ 2. **Verify exact format:**
162
+ - Must have `# ///` delimiters (with space after `#`)
163
+ - Dependencies must be valid PyPI package names
164
+ - Check spelling and version constraints
165
+
166
+ 3. **Test locally first:**
167
+ ```bash
168
+ uv run train.py # Tests if dependencies are correct
169
+ ```
170
+
171
+ ## Authentication Errors
172
+
173
+ **Problem:** Job fails with authentication or permission errors when pushing to Hub.
174
+
175
+ **Solutions:**
176
+ 1. **Verify authentication:**
177
+ ```python
178
+ mcp__huggingface__hf_whoami() # Check who's authenticated
179
+ ```
180
+
181
+ 2. **Check token permissions:**
182
+ - Go to https://huggingface.co/settings/tokens
183
+ - Ensure token has "write" permission
184
+ - Token must not be "read-only"
185
+
186
+ 3. **Verify token in job:**
187
+ ```python
188
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Must be in job config
189
+ ```
190
+
191
+ 4. **Check repo permissions:**
192
+ - User must have write access to target repo
193
+ - If org repo, user must be member with write access
194
+ - Repo must exist or user must have permission to create
195
+
196
+ ## Job Stuck or Not Starting
197
+
198
+ **Problem:** Job shows "pending" or "starting" for extended period.
199
+
200
+ **Solutions:**
201
+ - Check Jobs dashboard for status: https://huggingface.co/jobs
202
+ - Verify hardware availability (some GPU types may have queues)
203
+ - Try different hardware flavor if one is heavily utilized
204
+ - Check for account billing issues (Jobs requires paid plan)
205
+
206
+ **Typical startup times:**
207
+ - CPU jobs: 10-30 seconds
208
+ - GPU jobs: 30-90 seconds
209
+ - If >3 minutes: likely queued or stuck
210
+
211
+ ## Training Loss Not Decreasing
212
+
213
+ **Problem:** Training runs but loss stays flat or doesn't improve.
214
+
215
+ **Solutions:**
216
+ 1. **Check learning rate:** May be too low (try 2e-5 to 5e-5) or too high (try 1e-6)
217
+ 2. **Verify dataset quality:** Inspect examples to ensure they're reasonable
218
+ 3. **Check model size:** Very small models may not have capacity for task
219
+ 4. **Increase training steps:** May need more epochs or larger dataset
220
+ 5. **Verify dataset format:** Wrong format may cause degraded training
221
+
222
+ ## Logs Not Appearing
223
+
224
+ **Problem:** Cannot see training logs or progress.
225
+
226
+ **Solutions:**
227
+ 1. **Wait 30-60 seconds:** Initial logs can be delayed
228
+ 2. **Check logs via MCP tool:**
229
+ ```python
230
+ hf_jobs("logs", {"job_id": "your-job-id"})
231
+ ```
232
+ 3. **Use Trackio for real-time monitoring:** See `references/trackio_guide.md`
233
+ 4. **Verify job is actually running:**
234
+ ```python
235
+ hf_jobs("inspect", {"job_id": "your-job-id"})
236
+ ```
237
+
238
+ ## Checkpoint/Resume Issues
239
+
240
+ **Problem:** Cannot resume from checkpoint or checkpoint not saved.
241
+
242
+ **Solutions:**
243
+ 1. **Enable checkpoint saving:**
244
+ ```python
245
+ SFTConfig(
246
+ save_strategy="steps",
247
+ save_steps=100,
248
+ hub_strategy="every_save", # Push each checkpoint
249
+ )
250
+ ```
251
+
252
+ 2. **Verify checkpoints pushed to Hub:** Check model repo for checkpoint folders
253
+
254
+ 3. **Resume from checkpoint:**
255
+ ```python
256
+ trainer = SFTTrainer(
257
+ model="username/model-name", # Can be checkpoint path
258
+ resume_from_checkpoint="username/model-name/checkpoint-1000",
259
+ )
260
+ ```
261
+
262
+ ## Getting Help
263
+
264
+ If issues persist:
265
+
266
+ 1. **Check TRL documentation:**
267
+ ```python
268
+ hf_doc_search("your issue", product="trl")
269
+ ```
270
+
271
+ 2. **Check Jobs documentation:**
272
+ ```python
273
+ hf_doc_fetch("https://huggingface.co/docs/huggingface_hub/guides/jobs")
274
+ ```
275
+
276
+ 3. **Review related guides:**
277
+ - `references/hub_saving.md` - Hub authentication issues
278
+ - `references/hardware_guide.md` - Hardware selection and specs
279
+ - `references/training_patterns.md` - Eval dataset requirements
280
+ - SKILL.md "Working with Scripts" section - Script format and URL issues
281
+
282
+ 4. **Ask in HF forums:** https://discuss.huggingface.co/