@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,570 @@
1
+ # SkyPilot Troubleshooting Guide
2
+
3
+ ## Installation Issues
4
+
5
+ ### Cloud credentials not found
6
+
7
+ **Error**: `sky check` shows clouds as disabled
8
+
9
+ **Solutions**:
10
+ ```bash
11
+ # AWS
12
+ aws configure
13
+ # Verify: aws sts get-caller-identity
14
+
15
+ # GCP
16
+ gcloud auth application-default login
17
+ # Verify: gcloud auth list
18
+
19
+ # Azure
20
+ az login
21
+ az account set -s <subscription-id>
22
+
23
+ # Kubernetes
24
+ export KUBECONFIG=~/.kube/config
25
+ kubectl get nodes
26
+
27
+ # Re-check after configuration
28
+ sky check
29
+ ```
30
+
31
+ ### Permission errors
32
+
33
+ **Error**: `PermissionError` or `AccessDenied`
34
+
35
+ **Solutions**:
36
+ ```bash
37
+ # AWS: Ensure IAM permissions include EC2, S3, IAM
38
+ # Required policies: AmazonEC2FullAccess, AmazonS3FullAccess, IAMFullAccess
39
+
40
+ # GCP: Ensure roles include Compute Admin, Storage Admin
41
+ gcloud projects add-iam-policy-binding PROJECT_ID \
42
+ --member="user:email@example.com" \
43
+ --role="roles/compute.admin"
44
+
45
+ # Azure: Ensure Contributor role on subscription
46
+ az role assignment create \
47
+ --assignee email@example.com \
48
+ --role Contributor \
49
+ --scope /subscriptions/SUBSCRIPTION_ID
50
+ ```
51
+
52
+ ## Cluster Launch Issues
53
+
54
+ ### Quota exceeded
55
+
56
+ **Error**: `Quota exceeded for resource`
57
+
58
+ **Solutions**:
59
+ ```yaml
60
+ # Try different region
61
+ resources:
62
+ accelerators: A100:8
63
+ any_of:
64
+ - cloud: gcp
65
+ region: us-west1
66
+ - cloud: gcp
67
+ region: europe-west4
68
+ - cloud: aws
69
+ region: us-east-1
70
+
71
+ # Or request quota increase from cloud provider
72
+ ```
73
+
74
+ ```bash
75
+ # Check quota before launching
76
+ sky show-gpus --cloud gcp
77
+ ```
78
+
79
+ ### GPU not available
80
+
81
+ **Error**: `No resources available in region`
82
+
83
+ **Solutions**:
84
+ ```yaml
85
+ # Use fallback accelerators
86
+ resources:
87
+ accelerators:
88
+ H100: 8
89
+ A100-80GB: 8
90
+ A100: 8
91
+ any_of:
92
+ - cloud: gcp
93
+ - cloud: aws
94
+ - cloud: azure
95
+ ```
96
+
97
+ ```bash
98
+ # Check GPU availability
99
+ sky show-gpus A100
100
+ sky show-gpus --cloud aws
101
+ ```
102
+
103
+ ### Instance type not found
104
+
105
+ **Error**: `Instance type 'xyz' not found`
106
+
107
+ **Solutions**:
108
+ ```yaml
109
+ # Let SkyPilot choose instance automatically
110
+ resources:
111
+ accelerators: A100:8
112
+ cpus: 96+
113
+ memory: 512+
114
+ # Don't specify instance_type unless necessary
115
+ ```
116
+
117
+ ### Cluster stuck in INIT
118
+
119
+ **Error**: Cluster stays in INIT state
120
+
121
+ **Solutions**:
122
+ ```bash
123
+ # Check cluster logs
124
+ sky logs mycluster --status
125
+
126
+ # SSH and check manually
127
+ ssh mycluster
128
+ journalctl -u sky-supervisor
129
+
130
+ # Terminate and retry
131
+ sky down mycluster
132
+ sky launch -c mycluster task.yaml
133
+ ```
134
+
135
+ ## Setup Command Issues
136
+
137
+ ### Setup script fails
138
+
139
+ **Error**: Setup commands fail during provisioning
140
+
141
+ **Solutions**:
142
+ ```yaml
143
+ # Add error handling and retries
144
+ setup: |
145
+ set -e # Exit on error
146
+
147
+ # Retry pip installs
148
+ for i in {1..3}; do
149
+ pip install torch transformers && break
150
+ echo "Retry $i..."
151
+ sleep 10
152
+ done
153
+
154
+ # Verify installation
155
+ python -c "import torch; print(torch.__version__)"
156
+ ```
157
+
158
+ ### Conda environment issues
159
+
160
+ **Error**: Conda not found or environment issues
161
+
162
+ **Solutions**:
163
+ ```yaml
164
+ setup: |
165
+ # Initialize conda for bash
166
+ source ~/.bashrc
167
+
168
+ # Or use full path
169
+ ~/miniconda3/bin/conda create -n myenv python=3.10 -y
170
+ ~/miniconda3/bin/conda activate myenv
171
+ ```
172
+
173
+ ### CUDA version mismatch
174
+
175
+ **Error**: `CUDA driver version is insufficient`
176
+
177
+ **Solutions**:
178
+ ```yaml
179
+ setup: |
180
+ # Install specific CUDA version
181
+ pip install torch==2.1.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html
182
+
183
+ # Verify CUDA
184
+ python -c "import torch; print(torch.cuda.is_available())"
185
+ ```
186
+
187
+ ## Distributed Training Issues
188
+
189
+ ### Nodes can't communicate
190
+
191
+ **Error**: Connection refused between nodes
192
+
193
+ **Solutions**:
194
+ ```yaml
195
+ run: |
196
+ # Debug: Print all node IPs
197
+ echo "All nodes: $SKYPILOT_NODE_IPS"
198
+ echo "My rank: $SKYPILOT_NODE_RANK"
199
+
200
+ # Wait for all nodes to be ready
201
+ sleep 30
202
+
203
+ # Use correct master address
204
+ MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
205
+ echo "Master: $MASTER_ADDR"
206
+ ```
207
+
208
+ ### torchrun fails
209
+
210
+ **Error**: `torch.distributed` errors
211
+
212
+ **Solutions**:
213
+ ```yaml
214
+ run: |
215
+ # Ensure correct environment variables
216
+ export NCCL_DEBUG=INFO
217
+ export NCCL_IB_DISABLE=1 # Try if InfiniBand issues
218
+
219
+ torchrun \
220
+ --nnodes=$SKYPILOT_NUM_NODES \
221
+ --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
222
+ --node_rank=$SKYPILOT_NODE_RANK \
223
+ --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
224
+ --master_port=12355 \
225
+ --rdzv_backend=c10d \
226
+ train.py
227
+ ```
228
+
229
+ ### DeepSpeed hostfile errors
230
+
231
+ **Error**: `Invalid hostfile` or connection errors
232
+
233
+ **Solutions**:
234
+ ```yaml
235
+ run: |
236
+ # Create proper hostfile
237
+ echo "$SKYPILOT_NODE_IPS" | while read ip; do
238
+ echo "$ip slots=$SKYPILOT_NUM_GPUS_PER_NODE"
239
+ done > /tmp/hostfile
240
+
241
+ cat /tmp/hostfile # Debug
242
+
243
+ deepspeed --hostfile=/tmp/hostfile train.py
244
+ ```
245
+
246
+ ## File Mount Issues
247
+
248
+ ### Mount fails
249
+
250
+ **Error**: `Failed to mount storage`
251
+
252
+ **Solutions**:
253
+ ```yaml
254
+ # Verify bucket exists and credentials are valid
255
+ file_mounts:
256
+ /data:
257
+ source: s3://my-bucket/data
258
+ mode: MOUNT
259
+
260
+ # Check bucket access
261
+ # aws s3 ls s3://my-bucket/
262
+ ```
263
+
264
+ ### Slow file access
265
+
266
+ **Problem**: Reading from mount is very slow
267
+
268
+ **Solutions**:
269
+ ```yaml
270
+ # Use COPY mode for small datasets
271
+ file_mounts:
272
+ /data:
273
+ source: s3://bucket/data
274
+ mode: COPY # Pre-fetch to local disk
275
+
276
+ # Use MOUNT_CACHED for outputs
277
+ file_mounts:
278
+ /outputs:
279
+ name: outputs
280
+ store: s3
281
+ mode: MOUNT_CACHED # Cached writes
282
+ ```
283
+
284
+ ### Storage not persisting
285
+
286
+ **Error**: Data lost after cluster restart
287
+
288
+ **Solutions**:
289
+ ```yaml
290
+ # Use named storage (persists across clusters)
291
+ file_mounts:
292
+ /persistent:
293
+ name: my-persistent-storage
294
+ store: s3
295
+ mode: MOUNT
296
+
297
+ # Data in ~/sky_workdir is NOT persisted
298
+ # Always use file_mounts for persistent data
299
+ ```
300
+
301
+ ## Managed Job Issues
302
+
303
+ ### Job keeps failing
304
+
305
+ **Error**: Job fails and doesn't recover
306
+
307
+ **Solutions**:
308
+ ```yaml
309
+ # Enable spot recovery
310
+ resources:
311
+ use_spot: true
312
+ spot_recovery: FAILOVER
313
+
314
+ # Add retry logic
315
+ max_restarts_on_errors: 5
316
+
317
+ # Implement checkpointing
318
+ run: |
319
+ python train.py \
320
+ --checkpoint-dir /checkpoints \
321
+ --resume-from-latest
322
+ ```
323
+
324
+ ### Job stuck in pending
325
+
326
+ **Error**: Job stays in PENDING state
327
+
328
+ **Solutions**:
329
+ ```bash
330
+ # Check job controller status
331
+ sky jobs controller status
332
+
333
+ # View controller logs
334
+ sky jobs controller logs
335
+
336
+ # Restart controller if needed
337
+ sky jobs controller restart
338
+ ```
339
+
340
+ ### Checkpoint not resuming
341
+
342
+ **Error**: Training restarts from beginning
343
+
344
+ **Solutions**:
345
+ ```yaml
346
+ file_mounts:
347
+ /checkpoints:
348
+ name: training-checkpoints
349
+ store: s3
350
+ mode: MOUNT_CACHED
351
+
352
+ run: |
353
+ # Check for existing checkpoint
354
+ if [ -d "/checkpoints/latest" ]; then
355
+ RESUME_FLAG="--resume /checkpoints/latest"
356
+ else
357
+ RESUME_FLAG=""
358
+ fi
359
+
360
+ python train.py $RESUME_FLAG --checkpoint-dir /checkpoints
361
+ ```
362
+
363
+ ## Sky Serve Issues
364
+
365
+ ### Service not accessible
366
+
367
+ **Error**: Cannot reach service endpoint
368
+
369
+ **Solutions**:
370
+ ```bash
371
+ # Check service status
372
+ sky serve status my-service
373
+
374
+ # View replica logs
375
+ sky serve logs my-service
376
+
377
+ # Check readiness probe
378
+ sky serve status my-service --endpoint
379
+ ```
380
+
381
+ ### Replicas keep crashing
382
+
383
+ **Error**: Replicas fail health checks
384
+
385
+ **Solutions**:
386
+ ```yaml
387
+ service:
388
+ readiness_probe:
389
+ path: /health
390
+ initial_delay_seconds: 120 # Increase for slow model loading
391
+ period_seconds: 30
392
+ timeout_seconds: 10
393
+
394
+ run: |
395
+ # Ensure health endpoint exists
396
+ python -c "
397
+ from fastapi import FastAPI
398
+ app = FastAPI()
399
+
400
+ @app.get('/health')
401
+ def health():
402
+ return {'status': 'ok'}
403
+ "
404
+ ```
405
+
406
+ ### Autoscaling not working
407
+
408
+ **Problem**: Service doesn't scale up/down
409
+
410
+ **Solutions**:
411
+ ```yaml
412
+ service:
413
+ replica_policy:
414
+ min_replicas: 1
415
+ max_replicas: 10
416
+ target_qps_per_replica: 2.0
417
+ upscale_delay_seconds: 30 # Faster scale up
418
+ downscale_delay_seconds: 60 # Faster scale down
419
+
420
+ # Monitor metrics
421
+ # sky serve status my-service
422
+ ```
423
+
424
+ ## SSH and Access Issues
425
+
426
+ ### Cannot SSH to cluster
427
+
428
+ **Error**: `Connection refused` or timeout
429
+
430
+ **Solutions**:
431
+ ```bash
432
+ # Verify cluster is running
433
+ sky status
434
+
435
+ # Try with verbose output
436
+ ssh -v mycluster
437
+
438
+ # Check SSH key
439
+ ls -la ~/.ssh/sky-key*
440
+
441
+ # Regenerate SSH key if needed
442
+ sky launch -c test --dryrun # Regenerates key
443
+ ```
444
+
445
+ ### Port forwarding fails
446
+
447
+ **Error**: Cannot forward ports
448
+
449
+ **Solutions**:
450
+ ```bash
451
+ # Correct syntax
452
+ ssh -L 8080:localhost:8080 mycluster
453
+
454
+ # For Jupyter
455
+ ssh -L 8888:localhost:8888 mycluster
456
+
457
+ # Multiple ports
458
+ ssh -L 8080:localhost:8080 -L 6006:localhost:6006 mycluster
459
+ ```
460
+
461
+ ## Cost and Billing Issues
462
+
463
+ ### Unexpected charges
464
+
465
+ **Problem**: Higher than expected costs
466
+
467
+ **Solutions**:
468
+ ```bash
469
+ # Always terminate unused clusters
470
+ sky down --all
471
+
472
+ # Set autostop
473
+ sky autostop mycluster -i 30 --down
474
+
475
+ # Use spot instances
476
+ resources:
477
+ use_spot: true
478
+ ```
479
+
480
+ ### Spot instance preempted
481
+
482
+ **Error**: Instance terminated unexpectedly
483
+
484
+ **Solutions**:
485
+ ```yaml
486
+ # Use managed jobs for automatic recovery
487
+ # sky jobs launch instead of sky launch
488
+
489
+ resources:
490
+ use_spot: true
491
+ spot_recovery: FAILOVER # Auto-failover to another region/cloud
492
+
493
+ # Always checkpoint frequently when using spot
494
+ ```
495
+
496
+ ## Debugging Commands
497
+
498
+ ### View cluster state
499
+
500
+ ```bash
501
+ # Cluster status
502
+ sky status
503
+ sky status -a # Show all details
504
+
505
+ # Cluster resources
506
+ sky show-gpus
507
+
508
+ # Cloud credentials
509
+ sky check
510
+ ```
511
+
512
+ ### View logs
513
+
514
+ ```bash
515
+ # Task logs
516
+ sky logs mycluster
517
+ sky logs mycluster 1 # Specific job
518
+
519
+ # Managed job logs
520
+ sky jobs logs my-job
521
+ sky jobs logs my-job --follow
522
+
523
+ # Service logs
524
+ sky serve logs my-service
525
+ ```
526
+
527
+ ### Inspect cluster
528
+
529
+ ```bash
530
+ # SSH to cluster
531
+ ssh mycluster
532
+
533
+ # Check GPU status
534
+ nvidia-smi
535
+
536
+ # Check processes
537
+ ps aux | grep python
538
+
539
+ # Check disk space
540
+ df -h
541
+ ```
542
+
543
+ ## Common Error Messages
544
+
545
+ | Error | Cause | Solution |
546
+ |-------|-------|----------|
547
+ | `No launchable resources` | No available instances | Try different region/cloud |
548
+ | `Quota exceeded` | Cloud quota limit | Request increase or use different cloud |
549
+ | `Setup failed` | Script error | Check logs, add error handling |
550
+ | `Connection refused` | Network/firewall | Check security groups, wait for init |
551
+ | `CUDA OOM` | Out of GPU memory | Use larger GPU or reduce batch size |
552
+ | `Spot preempted` | Spot instance reclaimed | Use managed jobs for auto-recovery |
553
+ | `Mount failed` | Storage access issue | Check credentials and bucket exists |
554
+
555
+ ## Getting Help
556
+
557
+ 1. **Documentation**: https://docs.skypilot.co
558
+ 2. **GitHub Issues**: https://github.com/skypilot-org/skypilot/issues
559
+ 3. **Slack**: https://slack.skypilot.co
560
+ 4. **Examples**: https://github.com/skypilot-org/skypilot/tree/master/examples
561
+
562
+ ### Reporting Issues
563
+
564
+ Include:
565
+ - SkyPilot version: `sky --version`
566
+ - Python version: `python --version`
567
+ - Cloud provider and region
568
+ - Full error traceback
569
+ - Task YAML (sanitized)
570
+ - Output of `sky check`