@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,530 @@
1
+ # Lambda Labs Troubleshooting Guide
2
+
3
+ ## Instance Launch Issues
4
+
5
+ ### No instances available
6
+
7
+ **Error**: "No capacity available" or instance type not listed
8
+
9
+ **Solutions**:
10
+ ```bash
11
+ # Check availability via API
12
+ curl -u $LAMBDA_API_KEY: \
13
+ https://cloud.lambdalabs.com/api/v1/instance-types | jq '.data | to_entries[] | select(.value.regions_with_capacity_available | length > 0) | .key'
14
+
15
+ # Try different regions
16
+ # US regions: us-west-1, us-east-1, us-south-1
17
+ # International: eu-west-1, asia-northeast-1, etc.
18
+
19
+ # Try alternative GPU types
20
+ # H100 not available? Try A100
21
+ # A100 not available? Try A10 or A6000
22
+ ```
23
+
24
+ ### Instance stuck launching
25
+
26
+ **Problem**: Instance shows "booting" for over 20 minutes
27
+
28
+ **Solutions**:
29
+ ```bash
30
+ # Single-GPU: Should be ready in 3-5 minutes
31
+ # Multi-GPU (8x): May take 10-15 minutes
32
+
33
+ # If stuck longer:
34
+ # 1. Terminate the instance
35
+ # 2. Try a different region
36
+ # 3. Try a different instance type
37
+ # 4. Contact Lambda support if persistent
38
+ ```
39
+
40
+ ### API authentication fails
41
+
42
+ **Error**: `401 Unauthorized` or `403 Forbidden`
43
+
44
+ **Solutions**:
45
+ ```bash
46
+ # Verify API key format (should start with specific prefix)
47
+ echo $LAMBDA_API_KEY
48
+
49
+ # Test API key
50
+ curl -u $LAMBDA_API_KEY: \
51
+ https://cloud.lambdalabs.com/api/v1/instance-types
52
+
53
+ # Generate new API key from Lambda console if needed
54
+ # Settings > API keys > Generate
55
+ ```
56
+
57
+ ### Quota limits reached
58
+
59
+ **Error**: "Instance limit reached" or "Quota exceeded"
60
+
61
+ **Solutions**:
62
+ - Check current running instances in console
63
+ - Terminate unused instances
64
+ - Contact Lambda support to request quota increase
65
+ - Use 1-Click Clusters for large-scale needs
66
+
67
+ ## SSH Connection Issues
68
+
69
+ ### Connection refused
70
+
71
+ **Error**: `ssh: connect to host <IP> port 22: Connection refused`
72
+
73
+ **Solutions**:
74
+ ```bash
75
+ # Wait for instance to fully initialize
76
+ # Single-GPU: 3-5 minutes
77
+ # Multi-GPU: 10-15 minutes
78
+
79
+ # Check instance status in console (should be "active")
80
+
81
+ # Verify correct IP address
82
+ curl -u $LAMBDA_API_KEY: \
83
+ https://cloud.lambdalabs.com/api/v1/instances | jq '.data[].ip'
84
+ ```
85
+
86
+ ### Permission denied
87
+
88
+ **Error**: `Permission denied (publickey)`
89
+
90
+ **Solutions**:
91
+ ```bash
92
+ # Verify SSH key matches
93
+ ssh -v -i ~/.ssh/lambda_key ubuntu@<IP>
94
+
95
+ # Check key permissions
96
+ chmod 600 ~/.ssh/lambda_key
97
+ chmod 644 ~/.ssh/lambda_key.pub
98
+
99
+ # Verify key was added to Lambda console before launch
100
+ # Keys must be added BEFORE launching instance
101
+
102
+ # Check authorized_keys on instance (if you have another way in)
103
+ cat ~/.ssh/authorized_keys
104
+ ```
105
+
106
+ ### Host key verification failed
107
+
108
+ **Error**: `WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!`
109
+
110
+ **Solutions**:
111
+ ```bash
112
+ # This happens when IP is reused by different instance
113
+ # Remove old key
114
+ ssh-keygen -R <IP>
115
+
116
+ # Then connect again
117
+ ssh ubuntu@<IP>
118
+ ```
119
+
120
+ ### Timeout during SSH
121
+
122
+ **Error**: `ssh: connect to host <IP> port 22: Operation timed out`
123
+
124
+ **Solutions**:
125
+ ```bash
126
+ # Check if instance is in "active" state
127
+
128
+ # Verify firewall allows SSH (port 22)
129
+ # Lambda console > Firewall
130
+
131
+ # Check your local network allows outbound SSH
132
+
133
+ # Try from different network/VPN
134
+ ```
135
+
136
+ ## GPU Issues
137
+
138
+ ### GPU not detected
139
+
140
+ **Error**: `nvidia-smi: command not found` or no GPUs shown
141
+
142
+ **Solutions**:
143
+ ```bash
144
+ # Reboot instance
145
+ sudo reboot
146
+
147
+ # Reinstall NVIDIA drivers (if needed)
148
+ wget -nv -O- https://lambdalabs.com/install-lambda-stack.sh | sh -
149
+ sudo reboot
150
+
151
+ # Check driver status
152
+ nvidia-smi
153
+ lsmod | grep nvidia
154
+ ```
155
+
156
+ ### CUDA out of memory
157
+
158
+ **Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
159
+
160
+ **Solutions**:
161
+ ```python
162
+ # Check GPU memory
163
+ import torch
164
+ print(torch.cuda.get_device_properties(0).total_memory / 1e9, "GB")
165
+
166
+ # Clear cache
167
+ torch.cuda.empty_cache()
168
+
169
+ # Reduce batch size
170
+ batch_size = batch_size // 2
171
+
172
+ # Enable gradient checkpointing
173
+ model.gradient_checkpointing_enable()
174
+
175
+ # Use mixed precision
176
+ from torch.cuda.amp import autocast
177
+ with autocast():
178
+ outputs = model(**inputs)
179
+
180
+ # Use larger GPU instance
181
+ # A100-40GB → A100-80GB → H100
182
+ ```
183
+
184
+ ### CUDA version mismatch
185
+
186
+ **Error**: `CUDA driver version is insufficient for CUDA runtime version`
187
+
188
+ **Solutions**:
189
+ ```bash
190
+ # Check versions
191
+ nvidia-smi # Shows driver CUDA version
192
+ nvcc --version # Shows toolkit version
193
+
194
+ # Lambda Stack should have compatible versions
195
+ # If mismatch, reinstall Lambda Stack
196
+ wget -nv -O- https://lambdalabs.com/install-lambda-stack.sh | sh -
197
+ sudo reboot
198
+
199
+ # Or install specific PyTorch version
200
+ pip install torch==2.1.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html
201
+ ```
202
+
203
+ ### Multi-GPU not working
204
+
205
+ **Error**: Only one GPU being used
206
+
207
+ **Solutions**:
208
+ ```python
209
+ # Check all GPUs visible
210
+ import torch
211
+ print(f"GPUs available: {torch.cuda.device_count()}")
212
+
213
+ # Verify CUDA_VISIBLE_DEVICES not set restrictively
214
+ import os
215
+ print(os.environ.get("CUDA_VISIBLE_DEVICES", "not set"))
216
+
217
+ # Use DataParallel or DistributedDataParallel
218
+ model = torch.nn.DataParallel(model)
219
+ # or
220
+ model = torch.nn.parallel.DistributedDataParallel(model)
221
+ ```
222
+
223
+ ## Filesystem Issues
224
+
225
+ ### Filesystem not mounted
226
+
227
+ **Error**: `/lambda/nfs/<name>` doesn't exist
228
+
229
+ **Solutions**:
230
+ ```bash
231
+ # Filesystem must be attached at launch time
232
+ # Cannot attach to running instance
233
+
234
+ # Verify filesystem was selected during launch
235
+
236
+ # Check mount points
237
+ df -h | grep lambda
238
+
239
+ # If missing, terminate and relaunch with filesystem
240
+ ```
241
+
242
+ ### Slow filesystem performance
243
+
244
+ **Problem**: Reading/writing to filesystem is slow
245
+
246
+ **Solutions**:
247
+ ```bash
248
+ # Use local SSD for temporary/intermediate files
249
+ # /home/ubuntu has fast NVMe storage
250
+
251
+ # Copy frequently accessed data to local storage
252
+ cp -r /lambda/nfs/storage/dataset /home/ubuntu/dataset
253
+
254
+ # Use filesystem for checkpoints and final outputs only
255
+
256
+ # Check network bandwidth
257
+ iperf3 -c <filesystem_server>
258
+ ```
259
+
260
+ ### Data lost after termination
261
+
262
+ **Problem**: Files disappeared after instance terminated
263
+
264
+ **Solutions**:
265
+ ```bash
266
+ # Root volume (/home/ubuntu) is EPHEMERAL
267
+ # Data there is lost on termination
268
+
269
+ # ALWAYS use filesystem for persistent data
270
+ /lambda/nfs/<filesystem_name>/
271
+
272
+ # Sync important local files before terminating
273
+ rsync -av /home/ubuntu/outputs/ /lambda/nfs/storage/outputs/
274
+ ```
275
+
276
+ ### Filesystem full
277
+
278
+ **Error**: `No space left on device`
279
+
280
+ **Solutions**:
281
+ ```bash
282
+ # Check filesystem usage
283
+ df -h /lambda/nfs/storage
284
+
285
+ # Find large files
286
+ du -sh /lambda/nfs/storage/* | sort -h
287
+
288
+ # Clean up old checkpoints
289
+ find /lambda/nfs/storage/checkpoints -mtime +7 -delete
290
+
291
+ # Increase filesystem size in Lambda console
292
+ # (may require support request)
293
+ ```
294
+
295
+ ## Network Issues
296
+
297
+ ### Port not accessible
298
+
299
+ **Error**: Cannot connect to service (TensorBoard, Jupyter, etc.)
300
+
301
+ **Solutions**:
302
+ ```bash
303
+ # Lambda default: Only port 22 is open
304
+ # Configure firewall in Lambda console
305
+
306
+ # Or use SSH tunneling (recommended)
307
+ ssh -L 6006:localhost:6006 ubuntu@<IP>
308
+ # Access at http://localhost:6006
309
+
310
+ # For Jupyter
311
+ ssh -L 8888:localhost:8888 ubuntu@<IP>
312
+ ```
313
+
314
+ ### Slow data download
315
+
316
+ **Problem**: Downloading datasets is slow
317
+
318
+ **Solutions**:
319
+ ```bash
320
+ # Check available bandwidth
321
+ speedtest-cli
322
+
323
+ # Use multi-threaded download
324
+ aria2c -x 16 <URL>
325
+
326
+ # For HuggingFace models
327
+ export HF_HUB_ENABLE_HF_TRANSFER=1
328
+ pip install hf_transfer
329
+
330
+ # For S3, use parallel transfer
331
+ aws s3 sync s3://bucket/data /local/data --quiet
332
+ ```
333
+
334
+ ### Inter-node communication fails
335
+
336
+ **Error**: Distributed training can't connect between nodes
337
+
338
+ **Solutions**:
339
+ ```bash
340
+ # Verify nodes in same region (required)
341
+
342
+ # Check private IPs can communicate
343
+ ping <other_node_private_ip>
344
+
345
+ # Verify NCCL settings
346
+ export NCCL_DEBUG=INFO
347
+ export NCCL_IB_DISABLE=0 # Enable InfiniBand if available
348
+
349
+ # Check firewall allows distributed ports
350
+ # Need: 29500 (PyTorch), or configured MASTER_PORT
351
+ ```
352
+
353
+ ## Software Issues
354
+
355
+ ### Package installation fails
356
+
357
+ **Error**: `pip install` errors
358
+
359
+ **Solutions**:
360
+ ```bash
361
+ # Use virtual environment (don't modify system Python)
362
+ python -m venv ~/myenv
363
+ source ~/myenv/bin/activate
364
+ pip install <package>
365
+
366
+ # For CUDA packages, match CUDA version
367
+ pip install torch --index-url https://download.pytorch.org/whl/cu121
368
+
369
+ # Clear pip cache if corrupted
370
+ pip cache purge
371
+ ```
372
+
373
+ ### Python version issues
374
+
375
+ **Error**: Package requires different Python version
376
+
377
+ **Solutions**:
378
+ ```bash
379
+ # Install alternate Python (don't replace system Python)
380
+ sudo apt install python3.11 python3.11-venv python3.11-dev
381
+
382
+ # Create venv with specific Python
383
+ python3.11 -m venv ~/py311env
384
+ source ~/py311env/bin/activate
385
+ ```
386
+
387
+ ### ImportError or ModuleNotFoundError
388
+
389
+ **Error**: Module not found despite installation
390
+
391
+ **Solutions**:
392
+ ```bash
393
+ # Verify correct Python environment
394
+ which python
395
+ pip list | grep <module>
396
+
397
+ # Ensure virtual environment is activated
398
+ source ~/myenv/bin/activate
399
+
400
+ # Reinstall in correct environment
401
+ pip uninstall <package>
402
+ pip install <package>
403
+ ```
404
+
405
+ ## Training Issues
406
+
407
+ ### Training hangs
408
+
409
+ **Problem**: Training stops progressing, no output
410
+
411
+ **Solutions**:
412
+ ```bash
413
+ # Check GPU utilization
414
+ watch -n 1 nvidia-smi
415
+
416
+ # If GPUs at 0%, likely data loading bottleneck
417
+ # Increase num_workers in DataLoader
418
+
419
+ # Check for deadlocks in distributed training
420
+ export NCCL_DEBUG=INFO
421
+
422
+ # Add timeouts
423
+ dist.init_process_group(..., timeout=timedelta(minutes=30))
424
+ ```
425
+
426
+ ### Checkpoint corruption
427
+
428
+ **Error**: `RuntimeError: storage has wrong size` or similar
429
+
430
+ **Solutions**:
431
+ ```python
432
+ # Use safe saving pattern
433
+ checkpoint_path = "/lambda/nfs/storage/checkpoint.pt"
434
+ temp_path = checkpoint_path + ".tmp"
435
+
436
+ # Save to temp first
437
+ torch.save(state_dict, temp_path)
438
+ # Then atomic rename
439
+ os.rename(temp_path, checkpoint_path)
440
+
441
+ # For loading corrupted checkpoint
442
+ try:
443
+ state = torch.load(checkpoint_path)
444
+ except:
445
+ # Fall back to previous checkpoint
446
+ state = torch.load(checkpoint_path + ".backup")
447
+ ```
448
+
449
+ ### Memory leak
450
+
451
+ **Problem**: Memory usage grows over time
452
+
453
+ **Solutions**:
454
+ ```python
455
+ # Clear CUDA cache periodically
456
+ torch.cuda.empty_cache()
457
+
458
+ # Detach tensors when logging
459
+ loss_value = loss.detach().cpu().item()
460
+
461
+ # Don't accumulate gradients unintentionally
462
+ optimizer.zero_grad(set_to_none=True)
463
+
464
+ # Use gradient accumulation properly
465
+ if (step + 1) % accumulation_steps == 0:
466
+ optimizer.step()
467
+ optimizer.zero_grad()
468
+ ```
469
+
470
+ ## Billing Issues
471
+
472
+ ### Unexpected charges
473
+
474
+ **Problem**: Bill higher than expected
475
+
476
+ **Solutions**:
477
+ ```bash
478
+ # Check for forgotten running instances
479
+ curl -u $LAMBDA_API_KEY: \
480
+ https://cloud.lambdalabs.com/api/v1/instances | jq '.data[].id'
481
+
482
+ # Terminate all instances
483
+ # Lambda console > Instances > Terminate all
484
+
485
+ # Lambda charges by the minute
486
+ # No charge for stopped instances (but no "stop" feature - only terminate)
487
+ ```
488
+
489
+ ### Instance terminated unexpectedly
490
+
491
+ **Problem**: Instance disappeared without manual termination
492
+
493
+ **Possible causes**:
494
+ - Payment issue (card declined)
495
+ - Account suspension
496
+ - Instance health check failure
497
+
498
+ **Solutions**:
499
+ - Check email for Lambda notifications
500
+ - Verify payment method in console
501
+ - Contact Lambda support
502
+ - Always checkpoint to filesystem
503
+
504
+ ## Common Error Messages
505
+
506
+ | Error | Cause | Solution |
507
+ |-------|-------|----------|
508
+ | `No capacity available` | Region/GPU sold out | Try different region or GPU type |
509
+ | `Permission denied (publickey)` | SSH key mismatch | Re-add key, check permissions |
510
+ | `CUDA out of memory` | Model too large | Reduce batch size, use larger GPU |
511
+ | `No space left on device` | Disk full | Clean up or use filesystem |
512
+ | `Connection refused` | Instance not ready | Wait 3-15 minutes for boot |
513
+ | `Module not found` | Wrong Python env | Activate correct virtualenv |
514
+
515
+ ## Getting Help
516
+
517
+ 1. **Documentation**: https://docs.lambda.ai
518
+ 2. **Support**: https://support.lambdalabs.com
519
+ 3. **Email**: support@lambdalabs.com
520
+ 4. **Status**: Check Lambda status page for outages
521
+
522
+ ### Information to Include
523
+
524
+ When contacting support, include:
525
+ - Instance ID
526
+ - Region
527
+ - Instance type
528
+ - Error message (full traceback)
529
+ - Steps to reproduce
530
+ - Time of occurrence