@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,570 @@
|
|
|
1
|
+
# SkyPilot Troubleshooting Guide
|
|
2
|
+
|
|
3
|
+
## Installation Issues
|
|
4
|
+
|
|
5
|
+
### Cloud credentials not found
|
|
6
|
+
|
|
7
|
+
**Error**: `sky check` shows clouds as disabled
|
|
8
|
+
|
|
9
|
+
**Solutions**:
|
|
10
|
+
```bash
|
|
11
|
+
# AWS
|
|
12
|
+
aws configure
|
|
13
|
+
# Verify: aws sts get-caller-identity
|
|
14
|
+
|
|
15
|
+
# GCP
|
|
16
|
+
gcloud auth application-default login
|
|
17
|
+
# Verify: gcloud auth list
|
|
18
|
+
|
|
19
|
+
# Azure
|
|
20
|
+
az login
|
|
21
|
+
az account set -s <subscription-id>
|
|
22
|
+
|
|
23
|
+
# Kubernetes
|
|
24
|
+
export KUBECONFIG=~/.kube/config
|
|
25
|
+
kubectl get nodes
|
|
26
|
+
|
|
27
|
+
# Re-check after configuration
|
|
28
|
+
sky check
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
### Permission errors
|
|
32
|
+
|
|
33
|
+
**Error**: `PermissionError` or `AccessDenied`
|
|
34
|
+
|
|
35
|
+
**Solutions**:
|
|
36
|
+
```bash
|
|
37
|
+
# AWS: Ensure IAM permissions include EC2, S3, IAM
|
|
38
|
+
# Required policies: AmazonEC2FullAccess, AmazonS3FullAccess, IAMFullAccess
|
|
39
|
+
|
|
40
|
+
# GCP: Ensure roles include Compute Admin, Storage Admin
|
|
41
|
+
gcloud projects add-iam-policy-binding PROJECT_ID \
|
|
42
|
+
--member="user:email@example.com" \
|
|
43
|
+
--role="roles/compute.admin"
|
|
44
|
+
|
|
45
|
+
# Azure: Ensure Contributor role on subscription
|
|
46
|
+
az role assignment create \
|
|
47
|
+
--assignee email@example.com \
|
|
48
|
+
--role Contributor \
|
|
49
|
+
--scope /subscriptions/SUBSCRIPTION_ID
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
## Cluster Launch Issues
|
|
53
|
+
|
|
54
|
+
### Quota exceeded
|
|
55
|
+
|
|
56
|
+
**Error**: `Quota exceeded for resource`
|
|
57
|
+
|
|
58
|
+
**Solutions**:
|
|
59
|
+
```yaml
|
|
60
|
+
# Try different region
|
|
61
|
+
resources:
|
|
62
|
+
accelerators: A100:8
|
|
63
|
+
any_of:
|
|
64
|
+
- cloud: gcp
|
|
65
|
+
region: us-west1
|
|
66
|
+
- cloud: gcp
|
|
67
|
+
region: europe-west4
|
|
68
|
+
- cloud: aws
|
|
69
|
+
region: us-east-1
|
|
70
|
+
|
|
71
|
+
# Or request quota increase from cloud provider
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
# Check quota before launching
|
|
76
|
+
sky show-gpus --cloud gcp
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
### GPU not available
|
|
80
|
+
|
|
81
|
+
**Error**: `No resources available in region`
|
|
82
|
+
|
|
83
|
+
**Solutions**:
|
|
84
|
+
```yaml
|
|
85
|
+
# Use fallback accelerators
|
|
86
|
+
resources:
|
|
87
|
+
accelerators:
|
|
88
|
+
H100: 8
|
|
89
|
+
A100-80GB: 8
|
|
90
|
+
A100: 8
|
|
91
|
+
any_of:
|
|
92
|
+
- cloud: gcp
|
|
93
|
+
- cloud: aws
|
|
94
|
+
- cloud: azure
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
```bash
|
|
98
|
+
# Check GPU availability
|
|
99
|
+
sky show-gpus A100
|
|
100
|
+
sky show-gpus --cloud aws
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
### Instance type not found
|
|
104
|
+
|
|
105
|
+
**Error**: `Instance type 'xyz' not found`
|
|
106
|
+
|
|
107
|
+
**Solutions**:
|
|
108
|
+
```yaml
|
|
109
|
+
# Let SkyPilot choose instance automatically
|
|
110
|
+
resources:
|
|
111
|
+
accelerators: A100:8
|
|
112
|
+
cpus: 96+
|
|
113
|
+
memory: 512+
|
|
114
|
+
# Don't specify instance_type unless necessary
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### Cluster stuck in INIT
|
|
118
|
+
|
|
119
|
+
**Error**: Cluster stays in INIT state
|
|
120
|
+
|
|
121
|
+
**Solutions**:
|
|
122
|
+
```bash
|
|
123
|
+
# Check cluster logs
|
|
124
|
+
sky logs mycluster --status
|
|
125
|
+
|
|
126
|
+
# SSH and check manually
|
|
127
|
+
ssh mycluster
|
|
128
|
+
journalctl -u sky-supervisor
|
|
129
|
+
|
|
130
|
+
# Terminate and retry
|
|
131
|
+
sky down mycluster
|
|
132
|
+
sky launch -c mycluster task.yaml
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
## Setup Command Issues
|
|
136
|
+
|
|
137
|
+
### Setup script fails
|
|
138
|
+
|
|
139
|
+
**Error**: Setup commands fail during provisioning
|
|
140
|
+
|
|
141
|
+
**Solutions**:
|
|
142
|
+
```yaml
|
|
143
|
+
# Add error handling and retries
|
|
144
|
+
setup: |
|
|
145
|
+
set -e # Exit on error
|
|
146
|
+
|
|
147
|
+
# Retry pip installs
|
|
148
|
+
for i in {1..3}; do
|
|
149
|
+
pip install torch transformers && break
|
|
150
|
+
echo "Retry $i..."
|
|
151
|
+
sleep 10
|
|
152
|
+
done
|
|
153
|
+
|
|
154
|
+
# Verify installation
|
|
155
|
+
python -c "import torch; print(torch.__version__)"
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### Conda environment issues
|
|
159
|
+
|
|
160
|
+
**Error**: Conda not found or environment issues
|
|
161
|
+
|
|
162
|
+
**Solutions**:
|
|
163
|
+
```yaml
|
|
164
|
+
setup: |
|
|
165
|
+
# Initialize conda for bash
|
|
166
|
+
source ~/.bashrc
|
|
167
|
+
|
|
168
|
+
# Or use full path
|
|
169
|
+
~/miniconda3/bin/conda create -n myenv python=3.10 -y
|
|
170
|
+
~/miniconda3/bin/conda activate myenv
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
### CUDA version mismatch
|
|
174
|
+
|
|
175
|
+
**Error**: `CUDA driver version is insufficient`
|
|
176
|
+
|
|
177
|
+
**Solutions**:
|
|
178
|
+
```yaml
|
|
179
|
+
setup: |
|
|
180
|
+
# Install specific CUDA version
|
|
181
|
+
pip install torch==2.1.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html
|
|
182
|
+
|
|
183
|
+
# Verify CUDA
|
|
184
|
+
python -c "import torch; print(torch.cuda.is_available())"
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
## Distributed Training Issues
|
|
188
|
+
|
|
189
|
+
### Nodes can't communicate
|
|
190
|
+
|
|
191
|
+
**Error**: Connection refused between nodes
|
|
192
|
+
|
|
193
|
+
**Solutions**:
|
|
194
|
+
```yaml
|
|
195
|
+
run: |
|
|
196
|
+
# Debug: Print all node IPs
|
|
197
|
+
echo "All nodes: $SKYPILOT_NODE_IPS"
|
|
198
|
+
echo "My rank: $SKYPILOT_NODE_RANK"
|
|
199
|
+
|
|
200
|
+
# Wait for all nodes to be ready
|
|
201
|
+
sleep 30
|
|
202
|
+
|
|
203
|
+
# Use correct master address
|
|
204
|
+
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
|
|
205
|
+
echo "Master: $MASTER_ADDR"
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
### torchrun fails
|
|
209
|
+
|
|
210
|
+
**Error**: `torch.distributed` errors
|
|
211
|
+
|
|
212
|
+
**Solutions**:
|
|
213
|
+
```yaml
|
|
214
|
+
run: |
|
|
215
|
+
# Ensure correct environment variables
|
|
216
|
+
export NCCL_DEBUG=INFO
|
|
217
|
+
export NCCL_IB_DISABLE=1 # Try if InfiniBand issues
|
|
218
|
+
|
|
219
|
+
torchrun \
|
|
220
|
+
--nnodes=$SKYPILOT_NUM_NODES \
|
|
221
|
+
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
|
|
222
|
+
--node_rank=$SKYPILOT_NODE_RANK \
|
|
223
|
+
--master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
|
|
224
|
+
--master_port=12355 \
|
|
225
|
+
--rdzv_backend=c10d \
|
|
226
|
+
train.py
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
### DeepSpeed hostfile errors
|
|
230
|
+
|
|
231
|
+
**Error**: `Invalid hostfile` or connection errors
|
|
232
|
+
|
|
233
|
+
**Solutions**:
|
|
234
|
+
```yaml
|
|
235
|
+
run: |
|
|
236
|
+
# Create proper hostfile
|
|
237
|
+
echo "$SKYPILOT_NODE_IPS" | while read ip; do
|
|
238
|
+
echo "$ip slots=$SKYPILOT_NUM_GPUS_PER_NODE"
|
|
239
|
+
done > /tmp/hostfile
|
|
240
|
+
|
|
241
|
+
cat /tmp/hostfile # Debug
|
|
242
|
+
|
|
243
|
+
deepspeed --hostfile=/tmp/hostfile train.py
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
## File Mount Issues
|
|
247
|
+
|
|
248
|
+
### Mount fails
|
|
249
|
+
|
|
250
|
+
**Error**: `Failed to mount storage`
|
|
251
|
+
|
|
252
|
+
**Solutions**:
|
|
253
|
+
```yaml
|
|
254
|
+
# Verify bucket exists and credentials are valid
|
|
255
|
+
file_mounts:
|
|
256
|
+
/data:
|
|
257
|
+
source: s3://my-bucket/data
|
|
258
|
+
mode: MOUNT
|
|
259
|
+
|
|
260
|
+
# Check bucket access
|
|
261
|
+
# aws s3 ls s3://my-bucket/
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
### Slow file access
|
|
265
|
+
|
|
266
|
+
**Problem**: Reading from mount is very slow
|
|
267
|
+
|
|
268
|
+
**Solutions**:
|
|
269
|
+
```yaml
|
|
270
|
+
# Use COPY mode for small datasets
|
|
271
|
+
file_mounts:
|
|
272
|
+
/data:
|
|
273
|
+
source: s3://bucket/data
|
|
274
|
+
mode: COPY # Pre-fetch to local disk
|
|
275
|
+
|
|
276
|
+
# Use MOUNT_CACHED for outputs
|
|
277
|
+
file_mounts:
|
|
278
|
+
/outputs:
|
|
279
|
+
name: outputs
|
|
280
|
+
store: s3
|
|
281
|
+
mode: MOUNT_CACHED # Cached writes
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
### Storage not persisting
|
|
285
|
+
|
|
286
|
+
**Error**: Data lost after cluster restart
|
|
287
|
+
|
|
288
|
+
**Solutions**:
|
|
289
|
+
```yaml
|
|
290
|
+
# Use named storage (persists across clusters)
|
|
291
|
+
file_mounts:
|
|
292
|
+
/persistent:
|
|
293
|
+
name: my-persistent-storage
|
|
294
|
+
store: s3
|
|
295
|
+
mode: MOUNT
|
|
296
|
+
|
|
297
|
+
# Data in ~/sky_workdir is NOT persisted
|
|
298
|
+
# Always use file_mounts for persistent data
|
|
299
|
+
```
|
|
300
|
+
|
|
301
|
+
## Managed Job Issues
|
|
302
|
+
|
|
303
|
+
### Job keeps failing
|
|
304
|
+
|
|
305
|
+
**Error**: Job fails and doesn't recover
|
|
306
|
+
|
|
307
|
+
**Solutions**:
|
|
308
|
+
```yaml
|
|
309
|
+
# Enable spot recovery
|
|
310
|
+
resources:
|
|
311
|
+
use_spot: true
|
|
312
|
+
spot_recovery: FAILOVER
|
|
313
|
+
|
|
314
|
+
# Add retry logic
|
|
315
|
+
max_restarts_on_errors: 5
|
|
316
|
+
|
|
317
|
+
# Implement checkpointing
|
|
318
|
+
run: |
|
|
319
|
+
python train.py \
|
|
320
|
+
--checkpoint-dir /checkpoints \
|
|
321
|
+
--resume-from-latest
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
### Job stuck in pending
|
|
325
|
+
|
|
326
|
+
**Error**: Job stays in PENDING state
|
|
327
|
+
|
|
328
|
+
**Solutions**:
|
|
329
|
+
```bash
|
|
330
|
+
# Check job controller status
|
|
331
|
+
sky jobs controller status
|
|
332
|
+
|
|
333
|
+
# View controller logs
|
|
334
|
+
sky jobs controller logs
|
|
335
|
+
|
|
336
|
+
# Restart controller if needed
|
|
337
|
+
sky jobs controller restart
|
|
338
|
+
```
|
|
339
|
+
|
|
340
|
+
### Checkpoint not resuming
|
|
341
|
+
|
|
342
|
+
**Error**: Training restarts from beginning
|
|
343
|
+
|
|
344
|
+
**Solutions**:
|
|
345
|
+
```yaml
|
|
346
|
+
file_mounts:
|
|
347
|
+
/checkpoints:
|
|
348
|
+
name: training-checkpoints
|
|
349
|
+
store: s3
|
|
350
|
+
mode: MOUNT_CACHED
|
|
351
|
+
|
|
352
|
+
run: |
|
|
353
|
+
# Check for existing checkpoint
|
|
354
|
+
if [ -d "/checkpoints/latest" ]; then
|
|
355
|
+
RESUME_FLAG="--resume /checkpoints/latest"
|
|
356
|
+
else
|
|
357
|
+
RESUME_FLAG=""
|
|
358
|
+
fi
|
|
359
|
+
|
|
360
|
+
python train.py $RESUME_FLAG --checkpoint-dir /checkpoints
|
|
361
|
+
```
|
|
362
|
+
|
|
363
|
+
## Sky Serve Issues
|
|
364
|
+
|
|
365
|
+
### Service not accessible
|
|
366
|
+
|
|
367
|
+
**Error**: Cannot reach service endpoint
|
|
368
|
+
|
|
369
|
+
**Solutions**:
|
|
370
|
+
```bash
|
|
371
|
+
# Check service status
|
|
372
|
+
sky serve status my-service
|
|
373
|
+
|
|
374
|
+
# View replica logs
|
|
375
|
+
sky serve logs my-service
|
|
376
|
+
|
|
377
|
+
# Check readiness probe
|
|
378
|
+
sky serve status my-service --endpoint
|
|
379
|
+
```
|
|
380
|
+
|
|
381
|
+
### Replicas keep crashing
|
|
382
|
+
|
|
383
|
+
**Error**: Replicas fail health checks
|
|
384
|
+
|
|
385
|
+
**Solutions**:
|
|
386
|
+
```yaml
|
|
387
|
+
service:
|
|
388
|
+
readiness_probe:
|
|
389
|
+
path: /health
|
|
390
|
+
initial_delay_seconds: 120 # Increase for slow model loading
|
|
391
|
+
period_seconds: 30
|
|
392
|
+
timeout_seconds: 10
|
|
393
|
+
|
|
394
|
+
run: |
|
|
395
|
+
# Ensure health endpoint exists
|
|
396
|
+
python -c "
|
|
397
|
+
from fastapi import FastAPI
|
|
398
|
+
app = FastAPI()
|
|
399
|
+
|
|
400
|
+
@app.get('/health')
|
|
401
|
+
def health():
|
|
402
|
+
return {'status': 'ok'}
|
|
403
|
+
"
|
|
404
|
+
```
|
|
405
|
+
|
|
406
|
+
### Autoscaling not working
|
|
407
|
+
|
|
408
|
+
**Problem**: Service doesn't scale up/down
|
|
409
|
+
|
|
410
|
+
**Solutions**:
|
|
411
|
+
```yaml
|
|
412
|
+
service:
|
|
413
|
+
replica_policy:
|
|
414
|
+
min_replicas: 1
|
|
415
|
+
max_replicas: 10
|
|
416
|
+
target_qps_per_replica: 2.0
|
|
417
|
+
upscale_delay_seconds: 30 # Faster scale up
|
|
418
|
+
downscale_delay_seconds: 60 # Faster scale down
|
|
419
|
+
|
|
420
|
+
# Monitor metrics
|
|
421
|
+
# sky serve status my-service
|
|
422
|
+
```
|
|
423
|
+
|
|
424
|
+
## SSH and Access Issues
|
|
425
|
+
|
|
426
|
+
### Cannot SSH to cluster
|
|
427
|
+
|
|
428
|
+
**Error**: `Connection refused` or timeout
|
|
429
|
+
|
|
430
|
+
**Solutions**:
|
|
431
|
+
```bash
|
|
432
|
+
# Verify cluster is running
|
|
433
|
+
sky status
|
|
434
|
+
|
|
435
|
+
# Try with verbose output
|
|
436
|
+
ssh -v mycluster
|
|
437
|
+
|
|
438
|
+
# Check SSH key
|
|
439
|
+
ls -la ~/.ssh/sky-key*
|
|
440
|
+
|
|
441
|
+
# Regenerate SSH key if needed
|
|
442
|
+
sky launch -c test --dryrun # Regenerates key
|
|
443
|
+
```
|
|
444
|
+
|
|
445
|
+
### Port forwarding fails
|
|
446
|
+
|
|
447
|
+
**Error**: Cannot forward ports
|
|
448
|
+
|
|
449
|
+
**Solutions**:
|
|
450
|
+
```bash
|
|
451
|
+
# Correct syntax
|
|
452
|
+
ssh -L 8080:localhost:8080 mycluster
|
|
453
|
+
|
|
454
|
+
# For Jupyter
|
|
455
|
+
ssh -L 8888:localhost:8888 mycluster
|
|
456
|
+
|
|
457
|
+
# Multiple ports
|
|
458
|
+
ssh -L 8080:localhost:8080 -L 6006:localhost:6006 mycluster
|
|
459
|
+
```
|
|
460
|
+
|
|
461
|
+
## Cost and Billing Issues
|
|
462
|
+
|
|
463
|
+
### Unexpected charges
|
|
464
|
+
|
|
465
|
+
**Problem**: Higher than expected costs
|
|
466
|
+
|
|
467
|
+
**Solutions**:
|
|
468
|
+
```bash
|
|
469
|
+
# Always terminate unused clusters
|
|
470
|
+
sky down --all
|
|
471
|
+
|
|
472
|
+
# Set autostop
|
|
473
|
+
sky autostop mycluster -i 30 --down
|
|
474
|
+
|
|
475
|
+
# Use spot instances
|
|
476
|
+
resources:
|
|
477
|
+
use_spot: true
|
|
478
|
+
```
|
|
479
|
+
|
|
480
|
+
### Spot instance preempted
|
|
481
|
+
|
|
482
|
+
**Error**: Instance terminated unexpectedly
|
|
483
|
+
|
|
484
|
+
**Solutions**:
|
|
485
|
+
```yaml
|
|
486
|
+
# Use managed jobs for automatic recovery
|
|
487
|
+
# sky jobs launch instead of sky launch
|
|
488
|
+
|
|
489
|
+
resources:
|
|
490
|
+
use_spot: true
|
|
491
|
+
spot_recovery: FAILOVER # Auto-failover to another region/cloud
|
|
492
|
+
|
|
493
|
+
# Always checkpoint frequently when using spot
|
|
494
|
+
```
|
|
495
|
+
|
|
496
|
+
## Debugging Commands
|
|
497
|
+
|
|
498
|
+
### View cluster state
|
|
499
|
+
|
|
500
|
+
```bash
|
|
501
|
+
# Cluster status
|
|
502
|
+
sky status
|
|
503
|
+
sky status -a # Show all details
|
|
504
|
+
|
|
505
|
+
# Cluster resources
|
|
506
|
+
sky show-gpus
|
|
507
|
+
|
|
508
|
+
# Cloud credentials
|
|
509
|
+
sky check
|
|
510
|
+
```
|
|
511
|
+
|
|
512
|
+
### View logs
|
|
513
|
+
|
|
514
|
+
```bash
|
|
515
|
+
# Task logs
|
|
516
|
+
sky logs mycluster
|
|
517
|
+
sky logs mycluster 1 # Specific job
|
|
518
|
+
|
|
519
|
+
# Managed job logs
|
|
520
|
+
sky jobs logs my-job
|
|
521
|
+
sky jobs logs my-job --follow
|
|
522
|
+
|
|
523
|
+
# Service logs
|
|
524
|
+
sky serve logs my-service
|
|
525
|
+
```
|
|
526
|
+
|
|
527
|
+
### Inspect cluster
|
|
528
|
+
|
|
529
|
+
```bash
|
|
530
|
+
# SSH to cluster
|
|
531
|
+
ssh mycluster
|
|
532
|
+
|
|
533
|
+
# Check GPU status
|
|
534
|
+
nvidia-smi
|
|
535
|
+
|
|
536
|
+
# Check processes
|
|
537
|
+
ps aux | grep python
|
|
538
|
+
|
|
539
|
+
# Check disk space
|
|
540
|
+
df -h
|
|
541
|
+
```
|
|
542
|
+
|
|
543
|
+
## Common Error Messages
|
|
544
|
+
|
|
545
|
+
| Error | Cause | Solution |
|
|
546
|
+
|-------|-------|----------|
|
|
547
|
+
| `No launchable resources` | No available instances | Try different region/cloud |
|
|
548
|
+
| `Quota exceeded` | Cloud quota limit | Request increase or use different cloud |
|
|
549
|
+
| `Setup failed` | Script error | Check logs, add error handling |
|
|
550
|
+
| `Connection refused` | Network/firewall | Check security groups, wait for init |
|
|
551
|
+
| `CUDA OOM` | Out of GPU memory | Use larger GPU or reduce batch size |
|
|
552
|
+
| `Spot preempted` | Spot instance reclaimed | Use managed jobs for auto-recovery |
|
|
553
|
+
| `Mount failed` | Storage access issue | Check credentials and bucket exists |
|
|
554
|
+
|
|
555
|
+
## Getting Help
|
|
556
|
+
|
|
557
|
+
1. **Documentation**: https://docs.skypilot.co
|
|
558
|
+
2. **GitHub Issues**: https://github.com/skypilot-org/skypilot/issues
|
|
559
|
+
3. **Slack**: https://slack.skypilot.co
|
|
560
|
+
4. **Examples**: https://github.com/skypilot-org/skypilot/tree/master/examples
|
|
561
|
+
|
|
562
|
+
### Reporting Issues
|
|
563
|
+
|
|
564
|
+
Include:
|
|
565
|
+
- SkyPilot version: `sky --version`
|
|
566
|
+
- Python version: `python --version`
|
|
567
|
+
- Cloud provider and region
|
|
568
|
+
- Full error traceback
|
|
569
|
+
- Task YAML (sanitized)
|
|
570
|
+
- Output of `sky check`
|