@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,530 @@
|
|
|
1
|
+
# Lambda Labs Troubleshooting Guide
|
|
2
|
+
|
|
3
|
+
## Instance Launch Issues
|
|
4
|
+
|
|
5
|
+
### No instances available
|
|
6
|
+
|
|
7
|
+
**Error**: "No capacity available" or instance type not listed
|
|
8
|
+
|
|
9
|
+
**Solutions**:
|
|
10
|
+
```bash
|
|
11
|
+
# Check availability via API
|
|
12
|
+
curl -u $LAMBDA_API_KEY: \
|
|
13
|
+
https://cloud.lambdalabs.com/api/v1/instance-types | jq '.data | to_entries[] | select(.value.regions_with_capacity_available | length > 0) | .key'
|
|
14
|
+
|
|
15
|
+
# Try different regions
|
|
16
|
+
# US regions: us-west-1, us-east-1, us-south-1
|
|
17
|
+
# International: eu-west-1, asia-northeast-1, etc.
|
|
18
|
+
|
|
19
|
+
# Try alternative GPU types
|
|
20
|
+
# H100 not available? Try A100
|
|
21
|
+
# A100 not available? Try A10 or A6000
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
### Instance stuck launching
|
|
25
|
+
|
|
26
|
+
**Problem**: Instance shows "booting" for over 20 minutes
|
|
27
|
+
|
|
28
|
+
**Solutions**:
|
|
29
|
+
```bash
|
|
30
|
+
# Single-GPU: Should be ready in 3-5 minutes
|
|
31
|
+
# Multi-GPU (8x): May take 10-15 minutes
|
|
32
|
+
|
|
33
|
+
# If stuck longer:
|
|
34
|
+
# 1. Terminate the instance
|
|
35
|
+
# 2. Try a different region
|
|
36
|
+
# 3. Try a different instance type
|
|
37
|
+
# 4. Contact Lambda support if persistent
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
### API authentication fails
|
|
41
|
+
|
|
42
|
+
**Error**: `401 Unauthorized` or `403 Forbidden`
|
|
43
|
+
|
|
44
|
+
**Solutions**:
|
|
45
|
+
```bash
|
|
46
|
+
# Verify API key format (should start with specific prefix)
|
|
47
|
+
echo $LAMBDA_API_KEY
|
|
48
|
+
|
|
49
|
+
# Test API key
|
|
50
|
+
curl -u $LAMBDA_API_KEY: \
|
|
51
|
+
https://cloud.lambdalabs.com/api/v1/instance-types
|
|
52
|
+
|
|
53
|
+
# Generate new API key from Lambda console if needed
|
|
54
|
+
# Settings > API keys > Generate
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### Quota limits reached
|
|
58
|
+
|
|
59
|
+
**Error**: "Instance limit reached" or "Quota exceeded"
|
|
60
|
+
|
|
61
|
+
**Solutions**:
|
|
62
|
+
- Check current running instances in console
|
|
63
|
+
- Terminate unused instances
|
|
64
|
+
- Contact Lambda support to request quota increase
|
|
65
|
+
- Use 1-Click Clusters for large-scale needs
|
|
66
|
+
|
|
67
|
+
## SSH Connection Issues
|
|
68
|
+
|
|
69
|
+
### Connection refused
|
|
70
|
+
|
|
71
|
+
**Error**: `ssh: connect to host <IP> port 22: Connection refused`
|
|
72
|
+
|
|
73
|
+
**Solutions**:
|
|
74
|
+
```bash
|
|
75
|
+
# Wait for instance to fully initialize
|
|
76
|
+
# Single-GPU: 3-5 minutes
|
|
77
|
+
# Multi-GPU: 10-15 minutes
|
|
78
|
+
|
|
79
|
+
# Check instance status in console (should be "active")
|
|
80
|
+
|
|
81
|
+
# Verify correct IP address
|
|
82
|
+
curl -u $LAMBDA_API_KEY: \
|
|
83
|
+
https://cloud.lambdalabs.com/api/v1/instances | jq '.data[].ip'
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
### Permission denied
|
|
87
|
+
|
|
88
|
+
**Error**: `Permission denied (publickey)`
|
|
89
|
+
|
|
90
|
+
**Solutions**:
|
|
91
|
+
```bash
|
|
92
|
+
# Verify SSH key matches
|
|
93
|
+
ssh -v -i ~/.ssh/lambda_key ubuntu@<IP>
|
|
94
|
+
|
|
95
|
+
# Check key permissions
|
|
96
|
+
chmod 600 ~/.ssh/lambda_key
|
|
97
|
+
chmod 644 ~/.ssh/lambda_key.pub
|
|
98
|
+
|
|
99
|
+
# Verify key was added to Lambda console before launch
|
|
100
|
+
# Keys must be added BEFORE launching instance
|
|
101
|
+
|
|
102
|
+
# Check authorized_keys on instance (if you have another way in)
|
|
103
|
+
cat ~/.ssh/authorized_keys
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
### Host key verification failed
|
|
107
|
+
|
|
108
|
+
**Error**: `WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!`
|
|
109
|
+
|
|
110
|
+
**Solutions**:
|
|
111
|
+
```bash
|
|
112
|
+
# This happens when IP is reused by different instance
|
|
113
|
+
# Remove old key
|
|
114
|
+
ssh-keygen -R <IP>
|
|
115
|
+
|
|
116
|
+
# Then connect again
|
|
117
|
+
ssh ubuntu@<IP>
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### Timeout during SSH
|
|
121
|
+
|
|
122
|
+
**Error**: `ssh: connect to host <IP> port 22: Operation timed out`
|
|
123
|
+
|
|
124
|
+
**Solutions**:
|
|
125
|
+
```bash
|
|
126
|
+
# Check if instance is in "active" state
|
|
127
|
+
|
|
128
|
+
# Verify firewall allows SSH (port 22)
|
|
129
|
+
# Lambda console > Firewall
|
|
130
|
+
|
|
131
|
+
# Check your local network allows outbound SSH
|
|
132
|
+
|
|
133
|
+
# Try from different network/VPN
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
## GPU Issues
|
|
137
|
+
|
|
138
|
+
### GPU not detected
|
|
139
|
+
|
|
140
|
+
**Error**: `nvidia-smi: command not found` or no GPUs shown
|
|
141
|
+
|
|
142
|
+
**Solutions**:
|
|
143
|
+
```bash
|
|
144
|
+
# Reboot instance
|
|
145
|
+
sudo reboot
|
|
146
|
+
|
|
147
|
+
# Reinstall NVIDIA drivers (if needed)
|
|
148
|
+
wget -nv -O- https://lambdalabs.com/install-lambda-stack.sh | sh -
|
|
149
|
+
sudo reboot
|
|
150
|
+
|
|
151
|
+
# Check driver status
|
|
152
|
+
nvidia-smi
|
|
153
|
+
lsmod | grep nvidia
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
### CUDA out of memory
|
|
157
|
+
|
|
158
|
+
**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
|
|
159
|
+
|
|
160
|
+
**Solutions**:
|
|
161
|
+
```python
|
|
162
|
+
# Check GPU memory
|
|
163
|
+
import torch
|
|
164
|
+
print(torch.cuda.get_device_properties(0).total_memory / 1e9, "GB")
|
|
165
|
+
|
|
166
|
+
# Clear cache
|
|
167
|
+
torch.cuda.empty_cache()
|
|
168
|
+
|
|
169
|
+
# Reduce batch size
|
|
170
|
+
batch_size = batch_size // 2
|
|
171
|
+
|
|
172
|
+
# Enable gradient checkpointing
|
|
173
|
+
model.gradient_checkpointing_enable()
|
|
174
|
+
|
|
175
|
+
# Use mixed precision
|
|
176
|
+
from torch.cuda.amp import autocast
|
|
177
|
+
with autocast():
|
|
178
|
+
outputs = model(**inputs)
|
|
179
|
+
|
|
180
|
+
# Use larger GPU instance
|
|
181
|
+
# A100-40GB → A100-80GB → H100
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
### CUDA version mismatch
|
|
185
|
+
|
|
186
|
+
**Error**: `CUDA driver version is insufficient for CUDA runtime version`
|
|
187
|
+
|
|
188
|
+
**Solutions**:
|
|
189
|
+
```bash
|
|
190
|
+
# Check versions
|
|
191
|
+
nvidia-smi # Shows driver CUDA version
|
|
192
|
+
nvcc --version # Shows toolkit version
|
|
193
|
+
|
|
194
|
+
# Lambda Stack should have compatible versions
|
|
195
|
+
# If mismatch, reinstall Lambda Stack
|
|
196
|
+
wget -nv -O- https://lambdalabs.com/install-lambda-stack.sh | sh -
|
|
197
|
+
sudo reboot
|
|
198
|
+
|
|
199
|
+
# Or install specific PyTorch version
|
|
200
|
+
pip install torch==2.1.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
### Multi-GPU not working
|
|
204
|
+
|
|
205
|
+
**Error**: Only one GPU being used
|
|
206
|
+
|
|
207
|
+
**Solutions**:
|
|
208
|
+
```python
|
|
209
|
+
# Check all GPUs visible
|
|
210
|
+
import torch
|
|
211
|
+
print(f"GPUs available: {torch.cuda.device_count()}")
|
|
212
|
+
|
|
213
|
+
# Verify CUDA_VISIBLE_DEVICES not set restrictively
|
|
214
|
+
import os
|
|
215
|
+
print(os.environ.get("CUDA_VISIBLE_DEVICES", "not set"))
|
|
216
|
+
|
|
217
|
+
# Use DataParallel or DistributedDataParallel
|
|
218
|
+
model = torch.nn.DataParallel(model)
|
|
219
|
+
# or
|
|
220
|
+
model = torch.nn.parallel.DistributedDataParallel(model)
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
## Filesystem Issues
|
|
224
|
+
|
|
225
|
+
### Filesystem not mounted
|
|
226
|
+
|
|
227
|
+
**Error**: `/lambda/nfs/<name>` doesn't exist
|
|
228
|
+
|
|
229
|
+
**Solutions**:
|
|
230
|
+
```bash
|
|
231
|
+
# Filesystem must be attached at launch time
|
|
232
|
+
# Cannot attach to running instance
|
|
233
|
+
|
|
234
|
+
# Verify filesystem was selected during launch
|
|
235
|
+
|
|
236
|
+
# Check mount points
|
|
237
|
+
df -h | grep lambda
|
|
238
|
+
|
|
239
|
+
# If missing, terminate and relaunch with filesystem
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
### Slow filesystem performance
|
|
243
|
+
|
|
244
|
+
**Problem**: Reading/writing to filesystem is slow
|
|
245
|
+
|
|
246
|
+
**Solutions**:
|
|
247
|
+
```bash
|
|
248
|
+
# Use local SSD for temporary/intermediate files
|
|
249
|
+
# /home/ubuntu has fast NVMe storage
|
|
250
|
+
|
|
251
|
+
# Copy frequently accessed data to local storage
|
|
252
|
+
cp -r /lambda/nfs/storage/dataset /home/ubuntu/dataset
|
|
253
|
+
|
|
254
|
+
# Use filesystem for checkpoints and final outputs only
|
|
255
|
+
|
|
256
|
+
# Check network bandwidth
|
|
257
|
+
iperf3 -c <filesystem_server>
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
### Data lost after termination
|
|
261
|
+
|
|
262
|
+
**Problem**: Files disappeared after instance terminated
|
|
263
|
+
|
|
264
|
+
**Solutions**:
|
|
265
|
+
```bash
|
|
266
|
+
# Root volume (/home/ubuntu) is EPHEMERAL
|
|
267
|
+
# Data there is lost on termination
|
|
268
|
+
|
|
269
|
+
# ALWAYS use filesystem for persistent data
|
|
270
|
+
/lambda/nfs/<filesystem_name>/
|
|
271
|
+
|
|
272
|
+
# Sync important local files before terminating
|
|
273
|
+
rsync -av /home/ubuntu/outputs/ /lambda/nfs/storage/outputs/
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
### Filesystem full
|
|
277
|
+
|
|
278
|
+
**Error**: `No space left on device`
|
|
279
|
+
|
|
280
|
+
**Solutions**:
|
|
281
|
+
```bash
|
|
282
|
+
# Check filesystem usage
|
|
283
|
+
df -h /lambda/nfs/storage
|
|
284
|
+
|
|
285
|
+
# Find large files
|
|
286
|
+
du -sh /lambda/nfs/storage/* | sort -h
|
|
287
|
+
|
|
288
|
+
# Clean up old checkpoints
|
|
289
|
+
find /lambda/nfs/storage/checkpoints -mtime +7 -delete
|
|
290
|
+
|
|
291
|
+
# Increase filesystem size in Lambda console
|
|
292
|
+
# (may require support request)
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
## Network Issues
|
|
296
|
+
|
|
297
|
+
### Port not accessible
|
|
298
|
+
|
|
299
|
+
**Error**: Cannot connect to service (TensorBoard, Jupyter, etc.)
|
|
300
|
+
|
|
301
|
+
**Solutions**:
|
|
302
|
+
```bash
|
|
303
|
+
# Lambda default: Only port 22 is open
|
|
304
|
+
# Configure firewall in Lambda console
|
|
305
|
+
|
|
306
|
+
# Or use SSH tunneling (recommended)
|
|
307
|
+
ssh -L 6006:localhost:6006 ubuntu@<IP>
|
|
308
|
+
# Access at http://localhost:6006
|
|
309
|
+
|
|
310
|
+
# For Jupyter
|
|
311
|
+
ssh -L 8888:localhost:8888 ubuntu@<IP>
|
|
312
|
+
```
|
|
313
|
+
|
|
314
|
+
### Slow data download
|
|
315
|
+
|
|
316
|
+
**Problem**: Downloading datasets is slow
|
|
317
|
+
|
|
318
|
+
**Solutions**:
|
|
319
|
+
```bash
|
|
320
|
+
# Check available bandwidth
|
|
321
|
+
speedtest-cli
|
|
322
|
+
|
|
323
|
+
# Use multi-threaded download
|
|
324
|
+
aria2c -x 16 <URL>
|
|
325
|
+
|
|
326
|
+
# For HuggingFace models
|
|
327
|
+
export HF_HUB_ENABLE_HF_TRANSFER=1
|
|
328
|
+
pip install hf_transfer
|
|
329
|
+
|
|
330
|
+
# For S3, use parallel transfer
|
|
331
|
+
aws s3 sync s3://bucket/data /local/data --quiet
|
|
332
|
+
```
|
|
333
|
+
|
|
334
|
+
### Inter-node communication fails
|
|
335
|
+
|
|
336
|
+
**Error**: Distributed training can't connect between nodes
|
|
337
|
+
|
|
338
|
+
**Solutions**:
|
|
339
|
+
```bash
|
|
340
|
+
# Verify nodes in same region (required)
|
|
341
|
+
|
|
342
|
+
# Check private IPs can communicate
|
|
343
|
+
ping <other_node_private_ip>
|
|
344
|
+
|
|
345
|
+
# Verify NCCL settings
|
|
346
|
+
export NCCL_DEBUG=INFO
|
|
347
|
+
export NCCL_IB_DISABLE=0 # Enable InfiniBand if available
|
|
348
|
+
|
|
349
|
+
# Check firewall allows distributed ports
|
|
350
|
+
# Need: 29500 (PyTorch), or configured MASTER_PORT
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
## Software Issues
|
|
354
|
+
|
|
355
|
+
### Package installation fails
|
|
356
|
+
|
|
357
|
+
**Error**: `pip install` errors
|
|
358
|
+
|
|
359
|
+
**Solutions**:
|
|
360
|
+
```bash
|
|
361
|
+
# Use virtual environment (don't modify system Python)
|
|
362
|
+
python -m venv ~/myenv
|
|
363
|
+
source ~/myenv/bin/activate
|
|
364
|
+
pip install <package>
|
|
365
|
+
|
|
366
|
+
# For CUDA packages, match CUDA version
|
|
367
|
+
pip install torch --index-url https://download.pytorch.org/whl/cu121
|
|
368
|
+
|
|
369
|
+
# Clear pip cache if corrupted
|
|
370
|
+
pip cache purge
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
### Python version issues
|
|
374
|
+
|
|
375
|
+
**Error**: Package requires different Python version
|
|
376
|
+
|
|
377
|
+
**Solutions**:
|
|
378
|
+
```bash
|
|
379
|
+
# Install alternate Python (don't replace system Python)
|
|
380
|
+
sudo apt install python3.11 python3.11-venv python3.11-dev
|
|
381
|
+
|
|
382
|
+
# Create venv with specific Python
|
|
383
|
+
python3.11 -m venv ~/py311env
|
|
384
|
+
source ~/py311env/bin/activate
|
|
385
|
+
```
|
|
386
|
+
|
|
387
|
+
### ImportError or ModuleNotFoundError
|
|
388
|
+
|
|
389
|
+
**Error**: Module not found despite installation
|
|
390
|
+
|
|
391
|
+
**Solutions**:
|
|
392
|
+
```bash
|
|
393
|
+
# Verify correct Python environment
|
|
394
|
+
which python
|
|
395
|
+
pip list | grep <module>
|
|
396
|
+
|
|
397
|
+
# Ensure virtual environment is activated
|
|
398
|
+
source ~/myenv/bin/activate
|
|
399
|
+
|
|
400
|
+
# Reinstall in correct environment
|
|
401
|
+
pip uninstall <package>
|
|
402
|
+
pip install <package>
|
|
403
|
+
```
|
|
404
|
+
|
|
405
|
+
## Training Issues
|
|
406
|
+
|
|
407
|
+
### Training hangs
|
|
408
|
+
|
|
409
|
+
**Problem**: Training stops progressing, no output
|
|
410
|
+
|
|
411
|
+
**Solutions**:
|
|
412
|
+
```bash
|
|
413
|
+
# Check GPU utilization
|
|
414
|
+
watch -n 1 nvidia-smi
|
|
415
|
+
|
|
416
|
+
# If GPUs at 0%, likely data loading bottleneck
|
|
417
|
+
# Increase num_workers in DataLoader
|
|
418
|
+
|
|
419
|
+
# Check for deadlocks in distributed training
|
|
420
|
+
export NCCL_DEBUG=INFO
|
|
421
|
+
|
|
422
|
+
# Add timeouts
|
|
423
|
+
dist.init_process_group(..., timeout=timedelta(minutes=30))
|
|
424
|
+
```
|
|
425
|
+
|
|
426
|
+
### Checkpoint corruption
|
|
427
|
+
|
|
428
|
+
**Error**: `RuntimeError: storage has wrong size` or similar
|
|
429
|
+
|
|
430
|
+
**Solutions**:
|
|
431
|
+
```python
|
|
432
|
+
# Use safe saving pattern
|
|
433
|
+
checkpoint_path = "/lambda/nfs/storage/checkpoint.pt"
|
|
434
|
+
temp_path = checkpoint_path + ".tmp"
|
|
435
|
+
|
|
436
|
+
# Save to temp first
|
|
437
|
+
torch.save(state_dict, temp_path)
|
|
438
|
+
# Then atomic rename
|
|
439
|
+
os.rename(temp_path, checkpoint_path)
|
|
440
|
+
|
|
441
|
+
# For loading corrupted checkpoint
|
|
442
|
+
try:
|
|
443
|
+
state = torch.load(checkpoint_path)
|
|
444
|
+
except:
|
|
445
|
+
# Fall back to previous checkpoint
|
|
446
|
+
state = torch.load(checkpoint_path + ".backup")
|
|
447
|
+
```
|
|
448
|
+
|
|
449
|
+
### Memory leak
|
|
450
|
+
|
|
451
|
+
**Problem**: Memory usage grows over time
|
|
452
|
+
|
|
453
|
+
**Solutions**:
|
|
454
|
+
```python
|
|
455
|
+
# Clear CUDA cache periodically
|
|
456
|
+
torch.cuda.empty_cache()
|
|
457
|
+
|
|
458
|
+
# Detach tensors when logging
|
|
459
|
+
loss_value = loss.detach().cpu().item()
|
|
460
|
+
|
|
461
|
+
# Don't accumulate gradients unintentionally
|
|
462
|
+
optimizer.zero_grad(set_to_none=True)
|
|
463
|
+
|
|
464
|
+
# Use gradient accumulation properly
|
|
465
|
+
if (step + 1) % accumulation_steps == 0:
|
|
466
|
+
optimizer.step()
|
|
467
|
+
optimizer.zero_grad()
|
|
468
|
+
```
|
|
469
|
+
|
|
470
|
+
## Billing Issues
|
|
471
|
+
|
|
472
|
+
### Unexpected charges
|
|
473
|
+
|
|
474
|
+
**Problem**: Bill higher than expected
|
|
475
|
+
|
|
476
|
+
**Solutions**:
|
|
477
|
+
```bash
|
|
478
|
+
# Check for forgotten running instances
|
|
479
|
+
curl -u $LAMBDA_API_KEY: \
|
|
480
|
+
https://cloud.lambdalabs.com/api/v1/instances | jq '.data[].id'
|
|
481
|
+
|
|
482
|
+
# Terminate all instances
|
|
483
|
+
# Lambda console > Instances > Terminate all
|
|
484
|
+
|
|
485
|
+
# Lambda charges by the minute
|
|
486
|
+
# No charge for stopped instances (but no "stop" feature - only terminate)
|
|
487
|
+
```
|
|
488
|
+
|
|
489
|
+
### Instance terminated unexpectedly
|
|
490
|
+
|
|
491
|
+
**Problem**: Instance disappeared without manual termination
|
|
492
|
+
|
|
493
|
+
**Possible causes**:
|
|
494
|
+
- Payment issue (card declined)
|
|
495
|
+
- Account suspension
|
|
496
|
+
- Instance health check failure
|
|
497
|
+
|
|
498
|
+
**Solutions**:
|
|
499
|
+
- Check email for Lambda notifications
|
|
500
|
+
- Verify payment method in console
|
|
501
|
+
- Contact Lambda support
|
|
502
|
+
- Always checkpoint to filesystem
|
|
503
|
+
|
|
504
|
+
## Common Error Messages
|
|
505
|
+
|
|
506
|
+
| Error | Cause | Solution |
|
|
507
|
+
|-------|-------|----------|
|
|
508
|
+
| `No capacity available` | Region/GPU sold out | Try different region or GPU type |
|
|
509
|
+
| `Permission denied (publickey)` | SSH key mismatch | Re-add key, check permissions |
|
|
510
|
+
| `CUDA out of memory` | Model too large | Reduce batch size, use larger GPU |
|
|
511
|
+
| `No space left on device` | Disk full | Clean up or use filesystem |
|
|
512
|
+
| `Connection refused` | Instance not ready | Wait 3-15 minutes for boot |
|
|
513
|
+
| `Module not found` | Wrong Python env | Activate correct virtualenv |
|
|
514
|
+
|
|
515
|
+
## Getting Help
|
|
516
|
+
|
|
517
|
+
1. **Documentation**: https://docs.lambda.ai
|
|
518
|
+
2. **Support**: https://support.lambdalabs.com
|
|
519
|
+
3. **Email**: support@lambdalabs.com
|
|
520
|
+
4. **Status**: Check Lambda status page for outages
|
|
521
|
+
|
|
522
|
+
### Information to Include
|
|
523
|
+
|
|
524
|
+
When contacting support, include:
|
|
525
|
+
- Instance ID
|
|
526
|
+
- Region
|
|
527
|
+
- Instance type
|
|
528
|
+
- Error message (full traceback)
|
|
529
|
+
- Steps to reproduce
|
|
530
|
+
- Time of occurrence
|