@synsci/cli-darwin-arm64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,386 @@
|
|
|
1
|
+
# slime Troubleshooting Guide
|
|
2
|
+
|
|
3
|
+
## Common Issues and Solutions
|
|
4
|
+
|
|
5
|
+
### SGLang Issues
|
|
6
|
+
|
|
7
|
+
#### Issue: SGLang Engine Crash
|
|
8
|
+
|
|
9
|
+
**Symptoms**: Inference engine dies mid-training, connection errors
|
|
10
|
+
|
|
11
|
+
**Solutions**:
|
|
12
|
+
|
|
13
|
+
1. **Enable fault tolerance**:
|
|
14
|
+
```bash
|
|
15
|
+
--use-fault-tolerance
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
2. **Increase memory allocation**:
|
|
19
|
+
```bash
|
|
20
|
+
--sglang-mem-fraction-static 0.85 # Increase from 0.8
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
3. **Reduce batch size**:
|
|
24
|
+
```bash
|
|
25
|
+
--rollout-batch-size 16 # Reduce from 32
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
4. **Disable CUDA graphs** (for debugging):
|
|
29
|
+
```bash
|
|
30
|
+
--sglang-disable-cuda-graph
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
#### Issue: SGLang Router Load Imbalance
|
|
34
|
+
|
|
35
|
+
**Symptoms**: Some SGLang engines overloaded while others idle
|
|
36
|
+
|
|
37
|
+
**Solutions**:
|
|
38
|
+
|
|
39
|
+
1. **Adjust routing strategy**:
|
|
40
|
+
```bash
|
|
41
|
+
--sglang-router-strategy round_robin
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
2. **Increase number of engines**:
|
|
45
|
+
```bash
|
|
46
|
+
--rollout-num-gpus-per-engine 1 # More engines, less GPUs each
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
### Weight Synchronization Issues
|
|
50
|
+
|
|
51
|
+
#### Issue: Weight Sync Timeout
|
|
52
|
+
|
|
53
|
+
**Symptoms**: Training hangs after rollout, timeout errors
|
|
54
|
+
|
|
55
|
+
**Solutions**:
|
|
56
|
+
|
|
57
|
+
1. **Increase sync interval** (async mode):
|
|
58
|
+
```bash
|
|
59
|
+
--update-weights-interval 5 # Increase from 2
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
2. **Use colocated mode** (eliminates network transfer):
|
|
63
|
+
```bash
|
|
64
|
+
--colocate
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
3. **Check network bandwidth**:
|
|
68
|
+
```bash
|
|
69
|
+
# Verify InfiniBand is enabled
|
|
70
|
+
ibstat
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
#### Issue: Weight Sync Failures in Multi-Node
|
|
74
|
+
|
|
75
|
+
**Symptoms**: Nodes fail to receive updated weights
|
|
76
|
+
|
|
77
|
+
**Solutions**:
|
|
78
|
+
|
|
79
|
+
1. **Set NCCL environment**:
|
|
80
|
+
```bash
|
|
81
|
+
export NCCL_DEBUG=INFO
|
|
82
|
+
export NCCL_SOCKET_IFNAME=eth0
|
|
83
|
+
export NCCL_IB_DISABLE=0
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
2. **Increase timeout**:
|
|
87
|
+
```bash
|
|
88
|
+
export NCCL_TIMEOUT=1800
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
### Memory Issues
|
|
92
|
+
|
|
93
|
+
#### Issue: OOM During Training
|
|
94
|
+
|
|
95
|
+
**Symptoms**: CUDA OOM in backward pass
|
|
96
|
+
|
|
97
|
+
**Solutions**:
|
|
98
|
+
|
|
99
|
+
1. **Enable gradient checkpointing**:
|
|
100
|
+
```bash
|
|
101
|
+
--recompute-activations
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
2. **Reduce micro-batch size**:
|
|
105
|
+
```bash
|
|
106
|
+
--micro-batch-size 1
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
3. **Enable sequence parallelism**:
|
|
110
|
+
```bash
|
|
111
|
+
--sequence-parallel
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
4. **Reduce global batch size**:
|
|
115
|
+
```bash
|
|
116
|
+
--global-batch-size 128 # Reduce from 256
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
#### Issue: OOM in Colocated Mode
|
|
120
|
+
|
|
121
|
+
**Symptoms**: OOM when both training and inference run on same GPUs
|
|
122
|
+
|
|
123
|
+
**Solutions**:
|
|
124
|
+
|
|
125
|
+
1. **Reduce SGLang memory**:
|
|
126
|
+
```bash
|
|
127
|
+
--sglang-mem-fraction-static 0.4 # Reduce from 0.8
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
2. **Enable offloading**:
|
|
131
|
+
```bash
|
|
132
|
+
--offload-optimizer-states
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
3. **Use smaller sequence length**:
|
|
136
|
+
```bash
|
|
137
|
+
--seq-length 2048 # Reduce from 4096
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
### Data Loading Issues
|
|
141
|
+
|
|
142
|
+
#### Issue: Slow Data Loading
|
|
143
|
+
|
|
144
|
+
**Symptoms**: GPU idle during data fetch, low GPU utilization
|
|
145
|
+
|
|
146
|
+
**Solutions**:
|
|
147
|
+
|
|
148
|
+
1. **Increase data workers**:
|
|
149
|
+
```bash
|
|
150
|
+
--num-data-workers 4
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
2. **Use streaming dataset**:
|
|
154
|
+
```bash
|
|
155
|
+
--streaming-data
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
3. **Pre-tokenize data**:
|
|
159
|
+
```python
|
|
160
|
+
# Pre-process data offline
|
|
161
|
+
from transformers import AutoTokenizer
|
|
162
|
+
tokenizer = AutoTokenizer.from_pretrained("model_path")
|
|
163
|
+
# Save tokenized data
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
#### Issue: Data Format Errors
|
|
167
|
+
|
|
168
|
+
**Symptoms**: KeyError, missing fields, parsing failures
|
|
169
|
+
|
|
170
|
+
**Solutions**:
|
|
171
|
+
|
|
172
|
+
1. **Verify data format**:
|
|
173
|
+
```python
|
|
174
|
+
import json
|
|
175
|
+
with open("data.jsonl") as f:
|
|
176
|
+
for line in f:
|
|
177
|
+
data = json.loads(line)
|
|
178
|
+
assert "prompt" in data, "Missing prompt field"
|
|
179
|
+
assert "label" in data, "Missing label field"
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
2. **Check key names**:
|
|
183
|
+
```bash
|
|
184
|
+
--input-key prompt # Must match your data
|
|
185
|
+
--label-key label # Must match your data
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
### Training Stability Issues
|
|
189
|
+
|
|
190
|
+
#### Issue: Loss Explosion / NaN
|
|
191
|
+
|
|
192
|
+
**Symptoms**: Loss becomes NaN or explodes
|
|
193
|
+
|
|
194
|
+
**Solutions**:
|
|
195
|
+
|
|
196
|
+
1. **Reduce learning rate**:
|
|
197
|
+
```bash
|
|
198
|
+
--lr 1e-6 # Reduce from 5e-6
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
2. **Enable gradient clipping**:
|
|
202
|
+
```bash
|
|
203
|
+
--clip-grad 1.0
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
3. **Check for data issues**:
|
|
207
|
+
```python
|
|
208
|
+
# Verify no empty prompts or responses
|
|
209
|
+
for sample in dataset:
|
|
210
|
+
assert len(sample["prompt"]) > 0
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
4. **Use BF16 instead of FP16**:
|
|
214
|
+
```bash
|
|
215
|
+
--bf16 # More numerically stable
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
#### Issue: Reward Collapse
|
|
219
|
+
|
|
220
|
+
**Symptoms**: Reward drops to zero, model outputs garbage
|
|
221
|
+
|
|
222
|
+
**Solutions**:
|
|
223
|
+
|
|
224
|
+
1. **Increase KL penalty**:
|
|
225
|
+
```bash
|
|
226
|
+
--kl-loss-coef 0.01 # Increase from 0.001
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
2. **Reduce number of samples**:
|
|
230
|
+
```bash
|
|
231
|
+
--n-samples-per-prompt 4 # Reduce from 8
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
3. **Verify reward function**:
|
|
235
|
+
```python
|
|
236
|
+
# Test reward function independently
|
|
237
|
+
from custom_rm import reward_func
|
|
238
|
+
sample = Sample(prompt="test", response="test response")
|
|
239
|
+
reward = reward_func(args, sample)
|
|
240
|
+
print(f"Reward: {reward}") # Should be reasonable
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
### Async Training Issues
|
|
244
|
+
|
|
245
|
+
#### Issue: Async Training Not Supported with Colocate
|
|
246
|
+
|
|
247
|
+
**Symptoms**: Error when using `--colocate` with `train_async.py`
|
|
248
|
+
|
|
249
|
+
**Solution**: Colocated mode is NOT supported for async training. Use separate GPUs:
|
|
250
|
+
```bash
|
|
251
|
+
# Remove --colocate flag
|
|
252
|
+
python train_async.py \
|
|
253
|
+
--actor-num-gpus-per-node 4 \
|
|
254
|
+
--rollout-num-gpus 4 \
|
|
255
|
+
# No --colocate
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
#### Issue: Stale Weights in Async Mode
|
|
259
|
+
|
|
260
|
+
**Symptoms**: Policy divergence, inconsistent behavior
|
|
261
|
+
|
|
262
|
+
**Solutions**:
|
|
263
|
+
|
|
264
|
+
1. **Reduce async buffer size**:
|
|
265
|
+
```bash
|
|
266
|
+
--async-buffer-size 2 # Reduce from 4
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
2. **Increase weight update frequency**:
|
|
270
|
+
```bash
|
|
271
|
+
--update-weights-interval 1 # Sync every rollout
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
### Multi-Turn Training Issues
|
|
275
|
+
|
|
276
|
+
#### Issue: Tool Responses Included in Loss
|
|
277
|
+
|
|
278
|
+
**Symptoms**: Model learns to output tool responses verbatim
|
|
279
|
+
|
|
280
|
+
**Solution**: Properly set loss mask in custom generate function:
|
|
281
|
+
```python
|
|
282
|
+
def build_loss_mask(sample):
|
|
283
|
+
"""Create loss mask that excludes tool responses."""
|
|
284
|
+
mask = []
|
|
285
|
+
for i, token in enumerate(sample.tokens):
|
|
286
|
+
if is_tool_response(token, sample.metadata):
|
|
287
|
+
mask.append(0) # Don't compute loss
|
|
288
|
+
else:
|
|
289
|
+
mask.append(1) # Compute loss
|
|
290
|
+
return mask
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
#### Issue: Multi-Turn Context Too Long
|
|
294
|
+
|
|
295
|
+
**Symptoms**: OOM or truncation in multi-turn conversations
|
|
296
|
+
|
|
297
|
+
**Solutions**:
|
|
298
|
+
|
|
299
|
+
1. **Limit conversation history**:
|
|
300
|
+
```python
|
|
301
|
+
# In custom generate function
|
|
302
|
+
conversation = sample.prompt[-10:] # Keep last 10 turns
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
2. **Increase context length**:
|
|
306
|
+
```bash
|
|
307
|
+
--sglang-context-length 16384
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
### Checkpoint Issues
|
|
311
|
+
|
|
312
|
+
#### Issue: Checkpoint Loading Fails
|
|
313
|
+
|
|
314
|
+
**Symptoms**: Cannot load saved checkpoint
|
|
315
|
+
|
|
316
|
+
**Solutions**:
|
|
317
|
+
|
|
318
|
+
1. **Verify checkpoint path**:
|
|
319
|
+
```bash
|
|
320
|
+
ls -la /path/to/checkpoint/
|
|
321
|
+
```
|
|
322
|
+
|
|
323
|
+
2. **Check parallelism matches**:
|
|
324
|
+
```bash
|
|
325
|
+
# Checkpoint was saved with TP=2, must load with TP=2
|
|
326
|
+
--tensor-model-parallel-size 2
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
3. **Convert HuggingFace to Megatron** (if needed):
|
|
330
|
+
```bash
|
|
331
|
+
python tools/convert_hf_to_megatron.py \
|
|
332
|
+
--hf_model_path /path/to/hf/model \
|
|
333
|
+
--save_path /path/to/megatron/checkpoint
|
|
334
|
+
```
|
|
335
|
+
|
|
336
|
+
### Debugging Tips
|
|
337
|
+
|
|
338
|
+
#### Enable Verbose Logging
|
|
339
|
+
|
|
340
|
+
```bash
|
|
341
|
+
--log-level DEBUG
|
|
342
|
+
export SLIME_DEBUG=1
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
#### Check GPU Utilization
|
|
346
|
+
|
|
347
|
+
```bash
|
|
348
|
+
watch -n 1 nvidia-smi
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
#### Monitor Training
|
|
352
|
+
|
|
353
|
+
```bash
|
|
354
|
+
tensorboard --logdir outputs/
|
|
355
|
+
```
|
|
356
|
+
|
|
357
|
+
#### Test Custom Functions Independently
|
|
358
|
+
|
|
359
|
+
```python
|
|
360
|
+
# Test reward function
|
|
361
|
+
import asyncio
|
|
362
|
+
from custom_rm import reward_func
|
|
363
|
+
|
|
364
|
+
async def test():
|
|
365
|
+
sample = Sample(prompt="test", response="test", label="expected")
|
|
366
|
+
reward = await reward_func(args, sample)
|
|
367
|
+
print(f"Reward: {reward}")
|
|
368
|
+
|
|
369
|
+
asyncio.run(test())
|
|
370
|
+
```
|
|
371
|
+
|
|
372
|
+
## Constraint Reference
|
|
373
|
+
|
|
374
|
+
Key constraint to remember:
|
|
375
|
+
|
|
376
|
+
```
|
|
377
|
+
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
Example: `32 × 8 = 256 × 1`
|
|
381
|
+
|
|
382
|
+
## Resources
|
|
383
|
+
|
|
384
|
+
- GitHub Issues: https://github.com/THUDM/slime/issues
|
|
385
|
+
- Documentation: https://thudm.github.io/slime/
|
|
386
|
+
- Examples: `examples/` directory
|