@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,203 @@
|
|
|
1
|
+
# Common Training Patterns
|
|
2
|
+
|
|
3
|
+
This guide provides common training patterns and use cases for TRL on Hugging Face Jobs.
|
|
4
|
+
|
|
5
|
+
## Multi-GPU Training
|
|
6
|
+
|
|
7
|
+
Automatic distributed training across multiple GPUs. TRL/Accelerate handles distribution automatically:
|
|
8
|
+
|
|
9
|
+
```python
|
|
10
|
+
hf_jobs("uv", {
|
|
11
|
+
"script": """
|
|
12
|
+
# Your training script here (same as single GPU)
|
|
13
|
+
# No changes needed - Accelerate detects multiple GPUs
|
|
14
|
+
""",
|
|
15
|
+
"flavor": "a10g-largex2", # 2x A10G GPUs
|
|
16
|
+
"timeout": "4h",
|
|
17
|
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
|
18
|
+
})
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
**Tips for multi-GPU:**
|
|
22
|
+
- No code changes needed
|
|
23
|
+
- Use `per_device_train_batch_size` (per GPU, not total)
|
|
24
|
+
- Effective batch size = `per_device_train_batch_size` × `num_gpus` × `gradient_accumulation_steps`
|
|
25
|
+
- Monitor GPU utilization to ensure both GPUs are being used
|
|
26
|
+
|
|
27
|
+
## DPO Training (Preference Learning)
|
|
28
|
+
|
|
29
|
+
Train with preference data for alignment:
|
|
30
|
+
|
|
31
|
+
```python
|
|
32
|
+
hf_jobs("uv", {
|
|
33
|
+
"script": """
|
|
34
|
+
# /// script
|
|
35
|
+
# dependencies = ["trl>=0.12.0", "trackio"]
|
|
36
|
+
# ///
|
|
37
|
+
|
|
38
|
+
from datasets import load_dataset
|
|
39
|
+
from trl import DPOTrainer, DPOConfig
|
|
40
|
+
import trackio
|
|
41
|
+
|
|
42
|
+
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
|
|
43
|
+
|
|
44
|
+
# Create train/eval split
|
|
45
|
+
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
|
|
46
|
+
|
|
47
|
+
config = DPOConfig(
|
|
48
|
+
output_dir="dpo-model",
|
|
49
|
+
push_to_hub=True,
|
|
50
|
+
hub_model_id="username/dpo-model",
|
|
51
|
+
num_train_epochs=1,
|
|
52
|
+
beta=0.1, # KL penalty coefficient
|
|
53
|
+
eval_strategy="steps",
|
|
54
|
+
eval_steps=50,
|
|
55
|
+
report_to="trackio",
|
|
56
|
+
run_name="baseline_run", # use a meaningful run name
|
|
57
|
+
# max_length=1024, # Default - only set if you need different sequence length
|
|
58
|
+
)
|
|
59
|
+
|
|
60
|
+
trainer = DPOTrainer(
|
|
61
|
+
model="Qwen/Qwen2.5-0.5B-Instruct", # Use instruct model as base
|
|
62
|
+
train_dataset=dataset_split["train"],
|
|
63
|
+
eval_dataset=dataset_split["test"], # IMPORTANT: Provide eval_dataset when eval_strategy is enabled
|
|
64
|
+
args=config,
|
|
65
|
+
)
|
|
66
|
+
|
|
67
|
+
trainer.train()
|
|
68
|
+
trainer.push_to_hub()
|
|
69
|
+
trackio.finish()
|
|
70
|
+
""",
|
|
71
|
+
"flavor": "a10g-large",
|
|
72
|
+
"timeout": "3h",
|
|
73
|
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
|
74
|
+
})
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
**For DPO documentation:** Use `hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer")`
|
|
78
|
+
|
|
79
|
+
## GRPO Training (Online RL)
|
|
80
|
+
|
|
81
|
+
Group Relative Policy Optimization for online reinforcement learning:
|
|
82
|
+
|
|
83
|
+
```python
|
|
84
|
+
hf_jobs("uv", {
|
|
85
|
+
"script": "https://raw.githubusercontent.com/huggingface/trl/main/examples/scripts/grpo.py",
|
|
86
|
+
"script_args": [
|
|
87
|
+
"--model_name_or_path", "Qwen/Qwen2.5-0.5B-Instruct",
|
|
88
|
+
"--dataset_name", "trl-lib/math_shepherd",
|
|
89
|
+
"--output_dir", "grpo-model",
|
|
90
|
+
"--push_to_hub",
|
|
91
|
+
"--hub_model_id", "username/grpo-model"
|
|
92
|
+
],
|
|
93
|
+
"flavor": "a10g-large",
|
|
94
|
+
"timeout": "4h",
|
|
95
|
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
|
96
|
+
})
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
**For GRPO documentation:** Use `hf_doc_fetch("https://huggingface.co/docs/trl/grpo_trainer")`
|
|
100
|
+
|
|
101
|
+
## Trackio Configuration
|
|
102
|
+
|
|
103
|
+
**Use sensible defaults for trackio setup.** See `references/trackio_guide.md` for complete documentation including grouping runs for experiments.
|
|
104
|
+
|
|
105
|
+
### Basic Pattern
|
|
106
|
+
|
|
107
|
+
```python
|
|
108
|
+
import trackio
|
|
109
|
+
|
|
110
|
+
trackio.init(
|
|
111
|
+
project="my-training",
|
|
112
|
+
run_name="baseline-run", # Descriptive name user will recognize
|
|
113
|
+
space_id="username/trackio", # Default space: {username}/trackio
|
|
114
|
+
config={
|
|
115
|
+
# Keep config minimal - hyperparameters and model/dataset info only
|
|
116
|
+
"model": "Qwen/Qwen2.5-0.5B",
|
|
117
|
+
"dataset": "trl-lib/Capybara",
|
|
118
|
+
"learning_rate": 2e-5,
|
|
119
|
+
}
|
|
120
|
+
)
|
|
121
|
+
|
|
122
|
+
# Your training code...
|
|
123
|
+
|
|
124
|
+
trackio.finish()
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
### Grouping for Experiments (Optional)
|
|
128
|
+
|
|
129
|
+
When user wants to compare related runs, use the `group` parameter:
|
|
130
|
+
|
|
131
|
+
```python
|
|
132
|
+
# Hyperparameter sweep
|
|
133
|
+
trackio.init(project="hyperparam-sweep", run_name="lr-0.001", group="lr_0.001")
|
|
134
|
+
trackio.init(project="hyperparam-sweep", run_name="lr-0.01", group="lr_0.01")
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
## Pattern Selection Guide
|
|
138
|
+
|
|
139
|
+
| Use Case | Pattern | Hardware | Time |
|
|
140
|
+
|----------|---------|----------|------|
|
|
141
|
+
| SFT training | `scripts/train_sft_example.py` | a10g-large | 2-6 hours |
|
|
142
|
+
| Large dataset (>10K) | Multi-GPU | a10g-largex2 | 4-12 hours |
|
|
143
|
+
| Preference learning | DPO Training | a10g-large | 2-4 hours |
|
|
144
|
+
| Online RL | GRPO Training | a10g-large | 3-6 hours |
|
|
145
|
+
|
|
146
|
+
## Critical: Evaluation Dataset Requirements
|
|
147
|
+
|
|
148
|
+
**⚠️ IMPORTANT**: If you set `eval_strategy="steps"` or `eval_strategy="epoch"`, you **MUST** provide an `eval_dataset` to the trainer, or the training will hang.
|
|
149
|
+
|
|
150
|
+
### ✅ CORRECT - With eval dataset:
|
|
151
|
+
```python
|
|
152
|
+
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
|
|
153
|
+
|
|
154
|
+
trainer = SFTTrainer(
|
|
155
|
+
model="Qwen/Qwen2.5-0.5B",
|
|
156
|
+
train_dataset=dataset_split["train"],
|
|
157
|
+
eval_dataset=dataset_split["test"], # ← MUST provide when eval_strategy is enabled
|
|
158
|
+
args=SFTConfig(eval_strategy="steps", ...),
|
|
159
|
+
)
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
### ❌ WRONG - Will hang:
|
|
163
|
+
```python
|
|
164
|
+
trainer = SFTTrainer(
|
|
165
|
+
model="Qwen/Qwen2.5-0.5B",
|
|
166
|
+
train_dataset=dataset,
|
|
167
|
+
# NO eval_dataset but eval_strategy="steps" ← WILL HANG
|
|
168
|
+
args=SFTConfig(eval_strategy="steps", ...),
|
|
169
|
+
)
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
### Option: Disable evaluation if no eval dataset
|
|
173
|
+
```python
|
|
174
|
+
config = SFTConfig(
|
|
175
|
+
eval_strategy="no", # ← Explicitly disable evaluation
|
|
176
|
+
# ... other config
|
|
177
|
+
)
|
|
178
|
+
|
|
179
|
+
trainer = SFTTrainer(
|
|
180
|
+
model="Qwen/Qwen2.5-0.5B",
|
|
181
|
+
train_dataset=dataset,
|
|
182
|
+
# No eval_dataset needed
|
|
183
|
+
args=config,
|
|
184
|
+
)
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
## Best Practices
|
|
188
|
+
|
|
189
|
+
1. **Use train/eval splits** - Create evaluation split for monitoring progress
|
|
190
|
+
2. **Enable Trackio** - Monitor progress in real-time
|
|
191
|
+
3. **Add 20-30% buffer to timeout** - Account for loading/saving overhead
|
|
192
|
+
4. **Test with TRL official scripts first** - Use maintained examples before custom code
|
|
193
|
+
5. **Always provide eval_dataset** - When using eval_strategy, or set to "no"
|
|
194
|
+
6. **Use multi-GPU for large models** - 7B+ models benefit significantly
|
|
195
|
+
|
|
196
|
+
## See Also
|
|
197
|
+
|
|
198
|
+
- `scripts/train_sft_example.py` - Complete SFT template with Trackio and eval split
|
|
199
|
+
- `scripts/train_dpo_example.py` - Complete DPO template
|
|
200
|
+
- `scripts/train_grpo_example.py` - Complete GRPO template
|
|
201
|
+
- `references/hardware_guide.md` - Detailed hardware specifications
|
|
202
|
+
- `references/training_methods.md` - Overview of all TRL training methods
|
|
203
|
+
- `references/troubleshooting.md` - Common issues and solutions
|
|
@@ -0,0 +1,282 @@
|
|
|
1
|
+
# Troubleshooting TRL Training Jobs
|
|
2
|
+
|
|
3
|
+
Common issues and solutions when training with TRL on Hugging Face Jobs.
|
|
4
|
+
|
|
5
|
+
## Training Hangs at "Starting training..." Step
|
|
6
|
+
|
|
7
|
+
**Problem:** Job starts but hangs at the training step - never progresses, never times out, just sits there.
|
|
8
|
+
|
|
9
|
+
**Root Cause:** Using `eval_strategy="steps"` or `eval_strategy="epoch"` without providing an `eval_dataset` to the trainer.
|
|
10
|
+
|
|
11
|
+
**Solution:**
|
|
12
|
+
|
|
13
|
+
**Option A: Provide eval_dataset (recommended)**
|
|
14
|
+
```python
|
|
15
|
+
# Create train/eval split
|
|
16
|
+
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
|
|
17
|
+
|
|
18
|
+
trainer = SFTTrainer(
|
|
19
|
+
model="Qwen/Qwen2.5-0.5B",
|
|
20
|
+
train_dataset=dataset_split["train"],
|
|
21
|
+
eval_dataset=dataset_split["test"], # ← MUST provide when eval_strategy is enabled
|
|
22
|
+
args=SFTConfig(
|
|
23
|
+
eval_strategy="steps",
|
|
24
|
+
eval_steps=50,
|
|
25
|
+
...
|
|
26
|
+
),
|
|
27
|
+
)
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
**Option B: Disable evaluation**
|
|
31
|
+
```python
|
|
32
|
+
trainer = SFTTrainer(
|
|
33
|
+
model="Qwen/Qwen2.5-0.5B",
|
|
34
|
+
train_dataset=dataset,
|
|
35
|
+
# No eval_dataset
|
|
36
|
+
args=SFTConfig(
|
|
37
|
+
eval_strategy="no", # ← Explicitly disable
|
|
38
|
+
...
|
|
39
|
+
),
|
|
40
|
+
)
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
**Prevention:**
|
|
44
|
+
- Always create train/eval split for better monitoring
|
|
45
|
+
- Use `dataset.train_test_split(test_size=0.1, seed=42)`
|
|
46
|
+
- Check example scripts: `scripts/train_sft_example.py` includes proper eval setup
|
|
47
|
+
|
|
48
|
+
## Job Times Out
|
|
49
|
+
|
|
50
|
+
**Problem:** Job terminates before training completes, all progress lost.
|
|
51
|
+
|
|
52
|
+
**Solutions:**
|
|
53
|
+
- Increase timeout parameter (e.g., `"timeout": "4h"`)
|
|
54
|
+
- Reduce `num_train_epochs` or use smaller dataset slice
|
|
55
|
+
- Use smaller model or enable LoRA/PEFT to speed up training
|
|
56
|
+
- Add 20-30% buffer to estimated time for loading/saving overhead
|
|
57
|
+
|
|
58
|
+
**Prevention:**
|
|
59
|
+
- Always start with a quick demo run to estimate timing
|
|
60
|
+
- Use `scripts/estimate_cost.py` to get time estimates
|
|
61
|
+
- Monitor first runs closely via Trackio or logs
|
|
62
|
+
|
|
63
|
+
## Model Not Saved to Hub
|
|
64
|
+
|
|
65
|
+
**Problem:** Training completes but model doesn't appear on Hub - all work lost.
|
|
66
|
+
|
|
67
|
+
**Check:**
|
|
68
|
+
- [ ] `push_to_hub=True` in training config
|
|
69
|
+
- [ ] `hub_model_id` specified with username (e.g., `"username/model-name"`)
|
|
70
|
+
- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job submission
|
|
71
|
+
- [ ] User has write access to target repo
|
|
72
|
+
- [ ] Token has write permissions (check at https://huggingface.co/settings/tokens)
|
|
73
|
+
- [ ] Training script calls `trainer.push_to_hub()` at the end
|
|
74
|
+
|
|
75
|
+
**See:** `references/hub_saving.md` for detailed Hub authentication troubleshooting
|
|
76
|
+
|
|
77
|
+
## Out of Memory (OOM)
|
|
78
|
+
|
|
79
|
+
**Problem:** Job fails with CUDA out of memory error.
|
|
80
|
+
|
|
81
|
+
**Solutions (in order of preference):**
|
|
82
|
+
1. **Reduce batch size:** Lower `per_device_train_batch_size` (try 4 → 2 → 1)
|
|
83
|
+
2. **Increase gradient accumulation:** Raise `gradient_accumulation_steps` to maintain effective batch size
|
|
84
|
+
3. **Disable evaluation:** Remove `eval_dataset` and `eval_strategy` (saves ~40% memory, good for demos)
|
|
85
|
+
4. **Enable LoRA/PEFT:** Use `peft_config=LoraConfig(r=8, lora_alpha=16)` to train adapters only (smaller rank = less memory)
|
|
86
|
+
5. **Use larger GPU:** Switch from `t4-small` → `l4x1` → `a10g-large` → `a100-large`
|
|
87
|
+
6. **Enable gradient checkpointing:** Set `gradient_checkpointing=True` in config (slower but saves memory)
|
|
88
|
+
7. **Use smaller model:** Try a smaller variant (e.g., 0.5B instead of 3B)
|
|
89
|
+
|
|
90
|
+
**Memory guidelines:**
|
|
91
|
+
- T4 (16GB): <1B models with LoRA
|
|
92
|
+
- A10G (24GB): 1-3B models with LoRA, <1B full fine-tune
|
|
93
|
+
- A100 (40GB/80GB): 7B+ models with LoRA, 3B full fine-tune
|
|
94
|
+
|
|
95
|
+
## Parameter Naming Issues
|
|
96
|
+
|
|
97
|
+
**Problem:** `TypeError: SFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'`
|
|
98
|
+
|
|
99
|
+
**Cause:** TRL config classes use `max_length`, not `max_seq_length`.
|
|
100
|
+
|
|
101
|
+
**Solution:**
|
|
102
|
+
```python
|
|
103
|
+
# ✅ CORRECT - TRL uses max_length
|
|
104
|
+
SFTConfig(max_length=512)
|
|
105
|
+
DPOConfig(max_length=512)
|
|
106
|
+
|
|
107
|
+
# ❌ WRONG - This will fail
|
|
108
|
+
SFTConfig(max_seq_length=512)
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
**Note:** Most TRL configs don't require explicit max_length - the default (1024) works well. Only set if you need a specific value.
|
|
112
|
+
|
|
113
|
+
## Dataset Format Error
|
|
114
|
+
|
|
115
|
+
**Problem:** Training fails with dataset format errors or missing fields.
|
|
116
|
+
|
|
117
|
+
**Solutions:**
|
|
118
|
+
1. **Check format documentation:**
|
|
119
|
+
```python
|
|
120
|
+
hf_doc_fetch("https://huggingface.co/docs/trl/dataset_formats")
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
2. **Validate dataset before training:**
|
|
124
|
+
```bash
|
|
125
|
+
uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
|
|
126
|
+
--dataset <dataset-name> --split train
|
|
127
|
+
```
|
|
128
|
+
Or via hf_jobs:
|
|
129
|
+
```python
|
|
130
|
+
hf_jobs("uv", {
|
|
131
|
+
"script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
|
|
132
|
+
"script_args": ["--dataset", "dataset-name", "--split", "train"]
|
|
133
|
+
})
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
3. **Verify field names:**
|
|
137
|
+
- **SFT:** Needs "messages" field (conversational), OR "text" field, OR "prompt"/"completion"
|
|
138
|
+
- **DPO:** Needs "chosen" and "rejected" fields
|
|
139
|
+
- **GRPO:** Needs prompt-only format
|
|
140
|
+
|
|
141
|
+
4. **Check dataset split:**
|
|
142
|
+
- Ensure split exists (e.g., `split="train"`)
|
|
143
|
+
- Preview dataset: `load_dataset("name", split="train[:5]")`
|
|
144
|
+
|
|
145
|
+
## Import/Module Errors
|
|
146
|
+
|
|
147
|
+
**Problem:** Job fails with "ModuleNotFoundError" or import errors.
|
|
148
|
+
|
|
149
|
+
**Solutions:**
|
|
150
|
+
1. **Add PEP 723 header with dependencies:**
|
|
151
|
+
```python
|
|
152
|
+
# /// script
|
|
153
|
+
# dependencies = [
|
|
154
|
+
# "trl>=0.12.0",
|
|
155
|
+
# "peft>=0.7.0",
|
|
156
|
+
# "transformers>=4.36.0",
|
|
157
|
+
# ]
|
|
158
|
+
# ///
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
2. **Verify exact format:**
|
|
162
|
+
- Must have `# ///` delimiters (with space after `#`)
|
|
163
|
+
- Dependencies must be valid PyPI package names
|
|
164
|
+
- Check spelling and version constraints
|
|
165
|
+
|
|
166
|
+
3. **Test locally first:**
|
|
167
|
+
```bash
|
|
168
|
+
uv run train.py # Tests if dependencies are correct
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
## Authentication Errors
|
|
172
|
+
|
|
173
|
+
**Problem:** Job fails with authentication or permission errors when pushing to Hub.
|
|
174
|
+
|
|
175
|
+
**Solutions:**
|
|
176
|
+
1. **Verify authentication:**
|
|
177
|
+
```python
|
|
178
|
+
mcp__huggingface__hf_whoami() # Check who's authenticated
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
2. **Check token permissions:**
|
|
182
|
+
- Go to https://huggingface.co/settings/tokens
|
|
183
|
+
- Ensure token has "write" permission
|
|
184
|
+
- Token must not be "read-only"
|
|
185
|
+
|
|
186
|
+
3. **Verify token in job:**
|
|
187
|
+
```python
|
|
188
|
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Must be in job config
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
4. **Check repo permissions:**
|
|
192
|
+
- User must have write access to target repo
|
|
193
|
+
- If org repo, user must be member with write access
|
|
194
|
+
- Repo must exist or user must have permission to create
|
|
195
|
+
|
|
196
|
+
## Job Stuck or Not Starting
|
|
197
|
+
|
|
198
|
+
**Problem:** Job shows "pending" or "starting" for extended period.
|
|
199
|
+
|
|
200
|
+
**Solutions:**
|
|
201
|
+
- Check Jobs dashboard for status: https://huggingface.co/jobs
|
|
202
|
+
- Verify hardware availability (some GPU types may have queues)
|
|
203
|
+
- Try different hardware flavor if one is heavily utilized
|
|
204
|
+
- Check for account billing issues (Jobs requires paid plan)
|
|
205
|
+
|
|
206
|
+
**Typical startup times:**
|
|
207
|
+
- CPU jobs: 10-30 seconds
|
|
208
|
+
- GPU jobs: 30-90 seconds
|
|
209
|
+
- If >3 minutes: likely queued or stuck
|
|
210
|
+
|
|
211
|
+
## Training Loss Not Decreasing
|
|
212
|
+
|
|
213
|
+
**Problem:** Training runs but loss stays flat or doesn't improve.
|
|
214
|
+
|
|
215
|
+
**Solutions:**
|
|
216
|
+
1. **Check learning rate:** May be too low (try 2e-5 to 5e-5) or too high (try 1e-6)
|
|
217
|
+
2. **Verify dataset quality:** Inspect examples to ensure they're reasonable
|
|
218
|
+
3. **Check model size:** Very small models may not have capacity for task
|
|
219
|
+
4. **Increase training steps:** May need more epochs or larger dataset
|
|
220
|
+
5. **Verify dataset format:** Wrong format may cause degraded training
|
|
221
|
+
|
|
222
|
+
## Logs Not Appearing
|
|
223
|
+
|
|
224
|
+
**Problem:** Cannot see training logs or progress.
|
|
225
|
+
|
|
226
|
+
**Solutions:**
|
|
227
|
+
1. **Wait 30-60 seconds:** Initial logs can be delayed
|
|
228
|
+
2. **Check logs via MCP tool:**
|
|
229
|
+
```python
|
|
230
|
+
hf_jobs("logs", {"job_id": "your-job-id"})
|
|
231
|
+
```
|
|
232
|
+
3. **Use Trackio for real-time monitoring:** See `references/trackio_guide.md`
|
|
233
|
+
4. **Verify job is actually running:**
|
|
234
|
+
```python
|
|
235
|
+
hf_jobs("inspect", {"job_id": "your-job-id"})
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
## Checkpoint/Resume Issues
|
|
239
|
+
|
|
240
|
+
**Problem:** Cannot resume from checkpoint or checkpoint not saved.
|
|
241
|
+
|
|
242
|
+
**Solutions:**
|
|
243
|
+
1. **Enable checkpoint saving:**
|
|
244
|
+
```python
|
|
245
|
+
SFTConfig(
|
|
246
|
+
save_strategy="steps",
|
|
247
|
+
save_steps=100,
|
|
248
|
+
hub_strategy="every_save", # Push each checkpoint
|
|
249
|
+
)
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
2. **Verify checkpoints pushed to Hub:** Check model repo for checkpoint folders
|
|
253
|
+
|
|
254
|
+
3. **Resume from checkpoint:**
|
|
255
|
+
```python
|
|
256
|
+
trainer = SFTTrainer(
|
|
257
|
+
model="username/model-name", # Can be checkpoint path
|
|
258
|
+
resume_from_checkpoint="username/model-name/checkpoint-1000",
|
|
259
|
+
)
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
## Getting Help
|
|
263
|
+
|
|
264
|
+
If issues persist:
|
|
265
|
+
|
|
266
|
+
1. **Check TRL documentation:**
|
|
267
|
+
```python
|
|
268
|
+
hf_doc_search("your issue", product="trl")
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
2. **Check Jobs documentation:**
|
|
272
|
+
```python
|
|
273
|
+
hf_doc_fetch("https://huggingface.co/docs/huggingface_hub/guides/jobs")
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
3. **Review related guides:**
|
|
277
|
+
- `references/hub_saving.md` - Hub authentication issues
|
|
278
|
+
- `references/hardware_guide.md` - Hardware selection and specs
|
|
279
|
+
- `references/training_patterns.md` - Eval dataset requirements
|
|
280
|
+
- SKILL.md "Working with Scripts" section - Script format and URL issues
|
|
281
|
+
|
|
282
|
+
4. **Ask in HF forums:** https://discuss.huggingface.co/
|