@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,3563 @@
|
|
|
1
|
+
# Axolotl - Other
|
|
2
|
+
|
|
3
|
+
**Pages:** 26
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Mixed Precision Training
|
|
8
|
+
|
|
9
|
+
**URL:** https://docs.axolotl.ai/docs/mixed_precision.html
|
|
10
|
+
|
|
11
|
+
**Contents:**
|
|
12
|
+
- Mixed Precision Training
|
|
13
|
+
- 1 FP16 Mixed Precision
|
|
14
|
+
- 1.1 Overview
|
|
15
|
+
- 1.2 Configuration
|
|
16
|
+
- 1.3 FP16 Considerations
|
|
17
|
+
- 2 BF16 Mixed Precision
|
|
18
|
+
- 2.1 Overview
|
|
19
|
+
- 2.2 Configuration
|
|
20
|
+
- 3 FP8 Mixed Precision
|
|
21
|
+
- 3.1 What is FP8?
|
|
22
|
+
|
|
23
|
+
Mixed precision training uses lower precision data types to reduce memory usage and increase training speed while maintaining model quality. Axolotl supports several mixed precision formats:
|
|
24
|
+
|
|
25
|
+
FP16 is the traditional half-precision format, supported on older GPUs but can be less numerically stable than BF16.
|
|
26
|
+
|
|
27
|
+
BF16 (Brain Float 16) offers better numerical stability than FP16 and is the recommended mixed precision format for modern GPUs. It provides the same dynamic range as FP32 while using half the memory.
|
|
28
|
+
|
|
29
|
+
FP8 support is experimental and requires compatible hardware (H100, H200) and recent PyTorch versions with TorchAO.
|
|
30
|
+
|
|
31
|
+
FP8 (8-bit floating point) can provide significant time savings compared to FP16/BF16 while maintaining training stability. Axolotl’s implementation uses PyTorch’s TorchAO library with “tensorwise” scaling strategy.
|
|
32
|
+
|
|
33
|
+
Add to your YAML config:
|
|
34
|
+
|
|
35
|
+
torch.compile is critical for FP8 performance
|
|
36
|
+
|
|
37
|
+
FP8 training requires torch_compile: true to see meaningful speedups. Without compilation, FP8 may actually be slower and use more memory than FP16/BF16.
|
|
38
|
+
|
|
39
|
+
For FSDP (Fully Sharded Data Parallel) training:
|
|
40
|
+
|
|
41
|
+
Always validate your mixed precision setup:
|
|
42
|
+
|
|
43
|
+
See examples/llama-3/3b-fp8-fsdp2.yaml for an optimized example config. Enabling FP8 mixed precision + FP8 all-gather training results in ~10% faster iterations per second vs. BF16 for a relatively small (3B param) model
|
|
44
|
+
|
|
45
|
+
For more information on multi-GPU training, see our Multi-GPU guide.
|
|
46
|
+
|
|
47
|
+
**Examples:**
|
|
48
|
+
|
|
49
|
+
Example 1 (yaml):
|
|
50
|
+
```yaml
|
|
51
|
+
# Automatic BF16 detection (recommended)
|
|
52
|
+
bf16: auto
|
|
53
|
+
|
|
54
|
+
# Or explicitly enable
|
|
55
|
+
bf16: true
|
|
56
|
+
|
|
57
|
+
# For evaluation with BF16
|
|
58
|
+
bf16: full # Equivalent to bf16_full_eval in the HF trainer
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
Example 2 (yaml):
|
|
62
|
+
```yaml
|
|
63
|
+
# Enable FP8 mixed precision
|
|
64
|
+
fp8: true
|
|
65
|
+
|
|
66
|
+
# Optional: Enable FP8 for FSDP all-gather operations
|
|
67
|
+
fp8_enable_fsdp_float8_all_gather: true
|
|
68
|
+
|
|
69
|
+
# Enable torch.compile (almost always necessary for FP8 speedups)
|
|
70
|
+
torch_compile: true
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
Example 3 (yaml):
|
|
74
|
+
```yaml
|
|
75
|
+
fp8: true
|
|
76
|
+
fp8_enable_fsdp_float8_all_gather: true
|
|
77
|
+
|
|
78
|
+
torch_compile: true
|
|
79
|
+
|
|
80
|
+
# FSDP configuration
|
|
81
|
+
fsdp_version: 2
|
|
82
|
+
fsdp_config:
|
|
83
|
+
offload_params: false
|
|
84
|
+
cpu_ram_efficient_loading: true
|
|
85
|
+
auto_wrap_policy: TRANSFORMER_BASED_WRAP
|
|
86
|
+
transformer_layer_cls_to_wrap: LlamaDecoderLayer
|
|
87
|
+
state_dict_type: FULL_STATE_DICT
|
|
88
|
+
reshard_after_forward: true
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
---
|
|
92
|
+
|
|
93
|
+
## FAQ
|
|
94
|
+
|
|
95
|
+
**URL:** https://docs.axolotl.ai/docs/faq.html
|
|
96
|
+
|
|
97
|
+
**Contents:**
|
|
98
|
+
- FAQ
|
|
99
|
+
- General
|
|
100
|
+
- Chat templates
|
|
101
|
+
|
|
102
|
+
Q: The trainer stopped and hasn’t progressed in several minutes.
|
|
103
|
+
|
|
104
|
+
A: Usually an issue with the GPUs communicating with each other. See the NCCL doc
|
|
105
|
+
|
|
106
|
+
A: This usually happens when you run out of system RAM.
|
|
107
|
+
|
|
108
|
+
Q: exitcode: -7 while using deepspeed
|
|
109
|
+
|
|
110
|
+
A: Try upgrading deepspeed w: pip install -U deepspeed
|
|
111
|
+
|
|
112
|
+
Q: AttributeError: ‘DummyOptim’ object has no attribute ‘step’
|
|
113
|
+
|
|
114
|
+
Q: ModuleNotFoundError: No module named ‘mpi4py’ using single GPU with deepspeed
|
|
115
|
+
|
|
116
|
+
A: You may be using deepspeed with single gpu. Please remove the deepspeed: section in the yaml file or --deepspeed CLI flag.
|
|
117
|
+
|
|
118
|
+
Q: The codes is stuck on saving preprocessed datasets.
|
|
119
|
+
|
|
120
|
+
A: This is usually an issue with the GPU. This can be resolved through setting the os environment variable CUDA_VISIBLE_DEVICES=0. If you are on runpod, this is usually a pod issue. Starting a new pod should take care of it.
|
|
121
|
+
|
|
122
|
+
Q: Received mismatch error on merge adapters / loading adapters between torch.Size of checkpoint and model.
|
|
123
|
+
|
|
124
|
+
A: This is likely due to vocab size mismatch. By default, Axolotl expands the model’s embeddings if the tokenizer has more tokens than the model. Please use the axolotl merge-lora command to merge the adapters instead of using your own scripts.
|
|
125
|
+
|
|
126
|
+
On the other hand, if the model has more tokens than the tokenizer, Axolotl does not shrink the model’s embeddings unless shrink_embeddings: true is set in the config.
|
|
127
|
+
|
|
128
|
+
Q: How to call Axolotl via custom python scripts?
|
|
129
|
+
|
|
130
|
+
A: Since Axolotl is just Python, please see src/axolotl/cli/main.py on how each command is called.
|
|
131
|
+
|
|
132
|
+
Q: How to know the value to use for fsdp_transformer_layer_cls_to_wrap?
|
|
133
|
+
|
|
134
|
+
A: This is the class name of the transformer layer to wrap with FSDP. For example, for LlamaForCausalLM, the value is LlamaDecoderLayer. To find this for a specific model, check the model’s PreTrainedModel definition and look for _no_split_modules variable in the modeling_<model_name>.py file within transformers library.
|
|
135
|
+
|
|
136
|
+
Q: ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token
|
|
137
|
+
|
|
138
|
+
A: This is because the tokenizer does not have a padding token. Please add a padding token to the tokenizer via:
|
|
139
|
+
|
|
140
|
+
Q: IterableDataset error or KeyError: 'input_ids' when using preprocess CLI
|
|
141
|
+
|
|
142
|
+
A: This is because you may be using preprocess CLI with pretraining_dataset: or skip_prepare_dataset: true respectively. Please use axolotl train CLI directly instead as these datasets are prepared on demand.
|
|
143
|
+
|
|
144
|
+
Q: vLLM is not working with Axolotl
|
|
145
|
+
|
|
146
|
+
A: We currently recommend torch 2.6.0 for use with vllm. Please ensure you use the right version. For Docker, please use the main-py3.11-cu124-2.6.0 tag.
|
|
147
|
+
|
|
148
|
+
Q: FA2 2.8.0 undefined symbol runtime error on CUDA 12.4
|
|
149
|
+
|
|
150
|
+
A: There seems to be a wheel issue with FA2 2.8.0 on CUDA 12.4. Try CUDA 12.6 instead or downgrade to FA2 2.7.4. Please refer to the upstream issue: https://github.com/Dao-AILab/flash-attention/issues/1717.
|
|
151
|
+
|
|
152
|
+
Q: Can we mix text and text+image datasets for VLM training?
|
|
153
|
+
|
|
154
|
+
A: Yes, you can for newer VLM arch. The ones that would not work are LLaVA / Pixtral arch. If you notice one not working, please let us know!
|
|
155
|
+
|
|
156
|
+
Q: Why is memory/max_* different from nvidia-smi?
|
|
157
|
+
|
|
158
|
+
A: We use torch APIs to retrieve this information. You can see https://docs.pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management for more information.
|
|
159
|
+
|
|
160
|
+
Q: jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'content' / 'role' / ____
|
|
161
|
+
|
|
162
|
+
A: This means that the property mapping for the stated attribute does not exist when building chat_template prompt. For example, if no attribute 'content', please check you have added the correct mapping for content under message_property_mappings.
|
|
163
|
+
|
|
164
|
+
Q: Empty template generated for turn ___
|
|
165
|
+
|
|
166
|
+
A: The content is empty for that turn.
|
|
167
|
+
|
|
168
|
+
Q: Could not find content start/end boundary for turn __
|
|
169
|
+
|
|
170
|
+
A: The specific turn’s start/end could not be detected. Please ensure you have set the eos_token following your chat_template. Otherwise, this could be a chat_template which doesn’t use proper boundaries for each turn (like system). On the rare occurrence, make sure your content is not [[dummy_message]]. Please let us know about this.
|
|
171
|
+
|
|
172
|
+
Q: Content end boundary is before start boundary for turn ___
|
|
173
|
+
|
|
174
|
+
A: This is an edge case which should not occur. Please create an Issue if this happens.
|
|
175
|
+
|
|
176
|
+
Q: Content end boundary is the same as start boundary for turn ___. This is likely an empty turn.
|
|
177
|
+
|
|
178
|
+
A: This is likely an empty turn.
|
|
179
|
+
|
|
180
|
+
Q: The EOS token is incorrectly being masked or not being masked / EOS token __ not found in chat template.
|
|
181
|
+
|
|
182
|
+
A: There can be two reasons:
|
|
183
|
+
|
|
184
|
+
Q: “chat_template choice is tokenizer_default but tokenizer’s chat_template is null. Please add a chat_template in tokenizer config”
|
|
185
|
+
|
|
186
|
+
A: This is because the tokenizer does not have a chat template. Please add a chat template in the tokenizer config. See chat_template for more details.
|
|
187
|
+
|
|
188
|
+
Q: The EOT token(s) are incorrectly being masked or not being masked / EOT token __ not found in chat template.
|
|
189
|
+
|
|
190
|
+
A: There can be two reasons:
|
|
191
|
+
|
|
192
|
+
Q: EOT token encoding failed. Please check if the token is valid and can be encoded.
|
|
193
|
+
|
|
194
|
+
A: There could be some issue with the tokenizer or unicode encoding. Please raise an issue with examples with the EOT token & tokenizer causing the issue.
|
|
195
|
+
|
|
196
|
+
Q: EOT token __ is encoded as multiple tokens.
|
|
197
|
+
|
|
198
|
+
A: This is because the EOT token is encoded as multiple tokens which can cause unexpected behavior. Please add it under tokens: or (recommended) override unused added_tokens via added_tokens_overrides:.
|
|
199
|
+
|
|
200
|
+
Q: Conflict between train_on_eos and train_on_eot. eos_token is in eot_tokens and train_on_eos != train_on_eot
|
|
201
|
+
|
|
202
|
+
A: This is because the EOS token is in the eot_tokens: while mismatch between train_on_eos: and train_on_eot:. This will cause one to override the other. Please ensure that train_on_eos: and train_on_eot: are the same or remove the EOS token from eot_tokens:.
|
|
203
|
+
|
|
204
|
+
Q: If eot_tokens: is not provided, what happens?
|
|
205
|
+
|
|
206
|
+
A: If eot_tokens: is not provided, the default behavior is the same as before. EOS tokens used to delimit turns are masked/unmasked depending on whether the turn is trainable.
|
|
207
|
+
|
|
208
|
+
Internally, eot_tokens: tokenizer.eos_token and train_on_eot: train_on_eos (which defaults to turn). This transition helps clarify the naming and behavior of EOT/EOS tokens.
|
|
209
|
+
|
|
210
|
+
Q: Data processing error: CAS service error
|
|
211
|
+
|
|
212
|
+
A: Try disabling XET with export HF_HUB_DISABLE_XET=1
|
|
213
|
+
|
|
214
|
+
Q: torch._inductor.exc.LoweringException: NoValidChoicesError: No choices to select, please consider adding ATEN into max_autotune_gemm_backends config (defined in torch/_inductor/config.py) to allow at least one choice.
|
|
215
|
+
|
|
216
|
+
A: Depending on the version of torch, you may need to include this in your YAML:
|
|
217
|
+
|
|
218
|
+
**Q: ValueError("Backward pass should have cleared tracker of all tensors")
|
|
219
|
+
|
|
220
|
+
A: This may happen due to edge cases in using the modern OffloadActivations context manager for CUDA streams. If you encounter this error, you may have success using the naive implementation with offload_activations: legacy in your YAML.
|
|
221
|
+
|
|
222
|
+
**Q: Error parsing tool_calls arguments as JSON.
|
|
223
|
+
|
|
224
|
+
A: There is an error parsing string arguments to a dict. Please check your dataset and the error message for more details.
|
|
225
|
+
|
|
226
|
+
**Examples:**
|
|
227
|
+
|
|
228
|
+
Example 1 (yaml):
|
|
229
|
+
```yaml
|
|
230
|
+
special_tokens:
|
|
231
|
+
# str. If you're not sure, set to same as `eos_token`.
|
|
232
|
+
pad_token: "..."
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
Example 2 (yaml):
|
|
236
|
+
```yaml
|
|
237
|
+
flex_attn_compile_kwargs:
|
|
238
|
+
dynamic: false
|
|
239
|
+
mode: max-autotune-no-cudagraphs
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
---
|
|
243
|
+
|
|
244
|
+
## Installation
|
|
245
|
+
|
|
246
|
+
**URL:** https://docs.axolotl.ai/docs/installation.html
|
|
247
|
+
|
|
248
|
+
**Contents:**
|
|
249
|
+
- Installation
|
|
250
|
+
- 1 Requirements
|
|
251
|
+
- 2 Installation Methods
|
|
252
|
+
- 2.1 PyPI Installation (Recommended)
|
|
253
|
+
- 2.2 uv Installation
|
|
254
|
+
- 2.3 Edge/Development Build
|
|
255
|
+
- 2.4 Docker
|
|
256
|
+
- 3 Cloud Environments
|
|
257
|
+
- 3.1 Cloud GPU Providers
|
|
258
|
+
- 3.2 Google Colab
|
|
259
|
+
|
|
260
|
+
This guide covers all the ways you can install and set up Axolotl for your environment.
|
|
261
|
+
|
|
262
|
+
Please make sure to have Pytorch installed before installing Axolotl in your local environment.
|
|
263
|
+
|
|
264
|
+
Follow the instructions at: https://pytorch.org/get-started/locally/
|
|
265
|
+
|
|
266
|
+
For Blackwell GPUs, please use Pytorch 2.7.0 and CUDA 12.8.
|
|
267
|
+
|
|
268
|
+
We use --no-build-isolation in order to detect the installed PyTorch version (if installed) in order not to clobber it, and so that we set the correct version of dependencies that are specific to the PyTorch version or other installed co-dependencies.
|
|
269
|
+
|
|
270
|
+
uv is a fast, reliable Python package installer and resolver built in Rust. It offers significant performance improvements over pip and provides better dependency resolution, making it an excellent choice for complex environments.
|
|
271
|
+
|
|
272
|
+
Install uv if not already installed
|
|
273
|
+
|
|
274
|
+
Choose your CUDA version to use with PyTorch; e.g. cu124, cu126, cu128, then create the venv and activate
|
|
275
|
+
|
|
276
|
+
Install PyTorch - PyTorch 2.6.0 recommended
|
|
277
|
+
|
|
278
|
+
Install axolotl from PyPi
|
|
279
|
+
|
|
280
|
+
For the latest features between releases:
|
|
281
|
+
|
|
282
|
+
For development with Docker:
|
|
283
|
+
|
|
284
|
+
For Blackwell GPUs, please use axolotlai/axolotl:main-py3.11-cu128-2.7.0 or the cloud variant axolotlai/axolotl-cloud:main-py3.11-cu128-2.7.0.
|
|
285
|
+
|
|
286
|
+
Please refer to the Docker documentation for more information on the different Docker images that are available.
|
|
287
|
+
|
|
288
|
+
For providers supporting Docker:
|
|
289
|
+
|
|
290
|
+
See Section 6 for Mac-specific issues.
|
|
291
|
+
|
|
292
|
+
We recommend using WSL2 (Windows Subsystem for Linux) or Docker.
|
|
293
|
+
|
|
294
|
+
Install PyTorch: https://pytorch.org/get-started/locally/
|
|
295
|
+
|
|
296
|
+
(Optional) Login to Hugging Face:
|
|
297
|
+
|
|
298
|
+
If you encounter installation issues, see our FAQ and Debugging Guide.
|
|
299
|
+
|
|
300
|
+
**Examples:**
|
|
301
|
+
|
|
302
|
+
Example 1 (bash):
|
|
303
|
+
```bash
|
|
304
|
+
pip3 install -U packaging setuptools wheel ninja
|
|
305
|
+
pip3 install --no-build-isolation axolotl[flash-attn,deepspeed]
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
Example 2 (bash):
|
|
309
|
+
```bash
|
|
310
|
+
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
311
|
+
source $HOME/.local/bin/env
|
|
312
|
+
```
|
|
313
|
+
|
|
314
|
+
Example 3 (bash):
|
|
315
|
+
```bash
|
|
316
|
+
export UV_TORCH_BACKEND=cu126
|
|
317
|
+
uv venv --no-project --relocatable
|
|
318
|
+
source .venv/bin/activate
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
Example 4 (bash):
|
|
322
|
+
```bash
|
|
323
|
+
uv pip install packaging setuptools wheel
|
|
324
|
+
uv pip install torch==2.6.0
|
|
325
|
+
uv pip install awscli pydantic
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
---
|
|
329
|
+
|
|
330
|
+
## Dataset Preprocessing
|
|
331
|
+
|
|
332
|
+
**URL:** https://docs.axolotl.ai/docs/dataset_preprocessing.html
|
|
333
|
+
|
|
334
|
+
**Contents:**
|
|
335
|
+
- Dataset Preprocessing
|
|
336
|
+
- Overview
|
|
337
|
+
- What are the benefits of pre-processing?
|
|
338
|
+
- What are the edge cases?
|
|
339
|
+
|
|
340
|
+
Dataset pre-processing is the step where Axolotl takes each dataset you’ve configured alongside the dataset format and prompt strategies to:
|
|
341
|
+
|
|
342
|
+
The processing of the datasets can happen one of two ways:
|
|
343
|
+
|
|
344
|
+
When training interactively or for sweeps (e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent training parameters so that it will intelligently pull from its cache when possible.
|
|
345
|
+
|
|
346
|
+
The path of the cache is controlled by dataset_prepared_path: and is often left blank in example YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.
|
|
347
|
+
|
|
348
|
+
If dataset_prepared_path: is left empty, when training, the processed dataset will be cached in a default path of ./last_run_prepared/, but will ignore anything already cached there. By explicitly setting dataset_prepared_path: ./last_run_prepared, the trainer will use whatever pre-processed data is in the cache.
|
|
349
|
+
|
|
350
|
+
Let’s say you are writing a custom prompt strategy or using a user-defined prompt template. Because the trainer cannot readily detect these changes, we cannot change the calculated hash value for the pre-processed dataset.
|
|
351
|
+
|
|
352
|
+
If you have dataset_prepared_path: ... set and change your prompt templating logic, it may not pick up the changes you made and you will be training over the old prompt.
|
|
353
|
+
|
|
354
|
+
---
|
|
355
|
+
|
|
356
|
+
## Inference and Merging
|
|
357
|
+
|
|
358
|
+
**URL:** https://docs.axolotl.ai/docs/inference.html
|
|
359
|
+
|
|
360
|
+
**Contents:**
|
|
361
|
+
- Inference and Merging
|
|
362
|
+
- 1 Quick Start
|
|
363
|
+
- 1.1 Basic Inference
|
|
364
|
+
- 2 Advanced Usage
|
|
365
|
+
- 2.1 Gradio Interface
|
|
366
|
+
- 2.2 File-based Prompts
|
|
367
|
+
- 2.3 Memory Optimization
|
|
368
|
+
- 3 Merging LoRA Weights
|
|
369
|
+
- 3.1 Memory Management for Merging
|
|
370
|
+
- 4 Tokenization
|
|
371
|
+
|
|
372
|
+
This guide covers how to use your trained models for inference, including model loading, interactive testing, merging adapters, and common troubleshooting steps.
|
|
373
|
+
|
|
374
|
+
Use the same config used for training on inference/merging.
|
|
375
|
+
|
|
376
|
+
Launch an interactive web interface:
|
|
377
|
+
|
|
378
|
+
Process prompts from a text file:
|
|
379
|
+
|
|
380
|
+
For large models or limited memory:
|
|
381
|
+
|
|
382
|
+
Merge LoRA adapters with the base model:
|
|
383
|
+
|
|
384
|
+
Tokenization mismatches between training and inference are a common source of problems.
|
|
385
|
+
|
|
386
|
+
Verify inference tokenization by decoding tokens before model input
|
|
387
|
+
|
|
388
|
+
Compare token IDs between training and inference
|
|
389
|
+
|
|
390
|
+
Configure special tokens in your YAML:
|
|
391
|
+
|
|
392
|
+
For more details, see our debugging guide.
|
|
393
|
+
|
|
394
|
+
**Examples:**
|
|
395
|
+
|
|
396
|
+
Example 1 (bash):
|
|
397
|
+
```bash
|
|
398
|
+
axolotl inference your_config.yml --lora-model-dir="./lora-output-dir"
|
|
399
|
+
```
|
|
400
|
+
|
|
401
|
+
Example 2 (bash):
|
|
402
|
+
```bash
|
|
403
|
+
axolotl inference your_config.yml --base-model="./completed-model"
|
|
404
|
+
```
|
|
405
|
+
|
|
406
|
+
Example 3 (bash):
|
|
407
|
+
```bash
|
|
408
|
+
axolotl inference your_config.yml --gradio
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
Example 4 (bash):
|
|
412
|
+
```bash
|
|
413
|
+
cat /tmp/prompt.txt | axolotl inference your_config.yml \
|
|
414
|
+
--base-model="./completed-model" --prompter=None
|
|
415
|
+
```
|
|
416
|
+
|
|
417
|
+
---
|
|
418
|
+
|
|
419
|
+
## MultiModal / Vision Language Models (BETA)
|
|
420
|
+
|
|
421
|
+
**URL:** https://docs.axolotl.ai/docs/multimodal.html
|
|
422
|
+
|
|
423
|
+
**Contents:**
|
|
424
|
+
- MultiModal / Vision Language Models (BETA)
|
|
425
|
+
- Supported Models
|
|
426
|
+
- Usage
|
|
427
|
+
- Mllama
|
|
428
|
+
- Llama4
|
|
429
|
+
- Pixtral
|
|
430
|
+
- Llava-1.5
|
|
431
|
+
- Mistral-Small-3.1
|
|
432
|
+
- Magistral-Small-2509
|
|
433
|
+
- Voxtral
|
|
434
|
+
|
|
435
|
+
Multimodal support is limited and doesn’t have full feature parity.
|
|
436
|
+
|
|
437
|
+
Here are the hyperparams you’ll need to use to finetune a multimodal model.
|
|
438
|
+
|
|
439
|
+
Please see examples folder for full configs.
|
|
440
|
+
|
|
441
|
+
Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.
|
|
442
|
+
|
|
443
|
+
As of now, we do not truncate nor drop samples based on sequence_len as each arch has different ways to process non-text tokens. We are looking for help on this.
|
|
444
|
+
|
|
445
|
+
Please make sure to install vision lib via pip install 'mistral-common[opencv]==1.8.5'
|
|
446
|
+
|
|
447
|
+
Please make sure to install vision lib via pip install 'mistral-common[opencv]==1.8.5'
|
|
448
|
+
|
|
449
|
+
Please make sure to install audio lib via pip3 install librosa==0.11.0 'mistral_common[audio]==1.8.3'
|
|
450
|
+
|
|
451
|
+
The Gemma3-1B model is a text-only model, so please train as regular text model.
|
|
452
|
+
|
|
453
|
+
For multi-modal 4B/12B/27B models, use the following config:
|
|
454
|
+
|
|
455
|
+
The model’s initial loss and grad norm will be very high. We suspect this to be due to the Conv in the vision layers.
|
|
456
|
+
|
|
457
|
+
Please make sure to install timm via pip3 install timm==1.0.17
|
|
458
|
+
|
|
459
|
+
Please make sure to install num2words via pip3 install num2words==0.5.14
|
|
460
|
+
|
|
461
|
+
Please uninstall causal-conv1d via pip3 uninstall -y causal-conv1d
|
|
462
|
+
|
|
463
|
+
For multi-modal datasets, we adopt an extended chat_template format similar to OpenAI’s Message format.
|
|
464
|
+
|
|
465
|
+
For backwards compatibility:
|
|
466
|
+
|
|
467
|
+
For image loading, you can use the following keys within content alongside "type": "image":
|
|
468
|
+
|
|
469
|
+
For audio loading, you can use the following keys within content alongside "type": "audio":
|
|
470
|
+
|
|
471
|
+
You may need to install librosa via pip3 install librosa==0.11.0.
|
|
472
|
+
|
|
473
|
+
This is not well tested at the moment. We welcome contributors!
|
|
474
|
+
|
|
475
|
+
For video loading, you can use the following keys within content alongside "type": "video":
|
|
476
|
+
|
|
477
|
+
Here is an example of a multi-modal dataset:
|
|
478
|
+
|
|
479
|
+
PIL could not retrieve the file at url using requests. Please check for typo. One alternative reason is that the request is blocked by the server.
|
|
480
|
+
|
|
481
|
+
**Examples:**
|
|
482
|
+
|
|
483
|
+
Example 1 (yaml):
|
|
484
|
+
```yaml
|
|
485
|
+
processor_type: AutoProcessor
|
|
486
|
+
|
|
487
|
+
skip_prepare_dataset: true
|
|
488
|
+
remove_unused_columns: false # leave columns in place as they are needed to handle image embeddings during training
|
|
489
|
+
sample_packing: false # not yet supported with multimodal
|
|
490
|
+
|
|
491
|
+
chat_template: # see in next section if specified
|
|
492
|
+
|
|
493
|
+
# example dataset
|
|
494
|
+
datasets:
|
|
495
|
+
- path: HuggingFaceH4/llava-instruct-mix-vsft
|
|
496
|
+
type: chat_template
|
|
497
|
+
split: train[:1%]
|
|
498
|
+
|
|
499
|
+
# (optional) if doing lora, only finetune the Language model,
|
|
500
|
+
# leave the vision model and vision tower frozen
|
|
501
|
+
# load_in_8bit: true
|
|
502
|
+
adapter: lora
|
|
503
|
+
lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
|
|
504
|
+
|
|
505
|
+
# (optional) if you want to resize images to a set size
|
|
506
|
+
image_size: 512
|
|
507
|
+
image_resize_algorithm: bilinear
|
|
508
|
+
```
|
|
509
|
+
|
|
510
|
+
Example 2 (yaml):
|
|
511
|
+
```yaml
|
|
512
|
+
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
|
|
513
|
+
|
|
514
|
+
chat_template: llama3_2_vision
|
|
515
|
+
```
|
|
516
|
+
|
|
517
|
+
Example 3 (yaml):
|
|
518
|
+
```yaml
|
|
519
|
+
base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
|
|
520
|
+
|
|
521
|
+
chat_template: llama4
|
|
522
|
+
```
|
|
523
|
+
|
|
524
|
+
Example 4 (yaml):
|
|
525
|
+
```yaml
|
|
526
|
+
base_model: mistralai/Pixtral-12B-2409
|
|
527
|
+
|
|
528
|
+
chat_template: pixtral
|
|
529
|
+
```
|
|
530
|
+
|
|
531
|
+
---
|
|
532
|
+
|
|
533
|
+
## Reward Modelling
|
|
534
|
+
|
|
535
|
+
**URL:** https://docs.axolotl.ai/docs/reward_modelling.html
|
|
536
|
+
|
|
537
|
+
**Contents:**
|
|
538
|
+
- Reward Modelling
|
|
539
|
+
- Overview
|
|
540
|
+
- (Outcome) Reward Models
|
|
541
|
+
- Process Reward Models (PRM)
|
|
542
|
+
|
|
543
|
+
Reward modelling is a technique used to train models to predict the reward or value of a given input. This is particularly useful in reinforcement learning scenarios where the model needs to evaluate the quality of its actions or predictions. We support the reward modelling techniques supported by trl.
|
|
544
|
+
|
|
545
|
+
Outcome reward models are trained using data which contains preference annotations for an entire interaction between the user and model (e.g. rather than per-turn or per-step). For improved training stability, you can use the center_rewards_coefficient parameter to encourage mean-zero reward outputs (see TRL docs).
|
|
546
|
+
|
|
547
|
+
Bradley-Terry chat templates expect single-turn conversations in the following format:
|
|
548
|
+
|
|
549
|
+
Check out our PRM blog.
|
|
550
|
+
|
|
551
|
+
Process reward models are trained using data which contains preference annotations for each step in a series of interactions. Typically, PRMs are trained to provide reward signals over each step of a reasoning trace and are used for downstream reinforcement learning.
|
|
552
|
+
|
|
553
|
+
Please see stepwise_supervised for more details on the dataset format.
|
|
554
|
+
|
|
555
|
+
**Examples:**
|
|
556
|
+
|
|
557
|
+
Example 1 (yaml):
|
|
558
|
+
```yaml
|
|
559
|
+
base_model: google/gemma-2-2b
|
|
560
|
+
model_type: AutoModelForSequenceClassification
|
|
561
|
+
num_labels: 1
|
|
562
|
+
tokenizer_type: AutoTokenizer
|
|
563
|
+
|
|
564
|
+
reward_model: true
|
|
565
|
+
chat_template: gemma
|
|
566
|
+
datasets:
|
|
567
|
+
- path: argilla/distilabel-intel-orca-dpo-pairs
|
|
568
|
+
type: bradley_terry.chat_template
|
|
569
|
+
|
|
570
|
+
val_set_size: 0.1
|
|
571
|
+
eval_steps: 100
|
|
572
|
+
```
|
|
573
|
+
|
|
574
|
+
Example 2 (json):
|
|
575
|
+
```json
|
|
576
|
+
{
|
|
577
|
+
"system": "...", // optional
|
|
578
|
+
"input": "...",
|
|
579
|
+
"chosen": "...",
|
|
580
|
+
"rejected": "..."
|
|
581
|
+
}
|
|
582
|
+
```
|
|
583
|
+
|
|
584
|
+
Example 3 (yaml):
|
|
585
|
+
```yaml
|
|
586
|
+
base_model: Qwen/Qwen2.5-3B
|
|
587
|
+
model_type: AutoModelForTokenClassification
|
|
588
|
+
num_labels: 2
|
|
589
|
+
|
|
590
|
+
process_reward_model: true
|
|
591
|
+
datasets:
|
|
592
|
+
- path: trl-lib/math_shepherd
|
|
593
|
+
type: stepwise_supervised
|
|
594
|
+
split: train
|
|
595
|
+
|
|
596
|
+
val_set_size: 0.1
|
|
597
|
+
eval_steps: 100
|
|
598
|
+
```
|
|
599
|
+
|
|
600
|
+
---
|
|
601
|
+
|
|
602
|
+
## RLHF (Beta)
|
|
603
|
+
|
|
604
|
+
**URL:** https://docs.axolotl.ai/docs/rlhf.html
|
|
605
|
+
|
|
606
|
+
**Contents:**
|
|
607
|
+
- RLHF (Beta)
|
|
608
|
+
- Overview
|
|
609
|
+
- RLHF using Axolotl
|
|
610
|
+
- DPO
|
|
611
|
+
- chatml.argilla
|
|
612
|
+
- chatml.argilla_chat
|
|
613
|
+
- chatml.icr
|
|
614
|
+
- chatml.intel
|
|
615
|
+
- chatml.prompt_pairs
|
|
616
|
+
- chatml.ultra
|
|
617
|
+
|
|
618
|
+
Reinforcement Learning from Human Feedback is a method whereby a language model is optimized from data using human feedback. Various methods include, but not limited to:
|
|
619
|
+
|
|
620
|
+
This is a BETA feature and many features are not fully implemented. You are encouraged to open new PRs to improve the integration and functionality.
|
|
621
|
+
|
|
622
|
+
We rely on the TRL library for implementations of various RL training methods, which we wrap around to expose in axolotl. Each method has their own supported ways of loading datasets and prompt formats.
|
|
623
|
+
|
|
624
|
+
You can find what each method supports by going into src/axolotl/prompt_strategies/{method} where {method} is one of our supported methods. The type: can be retrieved from {method}.{function_name}.
|
|
625
|
+
|
|
626
|
+
DPO supports the following types with the following dataset format:
|
|
627
|
+
|
|
628
|
+
For custom behaviors,
|
|
629
|
+
|
|
630
|
+
The input format is a simple JSON input with customizable fields based on the above config.
|
|
631
|
+
|
|
632
|
+
As IPO is just DPO with a different loss function, all supported dataset formats for DPO are also supported for IPO.
|
|
633
|
+
|
|
634
|
+
Paper: https://arxiv.org/abs/2403.07691
|
|
635
|
+
|
|
636
|
+
ORPO supports the following types with the following dataset format:
|
|
637
|
+
|
|
638
|
+
KTO supports the following types with the following dataset format:
|
|
639
|
+
|
|
640
|
+
For custom behaviors,
|
|
641
|
+
|
|
642
|
+
The input format is a simple JSON input with customizable fields based on the above config.
|
|
643
|
+
|
|
644
|
+
Check out our GRPO cookbook.
|
|
645
|
+
|
|
646
|
+
In the latest GRPO implementation, vLLM is used to significantly speedup trajectory generation during training. In this example, we’re using 4 GPUs - 2 for training, and 2 for vLLM:
|
|
647
|
+
|
|
648
|
+
Make sure you’ve installed the correct version of vLLM by including it as an extra when installing axolotl, e.g. pip install axolotl[vllm].
|
|
649
|
+
|
|
650
|
+
Your vLLM instance will now attempt to spin up, and it’s time to kick off training utilizing our remaining two GPUs. In another terminal, execute:
|
|
651
|
+
|
|
652
|
+
Due to TRL’s implementation with vLLM, the vLLM instance must use the last N GPUs instead of the first N GPUs. This is why in the example above, we use CUDA_VISIBLE_DEVICES=2,3 for the vLLM instance.
|
|
653
|
+
|
|
654
|
+
GRPO uses custom reward functions and transformations. Please have them ready locally.
|
|
655
|
+
|
|
656
|
+
For example, to load OpenAI’s GSM8K and use a random reward for completions:
|
|
657
|
+
|
|
658
|
+
To see other examples of custom reward functions, please see TRL GRPO Docs.
|
|
659
|
+
|
|
660
|
+
To see all configs, please see TRLConfig.
|
|
661
|
+
|
|
662
|
+
The DAPO paper and subsequently Dr. GRPO paper proposed an alternative loss function for GRPO to remediate the penalty in longer responses.
|
|
663
|
+
|
|
664
|
+
For more information, see GRPO docs.
|
|
665
|
+
|
|
666
|
+
SimPO uses CPOTrainer but with alternative loss function.
|
|
667
|
+
|
|
668
|
+
This method uses the same dataset format as DPO.
|
|
669
|
+
|
|
670
|
+
TRL supports auto-unwrapping PEFT models for RL training paradigms which rely on a reference model. This significantly reduces memory pressure as an additional refreference model does not need to be loaded, and reference model log-probabilities can be obtained by disabling PEFT adapters. This is enabled by default. To turn it off, pass the following config:
|
|
671
|
+
|
|
672
|
+
**Examples:**
|
|
673
|
+
|
|
674
|
+
Example 1 (yaml):
|
|
675
|
+
```yaml
|
|
676
|
+
rl: dpo
|
|
677
|
+
datasets:
|
|
678
|
+
- path: Intel/orca_dpo_pairs
|
|
679
|
+
split: train
|
|
680
|
+
type: chatml.intel
|
|
681
|
+
- path: argilla/ultrafeedback-binarized-preferences
|
|
682
|
+
split: train
|
|
683
|
+
type: chatml
|
|
684
|
+
```
|
|
685
|
+
|
|
686
|
+
Example 2 (json):
|
|
687
|
+
```json
|
|
688
|
+
{
|
|
689
|
+
"system": "...", // optional
|
|
690
|
+
"instruction": "...",
|
|
691
|
+
"chosen_response": "...",
|
|
692
|
+
"rejected_response": "..."
|
|
693
|
+
}
|
|
694
|
+
```
|
|
695
|
+
|
|
696
|
+
Example 3 (json):
|
|
697
|
+
```json
|
|
698
|
+
{
|
|
699
|
+
"chosen": [
|
|
700
|
+
{"role": "user", "content": "..."},
|
|
701
|
+
{"role": "assistant", "content": "..."}
|
|
702
|
+
],
|
|
703
|
+
"rejected": [
|
|
704
|
+
{"role": "user", "content": "..."},
|
|
705
|
+
{"role": "assistant", "content": "..."}
|
|
706
|
+
]
|
|
707
|
+
}
|
|
708
|
+
```
|
|
709
|
+
|
|
710
|
+
Example 4 (json):
|
|
711
|
+
```json
|
|
712
|
+
{
|
|
713
|
+
"system": "...", // optional
|
|
714
|
+
"input": "...",
|
|
715
|
+
"chosen": "...",
|
|
716
|
+
"rejected": "..."
|
|
717
|
+
}
|
|
718
|
+
```
|
|
719
|
+
|
|
720
|
+
---
|
|
721
|
+
|
|
722
|
+
## LoRA Optimizations
|
|
723
|
+
|
|
724
|
+
**URL:** https://docs.axolotl.ai/docs/lora_optims.html
|
|
725
|
+
|
|
726
|
+
**Contents:**
|
|
727
|
+
- LoRA Optimizations
|
|
728
|
+
- Usage
|
|
729
|
+
- Requirements
|
|
730
|
+
- Implementation details
|
|
731
|
+
- Custom autograd functions
|
|
732
|
+
- Triton kernels
|
|
733
|
+
- Integration
|
|
734
|
+
- Future Work
|
|
735
|
+
|
|
736
|
+
Inspired by Unsloth, we’ve implemented two optimizations for LoRA and QLoRA fine-tuning, supporting both single GPU and multi-GPU (including the DDP, DeepSpeed, and FSDP2 settings) training. These include (1) SwiGLU and GEGLU activation function Triton kernels, and (2) LoRA MLP and attention custom autograd functions. Our goal was to leverage operator fusion and tensor re-use in order to improve speed and reduce memory usage during the forward and backward passes of these calculations.
|
|
737
|
+
|
|
738
|
+
We currently support several common model architectures, including (but not limited to):
|
|
739
|
+
|
|
740
|
+
The set of models we support is currently limited by our attention patching strategy, which assumes (and replaces) specific code blocks for query / key / value and output projections:
|
|
741
|
+
|
|
742
|
+
Where apply_qkv and apply_o are defined in the axolotl.kernels.lora module.
|
|
743
|
+
|
|
744
|
+
We welcome testing of other model architectures and / or PRs to expand our patching logic to be compatible with more of them.
|
|
745
|
+
|
|
746
|
+
Check out our LoRA optimizations blog.
|
|
747
|
+
|
|
748
|
+
These optimizations can be enabled in your Axolotl config YAML file. The lora_mlp_kernel option enables the optimized MLP path, while lora_qkv_kernel and lora_o_kernel enable the fused query-key-value projection and optimized output projection, respectively.
|
|
749
|
+
|
|
750
|
+
Currently, LoRA kernels are not supported for RLHF training, only SFT.
|
|
751
|
+
|
|
752
|
+
Models with pre-existing LoRA adapters that use Dropout or have bias terms may need to be re-finetuned without these features in order to be useful.
|
|
753
|
+
|
|
754
|
+
The LoRA MLP autograd function optimizes the entire MLP computation path. It fuses the LoRA and base weight computations together and provides a single, efficient backward pass for the entire MLP block.
|
|
755
|
+
|
|
756
|
+
For attention components, similar optimizations are provided through a function that handles the query, key, and value projections, and a function that handles the output projection. They are designed to work with the existing transformers attention implementation via some monkey-patching logic.
|
|
757
|
+
|
|
758
|
+
Two activation functions (SwiGLU and GeGLU) are implemented with Triton kernels for improved speed and memory performance. These kernels handle both the forward and backward passes.
|
|
759
|
+
|
|
760
|
+
The custom autograd functions and Triton kernels are designed to work together. The autograd function manages the high-level computation flow and gradient tracking, while calling the Triton kernels for the activation function computation. During the backward pass, the kernel computes both the activation output and the required gradients, which the autograd function then uses to compute the final gradients for the entire computation path.
|
|
761
|
+
|
|
762
|
+
**Examples:**
|
|
763
|
+
|
|
764
|
+
Example 1 (python):
|
|
765
|
+
```python
|
|
766
|
+
ORIGINAL_QKV_CODE = """
|
|
767
|
+
query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
|
|
768
|
+
key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
|
|
769
|
+
value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
|
|
770
|
+
""".lstrip(
|
|
771
|
+
"\n"
|
|
772
|
+
)
|
|
773
|
+
|
|
774
|
+
ORIGINAL_O_CODE = """
|
|
775
|
+
attn_output = self.o_proj(attn_output)
|
|
776
|
+
""".lstrip(
|
|
777
|
+
"\n"
|
|
778
|
+
)
|
|
779
|
+
```
|
|
780
|
+
|
|
781
|
+
Example 2 (python):
|
|
782
|
+
```python
|
|
783
|
+
PATCHED_QKV_CODE = """
|
|
784
|
+
query_states, key_states, value_states = self.apply_qkv(hidden_states)
|
|
785
|
+
query_states = query_states.view(hidden_shape).transpose(1, 2)
|
|
786
|
+
key_states = key_states.view(hidden_shape).transpose(1, 2)
|
|
787
|
+
value_states = value_states.view(hidden_shape).transpose(1, 2)
|
|
788
|
+
""".lstrip(
|
|
789
|
+
"\n"
|
|
790
|
+
)
|
|
791
|
+
|
|
792
|
+
PATCHED_O_CODE = """
|
|
793
|
+
attn_output = self.apply_o(attn_output)
|
|
794
|
+
""".lstrip(
|
|
795
|
+
"\n"
|
|
796
|
+
)
|
|
797
|
+
```
|
|
798
|
+
|
|
799
|
+
Example 3 (yaml):
|
|
800
|
+
```yaml
|
|
801
|
+
lora_mlp_kernel: true
|
|
802
|
+
lora_qkv_kernel: true
|
|
803
|
+
lora_o_kernel: true
|
|
804
|
+
```
|
|
805
|
+
|
|
806
|
+
---
|
|
807
|
+
|
|
808
|
+
## Quantization with torchao
|
|
809
|
+
|
|
810
|
+
**URL:** https://docs.axolotl.ai/docs/quantize.html
|
|
811
|
+
|
|
812
|
+
**Contents:**
|
|
813
|
+
- Quantization with torchao
|
|
814
|
+
- Configuring Quantization in Axolotl
|
|
815
|
+
|
|
816
|
+
Quantization is a technique to lower the memory footprint of your model, potentially at the cost of accuracy or model performance. We support quantizing your model using the torchao library. Quantization is supported for both post-training quantization (PTQ) and quantization-aware training (QAT).
|
|
817
|
+
|
|
818
|
+
We do not currently support quantization techniques such as GGUF/GPTQ,EXL2 at the moment.
|
|
819
|
+
|
|
820
|
+
Quantization is configured using the quantization key in your configuration file.
|
|
821
|
+
|
|
822
|
+
Once quantization is complete, your quantized model will be saved in the {output_dir}/quantized directory.
|
|
823
|
+
|
|
824
|
+
You may also use the quantize command to quantize a model which has been trained with QAT - you can do this by using the existing QAT configuration file which you used to train the model:
|
|
825
|
+
|
|
826
|
+
This ensures that an identical quantization configuration is used to quantize the model as was used to train it.
|
|
827
|
+
|
|
828
|
+
If you have configured pushing to hub with hub_model_id, your model hub name will have the quantization schema appended to it, e.g. axolotl-ai-cloud/qat-nvfp4-llama3B will become axolotl-ai-cloud/qat-nvfp4-llama3B-nvfp4w
|
|
829
|
+
|
|
830
|
+
**Examples:**
|
|
831
|
+
|
|
832
|
+
Example 1 (yaml):
|
|
833
|
+
```yaml
|
|
834
|
+
base_model: # The path to the model to quantize.
|
|
835
|
+
quantization:
|
|
836
|
+
activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4", "int8", "float8"
|
|
837
|
+
weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are "int4", "fp8", and "nvfp4".
|
|
838
|
+
group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
|
|
839
|
+
quantize_embedding: # Optional[bool] = False. Whether to quantize the embedding layer.
|
|
840
|
+
|
|
841
|
+
output_dir: # The path to the output directory.
|
|
842
|
+
```
|
|
843
|
+
|
|
844
|
+
Example 2 (yaml):
|
|
845
|
+
```yaml
|
|
846
|
+
# qat.yml
|
|
847
|
+
qat:
|
|
848
|
+
activation_dtype: int8
|
|
849
|
+
weight_dtype: int4
|
|
850
|
+
group_size: 256
|
|
851
|
+
|
|
852
|
+
output_dir: # The path to the output directory used during training where the final checkpoint has been saved.
|
|
853
|
+
```
|
|
854
|
+
|
|
855
|
+
Example 3 (bash):
|
|
856
|
+
```bash
|
|
857
|
+
axolotl quantize qat.yml
|
|
858
|
+
```
|
|
859
|
+
|
|
860
|
+
---
|
|
861
|
+
|
|
862
|
+
## NCCL
|
|
863
|
+
|
|
864
|
+
**URL:** https://docs.axolotl.ai/docs/nccl.html
|
|
865
|
+
|
|
866
|
+
**Contents:**
|
|
867
|
+
- NCCL
|
|
868
|
+
|
|
869
|
+
NVIDIA NCCL is a library to facilitate and optimize multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce, etc. Broadly, NCCL configuration is highly environment-specific and is configured via several environment variables. A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:
|
|
870
|
+
|
|
871
|
+
Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised. Nvidia recommends disabling PCI access control services (ACS) as a possible solution if this is available to you.
|
|
872
|
+
|
|
873
|
+
Forcing cross-GPU communication via NVLink may help without increasing timeouts. To verify that your configuration is leveraging NVLink run the following command:
|
|
874
|
+
|
|
875
|
+
To force NCCL to use NVLink, simply set this in the environment:
|
|
876
|
+
|
|
877
|
+
If NVLink is not available in your environment there are other options for NCCL_P2P_LEVEL in the table below:
|
|
878
|
+
|
|
879
|
+
To validate that acceptable data transfer speeds exist for your training job, running NCCL Tests can help pinpoint bottlenecks, for example:
|
|
880
|
+
|
|
881
|
+
It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:
|
|
882
|
+
|
|
883
|
+
Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ddp_timeout value in the Axolotl configuration. See PyTorch init_process_group for documentation on this value.
|
|
884
|
+
|
|
885
|
+
**Examples:**
|
|
886
|
+
|
|
887
|
+
Example 1 (unknown):
|
|
888
|
+
```unknown
|
|
889
|
+
Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1806948 milliseconds before timing out.
|
|
890
|
+
```
|
|
891
|
+
|
|
892
|
+
Example 2 (bash):
|
|
893
|
+
```bash
|
|
894
|
+
nvidia-smi nvlink --status
|
|
895
|
+
```
|
|
896
|
+
|
|
897
|
+
Example 3 (bash):
|
|
898
|
+
```bash
|
|
899
|
+
export NCCL_P2P_LEVEL=NVL
|
|
900
|
+
```
|
|
901
|
+
|
|
902
|
+
Example 4 (bash):
|
|
903
|
+
```bash
|
|
904
|
+
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
|
|
905
|
+
```
|
|
906
|
+
|
|
907
|
+
---
|
|
908
|
+
|
|
909
|
+
## Multi Node
|
|
910
|
+
|
|
911
|
+
**URL:** https://docs.axolotl.ai/docs/multi-node.html
|
|
912
|
+
|
|
913
|
+
**Contents:**
|
|
914
|
+
- Multi Node
|
|
915
|
+
- Accelerate
|
|
916
|
+
- Raytrain
|
|
917
|
+
- Torchrun
|
|
918
|
+
- Option 1: New Axolotl CLI with launcher args (Recommended)
|
|
919
|
+
- Option 2: Direct torchrun (Legacy)
|
|
920
|
+
|
|
921
|
+
The below are three ways to train multi-node in Axolotl.
|
|
922
|
+
|
|
923
|
+
Each machine needs a copy of Axolotl, we suggest using the same commit to ensure compatibility.
|
|
924
|
+
|
|
925
|
+
You will also need to have the same configuration file for your model on each machine.
|
|
926
|
+
|
|
927
|
+
Make sure the main machine is reachable by other machines.
|
|
928
|
+
|
|
929
|
+
You will need to create a configuration for accelerate, either by using accelerate config and follow the instructions or you can use one of the preset below:
|
|
930
|
+
|
|
931
|
+
~/.cache/huggingface/accelerate/default_config.yaml
|
|
932
|
+
|
|
933
|
+
Configure your model to use FSDP in the Axolotl yaml. For example:
|
|
934
|
+
|
|
935
|
+
All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.
|
|
936
|
+
|
|
937
|
+
Please see ray train doc here.
|
|
938
|
+
|
|
939
|
+
If you are using Infiniband, we recommend torchrun to utilize the full bandwidth.
|
|
940
|
+
|
|
941
|
+
Set the following env (change buffersize/socketname depending on your system):
|
|
942
|
+
|
|
943
|
+
Run the following on each node:
|
|
944
|
+
|
|
945
|
+
Please make sure to substitute the placeholder variables:
|
|
946
|
+
|
|
947
|
+
The new CLI approach (Option 1) is recommended as it provides consistent argument handling and works seamlessly with other Axolotl CLI features.
|
|
948
|
+
|
|
949
|
+
More info on the available configs can be found on the Pytorch docs here
|
|
950
|
+
|
|
951
|
+
**Examples:**
|
|
952
|
+
|
|
953
|
+
Example 1 (yaml):
|
|
954
|
+
```yaml
|
|
955
|
+
compute_environment: LOCAL_MACHINE
|
|
956
|
+
debug: false
|
|
957
|
+
distributed_type: FSDP
|
|
958
|
+
downcast_bf16: 'no'
|
|
959
|
+
machine_rank: 0 # Set to 0 for the main machine, increment by one for other machines
|
|
960
|
+
main_process_ip: 10.0.0.4 # Set to main machine's IP
|
|
961
|
+
main_process_port: 5000
|
|
962
|
+
main_training_function: main
|
|
963
|
+
mixed_precision: bf16
|
|
964
|
+
num_machines: 2 # Change to the number of machines
|
|
965
|
+
num_processes: 4 # That's the total number of GPUs, (for example: if you have 2 machines with 4 GPU, put 8)
|
|
966
|
+
rdzv_backend: static
|
|
967
|
+
same_network: true
|
|
968
|
+
tpu_env: []
|
|
969
|
+
tpu_use_cluster: false
|
|
970
|
+
tpu_use_sudo: false
|
|
971
|
+
use_cpu: false
|
|
972
|
+
```
|
|
973
|
+
|
|
974
|
+
Example 2 (yaml):
|
|
975
|
+
```yaml
|
|
976
|
+
fsdp_version: 2
|
|
977
|
+
fsdp_config:
|
|
978
|
+
offload_params: true
|
|
979
|
+
state_dict_type: FULL_STATE_DICT
|
|
980
|
+
auto_wrap_policy: TRANSFORMER_BASED_WRAP
|
|
981
|
+
transformer_layer_cls_to_wrap: LlamaDecoderLayer
|
|
982
|
+
reshard_after_forward: true
|
|
983
|
+
```
|
|
984
|
+
|
|
985
|
+
Example 3 (bash):
|
|
986
|
+
```bash
|
|
987
|
+
export NCCL_IB_DISABLE=0
|
|
988
|
+
export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond"
|
|
989
|
+
export NCCL_BUFFSIZE=2097152
|
|
990
|
+
```
|
|
991
|
+
|
|
992
|
+
Example 4 (bash):
|
|
993
|
+
```bash
|
|
994
|
+
axolotl train config.yaml --launcher torchrun -- --nnodes $num_nodes --nproc_per_node $gpu_per_node --rdzv_id $rdzv_id --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:$head_node_port"
|
|
995
|
+
```
|
|
996
|
+
|
|
997
|
+
---
|
|
998
|
+
|
|
999
|
+
## Dataset Loading
|
|
1000
|
+
|
|
1001
|
+
**URL:** https://docs.axolotl.ai/docs/dataset_loading.html
|
|
1002
|
+
|
|
1003
|
+
**Contents:**
|
|
1004
|
+
- Dataset Loading
|
|
1005
|
+
- Overview
|
|
1006
|
+
- Loading Datasets
|
|
1007
|
+
- Local dataset
|
|
1008
|
+
- Files
|
|
1009
|
+
- Directory
|
|
1010
|
+
- Loading entire directory
|
|
1011
|
+
- Loading specific files in directory
|
|
1012
|
+
- HuggingFace Hub
|
|
1013
|
+
- Folder uploaded
|
|
1014
|
+
|
|
1015
|
+
Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.
|
|
1016
|
+
|
|
1017
|
+
We use the datasets library to load datasets and a mix of load_dataset and load_from_disk to load them.
|
|
1018
|
+
|
|
1019
|
+
You may recognize the similar named configs between load_dataset and the datasets section of the config file.
|
|
1020
|
+
|
|
1021
|
+
Do not feel overwhelmed by the number of options here. A lot of them are optional. In fact, the most common config to use would be path and sometimes data_files.
|
|
1022
|
+
|
|
1023
|
+
This matches the API of datasets.load_dataset, so if you’re familiar with that, you will feel right at home.
|
|
1024
|
+
|
|
1025
|
+
For HuggingFace’s guide to load different dataset types, see here.
|
|
1026
|
+
|
|
1027
|
+
For full details on the config, see config-reference.qmd.
|
|
1028
|
+
|
|
1029
|
+
You can set multiple datasets in the config file by more than one entry under datasets.
|
|
1030
|
+
|
|
1031
|
+
To load a JSON file, you would do something like this:
|
|
1032
|
+
|
|
1033
|
+
Which translates to the following config:
|
|
1034
|
+
|
|
1035
|
+
In the example above, it can be seen that we can just point the path to the file or directory along with the ds_type to load the dataset.
|
|
1036
|
+
|
|
1037
|
+
This works for CSV, JSON, Parquet, and Arrow files.
|
|
1038
|
+
|
|
1039
|
+
If path points to a file and ds_type is not specified, we will automatically infer the dataset type from the file extension, so you could omit ds_type if you’d like.
|
|
1040
|
+
|
|
1041
|
+
If you’re loading a directory, you can point the path to the directory.
|
|
1042
|
+
|
|
1043
|
+
Then, you have two options:
|
|
1044
|
+
|
|
1045
|
+
You do not need any additional configs.
|
|
1046
|
+
|
|
1047
|
+
We will attempt to load in the following order: - datasets saved with datasets.save_to_disk - loading entire directory of files (such as with parquet/arrow files)
|
|
1048
|
+
|
|
1049
|
+
Provide data_files with a list of files to load.
|
|
1050
|
+
|
|
1051
|
+
The method you use to load the dataset depends on how the dataset was created, whether a folder was uploaded directly or a HuggingFace Dataset was pushed.
|
|
1052
|
+
|
|
1053
|
+
If you’re using a private dataset, you will need to enable the hf_use_auth_token flag in the root-level of the config file.
|
|
1054
|
+
|
|
1055
|
+
This would mean that the dataset is a single file or file(s) uploaded to the Hub.
|
|
1056
|
+
|
|
1057
|
+
This means that the dataset is created as a HuggingFace Dataset and pushed to the Hub via datasets.push_to_hub.
|
|
1058
|
+
|
|
1059
|
+
There are some other configs which may be required like name, split, revision, trust_remote_code, etc depending on the dataset.
|
|
1060
|
+
|
|
1061
|
+
Via the storage_options config under load_dataset, you can load datasets from remote filesystems like S3, GCS, Azure, and OCI.
|
|
1062
|
+
|
|
1063
|
+
This is currently experimental. Please let us know if you run into any issues!
|
|
1064
|
+
|
|
1065
|
+
The only difference between the providers is that you need to prepend the path with the respective protocols.
|
|
1066
|
+
|
|
1067
|
+
For directory, we load via load_from_disk.
|
|
1068
|
+
|
|
1069
|
+
Prepend the path with s3://.
|
|
1070
|
+
|
|
1071
|
+
The credentials are pulled in the following order:
|
|
1072
|
+
|
|
1073
|
+
We assume you have credentials setup and not using anonymous access. If you want to use anonymous access, let us know! We may have to open a config option for this.
|
|
1074
|
+
|
|
1075
|
+
Other environment variables that can be set can be found in boto3 docs
|
|
1076
|
+
|
|
1077
|
+
Prepend the path with gs:// or gcs://.
|
|
1078
|
+
|
|
1079
|
+
The credentials are loaded in the following order:
|
|
1080
|
+
|
|
1081
|
+
Prepend the path with adl://.
|
|
1082
|
+
|
|
1083
|
+
Ensure you have the following environment variables set:
|
|
1084
|
+
|
|
1085
|
+
Prepend the path with abfs:// or az://.
|
|
1086
|
+
|
|
1087
|
+
Ensure you have the following environment variables set:
|
|
1088
|
+
|
|
1089
|
+
Other environment variables that can be set can be found in adlfs docs
|
|
1090
|
+
|
|
1091
|
+
Prepend the path with oci://.
|
|
1092
|
+
|
|
1093
|
+
It would attempt to read in the following order:
|
|
1094
|
+
|
|
1095
|
+
Other environment variables:
|
|
1096
|
+
|
|
1097
|
+
Please see the ocifs docs.
|
|
1098
|
+
|
|
1099
|
+
The path should start with https://.
|
|
1100
|
+
|
|
1101
|
+
This must be publically accessible.
|
|
1102
|
+
|
|
1103
|
+
Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format dataset formats docs.
|
|
1104
|
+
|
|
1105
|
+
**Examples:**
|
|
1106
|
+
|
|
1107
|
+
Example 1 (yaml):
|
|
1108
|
+
```yaml
|
|
1109
|
+
datasets:
|
|
1110
|
+
- path:
|
|
1111
|
+
name:
|
|
1112
|
+
data_files:
|
|
1113
|
+
split:
|
|
1114
|
+
revision:
|
|
1115
|
+
trust_remote_code:
|
|
1116
|
+
```
|
|
1117
|
+
|
|
1118
|
+
Example 2 (yaml):
|
|
1119
|
+
```yaml
|
|
1120
|
+
datasets:
|
|
1121
|
+
- path: /path/to/your/dataset
|
|
1122
|
+
- path: /path/to/your/other/dataset
|
|
1123
|
+
```
|
|
1124
|
+
|
|
1125
|
+
Example 3 (python):
|
|
1126
|
+
```python
|
|
1127
|
+
from datasets import load_dataset
|
|
1128
|
+
|
|
1129
|
+
dataset = load_dataset("json", data_files="data.json")
|
|
1130
|
+
```
|
|
1131
|
+
|
|
1132
|
+
Example 4 (yaml):
|
|
1133
|
+
```yaml
|
|
1134
|
+
datasets:
|
|
1135
|
+
- path: data.json
|
|
1136
|
+
ds_type: json
|
|
1137
|
+
```
|
|
1138
|
+
|
|
1139
|
+
---
|
|
1140
|
+
|
|
1141
|
+
## Multi-GPU
|
|
1142
|
+
|
|
1143
|
+
**URL:** https://docs.axolotl.ai/docs/multi-gpu.html
|
|
1144
|
+
|
|
1145
|
+
**Contents:**
|
|
1146
|
+
- Multi-GPU
|
|
1147
|
+
- 1 Overview
|
|
1148
|
+
- 2 DeepSpeed
|
|
1149
|
+
- 2.1 Configuration
|
|
1150
|
+
- 2.2 Usage
|
|
1151
|
+
- 2.3 ZeRO Stages
|
|
1152
|
+
- 3 Fully Sharded Data Parallel (FSDP)
|
|
1153
|
+
- 3.1 Migrating from FSDP1 to FSDP2
|
|
1154
|
+
- 3.1.1 Config mapping
|
|
1155
|
+
- 3.2 FSDP1 (deprecated)
|
|
1156
|
+
|
|
1157
|
+
This guide covers advanced training configurations for multi-GPU setups using Axolotl.
|
|
1158
|
+
|
|
1159
|
+
Axolotl supports several methods for multi-GPU training:
|
|
1160
|
+
|
|
1161
|
+
Add to your YAML config:
|
|
1162
|
+
|
|
1163
|
+
We provide default configurations for:
|
|
1164
|
+
|
|
1165
|
+
Choose the configuration that offloads the least amount to memory while still being able to fit on VRAM for best performance.
|
|
1166
|
+
|
|
1167
|
+
Start from Stage 1 -> Stage 2 -> Stage 3.
|
|
1168
|
+
|
|
1169
|
+
FSDP2 is recommended for new users. FSDP1 is deprecated and will be removed in an upcoming release of Axolotl.
|
|
1170
|
+
|
|
1171
|
+
To migrate your config from FSDP1 to FSDP2, you must use the fsdp_version top-level config field to specify the FSDP version, and also follow the config field mapping below to update field names.
|
|
1172
|
+
|
|
1173
|
+
For more details, please see the migration guide in the torchtitan repo. In Axolotl, if you were using the following FSDP1 config:
|
|
1174
|
+
|
|
1175
|
+
You can migrate to the following FSDP2 config:
|
|
1176
|
+
|
|
1177
|
+
Using fsdp to configure FSDP is deprecated and will be removed in an upcoming release of Axolotl. Please use fsdp_config as above instead.
|
|
1178
|
+
|
|
1179
|
+
We support sequence parallelism (SP) via the ring-flash-attention project. This allows one to split up sequences across GPUs, which is useful in the event that a single sequence causes OOM errors during model training.
|
|
1180
|
+
|
|
1181
|
+
See our dedicated guide for more information.
|
|
1182
|
+
|
|
1183
|
+
For combining FSDP with QLoRA, see our dedicated guide.
|
|
1184
|
+
|
|
1185
|
+
Please see docs for more info.
|
|
1186
|
+
|
|
1187
|
+
For NCCL-related problems, see our NCCL troubleshooting guide.
|
|
1188
|
+
|
|
1189
|
+
For more detailed troubleshooting, see our debugging guide.
|
|
1190
|
+
|
|
1191
|
+
**Examples:**
|
|
1192
|
+
|
|
1193
|
+
Example 1 (yaml):
|
|
1194
|
+
```yaml
|
|
1195
|
+
deepspeed: deepspeed_configs/zero1.json
|
|
1196
|
+
```
|
|
1197
|
+
|
|
1198
|
+
Example 2 (bash):
|
|
1199
|
+
```bash
|
|
1200
|
+
# Fetch deepspeed configs (if not already present)
|
|
1201
|
+
axolotl fetch deepspeed_configs
|
|
1202
|
+
|
|
1203
|
+
# Passing arg via config
|
|
1204
|
+
axolotl train config.yml
|
|
1205
|
+
|
|
1206
|
+
# Passing arg via cli
|
|
1207
|
+
axolotl train config.yml --deepspeed deepspeed_configs/zero1.json
|
|
1208
|
+
```
|
|
1209
|
+
|
|
1210
|
+
Example 3 (yaml):
|
|
1211
|
+
```yaml
|
|
1212
|
+
fsdp_version: 1
|
|
1213
|
+
fsdp_config:
|
|
1214
|
+
fsdp_offload_params: false
|
|
1215
|
+
fsdp_cpu_ram_efficient_loading: true
|
|
1216
|
+
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
|
|
1217
|
+
fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
|
|
1218
|
+
fsdp_state_dict_type: FULL_STATE_DICT
|
|
1219
|
+
fsdp_sharding_strategy: FULL_SHARD
|
|
1220
|
+
```
|
|
1221
|
+
|
|
1222
|
+
Example 4 (yaml):
|
|
1223
|
+
```yaml
|
|
1224
|
+
fsdp_version: 2
|
|
1225
|
+
fsdp_config:
|
|
1226
|
+
offload_params: false
|
|
1227
|
+
cpu_ram_efficient_loading: true
|
|
1228
|
+
auto_wrap_policy: TRANSFORMER_BASED_WRAP
|
|
1229
|
+
transformer_layer_cls_to_wrap: Qwen3DecoderLayer
|
|
1230
|
+
state_dict_type: FULL_STATE_DICT
|
|
1231
|
+
reshard_after_forward: true
|
|
1232
|
+
```
|
|
1233
|
+
|
|
1234
|
+
---
|
|
1235
|
+
|
|
1236
|
+
## Ray Train
|
|
1237
|
+
|
|
1238
|
+
**URL:** https://docs.axolotl.ai/docs/ray-integration.html
|
|
1239
|
+
|
|
1240
|
+
**Contents:**
|
|
1241
|
+
- Ray Train
|
|
1242
|
+
- Ray cluster setup
|
|
1243
|
+
- Sanity check
|
|
1244
|
+
- Configuring training with Ray Train
|
|
1245
|
+
- Launching training
|
|
1246
|
+
|
|
1247
|
+
Axolotl supports using Ray as an alternative to accelerate for orchestrating training. This is especially useful for multi-node training since you only have to setup code and dependencies in a single node and launch training as if you were using a single node.
|
|
1248
|
+
|
|
1249
|
+
With the --use-ray CLI flag, Axolotl will use Ray Train’s TorchTrainer to run training.
|
|
1250
|
+
|
|
1251
|
+
A prerequisite using the Ray Train integration is to setup a Ray cluster on your desired node(s). For a detailed guide on how you can get started with ray clusters, check the official Ray docs here.
|
|
1252
|
+
|
|
1253
|
+
Every Ray cluster has one head node and a set of worker nodes. The head node is just like any other worker node, but it also runs certain special processes related to scheduling and orchestration. Ray-enabled scripts are run on the head node and depending on the resources (number of CPUs, GPUs, etc) they request, will be scheduled to run certain tasks on the worker nodes. For more on key concepts behind a Ray cluster, you can refer this doc.
|
|
1254
|
+
|
|
1255
|
+
To run a sanity check on whether your ray cluster is setup properly, execute the following on the head node:
|
|
1256
|
+
|
|
1257
|
+
The output should have a summary of your Ray cluster - list of all the nodes in your cluster, the number of CPUs and GPUs in your cluster, etc. For example, if you have a cluster with 1 CPU-only head node and 2 4xL40S worker nodes, the output can look like this:
|
|
1258
|
+
|
|
1259
|
+
You should also be able to see the same on the Ray dashboard.
|
|
1260
|
+
|
|
1261
|
+
You can find an example configuration at configs/llama-3/lora-1b-ray.yaml.
|
|
1262
|
+
|
|
1263
|
+
The key parameters to note here are:
|
|
1264
|
+
|
|
1265
|
+
You can simply run the following command on the head node:
|
|
1266
|
+
|
|
1267
|
+
This will launch training on the head node and workers will be scheduled automatically by Ray Train to run on the appropriate head or worker nodes.
|
|
1268
|
+
|
|
1269
|
+
You can also monitor training progress on the Ray dashboard.
|
|
1270
|
+
|
|
1271
|
+
Coming back to the example on a Ray cluster with 1 head node and 2 4xL40S worker nodes, let’s say you want to make use of all 8 GPUs. You would be able to just set ray_num_workers: 8 and run the previous command. The Cluster tab will show the following:
|
|
1272
|
+
|
|
1273
|
+
**Examples:**
|
|
1274
|
+
|
|
1275
|
+
Example 1 (unknown):
|
|
1276
|
+
```unknown
|
|
1277
|
+
Node status
|
|
1278
|
+
---------------------------------------------------------------
|
|
1279
|
+
Active:
|
|
1280
|
+
1 head
|
|
1281
|
+
Idle:
|
|
1282
|
+
2 4xL40S:48CPU-384GB
|
|
1283
|
+
Pending:
|
|
1284
|
+
(no pending nodes)
|
|
1285
|
+
Recent failures:
|
|
1286
|
+
(no failures)
|
|
1287
|
+
|
|
1288
|
+
Resources
|
|
1289
|
+
---------------------------------------------------------------
|
|
1290
|
+
Usage:
|
|
1291
|
+
0.0/96.0 CPU
|
|
1292
|
+
0.0/8.0 GPU
|
|
1293
|
+
0B/800.00GiB memory
|
|
1294
|
+
0B/229.57GiB object_store_memory
|
|
1295
|
+
|
|
1296
|
+
Demands:
|
|
1297
|
+
(no resource demands)
|
|
1298
|
+
```
|
|
1299
|
+
|
|
1300
|
+
Example 2 (yaml):
|
|
1301
|
+
```yaml
|
|
1302
|
+
use_ray: true
|
|
1303
|
+
ray_num_workers: 4
|
|
1304
|
+
# optional
|
|
1305
|
+
resources_per_worker:
|
|
1306
|
+
GPU: 1
|
|
1307
|
+
```
|
|
1308
|
+
|
|
1309
|
+
Example 3 (yaml):
|
|
1310
|
+
```yaml
|
|
1311
|
+
resources_per_worker:
|
|
1312
|
+
accelerator_type:L40S: 0.001
|
|
1313
|
+
```
|
|
1314
|
+
|
|
1315
|
+
Example 4 (bash):
|
|
1316
|
+
```bash
|
|
1317
|
+
axolotl train examples/llama-3/lora-1b-ray.yml --use-ray
|
|
1318
|
+
```
|
|
1319
|
+
|
|
1320
|
+
---
|
|
1321
|
+
|
|
1322
|
+
## Sequence Parallelism
|
|
1323
|
+
|
|
1324
|
+
**URL:** https://docs.axolotl.ai/docs/sequence_parallelism.html
|
|
1325
|
+
|
|
1326
|
+
**Contents:**
|
|
1327
|
+
- Sequence Parallelism
|
|
1328
|
+
- When to Use Sequence Parallelism
|
|
1329
|
+
- Configuration
|
|
1330
|
+
- Implementation Details
|
|
1331
|
+
- Requirements
|
|
1332
|
+
- Limitations
|
|
1333
|
+
- Example
|
|
1334
|
+
- Sample Packing with Sequence Parallelism
|
|
1335
|
+
- Effect on Batch Size
|
|
1336
|
+
|
|
1337
|
+
Sequence parallelism is a technique that splits sequences across multiple GPUs, allowing you to train with very long sequences that wouldn’t fit on a single GPU. Each GPU processes a different portion of the sequence, and the results are aggregated through a ring communication pattern.
|
|
1338
|
+
|
|
1339
|
+
Use sequence parallelism when:
|
|
1340
|
+
|
|
1341
|
+
To enable sequence parallelism, add the following to your configuration file:
|
|
1342
|
+
|
|
1343
|
+
The context_parallel_size should be a divisor of the total number of GPUs. For example:
|
|
1344
|
+
|
|
1345
|
+
When sequence parallelism is enabled:
|
|
1346
|
+
|
|
1347
|
+
To use sequence parallelism, you need:
|
|
1348
|
+
|
|
1349
|
+
This will train the Llama 3 8B model with 8K context length, with each sequence split into 2 subsequences of length 4096 across 2 GPUs.
|
|
1350
|
+
|
|
1351
|
+
Sequence parallelism is compatible with Axolotl’s sample packing functionality. When using both features together:
|
|
1352
|
+
|
|
1353
|
+
When using sequence parallelism, your effective global batch size is divided by the context_parallel_size. This happens because:
|
|
1354
|
+
|
|
1355
|
+
For example: - With 8 GPUs and no sequence parallelism: 8 different batches processed per step - With 8 GPUs and context_parallel_size=4: Only 2 different batches processed per step (each split across 4 GPUs) - If your per-GPU micro_batch_size is 2, the global batch size decreases from 16 to 4
|
|
1356
|
+
|
|
1357
|
+
**Examples:**
|
|
1358
|
+
|
|
1359
|
+
Example 1 (yaml):
|
|
1360
|
+
```yaml
|
|
1361
|
+
# Set to a divisor (> 1) of the number of GPUs available
|
|
1362
|
+
context_parallel_size: 4 # Split sequences across 4 GPUs
|
|
1363
|
+
# Optional; strides across the key dimension. Larger values use more memory but should make training faster.
|
|
1364
|
+
heads_k_stride: 1
|
|
1365
|
+
# Optional; one of "varlen_llama3" or "batch_ring". Defaults to
|
|
1366
|
+
# "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
|
|
1367
|
+
ring_attn_func:
|
|
1368
|
+
```
|
|
1369
|
+
|
|
1370
|
+
Example 2 (yaml):
|
|
1371
|
+
```yaml
|
|
1372
|
+
base_model: meta-llama/Llama-3-8B-Instruct
|
|
1373
|
+
sequence_len: 8192
|
|
1374
|
+
|
|
1375
|
+
...
|
|
1376
|
+
|
|
1377
|
+
context_parallel_size: 4 # Split each sequence into 4 parts, one per GPU
|
|
1378
|
+
# Optional; strides across the key dimension. Larger values use more memory but should make training faster.
|
|
1379
|
+
heads_k_stride: 1
|
|
1380
|
+
# Optional; one of "varlen_llama3" or "batch_ring". Defaults to
|
|
1381
|
+
# "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
|
|
1382
|
+
ring_attn_func:
|
|
1383
|
+
|
|
1384
|
+
...
|
|
1385
|
+
```
|
|
1386
|
+
|
|
1387
|
+
---
|
|
1388
|
+
|
|
1389
|
+
## Quantization Aware Training (QAT)
|
|
1390
|
+
|
|
1391
|
+
**URL:** https://docs.axolotl.ai/docs/qat.html
|
|
1392
|
+
|
|
1393
|
+
**Contents:**
|
|
1394
|
+
- Quantization Aware Training (QAT)
|
|
1395
|
+
- Overview
|
|
1396
|
+
- Configuring QAT in Axolotl
|
|
1397
|
+
|
|
1398
|
+
Quantization Aware Training (QAT) is a technique for improving the accuracy of models which are quantized by applying “fake” quantizations to the model’s weights (and optionally, activations) during training. This fake quantization allows for the model to adjust for noise introduced by the quantization, so when the model is eventually quantized, the accuracy loss is minimized. We use the quantization techniques implemented in torchao to provide support for QAT and post-training quantization (PTQ) in axolotl.
|
|
1399
|
+
|
|
1400
|
+
We recommend reviewing the excellent QAT tutorial in the torchtune library, and the QAT documentation in the torchao library, for more details.
|
|
1401
|
+
|
|
1402
|
+
To enable QAT in axolotl, add the following to your configuration file:
|
|
1403
|
+
|
|
1404
|
+
We support the following quantization schemas:
|
|
1405
|
+
|
|
1406
|
+
Once you have finished training, you must quantize your model by using the same quantization configuration which you used to train the model with. You can use the quantize command to do this.
|
|
1407
|
+
|
|
1408
|
+
**Examples:**
|
|
1409
|
+
|
|
1410
|
+
Example 1 (yaml):
|
|
1411
|
+
```yaml
|
|
1412
|
+
qat:
|
|
1413
|
+
activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4", "int8", "float8"
|
|
1414
|
+
weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are "int4", "fp8", and "nvfp4".
|
|
1415
|
+
group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
|
|
1416
|
+
fake_quant_after_n_steps: # Optional[int] = None. The number of steps to apply fake quantization after
|
|
1417
|
+
```
|
|
1418
|
+
|
|
1419
|
+
---
|
|
1420
|
+
|
|
1421
|
+
## FSDP + QLoRA
|
|
1422
|
+
|
|
1423
|
+
**URL:** https://docs.axolotl.ai/docs/fsdp_qlora.html
|
|
1424
|
+
|
|
1425
|
+
**Contents:**
|
|
1426
|
+
- FSDP + QLoRA
|
|
1427
|
+
- Background
|
|
1428
|
+
- Usage
|
|
1429
|
+
- Enabling Swap for FSDP2
|
|
1430
|
+
- Example Config
|
|
1431
|
+
- References
|
|
1432
|
+
- Footnotes
|
|
1433
|
+
|
|
1434
|
+
Using FSDP with QLoRA is essential for fine-tuning larger (70b+ parameter) LLMs on consumer GPUs. For example, you can use FSDP + QLoRA to train a 70b model on two 24GB GPUs1.
|
|
1435
|
+
|
|
1436
|
+
Below, we describe how to use this feature in Axolotl.
|
|
1437
|
+
|
|
1438
|
+
To enable QLoRA with FSDP, you need to perform the following steps:
|
|
1439
|
+
|
|
1440
|
+
![Tip] See the example config file in addition to reading these instructions.
|
|
1441
|
+
|
|
1442
|
+
If available memory is insufficient even after FSDP’s CPU offloading, you can enable swap memory usage by setting cpu_offload_pin_memory: false alongside offload_params: true in FSDP config.
|
|
1443
|
+
|
|
1444
|
+
This disables memory pinning, allowing FSDP to use disk swap space as fallback. Disabling memory pinning itself incurs performance overhead, and actually having to use swap adds more, but it may enable training larger models that would otherwise cause OOM errors on resource constrained systems.
|
|
1445
|
+
|
|
1446
|
+
examples/llama-2/qlora-fsdp.yml contains an example of how to enable QLoRA + FSDP in axolotl.
|
|
1447
|
+
|
|
1448
|
+
This was enabled by this work from the Answer.AI team.↩︎
|
|
1449
|
+
|
|
1450
|
+
---
|
|
1451
|
+
|
|
1452
|
+
## Custom Integrations
|
|
1453
|
+
|
|
1454
|
+
**URL:** https://docs.axolotl.ai/docs/custom_integrations.html
|
|
1455
|
+
|
|
1456
|
+
**Contents:**
|
|
1457
|
+
- Custom Integrations
|
|
1458
|
+
- Cut Cross Entropy
|
|
1459
|
+
- Requirements
|
|
1460
|
+
- Installation
|
|
1461
|
+
- Usage
|
|
1462
|
+
- Supported Models
|
|
1463
|
+
- Citation
|
|
1464
|
+
- DenseMixer
|
|
1465
|
+
- Diffusion LM Training Plugin for Axolotl
|
|
1466
|
+
- Overview
|
|
1467
|
+
|
|
1468
|
+
Axolotl adds custom features through integrations. They are located within the src/axolotl/integrations directory.
|
|
1469
|
+
|
|
1470
|
+
To enable them, please check the respective documentations.
|
|
1471
|
+
|
|
1472
|
+
Cut Cross Entropy (CCE) reduces VRAM usage through optimization on the cross-entropy operation during loss calculation.
|
|
1473
|
+
|
|
1474
|
+
See https://github.com/apple/ml-cross-entropy
|
|
1475
|
+
|
|
1476
|
+
Run the following command to install cut_cross_entropy[transformers] if you don’t have it already.
|
|
1477
|
+
|
|
1478
|
+
Please see reference here
|
|
1479
|
+
|
|
1480
|
+
Simply add the following to your axolotl YAML config:
|
|
1481
|
+
|
|
1482
|
+
Please see reference here
|
|
1483
|
+
|
|
1484
|
+
This plugin enables diffusion language model training using an approach inspired by LLaDA (Large Language Diffusion Models) within Axolotl.
|
|
1485
|
+
|
|
1486
|
+
LLaDA is a diffusion-based approach to language model training that uses: - Random token masking during training instead of next-token prediction - Bidirectional attention to allow the model to attend to the full context - Importance weighting based on masking probabilities for stable training
|
|
1487
|
+
|
|
1488
|
+
This approach can lead to more robust language models with better understanding of bidirectional context.
|
|
1489
|
+
|
|
1490
|
+
The plugin is included with Axolotl. See our installation docs.
|
|
1491
|
+
|
|
1492
|
+
Train with an example config (Llama‑3.2 1B): - Pretrain: axolotl train examples/llama-3/diffusion-3.2-1b-pretrain.yaml - SFT: axolotl train examples/llama-3/diffusion-3.2-1b-sft.yaml
|
|
1493
|
+
|
|
1494
|
+
You can also modify your existing configs to enable / customize diffusion training.
|
|
1495
|
+
|
|
1496
|
+
Add the following to your Axolotl config:
|
|
1497
|
+
|
|
1498
|
+
And, configure the nested diffusion block (defaults shown):
|
|
1499
|
+
|
|
1500
|
+
Any models that support 4D attention masks should work out of the box. If not, please create an issue or open a PR!
|
|
1501
|
+
|
|
1502
|
+
During training, tokens are randomly masked: - Sample timestep t uniformly from [0, 1] - Calculate masking probability: p = (1 - eps) * t + eps - Randomly mask tokens with probability p
|
|
1503
|
+
|
|
1504
|
+
Loss is computed only on masked tokens with (optional) importance weighting:
|
|
1505
|
+
|
|
1506
|
+
When diffusion.generate_samples: true, the plugin generates samples during training:
|
|
1507
|
+
|
|
1508
|
+
Samples are logged to console and wandb (if enabled).
|
|
1509
|
+
|
|
1510
|
+
Diffusion inference is integrated into the standard Axolotl CLI. Use the same config you trained with and run:
|
|
1511
|
+
|
|
1512
|
+
Optionally, pass --gradio to use a simple web interface.
|
|
1513
|
+
|
|
1514
|
+
Interactive controls (prefix the prompt with commands): - :complete N → completion mode with N new masked tokens appended (default 64) - :mask R → random masking mode with target mask ratio R in [0.0, 1.0]
|
|
1515
|
+
|
|
1516
|
+
The plugin adds (or modifies) several metrics to track diffusion training:
|
|
1517
|
+
|
|
1518
|
+
Please see reference here
|
|
1519
|
+
|
|
1520
|
+
See https://github.com/ironjr/grokfast
|
|
1521
|
+
|
|
1522
|
+
Please see reference here
|
|
1523
|
+
|
|
1524
|
+
An example dataset can be found at axolotl-ai-co/evolkit-logprobs-pipeline-75k-v2-sample
|
|
1525
|
+
|
|
1526
|
+
Please see reference here
|
|
1527
|
+
|
|
1528
|
+
Fine-tune sparsified models in Axolotl using Neural Magic’s LLMCompressor.
|
|
1529
|
+
|
|
1530
|
+
This integration enables fine-tuning of models sparsified using LLMCompressor within the Axolotl training framework. By combining LLMCompressor’s model compression capabilities with Axolotl’s distributed training pipelines, users can efficiently fine-tune sparse models at scale.
|
|
1531
|
+
|
|
1532
|
+
It uses Axolotl’s plugin system to hook into the fine-tuning flows while maintaining sparsity throughout training.
|
|
1533
|
+
|
|
1534
|
+
Axolotl with llmcompressor extras:
|
|
1535
|
+
|
|
1536
|
+
Requires llmcompressor >= 0.5.1
|
|
1537
|
+
|
|
1538
|
+
This will install all necessary dependencies to fine-tune sparsified models using the integration.
|
|
1539
|
+
|
|
1540
|
+
To enable sparse fine-tuning with this integration, include the plugin in your Axolotl config:
|
|
1541
|
+
|
|
1542
|
+
This plugin does not apply pruning or sparsification itself — it is intended for fine-tuning models that have already been sparsified.
|
|
1543
|
+
|
|
1544
|
+
Pre-sparsified checkpoints can be: - Generated using LLMCompressor - Downloaded from Neural Magic’s Hugging Face page - Any custom LLM with compatible sparsity patterns that you’ve created yourself
|
|
1545
|
+
|
|
1546
|
+
To learn more about writing and customizing LLMCompressor recipes, refer to the official documentation: https://github.com/vllm-project/llm-compressor/blob/main/README.md
|
|
1547
|
+
|
|
1548
|
+
Setting save_compressed: true in your configuration enables saving models in a compressed format, which: - Reduces disk space usage by approximately 40% - Maintains compatibility with vLLM for accelerated inference - Maintains compatibility with llmcompressor for further optimization (example: quantization)
|
|
1549
|
+
|
|
1550
|
+
This option is highly recommended when working with sparse models to maximize the benefits of model compression.
|
|
1551
|
+
|
|
1552
|
+
See examples/llama-3/sparse-finetuning.yaml for a complete example.
|
|
1553
|
+
|
|
1554
|
+
After fine-tuning your sparse model, you can leverage vLLM for efficient inference. You can also use LLMCompressor to apply additional quantization to your fine-tuned sparse model before inference for even greater performance benefits.:
|
|
1555
|
+
|
|
1556
|
+
For more details on vLLM’s capabilities and advanced configuration options, see the official vLLM documentation.
|
|
1557
|
+
|
|
1558
|
+
For details on available sparsity and quantization schemes, fine-tuning recipes, and usage examples, visit the official LLMCompressor repository:
|
|
1559
|
+
|
|
1560
|
+
https://github.com/vllm-project/llm-compressor
|
|
1561
|
+
|
|
1562
|
+
Please see reference here
|
|
1563
|
+
|
|
1564
|
+
Run evaluation on model using the popular lm-evaluation-harness library.
|
|
1565
|
+
|
|
1566
|
+
See https://github.com/EleutherAI/lm-evaluation-harness
|
|
1567
|
+
|
|
1568
|
+
Please see reference here
|
|
1569
|
+
|
|
1570
|
+
Liger Kernel provides efficient Triton kernels for LLM training, offering:
|
|
1571
|
+
|
|
1572
|
+
See https://github.com/linkedin/Liger-Kernel
|
|
1573
|
+
|
|
1574
|
+
Please see reference here
|
|
1575
|
+
|
|
1576
|
+
by Eric Hartford, Lucas Atkins, Fernando Fernandes, David Golchinfar
|
|
1577
|
+
|
|
1578
|
+
This plugin contains code to freeze the bottom fraction of modules in a model, based on the Signal-to-Noise Ratio (SNR).
|
|
1579
|
+
|
|
1580
|
+
See https://github.com/cognitivecomputations/spectrum
|
|
1581
|
+
|
|
1582
|
+
Spectrum is a tool for scanning and evaluating the Signal-to-Noise Ratio (SNR) of layers in large language models. By identifying the top n% of layers with the highest SNR, you can optimize training efficiency.
|
|
1583
|
+
|
|
1584
|
+
Please see reference here
|
|
1585
|
+
|
|
1586
|
+
Plugins can be used to customize the behavior of the training pipeline through hooks. See axolotl.integrations.BasePlugin for the possible hooks.
|
|
1587
|
+
|
|
1588
|
+
To add a new integration, please follow these steps:
|
|
1589
|
+
|
|
1590
|
+
See src/axolotl/integrations/cut_cross_entropy for a minimal integration example.
|
|
1591
|
+
|
|
1592
|
+
If you could not load your integration, please ensure you are pip installing in editable mode.
|
|
1593
|
+
|
|
1594
|
+
and correctly spelled the integration name in the config file.
|
|
1595
|
+
|
|
1596
|
+
It is not necessary to place your integration in the integrations folder. It can be in any location, so long as it’s installed in a package in your python env.
|
|
1597
|
+
|
|
1598
|
+
See this repo for an example: https://github.com/axolotl-ai-cloud/diff-transformer
|
|
1599
|
+
|
|
1600
|
+
**Examples:**
|
|
1601
|
+
|
|
1602
|
+
Example 1 (bash):
|
|
1603
|
+
```bash
|
|
1604
|
+
python scripts/cutcrossentropy_install.py | sh
|
|
1605
|
+
```
|
|
1606
|
+
|
|
1607
|
+
Example 2 (bash):
|
|
1608
|
+
```bash
|
|
1609
|
+
pip3 uninstall -y cut-cross-entropy && pip3 install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@8a1a0ec"
|
|
1610
|
+
```
|
|
1611
|
+
|
|
1612
|
+
Example 3 (yaml):
|
|
1613
|
+
```yaml
|
|
1614
|
+
plugins:
|
|
1615
|
+
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
|
|
1616
|
+
```
|
|
1617
|
+
|
|
1618
|
+
Example 4 (unknown):
|
|
1619
|
+
```unknown
|
|
1620
|
+
@article{wijmans2024cut,
|
|
1621
|
+
author = {Erik Wijmans and
|
|
1622
|
+
Brody Huval and
|
|
1623
|
+
Alexander Hertzberg and
|
|
1624
|
+
Vladlen Koltun and
|
|
1625
|
+
Philipp Kr\"ahenb\"uhl},
|
|
1626
|
+
title = {Cut Your Losses in Large-Vocabulary Language Models},
|
|
1627
|
+
journal = {arXiv},
|
|
1628
|
+
year = {2024},
|
|
1629
|
+
url = {https://arxiv.org/abs/2411.09009},
|
|
1630
|
+
}
|
|
1631
|
+
```
|
|
1632
|
+
|
|
1633
|
+
---
|
|
1634
|
+
|
|
1635
|
+
## Config Reference
|
|
1636
|
+
|
|
1637
|
+
**URL:** https://docs.axolotl.ai/docs/config-reference.html
|
|
1638
|
+
|
|
1639
|
+
**Contents:**
|
|
1640
|
+
- Config Reference
|
|
1641
|
+
|
|
1642
|
+
**Examples:**
|
|
1643
|
+
|
|
1644
|
+
Example 1 (yaml):
|
|
1645
|
+
```yaml
|
|
1646
|
+
# Allow overwrite yml config using from cli
|
|
1647
|
+
strict: bool | None = False
|
|
1648
|
+
# Resume from a specific checkpoint dir
|
|
1649
|
+
resume_from_checkpoint: str | None
|
|
1650
|
+
# If resume_from_checkpoint isn't set and you simply want it to start where it left off.
|
|
1651
|
+
# Be careful with this being turned on between different models.
|
|
1652
|
+
auto_resume_from_checkpoints: bool | None
|
|
1653
|
+
# Resize the model embeddings when new tokens are added to multiples of 32. This is
|
|
1654
|
+
# reported to improve training speed on some models
|
|
1655
|
+
resize_token_embeddings_to_32x: bool | None
|
|
1656
|
+
mean_resizing_embeddings: bool | None = False
|
|
1657
|
+
|
|
1658
|
+
# Whether to shrink the embeddings to len(tokenizer). By default, we won't shrink.
|
|
1659
|
+
shrink_embeddings: bool | None
|
|
1660
|
+
# Don't upcast the embeddings to float32 when using PEFT. Useful for low-VRAM GPUs
|
|
1661
|
+
embeddings_skip_upcast: bool | None
|
|
1662
|
+
# Reinitialize model weights randomly instead of loading pretrained weights
|
|
1663
|
+
reinit_weights: bool | None
|
|
1664
|
+
|
|
1665
|
+
# module to custom trainer class to use for training
|
|
1666
|
+
trainer_cls: str | None
|
|
1667
|
+
|
|
1668
|
+
# Use RL training: 'dpo', 'ipo', 'kto', 'simpo', 'orpo', 'grpo'
|
|
1669
|
+
rl: RLType | None
|
|
1670
|
+
|
|
1671
|
+
trl: TRLConfig | None
|
|
1672
|
+
# For TRLConfig:
|
|
1673
|
+
# Beta parameter for the RL training. Same as `rl_beta`. Use
|
|
1674
|
+
beta: float | None
|
|
1675
|
+
# Maximum length of the completion for RL training.
|
|
1676
|
+
max_completion_length: int | None
|
|
1677
|
+
|
|
1678
|
+
# Whether to use VLLM for RL training.
|
|
1679
|
+
use_vllm: bool = False
|
|
1680
|
+
# VLLM mode to use, one of 'server' or 'colocate'
|
|
1681
|
+
vllm_mode: Literal['server', 'colocate'] | None
|
|
1682
|
+
# Host of the vLLM server to connect to.
|
|
1683
|
+
vllm_server_host: str | None = 0.0.0.0
|
|
1684
|
+
# Port of the vLLM server to connect to.
|
|
1685
|
+
vllm_server_port: int | None = 8000
|
|
1686
|
+
# Total timeout (in seconds) to wait for the vLLM server to respond.
|
|
1687
|
+
vllm_server_timeout: int | None
|
|
1688
|
+
# Regex for vLLM guided decoding.
|
|
1689
|
+
vllm_guided_decoding_regex: str | None
|
|
1690
|
+
|
|
1691
|
+
# List of reward functions to load. Paths must be importable from current dir.
|
|
1692
|
+
reward_funcs: list[str] | None
|
|
1693
|
+
# List of reward weights for the reward functions.
|
|
1694
|
+
reward_weights: list[float] | None
|
|
1695
|
+
# Number of generations to sample.
|
|
1696
|
+
num_generations: int | None
|
|
1697
|
+
# Whether to log completions.
|
|
1698
|
+
log_completions: bool | None = False
|
|
1699
|
+
# Number of completions to print when log_completions is True.
|
|
1700
|
+
num_completions_to_print: int | None
|
|
1701
|
+
# Controls whether importance sampling ratios are computed at the `'token'` or
|
|
1702
|
+
# `'sequence'` level. For GSPO, use `sequence`, default is None which corresponds to
|
|
1703
|
+
# the original GRPO paper.
|
|
1704
|
+
importance_sampling_level: Literal['sequence', 'token'] | None
|
|
1705
|
+
|
|
1706
|
+
# Whether to sync the reference model.
|
|
1707
|
+
sync_ref_model: bool | None = False
|
|
1708
|
+
# Mixup alpha for the reference model.
|
|
1709
|
+
ref_model_mixup_alpha: float | None = 0.9
|
|
1710
|
+
# Sync steps for the reference model.
|
|
1711
|
+
ref_model_sync_steps: int | None = 64
|
|
1712
|
+
# Whether to scale rewards by their standard deviation.
|
|
1713
|
+
scale_rewards: bool = True
|
|
1714
|
+
|
|
1715
|
+
# Sampling temperature for the GRPO policy.
|
|
1716
|
+
temperature: float | None
|
|
1717
|
+
# Top-p sampling probability for the generation policy.
|
|
1718
|
+
top_p: float | None
|
|
1719
|
+
# Top-k sampling for the generation policy.
|
|
1720
|
+
top_k: int | None
|
|
1721
|
+
# Minimum probability for the generation policy.
|
|
1722
|
+
min_p: float | None
|
|
1723
|
+
# Penalty for tokens that appear in prompt and generated text.
|
|
1724
|
+
repetition_penalty: float | None
|
|
1725
|
+
# Number of iterations per batch (μ) for GRPO.
|
|
1726
|
+
num_iterations: int | None
|
|
1727
|
+
# Epsilon value for clipping in the GRPO algorithm.
|
|
1728
|
+
epsilon: float | None
|
|
1729
|
+
# Upper-bound epsilon value for clipping in the GRPO algorithm.
|
|
1730
|
+
epsilon_high: float | None
|
|
1731
|
+
# Whether to use Liger loss for GRPO.
|
|
1732
|
+
use_liger_loss: bool | None
|
|
1733
|
+
# Loss formulation to use. Supported values: grpo, bnpo, dr_grpo.
|
|
1734
|
+
loss_type: str | None
|
|
1735
|
+
# Whether to exclude truncated completions from loss calculation.
|
|
1736
|
+
mask_truncated_completions: bool = False
|
|
1737
|
+
# Enable sleep mode for vLLM to offload VRAM when idle
|
|
1738
|
+
vllm_enable_sleep_mode: bool | None
|
|
1739
|
+
|
|
1740
|
+
vllm: VllmConfig | None
|
|
1741
|
+
# For VllmConfig:
|
|
1742
|
+
# Device to use for VLLM
|
|
1743
|
+
device: str | None = auto
|
|
1744
|
+
# Tensor parallel size for VLLM
|
|
1745
|
+
tensor_parallel_size: int | None
|
|
1746
|
+
# Data parallel size for VLLM
|
|
1747
|
+
data_parallel_size: int | None
|
|
1748
|
+
# GPU memory utilization for VLLM
|
|
1749
|
+
gpu_memory_utilization: float | None = 0.9
|
|
1750
|
+
# Data type for VLLM
|
|
1751
|
+
dtype: str | None = auto
|
|
1752
|
+
# Maximum length of the model context for VLLM
|
|
1753
|
+
max_model_len: int | None
|
|
1754
|
+
# Enable prefix caching for VLLM
|
|
1755
|
+
enable_prefix_caching: bool | None
|
|
1756
|
+
# Host for the vLLM server to start on
|
|
1757
|
+
host: str | None = 0.0.0.0
|
|
1758
|
+
# Port of the vLLM server to start on
|
|
1759
|
+
port: int | None = 8000
|
|
1760
|
+
|
|
1761
|
+
# Enable reasoning for VLLM
|
|
1762
|
+
enable_reasoning: bool | None
|
|
1763
|
+
# Reasoning parser for VLLM
|
|
1764
|
+
reasoning_parser: str | None
|
|
1765
|
+
|
|
1766
|
+
qat: QATConfig | None
|
|
1767
|
+
# For QATConfig:
|
|
1768
|
+
# Fake quantization layout to use for activation quantization.
|
|
1769
|
+
activation_dtype: TorchAOQuantDType | None
|
|
1770
|
+
# Fake quantization layout to use for weight quantization.
|
|
1771
|
+
weight_dtype: TorchAOQuantDType = TorchAOQuantDType.int8
|
|
1772
|
+
# Quantize embedding
|
|
1773
|
+
quantize_embedding: bool | None = False
|
|
1774
|
+
# The number of elements in each group for per-group fake quantization
|
|
1775
|
+
group_size: int | None = 32
|
|
1776
|
+
# The number of steps to apply fake quantization after
|
|
1777
|
+
fake_quant_after_n_steps: int | None
|
|
1778
|
+
|
|
1779
|
+
quantization: PTQConfig | None
|
|
1780
|
+
# For PTQConfig:
|
|
1781
|
+
# Fake quantization layout to use for weight quantization.
|
|
1782
|
+
weight_dtype: TorchAOQuantDType = TorchAOQuantDType.int8
|
|
1783
|
+
# Fake quantization layout to use for activation quantization.
|
|
1784
|
+
activation_dtype: TorchAOQuantDType | None
|
|
1785
|
+
# Whether to quantize the embedding layer.
|
|
1786
|
+
quantize_embedding: bool | None
|
|
1787
|
+
# The number of elements in each group for per-group fake quantization
|
|
1788
|
+
group_size: int | None = 32
|
|
1789
|
+
|
|
1790
|
+
# Reward modelling: `True` or `False`
|
|
1791
|
+
reward_model: bool | None
|
|
1792
|
+
# Process reward modelling: `True` or `False`
|
|
1793
|
+
process_reward_model: bool | None
|
|
1794
|
+
# Coefficient to incentivize the reward model to output mean-zero rewards (proposed by
|
|
1795
|
+
# https://huggingface.co/papers/2312.09244, Eq. 2). Recommended value: `0.01`.
|
|
1796
|
+
center_rewards_coefficient: float | None
|
|
1797
|
+
num_labels: int | None
|
|
1798
|
+
|
|
1799
|
+
# Whether to perform weighting in DPO trainer
|
|
1800
|
+
dpo_use_weighting: bool | None
|
|
1801
|
+
dpo_use_logits_to_keep: bool | None
|
|
1802
|
+
dpo_label_smoothing: float | None
|
|
1803
|
+
dpo_norm_loss: bool | None
|
|
1804
|
+
dpo_padding_free: bool | None
|
|
1805
|
+
dpo_generate_during_eval: bool | None
|
|
1806
|
+
|
|
1807
|
+
# A list of one or more datasets to finetune the model with
|
|
1808
|
+
datasets: Annotated[list[SFTDataset | DPODataset | KTODataset | StepwiseSupervisedDataset], MinLen(1)] | None
|
|
1809
|
+
# For SFTDataset:
|
|
1810
|
+
# HuggingFace dataset repo | s3:// | gs:// | path to local file or directory
|
|
1811
|
+
path: str | None
|
|
1812
|
+
# name of dataset split to load from
|
|
1813
|
+
split: str | None
|
|
1814
|
+
# The type of prompt to use for training. [alpaca, gpteacher, oasst, reflection]
|
|
1815
|
+
type: str | UserDefinedPrompterType | None
|
|
1816
|
+
# For UserDefinedPrompterType:
|
|
1817
|
+
# Custom user instruction prompt
|
|
1818
|
+
system_prompt: str | None
|
|
1819
|
+
# Use {system} as key to be replaced
|
|
1820
|
+
system_format: str | None
|
|
1821
|
+
field_system: str | None
|
|
1822
|
+
field_instruction: str | None
|
|
1823
|
+
field_input: str | None
|
|
1824
|
+
field_output: str | None
|
|
1825
|
+
|
|
1826
|
+
# Customizable to be single line or multi-line. Use {instruction}/{input} as key to
|
|
1827
|
+
# be replaced. 'format' can include {input}
|
|
1828
|
+
format: str | None
|
|
1829
|
+
# 'no_input_format' cannot include {input}
|
|
1830
|
+
no_input_format: str | None
|
|
1831
|
+
input_transform: str | None
|
|
1832
|
+
# split dataset into N pieces (use with shards_idx)
|
|
1833
|
+
shards: int | None
|
|
1834
|
+
# the index of sharded dataset to use
|
|
1835
|
+
shards_idx: int | None
|
|
1836
|
+
# process dataset in N sequential chunks for memory efficiency (exclusive with
|
|
1837
|
+
# `shards`)
|
|
1838
|
+
preprocess_shards: int | None
|
|
1839
|
+
conversation: str | None
|
|
1840
|
+
|
|
1841
|
+
# The name of the chat template to use for training, following values are supported:
|
|
1842
|
+
# tokenizer_default: Uses the chat template that is available in the
|
|
1843
|
+
# tokenizer_config.json. If the chat template is not available in the tokenizer, it
|
|
1844
|
+
# will raise an error. This is the default.
|
|
1845
|
+
# alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates
|
|
1846
|
+
# are available in the axolotl codebase at src/axolotl/utils/chat_templates.py.
|
|
1847
|
+
# tokenizer_default_fallback_*: where * is the name of the chat template to fallback
|
|
1848
|
+
# to if the tokenizer does not have a chat template else default to tokenizer. E.g.
|
|
1849
|
+
# tokenizer_default_fallback_chatml. jinja: Uses a custom jinja template for the chat
|
|
1850
|
+
# template. The custom jinja template should be provided in the chat_template_jinja
|
|
1851
|
+
# field.
|
|
1852
|
+
chat_template: ChatTemplate | str | None
|
|
1853
|
+
# Custom jinja chat template or path to jinja file. Used only if `chat_template:
|
|
1854
|
+
# jinja` or empty.
|
|
1855
|
+
chat_template_jinja: str | None
|
|
1856
|
+
# path to source data files
|
|
1857
|
+
data_files: str | list[str] | None
|
|
1858
|
+
input_format: str | None
|
|
1859
|
+
# name of dataset configuration to load
|
|
1860
|
+
name: str | None
|
|
1861
|
+
# defines the datatype when path is a file
|
|
1862
|
+
ds_type: str | None
|
|
1863
|
+
# For `completion` datasets only, uses the provided field instead of `text` column
|
|
1864
|
+
field: str | None
|
|
1865
|
+
field_human: str | None
|
|
1866
|
+
field_model: str | None
|
|
1867
|
+
# Key containing the messages (default: "messages")
|
|
1868
|
+
field_messages: str | None
|
|
1869
|
+
# Key containing the tools (default: "tools"). Must be a list[dict] and follow [JSON
|
|
1870
|
+
# schema](https://json-schema.org/learn/getting-started-step-by-step).
|
|
1871
|
+
field_tools: str | None
|
|
1872
|
+
# Key containing the reasoning trace (default: "reasoning_content").
|
|
1873
|
+
field_thinking: str | None
|
|
1874
|
+
# The key the chat template expects that indicates the reasoning trace.
|
|
1875
|
+
template_thinking_key: str | None
|
|
1876
|
+
|
|
1877
|
+
message_field_role: str | None
|
|
1878
|
+
|
|
1879
|
+
message_field_content: str | None
|
|
1880
|
+
# Mapping of properties from the input dataset to the chat template. (default:
|
|
1881
|
+
# message_property_mappings={'role':'role', 'content':'content'}) If a property exists
|
|
1882
|
+
# in the template but not in this mapping, the system will attempt to load it directly
|
|
1883
|
+
# from the message using the property name as the key. Example: In the mapping below,
|
|
1884
|
+
# 'from' is loaded from input dataset and used as 'role', while 'value' is loaded and
|
|
1885
|
+
# used as 'content' in the chat template.
|
|
1886
|
+
message_property_mappings: dict[str, str] | None
|
|
1887
|
+
# The key in the message turn that indicates via boolean whether tokens of a turn
|
|
1888
|
+
# should be considered for training. Useful to selectively train on certain turns
|
|
1889
|
+
# besides the `roles_to_train`.
|
|
1890
|
+
message_field_training: str | None
|
|
1891
|
+
# The key in the message turn that contains the training details. Useful to
|
|
1892
|
+
# selectively train on certain tokens in a turn. The value of the key is a List[Dict]
|
|
1893
|
+
# containing `begin_offset` (start character index in content), `end_offset` (end
|
|
1894
|
+
# character index in content), and `train` (boolean whether to train).
|
|
1895
|
+
message_field_training_detail: str | None
|
|
1896
|
+
# (for Qwen3 template only) Whether to split the assistant content based on a
|
|
1897
|
+
# reasoning trace inside delimited tags
|
|
1898
|
+
split_thinking: bool | None
|
|
1899
|
+
logprobs_field: str | None
|
|
1900
|
+
temperature: float | None
|
|
1901
|
+
# Roles to train on. The tokens from these roles will be considered for the loss.
|
|
1902
|
+
roles_to_train: list[str] | None
|
|
1903
|
+
# Which EOS tokens to train on in the conversation. Possible values are: all: train on
|
|
1904
|
+
# all EOS tokens, turn (default): train on the EOS token at the end of each trainable
|
|
1905
|
+
# turn, last: train on the last EOS token in the conversation
|
|
1906
|
+
train_on_eos: Literal['all', 'turn', 'last'] | None
|
|
1907
|
+
# Roles mapping in the messages. The format is {target_role: [source_roles]}. All
|
|
1908
|
+
# source roles will be mapped to the target role. The default is: user: ["human",
|
|
1909
|
+
# "user"], assistant: ["gpt", "assistant"], system: ["system"], tool: ["tool"]
|
|
1910
|
+
roles: dict[str, list[str]] | None
|
|
1911
|
+
# Whether to drop the system turn from the dataset. Only works with chat_template.
|
|
1912
|
+
# This does not drop the default system message from chat_template if it exists. If
|
|
1913
|
+
# you wish to, we recommend using a custom jinja template with the default system
|
|
1914
|
+
# message removed or adding a system turn with empty content.
|
|
1915
|
+
drop_system_message: bool | None
|
|
1916
|
+
# Trust remote code for untrusted source
|
|
1917
|
+
trust_remote_code: bool | None = False
|
|
1918
|
+
# The specific revision of the dataset to use when loading from the Hugging Face Hub.
|
|
1919
|
+
# This can be a commit hash, tag, or branch name. If not specified, the latest version
|
|
1920
|
+
# will be used. This parameter is ignored for local datasets.
|
|
1921
|
+
revision: str | None
|
|
1922
|
+
|
|
1923
|
+
# For DPODataset:
|
|
1924
|
+
path: str | None
|
|
1925
|
+
split: str | None
|
|
1926
|
+
type: UserDefinedDPOType | str | None
|
|
1927
|
+
# For UserDefinedDPOType:
|
|
1928
|
+
field_system: str | None
|
|
1929
|
+
field_prompt: str | None
|
|
1930
|
+
field_chosen: str | None
|
|
1931
|
+
field_rejected: str | None
|
|
1932
|
+
prompt_format: str | None
|
|
1933
|
+
chosen_format: str | None
|
|
1934
|
+
rejected_format: str | None
|
|
1935
|
+
data_files: list[str] | None
|
|
1936
|
+
revision: str | None
|
|
1937
|
+
field_messages: str | None
|
|
1938
|
+
|
|
1939
|
+
# For KTODataset:
|
|
1940
|
+
path: str | None
|
|
1941
|
+
split: str | None
|
|
1942
|
+
type: UserDefinedKTOType | str | None
|
|
1943
|
+
# For UserDefinedKTOType:
|
|
1944
|
+
field_system: str | None
|
|
1945
|
+
field_prompt: str | None
|
|
1946
|
+
field_completion: str | None
|
|
1947
|
+
field_label: bool | None
|
|
1948
|
+
prompt_format: str | None
|
|
1949
|
+
completion_format: str | None
|
|
1950
|
+
data_files: list[str] | None
|
|
1951
|
+
trust_remote_code: bool | None = False
|
|
1952
|
+
revision: str | None
|
|
1953
|
+
|
|
1954
|
+
# For StepwiseSupervisedDataset:
|
|
1955
|
+
path: str | None
|
|
1956
|
+
split: str | None
|
|
1957
|
+
data_files: list[str] | None
|
|
1958
|
+
revision: str | None
|
|
1959
|
+
step_separator: str | None
|
|
1960
|
+
max_completion_length: int | None
|
|
1961
|
+
train_on_last_step_only: bool | None
|
|
1962
|
+
|
|
1963
|
+
# A list of one or more datasets to eval the model with. You can use either
|
|
1964
|
+
# test_datasets, or val_set_size, but not both.
|
|
1965
|
+
test_datasets: Annotated[list[SFTDataset | DPODataset | KTODataset | StepwiseSupervisedDataset], MinLen(1)] | None
|
|
1966
|
+
# For SFTDataset:
|
|
1967
|
+
# HuggingFace dataset repo | s3:// | gs:// | path to local file or directory
|
|
1968
|
+
path: str | None
|
|
1969
|
+
# name of dataset split to load from
|
|
1970
|
+
split: str | None
|
|
1971
|
+
# The type of prompt to use for training. [alpaca, gpteacher, oasst, reflection]
|
|
1972
|
+
type: str | UserDefinedPrompterType | None
|
|
1973
|
+
# For UserDefinedPrompterType:
|
|
1974
|
+
# Custom user instruction prompt
|
|
1975
|
+
system_prompt: str | None
|
|
1976
|
+
# Use {system} as key to be replaced
|
|
1977
|
+
system_format: str | None
|
|
1978
|
+
field_system: str | None
|
|
1979
|
+
field_instruction: str | None
|
|
1980
|
+
field_input: str | None
|
|
1981
|
+
field_output: str | None
|
|
1982
|
+
|
|
1983
|
+
# Customizable to be single line or multi-line. Use {instruction}/{input} as key to
|
|
1984
|
+
# be replaced. 'format' can include {input}
|
|
1985
|
+
format: str | None
|
|
1986
|
+
# 'no_input_format' cannot include {input}
|
|
1987
|
+
no_input_format: str | None
|
|
1988
|
+
input_transform: str | None
|
|
1989
|
+
# split dataset into N pieces (use with shards_idx)
|
|
1990
|
+
shards: int | None
|
|
1991
|
+
# the index of sharded dataset to use
|
|
1992
|
+
shards_idx: int | None
|
|
1993
|
+
# process dataset in N sequential chunks for memory efficiency (exclusive with
|
|
1994
|
+
# `shards`)
|
|
1995
|
+
preprocess_shards: int | None
|
|
1996
|
+
conversation: str | None
|
|
1997
|
+
|
|
1998
|
+
# The name of the chat template to use for training, following values are supported:
|
|
1999
|
+
# tokenizer_default: Uses the chat template that is available in the
|
|
2000
|
+
# tokenizer_config.json. If the chat template is not available in the tokenizer, it
|
|
2001
|
+
# will raise an error. This is the default.
|
|
2002
|
+
# alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates
|
|
2003
|
+
# are available in the axolotl codebase at src/axolotl/utils/chat_templates.py.
|
|
2004
|
+
# tokenizer_default_fallback_*: where * is the name of the chat template to fallback
|
|
2005
|
+
# to if the tokenizer does not have a chat template else default to tokenizer. E.g.
|
|
2006
|
+
# tokenizer_default_fallback_chatml. jinja: Uses a custom jinja template for the chat
|
|
2007
|
+
# template. The custom jinja template should be provided in the chat_template_jinja
|
|
2008
|
+
# field.
|
|
2009
|
+
chat_template: ChatTemplate | str | None
|
|
2010
|
+
# Custom jinja chat template or path to jinja file. Used only if `chat_template:
|
|
2011
|
+
# jinja` or empty.
|
|
2012
|
+
chat_template_jinja: str | None
|
|
2013
|
+
# path to source data files
|
|
2014
|
+
data_files: str | list[str] | None
|
|
2015
|
+
input_format: str | None
|
|
2016
|
+
# name of dataset configuration to load
|
|
2017
|
+
name: str | None
|
|
2018
|
+
# defines the datatype when path is a file
|
|
2019
|
+
ds_type: str | None
|
|
2020
|
+
# For `completion` datasets only, uses the provided field instead of `text` column
|
|
2021
|
+
field: str | None
|
|
2022
|
+
field_human: str | None
|
|
2023
|
+
field_model: str | None
|
|
2024
|
+
# Key containing the messages (default: "messages")
|
|
2025
|
+
field_messages: str | None
|
|
2026
|
+
# Key containing the tools (default: "tools"). Must be a list[dict] and follow [JSON
|
|
2027
|
+
# schema](https://json-schema.org/learn/getting-started-step-by-step).
|
|
2028
|
+
field_tools: str | None
|
|
2029
|
+
# Key containing the reasoning trace (default: "reasoning_content").
|
|
2030
|
+
field_thinking: str | None
|
|
2031
|
+
# The key the chat template expects that indicates the reasoning trace.
|
|
2032
|
+
template_thinking_key: str | None
|
|
2033
|
+
|
|
2034
|
+
message_field_role: str | None
|
|
2035
|
+
|
|
2036
|
+
message_field_content: str | None
|
|
2037
|
+
# Mapping of properties from the input dataset to the chat template. (default:
|
|
2038
|
+
# message_property_mappings={'role':'role', 'content':'content'}) If a property exists
|
|
2039
|
+
# in the template but not in this mapping, the system will attempt to load it directly
|
|
2040
|
+
# from the message using the property name as the key. Example: In the mapping below,
|
|
2041
|
+
# 'from' is loaded from input dataset and used as 'role', while 'value' is loaded and
|
|
2042
|
+
# used as 'content' in the chat template.
|
|
2043
|
+
message_property_mappings: dict[str, str] | None
|
|
2044
|
+
# The key in the message turn that indicates via boolean whether tokens of a turn
|
|
2045
|
+
# should be considered for training. Useful to selectively train on certain turns
|
|
2046
|
+
# besides the `roles_to_train`.
|
|
2047
|
+
message_field_training: str | None
|
|
2048
|
+
# The key in the message turn that contains the training details. Useful to
|
|
2049
|
+
# selectively train on certain tokens in a turn. The value of the key is a List[Dict]
|
|
2050
|
+
# containing `begin_offset` (start character index in content), `end_offset` (end
|
|
2051
|
+
# character index in content), and `train` (boolean whether to train).
|
|
2052
|
+
message_field_training_detail: str | None
|
|
2053
|
+
# (for Qwen3 template only) Whether to split the assistant content based on a
|
|
2054
|
+
# reasoning trace inside delimited tags
|
|
2055
|
+
split_thinking: bool | None
|
|
2056
|
+
logprobs_field: str | None
|
|
2057
|
+
temperature: float | None
|
|
2058
|
+
# Roles to train on. The tokens from these roles will be considered for the loss.
|
|
2059
|
+
roles_to_train: list[str] | None
|
|
2060
|
+
# Which EOS tokens to train on in the conversation. Possible values are: all: train on
|
|
2061
|
+
# all EOS tokens, turn (default): train on the EOS token at the end of each trainable
|
|
2062
|
+
# turn, last: train on the last EOS token in the conversation
|
|
2063
|
+
train_on_eos: Literal['all', 'turn', 'last'] | None
|
|
2064
|
+
# Roles mapping in the messages. The format is {target_role: [source_roles]}. All
|
|
2065
|
+
# source roles will be mapped to the target role. The default is: user: ["human",
|
|
2066
|
+
# "user"], assistant: ["gpt", "assistant"], system: ["system"], tool: ["tool"]
|
|
2067
|
+
roles: dict[str, list[str]] | None
|
|
2068
|
+
# Whether to drop the system turn from the dataset. Only works with chat_template.
|
|
2069
|
+
# This does not drop the default system message from chat_template if it exists. If
|
|
2070
|
+
# you wish to, we recommend using a custom jinja template with the default system
|
|
2071
|
+
# message removed or adding a system turn with empty content.
|
|
2072
|
+
drop_system_message: bool | None
|
|
2073
|
+
# Trust remote code for untrusted source
|
|
2074
|
+
trust_remote_code: bool | None = False
|
|
2075
|
+
# The specific revision of the dataset to use when loading from the Hugging Face Hub.
|
|
2076
|
+
# This can be a commit hash, tag, or branch name. If not specified, the latest version
|
|
2077
|
+
# will be used. This parameter is ignored for local datasets.
|
|
2078
|
+
revision: str | None
|
|
2079
|
+
|
|
2080
|
+
# For DPODataset:
|
|
2081
|
+
path: str | None
|
|
2082
|
+
split: str | None
|
|
2083
|
+
type: UserDefinedDPOType | str | None
|
|
2084
|
+
# For UserDefinedDPOType:
|
|
2085
|
+
field_system: str | None
|
|
2086
|
+
field_prompt: str | None
|
|
2087
|
+
field_chosen: str | None
|
|
2088
|
+
field_rejected: str | None
|
|
2089
|
+
prompt_format: str | None
|
|
2090
|
+
chosen_format: str | None
|
|
2091
|
+
rejected_format: str | None
|
|
2092
|
+
data_files: list[str] | None
|
|
2093
|
+
revision: str | None
|
|
2094
|
+
field_messages: str | None
|
|
2095
|
+
|
|
2096
|
+
# For KTODataset:
|
|
2097
|
+
path: str | None
|
|
2098
|
+
split: str | None
|
|
2099
|
+
type: UserDefinedKTOType | str | None
|
|
2100
|
+
# For UserDefinedKTOType:
|
|
2101
|
+
field_system: str | None
|
|
2102
|
+
field_prompt: str | None
|
|
2103
|
+
field_completion: str | None
|
|
2104
|
+
field_label: bool | None
|
|
2105
|
+
prompt_format: str | None
|
|
2106
|
+
completion_format: str | None
|
|
2107
|
+
data_files: list[str] | None
|
|
2108
|
+
trust_remote_code: bool | None = False
|
|
2109
|
+
revision: str | None
|
|
2110
|
+
|
|
2111
|
+
# For StepwiseSupervisedDataset:
|
|
2112
|
+
path: str | None
|
|
2113
|
+
split: str | None
|
|
2114
|
+
data_files: list[str] | None
|
|
2115
|
+
revision: str | None
|
|
2116
|
+
step_separator: str | None
|
|
2117
|
+
max_completion_length: int | None
|
|
2118
|
+
train_on_last_step_only: bool | None
|
|
2119
|
+
|
|
2120
|
+
# If false, the datasets will not be shuffled and will keep their original order in
|
|
2121
|
+
# `datasets`. The same applies to the `test_datasets` option and the
|
|
2122
|
+
# `pretraining_dataset` option. Default is true.
|
|
2123
|
+
shuffle_merged_datasets: bool | None = True
|
|
2124
|
+
# If true, each dataset in `datasets` will be shuffled before merging. This allows
|
|
2125
|
+
# curriculum learning strategies to be applied at the dataset level. Default is false.
|
|
2126
|
+
shuffle_before_merging_datasets: bool | None = False
|
|
2127
|
+
# Axolotl attempts to save the dataset as an arrow after packing the data together so
|
|
2128
|
+
# subsequent training attempts load faster, relative path
|
|
2129
|
+
dataset_prepared_path: str | None
|
|
2130
|
+
# Num shards for whole dataset
|
|
2131
|
+
dataset_shard_num: int | None
|
|
2132
|
+
# Index of shard to use for whole dataset
|
|
2133
|
+
dataset_shard_idx: int | None
|
|
2134
|
+
skip_prepare_dataset: bool | None = False
|
|
2135
|
+
# Number of shards to save the prepared dataset
|
|
2136
|
+
num_dataset_shards_to_save: int | None
|
|
2137
|
+
|
|
2138
|
+
# Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize
|
|
2139
|
+
pretraining_dataset: Annotated[list[PretrainingDataset | SFTDataset], MinLen(1)] | None
|
|
2140
|
+
# For PretrainingDataset:
|
|
2141
|
+
name: str | None
|
|
2142
|
+
path: str | None
|
|
2143
|
+
split: str | None = train
|
|
2144
|
+
text_column: str | None = text
|
|
2145
|
+
type: str | None = pretrain
|
|
2146
|
+
trust_remote_code: bool | None = False
|
|
2147
|
+
data_files: str | None
|
|
2148
|
+
skip: int | None
|
|
2149
|
+
|
|
2150
|
+
# For SFTDataset:
|
|
2151
|
+
# HuggingFace dataset repo | s3:// | gs:// | path to local file or directory
|
|
2152
|
+
path: str | None
|
|
2153
|
+
# name of dataset split to load from
|
|
2154
|
+
split: str | None
|
|
2155
|
+
# The type of prompt to use for training. [alpaca, gpteacher, oasst, reflection]
|
|
2156
|
+
type: str | UserDefinedPrompterType | None
|
|
2157
|
+
# For UserDefinedPrompterType:
|
|
2158
|
+
# Custom user instruction prompt
|
|
2159
|
+
system_prompt: str | None
|
|
2160
|
+
# Use {system} as key to be replaced
|
|
2161
|
+
system_format: str | None
|
|
2162
|
+
field_system: str | None
|
|
2163
|
+
field_instruction: str | None
|
|
2164
|
+
field_input: str | None
|
|
2165
|
+
field_output: str | None
|
|
2166
|
+
|
|
2167
|
+
# Customizable to be single line or multi-line. Use {instruction}/{input} as key to
|
|
2168
|
+
# be replaced. 'format' can include {input}
|
|
2169
|
+
format: str | None
|
|
2170
|
+
# 'no_input_format' cannot include {input}
|
|
2171
|
+
no_input_format: str | None
|
|
2172
|
+
input_transform: str | None
|
|
2173
|
+
# split dataset into N pieces (use with shards_idx)
|
|
2174
|
+
shards: int | None
|
|
2175
|
+
# the index of sharded dataset to use
|
|
2176
|
+
shards_idx: int | None
|
|
2177
|
+
# process dataset in N sequential chunks for memory efficiency (exclusive with
|
|
2178
|
+
# `shards`)
|
|
2179
|
+
preprocess_shards: int | None
|
|
2180
|
+
conversation: str | None
|
|
2181
|
+
|
|
2182
|
+
# The name of the chat template to use for training, following values are supported:
|
|
2183
|
+
# tokenizer_default: Uses the chat template that is available in the
|
|
2184
|
+
# tokenizer_config.json. If the chat template is not available in the tokenizer, it
|
|
2185
|
+
# will raise an error. This is the default.
|
|
2186
|
+
# alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates
|
|
2187
|
+
# are available in the axolotl codebase at src/axolotl/utils/chat_templates.py.
|
|
2188
|
+
# tokenizer_default_fallback_*: where * is the name of the chat template to fallback
|
|
2189
|
+
# to if the tokenizer does not have a chat template else default to tokenizer. E.g.
|
|
2190
|
+
# tokenizer_default_fallback_chatml. jinja: Uses a custom jinja template for the chat
|
|
2191
|
+
# template. The custom jinja template should be provided in the chat_template_jinja
|
|
2192
|
+
# field.
|
|
2193
|
+
chat_template: ChatTemplate | str | None
|
|
2194
|
+
# Custom jinja chat template or path to jinja file. Used only if `chat_template:
|
|
2195
|
+
# jinja` or empty.
|
|
2196
|
+
chat_template_jinja: str | None
|
|
2197
|
+
# path to source data files
|
|
2198
|
+
data_files: str | list[str] | None
|
|
2199
|
+
input_format: str | None
|
|
2200
|
+
# name of dataset configuration to load
|
|
2201
|
+
name: str | None
|
|
2202
|
+
# defines the datatype when path is a file
|
|
2203
|
+
ds_type: str | None
|
|
2204
|
+
# For `completion` datasets only, uses the provided field instead of `text` column
|
|
2205
|
+
field: str | None
|
|
2206
|
+
field_human: str | None
|
|
2207
|
+
field_model: str | None
|
|
2208
|
+
# Key containing the messages (default: "messages")
|
|
2209
|
+
field_messages: str | None
|
|
2210
|
+
# Key containing the tools (default: "tools"). Must be a list[dict] and follow [JSON
|
|
2211
|
+
# schema](https://json-schema.org/learn/getting-started-step-by-step).
|
|
2212
|
+
field_tools: str | None
|
|
2213
|
+
# Key containing the reasoning trace (default: "reasoning_content").
|
|
2214
|
+
field_thinking: str | None
|
|
2215
|
+
# The key the chat template expects that indicates the reasoning trace.
|
|
2216
|
+
template_thinking_key: str | None
|
|
2217
|
+
|
|
2218
|
+
message_field_role: str | None
|
|
2219
|
+
|
|
2220
|
+
message_field_content: str | None
|
|
2221
|
+
# Mapping of properties from the input dataset to the chat template. (default:
|
|
2222
|
+
# message_property_mappings={'role':'role', 'content':'content'}) If a property exists
|
|
2223
|
+
# in the template but not in this mapping, the system will attempt to load it directly
|
|
2224
|
+
# from the message using the property name as the key. Example: In the mapping below,
|
|
2225
|
+
# 'from' is loaded from input dataset and used as 'role', while 'value' is loaded and
|
|
2226
|
+
# used as 'content' in the chat template.
|
|
2227
|
+
message_property_mappings: dict[str, str] | None
|
|
2228
|
+
# The key in the message turn that indicates via boolean whether tokens of a turn
|
|
2229
|
+
# should be considered for training. Useful to selectively train on certain turns
|
|
2230
|
+
# besides the `roles_to_train`.
|
|
2231
|
+
message_field_training: str | None
|
|
2232
|
+
# The key in the message turn that contains the training details. Useful to
|
|
2233
|
+
# selectively train on certain tokens in a turn. The value of the key is a List[Dict]
|
|
2234
|
+
# containing `begin_offset` (start character index in content), `end_offset` (end
|
|
2235
|
+
# character index in content), and `train` (boolean whether to train).
|
|
2236
|
+
message_field_training_detail: str | None
|
|
2237
|
+
# (for Qwen3 template only) Whether to split the assistant content based on a
|
|
2238
|
+
# reasoning trace inside delimited tags
|
|
2239
|
+
split_thinking: bool | None
|
|
2240
|
+
logprobs_field: str | None
|
|
2241
|
+
temperature: float | None
|
|
2242
|
+
# Roles to train on. The tokens from these roles will be considered for the loss.
|
|
2243
|
+
roles_to_train: list[str] | None
|
|
2244
|
+
# Which EOS tokens to train on in the conversation. Possible values are: all: train on
|
|
2245
|
+
# all EOS tokens, turn (default): train on the EOS token at the end of each trainable
|
|
2246
|
+
# turn, last: train on the last EOS token in the conversation
|
|
2247
|
+
train_on_eos: Literal['all', 'turn', 'last'] | None
|
|
2248
|
+
# Roles mapping in the messages. The format is {target_role: [source_roles]}. All
|
|
2249
|
+
# source roles will be mapped to the target role. The default is: user: ["human",
|
|
2250
|
+
# "user"], assistant: ["gpt", "assistant"], system: ["system"], tool: ["tool"]
|
|
2251
|
+
roles: dict[str, list[str]] | None
|
|
2252
|
+
# Whether to drop the system turn from the dataset. Only works with chat_template.
|
|
2253
|
+
# This does not drop the default system message from chat_template if it exists. If
|
|
2254
|
+
# you wish to, we recommend using a custom jinja template with the default system
|
|
2255
|
+
# message removed or adding a system turn with empty content.
|
|
2256
|
+
drop_system_message: bool | None
|
|
2257
|
+
# Trust remote code for untrusted source
|
|
2258
|
+
trust_remote_code: bool | None = False
|
|
2259
|
+
# The specific revision of the dataset to use when loading from the Hugging Face Hub.
|
|
2260
|
+
# This can be a commit hash, tag, or branch name. If not specified, the latest version
|
|
2261
|
+
# will be used. This parameter is ignored for local datasets.
|
|
2262
|
+
revision: str | None
|
|
2263
|
+
|
|
2264
|
+
# The maximum number of processes to use while preprocessing your input dataset. This
|
|
2265
|
+
# defaults to `os.cpu_count()` if not set. For Runpod VMs, it will default to number of
|
|
2266
|
+
# vCPUs via RUNPOD_CPU_COUNT.
|
|
2267
|
+
dataset_processes: int | None
|
|
2268
|
+
# The maximum number of processes to use while preprocessing your input dataset. This
|
|
2269
|
+
# defaults to `os.cpu_count()` if not set. For Runpod VMs, it will default to number of
|
|
2270
|
+
# vCPUs via RUNPOD_CPU_COUNT.
|
|
2271
|
+
dataset_num_proc: int | None
|
|
2272
|
+
|
|
2273
|
+
# Deduplicates datasets and test_datasets with identical entries
|
|
2274
|
+
dataset_exact_deduplication: bool | None
|
|
2275
|
+
# Keep dataset in memory while preprocessing. Only needed if cached dataset is taking
|
|
2276
|
+
# too much storage
|
|
2277
|
+
dataset_keep_in_memory: bool | None
|
|
2278
|
+
dataloader_pin_memory: bool | None
|
|
2279
|
+
dataloader_num_workers: int | None
|
|
2280
|
+
dataloader_prefetch_factor: int | None
|
|
2281
|
+
dataloader_drop_last: bool | None
|
|
2282
|
+
|
|
2283
|
+
accelerator_config: dict[str, Any] | None
|
|
2284
|
+
|
|
2285
|
+
remove_unused_columns: bool | None
|
|
2286
|
+
|
|
2287
|
+
# Push prepared dataset to hub - repo_org/repo_name
|
|
2288
|
+
push_dataset_to_hub: str | None
|
|
2289
|
+
# Whether to use hf `use_auth_token` for loading datasets. Useful for fetching private
|
|
2290
|
+
# datasets. Required to be true when used in combination with `push_dataset_to_hub`
|
|
2291
|
+
hf_use_auth_token: bool | None
|
|
2292
|
+
|
|
2293
|
+
device: Any | None
|
|
2294
|
+
# Passed through to transformers when loading the model when launched without
|
|
2295
|
+
# accelerate. Use `sequential` when training w/ model parallelism to limit memory
|
|
2296
|
+
device_map: Any | None
|
|
2297
|
+
world_size: int | None
|
|
2298
|
+
# Don't mess with this, it's here for accelerate and torchrun
|
|
2299
|
+
local_rank: int | None
|
|
2300
|
+
ddp: bool | None
|
|
2301
|
+
|
|
2302
|
+
# Seed for reproducibility
|
|
2303
|
+
seed: int | None
|
|
2304
|
+
# Advanced DDP Arguments - timeout
|
|
2305
|
+
ddp_timeout: int | None
|
|
2306
|
+
# Advanced DDP Arguments - bucket cap in MB
|
|
2307
|
+
ddp_bucket_cap_mb: int | None
|
|
2308
|
+
# Advanced DDP Arguments - broadcast buffers
|
|
2309
|
+
ddp_broadcast_buffers: bool | None
|
|
2310
|
+
ddp_find_unused_parameters: bool | None
|
|
2311
|
+
|
|
2312
|
+
# Approximate number of predictions sent to wandb depending on batch size. Enabled above
|
|
2313
|
+
# 0. Default is 0
|
|
2314
|
+
eval_table_size: int | None
|
|
2315
|
+
# Total number of tokens generated for predictions sent to wandb. Default is 128
|
|
2316
|
+
eval_max_new_tokens: int | None
|
|
2317
|
+
# Whether to run causal language model evaluation for metrics in
|
|
2318
|
+
# `eval_causal_lm_metrics`
|
|
2319
|
+
do_causal_lm_eval: bool | None
|
|
2320
|
+
# HF evaluate metrics used during evaluation. Default is ['sacrebleu', 'comet', 'ter',
|
|
2321
|
+
# 'chrf', 'perplexity']
|
|
2322
|
+
eval_causal_lm_metrics: list[str] | None
|
|
2323
|
+
do_bench_eval: bool | None
|
|
2324
|
+
bench_dataset: str | None
|
|
2325
|
+
bench_split: str | None
|
|
2326
|
+
metric_for_best_model: str | None
|
|
2327
|
+
greater_is_better: bool | None
|
|
2328
|
+
|
|
2329
|
+
# High loss value, indicating the learning has broken down (a good estimate is ~2 times
|
|
2330
|
+
# the loss at the start of training)
|
|
2331
|
+
loss_watchdog_threshold: float | None
|
|
2332
|
+
# Number of high-loss steps in a row before the trainer aborts (default: 3)
|
|
2333
|
+
loss_watchdog_patience: int | None
|
|
2334
|
+
|
|
2335
|
+
# Run garbage collection every `gc_steps` steps. -1 will run on epoch end and before
|
|
2336
|
+
# evaluations. Default is 0 (disabled).
|
|
2337
|
+
gc_steps: int | None
|
|
2338
|
+
|
|
2339
|
+
# Use CUDA bf16. bool or 'full' for `bf16_full_eval`, or 'auto' for automatic detection.
|
|
2340
|
+
# require >=ampere
|
|
2341
|
+
bf16: Literal['auto'] | bool | None = auto
|
|
2342
|
+
# Use CUDA fp16
|
|
2343
|
+
fp16: bool | None
|
|
2344
|
+
# Enable FP8 mixed precision training using TorchAO. Best used in combination with
|
|
2345
|
+
# torch.compile.
|
|
2346
|
+
fp8: bool | None
|
|
2347
|
+
# Enable FSDP float8 all-gather optimization for FP8 training. Can improve training
|
|
2348
|
+
# speed by 10-15% when FSDP is enabled.
|
|
2349
|
+
fp8_enable_fsdp_float8_all_gather: bool | None
|
|
2350
|
+
# No AMP (automatic mixed precision) - require >=ampere
|
|
2351
|
+
bfloat16: bool | None
|
|
2352
|
+
# No AMP (automatic mixed precision)
|
|
2353
|
+
float16: bool | None
|
|
2354
|
+
# Use CUDA tf32 - require >=ampere
|
|
2355
|
+
tf32: bool | None
|
|
2356
|
+
float32: bool | None
|
|
2357
|
+
|
|
2358
|
+
# Whether to use gradient checkpointing. Available options are: true, false, 'offload',
|
|
2359
|
+
# 'offload_disk'.
|
|
2360
|
+
# https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
|
|
2361
|
+
gradient_checkpointing: Literal['offload', 'offload_disk'] | bool | None = False
|
|
2362
|
+
# Additional kwargs to pass to the trainer for gradient checkpointing
|
|
2363
|
+
gradient_checkpointing_kwargs: dict[str, Any] | None
|
|
2364
|
+
# Whether to offload activations. Available options are: true, false, 'legacy', 'disk'.
|
|
2365
|
+
activation_offloading: Literal['legacy', 'disk'] | bool | None = False
|
|
2366
|
+
|
|
2367
|
+
unfrozen_parameters: list[str] | None
|
|
2368
|
+
|
|
2369
|
+
# The maximum length of an input to train with, this should typically be less than 2048
|
|
2370
|
+
# as most models have a token/context limit of 2048
|
|
2371
|
+
sequence_len: int = 512
|
|
2372
|
+
# What to do when a tokenized row exceeds sequence_len. 'drop' removes the row;
|
|
2373
|
+
# 'truncate' slices tensors to sequence_len. Defaults to 'drop' for backward
|
|
2374
|
+
# compatibility.
|
|
2375
|
+
excess_length_strategy: Literal['drop', 'truncate'] | None
|
|
2376
|
+
# The maximum length of an input for evaluation. If not specified, defaults to
|
|
2377
|
+
# sequence_len
|
|
2378
|
+
eval_sequence_len: int | None
|
|
2379
|
+
min_sample_len: int | None
|
|
2380
|
+
# maximum prompt length for RL training
|
|
2381
|
+
max_prompt_len: int | None
|
|
2382
|
+
# Use efficient multi-packing with block diagonal attention and per sequence
|
|
2383
|
+
# position_ids. Recommend set to 'true'
|
|
2384
|
+
sample_packing: bool | None
|
|
2385
|
+
# The number of samples packed at a time. Increasing the following values helps with
|
|
2386
|
+
# packing, but usually only slightly (<%1.)
|
|
2387
|
+
sample_packing_group_size: int | None = 100000
|
|
2388
|
+
# The number of samples which can be packed into one sequence. Increase if using a large
|
|
2389
|
+
# sequence_len with many short samples.
|
|
2390
|
+
sample_packing_bin_size: int | None = 200
|
|
2391
|
+
# Whether to pack samples sequentially
|
|
2392
|
+
sample_packing_sequentially: bool | None
|
|
2393
|
+
# The multiprocessing start method to use for packing. Should be 'fork', 'spawn' or
|
|
2394
|
+
# 'forkserver'
|
|
2395
|
+
sample_packing_mp_start_method: str | None
|
|
2396
|
+
# Set to 'false' if getting errors during eval with sample_packing on
|
|
2397
|
+
eval_sample_packing: bool | None
|
|
2398
|
+
# Pad inputs so each step uses constant sized buffers. This will reduce memory
|
|
2399
|
+
# fragmentation and may prevent OOMs, by re-using memory more efficiently. Defaults to
|
|
2400
|
+
# True if `sample_packing` enabled
|
|
2401
|
+
pad_to_sequence_len: bool | None
|
|
2402
|
+
# Whether to use sequential sampling for curriculum learning
|
|
2403
|
+
curriculum_sampling: bool | None
|
|
2404
|
+
multipack_real_batches: bool | None
|
|
2405
|
+
|
|
2406
|
+
# Use batch flattening for speedups when not using sample_packing
|
|
2407
|
+
batch_flattening: Literal['auto'] | bool | None
|
|
2408
|
+
|
|
2409
|
+
use_pose: bool | None
|
|
2410
|
+
pose_split_on_token_ids: list[int] | None
|
|
2411
|
+
pose_max_context_len: int | None
|
|
2412
|
+
pose_num_chunks: int | None
|
|
2413
|
+
|
|
2414
|
+
pretrain_multipack_buffer_size: int | None
|
|
2415
|
+
# whether to prevent cross attention for packed sequences during pretraining
|
|
2416
|
+
pretrain_multipack_attn: bool | None = True
|
|
2417
|
+
# whether to concatenate samples during pretraining
|
|
2418
|
+
pretraining_sample_concatenation: bool | None
|
|
2419
|
+
|
|
2420
|
+
# Use streaming mode for loading datasets
|
|
2421
|
+
streaming: bool | None
|
|
2422
|
+
# Buffer size for multipack streaming datasets
|
|
2423
|
+
streaming_multipack_buffer_size: int | None = 10000
|
|
2424
|
+
|
|
2425
|
+
# Whether to use xformers attention patch https://github.com/facebookresearch/xformers
|
|
2426
|
+
xformers_attention: bool | None
|
|
2427
|
+
# Whether to use scaled-dot-product attention https://pytorch.org/docs/stable/generated/
|
|
2428
|
+
# torch.nn.functional.scaled_dot_product_attention.html
|
|
2429
|
+
sdp_attention: bool | None
|
|
2430
|
+
# Shifted-sparse attention (only llama) - https://arxiv.org/pdf/2309.12307.pdf
|
|
2431
|
+
s2_attention: bool | None
|
|
2432
|
+
flex_attention: bool | None
|
|
2433
|
+
flex_attn_compile_kwargs: dict[str, Any] | None
|
|
2434
|
+
# Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention
|
|
2435
|
+
flash_attention: bool | None
|
|
2436
|
+
# Whether to use flash-attention cross entropy implementation - advanced use only
|
|
2437
|
+
flash_attn_cross_entropy: bool | None
|
|
2438
|
+
# Whether to use flash-attention rms norm implementation - advanced use only
|
|
2439
|
+
flash_attn_rms_norm: bool | None
|
|
2440
|
+
# Whether to fuse part of the MLP into a single operation
|
|
2441
|
+
flash_attn_fuse_mlp: bool | None
|
|
2442
|
+
# Whether to use bettertransformers
|
|
2443
|
+
flash_optimum: bool | None
|
|
2444
|
+
|
|
2445
|
+
eager_attention: bool | None
|
|
2446
|
+
|
|
2447
|
+
# Specify a custom attention implementation, used mostly for kernels.
|
|
2448
|
+
attn_implementation: str | None
|
|
2449
|
+
|
|
2450
|
+
unsloth_cross_entropy_loss: bool | None
|
|
2451
|
+
unsloth_lora_mlp: bool | None
|
|
2452
|
+
unsloth_lora_qkv: bool | None
|
|
2453
|
+
unsloth_lora_o: bool | None
|
|
2454
|
+
unsloth_rms_norm: bool | None
|
|
2455
|
+
unsloth_rope: bool | None
|
|
2456
|
+
|
|
2457
|
+
# Apply custom LoRA autograd functions and activation function Triton kernels for speed
|
|
2458
|
+
# and memory savings. See: https://docs.axolotl.ai/docs/lora_optims.html
|
|
2459
|
+
lora_mlp_kernel: bool | None
|
|
2460
|
+
# Apply custom LoRA autograd functions and activation function Triton kernels for speed
|
|
2461
|
+
# and memory savings. See: https://docs.axolotl.ai/docs/lora_optims.html
|
|
2462
|
+
lora_qkv_kernel: bool | None
|
|
2463
|
+
# Apply custom LoRA autograd functions and activation function Triton kernels for speed
|
|
2464
|
+
# and memory savings. See: https://docs.axolotl.ai/docs/lora_optims.html
|
|
2465
|
+
lora_o_kernel: bool | None
|
|
2466
|
+
|
|
2467
|
+
# Whether to use chunked cross entropy loss for memory efficiency
|
|
2468
|
+
chunked_cross_entropy: bool | None
|
|
2469
|
+
# Number of chunks to use for chunked cross entropy loss
|
|
2470
|
+
chunked_cross_entropy_num_chunks: int | None
|
|
2471
|
+
|
|
2472
|
+
# Whether to use ALST tiled mlp for memory efficient long context
|
|
2473
|
+
tiled_mlp: bool | None
|
|
2474
|
+
|
|
2475
|
+
# Number of shards to use for ALST tiled mlp. If unset, it will be set based on
|
|
2476
|
+
# seqlen/hidden_size
|
|
2477
|
+
tiled_mlp_num_shards: int | None
|
|
2478
|
+
|
|
2479
|
+
# Whether to use original mlp for ALST tiled mlp. Otherwise uses a generic MLP based on
|
|
2480
|
+
# llama.
|
|
2481
|
+
tiled_mlp_use_original_mlp: bool | None = True
|
|
2482
|
+
|
|
2483
|
+
llama4_linearized_experts: bool | None
|
|
2484
|
+
|
|
2485
|
+
# Deepspeed config path. e.g., deepspeed_configs/zero3.json
|
|
2486
|
+
deepspeed: str | dict[str, Any] | None
|
|
2487
|
+
# Whether to use deepcompile for faster training with deepspeed
|
|
2488
|
+
deepcompile: bool | None
|
|
2489
|
+
# FSDP configuration
|
|
2490
|
+
fsdp: list[str] | None
|
|
2491
|
+
|
|
2492
|
+
# FSDP configuration options
|
|
2493
|
+
fsdp_config: FSDPConfig | None
|
|
2494
|
+
# For FSDPConfig:
|
|
2495
|
+
# Enable activation checkpointing to reduce memory usage during forward passes
|
|
2496
|
+
activation_checkpointing: bool | None
|
|
2497
|
+
# Offload parameters to CPU to reduce GPU memory usage
|
|
2498
|
+
offload_params: bool | None
|
|
2499
|
+
# Synchronize module states across all processes
|
|
2500
|
+
sync_module_states: bool | None
|
|
2501
|
+
# Enable CPU RAM efficient loading to reduce memory usage during model loading
|
|
2502
|
+
cpu_ram_efficient_loading: bool | None
|
|
2503
|
+
# Disabling this enables swap memory usage for resource-constrained setups when
|
|
2504
|
+
# offload_params is enabled.
|
|
2505
|
+
cpu_offload_pin_memory: bool | None
|
|
2506
|
+
# Use original parameters instead of flattened parameters
|
|
2507
|
+
use_orig_params: bool | None
|
|
2508
|
+
|
|
2509
|
+
# Type of state dict to use for saving/loading checkpoints
|
|
2510
|
+
state_dict_type: Literal['FULL_STATE_DICT', 'LOCAL_STATE_DICT', 'SHARDED_STATE_DICT'] | None
|
|
2511
|
+
# Final state dict type to use after training completion
|
|
2512
|
+
final_state_dict_type: Literal['FULL_STATE_DICT', 'LOCAL_STATE_DICT', 'SHARDED_STATE_DICT'] | None
|
|
2513
|
+
|
|
2514
|
+
# Policy for automatically wrapping modules with FSDP
|
|
2515
|
+
auto_wrap_policy: Literal['TRANSFORMER_BASED_WRAP', 'SIZE_BASED_WRAP'] | None
|
|
2516
|
+
# Class name of transformer layers to wrap (e.g., 'LlamaDecoderLayer')
|
|
2517
|
+
transformer_layer_cls_to_wrap: str | None
|
|
2518
|
+
|
|
2519
|
+
# Reshard parameters after forward pass to save memory
|
|
2520
|
+
reshard_after_forward: bool | None
|
|
2521
|
+
# Mixed precision policy for FSDP (e.g., 'fp16', 'bf16')
|
|
2522
|
+
mixed_precision_policy: str | None
|
|
2523
|
+
|
|
2524
|
+
# FSDP version
|
|
2525
|
+
fsdp_version: int | None
|
|
2526
|
+
fsdp_final_state_dict_type: Literal['FULL_STATE_DICT', 'LOCAL_STATE_DICT', 'SHARDED_STATE_DICT'] | None
|
|
2527
|
+
|
|
2528
|
+
# How much of the dataset to set aside as evaluation. 1 = 100%, 0.50 = 50%, etc. 0 for
|
|
2529
|
+
# no eval.
|
|
2530
|
+
val_set_size: float | None = 0.0
|
|
2531
|
+
|
|
2532
|
+
# Number of devices to shard across. If not set, will use all available devices.
|
|
2533
|
+
dp_shard_size: int | None
|
|
2534
|
+
# Number of devices to replicate across.
|
|
2535
|
+
dp_replicate_size: int | None
|
|
2536
|
+
# Deprecated: use `context_parallel_size` instead
|
|
2537
|
+
sequence_parallel_degree: int | None
|
|
2538
|
+
# Set to a divisor of the number of GPUs available to split sequences into chunks of
|
|
2539
|
+
# equal size. Use in long context training to prevent OOM when sequences cannot fit into
|
|
2540
|
+
# a single GPU's VRAM. E.g., if 4 GPUs are available, set this value to 2 to split each
|
|
2541
|
+
# sequence into two equal-sized subsequences, or set to 4 to split into four equal-sized
|
|
2542
|
+
# subsequences. See https://docs.axolotl.ai/docs/sequence_parallelism.html for more
|
|
2543
|
+
# details.
|
|
2544
|
+
context_parallel_size: int | None
|
|
2545
|
+
# Optional; strides across the key dimension. Larger values use more memory but should
|
|
2546
|
+
# make training faster. Must evenly divide the number of KV heads in your model.
|
|
2547
|
+
heads_k_stride: int | None
|
|
2548
|
+
# One of 'varlen_llama3', 'batch_ring', 'batch_zigzag', 'batch_stripe'. Defaults to
|
|
2549
|
+
# 'varlen_llama3' in the sample packing case, and 'batch_ring' in the non-sample packing
|
|
2550
|
+
# case.
|
|
2551
|
+
ring_attn_func: RingAttnFunc | None
|
|
2552
|
+
# Number of tensor parallel processes in TP group. Only supported with DeepSpeed AutoTP.
|
|
2553
|
+
tensor_parallel_size: int | None
|
|
2554
|
+
|
|
2555
|
+
# Add or change special tokens. If you add tokens here, you don't need to add them to
|
|
2556
|
+
# the `tokens` list.
|
|
2557
|
+
special_tokens: SpecialTokensConfig | None
|
|
2558
|
+
# For SpecialTokensConfig:
|
|
2559
|
+
bos_token: str | None
|
|
2560
|
+
eos_token: str | None
|
|
2561
|
+
pad_token: str | None
|
|
2562
|
+
unk_token: str | None
|
|
2563
|
+
additional_special_tokens: list[str] | None
|
|
2564
|
+
|
|
2565
|
+
# Add extra tokens to the tokenizer
|
|
2566
|
+
tokens: list[str] | None
|
|
2567
|
+
# Mapping token_id to new_token_string to override reserved added_tokens in the
|
|
2568
|
+
# tokenizer. Only works for tokens that are not part of the base vocab (aka are
|
|
2569
|
+
# added_tokens). Can be checked if they exist in tokenizer.json added_tokens.
|
|
2570
|
+
added_tokens_overrides: dict[int, str] | None
|
|
2571
|
+
|
|
2572
|
+
# Whether to use torch.compile and which backend to use. setting to `auto` will enable
|
|
2573
|
+
# torch compile when torch>=2.6.0
|
|
2574
|
+
torch_compile: Literal['auto'] | bool | None
|
|
2575
|
+
# Backend to use for torch.compile
|
|
2576
|
+
torch_compile_backend: str | None
|
|
2577
|
+
torch_compile_mode: Literal['default', 'reduce-overhead', 'max-autotune'] | None
|
|
2578
|
+
|
|
2579
|
+
# Maximum number of iterations to train for. It precedes num_epochs which means that if
|
|
2580
|
+
# both are set, num_epochs will not be guaranteed. e.g., when 1 epoch is 1000 steps =>
|
|
2581
|
+
# `num_epochs: 2` and `max_steps: 100` will train for 100 steps
|
|
2582
|
+
max_steps: int | None
|
|
2583
|
+
# Number of warmup steps. Cannot use with warmup_ratio
|
|
2584
|
+
warmup_steps: int | None
|
|
2585
|
+
# Warmup ratio. Cannot use with warmup_steps
|
|
2586
|
+
warmup_ratio: float | None
|
|
2587
|
+
# Leave empty to eval at each epoch, integer for every N steps. float for fraction of
|
|
2588
|
+
# total steps
|
|
2589
|
+
eval_steps: int | float | None
|
|
2590
|
+
# Number of times per epoch to run evals, mutually exclusive with eval_steps
|
|
2591
|
+
evals_per_epoch: int | None
|
|
2592
|
+
# Set to `no` to skip evaluation, `epoch` at end of each epoch, leave empty to infer
|
|
2593
|
+
# from `eval_steps`
|
|
2594
|
+
eval_strategy: str | None
|
|
2595
|
+
|
|
2596
|
+
# Leave empty to save at each epoch, integer for every N steps. float for fraction of
|
|
2597
|
+
# total steps
|
|
2598
|
+
save_steps: int | float | None
|
|
2599
|
+
# Number of times per epoch to save a checkpoint, mutually exclusive with save_steps
|
|
2600
|
+
saves_per_epoch: int | None
|
|
2601
|
+
# Set to `no` to skip checkpoint saves, `epoch` at end of each epoch, `best` when better
|
|
2602
|
+
# result is achieved, leave empty to infer from `save_steps`
|
|
2603
|
+
save_strategy: str | None
|
|
2604
|
+
# Checkpoints saved at a time
|
|
2605
|
+
save_total_limit: int | None
|
|
2606
|
+
# Whether to checkpoint a model after the first step of training. Defaults to False.
|
|
2607
|
+
save_first_step: bool | None
|
|
2608
|
+
|
|
2609
|
+
# Logging frequency
|
|
2610
|
+
logging_steps: int | None
|
|
2611
|
+
# Stop training after this many evaluation losses have increased in a row. https://huggi
|
|
2612
|
+
# ngface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppin
|
|
2613
|
+
# gCallback
|
|
2614
|
+
early_stopping_patience: int | None
|
|
2615
|
+
load_best_model_at_end: bool | None = False
|
|
2616
|
+
# Save only the model weights, skipping the optimizer. Using this means you can't resume
|
|
2617
|
+
# from checkpoints.
|
|
2618
|
+
save_only_model: bool | None = False
|
|
2619
|
+
# Use tensorboard for logging
|
|
2620
|
+
use_tensorboard: bool | None
|
|
2621
|
+
# Enable the pytorch profiler to capture the first N steps of training to the
|
|
2622
|
+
# output_dir. see https://pytorch.org/blog/understanding-gpu-memory-1/ for more
|
|
2623
|
+
# information. Snapshots can be visualized @ https://pytorch.org/memory_viz
|
|
2624
|
+
profiler_steps: int | None
|
|
2625
|
+
# Which step to start the profiler at. Useful for only capturing a few steps mid-run.
|
|
2626
|
+
profiler_steps_start: int | None = 0
|
|
2627
|
+
# bool of whether to report tokens per second at the end of training. This is not
|
|
2628
|
+
# supported with pre-training datasets.
|
|
2629
|
+
include_tokens_per_second: bool | None
|
|
2630
|
+
# bool of whether to report tokens per second per-gpu during training by measuring
|
|
2631
|
+
# throughput of non-padding tokens.
|
|
2632
|
+
include_tkps: bool | None = True
|
|
2633
|
+
# NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to
|
|
2634
|
+
# add noise to embeddings. Currently only supported on Llama and Mistral
|
|
2635
|
+
neftune_noise_alpha: float | None
|
|
2636
|
+
|
|
2637
|
+
# Parameter controlling the relative ratio loss weight in the ORPO loss. Passed to
|
|
2638
|
+
# `beta` in `ORPOConfig` due to trl mapping.
|
|
2639
|
+
orpo_alpha: float | None
|
|
2640
|
+
# Weighting of NLL term in loss from RPO paper
|
|
2641
|
+
rpo_alpha: float | None
|
|
2642
|
+
# Target reward margin for the SimPO loss
|
|
2643
|
+
simpo_gamma: float | None
|
|
2644
|
+
# Weight of the BC regularizer
|
|
2645
|
+
cpo_alpha: float | None
|
|
2646
|
+
|
|
2647
|
+
# Factor for desirable loss term in KTO loss
|
|
2648
|
+
kto_desirable_weight: float | None
|
|
2649
|
+
# Factor for undesirable loss term in KTO loss
|
|
2650
|
+
kto_undesirable_weight: float | None
|
|
2651
|
+
# The beta parameter for the RL training
|
|
2652
|
+
rl_beta: float | None
|
|
2653
|
+
|
|
2654
|
+
# Defines the max memory usage per gpu on the system. Passed through to transformers
|
|
2655
|
+
# when loading the model.
|
|
2656
|
+
max_memory: dict[int | Literal['cpu', 'disk'], int | str] | None
|
|
2657
|
+
# Limit the memory for all available GPUs to this amount (if an integer, expressed in
|
|
2658
|
+
# gigabytes); default: unset
|
|
2659
|
+
gpu_memory_limit: int | str | None
|
|
2660
|
+
# Whether to use low_cpu_mem_usage
|
|
2661
|
+
low_cpu_mem_usage: bool | None
|
|
2662
|
+
|
|
2663
|
+
# The name of the chat template to use for training, following values are supported:
|
|
2664
|
+
# tokenizer_default: Uses the chat template that is available in the
|
|
2665
|
+
# tokenizer_config.json. If the chat template is not available in the tokenizer, it will
|
|
2666
|
+
# raise an error. This is the default value.
|
|
2667
|
+
# alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates
|
|
2668
|
+
# are available in the axolotl codebase at src/axolotl/utils/chat_templates.py.
|
|
2669
|
+
# tokenizer_default_fallback_*: where * is the name of the chat template to fallback to.
|
|
2670
|
+
# E.g. tokenizer_default_fallback_chatml. This is useful when the chat template is not
|
|
2671
|
+
# available in the tokenizer. jinja: Uses a custom jinja template for the chat template.
|
|
2672
|
+
# The custom jinja template should be provided in the chat_template_jinja field. The
|
|
2673
|
+
# selected chat template will be saved to the tokenizer_config.json for easier
|
|
2674
|
+
# inferencing
|
|
2675
|
+
chat_template: ChatTemplate | Annotated[str, StringConstraints(pattern='^tokenizer_default_fallback_')] | None
|
|
2676
|
+
# Custom jinja template or path to jinja file for chat template. This will be only used
|
|
2677
|
+
# if chat_template is set to `jinja` or `null` (in which case chat_template is
|
|
2678
|
+
# automatically set to `jinja`). Default is null.
|
|
2679
|
+
chat_template_jinja: str | None
|
|
2680
|
+
# Additional kwargs to pass to the chat template. This is useful for customizing the
|
|
2681
|
+
# chat template. For example, you can pass `thinking=False` to add a generation prompt
|
|
2682
|
+
# to the chat template.
|
|
2683
|
+
chat_template_kwargs: dict[str, Any] | None
|
|
2684
|
+
# Custom EOT (End-of-Turn) tokens to mask/unmask during training. These tokens mark the
|
|
2685
|
+
# boundaries between conversation turns. For example: ['/INST', '</s>',
|
|
2686
|
+
# '[/SYSTEM_PROMPT]']. If not specified, defaults to just the model's eos_token. This is
|
|
2687
|
+
# useful for templates that use multiple delimiter tokens.
|
|
2688
|
+
eot_tokens: list[str] | None
|
|
2689
|
+
# Changes the default system message. Currently only supports chatml.
|
|
2690
|
+
default_system_message: str | None
|
|
2691
|
+
|
|
2692
|
+
# Token index or indices to adjust embedding weights to the mean of the other tokens.
|
|
2693
|
+
# This is useful when the model has untrained embeddings.
|
|
2694
|
+
fix_untrained_tokens: int | list[int] | None
|
|
2695
|
+
|
|
2696
|
+
is_preprocess: bool | None
|
|
2697
|
+
preprocess_iterable: bool | None
|
|
2698
|
+
|
|
2699
|
+
# Total number of tokens - internal use
|
|
2700
|
+
total_num_tokens: int | None
|
|
2701
|
+
total_supervised_tokens: int | None
|
|
2702
|
+
# You can set these packing optimizations AFTER starting a training at least once. The
|
|
2703
|
+
# trainer will provide recommended values for these values.
|
|
2704
|
+
sample_packing_eff_est: float | None
|
|
2705
|
+
axolotl_config_path: str | None
|
|
2706
|
+
|
|
2707
|
+
# Internal use only - Used to identify which the model is based on
|
|
2708
|
+
is_falcon_derived_model: bool | None
|
|
2709
|
+
# Internal use only - Used to identify which the model is based on
|
|
2710
|
+
is_llama_derived_model: bool | None
|
|
2711
|
+
# Internal use only - Used to identify which the model is based on. Please note that if
|
|
2712
|
+
# you set this to true, `padding_side` will be set to 'left' by default
|
|
2713
|
+
is_mistral_derived_model: bool | None
|
|
2714
|
+
# Internal use only - Used to identify which the model is based on
|
|
2715
|
+
is_qwen_derived_model: bool | None
|
|
2716
|
+
|
|
2717
|
+
# Add plugins to extend the pipeline. See `src/axolotl/integrations` for the available
|
|
2718
|
+
# plugins or doc below for more details.
|
|
2719
|
+
# https://docs.axolotl.ai/docs/custom_integrations.html
|
|
2720
|
+
plugins: list[str] | None
|
|
2721
|
+
|
|
2722
|
+
# This is the huggingface model that contains *.pt, *.safetensors, or *.bin files. This
|
|
2723
|
+
# can also be a relative path to a model on disk
|
|
2724
|
+
base_model: str (required)
|
|
2725
|
+
# If the base_model repo on hf hub doesn't include configuration .json files, You can
|
|
2726
|
+
# set that here, or leave this empty to default to base_model
|
|
2727
|
+
base_model_config: str | None
|
|
2728
|
+
cls_model_config: str | None
|
|
2729
|
+
# Optional tokenizer configuration path in case you want to use a different tokenizer
|
|
2730
|
+
# than the one defined in the base model
|
|
2731
|
+
tokenizer_config: str | None
|
|
2732
|
+
# use_fast option for tokenizer loading from_pretrained, default to True
|
|
2733
|
+
tokenizer_use_fast: bool | None
|
|
2734
|
+
# Whether to use the legacy tokenizer setting, defaults to True
|
|
2735
|
+
tokenizer_legacy: bool | None
|
|
2736
|
+
# Whether to use mistral-common tokenizer. If set to True, it will use the mistral-
|
|
2737
|
+
# common tokenizer.
|
|
2738
|
+
tokenizer_use_mistral_common: bool | None
|
|
2739
|
+
# Corresponding tokenizer for the model AutoTokenizer is a good choice
|
|
2740
|
+
tokenizer_type: str | None
|
|
2741
|
+
# transformers processor class
|
|
2742
|
+
processor_type: str | None
|
|
2743
|
+
# Whether to save jinja files for tokenizer, transformers default is True
|
|
2744
|
+
tokenizer_save_jinja_files: bool | None = True
|
|
2745
|
+
# Trust remote code for untrusted source
|
|
2746
|
+
trust_remote_code: bool | None
|
|
2747
|
+
|
|
2748
|
+
# Don't move the model to the device before sharding. Set to `false` to revert to legacy
|
|
2749
|
+
# behavior.
|
|
2750
|
+
experimental_skip_move_to_device: bool | None = True
|
|
2751
|
+
|
|
2752
|
+
# Use custom kernels, e.g. MegaBlocks.
|
|
2753
|
+
use_kernels: bool | None
|
|
2754
|
+
|
|
2755
|
+
# Model loading quantization config
|
|
2756
|
+
model_quantization_config: Literal['Mxfp4Config'] | None
|
|
2757
|
+
# kwargs for model quantization config
|
|
2758
|
+
model_quantization_config_kwargs: dict[str, Any] | None
|
|
2759
|
+
|
|
2760
|
+
# Where to save the full-finetuned model to
|
|
2761
|
+
output_dir: str = ./model-out
|
|
2762
|
+
# push checkpoints to hub
|
|
2763
|
+
hub_model_id: str | None
|
|
2764
|
+
# how to push checkpoints to hub
|
|
2765
|
+
hub_strategy: str | None
|
|
2766
|
+
# Save model as safetensors (require safetensors package). Default True
|
|
2767
|
+
save_safetensors: bool | None = True
|
|
2768
|
+
|
|
2769
|
+
# This will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer
|
|
2770
|
+
load_in_8bit: bool | None = False
|
|
2771
|
+
# Use bitsandbytes 4 bit
|
|
2772
|
+
load_in_4bit: bool | None = False
|
|
2773
|
+
|
|
2774
|
+
# If you want to use 'lora' or 'qlora' or leave blank to train all parameters in
|
|
2775
|
+
# original model
|
|
2776
|
+
adapter: str | None
|
|
2777
|
+
# If you already have a lora model trained that you want to load, put that here. This
|
|
2778
|
+
# means after training, if you want to test the model, you should set this to the value
|
|
2779
|
+
# of `output_dir`. Note that if you merge an adapter to the base model, a new
|
|
2780
|
+
# subdirectory `merged` will be created under the `output_dir`.
|
|
2781
|
+
lora_model_dir: str | None
|
|
2782
|
+
lora_r: int | None
|
|
2783
|
+
lora_alpha: int | None
|
|
2784
|
+
lora_fan_in_fan_out: bool | None
|
|
2785
|
+
lora_target_modules: str | list[str] | None
|
|
2786
|
+
lora_target_parameters: str | list[str] | None
|
|
2787
|
+
# If true, will target all linear modules
|
|
2788
|
+
lora_target_linear: bool | None
|
|
2789
|
+
# If you added new tokens to the tokenizer, you may need to save some LoRA modules
|
|
2790
|
+
# because they need to know the new tokens. For LLaMA and Mistral, you need to save
|
|
2791
|
+
# `embed_tokens` and `lm_head`. It may vary for other models. `embed_tokens` converts
|
|
2792
|
+
# tokens to embeddings, and `lm_head` converts embeddings to token probabilities.
|
|
2793
|
+
lora_modules_to_save: list[str] | None
|
|
2794
|
+
lora_dropout: float | None = 0.0
|
|
2795
|
+
# The layer indices to transform, otherwise, apply to all layers
|
|
2796
|
+
peft_layers_to_transform: list[int] | None
|
|
2797
|
+
peft_layers_pattern: list[str] | None
|
|
2798
|
+
|
|
2799
|
+
peft: PeftConfig | None
|
|
2800
|
+
# For PeftConfig:
|
|
2801
|
+
# Configuration options for loftq initialization for LoRA
|
|
2802
|
+
loftq_config: LoftQConfig | None
|
|
2803
|
+
# For LoftQConfig:
|
|
2804
|
+
# typically 4 bits
|
|
2805
|
+
loftq_bits: int = 4
|
|
2806
|
+
|
|
2807
|
+
# Whether to use DoRA.
|
|
2808
|
+
peft_use_dora: bool | None
|
|
2809
|
+
# Whether to use RSLoRA.
|
|
2810
|
+
peft_use_rslora: bool | None
|
|
2811
|
+
# List of layer indices to replicate.
|
|
2812
|
+
peft_layer_replication: list[tuple[int, int]] | None
|
|
2813
|
+
# How to initialize LoRA weights. Default to True which is MS original implementation.
|
|
2814
|
+
peft_init_lora_weights: bool | str | None
|
|
2815
|
+
# A list of token indices to fine-tune on the `embed_tokens` layer. Otherwise, a dict
|
|
2816
|
+
# mapping an embedding layer name to its trainable token indices. See
|
|
2817
|
+
# https://huggingface.co/docs/peft/v0.17.0/en/developer_guides/lora#efficiently-train-
|
|
2818
|
+
# tokens-alongside-lora
|
|
2819
|
+
peft_trainable_token_indices: list[int] | dict[str, list[int]] | None
|
|
2820
|
+
|
|
2821
|
+
# load qlora model in sharded format for FSDP using answer.ai technique.
|
|
2822
|
+
qlora_sharded_model_loading: bool | None = False
|
|
2823
|
+
# Do the LoRA/PEFT loading on CPU -- this is required if the base model is so large it
|
|
2824
|
+
# takes up most or all of the available GPU VRAM, e.g. during a model and LoRA merge
|
|
2825
|
+
lora_on_cpu: bool | None
|
|
2826
|
+
# Whether you are training a 4-bit GPTQ quantized model
|
|
2827
|
+
gptq: bool | None
|
|
2828
|
+
# optional overrides to the bnb 4bit quantization configuration
|
|
2829
|
+
bnb_config_kwargs: dict[str, Any] | None
|
|
2830
|
+
|
|
2831
|
+
# loraplus learning rate ratio lr_B / lr_A. Recommended value is 2^4.
|
|
2832
|
+
loraplus_lr_ratio: float | None
|
|
2833
|
+
# loraplus learning rate for lora embedding layers. Default value is 1e-6.
|
|
2834
|
+
loraplus_lr_embedding: float | None = 1e-06
|
|
2835
|
+
|
|
2836
|
+
merge_lora: bool | None
|
|
2837
|
+
|
|
2838
|
+
# Whether to use ReLoRA. Use with jagged_restart_*steps options.
|
|
2839
|
+
relora: bool | None
|
|
2840
|
+
# threshold for optimizer magnitude when pruning
|
|
2841
|
+
relora_prune_ratio: float | None
|
|
2842
|
+
# True to perform lora weight merges on cpu during restarts, for modest gpu memory
|
|
2843
|
+
# savings
|
|
2844
|
+
relora_cpu_offload: bool | None
|
|
2845
|
+
|
|
2846
|
+
# how often to reset for jagged restarts
|
|
2847
|
+
jagged_restart_steps: int | None
|
|
2848
|
+
# how many warmup steps to take after reset for jagged restarts
|
|
2849
|
+
jagged_restart_warmup_steps: int | None
|
|
2850
|
+
# how many anneal steps to take before reset for jagged restarts
|
|
2851
|
+
jagged_restart_anneal_steps: int | None
|
|
2852
|
+
|
|
2853
|
+
# If greater than 1, backpropagation will be skipped and the gradients will be
|
|
2854
|
+
# accumulated for the given number of steps.
|
|
2855
|
+
gradient_accumulation_steps: int | None = 1
|
|
2856
|
+
# The number of samples to include in each batch. This is the number of samples sent to
|
|
2857
|
+
# each GPU. Batch size per gpu = micro_batch_size * gradient_accumulation_steps
|
|
2858
|
+
micro_batch_size: int | None = 1
|
|
2859
|
+
# Total batch size, we do not recommended setting this manually
|
|
2860
|
+
batch_size: int | None
|
|
2861
|
+
# per gpu micro batch size for evals, defaults to value of micro_batch_size
|
|
2862
|
+
eval_batch_size: int | None
|
|
2863
|
+
|
|
2864
|
+
# whether to find batch size that fits in memory. Passed to underlying transformers
|
|
2865
|
+
# Trainer
|
|
2866
|
+
auto_find_batch_size: bool | None
|
|
2867
|
+
|
|
2868
|
+
# Whether to mask out or include the human's prompt from the training labels
|
|
2869
|
+
train_on_inputs: bool | None = False
|
|
2870
|
+
# Group similarly sized data to minimize padding. May be slower to start, as it must
|
|
2871
|
+
# download and sort the entire dataset. Note that training loss may have an oscillating
|
|
2872
|
+
# pattern with this enabled.
|
|
2873
|
+
group_by_length: bool | None
|
|
2874
|
+
|
|
2875
|
+
learning_rate: str | float (required)
|
|
2876
|
+
embedding_lr: float | None
|
|
2877
|
+
embedding_lr_scale: float | None
|
|
2878
|
+
# Specify weight decay
|
|
2879
|
+
weight_decay: float | None = 0.0
|
|
2880
|
+
# Specify optimizer
|
|
2881
|
+
optimizer: OptimizerNames | CustomSupportedOptimizers | None = OptimizerNames.ADAMW_TORCH_FUSED
|
|
2882
|
+
# Dictionary of arguments to pass to the optimizer
|
|
2883
|
+
optim_args: str | dict[str, Any] | None
|
|
2884
|
+
# The target modules to optimize, i.e. the module names that you would like to train,
|
|
2885
|
+
# right now this is used only for GaLore algorithm
|
|
2886
|
+
optim_target_modules: list[str] | Literal['all_linear'] | None
|
|
2887
|
+
# Path to torch distx for optim 'adamw_anyprecision'
|
|
2888
|
+
torchdistx_path: str | None
|
|
2889
|
+
lr_scheduler: SchedulerType | Literal['one_cycle'] | Literal['rex'] | None = SchedulerType.COSINE
|
|
2890
|
+
# Specify a scheduler and kwargs to use with the optimizer
|
|
2891
|
+
lr_scheduler_kwargs: dict[str, Any] | None
|
|
2892
|
+
lr_quadratic_warmup: bool | None
|
|
2893
|
+
# decay lr to some percentage of the peak lr, e.g. cosine_min_lr_ratio=0.1 for 10% of
|
|
2894
|
+
# peak lr
|
|
2895
|
+
cosine_min_lr_ratio: float | None
|
|
2896
|
+
# freeze lr at some percentage of the step, e.g. cosine_constant_lr_ratio=0.8 means
|
|
2897
|
+
# start cosine_min_lr at 80% of training step
|
|
2898
|
+
cosine_constant_lr_ratio: float | None
|
|
2899
|
+
# Learning rate div factor
|
|
2900
|
+
lr_div_factor: float | None
|
|
2901
|
+
|
|
2902
|
+
lr_groups: list[LrGroup] | None
|
|
2903
|
+
# For LrGroup:
|
|
2904
|
+
name: str (required)
|
|
2905
|
+
modules: list[str] (required)
|
|
2906
|
+
lr: float (required)
|
|
2907
|
+
|
|
2908
|
+
# adamw hyperparams
|
|
2909
|
+
adam_epsilon: float | None
|
|
2910
|
+
# only used for CAME Optimizer
|
|
2911
|
+
adam_epsilon2: float | None
|
|
2912
|
+
# adamw hyperparams
|
|
2913
|
+
adam_beta1: float | None
|
|
2914
|
+
# adamw hyperparams
|
|
2915
|
+
adam_beta2: float | None
|
|
2916
|
+
# only used for CAME Optimizer
|
|
2917
|
+
adam_beta3: float | None
|
|
2918
|
+
|
|
2919
|
+
# Dion Optimizer learning rate
|
|
2920
|
+
dion_lr: float | None
|
|
2921
|
+
# Dion Optimizer momentum
|
|
2922
|
+
dion_momentum: float | None
|
|
2923
|
+
# Dion Optimizer: r/d fraction for low-rank approximation. Used to compute the low-rank
|
|
2924
|
+
# dimension.
|
|
2925
|
+
dion_rank_fraction: float | None = 1.0
|
|
2926
|
+
# Dion Optimizer: Round up the low-rank dimension to a multiple of this number. This may
|
|
2927
|
+
# be useful to ensure even sharding.
|
|
2928
|
+
dion_rank_multiple_of: int | None = 1
|
|
2929
|
+
|
|
2930
|
+
# Gradient clipping max norm
|
|
2931
|
+
max_grad_norm: float | None
|
|
2932
|
+
num_epochs: float = 1.0
|
|
2933
|
+
|
|
2934
|
+
use_wandb: bool | None
|
|
2935
|
+
# Set the name of your wandb run
|
|
2936
|
+
wandb_name: str | None
|
|
2937
|
+
# Set the ID of your wandb run
|
|
2938
|
+
wandb_run_id: str | None
|
|
2939
|
+
# "offline" to save run metadata locally and not sync to the server, "disabled" to turn
|
|
2940
|
+
# off wandb
|
|
2941
|
+
wandb_mode: str | None
|
|
2942
|
+
# Your wandb project name
|
|
2943
|
+
wandb_project: str | None
|
|
2944
|
+
# A wandb Team name if using a Team
|
|
2945
|
+
wandb_entity: str | None
|
|
2946
|
+
wandb_watch: str | None
|
|
2947
|
+
# "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only
|
|
2948
|
+
# at the end of training
|
|
2949
|
+
wandb_log_model: str | None
|
|
2950
|
+
|
|
2951
|
+
use_mlflow: bool | None
|
|
2952
|
+
# URI to mlflow
|
|
2953
|
+
mlflow_tracking_uri: str | None
|
|
2954
|
+
# Your experiment name
|
|
2955
|
+
mlflow_experiment_name: str | None
|
|
2956
|
+
# Your run name
|
|
2957
|
+
mlflow_run_name: str | None
|
|
2958
|
+
# set to true to copy each saved checkpoint on each save to mlflow artifact registry
|
|
2959
|
+
hf_mlflow_log_artifacts: bool | None
|
|
2960
|
+
|
|
2961
|
+
# Enable or disable Comet integration.
|
|
2962
|
+
use_comet: bool | None
|
|
2963
|
+
# API key for Comet. Recommended to set via `comet login`.
|
|
2964
|
+
comet_api_key: str | None
|
|
2965
|
+
# Workspace name in Comet. Defaults to the user's default workspace.
|
|
2966
|
+
comet_workspace: str | None
|
|
2967
|
+
# Project name in Comet. Defaults to Uncategorized.
|
|
2968
|
+
comet_project_name: str | None
|
|
2969
|
+
# Identifier for the experiment. Used to append data to an existing experiment or
|
|
2970
|
+
# control the key of new experiments. Default to a random key.
|
|
2971
|
+
comet_experiment_key: str | None
|
|
2972
|
+
# Create a new experiment ("create") or log to an existing one ("get"). Default
|
|
2973
|
+
# ("get_or_create") auto-selects based on configuration.
|
|
2974
|
+
comet_mode: str | None
|
|
2975
|
+
# Set to True to log data to Comet server, or False for offline storage. Default is
|
|
2976
|
+
# True.
|
|
2977
|
+
comet_online: bool | None
|
|
2978
|
+
# Dictionary for additional configuration settings, see the doc for more details.
|
|
2979
|
+
comet_experiment_config: dict[str, Any] | None
|
|
2980
|
+
|
|
2981
|
+
# Enable OpenTelemetry metrics collection and Prometheus export
|
|
2982
|
+
use_otel_metrics: bool | None = False
|
|
2983
|
+
# Host to bind the OpenTelemetry metrics server to
|
|
2984
|
+
otel_metrics_host: str | None = localhost
|
|
2985
|
+
# Port for the Prometheus metrics HTTP server
|
|
2986
|
+
otel_metrics_port: int | None = 8000
|
|
2987
|
+
|
|
2988
|
+
# the number of activate layers in LISA
|
|
2989
|
+
lisa_n_layers: int | None
|
|
2990
|
+
# how often to switch layers in LISA
|
|
2991
|
+
lisa_step_interval: int | None
|
|
2992
|
+
# path under the model to access the layers
|
|
2993
|
+
lisa_layers_attribute: str | None = model.layers
|
|
2994
|
+
|
|
2995
|
+
gradio_title: str | None
|
|
2996
|
+
gradio_share: bool | None
|
|
2997
|
+
gradio_server_name: str | None
|
|
2998
|
+
gradio_server_port: int | None
|
|
2999
|
+
gradio_max_new_tokens: int | None
|
|
3000
|
+
gradio_temperature: float | None
|
|
3001
|
+
|
|
3002
|
+
use_ray: bool = False
|
|
3003
|
+
ray_run_name: str | None
|
|
3004
|
+
ray_num_workers: int = 1
|
|
3005
|
+
resources_per_worker: dict
|
|
3006
|
+
|
|
3007
|
+
# The size of the image to resize to. It can be an integer (resized into padded-square
|
|
3008
|
+
# image) or a tuple (width, height).If not provided, we will attempt to load from
|
|
3009
|
+
# preprocessor.size, otherwise, images won't be resized.
|
|
3010
|
+
image_size: int | tuple[int, int] | None
|
|
3011
|
+
# The resampling algorithm to use for image resizing. Default is bilinear. Please refer
|
|
3012
|
+
# to PIL.Image.Resampling for more details.
|
|
3013
|
+
image_resize_algorithm: Literal['bilinear', 'bicubic', 'lanczos'] | Resampling | None
|
|
3014
|
+
|
|
3015
|
+
# optional overrides to the base model configuration
|
|
3016
|
+
overrides_of_model_config: dict[str, Any] | None
|
|
3017
|
+
# optional overrides the base model loading from_pretrained
|
|
3018
|
+
overrides_of_model_kwargs: dict[str, Any] | None
|
|
3019
|
+
# If you want to specify the type of model to load, AutoModelForCausalLM is a good
|
|
3020
|
+
# choice too
|
|
3021
|
+
type_of_model: str | None
|
|
3022
|
+
# You can specify to choose a specific model revision from huggingface hub
|
|
3023
|
+
revision_of_model: str | None
|
|
3024
|
+
|
|
3025
|
+
max_packed_sequence_len: int | None
|
|
3026
|
+
rope_scaling: Any | None
|
|
3027
|
+
noisy_embedding_alpha: float | None
|
|
3028
|
+
dpo_beta: float | None
|
|
3029
|
+
evaluation_strategy: str | None
|
|
3030
|
+
```
|
|
3031
|
+
|
|
3032
|
+
---
|
|
3033
|
+
|
|
3034
|
+
##
|
|
3035
|
+
|
|
3036
|
+
**URL:** https://docs.axolotl.ai
|
|
3037
|
+
|
|
3038
|
+
**Contents:**
|
|
3039
|
+
- 🎉 Latest Updates
|
|
3040
|
+
- ✨ Overview
|
|
3041
|
+
- 🚀 Quick Start - LLM Fine-tuning in Minutes
|
|
3042
|
+
- Google Colab
|
|
3043
|
+
- Installation
|
|
3044
|
+
- Using pip
|
|
3045
|
+
- Using Docker
|
|
3046
|
+
- Cloud Providers
|
|
3047
|
+
- Your First Fine-tune
|
|
3048
|
+
- 📚 Documentation
|
|
3049
|
+
|
|
3050
|
+
A Free and Open Source LLM Fine-tuning Framework
|
|
3051
|
+
|
|
3052
|
+
Axolotl is a free and open-source tool designed to streamline post-training and fine-tuning for the latest large language models (LLMs).
|
|
3053
|
+
|
|
3054
|
+
Installing with Docker can be less error prone than installing in your own environment.
|
|
3055
|
+
|
|
3056
|
+
Other installation approaches are described here.
|
|
3057
|
+
|
|
3058
|
+
That’s it! Check out our Getting Started Guide for a more detailed walkthrough.
|
|
3059
|
+
|
|
3060
|
+
Contributions are welcome! Please see our Contributing Guide for details.
|
|
3061
|
+
|
|
3062
|
+
Interested in sponsoring? Contact us at [email protected]
|
|
3063
|
+
|
|
3064
|
+
If you use Axolotl in your research or projects, please cite it as follows:
|
|
3065
|
+
|
|
3066
|
+
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
|
|
3067
|
+
|
|
3068
|
+
**Examples:**
|
|
3069
|
+
|
|
3070
|
+
Example 1 (bash):
|
|
3071
|
+
```bash
|
|
3072
|
+
pip3 install -U packaging==23.2 setuptools==75.8.0 wheel ninja
|
|
3073
|
+
pip3 install --no-build-isolation axolotl[flash-attn,deepspeed]
|
|
3074
|
+
|
|
3075
|
+
# Download example axolotl configs, deepspeed configs
|
|
3076
|
+
axolotl fetch examples
|
|
3077
|
+
axolotl fetch deepspeed_configs # OPTIONAL
|
|
3078
|
+
```
|
|
3079
|
+
|
|
3080
|
+
Example 2 (bash):
|
|
3081
|
+
```bash
|
|
3082
|
+
docker run --gpus '"all"' --rm -it axolotlai/axolotl:main-latest
|
|
3083
|
+
```
|
|
3084
|
+
|
|
3085
|
+
Example 3 (bash):
|
|
3086
|
+
```bash
|
|
3087
|
+
# Fetch axolotl examples
|
|
3088
|
+
axolotl fetch examples
|
|
3089
|
+
|
|
3090
|
+
# Or, specify a custom path
|
|
3091
|
+
axolotl fetch examples --dest path/to/folder
|
|
3092
|
+
|
|
3093
|
+
# Train a model using LoRA
|
|
3094
|
+
axolotl train examples/llama-3/lora-1b.yml
|
|
3095
|
+
```
|
|
3096
|
+
|
|
3097
|
+
Example 4 (unknown):
|
|
3098
|
+
```unknown
|
|
3099
|
+
@software{axolotl,
|
|
3100
|
+
title = {Axolotl: Open Source LLM Post-Training},
|
|
3101
|
+
author = {{Axolotl maintainers and contributors}},
|
|
3102
|
+
url = {https://github.com/axolotl-ai-cloud/axolotl},
|
|
3103
|
+
license = {Apache-2.0},
|
|
3104
|
+
year = {2023}
|
|
3105
|
+
}
|
|
3106
|
+
```
|
|
3107
|
+
|
|
3108
|
+
---
|
|
3109
|
+
|
|
3110
|
+
## Quickstart
|
|
3111
|
+
|
|
3112
|
+
**URL:** https://docs.axolotl.ai/docs/getting-started.html
|
|
3113
|
+
|
|
3114
|
+
**Contents:**
|
|
3115
|
+
- Quickstart
|
|
3116
|
+
- 1 Quick Example
|
|
3117
|
+
- 2 Understanding the Process
|
|
3118
|
+
- 2.1 The Configuration File
|
|
3119
|
+
- 2.2 Training
|
|
3120
|
+
- 3 Your First Custom Training
|
|
3121
|
+
- 4 Common Tasks
|
|
3122
|
+
- 4.1 Testing Your Model
|
|
3123
|
+
- 4.2 Using a UI
|
|
3124
|
+
- 4.3 Preprocessing Data
|
|
3125
|
+
|
|
3126
|
+
This guide will walk you through your first model fine-tuning project with Axolotl.
|
|
3127
|
+
|
|
3128
|
+
Let’s start by fine-tuning a small language model using LoRA. This example uses a 1B parameter model to ensure it runs on most GPUs. Assuming axolotl is installed (if not, see our Installation Guide)
|
|
3129
|
+
|
|
3130
|
+
That’s it! Let’s understand what just happened.
|
|
3131
|
+
|
|
3132
|
+
The YAML configuration file controls everything about your training. Here’s what (part of) our example config looks like:
|
|
3133
|
+
|
|
3134
|
+
load_in_8bit: true and adapter: lora enables LoRA adapter finetuning.
|
|
3135
|
+
|
|
3136
|
+
See our config options for more details.
|
|
3137
|
+
|
|
3138
|
+
When you run axolotl train, Axolotl:
|
|
3139
|
+
|
|
3140
|
+
Let’s modify the example for your own data:
|
|
3141
|
+
|
|
3142
|
+
This specific config is for LoRA fine-tuning a model with instruction tuning data using the alpaca dataset format, which has the following format:
|
|
3143
|
+
|
|
3144
|
+
Please see our Dataset Formats for more dataset formats and how to format them.
|
|
3145
|
+
|
|
3146
|
+
The same yaml file is used for training, inference, and merging.
|
|
3147
|
+
|
|
3148
|
+
After training, test your model:
|
|
3149
|
+
|
|
3150
|
+
More details can be found in Inference.
|
|
3151
|
+
|
|
3152
|
+
Launch a Gradio interface:
|
|
3153
|
+
|
|
3154
|
+
For large datasets, preprocess first:
|
|
3155
|
+
|
|
3156
|
+
Please make sure to set dataset_prepared_path: in your config to set the path to save the prepared dataset.
|
|
3157
|
+
|
|
3158
|
+
More details can be found in Dataset Preprocessing.
|
|
3159
|
+
|
|
3160
|
+
To merge the LoRA weights back into the base model, run:
|
|
3161
|
+
|
|
3162
|
+
The merged model will be saved in the {output_dir}/merged directory.
|
|
3163
|
+
|
|
3164
|
+
More details can be found in Merging LoRA weights.
|
|
3165
|
+
|
|
3166
|
+
Now that you have the basics, you might want to:
|
|
3167
|
+
|
|
3168
|
+
Check our other guides for details on these topics:
|
|
3169
|
+
|
|
3170
|
+
**Examples:**
|
|
3171
|
+
|
|
3172
|
+
Example 1 (bash):
|
|
3173
|
+
```bash
|
|
3174
|
+
axolotl fetch examples
|
|
3175
|
+
```
|
|
3176
|
+
|
|
3177
|
+
Example 2 (bash):
|
|
3178
|
+
```bash
|
|
3179
|
+
axolotl train examples/llama-3/lora-1b.yml
|
|
3180
|
+
```
|
|
3181
|
+
|
|
3182
|
+
Example 3 (yaml):
|
|
3183
|
+
```yaml
|
|
3184
|
+
base_model: NousResearch/Llama-3.2-1B
|
|
3185
|
+
|
|
3186
|
+
load_in_8bit: true
|
|
3187
|
+
adapter: lora
|
|
3188
|
+
|
|
3189
|
+
datasets:
|
|
3190
|
+
- path: teknium/GPT4-LLM-Cleaned
|
|
3191
|
+
type: alpaca
|
|
3192
|
+
dataset_prepared_path: last_run_prepared
|
|
3193
|
+
val_set_size: 0.1
|
|
3194
|
+
output_dir: ./outputs/lora-out
|
|
3195
|
+
```
|
|
3196
|
+
|
|
3197
|
+
Example 4 (yaml):
|
|
3198
|
+
```yaml
|
|
3199
|
+
base_model: NousResearch/Nous-Hermes-llama-1b-v1
|
|
3200
|
+
|
|
3201
|
+
load_in_8bit: true
|
|
3202
|
+
adapter: lora
|
|
3203
|
+
|
|
3204
|
+
# Training settings
|
|
3205
|
+
micro_batch_size: 2
|
|
3206
|
+
num_epochs: 3
|
|
3207
|
+
learning_rate: 0.0003
|
|
3208
|
+
|
|
3209
|
+
# Your dataset
|
|
3210
|
+
datasets:
|
|
3211
|
+
- path: my_data.jsonl # Your local data file
|
|
3212
|
+
type: alpaca # Or other format
|
|
3213
|
+
```
|
|
3214
|
+
|
|
3215
|
+
---
|
|
3216
|
+
|
|
3217
|
+
## Multipack (Sample Packing)
|
|
3218
|
+
|
|
3219
|
+
**URL:** https://docs.axolotl.ai/docs/multipack.html
|
|
3220
|
+
|
|
3221
|
+
**Contents:**
|
|
3222
|
+
- Multipack (Sample Packing)
|
|
3223
|
+
- Visualization of Multipack with Flash Attention
|
|
3224
|
+
- Multipack without Flash Attention
|
|
3225
|
+
|
|
3226
|
+
Because Flash Attention simply drops the attention mask, we do not need to construct a 4d attention mask. We only need to concatenate the sequences into a single batch and let flash attention know where each new sequence begins.
|
|
3227
|
+
|
|
3228
|
+
4k context, bsz =4, each character represents 256 tokens X represents a padding token
|
|
3229
|
+
|
|
3230
|
+
after padding to longest input in each step
|
|
3231
|
+
|
|
3232
|
+
w packing ( note it’s the same effective number of tokens per step, but a true bsz of 1)
|
|
3233
|
+
|
|
3234
|
+
cu_seqlens: [[ 0, 11, 17, 24, 28, 36, 41 44, 48, 51, 55, 60, 64]]
|
|
3235
|
+
|
|
3236
|
+
Multipack can still be achieved without Flash attention, but with lower packing efficiency as we are not able to join multiple batches into a single batch due to context length limits without flash attention. We can use either Pytorch’s Scaled Dot Product Attention implementation or native Pytorch attention implementation along with 4d attention masks to pack sequences together and avoid cross attention.
|
|
3237
|
+
|
|
3238
|
+
**Examples:**
|
|
3239
|
+
|
|
3240
|
+
Example 1 (unknown):
|
|
3241
|
+
```unknown
|
|
3242
|
+
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
|
|
3243
|
+
[[ A A A A A A A A A A A ]
|
|
3244
|
+
B B B B B B ]
|
|
3245
|
+
C C C C C C C ]
|
|
3246
|
+
D D D D ]]
|
|
3247
|
+
|
|
3248
|
+
[[ E E E E E E E E ]
|
|
3249
|
+
[ F F F F ]
|
|
3250
|
+
[ G G G ]
|
|
3251
|
+
[ H H H H ]]
|
|
3252
|
+
|
|
3253
|
+
[[ I I I ]
|
|
3254
|
+
[ J J J ]
|
|
3255
|
+
[ K K K K K]
|
|
3256
|
+
[ L L L ]]
|
|
3257
|
+
```
|
|
3258
|
+
|
|
3259
|
+
Example 2 (unknown):
|
|
3260
|
+
```unknown
|
|
3261
|
+
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
|
|
3262
|
+
[[ A A A A A A A A A A A ]
|
|
3263
|
+
B B B B B B X X X X X X ]
|
|
3264
|
+
C C C C C C C X X X X ]
|
|
3265
|
+
D D D D X X X X X X X ]]
|
|
3266
|
+
|
|
3267
|
+
[[ E E E E E E E E ]
|
|
3268
|
+
[ F F F F X X X X ]
|
|
3269
|
+
[ G G G X X X X X ]
|
|
3270
|
+
[ H H H H X X X X ]]
|
|
3271
|
+
|
|
3272
|
+
[[ I I I X X ]
|
|
3273
|
+
[ J J J X X ]
|
|
3274
|
+
[ K K K K K ]
|
|
3275
|
+
[ L L L X X ]]
|
|
3276
|
+
```
|
|
3277
|
+
|
|
3278
|
+
Example 3 (unknown):
|
|
3279
|
+
```unknown
|
|
3280
|
+
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
|
|
3281
|
+
[[ A A A A A A A A A A A B B B B B
|
|
3282
|
+
B C C C C C C C D D D D E E E E
|
|
3283
|
+
E E E E F F F F F G G G H H H H
|
|
3284
|
+
I I I J J J J K K K K K L L L X ]]
|
|
3285
|
+
```
|
|
3286
|
+
|
|
3287
|
+
---
|
|
3288
|
+
|
|
3289
|
+
## Batch size vs Gradient accumulation
|
|
3290
|
+
|
|
3291
|
+
**URL:** https://docs.axolotl.ai/docs/batch_vs_grad.html
|
|
3292
|
+
|
|
3293
|
+
**Contents:**
|
|
3294
|
+
- Batch size vs Gradient accumulation
|
|
3295
|
+
|
|
3296
|
+
Gradient accumulation means accumulating gradients over several mini-batches and updating the model weights afterward. When the samples in each batch are diverse, this technique doesn’t significantly impact learning.
|
|
3297
|
+
|
|
3298
|
+
This method allows for effective training with larger effective batch sizes without needing proportionally larger memory. Here’s why:
|
|
3299
|
+
|
|
3300
|
+
Memory Consumption with Batch Size: The primary reason increasing the batch size impacts memory is due to the storage requirements for intermediate activations. When you forward propagate a batch through a network, you have to store the activations at each layer for each sample in the batch, because these activations are used during backpropagation to compute gradients. Therefore, larger batches mean more activations, leading to greater GPU memory consumption.
|
|
3301
|
+
|
|
3302
|
+
Gradient Accumulation: With gradient accumulation, you’re effectively simulating a larger batch size by accumulating gradients over several smaller batches (or micro-batches). However, at any given time, you’re only forward and backward propagating a micro-batch. This means you only store activations for the micro-batch, not the full accumulated batch. As a result, you can simulate the effect of a larger batch size without the memory cost of storing activations for a large batch.
|
|
3303
|
+
|
|
3304
|
+
Example 1: Micro batch size: 3 Gradient accumulation steps: 2 Number of GPUs: 3 Total batch size = 3 * 2 * 3 = 18
|
|
3305
|
+
|
|
3306
|
+
Example 2: Micro batch size: 2 Gradient accumulation steps: 1 Number of GPUs: 3 Total batch size = 2 * 1 * 3 = 6
|
|
3307
|
+
|
|
3308
|
+
**Examples:**
|
|
3309
|
+
|
|
3310
|
+
Example 1 (unknown):
|
|
3311
|
+
```unknown
|
|
3312
|
+
| GPU 1 | GPU 2 | GPU 3 |
|
|
3313
|
+
|----------------|----------------|----------------|
|
|
3314
|
+
| S1, S2, S3 | S4, S5, S6 | S7, S8, S9 |
|
|
3315
|
+
| e1, e2, e3 | e4, e5, e6 | e7, e8, e9 |
|
|
3316
|
+
|----------------|----------------|----------------|
|
|
3317
|
+
| → (accumulate) | → (accumulate) | → (accumulate) |
|
|
3318
|
+
|----------------|----------------|----------------|
|
|
3319
|
+
| S10, S11, S12 | S13, S14, S15 | S16, S17, S18 |
|
|
3320
|
+
| e10, e11, e12 | e13, e14, e15 | e16, e17, e18 |
|
|
3321
|
+
|----------------|----------------|----------------|
|
|
3322
|
+
| → (apply) | → (apply) | → (apply) |
|
|
3323
|
+
|
|
3324
|
+
Accumulated gradient for the weight w1 after the second iteration (considering all GPUs):
|
|
3325
|
+
Total gradient for w1 = e1 + e2 + e3 + e4 + e5 + e6 + e7 + e8 + e9 + e10 + e11 + e12 + e13 + e14 + e15 + e16 + e17 + e18
|
|
3326
|
+
|
|
3327
|
+
Weight update for w1:
|
|
3328
|
+
w1_new = w1_old - learning rate x (Total gradient for w1 / 18)
|
|
3329
|
+
```
|
|
3330
|
+
|
|
3331
|
+
Example 2 (unknown):
|
|
3332
|
+
```unknown
|
|
3333
|
+
| GPU 1 | GPU 2 | GPU 3 |
|
|
3334
|
+
|-----------|-----------|-----------|
|
|
3335
|
+
| S1, S2 | S3, S4 | S5, S6 |
|
|
3336
|
+
| e1, e2 | e3, e4 | e5, e6 |
|
|
3337
|
+
|-----------|-----------|-----------|
|
|
3338
|
+
| → (apply) | → (apply) | → (apply) |
|
|
3339
|
+
|
|
3340
|
+
Accumulated gradient for the weight w1 (considering all GPUs):
|
|
3341
|
+
Total gradient for w1 = e1 + e2 + e3 + e4 + e5 + e6
|
|
3342
|
+
|
|
3343
|
+
Weight update for w1:
|
|
3344
|
+
w1_new = w1_old - learning rate × (Total gradient for w1 / 6)
|
|
3345
|
+
```
|
|
3346
|
+
|
|
3347
|
+
---
|
|
3348
|
+
|
|
3349
|
+
## Debugging
|
|
3350
|
+
|
|
3351
|
+
**URL:** https://docs.axolotl.ai/docs/debugging.html
|
|
3352
|
+
|
|
3353
|
+
**Contents:**
|
|
3354
|
+
- Debugging
|
|
3355
|
+
- Table of Contents
|
|
3356
|
+
- General Tips
|
|
3357
|
+
- Debugging with VSCode
|
|
3358
|
+
- Background
|
|
3359
|
+
- Setup
|
|
3360
|
+
- Remote Hosts
|
|
3361
|
+
- Configuration
|
|
3362
|
+
- Customizing your debugger
|
|
3363
|
+
- Video Tutorial
|
|
3364
|
+
|
|
3365
|
+
This document provides some tips and tricks for debugging Axolotl. It also provides an example configuration for debugging with VSCode. A good debugging setup is essential to understanding how Axolotl code works behind the scenes.
|
|
3366
|
+
|
|
3367
|
+
While debugging it’s helpful to simplify your test scenario as much as possible. Here are some tips for doing so:
|
|
3368
|
+
|
|
3369
|
+
[!Important] All of these tips are incorporated into the example configuration for debugging with VSCode below.
|
|
3370
|
+
|
|
3371
|
+
Make sure you are using the latest version of axolotl: This project changes often and bugs get fixed fast. Check your git branch and make sure you have pulled the latest changes from main.
|
|
3372
|
+
|
|
3373
|
+
Eliminate concurrency: Restrict the number of processes to 1 for both training and data preprocessing:
|
|
3374
|
+
|
|
3375
|
+
Use a small dataset: Construct or use a small dataset from HF Hub. When using a small dataset, you will often have to make sure sample_packing: False and eval_sample_packing: False to avoid errors. If you are in a pinch and don’t have time to construct a small dataset but want to use from the HF Hub, you can shard the data (this will still tokenize the entire dataset, but will only use a fraction of the data for training. For example, to shard the dataset into 20 pieces, add the following to your axolotl config):
|
|
3376
|
+
|
|
3377
|
+
Use a small model: A good example of a small model is TinyLlama/TinyLlama-1.1B-Chat-v1.0.
|
|
3378
|
+
|
|
3379
|
+
Minimize iteration time: Make sure the training loop finishes as fast as possible, with these settings.
|
|
3380
|
+
|
|
3381
|
+
Clear Caches: Axolotl caches certain steps and so does the underlying HuggingFace trainer. You may want to clear some of these caches when debugging.
|
|
3382
|
+
|
|
3383
|
+
The below example shows how to configure VSCode to debug data preprocessing of the chat_template format. This is the format used when you have the following in your axolotl config:
|
|
3384
|
+
|
|
3385
|
+
[!Important] If you are already familiar with advanced VSCode debugging, you can skip the below explanation and look at the files .vscode/launch.json and .vscode/tasks.json for an example configuration.
|
|
3386
|
+
|
|
3387
|
+
[!Tip] If you prefer to watch a video, rather than read, you can skip to the video tutorial below (but doing both is recommended).
|
|
3388
|
+
|
|
3389
|
+
Make sure you have an editable install of Axolotl, which ensures that changes you make to the code are reflected at runtime. Run the following commands from the root of this project:
|
|
3390
|
+
|
|
3391
|
+
If you developing on a remote host, you can easily use VSCode to debug remotely. To do so, you will need to follow this remote - SSH guide. You can also see the video below on Docker and Remote SSH debugging.
|
|
3392
|
+
|
|
3393
|
+
The easiest way to get started is to modify the .vscode/launch.json file in this project. This is just an example configuration, so you may need to modify or copy it to suit your needs.
|
|
3394
|
+
|
|
3395
|
+
For example, to mimic the command cd devtools && CUDA_VISIBLE_DEVICES=0 accelerate launch -m axolotl.cli.train dev_chat_template.yml, you would use the below configuration1. Note that we add additional flags that override the axolotl config and incorporate the tips above (see the comments). We also set the working directory to devtools and set the env variable HF_HOME to a temporary folder that is later partially deleted. This is because we want to delete the HF dataset cache before each run in order to ensure that the data preprocessing code is run from scratch.
|
|
3396
|
+
|
|
3397
|
+
Additional notes about this configuration:
|
|
3398
|
+
|
|
3399
|
+
[!Tip] You may not want to delete these folders. For example, if you are debugging model training instead of data pre-processing, you may NOT want to delete the cache or output folders. You may also need to add additional tasks to the tasks.json file depending on your use case.
|
|
3400
|
+
|
|
3401
|
+
Below is the ./vscode/tasks.json file that defines the cleanup-for-dataprep task. This task is run before each debugging session when you use the above configuration. Note how there are two tasks that delete the two folders mentioned above. The third task cleanup-for-dataprep is a composite task that combines the two tasks. A composite task is necessary because VSCode does not allow you to specify multiple tasks in the preLaunchTask argument of the launch.json file.
|
|
3402
|
+
|
|
3403
|
+
Your debugging use case may differ from the example above. The easiest thing to do is to put your own axolotl config in the devtools folder and modify the launch.json file to use your config. You may also want to modify the preLaunchTask to delete different folders or not delete anything at all.
|
|
3404
|
+
|
|
3405
|
+
The following video tutorial walks through the above configuration and demonstrates how to debug with VSCode, (click the image below to watch):
|
|
3406
|
+
|
|
3407
|
+
Using official Axolotl Docker images is a great way to debug your code, and is a very popular way to use Axolotl. Attaching VSCode to Docker takes a few more steps.
|
|
3408
|
+
|
|
3409
|
+
On the host that is running axolotl (ex: if you are using a remote host), clone the axolotl repo and change your current directory to the root:
|
|
3410
|
+
|
|
3411
|
+
[!Tip] If you already have axolotl cloned on your host, make sure you have the latest changes and change into the root of the project.
|
|
3412
|
+
|
|
3413
|
+
Next, run the desired docker image and mount the current directory. Below is a docker command you can run to do this:2
|
|
3414
|
+
|
|
3415
|
+
[!Tip] To understand which containers are available, see the Docker section of the README and the DockerHub repo. For details of how the Docker containers are built, see axolotl’s Docker CI builds.
|
|
3416
|
+
|
|
3417
|
+
You will now be in the container. Next, perform an editable install of Axolotl:
|
|
3418
|
+
|
|
3419
|
+
Next, if you are using a remote host, Remote into this host with VSCode. If you are using a local host, you can skip this step.
|
|
3420
|
+
|
|
3421
|
+
Next, select Dev Containers: Attach to Running Container... using the command palette (CMD + SHIFT + P) in VSCode. You will be prompted to select a container to attach to. Select the container you just created. You will now be in the container with a working directory that is at the root of the project. Any changes you make to the code will be reflected both in the container and on the host.
|
|
3422
|
+
|
|
3423
|
+
Now you are ready to debug as described above (see Debugging with VSCode).
|
|
3424
|
+
|
|
3425
|
+
Here is a short video that demonstrates how to attach to a Docker container on a remote host:
|
|
3426
|
+
|
|
3427
|
+
The config actually mimics the command CUDA_VISIBLE_DEVICES=0 python -m accelerate.commands.launch -m axolotl.cli.train devtools/chat_template.yml, but this is the same thing.↩︎
|
|
3428
|
+
|
|
3429
|
+
Many of the below flags are recommended best practices by Nvidia when using nvidia-container-toolkit. You can read more about these flags here.↩︎
|
|
3430
|
+
|
|
3431
|
+
**Examples:**
|
|
3432
|
+
|
|
3433
|
+
Example 1 (yaml):
|
|
3434
|
+
```yaml
|
|
3435
|
+
datasets:
|
|
3436
|
+
...
|
|
3437
|
+
shards: 20
|
|
3438
|
+
```
|
|
3439
|
+
|
|
3440
|
+
Example 2 (yaml):
|
|
3441
|
+
```yaml
|
|
3442
|
+
datasets:
|
|
3443
|
+
- path: <path to your chat_template formatted dataset> # example on HF Hub: fozziethebeat/alpaca_messages_2k_test
|
|
3444
|
+
type: chat_template
|
|
3445
|
+
```
|
|
3446
|
+
|
|
3447
|
+
Example 3 (bash):
|
|
3448
|
+
```bash
|
|
3449
|
+
pip3 install packaging
|
|
3450
|
+
pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]'
|
|
3451
|
+
```
|
|
3452
|
+
|
|
3453
|
+
Example 4 (json):
|
|
3454
|
+
```json
|
|
3455
|
+
// .vscode/launch.json
|
|
3456
|
+
{
|
|
3457
|
+
"version": "0.2.0",
|
|
3458
|
+
"configurations": [
|
|
3459
|
+
{
|
|
3460
|
+
"name": "Debug axolotl prompt - chat_template",
|
|
3461
|
+
"type": "python",
|
|
3462
|
+
"module": "accelerate.commands.launch",
|
|
3463
|
+
"request": "launch",
|
|
3464
|
+
"args": [
|
|
3465
|
+
"-m", "axolotl.cli.train", "dev_chat_template.yml",
|
|
3466
|
+
// The flags below simplify debugging by overriding the axolotl config
|
|
3467
|
+
// with the debugging tips above. Modify as needed.
|
|
3468
|
+
"--dataset_num_proc=1", // limits data preprocessing to one process
|
|
3469
|
+
"--max_steps=1", // limits training to just one step
|
|
3470
|
+
"--batch_size=1", // minimizes batch size
|
|
3471
|
+
"--micro_batch_size=1", // minimizes batch size
|
|
3472
|
+
"--val_set_size=0", // disables validation
|
|
3473
|
+
"--sample_packing=False", // disables sample packing which is necessary for small datasets
|
|
3474
|
+
"--eval_sample_packing=False",// disables sample packing on eval set
|
|
3475
|
+
"--dataset_prepared_path=temp_debug/axolotl_outputs/data", // send data outputs to a temp folder
|
|
3476
|
+
"--output_dir=temp_debug/axolotl_outputs/model" // send model outputs to a temp folder
|
|
3477
|
+
],
|
|
3478
|
+
"console": "integratedTerminal", // show output in the integrated terminal
|
|
3479
|
+
"cwd": "${workspaceFolder}/devtools", // set working directory to devtools from the root of the project
|
|
3480
|
+
"justMyCode": true, // step through only axolotl code
|
|
3481
|
+
"env": {"CUDA_VISIBLE_DEVICES": "0", // Since we aren't doing distributed training, we need to limit to one GPU
|
|
3482
|
+
"HF_HOME": "${workspaceFolder}/devtools/temp_debug/.hf-cache"}, // send HF cache to a temp folder
|
|
3483
|
+
"preLaunchTask": "cleanup-for-dataprep", // delete temp folders (see below)
|
|
3484
|
+
}
|
|
3485
|
+
]
|
|
3486
|
+
}
|
|
3487
|
+
```
|
|
3488
|
+
|
|
3489
|
+
---
|
|
3490
|
+
|
|
3491
|
+
## Docker
|
|
3492
|
+
|
|
3493
|
+
**URL:** https://docs.axolotl.ai/docs/docker.html
|
|
3494
|
+
|
|
3495
|
+
**Contents:**
|
|
3496
|
+
- Docker
|
|
3497
|
+
- Base
|
|
3498
|
+
- Image
|
|
3499
|
+
- Tags format
|
|
3500
|
+
- Main
|
|
3501
|
+
- Image
|
|
3502
|
+
- Tags format
|
|
3503
|
+
- Cloud
|
|
3504
|
+
- Image
|
|
3505
|
+
- Tags format
|
|
3506
|
+
|
|
3507
|
+
This section describes the different Docker images that are released by AxolotlAI at Docker Hub.
|
|
3508
|
+
|
|
3509
|
+
For Blackwell GPUs, please use the tags with PyTorch 2.7.1 and CUDA 12.8.
|
|
3510
|
+
|
|
3511
|
+
The base image is the most minimal image that can install Axolotl. It is based on the nvidia/cuda image. It includes python, torch, git, git-lfs, awscli, pydantic, and more.
|
|
3512
|
+
|
|
3513
|
+
The main image is the image that is used to run Axolotl. It is based on the axolotlai/axolotl-base image and includes the Axolotl codebase, dependencies, and more.
|
|
3514
|
+
|
|
3515
|
+
There may be some extra tags appended to the image, like -vllm which installs those packages.
|
|
3516
|
+
|
|
3517
|
+
The cloud image is the image that is used to run Axolotl in the cloud. It is based on the axolotlai/axolotl image and sets ENV variables like HuggingFace cache directories for volume mounts, tmux, and more for different cloud providers.
|
|
3518
|
+
|
|
3519
|
+
Jupyter lab is run by default. Set JUPYTER_DISABLE=1 in the environment variables to disable it.
|
|
3520
|
+
|
|
3521
|
+
This uses the same tags as the main image.
|
|
3522
|
+
|
|
3523
|
+
We recommend mounting volumes to /workspace/data for data persistence. /workspace/axolotl contains the source code and is ephemeral.
|
|
3524
|
+
|
|
3525
|
+
This is the same as the cloud image but without tmux.
|
|
3526
|
+
|
|
3527
|
+
The naming may be a bit confusing as it has -term appended to the end.
|
|
3528
|
+
|
|
3529
|
+
This uses the same tags as the cloud image.
|
|
3530
|
+
|
|
3531
|
+
**Examples:**
|
|
3532
|
+
|
|
3533
|
+
Example 1 (unknown):
|
|
3534
|
+
```unknown
|
|
3535
|
+
axolotlai/axolotl-base
|
|
3536
|
+
```
|
|
3537
|
+
|
|
3538
|
+
Example 2 (bash):
|
|
3539
|
+
```bash
|
|
3540
|
+
main-base-py{python_version}-cu{cuda_version}-{pytorch_version}
|
|
3541
|
+
```
|
|
3542
|
+
|
|
3543
|
+
Example 3 (unknown):
|
|
3544
|
+
```unknown
|
|
3545
|
+
axolotlai/axolotl
|
|
3546
|
+
```
|
|
3547
|
+
|
|
3548
|
+
Example 4 (bash):
|
|
3549
|
+
```bash
|
|
3550
|
+
# on push to main
|
|
3551
|
+
main-py{python_version}-cu{cuda_version}-{pytorch_version}
|
|
3552
|
+
|
|
3553
|
+
# latest main (currently torch 2.6.0, python 3.11, cuda 12.4)
|
|
3554
|
+
main-latest
|
|
3555
|
+
|
|
3556
|
+
# nightly build
|
|
3557
|
+
{branch}-{date_in_YYYYMMDD}-py{python_version}-cu{cuda_version}-{pytorch_version}
|
|
3558
|
+
|
|
3559
|
+
# tagged release
|
|
3560
|
+
{version}
|
|
3561
|
+
```
|
|
3562
|
+
|
|
3563
|
+
---
|