@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,3563 @@
1
+ # Axolotl - Other
2
+
3
+ **Pages:** 26
4
+
5
+ ---
6
+
7
+ ## Mixed Precision Training
8
+
9
+ **URL:** https://docs.axolotl.ai/docs/mixed_precision.html
10
+
11
+ **Contents:**
12
+ - Mixed Precision Training
13
+ - 1 FP16 Mixed Precision
14
+ - 1.1 Overview
15
+ - 1.2 Configuration
16
+ - 1.3 FP16 Considerations
17
+ - 2 BF16 Mixed Precision
18
+ - 2.1 Overview
19
+ - 2.2 Configuration
20
+ - 3 FP8 Mixed Precision
21
+ - 3.1 What is FP8?
22
+
23
+ Mixed precision training uses lower precision data types to reduce memory usage and increase training speed while maintaining model quality. Axolotl supports several mixed precision formats:
24
+
25
+ FP16 is the traditional half-precision format, supported on older GPUs but can be less numerically stable than BF16.
26
+
27
+ BF16 (Brain Float 16) offers better numerical stability than FP16 and is the recommended mixed precision format for modern GPUs. It provides the same dynamic range as FP32 while using half the memory.
28
+
29
+ FP8 support is experimental and requires compatible hardware (H100, H200) and recent PyTorch versions with TorchAO.
30
+
31
+ FP8 (8-bit floating point) can provide significant time savings compared to FP16/BF16 while maintaining training stability. Axolotl’s implementation uses PyTorch’s TorchAO library with “tensorwise” scaling strategy.
32
+
33
+ Add to your YAML config:
34
+
35
+ torch.compile is critical for FP8 performance
36
+
37
+ FP8 training requires torch_compile: true to see meaningful speedups. Without compilation, FP8 may actually be slower and use more memory than FP16/BF16.
38
+
39
+ For FSDP (Fully Sharded Data Parallel) training:
40
+
41
+ Always validate your mixed precision setup:
42
+
43
+ See examples/llama-3/3b-fp8-fsdp2.yaml for an optimized example config. Enabling FP8 mixed precision + FP8 all-gather training results in ~10% faster iterations per second vs. BF16 for a relatively small (3B param) model
44
+
45
+ For more information on multi-GPU training, see our Multi-GPU guide.
46
+
47
+ **Examples:**
48
+
49
+ Example 1 (yaml):
50
+ ```yaml
51
+ # Automatic BF16 detection (recommended)
52
+ bf16: auto
53
+
54
+ # Or explicitly enable
55
+ bf16: true
56
+
57
+ # For evaluation with BF16
58
+ bf16: full # Equivalent to bf16_full_eval in the HF trainer
59
+ ```
60
+
61
+ Example 2 (yaml):
62
+ ```yaml
63
+ # Enable FP8 mixed precision
64
+ fp8: true
65
+
66
+ # Optional: Enable FP8 for FSDP all-gather operations
67
+ fp8_enable_fsdp_float8_all_gather: true
68
+
69
+ # Enable torch.compile (almost always necessary for FP8 speedups)
70
+ torch_compile: true
71
+ ```
72
+
73
+ Example 3 (yaml):
74
+ ```yaml
75
+ fp8: true
76
+ fp8_enable_fsdp_float8_all_gather: true
77
+
78
+ torch_compile: true
79
+
80
+ # FSDP configuration
81
+ fsdp_version: 2
82
+ fsdp_config:
83
+ offload_params: false
84
+ cpu_ram_efficient_loading: true
85
+ auto_wrap_policy: TRANSFORMER_BASED_WRAP
86
+ transformer_layer_cls_to_wrap: LlamaDecoderLayer
87
+ state_dict_type: FULL_STATE_DICT
88
+ reshard_after_forward: true
89
+ ```
90
+
91
+ ---
92
+
93
+ ## FAQ
94
+
95
+ **URL:** https://docs.axolotl.ai/docs/faq.html
96
+
97
+ **Contents:**
98
+ - FAQ
99
+ - General
100
+ - Chat templates
101
+
102
+ Q: The trainer stopped and hasn’t progressed in several minutes.
103
+
104
+ A: Usually an issue with the GPUs communicating with each other. See the NCCL doc
105
+
106
+ A: This usually happens when you run out of system RAM.
107
+
108
+ Q: exitcode: -7 while using deepspeed
109
+
110
+ A: Try upgrading deepspeed w: pip install -U deepspeed
111
+
112
+ Q: AttributeError: ‘DummyOptim’ object has no attribute ‘step’
113
+
114
+ Q: ModuleNotFoundError: No module named ‘mpi4py’ using single GPU with deepspeed
115
+
116
+ A: You may be using deepspeed with single gpu. Please remove the deepspeed: section in the yaml file or --deepspeed CLI flag.
117
+
118
+ Q: The codes is stuck on saving preprocessed datasets.
119
+
120
+ A: This is usually an issue with the GPU. This can be resolved through setting the os environment variable CUDA_VISIBLE_DEVICES=0. If you are on runpod, this is usually a pod issue. Starting a new pod should take care of it.
121
+
122
+ Q: Received mismatch error on merge adapters / loading adapters between torch.Size of checkpoint and model.
123
+
124
+ A: This is likely due to vocab size mismatch. By default, Axolotl expands the model’s embeddings if the tokenizer has more tokens than the model. Please use the axolotl merge-lora command to merge the adapters instead of using your own scripts.
125
+
126
+ On the other hand, if the model has more tokens than the tokenizer, Axolotl does not shrink the model’s embeddings unless shrink_embeddings: true is set in the config.
127
+
128
+ Q: How to call Axolotl via custom python scripts?
129
+
130
+ A: Since Axolotl is just Python, please see src/axolotl/cli/main.py on how each command is called.
131
+
132
+ Q: How to know the value to use for fsdp_transformer_layer_cls_to_wrap?
133
+
134
+ A: This is the class name of the transformer layer to wrap with FSDP. For example, for LlamaForCausalLM, the value is LlamaDecoderLayer. To find this for a specific model, check the model’s PreTrainedModel definition and look for _no_split_modules variable in the modeling_<model_name>.py file within transformers library.
135
+
136
+ Q: ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token
137
+
138
+ A: This is because the tokenizer does not have a padding token. Please add a padding token to the tokenizer via:
139
+
140
+ Q: IterableDataset error or KeyError: 'input_ids' when using preprocess CLI
141
+
142
+ A: This is because you may be using preprocess CLI with pretraining_dataset: or skip_prepare_dataset: true respectively. Please use axolotl train CLI directly instead as these datasets are prepared on demand.
143
+
144
+ Q: vLLM is not working with Axolotl
145
+
146
+ A: We currently recommend torch 2.6.0 for use with vllm. Please ensure you use the right version. For Docker, please use the main-py3.11-cu124-2.6.0 tag.
147
+
148
+ Q: FA2 2.8.0 undefined symbol runtime error on CUDA 12.4
149
+
150
+ A: There seems to be a wheel issue with FA2 2.8.0 on CUDA 12.4. Try CUDA 12.6 instead or downgrade to FA2 2.7.4. Please refer to the upstream issue: https://github.com/Dao-AILab/flash-attention/issues/1717.
151
+
152
+ Q: Can we mix text and text+image datasets for VLM training?
153
+
154
+ A: Yes, you can for newer VLM arch. The ones that would not work are LLaVA / Pixtral arch. If you notice one not working, please let us know!
155
+
156
+ Q: Why is memory/max_* different from nvidia-smi?
157
+
158
+ A: We use torch APIs to retrieve this information. You can see https://docs.pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management for more information.
159
+
160
+ Q: jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'content' / 'role' / ____
161
+
162
+ A: This means that the property mapping for the stated attribute does not exist when building chat_template prompt. For example, if no attribute 'content', please check you have added the correct mapping for content under message_property_mappings.
163
+
164
+ Q: Empty template generated for turn ___
165
+
166
+ A: The content is empty for that turn.
167
+
168
+ Q: Could not find content start/end boundary for turn __
169
+
170
+ A: The specific turn’s start/end could not be detected. Please ensure you have set the eos_token following your chat_template. Otherwise, this could be a chat_template which doesn’t use proper boundaries for each turn (like system). On the rare occurrence, make sure your content is not [[dummy_message]]. Please let us know about this.
171
+
172
+ Q: Content end boundary is before start boundary for turn ___
173
+
174
+ A: This is an edge case which should not occur. Please create an Issue if this happens.
175
+
176
+ Q: Content end boundary is the same as start boundary for turn ___. This is likely an empty turn.
177
+
178
+ A: This is likely an empty turn.
179
+
180
+ Q: The EOS token is incorrectly being masked or not being masked / EOS token __ not found in chat template.
181
+
182
+ A: There can be two reasons:
183
+
184
+ Q: “chat_template choice is tokenizer_default but tokenizer’s chat_template is null. Please add a chat_template in tokenizer config”
185
+
186
+ A: This is because the tokenizer does not have a chat template. Please add a chat template in the tokenizer config. See chat_template for more details.
187
+
188
+ Q: The EOT token(s) are incorrectly being masked or not being masked / EOT token __ not found in chat template.
189
+
190
+ A: There can be two reasons:
191
+
192
+ Q: EOT token encoding failed. Please check if the token is valid and can be encoded.
193
+
194
+ A: There could be some issue with the tokenizer or unicode encoding. Please raise an issue with examples with the EOT token & tokenizer causing the issue.
195
+
196
+ Q: EOT token __ is encoded as multiple tokens.
197
+
198
+ A: This is because the EOT token is encoded as multiple tokens which can cause unexpected behavior. Please add it under tokens: or (recommended) override unused added_tokens via added_tokens_overrides:.
199
+
200
+ Q: Conflict between train_on_eos and train_on_eot. eos_token is in eot_tokens and train_on_eos != train_on_eot
201
+
202
+ A: This is because the EOS token is in the eot_tokens: while mismatch between train_on_eos: and train_on_eot:. This will cause one to override the other. Please ensure that train_on_eos: and train_on_eot: are the same or remove the EOS token from eot_tokens:.
203
+
204
+ Q: If eot_tokens: is not provided, what happens?
205
+
206
+ A: If eot_tokens: is not provided, the default behavior is the same as before. EOS tokens used to delimit turns are masked/unmasked depending on whether the turn is trainable.
207
+
208
+ Internally, eot_tokens: tokenizer.eos_token and train_on_eot: train_on_eos (which defaults to turn). This transition helps clarify the naming and behavior of EOT/EOS tokens.
209
+
210
+ Q: Data processing error: CAS service error
211
+
212
+ A: Try disabling XET with export HF_HUB_DISABLE_XET=1
213
+
214
+ Q: torch._inductor.exc.LoweringException: NoValidChoicesError: No choices to select, please consider adding ATEN into max_autotune_gemm_backends config (defined in torch/_inductor/config.py) to allow at least one choice.
215
+
216
+ A: Depending on the version of torch, you may need to include this in your YAML:
217
+
218
+ **Q: ValueError("Backward pass should have cleared tracker of all tensors")
219
+
220
+ A: This may happen due to edge cases in using the modern OffloadActivations context manager for CUDA streams. If you encounter this error, you may have success using the naive implementation with offload_activations: legacy in your YAML.
221
+
222
+ **Q: Error parsing tool_calls arguments as JSON.
223
+
224
+ A: There is an error parsing string arguments to a dict. Please check your dataset and the error message for more details.
225
+
226
+ **Examples:**
227
+
228
+ Example 1 (yaml):
229
+ ```yaml
230
+ special_tokens:
231
+ # str. If you're not sure, set to same as `eos_token`.
232
+ pad_token: "..."
233
+ ```
234
+
235
+ Example 2 (yaml):
236
+ ```yaml
237
+ flex_attn_compile_kwargs:
238
+ dynamic: false
239
+ mode: max-autotune-no-cudagraphs
240
+ ```
241
+
242
+ ---
243
+
244
+ ## Installation
245
+
246
+ **URL:** https://docs.axolotl.ai/docs/installation.html
247
+
248
+ **Contents:**
249
+ - Installation
250
+ - 1 Requirements
251
+ - 2 Installation Methods
252
+ - 2.1 PyPI Installation (Recommended)
253
+ - 2.2 uv Installation
254
+ - 2.3 Edge/Development Build
255
+ - 2.4 Docker
256
+ - 3 Cloud Environments
257
+ - 3.1 Cloud GPU Providers
258
+ - 3.2 Google Colab
259
+
260
+ This guide covers all the ways you can install and set up Axolotl for your environment.
261
+
262
+ Please make sure to have Pytorch installed before installing Axolotl in your local environment.
263
+
264
+ Follow the instructions at: https://pytorch.org/get-started/locally/
265
+
266
+ For Blackwell GPUs, please use Pytorch 2.7.0 and CUDA 12.8.
267
+
268
+ We use --no-build-isolation in order to detect the installed PyTorch version (if installed) in order not to clobber it, and so that we set the correct version of dependencies that are specific to the PyTorch version or other installed co-dependencies.
269
+
270
+ uv is a fast, reliable Python package installer and resolver built in Rust. It offers significant performance improvements over pip and provides better dependency resolution, making it an excellent choice for complex environments.
271
+
272
+ Install uv if not already installed
273
+
274
+ Choose your CUDA version to use with PyTorch; e.g. cu124, cu126, cu128, then create the venv and activate
275
+
276
+ Install PyTorch - PyTorch 2.6.0 recommended
277
+
278
+ Install axolotl from PyPi
279
+
280
+ For the latest features between releases:
281
+
282
+ For development with Docker:
283
+
284
+ For Blackwell GPUs, please use axolotlai/axolotl:main-py3.11-cu128-2.7.0 or the cloud variant axolotlai/axolotl-cloud:main-py3.11-cu128-2.7.0.
285
+
286
+ Please refer to the Docker documentation for more information on the different Docker images that are available.
287
+
288
+ For providers supporting Docker:
289
+
290
+ See Section 6 for Mac-specific issues.
291
+
292
+ We recommend using WSL2 (Windows Subsystem for Linux) or Docker.
293
+
294
+ Install PyTorch: https://pytorch.org/get-started/locally/
295
+
296
+ (Optional) Login to Hugging Face:
297
+
298
+ If you encounter installation issues, see our FAQ and Debugging Guide.
299
+
300
+ **Examples:**
301
+
302
+ Example 1 (bash):
303
+ ```bash
304
+ pip3 install -U packaging setuptools wheel ninja
305
+ pip3 install --no-build-isolation axolotl[flash-attn,deepspeed]
306
+ ```
307
+
308
+ Example 2 (bash):
309
+ ```bash
310
+ curl -LsSf https://astral.sh/uv/install.sh | sh
311
+ source $HOME/.local/bin/env
312
+ ```
313
+
314
+ Example 3 (bash):
315
+ ```bash
316
+ export UV_TORCH_BACKEND=cu126
317
+ uv venv --no-project --relocatable
318
+ source .venv/bin/activate
319
+ ```
320
+
321
+ Example 4 (bash):
322
+ ```bash
323
+ uv pip install packaging setuptools wheel
324
+ uv pip install torch==2.6.0
325
+ uv pip install awscli pydantic
326
+ ```
327
+
328
+ ---
329
+
330
+ ## Dataset Preprocessing
331
+
332
+ **URL:** https://docs.axolotl.ai/docs/dataset_preprocessing.html
333
+
334
+ **Contents:**
335
+ - Dataset Preprocessing
336
+ - Overview
337
+ - What are the benefits of pre-processing?
338
+ - What are the edge cases?
339
+
340
+ Dataset pre-processing is the step where Axolotl takes each dataset you’ve configured alongside the dataset format and prompt strategies to:
341
+
342
+ The processing of the datasets can happen one of two ways:
343
+
344
+ When training interactively or for sweeps (e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent training parameters so that it will intelligently pull from its cache when possible.
345
+
346
+ The path of the cache is controlled by dataset_prepared_path: and is often left blank in example YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.
347
+
348
+ If dataset_prepared_path: is left empty, when training, the processed dataset will be cached in a default path of ./last_run_prepared/, but will ignore anything already cached there. By explicitly setting dataset_prepared_path: ./last_run_prepared, the trainer will use whatever pre-processed data is in the cache.
349
+
350
+ Let’s say you are writing a custom prompt strategy or using a user-defined prompt template. Because the trainer cannot readily detect these changes, we cannot change the calculated hash value for the pre-processed dataset.
351
+
352
+ If you have dataset_prepared_path: ... set and change your prompt templating logic, it may not pick up the changes you made and you will be training over the old prompt.
353
+
354
+ ---
355
+
356
+ ## Inference and Merging
357
+
358
+ **URL:** https://docs.axolotl.ai/docs/inference.html
359
+
360
+ **Contents:**
361
+ - Inference and Merging
362
+ - 1 Quick Start
363
+ - 1.1 Basic Inference
364
+ - 2 Advanced Usage
365
+ - 2.1 Gradio Interface
366
+ - 2.2 File-based Prompts
367
+ - 2.3 Memory Optimization
368
+ - 3 Merging LoRA Weights
369
+ - 3.1 Memory Management for Merging
370
+ - 4 Tokenization
371
+
372
+ This guide covers how to use your trained models for inference, including model loading, interactive testing, merging adapters, and common troubleshooting steps.
373
+
374
+ Use the same config used for training on inference/merging.
375
+
376
+ Launch an interactive web interface:
377
+
378
+ Process prompts from a text file:
379
+
380
+ For large models or limited memory:
381
+
382
+ Merge LoRA adapters with the base model:
383
+
384
+ Tokenization mismatches between training and inference are a common source of problems.
385
+
386
+ Verify inference tokenization by decoding tokens before model input
387
+
388
+ Compare token IDs between training and inference
389
+
390
+ Configure special tokens in your YAML:
391
+
392
+ For more details, see our debugging guide.
393
+
394
+ **Examples:**
395
+
396
+ Example 1 (bash):
397
+ ```bash
398
+ axolotl inference your_config.yml --lora-model-dir="./lora-output-dir"
399
+ ```
400
+
401
+ Example 2 (bash):
402
+ ```bash
403
+ axolotl inference your_config.yml --base-model="./completed-model"
404
+ ```
405
+
406
+ Example 3 (bash):
407
+ ```bash
408
+ axolotl inference your_config.yml --gradio
409
+ ```
410
+
411
+ Example 4 (bash):
412
+ ```bash
413
+ cat /tmp/prompt.txt | axolotl inference your_config.yml \
414
+ --base-model="./completed-model" --prompter=None
415
+ ```
416
+
417
+ ---
418
+
419
+ ## MultiModal / Vision Language Models (BETA)
420
+
421
+ **URL:** https://docs.axolotl.ai/docs/multimodal.html
422
+
423
+ **Contents:**
424
+ - MultiModal / Vision Language Models (BETA)
425
+ - Supported Models
426
+ - Usage
427
+ - Mllama
428
+ - Llama4
429
+ - Pixtral
430
+ - Llava-1.5
431
+ - Mistral-Small-3.1
432
+ - Magistral-Small-2509
433
+ - Voxtral
434
+
435
+ Multimodal support is limited and doesn’t have full feature parity.
436
+
437
+ Here are the hyperparams you’ll need to use to finetune a multimodal model.
438
+
439
+ Please see examples folder for full configs.
440
+
441
+ Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.
442
+
443
+ As of now, we do not truncate nor drop samples based on sequence_len as each arch has different ways to process non-text tokens. We are looking for help on this.
444
+
445
+ Please make sure to install vision lib via pip install 'mistral-common[opencv]==1.8.5'
446
+
447
+ Please make sure to install vision lib via pip install 'mistral-common[opencv]==1.8.5'
448
+
449
+ Please make sure to install audio lib via pip3 install librosa==0.11.0 'mistral_common[audio]==1.8.3'
450
+
451
+ The Gemma3-1B model is a text-only model, so please train as regular text model.
452
+
453
+ For multi-modal 4B/12B/27B models, use the following config:
454
+
455
+ The model’s initial loss and grad norm will be very high. We suspect this to be due to the Conv in the vision layers.
456
+
457
+ Please make sure to install timm via pip3 install timm==1.0.17
458
+
459
+ Please make sure to install num2words via pip3 install num2words==0.5.14
460
+
461
+ Please uninstall causal-conv1d via pip3 uninstall -y causal-conv1d
462
+
463
+ For multi-modal datasets, we adopt an extended chat_template format similar to OpenAI’s Message format.
464
+
465
+ For backwards compatibility:
466
+
467
+ For image loading, you can use the following keys within content alongside "type": "image":
468
+
469
+ For audio loading, you can use the following keys within content alongside "type": "audio":
470
+
471
+ You may need to install librosa via pip3 install librosa==0.11.0.
472
+
473
+ This is not well tested at the moment. We welcome contributors!
474
+
475
+ For video loading, you can use the following keys within content alongside "type": "video":
476
+
477
+ Here is an example of a multi-modal dataset:
478
+
479
+ PIL could not retrieve the file at url using requests. Please check for typo. One alternative reason is that the request is blocked by the server.
480
+
481
+ **Examples:**
482
+
483
+ Example 1 (yaml):
484
+ ```yaml
485
+ processor_type: AutoProcessor
486
+
487
+ skip_prepare_dataset: true
488
+ remove_unused_columns: false # leave columns in place as they are needed to handle image embeddings during training
489
+ sample_packing: false # not yet supported with multimodal
490
+
491
+ chat_template: # see in next section if specified
492
+
493
+ # example dataset
494
+ datasets:
495
+ - path: HuggingFaceH4/llava-instruct-mix-vsft
496
+ type: chat_template
497
+ split: train[:1%]
498
+
499
+ # (optional) if doing lora, only finetune the Language model,
500
+ # leave the vision model and vision tower frozen
501
+ # load_in_8bit: true
502
+ adapter: lora
503
+ lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
504
+
505
+ # (optional) if you want to resize images to a set size
506
+ image_size: 512
507
+ image_resize_algorithm: bilinear
508
+ ```
509
+
510
+ Example 2 (yaml):
511
+ ```yaml
512
+ base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
513
+
514
+ chat_template: llama3_2_vision
515
+ ```
516
+
517
+ Example 3 (yaml):
518
+ ```yaml
519
+ base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
520
+
521
+ chat_template: llama4
522
+ ```
523
+
524
+ Example 4 (yaml):
525
+ ```yaml
526
+ base_model: mistralai/Pixtral-12B-2409
527
+
528
+ chat_template: pixtral
529
+ ```
530
+
531
+ ---
532
+
533
+ ## Reward Modelling
534
+
535
+ **URL:** https://docs.axolotl.ai/docs/reward_modelling.html
536
+
537
+ **Contents:**
538
+ - Reward Modelling
539
+ - Overview
540
+ - (Outcome) Reward Models
541
+ - Process Reward Models (PRM)
542
+
543
+ Reward modelling is a technique used to train models to predict the reward or value of a given input. This is particularly useful in reinforcement learning scenarios where the model needs to evaluate the quality of its actions or predictions. We support the reward modelling techniques supported by trl.
544
+
545
+ Outcome reward models are trained using data which contains preference annotations for an entire interaction between the user and model (e.g. rather than per-turn or per-step). For improved training stability, you can use the center_rewards_coefficient parameter to encourage mean-zero reward outputs (see TRL docs).
546
+
547
+ Bradley-Terry chat templates expect single-turn conversations in the following format:
548
+
549
+ Check out our PRM blog.
550
+
551
+ Process reward models are trained using data which contains preference annotations for each step in a series of interactions. Typically, PRMs are trained to provide reward signals over each step of a reasoning trace and are used for downstream reinforcement learning.
552
+
553
+ Please see stepwise_supervised for more details on the dataset format.
554
+
555
+ **Examples:**
556
+
557
+ Example 1 (yaml):
558
+ ```yaml
559
+ base_model: google/gemma-2-2b
560
+ model_type: AutoModelForSequenceClassification
561
+ num_labels: 1
562
+ tokenizer_type: AutoTokenizer
563
+
564
+ reward_model: true
565
+ chat_template: gemma
566
+ datasets:
567
+ - path: argilla/distilabel-intel-orca-dpo-pairs
568
+ type: bradley_terry.chat_template
569
+
570
+ val_set_size: 0.1
571
+ eval_steps: 100
572
+ ```
573
+
574
+ Example 2 (json):
575
+ ```json
576
+ {
577
+ "system": "...", // optional
578
+ "input": "...",
579
+ "chosen": "...",
580
+ "rejected": "..."
581
+ }
582
+ ```
583
+
584
+ Example 3 (yaml):
585
+ ```yaml
586
+ base_model: Qwen/Qwen2.5-3B
587
+ model_type: AutoModelForTokenClassification
588
+ num_labels: 2
589
+
590
+ process_reward_model: true
591
+ datasets:
592
+ - path: trl-lib/math_shepherd
593
+ type: stepwise_supervised
594
+ split: train
595
+
596
+ val_set_size: 0.1
597
+ eval_steps: 100
598
+ ```
599
+
600
+ ---
601
+
602
+ ## RLHF (Beta)
603
+
604
+ **URL:** https://docs.axolotl.ai/docs/rlhf.html
605
+
606
+ **Contents:**
607
+ - RLHF (Beta)
608
+ - Overview
609
+ - RLHF using Axolotl
610
+ - DPO
611
+ - chatml.argilla
612
+ - chatml.argilla_chat
613
+ - chatml.icr
614
+ - chatml.intel
615
+ - chatml.prompt_pairs
616
+ - chatml.ultra
617
+
618
+ Reinforcement Learning from Human Feedback is a method whereby a language model is optimized from data using human feedback. Various methods include, but not limited to:
619
+
620
+ This is a BETA feature and many features are not fully implemented. You are encouraged to open new PRs to improve the integration and functionality.
621
+
622
+ We rely on the TRL library for implementations of various RL training methods, which we wrap around to expose in axolotl. Each method has their own supported ways of loading datasets and prompt formats.
623
+
624
+ You can find what each method supports by going into src/axolotl/prompt_strategies/{method} where {method} is one of our supported methods. The type: can be retrieved from {method}.{function_name}.
625
+
626
+ DPO supports the following types with the following dataset format:
627
+
628
+ For custom behaviors,
629
+
630
+ The input format is a simple JSON input with customizable fields based on the above config.
631
+
632
+ As IPO is just DPO with a different loss function, all supported dataset formats for DPO are also supported for IPO.
633
+
634
+ Paper: https://arxiv.org/abs/2403.07691
635
+
636
+ ORPO supports the following types with the following dataset format:
637
+
638
+ KTO supports the following types with the following dataset format:
639
+
640
+ For custom behaviors,
641
+
642
+ The input format is a simple JSON input with customizable fields based on the above config.
643
+
644
+ Check out our GRPO cookbook.
645
+
646
+ In the latest GRPO implementation, vLLM is used to significantly speedup trajectory generation during training. In this example, we’re using 4 GPUs - 2 for training, and 2 for vLLM:
647
+
648
+ Make sure you’ve installed the correct version of vLLM by including it as an extra when installing axolotl, e.g. pip install axolotl[vllm].
649
+
650
+ Your vLLM instance will now attempt to spin up, and it’s time to kick off training utilizing our remaining two GPUs. In another terminal, execute:
651
+
652
+ Due to TRL’s implementation with vLLM, the vLLM instance must use the last N GPUs instead of the first N GPUs. This is why in the example above, we use CUDA_VISIBLE_DEVICES=2,3 for the vLLM instance.
653
+
654
+ GRPO uses custom reward functions and transformations. Please have them ready locally.
655
+
656
+ For example, to load OpenAI’s GSM8K and use a random reward for completions:
657
+
658
+ To see other examples of custom reward functions, please see TRL GRPO Docs.
659
+
660
+ To see all configs, please see TRLConfig.
661
+
662
+ The DAPO paper and subsequently Dr. GRPO paper proposed an alternative loss function for GRPO to remediate the penalty in longer responses.
663
+
664
+ For more information, see GRPO docs.
665
+
666
+ SimPO uses CPOTrainer but with alternative loss function.
667
+
668
+ This method uses the same dataset format as DPO.
669
+
670
+ TRL supports auto-unwrapping PEFT models for RL training paradigms which rely on a reference model. This significantly reduces memory pressure as an additional refreference model does not need to be loaded, and reference model log-probabilities can be obtained by disabling PEFT adapters. This is enabled by default. To turn it off, pass the following config:
671
+
672
+ **Examples:**
673
+
674
+ Example 1 (yaml):
675
+ ```yaml
676
+ rl: dpo
677
+ datasets:
678
+ - path: Intel/orca_dpo_pairs
679
+ split: train
680
+ type: chatml.intel
681
+ - path: argilla/ultrafeedback-binarized-preferences
682
+ split: train
683
+ type: chatml
684
+ ```
685
+
686
+ Example 2 (json):
687
+ ```json
688
+ {
689
+ "system": "...", // optional
690
+ "instruction": "...",
691
+ "chosen_response": "...",
692
+ "rejected_response": "..."
693
+ }
694
+ ```
695
+
696
+ Example 3 (json):
697
+ ```json
698
+ {
699
+ "chosen": [
700
+ {"role": "user", "content": "..."},
701
+ {"role": "assistant", "content": "..."}
702
+ ],
703
+ "rejected": [
704
+ {"role": "user", "content": "..."},
705
+ {"role": "assistant", "content": "..."}
706
+ ]
707
+ }
708
+ ```
709
+
710
+ Example 4 (json):
711
+ ```json
712
+ {
713
+ "system": "...", // optional
714
+ "input": "...",
715
+ "chosen": "...",
716
+ "rejected": "..."
717
+ }
718
+ ```
719
+
720
+ ---
721
+
722
+ ## LoRA Optimizations
723
+
724
+ **URL:** https://docs.axolotl.ai/docs/lora_optims.html
725
+
726
+ **Contents:**
727
+ - LoRA Optimizations
728
+ - Usage
729
+ - Requirements
730
+ - Implementation details
731
+ - Custom autograd functions
732
+ - Triton kernels
733
+ - Integration
734
+ - Future Work
735
+
736
+ Inspired by Unsloth, we’ve implemented two optimizations for LoRA and QLoRA fine-tuning, supporting both single GPU and multi-GPU (including the DDP, DeepSpeed, and FSDP2 settings) training. These include (1) SwiGLU and GEGLU activation function Triton kernels, and (2) LoRA MLP and attention custom autograd functions. Our goal was to leverage operator fusion and tensor re-use in order to improve speed and reduce memory usage during the forward and backward passes of these calculations.
737
+
738
+ We currently support several common model architectures, including (but not limited to):
739
+
740
+ The set of models we support is currently limited by our attention patching strategy, which assumes (and replaces) specific code blocks for query / key / value and output projections:
741
+
742
+ Where apply_qkv and apply_o are defined in the axolotl.kernels.lora module.
743
+
744
+ We welcome testing of other model architectures and / or PRs to expand our patching logic to be compatible with more of them.
745
+
746
+ Check out our LoRA optimizations blog.
747
+
748
+ These optimizations can be enabled in your Axolotl config YAML file. The lora_mlp_kernel option enables the optimized MLP path, while lora_qkv_kernel and lora_o_kernel enable the fused query-key-value projection and optimized output projection, respectively.
749
+
750
+ Currently, LoRA kernels are not supported for RLHF training, only SFT.
751
+
752
+ Models with pre-existing LoRA adapters that use Dropout or have bias terms may need to be re-finetuned without these features in order to be useful.
753
+
754
+ The LoRA MLP autograd function optimizes the entire MLP computation path. It fuses the LoRA and base weight computations together and provides a single, efficient backward pass for the entire MLP block.
755
+
756
+ For attention components, similar optimizations are provided through a function that handles the query, key, and value projections, and a function that handles the output projection. They are designed to work with the existing transformers attention implementation via some monkey-patching logic.
757
+
758
+ Two activation functions (SwiGLU and GeGLU) are implemented with Triton kernels for improved speed and memory performance. These kernels handle both the forward and backward passes.
759
+
760
+ The custom autograd functions and Triton kernels are designed to work together. The autograd function manages the high-level computation flow and gradient tracking, while calling the Triton kernels for the activation function computation. During the backward pass, the kernel computes both the activation output and the required gradients, which the autograd function then uses to compute the final gradients for the entire computation path.
761
+
762
+ **Examples:**
763
+
764
+ Example 1 (python):
765
+ ```python
766
+ ORIGINAL_QKV_CODE = """
767
+ query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
768
+ key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
769
+ value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
770
+ """.lstrip(
771
+ "\n"
772
+ )
773
+
774
+ ORIGINAL_O_CODE = """
775
+ attn_output = self.o_proj(attn_output)
776
+ """.lstrip(
777
+ "\n"
778
+ )
779
+ ```
780
+
781
+ Example 2 (python):
782
+ ```python
783
+ PATCHED_QKV_CODE = """
784
+ query_states, key_states, value_states = self.apply_qkv(hidden_states)
785
+ query_states = query_states.view(hidden_shape).transpose(1, 2)
786
+ key_states = key_states.view(hidden_shape).transpose(1, 2)
787
+ value_states = value_states.view(hidden_shape).transpose(1, 2)
788
+ """.lstrip(
789
+ "\n"
790
+ )
791
+
792
+ PATCHED_O_CODE = """
793
+ attn_output = self.apply_o(attn_output)
794
+ """.lstrip(
795
+ "\n"
796
+ )
797
+ ```
798
+
799
+ Example 3 (yaml):
800
+ ```yaml
801
+ lora_mlp_kernel: true
802
+ lora_qkv_kernel: true
803
+ lora_o_kernel: true
804
+ ```
805
+
806
+ ---
807
+
808
+ ## Quantization with torchao
809
+
810
+ **URL:** https://docs.axolotl.ai/docs/quantize.html
811
+
812
+ **Contents:**
813
+ - Quantization with torchao
814
+ - Configuring Quantization in Axolotl
815
+
816
+ Quantization is a technique to lower the memory footprint of your model, potentially at the cost of accuracy or model performance. We support quantizing your model using the torchao library. Quantization is supported for both post-training quantization (PTQ) and quantization-aware training (QAT).
817
+
818
+ We do not currently support quantization techniques such as GGUF/GPTQ,EXL2 at the moment.
819
+
820
+ Quantization is configured using the quantization key in your configuration file.
821
+
822
+ Once quantization is complete, your quantized model will be saved in the {output_dir}/quantized directory.
823
+
824
+ You may also use the quantize command to quantize a model which has been trained with QAT - you can do this by using the existing QAT configuration file which you used to train the model:
825
+
826
+ This ensures that an identical quantization configuration is used to quantize the model as was used to train it.
827
+
828
+ If you have configured pushing to hub with hub_model_id, your model hub name will have the quantization schema appended to it, e.g. axolotl-ai-cloud/qat-nvfp4-llama3B will become axolotl-ai-cloud/qat-nvfp4-llama3B-nvfp4w
829
+
830
+ **Examples:**
831
+
832
+ Example 1 (yaml):
833
+ ```yaml
834
+ base_model: # The path to the model to quantize.
835
+ quantization:
836
+ activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4", "int8", "float8"
837
+ weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are "int4", "fp8", and "nvfp4".
838
+ group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
839
+ quantize_embedding: # Optional[bool] = False. Whether to quantize the embedding layer.
840
+
841
+ output_dir: # The path to the output directory.
842
+ ```
843
+
844
+ Example 2 (yaml):
845
+ ```yaml
846
+ # qat.yml
847
+ qat:
848
+ activation_dtype: int8
849
+ weight_dtype: int4
850
+ group_size: 256
851
+
852
+ output_dir: # The path to the output directory used during training where the final checkpoint has been saved.
853
+ ```
854
+
855
+ Example 3 (bash):
856
+ ```bash
857
+ axolotl quantize qat.yml
858
+ ```
859
+
860
+ ---
861
+
862
+ ## NCCL
863
+
864
+ **URL:** https://docs.axolotl.ai/docs/nccl.html
865
+
866
+ **Contents:**
867
+ - NCCL
868
+
869
+ NVIDIA NCCL is a library to facilitate and optimize multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce, etc. Broadly, NCCL configuration is highly environment-specific and is configured via several environment variables. A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:
870
+
871
+ Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised. Nvidia recommends disabling PCI access control services (ACS) as a possible solution if this is available to you.
872
+
873
+ Forcing cross-GPU communication via NVLink may help without increasing timeouts. To verify that your configuration is leveraging NVLink run the following command:
874
+
875
+ To force NCCL to use NVLink, simply set this in the environment:
876
+
877
+ If NVLink is not available in your environment there are other options for NCCL_P2P_LEVEL in the table below:
878
+
879
+ To validate that acceptable data transfer speeds exist for your training job, running NCCL Tests can help pinpoint bottlenecks, for example:
880
+
881
+ It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:
882
+
883
+ Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ddp_timeout value in the Axolotl configuration. See PyTorch init_process_group for documentation on this value.
884
+
885
+ **Examples:**
886
+
887
+ Example 1 (unknown):
888
+ ```unknown
889
+ Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1806948 milliseconds before timing out.
890
+ ```
891
+
892
+ Example 2 (bash):
893
+ ```bash
894
+ nvidia-smi nvlink --status
895
+ ```
896
+
897
+ Example 3 (bash):
898
+ ```bash
899
+ export NCCL_P2P_LEVEL=NVL
900
+ ```
901
+
902
+ Example 4 (bash):
903
+ ```bash
904
+ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
905
+ ```
906
+
907
+ ---
908
+
909
+ ## Multi Node
910
+
911
+ **URL:** https://docs.axolotl.ai/docs/multi-node.html
912
+
913
+ **Contents:**
914
+ - Multi Node
915
+ - Accelerate
916
+ - Raytrain
917
+ - Torchrun
918
+ - Option 1: New Axolotl CLI with launcher args (Recommended)
919
+ - Option 2: Direct torchrun (Legacy)
920
+
921
+ The below are three ways to train multi-node in Axolotl.
922
+
923
+ Each machine needs a copy of Axolotl, we suggest using the same commit to ensure compatibility.
924
+
925
+ You will also need to have the same configuration file for your model on each machine.
926
+
927
+ Make sure the main machine is reachable by other machines.
928
+
929
+ You will need to create a configuration for accelerate, either by using accelerate config and follow the instructions or you can use one of the preset below:
930
+
931
+ ~/.cache/huggingface/accelerate/default_config.yaml
932
+
933
+ Configure your model to use FSDP in the Axolotl yaml. For example:
934
+
935
+ All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.
936
+
937
+ Please see ray train doc here.
938
+
939
+ If you are using Infiniband, we recommend torchrun to utilize the full bandwidth.
940
+
941
+ Set the following env (change buffersize/socketname depending on your system):
942
+
943
+ Run the following on each node:
944
+
945
+ Please make sure to substitute the placeholder variables:
946
+
947
+ The new CLI approach (Option 1) is recommended as it provides consistent argument handling and works seamlessly with other Axolotl CLI features.
948
+
949
+ More info on the available configs can be found on the Pytorch docs here
950
+
951
+ **Examples:**
952
+
953
+ Example 1 (yaml):
954
+ ```yaml
955
+ compute_environment: LOCAL_MACHINE
956
+ debug: false
957
+ distributed_type: FSDP
958
+ downcast_bf16: 'no'
959
+ machine_rank: 0 # Set to 0 for the main machine, increment by one for other machines
960
+ main_process_ip: 10.0.0.4 # Set to main machine's IP
961
+ main_process_port: 5000
962
+ main_training_function: main
963
+ mixed_precision: bf16
964
+ num_machines: 2 # Change to the number of machines
965
+ num_processes: 4 # That's the total number of GPUs, (for example: if you have 2 machines with 4 GPU, put 8)
966
+ rdzv_backend: static
967
+ same_network: true
968
+ tpu_env: []
969
+ tpu_use_cluster: false
970
+ tpu_use_sudo: false
971
+ use_cpu: false
972
+ ```
973
+
974
+ Example 2 (yaml):
975
+ ```yaml
976
+ fsdp_version: 2
977
+ fsdp_config:
978
+ offload_params: true
979
+ state_dict_type: FULL_STATE_DICT
980
+ auto_wrap_policy: TRANSFORMER_BASED_WRAP
981
+ transformer_layer_cls_to_wrap: LlamaDecoderLayer
982
+ reshard_after_forward: true
983
+ ```
984
+
985
+ Example 3 (bash):
986
+ ```bash
987
+ export NCCL_IB_DISABLE=0
988
+ export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond"
989
+ export NCCL_BUFFSIZE=2097152
990
+ ```
991
+
992
+ Example 4 (bash):
993
+ ```bash
994
+ axolotl train config.yaml --launcher torchrun -- --nnodes $num_nodes --nproc_per_node $gpu_per_node --rdzv_id $rdzv_id --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:$head_node_port"
995
+ ```
996
+
997
+ ---
998
+
999
+ ## Dataset Loading
1000
+
1001
+ **URL:** https://docs.axolotl.ai/docs/dataset_loading.html
1002
+
1003
+ **Contents:**
1004
+ - Dataset Loading
1005
+ - Overview
1006
+ - Loading Datasets
1007
+ - Local dataset
1008
+ - Files
1009
+ - Directory
1010
+ - Loading entire directory
1011
+ - Loading specific files in directory
1012
+ - HuggingFace Hub
1013
+ - Folder uploaded
1014
+
1015
+ Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.
1016
+
1017
+ We use the datasets library to load datasets and a mix of load_dataset and load_from_disk to load them.
1018
+
1019
+ You may recognize the similar named configs between load_dataset and the datasets section of the config file.
1020
+
1021
+ Do not feel overwhelmed by the number of options here. A lot of them are optional. In fact, the most common config to use would be path and sometimes data_files.
1022
+
1023
+ This matches the API of datasets.load_dataset, so if you’re familiar with that, you will feel right at home.
1024
+
1025
+ For HuggingFace’s guide to load different dataset types, see here.
1026
+
1027
+ For full details on the config, see config-reference.qmd.
1028
+
1029
+ You can set multiple datasets in the config file by more than one entry under datasets.
1030
+
1031
+ To load a JSON file, you would do something like this:
1032
+
1033
+ Which translates to the following config:
1034
+
1035
+ In the example above, it can be seen that we can just point the path to the file or directory along with the ds_type to load the dataset.
1036
+
1037
+ This works for CSV, JSON, Parquet, and Arrow files.
1038
+
1039
+ If path points to a file and ds_type is not specified, we will automatically infer the dataset type from the file extension, so you could omit ds_type if you’d like.
1040
+
1041
+ If you’re loading a directory, you can point the path to the directory.
1042
+
1043
+ Then, you have two options:
1044
+
1045
+ You do not need any additional configs.
1046
+
1047
+ We will attempt to load in the following order: - datasets saved with datasets.save_to_disk - loading entire directory of files (such as with parquet/arrow files)
1048
+
1049
+ Provide data_files with a list of files to load.
1050
+
1051
+ The method you use to load the dataset depends on how the dataset was created, whether a folder was uploaded directly or a HuggingFace Dataset was pushed.
1052
+
1053
+ If you’re using a private dataset, you will need to enable the hf_use_auth_token flag in the root-level of the config file.
1054
+
1055
+ This would mean that the dataset is a single file or file(s) uploaded to the Hub.
1056
+
1057
+ This means that the dataset is created as a HuggingFace Dataset and pushed to the Hub via datasets.push_to_hub.
1058
+
1059
+ There are some other configs which may be required like name, split, revision, trust_remote_code, etc depending on the dataset.
1060
+
1061
+ Via the storage_options config under load_dataset, you can load datasets from remote filesystems like S3, GCS, Azure, and OCI.
1062
+
1063
+ This is currently experimental. Please let us know if you run into any issues!
1064
+
1065
+ The only difference between the providers is that you need to prepend the path with the respective protocols.
1066
+
1067
+ For directory, we load via load_from_disk.
1068
+
1069
+ Prepend the path with s3://.
1070
+
1071
+ The credentials are pulled in the following order:
1072
+
1073
+ We assume you have credentials setup and not using anonymous access. If you want to use anonymous access, let us know! We may have to open a config option for this.
1074
+
1075
+ Other environment variables that can be set can be found in boto3 docs
1076
+
1077
+ Prepend the path with gs:// or gcs://.
1078
+
1079
+ The credentials are loaded in the following order:
1080
+
1081
+ Prepend the path with adl://.
1082
+
1083
+ Ensure you have the following environment variables set:
1084
+
1085
+ Prepend the path with abfs:// or az://.
1086
+
1087
+ Ensure you have the following environment variables set:
1088
+
1089
+ Other environment variables that can be set can be found in adlfs docs
1090
+
1091
+ Prepend the path with oci://.
1092
+
1093
+ It would attempt to read in the following order:
1094
+
1095
+ Other environment variables:
1096
+
1097
+ Please see the ocifs docs.
1098
+
1099
+ The path should start with https://.
1100
+
1101
+ This must be publically accessible.
1102
+
1103
+ Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format dataset formats docs.
1104
+
1105
+ **Examples:**
1106
+
1107
+ Example 1 (yaml):
1108
+ ```yaml
1109
+ datasets:
1110
+ - path:
1111
+ name:
1112
+ data_files:
1113
+ split:
1114
+ revision:
1115
+ trust_remote_code:
1116
+ ```
1117
+
1118
+ Example 2 (yaml):
1119
+ ```yaml
1120
+ datasets:
1121
+ - path: /path/to/your/dataset
1122
+ - path: /path/to/your/other/dataset
1123
+ ```
1124
+
1125
+ Example 3 (python):
1126
+ ```python
1127
+ from datasets import load_dataset
1128
+
1129
+ dataset = load_dataset("json", data_files="data.json")
1130
+ ```
1131
+
1132
+ Example 4 (yaml):
1133
+ ```yaml
1134
+ datasets:
1135
+ - path: data.json
1136
+ ds_type: json
1137
+ ```
1138
+
1139
+ ---
1140
+
1141
+ ## Multi-GPU
1142
+
1143
+ **URL:** https://docs.axolotl.ai/docs/multi-gpu.html
1144
+
1145
+ **Contents:**
1146
+ - Multi-GPU
1147
+ - 1 Overview
1148
+ - 2 DeepSpeed
1149
+ - 2.1 Configuration
1150
+ - 2.2 Usage
1151
+ - 2.3 ZeRO Stages
1152
+ - 3 Fully Sharded Data Parallel (FSDP)
1153
+ - 3.1 Migrating from FSDP1 to FSDP2
1154
+ - 3.1.1 Config mapping
1155
+ - 3.2 FSDP1 (deprecated)
1156
+
1157
+ This guide covers advanced training configurations for multi-GPU setups using Axolotl.
1158
+
1159
+ Axolotl supports several methods for multi-GPU training:
1160
+
1161
+ Add to your YAML config:
1162
+
1163
+ We provide default configurations for:
1164
+
1165
+ Choose the configuration that offloads the least amount to memory while still being able to fit on VRAM for best performance.
1166
+
1167
+ Start from Stage 1 -> Stage 2 -> Stage 3.
1168
+
1169
+ FSDP2 is recommended for new users. FSDP1 is deprecated and will be removed in an upcoming release of Axolotl.
1170
+
1171
+ To migrate your config from FSDP1 to FSDP2, you must use the fsdp_version top-level config field to specify the FSDP version, and also follow the config field mapping below to update field names.
1172
+
1173
+ For more details, please see the migration guide in the torchtitan repo. In Axolotl, if you were using the following FSDP1 config:
1174
+
1175
+ You can migrate to the following FSDP2 config:
1176
+
1177
+ Using fsdp to configure FSDP is deprecated and will be removed in an upcoming release of Axolotl. Please use fsdp_config as above instead.
1178
+
1179
+ We support sequence parallelism (SP) via the ring-flash-attention project. This allows one to split up sequences across GPUs, which is useful in the event that a single sequence causes OOM errors during model training.
1180
+
1181
+ See our dedicated guide for more information.
1182
+
1183
+ For combining FSDP with QLoRA, see our dedicated guide.
1184
+
1185
+ Please see docs for more info.
1186
+
1187
+ For NCCL-related problems, see our NCCL troubleshooting guide.
1188
+
1189
+ For more detailed troubleshooting, see our debugging guide.
1190
+
1191
+ **Examples:**
1192
+
1193
+ Example 1 (yaml):
1194
+ ```yaml
1195
+ deepspeed: deepspeed_configs/zero1.json
1196
+ ```
1197
+
1198
+ Example 2 (bash):
1199
+ ```bash
1200
+ # Fetch deepspeed configs (if not already present)
1201
+ axolotl fetch deepspeed_configs
1202
+
1203
+ # Passing arg via config
1204
+ axolotl train config.yml
1205
+
1206
+ # Passing arg via cli
1207
+ axolotl train config.yml --deepspeed deepspeed_configs/zero1.json
1208
+ ```
1209
+
1210
+ Example 3 (yaml):
1211
+ ```yaml
1212
+ fsdp_version: 1
1213
+ fsdp_config:
1214
+ fsdp_offload_params: false
1215
+ fsdp_cpu_ram_efficient_loading: true
1216
+ fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
1217
+ fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
1218
+ fsdp_state_dict_type: FULL_STATE_DICT
1219
+ fsdp_sharding_strategy: FULL_SHARD
1220
+ ```
1221
+
1222
+ Example 4 (yaml):
1223
+ ```yaml
1224
+ fsdp_version: 2
1225
+ fsdp_config:
1226
+ offload_params: false
1227
+ cpu_ram_efficient_loading: true
1228
+ auto_wrap_policy: TRANSFORMER_BASED_WRAP
1229
+ transformer_layer_cls_to_wrap: Qwen3DecoderLayer
1230
+ state_dict_type: FULL_STATE_DICT
1231
+ reshard_after_forward: true
1232
+ ```
1233
+
1234
+ ---
1235
+
1236
+ ## Ray Train
1237
+
1238
+ **URL:** https://docs.axolotl.ai/docs/ray-integration.html
1239
+
1240
+ **Contents:**
1241
+ - Ray Train
1242
+ - Ray cluster setup
1243
+ - Sanity check
1244
+ - Configuring training with Ray Train
1245
+ - Launching training
1246
+
1247
+ Axolotl supports using Ray as an alternative to accelerate for orchestrating training. This is especially useful for multi-node training since you only have to setup code and dependencies in a single node and launch training as if you were using a single node.
1248
+
1249
+ With the --use-ray CLI flag, Axolotl will use Ray Train’s TorchTrainer to run training.
1250
+
1251
+ A prerequisite using the Ray Train integration is to setup a Ray cluster on your desired node(s). For a detailed guide on how you can get started with ray clusters, check the official Ray docs here.
1252
+
1253
+ Every Ray cluster has one head node and a set of worker nodes. The head node is just like any other worker node, but it also runs certain special processes related to scheduling and orchestration. Ray-enabled scripts are run on the head node and depending on the resources (number of CPUs, GPUs, etc) they request, will be scheduled to run certain tasks on the worker nodes. For more on key concepts behind a Ray cluster, you can refer this doc.
1254
+
1255
+ To run a sanity check on whether your ray cluster is setup properly, execute the following on the head node:
1256
+
1257
+ The output should have a summary of your Ray cluster - list of all the nodes in your cluster, the number of CPUs and GPUs in your cluster, etc. For example, if you have a cluster with 1 CPU-only head node and 2 4xL40S worker nodes, the output can look like this:
1258
+
1259
+ You should also be able to see the same on the Ray dashboard.
1260
+
1261
+ You can find an example configuration at configs/llama-3/lora-1b-ray.yaml.
1262
+
1263
+ The key parameters to note here are:
1264
+
1265
+ You can simply run the following command on the head node:
1266
+
1267
+ This will launch training on the head node and workers will be scheduled automatically by Ray Train to run on the appropriate head or worker nodes.
1268
+
1269
+ You can also monitor training progress on the Ray dashboard.
1270
+
1271
+ Coming back to the example on a Ray cluster with 1 head node and 2 4xL40S worker nodes, let’s say you want to make use of all 8 GPUs. You would be able to just set ray_num_workers: 8 and run the previous command. The Cluster tab will show the following:
1272
+
1273
+ **Examples:**
1274
+
1275
+ Example 1 (unknown):
1276
+ ```unknown
1277
+ Node status
1278
+ ---------------------------------------------------------------
1279
+ Active:
1280
+ 1 head
1281
+ Idle:
1282
+ 2 4xL40S:48CPU-384GB
1283
+ Pending:
1284
+ (no pending nodes)
1285
+ Recent failures:
1286
+ (no failures)
1287
+
1288
+ Resources
1289
+ ---------------------------------------------------------------
1290
+ Usage:
1291
+ 0.0/96.0 CPU
1292
+ 0.0/8.0 GPU
1293
+ 0B/800.00GiB memory
1294
+ 0B/229.57GiB object_store_memory
1295
+
1296
+ Demands:
1297
+ (no resource demands)
1298
+ ```
1299
+
1300
+ Example 2 (yaml):
1301
+ ```yaml
1302
+ use_ray: true
1303
+ ray_num_workers: 4
1304
+ # optional
1305
+ resources_per_worker:
1306
+ GPU: 1
1307
+ ```
1308
+
1309
+ Example 3 (yaml):
1310
+ ```yaml
1311
+ resources_per_worker:
1312
+ accelerator_type:L40S: 0.001
1313
+ ```
1314
+
1315
+ Example 4 (bash):
1316
+ ```bash
1317
+ axolotl train examples/llama-3/lora-1b-ray.yml --use-ray
1318
+ ```
1319
+
1320
+ ---
1321
+
1322
+ ## Sequence Parallelism
1323
+
1324
+ **URL:** https://docs.axolotl.ai/docs/sequence_parallelism.html
1325
+
1326
+ **Contents:**
1327
+ - Sequence Parallelism
1328
+ - When to Use Sequence Parallelism
1329
+ - Configuration
1330
+ - Implementation Details
1331
+ - Requirements
1332
+ - Limitations
1333
+ - Example
1334
+ - Sample Packing with Sequence Parallelism
1335
+ - Effect on Batch Size
1336
+
1337
+ Sequence parallelism is a technique that splits sequences across multiple GPUs, allowing you to train with very long sequences that wouldn’t fit on a single GPU. Each GPU processes a different portion of the sequence, and the results are aggregated through a ring communication pattern.
1338
+
1339
+ Use sequence parallelism when:
1340
+
1341
+ To enable sequence parallelism, add the following to your configuration file:
1342
+
1343
+ The context_parallel_size should be a divisor of the total number of GPUs. For example:
1344
+
1345
+ When sequence parallelism is enabled:
1346
+
1347
+ To use sequence parallelism, you need:
1348
+
1349
+ This will train the Llama 3 8B model with 8K context length, with each sequence split into 2 subsequences of length 4096 across 2 GPUs.
1350
+
1351
+ Sequence parallelism is compatible with Axolotl’s sample packing functionality. When using both features together:
1352
+
1353
+ When using sequence parallelism, your effective global batch size is divided by the context_parallel_size. This happens because:
1354
+
1355
+ For example: - With 8 GPUs and no sequence parallelism: 8 different batches processed per step - With 8 GPUs and context_parallel_size=4: Only 2 different batches processed per step (each split across 4 GPUs) - If your per-GPU micro_batch_size is 2, the global batch size decreases from 16 to 4
1356
+
1357
+ **Examples:**
1358
+
1359
+ Example 1 (yaml):
1360
+ ```yaml
1361
+ # Set to a divisor (> 1) of the number of GPUs available
1362
+ context_parallel_size: 4 # Split sequences across 4 GPUs
1363
+ # Optional; strides across the key dimension. Larger values use more memory but should make training faster.
1364
+ heads_k_stride: 1
1365
+ # Optional; one of "varlen_llama3" or "batch_ring". Defaults to
1366
+ # "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
1367
+ ring_attn_func:
1368
+ ```
1369
+
1370
+ Example 2 (yaml):
1371
+ ```yaml
1372
+ base_model: meta-llama/Llama-3-8B-Instruct
1373
+ sequence_len: 8192
1374
+
1375
+ ...
1376
+
1377
+ context_parallel_size: 4 # Split each sequence into 4 parts, one per GPU
1378
+ # Optional; strides across the key dimension. Larger values use more memory but should make training faster.
1379
+ heads_k_stride: 1
1380
+ # Optional; one of "varlen_llama3" or "batch_ring". Defaults to
1381
+ # "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
1382
+ ring_attn_func:
1383
+
1384
+ ...
1385
+ ```
1386
+
1387
+ ---
1388
+
1389
+ ## Quantization Aware Training (QAT)
1390
+
1391
+ **URL:** https://docs.axolotl.ai/docs/qat.html
1392
+
1393
+ **Contents:**
1394
+ - Quantization Aware Training (QAT)
1395
+ - Overview
1396
+ - Configuring QAT in Axolotl
1397
+
1398
+ Quantization Aware Training (QAT) is a technique for improving the accuracy of models which are quantized by applying “fake” quantizations to the model’s weights (and optionally, activations) during training. This fake quantization allows for the model to adjust for noise introduced by the quantization, so when the model is eventually quantized, the accuracy loss is minimized. We use the quantization techniques implemented in torchao to provide support for QAT and post-training quantization (PTQ) in axolotl.
1399
+
1400
+ We recommend reviewing the excellent QAT tutorial in the torchtune library, and the QAT documentation in the torchao library, for more details.
1401
+
1402
+ To enable QAT in axolotl, add the following to your configuration file:
1403
+
1404
+ We support the following quantization schemas:
1405
+
1406
+ Once you have finished training, you must quantize your model by using the same quantization configuration which you used to train the model with. You can use the quantize command to do this.
1407
+
1408
+ **Examples:**
1409
+
1410
+ Example 1 (yaml):
1411
+ ```yaml
1412
+ qat:
1413
+ activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4", "int8", "float8"
1414
+ weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are "int4", "fp8", and "nvfp4".
1415
+ group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
1416
+ fake_quant_after_n_steps: # Optional[int] = None. The number of steps to apply fake quantization after
1417
+ ```
1418
+
1419
+ ---
1420
+
1421
+ ## FSDP + QLoRA
1422
+
1423
+ **URL:** https://docs.axolotl.ai/docs/fsdp_qlora.html
1424
+
1425
+ **Contents:**
1426
+ - FSDP + QLoRA
1427
+ - Background
1428
+ - Usage
1429
+ - Enabling Swap for FSDP2
1430
+ - Example Config
1431
+ - References
1432
+ - Footnotes
1433
+
1434
+ Using FSDP with QLoRA is essential for fine-tuning larger (70b+ parameter) LLMs on consumer GPUs. For example, you can use FSDP + QLoRA to train a 70b model on two 24GB GPUs1.
1435
+
1436
+ Below, we describe how to use this feature in Axolotl.
1437
+
1438
+ To enable QLoRA with FSDP, you need to perform the following steps:
1439
+
1440
+ ![Tip] See the example config file in addition to reading these instructions.
1441
+
1442
+ If available memory is insufficient even after FSDP’s CPU offloading, you can enable swap memory usage by setting cpu_offload_pin_memory: false alongside offload_params: true in FSDP config.
1443
+
1444
+ This disables memory pinning, allowing FSDP to use disk swap space as fallback. Disabling memory pinning itself incurs performance overhead, and actually having to use swap adds more, but it may enable training larger models that would otherwise cause OOM errors on resource constrained systems.
1445
+
1446
+ examples/llama-2/qlora-fsdp.yml contains an example of how to enable QLoRA + FSDP in axolotl.
1447
+
1448
+ This was enabled by this work from the Answer.AI team.↩︎
1449
+
1450
+ ---
1451
+
1452
+ ## Custom Integrations
1453
+
1454
+ **URL:** https://docs.axolotl.ai/docs/custom_integrations.html
1455
+
1456
+ **Contents:**
1457
+ - Custom Integrations
1458
+ - Cut Cross Entropy
1459
+ - Requirements
1460
+ - Installation
1461
+ - Usage
1462
+ - Supported Models
1463
+ - Citation
1464
+ - DenseMixer
1465
+ - Diffusion LM Training Plugin for Axolotl
1466
+ - Overview
1467
+
1468
+ Axolotl adds custom features through integrations. They are located within the src/axolotl/integrations directory.
1469
+
1470
+ To enable them, please check the respective documentations.
1471
+
1472
+ Cut Cross Entropy (CCE) reduces VRAM usage through optimization on the cross-entropy operation during loss calculation.
1473
+
1474
+ See https://github.com/apple/ml-cross-entropy
1475
+
1476
+ Run the following command to install cut_cross_entropy[transformers] if you don’t have it already.
1477
+
1478
+ Please see reference here
1479
+
1480
+ Simply add the following to your axolotl YAML config:
1481
+
1482
+ Please see reference here
1483
+
1484
+ This plugin enables diffusion language model training using an approach inspired by LLaDA (Large Language Diffusion Models) within Axolotl.
1485
+
1486
+ LLaDA is a diffusion-based approach to language model training that uses: - Random token masking during training instead of next-token prediction - Bidirectional attention to allow the model to attend to the full context - Importance weighting based on masking probabilities for stable training
1487
+
1488
+ This approach can lead to more robust language models with better understanding of bidirectional context.
1489
+
1490
+ The plugin is included with Axolotl. See our installation docs.
1491
+
1492
+ Train with an example config (Llama‑3.2 1B): - Pretrain: axolotl train examples/llama-3/diffusion-3.2-1b-pretrain.yaml - SFT: axolotl train examples/llama-3/diffusion-3.2-1b-sft.yaml
1493
+
1494
+ You can also modify your existing configs to enable / customize diffusion training.
1495
+
1496
+ Add the following to your Axolotl config:
1497
+
1498
+ And, configure the nested diffusion block (defaults shown):
1499
+
1500
+ Any models that support 4D attention masks should work out of the box. If not, please create an issue or open a PR!
1501
+
1502
+ During training, tokens are randomly masked: - Sample timestep t uniformly from [0, 1] - Calculate masking probability: p = (1 - eps) * t + eps - Randomly mask tokens with probability p
1503
+
1504
+ Loss is computed only on masked tokens with (optional) importance weighting:
1505
+
1506
+ When diffusion.generate_samples: true, the plugin generates samples during training:
1507
+
1508
+ Samples are logged to console and wandb (if enabled).
1509
+
1510
+ Diffusion inference is integrated into the standard Axolotl CLI. Use the same config you trained with and run:
1511
+
1512
+ Optionally, pass --gradio to use a simple web interface.
1513
+
1514
+ Interactive controls (prefix the prompt with commands): - :complete N → completion mode with N new masked tokens appended (default 64) - :mask R → random masking mode with target mask ratio R in [0.0, 1.0]
1515
+
1516
+ The plugin adds (or modifies) several metrics to track diffusion training:
1517
+
1518
+ Please see reference here
1519
+
1520
+ See https://github.com/ironjr/grokfast
1521
+
1522
+ Please see reference here
1523
+
1524
+ An example dataset can be found at axolotl-ai-co/evolkit-logprobs-pipeline-75k-v2-sample
1525
+
1526
+ Please see reference here
1527
+
1528
+ Fine-tune sparsified models in Axolotl using Neural Magic’s LLMCompressor.
1529
+
1530
+ This integration enables fine-tuning of models sparsified using LLMCompressor within the Axolotl training framework. By combining LLMCompressor’s model compression capabilities with Axolotl’s distributed training pipelines, users can efficiently fine-tune sparse models at scale.
1531
+
1532
+ It uses Axolotl’s plugin system to hook into the fine-tuning flows while maintaining sparsity throughout training.
1533
+
1534
+ Axolotl with llmcompressor extras:
1535
+
1536
+ Requires llmcompressor >= 0.5.1
1537
+
1538
+ This will install all necessary dependencies to fine-tune sparsified models using the integration.
1539
+
1540
+ To enable sparse fine-tuning with this integration, include the plugin in your Axolotl config:
1541
+
1542
+ This plugin does not apply pruning or sparsification itself — it is intended for fine-tuning models that have already been sparsified.
1543
+
1544
+ Pre-sparsified checkpoints can be: - Generated using LLMCompressor - Downloaded from Neural Magic’s Hugging Face page - Any custom LLM with compatible sparsity patterns that you’ve created yourself
1545
+
1546
+ To learn more about writing and customizing LLMCompressor recipes, refer to the official documentation: https://github.com/vllm-project/llm-compressor/blob/main/README.md
1547
+
1548
+ Setting save_compressed: true in your configuration enables saving models in a compressed format, which: - Reduces disk space usage by approximately 40% - Maintains compatibility with vLLM for accelerated inference - Maintains compatibility with llmcompressor for further optimization (example: quantization)
1549
+
1550
+ This option is highly recommended when working with sparse models to maximize the benefits of model compression.
1551
+
1552
+ See examples/llama-3/sparse-finetuning.yaml for a complete example.
1553
+
1554
+ After fine-tuning your sparse model, you can leverage vLLM for efficient inference. You can also use LLMCompressor to apply additional quantization to your fine-tuned sparse model before inference for even greater performance benefits.:
1555
+
1556
+ For more details on vLLM’s capabilities and advanced configuration options, see the official vLLM documentation.
1557
+
1558
+ For details on available sparsity and quantization schemes, fine-tuning recipes, and usage examples, visit the official LLMCompressor repository:
1559
+
1560
+ https://github.com/vllm-project/llm-compressor
1561
+
1562
+ Please see reference here
1563
+
1564
+ Run evaluation on model using the popular lm-evaluation-harness library.
1565
+
1566
+ See https://github.com/EleutherAI/lm-evaluation-harness
1567
+
1568
+ Please see reference here
1569
+
1570
+ Liger Kernel provides efficient Triton kernels for LLM training, offering:
1571
+
1572
+ See https://github.com/linkedin/Liger-Kernel
1573
+
1574
+ Please see reference here
1575
+
1576
+ by Eric Hartford, Lucas Atkins, Fernando Fernandes, David Golchinfar
1577
+
1578
+ This plugin contains code to freeze the bottom fraction of modules in a model, based on the Signal-to-Noise Ratio (SNR).
1579
+
1580
+ See https://github.com/cognitivecomputations/spectrum
1581
+
1582
+ Spectrum is a tool for scanning and evaluating the Signal-to-Noise Ratio (SNR) of layers in large language models. By identifying the top n% of layers with the highest SNR, you can optimize training efficiency.
1583
+
1584
+ Please see reference here
1585
+
1586
+ Plugins can be used to customize the behavior of the training pipeline through hooks. See axolotl.integrations.BasePlugin for the possible hooks.
1587
+
1588
+ To add a new integration, please follow these steps:
1589
+
1590
+ See src/axolotl/integrations/cut_cross_entropy for a minimal integration example.
1591
+
1592
+ If you could not load your integration, please ensure you are pip installing in editable mode.
1593
+
1594
+ and correctly spelled the integration name in the config file.
1595
+
1596
+ It is not necessary to place your integration in the integrations folder. It can be in any location, so long as it’s installed in a package in your python env.
1597
+
1598
+ See this repo for an example: https://github.com/axolotl-ai-cloud/diff-transformer
1599
+
1600
+ **Examples:**
1601
+
1602
+ Example 1 (bash):
1603
+ ```bash
1604
+ python scripts/cutcrossentropy_install.py | sh
1605
+ ```
1606
+
1607
+ Example 2 (bash):
1608
+ ```bash
1609
+ pip3 uninstall -y cut-cross-entropy && pip3 install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@8a1a0ec"
1610
+ ```
1611
+
1612
+ Example 3 (yaml):
1613
+ ```yaml
1614
+ plugins:
1615
+ - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
1616
+ ```
1617
+
1618
+ Example 4 (unknown):
1619
+ ```unknown
1620
+ @article{wijmans2024cut,
1621
+ author = {Erik Wijmans and
1622
+ Brody Huval and
1623
+ Alexander Hertzberg and
1624
+ Vladlen Koltun and
1625
+ Philipp Kr\"ahenb\"uhl},
1626
+ title = {Cut Your Losses in Large-Vocabulary Language Models},
1627
+ journal = {arXiv},
1628
+ year = {2024},
1629
+ url = {https://arxiv.org/abs/2411.09009},
1630
+ }
1631
+ ```
1632
+
1633
+ ---
1634
+
1635
+ ## Config Reference
1636
+
1637
+ **URL:** https://docs.axolotl.ai/docs/config-reference.html
1638
+
1639
+ **Contents:**
1640
+ - Config Reference
1641
+
1642
+ **Examples:**
1643
+
1644
+ Example 1 (yaml):
1645
+ ```yaml
1646
+ # Allow overwrite yml config using from cli
1647
+ strict: bool | None = False
1648
+ # Resume from a specific checkpoint dir
1649
+ resume_from_checkpoint: str | None
1650
+ # If resume_from_checkpoint isn't set and you simply want it to start where it left off.
1651
+ # Be careful with this being turned on between different models.
1652
+ auto_resume_from_checkpoints: bool | None
1653
+ # Resize the model embeddings when new tokens are added to multiples of 32. This is
1654
+ # reported to improve training speed on some models
1655
+ resize_token_embeddings_to_32x: bool | None
1656
+ mean_resizing_embeddings: bool | None = False
1657
+
1658
+ # Whether to shrink the embeddings to len(tokenizer). By default, we won't shrink.
1659
+ shrink_embeddings: bool | None
1660
+ # Don't upcast the embeddings to float32 when using PEFT. Useful for low-VRAM GPUs
1661
+ embeddings_skip_upcast: bool | None
1662
+ # Reinitialize model weights randomly instead of loading pretrained weights
1663
+ reinit_weights: bool | None
1664
+
1665
+ # module to custom trainer class to use for training
1666
+ trainer_cls: str | None
1667
+
1668
+ # Use RL training: 'dpo', 'ipo', 'kto', 'simpo', 'orpo', 'grpo'
1669
+ rl: RLType | None
1670
+
1671
+ trl: TRLConfig | None
1672
+ # For TRLConfig:
1673
+ # Beta parameter for the RL training. Same as `rl_beta`. Use
1674
+ beta: float | None
1675
+ # Maximum length of the completion for RL training.
1676
+ max_completion_length: int | None
1677
+
1678
+ # Whether to use VLLM for RL training.
1679
+ use_vllm: bool = False
1680
+ # VLLM mode to use, one of 'server' or 'colocate'
1681
+ vllm_mode: Literal['server', 'colocate'] | None
1682
+ # Host of the vLLM server to connect to.
1683
+ vllm_server_host: str | None = 0.0.0.0
1684
+ # Port of the vLLM server to connect to.
1685
+ vllm_server_port: int | None = 8000
1686
+ # Total timeout (in seconds) to wait for the vLLM server to respond.
1687
+ vllm_server_timeout: int | None
1688
+ # Regex for vLLM guided decoding.
1689
+ vllm_guided_decoding_regex: str | None
1690
+
1691
+ # List of reward functions to load. Paths must be importable from current dir.
1692
+ reward_funcs: list[str] | None
1693
+ # List of reward weights for the reward functions.
1694
+ reward_weights: list[float] | None
1695
+ # Number of generations to sample.
1696
+ num_generations: int | None
1697
+ # Whether to log completions.
1698
+ log_completions: bool | None = False
1699
+ # Number of completions to print when log_completions is True.
1700
+ num_completions_to_print: int | None
1701
+ # Controls whether importance sampling ratios are computed at the `'token'` or
1702
+ # `'sequence'` level. For GSPO, use `sequence`, default is None which corresponds to
1703
+ # the original GRPO paper.
1704
+ importance_sampling_level: Literal['sequence', 'token'] | None
1705
+
1706
+ # Whether to sync the reference model.
1707
+ sync_ref_model: bool | None = False
1708
+ # Mixup alpha for the reference model.
1709
+ ref_model_mixup_alpha: float | None = 0.9
1710
+ # Sync steps for the reference model.
1711
+ ref_model_sync_steps: int | None = 64
1712
+ # Whether to scale rewards by their standard deviation.
1713
+ scale_rewards: bool = True
1714
+
1715
+ # Sampling temperature for the GRPO policy.
1716
+ temperature: float | None
1717
+ # Top-p sampling probability for the generation policy.
1718
+ top_p: float | None
1719
+ # Top-k sampling for the generation policy.
1720
+ top_k: int | None
1721
+ # Minimum probability for the generation policy.
1722
+ min_p: float | None
1723
+ # Penalty for tokens that appear in prompt and generated text.
1724
+ repetition_penalty: float | None
1725
+ # Number of iterations per batch (μ) for GRPO.
1726
+ num_iterations: int | None
1727
+ # Epsilon value for clipping in the GRPO algorithm.
1728
+ epsilon: float | None
1729
+ # Upper-bound epsilon value for clipping in the GRPO algorithm.
1730
+ epsilon_high: float | None
1731
+ # Whether to use Liger loss for GRPO.
1732
+ use_liger_loss: bool | None
1733
+ # Loss formulation to use. Supported values: grpo, bnpo, dr_grpo.
1734
+ loss_type: str | None
1735
+ # Whether to exclude truncated completions from loss calculation.
1736
+ mask_truncated_completions: bool = False
1737
+ # Enable sleep mode for vLLM to offload VRAM when idle
1738
+ vllm_enable_sleep_mode: bool | None
1739
+
1740
+ vllm: VllmConfig | None
1741
+ # For VllmConfig:
1742
+ # Device to use for VLLM
1743
+ device: str | None = auto
1744
+ # Tensor parallel size for VLLM
1745
+ tensor_parallel_size: int | None
1746
+ # Data parallel size for VLLM
1747
+ data_parallel_size: int | None
1748
+ # GPU memory utilization for VLLM
1749
+ gpu_memory_utilization: float | None = 0.9
1750
+ # Data type for VLLM
1751
+ dtype: str | None = auto
1752
+ # Maximum length of the model context for VLLM
1753
+ max_model_len: int | None
1754
+ # Enable prefix caching for VLLM
1755
+ enable_prefix_caching: bool | None
1756
+ # Host for the vLLM server to start on
1757
+ host: str | None = 0.0.0.0
1758
+ # Port of the vLLM server to start on
1759
+ port: int | None = 8000
1760
+
1761
+ # Enable reasoning for VLLM
1762
+ enable_reasoning: bool | None
1763
+ # Reasoning parser for VLLM
1764
+ reasoning_parser: str | None
1765
+
1766
+ qat: QATConfig | None
1767
+ # For QATConfig:
1768
+ # Fake quantization layout to use for activation quantization.
1769
+ activation_dtype: TorchAOQuantDType | None
1770
+ # Fake quantization layout to use for weight quantization.
1771
+ weight_dtype: TorchAOQuantDType = TorchAOQuantDType.int8
1772
+ # Quantize embedding
1773
+ quantize_embedding: bool | None = False
1774
+ # The number of elements in each group for per-group fake quantization
1775
+ group_size: int | None = 32
1776
+ # The number of steps to apply fake quantization after
1777
+ fake_quant_after_n_steps: int | None
1778
+
1779
+ quantization: PTQConfig | None
1780
+ # For PTQConfig:
1781
+ # Fake quantization layout to use for weight quantization.
1782
+ weight_dtype: TorchAOQuantDType = TorchAOQuantDType.int8
1783
+ # Fake quantization layout to use for activation quantization.
1784
+ activation_dtype: TorchAOQuantDType | None
1785
+ # Whether to quantize the embedding layer.
1786
+ quantize_embedding: bool | None
1787
+ # The number of elements in each group for per-group fake quantization
1788
+ group_size: int | None = 32
1789
+
1790
+ # Reward modelling: `True` or `False`
1791
+ reward_model: bool | None
1792
+ # Process reward modelling: `True` or `False`
1793
+ process_reward_model: bool | None
1794
+ # Coefficient to incentivize the reward model to output mean-zero rewards (proposed by
1795
+ # https://huggingface.co/papers/2312.09244, Eq. 2). Recommended value: `0.01`.
1796
+ center_rewards_coefficient: float | None
1797
+ num_labels: int | None
1798
+
1799
+ # Whether to perform weighting in DPO trainer
1800
+ dpo_use_weighting: bool | None
1801
+ dpo_use_logits_to_keep: bool | None
1802
+ dpo_label_smoothing: float | None
1803
+ dpo_norm_loss: bool | None
1804
+ dpo_padding_free: bool | None
1805
+ dpo_generate_during_eval: bool | None
1806
+
1807
+ # A list of one or more datasets to finetune the model with
1808
+ datasets: Annotated[list[SFTDataset | DPODataset | KTODataset | StepwiseSupervisedDataset], MinLen(1)] | None
1809
+ # For SFTDataset:
1810
+ # HuggingFace dataset repo | s3:// | gs:// | path to local file or directory
1811
+ path: str | None
1812
+ # name of dataset split to load from
1813
+ split: str | None
1814
+ # The type of prompt to use for training. [alpaca, gpteacher, oasst, reflection]
1815
+ type: str | UserDefinedPrompterType | None
1816
+ # For UserDefinedPrompterType:
1817
+ # Custom user instruction prompt
1818
+ system_prompt: str | None
1819
+ # Use {system} as key to be replaced
1820
+ system_format: str | None
1821
+ field_system: str | None
1822
+ field_instruction: str | None
1823
+ field_input: str | None
1824
+ field_output: str | None
1825
+
1826
+ # Customizable to be single line or multi-line. Use {instruction}/{input} as key to
1827
+ # be replaced. 'format' can include {input}
1828
+ format: str | None
1829
+ # 'no_input_format' cannot include {input}
1830
+ no_input_format: str | None
1831
+ input_transform: str | None
1832
+ # split dataset into N pieces (use with shards_idx)
1833
+ shards: int | None
1834
+ # the index of sharded dataset to use
1835
+ shards_idx: int | None
1836
+ # process dataset in N sequential chunks for memory efficiency (exclusive with
1837
+ # `shards`)
1838
+ preprocess_shards: int | None
1839
+ conversation: str | None
1840
+
1841
+ # The name of the chat template to use for training, following values are supported:
1842
+ # tokenizer_default: Uses the chat template that is available in the
1843
+ # tokenizer_config.json. If the chat template is not available in the tokenizer, it
1844
+ # will raise an error. This is the default.
1845
+ # alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates
1846
+ # are available in the axolotl codebase at src/axolotl/utils/chat_templates.py.
1847
+ # tokenizer_default_fallback_*: where * is the name of the chat template to fallback
1848
+ # to if the tokenizer does not have a chat template else default to tokenizer. E.g.
1849
+ # tokenizer_default_fallback_chatml. jinja: Uses a custom jinja template for the chat
1850
+ # template. The custom jinja template should be provided in the chat_template_jinja
1851
+ # field.
1852
+ chat_template: ChatTemplate | str | None
1853
+ # Custom jinja chat template or path to jinja file. Used only if `chat_template:
1854
+ # jinja` or empty.
1855
+ chat_template_jinja: str | None
1856
+ # path to source data files
1857
+ data_files: str | list[str] | None
1858
+ input_format: str | None
1859
+ # name of dataset configuration to load
1860
+ name: str | None
1861
+ # defines the datatype when path is a file
1862
+ ds_type: str | None
1863
+ # For `completion` datasets only, uses the provided field instead of `text` column
1864
+ field: str | None
1865
+ field_human: str | None
1866
+ field_model: str | None
1867
+ # Key containing the messages (default: "messages")
1868
+ field_messages: str | None
1869
+ # Key containing the tools (default: "tools"). Must be a list[dict] and follow [JSON
1870
+ # schema](https://json-schema.org/learn/getting-started-step-by-step).
1871
+ field_tools: str | None
1872
+ # Key containing the reasoning trace (default: "reasoning_content").
1873
+ field_thinking: str | None
1874
+ # The key the chat template expects that indicates the reasoning trace.
1875
+ template_thinking_key: str | None
1876
+
1877
+ message_field_role: str | None
1878
+
1879
+ message_field_content: str | None
1880
+ # Mapping of properties from the input dataset to the chat template. (default:
1881
+ # message_property_mappings={'role':'role', 'content':'content'}) If a property exists
1882
+ # in the template but not in this mapping, the system will attempt to load it directly
1883
+ # from the message using the property name as the key. Example: In the mapping below,
1884
+ # 'from' is loaded from input dataset and used as 'role', while 'value' is loaded and
1885
+ # used as 'content' in the chat template.
1886
+ message_property_mappings: dict[str, str] | None
1887
+ # The key in the message turn that indicates via boolean whether tokens of a turn
1888
+ # should be considered for training. Useful to selectively train on certain turns
1889
+ # besides the `roles_to_train`.
1890
+ message_field_training: str | None
1891
+ # The key in the message turn that contains the training details. Useful to
1892
+ # selectively train on certain tokens in a turn. The value of the key is a List[Dict]
1893
+ # containing `begin_offset` (start character index in content), `end_offset` (end
1894
+ # character index in content), and `train` (boolean whether to train).
1895
+ message_field_training_detail: str | None
1896
+ # (for Qwen3 template only) Whether to split the assistant content based on a
1897
+ # reasoning trace inside delimited tags
1898
+ split_thinking: bool | None
1899
+ logprobs_field: str | None
1900
+ temperature: float | None
1901
+ # Roles to train on. The tokens from these roles will be considered for the loss.
1902
+ roles_to_train: list[str] | None
1903
+ # Which EOS tokens to train on in the conversation. Possible values are: all: train on
1904
+ # all EOS tokens, turn (default): train on the EOS token at the end of each trainable
1905
+ # turn, last: train on the last EOS token in the conversation
1906
+ train_on_eos: Literal['all', 'turn', 'last'] | None
1907
+ # Roles mapping in the messages. The format is {target_role: [source_roles]}. All
1908
+ # source roles will be mapped to the target role. The default is: user: ["human",
1909
+ # "user"], assistant: ["gpt", "assistant"], system: ["system"], tool: ["tool"]
1910
+ roles: dict[str, list[str]] | None
1911
+ # Whether to drop the system turn from the dataset. Only works with chat_template.
1912
+ # This does not drop the default system message from chat_template if it exists. If
1913
+ # you wish to, we recommend using a custom jinja template with the default system
1914
+ # message removed or adding a system turn with empty content.
1915
+ drop_system_message: bool | None
1916
+ # Trust remote code for untrusted source
1917
+ trust_remote_code: bool | None = False
1918
+ # The specific revision of the dataset to use when loading from the Hugging Face Hub.
1919
+ # This can be a commit hash, tag, or branch name. If not specified, the latest version
1920
+ # will be used. This parameter is ignored for local datasets.
1921
+ revision: str | None
1922
+
1923
+ # For DPODataset:
1924
+ path: str | None
1925
+ split: str | None
1926
+ type: UserDefinedDPOType | str | None
1927
+ # For UserDefinedDPOType:
1928
+ field_system: str | None
1929
+ field_prompt: str | None
1930
+ field_chosen: str | None
1931
+ field_rejected: str | None
1932
+ prompt_format: str | None
1933
+ chosen_format: str | None
1934
+ rejected_format: str | None
1935
+ data_files: list[str] | None
1936
+ revision: str | None
1937
+ field_messages: str | None
1938
+
1939
+ # For KTODataset:
1940
+ path: str | None
1941
+ split: str | None
1942
+ type: UserDefinedKTOType | str | None
1943
+ # For UserDefinedKTOType:
1944
+ field_system: str | None
1945
+ field_prompt: str | None
1946
+ field_completion: str | None
1947
+ field_label: bool | None
1948
+ prompt_format: str | None
1949
+ completion_format: str | None
1950
+ data_files: list[str] | None
1951
+ trust_remote_code: bool | None = False
1952
+ revision: str | None
1953
+
1954
+ # For StepwiseSupervisedDataset:
1955
+ path: str | None
1956
+ split: str | None
1957
+ data_files: list[str] | None
1958
+ revision: str | None
1959
+ step_separator: str | None
1960
+ max_completion_length: int | None
1961
+ train_on_last_step_only: bool | None
1962
+
1963
+ # A list of one or more datasets to eval the model with. You can use either
1964
+ # test_datasets, or val_set_size, but not both.
1965
+ test_datasets: Annotated[list[SFTDataset | DPODataset | KTODataset | StepwiseSupervisedDataset], MinLen(1)] | None
1966
+ # For SFTDataset:
1967
+ # HuggingFace dataset repo | s3:// | gs:// | path to local file or directory
1968
+ path: str | None
1969
+ # name of dataset split to load from
1970
+ split: str | None
1971
+ # The type of prompt to use for training. [alpaca, gpteacher, oasst, reflection]
1972
+ type: str | UserDefinedPrompterType | None
1973
+ # For UserDefinedPrompterType:
1974
+ # Custom user instruction prompt
1975
+ system_prompt: str | None
1976
+ # Use {system} as key to be replaced
1977
+ system_format: str | None
1978
+ field_system: str | None
1979
+ field_instruction: str | None
1980
+ field_input: str | None
1981
+ field_output: str | None
1982
+
1983
+ # Customizable to be single line or multi-line. Use {instruction}/{input} as key to
1984
+ # be replaced. 'format' can include {input}
1985
+ format: str | None
1986
+ # 'no_input_format' cannot include {input}
1987
+ no_input_format: str | None
1988
+ input_transform: str | None
1989
+ # split dataset into N pieces (use with shards_idx)
1990
+ shards: int | None
1991
+ # the index of sharded dataset to use
1992
+ shards_idx: int | None
1993
+ # process dataset in N sequential chunks for memory efficiency (exclusive with
1994
+ # `shards`)
1995
+ preprocess_shards: int | None
1996
+ conversation: str | None
1997
+
1998
+ # The name of the chat template to use for training, following values are supported:
1999
+ # tokenizer_default: Uses the chat template that is available in the
2000
+ # tokenizer_config.json. If the chat template is not available in the tokenizer, it
2001
+ # will raise an error. This is the default.
2002
+ # alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates
2003
+ # are available in the axolotl codebase at src/axolotl/utils/chat_templates.py.
2004
+ # tokenizer_default_fallback_*: where * is the name of the chat template to fallback
2005
+ # to if the tokenizer does not have a chat template else default to tokenizer. E.g.
2006
+ # tokenizer_default_fallback_chatml. jinja: Uses a custom jinja template for the chat
2007
+ # template. The custom jinja template should be provided in the chat_template_jinja
2008
+ # field.
2009
+ chat_template: ChatTemplate | str | None
2010
+ # Custom jinja chat template or path to jinja file. Used only if `chat_template:
2011
+ # jinja` or empty.
2012
+ chat_template_jinja: str | None
2013
+ # path to source data files
2014
+ data_files: str | list[str] | None
2015
+ input_format: str | None
2016
+ # name of dataset configuration to load
2017
+ name: str | None
2018
+ # defines the datatype when path is a file
2019
+ ds_type: str | None
2020
+ # For `completion` datasets only, uses the provided field instead of `text` column
2021
+ field: str | None
2022
+ field_human: str | None
2023
+ field_model: str | None
2024
+ # Key containing the messages (default: "messages")
2025
+ field_messages: str | None
2026
+ # Key containing the tools (default: "tools"). Must be a list[dict] and follow [JSON
2027
+ # schema](https://json-schema.org/learn/getting-started-step-by-step).
2028
+ field_tools: str | None
2029
+ # Key containing the reasoning trace (default: "reasoning_content").
2030
+ field_thinking: str | None
2031
+ # The key the chat template expects that indicates the reasoning trace.
2032
+ template_thinking_key: str | None
2033
+
2034
+ message_field_role: str | None
2035
+
2036
+ message_field_content: str | None
2037
+ # Mapping of properties from the input dataset to the chat template. (default:
2038
+ # message_property_mappings={'role':'role', 'content':'content'}) If a property exists
2039
+ # in the template but not in this mapping, the system will attempt to load it directly
2040
+ # from the message using the property name as the key. Example: In the mapping below,
2041
+ # 'from' is loaded from input dataset and used as 'role', while 'value' is loaded and
2042
+ # used as 'content' in the chat template.
2043
+ message_property_mappings: dict[str, str] | None
2044
+ # The key in the message turn that indicates via boolean whether tokens of a turn
2045
+ # should be considered for training. Useful to selectively train on certain turns
2046
+ # besides the `roles_to_train`.
2047
+ message_field_training: str | None
2048
+ # The key in the message turn that contains the training details. Useful to
2049
+ # selectively train on certain tokens in a turn. The value of the key is a List[Dict]
2050
+ # containing `begin_offset` (start character index in content), `end_offset` (end
2051
+ # character index in content), and `train` (boolean whether to train).
2052
+ message_field_training_detail: str | None
2053
+ # (for Qwen3 template only) Whether to split the assistant content based on a
2054
+ # reasoning trace inside delimited tags
2055
+ split_thinking: bool | None
2056
+ logprobs_field: str | None
2057
+ temperature: float | None
2058
+ # Roles to train on. The tokens from these roles will be considered for the loss.
2059
+ roles_to_train: list[str] | None
2060
+ # Which EOS tokens to train on in the conversation. Possible values are: all: train on
2061
+ # all EOS tokens, turn (default): train on the EOS token at the end of each trainable
2062
+ # turn, last: train on the last EOS token in the conversation
2063
+ train_on_eos: Literal['all', 'turn', 'last'] | None
2064
+ # Roles mapping in the messages. The format is {target_role: [source_roles]}. All
2065
+ # source roles will be mapped to the target role. The default is: user: ["human",
2066
+ # "user"], assistant: ["gpt", "assistant"], system: ["system"], tool: ["tool"]
2067
+ roles: dict[str, list[str]] | None
2068
+ # Whether to drop the system turn from the dataset. Only works with chat_template.
2069
+ # This does not drop the default system message from chat_template if it exists. If
2070
+ # you wish to, we recommend using a custom jinja template with the default system
2071
+ # message removed or adding a system turn with empty content.
2072
+ drop_system_message: bool | None
2073
+ # Trust remote code for untrusted source
2074
+ trust_remote_code: bool | None = False
2075
+ # The specific revision of the dataset to use when loading from the Hugging Face Hub.
2076
+ # This can be a commit hash, tag, or branch name. If not specified, the latest version
2077
+ # will be used. This parameter is ignored for local datasets.
2078
+ revision: str | None
2079
+
2080
+ # For DPODataset:
2081
+ path: str | None
2082
+ split: str | None
2083
+ type: UserDefinedDPOType | str | None
2084
+ # For UserDefinedDPOType:
2085
+ field_system: str | None
2086
+ field_prompt: str | None
2087
+ field_chosen: str | None
2088
+ field_rejected: str | None
2089
+ prompt_format: str | None
2090
+ chosen_format: str | None
2091
+ rejected_format: str | None
2092
+ data_files: list[str] | None
2093
+ revision: str | None
2094
+ field_messages: str | None
2095
+
2096
+ # For KTODataset:
2097
+ path: str | None
2098
+ split: str | None
2099
+ type: UserDefinedKTOType | str | None
2100
+ # For UserDefinedKTOType:
2101
+ field_system: str | None
2102
+ field_prompt: str | None
2103
+ field_completion: str | None
2104
+ field_label: bool | None
2105
+ prompt_format: str | None
2106
+ completion_format: str | None
2107
+ data_files: list[str] | None
2108
+ trust_remote_code: bool | None = False
2109
+ revision: str | None
2110
+
2111
+ # For StepwiseSupervisedDataset:
2112
+ path: str | None
2113
+ split: str | None
2114
+ data_files: list[str] | None
2115
+ revision: str | None
2116
+ step_separator: str | None
2117
+ max_completion_length: int | None
2118
+ train_on_last_step_only: bool | None
2119
+
2120
+ # If false, the datasets will not be shuffled and will keep their original order in
2121
+ # `datasets`. The same applies to the `test_datasets` option and the
2122
+ # `pretraining_dataset` option. Default is true.
2123
+ shuffle_merged_datasets: bool | None = True
2124
+ # If true, each dataset in `datasets` will be shuffled before merging. This allows
2125
+ # curriculum learning strategies to be applied at the dataset level. Default is false.
2126
+ shuffle_before_merging_datasets: bool | None = False
2127
+ # Axolotl attempts to save the dataset as an arrow after packing the data together so
2128
+ # subsequent training attempts load faster, relative path
2129
+ dataset_prepared_path: str | None
2130
+ # Num shards for whole dataset
2131
+ dataset_shard_num: int | None
2132
+ # Index of shard to use for whole dataset
2133
+ dataset_shard_idx: int | None
2134
+ skip_prepare_dataset: bool | None = False
2135
+ # Number of shards to save the prepared dataset
2136
+ num_dataset_shards_to_save: int | None
2137
+
2138
+ # Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize
2139
+ pretraining_dataset: Annotated[list[PretrainingDataset | SFTDataset], MinLen(1)] | None
2140
+ # For PretrainingDataset:
2141
+ name: str | None
2142
+ path: str | None
2143
+ split: str | None = train
2144
+ text_column: str | None = text
2145
+ type: str | None = pretrain
2146
+ trust_remote_code: bool | None = False
2147
+ data_files: str | None
2148
+ skip: int | None
2149
+
2150
+ # For SFTDataset:
2151
+ # HuggingFace dataset repo | s3:// | gs:// | path to local file or directory
2152
+ path: str | None
2153
+ # name of dataset split to load from
2154
+ split: str | None
2155
+ # The type of prompt to use for training. [alpaca, gpteacher, oasst, reflection]
2156
+ type: str | UserDefinedPrompterType | None
2157
+ # For UserDefinedPrompterType:
2158
+ # Custom user instruction prompt
2159
+ system_prompt: str | None
2160
+ # Use {system} as key to be replaced
2161
+ system_format: str | None
2162
+ field_system: str | None
2163
+ field_instruction: str | None
2164
+ field_input: str | None
2165
+ field_output: str | None
2166
+
2167
+ # Customizable to be single line or multi-line. Use {instruction}/{input} as key to
2168
+ # be replaced. 'format' can include {input}
2169
+ format: str | None
2170
+ # 'no_input_format' cannot include {input}
2171
+ no_input_format: str | None
2172
+ input_transform: str | None
2173
+ # split dataset into N pieces (use with shards_idx)
2174
+ shards: int | None
2175
+ # the index of sharded dataset to use
2176
+ shards_idx: int | None
2177
+ # process dataset in N sequential chunks for memory efficiency (exclusive with
2178
+ # `shards`)
2179
+ preprocess_shards: int | None
2180
+ conversation: str | None
2181
+
2182
+ # The name of the chat template to use for training, following values are supported:
2183
+ # tokenizer_default: Uses the chat template that is available in the
2184
+ # tokenizer_config.json. If the chat template is not available in the tokenizer, it
2185
+ # will raise an error. This is the default.
2186
+ # alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates
2187
+ # are available in the axolotl codebase at src/axolotl/utils/chat_templates.py.
2188
+ # tokenizer_default_fallback_*: where * is the name of the chat template to fallback
2189
+ # to if the tokenizer does not have a chat template else default to tokenizer. E.g.
2190
+ # tokenizer_default_fallback_chatml. jinja: Uses a custom jinja template for the chat
2191
+ # template. The custom jinja template should be provided in the chat_template_jinja
2192
+ # field.
2193
+ chat_template: ChatTemplate | str | None
2194
+ # Custom jinja chat template or path to jinja file. Used only if `chat_template:
2195
+ # jinja` or empty.
2196
+ chat_template_jinja: str | None
2197
+ # path to source data files
2198
+ data_files: str | list[str] | None
2199
+ input_format: str | None
2200
+ # name of dataset configuration to load
2201
+ name: str | None
2202
+ # defines the datatype when path is a file
2203
+ ds_type: str | None
2204
+ # For `completion` datasets only, uses the provided field instead of `text` column
2205
+ field: str | None
2206
+ field_human: str | None
2207
+ field_model: str | None
2208
+ # Key containing the messages (default: "messages")
2209
+ field_messages: str | None
2210
+ # Key containing the tools (default: "tools"). Must be a list[dict] and follow [JSON
2211
+ # schema](https://json-schema.org/learn/getting-started-step-by-step).
2212
+ field_tools: str | None
2213
+ # Key containing the reasoning trace (default: "reasoning_content").
2214
+ field_thinking: str | None
2215
+ # The key the chat template expects that indicates the reasoning trace.
2216
+ template_thinking_key: str | None
2217
+
2218
+ message_field_role: str | None
2219
+
2220
+ message_field_content: str | None
2221
+ # Mapping of properties from the input dataset to the chat template. (default:
2222
+ # message_property_mappings={'role':'role', 'content':'content'}) If a property exists
2223
+ # in the template but not in this mapping, the system will attempt to load it directly
2224
+ # from the message using the property name as the key. Example: In the mapping below,
2225
+ # 'from' is loaded from input dataset and used as 'role', while 'value' is loaded and
2226
+ # used as 'content' in the chat template.
2227
+ message_property_mappings: dict[str, str] | None
2228
+ # The key in the message turn that indicates via boolean whether tokens of a turn
2229
+ # should be considered for training. Useful to selectively train on certain turns
2230
+ # besides the `roles_to_train`.
2231
+ message_field_training: str | None
2232
+ # The key in the message turn that contains the training details. Useful to
2233
+ # selectively train on certain tokens in a turn. The value of the key is a List[Dict]
2234
+ # containing `begin_offset` (start character index in content), `end_offset` (end
2235
+ # character index in content), and `train` (boolean whether to train).
2236
+ message_field_training_detail: str | None
2237
+ # (for Qwen3 template only) Whether to split the assistant content based on a
2238
+ # reasoning trace inside delimited tags
2239
+ split_thinking: bool | None
2240
+ logprobs_field: str | None
2241
+ temperature: float | None
2242
+ # Roles to train on. The tokens from these roles will be considered for the loss.
2243
+ roles_to_train: list[str] | None
2244
+ # Which EOS tokens to train on in the conversation. Possible values are: all: train on
2245
+ # all EOS tokens, turn (default): train on the EOS token at the end of each trainable
2246
+ # turn, last: train on the last EOS token in the conversation
2247
+ train_on_eos: Literal['all', 'turn', 'last'] | None
2248
+ # Roles mapping in the messages. The format is {target_role: [source_roles]}. All
2249
+ # source roles will be mapped to the target role. The default is: user: ["human",
2250
+ # "user"], assistant: ["gpt", "assistant"], system: ["system"], tool: ["tool"]
2251
+ roles: dict[str, list[str]] | None
2252
+ # Whether to drop the system turn from the dataset. Only works with chat_template.
2253
+ # This does not drop the default system message from chat_template if it exists. If
2254
+ # you wish to, we recommend using a custom jinja template with the default system
2255
+ # message removed or adding a system turn with empty content.
2256
+ drop_system_message: bool | None
2257
+ # Trust remote code for untrusted source
2258
+ trust_remote_code: bool | None = False
2259
+ # The specific revision of the dataset to use when loading from the Hugging Face Hub.
2260
+ # This can be a commit hash, tag, or branch name. If not specified, the latest version
2261
+ # will be used. This parameter is ignored for local datasets.
2262
+ revision: str | None
2263
+
2264
+ # The maximum number of processes to use while preprocessing your input dataset. This
2265
+ # defaults to `os.cpu_count()` if not set. For Runpod VMs, it will default to number of
2266
+ # vCPUs via RUNPOD_CPU_COUNT.
2267
+ dataset_processes: int | None
2268
+ # The maximum number of processes to use while preprocessing your input dataset. This
2269
+ # defaults to `os.cpu_count()` if not set. For Runpod VMs, it will default to number of
2270
+ # vCPUs via RUNPOD_CPU_COUNT.
2271
+ dataset_num_proc: int | None
2272
+
2273
+ # Deduplicates datasets and test_datasets with identical entries
2274
+ dataset_exact_deduplication: bool | None
2275
+ # Keep dataset in memory while preprocessing. Only needed if cached dataset is taking
2276
+ # too much storage
2277
+ dataset_keep_in_memory: bool | None
2278
+ dataloader_pin_memory: bool | None
2279
+ dataloader_num_workers: int | None
2280
+ dataloader_prefetch_factor: int | None
2281
+ dataloader_drop_last: bool | None
2282
+
2283
+ accelerator_config: dict[str, Any] | None
2284
+
2285
+ remove_unused_columns: bool | None
2286
+
2287
+ # Push prepared dataset to hub - repo_org/repo_name
2288
+ push_dataset_to_hub: str | None
2289
+ # Whether to use hf `use_auth_token` for loading datasets. Useful for fetching private
2290
+ # datasets. Required to be true when used in combination with `push_dataset_to_hub`
2291
+ hf_use_auth_token: bool | None
2292
+
2293
+ device: Any | None
2294
+ # Passed through to transformers when loading the model when launched without
2295
+ # accelerate. Use `sequential` when training w/ model parallelism to limit memory
2296
+ device_map: Any | None
2297
+ world_size: int | None
2298
+ # Don't mess with this, it's here for accelerate and torchrun
2299
+ local_rank: int | None
2300
+ ddp: bool | None
2301
+
2302
+ # Seed for reproducibility
2303
+ seed: int | None
2304
+ # Advanced DDP Arguments - timeout
2305
+ ddp_timeout: int | None
2306
+ # Advanced DDP Arguments - bucket cap in MB
2307
+ ddp_bucket_cap_mb: int | None
2308
+ # Advanced DDP Arguments - broadcast buffers
2309
+ ddp_broadcast_buffers: bool | None
2310
+ ddp_find_unused_parameters: bool | None
2311
+
2312
+ # Approximate number of predictions sent to wandb depending on batch size. Enabled above
2313
+ # 0. Default is 0
2314
+ eval_table_size: int | None
2315
+ # Total number of tokens generated for predictions sent to wandb. Default is 128
2316
+ eval_max_new_tokens: int | None
2317
+ # Whether to run causal language model evaluation for metrics in
2318
+ # `eval_causal_lm_metrics`
2319
+ do_causal_lm_eval: bool | None
2320
+ # HF evaluate metrics used during evaluation. Default is ['sacrebleu', 'comet', 'ter',
2321
+ # 'chrf', 'perplexity']
2322
+ eval_causal_lm_metrics: list[str] | None
2323
+ do_bench_eval: bool | None
2324
+ bench_dataset: str | None
2325
+ bench_split: str | None
2326
+ metric_for_best_model: str | None
2327
+ greater_is_better: bool | None
2328
+
2329
+ # High loss value, indicating the learning has broken down (a good estimate is ~2 times
2330
+ # the loss at the start of training)
2331
+ loss_watchdog_threshold: float | None
2332
+ # Number of high-loss steps in a row before the trainer aborts (default: 3)
2333
+ loss_watchdog_patience: int | None
2334
+
2335
+ # Run garbage collection every `gc_steps` steps. -1 will run on epoch end and before
2336
+ # evaluations. Default is 0 (disabled).
2337
+ gc_steps: int | None
2338
+
2339
+ # Use CUDA bf16. bool or 'full' for `bf16_full_eval`, or 'auto' for automatic detection.
2340
+ # require >=ampere
2341
+ bf16: Literal['auto'] | bool | None = auto
2342
+ # Use CUDA fp16
2343
+ fp16: bool | None
2344
+ # Enable FP8 mixed precision training using TorchAO. Best used in combination with
2345
+ # torch.compile.
2346
+ fp8: bool | None
2347
+ # Enable FSDP float8 all-gather optimization for FP8 training. Can improve training
2348
+ # speed by 10-15% when FSDP is enabled.
2349
+ fp8_enable_fsdp_float8_all_gather: bool | None
2350
+ # No AMP (automatic mixed precision) - require >=ampere
2351
+ bfloat16: bool | None
2352
+ # No AMP (automatic mixed precision)
2353
+ float16: bool | None
2354
+ # Use CUDA tf32 - require >=ampere
2355
+ tf32: bool | None
2356
+ float32: bool | None
2357
+
2358
+ # Whether to use gradient checkpointing. Available options are: true, false, 'offload',
2359
+ # 'offload_disk'.
2360
+ # https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
2361
+ gradient_checkpointing: Literal['offload', 'offload_disk'] | bool | None = False
2362
+ # Additional kwargs to pass to the trainer for gradient checkpointing
2363
+ gradient_checkpointing_kwargs: dict[str, Any] | None
2364
+ # Whether to offload activations. Available options are: true, false, 'legacy', 'disk'.
2365
+ activation_offloading: Literal['legacy', 'disk'] | bool | None = False
2366
+
2367
+ unfrozen_parameters: list[str] | None
2368
+
2369
+ # The maximum length of an input to train with, this should typically be less than 2048
2370
+ # as most models have a token/context limit of 2048
2371
+ sequence_len: int = 512
2372
+ # What to do when a tokenized row exceeds sequence_len. 'drop' removes the row;
2373
+ # 'truncate' slices tensors to sequence_len. Defaults to 'drop' for backward
2374
+ # compatibility.
2375
+ excess_length_strategy: Literal['drop', 'truncate'] | None
2376
+ # The maximum length of an input for evaluation. If not specified, defaults to
2377
+ # sequence_len
2378
+ eval_sequence_len: int | None
2379
+ min_sample_len: int | None
2380
+ # maximum prompt length for RL training
2381
+ max_prompt_len: int | None
2382
+ # Use efficient multi-packing with block diagonal attention and per sequence
2383
+ # position_ids. Recommend set to 'true'
2384
+ sample_packing: bool | None
2385
+ # The number of samples packed at a time. Increasing the following values helps with
2386
+ # packing, but usually only slightly (<%1.)
2387
+ sample_packing_group_size: int | None = 100000
2388
+ # The number of samples which can be packed into one sequence. Increase if using a large
2389
+ # sequence_len with many short samples.
2390
+ sample_packing_bin_size: int | None = 200
2391
+ # Whether to pack samples sequentially
2392
+ sample_packing_sequentially: bool | None
2393
+ # The multiprocessing start method to use for packing. Should be 'fork', 'spawn' or
2394
+ # 'forkserver'
2395
+ sample_packing_mp_start_method: str | None
2396
+ # Set to 'false' if getting errors during eval with sample_packing on
2397
+ eval_sample_packing: bool | None
2398
+ # Pad inputs so each step uses constant sized buffers. This will reduce memory
2399
+ # fragmentation and may prevent OOMs, by re-using memory more efficiently. Defaults to
2400
+ # True if `sample_packing` enabled
2401
+ pad_to_sequence_len: bool | None
2402
+ # Whether to use sequential sampling for curriculum learning
2403
+ curriculum_sampling: bool | None
2404
+ multipack_real_batches: bool | None
2405
+
2406
+ # Use batch flattening for speedups when not using sample_packing
2407
+ batch_flattening: Literal['auto'] | bool | None
2408
+
2409
+ use_pose: bool | None
2410
+ pose_split_on_token_ids: list[int] | None
2411
+ pose_max_context_len: int | None
2412
+ pose_num_chunks: int | None
2413
+
2414
+ pretrain_multipack_buffer_size: int | None
2415
+ # whether to prevent cross attention for packed sequences during pretraining
2416
+ pretrain_multipack_attn: bool | None = True
2417
+ # whether to concatenate samples during pretraining
2418
+ pretraining_sample_concatenation: bool | None
2419
+
2420
+ # Use streaming mode for loading datasets
2421
+ streaming: bool | None
2422
+ # Buffer size for multipack streaming datasets
2423
+ streaming_multipack_buffer_size: int | None = 10000
2424
+
2425
+ # Whether to use xformers attention patch https://github.com/facebookresearch/xformers
2426
+ xformers_attention: bool | None
2427
+ # Whether to use scaled-dot-product attention https://pytorch.org/docs/stable/generated/
2428
+ # torch.nn.functional.scaled_dot_product_attention.html
2429
+ sdp_attention: bool | None
2430
+ # Shifted-sparse attention (only llama) - https://arxiv.org/pdf/2309.12307.pdf
2431
+ s2_attention: bool | None
2432
+ flex_attention: bool | None
2433
+ flex_attn_compile_kwargs: dict[str, Any] | None
2434
+ # Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention
2435
+ flash_attention: bool | None
2436
+ # Whether to use flash-attention cross entropy implementation - advanced use only
2437
+ flash_attn_cross_entropy: bool | None
2438
+ # Whether to use flash-attention rms norm implementation - advanced use only
2439
+ flash_attn_rms_norm: bool | None
2440
+ # Whether to fuse part of the MLP into a single operation
2441
+ flash_attn_fuse_mlp: bool | None
2442
+ # Whether to use bettertransformers
2443
+ flash_optimum: bool | None
2444
+
2445
+ eager_attention: bool | None
2446
+
2447
+ # Specify a custom attention implementation, used mostly for kernels.
2448
+ attn_implementation: str | None
2449
+
2450
+ unsloth_cross_entropy_loss: bool | None
2451
+ unsloth_lora_mlp: bool | None
2452
+ unsloth_lora_qkv: bool | None
2453
+ unsloth_lora_o: bool | None
2454
+ unsloth_rms_norm: bool | None
2455
+ unsloth_rope: bool | None
2456
+
2457
+ # Apply custom LoRA autograd functions and activation function Triton kernels for speed
2458
+ # and memory savings. See: https://docs.axolotl.ai/docs/lora_optims.html
2459
+ lora_mlp_kernel: bool | None
2460
+ # Apply custom LoRA autograd functions and activation function Triton kernels for speed
2461
+ # and memory savings. See: https://docs.axolotl.ai/docs/lora_optims.html
2462
+ lora_qkv_kernel: bool | None
2463
+ # Apply custom LoRA autograd functions and activation function Triton kernels for speed
2464
+ # and memory savings. See: https://docs.axolotl.ai/docs/lora_optims.html
2465
+ lora_o_kernel: bool | None
2466
+
2467
+ # Whether to use chunked cross entropy loss for memory efficiency
2468
+ chunked_cross_entropy: bool | None
2469
+ # Number of chunks to use for chunked cross entropy loss
2470
+ chunked_cross_entropy_num_chunks: int | None
2471
+
2472
+ # Whether to use ALST tiled mlp for memory efficient long context
2473
+ tiled_mlp: bool | None
2474
+
2475
+ # Number of shards to use for ALST tiled mlp. If unset, it will be set based on
2476
+ # seqlen/hidden_size
2477
+ tiled_mlp_num_shards: int | None
2478
+
2479
+ # Whether to use original mlp for ALST tiled mlp. Otherwise uses a generic MLP based on
2480
+ # llama.
2481
+ tiled_mlp_use_original_mlp: bool | None = True
2482
+
2483
+ llama4_linearized_experts: bool | None
2484
+
2485
+ # Deepspeed config path. e.g., deepspeed_configs/zero3.json
2486
+ deepspeed: str | dict[str, Any] | None
2487
+ # Whether to use deepcompile for faster training with deepspeed
2488
+ deepcompile: bool | None
2489
+ # FSDP configuration
2490
+ fsdp: list[str] | None
2491
+
2492
+ # FSDP configuration options
2493
+ fsdp_config: FSDPConfig | None
2494
+ # For FSDPConfig:
2495
+ # Enable activation checkpointing to reduce memory usage during forward passes
2496
+ activation_checkpointing: bool | None
2497
+ # Offload parameters to CPU to reduce GPU memory usage
2498
+ offload_params: bool | None
2499
+ # Synchronize module states across all processes
2500
+ sync_module_states: bool | None
2501
+ # Enable CPU RAM efficient loading to reduce memory usage during model loading
2502
+ cpu_ram_efficient_loading: bool | None
2503
+ # Disabling this enables swap memory usage for resource-constrained setups when
2504
+ # offload_params is enabled.
2505
+ cpu_offload_pin_memory: bool | None
2506
+ # Use original parameters instead of flattened parameters
2507
+ use_orig_params: bool | None
2508
+
2509
+ # Type of state dict to use for saving/loading checkpoints
2510
+ state_dict_type: Literal['FULL_STATE_DICT', 'LOCAL_STATE_DICT', 'SHARDED_STATE_DICT'] | None
2511
+ # Final state dict type to use after training completion
2512
+ final_state_dict_type: Literal['FULL_STATE_DICT', 'LOCAL_STATE_DICT', 'SHARDED_STATE_DICT'] | None
2513
+
2514
+ # Policy for automatically wrapping modules with FSDP
2515
+ auto_wrap_policy: Literal['TRANSFORMER_BASED_WRAP', 'SIZE_BASED_WRAP'] | None
2516
+ # Class name of transformer layers to wrap (e.g., 'LlamaDecoderLayer')
2517
+ transformer_layer_cls_to_wrap: str | None
2518
+
2519
+ # Reshard parameters after forward pass to save memory
2520
+ reshard_after_forward: bool | None
2521
+ # Mixed precision policy for FSDP (e.g., 'fp16', 'bf16')
2522
+ mixed_precision_policy: str | None
2523
+
2524
+ # FSDP version
2525
+ fsdp_version: int | None
2526
+ fsdp_final_state_dict_type: Literal['FULL_STATE_DICT', 'LOCAL_STATE_DICT', 'SHARDED_STATE_DICT'] | None
2527
+
2528
+ # How much of the dataset to set aside as evaluation. 1 = 100%, 0.50 = 50%, etc. 0 for
2529
+ # no eval.
2530
+ val_set_size: float | None = 0.0
2531
+
2532
+ # Number of devices to shard across. If not set, will use all available devices.
2533
+ dp_shard_size: int | None
2534
+ # Number of devices to replicate across.
2535
+ dp_replicate_size: int | None
2536
+ # Deprecated: use `context_parallel_size` instead
2537
+ sequence_parallel_degree: int | None
2538
+ # Set to a divisor of the number of GPUs available to split sequences into chunks of
2539
+ # equal size. Use in long context training to prevent OOM when sequences cannot fit into
2540
+ # a single GPU's VRAM. E.g., if 4 GPUs are available, set this value to 2 to split each
2541
+ # sequence into two equal-sized subsequences, or set to 4 to split into four equal-sized
2542
+ # subsequences. See https://docs.axolotl.ai/docs/sequence_parallelism.html for more
2543
+ # details.
2544
+ context_parallel_size: int | None
2545
+ # Optional; strides across the key dimension. Larger values use more memory but should
2546
+ # make training faster. Must evenly divide the number of KV heads in your model.
2547
+ heads_k_stride: int | None
2548
+ # One of 'varlen_llama3', 'batch_ring', 'batch_zigzag', 'batch_stripe'. Defaults to
2549
+ # 'varlen_llama3' in the sample packing case, and 'batch_ring' in the non-sample packing
2550
+ # case.
2551
+ ring_attn_func: RingAttnFunc | None
2552
+ # Number of tensor parallel processes in TP group. Only supported with DeepSpeed AutoTP.
2553
+ tensor_parallel_size: int | None
2554
+
2555
+ # Add or change special tokens. If you add tokens here, you don't need to add them to
2556
+ # the `tokens` list.
2557
+ special_tokens: SpecialTokensConfig | None
2558
+ # For SpecialTokensConfig:
2559
+ bos_token: str | None
2560
+ eos_token: str | None
2561
+ pad_token: str | None
2562
+ unk_token: str | None
2563
+ additional_special_tokens: list[str] | None
2564
+
2565
+ # Add extra tokens to the tokenizer
2566
+ tokens: list[str] | None
2567
+ # Mapping token_id to new_token_string to override reserved added_tokens in the
2568
+ # tokenizer. Only works for tokens that are not part of the base vocab (aka are
2569
+ # added_tokens). Can be checked if they exist in tokenizer.json added_tokens.
2570
+ added_tokens_overrides: dict[int, str] | None
2571
+
2572
+ # Whether to use torch.compile and which backend to use. setting to `auto` will enable
2573
+ # torch compile when torch>=2.6.0
2574
+ torch_compile: Literal['auto'] | bool | None
2575
+ # Backend to use for torch.compile
2576
+ torch_compile_backend: str | None
2577
+ torch_compile_mode: Literal['default', 'reduce-overhead', 'max-autotune'] | None
2578
+
2579
+ # Maximum number of iterations to train for. It precedes num_epochs which means that if
2580
+ # both are set, num_epochs will not be guaranteed. e.g., when 1 epoch is 1000 steps =>
2581
+ # `num_epochs: 2` and `max_steps: 100` will train for 100 steps
2582
+ max_steps: int | None
2583
+ # Number of warmup steps. Cannot use with warmup_ratio
2584
+ warmup_steps: int | None
2585
+ # Warmup ratio. Cannot use with warmup_steps
2586
+ warmup_ratio: float | None
2587
+ # Leave empty to eval at each epoch, integer for every N steps. float for fraction of
2588
+ # total steps
2589
+ eval_steps: int | float | None
2590
+ # Number of times per epoch to run evals, mutually exclusive with eval_steps
2591
+ evals_per_epoch: int | None
2592
+ # Set to `no` to skip evaluation, `epoch` at end of each epoch, leave empty to infer
2593
+ # from `eval_steps`
2594
+ eval_strategy: str | None
2595
+
2596
+ # Leave empty to save at each epoch, integer for every N steps. float for fraction of
2597
+ # total steps
2598
+ save_steps: int | float | None
2599
+ # Number of times per epoch to save a checkpoint, mutually exclusive with save_steps
2600
+ saves_per_epoch: int | None
2601
+ # Set to `no` to skip checkpoint saves, `epoch` at end of each epoch, `best` when better
2602
+ # result is achieved, leave empty to infer from `save_steps`
2603
+ save_strategy: str | None
2604
+ # Checkpoints saved at a time
2605
+ save_total_limit: int | None
2606
+ # Whether to checkpoint a model after the first step of training. Defaults to False.
2607
+ save_first_step: bool | None
2608
+
2609
+ # Logging frequency
2610
+ logging_steps: int | None
2611
+ # Stop training after this many evaluation losses have increased in a row. https://huggi
2612
+ # ngface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppin
2613
+ # gCallback
2614
+ early_stopping_patience: int | None
2615
+ load_best_model_at_end: bool | None = False
2616
+ # Save only the model weights, skipping the optimizer. Using this means you can't resume
2617
+ # from checkpoints.
2618
+ save_only_model: bool | None = False
2619
+ # Use tensorboard for logging
2620
+ use_tensorboard: bool | None
2621
+ # Enable the pytorch profiler to capture the first N steps of training to the
2622
+ # output_dir. see https://pytorch.org/blog/understanding-gpu-memory-1/ for more
2623
+ # information. Snapshots can be visualized @ https://pytorch.org/memory_viz
2624
+ profiler_steps: int | None
2625
+ # Which step to start the profiler at. Useful for only capturing a few steps mid-run.
2626
+ profiler_steps_start: int | None = 0
2627
+ # bool of whether to report tokens per second at the end of training. This is not
2628
+ # supported with pre-training datasets.
2629
+ include_tokens_per_second: bool | None
2630
+ # bool of whether to report tokens per second per-gpu during training by measuring
2631
+ # throughput of non-padding tokens.
2632
+ include_tkps: bool | None = True
2633
+ # NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to
2634
+ # add noise to embeddings. Currently only supported on Llama and Mistral
2635
+ neftune_noise_alpha: float | None
2636
+
2637
+ # Parameter controlling the relative ratio loss weight in the ORPO loss. Passed to
2638
+ # `beta` in `ORPOConfig` due to trl mapping.
2639
+ orpo_alpha: float | None
2640
+ # Weighting of NLL term in loss from RPO paper
2641
+ rpo_alpha: float | None
2642
+ # Target reward margin for the SimPO loss
2643
+ simpo_gamma: float | None
2644
+ # Weight of the BC regularizer
2645
+ cpo_alpha: float | None
2646
+
2647
+ # Factor for desirable loss term in KTO loss
2648
+ kto_desirable_weight: float | None
2649
+ # Factor for undesirable loss term in KTO loss
2650
+ kto_undesirable_weight: float | None
2651
+ # The beta parameter for the RL training
2652
+ rl_beta: float | None
2653
+
2654
+ # Defines the max memory usage per gpu on the system. Passed through to transformers
2655
+ # when loading the model.
2656
+ max_memory: dict[int | Literal['cpu', 'disk'], int | str] | None
2657
+ # Limit the memory for all available GPUs to this amount (if an integer, expressed in
2658
+ # gigabytes); default: unset
2659
+ gpu_memory_limit: int | str | None
2660
+ # Whether to use low_cpu_mem_usage
2661
+ low_cpu_mem_usage: bool | None
2662
+
2663
+ # The name of the chat template to use for training, following values are supported:
2664
+ # tokenizer_default: Uses the chat template that is available in the
2665
+ # tokenizer_config.json. If the chat template is not available in the tokenizer, it will
2666
+ # raise an error. This is the default value.
2667
+ # alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates
2668
+ # are available in the axolotl codebase at src/axolotl/utils/chat_templates.py.
2669
+ # tokenizer_default_fallback_*: where * is the name of the chat template to fallback to.
2670
+ # E.g. tokenizer_default_fallback_chatml. This is useful when the chat template is not
2671
+ # available in the tokenizer. jinja: Uses a custom jinja template for the chat template.
2672
+ # The custom jinja template should be provided in the chat_template_jinja field. The
2673
+ # selected chat template will be saved to the tokenizer_config.json for easier
2674
+ # inferencing
2675
+ chat_template: ChatTemplate | Annotated[str, StringConstraints(pattern='^tokenizer_default_fallback_')] | None
2676
+ # Custom jinja template or path to jinja file for chat template. This will be only used
2677
+ # if chat_template is set to `jinja` or `null` (in which case chat_template is
2678
+ # automatically set to `jinja`). Default is null.
2679
+ chat_template_jinja: str | None
2680
+ # Additional kwargs to pass to the chat template. This is useful for customizing the
2681
+ # chat template. For example, you can pass `thinking=False` to add a generation prompt
2682
+ # to the chat template.
2683
+ chat_template_kwargs: dict[str, Any] | None
2684
+ # Custom EOT (End-of-Turn) tokens to mask/unmask during training. These tokens mark the
2685
+ # boundaries between conversation turns. For example: ['/INST', '</s>',
2686
+ # '[/SYSTEM_PROMPT]']. If not specified, defaults to just the model's eos_token. This is
2687
+ # useful for templates that use multiple delimiter tokens.
2688
+ eot_tokens: list[str] | None
2689
+ # Changes the default system message. Currently only supports chatml.
2690
+ default_system_message: str | None
2691
+
2692
+ # Token index or indices to adjust embedding weights to the mean of the other tokens.
2693
+ # This is useful when the model has untrained embeddings.
2694
+ fix_untrained_tokens: int | list[int] | None
2695
+
2696
+ is_preprocess: bool | None
2697
+ preprocess_iterable: bool | None
2698
+
2699
+ # Total number of tokens - internal use
2700
+ total_num_tokens: int | None
2701
+ total_supervised_tokens: int | None
2702
+ # You can set these packing optimizations AFTER starting a training at least once. The
2703
+ # trainer will provide recommended values for these values.
2704
+ sample_packing_eff_est: float | None
2705
+ axolotl_config_path: str | None
2706
+
2707
+ # Internal use only - Used to identify which the model is based on
2708
+ is_falcon_derived_model: bool | None
2709
+ # Internal use only - Used to identify which the model is based on
2710
+ is_llama_derived_model: bool | None
2711
+ # Internal use only - Used to identify which the model is based on. Please note that if
2712
+ # you set this to true, `padding_side` will be set to 'left' by default
2713
+ is_mistral_derived_model: bool | None
2714
+ # Internal use only - Used to identify which the model is based on
2715
+ is_qwen_derived_model: bool | None
2716
+
2717
+ # Add plugins to extend the pipeline. See `src/axolotl/integrations` for the available
2718
+ # plugins or doc below for more details.
2719
+ # https://docs.axolotl.ai/docs/custom_integrations.html
2720
+ plugins: list[str] | None
2721
+
2722
+ # This is the huggingface model that contains *.pt, *.safetensors, or *.bin files. This
2723
+ # can also be a relative path to a model on disk
2724
+ base_model: str (required)
2725
+ # If the base_model repo on hf hub doesn't include configuration .json files, You can
2726
+ # set that here, or leave this empty to default to base_model
2727
+ base_model_config: str | None
2728
+ cls_model_config: str | None
2729
+ # Optional tokenizer configuration path in case you want to use a different tokenizer
2730
+ # than the one defined in the base model
2731
+ tokenizer_config: str | None
2732
+ # use_fast option for tokenizer loading from_pretrained, default to True
2733
+ tokenizer_use_fast: bool | None
2734
+ # Whether to use the legacy tokenizer setting, defaults to True
2735
+ tokenizer_legacy: bool | None
2736
+ # Whether to use mistral-common tokenizer. If set to True, it will use the mistral-
2737
+ # common tokenizer.
2738
+ tokenizer_use_mistral_common: bool | None
2739
+ # Corresponding tokenizer for the model AutoTokenizer is a good choice
2740
+ tokenizer_type: str | None
2741
+ # transformers processor class
2742
+ processor_type: str | None
2743
+ # Whether to save jinja files for tokenizer, transformers default is True
2744
+ tokenizer_save_jinja_files: bool | None = True
2745
+ # Trust remote code for untrusted source
2746
+ trust_remote_code: bool | None
2747
+
2748
+ # Don't move the model to the device before sharding. Set to `false` to revert to legacy
2749
+ # behavior.
2750
+ experimental_skip_move_to_device: bool | None = True
2751
+
2752
+ # Use custom kernels, e.g. MegaBlocks.
2753
+ use_kernels: bool | None
2754
+
2755
+ # Model loading quantization config
2756
+ model_quantization_config: Literal['Mxfp4Config'] | None
2757
+ # kwargs for model quantization config
2758
+ model_quantization_config_kwargs: dict[str, Any] | None
2759
+
2760
+ # Where to save the full-finetuned model to
2761
+ output_dir: str = ./model-out
2762
+ # push checkpoints to hub
2763
+ hub_model_id: str | None
2764
+ # how to push checkpoints to hub
2765
+ hub_strategy: str | None
2766
+ # Save model as safetensors (require safetensors package). Default True
2767
+ save_safetensors: bool | None = True
2768
+
2769
+ # This will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer
2770
+ load_in_8bit: bool | None = False
2771
+ # Use bitsandbytes 4 bit
2772
+ load_in_4bit: bool | None = False
2773
+
2774
+ # If you want to use 'lora' or 'qlora' or leave blank to train all parameters in
2775
+ # original model
2776
+ adapter: str | None
2777
+ # If you already have a lora model trained that you want to load, put that here. This
2778
+ # means after training, if you want to test the model, you should set this to the value
2779
+ # of `output_dir`. Note that if you merge an adapter to the base model, a new
2780
+ # subdirectory `merged` will be created under the `output_dir`.
2781
+ lora_model_dir: str | None
2782
+ lora_r: int | None
2783
+ lora_alpha: int | None
2784
+ lora_fan_in_fan_out: bool | None
2785
+ lora_target_modules: str | list[str] | None
2786
+ lora_target_parameters: str | list[str] | None
2787
+ # If true, will target all linear modules
2788
+ lora_target_linear: bool | None
2789
+ # If you added new tokens to the tokenizer, you may need to save some LoRA modules
2790
+ # because they need to know the new tokens. For LLaMA and Mistral, you need to save
2791
+ # `embed_tokens` and `lm_head`. It may vary for other models. `embed_tokens` converts
2792
+ # tokens to embeddings, and `lm_head` converts embeddings to token probabilities.
2793
+ lora_modules_to_save: list[str] | None
2794
+ lora_dropout: float | None = 0.0
2795
+ # The layer indices to transform, otherwise, apply to all layers
2796
+ peft_layers_to_transform: list[int] | None
2797
+ peft_layers_pattern: list[str] | None
2798
+
2799
+ peft: PeftConfig | None
2800
+ # For PeftConfig:
2801
+ # Configuration options for loftq initialization for LoRA
2802
+ loftq_config: LoftQConfig | None
2803
+ # For LoftQConfig:
2804
+ # typically 4 bits
2805
+ loftq_bits: int = 4
2806
+
2807
+ # Whether to use DoRA.
2808
+ peft_use_dora: bool | None
2809
+ # Whether to use RSLoRA.
2810
+ peft_use_rslora: bool | None
2811
+ # List of layer indices to replicate.
2812
+ peft_layer_replication: list[tuple[int, int]] | None
2813
+ # How to initialize LoRA weights. Default to True which is MS original implementation.
2814
+ peft_init_lora_weights: bool | str | None
2815
+ # A list of token indices to fine-tune on the `embed_tokens` layer. Otherwise, a dict
2816
+ # mapping an embedding layer name to its trainable token indices. See
2817
+ # https://huggingface.co/docs/peft/v0.17.0/en/developer_guides/lora#efficiently-train-
2818
+ # tokens-alongside-lora
2819
+ peft_trainable_token_indices: list[int] | dict[str, list[int]] | None
2820
+
2821
+ # load qlora model in sharded format for FSDP using answer.ai technique.
2822
+ qlora_sharded_model_loading: bool | None = False
2823
+ # Do the LoRA/PEFT loading on CPU -- this is required if the base model is so large it
2824
+ # takes up most or all of the available GPU VRAM, e.g. during a model and LoRA merge
2825
+ lora_on_cpu: bool | None
2826
+ # Whether you are training a 4-bit GPTQ quantized model
2827
+ gptq: bool | None
2828
+ # optional overrides to the bnb 4bit quantization configuration
2829
+ bnb_config_kwargs: dict[str, Any] | None
2830
+
2831
+ # loraplus learning rate ratio lr_B / lr_A. Recommended value is 2^4.
2832
+ loraplus_lr_ratio: float | None
2833
+ # loraplus learning rate for lora embedding layers. Default value is 1e-6.
2834
+ loraplus_lr_embedding: float | None = 1e-06
2835
+
2836
+ merge_lora: bool | None
2837
+
2838
+ # Whether to use ReLoRA. Use with jagged_restart_*steps options.
2839
+ relora: bool | None
2840
+ # threshold for optimizer magnitude when pruning
2841
+ relora_prune_ratio: float | None
2842
+ # True to perform lora weight merges on cpu during restarts, for modest gpu memory
2843
+ # savings
2844
+ relora_cpu_offload: bool | None
2845
+
2846
+ # how often to reset for jagged restarts
2847
+ jagged_restart_steps: int | None
2848
+ # how many warmup steps to take after reset for jagged restarts
2849
+ jagged_restart_warmup_steps: int | None
2850
+ # how many anneal steps to take before reset for jagged restarts
2851
+ jagged_restart_anneal_steps: int | None
2852
+
2853
+ # If greater than 1, backpropagation will be skipped and the gradients will be
2854
+ # accumulated for the given number of steps.
2855
+ gradient_accumulation_steps: int | None = 1
2856
+ # The number of samples to include in each batch. This is the number of samples sent to
2857
+ # each GPU. Batch size per gpu = micro_batch_size * gradient_accumulation_steps
2858
+ micro_batch_size: int | None = 1
2859
+ # Total batch size, we do not recommended setting this manually
2860
+ batch_size: int | None
2861
+ # per gpu micro batch size for evals, defaults to value of micro_batch_size
2862
+ eval_batch_size: int | None
2863
+
2864
+ # whether to find batch size that fits in memory. Passed to underlying transformers
2865
+ # Trainer
2866
+ auto_find_batch_size: bool | None
2867
+
2868
+ # Whether to mask out or include the human's prompt from the training labels
2869
+ train_on_inputs: bool | None = False
2870
+ # Group similarly sized data to minimize padding. May be slower to start, as it must
2871
+ # download and sort the entire dataset. Note that training loss may have an oscillating
2872
+ # pattern with this enabled.
2873
+ group_by_length: bool | None
2874
+
2875
+ learning_rate: str | float (required)
2876
+ embedding_lr: float | None
2877
+ embedding_lr_scale: float | None
2878
+ # Specify weight decay
2879
+ weight_decay: float | None = 0.0
2880
+ # Specify optimizer
2881
+ optimizer: OptimizerNames | CustomSupportedOptimizers | None = OptimizerNames.ADAMW_TORCH_FUSED
2882
+ # Dictionary of arguments to pass to the optimizer
2883
+ optim_args: str | dict[str, Any] | None
2884
+ # The target modules to optimize, i.e. the module names that you would like to train,
2885
+ # right now this is used only for GaLore algorithm
2886
+ optim_target_modules: list[str] | Literal['all_linear'] | None
2887
+ # Path to torch distx for optim 'adamw_anyprecision'
2888
+ torchdistx_path: str | None
2889
+ lr_scheduler: SchedulerType | Literal['one_cycle'] | Literal['rex'] | None = SchedulerType.COSINE
2890
+ # Specify a scheduler and kwargs to use with the optimizer
2891
+ lr_scheduler_kwargs: dict[str, Any] | None
2892
+ lr_quadratic_warmup: bool | None
2893
+ # decay lr to some percentage of the peak lr, e.g. cosine_min_lr_ratio=0.1 for 10% of
2894
+ # peak lr
2895
+ cosine_min_lr_ratio: float | None
2896
+ # freeze lr at some percentage of the step, e.g. cosine_constant_lr_ratio=0.8 means
2897
+ # start cosine_min_lr at 80% of training step
2898
+ cosine_constant_lr_ratio: float | None
2899
+ # Learning rate div factor
2900
+ lr_div_factor: float | None
2901
+
2902
+ lr_groups: list[LrGroup] | None
2903
+ # For LrGroup:
2904
+ name: str (required)
2905
+ modules: list[str] (required)
2906
+ lr: float (required)
2907
+
2908
+ # adamw hyperparams
2909
+ adam_epsilon: float | None
2910
+ # only used for CAME Optimizer
2911
+ adam_epsilon2: float | None
2912
+ # adamw hyperparams
2913
+ adam_beta1: float | None
2914
+ # adamw hyperparams
2915
+ adam_beta2: float | None
2916
+ # only used for CAME Optimizer
2917
+ adam_beta3: float | None
2918
+
2919
+ # Dion Optimizer learning rate
2920
+ dion_lr: float | None
2921
+ # Dion Optimizer momentum
2922
+ dion_momentum: float | None
2923
+ # Dion Optimizer: r/d fraction for low-rank approximation. Used to compute the low-rank
2924
+ # dimension.
2925
+ dion_rank_fraction: float | None = 1.0
2926
+ # Dion Optimizer: Round up the low-rank dimension to a multiple of this number. This may
2927
+ # be useful to ensure even sharding.
2928
+ dion_rank_multiple_of: int | None = 1
2929
+
2930
+ # Gradient clipping max norm
2931
+ max_grad_norm: float | None
2932
+ num_epochs: float = 1.0
2933
+
2934
+ use_wandb: bool | None
2935
+ # Set the name of your wandb run
2936
+ wandb_name: str | None
2937
+ # Set the ID of your wandb run
2938
+ wandb_run_id: str | None
2939
+ # "offline" to save run metadata locally and not sync to the server, "disabled" to turn
2940
+ # off wandb
2941
+ wandb_mode: str | None
2942
+ # Your wandb project name
2943
+ wandb_project: str | None
2944
+ # A wandb Team name if using a Team
2945
+ wandb_entity: str | None
2946
+ wandb_watch: str | None
2947
+ # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only
2948
+ # at the end of training
2949
+ wandb_log_model: str | None
2950
+
2951
+ use_mlflow: bool | None
2952
+ # URI to mlflow
2953
+ mlflow_tracking_uri: str | None
2954
+ # Your experiment name
2955
+ mlflow_experiment_name: str | None
2956
+ # Your run name
2957
+ mlflow_run_name: str | None
2958
+ # set to true to copy each saved checkpoint on each save to mlflow artifact registry
2959
+ hf_mlflow_log_artifacts: bool | None
2960
+
2961
+ # Enable or disable Comet integration.
2962
+ use_comet: bool | None
2963
+ # API key for Comet. Recommended to set via `comet login`.
2964
+ comet_api_key: str | None
2965
+ # Workspace name in Comet. Defaults to the user's default workspace.
2966
+ comet_workspace: str | None
2967
+ # Project name in Comet. Defaults to Uncategorized.
2968
+ comet_project_name: str | None
2969
+ # Identifier for the experiment. Used to append data to an existing experiment or
2970
+ # control the key of new experiments. Default to a random key.
2971
+ comet_experiment_key: str | None
2972
+ # Create a new experiment ("create") or log to an existing one ("get"). Default
2973
+ # ("get_or_create") auto-selects based on configuration.
2974
+ comet_mode: str | None
2975
+ # Set to True to log data to Comet server, or False for offline storage. Default is
2976
+ # True.
2977
+ comet_online: bool | None
2978
+ # Dictionary for additional configuration settings, see the doc for more details.
2979
+ comet_experiment_config: dict[str, Any] | None
2980
+
2981
+ # Enable OpenTelemetry metrics collection and Prometheus export
2982
+ use_otel_metrics: bool | None = False
2983
+ # Host to bind the OpenTelemetry metrics server to
2984
+ otel_metrics_host: str | None = localhost
2985
+ # Port for the Prometheus metrics HTTP server
2986
+ otel_metrics_port: int | None = 8000
2987
+
2988
+ # the number of activate layers in LISA
2989
+ lisa_n_layers: int | None
2990
+ # how often to switch layers in LISA
2991
+ lisa_step_interval: int | None
2992
+ # path under the model to access the layers
2993
+ lisa_layers_attribute: str | None = model.layers
2994
+
2995
+ gradio_title: str | None
2996
+ gradio_share: bool | None
2997
+ gradio_server_name: str | None
2998
+ gradio_server_port: int | None
2999
+ gradio_max_new_tokens: int | None
3000
+ gradio_temperature: float | None
3001
+
3002
+ use_ray: bool = False
3003
+ ray_run_name: str | None
3004
+ ray_num_workers: int = 1
3005
+ resources_per_worker: dict
3006
+
3007
+ # The size of the image to resize to. It can be an integer (resized into padded-square
3008
+ # image) or a tuple (width, height).If not provided, we will attempt to load from
3009
+ # preprocessor.size, otherwise, images won't be resized.
3010
+ image_size: int | tuple[int, int] | None
3011
+ # The resampling algorithm to use for image resizing. Default is bilinear. Please refer
3012
+ # to PIL.Image.Resampling for more details.
3013
+ image_resize_algorithm: Literal['bilinear', 'bicubic', 'lanczos'] | Resampling | None
3014
+
3015
+ # optional overrides to the base model configuration
3016
+ overrides_of_model_config: dict[str, Any] | None
3017
+ # optional overrides the base model loading from_pretrained
3018
+ overrides_of_model_kwargs: dict[str, Any] | None
3019
+ # If you want to specify the type of model to load, AutoModelForCausalLM is a good
3020
+ # choice too
3021
+ type_of_model: str | None
3022
+ # You can specify to choose a specific model revision from huggingface hub
3023
+ revision_of_model: str | None
3024
+
3025
+ max_packed_sequence_len: int | None
3026
+ rope_scaling: Any | None
3027
+ noisy_embedding_alpha: float | None
3028
+ dpo_beta: float | None
3029
+ evaluation_strategy: str | None
3030
+ ```
3031
+
3032
+ ---
3033
+
3034
+ ##
3035
+
3036
+ **URL:** https://docs.axolotl.ai
3037
+
3038
+ **Contents:**
3039
+ - 🎉 Latest Updates
3040
+ - ✨ Overview
3041
+ - 🚀 Quick Start - LLM Fine-tuning in Minutes
3042
+ - Google Colab
3043
+ - Installation
3044
+ - Using pip
3045
+ - Using Docker
3046
+ - Cloud Providers
3047
+ - Your First Fine-tune
3048
+ - 📚 Documentation
3049
+
3050
+ A Free and Open Source LLM Fine-tuning Framework
3051
+
3052
+ Axolotl is a free and open-source tool designed to streamline post-training and fine-tuning for the latest large language models (LLMs).
3053
+
3054
+ Installing with Docker can be less error prone than installing in your own environment.
3055
+
3056
+ Other installation approaches are described here.
3057
+
3058
+ That’s it! Check out our Getting Started Guide for a more detailed walkthrough.
3059
+
3060
+ Contributions are welcome! Please see our Contributing Guide for details.
3061
+
3062
+ Interested in sponsoring? Contact us at [email protected]
3063
+
3064
+ If you use Axolotl in your research or projects, please cite it as follows:
3065
+
3066
+ This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
3067
+
3068
+ **Examples:**
3069
+
3070
+ Example 1 (bash):
3071
+ ```bash
3072
+ pip3 install -U packaging==23.2 setuptools==75.8.0 wheel ninja
3073
+ pip3 install --no-build-isolation axolotl[flash-attn,deepspeed]
3074
+
3075
+ # Download example axolotl configs, deepspeed configs
3076
+ axolotl fetch examples
3077
+ axolotl fetch deepspeed_configs # OPTIONAL
3078
+ ```
3079
+
3080
+ Example 2 (bash):
3081
+ ```bash
3082
+ docker run --gpus '"all"' --rm -it axolotlai/axolotl:main-latest
3083
+ ```
3084
+
3085
+ Example 3 (bash):
3086
+ ```bash
3087
+ # Fetch axolotl examples
3088
+ axolotl fetch examples
3089
+
3090
+ # Or, specify a custom path
3091
+ axolotl fetch examples --dest path/to/folder
3092
+
3093
+ # Train a model using LoRA
3094
+ axolotl train examples/llama-3/lora-1b.yml
3095
+ ```
3096
+
3097
+ Example 4 (unknown):
3098
+ ```unknown
3099
+ @software{axolotl,
3100
+ title = {Axolotl: Open Source LLM Post-Training},
3101
+ author = {{Axolotl maintainers and contributors}},
3102
+ url = {https://github.com/axolotl-ai-cloud/axolotl},
3103
+ license = {Apache-2.0},
3104
+ year = {2023}
3105
+ }
3106
+ ```
3107
+
3108
+ ---
3109
+
3110
+ ## Quickstart
3111
+
3112
+ **URL:** https://docs.axolotl.ai/docs/getting-started.html
3113
+
3114
+ **Contents:**
3115
+ - Quickstart
3116
+ - 1 Quick Example
3117
+ - 2 Understanding the Process
3118
+ - 2.1 The Configuration File
3119
+ - 2.2 Training
3120
+ - 3 Your First Custom Training
3121
+ - 4 Common Tasks
3122
+ - 4.1 Testing Your Model
3123
+ - 4.2 Using a UI
3124
+ - 4.3 Preprocessing Data
3125
+
3126
+ This guide will walk you through your first model fine-tuning project with Axolotl.
3127
+
3128
+ Let’s start by fine-tuning a small language model using LoRA. This example uses a 1B parameter model to ensure it runs on most GPUs. Assuming axolotl is installed (if not, see our Installation Guide)
3129
+
3130
+ That’s it! Let’s understand what just happened.
3131
+
3132
+ The YAML configuration file controls everything about your training. Here’s what (part of) our example config looks like:
3133
+
3134
+ load_in_8bit: true and adapter: lora enables LoRA adapter finetuning.
3135
+
3136
+ See our config options for more details.
3137
+
3138
+ When you run axolotl train, Axolotl:
3139
+
3140
+ Let’s modify the example for your own data:
3141
+
3142
+ This specific config is for LoRA fine-tuning a model with instruction tuning data using the alpaca dataset format, which has the following format:
3143
+
3144
+ Please see our Dataset Formats for more dataset formats and how to format them.
3145
+
3146
+ The same yaml file is used for training, inference, and merging.
3147
+
3148
+ After training, test your model:
3149
+
3150
+ More details can be found in Inference.
3151
+
3152
+ Launch a Gradio interface:
3153
+
3154
+ For large datasets, preprocess first:
3155
+
3156
+ Please make sure to set dataset_prepared_path: in your config to set the path to save the prepared dataset.
3157
+
3158
+ More details can be found in Dataset Preprocessing.
3159
+
3160
+ To merge the LoRA weights back into the base model, run:
3161
+
3162
+ The merged model will be saved in the {output_dir}/merged directory.
3163
+
3164
+ More details can be found in Merging LoRA weights.
3165
+
3166
+ Now that you have the basics, you might want to:
3167
+
3168
+ Check our other guides for details on these topics:
3169
+
3170
+ **Examples:**
3171
+
3172
+ Example 1 (bash):
3173
+ ```bash
3174
+ axolotl fetch examples
3175
+ ```
3176
+
3177
+ Example 2 (bash):
3178
+ ```bash
3179
+ axolotl train examples/llama-3/lora-1b.yml
3180
+ ```
3181
+
3182
+ Example 3 (yaml):
3183
+ ```yaml
3184
+ base_model: NousResearch/Llama-3.2-1B
3185
+
3186
+ load_in_8bit: true
3187
+ adapter: lora
3188
+
3189
+ datasets:
3190
+ - path: teknium/GPT4-LLM-Cleaned
3191
+ type: alpaca
3192
+ dataset_prepared_path: last_run_prepared
3193
+ val_set_size: 0.1
3194
+ output_dir: ./outputs/lora-out
3195
+ ```
3196
+
3197
+ Example 4 (yaml):
3198
+ ```yaml
3199
+ base_model: NousResearch/Nous-Hermes-llama-1b-v1
3200
+
3201
+ load_in_8bit: true
3202
+ adapter: lora
3203
+
3204
+ # Training settings
3205
+ micro_batch_size: 2
3206
+ num_epochs: 3
3207
+ learning_rate: 0.0003
3208
+
3209
+ # Your dataset
3210
+ datasets:
3211
+ - path: my_data.jsonl # Your local data file
3212
+ type: alpaca # Or other format
3213
+ ```
3214
+
3215
+ ---
3216
+
3217
+ ## Multipack (Sample Packing)
3218
+
3219
+ **URL:** https://docs.axolotl.ai/docs/multipack.html
3220
+
3221
+ **Contents:**
3222
+ - Multipack (Sample Packing)
3223
+ - Visualization of Multipack with Flash Attention
3224
+ - Multipack without Flash Attention
3225
+
3226
+ Because Flash Attention simply drops the attention mask, we do not need to construct a 4d attention mask. We only need to concatenate the sequences into a single batch and let flash attention know where each new sequence begins.
3227
+
3228
+ 4k context, bsz =4, each character represents 256 tokens X represents a padding token
3229
+
3230
+ after padding to longest input in each step
3231
+
3232
+ w packing ( note it’s the same effective number of tokens per step, but a true bsz of 1)
3233
+
3234
+ cu_seqlens: [[ 0, 11, 17, 24, 28, 36, 41 44, 48, 51, 55, 60, 64]]
3235
+
3236
+ Multipack can still be achieved without Flash attention, but with lower packing efficiency as we are not able to join multiple batches into a single batch due to context length limits without flash attention. We can use either Pytorch’s Scaled Dot Product Attention implementation or native Pytorch attention implementation along with 4d attention masks to pack sequences together and avoid cross attention.
3237
+
3238
+ **Examples:**
3239
+
3240
+ Example 1 (unknown):
3241
+ ```unknown
3242
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
3243
+ [[ A A A A A A A A A A A ]
3244
+ B B B B B B ]
3245
+ C C C C C C C ]
3246
+ D D D D ]]
3247
+
3248
+ [[ E E E E E E E E ]
3249
+ [ F F F F ]
3250
+ [ G G G ]
3251
+ [ H H H H ]]
3252
+
3253
+ [[ I I I ]
3254
+ [ J J J ]
3255
+ [ K K K K K]
3256
+ [ L L L ]]
3257
+ ```
3258
+
3259
+ Example 2 (unknown):
3260
+ ```unknown
3261
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
3262
+ [[ A A A A A A A A A A A ]
3263
+ B B B B B B X X X X X X ]
3264
+ C C C C C C C X X X X ]
3265
+ D D D D X X X X X X X ]]
3266
+
3267
+ [[ E E E E E E E E ]
3268
+ [ F F F F X X X X ]
3269
+ [ G G G X X X X X ]
3270
+ [ H H H H X X X X ]]
3271
+
3272
+ [[ I I I X X ]
3273
+ [ J J J X X ]
3274
+ [ K K K K K ]
3275
+ [ L L L X X ]]
3276
+ ```
3277
+
3278
+ Example 3 (unknown):
3279
+ ```unknown
3280
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
3281
+ [[ A A A A A A A A A A A B B B B B
3282
+ B C C C C C C C D D D D E E E E
3283
+ E E E E F F F F F G G G H H H H
3284
+ I I I J J J J K K K K K L L L X ]]
3285
+ ```
3286
+
3287
+ ---
3288
+
3289
+ ## Batch size vs Gradient accumulation
3290
+
3291
+ **URL:** https://docs.axolotl.ai/docs/batch_vs_grad.html
3292
+
3293
+ **Contents:**
3294
+ - Batch size vs Gradient accumulation
3295
+
3296
+ Gradient accumulation means accumulating gradients over several mini-batches and updating the model weights afterward. When the samples in each batch are diverse, this technique doesn’t significantly impact learning.
3297
+
3298
+ This method allows for effective training with larger effective batch sizes without needing proportionally larger memory. Here’s why:
3299
+
3300
+ Memory Consumption with Batch Size: The primary reason increasing the batch size impacts memory is due to the storage requirements for intermediate activations. When you forward propagate a batch through a network, you have to store the activations at each layer for each sample in the batch, because these activations are used during backpropagation to compute gradients. Therefore, larger batches mean more activations, leading to greater GPU memory consumption.
3301
+
3302
+ Gradient Accumulation: With gradient accumulation, you’re effectively simulating a larger batch size by accumulating gradients over several smaller batches (or micro-batches). However, at any given time, you’re only forward and backward propagating a micro-batch. This means you only store activations for the micro-batch, not the full accumulated batch. As a result, you can simulate the effect of a larger batch size without the memory cost of storing activations for a large batch.
3303
+
3304
+ Example 1: Micro batch size: 3 Gradient accumulation steps: 2 Number of GPUs: 3 Total batch size = 3 * 2 * 3 = 18
3305
+
3306
+ Example 2: Micro batch size: 2 Gradient accumulation steps: 1 Number of GPUs: 3 Total batch size = 2 * 1 * 3 = 6
3307
+
3308
+ **Examples:**
3309
+
3310
+ Example 1 (unknown):
3311
+ ```unknown
3312
+ | GPU 1 | GPU 2 | GPU 3 |
3313
+ |----------------|----------------|----------------|
3314
+ | S1, S2, S3 | S4, S5, S6 | S7, S8, S9 |
3315
+ | e1, e2, e3 | e4, e5, e6 | e7, e8, e9 |
3316
+ |----------------|----------------|----------------|
3317
+ | → (accumulate) | → (accumulate) | → (accumulate) |
3318
+ |----------------|----------------|----------------|
3319
+ | S10, S11, S12 | S13, S14, S15 | S16, S17, S18 |
3320
+ | e10, e11, e12 | e13, e14, e15 | e16, e17, e18 |
3321
+ |----------------|----------------|----------------|
3322
+ | → (apply) | → (apply) | → (apply) |
3323
+
3324
+ Accumulated gradient for the weight w1 after the second iteration (considering all GPUs):
3325
+ Total gradient for w1 = e1 + e2 + e3 + e4 + e5 + e6 + e7 + e8 + e9 + e10 + e11 + e12 + e13 + e14 + e15 + e16 + e17 + e18
3326
+
3327
+ Weight update for w1:
3328
+ w1_new = w1_old - learning rate x (Total gradient for w1 / 18)
3329
+ ```
3330
+
3331
+ Example 2 (unknown):
3332
+ ```unknown
3333
+ | GPU 1 | GPU 2 | GPU 3 |
3334
+ |-----------|-----------|-----------|
3335
+ | S1, S2 | S3, S4 | S5, S6 |
3336
+ | e1, e2 | e3, e4 | e5, e6 |
3337
+ |-----------|-----------|-----------|
3338
+ | → (apply) | → (apply) | → (apply) |
3339
+
3340
+ Accumulated gradient for the weight w1 (considering all GPUs):
3341
+ Total gradient for w1 = e1 + e2 + e3 + e4 + e5 + e6
3342
+
3343
+ Weight update for w1:
3344
+ w1_new = w1_old - learning rate × (Total gradient for w1 / 6)
3345
+ ```
3346
+
3347
+ ---
3348
+
3349
+ ## Debugging
3350
+
3351
+ **URL:** https://docs.axolotl.ai/docs/debugging.html
3352
+
3353
+ **Contents:**
3354
+ - Debugging
3355
+ - Table of Contents
3356
+ - General Tips
3357
+ - Debugging with VSCode
3358
+ - Background
3359
+ - Setup
3360
+ - Remote Hosts
3361
+ - Configuration
3362
+ - Customizing your debugger
3363
+ - Video Tutorial
3364
+
3365
+ This document provides some tips and tricks for debugging Axolotl. It also provides an example configuration for debugging with VSCode. A good debugging setup is essential to understanding how Axolotl code works behind the scenes.
3366
+
3367
+ While debugging it’s helpful to simplify your test scenario as much as possible. Here are some tips for doing so:
3368
+
3369
+ [!Important] All of these tips are incorporated into the example configuration for debugging with VSCode below.
3370
+
3371
+ Make sure you are using the latest version of axolotl: This project changes often and bugs get fixed fast. Check your git branch and make sure you have pulled the latest changes from main.
3372
+
3373
+ Eliminate concurrency: Restrict the number of processes to 1 for both training and data preprocessing:
3374
+
3375
+ Use a small dataset: Construct or use a small dataset from HF Hub. When using a small dataset, you will often have to make sure sample_packing: False and eval_sample_packing: False to avoid errors. If you are in a pinch and don’t have time to construct a small dataset but want to use from the HF Hub, you can shard the data (this will still tokenize the entire dataset, but will only use a fraction of the data for training. For example, to shard the dataset into 20 pieces, add the following to your axolotl config):
3376
+
3377
+ Use a small model: A good example of a small model is TinyLlama/TinyLlama-1.1B-Chat-v1.0.
3378
+
3379
+ Minimize iteration time: Make sure the training loop finishes as fast as possible, with these settings.
3380
+
3381
+ Clear Caches: Axolotl caches certain steps and so does the underlying HuggingFace trainer. You may want to clear some of these caches when debugging.
3382
+
3383
+ The below example shows how to configure VSCode to debug data preprocessing of the chat_template format. This is the format used when you have the following in your axolotl config:
3384
+
3385
+ [!Important] If you are already familiar with advanced VSCode debugging, you can skip the below explanation and look at the files .vscode/launch.json and .vscode/tasks.json for an example configuration.
3386
+
3387
+ [!Tip] If you prefer to watch a video, rather than read, you can skip to the video tutorial below (but doing both is recommended).
3388
+
3389
+ Make sure you have an editable install of Axolotl, which ensures that changes you make to the code are reflected at runtime. Run the following commands from the root of this project:
3390
+
3391
+ If you developing on a remote host, you can easily use VSCode to debug remotely. To do so, you will need to follow this remote - SSH guide. You can also see the video below on Docker and Remote SSH debugging.
3392
+
3393
+ The easiest way to get started is to modify the .vscode/launch.json file in this project. This is just an example configuration, so you may need to modify or copy it to suit your needs.
3394
+
3395
+ For example, to mimic the command cd devtools && CUDA_VISIBLE_DEVICES=0 accelerate launch -m axolotl.cli.train dev_chat_template.yml, you would use the below configuration1. Note that we add additional flags that override the axolotl config and incorporate the tips above (see the comments). We also set the working directory to devtools and set the env variable HF_HOME to a temporary folder that is later partially deleted. This is because we want to delete the HF dataset cache before each run in order to ensure that the data preprocessing code is run from scratch.
3396
+
3397
+ Additional notes about this configuration:
3398
+
3399
+ [!Tip] You may not want to delete these folders. For example, if you are debugging model training instead of data pre-processing, you may NOT want to delete the cache or output folders. You may also need to add additional tasks to the tasks.json file depending on your use case.
3400
+
3401
+ Below is the ./vscode/tasks.json file that defines the cleanup-for-dataprep task. This task is run before each debugging session when you use the above configuration. Note how there are two tasks that delete the two folders mentioned above. The third task cleanup-for-dataprep is a composite task that combines the two tasks. A composite task is necessary because VSCode does not allow you to specify multiple tasks in the preLaunchTask argument of the launch.json file.
3402
+
3403
+ Your debugging use case may differ from the example above. The easiest thing to do is to put your own axolotl config in the devtools folder and modify the launch.json file to use your config. You may also want to modify the preLaunchTask to delete different folders or not delete anything at all.
3404
+
3405
+ The following video tutorial walks through the above configuration and demonstrates how to debug with VSCode, (click the image below to watch):
3406
+
3407
+ Using official Axolotl Docker images is a great way to debug your code, and is a very popular way to use Axolotl. Attaching VSCode to Docker takes a few more steps.
3408
+
3409
+ On the host that is running axolotl (ex: if you are using a remote host), clone the axolotl repo and change your current directory to the root:
3410
+
3411
+ [!Tip] If you already have axolotl cloned on your host, make sure you have the latest changes and change into the root of the project.
3412
+
3413
+ Next, run the desired docker image and mount the current directory. Below is a docker command you can run to do this:2
3414
+
3415
+ [!Tip] To understand which containers are available, see the Docker section of the README and the DockerHub repo. For details of how the Docker containers are built, see axolotl’s Docker CI builds.
3416
+
3417
+ You will now be in the container. Next, perform an editable install of Axolotl:
3418
+
3419
+ Next, if you are using a remote host, Remote into this host with VSCode. If you are using a local host, you can skip this step.
3420
+
3421
+ Next, select Dev Containers: Attach to Running Container... using the command palette (CMD + SHIFT + P) in VSCode. You will be prompted to select a container to attach to. Select the container you just created. You will now be in the container with a working directory that is at the root of the project. Any changes you make to the code will be reflected both in the container and on the host.
3422
+
3423
+ Now you are ready to debug as described above (see Debugging with VSCode).
3424
+
3425
+ Here is a short video that demonstrates how to attach to a Docker container on a remote host:
3426
+
3427
+ The config actually mimics the command CUDA_VISIBLE_DEVICES=0 python -m accelerate.commands.launch -m axolotl.cli.train devtools/chat_template.yml, but this is the same thing.↩︎
3428
+
3429
+ Many of the below flags are recommended best practices by Nvidia when using nvidia-container-toolkit. You can read more about these flags here.↩︎
3430
+
3431
+ **Examples:**
3432
+
3433
+ Example 1 (yaml):
3434
+ ```yaml
3435
+ datasets:
3436
+ ...
3437
+ shards: 20
3438
+ ```
3439
+
3440
+ Example 2 (yaml):
3441
+ ```yaml
3442
+ datasets:
3443
+ - path: <path to your chat_template formatted dataset> # example on HF Hub: fozziethebeat/alpaca_messages_2k_test
3444
+ type: chat_template
3445
+ ```
3446
+
3447
+ Example 3 (bash):
3448
+ ```bash
3449
+ pip3 install packaging
3450
+ pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]'
3451
+ ```
3452
+
3453
+ Example 4 (json):
3454
+ ```json
3455
+ // .vscode/launch.json
3456
+ {
3457
+ "version": "0.2.0",
3458
+ "configurations": [
3459
+ {
3460
+ "name": "Debug axolotl prompt - chat_template",
3461
+ "type": "python",
3462
+ "module": "accelerate.commands.launch",
3463
+ "request": "launch",
3464
+ "args": [
3465
+ "-m", "axolotl.cli.train", "dev_chat_template.yml",
3466
+ // The flags below simplify debugging by overriding the axolotl config
3467
+ // with the debugging tips above. Modify as needed.
3468
+ "--dataset_num_proc=1", // limits data preprocessing to one process
3469
+ "--max_steps=1", // limits training to just one step
3470
+ "--batch_size=1", // minimizes batch size
3471
+ "--micro_batch_size=1", // minimizes batch size
3472
+ "--val_set_size=0", // disables validation
3473
+ "--sample_packing=False", // disables sample packing which is necessary for small datasets
3474
+ "--eval_sample_packing=False",// disables sample packing on eval set
3475
+ "--dataset_prepared_path=temp_debug/axolotl_outputs/data", // send data outputs to a temp folder
3476
+ "--output_dir=temp_debug/axolotl_outputs/model" // send model outputs to a temp folder
3477
+ ],
3478
+ "console": "integratedTerminal", // show output in the integrated terminal
3479
+ "cwd": "${workspaceFolder}/devtools", // set working directory to devtools from the root of the project
3480
+ "justMyCode": true, // step through only axolotl code
3481
+ "env": {"CUDA_VISIBLE_DEVICES": "0", // Since we aren't doing distributed training, we need to limit to one GPU
3482
+ "HF_HOME": "${workspaceFolder}/devtools/temp_debug/.hf-cache"}, // send HF cache to a temp folder
3483
+ "preLaunchTask": "cleanup-for-dataprep", // delete temp folders (see below)
3484
+ }
3485
+ ]
3486
+ }
3487
+ ```
3488
+
3489
+ ---
3490
+
3491
+ ## Docker
3492
+
3493
+ **URL:** https://docs.axolotl.ai/docs/docker.html
3494
+
3495
+ **Contents:**
3496
+ - Docker
3497
+ - Base
3498
+ - Image
3499
+ - Tags format
3500
+ - Main
3501
+ - Image
3502
+ - Tags format
3503
+ - Cloud
3504
+ - Image
3505
+ - Tags format
3506
+
3507
+ This section describes the different Docker images that are released by AxolotlAI at Docker Hub.
3508
+
3509
+ For Blackwell GPUs, please use the tags with PyTorch 2.7.1 and CUDA 12.8.
3510
+
3511
+ The base image is the most minimal image that can install Axolotl. It is based on the nvidia/cuda image. It includes python, torch, git, git-lfs, awscli, pydantic, and more.
3512
+
3513
+ The main image is the image that is used to run Axolotl. It is based on the axolotlai/axolotl-base image and includes the Axolotl codebase, dependencies, and more.
3514
+
3515
+ There may be some extra tags appended to the image, like -vllm which installs those packages.
3516
+
3517
+ The cloud image is the image that is used to run Axolotl in the cloud. It is based on the axolotlai/axolotl image and sets ENV variables like HuggingFace cache directories for volume mounts, tmux, and more for different cloud providers.
3518
+
3519
+ Jupyter lab is run by default. Set JUPYTER_DISABLE=1 in the environment variables to disable it.
3520
+
3521
+ This uses the same tags as the main image.
3522
+
3523
+ We recommend mounting volumes to /workspace/data for data persistence. /workspace/axolotl contains the source code and is ephemeral.
3524
+
3525
+ This is the same as the cloud image but without tmux.
3526
+
3527
+ The naming may be a bit confusing as it has -term appended to the end.
3528
+
3529
+ This uses the same tags as the cloud image.
3530
+
3531
+ **Examples:**
3532
+
3533
+ Example 1 (unknown):
3534
+ ```unknown
3535
+ axolotlai/axolotl-base
3536
+ ```
3537
+
3538
+ Example 2 (bash):
3539
+ ```bash
3540
+ main-base-py{python_version}-cu{cuda_version}-{pytorch_version}
3541
+ ```
3542
+
3543
+ Example 3 (unknown):
3544
+ ```unknown
3545
+ axolotlai/axolotl
3546
+ ```
3547
+
3548
+ Example 4 (bash):
3549
+ ```bash
3550
+ # on push to main
3551
+ main-py{python_version}-cu{cuda_version}-{pytorch_version}
3552
+
3553
+ # latest main (currently torch 2.6.0, python 3.11, cuda 12.4)
3554
+ main-latest
3555
+
3556
+ # nightly build
3557
+ {branch}-{date_in_YYYYMMDD}-py{python_version}-cu{cuda_version}-{pytorch_version}
3558
+
3559
+ # tagged release
3560
+ {version}
3561
+ ```
3562
+
3563
+ ---