@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,455 @@
1
+ ---
2
+ name: fine-tuning-with-trl
3
+ description: Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Transformers.
4
+ version: 1.0.0
5
+ author: Synthetic Sciences
6
+ license: MIT
7
+ tags: [Post-Training, TRL, Reinforcement Learning, Fine-Tuning, SFT, DPO, PPO, GRPO, RLHF, Preference Alignment, HuggingFace]
8
+ dependencies: [trl, transformers, datasets, peft, accelerate, torch]
9
+ ---
10
+
11
+ # TRL - Transformer Reinforcement Learning
12
+
13
+ ## Quick start
14
+
15
+ TRL provides post-training methods for aligning language models with human preferences.
16
+
17
+ **Installation**:
18
+ ```bash
19
+ pip install trl transformers datasets peft accelerate
20
+ ```
21
+
22
+ **Supervised Fine-Tuning** (instruction tuning):
23
+ ```python
24
+ from trl import SFTTrainer
25
+
26
+ trainer = SFTTrainer(
27
+ model="Qwen/Qwen2.5-0.5B",
28
+ train_dataset=dataset, # Prompt-completion pairs
29
+ )
30
+ trainer.train()
31
+ ```
32
+
33
+ **DPO** (align with preferences):
34
+ ```python
35
+ from trl import DPOTrainer, DPOConfig
36
+
37
+ config = DPOConfig(output_dir="model-dpo", beta=0.1)
38
+ trainer = DPOTrainer(
39
+ model=model,
40
+ args=config,
41
+ train_dataset=preference_dataset, # chosen/rejected pairs
42
+ processing_class=tokenizer
43
+ )
44
+ trainer.train()
45
+ ```
46
+
47
+ ## Common workflows
48
+
49
+ ### Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)
50
+
51
+ Complete pipeline from base model to human-aligned model.
52
+
53
+ Copy this checklist:
54
+
55
+ ```
56
+ RLHF Training:
57
+ - [ ] Step 1: Supervised fine-tuning (SFT)
58
+ - [ ] Step 2: Train reward model
59
+ - [ ] Step 3: PPO reinforcement learning
60
+ - [ ] Step 4: Evaluate aligned model
61
+ ```
62
+
63
+ **Step 1: Supervised fine-tuning**
64
+
65
+ Train base model on instruction-following data:
66
+
67
+ ```python
68
+ from transformers import AutoModelForCausalLM, AutoTokenizer
69
+ from trl import SFTTrainer, SFTConfig
70
+ from datasets import load_dataset
71
+
72
+ # Load model
73
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
74
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
75
+
76
+ # Load instruction dataset
77
+ dataset = load_dataset("trl-lib/Capybara", split="train")
78
+
79
+ # Configure training
80
+ training_args = SFTConfig(
81
+ output_dir="Qwen2.5-0.5B-SFT",
82
+ per_device_train_batch_size=4,
83
+ num_train_epochs=1,
84
+ learning_rate=2e-5,
85
+ logging_steps=10,
86
+ save_strategy="epoch"
87
+ )
88
+
89
+ # Train
90
+ trainer = SFTTrainer(
91
+ model=model,
92
+ args=training_args,
93
+ train_dataset=dataset,
94
+ tokenizer=tokenizer
95
+ )
96
+ trainer.train()
97
+ trainer.save_model()
98
+ ```
99
+
100
+ **Step 2: Train reward model**
101
+
102
+ Train model to predict human preferences:
103
+
104
+ ```python
105
+ from transformers import AutoModelForSequenceClassification
106
+ from trl import RewardTrainer, RewardConfig
107
+
108
+ # Load SFT model as base
109
+ model = AutoModelForSequenceClassification.from_pretrained(
110
+ "Qwen2.5-0.5B-SFT",
111
+ num_labels=1 # Single reward score
112
+ )
113
+ tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")
114
+
115
+ # Load preference data (chosen/rejected pairs)
116
+ dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
117
+
118
+ # Configure training
119
+ training_args = RewardConfig(
120
+ output_dir="Qwen2.5-0.5B-Reward",
121
+ per_device_train_batch_size=2,
122
+ num_train_epochs=1,
123
+ learning_rate=1e-5
124
+ )
125
+
126
+ # Train reward model
127
+ trainer = RewardTrainer(
128
+ model=model,
129
+ args=training_args,
130
+ processing_class=tokenizer,
131
+ train_dataset=dataset
132
+ )
133
+ trainer.train()
134
+ trainer.save_model()
135
+ ```
136
+
137
+ **Step 3: PPO reinforcement learning**
138
+
139
+ Optimize policy using reward model:
140
+
141
+ ```bash
142
+ python -m trl.scripts.ppo \
143
+ --model_name_or_path Qwen2.5-0.5B-SFT \
144
+ --reward_model_path Qwen2.5-0.5B-Reward \
145
+ --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
146
+ --output_dir Qwen2.5-0.5B-PPO \
147
+ --learning_rate 3e-6 \
148
+ --per_device_train_batch_size 64 \
149
+ --total_episodes 10000
150
+ ```
151
+
152
+ **Step 4: Evaluate**
153
+
154
+ ```python
155
+ from transformers import pipeline
156
+
157
+ # Load aligned model
158
+ generator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")
159
+
160
+ # Test
161
+ prompt = "Explain quantum computing to a 10-year-old"
162
+ output = generator(prompt, max_length=200)[0]["generated_text"]
163
+ print(output)
164
+ ```
165
+
166
+ ### Workflow 2: Simple preference alignment with DPO
167
+
168
+ Align model with preferences without reward model.
169
+
170
+ Copy this checklist:
171
+
172
+ ```
173
+ DPO Training:
174
+ - [ ] Step 1: Prepare preference dataset
175
+ - [ ] Step 2: Configure DPO
176
+ - [ ] Step 3: Train with DPOTrainer
177
+ - [ ] Step 4: Evaluate alignment
178
+ ```
179
+
180
+ **Step 1: Prepare preference dataset**
181
+
182
+ Dataset format:
183
+ ```json
184
+ {
185
+ "prompt": "What is the capital of France?",
186
+ "chosen": "The capital of France is Paris.",
187
+ "rejected": "I don't know."
188
+ }
189
+ ```
190
+
191
+ Load dataset:
192
+ ```python
193
+ from datasets import load_dataset
194
+
195
+ dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
196
+ # Or load your own
197
+ # dataset = load_dataset("json", data_files="preferences.json")
198
+ ```
199
+
200
+ **Step 2: Configure DPO**
201
+
202
+ ```python
203
+ from trl import DPOConfig
204
+
205
+ config = DPOConfig(
206
+ output_dir="Qwen2.5-0.5B-DPO",
207
+ per_device_train_batch_size=4,
208
+ num_train_epochs=1,
209
+ learning_rate=5e-7,
210
+ beta=0.1, # KL penalty strength
211
+ max_prompt_length=512,
212
+ max_length=1024,
213
+ logging_steps=10
214
+ )
215
+ ```
216
+
217
+ **Step 3: Train with DPOTrainer**
218
+
219
+ ```python
220
+ from transformers import AutoModelForCausalLM, AutoTokenizer
221
+ from trl import DPOTrainer
222
+
223
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
224
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
225
+
226
+ trainer = DPOTrainer(
227
+ model=model,
228
+ args=config,
229
+ train_dataset=dataset,
230
+ processing_class=tokenizer
231
+ )
232
+
233
+ trainer.train()
234
+ trainer.save_model()
235
+ ```
236
+
237
+ **CLI alternative**:
238
+ ```bash
239
+ trl dpo \
240
+ --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
241
+ --dataset_name argilla/Capybara-Preferences \
242
+ --output_dir Qwen2.5-0.5B-DPO \
243
+ --per_device_train_batch_size 4 \
244
+ --learning_rate 5e-7 \
245
+ --beta 0.1
246
+ ```
247
+
248
+ ### Workflow 3: Memory-efficient online RL with GRPO
249
+
250
+ Train with reinforcement learning using minimal memory.
251
+
252
+ Copy this checklist:
253
+
254
+ ```
255
+ GRPO Training:
256
+ - [ ] Step 1: Define reward function
257
+ - [ ] Step 2: Configure GRPO
258
+ - [ ] Step 3: Train with GRPOTrainer
259
+ ```
260
+
261
+ **Step 1: Define reward function**
262
+
263
+ ```python
264
+ def reward_function(completions, **kwargs):
265
+ """
266
+ Compute rewards for completions.
267
+
268
+ Args:
269
+ completions: List of generated texts
270
+
271
+ Returns:
272
+ List of reward scores (floats)
273
+ """
274
+ rewards = []
275
+ for completion in completions:
276
+ # Example: reward based on length and unique words
277
+ score = len(completion.split()) # Favor longer responses
278
+ score += len(set(completion.lower().split())) # Reward unique words
279
+ rewards.append(score)
280
+ return rewards
281
+ ```
282
+
283
+ Or use a reward model:
284
+ ```python
285
+ from transformers import pipeline
286
+
287
+ reward_model = pipeline("text-classification", model="reward-model-path")
288
+
289
+ def reward_from_model(completions, prompts, **kwargs):
290
+ # Combine prompt + completion
291
+ full_texts = [p + c for p, c in zip(prompts, completions)]
292
+ # Get reward scores
293
+ results = reward_model(full_texts)
294
+ return [r["score"] for r in results]
295
+ ```
296
+
297
+ **Step 2: Configure GRPO**
298
+
299
+ ```python
300
+ from trl import GRPOConfig
301
+
302
+ config = GRPOConfig(
303
+ output_dir="Qwen2-GRPO",
304
+ per_device_train_batch_size=4,
305
+ num_train_epochs=1,
306
+ learning_rate=1e-5,
307
+ num_generations=4, # Generate 4 completions per prompt
308
+ max_new_tokens=128
309
+ )
310
+ ```
311
+
312
+ **Step 3: Train with GRPOTrainer**
313
+
314
+ ```python
315
+ from datasets import load_dataset
316
+ from trl import GRPOTrainer
317
+
318
+ # Load prompt-only dataset
319
+ dataset = load_dataset("trl-lib/tldr", split="train")
320
+
321
+ trainer = GRPOTrainer(
322
+ model="Qwen/Qwen2-0.5B-Instruct",
323
+ reward_funcs=reward_function, # Your reward function
324
+ args=config,
325
+ train_dataset=dataset
326
+ )
327
+
328
+ trainer.train()
329
+ ```
330
+
331
+ **CLI**:
332
+ ```bash
333
+ trl grpo \
334
+ --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
335
+ --dataset_name trl-lib/tldr \
336
+ --output_dir Qwen2-GRPO \
337
+ --num_generations 4
338
+ ```
339
+
340
+ ## When to use vs alternatives
341
+
342
+ **Use TRL when:**
343
+ - Need to align model with human preferences
344
+ - Have preference data (chosen/rejected pairs)
345
+ - Want to use reinforcement learning (PPO, GRPO)
346
+ - Need reward model training
347
+ - Doing RLHF (full pipeline)
348
+
349
+ **Method selection**:
350
+ - **SFT**: Have prompt-completion pairs, want basic instruction following
351
+ - **DPO**: Have preferences, want simple alignment (no reward model needed)
352
+ - **PPO**: Have reward model, need maximum control over RL
353
+ - **GRPO**: Memory-constrained, want online RL
354
+ - **Reward Model**: Building RLHF pipeline, need to score generations
355
+
356
+ **Use alternatives instead:**
357
+ - **HuggingFace Trainer**: Basic fine-tuning without RL
358
+ - **Axolotl**: YAML-based training configuration
359
+ - **LitGPT**: Educational, minimal fine-tuning
360
+ - **Unsloth**: Fast LoRA training
361
+
362
+ ## Common issues
363
+
364
+ **Issue: OOM during DPO training**
365
+
366
+ Reduce batch size and sequence length:
367
+ ```python
368
+ config = DPOConfig(
369
+ per_device_train_batch_size=1, # Reduce from 4
370
+ max_length=512, # Reduce from 1024
371
+ gradient_accumulation_steps=8 # Maintain effective batch
372
+ )
373
+ ```
374
+
375
+ Or use gradient checkpointing:
376
+ ```python
377
+ model.gradient_checkpointing_enable()
378
+ ```
379
+
380
+ **Issue: Poor alignment quality**
381
+
382
+ Tune beta parameter:
383
+ ```python
384
+ # Higher beta = more conservative (stays closer to reference)
385
+ config = DPOConfig(beta=0.5) # Default 0.1
386
+
387
+ # Lower beta = more aggressive alignment
388
+ config = DPOConfig(beta=0.01)
389
+ ```
390
+
391
+ **Issue: Reward model not learning**
392
+
393
+ Check loss type and learning rate:
394
+ ```python
395
+ config = RewardConfig(
396
+ learning_rate=1e-5, # Try different LR
397
+ num_train_epochs=3 # Train longer
398
+ )
399
+ ```
400
+
401
+ Ensure preference dataset has clear winners:
402
+ ```python
403
+ # Verify dataset
404
+ print(dataset[0])
405
+ # Should have clear chosen > rejected
406
+ ```
407
+
408
+ **Issue: PPO training unstable**
409
+
410
+ Adjust KL coefficient:
411
+ ```python
412
+ config = PPOConfig(
413
+ kl_coef=0.1, # Increase from 0.05
414
+ cliprange=0.1 # Reduce from 0.2
415
+ )
416
+ ```
417
+
418
+ ## Advanced topics
419
+
420
+ **SFT training guide**: See [references/sft-training.md](references/sft-training.md) for dataset formats, chat templates, packing strategies, and multi-GPU training.
421
+
422
+ **DPO variants**: See [references/dpo-variants.md](references/dpo-variants.md) for IPO, cDPO, RPO, and other DPO loss functions with recommended hyperparameters.
423
+
424
+ **Reward modeling**: See [references/reward-modeling.md](references/reward-modeling.md) for outcome vs process rewards, Bradley-Terry loss, and reward model evaluation.
425
+
426
+ **Online RL methods**: See [references/online-rl.md](references/online-rl.md) for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
427
+
428
+ ## Hardware requirements
429
+
430
+ - **GPU**: NVIDIA (CUDA required)
431
+ - **VRAM**: Depends on model and method
432
+ - SFT 7B: 16GB (with LoRA)
433
+ - DPO 7B: 24GB (stores reference model)
434
+ - PPO 7B: 40GB (policy + reward model)
435
+ - GRPO 7B: 24GB (more memory efficient)
436
+ - **Multi-GPU**: Supported via `accelerate`
437
+ - **Mixed precision**: BF16 recommended (A100/H100)
438
+
439
+ **Memory optimization**:
440
+ - Use LoRA/QLoRA for all methods
441
+ - Enable gradient checkpointing
442
+ - Use smaller batch sizes with gradient accumulation
443
+
444
+ ## Resources
445
+
446
+ - Docs: https://huggingface.co/docs/trl/
447
+ - GitHub: https://github.com/huggingface/trl
448
+ - Papers:
449
+ - "Training language models to follow instructions with human feedback" (InstructGPT, 2022)
450
+ - "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (DPO, 2023)
451
+ - "Group Relative Policy Optimization" (GRPO, 2024)
452
+ - Examples: https://github.com/huggingface/trl/tree/main/examples/scripts
453
+
454
+
455
+
@@ -0,0 +1,227 @@
1
+ # DPO Variants
2
+
3
+ Complete guide to Direct Preference Optimization loss variants in TRL.
4
+
5
+ ## Overview
6
+
7
+ DPO optimizes models using preference data (chosen/rejected pairs). TRL supports 10+ loss variants for different scenarios.
8
+
9
+ ## Loss Types
10
+
11
+ ### 1. Sigmoid (Standard DPO)
12
+
13
+ **Formula**: `-log(sigmoid(β * logits))`
14
+
15
+ **When to use**: Default choice, general preference alignment
16
+
17
+ **Config**:
18
+ ```python
19
+ DPOConfig(
20
+ loss_type="sigmoid",
21
+ beta=0.1, # KL penalty
22
+ per_device_train_batch_size=64,
23
+ learning_rate=1e-6
24
+ )
25
+ ```
26
+
27
+ ### 2. IPO (Identity Policy Optimization)
28
+
29
+ **Formula**: `(logits - 1/(2β))²`
30
+
31
+ **When to use**: Better theoretical foundation, reduce overfitting
32
+
33
+ **Config**:
34
+ ```python
35
+ DPOConfig(
36
+ loss_type="ipo",
37
+ beta=0.1,
38
+ per_device_train_batch_size=90,
39
+ learning_rate=1e-2
40
+ )
41
+ ```
42
+
43
+ ### 3. Hinge (SLiC)
44
+
45
+ **Formula**: `ReLU(1 - β * logits)`
46
+
47
+ **When to use**: Margin-based objective
48
+
49
+ **Config**:
50
+ ```python
51
+ DPOConfig(
52
+ loss_type="hinge",
53
+ beta=0.1,
54
+ per_device_train_batch_size=512,
55
+ learning_rate=1e-4
56
+ )
57
+ ```
58
+
59
+ ### 4. Robust DPO
60
+
61
+ **Formula**: Sigmoid with label smoothing for noise robustness
62
+
63
+ **When to use**: Noisy preference labels
64
+
65
+ **Config**:
66
+ ```python
67
+ DPOConfig(
68
+ loss_type="robust",
69
+ beta=0.01,
70
+ label_smoothing=0.1, # Noise probability
71
+ per_device_train_batch_size=16,
72
+ learning_rate=1e-3,
73
+ max_prompt_length=128,
74
+ max_length=512
75
+ )
76
+ ```
77
+
78
+ ### 5. BCO Pair (Binary Classification)
79
+
80
+ **Formula**: Train binary classifier (chosen=1, rejected=0)
81
+
82
+ **When to use**: Pairwise preference data
83
+
84
+ **Config**:
85
+ ```python
86
+ DPOConfig(
87
+ loss_type="bco_pair",
88
+ beta=0.01,
89
+ per_device_train_batch_size=128,
90
+ learning_rate=5e-7,
91
+ max_prompt_length=1536,
92
+ max_completion_length=512
93
+ )
94
+ ```
95
+
96
+ ### 6. SPPO Hard
97
+
98
+ **Formula**: Push chosen→0.5, rejected→-0.5
99
+
100
+ **When to use**: Nash equilibrium, sparse data
101
+
102
+ **Config**:
103
+ ```python
104
+ DPOConfig(
105
+ loss_type="sppo_hard",
106
+ beta=0.1
107
+ )
108
+ ```
109
+
110
+ ### 7. DiscoPOP
111
+
112
+ **Formula**: Log-Ratio Modulated Loss
113
+
114
+ **When to use**: Automated loss discovery
115
+
116
+ **Config**:
117
+ ```python
118
+ DPOConfig(
119
+ loss_type="discopop",
120
+ beta=0.05,
121
+ discopop_tau=0.05,
122
+ per_device_train_batch_size=64,
123
+ learning_rate=5e-7
124
+ )
125
+ ```
126
+
127
+ ### 8. APO Zero
128
+
129
+ **Formula**: Increase chosen, decrease rejected likelihood
130
+
131
+ **When to use**: Model worse than winning outputs
132
+
133
+ **Config**:
134
+ ```python
135
+ DPOConfig(
136
+ loss_type="apo_zero",
137
+ beta=0.1,
138
+ per_device_train_batch_size=64,
139
+ learning_rate=2e-7,
140
+ max_prompt_length=512,
141
+ max_completion_length=512
142
+ )
143
+ ```
144
+
145
+ ### 9. APO Down
146
+
147
+ **Formula**: Decrease both, emphasize rejected reduction
148
+
149
+ **When to use**: Model better than winning outputs
150
+
151
+ **Config**:
152
+ ```python
153
+ DPOConfig(
154
+ loss_type="apo_down",
155
+ beta=0.1,
156
+ # Same hyperparameters as apo_zero
157
+ )
158
+ ```
159
+
160
+ ### 10. AOT & AOT Pair
161
+
162
+ **Formula**: Distributional alignment via stochastic dominance
163
+
164
+ **When to use**:
165
+ - `aot_pair`: Paired preference data
166
+ - `aot`: Unpaired data
167
+
168
+ **Config**:
169
+ ```python
170
+ DPOConfig(
171
+ loss_type="aot_pair", # or "aot"
172
+ beta=0.1,
173
+ label_smoothing=0.0
174
+ )
175
+ ```
176
+
177
+ ## Multi-Loss Training
178
+
179
+ Combine multiple losses:
180
+
181
+ ```python
182
+ DPOConfig(
183
+ loss_type=["sigmoid", "ipo"],
184
+ loss_weights=[0.7, 0.3], # Weighted combination
185
+ beta=0.1
186
+ )
187
+ ```
188
+
189
+ ## Key Parameters
190
+
191
+ ### Beta (β)
192
+
193
+ Controls deviation from reference model:
194
+ - **Higher** (0.5): More conservative, stays close to reference
195
+ - **Lower** (0.01): More aggressive alignment
196
+ - **Default**: 0.1
197
+
198
+ ### Label Smoothing
199
+
200
+ For robust DPO:
201
+ - **0.0**: No smoothing (default)
202
+ - **0.1-0.3**: Moderate noise robustness
203
+ - **0.5**: Maximum noise tolerance
204
+
205
+ ### Max Lengths
206
+
207
+ - `max_prompt_length`: 128-1536
208
+ - `max_completion_length`: 128-512
209
+ - `max_length`: Total sequence (1024-2048)
210
+
211
+ ## Comparison Table
212
+
213
+ | Loss | Speed | Stability | Best For |
214
+ |------|-------|-----------|----------|
215
+ | Sigmoid | Fast | Good | **General use** |
216
+ | IPO | Fast | Better | Overfitting issues |
217
+ | Hinge | Fast | Good | Margin objectives |
218
+ | Robust | Fast | Best | Noisy data |
219
+ | BCO | Medium | Good | Binary classification |
220
+ | DiscoPOP | Fast | Good | New architectures |
221
+ | APO | Fast | Good | Model quality matching |
222
+
223
+ ## References
224
+
225
+ - DPO paper: https://arxiv.org/abs/2305.18290
226
+ - IPO paper: https://arxiv.org/abs/2310.12036
227
+ - TRL docs: https://huggingface.co/docs/trl/dpo_trainer