@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,249 @@
1
+ ---
2
+ name: openrlhf-training
3
+ description: High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.
4
+ version: 1.0.0
5
+ author: Synthetic Sciences
6
+ license: MIT
7
+ tags: [Post-Training, OpenRLHF, RLHF, PPO, GRPO, RLOO, DPO, Ray, vLLM, Distributed Training, Large Models, ZeRO-3]
8
+ dependencies: [openrlhf, ray, vllm, torch, transformers, deepspeed]
9
+ ---
10
+
11
+ # OpenRLHF - High-Performance RLHF Training
12
+
13
+ ## Quick start
14
+
15
+ OpenRLHF is a Ray-based RLHF framework optimized for distributed training with vLLM inference acceleration.
16
+
17
+ **Installation**:
18
+ ```bash
19
+ # Launch Docker container
20
+ docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN \
21
+ -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash
22
+
23
+ # Uninstall conflicts
24
+ sudo pip uninstall xgboost transformer_engine flash_attn pynvml -y
25
+
26
+ # Install OpenRLHF with vLLM
27
+ pip install openrlhf[vllm]
28
+ ```
29
+
30
+ **PPO Training** (Hybrid Engine):
31
+ ```bash
32
+ ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
33
+
34
+ ray job submit --address="http://127.0.0.1:8265" \
35
+ --runtime-env-json='{"working_dir": "/openrlhf"}' \
36
+ -- python3 -m openrlhf.cli.train_ppo_ray \
37
+ --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
38
+ --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
39
+ --critic_num_nodes 1 --critic_num_gpus_per_node 8 \
40
+ --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
41
+ --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
42
+ --colocate_all_models \
43
+ --vllm_gpu_memory_utilization 0.5 \
44
+ --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
45
+ --reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
46
+ --save_path ./output/llama3-8b-rlhf \
47
+ --micro_train_batch_size 8 --train_batch_size 128 \
48
+ --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
49
+ --max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
50
+ --zero_stage 3 --bf16 \
51
+ --actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
52
+ --init_kl_coef 0.01 --normalize_reward \
53
+ --gradient_checkpointing --packing_samples \
54
+ --vllm_enable_sleep --deepspeed_enable_sleep
55
+ ```
56
+
57
+ **GRPO Training** (Group Normalized Policy Optimization):
58
+ ```bash
59
+ # Same command as PPO, but add:
60
+ --advantage_estimator group_norm
61
+ ```
62
+
63
+ ## Common workflows
64
+
65
+ ### Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)
66
+
67
+ **Step 1: Train reward model** (DPO):
68
+ ```bash
69
+ deepspeed --module openrlhf.cli.train_rm \
70
+ --save_path ./output/llama3-8b-rm \
71
+ --save_steps -1 --logging_steps 1 \
72
+ --eval_steps -1 --train_batch_size 256 \
73
+ --micro_train_batch_size 1 --pretrain meta-llama/Meta-Llama-3-8B \
74
+ --bf16 --max_epochs 1 --max_len 8192 \
75
+ --zero_stage 3 --learning_rate 9e-6 \
76
+ --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
77
+ --apply_chat_template --chosen_key chosen \
78
+ --rejected_key rejected --flash_attn --gradient_checkpointing
79
+ ```
80
+
81
+ **Step 2: PPO training**:
82
+ ```bash
83
+ ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
84
+
85
+ ray job submit --address="http://127.0.0.1:8265" \
86
+ -- python3 -m openrlhf.cli.train_ppo_ray \
87
+ --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
88
+ --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
89
+ --critic_num_nodes 1 --critic_num_gpus_per_node 8 \
90
+ --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
91
+ --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
92
+ --colocate_all_models \
93
+ --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
94
+ --reward_pretrain ./output/llama3-8b-rm \
95
+ --save_path ./output/llama3-8b-ppo \
96
+ --micro_train_batch_size 8 --train_batch_size 128 \
97
+ --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
98
+ --max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
99
+ --zero_stage 3 --bf16 \
100
+ --actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
101
+ --init_kl_coef 0.01 --normalize_reward \
102
+ --vllm_enable_sleep --deepspeed_enable_sleep
103
+ ```
104
+
105
+ ### Workflow 2: GRPO training (no critic model needed)
106
+
107
+ Memory-efficient alternative to PPO:
108
+
109
+ ```bash
110
+ ray job submit --address="http://127.0.0.1:8265" \
111
+ -- python3 -m openrlhf.cli.train_ppo_ray \
112
+ --advantage_estimator group_norm \
113
+ --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
114
+ --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
115
+ --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
116
+ --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
117
+ --colocate_all_models \
118
+ --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
119
+ --reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
120
+ --save_path ./output/llama3-8b-grpo \
121
+ --micro_train_batch_size 8 --train_batch_size 128 \
122
+ --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
123
+ --max_epochs 1 --bf16 \
124
+ --actor_learning_rate 5e-7 \
125
+ --init_kl_coef 0.01 --use_kl_loss --kl_estimator k3 \
126
+ --normalize_reward --no_advantage_std_norm
127
+ ```
128
+
129
+ **Key GRPO parameters**:
130
+ - `--advantage_estimator group_norm` - Enables GRPO
131
+ - `--use_kl_loss` - KL loss from GRPO paper
132
+ - `--kl_estimator k3` - Loss function (k2 ≈ k1)
133
+ - `--no_advantage_std_norm` - Disables std normalization
134
+
135
+ ### Workflow 3: DPO training (preference optimization)
136
+
137
+ Simpler alternative without reward model:
138
+
139
+ ```bash
140
+ deepspeed --module openrlhf.cli.train_dpo \
141
+ --save_path ./output/llama3-8b-dpo \
142
+ --save_steps -1 --logging_steps 1 \
143
+ --eval_steps -1 --train_batch_size 256 \
144
+ --micro_train_batch_size 2 --pretrain meta-llama/Meta-Llama-3-8B \
145
+ --bf16 --max_epochs 1 --max_len 8192 \
146
+ --zero_stage 3 --learning_rate 5e-7 --beta 0.1 \
147
+ --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
148
+ --apply_chat_template --chosen_key chosen \
149
+ --rejected_key rejected --flash_attn --gradient_checkpointing
150
+ ```
151
+
152
+ ## When to use vs alternatives
153
+
154
+ **Use OpenRLHF when**:
155
+ - Training large models (7B-70B+) with RL
156
+ - Need vLLM inference acceleration
157
+ - Want distributed architecture with Ray
158
+ - Have multi-node GPU cluster
159
+ - Need PPO/GRPO/RLOO/DPO in one framework
160
+
161
+ **Algorithm selection**:
162
+ - **PPO**: Maximum control, best for complex rewards
163
+ - **GRPO**: Memory-efficient, no critic needed
164
+ - **RLOO**: Modified PPO with per-token KL
165
+ - **REINFORCE++**: More stable than GRPO, faster than PPO
166
+ - **DPO**: Simplest, no reward model needed
167
+
168
+ **Use alternatives instead**:
169
+ - **TRL**: Single-node training, simpler API
170
+ - **veRL**: ByteDance's framework for 671B models
171
+ - **DeepSpeedChat**: Integrated with DeepSpeed ecosystem
172
+
173
+ ## Common issues
174
+
175
+ **Issue: GPU OOM with large models**
176
+
177
+ Disable model colocation:
178
+ ```bash
179
+ # Remove --colocate_all_models flag
180
+ # Allocate separate GPUs for each model
181
+ --actor_num_gpus_per_node 8 \
182
+ --critic_num_gpus_per_node 8 \
183
+ --reward_num_gpus_per_node 8 \
184
+ --ref_num_gpus_per_node 8
185
+ ```
186
+
187
+ **Issue: DeepSpeed GPU index out of range**
188
+
189
+ Set environment variable:
190
+ ```bash
191
+ export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
192
+ ```
193
+
194
+ **Issue: Training instability**
195
+
196
+ Use Hybrid Engine instead of async:
197
+ ```bash
198
+ --colocate_all_models \
199
+ --vllm_enable_sleep \
200
+ --deepspeed_enable_sleep
201
+ ```
202
+
203
+ Adjust KL coefficient:
204
+ ```bash
205
+ --init_kl_coef 0.05 # Increase from 0.01
206
+ ```
207
+
208
+ **Issue: Slow generation during PPO**
209
+
210
+ Enable vLLM acceleration:
211
+ ```bash
212
+ --vllm_num_engines 4 \
213
+ --vllm_tensor_parallel_size 2 \
214
+ --vllm_gpu_memory_utilization 0.5
215
+ ```
216
+
217
+ ## Advanced topics
218
+
219
+ **Hybrid Engine GPU sharing**: See [references/hybrid-engine.md](references/hybrid-engine.md) for vLLM sleep mode, DeepSpeed sleep mode, and optimal node allocation.
220
+
221
+ **Algorithm comparison**: See [references/algorithm-comparison.md](references/algorithm-comparison.md) for PPO vs GRPO vs RLOO vs REINFORCE++ benchmarks and hyperparameters.
222
+
223
+ **Multi-node setup**: See [references/multi-node-training.md](references/multi-node-training.md) for Ray cluster configuration and fault tolerance.
224
+
225
+ **Custom reward functions**: See [references/custom-rewards.md](references/custom-rewards.md) for reinforced fine-tuning and agent RLHF.
226
+
227
+ ## Hardware requirements
228
+
229
+ - **GPU**: NVIDIA A100/H100 recommended
230
+ - **VRAM**:
231
+ - 7B model: 8× A100 40GB (Hybrid Engine)
232
+ - 70B model: 48× A100 80GB (vLLM:Actor:Critic = 1:1:1)
233
+ - **Multi-node**: Ray cluster with InfiniBand recommended
234
+ - **Docker**: NVIDIA PyTorch container 25.02+
235
+
236
+ **Performance**:
237
+ - 2× faster than DeepSpeedChat
238
+ - vLLM inference acceleration
239
+ - Hybrid Engine minimizes GPU idle time
240
+
241
+ ## Resources
242
+
243
+ - Docs: https://github.com/OpenRLHF/OpenRLHF
244
+ - Paper: https://arxiv.org/abs/2405.11143
245
+ - Examples: https://github.com/OpenRLHF/OpenRLHF/tree/main/examples
246
+ - Discord: Community support
247
+
248
+
249
+
@@ -0,0 +1,404 @@
1
+ # Algorithm Comparison
2
+
3
+ Complete guide to RL algorithms in OpenRLHF: PPO, REINFORCE++, GRPO, RLOO, and their variants.
4
+
5
+ ## Overview
6
+
7
+ OpenRLHF supports 6 RL algorithms selectable via `--advantage_estimator`:
8
+ - **gae** - PPO with Generalized Advantage Estimation
9
+ - **reinforce** - REINFORCE++ (PPO optimizations without critic)
10
+ - **reinforce_baseline** - REINFORCE++ with baseline
11
+ - **group_norm** - GRPO (Group Normalized Policy Optimization)
12
+ - **dr_grpo** - Dr. GRPO (GRPO without std normalization)
13
+ - **rloo** - Reinforcement Learning with Online Off-policy Correction
14
+
15
+ ## Algorithm Details
16
+
17
+ ### PPO (Proximal Policy Optimization)
18
+
19
+ **Formula**:
20
+ ```
21
+ loss = -min(ratio * advantages, clip(ratio, 1-ε, 1+ε) * advantages)
22
+ ratio = π_new(a|s) / π_old(a|s)
23
+ ```
24
+
25
+ **Characteristics**:
26
+ - **Stability**: High (clipped objective prevents large updates)
27
+ - **Memory**: High (stores actor + critic experiences)
28
+ - **Speed**: Medium (critic training overhead)
29
+ - **Requires**: Critic network for value estimation
30
+
31
+ **Implementation**:
32
+ ```python
33
+ surr1 = ratio * advantages
34
+ surr2 = ratio.clamp(1 - clip_eps_low, 1 + clip_eps_high) * advantages
35
+ loss = -torch.min(surr1, surr2)
36
+ ```
37
+
38
+ **When to use**:
39
+ - General-purpose RLHF
40
+ - Complex reward functions
41
+ - Need stable training
42
+
43
+ **Hyperparameters**:
44
+ ```bash
45
+ --advantage_estimator gae # Enable PPO
46
+ --clip_eps_low 0.2 # Clipping lower bound
47
+ --clip_eps_high 0.2 # Clipping upper bound
48
+ --actor_learning_rate 1e-6
49
+ --critic_learning_rate 9e-6
50
+ --init_kl_coef 0.01
51
+ ```
52
+
53
+ ### REINFORCE++
54
+
55
+ **Formula**:
56
+ ```
57
+ loss = -ratio * advantages (with PPO-clip)
58
+ advantages = cumulative_returns - baseline
59
+ ```
60
+
61
+ **Characteristics**:
62
+ - **Stability**: Higher than GRPO
63
+ - **Memory**: Lower (no critic network)
64
+ - **Speed**: Faster than PPO
65
+ - **Requires**: No critic network
66
+
67
+ **Key innovation**: Integrates PPO optimizations (advantage normalization, PPO-clip loss) into REINFORCE while eliminating critic network overhead.
68
+
69
+ **When to use**:
70
+ - Want PPO stability without critic
71
+ - Limited memory budget
72
+ - Fast training priority
73
+
74
+ **Hyperparameters**:
75
+ ```bash
76
+ --advantage_estimator reinforce
77
+ --critic_pretrain None # No critic needed
78
+ --init_kl_coef 0.01
79
+ --actor_learning_rate 1e-6
80
+ ```
81
+
82
+ ### REINFORCE++-baseline
83
+
84
+ **Formula**:
85
+ ```
86
+ rewards = rewards - mean(rewards_same_prompt)
87
+ ```
88
+
89
+ **Characteristics**:
90
+ - **Stability**: Very high
91
+ - **Memory**: Lower (no critic)
92
+ - **Speed**: Faster than PPO
93
+ - **Requires**: Multiple samples per prompt
94
+
95
+ **Key innovation**: Uses mean reward of multiple samples from same prompt as baseline to reshape rewards.
96
+
97
+ **When to use**:
98
+ - RLVR (Reinforcement Learning via Verifier Rewards) settings
99
+ - Reward patterns vary (0/1/-0.5)
100
+ - Multiple samples per prompt available
101
+
102
+ **Hyperparameters**:
103
+ ```bash
104
+ --advantage_estimator reinforce_baseline
105
+ --n_samples_per_prompt 4 # Must be > 1
106
+ --init_kl_coef 0.01
107
+ ```
108
+
109
+ ### GRPO (Group Normalized Policy Optimization)
110
+
111
+ **Formula**:
112
+ ```
113
+ rewards = (rewards - mean(rewards)) / (std(rewards) + 1e-9)
114
+ loss = -ratio * normalized_advantages
115
+ KL loss (optional): k1, k2, or k3 estimator
116
+ ```
117
+
118
+ **Characteristics**:
119
+ - **Stability**: Lower than REINFORCE++
120
+ - **Memory**: Lower (no critic)
121
+ - **Speed**: Fast
122
+ - **Requires**: Group reward normalization
123
+
124
+ **Key innovation**: Group-based advantage normalization with optional KL loss.
125
+
126
+ **When to use**:
127
+ - Exploring policy optimization variants
128
+ - Need reward normalization
129
+ - Memory-constrained
130
+
131
+ **Hyperparameters**:
132
+ ```bash
133
+ --advantage_estimator group_norm
134
+ --use_kl_loss # Enable KL loss
135
+ --kl_estimator k3 # k3 for loss, k2 ≈ k1
136
+ --init_kl_coef 0.01
137
+ --no_advantage_std_norm # Optional: disable std norm
138
+ ```
139
+
140
+ **KL estimator variance**:
141
+ - **k3**: Larger variance under categorical distribution
142
+ - **k1, k2**: Similar variance, k2 ≈ k1 for loss
143
+
144
+ ### Dr. GRPO
145
+
146
+ **Formula**:
147
+ ```
148
+ rewards = rewards - mean(rewards) # No std normalization
149
+ ```
150
+
151
+ **Characteristics**:
152
+ - **Stability**: Similar to GRPO
153
+ - **Memory**: Lower (no critic)
154
+ - **Speed**: Fast
155
+ - **Requires**: Group mean normalization only
156
+
157
+ **Key innovation**: Removes local group normalization `/std` from GRPO (not needed in RL variance reduction theory).
158
+
159
+ **When to use**:
160
+ - GRPO variant experimentation
161
+ - Avoid std normalization issues
162
+
163
+ **Hyperparameters**:
164
+ ```bash
165
+ --advantage_estimator dr_grpo
166
+ --init_kl_coef 0.01
167
+ ```
168
+
169
+ ### RLOO (RL with Online Off-policy Correction)
170
+
171
+ **Formula**:
172
+ ```
173
+ baseline = (sum(rewards) - rewards) / (n_samples - 1)
174
+ rewards = rewards - baseline
175
+ loss = -ratio * advantages (with PPO-clip)
176
+ ```
177
+
178
+ **Characteristics**:
179
+ - **Stability**: High (PPO-clip)
180
+ - **Memory**: Lower (no critic)
181
+ - **Speed**: Fast
182
+ - **Requires**: Multiple samples per prompt, per-token KL
183
+
184
+ **Key innovation**: Incorporates per-token KL reward and PPO-clip loss.
185
+
186
+ **When to use**:
187
+ - Need per-token KL rewards
188
+ - Want PPO stability without critic
189
+ - Multiple samples per prompt
190
+
191
+ **Hyperparameters**:
192
+ ```bash
193
+ --advantage_estimator rloo
194
+ --n_samples_per_prompt 4 # Must be > 1
195
+ --init_kl_coef 0.01
196
+ ```
197
+
198
+ ## Comparison Table
199
+
200
+ | Algorithm | Critic | Stability | Memory | Speed | Best For |
201
+ |-----------|--------|-----------|--------|-------|----------|
202
+ | PPO | ✅ Yes | ⭐⭐⭐⭐⭐ | High | Medium | General purpose |
203
+ | REINFORCE++ | ❌ No | ⭐⭐⭐⭐ | Low | **Fast** | Critic-free PPO |
204
+ | REINFORCE++-baseline | ❌ No | ⭐⭐⭐⭐⭐ | Low | **Fast** | RLVR settings |
205
+ | GRPO | ❌ No | ⭐⭐⭐ | Low | Fast | Reward normalization |
206
+ | Dr. GRPO | ❌ No | ⭐⭐⭐ | Low | Fast | GRPO variant |
207
+ | RLOO | ❌ No | ⭐⭐⭐⭐ | Low | Fast | Per-token KL |
208
+
209
+ ## Experience Data Structure
210
+
211
+ **PPO (with critic)**:
212
+ ```python
213
+ @dataclass
214
+ class Experience:
215
+ sequences: torch.Tensor # Token sequences
216
+ attention_mask: torch.Tensor # Attention masks
217
+ action_mask: torch.Tensor # Action masks
218
+ action_log_probs: torch.Tensor # Log π(a|s)
219
+ values: torch.Tensor # Critic value estimates
220
+ returns: torch.Tensor # Cumulative returns
221
+ advantages: torch.Tensor # GAE advantages
222
+ reward: float # Total reward
223
+ kl: torch.Tensor # KL divergence
224
+ ```
225
+
226
+ **REINFORCE++ (no critic)**:
227
+ ```python
228
+ # No values, returns, or advantages stored
229
+ # Only sequences, log_probs, and rewards
230
+ ```
231
+
232
+ ## Memory Comparison (7B Model)
233
+
234
+ | Algorithm | Components | Memory (8× A100) |
235
+ |-----------|-----------|------------------|
236
+ | PPO | Actor + Critic + Reward + Ref | ~40GB |
237
+ | REINFORCE++ | Actor + Reward + Ref | ~28GB |
238
+ | GRPO | Actor + Reward + Ref | ~28GB |
239
+ | RLOO | Actor + Reward + Ref | ~28GB |
240
+
241
+ **Savings**: ~30% memory reduction without critic
242
+
243
+ ## Speed Comparison
244
+
245
+ **Relative training time** (7B model, 1000 steps):
246
+ - PPO: 1.0× baseline
247
+ - REINFORCE++: **0.75×** (25% faster)
248
+ - GRPO: 0.80×
249
+ - RLOO: 0.80×
250
+
251
+ **Why REINFORCE++ is faster**:
252
+ - No critic training
253
+ - No value function updates
254
+ - Fewer backward passes
255
+
256
+ ## Choosing an Algorithm
257
+
258
+ ### Decision Tree
259
+
260
+ ```
261
+ Need maximum stability?
262
+ ├─ Yes → PPO (with critic)
263
+ └─ No ↓
264
+
265
+ Have multiple samples per prompt?
266
+ ├─ Yes ↓
267
+ │ └─ RLVR setting with varying rewards?
268
+ │ ├─ Yes → REINFORCE++-baseline
269
+ │ └─ No → RLOO (if need per-token KL)
270
+ └─ No ↓
271
+
272
+ Want faster than PPO?
273
+ └─ Yes → REINFORCE++ (most stable critic-free)
274
+
275
+ Experimenting with normalization?
276
+ └─ Yes → GRPO or Dr. GRPO
277
+ ```
278
+
279
+ ### By Use Case
280
+
281
+ **Production deployment**:
282
+ ```bash
283
+ # Maximum stability
284
+ --advantage_estimator gae # PPO
285
+ --clip_eps_low 0.2
286
+ --init_kl_coef 0.01
287
+ ```
288
+
289
+ **Memory-constrained**:
290
+ ```bash
291
+ # No critic, stable
292
+ --advantage_estimator reinforce # REINFORCE++
293
+ --critic_pretrain None
294
+ ```
295
+
296
+ **RLVR / Verification rewards**:
297
+ ```bash
298
+ # Baseline reward shaping
299
+ --advantage_estimator reinforce_baseline
300
+ --n_samples_per_prompt 4
301
+ ```
302
+
303
+ **Research / Experimentation**:
304
+ ```bash
305
+ # Explore GRPO variants
306
+ --advantage_estimator group_norm
307
+ --use_kl_loss --kl_estimator k3
308
+ ```
309
+
310
+ ## Advanced Configuration
311
+
312
+ ### Reward Normalization
313
+
314
+ **PPO (no manual normalization)**:
315
+ ```bash
316
+ --advantage_estimator gae
317
+ # GAE handles advantage normalization
318
+ ```
319
+
320
+ **GRPO (group normalization)**:
321
+ ```bash
322
+ --advantage_estimator group_norm
323
+ --normalize_reward # Optional additional normalization
324
+ ```
325
+
326
+ **Disable std normalization**:
327
+ ```bash
328
+ --no_advantage_std_norm # Keep mean norm only
329
+ ```
330
+
331
+ ### KL Penalty Configuration
332
+
333
+ **All algorithms support**:
334
+ ```bash
335
+ --init_kl_coef 0.01 # Initial KL coefficient
336
+ --kl_target 0.1 # Target KL divergence
337
+ --kl_horizon 10000 # Steps to reach target
338
+ ```
339
+
340
+ **GRPO-specific**:
341
+ ```bash
342
+ --use_kl_loss # Enable KL loss term
343
+ --kl_estimator k3 # Loss function choice
344
+ ```
345
+
346
+ ### Clipping Configuration
347
+
348
+ **PPO clipping**:
349
+ ```bash
350
+ --clip_eps_low 0.2 # Lower bound
351
+ --clip_eps_high 0.2 # Upper bound
352
+ ```
353
+
354
+ **Reward clipping**:
355
+ ```bash
356
+ --reward_clip_range 10.0 # Clip rewards to [-10, 10]
357
+ ```
358
+
359
+ ## Common Issues
360
+
361
+ ### PPO Instability
362
+
363
+ **Symptom**: Large policy updates, divergence
364
+
365
+ **Solution**: Reduce clipping range
366
+ ```bash
367
+ --clip_eps_low 0.1 # Reduce from 0.2
368
+ --clip_eps_high 0.1
369
+ ```
370
+
371
+ ### GRPO High Variance
372
+
373
+ **Symptom**: Unstable training with GRPO
374
+
375
+ **Solution**: Switch to REINFORCE++
376
+ ```bash
377
+ --advantage_estimator reinforce # More stable
378
+ ```
379
+
380
+ ### Memory OOM with PPO
381
+
382
+ **Symptom**: OOM during critic training
383
+
384
+ **Solution**: Switch to critic-free
385
+ ```bash
386
+ --advantage_estimator reinforce # No critic
387
+ --critic_pretrain None
388
+ ```
389
+
390
+ ### RLOO/Baseline Requires Multiple Samples
391
+
392
+ **Symptom**: `AssertionError: n_samples_per_prompt must be > 1`
393
+
394
+ **Solution**:
395
+ ```bash
396
+ --n_samples_per_prompt 4 # Minimum 2, recommended 4-8
397
+ ```
398
+
399
+ ## References
400
+
401
+ - PPO paper: https://arxiv.org/abs/1707.06347
402
+ - GRPO paper: https://arxiv.org/abs/2402.03300
403
+ - OpenRLHF: https://github.com/OpenRLHF/OpenRLHF
404
+ - OpenRLHF paper: https://arxiv.org/abs/2405.11143