@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,249 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: openrlhf-training
|
|
3
|
+
description: High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Synthetic Sciences
|
|
6
|
+
license: MIT
|
|
7
|
+
tags: [Post-Training, OpenRLHF, RLHF, PPO, GRPO, RLOO, DPO, Ray, vLLM, Distributed Training, Large Models, ZeRO-3]
|
|
8
|
+
dependencies: [openrlhf, ray, vllm, torch, transformers, deepspeed]
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# OpenRLHF - High-Performance RLHF Training
|
|
12
|
+
|
|
13
|
+
## Quick start
|
|
14
|
+
|
|
15
|
+
OpenRLHF is a Ray-based RLHF framework optimized for distributed training with vLLM inference acceleration.
|
|
16
|
+
|
|
17
|
+
**Installation**:
|
|
18
|
+
```bash
|
|
19
|
+
# Launch Docker container
|
|
20
|
+
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN \
|
|
21
|
+
-v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash
|
|
22
|
+
|
|
23
|
+
# Uninstall conflicts
|
|
24
|
+
sudo pip uninstall xgboost transformer_engine flash_attn pynvml -y
|
|
25
|
+
|
|
26
|
+
# Install OpenRLHF with vLLM
|
|
27
|
+
pip install openrlhf[vllm]
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
**PPO Training** (Hybrid Engine):
|
|
31
|
+
```bash
|
|
32
|
+
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
|
|
33
|
+
|
|
34
|
+
ray job submit --address="http://127.0.0.1:8265" \
|
|
35
|
+
--runtime-env-json='{"working_dir": "/openrlhf"}' \
|
|
36
|
+
-- python3 -m openrlhf.cli.train_ppo_ray \
|
|
37
|
+
--ref_num_nodes 1 --ref_num_gpus_per_node 8 \
|
|
38
|
+
--reward_num_nodes 1 --reward_num_gpus_per_node 8 \
|
|
39
|
+
--critic_num_nodes 1 --critic_num_gpus_per_node 8 \
|
|
40
|
+
--actor_num_nodes 1 --actor_num_gpus_per_node 8 \
|
|
41
|
+
--vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
|
|
42
|
+
--colocate_all_models \
|
|
43
|
+
--vllm_gpu_memory_utilization 0.5 \
|
|
44
|
+
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
|
|
45
|
+
--reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
|
|
46
|
+
--save_path ./output/llama3-8b-rlhf \
|
|
47
|
+
--micro_train_batch_size 8 --train_batch_size 128 \
|
|
48
|
+
--micro_rollout_batch_size 16 --rollout_batch_size 1024 \
|
|
49
|
+
--max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
|
|
50
|
+
--zero_stage 3 --bf16 \
|
|
51
|
+
--actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
|
|
52
|
+
--init_kl_coef 0.01 --normalize_reward \
|
|
53
|
+
--gradient_checkpointing --packing_samples \
|
|
54
|
+
--vllm_enable_sleep --deepspeed_enable_sleep
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
**GRPO Training** (Group Normalized Policy Optimization):
|
|
58
|
+
```bash
|
|
59
|
+
# Same command as PPO, but add:
|
|
60
|
+
--advantage_estimator group_norm
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
## Common workflows
|
|
64
|
+
|
|
65
|
+
### Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)
|
|
66
|
+
|
|
67
|
+
**Step 1: Train reward model** (DPO):
|
|
68
|
+
```bash
|
|
69
|
+
deepspeed --module openrlhf.cli.train_rm \
|
|
70
|
+
--save_path ./output/llama3-8b-rm \
|
|
71
|
+
--save_steps -1 --logging_steps 1 \
|
|
72
|
+
--eval_steps -1 --train_batch_size 256 \
|
|
73
|
+
--micro_train_batch_size 1 --pretrain meta-llama/Meta-Llama-3-8B \
|
|
74
|
+
--bf16 --max_epochs 1 --max_len 8192 \
|
|
75
|
+
--zero_stage 3 --learning_rate 9e-6 \
|
|
76
|
+
--dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
|
|
77
|
+
--apply_chat_template --chosen_key chosen \
|
|
78
|
+
--rejected_key rejected --flash_attn --gradient_checkpointing
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
**Step 2: PPO training**:
|
|
82
|
+
```bash
|
|
83
|
+
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
|
|
84
|
+
|
|
85
|
+
ray job submit --address="http://127.0.0.1:8265" \
|
|
86
|
+
-- python3 -m openrlhf.cli.train_ppo_ray \
|
|
87
|
+
--ref_num_nodes 1 --ref_num_gpus_per_node 8 \
|
|
88
|
+
--reward_num_nodes 1 --reward_num_gpus_per_node 8 \
|
|
89
|
+
--critic_num_nodes 1 --critic_num_gpus_per_node 8 \
|
|
90
|
+
--actor_num_nodes 1 --actor_num_gpus_per_node 8 \
|
|
91
|
+
--vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
|
|
92
|
+
--colocate_all_models \
|
|
93
|
+
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
|
|
94
|
+
--reward_pretrain ./output/llama3-8b-rm \
|
|
95
|
+
--save_path ./output/llama3-8b-ppo \
|
|
96
|
+
--micro_train_batch_size 8 --train_batch_size 128 \
|
|
97
|
+
--micro_rollout_batch_size 16 --rollout_batch_size 1024 \
|
|
98
|
+
--max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
|
|
99
|
+
--zero_stage 3 --bf16 \
|
|
100
|
+
--actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
|
|
101
|
+
--init_kl_coef 0.01 --normalize_reward \
|
|
102
|
+
--vllm_enable_sleep --deepspeed_enable_sleep
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
### Workflow 2: GRPO training (no critic model needed)
|
|
106
|
+
|
|
107
|
+
Memory-efficient alternative to PPO:
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
ray job submit --address="http://127.0.0.1:8265" \
|
|
111
|
+
-- python3 -m openrlhf.cli.train_ppo_ray \
|
|
112
|
+
--advantage_estimator group_norm \
|
|
113
|
+
--ref_num_nodes 1 --ref_num_gpus_per_node 8 \
|
|
114
|
+
--reward_num_nodes 1 --reward_num_gpus_per_node 8 \
|
|
115
|
+
--actor_num_nodes 1 --actor_num_gpus_per_node 8 \
|
|
116
|
+
--vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
|
|
117
|
+
--colocate_all_models \
|
|
118
|
+
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
|
|
119
|
+
--reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
|
|
120
|
+
--save_path ./output/llama3-8b-grpo \
|
|
121
|
+
--micro_train_batch_size 8 --train_batch_size 128 \
|
|
122
|
+
--micro_rollout_batch_size 16 --rollout_batch_size 1024 \
|
|
123
|
+
--max_epochs 1 --bf16 \
|
|
124
|
+
--actor_learning_rate 5e-7 \
|
|
125
|
+
--init_kl_coef 0.01 --use_kl_loss --kl_estimator k3 \
|
|
126
|
+
--normalize_reward --no_advantage_std_norm
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
**Key GRPO parameters**:
|
|
130
|
+
- `--advantage_estimator group_norm` - Enables GRPO
|
|
131
|
+
- `--use_kl_loss` - KL loss from GRPO paper
|
|
132
|
+
- `--kl_estimator k3` - Loss function (k2 ≈ k1)
|
|
133
|
+
- `--no_advantage_std_norm` - Disables std normalization
|
|
134
|
+
|
|
135
|
+
### Workflow 3: DPO training (preference optimization)
|
|
136
|
+
|
|
137
|
+
Simpler alternative without reward model:
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
deepspeed --module openrlhf.cli.train_dpo \
|
|
141
|
+
--save_path ./output/llama3-8b-dpo \
|
|
142
|
+
--save_steps -1 --logging_steps 1 \
|
|
143
|
+
--eval_steps -1 --train_batch_size 256 \
|
|
144
|
+
--micro_train_batch_size 2 --pretrain meta-llama/Meta-Llama-3-8B \
|
|
145
|
+
--bf16 --max_epochs 1 --max_len 8192 \
|
|
146
|
+
--zero_stage 3 --learning_rate 5e-7 --beta 0.1 \
|
|
147
|
+
--dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
|
|
148
|
+
--apply_chat_template --chosen_key chosen \
|
|
149
|
+
--rejected_key rejected --flash_attn --gradient_checkpointing
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
## When to use vs alternatives
|
|
153
|
+
|
|
154
|
+
**Use OpenRLHF when**:
|
|
155
|
+
- Training large models (7B-70B+) with RL
|
|
156
|
+
- Need vLLM inference acceleration
|
|
157
|
+
- Want distributed architecture with Ray
|
|
158
|
+
- Have multi-node GPU cluster
|
|
159
|
+
- Need PPO/GRPO/RLOO/DPO in one framework
|
|
160
|
+
|
|
161
|
+
**Algorithm selection**:
|
|
162
|
+
- **PPO**: Maximum control, best for complex rewards
|
|
163
|
+
- **GRPO**: Memory-efficient, no critic needed
|
|
164
|
+
- **RLOO**: Modified PPO with per-token KL
|
|
165
|
+
- **REINFORCE++**: More stable than GRPO, faster than PPO
|
|
166
|
+
- **DPO**: Simplest, no reward model needed
|
|
167
|
+
|
|
168
|
+
**Use alternatives instead**:
|
|
169
|
+
- **TRL**: Single-node training, simpler API
|
|
170
|
+
- **veRL**: ByteDance's framework for 671B models
|
|
171
|
+
- **DeepSpeedChat**: Integrated with DeepSpeed ecosystem
|
|
172
|
+
|
|
173
|
+
## Common issues
|
|
174
|
+
|
|
175
|
+
**Issue: GPU OOM with large models**
|
|
176
|
+
|
|
177
|
+
Disable model colocation:
|
|
178
|
+
```bash
|
|
179
|
+
# Remove --colocate_all_models flag
|
|
180
|
+
# Allocate separate GPUs for each model
|
|
181
|
+
--actor_num_gpus_per_node 8 \
|
|
182
|
+
--critic_num_gpus_per_node 8 \
|
|
183
|
+
--reward_num_gpus_per_node 8 \
|
|
184
|
+
--ref_num_gpus_per_node 8
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
**Issue: DeepSpeed GPU index out of range**
|
|
188
|
+
|
|
189
|
+
Set environment variable:
|
|
190
|
+
```bash
|
|
191
|
+
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
**Issue: Training instability**
|
|
195
|
+
|
|
196
|
+
Use Hybrid Engine instead of async:
|
|
197
|
+
```bash
|
|
198
|
+
--colocate_all_models \
|
|
199
|
+
--vllm_enable_sleep \
|
|
200
|
+
--deepspeed_enable_sleep
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
Adjust KL coefficient:
|
|
204
|
+
```bash
|
|
205
|
+
--init_kl_coef 0.05 # Increase from 0.01
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
**Issue: Slow generation during PPO**
|
|
209
|
+
|
|
210
|
+
Enable vLLM acceleration:
|
|
211
|
+
```bash
|
|
212
|
+
--vllm_num_engines 4 \
|
|
213
|
+
--vllm_tensor_parallel_size 2 \
|
|
214
|
+
--vllm_gpu_memory_utilization 0.5
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
## Advanced topics
|
|
218
|
+
|
|
219
|
+
**Hybrid Engine GPU sharing**: See [references/hybrid-engine.md](references/hybrid-engine.md) for vLLM sleep mode, DeepSpeed sleep mode, and optimal node allocation.
|
|
220
|
+
|
|
221
|
+
**Algorithm comparison**: See [references/algorithm-comparison.md](references/algorithm-comparison.md) for PPO vs GRPO vs RLOO vs REINFORCE++ benchmarks and hyperparameters.
|
|
222
|
+
|
|
223
|
+
**Multi-node setup**: See [references/multi-node-training.md](references/multi-node-training.md) for Ray cluster configuration and fault tolerance.
|
|
224
|
+
|
|
225
|
+
**Custom reward functions**: See [references/custom-rewards.md](references/custom-rewards.md) for reinforced fine-tuning and agent RLHF.
|
|
226
|
+
|
|
227
|
+
## Hardware requirements
|
|
228
|
+
|
|
229
|
+
- **GPU**: NVIDIA A100/H100 recommended
|
|
230
|
+
- **VRAM**:
|
|
231
|
+
- 7B model: 8× A100 40GB (Hybrid Engine)
|
|
232
|
+
- 70B model: 48× A100 80GB (vLLM:Actor:Critic = 1:1:1)
|
|
233
|
+
- **Multi-node**: Ray cluster with InfiniBand recommended
|
|
234
|
+
- **Docker**: NVIDIA PyTorch container 25.02+
|
|
235
|
+
|
|
236
|
+
**Performance**:
|
|
237
|
+
- 2× faster than DeepSpeedChat
|
|
238
|
+
- vLLM inference acceleration
|
|
239
|
+
- Hybrid Engine minimizes GPU idle time
|
|
240
|
+
|
|
241
|
+
## Resources
|
|
242
|
+
|
|
243
|
+
- Docs: https://github.com/OpenRLHF/OpenRLHF
|
|
244
|
+
- Paper: https://arxiv.org/abs/2405.11143
|
|
245
|
+
- Examples: https://github.com/OpenRLHF/OpenRLHF/tree/main/examples
|
|
246
|
+
- Discord: Community support
|
|
247
|
+
|
|
248
|
+
|
|
249
|
+
|
|
@@ -0,0 +1,404 @@
|
|
|
1
|
+
# Algorithm Comparison
|
|
2
|
+
|
|
3
|
+
Complete guide to RL algorithms in OpenRLHF: PPO, REINFORCE++, GRPO, RLOO, and their variants.
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
OpenRLHF supports 6 RL algorithms selectable via `--advantage_estimator`:
|
|
8
|
+
- **gae** - PPO with Generalized Advantage Estimation
|
|
9
|
+
- **reinforce** - REINFORCE++ (PPO optimizations without critic)
|
|
10
|
+
- **reinforce_baseline** - REINFORCE++ with baseline
|
|
11
|
+
- **group_norm** - GRPO (Group Normalized Policy Optimization)
|
|
12
|
+
- **dr_grpo** - Dr. GRPO (GRPO without std normalization)
|
|
13
|
+
- **rloo** - Reinforcement Learning with Online Off-policy Correction
|
|
14
|
+
|
|
15
|
+
## Algorithm Details
|
|
16
|
+
|
|
17
|
+
### PPO (Proximal Policy Optimization)
|
|
18
|
+
|
|
19
|
+
**Formula**:
|
|
20
|
+
```
|
|
21
|
+
loss = -min(ratio * advantages, clip(ratio, 1-ε, 1+ε) * advantages)
|
|
22
|
+
ratio = π_new(a|s) / π_old(a|s)
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
**Characteristics**:
|
|
26
|
+
- **Stability**: High (clipped objective prevents large updates)
|
|
27
|
+
- **Memory**: High (stores actor + critic experiences)
|
|
28
|
+
- **Speed**: Medium (critic training overhead)
|
|
29
|
+
- **Requires**: Critic network for value estimation
|
|
30
|
+
|
|
31
|
+
**Implementation**:
|
|
32
|
+
```python
|
|
33
|
+
surr1 = ratio * advantages
|
|
34
|
+
surr2 = ratio.clamp(1 - clip_eps_low, 1 + clip_eps_high) * advantages
|
|
35
|
+
loss = -torch.min(surr1, surr2)
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
**When to use**:
|
|
39
|
+
- General-purpose RLHF
|
|
40
|
+
- Complex reward functions
|
|
41
|
+
- Need stable training
|
|
42
|
+
|
|
43
|
+
**Hyperparameters**:
|
|
44
|
+
```bash
|
|
45
|
+
--advantage_estimator gae # Enable PPO
|
|
46
|
+
--clip_eps_low 0.2 # Clipping lower bound
|
|
47
|
+
--clip_eps_high 0.2 # Clipping upper bound
|
|
48
|
+
--actor_learning_rate 1e-6
|
|
49
|
+
--critic_learning_rate 9e-6
|
|
50
|
+
--init_kl_coef 0.01
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
### REINFORCE++
|
|
54
|
+
|
|
55
|
+
**Formula**:
|
|
56
|
+
```
|
|
57
|
+
loss = -ratio * advantages (with PPO-clip)
|
|
58
|
+
advantages = cumulative_returns - baseline
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
**Characteristics**:
|
|
62
|
+
- **Stability**: Higher than GRPO
|
|
63
|
+
- **Memory**: Lower (no critic network)
|
|
64
|
+
- **Speed**: Faster than PPO
|
|
65
|
+
- **Requires**: No critic network
|
|
66
|
+
|
|
67
|
+
**Key innovation**: Integrates PPO optimizations (advantage normalization, PPO-clip loss) into REINFORCE while eliminating critic network overhead.
|
|
68
|
+
|
|
69
|
+
**When to use**:
|
|
70
|
+
- Want PPO stability without critic
|
|
71
|
+
- Limited memory budget
|
|
72
|
+
- Fast training priority
|
|
73
|
+
|
|
74
|
+
**Hyperparameters**:
|
|
75
|
+
```bash
|
|
76
|
+
--advantage_estimator reinforce
|
|
77
|
+
--critic_pretrain None # No critic needed
|
|
78
|
+
--init_kl_coef 0.01
|
|
79
|
+
--actor_learning_rate 1e-6
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
### REINFORCE++-baseline
|
|
83
|
+
|
|
84
|
+
**Formula**:
|
|
85
|
+
```
|
|
86
|
+
rewards = rewards - mean(rewards_same_prompt)
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
**Characteristics**:
|
|
90
|
+
- **Stability**: Very high
|
|
91
|
+
- **Memory**: Lower (no critic)
|
|
92
|
+
- **Speed**: Faster than PPO
|
|
93
|
+
- **Requires**: Multiple samples per prompt
|
|
94
|
+
|
|
95
|
+
**Key innovation**: Uses mean reward of multiple samples from same prompt as baseline to reshape rewards.
|
|
96
|
+
|
|
97
|
+
**When to use**:
|
|
98
|
+
- RLVR (Reinforcement Learning via Verifier Rewards) settings
|
|
99
|
+
- Reward patterns vary (0/1/-0.5)
|
|
100
|
+
- Multiple samples per prompt available
|
|
101
|
+
|
|
102
|
+
**Hyperparameters**:
|
|
103
|
+
```bash
|
|
104
|
+
--advantage_estimator reinforce_baseline
|
|
105
|
+
--n_samples_per_prompt 4 # Must be > 1
|
|
106
|
+
--init_kl_coef 0.01
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
### GRPO (Group Normalized Policy Optimization)
|
|
110
|
+
|
|
111
|
+
**Formula**:
|
|
112
|
+
```
|
|
113
|
+
rewards = (rewards - mean(rewards)) / (std(rewards) + 1e-9)
|
|
114
|
+
loss = -ratio * normalized_advantages
|
|
115
|
+
KL loss (optional): k1, k2, or k3 estimator
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
**Characteristics**:
|
|
119
|
+
- **Stability**: Lower than REINFORCE++
|
|
120
|
+
- **Memory**: Lower (no critic)
|
|
121
|
+
- **Speed**: Fast
|
|
122
|
+
- **Requires**: Group reward normalization
|
|
123
|
+
|
|
124
|
+
**Key innovation**: Group-based advantage normalization with optional KL loss.
|
|
125
|
+
|
|
126
|
+
**When to use**:
|
|
127
|
+
- Exploring policy optimization variants
|
|
128
|
+
- Need reward normalization
|
|
129
|
+
- Memory-constrained
|
|
130
|
+
|
|
131
|
+
**Hyperparameters**:
|
|
132
|
+
```bash
|
|
133
|
+
--advantage_estimator group_norm
|
|
134
|
+
--use_kl_loss # Enable KL loss
|
|
135
|
+
--kl_estimator k3 # k3 for loss, k2 ≈ k1
|
|
136
|
+
--init_kl_coef 0.01
|
|
137
|
+
--no_advantage_std_norm # Optional: disable std norm
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
**KL estimator variance**:
|
|
141
|
+
- **k3**: Larger variance under categorical distribution
|
|
142
|
+
- **k1, k2**: Similar variance, k2 ≈ k1 for loss
|
|
143
|
+
|
|
144
|
+
### Dr. GRPO
|
|
145
|
+
|
|
146
|
+
**Formula**:
|
|
147
|
+
```
|
|
148
|
+
rewards = rewards - mean(rewards) # No std normalization
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
**Characteristics**:
|
|
152
|
+
- **Stability**: Similar to GRPO
|
|
153
|
+
- **Memory**: Lower (no critic)
|
|
154
|
+
- **Speed**: Fast
|
|
155
|
+
- **Requires**: Group mean normalization only
|
|
156
|
+
|
|
157
|
+
**Key innovation**: Removes local group normalization `/std` from GRPO (not needed in RL variance reduction theory).
|
|
158
|
+
|
|
159
|
+
**When to use**:
|
|
160
|
+
- GRPO variant experimentation
|
|
161
|
+
- Avoid std normalization issues
|
|
162
|
+
|
|
163
|
+
**Hyperparameters**:
|
|
164
|
+
```bash
|
|
165
|
+
--advantage_estimator dr_grpo
|
|
166
|
+
--init_kl_coef 0.01
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
### RLOO (RL with Online Off-policy Correction)
|
|
170
|
+
|
|
171
|
+
**Formula**:
|
|
172
|
+
```
|
|
173
|
+
baseline = (sum(rewards) - rewards) / (n_samples - 1)
|
|
174
|
+
rewards = rewards - baseline
|
|
175
|
+
loss = -ratio * advantages (with PPO-clip)
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
**Characteristics**:
|
|
179
|
+
- **Stability**: High (PPO-clip)
|
|
180
|
+
- **Memory**: Lower (no critic)
|
|
181
|
+
- **Speed**: Fast
|
|
182
|
+
- **Requires**: Multiple samples per prompt, per-token KL
|
|
183
|
+
|
|
184
|
+
**Key innovation**: Incorporates per-token KL reward and PPO-clip loss.
|
|
185
|
+
|
|
186
|
+
**When to use**:
|
|
187
|
+
- Need per-token KL rewards
|
|
188
|
+
- Want PPO stability without critic
|
|
189
|
+
- Multiple samples per prompt
|
|
190
|
+
|
|
191
|
+
**Hyperparameters**:
|
|
192
|
+
```bash
|
|
193
|
+
--advantage_estimator rloo
|
|
194
|
+
--n_samples_per_prompt 4 # Must be > 1
|
|
195
|
+
--init_kl_coef 0.01
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
## Comparison Table
|
|
199
|
+
|
|
200
|
+
| Algorithm | Critic | Stability | Memory | Speed | Best For |
|
|
201
|
+
|-----------|--------|-----------|--------|-------|----------|
|
|
202
|
+
| PPO | ✅ Yes | ⭐⭐⭐⭐⭐ | High | Medium | General purpose |
|
|
203
|
+
| REINFORCE++ | ❌ No | ⭐⭐⭐⭐ | Low | **Fast** | Critic-free PPO |
|
|
204
|
+
| REINFORCE++-baseline | ❌ No | ⭐⭐⭐⭐⭐ | Low | **Fast** | RLVR settings |
|
|
205
|
+
| GRPO | ❌ No | ⭐⭐⭐ | Low | Fast | Reward normalization |
|
|
206
|
+
| Dr. GRPO | ❌ No | ⭐⭐⭐ | Low | Fast | GRPO variant |
|
|
207
|
+
| RLOO | ❌ No | ⭐⭐⭐⭐ | Low | Fast | Per-token KL |
|
|
208
|
+
|
|
209
|
+
## Experience Data Structure
|
|
210
|
+
|
|
211
|
+
**PPO (with critic)**:
|
|
212
|
+
```python
|
|
213
|
+
@dataclass
|
|
214
|
+
class Experience:
|
|
215
|
+
sequences: torch.Tensor # Token sequences
|
|
216
|
+
attention_mask: torch.Tensor # Attention masks
|
|
217
|
+
action_mask: torch.Tensor # Action masks
|
|
218
|
+
action_log_probs: torch.Tensor # Log π(a|s)
|
|
219
|
+
values: torch.Tensor # Critic value estimates
|
|
220
|
+
returns: torch.Tensor # Cumulative returns
|
|
221
|
+
advantages: torch.Tensor # GAE advantages
|
|
222
|
+
reward: float # Total reward
|
|
223
|
+
kl: torch.Tensor # KL divergence
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
**REINFORCE++ (no critic)**:
|
|
227
|
+
```python
|
|
228
|
+
# No values, returns, or advantages stored
|
|
229
|
+
# Only sequences, log_probs, and rewards
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
## Memory Comparison (7B Model)
|
|
233
|
+
|
|
234
|
+
| Algorithm | Components | Memory (8× A100) |
|
|
235
|
+
|-----------|-----------|------------------|
|
|
236
|
+
| PPO | Actor + Critic + Reward + Ref | ~40GB |
|
|
237
|
+
| REINFORCE++ | Actor + Reward + Ref | ~28GB |
|
|
238
|
+
| GRPO | Actor + Reward + Ref | ~28GB |
|
|
239
|
+
| RLOO | Actor + Reward + Ref | ~28GB |
|
|
240
|
+
|
|
241
|
+
**Savings**: ~30% memory reduction without critic
|
|
242
|
+
|
|
243
|
+
## Speed Comparison
|
|
244
|
+
|
|
245
|
+
**Relative training time** (7B model, 1000 steps):
|
|
246
|
+
- PPO: 1.0× baseline
|
|
247
|
+
- REINFORCE++: **0.75×** (25% faster)
|
|
248
|
+
- GRPO: 0.80×
|
|
249
|
+
- RLOO: 0.80×
|
|
250
|
+
|
|
251
|
+
**Why REINFORCE++ is faster**:
|
|
252
|
+
- No critic training
|
|
253
|
+
- No value function updates
|
|
254
|
+
- Fewer backward passes
|
|
255
|
+
|
|
256
|
+
## Choosing an Algorithm
|
|
257
|
+
|
|
258
|
+
### Decision Tree
|
|
259
|
+
|
|
260
|
+
```
|
|
261
|
+
Need maximum stability?
|
|
262
|
+
├─ Yes → PPO (with critic)
|
|
263
|
+
└─ No ↓
|
|
264
|
+
|
|
265
|
+
Have multiple samples per prompt?
|
|
266
|
+
├─ Yes ↓
|
|
267
|
+
│ └─ RLVR setting with varying rewards?
|
|
268
|
+
│ ├─ Yes → REINFORCE++-baseline
|
|
269
|
+
│ └─ No → RLOO (if need per-token KL)
|
|
270
|
+
└─ No ↓
|
|
271
|
+
|
|
272
|
+
Want faster than PPO?
|
|
273
|
+
└─ Yes → REINFORCE++ (most stable critic-free)
|
|
274
|
+
|
|
275
|
+
Experimenting with normalization?
|
|
276
|
+
└─ Yes → GRPO or Dr. GRPO
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
### By Use Case
|
|
280
|
+
|
|
281
|
+
**Production deployment**:
|
|
282
|
+
```bash
|
|
283
|
+
# Maximum stability
|
|
284
|
+
--advantage_estimator gae # PPO
|
|
285
|
+
--clip_eps_low 0.2
|
|
286
|
+
--init_kl_coef 0.01
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
**Memory-constrained**:
|
|
290
|
+
```bash
|
|
291
|
+
# No critic, stable
|
|
292
|
+
--advantage_estimator reinforce # REINFORCE++
|
|
293
|
+
--critic_pretrain None
|
|
294
|
+
```
|
|
295
|
+
|
|
296
|
+
**RLVR / Verification rewards**:
|
|
297
|
+
```bash
|
|
298
|
+
# Baseline reward shaping
|
|
299
|
+
--advantage_estimator reinforce_baseline
|
|
300
|
+
--n_samples_per_prompt 4
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
**Research / Experimentation**:
|
|
304
|
+
```bash
|
|
305
|
+
# Explore GRPO variants
|
|
306
|
+
--advantage_estimator group_norm
|
|
307
|
+
--use_kl_loss --kl_estimator k3
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
## Advanced Configuration
|
|
311
|
+
|
|
312
|
+
### Reward Normalization
|
|
313
|
+
|
|
314
|
+
**PPO (no manual normalization)**:
|
|
315
|
+
```bash
|
|
316
|
+
--advantage_estimator gae
|
|
317
|
+
# GAE handles advantage normalization
|
|
318
|
+
```
|
|
319
|
+
|
|
320
|
+
**GRPO (group normalization)**:
|
|
321
|
+
```bash
|
|
322
|
+
--advantage_estimator group_norm
|
|
323
|
+
--normalize_reward # Optional additional normalization
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
**Disable std normalization**:
|
|
327
|
+
```bash
|
|
328
|
+
--no_advantage_std_norm # Keep mean norm only
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
### KL Penalty Configuration
|
|
332
|
+
|
|
333
|
+
**All algorithms support**:
|
|
334
|
+
```bash
|
|
335
|
+
--init_kl_coef 0.01 # Initial KL coefficient
|
|
336
|
+
--kl_target 0.1 # Target KL divergence
|
|
337
|
+
--kl_horizon 10000 # Steps to reach target
|
|
338
|
+
```
|
|
339
|
+
|
|
340
|
+
**GRPO-specific**:
|
|
341
|
+
```bash
|
|
342
|
+
--use_kl_loss # Enable KL loss term
|
|
343
|
+
--kl_estimator k3 # Loss function choice
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
### Clipping Configuration
|
|
347
|
+
|
|
348
|
+
**PPO clipping**:
|
|
349
|
+
```bash
|
|
350
|
+
--clip_eps_low 0.2 # Lower bound
|
|
351
|
+
--clip_eps_high 0.2 # Upper bound
|
|
352
|
+
```
|
|
353
|
+
|
|
354
|
+
**Reward clipping**:
|
|
355
|
+
```bash
|
|
356
|
+
--reward_clip_range 10.0 # Clip rewards to [-10, 10]
|
|
357
|
+
```
|
|
358
|
+
|
|
359
|
+
## Common Issues
|
|
360
|
+
|
|
361
|
+
### PPO Instability
|
|
362
|
+
|
|
363
|
+
**Symptom**: Large policy updates, divergence
|
|
364
|
+
|
|
365
|
+
**Solution**: Reduce clipping range
|
|
366
|
+
```bash
|
|
367
|
+
--clip_eps_low 0.1 # Reduce from 0.2
|
|
368
|
+
--clip_eps_high 0.1
|
|
369
|
+
```
|
|
370
|
+
|
|
371
|
+
### GRPO High Variance
|
|
372
|
+
|
|
373
|
+
**Symptom**: Unstable training with GRPO
|
|
374
|
+
|
|
375
|
+
**Solution**: Switch to REINFORCE++
|
|
376
|
+
```bash
|
|
377
|
+
--advantage_estimator reinforce # More stable
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
### Memory OOM with PPO
|
|
381
|
+
|
|
382
|
+
**Symptom**: OOM during critic training
|
|
383
|
+
|
|
384
|
+
**Solution**: Switch to critic-free
|
|
385
|
+
```bash
|
|
386
|
+
--advantage_estimator reinforce # No critic
|
|
387
|
+
--critic_pretrain None
|
|
388
|
+
```
|
|
389
|
+
|
|
390
|
+
### RLOO/Baseline Requires Multiple Samples
|
|
391
|
+
|
|
392
|
+
**Symptom**: `AssertionError: n_samples_per_prompt must be > 1`
|
|
393
|
+
|
|
394
|
+
**Solution**:
|
|
395
|
+
```bash
|
|
396
|
+
--n_samples_per_prompt 4 # Minimum 2, recommended 4-8
|
|
397
|
+
```
|
|
398
|
+
|
|
399
|
+
## References
|
|
400
|
+
|
|
401
|
+
- PPO paper: https://arxiv.org/abs/1707.06347
|
|
402
|
+
- GRPO paper: https://arxiv.org/abs/2402.03300
|
|
403
|
+
- OpenRLHF: https://github.com/OpenRLHF/OpenRLHF
|
|
404
|
+
- OpenRLHF paper: https://arxiv.org/abs/2405.11143
|