@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,327 @@
|
|
|
1
|
+
# torchforge API Reference
|
|
2
|
+
|
|
3
|
+
## Architecture Overview
|
|
4
|
+
|
|
5
|
+
torchforge implements a fully asynchronous RL system built on:
|
|
6
|
+
|
|
7
|
+
- **Monarch**: PyTorch-native distributed coordination framework
|
|
8
|
+
- **TorchTitan**: Meta's production LLM training platform
|
|
9
|
+
- **vLLM**: High-throughput inference engine
|
|
10
|
+
|
|
11
|
+
```
|
|
12
|
+
┌─────────────────────────────────────────────────────────┐
|
|
13
|
+
│ Application Layer (Your Code) │
|
|
14
|
+
│ - Define reward models, loss functions, sampling │
|
|
15
|
+
└─────────────────────┬───────────────────────────────────┘
|
|
16
|
+
│
|
|
17
|
+
┌─────────────────────▼───────────────────────────────────┐
|
|
18
|
+
│ Forge API Layer │
|
|
19
|
+
│ - ForgeActor, Service │
|
|
20
|
+
│ - Async service interfaces │
|
|
21
|
+
└─────────────────────┬───────────────────────────────────┘
|
|
22
|
+
│
|
|
23
|
+
┌─────────────────────▼───────────────────────────────────┐
|
|
24
|
+
│ Distributed Services (Monarch) │
|
|
25
|
+
│ ├── TitanTrainer (TorchTitan FSDP) │
|
|
26
|
+
│ ├── Generator (vLLM inference) │
|
|
27
|
+
│ └── ReferenceModel (frozen KL baseline) │
|
|
28
|
+
└─────────────────────────────────────────────────────────┘
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## Core Classes
|
|
32
|
+
|
|
33
|
+
### ForgeActor
|
|
34
|
+
|
|
35
|
+
Base class for Forge actors with configurable resource attributes.
|
|
36
|
+
|
|
37
|
+
**Location**: `forge.controller.actor.ForgeActor`
|
|
38
|
+
|
|
39
|
+
```python
|
|
40
|
+
from forge.controller.actor import ForgeActor
|
|
41
|
+
|
|
42
|
+
class MyActor(ForgeActor):
|
|
43
|
+
procs = 1 # Number of processes
|
|
44
|
+
hosts = None # Host distribution
|
|
45
|
+
with_gpus = True # GPU allocation flag
|
|
46
|
+
num_replicas = 1 # Service replica count
|
|
47
|
+
mesh_name = None # Process mesh identifier
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
**Class Methods**:
|
|
51
|
+
- `as_actor(*args, **actor_kwargs)` → Spawns single actor using .options() configuration
|
|
52
|
+
- `launch(*args, **kwargs)` → Provisions and deploys new actor replica
|
|
53
|
+
- `options(*, procs=1, hosts=None, with_gpus=False, num_replicas=1, mesh_name=None, **kwargs)` → Pre-configures actor class
|
|
54
|
+
- `shutdown(actor)` → Terminates actor instance
|
|
55
|
+
|
|
56
|
+
### TitanTrainer
|
|
57
|
+
|
|
58
|
+
Generic trainer actor built on TorchTitan's training engine.
|
|
59
|
+
|
|
60
|
+
**Location**: `forge.actors.trainer.TitanTrainer`
|
|
61
|
+
|
|
62
|
+
**Key Methods**:
|
|
63
|
+
- `forward_backward(batch)` → Forward and backward pass
|
|
64
|
+
- `train_step()` → Complete training step
|
|
65
|
+
- `setup()` / `cleanup()` → Lifecycle methods
|
|
66
|
+
- `clear_gradients()` → Reset gradients
|
|
67
|
+
- `save()` / `load()` → Checkpoint operations
|
|
68
|
+
- `push_weights()` → Sync weights to inference
|
|
69
|
+
- `get_config()` / `get_status()` → Introspection
|
|
70
|
+
|
|
71
|
+
**Properties**: `job`, `model`, `optimizer`, `lr_scheduler`, `training`, `parallelism`, `checkpoint`, `activation_checkpoint`, `compile`, `quantize`, `comm`, `memory_estimation`, `state_dict_key`
|
|
72
|
+
|
|
73
|
+
### Generator
|
|
74
|
+
|
|
75
|
+
vLLM-based generator for inference.
|
|
76
|
+
|
|
77
|
+
**Location**: `forge.actors.generator.Generator`
|
|
78
|
+
|
|
79
|
+
```python
|
|
80
|
+
from forge.actors.generator import Generator
|
|
81
|
+
|
|
82
|
+
generator = Generator(
|
|
83
|
+
engine_args=<factory>,
|
|
84
|
+
sampling_params=<factory>,
|
|
85
|
+
prefetch_weights_to_shm=True,
|
|
86
|
+
n_fetcher_procs=8
|
|
87
|
+
)
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
**Key Methods**:
|
|
91
|
+
- `generate()` → Generate completions
|
|
92
|
+
- `run()` → Async generation loop
|
|
93
|
+
- `update_weights()` → Receive new weights from trainer
|
|
94
|
+
- `get_version()` / `get_vllm_config()` → Introspection
|
|
95
|
+
|
|
96
|
+
**Returns**: `Completion` dataclass with fields: `prompt`, `text`, `token_ids`, `logprobs`
|
|
97
|
+
|
|
98
|
+
### ReferenceModel
|
|
99
|
+
|
|
100
|
+
Frozen policy copy for computing KL divergence.
|
|
101
|
+
|
|
102
|
+
**Location**: `forge.actors.reference_model.ReferenceModel`
|
|
103
|
+
|
|
104
|
+
Maintains a frozen copy of the policy for computing advantages without gradient computation.
|
|
105
|
+
|
|
106
|
+
**Key Methods**:
|
|
107
|
+
- `forward()` → Inference without gradients
|
|
108
|
+
- `setup()` → Initialize from checkpoint
|
|
109
|
+
|
|
110
|
+
### Service
|
|
111
|
+
|
|
112
|
+
Actor-less service implementation for managing replicas.
|
|
113
|
+
|
|
114
|
+
**Location**: `forge.controller.service.service.Service`
|
|
115
|
+
|
|
116
|
+
```python
|
|
117
|
+
Service(cfg, actor_def, actor_args, actor_kwargs)
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
**Methods**:
|
|
121
|
+
- `call_all(function, *args, **kwargs)` → Call function on all healthy replicas
|
|
122
|
+
- `get_metrics()` → Returns ServiceMetrics object
|
|
123
|
+
- `start_session()` / `terminate_session(sess_id)` → Session management
|
|
124
|
+
- `stop()` → Stop service and all replicas
|
|
125
|
+
|
|
126
|
+
## Configuration (TorchTitan)
|
|
127
|
+
|
|
128
|
+
torchforge uses TorchTitan's configuration system:
|
|
129
|
+
|
|
130
|
+
### Job Configuration
|
|
131
|
+
|
|
132
|
+
```python
|
|
133
|
+
from torchtitan.config.job_config import Job
|
|
134
|
+
|
|
135
|
+
@dataclass
|
|
136
|
+
class Job:
|
|
137
|
+
config_file: str
|
|
138
|
+
dump_folder: str
|
|
139
|
+
description: str
|
|
140
|
+
print_config: bool
|
|
141
|
+
custom_config_module: str
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
### Model Configuration
|
|
145
|
+
|
|
146
|
+
```python
|
|
147
|
+
from torchtitan.config.job_config import Model
|
|
148
|
+
|
|
149
|
+
@dataclass
|
|
150
|
+
class Model:
|
|
151
|
+
name: str
|
|
152
|
+
flavor: str
|
|
153
|
+
hf_assets_path: str
|
|
154
|
+
tokenizer_path: str
|
|
155
|
+
converters: list
|
|
156
|
+
print_after_conversion: bool
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
### Training Configuration
|
|
160
|
+
|
|
161
|
+
```python
|
|
162
|
+
from torchtitan.config.job_config import Training
|
|
163
|
+
|
|
164
|
+
@dataclass
|
|
165
|
+
class Training:
|
|
166
|
+
dataset: str
|
|
167
|
+
dataset_path: str
|
|
168
|
+
local_batch_size: int
|
|
169
|
+
global_batch_size: int
|
|
170
|
+
seq_len: int
|
|
171
|
+
max_norm: float
|
|
172
|
+
steps: int
|
|
173
|
+
dtype: str
|
|
174
|
+
mixed_precision_param: str
|
|
175
|
+
mixed_precision_reduce: str
|
|
176
|
+
gc_freq: int
|
|
177
|
+
seed: int
|
|
178
|
+
deterministic: bool
|
|
179
|
+
enable_cpu_offload: bool
|
|
180
|
+
# ... additional fields
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
### Parallelism Configuration
|
|
184
|
+
|
|
185
|
+
```python
|
|
186
|
+
from torchtitan.config.job_config import Parallelism
|
|
187
|
+
|
|
188
|
+
@dataclass
|
|
189
|
+
class Parallelism:
|
|
190
|
+
# Parallelism degrees
|
|
191
|
+
data_parallel_shard_degree: int
|
|
192
|
+
data_parallel_replicate_degree: int
|
|
193
|
+
tensor_parallel_degree: int
|
|
194
|
+
pipeline_parallel_degree: int
|
|
195
|
+
context_parallel_degree: int
|
|
196
|
+
expert_parallel_degree: int
|
|
197
|
+
# FSDP configuration options
|
|
198
|
+
# ... additional fields
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
### Optimizer Configuration
|
|
202
|
+
|
|
203
|
+
```python
|
|
204
|
+
from torchtitan.config.job_config import Optimizer
|
|
205
|
+
|
|
206
|
+
@dataclass
|
|
207
|
+
class Optimizer:
|
|
208
|
+
name: str
|
|
209
|
+
lr: float
|
|
210
|
+
beta1: float
|
|
211
|
+
beta2: float
|
|
212
|
+
eps: float
|
|
213
|
+
weight_decay: float
|
|
214
|
+
implementation: str
|
|
215
|
+
early_step_in_backward: bool
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
## YAML Configuration Example
|
|
219
|
+
|
|
220
|
+
```yaml
|
|
221
|
+
# config/grpo_math.yaml
|
|
222
|
+
model: "Qwen/Qwen2.5-7B-Instruct"
|
|
223
|
+
|
|
224
|
+
dataset:
|
|
225
|
+
path: "openai/gsm8k"
|
|
226
|
+
split: "train"
|
|
227
|
+
streaming: true
|
|
228
|
+
|
|
229
|
+
training:
|
|
230
|
+
batch_size: 4
|
|
231
|
+
learning_rate: 1e-6
|
|
232
|
+
seq_len: 4096
|
|
233
|
+
dtype: bfloat16
|
|
234
|
+
gradient_accumulation_steps: 4
|
|
235
|
+
|
|
236
|
+
grpo:
|
|
237
|
+
n_samples: 8
|
|
238
|
+
clip_low: 0.2
|
|
239
|
+
clip_high: 0.28
|
|
240
|
+
beta: 0.1
|
|
241
|
+
temperature: 0.7
|
|
242
|
+
|
|
243
|
+
services:
|
|
244
|
+
generator:
|
|
245
|
+
procs: 1
|
|
246
|
+
num_replicas: 1
|
|
247
|
+
with_gpus: true
|
|
248
|
+
trainer:
|
|
249
|
+
procs: 1
|
|
250
|
+
num_replicas: 1
|
|
251
|
+
with_gpus: true
|
|
252
|
+
ref_model:
|
|
253
|
+
procs: 1
|
|
254
|
+
num_replicas: 1
|
|
255
|
+
with_gpus: true
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
## Launch Commands
|
|
259
|
+
|
|
260
|
+
### SFT Training (2+ GPUs)
|
|
261
|
+
|
|
262
|
+
```bash
|
|
263
|
+
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
### GRPO Training (3+ GPUs)
|
|
267
|
+
|
|
268
|
+
```bash
|
|
269
|
+
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
### Multi-GPU Distributed
|
|
273
|
+
|
|
274
|
+
```bash
|
|
275
|
+
python -m apps.grpo.main \
|
|
276
|
+
--config config/distributed.yaml \
|
|
277
|
+
--trainer.procs 4 \
|
|
278
|
+
--generator.procs 4
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
## Async Communication Pattern
|
|
282
|
+
|
|
283
|
+
torchforge uses async/await patterns for service communication:
|
|
284
|
+
|
|
285
|
+
```python
|
|
286
|
+
# Route: async point-to-point
|
|
287
|
+
response = await service.method.route(arg1, arg2)
|
|
288
|
+
|
|
289
|
+
# Fanout: broadcast to all replicas
|
|
290
|
+
await service.update_weights.fanout(training_step)
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
## Installation
|
|
294
|
+
|
|
295
|
+
```bash
|
|
296
|
+
# Create environment
|
|
297
|
+
conda create -n forge python=3.12
|
|
298
|
+
conda activate forge
|
|
299
|
+
|
|
300
|
+
# Install (handles PyTorch nightly + dependencies)
|
|
301
|
+
./scripts/install.sh
|
|
302
|
+
|
|
303
|
+
# ROCm (AMD GPUs)
|
|
304
|
+
./scripts/install_rocm.sh
|
|
305
|
+
|
|
306
|
+
# Verify
|
|
307
|
+
python -c "import torch, forge, vllm; print('OK')"
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
**Requirements**:
|
|
311
|
+
- PyTorch >= 2.9.0 (nightly)
|
|
312
|
+
- Monarch
|
|
313
|
+
- TorchTitan
|
|
314
|
+
- vLLM
|
|
315
|
+
|
|
316
|
+
## Experimental Warning
|
|
317
|
+
|
|
318
|
+
Both Monarch and torchforge are experimental. APIs may change as the project learns from early adopters.
|
|
319
|
+
|
|
320
|
+
## Resources
|
|
321
|
+
|
|
322
|
+
- Documentation: https://meta-pytorch.org/torchforge
|
|
323
|
+
- GitHub: https://github.com/meta-pytorch/torchforge
|
|
324
|
+
- Discord: https://discord.gg/YsTYBh6PD9
|
|
325
|
+
- TorchTitan: https://github.com/pytorch/torchtitan
|
|
326
|
+
- Monarch: https://github.com/meta-pytorch/monarch
|
|
327
|
+
- Blog: https://pytorch.org/blog/introducing-torchforge/
|
|
@@ -0,0 +1,409 @@
|
|
|
1
|
+
# torchforge Troubleshooting Guide
|
|
2
|
+
|
|
3
|
+
## GPU Resource Issues
|
|
4
|
+
|
|
5
|
+
### Issue: Not Enough GPUs
|
|
6
|
+
|
|
7
|
+
**Symptoms**: "Insufficient GPU resources" error
|
|
8
|
+
|
|
9
|
+
**Solutions**:
|
|
10
|
+
|
|
11
|
+
1. **Reduce service requirements**:
|
|
12
|
+
```yaml
|
|
13
|
+
services:
|
|
14
|
+
generator:
|
|
15
|
+
procs: 1
|
|
16
|
+
with_gpus: true
|
|
17
|
+
trainer:
|
|
18
|
+
procs: 1
|
|
19
|
+
with_gpus: true
|
|
20
|
+
# Remove ref_model or use CPU
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
2. **Use CPU for reference model**:
|
|
24
|
+
```yaml
|
|
25
|
+
ref_model:
|
|
26
|
+
with_gpus: false # Run on CPU
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
3. **Share resources between services**:
|
|
30
|
+
```yaml
|
|
31
|
+
services:
|
|
32
|
+
generator:
|
|
33
|
+
procs: 1
|
|
34
|
+
num_replicas: 1
|
|
35
|
+
colocate_with: trainer # Share GPU with trainer
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
### Issue: Minimum GPU Requirements
|
|
39
|
+
|
|
40
|
+
**Reference**:
|
|
41
|
+
- SFT: 2+ GPUs (trainer + generator)
|
|
42
|
+
- GRPO: 3+ GPUs (trainer + generator + ref_model)
|
|
43
|
+
- Large models: 8+ GPUs with tensor parallelism
|
|
44
|
+
|
|
45
|
+
## Memory Issues
|
|
46
|
+
|
|
47
|
+
### Issue: OOM During Generation
|
|
48
|
+
|
|
49
|
+
**Symptoms**: CUDA OOM in vLLM
|
|
50
|
+
|
|
51
|
+
**Solutions**:
|
|
52
|
+
|
|
53
|
+
1. **Reduce batch size**:
|
|
54
|
+
```yaml
|
|
55
|
+
grpo:
|
|
56
|
+
n_samples: 4 # Reduce from 8
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
2. **Reduce sequence length**:
|
|
60
|
+
```yaml
|
|
61
|
+
training:
|
|
62
|
+
seq_len: 2048 # Reduce from 4096
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
3. **Reduce vLLM memory**:
|
|
66
|
+
```yaml
|
|
67
|
+
generator:
|
|
68
|
+
gpu_memory_utilization: 0.7 # Reduce from 0.9
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### Issue: OOM During Training
|
|
72
|
+
|
|
73
|
+
**Symptoms**: CUDA OOM in backward pass
|
|
74
|
+
|
|
75
|
+
**Solutions**:
|
|
76
|
+
|
|
77
|
+
1. **Enable gradient checkpointing**:
|
|
78
|
+
```yaml
|
|
79
|
+
training:
|
|
80
|
+
gradient_checkpointing: true
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
2. **Increase gradient accumulation**:
|
|
84
|
+
```yaml
|
|
85
|
+
training:
|
|
86
|
+
gradient_accumulation_steps: 8 # Increase from 4
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
3. **Reduce batch size**:
|
|
90
|
+
```yaml
|
|
91
|
+
training:
|
|
92
|
+
batch_size: 2 # Reduce from 4
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
## Weight Synchronization Issues
|
|
96
|
+
|
|
97
|
+
### Issue: Slow Weight Sync
|
|
98
|
+
|
|
99
|
+
**Symptoms**: Long pauses between training and generation
|
|
100
|
+
|
|
101
|
+
**Solutions**:
|
|
102
|
+
|
|
103
|
+
1. **Enable RDMA** (if available):
|
|
104
|
+
```bash
|
|
105
|
+
export TORCHSTORE_USE_RDMA=1
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
2. **Reduce sync frequency**:
|
|
109
|
+
```yaml
|
|
110
|
+
training:
|
|
111
|
+
sync_interval: 10 # Sync every 10 steps
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
3. **Use colocated services**:
|
|
115
|
+
```yaml
|
|
116
|
+
services:
|
|
117
|
+
generator:
|
|
118
|
+
colocate_with: trainer
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Issue: Weight Sync Failures
|
|
122
|
+
|
|
123
|
+
**Symptoms**: Errors in weight transfer, stale weights
|
|
124
|
+
|
|
125
|
+
**Solutions**:
|
|
126
|
+
|
|
127
|
+
1. **Check network connectivity**:
|
|
128
|
+
```bash
|
|
129
|
+
ping other_node
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
2. **Increase timeout**:
|
|
133
|
+
```yaml
|
|
134
|
+
services:
|
|
135
|
+
weight_sync_timeout: 600 # 10 minutes
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
3. **Enable sync verification**:
|
|
139
|
+
```yaml
|
|
140
|
+
training:
|
|
141
|
+
verify_weight_sync: true
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
## Training Stability Issues
|
|
145
|
+
|
|
146
|
+
### Issue: Policy Collapse
|
|
147
|
+
|
|
148
|
+
**Symptoms**: Entropy drops to zero, reward stops improving
|
|
149
|
+
|
|
150
|
+
**Solutions**:
|
|
151
|
+
|
|
152
|
+
1. **Increase KL penalty**:
|
|
153
|
+
```yaml
|
|
154
|
+
grpo:
|
|
155
|
+
beta: 0.2 # Increase from 0.1
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
2. **Add entropy bonus**:
|
|
159
|
+
```yaml
|
|
160
|
+
training:
|
|
161
|
+
entropy_coef: 0.01
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
3. **Reduce learning rate**:
|
|
165
|
+
```yaml
|
|
166
|
+
training:
|
|
167
|
+
learning_rate: 5e-7 # Reduce from 1e-6
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
### Issue: Loss Spikes
|
|
171
|
+
|
|
172
|
+
**Symptoms**: Sudden loss increases, training instability
|
|
173
|
+
|
|
174
|
+
**Solutions**:
|
|
175
|
+
|
|
176
|
+
1. **Enable gradient clipping**:
|
|
177
|
+
```yaml
|
|
178
|
+
training:
|
|
179
|
+
max_grad_norm: 1.0
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
2. **Reduce clip range**:
|
|
183
|
+
```yaml
|
|
184
|
+
grpo:
|
|
185
|
+
clip_low: 0.1 # Reduce from 0.2
|
|
186
|
+
clip_high: 0.18 # Reduce from 0.28
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
3. **Use learning rate warmup**:
|
|
190
|
+
```yaml
|
|
191
|
+
training:
|
|
192
|
+
warmup_steps: 100
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
### Issue: Divergent Training
|
|
196
|
+
|
|
197
|
+
**Symptoms**: Loss becomes NaN, model outputs garbage
|
|
198
|
+
|
|
199
|
+
**Solutions**:
|
|
200
|
+
|
|
201
|
+
1. **Check for data issues**:
|
|
202
|
+
```python
|
|
203
|
+
# Verify no empty sequences
|
|
204
|
+
for batch in dataset:
|
|
205
|
+
assert batch.input_ids.numel() > 0
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
2. **Use BF16 instead of FP16**:
|
|
209
|
+
```yaml
|
|
210
|
+
training:
|
|
211
|
+
dtype: bfloat16
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
3. **Reduce learning rate significantly**:
|
|
215
|
+
```yaml
|
|
216
|
+
training:
|
|
217
|
+
learning_rate: 1e-7
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
## Service Issues
|
|
221
|
+
|
|
222
|
+
### Issue: Service Startup Failures
|
|
223
|
+
|
|
224
|
+
**Symptoms**: Services fail to initialize
|
|
225
|
+
|
|
226
|
+
**Solutions**:
|
|
227
|
+
|
|
228
|
+
1. **Check resource availability**:
|
|
229
|
+
```bash
|
|
230
|
+
nvidia-smi # Verify GPU availability
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
2. **Increase startup timeout**:
|
|
234
|
+
```yaml
|
|
235
|
+
services:
|
|
236
|
+
startup_timeout: 600
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
3. **Check model path**:
|
|
240
|
+
```python
|
|
241
|
+
from transformers import AutoModelForCausalLM
|
|
242
|
+
model = AutoModelForCausalLM.from_pretrained("model_path") # Verify accessible
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
### Issue: Generator Not Responding
|
|
246
|
+
|
|
247
|
+
**Symptoms**: Generation hangs, timeouts
|
|
248
|
+
|
|
249
|
+
**Solutions**:
|
|
250
|
+
|
|
251
|
+
1. **Check vLLM status**:
|
|
252
|
+
```python
|
|
253
|
+
# Add health check
|
|
254
|
+
await generator.health_check.route()
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
2. **Restart service**:
|
|
258
|
+
```python
|
|
259
|
+
await generator.restart.fanout()
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
3. **Reduce concurrent requests**:
|
|
263
|
+
```yaml
|
|
264
|
+
generator:
|
|
265
|
+
max_concurrent_requests: 10
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
## Monarch Issues
|
|
269
|
+
|
|
270
|
+
### Issue: Monarch Actor Failures
|
|
271
|
+
|
|
272
|
+
**Symptoms**: Actor crashes, communication errors
|
|
273
|
+
|
|
274
|
+
**Solutions**:
|
|
275
|
+
|
|
276
|
+
1. **Enable fault tolerance**:
|
|
277
|
+
```yaml
|
|
278
|
+
monarch:
|
|
279
|
+
fault_tolerance: true
|
|
280
|
+
max_restarts: 3
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
2. **Increase actor memory**:
|
|
284
|
+
```yaml
|
|
285
|
+
services:
|
|
286
|
+
actor_memory_mb: 4096
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
3. **Check Monarch logs**:
|
|
290
|
+
```bash
|
|
291
|
+
export MONARCH_LOG_LEVEL=DEBUG
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
### Issue: Deadlock in Distributed Communication
|
|
295
|
+
|
|
296
|
+
**Symptoms**: Training hangs, no progress
|
|
297
|
+
|
|
298
|
+
**Solutions**:
|
|
299
|
+
|
|
300
|
+
1. **Check for blocking calls**:
|
|
301
|
+
```python
|
|
302
|
+
# Use async/await correctly
|
|
303
|
+
result = await service.method.route(args) # Correct
|
|
304
|
+
# result = service.method.route(args).wait() # May deadlock
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
2. **Add timeouts**:
|
|
308
|
+
```python
|
|
309
|
+
result = await asyncio.wait_for(
|
|
310
|
+
service.method.route(args),
|
|
311
|
+
timeout=60.0
|
|
312
|
+
)
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
## Installation Issues
|
|
316
|
+
|
|
317
|
+
### Issue: PyTorch Version Mismatch
|
|
318
|
+
|
|
319
|
+
**Symptoms**: Import errors, CUDA errors
|
|
320
|
+
|
|
321
|
+
**Solutions**:
|
|
322
|
+
|
|
323
|
+
1. **Use provided install script**:
|
|
324
|
+
```bash
|
|
325
|
+
./scripts/install.sh
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
2. **Verify versions**:
|
|
329
|
+
```python
|
|
330
|
+
import torch
|
|
331
|
+
print(torch.__version__) # Should be 2.9.0+
|
|
332
|
+
```
|
|
333
|
+
|
|
334
|
+
3. **Clean reinstall**:
|
|
335
|
+
```bash
|
|
336
|
+
pip uninstall torch torchvision torchaudio
|
|
337
|
+
./scripts/install.sh
|
|
338
|
+
```
|
|
339
|
+
|
|
340
|
+
### Issue: Monarch Installation Fails
|
|
341
|
+
|
|
342
|
+
**Symptoms**: Cannot import monarch
|
|
343
|
+
|
|
344
|
+
**Solutions**:
|
|
345
|
+
|
|
346
|
+
1. **Install from source**:
|
|
347
|
+
```bash
|
|
348
|
+
git clone https://github.com/meta-pytorch/monarch
|
|
349
|
+
cd monarch && pip install -e .
|
|
350
|
+
```
|
|
351
|
+
|
|
352
|
+
2. **Check CUDA compatibility**:
|
|
353
|
+
```bash
|
|
354
|
+
nvcc --version # Should match PyTorch CUDA
|
|
355
|
+
```
|
|
356
|
+
|
|
357
|
+
## Debugging Tips
|
|
358
|
+
|
|
359
|
+
### Enable Verbose Logging
|
|
360
|
+
|
|
361
|
+
```bash
|
|
362
|
+
export FORGE_DEBUG=1
|
|
363
|
+
export MONARCH_LOG_LEVEL=DEBUG
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
### Profile Services
|
|
367
|
+
|
|
368
|
+
```python
|
|
369
|
+
# Add profiling
|
|
370
|
+
with torch.profiler.profile() as prof:
|
|
371
|
+
result = await trainer.train_step.route(batch)
|
|
372
|
+
prof.export_chrome_trace("trace.json")
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
### Monitor GPU Utilization
|
|
376
|
+
|
|
377
|
+
```bash
|
|
378
|
+
watch -n 1 nvidia-smi
|
|
379
|
+
```
|
|
380
|
+
|
|
381
|
+
### Test Services Individually
|
|
382
|
+
|
|
383
|
+
```python
|
|
384
|
+
# Test generator
|
|
385
|
+
completions = await generator.generate.route(
|
|
386
|
+
prompts=["Hello"],
|
|
387
|
+
max_tokens=10,
|
|
388
|
+
)
|
|
389
|
+
print(completions[0].text)
|
|
390
|
+
|
|
391
|
+
# Test trainer
|
|
392
|
+
result = await trainer.train_step.route(dummy_batch)
|
|
393
|
+
print(result.loss)
|
|
394
|
+
```
|
|
395
|
+
|
|
396
|
+
## Experimental Warning
|
|
397
|
+
|
|
398
|
+
Both Monarch and torchforge are experimental. Expect:
|
|
399
|
+
- API changes between versions
|
|
400
|
+
- Incomplete features
|
|
401
|
+
- Bugs in edge cases
|
|
402
|
+
|
|
403
|
+
Check Discord for latest updates and workarounds.
|
|
404
|
+
|
|
405
|
+
## Resources
|
|
406
|
+
|
|
407
|
+
- GitHub Issues: https://github.com/meta-pytorch/torchforge/issues
|
|
408
|
+
- Discord: https://discord.gg/YsTYBh6PD9
|
|
409
|
+
- Monarch Issues: https://github.com/meta-pytorch/monarch/issues
|