@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,406 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ray-train
|
|
3
|
+
description: Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Synthetic Sciences
|
|
6
|
+
license: MIT
|
|
7
|
+
tags: [Ray Train, Distributed Training, Synthetic Sciencestion, Ray, Hyperparameter Tuning, Fault Tolerance, Elastic Scaling, Multi-Node, PyTorch, TensorFlow]
|
|
8
|
+
dependencies: [ray[train], torch, transformers]
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Ray Train - Distributed Training Synthetic Sciencestion
|
|
12
|
+
|
|
13
|
+
## Quick start
|
|
14
|
+
|
|
15
|
+
Ray Train scales machine learning training from single GPU to multi-node clusters with minimal code changes.
|
|
16
|
+
|
|
17
|
+
**Installation**:
|
|
18
|
+
```bash
|
|
19
|
+
pip install -U "ray[train]"
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
**Basic PyTorch training** (single node):
|
|
23
|
+
|
|
24
|
+
```python
|
|
25
|
+
import ray
|
|
26
|
+
from ray import train
|
|
27
|
+
from ray.train import ScalingConfig
|
|
28
|
+
from ray.train.torch import TorchTrainer
|
|
29
|
+
import torch
|
|
30
|
+
import torch.nn as nn
|
|
31
|
+
|
|
32
|
+
# Define training function
|
|
33
|
+
def train_func(config):
|
|
34
|
+
# Your normal PyTorch code
|
|
35
|
+
model = nn.Linear(10, 1)
|
|
36
|
+
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
|
|
37
|
+
|
|
38
|
+
# Prepare for distributed (Ray handles device placement)
|
|
39
|
+
model = train.torch.prepare_model(model)
|
|
40
|
+
|
|
41
|
+
for epoch in range(10):
|
|
42
|
+
# Your training loop
|
|
43
|
+
output = model(torch.randn(32, 10))
|
|
44
|
+
loss = output.sum()
|
|
45
|
+
loss.backward()
|
|
46
|
+
optimizer.step()
|
|
47
|
+
optimizer.zero_grad()
|
|
48
|
+
|
|
49
|
+
# Report metrics (logged automatically)
|
|
50
|
+
train.report({"loss": loss.item(), "epoch": epoch})
|
|
51
|
+
|
|
52
|
+
# Run distributed training
|
|
53
|
+
trainer = TorchTrainer(
|
|
54
|
+
train_func,
|
|
55
|
+
scaling_config=ScalingConfig(
|
|
56
|
+
num_workers=4, # 4 GPUs/workers
|
|
57
|
+
use_gpu=True
|
|
58
|
+
)
|
|
59
|
+
)
|
|
60
|
+
|
|
61
|
+
result = trainer.fit()
|
|
62
|
+
print(f"Final loss: {result.metrics['loss']}")
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
**That's it!** Ray handles:
|
|
66
|
+
- Distributed coordination
|
|
67
|
+
- GPU allocation
|
|
68
|
+
- Fault tolerance
|
|
69
|
+
- Checkpointing
|
|
70
|
+
- Metric aggregation
|
|
71
|
+
|
|
72
|
+
## Common workflows
|
|
73
|
+
|
|
74
|
+
### Workflow 1: Scale existing PyTorch code
|
|
75
|
+
|
|
76
|
+
**Original single-GPU code**:
|
|
77
|
+
```python
|
|
78
|
+
model = MyModel().cuda()
|
|
79
|
+
optimizer = torch.optim.Adam(model.parameters())
|
|
80
|
+
|
|
81
|
+
for epoch in range(epochs):
|
|
82
|
+
for batch in dataloader:
|
|
83
|
+
loss = model(batch)
|
|
84
|
+
loss.backward()
|
|
85
|
+
optimizer.step()
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
**Ray Train version** (scales to multi-GPU/multi-node):
|
|
89
|
+
```python
|
|
90
|
+
from ray.train.torch import TorchTrainer
|
|
91
|
+
from ray import train
|
|
92
|
+
|
|
93
|
+
def train_func(config):
|
|
94
|
+
model = MyModel()
|
|
95
|
+
optimizer = torch.optim.Adam(model.parameters())
|
|
96
|
+
|
|
97
|
+
# Prepare for distributed (automatic device placement)
|
|
98
|
+
model = train.torch.prepare_model(model)
|
|
99
|
+
dataloader = train.torch.prepare_data_loader(dataloader)
|
|
100
|
+
|
|
101
|
+
for epoch in range(epochs):
|
|
102
|
+
for batch in dataloader:
|
|
103
|
+
loss = model(batch)
|
|
104
|
+
loss.backward()
|
|
105
|
+
optimizer.step()
|
|
106
|
+
|
|
107
|
+
# Report metrics
|
|
108
|
+
train.report({"loss": loss.item()})
|
|
109
|
+
|
|
110
|
+
# Scale to 8 GPUs
|
|
111
|
+
trainer = TorchTrainer(
|
|
112
|
+
train_func,
|
|
113
|
+
scaling_config=ScalingConfig(num_workers=8, use_gpu=True)
|
|
114
|
+
)
|
|
115
|
+
trainer.fit()
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
**Benefits**: Same code runs on 1 GPU or 1000 GPUs
|
|
119
|
+
|
|
120
|
+
### Workflow 2: HuggingFace Transformers integration
|
|
121
|
+
|
|
122
|
+
```python
|
|
123
|
+
from ray.train.huggingface import TransformersTrainer
|
|
124
|
+
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
|
|
125
|
+
|
|
126
|
+
def train_func(config):
|
|
127
|
+
# Load model and tokenizer
|
|
128
|
+
model = AutoModelForCausalLM.from_pretrained("gpt2")
|
|
129
|
+
tokenizer = AutoTokenizer.from_pretrained("gpt2")
|
|
130
|
+
|
|
131
|
+
# Training arguments (HuggingFace API)
|
|
132
|
+
training_args = TrainingArguments(
|
|
133
|
+
output_dir="./output",
|
|
134
|
+
num_train_epochs=3,
|
|
135
|
+
per_device_train_batch_size=8,
|
|
136
|
+
learning_rate=2e-5,
|
|
137
|
+
)
|
|
138
|
+
|
|
139
|
+
# Ray automatically handles distributed training
|
|
140
|
+
from transformers import Trainer
|
|
141
|
+
trainer = Trainer(
|
|
142
|
+
model=model,
|
|
143
|
+
args=training_args,
|
|
144
|
+
train_dataset=train_dataset,
|
|
145
|
+
)
|
|
146
|
+
|
|
147
|
+
trainer.train()
|
|
148
|
+
|
|
149
|
+
# Scale to multi-node (2 nodes × 8 GPUs = 16 workers)
|
|
150
|
+
trainer = TransformersTrainer(
|
|
151
|
+
train_func,
|
|
152
|
+
scaling_config=ScalingConfig(
|
|
153
|
+
num_workers=16,
|
|
154
|
+
use_gpu=True,
|
|
155
|
+
resources_per_worker={"GPU": 1}
|
|
156
|
+
)
|
|
157
|
+
)
|
|
158
|
+
|
|
159
|
+
result = trainer.fit()
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
### Workflow 3: Hyperparameter tuning with Ray Tune
|
|
163
|
+
|
|
164
|
+
```python
|
|
165
|
+
from ray import tune
|
|
166
|
+
from ray.train.torch import TorchTrainer
|
|
167
|
+
from ray.tune.schedulers import ASHAScheduler
|
|
168
|
+
|
|
169
|
+
def train_func(config):
|
|
170
|
+
# Use hyperparameters from config
|
|
171
|
+
lr = config["lr"]
|
|
172
|
+
batch_size = config["batch_size"]
|
|
173
|
+
|
|
174
|
+
model = MyModel()
|
|
175
|
+
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
|
|
176
|
+
|
|
177
|
+
model = train.torch.prepare_model(model)
|
|
178
|
+
|
|
179
|
+
for epoch in range(10):
|
|
180
|
+
# Training loop
|
|
181
|
+
loss = train_epoch(model, optimizer, batch_size)
|
|
182
|
+
train.report({"loss": loss, "epoch": epoch})
|
|
183
|
+
|
|
184
|
+
# Define search space
|
|
185
|
+
param_space = {
|
|
186
|
+
"lr": tune.loguniform(1e-5, 1e-2),
|
|
187
|
+
"batch_size": tune.choice([16, 32, 64, 128])
|
|
188
|
+
}
|
|
189
|
+
|
|
190
|
+
# Run 20 trials with early stopping
|
|
191
|
+
tuner = tune.Tuner(
|
|
192
|
+
TorchTrainer(
|
|
193
|
+
train_func,
|
|
194
|
+
scaling_config=ScalingConfig(num_workers=4, use_gpu=True)
|
|
195
|
+
),
|
|
196
|
+
param_space=param_space,
|
|
197
|
+
tune_config=tune.TuneConfig(
|
|
198
|
+
num_samples=20,
|
|
199
|
+
scheduler=ASHAScheduler(metric="loss", mode="min")
|
|
200
|
+
)
|
|
201
|
+
)
|
|
202
|
+
|
|
203
|
+
results = tuner.fit()
|
|
204
|
+
best = results.get_best_result(metric="loss", mode="min")
|
|
205
|
+
print(f"Best hyperparameters: {best.config}")
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
**Result**: Distributed hyperparameter search across cluster
|
|
209
|
+
|
|
210
|
+
### Workflow 4: Checkpointing and fault tolerance
|
|
211
|
+
|
|
212
|
+
```python
|
|
213
|
+
from ray import train
|
|
214
|
+
from ray.train import Checkpoint
|
|
215
|
+
|
|
216
|
+
def train_func(config):
|
|
217
|
+
model = MyModel()
|
|
218
|
+
optimizer = torch.optim.Adam(model.parameters())
|
|
219
|
+
|
|
220
|
+
# Try to resume from checkpoint
|
|
221
|
+
checkpoint = train.get_checkpoint()
|
|
222
|
+
if checkpoint:
|
|
223
|
+
with checkpoint.as_directory() as checkpoint_dir:
|
|
224
|
+
state = torch.load(f"{checkpoint_dir}/model.pt")
|
|
225
|
+
model.load_state_dict(state["model"])
|
|
226
|
+
optimizer.load_state_dict(state["optimizer"])
|
|
227
|
+
start_epoch = state["epoch"]
|
|
228
|
+
else:
|
|
229
|
+
start_epoch = 0
|
|
230
|
+
|
|
231
|
+
model = train.torch.prepare_model(model)
|
|
232
|
+
|
|
233
|
+
for epoch in range(start_epoch, 100):
|
|
234
|
+
loss = train_epoch(model, optimizer)
|
|
235
|
+
|
|
236
|
+
# Save checkpoint every 10 epochs
|
|
237
|
+
if epoch % 10 == 0:
|
|
238
|
+
checkpoint = Checkpoint.from_directory(
|
|
239
|
+
train.get_context().get_trial_dir()
|
|
240
|
+
)
|
|
241
|
+
torch.save({
|
|
242
|
+
"model": model.state_dict(),
|
|
243
|
+
"optimizer": optimizer.state_dict(),
|
|
244
|
+
"epoch": epoch
|
|
245
|
+
}, checkpoint.path / "model.pt")
|
|
246
|
+
|
|
247
|
+
train.report({"loss": loss}, checkpoint=checkpoint)
|
|
248
|
+
|
|
249
|
+
trainer = TorchTrainer(
|
|
250
|
+
train_func,
|
|
251
|
+
scaling_config=ScalingConfig(num_workers=8, use_gpu=True)
|
|
252
|
+
)
|
|
253
|
+
|
|
254
|
+
# Automatically resumes from checkpoint if training fails
|
|
255
|
+
result = trainer.fit()
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
### Workflow 5: Multi-node training
|
|
259
|
+
|
|
260
|
+
```python
|
|
261
|
+
from ray.train import ScalingConfig
|
|
262
|
+
|
|
263
|
+
# Connect to Ray cluster
|
|
264
|
+
ray.init(address="auto") # Or ray.init("ray://head-node:10001")
|
|
265
|
+
|
|
266
|
+
# Train across 4 nodes × 8 GPUs = 32 workers
|
|
267
|
+
trainer = TorchTrainer(
|
|
268
|
+
train_func,
|
|
269
|
+
scaling_config=ScalingConfig(
|
|
270
|
+
num_workers=32,
|
|
271
|
+
use_gpu=True,
|
|
272
|
+
resources_per_worker={"GPU": 1, "CPU": 4},
|
|
273
|
+
placement_strategy="SPREAD" # Spread across nodes
|
|
274
|
+
)
|
|
275
|
+
)
|
|
276
|
+
|
|
277
|
+
result = trainer.fit()
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
**Launch Ray cluster**:
|
|
281
|
+
```bash
|
|
282
|
+
# On head node
|
|
283
|
+
ray start --head --port=6379
|
|
284
|
+
|
|
285
|
+
# On worker nodes
|
|
286
|
+
ray start --address=<head-node-ip>:6379
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
## When to use vs alternatives
|
|
290
|
+
|
|
291
|
+
**Use Ray Train when**:
|
|
292
|
+
- Training across multiple machines (multi-node)
|
|
293
|
+
- Need hyperparameter tuning at scale
|
|
294
|
+
- Want fault tolerance (auto-restart failed workers)
|
|
295
|
+
- Elastic scaling (add/remove nodes during training)
|
|
296
|
+
- Unified framework (same code for PyTorch/TF/HF)
|
|
297
|
+
|
|
298
|
+
**Key advantages**:
|
|
299
|
+
- **Multi-node orchestration**: Easiest multi-node setup
|
|
300
|
+
- **Ray Tune integration**: Best-in-class hyperparameter tuning
|
|
301
|
+
- **Fault tolerance**: Automatic recovery from failures
|
|
302
|
+
- **Elastic**: Add/remove nodes without restarting
|
|
303
|
+
- **Framework agnostic**: PyTorch, TensorFlow, HuggingFace, XGBoost
|
|
304
|
+
|
|
305
|
+
**Use alternatives instead**:
|
|
306
|
+
- **Accelerate**: Single-node multi-GPU, simpler
|
|
307
|
+
- **PyTorch Lightning**: High-level abstractions, callbacks
|
|
308
|
+
- **DeepSpeed**: Maximum performance, complex setup
|
|
309
|
+
- **Raw DDP**: Maximum control, minimal overhead
|
|
310
|
+
|
|
311
|
+
## Common issues
|
|
312
|
+
|
|
313
|
+
**Issue: Ray cluster not connecting**
|
|
314
|
+
|
|
315
|
+
Check ray status:
|
|
316
|
+
```bash
|
|
317
|
+
ray status
|
|
318
|
+
|
|
319
|
+
# Should show:
|
|
320
|
+
# - Nodes: 4
|
|
321
|
+
# - GPUs: 32
|
|
322
|
+
# - Workers: Ready
|
|
323
|
+
```
|
|
324
|
+
|
|
325
|
+
If not connected:
|
|
326
|
+
```bash
|
|
327
|
+
# Restart head node
|
|
328
|
+
ray stop
|
|
329
|
+
ray start --head --port=6379 --dashboard-host=0.0.0.0
|
|
330
|
+
|
|
331
|
+
# Restart worker nodes
|
|
332
|
+
ray stop
|
|
333
|
+
ray start --address=<head-ip>:6379
|
|
334
|
+
```
|
|
335
|
+
|
|
336
|
+
**Issue: Out of memory**
|
|
337
|
+
|
|
338
|
+
Reduce workers or use gradient accumulation:
|
|
339
|
+
```python
|
|
340
|
+
scaling_config=ScalingConfig(
|
|
341
|
+
num_workers=4, # Reduce from 8
|
|
342
|
+
use_gpu=True
|
|
343
|
+
)
|
|
344
|
+
|
|
345
|
+
# In train_func, accumulate gradients
|
|
346
|
+
for i, batch in enumerate(dataloader):
|
|
347
|
+
loss = model(batch) / accumulation_steps
|
|
348
|
+
loss.backward()
|
|
349
|
+
|
|
350
|
+
if (i + 1) % accumulation_steps == 0:
|
|
351
|
+
optimizer.step()
|
|
352
|
+
optimizer.zero_grad()
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
**Issue: Slow training**
|
|
356
|
+
|
|
357
|
+
Check if data loading is bottleneck:
|
|
358
|
+
```python
|
|
359
|
+
import time
|
|
360
|
+
|
|
361
|
+
def train_func(config):
|
|
362
|
+
for epoch in range(epochs):
|
|
363
|
+
start = time.time()
|
|
364
|
+
for batch in dataloader:
|
|
365
|
+
data_time = time.time() - start
|
|
366
|
+
# Train...
|
|
367
|
+
start = time.time()
|
|
368
|
+
print(f"Data loading: {data_time:.3f}s")
|
|
369
|
+
```
|
|
370
|
+
|
|
371
|
+
If data loading is slow, increase workers:
|
|
372
|
+
```python
|
|
373
|
+
dataloader = DataLoader(dataset, num_workers=8)
|
|
374
|
+
```
|
|
375
|
+
|
|
376
|
+
## Advanced topics
|
|
377
|
+
|
|
378
|
+
**Multi-node setup**: See [references/multi-node.md](references/multi-node.md) for Ray cluster deployment on AWS, GCP, Kubernetes, and SLURM.
|
|
379
|
+
|
|
380
|
+
**Hyperparameter tuning**: See [references/hyperparameter-tuning.md](references/hyperparameter-tuning.md) for Ray Tune integration, search algorithms (Optuna, HyperOpt), and population-based training.
|
|
381
|
+
|
|
382
|
+
**Custom training loops**: See [references/custom-loops.md](references/custom-loops.md) for advanced Ray Train usage, custom backends, and integration with other frameworks.
|
|
383
|
+
|
|
384
|
+
## Hardware requirements
|
|
385
|
+
|
|
386
|
+
- **Single node**: 1+ GPUs (or CPUs)
|
|
387
|
+
- **Multi-node**: 2+ machines with network connectivity
|
|
388
|
+
- **Cloud**: AWS, GCP, Azure (Ray autoscaling)
|
|
389
|
+
- **On-prem**: Kubernetes, SLURM clusters
|
|
390
|
+
|
|
391
|
+
**Supported accelerators**:
|
|
392
|
+
- NVIDIA GPUs (CUDA)
|
|
393
|
+
- AMD GPUs (ROCm)
|
|
394
|
+
- TPUs (Google Cloud)
|
|
395
|
+
- CPUs
|
|
396
|
+
|
|
397
|
+
## Resources
|
|
398
|
+
|
|
399
|
+
- Docs: https://docs.ray.io/en/latest/train/train.html
|
|
400
|
+
- GitHub: https://github.com/ray-project/ray ⭐ 36,000+
|
|
401
|
+
- Version: 2.40.0+
|
|
402
|
+
- Examples: https://docs.ray.io/en/latest/train/examples.html
|
|
403
|
+
- Slack: https://forms.gle/9TSdDYUgxYs8SA9e8
|
|
404
|
+
- Used by: OpenAI, Uber, Spotify, Shopify, Instacart
|
|
405
|
+
|
|
406
|
+
|