@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,404 @@
|
|
|
1
|
+
# Parallelism Strategies Guide
|
|
2
|
+
|
|
3
|
+
Complete guide to parallelism in Megatron-Core: when to use each strategy, performance characteristics, and configuration examples.
|
|
4
|
+
|
|
5
|
+
## Parallelism Types
|
|
6
|
+
|
|
7
|
+
### 1. Data Parallel (DP)
|
|
8
|
+
|
|
9
|
+
**How it works**: Replicate full model on each GPU, split data batches, synchronize gradients.
|
|
10
|
+
|
|
11
|
+
**Memory**: Low efficiency - full model replication
|
|
12
|
+
**Communication**: Low - only gradients (all-reduce)
|
|
13
|
+
**Scalability**: Good for small models
|
|
14
|
+
|
|
15
|
+
**When to use**:
|
|
16
|
+
- Models <1B parameters that fit on single GPU
|
|
17
|
+
- Plenty of GPUs available
|
|
18
|
+
- Limited model complexity
|
|
19
|
+
|
|
20
|
+
**Configuration**:
|
|
21
|
+
```bash
|
|
22
|
+
# 8 GPUs, no model parallelism
|
|
23
|
+
torchrun --nproc_per_node=8 train.py \
|
|
24
|
+
--tensor-model-parallel-size 1 \
|
|
25
|
+
--pipeline-model-parallel-size 1
|
|
26
|
+
# Implicit DP = 8
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
**Performance**:
|
|
30
|
+
- Near-linear scaling for small models
|
|
31
|
+
- 7B model on 8×A100: ~90% efficiency
|
|
32
|
+
|
|
33
|
+
### 2. Tensor Parallel (TP)
|
|
34
|
+
|
|
35
|
+
**How it works**: Split individual layers/tensors across GPUs (column/row partitioning of weight matrices).
|
|
36
|
+
|
|
37
|
+
**Memory**: Excellent - 1/N reduction per GPU
|
|
38
|
+
**Communication**: Very high - all-reduce after every layer
|
|
39
|
+
**Scalability**: Best ≤8 GPUs within single node (needs NVLink)
|
|
40
|
+
|
|
41
|
+
**When to use**:
|
|
42
|
+
- Models >10B parameters
|
|
43
|
+
- Have NVLink-connected GPUs
|
|
44
|
+
- Within single node (network latency kills performance across nodes)
|
|
45
|
+
|
|
46
|
+
**Configuration**:
|
|
47
|
+
```bash
|
|
48
|
+
# Split model across 4 GPUs with TP
|
|
49
|
+
torchrun --nproc_per_node=4 train.py \
|
|
50
|
+
--tensor-model-parallel-size 4
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
**Performance**:
|
|
54
|
+
- **1 node (8 GPUs, NVLink)**: 85-95% efficiency
|
|
55
|
+
- **Across nodes**: <50% efficiency (avoid)
|
|
56
|
+
|
|
57
|
+
**Memory savings**:
|
|
58
|
+
```
|
|
59
|
+
LLaMA 70B without TP: 140GB (won't fit on 80GB GPU)
|
|
60
|
+
LLaMA 70B with TP=4: 35GB per GPU (fits easily)
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
**Communication volume** (70B model):
|
|
64
|
+
- Per layer: ~20GB all-reduce
|
|
65
|
+
- 80 layers × 20GB = 1.6TB total traffic
|
|
66
|
+
- With NVLink (600GB/s): Manageable
|
|
67
|
+
- With Ethernet (100Gb/s = 12.5GB/s): Too slow
|
|
68
|
+
|
|
69
|
+
### 3. Pipeline Parallel (PP)
|
|
70
|
+
|
|
71
|
+
**How it works**: Divide model layers into stages, assign stages to different GPUs, process microbatches in pipeline.
|
|
72
|
+
|
|
73
|
+
**Memory**: Very high - divide layers evenly
|
|
74
|
+
**Communication**: Low-medium - only activations between stages
|
|
75
|
+
**Scalability**: Good across nodes
|
|
76
|
+
|
|
77
|
+
**Pipeline Schedules**:
|
|
78
|
+
|
|
79
|
+
**GPipe** (simple but inefficient):
|
|
80
|
+
```
|
|
81
|
+
GPU0: F F F F ........ B B B B
|
|
82
|
+
GPU1: .... F F F F .... B B B B
|
|
83
|
+
GPU2: ........ F F F F B B B B
|
|
84
|
+
```
|
|
85
|
+
Bubble: 50% idle time
|
|
86
|
+
|
|
87
|
+
**1F1B** (one-forward-one-backward):
|
|
88
|
+
```
|
|
89
|
+
GPU0: F F F F B B B B B B B B
|
|
90
|
+
GPU1: .. F F F F B B B B B B B B
|
|
91
|
+
GPU2: .... F F F F B B B B B B B B
|
|
92
|
+
```
|
|
93
|
+
Bubble: ~25% idle time
|
|
94
|
+
|
|
95
|
+
**Interleaved 1F1B** (best):
|
|
96
|
+
```
|
|
97
|
+
GPU0: F1 F2 F3 F4 B1 B2 B3 B4 ...
|
|
98
|
+
GPU1: F1 F2 F3 F4 B1 B2 B3 B4 ...
|
|
99
|
+
```
|
|
100
|
+
Bubble: 5-10% idle time
|
|
101
|
+
|
|
102
|
+
**When to use**:
|
|
103
|
+
- Models >70B parameters
|
|
104
|
+
- Multi-node training
|
|
105
|
+
- Limited intra-node bandwidth
|
|
106
|
+
|
|
107
|
+
**Configuration**:
|
|
108
|
+
```bash
|
|
109
|
+
# 4-stage pipeline
|
|
110
|
+
torchrun --nproc_per_node=8 --nnodes=4 train.py \
|
|
111
|
+
--pipeline-model-parallel-size 4 \
|
|
112
|
+
--num-layers 80 \
|
|
113
|
+
--num-layers-per-virtual-pipeline-stage 2 # Interleaved
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
**Performance**:
|
|
117
|
+
- Interleaved schedule: 90-95% efficiency
|
|
118
|
+
- Standard 1F1B: 75-85% efficiency
|
|
119
|
+
|
|
120
|
+
### 4. Sequence Parallel (SP)
|
|
121
|
+
|
|
122
|
+
**How it works**: Split sequence dimension across tensor-parallel GPUs, reduce activation memory.
|
|
123
|
+
|
|
124
|
+
**Memory**: Reduces activations by TP factor
|
|
125
|
+
**Communication**: Same as TP (already using all-reduce)
|
|
126
|
+
**Scalability**: Tied to TP
|
|
127
|
+
|
|
128
|
+
**When to use**:
|
|
129
|
+
- Long sequences (>4K tokens)
|
|
130
|
+
- Using TP already
|
|
131
|
+
- Activation memory is bottleneck
|
|
132
|
+
|
|
133
|
+
**Configuration**:
|
|
134
|
+
```bash
|
|
135
|
+
torchrun --nproc_per_node=8 train.py \
|
|
136
|
+
--tensor-model-parallel-size 4 \
|
|
137
|
+
--sequence-parallel # Requires TP > 1
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
**Memory savings**:
|
|
141
|
+
```
|
|
142
|
+
70B model, 4K sequence, TP=4:
|
|
143
|
+
Without SP: 48GB activations per GPU
|
|
144
|
+
With SP: 12GB activations per GPU
|
|
145
|
+
Savings: 75%
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### 5. Context Parallel (CP)
|
|
149
|
+
|
|
150
|
+
**How it works**: Split very long sequences across GPUs using Ring Attention.
|
|
151
|
+
|
|
152
|
+
**Memory**: Reduces KV cache and activations
|
|
153
|
+
**Communication**: Medium - ring communication pattern
|
|
154
|
+
**Scalability**: Good for >8K sequences
|
|
155
|
+
|
|
156
|
+
**When to use**:
|
|
157
|
+
- Sequences >8K tokens
|
|
158
|
+
- Long-context models (>32K)
|
|
159
|
+
- KV cache memory bottleneck
|
|
160
|
+
|
|
161
|
+
**Configuration**:
|
|
162
|
+
```bash
|
|
163
|
+
torchrun --nproc_per_node=8 train.py \
|
|
164
|
+
--context-parallel-size 2 \
|
|
165
|
+
--seq-length 32768 # 32K tokens
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
**Memory savings** (32K sequence):
|
|
169
|
+
```
|
|
170
|
+
Without CP: 64GB KV cache
|
|
171
|
+
With CP=4: 16GB KV cache per GPU
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
### 6. Expert Parallel (EP)
|
|
175
|
+
|
|
176
|
+
**How it works**: For MoE models, distribute different experts across GPUs.
|
|
177
|
+
|
|
178
|
+
**Memory**: Excellent - only store 1/N experts per GPU
|
|
179
|
+
**Communication**: Low - only route tokens to experts
|
|
180
|
+
**Scalability**: Matches number of experts
|
|
181
|
+
|
|
182
|
+
**When to use**:
|
|
183
|
+
- Mixture of Experts models
|
|
184
|
+
- Want model capacity without memory cost
|
|
185
|
+
- Have ≥8 GPUs
|
|
186
|
+
|
|
187
|
+
**Configuration**:
|
|
188
|
+
```bash
|
|
189
|
+
# Mixtral 8x7B: 8 experts
|
|
190
|
+
torchrun --nproc_per_node=8 train.py \
|
|
191
|
+
--expert-model-parallel-size 4 \
|
|
192
|
+
--num-experts 8 \
|
|
193
|
+
--tensor-model-parallel-size 2
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
**Memory** (Mixtral 8×7B):
|
|
197
|
+
```
|
|
198
|
+
Without EP: 8 experts × 7B = 56GB
|
|
199
|
+
With EP=4: 2 experts × 7B = 14GB
|
|
200
|
+
Savings: 75%
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
## Combining Parallelism Strategies
|
|
204
|
+
|
|
205
|
+
### 3D Parallelism (TP + PP + DP)
|
|
206
|
+
|
|
207
|
+
Standard for large models.
|
|
208
|
+
|
|
209
|
+
**LLaMA 3 70B on 64 GPUs**:
|
|
210
|
+
```bash
|
|
211
|
+
TP=4 # Within each node
|
|
212
|
+
PP=4 # Across nodes
|
|
213
|
+
DP=4 # Remaining dimension
|
|
214
|
+
Total = 4 × 4 × 4 = 64 GPUs
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
**Memory per GPU**: 70B / 4 (TP) / 4 (PP) = 4.4B params ≈ 20GB
|
|
218
|
+
|
|
219
|
+
**Configuration**:
|
|
220
|
+
```bash
|
|
221
|
+
torchrun --nproc_per_node=8 --nnodes=8 train.py \
|
|
222
|
+
--tensor-model-parallel-size 4 \
|
|
223
|
+
--pipeline-model-parallel-size 4
|
|
224
|
+
# DP is implicit: 64 / (4*4) = 4
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
### 4D Parallelism (TP + PP + DP + CP)
|
|
228
|
+
|
|
229
|
+
For very large models or long context.
|
|
230
|
+
|
|
231
|
+
**LLaMA 3 405B on 256 GPUs**:
|
|
232
|
+
```bash
|
|
233
|
+
TP=8 # Max NVLink
|
|
234
|
+
PP=8 # Across nodes
|
|
235
|
+
CP=2 # Long sequences
|
|
236
|
+
DP=2 # Remaining
|
|
237
|
+
Total = 8 × 8 × 2 × 2 = 256 GPUs
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
**Configuration**:
|
|
241
|
+
```bash
|
|
242
|
+
torchrun --nproc_per_node=8 --nnodes=32 train.py \
|
|
243
|
+
--tensor-model-parallel-size 8 \
|
|
244
|
+
--pipeline-model-parallel-size 8 \
|
|
245
|
+
--context-parallel-size 2
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
### 4D + EP (5D Parallelism)
|
|
249
|
+
|
|
250
|
+
For sparse MoE models.
|
|
251
|
+
|
|
252
|
+
**DeepSeek-V3 671B (37B active) on 1024 GPUs**:
|
|
253
|
+
```bash
|
|
254
|
+
TP=2 # Limited by active params
|
|
255
|
+
PP=16 # Many stages
|
|
256
|
+
EP=64 # 256 experts / 4 experts per GPU
|
|
257
|
+
DP=2 # Small data parallel
|
|
258
|
+
Total = 2 × 16 × 64 × 2 = 4096 (uses 1024 in practice)
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
## Decision Guide
|
|
262
|
+
|
|
263
|
+
### By Model Size
|
|
264
|
+
|
|
265
|
+
| Model Size | GPUs | Recommended Strategy |
|
|
266
|
+
|------------|------|---------------------|
|
|
267
|
+
| <1B | 1-8 | DP only |
|
|
268
|
+
| 1-10B | 8-16 | TP=2-4 + DP |
|
|
269
|
+
| 10-70B | 16-64 | TP=4 + PP=2-4 + DP |
|
|
270
|
+
| 70-175B | 64-256 | TP=8 + PP=4-8 + DP |
|
|
271
|
+
| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP |
|
|
272
|
+
| 500B+ | 1024+ | 4D or 5D (with EP) |
|
|
273
|
+
|
|
274
|
+
### By Hardware Topology
|
|
275
|
+
|
|
276
|
+
**Single node (8 GPUs with NVLink)**:
|
|
277
|
+
```bash
|
|
278
|
+
# Up to 70B
|
|
279
|
+
TP=8 # Use all NVLink bandwidth
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
**Multiple nodes (InfiniBand)**:
|
|
283
|
+
```bash
|
|
284
|
+
# Minimize cross-node communication
|
|
285
|
+
TP=8 # Within node only
|
|
286
|
+
PP=N # Across nodes
|
|
287
|
+
DP=remaining
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
**Limited network (Ethernet)**:
|
|
291
|
+
```bash
|
|
292
|
+
# Avoid TP across nodes
|
|
293
|
+
TP=1-4 # Within node
|
|
294
|
+
PP=many # PP has low communication
|
|
295
|
+
```
|
|
296
|
+
|
|
297
|
+
### By Sequence Length
|
|
298
|
+
|
|
299
|
+
| Sequence | Parallelism |
|
|
300
|
+
|----------|------------|
|
|
301
|
+
| <2K | Standard (TP + PP + DP) |
|
|
302
|
+
| 2K-8K | + SP (sequence parallel) |
|
|
303
|
+
| 8K-32K | + CP=2 (context parallel) |
|
|
304
|
+
| 32K+ | + CP=4-8 |
|
|
305
|
+
|
|
306
|
+
## Performance Characteristics
|
|
307
|
+
|
|
308
|
+
### Communication Volume (per iteration)
|
|
309
|
+
|
|
310
|
+
**Data Parallel**: O(model_size) - all-reduce gradients
|
|
311
|
+
**Tensor Parallel**: O(model_size × layers) - all-reduce per layer
|
|
312
|
+
**Pipeline Parallel**: O(batch × hidden × layers/stages) - activations only
|
|
313
|
+
**Context Parallel**: O(sequence × hidden) - ring communication
|
|
314
|
+
|
|
315
|
+
### Memory Breakdown (70B model example)
|
|
316
|
+
|
|
317
|
+
Without parallelism:
|
|
318
|
+
```
|
|
319
|
+
Model parameters: 140GB (FP16)
|
|
320
|
+
Gradients: 140GB
|
|
321
|
+
Optimizer states: 280GB (Adam)
|
|
322
|
+
Activations: 48GB (batch=1, seq=4K)
|
|
323
|
+
Total: 608GB (won't fit!)
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
With TP=4, PP=4, DP=4 (64 GPUs):
|
|
327
|
+
```
|
|
328
|
+
Parameters: 140GB / 4 / 4 = 8.75GB per GPU
|
|
329
|
+
Gradients: 8.75GB per GPU
|
|
330
|
+
Optimizer: 17.5GB per GPU
|
|
331
|
+
Activations: 48GB / 4 / 4 = 3GB per GPU
|
|
332
|
+
Total: ~38GB per GPU (fits on A100 80GB)
|
|
333
|
+
```
|
|
334
|
+
|
|
335
|
+
## Best Practices
|
|
336
|
+
|
|
337
|
+
1. **Start with TP within single node**
|
|
338
|
+
```bash
|
|
339
|
+
--tensor-model-parallel-size 8 # Use all NVLink
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
2. **Add PP for cross-node scaling**
|
|
343
|
+
```bash
|
|
344
|
+
--pipeline-model-parallel-size 4
|
|
345
|
+
--num-layers-per-virtual-pipeline-stage 2 # Interleaved
|
|
346
|
+
```
|
|
347
|
+
|
|
348
|
+
3. **Enable SP when using TP**
|
|
349
|
+
```bash
|
|
350
|
+
--sequence-parallel # Free activation savings
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
4. **Use CP for long sequences**
|
|
354
|
+
```bash
|
|
355
|
+
--context-parallel-size 2 # If seq_len > 8K
|
|
356
|
+
```
|
|
357
|
+
|
|
358
|
+
5. **Avoid TP across nodes** (network latency kills performance)
|
|
359
|
+
|
|
360
|
+
6. **Match TP to GPU topology** (TP=8 for 8-GPU nodes)
|
|
361
|
+
|
|
362
|
+
7. **Profile first iteration** to check memory and communication:
|
|
363
|
+
```bash
|
|
364
|
+
--profile # Enable profiling
|
|
365
|
+
--profile-ranks 0 # Profile first rank only
|
|
366
|
+
```
|
|
367
|
+
|
|
368
|
+
## Troubleshooting
|
|
369
|
+
|
|
370
|
+
**High communication overhead (low MFU)**:
|
|
371
|
+
- Reduce TP degree (especially across nodes)
|
|
372
|
+
- Increase PP degree instead
|
|
373
|
+
- Enable interleaved pipeline schedule
|
|
374
|
+
|
|
375
|
+
**Out of memory**:
|
|
376
|
+
- Increase TP/PP (split model more)
|
|
377
|
+
- Enable gradient checkpointing:
|
|
378
|
+
```bash
|
|
379
|
+
--recompute-granularity full
|
|
380
|
+
--recompute-method block
|
|
381
|
+
```
|
|
382
|
+
- Reduce micro-batch size
|
|
383
|
+
|
|
384
|
+
**Pipeline bubbles (low GPU util)**:
|
|
385
|
+
- Use interleaved schedule:
|
|
386
|
+
```bash
|
|
387
|
+
--num-layers-per-virtual-pipeline-stage 2
|
|
388
|
+
```
|
|
389
|
+
- Increase number of microbatches:
|
|
390
|
+
```bash
|
|
391
|
+
--global-batch-size 1024
|
|
392
|
+
--micro-batch-size 1 # More microbatches = smaller bubbles
|
|
393
|
+
```
|
|
394
|
+
|
|
395
|
+
**Load imbalance in MoE**:
|
|
396
|
+
- Tune load balancing:
|
|
397
|
+
```bash
|
|
398
|
+
--moe-router-load-balancing-type aux_loss
|
|
399
|
+
--moe-aux-loss-coeff 0.01
|
|
400
|
+
```
|
|
401
|
+
- Increase expert parallel degree:
|
|
402
|
+
```bash
|
|
403
|
+
--expert-model-parallel-size 8 # More experts per GPU
|
|
404
|
+
```
|