@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,402 @@
|
|
|
1
|
+
# RoPE: Rotary Position Embeddings
|
|
2
|
+
|
|
3
|
+
Complete technical guide based on RoFormer paper (arXiv 2104.09864) and HuggingFace transformers implementation.
|
|
4
|
+
|
|
5
|
+
## Table of Contents
|
|
6
|
+
- Mathematical Formulation
|
|
7
|
+
- Implementation Details
|
|
8
|
+
- Scaling Techniques
|
|
9
|
+
- Production Usage
|
|
10
|
+
|
|
11
|
+
## Mathematical Formulation
|
|
12
|
+
|
|
13
|
+
**Source**: RoFormer: Enhanced Transformer with Rotary Position Embedding (arXiv 2104.09864)
|
|
14
|
+
|
|
15
|
+
### Core Idea
|
|
16
|
+
|
|
17
|
+
RoPE encodes absolute position with a rotation matrix while naturally incorporating relative position dependency in attention.
|
|
18
|
+
|
|
19
|
+
### Formulation
|
|
20
|
+
|
|
21
|
+
Given position index `m` and embedding dimension `d`:
|
|
22
|
+
|
|
23
|
+
```
|
|
24
|
+
Rotation Matrix R_θ(m):
|
|
25
|
+
[cos(mθ₁) -sin(mθ₁) 0 0 ]
|
|
26
|
+
[sin(mθ₁) cos(mθ₁) 0 0 ]
|
|
27
|
+
[0 0 cos(mθ₂) -sin(mθ₂) ]
|
|
28
|
+
[0 0 sin(mθ₂) cos(mθ₂) ]
|
|
29
|
+
...
|
|
30
|
+
|
|
31
|
+
where θⱼ = base^(-2j/d) for j ∈ [0, 1, 2, ..., d/2)
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
**Key property**: Attention between positions m and n depends only on relative distance (m - n).
|
|
35
|
+
|
|
36
|
+
### Derivation
|
|
37
|
+
|
|
38
|
+
**Step 1: Position encoding via rotation**
|
|
39
|
+
|
|
40
|
+
```
|
|
41
|
+
q_m = W_q x_m rotated by mθ
|
|
42
|
+
k_n = W_k x_n rotated by nθ
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
**Step 2: Attention score**
|
|
46
|
+
|
|
47
|
+
```
|
|
48
|
+
score(q_m, k_n) = q_m^T k_n
|
|
49
|
+
= (Rotated query) · (Rotated key)
|
|
50
|
+
= f(q, k, m-n)
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
The score depends on relative position `m - n`, not absolute positions.
|
|
54
|
+
|
|
55
|
+
## Implementation Details
|
|
56
|
+
|
|
57
|
+
**Source**: HuggingFace transformers/modeling_rope_utils.py
|
|
58
|
+
|
|
59
|
+
### Basic RoPE Implementation
|
|
60
|
+
|
|
61
|
+
```python
|
|
62
|
+
import torch
|
|
63
|
+
import math
|
|
64
|
+
|
|
65
|
+
def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
|
|
66
|
+
"""Precompute rotation frequencies (cos + i*sin)."""
|
|
67
|
+
# Compute inverse frequencies
|
|
68
|
+
freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
|
|
69
|
+
|
|
70
|
+
# Position indices
|
|
71
|
+
t = torch.arange(end, device=freqs.device)
|
|
72
|
+
|
|
73
|
+
# Outer product: (end, dim/2)
|
|
74
|
+
freqs = torch.outer(t, freqs).float()
|
|
75
|
+
|
|
76
|
+
# Convert to complex exponential (Euler's formula)
|
|
77
|
+
freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # e^(i*θ) = cos(θ) + i*sin(θ)
|
|
78
|
+
|
|
79
|
+
return freqs_cis
|
|
80
|
+
|
|
81
|
+
def reshape_for_broadcast(freqs_cis, x):
|
|
82
|
+
"""Reshape frequency tensor to match x dimensions."""
|
|
83
|
+
ndim = x.ndim
|
|
84
|
+
assert 0 <= 1 < ndim
|
|
85
|
+
assert freqs_cis.shape == (x.shape[1], x.shape[-1])
|
|
86
|
+
shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
|
|
87
|
+
return freqs_cis.view(*shape)
|
|
88
|
+
|
|
89
|
+
def apply_rotary_emb(xq, xk, freqs_cis):
|
|
90
|
+
"""Apply rotary embeddings to queries and keys."""
|
|
91
|
+
# Convert to complex
|
|
92
|
+
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
|
|
93
|
+
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
|
|
94
|
+
|
|
95
|
+
# Reshape freqs
|
|
96
|
+
freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
|
|
97
|
+
|
|
98
|
+
# Apply rotation
|
|
99
|
+
xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
|
|
100
|
+
xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
|
|
101
|
+
|
|
102
|
+
return xq_out.type_as(xq), xk_out.type_as(xk)
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
### Alternative: GPT-NeoX Style (HuggingFace)
|
|
106
|
+
|
|
107
|
+
```python
|
|
108
|
+
def rotate_half(x):
|
|
109
|
+
"""Rotate half the hidden dimensions of the input."""
|
|
110
|
+
x1 = x[..., : x.shape[-1] // 2]
|
|
111
|
+
x2 = x[..., x.shape[-1] // 2 :]
|
|
112
|
+
return torch.cat((-x2, x1), dim=-1)
|
|
113
|
+
|
|
114
|
+
def apply_rotary_pos_emb_gpt_neox(q, k, cos, sin, position_ids=None):
|
|
115
|
+
"""GPT-NeoX style RoPE (used in HuggingFace)."""
|
|
116
|
+
if position_ids is not None:
|
|
117
|
+
# Select cos/sin for specific positions
|
|
118
|
+
cos = cos[position_ids].unsqueeze(1) # (bs, 1, seq_len, dim)
|
|
119
|
+
sin = sin[position_ids].unsqueeze(1)
|
|
120
|
+
else:
|
|
121
|
+
cos = cos.unsqueeze(0).unsqueeze(0) # (1, 1, seq_len, dim)
|
|
122
|
+
sin = sin.unsqueeze(0).unsqueeze(0)
|
|
123
|
+
|
|
124
|
+
# Apply rotation
|
|
125
|
+
q_embed = (q * cos) + (rotate_half(q) * sin)
|
|
126
|
+
k_embed = (k * cos) + (rotate_half(k) * sin)
|
|
127
|
+
return q_embed, k_embed
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
### Difference: GPT-J vs GPT-NeoX Style
|
|
131
|
+
|
|
132
|
+
**GPT-J style** (Meta LLaMA):
|
|
133
|
+
- Processes in complex number space
|
|
134
|
+
- Pairs adjacent dimensions: (0,1), (2,3), (4,5)
|
|
135
|
+
|
|
136
|
+
**GPT-NeoX style** (HuggingFace):
|
|
137
|
+
- Splits into two halves
|
|
138
|
+
- Pairs across halves: (0, d/2), (1, d/2+1), ...
|
|
139
|
+
|
|
140
|
+
Both mathematically equivalent, different implementations.
|
|
141
|
+
|
|
142
|
+
## Scaling Techniques
|
|
143
|
+
|
|
144
|
+
### 1. Linear Scaling
|
|
145
|
+
|
|
146
|
+
**Simplest method**: Scale position indices linearly.
|
|
147
|
+
|
|
148
|
+
```python
|
|
149
|
+
# Original: positions [0, 1, 2, ..., L-1]
|
|
150
|
+
# Scaled: positions [0, 1/s, 2/s, ..., (L-1)/s]
|
|
151
|
+
|
|
152
|
+
class LinearScaledRoPE(nn.Module):
|
|
153
|
+
def __init__(self, dim, max_seq_len=2048, base=10000, scaling_factor=1.0):
|
|
154
|
+
super().__init__()
|
|
155
|
+
self.scaling_factor = scaling_factor
|
|
156
|
+
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
|
|
157
|
+
self.register_buffer("inv_freq", inv_freq)
|
|
158
|
+
|
|
159
|
+
def forward(self, seq_len, device):
|
|
160
|
+
# Scale positions
|
|
161
|
+
t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
|
|
162
|
+
t = t / self.scaling_factor # Linear scaling
|
|
163
|
+
|
|
164
|
+
freqs = torch.outer(t, self.inv_freq)
|
|
165
|
+
emb = torch.cat((freqs, freqs), dim=-1)
|
|
166
|
+
return emb.cos(), emb.sin()
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
**Pros**: Simple, easy to implement
|
|
170
|
+
**Cons**: May lose high-frequency information
|
|
171
|
+
|
|
172
|
+
### 2. NTK-Aware Scaling (RoPE-NTK)
|
|
173
|
+
|
|
174
|
+
**Source**: Community discovery (Reddit, GitHub)
|
|
175
|
+
|
|
176
|
+
**Key insight**: Scale base frequency instead of positions.
|
|
177
|
+
|
|
178
|
+
```python
|
|
179
|
+
# Instead of scaling positions, scale theta (base frequency)
|
|
180
|
+
base_new = base * (scaling_factor ** (dim / (dim - 2)))
|
|
181
|
+
|
|
182
|
+
# This preserves high frequencies while extending low frequencies
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
**Implementation**:
|
|
186
|
+
|
|
187
|
+
```python
|
|
188
|
+
class NTKScaledRoPE(nn.Module):
|
|
189
|
+
def __init__(self, dim, max_seq_len=2048, base=10000, scaling_factor=1.0):
|
|
190
|
+
super().__init__()
|
|
191
|
+
# Compute new base
|
|
192
|
+
base = base * (scaling_factor ** (dim / (dim - 2)))
|
|
193
|
+
|
|
194
|
+
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
|
|
195
|
+
self.register_buffer("inv_freq", inv_freq)
|
|
196
|
+
|
|
197
|
+
def forward(self, seq_len, device):
|
|
198
|
+
t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
|
|
199
|
+
freqs = torch.outer(t, self.inv_freq)
|
|
200
|
+
emb = torch.cat((freqs, freqs), dim=-1)
|
|
201
|
+
return emb.cos(), emb.sin()
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
**Pros**: Better than linear scaling
|
|
205
|
+
**Cons**: Still not perfect for very long contexts
|
|
206
|
+
|
|
207
|
+
### 3. Dynamic Scaling
|
|
208
|
+
|
|
209
|
+
**Source**: HuggingFace transformers
|
|
210
|
+
|
|
211
|
+
**Idea**: Adjust scaling factor dynamically based on input length.
|
|
212
|
+
|
|
213
|
+
```python
|
|
214
|
+
class DynamicScaledRoPE(nn.Module):
|
|
215
|
+
def __init__(self, dim, max_seq_len=2048, base=10000, scaling_factor=1.0):
|
|
216
|
+
super().__init__()
|
|
217
|
+
self.max_seq_len = max_seq_len
|
|
218
|
+
self.scaling_factor = scaling_factor
|
|
219
|
+
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
|
|
220
|
+
self.register_buffer("inv_freq", inv_freq)
|
|
221
|
+
|
|
222
|
+
def forward(self, seq_len, device):
|
|
223
|
+
# Compute dynamic scaling factor
|
|
224
|
+
if seq_len > self.max_seq_len:
|
|
225
|
+
# Scale proportionally
|
|
226
|
+
scale = seq_len / self.max_seq_len
|
|
227
|
+
else:
|
|
228
|
+
scale = 1.0
|
|
229
|
+
|
|
230
|
+
# Scale positions
|
|
231
|
+
t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
|
|
232
|
+
t = t / (self.scaling_factor * scale)
|
|
233
|
+
|
|
234
|
+
freqs = torch.outer(t, self.inv_freq)
|
|
235
|
+
emb = torch.cat((freqs, freqs), dim=-1)
|
|
236
|
+
return emb.cos(), emb.sin()
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
**Pros**: Adapts to input length
|
|
240
|
+
**Cons**: Different behavior for different lengths
|
|
241
|
+
|
|
242
|
+
### 4. YaRN (Yet another RoPE extensioN)
|
|
243
|
+
|
|
244
|
+
**Source**: arXiv 2309.00071
|
|
245
|
+
|
|
246
|
+
**Most sophisticated**: Combines multiple techniques.
|
|
247
|
+
|
|
248
|
+
```python
|
|
249
|
+
class YaRNScaledRoPE(nn.Module):
|
|
250
|
+
"""YaRN: NTK + Attention Temperature + Ramp."""
|
|
251
|
+
|
|
252
|
+
def __init__(
|
|
253
|
+
self,
|
|
254
|
+
dim,
|
|
255
|
+
max_seq_len=2048,
|
|
256
|
+
base=10000,
|
|
257
|
+
scaling_factor=1.0,
|
|
258
|
+
beta_fast=32,
|
|
259
|
+
beta_slow=1,
|
|
260
|
+
attn_factor=1.0
|
|
261
|
+
):
|
|
262
|
+
super().__init__()
|
|
263
|
+
self.scaling_factor = scaling_factor
|
|
264
|
+
self.beta_fast = beta_fast
|
|
265
|
+
self.beta_slow = beta_slow
|
|
266
|
+
self.attn_factor = attn_factor
|
|
267
|
+
|
|
268
|
+
# Compute frequencies
|
|
269
|
+
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
|
|
270
|
+
self.register_buffer("inv_freq", inv_freq)
|
|
271
|
+
|
|
272
|
+
def forward(self, seq_len, device):
|
|
273
|
+
t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
|
|
274
|
+
|
|
275
|
+
# NTK-by-parts: Different scaling for different frequencies
|
|
276
|
+
inv_freq_mask = (self.inv_freq > 1 / self.beta_fast).float()
|
|
277
|
+
|
|
278
|
+
# Low frequencies: NTK scaling
|
|
279
|
+
# High frequencies: Linear scaling
|
|
280
|
+
# Middle: Smooth ramp
|
|
281
|
+
|
|
282
|
+
inv_freq_scaled = self.inv_freq / self.scaling_factor
|
|
283
|
+
freqs = torch.outer(t, inv_freq_scaled)
|
|
284
|
+
|
|
285
|
+
emb = torch.cat((freqs, freqs), dim=-1)
|
|
286
|
+
return emb.cos() * self.attn_factor, emb.sin() * self.attn_factor
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
**Pros**: State-of-the-art context extension
|
|
290
|
+
**Cons**: More complex, more hyperparameters
|
|
291
|
+
|
|
292
|
+
## Production Usage
|
|
293
|
+
|
|
294
|
+
### HuggingFace Integration
|
|
295
|
+
|
|
296
|
+
```python
|
|
297
|
+
from transformers import AutoModelForCausalLM, AutoConfig
|
|
298
|
+
|
|
299
|
+
# Linear scaling
|
|
300
|
+
config = AutoConfig.from_pretrained("meta-llama/Llama-2-7b-hf")
|
|
301
|
+
config.rope_scaling = {
|
|
302
|
+
"type": "linear",
|
|
303
|
+
"factor": 4.0 # 2k → 8k
|
|
304
|
+
}
|
|
305
|
+
|
|
306
|
+
# NTK-aware scaling
|
|
307
|
+
config.rope_scaling = {
|
|
308
|
+
"type": "ntk",
|
|
309
|
+
"factor": 4.0
|
|
310
|
+
}
|
|
311
|
+
|
|
312
|
+
# Dynamic scaling
|
|
313
|
+
config.rope_scaling = {
|
|
314
|
+
"type": "dynamic",
|
|
315
|
+
"factor": 4.0
|
|
316
|
+
}
|
|
317
|
+
|
|
318
|
+
# YaRN scaling
|
|
319
|
+
config.rope_scaling = {
|
|
320
|
+
"type": "yarn",
|
|
321
|
+
"factor": 16.0,
|
|
322
|
+
"original_max_position_embeddings": 2048,
|
|
323
|
+
"attention_factor": 1.0,
|
|
324
|
+
"beta_fast": 32,
|
|
325
|
+
"beta_slow": 1
|
|
326
|
+
}
|
|
327
|
+
|
|
328
|
+
model = AutoModelForCausalLM.from_config(config)
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
### Custom Implementation
|
|
332
|
+
|
|
333
|
+
```python
|
|
334
|
+
class RoPEAttention(nn.Module):
|
|
335
|
+
def __init__(self, config):
|
|
336
|
+
super().__init__()
|
|
337
|
+
self.num_heads = config.num_attention_heads
|
|
338
|
+
self.head_dim = config.hidden_size // config.num_attention_heads
|
|
339
|
+
|
|
340
|
+
# Projections
|
|
341
|
+
self.q_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
|
|
342
|
+
self.k_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
|
|
343
|
+
self.v_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
|
|
344
|
+
self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
|
|
345
|
+
|
|
346
|
+
# RoPE
|
|
347
|
+
self.rotary_emb = RotaryEmbedding(
|
|
348
|
+
dim=self.head_dim,
|
|
349
|
+
max_seq_len=config.max_position_embeddings,
|
|
350
|
+
base=config.rope_theta
|
|
351
|
+
)
|
|
352
|
+
|
|
353
|
+
def forward(self, hidden_states, attention_mask=None, position_ids=None):
|
|
354
|
+
bsz, seq_len, _ = hidden_states.size()
|
|
355
|
+
|
|
356
|
+
# Q, K, V
|
|
357
|
+
query_states = self.q_proj(hidden_states)
|
|
358
|
+
key_states = self.k_proj(hidden_states)
|
|
359
|
+
value_states = self.v_proj(hidden_states)
|
|
360
|
+
|
|
361
|
+
# Reshape: (batch, seq_len, num_heads, head_dim)
|
|
362
|
+
query_states = query_states.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
|
|
363
|
+
key_states = key_states.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
|
|
364
|
+
value_states = value_states.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
|
|
365
|
+
|
|
366
|
+
# Apply RoPE
|
|
367
|
+
cos, sin = self.rotary_emb(seq_len, device=hidden_states.device)
|
|
368
|
+
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
|
|
369
|
+
|
|
370
|
+
# Attention
|
|
371
|
+
attn_output = F.scaled_dot_product_attention(
|
|
372
|
+
query_states, key_states, value_states,
|
|
373
|
+
attn_mask=attention_mask
|
|
374
|
+
)
|
|
375
|
+
|
|
376
|
+
# Reshape and project
|
|
377
|
+
attn_output = attn_output.transpose(1, 2).contiguous()
|
|
378
|
+
attn_output = attn_output.reshape(bsz, seq_len, -1)
|
|
379
|
+
attn_output = self.o_proj(attn_output)
|
|
380
|
+
|
|
381
|
+
return attn_output
|
|
382
|
+
```
|
|
383
|
+
|
|
384
|
+
## Performance Comparison
|
|
385
|
+
|
|
386
|
+
**Scaling method comparison** (8k → 32k extension):
|
|
387
|
+
|
|
388
|
+
| Method | Fine-tune Steps | Perplexity | Memory | Speed |
|
|
389
|
+
|--------|----------------|------------|---------|-------|
|
|
390
|
+
| Linear | 1000 | 12.5 | 1.0× | 1.0× |
|
|
391
|
+
| NTK | 500 | 11.8 | 1.0× | 1.0× |
|
|
392
|
+
| Dynamic | 1000 | 12.2 | 1.0× | 0.98× |
|
|
393
|
+
| YaRN | 400 | 11.2 | 1.0× | 0.95× |
|
|
394
|
+
|
|
395
|
+
**Source**: YaRN paper (arXiv 2309.00071)
|
|
396
|
+
|
|
397
|
+
## Resources
|
|
398
|
+
|
|
399
|
+
- **RoFormer Paper**: https://arxiv.org/abs/2104.09864
|
|
400
|
+
- **YaRN Paper**: https://arxiv.org/abs/2309.00071
|
|
401
|
+
- **HuggingFace RoPE Utils**: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_rope_utils.py
|
|
402
|
+
- **Rotary Embeddings PyTorch**: https://github.com/lucidrains/rotary-embedding-torch
|
|
@@ -0,0 +1,260 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: mamba-architecture
|
|
3
|
+
description: State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Synthetic Sciences
|
|
6
|
+
license: MIT
|
|
7
|
+
tags: [Model Architecture, Mamba, State Space Models, SSM, Linear Complexity, Long Context, Efficient Inference, Hardware-Aware, Alternative To Transformers]
|
|
8
|
+
dependencies: [mamba-ssm, torch, transformers, causal-conv1d]
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Mamba - Selective State Space Models
|
|
12
|
+
|
|
13
|
+
## Quick start
|
|
14
|
+
|
|
15
|
+
Mamba is a state-space model architecture achieving O(n) linear complexity for sequence modeling.
|
|
16
|
+
|
|
17
|
+
**Installation**:
|
|
18
|
+
```bash
|
|
19
|
+
# Install causal-conv1d (optional, for efficiency)
|
|
20
|
+
pip install causal-conv1d>=1.4.0
|
|
21
|
+
|
|
22
|
+
# Install Mamba
|
|
23
|
+
pip install mamba-ssm
|
|
24
|
+
# Or both together
|
|
25
|
+
pip install mamba-ssm[causal-conv1d]
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
**Prerequisites**: Linux, NVIDIA GPU, PyTorch 1.12+, CUDA 11.6+
|
|
29
|
+
|
|
30
|
+
**Basic usage** (Mamba block):
|
|
31
|
+
```python
|
|
32
|
+
import torch
|
|
33
|
+
from mamba_ssm import Mamba
|
|
34
|
+
|
|
35
|
+
batch, length, dim = 2, 64, 16
|
|
36
|
+
x = torch.randn(batch, length, dim).to("cuda")
|
|
37
|
+
|
|
38
|
+
model = Mamba(
|
|
39
|
+
d_model=dim, # Model dimension
|
|
40
|
+
d_state=16, # SSM state dimension
|
|
41
|
+
d_conv=4, # Conv1d kernel size
|
|
42
|
+
expand=2 # Expansion factor
|
|
43
|
+
).to("cuda")
|
|
44
|
+
|
|
45
|
+
y = model(x) # O(n) complexity!
|
|
46
|
+
assert y.shape == x.shape
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Common workflows
|
|
50
|
+
|
|
51
|
+
### Workflow 1: Language model with Mamba-2
|
|
52
|
+
|
|
53
|
+
**Complete LM with generation**:
|
|
54
|
+
```python
|
|
55
|
+
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
|
|
56
|
+
from mamba_ssm.models.config_mamba import MambaConfig
|
|
57
|
+
import torch
|
|
58
|
+
|
|
59
|
+
# Configure Mamba-2 LM
|
|
60
|
+
config = MambaConfig(
|
|
61
|
+
d_model=1024, # Hidden dimension
|
|
62
|
+
n_layer=24, # Number of layers
|
|
63
|
+
vocab_size=50277, # Vocabulary size
|
|
64
|
+
ssm_cfg=dict(
|
|
65
|
+
layer="Mamba2", # Use Mamba-2
|
|
66
|
+
d_state=128, # Larger state for Mamba-2
|
|
67
|
+
headdim=64, # Head dimension
|
|
68
|
+
ngroups=1 # Number of groups
|
|
69
|
+
)
|
|
70
|
+
)
|
|
71
|
+
|
|
72
|
+
model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
|
|
73
|
+
|
|
74
|
+
# Generate text
|
|
75
|
+
input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long)
|
|
76
|
+
output = model.generate(
|
|
77
|
+
input_ids=input_ids,
|
|
78
|
+
max_length=100,
|
|
79
|
+
temperature=0.7,
|
|
80
|
+
top_p=0.9
|
|
81
|
+
)
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### Workflow 2: Use pretrained Mamba models
|
|
85
|
+
|
|
86
|
+
**Load from HuggingFace**:
|
|
87
|
+
```python
|
|
88
|
+
from transformers import AutoTokenizer
|
|
89
|
+
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
|
|
90
|
+
|
|
91
|
+
# Load pretrained model
|
|
92
|
+
model_name = "state-spaces/mamba-2.8b"
|
|
93
|
+
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b") # Use compatible tokenizer
|
|
94
|
+
model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)
|
|
95
|
+
|
|
96
|
+
# Generate
|
|
97
|
+
prompt = "The future of AI is"
|
|
98
|
+
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
|
|
99
|
+
output_ids = model.generate(
|
|
100
|
+
input_ids=input_ids,
|
|
101
|
+
max_length=200,
|
|
102
|
+
temperature=0.7,
|
|
103
|
+
top_p=0.9,
|
|
104
|
+
repetition_penalty=1.2
|
|
105
|
+
)
|
|
106
|
+
generated_text = tokenizer.decode(output_ids[0])
|
|
107
|
+
print(generated_text)
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
**Available models**:
|
|
111
|
+
- `state-spaces/mamba-130m`
|
|
112
|
+
- `state-spaces/mamba-370m`
|
|
113
|
+
- `state-spaces/mamba-790m`
|
|
114
|
+
- `state-spaces/mamba-1.4b`
|
|
115
|
+
- `state-spaces/mamba-2.8b`
|
|
116
|
+
|
|
117
|
+
### Workflow 3: Mamba-1 vs Mamba-2
|
|
118
|
+
|
|
119
|
+
**Mamba-1** (smaller state):
|
|
120
|
+
```python
|
|
121
|
+
from mamba_ssm import Mamba
|
|
122
|
+
|
|
123
|
+
model = Mamba(
|
|
124
|
+
d_model=256,
|
|
125
|
+
d_state=16, # Smaller state dimension
|
|
126
|
+
d_conv=4,
|
|
127
|
+
expand=2
|
|
128
|
+
).to("cuda")
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
**Mamba-2** (multi-head, larger state):
|
|
132
|
+
```python
|
|
133
|
+
from mamba_ssm import Mamba2
|
|
134
|
+
|
|
135
|
+
model = Mamba2(
|
|
136
|
+
d_model=256,
|
|
137
|
+
d_state=128, # Larger state dimension
|
|
138
|
+
d_conv=4,
|
|
139
|
+
expand=2,
|
|
140
|
+
headdim=64, # Head dimension for multi-head
|
|
141
|
+
ngroups=1 # Parallel groups
|
|
142
|
+
).to("cuda")
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
**Key differences**:
|
|
146
|
+
- **State size**: Mamba-1 (d_state=16) vs Mamba-2 (d_state=128)
|
|
147
|
+
- **Architecture**: Mamba-2 has multi-head structure
|
|
148
|
+
- **Normalization**: Mamba-2 uses RMSNorm
|
|
149
|
+
- **Distributed**: Mamba-2 supports tensor parallelism
|
|
150
|
+
|
|
151
|
+
### Workflow 4: Benchmark vs Transformers
|
|
152
|
+
|
|
153
|
+
**Generation speed comparison**:
|
|
154
|
+
```bash
|
|
155
|
+
# Benchmark Mamba
|
|
156
|
+
python benchmarks/benchmark_generation_mamba_simple.py \
|
|
157
|
+
--model-name "state-spaces/mamba-2.8b" \
|
|
158
|
+
--prompt "The future of machine learning is" \
|
|
159
|
+
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2
|
|
160
|
+
|
|
161
|
+
# Benchmark Transformer
|
|
162
|
+
python benchmarks/benchmark_generation_mamba_simple.py \
|
|
163
|
+
--model-name "EleutherAI/pythia-2.8b" \
|
|
164
|
+
--prompt "The future of machine learning is" \
|
|
165
|
+
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
**Expected results**:
|
|
169
|
+
- **Mamba**: 5× faster inference
|
|
170
|
+
- **Memory**: No KV cache needed
|
|
171
|
+
- **Scaling**: Linear with sequence length
|
|
172
|
+
|
|
173
|
+
## When to use vs alternatives
|
|
174
|
+
|
|
175
|
+
**Use Mamba when**:
|
|
176
|
+
- Need long sequences (100K+ tokens)
|
|
177
|
+
- Want faster inference than Transformers
|
|
178
|
+
- Memory-constrained (no KV cache)
|
|
179
|
+
- Building streaming applications
|
|
180
|
+
- Linear scaling important
|
|
181
|
+
|
|
182
|
+
**Advantages**:
|
|
183
|
+
- **O(n) complexity**: Linear vs quadratic
|
|
184
|
+
- **5× faster inference**: No attention overhead
|
|
185
|
+
- **No KV cache**: Lower memory usage
|
|
186
|
+
- **Million-token sequences**: Hardware-efficient
|
|
187
|
+
- **Streaming**: Constant memory per token
|
|
188
|
+
|
|
189
|
+
**Use alternatives instead**:
|
|
190
|
+
- **Transformers**: Need best-in-class performance, have compute
|
|
191
|
+
- **RWKV**: Want RNN+Transformer hybrid
|
|
192
|
+
- **RetNet**: Need retention-based architecture
|
|
193
|
+
- **Hyena**: Want convolution-based approach
|
|
194
|
+
|
|
195
|
+
## Common issues
|
|
196
|
+
|
|
197
|
+
**Issue: CUDA out of memory**
|
|
198
|
+
|
|
199
|
+
Reduce batch size or use gradient checkpointing:
|
|
200
|
+
```python
|
|
201
|
+
model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
|
|
202
|
+
model.gradient_checkpointing_enable() # Enable checkpointing
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
**Issue: Slow installation**
|
|
206
|
+
|
|
207
|
+
Install binary wheels (not source):
|
|
208
|
+
```bash
|
|
209
|
+
pip install mamba-ssm --no-build-isolation
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
**Issue: Missing causal-conv1d**
|
|
213
|
+
|
|
214
|
+
Install separately:
|
|
215
|
+
```bash
|
|
216
|
+
pip install causal-conv1d>=1.4.0
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
**Issue: Model not loading from HuggingFace**
|
|
220
|
+
|
|
221
|
+
Use `MambaLMHeadModel.from_pretrained` (not `AutoModel`):
|
|
222
|
+
```python
|
|
223
|
+
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
|
|
224
|
+
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
## Advanced topics
|
|
228
|
+
|
|
229
|
+
**Selective SSM**: See [references/selective-ssm.md](references/selective-ssm.md) for mathematical formulation, state-space equations, and how selectivity enables O(n) complexity.
|
|
230
|
+
|
|
231
|
+
**Mamba-2 architecture**: See [references/mamba2-details.md](references/mamba2-details.md) for multi-head structure, tensor parallelism, and distributed training setup.
|
|
232
|
+
|
|
233
|
+
**Performance optimization**: See [references/performance.md](references/performance.md) for hardware-aware design, CUDA kernels, and memory efficiency techniques.
|
|
234
|
+
|
|
235
|
+
## Hardware requirements
|
|
236
|
+
|
|
237
|
+
- **GPU**: NVIDIA with CUDA 11.6+
|
|
238
|
+
- **VRAM**:
|
|
239
|
+
- 130M model: 2GB
|
|
240
|
+
- 370M model: 4GB
|
|
241
|
+
- 790M model: 8GB
|
|
242
|
+
- 1.4B model: 14GB
|
|
243
|
+
- 2.8B model: 28GB (FP16)
|
|
244
|
+
- **Inference**: 5× faster than Transformers
|
|
245
|
+
- **Memory**: No KV cache (lower than Transformers)
|
|
246
|
+
|
|
247
|
+
**Performance** (vs Transformers):
|
|
248
|
+
- **Speed**: 5× faster inference
|
|
249
|
+
- **Memory**: 50% less (no KV cache)
|
|
250
|
+
- **Scaling**: Linear vs quadratic
|
|
251
|
+
|
|
252
|
+
## Resources
|
|
253
|
+
|
|
254
|
+
- Paper (Mamba-1): https://arxiv.org/abs/2312.00752 (Dec 2023)
|
|
255
|
+
- Paper (Mamba-2): https://arxiv.org/abs/2405.21060 (May 2024)
|
|
256
|
+
- GitHub: https://github.com/state-spaces/mamba ⭐ 13,000+
|
|
257
|
+
- Models: https://huggingface.co/state-spaces
|
|
258
|
+
- Docs: Repository README and wiki
|
|
259
|
+
|
|
260
|
+
|