@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,206 @@
|
|
|
1
|
+
# Mamba Architecture Details
|
|
2
|
+
|
|
3
|
+
## Selective State Space Mechanism
|
|
4
|
+
|
|
5
|
+
Mamba's core innovation is the **Selective SSM (S6)** layer that makes state space model parameters input-dependent.
|
|
6
|
+
|
|
7
|
+
### How S6 Works
|
|
8
|
+
|
|
9
|
+
**Traditional SSMs** (non-selective):
|
|
10
|
+
```python
|
|
11
|
+
# Fixed A, B, C matrices for all inputs
|
|
12
|
+
h(t) = A * h(t-1) + B * x(t) # State update
|
|
13
|
+
y(t) = C * h(t) # Output
|
|
14
|
+
```
|
|
15
|
+
|
|
16
|
+
**Mamba's Selective SSM**:
|
|
17
|
+
```python
|
|
18
|
+
# Input-dependent parameters
|
|
19
|
+
B(t) = Linear_B(x(t)) # Selection mechanism
|
|
20
|
+
C(t) = Linear_C(x(t)) # Output projection
|
|
21
|
+
Δ(t) = Linear_Δ(x(t)) # Discretization step
|
|
22
|
+
|
|
23
|
+
# Selective state update
|
|
24
|
+
h(t) = discretize(A, Δ(t)) * h(t-1) + Δ(t) * B(t) * x(t)
|
|
25
|
+
y(t) = C(t) * h(t)
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
### Key Advantages
|
|
29
|
+
|
|
30
|
+
**1. Content-based reasoning**:
|
|
31
|
+
- Can selectively remember or forget based on input
|
|
32
|
+
- Addresses discrete modality weakness of traditional SSMs
|
|
33
|
+
- Example: Remembers important tokens, forgets padding
|
|
34
|
+
|
|
35
|
+
**2. Input-dependent selection**:
|
|
36
|
+
```python
|
|
37
|
+
# Mamba decides per token what to remember
|
|
38
|
+
if is_important(x(t)):
|
|
39
|
+
Δ(t) = large_value # Keep in state
|
|
40
|
+
else:
|
|
41
|
+
Δ(t) = small_value # Forget quickly
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
**3. No attention required**:
|
|
45
|
+
- Replaces O(n²) attention with O(n) state updates
|
|
46
|
+
- State dimension is constant (typically 16)
|
|
47
|
+
|
|
48
|
+
## Model Configuration
|
|
49
|
+
|
|
50
|
+
### Core Parameters
|
|
51
|
+
|
|
52
|
+
```python
|
|
53
|
+
from mamba_ssm import Mamba
|
|
54
|
+
|
|
55
|
+
model = Mamba(
|
|
56
|
+
d_model=256, # Hidden dimension (256, 512, 768, 1024, 2048)
|
|
57
|
+
d_state=16, # SSM state dimension (fixed at 16 is optimal)
|
|
58
|
+
d_conv=4, # Local convolution width (4 is standard)
|
|
59
|
+
expand=2, # Expansion factor (1.5-2.0)
|
|
60
|
+
dt_rank="auto", # Rank of Δ projection (auto = d_model / 16)
|
|
61
|
+
dt_min=0.001, # Min Δ init (controls forgetting rate)
|
|
62
|
+
dt_max=0.1, # Max Δ init
|
|
63
|
+
dt_init="random", # Δ initialization (random, constant)
|
|
64
|
+
dt_scale=1.0, # Δ scaling factor
|
|
65
|
+
conv_bias=True, # Use bias in convolution
|
|
66
|
+
bias=False # Use bias in linear projections
|
|
67
|
+
)
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### Parameter Impact
|
|
71
|
+
|
|
72
|
+
**d_state** (SSM state dimension):
|
|
73
|
+
- Standard: 16 (optimal from ablations)
|
|
74
|
+
- Smaller (8): Faster but less capacity
|
|
75
|
+
- Larger (32, 64): Minimal improvement, 2× slower
|
|
76
|
+
|
|
77
|
+
**expand** (block expansion):
|
|
78
|
+
- Standard: 2.0
|
|
79
|
+
- Range: 1.5-2.0
|
|
80
|
+
- Controls inner dimension = expand * d_model
|
|
81
|
+
|
|
82
|
+
**d_conv** (convolution width):
|
|
83
|
+
- Standard: 4
|
|
84
|
+
- Local context window before SSM
|
|
85
|
+
- Helps with positional information
|
|
86
|
+
|
|
87
|
+
**dt_rank** (Δ projection rank):
|
|
88
|
+
- Auto: d_model / 16 (recommended)
|
|
89
|
+
- Controls Δ parameter efficiency
|
|
90
|
+
- Lower rank = more efficient but less expressive
|
|
91
|
+
|
|
92
|
+
## Mamba Block Structure
|
|
93
|
+
|
|
94
|
+
```python
|
|
95
|
+
# Mamba block (replaces Transformer block)
|
|
96
|
+
class MambaBlock(nn.Module):
|
|
97
|
+
def __init__(self, d_model):
|
|
98
|
+
self.norm = RMSNorm(d_model)
|
|
99
|
+
self.mamba = Mamba(d_model, d_state=16, d_conv=4, expand=2)
|
|
100
|
+
|
|
101
|
+
def forward(self, x):
|
|
102
|
+
return x + self.mamba(self.norm(x)) # Residual
|
|
103
|
+
|
|
104
|
+
# Full model (stack of Mamba blocks)
|
|
105
|
+
model = nn.Sequential(
|
|
106
|
+
Embedding(...),
|
|
107
|
+
*[MambaBlock(d_model) for _ in range(n_layers)],
|
|
108
|
+
RMSNorm(d_model),
|
|
109
|
+
LMHead(...)
|
|
110
|
+
)
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
**Key differences from Transformers**:
|
|
114
|
+
- No multi-head attention (MHA)
|
|
115
|
+
- No feedforward network (FFN)
|
|
116
|
+
- Single Mamba layer per block
|
|
117
|
+
- 2× more layers than equivalent Transformer
|
|
118
|
+
|
|
119
|
+
## Hardware-Aware Implementation
|
|
120
|
+
|
|
121
|
+
### Parallel Algorithm
|
|
122
|
+
|
|
123
|
+
Mamba uses a **scan-based parallel algorithm** for training:
|
|
124
|
+
|
|
125
|
+
```python
|
|
126
|
+
# Parallel mode (training)
|
|
127
|
+
# GPU kernel fuses operations
|
|
128
|
+
y = parallel_scan(A, B, C, x) # O(n log n) parallel
|
|
129
|
+
|
|
130
|
+
# Sequential mode (inference)
|
|
131
|
+
# Constant memory RNN-style
|
|
132
|
+
h = 0
|
|
133
|
+
for x_t in sequence:
|
|
134
|
+
h = A*h + B*x_t
|
|
135
|
+
y_t = C*h
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Memory Efficiency
|
|
139
|
+
|
|
140
|
+
**Training**:
|
|
141
|
+
- Recomputes activations in backward pass
|
|
142
|
+
- Similar to FlashAttention strategy
|
|
143
|
+
- Memory: O(batch_size * seq_len * d_model)
|
|
144
|
+
|
|
145
|
+
**Inference**:
|
|
146
|
+
- RNN-style sequential processing
|
|
147
|
+
- State size: O(d_model * d_state) = constant
|
|
148
|
+
- No KV cache needed (huge advantage!)
|
|
149
|
+
|
|
150
|
+
### CUDA Kernel Optimizations
|
|
151
|
+
|
|
152
|
+
```python
|
|
153
|
+
# Fused kernel operations
|
|
154
|
+
- Discretization (continuous → discrete A, B)
|
|
155
|
+
- SSM recurrence (parallel scan)
|
|
156
|
+
- Convolution (efficient 1D conv)
|
|
157
|
+
- All in single GPU kernel
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
## Layer Count Scaling
|
|
161
|
+
|
|
162
|
+
Mamba models use **2× layers** compared to Transformers:
|
|
163
|
+
|
|
164
|
+
| Model | d_model | n_layers | Params |
|
|
165
|
+
|-------|---------|----------|--------|
|
|
166
|
+
| Mamba-130M | 768 | 24 | 130M |
|
|
167
|
+
| Mamba-370M | 1024 | 48 | 370M |
|
|
168
|
+
| Mamba-790M | 1536 | 48 | 790M |
|
|
169
|
+
| Mamba-1.4B | 2048 | 48 | 1.4B |
|
|
170
|
+
| Mamba-2.8B | 2560 | 64 | 2.8B |
|
|
171
|
+
|
|
172
|
+
**Why 2× layers?**
|
|
173
|
+
- Mamba blocks are simpler (no MHA, no FFN)
|
|
174
|
+
- ~50% fewer parameters per layer
|
|
175
|
+
- Doubling layers matches compute budget
|
|
176
|
+
|
|
177
|
+
## Initialization Strategy
|
|
178
|
+
|
|
179
|
+
```python
|
|
180
|
+
# Δ (discretization step) initialization
|
|
181
|
+
dt_init_floor = 1e-4
|
|
182
|
+
dt = torch.exp(
|
|
183
|
+
torch.rand(d_inner) * (math.log(dt_max) - math.log(dt_min))
|
|
184
|
+
+ math.log(dt_min)
|
|
185
|
+
).clamp(min=dt_init_floor)
|
|
186
|
+
|
|
187
|
+
# A (state transition) initialization
|
|
188
|
+
A = -torch.exp(torch.rand(d_inner, d_state)) # Negative for stability
|
|
189
|
+
|
|
190
|
+
# B, C (input/output) initialization
|
|
191
|
+
B = torch.randn(d_inner, d_state)
|
|
192
|
+
C = torch.randn(d_inner, d_state)
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
**Critical for stability**:
|
|
196
|
+
- A must be negative (exponential decay)
|
|
197
|
+
- Δ in range [dt_min, dt_max]
|
|
198
|
+
- Random initialization helps diversity
|
|
199
|
+
|
|
200
|
+
## Resources
|
|
201
|
+
|
|
202
|
+
- Paper: https://arxiv.org/abs/2312.00752 (Mamba-1)
|
|
203
|
+
- Paper: https://arxiv.org/abs/2405.21060 (Mamba-2)
|
|
204
|
+
- GitHub: https://github.com/state-spaces/mamba
|
|
205
|
+
- Models: https://huggingface.co/state-spaces
|
|
206
|
+
- CUDA kernels: https://github.com/state-spaces/mamba/tree/main/csrc
|
|
@@ -0,0 +1,255 @@
|
|
|
1
|
+
# Mamba Performance Benchmarks
|
|
2
|
+
|
|
3
|
+
## Inference Speed Comparison
|
|
4
|
+
|
|
5
|
+
### Throughput (tokens/sec)
|
|
6
|
+
|
|
7
|
+
**Mamba-1.4B vs Transformer-1.3B** on single A100 80GB:
|
|
8
|
+
|
|
9
|
+
| Sequence Length | Mamba-1.4B | Transformer-1.3B | Speedup |
|
|
10
|
+
|----------------|------------|------------------|---------|
|
|
11
|
+
| 512 | 8,300 | 6,200 | 1.3× |
|
|
12
|
+
| 1024 | 7,800 | 4,100 | 1.9× |
|
|
13
|
+
| 2048 | 7,200 | 2,300 | 3.1× |
|
|
14
|
+
| 4096 | 6,800 | 1,200 | 5.7× |
|
|
15
|
+
| 8192 | 6,400 | 600 | **10.7×** |
|
|
16
|
+
| 16384 | 6,100 | OOM | ∞ |
|
|
17
|
+
|
|
18
|
+
**Key insight**: Speedup grows with sequence length (Mamba O(n) vs Transformer O(n²))
|
|
19
|
+
|
|
20
|
+
### Latency (ms per token)
|
|
21
|
+
|
|
22
|
+
**Generation latency** (batch size 1, autoregressive):
|
|
23
|
+
|
|
24
|
+
| Model | First Token | Per Token | 100 Tokens Total |
|
|
25
|
+
|-------|-------------|-----------|------------------|
|
|
26
|
+
| Mamba-130M | 3 ms | 0.8 ms | 83 ms |
|
|
27
|
+
| Transformer-130M | 5 ms | 1.2 ms | 125 ms |
|
|
28
|
+
| Mamba-1.4B | 12 ms | 3.2 ms | 332 ms |
|
|
29
|
+
| Transformer-1.3B | 18 ms | 8.5 ms | 868 ms |
|
|
30
|
+
| Mamba-2.8B | 20 ms | 6.1 ms | 631 ms |
|
|
31
|
+
| Transformer-2.7B | 35 ms | 18.2 ms | 1855 ms |
|
|
32
|
+
|
|
33
|
+
**Mamba advantage**: Constant per-token latency regardless of context length
|
|
34
|
+
|
|
35
|
+
## Memory Usage
|
|
36
|
+
|
|
37
|
+
### Training Memory (BF16, per GPU)
|
|
38
|
+
|
|
39
|
+
**Mamba-1.4B** training memory breakdown:
|
|
40
|
+
|
|
41
|
+
| Sequence Length | Activations | Gradients | Optimizer | Total | vs Transformer |
|
|
42
|
+
|----------------|-------------|-----------|-----------|-------|----------------|
|
|
43
|
+
| 512 | 2.1 GB | 3.2 GB | 11.2 GB | 16.5 GB | 0.9× |
|
|
44
|
+
| 1024 | 3.8 GB | 3.2 GB | 11.2 GB | 18.2 GB | 0.6× |
|
|
45
|
+
| 2048 | 7.2 GB | 3.2 GB | 11.2 GB | 21.6 GB | 0.4× |
|
|
46
|
+
| 4096 | 14.1 GB | 3.2 GB | 11.2 GB | 28.5 GB | 0.25× |
|
|
47
|
+
| 8192 | 28.0 GB | 3.2 GB | 11.2 GB | 42.4 GB | 0.15× |
|
|
48
|
+
|
|
49
|
+
**Note**: Transformer OOMs at 8K sequence length on 40GB A100
|
|
50
|
+
|
|
51
|
+
### Inference Memory (FP16, batch size 1)
|
|
52
|
+
|
|
53
|
+
| Model | KV Cache (8K ctx) | State (Mamba) | Ratio |
|
|
54
|
+
|-------|------------------|---------------|-------|
|
|
55
|
+
| 130M | 2.1 GB | 0 MB | ∞ |
|
|
56
|
+
| 370M | 5.2 GB | 0 MB | ∞ |
|
|
57
|
+
| 1.4B | 19.7 GB | 0 MB | ∞ |
|
|
58
|
+
| 2.8B | 38.4 GB | 0 MB | ∞ |
|
|
59
|
+
|
|
60
|
+
**Mamba stores no KV cache** - constant memory per token!
|
|
61
|
+
|
|
62
|
+
Actual Mamba state size:
|
|
63
|
+
- 130M: ~3 MB (d_model × d_state × n_layers = 768 × 16 × 24)
|
|
64
|
+
- 2.8B: ~13 MB (2560 × 16 × 64)
|
|
65
|
+
|
|
66
|
+
## Language Modeling Benchmarks
|
|
67
|
+
|
|
68
|
+
### Perplexity on Common Datasets
|
|
69
|
+
|
|
70
|
+
**Models trained on The Pile (300B tokens)**:
|
|
71
|
+
|
|
72
|
+
| Model | Params | Pile (val) | WikiText-103 | C4 | Lambada |
|
|
73
|
+
|-------|--------|------------|--------------|-----|---------|
|
|
74
|
+
| Pythia | 160M | 29.6 | 28.4 | 23.1 | 51.2 |
|
|
75
|
+
| **Mamba** | **130M** | **28.1** | **26.7** | **21.8** | **48.3** |
|
|
76
|
+
| Pythia | 410M | 18.3 | 17.6 | 16.2 | 32.1 |
|
|
77
|
+
| **Mamba** | **370M** | **16.7** | **16.2** | **15.1** | **28.4** |
|
|
78
|
+
| Pythia | 1.4B | 10.8 | 10.2 | 11.3 | 15.2 |
|
|
79
|
+
| **Mamba** | **1.4B** | **9.1** | **9.6** | **10.1** | **12.8** |
|
|
80
|
+
| Pythia | 2.8B | 8.3 | 7.9 | 9.2 | 10.6 |
|
|
81
|
+
| **Mamba** | **2.8B** | **7.4** | **7.2** | **8.3** | **9.1** |
|
|
82
|
+
|
|
83
|
+
**Mamba consistently outperforms** Transformers of similar size by 10-20%
|
|
84
|
+
|
|
85
|
+
### Zero-Shot Task Performance
|
|
86
|
+
|
|
87
|
+
**Mamba-2.8B vs Transformer-2.7B** on common benchmarks:
|
|
88
|
+
|
|
89
|
+
| Task | Mamba-2.8B | Transformer-2.7B | Delta |
|
|
90
|
+
|------|------------|------------------|-------|
|
|
91
|
+
| HellaSwag | 61.3 | 58.7 | +2.6 |
|
|
92
|
+
| PIQA | 78.1 | 76.4 | +1.7 |
|
|
93
|
+
| ARC-Easy | 68.2 | 65.9 | +2.3 |
|
|
94
|
+
| ARC-Challenge | 42.7 | 40.1 | +2.6 |
|
|
95
|
+
| WinoGrande | 64.8 | 62.3 | +2.5 |
|
|
96
|
+
| OpenBookQA | 43.2 | 41.8 | +1.4 |
|
|
97
|
+
| BoolQ | 71.4 | 68.2 | +3.2 |
|
|
98
|
+
| MMLU (5-shot) | 35.2 | 33.8 | +1.4 |
|
|
99
|
+
|
|
100
|
+
**Average improvement**: +2.2 points across benchmarks
|
|
101
|
+
|
|
102
|
+
## Audio Modeling Benchmarks
|
|
103
|
+
|
|
104
|
+
### SC09 (Speech Commands)
|
|
105
|
+
|
|
106
|
+
**Task**: Audio classification (10 classes)
|
|
107
|
+
|
|
108
|
+
| Model | Params | Accuracy | Inference (ms) |
|
|
109
|
+
|-------|--------|----------|----------------|
|
|
110
|
+
| Transformer | 8.2M | 96.2% | 18 ms |
|
|
111
|
+
| S4 | 6.1M | 97.1% | 8 ms |
|
|
112
|
+
| **Mamba** | **6.3M** | **98.4%** | **6 ms** |
|
|
113
|
+
|
|
114
|
+
### LJSpeech (Speech Generation)
|
|
115
|
+
|
|
116
|
+
**Task**: Text-to-speech quality (MOS score)
|
|
117
|
+
|
|
118
|
+
| Model | Params | MOS ↑ | RTF ↓ |
|
|
119
|
+
|-------|--------|-------|-------|
|
|
120
|
+
| Transformer | 12M | 3.82 | 0.45 |
|
|
121
|
+
| Conformer | 11M | 3.91 | 0.38 |
|
|
122
|
+
| **Mamba** | **10M** | **4.03** | **0.21** |
|
|
123
|
+
|
|
124
|
+
**RTF** (Real-Time Factor): Lower is better (0.21 = 5× faster than real-time)
|
|
125
|
+
|
|
126
|
+
## Genomics Benchmarks
|
|
127
|
+
|
|
128
|
+
### Human Reference Genome (HG38)
|
|
129
|
+
|
|
130
|
+
**Task**: Next nucleotide prediction
|
|
131
|
+
|
|
132
|
+
| Model | Context Length | Perplexity | Throughput |
|
|
133
|
+
|-------|----------------|------------|------------|
|
|
134
|
+
| Transformer | 1024 | 3.21 | 1,200 bp/s |
|
|
135
|
+
| Hyena | 32768 | 2.87 | 8,500 bp/s |
|
|
136
|
+
| **Mamba** | **1M** | **2.14** | **45,000 bp/s** |
|
|
137
|
+
|
|
138
|
+
**Mamba handles million-length sequences** efficiently
|
|
139
|
+
|
|
140
|
+
## Scaling Laws
|
|
141
|
+
|
|
142
|
+
### Compute-Optimal Training
|
|
143
|
+
|
|
144
|
+
**FLOPs vs perplexity** (The Pile validation):
|
|
145
|
+
|
|
146
|
+
| Model Size | Training FLOPs | Mamba Perplexity | Transformer Perplexity |
|
|
147
|
+
|------------|----------------|------------------|------------------------|
|
|
148
|
+
| 130M | 6e19 | 28.1 | 29.6 |
|
|
149
|
+
| 370M | 3e20 | 16.7 | 18.3 |
|
|
150
|
+
| 790M | 8e20 | 12.3 | 13.9 |
|
|
151
|
+
| 1.4B | 2e21 | 9.1 | 10.8 |
|
|
152
|
+
| 2.8B | 6e21 | 7.4 | 8.3 |
|
|
153
|
+
|
|
154
|
+
**Scaling coefficient**: Mamba achieves same perplexity as Transformer with **0.8×** compute
|
|
155
|
+
|
|
156
|
+
### Parameter Efficiency
|
|
157
|
+
|
|
158
|
+
**Perplexity 10.0 target** on The Pile:
|
|
159
|
+
|
|
160
|
+
| Model Type | Parameters Needed | Memory (inference) |
|
|
161
|
+
|------------|-------------------|-------------------|
|
|
162
|
+
| Transformer | 1.6B | 3.2 GB |
|
|
163
|
+
| **Mamba** | **1.1B** | **2.2 GB** |
|
|
164
|
+
|
|
165
|
+
**Mamba needs ~30% fewer parameters** for same performance
|
|
166
|
+
|
|
167
|
+
## Long-Range Arena (LRA)
|
|
168
|
+
|
|
169
|
+
**Task**: Long-context understanding benchmarks
|
|
170
|
+
|
|
171
|
+
| Task | Length | Transformer | S4 | Mamba |
|
|
172
|
+
|------|--------|-------------|-----|-------|
|
|
173
|
+
| ListOps | 2K | 36.4% | 59.6% | **61.2%** |
|
|
174
|
+
| Text | 4K | 64.3% | 86.8% | **88.1%** |
|
|
175
|
+
| Retrieval | 4K | 57.5% | 90.9% | **92.3%** |
|
|
176
|
+
| Image | 1K | 42.4% | 88.7% | **89.4%** |
|
|
177
|
+
| PathFinder | 1K | 71.4% | 86.1% | **87.8%** |
|
|
178
|
+
| Path-X | 16K | OOM | 88.3% | **91.2%** |
|
|
179
|
+
|
|
180
|
+
**Average**: Mamba 85.0%, S4 83.4%, Transformer 54.4%
|
|
181
|
+
|
|
182
|
+
## Training Throughput
|
|
183
|
+
|
|
184
|
+
### Tokens/sec During Training
|
|
185
|
+
|
|
186
|
+
**8× A100 80GB** cluster, BF16, different sequence lengths:
|
|
187
|
+
|
|
188
|
+
| Model | Seq Len 512 | Seq Len 2K | Seq Len 8K | Seq Len 32K |
|
|
189
|
+
|-------|-------------|------------|------------|-------------|
|
|
190
|
+
| Transformer-1.3B | 180K | 52K | OOM | OOM |
|
|
191
|
+
| **Mamba-1.4B** | **195K** | **158K** | **121K** | **89K** |
|
|
192
|
+
| Transformer-2.7B | 92K | 26K | OOM | OOM |
|
|
193
|
+
| **Mamba-2.8B** | **98K** | **81K** | **62K** | **45K** |
|
|
194
|
+
|
|
195
|
+
**Mamba scales to longer sequences** without OOM
|
|
196
|
+
|
|
197
|
+
## Hardware Utilization
|
|
198
|
+
|
|
199
|
+
### GPU Memory Bandwidth
|
|
200
|
+
|
|
201
|
+
**Mamba-1.4B** inference on different GPUs:
|
|
202
|
+
|
|
203
|
+
| GPU | Memory BW | Tokens/sec | Efficiency |
|
|
204
|
+
|-----|-----------|------------|------------|
|
|
205
|
+
| A100 80GB | 2.0 TB/s | 6,800 | 85% |
|
|
206
|
+
| A100 40GB | 1.6 TB/s | 5,400 | 84% |
|
|
207
|
+
| V100 32GB | 900 GB/s | 3,100 | 86% |
|
|
208
|
+
| RTX 4090 | 1.0 TB/s | 3,600 | 90% |
|
|
209
|
+
|
|
210
|
+
**High efficiency**: Mamba is memory-bandwidth bound (good!)
|
|
211
|
+
|
|
212
|
+
### Multi-GPU Scaling
|
|
213
|
+
|
|
214
|
+
**Mamba-2.8B** training throughput:
|
|
215
|
+
|
|
216
|
+
| GPUs | Tokens/sec | Scaling Efficiency |
|
|
217
|
+
|------|------------|-------------------|
|
|
218
|
+
| 1× A100 | 12,300 | 100% |
|
|
219
|
+
| 2× A100 | 23,800 | 97% |
|
|
220
|
+
| 4× A100 | 46,100 | 94% |
|
|
221
|
+
| 8× A100 | 89,400 | 91% |
|
|
222
|
+
| 16× A100 | 172,000 | 88% |
|
|
223
|
+
|
|
224
|
+
**Near-linear scaling** up to 16 GPUs
|
|
225
|
+
|
|
226
|
+
## Cost Analysis
|
|
227
|
+
|
|
228
|
+
### Training Cost (USD)
|
|
229
|
+
|
|
230
|
+
**Training to The Pile perplexity 10.0** on cloud GPUs:
|
|
231
|
+
|
|
232
|
+
| Model | Cloud GPUs | Hours | Cost (A100) | Cost (H100) |
|
|
233
|
+
|-------|------------|-------|-------------|-------------|
|
|
234
|
+
| Transformer-1.6B | 8× A100 | 280 | $8,400 | $4,200 |
|
|
235
|
+
| **Mamba-1.1B** | **8× A100** | **180** | **$5,400** | **$2,700** |
|
|
236
|
+
|
|
237
|
+
**Savings**: 36% cost reduction vs Transformer
|
|
238
|
+
|
|
239
|
+
### Inference Cost (USD/million tokens)
|
|
240
|
+
|
|
241
|
+
**API-style inference** (batch size 1, 2K context):
|
|
242
|
+
|
|
243
|
+
| Model | Latency | Cost/M tokens | Quality (perplexity) |
|
|
244
|
+
|-------|---------|---------------|---------------------|
|
|
245
|
+
| Transformer-1.3B | 8.5 ms/tok | $0.42 | 10.8 |
|
|
246
|
+
| **Mamba-1.4B** | **3.2 ms/tok** | **$0.18** | **9.1** |
|
|
247
|
+
|
|
248
|
+
**Mamba provides**: 2.6× faster, 57% cheaper, better quality
|
|
249
|
+
|
|
250
|
+
## Resources
|
|
251
|
+
|
|
252
|
+
- Benchmarks code: https://github.com/state-spaces/mamba/tree/main/benchmarks
|
|
253
|
+
- Paper (Mamba-1): https://arxiv.org/abs/2312.00752 (Section 4: Experiments)
|
|
254
|
+
- Paper (Mamba-2): https://arxiv.org/abs/2405.21060 (Section 5: Experiments)
|
|
255
|
+
- Pretrained models: https://huggingface.co/state-spaces
|