@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,402 @@
1
+ # RoPE: Rotary Position Embeddings
2
+
3
+ Complete technical guide based on RoFormer paper (arXiv 2104.09864) and HuggingFace transformers implementation.
4
+
5
+ ## Table of Contents
6
+ - Mathematical Formulation
7
+ - Implementation Details
8
+ - Scaling Techniques
9
+ - Production Usage
10
+
11
+ ## Mathematical Formulation
12
+
13
+ **Source**: RoFormer: Enhanced Transformer with Rotary Position Embedding (arXiv 2104.09864)
14
+
15
+ ### Core Idea
16
+
17
+ RoPE encodes absolute position with a rotation matrix while naturally incorporating relative position dependency in attention.
18
+
19
+ ### Formulation
20
+
21
+ Given position index `m` and embedding dimension `d`:
22
+
23
+ ```
24
+ Rotation Matrix R_θ(m):
25
+ [cos(mθ₁) -sin(mθ₁) 0 0 ]
26
+ [sin(mθ₁) cos(mθ₁) 0 0 ]
27
+ [0 0 cos(mθ₂) -sin(mθ₂) ]
28
+ [0 0 sin(mθ₂) cos(mθ₂) ]
29
+ ...
30
+
31
+ where θⱼ = base^(-2j/d) for j ∈ [0, 1, 2, ..., d/2)
32
+ ```
33
+
34
+ **Key property**: Attention between positions m and n depends only on relative distance (m - n).
35
+
36
+ ### Derivation
37
+
38
+ **Step 1: Position encoding via rotation**
39
+
40
+ ```
41
+ q_m = W_q x_m rotated by mθ
42
+ k_n = W_k x_n rotated by nθ
43
+ ```
44
+
45
+ **Step 2: Attention score**
46
+
47
+ ```
48
+ score(q_m, k_n) = q_m^T k_n
49
+ = (Rotated query) · (Rotated key)
50
+ = f(q, k, m-n)
51
+ ```
52
+
53
+ The score depends on relative position `m - n`, not absolute positions.
54
+
55
+ ## Implementation Details
56
+
57
+ **Source**: HuggingFace transformers/modeling_rope_utils.py
58
+
59
+ ### Basic RoPE Implementation
60
+
61
+ ```python
62
+ import torch
63
+ import math
64
+
65
+ def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
66
+ """Precompute rotation frequencies (cos + i*sin)."""
67
+ # Compute inverse frequencies
68
+ freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
69
+
70
+ # Position indices
71
+ t = torch.arange(end, device=freqs.device)
72
+
73
+ # Outer product: (end, dim/2)
74
+ freqs = torch.outer(t, freqs).float()
75
+
76
+ # Convert to complex exponential (Euler's formula)
77
+ freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # e^(i*θ) = cos(θ) + i*sin(θ)
78
+
79
+ return freqs_cis
80
+
81
+ def reshape_for_broadcast(freqs_cis, x):
82
+ """Reshape frequency tensor to match x dimensions."""
83
+ ndim = x.ndim
84
+ assert 0 <= 1 < ndim
85
+ assert freqs_cis.shape == (x.shape[1], x.shape[-1])
86
+ shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
87
+ return freqs_cis.view(*shape)
88
+
89
+ def apply_rotary_emb(xq, xk, freqs_cis):
90
+ """Apply rotary embeddings to queries and keys."""
91
+ # Convert to complex
92
+ xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
93
+ xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
94
+
95
+ # Reshape freqs
96
+ freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
97
+
98
+ # Apply rotation
99
+ xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
100
+ xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
101
+
102
+ return xq_out.type_as(xq), xk_out.type_as(xk)
103
+ ```
104
+
105
+ ### Alternative: GPT-NeoX Style (HuggingFace)
106
+
107
+ ```python
108
+ def rotate_half(x):
109
+ """Rotate half the hidden dimensions of the input."""
110
+ x1 = x[..., : x.shape[-1] // 2]
111
+ x2 = x[..., x.shape[-1] // 2 :]
112
+ return torch.cat((-x2, x1), dim=-1)
113
+
114
+ def apply_rotary_pos_emb_gpt_neox(q, k, cos, sin, position_ids=None):
115
+ """GPT-NeoX style RoPE (used in HuggingFace)."""
116
+ if position_ids is not None:
117
+ # Select cos/sin for specific positions
118
+ cos = cos[position_ids].unsqueeze(1) # (bs, 1, seq_len, dim)
119
+ sin = sin[position_ids].unsqueeze(1)
120
+ else:
121
+ cos = cos.unsqueeze(0).unsqueeze(0) # (1, 1, seq_len, dim)
122
+ sin = sin.unsqueeze(0).unsqueeze(0)
123
+
124
+ # Apply rotation
125
+ q_embed = (q * cos) + (rotate_half(q) * sin)
126
+ k_embed = (k * cos) + (rotate_half(k) * sin)
127
+ return q_embed, k_embed
128
+ ```
129
+
130
+ ### Difference: GPT-J vs GPT-NeoX Style
131
+
132
+ **GPT-J style** (Meta LLaMA):
133
+ - Processes in complex number space
134
+ - Pairs adjacent dimensions: (0,1), (2,3), (4,5)
135
+
136
+ **GPT-NeoX style** (HuggingFace):
137
+ - Splits into two halves
138
+ - Pairs across halves: (0, d/2), (1, d/2+1), ...
139
+
140
+ Both mathematically equivalent, different implementations.
141
+
142
+ ## Scaling Techniques
143
+
144
+ ### 1. Linear Scaling
145
+
146
+ **Simplest method**: Scale position indices linearly.
147
+
148
+ ```python
149
+ # Original: positions [0, 1, 2, ..., L-1]
150
+ # Scaled: positions [0, 1/s, 2/s, ..., (L-1)/s]
151
+
152
+ class LinearScaledRoPE(nn.Module):
153
+ def __init__(self, dim, max_seq_len=2048, base=10000, scaling_factor=1.0):
154
+ super().__init__()
155
+ self.scaling_factor = scaling_factor
156
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
157
+ self.register_buffer("inv_freq", inv_freq)
158
+
159
+ def forward(self, seq_len, device):
160
+ # Scale positions
161
+ t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
162
+ t = t / self.scaling_factor # Linear scaling
163
+
164
+ freqs = torch.outer(t, self.inv_freq)
165
+ emb = torch.cat((freqs, freqs), dim=-1)
166
+ return emb.cos(), emb.sin()
167
+ ```
168
+
169
+ **Pros**: Simple, easy to implement
170
+ **Cons**: May lose high-frequency information
171
+
172
+ ### 2. NTK-Aware Scaling (RoPE-NTK)
173
+
174
+ **Source**: Community discovery (Reddit, GitHub)
175
+
176
+ **Key insight**: Scale base frequency instead of positions.
177
+
178
+ ```python
179
+ # Instead of scaling positions, scale theta (base frequency)
180
+ base_new = base * (scaling_factor ** (dim / (dim - 2)))
181
+
182
+ # This preserves high frequencies while extending low frequencies
183
+ ```
184
+
185
+ **Implementation**:
186
+
187
+ ```python
188
+ class NTKScaledRoPE(nn.Module):
189
+ def __init__(self, dim, max_seq_len=2048, base=10000, scaling_factor=1.0):
190
+ super().__init__()
191
+ # Compute new base
192
+ base = base * (scaling_factor ** (dim / (dim - 2)))
193
+
194
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
195
+ self.register_buffer("inv_freq", inv_freq)
196
+
197
+ def forward(self, seq_len, device):
198
+ t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
199
+ freqs = torch.outer(t, self.inv_freq)
200
+ emb = torch.cat((freqs, freqs), dim=-1)
201
+ return emb.cos(), emb.sin()
202
+ ```
203
+
204
+ **Pros**: Better than linear scaling
205
+ **Cons**: Still not perfect for very long contexts
206
+
207
+ ### 3. Dynamic Scaling
208
+
209
+ **Source**: HuggingFace transformers
210
+
211
+ **Idea**: Adjust scaling factor dynamically based on input length.
212
+
213
+ ```python
214
+ class DynamicScaledRoPE(nn.Module):
215
+ def __init__(self, dim, max_seq_len=2048, base=10000, scaling_factor=1.0):
216
+ super().__init__()
217
+ self.max_seq_len = max_seq_len
218
+ self.scaling_factor = scaling_factor
219
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
220
+ self.register_buffer("inv_freq", inv_freq)
221
+
222
+ def forward(self, seq_len, device):
223
+ # Compute dynamic scaling factor
224
+ if seq_len > self.max_seq_len:
225
+ # Scale proportionally
226
+ scale = seq_len / self.max_seq_len
227
+ else:
228
+ scale = 1.0
229
+
230
+ # Scale positions
231
+ t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
232
+ t = t / (self.scaling_factor * scale)
233
+
234
+ freqs = torch.outer(t, self.inv_freq)
235
+ emb = torch.cat((freqs, freqs), dim=-1)
236
+ return emb.cos(), emb.sin()
237
+ ```
238
+
239
+ **Pros**: Adapts to input length
240
+ **Cons**: Different behavior for different lengths
241
+
242
+ ### 4. YaRN (Yet another RoPE extensioN)
243
+
244
+ **Source**: arXiv 2309.00071
245
+
246
+ **Most sophisticated**: Combines multiple techniques.
247
+
248
+ ```python
249
+ class YaRNScaledRoPE(nn.Module):
250
+ """YaRN: NTK + Attention Temperature + Ramp."""
251
+
252
+ def __init__(
253
+ self,
254
+ dim,
255
+ max_seq_len=2048,
256
+ base=10000,
257
+ scaling_factor=1.0,
258
+ beta_fast=32,
259
+ beta_slow=1,
260
+ attn_factor=1.0
261
+ ):
262
+ super().__init__()
263
+ self.scaling_factor = scaling_factor
264
+ self.beta_fast = beta_fast
265
+ self.beta_slow = beta_slow
266
+ self.attn_factor = attn_factor
267
+
268
+ # Compute frequencies
269
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
270
+ self.register_buffer("inv_freq", inv_freq)
271
+
272
+ def forward(self, seq_len, device):
273
+ t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
274
+
275
+ # NTK-by-parts: Different scaling for different frequencies
276
+ inv_freq_mask = (self.inv_freq > 1 / self.beta_fast).float()
277
+
278
+ # Low frequencies: NTK scaling
279
+ # High frequencies: Linear scaling
280
+ # Middle: Smooth ramp
281
+
282
+ inv_freq_scaled = self.inv_freq / self.scaling_factor
283
+ freqs = torch.outer(t, inv_freq_scaled)
284
+
285
+ emb = torch.cat((freqs, freqs), dim=-1)
286
+ return emb.cos() * self.attn_factor, emb.sin() * self.attn_factor
287
+ ```
288
+
289
+ **Pros**: State-of-the-art context extension
290
+ **Cons**: More complex, more hyperparameters
291
+
292
+ ## Production Usage
293
+
294
+ ### HuggingFace Integration
295
+
296
+ ```python
297
+ from transformers import AutoModelForCausalLM, AutoConfig
298
+
299
+ # Linear scaling
300
+ config = AutoConfig.from_pretrained("meta-llama/Llama-2-7b-hf")
301
+ config.rope_scaling = {
302
+ "type": "linear",
303
+ "factor": 4.0 # 2k → 8k
304
+ }
305
+
306
+ # NTK-aware scaling
307
+ config.rope_scaling = {
308
+ "type": "ntk",
309
+ "factor": 4.0
310
+ }
311
+
312
+ # Dynamic scaling
313
+ config.rope_scaling = {
314
+ "type": "dynamic",
315
+ "factor": 4.0
316
+ }
317
+
318
+ # YaRN scaling
319
+ config.rope_scaling = {
320
+ "type": "yarn",
321
+ "factor": 16.0,
322
+ "original_max_position_embeddings": 2048,
323
+ "attention_factor": 1.0,
324
+ "beta_fast": 32,
325
+ "beta_slow": 1
326
+ }
327
+
328
+ model = AutoModelForCausalLM.from_config(config)
329
+ ```
330
+
331
+ ### Custom Implementation
332
+
333
+ ```python
334
+ class RoPEAttention(nn.Module):
335
+ def __init__(self, config):
336
+ super().__init__()
337
+ self.num_heads = config.num_attention_heads
338
+ self.head_dim = config.hidden_size // config.num_attention_heads
339
+
340
+ # Projections
341
+ self.q_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
342
+ self.k_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
343
+ self.v_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
344
+ self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
345
+
346
+ # RoPE
347
+ self.rotary_emb = RotaryEmbedding(
348
+ dim=self.head_dim,
349
+ max_seq_len=config.max_position_embeddings,
350
+ base=config.rope_theta
351
+ )
352
+
353
+ def forward(self, hidden_states, attention_mask=None, position_ids=None):
354
+ bsz, seq_len, _ = hidden_states.size()
355
+
356
+ # Q, K, V
357
+ query_states = self.q_proj(hidden_states)
358
+ key_states = self.k_proj(hidden_states)
359
+ value_states = self.v_proj(hidden_states)
360
+
361
+ # Reshape: (batch, seq_len, num_heads, head_dim)
362
+ query_states = query_states.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
363
+ key_states = key_states.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
364
+ value_states = value_states.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
365
+
366
+ # Apply RoPE
367
+ cos, sin = self.rotary_emb(seq_len, device=hidden_states.device)
368
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
369
+
370
+ # Attention
371
+ attn_output = F.scaled_dot_product_attention(
372
+ query_states, key_states, value_states,
373
+ attn_mask=attention_mask
374
+ )
375
+
376
+ # Reshape and project
377
+ attn_output = attn_output.transpose(1, 2).contiguous()
378
+ attn_output = attn_output.reshape(bsz, seq_len, -1)
379
+ attn_output = self.o_proj(attn_output)
380
+
381
+ return attn_output
382
+ ```
383
+
384
+ ## Performance Comparison
385
+
386
+ **Scaling method comparison** (8k → 32k extension):
387
+
388
+ | Method | Fine-tune Steps | Perplexity | Memory | Speed |
389
+ |--------|----------------|------------|---------|-------|
390
+ | Linear | 1000 | 12.5 | 1.0× | 1.0× |
391
+ | NTK | 500 | 11.8 | 1.0× | 1.0× |
392
+ | Dynamic | 1000 | 12.2 | 1.0× | 0.98× |
393
+ | YaRN | 400 | 11.2 | 1.0× | 0.95× |
394
+
395
+ **Source**: YaRN paper (arXiv 2309.00071)
396
+
397
+ ## Resources
398
+
399
+ - **RoFormer Paper**: https://arxiv.org/abs/2104.09864
400
+ - **YaRN Paper**: https://arxiv.org/abs/2309.00071
401
+ - **HuggingFace RoPE Utils**: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_rope_utils.py
402
+ - **Rotary Embeddings PyTorch**: https://github.com/lucidrains/rotary-embedding-torch
@@ -0,0 +1,260 @@
1
+ ---
2
+ name: mamba-architecture
3
+ description: State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace.
4
+ version: 1.0.0
5
+ author: Synthetic Sciences
6
+ license: MIT
7
+ tags: [Model Architecture, Mamba, State Space Models, SSM, Linear Complexity, Long Context, Efficient Inference, Hardware-Aware, Alternative To Transformers]
8
+ dependencies: [mamba-ssm, torch, transformers, causal-conv1d]
9
+ ---
10
+
11
+ # Mamba - Selective State Space Models
12
+
13
+ ## Quick start
14
+
15
+ Mamba is a state-space model architecture achieving O(n) linear complexity for sequence modeling.
16
+
17
+ **Installation**:
18
+ ```bash
19
+ # Install causal-conv1d (optional, for efficiency)
20
+ pip install causal-conv1d>=1.4.0
21
+
22
+ # Install Mamba
23
+ pip install mamba-ssm
24
+ # Or both together
25
+ pip install mamba-ssm[causal-conv1d]
26
+ ```
27
+
28
+ **Prerequisites**: Linux, NVIDIA GPU, PyTorch 1.12+, CUDA 11.6+
29
+
30
+ **Basic usage** (Mamba block):
31
+ ```python
32
+ import torch
33
+ from mamba_ssm import Mamba
34
+
35
+ batch, length, dim = 2, 64, 16
36
+ x = torch.randn(batch, length, dim).to("cuda")
37
+
38
+ model = Mamba(
39
+ d_model=dim, # Model dimension
40
+ d_state=16, # SSM state dimension
41
+ d_conv=4, # Conv1d kernel size
42
+ expand=2 # Expansion factor
43
+ ).to("cuda")
44
+
45
+ y = model(x) # O(n) complexity!
46
+ assert y.shape == x.shape
47
+ ```
48
+
49
+ ## Common workflows
50
+
51
+ ### Workflow 1: Language model with Mamba-2
52
+
53
+ **Complete LM with generation**:
54
+ ```python
55
+ from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
56
+ from mamba_ssm.models.config_mamba import MambaConfig
57
+ import torch
58
+
59
+ # Configure Mamba-2 LM
60
+ config = MambaConfig(
61
+ d_model=1024, # Hidden dimension
62
+ n_layer=24, # Number of layers
63
+ vocab_size=50277, # Vocabulary size
64
+ ssm_cfg=dict(
65
+ layer="Mamba2", # Use Mamba-2
66
+ d_state=128, # Larger state for Mamba-2
67
+ headdim=64, # Head dimension
68
+ ngroups=1 # Number of groups
69
+ )
70
+ )
71
+
72
+ model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
73
+
74
+ # Generate text
75
+ input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long)
76
+ output = model.generate(
77
+ input_ids=input_ids,
78
+ max_length=100,
79
+ temperature=0.7,
80
+ top_p=0.9
81
+ )
82
+ ```
83
+
84
+ ### Workflow 2: Use pretrained Mamba models
85
+
86
+ **Load from HuggingFace**:
87
+ ```python
88
+ from transformers import AutoTokenizer
89
+ from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
90
+
91
+ # Load pretrained model
92
+ model_name = "state-spaces/mamba-2.8b"
93
+ tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b") # Use compatible tokenizer
94
+ model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)
95
+
96
+ # Generate
97
+ prompt = "The future of AI is"
98
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
99
+ output_ids = model.generate(
100
+ input_ids=input_ids,
101
+ max_length=200,
102
+ temperature=0.7,
103
+ top_p=0.9,
104
+ repetition_penalty=1.2
105
+ )
106
+ generated_text = tokenizer.decode(output_ids[0])
107
+ print(generated_text)
108
+ ```
109
+
110
+ **Available models**:
111
+ - `state-spaces/mamba-130m`
112
+ - `state-spaces/mamba-370m`
113
+ - `state-spaces/mamba-790m`
114
+ - `state-spaces/mamba-1.4b`
115
+ - `state-spaces/mamba-2.8b`
116
+
117
+ ### Workflow 3: Mamba-1 vs Mamba-2
118
+
119
+ **Mamba-1** (smaller state):
120
+ ```python
121
+ from mamba_ssm import Mamba
122
+
123
+ model = Mamba(
124
+ d_model=256,
125
+ d_state=16, # Smaller state dimension
126
+ d_conv=4,
127
+ expand=2
128
+ ).to("cuda")
129
+ ```
130
+
131
+ **Mamba-2** (multi-head, larger state):
132
+ ```python
133
+ from mamba_ssm import Mamba2
134
+
135
+ model = Mamba2(
136
+ d_model=256,
137
+ d_state=128, # Larger state dimension
138
+ d_conv=4,
139
+ expand=2,
140
+ headdim=64, # Head dimension for multi-head
141
+ ngroups=1 # Parallel groups
142
+ ).to("cuda")
143
+ ```
144
+
145
+ **Key differences**:
146
+ - **State size**: Mamba-1 (d_state=16) vs Mamba-2 (d_state=128)
147
+ - **Architecture**: Mamba-2 has multi-head structure
148
+ - **Normalization**: Mamba-2 uses RMSNorm
149
+ - **Distributed**: Mamba-2 supports tensor parallelism
150
+
151
+ ### Workflow 4: Benchmark vs Transformers
152
+
153
+ **Generation speed comparison**:
154
+ ```bash
155
+ # Benchmark Mamba
156
+ python benchmarks/benchmark_generation_mamba_simple.py \
157
+ --model-name "state-spaces/mamba-2.8b" \
158
+ --prompt "The future of machine learning is" \
159
+ --topp 0.9 --temperature 0.7 --repetition-penalty 1.2
160
+
161
+ # Benchmark Transformer
162
+ python benchmarks/benchmark_generation_mamba_simple.py \
163
+ --model-name "EleutherAI/pythia-2.8b" \
164
+ --prompt "The future of machine learning is" \
165
+ --topp 0.9 --temperature 0.7 --repetition-penalty 1.2
166
+ ```
167
+
168
+ **Expected results**:
169
+ - **Mamba**: 5× faster inference
170
+ - **Memory**: No KV cache needed
171
+ - **Scaling**: Linear with sequence length
172
+
173
+ ## When to use vs alternatives
174
+
175
+ **Use Mamba when**:
176
+ - Need long sequences (100K+ tokens)
177
+ - Want faster inference than Transformers
178
+ - Memory-constrained (no KV cache)
179
+ - Building streaming applications
180
+ - Linear scaling important
181
+
182
+ **Advantages**:
183
+ - **O(n) complexity**: Linear vs quadratic
184
+ - **5× faster inference**: No attention overhead
185
+ - **No KV cache**: Lower memory usage
186
+ - **Million-token sequences**: Hardware-efficient
187
+ - **Streaming**: Constant memory per token
188
+
189
+ **Use alternatives instead**:
190
+ - **Transformers**: Need best-in-class performance, have compute
191
+ - **RWKV**: Want RNN+Transformer hybrid
192
+ - **RetNet**: Need retention-based architecture
193
+ - **Hyena**: Want convolution-based approach
194
+
195
+ ## Common issues
196
+
197
+ **Issue: CUDA out of memory**
198
+
199
+ Reduce batch size or use gradient checkpointing:
200
+ ```python
201
+ model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
202
+ model.gradient_checkpointing_enable() # Enable checkpointing
203
+ ```
204
+
205
+ **Issue: Slow installation**
206
+
207
+ Install binary wheels (not source):
208
+ ```bash
209
+ pip install mamba-ssm --no-build-isolation
210
+ ```
211
+
212
+ **Issue: Missing causal-conv1d**
213
+
214
+ Install separately:
215
+ ```bash
216
+ pip install causal-conv1d>=1.4.0
217
+ ```
218
+
219
+ **Issue: Model not loading from HuggingFace**
220
+
221
+ Use `MambaLMHeadModel.from_pretrained` (not `AutoModel`):
222
+ ```python
223
+ from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
224
+ model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")
225
+ ```
226
+
227
+ ## Advanced topics
228
+
229
+ **Selective SSM**: See [references/selective-ssm.md](references/selective-ssm.md) for mathematical formulation, state-space equations, and how selectivity enables O(n) complexity.
230
+
231
+ **Mamba-2 architecture**: See [references/mamba2-details.md](references/mamba2-details.md) for multi-head structure, tensor parallelism, and distributed training setup.
232
+
233
+ **Performance optimization**: See [references/performance.md](references/performance.md) for hardware-aware design, CUDA kernels, and memory efficiency techniques.
234
+
235
+ ## Hardware requirements
236
+
237
+ - **GPU**: NVIDIA with CUDA 11.6+
238
+ - **VRAM**:
239
+ - 130M model: 2GB
240
+ - 370M model: 4GB
241
+ - 790M model: 8GB
242
+ - 1.4B model: 14GB
243
+ - 2.8B model: 28GB (FP16)
244
+ - **Inference**: 5× faster than Transformers
245
+ - **Memory**: No KV cache (lower than Transformers)
246
+
247
+ **Performance** (vs Transformers):
248
+ - **Speed**: 5× faster inference
249
+ - **Memory**: 50% less (no KV cache)
250
+ - **Scaling**: Linear vs quadratic
251
+
252
+ ## Resources
253
+
254
+ - Paper (Mamba-1): https://arxiv.org/abs/2312.00752 (Dec 2023)
255
+ - Paper (Mamba-2): https://arxiv.org/abs/2405.21060 (May 2024)
256
+ - GitHub: https://github.com/state-spaces/mamba ⭐ 13,000+
257
+ - Models: https://huggingface.co/state-spaces
258
+ - Docs: Repository README and wiki
259
+
260
+