@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,293 @@
1
+ # HuggingFace Transformers Integration
2
+
3
+ ## Contents
4
+ - Enabling Flash Attention in Transformers
5
+ - Supported model architectures
6
+ - Configuration examples
7
+ - Performance comparisons
8
+ - Troubleshooting model-specific issues
9
+
10
+ ## Enabling Flash Attention in Transformers
11
+
12
+ HuggingFace Transformers (v4.36+) supports Flash Attention 2 natively.
13
+
14
+ **Simple enable for any supported model**:
15
+ ```python
16
+ from transformers import AutoModel
17
+
18
+ model = AutoModel.from_pretrained(
19
+ "meta-llama/Llama-2-7b-hf",
20
+ attn_implementation="flash_attention_2",
21
+ torch_dtype=torch.float16,
22
+ device_map="auto"
23
+ )
24
+ ```
25
+
26
+ **Install requirements**:
27
+ ```bash
28
+ pip install transformers>=4.36
29
+ pip install flash-attn --no-build-isolation
30
+ ```
31
+
32
+ ## Supported model architectures
33
+
34
+ As of Transformers 4.40:
35
+
36
+ **Fully supported**:
37
+ - Llama / Llama 2 / Llama 3
38
+ - Mistral / Mixtral
39
+ - Falcon
40
+ - GPT-NeoX
41
+ - Phi / Phi-2 / Phi-3
42
+ - Qwen / Qwen2
43
+ - Gemma
44
+ - Starcoder2
45
+ - GPT-J
46
+ - OPT
47
+ - BLOOM
48
+
49
+ **Partially supported** (encoder-decoder):
50
+ - BART
51
+ - T5 / Flan-T5
52
+ - Whisper
53
+
54
+ **Check support**:
55
+ ```python
56
+ from transformers import AutoConfig
57
+
58
+ config = AutoConfig.from_pretrained("model-name")
59
+ print(config._attn_implementation_internal)
60
+ # 'flash_attention_2' if supported
61
+ ```
62
+
63
+ ## Configuration examples
64
+
65
+ ### Llama 2 with Flash Attention
66
+
67
+ ```python
68
+ from transformers import AutoModelForCausalLM, AutoTokenizer
69
+ import torch
70
+
71
+ model_id = "meta-llama/Llama-2-7b-hf"
72
+
73
+ model = AutoModelForCausalLM.from_pretrained(
74
+ model_id,
75
+ attn_implementation="flash_attention_2",
76
+ torch_dtype=torch.float16,
77
+ device_map="auto"
78
+ )
79
+
80
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
81
+
82
+ # Generate
83
+ inputs = tokenizer("Once upon a time", return_tensors="pt").to("cuda")
84
+ outputs = model.generate(**inputs, max_length=100)
85
+ print(tokenizer.decode(outputs[0]))
86
+ ```
87
+
88
+ ### Mistral with Flash Attention for long context
89
+
90
+ ```python
91
+ from transformers import AutoModelForCausalLM
92
+ import torch
93
+
94
+ model = AutoModelForCausalLM.from_pretrained(
95
+ "mistralai/Mistral-7B-v0.1",
96
+ attn_implementation="flash_attention_2",
97
+ torch_dtype=torch.bfloat16, # Better for long context
98
+ device_map="auto",
99
+ max_position_embeddings=32768 # Extended context
100
+ )
101
+
102
+ # Process long document (32K tokens)
103
+ long_text = "..." * 10000
104
+ inputs = tokenizer(long_text, return_tensors="pt", truncation=False).to("cuda")
105
+ outputs = model.generate(**inputs, max_new_tokens=512)
106
+ ```
107
+
108
+ ### Fine-tuning with Flash Attention
109
+
110
+ ```python
111
+ from transformers import Trainer, TrainingArguments
112
+ from transformers import AutoModelForCausalLM
113
+
114
+ model = AutoModelForCausalLM.from_pretrained(
115
+ "meta-llama/Llama-2-7b-hf",
116
+ attn_implementation="flash_attention_2",
117
+ torch_dtype=torch.float16
118
+ )
119
+
120
+ training_args = TrainingArguments(
121
+ output_dir="./results",
122
+ per_device_train_batch_size=4,
123
+ gradient_accumulation_steps=4,
124
+ num_train_epochs=3,
125
+ fp16=True, # Must match model dtype
126
+ optim="adamw_torch_fused" # Fast optimizer
127
+ )
128
+
129
+ trainer = Trainer(
130
+ model=model,
131
+ args=training_args,
132
+ train_dataset=train_dataset
133
+ )
134
+
135
+ trainer.train()
136
+ ```
137
+
138
+ ### Multi-GPU training
139
+
140
+ ```python
141
+ from transformers import AutoModelForCausalLM
142
+ import torch
143
+
144
+ # Model parallelism with Flash Attention
145
+ model = AutoModelForCausalLM.from_pretrained(
146
+ "meta-llama/Llama-2-13b-hf",
147
+ attn_implementation="flash_attention_2",
148
+ torch_dtype=torch.float16,
149
+ device_map="auto", # Automatic multi-GPU placement
150
+ max_memory={0: "20GB", 1: "20GB"} # Limit per GPU
151
+ )
152
+ ```
153
+
154
+ ## Performance comparisons
155
+
156
+ ### Memory usage (Llama 2 7B, batch=1)
157
+
158
+ | Sequence Length | Standard Attention | Flash Attention 2 | Reduction |
159
+ |-----------------|-------------------|-------------------|-----------|
160
+ | 512 | 1.2 GB | 0.9 GB | 25% |
161
+ | 2048 | 3.8 GB | 1.4 GB | 63% |
162
+ | 8192 | 14.2 GB | 3.2 GB | 77% |
163
+ | 32768 | OOM (>24GB) | 10.8 GB | Fits! |
164
+
165
+ ### Speed (tokens/sec, A100 80GB)
166
+
167
+ | Model | Standard | Flash Attn 2 | Speedup |
168
+ |-------|----------|--------------|---------|
169
+ | Llama 2 7B (seq=2048) | 42 | 118 | 2.8x |
170
+ | Llama 2 13B (seq=4096) | 18 | 52 | 2.9x |
171
+ | Llama 2 70B (seq=2048) | 4 | 11 | 2.75x |
172
+
173
+ ### Training throughput (samples/sec)
174
+
175
+ | Model | Batch Size | Standard | Flash Attn 2 | Speedup |
176
+ |-------|------------|----------|--------------|---------|
177
+ | Llama 2 7B | 4 | 1.2 | 3.1 | 2.6x |
178
+ | Llama 2 7B | 8 | 2.1 | 5.8 | 2.8x |
179
+ | Llama 2 13B | 2 | 0.6 | 1.7 | 2.8x |
180
+
181
+ ## Troubleshooting model-specific issues
182
+
183
+ ### Issue: Model doesn't support Flash Attention
184
+
185
+ Check support list above. If not supported, use PyTorch SDPA as fallback:
186
+
187
+ ```python
188
+ model = AutoModelForCausalLM.from_pretrained(
189
+ "model-name",
190
+ attn_implementation="sdpa", # PyTorch native (still faster)
191
+ torch_dtype=torch.float16
192
+ )
193
+ ```
194
+
195
+ ### Issue: CUDA out of memory during loading
196
+
197
+ Reduce memory footprint:
198
+
199
+ ```python
200
+ model = AutoModelForCausalLM.from_pretrained(
201
+ "model-name",
202
+ attn_implementation="flash_attention_2",
203
+ torch_dtype=torch.float16,
204
+ device_map="auto",
205
+ max_memory={0: "18GB"}, # Reserve memory for KV cache
206
+ low_cpu_mem_usage=True
207
+ )
208
+ ```
209
+
210
+ ### Issue: Slower inference than expected
211
+
212
+ Ensure dtype matches:
213
+
214
+ ```python
215
+ # Model and inputs must both be float16/bfloat16
216
+ model = model.to(torch.float16)
217
+ inputs = tokenizer(..., return_tensors="pt").to("cuda")
218
+ inputs = {k: v.to(torch.float16) if v.dtype == torch.float32 else v
219
+ for k, v in inputs.items()}
220
+ ```
221
+
222
+ ### Issue: Different outputs vs standard attention
223
+
224
+ Flash Attention is numerically equivalent but uses different computation order. Small differences (<1e-3) are normal:
225
+
226
+ ```python
227
+ # Compare outputs
228
+ model_standard = AutoModelForCausalLM.from_pretrained("model-name", torch_dtype=torch.float16)
229
+ model_flash = AutoModelForCausalLM.from_pretrained(
230
+ "model-name",
231
+ attn_implementation="flash_attention_2",
232
+ torch_dtype=torch.float16
233
+ )
234
+
235
+ inputs = tokenizer("Test", return_tensors="pt").to("cuda")
236
+
237
+ with torch.no_grad():
238
+ out_standard = model_standard(**inputs).logits
239
+ out_flash = model_flash(**inputs).logits
240
+
241
+ diff = (out_standard - out_flash).abs().max()
242
+ print(f"Max diff: {diff:.6f}") # Should be ~1e-3 to 1e-4
243
+ ```
244
+
245
+ ### Issue: ImportError during model loading
246
+
247
+ Install flash-attn:
248
+ ```bash
249
+ pip install flash-attn --no-build-isolation
250
+ ```
251
+
252
+ Or disable Flash Attention:
253
+ ```python
254
+ model = AutoModelForCausalLM.from_pretrained(
255
+ "model-name",
256
+ attn_implementation="eager", # Standard PyTorch
257
+ torch_dtype=torch.float16
258
+ )
259
+ ```
260
+
261
+ ## Best practices
262
+
263
+ 1. **Always use float16/bfloat16** with Flash Attention (not float32)
264
+ 2. **Set device_map="auto"** for automatic memory management
265
+ 3. **Use bfloat16 for long context** (better numerical stability)
266
+ 4. **Enable gradient checkpointing** for training large models
267
+ 5. **Monitor memory** with `torch.cuda.max_memory_allocated()`
268
+
269
+ **Example with all best practices**:
270
+ ```python
271
+ from transformers import AutoModelForCausalLM, TrainingArguments
272
+
273
+ model = AutoModelForCausalLM.from_pretrained(
274
+ "meta-llama/Llama-2-7b-hf",
275
+ attn_implementation="flash_attention_2",
276
+ torch_dtype=torch.bfloat16, # Better for training
277
+ device_map="auto",
278
+ low_cpu_mem_usage=True
279
+ )
280
+
281
+ # Enable gradient checkpointing for memory
282
+ model.gradient_checkpointing_enable()
283
+
284
+ # Training with optimizations
285
+ training_args = TrainingArguments(
286
+ output_dir="./results",
287
+ per_device_train_batch_size=8,
288
+ gradient_accumulation_steps=2,
289
+ bf16=True, # Match model dtype
290
+ optim="adamw_torch_fused",
291
+ gradient_checkpointing=True
292
+ )
293
+ ```
@@ -0,0 +1,427 @@
1
+ ---
2
+ name: gguf-quantization
3
+ description: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
4
+ version: 1.0.0
5
+ author: Synthetic Sciences
6
+ license: MIT
7
+ tags: [GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization]
8
+ dependencies: [llama-cpp-python>=0.2.0]
9
+ ---
10
+
11
+ # GGUF - Quantization Format for llama.cpp
12
+
13
+ The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
14
+
15
+ ## When to use GGUF
16
+
17
+ **Use GGUF when:**
18
+ - Deploying on consumer hardware (laptops, desktops)
19
+ - Running on Apple Silicon (M1/M2/M3) with Metal acceleration
20
+ - Need CPU inference without GPU requirements
21
+ - Want flexible quantization (Q2_K to Q8_0)
22
+ - Using local AI tools (LM Studio, Ollama, text-generation-webui)
23
+
24
+ **Key advantages:**
25
+ - **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support
26
+ - **No Python runtime**: Pure C/C++ inference
27
+ - **Flexible quantization**: 2-8 bit with various methods (K-quants)
28
+ - **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more
29
+ - **imatrix**: Importance matrix for better low-bit quality
30
+
31
+ **Use alternatives instead:**
32
+ - **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs
33
+ - **HQQ**: Fast calibration-free quantization for HuggingFace
34
+ - **bitsandbytes**: Simple integration with transformers library
35
+ - **TensorRT-LLM**: Production NVIDIA deployment with maximum speed
36
+
37
+ ## Quick start
38
+
39
+ ### Installation
40
+
41
+ ```bash
42
+ # Clone llama.cpp
43
+ git clone https://github.com/ggml-org/llama.cpp
44
+ cd llama.cpp
45
+
46
+ # Build (CPU)
47
+ make
48
+
49
+ # Build with CUDA (NVIDIA)
50
+ make GGML_CUDA=1
51
+
52
+ # Build with Metal (Apple Silicon)
53
+ make GGML_METAL=1
54
+
55
+ # Install Python bindings (optional)
56
+ pip install llama-cpp-python
57
+ ```
58
+
59
+ ### Convert model to GGUF
60
+
61
+ ```bash
62
+ # Install requirements
63
+ pip install -r requirements.txt
64
+
65
+ # Convert HuggingFace model to GGUF (FP16)
66
+ python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
67
+
68
+ # Or specify output type
69
+ python convert_hf_to_gguf.py ./path/to/model \
70
+ --outfile model-f16.gguf \
71
+ --outtype f16
72
+ ```
73
+
74
+ ### Quantize model
75
+
76
+ ```bash
77
+ # Basic quantization to Q4_K_M
78
+ ./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
79
+
80
+ # Quantize with importance matrix (better quality)
81
+ ./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
82
+ ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
83
+ ```
84
+
85
+ ### Run inference
86
+
87
+ ```bash
88
+ # CLI inference
89
+ ./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
90
+
91
+ # Interactive mode
92
+ ./llama-cli -m model-q4_k_m.gguf --interactive
93
+
94
+ # With GPU offload
95
+ ./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
96
+ ```
97
+
98
+ ## Quantization types
99
+
100
+ ### K-quant methods (recommended)
101
+
102
+ | Type | Bits | Size (7B) | Quality | Use Case |
103
+ |------|------|-----------|---------|----------|
104
+ | Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
105
+ | Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
106
+ | Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
107
+ | Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
108
+ | Q4_K_M | 4.5 | ~4.1 GB | High | **Recommended default** |
109
+ | Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
110
+ | Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
111
+ | Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
112
+ | Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
113
+
114
+ ### Legacy methods
115
+
116
+ | Type | Description |
117
+ |------|-------------|
118
+ | Q4_0 | 4-bit, basic |
119
+ | Q4_1 | 4-bit with delta |
120
+ | Q5_0 | 5-bit, basic |
121
+ | Q5_1 | 5-bit with delta |
122
+
123
+ **Recommendation**: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
124
+
125
+ ## Conversion workflows
126
+
127
+ ### Workflow 1: HuggingFace to GGUF
128
+
129
+ ```bash
130
+ # 1. Download model
131
+ huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
132
+
133
+ # 2. Convert to GGUF (FP16)
134
+ python convert_hf_to_gguf.py ./llama-3.1-8b \
135
+ --outfile llama-3.1-8b-f16.gguf \
136
+ --outtype f16
137
+
138
+ # 3. Quantize
139
+ ./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
140
+
141
+ # 4. Test
142
+ ./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
143
+ ```
144
+
145
+ ### Workflow 2: With importance matrix (better quality)
146
+
147
+ ```bash
148
+ # 1. Convert to GGUF
149
+ python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
150
+
151
+ # 2. Create calibration text (diverse samples)
152
+ cat > calibration.txt << 'EOF'
153
+ The quick brown fox jumps over the lazy dog.
154
+ Machine learning is a subset of artificial intelligence.
155
+ Python is a popular programming language.
156
+ # Add more diverse text samples...
157
+ EOF
158
+
159
+ # 3. Generate importance matrix
160
+ ./llama-imatrix -m model-f16.gguf \
161
+ -f calibration.txt \
162
+ --chunk 512 \
163
+ -o model.imatrix \
164
+ -ngl 35 # GPU layers if available
165
+
166
+ # 4. Quantize with imatrix
167
+ ./llama-quantize --imatrix model.imatrix \
168
+ model-f16.gguf \
169
+ model-q4_k_m.gguf \
170
+ Q4_K_M
171
+ ```
172
+
173
+ ### Workflow 3: Multiple quantizations
174
+
175
+ ```bash
176
+ #!/bin/bash
177
+ MODEL="llama-3.1-8b-f16.gguf"
178
+ IMATRIX="llama-3.1-8b.imatrix"
179
+
180
+ # Generate imatrix once
181
+ ./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
182
+
183
+ # Create multiple quantizations
184
+ for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
185
+ OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
186
+ ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
187
+ echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
188
+ done
189
+ ```
190
+
191
+ ## Python usage
192
+
193
+ ### llama-cpp-python
194
+
195
+ ```python
196
+ from llama_cpp import Llama
197
+
198
+ # Load model
199
+ llm = Llama(
200
+ model_path="./model-q4_k_m.gguf",
201
+ n_ctx=4096, # Context window
202
+ n_gpu_layers=35, # GPU offload (0 for CPU only)
203
+ n_threads=8 # CPU threads
204
+ )
205
+
206
+ # Generate
207
+ output = llm(
208
+ "What is machine learning?",
209
+ max_tokens=256,
210
+ temperature=0.7,
211
+ stop=["</s>", "\n\n"]
212
+ )
213
+ print(output["choices"][0]["text"])
214
+ ```
215
+
216
+ ### Chat completion
217
+
218
+ ```python
219
+ from llama_cpp import Llama
220
+
221
+ llm = Llama(
222
+ model_path="./model-q4_k_m.gguf",
223
+ n_ctx=4096,
224
+ n_gpu_layers=35,
225
+ chat_format="llama-3" # Or "chatml", "mistral", etc.
226
+ )
227
+
228
+ messages = [
229
+ {"role": "system", "content": "You are a helpful assistant."},
230
+ {"role": "user", "content": "What is Python?"}
231
+ ]
232
+
233
+ response = llm.create_chat_completion(
234
+ messages=messages,
235
+ max_tokens=256,
236
+ temperature=0.7
237
+ )
238
+ print(response["choices"][0]["message"]["content"])
239
+ ```
240
+
241
+ ### Streaming
242
+
243
+ ```python
244
+ from llama_cpp import Llama
245
+
246
+ llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
247
+
248
+ # Stream tokens
249
+ for chunk in llm(
250
+ "Explain quantum computing:",
251
+ max_tokens=256,
252
+ stream=True
253
+ ):
254
+ print(chunk["choices"][0]["text"], end="", flush=True)
255
+ ```
256
+
257
+ ## Server mode
258
+
259
+ ### Start OpenAI-compatible server
260
+
261
+ ```bash
262
+ # Start server
263
+ ./llama-server -m model-q4_k_m.gguf \
264
+ --host 0.0.0.0 \
265
+ --port 8080 \
266
+ -ngl 35 \
267
+ -c 4096
268
+
269
+ # Or with Python bindings
270
+ python -m llama_cpp.server \
271
+ --model model-q4_k_m.gguf \
272
+ --n_gpu_layers 35 \
273
+ --host 0.0.0.0 \
274
+ --port 8080
275
+ ```
276
+
277
+ ### Use with OpenAI client
278
+
279
+ ```python
280
+ from openai import OpenAI
281
+
282
+ client = OpenAI(
283
+ base_url="http://localhost:8080/v1",
284
+ api_key="not-needed"
285
+ )
286
+
287
+ response = client.chat.completions.create(
288
+ model="local-model",
289
+ messages=[{"role": "user", "content": "Hello!"}],
290
+ max_tokens=256
291
+ )
292
+ print(response.choices[0].message.content)
293
+ ```
294
+
295
+ ## Hardware optimization
296
+
297
+ ### Apple Silicon (Metal)
298
+
299
+ ```bash
300
+ # Build with Metal
301
+ make clean && make GGML_METAL=1
302
+
303
+ # Run with Metal acceleration
304
+ ./llama-cli -m model.gguf -ngl 99 -p "Hello"
305
+
306
+ # Python with Metal
307
+ llm = Llama(
308
+ model_path="model.gguf",
309
+ n_gpu_layers=99, # Offload all layers
310
+ n_threads=1 # Metal handles parallelism
311
+ )
312
+ ```
313
+
314
+ ### NVIDIA CUDA
315
+
316
+ ```bash
317
+ # Build with CUDA
318
+ make clean && make GGML_CUDA=1
319
+
320
+ # Run with CUDA
321
+ ./llama-cli -m model.gguf -ngl 35 -p "Hello"
322
+
323
+ # Specify GPU
324
+ CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
325
+ ```
326
+
327
+ ### CPU optimization
328
+
329
+ ```bash
330
+ # Build with AVX2/AVX512
331
+ make clean && make
332
+
333
+ # Run with optimal threads
334
+ ./llama-cli -m model.gguf -t 8 -p "Hello"
335
+
336
+ # Python CPU config
337
+ llm = Llama(
338
+ model_path="model.gguf",
339
+ n_gpu_layers=0, # CPU only
340
+ n_threads=8, # Match physical cores
341
+ n_batch=512 # Batch size for prompt processing
342
+ )
343
+ ```
344
+
345
+ ## Integration with tools
346
+
347
+ ### Ollama
348
+
349
+ ```bash
350
+ # Create Modelfile
351
+ cat > Modelfile << 'EOF'
352
+ FROM ./model-q4_k_m.gguf
353
+ TEMPLATE """{{ .System }}
354
+ {{ .Prompt }}"""
355
+ PARAMETER temperature 0.7
356
+ PARAMETER num_ctx 4096
357
+ EOF
358
+
359
+ # Create Ollama model
360
+ ollama create mymodel -f Modelfile
361
+
362
+ # Run
363
+ ollama run mymodel "Hello!"
364
+ ```
365
+
366
+ ### LM Studio
367
+
368
+ 1. Place GGUF file in `~/.cache/lm-studio/models/`
369
+ 2. Open LM Studio and select the model
370
+ 3. Configure context length and GPU offload
371
+ 4. Start inference
372
+
373
+ ### text-generation-webui
374
+
375
+ ```bash
376
+ # Place in models folder
377
+ cp model-q4_k_m.gguf text-generation-webui/models/
378
+
379
+ # Start with llama.cpp loader
380
+ python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
381
+ ```
382
+
383
+ ## Best practices
384
+
385
+ 1. **Use K-quants**: Q4_K_M offers best quality/size balance
386
+ 2. **Use imatrix**: Always use importance matrix for Q4 and below
387
+ 3. **GPU offload**: Offload as many layers as VRAM allows
388
+ 4. **Context length**: Start with 4096, increase if needed
389
+ 5. **Thread count**: Match physical CPU cores, not logical
390
+ 6. **Batch size**: Increase n_batch for faster prompt processing
391
+
392
+ ## Common issues
393
+
394
+ **Model loads slowly:**
395
+ ```bash
396
+ # Use mmap for faster loading
397
+ ./llama-cli -m model.gguf --mmap
398
+ ```
399
+
400
+ **Out of memory:**
401
+ ```bash
402
+ # Reduce GPU layers
403
+ ./llama-cli -m model.gguf -ngl 20 # Reduce from 35
404
+
405
+ # Or use smaller quantization
406
+ ./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
407
+ ```
408
+
409
+ **Poor quality at low bits:**
410
+ ```bash
411
+ # Always use imatrix for Q4 and below
412
+ ./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
413
+ ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
414
+ ```
415
+
416
+ ## References
417
+
418
+ - **[Advanced Usage](references/advanced-usage.md)** - Batching, speculative decoding, custom builds
419
+ - **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks
420
+
421
+ ## Resources
422
+
423
+ - **Repository**: https://github.com/ggml-org/llama.cpp
424
+ - **Python Bindings**: https://github.com/abetlen/llama-cpp-python
425
+ - **Pre-quantized Models**: https://huggingface.co/TheBloke
426
+ - **GGUF Converter**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
427
+ - **License**: MIT