@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,293 @@
|
|
|
1
|
+
# HuggingFace Transformers Integration
|
|
2
|
+
|
|
3
|
+
## Contents
|
|
4
|
+
- Enabling Flash Attention in Transformers
|
|
5
|
+
- Supported model architectures
|
|
6
|
+
- Configuration examples
|
|
7
|
+
- Performance comparisons
|
|
8
|
+
- Troubleshooting model-specific issues
|
|
9
|
+
|
|
10
|
+
## Enabling Flash Attention in Transformers
|
|
11
|
+
|
|
12
|
+
HuggingFace Transformers (v4.36+) supports Flash Attention 2 natively.
|
|
13
|
+
|
|
14
|
+
**Simple enable for any supported model**:
|
|
15
|
+
```python
|
|
16
|
+
from transformers import AutoModel
|
|
17
|
+
|
|
18
|
+
model = AutoModel.from_pretrained(
|
|
19
|
+
"meta-llama/Llama-2-7b-hf",
|
|
20
|
+
attn_implementation="flash_attention_2",
|
|
21
|
+
torch_dtype=torch.float16,
|
|
22
|
+
device_map="auto"
|
|
23
|
+
)
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
**Install requirements**:
|
|
27
|
+
```bash
|
|
28
|
+
pip install transformers>=4.36
|
|
29
|
+
pip install flash-attn --no-build-isolation
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## Supported model architectures
|
|
33
|
+
|
|
34
|
+
As of Transformers 4.40:
|
|
35
|
+
|
|
36
|
+
**Fully supported**:
|
|
37
|
+
- Llama / Llama 2 / Llama 3
|
|
38
|
+
- Mistral / Mixtral
|
|
39
|
+
- Falcon
|
|
40
|
+
- GPT-NeoX
|
|
41
|
+
- Phi / Phi-2 / Phi-3
|
|
42
|
+
- Qwen / Qwen2
|
|
43
|
+
- Gemma
|
|
44
|
+
- Starcoder2
|
|
45
|
+
- GPT-J
|
|
46
|
+
- OPT
|
|
47
|
+
- BLOOM
|
|
48
|
+
|
|
49
|
+
**Partially supported** (encoder-decoder):
|
|
50
|
+
- BART
|
|
51
|
+
- T5 / Flan-T5
|
|
52
|
+
- Whisper
|
|
53
|
+
|
|
54
|
+
**Check support**:
|
|
55
|
+
```python
|
|
56
|
+
from transformers import AutoConfig
|
|
57
|
+
|
|
58
|
+
config = AutoConfig.from_pretrained("model-name")
|
|
59
|
+
print(config._attn_implementation_internal)
|
|
60
|
+
# 'flash_attention_2' if supported
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
## Configuration examples
|
|
64
|
+
|
|
65
|
+
### Llama 2 with Flash Attention
|
|
66
|
+
|
|
67
|
+
```python
|
|
68
|
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
69
|
+
import torch
|
|
70
|
+
|
|
71
|
+
model_id = "meta-llama/Llama-2-7b-hf"
|
|
72
|
+
|
|
73
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
74
|
+
model_id,
|
|
75
|
+
attn_implementation="flash_attention_2",
|
|
76
|
+
torch_dtype=torch.float16,
|
|
77
|
+
device_map="auto"
|
|
78
|
+
)
|
|
79
|
+
|
|
80
|
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
81
|
+
|
|
82
|
+
# Generate
|
|
83
|
+
inputs = tokenizer("Once upon a time", return_tensors="pt").to("cuda")
|
|
84
|
+
outputs = model.generate(**inputs, max_length=100)
|
|
85
|
+
print(tokenizer.decode(outputs[0]))
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Mistral with Flash Attention for long context
|
|
89
|
+
|
|
90
|
+
```python
|
|
91
|
+
from transformers import AutoModelForCausalLM
|
|
92
|
+
import torch
|
|
93
|
+
|
|
94
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
95
|
+
"mistralai/Mistral-7B-v0.1",
|
|
96
|
+
attn_implementation="flash_attention_2",
|
|
97
|
+
torch_dtype=torch.bfloat16, # Better for long context
|
|
98
|
+
device_map="auto",
|
|
99
|
+
max_position_embeddings=32768 # Extended context
|
|
100
|
+
)
|
|
101
|
+
|
|
102
|
+
# Process long document (32K tokens)
|
|
103
|
+
long_text = "..." * 10000
|
|
104
|
+
inputs = tokenizer(long_text, return_tensors="pt", truncation=False).to("cuda")
|
|
105
|
+
outputs = model.generate(**inputs, max_new_tokens=512)
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
### Fine-tuning with Flash Attention
|
|
109
|
+
|
|
110
|
+
```python
|
|
111
|
+
from transformers import Trainer, TrainingArguments
|
|
112
|
+
from transformers import AutoModelForCausalLM
|
|
113
|
+
|
|
114
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
115
|
+
"meta-llama/Llama-2-7b-hf",
|
|
116
|
+
attn_implementation="flash_attention_2",
|
|
117
|
+
torch_dtype=torch.float16
|
|
118
|
+
)
|
|
119
|
+
|
|
120
|
+
training_args = TrainingArguments(
|
|
121
|
+
output_dir="./results",
|
|
122
|
+
per_device_train_batch_size=4,
|
|
123
|
+
gradient_accumulation_steps=4,
|
|
124
|
+
num_train_epochs=3,
|
|
125
|
+
fp16=True, # Must match model dtype
|
|
126
|
+
optim="adamw_torch_fused" # Fast optimizer
|
|
127
|
+
)
|
|
128
|
+
|
|
129
|
+
trainer = Trainer(
|
|
130
|
+
model=model,
|
|
131
|
+
args=training_args,
|
|
132
|
+
train_dataset=train_dataset
|
|
133
|
+
)
|
|
134
|
+
|
|
135
|
+
trainer.train()
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Multi-GPU training
|
|
139
|
+
|
|
140
|
+
```python
|
|
141
|
+
from transformers import AutoModelForCausalLM
|
|
142
|
+
import torch
|
|
143
|
+
|
|
144
|
+
# Model parallelism with Flash Attention
|
|
145
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
146
|
+
"meta-llama/Llama-2-13b-hf",
|
|
147
|
+
attn_implementation="flash_attention_2",
|
|
148
|
+
torch_dtype=torch.float16,
|
|
149
|
+
device_map="auto", # Automatic multi-GPU placement
|
|
150
|
+
max_memory={0: "20GB", 1: "20GB"} # Limit per GPU
|
|
151
|
+
)
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
## Performance comparisons
|
|
155
|
+
|
|
156
|
+
### Memory usage (Llama 2 7B, batch=1)
|
|
157
|
+
|
|
158
|
+
| Sequence Length | Standard Attention | Flash Attention 2 | Reduction |
|
|
159
|
+
|-----------------|-------------------|-------------------|-----------|
|
|
160
|
+
| 512 | 1.2 GB | 0.9 GB | 25% |
|
|
161
|
+
| 2048 | 3.8 GB | 1.4 GB | 63% |
|
|
162
|
+
| 8192 | 14.2 GB | 3.2 GB | 77% |
|
|
163
|
+
| 32768 | OOM (>24GB) | 10.8 GB | Fits! |
|
|
164
|
+
|
|
165
|
+
### Speed (tokens/sec, A100 80GB)
|
|
166
|
+
|
|
167
|
+
| Model | Standard | Flash Attn 2 | Speedup |
|
|
168
|
+
|-------|----------|--------------|---------|
|
|
169
|
+
| Llama 2 7B (seq=2048) | 42 | 118 | 2.8x |
|
|
170
|
+
| Llama 2 13B (seq=4096) | 18 | 52 | 2.9x |
|
|
171
|
+
| Llama 2 70B (seq=2048) | 4 | 11 | 2.75x |
|
|
172
|
+
|
|
173
|
+
### Training throughput (samples/sec)
|
|
174
|
+
|
|
175
|
+
| Model | Batch Size | Standard | Flash Attn 2 | Speedup |
|
|
176
|
+
|-------|------------|----------|--------------|---------|
|
|
177
|
+
| Llama 2 7B | 4 | 1.2 | 3.1 | 2.6x |
|
|
178
|
+
| Llama 2 7B | 8 | 2.1 | 5.8 | 2.8x |
|
|
179
|
+
| Llama 2 13B | 2 | 0.6 | 1.7 | 2.8x |
|
|
180
|
+
|
|
181
|
+
## Troubleshooting model-specific issues
|
|
182
|
+
|
|
183
|
+
### Issue: Model doesn't support Flash Attention
|
|
184
|
+
|
|
185
|
+
Check support list above. If not supported, use PyTorch SDPA as fallback:
|
|
186
|
+
|
|
187
|
+
```python
|
|
188
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
189
|
+
"model-name",
|
|
190
|
+
attn_implementation="sdpa", # PyTorch native (still faster)
|
|
191
|
+
torch_dtype=torch.float16
|
|
192
|
+
)
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
### Issue: CUDA out of memory during loading
|
|
196
|
+
|
|
197
|
+
Reduce memory footprint:
|
|
198
|
+
|
|
199
|
+
```python
|
|
200
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
201
|
+
"model-name",
|
|
202
|
+
attn_implementation="flash_attention_2",
|
|
203
|
+
torch_dtype=torch.float16,
|
|
204
|
+
device_map="auto",
|
|
205
|
+
max_memory={0: "18GB"}, # Reserve memory for KV cache
|
|
206
|
+
low_cpu_mem_usage=True
|
|
207
|
+
)
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
### Issue: Slower inference than expected
|
|
211
|
+
|
|
212
|
+
Ensure dtype matches:
|
|
213
|
+
|
|
214
|
+
```python
|
|
215
|
+
# Model and inputs must both be float16/bfloat16
|
|
216
|
+
model = model.to(torch.float16)
|
|
217
|
+
inputs = tokenizer(..., return_tensors="pt").to("cuda")
|
|
218
|
+
inputs = {k: v.to(torch.float16) if v.dtype == torch.float32 else v
|
|
219
|
+
for k, v in inputs.items()}
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
### Issue: Different outputs vs standard attention
|
|
223
|
+
|
|
224
|
+
Flash Attention is numerically equivalent but uses different computation order. Small differences (<1e-3) are normal:
|
|
225
|
+
|
|
226
|
+
```python
|
|
227
|
+
# Compare outputs
|
|
228
|
+
model_standard = AutoModelForCausalLM.from_pretrained("model-name", torch_dtype=torch.float16)
|
|
229
|
+
model_flash = AutoModelForCausalLM.from_pretrained(
|
|
230
|
+
"model-name",
|
|
231
|
+
attn_implementation="flash_attention_2",
|
|
232
|
+
torch_dtype=torch.float16
|
|
233
|
+
)
|
|
234
|
+
|
|
235
|
+
inputs = tokenizer("Test", return_tensors="pt").to("cuda")
|
|
236
|
+
|
|
237
|
+
with torch.no_grad():
|
|
238
|
+
out_standard = model_standard(**inputs).logits
|
|
239
|
+
out_flash = model_flash(**inputs).logits
|
|
240
|
+
|
|
241
|
+
diff = (out_standard - out_flash).abs().max()
|
|
242
|
+
print(f"Max diff: {diff:.6f}") # Should be ~1e-3 to 1e-4
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
### Issue: ImportError during model loading
|
|
246
|
+
|
|
247
|
+
Install flash-attn:
|
|
248
|
+
```bash
|
|
249
|
+
pip install flash-attn --no-build-isolation
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
Or disable Flash Attention:
|
|
253
|
+
```python
|
|
254
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
255
|
+
"model-name",
|
|
256
|
+
attn_implementation="eager", # Standard PyTorch
|
|
257
|
+
torch_dtype=torch.float16
|
|
258
|
+
)
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
## Best practices
|
|
262
|
+
|
|
263
|
+
1. **Always use float16/bfloat16** with Flash Attention (not float32)
|
|
264
|
+
2. **Set device_map="auto"** for automatic memory management
|
|
265
|
+
3. **Use bfloat16 for long context** (better numerical stability)
|
|
266
|
+
4. **Enable gradient checkpointing** for training large models
|
|
267
|
+
5. **Monitor memory** with `torch.cuda.max_memory_allocated()`
|
|
268
|
+
|
|
269
|
+
**Example with all best practices**:
|
|
270
|
+
```python
|
|
271
|
+
from transformers import AutoModelForCausalLM, TrainingArguments
|
|
272
|
+
|
|
273
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
274
|
+
"meta-llama/Llama-2-7b-hf",
|
|
275
|
+
attn_implementation="flash_attention_2",
|
|
276
|
+
torch_dtype=torch.bfloat16, # Better for training
|
|
277
|
+
device_map="auto",
|
|
278
|
+
low_cpu_mem_usage=True
|
|
279
|
+
)
|
|
280
|
+
|
|
281
|
+
# Enable gradient checkpointing for memory
|
|
282
|
+
model.gradient_checkpointing_enable()
|
|
283
|
+
|
|
284
|
+
# Training with optimizations
|
|
285
|
+
training_args = TrainingArguments(
|
|
286
|
+
output_dir="./results",
|
|
287
|
+
per_device_train_batch_size=8,
|
|
288
|
+
gradient_accumulation_steps=2,
|
|
289
|
+
bf16=True, # Match model dtype
|
|
290
|
+
optim="adamw_torch_fused",
|
|
291
|
+
gradient_checkpointing=True
|
|
292
|
+
)
|
|
293
|
+
```
|
|
@@ -0,0 +1,427 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: gguf-quantization
|
|
3
|
+
description: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Synthetic Sciences
|
|
6
|
+
license: MIT
|
|
7
|
+
tags: [GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization]
|
|
8
|
+
dependencies: [llama-cpp-python>=0.2.0]
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# GGUF - Quantization Format for llama.cpp
|
|
12
|
+
|
|
13
|
+
The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
|
|
14
|
+
|
|
15
|
+
## When to use GGUF
|
|
16
|
+
|
|
17
|
+
**Use GGUF when:**
|
|
18
|
+
- Deploying on consumer hardware (laptops, desktops)
|
|
19
|
+
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
|
|
20
|
+
- Need CPU inference without GPU requirements
|
|
21
|
+
- Want flexible quantization (Q2_K to Q8_0)
|
|
22
|
+
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
|
|
23
|
+
|
|
24
|
+
**Key advantages:**
|
|
25
|
+
- **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support
|
|
26
|
+
- **No Python runtime**: Pure C/C++ inference
|
|
27
|
+
- **Flexible quantization**: 2-8 bit with various methods (K-quants)
|
|
28
|
+
- **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more
|
|
29
|
+
- **imatrix**: Importance matrix for better low-bit quality
|
|
30
|
+
|
|
31
|
+
**Use alternatives instead:**
|
|
32
|
+
- **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs
|
|
33
|
+
- **HQQ**: Fast calibration-free quantization for HuggingFace
|
|
34
|
+
- **bitsandbytes**: Simple integration with transformers library
|
|
35
|
+
- **TensorRT-LLM**: Production NVIDIA deployment with maximum speed
|
|
36
|
+
|
|
37
|
+
## Quick start
|
|
38
|
+
|
|
39
|
+
### Installation
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
# Clone llama.cpp
|
|
43
|
+
git clone https://github.com/ggml-org/llama.cpp
|
|
44
|
+
cd llama.cpp
|
|
45
|
+
|
|
46
|
+
# Build (CPU)
|
|
47
|
+
make
|
|
48
|
+
|
|
49
|
+
# Build with CUDA (NVIDIA)
|
|
50
|
+
make GGML_CUDA=1
|
|
51
|
+
|
|
52
|
+
# Build with Metal (Apple Silicon)
|
|
53
|
+
make GGML_METAL=1
|
|
54
|
+
|
|
55
|
+
# Install Python bindings (optional)
|
|
56
|
+
pip install llama-cpp-python
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### Convert model to GGUF
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
# Install requirements
|
|
63
|
+
pip install -r requirements.txt
|
|
64
|
+
|
|
65
|
+
# Convert HuggingFace model to GGUF (FP16)
|
|
66
|
+
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
|
|
67
|
+
|
|
68
|
+
# Or specify output type
|
|
69
|
+
python convert_hf_to_gguf.py ./path/to/model \
|
|
70
|
+
--outfile model-f16.gguf \
|
|
71
|
+
--outtype f16
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
### Quantize model
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
# Basic quantization to Q4_K_M
|
|
78
|
+
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
|
79
|
+
|
|
80
|
+
# Quantize with importance matrix (better quality)
|
|
81
|
+
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
|
|
82
|
+
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
### Run inference
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
# CLI inference
|
|
89
|
+
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
|
|
90
|
+
|
|
91
|
+
# Interactive mode
|
|
92
|
+
./llama-cli -m model-q4_k_m.gguf --interactive
|
|
93
|
+
|
|
94
|
+
# With GPU offload
|
|
95
|
+
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## Quantization types
|
|
99
|
+
|
|
100
|
+
### K-quant methods (recommended)
|
|
101
|
+
|
|
102
|
+
| Type | Bits | Size (7B) | Quality | Use Case |
|
|
103
|
+
|------|------|-----------|---------|----------|
|
|
104
|
+
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
|
|
105
|
+
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
|
|
106
|
+
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
|
|
107
|
+
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
|
|
108
|
+
| Q4_K_M | 4.5 | ~4.1 GB | High | **Recommended default** |
|
|
109
|
+
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
|
|
110
|
+
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
|
|
111
|
+
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
|
|
112
|
+
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
|
|
113
|
+
|
|
114
|
+
### Legacy methods
|
|
115
|
+
|
|
116
|
+
| Type | Description |
|
|
117
|
+
|------|-------------|
|
|
118
|
+
| Q4_0 | 4-bit, basic |
|
|
119
|
+
| Q4_1 | 4-bit with delta |
|
|
120
|
+
| Q5_0 | 5-bit, basic |
|
|
121
|
+
| Q5_1 | 5-bit with delta |
|
|
122
|
+
|
|
123
|
+
**Recommendation**: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
|
|
124
|
+
|
|
125
|
+
## Conversion workflows
|
|
126
|
+
|
|
127
|
+
### Workflow 1: HuggingFace to GGUF
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
# 1. Download model
|
|
131
|
+
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
|
|
132
|
+
|
|
133
|
+
# 2. Convert to GGUF (FP16)
|
|
134
|
+
python convert_hf_to_gguf.py ./llama-3.1-8b \
|
|
135
|
+
--outfile llama-3.1-8b-f16.gguf \
|
|
136
|
+
--outtype f16
|
|
137
|
+
|
|
138
|
+
# 3. Quantize
|
|
139
|
+
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
|
|
140
|
+
|
|
141
|
+
# 4. Test
|
|
142
|
+
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
### Workflow 2: With importance matrix (better quality)
|
|
146
|
+
|
|
147
|
+
```bash
|
|
148
|
+
# 1. Convert to GGUF
|
|
149
|
+
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
|
|
150
|
+
|
|
151
|
+
# 2. Create calibration text (diverse samples)
|
|
152
|
+
cat > calibration.txt << 'EOF'
|
|
153
|
+
The quick brown fox jumps over the lazy dog.
|
|
154
|
+
Machine learning is a subset of artificial intelligence.
|
|
155
|
+
Python is a popular programming language.
|
|
156
|
+
# Add more diverse text samples...
|
|
157
|
+
EOF
|
|
158
|
+
|
|
159
|
+
# 3. Generate importance matrix
|
|
160
|
+
./llama-imatrix -m model-f16.gguf \
|
|
161
|
+
-f calibration.txt \
|
|
162
|
+
--chunk 512 \
|
|
163
|
+
-o model.imatrix \
|
|
164
|
+
-ngl 35 # GPU layers if available
|
|
165
|
+
|
|
166
|
+
# 4. Quantize with imatrix
|
|
167
|
+
./llama-quantize --imatrix model.imatrix \
|
|
168
|
+
model-f16.gguf \
|
|
169
|
+
model-q4_k_m.gguf \
|
|
170
|
+
Q4_K_M
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
### Workflow 3: Multiple quantizations
|
|
174
|
+
|
|
175
|
+
```bash
|
|
176
|
+
#!/bin/bash
|
|
177
|
+
MODEL="llama-3.1-8b-f16.gguf"
|
|
178
|
+
IMATRIX="llama-3.1-8b.imatrix"
|
|
179
|
+
|
|
180
|
+
# Generate imatrix once
|
|
181
|
+
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
|
|
182
|
+
|
|
183
|
+
# Create multiple quantizations
|
|
184
|
+
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
|
|
185
|
+
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
|
|
186
|
+
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
|
|
187
|
+
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
|
|
188
|
+
done
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
## Python usage
|
|
192
|
+
|
|
193
|
+
### llama-cpp-python
|
|
194
|
+
|
|
195
|
+
```python
|
|
196
|
+
from llama_cpp import Llama
|
|
197
|
+
|
|
198
|
+
# Load model
|
|
199
|
+
llm = Llama(
|
|
200
|
+
model_path="./model-q4_k_m.gguf",
|
|
201
|
+
n_ctx=4096, # Context window
|
|
202
|
+
n_gpu_layers=35, # GPU offload (0 for CPU only)
|
|
203
|
+
n_threads=8 # CPU threads
|
|
204
|
+
)
|
|
205
|
+
|
|
206
|
+
# Generate
|
|
207
|
+
output = llm(
|
|
208
|
+
"What is machine learning?",
|
|
209
|
+
max_tokens=256,
|
|
210
|
+
temperature=0.7,
|
|
211
|
+
stop=["</s>", "\n\n"]
|
|
212
|
+
)
|
|
213
|
+
print(output["choices"][0]["text"])
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
### Chat completion
|
|
217
|
+
|
|
218
|
+
```python
|
|
219
|
+
from llama_cpp import Llama
|
|
220
|
+
|
|
221
|
+
llm = Llama(
|
|
222
|
+
model_path="./model-q4_k_m.gguf",
|
|
223
|
+
n_ctx=4096,
|
|
224
|
+
n_gpu_layers=35,
|
|
225
|
+
chat_format="llama-3" # Or "chatml", "mistral", etc.
|
|
226
|
+
)
|
|
227
|
+
|
|
228
|
+
messages = [
|
|
229
|
+
{"role": "system", "content": "You are a helpful assistant."},
|
|
230
|
+
{"role": "user", "content": "What is Python?"}
|
|
231
|
+
]
|
|
232
|
+
|
|
233
|
+
response = llm.create_chat_completion(
|
|
234
|
+
messages=messages,
|
|
235
|
+
max_tokens=256,
|
|
236
|
+
temperature=0.7
|
|
237
|
+
)
|
|
238
|
+
print(response["choices"][0]["message"]["content"])
|
|
239
|
+
```
|
|
240
|
+
|
|
241
|
+
### Streaming
|
|
242
|
+
|
|
243
|
+
```python
|
|
244
|
+
from llama_cpp import Llama
|
|
245
|
+
|
|
246
|
+
llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
|
|
247
|
+
|
|
248
|
+
# Stream tokens
|
|
249
|
+
for chunk in llm(
|
|
250
|
+
"Explain quantum computing:",
|
|
251
|
+
max_tokens=256,
|
|
252
|
+
stream=True
|
|
253
|
+
):
|
|
254
|
+
print(chunk["choices"][0]["text"], end="", flush=True)
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
## Server mode
|
|
258
|
+
|
|
259
|
+
### Start OpenAI-compatible server
|
|
260
|
+
|
|
261
|
+
```bash
|
|
262
|
+
# Start server
|
|
263
|
+
./llama-server -m model-q4_k_m.gguf \
|
|
264
|
+
--host 0.0.0.0 \
|
|
265
|
+
--port 8080 \
|
|
266
|
+
-ngl 35 \
|
|
267
|
+
-c 4096
|
|
268
|
+
|
|
269
|
+
# Or with Python bindings
|
|
270
|
+
python -m llama_cpp.server \
|
|
271
|
+
--model model-q4_k_m.gguf \
|
|
272
|
+
--n_gpu_layers 35 \
|
|
273
|
+
--host 0.0.0.0 \
|
|
274
|
+
--port 8080
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
### Use with OpenAI client
|
|
278
|
+
|
|
279
|
+
```python
|
|
280
|
+
from openai import OpenAI
|
|
281
|
+
|
|
282
|
+
client = OpenAI(
|
|
283
|
+
base_url="http://localhost:8080/v1",
|
|
284
|
+
api_key="not-needed"
|
|
285
|
+
)
|
|
286
|
+
|
|
287
|
+
response = client.chat.completions.create(
|
|
288
|
+
model="local-model",
|
|
289
|
+
messages=[{"role": "user", "content": "Hello!"}],
|
|
290
|
+
max_tokens=256
|
|
291
|
+
)
|
|
292
|
+
print(response.choices[0].message.content)
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
## Hardware optimization
|
|
296
|
+
|
|
297
|
+
### Apple Silicon (Metal)
|
|
298
|
+
|
|
299
|
+
```bash
|
|
300
|
+
# Build with Metal
|
|
301
|
+
make clean && make GGML_METAL=1
|
|
302
|
+
|
|
303
|
+
# Run with Metal acceleration
|
|
304
|
+
./llama-cli -m model.gguf -ngl 99 -p "Hello"
|
|
305
|
+
|
|
306
|
+
# Python with Metal
|
|
307
|
+
llm = Llama(
|
|
308
|
+
model_path="model.gguf",
|
|
309
|
+
n_gpu_layers=99, # Offload all layers
|
|
310
|
+
n_threads=1 # Metal handles parallelism
|
|
311
|
+
)
|
|
312
|
+
```
|
|
313
|
+
|
|
314
|
+
### NVIDIA CUDA
|
|
315
|
+
|
|
316
|
+
```bash
|
|
317
|
+
# Build with CUDA
|
|
318
|
+
make clean && make GGML_CUDA=1
|
|
319
|
+
|
|
320
|
+
# Run with CUDA
|
|
321
|
+
./llama-cli -m model.gguf -ngl 35 -p "Hello"
|
|
322
|
+
|
|
323
|
+
# Specify GPU
|
|
324
|
+
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
### CPU optimization
|
|
328
|
+
|
|
329
|
+
```bash
|
|
330
|
+
# Build with AVX2/AVX512
|
|
331
|
+
make clean && make
|
|
332
|
+
|
|
333
|
+
# Run with optimal threads
|
|
334
|
+
./llama-cli -m model.gguf -t 8 -p "Hello"
|
|
335
|
+
|
|
336
|
+
# Python CPU config
|
|
337
|
+
llm = Llama(
|
|
338
|
+
model_path="model.gguf",
|
|
339
|
+
n_gpu_layers=0, # CPU only
|
|
340
|
+
n_threads=8, # Match physical cores
|
|
341
|
+
n_batch=512 # Batch size for prompt processing
|
|
342
|
+
)
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
## Integration with tools
|
|
346
|
+
|
|
347
|
+
### Ollama
|
|
348
|
+
|
|
349
|
+
```bash
|
|
350
|
+
# Create Modelfile
|
|
351
|
+
cat > Modelfile << 'EOF'
|
|
352
|
+
FROM ./model-q4_k_m.gguf
|
|
353
|
+
TEMPLATE """{{ .System }}
|
|
354
|
+
{{ .Prompt }}"""
|
|
355
|
+
PARAMETER temperature 0.7
|
|
356
|
+
PARAMETER num_ctx 4096
|
|
357
|
+
EOF
|
|
358
|
+
|
|
359
|
+
# Create Ollama model
|
|
360
|
+
ollama create mymodel -f Modelfile
|
|
361
|
+
|
|
362
|
+
# Run
|
|
363
|
+
ollama run mymodel "Hello!"
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
### LM Studio
|
|
367
|
+
|
|
368
|
+
1. Place GGUF file in `~/.cache/lm-studio/models/`
|
|
369
|
+
2. Open LM Studio and select the model
|
|
370
|
+
3. Configure context length and GPU offload
|
|
371
|
+
4. Start inference
|
|
372
|
+
|
|
373
|
+
### text-generation-webui
|
|
374
|
+
|
|
375
|
+
```bash
|
|
376
|
+
# Place in models folder
|
|
377
|
+
cp model-q4_k_m.gguf text-generation-webui/models/
|
|
378
|
+
|
|
379
|
+
# Start with llama.cpp loader
|
|
380
|
+
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
|
|
381
|
+
```
|
|
382
|
+
|
|
383
|
+
## Best practices
|
|
384
|
+
|
|
385
|
+
1. **Use K-quants**: Q4_K_M offers best quality/size balance
|
|
386
|
+
2. **Use imatrix**: Always use importance matrix for Q4 and below
|
|
387
|
+
3. **GPU offload**: Offload as many layers as VRAM allows
|
|
388
|
+
4. **Context length**: Start with 4096, increase if needed
|
|
389
|
+
5. **Thread count**: Match physical CPU cores, not logical
|
|
390
|
+
6. **Batch size**: Increase n_batch for faster prompt processing
|
|
391
|
+
|
|
392
|
+
## Common issues
|
|
393
|
+
|
|
394
|
+
**Model loads slowly:**
|
|
395
|
+
```bash
|
|
396
|
+
# Use mmap for faster loading
|
|
397
|
+
./llama-cli -m model.gguf --mmap
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
**Out of memory:**
|
|
401
|
+
```bash
|
|
402
|
+
# Reduce GPU layers
|
|
403
|
+
./llama-cli -m model.gguf -ngl 20 # Reduce from 35
|
|
404
|
+
|
|
405
|
+
# Or use smaller quantization
|
|
406
|
+
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
|
|
407
|
+
```
|
|
408
|
+
|
|
409
|
+
**Poor quality at low bits:**
|
|
410
|
+
```bash
|
|
411
|
+
# Always use imatrix for Q4 and below
|
|
412
|
+
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
|
|
413
|
+
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
|
414
|
+
```
|
|
415
|
+
|
|
416
|
+
## References
|
|
417
|
+
|
|
418
|
+
- **[Advanced Usage](references/advanced-usage.md)** - Batching, speculative decoding, custom builds
|
|
419
|
+
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks
|
|
420
|
+
|
|
421
|
+
## Resources
|
|
422
|
+
|
|
423
|
+
- **Repository**: https://github.com/ggml-org/llama.cpp
|
|
424
|
+
- **Python Bindings**: https://github.com/abetlen/llama-cpp-python
|
|
425
|
+
- **Pre-quantized Models**: https://huggingface.co/TheBloke
|
|
426
|
+
- **GGUF Converter**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
|
|
427
|
+
- **License**: MIT
|