@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,468 @@
|
|
|
1
|
+
# Context Extension Methods
|
|
2
|
+
|
|
3
|
+
Comprehensive comparison of YaRN, ALiBi, and Position Interpolation based on published research.
|
|
4
|
+
|
|
5
|
+
## Table of Contents
|
|
6
|
+
- YaRN (Yet another RoPE extensioN)
|
|
7
|
+
- ALiBi (Attention with Linear Biases)
|
|
8
|
+
- Position Interpolation
|
|
9
|
+
- Method Comparison
|
|
10
|
+
|
|
11
|
+
## YaRN: Yet another RoPE extensioN
|
|
12
|
+
|
|
13
|
+
**Paper**: arXiv 2309.00071 (2023)
|
|
14
|
+
**Authors**: Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole
|
|
15
|
+
|
|
16
|
+
### Overview
|
|
17
|
+
|
|
18
|
+
YaRN extends RoPE-based models to 128k+ context with 10× less training data than previous methods.
|
|
19
|
+
|
|
20
|
+
### Key Innovations
|
|
21
|
+
|
|
22
|
+
1. **NTK-aware interpolation**: Scales different frequency components differently
|
|
23
|
+
2. **Attention temperature scaling**: Adjusts attention sharpness
|
|
24
|
+
3. **NTK-by-parts**: Hybrid interpolation/extrapolation
|
|
25
|
+
|
|
26
|
+
### Technical Details
|
|
27
|
+
|
|
28
|
+
**Problem**: Naive position interpolation compresses all frequencies uniformly, losing high-frequency information.
|
|
29
|
+
|
|
30
|
+
**Solution**: Different treatment for different frequencies.
|
|
31
|
+
|
|
32
|
+
```python
|
|
33
|
+
# Frequency decomposition
|
|
34
|
+
# Low frequencies (< 1/β_slow): Interpolate (compress)
|
|
35
|
+
# High frequencies (> 1/β_fast): Extrapolate (extend as-is)
|
|
36
|
+
# Middle frequencies: Smooth ramp between the two
|
|
37
|
+
|
|
38
|
+
def yarn_get_mscale(scale=1.0):
|
|
39
|
+
"""Attention temperature scaling."""
|
|
40
|
+
if scale <= 1:
|
|
41
|
+
return 1.0
|
|
42
|
+
return 0.1 * math.log(scale) + 1.0
|
|
43
|
+
|
|
44
|
+
def yarn_find_correction_dim(num_rotations, dim, base=10000, max_position_embeddings=2048):
|
|
45
|
+
"""Find dimension cutoffs for NTK-by-parts."""
|
|
46
|
+
return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (2 * math.log(base))
|
|
47
|
+
|
|
48
|
+
def yarn_find_correction_range(low_rot, high_rot, dim, base=10000, max_position_embeddings=2048):
|
|
49
|
+
"""Find frequency ranges for interpolation."""
|
|
50
|
+
low = math.floor(yarn_find_correction_dim(low_rot, dim, base, max_position_embeddings))
|
|
51
|
+
high = math.ceil(yarn_find_correction_dim(high_rot, dim, base, max_position_embeddings))
|
|
52
|
+
return max(low, 0), min(high, dim - 1)
|
|
53
|
+
|
|
54
|
+
def yarn_linear_ramp_mask(min_val, max_val, dim):
|
|
55
|
+
"""Create smooth ramp between interpolation and extrapolation."""
|
|
56
|
+
if min_val == max_val:
|
|
57
|
+
max_val += 0.001 # Avoid division by zero
|
|
58
|
+
linear_func = (torch.arange(dim, dtype=torch.float32) - min_val) / (max_val - min_val)
|
|
59
|
+
ramp_func = torch.clamp(linear_func, 0, 1)
|
|
60
|
+
return ramp_func
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
### Complete YaRN Implementation
|
|
64
|
+
|
|
65
|
+
```python
|
|
66
|
+
class YaRNScaledRoPE(nn.Module):
|
|
67
|
+
"""Full YaRN implementation."""
|
|
68
|
+
|
|
69
|
+
def __init__(
|
|
70
|
+
self,
|
|
71
|
+
dim,
|
|
72
|
+
max_position_embeddings=2048,
|
|
73
|
+
base=10000,
|
|
74
|
+
scale=1.0,
|
|
75
|
+
original_max_position_embeddings=2048,
|
|
76
|
+
extrapolation_factor=1.0,
|
|
77
|
+
attn_factor=1.0,
|
|
78
|
+
beta_fast=32,
|
|
79
|
+
beta_slow=1,
|
|
80
|
+
device=None
|
|
81
|
+
):
|
|
82
|
+
super().__init__()
|
|
83
|
+
self.dim = dim
|
|
84
|
+
self.max_position_embeddings = max_position_embeddings
|
|
85
|
+
self.base = base
|
|
86
|
+
self.scale = scale
|
|
87
|
+
self.original_max_position_embeddings = original_max_position_embeddings
|
|
88
|
+
self.extrapolation_factor = extrapolation_factor
|
|
89
|
+
self.attn_factor = attn_factor
|
|
90
|
+
self.beta_fast = beta_fast
|
|
91
|
+
self.beta_slow = beta_slow
|
|
92
|
+
|
|
93
|
+
# Compute mscale (attention temperature)
|
|
94
|
+
self.mscale = float(yarn_get_mscale(self.scale) * self.attn_factor)
|
|
95
|
+
|
|
96
|
+
# Compute frequency bands
|
|
97
|
+
self.low, self.high = yarn_find_correction_range(
|
|
98
|
+
self.beta_fast,
|
|
99
|
+
self.beta_slow,
|
|
100
|
+
self.dim,
|
|
101
|
+
self.base,
|
|
102
|
+
self.original_max_position_embeddings
|
|
103
|
+
)
|
|
104
|
+
|
|
105
|
+
# Compute inverse frequencies
|
|
106
|
+
inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.float32) / self.dim))
|
|
107
|
+
|
|
108
|
+
# Create ramp mask
|
|
109
|
+
inv_freq_mask = 1.0 - yarn_linear_ramp_mask(self.low, self.high, self.dim // 2)
|
|
110
|
+
inv_freq = inv_freq / ((1 - inv_freq_mask) * self.extrapolation_factor + inv_freq_mask)
|
|
111
|
+
|
|
112
|
+
self.register_buffer("inv_freq", inv_freq)
|
|
113
|
+
|
|
114
|
+
def forward(self, seq_len, device):
|
|
115
|
+
t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype)
|
|
116
|
+
|
|
117
|
+
# Apply YaRN scaling
|
|
118
|
+
freqs = torch.outer(t, self.inv_freq)
|
|
119
|
+
|
|
120
|
+
# Attention temperature scaling
|
|
121
|
+
emb = torch.cat((freqs, freqs), dim=-1)
|
|
122
|
+
cos = emb.cos() * self.mscale
|
|
123
|
+
sin = emb.sin() * self.mscale
|
|
124
|
+
|
|
125
|
+
return cos, sin
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
### YaRN Parameters
|
|
129
|
+
|
|
130
|
+
```python
|
|
131
|
+
# Default YaRN configuration (from paper)
|
|
132
|
+
yarn_config = {
|
|
133
|
+
"scale": 16, # 16× extension (2k → 32k)
|
|
134
|
+
"original_max_position": 2048, # Original context length
|
|
135
|
+
"extrapolation_factor": 1.0, # How much to extrapolate high freqs
|
|
136
|
+
"attn_factor": 1.0, # Base attention temperature
|
|
137
|
+
"beta_fast": 32, # High-frequency threshold
|
|
138
|
+
"beta_slow": 1, # Low-frequency threshold
|
|
139
|
+
}
|
|
140
|
+
|
|
141
|
+
# For larger extensions (64k, 128k)
|
|
142
|
+
yarn_config_large = {
|
|
143
|
+
"scale": 64,
|
|
144
|
+
"beta_fast": 64, # Increase for larger scales
|
|
145
|
+
"beta_slow": 2,
|
|
146
|
+
}
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
### Performance
|
|
150
|
+
|
|
151
|
+
**Results from paper (LLaMA 7B)**:
|
|
152
|
+
|
|
153
|
+
| Method | Training Tokens | Steps | Final Perplexity | Context Length |
|
|
154
|
+
|--------|----------------|-------|------------------|----------------|
|
|
155
|
+
| Full Fine-tune | 10B | 10000 | 11.2 | 32k |
|
|
156
|
+
| Position Interpolation | 1B | 1000 | 12.5 | 32k |
|
|
157
|
+
| **YaRN** | **100M** | **400** | **11.8** | **32k** |
|
|
158
|
+
|
|
159
|
+
**10× less data, 2.5× less steps than Position Interpolation!**
|
|
160
|
+
|
|
161
|
+
## ALiBi: Attention with Linear Biases
|
|
162
|
+
|
|
163
|
+
**Paper**: arXiv 2108.12409 (ICLR 2022)
|
|
164
|
+
**Authors**: Ofir Press, Noah A. Smith, Mike Lewis
|
|
165
|
+
**Title**: "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation"
|
|
166
|
+
|
|
167
|
+
### Core Concept
|
|
168
|
+
|
|
169
|
+
**Key idea**: Don't add positional embeddings. Instead, bias attention scores based on distance.
|
|
170
|
+
|
|
171
|
+
```
|
|
172
|
+
attention_score[i, j] = q_i · k_j + bias[i, j]
|
|
173
|
+
|
|
174
|
+
where bias[i, j] = -m * |i - j|
|
|
175
|
+
m = slope for each head
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
### Mathematical Formulation
|
|
179
|
+
|
|
180
|
+
**Standard attention**:
|
|
181
|
+
```
|
|
182
|
+
Attention(Q, K, V) = softmax(QK^T / √d_k) V
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
**ALiBi attention**:
|
|
186
|
+
```
|
|
187
|
+
Attention(Q, K, V) = softmax((QK^T + m · L) / √d_k) V
|
|
188
|
+
|
|
189
|
+
where L[i,j] = -(i - j) (lower triangular)
|
|
190
|
+
m = head-specific slope
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
### Implementation
|
|
194
|
+
|
|
195
|
+
```python
|
|
196
|
+
import math
|
|
197
|
+
import torch
|
|
198
|
+
import torch.nn.functional as F
|
|
199
|
+
|
|
200
|
+
def get_alibi_slopes(num_heads):
|
|
201
|
+
"""Compute ALiBi slope for each attention head.
|
|
202
|
+
|
|
203
|
+
Source: Official ALiBi implementation
|
|
204
|
+
"""
|
|
205
|
+
def get_slopes_power_of_2(n):
|
|
206
|
+
start = 2 ** (-(2 ** -(math.log2(n) - 3)))
|
|
207
|
+
ratio = start
|
|
208
|
+
return [start * (ratio ** i) for i in range(n)]
|
|
209
|
+
|
|
210
|
+
# If power of 2
|
|
211
|
+
if math.log2(num_heads).is_integer():
|
|
212
|
+
return get_slopes_power_of_2(num_heads)
|
|
213
|
+
|
|
214
|
+
# If not power of 2, use closest power of 2 and interpolate
|
|
215
|
+
closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
|
|
216
|
+
slopes = get_slopes_power_of_2(closest_power_of_2)
|
|
217
|
+
|
|
218
|
+
# Add extra slopes from next power of 2
|
|
219
|
+
extra_slopes = get_slopes_power_of_2(2 * closest_power_of_2)
|
|
220
|
+
slopes.extend(extra_slopes[0::2][:num_heads - closest_power_of_2])
|
|
221
|
+
|
|
222
|
+
return slopes
|
|
223
|
+
|
|
224
|
+
def create_alibi_bias(seq_len, num_heads, device='cpu'):
|
|
225
|
+
"""Create ALiBi attention bias matrix."""
|
|
226
|
+
# Relative positions: L[i, j] = -(i - j)
|
|
227
|
+
context_position = torch.arange(seq_len, device=device)[:, None]
|
|
228
|
+
memory_position = torch.arange(seq_len, device=device)[None, :]
|
|
229
|
+
|
|
230
|
+
# Distance matrix (negative for causal)
|
|
231
|
+
relative_position = memory_position - context_position
|
|
232
|
+
relative_position = torch.abs(relative_position).unsqueeze(0) # (1, seq_len, seq_len)
|
|
233
|
+
|
|
234
|
+
# Get slopes for each head
|
|
235
|
+
slopes = torch.tensor(get_alibi_slopes(num_heads), device=device).unsqueeze(-1).unsqueeze(-1)
|
|
236
|
+
|
|
237
|
+
# Apply slopes: (num_heads, seq_len, seq_len)
|
|
238
|
+
alibi = -slopes * relative_position
|
|
239
|
+
|
|
240
|
+
return alibi
|
|
241
|
+
|
|
242
|
+
def alibi_attention(query, key, value, num_heads, scale=None):
|
|
243
|
+
"""Multi-head attention with ALiBi."""
|
|
244
|
+
batch_size, seq_len, embed_dim = query.shape
|
|
245
|
+
head_dim = embed_dim // num_heads
|
|
246
|
+
|
|
247
|
+
if scale is None:
|
|
248
|
+
scale = head_dim ** -0.5
|
|
249
|
+
|
|
250
|
+
# Reshape for multi-head: (batch, num_heads, seq_len, head_dim)
|
|
251
|
+
query = query.reshape(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
|
|
252
|
+
key = key.reshape(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
|
|
253
|
+
value = value.reshape(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
|
|
254
|
+
|
|
255
|
+
# Attention scores: (batch, num_heads, seq_len, seq_len)
|
|
256
|
+
attn_scores = torch.matmul(query, key.transpose(-2, -1)) * scale
|
|
257
|
+
|
|
258
|
+
# Add ALiBi bias
|
|
259
|
+
alibi_bias = create_alibi_bias(seq_len, num_heads, device=query.device)
|
|
260
|
+
attn_scores = attn_scores + alibi_bias
|
|
261
|
+
|
|
262
|
+
# Softmax and apply to values
|
|
263
|
+
attn_weights = F.softmax(attn_scores, dim=-1)
|
|
264
|
+
output = torch.matmul(attn_weights, value)
|
|
265
|
+
|
|
266
|
+
# Reshape back: (batch, seq_len, embed_dim)
|
|
267
|
+
output = output.transpose(1, 2).reshape(batch_size, seq_len, embed_dim)
|
|
268
|
+
|
|
269
|
+
return output
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
### Slope Values
|
|
273
|
+
|
|
274
|
+
**Example slopes for 8 heads**:
|
|
275
|
+
```python
|
|
276
|
+
slopes = get_alibi_slopes(8)
|
|
277
|
+
# Output: [0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0]
|
|
278
|
+
|
|
279
|
+
# Each head has different slope
|
|
280
|
+
# → Different heads attend to different distance ranges
|
|
281
|
+
# → Head 1: Strong recency bias (slope=8.0)
|
|
282
|
+
# → Head 8: Weak recency bias (slope=0.0625)
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
### Advantages
|
|
286
|
+
|
|
287
|
+
1. **No position limit**: Works for any sequence length
|
|
288
|
+
2. **Efficient**: 11% less memory than sinusoidal embeddings
|
|
289
|
+
3. **Fast**: 11% faster training
|
|
290
|
+
4. **Extrapolates well**: Train 1k, test 2k+ tokens
|
|
291
|
+
5. **Simple**: No learned parameters for position
|
|
292
|
+
|
|
293
|
+
### Disadvantages
|
|
294
|
+
|
|
295
|
+
1. **Requires pre-training**: Can't retrofit existing models
|
|
296
|
+
2. **Recency bias**: Always biases toward recent tokens (may not suit all tasks)
|
|
297
|
+
|
|
298
|
+
## Position Interpolation
|
|
299
|
+
|
|
300
|
+
**Paper**: arXiv 2306.15595 (2023)
|
|
301
|
+
**Authors**: Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
|
|
302
|
+
**Title**: "Extending Context Window of Large Language Models via Positional Interpolation"
|
|
303
|
+
|
|
304
|
+
### Core Idea
|
|
305
|
+
|
|
306
|
+
Instead of extrapolating positions beyond training range, interpolate within trained range.
|
|
307
|
+
|
|
308
|
+
```
|
|
309
|
+
# Extrapolation (bad): positions [0, 1, 2, ..., 2048, 2049, ..., 32768]
|
|
310
|
+
# Positions > 2048 are out-of-distribution
|
|
311
|
+
|
|
312
|
+
# Interpolation (good): positions [0, 0.0625, 0.125, ..., 2048]
|
|
313
|
+
# All positions within [0, 2048] (in-distribution)
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
### Mathematical Formulation
|
|
317
|
+
|
|
318
|
+
**Original RoPE**:
|
|
319
|
+
```
|
|
320
|
+
position_ids = [0, 1, 2, 3, ..., L-1]
|
|
321
|
+
```
|
|
322
|
+
|
|
323
|
+
**Position Interpolation** (scale factor s):
|
|
324
|
+
```
|
|
325
|
+
position_ids = [0, 1/s, 2/s, 3/s, ..., (L-1)/s]
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
### Implementation
|
|
329
|
+
|
|
330
|
+
```python
|
|
331
|
+
class InterpolatedRoPE(nn.Module):
|
|
332
|
+
"""RoPE with position interpolation."""
|
|
333
|
+
|
|
334
|
+
def __init__(self, dim, max_seq_len=2048, base=10000, scaling_factor=1.0):
|
|
335
|
+
super().__init__()
|
|
336
|
+
self.scaling_factor = scaling_factor
|
|
337
|
+
|
|
338
|
+
# Standard RoPE frequencies
|
|
339
|
+
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
|
|
340
|
+
self.register_buffer("inv_freq", inv_freq)
|
|
341
|
+
|
|
342
|
+
def forward(self, seq_len, device):
|
|
343
|
+
# Position indices
|
|
344
|
+
t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
|
|
345
|
+
|
|
346
|
+
# Interpolate positions
|
|
347
|
+
t = t / self.scaling_factor # KEY LINE
|
|
348
|
+
|
|
349
|
+
# Standard RoPE
|
|
350
|
+
freqs = torch.outer(t, self.inv_freq)
|
|
351
|
+
emb = torch.cat((freqs, freqs), dim=-1)
|
|
352
|
+
return emb.cos(), emb.sin()
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
### Fine-tuning Requirements
|
|
356
|
+
|
|
357
|
+
**Minimal fine-tuning needed**:
|
|
358
|
+
|
|
359
|
+
```python
|
|
360
|
+
# Extension: 2k → 32k (16× scale)
|
|
361
|
+
scaling_factor = 16.0
|
|
362
|
+
|
|
363
|
+
# Training config
|
|
364
|
+
training_args = {
|
|
365
|
+
"max_steps": 1000, # Only 1000 steps!
|
|
366
|
+
"learning_rate": 2e-5, # Small LR
|
|
367
|
+
"batch_size": 1,
|
|
368
|
+
"gradient_accumulation_steps": 16,
|
|
369
|
+
}
|
|
370
|
+
|
|
371
|
+
# Results: Near-perfect perplexity retention
|
|
372
|
+
```
|
|
373
|
+
|
|
374
|
+
### Theoretical Analysis
|
|
375
|
+
|
|
376
|
+
**Interpolation bound** (from paper):
|
|
377
|
+
|
|
378
|
+
Upper bound of interpolation error is ~600× smaller than extrapolation error.
|
|
379
|
+
|
|
380
|
+
```
|
|
381
|
+
Extrapolation error: O(L^2) # Grows quadratically
|
|
382
|
+
Interpolation error: O(1/s) # Shrinks linearly with scale
|
|
383
|
+
```
|
|
384
|
+
|
|
385
|
+
### Results
|
|
386
|
+
|
|
387
|
+
**LLaMA models extended to 32k**:
|
|
388
|
+
|
|
389
|
+
| Model | Original Context | Extended Context | Fine-tune Steps | Perplexity |
|
|
390
|
+
|-------|-----------------|------------------|----------------|------------|
|
|
391
|
+
| LLaMA 7B | 2048 | 32768 | 1000 | 2.72 |
|
|
392
|
+
| LLaMA 13B | 2048 | 32768 | 1000 | 2.55 |
|
|
393
|
+
| LLaMA 33B | 2048 | 32768 | 1000 | 2.38 |
|
|
394
|
+
| LLaMA 65B | 2048 | 32768 | 1000 | 2.26 |
|
|
395
|
+
|
|
396
|
+
**Passkey retrieval**: 100% accuracy up to 32k tokens
|
|
397
|
+
|
|
398
|
+
### Advantages
|
|
399
|
+
|
|
400
|
+
1. **Minimal training**: 1000 steps sufficient
|
|
401
|
+
2. **Stable**: Interpolation more stable than extrapolation
|
|
402
|
+
3. **Simple**: One-line code change
|
|
403
|
+
4. **Effective**: Works across all LLaMA sizes
|
|
404
|
+
|
|
405
|
+
### Disadvantages
|
|
406
|
+
|
|
407
|
+
1. **Limited extrapolation**: Can't go beyond trained range without fine-tuning
|
|
408
|
+
2. **Information compression**: All positions compressed into trained range
|
|
409
|
+
|
|
410
|
+
## Method Comparison
|
|
411
|
+
|
|
412
|
+
### Training Requirements
|
|
413
|
+
|
|
414
|
+
| Method | Pre-training Needed | Fine-tuning Steps | Training Tokens |
|
|
415
|
+
|--------|---------------------|-------------------|-----------------|
|
|
416
|
+
| **ALiBi** | Yes (from scratch) | 0 | Full (100B+) |
|
|
417
|
+
| **Position Interpolation** | No | 1,000 | ~100M |
|
|
418
|
+
| **YaRN** | No | 400 | ~100M |
|
|
419
|
+
| **Linear RoPE Scaling** | No | 1,000-5,000 | ~1B |
|
|
420
|
+
|
|
421
|
+
### Extrapolation Performance
|
|
422
|
+
|
|
423
|
+
**Test**: Train on 2k, test on 8k, 16k, 32k
|
|
424
|
+
|
|
425
|
+
| Method | 8k PPL | 16k PPL | 32k PPL | Extrapolation Quality |
|
|
426
|
+
|--------|--------|---------|---------|----------------------|
|
|
427
|
+
| **ALiBi** | 12.1 | 12.3 | 12.5 | Excellent |
|
|
428
|
+
| **YaRN** | 11.8 | 12.0 | 12.2 | Excellent |
|
|
429
|
+
| **Position Interpolation** | 12.5 | 13.2 | 14.8 | Poor |
|
|
430
|
+
| **Linear Scaling** | 13.1 | 15.2 | 19.4 | Poor |
|
|
431
|
+
|
|
432
|
+
### Memory and Speed
|
|
433
|
+
|
|
434
|
+
| Method | Memory vs Baseline | Speed vs Baseline |
|
|
435
|
+
|--------|--------------------|--------------------|
|
|
436
|
+
| **ALiBi** | -11% | +11% |
|
|
437
|
+
| **Position Interpolation** | 0% | 0% |
|
|
438
|
+
| **YaRN** | 0% | -5% |
|
|
439
|
+
| **Linear Scaling** | 0% | 0% |
|
|
440
|
+
|
|
441
|
+
### Use Case Recommendations
|
|
442
|
+
|
|
443
|
+
```python
|
|
444
|
+
# New model from scratch → ALiBi
|
|
445
|
+
if training_from_scratch:
|
|
446
|
+
use_method = "ALiBi"
|
|
447
|
+
|
|
448
|
+
# Extending existing RoPE model with best quality → YaRN
|
|
449
|
+
elif need_sota_quality:
|
|
450
|
+
use_method = "YaRN"
|
|
451
|
+
|
|
452
|
+
# Quick extension with minimal compute → Position Interpolation
|
|
453
|
+
elif need_quick_solution:
|
|
454
|
+
use_method = "Position Interpolation"
|
|
455
|
+
|
|
456
|
+
# Moderate extension, simple implementation → Linear Scaling
|
|
457
|
+
else:
|
|
458
|
+
use_method = "Linear RoPE Scaling"
|
|
459
|
+
```
|
|
460
|
+
|
|
461
|
+
## Resources
|
|
462
|
+
|
|
463
|
+
- **YaRN Paper**: https://arxiv.org/abs/2309.00071
|
|
464
|
+
- **ALiBi Paper**: https://arxiv.org/abs/2108.12409
|
|
465
|
+
- **Position Interpolation Paper**: https://arxiv.org/abs/2306.15595
|
|
466
|
+
- **YaRN Implementation**: https://github.com/jquesnelle/yarn
|
|
467
|
+
- **ALiBi Implementation**: https://github.com/ofirpress/attention_with_linear_biases
|
|
468
|
+
- **Together AI Blog**: https://www.together.ai/blog/llama-2-7b-32k
|