@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,425 @@
|
|
|
1
|
+
# MoE Training Guide
|
|
2
|
+
|
|
3
|
+
Complete training guide based on DeepSpeed official documentation and production practices.
|
|
4
|
+
|
|
5
|
+
## Table of Contents
|
|
6
|
+
- DeepSpeed MoE Setup
|
|
7
|
+
- Training Configuration
|
|
8
|
+
- PR-MoE (Pyramid-Residual-MoE)
|
|
9
|
+
- Mixture-of-Students (MoS)
|
|
10
|
+
- Hyperparameter Tuning
|
|
11
|
+
- Production Training
|
|
12
|
+
|
|
13
|
+
## DeepSpeed MoE Setup
|
|
14
|
+
|
|
15
|
+
**Source**: DeepSpeed MoE Tutorial (https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/)
|
|
16
|
+
|
|
17
|
+
### Requirements
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
# Install DeepSpeed v0.6.0 or higher
|
|
21
|
+
pip install deepspeed>=0.6.0
|
|
22
|
+
|
|
23
|
+
# Clone Megatron-DeepSpeed
|
|
24
|
+
git clone https://github.com/microsoft/Megatron-DeepSpeed
|
|
25
|
+
cd Megatron-DeepSpeed
|
|
26
|
+
pip install -r requirements.txt
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
### Basic MoE Configuration
|
|
30
|
+
|
|
31
|
+
```json
|
|
32
|
+
{
|
|
33
|
+
"train_batch_size": 256,
|
|
34
|
+
"gradient_accumulation_steps": 1,
|
|
35
|
+
"fp16": {
|
|
36
|
+
"enabled": true,
|
|
37
|
+
"loss_scale": 0,
|
|
38
|
+
"initial_scale_power": 16
|
|
39
|
+
},
|
|
40
|
+
"moe": {
|
|
41
|
+
"enabled": true,
|
|
42
|
+
"num_experts": 128,
|
|
43
|
+
"expert_parallel_size": 8,
|
|
44
|
+
"moe_loss_coeff": 0.01,
|
|
45
|
+
"train_capacity_factor": 1.25,
|
|
46
|
+
"eval_capacity_factor": 2.0,
|
|
47
|
+
"min_capacity": 4,
|
|
48
|
+
"drop_tokens": true
|
|
49
|
+
},
|
|
50
|
+
"zero_optimization": {
|
|
51
|
+
"stage": 1
|
|
52
|
+
}
|
|
53
|
+
}
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## Training Parameters
|
|
57
|
+
|
|
58
|
+
### Core MoE Parameters
|
|
59
|
+
|
|
60
|
+
**From DeepSpeed documentation:**
|
|
61
|
+
|
|
62
|
+
1. **`--num-experts`**
|
|
63
|
+
- Number of experts per MoE layer
|
|
64
|
+
- Recommended: 128 experts
|
|
65
|
+
- Range: 8-256 depending on scale
|
|
66
|
+
|
|
67
|
+
2. **`--moe-expert-parallel-size`**
|
|
68
|
+
- Degree of expert parallelism
|
|
69
|
+
- Distributes experts across GPUs
|
|
70
|
+
- Example: 128 experts / 8 GPUs = 16 experts per GPU
|
|
71
|
+
|
|
72
|
+
3. **`--moe-loss-coeff`**
|
|
73
|
+
- MoE auxiliary loss coefficient
|
|
74
|
+
- Recommended: 0.01
|
|
75
|
+
- Controls load balancing strength
|
|
76
|
+
|
|
77
|
+
4. **`--moe-train-capacity-factor`**
|
|
78
|
+
- Training capacity multiplier
|
|
79
|
+
- Default: 1.25
|
|
80
|
+
- Formula: capacity = (tokens/num_experts) × capacity_factor
|
|
81
|
+
|
|
82
|
+
5. **`--moe-eval-capacity-factor`**
|
|
83
|
+
- Evaluation capacity multiplier
|
|
84
|
+
- Default: 2.0 (no token dropping during eval)
|
|
85
|
+
|
|
86
|
+
6. **`--moe-min-capacity`**
|
|
87
|
+
- Minimum expert capacity
|
|
88
|
+
- Default: 4
|
|
89
|
+
- Ensures each expert processes minimum tokens
|
|
90
|
+
|
|
91
|
+
7. **`--disable-moe-token-dropping`**
|
|
92
|
+
- Remove expert capacity limits
|
|
93
|
+
- All tokens processed (no dropping)
|
|
94
|
+
- May increase memory usage
|
|
95
|
+
|
|
96
|
+
### Example Training Script
|
|
97
|
+
|
|
98
|
+
```bash
|
|
99
|
+
#!/bin/bash
|
|
100
|
+
|
|
101
|
+
deepspeed --num_gpus 8 pretrain_gpt_moe.py \
|
|
102
|
+
--tensor-model-parallel-size 1 \
|
|
103
|
+
--pipeline-model-parallel-size 1 \
|
|
104
|
+
--num-layers 24 \
|
|
105
|
+
--hidden-size 1024 \
|
|
106
|
+
--num-attention-heads 16 \
|
|
107
|
+
--seq-length 2048 \
|
|
108
|
+
--max-position-embeddings 2048 \
|
|
109
|
+
--micro-batch-size 4 \
|
|
110
|
+
--global-batch-size 256 \
|
|
111
|
+
--train-iters 500000 \
|
|
112
|
+
--lr 0.0001 \
|
|
113
|
+
--min-lr 0.00001 \
|
|
114
|
+
--lr-decay-style cosine \
|
|
115
|
+
--lr-warmup-iters 2000 \
|
|
116
|
+
--clip-grad 1.0 \
|
|
117
|
+
--weight-decay 0.1 \
|
|
118
|
+
--num-experts 128 \
|
|
119
|
+
--moe-expert-parallel-size 8 \
|
|
120
|
+
--moe-loss-coeff 0.01 \
|
|
121
|
+
--moe-train-capacity-factor 1.25 \
|
|
122
|
+
--moe-eval-capacity-factor 2.0 \
|
|
123
|
+
--moe-min-capacity 4 \
|
|
124
|
+
--fp16 \
|
|
125
|
+
--deepspeed \
|
|
126
|
+
--deepspeed_config ds_config_moe.json \
|
|
127
|
+
--data-path /path/to/data \
|
|
128
|
+
--vocab-file /path/to/vocab.json \
|
|
129
|
+
--merge-file /path/to/merges.txt \
|
|
130
|
+
--save-interval 5000 \
|
|
131
|
+
--eval-interval 1000 \
|
|
132
|
+
--eval-iters 100
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
## PR-MoE: Pyramid-Residual-MoE
|
|
136
|
+
|
|
137
|
+
**Source**: DeepSpeed documentation - improves parameter efficiency 3× over standard MoE
|
|
138
|
+
|
|
139
|
+
### Architecture
|
|
140
|
+
|
|
141
|
+
PR-MoE uses:
|
|
142
|
+
- Varying number of experts per layer (pyramid structure)
|
|
143
|
+
- Residual connections between expert layers
|
|
144
|
+
- Better parameter efficiency
|
|
145
|
+
|
|
146
|
+
### Configuration
|
|
147
|
+
|
|
148
|
+
```bash
|
|
149
|
+
# PR-MoE specific parameters
|
|
150
|
+
--num-experts "[128, 64, 32, 16]" \ # Pyramid: different experts per layer
|
|
151
|
+
--mlp-type residual \ # Use residual connections
|
|
152
|
+
--moe-expert-parallel-size 4 \
|
|
153
|
+
--moe-loss-coeff 0.01
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
### Full PR-MoE Training
|
|
157
|
+
|
|
158
|
+
```bash
|
|
159
|
+
deepspeed --num_gpus 8 pretrain_gpt_moe.py \
|
|
160
|
+
--num-layers 24 \
|
|
161
|
+
--hidden-size 1024 \
|
|
162
|
+
--num-attention-heads 16 \
|
|
163
|
+
--seq-length 2048 \
|
|
164
|
+
--max-position-embeddings 2048 \
|
|
165
|
+
--micro-batch-size 4 \
|
|
166
|
+
--global-batch-size 256 \
|
|
167
|
+
--num-experts "[128, 64, 32, 16]" \ # Pyramid structure
|
|
168
|
+
--mlp-type residual \ # Residual MoE
|
|
169
|
+
--moe-expert-parallel-size 4 \
|
|
170
|
+
--moe-loss-coeff 0.01 \
|
|
171
|
+
--moe-train-capacity-factor 1.25 \
|
|
172
|
+
--fp16 \
|
|
173
|
+
--deepspeed \
|
|
174
|
+
--deepspeed_config ds_config_moe.json \
|
|
175
|
+
--data-path /path/to/data \
|
|
176
|
+
--save-interval 5000
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
**Benefits**:
|
|
180
|
+
- 3× better parameter efficiency vs standard MoE
|
|
181
|
+
- Fewer total parameters for same performance
|
|
182
|
+
- Better gradient flow with residual connections
|
|
183
|
+
|
|
184
|
+
## Mixture-of-Students (MoS)
|
|
185
|
+
|
|
186
|
+
**Source**: DeepSpeed documentation - knowledge distillation for MoE
|
|
187
|
+
|
|
188
|
+
### Overview
|
|
189
|
+
|
|
190
|
+
MoS = MoE + Knowledge Distillation
|
|
191
|
+
- Student: MoE model (being trained)
|
|
192
|
+
- Teacher: Dense model (pre-trained)
|
|
193
|
+
- Transfers knowledge from dense teacher to sparse MoE student
|
|
194
|
+
|
|
195
|
+
### Configuration
|
|
196
|
+
|
|
197
|
+
```bash
|
|
198
|
+
# MoS parameters
|
|
199
|
+
--mos \ # Enable MoS distillation
|
|
200
|
+
--load-teacher /path/to/teacher \ # Teacher model checkpoint
|
|
201
|
+
--teacher-forward \ # Enable teacher forward pass
|
|
202
|
+
--teacher-model-parallel-size 1
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
### Full MoS Training
|
|
206
|
+
|
|
207
|
+
```bash
|
|
208
|
+
deepspeed --num_gpus 8 pretrain_gpt_moe.py \
|
|
209
|
+
--num-layers 24 \
|
|
210
|
+
--hidden-size 1024 \
|
|
211
|
+
--num-attention-heads 16 \
|
|
212
|
+
--num-experts 128 \
|
|
213
|
+
--moe-expert-parallel-size 8 \
|
|
214
|
+
--moe-loss-coeff 0.01 \
|
|
215
|
+
--mos \ # Enable MoS
|
|
216
|
+
--load-teacher /path/to/dense/teacher \ # Teacher checkpoint
|
|
217
|
+
--teacher-forward \
|
|
218
|
+
--teacher-model-parallel-size 1 \
|
|
219
|
+
--fp16 \
|
|
220
|
+
--deepspeed \
|
|
221
|
+
--deepspeed_config ds_config_moe.json \
|
|
222
|
+
--data-path /path/to/data
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
### Staged Distillation
|
|
226
|
+
|
|
227
|
+
**Recommended**: Stop distillation early
|
|
228
|
+
|
|
229
|
+
```python
|
|
230
|
+
# In training loop
|
|
231
|
+
if iteration < 400000:
|
|
232
|
+
# Use MoS (distillation)
|
|
233
|
+
loss = moe_loss + distillation_loss
|
|
234
|
+
else:
|
|
235
|
+
# Stop distillation, train MoE only
|
|
236
|
+
loss = moe_loss
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
**Benefits**:
|
|
240
|
+
- Faster convergence
|
|
241
|
+
- Better final performance
|
|
242
|
+
- Preserves teacher knowledge while allowing MoE specialization
|
|
243
|
+
|
|
244
|
+
## Hyperparameter Tuning
|
|
245
|
+
|
|
246
|
+
### Learning Rate
|
|
247
|
+
|
|
248
|
+
**Key insight**: MoE needs lower LR than dense models
|
|
249
|
+
|
|
250
|
+
```bash
|
|
251
|
+
# Dense model
|
|
252
|
+
--lr 0.0006 \
|
|
253
|
+
--min-lr 0.00006
|
|
254
|
+
|
|
255
|
+
# MoE model (3-6× lower)
|
|
256
|
+
--lr 0.0001 \ # Lower!
|
|
257
|
+
--min-lr 0.00001
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
### LR Decay
|
|
261
|
+
|
|
262
|
+
**Extend decay schedule** for MoE:
|
|
263
|
+
|
|
264
|
+
```bash
|
|
265
|
+
# Dense model
|
|
266
|
+
--lr-decay-iters 300000 \
|
|
267
|
+
--lr-warmup-iters 2000
|
|
268
|
+
|
|
269
|
+
# MoE model (1.5-2× longer)
|
|
270
|
+
--lr-decay-iters 500000 \ # Extended!
|
|
271
|
+
--lr-warmup-iters 2000
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
### Capacity Factor
|
|
275
|
+
|
|
276
|
+
**Tune based on memory/speed tradeoff**:
|
|
277
|
+
|
|
278
|
+
```json
|
|
279
|
+
{
|
|
280
|
+
"moe": {
|
|
281
|
+
// Training: Lower capacity (faster, drops tokens)
|
|
282
|
+
"train_capacity_factor": 1.0, // Aggressive
|
|
283
|
+
"train_capacity_factor": 1.25, // Balanced (recommended)
|
|
284
|
+
"train_capacity_factor": 1.5, // Conservative
|
|
285
|
+
|
|
286
|
+
// Evaluation: Higher capacity (no dropping)
|
|
287
|
+
"eval_capacity_factor": 2.0 // Standard
|
|
288
|
+
}
|
|
289
|
+
}
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
### Load Balancing Coefficient
|
|
293
|
+
|
|
294
|
+
```json
|
|
295
|
+
{
|
|
296
|
+
"moe": {
|
|
297
|
+
"moe_loss_coeff": 0.001, // Weak balancing
|
|
298
|
+
"moe_loss_coeff": 0.01, // Standard (recommended)
|
|
299
|
+
"moe_loss_coeff": 0.1 // Strong balancing
|
|
300
|
+
}
|
|
301
|
+
}
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
**Rule**: If load imbalance persists, increase coefficient
|
|
305
|
+
|
|
306
|
+
## Production Training
|
|
307
|
+
|
|
308
|
+
### Performance Benchmarks
|
|
309
|
+
|
|
310
|
+
**From DeepSpeed documentation:**
|
|
311
|
+
|
|
312
|
+
Standard MoE:
|
|
313
|
+
- **5× training cost reduction** vs dense model
|
|
314
|
+
- **3× model size reduction** with PR-MoE
|
|
315
|
+
|
|
316
|
+
Example:
|
|
317
|
+
- Dense 13B model: 100% cost
|
|
318
|
+
- MoE 13B (128 experts): 20% cost (5× faster)
|
|
319
|
+
- PR-MoE 13B: 15% cost + 3× fewer params
|
|
320
|
+
|
|
321
|
+
### Recommended Dataset
|
|
322
|
+
|
|
323
|
+
**The Pile** - publicly available training dataset
|
|
324
|
+
- 800GB of diverse text
|
|
325
|
+
- Standard benchmark for MoE training
|
|
326
|
+
- Used in DeepSpeed examples
|
|
327
|
+
|
|
328
|
+
### Example Configs
|
|
329
|
+
|
|
330
|
+
**Small MoE (8 experts)**:
|
|
331
|
+
|
|
332
|
+
```bash
|
|
333
|
+
deepspeed --num_gpus 4 pretrain_gpt_moe.py \
|
|
334
|
+
--num-layers 12 \
|
|
335
|
+
--hidden-size 768 \
|
|
336
|
+
--num-attention-heads 12 \
|
|
337
|
+
--num-experts 8 \
|
|
338
|
+
--moe-expert-parallel-size 2 \
|
|
339
|
+
--global-batch-size 128 \
|
|
340
|
+
--fp16
|
|
341
|
+
```
|
|
342
|
+
|
|
343
|
+
**Medium MoE (64 experts)**:
|
|
344
|
+
|
|
345
|
+
```bash
|
|
346
|
+
deepspeed --num_gpus 16 pretrain_gpt_moe.py \
|
|
347
|
+
--num-layers 24 \
|
|
348
|
+
--hidden-size 1024 \
|
|
349
|
+
--num-attention-heads 16 \
|
|
350
|
+
--num-experts 64 \
|
|
351
|
+
--moe-expert-parallel-size 8 \
|
|
352
|
+
--global-batch-size 256 \
|
|
353
|
+
--fp16
|
|
354
|
+
```
|
|
355
|
+
|
|
356
|
+
**Large MoE (128 experts)**:
|
|
357
|
+
|
|
358
|
+
```bash
|
|
359
|
+
deepspeed --num_gpus 32 pretrain_gpt_moe.py \
|
|
360
|
+
--num-layers 32 \
|
|
361
|
+
--hidden-size 2048 \
|
|
362
|
+
--num-attention-heads 32 \
|
|
363
|
+
--num-experts 128 \
|
|
364
|
+
--moe-expert-parallel-size 16 \
|
|
365
|
+
--global-batch-size 512 \
|
|
366
|
+
--fp16
|
|
367
|
+
```
|
|
368
|
+
|
|
369
|
+
### Monitoring
|
|
370
|
+
|
|
371
|
+
Key metrics to track:
|
|
372
|
+
|
|
373
|
+
```python
|
|
374
|
+
# Expert load balance
|
|
375
|
+
expert_counts = [expert.token_count for expert in experts]
|
|
376
|
+
load_imbalance = max(expert_counts) / min(expert_counts)
|
|
377
|
+
|
|
378
|
+
# Should be close to 1.0 (perfectly balanced)
|
|
379
|
+
# If > 2.0, increase moe_loss_coeff
|
|
380
|
+
|
|
381
|
+
# Expert utilization
|
|
382
|
+
utilized_experts = sum(count > 0 for count in expert_counts)
|
|
383
|
+
utilization_rate = utilized_experts / num_experts
|
|
384
|
+
|
|
385
|
+
# Should be close to 1.0 (all experts used)
|
|
386
|
+
|
|
387
|
+
# Token dropping rate
|
|
388
|
+
dropped_tokens = total_tokens - processed_tokens
|
|
389
|
+
drop_rate = dropped_tokens / total_tokens
|
|
390
|
+
|
|
391
|
+
# Should be low (<5%) during training
|
|
392
|
+
```
|
|
393
|
+
|
|
394
|
+
## Troubleshooting
|
|
395
|
+
|
|
396
|
+
### Issue: Load Imbalance
|
|
397
|
+
|
|
398
|
+
**Symptoms**: Some experts get most tokens
|
|
399
|
+
|
|
400
|
+
**Solutions**:
|
|
401
|
+
1. Increase `moe_loss_coeff` (0.01 → 0.1)
|
|
402
|
+
2. Reduce `train_capacity_factor` (forces redistribution)
|
|
403
|
+
3. Add noise to router logits (gating network)
|
|
404
|
+
|
|
405
|
+
### Issue: High Memory Usage
|
|
406
|
+
|
|
407
|
+
**Solutions**:
|
|
408
|
+
1. Enable ZeRO Stage 1 or 2
|
|
409
|
+
2. Reduce `train_capacity_factor`
|
|
410
|
+
3. Enable `drop_tokens`
|
|
411
|
+
4. Increase `moe_expert_parallel_size`
|
|
412
|
+
|
|
413
|
+
### Issue: Unstable Training
|
|
414
|
+
|
|
415
|
+
**Solutions**:
|
|
416
|
+
1. Lower learning rate
|
|
417
|
+
2. Increase warmup steps
|
|
418
|
+
3. Use gradient clipping (`--clip-grad 1.0`)
|
|
419
|
+
4. Reduce router z-loss coefficient
|
|
420
|
+
|
|
421
|
+
## Resources
|
|
422
|
+
|
|
423
|
+
- **DeepSpeed MoE Tutorial**: https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/
|
|
424
|
+
- **Megatron-DeepSpeed**: https://github.com/microsoft/Megatron-DeepSpeed
|
|
425
|
+
- **Example Scripts**: `examples_deepspeed/MoE/`
|
|
@@ -0,0 +1,290 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: nanogpt
|
|
3
|
+
description: Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Synthetic Sciences
|
|
6
|
+
license: MIT
|
|
7
|
+
tags: [Model Architecture, NanoGPT, GPT-2, Educational, Andrej Karpathy, Transformer, Minimalist, From Scratch, Training]
|
|
8
|
+
dependencies: [torch, transformers, datasets, tiktoken, wandb]
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# nanoGPT - Minimalist GPT Training
|
|
12
|
+
|
|
13
|
+
## Quick start
|
|
14
|
+
|
|
15
|
+
nanoGPT is a simplified GPT implementation designed for learning and experimentation.
|
|
16
|
+
|
|
17
|
+
**Installation**:
|
|
18
|
+
```bash
|
|
19
|
+
pip install torch numpy transformers datasets tiktoken wandb tqdm
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
**Train on Shakespeare** (CPU-friendly):
|
|
23
|
+
```bash
|
|
24
|
+
# Prepare data
|
|
25
|
+
python data/shakespeare_char/prepare.py
|
|
26
|
+
|
|
27
|
+
# Train (5 minutes on CPU)
|
|
28
|
+
python train.py config/train_shakespeare_char.py
|
|
29
|
+
|
|
30
|
+
# Generate text
|
|
31
|
+
python sample.py --out_dir=out-shakespeare-char
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
**Output**:
|
|
35
|
+
```
|
|
36
|
+
ROMEO:
|
|
37
|
+
What say'st thou? Shall I speak, and be a man?
|
|
38
|
+
|
|
39
|
+
JULIET:
|
|
40
|
+
I am afeard, and yet I'll speak; for thou art
|
|
41
|
+
One that hath been a man, and yet I know not
|
|
42
|
+
What thou art.
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
## Common workflows
|
|
46
|
+
|
|
47
|
+
### Workflow 1: Character-level Shakespeare
|
|
48
|
+
|
|
49
|
+
**Complete training pipeline**:
|
|
50
|
+
```bash
|
|
51
|
+
# Step 1: Prepare data (creates train.bin, val.bin)
|
|
52
|
+
python data/shakespeare_char/prepare.py
|
|
53
|
+
|
|
54
|
+
# Step 2: Train small model
|
|
55
|
+
python train.py config/train_shakespeare_char.py
|
|
56
|
+
|
|
57
|
+
# Step 3: Generate text
|
|
58
|
+
python sample.py --out_dir=out-shakespeare-char
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
**Config** (`config/train_shakespeare_char.py`):
|
|
62
|
+
```python
|
|
63
|
+
# Model config
|
|
64
|
+
n_layer = 6 # 6 transformer layers
|
|
65
|
+
n_head = 6 # 6 attention heads
|
|
66
|
+
n_embd = 384 # 384-dim embeddings
|
|
67
|
+
block_size = 256 # 256 char context
|
|
68
|
+
|
|
69
|
+
# Training config
|
|
70
|
+
batch_size = 64
|
|
71
|
+
learning_rate = 1e-3
|
|
72
|
+
max_iters = 5000
|
|
73
|
+
eval_interval = 500
|
|
74
|
+
|
|
75
|
+
# Hardware
|
|
76
|
+
device = 'cpu' # Or 'cuda'
|
|
77
|
+
compile = False # Set True for PyTorch 2.0
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
**Training time**: ~5 minutes (CPU), ~1 minute (GPU)
|
|
81
|
+
|
|
82
|
+
### Workflow 2: Reproduce GPT-2 (124M)
|
|
83
|
+
|
|
84
|
+
**Multi-GPU training on OpenWebText**:
|
|
85
|
+
```bash
|
|
86
|
+
# Step 1: Prepare OpenWebText (takes ~1 hour)
|
|
87
|
+
python data/openwebtext/prepare.py
|
|
88
|
+
|
|
89
|
+
# Step 2: Train GPT-2 124M with DDP (8 GPUs)
|
|
90
|
+
torchrun --standalone --nproc_per_node=8 \
|
|
91
|
+
train.py config/train_gpt2.py
|
|
92
|
+
|
|
93
|
+
# Step 3: Sample from trained model
|
|
94
|
+
python sample.py --out_dir=out
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
**Config** (`config/train_gpt2.py`):
|
|
98
|
+
```python
|
|
99
|
+
# GPT-2 (124M) architecture
|
|
100
|
+
n_layer = 12
|
|
101
|
+
n_head = 12
|
|
102
|
+
n_embd = 768
|
|
103
|
+
block_size = 1024
|
|
104
|
+
dropout = 0.0
|
|
105
|
+
|
|
106
|
+
# Training
|
|
107
|
+
batch_size = 12
|
|
108
|
+
gradient_accumulation_steps = 5 * 8 # Total batch ~0.5M tokens
|
|
109
|
+
learning_rate = 6e-4
|
|
110
|
+
max_iters = 600000
|
|
111
|
+
lr_decay_iters = 600000
|
|
112
|
+
|
|
113
|
+
# System
|
|
114
|
+
compile = True # PyTorch 2.0
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
**Training time**: ~4 days (8× A100)
|
|
118
|
+
|
|
119
|
+
### Workflow 3: Fine-tune pretrained GPT-2
|
|
120
|
+
|
|
121
|
+
**Start from OpenAI checkpoint**:
|
|
122
|
+
```python
|
|
123
|
+
# In train.py or config
|
|
124
|
+
init_from = 'gpt2' # Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl
|
|
125
|
+
|
|
126
|
+
# Model loads OpenAI weights automatically
|
|
127
|
+
python train.py config/finetune_shakespeare.py
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
**Example config** (`config/finetune_shakespeare.py`):
|
|
131
|
+
```python
|
|
132
|
+
# Start from GPT-2
|
|
133
|
+
init_from = 'gpt2'
|
|
134
|
+
|
|
135
|
+
# Dataset
|
|
136
|
+
dataset = 'shakespeare_char'
|
|
137
|
+
batch_size = 1
|
|
138
|
+
block_size = 1024
|
|
139
|
+
|
|
140
|
+
# Fine-tuning
|
|
141
|
+
learning_rate = 3e-5 # Lower LR for fine-tuning
|
|
142
|
+
max_iters = 2000
|
|
143
|
+
warmup_iters = 100
|
|
144
|
+
|
|
145
|
+
# Regularization
|
|
146
|
+
weight_decay = 1e-1
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
### Workflow 4: Custom dataset
|
|
150
|
+
|
|
151
|
+
**Train on your own text**:
|
|
152
|
+
```python
|
|
153
|
+
# data/custom/prepare.py
|
|
154
|
+
import numpy as np
|
|
155
|
+
|
|
156
|
+
# Load your data
|
|
157
|
+
with open('my_data.txt', 'r') as f:
|
|
158
|
+
text = f.read()
|
|
159
|
+
|
|
160
|
+
# Create character mappings
|
|
161
|
+
chars = sorted(list(set(text)))
|
|
162
|
+
stoi = {ch: i for i, ch in enumerate(chars)}
|
|
163
|
+
itos = {i: ch for i, ch in enumerate(chars)}
|
|
164
|
+
|
|
165
|
+
# Tokenize
|
|
166
|
+
data = np.array([stoi[ch] for ch in text], dtype=np.uint16)
|
|
167
|
+
|
|
168
|
+
# Split train/val
|
|
169
|
+
n = len(data)
|
|
170
|
+
train_data = data[:int(n*0.9)]
|
|
171
|
+
val_data = data[int(n*0.9):]
|
|
172
|
+
|
|
173
|
+
# Save
|
|
174
|
+
train_data.tofile('data/custom/train.bin')
|
|
175
|
+
val_data.tofile('data/custom/val.bin')
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
**Train**:
|
|
179
|
+
```bash
|
|
180
|
+
python data/custom/prepare.py
|
|
181
|
+
python train.py --dataset=custom
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
## When to use vs alternatives
|
|
185
|
+
|
|
186
|
+
**Use nanoGPT when**:
|
|
187
|
+
- Learning how GPT works
|
|
188
|
+
- Experimenting with transformer variants
|
|
189
|
+
- Teaching/education purposes
|
|
190
|
+
- Quick prototyping
|
|
191
|
+
- Limited compute (can run on CPU)
|
|
192
|
+
|
|
193
|
+
**Simplicity advantages**:
|
|
194
|
+
- **~300 lines**: Entire model in `model.py`
|
|
195
|
+
- **~300 lines**: Training loop in `train.py`
|
|
196
|
+
- **Hackable**: Easy to modify
|
|
197
|
+
- **No abstractions**: Pure PyTorch
|
|
198
|
+
|
|
199
|
+
**Use alternatives instead**:
|
|
200
|
+
- **HuggingFace Transformers**: Production use, many models
|
|
201
|
+
- **Megatron-LM**: Large-scale distributed training
|
|
202
|
+
- **LitGPT**: More architectures, production-ready
|
|
203
|
+
- **PyTorch Lightning**: Need high-level framework
|
|
204
|
+
|
|
205
|
+
## Common issues
|
|
206
|
+
|
|
207
|
+
**Issue: CUDA out of memory**
|
|
208
|
+
|
|
209
|
+
Reduce batch size or context length:
|
|
210
|
+
```python
|
|
211
|
+
batch_size = 1 # Reduce from 12
|
|
212
|
+
block_size = 512 # Reduce from 1024
|
|
213
|
+
gradient_accumulation_steps = 40 # Increase to maintain effective batch
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
**Issue: Training too slow**
|
|
217
|
+
|
|
218
|
+
Enable compilation (PyTorch 2.0+):
|
|
219
|
+
```python
|
|
220
|
+
compile = True # 2× speedup
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
Use mixed precision:
|
|
224
|
+
```python
|
|
225
|
+
dtype = 'bfloat16' # Or 'float16'
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
**Issue: Poor generation quality**
|
|
229
|
+
|
|
230
|
+
Train longer:
|
|
231
|
+
```python
|
|
232
|
+
max_iters = 10000 # Increase from 5000
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
Lower temperature:
|
|
236
|
+
```python
|
|
237
|
+
# In sample.py
|
|
238
|
+
temperature = 0.7 # Lower from 1.0
|
|
239
|
+
top_k = 200 # Add top-k sampling
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
**Issue: Can't load GPT-2 weights**
|
|
243
|
+
|
|
244
|
+
Install transformers:
|
|
245
|
+
```bash
|
|
246
|
+
pip install transformers
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
Check model name:
|
|
250
|
+
```python
|
|
251
|
+
init_from = 'gpt2' # Valid: gpt2, gpt2-medium, gpt2-large, gpt2-xl
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
## Advanced topics
|
|
255
|
+
|
|
256
|
+
**Model architecture**: See [references/architecture.md](references/architecture.md) for GPT block structure, multi-head attention, and MLP layers explained simply.
|
|
257
|
+
|
|
258
|
+
**Training loop**: See [references/training.md](references/training.md) for learning rate schedule, gradient accumulation, and distributed data parallel setup.
|
|
259
|
+
|
|
260
|
+
**Data preparation**: See [references/data.md](references/data.md) for tokenization strategies (character-level vs BPE) and binary format details.
|
|
261
|
+
|
|
262
|
+
## Hardware requirements
|
|
263
|
+
|
|
264
|
+
- **Shakespeare (char-level)**:
|
|
265
|
+
- CPU: 5 minutes
|
|
266
|
+
- GPU (T4): 1 minute
|
|
267
|
+
- VRAM: <1GB
|
|
268
|
+
|
|
269
|
+
- **GPT-2 (124M)**:
|
|
270
|
+
- 1× A100: ~1 week
|
|
271
|
+
- 8× A100: ~4 days
|
|
272
|
+
- VRAM: ~16GB per GPU
|
|
273
|
+
|
|
274
|
+
- **GPT-2 Medium (350M)**:
|
|
275
|
+
- 8× A100: ~2 weeks
|
|
276
|
+
- VRAM: ~40GB per GPU
|
|
277
|
+
|
|
278
|
+
**Performance**:
|
|
279
|
+
- With `compile=True`: 2× speedup
|
|
280
|
+
- With `dtype=bfloat16`: 50% memory reduction
|
|
281
|
+
|
|
282
|
+
## Resources
|
|
283
|
+
|
|
284
|
+
- GitHub: https://github.com/karpathy/nanoGPT ⭐ 48,000+
|
|
285
|
+
- Video: "Let's build GPT" by Andrej Karpathy
|
|
286
|
+
- Paper: "Attention is All You Need" (Vaswani et al.)
|
|
287
|
+
- OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext
|
|
288
|
+
- Educational: Best for understanding transformers from scratch
|
|
289
|
+
|
|
290
|
+
|