@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,425 @@
1
+ # MoE Training Guide
2
+
3
+ Complete training guide based on DeepSpeed official documentation and production practices.
4
+
5
+ ## Table of Contents
6
+ - DeepSpeed MoE Setup
7
+ - Training Configuration
8
+ - PR-MoE (Pyramid-Residual-MoE)
9
+ - Mixture-of-Students (MoS)
10
+ - Hyperparameter Tuning
11
+ - Production Training
12
+
13
+ ## DeepSpeed MoE Setup
14
+
15
+ **Source**: DeepSpeed MoE Tutorial (https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/)
16
+
17
+ ### Requirements
18
+
19
+ ```bash
20
+ # Install DeepSpeed v0.6.0 or higher
21
+ pip install deepspeed>=0.6.0
22
+
23
+ # Clone Megatron-DeepSpeed
24
+ git clone https://github.com/microsoft/Megatron-DeepSpeed
25
+ cd Megatron-DeepSpeed
26
+ pip install -r requirements.txt
27
+ ```
28
+
29
+ ### Basic MoE Configuration
30
+
31
+ ```json
32
+ {
33
+ "train_batch_size": 256,
34
+ "gradient_accumulation_steps": 1,
35
+ "fp16": {
36
+ "enabled": true,
37
+ "loss_scale": 0,
38
+ "initial_scale_power": 16
39
+ },
40
+ "moe": {
41
+ "enabled": true,
42
+ "num_experts": 128,
43
+ "expert_parallel_size": 8,
44
+ "moe_loss_coeff": 0.01,
45
+ "train_capacity_factor": 1.25,
46
+ "eval_capacity_factor": 2.0,
47
+ "min_capacity": 4,
48
+ "drop_tokens": true
49
+ },
50
+ "zero_optimization": {
51
+ "stage": 1
52
+ }
53
+ }
54
+ ```
55
+
56
+ ## Training Parameters
57
+
58
+ ### Core MoE Parameters
59
+
60
+ **From DeepSpeed documentation:**
61
+
62
+ 1. **`--num-experts`**
63
+ - Number of experts per MoE layer
64
+ - Recommended: 128 experts
65
+ - Range: 8-256 depending on scale
66
+
67
+ 2. **`--moe-expert-parallel-size`**
68
+ - Degree of expert parallelism
69
+ - Distributes experts across GPUs
70
+ - Example: 128 experts / 8 GPUs = 16 experts per GPU
71
+
72
+ 3. **`--moe-loss-coeff`**
73
+ - MoE auxiliary loss coefficient
74
+ - Recommended: 0.01
75
+ - Controls load balancing strength
76
+
77
+ 4. **`--moe-train-capacity-factor`**
78
+ - Training capacity multiplier
79
+ - Default: 1.25
80
+ - Formula: capacity = (tokens/num_experts) × capacity_factor
81
+
82
+ 5. **`--moe-eval-capacity-factor`**
83
+ - Evaluation capacity multiplier
84
+ - Default: 2.0 (no token dropping during eval)
85
+
86
+ 6. **`--moe-min-capacity`**
87
+ - Minimum expert capacity
88
+ - Default: 4
89
+ - Ensures each expert processes minimum tokens
90
+
91
+ 7. **`--disable-moe-token-dropping`**
92
+ - Remove expert capacity limits
93
+ - All tokens processed (no dropping)
94
+ - May increase memory usage
95
+
96
+ ### Example Training Script
97
+
98
+ ```bash
99
+ #!/bin/bash
100
+
101
+ deepspeed --num_gpus 8 pretrain_gpt_moe.py \
102
+ --tensor-model-parallel-size 1 \
103
+ --pipeline-model-parallel-size 1 \
104
+ --num-layers 24 \
105
+ --hidden-size 1024 \
106
+ --num-attention-heads 16 \
107
+ --seq-length 2048 \
108
+ --max-position-embeddings 2048 \
109
+ --micro-batch-size 4 \
110
+ --global-batch-size 256 \
111
+ --train-iters 500000 \
112
+ --lr 0.0001 \
113
+ --min-lr 0.00001 \
114
+ --lr-decay-style cosine \
115
+ --lr-warmup-iters 2000 \
116
+ --clip-grad 1.0 \
117
+ --weight-decay 0.1 \
118
+ --num-experts 128 \
119
+ --moe-expert-parallel-size 8 \
120
+ --moe-loss-coeff 0.01 \
121
+ --moe-train-capacity-factor 1.25 \
122
+ --moe-eval-capacity-factor 2.0 \
123
+ --moe-min-capacity 4 \
124
+ --fp16 \
125
+ --deepspeed \
126
+ --deepspeed_config ds_config_moe.json \
127
+ --data-path /path/to/data \
128
+ --vocab-file /path/to/vocab.json \
129
+ --merge-file /path/to/merges.txt \
130
+ --save-interval 5000 \
131
+ --eval-interval 1000 \
132
+ --eval-iters 100
133
+ ```
134
+
135
+ ## PR-MoE: Pyramid-Residual-MoE
136
+
137
+ **Source**: DeepSpeed documentation - improves parameter efficiency 3× over standard MoE
138
+
139
+ ### Architecture
140
+
141
+ PR-MoE uses:
142
+ - Varying number of experts per layer (pyramid structure)
143
+ - Residual connections between expert layers
144
+ - Better parameter efficiency
145
+
146
+ ### Configuration
147
+
148
+ ```bash
149
+ # PR-MoE specific parameters
150
+ --num-experts "[128, 64, 32, 16]" \ # Pyramid: different experts per layer
151
+ --mlp-type residual \ # Use residual connections
152
+ --moe-expert-parallel-size 4 \
153
+ --moe-loss-coeff 0.01
154
+ ```
155
+
156
+ ### Full PR-MoE Training
157
+
158
+ ```bash
159
+ deepspeed --num_gpus 8 pretrain_gpt_moe.py \
160
+ --num-layers 24 \
161
+ --hidden-size 1024 \
162
+ --num-attention-heads 16 \
163
+ --seq-length 2048 \
164
+ --max-position-embeddings 2048 \
165
+ --micro-batch-size 4 \
166
+ --global-batch-size 256 \
167
+ --num-experts "[128, 64, 32, 16]" \ # Pyramid structure
168
+ --mlp-type residual \ # Residual MoE
169
+ --moe-expert-parallel-size 4 \
170
+ --moe-loss-coeff 0.01 \
171
+ --moe-train-capacity-factor 1.25 \
172
+ --fp16 \
173
+ --deepspeed \
174
+ --deepspeed_config ds_config_moe.json \
175
+ --data-path /path/to/data \
176
+ --save-interval 5000
177
+ ```
178
+
179
+ **Benefits**:
180
+ - 3× better parameter efficiency vs standard MoE
181
+ - Fewer total parameters for same performance
182
+ - Better gradient flow with residual connections
183
+
184
+ ## Mixture-of-Students (MoS)
185
+
186
+ **Source**: DeepSpeed documentation - knowledge distillation for MoE
187
+
188
+ ### Overview
189
+
190
+ MoS = MoE + Knowledge Distillation
191
+ - Student: MoE model (being trained)
192
+ - Teacher: Dense model (pre-trained)
193
+ - Transfers knowledge from dense teacher to sparse MoE student
194
+
195
+ ### Configuration
196
+
197
+ ```bash
198
+ # MoS parameters
199
+ --mos \ # Enable MoS distillation
200
+ --load-teacher /path/to/teacher \ # Teacher model checkpoint
201
+ --teacher-forward \ # Enable teacher forward pass
202
+ --teacher-model-parallel-size 1
203
+ ```
204
+
205
+ ### Full MoS Training
206
+
207
+ ```bash
208
+ deepspeed --num_gpus 8 pretrain_gpt_moe.py \
209
+ --num-layers 24 \
210
+ --hidden-size 1024 \
211
+ --num-attention-heads 16 \
212
+ --num-experts 128 \
213
+ --moe-expert-parallel-size 8 \
214
+ --moe-loss-coeff 0.01 \
215
+ --mos \ # Enable MoS
216
+ --load-teacher /path/to/dense/teacher \ # Teacher checkpoint
217
+ --teacher-forward \
218
+ --teacher-model-parallel-size 1 \
219
+ --fp16 \
220
+ --deepspeed \
221
+ --deepspeed_config ds_config_moe.json \
222
+ --data-path /path/to/data
223
+ ```
224
+
225
+ ### Staged Distillation
226
+
227
+ **Recommended**: Stop distillation early
228
+
229
+ ```python
230
+ # In training loop
231
+ if iteration < 400000:
232
+ # Use MoS (distillation)
233
+ loss = moe_loss + distillation_loss
234
+ else:
235
+ # Stop distillation, train MoE only
236
+ loss = moe_loss
237
+ ```
238
+
239
+ **Benefits**:
240
+ - Faster convergence
241
+ - Better final performance
242
+ - Preserves teacher knowledge while allowing MoE specialization
243
+
244
+ ## Hyperparameter Tuning
245
+
246
+ ### Learning Rate
247
+
248
+ **Key insight**: MoE needs lower LR than dense models
249
+
250
+ ```bash
251
+ # Dense model
252
+ --lr 0.0006 \
253
+ --min-lr 0.00006
254
+
255
+ # MoE model (3-6× lower)
256
+ --lr 0.0001 \ # Lower!
257
+ --min-lr 0.00001
258
+ ```
259
+
260
+ ### LR Decay
261
+
262
+ **Extend decay schedule** for MoE:
263
+
264
+ ```bash
265
+ # Dense model
266
+ --lr-decay-iters 300000 \
267
+ --lr-warmup-iters 2000
268
+
269
+ # MoE model (1.5-2× longer)
270
+ --lr-decay-iters 500000 \ # Extended!
271
+ --lr-warmup-iters 2000
272
+ ```
273
+
274
+ ### Capacity Factor
275
+
276
+ **Tune based on memory/speed tradeoff**:
277
+
278
+ ```json
279
+ {
280
+ "moe": {
281
+ // Training: Lower capacity (faster, drops tokens)
282
+ "train_capacity_factor": 1.0, // Aggressive
283
+ "train_capacity_factor": 1.25, // Balanced (recommended)
284
+ "train_capacity_factor": 1.5, // Conservative
285
+
286
+ // Evaluation: Higher capacity (no dropping)
287
+ "eval_capacity_factor": 2.0 // Standard
288
+ }
289
+ }
290
+ ```
291
+
292
+ ### Load Balancing Coefficient
293
+
294
+ ```json
295
+ {
296
+ "moe": {
297
+ "moe_loss_coeff": 0.001, // Weak balancing
298
+ "moe_loss_coeff": 0.01, // Standard (recommended)
299
+ "moe_loss_coeff": 0.1 // Strong balancing
300
+ }
301
+ }
302
+ ```
303
+
304
+ **Rule**: If load imbalance persists, increase coefficient
305
+
306
+ ## Production Training
307
+
308
+ ### Performance Benchmarks
309
+
310
+ **From DeepSpeed documentation:**
311
+
312
+ Standard MoE:
313
+ - **5× training cost reduction** vs dense model
314
+ - **3× model size reduction** with PR-MoE
315
+
316
+ Example:
317
+ - Dense 13B model: 100% cost
318
+ - MoE 13B (128 experts): 20% cost (5× faster)
319
+ - PR-MoE 13B: 15% cost + 3× fewer params
320
+
321
+ ### Recommended Dataset
322
+
323
+ **The Pile** - publicly available training dataset
324
+ - 800GB of diverse text
325
+ - Standard benchmark for MoE training
326
+ - Used in DeepSpeed examples
327
+
328
+ ### Example Configs
329
+
330
+ **Small MoE (8 experts)**:
331
+
332
+ ```bash
333
+ deepspeed --num_gpus 4 pretrain_gpt_moe.py \
334
+ --num-layers 12 \
335
+ --hidden-size 768 \
336
+ --num-attention-heads 12 \
337
+ --num-experts 8 \
338
+ --moe-expert-parallel-size 2 \
339
+ --global-batch-size 128 \
340
+ --fp16
341
+ ```
342
+
343
+ **Medium MoE (64 experts)**:
344
+
345
+ ```bash
346
+ deepspeed --num_gpus 16 pretrain_gpt_moe.py \
347
+ --num-layers 24 \
348
+ --hidden-size 1024 \
349
+ --num-attention-heads 16 \
350
+ --num-experts 64 \
351
+ --moe-expert-parallel-size 8 \
352
+ --global-batch-size 256 \
353
+ --fp16
354
+ ```
355
+
356
+ **Large MoE (128 experts)**:
357
+
358
+ ```bash
359
+ deepspeed --num_gpus 32 pretrain_gpt_moe.py \
360
+ --num-layers 32 \
361
+ --hidden-size 2048 \
362
+ --num-attention-heads 32 \
363
+ --num-experts 128 \
364
+ --moe-expert-parallel-size 16 \
365
+ --global-batch-size 512 \
366
+ --fp16
367
+ ```
368
+
369
+ ### Monitoring
370
+
371
+ Key metrics to track:
372
+
373
+ ```python
374
+ # Expert load balance
375
+ expert_counts = [expert.token_count for expert in experts]
376
+ load_imbalance = max(expert_counts) / min(expert_counts)
377
+
378
+ # Should be close to 1.0 (perfectly balanced)
379
+ # If > 2.0, increase moe_loss_coeff
380
+
381
+ # Expert utilization
382
+ utilized_experts = sum(count > 0 for count in expert_counts)
383
+ utilization_rate = utilized_experts / num_experts
384
+
385
+ # Should be close to 1.0 (all experts used)
386
+
387
+ # Token dropping rate
388
+ dropped_tokens = total_tokens - processed_tokens
389
+ drop_rate = dropped_tokens / total_tokens
390
+
391
+ # Should be low (<5%) during training
392
+ ```
393
+
394
+ ## Troubleshooting
395
+
396
+ ### Issue: Load Imbalance
397
+
398
+ **Symptoms**: Some experts get most tokens
399
+
400
+ **Solutions**:
401
+ 1. Increase `moe_loss_coeff` (0.01 → 0.1)
402
+ 2. Reduce `train_capacity_factor` (forces redistribution)
403
+ 3. Add noise to router logits (gating network)
404
+
405
+ ### Issue: High Memory Usage
406
+
407
+ **Solutions**:
408
+ 1. Enable ZeRO Stage 1 or 2
409
+ 2. Reduce `train_capacity_factor`
410
+ 3. Enable `drop_tokens`
411
+ 4. Increase `moe_expert_parallel_size`
412
+
413
+ ### Issue: Unstable Training
414
+
415
+ **Solutions**:
416
+ 1. Lower learning rate
417
+ 2. Increase warmup steps
418
+ 3. Use gradient clipping (`--clip-grad 1.0`)
419
+ 4. Reduce router z-loss coefficient
420
+
421
+ ## Resources
422
+
423
+ - **DeepSpeed MoE Tutorial**: https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/
424
+ - **Megatron-DeepSpeed**: https://github.com/microsoft/Megatron-DeepSpeed
425
+ - **Example Scripts**: `examples_deepspeed/MoE/`
@@ -0,0 +1,290 @@
1
+ ---
2
+ name: nanogpt
3
+ description: Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).
4
+ version: 1.0.0
5
+ author: Synthetic Sciences
6
+ license: MIT
7
+ tags: [Model Architecture, NanoGPT, GPT-2, Educational, Andrej Karpathy, Transformer, Minimalist, From Scratch, Training]
8
+ dependencies: [torch, transformers, datasets, tiktoken, wandb]
9
+ ---
10
+
11
+ # nanoGPT - Minimalist GPT Training
12
+
13
+ ## Quick start
14
+
15
+ nanoGPT is a simplified GPT implementation designed for learning and experimentation.
16
+
17
+ **Installation**:
18
+ ```bash
19
+ pip install torch numpy transformers datasets tiktoken wandb tqdm
20
+ ```
21
+
22
+ **Train on Shakespeare** (CPU-friendly):
23
+ ```bash
24
+ # Prepare data
25
+ python data/shakespeare_char/prepare.py
26
+
27
+ # Train (5 minutes on CPU)
28
+ python train.py config/train_shakespeare_char.py
29
+
30
+ # Generate text
31
+ python sample.py --out_dir=out-shakespeare-char
32
+ ```
33
+
34
+ **Output**:
35
+ ```
36
+ ROMEO:
37
+ What say'st thou? Shall I speak, and be a man?
38
+
39
+ JULIET:
40
+ I am afeard, and yet I'll speak; for thou art
41
+ One that hath been a man, and yet I know not
42
+ What thou art.
43
+ ```
44
+
45
+ ## Common workflows
46
+
47
+ ### Workflow 1: Character-level Shakespeare
48
+
49
+ **Complete training pipeline**:
50
+ ```bash
51
+ # Step 1: Prepare data (creates train.bin, val.bin)
52
+ python data/shakespeare_char/prepare.py
53
+
54
+ # Step 2: Train small model
55
+ python train.py config/train_shakespeare_char.py
56
+
57
+ # Step 3: Generate text
58
+ python sample.py --out_dir=out-shakespeare-char
59
+ ```
60
+
61
+ **Config** (`config/train_shakespeare_char.py`):
62
+ ```python
63
+ # Model config
64
+ n_layer = 6 # 6 transformer layers
65
+ n_head = 6 # 6 attention heads
66
+ n_embd = 384 # 384-dim embeddings
67
+ block_size = 256 # 256 char context
68
+
69
+ # Training config
70
+ batch_size = 64
71
+ learning_rate = 1e-3
72
+ max_iters = 5000
73
+ eval_interval = 500
74
+
75
+ # Hardware
76
+ device = 'cpu' # Or 'cuda'
77
+ compile = False # Set True for PyTorch 2.0
78
+ ```
79
+
80
+ **Training time**: ~5 minutes (CPU), ~1 minute (GPU)
81
+
82
+ ### Workflow 2: Reproduce GPT-2 (124M)
83
+
84
+ **Multi-GPU training on OpenWebText**:
85
+ ```bash
86
+ # Step 1: Prepare OpenWebText (takes ~1 hour)
87
+ python data/openwebtext/prepare.py
88
+
89
+ # Step 2: Train GPT-2 124M with DDP (8 GPUs)
90
+ torchrun --standalone --nproc_per_node=8 \
91
+ train.py config/train_gpt2.py
92
+
93
+ # Step 3: Sample from trained model
94
+ python sample.py --out_dir=out
95
+ ```
96
+
97
+ **Config** (`config/train_gpt2.py`):
98
+ ```python
99
+ # GPT-2 (124M) architecture
100
+ n_layer = 12
101
+ n_head = 12
102
+ n_embd = 768
103
+ block_size = 1024
104
+ dropout = 0.0
105
+
106
+ # Training
107
+ batch_size = 12
108
+ gradient_accumulation_steps = 5 * 8 # Total batch ~0.5M tokens
109
+ learning_rate = 6e-4
110
+ max_iters = 600000
111
+ lr_decay_iters = 600000
112
+
113
+ # System
114
+ compile = True # PyTorch 2.0
115
+ ```
116
+
117
+ **Training time**: ~4 days (8× A100)
118
+
119
+ ### Workflow 3: Fine-tune pretrained GPT-2
120
+
121
+ **Start from OpenAI checkpoint**:
122
+ ```python
123
+ # In train.py or config
124
+ init_from = 'gpt2' # Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl
125
+
126
+ # Model loads OpenAI weights automatically
127
+ python train.py config/finetune_shakespeare.py
128
+ ```
129
+
130
+ **Example config** (`config/finetune_shakespeare.py`):
131
+ ```python
132
+ # Start from GPT-2
133
+ init_from = 'gpt2'
134
+
135
+ # Dataset
136
+ dataset = 'shakespeare_char'
137
+ batch_size = 1
138
+ block_size = 1024
139
+
140
+ # Fine-tuning
141
+ learning_rate = 3e-5 # Lower LR for fine-tuning
142
+ max_iters = 2000
143
+ warmup_iters = 100
144
+
145
+ # Regularization
146
+ weight_decay = 1e-1
147
+ ```
148
+
149
+ ### Workflow 4: Custom dataset
150
+
151
+ **Train on your own text**:
152
+ ```python
153
+ # data/custom/prepare.py
154
+ import numpy as np
155
+
156
+ # Load your data
157
+ with open('my_data.txt', 'r') as f:
158
+ text = f.read()
159
+
160
+ # Create character mappings
161
+ chars = sorted(list(set(text)))
162
+ stoi = {ch: i for i, ch in enumerate(chars)}
163
+ itos = {i: ch for i, ch in enumerate(chars)}
164
+
165
+ # Tokenize
166
+ data = np.array([stoi[ch] for ch in text], dtype=np.uint16)
167
+
168
+ # Split train/val
169
+ n = len(data)
170
+ train_data = data[:int(n*0.9)]
171
+ val_data = data[int(n*0.9):]
172
+
173
+ # Save
174
+ train_data.tofile('data/custom/train.bin')
175
+ val_data.tofile('data/custom/val.bin')
176
+ ```
177
+
178
+ **Train**:
179
+ ```bash
180
+ python data/custom/prepare.py
181
+ python train.py --dataset=custom
182
+ ```
183
+
184
+ ## When to use vs alternatives
185
+
186
+ **Use nanoGPT when**:
187
+ - Learning how GPT works
188
+ - Experimenting with transformer variants
189
+ - Teaching/education purposes
190
+ - Quick prototyping
191
+ - Limited compute (can run on CPU)
192
+
193
+ **Simplicity advantages**:
194
+ - **~300 lines**: Entire model in `model.py`
195
+ - **~300 lines**: Training loop in `train.py`
196
+ - **Hackable**: Easy to modify
197
+ - **No abstractions**: Pure PyTorch
198
+
199
+ **Use alternatives instead**:
200
+ - **HuggingFace Transformers**: Production use, many models
201
+ - **Megatron-LM**: Large-scale distributed training
202
+ - **LitGPT**: More architectures, production-ready
203
+ - **PyTorch Lightning**: Need high-level framework
204
+
205
+ ## Common issues
206
+
207
+ **Issue: CUDA out of memory**
208
+
209
+ Reduce batch size or context length:
210
+ ```python
211
+ batch_size = 1 # Reduce from 12
212
+ block_size = 512 # Reduce from 1024
213
+ gradient_accumulation_steps = 40 # Increase to maintain effective batch
214
+ ```
215
+
216
+ **Issue: Training too slow**
217
+
218
+ Enable compilation (PyTorch 2.0+):
219
+ ```python
220
+ compile = True # 2× speedup
221
+ ```
222
+
223
+ Use mixed precision:
224
+ ```python
225
+ dtype = 'bfloat16' # Or 'float16'
226
+ ```
227
+
228
+ **Issue: Poor generation quality**
229
+
230
+ Train longer:
231
+ ```python
232
+ max_iters = 10000 # Increase from 5000
233
+ ```
234
+
235
+ Lower temperature:
236
+ ```python
237
+ # In sample.py
238
+ temperature = 0.7 # Lower from 1.0
239
+ top_k = 200 # Add top-k sampling
240
+ ```
241
+
242
+ **Issue: Can't load GPT-2 weights**
243
+
244
+ Install transformers:
245
+ ```bash
246
+ pip install transformers
247
+ ```
248
+
249
+ Check model name:
250
+ ```python
251
+ init_from = 'gpt2' # Valid: gpt2, gpt2-medium, gpt2-large, gpt2-xl
252
+ ```
253
+
254
+ ## Advanced topics
255
+
256
+ **Model architecture**: See [references/architecture.md](references/architecture.md) for GPT block structure, multi-head attention, and MLP layers explained simply.
257
+
258
+ **Training loop**: See [references/training.md](references/training.md) for learning rate schedule, gradient accumulation, and distributed data parallel setup.
259
+
260
+ **Data preparation**: See [references/data.md](references/data.md) for tokenization strategies (character-level vs BPE) and binary format details.
261
+
262
+ ## Hardware requirements
263
+
264
+ - **Shakespeare (char-level)**:
265
+ - CPU: 5 minutes
266
+ - GPU (T4): 1 minute
267
+ - VRAM: <1GB
268
+
269
+ - **GPT-2 (124M)**:
270
+ - 1× A100: ~1 week
271
+ - 8× A100: ~4 days
272
+ - VRAM: ~16GB per GPU
273
+
274
+ - **GPT-2 Medium (350M)**:
275
+ - 8× A100: ~2 weeks
276
+ - VRAM: ~40GB per GPU
277
+
278
+ **Performance**:
279
+ - With `compile=True`: 2× speedup
280
+ - With `dtype=bfloat16`: 50% memory reduction
281
+
282
+ ## Resources
283
+
284
+ - GitHub: https://github.com/karpathy/nanoGPT ⭐ 48,000+
285
+ - Video: "Let's build GPT" by Andrej Karpathy
286
+ - Paper: "Attention is All You Need" (Vaswani et al.)
287
+ - OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext
288
+ - Educational: Best for understanding transformers from scratch
289
+
290
+