@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,653 @@
1
+ # Tokenization Algorithms Deep Dive
2
+
3
+ Comprehensive explanation of BPE, WordPiece, and Unigram algorithms.
4
+
5
+ ## Byte-Pair Encoding (BPE)
6
+
7
+ ### Algorithm overview
8
+
9
+ BPE iteratively merges the most frequent pair of tokens in a corpus.
10
+
11
+ **Training process**:
12
+ 1. Initialize vocabulary with all characters
13
+ 2. Count frequency of all adjacent token pairs
14
+ 3. Merge most frequent pair into new token
15
+ 4. Add new token to vocabulary
16
+ 5. Update corpus with new token
17
+ 6. Repeat until vocabulary size reached
18
+
19
+ ### Step-by-step example
20
+
21
+ **Corpus**:
22
+ ```
23
+ low: 5
24
+ lower: 2
25
+ newest: 6
26
+ widest: 3
27
+ ```
28
+
29
+ **Iteration 1**:
30
+ ```
31
+ Count pairs:
32
+ 'e' + 's': 9 (newest: 6, widest: 3) ← most frequent
33
+ 'l' + 'o': 7
34
+ 'o' + 'w': 7
35
+ ...
36
+
37
+ Merge: 'e' + 's' → 'es'
38
+
39
+ Updated corpus:
40
+ low: 5
41
+ lower: 2
42
+ newest: 6 → newes|t: 6
43
+ widest: 3 → wides|t: 3
44
+
45
+ Vocabulary: [a-z] + ['es']
46
+ ```
47
+
48
+ **Iteration 2**:
49
+ ```
50
+ Count pairs:
51
+ 'es' + 't': 9 ← most frequent
52
+ 'l' + 'o': 7
53
+ ...
54
+
55
+ Merge: 'es' + 't' → 'est'
56
+
57
+ Updated corpus:
58
+ low: 5
59
+ lower: 2
60
+ newest: 6 → new|est: 6
61
+ widest: 3 → wid|est: 3
62
+
63
+ Vocabulary: [a-z] + ['es', 'est']
64
+ ```
65
+
66
+ **Continue until desired vocabulary size...**
67
+
68
+ ### Tokenization with trained BPE
69
+
70
+ Given vocabulary: `['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd', 'es', 'est', 'lo', 'low', 'ne', 'new', 'newest', 'wi', 'wid', 'widest']`
71
+
72
+ Tokenize "lowest":
73
+ ```
74
+ Step 1: Split into characters
75
+ ['l', 'o', 'w', 'e', 's', 't']
76
+
77
+ Step 2: Apply merges in order learned during training
78
+ - Merge 'l' + 'o' → 'lo' (if this merge was learned)
79
+ - Merge 'lo' + 'w' → 'low' (if learned)
80
+ - Merge 'e' + 's' → 'es' (learned)
81
+ - Merge 'es' + 't' → 'est' (learned)
82
+
83
+ Final: ['low', 'est']
84
+ ```
85
+
86
+ ### Implementation
87
+
88
+ ```python
89
+ from tokenizers import Tokenizer
90
+ from tokenizers.models import BPE
91
+ from tokenizers.trainers import BpeTrainer
92
+ from tokenizers.pre_tokenizers import Whitespace
93
+
94
+ # Initialize
95
+ tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
96
+ tokenizer.pre_tokenizer = Whitespace()
97
+
98
+ # Configure trainer
99
+ trainer = BpeTrainer(
100
+ vocab_size=1000,
101
+ min_frequency=2,
102
+ special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
103
+ )
104
+
105
+ # Train
106
+ corpus = [
107
+ "This is a sample corpus for BPE training.",
108
+ "BPE learns subword units from the training data.",
109
+ # ... more sentences
110
+ ]
111
+
112
+ tokenizer.train_from_iterator(corpus, trainer=trainer)
113
+
114
+ # Use
115
+ output = tokenizer.encode("This is tokenization")
116
+ print(output.tokens) # ['This', 'is', 'token', 'ization']
117
+ ```
118
+
119
+ ### Byte-level BPE (GPT-2 variant)
120
+
121
+ **Problem**: Standard BPE has limited character coverage (256+ Unicode chars)
122
+
123
+ **Solution**: Operate on byte level (256 bytes)
124
+
125
+ ```python
126
+ from tokenizers.pre_tokenizers import ByteLevel
127
+ from tokenizers.decoders import ByteLevel as ByteLevelDecoder
128
+
129
+ tokenizer = Tokenizer(BPE())
130
+
131
+ # Byte-level pre-tokenization
132
+ tokenizer.pre_tokenizer = ByteLevel()
133
+ tokenizer.decoder = ByteLevelDecoder()
134
+
135
+ # This handles ALL possible characters, including emojis
136
+ text = "Hello 🌍 世界"
137
+ tokens = tokenizer.encode(text).tokens
138
+ ```
139
+
140
+ **Advantages**:
141
+ - Handles any Unicode character (256 byte coverage)
142
+ - No unknown tokens (worst case: bytes)
143
+ - Used by GPT-2, GPT-3, BART
144
+
145
+ **Trade-offs**:
146
+ - Slightly worse compression (bytes vs characters)
147
+ - More tokens for non-ASCII text
148
+
149
+ ### BPE variants
150
+
151
+ **SentencePiece BPE**:
152
+ - Language-independent (no pre-tokenization)
153
+ - Treats input as raw byte stream
154
+ - Used by T5, ALBERT, XLNet
155
+
156
+ **Robust BPE**:
157
+ - Dropout during training (randomly skip merges)
158
+ - More robust tokenization at inference
159
+ - Reduces overfitting to training data
160
+
161
+ ## WordPiece
162
+
163
+ ### Algorithm overview
164
+
165
+ WordPiece is similar to BPE but uses a different merge selection criterion.
166
+
167
+ **Training process**:
168
+ 1. Initialize vocabulary with all characters
169
+ 2. Count frequency of all token pairs
170
+ 3. Score each pair: `score = freq(pair) / (freq(first) × freq(second))`
171
+ 4. Merge pair with highest score
172
+ 5. Repeat until vocabulary size reached
173
+
174
+ ### Why different scoring?
175
+
176
+ **BPE**: Merges most frequent pairs
177
+ - "aa" appears 100 times → high priority
178
+ - Even if 'a' appears 1000 times alone
179
+
180
+ **WordPiece**: Merges pairs that are semantically related
181
+ - "aa" appears 100 times, 'a' appears 1000 times → low score (100 / (1000 × 1000))
182
+ - "th" appears 50 times, 't' appears 60 times, 'h' appears 55 times → high score (50 / (60 × 55))
183
+ - Prioritizes pairs that appear together more than expected
184
+
185
+ ### Step-by-step example
186
+
187
+ **Corpus**:
188
+ ```
189
+ low: 5
190
+ lower: 2
191
+ newest: 6
192
+ widest: 3
193
+ ```
194
+
195
+ **Iteration 1**:
196
+ ```
197
+ Count frequencies:
198
+ 'e': 11 (lower: 2, newest: 6, widest: 3)
199
+ 's': 9
200
+ 't': 9
201
+ ...
202
+
203
+ Count pairs:
204
+ 'e' + 's': 9 (newest: 6, widest: 3)
205
+ 'es' + 't': 9 (newest: 6, widest: 3)
206
+ ...
207
+
208
+ Compute scores:
209
+ score('e' + 's') = 9 / (11 × 9) = 0.091
210
+ score('es' + 't') = 9 / (9 × 9) = 0.111 ← highest score
211
+ score('l' + 'o') = 7 / (7 × 9) = 0.111 ← tied
212
+
213
+ Choose: 'es' + 't' → 'est' (or 'lo' if tied)
214
+ ```
215
+
216
+ **Key difference**: WordPiece prioritizes rare combinations over frequent ones.
217
+
218
+ ### Tokenization with WordPiece
219
+
220
+ Given vocabulary: `['##e', '##s', '##t', 'l', 'o', 'w', 'new', 'est', 'low']`
221
+
222
+ Tokenize "lowest":
223
+ ```
224
+ Step 1: Find longest matching prefix
225
+ 'lowest' → 'low' (matches)
226
+
227
+ Step 2: Find longest match for remainder
228
+ 'est' → 'est' (matches)
229
+
230
+ Final: ['low', 'est']
231
+ ```
232
+
233
+ **If no match**:
234
+ ```
235
+ Tokenize "unknownword":
236
+ 'unknownword' → no match
237
+ 'unknown' → no match
238
+ 'unkn' → no match
239
+ 'un' → no match
240
+ 'u' → no match
241
+ → [UNK]
242
+ ```
243
+
244
+ ### Implementation
245
+
246
+ ```python
247
+ from tokenizers import Tokenizer
248
+ from tokenizers.models import WordPiece
249
+ from tokenizers.trainers import WordPieceTrainer
250
+ from tokenizers.normalizers import BertNormalizer
251
+ from tokenizers.pre_tokenizers import BertPreTokenizer
252
+
253
+ # Initialize BERT-style tokenizer
254
+ tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
255
+
256
+ # Normalization (lowercase, accent stripping)
257
+ tokenizer.normalizer = BertNormalizer(lowercase=True)
258
+
259
+ # Pre-tokenization (whitespace + punctuation)
260
+ tokenizer.pre_tokenizer = BertPreTokenizer()
261
+
262
+ # Configure trainer
263
+ trainer = WordPieceTrainer(
264
+ vocab_size=30522, # BERT vocab size
265
+ min_frequency=2,
266
+ special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
267
+ continuing_subword_prefix="##" # BERT uses ##
268
+ )
269
+
270
+ # Train
271
+ tokenizer.train_from_iterator(corpus, trainer=trainer)
272
+
273
+ # Use
274
+ output = tokenizer.encode("Tokenization works great!")
275
+ print(output.tokens) # ['token', '##ization', 'works', 'great', '!']
276
+ ```
277
+
278
+ ### Subword prefix
279
+
280
+ **BERT uses `##` prefix**:
281
+ ```
282
+ "unbelievable" → ['un', '##believ', '##able']
283
+ ```
284
+
285
+ **Why?**
286
+ - Indicates token is a continuation
287
+ - Allows reconstruction: remove ##, concatenate
288
+ - Helps model distinguish word boundaries
289
+
290
+ ### WordPiece advantages
291
+
292
+ **Semantic merges**:
293
+ - Prioritizes meaningful combinations
294
+ - "qu" has high score (always together)
295
+ - "qx" has low score (rare combination)
296
+
297
+ **Better for morphology**:
298
+ - Captures affixes: un-, -ing, -ed
299
+ - Preserves word stems
300
+
301
+ **Trade-offs**:
302
+ - Slower training than BPE
303
+ - More memory (stores vocabulary, not merges)
304
+ - Original implementation not open-source (HF reimplementation)
305
+
306
+ ## Unigram
307
+
308
+ ### Algorithm overview
309
+
310
+ Unigram works backward: start with large vocabulary, remove tokens.
311
+
312
+ **Training process**:
313
+ 1. Initialize with large vocabulary (all substrings)
314
+ 2. Estimate probability of each token (frequency-based)
315
+ 3. For each token, compute loss increase if removed
316
+ 4. Remove 10-20% of tokens with lowest loss impact
317
+ 5. Re-estimate probabilities
318
+ 6. Repeat until desired vocabulary size
319
+
320
+ ### Probabilistic tokenization
321
+
322
+ **Unigram assumption**: Each token is independent.
323
+
324
+ Given vocabulary with probabilities:
325
+ ```
326
+ P('low') = 0.02
327
+ P('l') = 0.01
328
+ P('o') = 0.015
329
+ P('w') = 0.01
330
+ P('est') = 0.03
331
+ P('e') = 0.02
332
+ P('s') = 0.015
333
+ P('t') = 0.015
334
+ ```
335
+
336
+ Tokenize "lowest":
337
+ ```
338
+ Option 1: ['low', 'est']
339
+ P = P('low') × P('est') = 0.02 × 0.03 = 0.0006
340
+
341
+ Option 2: ['l', 'o', 'w', 'est']
342
+ P = 0.01 × 0.015 × 0.01 × 0.03 = 0.000000045
343
+
344
+ Option 3: ['low', 'e', 's', 't']
345
+ P = 0.02 × 0.02 × 0.015 × 0.015 = 0.0000009
346
+
347
+ Choose option 1 (highest probability)
348
+ ```
349
+
350
+ ### Viterbi algorithm
351
+
352
+ Finding best tokenization is expensive (exponential possibilities).
353
+
354
+ **Viterbi algorithm** (dynamic programming):
355
+ ```python
356
+ def tokenize_viterbi(word, vocab, probs):
357
+ n = len(word)
358
+ # dp[i] = (best_prob, best_tokens) for word[:i]
359
+ dp = [{} for _ in range(n + 1)]
360
+ dp[0] = (0.0, []) # log probability
361
+
362
+ for i in range(1, n + 1):
363
+ best_prob = float('-inf')
364
+ best_tokens = []
365
+
366
+ # Try all possible last tokens
367
+ for j in range(i):
368
+ token = word[j:i]
369
+ if token in vocab:
370
+ prob = dp[j][0] + log(probs[token])
371
+ if prob > best_prob:
372
+ best_prob = prob
373
+ best_tokens = dp[j][1] + [token]
374
+
375
+ dp[i] = (best_prob, best_tokens)
376
+
377
+ return dp[n][1]
378
+ ```
379
+
380
+ **Time complexity**: O(n² × vocab_size) vs O(2^n) brute force
381
+
382
+ ### Implementation
383
+
384
+ ```python
385
+ from tokenizers import Tokenizer
386
+ from tokenizers.models import Unigram
387
+ from tokenizers.trainers import UnigramTrainer
388
+
389
+ # Initialize
390
+ tokenizer = Tokenizer(Unigram())
391
+
392
+ # Configure trainer
393
+ trainer = UnigramTrainer(
394
+ vocab_size=8000,
395
+ special_tokens=["<unk>", "<s>", "</s>"],
396
+ unk_token="<unk>",
397
+ max_piece_length=16, # Max token length
398
+ n_sub_iterations=2, # EM iterations
399
+ shrinking_factor=0.75 # Remove 25% each iteration
400
+ )
401
+
402
+ # Train
403
+ tokenizer.train_from_iterator(corpus, trainer=trainer)
404
+
405
+ # Use
406
+ output = tokenizer.encode("Tokenization with Unigram")
407
+ print(output.tokens) # ['▁Token', 'ization', '▁with', '▁Un', 'igram']
408
+ ```
409
+
410
+ ### Unigram advantages
411
+
412
+ **Probabilistic**:
413
+ - Multiple valid tokenizations
414
+ - Can sample different tokenizations (data augmentation)
415
+
416
+ **Subword regularization**:
417
+ ```python
418
+ # Sample different tokenizations
419
+ for _ in range(3):
420
+ tokens = tokenizer.encode("tokenization", is_pretokenized=False).tokens
421
+ print(tokens)
422
+
423
+ # Output (different each time):
424
+ # ['token', 'ization']
425
+ # ['tok', 'en', 'ization']
426
+ # ['token', 'iz', 'ation']
427
+ ```
428
+
429
+ **Language-independent**:
430
+ - No word boundaries needed
431
+ - Works for CJK languages (Chinese, Japanese, Korean)
432
+ - Treats input as character stream
433
+
434
+ **Trade-offs**:
435
+ - Slower training (EM algorithm)
436
+ - More hyperparameters
437
+ - Larger model (stores probabilities)
438
+
439
+ ## Algorithm comparison
440
+
441
+ ### Training speed
442
+
443
+ | Algorithm | Small (10MB) | Medium (100MB) | Large (1GB) |
444
+ |------------|--------------|----------------|-------------|
445
+ | BPE | 10-15 sec | 1-2 min | 10-20 min |
446
+ | WordPiece | 15-20 sec | 2-3 min | 15-30 min |
447
+ | Unigram | 20-30 sec | 3-5 min | 30-60 min |
448
+
449
+ **Tested on**: 16-core CPU, 30k vocab
450
+
451
+ ### Tokenization quality
452
+
453
+ Tested on English Wikipedia (perplexity measurement):
454
+
455
+ | Algorithm | Vocab Size | Tokens/Word | Unknown Rate |
456
+ |------------|------------|-------------|--------------|
457
+ | BPE | 30k | 1.3 | 0.5% |
458
+ | WordPiece | 30k | 1.2 | 1.2% |
459
+ | Unigram | 8k | 1.5 | 0.3% |
460
+
461
+ **Key observations**:
462
+ - WordPiece: Slightly better compression
463
+ - BPE: Lower unknown rate
464
+ - Unigram: Smallest vocab, good coverage
465
+
466
+ ### Compression ratio
467
+
468
+ Characters per token (higher = better compression):
469
+
470
+ | Language | BPE (30k) | WordPiece (30k) | Unigram (8k) |
471
+ |----------|-----------|-----------------|--------------|
472
+ | English | 4.2 | 4.5 | 3.8 |
473
+ | Chinese | 2.1 | 2.3 | 2.5 |
474
+ | Arabic | 3.5 | 3.8 | 3.2 |
475
+
476
+ **Best for each**:
477
+ - English: WordPiece
478
+ - Chinese: Unigram (language-independent)
479
+ - Arabic: WordPiece
480
+
481
+ ### Use case recommendations
482
+
483
+ **BPE** - Best for:
484
+ - English language models
485
+ - Code (handles symbols well)
486
+ - Fast training needed
487
+ - **Models**: GPT-2, GPT-3, RoBERTa, BART
488
+
489
+ **WordPiece** - Best for:
490
+ - Masked language modeling (BERT-style)
491
+ - Morphologically rich languages
492
+ - Semantic understanding tasks
493
+ - **Models**: BERT, DistilBERT, ELECTRA
494
+
495
+ **Unigram** - Best for:
496
+ - Multilingual models
497
+ - Languages without word boundaries (CJK)
498
+ - Data augmentation via subword regularization
499
+ - **Models**: T5, ALBERT, XLNet (via SentencePiece)
500
+
501
+ ## Advanced topics
502
+
503
+ ### Handling rare words
504
+
505
+ **BPE approach**:
506
+ ```
507
+ "antidisestablishmentarianism"
508
+ → ['anti', 'dis', 'establish', 'ment', 'arian', 'ism']
509
+ ```
510
+
511
+ **WordPiece approach**:
512
+ ```
513
+ "antidisestablishmentarianism"
514
+ → ['anti', '##dis', '##establish', '##ment', '##arian', '##ism']
515
+ ```
516
+
517
+ **Unigram approach**:
518
+ ```
519
+ "antidisestablishmentarianism"
520
+ → ['▁anti', 'dis', 'establish', 'ment', 'arian', 'ism']
521
+ ```
522
+
523
+ ### Handling numbers
524
+
525
+ **Challenge**: Infinite number combinations
526
+
527
+ **BPE solution**: Byte-level (handles any digit sequence)
528
+ ```python
529
+ tokenizer = Tokenizer(BPE())
530
+ tokenizer.pre_tokenizer = ByteLevel()
531
+
532
+ # Handles any number
533
+ "123456789" → byte-level tokens
534
+ ```
535
+
536
+ **WordPiece solution**: Digit pre-tokenization
537
+ ```python
538
+ from tokenizers.pre_tokenizers import Digits
539
+
540
+ # Split digits individually or as groups
541
+ tokenizer.pre_tokenizer = Digits(individual_digits=True)
542
+
543
+ "123" → ['1', '2', '3']
544
+ ```
545
+
546
+ **Unigram solution**: Learns common number patterns
547
+ ```python
548
+ # Learns patterns during training
549
+ "2023" → ['202', '3'] or ['20', '23']
550
+ ```
551
+
552
+ ### Handling case sensitivity
553
+
554
+ **Lowercase (BERT)**:
555
+ ```python
556
+ from tokenizers.normalizers import Lowercase
557
+
558
+ tokenizer.normalizer = Lowercase()
559
+
560
+ "Hello WORLD" → "hello world" → ['hello', 'world']
561
+ ```
562
+
563
+ **Preserve case (GPT-2)**:
564
+ ```python
565
+ # No case normalization
566
+ tokenizer.normalizer = None
567
+
568
+ "Hello WORLD" → ['Hello', 'WORLD']
569
+ ```
570
+
571
+ **Cased tokens (RoBERTa)**:
572
+ ```python
573
+ # Learns separate tokens for different cases
574
+ Vocabulary: ['Hello', 'hello', 'HELLO', 'world', 'WORLD']
575
+ ```
576
+
577
+ ### Handling emojis and special characters
578
+
579
+ **Byte-level (GPT-2)**:
580
+ ```python
581
+ tokenizer.pre_tokenizer = ByteLevel()
582
+
583
+ "Hello 🌍 👋" → byte-level representation (always works)
584
+ ```
585
+
586
+ **Unicode normalization**:
587
+ ```python
588
+ from tokenizers.normalizers import NFKC
589
+
590
+ tokenizer.normalizer = NFKC()
591
+
592
+ "é" (composed) ↔ "é" (decomposed) → normalized to one form
593
+ ```
594
+
595
+ ## Troubleshooting
596
+
597
+ ### Issue: Poor subword splitting
598
+
599
+ **Symptom**:
600
+ ```
601
+ "running" → ['r', 'u', 'n', 'n', 'i', 'n', 'g'] (too granular)
602
+ ```
603
+
604
+ **Solutions**:
605
+ 1. Increase vocabulary size
606
+ 2. Train longer (more merge iterations)
607
+ 3. Lower `min_frequency` threshold
608
+
609
+ ### Issue: Too many unknown tokens
610
+
611
+ **Symptom**:
612
+ ```
613
+ 5% of tokens are [UNK]
614
+ ```
615
+
616
+ **Solutions**:
617
+ 1. Increase vocabulary size
618
+ 2. Use byte-level BPE (no UNK possible)
619
+ 3. Verify training corpus is representative
620
+
621
+ ### Issue: Inconsistent tokenization
622
+
623
+ **Symptom**:
624
+ ```
625
+ "running" → ['run', 'ning']
626
+ "runner" → ['r', 'u', 'n', 'n', 'e', 'r']
627
+ ```
628
+
629
+ **Solutions**:
630
+ 1. Check normalization consistency
631
+ 2. Ensure pre-tokenization is deterministic
632
+ 3. Use Unigram for probabilistic variance
633
+
634
+ ## Best practices
635
+
636
+ 1. **Match algorithm to model architecture**:
637
+ - BERT-style → WordPiece
638
+ - GPT-style → BPE
639
+ - T5-style → Unigram
640
+
641
+ 2. **Use byte-level for multilingual**:
642
+ - Handles any Unicode
643
+ - No unknown tokens
644
+
645
+ 3. **Test on representative data**:
646
+ - Measure compression ratio
647
+ - Check unknown token rate
648
+ - Inspect sample tokenizations
649
+
650
+ 4. **Version control tokenizers**:
651
+ - Save with model
652
+ - Document special tokens
653
+ - Track vocabulary changes