@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,723 @@
1
+ # Tokenization Pipeline Components
2
+
3
+ Complete guide to normalizers, pre-tokenizers, models, post-processors, and decoders.
4
+
5
+ ## Pipeline overview
6
+
7
+ **Full tokenization pipeline**:
8
+ ```
9
+ Raw Text
10
+
11
+ Normalization (cleaning, lowercasing)
12
+
13
+ Pre-tokenization (split into words)
14
+
15
+ Model (apply BPE/WordPiece/Unigram)
16
+
17
+ Post-processing (add special tokens)
18
+
19
+ Token IDs
20
+ ```
21
+
22
+ **Decoding reverses the process**:
23
+ ```
24
+ Token IDs
25
+
26
+ Decoder (handle special encodings)
27
+
28
+ Raw Text
29
+ ```
30
+
31
+ ## Normalizers
32
+
33
+ Clean and standardize input text.
34
+
35
+ ### Common normalizers
36
+
37
+ **Lowercase**:
38
+ ```python
39
+ from tokenizers.normalizers import Lowercase
40
+
41
+ tokenizer.normalizer = Lowercase()
42
+
43
+ # Input: "Hello WORLD"
44
+ # Output: "hello world"
45
+ ```
46
+
47
+ **Unicode normalization**:
48
+ ```python
49
+ from tokenizers.normalizers import NFD, NFC, NFKD, NFKC
50
+
51
+ # NFD: Canonical decomposition
52
+ tokenizer.normalizer = NFD()
53
+ # "é" → "e" + "́" (separate characters)
54
+
55
+ # NFC: Canonical composition (default)
56
+ tokenizer.normalizer = NFC()
57
+ # "e" + "́" → "é" (composed)
58
+
59
+ # NFKD: Compatibility decomposition
60
+ tokenizer.normalizer = NFKD()
61
+ # "fi" → "f" + "i"
62
+
63
+ # NFKC: Compatibility composition
64
+ tokenizer.normalizer = NFKC()
65
+ # Most aggressive normalization
66
+ ```
67
+
68
+ **Strip accents**:
69
+ ```python
70
+ from tokenizers.normalizers import StripAccents
71
+
72
+ tokenizer.normalizer = StripAccents()
73
+
74
+ # Input: "café"
75
+ # Output: "cafe"
76
+ ```
77
+
78
+ **Whitespace handling**:
79
+ ```python
80
+ from tokenizers.normalizers import Strip, StripAccents
81
+
82
+ # Remove leading/trailing whitespace
83
+ tokenizer.normalizer = Strip()
84
+
85
+ # Input: " hello "
86
+ # Output: "hello"
87
+ ```
88
+
89
+ **Replace patterns**:
90
+ ```python
91
+ from tokenizers.normalizers import Replace
92
+
93
+ # Replace newlines with spaces
94
+ tokenizer.normalizer = Replace("\\n", " ")
95
+
96
+ # Input: "hello\\nworld"
97
+ # Output: "hello world"
98
+ ```
99
+
100
+ ### Combining normalizers
101
+
102
+ ```python
103
+ from tokenizers.normalizers import Sequence, NFD, Lowercase, StripAccents
104
+
105
+ # BERT-style normalization
106
+ tokenizer.normalizer = Sequence([
107
+ NFD(), # Unicode decomposition
108
+ Lowercase(), # Convert to lowercase
109
+ StripAccents() # Remove accents
110
+ ])
111
+
112
+ # Input: "Café au Lait"
113
+ # After NFD: "Café au Lait" (e + ́)
114
+ # After Lowercase: "café au lait"
115
+ # After StripAccents: "cafe au lait"
116
+ ```
117
+
118
+ ### Use case examples
119
+
120
+ **Case-insensitive model (BERT)**:
121
+ ```python
122
+ from tokenizers.normalizers import BertNormalizer
123
+
124
+ # All-in-one BERT normalization
125
+ tokenizer.normalizer = BertNormalizer(
126
+ clean_text=True, # Remove control characters
127
+ handle_chinese_chars=True, # Add spaces around Chinese
128
+ strip_accents=True, # Remove accents
129
+ lowercase=True # Lowercase
130
+ )
131
+ ```
132
+
133
+ **Case-sensitive model (GPT-2)**:
134
+ ```python
135
+ # Minimal normalization
136
+ tokenizer.normalizer = NFC() # Only normalize Unicode
137
+ ```
138
+
139
+ **Multilingual (mBERT)**:
140
+ ```python
141
+ # Preserve scripts, normalize form
142
+ tokenizer.normalizer = NFKC()
143
+ ```
144
+
145
+ ## Pre-tokenizers
146
+
147
+ Split text into word-like units before tokenization.
148
+
149
+ ### Whitespace splitting
150
+
151
+ ```python
152
+ from tokenizers.pre_tokenizers import Whitespace
153
+
154
+ tokenizer.pre_tokenizer = Whitespace()
155
+
156
+ # Input: "Hello world! How are you?"
157
+ # Output: [("Hello", (0, 5)), ("world!", (6, 12)), ("How", (13, 16)), ("are", (17, 20)), ("you?", (21, 25))]
158
+ ```
159
+
160
+ ### Punctuation isolation
161
+
162
+ ```python
163
+ from tokenizers.pre_tokenizers import Punctuation
164
+
165
+ tokenizer.pre_tokenizer = Punctuation()
166
+
167
+ # Input: "Hello, world!"
168
+ # Output: [("Hello", ...), (",", ...), ("world", ...), ("!", ...)]
169
+ ```
170
+
171
+ ### Byte-level (GPT-2)
172
+
173
+ ```python
174
+ from tokenizers.pre_tokenizers import ByteLevel
175
+
176
+ tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=True)
177
+
178
+ # Input: "Hello world"
179
+ # Output: Byte-level tokens with Ġ prefix for spaces
180
+ # [("ĠHello", ...), ("Ġworld", ...)]
181
+ ```
182
+
183
+ **Key feature**: Handles ALL Unicode characters (256 byte combinations)
184
+
185
+ ### Metaspace (SentencePiece)
186
+
187
+ ```python
188
+ from tokenizers.pre_tokenizers import Metaspace
189
+
190
+ tokenizer.pre_tokenizer = Metaspace(replacement="▁", add_prefix_space=True)
191
+
192
+ # Input: "Hello world"
193
+ # Output: [("▁Hello", ...), ("▁world", ...)]
194
+ ```
195
+
196
+ **Used by**: T5, ALBERT (via SentencePiece)
197
+
198
+ ### Digits splitting
199
+
200
+ ```python
201
+ from tokenizers.pre_tokenizers import Digits
202
+
203
+ # Split digits individually
204
+ tokenizer.pre_tokenizer = Digits(individual_digits=True)
205
+
206
+ # Input: "Room 123"
207
+ # Output: [("Room", ...), ("1", ...), ("2", ...), ("3", ...)]
208
+
209
+ # Keep digits together
210
+ tokenizer.pre_tokenizer = Digits(individual_digits=False)
211
+
212
+ # Input: "Room 123"
213
+ # Output: [("Room", ...), ("123", ...)]
214
+ ```
215
+
216
+ ### BERT pre-tokenizer
217
+
218
+ ```python
219
+ from tokenizers.pre_tokenizers import BertPreTokenizer
220
+
221
+ tokenizer.pre_tokenizer = BertPreTokenizer()
222
+
223
+ # Splits on whitespace and punctuation, preserves CJK
224
+ # Input: "Hello, 世界!"
225
+ # Output: [("Hello", ...), (",", ...), ("世", ...), ("界", ...), ("!", ...)]
226
+ ```
227
+
228
+ ### Combining pre-tokenizers
229
+
230
+ ```python
231
+ from tokenizers.pre_tokenizers import Sequence, Whitespace, Punctuation
232
+
233
+ tokenizer.pre_tokenizer = Sequence([
234
+ Whitespace(), # Split on whitespace first
235
+ Punctuation() # Then isolate punctuation
236
+ ])
237
+
238
+ # Input: "Hello, world!"
239
+ # After Whitespace: [("Hello,", ...), ("world!", ...)]
240
+ # After Punctuation: [("Hello", ...), (",", ...), ("world", ...), ("!", ...)]
241
+ ```
242
+
243
+ ### Pre-tokenizer comparison
244
+
245
+ | Pre-tokenizer | Use Case | Example |
246
+ |-------------------|---------------------------------|--------------------------------------------|
247
+ | Whitespace | Simple English | "Hello world" → ["Hello", "world"] |
248
+ | Punctuation | Isolate symbols | "world!" → ["world", "!"] |
249
+ | ByteLevel | Multilingual, emojis | "🌍" → byte tokens |
250
+ | Metaspace | SentencePiece-style | "Hello" → ["▁Hello"] |
251
+ | BertPreTokenizer | BERT-style (CJK aware) | "世界" → ["世", "界"] |
252
+ | Digits | Handle numbers | "123" → ["1", "2", "3"] or ["123"] |
253
+
254
+ ## Models
255
+
256
+ Core tokenization algorithms.
257
+
258
+ ### BPE Model
259
+
260
+ ```python
261
+ from tokenizers.models import BPE
262
+
263
+ model = BPE(
264
+ vocab=None, # Or provide pre-built vocab
265
+ merges=None, # Or provide merge rules
266
+ unk_token="[UNK]", # Unknown token
267
+ continuing_subword_prefix="",
268
+ end_of_word_suffix="",
269
+ fuse_unk=False # Keep unknown tokens separate
270
+ )
271
+
272
+ tokenizer = Tokenizer(model)
273
+ ```
274
+
275
+ **Parameters**:
276
+ - `vocab`: Dict of token → id
277
+ - `merges`: List of merge rules `["a b", "ab c"]`
278
+ - `unk_token`: Token for unknown words
279
+ - `continuing_subword_prefix`: Prefix for subwords (empty for GPT-2)
280
+ - `end_of_word_suffix`: Suffix for last subword (empty for GPT-2)
281
+
282
+ ### WordPiece Model
283
+
284
+ ```python
285
+ from tokenizers.models import WordPiece
286
+
287
+ model = WordPiece(
288
+ vocab=None,
289
+ unk_token="[UNK]",
290
+ max_input_chars_per_word=100, # Max word length
291
+ continuing_subword_prefix="##" # BERT-style prefix
292
+ )
293
+
294
+ tokenizer = Tokenizer(model)
295
+ ```
296
+
297
+ **Key difference**: Uses `##` prefix for continuing subwords.
298
+
299
+ ### Unigram Model
300
+
301
+ ```python
302
+ from tokenizers.models import Unigram
303
+
304
+ model = Unigram(
305
+ vocab=None, # List of (token, score) tuples
306
+ unk_id=0, # ID for unknown token
307
+ byte_fallback=False # Fall back to bytes if no match
308
+ )
309
+
310
+ tokenizer = Tokenizer(model)
311
+ ```
312
+
313
+ **Probabilistic**: Selects tokenization with highest probability.
314
+
315
+ ### WordLevel Model
316
+
317
+ ```python
318
+ from tokenizers.models import WordLevel
319
+
320
+ # Simple word-to-ID mapping (no subwords)
321
+ model = WordLevel(
322
+ vocab=None,
323
+ unk_token="[UNK]"
324
+ )
325
+
326
+ tokenizer = Tokenizer(model)
327
+ ```
328
+
329
+ **Warning**: Requires huge vocabulary (one token per word).
330
+
331
+ ## Post-processors
332
+
333
+ Add special tokens and format output.
334
+
335
+ ### Template processing
336
+
337
+ **BERT-style** (`[CLS] sentence [SEP]`):
338
+ ```python
339
+ from tokenizers.processors import TemplateProcessing
340
+
341
+ tokenizer.post_processor = TemplateProcessing(
342
+ single="[CLS] $A [SEP]",
343
+ pair="[CLS] $A [SEP] $B [SEP]",
344
+ special_tokens=[
345
+ ("[CLS]", 101),
346
+ ("[SEP]", 102),
347
+ ],
348
+ )
349
+
350
+ # Single sentence
351
+ output = tokenizer.encode("Hello world")
352
+ # [101, ..., 102] ([CLS] hello world [SEP])
353
+
354
+ # Sentence pair
355
+ output = tokenizer.encode("Hello", "world")
356
+ # [101, ..., 102, ..., 102] ([CLS] hello [SEP] world [SEP])
357
+ ```
358
+
359
+ **GPT-2 style** (`sentence <|endoftext|>`):
360
+ ```python
361
+ tokenizer.post_processor = TemplateProcessing(
362
+ single="$A <|endoftext|>",
363
+ special_tokens=[
364
+ ("<|endoftext|>", 50256),
365
+ ],
366
+ )
367
+ ```
368
+
369
+ **RoBERTa style** (`<s> sentence </s>`):
370
+ ```python
371
+ tokenizer.post_processor = TemplateProcessing(
372
+ single="<s> $A </s>",
373
+ pair="<s> $A </s> </s> $B </s>",
374
+ special_tokens=[
375
+ ("<s>", 0),
376
+ ("</s>", 2),
377
+ ],
378
+ )
379
+ ```
380
+
381
+ **T5 style** (no special tokens):
382
+ ```python
383
+ # T5 doesn't add special tokens via post-processor
384
+ tokenizer.post_processor = None
385
+ ```
386
+
387
+ ### RobertaProcessing
388
+
389
+ ```python
390
+ from tokenizers.processors import RobertaProcessing
391
+
392
+ tokenizer.post_processor = RobertaProcessing(
393
+ sep=("</s>", 2),
394
+ cls=("<s>", 0),
395
+ add_prefix_space=True, # Add space before first token
396
+ trim_offsets=True # Trim leading space from offsets
397
+ )
398
+ ```
399
+
400
+ ### ByteLevelProcessing
401
+
402
+ ```python
403
+ from tokenizers.processors import ByteLevel as ByteLevelProcessing
404
+
405
+ tokenizer.post_processor = ByteLevelProcessing(
406
+ trim_offsets=True # Remove Ġ from offsets
407
+ )
408
+ ```
409
+
410
+ ## Decoders
411
+
412
+ Convert token IDs back to text.
413
+
414
+ ### ByteLevel decoder
415
+
416
+ ```python
417
+ from tokenizers.decoders import ByteLevel
418
+
419
+ tokenizer.decoder = ByteLevel()
420
+
421
+ # Handles byte-level tokens
422
+ # ["ĠHello", "Ġworld"] → "Hello world"
423
+ ```
424
+
425
+ ### WordPiece decoder
426
+
427
+ ```python
428
+ from tokenizers.decoders import WordPiece
429
+
430
+ tokenizer.decoder = WordPiece(prefix="##")
431
+
432
+ # Removes ## prefix and concatenates
433
+ # ["token", "##ization"] → "tokenization"
434
+ ```
435
+
436
+ ### Metaspace decoder
437
+
438
+ ```python
439
+ from tokenizers.decoders import Metaspace
440
+
441
+ tokenizer.decoder = Metaspace(replacement="▁", add_prefix_space=True)
442
+
443
+ # Converts ▁ back to spaces
444
+ # ["▁Hello", "▁world"] → "Hello world"
445
+ ```
446
+
447
+ ### BPEDecoder
448
+
449
+ ```python
450
+ from tokenizers.decoders import BPEDecoder
451
+
452
+ tokenizer.decoder = BPEDecoder(suffix="</w>")
453
+
454
+ # Removes suffix and concatenates
455
+ # ["token", "ization</w>"] → "tokenization"
456
+ ```
457
+
458
+ ### Sequence decoder
459
+
460
+ ```python
461
+ from tokenizers.decoders import Sequence, ByteLevel, Strip
462
+
463
+ tokenizer.decoder = Sequence([
464
+ ByteLevel(), # Decode byte-level first
465
+ Strip(' ', 1, 1) # Strip leading/trailing spaces
466
+ ])
467
+ ```
468
+
469
+ ## Complete pipeline examples
470
+
471
+ ### BERT tokenizer
472
+
473
+ ```python
474
+ from tokenizers import Tokenizer
475
+ from tokenizers.models import WordPiece
476
+ from tokenizers.normalizers import BertNormalizer
477
+ from tokenizers.pre_tokenizers import BertPreTokenizer
478
+ from tokenizers.processors import TemplateProcessing
479
+ from tokenizers.decoders import WordPiece as WordPieceDecoder
480
+
481
+ # Model
482
+ tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
483
+
484
+ # Normalization
485
+ tokenizer.normalizer = BertNormalizer(lowercase=True)
486
+
487
+ # Pre-tokenization
488
+ tokenizer.pre_tokenizer = BertPreTokenizer()
489
+
490
+ # Post-processing
491
+ tokenizer.post_processor = TemplateProcessing(
492
+ single="[CLS] $A [SEP]",
493
+ pair="[CLS] $A [SEP] $B [SEP]",
494
+ special_tokens=[("[CLS]", 101), ("[SEP]", 102)],
495
+ )
496
+
497
+ # Decoder
498
+ tokenizer.decoder = WordPieceDecoder(prefix="##")
499
+
500
+ # Enable padding
501
+ tokenizer.enable_padding(pad_id=0, pad_token="[PAD]")
502
+
503
+ # Enable truncation
504
+ tokenizer.enable_truncation(max_length=512)
505
+ ```
506
+
507
+ ### GPT-2 tokenizer
508
+
509
+ ```python
510
+ from tokenizers import Tokenizer
511
+ from tokenizers.models import BPE
512
+ from tokenizers.normalizers import NFC
513
+ from tokenizers.pre_tokenizers import ByteLevel
514
+ from tokenizers.decoders import ByteLevel as ByteLevelDecoder
515
+ from tokenizers.processors import TemplateProcessing
516
+
517
+ # Model
518
+ tokenizer = Tokenizer(BPE())
519
+
520
+ # Normalization (minimal)
521
+ tokenizer.normalizer = NFC()
522
+
523
+ # Byte-level pre-tokenization
524
+ tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)
525
+
526
+ # Post-processing
527
+ tokenizer.post_processor = TemplateProcessing(
528
+ single="$A <|endoftext|>",
529
+ special_tokens=[("<|endoftext|>", 50256)],
530
+ )
531
+
532
+ # Byte-level decoder
533
+ tokenizer.decoder = ByteLevelDecoder()
534
+ ```
535
+
536
+ ### T5 tokenizer (SentencePiece-style)
537
+
538
+ ```python
539
+ from tokenizers import Tokenizer
540
+ from tokenizers.models import Unigram
541
+ from tokenizers.normalizers import NFKC
542
+ from tokenizers.pre_tokenizers import Metaspace
543
+ from tokenizers.decoders import Metaspace as MetaspaceDecoder
544
+
545
+ # Model
546
+ tokenizer = Tokenizer(Unigram())
547
+
548
+ # Normalization
549
+ tokenizer.normalizer = NFKC()
550
+
551
+ # Metaspace pre-tokenization
552
+ tokenizer.pre_tokenizer = Metaspace(replacement="▁", add_prefix_space=True)
553
+
554
+ # No post-processing (T5 doesn't add CLS/SEP)
555
+ tokenizer.post_processor = None
556
+
557
+ # Metaspace decoder
558
+ tokenizer.decoder = MetaspaceDecoder(replacement="▁", add_prefix_space=True)
559
+ ```
560
+
561
+ ## Alignment tracking
562
+
563
+ Track token positions in original text.
564
+
565
+ ### Basic alignment
566
+
567
+ ```python
568
+ text = "Hello, world!"
569
+ output = tokenizer.encode(text)
570
+
571
+ for token, (start, end) in zip(output.tokens, output.offsets):
572
+ print(f"{token:10s} → [{start:2d}, {end:2d}): {text[start:end]!r}")
573
+
574
+ # Output:
575
+ # [CLS] → [ 0, 0): ''
576
+ # hello → [ 0, 5): 'Hello'
577
+ # , → [ 5, 6): ','
578
+ # world → [ 7, 12): 'world'
579
+ # ! → [12, 13): '!'
580
+ # [SEP] → [ 0, 0): ''
581
+ ```
582
+
583
+ ### Word-level alignment
584
+
585
+ ```python
586
+ # Get word_ids (which word each token belongs to)
587
+ encoding = tokenizer.encode("Hello world")
588
+ word_ids = encoding.word_ids
589
+
590
+ print(word_ids)
591
+ # [None, 0, 0, 1, None]
592
+ # None = special token, 0 = first word, 1 = second word
593
+ ```
594
+
595
+ **Use case**: Token classification (NER)
596
+ ```python
597
+ # Align predictions to words
598
+ predictions = ["O", "B-PER", "I-PER", "O", "O"]
599
+ word_predictions = {}
600
+
601
+ for token_idx, word_idx in enumerate(encoding.word_ids):
602
+ if word_idx is not None and word_idx not in word_predictions:
603
+ word_predictions[word_idx] = predictions[token_idx]
604
+
605
+ print(word_predictions)
606
+ # {0: "B-PER", 1: "O"} # First word is PERSON, second is OTHER
607
+ ```
608
+
609
+ ### Span alignment
610
+
611
+ ```python
612
+ # Find token span for character span
613
+ text = "Machine learning is awesome"
614
+ char_start, char_end = 8, 16 # "learning"
615
+
616
+ encoding = tokenizer.encode(text)
617
+
618
+ # Find token span
619
+ token_start = encoding.char_to_token(char_start)
620
+ token_end = encoding.char_to_token(char_end - 1) + 1
621
+
622
+ print(f"Tokens {token_start}:{token_end} = {encoding.tokens[token_start:token_end]}")
623
+ # Tokens 2:3 = ['learning']
624
+ ```
625
+
626
+ **Use case**: Question answering (extract answer span)
627
+
628
+ ## Custom components
629
+
630
+ ### Custom normalizer
631
+
632
+ ```python
633
+ from tokenizers import NormalizedString, Normalizer
634
+
635
+ class CustomNormalizer:
636
+ def normalize(self, normalized: NormalizedString):
637
+ # Custom normalization logic
638
+ normalized.lowercase()
639
+ normalized.replace(" ", " ") # Replace double spaces
640
+
641
+ # Use custom normalizer
642
+ tokenizer.normalizer = CustomNormalizer()
643
+ ```
644
+
645
+ ### Custom pre-tokenizer
646
+
647
+ ```python
648
+ from tokenizers import PreTokenizedString
649
+
650
+ class CustomPreTokenizer:
651
+ def pre_tokenize(self, pretok: PreTokenizedString):
652
+ # Custom pre-tokenization logic
653
+ pretok.split(lambda i, char: char.isspace())
654
+
655
+ tokenizer.pre_tokenizer = CustomPreTokenizer()
656
+ ```
657
+
658
+ ## Troubleshooting
659
+
660
+ ### Issue: Misaligned offsets
661
+
662
+ **Symptom**: Offsets don't match original text
663
+ ```python
664
+ text = " hello" # Leading spaces
665
+ offsets = [(0, 5)] # Expects " hel"
666
+ ```
667
+
668
+ **Solution**: Check normalization strips spaces
669
+ ```python
670
+ # Preserve offsets
671
+ tokenizer.normalizer = Sequence([
672
+ Strip(), # This changes offsets!
673
+ ])
674
+
675
+ # Use trim_offsets in post-processor instead
676
+ tokenizer.post_processor = ByteLevelProcessing(trim_offsets=True)
677
+ ```
678
+
679
+ ### Issue: Special tokens not added
680
+
681
+ **Symptom**: No [CLS] or [SEP] in output
682
+
683
+ **Solution**: Check post-processor is set
684
+ ```python
685
+ tokenizer.post_processor = TemplateProcessing(
686
+ single="[CLS] $A [SEP]",
687
+ special_tokens=[("[CLS]", 101), ("[SEP]", 102)],
688
+ )
689
+ ```
690
+
691
+ ### Issue: Incorrect decoding
692
+
693
+ **Symptom**: Decoded text has ## or ▁
694
+
695
+ **Solution**: Set correct decoder
696
+ ```python
697
+ # For WordPiece
698
+ tokenizer.decoder = WordPieceDecoder(prefix="##")
699
+
700
+ # For SentencePiece
701
+ tokenizer.decoder = MetaspaceDecoder(replacement="▁")
702
+ ```
703
+
704
+ ## Best practices
705
+
706
+ 1. **Match pipeline to model architecture**:
707
+ - BERT → BertNormalizer + BertPreTokenizer + WordPiece
708
+ - GPT-2 → NFC + ByteLevel + BPE
709
+ - T5 → NFKC + Metaspace + Unigram
710
+
711
+ 2. **Test pipeline on sample inputs**:
712
+ - Check normalization doesn't over-normalize
713
+ - Verify pre-tokenization splits correctly
714
+ - Ensure decoding reconstructs text
715
+
716
+ 3. **Preserve alignment for downstream tasks**:
717
+ - Use `trim_offsets` instead of stripping in normalizer
718
+ - Test `char_to_token()` on sample spans
719
+
720
+ 4. **Document your pipeline**:
721
+ - Save complete tokenizer config
722
+ - Document special tokens
723
+ - Note any custom components