@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,723 @@
|
|
|
1
|
+
# Tokenization Pipeline Components
|
|
2
|
+
|
|
3
|
+
Complete guide to normalizers, pre-tokenizers, models, post-processors, and decoders.
|
|
4
|
+
|
|
5
|
+
## Pipeline overview
|
|
6
|
+
|
|
7
|
+
**Full tokenization pipeline**:
|
|
8
|
+
```
|
|
9
|
+
Raw Text
|
|
10
|
+
↓
|
|
11
|
+
Normalization (cleaning, lowercasing)
|
|
12
|
+
↓
|
|
13
|
+
Pre-tokenization (split into words)
|
|
14
|
+
↓
|
|
15
|
+
Model (apply BPE/WordPiece/Unigram)
|
|
16
|
+
↓
|
|
17
|
+
Post-processing (add special tokens)
|
|
18
|
+
↓
|
|
19
|
+
Token IDs
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
**Decoding reverses the process**:
|
|
23
|
+
```
|
|
24
|
+
Token IDs
|
|
25
|
+
↓
|
|
26
|
+
Decoder (handle special encodings)
|
|
27
|
+
↓
|
|
28
|
+
Raw Text
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## Normalizers
|
|
32
|
+
|
|
33
|
+
Clean and standardize input text.
|
|
34
|
+
|
|
35
|
+
### Common normalizers
|
|
36
|
+
|
|
37
|
+
**Lowercase**:
|
|
38
|
+
```python
|
|
39
|
+
from tokenizers.normalizers import Lowercase
|
|
40
|
+
|
|
41
|
+
tokenizer.normalizer = Lowercase()
|
|
42
|
+
|
|
43
|
+
# Input: "Hello WORLD"
|
|
44
|
+
# Output: "hello world"
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
**Unicode normalization**:
|
|
48
|
+
```python
|
|
49
|
+
from tokenizers.normalizers import NFD, NFC, NFKD, NFKC
|
|
50
|
+
|
|
51
|
+
# NFD: Canonical decomposition
|
|
52
|
+
tokenizer.normalizer = NFD()
|
|
53
|
+
# "é" → "e" + "́" (separate characters)
|
|
54
|
+
|
|
55
|
+
# NFC: Canonical composition (default)
|
|
56
|
+
tokenizer.normalizer = NFC()
|
|
57
|
+
# "e" + "́" → "é" (composed)
|
|
58
|
+
|
|
59
|
+
# NFKD: Compatibility decomposition
|
|
60
|
+
tokenizer.normalizer = NFKD()
|
|
61
|
+
# "fi" → "f" + "i"
|
|
62
|
+
|
|
63
|
+
# NFKC: Compatibility composition
|
|
64
|
+
tokenizer.normalizer = NFKC()
|
|
65
|
+
# Most aggressive normalization
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
**Strip accents**:
|
|
69
|
+
```python
|
|
70
|
+
from tokenizers.normalizers import StripAccents
|
|
71
|
+
|
|
72
|
+
tokenizer.normalizer = StripAccents()
|
|
73
|
+
|
|
74
|
+
# Input: "café"
|
|
75
|
+
# Output: "cafe"
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
**Whitespace handling**:
|
|
79
|
+
```python
|
|
80
|
+
from tokenizers.normalizers import Strip, StripAccents
|
|
81
|
+
|
|
82
|
+
# Remove leading/trailing whitespace
|
|
83
|
+
tokenizer.normalizer = Strip()
|
|
84
|
+
|
|
85
|
+
# Input: " hello "
|
|
86
|
+
# Output: "hello"
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
**Replace patterns**:
|
|
90
|
+
```python
|
|
91
|
+
from tokenizers.normalizers import Replace
|
|
92
|
+
|
|
93
|
+
# Replace newlines with spaces
|
|
94
|
+
tokenizer.normalizer = Replace("\\n", " ")
|
|
95
|
+
|
|
96
|
+
# Input: "hello\\nworld"
|
|
97
|
+
# Output: "hello world"
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
### Combining normalizers
|
|
101
|
+
|
|
102
|
+
```python
|
|
103
|
+
from tokenizers.normalizers import Sequence, NFD, Lowercase, StripAccents
|
|
104
|
+
|
|
105
|
+
# BERT-style normalization
|
|
106
|
+
tokenizer.normalizer = Sequence([
|
|
107
|
+
NFD(), # Unicode decomposition
|
|
108
|
+
Lowercase(), # Convert to lowercase
|
|
109
|
+
StripAccents() # Remove accents
|
|
110
|
+
])
|
|
111
|
+
|
|
112
|
+
# Input: "Café au Lait"
|
|
113
|
+
# After NFD: "Café au Lait" (e + ́)
|
|
114
|
+
# After Lowercase: "café au lait"
|
|
115
|
+
# After StripAccents: "cafe au lait"
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
### Use case examples
|
|
119
|
+
|
|
120
|
+
**Case-insensitive model (BERT)**:
|
|
121
|
+
```python
|
|
122
|
+
from tokenizers.normalizers import BertNormalizer
|
|
123
|
+
|
|
124
|
+
# All-in-one BERT normalization
|
|
125
|
+
tokenizer.normalizer = BertNormalizer(
|
|
126
|
+
clean_text=True, # Remove control characters
|
|
127
|
+
handle_chinese_chars=True, # Add spaces around Chinese
|
|
128
|
+
strip_accents=True, # Remove accents
|
|
129
|
+
lowercase=True # Lowercase
|
|
130
|
+
)
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
**Case-sensitive model (GPT-2)**:
|
|
134
|
+
```python
|
|
135
|
+
# Minimal normalization
|
|
136
|
+
tokenizer.normalizer = NFC() # Only normalize Unicode
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
**Multilingual (mBERT)**:
|
|
140
|
+
```python
|
|
141
|
+
# Preserve scripts, normalize form
|
|
142
|
+
tokenizer.normalizer = NFKC()
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
## Pre-tokenizers
|
|
146
|
+
|
|
147
|
+
Split text into word-like units before tokenization.
|
|
148
|
+
|
|
149
|
+
### Whitespace splitting
|
|
150
|
+
|
|
151
|
+
```python
|
|
152
|
+
from tokenizers.pre_tokenizers import Whitespace
|
|
153
|
+
|
|
154
|
+
tokenizer.pre_tokenizer = Whitespace()
|
|
155
|
+
|
|
156
|
+
# Input: "Hello world! How are you?"
|
|
157
|
+
# Output: [("Hello", (0, 5)), ("world!", (6, 12)), ("How", (13, 16)), ("are", (17, 20)), ("you?", (21, 25))]
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
### Punctuation isolation
|
|
161
|
+
|
|
162
|
+
```python
|
|
163
|
+
from tokenizers.pre_tokenizers import Punctuation
|
|
164
|
+
|
|
165
|
+
tokenizer.pre_tokenizer = Punctuation()
|
|
166
|
+
|
|
167
|
+
# Input: "Hello, world!"
|
|
168
|
+
# Output: [("Hello", ...), (",", ...), ("world", ...), ("!", ...)]
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
### Byte-level (GPT-2)
|
|
172
|
+
|
|
173
|
+
```python
|
|
174
|
+
from tokenizers.pre_tokenizers import ByteLevel
|
|
175
|
+
|
|
176
|
+
tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=True)
|
|
177
|
+
|
|
178
|
+
# Input: "Hello world"
|
|
179
|
+
# Output: Byte-level tokens with Ġ prefix for spaces
|
|
180
|
+
# [("ĠHello", ...), ("Ġworld", ...)]
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
**Key feature**: Handles ALL Unicode characters (256 byte combinations)
|
|
184
|
+
|
|
185
|
+
### Metaspace (SentencePiece)
|
|
186
|
+
|
|
187
|
+
```python
|
|
188
|
+
from tokenizers.pre_tokenizers import Metaspace
|
|
189
|
+
|
|
190
|
+
tokenizer.pre_tokenizer = Metaspace(replacement="▁", add_prefix_space=True)
|
|
191
|
+
|
|
192
|
+
# Input: "Hello world"
|
|
193
|
+
# Output: [("▁Hello", ...), ("▁world", ...)]
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
**Used by**: T5, ALBERT (via SentencePiece)
|
|
197
|
+
|
|
198
|
+
### Digits splitting
|
|
199
|
+
|
|
200
|
+
```python
|
|
201
|
+
from tokenizers.pre_tokenizers import Digits
|
|
202
|
+
|
|
203
|
+
# Split digits individually
|
|
204
|
+
tokenizer.pre_tokenizer = Digits(individual_digits=True)
|
|
205
|
+
|
|
206
|
+
# Input: "Room 123"
|
|
207
|
+
# Output: [("Room", ...), ("1", ...), ("2", ...), ("3", ...)]
|
|
208
|
+
|
|
209
|
+
# Keep digits together
|
|
210
|
+
tokenizer.pre_tokenizer = Digits(individual_digits=False)
|
|
211
|
+
|
|
212
|
+
# Input: "Room 123"
|
|
213
|
+
# Output: [("Room", ...), ("123", ...)]
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
### BERT pre-tokenizer
|
|
217
|
+
|
|
218
|
+
```python
|
|
219
|
+
from tokenizers.pre_tokenizers import BertPreTokenizer
|
|
220
|
+
|
|
221
|
+
tokenizer.pre_tokenizer = BertPreTokenizer()
|
|
222
|
+
|
|
223
|
+
# Splits on whitespace and punctuation, preserves CJK
|
|
224
|
+
# Input: "Hello, 世界!"
|
|
225
|
+
# Output: [("Hello", ...), (",", ...), ("世", ...), ("界", ...), ("!", ...)]
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
### Combining pre-tokenizers
|
|
229
|
+
|
|
230
|
+
```python
|
|
231
|
+
from tokenizers.pre_tokenizers import Sequence, Whitespace, Punctuation
|
|
232
|
+
|
|
233
|
+
tokenizer.pre_tokenizer = Sequence([
|
|
234
|
+
Whitespace(), # Split on whitespace first
|
|
235
|
+
Punctuation() # Then isolate punctuation
|
|
236
|
+
])
|
|
237
|
+
|
|
238
|
+
# Input: "Hello, world!"
|
|
239
|
+
# After Whitespace: [("Hello,", ...), ("world!", ...)]
|
|
240
|
+
# After Punctuation: [("Hello", ...), (",", ...), ("world", ...), ("!", ...)]
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
### Pre-tokenizer comparison
|
|
244
|
+
|
|
245
|
+
| Pre-tokenizer | Use Case | Example |
|
|
246
|
+
|-------------------|---------------------------------|--------------------------------------------|
|
|
247
|
+
| Whitespace | Simple English | "Hello world" → ["Hello", "world"] |
|
|
248
|
+
| Punctuation | Isolate symbols | "world!" → ["world", "!"] |
|
|
249
|
+
| ByteLevel | Multilingual, emojis | "🌍" → byte tokens |
|
|
250
|
+
| Metaspace | SentencePiece-style | "Hello" → ["▁Hello"] |
|
|
251
|
+
| BertPreTokenizer | BERT-style (CJK aware) | "世界" → ["世", "界"] |
|
|
252
|
+
| Digits | Handle numbers | "123" → ["1", "2", "3"] or ["123"] |
|
|
253
|
+
|
|
254
|
+
## Models
|
|
255
|
+
|
|
256
|
+
Core tokenization algorithms.
|
|
257
|
+
|
|
258
|
+
### BPE Model
|
|
259
|
+
|
|
260
|
+
```python
|
|
261
|
+
from tokenizers.models import BPE
|
|
262
|
+
|
|
263
|
+
model = BPE(
|
|
264
|
+
vocab=None, # Or provide pre-built vocab
|
|
265
|
+
merges=None, # Or provide merge rules
|
|
266
|
+
unk_token="[UNK]", # Unknown token
|
|
267
|
+
continuing_subword_prefix="",
|
|
268
|
+
end_of_word_suffix="",
|
|
269
|
+
fuse_unk=False # Keep unknown tokens separate
|
|
270
|
+
)
|
|
271
|
+
|
|
272
|
+
tokenizer = Tokenizer(model)
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
**Parameters**:
|
|
276
|
+
- `vocab`: Dict of token → id
|
|
277
|
+
- `merges`: List of merge rules `["a b", "ab c"]`
|
|
278
|
+
- `unk_token`: Token for unknown words
|
|
279
|
+
- `continuing_subword_prefix`: Prefix for subwords (empty for GPT-2)
|
|
280
|
+
- `end_of_word_suffix`: Suffix for last subword (empty for GPT-2)
|
|
281
|
+
|
|
282
|
+
### WordPiece Model
|
|
283
|
+
|
|
284
|
+
```python
|
|
285
|
+
from tokenizers.models import WordPiece
|
|
286
|
+
|
|
287
|
+
model = WordPiece(
|
|
288
|
+
vocab=None,
|
|
289
|
+
unk_token="[UNK]",
|
|
290
|
+
max_input_chars_per_word=100, # Max word length
|
|
291
|
+
continuing_subword_prefix="##" # BERT-style prefix
|
|
292
|
+
)
|
|
293
|
+
|
|
294
|
+
tokenizer = Tokenizer(model)
|
|
295
|
+
```
|
|
296
|
+
|
|
297
|
+
**Key difference**: Uses `##` prefix for continuing subwords.
|
|
298
|
+
|
|
299
|
+
### Unigram Model
|
|
300
|
+
|
|
301
|
+
```python
|
|
302
|
+
from tokenizers.models import Unigram
|
|
303
|
+
|
|
304
|
+
model = Unigram(
|
|
305
|
+
vocab=None, # List of (token, score) tuples
|
|
306
|
+
unk_id=0, # ID for unknown token
|
|
307
|
+
byte_fallback=False # Fall back to bytes if no match
|
|
308
|
+
)
|
|
309
|
+
|
|
310
|
+
tokenizer = Tokenizer(model)
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
**Probabilistic**: Selects tokenization with highest probability.
|
|
314
|
+
|
|
315
|
+
### WordLevel Model
|
|
316
|
+
|
|
317
|
+
```python
|
|
318
|
+
from tokenizers.models import WordLevel
|
|
319
|
+
|
|
320
|
+
# Simple word-to-ID mapping (no subwords)
|
|
321
|
+
model = WordLevel(
|
|
322
|
+
vocab=None,
|
|
323
|
+
unk_token="[UNK]"
|
|
324
|
+
)
|
|
325
|
+
|
|
326
|
+
tokenizer = Tokenizer(model)
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
**Warning**: Requires huge vocabulary (one token per word).
|
|
330
|
+
|
|
331
|
+
## Post-processors
|
|
332
|
+
|
|
333
|
+
Add special tokens and format output.
|
|
334
|
+
|
|
335
|
+
### Template processing
|
|
336
|
+
|
|
337
|
+
**BERT-style** (`[CLS] sentence [SEP]`):
|
|
338
|
+
```python
|
|
339
|
+
from tokenizers.processors import TemplateProcessing
|
|
340
|
+
|
|
341
|
+
tokenizer.post_processor = TemplateProcessing(
|
|
342
|
+
single="[CLS] $A [SEP]",
|
|
343
|
+
pair="[CLS] $A [SEP] $B [SEP]",
|
|
344
|
+
special_tokens=[
|
|
345
|
+
("[CLS]", 101),
|
|
346
|
+
("[SEP]", 102),
|
|
347
|
+
],
|
|
348
|
+
)
|
|
349
|
+
|
|
350
|
+
# Single sentence
|
|
351
|
+
output = tokenizer.encode("Hello world")
|
|
352
|
+
# [101, ..., 102] ([CLS] hello world [SEP])
|
|
353
|
+
|
|
354
|
+
# Sentence pair
|
|
355
|
+
output = tokenizer.encode("Hello", "world")
|
|
356
|
+
# [101, ..., 102, ..., 102] ([CLS] hello [SEP] world [SEP])
|
|
357
|
+
```
|
|
358
|
+
|
|
359
|
+
**GPT-2 style** (`sentence <|endoftext|>`):
|
|
360
|
+
```python
|
|
361
|
+
tokenizer.post_processor = TemplateProcessing(
|
|
362
|
+
single="$A <|endoftext|>",
|
|
363
|
+
special_tokens=[
|
|
364
|
+
("<|endoftext|>", 50256),
|
|
365
|
+
],
|
|
366
|
+
)
|
|
367
|
+
```
|
|
368
|
+
|
|
369
|
+
**RoBERTa style** (`<s> sentence </s>`):
|
|
370
|
+
```python
|
|
371
|
+
tokenizer.post_processor = TemplateProcessing(
|
|
372
|
+
single="<s> $A </s>",
|
|
373
|
+
pair="<s> $A </s> </s> $B </s>",
|
|
374
|
+
special_tokens=[
|
|
375
|
+
("<s>", 0),
|
|
376
|
+
("</s>", 2),
|
|
377
|
+
],
|
|
378
|
+
)
|
|
379
|
+
```
|
|
380
|
+
|
|
381
|
+
**T5 style** (no special tokens):
|
|
382
|
+
```python
|
|
383
|
+
# T5 doesn't add special tokens via post-processor
|
|
384
|
+
tokenizer.post_processor = None
|
|
385
|
+
```
|
|
386
|
+
|
|
387
|
+
### RobertaProcessing
|
|
388
|
+
|
|
389
|
+
```python
|
|
390
|
+
from tokenizers.processors import RobertaProcessing
|
|
391
|
+
|
|
392
|
+
tokenizer.post_processor = RobertaProcessing(
|
|
393
|
+
sep=("</s>", 2),
|
|
394
|
+
cls=("<s>", 0),
|
|
395
|
+
add_prefix_space=True, # Add space before first token
|
|
396
|
+
trim_offsets=True # Trim leading space from offsets
|
|
397
|
+
)
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
### ByteLevelProcessing
|
|
401
|
+
|
|
402
|
+
```python
|
|
403
|
+
from tokenizers.processors import ByteLevel as ByteLevelProcessing
|
|
404
|
+
|
|
405
|
+
tokenizer.post_processor = ByteLevelProcessing(
|
|
406
|
+
trim_offsets=True # Remove Ġ from offsets
|
|
407
|
+
)
|
|
408
|
+
```
|
|
409
|
+
|
|
410
|
+
## Decoders
|
|
411
|
+
|
|
412
|
+
Convert token IDs back to text.
|
|
413
|
+
|
|
414
|
+
### ByteLevel decoder
|
|
415
|
+
|
|
416
|
+
```python
|
|
417
|
+
from tokenizers.decoders import ByteLevel
|
|
418
|
+
|
|
419
|
+
tokenizer.decoder = ByteLevel()
|
|
420
|
+
|
|
421
|
+
# Handles byte-level tokens
|
|
422
|
+
# ["ĠHello", "Ġworld"] → "Hello world"
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
### WordPiece decoder
|
|
426
|
+
|
|
427
|
+
```python
|
|
428
|
+
from tokenizers.decoders import WordPiece
|
|
429
|
+
|
|
430
|
+
tokenizer.decoder = WordPiece(prefix="##")
|
|
431
|
+
|
|
432
|
+
# Removes ## prefix and concatenates
|
|
433
|
+
# ["token", "##ization"] → "tokenization"
|
|
434
|
+
```
|
|
435
|
+
|
|
436
|
+
### Metaspace decoder
|
|
437
|
+
|
|
438
|
+
```python
|
|
439
|
+
from tokenizers.decoders import Metaspace
|
|
440
|
+
|
|
441
|
+
tokenizer.decoder = Metaspace(replacement="▁", add_prefix_space=True)
|
|
442
|
+
|
|
443
|
+
# Converts ▁ back to spaces
|
|
444
|
+
# ["▁Hello", "▁world"] → "Hello world"
|
|
445
|
+
```
|
|
446
|
+
|
|
447
|
+
### BPEDecoder
|
|
448
|
+
|
|
449
|
+
```python
|
|
450
|
+
from tokenizers.decoders import BPEDecoder
|
|
451
|
+
|
|
452
|
+
tokenizer.decoder = BPEDecoder(suffix="</w>")
|
|
453
|
+
|
|
454
|
+
# Removes suffix and concatenates
|
|
455
|
+
# ["token", "ization</w>"] → "tokenization"
|
|
456
|
+
```
|
|
457
|
+
|
|
458
|
+
### Sequence decoder
|
|
459
|
+
|
|
460
|
+
```python
|
|
461
|
+
from tokenizers.decoders import Sequence, ByteLevel, Strip
|
|
462
|
+
|
|
463
|
+
tokenizer.decoder = Sequence([
|
|
464
|
+
ByteLevel(), # Decode byte-level first
|
|
465
|
+
Strip(' ', 1, 1) # Strip leading/trailing spaces
|
|
466
|
+
])
|
|
467
|
+
```
|
|
468
|
+
|
|
469
|
+
## Complete pipeline examples
|
|
470
|
+
|
|
471
|
+
### BERT tokenizer
|
|
472
|
+
|
|
473
|
+
```python
|
|
474
|
+
from tokenizers import Tokenizer
|
|
475
|
+
from tokenizers.models import WordPiece
|
|
476
|
+
from tokenizers.normalizers import BertNormalizer
|
|
477
|
+
from tokenizers.pre_tokenizers import BertPreTokenizer
|
|
478
|
+
from tokenizers.processors import TemplateProcessing
|
|
479
|
+
from tokenizers.decoders import WordPiece as WordPieceDecoder
|
|
480
|
+
|
|
481
|
+
# Model
|
|
482
|
+
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
|
|
483
|
+
|
|
484
|
+
# Normalization
|
|
485
|
+
tokenizer.normalizer = BertNormalizer(lowercase=True)
|
|
486
|
+
|
|
487
|
+
# Pre-tokenization
|
|
488
|
+
tokenizer.pre_tokenizer = BertPreTokenizer()
|
|
489
|
+
|
|
490
|
+
# Post-processing
|
|
491
|
+
tokenizer.post_processor = TemplateProcessing(
|
|
492
|
+
single="[CLS] $A [SEP]",
|
|
493
|
+
pair="[CLS] $A [SEP] $B [SEP]",
|
|
494
|
+
special_tokens=[("[CLS]", 101), ("[SEP]", 102)],
|
|
495
|
+
)
|
|
496
|
+
|
|
497
|
+
# Decoder
|
|
498
|
+
tokenizer.decoder = WordPieceDecoder(prefix="##")
|
|
499
|
+
|
|
500
|
+
# Enable padding
|
|
501
|
+
tokenizer.enable_padding(pad_id=0, pad_token="[PAD]")
|
|
502
|
+
|
|
503
|
+
# Enable truncation
|
|
504
|
+
tokenizer.enable_truncation(max_length=512)
|
|
505
|
+
```
|
|
506
|
+
|
|
507
|
+
### GPT-2 tokenizer
|
|
508
|
+
|
|
509
|
+
```python
|
|
510
|
+
from tokenizers import Tokenizer
|
|
511
|
+
from tokenizers.models import BPE
|
|
512
|
+
from tokenizers.normalizers import NFC
|
|
513
|
+
from tokenizers.pre_tokenizers import ByteLevel
|
|
514
|
+
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
|
|
515
|
+
from tokenizers.processors import TemplateProcessing
|
|
516
|
+
|
|
517
|
+
# Model
|
|
518
|
+
tokenizer = Tokenizer(BPE())
|
|
519
|
+
|
|
520
|
+
# Normalization (minimal)
|
|
521
|
+
tokenizer.normalizer = NFC()
|
|
522
|
+
|
|
523
|
+
# Byte-level pre-tokenization
|
|
524
|
+
tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)
|
|
525
|
+
|
|
526
|
+
# Post-processing
|
|
527
|
+
tokenizer.post_processor = TemplateProcessing(
|
|
528
|
+
single="$A <|endoftext|>",
|
|
529
|
+
special_tokens=[("<|endoftext|>", 50256)],
|
|
530
|
+
)
|
|
531
|
+
|
|
532
|
+
# Byte-level decoder
|
|
533
|
+
tokenizer.decoder = ByteLevelDecoder()
|
|
534
|
+
```
|
|
535
|
+
|
|
536
|
+
### T5 tokenizer (SentencePiece-style)
|
|
537
|
+
|
|
538
|
+
```python
|
|
539
|
+
from tokenizers import Tokenizer
|
|
540
|
+
from tokenizers.models import Unigram
|
|
541
|
+
from tokenizers.normalizers import NFKC
|
|
542
|
+
from tokenizers.pre_tokenizers import Metaspace
|
|
543
|
+
from tokenizers.decoders import Metaspace as MetaspaceDecoder
|
|
544
|
+
|
|
545
|
+
# Model
|
|
546
|
+
tokenizer = Tokenizer(Unigram())
|
|
547
|
+
|
|
548
|
+
# Normalization
|
|
549
|
+
tokenizer.normalizer = NFKC()
|
|
550
|
+
|
|
551
|
+
# Metaspace pre-tokenization
|
|
552
|
+
tokenizer.pre_tokenizer = Metaspace(replacement="▁", add_prefix_space=True)
|
|
553
|
+
|
|
554
|
+
# No post-processing (T5 doesn't add CLS/SEP)
|
|
555
|
+
tokenizer.post_processor = None
|
|
556
|
+
|
|
557
|
+
# Metaspace decoder
|
|
558
|
+
tokenizer.decoder = MetaspaceDecoder(replacement="▁", add_prefix_space=True)
|
|
559
|
+
```
|
|
560
|
+
|
|
561
|
+
## Alignment tracking
|
|
562
|
+
|
|
563
|
+
Track token positions in original text.
|
|
564
|
+
|
|
565
|
+
### Basic alignment
|
|
566
|
+
|
|
567
|
+
```python
|
|
568
|
+
text = "Hello, world!"
|
|
569
|
+
output = tokenizer.encode(text)
|
|
570
|
+
|
|
571
|
+
for token, (start, end) in zip(output.tokens, output.offsets):
|
|
572
|
+
print(f"{token:10s} → [{start:2d}, {end:2d}): {text[start:end]!r}")
|
|
573
|
+
|
|
574
|
+
# Output:
|
|
575
|
+
# [CLS] → [ 0, 0): ''
|
|
576
|
+
# hello → [ 0, 5): 'Hello'
|
|
577
|
+
# , → [ 5, 6): ','
|
|
578
|
+
# world → [ 7, 12): 'world'
|
|
579
|
+
# ! → [12, 13): '!'
|
|
580
|
+
# [SEP] → [ 0, 0): ''
|
|
581
|
+
```
|
|
582
|
+
|
|
583
|
+
### Word-level alignment
|
|
584
|
+
|
|
585
|
+
```python
|
|
586
|
+
# Get word_ids (which word each token belongs to)
|
|
587
|
+
encoding = tokenizer.encode("Hello world")
|
|
588
|
+
word_ids = encoding.word_ids
|
|
589
|
+
|
|
590
|
+
print(word_ids)
|
|
591
|
+
# [None, 0, 0, 1, None]
|
|
592
|
+
# None = special token, 0 = first word, 1 = second word
|
|
593
|
+
```
|
|
594
|
+
|
|
595
|
+
**Use case**: Token classification (NER)
|
|
596
|
+
```python
|
|
597
|
+
# Align predictions to words
|
|
598
|
+
predictions = ["O", "B-PER", "I-PER", "O", "O"]
|
|
599
|
+
word_predictions = {}
|
|
600
|
+
|
|
601
|
+
for token_idx, word_idx in enumerate(encoding.word_ids):
|
|
602
|
+
if word_idx is not None and word_idx not in word_predictions:
|
|
603
|
+
word_predictions[word_idx] = predictions[token_idx]
|
|
604
|
+
|
|
605
|
+
print(word_predictions)
|
|
606
|
+
# {0: "B-PER", 1: "O"} # First word is PERSON, second is OTHER
|
|
607
|
+
```
|
|
608
|
+
|
|
609
|
+
### Span alignment
|
|
610
|
+
|
|
611
|
+
```python
|
|
612
|
+
# Find token span for character span
|
|
613
|
+
text = "Machine learning is awesome"
|
|
614
|
+
char_start, char_end = 8, 16 # "learning"
|
|
615
|
+
|
|
616
|
+
encoding = tokenizer.encode(text)
|
|
617
|
+
|
|
618
|
+
# Find token span
|
|
619
|
+
token_start = encoding.char_to_token(char_start)
|
|
620
|
+
token_end = encoding.char_to_token(char_end - 1) + 1
|
|
621
|
+
|
|
622
|
+
print(f"Tokens {token_start}:{token_end} = {encoding.tokens[token_start:token_end]}")
|
|
623
|
+
# Tokens 2:3 = ['learning']
|
|
624
|
+
```
|
|
625
|
+
|
|
626
|
+
**Use case**: Question answering (extract answer span)
|
|
627
|
+
|
|
628
|
+
## Custom components
|
|
629
|
+
|
|
630
|
+
### Custom normalizer
|
|
631
|
+
|
|
632
|
+
```python
|
|
633
|
+
from tokenizers import NormalizedString, Normalizer
|
|
634
|
+
|
|
635
|
+
class CustomNormalizer:
|
|
636
|
+
def normalize(self, normalized: NormalizedString):
|
|
637
|
+
# Custom normalization logic
|
|
638
|
+
normalized.lowercase()
|
|
639
|
+
normalized.replace(" ", " ") # Replace double spaces
|
|
640
|
+
|
|
641
|
+
# Use custom normalizer
|
|
642
|
+
tokenizer.normalizer = CustomNormalizer()
|
|
643
|
+
```
|
|
644
|
+
|
|
645
|
+
### Custom pre-tokenizer
|
|
646
|
+
|
|
647
|
+
```python
|
|
648
|
+
from tokenizers import PreTokenizedString
|
|
649
|
+
|
|
650
|
+
class CustomPreTokenizer:
|
|
651
|
+
def pre_tokenize(self, pretok: PreTokenizedString):
|
|
652
|
+
# Custom pre-tokenization logic
|
|
653
|
+
pretok.split(lambda i, char: char.isspace())
|
|
654
|
+
|
|
655
|
+
tokenizer.pre_tokenizer = CustomPreTokenizer()
|
|
656
|
+
```
|
|
657
|
+
|
|
658
|
+
## Troubleshooting
|
|
659
|
+
|
|
660
|
+
### Issue: Misaligned offsets
|
|
661
|
+
|
|
662
|
+
**Symptom**: Offsets don't match original text
|
|
663
|
+
```python
|
|
664
|
+
text = " hello" # Leading spaces
|
|
665
|
+
offsets = [(0, 5)] # Expects " hel"
|
|
666
|
+
```
|
|
667
|
+
|
|
668
|
+
**Solution**: Check normalization strips spaces
|
|
669
|
+
```python
|
|
670
|
+
# Preserve offsets
|
|
671
|
+
tokenizer.normalizer = Sequence([
|
|
672
|
+
Strip(), # This changes offsets!
|
|
673
|
+
])
|
|
674
|
+
|
|
675
|
+
# Use trim_offsets in post-processor instead
|
|
676
|
+
tokenizer.post_processor = ByteLevelProcessing(trim_offsets=True)
|
|
677
|
+
```
|
|
678
|
+
|
|
679
|
+
### Issue: Special tokens not added
|
|
680
|
+
|
|
681
|
+
**Symptom**: No [CLS] or [SEP] in output
|
|
682
|
+
|
|
683
|
+
**Solution**: Check post-processor is set
|
|
684
|
+
```python
|
|
685
|
+
tokenizer.post_processor = TemplateProcessing(
|
|
686
|
+
single="[CLS] $A [SEP]",
|
|
687
|
+
special_tokens=[("[CLS]", 101), ("[SEP]", 102)],
|
|
688
|
+
)
|
|
689
|
+
```
|
|
690
|
+
|
|
691
|
+
### Issue: Incorrect decoding
|
|
692
|
+
|
|
693
|
+
**Symptom**: Decoded text has ## or ▁
|
|
694
|
+
|
|
695
|
+
**Solution**: Set correct decoder
|
|
696
|
+
```python
|
|
697
|
+
# For WordPiece
|
|
698
|
+
tokenizer.decoder = WordPieceDecoder(prefix="##")
|
|
699
|
+
|
|
700
|
+
# For SentencePiece
|
|
701
|
+
tokenizer.decoder = MetaspaceDecoder(replacement="▁")
|
|
702
|
+
```
|
|
703
|
+
|
|
704
|
+
## Best practices
|
|
705
|
+
|
|
706
|
+
1. **Match pipeline to model architecture**:
|
|
707
|
+
- BERT → BertNormalizer + BertPreTokenizer + WordPiece
|
|
708
|
+
- GPT-2 → NFC + ByteLevel + BPE
|
|
709
|
+
- T5 → NFKC + Metaspace + Unigram
|
|
710
|
+
|
|
711
|
+
2. **Test pipeline on sample inputs**:
|
|
712
|
+
- Check normalization doesn't over-normalize
|
|
713
|
+
- Verify pre-tokenization splits correctly
|
|
714
|
+
- Ensure decoding reconstructs text
|
|
715
|
+
|
|
716
|
+
3. **Preserve alignment for downstream tasks**:
|
|
717
|
+
- Use `trim_offsets` instead of stripping in normalizer
|
|
718
|
+
- Test `char_to_token()` on sample spans
|
|
719
|
+
|
|
720
|
+
4. **Document your pipeline**:
|
|
721
|
+
- Save complete tokenizer config
|
|
722
|
+
- Document special tokens
|
|
723
|
+
- Note any custom components
|