@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,653 @@
|
|
|
1
|
+
# Tokenization Algorithms Deep Dive
|
|
2
|
+
|
|
3
|
+
Comprehensive explanation of BPE, WordPiece, and Unigram algorithms.
|
|
4
|
+
|
|
5
|
+
## Byte-Pair Encoding (BPE)
|
|
6
|
+
|
|
7
|
+
### Algorithm overview
|
|
8
|
+
|
|
9
|
+
BPE iteratively merges the most frequent pair of tokens in a corpus.
|
|
10
|
+
|
|
11
|
+
**Training process**:
|
|
12
|
+
1. Initialize vocabulary with all characters
|
|
13
|
+
2. Count frequency of all adjacent token pairs
|
|
14
|
+
3. Merge most frequent pair into new token
|
|
15
|
+
4. Add new token to vocabulary
|
|
16
|
+
5. Update corpus with new token
|
|
17
|
+
6. Repeat until vocabulary size reached
|
|
18
|
+
|
|
19
|
+
### Step-by-step example
|
|
20
|
+
|
|
21
|
+
**Corpus**:
|
|
22
|
+
```
|
|
23
|
+
low: 5
|
|
24
|
+
lower: 2
|
|
25
|
+
newest: 6
|
|
26
|
+
widest: 3
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
**Iteration 1**:
|
|
30
|
+
```
|
|
31
|
+
Count pairs:
|
|
32
|
+
'e' + 's': 9 (newest: 6, widest: 3) ← most frequent
|
|
33
|
+
'l' + 'o': 7
|
|
34
|
+
'o' + 'w': 7
|
|
35
|
+
...
|
|
36
|
+
|
|
37
|
+
Merge: 'e' + 's' → 'es'
|
|
38
|
+
|
|
39
|
+
Updated corpus:
|
|
40
|
+
low: 5
|
|
41
|
+
lower: 2
|
|
42
|
+
newest: 6 → newes|t: 6
|
|
43
|
+
widest: 3 → wides|t: 3
|
|
44
|
+
|
|
45
|
+
Vocabulary: [a-z] + ['es']
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
**Iteration 2**:
|
|
49
|
+
```
|
|
50
|
+
Count pairs:
|
|
51
|
+
'es' + 't': 9 ← most frequent
|
|
52
|
+
'l' + 'o': 7
|
|
53
|
+
...
|
|
54
|
+
|
|
55
|
+
Merge: 'es' + 't' → 'est'
|
|
56
|
+
|
|
57
|
+
Updated corpus:
|
|
58
|
+
low: 5
|
|
59
|
+
lower: 2
|
|
60
|
+
newest: 6 → new|est: 6
|
|
61
|
+
widest: 3 → wid|est: 3
|
|
62
|
+
|
|
63
|
+
Vocabulary: [a-z] + ['es', 'est']
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
**Continue until desired vocabulary size...**
|
|
67
|
+
|
|
68
|
+
### Tokenization with trained BPE
|
|
69
|
+
|
|
70
|
+
Given vocabulary: `['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd', 'es', 'est', 'lo', 'low', 'ne', 'new', 'newest', 'wi', 'wid', 'widest']`
|
|
71
|
+
|
|
72
|
+
Tokenize "lowest":
|
|
73
|
+
```
|
|
74
|
+
Step 1: Split into characters
|
|
75
|
+
['l', 'o', 'w', 'e', 's', 't']
|
|
76
|
+
|
|
77
|
+
Step 2: Apply merges in order learned during training
|
|
78
|
+
- Merge 'l' + 'o' → 'lo' (if this merge was learned)
|
|
79
|
+
- Merge 'lo' + 'w' → 'low' (if learned)
|
|
80
|
+
- Merge 'e' + 's' → 'es' (learned)
|
|
81
|
+
- Merge 'es' + 't' → 'est' (learned)
|
|
82
|
+
|
|
83
|
+
Final: ['low', 'est']
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
### Implementation
|
|
87
|
+
|
|
88
|
+
```python
|
|
89
|
+
from tokenizers import Tokenizer
|
|
90
|
+
from tokenizers.models import BPE
|
|
91
|
+
from tokenizers.trainers import BpeTrainer
|
|
92
|
+
from tokenizers.pre_tokenizers import Whitespace
|
|
93
|
+
|
|
94
|
+
# Initialize
|
|
95
|
+
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
|
|
96
|
+
tokenizer.pre_tokenizer = Whitespace()
|
|
97
|
+
|
|
98
|
+
# Configure trainer
|
|
99
|
+
trainer = BpeTrainer(
|
|
100
|
+
vocab_size=1000,
|
|
101
|
+
min_frequency=2,
|
|
102
|
+
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
|
|
103
|
+
)
|
|
104
|
+
|
|
105
|
+
# Train
|
|
106
|
+
corpus = [
|
|
107
|
+
"This is a sample corpus for BPE training.",
|
|
108
|
+
"BPE learns subword units from the training data.",
|
|
109
|
+
# ... more sentences
|
|
110
|
+
]
|
|
111
|
+
|
|
112
|
+
tokenizer.train_from_iterator(corpus, trainer=trainer)
|
|
113
|
+
|
|
114
|
+
# Use
|
|
115
|
+
output = tokenizer.encode("This is tokenization")
|
|
116
|
+
print(output.tokens) # ['This', 'is', 'token', 'ization']
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
### Byte-level BPE (GPT-2 variant)
|
|
120
|
+
|
|
121
|
+
**Problem**: Standard BPE has limited character coverage (256+ Unicode chars)
|
|
122
|
+
|
|
123
|
+
**Solution**: Operate on byte level (256 bytes)
|
|
124
|
+
|
|
125
|
+
```python
|
|
126
|
+
from tokenizers.pre_tokenizers import ByteLevel
|
|
127
|
+
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
|
|
128
|
+
|
|
129
|
+
tokenizer = Tokenizer(BPE())
|
|
130
|
+
|
|
131
|
+
# Byte-level pre-tokenization
|
|
132
|
+
tokenizer.pre_tokenizer = ByteLevel()
|
|
133
|
+
tokenizer.decoder = ByteLevelDecoder()
|
|
134
|
+
|
|
135
|
+
# This handles ALL possible characters, including emojis
|
|
136
|
+
text = "Hello 🌍 世界"
|
|
137
|
+
tokens = tokenizer.encode(text).tokens
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
**Advantages**:
|
|
141
|
+
- Handles any Unicode character (256 byte coverage)
|
|
142
|
+
- No unknown tokens (worst case: bytes)
|
|
143
|
+
- Used by GPT-2, GPT-3, BART
|
|
144
|
+
|
|
145
|
+
**Trade-offs**:
|
|
146
|
+
- Slightly worse compression (bytes vs characters)
|
|
147
|
+
- More tokens for non-ASCII text
|
|
148
|
+
|
|
149
|
+
### BPE variants
|
|
150
|
+
|
|
151
|
+
**SentencePiece BPE**:
|
|
152
|
+
- Language-independent (no pre-tokenization)
|
|
153
|
+
- Treats input as raw byte stream
|
|
154
|
+
- Used by T5, ALBERT, XLNet
|
|
155
|
+
|
|
156
|
+
**Robust BPE**:
|
|
157
|
+
- Dropout during training (randomly skip merges)
|
|
158
|
+
- More robust tokenization at inference
|
|
159
|
+
- Reduces overfitting to training data
|
|
160
|
+
|
|
161
|
+
## WordPiece
|
|
162
|
+
|
|
163
|
+
### Algorithm overview
|
|
164
|
+
|
|
165
|
+
WordPiece is similar to BPE but uses a different merge selection criterion.
|
|
166
|
+
|
|
167
|
+
**Training process**:
|
|
168
|
+
1. Initialize vocabulary with all characters
|
|
169
|
+
2. Count frequency of all token pairs
|
|
170
|
+
3. Score each pair: `score = freq(pair) / (freq(first) × freq(second))`
|
|
171
|
+
4. Merge pair with highest score
|
|
172
|
+
5. Repeat until vocabulary size reached
|
|
173
|
+
|
|
174
|
+
### Why different scoring?
|
|
175
|
+
|
|
176
|
+
**BPE**: Merges most frequent pairs
|
|
177
|
+
- "aa" appears 100 times → high priority
|
|
178
|
+
- Even if 'a' appears 1000 times alone
|
|
179
|
+
|
|
180
|
+
**WordPiece**: Merges pairs that are semantically related
|
|
181
|
+
- "aa" appears 100 times, 'a' appears 1000 times → low score (100 / (1000 × 1000))
|
|
182
|
+
- "th" appears 50 times, 't' appears 60 times, 'h' appears 55 times → high score (50 / (60 × 55))
|
|
183
|
+
- Prioritizes pairs that appear together more than expected
|
|
184
|
+
|
|
185
|
+
### Step-by-step example
|
|
186
|
+
|
|
187
|
+
**Corpus**:
|
|
188
|
+
```
|
|
189
|
+
low: 5
|
|
190
|
+
lower: 2
|
|
191
|
+
newest: 6
|
|
192
|
+
widest: 3
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
**Iteration 1**:
|
|
196
|
+
```
|
|
197
|
+
Count frequencies:
|
|
198
|
+
'e': 11 (lower: 2, newest: 6, widest: 3)
|
|
199
|
+
's': 9
|
|
200
|
+
't': 9
|
|
201
|
+
...
|
|
202
|
+
|
|
203
|
+
Count pairs:
|
|
204
|
+
'e' + 's': 9 (newest: 6, widest: 3)
|
|
205
|
+
'es' + 't': 9 (newest: 6, widest: 3)
|
|
206
|
+
...
|
|
207
|
+
|
|
208
|
+
Compute scores:
|
|
209
|
+
score('e' + 's') = 9 / (11 × 9) = 0.091
|
|
210
|
+
score('es' + 't') = 9 / (9 × 9) = 0.111 ← highest score
|
|
211
|
+
score('l' + 'o') = 7 / (7 × 9) = 0.111 ← tied
|
|
212
|
+
|
|
213
|
+
Choose: 'es' + 't' → 'est' (or 'lo' if tied)
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
**Key difference**: WordPiece prioritizes rare combinations over frequent ones.
|
|
217
|
+
|
|
218
|
+
### Tokenization with WordPiece
|
|
219
|
+
|
|
220
|
+
Given vocabulary: `['##e', '##s', '##t', 'l', 'o', 'w', 'new', 'est', 'low']`
|
|
221
|
+
|
|
222
|
+
Tokenize "lowest":
|
|
223
|
+
```
|
|
224
|
+
Step 1: Find longest matching prefix
|
|
225
|
+
'lowest' → 'low' (matches)
|
|
226
|
+
|
|
227
|
+
Step 2: Find longest match for remainder
|
|
228
|
+
'est' → 'est' (matches)
|
|
229
|
+
|
|
230
|
+
Final: ['low', 'est']
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
**If no match**:
|
|
234
|
+
```
|
|
235
|
+
Tokenize "unknownword":
|
|
236
|
+
'unknownword' → no match
|
|
237
|
+
'unknown' → no match
|
|
238
|
+
'unkn' → no match
|
|
239
|
+
'un' → no match
|
|
240
|
+
'u' → no match
|
|
241
|
+
→ [UNK]
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
### Implementation
|
|
245
|
+
|
|
246
|
+
```python
|
|
247
|
+
from tokenizers import Tokenizer
|
|
248
|
+
from tokenizers.models import WordPiece
|
|
249
|
+
from tokenizers.trainers import WordPieceTrainer
|
|
250
|
+
from tokenizers.normalizers import BertNormalizer
|
|
251
|
+
from tokenizers.pre_tokenizers import BertPreTokenizer
|
|
252
|
+
|
|
253
|
+
# Initialize BERT-style tokenizer
|
|
254
|
+
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
|
|
255
|
+
|
|
256
|
+
# Normalization (lowercase, accent stripping)
|
|
257
|
+
tokenizer.normalizer = BertNormalizer(lowercase=True)
|
|
258
|
+
|
|
259
|
+
# Pre-tokenization (whitespace + punctuation)
|
|
260
|
+
tokenizer.pre_tokenizer = BertPreTokenizer()
|
|
261
|
+
|
|
262
|
+
# Configure trainer
|
|
263
|
+
trainer = WordPieceTrainer(
|
|
264
|
+
vocab_size=30522, # BERT vocab size
|
|
265
|
+
min_frequency=2,
|
|
266
|
+
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
|
|
267
|
+
continuing_subword_prefix="##" # BERT uses ##
|
|
268
|
+
)
|
|
269
|
+
|
|
270
|
+
# Train
|
|
271
|
+
tokenizer.train_from_iterator(corpus, trainer=trainer)
|
|
272
|
+
|
|
273
|
+
# Use
|
|
274
|
+
output = tokenizer.encode("Tokenization works great!")
|
|
275
|
+
print(output.tokens) # ['token', '##ization', 'works', 'great', '!']
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
### Subword prefix
|
|
279
|
+
|
|
280
|
+
**BERT uses `##` prefix**:
|
|
281
|
+
```
|
|
282
|
+
"unbelievable" → ['un', '##believ', '##able']
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
**Why?**
|
|
286
|
+
- Indicates token is a continuation
|
|
287
|
+
- Allows reconstruction: remove ##, concatenate
|
|
288
|
+
- Helps model distinguish word boundaries
|
|
289
|
+
|
|
290
|
+
### WordPiece advantages
|
|
291
|
+
|
|
292
|
+
**Semantic merges**:
|
|
293
|
+
- Prioritizes meaningful combinations
|
|
294
|
+
- "qu" has high score (always together)
|
|
295
|
+
- "qx" has low score (rare combination)
|
|
296
|
+
|
|
297
|
+
**Better for morphology**:
|
|
298
|
+
- Captures affixes: un-, -ing, -ed
|
|
299
|
+
- Preserves word stems
|
|
300
|
+
|
|
301
|
+
**Trade-offs**:
|
|
302
|
+
- Slower training than BPE
|
|
303
|
+
- More memory (stores vocabulary, not merges)
|
|
304
|
+
- Original implementation not open-source (HF reimplementation)
|
|
305
|
+
|
|
306
|
+
## Unigram
|
|
307
|
+
|
|
308
|
+
### Algorithm overview
|
|
309
|
+
|
|
310
|
+
Unigram works backward: start with large vocabulary, remove tokens.
|
|
311
|
+
|
|
312
|
+
**Training process**:
|
|
313
|
+
1. Initialize with large vocabulary (all substrings)
|
|
314
|
+
2. Estimate probability of each token (frequency-based)
|
|
315
|
+
3. For each token, compute loss increase if removed
|
|
316
|
+
4. Remove 10-20% of tokens with lowest loss impact
|
|
317
|
+
5. Re-estimate probabilities
|
|
318
|
+
6. Repeat until desired vocabulary size
|
|
319
|
+
|
|
320
|
+
### Probabilistic tokenization
|
|
321
|
+
|
|
322
|
+
**Unigram assumption**: Each token is independent.
|
|
323
|
+
|
|
324
|
+
Given vocabulary with probabilities:
|
|
325
|
+
```
|
|
326
|
+
P('low') = 0.02
|
|
327
|
+
P('l') = 0.01
|
|
328
|
+
P('o') = 0.015
|
|
329
|
+
P('w') = 0.01
|
|
330
|
+
P('est') = 0.03
|
|
331
|
+
P('e') = 0.02
|
|
332
|
+
P('s') = 0.015
|
|
333
|
+
P('t') = 0.015
|
|
334
|
+
```
|
|
335
|
+
|
|
336
|
+
Tokenize "lowest":
|
|
337
|
+
```
|
|
338
|
+
Option 1: ['low', 'est']
|
|
339
|
+
P = P('low') × P('est') = 0.02 × 0.03 = 0.0006
|
|
340
|
+
|
|
341
|
+
Option 2: ['l', 'o', 'w', 'est']
|
|
342
|
+
P = 0.01 × 0.015 × 0.01 × 0.03 = 0.000000045
|
|
343
|
+
|
|
344
|
+
Option 3: ['low', 'e', 's', 't']
|
|
345
|
+
P = 0.02 × 0.02 × 0.015 × 0.015 = 0.0000009
|
|
346
|
+
|
|
347
|
+
Choose option 1 (highest probability)
|
|
348
|
+
```
|
|
349
|
+
|
|
350
|
+
### Viterbi algorithm
|
|
351
|
+
|
|
352
|
+
Finding best tokenization is expensive (exponential possibilities).
|
|
353
|
+
|
|
354
|
+
**Viterbi algorithm** (dynamic programming):
|
|
355
|
+
```python
|
|
356
|
+
def tokenize_viterbi(word, vocab, probs):
|
|
357
|
+
n = len(word)
|
|
358
|
+
# dp[i] = (best_prob, best_tokens) for word[:i]
|
|
359
|
+
dp = [{} for _ in range(n + 1)]
|
|
360
|
+
dp[0] = (0.0, []) # log probability
|
|
361
|
+
|
|
362
|
+
for i in range(1, n + 1):
|
|
363
|
+
best_prob = float('-inf')
|
|
364
|
+
best_tokens = []
|
|
365
|
+
|
|
366
|
+
# Try all possible last tokens
|
|
367
|
+
for j in range(i):
|
|
368
|
+
token = word[j:i]
|
|
369
|
+
if token in vocab:
|
|
370
|
+
prob = dp[j][0] + log(probs[token])
|
|
371
|
+
if prob > best_prob:
|
|
372
|
+
best_prob = prob
|
|
373
|
+
best_tokens = dp[j][1] + [token]
|
|
374
|
+
|
|
375
|
+
dp[i] = (best_prob, best_tokens)
|
|
376
|
+
|
|
377
|
+
return dp[n][1]
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
**Time complexity**: O(n² × vocab_size) vs O(2^n) brute force
|
|
381
|
+
|
|
382
|
+
### Implementation
|
|
383
|
+
|
|
384
|
+
```python
|
|
385
|
+
from tokenizers import Tokenizer
|
|
386
|
+
from tokenizers.models import Unigram
|
|
387
|
+
from tokenizers.trainers import UnigramTrainer
|
|
388
|
+
|
|
389
|
+
# Initialize
|
|
390
|
+
tokenizer = Tokenizer(Unigram())
|
|
391
|
+
|
|
392
|
+
# Configure trainer
|
|
393
|
+
trainer = UnigramTrainer(
|
|
394
|
+
vocab_size=8000,
|
|
395
|
+
special_tokens=["<unk>", "<s>", "</s>"],
|
|
396
|
+
unk_token="<unk>",
|
|
397
|
+
max_piece_length=16, # Max token length
|
|
398
|
+
n_sub_iterations=2, # EM iterations
|
|
399
|
+
shrinking_factor=0.75 # Remove 25% each iteration
|
|
400
|
+
)
|
|
401
|
+
|
|
402
|
+
# Train
|
|
403
|
+
tokenizer.train_from_iterator(corpus, trainer=trainer)
|
|
404
|
+
|
|
405
|
+
# Use
|
|
406
|
+
output = tokenizer.encode("Tokenization with Unigram")
|
|
407
|
+
print(output.tokens) # ['▁Token', 'ization', '▁with', '▁Un', 'igram']
|
|
408
|
+
```
|
|
409
|
+
|
|
410
|
+
### Unigram advantages
|
|
411
|
+
|
|
412
|
+
**Probabilistic**:
|
|
413
|
+
- Multiple valid tokenizations
|
|
414
|
+
- Can sample different tokenizations (data augmentation)
|
|
415
|
+
|
|
416
|
+
**Subword regularization**:
|
|
417
|
+
```python
|
|
418
|
+
# Sample different tokenizations
|
|
419
|
+
for _ in range(3):
|
|
420
|
+
tokens = tokenizer.encode("tokenization", is_pretokenized=False).tokens
|
|
421
|
+
print(tokens)
|
|
422
|
+
|
|
423
|
+
# Output (different each time):
|
|
424
|
+
# ['token', 'ization']
|
|
425
|
+
# ['tok', 'en', 'ization']
|
|
426
|
+
# ['token', 'iz', 'ation']
|
|
427
|
+
```
|
|
428
|
+
|
|
429
|
+
**Language-independent**:
|
|
430
|
+
- No word boundaries needed
|
|
431
|
+
- Works for CJK languages (Chinese, Japanese, Korean)
|
|
432
|
+
- Treats input as character stream
|
|
433
|
+
|
|
434
|
+
**Trade-offs**:
|
|
435
|
+
- Slower training (EM algorithm)
|
|
436
|
+
- More hyperparameters
|
|
437
|
+
- Larger model (stores probabilities)
|
|
438
|
+
|
|
439
|
+
## Algorithm comparison
|
|
440
|
+
|
|
441
|
+
### Training speed
|
|
442
|
+
|
|
443
|
+
| Algorithm | Small (10MB) | Medium (100MB) | Large (1GB) |
|
|
444
|
+
|------------|--------------|----------------|-------------|
|
|
445
|
+
| BPE | 10-15 sec | 1-2 min | 10-20 min |
|
|
446
|
+
| WordPiece | 15-20 sec | 2-3 min | 15-30 min |
|
|
447
|
+
| Unigram | 20-30 sec | 3-5 min | 30-60 min |
|
|
448
|
+
|
|
449
|
+
**Tested on**: 16-core CPU, 30k vocab
|
|
450
|
+
|
|
451
|
+
### Tokenization quality
|
|
452
|
+
|
|
453
|
+
Tested on English Wikipedia (perplexity measurement):
|
|
454
|
+
|
|
455
|
+
| Algorithm | Vocab Size | Tokens/Word | Unknown Rate |
|
|
456
|
+
|------------|------------|-------------|--------------|
|
|
457
|
+
| BPE | 30k | 1.3 | 0.5% |
|
|
458
|
+
| WordPiece | 30k | 1.2 | 1.2% |
|
|
459
|
+
| Unigram | 8k | 1.5 | 0.3% |
|
|
460
|
+
|
|
461
|
+
**Key observations**:
|
|
462
|
+
- WordPiece: Slightly better compression
|
|
463
|
+
- BPE: Lower unknown rate
|
|
464
|
+
- Unigram: Smallest vocab, good coverage
|
|
465
|
+
|
|
466
|
+
### Compression ratio
|
|
467
|
+
|
|
468
|
+
Characters per token (higher = better compression):
|
|
469
|
+
|
|
470
|
+
| Language | BPE (30k) | WordPiece (30k) | Unigram (8k) |
|
|
471
|
+
|----------|-----------|-----------------|--------------|
|
|
472
|
+
| English | 4.2 | 4.5 | 3.8 |
|
|
473
|
+
| Chinese | 2.1 | 2.3 | 2.5 |
|
|
474
|
+
| Arabic | 3.5 | 3.8 | 3.2 |
|
|
475
|
+
|
|
476
|
+
**Best for each**:
|
|
477
|
+
- English: WordPiece
|
|
478
|
+
- Chinese: Unigram (language-independent)
|
|
479
|
+
- Arabic: WordPiece
|
|
480
|
+
|
|
481
|
+
### Use case recommendations
|
|
482
|
+
|
|
483
|
+
**BPE** - Best for:
|
|
484
|
+
- English language models
|
|
485
|
+
- Code (handles symbols well)
|
|
486
|
+
- Fast training needed
|
|
487
|
+
- **Models**: GPT-2, GPT-3, RoBERTa, BART
|
|
488
|
+
|
|
489
|
+
**WordPiece** - Best for:
|
|
490
|
+
- Masked language modeling (BERT-style)
|
|
491
|
+
- Morphologically rich languages
|
|
492
|
+
- Semantic understanding tasks
|
|
493
|
+
- **Models**: BERT, DistilBERT, ELECTRA
|
|
494
|
+
|
|
495
|
+
**Unigram** - Best for:
|
|
496
|
+
- Multilingual models
|
|
497
|
+
- Languages without word boundaries (CJK)
|
|
498
|
+
- Data augmentation via subword regularization
|
|
499
|
+
- **Models**: T5, ALBERT, XLNet (via SentencePiece)
|
|
500
|
+
|
|
501
|
+
## Advanced topics
|
|
502
|
+
|
|
503
|
+
### Handling rare words
|
|
504
|
+
|
|
505
|
+
**BPE approach**:
|
|
506
|
+
```
|
|
507
|
+
"antidisestablishmentarianism"
|
|
508
|
+
→ ['anti', 'dis', 'establish', 'ment', 'arian', 'ism']
|
|
509
|
+
```
|
|
510
|
+
|
|
511
|
+
**WordPiece approach**:
|
|
512
|
+
```
|
|
513
|
+
"antidisestablishmentarianism"
|
|
514
|
+
→ ['anti', '##dis', '##establish', '##ment', '##arian', '##ism']
|
|
515
|
+
```
|
|
516
|
+
|
|
517
|
+
**Unigram approach**:
|
|
518
|
+
```
|
|
519
|
+
"antidisestablishmentarianism"
|
|
520
|
+
→ ['▁anti', 'dis', 'establish', 'ment', 'arian', 'ism']
|
|
521
|
+
```
|
|
522
|
+
|
|
523
|
+
### Handling numbers
|
|
524
|
+
|
|
525
|
+
**Challenge**: Infinite number combinations
|
|
526
|
+
|
|
527
|
+
**BPE solution**: Byte-level (handles any digit sequence)
|
|
528
|
+
```python
|
|
529
|
+
tokenizer = Tokenizer(BPE())
|
|
530
|
+
tokenizer.pre_tokenizer = ByteLevel()
|
|
531
|
+
|
|
532
|
+
# Handles any number
|
|
533
|
+
"123456789" → byte-level tokens
|
|
534
|
+
```
|
|
535
|
+
|
|
536
|
+
**WordPiece solution**: Digit pre-tokenization
|
|
537
|
+
```python
|
|
538
|
+
from tokenizers.pre_tokenizers import Digits
|
|
539
|
+
|
|
540
|
+
# Split digits individually or as groups
|
|
541
|
+
tokenizer.pre_tokenizer = Digits(individual_digits=True)
|
|
542
|
+
|
|
543
|
+
"123" → ['1', '2', '3']
|
|
544
|
+
```
|
|
545
|
+
|
|
546
|
+
**Unigram solution**: Learns common number patterns
|
|
547
|
+
```python
|
|
548
|
+
# Learns patterns during training
|
|
549
|
+
"2023" → ['202', '3'] or ['20', '23']
|
|
550
|
+
```
|
|
551
|
+
|
|
552
|
+
### Handling case sensitivity
|
|
553
|
+
|
|
554
|
+
**Lowercase (BERT)**:
|
|
555
|
+
```python
|
|
556
|
+
from tokenizers.normalizers import Lowercase
|
|
557
|
+
|
|
558
|
+
tokenizer.normalizer = Lowercase()
|
|
559
|
+
|
|
560
|
+
"Hello WORLD" → "hello world" → ['hello', 'world']
|
|
561
|
+
```
|
|
562
|
+
|
|
563
|
+
**Preserve case (GPT-2)**:
|
|
564
|
+
```python
|
|
565
|
+
# No case normalization
|
|
566
|
+
tokenizer.normalizer = None
|
|
567
|
+
|
|
568
|
+
"Hello WORLD" → ['Hello', 'WORLD']
|
|
569
|
+
```
|
|
570
|
+
|
|
571
|
+
**Cased tokens (RoBERTa)**:
|
|
572
|
+
```python
|
|
573
|
+
# Learns separate tokens for different cases
|
|
574
|
+
Vocabulary: ['Hello', 'hello', 'HELLO', 'world', 'WORLD']
|
|
575
|
+
```
|
|
576
|
+
|
|
577
|
+
### Handling emojis and special characters
|
|
578
|
+
|
|
579
|
+
**Byte-level (GPT-2)**:
|
|
580
|
+
```python
|
|
581
|
+
tokenizer.pre_tokenizer = ByteLevel()
|
|
582
|
+
|
|
583
|
+
"Hello 🌍 👋" → byte-level representation (always works)
|
|
584
|
+
```
|
|
585
|
+
|
|
586
|
+
**Unicode normalization**:
|
|
587
|
+
```python
|
|
588
|
+
from tokenizers.normalizers import NFKC
|
|
589
|
+
|
|
590
|
+
tokenizer.normalizer = NFKC()
|
|
591
|
+
|
|
592
|
+
"é" (composed) ↔ "é" (decomposed) → normalized to one form
|
|
593
|
+
```
|
|
594
|
+
|
|
595
|
+
## Troubleshooting
|
|
596
|
+
|
|
597
|
+
### Issue: Poor subword splitting
|
|
598
|
+
|
|
599
|
+
**Symptom**:
|
|
600
|
+
```
|
|
601
|
+
"running" → ['r', 'u', 'n', 'n', 'i', 'n', 'g'] (too granular)
|
|
602
|
+
```
|
|
603
|
+
|
|
604
|
+
**Solutions**:
|
|
605
|
+
1. Increase vocabulary size
|
|
606
|
+
2. Train longer (more merge iterations)
|
|
607
|
+
3. Lower `min_frequency` threshold
|
|
608
|
+
|
|
609
|
+
### Issue: Too many unknown tokens
|
|
610
|
+
|
|
611
|
+
**Symptom**:
|
|
612
|
+
```
|
|
613
|
+
5% of tokens are [UNK]
|
|
614
|
+
```
|
|
615
|
+
|
|
616
|
+
**Solutions**:
|
|
617
|
+
1. Increase vocabulary size
|
|
618
|
+
2. Use byte-level BPE (no UNK possible)
|
|
619
|
+
3. Verify training corpus is representative
|
|
620
|
+
|
|
621
|
+
### Issue: Inconsistent tokenization
|
|
622
|
+
|
|
623
|
+
**Symptom**:
|
|
624
|
+
```
|
|
625
|
+
"running" → ['run', 'ning']
|
|
626
|
+
"runner" → ['r', 'u', 'n', 'n', 'e', 'r']
|
|
627
|
+
```
|
|
628
|
+
|
|
629
|
+
**Solutions**:
|
|
630
|
+
1. Check normalization consistency
|
|
631
|
+
2. Ensure pre-tokenization is deterministic
|
|
632
|
+
3. Use Unigram for probabilistic variance
|
|
633
|
+
|
|
634
|
+
## Best practices
|
|
635
|
+
|
|
636
|
+
1. **Match algorithm to model architecture**:
|
|
637
|
+
- BERT-style → WordPiece
|
|
638
|
+
- GPT-style → BPE
|
|
639
|
+
- T5-style → Unigram
|
|
640
|
+
|
|
641
|
+
2. **Use byte-level for multilingual**:
|
|
642
|
+
- Handles any Unicode
|
|
643
|
+
- No unknown tokens
|
|
644
|
+
|
|
645
|
+
3. **Test on representative data**:
|
|
646
|
+
- Measure compression ratio
|
|
647
|
+
- Check unknown token rate
|
|
648
|
+
- Inspect sample tokenizations
|
|
649
|
+
|
|
650
|
+
4. **Version control tokenizers**:
|
|
651
|
+
- Save with model
|
|
652
|
+
- Document special tokens
|
|
653
|
+
- Track vocabulary changes
|