@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,516 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: huggingface-tokenizers
|
|
3
|
+
description: Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Synthetic Sciences
|
|
6
|
+
license: MIT
|
|
7
|
+
tags: [Tokenization, HuggingFace, BPE, WordPiece, Unigram, Fast Tokenization, Rust, Custom Tokenizer, Alignment Tracking, Production]
|
|
8
|
+
dependencies: [tokenizers, transformers, datasets]
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# HuggingFace Tokenizers - Fast Tokenization for NLP
|
|
12
|
+
|
|
13
|
+
Fast, production-ready tokenizers with Rust performance and Python ease-of-use.
|
|
14
|
+
|
|
15
|
+
## When to use HuggingFace Tokenizers
|
|
16
|
+
|
|
17
|
+
**Use HuggingFace Tokenizers when:**
|
|
18
|
+
- Need extremely fast tokenization (<20s per GB of text)
|
|
19
|
+
- Training custom tokenizers from scratch
|
|
20
|
+
- Want alignment tracking (token → original text position)
|
|
21
|
+
- Building production NLP pipelines
|
|
22
|
+
- Need to tokenize large corpora efficiently
|
|
23
|
+
|
|
24
|
+
**Performance**:
|
|
25
|
+
- **Speed**: <20 seconds to tokenize 1GB on CPU
|
|
26
|
+
- **Implementation**: Rust core with Python/Node.js bindings
|
|
27
|
+
- **Efficiency**: 10-100× faster than pure Python implementations
|
|
28
|
+
|
|
29
|
+
**Use alternatives instead**:
|
|
30
|
+
- **SentencePiece**: Language-independent, used by T5/ALBERT
|
|
31
|
+
- **tiktoken**: OpenAI's BPE tokenizer for GPT models
|
|
32
|
+
- **transformers AutoTokenizer**: Loading pretrained only (uses this library internally)
|
|
33
|
+
|
|
34
|
+
## Quick start
|
|
35
|
+
|
|
36
|
+
### Installation
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
# Install tokenizers
|
|
40
|
+
pip install tokenizers
|
|
41
|
+
|
|
42
|
+
# With transformers integration
|
|
43
|
+
pip install tokenizers transformers
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
### Load pretrained tokenizer
|
|
47
|
+
|
|
48
|
+
```python
|
|
49
|
+
from tokenizers import Tokenizer
|
|
50
|
+
|
|
51
|
+
# Load from HuggingFace Hub
|
|
52
|
+
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
|
|
53
|
+
|
|
54
|
+
# Encode text
|
|
55
|
+
output = tokenizer.encode("Hello, how are you?")
|
|
56
|
+
print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?']
|
|
57
|
+
print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]
|
|
58
|
+
|
|
59
|
+
# Decode back
|
|
60
|
+
text = tokenizer.decode(output.ids)
|
|
61
|
+
print(text) # "hello, how are you?"
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### Train custom BPE tokenizer
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
from tokenizers import Tokenizer
|
|
68
|
+
from tokenizers.models import BPE
|
|
69
|
+
from tokenizers.trainers import BpeTrainer
|
|
70
|
+
from tokenizers.pre_tokenizers import Whitespace
|
|
71
|
+
|
|
72
|
+
# Initialize tokenizer with BPE model
|
|
73
|
+
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
|
|
74
|
+
tokenizer.pre_tokenizer = Whitespace()
|
|
75
|
+
|
|
76
|
+
# Configure trainer
|
|
77
|
+
trainer = BpeTrainer(
|
|
78
|
+
vocab_size=30000,
|
|
79
|
+
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
|
|
80
|
+
min_frequency=2
|
|
81
|
+
)
|
|
82
|
+
|
|
83
|
+
# Train on files
|
|
84
|
+
files = ["train.txt", "validation.txt"]
|
|
85
|
+
tokenizer.train(files, trainer)
|
|
86
|
+
|
|
87
|
+
# Save
|
|
88
|
+
tokenizer.save("my-tokenizer.json")
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
**Training time**: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB
|
|
92
|
+
|
|
93
|
+
### Batch encoding with padding
|
|
94
|
+
|
|
95
|
+
```python
|
|
96
|
+
# Enable padding
|
|
97
|
+
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
|
|
98
|
+
|
|
99
|
+
# Encode batch
|
|
100
|
+
texts = ["Hello world", "This is a longer sentence"]
|
|
101
|
+
encodings = tokenizer.encode_batch(texts)
|
|
102
|
+
|
|
103
|
+
for encoding in encodings:
|
|
104
|
+
print(encoding.ids)
|
|
105
|
+
# [101, 7592, 2088, 102, 3, 3, 3]
|
|
106
|
+
# [101, 2023, 2003, 1037, 2936, 6251, 102]
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
## Tokenization algorithms
|
|
110
|
+
|
|
111
|
+
### BPE (Byte-Pair Encoding)
|
|
112
|
+
|
|
113
|
+
**How it works**:
|
|
114
|
+
1. Start with character-level vocabulary
|
|
115
|
+
2. Find most frequent character pair
|
|
116
|
+
3. Merge into new token, add to vocabulary
|
|
117
|
+
4. Repeat until vocabulary size reached
|
|
118
|
+
|
|
119
|
+
**Used by**: GPT-2, GPT-3, RoBERTa, BART, DeBERTa
|
|
120
|
+
|
|
121
|
+
```python
|
|
122
|
+
from tokenizers import Tokenizer
|
|
123
|
+
from tokenizers.models import BPE
|
|
124
|
+
from tokenizers.trainers import BpeTrainer
|
|
125
|
+
from tokenizers.pre_tokenizers import ByteLevel
|
|
126
|
+
|
|
127
|
+
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
|
|
128
|
+
tokenizer.pre_tokenizer = ByteLevel()
|
|
129
|
+
|
|
130
|
+
trainer = BpeTrainer(
|
|
131
|
+
vocab_size=50257,
|
|
132
|
+
special_tokens=["<|endoftext|>"],
|
|
133
|
+
min_frequency=2
|
|
134
|
+
)
|
|
135
|
+
|
|
136
|
+
tokenizer.train(files=["data.txt"], trainer=trainer)
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
**Advantages**:
|
|
140
|
+
- Handles OOV words well (breaks into subwords)
|
|
141
|
+
- Flexible vocabulary size
|
|
142
|
+
- Good for morphologically rich languages
|
|
143
|
+
|
|
144
|
+
**Trade-offs**:
|
|
145
|
+
- Tokenization depends on merge order
|
|
146
|
+
- May split common words unexpectedly
|
|
147
|
+
|
|
148
|
+
### WordPiece
|
|
149
|
+
|
|
150
|
+
**How it works**:
|
|
151
|
+
1. Start with character vocabulary
|
|
152
|
+
2. Score merge pairs: `frequency(pair) / (frequency(first) × frequency(second))`
|
|
153
|
+
3. Merge highest scoring pair
|
|
154
|
+
4. Repeat until vocabulary size reached
|
|
155
|
+
|
|
156
|
+
**Used by**: BERT, DistilBERT, MobileBERT
|
|
157
|
+
|
|
158
|
+
```python
|
|
159
|
+
from tokenizers import Tokenizer
|
|
160
|
+
from tokenizers.models import WordPiece
|
|
161
|
+
from tokenizers.trainers import WordPieceTrainer
|
|
162
|
+
from tokenizers.pre_tokenizers import Whitespace
|
|
163
|
+
from tokenizers.normalizers import BertNormalizer
|
|
164
|
+
|
|
165
|
+
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
|
|
166
|
+
tokenizer.normalizer = BertNormalizer(lowercase=True)
|
|
167
|
+
tokenizer.pre_tokenizer = Whitespace()
|
|
168
|
+
|
|
169
|
+
trainer = WordPieceTrainer(
|
|
170
|
+
vocab_size=30522,
|
|
171
|
+
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
|
|
172
|
+
continuing_subword_prefix="##"
|
|
173
|
+
)
|
|
174
|
+
|
|
175
|
+
tokenizer.train(files=["corpus.txt"], trainer=trainer)
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
**Advantages**:
|
|
179
|
+
- Prioritizes meaningful merges (high score = semantically related)
|
|
180
|
+
- Used successfully in BERT (state-of-the-art results)
|
|
181
|
+
|
|
182
|
+
**Trade-offs**:
|
|
183
|
+
- Unknown words become `[UNK]` if no subword match
|
|
184
|
+
- Saves vocabulary, not merge rules (larger files)
|
|
185
|
+
|
|
186
|
+
### Unigram
|
|
187
|
+
|
|
188
|
+
**How it works**:
|
|
189
|
+
1. Start with large vocabulary (all substrings)
|
|
190
|
+
2. Compute loss for corpus with current vocabulary
|
|
191
|
+
3. Remove tokens with minimal impact on loss
|
|
192
|
+
4. Repeat until vocabulary size reached
|
|
193
|
+
|
|
194
|
+
**Used by**: ALBERT, T5, mBART, XLNet (via SentencePiece)
|
|
195
|
+
|
|
196
|
+
```python
|
|
197
|
+
from tokenizers import Tokenizer
|
|
198
|
+
from tokenizers.models import Unigram
|
|
199
|
+
from tokenizers.trainers import UnigramTrainer
|
|
200
|
+
|
|
201
|
+
tokenizer = Tokenizer(Unigram())
|
|
202
|
+
|
|
203
|
+
trainer = UnigramTrainer(
|
|
204
|
+
vocab_size=8000,
|
|
205
|
+
special_tokens=["<unk>", "<s>", "</s>"],
|
|
206
|
+
unk_token="<unk>"
|
|
207
|
+
)
|
|
208
|
+
|
|
209
|
+
tokenizer.train(files=["data.txt"], trainer=trainer)
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
**Advantages**:
|
|
213
|
+
- Probabilistic (finds most likely tokenization)
|
|
214
|
+
- Works well for languages without word boundaries
|
|
215
|
+
- Handles diverse linguistic contexts
|
|
216
|
+
|
|
217
|
+
**Trade-offs**:
|
|
218
|
+
- Computationally expensive to train
|
|
219
|
+
- More hyperparameters to tune
|
|
220
|
+
|
|
221
|
+
## Tokenization pipeline
|
|
222
|
+
|
|
223
|
+
Complete pipeline: **Normalization → Pre-tokenization → Model → Post-processing**
|
|
224
|
+
|
|
225
|
+
### Normalization
|
|
226
|
+
|
|
227
|
+
Clean and standardize text:
|
|
228
|
+
|
|
229
|
+
```python
|
|
230
|
+
from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence
|
|
231
|
+
|
|
232
|
+
tokenizer.normalizer = Sequence([
|
|
233
|
+
NFD(), # Unicode normalization (decompose)
|
|
234
|
+
Lowercase(), # Convert to lowercase
|
|
235
|
+
StripAccents() # Remove accents
|
|
236
|
+
])
|
|
237
|
+
|
|
238
|
+
# Input: "Héllo WORLD"
|
|
239
|
+
# After normalization: "hello world"
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
**Common normalizers**:
|
|
243
|
+
- `NFD`, `NFC`, `NFKD`, `NFKC` - Unicode normalization forms
|
|
244
|
+
- `Lowercase()` - Convert to lowercase
|
|
245
|
+
- `StripAccents()` - Remove accents (é → e)
|
|
246
|
+
- `Strip()` - Remove whitespace
|
|
247
|
+
- `Replace(pattern, content)` - Regex replacement
|
|
248
|
+
|
|
249
|
+
### Pre-tokenization
|
|
250
|
+
|
|
251
|
+
Split text into word-like units:
|
|
252
|
+
|
|
253
|
+
```python
|
|
254
|
+
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel
|
|
255
|
+
|
|
256
|
+
# Split on whitespace and punctuation
|
|
257
|
+
tokenizer.pre_tokenizer = Sequence([
|
|
258
|
+
Whitespace(),
|
|
259
|
+
Punctuation()
|
|
260
|
+
])
|
|
261
|
+
|
|
262
|
+
# Input: "Hello, world!"
|
|
263
|
+
# After pre-tokenization: ["Hello", ",", "world", "!"]
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
**Common pre-tokenizers**:
|
|
267
|
+
- `Whitespace()` - Split on spaces, tabs, newlines
|
|
268
|
+
- `ByteLevel()` - GPT-2 style byte-level splitting
|
|
269
|
+
- `Punctuation()` - Isolate punctuation
|
|
270
|
+
- `Digits(individual_digits=True)` - Split digits individually
|
|
271
|
+
- `Metaspace()` - Replace spaces with ▁ (SentencePiece style)
|
|
272
|
+
|
|
273
|
+
### Post-processing
|
|
274
|
+
|
|
275
|
+
Add special tokens for model input:
|
|
276
|
+
|
|
277
|
+
```python
|
|
278
|
+
from tokenizers.processors import TemplateProcessing
|
|
279
|
+
|
|
280
|
+
# BERT-style: [CLS] sentence [SEP]
|
|
281
|
+
tokenizer.post_processor = TemplateProcessing(
|
|
282
|
+
single="[CLS] $A [SEP]",
|
|
283
|
+
pair="[CLS] $A [SEP] $B [SEP]",
|
|
284
|
+
special_tokens=[
|
|
285
|
+
("[CLS]", 1),
|
|
286
|
+
("[SEP]", 2),
|
|
287
|
+
],
|
|
288
|
+
)
|
|
289
|
+
```
|
|
290
|
+
|
|
291
|
+
**Common patterns**:
|
|
292
|
+
```python
|
|
293
|
+
# GPT-2: sentence <|endoftext|>
|
|
294
|
+
TemplateProcessing(
|
|
295
|
+
single="$A <|endoftext|>",
|
|
296
|
+
special_tokens=[("<|endoftext|>", 50256)]
|
|
297
|
+
)
|
|
298
|
+
|
|
299
|
+
# RoBERTa: <s> sentence </s>
|
|
300
|
+
TemplateProcessing(
|
|
301
|
+
single="<s> $A </s>",
|
|
302
|
+
pair="<s> $A </s> </s> $B </s>",
|
|
303
|
+
special_tokens=[("<s>", 0), ("</s>", 2)]
|
|
304
|
+
)
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
## Alignment tracking
|
|
308
|
+
|
|
309
|
+
Track token positions in original text:
|
|
310
|
+
|
|
311
|
+
```python
|
|
312
|
+
output = tokenizer.encode("Hello, world!")
|
|
313
|
+
|
|
314
|
+
# Get token offsets
|
|
315
|
+
for token, offset in zip(output.tokens, output.offsets):
|
|
316
|
+
start, end = offset
|
|
317
|
+
print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")
|
|
318
|
+
|
|
319
|
+
# Output:
|
|
320
|
+
# hello → [ 0, 5): 'Hello'
|
|
321
|
+
# , → [ 5, 6): ','
|
|
322
|
+
# world → [ 7, 12): 'world'
|
|
323
|
+
# ! → [12, 13): '!'
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
**Use cases**:
|
|
327
|
+
- Named entity recognition (map predictions back to text)
|
|
328
|
+
- Question answering (extract answer spans)
|
|
329
|
+
- Token classification (align labels to original positions)
|
|
330
|
+
|
|
331
|
+
## Integration with transformers
|
|
332
|
+
|
|
333
|
+
### Load with AutoTokenizer
|
|
334
|
+
|
|
335
|
+
```python
|
|
336
|
+
from transformers import AutoTokenizer
|
|
337
|
+
|
|
338
|
+
# AutoTokenizer automatically uses fast tokenizers
|
|
339
|
+
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
|
340
|
+
|
|
341
|
+
# Check if using fast tokenizer
|
|
342
|
+
print(tokenizer.is_fast) # True
|
|
343
|
+
|
|
344
|
+
# Access underlying tokenizers.Tokenizer
|
|
345
|
+
fast_tokenizer = tokenizer.backend_tokenizer
|
|
346
|
+
print(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'>
|
|
347
|
+
```
|
|
348
|
+
|
|
349
|
+
### Convert custom tokenizer to transformers
|
|
350
|
+
|
|
351
|
+
```python
|
|
352
|
+
from tokenizers import Tokenizer
|
|
353
|
+
from transformers import PreTrainedTokenizerFast
|
|
354
|
+
|
|
355
|
+
# Train custom tokenizer
|
|
356
|
+
tokenizer = Tokenizer(BPE())
|
|
357
|
+
# ... train tokenizer ...
|
|
358
|
+
tokenizer.save("my-tokenizer.json")
|
|
359
|
+
|
|
360
|
+
# Wrap for transformers
|
|
361
|
+
transformers_tokenizer = PreTrainedTokenizerFast(
|
|
362
|
+
tokenizer_file="my-tokenizer.json",
|
|
363
|
+
unk_token="[UNK]",
|
|
364
|
+
pad_token="[PAD]",
|
|
365
|
+
cls_token="[CLS]",
|
|
366
|
+
sep_token="[SEP]",
|
|
367
|
+
mask_token="[MASK]"
|
|
368
|
+
)
|
|
369
|
+
|
|
370
|
+
# Use like any transformers tokenizer
|
|
371
|
+
outputs = transformers_tokenizer(
|
|
372
|
+
"Hello world",
|
|
373
|
+
padding=True,
|
|
374
|
+
truncation=True,
|
|
375
|
+
max_length=512,
|
|
376
|
+
return_tensors="pt"
|
|
377
|
+
)
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
## Common patterns
|
|
381
|
+
|
|
382
|
+
### Train from iterator (large datasets)
|
|
383
|
+
|
|
384
|
+
```python
|
|
385
|
+
from datasets import load_dataset
|
|
386
|
+
|
|
387
|
+
# Load dataset
|
|
388
|
+
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
|
|
389
|
+
|
|
390
|
+
# Create batch iterator
|
|
391
|
+
def batch_iterator(batch_size=1000):
|
|
392
|
+
for i in range(0, len(dataset), batch_size):
|
|
393
|
+
yield dataset[i:i + batch_size]["text"]
|
|
394
|
+
|
|
395
|
+
# Train tokenizer
|
|
396
|
+
tokenizer.train_from_iterator(
|
|
397
|
+
batch_iterator(),
|
|
398
|
+
trainer=trainer,
|
|
399
|
+
length=len(dataset) # For progress bar
|
|
400
|
+
)
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
**Performance**: Processes 1GB in ~10-20 minutes
|
|
404
|
+
|
|
405
|
+
### Enable truncation and padding
|
|
406
|
+
|
|
407
|
+
```python
|
|
408
|
+
# Enable truncation
|
|
409
|
+
tokenizer.enable_truncation(max_length=512)
|
|
410
|
+
|
|
411
|
+
# Enable padding
|
|
412
|
+
tokenizer.enable_padding(
|
|
413
|
+
pad_id=tokenizer.token_to_id("[PAD]"),
|
|
414
|
+
pad_token="[PAD]",
|
|
415
|
+
length=512 # Fixed length, or None for batch max
|
|
416
|
+
)
|
|
417
|
+
|
|
418
|
+
# Encode with both
|
|
419
|
+
output = tokenizer.encode("This is a long sentence that will be truncated...")
|
|
420
|
+
print(len(output.ids)) # 512
|
|
421
|
+
```
|
|
422
|
+
|
|
423
|
+
### Multi-processing
|
|
424
|
+
|
|
425
|
+
```python
|
|
426
|
+
from tokenizers import Tokenizer
|
|
427
|
+
from multiprocessing import Pool
|
|
428
|
+
|
|
429
|
+
# Load tokenizer
|
|
430
|
+
tokenizer = Tokenizer.from_file("tokenizer.json")
|
|
431
|
+
|
|
432
|
+
def encode_batch(texts):
|
|
433
|
+
return tokenizer.encode_batch(texts)
|
|
434
|
+
|
|
435
|
+
# Process large corpus in parallel
|
|
436
|
+
with Pool(8) as pool:
|
|
437
|
+
# Split corpus into chunks
|
|
438
|
+
chunk_size = 1000
|
|
439
|
+
chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]
|
|
440
|
+
|
|
441
|
+
# Encode in parallel
|
|
442
|
+
results = pool.map(encode_batch, chunks)
|
|
443
|
+
```
|
|
444
|
+
|
|
445
|
+
**Speedup**: 5-8× with 8 cores
|
|
446
|
+
|
|
447
|
+
## Performance benchmarks
|
|
448
|
+
|
|
449
|
+
### Training speed
|
|
450
|
+
|
|
451
|
+
| Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) |
|
|
452
|
+
|-------------|-----------------|-----------------|--------------|
|
|
453
|
+
| 10 MB | 15 sec | 18 sec | 25 sec |
|
|
454
|
+
| 100 MB | 1.5 min | 2 min | 4 min |
|
|
455
|
+
| 1 GB | 15 min | 20 min | 40 min |
|
|
456
|
+
|
|
457
|
+
**Hardware**: 16-core CPU, tested on English Wikipedia
|
|
458
|
+
|
|
459
|
+
### Tokenization speed
|
|
460
|
+
|
|
461
|
+
| Implementation | 1 GB corpus | Throughput |
|
|
462
|
+
|----------------|-------------|---------------|
|
|
463
|
+
| Pure Python | ~20 minutes | ~50 MB/min |
|
|
464
|
+
| HF Tokenizers | ~15 seconds | ~4 GB/min |
|
|
465
|
+
| **Speedup** | **80×** | **80×** |
|
|
466
|
+
|
|
467
|
+
**Test**: English text, average sentence length 20 words
|
|
468
|
+
|
|
469
|
+
### Memory usage
|
|
470
|
+
|
|
471
|
+
| Task | Memory |
|
|
472
|
+
|-------------------------|---------|
|
|
473
|
+
| Load tokenizer | ~10 MB |
|
|
474
|
+
| Train BPE (30k vocab) | ~200 MB |
|
|
475
|
+
| Encode 1M sentences | ~500 MB |
|
|
476
|
+
|
|
477
|
+
## Supported models
|
|
478
|
+
|
|
479
|
+
Pre-trained tokenizers available via `from_pretrained()`:
|
|
480
|
+
|
|
481
|
+
**BERT family**:
|
|
482
|
+
- `bert-base-uncased`, `bert-large-cased`
|
|
483
|
+
- `distilbert-base-uncased`
|
|
484
|
+
- `roberta-base`, `roberta-large`
|
|
485
|
+
|
|
486
|
+
**GPT family**:
|
|
487
|
+
- `gpt2`, `gpt2-medium`, `gpt2-large`
|
|
488
|
+
- `distilgpt2`
|
|
489
|
+
|
|
490
|
+
**T5 family**:
|
|
491
|
+
- `t5-small`, `t5-base`, `t5-large`
|
|
492
|
+
- `google/flan-t5-xxl`
|
|
493
|
+
|
|
494
|
+
**Other**:
|
|
495
|
+
- `facebook/bart-base`, `facebook/mbart-large-cc25`
|
|
496
|
+
- `albert-base-v2`, `albert-xlarge-v2`
|
|
497
|
+
- `xlm-roberta-base`, `xlm-roberta-large`
|
|
498
|
+
|
|
499
|
+
Browse all: https://huggingface.co/models?library=tokenizers
|
|
500
|
+
|
|
501
|
+
## References
|
|
502
|
+
|
|
503
|
+
- **[Training Guide](references/training.md)** - Train custom tokenizers, configure trainers, handle large datasets
|
|
504
|
+
- **[Algorithms Deep Dive](references/algorithms.md)** - BPE, WordPiece, Unigram explained in detail
|
|
505
|
+
- **[Pipeline Components](references/pipeline.md)** - Normalizers, pre-tokenizers, post-processors, decoders
|
|
506
|
+
- **[Transformers Integration](references/integration.md)** - AutoTokenizer, PreTrainedTokenizerFast, special tokens
|
|
507
|
+
|
|
508
|
+
## Resources
|
|
509
|
+
|
|
510
|
+
- **Docs**: https://huggingface.co/docs/tokenizers
|
|
511
|
+
- **GitHub**: https://github.com/huggingface/tokenizers ⭐ 9,000+
|
|
512
|
+
- **Version**: 0.20.0+
|
|
513
|
+
- **Course**: https://huggingface.co/learn/nlp-course/chapter6/1
|
|
514
|
+
- **Paper**: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)
|
|
515
|
+
|
|
516
|
+
|