@synsci/cli-darwin-x64 1.1.49
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/accelerate/SKILL.md +332 -0
- package/bin/skills/accelerate/references/custom-plugins.md +453 -0
- package/bin/skills/accelerate/references/megatron-integration.md +489 -0
- package/bin/skills/accelerate/references/performance.md +525 -0
- package/bin/skills/audiocraft/SKILL.md +564 -0
- package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
- package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
- package/bin/skills/autogpt/SKILL.md +403 -0
- package/bin/skills/autogpt/references/advanced-usage.md +535 -0
- package/bin/skills/autogpt/references/troubleshooting.md +420 -0
- package/bin/skills/awq/SKILL.md +310 -0
- package/bin/skills/awq/references/advanced-usage.md +324 -0
- package/bin/skills/awq/references/troubleshooting.md +344 -0
- package/bin/skills/axolotl/SKILL.md +158 -0
- package/bin/skills/axolotl/references/api.md +5548 -0
- package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
- package/bin/skills/axolotl/references/index.md +15 -0
- package/bin/skills/axolotl/references/other.md +3563 -0
- package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
- package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
- package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
- package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
- package/bin/skills/bitsandbytes/SKILL.md +411 -0
- package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
- package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
- package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
- package/bin/skills/blip-2/SKILL.md +564 -0
- package/bin/skills/blip-2/references/advanced-usage.md +680 -0
- package/bin/skills/blip-2/references/troubleshooting.md +526 -0
- package/bin/skills/chroma/SKILL.md +406 -0
- package/bin/skills/chroma/references/integration.md +38 -0
- package/bin/skills/clip/SKILL.md +253 -0
- package/bin/skills/clip/references/applications.md +207 -0
- package/bin/skills/constitutional-ai/SKILL.md +290 -0
- package/bin/skills/crewai/SKILL.md +498 -0
- package/bin/skills/crewai/references/flows.md +438 -0
- package/bin/skills/crewai/references/tools.md +429 -0
- package/bin/skills/crewai/references/troubleshooting.md +480 -0
- package/bin/skills/deepspeed/SKILL.md +141 -0
- package/bin/skills/deepspeed/references/08.md +17 -0
- package/bin/skills/deepspeed/references/09.md +173 -0
- package/bin/skills/deepspeed/references/2020.md +378 -0
- package/bin/skills/deepspeed/references/2023.md +279 -0
- package/bin/skills/deepspeed/references/assets.md +179 -0
- package/bin/skills/deepspeed/references/index.md +35 -0
- package/bin/skills/deepspeed/references/mii.md +118 -0
- package/bin/skills/deepspeed/references/other.md +1191 -0
- package/bin/skills/deepspeed/references/tutorials.md +6554 -0
- package/bin/skills/dspy/SKILL.md +590 -0
- package/bin/skills/dspy/references/examples.md +663 -0
- package/bin/skills/dspy/references/modules.md +475 -0
- package/bin/skills/dspy/references/optimizers.md +566 -0
- package/bin/skills/faiss/SKILL.md +221 -0
- package/bin/skills/faiss/references/index_types.md +280 -0
- package/bin/skills/flash-attention/SKILL.md +367 -0
- package/bin/skills/flash-attention/references/benchmarks.md +215 -0
- package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
- package/bin/skills/gguf/SKILL.md +427 -0
- package/bin/skills/gguf/references/advanced-usage.md +504 -0
- package/bin/skills/gguf/references/troubleshooting.md +442 -0
- package/bin/skills/gptq/SKILL.md +450 -0
- package/bin/skills/gptq/references/calibration.md +337 -0
- package/bin/skills/gptq/references/integration.md +129 -0
- package/bin/skills/gptq/references/troubleshooting.md +95 -0
- package/bin/skills/grpo-rl-training/README.md +97 -0
- package/bin/skills/grpo-rl-training/SKILL.md +572 -0
- package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
- package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
- package/bin/skills/guidance/SKILL.md +572 -0
- package/bin/skills/guidance/references/backends.md +554 -0
- package/bin/skills/guidance/references/constraints.md +674 -0
- package/bin/skills/guidance/references/examples.md +767 -0
- package/bin/skills/hqq/SKILL.md +445 -0
- package/bin/skills/hqq/references/advanced-usage.md +528 -0
- package/bin/skills/hqq/references/troubleshooting.md +503 -0
- package/bin/skills/hugging-face-cli/SKILL.md +191 -0
- package/bin/skills/hugging-face-cli/references/commands.md +954 -0
- package/bin/skills/hugging-face-cli/references/examples.md +374 -0
- package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
- package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
- package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
- package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
- package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
- package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
- package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
- package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
- package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
- package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
- package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
- package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
- package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
- package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
- package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
- package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
- package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
- package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
- package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
- package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
- package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
- package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
- package/bin/skills/hugging-face-jobs/index.html +216 -0
- package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
- package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
- package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
- package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
- package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
- package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
- package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
- package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
- package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
- package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
- package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
- package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
- package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
- package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
- package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
- package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
- package/bin/skills/instructor/SKILL.md +740 -0
- package/bin/skills/instructor/references/examples.md +107 -0
- package/bin/skills/instructor/references/providers.md +70 -0
- package/bin/skills/instructor/references/validation.md +606 -0
- package/bin/skills/knowledge-distillation/SKILL.md +458 -0
- package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
- package/bin/skills/lambda-labs/SKILL.md +545 -0
- package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
- package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
- package/bin/skills/langchain/SKILL.md +480 -0
- package/bin/skills/langchain/references/agents.md +499 -0
- package/bin/skills/langchain/references/integration.md +562 -0
- package/bin/skills/langchain/references/rag.md +600 -0
- package/bin/skills/langsmith/SKILL.md +422 -0
- package/bin/skills/langsmith/references/advanced-usage.md +548 -0
- package/bin/skills/langsmith/references/troubleshooting.md +537 -0
- package/bin/skills/litgpt/SKILL.md +469 -0
- package/bin/skills/litgpt/references/custom-models.md +568 -0
- package/bin/skills/litgpt/references/distributed-training.md +451 -0
- package/bin/skills/litgpt/references/supported-models.md +336 -0
- package/bin/skills/litgpt/references/training-recipes.md +619 -0
- package/bin/skills/llama-cpp/SKILL.md +258 -0
- package/bin/skills/llama-cpp/references/optimization.md +89 -0
- package/bin/skills/llama-cpp/references/quantization.md +213 -0
- package/bin/skills/llama-cpp/references/server.md +125 -0
- package/bin/skills/llama-factory/SKILL.md +80 -0
- package/bin/skills/llama-factory/references/_images.md +23 -0
- package/bin/skills/llama-factory/references/advanced.md +1055 -0
- package/bin/skills/llama-factory/references/getting_started.md +349 -0
- package/bin/skills/llama-factory/references/index.md +19 -0
- package/bin/skills/llama-factory/references/other.md +31 -0
- package/bin/skills/llamaguard/SKILL.md +337 -0
- package/bin/skills/llamaindex/SKILL.md +569 -0
- package/bin/skills/llamaindex/references/agents.md +83 -0
- package/bin/skills/llamaindex/references/data_connectors.md +108 -0
- package/bin/skills/llamaindex/references/query_engines.md +406 -0
- package/bin/skills/llava/SKILL.md +304 -0
- package/bin/skills/llava/references/training.md +197 -0
- package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- package/bin/skills/long-context/SKILL.md +536 -0
- package/bin/skills/long-context/references/extension_methods.md +468 -0
- package/bin/skills/long-context/references/fine_tuning.md +611 -0
- package/bin/skills/long-context/references/rope.md +402 -0
- package/bin/skills/mamba/SKILL.md +260 -0
- package/bin/skills/mamba/references/architecture-details.md +206 -0
- package/bin/skills/mamba/references/benchmarks.md +255 -0
- package/bin/skills/mamba/references/training-guide.md +388 -0
- package/bin/skills/megatron-core/SKILL.md +366 -0
- package/bin/skills/megatron-core/references/benchmarks.md +249 -0
- package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
- package/bin/skills/megatron-core/references/production-examples.md +473 -0
- package/bin/skills/megatron-core/references/training-recipes.md +547 -0
- package/bin/skills/miles/SKILL.md +315 -0
- package/bin/skills/miles/references/api-reference.md +141 -0
- package/bin/skills/miles/references/troubleshooting.md +352 -0
- package/bin/skills/mlflow/SKILL.md +704 -0
- package/bin/skills/mlflow/references/deployment.md +744 -0
- package/bin/skills/mlflow/references/model-registry.md +770 -0
- package/bin/skills/mlflow/references/tracking.md +680 -0
- package/bin/skills/modal/SKILL.md +341 -0
- package/bin/skills/modal/references/advanced-usage.md +503 -0
- package/bin/skills/modal/references/troubleshooting.md +494 -0
- package/bin/skills/model-merging/SKILL.md +539 -0
- package/bin/skills/model-merging/references/evaluation.md +462 -0
- package/bin/skills/model-merging/references/examples.md +428 -0
- package/bin/skills/model-merging/references/methods.md +352 -0
- package/bin/skills/model-pruning/SKILL.md +495 -0
- package/bin/skills/model-pruning/references/wanda.md +347 -0
- package/bin/skills/moe-training/SKILL.md +526 -0
- package/bin/skills/moe-training/references/architectures.md +432 -0
- package/bin/skills/moe-training/references/inference.md +348 -0
- package/bin/skills/moe-training/references/training.md +425 -0
- package/bin/skills/nanogpt/SKILL.md +290 -0
- package/bin/skills/nanogpt/references/architecture.md +382 -0
- package/bin/skills/nanogpt/references/data.md +476 -0
- package/bin/skills/nanogpt/references/training.md +564 -0
- package/bin/skills/nemo-curator/SKILL.md +383 -0
- package/bin/skills/nemo-curator/references/deduplication.md +87 -0
- package/bin/skills/nemo-curator/references/filtering.md +102 -0
- package/bin/skills/nemo-evaluator/SKILL.md +494 -0
- package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
- package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
- package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
- package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
- package/bin/skills/nemo-guardrails/SKILL.md +297 -0
- package/bin/skills/nnsight/SKILL.md +436 -0
- package/bin/skills/nnsight/references/README.md +78 -0
- package/bin/skills/nnsight/references/api.md +344 -0
- package/bin/skills/nnsight/references/tutorials.md +300 -0
- package/bin/skills/openrlhf/SKILL.md +249 -0
- package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
- package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
- package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
- package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
- package/bin/skills/outlines/SKILL.md +652 -0
- package/bin/skills/outlines/references/backends.md +615 -0
- package/bin/skills/outlines/references/examples.md +773 -0
- package/bin/skills/outlines/references/json_generation.md +652 -0
- package/bin/skills/peft/SKILL.md +431 -0
- package/bin/skills/peft/references/advanced-usage.md +514 -0
- package/bin/skills/peft/references/troubleshooting.md +480 -0
- package/bin/skills/phoenix/SKILL.md +475 -0
- package/bin/skills/phoenix/references/advanced-usage.md +619 -0
- package/bin/skills/phoenix/references/troubleshooting.md +538 -0
- package/bin/skills/pinecone/SKILL.md +358 -0
- package/bin/skills/pinecone/references/deployment.md +181 -0
- package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
- package/bin/skills/pytorch-fsdp/references/index.md +7 -0
- package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
- package/bin/skills/pytorch-lightning/SKILL.md +346 -0
- package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
- package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
- package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
- package/bin/skills/pyvene/SKILL.md +473 -0
- package/bin/skills/pyvene/references/README.md +73 -0
- package/bin/skills/pyvene/references/api.md +383 -0
- package/bin/skills/pyvene/references/tutorials.md +376 -0
- package/bin/skills/qdrant/SKILL.md +493 -0
- package/bin/skills/qdrant/references/advanced-usage.md +648 -0
- package/bin/skills/qdrant/references/troubleshooting.md +631 -0
- package/bin/skills/ray-data/SKILL.md +326 -0
- package/bin/skills/ray-data/references/integration.md +82 -0
- package/bin/skills/ray-data/references/transformations.md +83 -0
- package/bin/skills/ray-train/SKILL.md +406 -0
- package/bin/skills/ray-train/references/multi-node.md +628 -0
- package/bin/skills/rwkv/SKILL.md +260 -0
- package/bin/skills/rwkv/references/architecture-details.md +344 -0
- package/bin/skills/rwkv/references/rwkv7.md +386 -0
- package/bin/skills/rwkv/references/state-management.md +369 -0
- package/bin/skills/saelens/SKILL.md +386 -0
- package/bin/skills/saelens/references/README.md +70 -0
- package/bin/skills/saelens/references/api.md +333 -0
- package/bin/skills/saelens/references/tutorials.md +318 -0
- package/bin/skills/segment-anything/SKILL.md +500 -0
- package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
- package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
- package/bin/skills/sentence-transformers/SKILL.md +255 -0
- package/bin/skills/sentence-transformers/references/models.md +123 -0
- package/bin/skills/sentencepiece/SKILL.md +235 -0
- package/bin/skills/sentencepiece/references/algorithms.md +200 -0
- package/bin/skills/sentencepiece/references/training.md +304 -0
- package/bin/skills/sglang/SKILL.md +442 -0
- package/bin/skills/sglang/references/deployment.md +490 -0
- package/bin/skills/sglang/references/radix-attention.md +413 -0
- package/bin/skills/sglang/references/structured-generation.md +541 -0
- package/bin/skills/simpo/SKILL.md +219 -0
- package/bin/skills/simpo/references/datasets.md +478 -0
- package/bin/skills/simpo/references/hyperparameters.md +452 -0
- package/bin/skills/simpo/references/loss-functions.md +350 -0
- package/bin/skills/skypilot/SKILL.md +509 -0
- package/bin/skills/skypilot/references/advanced-usage.md +491 -0
- package/bin/skills/skypilot/references/troubleshooting.md +570 -0
- package/bin/skills/slime/SKILL.md +464 -0
- package/bin/skills/slime/references/api-reference.md +392 -0
- package/bin/skills/slime/references/troubleshooting.md +386 -0
- package/bin/skills/speculative-decoding/SKILL.md +467 -0
- package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
- package/bin/skills/speculative-decoding/references/medusa.md +350 -0
- package/bin/skills/stable-diffusion/SKILL.md +519 -0
- package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
- package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
- package/bin/skills/tensorboard/SKILL.md +629 -0
- package/bin/skills/tensorboard/references/integrations.md +638 -0
- package/bin/skills/tensorboard/references/profiling.md +545 -0
- package/bin/skills/tensorboard/references/visualization.md +620 -0
- package/bin/skills/tensorrt-llm/SKILL.md +187 -0
- package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
- package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
- package/bin/skills/tensorrt-llm/references/serving.md +470 -0
- package/bin/skills/tinker/SKILL.md +362 -0
- package/bin/skills/tinker/references/api-reference.md +168 -0
- package/bin/skills/tinker/references/getting-started.md +157 -0
- package/bin/skills/tinker/references/loss-functions.md +163 -0
- package/bin/skills/tinker/references/models-and-lora.md +139 -0
- package/bin/skills/tinker/references/recipes.md +280 -0
- package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
- package/bin/skills/tinker/references/rendering.md +243 -0
- package/bin/skills/tinker/references/supervised-learning.md +232 -0
- package/bin/skills/tinker-training-cost/SKILL.md +187 -0
- package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
- package/bin/skills/torchforge/SKILL.md +433 -0
- package/bin/skills/torchforge/references/api-reference.md +327 -0
- package/bin/skills/torchforge/references/troubleshooting.md +409 -0
- package/bin/skills/torchtitan/SKILL.md +358 -0
- package/bin/skills/torchtitan/references/checkpoint.md +181 -0
- package/bin/skills/torchtitan/references/custom-models.md +258 -0
- package/bin/skills/torchtitan/references/float8.md +133 -0
- package/bin/skills/torchtitan/references/fsdp.md +126 -0
- package/bin/skills/transformer-lens/SKILL.md +346 -0
- package/bin/skills/transformer-lens/references/README.md +54 -0
- package/bin/skills/transformer-lens/references/api.md +362 -0
- package/bin/skills/transformer-lens/references/tutorials.md +339 -0
- package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
- package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
- package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
- package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
- package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
- package/bin/skills/unsloth/SKILL.md +80 -0
- package/bin/skills/unsloth/references/index.md +7 -0
- package/bin/skills/unsloth/references/llms-full.md +16799 -0
- package/bin/skills/unsloth/references/llms-txt.md +12044 -0
- package/bin/skills/unsloth/references/llms.md +82 -0
- package/bin/skills/verl/SKILL.md +391 -0
- package/bin/skills/verl/references/api-reference.md +301 -0
- package/bin/skills/verl/references/troubleshooting.md +391 -0
- package/bin/skills/vllm/SKILL.md +364 -0
- package/bin/skills/vllm/references/optimization.md +226 -0
- package/bin/skills/vllm/references/quantization.md +284 -0
- package/bin/skills/vllm/references/server-deployment.md +255 -0
- package/bin/skills/vllm/references/troubleshooting.md +447 -0
- package/bin/skills/weights-and-biases/SKILL.md +590 -0
- package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
- package/bin/skills/weights-and-biases/references/integrations.md +700 -0
- package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
- package/bin/skills/whisper/SKILL.md +317 -0
- package/bin/skills/whisper/references/languages.md +189 -0
- package/bin/synsc +0 -0
- package/package.json +10 -0
|
@@ -0,0 +1,637 @@
|
|
|
1
|
+
# Transformers Integration
|
|
2
|
+
|
|
3
|
+
Complete guide to using HuggingFace Tokenizers with the Transformers library.
|
|
4
|
+
|
|
5
|
+
## AutoTokenizer
|
|
6
|
+
|
|
7
|
+
The easiest way to load tokenizers.
|
|
8
|
+
|
|
9
|
+
### Loading pretrained tokenizers
|
|
10
|
+
|
|
11
|
+
```python
|
|
12
|
+
from transformers import AutoTokenizer
|
|
13
|
+
|
|
14
|
+
# Load from HuggingFace Hub
|
|
15
|
+
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
|
16
|
+
|
|
17
|
+
# Check if using fast tokenizer (Rust-based)
|
|
18
|
+
print(tokenizer.is_fast) # True
|
|
19
|
+
|
|
20
|
+
# Access underlying tokenizers.Tokenizer
|
|
21
|
+
if tokenizer.is_fast:
|
|
22
|
+
fast_tokenizer = tokenizer.backend_tokenizer
|
|
23
|
+
print(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'>
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
### Fast vs slow tokenizers
|
|
27
|
+
|
|
28
|
+
| Feature | Fast (Rust) | Slow (Python) |
|
|
29
|
+
|--------------------------|----------------|---------------|
|
|
30
|
+
| Speed | 5-10× faster | Baseline |
|
|
31
|
+
| Alignment tracking | ✅ Full support | ❌ Limited |
|
|
32
|
+
| Batch processing | ✅ Optimized | ⚠️ Slower |
|
|
33
|
+
| Offset mapping | ✅ Yes | ❌ No |
|
|
34
|
+
| Installation | `tokenizers` | Built-in |
|
|
35
|
+
|
|
36
|
+
**Always use fast tokenizers when available.**
|
|
37
|
+
|
|
38
|
+
### Check available tokenizers
|
|
39
|
+
|
|
40
|
+
```python
|
|
41
|
+
from transformers import TOKENIZER_MAPPING
|
|
42
|
+
|
|
43
|
+
# List all fast tokenizers
|
|
44
|
+
for config_class, (slow, fast) in TOKENIZER_MAPPING.items():
|
|
45
|
+
if fast is not None:
|
|
46
|
+
print(f"{config_class.__name__}: {fast.__name__}")
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## PreTrainedTokenizerFast
|
|
50
|
+
|
|
51
|
+
Wrap custom tokenizers for transformers.
|
|
52
|
+
|
|
53
|
+
### Convert custom tokenizer
|
|
54
|
+
|
|
55
|
+
```python
|
|
56
|
+
from tokenizers import Tokenizer
|
|
57
|
+
from tokenizers.models import BPE
|
|
58
|
+
from tokenizers.trainers import BpeTrainer
|
|
59
|
+
from transformers import PreTrainedTokenizerFast
|
|
60
|
+
|
|
61
|
+
# Train custom tokenizer
|
|
62
|
+
tokenizer = Tokenizer(BPE())
|
|
63
|
+
trainer = BpeTrainer(
|
|
64
|
+
vocab_size=30000,
|
|
65
|
+
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
|
|
66
|
+
)
|
|
67
|
+
tokenizer.train(files=["corpus.txt"], trainer=trainer)
|
|
68
|
+
|
|
69
|
+
# Save tokenizer
|
|
70
|
+
tokenizer.save("my-tokenizer.json")
|
|
71
|
+
|
|
72
|
+
# Wrap for transformers
|
|
73
|
+
transformers_tokenizer = PreTrainedTokenizerFast(
|
|
74
|
+
tokenizer_file="my-tokenizer.json",
|
|
75
|
+
unk_token="[UNK]",
|
|
76
|
+
sep_token="[SEP]",
|
|
77
|
+
pad_token="[PAD]",
|
|
78
|
+
cls_token="[CLS]",
|
|
79
|
+
mask_token="[MASK]"
|
|
80
|
+
)
|
|
81
|
+
|
|
82
|
+
# Save in transformers format
|
|
83
|
+
transformers_tokenizer.save_pretrained("my-tokenizer")
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
**Result**: Directory with `tokenizer.json` + `tokenizer_config.json` + `special_tokens_map.json`
|
|
87
|
+
|
|
88
|
+
### Use like any transformers tokenizer
|
|
89
|
+
|
|
90
|
+
```python
|
|
91
|
+
# Load
|
|
92
|
+
from transformers import AutoTokenizer
|
|
93
|
+
tokenizer = AutoTokenizer.from_pretrained("my-tokenizer")
|
|
94
|
+
|
|
95
|
+
# Encode with all transformers features
|
|
96
|
+
outputs = tokenizer(
|
|
97
|
+
"Hello world",
|
|
98
|
+
padding="max_length",
|
|
99
|
+
truncation=True,
|
|
100
|
+
max_length=128,
|
|
101
|
+
return_tensors="pt"
|
|
102
|
+
)
|
|
103
|
+
|
|
104
|
+
print(outputs.keys())
|
|
105
|
+
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
## Special tokens
|
|
109
|
+
|
|
110
|
+
### Default special tokens
|
|
111
|
+
|
|
112
|
+
| Model Family | CLS/BOS | SEP/EOS | PAD | UNK | MASK |
|
|
113
|
+
|--------------|---------|---------------|---------|---------|---------|
|
|
114
|
+
| BERT | [CLS] | [SEP] | [PAD] | [UNK] | [MASK] |
|
|
115
|
+
| GPT-2 | - | <\|endoftext\|> | <\|endoftext\|> | <\|endoftext\|> | - |
|
|
116
|
+
| RoBERTa | <s> | </s> | <pad> | <unk> | <mask> |
|
|
117
|
+
| T5 | - | </s> | <pad> | <unk> | - |
|
|
118
|
+
|
|
119
|
+
### Adding special tokens
|
|
120
|
+
|
|
121
|
+
```python
|
|
122
|
+
# Add new special tokens
|
|
123
|
+
special_tokens_dict = {
|
|
124
|
+
"additional_special_tokens": ["<|image|>", "<|video|>", "<|audio|>"]
|
|
125
|
+
}
|
|
126
|
+
|
|
127
|
+
num_added_tokens = tokenizer.add_special_tokens(special_tokens_dict)
|
|
128
|
+
print(f"Added {num_added_tokens} tokens")
|
|
129
|
+
|
|
130
|
+
# Resize model embeddings
|
|
131
|
+
model.resize_token_embeddings(len(tokenizer))
|
|
132
|
+
|
|
133
|
+
# Use new tokens
|
|
134
|
+
text = "This is an image: <|image|>"
|
|
135
|
+
tokens = tokenizer.encode(text)
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Adding regular tokens
|
|
139
|
+
|
|
140
|
+
```python
|
|
141
|
+
# Add domain-specific tokens
|
|
142
|
+
new_tokens = ["COVID-19", "mRNA", "vaccine"]
|
|
143
|
+
num_added = tokenizer.add_tokens(new_tokens)
|
|
144
|
+
|
|
145
|
+
# These are NOT special tokens (can be split if needed)
|
|
146
|
+
tokenizer.add_tokens(new_tokens, special_tokens=False)
|
|
147
|
+
|
|
148
|
+
# These ARE special tokens (never split)
|
|
149
|
+
tokenizer.add_tokens(new_tokens, special_tokens=True)
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
## Encoding and decoding
|
|
153
|
+
|
|
154
|
+
### Basic encoding
|
|
155
|
+
|
|
156
|
+
```python
|
|
157
|
+
# Single sentence
|
|
158
|
+
text = "Hello, how are you?"
|
|
159
|
+
encoded = tokenizer(text)
|
|
160
|
+
|
|
161
|
+
print(encoded)
|
|
162
|
+
# {'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 1029, 102],
|
|
163
|
+
# 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
|
|
164
|
+
# 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
### Batch encoding
|
|
168
|
+
|
|
169
|
+
```python
|
|
170
|
+
# Multiple sentences
|
|
171
|
+
texts = ["Hello world", "How are you?", "I am fine"]
|
|
172
|
+
encoded = tokenizer(texts, padding=True, truncation=True, max_length=10)
|
|
173
|
+
|
|
174
|
+
print(encoded['input_ids'])
|
|
175
|
+
# [[101, 7592, 2088, 102, 0, 0, 0, 0, 0, 0],
|
|
176
|
+
# [101, 2129, 2024, 2017, 1029, 102, 0, 0, 0, 0],
|
|
177
|
+
# [101, 1045, 2572, 2986, 102, 0, 0, 0, 0, 0]]
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
### Return tensors
|
|
181
|
+
|
|
182
|
+
```python
|
|
183
|
+
# Return PyTorch tensors
|
|
184
|
+
outputs = tokenizer("Hello world", return_tensors="pt")
|
|
185
|
+
print(outputs['input_ids'].shape) # torch.Size([1, 5])
|
|
186
|
+
|
|
187
|
+
# Return TensorFlow tensors
|
|
188
|
+
outputs = tokenizer("Hello world", return_tensors="tf")
|
|
189
|
+
|
|
190
|
+
# Return NumPy arrays
|
|
191
|
+
outputs = tokenizer("Hello world", return_tensors="np")
|
|
192
|
+
|
|
193
|
+
# Return lists (default)
|
|
194
|
+
outputs = tokenizer("Hello world", return_tensors=None)
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
### Decoding
|
|
198
|
+
|
|
199
|
+
```python
|
|
200
|
+
# Decode token IDs
|
|
201
|
+
ids = [101, 7592, 2088, 102]
|
|
202
|
+
text = tokenizer.decode(ids)
|
|
203
|
+
print(text) # "[CLS] hello world [SEP]"
|
|
204
|
+
|
|
205
|
+
# Skip special tokens
|
|
206
|
+
text = tokenizer.decode(ids, skip_special_tokens=True)
|
|
207
|
+
print(text) # "hello world"
|
|
208
|
+
|
|
209
|
+
# Batch decode
|
|
210
|
+
batch_ids = [[101, 7592, 102], [101, 2088, 102]]
|
|
211
|
+
texts = tokenizer.batch_decode(batch_ids, skip_special_tokens=True)
|
|
212
|
+
print(texts) # ["hello", "world"]
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
## Padding and truncation
|
|
216
|
+
|
|
217
|
+
### Padding strategies
|
|
218
|
+
|
|
219
|
+
```python
|
|
220
|
+
# Pad to max length in batch
|
|
221
|
+
tokenizer(texts, padding="longest")
|
|
222
|
+
|
|
223
|
+
# Pad to model max length
|
|
224
|
+
tokenizer(texts, padding="max_length", max_length=128)
|
|
225
|
+
|
|
226
|
+
# No padding
|
|
227
|
+
tokenizer(texts, padding=False)
|
|
228
|
+
|
|
229
|
+
# Pad to multiple of value (for efficient computation)
|
|
230
|
+
tokenizer(texts, padding="max_length", max_length=128, pad_to_multiple_of=8)
|
|
231
|
+
# Result: length will be 128 (already multiple of 8)
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
### Truncation strategies
|
|
235
|
+
|
|
236
|
+
```python
|
|
237
|
+
# Truncate to max length
|
|
238
|
+
tokenizer(text, truncation=True, max_length=10)
|
|
239
|
+
|
|
240
|
+
# Only truncate first sequence (for pairs)
|
|
241
|
+
tokenizer(text1, text2, truncation="only_first", max_length=20)
|
|
242
|
+
|
|
243
|
+
# Only truncate second sequence
|
|
244
|
+
tokenizer(text1, text2, truncation="only_second", max_length=20)
|
|
245
|
+
|
|
246
|
+
# Truncate longest first (default for pairs)
|
|
247
|
+
tokenizer(text1, text2, truncation="longest_first", max_length=20)
|
|
248
|
+
|
|
249
|
+
# No truncation (error if too long)
|
|
250
|
+
tokenizer(text, truncation=False)
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
### Stride for long documents
|
|
254
|
+
|
|
255
|
+
```python
|
|
256
|
+
# For documents longer than max_length
|
|
257
|
+
text = "Very long document " * 1000
|
|
258
|
+
|
|
259
|
+
# Encode with overlap
|
|
260
|
+
encodings = tokenizer(
|
|
261
|
+
text,
|
|
262
|
+
max_length=512,
|
|
263
|
+
stride=128, # Overlap between chunks
|
|
264
|
+
truncation=True,
|
|
265
|
+
return_overflowing_tokens=True,
|
|
266
|
+
return_offsets_mapping=True
|
|
267
|
+
)
|
|
268
|
+
|
|
269
|
+
# Get all chunks
|
|
270
|
+
num_chunks = len(encodings['input_ids'])
|
|
271
|
+
print(f"Split into {num_chunks} chunks")
|
|
272
|
+
|
|
273
|
+
# Each chunk overlaps by stride tokens
|
|
274
|
+
for i, chunk in enumerate(encodings['input_ids']):
|
|
275
|
+
print(f"Chunk {i}: {len(chunk)} tokens")
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
**Use case**: Long document QA, sliding window inference
|
|
279
|
+
|
|
280
|
+
## Alignment and offsets
|
|
281
|
+
|
|
282
|
+
### Offset mapping
|
|
283
|
+
|
|
284
|
+
```python
|
|
285
|
+
# Get character offsets for each token
|
|
286
|
+
encoded = tokenizer("Hello, world!", return_offsets_mapping=True)
|
|
287
|
+
|
|
288
|
+
for token, (start, end) in zip(
|
|
289
|
+
encoded.tokens(),
|
|
290
|
+
encoded['offset_mapping'][0]
|
|
291
|
+
):
|
|
292
|
+
print(f"{token:10s} → [{start:2d}, {end:2d})")
|
|
293
|
+
|
|
294
|
+
# Output:
|
|
295
|
+
# [CLS] → [ 0, 0)
|
|
296
|
+
# Hello → [ 0, 5)
|
|
297
|
+
# , → [ 5, 6)
|
|
298
|
+
# world → [ 7, 12)
|
|
299
|
+
# ! → [12, 13)
|
|
300
|
+
# [SEP] → [ 0, 0)
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
### Word IDs
|
|
304
|
+
|
|
305
|
+
```python
|
|
306
|
+
# Get word index for each token
|
|
307
|
+
encoded = tokenizer("Hello world", return_offsets_mapping=True)
|
|
308
|
+
word_ids = encoded.word_ids()
|
|
309
|
+
|
|
310
|
+
print(word_ids)
|
|
311
|
+
# [None, 0, 1, None]
|
|
312
|
+
# None = special token, 0 = first word, 1 = second word
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
**Use case**: Token classification (NER, POS tagging)
|
|
316
|
+
|
|
317
|
+
### Character to token mapping
|
|
318
|
+
|
|
319
|
+
```python
|
|
320
|
+
text = "Machine learning is awesome"
|
|
321
|
+
encoded = tokenizer(text, return_offsets_mapping=True)
|
|
322
|
+
|
|
323
|
+
# Find token for character position
|
|
324
|
+
char_pos = 8 # "l" in "learning"
|
|
325
|
+
token_idx = encoded.char_to_token(char_pos)
|
|
326
|
+
|
|
327
|
+
print(f"Character {char_pos} is in token {token_idx}: {encoded.tokens()[token_idx]}")
|
|
328
|
+
# Character 8 is in token 2: learning
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
**Use case**: Question answering (map answer character span to tokens)
|
|
332
|
+
|
|
333
|
+
### Sequence pairs
|
|
334
|
+
|
|
335
|
+
```python
|
|
336
|
+
# Encode sentence pair
|
|
337
|
+
encoded = tokenizer("Question here", "Answer here", return_offsets_mapping=True)
|
|
338
|
+
|
|
339
|
+
# Get sequence IDs (which sequence each token belongs to)
|
|
340
|
+
sequence_ids = encoded.sequence_ids()
|
|
341
|
+
print(sequence_ids)
|
|
342
|
+
# [None, 0, 0, 0, None, 1, 1, 1, None]
|
|
343
|
+
# None = special token, 0 = question, 1 = answer
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
## Model integration
|
|
347
|
+
|
|
348
|
+
### Use with transformers models
|
|
349
|
+
|
|
350
|
+
```python
|
|
351
|
+
from transformers import AutoModel, AutoTokenizer
|
|
352
|
+
import torch
|
|
353
|
+
|
|
354
|
+
# Load model and tokenizer
|
|
355
|
+
model = AutoModel.from_pretrained("bert-base-uncased")
|
|
356
|
+
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
|
357
|
+
|
|
358
|
+
# Tokenize
|
|
359
|
+
text = "Hello world"
|
|
360
|
+
inputs = tokenizer(text, return_tensors="pt")
|
|
361
|
+
|
|
362
|
+
# Forward pass
|
|
363
|
+
with torch.no_grad():
|
|
364
|
+
outputs = model(**inputs)
|
|
365
|
+
|
|
366
|
+
# Get embeddings
|
|
367
|
+
last_hidden_state = outputs.last_hidden_state
|
|
368
|
+
print(last_hidden_state.shape) # [1, seq_len, hidden_size]
|
|
369
|
+
```
|
|
370
|
+
|
|
371
|
+
### Custom model with custom tokenizer
|
|
372
|
+
|
|
373
|
+
```python
|
|
374
|
+
from transformers import BertConfig, BertModel
|
|
375
|
+
|
|
376
|
+
# Train custom tokenizer
|
|
377
|
+
from tokenizers import Tokenizer, models, trainers
|
|
378
|
+
tokenizer = Tokenizer(models.BPE())
|
|
379
|
+
trainer = trainers.BpeTrainer(vocab_size=30000)
|
|
380
|
+
tokenizer.train(files=["data.txt"], trainer=trainer)
|
|
381
|
+
|
|
382
|
+
# Wrap for transformers
|
|
383
|
+
from transformers import PreTrainedTokenizerFast
|
|
384
|
+
fast_tokenizer = PreTrainedTokenizerFast(
|
|
385
|
+
tokenizer_object=tokenizer,
|
|
386
|
+
unk_token="[UNK]",
|
|
387
|
+
pad_token="[PAD]"
|
|
388
|
+
)
|
|
389
|
+
|
|
390
|
+
# Create model with custom vocab size
|
|
391
|
+
config = BertConfig(vocab_size=30000)
|
|
392
|
+
model = BertModel(config)
|
|
393
|
+
|
|
394
|
+
# Use together
|
|
395
|
+
inputs = fast_tokenizer("Hello world", return_tensors="pt")
|
|
396
|
+
outputs = model(**inputs)
|
|
397
|
+
```
|
|
398
|
+
|
|
399
|
+
### Save and load together
|
|
400
|
+
|
|
401
|
+
```python
|
|
402
|
+
# Save both
|
|
403
|
+
model.save_pretrained("my-model")
|
|
404
|
+
tokenizer.save_pretrained("my-model")
|
|
405
|
+
|
|
406
|
+
# Directory structure:
|
|
407
|
+
# my-model/
|
|
408
|
+
# ├── config.json
|
|
409
|
+
# ├── pytorch_model.bin
|
|
410
|
+
# ├── tokenizer.json
|
|
411
|
+
# ├── tokenizer_config.json
|
|
412
|
+
# └── special_tokens_map.json
|
|
413
|
+
|
|
414
|
+
# Load both
|
|
415
|
+
from transformers import AutoModel, AutoTokenizer
|
|
416
|
+
|
|
417
|
+
model = AutoModel.from_pretrained("my-model")
|
|
418
|
+
tokenizer = AutoTokenizer.from_pretrained("my-model")
|
|
419
|
+
```
|
|
420
|
+
|
|
421
|
+
## Advanced features
|
|
422
|
+
|
|
423
|
+
### Multimodal tokenization
|
|
424
|
+
|
|
425
|
+
```python
|
|
426
|
+
from transformers import AutoTokenizer
|
|
427
|
+
|
|
428
|
+
# LLaVA-style (image + text)
|
|
429
|
+
tokenizer = AutoTokenizer.from_pretrained("llava-hf/llava-1.5-7b-hf")
|
|
430
|
+
|
|
431
|
+
# Add image placeholder token
|
|
432
|
+
tokenizer.add_special_tokens({"additional_special_tokens": ["<image>"]})
|
|
433
|
+
|
|
434
|
+
# Use in prompt
|
|
435
|
+
text = "Describe this image: <image>"
|
|
436
|
+
inputs = tokenizer(text, return_tensors="pt")
|
|
437
|
+
```
|
|
438
|
+
|
|
439
|
+
### Template formatting
|
|
440
|
+
|
|
441
|
+
```python
|
|
442
|
+
# Chat template
|
|
443
|
+
messages = [
|
|
444
|
+
{"role": "system", "content": "You are a helpful assistant."},
|
|
445
|
+
{"role": "user", "content": "Hello!"},
|
|
446
|
+
{"role": "assistant", "content": "Hi! How can I help?"},
|
|
447
|
+
{"role": "user", "content": "What's the weather?"}
|
|
448
|
+
]
|
|
449
|
+
|
|
450
|
+
# Apply chat template (if tokenizer has one)
|
|
451
|
+
if hasattr(tokenizer, "apply_chat_template"):
|
|
452
|
+
text = tokenizer.apply_chat_template(messages, tokenize=False)
|
|
453
|
+
inputs = tokenizer(text, return_tensors="pt")
|
|
454
|
+
```
|
|
455
|
+
|
|
456
|
+
### Custom template
|
|
457
|
+
|
|
458
|
+
```python
|
|
459
|
+
from transformers import PreTrainedTokenizerFast
|
|
460
|
+
|
|
461
|
+
tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
|
|
462
|
+
|
|
463
|
+
# Define chat template
|
|
464
|
+
tokenizer.chat_template = """
|
|
465
|
+
{%- for message in messages %}
|
|
466
|
+
{%- if message['role'] == 'system' %}
|
|
467
|
+
System: {{ message['content'] }}\\n
|
|
468
|
+
{%- elif message['role'] == 'user' %}
|
|
469
|
+
User: {{ message['content'] }}\\n
|
|
470
|
+
{%- elif message['role'] == 'assistant' %}
|
|
471
|
+
Assistant: {{ message['content'] }}\\n
|
|
472
|
+
{%- endif %}
|
|
473
|
+
{%- endfor %}
|
|
474
|
+
Assistant:
|
|
475
|
+
"""
|
|
476
|
+
|
|
477
|
+
# Use template
|
|
478
|
+
text = tokenizer.apply_chat_template(messages, tokenize=False)
|
|
479
|
+
```
|
|
480
|
+
|
|
481
|
+
## Performance optimization
|
|
482
|
+
|
|
483
|
+
### Batch processing
|
|
484
|
+
|
|
485
|
+
```python
|
|
486
|
+
# Process large datasets efficiently
|
|
487
|
+
from datasets import load_dataset
|
|
488
|
+
|
|
489
|
+
dataset = load_dataset("imdb", split="train[:1000]")
|
|
490
|
+
|
|
491
|
+
# Tokenize in batches
|
|
492
|
+
def tokenize_function(examples):
|
|
493
|
+
return tokenizer(
|
|
494
|
+
examples["text"],
|
|
495
|
+
padding="max_length",
|
|
496
|
+
truncation=True,
|
|
497
|
+
max_length=512
|
|
498
|
+
)
|
|
499
|
+
|
|
500
|
+
# Map over dataset (batched)
|
|
501
|
+
tokenized_dataset = dataset.map(
|
|
502
|
+
tokenize_function,
|
|
503
|
+
batched=True,
|
|
504
|
+
batch_size=1000,
|
|
505
|
+
num_proc=4 # Parallel processing
|
|
506
|
+
)
|
|
507
|
+
```
|
|
508
|
+
|
|
509
|
+
### Caching
|
|
510
|
+
|
|
511
|
+
```python
|
|
512
|
+
# Enable caching for repeated tokenization
|
|
513
|
+
tokenizer = AutoTokenizer.from_pretrained(
|
|
514
|
+
"bert-base-uncased",
|
|
515
|
+
use_fast=True,
|
|
516
|
+
cache_dir="./cache" # Cache tokenizer files
|
|
517
|
+
)
|
|
518
|
+
|
|
519
|
+
# Tokenize with caching
|
|
520
|
+
from functools import lru_cache
|
|
521
|
+
|
|
522
|
+
@lru_cache(maxsize=10000)
|
|
523
|
+
def cached_tokenize(text):
|
|
524
|
+
return tuple(tokenizer.encode(text))
|
|
525
|
+
|
|
526
|
+
# Reuses cached results for repeated inputs
|
|
527
|
+
```
|
|
528
|
+
|
|
529
|
+
### Memory efficiency
|
|
530
|
+
|
|
531
|
+
```python
|
|
532
|
+
# For very large datasets, use streaming
|
|
533
|
+
from datasets import load_dataset
|
|
534
|
+
|
|
535
|
+
dataset = load_dataset("pile", split="train", streaming=True)
|
|
536
|
+
|
|
537
|
+
def process_batch(batch):
|
|
538
|
+
# Tokenize
|
|
539
|
+
tokens = tokenizer(batch["text"], truncation=True, max_length=512)
|
|
540
|
+
|
|
541
|
+
# Process tokens...
|
|
542
|
+
|
|
543
|
+
return tokens
|
|
544
|
+
|
|
545
|
+
# Process in chunks (memory efficient)
|
|
546
|
+
for batch in dataset.batch(batch_size=1000):
|
|
547
|
+
processed = process_batch(batch)
|
|
548
|
+
```
|
|
549
|
+
|
|
550
|
+
## Troubleshooting
|
|
551
|
+
|
|
552
|
+
### Issue: Tokenizer not fast
|
|
553
|
+
|
|
554
|
+
**Symptom**:
|
|
555
|
+
```python
|
|
556
|
+
tokenizer.is_fast # False
|
|
557
|
+
```
|
|
558
|
+
|
|
559
|
+
**Solution**: Install tokenizers library
|
|
560
|
+
```bash
|
|
561
|
+
pip install tokenizers
|
|
562
|
+
```
|
|
563
|
+
|
|
564
|
+
### Issue: Special tokens not working
|
|
565
|
+
|
|
566
|
+
**Symptom**: Special tokens are split into subwords
|
|
567
|
+
|
|
568
|
+
**Solution**: Add as special tokens, not regular tokens
|
|
569
|
+
```python
|
|
570
|
+
# Wrong
|
|
571
|
+
tokenizer.add_tokens(["<|image|>"])
|
|
572
|
+
|
|
573
|
+
# Correct
|
|
574
|
+
tokenizer.add_special_tokens({"additional_special_tokens": ["<|image|>"]})
|
|
575
|
+
```
|
|
576
|
+
|
|
577
|
+
### Issue: Offset mapping not available
|
|
578
|
+
|
|
579
|
+
**Symptom**:
|
|
580
|
+
```python
|
|
581
|
+
tokenizer("text", return_offsets_mapping=True)
|
|
582
|
+
# Error: return_offsets_mapping not supported
|
|
583
|
+
```
|
|
584
|
+
|
|
585
|
+
**Solution**: Use fast tokenizer
|
|
586
|
+
```python
|
|
587
|
+
from transformers import AutoTokenizer
|
|
588
|
+
|
|
589
|
+
# Load fast version
|
|
590
|
+
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
|
|
591
|
+
```
|
|
592
|
+
|
|
593
|
+
### Issue: Padding inconsistent
|
|
594
|
+
|
|
595
|
+
**Symptom**: Some sequences padded, others not
|
|
596
|
+
|
|
597
|
+
**Solution**: Specify padding strategy
|
|
598
|
+
```python
|
|
599
|
+
# Explicit padding
|
|
600
|
+
tokenizer(
|
|
601
|
+
texts,
|
|
602
|
+
padding="max_length", # or "longest"
|
|
603
|
+
max_length=128
|
|
604
|
+
)
|
|
605
|
+
```
|
|
606
|
+
|
|
607
|
+
## Best practices
|
|
608
|
+
|
|
609
|
+
1. **Always use fast tokenizers**:
|
|
610
|
+
- 5-10× faster
|
|
611
|
+
- Full alignment tracking
|
|
612
|
+
- Better batch processing
|
|
613
|
+
|
|
614
|
+
2. **Save tokenizer with model**:
|
|
615
|
+
- Ensures reproducibility
|
|
616
|
+
- Prevents version mismatches
|
|
617
|
+
|
|
618
|
+
3. **Use batch processing for datasets**:
|
|
619
|
+
- Tokenize with `.map(batched=True)`
|
|
620
|
+
- Set `num_proc` for parallelism
|
|
621
|
+
|
|
622
|
+
4. **Enable caching for repeated inputs**:
|
|
623
|
+
- Use `lru_cache` for inference
|
|
624
|
+
- Cache tokenizer files with `cache_dir`
|
|
625
|
+
|
|
626
|
+
5. **Handle special tokens properly**:
|
|
627
|
+
- Use `add_special_tokens()` for never-split tokens
|
|
628
|
+
- Resize embeddings after adding tokens
|
|
629
|
+
|
|
630
|
+
6. **Test alignment for downstream tasks**:
|
|
631
|
+
- Verify `offset_mapping` is correct
|
|
632
|
+
- Test `char_to_token()` on samples
|
|
633
|
+
|
|
634
|
+
7. **Version control tokenizer config**:
|
|
635
|
+
- Save `tokenizer_config.json`
|
|
636
|
+
- Document custom templates
|
|
637
|
+
- Track vocabulary changes
|