@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,656 @@
1
+ ---
2
+ name: hugging-face-evaluation
3
+ description: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
4
+ version: 1.0.0
5
+ author: Synthetic Sciences
6
+ license: MIT
7
+ tags: [Hugging Face, Evaluation, Benchmarking, Metrics]
8
+ dependencies: [huggingface-hub, transformers]
9
+ ---
10
+
11
+ # Overview
12
+ This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
13
+ - Extracting existing evaluation tables from README content
14
+ - Importing benchmark scores from Artificial Analysis
15
+ - Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)
16
+
17
+ ## Integration with HF Ecosystem
18
+ - **Model Cards**: Updates model-index metadata for leaderboard integration
19
+ - **Artificial Analysis**: Direct API integration for benchmark imports
20
+ - **Papers with Code**: Compatible with their model-index specification
21
+ - **Jobs**: Run evaluations directly on Hugging Face Jobs with `uv` integration
22
+ - **vLLM**: Efficient GPU inference for custom model evaluation
23
+ - **lighteval**: HuggingFace's evaluation library with vLLM/accelerate backends
24
+ - **inspect-ai**: UK AI Safety Institute's evaluation framework
25
+
26
+ # Version
27
+ 1.3.0
28
+
29
+ # Dependencies
30
+
31
+ ## Core Dependencies
32
+ - huggingface_hub>=0.26.0
33
+ - markdown-it-py>=3.0.0
34
+ - python-dotenv>=1.2.1
35
+ - pyyaml>=6.0.3
36
+ - requests>=2.32.5
37
+ - re (built-in)
38
+
39
+ ## Inference Provider Evaluation
40
+ - inspect-ai>=0.3.0
41
+ - inspect-evals
42
+ - openai
43
+
44
+ ## vLLM Custom Model Evaluation (GPU required)
45
+ - lighteval[accelerate,vllm]>=0.6.0
46
+ - vllm>=0.4.0
47
+ - torch>=2.0.0
48
+ - transformers>=4.40.0
49
+ - accelerate>=0.30.0
50
+
51
+ Note: vLLM dependencies are installed automatically via PEP 723 script headers when using `uv run`.
52
+
53
+ # IMPORTANT: Using This Skill
54
+
55
+ ## ⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones
56
+
57
+ **Before creating ANY pull request with `--create-pr`, you MUST check for existing open PRs:**
58
+
59
+ ```bash
60
+ uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
61
+ ```
62
+
63
+ **If open PRs exist:**
64
+ 1. **DO NOT create a new PR** - this creates duplicate work for maintainers
65
+ 2. **Warn the user** that open PRs already exist
66
+ 3. **Show the user** the existing PR URLs so they can review them
67
+ 4. Only proceed if the user explicitly confirms they want to create another PR
68
+
69
+ This prevents spamming model repositories with duplicate evaluation PRs.
70
+
71
+ ---
72
+
73
+ > **All paths are relative to the directory containing this SKILL.md
74
+ file.**
75
+ > Before running any script, first `cd` to that directory or use the full
76
+ path.
77
+
78
+
79
+ **Use `--help` for the latest workflow guidance.** Works with plain Python or `uv run`:
80
+ ```bash
81
+ uv run scripts/evaluation_manager.py --help
82
+ uv run scripts/evaluation_manager.py inspect-tables --help
83
+ uv run scripts/evaluation_manager.py extract-readme --help
84
+ ```
85
+ Key workflow (matches CLI help):
86
+
87
+ 1) `get-prs` → check for existing open PRs first
88
+ 2) `inspect-tables` → find table numbers/columns
89
+ 3) `extract-readme --table N` → prints YAML by default
90
+ 4) add `--apply` (push) or `--create-pr` to write changes
91
+
92
+ # Core Capabilities
93
+
94
+ ## 1. Inspect and Extract Evaluation Tables from README
95
+ - **Inspect Tables**: Use `inspect-tables` to see all tables in a README with structure, columns, and sample rows
96
+ - **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples)
97
+ - **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist)
98
+ - **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
99
+ - **Column Matching**: Automatically identify model columns/rows; prefer `--model-column-index` (index from inspect output). Use `--model-name-override` only with exact column header text.
100
+ - **YAML Generation**: Convert selected table to model-index YAML format
101
+ - **Task Typing**: `--task-type` sets the `task.type` field in model-index output (e.g., `text-generation`, `summarization`)
102
+
103
+ ## 2. Import from Artificial Analysis
104
+ - **API Integration**: Fetch benchmark scores directly from Artificial Analysis
105
+ - **Automatic Formatting**: Convert API responses to model-index format
106
+ - **Metadata Preservation**: Maintain source attribution and URLs
107
+ - **PR Creation**: Automatically create pull requests with evaluation updates
108
+
109
+ ## 3. Model-Index Management
110
+ - **YAML Generation**: Create properly formatted model-index entries
111
+ - **Merge Support**: Add evaluations to existing model cards without overwriting
112
+ - **Validation**: Ensure compliance with Papers with Code specification
113
+ - **Batch Operations**: Process multiple models efficiently
114
+
115
+ ## 4. Run Evaluations on HF Jobs (Inference Providers)
116
+ - **Inspect-AI Integration**: Run standard evaluations using the `inspect-ai` library
117
+ - **UV Integration**: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
118
+ - **Zero-Config**: No Dockerfiles or Space management required
119
+ - **Hardware Selection**: Configure CPU or GPU hardware for the evaluation job
120
+ - **Secure Execution**: Handles API tokens safely via secrets passed through the CLI
121
+
122
+ ## 5. Run Custom Model Evaluations with vLLM (NEW)
123
+
124
+ ⚠️ **Important:** This approach is only possible on devices with `uv` installed and sufficient GPU memory.
125
+ **Benefits:** No need to use `hf_jobs()` MCP tool, can run scripts directly in terminal
126
+ **When to use:** User working in local device directly when GPU is available
127
+
128
+ ### Before running the script
129
+
130
+ - check the script path
131
+ - check uv is installed
132
+ - check gpu is available with `nvidia-smi`
133
+
134
+ ### Running the script
135
+
136
+ ```bash
137
+ uv run scripts/train_sft_example.py
138
+ ```
139
+ ### Features
140
+
141
+ - **vLLM Backend**: High-performance GPU inference (5-10x faster than standard HF methods)
142
+ - **lighteval Framework**: HuggingFace's evaluation library with Open LLM Leaderboard tasks
143
+ - **inspect-ai Framework**: UK AI Safety Institute's evaluation library
144
+ - **Standalone or Jobs**: Run locally or submit to HF Jobs infrastructure
145
+
146
+ # Usage Instructions
147
+
148
+ The skill includes Python scripts in `scripts/` to perform operations.
149
+
150
+ ### Prerequisites
151
+ - Preferred: use `uv run` (PEP 723 header auto-installs deps)
152
+ - Or install manually: `pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests`
153
+ - Set `HF_TOKEN` environment variable with Write-access token
154
+ - For Artificial Analysis: Set `AA_API_KEY` environment variable
155
+ - `.env` is loaded automatically if `python-dotenv` is installed
156
+
157
+ ### Method 1: Extract from README (CLI workflow)
158
+
159
+ Recommended flow (matches `--help`):
160
+ ```bash
161
+ # 1) Inspect tables to get table numbers and column hints
162
+ uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
163
+
164
+ # 2) Extract a specific table (prints YAML by default)
165
+ uv run scripts/evaluation_manager.py extract-readme \
166
+ --repo-id "username/model" \
167
+ --table 1 \
168
+ [--model-column-index <column index shown by inspect-tables>] \
169
+ [--model-name-override "<column header/model name>"] # use exact header text if you can't use the index
170
+
171
+ # 3) Apply changes (push or PR)
172
+ uv run scripts/evaluation_manager.py extract-readme \
173
+ --repo-id "username/model" \
174
+ --table 1 \
175
+ --apply # push directly
176
+ # or
177
+ uv run scripts/evaluation_manager.py extract-readme \
178
+ --repo-id "username/model" \
179
+ --table 1 \
180
+ --create-pr # open a PR
181
+ ```
182
+
183
+ Validation checklist:
184
+ - YAML is printed by default; compare against the README table before applying.
185
+ - Prefer `--model-column-index`; if using `--model-name-override`, the column header text must be exact.
186
+ - For transposed tables (models as rows), ensure only one row is extracted.
187
+
188
+ ### Method 2: Import from Artificial Analysis
189
+
190
+ Fetch benchmark scores from Artificial Analysis API and add them to a model card.
191
+
192
+ **Basic Usage:**
193
+ ```bash
194
+ AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
195
+ --creator-slug "anthropic" \
196
+ --model-name "claude-sonnet-4" \
197
+ --repo-id "username/model-name"
198
+ ```
199
+
200
+ **With Environment File:**
201
+ ```bash
202
+ # Create .env file
203
+ echo "AA_API_KEY=your-api-key" >> .env
204
+ echo "HF_TOKEN=your-hf-token" >> .env
205
+
206
+ # Run import
207
+ uv run scripts/evaluation_manager.py import-aa \
208
+ --creator-slug "anthropic" \
209
+ --model-name "claude-sonnet-4" \
210
+ --repo-id "username/model-name"
211
+ ```
212
+
213
+ **Create Pull Request:**
214
+ ```bash
215
+ uv run scripts/evaluation_manager.py import-aa \
216
+ --creator-slug "anthropic" \
217
+ --model-name "claude-sonnet-4" \
218
+ --repo-id "username/model-name" \
219
+ --create-pr
220
+ ```
221
+
222
+ ### Method 3: Run Evaluation Job
223
+
224
+ Submit an evaluation job on Hugging Face infrastructure using the `hf jobs uv run` CLI.
225
+
226
+ **Direct CLI Usage:**
227
+ ```bash
228
+ HF_TOKEN=$HF_TOKEN \
229
+ hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
230
+ --flavor cpu-basic \
231
+ --secret HF_TOKEN=$HF_TOKEN \
232
+ -- --model "meta-llama/Llama-2-7b-hf" \
233
+ --task "mmlu"
234
+ ```
235
+
236
+ **GPU Example (A10G):**
237
+ ```bash
238
+ HF_TOKEN=$HF_TOKEN \
239
+ hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
240
+ --flavor a10g-small \
241
+ --secret HF_TOKEN=$HF_TOKEN \
242
+ -- --model "meta-llama/Llama-2-7b-hf" \
243
+ --task "gsm8k"
244
+ ```
245
+
246
+ **Python Helper (optional):**
247
+ ```bash
248
+ uv run scripts/run_eval_job.py \
249
+ --model "meta-llama/Llama-2-7b-hf" \
250
+ --task "mmlu" \
251
+ --hardware "t4-small"
252
+ ```
253
+
254
+ ### Method 4: Run Custom Model Evaluation with vLLM
255
+
256
+ Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are **separate from inference provider scripts** and run models locally on the job's hardware.
257
+
258
+ #### When to Use vLLM Evaluation (vs Inference Providers)
259
+
260
+ | Feature | vLLM Scripts | Inference Provider Scripts |
261
+ |---------|-------------|---------------------------|
262
+ | Model access | Any HF model | Models with API endpoints |
263
+ | Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |
264
+ | Cost | HF Jobs compute cost | API usage fees |
265
+ | Speed | vLLM optimized | Depends on provider |
266
+ | Offline | Yes (after download) | No |
267
+
268
+ #### Option A: lighteval with vLLM Backend
269
+
270
+ lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.
271
+
272
+ **Standalone (local GPU):**
273
+ ```bash
274
+ # Run MMLU 5-shot with vLLM
275
+ uv run scripts/lighteval_vllm_uv.py \
276
+ --model meta-llama/Llama-3.2-1B \
277
+ --tasks "leaderboard|mmlu|5"
278
+
279
+ # Run multiple tasks
280
+ uv run scripts/lighteval_vllm_uv.py \
281
+ --model meta-llama/Llama-3.2-1B \
282
+ --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
283
+
284
+ # Use accelerate backend instead of vLLM
285
+ uv run scripts/lighteval_vllm_uv.py \
286
+ --model meta-llama/Llama-3.2-1B \
287
+ --tasks "leaderboard|mmlu|5" \
288
+ --backend accelerate
289
+
290
+ # Chat/instruction-tuned models
291
+ uv run scripts/lighteval_vllm_uv.py \
292
+ --model meta-llama/Llama-3.2-1B-Instruct \
293
+ --tasks "leaderboard|mmlu|5" \
294
+ --use-chat-template
295
+ ```
296
+
297
+ **Via HF Jobs:**
298
+ ```bash
299
+ hf jobs uv run scripts/lighteval_vllm_uv.py \
300
+ --flavor a10g-small \
301
+ --secrets HF_TOKEN=$HF_TOKEN \
302
+ -- --model meta-llama/Llama-3.2-1B \
303
+ --tasks "leaderboard|mmlu|5"
304
+ ```
305
+
306
+ **lighteval Task Format:**
307
+ Tasks use the format `suite|task|num_fewshot`:
308
+ - `leaderboard|mmlu|5` - MMLU with 5-shot
309
+ - `leaderboard|gsm8k|5` - GSM8K with 5-shot
310
+ - `lighteval|hellaswag|0` - HellaSwag zero-shot
311
+ - `leaderboard|arc_challenge|25` - ARC-Challenge with 25-shot
312
+
313
+ **Finding Available Tasks:**
314
+ The complete list of available lighteval tasks can be found at:
315
+ https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt
316
+
317
+ This file contains all supported tasks in the format `suite|task|num_fewshot|0` (the trailing `0` is a version flag and can be ignored). Common suites include:
318
+ - `leaderboard` - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)
319
+ - `lighteval` - Additional lighteval tasks
320
+ - `bigbench` - BigBench tasks
321
+ - `original` - Original benchmark tasks
322
+
323
+ To use a task from the list, extract the `suite|task|num_fewshot` portion (without the trailing `0`) and pass it to the `--tasks` parameter. For example:
324
+ - From file: `leaderboard|mmlu|0` → Use: `leaderboard|mmlu|0` (or change to `5` for 5-shot)
325
+ - From file: `bigbench|abstract_narrative_understanding|0` → Use: `bigbench|abstract_narrative_understanding|0`
326
+ - From file: `lighteval|wmt14:hi-en|0` → Use: `lighteval|wmt14:hi-en|0`
327
+
328
+ Multiple tasks can be specified as comma-separated values: `--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"`
329
+
330
+ #### Option B: inspect-ai with vLLM Backend
331
+
332
+ inspect-ai is the UK AI Safety Institute's evaluation framework.
333
+
334
+ **Standalone (local GPU):**
335
+ ```bash
336
+ # Run MMLU with vLLM
337
+ uv run scripts/inspect_vllm_uv.py \
338
+ --model meta-llama/Llama-3.2-1B \
339
+ --task mmlu
340
+
341
+ # Use HuggingFace Transformers backend
342
+ uv run scripts/inspect_vllm_uv.py \
343
+ --model meta-llama/Llama-3.2-1B \
344
+ --task mmlu \
345
+ --backend hf
346
+
347
+ # Multi-GPU with tensor parallelism
348
+ uv run scripts/inspect_vllm_uv.py \
349
+ --model meta-llama/Llama-3.2-70B \
350
+ --task mmlu \
351
+ --tensor-parallel-size 4
352
+ ```
353
+
354
+ **Via HF Jobs:**
355
+ ```bash
356
+ hf jobs uv run scripts/inspect_vllm_uv.py \
357
+ --flavor a10g-small \
358
+ --secrets HF_TOKEN=$HF_TOKEN \
359
+ -- --model meta-llama/Llama-3.2-1B \
360
+ --task mmlu
361
+ ```
362
+
363
+ **Available inspect-ai Tasks:**
364
+ - `mmlu` - Massive Multitask Language Understanding
365
+ - `gsm8k` - Grade School Math
366
+ - `hellaswag` - Common sense reasoning
367
+ - `arc_challenge` - AI2 Reasoning Challenge
368
+ - `truthfulqa` - TruthfulQA benchmark
369
+ - `winogrande` - Winograd Schema Challenge
370
+ - `humaneval` - Code generation
371
+
372
+ #### Option C: Python Helper Script
373
+
374
+ The helper script auto-selects hardware and simplifies job submission:
375
+
376
+ ```bash
377
+ # Auto-detect hardware based on model size
378
+ uv run scripts/run_vllm_eval_job.py \
379
+ --model meta-llama/Llama-3.2-1B \
380
+ --task "leaderboard|mmlu|5" \
381
+ --framework lighteval
382
+
383
+ # Explicit hardware selection
384
+ uv run scripts/run_vllm_eval_job.py \
385
+ --model meta-llama/Llama-3.2-70B \
386
+ --task mmlu \
387
+ --framework inspect \
388
+ --hardware a100-large \
389
+ --tensor-parallel-size 4
390
+
391
+ # Use HF Transformers backend
392
+ uv run scripts/run_vllm_eval_job.py \
393
+ --model microsoft/phi-2 \
394
+ --task mmlu \
395
+ --framework inspect \
396
+ --backend hf
397
+ ```
398
+
399
+ **Hardware Recommendations:**
400
+ | Model Size | Recommended Hardware |
401
+ |------------|---------------------|
402
+ | < 3B params | `t4-small` |
403
+ | 3B - 13B | `a10g-small` |
404
+ | 13B - 34B | `a10g-large` |
405
+ | 34B+ | `a100-large` |
406
+
407
+ ### Commands Reference
408
+
409
+ **Top-level help and version:**
410
+ ```bash
411
+ uv run scripts/evaluation_manager.py --help
412
+ uv run scripts/evaluation_manager.py --version
413
+ ```
414
+
415
+ **Inspect Tables (start here):**
416
+ ```bash
417
+ uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"
418
+ ```
419
+
420
+ **Extract from README:**
421
+ ```bash
422
+ uv run scripts/evaluation_manager.py extract-readme \
423
+ --repo-id "username/model-name" \
424
+ --table N \
425
+ [--model-column-index N] \
426
+ [--model-name-override "Exact Column Header or Model Name"] \
427
+ [--task-type "text-generation"] \
428
+ [--dataset-name "Custom Benchmarks"] \
429
+ [--apply | --create-pr]
430
+ ```
431
+
432
+ **Import from Artificial Analysis:**
433
+ ```bash
434
+ AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
435
+ --creator-slug "creator-name" \
436
+ --model-name "model-slug" \
437
+ --repo-id "username/model-name" \
438
+ [--create-pr]
439
+ ```
440
+
441
+ **View / Validate:**
442
+ ```bash
443
+ uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
444
+ uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"
445
+ ```
446
+
447
+ **Check Open PRs (ALWAYS run before --create-pr):**
448
+ ```bash
449
+ uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
450
+ ```
451
+ Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.
452
+
453
+ **Run Evaluation Job (Inference Providers):**
454
+ ```bash
455
+ hf jobs uv run scripts/inspect_eval_uv.py \
456
+ --flavor "cpu-basic|t4-small|..." \
457
+ --secret HF_TOKEN=$HF_TOKEN \
458
+ -- --model "model-id" \
459
+ --task "task-name"
460
+ ```
461
+
462
+ or use the Python helper:
463
+
464
+ ```bash
465
+ uv run scripts/run_eval_job.py \
466
+ --model "model-id" \
467
+ --task "task-name" \
468
+ --hardware "cpu-basic|t4-small|..."
469
+ ```
470
+
471
+ **Run vLLM Evaluation (Custom Models):**
472
+ ```bash
473
+ # lighteval with vLLM
474
+ hf jobs uv run scripts/lighteval_vllm_uv.py \
475
+ --flavor "a10g-small" \
476
+ --secrets HF_TOKEN=$HF_TOKEN \
477
+ -- --model "model-id" \
478
+ --tasks "leaderboard|mmlu|5"
479
+
480
+ # inspect-ai with vLLM
481
+ hf jobs uv run scripts/inspect_vllm_uv.py \
482
+ --flavor "a10g-small" \
483
+ --secrets HF_TOKEN=$HF_TOKEN \
484
+ -- --model "model-id" \
485
+ --task "mmlu"
486
+
487
+ # Helper script (auto hardware selection)
488
+ uv run scripts/run_vllm_eval_job.py \
489
+ --model "model-id" \
490
+ --task "leaderboard|mmlu|5" \
491
+ --framework lighteval
492
+ ```
493
+
494
+ ### Model-Index Format
495
+
496
+ The generated model-index follows this structure:
497
+
498
+ ```yaml
499
+ model-index:
500
+ - name: Model Name
501
+ results:
502
+ - task:
503
+ type: text-generation
504
+ dataset:
505
+ name: Benchmark Dataset
506
+ type: benchmark_type
507
+ metrics:
508
+ - name: MMLU
509
+ type: mmlu
510
+ value: 85.2
511
+ - name: HumanEval
512
+ type: humaneval
513
+ value: 72.5
514
+ source:
515
+ name: Source Name
516
+ url: https://source-url.com
517
+ ```
518
+
519
+ WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.
520
+
521
+ ### Error Handling
522
+ - **Table Not Found**: Script will report if no evaluation tables are detected
523
+ - **Invalid Format**: Clear error messages for malformed tables
524
+ - **API Errors**: Retry logic for transient Artificial Analysis API failures
525
+ - **Token Issues**: Validation before attempting updates
526
+ - **Merge Conflicts**: Preserves existing model-index entries when adding new ones
527
+ - **Space Creation**: Handles naming conflicts and hardware request failures gracefully
528
+
529
+ ### Best Practices
530
+
531
+ 1. **Check for existing PRs first**: Run `get-prs` before creating any new PR to avoid duplicates
532
+ 2. **Always start with `inspect-tables`**: See table structure and get the correct extraction command
533
+ 3. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow
534
+ 4. **Preview first**: Default behavior prints YAML; review it before using `--apply` or `--create-pr`
535
+ 5. **Verify extracted values**: Compare YAML output against the README table manually
536
+ 6. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist
537
+ 7. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output
538
+ 8. **Create PRs for Others**: Use `--create-pr` when updating models you don't own
539
+ 9. **One model per repo**: Only add the main model's results to model-index
540
+ 10. **No markdown in YAML names**: The model name field in YAML should be plain text
541
+
542
+ ### Model Name Matching
543
+
544
+ When extracting evaluation tables with multiple models (either as columns or rows), the script uses **exact normalized token matching**:
545
+
546
+ - Removes markdown formatting (bold `**`, links `[]()` )
547
+ - Normalizes names (lowercase, replace `-` and `_` with spaces)
548
+ - Compares token sets: `"OLMo-3-32B"` → `{"olmo", "3", "32b"}` matches `"**Olmo 3 32B**"` or `"[Olmo-3-32B](...)`
549
+ - Only extracts if tokens match exactly (handles different word orders and separators)
550
+ - Fails if no exact match found (rather than guessing from similar names)
551
+
552
+ **For column-based tables** (benchmarks as rows, models as columns):
553
+ - Finds the column header matching the model name
554
+ - Extracts scores from that column only
555
+
556
+ **For transposed tables** (models as rows, benchmarks as columns):
557
+ - Finds the row in the first column matching the model name
558
+ - Extracts all benchmark scores from that row only
559
+
560
+ This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints.
561
+
562
+ ### Common Patterns
563
+
564
+ **Update Your Own Model:**
565
+ ```bash
566
+ # Extract from README and push directly
567
+ uv run scripts/evaluation_manager.py extract-readme \
568
+ --repo-id "your-username/your-model" \
569
+ --task-type "text-generation"
570
+ ```
571
+
572
+ **Update Someone Else's Model (Full Workflow):**
573
+ ```bash
574
+ # Step 1: ALWAYS check for existing PRs first
575
+ uv run scripts/evaluation_manager.py get-prs \
576
+ --repo-id "other-username/their-model"
577
+
578
+ # Step 2: If NO open PRs exist, proceed with creating one
579
+ uv run scripts/evaluation_manager.py extract-readme \
580
+ --repo-id "other-username/their-model" \
581
+ --create-pr
582
+
583
+ # If open PRs DO exist:
584
+ # - Warn the user about existing PRs
585
+ # - Show them the PR URLs
586
+ # - Do NOT create a new PR unless user explicitly confirms
587
+ ```
588
+
589
+ **Import Fresh Benchmarks:**
590
+ ```bash
591
+ # Step 1: Check for existing PRs
592
+ uv run scripts/evaluation_manager.py get-prs \
593
+ --repo-id "anthropic/claude-sonnet-4"
594
+
595
+ # Step 2: If no PRs, import from Artificial Analysis
596
+ AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
597
+ --creator-slug "anthropic" \
598
+ --model-name "claude-sonnet-4" \
599
+ --repo-id "anthropic/claude-sonnet-4" \
600
+ --create-pr
601
+ ```
602
+
603
+ ### Troubleshooting
604
+
605
+ **Issue**: "No evaluation tables found in README"
606
+ - **Solution**: Check if README contains markdown tables with numeric scores
607
+
608
+ **Issue**: "Could not find model 'X' in transposed table"
609
+ - **Solution**: The script will display available models. Use `--model-name-override` with the exact name from the list
610
+ - **Example**: `--model-name-override "**Olmo 3-32B**"`
611
+
612
+ **Issue**: "AA_API_KEY not set"
613
+ - **Solution**: Set environment variable or add to .env file
614
+
615
+ **Issue**: "Token does not have write access"
616
+ - **Solution**: Ensure HF_TOKEN has write permissions for the repository
617
+
618
+ **Issue**: "Model not found in Artificial Analysis"
619
+ - **Solution**: Verify creator-slug and model-name match API values
620
+
621
+ **Issue**: "Payment required for hardware"
622
+ - **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware
623
+
624
+ **Issue**: "vLLM out of memory" or CUDA OOM
625
+ - **Solution**: Use a larger hardware flavor, reduce `--gpu-memory-utilization`, or use `--tensor-parallel-size` for multi-GPU
626
+
627
+ **Issue**: "Model architecture not supported by vLLM"
628
+ - **Solution**: Use `--backend hf` (inspect-ai) or `--backend accelerate` (lighteval) for HuggingFace Transformers
629
+
630
+ **Issue**: "Trust remote code required"
631
+ - **Solution**: Add `--trust-remote-code` flag for models with custom code (e.g., Phi-2, Qwen)
632
+
633
+ **Issue**: "Chat template not found"
634
+ - **Solution**: Only use `--use-chat-template` for instruction-tuned models that include a chat template
635
+
636
+ ### Integration Examples
637
+
638
+ **Python Script Integration:**
639
+ ```python
640
+ import subprocess
641
+ import os
642
+
643
+ def update_model_evaluations(repo_id, readme_content):
644
+ """Update model card with evaluations from README."""
645
+ result = subprocess.run([
646
+ "python", "scripts/evaluation_manager.py",
647
+ "extract-readme",
648
+ "--repo-id", repo_id,
649
+ "--create-pr"
650
+ ], capture_output=True, text=True)
651
+
652
+ if result.returncode == 0:
653
+ print(f"Successfully updated {repo_id}")
654
+ else:
655
+ print(f"Error: {result.stderr}")
656
+ ```