@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,1041 @@
1
+ ---
2
+ name: hugging-face-jobs
3
+ description: This skill should be used when users want to run any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks. Should be invoked for tasks involving cloud compute, GPU workloads, or when users mention running jobs on Hugging Face infrastructure without local setup.
4
+ version: 1.0.0
5
+ author: Synthetic Sciences
6
+ license: MIT
7
+ tags: [Hugging Face, Cloud Compute, Training Jobs, GPU]
8
+ dependencies: [huggingface-hub, transformers]
9
+ license: Complete terms in LICENSE.txt
10
+ ---
11
+
12
+ # Running Workloads on Hugging Face Jobs
13
+
14
+ ## Overview
15
+
16
+ Run any workload on fully managed Hugging Face infrastructure. No local setup required—jobs run on cloud CPUs, GPUs, or TPUs and can persist results to the Hugging Face Hub.
17
+
18
+ **Common use cases:**
19
+ - **Data Processing** - Transform, filter, or analyze large datasets
20
+ - **Batch Inference** - Run inference on thousands of samples
21
+ - **Experiments & Benchmarks** - Reproducible ML experiments
22
+ - **Model Training** - Fine-tune models (see `model-trainer` skill for TRL-specific training)
23
+ - **Synthetic Data Generation** - Generate datasets using LLMs
24
+ - **Development & Testing** - Test code without local GPU setup
25
+ - **Scheduled Jobs** - Automate recurring tasks
26
+
27
+ **For model training specifically:** See the `model-trainer` skill for TRL-based training workflows.
28
+
29
+ ## When to Use This Skill
30
+
31
+ Use this skill when users want to:
32
+ - Run Python workloads on cloud infrastructure
33
+ - Execute jobs without local GPU/TPU setup
34
+ - Process data at scale
35
+ - Run batch inference or experiments
36
+ - Schedule recurring tasks
37
+ - Use GPUs/TPUs for any workload
38
+ - Persist results to the Hugging Face Hub
39
+
40
+ ## Key Directives
41
+
42
+ When assisting with jobs:
43
+
44
+ 1. **ALWAYS use `hf_jobs()` MCP tool** - Submit jobs using `hf_jobs("uv", {...})` or `hf_jobs("run", {...})`. The `script` parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to `hf_jobs()`.
45
+
46
+ 2. **Always handle authentication** - Jobs that interact with the Hub require `HF_TOKEN` via secrets. See Token Usage section below.
47
+
48
+ 3. **Provide job details after submission** - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
49
+
50
+ 4. **Set appropriate timeouts** - Default 30min may be insufficient for long-running tasks.
51
+
52
+ ## Prerequisites Checklist
53
+
54
+ Before starting any job, verify:
55
+
56
+ ### ✅ **Account & Authentication**
57
+ - Hugging Face Account with [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan (Jobs require paid plan)
58
+ - Authenticated login: Check with `hf_whoami()`
59
+ - **HF_TOKEN for Hub Access** ⚠️ CRITICAL - Required for any Hub operations (push models/datasets, download private repos, etc.)
60
+ - Token must have appropriate permissions (read for downloads, write for uploads)
61
+
62
+ ### ✅ **Token Usage** (See Token Usage section for details)
63
+
64
+ **When tokens are required:**
65
+ - Pushing models/datasets to Hub
66
+ - Accessing private repositories
67
+ - Using Hub APIs in scripts
68
+ - Any authenticated Hub operations
69
+
70
+ **How to provide tokens:**
71
+ ```python
72
+ {
73
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Recommended: automatic token
74
+ }
75
+ ```
76
+
77
+ **⚠️ CRITICAL:** The `$HF_TOKEN` placeholder is automatically replaced with your logged-in token. Never hardcode tokens in scripts.
78
+
79
+ ## Token Usage Guide
80
+
81
+ ### Understanding Tokens
82
+
83
+ **What are HF Tokens?**
84
+ - Authentication credentials for Hugging Face Hub
85
+ - Required for authenticated operations (push, private repos, API access)
86
+ - Stored securely on your machine after `hf auth login`
87
+
88
+ **Token Types:**
89
+ - **Read Token** - Can download models/datasets, read private repos
90
+ - **Write Token** - Can push models/datasets, create repos, modify content
91
+ - **Organization Token** - Can act on behalf of an organization
92
+
93
+ ### When Tokens Are Required
94
+
95
+ **Always Required:**
96
+ - Pushing models/datasets to Hub
97
+ - Accessing private repositories
98
+ - Creating new repositories
99
+ - Modifying existing repositories
100
+ - Using Hub APIs programmatically
101
+
102
+ **Not Required:**
103
+ - Downloading public models/datasets
104
+ - Running jobs that don't interact with Hub
105
+ - Reading public repository information
106
+
107
+ ### How to Provide Tokens to Jobs
108
+
109
+ #### Method 1: Automatic Token (Recommended)
110
+
111
+ ```python
112
+ hf_jobs("uv", {
113
+ "script": "your_script.py",
114
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Automatic replacement
115
+ })
116
+ ```
117
+
118
+ **How it works:**
119
+ - `$HF_TOKEN` is a placeholder that gets replaced with your actual token
120
+ - Uses the token from your logged-in session (`hf auth login`)
121
+ - Most secure and convenient method
122
+ - Token is encrypted server-side when passed as a secret
123
+
124
+ **Benefits:**
125
+ - No token exposure in code
126
+ - Uses your current login session
127
+ - Automatically updated if you re-login
128
+ - Works seamlessly with MCP tools
129
+
130
+ #### Method 2: Explicit Token (Not Recommended)
131
+
132
+ ```python
133
+ hf_jobs("uv", {
134
+ "script": "your_script.py",
135
+ "secrets": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Hardcoded token
136
+ })
137
+ ```
138
+
139
+ **When to use:**
140
+ - Only if automatic token doesn't work
141
+ - Testing with a specific token
142
+ - Organization tokens (use with caution)
143
+
144
+ **Security concerns:**
145
+ - Token visible in code/logs
146
+ - Must manually update if token rotates
147
+ - Risk of token exposure
148
+
149
+ #### Method 3: Environment Variable (Less Secure)
150
+
151
+ ```python
152
+ hf_jobs("uv", {
153
+ "script": "your_script.py",
154
+ "env": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Less secure than secrets
155
+ })
156
+ ```
157
+
158
+ **Difference from secrets:**
159
+ - `env` variables are visible in job logs
160
+ - `secrets` are encrypted server-side
161
+ - Always prefer `secrets` for tokens
162
+
163
+ ### Using Tokens in Scripts
164
+
165
+ **In your Python script, tokens are available as environment variables:**
166
+
167
+ ```python
168
+ # /// script
169
+ # dependencies = ["huggingface-hub"]
170
+ # ///
171
+
172
+ import os
173
+ from huggingface_hub import HfApi
174
+
175
+ # Token is automatically available if passed via secrets
176
+ token = os.environ.get("HF_TOKEN")
177
+
178
+ # Use with Hub API
179
+ api = HfApi(token=token)
180
+
181
+ # Or let huggingface_hub auto-detect
182
+ api = HfApi() # Automatically uses HF_TOKEN env var
183
+ ```
184
+
185
+ **Best practices:**
186
+ - Don't hardcode tokens in scripts
187
+ - Use `os.environ.get("HF_TOKEN")` to access
188
+ - Let `huggingface_hub` auto-detect when possible
189
+ - Verify token exists before Hub operations
190
+
191
+ ### Token Verification
192
+
193
+ **Check if you're logged in:**
194
+ ```python
195
+ from huggingface_hub import whoami
196
+ user_info = whoami() # Returns your username if authenticated
197
+ ```
198
+
199
+ **Verify token in job:**
200
+ ```python
201
+ import os
202
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN not found!"
203
+ token = os.environ["HF_TOKEN"]
204
+ print(f"Token starts with: {token[:7]}...") # Should start with "hf_"
205
+ ```
206
+
207
+ ### Common Token Issues
208
+
209
+ **Error: 401 Unauthorized**
210
+ - **Cause:** Token missing or invalid
211
+ - **Fix:** Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
212
+ - **Verify:** Check `hf_whoami()` works locally
213
+
214
+ **Error: 403 Forbidden**
215
+ - **Cause:** Token lacks required permissions
216
+ - **Fix:** Ensure token has write permissions for push operations
217
+ - **Check:** Token type at https://huggingface.co/settings/tokens
218
+
219
+ **Error: Token not found in environment**
220
+ - **Cause:** `secrets` not passed or wrong key name
221
+ - **Fix:** Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
222
+ - **Verify:** Script checks `os.environ.get("HF_TOKEN")`
223
+
224
+ **Error: Repository access denied**
225
+ - **Cause:** Token doesn't have access to private repo
226
+ - **Fix:** Use token from account with access
227
+ - **Check:** Verify repo visibility and your permissions
228
+
229
+ ### Token Security Best Practices
230
+
231
+ 1. **Never commit tokens** - Use `$HF_TOKEN` placeholder or environment variables
232
+ 2. **Use secrets, not env** - Secrets are encrypted server-side
233
+ 3. **Rotate tokens regularly** - Generate new tokens periodically
234
+ 4. **Use minimal permissions** - Create tokens with only needed permissions
235
+ 5. **Don't share tokens** - Each user should use their own token
236
+ 6. **Monitor token usage** - Check token activity in Hub settings
237
+
238
+ ### Complete Token Example
239
+
240
+ ```python
241
+ # Example: Push results to Hub
242
+ hf_jobs("uv", {
243
+ "script": """
244
+ # /// script
245
+ # dependencies = ["huggingface-hub", "datasets"]
246
+ # ///
247
+
248
+ import os
249
+ from huggingface_hub import HfApi
250
+ from datasets import Dataset
251
+
252
+ # Verify token is available
253
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
254
+
255
+ # Use token for Hub operations
256
+ api = HfApi(token=os.environ["HF_TOKEN"])
257
+
258
+ # Create and push dataset
259
+ data = {"text": ["Hello", "World"]}
260
+ dataset = Dataset.from_dict(data)
261
+ dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"])
262
+
263
+ print("✅ Dataset pushed successfully!")
264
+ """,
265
+ "flavor": "cpu-basic",
266
+ "timeout": "30m",
267
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided securely
268
+ })
269
+ ```
270
+
271
+ ## Quick Start: Two Approaches
272
+
273
+ ### Approach 1: UV Scripts (Recommended)
274
+
275
+ UV scripts use PEP 723 inline dependencies for clean, self-contained workloads.
276
+
277
+ **MCP Tool:**
278
+ ```python
279
+ hf_jobs("uv", {
280
+ "script": """
281
+ # /// script
282
+ # dependencies = ["transformers", "torch"]
283
+ # ///
284
+
285
+ from transformers import pipeline
286
+ import torch
287
+
288
+ # Your workload here
289
+ classifier = pipeline("sentiment-analysis")
290
+ result = classifier("I love Hugging Face!")
291
+ print(result)
292
+ """,
293
+ "flavor": "cpu-basic",
294
+ "timeout": "30m"
295
+ })
296
+ ```
297
+
298
+ **CLI Equivalent:**
299
+ ```bash
300
+ hf jobs uv run my_script.py --flavor cpu-basic --timeout 30m
301
+ ```
302
+
303
+ **Python API:**
304
+ ```python
305
+ from huggingface_hub import run_uv_job
306
+ run_uv_job("my_script.py", flavor="cpu-basic", timeout="30m")
307
+ ```
308
+
309
+ **Benefits:** Direct MCP tool usage, clean code, dependencies declared inline, no file saving required
310
+
311
+ **When to use:** Default choice for all workloads, custom logic, any scenario requiring `hf_jobs()`
312
+
313
+ #### Custom Docker Images for UV Scripts
314
+
315
+ By default, UV scripts use `ghcr.io/astral-sh/uv:python3.12-bookworm-slim`. For ML workloads with complex dependencies, use pre-built images:
316
+
317
+ ```python
318
+ hf_jobs("uv", {
319
+ "script": "inference.py",
320
+ "image": "vllm/vllm-openai:latest", # Pre-built image with vLLM
321
+ "flavor": "a10g-large"
322
+ })
323
+ ```
324
+
325
+ **CLI:**
326
+ ```bash
327
+ hf jobs uv run --image vllm/vllm-openai:latest --flavor a10g-large inference.py
328
+ ```
329
+
330
+ **Benefits:** Faster startup, pre-installed dependencies, optimized for specific frameworks
331
+
332
+ #### Python Version
333
+
334
+ By default, UV scripts use Python 3.12. Specify a different version:
335
+
336
+ ```python
337
+ hf_jobs("uv", {
338
+ "script": "my_script.py",
339
+ "python": "3.11", # Use Python 3.11
340
+ "flavor": "cpu-basic"
341
+ })
342
+ ```
343
+
344
+ **Python API:**
345
+ ```python
346
+ from huggingface_hub import run_uv_job
347
+ run_uv_job("my_script.py", python="3.11")
348
+ ```
349
+
350
+ #### Working with Scripts
351
+
352
+ ⚠️ **Important:** There are *two* "script path" stories depending on how you run Jobs:
353
+
354
+ - **Using the `hf_jobs()` MCP tool (recommended in this repo)**: the `script` value must be **inline code** (a string) or a **URL**. A local filesystem path (like `"./scripts/foo.py"`) won't exist inside the remote container.
355
+ - **Using the `hf jobs uv run` CLI**: local file paths **do work** (the CLI uploads your script).
356
+
357
+ **Common mistake with `hf_jobs()` MCP tool:**
358
+
359
+ ```python
360
+ # ❌ Will fail (remote container can't see your local path)
361
+ hf_jobs("uv", {"script": "./scripts/foo.py"})
362
+ ```
363
+
364
+ **Correct patterns with `hf_jobs()` MCP tool:**
365
+
366
+ ```python
367
+ # ✅ Inline: read the local script file and pass its *contents*
368
+ from pathlib import Path
369
+ script = Path("hf-jobs/scripts/foo.py").read_text()
370
+ hf_jobs("uv", {"script": script})
371
+
372
+ # ✅ URL: host the script somewhere reachable
373
+ hf_jobs("uv", {"script": "https://huggingface.co/datasets/uv-scripts/.../raw/main/foo.py"})
374
+
375
+ # ✅ URL from GitHub
376
+ hf_jobs("uv", {"script": "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py"})
377
+ ```
378
+
379
+ **CLI equivalent (local paths supported):**
380
+
381
+ ```bash
382
+ hf jobs uv run ./scripts/foo.py -- --your --args
383
+ ```
384
+
385
+ #### Adding Dependencies at Runtime
386
+
387
+ Add extra dependencies beyond what's in the PEP 723 header:
388
+
389
+ ```python
390
+ hf_jobs("uv", {
391
+ "script": "inference.py",
392
+ "dependencies": ["transformers", "torch>=2.0"], # Extra deps
393
+ "flavor": "a10g-small"
394
+ })
395
+ ```
396
+
397
+ **Python API:**
398
+ ```python
399
+ from huggingface_hub import run_uv_job
400
+ run_uv_job("inference.py", dependencies=["transformers", "torch>=2.0"])
401
+ ```
402
+
403
+ ### Approach 2: Docker-Based Jobs
404
+
405
+ Run jobs with custom Docker images and commands.
406
+
407
+ **MCP Tool:**
408
+ ```python
409
+ hf_jobs("run", {
410
+ "image": "python:3.12",
411
+ "command": ["python", "-c", "print('Hello from HF Jobs!')"],
412
+ "flavor": "cpu-basic",
413
+ "timeout": "30m"
414
+ })
415
+ ```
416
+
417
+ **CLI Equivalent:**
418
+ ```bash
419
+ hf jobs run python:3.12 python -c "print('Hello from HF Jobs!')"
420
+ ```
421
+
422
+ **Python API:**
423
+ ```python
424
+ from huggingface_hub import run_job
425
+ run_job(image="python:3.12", command=["python", "-c", "print('Hello!')"], flavor="cpu-basic")
426
+ ```
427
+
428
+ **Benefits:** Full Docker control, use pre-built images, run any command
429
+ **When to use:** Need specific Docker images, non-Python workloads, complex environments
430
+
431
+ **Example with GPU:**
432
+ ```python
433
+ hf_jobs("run", {
434
+ "image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
435
+ "command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"],
436
+ "flavor": "a10g-small",
437
+ "timeout": "1h"
438
+ })
439
+ ```
440
+
441
+ **Using Hugging Face Spaces as Images:**
442
+
443
+ You can use Docker images from HF Spaces:
444
+ ```python
445
+ hf_jobs("run", {
446
+ "image": "hf.co/spaces/lhoestq/duckdb", # Space as Docker image
447
+ "command": ["duckdb", "-c", "SELECT 'Hello from DuckDB!'"],
448
+ "flavor": "cpu-basic"
449
+ })
450
+ ```
451
+
452
+ **CLI:**
453
+ ```bash
454
+ hf jobs run hf.co/spaces/lhoestq/duckdb duckdb -c "SELECT 'Hello!'"
455
+ ```
456
+
457
+ ### Finding More UV Scripts on Hub
458
+
459
+ The `uv-scripts` organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:
460
+
461
+ ```python
462
+ # Discover available UV script collections
463
+ dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})
464
+
465
+ # Explore a specific collection
466
+ hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)
467
+ ```
468
+
469
+ **Popular collections:** OCR, classification, synthetic-data, vLLM, dataset-creation
470
+
471
+ ## Hardware Selection
472
+
473
+ > **Reference:** [HF Jobs Hardware Docs](https://huggingface.co/docs/hub/en/spaces-config-reference) (updated 07/2025)
474
+
475
+ | Workload Type | Recommended Hardware | Use Case |
476
+ |---------------|---------------------|----------|
477
+ | Data processing, testing | `cpu-basic`, `cpu-upgrade` | Lightweight tasks |
478
+ | Small models, demos | `t4-small` | <1B models, quick tests |
479
+ | Medium models | `t4-medium`, `l4x1` | 1-7B models |
480
+ | Large models, production | `a10g-small`, `a10g-large` | 7-13B models |
481
+ | Very large models | `a100-large` | 13B+ models |
482
+ | Batch inference | `a10g-large`, `a100-large` | High-throughput |
483
+ | Multi-GPU workloads | `l4x4`, `a10g-largex2`, `a10g-largex4` | Parallel/large models |
484
+ | TPU workloads | `v5e-1x1`, `v5e-2x2`, `v5e-2x4` | JAX/Flax, TPU-optimized |
485
+
486
+ **All Available Flavors:**
487
+ - **CPU:** `cpu-basic`, `cpu-upgrade`
488
+ - **GPU:** `t4-small`, `t4-medium`, `l4x1`, `l4x4`, `a10g-small`, `a10g-large`, `a10g-largex2`, `a10g-largex4`, `a100-large`
489
+ - **TPU:** `v5e-1x1`, `v5e-2x2`, `v5e-2x4`
490
+
491
+ **Guidelines:**
492
+ - Start with smaller hardware for testing
493
+ - Scale up based on actual needs
494
+ - Use multi-GPU for parallel workloads or large models
495
+ - Use TPUs for JAX/Flax workloads
496
+ - See `references/hardware_guide.md` for detailed specifications
497
+
498
+ ## Critical: Saving Results
499
+
500
+ **⚠️ EPHEMERAL ENVIRONMENT—MUST PERSIST RESULTS**
501
+
502
+ The Jobs environment is temporary. All files are deleted when the job ends. If results aren't persisted, **ALL WORK IS LOST**.
503
+
504
+ ### Persistence Options
505
+
506
+ **1. Push to Hugging Face Hub (Recommended)**
507
+
508
+ ```python
509
+ # Push models
510
+ model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"])
511
+
512
+ # Push datasets
513
+ dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"])
514
+
515
+ # Push artifacts
516
+ api.upload_file(
517
+ path_or_fileobj="results.json",
518
+ path_in_repo="results.json",
519
+ repo_id="username/results",
520
+ token=os.environ["HF_TOKEN"]
521
+ )
522
+ ```
523
+
524
+ **2. Use External Storage**
525
+
526
+ ```python
527
+ # Upload to S3, GCS, etc.
528
+ import boto3
529
+ s3 = boto3.client('s3')
530
+ s3.upload_file('results.json', 'my-bucket', 'results.json')
531
+ ```
532
+
533
+ **3. Send Results via API**
534
+
535
+ ```python
536
+ # POST results to your API
537
+ import requests
538
+ requests.post("https://your-api.com/results", json=results)
539
+ ```
540
+
541
+ ### Required Configuration for Hub Push
542
+
543
+ **In job submission:**
544
+ ```python
545
+ {
546
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Enables authentication
547
+ }
548
+ ```
549
+
550
+ **In script:**
551
+ ```python
552
+ import os
553
+ from huggingface_hub import HfApi
554
+
555
+ # Token automatically available from secrets
556
+ api = HfApi(token=os.environ.get("HF_TOKEN"))
557
+
558
+ # Push your results
559
+ api.upload_file(...)
560
+ ```
561
+
562
+ ### Verification Checklist
563
+
564
+ Before submitting:
565
+ - [ ] Results persistence method chosen
566
+ - [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` if using Hub
567
+ - [ ] Script handles missing token gracefully
568
+ - [ ] Test persistence path works
569
+
570
+ **See:** `references/hub_saving.md` for detailed Hub persistence guide
571
+
572
+ ## Timeout Management
573
+
574
+ **⚠️ DEFAULT: 30 MINUTES**
575
+
576
+ Jobs automatically stop after the timeout. For long-running tasks like training, always set a custom timeout.
577
+
578
+ ### Setting Timeouts
579
+
580
+ **MCP Tool:**
581
+ ```python
582
+ {
583
+ "timeout": "2h" # 2 hours
584
+ }
585
+ ```
586
+
587
+ **Supported formats:**
588
+ - Integer/float: seconds (e.g., `300` = 5 minutes)
589
+ - String with suffix: `"5m"` (minutes), `"2h"` (hours), `"1d"` (days)
590
+ - Examples: `"90m"`, `"2h"`, `"1.5h"`, `300`, `"1d"`
591
+
592
+ **Python API:**
593
+ ```python
594
+ from huggingface_hub import run_job, run_uv_job
595
+
596
+ run_job(image="python:3.12", command=[...], timeout="2h")
597
+ run_uv_job("script.py", timeout=7200) # 2 hours in seconds
598
+ ```
599
+
600
+ ### Timeout Guidelines
601
+
602
+ | Scenario | Recommended | Notes |
603
+ |----------|-------------|-------|
604
+ | Quick test | 10-30 min | Verify setup |
605
+ | Data processing | 1-2 hours | Depends on data size |
606
+ | Batch inference | 2-4 hours | Large batches |
607
+ | Experiments | 4-8 hours | Multiple runs |
608
+ | Long-running | 8-24 hours | Production workloads |
609
+
610
+ **Always add 20-30% buffer** for setup, network delays, and cleanup.
611
+
612
+ **On timeout:** Job killed immediately, all unsaved progress lost
613
+
614
+ ## Cost Estimation
615
+
616
+ **General guidelines:**
617
+
618
+ ```
619
+ Total Cost = (Hours of runtime) × (Cost per hour)
620
+ ```
621
+
622
+ **Example calculations:**
623
+
624
+ **Quick test:**
625
+ - Hardware: cpu-basic ($0.10/hour)
626
+ - Time: 15 minutes (0.25 hours)
627
+ - Cost: $0.03
628
+
629
+ **Data processing:**
630
+ - Hardware: l4x1 ($2.50/hour)
631
+ - Time: 2 hours
632
+ - Cost: $5.00
633
+
634
+ **Batch inference:**
635
+ - Hardware: a10g-large ($5/hour)
636
+ - Time: 4 hours
637
+ - Cost: $20.00
638
+
639
+ **Cost optimization tips:**
640
+ 1. Start small - Test on cpu-basic or t4-small
641
+ 2. Monitor runtime - Set appropriate timeouts
642
+ 3. Use checkpoints - Resume if job fails
643
+ 4. Optimize code - Reduce unnecessary compute
644
+ 5. Choose right hardware - Don't over-provision
645
+
646
+ ## Monitoring and Tracking
647
+
648
+ ### Check Job Status
649
+
650
+ **MCP Tool:**
651
+ ```python
652
+ # List all jobs
653
+ hf_jobs("ps")
654
+
655
+ # Inspect specific job
656
+ hf_jobs("inspect", {"job_id": "your-job-id"})
657
+
658
+ # View logs
659
+ hf_jobs("logs", {"job_id": "your-job-id"})
660
+
661
+ # Cancel a job
662
+ hf_jobs("cancel", {"job_id": "your-job-id"})
663
+ ```
664
+
665
+ **Python API:**
666
+ ```python
667
+ from huggingface_hub import list_jobs, inspect_job, fetch_job_logs, cancel_job
668
+
669
+ # List your jobs
670
+ jobs = list_jobs()
671
+
672
+ # List running jobs only
673
+ running = [j for j in list_jobs() if j.status.stage == "RUNNING"]
674
+
675
+ # Inspect specific job
676
+ job_info = inspect_job(job_id="your-job-id")
677
+
678
+ # View logs
679
+ for log in fetch_job_logs(job_id="your-job-id"):
680
+ print(log)
681
+
682
+ # Cancel a job
683
+ cancel_job(job_id="your-job-id")
684
+ ```
685
+
686
+ **CLI:**
687
+ ```bash
688
+ hf jobs ps # List jobs
689
+ hf jobs logs <job-id> # View logs
690
+ hf jobs cancel <job-id> # Cancel job
691
+ ```
692
+
693
+ **Remember:** Wait for user to request status checks. Avoid polling repeatedly.
694
+
695
+ ### Job URLs
696
+
697
+ After submission, jobs have monitoring URLs:
698
+ ```
699
+ https://huggingface.co/jobs/username/job-id
700
+ ```
701
+
702
+ View logs, status, and details in the browser.
703
+
704
+ ### Wait for Multiple Jobs
705
+
706
+ ```python
707
+ import time
708
+ from huggingface_hub import inspect_job, run_job
709
+
710
+ # Run multiple jobs
711
+ jobs = [run_job(image=img, command=cmd) for img, cmd in workloads]
712
+
713
+ # Wait for all to complete
714
+ for job in jobs:
715
+ while inspect_job(job_id=job.id).status.stage not in ("COMPLETED", "ERROR"):
716
+ time.sleep(10)
717
+ ```
718
+
719
+ ## Scheduled Jobs
720
+
721
+ Run jobs on a schedule using CRON expressions or predefined schedules.
722
+
723
+ **MCP Tool:**
724
+ ```python
725
+ # Schedule a UV script that runs every hour
726
+ hf_jobs("scheduled uv", {
727
+ "script": "your_script.py",
728
+ "schedule": "@hourly",
729
+ "flavor": "cpu-basic"
730
+ })
731
+
732
+ # Schedule with CRON syntax
733
+ hf_jobs("scheduled uv", {
734
+ "script": "your_script.py",
735
+ "schedule": "0 9 * * 1", # 9 AM every Monday
736
+ "flavor": "cpu-basic"
737
+ })
738
+
739
+ # Schedule a Docker-based job
740
+ hf_jobs("scheduled run", {
741
+ "image": "python:3.12",
742
+ "command": ["python", "-c", "print('Scheduled!')"],
743
+ "schedule": "@daily",
744
+ "flavor": "cpu-basic"
745
+ })
746
+ ```
747
+
748
+ **Python API:**
749
+ ```python
750
+ from huggingface_hub import create_scheduled_job, create_scheduled_uv_job
751
+
752
+ # Schedule a Docker job
753
+ create_scheduled_job(
754
+ image="python:3.12",
755
+ command=["python", "-c", "print('Running on schedule!')"],
756
+ schedule="@hourly"
757
+ )
758
+
759
+ # Schedule a UV script
760
+ create_scheduled_uv_job("my_script.py", schedule="@daily", flavor="cpu-basic")
761
+
762
+ # Schedule with GPU
763
+ create_scheduled_uv_job(
764
+ "ml_inference.py",
765
+ schedule="0 */6 * * *", # Every 6 hours
766
+ flavor="a10g-small"
767
+ )
768
+ ```
769
+
770
+ **Available schedules:**
771
+ - `@annually`, `@yearly` - Once per year
772
+ - `@monthly` - Once per month
773
+ - `@weekly` - Once per week
774
+ - `@daily` - Once per day
775
+ - `@hourly` - Once per hour
776
+ - CRON expression - Custom schedule (e.g., `"*/5 * * * *"` for every 5 minutes)
777
+
778
+ **Manage scheduled jobs:**
779
+ ```python
780
+ # MCP Tool
781
+ hf_jobs("scheduled ps") # List scheduled jobs
782
+ hf_jobs("scheduled inspect", {"job_id": "..."}) # Inspect details
783
+ hf_jobs("scheduled suspend", {"job_id": "..."}) # Pause
784
+ hf_jobs("scheduled resume", {"job_id": "..."}) # Resume
785
+ hf_jobs("scheduled delete", {"job_id": "..."}) # Delete
786
+ ```
787
+
788
+ **Python API for management:**
789
+ ```python
790
+ from huggingface_hub import (
791
+ list_scheduled_jobs,
792
+ inspect_scheduled_job,
793
+ suspend_scheduled_job,
794
+ resume_scheduled_job,
795
+ delete_scheduled_job
796
+ )
797
+
798
+ # List all scheduled jobs
799
+ scheduled = list_scheduled_jobs()
800
+
801
+ # Inspect a scheduled job
802
+ info = inspect_scheduled_job(scheduled_job_id)
803
+
804
+ # Suspend (pause) a scheduled job
805
+ suspend_scheduled_job(scheduled_job_id)
806
+
807
+ # Resume a scheduled job
808
+ resume_scheduled_job(scheduled_job_id)
809
+
810
+ # Delete a scheduled job
811
+ delete_scheduled_job(scheduled_job_id)
812
+ ```
813
+
814
+ ## Webhooks: Trigger Jobs on Events
815
+
816
+ Trigger jobs automatically when changes happen in Hugging Face repositories.
817
+
818
+ **Python API:**
819
+ ```python
820
+ from huggingface_hub import create_webhook
821
+
822
+ # Create webhook that triggers a job when a repo changes
823
+ webhook = create_webhook(
824
+ job_id=job.id,
825
+ watched=[
826
+ {"type": "user", "name": "your-username"},
827
+ {"type": "org", "name": "your-org-name"}
828
+ ],
829
+ domains=["repo", "discussion"],
830
+ secret="your-secret"
831
+ )
832
+ ```
833
+
834
+ **How it works:**
835
+ 1. Webhook listens for changes in watched repositories
836
+ 2. When triggered, the job runs with `WEBHOOK_PAYLOAD` environment variable
837
+ 3. Your script can parse the payload to understand what changed
838
+
839
+ **Use cases:**
840
+ - Auto-process new datasets when uploaded
841
+ - Trigger inference when models are updated
842
+ - Run tests when code changes
843
+ - Generate reports on repository activity
844
+
845
+ **Access webhook payload in script:**
846
+ ```python
847
+ import os
848
+ import json
849
+
850
+ payload = json.loads(os.environ.get("WEBHOOK_PAYLOAD", "{}"))
851
+ print(f"Event type: {payload.get('event', {}).get('action')}")
852
+ ```
853
+
854
+ See [Webhooks Documentation](https://huggingface.co/docs/huggingface_hub/guides/webhooks) for more details.
855
+
856
+ ## Common Workload Patterns
857
+
858
+ This repository ships ready-to-run UV scripts in `hf-jobs/scripts/`. Prefer using them instead of inventing new templates.
859
+
860
+ ### Pattern 1: Dataset → Model Responses (vLLM) — `scripts/generate-responses.py`
861
+
862
+ **What it does:** loads a Hub dataset (chat `messages` or a `prompt` column), applies a model chat template, generates responses with vLLM, and **pushes** the output dataset + dataset card back to the Hub.
863
+
864
+ **Requires:** GPU + **write** token (it pushes a dataset).
865
+
866
+ ```python
867
+ from pathlib import Path
868
+
869
+ script = Path("hf-jobs/scripts/generate-responses.py").read_text()
870
+ hf_jobs("uv", {
871
+ "script": script,
872
+ "script_args": [
873
+ "username/input-dataset",
874
+ "username/output-dataset",
875
+ "--messages-column", "messages",
876
+ "--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507",
877
+ "--temperature", "0.7",
878
+ "--top-p", "0.8",
879
+ "--max-tokens", "2048",
880
+ ],
881
+ "flavor": "a10g-large",
882
+ "timeout": "4h",
883
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
884
+ })
885
+ ```
886
+
887
+ ### Pattern 2: CoT Self-Instruct Synthetic Data — `scripts/cot-self-instruct.py`
888
+
889
+ **What it does:** generates synthetic prompts/answers via CoT Self-Instruct, optionally filters outputs (answer-consistency / RIP), then **pushes** the generated dataset + dataset card to the Hub.
890
+
891
+ **Requires:** GPU + **write** token (it pushes a dataset).
892
+
893
+ ```python
894
+ from pathlib import Path
895
+
896
+ script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text()
897
+ hf_jobs("uv", {
898
+ "script": script,
899
+ "script_args": [
900
+ "--seed-dataset", "davanstrien/s1k-reasoning",
901
+ "--output-dataset", "username/synthetic-math",
902
+ "--task-type", "reasoning",
903
+ "--num-samples", "5000",
904
+ "--filter-method", "answer-consistency",
905
+ ],
906
+ "flavor": "l4x4",
907
+ "timeout": "8h",
908
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
909
+ })
910
+ ```
911
+
912
+ ### Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — `scripts/finepdfs-stats.py`
913
+
914
+ **What it does:** scans parquet directly from Hub (no 300GB download), computes temporal stats, and (optionally) uploads results to a Hub dataset repo.
915
+
916
+ **Requires:** CPU is often enough; token needed **only** if you pass `--output-repo` (upload).
917
+
918
+ ```python
919
+ from pathlib import Path
920
+
921
+ script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text()
922
+ hf_jobs("uv", {
923
+ "script": script,
924
+ "script_args": [
925
+ "--limit", "10000",
926
+ "--show-plan",
927
+ "--output-repo", "username/finepdfs-temporal-stats",
928
+ ],
929
+ "flavor": "cpu-upgrade",
930
+ "timeout": "2h",
931
+ "env": {"HF_XET_HIGH_PERFORMANCE": "1"},
932
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
933
+ })
934
+ ```
935
+
936
+ ## Common Failure Modes
937
+
938
+ ### Out of Memory (OOM)
939
+
940
+ **Fix:**
941
+ 1. Reduce batch size or data chunk size
942
+ 2. Process data in smaller batches
943
+ 3. Upgrade hardware: cpu → t4 → a10g → a100
944
+
945
+ ### Job Timeout
946
+
947
+ **Fix:**
948
+ 1. Check logs for actual runtime
949
+ 2. Increase timeout with buffer: `"timeout": "3h"`
950
+ 3. Optimize code for faster execution
951
+ 4. Process data in chunks
952
+
953
+ ### Hub Push Failures
954
+
955
+ **Fix:**
956
+ 1. Add to job: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
957
+ 2. Verify token in script: `assert "HF_TOKEN" in os.environ`
958
+ 3. Check token permissions
959
+ 4. Verify repo exists or can be created
960
+
961
+ ### Missing Dependencies
962
+
963
+ **Fix:**
964
+ Add to PEP 723 header:
965
+ ```python
966
+ # /// script
967
+ # dependencies = ["package1", "package2>=1.0.0"]
968
+ # ///
969
+ ```
970
+
971
+ ### Authentication Errors
972
+
973
+ **Fix:**
974
+ 1. Check `hf_whoami()` works locally
975
+ 2. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
976
+ 3. Re-login: `hf auth login`
977
+ 4. Check token has required permissions
978
+
979
+ ## Troubleshooting
980
+
981
+ **Common issues:**
982
+ - Job times out → Increase timeout, optimize code
983
+ - Results not saved → Check persistence method, verify HF_TOKEN
984
+ - Out of Memory → Reduce batch size, upgrade hardware
985
+ - Import errors → Add dependencies to PEP 723 header
986
+ - Authentication errors → Check token, verify secrets parameter
987
+
988
+ **See:** `references/troubleshooting.md` for complete troubleshooting guide
989
+
990
+ ## Resources
991
+
992
+ ### References (In This Skill)
993
+ - `references/token_usage.md` - Complete token usage guide
994
+ - `references/hardware_guide.md` - Hardware specs and selection
995
+ - `references/hub_saving.md` - Hub persistence guide
996
+ - `references/troubleshooting.md` - Common issues and solutions
997
+
998
+ ### Scripts (In This Skill)
999
+ - `scripts/generate-responses.py` - vLLM batch generation: dataset → responses → push to Hub
1000
+ - `scripts/cot-self-instruct.py` - CoT Self-Instruct synthetic data generation + filtering → push to Hub
1001
+ - `scripts/finepdfs-stats.py` - Polars streaming stats over `finepdfs-edu` parquet on Hub (optional push)
1002
+
1003
+ ### External Links
1004
+
1005
+ **Official Documentation:**
1006
+ - [HF Jobs Guide](https://huggingface.co/docs/huggingface_hub/guides/jobs) - Main documentation
1007
+ - [HF Jobs CLI Reference](https://huggingface.co/docs/huggingface_hub/guides/cli#hf-jobs) - Command line interface
1008
+ - [HF Jobs API Reference](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api) - Python API details
1009
+ - [Hardware Flavors Reference](https://huggingface.co/docs/hub/en/spaces-config-reference) - Available hardware
1010
+
1011
+ **Related Tools:**
1012
+ - [UV Scripts Guide](https://docs.astral.sh/uv/guides/scripts/) - PEP 723 inline dependencies
1013
+ - [UV Scripts Organization](https://huggingface.co/uv-scripts) - Community UV script collection
1014
+ - [HF Hub Authentication](https://huggingface.co/docs/huggingface_hub/quick-start#authentication) - Token setup
1015
+ - [Webhooks Documentation](https://huggingface.co/docs/huggingface_hub/guides/webhooks) - Event triggers
1016
+
1017
+ ## Key Takeaways
1018
+
1019
+ 1. **Submit scripts inline** - The `script` parameter accepts Python code directly; no file saving required unless user requests
1020
+ 2. **Jobs are asynchronous** - Don't wait/poll; let user check when ready
1021
+ 3. **Always set timeout** - Default 30 min may be insufficient; set appropriate timeout
1022
+ 4. **Always persist results** - Environment is ephemeral; without persistence, all work is lost
1023
+ 5. **Use tokens securely** - Always use `secrets={"HF_TOKEN": "$HF_TOKEN"}` for Hub operations
1024
+ 6. **Choose appropriate hardware** - Start small, scale up based on needs (see hardware guide)
1025
+ 7. **Use UV scripts** - Default to `hf_jobs("uv", {...})` with inline scripts for Python workloads
1026
+ 8. **Handle authentication** - Verify tokens are available before Hub operations
1027
+ 9. **Monitor jobs** - Provide job URLs and status check commands
1028
+ 10. **Optimize costs** - Choose right hardware, set appropriate timeouts
1029
+
1030
+ ## Quick Reference: MCP Tool vs CLI vs Python API
1031
+
1032
+ | Operation | MCP Tool | CLI | Python API |
1033
+ |-----------|----------|-----|------------|
1034
+ | Run UV script | `hf_jobs("uv", {...})` | `hf jobs uv run script.py` | `run_uv_job("script.py")` |
1035
+ | Run Docker job | `hf_jobs("run", {...})` | `hf jobs run image cmd` | `run_job(image, command)` |
1036
+ | List jobs | `hf_jobs("ps")` | `hf jobs ps` | `list_jobs()` |
1037
+ | View logs | `hf_jobs("logs", {...})` | `hf jobs logs <id>` | `fetch_job_logs(job_id)` |
1038
+ | Cancel job | `hf_jobs("cancel", {...})` | `hf jobs cancel <id>` | `cancel_job(job_id)` |
1039
+ | Schedule UV | `hf_jobs("scheduled uv", {...})` | - | `create_scheduled_uv_job()` |
1040
+ | Schedule Docker | `hf_jobs("scheduled run", {...})` | - | `create_scheduled_job()` |
1041
+