@synsci/cli-darwin-x64 1.1.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (373) hide show
  1. package/bin/skills/accelerate/SKILL.md +332 -0
  2. package/bin/skills/accelerate/references/custom-plugins.md +453 -0
  3. package/bin/skills/accelerate/references/megatron-integration.md +489 -0
  4. package/bin/skills/accelerate/references/performance.md +525 -0
  5. package/bin/skills/audiocraft/SKILL.md +564 -0
  6. package/bin/skills/audiocraft/references/advanced-usage.md +666 -0
  7. package/bin/skills/audiocraft/references/troubleshooting.md +504 -0
  8. package/bin/skills/autogpt/SKILL.md +403 -0
  9. package/bin/skills/autogpt/references/advanced-usage.md +535 -0
  10. package/bin/skills/autogpt/references/troubleshooting.md +420 -0
  11. package/bin/skills/awq/SKILL.md +310 -0
  12. package/bin/skills/awq/references/advanced-usage.md +324 -0
  13. package/bin/skills/awq/references/troubleshooting.md +344 -0
  14. package/bin/skills/axolotl/SKILL.md +158 -0
  15. package/bin/skills/axolotl/references/api.md +5548 -0
  16. package/bin/skills/axolotl/references/dataset-formats.md +1029 -0
  17. package/bin/skills/axolotl/references/index.md +15 -0
  18. package/bin/skills/axolotl/references/other.md +3563 -0
  19. package/bin/skills/bigcode-evaluation-harness/SKILL.md +405 -0
  20. package/bin/skills/bigcode-evaluation-harness/references/benchmarks.md +393 -0
  21. package/bin/skills/bigcode-evaluation-harness/references/custom-tasks.md +424 -0
  22. package/bin/skills/bigcode-evaluation-harness/references/issues.md +394 -0
  23. package/bin/skills/bitsandbytes/SKILL.md +411 -0
  24. package/bin/skills/bitsandbytes/references/memory-optimization.md +521 -0
  25. package/bin/skills/bitsandbytes/references/qlora-training.md +521 -0
  26. package/bin/skills/bitsandbytes/references/quantization-formats.md +447 -0
  27. package/bin/skills/blip-2/SKILL.md +564 -0
  28. package/bin/skills/blip-2/references/advanced-usage.md +680 -0
  29. package/bin/skills/blip-2/references/troubleshooting.md +526 -0
  30. package/bin/skills/chroma/SKILL.md +406 -0
  31. package/bin/skills/chroma/references/integration.md +38 -0
  32. package/bin/skills/clip/SKILL.md +253 -0
  33. package/bin/skills/clip/references/applications.md +207 -0
  34. package/bin/skills/constitutional-ai/SKILL.md +290 -0
  35. package/bin/skills/crewai/SKILL.md +498 -0
  36. package/bin/skills/crewai/references/flows.md +438 -0
  37. package/bin/skills/crewai/references/tools.md +429 -0
  38. package/bin/skills/crewai/references/troubleshooting.md +480 -0
  39. package/bin/skills/deepspeed/SKILL.md +141 -0
  40. package/bin/skills/deepspeed/references/08.md +17 -0
  41. package/bin/skills/deepspeed/references/09.md +173 -0
  42. package/bin/skills/deepspeed/references/2020.md +378 -0
  43. package/bin/skills/deepspeed/references/2023.md +279 -0
  44. package/bin/skills/deepspeed/references/assets.md +179 -0
  45. package/bin/skills/deepspeed/references/index.md +35 -0
  46. package/bin/skills/deepspeed/references/mii.md +118 -0
  47. package/bin/skills/deepspeed/references/other.md +1191 -0
  48. package/bin/skills/deepspeed/references/tutorials.md +6554 -0
  49. package/bin/skills/dspy/SKILL.md +590 -0
  50. package/bin/skills/dspy/references/examples.md +663 -0
  51. package/bin/skills/dspy/references/modules.md +475 -0
  52. package/bin/skills/dspy/references/optimizers.md +566 -0
  53. package/bin/skills/faiss/SKILL.md +221 -0
  54. package/bin/skills/faiss/references/index_types.md +280 -0
  55. package/bin/skills/flash-attention/SKILL.md +367 -0
  56. package/bin/skills/flash-attention/references/benchmarks.md +215 -0
  57. package/bin/skills/flash-attention/references/transformers-integration.md +293 -0
  58. package/bin/skills/gguf/SKILL.md +427 -0
  59. package/bin/skills/gguf/references/advanced-usage.md +504 -0
  60. package/bin/skills/gguf/references/troubleshooting.md +442 -0
  61. package/bin/skills/gptq/SKILL.md +450 -0
  62. package/bin/skills/gptq/references/calibration.md +337 -0
  63. package/bin/skills/gptq/references/integration.md +129 -0
  64. package/bin/skills/gptq/references/troubleshooting.md +95 -0
  65. package/bin/skills/grpo-rl-training/README.md +97 -0
  66. package/bin/skills/grpo-rl-training/SKILL.md +572 -0
  67. package/bin/skills/grpo-rl-training/examples/reward_functions_library.py +393 -0
  68. package/bin/skills/grpo-rl-training/templates/basic_grpo_training.py +228 -0
  69. package/bin/skills/guidance/SKILL.md +572 -0
  70. package/bin/skills/guidance/references/backends.md +554 -0
  71. package/bin/skills/guidance/references/constraints.md +674 -0
  72. package/bin/skills/guidance/references/examples.md +767 -0
  73. package/bin/skills/hqq/SKILL.md +445 -0
  74. package/bin/skills/hqq/references/advanced-usage.md +528 -0
  75. package/bin/skills/hqq/references/troubleshooting.md +503 -0
  76. package/bin/skills/hugging-face-cli/SKILL.md +191 -0
  77. package/bin/skills/hugging-face-cli/references/commands.md +954 -0
  78. package/bin/skills/hugging-face-cli/references/examples.md +374 -0
  79. package/bin/skills/hugging-face-datasets/SKILL.md +547 -0
  80. package/bin/skills/hugging-face-datasets/examples/diverse_training_examples.json +239 -0
  81. package/bin/skills/hugging-face-datasets/examples/system_prompt_template.txt +196 -0
  82. package/bin/skills/hugging-face-datasets/examples/training_examples.json +176 -0
  83. package/bin/skills/hugging-face-datasets/scripts/dataset_manager.py +522 -0
  84. package/bin/skills/hugging-face-datasets/scripts/sql_manager.py +844 -0
  85. package/bin/skills/hugging-face-datasets/templates/chat.json +55 -0
  86. package/bin/skills/hugging-face-datasets/templates/classification.json +62 -0
  87. package/bin/skills/hugging-face-datasets/templates/completion.json +51 -0
  88. package/bin/skills/hugging-face-datasets/templates/custom.json +75 -0
  89. package/bin/skills/hugging-face-datasets/templates/qa.json +54 -0
  90. package/bin/skills/hugging-face-datasets/templates/tabular.json +81 -0
  91. package/bin/skills/hugging-face-evaluation/SKILL.md +656 -0
  92. package/bin/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +382 -0
  93. package/bin/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +141 -0
  94. package/bin/skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  95. package/bin/skills/hugging-face-evaluation/examples/metric_mapping.json +50 -0
  96. package/bin/skills/hugging-face-evaluation/requirements.txt +20 -0
  97. package/bin/skills/hugging-face-evaluation/scripts/evaluation_manager.py +1374 -0
  98. package/bin/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +104 -0
  99. package/bin/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +317 -0
  100. package/bin/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +303 -0
  101. package/bin/skills/hugging-face-evaluation/scripts/run_eval_job.py +98 -0
  102. package/bin/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +331 -0
  103. package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +206 -0
  104. package/bin/skills/hugging-face-jobs/SKILL.md +1041 -0
  105. package/bin/skills/hugging-face-jobs/index.html +216 -0
  106. package/bin/skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  107. package/bin/skills/hugging-face-jobs/references/hub_saving.md +352 -0
  108. package/bin/skills/hugging-face-jobs/references/token_usage.md +546 -0
  109. package/bin/skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  110. package/bin/skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  111. package/bin/skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  112. package/bin/skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  113. package/bin/skills/hugging-face-model-trainer/SKILL.md +711 -0
  114. package/bin/skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  115. package/bin/skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  116. package/bin/skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  117. package/bin/skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  118. package/bin/skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  119. package/bin/skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  120. package/bin/skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  121. package/bin/skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  122. package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  123. package/bin/skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  124. package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  125. package/bin/skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  126. package/bin/skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  127. package/bin/skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  128. package/bin/skills/hugging-face-paper-publisher/SKILL.md +627 -0
  129. package/bin/skills/hugging-face-paper-publisher/examples/example_usage.md +327 -0
  130. package/bin/skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  131. package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +508 -0
  132. package/bin/skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  133. package/bin/skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  134. package/bin/skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  135. package/bin/skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  136. package/bin/skills/hugging-face-tool-builder/SKILL.md +115 -0
  137. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.py +57 -0
  138. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.sh +40 -0
  139. package/bin/skills/hugging-face-tool-builder/references/baseline_hf_api.tsx +57 -0
  140. package/bin/skills/hugging-face-tool-builder/references/find_models_by_paper.sh +230 -0
  141. package/bin/skills/hugging-face-tool-builder/references/hf_enrich_models.sh +96 -0
  142. package/bin/skills/hugging-face-tool-builder/references/hf_model_card_frontmatter.sh +188 -0
  143. package/bin/skills/hugging-face-tool-builder/references/hf_model_papers_auth.sh +171 -0
  144. package/bin/skills/hugging-face-trackio/SKILL.md +65 -0
  145. package/bin/skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  146. package/bin/skills/hugging-face-trackio/references/retrieving_metrics.md +223 -0
  147. package/bin/skills/huggingface-tokenizers/SKILL.md +516 -0
  148. package/bin/skills/huggingface-tokenizers/references/algorithms.md +653 -0
  149. package/bin/skills/huggingface-tokenizers/references/integration.md +637 -0
  150. package/bin/skills/huggingface-tokenizers/references/pipeline.md +723 -0
  151. package/bin/skills/huggingface-tokenizers/references/training.md +565 -0
  152. package/bin/skills/instructor/SKILL.md +740 -0
  153. package/bin/skills/instructor/references/examples.md +107 -0
  154. package/bin/skills/instructor/references/providers.md +70 -0
  155. package/bin/skills/instructor/references/validation.md +606 -0
  156. package/bin/skills/knowledge-distillation/SKILL.md +458 -0
  157. package/bin/skills/knowledge-distillation/references/minillm.md +334 -0
  158. package/bin/skills/lambda-labs/SKILL.md +545 -0
  159. package/bin/skills/lambda-labs/references/advanced-usage.md +611 -0
  160. package/bin/skills/lambda-labs/references/troubleshooting.md +530 -0
  161. package/bin/skills/langchain/SKILL.md +480 -0
  162. package/bin/skills/langchain/references/agents.md +499 -0
  163. package/bin/skills/langchain/references/integration.md +562 -0
  164. package/bin/skills/langchain/references/rag.md +600 -0
  165. package/bin/skills/langsmith/SKILL.md +422 -0
  166. package/bin/skills/langsmith/references/advanced-usage.md +548 -0
  167. package/bin/skills/langsmith/references/troubleshooting.md +537 -0
  168. package/bin/skills/litgpt/SKILL.md +469 -0
  169. package/bin/skills/litgpt/references/custom-models.md +568 -0
  170. package/bin/skills/litgpt/references/distributed-training.md +451 -0
  171. package/bin/skills/litgpt/references/supported-models.md +336 -0
  172. package/bin/skills/litgpt/references/training-recipes.md +619 -0
  173. package/bin/skills/llama-cpp/SKILL.md +258 -0
  174. package/bin/skills/llama-cpp/references/optimization.md +89 -0
  175. package/bin/skills/llama-cpp/references/quantization.md +213 -0
  176. package/bin/skills/llama-cpp/references/server.md +125 -0
  177. package/bin/skills/llama-factory/SKILL.md +80 -0
  178. package/bin/skills/llama-factory/references/_images.md +23 -0
  179. package/bin/skills/llama-factory/references/advanced.md +1055 -0
  180. package/bin/skills/llama-factory/references/getting_started.md +349 -0
  181. package/bin/skills/llama-factory/references/index.md +19 -0
  182. package/bin/skills/llama-factory/references/other.md +31 -0
  183. package/bin/skills/llamaguard/SKILL.md +337 -0
  184. package/bin/skills/llamaindex/SKILL.md +569 -0
  185. package/bin/skills/llamaindex/references/agents.md +83 -0
  186. package/bin/skills/llamaindex/references/data_connectors.md +108 -0
  187. package/bin/skills/llamaindex/references/query_engines.md +406 -0
  188. package/bin/skills/llava/SKILL.md +304 -0
  189. package/bin/skills/llava/references/training.md +197 -0
  190. package/bin/skills/lm-evaluation-harness/SKILL.md +490 -0
  191. package/bin/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  192. package/bin/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  193. package/bin/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  194. package/bin/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  195. package/bin/skills/long-context/SKILL.md +536 -0
  196. package/bin/skills/long-context/references/extension_methods.md +468 -0
  197. package/bin/skills/long-context/references/fine_tuning.md +611 -0
  198. package/bin/skills/long-context/references/rope.md +402 -0
  199. package/bin/skills/mamba/SKILL.md +260 -0
  200. package/bin/skills/mamba/references/architecture-details.md +206 -0
  201. package/bin/skills/mamba/references/benchmarks.md +255 -0
  202. package/bin/skills/mamba/references/training-guide.md +388 -0
  203. package/bin/skills/megatron-core/SKILL.md +366 -0
  204. package/bin/skills/megatron-core/references/benchmarks.md +249 -0
  205. package/bin/skills/megatron-core/references/parallelism-guide.md +404 -0
  206. package/bin/skills/megatron-core/references/production-examples.md +473 -0
  207. package/bin/skills/megatron-core/references/training-recipes.md +547 -0
  208. package/bin/skills/miles/SKILL.md +315 -0
  209. package/bin/skills/miles/references/api-reference.md +141 -0
  210. package/bin/skills/miles/references/troubleshooting.md +352 -0
  211. package/bin/skills/mlflow/SKILL.md +704 -0
  212. package/bin/skills/mlflow/references/deployment.md +744 -0
  213. package/bin/skills/mlflow/references/model-registry.md +770 -0
  214. package/bin/skills/mlflow/references/tracking.md +680 -0
  215. package/bin/skills/modal/SKILL.md +341 -0
  216. package/bin/skills/modal/references/advanced-usage.md +503 -0
  217. package/bin/skills/modal/references/troubleshooting.md +494 -0
  218. package/bin/skills/model-merging/SKILL.md +539 -0
  219. package/bin/skills/model-merging/references/evaluation.md +462 -0
  220. package/bin/skills/model-merging/references/examples.md +428 -0
  221. package/bin/skills/model-merging/references/methods.md +352 -0
  222. package/bin/skills/model-pruning/SKILL.md +495 -0
  223. package/bin/skills/model-pruning/references/wanda.md +347 -0
  224. package/bin/skills/moe-training/SKILL.md +526 -0
  225. package/bin/skills/moe-training/references/architectures.md +432 -0
  226. package/bin/skills/moe-training/references/inference.md +348 -0
  227. package/bin/skills/moe-training/references/training.md +425 -0
  228. package/bin/skills/nanogpt/SKILL.md +290 -0
  229. package/bin/skills/nanogpt/references/architecture.md +382 -0
  230. package/bin/skills/nanogpt/references/data.md +476 -0
  231. package/bin/skills/nanogpt/references/training.md +564 -0
  232. package/bin/skills/nemo-curator/SKILL.md +383 -0
  233. package/bin/skills/nemo-curator/references/deduplication.md +87 -0
  234. package/bin/skills/nemo-curator/references/filtering.md +102 -0
  235. package/bin/skills/nemo-evaluator/SKILL.md +494 -0
  236. package/bin/skills/nemo-evaluator/references/adapter-system.md +340 -0
  237. package/bin/skills/nemo-evaluator/references/configuration.md +447 -0
  238. package/bin/skills/nemo-evaluator/references/custom-benchmarks.md +315 -0
  239. package/bin/skills/nemo-evaluator/references/execution-backends.md +361 -0
  240. package/bin/skills/nemo-guardrails/SKILL.md +297 -0
  241. package/bin/skills/nnsight/SKILL.md +436 -0
  242. package/bin/skills/nnsight/references/README.md +78 -0
  243. package/bin/skills/nnsight/references/api.md +344 -0
  244. package/bin/skills/nnsight/references/tutorials.md +300 -0
  245. package/bin/skills/openrlhf/SKILL.md +249 -0
  246. package/bin/skills/openrlhf/references/algorithm-comparison.md +404 -0
  247. package/bin/skills/openrlhf/references/custom-rewards.md +530 -0
  248. package/bin/skills/openrlhf/references/hybrid-engine.md +287 -0
  249. package/bin/skills/openrlhf/references/multi-node-training.md +454 -0
  250. package/bin/skills/outlines/SKILL.md +652 -0
  251. package/bin/skills/outlines/references/backends.md +615 -0
  252. package/bin/skills/outlines/references/examples.md +773 -0
  253. package/bin/skills/outlines/references/json_generation.md +652 -0
  254. package/bin/skills/peft/SKILL.md +431 -0
  255. package/bin/skills/peft/references/advanced-usage.md +514 -0
  256. package/bin/skills/peft/references/troubleshooting.md +480 -0
  257. package/bin/skills/phoenix/SKILL.md +475 -0
  258. package/bin/skills/phoenix/references/advanced-usage.md +619 -0
  259. package/bin/skills/phoenix/references/troubleshooting.md +538 -0
  260. package/bin/skills/pinecone/SKILL.md +358 -0
  261. package/bin/skills/pinecone/references/deployment.md +181 -0
  262. package/bin/skills/pytorch-fsdp/SKILL.md +126 -0
  263. package/bin/skills/pytorch-fsdp/references/index.md +7 -0
  264. package/bin/skills/pytorch-fsdp/references/other.md +4249 -0
  265. package/bin/skills/pytorch-lightning/SKILL.md +346 -0
  266. package/bin/skills/pytorch-lightning/references/callbacks.md +436 -0
  267. package/bin/skills/pytorch-lightning/references/distributed.md +490 -0
  268. package/bin/skills/pytorch-lightning/references/hyperparameter-tuning.md +556 -0
  269. package/bin/skills/pyvene/SKILL.md +473 -0
  270. package/bin/skills/pyvene/references/README.md +73 -0
  271. package/bin/skills/pyvene/references/api.md +383 -0
  272. package/bin/skills/pyvene/references/tutorials.md +376 -0
  273. package/bin/skills/qdrant/SKILL.md +493 -0
  274. package/bin/skills/qdrant/references/advanced-usage.md +648 -0
  275. package/bin/skills/qdrant/references/troubleshooting.md +631 -0
  276. package/bin/skills/ray-data/SKILL.md +326 -0
  277. package/bin/skills/ray-data/references/integration.md +82 -0
  278. package/bin/skills/ray-data/references/transformations.md +83 -0
  279. package/bin/skills/ray-train/SKILL.md +406 -0
  280. package/bin/skills/ray-train/references/multi-node.md +628 -0
  281. package/bin/skills/rwkv/SKILL.md +260 -0
  282. package/bin/skills/rwkv/references/architecture-details.md +344 -0
  283. package/bin/skills/rwkv/references/rwkv7.md +386 -0
  284. package/bin/skills/rwkv/references/state-management.md +369 -0
  285. package/bin/skills/saelens/SKILL.md +386 -0
  286. package/bin/skills/saelens/references/README.md +70 -0
  287. package/bin/skills/saelens/references/api.md +333 -0
  288. package/bin/skills/saelens/references/tutorials.md +318 -0
  289. package/bin/skills/segment-anything/SKILL.md +500 -0
  290. package/bin/skills/segment-anything/references/advanced-usage.md +589 -0
  291. package/bin/skills/segment-anything/references/troubleshooting.md +484 -0
  292. package/bin/skills/sentence-transformers/SKILL.md +255 -0
  293. package/bin/skills/sentence-transformers/references/models.md +123 -0
  294. package/bin/skills/sentencepiece/SKILL.md +235 -0
  295. package/bin/skills/sentencepiece/references/algorithms.md +200 -0
  296. package/bin/skills/sentencepiece/references/training.md +304 -0
  297. package/bin/skills/sglang/SKILL.md +442 -0
  298. package/bin/skills/sglang/references/deployment.md +490 -0
  299. package/bin/skills/sglang/references/radix-attention.md +413 -0
  300. package/bin/skills/sglang/references/structured-generation.md +541 -0
  301. package/bin/skills/simpo/SKILL.md +219 -0
  302. package/bin/skills/simpo/references/datasets.md +478 -0
  303. package/bin/skills/simpo/references/hyperparameters.md +452 -0
  304. package/bin/skills/simpo/references/loss-functions.md +350 -0
  305. package/bin/skills/skypilot/SKILL.md +509 -0
  306. package/bin/skills/skypilot/references/advanced-usage.md +491 -0
  307. package/bin/skills/skypilot/references/troubleshooting.md +570 -0
  308. package/bin/skills/slime/SKILL.md +464 -0
  309. package/bin/skills/slime/references/api-reference.md +392 -0
  310. package/bin/skills/slime/references/troubleshooting.md +386 -0
  311. package/bin/skills/speculative-decoding/SKILL.md +467 -0
  312. package/bin/skills/speculative-decoding/references/lookahead.md +309 -0
  313. package/bin/skills/speculative-decoding/references/medusa.md +350 -0
  314. package/bin/skills/stable-diffusion/SKILL.md +519 -0
  315. package/bin/skills/stable-diffusion/references/advanced-usage.md +716 -0
  316. package/bin/skills/stable-diffusion/references/troubleshooting.md +555 -0
  317. package/bin/skills/tensorboard/SKILL.md +629 -0
  318. package/bin/skills/tensorboard/references/integrations.md +638 -0
  319. package/bin/skills/tensorboard/references/profiling.md +545 -0
  320. package/bin/skills/tensorboard/references/visualization.md +620 -0
  321. package/bin/skills/tensorrt-llm/SKILL.md +187 -0
  322. package/bin/skills/tensorrt-llm/references/multi-gpu.md +298 -0
  323. package/bin/skills/tensorrt-llm/references/optimization.md +242 -0
  324. package/bin/skills/tensorrt-llm/references/serving.md +470 -0
  325. package/bin/skills/tinker/SKILL.md +362 -0
  326. package/bin/skills/tinker/references/api-reference.md +168 -0
  327. package/bin/skills/tinker/references/getting-started.md +157 -0
  328. package/bin/skills/tinker/references/loss-functions.md +163 -0
  329. package/bin/skills/tinker/references/models-and-lora.md +139 -0
  330. package/bin/skills/tinker/references/recipes.md +280 -0
  331. package/bin/skills/tinker/references/reinforcement-learning.md +212 -0
  332. package/bin/skills/tinker/references/rendering.md +243 -0
  333. package/bin/skills/tinker/references/supervised-learning.md +232 -0
  334. package/bin/skills/tinker-training-cost/SKILL.md +187 -0
  335. package/bin/skills/tinker-training-cost/scripts/calculate_cost.py +123 -0
  336. package/bin/skills/torchforge/SKILL.md +433 -0
  337. package/bin/skills/torchforge/references/api-reference.md +327 -0
  338. package/bin/skills/torchforge/references/troubleshooting.md +409 -0
  339. package/bin/skills/torchtitan/SKILL.md +358 -0
  340. package/bin/skills/torchtitan/references/checkpoint.md +181 -0
  341. package/bin/skills/torchtitan/references/custom-models.md +258 -0
  342. package/bin/skills/torchtitan/references/float8.md +133 -0
  343. package/bin/skills/torchtitan/references/fsdp.md +126 -0
  344. package/bin/skills/transformer-lens/SKILL.md +346 -0
  345. package/bin/skills/transformer-lens/references/README.md +54 -0
  346. package/bin/skills/transformer-lens/references/api.md +362 -0
  347. package/bin/skills/transformer-lens/references/tutorials.md +339 -0
  348. package/bin/skills/trl-fine-tuning/SKILL.md +455 -0
  349. package/bin/skills/trl-fine-tuning/references/dpo-variants.md +227 -0
  350. package/bin/skills/trl-fine-tuning/references/online-rl.md +82 -0
  351. package/bin/skills/trl-fine-tuning/references/reward-modeling.md +122 -0
  352. package/bin/skills/trl-fine-tuning/references/sft-training.md +168 -0
  353. package/bin/skills/unsloth/SKILL.md +80 -0
  354. package/bin/skills/unsloth/references/index.md +7 -0
  355. package/bin/skills/unsloth/references/llms-full.md +16799 -0
  356. package/bin/skills/unsloth/references/llms-txt.md +12044 -0
  357. package/bin/skills/unsloth/references/llms.md +82 -0
  358. package/bin/skills/verl/SKILL.md +391 -0
  359. package/bin/skills/verl/references/api-reference.md +301 -0
  360. package/bin/skills/verl/references/troubleshooting.md +391 -0
  361. package/bin/skills/vllm/SKILL.md +364 -0
  362. package/bin/skills/vllm/references/optimization.md +226 -0
  363. package/bin/skills/vllm/references/quantization.md +284 -0
  364. package/bin/skills/vllm/references/server-deployment.md +255 -0
  365. package/bin/skills/vllm/references/troubleshooting.md +447 -0
  366. package/bin/skills/weights-and-biases/SKILL.md +590 -0
  367. package/bin/skills/weights-and-biases/references/artifacts.md +584 -0
  368. package/bin/skills/weights-and-biases/references/integrations.md +700 -0
  369. package/bin/skills/weights-and-biases/references/sweeps.md +847 -0
  370. package/bin/skills/whisper/SKILL.md +317 -0
  371. package/bin/skills/whisper/references/languages.md +189 -0
  372. package/bin/synsc +0 -0
  373. package/package.json +10 -0
@@ -0,0 +1,327 @@
1
+ # torchforge API Reference
2
+
3
+ ## Architecture Overview
4
+
5
+ torchforge implements a fully asynchronous RL system built on:
6
+
7
+ - **Monarch**: PyTorch-native distributed coordination framework
8
+ - **TorchTitan**: Meta's production LLM training platform
9
+ - **vLLM**: High-throughput inference engine
10
+
11
+ ```
12
+ ┌─────────────────────────────────────────────────────────┐
13
+ │ Application Layer (Your Code) │
14
+ │ - Define reward models, loss functions, sampling │
15
+ └─────────────────────┬───────────────────────────────────┘
16
+
17
+ ┌─────────────────────▼───────────────────────────────────┐
18
+ │ Forge API Layer │
19
+ │ - ForgeActor, Service │
20
+ │ - Async service interfaces │
21
+ └─────────────────────┬───────────────────────────────────┘
22
+
23
+ ┌─────────────────────▼───────────────────────────────────┐
24
+ │ Distributed Services (Monarch) │
25
+ │ ├── TitanTrainer (TorchTitan FSDP) │
26
+ │ ├── Generator (vLLM inference) │
27
+ │ └── ReferenceModel (frozen KL baseline) │
28
+ └─────────────────────────────────────────────────────────┘
29
+ ```
30
+
31
+ ## Core Classes
32
+
33
+ ### ForgeActor
34
+
35
+ Base class for Forge actors with configurable resource attributes.
36
+
37
+ **Location**: `forge.controller.actor.ForgeActor`
38
+
39
+ ```python
40
+ from forge.controller.actor import ForgeActor
41
+
42
+ class MyActor(ForgeActor):
43
+ procs = 1 # Number of processes
44
+ hosts = None # Host distribution
45
+ with_gpus = True # GPU allocation flag
46
+ num_replicas = 1 # Service replica count
47
+ mesh_name = None # Process mesh identifier
48
+ ```
49
+
50
+ **Class Methods**:
51
+ - `as_actor(*args, **actor_kwargs)` → Spawns single actor using .options() configuration
52
+ - `launch(*args, **kwargs)` → Provisions and deploys new actor replica
53
+ - `options(*, procs=1, hosts=None, with_gpus=False, num_replicas=1, mesh_name=None, **kwargs)` → Pre-configures actor class
54
+ - `shutdown(actor)` → Terminates actor instance
55
+
56
+ ### TitanTrainer
57
+
58
+ Generic trainer actor built on TorchTitan's training engine.
59
+
60
+ **Location**: `forge.actors.trainer.TitanTrainer`
61
+
62
+ **Key Methods**:
63
+ - `forward_backward(batch)` → Forward and backward pass
64
+ - `train_step()` → Complete training step
65
+ - `setup()` / `cleanup()` → Lifecycle methods
66
+ - `clear_gradients()` → Reset gradients
67
+ - `save()` / `load()` → Checkpoint operations
68
+ - `push_weights()` → Sync weights to inference
69
+ - `get_config()` / `get_status()` → Introspection
70
+
71
+ **Properties**: `job`, `model`, `optimizer`, `lr_scheduler`, `training`, `parallelism`, `checkpoint`, `activation_checkpoint`, `compile`, `quantize`, `comm`, `memory_estimation`, `state_dict_key`
72
+
73
+ ### Generator
74
+
75
+ vLLM-based generator for inference.
76
+
77
+ **Location**: `forge.actors.generator.Generator`
78
+
79
+ ```python
80
+ from forge.actors.generator import Generator
81
+
82
+ generator = Generator(
83
+ engine_args=<factory>,
84
+ sampling_params=<factory>,
85
+ prefetch_weights_to_shm=True,
86
+ n_fetcher_procs=8
87
+ )
88
+ ```
89
+
90
+ **Key Methods**:
91
+ - `generate()` → Generate completions
92
+ - `run()` → Async generation loop
93
+ - `update_weights()` → Receive new weights from trainer
94
+ - `get_version()` / `get_vllm_config()` → Introspection
95
+
96
+ **Returns**: `Completion` dataclass with fields: `prompt`, `text`, `token_ids`, `logprobs`
97
+
98
+ ### ReferenceModel
99
+
100
+ Frozen policy copy for computing KL divergence.
101
+
102
+ **Location**: `forge.actors.reference_model.ReferenceModel`
103
+
104
+ Maintains a frozen copy of the policy for computing advantages without gradient computation.
105
+
106
+ **Key Methods**:
107
+ - `forward()` → Inference without gradients
108
+ - `setup()` → Initialize from checkpoint
109
+
110
+ ### Service
111
+
112
+ Actor-less service implementation for managing replicas.
113
+
114
+ **Location**: `forge.controller.service.service.Service`
115
+
116
+ ```python
117
+ Service(cfg, actor_def, actor_args, actor_kwargs)
118
+ ```
119
+
120
+ **Methods**:
121
+ - `call_all(function, *args, **kwargs)` → Call function on all healthy replicas
122
+ - `get_metrics()` → Returns ServiceMetrics object
123
+ - `start_session()` / `terminate_session(sess_id)` → Session management
124
+ - `stop()` → Stop service and all replicas
125
+
126
+ ## Configuration (TorchTitan)
127
+
128
+ torchforge uses TorchTitan's configuration system:
129
+
130
+ ### Job Configuration
131
+
132
+ ```python
133
+ from torchtitan.config.job_config import Job
134
+
135
+ @dataclass
136
+ class Job:
137
+ config_file: str
138
+ dump_folder: str
139
+ description: str
140
+ print_config: bool
141
+ custom_config_module: str
142
+ ```
143
+
144
+ ### Model Configuration
145
+
146
+ ```python
147
+ from torchtitan.config.job_config import Model
148
+
149
+ @dataclass
150
+ class Model:
151
+ name: str
152
+ flavor: str
153
+ hf_assets_path: str
154
+ tokenizer_path: str
155
+ converters: list
156
+ print_after_conversion: bool
157
+ ```
158
+
159
+ ### Training Configuration
160
+
161
+ ```python
162
+ from torchtitan.config.job_config import Training
163
+
164
+ @dataclass
165
+ class Training:
166
+ dataset: str
167
+ dataset_path: str
168
+ local_batch_size: int
169
+ global_batch_size: int
170
+ seq_len: int
171
+ max_norm: float
172
+ steps: int
173
+ dtype: str
174
+ mixed_precision_param: str
175
+ mixed_precision_reduce: str
176
+ gc_freq: int
177
+ seed: int
178
+ deterministic: bool
179
+ enable_cpu_offload: bool
180
+ # ... additional fields
181
+ ```
182
+
183
+ ### Parallelism Configuration
184
+
185
+ ```python
186
+ from torchtitan.config.job_config import Parallelism
187
+
188
+ @dataclass
189
+ class Parallelism:
190
+ # Parallelism degrees
191
+ data_parallel_shard_degree: int
192
+ data_parallel_replicate_degree: int
193
+ tensor_parallel_degree: int
194
+ pipeline_parallel_degree: int
195
+ context_parallel_degree: int
196
+ expert_parallel_degree: int
197
+ # FSDP configuration options
198
+ # ... additional fields
199
+ ```
200
+
201
+ ### Optimizer Configuration
202
+
203
+ ```python
204
+ from torchtitan.config.job_config import Optimizer
205
+
206
+ @dataclass
207
+ class Optimizer:
208
+ name: str
209
+ lr: float
210
+ beta1: float
211
+ beta2: float
212
+ eps: float
213
+ weight_decay: float
214
+ implementation: str
215
+ early_step_in_backward: bool
216
+ ```
217
+
218
+ ## YAML Configuration Example
219
+
220
+ ```yaml
221
+ # config/grpo_math.yaml
222
+ model: "Qwen/Qwen2.5-7B-Instruct"
223
+
224
+ dataset:
225
+ path: "openai/gsm8k"
226
+ split: "train"
227
+ streaming: true
228
+
229
+ training:
230
+ batch_size: 4
231
+ learning_rate: 1e-6
232
+ seq_len: 4096
233
+ dtype: bfloat16
234
+ gradient_accumulation_steps: 4
235
+
236
+ grpo:
237
+ n_samples: 8
238
+ clip_low: 0.2
239
+ clip_high: 0.28
240
+ beta: 0.1
241
+ temperature: 0.7
242
+
243
+ services:
244
+ generator:
245
+ procs: 1
246
+ num_replicas: 1
247
+ with_gpus: true
248
+ trainer:
249
+ procs: 1
250
+ num_replicas: 1
251
+ with_gpus: true
252
+ ref_model:
253
+ procs: 1
254
+ num_replicas: 1
255
+ with_gpus: true
256
+ ```
257
+
258
+ ## Launch Commands
259
+
260
+ ### SFT Training (2+ GPUs)
261
+
262
+ ```bash
263
+ python -m apps.sft.main --config apps/sft/llama3_8b.yaml
264
+ ```
265
+
266
+ ### GRPO Training (3+ GPUs)
267
+
268
+ ```bash
269
+ python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
270
+ ```
271
+
272
+ ### Multi-GPU Distributed
273
+
274
+ ```bash
275
+ python -m apps.grpo.main \
276
+ --config config/distributed.yaml \
277
+ --trainer.procs 4 \
278
+ --generator.procs 4
279
+ ```
280
+
281
+ ## Async Communication Pattern
282
+
283
+ torchforge uses async/await patterns for service communication:
284
+
285
+ ```python
286
+ # Route: async point-to-point
287
+ response = await service.method.route(arg1, arg2)
288
+
289
+ # Fanout: broadcast to all replicas
290
+ await service.update_weights.fanout(training_step)
291
+ ```
292
+
293
+ ## Installation
294
+
295
+ ```bash
296
+ # Create environment
297
+ conda create -n forge python=3.12
298
+ conda activate forge
299
+
300
+ # Install (handles PyTorch nightly + dependencies)
301
+ ./scripts/install.sh
302
+
303
+ # ROCm (AMD GPUs)
304
+ ./scripts/install_rocm.sh
305
+
306
+ # Verify
307
+ python -c "import torch, forge, vllm; print('OK')"
308
+ ```
309
+
310
+ **Requirements**:
311
+ - PyTorch >= 2.9.0 (nightly)
312
+ - Monarch
313
+ - TorchTitan
314
+ - vLLM
315
+
316
+ ## Experimental Warning
317
+
318
+ Both Monarch and torchforge are experimental. APIs may change as the project learns from early adopters.
319
+
320
+ ## Resources
321
+
322
+ - Documentation: https://meta-pytorch.org/torchforge
323
+ - GitHub: https://github.com/meta-pytorch/torchforge
324
+ - Discord: https://discord.gg/YsTYBh6PD9
325
+ - TorchTitan: https://github.com/pytorch/torchtitan
326
+ - Monarch: https://github.com/meta-pytorch/monarch
327
+ - Blog: https://pytorch.org/blog/introducing-torchforge/
@@ -0,0 +1,409 @@
1
+ # torchforge Troubleshooting Guide
2
+
3
+ ## GPU Resource Issues
4
+
5
+ ### Issue: Not Enough GPUs
6
+
7
+ **Symptoms**: "Insufficient GPU resources" error
8
+
9
+ **Solutions**:
10
+
11
+ 1. **Reduce service requirements**:
12
+ ```yaml
13
+ services:
14
+ generator:
15
+ procs: 1
16
+ with_gpus: true
17
+ trainer:
18
+ procs: 1
19
+ with_gpus: true
20
+ # Remove ref_model or use CPU
21
+ ```
22
+
23
+ 2. **Use CPU for reference model**:
24
+ ```yaml
25
+ ref_model:
26
+ with_gpus: false # Run on CPU
27
+ ```
28
+
29
+ 3. **Share resources between services**:
30
+ ```yaml
31
+ services:
32
+ generator:
33
+ procs: 1
34
+ num_replicas: 1
35
+ colocate_with: trainer # Share GPU with trainer
36
+ ```
37
+
38
+ ### Issue: Minimum GPU Requirements
39
+
40
+ **Reference**:
41
+ - SFT: 2+ GPUs (trainer + generator)
42
+ - GRPO: 3+ GPUs (trainer + generator + ref_model)
43
+ - Large models: 8+ GPUs with tensor parallelism
44
+
45
+ ## Memory Issues
46
+
47
+ ### Issue: OOM During Generation
48
+
49
+ **Symptoms**: CUDA OOM in vLLM
50
+
51
+ **Solutions**:
52
+
53
+ 1. **Reduce batch size**:
54
+ ```yaml
55
+ grpo:
56
+ n_samples: 4 # Reduce from 8
57
+ ```
58
+
59
+ 2. **Reduce sequence length**:
60
+ ```yaml
61
+ training:
62
+ seq_len: 2048 # Reduce from 4096
63
+ ```
64
+
65
+ 3. **Reduce vLLM memory**:
66
+ ```yaml
67
+ generator:
68
+ gpu_memory_utilization: 0.7 # Reduce from 0.9
69
+ ```
70
+
71
+ ### Issue: OOM During Training
72
+
73
+ **Symptoms**: CUDA OOM in backward pass
74
+
75
+ **Solutions**:
76
+
77
+ 1. **Enable gradient checkpointing**:
78
+ ```yaml
79
+ training:
80
+ gradient_checkpointing: true
81
+ ```
82
+
83
+ 2. **Increase gradient accumulation**:
84
+ ```yaml
85
+ training:
86
+ gradient_accumulation_steps: 8 # Increase from 4
87
+ ```
88
+
89
+ 3. **Reduce batch size**:
90
+ ```yaml
91
+ training:
92
+ batch_size: 2 # Reduce from 4
93
+ ```
94
+
95
+ ## Weight Synchronization Issues
96
+
97
+ ### Issue: Slow Weight Sync
98
+
99
+ **Symptoms**: Long pauses between training and generation
100
+
101
+ **Solutions**:
102
+
103
+ 1. **Enable RDMA** (if available):
104
+ ```bash
105
+ export TORCHSTORE_USE_RDMA=1
106
+ ```
107
+
108
+ 2. **Reduce sync frequency**:
109
+ ```yaml
110
+ training:
111
+ sync_interval: 10 # Sync every 10 steps
112
+ ```
113
+
114
+ 3. **Use colocated services**:
115
+ ```yaml
116
+ services:
117
+ generator:
118
+ colocate_with: trainer
119
+ ```
120
+
121
+ ### Issue: Weight Sync Failures
122
+
123
+ **Symptoms**: Errors in weight transfer, stale weights
124
+
125
+ **Solutions**:
126
+
127
+ 1. **Check network connectivity**:
128
+ ```bash
129
+ ping other_node
130
+ ```
131
+
132
+ 2. **Increase timeout**:
133
+ ```yaml
134
+ services:
135
+ weight_sync_timeout: 600 # 10 minutes
136
+ ```
137
+
138
+ 3. **Enable sync verification**:
139
+ ```yaml
140
+ training:
141
+ verify_weight_sync: true
142
+ ```
143
+
144
+ ## Training Stability Issues
145
+
146
+ ### Issue: Policy Collapse
147
+
148
+ **Symptoms**: Entropy drops to zero, reward stops improving
149
+
150
+ **Solutions**:
151
+
152
+ 1. **Increase KL penalty**:
153
+ ```yaml
154
+ grpo:
155
+ beta: 0.2 # Increase from 0.1
156
+ ```
157
+
158
+ 2. **Add entropy bonus**:
159
+ ```yaml
160
+ training:
161
+ entropy_coef: 0.01
162
+ ```
163
+
164
+ 3. **Reduce learning rate**:
165
+ ```yaml
166
+ training:
167
+ learning_rate: 5e-7 # Reduce from 1e-6
168
+ ```
169
+
170
+ ### Issue: Loss Spikes
171
+
172
+ **Symptoms**: Sudden loss increases, training instability
173
+
174
+ **Solutions**:
175
+
176
+ 1. **Enable gradient clipping**:
177
+ ```yaml
178
+ training:
179
+ max_grad_norm: 1.0
180
+ ```
181
+
182
+ 2. **Reduce clip range**:
183
+ ```yaml
184
+ grpo:
185
+ clip_low: 0.1 # Reduce from 0.2
186
+ clip_high: 0.18 # Reduce from 0.28
187
+ ```
188
+
189
+ 3. **Use learning rate warmup**:
190
+ ```yaml
191
+ training:
192
+ warmup_steps: 100
193
+ ```
194
+
195
+ ### Issue: Divergent Training
196
+
197
+ **Symptoms**: Loss becomes NaN, model outputs garbage
198
+
199
+ **Solutions**:
200
+
201
+ 1. **Check for data issues**:
202
+ ```python
203
+ # Verify no empty sequences
204
+ for batch in dataset:
205
+ assert batch.input_ids.numel() > 0
206
+ ```
207
+
208
+ 2. **Use BF16 instead of FP16**:
209
+ ```yaml
210
+ training:
211
+ dtype: bfloat16
212
+ ```
213
+
214
+ 3. **Reduce learning rate significantly**:
215
+ ```yaml
216
+ training:
217
+ learning_rate: 1e-7
218
+ ```
219
+
220
+ ## Service Issues
221
+
222
+ ### Issue: Service Startup Failures
223
+
224
+ **Symptoms**: Services fail to initialize
225
+
226
+ **Solutions**:
227
+
228
+ 1. **Check resource availability**:
229
+ ```bash
230
+ nvidia-smi # Verify GPU availability
231
+ ```
232
+
233
+ 2. **Increase startup timeout**:
234
+ ```yaml
235
+ services:
236
+ startup_timeout: 600
237
+ ```
238
+
239
+ 3. **Check model path**:
240
+ ```python
241
+ from transformers import AutoModelForCausalLM
242
+ model = AutoModelForCausalLM.from_pretrained("model_path") # Verify accessible
243
+ ```
244
+
245
+ ### Issue: Generator Not Responding
246
+
247
+ **Symptoms**: Generation hangs, timeouts
248
+
249
+ **Solutions**:
250
+
251
+ 1. **Check vLLM status**:
252
+ ```python
253
+ # Add health check
254
+ await generator.health_check.route()
255
+ ```
256
+
257
+ 2. **Restart service**:
258
+ ```python
259
+ await generator.restart.fanout()
260
+ ```
261
+
262
+ 3. **Reduce concurrent requests**:
263
+ ```yaml
264
+ generator:
265
+ max_concurrent_requests: 10
266
+ ```
267
+
268
+ ## Monarch Issues
269
+
270
+ ### Issue: Monarch Actor Failures
271
+
272
+ **Symptoms**: Actor crashes, communication errors
273
+
274
+ **Solutions**:
275
+
276
+ 1. **Enable fault tolerance**:
277
+ ```yaml
278
+ monarch:
279
+ fault_tolerance: true
280
+ max_restarts: 3
281
+ ```
282
+
283
+ 2. **Increase actor memory**:
284
+ ```yaml
285
+ services:
286
+ actor_memory_mb: 4096
287
+ ```
288
+
289
+ 3. **Check Monarch logs**:
290
+ ```bash
291
+ export MONARCH_LOG_LEVEL=DEBUG
292
+ ```
293
+
294
+ ### Issue: Deadlock in Distributed Communication
295
+
296
+ **Symptoms**: Training hangs, no progress
297
+
298
+ **Solutions**:
299
+
300
+ 1. **Check for blocking calls**:
301
+ ```python
302
+ # Use async/await correctly
303
+ result = await service.method.route(args) # Correct
304
+ # result = service.method.route(args).wait() # May deadlock
305
+ ```
306
+
307
+ 2. **Add timeouts**:
308
+ ```python
309
+ result = await asyncio.wait_for(
310
+ service.method.route(args),
311
+ timeout=60.0
312
+ )
313
+ ```
314
+
315
+ ## Installation Issues
316
+
317
+ ### Issue: PyTorch Version Mismatch
318
+
319
+ **Symptoms**: Import errors, CUDA errors
320
+
321
+ **Solutions**:
322
+
323
+ 1. **Use provided install script**:
324
+ ```bash
325
+ ./scripts/install.sh
326
+ ```
327
+
328
+ 2. **Verify versions**:
329
+ ```python
330
+ import torch
331
+ print(torch.__version__) # Should be 2.9.0+
332
+ ```
333
+
334
+ 3. **Clean reinstall**:
335
+ ```bash
336
+ pip uninstall torch torchvision torchaudio
337
+ ./scripts/install.sh
338
+ ```
339
+
340
+ ### Issue: Monarch Installation Fails
341
+
342
+ **Symptoms**: Cannot import monarch
343
+
344
+ **Solutions**:
345
+
346
+ 1. **Install from source**:
347
+ ```bash
348
+ git clone https://github.com/meta-pytorch/monarch
349
+ cd monarch && pip install -e .
350
+ ```
351
+
352
+ 2. **Check CUDA compatibility**:
353
+ ```bash
354
+ nvcc --version # Should match PyTorch CUDA
355
+ ```
356
+
357
+ ## Debugging Tips
358
+
359
+ ### Enable Verbose Logging
360
+
361
+ ```bash
362
+ export FORGE_DEBUG=1
363
+ export MONARCH_LOG_LEVEL=DEBUG
364
+ ```
365
+
366
+ ### Profile Services
367
+
368
+ ```python
369
+ # Add profiling
370
+ with torch.profiler.profile() as prof:
371
+ result = await trainer.train_step.route(batch)
372
+ prof.export_chrome_trace("trace.json")
373
+ ```
374
+
375
+ ### Monitor GPU Utilization
376
+
377
+ ```bash
378
+ watch -n 1 nvidia-smi
379
+ ```
380
+
381
+ ### Test Services Individually
382
+
383
+ ```python
384
+ # Test generator
385
+ completions = await generator.generate.route(
386
+ prompts=["Hello"],
387
+ max_tokens=10,
388
+ )
389
+ print(completions[0].text)
390
+
391
+ # Test trainer
392
+ result = await trainer.train_step.route(dummy_batch)
393
+ print(result.loss)
394
+ ```
395
+
396
+ ## Experimental Warning
397
+
398
+ Both Monarch and torchforge are experimental. Expect:
399
+ - API changes between versions
400
+ - Incomplete features
401
+ - Bugs in edge cases
402
+
403
+ Check Discord for latest updates and workarounds.
404
+
405
+ ## Resources
406
+
407
+ - GitHub Issues: https://github.com/meta-pytorch/torchforge/issues
408
+ - Discord: https://discord.gg/YsTYBh6PD9
409
+ - Monarch Issues: https://github.com/meta-pytorch/monarch/issues