claude-turing 4.6.0 → 4.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (333) hide show
  1. package/.claude-plugin/plugin.json +2 -2
  2. package/README.md +1 -1
  3. package/commands/ablate.md +0 -1
  4. package/commands/annotate.md +0 -1
  5. package/commands/archive.md +0 -1
  6. package/commands/audit.md +0 -1
  7. package/commands/baseline.md +0 -1
  8. package/commands/brief.md +0 -1
  9. package/commands/budget.md +0 -1
  10. package/commands/calibrate.md +0 -1
  11. package/commands/card.md +0 -1
  12. package/commands/changelog.md +0 -1
  13. package/commands/checkpoint.md +0 -1
  14. package/commands/cite.md +0 -1
  15. package/commands/compare.md +0 -1
  16. package/commands/counterfactual.md +0 -1
  17. package/commands/curriculum.md +0 -1
  18. package/commands/design.md +0 -1
  19. package/commands/diagnose.md +0 -1
  20. package/commands/diff.md +0 -1
  21. package/commands/distill.md +0 -1
  22. package/commands/doctor.md +0 -1
  23. package/commands/ensemble.md +0 -1
  24. package/commands/explore.md +0 -1
  25. package/commands/export.md +0 -1
  26. package/commands/feature.md +0 -1
  27. package/commands/flashback.md +0 -1
  28. package/commands/fork.md +0 -1
  29. package/commands/frontier.md +0 -1
  30. package/commands/init.md +0 -1
  31. package/commands/leak.md +0 -1
  32. package/commands/lit.md +0 -1
  33. package/commands/logbook.md +0 -1
  34. package/commands/merge.md +0 -1
  35. package/commands/mode.md +0 -1
  36. package/commands/onboard.md +0 -1
  37. package/commands/paper.md +0 -1
  38. package/commands/plan.md +0 -1
  39. package/commands/poster.md +0 -1
  40. package/commands/postmortem.md +0 -1
  41. package/commands/preflight.md +0 -1
  42. package/commands/present.md +0 -1
  43. package/commands/profile.md +0 -1
  44. package/commands/prune.md +0 -1
  45. package/commands/quantize.md +0 -1
  46. package/commands/queue.md +0 -1
  47. package/commands/registry.md +0 -1
  48. package/commands/regress.md +0 -1
  49. package/commands/replay.md +0 -1
  50. package/commands/report.md +0 -1
  51. package/commands/reproduce.md +0 -1
  52. package/commands/retry.md +0 -1
  53. package/commands/review.md +0 -1
  54. package/commands/sanity.md +0 -1
  55. package/commands/scale.md +0 -1
  56. package/commands/search.md +0 -1
  57. package/commands/seed.md +0 -1
  58. package/commands/sensitivity.md +0 -1
  59. package/commands/share.md +0 -1
  60. package/commands/simulate.md +0 -1
  61. package/commands/status.md +0 -1
  62. package/commands/stitch.md +0 -1
  63. package/commands/suggest.md +0 -1
  64. package/commands/surgery.md +0 -1
  65. package/commands/sweep.md +0 -1
  66. package/commands/template.md +0 -1
  67. package/commands/train.md +0 -1
  68. package/commands/transfer.md +0 -1
  69. package/commands/trend.md +0 -1
  70. package/commands/try.md +0 -1
  71. package/commands/turing.md +3 -3
  72. package/commands/update.md +0 -1
  73. package/commands/validate.md +0 -1
  74. package/commands/warm.md +0 -1
  75. package/commands/watch.md +0 -1
  76. package/commands/whatif.md +0 -1
  77. package/commands/xray.md +0 -1
  78. package/config/commands.yaml +74 -74
  79. package/package.json +10 -3
  80. package/skills/turing/SKILL.md +180 -0
  81. package/skills/turing/ablate/SKILL.md +46 -0
  82. package/skills/turing/annotate/SKILL.md +22 -0
  83. package/skills/turing/archive/SKILL.md +22 -0
  84. package/skills/turing/audit/SKILL.md +55 -0
  85. package/skills/turing/baseline/SKILL.md +44 -0
  86. package/skills/turing/brief/SKILL.md +94 -0
  87. package/skills/turing/budget/SKILL.md +51 -0
  88. package/skills/turing/calibrate/SKILL.md +46 -0
  89. package/skills/turing/card/SKILL.md +35 -0
  90. package/skills/turing/changelog/SKILL.md +21 -0
  91. package/skills/turing/checkpoint/SKILL.md +46 -0
  92. package/skills/turing/cite/SKILL.md +22 -0
  93. package/skills/turing/compare/SKILL.md +23 -0
  94. package/skills/turing/counterfactual/SKILL.md +26 -0
  95. package/skills/turing/curriculum/SKILL.md +42 -0
  96. package/skills/turing/design/SKILL.md +96 -0
  97. package/skills/turing/diagnose/SKILL.md +51 -0
  98. package/skills/turing/diff/SKILL.md +47 -0
  99. package/skills/turing/distill/SKILL.md +55 -0
  100. package/skills/turing/doctor/SKILL.md +30 -0
  101. package/skills/turing/ensemble/SKILL.md +53 -0
  102. package/skills/turing/explore/SKILL.md +106 -0
  103. package/skills/turing/export/SKILL.md +47 -0
  104. package/skills/turing/feature/SKILL.md +41 -0
  105. package/skills/turing/flashback/SKILL.md +21 -0
  106. package/skills/turing/fork/SKILL.md +39 -0
  107. package/skills/turing/frontier/SKILL.md +44 -0
  108. package/skills/turing/init/SKILL.md +153 -0
  109. package/skills/turing/leak/SKILL.md +46 -0
  110. package/skills/turing/lit/SKILL.md +46 -0
  111. package/skills/turing/logbook/SKILL.md +50 -0
  112. package/skills/turing/merge/SKILL.md +23 -0
  113. package/skills/turing/mode/SKILL.md +42 -0
  114. package/skills/turing/onboard/SKILL.md +19 -0
  115. package/skills/turing/paper/SKILL.md +43 -0
  116. package/skills/turing/plan/SKILL.md +26 -0
  117. package/skills/turing/poster/SKILL.md +88 -0
  118. package/skills/turing/postmortem/SKILL.md +27 -0
  119. package/skills/turing/preflight/SKILL.md +74 -0
  120. package/skills/turing/present/SKILL.md +22 -0
  121. package/skills/turing/profile/SKILL.md +42 -0
  122. package/skills/turing/prune/SKILL.md +25 -0
  123. package/skills/turing/quantize/SKILL.md +23 -0
  124. package/skills/turing/queue/SKILL.md +47 -0
  125. package/skills/turing/registry/SKILL.md +30 -0
  126. package/skills/turing/regress/SKILL.md +52 -0
  127. package/skills/turing/replay/SKILL.md +22 -0
  128. package/skills/turing/report/SKILL.md +96 -0
  129. package/skills/turing/reproduce/SKILL.md +47 -0
  130. package/skills/turing/retry/SKILL.md +40 -0
  131. package/skills/turing/review/SKILL.md +19 -0
  132. package/skills/turing/rules/loop-protocol.md +91 -0
  133. package/skills/turing/sanity/SKILL.md +47 -0
  134. package/skills/turing/scale/SKILL.md +54 -0
  135. package/skills/turing/search/SKILL.md +21 -0
  136. package/skills/turing/seed/SKILL.md +46 -0
  137. package/skills/turing/sensitivity/SKILL.md +40 -0
  138. package/skills/turing/share/SKILL.md +19 -0
  139. package/skills/turing/simulate/SKILL.md +27 -0
  140. package/skills/turing/status/SKILL.md +23 -0
  141. package/skills/turing/stitch/SKILL.md +48 -0
  142. package/skills/turing/suggest/SKILL.md +158 -0
  143. package/skills/turing/surgery/SKILL.md +26 -0
  144. package/skills/turing/sweep/SKILL.md +44 -0
  145. package/skills/turing/template/SKILL.md +21 -0
  146. package/skills/turing/train/SKILL.md +74 -0
  147. package/skills/turing/transfer/SKILL.md +53 -0
  148. package/skills/turing/trend/SKILL.md +20 -0
  149. package/skills/turing/try/SKILL.md +62 -0
  150. package/skills/turing/update/SKILL.md +26 -0
  151. package/skills/turing/validate/SKILL.md +33 -0
  152. package/skills/turing/warm/SKILL.md +52 -0
  153. package/skills/turing/watch/SKILL.md +59 -0
  154. package/skills/turing/whatif/SKILL.md +30 -0
  155. package/skills/turing/xray/SKILL.md +42 -0
  156. package/src/command-registry.js +21 -0
  157. package/src/install.js +4 -3
  158. package/src/sync-commands-layout.js +149 -0
  159. package/src/sync-skills-layout.js +20 -0
  160. package/templates/__pycache__/evaluate.cpython-312.pyc +0 -0
  161. package/templates/__pycache__/evaluate.cpython-314.pyc +0 -0
  162. package/templates/__pycache__/prepare.cpython-312.pyc +0 -0
  163. package/templates/__pycache__/prepare.cpython-314.pyc +0 -0
  164. package/templates/features/__pycache__/__init__.cpython-312.pyc +0 -0
  165. package/templates/features/__pycache__/__init__.cpython-314.pyc +0 -0
  166. package/templates/features/__pycache__/featurizers.cpython-312.pyc +0 -0
  167. package/templates/features/__pycache__/featurizers.cpython-314.pyc +0 -0
  168. package/templates/scripts/__pycache__/__init__.cpython-312.pyc +0 -0
  169. package/templates/scripts/__pycache__/__init__.cpython-314.pyc +0 -0
  170. package/templates/scripts/__pycache__/ablation_study.cpython-312.pyc +0 -0
  171. package/templates/scripts/__pycache__/ablation_study.cpython-314.pyc +0 -0
  172. package/templates/scripts/__pycache__/architecture_surgery.cpython-312.pyc +0 -0
  173. package/templates/scripts/__pycache__/architecture_surgery.cpython-314.pyc +0 -0
  174. package/templates/scripts/__pycache__/budget_manager.cpython-312.pyc +0 -0
  175. package/templates/scripts/__pycache__/budget_manager.cpython-314.pyc +0 -0
  176. package/templates/scripts/__pycache__/build_ensemble.cpython-312.pyc +0 -0
  177. package/templates/scripts/__pycache__/build_ensemble.cpython-314.pyc +0 -0
  178. package/templates/scripts/__pycache__/calibration.cpython-312.pyc +0 -0
  179. package/templates/scripts/__pycache__/calibration.cpython-314.pyc +0 -0
  180. package/templates/scripts/__pycache__/check_convergence.cpython-312.pyc +0 -0
  181. package/templates/scripts/__pycache__/check_convergence.cpython-314.pyc +0 -0
  182. package/templates/scripts/__pycache__/checkpoint_manager.cpython-312.pyc +0 -0
  183. package/templates/scripts/__pycache__/checkpoint_manager.cpython-314.pyc +0 -0
  184. package/templates/scripts/__pycache__/citation_manager.cpython-312.pyc +0 -0
  185. package/templates/scripts/__pycache__/citation_manager.cpython-314.pyc +0 -0
  186. package/templates/scripts/__pycache__/cost_frontier.cpython-312.pyc +0 -0
  187. package/templates/scripts/__pycache__/cost_frontier.cpython-314.pyc +0 -0
  188. package/templates/scripts/__pycache__/counterfactual_explanation.cpython-312.pyc +0 -0
  189. package/templates/scripts/__pycache__/counterfactual_explanation.cpython-314.pyc +0 -0
  190. package/templates/scripts/__pycache__/critique_hypothesis.cpython-312.pyc +0 -0
  191. package/templates/scripts/__pycache__/critique_hypothesis.cpython-314.pyc +0 -0
  192. package/templates/scripts/__pycache__/curriculum_optimizer.cpython-312.pyc +0 -0
  193. package/templates/scripts/__pycache__/curriculum_optimizer.cpython-314.pyc +0 -0
  194. package/templates/scripts/__pycache__/diagnose_errors.cpython-312.pyc +0 -0
  195. package/templates/scripts/__pycache__/diagnose_errors.cpython-314.pyc +0 -0
  196. package/templates/scripts/__pycache__/draft_paper_sections.cpython-312.pyc +0 -0
  197. package/templates/scripts/__pycache__/draft_paper_sections.cpython-314.pyc +0 -0
  198. package/templates/scripts/__pycache__/equivalence_checker.cpython-312.pyc +0 -0
  199. package/templates/scripts/__pycache__/equivalence_checker.cpython-314.pyc +0 -0
  200. package/templates/scripts/__pycache__/experiment_annotations.cpython-312.pyc +0 -0
  201. package/templates/scripts/__pycache__/experiment_annotations.cpython-314.pyc +0 -0
  202. package/templates/scripts/__pycache__/experiment_archive.cpython-312.pyc +0 -0
  203. package/templates/scripts/__pycache__/experiment_archive.cpython-314.pyc +0 -0
  204. package/templates/scripts/__pycache__/experiment_diff.cpython-312.pyc +0 -0
  205. package/templates/scripts/__pycache__/experiment_diff.cpython-314.pyc +0 -0
  206. package/templates/scripts/__pycache__/experiment_index.cpython-312.pyc +0 -0
  207. package/templates/scripts/__pycache__/experiment_index.cpython-314.pyc +0 -0
  208. package/templates/scripts/__pycache__/experiment_queue.cpython-312.pyc +0 -0
  209. package/templates/scripts/__pycache__/experiment_queue.cpython-314.pyc +0 -0
  210. package/templates/scripts/__pycache__/experiment_replay.cpython-312.pyc +0 -0
  211. package/templates/scripts/__pycache__/experiment_replay.cpython-314.pyc +0 -0
  212. package/templates/scripts/__pycache__/experiment_search.cpython-312.pyc +0 -0
  213. package/templates/scripts/__pycache__/experiment_search.cpython-314.pyc +0 -0
  214. package/templates/scripts/__pycache__/experiment_simulator.cpython-312.pyc +0 -0
  215. package/templates/scripts/__pycache__/experiment_simulator.cpython-314.pyc +0 -0
  216. package/templates/scripts/__pycache__/experiment_templates.cpython-312.pyc +0 -0
  217. package/templates/scripts/__pycache__/experiment_templates.cpython-314.pyc +0 -0
  218. package/templates/scripts/__pycache__/export_card.cpython-312.pyc +0 -0
  219. package/templates/scripts/__pycache__/export_card.cpython-314.pyc +0 -0
  220. package/templates/scripts/__pycache__/export_formats.cpython-312.pyc +0 -0
  221. package/templates/scripts/__pycache__/export_formats.cpython-314.pyc +0 -0
  222. package/templates/scripts/__pycache__/failure_postmortem.cpython-312.pyc +0 -0
  223. package/templates/scripts/__pycache__/failure_postmortem.cpython-314.pyc +0 -0
  224. package/templates/scripts/__pycache__/feature_intelligence.cpython-312.pyc +0 -0
  225. package/templates/scripts/__pycache__/feature_intelligence.cpython-314.pyc +0 -0
  226. package/templates/scripts/__pycache__/fork_experiment.cpython-312.pyc +0 -0
  227. package/templates/scripts/__pycache__/fork_experiment.cpython-314.pyc +0 -0
  228. package/templates/scripts/__pycache__/generate_baselines.cpython-312.pyc +0 -0
  229. package/templates/scripts/__pycache__/generate_baselines.cpython-314.pyc +0 -0
  230. package/templates/scripts/__pycache__/generate_brief.cpython-312.pyc +0 -0
  231. package/templates/scripts/__pycache__/generate_brief.cpython-314.pyc +0 -0
  232. package/templates/scripts/__pycache__/generate_changelog.cpython-312.pyc +0 -0
  233. package/templates/scripts/__pycache__/generate_changelog.cpython-314.pyc +0 -0
  234. package/templates/scripts/__pycache__/generate_figures.cpython-312.pyc +0 -0
  235. package/templates/scripts/__pycache__/generate_figures.cpython-314.pyc +0 -0
  236. package/templates/scripts/__pycache__/generate_logbook.cpython-312.pyc +0 -0
  237. package/templates/scripts/__pycache__/generate_logbook.cpython-314.pyc +0 -0
  238. package/templates/scripts/__pycache__/generate_model_card.cpython-312.pyc +0 -0
  239. package/templates/scripts/__pycache__/generate_model_card.cpython-314.pyc +0 -0
  240. package/templates/scripts/__pycache__/generate_onboarding.cpython-312.pyc +0 -0
  241. package/templates/scripts/__pycache__/generate_onboarding.cpython-314.pyc +0 -0
  242. package/templates/scripts/__pycache__/harness_doctor.cpython-312.pyc +0 -0
  243. package/templates/scripts/__pycache__/harness_doctor.cpython-314.pyc +0 -0
  244. package/templates/scripts/__pycache__/incremental_update.cpython-312.pyc +0 -0
  245. package/templates/scripts/__pycache__/incremental_update.cpython-314.pyc +0 -0
  246. package/templates/scripts/__pycache__/knowledge_transfer.cpython-312.pyc +0 -0
  247. package/templates/scripts/__pycache__/knowledge_transfer.cpython-314.pyc +0 -0
  248. package/templates/scripts/__pycache__/latency_benchmark.cpython-312.pyc +0 -0
  249. package/templates/scripts/__pycache__/latency_benchmark.cpython-314.pyc +0 -0
  250. package/templates/scripts/__pycache__/leakage_detector.cpython-312.pyc +0 -0
  251. package/templates/scripts/__pycache__/leakage_detector.cpython-314.pyc +0 -0
  252. package/templates/scripts/__pycache__/literature_search.cpython-312.pyc +0 -0
  253. package/templates/scripts/__pycache__/literature_search.cpython-314.pyc +0 -0
  254. package/templates/scripts/__pycache__/log_experiment.cpython-312.pyc +0 -0
  255. package/templates/scripts/__pycache__/log_experiment.cpython-314.pyc +0 -0
  256. package/templates/scripts/__pycache__/manage_hypotheses.cpython-312.pyc +0 -0
  257. package/templates/scripts/__pycache__/manage_hypotheses.cpython-314.pyc +0 -0
  258. package/templates/scripts/__pycache__/methodology_audit.cpython-312.pyc +0 -0
  259. package/templates/scripts/__pycache__/methodology_audit.cpython-314.pyc +0 -0
  260. package/templates/scripts/__pycache__/model_distiller.cpython-312.pyc +0 -0
  261. package/templates/scripts/__pycache__/model_distiller.cpython-314.pyc +0 -0
  262. package/templates/scripts/__pycache__/model_lifecycle.cpython-312.pyc +0 -0
  263. package/templates/scripts/__pycache__/model_lifecycle.cpython-314.pyc +0 -0
  264. package/templates/scripts/__pycache__/model_merger.cpython-312.pyc +0 -0
  265. package/templates/scripts/__pycache__/model_merger.cpython-314.pyc +0 -0
  266. package/templates/scripts/__pycache__/model_pruning.cpython-312.pyc +0 -0
  267. package/templates/scripts/__pycache__/model_pruning.cpython-314.pyc +0 -0
  268. package/templates/scripts/__pycache__/model_quantization.cpython-312.pyc +0 -0
  269. package/templates/scripts/__pycache__/model_quantization.cpython-314.pyc +0 -0
  270. package/templates/scripts/__pycache__/model_xray.cpython-312.pyc +0 -0
  271. package/templates/scripts/__pycache__/model_xray.cpython-314.pyc +0 -0
  272. package/templates/scripts/__pycache__/novelty_guard.cpython-312.pyc +0 -0
  273. package/templates/scripts/__pycache__/novelty_guard.cpython-314.pyc +0 -0
  274. package/templates/scripts/__pycache__/package_experiments.cpython-312.pyc +0 -0
  275. package/templates/scripts/__pycache__/package_experiments.cpython-314.pyc +0 -0
  276. package/templates/scripts/__pycache__/pareto_frontier.cpython-312.pyc +0 -0
  277. package/templates/scripts/__pycache__/pareto_frontier.cpython-314.pyc +0 -0
  278. package/templates/scripts/__pycache__/parse_metrics.cpython-312.pyc +0 -0
  279. package/templates/scripts/__pycache__/parse_metrics.cpython-314.pyc +0 -0
  280. package/templates/scripts/__pycache__/pipeline_manager.cpython-312.pyc +0 -0
  281. package/templates/scripts/__pycache__/pipeline_manager.cpython-314.pyc +0 -0
  282. package/templates/scripts/__pycache__/profile_training.cpython-312.pyc +0 -0
  283. package/templates/scripts/__pycache__/profile_training.cpython-314.pyc +0 -0
  284. package/templates/scripts/__pycache__/regression_gate.cpython-312.pyc +0 -0
  285. package/templates/scripts/__pycache__/regression_gate.cpython-314.pyc +0 -0
  286. package/templates/scripts/__pycache__/reproduce_experiment.cpython-312.pyc +0 -0
  287. package/templates/scripts/__pycache__/reproduce_experiment.cpython-314.pyc +0 -0
  288. package/templates/scripts/__pycache__/research_planner.cpython-312.pyc +0 -0
  289. package/templates/scripts/__pycache__/research_planner.cpython-314.pyc +0 -0
  290. package/templates/scripts/__pycache__/sanity_checks.cpython-312.pyc +0 -0
  291. package/templates/scripts/__pycache__/sanity_checks.cpython-314.pyc +0 -0
  292. package/templates/scripts/__pycache__/scaffold.cpython-312.pyc +0 -0
  293. package/templates/scripts/__pycache__/scaffold.cpython-314.pyc +0 -0
  294. package/templates/scripts/__pycache__/scaling_estimator.cpython-312.pyc +0 -0
  295. package/templates/scripts/__pycache__/scaling_estimator.cpython-314.pyc +0 -0
  296. package/templates/scripts/__pycache__/seed_runner.cpython-312.pyc +0 -0
  297. package/templates/scripts/__pycache__/seed_runner.cpython-314.pyc +0 -0
  298. package/templates/scripts/__pycache__/sensitivity_analysis.cpython-312.pyc +0 -0
  299. package/templates/scripts/__pycache__/sensitivity_analysis.cpython-314.pyc +0 -0
  300. package/templates/scripts/__pycache__/session_flashback.cpython-312.pyc +0 -0
  301. package/templates/scripts/__pycache__/session_flashback.cpython-314.pyc +0 -0
  302. package/templates/scripts/__pycache__/show_experiment_tree.cpython-312.pyc +0 -0
  303. package/templates/scripts/__pycache__/show_experiment_tree.cpython-314.pyc +0 -0
  304. package/templates/scripts/__pycache__/show_families.cpython-312.pyc +0 -0
  305. package/templates/scripts/__pycache__/show_families.cpython-314.pyc +0 -0
  306. package/templates/scripts/__pycache__/simulate_review.cpython-312.pyc +0 -0
  307. package/templates/scripts/__pycache__/simulate_review.cpython-314.pyc +0 -0
  308. package/templates/scripts/__pycache__/smart_retry.cpython-312.pyc +0 -0
  309. package/templates/scripts/__pycache__/smart_retry.cpython-314.pyc +0 -0
  310. package/templates/scripts/__pycache__/statistical_compare.cpython-312.pyc +0 -0
  311. package/templates/scripts/__pycache__/statistical_compare.cpython-314.pyc +0 -0
  312. package/templates/scripts/__pycache__/suggest_next.cpython-312.pyc +0 -0
  313. package/templates/scripts/__pycache__/suggest_next.cpython-314.pyc +0 -0
  314. package/templates/scripts/__pycache__/sweep.cpython-312.pyc +0 -0
  315. package/templates/scripts/__pycache__/sweep.cpython-314.pyc +0 -0
  316. package/templates/scripts/__pycache__/synthesize_decision.cpython-312.pyc +0 -0
  317. package/templates/scripts/__pycache__/synthesize_decision.cpython-314.pyc +0 -0
  318. package/templates/scripts/__pycache__/training_monitor.cpython-312.pyc +0 -0
  319. package/templates/scripts/__pycache__/training_monitor.cpython-314.pyc +0 -0
  320. package/templates/scripts/__pycache__/treequest_suggest.cpython-312.pyc +0 -0
  321. package/templates/scripts/__pycache__/treequest_suggest.cpython-314.pyc +0 -0
  322. package/templates/scripts/__pycache__/trend_analysis.cpython-312.pyc +0 -0
  323. package/templates/scripts/__pycache__/trend_analysis.cpython-314.pyc +0 -0
  324. package/templates/scripts/__pycache__/turing_io.cpython-312.pyc +0 -0
  325. package/templates/scripts/__pycache__/turing_io.cpython-314.pyc +0 -0
  326. package/templates/scripts/__pycache__/update_state.cpython-312.pyc +0 -0
  327. package/templates/scripts/__pycache__/update_state.cpython-314.pyc +0 -0
  328. package/templates/scripts/__pycache__/verify_placeholders.cpython-312.pyc +0 -0
  329. package/templates/scripts/__pycache__/verify_placeholders.cpython-314.pyc +0 -0
  330. package/templates/scripts/__pycache__/warm_start.cpython-312.pyc +0 -0
  331. package/templates/scripts/__pycache__/warm_start.cpython-314.pyc +0 -0
  332. package/templates/scripts/__pycache__/whatif_engine.cpython-312.pyc +0 -0
  333. package/templates/scripts/__pycache__/whatif_engine.cpython-314.pyc +0 -0
@@ -0,0 +1,180 @@
1
+ ---
2
+ name: turing
3
+ description: Autonomous ML research harness. Thin router that detects ML training intent and identifies the matching Turing sub-command execution path. Each sub-command handles one phase of the experiment lifecycle.
4
+ ---
5
+
6
+ You are the Turing ML research router. Detect the user's intent and identify the matching Turing sub-command execution path.
7
+
8
+ ## Execution Contract
9
+
10
+ Turing sub-commands are slash-command skills that allow model invocation, so router handling may select the focused skill when the user's intent matches a sub-command.
11
+
12
+ - If the user explicitly invokes `/turing:<cmd>`, handle that focused sub-command directly.
13
+ - If the user invokes `/turing` as a router and the detected command is `slash_only`, route to the focused sub-command skill when appropriate.
14
+ - If a command has a documented safe equivalent script, the assistant may execute those documented steps inline when safe and appropriate.
15
+
16
+ ## Routing Table
17
+
18
+ | User says... | Route to | Lifecycle phase |
19
+ |---|---|---|
20
+ | "train", "train ml/coding", "train ml/claims", "run experiments", "run experiments in ml/X", "autoresearch", "improve the model", "start training" | `/turing:train` | Execute |
21
+ | "status", "how's training", "experiment results", "current metrics" | `/turing:status` | Observe |
22
+ | "compare", "diff runs", "which is better" | `/turing:compare` | Analyze |
23
+ | "sweep", "grid search", "hyperparameter search", "tune" | `/turing:sweep` | Explore |
24
+ | "init", "set up ML", "initialize", "scaffold", "bootstrap" | `/turing:init` | Setup |
25
+ | "try", "test this", "inject", "what if we", "I think we should" | `/turing:try` | Steer |
26
+ | "brief", "briefing", "what have we learned", "summary" | `/turing:brief` | Report |
27
+ | "logbook", "log", "history", "timeline", "narrative" | `/turing:logbook` | Document |
28
+ | "poster", "presentation", "one-pager", "visual summary" | `/turing:poster` | Document |
29
+ | "report", "write-up", "findings", "document results" | `/turing:report` | Document |
30
+ | "validate", "stability", "check variance", "noisy" | `/turing:validate` | Validate |
31
+ | "seed", "seed study", "multi-seed", "lucky seed", "seed sensitivity" | `/turing:seed` | Validate |
32
+ | "reproduce", "reproducibility", "verify results", "re-run experiment", "repro" | `/turing:reproduce` | Validate |
33
+ | "suggest", "what model", "recommend", "which architecture", "literature" | `/turing:suggest` | Research |
34
+ | "explore hypotheses", "tree search", "treequest", "search hypothesis space", "MCTS" | `/turing:explore` | Research |
35
+ | "design", "plan experiment", "how should I test", "experiment design" | `/turing:design` | Design |
36
+ | "mode", "explore", "exploit", "replicate", "strategy" | `/turing:mode` | Strategy |
37
+ | "preflight", "resources", "VRAM", "memory", "can I run", "OOM", "GPU" | `/turing:preflight` | Check |
38
+ | "card", "model card", "document model", "model documentation" | `/turing:card` | Document |
39
+ | "diagnose", "error analysis", "failure modes", "where does it fail", "confusion matrix" | `/turing:diagnose` | Analyze |
40
+ | "ablate", "ablation", "remove component", "which features matter", "component impact" | `/turing:ablate` | Analyze |
41
+ | "frontier", "pareto", "tradeoff", "tradeoffs", "multi-objective", "which model is best" | `/turing:frontier` | Analyze |
42
+ | "lit", "literature", "papers", "SOTA", "baseline", "related work", "citations" | `/turing:lit` | Research |
43
+ | "paper", "draft paper", "write paper", "results table", "latex", "experimental setup" | `/turing:paper` | Document |
44
+ | "export", "deploy", "production", "onnx", "torchscript", "tflite", "ship model" | `/turing:export` | Deploy |
45
+ | "queue", "batch", "overnight", "schedule experiments", "run queue" | `/turing:queue` | Orchestrate |
46
+ | "retry", "retry experiment", "crashed", "OOM", "fix and rerun" | `/turing:retry` | Orchestrate |
47
+ | "fork", "branch", "try both", "parallel experiments", "A or B" | `/turing:fork` | Orchestrate |
48
+ | "profile", "profiling", "bottleneck", "slow training", "why is it slow", "timing" | `/turing:profile` | Check |
49
+ | "checkpoint", "checkpoints", "prune checkpoints", "disk space", "resume training" | `/turing:checkpoint` | Check |
50
+ | "diff", "deep compare", "what changed", "why did it diverge", "experiment diff" | `/turing:diff` | Analyze |
51
+ | "watch", "monitor", "live training", "loss spike", "is it overfitting", "training progress" | `/turing:watch` | Monitor |
52
+ | "regress", "regression", "did metrics degrade", "check for regression", "CI gate", "stability check" | `/turing:regress` | Validate |
53
+ | "ensemble", "combine models", "voting", "stacking", "blending", "merge models" | `/turing:ensemble` | Compose |
54
+ | "stitch", "pipeline", "swap stage", "cache stage", "pipeline composition" | `/turing:stitch` | Compose |
55
+ | "warm", "warm start", "fine-tune", "continue training", "transfer learning", "from checkpoint" | `/turing:warm` | Compose |
56
+ | "scale", "scaling law", "how much data", "is more data worth it", "power law", "data efficiency" | `/turing:scale` | Analyze |
57
+ | "budget", "compute budget", "how many experiments", "spending limit", "stop after" | `/turing:budget` | Manage |
58
+ | "distill", "compress", "smaller model", "student model", "knowledge distillation", "model compression" | `/turing:distill` | Deploy |
59
+ | "transfer", "what worked before", "similar project", "cross-project", "institutional knowledge", "prior projects" | `/turing:transfer` | Research |
60
+ | "audit", "methodology check", "pre-submission", "reviewer checklist", "data leakage", "missing baselines" | `/turing:audit` | Validate |
61
+ | "sanity", "sanity check", "pre-training", "is it broken", "before training", "quick check" | `/turing:sanity` | Check |
62
+ | "baseline", "baselines", "trivial baseline", "majority class", "is it better than random" | `/turing:baseline` | Analyze |
63
+ | "leak", "leakage", "data leakage scan", "suspicious feature", "train test overlap" | `/turing:leak` | Validate |
64
+ | "xray", "model internals", "dead neurons", "gradient flow", "weight distribution", "inside the model" | `/turing:xray` | Analyze |
65
+ | "sensitivity", "which params matter", "hyperparameter importance", "parameter ranking" | `/turing:sensitivity` | Analyze |
66
+ | "calibrate", "calibration", "ECE", "reliability diagram", "overconfident", "probability calibration" | `/turing:calibrate` | Analyze |
67
+ | "feature", "features", "feature selection", "feature importance", "which features matter", "redundant features" | `/turing:feature` | Analyze |
68
+ | "curriculum", "training order", "easy to hard", "data ordering", "curriculum learning" | `/turing:curriculum` | Optimize |
69
+ | "prune", "pruning", "sparsity", "remove weights", "smaller model", "weight pruning" | `/turing:prune` | Optimize |
70
+ | "quantize", "quantization", "int8", "fp16", "reduce precision", "faster inference" | `/turing:quantize` | Optimize |
71
+ | "merge", "model soup", "merge weights", "average models", "TIES", "DARE" | `/turing:merge` | Compose |
72
+ | "surgery", "architecture", "add layer", "widen", "modify model", "swap activation" | `/turing:surgery` | Modify |
73
+ | "cite", "citation", "bibliography", "bibtex", "attribution", "references" | `/turing:cite` | Record |
74
+ | "present", "figures", "slides", "presentation", "charts", "plots" | `/turing:present` | Document |
75
+ | "changelog", "model changelog", "progress summary", "what improved" | `/turing:changelog` | Document |
76
+ | "onboard", "onboarding", "walkthrough", "new collaborator", "project overview" | `/turing:onboard` | Document |
77
+ | "share", "package", "export experiments", "send results", "portable" | `/turing:share` | Share |
78
+ | "review", "peer review", "reviewer", "simulate review", "weakness" | `/turing:review` | Validate |
79
+ | "trend", "trends", "research direction", "improvement rate", "diminishing returns", "what's working" | `/turing:trend` | Analyze |
80
+ | "flashback", "where was I", "context", "resume", "catch up", "what happened" | `/turing:flashback` | Recall |
81
+ | "archive", "cleanup", "compress old", "disk space", "archive experiments" | `/turing:archive` | Manage |
82
+ | "annotate", "note", "tag experiment", "add note", "experiment note" | `/turing:annotate` | Record |
83
+ | "search", "find experiment", "query experiments", "which experiments" | `/turing:search` | Query |
84
+ | "template", "recipe", "save config", "reusable config", "starting point" | `/turing:template` | Manage |
85
+ | "replay", "re-run", "revisit", "retry old", "would it work now" | `/turing:replay` | Validate |
86
+ | "what if", "what-if", "hypothetical", "estimate impact", "would it help" | `/turing:whatif` | Analyze |
87
+ | "counterfactual", "flip prediction", "why this prediction", "minimum change", "explanation" | `/turing:counterfactual` | Explain |
88
+ | "simulate", "predict outcome", "pre-filter", "which configs will work", "forecast" | `/turing:simulate` | Predict |
89
+ | "update", "incremental", "new data", "add data", "fine-tune existing", "partial update" | `/turing:update` | Update |
90
+ | "registry", "promote", "demote", "staging", "production", "which model is deployed", "model lifecycle" | `/turing:registry` | Govern |
91
+ | "postmortem", "why failing", "failure streak", "why no improvement", "what went wrong" | `/turing:postmortem` | Diagnose |
92
+ | "doctor", "health check", "is it broken", "diagnose harness", "self-check" | `/turing:doctor` | Check |
93
+ | "plan", "research plan", "campaign", "what next", "allocate budget", "strategic plan" | `/turing:plan` | Plan |
94
+
95
+ ## Sub-commands
96
+
97
+ | Command | Purpose | Invocation |
98
+ |---|---|---|
99
+ | `/turing:train [ml/project] [N]` | Run the autonomous experiment loop (auto-detects project from path or cwd) | slash_only |
100
+ | `/turing:status` | Show experiment status, best model, convergence | slash_only |
101
+ | `/turing:compare <a> <b>` | Side-by-side experiment comparison | slash_only |
102
+ | `/turing:sweep` | Generate and run hyperparameter sweep | slash_only |
103
+ | `/turing:try <hypothesis>` | Inject a hypothesis into the agent's queue | slash_only |
104
+ | `/turing:brief` | Generate structured research intelligence report | slash_only |
105
+ | `/turing:init` | Scaffold a new ML project | slash_only |
106
+ | `/turing:validate` | Check metric stability, auto-fix if noisy | slash_only |
107
+ | `/turing:seed [N] [--quick]` | Multi-seed study: mean/std/CI, flag seed-sensitive results | slash_only |
108
+ | `/turing:reproduce <exp-id>` | Reproducibility verification with tolerance checking | slash_only |
109
+ | `/turing:suggest` | Literature-grounded model architecture suggestions | slash_only |
110
+ | `/turing:explore` | Tree-search hypothesis exploration via AB-MCTS | slash_only |
111
+ | `/turing:design <hyp-id>` | Generate structured experiment design from hypothesis | slash_only |
112
+ | `/turing:logbook` | HTML/markdown logbook with trajectory chart | slash_only |
113
+ | `/turing:poster` | Single-page HTML research poster | slash_only |
114
+ | `/turing:report` | Structured markdown research report | slash_only |
115
+ | `/turing:mode <mode>` | Set research strategy (explore/exploit/replicate) | slash_only |
116
+ | `/turing:preflight` | Pre-flight resource check (VRAM/RAM/disk) | slash_only |
117
+ | `/turing:card` | Generate standardized model card (type, performance, data, limitations, contract) | slash_only |
118
+ | `/turing:diagnose [exp-id]` | Error analysis: failure modes, confused pairs, feature-range bias | slash_only |
119
+ | `/turing:ablate [--components]` | Ablation study: remove components, measure impact, flag dead weight | slash_only |
120
+ | `/turing:frontier [--metrics]` | Pareto frontier: multi-objective tradeoff visualization | slash_only |
121
+ | `/turing:lit <query>` | Literature search: papers, SOTA baselines, related work | slash_only |
122
+ | `/turing:paper [--sections] [--format]` | Draft paper sections from experiment logs (setup, results, ablation, hyperparams) | slash_only |
123
+ | `/turing:export [exp-id] [--format]` | Export model to production format with equivalence check + latency benchmark | slash_only |
124
+ | `/turing:queue <action>` | Batch experiment scheduler: add, list, run, pause, clear | slash_only |
125
+ | `/turing:retry <exp-id>` | Smart failure recovery: auto-diagnose crash, apply fix, re-run | slash_only |
126
+ | `/turing:fork <exp-id> --branches` | Experiment branching: run parallel tracks, report winner | slash_only |
127
+ | `/turing:profile [exp-id]` | Computational profiling: timing, memory, throughput, bottleneck detection | slash_only |
128
+ | `/turing:checkpoint <action>` | Smart checkpoint management: list, prune (Pareto), average, resume, stats | slash_only |
129
+ | `/turing:diff <exp-a> <exp-b>` | Deep experiment comparison: config diff, metric significance, per-class regressions, curve divergence | slash_only |
130
+ | `/turing:watch [--analyze]` | Live training monitor with early-warning alerts (loss spike, NaN, overfitting, plateau) | slash_only |
131
+ | `/turing:regress [--tolerance]` | Performance regression gate: re-run best experiment, verify metrics haven't degraded | slash_only |
132
+ | `/turing:ensemble [--top-k] [--methods]` | Automated ensemble: voting, weighted voting, stacking, blending from top-K models | slash_only |
133
+ | `/turing:stitch <action> [stage]` | Pipeline composition: show/swap/cache/run stages independently | slash_only |
134
+ | `/turing:warm <exp-id>` | Warm-start from prior model: load checkpoint, freeze layers, adjust LR | slash_only |
135
+ | `/turing:scale [--axis]` | Scaling law estimator: fit power law, predict full-scale performance | slash_only |
136
+ | `/turing:budget <action>` | Compute budget manager: set limits, track allocation, auto-shift modes | slash_only |
137
+ | `/turing:distill <exp-id>` | Model compression: distill teacher into smaller student model | slash_only |
138
+ | `/turing:transfer [--from]` | Cross-project knowledge transfer: find similar prior projects, surface what worked | slash_only |
139
+ | `/turing:audit [--strict]` | Pre-submission methodology audit: data leakage, baselines, seeds, ablations, reproducibility | slash_only |
140
+ | `/turing:sanity [--quick]` | Pre-training sanity checks: initial loss, overfit test, gradient flow, output validation | slash_only |
141
+ | `/turing:baseline [--methods]` | Automatic baseline generation: random, majority/mean, linear, k-NN | slash_only |
142
+ | `/turing:leak [--deep]` | Targeted leakage detection: single-feature tests, correlation, train/test overlap | slash_only |
143
+ | `/turing:xray [exp-id]` | Internal model diagnostics: gradient flow, dead neurons, weight distributions, tree analysis | slash_only |
144
+ | `/turing:sensitivity [exp-id]` | Hyperparameter sensitivity analysis: rank parameters by impact, detect non-monotonic responses | slash_only |
145
+ | `/turing:calibrate [exp-id]` | Probability calibration: ECE/MCE, reliability diagrams, Platt/isotonic/temperature scaling | slash_only |
146
+ | `/turing:feature [--method]` | Automated feature selection: multi-method consensus ranking, redundancy, interaction generation | slash_only |
147
+ | `/turing:curriculum [exp-id]` | Training curriculum optimization: difficulty scoring, strategy comparison, impossible sample detection | slash_only |
148
+ | `/turing:prune <exp-id>` | Weight pruning: magnitude/structured/lottery, sparsity sweep, knee point detection | slash_only |
149
+ | `/turing:quantize <exp-id>` | Post-training quantization: FP16/INT8, accuracy-latency comparison, QAT suggestion | slash_only |
150
+ | `/turing:merge <exp-ids...>` | Model merging: uniform/greedy soup, TIES, DARE — free accuracy, zero latency cost | slash_only |
151
+ | `/turing:surgery <exp-id>` | Architecture modification: add/remove layer, widen/narrow, swap activation, skip connections | slash_only |
152
+ | `/turing:trend` | Long-term trend analysis: improvement velocity, family ROI, diminishing returns detection | slash_only |
153
+ | `/turing:flashback` | Session context restoration: "where was I?" after days away from the project | slash_only |
154
+ | `/turing:archive` | Experiment lifecycle cleanup: compress old artifacts, prune checkpoints, summary index | slash_only |
155
+ | `/turing:annotate <exp-id>` | Retrospective annotations: add human notes, tags, search by content | slash_only |
156
+ | `/turing:search <query>` | Natural language experiment search with structured filters | slash_only |
157
+ | `/turing:template <action>` | Experiment template library: save/list/apply reusable configs across projects | slash_only |
158
+ | `/turing:replay <exp-id>` | Experiment replay: re-run old experiment with current infrastructure | slash_only |
159
+ | `/turing:cite <action>` | Citation manager: add/list/check/bib for papers, datasets, methods | slash_only |
160
+ | `/turing:present [--figures]` | Presentation figures: training curves, comparisons, ablation, Pareto, sensitivity | slash_only |
161
+ | `/turing:changelog [--audience]` | Model changelog: version-grouped improvements for technical or stakeholder audiences | slash_only |
162
+ | `/turing:onboard [--audience]` | Project onboarding: full walkthrough for new collaborators | slash_only |
163
+ | `/turing:share <exp-ids...>` | Experiment packaging: portable archive with manifest and README | slash_only |
164
+ | `/turing:review [--venue]` | Peer review simulation: weaknesses, questions, fix commands, score | slash_only |
165
+ | `/turing:whatif "<question>"` | What-if analysis: route hypotheticals to existing estimators (scaling, ablation, sensitivity, ensemble, pruning) | slash_only |
166
+ | `/turing:counterfactual <exp-id> --sample <index>` | Input-level counterfactual explanations: minimum input change to flip a prediction | slash_only |
167
+ | `/turing:simulate [--configs] [--top-k]` | Experiment outcome prediction: pre-filter configs using surrogate model, save budget | slash_only |
168
+ | `/turing:update <exp-id> --new-data <path>` | Incremental model update: add new data without full retraining, forgetting detection | slash_only |
169
+ | `/turing:registry [list\|register\|promote\|demote\|history]` | Model registry: stage lifecycle (candidate → staging → production) with promotion gates | slash_only |
170
+ | `/turing:postmortem [--window N]` | Failure postmortem: diagnose why experiments stopped improving (exhaustion, config error, data issue, ceiling, noise) | slash_only |
171
+ | `/turing:doctor [--fix]` | Harness self-diagnosis: environment, dependencies, config, log integrity, scripts, disk, git state, Claude hooks | slash_only |
172
+ | `/turing:plan [--budget N] [--goal]` | Research planning assistant: strategic campaign design with budget-aware ROI allocation | slash_only |
173
+
174
+ ## Proactive Detection
175
+
176
+ If you detect ML training intent in the conversation (e.g., "the model accuracy is bad", "we need to improve predictions", "let's try a different model"), suggest the relevant sub-command.
177
+
178
+ ## First-Time Setup
179
+
180
+ If no ML project is detected (no `config.yaml`, no `train.py`, no `experiments/`), suggest `/turing:init` first.
@@ -0,0 +1,46 @@
1
+ ---
2
+ name: ablate
3
+ description: Run systematic ablation study — remove components one at a time, measure impact, produce publication-ready table with dead-weight flagging.
4
+ argument-hint: "[exp-id] [--components \"X,Y\"] [--seeds 3] [--latex]"
5
+ allowed-tools: Read, Bash(*), Grep, Glob
6
+ ---
7
+
8
+ Run a systematic ablation study to measure the contribution of each model component.
9
+
10
+ ## Steps
11
+
12
+ 1. **Activate environment:**
13
+ ```bash
14
+ source .venv/bin/activate
15
+ ```
16
+
17
+ 2. **Parse arguments from `$ARGUMENTS`:**
18
+ - First argument can be an experiment ID (e.g., `exp-042`); defaults to best
19
+ - `--components "dropout,feature_X,regularization"` specifies components to ablate
20
+ - `--seeds 3` runs each ablation 3 times for statistical robustness (uses seed runner)
21
+ - `--latex` outputs a LaTeX-formatted table instead of markdown
22
+
23
+ 3. **Run ablation study:**
24
+ ```bash
25
+ python scripts/ablation_study.py $ARGUMENTS
26
+ ```
27
+
28
+ 4. **Report results:**
29
+ - Show the ablation table: Configuration | Metric | Δ from Full | % Change
30
+ - Rank by impact (largest Δ first)
31
+ - Flag **dead-weight** components (removing them improves the metric)
32
+ - If `--latex`, output ready for copy-paste into a paper
33
+
34
+ 5. **Saved output:** results written to `experiments/ablations/exp-NNN-ablation.yaml`
35
+
36
+ 6. **If no ablatable components detected:** suggest using `--components` explicitly.
37
+
38
+ ## Examples
39
+
40
+ ```
41
+ /turing:ablate # Auto-detect components
42
+ /turing:ablate exp-042 # Specific experiment
43
+ /turing:ablate --components "dropout,subsample" # Specific components
44
+ /turing:ablate --seeds 3 # Multi-seed for robustness
45
+ /turing:ablate --latex # LaTeX table output
46
+ ```
@@ -0,0 +1,22 @@
1
+ ---
2
+ name: annotate
3
+ description: Retrospective experiment annotations — add human notes, tags, and context that automated metrics can't capture.
4
+ argument-hint: "<exp-id> \"note\" [--tag fragile] | --list | --search \"keyword\""
5
+ allowed-tools: Read, Bash(*), Grep, Glob
6
+ ---
7
+
8
+ Add context that experiment logs can't capture. "This only worked because the data was pre-sorted."
9
+
10
+ ## Steps
11
+ 1. **Activate environment:** `source .venv/bin/activate`
12
+ 2. **Run:** `python scripts/experiment_annotations.py $ARGUMENTS`
13
+ 3. **Operations:** add (text + tags), list (per-experiment or all), search (keyword or tag)
14
+ 4. **Stored in:** `experiments/annotations.yaml`
15
+
16
+ ## Examples
17
+ ```
18
+ /turing:annotate exp-042 "Fragile — only works with specific preprocessing"
19
+ /turing:annotate exp-042 "Reviewer 2 requested this" --tag reviewer-requested
20
+ /turing:annotate --list
21
+ /turing:annotate --search "fragile"
22
+ ```
@@ -0,0 +1,22 @@
1
+ ---
2
+ name: archive
3
+ description: Experiment lifecycle cleanup — compress old artifacts, prune checkpoints, create queryable summary index. Reclaim disk space.
4
+ argument-hint: "[--older-than 30d] [--keep-best 10] [--dry-run]"
5
+ allowed-tools: Read, Bash(*), Grep, Glob
6
+ ---
7
+
8
+ Keep your project directory manageable after 200+ experiments.
9
+
10
+ ## Steps
11
+ 1. **Activate environment:** `source .venv/bin/activate`
12
+ 2. **Run:** `python scripts/experiment_archive.py $ARGUMENTS`
13
+ 3. **Protected experiments:** Pareto-optimal, current best, recent, top-N by metric
14
+ 4. **Report:** archived count, preserved count, space reclaimed
15
+ 5. **Saved output:** `experiments/archive/index.yaml`
16
+
17
+ ## Examples
18
+ ```
19
+ /turing:archive --dry-run # Preview what would be archived
20
+ /turing:archive --older-than 30 --keep-best 10 # Archive old, keep top 10
21
+ /turing:archive # Default: 30 days, keep 10
22
+ ```
@@ -0,0 +1,55 @@
1
+ ---
2
+ name: audit
3
+ description: Pre-submission methodology audit — catch data leakage, missing baselines, cherry-picked seeds, and incomplete ablations before a reviewer does.
4
+ argument-hint: "[--strict] [--checklist neurips]"
5
+ allowed-tools: Read, Bash(*), Grep, Glob
6
+ ---
7
+
8
+ A reviewer checklist you run before submitting. Catches methodology mistakes that cause desk rejections.
9
+
10
+ ## Steps
11
+
12
+ 1. **Activate environment:**
13
+ ```bash
14
+ source .venv/bin/activate
15
+ ```
16
+
17
+ 2. **Parse arguments from `$ARGUMENTS`:**
18
+ - `--strict` — treat warnings as failures
19
+ - `--checklist neurips|icml|iclr` — add venue-specific checks
20
+ - `--json` — raw JSON output
21
+
22
+ 3. **Run methodology audit:**
23
+ ```bash
24
+ python scripts/methodology_audit.py $ARGUMENTS
25
+ ```
26
+
27
+ 4. **Checks performed:**
28
+ - **Data leakage** (critical): verify prepare.py/evaluate.py separation
29
+ - **CV strategy** (critical): verify appropriate cross-validation for data type
30
+ - **Seed sensitivity** (high): seed studies exist for best experiments
31
+ - **Ablation completeness** (high): ablation studies performed
32
+ - **Baseline comparison** (high): simple baselines in experiment log
33
+ - **Reproducibility** (high): best result successfully reproduced
34
+ - **Hyperparameter budget** (medium): total tuning cost documented
35
+ - **Regression stability** (medium): regression checks performed
36
+
37
+ 5. **Verdicts:**
38
+ - **PASS** — ready for submission
39
+ - **PASS (with warnings)** — address before submission
40
+ - **NEEDS WORK** — fix failures first
41
+ - **FAIL** — critical issues found
42
+
43
+ 6. **Actions:** each failure suggests the `/turing:` command to fix it
44
+
45
+ 7. **Venue checklists:** `--checklist neurips` adds NeurIPS-specific checks (broader impact, reproducibility checklist, code availability)
46
+
47
+ 8. **Saved output:** report in `experiments/audits/audit-YYYY-MM-DD.yaml`
48
+
49
+ ## Examples
50
+
51
+ ```
52
+ /turing:audit # Standard audit
53
+ /turing:audit --strict # Warnings become failures
54
+ /turing:audit --checklist neurips # NeurIPS submission checklist
55
+ ```
@@ -0,0 +1,44 @@
1
+ ---
2
+ name: baseline
3
+ description: Automatic baseline generation — random, majority/mean, linear, k-NN baselines in 60 seconds. Every experiment needs a "is this better than dumb?" reference.
4
+ argument-hint: "[--methods all|simple|linear] [--data data.npz]"
5
+ allowed-tools: Read, Bash(*), Grep, Glob
6
+ ---
7
+
8
+ Generate trivial baselines so you always know if your model is meaningfully better than simple approaches.
9
+
10
+ ## Steps
11
+
12
+ 1. **Activate environment:**
13
+ ```bash
14
+ source .venv/bin/activate
15
+ ```
16
+
17
+ 2. **Parse arguments from `$ARGUMENTS`:**
18
+ - `--methods all|simple|linear` — baseline group (default: all)
19
+ - `--data data.npz` — data file with X and y arrays
20
+ - `--json` — raw JSON output
21
+
22
+ 3. **Run baseline generation:**
23
+ ```bash
24
+ python scripts/generate_baselines.py $ARGUMENTS
25
+ ```
26
+
27
+ 4. **Baselines generated:**
28
+ - **Classification:** Random, Majority class, Stratified random, Logistic Regression, k-NN
29
+ - **Regression:** Random, Mean predictor, Median predictor, Ridge Regression, k-NN
30
+ - Each evaluated with the same protocol as real experiments
31
+
32
+ 5. **Report includes:** comparison table with metric values and notes (floor, ceiling, reference)
33
+
34
+ 6. **Integration:** satisfies the "baseline comparison" check in `/turing:audit`
35
+
36
+ 7. **Saved output:** report in `experiments/baselines/baselines-*.yaml`
37
+
38
+ ## Examples
39
+
40
+ ```
41
+ /turing:baseline # All baselines
42
+ /turing:baseline --methods simple # Just random + majority
43
+ /turing:baseline --data data/processed.npz # With actual data
44
+ ```
@@ -0,0 +1,94 @@
1
+ ---
2
+ name: brief
3
+ description: Generate a structured research intelligence report from experiment history — what's been learned, what's promising, what's exhausted, and what the human should consider next. Use --deep for literature-grounded suggestions.
4
+ argument-hint: "[ml/project] [--deep]"
5
+ allowed-tools: Read, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob, WebSearch, WebFetch
6
+ ---
7
+
8
+ Generate a research briefing that a human can read in 2 minutes and immediately decide what to inject next.
9
+
10
+ ## Project Detection
11
+
12
+ Before generating the briefing, detect which project to report on:
13
+
14
+ 0. **Detect project directory:**
15
+ - If `$ARGUMENTS` contains a path (e.g., `ml/coding`), use that as the project directory
16
+ - Else if cwd contains `config.yaml` and `train.py`, use cwd
17
+ - Else search for `ml/*/` subdirectories containing `config.yaml`
18
+ - If exactly one found, use it
19
+ - If multiple found, list them and ask the user which to report on
20
+ - All subsequent commands run from the detected project directory
21
+
22
+ ## Steps
23
+
24
+ 1. **Generate the briefing:**
25
+ ```bash
26
+ source .venv/bin/activate && python scripts/generate_brief.py
27
+ ```
28
+
29
+ 2. **Self-critique the briefing** before presenting. Review the generated output and check:
30
+ - **Recommendations specificity:** Are they concrete enough to act on? "Try a different model" is bad. "Try LightGBM with leaf-wise growth because exp-004 showed depth sensitivity" is good. If vague, rewrite them with specific model/hyperparameter suggestions grounded in the experiment data.
31
+ - **Exhausted directions coverage:** Cross-reference the "Model Types Explored" section against `experiments/log.jsonl`. Are there discarded experiments missing from the summary? If so, add them.
32
+ - **Convergence estimate grounding:** If the briefing says "close to convergence" or "further improvement possible", verify against the actual metric trajectory. Is the claim supported by the numbers?
33
+ - **Metric accuracy:** Spot-check that the "Current Best" metrics match the actual log. Run `python scripts/show_metrics.py --last 1` if uncertain.
34
+
35
+ If any section fails the check, regenerate just that section. Max 1 revision round — don't over-polish.
36
+
37
+ 3. **Present the output** to the user. The briefing has 6 sections:
38
+ - **Campaign Summary** — total experiments, keep rate, timespan
39
+ - **Current Best** — model type, metrics, experiment ID, configuration
40
+ - **Improvement Trajectory** — metric over time, rate of improvement
41
+ - **Model Types Explored** — which approaches have been tried and their hit rates
42
+ - **Hypothesis Queue** — pending and completed hypotheses
43
+ - **Recommendations** — data-driven next steps
44
+
45
+ 4. **If `$ARGUMENTS` contains `--deep`:** run the Literature-Grounded Suggestions step below.
46
+
47
+ 5. **Prompt for action:**
48
+ - "Want to inject a hypothesis? Use `/turing:try <idea>`"
49
+ - "Want to continue training? Use `/turing:train`"
50
+ - "Want literature-backed suggestions? Use `/turing:brief --deep`"
51
+
52
+ ## Literature-Grounded Suggestions (--deep flag)
53
+
54
+ When `--deep` is requested, add a 7th section: **Literature-Grounded Suggestions**.
55
+
56
+ ### Steps:
57
+
58
+ 1. **Read context:** Read `config.yaml` and the briefing output to understand:
59
+ - What task type this is (tabular classification, time series, etc.)
60
+ - Which model families have been exhausted (from "Model Types Explored")
61
+ - Where improvement has plateaued (from "Improvement Trajectory")
62
+ - What failure patterns keep recurring
63
+
64
+ 2. **Search literature** with `WebSearch` for techniques that address the specific stagnation:
65
+ - If plateaued: "improve [task type] accuracy beyond [current metric] 2024"
66
+ - If overfitting: "regularization techniques [model family] [task type]"
67
+ - If all models tried: "state of the art [task type] benchmark 2024 2025"
68
+
69
+ 3. **Distill 3-5 suggestions** from the literature, each with:
70
+ - **Technique:** specific and actionable
71
+ - **Source:** paper or article URL
72
+ - **Why now:** how it addresses the specific stagnation point
73
+ - **Impact estimate:** high/medium/low
74
+ - **Complexity:** low/medium/high
75
+
76
+ 4. **Queue suggestions** as hypotheses:
77
+ ```bash
78
+ source .venv/bin/activate && python scripts/manage_hypotheses.py add "<technique>: <rationale> (source: <citation>)" --priority medium --source literature
79
+ ```
80
+
81
+ 5. **Format as a section** appended to the briefing.
82
+
83
+ ## Saving Briefs
84
+
85
+ ```bash
86
+ mkdir -p briefs && python scripts/generate_brief.py > briefs/brief-$(date +%Y-%m-%d).md
87
+ ```
88
+
89
+ ## When to Use
90
+
91
+ - After a training session completes or converges
92
+ - Before injecting new hypotheses (to understand what's already been tried)
93
+ - When returning to a project after time away
94
+ - **With `--deep`:** when the agent seems stuck and you want evidence-based direction
@@ -0,0 +1,51 @@
1
+ ---
2
+ name: budget
3
+ description: Compute budget manager — set experiment/time limits, track allocation across explore/exploit phases, auto-shift modes, hard stop.
4
+ argument-hint: "<set|status|reset> [--experiments 50] [--hours 8]"
5
+ allowed-tools: Read, Bash(*), Grep, Glob
6
+ ---
7
+
8
+ Set a compute ceiling and let the system optimize within it. Prevents runaway experiment loops.
9
+
10
+ ## Steps
11
+
12
+ 1. **Activate environment:**
13
+ ```bash
14
+ source .venv/bin/activate
15
+ ```
16
+
17
+ 2. **Parse arguments from `$ARGUMENTS`:**
18
+ - First argument is action: `set`, `status`, `reset`, or `check`
19
+ - `--experiments 50` — max experiment count
20
+ - `--hours 8` — max wall-clock hours
21
+ - `--json` — raw JSON output
22
+
23
+ 3. **Run budget manager:**
24
+ ```bash
25
+ python scripts/budget_manager.py $ARGUMENTS
26
+ ```
27
+
28
+ 4. **Actions:**
29
+ - **set:** create a budget with experiment and/or time constraints
30
+ - **status:** show usage, burn rate, projected exhaustion, allocation breakdown
31
+ - **reset:** deactivate the current budget
32
+ - **check:** returns whether another experiment is allowed (used by `/turing:train`)
33
+
34
+ 5. **Budget allocation policy:**
35
+ - **0-50% budget:** EXPLORE — try diverse hypotheses
36
+ - **50-80% budget:** MIXED — explore promising, exploit best
37
+ - **80-100% budget:** EXPLOIT ONLY — refine the winner
38
+ - **100% budget:** HARD STOP — `/turing:train` refuses new experiments
39
+
40
+ 6. **Budget state** stored in `experiment_state.yaml` under the `budget` key.
41
+
42
+ 7. **If no budget exists:** `/turing:train` runs without limits.
43
+
44
+ ## Examples
45
+
46
+ ```
47
+ /turing:budget set --experiments 50 --hours 8 # Set both constraints
48
+ /turing:budget set --experiments 30 # Experiment count only
49
+ /turing:budget status # Show usage and projections
50
+ /turing:budget reset # Remove budget limits
51
+ ```
@@ -0,0 +1,46 @@
1
+ ---
2
+ name: calibrate
3
+ description: Probability calibration — measure ECE, plot reliability diagrams, apply Platt scaling or isotonic regression.
4
+ argument-hint: "[exp-id] [--method platt|isotonic|temperature|auto]"
5
+ allowed-tools: Read, Bash(*), Grep, Glob
6
+ ---
7
+
8
+ Make model probabilities trustworthy. Does 80% confidence actually mean 80% correct?
9
+
10
+ ## Steps
11
+
12
+ 1. **Activate environment:**
13
+ ```bash
14
+ source .venv/bin/activate
15
+ ```
16
+
17
+ 2. **Parse arguments from `$ARGUMENTS`:**
18
+ - Optional experiment ID
19
+ - `--method platt|isotonic|temperature|auto` — calibration method (default: auto)
20
+ - `--json` — raw JSON output
21
+
22
+ 3. **Run calibration:**
23
+ ```bash
24
+ python scripts/calibration.py $ARGUMENTS
25
+ ```
26
+
27
+ 4. **Report includes:**
28
+ - ECE/MCE before calibration
29
+ - Reliability diagram (predicted vs actual per bin)
30
+ - Calibration method comparison table
31
+ - Verdict: ALREADY CALIBRATED / IMPROVED / NO IMPROVEMENT
32
+
33
+ 5. **Methods:**
34
+ - **Platt:** logistic regression on logits
35
+ - **Isotonic:** non-parametric (more flexible, needs more data)
36
+ - **Temperature:** single scalar T parameter
37
+ - **Auto:** tries all, picks lowest ECE
38
+
39
+ 6. **Saved output:** report in `experiments/calibration/<exp-id>-calibration.yaml`
40
+
41
+ ## Examples
42
+
43
+ ```
44
+ /turing:calibrate exp-042 # Auto-select best method
45
+ /turing:calibrate exp-042 --method platt # Platt scaling only
46
+ ```
@@ -0,0 +1,35 @@
1
+ ---
2
+ name: card
3
+ description: Generate a standardized model card documenting the trained model — type, performance, training data, limitations, intended use, and artifact contract.
4
+ allowed-tools: Read, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob
5
+ ---
6
+
7
+ You generate a standardized model card from the experiment log, model contract, and config.
8
+
9
+ ## Steps
10
+
11
+ 1. **Activate the virtual environment:**
12
+ ```bash
13
+ source .venv/bin/activate
14
+ ```
15
+
16
+ 2. **Run the model card generator:**
17
+ ```bash
18
+ python scripts/generate_model_card.py --config config.yaml --log experiments/log.jsonl --contract model_contract.md --output MODEL_CARD.md
19
+ ```
20
+
21
+ 3. **Read and present the generated card:**
22
+ - Read `MODEL_CARD.md` and display it to the user.
23
+ - If no experiments exist yet, inform the user and show the skeleton card.
24
+
25
+ 4. **Suggest next steps:**
26
+ - Review the **Ethical Considerations** section and fill in bias, fairness, and impact notes.
27
+ - Review the **Intended Use** section and document what the model is NOT intended for.
28
+ - If limitations mention overfitting, suggest running `/turing:validate` for stability checks.
29
+ - If the card looks complete, suggest committing it to version control.
30
+
31
+ ## Error Handling
32
+
33
+ - If `config.yaml` is missing, tell the user to run `/turing:init` first.
34
+ - If `experiments/log.jsonl` is missing or empty, generate a skeleton card and note that training is needed.
35
+ - If `.venv` doesn't exist, try `python3 scripts/generate_model_card.py` directly.
@@ -0,0 +1,21 @@
1
+ ---
2
+ name: changelog
3
+ description: Model changelog generation — auto-generate human-readable progress narrative from experiment history for stakeholders.
4
+ argument-hint: "[--since exp-id|date] [--audience technical|stakeholder]"
5
+ allowed-tools: Read, Bash(*), Grep, Glob
6
+ ---
7
+
8
+ Translate experiment logs into a narrative that PMs and stakeholders can read in 2 minutes.
9
+
10
+ ## Steps
11
+ 1. **Activate environment:** `source .venv/bin/activate`
12
+ 2. **Run:** `python scripts/generate_changelog.py $ARGUMENTS`
13
+ 3. **Audience:** technical (experiment IDs, configs), stakeholder (plain English, percentages)
14
+ 4. **Saved output:** `paper/CHANGELOG.md`
15
+
16
+ ## Examples
17
+ ```
18
+ /turing:changelog # Full changelog
19
+ /turing:changelog --audience stakeholder # Non-technical summary
20
+ /turing:changelog --since exp-042 # Since specific experiment
21
+ ```