@researai/deepscientist 1.5.13 → 1.5.15

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (142) hide show
  1. package/README.md +8 -0
  2. package/assets/branding/logo-raster.png +0 -0
  3. package/bin/ds.js +134 -49
  4. package/docs/en/00_QUICK_START.md +2 -2
  5. package/docs/en/01_SETTINGS_REFERENCE.md +20 -4
  6. package/docs/en/03_QQ_CONNECTOR_GUIDE.md +19 -0
  7. package/docs/en/05_TUI_GUIDE.md +466 -96
  8. package/docs/en/10_WEIXIN_CONNECTOR_GUIDE.md +20 -0
  9. package/docs/en/14_PROMPT_SKILLS_AND_MCP_GUIDE.md +2 -0
  10. package/docs/en/16_TELEGRAM_CONNECTOR_GUIDE.md +134 -0
  11. package/docs/en/17_WHATSAPP_CONNECTOR_GUIDE.md +126 -0
  12. package/docs/en/18_FEISHU_CONNECTOR_GUIDE.md +136 -0
  13. package/docs/en/README.md +8 -0
  14. package/docs/zh/00_QUICK_START.md +2 -2
  15. package/docs/zh/01_SETTINGS_REFERENCE.md +20 -4
  16. package/docs/zh/03_QQ_CONNECTOR_GUIDE.md +19 -0
  17. package/docs/zh/05_TUI_GUIDE.md +465 -82
  18. package/docs/zh/10_WEIXIN_CONNECTOR_GUIDE.md +20 -0
  19. package/docs/zh/14_PROMPT_SKILLS_AND_MCP_GUIDE.md +2 -0
  20. package/docs/zh/16_TELEGRAM_CONNECTOR_GUIDE.md +134 -0
  21. package/docs/zh/17_WHATSAPP_CONNECTOR_GUIDE.md +126 -0
  22. package/docs/zh/18_FEISHU_CONNECTOR_GUIDE.md +136 -0
  23. package/docs/zh/README.md +8 -0
  24. package/install.sh +2 -0
  25. package/package.json +1 -1
  26. package/pyproject.toml +1 -1
  27. package/src/deepscientist/__init__.py +1 -1
  28. package/src/deepscientist/artifact/charts.py +567 -0
  29. package/src/deepscientist/artifact/guidance.py +50 -10
  30. package/src/deepscientist/artifact/metrics.py +228 -5
  31. package/src/deepscientist/artifact/schemas.py +3 -0
  32. package/src/deepscientist/artifact/service.py +4004 -538
  33. package/src/deepscientist/bash_exec/models.py +23 -0
  34. package/src/deepscientist/bash_exec/monitor.py +147 -67
  35. package/src/deepscientist/bash_exec/runtime.py +218 -156
  36. package/src/deepscientist/bash_exec/service.py +79 -64
  37. package/src/deepscientist/bash_exec/shells.py +87 -0
  38. package/src/deepscientist/bridges/connectors.py +51 -2
  39. package/src/deepscientist/config/models.py +6 -3
  40. package/src/deepscientist/config/service.py +7 -2
  41. package/src/deepscientist/connector/lingzhu_support.py +23 -4
  42. package/src/deepscientist/connector/weixin_support.py +122 -1
  43. package/src/deepscientist/daemon/api/handlers.py +75 -4
  44. package/src/deepscientist/daemon/api/router.py +1 -0
  45. package/src/deepscientist/daemon/app.py +869 -236
  46. package/src/deepscientist/doctor.py +51 -0
  47. package/src/deepscientist/file_lock.py +48 -0
  48. package/src/deepscientist/gitops/diff.py +167 -1
  49. package/src/deepscientist/mcp/server.py +331 -21
  50. package/src/deepscientist/process_control.py +161 -0
  51. package/src/deepscientist/prompts/builder.py +275 -491
  52. package/src/deepscientist/quest/service.py +2336 -145
  53. package/src/deepscientist/quest/stage_views.py +305 -29
  54. package/src/deepscientist/runners/base.py +2 -0
  55. package/src/deepscientist/runners/codex.py +88 -5
  56. package/src/deepscientist/runners/runtime_overrides.py +17 -1
  57. package/src/deepscientist/shared.py +6 -1
  58. package/src/prompts/contracts/shared_interaction.md +13 -4
  59. package/src/prompts/system.md +984 -1985
  60. package/src/skills/analysis-campaign/SKILL.md +31 -2
  61. package/src/skills/analysis-campaign/references/artifact-orchestration.md +1 -1
  62. package/src/skills/analysis-campaign/references/writing-facing-slice-examples.md +65 -0
  63. package/src/skills/baseline/SKILL.md +267 -994
  64. package/src/skills/baseline/references/baseline-checklist-template.md +21 -32
  65. package/src/skills/baseline/references/baseline-plan-template.md +41 -57
  66. package/src/skills/decision/SKILL.md +19 -2
  67. package/src/skills/experiment/SKILL.md +8 -2
  68. package/src/skills/finalize/SKILL.md +18 -0
  69. package/src/skills/idea/SKILL.md +78 -0
  70. package/src/skills/idea/references/idea-generation-playbook.md +100 -0
  71. package/src/skills/idea/references/outline-seeding-example.md +60 -0
  72. package/src/skills/intake-audit/SKILL.md +1 -1
  73. package/src/skills/optimize/SKILL.md +1644 -0
  74. package/src/skills/rebuttal/SKILL.md +2 -1
  75. package/src/skills/review/SKILL.md +2 -1
  76. package/src/skills/write/SKILL.md +80 -12
  77. package/src/skills/write/references/outline-evidence-contract-example.md +107 -0
  78. package/src/tui/dist/app/AppContainer.js +1445 -52
  79. package/src/tui/dist/components/Composer.js +1 -1
  80. package/src/tui/dist/components/ConfigScreen.js +190 -36
  81. package/src/tui/dist/components/GradientStatusText.js +1 -20
  82. package/src/tui/dist/components/InputPrompt.js +41 -32
  83. package/src/tui/dist/components/LoadingIndicator.js +1 -1
  84. package/src/tui/dist/components/Logo.js +61 -38
  85. package/src/tui/dist/components/MainContent.js +10 -3
  86. package/src/tui/dist/components/WelcomePanel.js +4 -12
  87. package/src/tui/dist/components/messages/AssistantMessage.js +1 -1
  88. package/src/tui/dist/components/messages/BashExecOperationMessage.js +3 -3
  89. package/src/tui/dist/components/messages/OperationMessage.js +1 -1
  90. package/src/tui/dist/index.js +28 -1
  91. package/src/tui/dist/layouts/DefaultAppLayout.js +3 -3
  92. package/src/tui/dist/lib/api.js +17 -0
  93. package/src/tui/dist/lib/connectors.js +261 -0
  94. package/src/tui/dist/semantic-colors.js +29 -19
  95. package/src/tui/package.json +1 -1
  96. package/src/ui/dist/assets/{AiManusChatView-CnJcXynW.js → AiManusChatView-DDjbFnbt.js} +12 -12
  97. package/src/ui/dist/assets/{AnalysisPlugin-DeyzPEhV.js → AnalysisPlugin-Yb5IdmaU.js} +1 -1
  98. package/src/ui/dist/assets/CliPlugin-e64sreyu.js +31037 -0
  99. package/src/ui/dist/assets/{CodeEditorPlugin-B-xicq1e.js → CodeEditorPlugin-C4D2TIkU.js} +8 -8
  100. package/src/ui/dist/assets/{CodeViewerPlugin-DT54ysXa.js → CodeViewerPlugin-BVoNZIvC.js} +5 -5
  101. package/src/ui/dist/assets/{DocViewerPlugin-DQtKT-VD.js → DocViewerPlugin-CLChbllo.js} +3 -3
  102. package/src/ui/dist/assets/{GitDiffViewerPlugin-hqHbCfnv.js → GitDiffViewerPlugin-C4xeFyFQ.js} +20 -20
  103. package/src/ui/dist/assets/{ImageViewerPlugin-OcVo33jV.js → ImageViewerPlugin-OiMUAcLi.js} +5 -5
  104. package/src/ui/dist/assets/{LabCopilotPanel-DdGwhEUV.js → LabCopilotPanel-BjD2ThQF.js} +11 -11
  105. package/src/ui/dist/assets/{LabPlugin-Ciz1gDaX.js → LabPlugin-DQPg-NrB.js} +2 -2
  106. package/src/ui/dist/assets/{LatexPlugin-BhmjNQRC.js → LatexPlugin-CI05XAV9.js} +7 -7
  107. package/src/ui/dist/assets/{MarkdownViewerPlugin-BzdVH9Bx.js → MarkdownViewerPlugin-DpeBLYZf.js} +4 -4
  108. package/src/ui/dist/assets/{MarketplacePlugin-DmyHspXt.js → MarketplacePlugin-DolE58Q2.js} +3 -3
  109. package/src/ui/dist/assets/{NotebookEditor-BTVYRGkm.js → NotebookEditor-7Qm2rSWD.js} +11 -11
  110. package/src/ui/dist/assets/{NotebookEditor-BMXKrDRk.js → NotebookEditor-C1kWaxKi.js} +1 -1
  111. package/src/ui/dist/assets/{PdfLoader-CvcjJHXv.js → PdfLoader-BfOHw8Zw.js} +1 -1
  112. package/src/ui/dist/assets/{PdfMarkdownPlugin-DW2ej8Vk.js → PdfMarkdownPlugin-BulDREv1.js} +2 -2
  113. package/src/ui/dist/assets/{PdfViewerPlugin-CmlDxbhU.js → PdfViewerPlugin-C-daaOaL.js} +10 -10
  114. package/src/ui/dist/assets/{SearchPlugin-DAjQZPSv.js → SearchPlugin-CjpaiJ3A.js} +1 -1
  115. package/src/ui/dist/assets/{TextViewerPlugin-C-nVAZb_.js → TextViewerPlugin-BxIyqPQC.js} +5 -5
  116. package/src/ui/dist/assets/{VNCViewer-D7-dIYon.js → VNCViewer-HAg9mF7M.js} +10 -10
  117. package/src/ui/dist/assets/{bot-C_G4WtNI.js → bot-0DYntytV.js} +1 -1
  118. package/src/ui/dist/assets/{code-Cd7WfiWq.js → code-B20Slj_w.js} +1 -1
  119. package/src/ui/dist/assets/{file-content-B57zsL9y.js → file-content-DT24KFma.js} +1 -1
  120. package/src/ui/dist/assets/{file-diff-panel-DVoheLFq.js → file-diff-panel-DK13YPql.js} +1 -1
  121. package/src/ui/dist/assets/{file-socket-B5kXFxZP.js → file-socket-B4T2o4nR.js} +1 -1
  122. package/src/ui/dist/assets/{image-LLOjkMHF.js → image-DSeR_sDS.js} +1 -1
  123. package/src/ui/dist/assets/{index-hOUOWbW2.js → index-BrFje2Uk.js} +2 -2
  124. package/src/ui/dist/assets/{index-Dxa2eYMY.js → index-BwRJaoTl.js} +1 -1
  125. package/src/ui/dist/assets/{index-CLQauncb.js → index-D_E4281X.js} +5418 -28620
  126. package/src/ui/dist/assets/{index-C3r2iGrp.js → index-DnYB3xb1.js} +12 -12
  127. package/src/ui/dist/assets/{index-BQG-1s2o.css → index-G7AcWcMu.css} +43 -2
  128. package/src/ui/dist/assets/{monaco-BGGAEii3.js → monaco-LExaAN3Y.js} +1 -1
  129. package/src/ui/dist/assets/{pdf-effect-queue-DlEr1_y5.js → pdf-effect-queue-BJk5okWJ.js} +1 -1
  130. package/src/ui/dist/assets/{popover-CWJbJuYY.js → popover-D3Gg_FoV.js} +1 -1
  131. package/src/ui/dist/assets/{project-sync-CRJiucYO.js → project-sync-C_ygLlVU.js} +1 -1
  132. package/src/ui/dist/assets/{select-CoHB7pvH.js → select-CpAK6uWm.js} +2 -2
  133. package/src/ui/dist/assets/{sigma-D5aJWR8J.js → sigma-DEccaSgk.js} +1 -1
  134. package/src/ui/dist/assets/{square-check-big-DUK_mnkS.js → square-check-big-uUfyVsbD.js} +1 -1
  135. package/src/ui/dist/assets/{trash-ChU3SEE3.js → trash-CXvwwSe8.js} +1 -1
  136. package/src/ui/dist/assets/{useCliAccess-BrJBV3tY.js → useCliAccess-Bnop4mgR.js} +1 -1
  137. package/src/ui/dist/assets/{useFileDiffOverlay-C2OQaVWc.js → useFileDiffOverlay-B8eUAX0I.js} +1 -1
  138. package/src/ui/dist/assets/{wrap-text-C7Qqh-om.js → wrap-text-9vbOBpkW.js} +1 -1
  139. package/src/ui/dist/assets/{zoom-out-rtX0FKya.js → zoom-out-BgVMmOW4.js} +1 -1
  140. package/src/ui/dist/index.html +2 -2
  141. package/uv.lock +1 -1
  142. package/src/ui/dist/assets/CliPlugin-CB1YODQn.js +0 -5905
@@ -6,112 +6,83 @@ description: Use when a quest needs to attach, import, reproduce, repair, verify
6
6
  # Baseline
7
7
 
8
8
  This skill establishes the reference system the quest will compare against.
9
- It absorbs the essential old DeepScientist reproducer discipline into one stage skill.
9
+ The target is one trustworthy baseline line, not an endless reproduction diary.
10
10
 
11
11
  ## Interaction discipline
12
12
 
13
13
  - Follow the shared interaction contract injected by the system prompt.
14
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
15
- - Keep ordinary setup and debugging updates concise. Reserve richer milestone reports for accepted / waived / blocked baseline outcomes or other route-changing checkpoints instead of narrating every small setup step.
16
- - Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
17
- - If a threaded user reply arrives, interpret it relative to the latest baseline progress update before assuming the task changed completely.
18
- - Prefer `bash_exec` for setup, reproduction, and verification commands so each baseline action keeps a durable quest-local session id and log trail.
19
- - When the baseline route is durably chosen, confirmed, waived, or blocked with a clear next action, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says whether the baseline is trusted, blocked, or waived, why that matters, and what the next stage is.
14
+ - Keep ordinary setup and debugging updates concise.
15
+ - Use richer milestone updates only when the baseline becomes trusted, caveated, blocked, waived, or route-changing.
16
+ - Hard execution rule: every terminal command in this stage must go through `bash_exec`; do not use any other terminal path for setup, reproduction, monitoring, verification, Git, Python, package-manager, or file-inspection commands.
17
+ - Prefer `bash_exec` for setup, reproduction, monitoring, and verification commands so the baseline line stays durable and auditable.
20
18
 
21
19
  ## Non-negotiable rules
22
20
 
23
- - no fabrication of metrics, logs, run status, or success claims
24
- - do not skip baseline steps or silently simplify the reproduction path without explicit approval
21
+ - no fabricated metrics, logs, run status, or success claims
22
+ - do not skip baseline steps or silently simplify the route when that would change trust or comparability
25
23
  - do not claim a baseline is ready before verification is complete
26
- - do not infer missing commands, scripts, or parameters when the uncertainty would change the result
24
+ - do not infer missing commands, scripts, or parameters when the uncertainty could change the result
27
25
  - any unavoidable guess must be written down explicitly with expected impact
28
- - for Python baselines, standardize environment setup with `uv`; do not default to ad-hoc `pip install ...`, a fresh `conda create ...`, or global package mutation when `uv` can provide the same environment reproducibly
29
26
  - use web search for discovering papers or repos, but use `artifact.arxiv(paper_id=..., full_text=False)` for actually reading a source arXiv paper when it exists
30
- - set `full_text=True` only when the summary/abstract view is insufficient for the needed detail; do not default to the raw PDF
31
-
32
- ## Language and interaction rules
33
-
34
- - match the user's language in all visible outputs
35
- - keep updates concise but concrete
36
- - if a structured user decision is required, ask only for decisions that the system cannot safely derive locally
37
- - do not ask speculative or premature questions when local analysis can narrow the choices first
27
+ - set `full_text=True` only when the short form is insufficient
28
+ - for Python baselines, environment setup should be standardized around `uv`
38
29
 
39
30
  ## Stage purpose
40
31
 
41
32
  The baseline stage should produce a usable reference point through one of four routes:
42
33
 
43
- - attach an existing reusable baseline
44
- - import a reusable baseline package
45
- - reproduce a baseline from source
46
- - repair a broken or stale baseline
34
+ 1. attach an existing reusable baseline
35
+ 2. import a reusable baseline package
36
+ 3. reproduce a baseline from source
37
+ 4. repair a broken or stale baseline
47
38
 
48
- The stage must preserve the classic four-part reproducer flow:
39
+ Keep the classic control flow:
49
40
 
50
41
  1. analysis
51
42
  2. setup
52
43
  3. execution
53
44
  4. verification
54
45
 
55
- Do not casually skip these gates.
46
+ These are control gates, not paperwork walls.
56
47
 
57
48
  ## Quick workflow
58
49
 
59
- Treat this as the compressed map of the detailed sections below, not as a second independent SOP.
60
-
61
- 1. Read the source paper and source repo first, or explicitly record what is missing and why.
50
+ 1. Read the source paper and source repo first, or record exactly what is missing and why.
62
51
  2. Choose the lightest trustworthy route: attach, import, reproduce, or repair.
63
- 3. Before substantial setup, code changes, or a real run, create `PLAN.md` and `CHECKLIST.md`, and keep them updated when the route, assets, commands, or trust judgment changes materially.
64
- 4. Keep one dominant phase visible: analysis -> setup -> execution -> verification, with a bounded smoke test before any real long run.
65
- 5. Once the route is concrete, prefer one clean implementation pass, one smoke test, and then one normal baseline run; retry only when the smoke test, verification, or runtime evidence shows a concrete failure or incompatibility.
66
- 6. Close the baseline stage by confirming or waiving the gate, then send a concise `1-2` sentence summary that says whether the baseline is trusted, caveated, blocked, or waived, and what happens next.
67
-
68
- ## Route priority and escalation
52
+ 3. Start with the fast path whenever the current baseline object, command path, and acceptance target are already clear enough to validate cheaply.
53
+ 4. Before substantial baseline setup, code edits, or a real baseline run, create `PLAN.md` and `CHECKLIST.md`; short-form files are enough for simple fast-path work.
54
+ 5. Keep one dominant phase visible: analysis -> setup -> execution -> verification.
55
+ 6. Prefer one clean implementation pass, one smoke test, and then one normal baseline run.
56
+ 7. Retry only when smoke, verification, or runtime evidence shows a concrete failure or incompatibility.
57
+ 8. Close the stage by confirming or waiving the gate, then hand off with a concise `1-2` sentence summary of trust status and next anchor.
69
58
 
70
- This section sets route priority and escalation rules. The authoritative step-by-step execution remains in `Workflow`.
59
+ ## Fast-path first
71
60
 
72
61
  Default to the lightest baseline path that can still establish a trustworthy comparison.
73
- Do not front-load a full reproduction dossier when a faster truth-finding step would tell you whether the route is even viable.
74
- User requirements and explicit constraints are the primary boundary for the reproduction plan.
75
- Within that boundary, prefer equivalence-preserving efficiency gains before more compute: larger safe batch size, cache reuse, checkpoint resume, parallel downloads or workers, and the cheapest comparable smoke path.
62
+ Default to a fast path when it can establish trust with less work.
76
63
 
77
- The ordinary baseline order is:
64
+ Fast path is the default when any of the following is true:
78
65
 
79
- 1. confirm quest binding and current baseline state
80
- 2. look for the cheapest trustworthy route in order: attach, import, reproduce, repair
81
- 3. capture the minimum viable contract: task, dataset or split, metric, source identity, expected command path, and main risks
82
- 4. run a bounded smoke test as soon as that contract is concrete enough, then expand setup notes and launch the real run only after the smoke test is credible
83
- 5. verify before accepting, then archive, publish, or attach the result when appropriate
66
+ - `requested_baseline_ref` or `confirmed_baseline_ref` already points to the active baseline object
67
+ - the route is clearly `attach` or `import`
68
+ - the repo entrypoint, dataset or split, and metric contract are already concrete enough to validate cheaply
69
+ - reproduction requires no meaningful code changes and the main uncertainty is only whether the command still runs
84
70
 
85
- Escalate to the heavier baseline path only when the baseline is ambiguous, broken, multi-variant, paper-to-repo mismatched, or likely to be reused beyond the current quest.
71
+ Fast path means:
86
72
 
87
- If the quest is not yet bound to a stable baseline context, do not pretend the stage is ready just because some code exists locally.
73
+ - do not restart broad baseline discovery by default
74
+ - do not front-load a full codebase audit when the entrypoint is already concrete
75
+ - use a minimal `PLAN.md`, a minimal `CHECKLIST.md`, one bounded smoke test when needed, and then one real validation or run
76
+ - default to reuse-and-verify when runtime already attached a concrete baseline
88
77
 
89
- ## Required plan and checklist
78
+ Escalate from fast path to fuller audit only when:
90
79
 
91
- Before substantial baseline setup, code edits, or a real baseline run, create a quest-visible `PLAN.md` and `CHECKLIST.md`.
92
-
93
- - Use `references/baseline-plan-template.md` as the canonical structure for `PLAN.md`.
94
- - Use `references/baseline-checklist-template.md` as the canonical structure for `CHECKLIST.md`.
95
- - `PLAN.md` becomes mandatory after you have read the source paper and repo enough to restate the method faithfully, identify the real entrypoints, and explain the likely failure points; if either source is missing, record that gap explicitly before proceeding.
96
- - `PLAN.md` should put the user's explicit requirements and non-negotiable constraints first, then cover the chosen route, source package and provenance, safe efficiency levers, code touchpoints, environment and asset plan, smoke test, main run, fallback options such as ModelScope or local mirrors when Hugging Face is blocked, monitoring and sleep rules, verification targets, and a revision log.
97
- - `CHECKLIST.md` is the living companion to `PLAN.md`; update it during reading, setup, smoke testing, real execution, verification, and every material route change.
98
- - If an older quest already uses `analysis_plan.md` or `REPRO_CHECKLIST.md`, keep those files aligned with the canonical `PLAN.md` / `CHECKLIST.md` or turn them into clear compatibility pointers rather than splitting truth across parallel planning files.
99
- - Do not treat the plan as static: if the route, commands, source package, fallback path, or trust judgment changes materially, revise `PLAN.md` before continuing.
100
- - Once `PLAN.md` makes the route concrete, do not keep rewriting code or commands speculatively. The normal default is one bounded smoke test and then one real run, with retries only after a documented failure, invalidity, or compatibility problem.
101
-
102
- ## Phase routing rule
103
-
104
- Treat `analysis`, `setup`, `execution`, and `verification` as logical control gates, not paperwork walls.
105
- At any moment, the work should have one dominant phase among:
106
-
107
- - `analysis`
108
- - `setup`
109
- - `execution`
110
- - `verification`
111
-
112
- Keep the dominant phase explicit, but allow small backtracks and lightweight overlap when they reduce wasted work.
113
- Do not delay an early smoke test just because a fuller write-up is not done yet.
114
- Before a real long run, make sure the minimum viable contract is explicit and the active phase is still easy to reconstruct.
80
+ - the paper and repo disagree materially
81
+ - the real run or eval entrypoint is unclear
82
+ - code changes are likely required
83
+ - the contract spans multiple metrics, datasets, subtasks, or splits that still need interpretation
84
+ - the same failure class reappears after one documented autonomous fix
85
+ - the quest is trying to publish a reusable global baseline rather than only clear the current gate
115
86
 
116
87
  ## Use when
117
88
 
@@ -119,7 +90,7 @@ Before a real long run, make sure the minimum viable contract is explicit and th
119
90
  - the current baseline is unverified or stale
120
91
  - the user already has a baseline package that should be attached or imported
121
92
  - a reproduction failed earlier and now needs repair
122
- - the quest was resumed and the baseline trust state is unclear
93
+ - the quest resumed and the baseline trust state is unclear
123
94
 
124
95
  ## Do not use when
125
96
 
@@ -128,97 +99,83 @@ Before a real long run, make sure the minimum viable contract is explicit and th
128
99
 
129
100
  ## Stage gate
130
101
 
131
- Do not proceed to `idea` or `experiment` unless one of the following is durably true:
102
+ Do not proceed to comparison-heavy downstream work unless one of the following is durably true:
132
103
 
133
104
  - a baseline has been attached and accepted
134
105
  - a baseline has been imported and accepted
135
106
  - a baseline reproduction has completed and been verified
136
107
  - an explicit waiver decision exists with a clear reason
137
108
 
138
- Operationally, the canonical exit is stricter:
139
-
140
- - after the accepted baseline root is clear, call `artifact.confirm_baseline(...)`
141
- - if the quest must continue without a baseline, call `artifact.waive_baseline(...)`
142
-
143
- `attach`, `import`, `publish`, or a plain `baseline` artifact alone do not open the downstream gate.
109
+ Operationally:
144
110
 
145
- ## Truth sources
111
+ - call `artifact.confirm_baseline(...)` once the accepted baseline root and trusted comparison contract are clear
112
+ - call `artifact.waive_baseline(...)` when the quest must continue without a baseline
113
+ - attach, import, or publish alone do not open the downstream gate
146
114
 
147
- Use the following as baseline truth sources:
148
-
149
- - user objective and task framing
150
- - source paper and official repo when available
151
- - existing baseline registry entries
152
- - local baseline directories under `quest_root`
153
- - repo code, configs, and scripts
154
- - device and environment constraints detected locally
155
- - logs, metrics, and summaries from actual runs
156
-
157
- Do not treat memory alone as sufficient evidence for baseline readiness.
158
-
159
- ## Baseline workspace rules
115
+ ## Required plan and checklist
160
116
 
161
- - treat the baseline workspace as a system-managed reproduction surface, not an unrelated sandbox
162
- - avoid creating nested Git workflows inside the baseline workspace
163
- - keep the authoritative quest history in the quest repo
164
- - if papers are converted or notes are generated during baseline work, keep the durable copies under the quest-visible artifacts area unless there is a strong reason to keep a baseline-side copy
165
- - if runtime environment variables or secrets are provided by the runner, use them as authoritative but never echo or persist secret values
117
+ Before substantial baseline setup, code edits, or a real baseline run, create a quest-visible `PLAN.md` and `CHECKLIST.md`.
166
118
 
167
- The baseline line should also maintain a durable working-record area outside the execution surface.
168
- Recommended quest-visible records include:
119
+ - Use `references/baseline-plan-template.md` as the canonical structure for `PLAN.md`.
120
+ - Use `references/baseline-checklist-template.md` as the canonical structure for `CHECKLIST.md`.
121
+ - `analysis_plan.md` and `REPRO_CHECKLIST.md` remain acceptable compatibility alias files when an older quest already depends on them.
122
+ - For fast-path attach/import/prebound validation or a simple reproduce path with no expected code changes, short-form `PLAN.md` and `CHECKLIST.md` are enough.
123
+ - The plan should put the user's explicit requirements and non-negotiable constraints first.
124
+ - Then record the chosen route, source identity, command path, expected outputs, acceptance condition, safe efficiency levers, main risks, and fallback.
125
+ - If the route, commands, source package, fallback path, or trust judgment changes materially, revise `PLAN.md` before continuing.
126
+ - Once the route is concrete, stop reshaping code and commands speculatively.
169
127
 
170
- - `PLAN.md` as the canonical baseline plan; older quests may keep `analysis_plan.md` as a compatibility alias
171
- - `CHECKLIST.md` as the canonical living checklist; older quests may keep `REPRO_CHECKLIST.md` as a compatibility alias when already wired
172
- - `setup.md`
173
- - `execution.md`
174
- - `verification.md`
175
- - `STRUCTURE.md` only when the workspace layout is non-obvious or later reuse depends on it
128
+ Default retry discipline:
176
129
 
177
- For a simple attach/import flow or a straightforward reproduce flow, do not stall just to precreate every one of these files.
178
- Start with the smallest durable note that preserves the route, command path, target outputs, and main risks; expand it only after the route proves real.
130
+ - do not rerun the same unchanged smoke command just to reconfirm the same fact
131
+ - treat one autonomous retry for the same failure class as the normal upper bound
132
+ - if the same failure class appears again, switch explicitly into `repair`, record `blocked`, or route through `decision`
179
133
 
180
134
  ## Required durable outputs
181
135
 
182
136
  The baseline stage should usually leave behind:
183
137
 
184
138
  - a baseline directory under `baselines/local/` or `baselines/imported/`
185
- - a verification note or report under the quest
139
+ - `PLAN.md` and `CHECKLIST.md`
140
+ - a verification note or report
186
141
  - command, config, environment, and metrics pointers
187
142
  - a baseline artifact
188
143
  - a confirmed baseline gate via `artifact.confirm_baseline(...)`, or an explicit waiver via `artifact.waive_baseline(...)`
189
144
  - an optional registry publication if the baseline is reusable beyond this quest
190
145
 
191
- ## Stable execution contract
146
+ For simple attach/import flows or a straightforward reproduce flow, do not stall just to precreate every optional note file.
192
147
 
193
- To keep baseline work stable across different quests, do not stop at loose prose.
194
- But also do not confuse stability with ceremony.
195
- Use the lightest durable structure that keeps the baseline auditable and reusable.
148
+ Useful optional notes:
149
+
150
+ - `setup.md`
151
+ - `execution.md`
152
+ - `verification.md`
153
+ - `STRUCTURE.md` when the layout is non-obvious
154
+
155
+ ## File-by-file contract
156
+
157
+ - `PLAN.md` or compatibility alias `analysis_plan.md` is the required route contract before substantial setup, code edits, or a real run; it should state the route, source identity, command path, expected outputs, acceptance condition, main risks, and fallback.
158
+ - `CHECKLIST.md` or compatibility alias `REPRO_CHECKLIST.md` is the required living state tracker; it should show whether the baseline object, smoke decision, real run decision, and final accept / block / waive outcome are explicit.
159
+ - `setup.md` is optional unless environment or layout choices are non-trivial; if used, record the working directory, environment route, important config paths, source revision, and notable setup deviations.
160
+ - `execution.md` is optional unless the run is long, multi-step, or rerun-heavy; if used, record the launched commands, durable log paths, checkpoints, exit state, and any reruns or repairs.
161
+ - `verification.md` is optional as a filename but required in substance before acceptance or blocked closeout; either this file or an equivalent report should record trusted metrics, expected-versus-observed comparison, caveats, canonical output paths, and the next anchor.
162
+ - `STRUCTURE.md` becomes required when the workspace layout, mounts, symlinks, or generated outputs are non-obvious or meant for reuse; it should map the important directories and say which paths are canonical.
163
+ - `attachment.yaml` is required for attached or imported baselines under `baselines/imported/`; preserve source identity, selected variant when relevant, and attachment provenance there.
164
+ - `<baseline_root>/json/metric_contract.json` is the canonical accepted comparison contract; once the baseline is accepted, do not leave the authoritative metric surface only in chat, memory, or prose.
165
+ - `Result/metric.md` is scratch-only; it may help during execution, but it is never the final source of truth.
196
166
 
197
167
  Minimum stability rules:
198
168
 
199
169
  - before the first real run, leave one durable note with the chosen route, expected command path, target outputs, and main risks
200
170
  - after each smoke test or real run, record what actually happened and whether the route still looks viable
201
171
  - before acceptance, leave a clear verification note and baseline gate decision
202
- - every route selection should leave one explicit reasoned decision record
203
172
  - every accepted baseline should leave one accepted baseline artifact
204
173
  - every blocked baseline line should leave one blocked report and one next-step decision
205
- - every handoff should name the active baseline reference and trusted metric set explicitly
206
- - when the accepted paper-facing contract spans multiple metrics, datasets, subtasks, or splits, preserve that full comparison surface in the durable metric contract rather than collapsing it to one headline number
207
- - do not require every optional checklist or template before the first smoke test
208
174
  - if one rolling note is enough for a simple baseline line, use it
209
175
 
210
- Recommended phase-to-output mapping:
211
-
212
- - `analysis` -> a brief `PLAN.md` or compatible `analysis_plan.md`, plus optional route decision artifact
213
- - `setup` -> `setup.md` when setup choices are non-trivial
214
- - `execution` -> `execution.md` plus progress artifacts when long-running
215
- - `verification` -> `verification.md` plus accepted baseline artifact and `artifact.confirm_baseline(...)`, or a blocked report plus `artifact.waive_baseline(...)` when skipping is intentional
216
-
217
- If the work skips one of these durable outputs, explain why the baseline remains interpretable without it.
218
-
219
176
  ## Durable path contract
220
177
 
221
- The baseline stage should use the real runtime paths consistently.
178
+ Use the real runtime paths consistently.
222
179
 
223
180
  Quest-local paths:
224
181
 
@@ -235,33 +192,33 @@ Global reusable registry paths:
235
192
  - baseline registry index: `~/DeepScientist/config/baselines/index.jsonl`
236
193
  - canonical baseline entry: `~/DeepScientist/config/baselines/entries/<baseline_id>.yaml`
237
194
 
195
+ ## Baseline id and variant rules
196
+
197
+ - `baseline_id` should be short, stable, and filesystem-safe
198
+ - use letters, digits, `.`, `_`, or `-`
199
+ - do not use spaces, `/`, `\\`, or `..`
200
+ - if one codebase contains multiple comparable baselines, prefer one `baseline_id` with structured variants instead of inventing many near-duplicate entries
201
+ - when variants exist, keep `default_variant_id`, `baseline_variants`, and per-variant metric summaries stable enough that later `experiment` and `write` stages can cite them directly
202
+
238
203
  Do not invent parallel durable locations when these runtime contracts already exist.
239
204
  Do not leave the authoritative metric contract only in chat, memory, or prose once the baseline is accepted.
240
205
 
241
206
  If a baseline is reproduced only because an analysis campaign needs an extra comparator:
242
207
 
243
- - still place it under `<quest_root>/baselines/local/<baseline_id>/` or `<quest_root>/baselines/imported/<baseline_id>/`
208
+ - still place it under the normal baseline roots
244
209
  - treat it as a supplementary analysis baseline unless the quest explicitly promotes it into the canonical gate
245
210
  - do not call `artifact.confirm_baseline(...)` for that supplementary case unless the quest truly intends to replace the canonical baseline
246
211
 
247
- ## Baseline id and variant rules
248
-
249
- Baseline identity should be stable and path-safe.
250
-
251
- - `baseline_id` should be short, stable, and filesystem-safe
252
- - use letters, digits, `.`, `_`, or `-`
253
- - do not use spaces, `/`, `\\`, or `..`
254
- - if one codebase contains multiple comparable baselines, use one `baseline_id` with structured variants instead of inventing many unrelated entries
255
-
256
- When variants exist, maintain at least:
212
+ ## Multi-baseline policy
257
213
 
258
- - `default_variant_id`
259
- - `baseline_variants`
260
- - per-variant metric summaries when available
214
+ One quest may legitimately need more than one baseline.
261
215
 
262
- The baseline stage should treat `baseline_id` and `variant_id` as durable references that later `idea`, `experiment`, and `write` stages can cite directly.
216
+ - explicitly mark which baseline is the primary downstream comparator
217
+ - distinguish primary comparison baselines from fallback or infrastructure baselines
218
+ - if several baselines are credible, record why the chosen primary baseline is the fairest paper-facing comparator
219
+ - do not leave later stages guessing which baseline is authoritative
263
220
 
264
- ## Baseline route order
221
+ ## Route order
265
222
 
266
223
  Prefer this order:
267
224
 
@@ -272,101 +229,6 @@ Prefer this order:
272
229
 
273
230
  Prefer reuse over redundant reproduction.
274
231
 
275
- ## Route selection rules
276
-
277
- Choose the route explicitly rather than by habit.
278
-
279
- - choose `attach` when a published baseline already exists in the registry and its metrics or provenance are trustworthy enough for the quest
280
- - choose `import` when the user or repo provides a reusable baseline package or bundle that is not yet attached to the current quest
281
- - choose `reproduce` when no trustworthy reusable baseline is available but the source repo, paper, and evaluation path are concrete enough to establish one
282
- - choose `repair` when a baseline route already exists but failed, drifted, or is only partially complete and the broken point is bounded enough to diagnose directly
283
-
284
- Do not default to reproduction if attach or import would establish an equally trustworthy reference with less risk and cost.
285
-
286
- Before locking the route, explicitly answer:
287
-
288
- - what object is being reused or established
289
- - what makes it trustworthy enough for downstream comparison
290
- - what evidence is missing
291
- - what the cheapest credible next step is
292
-
293
- For a more explicit route-selection rubric, read `references/route-selection.md`.
294
-
295
- ## Baseline comparability contract
296
-
297
- The baseline stage is not complete just because something ran.
298
- It is complete when later stages can compare against it fairly.
299
-
300
- Before declaring a baseline usable, make the comparability contract explicit:
301
-
302
- - task identity
303
- - dataset identity and version
304
- - split contract
305
- - preprocessing boundary
306
- - evaluation script or evaluation path
307
- - required metric keys
308
- - metric directions
309
- - seed policy when relevant
310
- - source commit or source package identity
311
- - known deviations from the source reference
312
-
313
- If any of these are still materially unknown, do not pretend the baseline is a clean downstream reference.
314
-
315
- Use `references/comparability-contract.md` for the full checklist.
316
-
317
- ## Feasibility and acceptance classes
318
-
319
- Before accepting a baseline, classify feasibility as one of:
320
-
321
- - `full_reproducible`
322
- - `degraded_but_acceptable`
323
- - `blocked`
324
-
325
- And classify downstream trust as one of:
326
-
327
- - `verified`
328
- - `partially_verified`
329
- - `operational_but_incomparable`
330
- - `failed`
331
-
332
- Rules:
333
-
334
- - `full_reproducible` means the baseline can be reproduced within the agreed contract
335
- - `degraded_but_acceptable` means the quest explicitly allows a bounded degraded gate
336
- - `blocked` means insufficient assets, compute, or environment to produce an acceptable baseline
337
- - `verified` means trusted for downstream comparison
338
- - `partially_verified` means useful but still caveated
339
- - `operational_but_incomparable` means it runs, but the comparison contract is not stable enough yet
340
- - `failed` means it should not be used downstream
341
-
342
- Do not silently upgrade a degraded or only operational result into a normal trusted baseline.
343
-
344
- ## Multi-baseline policy
345
-
346
- One quest may legitimately need more than one baseline reference.
347
-
348
- Common roles include:
349
-
350
- - primary comparison baseline
351
- - strongest literature baseline
352
- - cheapest operational fallback baseline
353
-
354
- If more than one baseline exists, explicitly record:
355
-
356
- - which one is the primary downstream comparison
357
- - which one is only a fallback or infrastructure reference
358
- - why the primary choice is the fairest or strongest comparison
359
-
360
- Do not leave later stages guessing which baseline is authoritative.
361
-
362
- When useful, record the route choice as a decision artifact with action such as:
363
-
364
- - `attach_baseline`
365
- - `reuse_baseline`
366
- - `publish_baseline`
367
- - `continue`
368
- - `request_user_decision`
369
-
370
232
  ## Workflow
371
233
 
372
234
  ### Phase 1. Analysis
@@ -379,236 +241,88 @@ Before running anything substantial, determine:
379
241
  - source baseline identity
380
242
  - source code path
381
243
  - expected run command or evaluation path
382
- - expected paper or repo numbers, if any
244
+ - expected paper or repo numbers when they exist
383
245
  - local resource constraints
384
246
 
385
- For straightforward baseline work, start with a quick viability pass:
247
+ Default analysis discipline:
386
248
 
387
- - find the real run or evaluation entrypoint
388
- - identify the dataset/split and metric contract
249
+ - read the source paper and source repo first
250
+ - if runtime already exposes a matching `requested_baseline_ref` or `confirmed_baseline_ref`, validate that concrete object before restarting broad discovery
251
+ - identify the real run or evaluation entrypoint
252
+ - identify the dataset or split and metric contract
389
253
  - identify likely environment blockers
390
254
  - define the cheapest credible smoke test
391
255
 
392
- Escalate from that quick pass to a fuller baseline codebase audit when the command path is unclear, the repo is large or confusing, the paper and code diverge materially, repair mode is active, or custom code changes look likely.
256
+ Escalate to a fuller audit only when the command path is unclear, the repo is large or confusing, repair mode is active, or custom code changes look likely.
393
257
 
394
- When the fuller audit is necessary, capture at least:
258
+ When the fuller audit is necessary, capture only what later stages truly need:
395
259
 
396
- - major modules and files
260
+ - major entry scripts, configs, and modules
397
261
  - end-to-end data flow
398
- - key classes, functions, or scripts
399
- - external dependencies and environment assumptions
400
- - computational hotspots or obvious bottlenecks
401
- - current evaluation pipeline and metric computation path
402
- - coupling, maintainability, or scalability issues that may slow later iterations
262
+ - evaluation path and metric computation path
263
+ - obvious environment assumptions
264
+ - obvious bottlenecks or incompatibilities
403
265
 
404
- When the source paper is available, also record:
266
+ If the source paper is available, record:
405
267
 
406
- - read it through `artifact.arxiv(paper_id=..., full_text=False)` first, and only switch to `full_text=True` when the shorter view is insufficient
407
268
  - the core algorithm in compact, implementation-faithful form
408
269
  - the main reported numbers
409
- - the main weaknesses or bottlenecks likely to matter on the current quest task or dataset
410
-
411
- If helpful, restate the core algorithm using two of the following:
412
-
413
- - short pseudocode
414
- - a compact equation or objective
415
- - a code-level sketch tied to real files
416
-
417
- The goal is not academic polish.
418
- The goal is that later `idea`, `experiment`, and `write` stages can understand what the baseline actually does without reopening the whole repo from scratch.
419
-
420
- You should inspect local feasibility with shell-based checks when needed, including:
421
-
422
- - OS
423
- - GPU availability
424
- - CPU and RAM
425
- - free disk
426
- - Python or conda environment availability
427
- - whether `uv` is available and which Python version `uv` should target
428
-
429
- Use the collected constraints to choose a realistic baseline route and runtime plan.
430
-
431
- The analysis phase should leave behind a concrete baseline plan rather than only conversational intent.
432
- At minimum, the plan should capture:
433
-
434
- - chosen route
435
- - source identity
436
- - expected commands
437
- - expected outputs
438
- - feasibility notes
439
- - key risks
440
- - verification targets
441
-
442
- Prefer `PLAN.md` for new work and use `references/baseline-plan-template.md` when you need a concrete starting structure.
443
- When the analysis note becomes substantial, structure `PLAN.md` or a legacy-compatible `analysis_plan.md` with headings close to:
270
+ - the main weaknesses or bottlenecks likely to matter for this quest
444
271
 
445
- - executive summary
446
- - codebase analysis
447
- - limitations or bottlenecks
448
- - KPI and metric contract
449
- - route choice
450
- - risks and mitigations
272
+ You may inspect local feasibility with shell-based checks for OS, GPU, CPU, RAM, disk, Python version, and whether `uv` is available.
451
273
 
452
- Analysis-phase questioning rules:
274
+ The analysis phase should leave behind a concrete plan rather than only conversational intent.
453
275
 
454
- - ask the user only after the analysis is concrete enough to expose real choices
455
- - the early exception is when code access, paper access, source identity, or execution permission is missing and that absence blocks even baseline analysis
456
- - do not ask generic “how should I set up the environment” questions before you inspect the device and code requirements
457
- - do not repeat already confirmed decisions unless the plan materially changed
458
-
459
- If a user decision is required, make it structured and compact:
460
-
461
- - usually `1-6` questions total
462
- - each question should contain concrete options
463
- - options should reflect actual hardware/code feasibility
464
- - options should include tradeoffs
465
- - the recommended option should be explicit
466
- - free-form input should be requested only where a preset choice is genuinely insufficient
467
-
468
- If parallel execution is proposed, it must be explicitly confirmed rather than silently enabled.
469
-
470
- Avoid asking the user to design the environment for you.
471
- Instead, analyze the environment first, then present the recommended path and tradeoffs only if a user decision is actually required.
472
-
473
- If the code, paper, or baseline source is missing and the missing piece changes the route materially, stop and ask for a structured decision rather than guessing.
474
-
475
- For a denser audit checklist, read `references/codebase-audit-checklist.md`.
476
-
477
- ### Phase 2. Setup
276
+ ## Phase 2. Setup
478
277
 
479
278
  Prepare the selected route:
480
279
 
481
280
  - attach: validate the selected baseline id and variant
482
- - import: place the imported baseline metadata under the quest
281
+ - import: place the imported baseline metadata under the quest and confirm the package is readable
483
282
  - reproduce: prepare the baseline work directory, commands, config pointers, and environment notes
484
283
  - repair: identify the precise broken point before rerunning blindly
485
284
 
486
- For Python baselines, environment setup should be standardized around `uv`.
487
- Treat `uv` as the default environment and package manager for baseline setup, smoke tests, and real runs.
488
- Do not casually switch to a new conda environment or a manual `pip install` flow just because the repo is old.
489
- If the baseline already ships a `pyproject.toml` / `uv.lock`, use that path first.
490
- If it only ships `requirements.txt`, still create the environment with `uv` and install through `uv pip`.
491
- Only accept a non-`uv` environment route when there is a concrete blocker that cannot be resolved locally, and record that blocker explicitly in `setup.md` and the progress update.
492
-
493
- For a fast-path reproduction, setup can stay lightweight.
494
- Confirm the working directory, environment, config, output paths, smoke command, and long-run command, then move forward.
495
- Do not manufacture a fresh workspace tree or copy the repo just to satisfy a template if the existing layout is already workable and auditable.
496
-
497
- Capture:
498
-
499
- - baseline identifier
500
- - source and provenance
501
- - working directory
502
- - config files
503
- - command template
504
- - expected outputs
505
- - risks and known deviations from the paper or source
506
-
507
- Setup should also confirm:
508
-
509
- - the intended working directory is correct
510
- - the output paths are durable and quest-visible
511
- - required dependencies or environments are known
512
- - the execution plan is realistic for the detected hardware
285
+ For Python baselines, standardize environment setup around `uv`.
513
286
 
514
287
  ### Python environment rule: use `uv`
515
288
 
516
- When the baseline is Python-based, prefer the following order:
517
-
518
- 1. if the repo already contains `uv.lock` or a solid `pyproject.toml`, use `uv sync`
519
- 2. otherwise create a local virtual environment with `uv venv`
520
- 3. install dependencies with `uv pip install ...`
521
- 4. run setup, smoke tests, and real commands through `uv run ...`
289
+ - if the repo already contains `uv.lock` or a solid `pyproject.toml`, use `uv sync`
290
+ - otherwise create a local virtual environment with `uv venv`
291
+ - install dependencies with `uv pip install ...`
292
+ - run setup, smoke tests, and real commands through `uv run ...`
522
293
 
523
294
  Practical rules:
524
295
 
525
- - prefer a quest-local or baseline-local `.venv` under the actual working tree
526
- - prefer `uv run python ...` / `uv run bash ...` over relying on shell activation state
296
+ - prefer a quest-local or baseline-local `.venv`
297
+ - prefer `uv run python ...` or `uv run bash ...` over relying on shell activation state
527
298
  - if a specific interpreter is required, make it explicit with `uv venv --python 3.11` or `uv run --python 3.11 ...`
528
- - if CUDA, PyTorch, JAX, or custom wheels require a special index URL, still keep the installation command under `uv pip`
529
- - if the repo insists on conda-only tooling, first check whether the same packages can be installed with `uv`; only keep the conda route if you can explain why `uv` is not viable
530
-
531
- Examples:
532
-
533
- ```bash
534
- # modern repo with pyproject.toml / uv.lock
535
- cd <baseline_root>
536
- uv sync
537
- uv run python -m pytest tests/test_smoke.py -q
538
- uv run python train.py --config configs/baseline.yaml
539
- ```
540
-
541
- ```bash
542
- # legacy repo with requirements.txt
543
- cd <baseline_root>
544
- uv venv --python 3.11
545
- uv pip install -r requirements.txt
546
- uv run python scripts/smoke_test.py
547
- uv run python main.py --dataset cifar10 --config configs/resnet18.yaml
548
- ```
549
-
550
- ```bash
551
- # one-off package additions without leaving the uv-managed flow
552
- cd <baseline_root>
553
- uv venv --python 3.11
554
- uv pip install -r requirements.txt
555
- uv pip install "torch==2.4.1" "torchvision==0.19.1"
556
- uv run python evaluate.py --checkpoint outputs/best.pt
557
- ```
558
-
559
- When you record the setup, explicitly note:
560
-
561
- - the chosen `uv` route: `uv sync` vs `uv venv` + `uv pip`
562
- - the Python version
563
- - the dependency source files used
564
- - the exact `uv run ...` command used for the smoke test
565
- - any blocker that prevented a pure `uv` flow
566
-
567
- If a dedicated baseline workspace is needed, establish a clear layout.
568
- One workable structure is:
569
-
570
- ```text
571
- <baseline_root>/
572
- src/
573
- scripts/
574
- logs/
575
- cache/
576
- results/
577
- exports/
578
- latest/
579
- <run_id>/
580
- ```
581
-
582
- If the baseline becomes long-lived, shared, or non-obvious, the quest-visible audit area may contain:
583
-
584
- ```text
585
- <quest_root>/
586
- baselines/
587
- local/
588
- <baseline_id>/
589
- analysis_plan.md
590
- setup.md
591
- execution.md
592
- verification.md
593
- STRUCTURE.md
594
- REPRO_CHECKLIST.md
595
- ```
299
+ - if CUDA, PyTorch, JAX, or custom wheels require a special index URL, keep that install under `uv pip`
300
+ - only accept a non-`uv` route when there is a concrete blocker that cannot be resolved locally
301
+
302
+ Common `uv` patterns:
303
+
304
+ - `uv sync`
305
+ - `uv venv --python 3.11`
306
+ - `uv pip install -r requirements.txt`
307
+ - `uv run python scripts/smoke_test.py`
308
+ - `uv run python train.py --config ...`
596
309
 
597
310
  Setup should record:
598
311
 
599
- - how the source was obtained: attach/import/copy/clone
600
- - upstream URL when known
601
- - upstream commit hash when known
602
- - `uv` environment route and Python version
603
- - key environment variables by name only, with sensitive values redacted
604
- - the directory tree and key files expected to matter later
312
+ - baseline id and source identity
313
+ - working directory
314
+ - config files
315
+ - command template
316
+ - expected outputs
317
+ - known deviations from paper or source
318
+ - the chosen `uv` route and Python version
605
319
 
606
- If a local source repo was copied into the workspace, preserve provenance but do not keep a nested authoritative Git lifecycle inside the baseline execution root.
320
+ Fallbacks:
607
321
 
608
- If setup reveals that the chosen route is infeasible on the current device, do not brute-force ahead.
609
- Either downgrade scope explicitly, switch route, or request a structured decision.
322
+ - if Hugging Face access is blocked, record and try an approved local mirror such as ModelScope when that does not change the comparison meaning
323
+ - if a quest already depends on `analysis_plan.md` or `REPRO_CHECKLIST.md`, keep the compatibility alias explicit rather than splitting truth across two active plans
610
324
 
611
- ### Phase 3. Execution
325
+ ## Phase 3. Execution
612
326
 
613
327
  Run only the work required to establish the baseline credibly.
614
328
 
@@ -617,88 +331,31 @@ Execution rules:
617
331
  - keep commands auditable
618
332
  - keep logs durable
619
333
  - avoid uncontrolled side experiments during baseline establishment
620
- - if a run is long, emit progress artifacts at meaningful checkpoints
621
- - if setup required code changes, checkpoint only explainable, minimal changes
622
-
623
- Execution should rely on existing explicit scripts or command paths where possible.
624
- Prefer the smallest runnable command that proves the baseline route.
625
- Do not build a new wrapper, registry, or result-export scaffold unless existing commands are missing, repeated reruns justify it, or later automation clearly needs it.
626
- If a wrapper or entry script is truly needed, it should support most of the following:
627
-
628
- - run mode for missing combinations
629
- - print-only mode that summarizes existing results without rerunning everything
630
- - result registry or skip logic so old baseline results are not re-executed unnecessarily
631
- - export of per-run results and a `latest/` snapshot
632
- - final Markdown and/or JSON summary output
633
- - cache and debug logs
634
- - environment checks when relevant
635
- - throttled structured progress markers for long loops
636
- - `--new-only` or equivalent incremental mode
637
- - `--rerun` or equivalent force-rerun mode when needed
638
- - scope flags such as minimal/full/custom when the analysis plan distinguishes them
639
- - speed flags such as parallelism, batch size, epochs, or steps when relevant
640
- - optional evaluation and postprocess steps when the repo separates them
641
-
642
- Prefer those efficiency levers only when they do not change the accepted baseline meaning, effective evaluation contract, or trust judgment.
643
-
644
- If adding this scaffolding would require large assumptions about missing scripts, stop and return to analysis rather than creating a misleading opaque wrapper.
645
-
646
- Recommended result structures to maintain:
647
-
648
- - per-combination result records
649
- - an aggregated `result.json`
650
- - a registry or JSONL index mapping each combination to its stored result
651
- - exported snapshots in both run-specific and `latest/` locations
652
- - run metadata capturing the environment and command context
653
-
654
- Recommended run metadata includes:
655
-
656
- - config snapshot
657
- - relevant Git or source snapshot identifiers
658
- - package/environment summary
659
- - machine summary such as GPU visibility when relevant
660
-
661
- If a result backup is useful for audit or recovery, create it explicitly rather than assuming the latest export is enough.
662
-
663
- Long-running execution rules:
664
-
665
- - before a substantial baseline reproduction, run a bounded smoke test first so command paths, output locations, and metric plumbing are validated cheaply
666
- - once the smoke test passes, launch the real baseline reproduction with `bash_exec(mode='detach', ...)` and normally leave `timeout_seconds` unset for the long run itself
667
- - `bash_exec(mode='read', id=...)` returns the full rendered log when it is 2000 lines or fewer; for longer logs it returns the first 500 lines plus the last 1500 lines and a hint to inspect omitted sections with `start` and `tail`
668
- - if a long saved log omits the middle section you need, use `bash_exec(mode='read', id=..., start=..., tail=...)` to inspect that forward rendered-line window
669
- - when monitoring that detached run, prefer `bash_exec(mode='read', id=..., tail_limit=..., order='desc')` so you inspect the newest log evidence first
670
- - after the first read, prefer incremental checks with `bash_exec(mode='read', id=..., after_seq=last_seen_seq, tail_limit=..., order='asc')` so you only inspect newly appended evidence
671
- - if you need to recover ids or confirm the newest session quickly, use `bash_exec(mode='history')` or `bash_exec(mode='list')` rather than guessing
672
- - include a structured `comment` on long-running bash sessions with fields such as `stage`, `goal`, `action`, `expected_signal`, and `next_check`
673
- - use `silent_seconds`, `progress_age_seconds`, `signal_age_seconds`, and `watchdog_overdue` from `bash_exec(mode='list'|'read', ...)` as the default staleness checks
674
- - when the reproduction code is under your control, prefer a throttled `tqdm` progress reporter and, when feasible, pair it with periodic `__DS_PROGRESS__` JSON lines carrying phase and ETA
675
- - if a command is expected to run for a long time, monitor it as a real background task rather than assuming success
676
- - do not write final summaries or accepted metrics until the command has actually completed
677
- - verify that the expected result files exist before treating the run as finished
678
- - if a task is invalid, wedged, or failed, stop it with `bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...)`; if it must die immediately, add `force=true`, then diagnose the reason and either retry with a documented fix or record the failure durably
679
- - canonical sleep choice:
680
- - if you only need wall-clock waiting between checks, use `bash_exec(command='sleep N', mode='await', timeout_seconds=N+buffer, ...)`
681
- - keep a real buffer on that sleep timeout; do not set `timeout_seconds` exactly equal to `N`
682
- - if you are waiting on an already running managed session, prefer `bash_exec(mode='await', id=..., timeout_seconds=...)` instead of starting a new sleep command
683
-
684
- Recommended monitoring cadence for long-running work:
685
-
686
- - first check after about 60 seconds
687
- - second check after about 120 seconds
688
- - third check after about 300 seconds
689
- - fourth check after about 600 seconds
690
- - fifth check after about 1800 seconds
691
- - after that, keep checking about every 1800 seconds while the run is still active
692
-
693
- The exact mechanism should prefer `bash_exec(mode='await' | 'detach' | 'read' | 'list' | 'history' | 'kill', ...)`, with `read` usually using a tailed or incremental window during monitoring, but the behavioral rule stays the same:
694
- do not report completion until the run is actually done and the outputs are real.
695
- After each meaningful check, notify the user through `artifact.interact(kind='progress', ...)` with current status, latest evidence, and the next monitoring point.
696
- Do this after every completed wait cycle for important long-running work; do not skip several sleep windows without reporting.
697
- When structured progress markers are available, include `eta` and preferably `next_reply_at` or `next_check_at` so the UI can show the next expected update time.
698
-
699
- Do not silently widen scope from “baseline reproduction” into “new method exploration”.
700
-
701
- ### Phase 4. Verification
334
+ - checkpoint only explainable, minimal code changes
335
+ - prefer equivalence-preserving efficiency gains such as larger safe batch size, cache reuse, checkpoint resume, and parallel downloads or workers
336
+ - do not use an efficiency lever if it changes accepted baseline meaning, effective evaluation contract, or trust judgment
337
+
338
+ Long-running execution discipline:
339
+
340
+ - run one bounded smoke test before a substantial baseline reproduction
341
+ - once the smoke test passes, launch the real baseline reproduction with `bash_exec(mode='detach', ...)`
342
+ - monitor by forward progress instead of by short-window completion anxiety
343
+ - do not report final success until the command actually finished and the expected result files exist
344
+ - if you need to recover ids or inspect session state, use `bash_exec(mode='history')` or `bash_exec(mode='list')`
345
+ - `bash_exec(mode='read', id=...)` returns the full saved log when it is `2000 lines or fewer`; for longer logs, inspect omitted middle windows with `start` and `tail`
346
+ - during monitoring, prefer `bash_exec(mode='read', id=..., tail_limit=..., order='desc')`, and after the first read prefer incremental checks with `after_seq=last_seen_seq`
347
+ - use `silent_seconds`, `progress_age_seconds`, `signal_age_seconds`, and `watchdog_overdue` as the default staleness clues
348
+ - if a run is clearly invalid, wedged, or superseded, stop it with `bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...)`, document why, and relaunch cleanly
349
+ - do not let more than the `30-minute visibility bound` pass without a real inspection and a `next expected update time`
350
+ - when the baseline code is under your control, prefer a throttled `tqdm` progress reporter and periodic `__DS_PROGRESS__` markers when feasible
351
+
352
+ Keep retries bounded:
353
+
354
+ - one smoke test is the default
355
+ - one autonomous fix-and-retry for the same failure class is the normal upper bound
356
+ - if the same failure class returns, stop looping
357
+
358
+ ## Phase 4. Verification
702
359
 
703
360
  Verification is mandatory before baseline acceptance.
704
361
 
@@ -707,32 +364,17 @@ Verify:
707
364
  - the run actually finished
708
365
  - the reported metrics came from the intended dataset and split
709
366
  - the metric definitions match the quest contract
710
- - the result is comparable to the paper, source repo, or selected baseline target
367
+ - the result is comparable to the paper, source repo, or selected target
711
368
  - any deviations are explicitly stated
712
369
 
713
- Classify the outcome:
370
+ Classify the outcome as one of:
714
371
 
715
372
  - `verified_match`
716
373
  - `verified_close`
717
374
  - `verified_diverged`
718
375
  - `broken`
719
376
 
720
- Verification must be evidence-first.
721
- Do not accept any of the following without explanation:
722
-
723
- - missing result files
724
- - metrics that cannot be traced to an actual run
725
- - metric definitions that do not match the quest contract
726
- - unexplained mismatch versus the intended paper or source repo setup
727
-
728
- Verification-phase interaction rules:
729
-
730
- - do not ask new questions during verification unless the stage has genuinely fallen back to analysis
731
- - if requirements, scope, or permissions changed materially, stop verification and return to the analysis phase explicitly
732
- - verification should summarize real progress milestones rather than quoting raw internal progress markers
733
- - structured progress markers are for runtime monitoring, not for final verification prose
734
-
735
- If the reproduced result differs materially from the source reference, verification should explicitly separate:
377
+ Verification must explicitly separate:
736
378
 
737
379
  - likely implementation mismatch
738
380
  - environment mismatch
@@ -740,43 +382,64 @@ If the reproduced result differs materially from the source reference, verificat
740
382
  - expected stochastic variance
741
383
  - unexplained divergence
742
384
 
743
- Verification should also answer:
385
+ Verification should answer:
744
386
 
745
387
  - whether the baseline is trustworthy enough for downstream comparison
746
388
  - whether the result is reusable beyond this quest
747
389
  - whether another repair or rerun is justified
748
- - whether the baseline line should stop here and hand off to another stage
390
+ - whether the line should stop here and hand off
391
+
392
+ A verification report should be self-contained enough that a later stage can answer:
393
+
394
+ - what was used
395
+ - how it was obtained: attach, import, reproduce, or repair
396
+ - what commands and configs were used
397
+ - what metrics are trusted
398
+ - what caveats remain
399
+ - whether the result is reusable beyond this quest
749
400
 
750
- Verification checklist before accepting results:
401
+ ## Baseline comparability contract
402
+
403
+ The baseline stage is not complete just because something ran.
404
+ It is complete when later stages can compare against it fairly.
751
405
 
752
- - logs show command completion rather than only task start
753
- - final result files exist
754
- - exported latest snapshot exists when the workflow expects it
755
- - metrics are non-empty, non-placeholder, and non-NaN
756
- - execution notes document the actual commands and outcomes
757
- - the baseline phase state is ready to hand off
758
- - the infrastructure needed for reproduction is actually present and usable
759
- - any closed-loop or key-metric steps expected by the plan were completed or their omission was explicitly documented
406
+ Before declaring a baseline usable, make the comparability contract explicit:
760
407
 
761
- If the workflow uses both result files and export files, they should agree or the mismatch must be explained.
408
+ - task identity
409
+ - dataset identity and version
410
+ - split contract
411
+ - preprocessing boundary
412
+ - evaluation script or evaluation path
413
+ - required metric keys
414
+ - metric directions
415
+ - seed policy when relevant
416
+ - source commit or source package identity
417
+ - known deviations from the source reference
762
418
 
763
- Verification should also test the reporting surface itself when the baseline workflow includes one.
764
- For example, if the baseline uses a main driver script with a print-only mode, verify that:
419
+ Unless the user explicitly specifies otherwise, treat the original paper's evaluation protocol as the canonical baseline contract.
420
+ If any of these fields are still materially unknown, do not pretend the baseline is a clean downstream reference.
421
+ For the fuller checklist and verdict meanings, read `references/comparability-contract.md`.
765
422
 
766
- - summary mode runs successfully
767
- - exported Markdown and/or JSON summaries are actually generated
768
- - incremental flags such as `--new-only` behave as documented when they are part of the workflow
423
+ ## Feasibility and trust classes
769
424
 
770
- Then record:
425
+ Before acceptance, classify feasibility as one of:
771
426
 
772
- - trusted metrics
773
- - important caveats
774
- - exact paths for logs, configs, and outputs
775
- - whether the baseline is reusable and should be published
427
+ - `full_reproducible`
428
+ - `degraded_but_acceptable`
429
+ - `blocked`
430
+
431
+ And classify downstream trust as one of:
432
+
433
+ - `verified`
434
+ - `partially_verified`
435
+ - `operational_but_incomparable`
436
+ - `failed`
437
+
438
+ Do not silently upgrade a degraded or merely operational result into a normal trusted baseline.
776
439
 
777
440
  ## Minimum baseline artifact content
778
441
 
779
- The baseline artifact should clearly include at least:
442
+ The accepted baseline artifact should include at least:
780
443
 
781
444
  - `baseline_id`
782
445
  - `baseline_kind`
@@ -794,254 +457,32 @@ If variants exist, also include:
794
457
  - `default_variant_id`
795
458
  - `baseline_variants`
796
459
 
797
- Metric-contract rule:
460
+ Metric-contract rules:
798
461
 
799
- - unless the user explicitly specifies otherwise, treat the original paper's evaluation protocol as the canonical baseline contract
800
462
  - if the accepted baseline contract includes multiple metrics, datasets, subtasks, or splits, record all of them in `<baseline_root>/json/metric_contract.json`
801
- - keep `primary_metric` as the headline metric only; do not let it erase the rest of the accepted paper-facing comparison surface
802
- - when confirming a baseline, submit the canonical `metrics_summary` as a flat top-level dictionary keyed by the paper-facing metric ids; if the raw evaluator output is nested, use explicit `origin_path` fields in `metric_contract.metrics` to map the required canonical metrics
803
- - every canonical baseline metric entry should include `description`, either `derivation` or `origin_path`, and `source_ref` so later stages can audit where the number came from
463
+ - keep `primary_metric` as the headline metric only; do not let it erase the rest of the comparison surface
464
+ - when confirming a baseline, submit the canonical `metrics_summary` as a flat top-level dictionary keyed by the paper-facing metric ids
465
+ - every canonical baseline metric entry should include `description`, either `derivation` or `origin_path`, and `source_ref`
804
466
  - if the paper reports both aggregate and per-dataset or per-task results, preserve both whenever feasible through `metrics_summary` plus structured rows rather than one cherry-picked scalar
805
- - `Result/metric.md` is optional temporary scratch memory only; if it exists, reconcile the final baseline submission against it before calling `artifact.confirm_baseline(...)`, but do not treat it as a required file
806
-
807
- ## Durable note templates
808
-
809
- Use compact but structured notes so later stages do not need to reconstruct baseline state from chat history.
810
- The templates below are references, not prerequisites for the first smoke test.
811
- For simple baseline lines, keep them short and fill only the sections that matter.
812
-
813
- Canonical naming for new work:
814
-
815
- - `PLAN.md` -> use `references/baseline-plan-template.md`
816
- - `CHECKLIST.md` -> use `references/baseline-checklist-template.md`
817
- - `analysis_plan.md` and `REPRO_CHECKLIST.md` remain acceptable compatibility aliases when a quest already depends on them
818
-
819
- ### `PLAN.md` or `analysis_plan.md`
820
-
821
- Recommended shape:
822
-
823
- ```md
824
- # Baseline Analysis Plan
825
-
826
- - quest_id:
827
- - baseline_id:
828
- - requested_route: attach | import | reproduce | repair
829
- - recommended_route:
830
- - source_identity:
831
- - task:
832
- - dataset_and_split:
833
- - metric_contract:
834
- - expected_reference:
835
- - feasibility_summary:
836
-
837
- ## Existing evidence
838
- - published registry entries:
839
- - local baseline roots:
840
- - relevant repo paths:
841
-
842
- ## Planned commands
843
- - inspect:
844
- - setup:
845
- - run:
846
- - verify:
847
-
848
- ## Expected outputs
849
- - baseline_root:
850
- - metrics_path:
851
- - logs_path:
852
- - export_paths:
853
-
854
- ## Risks
855
- - risk:
856
-
857
- ## Gate to next phase
858
- - what must be true before setup starts
859
- ```
860
-
861
- ### `setup.md`
862
-
863
- Recommended shape:
864
-
865
- ```md
866
- # Baseline Setup
467
+ - if the source package already has a richer leaderboard table, structured result file, or `json/metric_contract.json`, reuse that richer contract instead of hand-writing a thinner one that keeps only one averaged scalar
468
+ - `Result/metric.md` is optional temporary scratch memory only; reconcile against it before calling `artifact.confirm_baseline(...)`, but do not treat it as a required durable file
469
+
470
+ ## Publication and reuse
867
471
 
868
- - baseline_id:
869
- - route:
870
- - working_directory:
871
- - source_origin:
872
- - source_commit:
873
- - environment_summary:
874
- - uv_strategy:
875
- - python_version:
876
- - config_paths:
877
- - command_template:
878
-
879
- ## Directory contract
880
- - baseline_root:
881
- - logs_root:
882
- - results_root:
883
- - exports_root:
884
-
885
- ## Known deviations
886
- - deviation:
887
-
888
- ## Ready-for-execution check
889
- - uv_route_recorded: yes/no
890
- - dependencies_known: yes/no
891
- - outputs_defined: yes/no
892
- - feasible_on_current_machine: yes/no
893
- ```
894
-
895
- ### `execution.md`
896
-
897
- Recommended shape:
898
-
899
- ```md
900
- # Baseline Execution
901
-
902
- - baseline_id:
903
- - route:
904
- - run_scope:
905
- - command_started:
906
- - started_at:
907
- - monitoring_plan:
908
-
909
- ## Runtime log pointers
910
- - stdout_or_main_log:
911
- - stderr_or_error_log:
912
- - result_index:
913
-
914
- ## Checkpoints
915
- - checkpoint:
916
-
917
- ## Final execution state
918
- - completed_at:
919
- - exit_status:
920
- - produced_outputs:
921
- - reruns_or_repairs:
922
- ```
923
-
924
- ### `verification.md`
925
-
926
- Recommended shape:
927
-
928
- ```md
929
- # Baseline Verification
930
-
931
- - baseline_id:
932
- - route:
933
- - verification_outcome: verified_match | verified_close | verified_diverged | broken
934
- - trusted_for_downstream: yes/no
935
- - reusable_beyond_quest: yes/no
936
- - publish_recommended: yes/no
937
-
938
- ## Trusted metrics
939
- - metric:
940
-
941
- ## Reference comparison
942
- - expected_reference:
943
- - observed_result:
944
- - delta_or_gap:
945
-
946
- ## Evidence paths
947
- - final_metrics:
948
- - logs:
949
- - exports:
950
- - config_snapshot:
951
-
952
- ## Caveats
953
- - caveat:
954
-
955
- ## Next recommendation
956
- - next_anchor:
957
- - next_action:
958
- ```
959
-
960
- These notes do not need to be verbose.
961
- They do need to be complete enough that another stage can read them without replaying the full baseline process.
962
-
963
- ## Artifact payload templates
964
-
965
- When writing artifacts, prefer a stable field shape.
966
-
967
- ### Route or blocked decision artifact template
968
-
969
- ```json
970
- {
971
- "kind": "decision",
972
- "verdict": "neutral",
973
- "action": "attach_baseline",
974
- "reason": "A published baseline already matches the quest task and metric contract.",
975
- "baseline_id": "baseline-demo",
976
- "baseline_variant_id": "main",
977
- "evidence_paths": [
978
- "<quest_root>/artifacts/reports/report-....json"
979
- ],
980
- "next_direction": "Attach the baseline and move to verification or idea selection."
981
- }
982
- ```
983
-
984
- If blocked, keep the same structure but use a blocked-appropriate action and reason.
985
-
986
- ### Accepted baseline artifact template
987
-
988
- ```json
989
- {
990
- "kind": "baseline",
991
- "publish_global": true,
992
- "baseline_id": "baseline-demo",
993
- "name": "Demo baseline",
994
- "baseline_kind": "reproduced",
995
- "task": "image-classification",
996
- "dataset": "CIFAR-10/test",
997
- "primary_metric": {
998
- "name": "accuracy",
999
- "value": 0.943
1000
- },
1001
- "metrics_summary": {
1002
- "accuracy": 0.943
1003
- },
1004
- "default_variant_id": "main",
1005
- "baseline_variants": [
1006
- {
1007
- "variant_id": "main",
1008
- "label": "Main",
1009
- "metrics_summary": {
1010
- "accuracy": 0.943
1011
- }
1012
- }
1013
- ],
1014
- "environment": {
1015
- "python": "3.11"
1016
- },
1017
- "summary": "Verified reproduced baseline accepted for downstream comparison.",
1018
- "path": "<quest_root>/baselines/local/baseline-demo",
1019
- "source": {
1020
- "kind": "artifact_publish",
1021
- "quest_id": "<quest_id>",
1022
- "quest_root": "<quest_root>"
1023
- }
1024
- }
1025
- ```
1026
-
1027
- Only set `publish_global: true` when verification is complete and reuse is justified.
1028
-
1029
- ## Registry publication and attachment contract
1030
-
1031
- The baseline skill should use the durable registry deliberately, not as an afterthought.
472
+ Use the registry deliberately, not as an afterthought.
1032
473
 
1033
474
  If the result is reusable beyond the current quest:
1034
475
 
1035
476
  - publish it through `artifact.publish_baseline(...)`
1036
- - ensure the payload includes the baseline identity, provenance, trusted metrics, and any variant structure
1037
- - prefer `publish_global: true` only when the baseline is actually reusable and verification is complete
477
+ - ensure the payload includes identity, provenance, trusted metrics, and any variant structure
478
+ - set `publish_global: true` only when verification is complete and reuse is justified
1038
479
 
1039
480
  If the current quest should reuse an existing baseline:
1040
481
 
1041
482
  - attach it through `artifact.attach_baseline(...)`
1042
483
  - preserve the selected `baseline_id`
1043
484
  - preserve the selected `variant_id` when one is used
1044
- - ensure the resulting attachment record is durable under `baselines/imported/`
485
+ - keep the attachment durable under `baselines/imported/`
1045
486
 
1046
487
  If runtime state already includes `requested_baseline_ref` or a matching `confirmed_baseline_ref`:
1047
488
 
@@ -1049,221 +490,55 @@ If runtime state already includes `requested_baseline_ref` or a matching `confir
1049
490
  - treat a creation-time pre-bound baseline as the active starting point unless you find a concrete incompatibility
1050
491
  - do not rerun broad baseline scouting or full reproduction just because the stage name is `baseline`
1051
492
 
1052
- Do not publish a baseline that is still blocked, speculative, or verification-incomplete.
1053
- Do not attach a baseline without explaining why it is the right reference for the quest.
1054
-
1055
- ## Verification report expectations
1056
-
1057
- A baseline verification report should answer:
1058
-
1059
- - what baseline was used
1060
- - how it was obtained: attach, import, reproduce, or repair
1061
- - what commands and configs were used
1062
- - what metrics are trusted
1063
- - how the result compares with the expected reference
1064
- - what caveats remain
1065
-
1066
- The report should also include:
1067
-
1068
- - whether the run should be trusted for downstream comparison
1069
- - whether the baseline is reusable beyond this quest
1070
- - whether another repair or rerun is justified
1071
-
1072
- The verification report should be strong enough that later `idea`, `experiment`, and `write` stages can cite the baseline setup without reconstructing it from scratch.
1073
-
1074
- It should ideally also function as a self-contained reproduction note describing:
1075
-
1076
- - baseline identity
1077
- - source provenance
1078
- - key commands
1079
- - environment assumptions
1080
- - result locations
1081
- - trusted interpretation of the outcome
1082
-
1083
- If the baseline line is meant to be reused later, the final report should be self-contained enough that another stage can answer:
1084
-
1085
- - what to run
1086
- - where to run it
1087
- - what outputs should appear
1088
- - how to interpret those outputs
1089
-
1090
- without reopening the whole reproduction process from scratch.
1091
-
1092
- When useful, generate a single merged reproduction report that includes:
1093
-
1094
- - structure overview
1095
- - modification summary
1096
- - testing commands
1097
- - device and environment summary
1098
- - baseline status and blockers
1099
- - redacted configuration inventory
1100
- - key implementation measures
1101
- - core method equations or mathematical notes when they matter for later understanding
1102
- - results table
1103
- - export paths
1104
-
1105
- For a reusable baseline package checklist, read `references/publishable-baseline-package.md`.
493
+ For a clearer attach/import/reproduce/repair rubric, read `references/route-selection.md`.
494
+ For reusable-package expectations, read `references/publishable-baseline-package.md`.
1106
495
 
1107
- ## Branch and worktree rules
496
+ ## Workspace and branch rules
1108
497
 
1109
- - Use the quest branch unless isolation is genuinely needed.
1110
- - If baseline setup is risky or intrusive, prepare an isolated branch or worktree first.
1111
- - Do not proliferate branches without a reason.
1112
- - If a branch was used, record why the baseline needed isolation.
1113
-
1114
- The baseline stage should not build a parallel Git lifecycle of its own.
1115
- Branching and promotion remain quest-level concerns.
1116
-
1117
- However, if baseline setup materially changed code or scripts, preserve at least:
1118
-
1119
- - an initial snapshot of the baseline workspace state
1120
- - a final snapshot after setup/execution changes
1121
-
1122
- so the quest can later audit what changed during reproduction.
1123
-
1124
- If the workflow uses a baseline-local Git snapshot for audit, treat it as an execution snapshot only.
1125
- The quest repo remains the durable authority for promotion and narrative state.
498
+ - treat the baseline workspace as a system-managed reproduction surface, not an unrelated sandbox
499
+ - avoid creating a nested authoritative Git lifecycle inside the baseline workspace
500
+ - use the quest branch unless isolation is genuinely needed
501
+ - if baseline setup is risky or intrusive, prepare an isolated branch or worktree first and record why
502
+ - do not proliferate branches without a reason
1126
503
 
1127
504
  ## Memory rules
1128
505
 
1129
506
  Stage-start requirement:
1130
507
 
1131
- - begin every baseline pass with `memory.list_recent(scope='quest', limit=5)`
508
+ - by default, begin every baseline pass with `memory.list_recent(scope='quest', limit=5)`
1132
509
  - then run at least one baseline-relevant `memory.search(...)` before new baseline analysis, repair, or rerun work
1133
- - if several baseline or idea lines exist, narrow retrieval to the active baseline route instead of mixing notes from unrelated lines
510
+ - fast-path exception: if the quest already exposes a clear `requested_baseline_ref` or `confirmed_baseline_ref` and the immediate task is only to validate or reattach that concrete baseline, you may skip broad retrieval
1134
511
 
1135
- Write to memory only when the lesson is reusable, such as:
512
+ Write memory only for reusable lessons such as:
1136
513
 
1137
- - baseline pitfalls
1138
- - environment gotchas
1139
- - dataset quirks
1140
514
  - paper-to-code mismatch notes
1141
-
1142
- Do not use memory as a substitute for the baseline artifact itself.
1143
-
1144
- Preferred memory usage:
1145
-
1146
- - quest `papers`:
1147
- - paper-to-code mismatch notes
1148
- - baseline paper caveats
1149
- - quest `decisions`:
1150
- - attach / import / reproduce / repair rationale
1151
- - accepted-versus-rejected baseline route choices
1152
- - quest `episodes`:
1153
- - setup failures
1154
- - execution failures
1155
- - environment incidents
1156
- - suspicious or divergent baseline runs
1157
- - quest `knowledge`:
1158
- - verified metric contract
1159
- - stable setup rules
1160
- - data and evaluation caveats
1161
- - reproducibility lessons that matter later in this quest
1162
- - global `knowledge`:
1163
- - reusable reproduction heuristics
1164
- - stable verification heuristics
1165
- - cross-quest baseline debugging lessons
1166
- - global `templates`:
1167
- - setup checklist templates
1168
- - verification checklist templates
1169
- - publishable baseline package templates
1170
-
1171
- Useful tags include:
1172
-
1173
- - `stage:baseline`
1174
- - `baseline:<baseline_id>`
1175
- - `type:repro-lesson`
1176
- - `type:verification-caveat`
1177
- - `type:environment-incident`
1178
- - `topic:<dataset-or-method>`
515
+ - environment incidents
516
+ - dataset quirks
517
+ - verification caveats
518
+ - attach vs import vs reproduce vs repair rationale
1179
519
 
1180
520
  When calling `memory.write(...)`, pass `tags` as an array like `["stage:baseline", "baseline:<baseline_id>", "type:repro-lesson"]`, not as one comma-joined string.
1181
521
 
1182
- Recommended read timing:
1183
-
1184
- - before route selection:
1185
- - consult quest `decisions`, `knowledge`, and relevant `papers`
1186
- - before reruns or repairs:
1187
- - search quest `episodes` first
1188
- - before acceptance:
1189
- - re-check quest `knowledge` and `decisions`
1190
- - before publishing globally:
1191
- - confirm the lesson is truly reusable and not only quest-local
1192
-
1193
522
  Stage-end requirement:
1194
523
 
1195
524
  - if baseline work produced a durable reproduction lesson, verification caveat, environment incident, or route rationale, write at least one `memory.write(...)` before leaving the stage
1196
525
 
1197
- For a fuller memory strategy, read `references/memory-playbook.md`.
1198
-
1199
526
  ## Artifact rules
1200
527
 
1201
528
  Typical artifact sequence:
1202
529
 
1203
- - progress artifact for long-running setup or execution
1204
- - report artifact for analysis or verification notes
1205
- - baseline artifact for the accepted result
1206
- - decision artifact when choosing attach/import/reproduce/repair or when deciding the next anchor
1207
-
1208
- If a reusable baseline was established, prefer recording it in a form that later stages can attach or reuse directly instead of forcing redundant reproduction.
1209
-
1210
- Use `artifact.attach_baseline(...)` or `artifact.publish_baseline(...)` when appropriate.
530
+ - `progress` for long-running setup or execution checkpoints
531
+ - `report` for analysis notes or verification notes
532
+ - `decision` for route choice, blocked routing, or accept/reject/rerun/repair calls
533
+ - `baseline` only for an accepted baseline record
1211
534
 
1212
- Preferred artifact choices:
535
+ For stable field shapes, read `references/artifact-payload-examples.md`.
1213
536
 
1214
- - use `decision` for:
1215
- - route selection
1216
- - blocked-state routing
1217
- - accept / reject / rerun / repair choices
1218
- - use `report` for:
1219
- - analysis notes
1220
- - verification reports
1221
- - merged reproduction reports
1222
- - comparability-contract summaries
1223
- - use `progress` during long-running setup or execution
1224
- - use `baseline` only for an accepted baseline record
1225
- - use `approval` only if an explicit user approval was needed for a costly or degraded baseline gate
1226
-
1227
- ## Handoff contract
1228
-
1229
- Before handing the quest to `idea`, `experiment`, or `write`, the baseline stage should make the next stage's life easy.
1230
-
1231
- At minimum, downstream stages should be able to answer all of the following without reopening the full reproduction investigation:
1232
-
1233
- - which baseline is active
1234
- - which route produced it: attach, import, reproduce, or repair
1235
- - which metrics are trusted
1236
- - where the baseline outputs and logs live
1237
- - what caveats or deviations still matter
1238
- - whether the baseline is quest-local only or globally reusable
1239
-
1240
- ## Publication and reuse rules
1241
-
1242
- Publish or attach baselines deliberately.
1243
-
1244
- - attach when a trusted reusable baseline already exists and is the right reference for this quest
1245
- - publish when this quest produced a verified reusable baseline that later quests should be able to reuse
1246
- - do not publish a blocked, speculative, or verification-incomplete baseline
1247
- - do not attach a baseline without explaining why it is the correct downstream reference
1248
-
1249
- If a baseline is accepted but not globally reusable, say that explicitly instead of leaving the reuse status ambiguous.
1250
-
1251
- The baseline stage should normally hand off with:
1252
-
1253
- - one accepted baseline artifact
1254
- - one verification-oriented report artifact
1255
- - one active baseline reference through attachment or accepted local baseline state
1256
- - one concise next-step guidance statement or decision artifact when the next anchor is not obvious
1257
-
1258
- ## Final handoff packet
1259
-
1260
- Before leaving the baseline stage, make sure the next stage can read a compact handoff packet from durable state.
1261
-
1262
- The handoff packet should make these items obvious:
537
+ The baseline handoff should make these items obvious:
1263
538
 
1264
539
  - `baseline_id`
1265
540
  - `baseline_variant_id` when relevant
1266
- - route used: attach/import/reproduce/repair
541
+ - route used: attach, import, reproduce, or repair
1267
542
  - trusted metrics
1268
543
  - canonical metric contract JSON path
1269
544
  - verification outcome
@@ -1272,38 +547,36 @@ The handoff packet should make these items obvious:
1272
547
  - main caveats
1273
548
  - recommended next anchor
1274
549
 
1275
- If this packet is not obvious from the artifact plus verification note, the baseline stage is not yet stable enough.
550
+ If this packet is not obvious from the accepted artifact plus verification note, the baseline line is not stable enough yet.
1276
551
 
1277
552
  ## Failure and blocked handling
1278
553
 
1279
- Do not hide baseline failures.
554
+ Do not hide failures.
1280
555
 
1281
- If blocked, record exactly which class applies:
556
+ If blocked, record the class explicitly:
1282
557
 
1283
- - missing_source
1284
- - missing_code
1285
- - missing_metric_contract
1286
- - environment_infeasible
1287
- - command_unknown
1288
- - run_failed
1289
- - verification_failed
558
+ - `missing_source`
559
+ - `missing_code`
560
+ - `missing_metric_contract`
561
+ - `environment_infeasible`
562
+ - `command_unknown`
563
+ - `run_failed`
564
+ - `verification_failed`
1290
565
 
1291
- A blocked baseline result must state:
566
+ A blocked result must state:
1292
567
 
1293
568
  - what failed
1294
569
  - what was tried
1295
- - which paths/logs show the issue
1296
- - whether the next best move is attach/import/retry/reset/ask user
1297
-
1298
- If the failure happened after a long-running task, include the monitored command/log path rather than only a prose description.
570
+ - which paths or logs show the issue
571
+ - whether the next best move is attach, import, retry, repair, reset, or ask the user
1299
572
 
1300
- Common autonomous fixes before falling back:
573
+ Reasonable autonomous fixes before escalation:
1301
574
 
1302
575
  - missing module or dependency
1303
576
  - wrong dataset path
1304
577
  - permission errors on scripts
1305
578
  - reasonable batch-size reductions for OOM
1306
- - obvious environment activation issues
579
+ - obvious environment activation mistakes
1307
580
 
1308
581
  If a fix would change confirmed scope, metrics, permissions, or resource assumptions, stop and return to analysis rather than applying it silently.
1309
582