@researai/deepscientist 1.5.12 → 1.5.14

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (99) hide show
  1. package/bin/ds.js +20 -3
  2. package/docs/en/00_QUICK_START.md +24 -5
  3. package/docs/en/01_SETTINGS_REFERENCE.md +4 -0
  4. package/docs/en/05_TUI_GUIDE.md +466 -96
  5. package/docs/en/09_DOCTOR.md +24 -5
  6. package/docs/en/15_CODEX_PROVIDER_SETUP.md +113 -15
  7. package/docs/en/README.md +2 -0
  8. package/docs/zh/00_QUICK_START.md +24 -5
  9. package/docs/zh/01_SETTINGS_REFERENCE.md +4 -0
  10. package/docs/zh/05_TUI_GUIDE.md +465 -82
  11. package/docs/zh/09_DOCTOR.md +24 -5
  12. package/docs/zh/15_CODEX_PROVIDER_SETUP.md +113 -15
  13. package/docs/zh/README.md +2 -0
  14. package/package.json +2 -1
  15. package/pyproject.toml +1 -1
  16. package/src/deepscientist/__init__.py +1 -1
  17. package/src/deepscientist/artifact/service.py +125 -2
  18. package/src/deepscientist/cli.py +3 -0
  19. package/src/deepscientist/codex_cli_compat.py +117 -0
  20. package/src/deepscientist/config/service.py +53 -6
  21. package/src/deepscientist/connector/lingzhu_support.py +23 -4
  22. package/src/deepscientist/daemon/app.py +111 -30
  23. package/src/deepscientist/mcp/server.py +161 -19
  24. package/src/deepscientist/prompts/builder.py +13 -54
  25. package/src/deepscientist/quest/service.py +99 -0
  26. package/src/deepscientist/quest/stage_views.py +134 -29
  27. package/src/deepscientist/runners/codex.py +11 -2
  28. package/src/deepscientist/runners/runtime_overrides.py +3 -0
  29. package/src/deepscientist/shared.py +6 -1
  30. package/src/prompts/system.md +220 -2065
  31. package/src/skills/baseline/SKILL.md +265 -994
  32. package/src/skills/baseline/references/artifact-payload-examples.md +39 -0
  33. package/src/skills/baseline/references/baseline-checklist-template.md +21 -32
  34. package/src/skills/baseline/references/baseline-plan-template.md +41 -57
  35. package/src/tui/dist/app/AppContainer.js +1442 -52
  36. package/src/tui/dist/components/Composer.js +1 -1
  37. package/src/tui/dist/components/ConfigScreen.js +190 -36
  38. package/src/tui/dist/components/GradientStatusText.js +1 -20
  39. package/src/tui/dist/components/InputPrompt.js +41 -32
  40. package/src/tui/dist/components/LoadingIndicator.js +1 -1
  41. package/src/tui/dist/components/Logo.js +61 -38
  42. package/src/tui/dist/components/MainContent.js +10 -3
  43. package/src/tui/dist/components/WelcomePanel.js +4 -12
  44. package/src/tui/dist/components/messages/AssistantMessage.js +1 -1
  45. package/src/tui/dist/components/messages/BashExecOperationMessage.js +3 -3
  46. package/src/tui/dist/components/messages/OperationMessage.js +1 -1
  47. package/src/tui/dist/index.js +28 -1
  48. package/src/tui/dist/layouts/DefaultAppLayout.js +3 -3
  49. package/src/tui/dist/lib/api.js +17 -0
  50. package/src/tui/dist/lib/connectorConfig.js +90 -0
  51. package/src/tui/dist/lib/connectors.js +261 -0
  52. package/src/tui/dist/lib/qr.js +21 -0
  53. package/src/tui/dist/semantic-colors.js +29 -19
  54. package/src/tui/package.json +2 -1
  55. package/src/ui/dist/assets/{AiManusChatView-CnJcXynW.js → AiManusChatView-DaF9Nge_.js} +12 -12
  56. package/src/ui/dist/assets/{AnalysisPlugin-DeyzPEhV.js → AnalysisPlugin-BSVx6dXE.js} +1 -1
  57. package/src/ui/dist/assets/{CliPlugin-CB1YODQn.js → CliPlugin-C9gzJX41.js} +9 -9
  58. package/src/ui/dist/assets/{CodeEditorPlugin-B-xicq1e.js → CodeEditorPlugin-DU9G0Tox.js} +8 -8
  59. package/src/ui/dist/assets/{CodeViewerPlugin-DT54ysXa.js → CodeViewerPlugin-DoX_fI9l.js} +5 -5
  60. package/src/ui/dist/assets/{DocViewerPlugin-DQtKT-VD.js → DocViewerPlugin-C4FWIXuU.js} +3 -3
  61. package/src/ui/dist/assets/{GitDiffViewerPlugin-hqHbCfnv.js → GitDiffViewerPlugin-BgfFMgtf.js} +20 -20
  62. package/src/ui/dist/assets/{ImageViewerPlugin-OcVo33jV.js → ImageViewerPlugin-tcPkfY_x.js} +5 -5
  63. package/src/ui/dist/assets/{LabCopilotPanel-DdGwhEUV.js → LabCopilotPanel-_dKV60Bf.js} +11 -11
  64. package/src/ui/dist/assets/{LabPlugin-Ciz1gDaX.js → LabPlugin-Bje0ayoC.js} +2 -2
  65. package/src/ui/dist/assets/{LatexPlugin-BhmjNQRC.js → LatexPlugin-CVsBzAln.js} +7 -7
  66. package/src/ui/dist/assets/{MarkdownViewerPlugin-BzdVH9Bx.js → MarkdownViewerPlugin-xjmrqv_8.js} +4 -4
  67. package/src/ui/dist/assets/{MarketplacePlugin-DmyHspXt.js → MarketplacePlugin-mMM2A8wP.js} +3 -3
  68. package/src/ui/dist/assets/{NotebookEditor-BTVYRGkm.js → NotebookEditor-3kVDSOBo.js} +11 -11
  69. package/src/ui/dist/assets/{NotebookEditor-BMXKrDRk.js → NotebookEditor-SoJ8X-MO.js} +1 -1
  70. package/src/ui/dist/assets/{PdfLoader-CvcjJHXv.js → PdfLoader-DElVuHl9.js} +1 -1
  71. package/src/ui/dist/assets/{PdfMarkdownPlugin-DW2ej8Vk.js → PdfMarkdownPlugin-Bq88XT4G.js} +2 -2
  72. package/src/ui/dist/assets/{PdfViewerPlugin-CmlDxbhU.js → PdfViewerPlugin-CsCXMo9S.js} +10 -10
  73. package/src/ui/dist/assets/{SearchPlugin-DAjQZPSv.js → SearchPlugin-oUPvy19k.js} +1 -1
  74. package/src/ui/dist/assets/{TextViewerPlugin-C-nVAZb_.js → TextViewerPlugin-CRkT9yNy.js} +5 -5
  75. package/src/ui/dist/assets/{VNCViewer-D7-dIYon.js → VNCViewer-BgbuvWhR.js} +10 -10
  76. package/src/ui/dist/assets/{bot-C_G4WtNI.js → bot-v_RASACv.js} +1 -1
  77. package/src/ui/dist/assets/{code-Cd7WfiWq.js → code-5hC9d0VH.js} +1 -1
  78. package/src/ui/dist/assets/{file-content-B57zsL9y.js → file-content-D1PxfOrp.js} +1 -1
  79. package/src/ui/dist/assets/{file-diff-panel-DVoheLFq.js → file-diff-panel-DG1oT_Hj.js} +1 -1
  80. package/src/ui/dist/assets/{file-socket-B5kXFxZP.js → file-socket-BmdFYQlk.js} +1 -1
  81. package/src/ui/dist/assets/{image-LLOjkMHF.js → image-Dqe2X2tW.js} +1 -1
  82. package/src/ui/dist/assets/{index-Dxa2eYMY.js → index-DVsMKK_y.js} +1 -1
  83. package/src/ui/dist/assets/{index-C3r2iGrp.js → index-Duvz8Ip0.js} +12 -12
  84. package/src/ui/dist/assets/{index-CLQauncb.js → index-Nt9hS4ck.js} +470 -165
  85. package/src/ui/dist/assets/{index-hOUOWbW2.js → index-RDlNXXx1.js} +2 -2
  86. package/src/ui/dist/assets/{monaco-BGGAEii3.js → monaco-DIXge1CP.js} +1 -1
  87. package/src/ui/dist/assets/{pdf-effect-queue-DlEr1_y5.js → pdf-effect-queue-BBTTQaO-.js} +1 -1
  88. package/src/ui/dist/assets/{popover-CWJbJuYY.js → popover-BWlolyxo.js} +1 -1
  89. package/src/ui/dist/assets/{project-sync-CRJiucYO.js → project-sync-BM5PkFH4.js} +1 -1
  90. package/src/ui/dist/assets/{select-CoHB7pvH.js → select-D4dAtrA8.js} +2 -2
  91. package/src/ui/dist/assets/{sigma-D5aJWR8J.js → sigma-CKbE5jJT.js} +1 -1
  92. package/src/ui/dist/assets/{square-check-big-DUK_mnkS.js → square-check-big-CZNGMgiB.js} +1 -1
  93. package/src/ui/dist/assets/{trash-ChU3SEE3.js → trash-DaB37xAz.js} +1 -1
  94. package/src/ui/dist/assets/{useCliAccess-BrJBV3tY.js → useCliAccess-C2OmAcWe.js} +1 -1
  95. package/src/ui/dist/assets/{useFileDiffOverlay-C2OQaVWc.js → useFileDiffOverlay-Dowd1Ij4.js} +1 -1
  96. package/src/ui/dist/assets/{wrap-text-C7Qqh-om.js → wrap-text-BGjAhAUq.js} +1 -1
  97. package/src/ui/dist/assets/{zoom-out-rtX0FKya.js → zoom-out-dMZQMXzc.js} +1 -1
  98. package/src/ui/dist/index.html +1 -1
  99. package/uv.lock +1 -1
@@ -6,112 +6,82 @@ description: Use when a quest needs to attach, import, reproduce, repair, verify
6
6
  # Baseline
7
7
 
8
8
  This skill establishes the reference system the quest will compare against.
9
- It absorbs the essential old DeepScientist reproducer discipline into one stage skill.
9
+ The target is one trustworthy baseline line, not an endless reproduction diary.
10
10
 
11
11
  ## Interaction discipline
12
12
 
13
13
  - Follow the shared interaction contract injected by the system prompt.
14
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
15
- - Keep ordinary setup and debugging updates concise. Reserve richer milestone reports for accepted / waived / blocked baseline outcomes or other route-changing checkpoints instead of narrating every small setup step.
16
- - Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
17
- - If a threaded user reply arrives, interpret it relative to the latest baseline progress update before assuming the task changed completely.
18
- - Prefer `bash_exec` for setup, reproduction, and verification commands so each baseline action keeps a durable quest-local session id and log trail.
19
- - When the baseline route is durably chosen, confirmed, waived, or blocked with a clear next action, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says whether the baseline is trusted, blocked, or waived, why that matters, and what the next stage is.
14
+ - Keep ordinary setup and debugging updates concise.
15
+ - Use richer milestone updates only when the baseline becomes trusted, caveated, blocked, waived, or route-changing.
16
+ - Prefer `bash_exec` for setup, reproduction, monitoring, and verification commands so the baseline line stays durable and auditable.
20
17
 
21
18
  ## Non-negotiable rules
22
19
 
23
- - no fabrication of metrics, logs, run status, or success claims
24
- - do not skip baseline steps or silently simplify the reproduction path without explicit approval
20
+ - no fabricated metrics, logs, run status, or success claims
21
+ - do not skip baseline steps or silently simplify the route when that would change trust or comparability
25
22
  - do not claim a baseline is ready before verification is complete
26
- - do not infer missing commands, scripts, or parameters when the uncertainty would change the result
23
+ - do not infer missing commands, scripts, or parameters when the uncertainty could change the result
27
24
  - any unavoidable guess must be written down explicitly with expected impact
28
- - for Python baselines, standardize environment setup with `uv`; do not default to ad-hoc `pip install ...`, a fresh `conda create ...`, or global package mutation when `uv` can provide the same environment reproducibly
29
25
  - use web search for discovering papers or repos, but use `artifact.arxiv(paper_id=..., full_text=False)` for actually reading a source arXiv paper when it exists
30
- - set `full_text=True` only when the summary/abstract view is insufficient for the needed detail; do not default to the raw PDF
31
-
32
- ## Language and interaction rules
33
-
34
- - match the user's language in all visible outputs
35
- - keep updates concise but concrete
36
- - if a structured user decision is required, ask only for decisions that the system cannot safely derive locally
37
- - do not ask speculative or premature questions when local analysis can narrow the choices first
26
+ - set `full_text=True` only when the short form is insufficient
27
+ - for Python baselines, environment setup should be standardized around `uv`
38
28
 
39
29
  ## Stage purpose
40
30
 
41
31
  The baseline stage should produce a usable reference point through one of four routes:
42
32
 
43
- - attach an existing reusable baseline
44
- - import a reusable baseline package
45
- - reproduce a baseline from source
46
- - repair a broken or stale baseline
33
+ 1. attach an existing reusable baseline
34
+ 2. import a reusable baseline package
35
+ 3. reproduce a baseline from source
36
+ 4. repair a broken or stale baseline
47
37
 
48
- The stage must preserve the classic four-part reproducer flow:
38
+ Keep the classic control flow:
49
39
 
50
40
  1. analysis
51
41
  2. setup
52
42
  3. execution
53
43
  4. verification
54
44
 
55
- Do not casually skip these gates.
45
+ These are control gates, not paperwork walls.
56
46
 
57
47
  ## Quick workflow
58
48
 
59
- Treat this as the compressed map of the detailed sections below, not as a second independent SOP.
60
-
61
- 1. Read the source paper and source repo first, or explicitly record what is missing and why.
49
+ 1. Read the source paper and source repo first, or record exactly what is missing and why.
62
50
  2. Choose the lightest trustworthy route: attach, import, reproduce, or repair.
63
- 3. Before substantial setup, code changes, or a real run, create `PLAN.md` and `CHECKLIST.md`, and keep them updated when the route, assets, commands, or trust judgment changes materially.
64
- 4. Keep one dominant phase visible: analysis -> setup -> execution -> verification, with a bounded smoke test before any real long run.
65
- 5. Once the route is concrete, prefer one clean implementation pass, one smoke test, and then one normal baseline run; retry only when the smoke test, verification, or runtime evidence shows a concrete failure or incompatibility.
66
- 6. Close the baseline stage by confirming or waiving the gate, then send a concise `1-2` sentence summary that says whether the baseline is trusted, caveated, blocked, or waived, and what happens next.
67
-
68
- ## Route priority and escalation
51
+ 3. Start with the fast path whenever the current baseline object, command path, and acceptance target are already clear enough to validate cheaply.
52
+ 4. Before substantial baseline setup, code edits, or a real baseline run, create `PLAN.md` and `CHECKLIST.md`; short-form files are enough for simple fast-path work.
53
+ 5. Keep one dominant phase visible: analysis -> setup -> execution -> verification.
54
+ 6. Prefer one clean implementation pass, one smoke test, and then one normal baseline run.
55
+ 7. Retry only when smoke, verification, or runtime evidence shows a concrete failure or incompatibility.
56
+ 8. Close the stage by confirming or waiving the gate, then hand off with a concise `1-2` sentence summary of trust status and next anchor.
69
57
 
70
- This section sets route priority and escalation rules. The authoritative step-by-step execution remains in `Workflow`.
58
+ ## Fast-path first
71
59
 
72
60
  Default to the lightest baseline path that can still establish a trustworthy comparison.
73
- Do not front-load a full reproduction dossier when a faster truth-finding step would tell you whether the route is even viable.
74
- User requirements and explicit constraints are the primary boundary for the reproduction plan.
75
- Within that boundary, prefer equivalence-preserving efficiency gains before more compute: larger safe batch size, cache reuse, checkpoint resume, parallel downloads or workers, and the cheapest comparable smoke path.
61
+ Default to a fast path when it can establish trust with less work.
76
62
 
77
- The ordinary baseline order is:
63
+ Fast path is the default when any of the following is true:
78
64
 
79
- 1. confirm quest binding and current baseline state
80
- 2. look for the cheapest trustworthy route in order: attach, import, reproduce, repair
81
- 3. capture the minimum viable contract: task, dataset or split, metric, source identity, expected command path, and main risks
82
- 4. run a bounded smoke test as soon as that contract is concrete enough, then expand setup notes and launch the real run only after the smoke test is credible
83
- 5. verify before accepting, then archive, publish, or attach the result when appropriate
65
+ - `requested_baseline_ref` or `confirmed_baseline_ref` already points to the active baseline object
66
+ - the route is clearly `attach` or `import`
67
+ - the repo entrypoint, dataset or split, and metric contract are already concrete enough to validate cheaply
68
+ - reproduction requires no meaningful code changes and the main uncertainty is only whether the command still runs
84
69
 
85
- Escalate to the heavier baseline path only when the baseline is ambiguous, broken, multi-variant, paper-to-repo mismatched, or likely to be reused beyond the current quest.
70
+ Fast path means:
86
71
 
87
- If the quest is not yet bound to a stable baseline context, do not pretend the stage is ready just because some code exists locally.
72
+ - do not restart broad baseline discovery by default
73
+ - do not front-load a full codebase audit when the entrypoint is already concrete
74
+ - use a minimal `PLAN.md`, a minimal `CHECKLIST.md`, one bounded smoke test when needed, and then one real validation or run
75
+ - default to reuse-and-verify when runtime already attached a concrete baseline
88
76
 
89
- ## Required plan and checklist
77
+ Escalate from fast path to fuller audit only when:
90
78
 
91
- Before substantial baseline setup, code edits, or a real baseline run, create a quest-visible `PLAN.md` and `CHECKLIST.md`.
92
-
93
- - Use `references/baseline-plan-template.md` as the canonical structure for `PLAN.md`.
94
- - Use `references/baseline-checklist-template.md` as the canonical structure for `CHECKLIST.md`.
95
- - `PLAN.md` becomes mandatory after you have read the source paper and repo enough to restate the method faithfully, identify the real entrypoints, and explain the likely failure points; if either source is missing, record that gap explicitly before proceeding.
96
- - `PLAN.md` should put the user's explicit requirements and non-negotiable constraints first, then cover the chosen route, source package and provenance, safe efficiency levers, code touchpoints, environment and asset plan, smoke test, main run, fallback options such as ModelScope or local mirrors when Hugging Face is blocked, monitoring and sleep rules, verification targets, and a revision log.
97
- - `CHECKLIST.md` is the living companion to `PLAN.md`; update it during reading, setup, smoke testing, real execution, verification, and every material route change.
98
- - If an older quest already uses `analysis_plan.md` or `REPRO_CHECKLIST.md`, keep those files aligned with the canonical `PLAN.md` / `CHECKLIST.md` or turn them into clear compatibility pointers rather than splitting truth across parallel planning files.
99
- - Do not treat the plan as static: if the route, commands, source package, fallback path, or trust judgment changes materially, revise `PLAN.md` before continuing.
100
- - Once `PLAN.md` makes the route concrete, do not keep rewriting code or commands speculatively. The normal default is one bounded smoke test and then one real run, with retries only after a documented failure, invalidity, or compatibility problem.
101
-
102
- ## Phase routing rule
103
-
104
- Treat `analysis`, `setup`, `execution`, and `verification` as logical control gates, not paperwork walls.
105
- At any moment, the work should have one dominant phase among:
106
-
107
- - `analysis`
108
- - `setup`
109
- - `execution`
110
- - `verification`
111
-
112
- Keep the dominant phase explicit, but allow small backtracks and lightweight overlap when they reduce wasted work.
113
- Do not delay an early smoke test just because a fuller write-up is not done yet.
114
- Before a real long run, make sure the minimum viable contract is explicit and the active phase is still easy to reconstruct.
79
+ - the paper and repo disagree materially
80
+ - the real run or eval entrypoint is unclear
81
+ - code changes are likely required
82
+ - the contract spans multiple metrics, datasets, subtasks, or splits that still need interpretation
83
+ - the same failure class reappears after one documented autonomous fix
84
+ - the quest is trying to publish a reusable global baseline rather than only clear the current gate
115
85
 
116
86
  ## Use when
117
87
 
@@ -119,7 +89,7 @@ Before a real long run, make sure the minimum viable contract is explicit and th
119
89
  - the current baseline is unverified or stale
120
90
  - the user already has a baseline package that should be attached or imported
121
91
  - a reproduction failed earlier and now needs repair
122
- - the quest was resumed and the baseline trust state is unclear
92
+ - the quest resumed and the baseline trust state is unclear
123
93
 
124
94
  ## Do not use when
125
95
 
@@ -128,97 +98,83 @@ Before a real long run, make sure the minimum viable contract is explicit and th
128
98
 
129
99
  ## Stage gate
130
100
 
131
- Do not proceed to `idea` or `experiment` unless one of the following is durably true:
101
+ Do not proceed to comparison-heavy downstream work unless one of the following is durably true:
132
102
 
133
103
  - a baseline has been attached and accepted
134
104
  - a baseline has been imported and accepted
135
105
  - a baseline reproduction has completed and been verified
136
106
  - an explicit waiver decision exists with a clear reason
137
107
 
138
- Operationally, the canonical exit is stricter:
139
-
140
- - after the accepted baseline root is clear, call `artifact.confirm_baseline(...)`
141
- - if the quest must continue without a baseline, call `artifact.waive_baseline(...)`
142
-
143
- `attach`, `import`, `publish`, or a plain `baseline` artifact alone do not open the downstream gate.
108
+ Operationally:
144
109
 
145
- ## Truth sources
110
+ - call `artifact.confirm_baseline(...)` once the accepted baseline root and trusted comparison contract are clear
111
+ - call `artifact.waive_baseline(...)` when the quest must continue without a baseline
112
+ - attach, import, or publish alone do not open the downstream gate
146
113
 
147
- Use the following as baseline truth sources:
148
-
149
- - user objective and task framing
150
- - source paper and official repo when available
151
- - existing baseline registry entries
152
- - local baseline directories under `quest_root`
153
- - repo code, configs, and scripts
154
- - device and environment constraints detected locally
155
- - logs, metrics, and summaries from actual runs
156
-
157
- Do not treat memory alone as sufficient evidence for baseline readiness.
158
-
159
- ## Baseline workspace rules
114
+ ## Required plan and checklist
160
115
 
161
- - treat the baseline workspace as a system-managed reproduction surface, not an unrelated sandbox
162
- - avoid creating nested Git workflows inside the baseline workspace
163
- - keep the authoritative quest history in the quest repo
164
- - if papers are converted or notes are generated during baseline work, keep the durable copies under the quest-visible artifacts area unless there is a strong reason to keep a baseline-side copy
165
- - if runtime environment variables or secrets are provided by the runner, use them as authoritative but never echo or persist secret values
116
+ Before substantial baseline setup, code edits, or a real baseline run, create a quest-visible `PLAN.md` and `CHECKLIST.md`.
166
117
 
167
- The baseline line should also maintain a durable working-record area outside the execution surface.
168
- Recommended quest-visible records include:
118
+ - Use `references/baseline-plan-template.md` as the canonical structure for `PLAN.md`.
119
+ - Use `references/baseline-checklist-template.md` as the canonical structure for `CHECKLIST.md`.
120
+ - `analysis_plan.md` and `REPRO_CHECKLIST.md` remain acceptable compatibility alias files when an older quest already depends on them.
121
+ - For fast-path attach/import/prebound validation or a simple reproduce path with no expected code changes, short-form `PLAN.md` and `CHECKLIST.md` are enough.
122
+ - The plan should put the user's explicit requirements and non-negotiable constraints first.
123
+ - Then record the chosen route, source identity, command path, expected outputs, acceptance condition, safe efficiency levers, main risks, and fallback.
124
+ - If the route, commands, source package, fallback path, or trust judgment changes materially, revise `PLAN.md` before continuing.
125
+ - Once the route is concrete, stop reshaping code and commands speculatively.
169
126
 
170
- - `PLAN.md` as the canonical baseline plan; older quests may keep `analysis_plan.md` as a compatibility alias
171
- - `CHECKLIST.md` as the canonical living checklist; older quests may keep `REPRO_CHECKLIST.md` as a compatibility alias when already wired
172
- - `setup.md`
173
- - `execution.md`
174
- - `verification.md`
175
- - `STRUCTURE.md` only when the workspace layout is non-obvious or later reuse depends on it
127
+ Default retry discipline:
176
128
 
177
- For a simple attach/import flow or a straightforward reproduce flow, do not stall just to precreate every one of these files.
178
- Start with the smallest durable note that preserves the route, command path, target outputs, and main risks; expand it only after the route proves real.
129
+ - do not rerun the same unchanged smoke command just to reconfirm the same fact
130
+ - treat one autonomous retry for the same failure class as the normal upper bound
131
+ - if the same failure class appears again, switch explicitly into `repair`, record `blocked`, or route through `decision`
179
132
 
180
133
  ## Required durable outputs
181
134
 
182
135
  The baseline stage should usually leave behind:
183
136
 
184
137
  - a baseline directory under `baselines/local/` or `baselines/imported/`
185
- - a verification note or report under the quest
138
+ - `PLAN.md` and `CHECKLIST.md`
139
+ - a verification note or report
186
140
  - command, config, environment, and metrics pointers
187
141
  - a baseline artifact
188
142
  - a confirmed baseline gate via `artifact.confirm_baseline(...)`, or an explicit waiver via `artifact.waive_baseline(...)`
189
143
  - an optional registry publication if the baseline is reusable beyond this quest
190
144
 
191
- ## Stable execution contract
145
+ For simple attach/import flows or a straightforward reproduce flow, do not stall just to precreate every optional note file.
192
146
 
193
- To keep baseline work stable across different quests, do not stop at loose prose.
194
- But also do not confuse stability with ceremony.
195
- Use the lightest durable structure that keeps the baseline auditable and reusable.
147
+ Useful optional notes:
148
+
149
+ - `setup.md`
150
+ - `execution.md`
151
+ - `verification.md`
152
+ - `STRUCTURE.md` when the layout is non-obvious
153
+
154
+ ## File-by-file contract
155
+
156
+ - `PLAN.md` or compatibility alias `analysis_plan.md` is the required route contract before substantial setup, code edits, or a real run; it should state the route, source identity, command path, expected outputs, acceptance condition, main risks, and fallback.
157
+ - `CHECKLIST.md` or compatibility alias `REPRO_CHECKLIST.md` is the required living state tracker; it should show whether the baseline object, smoke decision, real run decision, and final accept / block / waive outcome are explicit.
158
+ - `setup.md` is optional unless environment or layout choices are non-trivial; if used, record the working directory, environment route, important config paths, source revision, and notable setup deviations.
159
+ - `execution.md` is optional unless the run is long, multi-step, or rerun-heavy; if used, record the launched commands, durable log paths, checkpoints, exit state, and any reruns or repairs.
160
+ - `verification.md` is optional as a filename but required in substance before acceptance or blocked closeout; either this file or an equivalent report should record trusted metrics, expected-versus-observed comparison, caveats, canonical output paths, and the next anchor.
161
+ - `STRUCTURE.md` becomes required when the workspace layout, mounts, symlinks, or generated outputs are non-obvious or meant for reuse; it should map the important directories and say which paths are canonical.
162
+ - `attachment.yaml` is required for attached or imported baselines under `baselines/imported/`; preserve source identity, selected variant when relevant, and attachment provenance there.
163
+ - `<baseline_root>/json/metric_contract.json` is the canonical accepted comparison contract; once the baseline is accepted, do not leave the authoritative metric surface only in chat, memory, or prose.
164
+ - `Result/metric.md` is scratch-only; it may help during execution, but it is never the final source of truth.
196
165
 
197
166
  Minimum stability rules:
198
167
 
199
168
  - before the first real run, leave one durable note with the chosen route, expected command path, target outputs, and main risks
200
169
  - after each smoke test or real run, record what actually happened and whether the route still looks viable
201
170
  - before acceptance, leave a clear verification note and baseline gate decision
202
- - every route selection should leave one explicit reasoned decision record
203
171
  - every accepted baseline should leave one accepted baseline artifact
204
172
  - every blocked baseline line should leave one blocked report and one next-step decision
205
- - every handoff should name the active baseline reference and trusted metric set explicitly
206
- - when the accepted paper-facing contract spans multiple metrics, datasets, subtasks, or splits, preserve that full comparison surface in the durable metric contract rather than collapsing it to one headline number
207
- - do not require every optional checklist or template before the first smoke test
208
173
  - if one rolling note is enough for a simple baseline line, use it
209
174
 
210
- Recommended phase-to-output mapping:
211
-
212
- - `analysis` -> a brief `PLAN.md` or compatible `analysis_plan.md`, plus optional route decision artifact
213
- - `setup` -> `setup.md` when setup choices are non-trivial
214
- - `execution` -> `execution.md` plus progress artifacts when long-running
215
- - `verification` -> `verification.md` plus accepted baseline artifact and `artifact.confirm_baseline(...)`, or a blocked report plus `artifact.waive_baseline(...)` when skipping is intentional
216
-
217
- If the work skips one of these durable outputs, explain why the baseline remains interpretable without it.
218
-
219
175
  ## Durable path contract
220
176
 
221
- The baseline stage should use the real runtime paths consistently.
177
+ Use the real runtime paths consistently.
222
178
 
223
179
  Quest-local paths:
224
180
 
@@ -235,33 +191,33 @@ Global reusable registry paths:
235
191
  - baseline registry index: `~/DeepScientist/config/baselines/index.jsonl`
236
192
  - canonical baseline entry: `~/DeepScientist/config/baselines/entries/<baseline_id>.yaml`
237
193
 
194
+ ## Baseline id and variant rules
195
+
196
+ - `baseline_id` should be short, stable, and filesystem-safe
197
+ - use letters, digits, `.`, `_`, or `-`
198
+ - do not use spaces, `/`, `\\`, or `..`
199
+ - if one codebase contains multiple comparable baselines, prefer one `baseline_id` with structured variants instead of inventing many near-duplicate entries
200
+ - when variants exist, keep `default_variant_id`, `baseline_variants`, and per-variant metric summaries stable enough that later `experiment` and `write` stages can cite them directly
201
+
238
202
  Do not invent parallel durable locations when these runtime contracts already exist.
239
203
  Do not leave the authoritative metric contract only in chat, memory, or prose once the baseline is accepted.
240
204
 
241
205
  If a baseline is reproduced only because an analysis campaign needs an extra comparator:
242
206
 
243
- - still place it under `<quest_root>/baselines/local/<baseline_id>/` or `<quest_root>/baselines/imported/<baseline_id>/`
207
+ - still place it under the normal baseline roots
244
208
  - treat it as a supplementary analysis baseline unless the quest explicitly promotes it into the canonical gate
245
209
  - do not call `artifact.confirm_baseline(...)` for that supplementary case unless the quest truly intends to replace the canonical baseline
246
210
 
247
- ## Baseline id and variant rules
248
-
249
- Baseline identity should be stable and path-safe.
250
-
251
- - `baseline_id` should be short, stable, and filesystem-safe
252
- - use letters, digits, `.`, `_`, or `-`
253
- - do not use spaces, `/`, `\\`, or `..`
254
- - if one codebase contains multiple comparable baselines, use one `baseline_id` with structured variants instead of inventing many unrelated entries
255
-
256
- When variants exist, maintain at least:
211
+ ## Multi-baseline policy
257
212
 
258
- - `default_variant_id`
259
- - `baseline_variants`
260
- - per-variant metric summaries when available
213
+ One quest may legitimately need more than one baseline.
261
214
 
262
- The baseline stage should treat `baseline_id` and `variant_id` as durable references that later `idea`, `experiment`, and `write` stages can cite directly.
215
+ - explicitly mark which baseline is the primary downstream comparator
216
+ - distinguish primary comparison baselines from fallback or infrastructure baselines
217
+ - if several baselines are credible, record why the chosen primary baseline is the fairest paper-facing comparator
218
+ - do not leave later stages guessing which baseline is authoritative
263
219
 
264
- ## Baseline route order
220
+ ## Route order
265
221
 
266
222
  Prefer this order:
267
223
 
@@ -272,101 +228,6 @@ Prefer this order:
272
228
 
273
229
  Prefer reuse over redundant reproduction.
274
230
 
275
- ## Route selection rules
276
-
277
- Choose the route explicitly rather than by habit.
278
-
279
- - choose `attach` when a published baseline already exists in the registry and its metrics or provenance are trustworthy enough for the quest
280
- - choose `import` when the user or repo provides a reusable baseline package or bundle that is not yet attached to the current quest
281
- - choose `reproduce` when no trustworthy reusable baseline is available but the source repo, paper, and evaluation path are concrete enough to establish one
282
- - choose `repair` when a baseline route already exists but failed, drifted, or is only partially complete and the broken point is bounded enough to diagnose directly
283
-
284
- Do not default to reproduction if attach or import would establish an equally trustworthy reference with less risk and cost.
285
-
286
- Before locking the route, explicitly answer:
287
-
288
- - what object is being reused or established
289
- - what makes it trustworthy enough for downstream comparison
290
- - what evidence is missing
291
- - what the cheapest credible next step is
292
-
293
- For a more explicit route-selection rubric, read `references/route-selection.md`.
294
-
295
- ## Baseline comparability contract
296
-
297
- The baseline stage is not complete just because something ran.
298
- It is complete when later stages can compare against it fairly.
299
-
300
- Before declaring a baseline usable, make the comparability contract explicit:
301
-
302
- - task identity
303
- - dataset identity and version
304
- - split contract
305
- - preprocessing boundary
306
- - evaluation script or evaluation path
307
- - required metric keys
308
- - metric directions
309
- - seed policy when relevant
310
- - source commit or source package identity
311
- - known deviations from the source reference
312
-
313
- If any of these are still materially unknown, do not pretend the baseline is a clean downstream reference.
314
-
315
- Use `references/comparability-contract.md` for the full checklist.
316
-
317
- ## Feasibility and acceptance classes
318
-
319
- Before accepting a baseline, classify feasibility as one of:
320
-
321
- - `full_reproducible`
322
- - `degraded_but_acceptable`
323
- - `blocked`
324
-
325
- And classify downstream trust as one of:
326
-
327
- - `verified`
328
- - `partially_verified`
329
- - `operational_but_incomparable`
330
- - `failed`
331
-
332
- Rules:
333
-
334
- - `full_reproducible` means the baseline can be reproduced within the agreed contract
335
- - `degraded_but_acceptable` means the quest explicitly allows a bounded degraded gate
336
- - `blocked` means insufficient assets, compute, or environment to produce an acceptable baseline
337
- - `verified` means trusted for downstream comparison
338
- - `partially_verified` means useful but still caveated
339
- - `operational_but_incomparable` means it runs, but the comparison contract is not stable enough yet
340
- - `failed` means it should not be used downstream
341
-
342
- Do not silently upgrade a degraded or only operational result into a normal trusted baseline.
343
-
344
- ## Multi-baseline policy
345
-
346
- One quest may legitimately need more than one baseline reference.
347
-
348
- Common roles include:
349
-
350
- - primary comparison baseline
351
- - strongest literature baseline
352
- - cheapest operational fallback baseline
353
-
354
- If more than one baseline exists, explicitly record:
355
-
356
- - which one is the primary downstream comparison
357
- - which one is only a fallback or infrastructure reference
358
- - why the primary choice is the fairest or strongest comparison
359
-
360
- Do not leave later stages guessing which baseline is authoritative.
361
-
362
- When useful, record the route choice as a decision artifact with action such as:
363
-
364
- - `attach_baseline`
365
- - `reuse_baseline`
366
- - `publish_baseline`
367
- - `continue`
368
- - `request_user_decision`
369
-
370
231
  ## Workflow
371
232
 
372
233
  ### Phase 1. Analysis
@@ -379,236 +240,88 @@ Before running anything substantial, determine:
379
240
  - source baseline identity
380
241
  - source code path
381
242
  - expected run command or evaluation path
382
- - expected paper or repo numbers, if any
243
+ - expected paper or repo numbers when they exist
383
244
  - local resource constraints
384
245
 
385
- For straightforward baseline work, start with a quick viability pass:
246
+ Default analysis discipline:
386
247
 
387
- - find the real run or evaluation entrypoint
388
- - identify the dataset/split and metric contract
248
+ - read the source paper and source repo first
249
+ - if runtime already exposes a matching `requested_baseline_ref` or `confirmed_baseline_ref`, validate that concrete object before restarting broad discovery
250
+ - identify the real run or evaluation entrypoint
251
+ - identify the dataset or split and metric contract
389
252
  - identify likely environment blockers
390
253
  - define the cheapest credible smoke test
391
254
 
392
- Escalate from that quick pass to a fuller baseline codebase audit when the command path is unclear, the repo is large or confusing, the paper and code diverge materially, repair mode is active, or custom code changes look likely.
255
+ Escalate to a fuller audit only when the command path is unclear, the repo is large or confusing, repair mode is active, or custom code changes look likely.
393
256
 
394
- When the fuller audit is necessary, capture at least:
257
+ When the fuller audit is necessary, capture only what later stages truly need:
395
258
 
396
- - major modules and files
259
+ - major entry scripts, configs, and modules
397
260
  - end-to-end data flow
398
- - key classes, functions, or scripts
399
- - external dependencies and environment assumptions
400
- - computational hotspots or obvious bottlenecks
401
- - current evaluation pipeline and metric computation path
402
- - coupling, maintainability, or scalability issues that may slow later iterations
261
+ - evaluation path and metric computation path
262
+ - obvious environment assumptions
263
+ - obvious bottlenecks or incompatibilities
403
264
 
404
- When the source paper is available, also record:
265
+ If the source paper is available, record:
405
266
 
406
- - read it through `artifact.arxiv(paper_id=..., full_text=False)` first, and only switch to `full_text=True` when the shorter view is insufficient
407
267
  - the core algorithm in compact, implementation-faithful form
408
268
  - the main reported numbers
409
- - the main weaknesses or bottlenecks likely to matter on the current quest task or dataset
410
-
411
- If helpful, restate the core algorithm using two of the following:
412
-
413
- - short pseudocode
414
- - a compact equation or objective
415
- - a code-level sketch tied to real files
416
-
417
- The goal is not academic polish.
418
- The goal is that later `idea`, `experiment`, and `write` stages can understand what the baseline actually does without reopening the whole repo from scratch.
419
-
420
- You should inspect local feasibility with shell-based checks when needed, including:
421
-
422
- - OS
423
- - GPU availability
424
- - CPU and RAM
425
- - free disk
426
- - Python or conda environment availability
427
- - whether `uv` is available and which Python version `uv` should target
428
-
429
- Use the collected constraints to choose a realistic baseline route and runtime plan.
430
-
431
- The analysis phase should leave behind a concrete baseline plan rather than only conversational intent.
432
- At minimum, the plan should capture:
433
-
434
- - chosen route
435
- - source identity
436
- - expected commands
437
- - expected outputs
438
- - feasibility notes
439
- - key risks
440
- - verification targets
441
-
442
- Prefer `PLAN.md` for new work and use `references/baseline-plan-template.md` when you need a concrete starting structure.
443
- When the analysis note becomes substantial, structure `PLAN.md` or a legacy-compatible `analysis_plan.md` with headings close to:
269
+ - the main weaknesses or bottlenecks likely to matter for this quest
444
270
 
445
- - executive summary
446
- - codebase analysis
447
- - limitations or bottlenecks
448
- - KPI and metric contract
449
- - route choice
450
- - risks and mitigations
271
+ You may inspect local feasibility with shell-based checks for OS, GPU, CPU, RAM, disk, Python version, and whether `uv` is available.
451
272
 
452
- Analysis-phase questioning rules:
273
+ The analysis phase should leave behind a concrete plan rather than only conversational intent.
453
274
 
454
- - ask the user only after the analysis is concrete enough to expose real choices
455
- - the early exception is when code access, paper access, source identity, or execution permission is missing and that absence blocks even baseline analysis
456
- - do not ask generic “how should I set up the environment” questions before you inspect the device and code requirements
457
- - do not repeat already confirmed decisions unless the plan materially changed
458
-
459
- If a user decision is required, make it structured and compact:
460
-
461
- - usually `1-6` questions total
462
- - each question should contain concrete options
463
- - options should reflect actual hardware/code feasibility
464
- - options should include tradeoffs
465
- - the recommended option should be explicit
466
- - free-form input should be requested only where a preset choice is genuinely insufficient
467
-
468
- If parallel execution is proposed, it must be explicitly confirmed rather than silently enabled.
469
-
470
- Avoid asking the user to design the environment for you.
471
- Instead, analyze the environment first, then present the recommended path and tradeoffs only if a user decision is actually required.
472
-
473
- If the code, paper, or baseline source is missing and the missing piece changes the route materially, stop and ask for a structured decision rather than guessing.
474
-
475
- For a denser audit checklist, read `references/codebase-audit-checklist.md`.
476
-
477
- ### Phase 2. Setup
275
+ ## Phase 2. Setup
478
276
 
479
277
  Prepare the selected route:
480
278
 
481
279
  - attach: validate the selected baseline id and variant
482
- - import: place the imported baseline metadata under the quest
280
+ - import: place the imported baseline metadata under the quest and confirm the package is readable
483
281
  - reproduce: prepare the baseline work directory, commands, config pointers, and environment notes
484
282
  - repair: identify the precise broken point before rerunning blindly
485
283
 
486
- For Python baselines, environment setup should be standardized around `uv`.
487
- Treat `uv` as the default environment and package manager for baseline setup, smoke tests, and real runs.
488
- Do not casually switch to a new conda environment or a manual `pip install` flow just because the repo is old.
489
- If the baseline already ships a `pyproject.toml` / `uv.lock`, use that path first.
490
- If it only ships `requirements.txt`, still create the environment with `uv` and install through `uv pip`.
491
- Only accept a non-`uv` environment route when there is a concrete blocker that cannot be resolved locally, and record that blocker explicitly in `setup.md` and the progress update.
492
-
493
- For a fast-path reproduction, setup can stay lightweight.
494
- Confirm the working directory, environment, config, output paths, smoke command, and long-run command, then move forward.
495
- Do not manufacture a fresh workspace tree or copy the repo just to satisfy a template if the existing layout is already workable and auditable.
496
-
497
- Capture:
498
-
499
- - baseline identifier
500
- - source and provenance
501
- - working directory
502
- - config files
503
- - command template
504
- - expected outputs
505
- - risks and known deviations from the paper or source
506
-
507
- Setup should also confirm:
508
-
509
- - the intended working directory is correct
510
- - the output paths are durable and quest-visible
511
- - required dependencies or environments are known
512
- - the execution plan is realistic for the detected hardware
284
+ For Python baselines, standardize environment setup around `uv`.
513
285
 
514
286
  ### Python environment rule: use `uv`
515
287
 
516
- When the baseline is Python-based, prefer the following order:
517
-
518
- 1. if the repo already contains `uv.lock` or a solid `pyproject.toml`, use `uv sync`
519
- 2. otherwise create a local virtual environment with `uv venv`
520
- 3. install dependencies with `uv pip install ...`
521
- 4. run setup, smoke tests, and real commands through `uv run ...`
288
+ - if the repo already contains `uv.lock` or a solid `pyproject.toml`, use `uv sync`
289
+ - otherwise create a local virtual environment with `uv venv`
290
+ - install dependencies with `uv pip install ...`
291
+ - run setup, smoke tests, and real commands through `uv run ...`
522
292
 
523
293
  Practical rules:
524
294
 
525
- - prefer a quest-local or baseline-local `.venv` under the actual working tree
526
- - prefer `uv run python ...` / `uv run bash ...` over relying on shell activation state
295
+ - prefer a quest-local or baseline-local `.venv`
296
+ - prefer `uv run python ...` or `uv run bash ...` over relying on shell activation state
527
297
  - if a specific interpreter is required, make it explicit with `uv venv --python 3.11` or `uv run --python 3.11 ...`
528
- - if CUDA, PyTorch, JAX, or custom wheels require a special index URL, still keep the installation command under `uv pip`
529
- - if the repo insists on conda-only tooling, first check whether the same packages can be installed with `uv`; only keep the conda route if you can explain why `uv` is not viable
530
-
531
- Examples:
532
-
533
- ```bash
534
- # modern repo with pyproject.toml / uv.lock
535
- cd <baseline_root>
536
- uv sync
537
- uv run python -m pytest tests/test_smoke.py -q
538
- uv run python train.py --config configs/baseline.yaml
539
- ```
540
-
541
- ```bash
542
- # legacy repo with requirements.txt
543
- cd <baseline_root>
544
- uv venv --python 3.11
545
- uv pip install -r requirements.txt
546
- uv run python scripts/smoke_test.py
547
- uv run python main.py --dataset cifar10 --config configs/resnet18.yaml
548
- ```
549
-
550
- ```bash
551
- # one-off package additions without leaving the uv-managed flow
552
- cd <baseline_root>
553
- uv venv --python 3.11
554
- uv pip install -r requirements.txt
555
- uv pip install "torch==2.4.1" "torchvision==0.19.1"
556
- uv run python evaluate.py --checkpoint outputs/best.pt
557
- ```
558
-
559
- When you record the setup, explicitly note:
560
-
561
- - the chosen `uv` route: `uv sync` vs `uv venv` + `uv pip`
562
- - the Python version
563
- - the dependency source files used
564
- - the exact `uv run ...` command used for the smoke test
565
- - any blocker that prevented a pure `uv` flow
566
-
567
- If a dedicated baseline workspace is needed, establish a clear layout.
568
- One workable structure is:
569
-
570
- ```text
571
- <baseline_root>/
572
- src/
573
- scripts/
574
- logs/
575
- cache/
576
- results/
577
- exports/
578
- latest/
579
- <run_id>/
580
- ```
581
-
582
- If the baseline becomes long-lived, shared, or non-obvious, the quest-visible audit area may contain:
583
-
584
- ```text
585
- <quest_root>/
586
- baselines/
587
- local/
588
- <baseline_id>/
589
- analysis_plan.md
590
- setup.md
591
- execution.md
592
- verification.md
593
- STRUCTURE.md
594
- REPRO_CHECKLIST.md
595
- ```
298
+ - if CUDA, PyTorch, JAX, or custom wheels require a special index URL, keep that install under `uv pip`
299
+ - only accept a non-`uv` route when there is a concrete blocker that cannot be resolved locally
300
+
301
+ Common `uv` patterns:
302
+
303
+ - `uv sync`
304
+ - `uv venv --python 3.11`
305
+ - `uv pip install -r requirements.txt`
306
+ - `uv run python scripts/smoke_test.py`
307
+ - `uv run python train.py --config ...`
596
308
 
597
309
  Setup should record:
598
310
 
599
- - how the source was obtained: attach/import/copy/clone
600
- - upstream URL when known
601
- - upstream commit hash when known
602
- - `uv` environment route and Python version
603
- - key environment variables by name only, with sensitive values redacted
604
- - the directory tree and key files expected to matter later
311
+ - baseline id and source identity
312
+ - working directory
313
+ - config files
314
+ - command template
315
+ - expected outputs
316
+ - known deviations from paper or source
317
+ - the chosen `uv` route and Python version
605
318
 
606
- If a local source repo was copied into the workspace, preserve provenance but do not keep a nested authoritative Git lifecycle inside the baseline execution root.
319
+ Fallbacks:
607
320
 
608
- If setup reveals that the chosen route is infeasible on the current device, do not brute-force ahead.
609
- Either downgrade scope explicitly, switch route, or request a structured decision.
321
+ - if Hugging Face access is blocked, record and try an approved local mirror such as ModelScope when that does not change the comparison meaning
322
+ - if a quest already depends on `analysis_plan.md` or `REPRO_CHECKLIST.md`, keep the compatibility alias explicit rather than splitting truth across two active plans
610
323
 
611
- ### Phase 3. Execution
324
+ ## Phase 3. Execution
612
325
 
613
326
  Run only the work required to establish the baseline credibly.
614
327
 
@@ -617,88 +330,31 @@ Execution rules:
617
330
  - keep commands auditable
618
331
  - keep logs durable
619
332
  - avoid uncontrolled side experiments during baseline establishment
620
- - if a run is long, emit progress artifacts at meaningful checkpoints
621
- - if setup required code changes, checkpoint only explainable, minimal changes
622
-
623
- Execution should rely on existing explicit scripts or command paths where possible.
624
- Prefer the smallest runnable command that proves the baseline route.
625
- Do not build a new wrapper, registry, or result-export scaffold unless existing commands are missing, repeated reruns justify it, or later automation clearly needs it.
626
- If a wrapper or entry script is truly needed, it should support most of the following:
627
-
628
- - run mode for missing combinations
629
- - print-only mode that summarizes existing results without rerunning everything
630
- - result registry or skip logic so old baseline results are not re-executed unnecessarily
631
- - export of per-run results and a `latest/` snapshot
632
- - final Markdown and/or JSON summary output
633
- - cache and debug logs
634
- - environment checks when relevant
635
- - throttled structured progress markers for long loops
636
- - `--new-only` or equivalent incremental mode
637
- - `--rerun` or equivalent force-rerun mode when needed
638
- - scope flags such as minimal/full/custom when the analysis plan distinguishes them
639
- - speed flags such as parallelism, batch size, epochs, or steps when relevant
640
- - optional evaluation and postprocess steps when the repo separates them
641
-
642
- Prefer those efficiency levers only when they do not change the accepted baseline meaning, effective evaluation contract, or trust judgment.
643
-
644
- If adding this scaffolding would require large assumptions about missing scripts, stop and return to analysis rather than creating a misleading opaque wrapper.
645
-
646
- Recommended result structures to maintain:
647
-
648
- - per-combination result records
649
- - an aggregated `result.json`
650
- - a registry or JSONL index mapping each combination to its stored result
651
- - exported snapshots in both run-specific and `latest/` locations
652
- - run metadata capturing the environment and command context
653
-
654
- Recommended run metadata includes:
655
-
656
- - config snapshot
657
- - relevant Git or source snapshot identifiers
658
- - package/environment summary
659
- - machine summary such as GPU visibility when relevant
660
-
661
- If a result backup is useful for audit or recovery, create it explicitly rather than assuming the latest export is enough.
662
-
663
- Long-running execution rules:
664
-
665
- - before a substantial baseline reproduction, run a bounded smoke test first so command paths, output locations, and metric plumbing are validated cheaply
666
- - once the smoke test passes, launch the real baseline reproduction with `bash_exec(mode='detach', ...)` and normally leave `timeout_seconds` unset for the long run itself
667
- - `bash_exec(mode='read', id=...)` returns the full rendered log when it is 2000 lines or fewer; for longer logs it returns the first 500 lines plus the last 1500 lines and a hint to inspect omitted sections with `start` and `tail`
668
- - if a long saved log omits the middle section you need, use `bash_exec(mode='read', id=..., start=..., tail=...)` to inspect that forward rendered-line window
669
- - when monitoring that detached run, prefer `bash_exec(mode='read', id=..., tail_limit=..., order='desc')` so you inspect the newest log evidence first
670
- - after the first read, prefer incremental checks with `bash_exec(mode='read', id=..., after_seq=last_seen_seq, tail_limit=..., order='asc')` so you only inspect newly appended evidence
671
- - if you need to recover ids or confirm the newest session quickly, use `bash_exec(mode='history')` or `bash_exec(mode='list')` rather than guessing
672
- - include a structured `comment` on long-running bash sessions with fields such as `stage`, `goal`, `action`, `expected_signal`, and `next_check`
673
- - use `silent_seconds`, `progress_age_seconds`, `signal_age_seconds`, and `watchdog_overdue` from `bash_exec(mode='list'|'read', ...)` as the default staleness checks
674
- - when the reproduction code is under your control, prefer a throttled `tqdm` progress reporter and, when feasible, pair it with periodic `__DS_PROGRESS__` JSON lines carrying phase and ETA
675
- - if a command is expected to run for a long time, monitor it as a real background task rather than assuming success
676
- - do not write final summaries or accepted metrics until the command has actually completed
677
- - verify that the expected result files exist before treating the run as finished
678
- - if a task is invalid, wedged, or failed, stop it with `bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...)`; if it must die immediately, add `force=true`, then diagnose the reason and either retry with a documented fix or record the failure durably
679
- - canonical sleep choice:
680
- - if you only need wall-clock waiting between checks, use `bash_exec(command='sleep N', mode='await', timeout_seconds=N+buffer, ...)`
681
- - keep a real buffer on that sleep timeout; do not set `timeout_seconds` exactly equal to `N`
682
- - if you are waiting on an already running managed session, prefer `bash_exec(mode='await', id=..., timeout_seconds=...)` instead of starting a new sleep command
683
-
684
- Recommended monitoring cadence for long-running work:
685
-
686
- - first check after about 60 seconds
687
- - second check after about 120 seconds
688
- - third check after about 300 seconds
689
- - fourth check after about 600 seconds
690
- - fifth check after about 1800 seconds
691
- - after that, keep checking about every 1800 seconds while the run is still active
692
-
693
- The exact mechanism should prefer `bash_exec(mode='await' | 'detach' | 'read' | 'list' | 'history' | 'kill', ...)`, with `read` usually using a tailed or incremental window during monitoring, but the behavioral rule stays the same:
694
- do not report completion until the run is actually done and the outputs are real.
695
- After each meaningful check, notify the user through `artifact.interact(kind='progress', ...)` with current status, latest evidence, and the next monitoring point.
696
- Do this after every completed wait cycle for important long-running work; do not skip several sleep windows without reporting.
697
- When structured progress markers are available, include `eta` and preferably `next_reply_at` or `next_check_at` so the UI can show the next expected update time.
698
-
699
- Do not silently widen scope from “baseline reproduction” into “new method exploration”.
700
-
701
- ### Phase 4. Verification
333
+ - checkpoint only explainable, minimal code changes
334
+ - prefer equivalence-preserving efficiency gains such as larger safe batch size, cache reuse, checkpoint resume, and parallel downloads or workers
335
+ - do not use an efficiency lever if it changes accepted baseline meaning, effective evaluation contract, or trust judgment
336
+
337
+ Long-running execution discipline:
338
+
339
+ - run one bounded smoke test before a substantial baseline reproduction
340
+ - once the smoke test passes, launch the real baseline reproduction with `bash_exec(mode='detach', ...)`
341
+ - monitor by forward progress instead of by short-window completion anxiety
342
+ - do not report final success until the command actually finished and the expected result files exist
343
+ - if you need to recover ids or inspect session state, use `bash_exec(mode='history')` or `bash_exec(mode='list')`
344
+ - `bash_exec(mode='read', id=...)` returns the full saved log when it is `2000 lines or fewer`; for longer logs, inspect omitted middle windows with `start` and `tail`
345
+ - during monitoring, prefer `bash_exec(mode='read', id=..., tail_limit=..., order='desc')`, and after the first read prefer incremental checks with `after_seq=last_seen_seq`
346
+ - use `silent_seconds`, `progress_age_seconds`, `signal_age_seconds`, and `watchdog_overdue` as the default staleness clues
347
+ - if a run is clearly invalid, wedged, or superseded, stop it with `bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...)`, document why, and relaunch cleanly
348
+ - do not let more than the `30-minute visibility bound` pass without a real inspection and a `next expected update time`
349
+ - when the baseline code is under your control, prefer a throttled `tqdm` progress reporter and periodic `__DS_PROGRESS__` markers when feasible
350
+
351
+ Keep retries bounded:
352
+
353
+ - one smoke test is the default
354
+ - one autonomous fix-and-retry for the same failure class is the normal upper bound
355
+ - if the same failure class returns, stop looping
356
+
357
+ ## Phase 4. Verification
702
358
 
703
359
  Verification is mandatory before baseline acceptance.
704
360
 
@@ -707,32 +363,17 @@ Verify:
707
363
  - the run actually finished
708
364
  - the reported metrics came from the intended dataset and split
709
365
  - the metric definitions match the quest contract
710
- - the result is comparable to the paper, source repo, or selected baseline target
366
+ - the result is comparable to the paper, source repo, or selected target
711
367
  - any deviations are explicitly stated
712
368
 
713
- Classify the outcome:
369
+ Classify the outcome as one of:
714
370
 
715
371
  - `verified_match`
716
372
  - `verified_close`
717
373
  - `verified_diverged`
718
374
  - `broken`
719
375
 
720
- Verification must be evidence-first.
721
- Do not accept any of the following without explanation:
722
-
723
- - missing result files
724
- - metrics that cannot be traced to an actual run
725
- - metric definitions that do not match the quest contract
726
- - unexplained mismatch versus the intended paper or source repo setup
727
-
728
- Verification-phase interaction rules:
729
-
730
- - do not ask new questions during verification unless the stage has genuinely fallen back to analysis
731
- - if requirements, scope, or permissions changed materially, stop verification and return to the analysis phase explicitly
732
- - verification should summarize real progress milestones rather than quoting raw internal progress markers
733
- - structured progress markers are for runtime monitoring, not for final verification prose
734
-
735
- If the reproduced result differs materially from the source reference, verification should explicitly separate:
376
+ Verification must explicitly separate:
736
377
 
737
378
  - likely implementation mismatch
738
379
  - environment mismatch
@@ -740,43 +381,64 @@ If the reproduced result differs materially from the source reference, verificat
740
381
  - expected stochastic variance
741
382
  - unexplained divergence
742
383
 
743
- Verification should also answer:
384
+ Verification should answer:
744
385
 
745
386
  - whether the baseline is trustworthy enough for downstream comparison
746
387
  - whether the result is reusable beyond this quest
747
388
  - whether another repair or rerun is justified
748
- - whether the baseline line should stop here and hand off to another stage
389
+ - whether the line should stop here and hand off
390
+
391
+ A verification report should be self-contained enough that a later stage can answer:
392
+
393
+ - what was used
394
+ - how it was obtained: attach, import, reproduce, or repair
395
+ - what commands and configs were used
396
+ - what metrics are trusted
397
+ - what caveats remain
398
+ - whether the result is reusable beyond this quest
749
399
 
750
- Verification checklist before accepting results:
400
+ ## Baseline comparability contract
401
+
402
+ The baseline stage is not complete just because something ran.
403
+ It is complete when later stages can compare against it fairly.
751
404
 
752
- - logs show command completion rather than only task start
753
- - final result files exist
754
- - exported latest snapshot exists when the workflow expects it
755
- - metrics are non-empty, non-placeholder, and non-NaN
756
- - execution notes document the actual commands and outcomes
757
- - the baseline phase state is ready to hand off
758
- - the infrastructure needed for reproduction is actually present and usable
759
- - any closed-loop or key-metric steps expected by the plan were completed or their omission was explicitly documented
405
+ Before declaring a baseline usable, make the comparability contract explicit:
760
406
 
761
- If the workflow uses both result files and export files, they should agree or the mismatch must be explained.
407
+ - task identity
408
+ - dataset identity and version
409
+ - split contract
410
+ - preprocessing boundary
411
+ - evaluation script or evaluation path
412
+ - required metric keys
413
+ - metric directions
414
+ - seed policy when relevant
415
+ - source commit or source package identity
416
+ - known deviations from the source reference
762
417
 
763
- Verification should also test the reporting surface itself when the baseline workflow includes one.
764
- For example, if the baseline uses a main driver script with a print-only mode, verify that:
418
+ Unless the user explicitly specifies otherwise, treat the original paper's evaluation protocol as the canonical baseline contract.
419
+ If any of these fields are still materially unknown, do not pretend the baseline is a clean downstream reference.
420
+ For the fuller checklist and verdict meanings, read `references/comparability-contract.md`.
765
421
 
766
- - summary mode runs successfully
767
- - exported Markdown and/or JSON summaries are actually generated
768
- - incremental flags such as `--new-only` behave as documented when they are part of the workflow
422
+ ## Feasibility and trust classes
769
423
 
770
- Then record:
424
+ Before acceptance, classify feasibility as one of:
771
425
 
772
- - trusted metrics
773
- - important caveats
774
- - exact paths for logs, configs, and outputs
775
- - whether the baseline is reusable and should be published
426
+ - `full_reproducible`
427
+ - `degraded_but_acceptable`
428
+ - `blocked`
429
+
430
+ And classify downstream trust as one of:
431
+
432
+ - `verified`
433
+ - `partially_verified`
434
+ - `operational_but_incomparable`
435
+ - `failed`
436
+
437
+ Do not silently upgrade a degraded or merely operational result into a normal trusted baseline.
776
438
 
777
439
  ## Minimum baseline artifact content
778
440
 
779
- The baseline artifact should clearly include at least:
441
+ The accepted baseline artifact should include at least:
780
442
 
781
443
  - `baseline_id`
782
444
  - `baseline_kind`
@@ -794,254 +456,31 @@ If variants exist, also include:
794
456
  - `default_variant_id`
795
457
  - `baseline_variants`
796
458
 
797
- Metric-contract rule:
459
+ Metric-contract rules:
798
460
 
799
- - unless the user explicitly specifies otherwise, treat the original paper's evaluation protocol as the canonical baseline contract
800
461
  - if the accepted baseline contract includes multiple metrics, datasets, subtasks, or splits, record all of them in `<baseline_root>/json/metric_contract.json`
801
- - keep `primary_metric` as the headline metric only; do not let it erase the rest of the accepted paper-facing comparison surface
802
- - when confirming a baseline, submit the canonical `metrics_summary` as a flat top-level dictionary keyed by the paper-facing metric ids; if the raw evaluator output is nested, use explicit `origin_path` fields in `metric_contract.metrics` to map the required canonical metrics
803
- - every canonical baseline metric entry should include `description`, either `derivation` or `origin_path`, and `source_ref` so later stages can audit where the number came from
462
+ - keep `primary_metric` as the headline metric only; do not let it erase the rest of the comparison surface
463
+ - when confirming a baseline, submit the canonical `metrics_summary` as a flat top-level dictionary keyed by the paper-facing metric ids
464
+ - every canonical baseline metric entry should include `description`, either `derivation` or `origin_path`, and `source_ref`
804
465
  - if the paper reports both aggregate and per-dataset or per-task results, preserve both whenever feasible through `metrics_summary` plus structured rows rather than one cherry-picked scalar
805
- - `Result/metric.md` is optional temporary scratch memory only; if it exists, reconcile the final baseline submission against it before calling `artifact.confirm_baseline(...)`, but do not treat it as a required file
806
-
807
- ## Durable note templates
808
-
809
- Use compact but structured notes so later stages do not need to reconstruct baseline state from chat history.
810
- The templates below are references, not prerequisites for the first smoke test.
811
- For simple baseline lines, keep them short and fill only the sections that matter.
812
-
813
- Canonical naming for new work:
814
-
815
- - `PLAN.md` -> use `references/baseline-plan-template.md`
816
- - `CHECKLIST.md` -> use `references/baseline-checklist-template.md`
817
- - `analysis_plan.md` and `REPRO_CHECKLIST.md` remain acceptable compatibility aliases when a quest already depends on them
818
-
819
- ### `PLAN.md` or `analysis_plan.md`
820
-
821
- Recommended shape:
822
-
823
- ```md
824
- # Baseline Analysis Plan
825
-
826
- - quest_id:
827
- - baseline_id:
828
- - requested_route: attach | import | reproduce | repair
829
- - recommended_route:
830
- - source_identity:
831
- - task:
832
- - dataset_and_split:
833
- - metric_contract:
834
- - expected_reference:
835
- - feasibility_summary:
836
-
837
- ## Existing evidence
838
- - published registry entries:
839
- - local baseline roots:
840
- - relevant repo paths:
841
-
842
- ## Planned commands
843
- - inspect:
844
- - setup:
845
- - run:
846
- - verify:
847
-
848
- ## Expected outputs
849
- - baseline_root:
850
- - metrics_path:
851
- - logs_path:
852
- - export_paths:
853
-
854
- ## Risks
855
- - risk:
856
-
857
- ## Gate to next phase
858
- - what must be true before setup starts
859
- ```
860
-
861
- ### `setup.md`
862
-
863
- Recommended shape:
864
-
865
- ```md
866
- # Baseline Setup
466
+ - `Result/metric.md` is optional temporary scratch memory only; reconcile against it before calling `artifact.confirm_baseline(...)`, but do not treat it as a required durable file
467
+
468
+ ## Publication and reuse
867
469
 
868
- - baseline_id:
869
- - route:
870
- - working_directory:
871
- - source_origin:
872
- - source_commit:
873
- - environment_summary:
874
- - uv_strategy:
875
- - python_version:
876
- - config_paths:
877
- - command_template:
878
-
879
- ## Directory contract
880
- - baseline_root:
881
- - logs_root:
882
- - results_root:
883
- - exports_root:
884
-
885
- ## Known deviations
886
- - deviation:
887
-
888
- ## Ready-for-execution check
889
- - uv_route_recorded: yes/no
890
- - dependencies_known: yes/no
891
- - outputs_defined: yes/no
892
- - feasible_on_current_machine: yes/no
893
- ```
894
-
895
- ### `execution.md`
896
-
897
- Recommended shape:
898
-
899
- ```md
900
- # Baseline Execution
901
-
902
- - baseline_id:
903
- - route:
904
- - run_scope:
905
- - command_started:
906
- - started_at:
907
- - monitoring_plan:
908
-
909
- ## Runtime log pointers
910
- - stdout_or_main_log:
911
- - stderr_or_error_log:
912
- - result_index:
913
-
914
- ## Checkpoints
915
- - checkpoint:
916
-
917
- ## Final execution state
918
- - completed_at:
919
- - exit_status:
920
- - produced_outputs:
921
- - reruns_or_repairs:
922
- ```
923
-
924
- ### `verification.md`
925
-
926
- Recommended shape:
927
-
928
- ```md
929
- # Baseline Verification
930
-
931
- - baseline_id:
932
- - route:
933
- - verification_outcome: verified_match | verified_close | verified_diverged | broken
934
- - trusted_for_downstream: yes/no
935
- - reusable_beyond_quest: yes/no
936
- - publish_recommended: yes/no
937
-
938
- ## Trusted metrics
939
- - metric:
940
-
941
- ## Reference comparison
942
- - expected_reference:
943
- - observed_result:
944
- - delta_or_gap:
945
-
946
- ## Evidence paths
947
- - final_metrics:
948
- - logs:
949
- - exports:
950
- - config_snapshot:
951
-
952
- ## Caveats
953
- - caveat:
954
-
955
- ## Next recommendation
956
- - next_anchor:
957
- - next_action:
958
- ```
959
-
960
- These notes do not need to be verbose.
961
- They do need to be complete enough that another stage can read them without replaying the full baseline process.
962
-
963
- ## Artifact payload templates
964
-
965
- When writing artifacts, prefer a stable field shape.
966
-
967
- ### Route or blocked decision artifact template
968
-
969
- ```json
970
- {
971
- "kind": "decision",
972
- "verdict": "neutral",
973
- "action": "attach_baseline",
974
- "reason": "A published baseline already matches the quest task and metric contract.",
975
- "baseline_id": "baseline-demo",
976
- "baseline_variant_id": "main",
977
- "evidence_paths": [
978
- "<quest_root>/artifacts/reports/report-....json"
979
- ],
980
- "next_direction": "Attach the baseline and move to verification or idea selection."
981
- }
982
- ```
983
-
984
- If blocked, keep the same structure but use a blocked-appropriate action and reason.
985
-
986
- ### Accepted baseline artifact template
987
-
988
- ```json
989
- {
990
- "kind": "baseline",
991
- "publish_global": true,
992
- "baseline_id": "baseline-demo",
993
- "name": "Demo baseline",
994
- "baseline_kind": "reproduced",
995
- "task": "image-classification",
996
- "dataset": "CIFAR-10/test",
997
- "primary_metric": {
998
- "name": "accuracy",
999
- "value": 0.943
1000
- },
1001
- "metrics_summary": {
1002
- "accuracy": 0.943
1003
- },
1004
- "default_variant_id": "main",
1005
- "baseline_variants": [
1006
- {
1007
- "variant_id": "main",
1008
- "label": "Main",
1009
- "metrics_summary": {
1010
- "accuracy": 0.943
1011
- }
1012
- }
1013
- ],
1014
- "environment": {
1015
- "python": "3.11"
1016
- },
1017
- "summary": "Verified reproduced baseline accepted for downstream comparison.",
1018
- "path": "<quest_root>/baselines/local/baseline-demo",
1019
- "source": {
1020
- "kind": "artifact_publish",
1021
- "quest_id": "<quest_id>",
1022
- "quest_root": "<quest_root>"
1023
- }
1024
- }
1025
- ```
1026
-
1027
- Only set `publish_global: true` when verification is complete and reuse is justified.
1028
-
1029
- ## Registry publication and attachment contract
1030
-
1031
- The baseline skill should use the durable registry deliberately, not as an afterthought.
470
+ Use the registry deliberately, not as an afterthought.
1032
471
 
1033
472
  If the result is reusable beyond the current quest:
1034
473
 
1035
474
  - publish it through `artifact.publish_baseline(...)`
1036
- - ensure the payload includes the baseline identity, provenance, trusted metrics, and any variant structure
1037
- - prefer `publish_global: true` only when the baseline is actually reusable and verification is complete
475
+ - ensure the payload includes identity, provenance, trusted metrics, and any variant structure
476
+ - set `publish_global: true` only when verification is complete and reuse is justified
1038
477
 
1039
478
  If the current quest should reuse an existing baseline:
1040
479
 
1041
480
  - attach it through `artifact.attach_baseline(...)`
1042
481
  - preserve the selected `baseline_id`
1043
482
  - preserve the selected `variant_id` when one is used
1044
- - ensure the resulting attachment record is durable under `baselines/imported/`
483
+ - keep the attachment durable under `baselines/imported/`
1045
484
 
1046
485
  If runtime state already includes `requested_baseline_ref` or a matching `confirmed_baseline_ref`:
1047
486
 
@@ -1049,221 +488,55 @@ If runtime state already includes `requested_baseline_ref` or a matching `confir
1049
488
  - treat a creation-time pre-bound baseline as the active starting point unless you find a concrete incompatibility
1050
489
  - do not rerun broad baseline scouting or full reproduction just because the stage name is `baseline`
1051
490
 
1052
- Do not publish a baseline that is still blocked, speculative, or verification-incomplete.
1053
- Do not attach a baseline without explaining why it is the right reference for the quest.
1054
-
1055
- ## Verification report expectations
1056
-
1057
- A baseline verification report should answer:
1058
-
1059
- - what baseline was used
1060
- - how it was obtained: attach, import, reproduce, or repair
1061
- - what commands and configs were used
1062
- - what metrics are trusted
1063
- - how the result compares with the expected reference
1064
- - what caveats remain
1065
-
1066
- The report should also include:
1067
-
1068
- - whether the run should be trusted for downstream comparison
1069
- - whether the baseline is reusable beyond this quest
1070
- - whether another repair or rerun is justified
1071
-
1072
- The verification report should be strong enough that later `idea`, `experiment`, and `write` stages can cite the baseline setup without reconstructing it from scratch.
1073
-
1074
- It should ideally also function as a self-contained reproduction note describing:
1075
-
1076
- - baseline identity
1077
- - source provenance
1078
- - key commands
1079
- - environment assumptions
1080
- - result locations
1081
- - trusted interpretation of the outcome
1082
-
1083
- If the baseline line is meant to be reused later, the final report should be self-contained enough that another stage can answer:
1084
-
1085
- - what to run
1086
- - where to run it
1087
- - what outputs should appear
1088
- - how to interpret those outputs
1089
-
1090
- without reopening the whole reproduction process from scratch.
1091
-
1092
- When useful, generate a single merged reproduction report that includes:
1093
-
1094
- - structure overview
1095
- - modification summary
1096
- - testing commands
1097
- - device and environment summary
1098
- - baseline status and blockers
1099
- - redacted configuration inventory
1100
- - key implementation measures
1101
- - core method equations or mathematical notes when they matter for later understanding
1102
- - results table
1103
- - export paths
1104
-
1105
- For a reusable baseline package checklist, read `references/publishable-baseline-package.md`.
491
+ For a clearer attach/import/reproduce/repair rubric, read `references/route-selection.md`.
492
+ For reusable-package expectations, read `references/publishable-baseline-package.md`.
1106
493
 
1107
- ## Branch and worktree rules
494
+ ## Workspace and branch rules
1108
495
 
1109
- - Use the quest branch unless isolation is genuinely needed.
1110
- - If baseline setup is risky or intrusive, prepare an isolated branch or worktree first.
1111
- - Do not proliferate branches without a reason.
1112
- - If a branch was used, record why the baseline needed isolation.
1113
-
1114
- The baseline stage should not build a parallel Git lifecycle of its own.
1115
- Branching and promotion remain quest-level concerns.
1116
-
1117
- However, if baseline setup materially changed code or scripts, preserve at least:
1118
-
1119
- - an initial snapshot of the baseline workspace state
1120
- - a final snapshot after setup/execution changes
1121
-
1122
- so the quest can later audit what changed during reproduction.
1123
-
1124
- If the workflow uses a baseline-local Git snapshot for audit, treat it as an execution snapshot only.
1125
- The quest repo remains the durable authority for promotion and narrative state.
496
+ - treat the baseline workspace as a system-managed reproduction surface, not an unrelated sandbox
497
+ - avoid creating a nested authoritative Git lifecycle inside the baseline workspace
498
+ - use the quest branch unless isolation is genuinely needed
499
+ - if baseline setup is risky or intrusive, prepare an isolated branch or worktree first and record why
500
+ - do not proliferate branches without a reason
1126
501
 
1127
502
  ## Memory rules
1128
503
 
1129
504
  Stage-start requirement:
1130
505
 
1131
- - begin every baseline pass with `memory.list_recent(scope='quest', limit=5)`
506
+ - by default, begin every baseline pass with `memory.list_recent(scope='quest', limit=5)`
1132
507
  - then run at least one baseline-relevant `memory.search(...)` before new baseline analysis, repair, or rerun work
1133
- - if several baseline or idea lines exist, narrow retrieval to the active baseline route instead of mixing notes from unrelated lines
508
+ - fast-path exception: if the quest already exposes a clear `requested_baseline_ref` or `confirmed_baseline_ref` and the immediate task is only to validate or reattach that concrete baseline, you may skip broad retrieval
1134
509
 
1135
- Write to memory only when the lesson is reusable, such as:
510
+ Write memory only for reusable lessons such as:
1136
511
 
1137
- - baseline pitfalls
1138
- - environment gotchas
1139
- - dataset quirks
1140
512
  - paper-to-code mismatch notes
1141
-
1142
- Do not use memory as a substitute for the baseline artifact itself.
1143
-
1144
- Preferred memory usage:
1145
-
1146
- - quest `papers`:
1147
- - paper-to-code mismatch notes
1148
- - baseline paper caveats
1149
- - quest `decisions`:
1150
- - attach / import / reproduce / repair rationale
1151
- - accepted-versus-rejected baseline route choices
1152
- - quest `episodes`:
1153
- - setup failures
1154
- - execution failures
1155
- - environment incidents
1156
- - suspicious or divergent baseline runs
1157
- - quest `knowledge`:
1158
- - verified metric contract
1159
- - stable setup rules
1160
- - data and evaluation caveats
1161
- - reproducibility lessons that matter later in this quest
1162
- - global `knowledge`:
1163
- - reusable reproduction heuristics
1164
- - stable verification heuristics
1165
- - cross-quest baseline debugging lessons
1166
- - global `templates`:
1167
- - setup checklist templates
1168
- - verification checklist templates
1169
- - publishable baseline package templates
1170
-
1171
- Useful tags include:
1172
-
1173
- - `stage:baseline`
1174
- - `baseline:<baseline_id>`
1175
- - `type:repro-lesson`
1176
- - `type:verification-caveat`
1177
- - `type:environment-incident`
1178
- - `topic:<dataset-or-method>`
513
+ - environment incidents
514
+ - dataset quirks
515
+ - verification caveats
516
+ - attach vs import vs reproduce vs repair rationale
1179
517
 
1180
518
  When calling `memory.write(...)`, pass `tags` as an array like `["stage:baseline", "baseline:<baseline_id>", "type:repro-lesson"]`, not as one comma-joined string.
1181
519
 
1182
- Recommended read timing:
1183
-
1184
- - before route selection:
1185
- - consult quest `decisions`, `knowledge`, and relevant `papers`
1186
- - before reruns or repairs:
1187
- - search quest `episodes` first
1188
- - before acceptance:
1189
- - re-check quest `knowledge` and `decisions`
1190
- - before publishing globally:
1191
- - confirm the lesson is truly reusable and not only quest-local
1192
-
1193
520
  Stage-end requirement:
1194
521
 
1195
522
  - if baseline work produced a durable reproduction lesson, verification caveat, environment incident, or route rationale, write at least one `memory.write(...)` before leaving the stage
1196
523
 
1197
- For a fuller memory strategy, read `references/memory-playbook.md`.
1198
-
1199
524
  ## Artifact rules
1200
525
 
1201
526
  Typical artifact sequence:
1202
527
 
1203
- - progress artifact for long-running setup or execution
1204
- - report artifact for analysis or verification notes
1205
- - baseline artifact for the accepted result
1206
- - decision artifact when choosing attach/import/reproduce/repair or when deciding the next anchor
1207
-
1208
- If a reusable baseline was established, prefer recording it in a form that later stages can attach or reuse directly instead of forcing redundant reproduction.
1209
-
1210
- Use `artifact.attach_baseline(...)` or `artifact.publish_baseline(...)` when appropriate.
528
+ - `progress` for long-running setup or execution checkpoints
529
+ - `report` for analysis notes or verification notes
530
+ - `decision` for route choice, blocked routing, or accept/reject/rerun/repair calls
531
+ - `baseline` only for an accepted baseline record
1211
532
 
1212
- Preferred artifact choices:
533
+ For stable field shapes, read `references/artifact-payload-examples.md`.
1213
534
 
1214
- - use `decision` for:
1215
- - route selection
1216
- - blocked-state routing
1217
- - accept / reject / rerun / repair choices
1218
- - use `report` for:
1219
- - analysis notes
1220
- - verification reports
1221
- - merged reproduction reports
1222
- - comparability-contract summaries
1223
- - use `progress` during long-running setup or execution
1224
- - use `baseline` only for an accepted baseline record
1225
- - use `approval` only if an explicit user approval was needed for a costly or degraded baseline gate
1226
-
1227
- ## Handoff contract
1228
-
1229
- Before handing the quest to `idea`, `experiment`, or `write`, the baseline stage should make the next stage's life easy.
1230
-
1231
- At minimum, downstream stages should be able to answer all of the following without reopening the full reproduction investigation:
1232
-
1233
- - which baseline is active
1234
- - which route produced it: attach, import, reproduce, or repair
1235
- - which metrics are trusted
1236
- - where the baseline outputs and logs live
1237
- - what caveats or deviations still matter
1238
- - whether the baseline is quest-local only or globally reusable
1239
-
1240
- ## Publication and reuse rules
1241
-
1242
- Publish or attach baselines deliberately.
1243
-
1244
- - attach when a trusted reusable baseline already exists and is the right reference for this quest
1245
- - publish when this quest produced a verified reusable baseline that later quests should be able to reuse
1246
- - do not publish a blocked, speculative, or verification-incomplete baseline
1247
- - do not attach a baseline without explaining why it is the correct downstream reference
1248
-
1249
- If a baseline is accepted but not globally reusable, say that explicitly instead of leaving the reuse status ambiguous.
1250
-
1251
- The baseline stage should normally hand off with:
1252
-
1253
- - one accepted baseline artifact
1254
- - one verification-oriented report artifact
1255
- - one active baseline reference through attachment or accepted local baseline state
1256
- - one concise next-step guidance statement or decision artifact when the next anchor is not obvious
1257
-
1258
- ## Final handoff packet
1259
-
1260
- Before leaving the baseline stage, make sure the next stage can read a compact handoff packet from durable state.
1261
-
1262
- The handoff packet should make these items obvious:
535
+ The baseline handoff should make these items obvious:
1263
536
 
1264
537
  - `baseline_id`
1265
538
  - `baseline_variant_id` when relevant
1266
- - route used: attach/import/reproduce/repair
539
+ - route used: attach, import, reproduce, or repair
1267
540
  - trusted metrics
1268
541
  - canonical metric contract JSON path
1269
542
  - verification outcome
@@ -1272,38 +545,36 @@ The handoff packet should make these items obvious:
1272
545
  - main caveats
1273
546
  - recommended next anchor
1274
547
 
1275
- If this packet is not obvious from the artifact plus verification note, the baseline stage is not yet stable enough.
548
+ If this packet is not obvious from the accepted artifact plus verification note, the baseline line is not stable enough yet.
1276
549
 
1277
550
  ## Failure and blocked handling
1278
551
 
1279
- Do not hide baseline failures.
552
+ Do not hide failures.
1280
553
 
1281
- If blocked, record exactly which class applies:
554
+ If blocked, record the class explicitly:
1282
555
 
1283
- - missing_source
1284
- - missing_code
1285
- - missing_metric_contract
1286
- - environment_infeasible
1287
- - command_unknown
1288
- - run_failed
1289
- - verification_failed
556
+ - `missing_source`
557
+ - `missing_code`
558
+ - `missing_metric_contract`
559
+ - `environment_infeasible`
560
+ - `command_unknown`
561
+ - `run_failed`
562
+ - `verification_failed`
1290
563
 
1291
- A blocked baseline result must state:
564
+ A blocked result must state:
1292
565
 
1293
566
  - what failed
1294
567
  - what was tried
1295
- - which paths/logs show the issue
1296
- - whether the next best move is attach/import/retry/reset/ask user
1297
-
1298
- If the failure happened after a long-running task, include the monitored command/log path rather than only a prose description.
568
+ - which paths or logs show the issue
569
+ - whether the next best move is attach, import, retry, repair, reset, or ask the user
1299
570
 
1300
- Common autonomous fixes before falling back:
571
+ Reasonable autonomous fixes before escalation:
1301
572
 
1302
573
  - missing module or dependency
1303
574
  - wrong dataset path
1304
575
  - permission errors on scripts
1305
576
  - reasonable batch-size reductions for OOM
1306
- - obvious environment activation issues
577
+ - obvious environment activation mistakes
1307
578
 
1308
579
  If a fix would change confirmed scope, metrics, permissions, or resource assumptions, stop and return to analysis rather than applying it silently.
1309
580