@researai/deepscientist 1.5.13 → 1.5.15
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +8 -0
- package/assets/branding/logo-raster.png +0 -0
- package/bin/ds.js +134 -49
- package/docs/en/00_QUICK_START.md +2 -2
- package/docs/en/01_SETTINGS_REFERENCE.md +20 -4
- package/docs/en/03_QQ_CONNECTOR_GUIDE.md +19 -0
- package/docs/en/05_TUI_GUIDE.md +466 -96
- package/docs/en/10_WEIXIN_CONNECTOR_GUIDE.md +20 -0
- package/docs/en/14_PROMPT_SKILLS_AND_MCP_GUIDE.md +2 -0
- package/docs/en/16_TELEGRAM_CONNECTOR_GUIDE.md +134 -0
- package/docs/en/17_WHATSAPP_CONNECTOR_GUIDE.md +126 -0
- package/docs/en/18_FEISHU_CONNECTOR_GUIDE.md +136 -0
- package/docs/en/README.md +8 -0
- package/docs/zh/00_QUICK_START.md +2 -2
- package/docs/zh/01_SETTINGS_REFERENCE.md +20 -4
- package/docs/zh/03_QQ_CONNECTOR_GUIDE.md +19 -0
- package/docs/zh/05_TUI_GUIDE.md +465 -82
- package/docs/zh/10_WEIXIN_CONNECTOR_GUIDE.md +20 -0
- package/docs/zh/14_PROMPT_SKILLS_AND_MCP_GUIDE.md +2 -0
- package/docs/zh/16_TELEGRAM_CONNECTOR_GUIDE.md +134 -0
- package/docs/zh/17_WHATSAPP_CONNECTOR_GUIDE.md +126 -0
- package/docs/zh/18_FEISHU_CONNECTOR_GUIDE.md +136 -0
- package/docs/zh/README.md +8 -0
- package/install.sh +2 -0
- package/package.json +1 -1
- package/pyproject.toml +1 -1
- package/src/deepscientist/__init__.py +1 -1
- package/src/deepscientist/artifact/charts.py +567 -0
- package/src/deepscientist/artifact/guidance.py +50 -10
- package/src/deepscientist/artifact/metrics.py +228 -5
- package/src/deepscientist/artifact/schemas.py +3 -0
- package/src/deepscientist/artifact/service.py +4004 -538
- package/src/deepscientist/bash_exec/models.py +23 -0
- package/src/deepscientist/bash_exec/monitor.py +147 -67
- package/src/deepscientist/bash_exec/runtime.py +218 -156
- package/src/deepscientist/bash_exec/service.py +79 -64
- package/src/deepscientist/bash_exec/shells.py +87 -0
- package/src/deepscientist/bridges/connectors.py +51 -2
- package/src/deepscientist/config/models.py +6 -3
- package/src/deepscientist/config/service.py +7 -2
- package/src/deepscientist/connector/lingzhu_support.py +23 -4
- package/src/deepscientist/connector/weixin_support.py +122 -1
- package/src/deepscientist/daemon/api/handlers.py +75 -4
- package/src/deepscientist/daemon/api/router.py +1 -0
- package/src/deepscientist/daemon/app.py +869 -236
- package/src/deepscientist/doctor.py +51 -0
- package/src/deepscientist/file_lock.py +48 -0
- package/src/deepscientist/gitops/diff.py +167 -1
- package/src/deepscientist/mcp/server.py +331 -21
- package/src/deepscientist/process_control.py +161 -0
- package/src/deepscientist/prompts/builder.py +275 -491
- package/src/deepscientist/quest/service.py +2336 -145
- package/src/deepscientist/quest/stage_views.py +305 -29
- package/src/deepscientist/runners/base.py +2 -0
- package/src/deepscientist/runners/codex.py +88 -5
- package/src/deepscientist/runners/runtime_overrides.py +17 -1
- package/src/deepscientist/shared.py +6 -1
- package/src/prompts/contracts/shared_interaction.md +13 -4
- package/src/prompts/system.md +984 -1985
- package/src/skills/analysis-campaign/SKILL.md +31 -2
- package/src/skills/analysis-campaign/references/artifact-orchestration.md +1 -1
- package/src/skills/analysis-campaign/references/writing-facing-slice-examples.md +65 -0
- package/src/skills/baseline/SKILL.md +267 -994
- package/src/skills/baseline/references/baseline-checklist-template.md +21 -32
- package/src/skills/baseline/references/baseline-plan-template.md +41 -57
- package/src/skills/decision/SKILL.md +19 -2
- package/src/skills/experiment/SKILL.md +8 -2
- package/src/skills/finalize/SKILL.md +18 -0
- package/src/skills/idea/SKILL.md +78 -0
- package/src/skills/idea/references/idea-generation-playbook.md +100 -0
- package/src/skills/idea/references/outline-seeding-example.md +60 -0
- package/src/skills/intake-audit/SKILL.md +1 -1
- package/src/skills/optimize/SKILL.md +1644 -0
- package/src/skills/rebuttal/SKILL.md +2 -1
- package/src/skills/review/SKILL.md +2 -1
- package/src/skills/write/SKILL.md +80 -12
- package/src/skills/write/references/outline-evidence-contract-example.md +107 -0
- package/src/tui/dist/app/AppContainer.js +1445 -52
- package/src/tui/dist/components/Composer.js +1 -1
- package/src/tui/dist/components/ConfigScreen.js +190 -36
- package/src/tui/dist/components/GradientStatusText.js +1 -20
- package/src/tui/dist/components/InputPrompt.js +41 -32
- package/src/tui/dist/components/LoadingIndicator.js +1 -1
- package/src/tui/dist/components/Logo.js +61 -38
- package/src/tui/dist/components/MainContent.js +10 -3
- package/src/tui/dist/components/WelcomePanel.js +4 -12
- package/src/tui/dist/components/messages/AssistantMessage.js +1 -1
- package/src/tui/dist/components/messages/BashExecOperationMessage.js +3 -3
- package/src/tui/dist/components/messages/OperationMessage.js +1 -1
- package/src/tui/dist/index.js +28 -1
- package/src/tui/dist/layouts/DefaultAppLayout.js +3 -3
- package/src/tui/dist/lib/api.js +17 -0
- package/src/tui/dist/lib/connectors.js +261 -0
- package/src/tui/dist/semantic-colors.js +29 -19
- package/src/tui/package.json +1 -1
- package/src/ui/dist/assets/{AiManusChatView-CnJcXynW.js → AiManusChatView-DDjbFnbt.js} +12 -12
- package/src/ui/dist/assets/{AnalysisPlugin-DeyzPEhV.js → AnalysisPlugin-Yb5IdmaU.js} +1 -1
- package/src/ui/dist/assets/CliPlugin-e64sreyu.js +31037 -0
- package/src/ui/dist/assets/{CodeEditorPlugin-B-xicq1e.js → CodeEditorPlugin-C4D2TIkU.js} +8 -8
- package/src/ui/dist/assets/{CodeViewerPlugin-DT54ysXa.js → CodeViewerPlugin-BVoNZIvC.js} +5 -5
- package/src/ui/dist/assets/{DocViewerPlugin-DQtKT-VD.js → DocViewerPlugin-CLChbllo.js} +3 -3
- package/src/ui/dist/assets/{GitDiffViewerPlugin-hqHbCfnv.js → GitDiffViewerPlugin-C4xeFyFQ.js} +20 -20
- package/src/ui/dist/assets/{ImageViewerPlugin-OcVo33jV.js → ImageViewerPlugin-OiMUAcLi.js} +5 -5
- package/src/ui/dist/assets/{LabCopilotPanel-DdGwhEUV.js → LabCopilotPanel-BjD2ThQF.js} +11 -11
- package/src/ui/dist/assets/{LabPlugin-Ciz1gDaX.js → LabPlugin-DQPg-NrB.js} +2 -2
- package/src/ui/dist/assets/{LatexPlugin-BhmjNQRC.js → LatexPlugin-CI05XAV9.js} +7 -7
- package/src/ui/dist/assets/{MarkdownViewerPlugin-BzdVH9Bx.js → MarkdownViewerPlugin-DpeBLYZf.js} +4 -4
- package/src/ui/dist/assets/{MarketplacePlugin-DmyHspXt.js → MarketplacePlugin-DolE58Q2.js} +3 -3
- package/src/ui/dist/assets/{NotebookEditor-BTVYRGkm.js → NotebookEditor-7Qm2rSWD.js} +11 -11
- package/src/ui/dist/assets/{NotebookEditor-BMXKrDRk.js → NotebookEditor-C1kWaxKi.js} +1 -1
- package/src/ui/dist/assets/{PdfLoader-CvcjJHXv.js → PdfLoader-BfOHw8Zw.js} +1 -1
- package/src/ui/dist/assets/{PdfMarkdownPlugin-DW2ej8Vk.js → PdfMarkdownPlugin-BulDREv1.js} +2 -2
- package/src/ui/dist/assets/{PdfViewerPlugin-CmlDxbhU.js → PdfViewerPlugin-C-daaOaL.js} +10 -10
- package/src/ui/dist/assets/{SearchPlugin-DAjQZPSv.js → SearchPlugin-CjpaiJ3A.js} +1 -1
- package/src/ui/dist/assets/{TextViewerPlugin-C-nVAZb_.js → TextViewerPlugin-BxIyqPQC.js} +5 -5
- package/src/ui/dist/assets/{VNCViewer-D7-dIYon.js → VNCViewer-HAg9mF7M.js} +10 -10
- package/src/ui/dist/assets/{bot-C_G4WtNI.js → bot-0DYntytV.js} +1 -1
- package/src/ui/dist/assets/{code-Cd7WfiWq.js → code-B20Slj_w.js} +1 -1
- package/src/ui/dist/assets/{file-content-B57zsL9y.js → file-content-DT24KFma.js} +1 -1
- package/src/ui/dist/assets/{file-diff-panel-DVoheLFq.js → file-diff-panel-DK13YPql.js} +1 -1
- package/src/ui/dist/assets/{file-socket-B5kXFxZP.js → file-socket-B4T2o4nR.js} +1 -1
- package/src/ui/dist/assets/{image-LLOjkMHF.js → image-DSeR_sDS.js} +1 -1
- package/src/ui/dist/assets/{index-hOUOWbW2.js → index-BrFje2Uk.js} +2 -2
- package/src/ui/dist/assets/{index-Dxa2eYMY.js → index-BwRJaoTl.js} +1 -1
- package/src/ui/dist/assets/{index-CLQauncb.js → index-D_E4281X.js} +5418 -28620
- package/src/ui/dist/assets/{index-C3r2iGrp.js → index-DnYB3xb1.js} +12 -12
- package/src/ui/dist/assets/{index-BQG-1s2o.css → index-G7AcWcMu.css} +43 -2
- package/src/ui/dist/assets/{monaco-BGGAEii3.js → monaco-LExaAN3Y.js} +1 -1
- package/src/ui/dist/assets/{pdf-effect-queue-DlEr1_y5.js → pdf-effect-queue-BJk5okWJ.js} +1 -1
- package/src/ui/dist/assets/{popover-CWJbJuYY.js → popover-D3Gg_FoV.js} +1 -1
- package/src/ui/dist/assets/{project-sync-CRJiucYO.js → project-sync-C_ygLlVU.js} +1 -1
- package/src/ui/dist/assets/{select-CoHB7pvH.js → select-CpAK6uWm.js} +2 -2
- package/src/ui/dist/assets/{sigma-D5aJWR8J.js → sigma-DEccaSgk.js} +1 -1
- package/src/ui/dist/assets/{square-check-big-DUK_mnkS.js → square-check-big-uUfyVsbD.js} +1 -1
- package/src/ui/dist/assets/{trash-ChU3SEE3.js → trash-CXvwwSe8.js} +1 -1
- package/src/ui/dist/assets/{useCliAccess-BrJBV3tY.js → useCliAccess-Bnop4mgR.js} +1 -1
- package/src/ui/dist/assets/{useFileDiffOverlay-C2OQaVWc.js → useFileDiffOverlay-B8eUAX0I.js} +1 -1
- package/src/ui/dist/assets/{wrap-text-C7Qqh-om.js → wrap-text-9vbOBpkW.js} +1 -1
- package/src/ui/dist/assets/{zoom-out-rtX0FKya.js → zoom-out-BgVMmOW4.js} +1 -1
- package/src/ui/dist/index.html +2 -2
- package/uv.lock +1 -1
- package/src/ui/dist/assets/CliPlugin-CB1YODQn.js +0 -5905
|
@@ -6,112 +6,83 @@ description: Use when a quest needs to attach, import, reproduce, repair, verify
|
|
|
6
6
|
# Baseline
|
|
7
7
|
|
|
8
8
|
This skill establishes the reference system the quest will compare against.
|
|
9
|
-
|
|
9
|
+
The target is one trustworthy baseline line, not an endless reproduction diary.
|
|
10
10
|
|
|
11
11
|
## Interaction discipline
|
|
12
12
|
|
|
13
13
|
- Follow the shared interaction contract injected by the system prompt.
|
|
14
|
-
-
|
|
15
|
-
-
|
|
16
|
-
-
|
|
17
|
-
-
|
|
18
|
-
- Prefer `bash_exec` for setup, reproduction, and verification commands so each baseline action keeps a durable quest-local session id and log trail.
|
|
19
|
-
- When the baseline route is durably chosen, confirmed, waived, or blocked with a clear next action, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says whether the baseline is trusted, blocked, or waived, why that matters, and what the next stage is.
|
|
14
|
+
- Keep ordinary setup and debugging updates concise.
|
|
15
|
+
- Use richer milestone updates only when the baseline becomes trusted, caveated, blocked, waived, or route-changing.
|
|
16
|
+
- Hard execution rule: every terminal command in this stage must go through `bash_exec`; do not use any other terminal path for setup, reproduction, monitoring, verification, Git, Python, package-manager, or file-inspection commands.
|
|
17
|
+
- Prefer `bash_exec` for setup, reproduction, monitoring, and verification commands so the baseline line stays durable and auditable.
|
|
20
18
|
|
|
21
19
|
## Non-negotiable rules
|
|
22
20
|
|
|
23
|
-
- no
|
|
24
|
-
- do not skip baseline steps or silently simplify the
|
|
21
|
+
- no fabricated metrics, logs, run status, or success claims
|
|
22
|
+
- do not skip baseline steps or silently simplify the route when that would change trust or comparability
|
|
25
23
|
- do not claim a baseline is ready before verification is complete
|
|
26
|
-
- do not infer missing commands, scripts, or parameters when the uncertainty
|
|
24
|
+
- do not infer missing commands, scripts, or parameters when the uncertainty could change the result
|
|
27
25
|
- any unavoidable guess must be written down explicitly with expected impact
|
|
28
|
-
- for Python baselines, standardize environment setup with `uv`; do not default to ad-hoc `pip install ...`, a fresh `conda create ...`, or global package mutation when `uv` can provide the same environment reproducibly
|
|
29
26
|
- use web search for discovering papers or repos, but use `artifact.arxiv(paper_id=..., full_text=False)` for actually reading a source arXiv paper when it exists
|
|
30
|
-
- set `full_text=True` only when the
|
|
31
|
-
|
|
32
|
-
## Language and interaction rules
|
|
33
|
-
|
|
34
|
-
- match the user's language in all visible outputs
|
|
35
|
-
- keep updates concise but concrete
|
|
36
|
-
- if a structured user decision is required, ask only for decisions that the system cannot safely derive locally
|
|
37
|
-
- do not ask speculative or premature questions when local analysis can narrow the choices first
|
|
27
|
+
- set `full_text=True` only when the short form is insufficient
|
|
28
|
+
- for Python baselines, environment setup should be standardized around `uv`
|
|
38
29
|
|
|
39
30
|
## Stage purpose
|
|
40
31
|
|
|
41
32
|
The baseline stage should produce a usable reference point through one of four routes:
|
|
42
33
|
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
34
|
+
1. attach an existing reusable baseline
|
|
35
|
+
2. import a reusable baseline package
|
|
36
|
+
3. reproduce a baseline from source
|
|
37
|
+
4. repair a broken or stale baseline
|
|
47
38
|
|
|
48
|
-
|
|
39
|
+
Keep the classic control flow:
|
|
49
40
|
|
|
50
41
|
1. analysis
|
|
51
42
|
2. setup
|
|
52
43
|
3. execution
|
|
53
44
|
4. verification
|
|
54
45
|
|
|
55
|
-
|
|
46
|
+
These are control gates, not paperwork walls.
|
|
56
47
|
|
|
57
48
|
## Quick workflow
|
|
58
49
|
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
1. Read the source paper and source repo first, or explicitly record what is missing and why.
|
|
50
|
+
1. Read the source paper and source repo first, or record exactly what is missing and why.
|
|
62
51
|
2. Choose the lightest trustworthy route: attach, import, reproduce, or repair.
|
|
63
|
-
3.
|
|
64
|
-
4.
|
|
65
|
-
5.
|
|
66
|
-
6.
|
|
67
|
-
|
|
68
|
-
|
|
52
|
+
3. Start with the fast path whenever the current baseline object, command path, and acceptance target are already clear enough to validate cheaply.
|
|
53
|
+
4. Before substantial baseline setup, code edits, or a real baseline run, create `PLAN.md` and `CHECKLIST.md`; short-form files are enough for simple fast-path work.
|
|
54
|
+
5. Keep one dominant phase visible: analysis -> setup -> execution -> verification.
|
|
55
|
+
6. Prefer one clean implementation pass, one smoke test, and then one normal baseline run.
|
|
56
|
+
7. Retry only when smoke, verification, or runtime evidence shows a concrete failure or incompatibility.
|
|
57
|
+
8. Close the stage by confirming or waiving the gate, then hand off with a concise `1-2` sentence summary of trust status and next anchor.
|
|
69
58
|
|
|
70
|
-
|
|
59
|
+
## Fast-path first
|
|
71
60
|
|
|
72
61
|
Default to the lightest baseline path that can still establish a trustworthy comparison.
|
|
73
|
-
|
|
74
|
-
User requirements and explicit constraints are the primary boundary for the reproduction plan.
|
|
75
|
-
Within that boundary, prefer equivalence-preserving efficiency gains before more compute: larger safe batch size, cache reuse, checkpoint resume, parallel downloads or workers, and the cheapest comparable smoke path.
|
|
62
|
+
Default to a fast path when it can establish trust with less work.
|
|
76
63
|
|
|
77
|
-
|
|
64
|
+
Fast path is the default when any of the following is true:
|
|
78
65
|
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
5. verify before accepting, then archive, publish, or attach the result when appropriate
|
|
66
|
+
- `requested_baseline_ref` or `confirmed_baseline_ref` already points to the active baseline object
|
|
67
|
+
- the route is clearly `attach` or `import`
|
|
68
|
+
- the repo entrypoint, dataset or split, and metric contract are already concrete enough to validate cheaply
|
|
69
|
+
- reproduction requires no meaningful code changes and the main uncertainty is only whether the command still runs
|
|
84
70
|
|
|
85
|
-
|
|
71
|
+
Fast path means:
|
|
86
72
|
|
|
87
|
-
|
|
73
|
+
- do not restart broad baseline discovery by default
|
|
74
|
+
- do not front-load a full codebase audit when the entrypoint is already concrete
|
|
75
|
+
- use a minimal `PLAN.md`, a minimal `CHECKLIST.md`, one bounded smoke test when needed, and then one real validation or run
|
|
76
|
+
- default to reuse-and-verify when runtime already attached a concrete baseline
|
|
88
77
|
|
|
89
|
-
|
|
78
|
+
Escalate from fast path to fuller audit only when:
|
|
90
79
|
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
-
|
|
94
|
-
-
|
|
95
|
-
-
|
|
96
|
-
-
|
|
97
|
-
- `CHECKLIST.md` is the living companion to `PLAN.md`; update it during reading, setup, smoke testing, real execution, verification, and every material route change.
|
|
98
|
-
- If an older quest already uses `analysis_plan.md` or `REPRO_CHECKLIST.md`, keep those files aligned with the canonical `PLAN.md` / `CHECKLIST.md` or turn them into clear compatibility pointers rather than splitting truth across parallel planning files.
|
|
99
|
-
- Do not treat the plan as static: if the route, commands, source package, fallback path, or trust judgment changes materially, revise `PLAN.md` before continuing.
|
|
100
|
-
- Once `PLAN.md` makes the route concrete, do not keep rewriting code or commands speculatively. The normal default is one bounded smoke test and then one real run, with retries only after a documented failure, invalidity, or compatibility problem.
|
|
101
|
-
|
|
102
|
-
## Phase routing rule
|
|
103
|
-
|
|
104
|
-
Treat `analysis`, `setup`, `execution`, and `verification` as logical control gates, not paperwork walls.
|
|
105
|
-
At any moment, the work should have one dominant phase among:
|
|
106
|
-
|
|
107
|
-
- `analysis`
|
|
108
|
-
- `setup`
|
|
109
|
-
- `execution`
|
|
110
|
-
- `verification`
|
|
111
|
-
|
|
112
|
-
Keep the dominant phase explicit, but allow small backtracks and lightweight overlap when they reduce wasted work.
|
|
113
|
-
Do not delay an early smoke test just because a fuller write-up is not done yet.
|
|
114
|
-
Before a real long run, make sure the minimum viable contract is explicit and the active phase is still easy to reconstruct.
|
|
80
|
+
- the paper and repo disagree materially
|
|
81
|
+
- the real run or eval entrypoint is unclear
|
|
82
|
+
- code changes are likely required
|
|
83
|
+
- the contract spans multiple metrics, datasets, subtasks, or splits that still need interpretation
|
|
84
|
+
- the same failure class reappears after one documented autonomous fix
|
|
85
|
+
- the quest is trying to publish a reusable global baseline rather than only clear the current gate
|
|
115
86
|
|
|
116
87
|
## Use when
|
|
117
88
|
|
|
@@ -119,7 +90,7 @@ Before a real long run, make sure the minimum viable contract is explicit and th
|
|
|
119
90
|
- the current baseline is unverified or stale
|
|
120
91
|
- the user already has a baseline package that should be attached or imported
|
|
121
92
|
- a reproduction failed earlier and now needs repair
|
|
122
|
-
- the quest
|
|
93
|
+
- the quest resumed and the baseline trust state is unclear
|
|
123
94
|
|
|
124
95
|
## Do not use when
|
|
125
96
|
|
|
@@ -128,97 +99,83 @@ Before a real long run, make sure the minimum viable contract is explicit and th
|
|
|
128
99
|
|
|
129
100
|
## Stage gate
|
|
130
101
|
|
|
131
|
-
Do not proceed to
|
|
102
|
+
Do not proceed to comparison-heavy downstream work unless one of the following is durably true:
|
|
132
103
|
|
|
133
104
|
- a baseline has been attached and accepted
|
|
134
105
|
- a baseline has been imported and accepted
|
|
135
106
|
- a baseline reproduction has completed and been verified
|
|
136
107
|
- an explicit waiver decision exists with a clear reason
|
|
137
108
|
|
|
138
|
-
Operationally
|
|
139
|
-
|
|
140
|
-
- after the accepted baseline root is clear, call `artifact.confirm_baseline(...)`
|
|
141
|
-
- if the quest must continue without a baseline, call `artifact.waive_baseline(...)`
|
|
142
|
-
|
|
143
|
-
`attach`, `import`, `publish`, or a plain `baseline` artifact alone do not open the downstream gate.
|
|
109
|
+
Operationally:
|
|
144
110
|
|
|
145
|
-
|
|
111
|
+
- call `artifact.confirm_baseline(...)` once the accepted baseline root and trusted comparison contract are clear
|
|
112
|
+
- call `artifact.waive_baseline(...)` when the quest must continue without a baseline
|
|
113
|
+
- attach, import, or publish alone do not open the downstream gate
|
|
146
114
|
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
- user objective and task framing
|
|
150
|
-
- source paper and official repo when available
|
|
151
|
-
- existing baseline registry entries
|
|
152
|
-
- local baseline directories under `quest_root`
|
|
153
|
-
- repo code, configs, and scripts
|
|
154
|
-
- device and environment constraints detected locally
|
|
155
|
-
- logs, metrics, and summaries from actual runs
|
|
156
|
-
|
|
157
|
-
Do not treat memory alone as sufficient evidence for baseline readiness.
|
|
158
|
-
|
|
159
|
-
## Baseline workspace rules
|
|
115
|
+
## Required plan and checklist
|
|
160
116
|
|
|
161
|
-
|
|
162
|
-
- avoid creating nested Git workflows inside the baseline workspace
|
|
163
|
-
- keep the authoritative quest history in the quest repo
|
|
164
|
-
- if papers are converted or notes are generated during baseline work, keep the durable copies under the quest-visible artifacts area unless there is a strong reason to keep a baseline-side copy
|
|
165
|
-
- if runtime environment variables or secrets are provided by the runner, use them as authoritative but never echo or persist secret values
|
|
117
|
+
Before substantial baseline setup, code edits, or a real baseline run, create a quest-visible `PLAN.md` and `CHECKLIST.md`.
|
|
166
118
|
|
|
167
|
-
|
|
168
|
-
|
|
119
|
+
- Use `references/baseline-plan-template.md` as the canonical structure for `PLAN.md`.
|
|
120
|
+
- Use `references/baseline-checklist-template.md` as the canonical structure for `CHECKLIST.md`.
|
|
121
|
+
- `analysis_plan.md` and `REPRO_CHECKLIST.md` remain acceptable compatibility alias files when an older quest already depends on them.
|
|
122
|
+
- For fast-path attach/import/prebound validation or a simple reproduce path with no expected code changes, short-form `PLAN.md` and `CHECKLIST.md` are enough.
|
|
123
|
+
- The plan should put the user's explicit requirements and non-negotiable constraints first.
|
|
124
|
+
- Then record the chosen route, source identity, command path, expected outputs, acceptance condition, safe efficiency levers, main risks, and fallback.
|
|
125
|
+
- If the route, commands, source package, fallback path, or trust judgment changes materially, revise `PLAN.md` before continuing.
|
|
126
|
+
- Once the route is concrete, stop reshaping code and commands speculatively.
|
|
169
127
|
|
|
170
|
-
|
|
171
|
-
- `CHECKLIST.md` as the canonical living checklist; older quests may keep `REPRO_CHECKLIST.md` as a compatibility alias when already wired
|
|
172
|
-
- `setup.md`
|
|
173
|
-
- `execution.md`
|
|
174
|
-
- `verification.md`
|
|
175
|
-
- `STRUCTURE.md` only when the workspace layout is non-obvious or later reuse depends on it
|
|
128
|
+
Default retry discipline:
|
|
176
129
|
|
|
177
|
-
|
|
178
|
-
|
|
130
|
+
- do not rerun the same unchanged smoke command just to reconfirm the same fact
|
|
131
|
+
- treat one autonomous retry for the same failure class as the normal upper bound
|
|
132
|
+
- if the same failure class appears again, switch explicitly into `repair`, record `blocked`, or route through `decision`
|
|
179
133
|
|
|
180
134
|
## Required durable outputs
|
|
181
135
|
|
|
182
136
|
The baseline stage should usually leave behind:
|
|
183
137
|
|
|
184
138
|
- a baseline directory under `baselines/local/` or `baselines/imported/`
|
|
185
|
-
-
|
|
139
|
+
- `PLAN.md` and `CHECKLIST.md`
|
|
140
|
+
- a verification note or report
|
|
186
141
|
- command, config, environment, and metrics pointers
|
|
187
142
|
- a baseline artifact
|
|
188
143
|
- a confirmed baseline gate via `artifact.confirm_baseline(...)`, or an explicit waiver via `artifact.waive_baseline(...)`
|
|
189
144
|
- an optional registry publication if the baseline is reusable beyond this quest
|
|
190
145
|
|
|
191
|
-
|
|
146
|
+
For simple attach/import flows or a straightforward reproduce flow, do not stall just to precreate every optional note file.
|
|
192
147
|
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
|
|
148
|
+
Useful optional notes:
|
|
149
|
+
|
|
150
|
+
- `setup.md`
|
|
151
|
+
- `execution.md`
|
|
152
|
+
- `verification.md`
|
|
153
|
+
- `STRUCTURE.md` when the layout is non-obvious
|
|
154
|
+
|
|
155
|
+
## File-by-file contract
|
|
156
|
+
|
|
157
|
+
- `PLAN.md` or compatibility alias `analysis_plan.md` is the required route contract before substantial setup, code edits, or a real run; it should state the route, source identity, command path, expected outputs, acceptance condition, main risks, and fallback.
|
|
158
|
+
- `CHECKLIST.md` or compatibility alias `REPRO_CHECKLIST.md` is the required living state tracker; it should show whether the baseline object, smoke decision, real run decision, and final accept / block / waive outcome are explicit.
|
|
159
|
+
- `setup.md` is optional unless environment or layout choices are non-trivial; if used, record the working directory, environment route, important config paths, source revision, and notable setup deviations.
|
|
160
|
+
- `execution.md` is optional unless the run is long, multi-step, or rerun-heavy; if used, record the launched commands, durable log paths, checkpoints, exit state, and any reruns or repairs.
|
|
161
|
+
- `verification.md` is optional as a filename but required in substance before acceptance or blocked closeout; either this file or an equivalent report should record trusted metrics, expected-versus-observed comparison, caveats, canonical output paths, and the next anchor.
|
|
162
|
+
- `STRUCTURE.md` becomes required when the workspace layout, mounts, symlinks, or generated outputs are non-obvious or meant for reuse; it should map the important directories and say which paths are canonical.
|
|
163
|
+
- `attachment.yaml` is required for attached or imported baselines under `baselines/imported/`; preserve source identity, selected variant when relevant, and attachment provenance there.
|
|
164
|
+
- `<baseline_root>/json/metric_contract.json` is the canonical accepted comparison contract; once the baseline is accepted, do not leave the authoritative metric surface only in chat, memory, or prose.
|
|
165
|
+
- `Result/metric.md` is scratch-only; it may help during execution, but it is never the final source of truth.
|
|
196
166
|
|
|
197
167
|
Minimum stability rules:
|
|
198
168
|
|
|
199
169
|
- before the first real run, leave one durable note with the chosen route, expected command path, target outputs, and main risks
|
|
200
170
|
- after each smoke test or real run, record what actually happened and whether the route still looks viable
|
|
201
171
|
- before acceptance, leave a clear verification note and baseline gate decision
|
|
202
|
-
- every route selection should leave one explicit reasoned decision record
|
|
203
172
|
- every accepted baseline should leave one accepted baseline artifact
|
|
204
173
|
- every blocked baseline line should leave one blocked report and one next-step decision
|
|
205
|
-
- every handoff should name the active baseline reference and trusted metric set explicitly
|
|
206
|
-
- when the accepted paper-facing contract spans multiple metrics, datasets, subtasks, or splits, preserve that full comparison surface in the durable metric contract rather than collapsing it to one headline number
|
|
207
|
-
- do not require every optional checklist or template before the first smoke test
|
|
208
174
|
- if one rolling note is enough for a simple baseline line, use it
|
|
209
175
|
|
|
210
|
-
Recommended phase-to-output mapping:
|
|
211
|
-
|
|
212
|
-
- `analysis` -> a brief `PLAN.md` or compatible `analysis_plan.md`, plus optional route decision artifact
|
|
213
|
-
- `setup` -> `setup.md` when setup choices are non-trivial
|
|
214
|
-
- `execution` -> `execution.md` plus progress artifacts when long-running
|
|
215
|
-
- `verification` -> `verification.md` plus accepted baseline artifact and `artifact.confirm_baseline(...)`, or a blocked report plus `artifact.waive_baseline(...)` when skipping is intentional
|
|
216
|
-
|
|
217
|
-
If the work skips one of these durable outputs, explain why the baseline remains interpretable without it.
|
|
218
|
-
|
|
219
176
|
## Durable path contract
|
|
220
177
|
|
|
221
|
-
|
|
178
|
+
Use the real runtime paths consistently.
|
|
222
179
|
|
|
223
180
|
Quest-local paths:
|
|
224
181
|
|
|
@@ -235,33 +192,33 @@ Global reusable registry paths:
|
|
|
235
192
|
- baseline registry index: `~/DeepScientist/config/baselines/index.jsonl`
|
|
236
193
|
- canonical baseline entry: `~/DeepScientist/config/baselines/entries/<baseline_id>.yaml`
|
|
237
194
|
|
|
195
|
+
## Baseline id and variant rules
|
|
196
|
+
|
|
197
|
+
- `baseline_id` should be short, stable, and filesystem-safe
|
|
198
|
+
- use letters, digits, `.`, `_`, or `-`
|
|
199
|
+
- do not use spaces, `/`, `\\`, or `..`
|
|
200
|
+
- if one codebase contains multiple comparable baselines, prefer one `baseline_id` with structured variants instead of inventing many near-duplicate entries
|
|
201
|
+
- when variants exist, keep `default_variant_id`, `baseline_variants`, and per-variant metric summaries stable enough that later `experiment` and `write` stages can cite them directly
|
|
202
|
+
|
|
238
203
|
Do not invent parallel durable locations when these runtime contracts already exist.
|
|
239
204
|
Do not leave the authoritative metric contract only in chat, memory, or prose once the baseline is accepted.
|
|
240
205
|
|
|
241
206
|
If a baseline is reproduced only because an analysis campaign needs an extra comparator:
|
|
242
207
|
|
|
243
|
-
- still place it under
|
|
208
|
+
- still place it under the normal baseline roots
|
|
244
209
|
- treat it as a supplementary analysis baseline unless the quest explicitly promotes it into the canonical gate
|
|
245
210
|
- do not call `artifact.confirm_baseline(...)` for that supplementary case unless the quest truly intends to replace the canonical baseline
|
|
246
211
|
|
|
247
|
-
##
|
|
248
|
-
|
|
249
|
-
Baseline identity should be stable and path-safe.
|
|
250
|
-
|
|
251
|
-
- `baseline_id` should be short, stable, and filesystem-safe
|
|
252
|
-
- use letters, digits, `.`, `_`, or `-`
|
|
253
|
-
- do not use spaces, `/`, `\\`, or `..`
|
|
254
|
-
- if one codebase contains multiple comparable baselines, use one `baseline_id` with structured variants instead of inventing many unrelated entries
|
|
255
|
-
|
|
256
|
-
When variants exist, maintain at least:
|
|
212
|
+
## Multi-baseline policy
|
|
257
213
|
|
|
258
|
-
|
|
259
|
-
- `baseline_variants`
|
|
260
|
-
- per-variant metric summaries when available
|
|
214
|
+
One quest may legitimately need more than one baseline.
|
|
261
215
|
|
|
262
|
-
|
|
216
|
+
- explicitly mark which baseline is the primary downstream comparator
|
|
217
|
+
- distinguish primary comparison baselines from fallback or infrastructure baselines
|
|
218
|
+
- if several baselines are credible, record why the chosen primary baseline is the fairest paper-facing comparator
|
|
219
|
+
- do not leave later stages guessing which baseline is authoritative
|
|
263
220
|
|
|
264
|
-
##
|
|
221
|
+
## Route order
|
|
265
222
|
|
|
266
223
|
Prefer this order:
|
|
267
224
|
|
|
@@ -272,101 +229,6 @@ Prefer this order:
|
|
|
272
229
|
|
|
273
230
|
Prefer reuse over redundant reproduction.
|
|
274
231
|
|
|
275
|
-
## Route selection rules
|
|
276
|
-
|
|
277
|
-
Choose the route explicitly rather than by habit.
|
|
278
|
-
|
|
279
|
-
- choose `attach` when a published baseline already exists in the registry and its metrics or provenance are trustworthy enough for the quest
|
|
280
|
-
- choose `import` when the user or repo provides a reusable baseline package or bundle that is not yet attached to the current quest
|
|
281
|
-
- choose `reproduce` when no trustworthy reusable baseline is available but the source repo, paper, and evaluation path are concrete enough to establish one
|
|
282
|
-
- choose `repair` when a baseline route already exists but failed, drifted, or is only partially complete and the broken point is bounded enough to diagnose directly
|
|
283
|
-
|
|
284
|
-
Do not default to reproduction if attach or import would establish an equally trustworthy reference with less risk and cost.
|
|
285
|
-
|
|
286
|
-
Before locking the route, explicitly answer:
|
|
287
|
-
|
|
288
|
-
- what object is being reused or established
|
|
289
|
-
- what makes it trustworthy enough for downstream comparison
|
|
290
|
-
- what evidence is missing
|
|
291
|
-
- what the cheapest credible next step is
|
|
292
|
-
|
|
293
|
-
For a more explicit route-selection rubric, read `references/route-selection.md`.
|
|
294
|
-
|
|
295
|
-
## Baseline comparability contract
|
|
296
|
-
|
|
297
|
-
The baseline stage is not complete just because something ran.
|
|
298
|
-
It is complete when later stages can compare against it fairly.
|
|
299
|
-
|
|
300
|
-
Before declaring a baseline usable, make the comparability contract explicit:
|
|
301
|
-
|
|
302
|
-
- task identity
|
|
303
|
-
- dataset identity and version
|
|
304
|
-
- split contract
|
|
305
|
-
- preprocessing boundary
|
|
306
|
-
- evaluation script or evaluation path
|
|
307
|
-
- required metric keys
|
|
308
|
-
- metric directions
|
|
309
|
-
- seed policy when relevant
|
|
310
|
-
- source commit or source package identity
|
|
311
|
-
- known deviations from the source reference
|
|
312
|
-
|
|
313
|
-
If any of these are still materially unknown, do not pretend the baseline is a clean downstream reference.
|
|
314
|
-
|
|
315
|
-
Use `references/comparability-contract.md` for the full checklist.
|
|
316
|
-
|
|
317
|
-
## Feasibility and acceptance classes
|
|
318
|
-
|
|
319
|
-
Before accepting a baseline, classify feasibility as one of:
|
|
320
|
-
|
|
321
|
-
- `full_reproducible`
|
|
322
|
-
- `degraded_but_acceptable`
|
|
323
|
-
- `blocked`
|
|
324
|
-
|
|
325
|
-
And classify downstream trust as one of:
|
|
326
|
-
|
|
327
|
-
- `verified`
|
|
328
|
-
- `partially_verified`
|
|
329
|
-
- `operational_but_incomparable`
|
|
330
|
-
- `failed`
|
|
331
|
-
|
|
332
|
-
Rules:
|
|
333
|
-
|
|
334
|
-
- `full_reproducible` means the baseline can be reproduced within the agreed contract
|
|
335
|
-
- `degraded_but_acceptable` means the quest explicitly allows a bounded degraded gate
|
|
336
|
-
- `blocked` means insufficient assets, compute, or environment to produce an acceptable baseline
|
|
337
|
-
- `verified` means trusted for downstream comparison
|
|
338
|
-
- `partially_verified` means useful but still caveated
|
|
339
|
-
- `operational_but_incomparable` means it runs, but the comparison contract is not stable enough yet
|
|
340
|
-
- `failed` means it should not be used downstream
|
|
341
|
-
|
|
342
|
-
Do not silently upgrade a degraded or only operational result into a normal trusted baseline.
|
|
343
|
-
|
|
344
|
-
## Multi-baseline policy
|
|
345
|
-
|
|
346
|
-
One quest may legitimately need more than one baseline reference.
|
|
347
|
-
|
|
348
|
-
Common roles include:
|
|
349
|
-
|
|
350
|
-
- primary comparison baseline
|
|
351
|
-
- strongest literature baseline
|
|
352
|
-
- cheapest operational fallback baseline
|
|
353
|
-
|
|
354
|
-
If more than one baseline exists, explicitly record:
|
|
355
|
-
|
|
356
|
-
- which one is the primary downstream comparison
|
|
357
|
-
- which one is only a fallback or infrastructure reference
|
|
358
|
-
- why the primary choice is the fairest or strongest comparison
|
|
359
|
-
|
|
360
|
-
Do not leave later stages guessing which baseline is authoritative.
|
|
361
|
-
|
|
362
|
-
When useful, record the route choice as a decision artifact with action such as:
|
|
363
|
-
|
|
364
|
-
- `attach_baseline`
|
|
365
|
-
- `reuse_baseline`
|
|
366
|
-
- `publish_baseline`
|
|
367
|
-
- `continue`
|
|
368
|
-
- `request_user_decision`
|
|
369
|
-
|
|
370
232
|
## Workflow
|
|
371
233
|
|
|
372
234
|
### Phase 1. Analysis
|
|
@@ -379,236 +241,88 @@ Before running anything substantial, determine:
|
|
|
379
241
|
- source baseline identity
|
|
380
242
|
- source code path
|
|
381
243
|
- expected run command or evaluation path
|
|
382
|
-
- expected paper or repo numbers
|
|
244
|
+
- expected paper or repo numbers when they exist
|
|
383
245
|
- local resource constraints
|
|
384
246
|
|
|
385
|
-
|
|
247
|
+
Default analysis discipline:
|
|
386
248
|
|
|
387
|
-
-
|
|
388
|
-
-
|
|
249
|
+
- read the source paper and source repo first
|
|
250
|
+
- if runtime already exposes a matching `requested_baseline_ref` or `confirmed_baseline_ref`, validate that concrete object before restarting broad discovery
|
|
251
|
+
- identify the real run or evaluation entrypoint
|
|
252
|
+
- identify the dataset or split and metric contract
|
|
389
253
|
- identify likely environment blockers
|
|
390
254
|
- define the cheapest credible smoke test
|
|
391
255
|
|
|
392
|
-
Escalate
|
|
256
|
+
Escalate to a fuller audit only when the command path is unclear, the repo is large or confusing, repair mode is active, or custom code changes look likely.
|
|
393
257
|
|
|
394
|
-
When the fuller audit is necessary, capture
|
|
258
|
+
When the fuller audit is necessary, capture only what later stages truly need:
|
|
395
259
|
|
|
396
|
-
- major
|
|
260
|
+
- major entry scripts, configs, and modules
|
|
397
261
|
- end-to-end data flow
|
|
398
|
-
-
|
|
399
|
-
-
|
|
400
|
-
-
|
|
401
|
-
- current evaluation pipeline and metric computation path
|
|
402
|
-
- coupling, maintainability, or scalability issues that may slow later iterations
|
|
262
|
+
- evaluation path and metric computation path
|
|
263
|
+
- obvious environment assumptions
|
|
264
|
+
- obvious bottlenecks or incompatibilities
|
|
403
265
|
|
|
404
|
-
|
|
266
|
+
If the source paper is available, record:
|
|
405
267
|
|
|
406
|
-
- read it through `artifact.arxiv(paper_id=..., full_text=False)` first, and only switch to `full_text=True` when the shorter view is insufficient
|
|
407
268
|
- the core algorithm in compact, implementation-faithful form
|
|
408
269
|
- the main reported numbers
|
|
409
|
-
- the main weaknesses or bottlenecks likely to matter
|
|
410
|
-
|
|
411
|
-
If helpful, restate the core algorithm using two of the following:
|
|
412
|
-
|
|
413
|
-
- short pseudocode
|
|
414
|
-
- a compact equation or objective
|
|
415
|
-
- a code-level sketch tied to real files
|
|
416
|
-
|
|
417
|
-
The goal is not academic polish.
|
|
418
|
-
The goal is that later `idea`, `experiment`, and `write` stages can understand what the baseline actually does without reopening the whole repo from scratch.
|
|
419
|
-
|
|
420
|
-
You should inspect local feasibility with shell-based checks when needed, including:
|
|
421
|
-
|
|
422
|
-
- OS
|
|
423
|
-
- GPU availability
|
|
424
|
-
- CPU and RAM
|
|
425
|
-
- free disk
|
|
426
|
-
- Python or conda environment availability
|
|
427
|
-
- whether `uv` is available and which Python version `uv` should target
|
|
428
|
-
|
|
429
|
-
Use the collected constraints to choose a realistic baseline route and runtime plan.
|
|
430
|
-
|
|
431
|
-
The analysis phase should leave behind a concrete baseline plan rather than only conversational intent.
|
|
432
|
-
At minimum, the plan should capture:
|
|
433
|
-
|
|
434
|
-
- chosen route
|
|
435
|
-
- source identity
|
|
436
|
-
- expected commands
|
|
437
|
-
- expected outputs
|
|
438
|
-
- feasibility notes
|
|
439
|
-
- key risks
|
|
440
|
-
- verification targets
|
|
441
|
-
|
|
442
|
-
Prefer `PLAN.md` for new work and use `references/baseline-plan-template.md` when you need a concrete starting structure.
|
|
443
|
-
When the analysis note becomes substantial, structure `PLAN.md` or a legacy-compatible `analysis_plan.md` with headings close to:
|
|
270
|
+
- the main weaknesses or bottlenecks likely to matter for this quest
|
|
444
271
|
|
|
445
|
-
-
|
|
446
|
-
- codebase analysis
|
|
447
|
-
- limitations or bottlenecks
|
|
448
|
-
- KPI and metric contract
|
|
449
|
-
- route choice
|
|
450
|
-
- risks and mitigations
|
|
272
|
+
You may inspect local feasibility with shell-based checks for OS, GPU, CPU, RAM, disk, Python version, and whether `uv` is available.
|
|
451
273
|
|
|
452
|
-
|
|
274
|
+
The analysis phase should leave behind a concrete plan rather than only conversational intent.
|
|
453
275
|
|
|
454
|
-
|
|
455
|
-
- the early exception is when code access, paper access, source identity, or execution permission is missing and that absence blocks even baseline analysis
|
|
456
|
-
- do not ask generic “how should I set up the environment” questions before you inspect the device and code requirements
|
|
457
|
-
- do not repeat already confirmed decisions unless the plan materially changed
|
|
458
|
-
|
|
459
|
-
If a user decision is required, make it structured and compact:
|
|
460
|
-
|
|
461
|
-
- usually `1-6` questions total
|
|
462
|
-
- each question should contain concrete options
|
|
463
|
-
- options should reflect actual hardware/code feasibility
|
|
464
|
-
- options should include tradeoffs
|
|
465
|
-
- the recommended option should be explicit
|
|
466
|
-
- free-form input should be requested only where a preset choice is genuinely insufficient
|
|
467
|
-
|
|
468
|
-
If parallel execution is proposed, it must be explicitly confirmed rather than silently enabled.
|
|
469
|
-
|
|
470
|
-
Avoid asking the user to design the environment for you.
|
|
471
|
-
Instead, analyze the environment first, then present the recommended path and tradeoffs only if a user decision is actually required.
|
|
472
|
-
|
|
473
|
-
If the code, paper, or baseline source is missing and the missing piece changes the route materially, stop and ask for a structured decision rather than guessing.
|
|
474
|
-
|
|
475
|
-
For a denser audit checklist, read `references/codebase-audit-checklist.md`.
|
|
476
|
-
|
|
477
|
-
### Phase 2. Setup
|
|
276
|
+
## Phase 2. Setup
|
|
478
277
|
|
|
479
278
|
Prepare the selected route:
|
|
480
279
|
|
|
481
280
|
- attach: validate the selected baseline id and variant
|
|
482
|
-
- import: place the imported baseline metadata under the quest
|
|
281
|
+
- import: place the imported baseline metadata under the quest and confirm the package is readable
|
|
483
282
|
- reproduce: prepare the baseline work directory, commands, config pointers, and environment notes
|
|
484
283
|
- repair: identify the precise broken point before rerunning blindly
|
|
485
284
|
|
|
486
|
-
For Python baselines, environment setup
|
|
487
|
-
Treat `uv` as the default environment and package manager for baseline setup, smoke tests, and real runs.
|
|
488
|
-
Do not casually switch to a new conda environment or a manual `pip install` flow just because the repo is old.
|
|
489
|
-
If the baseline already ships a `pyproject.toml` / `uv.lock`, use that path first.
|
|
490
|
-
If it only ships `requirements.txt`, still create the environment with `uv` and install through `uv pip`.
|
|
491
|
-
Only accept a non-`uv` environment route when there is a concrete blocker that cannot be resolved locally, and record that blocker explicitly in `setup.md` and the progress update.
|
|
492
|
-
|
|
493
|
-
For a fast-path reproduction, setup can stay lightweight.
|
|
494
|
-
Confirm the working directory, environment, config, output paths, smoke command, and long-run command, then move forward.
|
|
495
|
-
Do not manufacture a fresh workspace tree or copy the repo just to satisfy a template if the existing layout is already workable and auditable.
|
|
496
|
-
|
|
497
|
-
Capture:
|
|
498
|
-
|
|
499
|
-
- baseline identifier
|
|
500
|
-
- source and provenance
|
|
501
|
-
- working directory
|
|
502
|
-
- config files
|
|
503
|
-
- command template
|
|
504
|
-
- expected outputs
|
|
505
|
-
- risks and known deviations from the paper or source
|
|
506
|
-
|
|
507
|
-
Setup should also confirm:
|
|
508
|
-
|
|
509
|
-
- the intended working directory is correct
|
|
510
|
-
- the output paths are durable and quest-visible
|
|
511
|
-
- required dependencies or environments are known
|
|
512
|
-
- the execution plan is realistic for the detected hardware
|
|
285
|
+
For Python baselines, standardize environment setup around `uv`.
|
|
513
286
|
|
|
514
287
|
### Python environment rule: use `uv`
|
|
515
288
|
|
|
516
|
-
|
|
517
|
-
|
|
518
|
-
|
|
519
|
-
|
|
520
|
-
3. install dependencies with `uv pip install ...`
|
|
521
|
-
4. run setup, smoke tests, and real commands through `uv run ...`
|
|
289
|
+
- if the repo already contains `uv.lock` or a solid `pyproject.toml`, use `uv sync`
|
|
290
|
+
- otherwise create a local virtual environment with `uv venv`
|
|
291
|
+
- install dependencies with `uv pip install ...`
|
|
292
|
+
- run setup, smoke tests, and real commands through `uv run ...`
|
|
522
293
|
|
|
523
294
|
Practical rules:
|
|
524
295
|
|
|
525
|
-
- prefer a quest-local or baseline-local `.venv`
|
|
526
|
-
- prefer `uv run python ...`
|
|
296
|
+
- prefer a quest-local or baseline-local `.venv`
|
|
297
|
+
- prefer `uv run python ...` or `uv run bash ...` over relying on shell activation state
|
|
527
298
|
- if a specific interpreter is required, make it explicit with `uv venv --python 3.11` or `uv run --python 3.11 ...`
|
|
528
|
-
- if CUDA, PyTorch, JAX, or custom wheels require a special index URL,
|
|
529
|
-
-
|
|
530
|
-
|
|
531
|
-
|
|
532
|
-
|
|
533
|
-
|
|
534
|
-
|
|
535
|
-
|
|
536
|
-
uv
|
|
537
|
-
uv run python
|
|
538
|
-
uv run python train.py --config configs/baseline.yaml
|
|
539
|
-
```
|
|
540
|
-
|
|
541
|
-
```bash
|
|
542
|
-
# legacy repo with requirements.txt
|
|
543
|
-
cd <baseline_root>
|
|
544
|
-
uv venv --python 3.11
|
|
545
|
-
uv pip install -r requirements.txt
|
|
546
|
-
uv run python scripts/smoke_test.py
|
|
547
|
-
uv run python main.py --dataset cifar10 --config configs/resnet18.yaml
|
|
548
|
-
```
|
|
549
|
-
|
|
550
|
-
```bash
|
|
551
|
-
# one-off package additions without leaving the uv-managed flow
|
|
552
|
-
cd <baseline_root>
|
|
553
|
-
uv venv --python 3.11
|
|
554
|
-
uv pip install -r requirements.txt
|
|
555
|
-
uv pip install "torch==2.4.1" "torchvision==0.19.1"
|
|
556
|
-
uv run python evaluate.py --checkpoint outputs/best.pt
|
|
557
|
-
```
|
|
558
|
-
|
|
559
|
-
When you record the setup, explicitly note:
|
|
560
|
-
|
|
561
|
-
- the chosen `uv` route: `uv sync` vs `uv venv` + `uv pip`
|
|
562
|
-
- the Python version
|
|
563
|
-
- the dependency source files used
|
|
564
|
-
- the exact `uv run ...` command used for the smoke test
|
|
565
|
-
- any blocker that prevented a pure `uv` flow
|
|
566
|
-
|
|
567
|
-
If a dedicated baseline workspace is needed, establish a clear layout.
|
|
568
|
-
One workable structure is:
|
|
569
|
-
|
|
570
|
-
```text
|
|
571
|
-
<baseline_root>/
|
|
572
|
-
src/
|
|
573
|
-
scripts/
|
|
574
|
-
logs/
|
|
575
|
-
cache/
|
|
576
|
-
results/
|
|
577
|
-
exports/
|
|
578
|
-
latest/
|
|
579
|
-
<run_id>/
|
|
580
|
-
```
|
|
581
|
-
|
|
582
|
-
If the baseline becomes long-lived, shared, or non-obvious, the quest-visible audit area may contain:
|
|
583
|
-
|
|
584
|
-
```text
|
|
585
|
-
<quest_root>/
|
|
586
|
-
baselines/
|
|
587
|
-
local/
|
|
588
|
-
<baseline_id>/
|
|
589
|
-
analysis_plan.md
|
|
590
|
-
setup.md
|
|
591
|
-
execution.md
|
|
592
|
-
verification.md
|
|
593
|
-
STRUCTURE.md
|
|
594
|
-
REPRO_CHECKLIST.md
|
|
595
|
-
```
|
|
299
|
+
- if CUDA, PyTorch, JAX, or custom wheels require a special index URL, keep that install under `uv pip`
|
|
300
|
+
- only accept a non-`uv` route when there is a concrete blocker that cannot be resolved locally
|
|
301
|
+
|
|
302
|
+
Common `uv` patterns:
|
|
303
|
+
|
|
304
|
+
- `uv sync`
|
|
305
|
+
- `uv venv --python 3.11`
|
|
306
|
+
- `uv pip install -r requirements.txt`
|
|
307
|
+
- `uv run python scripts/smoke_test.py`
|
|
308
|
+
- `uv run python train.py --config ...`
|
|
596
309
|
|
|
597
310
|
Setup should record:
|
|
598
311
|
|
|
599
|
-
-
|
|
600
|
-
-
|
|
601
|
-
-
|
|
602
|
-
-
|
|
603
|
-
-
|
|
604
|
-
-
|
|
312
|
+
- baseline id and source identity
|
|
313
|
+
- working directory
|
|
314
|
+
- config files
|
|
315
|
+
- command template
|
|
316
|
+
- expected outputs
|
|
317
|
+
- known deviations from paper or source
|
|
318
|
+
- the chosen `uv` route and Python version
|
|
605
319
|
|
|
606
|
-
|
|
320
|
+
Fallbacks:
|
|
607
321
|
|
|
608
|
-
|
|
609
|
-
|
|
322
|
+
- if Hugging Face access is blocked, record and try an approved local mirror such as ModelScope when that does not change the comparison meaning
|
|
323
|
+
- if a quest already depends on `analysis_plan.md` or `REPRO_CHECKLIST.md`, keep the compatibility alias explicit rather than splitting truth across two active plans
|
|
610
324
|
|
|
611
|
-
|
|
325
|
+
## Phase 3. Execution
|
|
612
326
|
|
|
613
327
|
Run only the work required to establish the baseline credibly.
|
|
614
328
|
|
|
@@ -617,88 +331,31 @@ Execution rules:
|
|
|
617
331
|
- keep commands auditable
|
|
618
332
|
- keep logs durable
|
|
619
333
|
- avoid uncontrolled side experiments during baseline establishment
|
|
620
|
-
-
|
|
621
|
-
-
|
|
622
|
-
|
|
623
|
-
|
|
624
|
-
|
|
625
|
-
|
|
626
|
-
|
|
627
|
-
|
|
628
|
-
-
|
|
629
|
-
-
|
|
630
|
-
-
|
|
631
|
-
-
|
|
632
|
-
-
|
|
633
|
-
-
|
|
634
|
-
-
|
|
635
|
-
-
|
|
636
|
-
-
|
|
637
|
-
|
|
638
|
-
|
|
639
|
-
|
|
640
|
-
-
|
|
641
|
-
|
|
642
|
-
|
|
643
|
-
|
|
644
|
-
|
|
645
|
-
|
|
646
|
-
Recommended result structures to maintain:
|
|
647
|
-
|
|
648
|
-
- per-combination result records
|
|
649
|
-
- an aggregated `result.json`
|
|
650
|
-
- a registry or JSONL index mapping each combination to its stored result
|
|
651
|
-
- exported snapshots in both run-specific and `latest/` locations
|
|
652
|
-
- run metadata capturing the environment and command context
|
|
653
|
-
|
|
654
|
-
Recommended run metadata includes:
|
|
655
|
-
|
|
656
|
-
- config snapshot
|
|
657
|
-
- relevant Git or source snapshot identifiers
|
|
658
|
-
- package/environment summary
|
|
659
|
-
- machine summary such as GPU visibility when relevant
|
|
660
|
-
|
|
661
|
-
If a result backup is useful for audit or recovery, create it explicitly rather than assuming the latest export is enough.
|
|
662
|
-
|
|
663
|
-
Long-running execution rules:
|
|
664
|
-
|
|
665
|
-
- before a substantial baseline reproduction, run a bounded smoke test first so command paths, output locations, and metric plumbing are validated cheaply
|
|
666
|
-
- once the smoke test passes, launch the real baseline reproduction with `bash_exec(mode='detach', ...)` and normally leave `timeout_seconds` unset for the long run itself
|
|
667
|
-
- `bash_exec(mode='read', id=...)` returns the full rendered log when it is 2000 lines or fewer; for longer logs it returns the first 500 lines plus the last 1500 lines and a hint to inspect omitted sections with `start` and `tail`
|
|
668
|
-
- if a long saved log omits the middle section you need, use `bash_exec(mode='read', id=..., start=..., tail=...)` to inspect that forward rendered-line window
|
|
669
|
-
- when monitoring that detached run, prefer `bash_exec(mode='read', id=..., tail_limit=..., order='desc')` so you inspect the newest log evidence first
|
|
670
|
-
- after the first read, prefer incremental checks with `bash_exec(mode='read', id=..., after_seq=last_seen_seq, tail_limit=..., order='asc')` so you only inspect newly appended evidence
|
|
671
|
-
- if you need to recover ids or confirm the newest session quickly, use `bash_exec(mode='history')` or `bash_exec(mode='list')` rather than guessing
|
|
672
|
-
- include a structured `comment` on long-running bash sessions with fields such as `stage`, `goal`, `action`, `expected_signal`, and `next_check`
|
|
673
|
-
- use `silent_seconds`, `progress_age_seconds`, `signal_age_seconds`, and `watchdog_overdue` from `bash_exec(mode='list'|'read', ...)` as the default staleness checks
|
|
674
|
-
- when the reproduction code is under your control, prefer a throttled `tqdm` progress reporter and, when feasible, pair it with periodic `__DS_PROGRESS__` JSON lines carrying phase and ETA
|
|
675
|
-
- if a command is expected to run for a long time, monitor it as a real background task rather than assuming success
|
|
676
|
-
- do not write final summaries or accepted metrics until the command has actually completed
|
|
677
|
-
- verify that the expected result files exist before treating the run as finished
|
|
678
|
-
- if a task is invalid, wedged, or failed, stop it with `bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...)`; if it must die immediately, add `force=true`, then diagnose the reason and either retry with a documented fix or record the failure durably
|
|
679
|
-
- canonical sleep choice:
|
|
680
|
-
- if you only need wall-clock waiting between checks, use `bash_exec(command='sleep N', mode='await', timeout_seconds=N+buffer, ...)`
|
|
681
|
-
- keep a real buffer on that sleep timeout; do not set `timeout_seconds` exactly equal to `N`
|
|
682
|
-
- if you are waiting on an already running managed session, prefer `bash_exec(mode='await', id=..., timeout_seconds=...)` instead of starting a new sleep command
|
|
683
|
-
|
|
684
|
-
Recommended monitoring cadence for long-running work:
|
|
685
|
-
|
|
686
|
-
- first check after about 60 seconds
|
|
687
|
-
- second check after about 120 seconds
|
|
688
|
-
- third check after about 300 seconds
|
|
689
|
-
- fourth check after about 600 seconds
|
|
690
|
-
- fifth check after about 1800 seconds
|
|
691
|
-
- after that, keep checking about every 1800 seconds while the run is still active
|
|
692
|
-
|
|
693
|
-
The exact mechanism should prefer `bash_exec(mode='await' | 'detach' | 'read' | 'list' | 'history' | 'kill', ...)`, with `read` usually using a tailed or incremental window during monitoring, but the behavioral rule stays the same:
|
|
694
|
-
do not report completion until the run is actually done and the outputs are real.
|
|
695
|
-
After each meaningful check, notify the user through `artifact.interact(kind='progress', ...)` with current status, latest evidence, and the next monitoring point.
|
|
696
|
-
Do this after every completed wait cycle for important long-running work; do not skip several sleep windows without reporting.
|
|
697
|
-
When structured progress markers are available, include `eta` and preferably `next_reply_at` or `next_check_at` so the UI can show the next expected update time.
|
|
698
|
-
|
|
699
|
-
Do not silently widen scope from “baseline reproduction” into “new method exploration”.
|
|
700
|
-
|
|
701
|
-
### Phase 4. Verification
|
|
334
|
+
- checkpoint only explainable, minimal code changes
|
|
335
|
+
- prefer equivalence-preserving efficiency gains such as larger safe batch size, cache reuse, checkpoint resume, and parallel downloads or workers
|
|
336
|
+
- do not use an efficiency lever if it changes accepted baseline meaning, effective evaluation contract, or trust judgment
|
|
337
|
+
|
|
338
|
+
Long-running execution discipline:
|
|
339
|
+
|
|
340
|
+
- run one bounded smoke test before a substantial baseline reproduction
|
|
341
|
+
- once the smoke test passes, launch the real baseline reproduction with `bash_exec(mode='detach', ...)`
|
|
342
|
+
- monitor by forward progress instead of by short-window completion anxiety
|
|
343
|
+
- do not report final success until the command actually finished and the expected result files exist
|
|
344
|
+
- if you need to recover ids or inspect session state, use `bash_exec(mode='history')` or `bash_exec(mode='list')`
|
|
345
|
+
- `bash_exec(mode='read', id=...)` returns the full saved log when it is `2000 lines or fewer`; for longer logs, inspect omitted middle windows with `start` and `tail`
|
|
346
|
+
- during monitoring, prefer `bash_exec(mode='read', id=..., tail_limit=..., order='desc')`, and after the first read prefer incremental checks with `after_seq=last_seen_seq`
|
|
347
|
+
- use `silent_seconds`, `progress_age_seconds`, `signal_age_seconds`, and `watchdog_overdue` as the default staleness clues
|
|
348
|
+
- if a run is clearly invalid, wedged, or superseded, stop it with `bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...)`, document why, and relaunch cleanly
|
|
349
|
+
- do not let more than the `30-minute visibility bound` pass without a real inspection and a `next expected update time`
|
|
350
|
+
- when the baseline code is under your control, prefer a throttled `tqdm` progress reporter and periodic `__DS_PROGRESS__` markers when feasible
|
|
351
|
+
|
|
352
|
+
Keep retries bounded:
|
|
353
|
+
|
|
354
|
+
- one smoke test is the default
|
|
355
|
+
- one autonomous fix-and-retry for the same failure class is the normal upper bound
|
|
356
|
+
- if the same failure class returns, stop looping
|
|
357
|
+
|
|
358
|
+
## Phase 4. Verification
|
|
702
359
|
|
|
703
360
|
Verification is mandatory before baseline acceptance.
|
|
704
361
|
|
|
@@ -707,32 +364,17 @@ Verify:
|
|
|
707
364
|
- the run actually finished
|
|
708
365
|
- the reported metrics came from the intended dataset and split
|
|
709
366
|
- the metric definitions match the quest contract
|
|
710
|
-
- the result is comparable to the paper, source repo, or selected
|
|
367
|
+
- the result is comparable to the paper, source repo, or selected target
|
|
711
368
|
- any deviations are explicitly stated
|
|
712
369
|
|
|
713
|
-
Classify the outcome:
|
|
370
|
+
Classify the outcome as one of:
|
|
714
371
|
|
|
715
372
|
- `verified_match`
|
|
716
373
|
- `verified_close`
|
|
717
374
|
- `verified_diverged`
|
|
718
375
|
- `broken`
|
|
719
376
|
|
|
720
|
-
Verification must
|
|
721
|
-
Do not accept any of the following without explanation:
|
|
722
|
-
|
|
723
|
-
- missing result files
|
|
724
|
-
- metrics that cannot be traced to an actual run
|
|
725
|
-
- metric definitions that do not match the quest contract
|
|
726
|
-
- unexplained mismatch versus the intended paper or source repo setup
|
|
727
|
-
|
|
728
|
-
Verification-phase interaction rules:
|
|
729
|
-
|
|
730
|
-
- do not ask new questions during verification unless the stage has genuinely fallen back to analysis
|
|
731
|
-
- if requirements, scope, or permissions changed materially, stop verification and return to the analysis phase explicitly
|
|
732
|
-
- verification should summarize real progress milestones rather than quoting raw internal progress markers
|
|
733
|
-
- structured progress markers are for runtime monitoring, not for final verification prose
|
|
734
|
-
|
|
735
|
-
If the reproduced result differs materially from the source reference, verification should explicitly separate:
|
|
377
|
+
Verification must explicitly separate:
|
|
736
378
|
|
|
737
379
|
- likely implementation mismatch
|
|
738
380
|
- environment mismatch
|
|
@@ -740,43 +382,64 @@ If the reproduced result differs materially from the source reference, verificat
|
|
|
740
382
|
- expected stochastic variance
|
|
741
383
|
- unexplained divergence
|
|
742
384
|
|
|
743
|
-
Verification should
|
|
385
|
+
Verification should answer:
|
|
744
386
|
|
|
745
387
|
- whether the baseline is trustworthy enough for downstream comparison
|
|
746
388
|
- whether the result is reusable beyond this quest
|
|
747
389
|
- whether another repair or rerun is justified
|
|
748
|
-
- whether the
|
|
390
|
+
- whether the line should stop here and hand off
|
|
391
|
+
|
|
392
|
+
A verification report should be self-contained enough that a later stage can answer:
|
|
393
|
+
|
|
394
|
+
- what was used
|
|
395
|
+
- how it was obtained: attach, import, reproduce, or repair
|
|
396
|
+
- what commands and configs were used
|
|
397
|
+
- what metrics are trusted
|
|
398
|
+
- what caveats remain
|
|
399
|
+
- whether the result is reusable beyond this quest
|
|
749
400
|
|
|
750
|
-
|
|
401
|
+
## Baseline comparability contract
|
|
402
|
+
|
|
403
|
+
The baseline stage is not complete just because something ran.
|
|
404
|
+
It is complete when later stages can compare against it fairly.
|
|
751
405
|
|
|
752
|
-
|
|
753
|
-
- final result files exist
|
|
754
|
-
- exported latest snapshot exists when the workflow expects it
|
|
755
|
-
- metrics are non-empty, non-placeholder, and non-NaN
|
|
756
|
-
- execution notes document the actual commands and outcomes
|
|
757
|
-
- the baseline phase state is ready to hand off
|
|
758
|
-
- the infrastructure needed for reproduction is actually present and usable
|
|
759
|
-
- any closed-loop or key-metric steps expected by the plan were completed or their omission was explicitly documented
|
|
406
|
+
Before declaring a baseline usable, make the comparability contract explicit:
|
|
760
407
|
|
|
761
|
-
|
|
408
|
+
- task identity
|
|
409
|
+
- dataset identity and version
|
|
410
|
+
- split contract
|
|
411
|
+
- preprocessing boundary
|
|
412
|
+
- evaluation script or evaluation path
|
|
413
|
+
- required metric keys
|
|
414
|
+
- metric directions
|
|
415
|
+
- seed policy when relevant
|
|
416
|
+
- source commit or source package identity
|
|
417
|
+
- known deviations from the source reference
|
|
762
418
|
|
|
763
|
-
|
|
764
|
-
|
|
419
|
+
Unless the user explicitly specifies otherwise, treat the original paper's evaluation protocol as the canonical baseline contract.
|
|
420
|
+
If any of these fields are still materially unknown, do not pretend the baseline is a clean downstream reference.
|
|
421
|
+
For the fuller checklist and verdict meanings, read `references/comparability-contract.md`.
|
|
765
422
|
|
|
766
|
-
|
|
767
|
-
- exported Markdown and/or JSON summaries are actually generated
|
|
768
|
-
- incremental flags such as `--new-only` behave as documented when they are part of the workflow
|
|
423
|
+
## Feasibility and trust classes
|
|
769
424
|
|
|
770
|
-
|
|
425
|
+
Before acceptance, classify feasibility as one of:
|
|
771
426
|
|
|
772
|
-
-
|
|
773
|
-
-
|
|
774
|
-
-
|
|
775
|
-
|
|
427
|
+
- `full_reproducible`
|
|
428
|
+
- `degraded_but_acceptable`
|
|
429
|
+
- `blocked`
|
|
430
|
+
|
|
431
|
+
And classify downstream trust as one of:
|
|
432
|
+
|
|
433
|
+
- `verified`
|
|
434
|
+
- `partially_verified`
|
|
435
|
+
- `operational_but_incomparable`
|
|
436
|
+
- `failed`
|
|
437
|
+
|
|
438
|
+
Do not silently upgrade a degraded or merely operational result into a normal trusted baseline.
|
|
776
439
|
|
|
777
440
|
## Minimum baseline artifact content
|
|
778
441
|
|
|
779
|
-
The baseline artifact should
|
|
442
|
+
The accepted baseline artifact should include at least:
|
|
780
443
|
|
|
781
444
|
- `baseline_id`
|
|
782
445
|
- `baseline_kind`
|
|
@@ -794,254 +457,32 @@ If variants exist, also include:
|
|
|
794
457
|
- `default_variant_id`
|
|
795
458
|
- `baseline_variants`
|
|
796
459
|
|
|
797
|
-
Metric-contract
|
|
460
|
+
Metric-contract rules:
|
|
798
461
|
|
|
799
|
-
- unless the user explicitly specifies otherwise, treat the original paper's evaluation protocol as the canonical baseline contract
|
|
800
462
|
- if the accepted baseline contract includes multiple metrics, datasets, subtasks, or splits, record all of them in `<baseline_root>/json/metric_contract.json`
|
|
801
|
-
- keep `primary_metric` as the headline metric only; do not let it erase the rest of the
|
|
802
|
-
- when confirming a baseline, submit the canonical `metrics_summary` as a flat top-level dictionary keyed by the paper-facing metric ids
|
|
803
|
-
- every canonical baseline metric entry should include `description`, either `derivation` or `origin_path`, and `source_ref`
|
|
463
|
+
- keep `primary_metric` as the headline metric only; do not let it erase the rest of the comparison surface
|
|
464
|
+
- when confirming a baseline, submit the canonical `metrics_summary` as a flat top-level dictionary keyed by the paper-facing metric ids
|
|
465
|
+
- every canonical baseline metric entry should include `description`, either `derivation` or `origin_path`, and `source_ref`
|
|
804
466
|
- if the paper reports both aggregate and per-dataset or per-task results, preserve both whenever feasible through `metrics_summary` plus structured rows rather than one cherry-picked scalar
|
|
805
|
-
-
|
|
806
|
-
|
|
807
|
-
|
|
808
|
-
|
|
809
|
-
Use compact but structured notes so later stages do not need to reconstruct baseline state from chat history.
|
|
810
|
-
The templates below are references, not prerequisites for the first smoke test.
|
|
811
|
-
For simple baseline lines, keep them short and fill only the sections that matter.
|
|
812
|
-
|
|
813
|
-
Canonical naming for new work:
|
|
814
|
-
|
|
815
|
-
- `PLAN.md` -> use `references/baseline-plan-template.md`
|
|
816
|
-
- `CHECKLIST.md` -> use `references/baseline-checklist-template.md`
|
|
817
|
-
- `analysis_plan.md` and `REPRO_CHECKLIST.md` remain acceptable compatibility aliases when a quest already depends on them
|
|
818
|
-
|
|
819
|
-
### `PLAN.md` or `analysis_plan.md`
|
|
820
|
-
|
|
821
|
-
Recommended shape:
|
|
822
|
-
|
|
823
|
-
```md
|
|
824
|
-
# Baseline Analysis Plan
|
|
825
|
-
|
|
826
|
-
- quest_id:
|
|
827
|
-
- baseline_id:
|
|
828
|
-
- requested_route: attach | import | reproduce | repair
|
|
829
|
-
- recommended_route:
|
|
830
|
-
- source_identity:
|
|
831
|
-
- task:
|
|
832
|
-
- dataset_and_split:
|
|
833
|
-
- metric_contract:
|
|
834
|
-
- expected_reference:
|
|
835
|
-
- feasibility_summary:
|
|
836
|
-
|
|
837
|
-
## Existing evidence
|
|
838
|
-
- published registry entries:
|
|
839
|
-
- local baseline roots:
|
|
840
|
-
- relevant repo paths:
|
|
841
|
-
|
|
842
|
-
## Planned commands
|
|
843
|
-
- inspect:
|
|
844
|
-
- setup:
|
|
845
|
-
- run:
|
|
846
|
-
- verify:
|
|
847
|
-
|
|
848
|
-
## Expected outputs
|
|
849
|
-
- baseline_root:
|
|
850
|
-
- metrics_path:
|
|
851
|
-
- logs_path:
|
|
852
|
-
- export_paths:
|
|
853
|
-
|
|
854
|
-
## Risks
|
|
855
|
-
- risk:
|
|
856
|
-
|
|
857
|
-
## Gate to next phase
|
|
858
|
-
- what must be true before setup starts
|
|
859
|
-
```
|
|
860
|
-
|
|
861
|
-
### `setup.md`
|
|
862
|
-
|
|
863
|
-
Recommended shape:
|
|
864
|
-
|
|
865
|
-
```md
|
|
866
|
-
# Baseline Setup
|
|
467
|
+
- if the source package already has a richer leaderboard table, structured result file, or `json/metric_contract.json`, reuse that richer contract instead of hand-writing a thinner one that keeps only one averaged scalar
|
|
468
|
+
- `Result/metric.md` is optional temporary scratch memory only; reconcile against it before calling `artifact.confirm_baseline(...)`, but do not treat it as a required durable file
|
|
469
|
+
|
|
470
|
+
## Publication and reuse
|
|
867
471
|
|
|
868
|
-
|
|
869
|
-
- route:
|
|
870
|
-
- working_directory:
|
|
871
|
-
- source_origin:
|
|
872
|
-
- source_commit:
|
|
873
|
-
- environment_summary:
|
|
874
|
-
- uv_strategy:
|
|
875
|
-
- python_version:
|
|
876
|
-
- config_paths:
|
|
877
|
-
- command_template:
|
|
878
|
-
|
|
879
|
-
## Directory contract
|
|
880
|
-
- baseline_root:
|
|
881
|
-
- logs_root:
|
|
882
|
-
- results_root:
|
|
883
|
-
- exports_root:
|
|
884
|
-
|
|
885
|
-
## Known deviations
|
|
886
|
-
- deviation:
|
|
887
|
-
|
|
888
|
-
## Ready-for-execution check
|
|
889
|
-
- uv_route_recorded: yes/no
|
|
890
|
-
- dependencies_known: yes/no
|
|
891
|
-
- outputs_defined: yes/no
|
|
892
|
-
- feasible_on_current_machine: yes/no
|
|
893
|
-
```
|
|
894
|
-
|
|
895
|
-
### `execution.md`
|
|
896
|
-
|
|
897
|
-
Recommended shape:
|
|
898
|
-
|
|
899
|
-
```md
|
|
900
|
-
# Baseline Execution
|
|
901
|
-
|
|
902
|
-
- baseline_id:
|
|
903
|
-
- route:
|
|
904
|
-
- run_scope:
|
|
905
|
-
- command_started:
|
|
906
|
-
- started_at:
|
|
907
|
-
- monitoring_plan:
|
|
908
|
-
|
|
909
|
-
## Runtime log pointers
|
|
910
|
-
- stdout_or_main_log:
|
|
911
|
-
- stderr_or_error_log:
|
|
912
|
-
- result_index:
|
|
913
|
-
|
|
914
|
-
## Checkpoints
|
|
915
|
-
- checkpoint:
|
|
916
|
-
|
|
917
|
-
## Final execution state
|
|
918
|
-
- completed_at:
|
|
919
|
-
- exit_status:
|
|
920
|
-
- produced_outputs:
|
|
921
|
-
- reruns_or_repairs:
|
|
922
|
-
```
|
|
923
|
-
|
|
924
|
-
### `verification.md`
|
|
925
|
-
|
|
926
|
-
Recommended shape:
|
|
927
|
-
|
|
928
|
-
```md
|
|
929
|
-
# Baseline Verification
|
|
930
|
-
|
|
931
|
-
- baseline_id:
|
|
932
|
-
- route:
|
|
933
|
-
- verification_outcome: verified_match | verified_close | verified_diverged | broken
|
|
934
|
-
- trusted_for_downstream: yes/no
|
|
935
|
-
- reusable_beyond_quest: yes/no
|
|
936
|
-
- publish_recommended: yes/no
|
|
937
|
-
|
|
938
|
-
## Trusted metrics
|
|
939
|
-
- metric:
|
|
940
|
-
|
|
941
|
-
## Reference comparison
|
|
942
|
-
- expected_reference:
|
|
943
|
-
- observed_result:
|
|
944
|
-
- delta_or_gap:
|
|
945
|
-
|
|
946
|
-
## Evidence paths
|
|
947
|
-
- final_metrics:
|
|
948
|
-
- logs:
|
|
949
|
-
- exports:
|
|
950
|
-
- config_snapshot:
|
|
951
|
-
|
|
952
|
-
## Caveats
|
|
953
|
-
- caveat:
|
|
954
|
-
|
|
955
|
-
## Next recommendation
|
|
956
|
-
- next_anchor:
|
|
957
|
-
- next_action:
|
|
958
|
-
```
|
|
959
|
-
|
|
960
|
-
These notes do not need to be verbose.
|
|
961
|
-
They do need to be complete enough that another stage can read them without replaying the full baseline process.
|
|
962
|
-
|
|
963
|
-
## Artifact payload templates
|
|
964
|
-
|
|
965
|
-
When writing artifacts, prefer a stable field shape.
|
|
966
|
-
|
|
967
|
-
### Route or blocked decision artifact template
|
|
968
|
-
|
|
969
|
-
```json
|
|
970
|
-
{
|
|
971
|
-
"kind": "decision",
|
|
972
|
-
"verdict": "neutral",
|
|
973
|
-
"action": "attach_baseline",
|
|
974
|
-
"reason": "A published baseline already matches the quest task and metric contract.",
|
|
975
|
-
"baseline_id": "baseline-demo",
|
|
976
|
-
"baseline_variant_id": "main",
|
|
977
|
-
"evidence_paths": [
|
|
978
|
-
"<quest_root>/artifacts/reports/report-....json"
|
|
979
|
-
],
|
|
980
|
-
"next_direction": "Attach the baseline and move to verification or idea selection."
|
|
981
|
-
}
|
|
982
|
-
```
|
|
983
|
-
|
|
984
|
-
If blocked, keep the same structure but use a blocked-appropriate action and reason.
|
|
985
|
-
|
|
986
|
-
### Accepted baseline artifact template
|
|
987
|
-
|
|
988
|
-
```json
|
|
989
|
-
{
|
|
990
|
-
"kind": "baseline",
|
|
991
|
-
"publish_global": true,
|
|
992
|
-
"baseline_id": "baseline-demo",
|
|
993
|
-
"name": "Demo baseline",
|
|
994
|
-
"baseline_kind": "reproduced",
|
|
995
|
-
"task": "image-classification",
|
|
996
|
-
"dataset": "CIFAR-10/test",
|
|
997
|
-
"primary_metric": {
|
|
998
|
-
"name": "accuracy",
|
|
999
|
-
"value": 0.943
|
|
1000
|
-
},
|
|
1001
|
-
"metrics_summary": {
|
|
1002
|
-
"accuracy": 0.943
|
|
1003
|
-
},
|
|
1004
|
-
"default_variant_id": "main",
|
|
1005
|
-
"baseline_variants": [
|
|
1006
|
-
{
|
|
1007
|
-
"variant_id": "main",
|
|
1008
|
-
"label": "Main",
|
|
1009
|
-
"metrics_summary": {
|
|
1010
|
-
"accuracy": 0.943
|
|
1011
|
-
}
|
|
1012
|
-
}
|
|
1013
|
-
],
|
|
1014
|
-
"environment": {
|
|
1015
|
-
"python": "3.11"
|
|
1016
|
-
},
|
|
1017
|
-
"summary": "Verified reproduced baseline accepted for downstream comparison.",
|
|
1018
|
-
"path": "<quest_root>/baselines/local/baseline-demo",
|
|
1019
|
-
"source": {
|
|
1020
|
-
"kind": "artifact_publish",
|
|
1021
|
-
"quest_id": "<quest_id>",
|
|
1022
|
-
"quest_root": "<quest_root>"
|
|
1023
|
-
}
|
|
1024
|
-
}
|
|
1025
|
-
```
|
|
1026
|
-
|
|
1027
|
-
Only set `publish_global: true` when verification is complete and reuse is justified.
|
|
1028
|
-
|
|
1029
|
-
## Registry publication and attachment contract
|
|
1030
|
-
|
|
1031
|
-
The baseline skill should use the durable registry deliberately, not as an afterthought.
|
|
472
|
+
Use the registry deliberately, not as an afterthought.
|
|
1032
473
|
|
|
1033
474
|
If the result is reusable beyond the current quest:
|
|
1034
475
|
|
|
1035
476
|
- publish it through `artifact.publish_baseline(...)`
|
|
1036
|
-
- ensure the payload includes
|
|
1037
|
-
-
|
|
477
|
+
- ensure the payload includes identity, provenance, trusted metrics, and any variant structure
|
|
478
|
+
- set `publish_global: true` only when verification is complete and reuse is justified
|
|
1038
479
|
|
|
1039
480
|
If the current quest should reuse an existing baseline:
|
|
1040
481
|
|
|
1041
482
|
- attach it through `artifact.attach_baseline(...)`
|
|
1042
483
|
- preserve the selected `baseline_id`
|
|
1043
484
|
- preserve the selected `variant_id` when one is used
|
|
1044
|
-
-
|
|
485
|
+
- keep the attachment durable under `baselines/imported/`
|
|
1045
486
|
|
|
1046
487
|
If runtime state already includes `requested_baseline_ref` or a matching `confirmed_baseline_ref`:
|
|
1047
488
|
|
|
@@ -1049,221 +490,55 @@ If runtime state already includes `requested_baseline_ref` or a matching `confir
|
|
|
1049
490
|
- treat a creation-time pre-bound baseline as the active starting point unless you find a concrete incompatibility
|
|
1050
491
|
- do not rerun broad baseline scouting or full reproduction just because the stage name is `baseline`
|
|
1051
492
|
|
|
1052
|
-
|
|
1053
|
-
|
|
1054
|
-
|
|
1055
|
-
## Verification report expectations
|
|
1056
|
-
|
|
1057
|
-
A baseline verification report should answer:
|
|
1058
|
-
|
|
1059
|
-
- what baseline was used
|
|
1060
|
-
- how it was obtained: attach, import, reproduce, or repair
|
|
1061
|
-
- what commands and configs were used
|
|
1062
|
-
- what metrics are trusted
|
|
1063
|
-
- how the result compares with the expected reference
|
|
1064
|
-
- what caveats remain
|
|
1065
|
-
|
|
1066
|
-
The report should also include:
|
|
1067
|
-
|
|
1068
|
-
- whether the run should be trusted for downstream comparison
|
|
1069
|
-
- whether the baseline is reusable beyond this quest
|
|
1070
|
-
- whether another repair or rerun is justified
|
|
1071
|
-
|
|
1072
|
-
The verification report should be strong enough that later `idea`, `experiment`, and `write` stages can cite the baseline setup without reconstructing it from scratch.
|
|
1073
|
-
|
|
1074
|
-
It should ideally also function as a self-contained reproduction note describing:
|
|
1075
|
-
|
|
1076
|
-
- baseline identity
|
|
1077
|
-
- source provenance
|
|
1078
|
-
- key commands
|
|
1079
|
-
- environment assumptions
|
|
1080
|
-
- result locations
|
|
1081
|
-
- trusted interpretation of the outcome
|
|
1082
|
-
|
|
1083
|
-
If the baseline line is meant to be reused later, the final report should be self-contained enough that another stage can answer:
|
|
1084
|
-
|
|
1085
|
-
- what to run
|
|
1086
|
-
- where to run it
|
|
1087
|
-
- what outputs should appear
|
|
1088
|
-
- how to interpret those outputs
|
|
1089
|
-
|
|
1090
|
-
without reopening the whole reproduction process from scratch.
|
|
1091
|
-
|
|
1092
|
-
When useful, generate a single merged reproduction report that includes:
|
|
1093
|
-
|
|
1094
|
-
- structure overview
|
|
1095
|
-
- modification summary
|
|
1096
|
-
- testing commands
|
|
1097
|
-
- device and environment summary
|
|
1098
|
-
- baseline status and blockers
|
|
1099
|
-
- redacted configuration inventory
|
|
1100
|
-
- key implementation measures
|
|
1101
|
-
- core method equations or mathematical notes when they matter for later understanding
|
|
1102
|
-
- results table
|
|
1103
|
-
- export paths
|
|
1104
|
-
|
|
1105
|
-
For a reusable baseline package checklist, read `references/publishable-baseline-package.md`.
|
|
493
|
+
For a clearer attach/import/reproduce/repair rubric, read `references/route-selection.md`.
|
|
494
|
+
For reusable-package expectations, read `references/publishable-baseline-package.md`.
|
|
1106
495
|
|
|
1107
|
-
##
|
|
496
|
+
## Workspace and branch rules
|
|
1108
497
|
|
|
1109
|
-
-
|
|
1110
|
-
-
|
|
1111
|
-
-
|
|
1112
|
-
-
|
|
1113
|
-
|
|
1114
|
-
The baseline stage should not build a parallel Git lifecycle of its own.
|
|
1115
|
-
Branching and promotion remain quest-level concerns.
|
|
1116
|
-
|
|
1117
|
-
However, if baseline setup materially changed code or scripts, preserve at least:
|
|
1118
|
-
|
|
1119
|
-
- an initial snapshot of the baseline workspace state
|
|
1120
|
-
- a final snapshot after setup/execution changes
|
|
1121
|
-
|
|
1122
|
-
so the quest can later audit what changed during reproduction.
|
|
1123
|
-
|
|
1124
|
-
If the workflow uses a baseline-local Git snapshot for audit, treat it as an execution snapshot only.
|
|
1125
|
-
The quest repo remains the durable authority for promotion and narrative state.
|
|
498
|
+
- treat the baseline workspace as a system-managed reproduction surface, not an unrelated sandbox
|
|
499
|
+
- avoid creating a nested authoritative Git lifecycle inside the baseline workspace
|
|
500
|
+
- use the quest branch unless isolation is genuinely needed
|
|
501
|
+
- if baseline setup is risky or intrusive, prepare an isolated branch or worktree first and record why
|
|
502
|
+
- do not proliferate branches without a reason
|
|
1126
503
|
|
|
1127
504
|
## Memory rules
|
|
1128
505
|
|
|
1129
506
|
Stage-start requirement:
|
|
1130
507
|
|
|
1131
|
-
- begin every baseline pass with `memory.list_recent(scope='quest', limit=5)`
|
|
508
|
+
- by default, begin every baseline pass with `memory.list_recent(scope='quest', limit=5)`
|
|
1132
509
|
- then run at least one baseline-relevant `memory.search(...)` before new baseline analysis, repair, or rerun work
|
|
1133
|
-
- if
|
|
510
|
+
- fast-path exception: if the quest already exposes a clear `requested_baseline_ref` or `confirmed_baseline_ref` and the immediate task is only to validate or reattach that concrete baseline, you may skip broad retrieval
|
|
1134
511
|
|
|
1135
|
-
Write
|
|
512
|
+
Write memory only for reusable lessons such as:
|
|
1136
513
|
|
|
1137
|
-
- baseline pitfalls
|
|
1138
|
-
- environment gotchas
|
|
1139
|
-
- dataset quirks
|
|
1140
514
|
- paper-to-code mismatch notes
|
|
1141
|
-
|
|
1142
|
-
|
|
1143
|
-
|
|
1144
|
-
|
|
1145
|
-
|
|
1146
|
-
- quest `papers`:
|
|
1147
|
-
- paper-to-code mismatch notes
|
|
1148
|
-
- baseline paper caveats
|
|
1149
|
-
- quest `decisions`:
|
|
1150
|
-
- attach / import / reproduce / repair rationale
|
|
1151
|
-
- accepted-versus-rejected baseline route choices
|
|
1152
|
-
- quest `episodes`:
|
|
1153
|
-
- setup failures
|
|
1154
|
-
- execution failures
|
|
1155
|
-
- environment incidents
|
|
1156
|
-
- suspicious or divergent baseline runs
|
|
1157
|
-
- quest `knowledge`:
|
|
1158
|
-
- verified metric contract
|
|
1159
|
-
- stable setup rules
|
|
1160
|
-
- data and evaluation caveats
|
|
1161
|
-
- reproducibility lessons that matter later in this quest
|
|
1162
|
-
- global `knowledge`:
|
|
1163
|
-
- reusable reproduction heuristics
|
|
1164
|
-
- stable verification heuristics
|
|
1165
|
-
- cross-quest baseline debugging lessons
|
|
1166
|
-
- global `templates`:
|
|
1167
|
-
- setup checklist templates
|
|
1168
|
-
- verification checklist templates
|
|
1169
|
-
- publishable baseline package templates
|
|
1170
|
-
|
|
1171
|
-
Useful tags include:
|
|
1172
|
-
|
|
1173
|
-
- `stage:baseline`
|
|
1174
|
-
- `baseline:<baseline_id>`
|
|
1175
|
-
- `type:repro-lesson`
|
|
1176
|
-
- `type:verification-caveat`
|
|
1177
|
-
- `type:environment-incident`
|
|
1178
|
-
- `topic:<dataset-or-method>`
|
|
515
|
+
- environment incidents
|
|
516
|
+
- dataset quirks
|
|
517
|
+
- verification caveats
|
|
518
|
+
- attach vs import vs reproduce vs repair rationale
|
|
1179
519
|
|
|
1180
520
|
When calling `memory.write(...)`, pass `tags` as an array like `["stage:baseline", "baseline:<baseline_id>", "type:repro-lesson"]`, not as one comma-joined string.
|
|
1181
521
|
|
|
1182
|
-
Recommended read timing:
|
|
1183
|
-
|
|
1184
|
-
- before route selection:
|
|
1185
|
-
- consult quest `decisions`, `knowledge`, and relevant `papers`
|
|
1186
|
-
- before reruns or repairs:
|
|
1187
|
-
- search quest `episodes` first
|
|
1188
|
-
- before acceptance:
|
|
1189
|
-
- re-check quest `knowledge` and `decisions`
|
|
1190
|
-
- before publishing globally:
|
|
1191
|
-
- confirm the lesson is truly reusable and not only quest-local
|
|
1192
|
-
|
|
1193
522
|
Stage-end requirement:
|
|
1194
523
|
|
|
1195
524
|
- if baseline work produced a durable reproduction lesson, verification caveat, environment incident, or route rationale, write at least one `memory.write(...)` before leaving the stage
|
|
1196
525
|
|
|
1197
|
-
For a fuller memory strategy, read `references/memory-playbook.md`.
|
|
1198
|
-
|
|
1199
526
|
## Artifact rules
|
|
1200
527
|
|
|
1201
528
|
Typical artifact sequence:
|
|
1202
529
|
|
|
1203
|
-
- progress
|
|
1204
|
-
- report
|
|
1205
|
-
-
|
|
1206
|
-
-
|
|
1207
|
-
|
|
1208
|
-
If a reusable baseline was established, prefer recording it in a form that later stages can attach or reuse directly instead of forcing redundant reproduction.
|
|
1209
|
-
|
|
1210
|
-
Use `artifact.attach_baseline(...)` or `artifact.publish_baseline(...)` when appropriate.
|
|
530
|
+
- `progress` for long-running setup or execution checkpoints
|
|
531
|
+
- `report` for analysis notes or verification notes
|
|
532
|
+
- `decision` for route choice, blocked routing, or accept/reject/rerun/repair calls
|
|
533
|
+
- `baseline` only for an accepted baseline record
|
|
1211
534
|
|
|
1212
|
-
|
|
535
|
+
For stable field shapes, read `references/artifact-payload-examples.md`.
|
|
1213
536
|
|
|
1214
|
-
|
|
1215
|
-
- route selection
|
|
1216
|
-
- blocked-state routing
|
|
1217
|
-
- accept / reject / rerun / repair choices
|
|
1218
|
-
- use `report` for:
|
|
1219
|
-
- analysis notes
|
|
1220
|
-
- verification reports
|
|
1221
|
-
- merged reproduction reports
|
|
1222
|
-
- comparability-contract summaries
|
|
1223
|
-
- use `progress` during long-running setup or execution
|
|
1224
|
-
- use `baseline` only for an accepted baseline record
|
|
1225
|
-
- use `approval` only if an explicit user approval was needed for a costly or degraded baseline gate
|
|
1226
|
-
|
|
1227
|
-
## Handoff contract
|
|
1228
|
-
|
|
1229
|
-
Before handing the quest to `idea`, `experiment`, or `write`, the baseline stage should make the next stage's life easy.
|
|
1230
|
-
|
|
1231
|
-
At minimum, downstream stages should be able to answer all of the following without reopening the full reproduction investigation:
|
|
1232
|
-
|
|
1233
|
-
- which baseline is active
|
|
1234
|
-
- which route produced it: attach, import, reproduce, or repair
|
|
1235
|
-
- which metrics are trusted
|
|
1236
|
-
- where the baseline outputs and logs live
|
|
1237
|
-
- what caveats or deviations still matter
|
|
1238
|
-
- whether the baseline is quest-local only or globally reusable
|
|
1239
|
-
|
|
1240
|
-
## Publication and reuse rules
|
|
1241
|
-
|
|
1242
|
-
Publish or attach baselines deliberately.
|
|
1243
|
-
|
|
1244
|
-
- attach when a trusted reusable baseline already exists and is the right reference for this quest
|
|
1245
|
-
- publish when this quest produced a verified reusable baseline that later quests should be able to reuse
|
|
1246
|
-
- do not publish a blocked, speculative, or verification-incomplete baseline
|
|
1247
|
-
- do not attach a baseline without explaining why it is the correct downstream reference
|
|
1248
|
-
|
|
1249
|
-
If a baseline is accepted but not globally reusable, say that explicitly instead of leaving the reuse status ambiguous.
|
|
1250
|
-
|
|
1251
|
-
The baseline stage should normally hand off with:
|
|
1252
|
-
|
|
1253
|
-
- one accepted baseline artifact
|
|
1254
|
-
- one verification-oriented report artifact
|
|
1255
|
-
- one active baseline reference through attachment or accepted local baseline state
|
|
1256
|
-
- one concise next-step guidance statement or decision artifact when the next anchor is not obvious
|
|
1257
|
-
|
|
1258
|
-
## Final handoff packet
|
|
1259
|
-
|
|
1260
|
-
Before leaving the baseline stage, make sure the next stage can read a compact handoff packet from durable state.
|
|
1261
|
-
|
|
1262
|
-
The handoff packet should make these items obvious:
|
|
537
|
+
The baseline handoff should make these items obvious:
|
|
1263
538
|
|
|
1264
539
|
- `baseline_id`
|
|
1265
540
|
- `baseline_variant_id` when relevant
|
|
1266
|
-
- route used: attach
|
|
541
|
+
- route used: attach, import, reproduce, or repair
|
|
1267
542
|
- trusted metrics
|
|
1268
543
|
- canonical metric contract JSON path
|
|
1269
544
|
- verification outcome
|
|
@@ -1272,38 +547,36 @@ The handoff packet should make these items obvious:
|
|
|
1272
547
|
- main caveats
|
|
1273
548
|
- recommended next anchor
|
|
1274
549
|
|
|
1275
|
-
If this packet is not obvious from the artifact plus verification note, the baseline
|
|
550
|
+
If this packet is not obvious from the accepted artifact plus verification note, the baseline line is not stable enough yet.
|
|
1276
551
|
|
|
1277
552
|
## Failure and blocked handling
|
|
1278
553
|
|
|
1279
|
-
Do not hide
|
|
554
|
+
Do not hide failures.
|
|
1280
555
|
|
|
1281
|
-
If blocked, record
|
|
556
|
+
If blocked, record the class explicitly:
|
|
1282
557
|
|
|
1283
|
-
- missing_source
|
|
1284
|
-
- missing_code
|
|
1285
|
-
- missing_metric_contract
|
|
1286
|
-
- environment_infeasible
|
|
1287
|
-
- command_unknown
|
|
1288
|
-
- run_failed
|
|
1289
|
-
- verification_failed
|
|
558
|
+
- `missing_source`
|
|
559
|
+
- `missing_code`
|
|
560
|
+
- `missing_metric_contract`
|
|
561
|
+
- `environment_infeasible`
|
|
562
|
+
- `command_unknown`
|
|
563
|
+
- `run_failed`
|
|
564
|
+
- `verification_failed`
|
|
1290
565
|
|
|
1291
|
-
A blocked
|
|
566
|
+
A blocked result must state:
|
|
1292
567
|
|
|
1293
568
|
- what failed
|
|
1294
569
|
- what was tried
|
|
1295
|
-
- which paths
|
|
1296
|
-
- whether the next best move is attach
|
|
1297
|
-
|
|
1298
|
-
If the failure happened after a long-running task, include the monitored command/log path rather than only a prose description.
|
|
570
|
+
- which paths or logs show the issue
|
|
571
|
+
- whether the next best move is attach, import, retry, repair, reset, or ask the user
|
|
1299
572
|
|
|
1300
|
-
|
|
573
|
+
Reasonable autonomous fixes before escalation:
|
|
1301
574
|
|
|
1302
575
|
- missing module or dependency
|
|
1303
576
|
- wrong dataset path
|
|
1304
577
|
- permission errors on scripts
|
|
1305
578
|
- reasonable batch-size reductions for OOM
|
|
1306
|
-
- obvious environment activation
|
|
579
|
+
- obvious environment activation mistakes
|
|
1307
580
|
|
|
1308
581
|
If a fix would change confirmed scope, metrics, permissions, or resource assumptions, stop and return to analysis rather than applying it silently.
|
|
1309
582
|
|