@researai/deepscientist 1.5.6 → 1.5.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +32 -0
- package/bin/ds.js +274 -18
- package/docs/en/07_MEMORY_AND_MCP.md +40 -3
- package/docs/en/99_ACKNOWLEDGEMENTS.md +1 -0
- package/docs/zh/07_MEMORY_AND_MCP.md +40 -3
- package/docs/zh/99_ACKNOWLEDGEMENTS.md +1 -0
- package/install.sh +34 -0
- package/package.json +1 -1
- package/pyproject.toml +1 -1
- package/src/deepscientist/__init__.py +1 -1
- package/src/deepscientist/acp/envelope.py +1 -0
- package/src/deepscientist/artifact/metrics.py +813 -80
- package/src/deepscientist/artifact/schemas.py +1 -0
- package/src/deepscientist/artifact/service.py +1101 -99
- package/src/deepscientist/bash_exec/monitor.py +1 -1
- package/src/deepscientist/bash_exec/service.py +17 -9
- package/src/deepscientist/channels/qq.py +17 -0
- package/src/deepscientist/channels/relay.py +16 -0
- package/src/deepscientist/cli.py +1 -1
- package/src/deepscientist/config/models.py +12 -6
- package/src/deepscientist/config/service.py +75 -2
- package/src/deepscientist/connector_profiles.py +34 -6
- package/src/deepscientist/daemon/api/handlers.py +290 -15
- package/src/deepscientist/daemon/api/router.py +2 -0
- package/src/deepscientist/daemon/app.py +521 -23
- package/src/deepscientist/gitops/diff.py +6 -10
- package/src/deepscientist/mcp/server.py +188 -39
- package/src/deepscientist/prompts/builder.py +71 -22
- package/src/deepscientist/qq_profiles.py +19 -9
- package/src/deepscientist/quest/layout.py +1 -0
- package/src/deepscientist/quest/service.py +83 -34
- package/src/deepscientist/quest/stage_views.py +74 -29
- package/src/deepscientist/runners/codex.py +32 -14
- package/src/deepscientist/runners/runtime_overrides.py +46 -0
- package/src/deepscientist/skills/installer.py +7 -0
- package/src/prompts/connectors/qq.md +1 -1
- package/src/prompts/contracts/shared_interaction.md +14 -0
- package/src/prompts/system.md +134 -30
- package/src/skills/analysis-campaign/SKILL.md +34 -8
- package/src/skills/analysis-campaign/references/campaign-checklist-template.md +41 -0
- package/src/skills/analysis-campaign/references/campaign-plan-template.md +68 -0
- package/src/skills/baseline/SKILL.md +145 -32
- package/src/skills/baseline/references/baseline-checklist-template.md +57 -0
- package/src/skills/baseline/references/baseline-plan-template.md +105 -0
- package/src/skills/decision/SKILL.md +12 -8
- package/src/skills/experiment/SKILL.md +51 -9
- package/src/skills/experiment/references/main-experiment-checklist-template.md +52 -0
- package/src/skills/experiment/references/main-experiment-plan-template.md +79 -0
- package/src/skills/figure-polish/SKILL.md +1 -0
- package/src/skills/finalize/SKILL.md +3 -8
- package/src/skills/idea/SKILL.md +2 -8
- package/src/skills/intake-audit/SKILL.md +2 -8
- package/src/skills/rebuttal/SKILL.md +2 -8
- package/src/skills/review/SKILL.md +2 -8
- package/src/skills/scout/SKILL.md +2 -8
- package/src/skills/write/SKILL.md +52 -16
- package/src/skills/write/templates/DEEPSCIENTIST_NOTES.md +21 -0
- package/src/skills/write/templates/README.md +408 -0
- package/src/skills/write/templates/UPSTREAM_LICENSE.txt +21 -0
- package/src/skills/write/templates/aaai2026/README.md +534 -0
- package/src/skills/write/templates/aaai2026/aaai2026-unified-supp.tex +144 -0
- package/src/skills/write/templates/aaai2026/aaai2026-unified-template.tex +952 -0
- package/src/skills/write/templates/aaai2026/aaai2026.bib +111 -0
- package/src/skills/write/templates/aaai2026/aaai2026.bst +1493 -0
- package/src/skills/write/templates/aaai2026/aaai2026.sty +315 -0
- package/src/skills/write/templates/acl/README.md +50 -0
- package/src/skills/write/templates/acl/acl.sty +312 -0
- package/src/skills/write/templates/acl/acl_latex.tex +377 -0
- package/src/skills/write/templates/acl/acl_lualatex.tex +101 -0
- package/src/skills/write/templates/acl/acl_natbib.bst +1940 -0
- package/src/skills/write/templates/acl/anthology.bib.txt +26 -0
- package/src/skills/write/templates/acl/custom.bib +70 -0
- package/src/skills/write/templates/acl/formatting.md +326 -0
- package/src/skills/write/templates/asplos2027/main.tex +459 -0
- package/src/skills/write/templates/asplos2027/references.bib +135 -0
- package/src/skills/write/templates/colm2025/README.md +3 -0
- package/src/skills/write/templates/colm2025/colm2025_conference.bib +11 -0
- package/src/skills/write/templates/colm2025/colm2025_conference.bst +1440 -0
- package/src/skills/write/templates/colm2025/colm2025_conference.sty +218 -0
- package/src/skills/write/templates/colm2025/colm2025_conference.tex +305 -0
- package/src/skills/write/templates/colm2025/fancyhdr.sty +485 -0
- package/src/skills/write/templates/colm2025/math_commands.tex +508 -0
- package/src/skills/write/templates/colm2025/natbib.sty +1246 -0
- package/src/skills/write/templates/iclr2026/fancyhdr.sty +485 -0
- package/src/skills/write/templates/iclr2026/iclr2026_conference.bib +24 -0
- package/src/skills/write/templates/iclr2026/iclr2026_conference.bst +1440 -0
- package/src/skills/write/templates/iclr2026/iclr2026_conference.sty +246 -0
- package/src/skills/write/templates/iclr2026/iclr2026_conference.tex +414 -0
- package/src/skills/write/templates/iclr2026/math_commands.tex +508 -0
- package/src/skills/write/templates/iclr2026/natbib.sty +1246 -0
- package/src/skills/write/templates/icml2026/algorithm.sty +79 -0
- package/src/skills/write/templates/icml2026/algorithmic.sty +201 -0
- package/src/skills/write/templates/icml2026/example_paper.bib +75 -0
- package/src/skills/write/templates/icml2026/example_paper.tex +662 -0
- package/src/skills/write/templates/icml2026/fancyhdr.sty +864 -0
- package/src/skills/write/templates/icml2026/icml2026.bst +1443 -0
- package/src/skills/write/templates/icml2026/icml2026.sty +767 -0
- package/src/skills/write/templates/neurips2025/Makefile +36 -0
- package/src/skills/write/templates/neurips2025/extra_pkgs.tex +53 -0
- package/src/skills/write/templates/neurips2025/main.tex +38 -0
- package/src/skills/write/templates/neurips2025/neurips.sty +382 -0
- package/src/skills/write/templates/nsdi2027/main.tex +426 -0
- package/src/skills/write/templates/nsdi2027/references.bib +151 -0
- package/src/skills/write/templates/nsdi2027/usenix-2020-09.sty +83 -0
- package/src/skills/write/templates/osdi2026/main.tex +429 -0
- package/src/skills/write/templates/osdi2026/references.bib +150 -0
- package/src/skills/write/templates/osdi2026/usenix-2020-09.sty +83 -0
- package/src/skills/write/templates/sosp2026/main.tex +532 -0
- package/src/skills/write/templates/sosp2026/references.bib +148 -0
- package/src/tui/package.json +1 -1
- package/src/ui/dist/assets/{AiManusChatView-BGLArZRn.js → AiManusChatView-m2FNtwbn.js} +110 -14
- package/src/ui/dist/assets/{AnalysisPlugin-BgDGSigG.js → AnalysisPlugin-BMTF8EGL.js} +1 -1
- package/src/ui/dist/assets/{AutoFigurePlugin-B65HD7L4.js → AutoFigurePlugin-DxPdMUNb.js} +5 -5
- package/src/ui/dist/assets/{CliPlugin-CUqgsFHC.js → CliPlugin-BEOWgxCI.js} +9 -9
- package/src/ui/dist/assets/{CodeEditorPlugin-CF5EdvaS.js → CodeEditorPlugin-BCXvjqmb.js} +8 -8
- package/src/ui/dist/assets/{CodeViewerPlugin-DEeU063D.js → CodeViewerPlugin-DaJcy3nD.js} +5 -5
- package/src/ui/dist/assets/{DocViewerPlugin-Df-FuDlZ.js → DocViewerPlugin-ByfeIq4K.js} +3 -3
- package/src/ui/dist/assets/{GitDiffViewerPlugin-RAnNaRxM.js → GitDiffViewerPlugin-Cksf3VZ-.js} +830 -86
- package/src/ui/dist/assets/{ImageViewerPlugin-DXJ0ZJGg.js → ImageViewerPlugin-CFz-OsTS.js} +5 -5
- package/src/ui/dist/assets/{LabCopilotPanel-BlO-sKsj.js → LabCopilotPanel-CJ1cJzoX.js} +10 -10
- package/src/ui/dist/assets/{LabPlugin-BajPZW5v.js → LabPlugin-BF3dVJwa.js} +1 -1
- package/src/ui/dist/assets/{LatexPlugin-F1OEol8D.js → LatexPlugin-DDkwZ6Sj.js} +7 -7
- package/src/ui/dist/assets/{MarkdownViewerPlugin-MhUupqwT.js → MarkdownViewerPlugin-HAuvurcT.js} +4 -4
- package/src/ui/dist/assets/{MarketplacePlugin-DxhIEsv0.js → MarketplacePlugin-BtoTYy2C.js} +3 -3
- package/src/ui/dist/assets/{index-B-2scqCJ.js → NotebookEditor-CSJYx7b-.js} +12 -155
- package/src/ui/dist/assets/{NotebookEditor-q7TkhewC.js → NotebookEditor-DQgRezm_.js} +1 -1
- package/src/ui/dist/assets/{PdfLoader-B8ZOTKFc.js → PdfLoader-DPa_-fv6.js} +1 -1
- package/src/ui/dist/assets/{PdfMarkdownPlugin-xFPvzvWh.js → PdfMarkdownPlugin-BZpXOEjm.js} +3 -3
- package/src/ui/dist/assets/{PdfViewerPlugin-EjEcsIB8.js → PdfViewerPlugin-BT8a6wGR.js} +10 -10
- package/src/ui/dist/assets/{SearchPlugin-ixY-1lgW.js → SearchPlugin-D_blveZi.js} +1 -1
- package/src/ui/dist/assets/{Stepper-gYFK2Pgz.js → Stepper-DH2k75Vo.js} +1 -1
- package/src/ui/dist/assets/{TextViewerPlugin-Cym6pv_n.js → TextViewerPlugin-Btx0M3hX.js} +4 -4
- package/src/ui/dist/assets/{VNCViewer-BPmIHcmK.js → VNCViewer-DImJO4rO.js} +9 -9
- package/src/ui/dist/assets/{bibtex-Btv6Wi7f.js → bibtex-B-Hqu0Sg.js} +1 -1
- package/src/ui/dist/assets/{code-BlG7g85c.js → code-BUfXGJSl.js} +1 -1
- package/src/ui/dist/assets/{file-content-DBT5OfTZ.js → file-content-VqamwI3X.js} +1 -1
- package/src/ui/dist/assets/{file-diff-panel-BWXYzqHk.js → file-diff-panel-C_wOoS7a.js} +1 -1
- package/src/ui/dist/assets/{file-socket-wDlx6byM.js → file-socket-D2bTuMVP.js} +1 -1
- package/src/ui/dist/assets/{file-utils-Ba3nJmH0.js → file-utils--zJCPN1i.js} +1 -1
- package/src/ui/dist/assets/{image-BwtCyguk.js → image-BZkGJ4mM.js} +1 -1
- package/src/ui/dist/assets/{index-CfRpE209.js → index-CxkvSeKw.js} +2 -2
- package/src/ui/dist/assets/{index-DcqvKzeJ.js → index-D9QIGcmc.js} +1 -1
- package/src/ui/dist/assets/{index-DpMZw8aM.css → index-DXZ1daiJ.css} +163 -34
- package/src/ui/dist/assets/index-DdRW6RMJ.js +159 -0
- package/src/ui/dist/assets/{index-Bz5AaWL7.js → index-DjggJovS.js} +2948 -1565
- package/src/ui/dist/assets/{message-square-BnlyWVH0.js → message-square-FUIPIhU2.js} +1 -1
- package/src/ui/dist/assets/{monaco-CXe0pAVe.js → monaco-DHMc7kKM.js} +1 -1
- package/src/ui/dist/assets/{popover-BCHmVhHj.js → popover-B85oCgCS.js} +1 -1
- package/src/ui/dist/assets/{project-sync-Brk6kaOD.js → project-sync-DOMCcPac.js} +1 -1
- package/src/ui/dist/assets/{sigma-D72eSUep.js → sigma-BO2rQrl3.js} +1 -1
- package/src/ui/dist/assets/{tooltip-BMWd0dqX.js → tooltip-B1OspAkx.js} +1 -1
- package/src/ui/dist/assets/{trash-BIt_eWIS.js → trash-BsVEH_dV.js} +1 -1
- package/src/ui/dist/assets/{useCliAccess-N1hkTRrR.js → useCliAccess-b8L6JuZm.js} +1 -1
- package/src/ui/dist/assets/{useFileDiffOverlay-DPRPv6rv.js → useFileDiffOverlay-BY7uA9hV.js} +1 -1
- package/src/ui/dist/assets/{wrap-text-E5-UheyP.js → wrap-text-BwyVuUIK.js} +1 -1
- package/src/ui/dist/assets/{zoom-out-D4TR-ZZ_.js → zoom-out-RDpLugQP.js} +1 -1
- package/src/ui/dist/index.html +5 -2
- /package/src/ui/dist/assets/{index-CccQYZjX.css → NotebookEditor-CccQYZjX.css} +0 -0
|
@@ -10,15 +10,10 @@ It absorbs the essential old DeepScientist reproducer discipline into one stage
|
|
|
10
10
|
|
|
11
11
|
## Interaction discipline
|
|
12
12
|
|
|
13
|
-
-
|
|
14
|
-
-
|
|
15
|
-
-
|
|
16
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: the first meaningful signal of long work, a meaningful checkpoint, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
17
|
-
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
18
|
-
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
13
|
+
- Follow the shared interaction contract injected by the system prompt.
|
|
14
|
+
- For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
|
|
15
|
+
- Keep ordinary setup and debugging updates concise. Reserve richer milestone reports for accepted / waived / blocked baseline outcomes or other route-changing checkpoints instead of narrating every small setup step.
|
|
19
16
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
|
|
20
|
-
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
21
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
22
17
|
- If a threaded user reply arrives, interpret it relative to the latest baseline progress update before assuming the task changed completely.
|
|
23
18
|
- Prefer `bash_exec` for setup, reproduction, and verification commands so each baseline action keeps a durable quest-local session id and log trail.
|
|
24
19
|
- When the baseline route is durably chosen, confirmed, waived, or blocked with a clear next action, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says whether the baseline is trusted, blocked, or waived, why that matters, and what the next stage is.
|
|
@@ -30,6 +25,7 @@ It absorbs the essential old DeepScientist reproducer discipline into one stage
|
|
|
30
25
|
- do not claim a baseline is ready before verification is complete
|
|
31
26
|
- do not infer missing commands, scripts, or parameters when the uncertainty would change the result
|
|
32
27
|
- any unavoidable guess must be written down explicitly with expected impact
|
|
28
|
+
- for Python baselines, standardize environment setup with `uv`; do not default to ad-hoc `pip install ...`, a fresh `conda create ...`, or global package mutation when `uv` can provide the same environment reproducibly
|
|
33
29
|
- use web search for discovering papers or repos, but use `artifact.arxiv(paper_id=..., full_text=False)` for actually reading a source arXiv paper when it exists
|
|
34
30
|
- set `full_text=True` only when the summary/abstract view is insufficient for the needed detail; do not default to the raw PDF
|
|
35
31
|
|
|
@@ -40,25 +36,6 @@ It absorbs the essential old DeepScientist reproducer discipline into one stage
|
|
|
40
36
|
- if a structured user decision is required, ask only for decisions that the system cannot safely derive locally
|
|
41
37
|
- do not ask speculative or premature questions when local analysis can narrow the choices first
|
|
42
38
|
|
|
43
|
-
## Priority workflow
|
|
44
|
-
|
|
45
|
-
Default to the lightest baseline path that can still establish a trustworthy comparison.
|
|
46
|
-
Do not front-load a full reproduction dossier when a faster truth-finding step would tell you whether the route is even viable.
|
|
47
|
-
|
|
48
|
-
The ordinary baseline order is:
|
|
49
|
-
|
|
50
|
-
1. confirm quest binding and current baseline state
|
|
51
|
-
2. look for the cheapest trustworthy route in order: attach, import, reproduce, repair
|
|
52
|
-
3. capture the minimum viable contract: task, dataset or split, metric, source identity, expected command path, and main risks
|
|
53
|
-
4. run a bounded smoke test as soon as that contract is concrete enough
|
|
54
|
-
5. only after the smoke test is credible, expand setup notes and launch the real run
|
|
55
|
-
6. verify before accepting
|
|
56
|
-
7. archive, publish, or attach the result when appropriate
|
|
57
|
-
|
|
58
|
-
Escalate to the heavier baseline path only when the baseline is ambiguous, broken, multi-variant, paper-to-repo mismatched, or likely to be reused beyond the current quest.
|
|
59
|
-
|
|
60
|
-
If the quest is not yet bound to a stable baseline context, do not pretend the stage is ready just because some code exists locally.
|
|
61
|
-
|
|
62
39
|
## Stage purpose
|
|
63
40
|
|
|
64
41
|
The baseline stage should produce a usable reference point through one of four routes:
|
|
@@ -77,6 +54,51 @@ The stage must preserve the classic four-part reproducer flow:
|
|
|
77
54
|
|
|
78
55
|
Do not casually skip these gates.
|
|
79
56
|
|
|
57
|
+
## Quick workflow
|
|
58
|
+
|
|
59
|
+
Treat this as the compressed map of the detailed sections below, not as a second independent SOP.
|
|
60
|
+
|
|
61
|
+
1. Read the source paper and source repo first, or explicitly record what is missing and why.
|
|
62
|
+
2. Choose the lightest trustworthy route: attach, import, reproduce, or repair.
|
|
63
|
+
3. Before substantial setup, code changes, or a real run, create `PLAN.md` and `CHECKLIST.md`, and keep them updated when the route, assets, commands, or trust judgment changes materially.
|
|
64
|
+
4. Keep one dominant phase visible: analysis -> setup -> execution -> verification, with a bounded smoke test before any real long run.
|
|
65
|
+
5. Once the route is concrete, prefer one clean implementation pass, one smoke test, and then one normal baseline run; retry only when the smoke test, verification, or runtime evidence shows a concrete failure or incompatibility.
|
|
66
|
+
6. Close the baseline stage by confirming or waiving the gate, then send a concise `1-2` sentence summary that says whether the baseline is trusted, caveated, blocked, or waived, and what happens next.
|
|
67
|
+
|
|
68
|
+
## Route priority and escalation
|
|
69
|
+
|
|
70
|
+
This section sets route priority and escalation rules. The authoritative step-by-step execution remains in `Workflow`.
|
|
71
|
+
|
|
72
|
+
Default to the lightest baseline path that can still establish a trustworthy comparison.
|
|
73
|
+
Do not front-load a full reproduction dossier when a faster truth-finding step would tell you whether the route is even viable.
|
|
74
|
+
User requirements and explicit constraints are the primary boundary for the reproduction plan.
|
|
75
|
+
Within that boundary, prefer equivalence-preserving efficiency gains before more compute: larger safe batch size, cache reuse, checkpoint resume, parallel downloads or workers, and the cheapest comparable smoke path.
|
|
76
|
+
|
|
77
|
+
The ordinary baseline order is:
|
|
78
|
+
|
|
79
|
+
1. confirm quest binding and current baseline state
|
|
80
|
+
2. look for the cheapest trustworthy route in order: attach, import, reproduce, repair
|
|
81
|
+
3. capture the minimum viable contract: task, dataset or split, metric, source identity, expected command path, and main risks
|
|
82
|
+
4. run a bounded smoke test as soon as that contract is concrete enough, then expand setup notes and launch the real run only after the smoke test is credible
|
|
83
|
+
5. verify before accepting, then archive, publish, or attach the result when appropriate
|
|
84
|
+
|
|
85
|
+
Escalate to the heavier baseline path only when the baseline is ambiguous, broken, multi-variant, paper-to-repo mismatched, or likely to be reused beyond the current quest.
|
|
86
|
+
|
|
87
|
+
If the quest is not yet bound to a stable baseline context, do not pretend the stage is ready just because some code exists locally.
|
|
88
|
+
|
|
89
|
+
## Required plan and checklist
|
|
90
|
+
|
|
91
|
+
Before substantial baseline setup, code edits, or a real baseline run, create a quest-visible `PLAN.md` and `CHECKLIST.md`.
|
|
92
|
+
|
|
93
|
+
- Use `references/baseline-plan-template.md` as the canonical structure for `PLAN.md`.
|
|
94
|
+
- Use `references/baseline-checklist-template.md` as the canonical structure for `CHECKLIST.md`.
|
|
95
|
+
- `PLAN.md` becomes mandatory after you have read the source paper and repo enough to restate the method faithfully, identify the real entrypoints, and explain the likely failure points; if either source is missing, record that gap explicitly before proceeding.
|
|
96
|
+
- `PLAN.md` should put the user's explicit requirements and non-negotiable constraints first, then cover the chosen route, source package and provenance, safe efficiency levers, code touchpoints, environment and asset plan, smoke test, main run, fallback options such as ModelScope or local mirrors when Hugging Face is blocked, monitoring and sleep rules, verification targets, and a revision log.
|
|
97
|
+
- `CHECKLIST.md` is the living companion to `PLAN.md`; update it during reading, setup, smoke testing, real execution, verification, and every material route change.
|
|
98
|
+
- If an older quest already uses `analysis_plan.md` or `REPRO_CHECKLIST.md`, keep those files aligned with the canonical `PLAN.md` / `CHECKLIST.md` or turn them into clear compatibility pointers rather than splitting truth across parallel planning files.
|
|
99
|
+
- Do not treat the plan as static: if the route, commands, source package, fallback path, or trust judgment changes materially, revise `PLAN.md` before continuing.
|
|
100
|
+
- Once `PLAN.md` makes the route concrete, do not keep rewriting code or commands speculatively. The normal default is one bounded smoke test and then one real run, with retries only after a documented failure, invalidity, or compatibility problem.
|
|
101
|
+
|
|
80
102
|
## Phase routing rule
|
|
81
103
|
|
|
82
104
|
Treat `analysis`, `setup`, `execution`, and `verification` as logical control gates, not paperwork walls.
|
|
@@ -145,12 +167,12 @@ Do not treat memory alone as sufficient evidence for baseline readiness.
|
|
|
145
167
|
The baseline line should also maintain a durable working-record area outside the execution surface.
|
|
146
168
|
Recommended quest-visible records include:
|
|
147
169
|
|
|
148
|
-
- `
|
|
170
|
+
- `PLAN.md` as the canonical baseline plan; older quests may keep `analysis_plan.md` as a compatibility alias
|
|
171
|
+
- `CHECKLIST.md` as the canonical living checklist; older quests may keep `REPRO_CHECKLIST.md` as a compatibility alias when already wired
|
|
149
172
|
- `setup.md`
|
|
150
173
|
- `execution.md`
|
|
151
174
|
- `verification.md`
|
|
152
175
|
- `STRUCTURE.md` only when the workspace layout is non-obvious or later reuse depends on it
|
|
153
|
-
- `REPRO_CHECKLIST.md` only when the route is complex, repair-heavy, multi-variant, or publication-facing
|
|
154
176
|
|
|
155
177
|
For a simple attach/import flow or a straightforward reproduce flow, do not stall just to precreate every one of these files.
|
|
156
178
|
Start with the smallest durable note that preserves the route, command path, target outputs, and main risks; expand it only after the route proves real.
|
|
@@ -181,12 +203,13 @@ Minimum stability rules:
|
|
|
181
203
|
- every accepted baseline should leave one accepted baseline artifact
|
|
182
204
|
- every blocked baseline line should leave one blocked report and one next-step decision
|
|
183
205
|
- every handoff should name the active baseline reference and trusted metric set explicitly
|
|
206
|
+
- when the accepted paper-facing contract spans multiple metrics, datasets, subtasks, or splits, preserve that full comparison surface in the durable metric contract rather than collapsing it to one headline number
|
|
184
207
|
- do not require every optional checklist or template before the first smoke test
|
|
185
208
|
- if one rolling note is enough for a simple baseline line, use it
|
|
186
209
|
|
|
187
210
|
Recommended phase-to-output mapping:
|
|
188
211
|
|
|
189
|
-
- `analysis` -> a brief `
|
|
212
|
+
- `analysis` -> a brief `PLAN.md` or compatible `analysis_plan.md`, plus optional route decision artifact
|
|
190
213
|
- `setup` -> `setup.md` when setup choices are non-trivial
|
|
191
214
|
- `execution` -> `execution.md` plus progress artifacts when long-running
|
|
192
215
|
- `verification` -> `verification.md` plus accepted baseline artifact and `artifact.confirm_baseline(...)`, or a blocked report plus `artifact.waive_baseline(...)` when skipping is intentional
|
|
@@ -401,6 +424,7 @@ You should inspect local feasibility with shell-based checks when needed, includ
|
|
|
401
424
|
- CPU and RAM
|
|
402
425
|
- free disk
|
|
403
426
|
- Python or conda environment availability
|
|
427
|
+
- whether `uv` is available and which Python version `uv` should target
|
|
404
428
|
|
|
405
429
|
Use the collected constraints to choose a realistic baseline route and runtime plan.
|
|
406
430
|
|
|
@@ -415,7 +439,8 @@ At minimum, the plan should capture:
|
|
|
415
439
|
- key risks
|
|
416
440
|
- verification targets
|
|
417
441
|
|
|
418
|
-
|
|
442
|
+
Prefer `PLAN.md` for new work and use `references/baseline-plan-template.md` when you need a concrete starting structure.
|
|
443
|
+
When the analysis note becomes substantial, structure `PLAN.md` or a legacy-compatible `analysis_plan.md` with headings close to:
|
|
419
444
|
|
|
420
445
|
- executive summary
|
|
421
446
|
- codebase analysis
|
|
@@ -458,6 +483,13 @@ Prepare the selected route:
|
|
|
458
483
|
- reproduce: prepare the baseline work directory, commands, config pointers, and environment notes
|
|
459
484
|
- repair: identify the precise broken point before rerunning blindly
|
|
460
485
|
|
|
486
|
+
For Python baselines, environment setup should be standardized around `uv`.
|
|
487
|
+
Treat `uv` as the default environment and package manager for baseline setup, smoke tests, and real runs.
|
|
488
|
+
Do not casually switch to a new conda environment or a manual `pip install` flow just because the repo is old.
|
|
489
|
+
If the baseline already ships a `pyproject.toml` / `uv.lock`, use that path first.
|
|
490
|
+
If it only ships `requirements.txt`, still create the environment with `uv` and install through `uv pip`.
|
|
491
|
+
Only accept a non-`uv` environment route when there is a concrete blocker that cannot be resolved locally, and record that blocker explicitly in `setup.md` and the progress update.
|
|
492
|
+
|
|
461
493
|
For a fast-path reproduction, setup can stay lightweight.
|
|
462
494
|
Confirm the working directory, environment, config, output paths, smoke command, and long-run command, then move forward.
|
|
463
495
|
Do not manufacture a fresh workspace tree or copy the repo just to satisfy a template if the existing layout is already workable and auditable.
|
|
@@ -479,6 +511,59 @@ Setup should also confirm:
|
|
|
479
511
|
- required dependencies or environments are known
|
|
480
512
|
- the execution plan is realistic for the detected hardware
|
|
481
513
|
|
|
514
|
+
### Python environment rule: use `uv`
|
|
515
|
+
|
|
516
|
+
When the baseline is Python-based, prefer the following order:
|
|
517
|
+
|
|
518
|
+
1. if the repo already contains `uv.lock` or a solid `pyproject.toml`, use `uv sync`
|
|
519
|
+
2. otherwise create a local virtual environment with `uv venv`
|
|
520
|
+
3. install dependencies with `uv pip install ...`
|
|
521
|
+
4. run setup, smoke tests, and real commands through `uv run ...`
|
|
522
|
+
|
|
523
|
+
Practical rules:
|
|
524
|
+
|
|
525
|
+
- prefer a quest-local or baseline-local `.venv` under the actual working tree
|
|
526
|
+
- prefer `uv run python ...` / `uv run bash ...` over relying on shell activation state
|
|
527
|
+
- if a specific interpreter is required, make it explicit with `uv venv --python 3.11` or `uv run --python 3.11 ...`
|
|
528
|
+
- if CUDA, PyTorch, JAX, or custom wheels require a special index URL, still keep the installation command under `uv pip`
|
|
529
|
+
- if the repo insists on conda-only tooling, first check whether the same packages can be installed with `uv`; only keep the conda route if you can explain why `uv` is not viable
|
|
530
|
+
|
|
531
|
+
Examples:
|
|
532
|
+
|
|
533
|
+
```bash
|
|
534
|
+
# modern repo with pyproject.toml / uv.lock
|
|
535
|
+
cd <baseline_root>
|
|
536
|
+
uv sync
|
|
537
|
+
uv run python -m pytest tests/test_smoke.py -q
|
|
538
|
+
uv run python train.py --config configs/baseline.yaml
|
|
539
|
+
```
|
|
540
|
+
|
|
541
|
+
```bash
|
|
542
|
+
# legacy repo with requirements.txt
|
|
543
|
+
cd <baseline_root>
|
|
544
|
+
uv venv --python 3.11
|
|
545
|
+
uv pip install -r requirements.txt
|
|
546
|
+
uv run python scripts/smoke_test.py
|
|
547
|
+
uv run python main.py --dataset cifar10 --config configs/resnet18.yaml
|
|
548
|
+
```
|
|
549
|
+
|
|
550
|
+
```bash
|
|
551
|
+
# one-off package additions without leaving the uv-managed flow
|
|
552
|
+
cd <baseline_root>
|
|
553
|
+
uv venv --python 3.11
|
|
554
|
+
uv pip install -r requirements.txt
|
|
555
|
+
uv pip install "torch==2.4.1" "torchvision==0.19.1"
|
|
556
|
+
uv run python evaluate.py --checkpoint outputs/best.pt
|
|
557
|
+
```
|
|
558
|
+
|
|
559
|
+
When you record the setup, explicitly note:
|
|
560
|
+
|
|
561
|
+
- the chosen `uv` route: `uv sync` vs `uv venv` + `uv pip`
|
|
562
|
+
- the Python version
|
|
563
|
+
- the dependency source files used
|
|
564
|
+
- the exact `uv run ...` command used for the smoke test
|
|
565
|
+
- any blocker that prevented a pure `uv` flow
|
|
566
|
+
|
|
482
567
|
If a dedicated baseline workspace is needed, establish a clear layout.
|
|
483
568
|
One workable structure is:
|
|
484
569
|
|
|
@@ -514,6 +599,7 @@ Setup should record:
|
|
|
514
599
|
- how the source was obtained: attach/import/copy/clone
|
|
515
600
|
- upstream URL when known
|
|
516
601
|
- upstream commit hash when known
|
|
602
|
+
- `uv` environment route and Python version
|
|
517
603
|
- key environment variables by name only, with sensitive values redacted
|
|
518
604
|
- the directory tree and key files expected to matter later
|
|
519
605
|
|
|
@@ -553,6 +639,8 @@ If a wrapper or entry script is truly needed, it should support most of the foll
|
|
|
553
639
|
- speed flags such as parallelism, batch size, epochs, or steps when relevant
|
|
554
640
|
- optional evaluation and postprocess steps when the repo separates them
|
|
555
641
|
|
|
642
|
+
Prefer those efficiency levers only when they do not change the accepted baseline meaning, effective evaluation contract, or trust judgment.
|
|
643
|
+
|
|
556
644
|
If adding this scaffolding would require large assumptions about missing scripts, stop and return to analysis rather than creating a misleading opaque wrapper.
|
|
557
645
|
|
|
558
646
|
Recommended result structures to maintain:
|
|
@@ -576,6 +664,8 @@ Long-running execution rules:
|
|
|
576
664
|
|
|
577
665
|
- before a substantial baseline reproduction, run a bounded smoke test first so command paths, output locations, and metric plumbing are validated cheaply
|
|
578
666
|
- once the smoke test passes, launch the real baseline reproduction with `bash_exec(mode='detach', ...)` and normally leave `timeout_seconds` unset for the long run itself
|
|
667
|
+
- `bash_exec(mode='read', id=...)` returns the full rendered log when it is 2000 lines or fewer; for longer logs it returns the first 500 lines plus the last 1500 lines and a hint to inspect omitted sections with `start` and `tail`
|
|
668
|
+
- if a long saved log omits the middle section you need, use `bash_exec(mode='read', id=..., start=..., tail=...)` to inspect that forward rendered-line window
|
|
579
669
|
- when monitoring that detached run, prefer `bash_exec(mode='read', id=..., tail_limit=..., order='desc')` so you inspect the newest log evidence first
|
|
580
670
|
- after the first read, prefer incremental checks with `bash_exec(mode='read', id=..., after_seq=last_seen_seq, tail_limit=..., order='asc')` so you only inspect newly appended evidence
|
|
581
671
|
- if you need to recover ids or confirm the newest session quickly, use `bash_exec(mode='history')` or `bash_exec(mode='list')` rather than guessing
|
|
@@ -586,6 +676,10 @@ Long-running execution rules:
|
|
|
586
676
|
- do not write final summaries or accepted metrics until the command has actually completed
|
|
587
677
|
- verify that the expected result files exist before treating the run as finished
|
|
588
678
|
- if a task is invalid, wedged, or failed, stop it with `bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...)`; if it must die immediately, add `force=true`, then diagnose the reason and either retry with a documented fix or record the failure durably
|
|
679
|
+
- canonical sleep choice:
|
|
680
|
+
- if you only need wall-clock waiting between checks, use `bash_exec(command='sleep N', mode='await', timeout_seconds=N+buffer, ...)`
|
|
681
|
+
- keep a real buffer on that sleep timeout; do not set `timeout_seconds` exactly equal to `N`
|
|
682
|
+
- if you are waiting on an already running managed session, prefer `bash_exec(mode='await', id=..., timeout_seconds=...)` instead of starting a new sleep command
|
|
589
683
|
|
|
590
684
|
Recommended monitoring cadence for long-running work:
|
|
591
685
|
|
|
@@ -700,13 +794,29 @@ If variants exist, also include:
|
|
|
700
794
|
- `default_variant_id`
|
|
701
795
|
- `baseline_variants`
|
|
702
796
|
|
|
797
|
+
Metric-contract rule:
|
|
798
|
+
|
|
799
|
+
- unless the user explicitly specifies otherwise, treat the original paper's evaluation protocol as the canonical baseline contract
|
|
800
|
+
- if the accepted baseline contract includes multiple metrics, datasets, subtasks, or splits, record all of them in `<baseline_root>/json/metric_contract.json`
|
|
801
|
+
- keep `primary_metric` as the headline metric only; do not let it erase the rest of the accepted paper-facing comparison surface
|
|
802
|
+
- when confirming a baseline, submit the canonical `metrics_summary` as a flat top-level dictionary keyed by the paper-facing metric ids; if the raw evaluator output is nested, use explicit `origin_path` fields in `metric_contract.metrics` to map the required canonical metrics
|
|
803
|
+
- every canonical baseline metric entry should include `description`, either `derivation` or `origin_path`, and `source_ref` so later stages can audit where the number came from
|
|
804
|
+
- if the paper reports both aggregate and per-dataset or per-task results, preserve both whenever feasible through `metrics_summary` plus structured rows rather than one cherry-picked scalar
|
|
805
|
+
- `Result/metric.md` is optional temporary scratch memory only; if it exists, reconcile the final baseline submission against it before calling `artifact.confirm_baseline(...)`, but do not treat it as a required file
|
|
806
|
+
|
|
703
807
|
## Durable note templates
|
|
704
808
|
|
|
705
809
|
Use compact but structured notes so later stages do not need to reconstruct baseline state from chat history.
|
|
706
810
|
The templates below are references, not prerequisites for the first smoke test.
|
|
707
811
|
For simple baseline lines, keep them short and fill only the sections that matter.
|
|
708
812
|
|
|
709
|
-
|
|
813
|
+
Canonical naming for new work:
|
|
814
|
+
|
|
815
|
+
- `PLAN.md` -> use `references/baseline-plan-template.md`
|
|
816
|
+
- `CHECKLIST.md` -> use `references/baseline-checklist-template.md`
|
|
817
|
+
- `analysis_plan.md` and `REPRO_CHECKLIST.md` remain acceptable compatibility aliases when a quest already depends on them
|
|
818
|
+
|
|
819
|
+
### `PLAN.md` or `analysis_plan.md`
|
|
710
820
|
|
|
711
821
|
Recommended shape:
|
|
712
822
|
|
|
@@ -761,6 +871,8 @@ Recommended shape:
|
|
|
761
871
|
- source_origin:
|
|
762
872
|
- source_commit:
|
|
763
873
|
- environment_summary:
|
|
874
|
+
- uv_strategy:
|
|
875
|
+
- python_version:
|
|
764
876
|
- config_paths:
|
|
765
877
|
- command_template:
|
|
766
878
|
|
|
@@ -774,6 +886,7 @@ Recommended shape:
|
|
|
774
886
|
- deviation:
|
|
775
887
|
|
|
776
888
|
## Ready-for-execution check
|
|
889
|
+
- uv_route_recorded: yes/no
|
|
777
890
|
- dependencies_known: yes/no
|
|
778
891
|
- outputs_defined: yes/no
|
|
779
892
|
- feasible_on_current_machine: yes/no
|
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
# Baseline Checklist Template
|
|
2
|
+
|
|
3
|
+
Use this as a living checklist.
|
|
4
|
+
Update it during reading, setup, smoke testing, real execution, verification, and route changes.
|
|
5
|
+
|
|
6
|
+
## Identity
|
|
7
|
+
|
|
8
|
+
- baseline id:
|
|
9
|
+
- route:
|
|
10
|
+
- owner stage:
|
|
11
|
+
|
|
12
|
+
## Analysis
|
|
13
|
+
|
|
14
|
+
- [ ] paper source identified
|
|
15
|
+
- [ ] repo source identified
|
|
16
|
+
- [ ] paper read enough to restate the core method faithfully
|
|
17
|
+
- [ ] repo read enough to identify the real entrypoints
|
|
18
|
+
- [ ] dataset / split contract confirmed
|
|
19
|
+
- [ ] metric contract confirmed
|
|
20
|
+
- [ ] main files to inspect or modify listed
|
|
21
|
+
- [ ] risks and fallbacks written into `PLAN.md`
|
|
22
|
+
|
|
23
|
+
## Setup
|
|
24
|
+
|
|
25
|
+
- [ ] working directory confirmed
|
|
26
|
+
- [ ] environment route chosen
|
|
27
|
+
- [ ] key dependencies checked
|
|
28
|
+
- [ ] model / data download path confirmed
|
|
29
|
+
- [ ] fallback source recorded for critical downloads
|
|
30
|
+
|
|
31
|
+
## Smoke Test
|
|
32
|
+
|
|
33
|
+
- [ ] smoke command written in `PLAN.md`
|
|
34
|
+
- [ ] smoke command executed
|
|
35
|
+
- [ ] smoke outputs verified
|
|
36
|
+
- [ ] smoke failure handled or route revised
|
|
37
|
+
|
|
38
|
+
## Main Run
|
|
39
|
+
|
|
40
|
+
- [ ] real command written in `PLAN.md`
|
|
41
|
+
- [ ] real run launched with durable logging
|
|
42
|
+
- [ ] monitoring cadence started
|
|
43
|
+
- [ ] health signals confirmed
|
|
44
|
+
- [ ] any execution deviation reflected back into `PLAN.md`
|
|
45
|
+
|
|
46
|
+
## Verification
|
|
47
|
+
|
|
48
|
+
- [ ] expected result files exist
|
|
49
|
+
- [ ] metric keys are complete
|
|
50
|
+
- [ ] baseline is comparable to the intended contract
|
|
51
|
+
- [ ] verification note written
|
|
52
|
+
- [ ] baseline accepted or explicitly blocked / waived
|
|
53
|
+
|
|
54
|
+
## Closeout
|
|
55
|
+
|
|
56
|
+
- [ ] concise `1-2` sentence baseline summary written
|
|
57
|
+
- [ ] next stage named explicitly
|
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
# Baseline Plan Template
|
|
2
|
+
|
|
3
|
+
Use this when the `baseline` stage becomes concrete enough to act.
|
|
4
|
+
Keep it short when the route is simple, but do not skip the sections that affect reproducibility, code touchpoints, or fallback handling.
|
|
5
|
+
|
|
6
|
+
## 1. Objective
|
|
7
|
+
|
|
8
|
+
- quest goal:
|
|
9
|
+
- user's core requirements:
|
|
10
|
+
- non-negotiable user constraints:
|
|
11
|
+
- chosen baseline route:
|
|
12
|
+
- attach / import / reproduce / repair
|
|
13
|
+
- baseline id:
|
|
14
|
+
- variant id:
|
|
15
|
+
|
|
16
|
+
## 2. Source Package
|
|
17
|
+
|
|
18
|
+
- source paper:
|
|
19
|
+
- source repo:
|
|
20
|
+
- fallback repo or mirror:
|
|
21
|
+
- source commit / version / tag:
|
|
22
|
+
- task:
|
|
23
|
+
- dataset / split:
|
|
24
|
+
- metric contract:
|
|
25
|
+
|
|
26
|
+
## 3. Paper And Repo Reading Notes
|
|
27
|
+
|
|
28
|
+
- paper summary in `1-3` bullets:
|
|
29
|
+
- repo summary in `1-3` bullets:
|
|
30
|
+
- what the baseline actually does:
|
|
31
|
+
- what the likely bottlenecks or brittle points are:
|
|
32
|
+
- what still needs verification:
|
|
33
|
+
|
|
34
|
+
## 4. Code Touchpoints
|
|
35
|
+
|
|
36
|
+
List the main files or modules that matter before you change anything substantial.
|
|
37
|
+
|
|
38
|
+
| Path | Role | Why it matters now | Expected action | Notes |
|
|
39
|
+
|---|---|---|---|---|
|
|
40
|
+
| | | | inspect / modify / leave alone | |
|
|
41
|
+
|
|
42
|
+
## 5. Environment And Asset Plan
|
|
43
|
+
|
|
44
|
+
- working directory:
|
|
45
|
+
- environment plan:
|
|
46
|
+
- required downloads:
|
|
47
|
+
- checkpoints / models:
|
|
48
|
+
- hardware assumptions:
|
|
49
|
+
- likely external blockers:
|
|
50
|
+
|
|
51
|
+
Fallbacks and contingency options:
|
|
52
|
+
|
|
53
|
+
- if Hugging Face is slow, blocked, or rate-limited:
|
|
54
|
+
- try ModelScope, official mirrors, quest-local caches, or manually staged files
|
|
55
|
+
- if the official repo is unavailable:
|
|
56
|
+
- use a verified mirror and record the exact provenance
|
|
57
|
+
- if the full run is too expensive:
|
|
58
|
+
- define the smoke-test path and the cheapest comparable reduced pilot
|
|
59
|
+
|
|
60
|
+
## 6. Execution Strategy
|
|
61
|
+
|
|
62
|
+
### Smoke Test
|
|
63
|
+
|
|
64
|
+
- command:
|
|
65
|
+
- purpose:
|
|
66
|
+
- expected outputs:
|
|
67
|
+
- fastest failure signal:
|
|
68
|
+
|
|
69
|
+
### Main Run
|
|
70
|
+
|
|
71
|
+
- command:
|
|
72
|
+
- expected outputs:
|
|
73
|
+
- expected runtime / budget:
|
|
74
|
+
- durable log path:
|
|
75
|
+
- safe efficiency levers to try first:
|
|
76
|
+
|
|
77
|
+
### Monitoring And Sleep Rules
|
|
78
|
+
|
|
79
|
+
- first checks:
|
|
80
|
+
- `60s`
|
|
81
|
+
- `120s`
|
|
82
|
+
- `300s`
|
|
83
|
+
- `600s`
|
|
84
|
+
- `1800s`
|
|
85
|
+
- health signals that justify continued monitoring rather than intervention:
|
|
86
|
+
- conditions that require plan revision or kill-and-relaunch:
|
|
87
|
+
|
|
88
|
+
## 7. Verification Plan
|
|
89
|
+
|
|
90
|
+
- required result files:
|
|
91
|
+
- required metric keys:
|
|
92
|
+
- comparability checks:
|
|
93
|
+
- acceptance condition:
|
|
94
|
+
- downgrade / blocked condition:
|
|
95
|
+
|
|
96
|
+
## 8. Checklist Link
|
|
97
|
+
|
|
98
|
+
- checklist path:
|
|
99
|
+
- which item should move next:
|
|
100
|
+
|
|
101
|
+
## 9. Revision Log
|
|
102
|
+
|
|
103
|
+
| Time | What changed | Why it changed | Impact on execution |
|
|
104
|
+
|---|---|---|---|
|
|
105
|
+
| | | | |
|
|
@@ -9,17 +9,12 @@ Use this skill whenever continuation is non-trivial.
|
|
|
9
9
|
|
|
10
10
|
## Interaction discipline
|
|
11
11
|
|
|
12
|
-
-
|
|
13
|
-
-
|
|
14
|
-
- Immediately follow any non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt; if the request is directly answerable, answer there, otherwise say the current subtask is paused, give a short plan plus nearest report-back point, and handle that request first.
|
|
15
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: a meaningful checkpoint, a route-shaping update, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
12
|
+
- Follow the shared interaction contract injected by the system prompt.
|
|
13
|
+
- For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
|
|
16
14
|
- Message templates are references only. Adapt to context and vary wording so updates feel natural and non-robotic.
|
|
17
|
-
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
18
|
-
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
19
15
|
- If the runtime starts an auto-continue turn with no new user message, continue from the active requirements and durable quest state instead of replaying the previous user turn.
|
|
20
16
|
- If `startup_contract.decision_policy = autonomous`, do not emit ordinary `artifact.interact(kind='decision_request', ...)` calls; decide the route yourself, record the reason, and continue.
|
|
21
17
|
- Use `reply_mode='blocking'` for the actual decision request only when the user must choose before safe continuation and the quest contract still allows a user-gated decision.
|
|
22
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
23
18
|
- If a threaded user reply arrives, interpret it relative to the latest decision or progress interaction before assuming the task changed completely.
|
|
24
19
|
- Quest completion is a special terminal decision: first ask for explicit completion approval with `artifact.interact(kind='decision_request', reply_mode='blocking', reply_schema={'decision_type': 'quest_completion_approval'}, ...)`, and only after an explicit approval reply should you call `artifact.complete_quest(...)`.
|
|
25
20
|
|
|
@@ -74,6 +69,7 @@ Use the following canonical actions:
|
|
|
74
69
|
- `launch_analysis_campaign`
|
|
75
70
|
- `branch`
|
|
76
71
|
- `prepare_branch`
|
|
72
|
+
- `activate_branch`
|
|
77
73
|
- `reuse_baseline`
|
|
78
74
|
- `attach_baseline`
|
|
79
75
|
- `publish_baseline`
|
|
@@ -91,6 +87,8 @@ In the current runtime, prefer these concrete flow actions:
|
|
|
91
87
|
- accepted idea -> `artifact.submit_idea(mode='create', lineage_intent='continue_line'|'branch_alternative', ...)`
|
|
92
88
|
- maintenance-only in-place cleanup of the same branch -> `artifact.submit_idea(mode='revise', ...)`
|
|
93
89
|
- compare branch foundations before a new round -> `artifact.list_research_branches(...)`
|
|
90
|
+
- return to an older durable branch without creating a new node -> `artifact.activate_branch(...)`
|
|
91
|
+
- materialize the concrete main-result node when a real main experiment line is about to be or was just durably recorded -> dedicated child `run/*` branch/worktree
|
|
94
92
|
- start the next optimization round from a measured result -> `artifact.record(kind='decision', action='iterate', ...)`
|
|
95
93
|
- launch analysis campaign -> `artifact.create_analysis_campaign(...)`
|
|
96
94
|
- finish one analysis slice -> `artifact.record_analysis_slice(...)`
|
|
@@ -104,8 +102,12 @@ If the chosen action is baseline reuse, the decision is not complete until one o
|
|
|
104
102
|
- or the quest recorded an explicit blocker or waiver explaining why reuse could not be completed safely
|
|
105
103
|
|
|
106
104
|
Treat `prepare_branch` as a compatibility or recovery action, not the normal path.
|
|
105
|
+
Treat `activate_branch` as the correct recovery or revisit action when the quest should resume on an existing older durable branch while preserving the newer research head.
|
|
107
106
|
Treat each accepted branch as one durable research round.
|
|
108
107
|
If a branch already has a durable main-experiment result, a genuinely new optimization round should normally create a child branch from a chosen foundation rather than keep revising that old branch in place.
|
|
108
|
+
Treat each durable main experiment as its own child `run/*` branch/node, not as another mutable state on the idea branch.
|
|
109
|
+
When paper mode is enabled and the necessary analysis for a strong run is done, the next default route is `write` on a dedicated `paper/*` branch/worktree derived from that run branch.
|
|
110
|
+
Do not approve `launch_analysis_campaign` casually; analysis usually carries extra resource cost and should require clear academic or claim-level value before spending that budget.
|
|
109
111
|
|
|
110
112
|
## Truth sources
|
|
111
113
|
|
|
@@ -146,7 +148,7 @@ Typical mapping:
|
|
|
146
148
|
- `good`
|
|
147
149
|
- continue, branch, launch experiment, write, finalize
|
|
148
150
|
- `neutral`
|
|
149
|
-
- branch, launch analysis campaign, request user decision
|
|
151
|
+
- branch, activate branch, launch analysis campaign, request user decision
|
|
150
152
|
- `bad`
|
|
151
153
|
- reset, stop
|
|
152
154
|
- `blocked`
|
|
@@ -301,6 +303,7 @@ This is especially useful for:
|
|
|
301
303
|
- idea branch selection
|
|
302
304
|
- experiment package selection
|
|
303
305
|
- launch of an analysis campaign
|
|
306
|
+
- reactivation of an older durable branch
|
|
304
307
|
- post-campaign routing
|
|
305
308
|
- stop / pivot / finalize choices
|
|
306
309
|
|
|
@@ -341,6 +344,7 @@ Good decisions:
|
|
|
341
344
|
- say what happens next
|
|
342
345
|
- say why the alternative was not chosen
|
|
343
346
|
- explicitly identify the winning candidate when choosing among multiple packages
|
|
347
|
+
- do not launch analysis campaigns unless the expected information gain clearly justifies the extra resource cost
|
|
344
348
|
|
|
345
349
|
Weak decisions:
|
|
346
350
|
|