@researai/deepscientist 1.5.1 → 1.5.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +69 -1
- package/bin/ds.js +2239 -153
- package/docs/en/00_QUICK_START.md +60 -20
- package/docs/en/01_SETTINGS_REFERENCE.md +20 -20
- package/docs/en/02_START_RESEARCH_GUIDE.md +11 -11
- package/docs/en/03_QQ_CONNECTOR_GUIDE.md +10 -10
- package/docs/en/05_TUI_GUIDE.md +1 -1
- package/docs/en/09_DOCTOR.md +48 -4
- package/docs/en/90_ARCHITECTURE.md +4 -2
- package/docs/zh/00_QUICK_START.md +60 -20
- package/docs/zh/01_SETTINGS_REFERENCE.md +21 -21
- package/docs/zh/02_START_RESEARCH_GUIDE.md +19 -19
- package/docs/zh/03_QQ_CONNECTOR_GUIDE.md +10 -10
- package/docs/zh/05_TUI_GUIDE.md +1 -1
- package/docs/zh/09_DOCTOR.md +46 -4
- package/install.sh +125 -8
- package/package.json +2 -1
- package/pyproject.toml +1 -1
- package/src/deepscientist/__init__.py +6 -1
- package/src/deepscientist/artifact/service.py +553 -26
- package/src/deepscientist/bash_exec/monitor.py +23 -4
- package/src/deepscientist/bash_exec/runtime.py +3 -0
- package/src/deepscientist/bash_exec/service.py +132 -4
- package/src/deepscientist/bridges/base.py +10 -19
- package/src/deepscientist/channels/discord_gateway.py +25 -2
- package/src/deepscientist/channels/feishu_long_connection.py +41 -3
- package/src/deepscientist/channels/qq.py +524 -64
- package/src/deepscientist/channels/qq_gateway.py +22 -3
- package/src/deepscientist/channels/relay.py +429 -90
- package/src/deepscientist/channels/slack_socket.py +29 -5
- package/src/deepscientist/channels/telegram_polling.py +25 -2
- package/src/deepscientist/channels/whatsapp_local_session.py +32 -4
- package/src/deepscientist/cli.py +27 -0
- package/src/deepscientist/config/models.py +6 -40
- package/src/deepscientist/config/service.py +165 -156
- package/src/deepscientist/connector_profiles.py +346 -0
- package/src/deepscientist/connector_runtime.py +88 -43
- package/src/deepscientist/daemon/api/handlers.py +65 -11
- package/src/deepscientist/daemon/api/router.py +4 -2
- package/src/deepscientist/daemon/app.py +772 -219
- package/src/deepscientist/doctor.py +69 -2
- package/src/deepscientist/gitops/diff.py +3 -0
- package/src/deepscientist/home.py +25 -2
- package/src/deepscientist/mcp/context.py +3 -1
- package/src/deepscientist/mcp/server.py +66 -7
- package/src/deepscientist/migration.py +114 -0
- package/src/deepscientist/prompts/builder.py +71 -3
- package/src/deepscientist/qq_profiles.py +186 -0
- package/src/deepscientist/quest/layout.py +1 -0
- package/src/deepscientist/quest/service.py +70 -12
- package/src/deepscientist/quest/stage_views.py +46 -0
- package/src/deepscientist/runners/codex.py +2 -0
- package/src/deepscientist/shared.py +44 -17
- package/src/prompts/connectors/lingzhu.md +3 -0
- package/src/prompts/connectors/qq.md +42 -2
- package/src/prompts/system.md +123 -10
- package/src/skills/analysis-campaign/SKILL.md +35 -6
- package/src/skills/baseline/SKILL.md +73 -32
- package/src/skills/decision/SKILL.md +4 -3
- package/src/skills/experiment/SKILL.md +28 -6
- package/src/skills/finalize/SKILL.md +5 -2
- package/src/skills/idea/SKILL.md +2 -2
- package/src/skills/intake-audit/SKILL.md +2 -2
- package/src/skills/rebuttal/SKILL.md +4 -2
- package/src/skills/review/SKILL.md +4 -2
- package/src/skills/scout/SKILL.md +2 -2
- package/src/skills/write/SKILL.md +2 -2
- package/src/tui/package.json +1 -1
- package/src/ui/dist/assets/{AiManusChatView-w5lF2Ttt.js → AiManusChatView-qzChi9uh.js} +67 -94
- package/src/ui/dist/assets/{AnalysisPlugin-DJOED79I.js → AnalysisPlugin-CcC_-UqN.js} +1 -1
- package/src/ui/dist/assets/{AutoFigurePlugin-DaG61Y0M.js → AutoFigurePlugin-DD8LkJLe.js} +5 -5
- package/src/ui/dist/assets/{CliPlugin-CV4LqUB_.js → CliPlugin-DJJFfVmW.js} +17 -110
- package/src/ui/dist/assets/{CodeEditorPlugin-DylfAea4.js → CodeEditorPlugin-CrjkHNLh.js} +8 -8
- package/src/ui/dist/assets/{CodeViewerPlugin-F7saY0LM.js → CodeViewerPlugin-obnD6G5R.js} +5 -5
- package/src/ui/dist/assets/{DocViewerPlugin-COP0c7jf.js → DocViewerPlugin-DB9SUQVd.js} +3 -3
- package/src/ui/dist/assets/{GitDiffViewerPlugin-CAS05pT9.js → GitDiffViewerPlugin-DZLlNlD2.js} +1 -1
- package/src/ui/dist/assets/{ImageViewerPlugin-Bco1CN_w.js → ImageViewerPlugin-BGwfDZ0Y.js} +5 -5
- package/src/ui/dist/assets/{LabCopilotPanel-CvMlCD99.js → LabCopilotPanel-dfLptQcR.js} +10 -10
- package/src/ui/dist/assets/{LabPlugin-BYankkE4.js → LabPlugin-CeGjAl3A.js} +1 -1
- package/src/ui/dist/assets/{LatexPlugin-LDSMR-t-.js → LatexPlugin-BBJ7kd1V.js} +7 -7
- package/src/ui/dist/assets/{MarkdownViewerPlugin-B7o80jgm.js → MarkdownViewerPlugin-DKZi7BcB.js} +4 -4
- package/src/ui/dist/assets/{MarketplacePlugin-CM6ZOcpC.js → MarketplacePlugin-C_k-9jD0.js} +3 -3
- package/src/ui/dist/assets/{NotebookEditor-Dc61cXmK.js → NotebookEditor-4R88_BMO.js} +1 -1
- package/src/ui/dist/assets/{PdfLoader-DWowuQwx.js → PdfLoader-DwEFQLrw.js} +1 -1
- package/src/ui/dist/assets/{PdfMarkdownPlugin-BsJM1q_a.js → PdfMarkdownPlugin-D-jdsqF8.js} +3 -3
- package/src/ui/dist/assets/{PdfViewerPlugin-DB2eEEFQ.js → PdfViewerPlugin-CmeBGDY0.js} +10 -10
- package/src/ui/dist/assets/{SearchPlugin-CraThSvt.js → SearchPlugin-Dlz2WKJ4.js} +1 -1
- package/src/ui/dist/assets/{Stepper-CgocRTPq.js → Stepper-ClOgzWM3.js} +1 -1
- package/src/ui/dist/assets/{TextViewerPlugin-B1JGhKtd.js → TextViewerPlugin-DDQWxibk.js} +4 -4
- package/src/ui/dist/assets/{VNCViewer-CclFC7FM.js → VNCViewer-CJXT0Nm8.js} +9 -9
- package/src/ui/dist/assets/{bibtex-D3IKsMl7.js → bibtex-DLr4Rtk4.js} +1 -1
- package/src/ui/dist/assets/{code-BP37Xx0p.js → code-DgKK408Y.js} +1 -1
- package/src/ui/dist/assets/{file-content-BAJSu-9r.js → file-content-6HBqQnvQ.js} +1 -1
- package/src/ui/dist/assets/{file-diff-panel-DUGeCTuy.js → file-diff-panel-Dhu0TbBM.js} +1 -1
- package/src/ui/dist/assets/{file-socket-CXc1Ojf7.js → file-socket-CP3iwVZG.js} +1 -1
- package/src/ui/dist/assets/{file-utils-2J21jt7M.js → file-utils-BsS-Aw68.js} +1 -1
- package/src/ui/dist/assets/{image-CMMmgvcn.js → image-ByeK-Zcv.js} +1 -1
- package/src/ui/dist/assets/{index-DmwmJmbW.js → index-BLjo5--a.js} +33610 -31016
- package/src/ui/dist/assets/{index-CWgMgpow.js → index-BdsE0uRz.js} +11 -11
- package/src/ui/dist/assets/{index-s7aHnNQ4.js → index-C-eX-N6A.js} +1 -1
- package/src/ui/dist/assets/{index-KGt-z-dD.css → index-CuQhlrR-.css} +2747 -2
- package/src/ui/dist/assets/{index-BaVumsQT.js → index-DyremSIv.js} +2 -2
- package/src/ui/dist/assets/{message-square-CQRfX0Am.js → message-square-DnagiLnc.js} +1 -1
- package/src/ui/dist/assets/{monaco-B4TbdsrF.js → monaco-4kBFeprs.js} +1 -1
- package/src/ui/dist/assets/{popover-B8Rokodk.js → popover-hRCXZzs2.js} +1 -1
- package/src/ui/dist/assets/{project-sync-D_i96KH4.js → project-sync-O_85YuP6.js} +1 -1
- package/src/ui/dist/assets/{sigma-D12PnzCN.js → sigma-DvKopSnL.js} +1 -1
- package/src/ui/dist/assets/{tooltip-B6YrI4aJ.js → tooltip-BmlPc6kc.js} +1 -1
- package/src/ui/dist/assets/{trash-Bc8jGp0V.js → trash-n-UvdZFR.js} +1 -1
- package/src/ui/dist/assets/{useCliAccess-mXVCYSZ-.js → useCliAccess-WDd3_wIh.js} +1 -1
- package/src/ui/dist/assets/{useFileDiffOverlay-Bg6b9H9K.js → useFileDiffOverlay-rXLIL2NF.js} +1 -1
- package/src/ui/dist/assets/{wrap-text-Drh5GEnL.js → wrap-text-qIYQ4a_W.js} +1 -1
- package/src/ui/dist/assets/{zoom-out-CJj9DZLn.js → zoom-out-fZXCEFsy.js} +1 -1
- package/src/ui/dist/index.html +2 -2
- package/uv.lock +1155 -0
- package/src/ui/dist/assets/LabPlugin-D9jVIo0A.css +0 -2698
|
@@ -13,12 +13,12 @@ It absorbs the essential old DeepScientist reproducer discipline into one stage
|
|
|
13
13
|
- Treat `artifact.interact(...)` as the main long-lived communication thread across TUI, web, and bound connectors.
|
|
14
14
|
- If `artifact.interact(...)` returns queued user requirements, treat them as the highest-priority user instruction bundle before continuing baseline work.
|
|
15
15
|
- Immediately follow any non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt; if the request is directly answerable, answer there, otherwise say the current subtask is paused, give a short plan plus nearest report-back point, and handle that request first.
|
|
16
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)`
|
|
16
|
+
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: the first meaningful signal of long work, a meaningful checkpoint, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
17
17
|
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
18
18
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
19
19
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
|
|
20
20
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
21
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
21
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
22
22
|
- If a threaded user reply arrives, interpret it relative to the latest baseline progress update before assuming the task changed completely.
|
|
23
23
|
- Prefer `bash_exec` for setup, reproduction, and verification commands so each baseline action keeps a durable quest-local session id and log trail.
|
|
24
24
|
- When the baseline route is durably chosen, confirmed, waived, or blocked with a clear next action, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says whether the baseline is trusted, blocked, or waived, why that matters, and what the next stage is.
|
|
@@ -42,16 +42,20 @@ It absorbs the essential old DeepScientist reproducer discipline into one stage
|
|
|
42
42
|
|
|
43
43
|
## Priority workflow
|
|
44
44
|
|
|
45
|
-
|
|
45
|
+
Default to the lightest baseline path that can still establish a trustworthy comparison.
|
|
46
|
+
Do not front-load a full reproduction dossier when a faster truth-finding step would tell you whether the route is even viable.
|
|
47
|
+
|
|
48
|
+
The ordinary baseline order is:
|
|
46
49
|
|
|
47
50
|
1. confirm quest binding and current baseline state
|
|
48
|
-
2.
|
|
49
|
-
3.
|
|
50
|
-
4.
|
|
51
|
-
5.
|
|
52
|
-
6.
|
|
53
|
-
7.
|
|
54
|
-
|
|
51
|
+
2. look for the cheapest trustworthy route in order: attach, import, reproduce, repair
|
|
52
|
+
3. capture the minimum viable contract: task, dataset or split, metric, source identity, expected command path, and main risks
|
|
53
|
+
4. run a bounded smoke test as soon as that contract is concrete enough
|
|
54
|
+
5. only after the smoke test is credible, expand setup notes and launch the real run
|
|
55
|
+
6. verify before accepting
|
|
56
|
+
7. archive, publish, or attach the result when appropriate
|
|
57
|
+
|
|
58
|
+
Escalate to the heavier baseline path only when the baseline is ambiguous, broken, multi-variant, paper-to-repo mismatched, or likely to be reused beyond the current quest.
|
|
55
59
|
|
|
56
60
|
If the quest is not yet bound to a stable baseline context, do not pretend the stage is ready just because some code exists locally.
|
|
57
61
|
|
|
@@ -75,16 +79,17 @@ Do not casually skip these gates.
|
|
|
75
79
|
|
|
76
80
|
## Phase routing rule
|
|
77
81
|
|
|
78
|
-
Treat
|
|
79
|
-
At any moment, the work should
|
|
82
|
+
Treat `analysis`, `setup`, `execution`, and `verification` as logical control gates, not paperwork walls.
|
|
83
|
+
At any moment, the work should have one dominant phase among:
|
|
80
84
|
|
|
81
85
|
- `analysis`
|
|
82
86
|
- `setup`
|
|
83
87
|
- `execution`
|
|
84
88
|
- `verification`
|
|
85
89
|
|
|
86
|
-
|
|
87
|
-
|
|
90
|
+
Keep the dominant phase explicit, but allow small backtracks and lightweight overlap when they reduce wasted work.
|
|
91
|
+
Do not delay an early smoke test just because a fuller write-up is not done yet.
|
|
92
|
+
Before a real long run, make sure the minimum viable contract is explicit and the active phase is still easy to reconstruct.
|
|
88
93
|
|
|
89
94
|
## Use when
|
|
90
95
|
|
|
@@ -140,14 +145,15 @@ Do not treat memory alone as sufficient evidence for baseline readiness.
|
|
|
140
145
|
The baseline line should also maintain a durable working-record area outside the execution surface.
|
|
141
146
|
Recommended quest-visible records include:
|
|
142
147
|
|
|
143
|
-
- `analysis_plan.md`
|
|
148
|
+
- `analysis_plan.md` or a compact equivalent section in `execution.md`
|
|
144
149
|
- `setup.md`
|
|
145
150
|
- `execution.md`
|
|
146
151
|
- `verification.md`
|
|
147
|
-
- `STRUCTURE.md`
|
|
148
|
-
- `REPRO_CHECKLIST.md`
|
|
152
|
+
- `STRUCTURE.md` only when the workspace layout is non-obvious or later reuse depends on it
|
|
153
|
+
- `REPRO_CHECKLIST.md` only when the route is complex, repair-heavy, multi-variant, or publication-facing
|
|
149
154
|
|
|
150
|
-
|
|
155
|
+
For a simple attach/import flow or a straightforward reproduce flow, do not stall just to precreate every one of these files.
|
|
156
|
+
Start with the smallest durable note that preserves the route, command path, target outputs, and main risks; expand it only after the route proves real.
|
|
151
157
|
|
|
152
158
|
## Required durable outputs
|
|
153
159
|
|
|
@@ -163,20 +169,25 @@ The baseline stage should usually leave behind:
|
|
|
163
169
|
## Stable execution contract
|
|
164
170
|
|
|
165
171
|
To keep baseline work stable across different quests, do not stop at loose prose.
|
|
166
|
-
|
|
172
|
+
But also do not confuse stability with ceremony.
|
|
173
|
+
Use the lightest durable structure that keeps the baseline auditable and reusable.
|
|
167
174
|
|
|
168
175
|
Minimum stability rules:
|
|
169
176
|
|
|
170
|
-
-
|
|
177
|
+
- before the first real run, leave one durable note with the chosen route, expected command path, target outputs, and main risks
|
|
178
|
+
- after each smoke test or real run, record what actually happened and whether the route still looks viable
|
|
179
|
+
- before acceptance, leave a clear verification note and baseline gate decision
|
|
171
180
|
- every route selection should leave one explicit reasoned decision record
|
|
172
181
|
- every accepted baseline should leave one accepted baseline artifact
|
|
173
182
|
- every blocked baseline line should leave one blocked report and one next-step decision
|
|
174
183
|
- every handoff should name the active baseline reference and trusted metric set explicitly
|
|
184
|
+
- do not require every optional checklist or template before the first smoke test
|
|
185
|
+
- if one rolling note is enough for a simple baseline line, use it
|
|
175
186
|
|
|
176
187
|
Recommended phase-to-output mapping:
|
|
177
188
|
|
|
178
|
-
- `analysis` -> `analysis_plan.md` plus optional route decision artifact
|
|
179
|
-
- `setup` -> `setup.md`
|
|
189
|
+
- `analysis` -> a brief `analysis_plan.md` or equivalent compact route note, plus optional route decision artifact
|
|
190
|
+
- `setup` -> `setup.md` when setup choices are non-trivial
|
|
180
191
|
- `execution` -> `execution.md` plus progress artifacts when long-running
|
|
181
192
|
- `verification` -> `verification.md` plus accepted baseline artifact and `artifact.confirm_baseline(...)`, or a blocked report plus `artifact.waive_baseline(...)` when skipping is intentional
|
|
182
193
|
|
|
@@ -204,6 +215,12 @@ Global reusable registry paths:
|
|
|
204
215
|
Do not invent parallel durable locations when these runtime contracts already exist.
|
|
205
216
|
Do not leave the authoritative metric contract only in chat, memory, or prose once the baseline is accepted.
|
|
206
217
|
|
|
218
|
+
If a baseline is reproduced only because an analysis campaign needs an extra comparator:
|
|
219
|
+
|
|
220
|
+
- still place it under `<quest_root>/baselines/local/<baseline_id>/` or `<quest_root>/baselines/imported/<baseline_id>/`
|
|
221
|
+
- treat it as a supplementary analysis baseline unless the quest explicitly promotes it into the canonical gate
|
|
222
|
+
- do not call `artifact.confirm_baseline(...)` for that supplementary case unless the quest truly intends to replace the canonical baseline
|
|
223
|
+
|
|
207
224
|
## Baseline id and variant rules
|
|
208
225
|
|
|
209
226
|
Baseline identity should be stable and path-safe.
|
|
@@ -342,8 +359,16 @@ Before running anything substantial, determine:
|
|
|
342
359
|
- expected paper or repo numbers, if any
|
|
343
360
|
- local resource constraints
|
|
344
361
|
|
|
345
|
-
|
|
346
|
-
|
|
362
|
+
For straightforward baseline work, start with a quick viability pass:
|
|
363
|
+
|
|
364
|
+
- find the real run or evaluation entrypoint
|
|
365
|
+
- identify the dataset/split and metric contract
|
|
366
|
+
- identify likely environment blockers
|
|
367
|
+
- define the cheapest credible smoke test
|
|
368
|
+
|
|
369
|
+
Escalate from that quick pass to a fuller baseline codebase audit when the command path is unclear, the repo is large or confusing, the paper and code diverge materially, repair mode is active, or custom code changes look likely.
|
|
370
|
+
|
|
371
|
+
When the fuller audit is necessary, capture at least:
|
|
347
372
|
|
|
348
373
|
- major modules and files
|
|
349
374
|
- end-to-end data flow
|
|
@@ -390,7 +415,7 @@ At minimum, the plan should capture:
|
|
|
390
415
|
- key risks
|
|
391
416
|
- verification targets
|
|
392
417
|
|
|
393
|
-
When
|
|
418
|
+
When the analysis note becomes substantial, structure `analysis_plan.md` with headings close to:
|
|
394
419
|
|
|
395
420
|
- executive summary
|
|
396
421
|
- codebase analysis
|
|
@@ -433,6 +458,10 @@ Prepare the selected route:
|
|
|
433
458
|
- reproduce: prepare the baseline work directory, commands, config pointers, and environment notes
|
|
434
459
|
- repair: identify the precise broken point before rerunning blindly
|
|
435
460
|
|
|
461
|
+
For a fast-path reproduction, setup can stay lightweight.
|
|
462
|
+
Confirm the working directory, environment, config, output paths, smoke command, and long-run command, then move forward.
|
|
463
|
+
Do not manufacture a fresh workspace tree or copy the repo just to satisfy a template if the existing layout is already workable and auditable.
|
|
464
|
+
|
|
436
465
|
Capture:
|
|
437
466
|
|
|
438
467
|
- baseline identifier
|
|
@@ -450,8 +479,8 @@ Setup should also confirm:
|
|
|
450
479
|
- required dependencies or environments are known
|
|
451
480
|
- the execution plan is realistic for the detected hardware
|
|
452
481
|
|
|
453
|
-
|
|
454
|
-
|
|
482
|
+
If a dedicated baseline workspace is needed, establish a clear layout.
|
|
483
|
+
One workable structure is:
|
|
455
484
|
|
|
456
485
|
```text
|
|
457
486
|
<baseline_root>/
|
|
@@ -465,7 +494,7 @@ Recommended structure:
|
|
|
465
494
|
<run_id>/
|
|
466
495
|
```
|
|
467
496
|
|
|
468
|
-
|
|
497
|
+
If the baseline becomes long-lived, shared, or non-obvious, the quest-visible audit area may contain:
|
|
469
498
|
|
|
470
499
|
```text
|
|
471
500
|
<quest_root>/
|
|
@@ -505,8 +534,10 @@ Execution rules:
|
|
|
505
534
|
- if a run is long, emit progress artifacts at meaningful checkpoints
|
|
506
535
|
- if setup required code changes, checkpoint only explainable, minimal changes
|
|
507
536
|
|
|
508
|
-
Execution should rely on explicit scripts or command paths where possible.
|
|
509
|
-
|
|
537
|
+
Execution should rely on existing explicit scripts or command paths where possible.
|
|
538
|
+
Prefer the smallest runnable command that proves the baseline route.
|
|
539
|
+
Do not build a new wrapper, registry, or result-export scaffold unless existing commands are missing, repeated reruns justify it, or later automation clearly needs it.
|
|
540
|
+
If a wrapper or entry script is truly needed, it should support most of the following:
|
|
510
541
|
|
|
511
542
|
- run mode for missing combinations
|
|
512
543
|
- print-only mode that summarizes existing results without rerunning everything
|
|
@@ -543,10 +574,18 @@ If a result backup is useful for audit or recovery, create it explicitly rather
|
|
|
543
574
|
|
|
544
575
|
Long-running execution rules:
|
|
545
576
|
|
|
577
|
+
- before a substantial baseline reproduction, run a bounded smoke test first so command paths, output locations, and metric plumbing are validated cheaply
|
|
578
|
+
- once the smoke test passes, launch the real baseline reproduction with `bash_exec(mode='detach', ...)` and normally leave `timeout_seconds` unset for the long run itself
|
|
579
|
+
- when monitoring that detached run, prefer `bash_exec(mode='read', id=..., tail_limit=..., order='desc')` so you inspect the newest log evidence first
|
|
580
|
+
- after the first read, prefer incremental checks with `bash_exec(mode='read', id=..., after_seq=last_seen_seq, tail_limit=..., order='asc')` so you only inspect newly appended evidence
|
|
581
|
+
- if you need to recover ids or confirm the newest session quickly, use `bash_exec(mode='history')` or `bash_exec(mode='list')` rather than guessing
|
|
582
|
+
- include a structured `comment` on long-running bash sessions with fields such as `stage`, `goal`, `action`, `expected_signal`, and `next_check`
|
|
583
|
+
- use `silent_seconds`, `progress_age_seconds`, `signal_age_seconds`, and `watchdog_overdue` from `bash_exec(mode='list'|'read', ...)` as the default staleness checks
|
|
584
|
+
- when the reproduction code is under your control, prefer a throttled `tqdm` progress reporter and, when feasible, pair it with periodic `__DS_PROGRESS__` JSON lines carrying phase and ETA
|
|
546
585
|
- if a command is expected to run for a long time, monitor it as a real background task rather than assuming success
|
|
547
586
|
- do not write final summaries or accepted metrics until the command has actually completed
|
|
548
587
|
- verify that the expected result files exist before treating the run as finished
|
|
549
|
-
- if a task
|
|
588
|
+
- if a task is invalid, wedged, or failed, stop it with `bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...)`; if it must die immediately, add `force=true`, then diagnose the reason and either retry with a documented fix or record the failure durably
|
|
550
589
|
|
|
551
590
|
Recommended monitoring cadence for long-running work:
|
|
552
591
|
|
|
@@ -557,7 +596,7 @@ Recommended monitoring cadence for long-running work:
|
|
|
557
596
|
- fifth check after about 1800 seconds
|
|
558
597
|
- after that, keep checking about every 1800 seconds while the run is still active
|
|
559
598
|
|
|
560
|
-
The exact mechanism should prefer `bash_exec(mode='await' | 'detach' | 'read' | 'list' | 'kill', ...)`, but the behavioral rule stays the same:
|
|
599
|
+
The exact mechanism should prefer `bash_exec(mode='await' | 'detach' | 'read' | 'list' | 'history' | 'kill', ...)`, with `read` usually using a tailed or incremental window during monitoring, but the behavioral rule stays the same:
|
|
561
600
|
do not report completion until the run is actually done and the outputs are real.
|
|
562
601
|
After each meaningful check, notify the user through `artifact.interact(kind='progress', ...)` with current status, latest evidence, and the next monitoring point.
|
|
563
602
|
Do this after every completed wait cycle for important long-running work; do not skip several sleep windows without reporting.
|
|
@@ -664,6 +703,8 @@ If variants exist, also include:
|
|
|
664
703
|
## Durable note templates
|
|
665
704
|
|
|
666
705
|
Use compact but structured notes so later stages do not need to reconstruct baseline state from chat history.
|
|
706
|
+
The templates below are references, not prerequisites for the first smoke test.
|
|
707
|
+
For simple baseline lines, keep them short and fill only the sections that matter.
|
|
667
708
|
|
|
668
709
|
### `analysis_plan.md`
|
|
669
710
|
|
|
@@ -12,14 +12,14 @@ Use this skill whenever continuation is non-trivial.
|
|
|
12
12
|
- Treat `artifact.interact(...)` as the main long-lived communication thread across TUI, web, and bound connectors.
|
|
13
13
|
- If `artifact.interact(...)` returns queued user requirements, treat them as the highest-priority user instruction bundle before making the next decision.
|
|
14
14
|
- Immediately follow any non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt; if the request is directly answerable, answer there, otherwise say the current subtask is paused, give a short plan plus nearest report-back point, and handle that request first.
|
|
15
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)`
|
|
15
|
+
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: a meaningful checkpoint, a route-shaping update, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
16
16
|
- Message templates are references only. Adapt to context and vary wording so updates feel natural and non-robotic.
|
|
17
17
|
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
18
18
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
19
19
|
- If the runtime starts an auto-continue turn with no new user message, continue from the active requirements and durable quest state instead of replaying the previous user turn.
|
|
20
20
|
- If `startup_contract.decision_policy = autonomous`, do not emit ordinary `artifact.interact(kind='decision_request', ...)` calls; decide the route yourself, record the reason, and continue.
|
|
21
21
|
- Use `reply_mode='blocking'` for the actual decision request only when the user must choose before safe continuation and the quest contract still allows a user-gated decision.
|
|
22
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
22
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
23
23
|
- If a threaded user reply arrives, interpret it relative to the latest decision or progress interaction before assuming the task changed completely.
|
|
24
24
|
- Quest completion is a special terminal decision: first ask for explicit completion approval with `artifact.interact(kind='decision_request', reply_mode='blocking', reply_schema={'decision_type': 'quest_completion_approval'}, ...)`, and only after an explicit approval reply should you call `artifact.complete_quest(...)`.
|
|
25
25
|
|
|
@@ -319,7 +319,7 @@ When asking, use a structured decision request with:
|
|
|
319
319
|
- tradeoffs, including the main pros and cons for each option
|
|
320
320
|
- recommended option first
|
|
321
321
|
- explicit reply format
|
|
322
|
-
- a stated timeout window; normally wait up to 1 day before self-resolving if no user reply arrives
|
|
322
|
+
- a stated timeout window; normally wait up to 1 day before self-resolving if no user reply arrives, except when the only blocker is a missing external credential or secret that only the user can provide
|
|
323
323
|
|
|
324
324
|
### 6. Record the decision durably
|
|
325
325
|
|
|
@@ -327,6 +327,7 @@ Use `artifact.record(kind='decision', ...)` for the final decision.
|
|
|
327
327
|
|
|
328
328
|
If user input is needed, also use `artifact.interact(kind='decision_request', ...)`.
|
|
329
329
|
If the timeout expires without a user reply, choose the best option yourself, record why, and notify the user of the chosen option before moving on.
|
|
330
|
+
This does not apply when the only blocker is a missing external credential or secret that only the user can provide; in that case keep the interaction waiting and, if resumed without the credential, you may park with `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` instead of busy-looping.
|
|
330
331
|
|
|
331
332
|
If `startup_contract.decision_policy = autonomous`, ordinary route ambiguity is not by itself grounds to request user input.
|
|
332
333
|
In that mode, only explicit approval-style exceptions such as quest completion should normally become blocking user decisions.
|
|
@@ -12,7 +12,7 @@ Use this skill for the main evidence-producing runs of the quest.
|
|
|
12
12
|
- Treat `artifact.interact(...)` as the main long-lived communication thread across TUI, web, and bound connectors.
|
|
13
13
|
- If `artifact.interact(...)` returns queued user requirements, treat them as the highest-priority user instruction bundle before continuing the run plan.
|
|
14
14
|
- Immediately follow any non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt; if the request is directly answerable, answer there, otherwise say the current subtask is paused, give a short plan plus nearest report-back point, and handle that request first.
|
|
15
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)`
|
|
15
|
+
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: the first meaningful signal of long work, a meaningful checkpoint, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
16
16
|
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
17
17
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
18
18
|
- Keep ordinary subtask completions concise. When a main experiment actually finishes or reaches a stage-significant checkpoint, upgrade to a richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` report rather than another short progress line.
|
|
@@ -43,7 +43,7 @@ Use this skill for the main evidence-producing runs of the quest.
|
|
|
43
43
|
- If the runtime starts an auto-continue turn with no new user message, continue from the current run state, logs, artifacts, and active requirements instead of replaying the previous user turn.
|
|
44
44
|
- Progress message templates are references only. Adapt to the actual context and vary wording so messages feel human, respectful, and non-robotic.
|
|
45
45
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
46
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
46
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
47
47
|
- If a threaded user reply arrives, interpret it relative to the latest experiment progress update before assuming the task changed completely.
|
|
48
48
|
- Prefer `bash_exec` for experiment commands so each run gets a durable session id, quest-local log folder, and later `read/list/kill` control.
|
|
49
49
|
|
|
@@ -377,9 +377,14 @@ Last-known-good rule:
|
|
|
377
377
|
|
|
378
378
|
For commands that may run longer than a few minutes:
|
|
379
379
|
|
|
380
|
-
-
|
|
380
|
+
- before the real long run, execute a bounded smoke test or pilot that validates command paths, outputs, and basic metrics
|
|
381
|
+
- once the smoke test passes, launch the real run with `bash_exec(mode='detach', ...)` and normally leave `timeout_seconds` unset for that long run
|
|
381
382
|
- monitor through durable logs rather than only live terminal output
|
|
382
|
-
- use `bash_exec(mode='list')` and `bash_exec(mode='read', id
|
|
383
|
+
- use `bash_exec(mode='list')` and `bash_exec(mode='read', id=..., tail_limit=..., order='desc')` to monitor or revisit managed commands while focusing on the newest evidence first
|
|
384
|
+
- after the first read, prefer `bash_exec(mode='read', id=..., after_seq=last_seen_seq, tail_limit=..., order='asc')` so later checks only fetch new evidence
|
|
385
|
+
- if you need to recover ids or sanity-check the active session ordering, use `bash_exec(mode='history')`
|
|
386
|
+
- launch important runs with a structured `comment` such as `{stage, goal, action, expected_signal, next_check}`
|
|
387
|
+
- use `silent_seconds`, `progress_age_seconds`, `signal_age_seconds`, and `watchdog_overdue` from `bash_exec(mode='list'|'read', ...)` as your default watchdog signals
|
|
383
388
|
- use an explicit wait-and-check loop such as:
|
|
384
389
|
- wait about `60s`, then inspect logs
|
|
385
390
|
- wait about `120s`, then inspect logs
|
|
@@ -387,9 +392,10 @@ For commands that may run longer than a few minutes:
|
|
|
387
392
|
- wait about `600s`, then inspect logs
|
|
388
393
|
- wait about `1800s`, then inspect logs
|
|
389
394
|
- then keep checking about every `1800s` while the run is still active
|
|
390
|
-
- if needed, use
|
|
395
|
+
- if needed, use an explicit bounded wait such as `bash_exec(command='sleep 60', mode='await', timeout_seconds=70)` or `bash_exec(mode='await', id=..., timeout_seconds=...)` between checks
|
|
391
396
|
- after every completed sleep / await cycle, inspect logs and send `artifact.interact(kind='progress', ...)` with the latest real status, latest evidence, the next checkpoint, and the estimated next reply time
|
|
392
397
|
- after the first meaningful signal and then at real checkpoints (e.g., completion, or roughly every ~30 minutes if still running), keep those progress updates going rather than waiting silently
|
|
398
|
+
- if the run is clearly invalid, wedged, or superseded, stop it with `bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...)`; if it must die immediately, add `force=true`, record the reason, fix the issue, and relaunch cleanly
|
|
393
399
|
- do not report completion until logs and output files both confirm completion
|
|
394
400
|
|
|
395
401
|
Always preserve the managed `bash_exec` log and export it into the experiment artifact directory when the run artifact is written.
|
|
@@ -404,7 +410,7 @@ Long loops should emit structured progress markers rather than noisy raw progres
|
|
|
404
410
|
- do not paste raw progress lines into summaries
|
|
405
411
|
- when possible include `eta` in seconds and `next_reply_at` or `next_check_at` so web/TUI can show the next expected update
|
|
406
412
|
|
|
407
|
-
If the
|
|
413
|
+
If you control the code, prefer a throttled `tqdm`-style progress reporter for the run itself and pair it with concise structured `__DS_PROGRESS__` lines when feasible so monitoring remains machine-readable.
|
|
408
414
|
|
|
409
415
|
### 6. Validate the outputs
|
|
410
416
|
|
|
@@ -466,6 +472,22 @@ That call is responsible for writing:
|
|
|
466
472
|
- evidence paths
|
|
467
473
|
- changed files
|
|
468
474
|
- relevant config paths when applicable
|
|
475
|
+
- `evaluation_summary` with exactly these six fields:
|
|
476
|
+
- `takeaway`
|
|
477
|
+
- `claim_update`
|
|
478
|
+
- `baseline_relation`
|
|
479
|
+
- `comparability`
|
|
480
|
+
- `failure_mode`
|
|
481
|
+
- `next_action`
|
|
482
|
+
|
|
483
|
+
Use `evaluation_summary` as the short structured judgment layer on top of the longer narrative fields:
|
|
484
|
+
|
|
485
|
+
- `takeaway`: one sentence the next reader can reuse directly
|
|
486
|
+
- `claim_update`: `strengthens`, `weakens`, `narrows`, or `neutral`
|
|
487
|
+
- `baseline_relation`: `better`, `worse`, `mixed`, or `not_comparable`
|
|
488
|
+
- `comparability`: `high`, `medium`, or `low`
|
|
489
|
+
- `failure_mode`: `none`, `implementation`, `evaluation`, `environment`, or `direction`
|
|
490
|
+
- `next_action`: the immediate route such as `continue`, `revise_idea`, `analysis_campaign`, `write`, or `stop`
|
|
469
491
|
|
|
470
492
|
After `artifact.record_main_experiment(...)` succeeds, do not assume the same branch should absorb the next round by default.
|
|
471
493
|
Interpret the measured result first, then either:
|
|
@@ -12,12 +12,12 @@ Use this skill to close or pause a quest responsibly.
|
|
|
12
12
|
- Treat `artifact.interact(...)` as the main long-lived communication thread across TUI, web, and bound connectors.
|
|
13
13
|
- If `artifact.interact(...)` returns queued user requirements, treat them as the highest-priority user instruction bundle before closing or pausing the quest.
|
|
14
14
|
- Immediately follow any non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt; if the request is directly answerable, answer there, otherwise say the current subtask is paused, give a short plan plus nearest report-back point, and handle that request first.
|
|
15
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)`
|
|
15
|
+
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: the first meaningful signal of long work, a meaningful checkpoint, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
16
16
|
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
17
17
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
18
18
|
- If the runtime starts an auto-continue turn with no new user message, keep finalizing from the durable quest state and active requirements instead of replaying the previous user turn.
|
|
19
19
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
20
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
20
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
21
21
|
- If a threaded user reply arrives, interpret it relative to the latest finalize progress update before assuming the task changed completely.
|
|
22
22
|
- When finalize reaches a real closure state, pause-ready packet, or route-back decision, send one threaded `artifact.interact(kind='milestone', ...)` update that names the recommendation, why it is the right call, and any reopen condition that still matters.
|
|
23
23
|
- True quest completion still requires explicit user approval through the runtime completion flow before calling `artifact.complete_quest(...)`.
|
|
@@ -124,9 +124,12 @@ When a paper bundle exists, verify the manifest inventory explicitly, including:
|
|
|
124
124
|
- referenced `writing_plan_path`
|
|
125
125
|
- referenced `references_path`
|
|
126
126
|
- referenced `claim_evidence_map_path`
|
|
127
|
+
- referenced `baseline_inventory_path`
|
|
127
128
|
- referenced `compile_report_path`
|
|
128
129
|
- referenced `pdf_path`
|
|
129
130
|
- referenced `latex_root_path`
|
|
131
|
+
- `release/open_source/manifest.json` when open-source preparation has started
|
|
132
|
+
- `release/open_source/cleanup_plan.md` when the paper line is being prepared for a public code release
|
|
130
133
|
|
|
131
134
|
### 2. Build the final claim ledger
|
|
132
135
|
|
package/src/skills/idea/SKILL.md
CHANGED
|
@@ -12,7 +12,7 @@ Use this skill to turn the current baseline and problem frame into concrete, lit
|
|
|
12
12
|
- Treat `artifact.interact(...)` as the main long-lived communication thread across TUI, web, and bound connectors.
|
|
13
13
|
- If `artifact.interact(...)` returns queued user requirements, treat them as the highest-priority user instruction bundle before selecting or refining ideas.
|
|
14
14
|
- Immediately follow any non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt; if the request is directly answerable, answer there, otherwise say the current subtask is paused, give a short plan plus nearest report-back point, and handle that request first.
|
|
15
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)`
|
|
15
|
+
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: the first meaningful signal of long work, a meaningful checkpoint, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
16
16
|
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
17
17
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
18
18
|
- Keep ordinary subtask completions concise. When the idea stage actually finishes a meaningful deliverable such as a selected idea package, a rejected-ideas summary, or a route-shaping ideation checkpoint, upgrade to a richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` report.
|
|
@@ -21,7 +21,7 @@ Use this skill to turn the current baseline and problem frame into concrete, lit
|
|
|
21
21
|
- If the runtime starts an auto-continue turn with no new user message, keep advancing from the active requirements and current durable state instead of re-answering the previous user turn.
|
|
22
22
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
|
|
23
23
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
24
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
24
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
25
25
|
- If a threaded user reply arrives, interpret it relative to the latest idea progress update before assuming the task changed completely.
|
|
26
26
|
|
|
27
27
|
## Stage purpose
|
|
@@ -12,12 +12,12 @@ Use this skill when the quest already has meaningful state and the first job is
|
|
|
12
12
|
- Treat `artifact.interact(...)` as the main long-lived communication thread across TUI, web, and bound connectors.
|
|
13
13
|
- If `artifact.interact(...)` returns queued user requirements, treat them as the highest-priority user instruction bundle before continuing the audit.
|
|
14
14
|
- Immediately follow any non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt; if the request is directly answerable, answer there, otherwise say the current subtask is paused, give a short plan plus nearest report-back point, and handle that request first.
|
|
15
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)`
|
|
15
|
+
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: the first meaningful signal of the audit, a meaningful checkpoint, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
16
16
|
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
17
17
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
18
18
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
|
|
19
19
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
20
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
20
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
21
21
|
- If a threaded user reply arrives, interpret it relative to the latest intake-audit progress update before assuming the task changed completely.
|
|
22
22
|
- When the audit reaches a durable route recommendation, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what state is trusted, what still needs work, and which anchor should run next.
|
|
23
23
|
|
|
@@ -16,12 +16,12 @@ The task is “respond to concrete reviewer pressure with the smallest honest se
|
|
|
16
16
|
- Treat `artifact.interact(...)` as the main long-lived communication thread across TUI, web, and bound connectors.
|
|
17
17
|
- If `artifact.interact(...)` returns queued user requirements, treat them as the highest-priority user instruction bundle before continuing the rebuttal pass.
|
|
18
18
|
- Immediately follow any non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt; if the request is directly answerable, answer there, otherwise say the current subtask is paused, give a short plan plus nearest report-back point, and handle that request first.
|
|
19
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)`
|
|
19
|
+
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: the first meaningful signal of the rebuttal pass, a meaningful checkpoint, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
20
20
|
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
21
21
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
22
22
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
|
|
23
23
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
24
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
24
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
25
25
|
- If a threaded user reply arrives, interpret it relative to the latest rebuttal progress update before assuming the task changed completely.
|
|
26
26
|
- When the rebuttal plan, the main supplementary-evidence package, or the final response bundle becomes durable, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what reviewer concerns are now addressed, what still remains open, and what happens next.
|
|
27
27
|
|
|
@@ -87,11 +87,13 @@ Use, in roughly this order:
|
|
|
87
87
|
- the current paper or draft
|
|
88
88
|
- the selected outline if one exists
|
|
89
89
|
- review comments, meta-review, or editor letter
|
|
90
|
+
- the six-field `evaluation_summary` blocks from recent main experiments and analysis slices
|
|
90
91
|
- recent main and analysis experiment results
|
|
91
92
|
- prior decision and writing memory
|
|
92
93
|
- existing figures, tables, and claim-evidence maps
|
|
93
94
|
|
|
94
95
|
If the current paper/result state is still unclear, open `intake-audit` first before continuing the rebuttal workflow.
|
|
96
|
+
Before launching any new supplementary experiment, read those structured `evaluation_summary` blocks first so the rebuttal plan starts from the already-recorded evidence state rather than from raw narrative memory.
|
|
95
97
|
|
|
96
98
|
## Core outputs
|
|
97
99
|
|
|
@@ -19,11 +19,11 @@ It is also not the same as `rebuttal`.
|
|
|
19
19
|
- Treat `artifact.interact(...)` as the main long-lived communication thread across TUI, web, and bound connectors.
|
|
20
20
|
- If `artifact.interact(...)` returns queued user requirements, treat them as the highest-priority user instruction bundle before continuing the review pass.
|
|
21
21
|
- Immediately follow any non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt; if the request is directly answerable, answer there, otherwise say the current subtask is paused, give a short plan plus nearest report-back point, and handle that request first.
|
|
22
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)`
|
|
22
|
+
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: the first meaningful signal of the review pass, a meaningful checkpoint, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
23
23
|
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
24
24
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
25
25
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
26
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
26
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
27
27
|
- When the review report, revision plan, or follow-up experiment TODO list becomes durable, send a richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what the main risks are, what should be fixed next, and whether the next route is writing, experiment, or claim downgrade.
|
|
28
28
|
|
|
29
29
|
## Purpose
|
|
@@ -77,12 +77,14 @@ Use, in roughly this order:
|
|
|
77
77
|
- the current paper or report draft
|
|
78
78
|
- the selected outline if one exists
|
|
79
79
|
- the claim-evidence map if one exists
|
|
80
|
+
- the six-field `evaluation_summary` blocks from recent main experiments and analysis slices
|
|
80
81
|
- recent main and analysis experiment results
|
|
81
82
|
- figures, tables, and captions
|
|
82
83
|
- prior self-review or reviewer-first notes as low-trust auxiliary input
|
|
83
84
|
- nearby papers when novelty or comparison is unclear
|
|
84
85
|
|
|
85
86
|
If the draft/result state is still unclear, open `intake-audit` first before continuing the review workflow.
|
|
87
|
+
Before proposing extra experiments, read those structured `evaluation_summary` blocks first so you do not request work that the recorded evidence already resolved.
|
|
86
88
|
|
|
87
89
|
## Core outputs
|
|
88
90
|
|
|
@@ -12,12 +12,12 @@ Use this skill when the quest does not yet have a stable research frame.
|
|
|
12
12
|
- Treat `artifact.interact(...)` as the main long-lived communication thread across TUI, web, and bound connectors.
|
|
13
13
|
- If `artifact.interact(...)` returns queued user requirements, treat them as the highest-priority user instruction bundle before continuing scouting.
|
|
14
14
|
- Immediately follow any non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt; if the request is directly answerable, answer there, otherwise say the current subtask is paused, give a short plan plus nearest report-back point, and handle that request first.
|
|
15
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)`
|
|
15
|
+
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: the first meaningful signal of long work, a meaningful checkpoint, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
16
16
|
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
17
17
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
18
18
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
|
|
19
19
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
20
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
20
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
21
21
|
- If a threaded user reply arrives, interpret it relative to the latest scout progress update before assuming the task changed completely.
|
|
22
22
|
- When scouting actually resolves the framing ambiguity, locks the evaluation contract, or makes the next anchor obvious, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what is now clear, why it matters, and which stage should come next.
|
|
23
23
|
|
|
@@ -22,7 +22,7 @@ This skill intentionally absorbs the strongest old DeepScientist writing discipl
|
|
|
22
22
|
- Treat `artifact.interact(...)` as the main long-lived communication thread across TUI, web, and bound connectors.
|
|
23
23
|
- If `artifact.interact(...)` returns queued user requirements, treat them as the highest-priority user instruction bundle before continuing drafting or revision.
|
|
24
24
|
- Immediately follow any non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt; if the request is directly answerable, answer there, otherwise say the current subtask is paused, give a short plan plus nearest report-back point, and handle that request first.
|
|
25
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)`
|
|
25
|
+
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: the first meaningful signal of long work, a meaningful checkpoint, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
26
26
|
- Prefer `bash_exec` for durable document-build commands such as LaTeX compilation, figure regeneration, and scripted export steps so logs remain quest-local and reviewable.
|
|
27
27
|
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
28
28
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
@@ -56,7 +56,7 @@ This skill intentionally absorbs the strongest old DeepScientist writing discipl
|
|
|
56
56
|
- If the runtime starts an auto-continue turn with no new user message, keep drafting or verifying from the durable state and active requirements instead of replaying the previous user turn.
|
|
57
57
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel respectful, human, and non-robotic.
|
|
58
58
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
59
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
59
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
60
60
|
- If a threaded user reply arrives, interpret it relative to the latest writing progress update before assuming the task changed completely.
|
|
61
61
|
- Use milestone updates deliberately when outline selection, claim downgrades, proofing completion, bundle readiness, or route-back-to-experiment decisions become durably true.
|
|
62
62
|
|