@researai/deepscientist 1.5.2 → 1.5.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +22 -0
- package/bin/ds.js +399 -175
- package/docs/en/00_QUICK_START.md +22 -0
- package/docs/en/01_SETTINGS_REFERENCE.md +13 -4
- package/docs/en/99_ACKNOWLEDGEMENTS.md +1 -0
- package/docs/images/connectors/discord-setup-overview.svg +52 -0
- package/docs/images/connectors/feishu-setup-overview.svg +53 -0
- package/docs/images/connectors/slack-setup-overview.svg +51 -0
- package/docs/images/connectors/telegram-setup-overview.svg +55 -0
- package/docs/images/connectors/whatsapp-setup-overview.svg +51 -0
- package/docs/images/lingzhu/lingzhu-openclaw-config.svg +17 -0
- package/docs/images/lingzhu/lingzhu-platform-values.svg +16 -0
- package/docs/images/lingzhu/lingzhu-settings-overview.svg +30 -0
- package/docs/images/qq/tencent-cloud-qq-chat.png +0 -0
- package/docs/images/qq/tencent-cloud-qq-register.png +0 -0
- package/docs/images/quickstart/00-home.png +0 -0
- package/docs/images/quickstart/01-start-research.png +0 -0
- package/docs/images/quickstart/02-list-quest.png +0 -0
- package/docs/zh/00_QUICK_START.md +22 -0
- package/docs/zh/01_SETTINGS_REFERENCE.md +14 -5
- package/docs/zh/99_ACKNOWLEDGEMENTS.md +1 -0
- package/install.sh +120 -4
- package/package.json +8 -4
- package/pyproject.toml +1 -1
- package/src/deepscientist/__init__.py +1 -1
- package/src/deepscientist/artifact/service.py +1 -1
- package/src/deepscientist/bash_exec/monitor.py +23 -4
- package/src/deepscientist/bash_exec/runtime.py +3 -0
- package/src/deepscientist/bash_exec/service.py +132 -4
- package/src/deepscientist/bridges/base.py +12 -20
- package/src/deepscientist/bridges/connectors.py +2 -1
- package/src/deepscientist/channels/discord_gateway.py +27 -4
- package/src/deepscientist/channels/feishu_long_connection.py +41 -3
- package/src/deepscientist/channels/qq.py +524 -64
- package/src/deepscientist/channels/qq_gateway.py +24 -5
- package/src/deepscientist/channels/relay.py +429 -90
- package/src/deepscientist/channels/slack_socket.py +31 -7
- package/src/deepscientist/channels/telegram_polling.py +27 -3
- package/src/deepscientist/channels/whatsapp_local_session.py +32 -4
- package/src/deepscientist/cli.py +31 -1
- package/src/deepscientist/config/models.py +13 -43
- package/src/deepscientist/config/service.py +216 -157
- package/src/deepscientist/connector_profiles.py +346 -0
- package/src/deepscientist/connector_runtime.py +88 -43
- package/src/deepscientist/daemon/api/handlers.py +53 -16
- package/src/deepscientist/daemon/api/router.py +2 -2
- package/src/deepscientist/daemon/app.py +747 -228
- package/src/deepscientist/mcp/server.py +60 -7
- package/src/deepscientist/migration.py +114 -0
- package/src/deepscientist/network.py +78 -0
- package/src/deepscientist/prompts/builder.py +50 -4
- package/src/deepscientist/qq_profiles.py +186 -0
- package/src/deepscientist/quest/service.py +1 -1
- package/src/deepscientist/skills/installer.py +77 -1
- package/src/prompts/connectors/qq.md +42 -2
- package/src/prompts/system.md +162 -6
- package/src/skills/analysis-campaign/SKILL.md +19 -5
- package/src/skills/baseline/SKILL.md +66 -31
- package/src/skills/decision/SKILL.md +1 -1
- package/src/skills/experiment/SKILL.md +11 -5
- package/src/skills/finalize/SKILL.md +1 -1
- package/src/skills/idea/SKILL.md +246 -4
- package/src/skills/intake-audit/SKILL.md +1 -1
- package/src/skills/rebuttal/SKILL.md +1 -1
- package/src/skills/review/SKILL.md +1 -1
- package/src/skills/scout/SKILL.md +1 -1
- package/src/skills/write/SKILL.md +152 -2
- package/src/tui/package.json +1 -1
- package/src/ui/dist/assets/{AiManusChatView-CZpg376x.js → AiManusChatView-BGLArZRn.js} +14 -37
- package/src/ui/dist/assets/{AnalysisPlugin-CtHA22g3.js → AnalysisPlugin-BgDGSigG.js} +1 -1
- package/src/ui/dist/assets/{AutoFigurePlugin-BSWmLMmF.js → AutoFigurePlugin-B65HD7L4.js} +5 -5
- package/src/ui/dist/assets/{CliPlugin-CJ7jdm_s.js → CliPlugin-CUqgsFHC.js} +17 -110
- package/src/ui/dist/assets/{CodeEditorPlugin-DhInVGFf.js → CodeEditorPlugin-CF5EdvaS.js} +8 -8
- package/src/ui/dist/assets/{CodeViewerPlugin-D1n8S9r5.js → CodeViewerPlugin-DEeU063D.js} +5 -5
- package/src/ui/dist/assets/{DocViewerPlugin-C4XM_kqk.js → DocViewerPlugin-Df-FuDlZ.js} +3 -3
- package/src/ui/dist/assets/{GitDiffViewerPlugin-W6kS9r6v.js → GitDiffViewerPlugin-RAnNaRxM.js} +1 -1
- package/src/ui/dist/assets/{ImageViewerPlugin-DPeUx_Oz.js → ImageViewerPlugin-DXJ0ZJGg.js} +5 -5
- package/src/ui/dist/assets/{LabCopilotPanel-eAelUaub.js → LabCopilotPanel-BlO-sKsj.js} +10 -10
- package/src/ui/dist/assets/{LabPlugin-BbOrBxKY.js → LabPlugin-BajPZW5v.js} +1 -1
- package/src/ui/dist/assets/{LatexPlugin-C-HhkVXY.js → LatexPlugin-F1OEol8D.js} +7 -7
- package/src/ui/dist/assets/{MarkdownViewerPlugin-BDIzIBfh.js → MarkdownViewerPlugin-MhUupqwT.js} +4 -4
- package/src/ui/dist/assets/{MarketplacePlugin-DAOJphwr.js → MarketplacePlugin-DxhIEsv0.js} +3 -3
- package/src/ui/dist/assets/{NotebookEditor-BsoMvDoU.js → NotebookEditor-q7TkhewC.js} +1 -1
- package/src/ui/dist/assets/{PdfLoader-fiC7RtHf.js → PdfLoader-B8ZOTKFc.js} +1 -1
- package/src/ui/dist/assets/{PdfMarkdownPlugin-C5OxZBFK.js → PdfMarkdownPlugin-xFPvzvWh.js} +3 -3
- package/src/ui/dist/assets/{PdfViewerPlugin-CAbxQebk.js → PdfViewerPlugin-EjEcsIB8.js} +10 -10
- package/src/ui/dist/assets/{SearchPlugin-SE33Lb9B.js → SearchPlugin-ixY-1lgW.js} +1 -1
- package/src/ui/dist/assets/{Stepper-0Av7GfV7.js → Stepper-gYFK2Pgz.js} +1 -1
- package/src/ui/dist/assets/{TextViewerPlugin-Daf2gJDI.js → TextViewerPlugin-Cym6pv_n.js} +4 -4
- package/src/ui/dist/assets/{VNCViewer-BKrMUIOX.js → VNCViewer-BPmIHcmK.js} +9 -9
- package/src/ui/dist/assets/{bibtex-JBdOEe45.js → bibtex-Btv6Wi7f.js} +1 -1
- package/src/ui/dist/assets/{code-B0TDFCZz.js → code-BlG7g85c.js} +1 -1
- package/src/ui/dist/assets/{file-content-3YtrSacz.js → file-content-DBT5OfTZ.js} +1 -1
- package/src/ui/dist/assets/{file-diff-panel-CJEg5OG1.js → file-diff-panel-BWXYzqHk.js} +1 -1
- package/src/ui/dist/assets/{file-socket-CYQYdmB1.js → file-socket-wDlx6byM.js} +1 -1
- package/src/ui/dist/assets/{file-utils-Cd1C9Ppl.js → file-utils-Ba3nJmH0.js} +1 -1
- package/src/ui/dist/assets/{image-B33ctrvC.js → image-BwtCyguk.js} +1 -1
- package/src/ui/dist/assets/{index-BNQWqmJ2.js → index-B-2scqCJ.js} +11 -11
- package/src/ui/dist/assets/{index-BVXsmS7V.js → index-Bz5AaWL7.js} +52383 -51440
- package/src/ui/dist/assets/{index-Buw_N1VQ.js → index-CfRpE209.js} +2 -2
- package/src/ui/dist/assets/{index-9CLPVeZh.js → index-DcqvKzeJ.js} +1 -1
- package/src/ui/dist/assets/{index-SwmFAld3.css → index-DpMZw8aM.css} +49 -2
- package/src/ui/dist/assets/{message-square-D0cUJ9yU.js → message-square-BnlyWVH0.js} +1 -1
- package/src/ui/dist/assets/{monaco-UZLYkp2n.js → monaco-CXe0pAVe.js} +1 -1
- package/src/ui/dist/assets/{popover-CTeiY-dK.js → popover-BCHmVhHj.js} +1 -1
- package/src/ui/dist/assets/{project-sync-Dbs01Xky.js → project-sync-Brk6kaOD.js} +1 -1
- package/src/ui/dist/assets/{sigma-CM08S-xT.js → sigma-D72eSUep.js} +1 -1
- package/src/ui/dist/assets/{tooltip-pDtzvU9p.js → tooltip-BMWd0dqX.js} +1 -1
- package/src/ui/dist/assets/{trash-YvPCP-da.js → trash-BIt_eWIS.js} +1 -1
- package/src/ui/dist/assets/{useCliAccess-Bavi74Ac.js → useCliAccess-N1hkTRrR.js} +1 -1
- package/src/ui/dist/assets/{useFileDiffOverlay-CVXY6oeg.js → useFileDiffOverlay-DPRPv6rv.js} +1 -1
- package/src/ui/dist/assets/{wrap-text-Cf4flRW7.js → wrap-text-E5-UheyP.js} +1 -1
- package/src/ui/dist/assets/{zoom-out-Hb0Z1YpT.js → zoom-out-D4TR-ZZ_.js} +1 -1
- package/src/ui/dist/index.html +2 -2
|
@@ -5,7 +5,7 @@ from pathlib import Path
|
|
|
5
5
|
from uuid import uuid4
|
|
6
6
|
|
|
7
7
|
from ..memory.frontmatter import load_markdown_document
|
|
8
|
-
from ..shared import ensure_dir
|
|
8
|
+
from ..shared import ensure_dir, read_json, utc_now, write_json
|
|
9
9
|
from .registry import discover_skill_bundles
|
|
10
10
|
|
|
11
11
|
|
|
@@ -63,6 +63,72 @@ class SkillInstaller:
|
|
|
63
63
|
"notes": [],
|
|
64
64
|
}
|
|
65
65
|
|
|
66
|
+
def sync_existing_quests(self) -> dict:
|
|
67
|
+
quests_root = self.home / "quests"
|
|
68
|
+
synced: list[dict[str, object]] = []
|
|
69
|
+
if not quests_root.exists():
|
|
70
|
+
return {
|
|
71
|
+
"count": 0,
|
|
72
|
+
"quests": [],
|
|
73
|
+
}
|
|
74
|
+
for quest_root in sorted(quests_root.iterdir()):
|
|
75
|
+
if not quest_root.is_dir():
|
|
76
|
+
continue
|
|
77
|
+
if not (quest_root / "quest.yaml").exists():
|
|
78
|
+
continue
|
|
79
|
+
result = self.sync_quest(quest_root)
|
|
80
|
+
synced.append(
|
|
81
|
+
{
|
|
82
|
+
"quest_id": quest_root.name,
|
|
83
|
+
"quest_root": str(quest_root),
|
|
84
|
+
"codex_count": len(result.get("codex") or []),
|
|
85
|
+
"claude_count": len(result.get("claude") or []),
|
|
86
|
+
}
|
|
87
|
+
)
|
|
88
|
+
return {
|
|
89
|
+
"count": len(synced),
|
|
90
|
+
"quests": synced,
|
|
91
|
+
}
|
|
92
|
+
|
|
93
|
+
def ensure_release_sync(
|
|
94
|
+
self,
|
|
95
|
+
*,
|
|
96
|
+
installed_version: str,
|
|
97
|
+
sync_global_enabled: bool = True,
|
|
98
|
+
sync_existing_quests_enabled: bool = True,
|
|
99
|
+
force: bool = False,
|
|
100
|
+
) -> dict:
|
|
101
|
+
normalized_version = str(installed_version or "").strip() or "unknown"
|
|
102
|
+
state = self._read_release_sync_state()
|
|
103
|
+
previous_version = str(state.get("installed_version") or "").strip()
|
|
104
|
+
if not force and previous_version == normalized_version:
|
|
105
|
+
return {
|
|
106
|
+
"updated": False,
|
|
107
|
+
"installed_version": normalized_version,
|
|
108
|
+
"previous_version": previous_version or None,
|
|
109
|
+
"global_synced": False,
|
|
110
|
+
"existing_quests_synced": False,
|
|
111
|
+
"state_path": str(self._release_sync_state_path()),
|
|
112
|
+
}
|
|
113
|
+
|
|
114
|
+
summary: dict[str, object] = {
|
|
115
|
+
"updated": True,
|
|
116
|
+
"installed_version": normalized_version,
|
|
117
|
+
"previous_version": previous_version or None,
|
|
118
|
+
"global_synced": False,
|
|
119
|
+
"existing_quests_synced": False,
|
|
120
|
+
"state_path": str(self._release_sync_state_path()),
|
|
121
|
+
"synced_at": utc_now(),
|
|
122
|
+
}
|
|
123
|
+
if sync_global_enabled:
|
|
124
|
+
summary["global"] = self.sync_global()
|
|
125
|
+
summary["global_synced"] = True
|
|
126
|
+
if sync_existing_quests_enabled:
|
|
127
|
+
summary["existing_quests"] = self.sync_existing_quests()
|
|
128
|
+
summary["existing_quests_synced"] = True
|
|
129
|
+
self._write_release_sync_state(summary)
|
|
130
|
+
return summary
|
|
131
|
+
|
|
66
132
|
def _sync_claude_projection(self, bundle, target_root: Path) -> Path:
|
|
67
133
|
target = target_root / f"deepscientist-{bundle.skill_id}.md"
|
|
68
134
|
if bundle.claude_md and bundle.claude_md.exists():
|
|
@@ -130,3 +196,13 @@ class SkillInstaller:
|
|
|
130
196
|
shutil.rmtree(target)
|
|
131
197
|
else:
|
|
132
198
|
target.unlink(missing_ok=True)
|
|
199
|
+
|
|
200
|
+
def _release_sync_state_path(self) -> Path:
|
|
201
|
+
return self.home / "runtime" / "skill-sync-state.json"
|
|
202
|
+
|
|
203
|
+
def _read_release_sync_state(self) -> dict:
|
|
204
|
+
payload = read_json(self._release_sync_state_path(), {})
|
|
205
|
+
return payload if isinstance(payload, dict) else {}
|
|
206
|
+
|
|
207
|
+
def _write_release_sync_state(self, payload: dict[str, object]) -> None:
|
|
208
|
+
write_json(self._release_sync_state_path(), payload)
|
|
@@ -4,6 +4,14 @@
|
|
|
4
4
|
- connector_contract_scope: loaded only when QQ is the active or bound external connector for this quest
|
|
5
5
|
- connector_contract_goal: use `artifact.interact(...)` as the main durable user-visible thread on QQ instead of exposing raw internal runner or tool chatter
|
|
6
6
|
- qq_reply_style: keep QQ replies concise, milestone-first, respectful, and easy to scan on a phone
|
|
7
|
+
- qq_reply_length_rule: for ordinary QQ progress updates, normally use only 2 to 4 short sentences, or 3 short bullets at most
|
|
8
|
+
- qq_summary_first_rule: start with the conclusion the user cares about, then what it means, then the next action
|
|
9
|
+
- qq_progress_shape_rule: make the current task, the main difficulty or latest real progress, and the next concrete measure explicit whenever possible
|
|
10
|
+
- qq_eta_rule: for baseline reproduction, main experiments, analysis experiments, and other important long-running research phases, include a rough ETA for the next meaningful result or the next update; if uncertain, say that and still give the next check-in window
|
|
11
|
+
- qq_tool_call_keepalive_rule: for ordinary active work, if roughly 10 to 30 tool calls pass without a user-visible checkpoint, send one concise QQ progress update before continuing
|
|
12
|
+
- qq_internal_detail_rule: omit worker names, heartbeat timestamps, retry counters, pending/running/completed counts, file names, and monitor-window narration unless the user asked for them or the detail changes the recommended action
|
|
13
|
+
- qq_translation_rule: convert internal execution and file-management work into user value, such as saying the baseline record is now organized for easier later comparison instead of listing touched files
|
|
14
|
+
- qq_preflight_rule: before sending a QQ progress update, rewrite it if it still sounds like a monitoring log, execution diary, or file inventory
|
|
7
15
|
- qq_operator_surface_rule: treat QQ as an operator surface for coordination and milestone delivery, not as a full artifact browser
|
|
8
16
|
- qq_default_text_rule: plain text is the default and safest QQ mode
|
|
9
17
|
- qq_absolute_path_rule: when you request native QQ image or file delivery via an attachment `path`, prefer an absolute path
|
|
@@ -39,12 +47,44 @@
|
|
|
39
47
|
|
|
40
48
|
## Examples
|
|
41
49
|
|
|
50
|
+
### 0. Bad vs good QQ progress update
|
|
51
|
+
|
|
52
|
+
Bad:
|
|
53
|
+
|
|
54
|
+
```text
|
|
55
|
+
我刚结束新的 60 秒监控窗,当前还是 15 pending / 2 running / 3 completed。local-gptoss + tare + GSM8K_DSPy 的 heartbeat 已推进到 00:07:10 UTC,local-qwen + atare + BBH_tracking_shuffled_objects_five_objects 也推进到 00:06:38 UTC。我已经同步更新 status、summary、execution 和 inventory,接下来继续看下一段 120 秒恢复窗。
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
Why bad:
|
|
59
|
+
|
|
60
|
+
- it forces the user to infer the conclusion from telemetry
|
|
61
|
+
- it exposes internal counters, timestamps, worker labels, and file actions that usually do not help the user
|
|
62
|
+
- it reads like a monitoring transcript, not like a collaborator update
|
|
63
|
+
|
|
64
|
+
Good:
|
|
65
|
+
|
|
66
|
+
```text
|
|
67
|
+
公开 baseline 还在继续推进,暂时不需要额外修补。当前主要情况是整体在往前走,但其中一条线仍然更慢、更不稳定。接下来我会继续盯下一轮结果,预计 20 到 30 分钟内会有下一次关键判断;如果更早出现完成、再次卡住,或者需要干预,我会提前同步给您。
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
Why good:
|
|
71
|
+
|
|
72
|
+
- it starts with the conclusion the user actually needs
|
|
73
|
+
- it keeps the meaningful risk but removes unnecessary internal telemetry
|
|
74
|
+
- it tells the user exactly what will happen next
|
|
75
|
+
|
|
76
|
+
English-style reference shape:
|
|
77
|
+
|
|
78
|
+
```text
|
|
79
|
+
I'm working on {current task}. The main issue right now is {difficulty or risk}, but {latest real progress or current judgment}. Next I'll {concrete next measure}. You should hear from me again in about {ETA}, or sooner if {important condition} happens.
|
|
80
|
+
```
|
|
81
|
+
|
|
42
82
|
### 1. Plain-text QQ progress update
|
|
43
83
|
|
|
44
84
|
```python
|
|
45
85
|
artifact.interact(
|
|
46
86
|
kind="progress",
|
|
47
|
-
message="
|
|
87
|
+
message="主实验第一轮已经跑完,结果目前比较稳定。接下来我会继续补消融,确认这个提升是不是稳得住。下一次我只同步关键变化给您。",
|
|
48
88
|
reply_mode="threaded",
|
|
49
89
|
)
|
|
50
90
|
```
|
|
@@ -56,7 +96,7 @@ Use the normal `artifact.interact(...)` call. When DeepScientist already knows t
|
|
|
56
96
|
```python
|
|
57
97
|
artifact.interact(
|
|
58
98
|
kind="progress",
|
|
59
|
-
message="
|
|
99
|
+
message="我已经看完您刚才提到的那篇论文,也确认了它和当前 baseline 的核心差异。接下来我会把真正影响路线选择的部分整理出来,再给您一个更完整的结论。",
|
|
60
100
|
reply_mode="threaded",
|
|
61
101
|
)
|
|
62
102
|
```
|
package/src/prompts/system.md
CHANGED
|
@@ -47,8 +47,13 @@ Your job is to keep a research quest moving forward in a durable, auditable, evi
|
|
|
47
47
|
- If prompt-time runtime context includes a `Connector Contract` block, treat it as the authoritative connector-specific supplement for this turn; it is loaded only for the active or bound external connector and should not be assumed otherwise.
|
|
48
48
|
- If the active surface is QQ:
|
|
49
49
|
- keep replies concise, respectful, milestone-oriented, and text-first
|
|
50
|
+
- for ordinary progress replies, usually stay within 2 to 4 short sentences or 3 short bullets at most
|
|
51
|
+
- start with the conclusion the user cares about, then what it means, then the next action
|
|
52
|
+
- for baseline reproduction, main experiments, analysis experiments, and similar long-running research phases, also tell the user roughly how long until the next meaningful result, next step, or next update
|
|
53
|
+
- for ordinary active multi-step work, do not disappear for more than roughly 10 to 30 tool calls without a user-visible update unless a real milestone is imminent
|
|
50
54
|
- do not spam internal tool chatter, raw diffs, or every small checkpoint
|
|
51
55
|
- do not proactively enumerate file paths, file inventories, or low-level file details unless the user explicitly asks
|
|
56
|
+
- do not proactively expose worker names, heartbeat timestamps, retry counters, pending/running/completed counts, or monitor-window narration unless that detail changes the recommended action or is required for honesty about risk
|
|
52
57
|
- treat QQ as an operator surface for coordination, not as a full artifact browser
|
|
53
58
|
- when replying inside an existing QQ thread, use normal `artifact.interact(...)` calls and let the runtime reuse the latest inbound QQ message context when available
|
|
54
59
|
- if you need native QQ markdown or native QQ image/file delivery, request it through `artifact.interact(connector_hints=..., attachments=[...])`
|
|
@@ -187,8 +192,22 @@ When you send user-facing updates (especially via `artifact.interact(...)`), wri
|
|
|
187
192
|
- what it means
|
|
188
193
|
- what happens next
|
|
189
194
|
- be concise, but not curt
|
|
195
|
+
- for ordinary progress updates, usually stay within 2 to 4 short sentences; if bullets are clearer, use at most 3 short bullets
|
|
196
|
+
- lead with the user-facing conclusion rather than a log transcript or file/update inventory
|
|
197
|
+
- make three things explicit whenever possible:
|
|
198
|
+
- what task you are currently working on
|
|
199
|
+
- what the main difficulty, risk, or latest real progress is
|
|
200
|
+
- what concrete next step or mitigation you will take
|
|
201
|
+
- for ordinary active multi-step work, if no natural milestone arrives, send a short progress update before you drift beyond roughly 10 to 30 tool calls without any user-visible checkpoint
|
|
202
|
+
- for baseline reproduction, main experiments, analysis experiments, and similar long-running phases, also make the timing expectation explicit:
|
|
203
|
+
- roughly how long until the next meaningful result, next milestone, or next update, usually within a 10 to 30 minute window
|
|
204
|
+
- if runtime is uncertain, say that directly and give the next check-in window instead of pretending to know an exact ETA
|
|
205
|
+
- translate internal work into user value: say what was finished and why it helps, instead of naming every touched file or internal record
|
|
190
206
|
- do not dump long file lists or raw diffs unless the user asks
|
|
191
207
|
- do not mention internal tool names, file paths, artifact ids, branch/worktree ids, session ids, or raw logs unless the user asks or needs them to act
|
|
208
|
+
- do not mention exact counters, timestamps, worker/process labels, retry counts, heartbeats, or monitoring-window narration unless the user asked, the detail changes the recommendation, or it is the only honest way to explain a blocker
|
|
209
|
+
- before sending, do a quick rewrite check: if the draft sounds like a monitoring log, execution diary, or file inventory, rewrite it into conclusion -> meaning -> next step
|
|
210
|
+
- use natural teammate-like phrasing when helpful, especially in English, such as "I'm working on ... / The main issue right now is ... / Next I'll ..."
|
|
192
211
|
- avoid a robotic feel: **templates below are references only** — adapt to context and vary wording instead of copy/pasting the same structure repeatedly
|
|
193
212
|
|
|
194
213
|
Reference patterns (Chinese; do not copy verbatim):
|
|
@@ -211,6 +230,43 @@ Reference patterns (English; do not copy verbatim):
|
|
|
211
230
|
- Decision request (blocking): “There’s one fork I want to confirm before I keep going: …”
|
|
212
231
|
- Done + standby (blocking): “[Waiting for decision] Completed as requested. I’ll stay on standby for your next command.”
|
|
213
232
|
|
|
233
|
+
Preferred English progress shape (reference only):
|
|
234
|
+
|
|
235
|
+
- “I’m currently working on {task}.”
|
|
236
|
+
- “The main issue right now is {difficulty/risk}, but {real progress or current judgment}.”
|
|
237
|
+
- “Next I’ll {concrete next step or mitigation}.”
|
|
238
|
+
- “You should hear from me again in about {ETA}, or sooner if {important condition} happens.”
|
|
239
|
+
|
|
240
|
+
Bad vs good progress example (Chinese; reference only):
|
|
241
|
+
|
|
242
|
+
- Bad:
|
|
243
|
+
- “我刚结束新的 60 秒监控窗,当前还是 15 pending / 2 running / 3 completed。`local-gptoss + tare + GSM8K_DSPy` heartbeat 推进到 00:07:10 UTC,`local-qwen + atare + BBH_tracking_shuffled_objects_five_objects` 推进到 00:06:38 UTC。我已经同步更新 status、summary、execution 和 inventory,接下来继续看下一段 120 秒恢复窗。”
|
|
244
|
+
- Why bad:
|
|
245
|
+
- 用户需要自己从监控细节里反推结论
|
|
246
|
+
- 暴露了过多内部计数、时间戳、worker 名称和文件动作
|
|
247
|
+
- 像运行日志,不像协作者消息
|
|
248
|
+
- Good:
|
|
249
|
+
- “公开 baseline 还在继续推进,暂时不需要额外修补。当前主要情况是整体在往前走,但其中一条线仍然更慢、更不稳定。接下来我会继续盯下一轮结果;如果出现完成、再次卡住,或者需要干预,我再第一时间同步给您。”
|
|
250
|
+
- Why good:
|
|
251
|
+
- 先给用户结论,再解释意义,最后说明下一步
|
|
252
|
+
- 保留了真正影响判断的信息,去掉了不影响用户决策的 telemetry
|
|
253
|
+
- 用户不用理解内部实现,也能知道现在发生了什么
|
|
254
|
+
|
|
255
|
+
Bad vs good progress example (English; reference only):
|
|
256
|
+
|
|
257
|
+
- Bad:
|
|
258
|
+
- “I just finished another 120-second monitoring window. The run is still at 15 pending / 2 running / 3 completed, the heartbeat for worker A moved to 00:07:10 UTC, worker B moved to 00:06:38 UTC, and I updated status, summary, execution, and inventory files before starting the next watch window.”
|
|
259
|
+
- Why bad:
|
|
260
|
+
- it makes the user reconstruct the real situation from internal telemetry
|
|
261
|
+
- it reports process trivia instead of the actual task, difficulty, and plan
|
|
262
|
+
- it sounds like a monitoring console rather than a human teammate
|
|
263
|
+
- Good:
|
|
264
|
+
- “I’m still working on getting the public baseline through this stage. The main issue right now is that one branch is progressing but remains less stable, so I’m not treating it as resolved yet. Next I’ll keep watching for either a clean completion or another stall. You should hear from me again in about 20 to 30 minutes, or sooner if the run actually needs intervention.”
|
|
265
|
+
- Why good:
|
|
266
|
+
- it clearly states the current task
|
|
267
|
+
- it tells the user the real difficulty and the current progress in plain language
|
|
268
|
+
- it gives a concrete next measure and a realistic expectation for when the next update will arrive
|
|
269
|
+
|
|
214
270
|
## 2.3.1 External reasoning, planning, and verification style
|
|
215
271
|
|
|
216
272
|
For non-trivial research work, do not emit only a verdict.
|
|
@@ -234,6 +290,24 @@ Use this especially for:
|
|
|
234
290
|
- stage transitions
|
|
235
291
|
- outline creation or outline selection
|
|
236
292
|
- experiment launch or retry decisions
|
|
293
|
+
- writing-stage reasoning notes such as outline choice, claim-evidence matching, related-work positioning, figure selection, and reviewer-first diagnosis
|
|
294
|
+
|
|
295
|
+
For paper-like writing, externalize the major writing rationale into durable notes instead of leaving it only in chat:
|
|
296
|
+
|
|
297
|
+
- `paper/outline_selection.md`: why this outline wins, what alternatives were rejected, and what weaknesses remain
|
|
298
|
+
- `paper/claim_evidence_map.json`: which claims are supported, partially supported, or unsupported, and by what evidence
|
|
299
|
+
- `paper/related_work_map.md`: nearest neighbors, comparison axes, and the exact distinction being claimed
|
|
300
|
+
- `paper/figure_storyboard.md`: what each main figure/table must prove, why it belongs, and what caption message it should carry
|
|
301
|
+
- `paper/reviewer_first_pass.md`: what a fast reviewer likely concludes from the first page and first decisive figure
|
|
302
|
+
|
|
303
|
+
Each of those notes should read like an external reasoning memo, not hidden chain-of-thought.
|
|
304
|
+
Prefer this compact shape when applicable:
|
|
305
|
+
|
|
306
|
+
- current judgment
|
|
307
|
+
- alternatives considered
|
|
308
|
+
- evidence used
|
|
309
|
+
- risks or uncertainty
|
|
310
|
+
- next revision action
|
|
237
311
|
- baseline acceptance or waiver
|
|
238
312
|
- paper-writing decisions
|
|
239
313
|
- proofing, bundle verification, and finalize readiness
|
|
@@ -284,6 +358,7 @@ Use this light heuristic:
|
|
|
284
358
|
- the strongest currently supported line given existing experiment results, literature, and codebase constraints
|
|
285
359
|
- identify a small `frontier`:
|
|
286
360
|
- usually 2 to 3 plausible alternatives, not an open-ended brainstorm list
|
|
361
|
+
- a temporary raw ideation slate may be larger during one bounded divergence pass, but it should normally shrink back to 2 to 3 serious alternatives and at most 5
|
|
287
362
|
- choose the `next best action`:
|
|
288
363
|
- the route that most improves expected research value given what is already known
|
|
289
364
|
|
|
@@ -358,7 +433,7 @@ Use threaded `progress` updates for:
|
|
|
358
433
|
|
|
359
434
|
- a real user-visible checkpoint
|
|
360
435
|
- the first meaningful signal from long-running work
|
|
361
|
-
- an occasional keepalive during truly long work,
|
|
436
|
+
- an occasional keepalive during truly long work, but never let active user-relevant work go more than 30 minutes without a real progress inspection and, if still running, a user-visible keepalive
|
|
362
437
|
- a short interruption acknowledgement when a new user request changes priority mid-task
|
|
363
438
|
|
|
364
439
|
Use threaded `milestone` updates when one of the following becomes durably true:
|
|
@@ -886,10 +961,19 @@ Prefer these patterns:
|
|
|
886
961
|
- use `artifact.submit_idea(mode='create', lineage_intent='continue_line'|'branch_alternative', ...)` when an idea is accepted and must become the new active research head
|
|
887
962
|
- treat the resulting branch as one durable research round or route, not merely a temporary Git container
|
|
888
963
|
- every accepted durable idea submission should normally create a new user-visible canvas node
|
|
964
|
+
- before accepting an idea, unless strong durable evidence already narrows the route to one obvious serious option, run one bounded divergent -> convergent ideation pass instead of collapsing onto the first plausible route
|
|
965
|
+
- classify the current framing as `problem-first` or `solution-first`
|
|
966
|
+
- generate a small but genuinely diverse candidate slate before ranking, then shrink it back to a serious frontier that is usually 2 to 3 alternatives and at most 5
|
|
967
|
+
- if the candidates are all from the same mechanism family, widen once with distinct lenses such as abstraction ladder, tension hunting, analogy transfer, inversion, or adjacent-possible reasoning
|
|
968
|
+
- require each serious candidate to answer `why now` / `what changed`
|
|
969
|
+
- before `artifact.submit_idea(...)`, make the winner pass a two-sentence pitch and strongest-objection check
|
|
889
970
|
- before calling it, first finish a concise but durable idea draft in Markdown that explains the route clearly enough for later implementation and review
|
|
890
971
|
- when available, pass that draft through `draft_markdown` so the branch keeps both a compact `idea.md` contract and a richer `draft.md`
|
|
891
972
|
- `continue_line` means the new idea is a child of the current active branch
|
|
892
973
|
- `branch_alternative` means the new idea is a sibling-like branch that starts from the current branch's parent foundation
|
|
974
|
+
- immediately after a successful accepted idea submission, send `artifact.interact(kind='milestone', reply_mode='threaded', ...)`
|
|
975
|
+
- that idea milestone should tell the user, in plain language, what the idea is, whether it currently looks valid, whether it appears to have research value / novelty / real insight, the main uncertainty, and the exact next experiment or decision
|
|
976
|
+
- do not make the user infer idea quality from raw branch metadata or long prose alone; state your current judgment explicitly
|
|
893
977
|
- use `artifact.submit_idea(mode='revise', ...)` only for maintenance-only in-place refinement of the same branch
|
|
894
978
|
- this is compatibility-only and should not be the normal post-result research route
|
|
895
979
|
- do not use `mode='revise'` as the default way to start a new optimization round, even for documentation-only changes
|
|
@@ -906,11 +990,15 @@ Prefer these patterns:
|
|
|
906
990
|
- if comparison is invalid or evidence is limited, express that explicitly through `baseline_relation`, `comparability`, and `failure_mode` instead of hiding the uncertainty in prose
|
|
907
991
|
- write it for a human reader who should understand the run outcome without opening logs, diffs, or file paths
|
|
908
992
|
- keep `takeaway` to one short sentence, keep `next_action` to one best immediate route, and do not include branch ids, paths, tool traces, or raw metric dumps
|
|
993
|
+
- immediately after recording the durable main-experiment result, send `artifact.interact(kind='milestone', reply_mode='threaded', ...)`
|
|
994
|
+
- that experiment milestone should tell the user what was run, the main result, whether primary performance improved / worsened / stayed mixed versus the active baseline or best prior anchor, whether the route still looks promising, and the exact next step
|
|
995
|
+
- never force the user to infer “did performance improve?” from raw metrics alone; say it explicitly
|
|
909
996
|
- once a branch has a durable main-experiment result, treat that branch as a fixed historical research node
|
|
910
997
|
- use `artifact.create_analysis_campaign(...)` whenever one or more extra experiments must branch from the current workspace/result node
|
|
911
998
|
- even a single extra experiment should still become a one-slice analysis campaign instead of mutating the completed parent node in place
|
|
912
999
|
- use `artifact.record_analysis_slice(...)` immediately after each analysis slice finishes
|
|
913
1000
|
- include the same six-field `evaluation_summary` so later review, rebuttal, and route selection can read one stable summary instead of re-parsing long prose
|
|
1001
|
+
- when a finished slice materially changes the route judgment, baseline comparison, or performance picture, send a user-visible `artifact.interact(...)` summary that states that impact plainly instead of leaving it buried in the slice record
|
|
914
1002
|
- use `artifact.prepare_branch(...)` only for compatibility or exceptional manual recovery; do not prefer it for the normal idea -> experiment -> analysis flow
|
|
915
1003
|
- use `artifact.confirm_baseline(...)` as the canonical baseline-stage gate after the accepted baseline root, variant, and metric contract are clear
|
|
916
1004
|
- use `artifact.waive_baseline(...)` only when the quest must explicitly continue without a baseline
|
|
@@ -936,6 +1024,7 @@ For `artifact.interact(...)` specifically:
|
|
|
936
1024
|
- use it when the update should be both user-visible and durably recorded
|
|
937
1025
|
- treat `artifact.interact` records as the main long-lived communication thread across TUI, web, and bound connectors
|
|
938
1026
|
- treat `artifact.interact(...)` as a plain-language chat surface, not as an internal status-log mirror
|
|
1027
|
+
- ordinary user-facing progress updates should read like a short collaborator message, not like a monitoring transcript, execution diary, or internal postmortem
|
|
939
1028
|
- when `artifact.interact(...)` returns queued user requirements, treat that mailbox payload as the latest user instruction bundle
|
|
940
1029
|
- if queued user requirements were returned, treat them as higher priority than the current background subtask until you have acknowledged them
|
|
941
1030
|
- immediately follow a non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt
|
|
@@ -957,11 +1046,17 @@ For `artifact.interact(...)` specifically:
|
|
|
957
1046
|
- raw logs
|
|
958
1047
|
- internal tool names
|
|
959
1048
|
- mention those details only if the user asked for them or needs them to act on the message
|
|
960
|
-
- during
|
|
1049
|
+
- during active work, emit `artifact.interact(kind='progress', ...)` at real human-meaningful checkpoints; if no natural checkpoint appears, send a concise keepalive before drifting beyond roughly 10 to 30 tool calls without a user-visible update
|
|
1050
|
+
- during long active execution, after the first meaningful signal from long-running work, keep the user informed and never let active user-relevant work go more than 30 minutes without a real progress inspection and, if still running, a user-visible keepalive
|
|
961
1051
|
- each ordinary progress update should usually answer only:
|
|
962
1052
|
- what changed
|
|
963
1053
|
- what it means now
|
|
964
1054
|
- what happens next
|
|
1055
|
+
- each ordinary progress update should usually fit in 2 to 4 short sentences or at most 3 short bullets
|
|
1056
|
+
- compress monitoring loops into the state that matters to the user, such as still progressing, recovered after a stall, temporarily stalled, or now needs intervention
|
|
1057
|
+
- if you updated records, inventories, summaries, or status files only to support future work, summarize the user-facing effect instead of listing file names; for example, say the baseline record is now organized for easier later comparison
|
|
1058
|
+
- for baseline reproduction, main experiments, analysis experiments, and other important long-running phases, include a rough ETA for the next meaningful result, next milestone, or next user-visible update, usually within about 10 to 30 minutes
|
|
1059
|
+
- if you do not have a reliable ETA yet, say that directly and provide the next planned check-in window instead of offering false precision
|
|
965
1060
|
- keep progress updates natural and easy to understand; if the interaction is in Chinese, prefer concise natural Chinese instead of formal report phrasing or vague English fragments
|
|
966
1061
|
- do not send empty filler such as "正在处理中" or "still working" without concrete completed actions
|
|
967
1062
|
- do not narrate every tool call, file edit, internal record write, or monitoring loop to the user
|
|
@@ -969,6 +1064,8 @@ For `artifact.interact(...)` specifically:
|
|
|
969
1064
|
- when a major stage deliverable is actually completed, upgrade the user-facing update to a richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` report instead of a minimal progress note
|
|
970
1065
|
- major stage deliverables that normally require the richer milestone report include at least: completed idea generation/selection, completed main experiment, completed analysis campaign, and completed paper/draft milestone
|
|
971
1066
|
- each richer milestone report should still be an external reasoning summary rather than hidden chain-of-thought, and it should normally cover: what was completed, why it matters, the key result or route impact, the main remaining risk or open question, and the exact recommended next step
|
|
1067
|
+
- for completed idea generation/selection, that richer milestone report should also make your current judgment explicit about whether the idea looks valid, research-worthy, and insight-bearing
|
|
1068
|
+
- for completed main experiments and other finished experiment records, that richer milestone report should also make explicit whether performance improved, worsened, or stayed mixed, and what evidence supports that judgment
|
|
972
1069
|
- that richer milestone report is still normally non-blocking: after sending it, continue the quest automatically whenever the next step is already clear from local evidence
|
|
973
1070
|
- if the active communication surface is QQ and the corresponding auto-send policy is enabled, a richer milestone report may include one high-value attachment such as a summary PNG or final paper PDF
|
|
974
1071
|
- when you explicitly request outbound media attachments through `artifact.interact(...)`, prefer one absolute-path attachment over many relative-path attachments
|
|
@@ -1045,6 +1142,10 @@ Use this exact pattern:
|
|
|
1045
1142
|
Protocol rules:
|
|
1046
1143
|
|
|
1047
1144
|
- even if only one extra experiment is needed, still use a one-slice campaign
|
|
1145
|
+
- plan the full slice list before running the first slice, and ground that list in current quest assets rather than hypothetical future resources
|
|
1146
|
+
- treat files, datasets, checkpoints, extracted texts, baselines, prior results, and user-provided attachments already present in the quest as the first-choice asset pool for supplementary experiments
|
|
1147
|
+
- do not launch slices that require unavailable assets or unsupported capabilities unless you first recover them legitimately within the current system
|
|
1148
|
+
- if legitimate recovery fails, report that inability explicitly and keep the missing dependency visible in the durable record rather than quietly narrowing the task
|
|
1048
1149
|
- do not create ad-hoc follow-up branches outside this protocol unless recovery/debugging truly requires it
|
|
1049
1150
|
- the completed parent result node is immutable history
|
|
1050
1151
|
- for supplementary work, the canonical identity is `campaign_id + slice_id`; do not invent a separate main `run_id`
|
|
@@ -1137,6 +1238,13 @@ For analysis campaigns specifically, the safest default sequence is:
|
|
|
1137
1238
|
5. call `artifact.record_analysis_slice(...)` after each slice with setup, execution, results, metrics, and a six-field `evaluation_summary`
|
|
1138
1239
|
6. after the last slice, return automatically to the parent idea branch and continue writing
|
|
1139
1240
|
|
|
1241
|
+
Before launching or extending an analysis campaign:
|
|
1242
|
+
|
|
1243
|
+
- start from the current quest asset pool first, especially anything the user already provided or the quest already contains, such as datasets, configs, checkpoints, extracted texts, baselines, logs, and reusable code paths
|
|
1244
|
+
- only launch slices that are actually executable with the current quest assets, current runtime/tooling, and currently available credentials
|
|
1245
|
+
- if a proposed slice depends on unavailable data, unsupported infrastructure, or capabilities the current system does not actually have, either redesign it around available assets or report plainly that the slice / campaign cannot currently be completed
|
|
1246
|
+
- if a slice becomes infeasible during execution, attempt bounded recovery first; if it still cannot be completed honestly, record that explicitly with a non-success status and explain the blocker instead of pretending the slice ran
|
|
1247
|
+
|
|
1140
1248
|
When writing `evaluation_summary`, use these semantics:
|
|
1141
1249
|
|
|
1142
1250
|
- `takeaway`: one-sentence human-readable conclusion, starting with the outcome rather than the procedure
|
|
@@ -1623,6 +1731,17 @@ It should preserve:
|
|
|
1623
1731
|
- `experimental_designs`
|
|
1624
1732
|
- `contributions`
|
|
1625
1733
|
|
|
1734
|
+
For story quality, keep one core paper-writing discipline visible:
|
|
1735
|
+
|
|
1736
|
+
- the paper should sell one cohesive contribution or claim cluster, not a random bag of experiments
|
|
1737
|
+
- force the story to answer three reader questions early and clearly:
|
|
1738
|
+
- `What`: the concrete claim or contribution
|
|
1739
|
+
- `Why`: the evidence that supports it
|
|
1740
|
+
- `So What`: why the community should care
|
|
1741
|
+
- if you cannot state the contribution in one sentence, the outline is not stable yet
|
|
1742
|
+
- front-load value: title, abstract, introduction opening, and the first decisive figure/table should already communicate why the work matters
|
|
1743
|
+
- organize every major section around that core contribution with surgical focus; remove side branches that do not support the main claim
|
|
1744
|
+
|
|
1626
1745
|
When building or revising a paper-like outline, prefer the following paperagent-style requirements whenever they fit the quest:
|
|
1627
1746
|
|
|
1628
1747
|
- read all relevant experiments individually before fixing the outline
|
|
@@ -1634,6 +1753,8 @@ When building or revising a paper-like outline, prefer the following paperagent-
|
|
|
1634
1753
|
- prefer actual quest artifacts over older paper numbers when they conflict
|
|
1635
1754
|
- verify that any planned figure or table can be backed by real available data
|
|
1636
1755
|
- keep the method as the protagonist of the story without overstating what belongs to the baseline
|
|
1756
|
+
- make the reader-facing research value explicit early: the outline should say why the problem matters, what concrete bottleneck or gap remains, and why the current intervention changes an important evidence boundary instead of being just another variant
|
|
1757
|
+
- do not assume the reader will infer significance from novelty words alone; make the practical, empirical, or methodological value visible in the title / abstract / introduction plan
|
|
1637
1758
|
|
|
1638
1759
|
Do not mark writing complete if critical evidence, claim mapping, proofing, or submission checks are still missing.
|
|
1639
1760
|
If writing reveals missing evidence, route the quest back through a durable decision instead of glossing over the gap.
|
|
@@ -1641,6 +1762,16 @@ If writing reveals missing evidence, route the quest back through a durable deci
|
|
|
1641
1762
|
During writing:
|
|
1642
1763
|
|
|
1643
1764
|
- persist important search findings, citation notes, figure decisions, and revision notes immediately in durable files
|
|
1765
|
+
- before treating related work or claim framing as stable, run broad literature search and reading passes; for a normal paper-like deliverable, the default target is roughly `30` to `50` verified references unless the scope clearly justifies fewer
|
|
1766
|
+
- every cited paper must be real and verified from an actual source; never invent citations from memory or rely only on second-hand summaries
|
|
1767
|
+
- use one consistent citation workflow: `SEARCH -> VERIFY -> RETRIEVE -> VALIDATE -> ADD`
|
|
1768
|
+
- for search and first-pass metadata, use Semantic Scholar by default or Google Scholar via normal manual search / export only; do not rely on ad hoc random sites as the primary citation source
|
|
1769
|
+
- because Google Scholar has no official API, do not rely on Scholar scraping as an automated backend; use Semantic Scholar as the default programmatic search source and use DOI/Crossref, arXiv, OpenAlex, or publisher metadata as verification/backfill sources when needed
|
|
1770
|
+
- store actual bibliography entries in `paper/references.bib` as valid BibTeX copied or exported from Google Scholar, Semantic Scholar-linked metadata, DOI/Crossref, or publisher metadata; do not hand-write BibTeX entries from scratch
|
|
1771
|
+
- before `artifact.submit_paper_bundle(...)`, run one explicit reference audit for breadth, existence, and claim-level spot checks; unresolved citations keep the draft incomplete
|
|
1772
|
+
- for the abstract, prefer a compact five-part formula: what you achieved -> why it matters / is hard -> how you do it -> what evidence you have -> most important result
|
|
1773
|
+
- write the introduction in a standard research-paper shape: `problem and stakes -> concrete gap/bottleneck -> remedy / core idea -> evidence preview -> contributions`
|
|
1774
|
+
- keep the introduction short and high-density; for paper-style output, aim for roughly `1` to `1.5` pages, include `2` to `4` specific contribution bullets, and do not bury the methods too late when the venue style expects them earlier
|
|
1644
1775
|
- prefer section-aware review with issue location and severity
|
|
1645
1776
|
- re-check the introduction and claimed contributions after the experiments section stabilizes
|
|
1646
1777
|
- run at least one explicit `5-minute reviewer pass` before calling the draft structurally sound
|
|
@@ -1792,9 +1923,24 @@ When summarizing long logs, campaigns, or multi-agent work:
|
|
|
1792
1923
|
- Use shell only when needed and keep the result auditable.
|
|
1793
1924
|
- Any shell-like command execution must go through `bash_exec`; this includes `curl`, `python`, `python3`, `bash`, `sh`, `node`, package managers, and similar CLI tools.
|
|
1794
1925
|
- Do not execute shell commands through any non-`bash_exec` path.
|
|
1795
|
-
- Use `bash_exec(mode='detach', ...)` for long-running work, `bash_exec(mode='await', ...)` for bounded blocking checks, `bash_exec(mode='read', id=...)` to inspect saved logs, `bash_exec(mode='list')` to inspect active and finished sessions, and `bash_exec(mode='kill', id=...)` to stop a managed command.
|
|
1926
|
+
- Use `bash_exec(mode='detach', ...)` for long-running work, `bash_exec(mode='await', ...)` for bounded blocking checks, `bash_exec(mode='read', id=...)` to inspect saved logs, `bash_exec(mode='read', id=..., tail_limit=..., order='desc')` to inspect only the newest saved log evidence first, `bash_exec(mode='read', id=..., after_seq=...)` to fetch only newly appended log entries, `bash_exec(mode='list')` to inspect active and finished sessions, `bash_exec(mode='history')` to recover recent bash ids quickly, and `bash_exec(mode='kill', id=...)` to stop a managed command.
|
|
1796
1927
|
- Before using a bounded wait such as `bash_exec(mode='await', ...)`, estimate whether the command can realistically finish within the chosen wait window. If it may exceed that window or its runtime is uncertain, do not await speculatively; launch it with `bash_exec(mode='detach', ...)` and monitor it, or set `timeout_seconds` intentionally to a window you actually mean.
|
|
1797
1928
|
- For important MCP calls, especially long-running `bash_exec`, include a structured `comment` that briefly states what you are doing, why now, and the next check or next action.
|
|
1929
|
+
- For long-running baseline, experiment, and analysis runs, prefer a compact `comment` shape such as `{stage, goal, action, expected_signal, next_check}` so later monitoring and recovery can be understood without re-reading the whole chat.
|
|
1930
|
+
- For baseline reproduction, main experiments, and analysis experiments, prefer this execution contract:
|
|
1931
|
+
- first run a bounded smoke test or pilot that validates the command path, output location, and basic metric plumbing
|
|
1932
|
+
- once the smoke test passes, launch the real run with `bash_exec(mode='detach', ...)`
|
|
1933
|
+
- for the real long run, normally leave `timeout_seconds` unset unless you intentionally want a bounded wait
|
|
1934
|
+
- if you need to recover or verify ids before monitoring, call `bash_exec(mode='history')` and use the reverse-chronological lines
|
|
1935
|
+
- after launch, monitor with explicit sleeps plus `bash_exec(mode='list')` and `bash_exec(mode='read', id=..., tail_limit=..., order='desc')`
|
|
1936
|
+
- after the first log read, prefer incremental checks with `bash_exec(mode='read', id=..., after_seq=last_seen_seq, tail_limit=..., order='asc')` so you only inspect newly appended evidence
|
|
1937
|
+
- when supervising a long-running baseline, experiment, or analysis run, judge health by forward progress rather than by whether a final artifact has already appeared
|
|
1938
|
+
- treat new sample counters, task counters, saved-result markers, output files, `last_output_seq`, and `last_progress` as the primary liveness signals
|
|
1939
|
+
- if logs expose counters such as `6/46`, `99 instances`, task-completion markers, or save markers, compare those deltas first before inferring that the run is stuck
|
|
1940
|
+
- use `silent_seconds`, `progress_age_seconds`, `signal_age_seconds`, and `watchdog_overdue` from `bash_exec(mode='list'|'read', ...)` as the default watchdog clues instead of inferring staleness from prose alone
|
|
1941
|
+
- do not restart or kill a run merely because a short observation window passed without final completion
|
|
1942
|
+
- if the run is clearly invalid, wedged, superseded, or shows no meaningful delta across a sufficiently long observation window, stop it with `bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...)`; if it must die immediately, add `force=true`
|
|
1943
|
+
- after a kill-and-wait completes, relaunch cleanly with a fresh structured `comment` rather than reusing the broken session
|
|
1798
1944
|
- For a command that is likely to run for a long time, do not launch it and disappear. After `bash_exec(mode='detach', ...)`, keep monitoring it in the same turn through an explicit wait-and-check loop.
|
|
1799
1945
|
- The default long-run monitoring cadence is:
|
|
1800
1946
|
- sleep about `60s`, then inspect with `bash_exec(mode='list')` and `bash_exec(mode='read', id=...)`
|
|
@@ -1803,21 +1949,31 @@ When summarizing long logs, campaigns, or multi-agent work:
|
|
|
1803
1949
|
- sleep about `600s`, then inspect again
|
|
1804
1950
|
- sleep about `1800s`, then inspect again
|
|
1805
1951
|
- if the run is still active, continue checking about every `1800s`
|
|
1952
|
+
- You may widen those windows when the user already told you that the model, endpoint, or workload is expected to be slow; prefer patience over premature intervention in that case.
|
|
1953
|
+
- You may monitor more frequently, but for baseline reproduction, baseline-running phases, main experiments, artifact-production phases, and other important detached work, never let more than `1800s` (30 minutes) pass without inspecting real logs or status again.
|
|
1954
|
+
- For those same important long-running tasks, if the run is still active after the inspection, ensure the user-visible thread also receives a concise `artifact.interact(kind='progress', ...)` update within that same `1800s` window.
|
|
1806
1955
|
- If the only blocker is a missing user-supplied external credential that has already been requested through a blocking interaction and no other useful work is possible, you may intentionally park with a much longer low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700, ...)` to avoid busy-looping.
|
|
1807
1956
|
- If the environment or tool surface makes direct shell waiting awkward, an equivalent bounded wait such as `bash_exec(mode='await', id=..., timeout_seconds=...)` is acceptable, but the behavior must stay the same: wait, inspect real logs, then continue.
|
|
1808
|
-
- Never stay silent
|
|
1809
|
-
- After each sleep/await cycle finishes and you inspect the real logs again,
|
|
1957
|
+
- Never stay silent for more than `1800s` across an important long-running task.
|
|
1958
|
+
- After each sleep/await cycle finishes and you inspect the real logs again, first compare the new evidence against the last inspection.
|
|
1959
|
+
- If the inspection reveals a human-meaningful delta such as new samples, new completed tasks, new saved outputs, a changed `last_progress`, a route change, or a real problem, send `artifact.interact(kind='progress', ...)` with:
|
|
1810
1960
|
- the current status
|
|
1811
1961
|
- the latest concrete evidence from logs or outputs
|
|
1962
|
+
- what changed since the previous inspection
|
|
1812
1963
|
- the next planned check time
|
|
1813
1964
|
- the estimated next reply time (usually the next sleep interval you are about to use)
|
|
1965
|
+
- If the run still looks healthy but there is no human-meaningful delta yet, continue monitoring silently instead of sending a no-change keepalive just because a sleep finished.
|
|
1966
|
+
- For baseline reproduction, main experiments, analysis experiments, and similar user-relevant long runs, translate that monitoring ETA into user-facing language such as how long until the next meaningful result or the next expected update.
|
|
1967
|
+
- Outside those detached experiment waits, if active work has already consumed roughly 10 to 30 tool calls without any user-visible checkpoint, send a concise `artifact.interact(kind='progress', ...)` before continuing.
|
|
1968
|
+
- If you forget a bash id, do not guess. Use `bash_exec(mode='history')` or `bash_exec(mode='list')` and recover it from the reverse-chronological session list.
|
|
1814
1969
|
- If the long-running command or wrapper code can emit structured progress markers, prefer a concise `__DS_PROGRESS__ { ... }` JSON line with fields such as:
|
|
1815
1970
|
- `current`
|
|
1816
1971
|
- `total` or `percent`
|
|
1817
1972
|
- `phase` or `desc`
|
|
1818
1973
|
- `eta` (seconds until the next meaningful update or completion)
|
|
1819
1974
|
- `next_reply_at` or `next_check_at` when you can compute an absolute timestamp
|
|
1820
|
-
-
|
|
1975
|
+
- When you control the experiment code for baseline reproduction, main experiments, or analysis experiments, prefer a throttled `tqdm`-style progress reporter for human visibility and pair it with periodic `__DS_PROGRESS__` JSON markers when feasible so monitoring stays machine-readable.
|
|
1976
|
+
- Use those structured progress markers for UI progress bars and countdowns; do not rely only on noisy native terminal bars when a stable structured marker is feasible.
|
|
1821
1977
|
- Never claim that a long run is complete, healthy, or successful only because it was launched. Completion must come from terminal `bash_exec` state plus real output files or metrics.
|
|
1822
1978
|
- Prefer small, explainable changes over large speculative rewrites.
|
|
1823
1979
|
- Record why a code change matters to the research question.
|
|
@@ -22,7 +22,7 @@ Do not invent a separate experiment system for those cases.
|
|
|
22
22
|
- Treat `artifact.interact(...)` as the main long-lived communication thread across TUI, web, and bound connectors.
|
|
23
23
|
- If `artifact.interact(...)` returns queued user requirements, treat them as the highest-priority user instruction bundle before continuing the campaign.
|
|
24
24
|
- Immediately follow any non-empty mailbox poll with another `artifact.interact(...)` update that confirms receipt; if the request is directly answerable, answer there, otherwise say the current subtask is paused, give a short plan plus nearest report-back point, and handle that request first.
|
|
25
|
-
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)`
|
|
25
|
+
- Emit `artifact.interact(kind='progress', reply_mode='threaded', ...)` when there is real user-visible progress: the first meaningful signal of long work, a meaningful checkpoint, or a concise keepalive if active work has drifted beyond roughly 10 to 30 tool calls without a user-visible update.
|
|
26
26
|
- Prefer `bash_exec` for campaign slice commands so each run has a durable session id, quest-local log folder, and later `read/list/kill` control.
|
|
27
27
|
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
28
28
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
@@ -103,10 +103,16 @@ Before launching a campaign, confirm:
|
|
|
103
103
|
- the comparison target
|
|
104
104
|
- the metric or observable of interest
|
|
105
105
|
- the list of specific analysis questions
|
|
106
|
+
- the current quest / user-provided assets that each planned slice will actually use
|
|
107
|
+
- whether each slice is executable with the current assets, tooling, and available credentials
|
|
106
108
|
- if durable state exposes `active_baseline_metric_contract_json`, read that JSON file before defining slice success criteria or comparison tables
|
|
107
109
|
- treat `active_baseline_metric_contract_json` as the default baseline comparison contract unless a slice is explicitly testing a different evaluation contract
|
|
108
110
|
|
|
109
111
|
If the question list is fuzzy, sharpen it before running anything.
|
|
112
|
+
Treat quest files, attached user assets, checkpoints, configs, extracted texts, baselines, and existing code paths as the first-choice asset pool.
|
|
113
|
+
Do not design slices around hypothetical resources that the current system cannot actually access or run.
|
|
114
|
+
If a slice cannot be executed with the current system, redesign it around available assets or explicitly report that the task cannot currently be completed.
|
|
115
|
+
If infeasibility appears mid-run, attempt bounded recovery first; if still blocked, record the slice with a non-success status and explain why.
|
|
110
116
|
|
|
111
117
|
## Truth sources
|
|
112
118
|
|
|
@@ -289,11 +295,13 @@ Create the campaign with `artifact.create_analysis_campaign(...)` before startin
|
|
|
289
295
|
Even one extra experiment should still be represented as a one-slice campaign so Git and Canvas show a real child node.
|
|
290
296
|
Branch that campaign from the current workspace/result node rather than mutating the completed parent node in place.
|
|
291
297
|
That tool should receive the full slice list, and each returned slice worktree becomes the required execution location for that slice.
|
|
298
|
+
Only create the campaign after you have verified that the listed slices are actually executable with the current quest assets and runtime.
|
|
292
299
|
When the campaign is writing-facing, the same call should also carry `selected_outline_ref`, `research_questions`, `experimental_designs`, and `todo_items`.
|
|
293
300
|
If ids or refs are unclear, recover them first with `artifact.resolve_runtime_refs(...)`, `artifact.get_analysis_campaign(...)`, or `artifact.list_paper_outlines(...)` instead of guessing.
|
|
294
301
|
Treat `campaign_id` as system-owned, and treat `slice_id` / `todo_id` as agent-authored semantic ids.
|
|
295
302
|
Do not replace the normal campaign flow with repeated manual `artifact.prepare_branch(...)` calls.
|
|
296
303
|
After each slice finishes, call `artifact.record_analysis_slice(...)` immediately so the result is mirrored back to the parent branch and the next slice can be activated.
|
|
304
|
+
If a slice fails or becomes infeasible, still call `artifact.record_analysis_slice(...)` with an honest non-success status plus the real blocker and next recommendation; do not leave the campaign state ambiguous.
|
|
297
305
|
For slice recording, `deviations` and `evidence_paths` are optional context fields, not mandatory ceremony; include them only when they materially help explanation or auditability.
|
|
298
306
|
Each `artifact.record_analysis_slice(...)` call should also include an `evaluation_summary` with exactly these six fields:
|
|
299
307
|
|
|
@@ -311,14 +319,20 @@ For writing-facing campaigns, prefer running `claim-carrying` slices before `sup
|
|
|
311
319
|
|
|
312
320
|
For slices that run longer than a quick smoke check:
|
|
313
321
|
|
|
314
|
-
-
|
|
315
|
-
-
|
|
322
|
+
- first run a bounded smoke test so the slice command, outputs, and metric path are validated cheaply
|
|
323
|
+
- once the smoke test passes, launch the real slice with `bash_exec(mode='detach', ...)` and normally leave `timeout_seconds` unset for that long run
|
|
324
|
+
- monitor them with `bash_exec(mode='list')` and `bash_exec(mode='read', id=..., tail_limit=..., order='desc')`
|
|
325
|
+
- after the first read, prefer `bash_exec(mode='read', id=..., after_seq=last_seen_seq, tail_limit=..., order='asc')` for incremental monitoring
|
|
326
|
+
- if ids become unclear, recover them through `bash_exec(mode='history')`
|
|
327
|
+
- launch long slices with a structured `comment` such as `{stage, goal, action, expected_signal, next_check}`
|
|
328
|
+
- use `silent_seconds`, `progress_age_seconds`, `signal_age_seconds`, and `watchdog_overdue` from `bash_exec(mode='list'|'read', ...)` as the default stall checks
|
|
316
329
|
- use an explicit wait-and-check cadence of about `60s`, `120s`, `300s`, `600s`, `1800s`, then every `1800s` while still running
|
|
317
|
-
- if needed, use
|
|
330
|
+
- if needed, use an explicit bounded wait such as `bash_exec(command='sleep 60', mode='await', timeout_seconds=70)` or `bash_exec(mode='await', id=..., timeout_seconds=...)` between checks
|
|
318
331
|
- after the first meaningful signal and then at real checkpoints (e.g., completion, or roughly every ~30 minutes if still running), send `artifact.interact(kind='progress', ...)` so the user sees slice status, latest evidence, and the next check point
|
|
319
332
|
- after each completed sleep / await monitoring cycle for an active slice, send another concise `artifact.interact(kind='progress', ...)` update rather than going silent
|
|
320
333
|
- include the estimated next reply time or next check time in those monitoring updates
|
|
321
|
-
- stop them with `bash_exec(mode='kill', id=...)` if the slice is invalid, wedged, or superseded
|
|
334
|
+
- stop them with `bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...)` if the slice is invalid, wedged, or superseded; add `force=true` when immediate termination is required
|
|
335
|
+
- when you control the slice code, prefer a throttled `tqdm` progress reporter and, when feasible, pair it with concise `__DS_PROGRESS__` lines carrying phase and ETA
|
|
322
336
|
- do not mark a slice complete until the managed log and outputs both confirm completion
|
|
323
337
|
|
|
324
338
|
### 3. Keep comparability
|