@researai/deepscientist 1.5.1 → 1.5.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +47 -1
- package/bin/ds.js +1823 -121
- package/docs/en/00_QUICK_START.md +38 -20
- package/docs/en/01_SETTINGS_REFERENCE.md +20 -20
- package/docs/en/02_START_RESEARCH_GUIDE.md +11 -11
- package/docs/en/03_QQ_CONNECTOR_GUIDE.md +10 -10
- package/docs/en/05_TUI_GUIDE.md +1 -1
- package/docs/en/09_DOCTOR.md +48 -4
- package/docs/en/90_ARCHITECTURE.md +4 -2
- package/docs/zh/00_QUICK_START.md +38 -20
- package/docs/zh/01_SETTINGS_REFERENCE.md +21 -21
- package/docs/zh/02_START_RESEARCH_GUIDE.md +19 -19
- package/docs/zh/03_QQ_CONNECTOR_GUIDE.md +10 -10
- package/docs/zh/05_TUI_GUIDE.md +1 -1
- package/docs/zh/09_DOCTOR.md +46 -4
- package/install.sh +9 -8
- package/package.json +2 -1
- package/pyproject.toml +1 -1
- package/src/deepscientist/__init__.py +6 -1
- package/src/deepscientist/artifact/service.py +552 -25
- package/src/deepscientist/config/service.py +1 -1
- package/src/deepscientist/daemon/api/handlers.py +18 -1
- package/src/deepscientist/daemon/api/router.py +2 -0
- package/src/deepscientist/daemon/app.py +90 -1
- package/src/deepscientist/doctor.py +69 -2
- package/src/deepscientist/gitops/diff.py +3 -0
- package/src/deepscientist/home.py +25 -2
- package/src/deepscientist/mcp/context.py +3 -1
- package/src/deepscientist/mcp/server.py +6 -0
- package/src/deepscientist/prompts/builder.py +41 -0
- package/src/deepscientist/quest/layout.py +1 -0
- package/src/deepscientist/quest/service.py +70 -12
- package/src/deepscientist/quest/stage_views.py +46 -0
- package/src/deepscientist/runners/codex.py +2 -0
- package/src/deepscientist/shared.py +44 -17
- package/src/prompts/connectors/lingzhu.md +3 -0
- package/src/prompts/system.md +38 -5
- package/src/skills/analysis-campaign/SKILL.md +24 -1
- package/src/skills/baseline/SKILL.md +7 -1
- package/src/skills/decision/SKILL.md +3 -2
- package/src/skills/experiment/SKILL.md +17 -1
- package/src/skills/finalize/SKILL.md +4 -1
- package/src/skills/idea/SKILL.md +1 -1
- package/src/skills/intake-audit/SKILL.md +1 -1
- package/src/skills/rebuttal/SKILL.md +3 -1
- package/src/skills/review/SKILL.md +3 -1
- package/src/skills/scout/SKILL.md +1 -1
- package/src/skills/write/SKILL.md +1 -1
- package/src/tui/package.json +1 -1
- package/src/ui/dist/assets/{AiManusChatView-w5lF2Ttt.js → AiManusChatView-CZpg376x.js} +64 -68
- package/src/ui/dist/assets/{AnalysisPlugin-DJOED79I.js → AnalysisPlugin-CtHA22g3.js} +1 -1
- package/src/ui/dist/assets/{AutoFigurePlugin-DaG61Y0M.js → AutoFigurePlugin-BSWmLMmF.js} +5 -5
- package/src/ui/dist/assets/{CliPlugin-CV4LqUB_.js → CliPlugin-CJ7jdm_s.js} +9 -9
- package/src/ui/dist/assets/{CodeEditorPlugin-DylfAea4.js → CodeEditorPlugin-DhInVGFf.js} +8 -8
- package/src/ui/dist/assets/{CodeViewerPlugin-F7saY0LM.js → CodeViewerPlugin-D1n8S9r5.js} +5 -5
- package/src/ui/dist/assets/{DocViewerPlugin-COP0c7jf.js → DocViewerPlugin-C4XM_kqk.js} +3 -3
- package/src/ui/dist/assets/{GitDiffViewerPlugin-CAS05pT9.js → GitDiffViewerPlugin-W6kS9r6v.js} +1 -1
- package/src/ui/dist/assets/{ImageViewerPlugin-Bco1CN_w.js → ImageViewerPlugin-DPeUx_Oz.js} +5 -5
- package/src/ui/dist/assets/{LabCopilotPanel-CvMlCD99.js → LabCopilotPanel-eAelUaub.js} +10 -10
- package/src/ui/dist/assets/{LabPlugin-BYankkE4.js → LabPlugin-BbOrBxKY.js} +1 -1
- package/src/ui/dist/assets/{LatexPlugin-LDSMR-t-.js → LatexPlugin-C-HhkVXY.js} +7 -7
- package/src/ui/dist/assets/{MarkdownViewerPlugin-B7o80jgm.js → MarkdownViewerPlugin-BDIzIBfh.js} +4 -4
- package/src/ui/dist/assets/{MarketplacePlugin-CM6ZOcpC.js → MarketplacePlugin-DAOJphwr.js} +3 -3
- package/src/ui/dist/assets/{NotebookEditor-Dc61cXmK.js → NotebookEditor-BsoMvDoU.js} +1 -1
- package/src/ui/dist/assets/{PdfLoader-DWowuQwx.js → PdfLoader-fiC7RtHf.js} +1 -1
- package/src/ui/dist/assets/{PdfMarkdownPlugin-BsJM1q_a.js → PdfMarkdownPlugin-C5OxZBFK.js} +3 -3
- package/src/ui/dist/assets/{PdfViewerPlugin-DB2eEEFQ.js → PdfViewerPlugin-CAbxQebk.js} +10 -10
- package/src/ui/dist/assets/{SearchPlugin-CraThSvt.js → SearchPlugin-SE33Lb9B.js} +1 -1
- package/src/ui/dist/assets/{Stepper-CgocRTPq.js → Stepper-0Av7GfV7.js} +1 -1
- package/src/ui/dist/assets/{TextViewerPlugin-B1JGhKtd.js → TextViewerPlugin-Daf2gJDI.js} +4 -4
- package/src/ui/dist/assets/{VNCViewer-CclFC7FM.js → VNCViewer-BKrMUIOX.js} +9 -9
- package/src/ui/dist/assets/{bibtex-D3IKsMl7.js → bibtex-JBdOEe45.js} +1 -1
- package/src/ui/dist/assets/{code-BP37Xx0p.js → code-B0TDFCZz.js} +1 -1
- package/src/ui/dist/assets/{file-content-BAJSu-9r.js → file-content-3YtrSacz.js} +1 -1
- package/src/ui/dist/assets/{file-diff-panel-DUGeCTuy.js → file-diff-panel-CJEg5OG1.js} +1 -1
- package/src/ui/dist/assets/{file-socket-CXc1Ojf7.js → file-socket-CYQYdmB1.js} +1 -1
- package/src/ui/dist/assets/{file-utils-2J21jt7M.js → file-utils-Cd1C9Ppl.js} +1 -1
- package/src/ui/dist/assets/{image-CMMmgvcn.js → image-B33ctrvC.js} +1 -1
- package/src/ui/dist/assets/{index-s7aHnNQ4.js → index-9CLPVeZh.js} +1 -1
- package/src/ui/dist/assets/{index-CWgMgpow.js → index-BNQWqmJ2.js} +11 -11
- package/src/ui/dist/assets/{index-DmwmJmbW.js → index-BVXsmS7V.js} +15808 -14025
- package/src/ui/dist/assets/{index-BaVumsQT.js → index-Buw_N1VQ.js} +2 -2
- package/src/ui/dist/assets/{index-KGt-z-dD.css → index-SwmFAld3.css} +2700 -2
- package/src/ui/dist/assets/{message-square-CQRfX0Am.js → message-square-D0cUJ9yU.js} +1 -1
- package/src/ui/dist/assets/{monaco-B4TbdsrF.js → monaco-UZLYkp2n.js} +1 -1
- package/src/ui/dist/assets/{popover-B8Rokodk.js → popover-CTeiY-dK.js} +1 -1
- package/src/ui/dist/assets/{project-sync-D_i96KH4.js → project-sync-Dbs01Xky.js} +1 -1
- package/src/ui/dist/assets/{sigma-D12PnzCN.js → sigma-CM08S-xT.js} +1 -1
- package/src/ui/dist/assets/{tooltip-B6YrI4aJ.js → tooltip-pDtzvU9p.js} +1 -1
- package/src/ui/dist/assets/{trash-Bc8jGp0V.js → trash-YvPCP-da.js} +1 -1
- package/src/ui/dist/assets/{useCliAccess-mXVCYSZ-.js → useCliAccess-Bavi74Ac.js} +1 -1
- package/src/ui/dist/assets/{useFileDiffOverlay-Bg6b9H9K.js → useFileDiffOverlay-CVXY6oeg.js} +1 -1
- package/src/ui/dist/assets/{wrap-text-Drh5GEnL.js → wrap-text-Cf4flRW7.js} +1 -1
- package/src/ui/dist/assets/{zoom-out-CJj9DZLn.js → zoom-out-Hb0Z1YpT.js} +1 -1
- package/src/ui/dist/index.html +2 -2
- package/uv.lock +1155 -0
- package/src/ui/dist/assets/LabPlugin-D9jVIo0A.css +0 -2698
|
@@ -62,6 +62,38 @@ def _field(label: str, value: object, *, tone: str = "default") -> dict[str, Any
|
|
|
62
62
|
}
|
|
63
63
|
|
|
64
64
|
|
|
65
|
+
def _evaluation_summary(value: object) -> dict[str, Any]:
|
|
66
|
+
if not isinstance(value, dict):
|
|
67
|
+
return {}
|
|
68
|
+
normalized: dict[str, Any] = {}
|
|
69
|
+
for key in (
|
|
70
|
+
"takeaway",
|
|
71
|
+
"claim_update",
|
|
72
|
+
"baseline_relation",
|
|
73
|
+
"comparability",
|
|
74
|
+
"failure_mode",
|
|
75
|
+
"next_action",
|
|
76
|
+
):
|
|
77
|
+
raw = value.get(key)
|
|
78
|
+
text = str(raw).strip() if raw is not None else ""
|
|
79
|
+
if text:
|
|
80
|
+
normalized[key] = text
|
|
81
|
+
return normalized
|
|
82
|
+
|
|
83
|
+
|
|
84
|
+
def _evaluation_summary_fields(value: object, *, prefix: str = "Evaluation") -> list[dict[str, Any]]:
|
|
85
|
+
summary = _evaluation_summary(value)
|
|
86
|
+
labels = (
|
|
87
|
+
("takeaway", f"{prefix} Takeaway"),
|
|
88
|
+
("claim_update", f"{prefix} Claim Update"),
|
|
89
|
+
("baseline_relation", f"{prefix} Baseline Relation"),
|
|
90
|
+
("comparability", f"{prefix} Comparability"),
|
|
91
|
+
("failure_mode", f"{prefix} Failure Mode"),
|
|
92
|
+
("next_action", f"{prefix} Next Action"),
|
|
93
|
+
)
|
|
94
|
+
return [_field(label, summary[key]) for key, label in labels if summary.get(key)]
|
|
95
|
+
|
|
96
|
+
|
|
65
97
|
def _artifact_sort_key(item: dict[str, Any]) -> tuple[str, str]:
|
|
66
98
|
payload = item.get("payload") if isinstance(item.get("payload"), dict) else {}
|
|
67
99
|
return (
|
|
@@ -814,6 +846,9 @@ class QuestStageViewBuilder:
|
|
|
814
846
|
)
|
|
815
847
|
latest_metrics_summary = latest_experiment_payload.get("metrics_summary") or latest_result_payload.get("metrics_summary") or {}
|
|
816
848
|
latest_run_id = str(latest_experiment_payload.get("run_id") or "").strip() or None
|
|
849
|
+
latest_evaluation_summary = _evaluation_summary(
|
|
850
|
+
latest_experiment_payload.get("evaluation_summary") or latest_result_payload.get("evaluation_summary")
|
|
851
|
+
)
|
|
817
852
|
|
|
818
853
|
analysis_manifests = self._analysis_manifests()
|
|
819
854
|
analysis_manifest = next(
|
|
@@ -883,6 +918,7 @@ class QuestStageViewBuilder:
|
|
|
883
918
|
_field("Latest Metrics", latest_metrics_summary or "Not recorded"),
|
|
884
919
|
_field("Delta vs Baseline", latest_progress_eval.get("delta_vs_baseline") or "Not recorded"),
|
|
885
920
|
_field("Breakthrough", latest_progress_eval.get("breakthrough_level") or "Not recorded"),
|
|
921
|
+
*_evaluation_summary_fields(latest_evaluation_summary),
|
|
886
922
|
],
|
|
887
923
|
key_files=self._dedupe_files(
|
|
888
924
|
[
|
|
@@ -940,6 +976,7 @@ class QuestStageViewBuilder:
|
|
|
940
976
|
"verdict": latest_experiment_payload.get("verdict"),
|
|
941
977
|
"metrics_summary": latest_metrics_summary,
|
|
942
978
|
"progress_eval": latest_progress_eval,
|
|
979
|
+
"evaluation_summary": latest_evaluation_summary,
|
|
943
980
|
"run_md_path": latest_experiment_paths.get("run_md"),
|
|
944
981
|
"result_json_path": latest_experiment_paths.get("result_json"),
|
|
945
982
|
}
|
|
@@ -979,6 +1016,7 @@ class QuestStageViewBuilder:
|
|
|
979
1016
|
result_payload = read_json(Path(paths.get("result_json")), {}) if str(paths.get("result_json") or "").strip() else {}
|
|
980
1017
|
progress_eval = payload.get("progress_eval") or result_payload.get("progress_eval") or {}
|
|
981
1018
|
baseline_ref = payload.get("baseline_ref") or result_payload.get("baseline_ref") or {}
|
|
1019
|
+
evaluation_summary = _evaluation_summary(payload.get("evaluation_summary") or result_payload.get("evaluation_summary"))
|
|
982
1020
|
run_id = str(payload.get("run_id") or "pending").strip() or "pending"
|
|
983
1021
|
note = (
|
|
984
1022
|
str(payload.get("summary") or result_payload.get("conclusion") or (progress_eval or {}).get("reason") or "").strip()
|
|
@@ -1028,6 +1066,7 @@ class QuestStageViewBuilder:
|
|
|
1028
1066
|
_field("Metrics Summary", metrics_summary or "Not recorded"),
|
|
1029
1067
|
_field("Delta vs Baseline", (progress_eval or {}).get("delta_vs_baseline") or "Not recorded"),
|
|
1030
1068
|
_field("Breakthrough Level", (progress_eval or {}).get("breakthrough_level") or "Not recorded"),
|
|
1069
|
+
*_evaluation_summary_fields(evaluation_summary),
|
|
1031
1070
|
],
|
|
1032
1071
|
key_files=key_files,
|
|
1033
1072
|
history=self._artifact_history(experiment_items),
|
|
@@ -1040,6 +1079,7 @@ class QuestStageViewBuilder:
|
|
|
1040
1079
|
"baseline_ref": baseline_ref,
|
|
1041
1080
|
"metrics_summary": metrics_summary,
|
|
1042
1081
|
"progress_eval": progress_eval,
|
|
1082
|
+
"evaluation_summary": evaluation_summary,
|
|
1043
1083
|
"result_payload": result_payload,
|
|
1044
1084
|
}
|
|
1045
1085
|
},
|
|
@@ -1141,6 +1181,9 @@ class QuestStageViewBuilder:
|
|
|
1141
1181
|
"reviewer_resolution": detail_payload.get("reviewer_resolution"),
|
|
1142
1182
|
"manuscript_update_hint": detail_payload.get("manuscript_update_hint"),
|
|
1143
1183
|
"next_recommendation": detail_payload.get("next_recommendation"),
|
|
1184
|
+
"evaluation_summary": _evaluation_summary(
|
|
1185
|
+
run_payload.get("evaluation_summary") or detail_payload.get("evaluation_summary")
|
|
1186
|
+
),
|
|
1144
1187
|
"deviations": detail_payload.get("deviations") or [],
|
|
1145
1188
|
"evidence_paths": detail_payload.get("evidence_paths") or [],
|
|
1146
1189
|
"plan_path": item.get("plan_path"),
|
|
@@ -1233,8 +1276,11 @@ class QuestStageViewBuilder:
|
|
|
1233
1276
|
self._file_entry("paper/writing_plan.md", label="Writing Plan", description="Paper writing plan."),
|
|
1234
1277
|
self._file_entry("paper/references.bib", label="References", description="Bibliography file."),
|
|
1235
1278
|
self._file_entry("paper/claim_evidence_map.json", label="Claim-Evidence Map", description="Claim to evidence mapping."),
|
|
1279
|
+
self._file_entry("paper/baseline_inventory.json", label="Baseline Inventory", description="Canonical and supplementary baseline inventory for writing."),
|
|
1236
1280
|
self._file_entry("paper/build/compile_report.json", label="Compile Report", description="Paper build/compile report."),
|
|
1237
1281
|
self._file_entry("paper/paper_bundle_manifest.json", label="Bundle Manifest", description="Final paper bundle manifest."),
|
|
1282
|
+
self._file_entry("release/open_source/manifest.json", label="Open Source Manifest", description="Open-source cleanup and release preparation manifest."),
|
|
1283
|
+
self._file_entry("release/open_source/cleanup_plan.md", label="Open Source Cleanup Plan", description="Checklist for cleaning the paper branch into a public release."),
|
|
1238
1284
|
self._file_entry(latex_root_rel, label="LaTeX Sources", description="LaTeX source folder.", expected_kind="directory"),
|
|
1239
1285
|
self._file_entry(main_tex_rel, label="Main TeX", description="Primary TeX source file."),
|
|
1240
1286
|
]
|
|
@@ -530,6 +530,7 @@ class CodexRunner:
|
|
|
530
530
|
|
|
531
531
|
env = dict(**os.environ)
|
|
532
532
|
env["CODEX_HOME"] = str(codex_home)
|
|
533
|
+
env["DEEPSCIENTIST_HOME"] = str(self.home)
|
|
533
534
|
env["DS_HOME"] = str(self.home)
|
|
534
535
|
env["DS_QUEST_ID"] = request.quest_id
|
|
535
536
|
env["DS_QUEST_ROOT"] = str(request.quest_root)
|
|
@@ -846,6 +847,7 @@ class CodexRunner:
|
|
|
846
847
|
tool_timeout_sec = None
|
|
847
848
|
|
|
848
849
|
shared_env = {
|
|
850
|
+
"DEEPSCIENTIST_HOME": str(self.home),
|
|
849
851
|
"DS_HOME": str(self.home),
|
|
850
852
|
"DS_QUEST_ID": quest_id,
|
|
851
853
|
"DS_QUEST_ROOT": str(quest_root),
|
|
@@ -71,7 +71,8 @@ def write_json(path: Path, payload: Any) -> None:
|
|
|
71
71
|
)
|
|
72
72
|
|
|
73
73
|
|
|
74
|
-
def read_json(path: Path, default: Any = None) -> Any:
|
|
74
|
+
def read_json(path: Path | str, default: Any = None) -> Any:
|
|
75
|
+
path = Path(path)
|
|
75
76
|
if not path.exists():
|
|
76
77
|
return default
|
|
77
78
|
payload = path.read_text(encoding="utf-8").strip()
|
|
@@ -155,35 +156,61 @@ def which(binary: str) -> str | None:
|
|
|
155
156
|
return shutil.which(binary)
|
|
156
157
|
|
|
157
158
|
|
|
158
|
-
def
|
|
159
|
-
normalized = str(
|
|
159
|
+
def _resolve_executable_reference(reference: str) -> str | None:
|
|
160
|
+
normalized = str(reference or "").strip()
|
|
160
161
|
if not normalized:
|
|
161
162
|
return None
|
|
162
163
|
|
|
163
164
|
candidate = Path(normalized).expanduser()
|
|
164
165
|
if candidate.is_absolute() or os.path.sep in normalized or (os.path.altsep and os.path.altsep in normalized):
|
|
165
166
|
return str(candidate) if candidate.exists() else None
|
|
167
|
+
return shutil.which(normalized)
|
|
168
|
+
|
|
169
|
+
|
|
170
|
+
def _codex_repo_roots() -> list[Path]:
|
|
171
|
+
roots: list[Path] = []
|
|
172
|
+
configured = str(os.environ.get("DEEPSCIENTIST_REPO_ROOT") or "").strip()
|
|
173
|
+
if configured:
|
|
174
|
+
roots.append(Path(configured).expanduser().resolve())
|
|
175
|
+
roots.append(Path(__file__).resolve().parents[2])
|
|
176
|
+
|
|
177
|
+
deduped: list[Path] = []
|
|
178
|
+
seen: set[str] = set()
|
|
179
|
+
for root in roots:
|
|
180
|
+
key = str(root)
|
|
181
|
+
if key in seen:
|
|
182
|
+
continue
|
|
183
|
+
seen.add(key)
|
|
184
|
+
deduped.append(root)
|
|
185
|
+
return deduped
|
|
166
186
|
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
187
|
+
|
|
188
|
+
def resolve_runner_binary(binary: str, *, runner_name: str | None = None) -> str | None:
|
|
189
|
+
normalized = str(binary or "").strip()
|
|
190
|
+
if not normalized:
|
|
191
|
+
return None
|
|
192
|
+
|
|
193
|
+
resolved_reference = _resolve_executable_reference(normalized)
|
|
194
|
+
candidate = Path(normalized).expanduser()
|
|
195
|
+
if candidate.is_absolute() or os.path.sep in normalized or (os.path.altsep and os.path.altsep in normalized):
|
|
196
|
+
return resolved_reference
|
|
170
197
|
|
|
171
198
|
normalized_runner = str(runner_name or candidate.name or normalized).strip().lower()
|
|
172
199
|
if normalized_runner != "codex":
|
|
173
|
-
return
|
|
200
|
+
return resolved_reference
|
|
174
201
|
|
|
175
202
|
for env_name in ("DEEPSCIENTIST_CODEX_BINARY", "DS_CODEX_BINARY"):
|
|
176
203
|
override = os.environ.get(env_name)
|
|
177
204
|
if override:
|
|
178
|
-
|
|
179
|
-
if
|
|
180
|
-
return
|
|
205
|
+
resolved_override = _resolve_executable_reference(override)
|
|
206
|
+
if resolved_override:
|
|
207
|
+
return resolved_override
|
|
181
208
|
|
|
182
|
-
repo_root = Path(__file__).resolve().parents[2]
|
|
183
|
-
node_bin_root = repo_root / "node_modules" / ".bin"
|
|
184
209
|
names = ["codex.cmd", "codex.exe", "codex"] if sys.platform.startswith("win") else ["codex"]
|
|
185
|
-
for
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
210
|
+
for root in _codex_repo_roots():
|
|
211
|
+
node_bin_root = root / "node_modules" / ".bin"
|
|
212
|
+
for name in names:
|
|
213
|
+
package_local = node_bin_root / name
|
|
214
|
+
if package_local.exists():
|
|
215
|
+
return str(package_local)
|
|
216
|
+
return resolved_reference
|
|
@@ -10,3 +10,6 @@
|
|
|
10
10
|
- lingzhu_progress_rule: for long-running work, your first substantive reply should contain either the direct answer or the first concrete checkpoint, not a duplicate transport acknowledgement
|
|
11
11
|
- lingzhu_safety_rule: request only actions that are clearly justified by the current quest and understandable to the human user
|
|
12
12
|
- lingzhu_text_rule: even when requesting `surface_actions`, always include a clear text explanation of what is happening and why
|
|
13
|
+
- lingzhu_reply_style_rule: for Lingzhu-facing user-visible text sent through `artifact.interact(...)`, keep the message clear, concise, respectful, and high-information-density
|
|
14
|
+
- lingzhu_reply_length_rule: for each Lingzhu-facing `artifact.interact(...)` message, normally answer in at most 2 to 3 sentences unless the user explicitly asks for more detail
|
|
15
|
+
- lingzhu_summary_first_rule: in Lingzhu-facing `artifact.interact(...)` messages, usually give only the synopsis and key facts needed for the user's next decision or understanding; avoid long preambles, repetition, and low-signal detail
|
package/src/prompts/system.md
CHANGED
|
@@ -433,12 +433,16 @@ If you must deviate, record the reason in an artifact report or decision.
|
|
|
433
433
|
|
|
434
434
|
- `baselines/local/` (baseline code you maintain)
|
|
435
435
|
- Baseline code that you are actively fixing, reproducing, or extending inside this quest.
|
|
436
|
+
- Supplementary analysis comparators still live here when they are reproduced inside the quest; do not create a parallel top-level baseline root.
|
|
436
437
|
- Store durable baseline variants here when they must be committed and reviewed.
|
|
437
438
|
|
|
438
439
|
- `artifacts/baselines/` (baseline records)
|
|
439
440
|
- Baseline audit notes, metric contracts, reproduction notes, and baseline attachment records.
|
|
440
441
|
- This is metadata and reporting, not the baseline code itself.
|
|
441
442
|
|
|
443
|
+
- `release/open_source/` (public-release preparation)
|
|
444
|
+
- Use this for open-source cleanup manifests, include/exclude lists, and the final public-code pruning checklist after the paper bundle exists.
|
|
445
|
+
|
|
442
446
|
- `experiments/main/` (main experiment workspace)
|
|
443
447
|
- Main experiment scripts, configs, and durable outputs tied to the active idea branch.
|
|
444
448
|
|
|
@@ -891,10 +895,22 @@ Prefer these patterns:
|
|
|
891
895
|
- do not use `mode='revise'` as the default way to start a new optimization round, even for documentation-only changes
|
|
892
896
|
- use `artifact.record_main_experiment(...)` immediately after a real main experiment finishes on the active idea workspace
|
|
893
897
|
- this call is the normal path to write `RUN.md` and `RESULT.json`
|
|
898
|
+
- include a compact `evaluation_summary` for every durable main-experiment result with exactly these fields:
|
|
899
|
+
- `takeaway`
|
|
900
|
+
- `claim_update`
|
|
901
|
+
- `baseline_relation`
|
|
902
|
+
- `comparability`
|
|
903
|
+
- `failure_mode`
|
|
904
|
+
- `next_action`
|
|
905
|
+
- do not omit `evaluation_summary` just because the result is weak, mixed, or not directly comparable
|
|
906
|
+
- if comparison is invalid or evidence is limited, express that explicitly through `baseline_relation`, `comparability`, and `failure_mode` instead of hiding the uncertainty in prose
|
|
907
|
+
- write it for a human reader who should understand the run outcome without opening logs, diffs, or file paths
|
|
908
|
+
- keep `takeaway` to one short sentence, keep `next_action` to one best immediate route, and do not include branch ids, paths, tool traces, or raw metric dumps
|
|
894
909
|
- once a branch has a durable main-experiment result, treat that branch as a fixed historical research node
|
|
895
910
|
- use `artifact.create_analysis_campaign(...)` whenever one or more extra experiments must branch from the current workspace/result node
|
|
896
911
|
- even a single extra experiment should still become a one-slice analysis campaign instead of mutating the completed parent node in place
|
|
897
912
|
- use `artifact.record_analysis_slice(...)` immediately after each analysis slice finishes
|
|
913
|
+
- include the same six-field `evaluation_summary` so later review, rebuttal, and route selection can read one stable summary instead of re-parsing long prose
|
|
898
914
|
- use `artifact.prepare_branch(...)` only for compatibility or exceptional manual recovery; do not prefer it for the normal idea -> experiment -> analysis flow
|
|
899
915
|
- use `artifact.confirm_baseline(...)` as the canonical baseline-stage gate after the accepted baseline root, variant, and metric contract are clear
|
|
900
916
|
- use `artifact.waive_baseline(...)` only when the quest must explicitly continue without a baseline
|
|
@@ -968,7 +984,10 @@ For `artifact.interact(...)` specifically:
|
|
|
968
984
|
- when requesting user input, include concrete options and an explicit reply format whenever possible
|
|
969
985
|
- for a blocking `artifact.interact(kind='decision_request', ...)`, provide 1 to 3 concrete options, put the recommended option first, and explain each option's actual content, pros, cons, and expected consequence
|
|
970
986
|
- for a blocking `artifact.interact(kind='decision_request', ...)`, state the reply format clearly and normally wait up to 1 day for the user unless the task or user already defined a shorter safe deadline
|
|
971
|
-
- if that
|
|
987
|
+
- if the blocker is a user-supplied external credential or secret that you cannot safely obtain yourself, such as an API key, GitHub key/token, Hugging Face key/token, or similar account credential, always use `artifact.interact(kind='decision_request', reply_mode='blocking', ...)` to ask the user to provide it or choose an alternative route
|
|
988
|
+
- for that credential-blocked case, do not fabricate placeholder credentials, do not silently skip the blocked step, and do not self-resolve by pretending the credential is optional unless the user explicitly chose an alternative route
|
|
989
|
+
- if such a credential request remains unanswered, keep the quest waiting instead of forcing a route decision; if the runtime or tool loop resumes you without fresh credentials and no other work is possible, you may park with a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700, ...)` rather than busy-looping
|
|
990
|
+
- otherwise, if that blocking decision request times out, choose the best option yourself from the stated options, record the evidence-backed reason, and notify the user of the chosen option before continuing
|
|
972
991
|
- prefer one blocking user request at a time unless true parallel ambiguity is unavoidable
|
|
973
992
|
- if a threaded user reply arrives after a progress update, interpret it relative to that progress thread first before treating it as a new unrelated task
|
|
974
993
|
- after sending a blocking request, treat the next unseen inbound user messages as higher-priority context than stale plan assumptions
|
|
@@ -1115,16 +1134,27 @@ For analysis campaigns specifically, the safest default sequence is:
|
|
|
1115
1134
|
2. call `artifact.create_analysis_campaign(...)` with the full slice list
|
|
1116
1135
|
3. move into the returned slice worktrees one by one
|
|
1117
1136
|
4. emit `progress` during long-running slices
|
|
1118
|
-
5. call `artifact.record_analysis_slice(...)` after each slice with setup, execution, results, metrics, and
|
|
1137
|
+
5. call `artifact.record_analysis_slice(...)` after each slice with setup, execution, results, metrics, and a six-field `evaluation_summary`
|
|
1119
1138
|
6. after the last slice, return automatically to the parent idea branch and continue writing
|
|
1120
1139
|
|
|
1140
|
+
When writing `evaluation_summary`, use these semantics:
|
|
1141
|
+
|
|
1142
|
+
- `takeaway`: one-sentence human-readable conclusion, starting with the outcome rather than the procedure
|
|
1143
|
+
- `claim_update`: only describe whether the core claim is strengthened, weakened, narrowed, or left neutral
|
|
1144
|
+
- `baseline_relation`: compare against the active baseline only when the comparison is methodologically valid; otherwise use `not_comparable`
|
|
1145
|
+
- `comparability`: use this as the explicit uncertainty channel when protocol drift, data mismatch, or incomplete runs reduce confidence
|
|
1146
|
+
- `failure_mode`: classify the dominant reason for failure or instability instead of reframing failures as support
|
|
1147
|
+
- `next_action`: choose one immediate route only; do not turn it into a wishlist
|
|
1148
|
+
|
|
1149
|
+
Before planning further work, first read the most recent `evaluation_summary` blocks from the relevant main experiment and analysis slices; only drop to raw logs or long prose when the short judgment layer is still ambiguous.
|
|
1150
|
+
|
|
1121
1151
|
For a normal main experiment specifically, the safest default sequence is:
|
|
1122
1152
|
|
|
1123
1153
|
1. stay in the active idea worktree returned by `artifact.submit_idea(...)`
|
|
1124
1154
|
2. implement and run there
|
|
1125
1155
|
3. verify that the metric keys still match the active baseline contract
|
|
1126
|
-
4. write the human-readable run log and structured result through `artifact.record_main_experiment(...)`
|
|
1127
|
-
5. use the returned baseline comparison
|
|
1156
|
+
4. write the human-readable run log and structured result through `artifact.record_main_experiment(...)`, including a six-field `evaluation_summary`
|
|
1157
|
+
5. use the returned baseline comparison, breakthrough signal, and `evaluation_summary` before deciding whether to continue, launch analysis, or write
|
|
1128
1158
|
|
|
1129
1159
|
### Startup-contract delivery mode
|
|
1130
1160
|
|
|
@@ -1524,6 +1554,7 @@ First ensure one selected outline exists, then bind the campaign to that outline
|
|
|
1524
1554
|
|
|
1525
1555
|
If durable state exposes `active_baseline_metric_contract_json`, read that JSON file before defining slice success criteria or comparison tables.
|
|
1526
1556
|
By default, use it as the campaign's baseline comparison contract unless a slice is explicitly designed to test a different evaluation contract and that deviation is recorded durably.
|
|
1557
|
+
If a slice needs an extra comparator baseline, reproduce or attach it under the normal `baselines/local/` or `baselines/imported/` quest roots, record that requirement in the campaign slice, and later submit the realized comparator through `record_analysis_slice(..., comparison_baselines=[...])` without replacing the canonical baseline gate unless the quest explicitly promotes it.
|
|
1527
1558
|
|
|
1528
1559
|
Recommended tool discipline:
|
|
1529
1560
|
|
|
@@ -1668,7 +1699,7 @@ Before finalizing:
|
|
|
1668
1699
|
|
|
1669
1700
|
- re-check the latest decisions, reports, and package inventory
|
|
1670
1701
|
- re-check writing review / proofing / submission outputs when a paper bundle exists
|
|
1671
|
-
- when a paper bundle exists or should exist, verify `paper/paper_bundle_manifest.json` and its referenced `outline_path`, `draft_path`, `writing_plan_path`, `references_path`, `claim_evidence_map_path`, `compile_report_path`, `pdf_path`, and `
|
|
1702
|
+
- when a paper bundle exists or should exist, verify `paper/paper_bundle_manifest.json` and its referenced `outline_path`, `draft_path`, `writing_plan_path`, `references_path`, `claim_evidence_map_path`, `baseline_inventory_path`, `compile_report_path`, `pdf_path`, `latex_root_path`, and any `open_source_manifest_path`
|
|
1672
1703
|
- classify major claims as supported, partial, unsupported, or deferred
|
|
1673
1704
|
- preserve important failures and downgrade history instead of hiding them
|
|
1674
1705
|
|
|
@@ -1762,6 +1793,7 @@ When summarizing long logs, campaigns, or multi-agent work:
|
|
|
1762
1793
|
- Any shell-like command execution must go through `bash_exec`; this includes `curl`, `python`, `python3`, `bash`, `sh`, `node`, package managers, and similar CLI tools.
|
|
1763
1794
|
- Do not execute shell commands through any non-`bash_exec` path.
|
|
1764
1795
|
- Use `bash_exec(mode='detach', ...)` for long-running work, `bash_exec(mode='await', ...)` for bounded blocking checks, `bash_exec(mode='read', id=...)` to inspect saved logs, `bash_exec(mode='list')` to inspect active and finished sessions, and `bash_exec(mode='kill', id=...)` to stop a managed command.
|
|
1796
|
+
- Before using a bounded wait such as `bash_exec(mode='await', ...)`, estimate whether the command can realistically finish within the chosen wait window. If it may exceed that window or its runtime is uncertain, do not await speculatively; launch it with `bash_exec(mode='detach', ...)` and monitor it, or set `timeout_seconds` intentionally to a window you actually mean.
|
|
1765
1797
|
- For important MCP calls, especially long-running `bash_exec`, include a structured `comment` that briefly states what you are doing, why now, and the next check or next action.
|
|
1766
1798
|
- For a command that is likely to run for a long time, do not launch it and disappear. After `bash_exec(mode='detach', ...)`, keep monitoring it in the same turn through an explicit wait-and-check loop.
|
|
1767
1799
|
- The default long-run monitoring cadence is:
|
|
@@ -1771,6 +1803,7 @@ When summarizing long logs, campaigns, or multi-agent work:
|
|
|
1771
1803
|
- sleep about `600s`, then inspect again
|
|
1772
1804
|
- sleep about `1800s`, then inspect again
|
|
1773
1805
|
- if the run is still active, continue checking about every `1800s`
|
|
1806
|
+
- If the only blocker is a missing user-supplied external credential that has already been requested through a blocking interaction and no other useful work is possible, you may intentionally park with a much longer low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700, ...)` to avoid busy-looping.
|
|
1774
1807
|
- If the environment or tool surface makes direct shell waiting awkward, an equivalent bounded wait such as `bash_exec(mode='await', id=..., timeout_seconds=...)` is acceptable, but the behavior must stay the same: wait, inspect real logs, then continue.
|
|
1775
1808
|
- Never stay silent across multiple sleep windows for an important long-running task.
|
|
1776
1809
|
- After each sleep/await cycle finishes and you inspect the real logs again, send `artifact.interact(kind='progress', ...)` with:
|
|
@@ -53,7 +53,7 @@ Do not invent a separate experiment system for those cases.
|
|
|
53
53
|
- If the runtime starts an auto-continue turn with no new user message, resume from the current campaign state and active requirements instead of replaying the previous user turn.
|
|
54
54
|
- Progress message templates are references only. Adapt to the actual context and vary wording so messages feel human, respectful, and non-robotic.
|
|
55
55
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
56
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
56
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
57
57
|
- If a threaded user reply arrives, interpret it relative to the latest campaign progress update before assuming the task changed completely.
|
|
58
58
|
|
|
59
59
|
## Stage purpose
|
|
@@ -129,6 +129,8 @@ A campaign should usually leave behind:
|
|
|
129
129
|
- a campaign identifier
|
|
130
130
|
- a selected outline reference when the campaign is writing-facing
|
|
131
131
|
- one directory per analysis run
|
|
132
|
+
- any supplementary baseline reproduced for analysis under `baselines/local/<baseline_id>/` or attached under `baselines/imported/<baseline_id>/`
|
|
133
|
+
- one quest-level supplementary baseline inventory at `artifacts/baselines/analysis_inventory.json`
|
|
132
134
|
- one run artifact per analysis slice
|
|
133
135
|
- one outline-bound todo manifest when the campaign is writing-facing
|
|
134
136
|
- an aggregated campaign report
|
|
@@ -252,12 +254,21 @@ For each slice, define at minimum:
|
|
|
252
254
|
- metric or observable
|
|
253
255
|
- stop condition
|
|
254
256
|
- evidence path expectations
|
|
257
|
+
- `required_baselines` when the slice depends on an extra comparator that is not yet available in the quest
|
|
255
258
|
|
|
256
259
|
Recommended extra per-slice fields:
|
|
257
260
|
|
|
258
261
|
- `slice_id`
|
|
259
262
|
- `run_kind`
|
|
260
263
|
- `slice_class`, such as `auxiliary`, `claim-carrying`, or `supporting`
|
|
264
|
+
- `required_baselines`, where each item records at least `baseline_id` plus the reason, benchmark, and split when known
|
|
265
|
+
|
|
266
|
+
If a slice needs an extra comparator baseline:
|
|
267
|
+
|
|
268
|
+
- reproduce it under `baselines/local/<baseline_id>/` unless it is attached under `baselines/imported/<baseline_id>/`
|
|
269
|
+
- keep the usual durable baseline notes there, including `analysis_plan.md`, `setup.md`, `execution.md`, and `verification.md`
|
|
270
|
+
- do not overwrite the canonical quest baseline gate just because an analysis slice needed a supplementary baseline
|
|
271
|
+
- after the comparator is ready, record it back through `record_analysis_slice(..., comparison_baselines=[...])` with its `baseline_id`, path, benchmark/split, and metrics summary
|
|
261
272
|
- `parent_run_id`
|
|
262
273
|
- whether a code diff is required
|
|
263
274
|
- whether an isolated branch/worktree is required
|
|
@@ -284,6 +295,17 @@ Treat `campaign_id` as system-owned, and treat `slice_id` / `todo_id` as agent-a
|
|
|
284
295
|
Do not replace the normal campaign flow with repeated manual `artifact.prepare_branch(...)` calls.
|
|
285
296
|
After each slice finishes, call `artifact.record_analysis_slice(...)` immediately so the result is mirrored back to the parent branch and the next slice can be activated.
|
|
286
297
|
For slice recording, `deviations` and `evidence_paths` are optional context fields, not mandatory ceremony; include them only when they materially help explanation or auditability.
|
|
298
|
+
Each `artifact.record_analysis_slice(...)` call should also include an `evaluation_summary` with exactly these six fields:
|
|
299
|
+
|
|
300
|
+
- `takeaway`
|
|
301
|
+
- `claim_update`
|
|
302
|
+
- `baseline_relation`
|
|
303
|
+
- `comparability`
|
|
304
|
+
- `failure_mode`
|
|
305
|
+
- `next_action`
|
|
306
|
+
|
|
307
|
+
Use those six fields to keep each slice readable at a glance from Canvas, stage tabs, review, and rebuttal.
|
|
308
|
+
The longer prose still matters, but the six-field summary is the stable routing summary.
|
|
287
309
|
|
|
288
310
|
For writing-facing campaigns, prefer running `claim-carrying` slices before `supporting` slices unless an auxiliary check is required to make the main slice interpretable.
|
|
289
311
|
|
|
@@ -473,6 +495,7 @@ Stage-end requirement:
|
|
|
473
495
|
- if the campaign produced a durable cross-slice lesson, failure pattern, or comparability caveat, write at least one `memory.write(...)` before leaving the stage
|
|
474
496
|
|
|
475
497
|
The campaign’s main record belongs in run artifacts and the aggregated report.
|
|
498
|
+
When synthesizing the campaign, read the per-slice `evaluation_summary` fields first, then expand into longer evidence only where the short summaries are still ambiguous.
|
|
476
499
|
|
|
477
500
|
## Artifact rules
|
|
478
501
|
|
|
@@ -18,7 +18,7 @@ It absorbs the essential old DeepScientist reproducer discipline into one stage
|
|
|
18
18
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
19
19
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
|
|
20
20
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
21
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
21
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
22
22
|
- If a threaded user reply arrives, interpret it relative to the latest baseline progress update before assuming the task changed completely.
|
|
23
23
|
- Prefer `bash_exec` for setup, reproduction, and verification commands so each baseline action keeps a durable quest-local session id and log trail.
|
|
24
24
|
- When the baseline route is durably chosen, confirmed, waived, or blocked with a clear next action, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says whether the baseline is trusted, blocked, or waived, why that matters, and what the next stage is.
|
|
@@ -204,6 +204,12 @@ Global reusable registry paths:
|
|
|
204
204
|
Do not invent parallel durable locations when these runtime contracts already exist.
|
|
205
205
|
Do not leave the authoritative metric contract only in chat, memory, or prose once the baseline is accepted.
|
|
206
206
|
|
|
207
|
+
If a baseline is reproduced only because an analysis campaign needs an extra comparator:
|
|
208
|
+
|
|
209
|
+
- still place it under `<quest_root>/baselines/local/<baseline_id>/` or `<quest_root>/baselines/imported/<baseline_id>/`
|
|
210
|
+
- treat it as a supplementary analysis baseline unless the quest explicitly promotes it into the canonical gate
|
|
211
|
+
- do not call `artifact.confirm_baseline(...)` for that supplementary case unless the quest truly intends to replace the canonical baseline
|
|
212
|
+
|
|
207
213
|
## Baseline id and variant rules
|
|
208
214
|
|
|
209
215
|
Baseline identity should be stable and path-safe.
|
|
@@ -19,7 +19,7 @@ Use this skill whenever continuation is non-trivial.
|
|
|
19
19
|
- If the runtime starts an auto-continue turn with no new user message, continue from the active requirements and durable quest state instead of replaying the previous user turn.
|
|
20
20
|
- If `startup_contract.decision_policy = autonomous`, do not emit ordinary `artifact.interact(kind='decision_request', ...)` calls; decide the route yourself, record the reason, and continue.
|
|
21
21
|
- Use `reply_mode='blocking'` for the actual decision request only when the user must choose before safe continuation and the quest contract still allows a user-gated decision.
|
|
22
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
22
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
23
23
|
- If a threaded user reply arrives, interpret it relative to the latest decision or progress interaction before assuming the task changed completely.
|
|
24
24
|
- Quest completion is a special terminal decision: first ask for explicit completion approval with `artifact.interact(kind='decision_request', reply_mode='blocking', reply_schema={'decision_type': 'quest_completion_approval'}, ...)`, and only after an explicit approval reply should you call `artifact.complete_quest(...)`.
|
|
25
25
|
|
|
@@ -319,7 +319,7 @@ When asking, use a structured decision request with:
|
|
|
319
319
|
- tradeoffs, including the main pros and cons for each option
|
|
320
320
|
- recommended option first
|
|
321
321
|
- explicit reply format
|
|
322
|
-
- a stated timeout window; normally wait up to 1 day before self-resolving if no user reply arrives
|
|
322
|
+
- a stated timeout window; normally wait up to 1 day before self-resolving if no user reply arrives, except when the only blocker is a missing external credential or secret that only the user can provide
|
|
323
323
|
|
|
324
324
|
### 6. Record the decision durably
|
|
325
325
|
|
|
@@ -327,6 +327,7 @@ Use `artifact.record(kind='decision', ...)` for the final decision.
|
|
|
327
327
|
|
|
328
328
|
If user input is needed, also use `artifact.interact(kind='decision_request', ...)`.
|
|
329
329
|
If the timeout expires without a user reply, choose the best option yourself, record why, and notify the user of the chosen option before moving on.
|
|
330
|
+
This does not apply when the only blocker is a missing external credential or secret that only the user can provide; in that case keep the interaction waiting and, if resumed without the credential, you may park with `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` instead of busy-looping.
|
|
330
331
|
|
|
331
332
|
If `startup_contract.decision_policy = autonomous`, ordinary route ambiguity is not by itself grounds to request user input.
|
|
332
333
|
In that mode, only explicit approval-style exceptions such as quest completion should normally become blocking user decisions.
|
|
@@ -43,7 +43,7 @@ Use this skill for the main evidence-producing runs of the quest.
|
|
|
43
43
|
- If the runtime starts an auto-continue turn with no new user message, continue from the current run state, logs, artifacts, and active requirements instead of replaying the previous user turn.
|
|
44
44
|
- Progress message templates are references only. Adapt to the actual context and vary wording so messages feel human, respectful, and non-robotic.
|
|
45
45
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
46
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
46
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
47
47
|
- If a threaded user reply arrives, interpret it relative to the latest experiment progress update before assuming the task changed completely.
|
|
48
48
|
- Prefer `bash_exec` for experiment commands so each run gets a durable session id, quest-local log folder, and later `read/list/kill` control.
|
|
49
49
|
|
|
@@ -466,6 +466,22 @@ That call is responsible for writing:
|
|
|
466
466
|
- evidence paths
|
|
467
467
|
- changed files
|
|
468
468
|
- relevant config paths when applicable
|
|
469
|
+
- `evaluation_summary` with exactly these six fields:
|
|
470
|
+
- `takeaway`
|
|
471
|
+
- `claim_update`
|
|
472
|
+
- `baseline_relation`
|
|
473
|
+
- `comparability`
|
|
474
|
+
- `failure_mode`
|
|
475
|
+
- `next_action`
|
|
476
|
+
|
|
477
|
+
Use `evaluation_summary` as the short structured judgment layer on top of the longer narrative fields:
|
|
478
|
+
|
|
479
|
+
- `takeaway`: one sentence the next reader can reuse directly
|
|
480
|
+
- `claim_update`: `strengthens`, `weakens`, `narrows`, or `neutral`
|
|
481
|
+
- `baseline_relation`: `better`, `worse`, `mixed`, or `not_comparable`
|
|
482
|
+
- `comparability`: `high`, `medium`, or `low`
|
|
483
|
+
- `failure_mode`: `none`, `implementation`, `evaluation`, `environment`, or `direction`
|
|
484
|
+
- `next_action`: the immediate route such as `continue`, `revise_idea`, `analysis_campaign`, `write`, or `stop`
|
|
469
485
|
|
|
470
486
|
After `artifact.record_main_experiment(...)` succeeds, do not assume the same branch should absorb the next round by default.
|
|
471
487
|
Interpret the measured result first, then either:
|
|
@@ -17,7 +17,7 @@ Use this skill to close or pause a quest responsibly.
|
|
|
17
17
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
18
18
|
- If the runtime starts an auto-continue turn with no new user message, keep finalizing from the durable quest state and active requirements instead of replaying the previous user turn.
|
|
19
19
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
20
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
20
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
21
21
|
- If a threaded user reply arrives, interpret it relative to the latest finalize progress update before assuming the task changed completely.
|
|
22
22
|
- When finalize reaches a real closure state, pause-ready packet, or route-back decision, send one threaded `artifact.interact(kind='milestone', ...)` update that names the recommendation, why it is the right call, and any reopen condition that still matters.
|
|
23
23
|
- True quest completion still requires explicit user approval through the runtime completion flow before calling `artifact.complete_quest(...)`.
|
|
@@ -124,9 +124,12 @@ When a paper bundle exists, verify the manifest inventory explicitly, including:
|
|
|
124
124
|
- referenced `writing_plan_path`
|
|
125
125
|
- referenced `references_path`
|
|
126
126
|
- referenced `claim_evidence_map_path`
|
|
127
|
+
- referenced `baseline_inventory_path`
|
|
127
128
|
- referenced `compile_report_path`
|
|
128
129
|
- referenced `pdf_path`
|
|
129
130
|
- referenced `latex_root_path`
|
|
131
|
+
- `release/open_source/manifest.json` when open-source preparation has started
|
|
132
|
+
- `release/open_source/cleanup_plan.md` when the paper line is being prepared for a public code release
|
|
130
133
|
|
|
131
134
|
### 2. Build the final claim ledger
|
|
132
135
|
|
package/src/skills/idea/SKILL.md
CHANGED
|
@@ -21,7 +21,7 @@ Use this skill to turn the current baseline and problem frame into concrete, lit
|
|
|
21
21
|
- If the runtime starts an auto-continue turn with no new user message, keep advancing from the active requirements and current durable state instead of re-answering the previous user turn.
|
|
22
22
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
|
|
23
23
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
24
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
24
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
25
25
|
- If a threaded user reply arrives, interpret it relative to the latest idea progress update before assuming the task changed completely.
|
|
26
26
|
|
|
27
27
|
## Stage purpose
|
|
@@ -17,7 +17,7 @@ Use this skill when the quest already has meaningful state and the first job is
|
|
|
17
17
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
18
18
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
|
|
19
19
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
20
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
20
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
21
21
|
- If a threaded user reply arrives, interpret it relative to the latest intake-audit progress update before assuming the task changed completely.
|
|
22
22
|
- When the audit reaches a durable route recommendation, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what state is trusted, what still needs work, and which anchor should run next.
|
|
23
23
|
|
|
@@ -21,7 +21,7 @@ The task is “respond to concrete reviewer pressure with the smallest honest se
|
|
|
21
21
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
22
22
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
|
|
23
23
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
24
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
24
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
25
25
|
- If a threaded user reply arrives, interpret it relative to the latest rebuttal progress update before assuming the task changed completely.
|
|
26
26
|
- When the rebuttal plan, the main supplementary-evidence package, or the final response bundle becomes durable, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what reviewer concerns are now addressed, what still remains open, and what happens next.
|
|
27
27
|
|
|
@@ -87,11 +87,13 @@ Use, in roughly this order:
|
|
|
87
87
|
- the current paper or draft
|
|
88
88
|
- the selected outline if one exists
|
|
89
89
|
- review comments, meta-review, or editor letter
|
|
90
|
+
- the six-field `evaluation_summary` blocks from recent main experiments and analysis slices
|
|
90
91
|
- recent main and analysis experiment results
|
|
91
92
|
- prior decision and writing memory
|
|
92
93
|
- existing figures, tables, and claim-evidence maps
|
|
93
94
|
|
|
94
95
|
If the current paper/result state is still unclear, open `intake-audit` first before continuing the rebuttal workflow.
|
|
96
|
+
Before launching any new supplementary experiment, read those structured `evaluation_summary` blocks first so the rebuttal plan starts from the already-recorded evidence state rather than from raw narrative memory.
|
|
95
97
|
|
|
96
98
|
## Core outputs
|
|
97
99
|
|
|
@@ -23,7 +23,7 @@ It is also not the same as `rebuttal`.
|
|
|
23
23
|
- Keep progress updates chat-like and easy to understand: say what changed, what it means, and what happens next.
|
|
24
24
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
25
25
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
26
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
26
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
27
27
|
- When the review report, revision plan, or follow-up experiment TODO list becomes durable, send a richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what the main risks are, what should be fixed next, and whether the next route is writing, experiment, or claim downgrade.
|
|
28
28
|
|
|
29
29
|
## Purpose
|
|
@@ -77,12 +77,14 @@ Use, in roughly this order:
|
|
|
77
77
|
- the current paper or report draft
|
|
78
78
|
- the selected outline if one exists
|
|
79
79
|
- the claim-evidence map if one exists
|
|
80
|
+
- the six-field `evaluation_summary` blocks from recent main experiments and analysis slices
|
|
80
81
|
- recent main and analysis experiment results
|
|
81
82
|
- figures, tables, and captions
|
|
82
83
|
- prior self-review or reviewer-first notes as low-trust auxiliary input
|
|
83
84
|
- nearby papers when novelty or comparison is unclear
|
|
84
85
|
|
|
85
86
|
If the draft/result state is still unclear, open `intake-audit` first before continuing the review workflow.
|
|
87
|
+
Before proposing extra experiments, read those structured `evaluation_summary` blocks first so you do not request work that the recorded evidence already resolved.
|
|
86
88
|
|
|
87
89
|
## Core outputs
|
|
88
90
|
|
|
@@ -17,7 +17,7 @@ Use this skill when the quest does not yet have a stable research frame.
|
|
|
17
17
|
- Default to plain-language summaries. Do not mention file paths, artifact ids, branch/worktree ids, session ids, raw commands, or raw logs unless the user asks or needs them to act.
|
|
18
18
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
|
|
19
19
|
- Use `reply_mode='blocking'` only for real user decisions that cannot be resolved from local evidence.
|
|
20
|
-
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, wait up to 1 day when feasible,
|
|
20
|
+
- For any blocking decision request, provide 1 to 3 concrete options, put the recommended option first, explain each option's actual content plus pros and cons, and wait up to 1 day when feasible. If the blocker is a missing external credential or secret that only the user can provide, keep the quest waiting, ask the user to supply it or choose an alternative, and do not self-resolve; if resumed without that credential and no other work is possible, a long low-frequency wait such as `bash_exec(command='sleep 3600', mode='await', timeout_seconds=3700)` is acceptable. Otherwise choose the best option yourself and notify the user of the chosen option if the timeout expires.
|
|
21
21
|
- If a threaded user reply arrives, interpret it relative to the latest scout progress update before assuming the task changed completely.
|
|
22
22
|
- When scouting actually resolves the framing ambiguity, locks the evaluation contract, or makes the next anchor obvious, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what is now clear, why it matters, and which stage should come next.
|
|
23
23
|
|