@researai/deepscientist 1.5.11 → 1.5.13
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +8 -8
- package/bin/ds.js +375 -61
- package/docs/en/00_QUICK_START.md +55 -4
- package/docs/en/01_SETTINGS_REFERENCE.md +15 -0
- package/docs/en/02_START_RESEARCH_GUIDE.md +68 -4
- package/docs/en/09_DOCTOR.md +48 -4
- package/docs/en/12_GUIDED_WORKFLOW_TOUR.md +21 -2
- package/docs/en/15_CODEX_PROVIDER_SETUP.md +382 -0
- package/docs/en/README.md +4 -0
- package/docs/zh/00_QUICK_START.md +54 -3
- package/docs/zh/01_SETTINGS_REFERENCE.md +15 -0
- package/docs/zh/02_START_RESEARCH_GUIDE.md +69 -3
- package/docs/zh/09_DOCTOR.md +48 -2
- package/docs/zh/12_GUIDED_WORKFLOW_TOUR.md +21 -2
- package/docs/zh/15_CODEX_PROVIDER_SETUP.md +383 -0
- package/docs/zh/README.md +4 -1
- package/package.json +2 -1
- package/pyproject.toml +1 -1
- package/src/deepscientist/__init__.py +1 -1
- package/src/deepscientist/bash_exec/monitor.py +7 -5
- package/src/deepscientist/bash_exec/service.py +84 -21
- package/src/deepscientist/channels/local.py +3 -3
- package/src/deepscientist/channels/qq.py +7 -7
- package/src/deepscientist/channels/relay.py +7 -7
- package/src/deepscientist/channels/weixin_ilink.py +90 -19
- package/src/deepscientist/cli.py +3 -0
- package/src/deepscientist/codex_cli_compat.py +117 -0
- package/src/deepscientist/config/models.py +1 -0
- package/src/deepscientist/config/service.py +173 -25
- package/src/deepscientist/daemon/app.py +314 -6
- package/src/deepscientist/doctor.py +1 -5
- package/src/deepscientist/mcp/server.py +124 -3
- package/src/deepscientist/prompts/builder.py +113 -11
- package/src/deepscientist/quest/service.py +247 -31
- package/src/deepscientist/runners/codex.py +132 -24
- package/src/deepscientist/runners/runtime_overrides.py +9 -0
- package/src/deepscientist/shared.py +33 -14
- package/src/prompts/connectors/qq.md +2 -1
- package/src/prompts/connectors/weixin.md +2 -1
- package/src/prompts/contracts/shared_interaction.md +4 -1
- package/src/prompts/system.md +59 -9
- package/src/skills/analysis-campaign/SKILL.md +46 -6
- package/src/skills/analysis-campaign/references/campaign-plan-template.md +21 -8
- package/src/skills/baseline/SKILL.md +1 -1
- package/src/skills/baseline/references/artifact-payload-examples.md +39 -0
- package/src/skills/decision/SKILL.md +1 -1
- package/src/skills/experiment/SKILL.md +1 -1
- package/src/skills/finalize/SKILL.md +1 -1
- package/src/skills/idea/SKILL.md +1 -1
- package/src/skills/intake-audit/SKILL.md +1 -1
- package/src/skills/rebuttal/SKILL.md +74 -1
- package/src/skills/rebuttal/references/response-letter-template.md +55 -11
- package/src/skills/review/SKILL.md +118 -1
- package/src/skills/review/references/experiment-todo-template.md +23 -0
- package/src/skills/review/references/review-report-template.md +16 -0
- package/src/skills/review/references/revision-log-template.md +4 -0
- package/src/skills/scout/SKILL.md +1 -1
- package/src/skills/write/SKILL.md +168 -7
- package/src/skills/write/references/paper-experiment-matrix-template.md +131 -0
- package/src/tui/dist/lib/connectorConfig.js +90 -0
- package/src/tui/dist/lib/qr.js +21 -0
- package/src/tui/package.json +2 -1
- package/src/ui/dist/assets/{AiManusChatView-D0mTXG4-.js → AiManusChatView-CnJcXynW.js} +12 -12
- package/src/ui/dist/assets/{AnalysisPlugin-Db0cTXxm.js → AnalysisPlugin-DeyzPEhV.js} +1 -1
- package/src/ui/dist/assets/{CliPlugin-DrV8je02.js → CliPlugin-CB1YODQn.js} +9 -9
- package/src/ui/dist/assets/{CodeEditorPlugin-QXMSCH71.js → CodeEditorPlugin-B-xicq1e.js} +8 -8
- package/src/ui/dist/assets/{CodeViewerPlugin-7hhtWj_E.js → CodeViewerPlugin-DT54ysXa.js} +5 -5
- package/src/ui/dist/assets/{DocViewerPlugin-BWMSnRJe.js → DocViewerPlugin-DQtKT-VD.js} +3 -3
- package/src/ui/dist/assets/{GitDiffViewerPlugin-7J9h9Vy_.js → GitDiffViewerPlugin-hqHbCfnv.js} +20 -20
- package/src/ui/dist/assets/{ImageViewerPlugin-CHJl_0lr.js → ImageViewerPlugin-OcVo33jV.js} +5 -5
- package/src/ui/dist/assets/{LabCopilotPanel-1qSow1es.js → LabCopilotPanel-DdGwhEUV.js} +11 -11
- package/src/ui/dist/assets/{LabPlugin-eQpPPCEp.js → LabPlugin-Ciz1gDaX.js} +2 -2
- package/src/ui/dist/assets/{LatexPlugin-BwRfi89Z.js → LatexPlugin-BhmjNQRC.js} +37 -11
- package/src/ui/dist/assets/{MarkdownViewerPlugin-836PVQWV.js → MarkdownViewerPlugin-BzdVH9Bx.js} +4 -4
- package/src/ui/dist/assets/{MarketplacePlugin-C2y_556i.js → MarketplacePlugin-DmyHspXt.js} +3 -3
- package/src/ui/dist/assets/{NotebookEditor-DIX7Mlzu.js → NotebookEditor-BMXKrDRk.js} +1 -1
- package/src/ui/dist/assets/{NotebookEditor-BRzJbGsn.js → NotebookEditor-BTVYRGkm.js} +11 -11
- package/src/ui/dist/assets/{PdfLoader-DzRaTAlq.js → PdfLoader-CvcjJHXv.js} +1 -1
- package/src/ui/dist/assets/{PdfMarkdownPlugin-DZUfIUnp.js → PdfMarkdownPlugin-DW2ej8Vk.js} +2 -2
- package/src/ui/dist/assets/{PdfViewerPlugin-BwtICzue.js → PdfViewerPlugin-CmlDxbhU.js} +10 -10
- package/src/ui/dist/assets/{SearchPlugin-DHeIAMsx.js → SearchPlugin-DAjQZPSv.js} +1 -1
- package/src/ui/dist/assets/{TextViewerPlugin-C3tCmFox.js → TextViewerPlugin-C-nVAZb_.js} +5 -5
- package/src/ui/dist/assets/{VNCViewer-CQsKVm3t.js → VNCViewer-D7-dIYon.js} +10 -10
- package/src/ui/dist/assets/{bot-BEA2vWuK.js → bot-C_G4WtNI.js} +1 -1
- package/src/ui/dist/assets/{code-XfbSR8K2.js → code-Cd7WfiWq.js} +1 -1
- package/src/ui/dist/assets/{file-content-BjxNaIfy.js → file-content-B57zsL9y.js} +1 -1
- package/src/ui/dist/assets/{file-diff-panel-D_lLVQk0.js → file-diff-panel-DVoheLFq.js} +1 -1
- package/src/ui/dist/assets/{file-socket-D9x_5vlY.js → file-socket-B5kXFxZP.js} +1 -1
- package/src/ui/dist/assets/{image-BhWT33W1.js → image-LLOjkMHF.js} +1 -1
- package/src/ui/dist/assets/{index-Dqj-Mjb4.css → index-BQG-1s2o.css} +40 -2
- package/src/ui/dist/assets/{index--c4iXtuy.js → index-C3r2iGrp.js} +12 -12
- package/src/ui/dist/assets/{index-DZTZ8mWP.js → index-CLQauncb.js} +911 -120
- package/src/ui/dist/assets/{index-PJbSbPTy.js → index-Dxa2eYMY.js} +1 -1
- package/src/ui/dist/assets/{index-BDxipwrC.js → index-hOUOWbW2.js} +2 -2
- package/src/ui/dist/assets/{monaco-K8izTGgo.js → monaco-BGGAEii3.js} +1 -1
- package/src/ui/dist/assets/{pdf-effect-queue-DfBors6y.js → pdf-effect-queue-DlEr1_y5.js} +1 -1
- package/src/ui/dist/assets/{popover-yFK1J4fL.js → popover-CWJbJuYY.js} +1 -1
- package/src/ui/dist/assets/{project-sync-PENr2zcz.js → project-sync-CRJiucYO.js} +18 -4
- package/src/ui/dist/assets/{select-CAbJDfYv.js → select-CoHB7pvH.js} +2 -2
- package/src/ui/dist/assets/{sigma-DEuYJqTl.js → sigma-D5aJWR8J.js} +1 -1
- package/src/ui/dist/assets/{square-check-big-omoSUmcd.js → square-check-big-DUK_mnkS.js} +1 -1
- package/src/ui/dist/assets/{trash--F119N47.js → trash-ChU3SEE3.js} +1 -1
- package/src/ui/dist/assets/{useCliAccess-D31UR23I.js → useCliAccess-BrJBV3tY.js} +1 -1
- package/src/ui/dist/assets/{useFileDiffOverlay-BH6KcMzq.js → useFileDiffOverlay-C2OQaVWc.js} +1 -1
- package/src/ui/dist/assets/{wrap-text-CZ613PM5.js → wrap-text-C7Qqh-om.js} +1 -1
- package/src/ui/dist/assets/{zoom-out-BgDLAv3z.js → zoom-out-rtX0FKya.js} +1 -1
- package/src/ui/dist/index.html +2 -2
|
@@ -17,7 +17,7 @@ It is also not the same as `rebuttal`.
|
|
|
17
17
|
## Interaction discipline
|
|
18
18
|
|
|
19
19
|
- Follow the shared interaction contract injected by the system prompt.
|
|
20
|
-
- For ordinary active work, prefer a concise progress update once work has crossed roughly
|
|
20
|
+
- For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
|
|
21
21
|
- When the review report, revision plan, or follow-up experiment TODO list becomes durable, send a richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what the main risks are, what should be fixed next, and whether the next route is writing, experiment, or claim downgrade.
|
|
22
22
|
|
|
23
23
|
## Purpose
|
|
@@ -63,6 +63,16 @@ Do not treat “looks polished” as “is defensible”.
|
|
|
63
63
|
- Do not recommend rhetoric when the real problem is missing evidence.
|
|
64
64
|
- If novelty or positioning is uncertain, treat that as a literature-audit question first, not an automatic experiment request.
|
|
65
65
|
- If a claim is too broad for the evidence, prefer narrowing or downgrading the claim over defending it with style.
|
|
66
|
+
- If `startup_contract.review_followup_policy` is present, honor it:
|
|
67
|
+
- `audit_only`
|
|
68
|
+
- stop after durable review artifacts and a clear route recommendation
|
|
69
|
+
- `auto_execute_followups`
|
|
70
|
+
- do not stop at the audit if the next route is already clear; continue into the required experiments and manuscript deltas
|
|
71
|
+
- `user_gated_followups`
|
|
72
|
+
- finish the audit first, then package the next expensive follow-up step into one structured decision
|
|
73
|
+
- If `startup_contract.manuscript_edit_mode = latex_required`, treat the provided LaTeX tree or `paper/latex/` as the writing surface when manuscript revision is needed.
|
|
74
|
+
- If LaTeX source is unavailable while `latex_required` is requested, do not pretend the manuscript was edited; produce LaTeX-ready replacement text and an explicit blocker note instead.
|
|
75
|
+
- Accept manuscript and review inputs from URLs, local file paths, local directories, or current-turn attachments; do not assume the draft is already perfectly normalized.
|
|
66
76
|
|
|
67
77
|
## Primary inputs
|
|
68
78
|
|
|
@@ -74,11 +84,13 @@ Use, in roughly this order:
|
|
|
74
84
|
- the six-field `evaluation_summary` blocks from recent main experiments and analysis slices
|
|
75
85
|
- recent main and analysis experiment results
|
|
76
86
|
- figures, tables, and captions
|
|
87
|
+
- current-turn attachments and user-provided local paths / directories / URLs for the manuscript bundle or review packet
|
|
77
88
|
- prior self-review or reviewer-first notes as low-trust auxiliary input
|
|
78
89
|
- nearby papers when novelty or comparison is unclear
|
|
79
90
|
|
|
80
91
|
If the draft/result state is still unclear, open `intake-audit` first before continuing the review workflow.
|
|
81
92
|
Before proposing extra experiments, read those structured `evaluation_summary` blocks first so you do not request work that the recorded evidence already resolved.
|
|
93
|
+
If the user provided draft files or manuscript bundles directly, first normalize them into durable quest-visible paths before planning experiments or section-level revisions.
|
|
82
94
|
|
|
83
95
|
## Core outputs
|
|
84
96
|
|
|
@@ -87,6 +99,8 @@ The review pass should usually leave behind:
|
|
|
87
99
|
- `paper/review/review.md`
|
|
88
100
|
- `paper/review/revision_log.md`
|
|
89
101
|
- `paper/review/experiment_todo.md`
|
|
102
|
+
- `paper/paper_experiment_matrix.md` when more evidence is still needed
|
|
103
|
+
- `paper/paper_experiment_matrix.json` when more evidence is still needed
|
|
90
104
|
|
|
91
105
|
Use the templates in `references/` when needed:
|
|
92
106
|
|
|
@@ -175,14 +189,25 @@ For each serious issue, record:
|
|
|
175
189
|
- what should change
|
|
176
190
|
- whether the fix is writing-only, evidence-only, or experiment-dependent
|
|
177
191
|
- whether the issue blocks `finalize`
|
|
192
|
+
- one copy-ready replacement sentence / paragraph when feasible
|
|
193
|
+
- one LaTeX-ready replacement block when `startup_contract.manuscript_edit_mode = latex_required`
|
|
178
194
|
|
|
179
195
|
### 5. Produce the follow-up experiment TODO list
|
|
180
196
|
|
|
181
197
|
Only if more evidence is truly needed, write `paper/review/experiment_todo.md` using `references/experiment-todo-template.md`.
|
|
182
198
|
|
|
199
|
+
When the paper still lacks experimental support, also create or revise:
|
|
200
|
+
|
|
201
|
+
- `paper/paper_experiment_matrix.md`
|
|
202
|
+
- `paper/paper_experiment_matrix.json`
|
|
203
|
+
|
|
204
|
+
Treat the matrix as the paper-facing master plan and `paper/review/experiment_todo.md` as only the current execution frontier or review-facing subset.
|
|
205
|
+
|
|
183
206
|
Each TODO item should include:
|
|
184
207
|
|
|
185
208
|
- the review issue it answers
|
|
209
|
+
- the matrix exp id
|
|
210
|
+
- the corresponding `exp_id` in the paper experiment matrix
|
|
186
211
|
- why existing evidence is still insufficient
|
|
187
212
|
- the minimum experiment or analysis needed
|
|
188
213
|
- required metric(s)
|
|
@@ -195,6 +220,50 @@ Each TODO item should include:
|
|
|
195
220
|
|
|
196
221
|
Do not write a vague “run more ablations” list.
|
|
197
222
|
Each TODO item should be concrete enough to turn into `analysis-campaign` slices or a `baseline` recovery task.
|
|
223
|
+
The matrix should be broader than the TODO list and should classify the full paper-facing experiment space, not just analysis work.
|
|
224
|
+
When building or revising that matrix, explicitly consider:
|
|
225
|
+
|
|
226
|
+
- main comparison packaging or extension
|
|
227
|
+
- component ablations
|
|
228
|
+
- sensitivity / hyperparameter checks
|
|
229
|
+
- robustness checks
|
|
230
|
+
- efficiency / cost / latency / token-overhead checks when relevant
|
|
231
|
+
- highlight-validation experiments that test the likely strengths of the method
|
|
232
|
+
- limitation-boundary analyses
|
|
233
|
+
- case study rows as optional rather than mandatory evidence
|
|
234
|
+
|
|
235
|
+
Do not assume the paper only needs “analysis experiments”.
|
|
236
|
+
Do not assume case studies belong in the required set.
|
|
237
|
+
If efficiency or cost could become a reviewer-facing strength or concern, put that into the matrix explicitly.
|
|
238
|
+
|
|
239
|
+
For the matrix, each row should usually record:
|
|
240
|
+
|
|
241
|
+
- `exp_id`
|
|
242
|
+
- `tier`
|
|
243
|
+
- `experiment_type`
|
|
244
|
+
- `status`
|
|
245
|
+
- `feasibility_now`
|
|
246
|
+
- `claim_ids`
|
|
247
|
+
- `highlight_ids`
|
|
248
|
+
- `research_question`
|
|
249
|
+
- `hypothesis`
|
|
250
|
+
- `comparators`
|
|
251
|
+
- `metrics`
|
|
252
|
+
- `minimal_success_criterion`
|
|
253
|
+
- `paper_placement`
|
|
254
|
+
- `promotion_rule`
|
|
255
|
+
- `next_action`
|
|
256
|
+
|
|
257
|
+
The matrix should also keep a short `highlight hypotheses` block.
|
|
258
|
+
Do not rely on prose intuition for the method's best selling point; if a likely highlight matters, it should have a corresponding validation row in the matrix.
|
|
259
|
+
|
|
260
|
+
Before treating the experiments section as stable, require that every currently feasible matrix row that is not merely `optional` or `dropped` is either:
|
|
261
|
+
|
|
262
|
+
- completed
|
|
263
|
+
- analyzed
|
|
264
|
+
- excluded with a real reason
|
|
265
|
+
- or blocked with a real reason
|
|
266
|
+
|
|
198
267
|
When extra evidence is truly needed, use the shared supplementary-experiment protocol:
|
|
199
268
|
|
|
200
269
|
- recover ids / refs first if needed
|
|
@@ -216,6 +285,54 @@ After the review artifacts are durable:
|
|
|
216
285
|
|
|
217
286
|
Do not stop immediately after writing the review if the next route is already clear.
|
|
218
287
|
|
|
288
|
+
### 7. Auto follow-up execution contract
|
|
289
|
+
|
|
290
|
+
When `startup_contract.review_followup_policy = auto_execute_followups`:
|
|
291
|
+
|
|
292
|
+
- treat the review as a gate, not as the endpoint
|
|
293
|
+
- immediately turn the accepted follow-up route into action:
|
|
294
|
+
- `analysis-campaign`
|
|
295
|
+
- when new evidence is truly required
|
|
296
|
+
- `baseline`
|
|
297
|
+
- when a missing comparator baseline blocks fair review
|
|
298
|
+
- `write`
|
|
299
|
+
- when the issues are mostly text, outline, claim-scope, figure, or framing revisions
|
|
300
|
+
- after each completed follow-up step, update:
|
|
301
|
+
- `paper/review/revision_log.md`
|
|
302
|
+
- `paper/review/experiment_todo.md`
|
|
303
|
+
- the draft or manuscript-facing revision package
|
|
304
|
+
- only treat the review line as truly closed after the follow-up route has either completed or been downgraded / blocked explicitly
|
|
305
|
+
|
|
306
|
+
When `startup_contract.review_followup_policy = user_gated_followups`:
|
|
307
|
+
|
|
308
|
+
- stop after the durable audit artifacts
|
|
309
|
+
- turn the next expensive follow-up package into one structured decision instead of continuing silently
|
|
310
|
+
|
|
311
|
+
When `startup_contract.review_followup_policy = audit_only`:
|
|
312
|
+
|
|
313
|
+
- stop after the durable audit artifacts and route recommendation
|
|
314
|
+
|
|
315
|
+
### 8. Manuscript revision delivery contract
|
|
316
|
+
|
|
317
|
+
If manuscript revision is required, make the delta explicit:
|
|
318
|
+
|
|
319
|
+
- section
|
|
320
|
+
- old claim / weakness
|
|
321
|
+
- new wording
|
|
322
|
+
- evidence basis
|
|
323
|
+
- remaining limitation
|
|
324
|
+
|
|
325
|
+
If `startup_contract.manuscript_edit_mode = copy_ready_text`:
|
|
326
|
+
|
|
327
|
+
- provide copy-ready replacement wording in `paper/review/revision_log.md` or a nearby revision note
|
|
328
|
+
- keep the wording directly usable by the user or downstream `write`
|
|
329
|
+
|
|
330
|
+
If `startup_contract.manuscript_edit_mode = latex_required`:
|
|
331
|
+
|
|
332
|
+
- prefer editing the actual LaTeX sources when they are available
|
|
333
|
+
- otherwise provide LaTeX-ready replacement text blocks with explicit insertion targets
|
|
334
|
+
- preserve labels, citations, figure/table refs, and section structure in the suggested replacements
|
|
335
|
+
|
|
219
336
|
## Companion skill routing
|
|
220
337
|
|
|
221
338
|
Open additional skills only when the review workflow requires them:
|
|
@@ -5,25 +5,48 @@
|
|
|
5
5
|
### TODO EXP-001
|
|
6
6
|
|
|
7
7
|
- source review issue:
|
|
8
|
+
- matrix exp id:
|
|
8
9
|
- why current evidence is insufficient:
|
|
9
10
|
- route type:
|
|
10
11
|
- existing-result analysis
|
|
11
12
|
- comparator baseline
|
|
12
13
|
- supplementary experiment
|
|
13
14
|
- figure / table regeneration
|
|
15
|
+
- experiment type:
|
|
16
|
+
- component_ablation
|
|
17
|
+
- sensitivity
|
|
18
|
+
- robustness
|
|
19
|
+
- efficiency_cost
|
|
20
|
+
- highlight_validation
|
|
21
|
+
- failure_boundary
|
|
22
|
+
- case_study_optional
|
|
23
|
+
- tier:
|
|
24
|
+
- main_required
|
|
25
|
+
- main_optional
|
|
26
|
+
- appendix
|
|
27
|
+
- optional
|
|
14
28
|
- minimum task:
|
|
15
29
|
- required metric(s):
|
|
16
30
|
- minimal success criterion:
|
|
31
|
+
- likely paper placement:
|
|
32
|
+
- main_text
|
|
33
|
+
- appendix
|
|
34
|
+
- maybe
|
|
35
|
+
- omit
|
|
17
36
|
- expected manuscript impact:
|
|
18
37
|
- owner / next step:
|
|
19
38
|
|
|
20
39
|
### TODO EXP-002
|
|
21
40
|
|
|
22
41
|
- source review issue:
|
|
42
|
+
- matrix exp id:
|
|
23
43
|
- why current evidence is insufficient:
|
|
24
44
|
- route type:
|
|
45
|
+
- experiment type:
|
|
46
|
+
- tier:
|
|
25
47
|
- minimum task:
|
|
26
48
|
- required metric(s):
|
|
27
49
|
- minimal success criterion:
|
|
50
|
+
- likely paper placement:
|
|
28
51
|
- expected manuscript impact:
|
|
29
52
|
- owner / next step:
|
|
@@ -1,5 +1,11 @@
|
|
|
1
1
|
# Review Report Template
|
|
2
2
|
|
|
3
|
+
## Review mode
|
|
4
|
+
|
|
5
|
+
- review_followup_policy:
|
|
6
|
+
- manuscript_edit_mode:
|
|
7
|
+
- manuscript_source_status:
|
|
8
|
+
|
|
3
9
|
## Summary
|
|
4
10
|
|
|
5
11
|
- paper / draft:
|
|
@@ -36,6 +42,8 @@
|
|
|
36
42
|
- cause:
|
|
37
43
|
- actionable fix:
|
|
38
44
|
- acceptance criterion:
|
|
45
|
+
- copy-ready revision text:
|
|
46
|
+
- latex-ready revision text:
|
|
39
47
|
|
|
40
48
|
## Storyline Options + Writing Outlines
|
|
41
49
|
|
|
@@ -49,6 +57,14 @@
|
|
|
49
57
|
2.
|
|
50
58
|
3.
|
|
51
59
|
|
|
60
|
+
## Manuscript Revision Package
|
|
61
|
+
|
|
62
|
+
- section:
|
|
63
|
+
- old wording / weakness:
|
|
64
|
+
- new wording:
|
|
65
|
+
- evidence basis:
|
|
66
|
+
- latex-ready replacement block:
|
|
67
|
+
|
|
52
68
|
## Experiment Inventory & Research Experiment Plan
|
|
53
69
|
|
|
54
70
|
- what existing experiments already cover:
|
|
@@ -3,6 +3,8 @@
|
|
|
3
3
|
## Revision Summary
|
|
4
4
|
|
|
5
5
|
- current draft state:
|
|
6
|
+
- review_followup_policy:
|
|
7
|
+
- manuscript_edit_mode:
|
|
6
8
|
- highest-priority fixes:
|
|
7
9
|
- blockers:
|
|
8
10
|
|
|
@@ -20,6 +22,8 @@
|
|
|
20
22
|
- supplementary experiment
|
|
21
23
|
- claim downgrade
|
|
22
24
|
- concrete change:
|
|
25
|
+
- copy-ready revision text:
|
|
26
|
+
- latex-ready revision text:
|
|
23
27
|
- status:
|
|
24
28
|
- blocks finalize:
|
|
25
29
|
|
|
@@ -10,7 +10,7 @@ Use this skill when the quest does not yet have a stable research frame.
|
|
|
10
10
|
## Interaction discipline
|
|
11
11
|
|
|
12
12
|
- Follow the shared interaction contract injected by the system prompt.
|
|
13
|
-
- For ordinary active work, prefer a concise progress update once work has crossed roughly
|
|
13
|
+
- For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
|
|
14
14
|
- Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
|
|
15
15
|
- If a threaded user reply arrives, interpret it relative to the latest scout progress update before assuming the task changed completely.
|
|
16
16
|
- When scouting actually resolves the framing ambiguity, locks the evaluation contract, or makes the next anchor obvious, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what is now clear, why it matters, and which stage should come next.
|
|
@@ -20,7 +20,7 @@ This skill intentionally absorbs the strongest old DeepScientist writing discipl
|
|
|
20
20
|
## Interaction discipline
|
|
21
21
|
|
|
22
22
|
- Follow the shared interaction contract injected by the system prompt.
|
|
23
|
-
- For ordinary active work, prefer a concise progress update once work has crossed roughly
|
|
23
|
+
- For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
|
|
24
24
|
- Prefer `bash_exec` for durable document-build commands such as LaTeX compilation, figure regeneration, and scripted export steps so logs remain quest-local and reviewable.
|
|
25
25
|
- Keep ordinary subtask completions concise. When a paper/draft milestone is actually completed, upgrade to a richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` report instead of another short progress update.
|
|
26
26
|
- That richer writing-stage milestone report should normally cover: which draft, section, or outline milestone finished, what is now supportable, what is still missing, and the exact recommended next revision or route decision.
|
|
@@ -146,6 +146,8 @@ The write stage should usually produce most of the following:
|
|
|
146
146
|
|
|
147
147
|
- `paper/outline.md` or equivalent outline
|
|
148
148
|
- `paper/selected_outline.json`
|
|
149
|
+
- `paper/paper_experiment_matrix.md`
|
|
150
|
+
- `paper/paper_experiment_matrix.json`
|
|
149
151
|
- `paper/outline_selection.md`
|
|
150
152
|
- `paper/reviewer_first_pass.md`
|
|
151
153
|
- `paper/section_contracts.md`
|
|
@@ -202,6 +204,144 @@ At minimum, repeatedly verify:
|
|
|
202
204
|
- figure and table provenance
|
|
203
205
|
- file inclusion integrity for the draft or bundle
|
|
204
206
|
|
|
207
|
+
## Paper experiment matrix contract
|
|
208
|
+
|
|
209
|
+
For any paper-like writing line that has more than a trivial single-result story, create and maintain:
|
|
210
|
+
|
|
211
|
+
- `paper/paper_experiment_matrix.md`
|
|
212
|
+
- `paper/paper_experiment_matrix.json`
|
|
213
|
+
|
|
214
|
+
Use `references/paper-experiment-matrix-template.md` when helpful.
|
|
215
|
+
|
|
216
|
+
The paper experiment matrix is the durable experiment-control surface for the paper line.
|
|
217
|
+
It exists to prevent two common failures:
|
|
218
|
+
|
|
219
|
+
- an outline that overweights post-hoc analysis and under-specifies paper-typical experiments
|
|
220
|
+
- a drifting supplementary-experiment queue where runs are launched ad hoc without a full paper-facing plan
|
|
221
|
+
|
|
222
|
+
The matrix is not just an “analysis list”.
|
|
223
|
+
It should cover the full paper-facing experiment program beyond the already-finished main run, including:
|
|
224
|
+
|
|
225
|
+
- main comparison surfaces that still need packaging or extension
|
|
226
|
+
- component ablations
|
|
227
|
+
- sensitivity / hyperparameter checks
|
|
228
|
+
- robustness or stress checks
|
|
229
|
+
- efficiency / cost / latency / token-overhead checks when the method may have a strong deployment or efficiency story
|
|
230
|
+
- highlight-validation experiments that test the method's most likely reader-facing strengths rather than merely assuming those strengths
|
|
231
|
+
- failure-boundary or limitation-surface analyses
|
|
232
|
+
- case study or trace walkthrough rows as optional supporting material rather than mandatory core evidence
|
|
233
|
+
|
|
234
|
+
Case study is usually optional.
|
|
235
|
+
Do not let it displace stronger quantitative evidence.
|
|
236
|
+
Efficiency or cost experiments are not mandatory in every paper, but they should be added whenever:
|
|
237
|
+
|
|
238
|
+
- the method may be attractive partly because it is lightweight or prompt-level
|
|
239
|
+
- the overhead skepticism from reviewers is easy to anticipate
|
|
240
|
+
- a performance-over-cost tradeoff could become part of the paper's practical contribution
|
|
241
|
+
|
|
242
|
+
Highlight-validation rule:
|
|
243
|
+
|
|
244
|
+
- do not assume the method's strongest selling point is already obvious from the aggregate metric
|
|
245
|
+
- explicitly write down `highlight hypotheses`
|
|
246
|
+
- plan at least one experiment that could confirm or falsify each serious highlight hypothesis
|
|
247
|
+
|
|
248
|
+
Typical highlight hypotheses include:
|
|
249
|
+
|
|
250
|
+
- the method is more selective rather than merely more conservative
|
|
251
|
+
- the gain comes from a named mechanism rather than from generic stubbornness or scale
|
|
252
|
+
- the improvement concentrates on the intended failure regime
|
|
253
|
+
- the method keeps a strong performance / overhead tradeoff
|
|
254
|
+
|
|
255
|
+
Each matrix row should normally record at least:
|
|
256
|
+
|
|
257
|
+
- `exp_id`
|
|
258
|
+
- `title`
|
|
259
|
+
- `tier`
|
|
260
|
+
- `main_required`
|
|
261
|
+
- `main_optional`
|
|
262
|
+
- `appendix`
|
|
263
|
+
- `optional`
|
|
264
|
+
- `dropped`
|
|
265
|
+
- `experiment_type`
|
|
266
|
+
- `main_comparison`
|
|
267
|
+
- `component_ablation`
|
|
268
|
+
- `sensitivity`
|
|
269
|
+
- `robustness`
|
|
270
|
+
- `efficiency_cost`
|
|
271
|
+
- `highlight_validation`
|
|
272
|
+
- `failure_boundary`
|
|
273
|
+
- `case_study_optional`
|
|
274
|
+
- `status`
|
|
275
|
+
- `proposed`
|
|
276
|
+
- `planned`
|
|
277
|
+
- `ready`
|
|
278
|
+
- `running`
|
|
279
|
+
- `completed`
|
|
280
|
+
- `analyzed`
|
|
281
|
+
- `written`
|
|
282
|
+
- `excluded`
|
|
283
|
+
- `blocked`
|
|
284
|
+
- `feasibility_now`
|
|
285
|
+
- whether the row is runnable with current assets or still blocked
|
|
286
|
+
- `claim_ids`
|
|
287
|
+
- `highlight_ids`
|
|
288
|
+
- `research_question`
|
|
289
|
+
- `hypothesis`
|
|
290
|
+
- `why_this_matters`
|
|
291
|
+
- `comparators`
|
|
292
|
+
- `fixed_conditions`
|
|
293
|
+
- `changed_variables`
|
|
294
|
+
- `metrics`
|
|
295
|
+
- `cost_budget`
|
|
296
|
+
- `minimal_success_criterion`
|
|
297
|
+
- `promotion_rule`
|
|
298
|
+
- what result would move the row into main text
|
|
299
|
+
- what result keeps it appendix-only
|
|
300
|
+
- what result should exclude it
|
|
301
|
+
- `paper_placement`
|
|
302
|
+
- `main_text`
|
|
303
|
+
- `appendix`
|
|
304
|
+
- `maybe`
|
|
305
|
+
- `omit`
|
|
306
|
+
- `result_artifacts`
|
|
307
|
+
- `next_action`
|
|
308
|
+
|
|
309
|
+
The matrix should also contain:
|
|
310
|
+
|
|
311
|
+
- core paper claims
|
|
312
|
+
- highlight hypotheses
|
|
313
|
+
- a short experiment taxonomy summary
|
|
314
|
+
- the current execution frontier
|
|
315
|
+
- an explicit main-text gate
|
|
316
|
+
- a refresh log that records how priorities changed after new evidence arrived
|
|
317
|
+
|
|
318
|
+
Main-text drafting gate:
|
|
319
|
+
|
|
320
|
+
- do not treat the main experiments section as stable while any row that is both:
|
|
321
|
+
- currently feasible
|
|
322
|
+
- and not marked `optional` or `dropped`
|
|
323
|
+
remains unaddressed
|
|
324
|
+
- before the experiments section becomes stable, every currently feasible row should be:
|
|
325
|
+
- `completed`
|
|
326
|
+
- `analyzed`
|
|
327
|
+
- `excluded` with a real reason
|
|
328
|
+
- or `blocked` with a real reason
|
|
329
|
+
|
|
330
|
+
This does not forbid drafting the introduction, method, or placeholders early.
|
|
331
|
+
It does forbid pretending the paper's experimental story is settled while the feasible experiment frontier is still open.
|
|
332
|
+
|
|
333
|
+
After every meaningful experiment outcome, even a null result or exclusion:
|
|
334
|
+
|
|
335
|
+
- reopen the matrix first
|
|
336
|
+
- update the row status and feasibility
|
|
337
|
+
- update `paper_placement`
|
|
338
|
+
- update the claim and highlight impact
|
|
339
|
+
- update the priority order of the remaining rows
|
|
340
|
+
- then decide the next experiment or writing move
|
|
341
|
+
|
|
342
|
+
Do not decide the next supplementary experiment from memory alone when the matrix exists.
|
|
343
|
+
The matrix should be the authoritative experiment-routing surface for the paper line, and the selected outline's `experimental_designs` should stay consistent with that matrix rather than drifting away from it.
|
|
344
|
+
|
|
205
345
|
## Venue template selection
|
|
206
346
|
|
|
207
347
|
For paper-like writing, use a real venue template rather than improvising a blank LaTeX tree.
|
|
@@ -246,18 +386,20 @@ For paper-like deliverables, the safest default order is:
|
|
|
246
386
|
3. choose the venue template from `templates/`, copy it into `paper/latex/`, and default general ML work to `templates/iclr2026/` unless a stronger venue target exists
|
|
247
387
|
4. if the line benefits from an explicit outline contract, record one or more outline candidates with `artifact.submit_paper_outline(mode='candidate', ...)`
|
|
248
388
|
5. if one outline should become the durable paper contract, select or revise it with `artifact.submit_paper_outline(mode='select'|'revise', ...)`
|
|
249
|
-
6.
|
|
250
|
-
7.
|
|
251
|
-
8.
|
|
252
|
-
9.
|
|
253
|
-
10.
|
|
254
|
-
11.
|
|
389
|
+
6. create or refresh `paper/paper_experiment_matrix.md` and `paper/paper_experiment_matrix.json` before stabilizing the experiments section
|
|
390
|
+
7. if the selected outline or matrix still exposes evidence gaps, launch an outline-bound and matrix-bound `artifact.create_analysis_campaign(...)` before drafting the experiments section as if it were settled
|
|
391
|
+
8. plan and generate decisive figures or tables
|
|
392
|
+
9. draft sections directly from the evidence and the current working outline; do not force extra outline rounds when direct drafting is clearer and safer
|
|
393
|
+
10. run harsh review and revision cycles
|
|
394
|
+
11. proof, package, submit `artifact.submit_paper_bundle(...)` when the bundle is ready, and then pass to `finalize`
|
|
395
|
+
12. if the final paper PDF exists and QQ milestone media is enabled in config, the bundle-ready milestone may attach that PDF once
|
|
255
396
|
|
|
256
397
|
Before real drafting, force one explicit planning pass that stabilizes at least:
|
|
257
398
|
|
|
258
399
|
- the current claim inventory
|
|
259
400
|
- the claim-evidence map skeleton
|
|
260
401
|
- the outline or outline candidates
|
|
402
|
+
- the paper experiment matrix
|
|
261
403
|
- the figure/table plan
|
|
262
404
|
- the main evidence gaps
|
|
263
405
|
|
|
@@ -273,6 +415,7 @@ For substantial paper-like writing, the durable writing plan should usually incl
|
|
|
273
415
|
|
|
274
416
|
- section goals
|
|
275
417
|
- paragraph or subsection intent when it materially affects correctness
|
|
418
|
+
- paper experiment matrix status and execution frontier
|
|
276
419
|
- experiment-to-section mapping
|
|
277
420
|
- figure/table-to-data-source mapping
|
|
278
421
|
- citation/search plan
|
|
@@ -284,6 +427,7 @@ Do not let drafting quietly outrun the current evidence inventory.
|
|
|
284
427
|
|
|
285
428
|
For reviewer-facing structure and section-level drafting contracts, read these references when the line needs sharper paper craft:
|
|
286
429
|
|
|
430
|
+
- `references/paper-experiment-matrix-template.md`
|
|
287
431
|
- `references/reviewer-first-writing.md`
|
|
288
432
|
- `references/section-contracts.md`
|
|
289
433
|
- `references/sentence-level-proofing.md`
|
|
@@ -306,6 +450,21 @@ Also build an experiment inventory before outlining:
|
|
|
306
450
|
- appendix-only evidence
|
|
307
451
|
- unusable or too-weak evidence
|
|
308
452
|
- verify that each planned main claim has at least one durable evidence path
|
|
453
|
+
- convert that inventory into the paper experiment matrix instead of leaving it as loose notes
|
|
454
|
+
|
|
455
|
+
When building the matrix, do not reduce the candidate pool to “analysis experiments”.
|
|
456
|
+
The inventory should explicitly consider:
|
|
457
|
+
|
|
458
|
+
- ablations
|
|
459
|
+
- robustness checks
|
|
460
|
+
- sensitivity or hyperparameter checks
|
|
461
|
+
- efficiency / cost / latency / token-overhead checks
|
|
462
|
+
- experiments aimed at validating likely highlights
|
|
463
|
+
- limitation-boundary analyses
|
|
464
|
+
- optional case studies
|
|
465
|
+
|
|
466
|
+
If the method appears to have a likely practical or deployment-facing strength, test it directly instead of burying that possibility in prose.
|
|
467
|
+
If the method appears to have a likely conceptual highlight, write the corresponding `highlight hypothesis` and treat it as something that still needs evidence rather than something to assume.
|
|
309
468
|
|
|
310
469
|
If an experiment is too weak, too tiny, or poorly comparable, do not let it silently anchor a main claim.
|
|
311
470
|
As a strong default, experiments with very small evaluation support, such as `<=10` effective examples or similarly fragile sample counts, should not carry a main-text claim unless the user explicitly accepts that limitation and the caveat is written next to the claim.
|
|
@@ -1083,3 +1242,5 @@ Exit the write stage only when one of the following is durably true:
|
|
|
1083
1242
|
- the current draft is evidence-complete enough for `finalize`, including a selected outline and a durable paper bundle manifest when the deliverable is paper-like
|
|
1084
1243
|
- a clear evidence gap has been recorded and the quest is routed backward
|
|
1085
1244
|
- a packaging or proofing blocker has been recorded and the next action is explicit
|
|
1245
|
+
|
|
1246
|
+
For paper-like writing, do not treat the draft as evidence-complete enough for `finalize` while `paper/paper_experiment_matrix.*` still contains currently feasible non-optional rows that remain unresolved.
|
|
@@ -0,0 +1,131 @@
|
|
|
1
|
+
# Paper Experiment Matrix Template
|
|
2
|
+
|
|
3
|
+
Use this template when a paper-like line needs a durable experiment-control surface beyond the selected outline.
|
|
4
|
+
|
|
5
|
+
Create and maintain both:
|
|
6
|
+
|
|
7
|
+
- `paper/paper_experiment_matrix.md`
|
|
8
|
+
- `paper/paper_experiment_matrix.json`
|
|
9
|
+
|
|
10
|
+
The Markdown file is the human-facing control surface.
|
|
11
|
+
The JSON file is the machine-facing mirror.
|
|
12
|
+
|
|
13
|
+
## 1. Current Judgment
|
|
14
|
+
|
|
15
|
+
- current judgment:
|
|
16
|
+
- why the matrix is needed now:
|
|
17
|
+
- what would make the experiments section stable:
|
|
18
|
+
- what still blocks stable paper writing:
|
|
19
|
+
|
|
20
|
+
## 2. Core Claims
|
|
21
|
+
|
|
22
|
+
- `C1`:
|
|
23
|
+
- one-line claim:
|
|
24
|
+
- current support status:
|
|
25
|
+
- strongest current evidence:
|
|
26
|
+
- still-missing evidence:
|
|
27
|
+
|
|
28
|
+
- `C2`:
|
|
29
|
+
- one-line claim:
|
|
30
|
+
- current support status:
|
|
31
|
+
- strongest current evidence:
|
|
32
|
+
- still-missing evidence:
|
|
33
|
+
|
|
34
|
+
- `C3`:
|
|
35
|
+
- one-line claim:
|
|
36
|
+
- current support status:
|
|
37
|
+
- strongest current evidence:
|
|
38
|
+
- still-missing evidence:
|
|
39
|
+
|
|
40
|
+
## 3. Highlight Hypotheses
|
|
41
|
+
|
|
42
|
+
Write only serious hypotheses that could matter to the paper's reader-facing value.
|
|
43
|
+
Do not assume the highlight is already true just because it sounds attractive.
|
|
44
|
+
|
|
45
|
+
- `H1`:
|
|
46
|
+
- one-line highlight:
|
|
47
|
+
- why it is plausible:
|
|
48
|
+
- validation rows:
|
|
49
|
+
- fallback if unsupported:
|
|
50
|
+
|
|
51
|
+
- `H2`:
|
|
52
|
+
- one-line highlight:
|
|
53
|
+
- why it is plausible:
|
|
54
|
+
- validation rows:
|
|
55
|
+
- fallback if unsupported:
|
|
56
|
+
|
|
57
|
+
## 4. Taxonomy Summary
|
|
58
|
+
|
|
59
|
+
Check every category deliberately.
|
|
60
|
+
Do not collapse the matrix into “analysis only”.
|
|
61
|
+
|
|
62
|
+
- main comparison:
|
|
63
|
+
- component ablation:
|
|
64
|
+
- sensitivity:
|
|
65
|
+
- robustness:
|
|
66
|
+
- efficiency / cost:
|
|
67
|
+
- highlight validation:
|
|
68
|
+
- failure boundary:
|
|
69
|
+
- case study optional:
|
|
70
|
+
|
|
71
|
+
## 5. Matrix Table
|
|
72
|
+
|
|
73
|
+
| Exp id | Title | Tier | Experiment type | Status | Feasibility now | Claim ids | Highlight ids | Research question | Metrics | Paper placement | Next action |
|
|
74
|
+
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
75
|
+
| | | main_required / main_optional / appendix / optional / dropped | main_comparison / component_ablation / sensitivity / robustness / efficiency_cost / highlight_validation / failure_boundary / case_study_optional | proposed / planned / ready / running / completed / analyzed / written / excluded / blocked | feasible_now / light_setup / blocked / uncertain | | | | | main_text / appendix / maybe / omit | |
|
|
76
|
+
|
|
77
|
+
## 6. Detail Cards
|
|
78
|
+
|
|
79
|
+
Repeat one card per meaningful row.
|
|
80
|
+
|
|
81
|
+
### EXP-001
|
|
82
|
+
|
|
83
|
+
- title:
|
|
84
|
+
- tier:
|
|
85
|
+
- experiment type:
|
|
86
|
+
- current status:
|
|
87
|
+
- feasibility now:
|
|
88
|
+
- why this row exists:
|
|
89
|
+
- research question:
|
|
90
|
+
- hypothesis:
|
|
91
|
+
- comparators:
|
|
92
|
+
- fixed conditions:
|
|
93
|
+
- changed variables:
|
|
94
|
+
- required metric(s):
|
|
95
|
+
- minimal success criterion:
|
|
96
|
+
- cost / runtime budget:
|
|
97
|
+
- promotion rule:
|
|
98
|
+
- main text if:
|
|
99
|
+
- appendix if:
|
|
100
|
+
- omit if:
|
|
101
|
+
- expected figure or table:
|
|
102
|
+
- result artifact paths:
|
|
103
|
+
- dependencies:
|
|
104
|
+
- next action:
|
|
105
|
+
|
|
106
|
+
## 7. Execution Frontier
|
|
107
|
+
|
|
108
|
+
- rows ready now:
|
|
109
|
+
- rows blocked now:
|
|
110
|
+
- rows that must finish before the experiments section is stable:
|
|
111
|
+
- rows that are appendix-only and can wait:
|
|
112
|
+
- rows that are optional and should not block:
|
|
113
|
+
|
|
114
|
+
## 8. Main-Text Gate
|
|
115
|
+
|
|
116
|
+
Do not treat the experiments section as stable while any currently feasible row that is not merely `optional` or `dropped` remains unresolved.
|
|
117
|
+
|
|
118
|
+
Every currently feasible non-optional row should be one of:
|
|
119
|
+
|
|
120
|
+
- completed
|
|
121
|
+
- analyzed
|
|
122
|
+
- excluded with reason
|
|
123
|
+
- blocked with reason
|
|
124
|
+
|
|
125
|
+
## 9. Refresh Log
|
|
126
|
+
|
|
127
|
+
After every completed, excluded, or blocked slice, reopen the matrix first and update it before selecting the next run.
|
|
128
|
+
|
|
129
|
+
| Time | Exp id | What changed | Claim/highlight impact | Priority change | New next action |
|
|
130
|
+
|---|---|---|---|---|---|
|
|
131
|
+
| | | | | | |
|