@researai/deepscientist 1.5.11 → 1.5.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (107) hide show
  1. package/README.md +8 -8
  2. package/bin/ds.js +375 -61
  3. package/docs/en/00_QUICK_START.md +55 -4
  4. package/docs/en/01_SETTINGS_REFERENCE.md +15 -0
  5. package/docs/en/02_START_RESEARCH_GUIDE.md +68 -4
  6. package/docs/en/09_DOCTOR.md +48 -4
  7. package/docs/en/12_GUIDED_WORKFLOW_TOUR.md +21 -2
  8. package/docs/en/15_CODEX_PROVIDER_SETUP.md +382 -0
  9. package/docs/en/README.md +4 -0
  10. package/docs/zh/00_QUICK_START.md +54 -3
  11. package/docs/zh/01_SETTINGS_REFERENCE.md +15 -0
  12. package/docs/zh/02_START_RESEARCH_GUIDE.md +69 -3
  13. package/docs/zh/09_DOCTOR.md +48 -2
  14. package/docs/zh/12_GUIDED_WORKFLOW_TOUR.md +21 -2
  15. package/docs/zh/15_CODEX_PROVIDER_SETUP.md +383 -0
  16. package/docs/zh/README.md +4 -1
  17. package/package.json +2 -1
  18. package/pyproject.toml +1 -1
  19. package/src/deepscientist/__init__.py +1 -1
  20. package/src/deepscientist/bash_exec/monitor.py +7 -5
  21. package/src/deepscientist/bash_exec/service.py +84 -21
  22. package/src/deepscientist/channels/local.py +3 -3
  23. package/src/deepscientist/channels/qq.py +7 -7
  24. package/src/deepscientist/channels/relay.py +7 -7
  25. package/src/deepscientist/channels/weixin_ilink.py +90 -19
  26. package/src/deepscientist/cli.py +3 -0
  27. package/src/deepscientist/codex_cli_compat.py +117 -0
  28. package/src/deepscientist/config/models.py +1 -0
  29. package/src/deepscientist/config/service.py +173 -25
  30. package/src/deepscientist/daemon/app.py +314 -6
  31. package/src/deepscientist/doctor.py +1 -5
  32. package/src/deepscientist/mcp/server.py +124 -3
  33. package/src/deepscientist/prompts/builder.py +113 -11
  34. package/src/deepscientist/quest/service.py +247 -31
  35. package/src/deepscientist/runners/codex.py +132 -24
  36. package/src/deepscientist/runners/runtime_overrides.py +9 -0
  37. package/src/deepscientist/shared.py +33 -14
  38. package/src/prompts/connectors/qq.md +2 -1
  39. package/src/prompts/connectors/weixin.md +2 -1
  40. package/src/prompts/contracts/shared_interaction.md +4 -1
  41. package/src/prompts/system.md +59 -9
  42. package/src/skills/analysis-campaign/SKILL.md +46 -6
  43. package/src/skills/analysis-campaign/references/campaign-plan-template.md +21 -8
  44. package/src/skills/baseline/SKILL.md +1 -1
  45. package/src/skills/baseline/references/artifact-payload-examples.md +39 -0
  46. package/src/skills/decision/SKILL.md +1 -1
  47. package/src/skills/experiment/SKILL.md +1 -1
  48. package/src/skills/finalize/SKILL.md +1 -1
  49. package/src/skills/idea/SKILL.md +1 -1
  50. package/src/skills/intake-audit/SKILL.md +1 -1
  51. package/src/skills/rebuttal/SKILL.md +74 -1
  52. package/src/skills/rebuttal/references/response-letter-template.md +55 -11
  53. package/src/skills/review/SKILL.md +118 -1
  54. package/src/skills/review/references/experiment-todo-template.md +23 -0
  55. package/src/skills/review/references/review-report-template.md +16 -0
  56. package/src/skills/review/references/revision-log-template.md +4 -0
  57. package/src/skills/scout/SKILL.md +1 -1
  58. package/src/skills/write/SKILL.md +168 -7
  59. package/src/skills/write/references/paper-experiment-matrix-template.md +131 -0
  60. package/src/tui/dist/lib/connectorConfig.js +90 -0
  61. package/src/tui/dist/lib/qr.js +21 -0
  62. package/src/tui/package.json +2 -1
  63. package/src/ui/dist/assets/{AiManusChatView-D0mTXG4-.js → AiManusChatView-CnJcXynW.js} +12 -12
  64. package/src/ui/dist/assets/{AnalysisPlugin-Db0cTXxm.js → AnalysisPlugin-DeyzPEhV.js} +1 -1
  65. package/src/ui/dist/assets/{CliPlugin-DrV8je02.js → CliPlugin-CB1YODQn.js} +9 -9
  66. package/src/ui/dist/assets/{CodeEditorPlugin-QXMSCH71.js → CodeEditorPlugin-B-xicq1e.js} +8 -8
  67. package/src/ui/dist/assets/{CodeViewerPlugin-7hhtWj_E.js → CodeViewerPlugin-DT54ysXa.js} +5 -5
  68. package/src/ui/dist/assets/{DocViewerPlugin-BWMSnRJe.js → DocViewerPlugin-DQtKT-VD.js} +3 -3
  69. package/src/ui/dist/assets/{GitDiffViewerPlugin-7J9h9Vy_.js → GitDiffViewerPlugin-hqHbCfnv.js} +20 -20
  70. package/src/ui/dist/assets/{ImageViewerPlugin-CHJl_0lr.js → ImageViewerPlugin-OcVo33jV.js} +5 -5
  71. package/src/ui/dist/assets/{LabCopilotPanel-1qSow1es.js → LabCopilotPanel-DdGwhEUV.js} +11 -11
  72. package/src/ui/dist/assets/{LabPlugin-eQpPPCEp.js → LabPlugin-Ciz1gDaX.js} +2 -2
  73. package/src/ui/dist/assets/{LatexPlugin-BwRfi89Z.js → LatexPlugin-BhmjNQRC.js} +37 -11
  74. package/src/ui/dist/assets/{MarkdownViewerPlugin-836PVQWV.js → MarkdownViewerPlugin-BzdVH9Bx.js} +4 -4
  75. package/src/ui/dist/assets/{MarketplacePlugin-C2y_556i.js → MarketplacePlugin-DmyHspXt.js} +3 -3
  76. package/src/ui/dist/assets/{NotebookEditor-DIX7Mlzu.js → NotebookEditor-BMXKrDRk.js} +1 -1
  77. package/src/ui/dist/assets/{NotebookEditor-BRzJbGsn.js → NotebookEditor-BTVYRGkm.js} +11 -11
  78. package/src/ui/dist/assets/{PdfLoader-DzRaTAlq.js → PdfLoader-CvcjJHXv.js} +1 -1
  79. package/src/ui/dist/assets/{PdfMarkdownPlugin-DZUfIUnp.js → PdfMarkdownPlugin-DW2ej8Vk.js} +2 -2
  80. package/src/ui/dist/assets/{PdfViewerPlugin-BwtICzue.js → PdfViewerPlugin-CmlDxbhU.js} +10 -10
  81. package/src/ui/dist/assets/{SearchPlugin-DHeIAMsx.js → SearchPlugin-DAjQZPSv.js} +1 -1
  82. package/src/ui/dist/assets/{TextViewerPlugin-C3tCmFox.js → TextViewerPlugin-C-nVAZb_.js} +5 -5
  83. package/src/ui/dist/assets/{VNCViewer-CQsKVm3t.js → VNCViewer-D7-dIYon.js} +10 -10
  84. package/src/ui/dist/assets/{bot-BEA2vWuK.js → bot-C_G4WtNI.js} +1 -1
  85. package/src/ui/dist/assets/{code-XfbSR8K2.js → code-Cd7WfiWq.js} +1 -1
  86. package/src/ui/dist/assets/{file-content-BjxNaIfy.js → file-content-B57zsL9y.js} +1 -1
  87. package/src/ui/dist/assets/{file-diff-panel-D_lLVQk0.js → file-diff-panel-DVoheLFq.js} +1 -1
  88. package/src/ui/dist/assets/{file-socket-D9x_5vlY.js → file-socket-B5kXFxZP.js} +1 -1
  89. package/src/ui/dist/assets/{image-BhWT33W1.js → image-LLOjkMHF.js} +1 -1
  90. package/src/ui/dist/assets/{index-Dqj-Mjb4.css → index-BQG-1s2o.css} +40 -2
  91. package/src/ui/dist/assets/{index--c4iXtuy.js → index-C3r2iGrp.js} +12 -12
  92. package/src/ui/dist/assets/{index-DZTZ8mWP.js → index-CLQauncb.js} +911 -120
  93. package/src/ui/dist/assets/{index-PJbSbPTy.js → index-Dxa2eYMY.js} +1 -1
  94. package/src/ui/dist/assets/{index-BDxipwrC.js → index-hOUOWbW2.js} +2 -2
  95. package/src/ui/dist/assets/{monaco-K8izTGgo.js → monaco-BGGAEii3.js} +1 -1
  96. package/src/ui/dist/assets/{pdf-effect-queue-DfBors6y.js → pdf-effect-queue-DlEr1_y5.js} +1 -1
  97. package/src/ui/dist/assets/{popover-yFK1J4fL.js → popover-CWJbJuYY.js} +1 -1
  98. package/src/ui/dist/assets/{project-sync-PENr2zcz.js → project-sync-CRJiucYO.js} +18 -4
  99. package/src/ui/dist/assets/{select-CAbJDfYv.js → select-CoHB7pvH.js} +2 -2
  100. package/src/ui/dist/assets/{sigma-DEuYJqTl.js → sigma-D5aJWR8J.js} +1 -1
  101. package/src/ui/dist/assets/{square-check-big-omoSUmcd.js → square-check-big-DUK_mnkS.js} +1 -1
  102. package/src/ui/dist/assets/{trash--F119N47.js → trash-ChU3SEE3.js} +1 -1
  103. package/src/ui/dist/assets/{useCliAccess-D31UR23I.js → useCliAccess-BrJBV3tY.js} +1 -1
  104. package/src/ui/dist/assets/{useFileDiffOverlay-BH6KcMzq.js → useFileDiffOverlay-C2OQaVWc.js} +1 -1
  105. package/src/ui/dist/assets/{wrap-text-CZ613PM5.js → wrap-text-C7Qqh-om.js} +1 -1
  106. package/src/ui/dist/assets/{zoom-out-BgDLAv3z.js → zoom-out-rtX0FKya.js} +1 -1
  107. package/src/ui/dist/index.html +2 -2
@@ -17,7 +17,7 @@ It is also not the same as `rebuttal`.
17
17
  ## Interaction discipline
18
18
 
19
19
  - Follow the shared interaction contract injected by the system prompt.
20
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
20
+ - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
21
21
  - When the review report, revision plan, or follow-up experiment TODO list becomes durable, send a richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what the main risks are, what should be fixed next, and whether the next route is writing, experiment, or claim downgrade.
22
22
 
23
23
  ## Purpose
@@ -63,6 +63,16 @@ Do not treat “looks polished” as “is defensible”.
63
63
  - Do not recommend rhetoric when the real problem is missing evidence.
64
64
  - If novelty or positioning is uncertain, treat that as a literature-audit question first, not an automatic experiment request.
65
65
  - If a claim is too broad for the evidence, prefer narrowing or downgrading the claim over defending it with style.
66
+ - If `startup_contract.review_followup_policy` is present, honor it:
67
+ - `audit_only`
68
+ - stop after durable review artifacts and a clear route recommendation
69
+ - `auto_execute_followups`
70
+ - do not stop at the audit if the next route is already clear; continue into the required experiments and manuscript deltas
71
+ - `user_gated_followups`
72
+ - finish the audit first, then package the next expensive follow-up step into one structured decision
73
+ - If `startup_contract.manuscript_edit_mode = latex_required`, treat the provided LaTeX tree or `paper/latex/` as the writing surface when manuscript revision is needed.
74
+ - If LaTeX source is unavailable while `latex_required` is requested, do not pretend the manuscript was edited; produce LaTeX-ready replacement text and an explicit blocker note instead.
75
+ - Accept manuscript and review inputs from URLs, local file paths, local directories, or current-turn attachments; do not assume the draft is already perfectly normalized.
66
76
 
67
77
  ## Primary inputs
68
78
 
@@ -74,11 +84,13 @@ Use, in roughly this order:
74
84
  - the six-field `evaluation_summary` blocks from recent main experiments and analysis slices
75
85
  - recent main and analysis experiment results
76
86
  - figures, tables, and captions
87
+ - current-turn attachments and user-provided local paths / directories / URLs for the manuscript bundle or review packet
77
88
  - prior self-review or reviewer-first notes as low-trust auxiliary input
78
89
  - nearby papers when novelty or comparison is unclear
79
90
 
80
91
  If the draft/result state is still unclear, open `intake-audit` first before continuing the review workflow.
81
92
  Before proposing extra experiments, read those structured `evaluation_summary` blocks first so you do not request work that the recorded evidence already resolved.
93
+ If the user provided draft files or manuscript bundles directly, first normalize them into durable quest-visible paths before planning experiments or section-level revisions.
82
94
 
83
95
  ## Core outputs
84
96
 
@@ -87,6 +99,8 @@ The review pass should usually leave behind:
87
99
  - `paper/review/review.md`
88
100
  - `paper/review/revision_log.md`
89
101
  - `paper/review/experiment_todo.md`
102
+ - `paper/paper_experiment_matrix.md` when more evidence is still needed
103
+ - `paper/paper_experiment_matrix.json` when more evidence is still needed
90
104
 
91
105
  Use the templates in `references/` when needed:
92
106
 
@@ -175,14 +189,25 @@ For each serious issue, record:
175
189
  - what should change
176
190
  - whether the fix is writing-only, evidence-only, or experiment-dependent
177
191
  - whether the issue blocks `finalize`
192
+ - one copy-ready replacement sentence / paragraph when feasible
193
+ - one LaTeX-ready replacement block when `startup_contract.manuscript_edit_mode = latex_required`
178
194
 
179
195
  ### 5. Produce the follow-up experiment TODO list
180
196
 
181
197
  Only if more evidence is truly needed, write `paper/review/experiment_todo.md` using `references/experiment-todo-template.md`.
182
198
 
199
+ When the paper still lacks experimental support, also create or revise:
200
+
201
+ - `paper/paper_experiment_matrix.md`
202
+ - `paper/paper_experiment_matrix.json`
203
+
204
+ Treat the matrix as the paper-facing master plan and `paper/review/experiment_todo.md` as only the current execution frontier or review-facing subset.
205
+
183
206
  Each TODO item should include:
184
207
 
185
208
  - the review issue it answers
209
+ - the matrix exp id
210
+ - the corresponding `exp_id` in the paper experiment matrix
186
211
  - why existing evidence is still insufficient
187
212
  - the minimum experiment or analysis needed
188
213
  - required metric(s)
@@ -195,6 +220,50 @@ Each TODO item should include:
195
220
 
196
221
  Do not write a vague “run more ablations” list.
197
222
  Each TODO item should be concrete enough to turn into `analysis-campaign` slices or a `baseline` recovery task.
223
+ The matrix should be broader than the TODO list and should classify the full paper-facing experiment space, not just analysis work.
224
+ When building or revising that matrix, explicitly consider:
225
+
226
+ - main comparison packaging or extension
227
+ - component ablations
228
+ - sensitivity / hyperparameter checks
229
+ - robustness checks
230
+ - efficiency / cost / latency / token-overhead checks when relevant
231
+ - highlight-validation experiments that test the likely strengths of the method
232
+ - limitation-boundary analyses
233
+ - case study rows as optional rather than mandatory evidence
234
+
235
+ Do not assume the paper only needs “analysis experiments”.
236
+ Do not assume case studies belong in the required set.
237
+ If efficiency or cost could become a reviewer-facing strength or concern, put that into the matrix explicitly.
238
+
239
+ For the matrix, each row should usually record:
240
+
241
+ - `exp_id`
242
+ - `tier`
243
+ - `experiment_type`
244
+ - `status`
245
+ - `feasibility_now`
246
+ - `claim_ids`
247
+ - `highlight_ids`
248
+ - `research_question`
249
+ - `hypothesis`
250
+ - `comparators`
251
+ - `metrics`
252
+ - `minimal_success_criterion`
253
+ - `paper_placement`
254
+ - `promotion_rule`
255
+ - `next_action`
256
+
257
+ The matrix should also keep a short `highlight hypotheses` block.
258
+ Do not rely on prose intuition for the method's best selling point; if a likely highlight matters, it should have a corresponding validation row in the matrix.
259
+
260
+ Before treating the experiments section as stable, require that every currently feasible matrix row that is not merely `optional` or `dropped` is either:
261
+
262
+ - completed
263
+ - analyzed
264
+ - excluded with a real reason
265
+ - or blocked with a real reason
266
+
198
267
  When extra evidence is truly needed, use the shared supplementary-experiment protocol:
199
268
 
200
269
  - recover ids / refs first if needed
@@ -216,6 +285,54 @@ After the review artifacts are durable:
216
285
 
217
286
  Do not stop immediately after writing the review if the next route is already clear.
218
287
 
288
+ ### 7. Auto follow-up execution contract
289
+
290
+ When `startup_contract.review_followup_policy = auto_execute_followups`:
291
+
292
+ - treat the review as a gate, not as the endpoint
293
+ - immediately turn the accepted follow-up route into action:
294
+ - `analysis-campaign`
295
+ - when new evidence is truly required
296
+ - `baseline`
297
+ - when a missing comparator baseline blocks fair review
298
+ - `write`
299
+ - when the issues are mostly text, outline, claim-scope, figure, or framing revisions
300
+ - after each completed follow-up step, update:
301
+ - `paper/review/revision_log.md`
302
+ - `paper/review/experiment_todo.md`
303
+ - the draft or manuscript-facing revision package
304
+ - only treat the review line as truly closed after the follow-up route has either completed or been downgraded / blocked explicitly
305
+
306
+ When `startup_contract.review_followup_policy = user_gated_followups`:
307
+
308
+ - stop after the durable audit artifacts
309
+ - turn the next expensive follow-up package into one structured decision instead of continuing silently
310
+
311
+ When `startup_contract.review_followup_policy = audit_only`:
312
+
313
+ - stop after the durable audit artifacts and route recommendation
314
+
315
+ ### 8. Manuscript revision delivery contract
316
+
317
+ If manuscript revision is required, make the delta explicit:
318
+
319
+ - section
320
+ - old claim / weakness
321
+ - new wording
322
+ - evidence basis
323
+ - remaining limitation
324
+
325
+ If `startup_contract.manuscript_edit_mode = copy_ready_text`:
326
+
327
+ - provide copy-ready replacement wording in `paper/review/revision_log.md` or a nearby revision note
328
+ - keep the wording directly usable by the user or downstream `write`
329
+
330
+ If `startup_contract.manuscript_edit_mode = latex_required`:
331
+
332
+ - prefer editing the actual LaTeX sources when they are available
333
+ - otherwise provide LaTeX-ready replacement text blocks with explicit insertion targets
334
+ - preserve labels, citations, figure/table refs, and section structure in the suggested replacements
335
+
219
336
  ## Companion skill routing
220
337
 
221
338
  Open additional skills only when the review workflow requires them:
@@ -5,25 +5,48 @@
5
5
  ### TODO EXP-001
6
6
 
7
7
  - source review issue:
8
+ - matrix exp id:
8
9
  - why current evidence is insufficient:
9
10
  - route type:
10
11
  - existing-result analysis
11
12
  - comparator baseline
12
13
  - supplementary experiment
13
14
  - figure / table regeneration
15
+ - experiment type:
16
+ - component_ablation
17
+ - sensitivity
18
+ - robustness
19
+ - efficiency_cost
20
+ - highlight_validation
21
+ - failure_boundary
22
+ - case_study_optional
23
+ - tier:
24
+ - main_required
25
+ - main_optional
26
+ - appendix
27
+ - optional
14
28
  - minimum task:
15
29
  - required metric(s):
16
30
  - minimal success criterion:
31
+ - likely paper placement:
32
+ - main_text
33
+ - appendix
34
+ - maybe
35
+ - omit
17
36
  - expected manuscript impact:
18
37
  - owner / next step:
19
38
 
20
39
  ### TODO EXP-002
21
40
 
22
41
  - source review issue:
42
+ - matrix exp id:
23
43
  - why current evidence is insufficient:
24
44
  - route type:
45
+ - experiment type:
46
+ - tier:
25
47
  - minimum task:
26
48
  - required metric(s):
27
49
  - minimal success criterion:
50
+ - likely paper placement:
28
51
  - expected manuscript impact:
29
52
  - owner / next step:
@@ -1,5 +1,11 @@
1
1
  # Review Report Template
2
2
 
3
+ ## Review mode
4
+
5
+ - review_followup_policy:
6
+ - manuscript_edit_mode:
7
+ - manuscript_source_status:
8
+
3
9
  ## Summary
4
10
 
5
11
  - paper / draft:
@@ -36,6 +42,8 @@
36
42
  - cause:
37
43
  - actionable fix:
38
44
  - acceptance criterion:
45
+ - copy-ready revision text:
46
+ - latex-ready revision text:
39
47
 
40
48
  ## Storyline Options + Writing Outlines
41
49
 
@@ -49,6 +57,14 @@
49
57
  2.
50
58
  3.
51
59
 
60
+ ## Manuscript Revision Package
61
+
62
+ - section:
63
+ - old wording / weakness:
64
+ - new wording:
65
+ - evidence basis:
66
+ - latex-ready replacement block:
67
+
52
68
  ## Experiment Inventory & Research Experiment Plan
53
69
 
54
70
  - what existing experiments already cover:
@@ -3,6 +3,8 @@
3
3
  ## Revision Summary
4
4
 
5
5
  - current draft state:
6
+ - review_followup_policy:
7
+ - manuscript_edit_mode:
6
8
  - highest-priority fixes:
7
9
  - blockers:
8
10
 
@@ -20,6 +22,8 @@
20
22
  - supplementary experiment
21
23
  - claim downgrade
22
24
  - concrete change:
25
+ - copy-ready revision text:
26
+ - latex-ready revision text:
23
27
  - status:
24
28
  - blocks finalize:
25
29
 
@@ -10,7 +10,7 @@ Use this skill when the quest does not yet have a stable research frame.
10
10
  ## Interaction discipline
11
11
 
12
12
  - Follow the shared interaction contract injected by the system prompt.
13
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
13
+ - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
14
14
  - Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
15
15
  - If a threaded user reply arrives, interpret it relative to the latest scout progress update before assuming the task changed completely.
16
16
  - When scouting actually resolves the framing ambiguity, locks the evaluation contract, or makes the next anchor obvious, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what is now clear, why it matters, and which stage should come next.
@@ -20,7 +20,7 @@ This skill intentionally absorbs the strongest old DeepScientist writing discipl
20
20
  ## Interaction discipline
21
21
 
22
22
  - Follow the shared interaction contract injected by the system prompt.
23
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
23
+ - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
24
24
  - Prefer `bash_exec` for durable document-build commands such as LaTeX compilation, figure regeneration, and scripted export steps so logs remain quest-local and reviewable.
25
25
  - Keep ordinary subtask completions concise. When a paper/draft milestone is actually completed, upgrade to a richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` report instead of another short progress update.
26
26
  - That richer writing-stage milestone report should normally cover: which draft, section, or outline milestone finished, what is now supportable, what is still missing, and the exact recommended next revision or route decision.
@@ -146,6 +146,8 @@ The write stage should usually produce most of the following:
146
146
 
147
147
  - `paper/outline.md` or equivalent outline
148
148
  - `paper/selected_outline.json`
149
+ - `paper/paper_experiment_matrix.md`
150
+ - `paper/paper_experiment_matrix.json`
149
151
  - `paper/outline_selection.md`
150
152
  - `paper/reviewer_first_pass.md`
151
153
  - `paper/section_contracts.md`
@@ -202,6 +204,144 @@ At minimum, repeatedly verify:
202
204
  - figure and table provenance
203
205
  - file inclusion integrity for the draft or bundle
204
206
 
207
+ ## Paper experiment matrix contract
208
+
209
+ For any paper-like writing line that has more than a trivial single-result story, create and maintain:
210
+
211
+ - `paper/paper_experiment_matrix.md`
212
+ - `paper/paper_experiment_matrix.json`
213
+
214
+ Use `references/paper-experiment-matrix-template.md` when helpful.
215
+
216
+ The paper experiment matrix is the durable experiment-control surface for the paper line.
217
+ It exists to prevent two common failures:
218
+
219
+ - an outline that overweights post-hoc analysis and under-specifies paper-typical experiments
220
+ - a drifting supplementary-experiment queue where runs are launched ad hoc without a full paper-facing plan
221
+
222
+ The matrix is not just an “analysis list”.
223
+ It should cover the full paper-facing experiment program beyond the already-finished main run, including:
224
+
225
+ - main comparison surfaces that still need packaging or extension
226
+ - component ablations
227
+ - sensitivity / hyperparameter checks
228
+ - robustness or stress checks
229
+ - efficiency / cost / latency / token-overhead checks when the method may have a strong deployment or efficiency story
230
+ - highlight-validation experiments that test the method's most likely reader-facing strengths rather than merely assuming those strengths
231
+ - failure-boundary or limitation-surface analyses
232
+ - case study or trace walkthrough rows as optional supporting material rather than mandatory core evidence
233
+
234
+ Case study is usually optional.
235
+ Do not let it displace stronger quantitative evidence.
236
+ Efficiency or cost experiments are not mandatory in every paper, but they should be added whenever:
237
+
238
+ - the method may be attractive partly because it is lightweight or prompt-level
239
+ - the overhead skepticism from reviewers is easy to anticipate
240
+ - a performance-over-cost tradeoff could become part of the paper's practical contribution
241
+
242
+ Highlight-validation rule:
243
+
244
+ - do not assume the method's strongest selling point is already obvious from the aggregate metric
245
+ - explicitly write down `highlight hypotheses`
246
+ - plan at least one experiment that could confirm or falsify each serious highlight hypothesis
247
+
248
+ Typical highlight hypotheses include:
249
+
250
+ - the method is more selective rather than merely more conservative
251
+ - the gain comes from a named mechanism rather than from generic stubbornness or scale
252
+ - the improvement concentrates on the intended failure regime
253
+ - the method keeps a strong performance / overhead tradeoff
254
+
255
+ Each matrix row should normally record at least:
256
+
257
+ - `exp_id`
258
+ - `title`
259
+ - `tier`
260
+ - `main_required`
261
+ - `main_optional`
262
+ - `appendix`
263
+ - `optional`
264
+ - `dropped`
265
+ - `experiment_type`
266
+ - `main_comparison`
267
+ - `component_ablation`
268
+ - `sensitivity`
269
+ - `robustness`
270
+ - `efficiency_cost`
271
+ - `highlight_validation`
272
+ - `failure_boundary`
273
+ - `case_study_optional`
274
+ - `status`
275
+ - `proposed`
276
+ - `planned`
277
+ - `ready`
278
+ - `running`
279
+ - `completed`
280
+ - `analyzed`
281
+ - `written`
282
+ - `excluded`
283
+ - `blocked`
284
+ - `feasibility_now`
285
+ - whether the row is runnable with current assets or still blocked
286
+ - `claim_ids`
287
+ - `highlight_ids`
288
+ - `research_question`
289
+ - `hypothesis`
290
+ - `why_this_matters`
291
+ - `comparators`
292
+ - `fixed_conditions`
293
+ - `changed_variables`
294
+ - `metrics`
295
+ - `cost_budget`
296
+ - `minimal_success_criterion`
297
+ - `promotion_rule`
298
+ - what result would move the row into main text
299
+ - what result keeps it appendix-only
300
+ - what result should exclude it
301
+ - `paper_placement`
302
+ - `main_text`
303
+ - `appendix`
304
+ - `maybe`
305
+ - `omit`
306
+ - `result_artifacts`
307
+ - `next_action`
308
+
309
+ The matrix should also contain:
310
+
311
+ - core paper claims
312
+ - highlight hypotheses
313
+ - a short experiment taxonomy summary
314
+ - the current execution frontier
315
+ - an explicit main-text gate
316
+ - a refresh log that records how priorities changed after new evidence arrived
317
+
318
+ Main-text drafting gate:
319
+
320
+ - do not treat the main experiments section as stable while any row that is both:
321
+ - currently feasible
322
+ - and not marked `optional` or `dropped`
323
+ remains unaddressed
324
+ - before the experiments section becomes stable, every currently feasible row should be:
325
+ - `completed`
326
+ - `analyzed`
327
+ - `excluded` with a real reason
328
+ - or `blocked` with a real reason
329
+
330
+ This does not forbid drafting the introduction, method, or placeholders early.
331
+ It does forbid pretending the paper's experimental story is settled while the feasible experiment frontier is still open.
332
+
333
+ After every meaningful experiment outcome, even a null result or exclusion:
334
+
335
+ - reopen the matrix first
336
+ - update the row status and feasibility
337
+ - update `paper_placement`
338
+ - update the claim and highlight impact
339
+ - update the priority order of the remaining rows
340
+ - then decide the next experiment or writing move
341
+
342
+ Do not decide the next supplementary experiment from memory alone when the matrix exists.
343
+ The matrix should be the authoritative experiment-routing surface for the paper line, and the selected outline's `experimental_designs` should stay consistent with that matrix rather than drifting away from it.
344
+
205
345
  ## Venue template selection
206
346
 
207
347
  For paper-like writing, use a real venue template rather than improvising a blank LaTeX tree.
@@ -246,18 +386,20 @@ For paper-like deliverables, the safest default order is:
246
386
  3. choose the venue template from `templates/`, copy it into `paper/latex/`, and default general ML work to `templates/iclr2026/` unless a stronger venue target exists
247
387
  4. if the line benefits from an explicit outline contract, record one or more outline candidates with `artifact.submit_paper_outline(mode='candidate', ...)`
248
388
  5. if one outline should become the durable paper contract, select or revise it with `artifact.submit_paper_outline(mode='select'|'revise', ...)`
249
- 6. if the selected outline still exposes evidence gaps, launch an outline-bound `artifact.create_analysis_campaign(...)` before drafting
250
- 7. plan and generate decisive figures or tables
251
- 8. draft sections directly from the evidence and the current working outline; do not force extra outline rounds when direct drafting is clearer and safer
252
- 9. run harsh review and revision cycles
253
- 10. proof, package, submit `artifact.submit_paper_bundle(...)` when the bundle is ready, and then pass to `finalize`
254
- 11. if the final paper PDF exists and QQ milestone media is enabled in config, the bundle-ready milestone may attach that PDF once
389
+ 6. create or refresh `paper/paper_experiment_matrix.md` and `paper/paper_experiment_matrix.json` before stabilizing the experiments section
390
+ 7. if the selected outline or matrix still exposes evidence gaps, launch an outline-bound and matrix-bound `artifact.create_analysis_campaign(...)` before drafting the experiments section as if it were settled
391
+ 8. plan and generate decisive figures or tables
392
+ 9. draft sections directly from the evidence and the current working outline; do not force extra outline rounds when direct drafting is clearer and safer
393
+ 10. run harsh review and revision cycles
394
+ 11. proof, package, submit `artifact.submit_paper_bundle(...)` when the bundle is ready, and then pass to `finalize`
395
+ 12. if the final paper PDF exists and QQ milestone media is enabled in config, the bundle-ready milestone may attach that PDF once
255
396
 
256
397
  Before real drafting, force one explicit planning pass that stabilizes at least:
257
398
 
258
399
  - the current claim inventory
259
400
  - the claim-evidence map skeleton
260
401
  - the outline or outline candidates
402
+ - the paper experiment matrix
261
403
  - the figure/table plan
262
404
  - the main evidence gaps
263
405
 
@@ -273,6 +415,7 @@ For substantial paper-like writing, the durable writing plan should usually incl
273
415
 
274
416
  - section goals
275
417
  - paragraph or subsection intent when it materially affects correctness
418
+ - paper experiment matrix status and execution frontier
276
419
  - experiment-to-section mapping
277
420
  - figure/table-to-data-source mapping
278
421
  - citation/search plan
@@ -284,6 +427,7 @@ Do not let drafting quietly outrun the current evidence inventory.
284
427
 
285
428
  For reviewer-facing structure and section-level drafting contracts, read these references when the line needs sharper paper craft:
286
429
 
430
+ - `references/paper-experiment-matrix-template.md`
287
431
  - `references/reviewer-first-writing.md`
288
432
  - `references/section-contracts.md`
289
433
  - `references/sentence-level-proofing.md`
@@ -306,6 +450,21 @@ Also build an experiment inventory before outlining:
306
450
  - appendix-only evidence
307
451
  - unusable or too-weak evidence
308
452
  - verify that each planned main claim has at least one durable evidence path
453
+ - convert that inventory into the paper experiment matrix instead of leaving it as loose notes
454
+
455
+ When building the matrix, do not reduce the candidate pool to “analysis experiments”.
456
+ The inventory should explicitly consider:
457
+
458
+ - ablations
459
+ - robustness checks
460
+ - sensitivity or hyperparameter checks
461
+ - efficiency / cost / latency / token-overhead checks
462
+ - experiments aimed at validating likely highlights
463
+ - limitation-boundary analyses
464
+ - optional case studies
465
+
466
+ If the method appears to have a likely practical or deployment-facing strength, test it directly instead of burying that possibility in prose.
467
+ If the method appears to have a likely conceptual highlight, write the corresponding `highlight hypothesis` and treat it as something that still needs evidence rather than something to assume.
309
468
 
310
469
  If an experiment is too weak, too tiny, or poorly comparable, do not let it silently anchor a main claim.
311
470
  As a strong default, experiments with very small evaluation support, such as `<=10` effective examples or similarly fragile sample counts, should not carry a main-text claim unless the user explicitly accepts that limitation and the caveat is written next to the claim.
@@ -1083,3 +1242,5 @@ Exit the write stage only when one of the following is durably true:
1083
1242
  - the current draft is evidence-complete enough for `finalize`, including a selected outline and a durable paper bundle manifest when the deliverable is paper-like
1084
1243
  - a clear evidence gap has been recorded and the quest is routed backward
1085
1244
  - a packaging or proofing blocker has been recorded and the next action is explicit
1245
+
1246
+ For paper-like writing, do not treat the draft as evidence-complete enough for `finalize` while `paper/paper_experiment_matrix.*` still contains currently feasible non-optional rows that remain unresolved.
@@ -0,0 +1,131 @@
1
+ # Paper Experiment Matrix Template
2
+
3
+ Use this template when a paper-like line needs a durable experiment-control surface beyond the selected outline.
4
+
5
+ Create and maintain both:
6
+
7
+ - `paper/paper_experiment_matrix.md`
8
+ - `paper/paper_experiment_matrix.json`
9
+
10
+ The Markdown file is the human-facing control surface.
11
+ The JSON file is the machine-facing mirror.
12
+
13
+ ## 1. Current Judgment
14
+
15
+ - current judgment:
16
+ - why the matrix is needed now:
17
+ - what would make the experiments section stable:
18
+ - what still blocks stable paper writing:
19
+
20
+ ## 2. Core Claims
21
+
22
+ - `C1`:
23
+ - one-line claim:
24
+ - current support status:
25
+ - strongest current evidence:
26
+ - still-missing evidence:
27
+
28
+ - `C2`:
29
+ - one-line claim:
30
+ - current support status:
31
+ - strongest current evidence:
32
+ - still-missing evidence:
33
+
34
+ - `C3`:
35
+ - one-line claim:
36
+ - current support status:
37
+ - strongest current evidence:
38
+ - still-missing evidence:
39
+
40
+ ## 3. Highlight Hypotheses
41
+
42
+ Write only serious hypotheses that could matter to the paper's reader-facing value.
43
+ Do not assume the highlight is already true just because it sounds attractive.
44
+
45
+ - `H1`:
46
+ - one-line highlight:
47
+ - why it is plausible:
48
+ - validation rows:
49
+ - fallback if unsupported:
50
+
51
+ - `H2`:
52
+ - one-line highlight:
53
+ - why it is plausible:
54
+ - validation rows:
55
+ - fallback if unsupported:
56
+
57
+ ## 4. Taxonomy Summary
58
+
59
+ Check every category deliberately.
60
+ Do not collapse the matrix into “analysis only”.
61
+
62
+ - main comparison:
63
+ - component ablation:
64
+ - sensitivity:
65
+ - robustness:
66
+ - efficiency / cost:
67
+ - highlight validation:
68
+ - failure boundary:
69
+ - case study optional:
70
+
71
+ ## 5. Matrix Table
72
+
73
+ | Exp id | Title | Tier | Experiment type | Status | Feasibility now | Claim ids | Highlight ids | Research question | Metrics | Paper placement | Next action |
74
+ |---|---|---|---|---|---|---|---|---|---|---|---|
75
+ | | | main_required / main_optional / appendix / optional / dropped | main_comparison / component_ablation / sensitivity / robustness / efficiency_cost / highlight_validation / failure_boundary / case_study_optional | proposed / planned / ready / running / completed / analyzed / written / excluded / blocked | feasible_now / light_setup / blocked / uncertain | | | | | main_text / appendix / maybe / omit | |
76
+
77
+ ## 6. Detail Cards
78
+
79
+ Repeat one card per meaningful row.
80
+
81
+ ### EXP-001
82
+
83
+ - title:
84
+ - tier:
85
+ - experiment type:
86
+ - current status:
87
+ - feasibility now:
88
+ - why this row exists:
89
+ - research question:
90
+ - hypothesis:
91
+ - comparators:
92
+ - fixed conditions:
93
+ - changed variables:
94
+ - required metric(s):
95
+ - minimal success criterion:
96
+ - cost / runtime budget:
97
+ - promotion rule:
98
+ - main text if:
99
+ - appendix if:
100
+ - omit if:
101
+ - expected figure or table:
102
+ - result artifact paths:
103
+ - dependencies:
104
+ - next action:
105
+
106
+ ## 7. Execution Frontier
107
+
108
+ - rows ready now:
109
+ - rows blocked now:
110
+ - rows that must finish before the experiments section is stable:
111
+ - rows that are appendix-only and can wait:
112
+ - rows that are optional and should not block:
113
+
114
+ ## 8. Main-Text Gate
115
+
116
+ Do not treat the experiments section as stable while any currently feasible row that is not merely `optional` or `dropped` remains unresolved.
117
+
118
+ Every currently feasible non-optional row should be one of:
119
+
120
+ - completed
121
+ - analyzed
122
+ - excluded with reason
123
+ - blocked with reason
124
+
125
+ ## 9. Refresh Log
126
+
127
+ After every completed, excluded, or blocked slice, reopen the matrix first and update it before selecting the next run.
128
+
129
+ | Time | Exp id | What changed | Claim/highlight impact | Priority change | New next action |
130
+ |---|---|---|---|---|---|
131
+ | | | | | | |