@researai/deepscientist 1.5.11 → 1.5.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (102) hide show
  1. package/README.md +8 -8
  2. package/bin/ds.js +358 -61
  3. package/docs/en/00_QUICK_START.md +35 -3
  4. package/docs/en/01_SETTINGS_REFERENCE.md +11 -0
  5. package/docs/en/02_START_RESEARCH_GUIDE.md +68 -4
  6. package/docs/en/09_DOCTOR.md +28 -3
  7. package/docs/en/12_GUIDED_WORKFLOW_TOUR.md +21 -2
  8. package/docs/en/15_CODEX_PROVIDER_SETUP.md +284 -0
  9. package/docs/en/README.md +4 -0
  10. package/docs/zh/00_QUICK_START.md +34 -2
  11. package/docs/zh/01_SETTINGS_REFERENCE.md +11 -0
  12. package/docs/zh/02_START_RESEARCH_GUIDE.md +69 -3
  13. package/docs/zh/09_DOCTOR.md +28 -1
  14. package/docs/zh/12_GUIDED_WORKFLOW_TOUR.md +21 -2
  15. package/docs/zh/15_CODEX_PROVIDER_SETUP.md +285 -0
  16. package/docs/zh/README.md +4 -1
  17. package/package.json +1 -1
  18. package/pyproject.toml +1 -1
  19. package/src/deepscientist/__init__.py +1 -1
  20. package/src/deepscientist/bash_exec/monitor.py +7 -5
  21. package/src/deepscientist/bash_exec/service.py +84 -21
  22. package/src/deepscientist/channels/local.py +3 -3
  23. package/src/deepscientist/channels/qq.py +7 -7
  24. package/src/deepscientist/channels/relay.py +7 -7
  25. package/src/deepscientist/channels/weixin_ilink.py +90 -19
  26. package/src/deepscientist/config/models.py +1 -0
  27. package/src/deepscientist/config/service.py +121 -20
  28. package/src/deepscientist/daemon/app.py +314 -6
  29. package/src/deepscientist/doctor.py +1 -5
  30. package/src/deepscientist/mcp/server.py +124 -3
  31. package/src/deepscientist/prompts/builder.py +113 -11
  32. package/src/deepscientist/quest/service.py +247 -31
  33. package/src/deepscientist/runners/codex.py +121 -22
  34. package/src/deepscientist/runners/runtime_overrides.py +6 -0
  35. package/src/deepscientist/shared.py +33 -14
  36. package/src/prompts/connectors/qq.md +2 -1
  37. package/src/prompts/connectors/weixin.md +2 -1
  38. package/src/prompts/contracts/shared_interaction.md +4 -1
  39. package/src/prompts/system.md +59 -9
  40. package/src/skills/analysis-campaign/SKILL.md +46 -6
  41. package/src/skills/analysis-campaign/references/campaign-plan-template.md +21 -8
  42. package/src/skills/baseline/SKILL.md +1 -1
  43. package/src/skills/decision/SKILL.md +1 -1
  44. package/src/skills/experiment/SKILL.md +1 -1
  45. package/src/skills/finalize/SKILL.md +1 -1
  46. package/src/skills/idea/SKILL.md +1 -1
  47. package/src/skills/intake-audit/SKILL.md +1 -1
  48. package/src/skills/rebuttal/SKILL.md +74 -1
  49. package/src/skills/rebuttal/references/response-letter-template.md +55 -11
  50. package/src/skills/review/SKILL.md +118 -1
  51. package/src/skills/review/references/experiment-todo-template.md +23 -0
  52. package/src/skills/review/references/review-report-template.md +16 -0
  53. package/src/skills/review/references/revision-log-template.md +4 -0
  54. package/src/skills/scout/SKILL.md +1 -1
  55. package/src/skills/write/SKILL.md +168 -7
  56. package/src/skills/write/references/paper-experiment-matrix-template.md +131 -0
  57. package/src/tui/package.json +1 -1
  58. package/src/ui/dist/assets/{AiManusChatView-D0mTXG4-.js → AiManusChatView-CnJcXynW.js} +12 -12
  59. package/src/ui/dist/assets/{AnalysisPlugin-Db0cTXxm.js → AnalysisPlugin-DeyzPEhV.js} +1 -1
  60. package/src/ui/dist/assets/{CliPlugin-DrV8je02.js → CliPlugin-CB1YODQn.js} +9 -9
  61. package/src/ui/dist/assets/{CodeEditorPlugin-QXMSCH71.js → CodeEditorPlugin-B-xicq1e.js} +8 -8
  62. package/src/ui/dist/assets/{CodeViewerPlugin-7hhtWj_E.js → CodeViewerPlugin-DT54ysXa.js} +5 -5
  63. package/src/ui/dist/assets/{DocViewerPlugin-BWMSnRJe.js → DocViewerPlugin-DQtKT-VD.js} +3 -3
  64. package/src/ui/dist/assets/{GitDiffViewerPlugin-7J9h9Vy_.js → GitDiffViewerPlugin-hqHbCfnv.js} +20 -20
  65. package/src/ui/dist/assets/{ImageViewerPlugin-CHJl_0lr.js → ImageViewerPlugin-OcVo33jV.js} +5 -5
  66. package/src/ui/dist/assets/{LabCopilotPanel-1qSow1es.js → LabCopilotPanel-DdGwhEUV.js} +11 -11
  67. package/src/ui/dist/assets/{LabPlugin-eQpPPCEp.js → LabPlugin-Ciz1gDaX.js} +2 -2
  68. package/src/ui/dist/assets/{LatexPlugin-BwRfi89Z.js → LatexPlugin-BhmjNQRC.js} +37 -11
  69. package/src/ui/dist/assets/{MarkdownViewerPlugin-836PVQWV.js → MarkdownViewerPlugin-BzdVH9Bx.js} +4 -4
  70. package/src/ui/dist/assets/{MarketplacePlugin-C2y_556i.js → MarketplacePlugin-DmyHspXt.js} +3 -3
  71. package/src/ui/dist/assets/{NotebookEditor-DIX7Mlzu.js → NotebookEditor-BMXKrDRk.js} +1 -1
  72. package/src/ui/dist/assets/{NotebookEditor-BRzJbGsn.js → NotebookEditor-BTVYRGkm.js} +11 -11
  73. package/src/ui/dist/assets/{PdfLoader-DzRaTAlq.js → PdfLoader-CvcjJHXv.js} +1 -1
  74. package/src/ui/dist/assets/{PdfMarkdownPlugin-DZUfIUnp.js → PdfMarkdownPlugin-DW2ej8Vk.js} +2 -2
  75. package/src/ui/dist/assets/{PdfViewerPlugin-BwtICzue.js → PdfViewerPlugin-CmlDxbhU.js} +10 -10
  76. package/src/ui/dist/assets/{SearchPlugin-DHeIAMsx.js → SearchPlugin-DAjQZPSv.js} +1 -1
  77. package/src/ui/dist/assets/{TextViewerPlugin-C3tCmFox.js → TextViewerPlugin-C-nVAZb_.js} +5 -5
  78. package/src/ui/dist/assets/{VNCViewer-CQsKVm3t.js → VNCViewer-D7-dIYon.js} +10 -10
  79. package/src/ui/dist/assets/{bot-BEA2vWuK.js → bot-C_G4WtNI.js} +1 -1
  80. package/src/ui/dist/assets/{code-XfbSR8K2.js → code-Cd7WfiWq.js} +1 -1
  81. package/src/ui/dist/assets/{file-content-BjxNaIfy.js → file-content-B57zsL9y.js} +1 -1
  82. package/src/ui/dist/assets/{file-diff-panel-D_lLVQk0.js → file-diff-panel-DVoheLFq.js} +1 -1
  83. package/src/ui/dist/assets/{file-socket-D9x_5vlY.js → file-socket-B5kXFxZP.js} +1 -1
  84. package/src/ui/dist/assets/{image-BhWT33W1.js → image-LLOjkMHF.js} +1 -1
  85. package/src/ui/dist/assets/{index-Dqj-Mjb4.css → index-BQG-1s2o.css} +40 -2
  86. package/src/ui/dist/assets/{index--c4iXtuy.js → index-C3r2iGrp.js} +12 -12
  87. package/src/ui/dist/assets/{index-DZTZ8mWP.js → index-CLQauncb.js} +911 -120
  88. package/src/ui/dist/assets/{index-PJbSbPTy.js → index-Dxa2eYMY.js} +1 -1
  89. package/src/ui/dist/assets/{index-BDxipwrC.js → index-hOUOWbW2.js} +2 -2
  90. package/src/ui/dist/assets/{monaco-K8izTGgo.js → monaco-BGGAEii3.js} +1 -1
  91. package/src/ui/dist/assets/{pdf-effect-queue-DfBors6y.js → pdf-effect-queue-DlEr1_y5.js} +1 -1
  92. package/src/ui/dist/assets/{popover-yFK1J4fL.js → popover-CWJbJuYY.js} +1 -1
  93. package/src/ui/dist/assets/{project-sync-PENr2zcz.js → project-sync-CRJiucYO.js} +18 -4
  94. package/src/ui/dist/assets/{select-CAbJDfYv.js → select-CoHB7pvH.js} +2 -2
  95. package/src/ui/dist/assets/{sigma-DEuYJqTl.js → sigma-D5aJWR8J.js} +1 -1
  96. package/src/ui/dist/assets/{square-check-big-omoSUmcd.js → square-check-big-DUK_mnkS.js} +1 -1
  97. package/src/ui/dist/assets/{trash--F119N47.js → trash-ChU3SEE3.js} +1 -1
  98. package/src/ui/dist/assets/{useCliAccess-D31UR23I.js → useCliAccess-BrJBV3tY.js} +1 -1
  99. package/src/ui/dist/assets/{useFileDiffOverlay-BH6KcMzq.js → useFileDiffOverlay-C2OQaVWc.js} +1 -1
  100. package/src/ui/dist/assets/{wrap-text-CZ613PM5.js → wrap-text-C7Qqh-om.js} +1 -1
  101. package/src/ui/dist/assets/{zoom-out-BgDLAv3z.js → zoom-out-rtX0FKya.js} +1 -1
  102. package/src/ui/dist/index.html +2 -2
@@ -15,12 +15,19 @@ Use the same route for:
15
15
  - rebuttal-driven extra experiments
16
16
  - writing-driven evidence gaps
17
17
 
18
+ For paper-facing work, treat “analysis campaign” broadly:
19
+
20
+ - not only post-hoc interpretation
21
+ - also ablations, sensitivity checks, robustness checks, efficiency or cost checks, highlight-validation runs, and limitation-boundary work beyond the main result
22
+
23
+ Do not assume a writing-facing campaign means “analysis only”.
24
+
18
25
  Do not invent a separate experiment system for those cases.
19
26
 
20
27
  ## Interaction discipline
21
28
 
22
29
  - Follow the shared interaction contract injected by the system prompt.
23
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
30
+ - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
24
31
  - Prefer `bash_exec` for campaign slice commands so each run has a durable session id, quest-local log folder, and later `read/list/kill` control.
25
32
  - Keep ordinary subtask completions concise. When an analysis campaign or a stage-significant campaign checkpoint is complete, upgrade to a richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` report.
26
33
  - That richer campaign milestone report should normally cover: which slices completed, the main takeaway, whether the claim got stronger or weaker, and the exact recommended next route.
@@ -69,11 +76,12 @@ For campaign prioritization and writing-facing slice design, read `references/ca
69
76
  Treat this as the compressed campaign map. The authoritative slice protocol and aggregation rules remain in `Workflow`.
70
77
 
71
78
  1. Bind the campaign to the parent run or idea and, when writing-facing, to the selected outline.
72
- 2. Before launching slices, create `PLAN.md` and `CHECKLIST.md`.
73
- 3. Use `PLAN.md` as the durable charter and `CHECKLIST.md` as the living execution surface while launching, monitoring, recording, and aggregating slices.
74
- 4. Run claim-critical slices first and smoke-test long slices before their real runs.
75
- 5. Revise the plan if slice feasibility, ordering, comparators, or campaign interpretation changes materially, and record every slice durably, including honest non-success states.
76
- 6. Close meaningful campaign milestones with a concise `1-2` sentence summary that says whether the claim gained stable support, partial support, contradiction, or unresolved ambiguity, and what happens next.
79
+ 2. When the campaign is writing-facing, refresh `paper/paper_experiment_matrix.*` before freezing the slice frontier.
80
+ 3. Before launching slices, create `PLAN.md` and `CHECKLIST.md`.
81
+ 4. Use `PLAN.md` as the durable charter and `CHECKLIST.md` as the living execution surface while launching, monitoring, recording, and aggregating slices.
82
+ 5. Run claim-critical slices first and smoke-test long slices before their real runs.
83
+ 6. Revise the plan and matrix if slice feasibility, ordering, comparators, or campaign interpretation changes materially, and record every slice durably, including honest non-success states.
84
+ 7. Close meaningful campaign milestones with a concise `1-2` sentence summary that says whether the claim gained stable support, partial support, contradiction, or unresolved ambiguity, what the matrix frontier now looks like, and what happens next.
77
85
 
78
86
  ## Non-negotiable rules
79
87
 
@@ -83,6 +91,8 @@ Treat this as the compressed campaign map. The authoritative slice protocol and
83
91
  - Every analysis slice must have a specific research question and a falsifiable or at least decision-relevant expectation.
84
92
  - If the campaign is supporting a paper or paper-like report, do not launch it until a selected outline exists.
85
93
  - When a selected outline exists, every slice should map to a named `research_question` and `experimental_design` from that outline.
94
+ - When the campaign is supporting a paper or paper-like report, do not launch or reorder the slice set without first reading `paper/paper_experiment_matrix.md` when it exists.
95
+ - For writing-facing campaigns, every slice should correspond to a stable matrix row such as `exp_id`, not just a free-form note.
86
96
  - Do not aggregate campaign conclusions without per-run evidence.
87
97
  - Do not bury null or contradictory findings.
88
98
 
@@ -110,6 +120,7 @@ Before launching a campaign, confirm:
110
120
  - the list of specific analysis questions
111
121
  - the current quest / user-provided assets that each planned slice will actually use
112
122
  - whether each slice is executable with the current assets, tooling, and available credentials
123
+ - for paper-facing campaigns, the current paper experiment matrix frontier and which rows are actually feasible now
113
124
  - if durable state exposes `active_baseline_metric_contract_json`, read that JSON file before defining slice success criteria or comparison tables
114
125
  - treat `active_baseline_metric_contract_json` as the default baseline comparison contract unless a slice is explicitly testing a different evaluation contract
115
126
 
@@ -150,6 +161,8 @@ A campaign should usually leave behind:
150
161
 
151
162
  - a campaign identifier
152
163
  - a selected outline reference when the campaign is writing-facing
164
+ - a refreshed `paper/paper_experiment_matrix.md`
165
+ - a refreshed `paper/paper_experiment_matrix.json`
153
166
  - one directory per analysis run
154
167
  - any supplementary baseline reproduced for analysis under `baselines/local/<baseline_id>/` or attached under `baselines/imported/<baseline_id>/`
155
168
  - one quest-level supplementary baseline inventory at `artifacts/baselines/analysis_inventory.json`
@@ -198,17 +211,28 @@ If the campaign exists to support a paper or paper-like report:
198
211
 
199
212
  - do not proceed until one selected outline exists
200
213
  - if no selected outline exists yet, route to `write` or `decision` first so the outline can be created and selected durably
214
+ - before deciding the slice list, create or refresh `paper/paper_experiment_matrix.md` when it is missing or stale
215
+ - treat that matrix as the upstream paper experiment contract, not `todo_items` alone
216
+ - use the matrix to decide:
217
+ - which rows are `main_required`
218
+ - which are `main_optional`
219
+ - which are appendix-only
220
+ - which are optional or should be dropped
221
+ - do not start stable experiments-section drafting while currently feasible non-optional matrix rows remain unresolved
201
222
  - call `artifact.create_analysis_campaign(...)` with:
202
223
  - `selected_outline_ref`
203
224
  - `research_questions`
204
225
  - `experimental_designs`
205
226
  - `todo_items`
206
227
  - ensure each todo item names at least:
228
+ - `exp_id`
207
229
  - `todo_id`
208
230
  - `slice_id`
209
231
  - `title`
210
232
  - `research_question`
211
233
  - `experimental_design`
234
+ - `tier`
235
+ - `paper_placement`
212
236
  - `completion_condition`
213
237
 
214
238
  This keeps the analysis campaign aligned with the paper plan instead of becoming a free-floating batch of slices.
@@ -229,6 +253,7 @@ The charter should also include:
229
253
  - campaign type priority order
230
254
  - expected slice count
231
255
  - dependency structure between slices
256
+ - the matrix path and current execution frontier
232
257
  - whether any slice requires isolated code changes or only reruns/config changes
233
258
  - the top-level success condition for ending the campaign
234
259
  - the top-level abandonment condition for stopping it early
@@ -238,6 +263,7 @@ Prefer to keep this charter in `PLAN.md` first and mirror the execution frontier
238
263
  For each analysis question, also state:
239
264
 
240
265
  - why it matters to the main claim
266
+ - whether it exists mainly to support a core claim, validate a highlight, answer an efficiency or cost concern, or bound a limitation
241
267
  - what result would strengthen the claim
242
268
  - what result would weaken or complicate the claim
243
269
  - whether the run is:
@@ -267,6 +293,8 @@ Each analysis run should correspond to one need, such as:
267
293
  - run additional seeds
268
294
  - inspect one failure bucket
269
295
  - test one environment variation
296
+ - measure one efficiency or cost dimension
297
+ - validate one highlight hypothesis
270
298
 
271
299
  Avoid changing many factors at once unless the campaign is explicitly exploratory.
272
300
 
@@ -283,9 +311,13 @@ For each slice, define at minimum:
283
311
 
284
312
  Recommended extra per-slice fields:
285
313
 
314
+ - `exp_id`
286
315
  - `slice_id`
287
316
  - `run_kind`
288
317
  - `slice_class`, such as `auxiliary`, `claim-carrying`, or `supporting`
318
+ - `tier`, such as `main_required`, `main_optional`, `appendix`, or `optional`
319
+ - `paper_placement`
320
+ - `highlight_ids`
289
321
  - `required_baselines`, where each item records at least `baseline_id` plus the reason, benchmark, and split when known
290
322
 
291
323
  If a slice needs an extra comparator baseline:
@@ -321,6 +353,14 @@ Treat `campaign_id` as system-owned, and treat `slice_id` / `todo_id` as agent-a
321
353
  Do not replace the normal campaign flow with repeated manual `artifact.prepare_branch(...)` calls.
322
354
  After each slice finishes, call `artifact.record_analysis_slice(...)` immediately so the result is mirrored back to the parent branch and the next slice can be activated.
323
355
  If a slice fails or becomes infeasible, still call `artifact.record_analysis_slice(...)` with an honest non-success status plus the real blocker and next recommendation; do not leave the campaign state ambiguous.
356
+ After every completed, excluded, or blocked writing-facing slice:
357
+
358
+ - reopen `paper/paper_experiment_matrix.md`
359
+ - update the row status, feasibility, and result artifacts
360
+ - update whether the row now belongs in main text, appendix, or omission
361
+ - update the remaining execution frontier before choosing the next slice
362
+
363
+ Do not keep launching writing-facing slices from stale memory when the matrix has changed.
324
364
  For slice recording, `deviations` and `evidence_paths` are optional context fields, not mandatory ceremony; include them only when they materially help explanation or auditability.
325
365
  Each `artifact.record_analysis_slice(...)` call should also include an `evaluation_summary` with exactly these six fields:
326
366
 
@@ -10,6 +10,9 @@ Treat it as the durable version of the charter, not a separate optional memo.
10
10
  - main claim under test:
11
11
  - user's core requirements:
12
12
  - campaign outcome needed:
13
+ - selected outline ref:
14
+ - paper experiment matrix path:
15
+ - current matrix execution frontier:
13
16
 
14
17
  ## 2. Boundary And Comparability
15
18
 
@@ -20,18 +23,26 @@ Treat it as the durable version of the charter, not a separate optional memo.
20
23
 
21
24
  ## 3. Slice Plan
22
25
 
23
- | Slice id | Slice class | Research question | Expected value | Priority | Needs code change? | Needs extra baseline? |
24
- |---|---|---|---|---|---|---|
25
- | | auxiliary / claim-carrying / supporting | | | | yes / no | yes / no |
26
+ | Exp id | Slice id | Tier | Slice class | Experiment type | Research question | Expected value | Priority | Paper placement | Needs code change? | Needs extra baseline? |
27
+ |---|---|---|---|---|---|---|---|---|---|---|
28
+ | | | main_required / main_optional / appendix / optional | auxiliary / claim-carrying / supporting | ablation / sensitivity / robustness / efficiency / highlight / boundary / case-study | | | | main_text / appendix / maybe / omit | yes / no | yes / no |
26
29
 
27
- ## 4. Assets And Dependencies
30
+ ## 4. Highlight Hypotheses
31
+
32
+ - highlight id:
33
+ - one-line claim:
34
+ - why it is plausible:
35
+ - which slices validate or falsify it:
36
+ - what happens if it fails:
37
+
38
+ ## 5. Assets And Dependencies
28
39
 
29
40
  - quest-local assets already available:
30
41
  - checkpoints / baselines already available:
31
42
  - downloads or services still needed:
32
43
  - fallback options if external assets are blocked:
33
44
 
34
- ## 5. Execution Strategy
45
+ ## 6. Execution Strategy
35
46
 
36
47
  - first slices to run:
37
48
  - smoke-test policy:
@@ -49,19 +60,21 @@ Monitoring and sleep plan:
49
60
  - health signals that justify continued monitoring:
50
61
  - conditions that trigger slice redesign, kill, or campaign revision:
51
62
 
52
- ## 6. Reporting Plan
63
+ ## 7. Reporting Plan
53
64
 
54
65
  - what will count as stable support:
55
66
  - what will count as contradiction:
56
67
  - what will count as unresolved ambiguity:
57
68
  - campaign summary should say in `1-2` sentences:
69
+ - matrix refresh rule after every slice:
70
+ - main-text gating rule:
58
71
 
59
- ## 7. Checklist Link
72
+ ## 8. Checklist Link
60
73
 
61
74
  - checklist path:
62
75
  - next unchecked item:
63
76
 
64
- ## 8. Revision Log
77
+ ## 9. Revision Log
65
78
 
66
79
  | Time | What changed | Why it changed | Impact on slices or interpretation |
67
80
  |---|---|---|---|
@@ -11,7 +11,7 @@ It absorbs the essential old DeepScientist reproducer discipline into one stage
11
11
  ## Interaction discipline
12
12
 
13
13
  - Follow the shared interaction contract injected by the system prompt.
14
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
14
+ - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
15
15
  - Keep ordinary setup and debugging updates concise. Reserve richer milestone reports for accepted / waived / blocked baseline outcomes or other route-changing checkpoints instead of narrating every small setup step.
16
16
  - Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
17
17
  - If a threaded user reply arrives, interpret it relative to the latest baseline progress update before assuming the task changed completely.
@@ -10,7 +10,7 @@ Use this skill whenever continuation is non-trivial.
10
10
  ## Interaction discipline
11
11
 
12
12
  - Follow the shared interaction contract injected by the system prompt.
13
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
13
+ - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
14
14
  - Message templates are references only. Adapt to context and vary wording so updates feel natural and non-robotic.
15
15
  - If the runtime starts an auto-continue turn with no new user message, continue from the active requirements and durable quest state instead of replaying the previous user turn.
16
16
  - If `startup_contract.decision_policy = autonomous`, do not emit ordinary `artifact.interact(kind='decision_request', ...)` calls; decide the route yourself, record the reason, and continue.
@@ -10,7 +10,7 @@ Use this skill for the main evidence-producing runs of the quest.
10
10
  ## Interaction discipline
11
11
 
12
12
  - Follow the shared interaction contract injected by the system prompt.
13
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
13
+ - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
14
14
  - Keep ordinary subtask completions concise. When a main experiment actually finishes or reaches a stage-significant checkpoint, upgrade to a richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` report rather than another short progress line.
15
15
  - That richer experiment-stage milestone report should normally cover: what run finished, the headline result versus baseline or expectation, the main caveat, and the exact recommended next action.
16
16
  - That richer milestone report is still normally non-blocking. If the next route is already justified locally, continue automatically after reporting rather than idling for acknowledgment.
@@ -10,7 +10,7 @@ Use this skill to close or pause a quest responsibly.
10
10
  ## Interaction discipline
11
11
 
12
12
  - Follow the shared interaction contract injected by the system prompt.
13
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
13
+ - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
14
14
  - If the runtime starts an auto-continue turn with no new user message, keep finalizing from the durable quest state and active requirements instead of replaying the previous user turn.
15
15
  - If a threaded user reply arrives, interpret it relative to the latest finalize progress update before assuming the task changed completely.
16
16
  - When finalize reaches a real closure state, pause-ready packet, or route-back decision, send one threaded `artifact.interact(kind='milestone', ...)` update that names the recommendation, why it is the right call, and any reopen condition that still matters.
@@ -10,7 +10,7 @@ Use this skill to turn the current baseline and problem frame into concrete, lit
10
10
  ## Interaction discipline
11
11
 
12
12
  - Follow the shared interaction contract injected by the system prompt.
13
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
13
+ - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
14
14
  - Keep ordinary subtask completions concise. When the idea stage actually finishes a meaningful deliverable such as a selected idea package, a rejected-ideas summary, or a route-shaping ideation checkpoint, upgrade to a richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` report.
15
15
  - That richer idea-stage milestone report should normally cover: the final selected or rejected direction, why it won or lost, the main remaining risk, and the exact recommended next stage or experiment.
16
16
  - That richer milestone report is still normally non-blocking. If the next experiment or route is already clear from durable evidence, continue automatically after reporting instead of waiting.
@@ -10,7 +10,7 @@ Use this skill when the quest already has meaningful state and the first job is
10
10
  ## Interaction discipline
11
11
 
12
12
  - Follow the shared interaction contract injected by the system prompt.
13
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
13
+ - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
14
14
  - Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
15
15
  - If a threaded user reply arrives, interpret it relative to the latest intake-audit progress update before assuming the task changed completely.
16
16
  - When the audit reaches a durable route recommendation, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what state is trusted, what still needs work, and which anchor should run next.
@@ -14,7 +14,7 @@ The task is “respond to concrete reviewer pressure with the smallest honest se
14
14
  ## Interaction discipline
15
15
 
16
16
  - Follow the shared interaction contract injected by the system prompt.
17
- - For ordinary active work, prefer a concise progress update once work has crossed roughly 10 tool calls with a human-meaningful delta, and do not drift beyond roughly 20 tool calls or about 15 minutes without a user-visible update.
17
+ - For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
18
18
  - Message templates are references only. Adapt to the actual context and vary wording so updates feel natural and non-robotic.
19
19
  - If a threaded user reply arrives, interpret it relative to the latest rebuttal progress update before assuming the task changed completely.
20
20
  - When the rebuttal plan, the main supplementary-evidence package, or the final response bundle becomes durable, send one richer `artifact.interact(kind='milestone', reply_mode='threaded', ...)` update that says what reviewer concerns are now addressed, what still remains open, and what happens next.
@@ -73,6 +73,16 @@ First decide whether the issue is actually:
73
73
  - Do not run supplementary experiments without first mapping them to named reviewer concerns.
74
74
  - Do not keep the original claim scope if the new evidence no longer supports it.
75
75
  - If a reviewer request cannot be fully satisfied, say so clearly and explain the honest limitation.
76
+ - If `startup_contract.baseline_execution_policy` is present, honor it:
77
+ - `must_reproduce_or_verify`
78
+ - verify or recover the rebuttal-critical baseline/comparator before reviewer-linked follow-up work
79
+ - `reuse_existing_only`
80
+ - trust the current baseline/results unless you find concrete inconsistency, corruption, or missing-evidence problems
81
+ - `skip_unless_blocking`
82
+ - do not spend time rerunning baselines unless a named reviewer item truly depends on a missing comparator
83
+ - If `startup_contract.manuscript_edit_mode = latex_required`, treat the provided LaTeX tree or `paper/latex/` as the preferred writing surface when manuscript revision is needed.
84
+ - If LaTeX source is unavailable while `latex_required` is requested, do not pretend the manuscript was edited; produce LaTeX-ready replacement text and an explicit blocker note instead.
85
+ - Accept review inputs from URLs, local file paths, local directories, or current-turn attachments; do not assume the review packet must already be neatly structured.
76
86
 
77
87
  ## Primary inputs
78
88
 
@@ -81,6 +91,7 @@ Use, in roughly this order:
81
91
  - the current paper or draft
82
92
  - the selected outline if one exists
83
93
  - review comments, meta-review, or editor letter
94
+ - current-turn attachments and user-provided local paths / directories / URLs for the manuscript or review packet
84
95
  - the six-field `evaluation_summary` blocks from recent main experiments and analysis slices
85
96
  - recent main and analysis experiment results
86
97
  - prior decision and writing memory
@@ -88,6 +99,7 @@ Use, in roughly this order:
88
99
 
89
100
  If the current paper/result state is still unclear, open `intake-audit` first before continuing the rebuttal workflow.
90
101
  Before launching any new supplementary experiment, read those structured `evaluation_summary` blocks first so the rebuttal plan starts from the already-recorded evidence state rather than from raw narrative memory.
102
+ If the user provided manuscript files or review-packet files directly, first normalize them into durable quest-visible paths under `paper/` or `paper/rebuttal/input/` before planning reviewer-linked experiments or draft replies.
91
103
 
92
104
  ## Core outputs
93
105
 
@@ -98,6 +110,8 @@ The rebuttal pass should usually leave behind:
98
110
  - `paper/rebuttal/response_letter.md`
99
111
  - `paper/rebuttal/text_deltas.md`
100
112
  - `paper/rebuttal/evidence_update.md`
113
+ - `paper/paper_experiment_matrix.md` when reviewer concerns materially change the paper experiment plan
114
+ - `paper/paper_experiment_matrix.json` when reviewer concerns materially change the paper experiment plan
101
115
 
102
116
  Use the templates in `references/` when needed:
103
117
 
@@ -212,6 +226,7 @@ For each reviewer issue, decide whether the right answer is:
212
226
 
213
227
  Then write one durable rebuttal plan in `paper/rebuttal/action_plan.md`.
214
228
  That plan should explicitly include the analysis-experiment TODO list for reviewer-linked follow-up work.
229
+ If reviewer concerns materially change the paper's experiment story, also create or revise `paper/paper_experiment_matrix.*` so the rebuttal experiment package stays consistent with the paper-facing plan rather than drifting into a reviewer-only side list.
215
230
 
216
231
  The action plan should be the main thinking draft before execution.
217
232
  For each serious item, record:
@@ -237,6 +252,18 @@ Write at least:
237
252
  For novelty / comparison / positioning complaints, do not default to experiments.
238
253
  First decide whether the issue is better answered by a focused literature audit and clearer paper positioning.
239
254
 
255
+ When a reviewer concern really does imply experimental follow-up, map it into the same paper experiment taxonomy used by the writing line:
256
+
257
+ - `component_ablation`
258
+ - `sensitivity`
259
+ - `robustness`
260
+ - `efficiency_cost`
261
+ - `highlight_validation`
262
+ - `failure_boundary`
263
+ - `case_study_optional`
264
+
265
+ Case study remains optional unless the reviewer concern is specifically qualitative and cannot be addressed better with quantitative evidence.
266
+
240
267
  ### 3. Route experiments only when genuinely needed
241
268
 
242
269
  If one or more comments truly require new runs:
@@ -252,9 +279,18 @@ If one or more comments truly require new runs:
252
279
  Do not launch a free-floating ablation batch.
253
280
  Every supplementary run should answer a named reviewer issue.
254
281
  Every slice should reference one or more stable reviewer item ids.
282
+ Every rebuttal-linked slice should also reference the corresponding `exp_id` from `paper/paper_experiment_matrix.*` when that matrix exists.
255
283
  After each completed reviewer-linked slice, record the result, the implication for the manuscript, and the concrete modification advice in `paper/rebuttal/evidence_update.md`.
256
284
  Use the same shared supplementary-experiment protocol as ordinary analysis work; do not invent a rebuttal-only experiment system.
257
285
  If ids or refs are unclear, recover them first with `artifact.resolve_runtime_refs(...)`, `artifact.get_analysis_campaign(...)`, or `artifact.list_paper_outlines(...)`.
286
+ After each completed, excluded, or blocked reviewer-linked slice:
287
+
288
+ - reopen `paper/paper_experiment_matrix.*`
289
+ - update the affected `exp_id`
290
+ - update whether the result now belongs in main text, appendix, or omission
291
+ - update which reviewer items are now fully answered
292
+
293
+ Do not finalize the rebuttal package while reviewer-critical and currently feasible matrix rows remain unresolved without an explicit blocker note.
258
294
 
259
295
  ### 4. Route manuscript changes explicitly
260
296
 
@@ -279,6 +315,14 @@ If a reviewer request forces a narrower story, revise the outline before polishi
279
315
 
280
316
  Use `references/response-letter-template.md` when helpful.
281
317
 
318
+ Before treating the response letter as final:
319
+
320
+ - first complete every feasible reviewer-linked experiment or analysis slice that the current plan marked as necessary
321
+ - ensure the necessary rows in `paper/paper_experiment_matrix.*` have been refreshed after those runs
322
+ - use real completed experiment results directly in the reply wherever the concern is genuinely experimental
323
+ - for non-experimental items, do not wait for unnecessary experiments; answer as strongly as the current manuscript, literature, and analysis already allow
324
+ - if one experimental item cannot be completed in time, keep the reply honest and explicit about the remaining limitation or fallback wording
325
+
282
326
  The response should be:
283
327
 
284
328
  - professional
@@ -290,6 +334,8 @@ The response should be:
290
334
  Good response structure:
291
335
 
292
336
  - short appreciation / acknowledgement
337
+ - overall response that summarizes the revision strategy and the strongest strengths acknowledged across reviewers
338
+ - strengths recognized across reviewers
293
339
  - direct answer to the reviewer concern
294
340
  - keep stable item ids visible when helpful
295
341
  - restate reviewer wording faithfully before answering
@@ -300,6 +346,28 @@ Good response structure:
300
346
  - claim scope
301
347
  - if not fully addressed, why not and what honest limitation remains
302
348
 
349
+ Drafting style rules for the actual author reply body:
350
+
351
+ - Treat `response_letter.md` as rebuttal-ready author text, not as internal coaching notes.
352
+ - Write in a calm, direct, precise author voice.
353
+ - Sound like authors clarifying the record, not authors asking for approval.
354
+ - Brief professional courtesy is allowed, but keep it short and move to substance immediately.
355
+ - Avoid sycophancy, flattery, excessive gratitude, or approval-seeking language.
356
+ - Do not default to conceding fault.
357
+ - Use selective concede, selective clarify, and selective defend.
358
+ - Answer the reviewer concern directly in the first 1 to 2 sentences.
359
+ - For non-experimental items, reduce reviewer uncertainty as much as the real evidence allows; the goal is to make a score improvement reasonable for an honest reviewer, not to persuade through rhetoric alone.
360
+ - Write strongly enough that a neutral reviewer or AC can judge the concern substantially addressed from the rebuttal text alone.
361
+ - After the literal answer, address the underlying doubt about validity, novelty, scope, fairness, or completeness.
362
+ - If the answer already exists in the manuscript, restate it in the rebuttal and then point to the manuscript change; do not only say “we will clarify”.
363
+ - If the issue is about wording, interpretation, or claim strength, include the revised sentence or close paraphrase that should appear in the manuscript.
364
+ - Keep the main response body for each item as 1 to 2 full paragraphs of polished prose.
365
+ - Do not use bullets, numbered lists, bold labels, or checklist fragments inside the actual response paragraphs.
366
+ - Do not narrate rebuttal strategy inside the author reply.
367
+ - Do not rely on future edits alone when you can already give the clarification, argument, or wording now.
368
+ - When pushing back, lead with evidence, scope, or feasibility constraints before intuition.
369
+ - If `startup_contract.manuscript_edit_mode = latex_required`, keep manuscript-facing replacement text LaTeX-ready.
370
+
303
371
  If details are still genuinely unknown, use explicit placeholders such as `[[AUTHOR TO FILL]]` rather than inventing specifics.
304
372
 
305
373
  Avoid:
@@ -319,6 +387,8 @@ When the rebuttal package is durably ready:
319
387
 
320
388
  If a combined rebuttal note is useful, make sure the total package still covers:
321
389
 
390
+ - overall response
391
+ - strengths recognized across reviewers
322
392
  - overview and revision strategy
323
393
  - draft responses to reviewers
324
394
  - point-to-point triage
@@ -398,6 +468,9 @@ Useful tags include:
398
468
  - supplementary experiments, if needed, are routed cleanly
399
469
  - manuscript deltas are explicit
400
470
  - the response letter is evidence-backed and honest
471
+ - the final package contains both:
472
+ - reviewer-specific replies
473
+ - one overall response that makes the paper strengths, the main resolved concerns, and the remaining limitations legible to a neutral reader or AC
401
474
 
402
475
  The goal is not just “write a nicer response”.
403
476
  The goal is to convert review pressure into a durable, auditable revision workflow.
@@ -1,9 +1,55 @@
1
1
  # Response Letter Template
2
2
 
3
+ ## Drafting rules
4
+
5
+ - Treat this file as rebuttal-ready author text, not as private coaching notes.
6
+ - Write in a calm, direct, precise author voice.
7
+ - Brief professional courtesy is allowed, but keep it short and move to substance immediately.
8
+ - Avoid sycophancy, flattery, excessive gratitude, or approval-seeking language.
9
+ - Do not default to conceding fault.
10
+ - Use selective concede, selective clarify, and selective defend.
11
+ - Answer the reviewer concern directly in the first 1 to 2 sentences.
12
+ - Keep the actual response body for each item as 1 to 2 full paragraphs of polished prose.
13
+ - If the issue is about wording, interpretation, or claim strength, include the revised sentence or close paraphrase that should appear in the manuscript.
14
+ - Do not use bullets, numbered lists, or label-value schemas inside the actual response paragraphs.
15
+ - Do not rely on future edits alone when you can already give the clarification, argument, or draft wording now.
16
+ - If a concrete number, setup detail, or result is still unknown, use `[[AUTHOR TO FILL]]`.
17
+
3
18
  ## Cover note
4
19
 
5
20
  We thank the reviewers for the careful reading and constructive feedback. Below we respond point by point and indicate the corresponding manuscript changes and supplementary evidence when applicable.
6
21
 
22
+ ## Overview & Revision Strategy
23
+
24
+ - main reviewer risks:
25
+ - current strongest evidence:
26
+ - current weakest evidence:
27
+ - baseline handling decision:
28
+ - response strategy:
29
+ - manuscript_edit_mode:
30
+
31
+ ## Overall Response
32
+
33
+ - strongest strengths recognized across reviewers:
34
+ - overall revision strategy:
35
+ - biggest concerns now addressed:
36
+ - concerns still partially open:
37
+ - claim-scope changes:
38
+ - remaining limitation:
39
+
40
+ ## Strengths Recognized Across Reviewers
41
+
42
+ - strength 1:
43
+ - strength 2:
44
+ - why these strengths still matter after revision:
45
+
46
+ ## Resolution Snapshot
47
+
48
+ | Item ID | Status | What changed | Evidence basis | Manuscript delta |
49
+ | --- | --- | --- | --- | --- |
50
+ | R1-C1 | | | | |
51
+ | R1-C2 | | | | |
52
+
7
53
  ## Reviewer 1
8
54
 
9
55
  ### Item R1-C1
@@ -20,9 +66,11 @@ We thank the reviewers for the careful reading and constructive feedback. Below
20
66
 
21
67
  - agree / partially_agree / clarify / respectful_disagree
22
68
 
23
- **Response**
69
+ **Response Draft**
24
70
 
25
- -
71
+ Write 1 to 2 full paragraphs of rebuttal-ready prose here.
72
+ The first 1 to 2 sentences should answer the concern directly.
73
+ Then explain the evidence, manuscript rationale, and the exact clarification or wording that should appear in the revision.
26
74
 
27
75
  **What changed**
28
76
 
@@ -30,6 +78,7 @@ We thank the reviewers for the careful reading and constructive feedback. Below
30
78
  - evidence basis:
31
79
  - claim-scope effect:
32
80
  - remaining limitation:
81
+ - latex-ready manuscript text:
33
82
 
34
83
  **If an experiment is still pending**
35
84
 
@@ -51,9 +100,9 @@ We thank the reviewers for the careful reading and constructive feedback. Below
51
100
 
52
101
  - agree / partially_agree / clarify / respectful_disagree
53
102
 
54
- **Response**
103
+ **Response Draft**
55
104
 
56
- -
105
+ Write 1 to 2 full paragraphs of rebuttal-ready prose here.
57
106
 
58
107
  **What changed**
59
108
 
@@ -84,9 +133,9 @@ We thank the reviewers for the careful reading and constructive feedback. Below
84
133
 
85
134
  - agree / partially_agree / clarify / respectful_disagree
86
135
 
87
- **Response**
136
+ **Response Draft**
88
137
 
89
- -
138
+ Write 1 to 2 full paragraphs of rebuttal-ready prose here.
90
139
 
91
140
  **What changed**
92
141
 
@@ -106,8 +155,3 @@ We thank the reviewers for the careful reading and constructive feedback. Below
106
155
  - what could not be fully addressed:
107
156
  - why:
108
157
  - how the manuscript now reflects that limitation:
109
-
110
- ## Author placeholders
111
-
112
- - If a concrete number, setup detail, or result is still unknown, use `[[AUTHOR TO FILL]]`.
113
- - Do not fabricate missing details just to make the letter sound complete.