superlab 0.1.11 → 0.1.13
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +19 -2
- package/README.zh-CN.md +19 -2
- package/bin/superlab.cjs +43 -1
- package/lib/auto.cjs +14 -771
- package/lib/auto_common.cjs +129 -0
- package/lib/auto_contracts.cjs +387 -0
- package/lib/auto_runner.cjs +830 -0
- package/lib/auto_state.cjs +227 -0
- package/lib/context.cjs +94 -0
- package/lib/eval_protocol.cjs +236 -0
- package/lib/i18n.cjs +140 -11
- package/lib/install.cjs +26 -6
- package/package-assets/claude/commands/lab/auto.md +1 -1
- package/package-assets/claude/commands/lab.md +2 -1
- package/package-assets/codex/prompts/lab-auto.md +1 -1
- package/package-assets/codex/prompts/lab.md +2 -1
- package/package-assets/shared/lab/context/auto-mode.md +16 -0
- package/package-assets/shared/lab/context/auto-outcome.md +28 -0
- package/package-assets/shared/lab/context/auto-status.md +3 -0
- package/package-assets/shared/lab/context/eval-protocol.md +46 -0
- package/package-assets/shared/skills/lab/SKILL.md +12 -1
- package/package-assets/shared/skills/lab/stages/auto.md +37 -7
- package/package-assets/shared/skills/lab/stages/iterate.md +4 -0
- package/package-assets/shared/skills/lab/stages/report.md +4 -0
- package/package-assets/shared/skills/lab/stages/run.md +4 -1
- package/package.json +1 -1
package/lib/i18n.cjs
CHANGED
|
@@ -129,6 +129,7 @@ const ZH_SKILL_FILES = {
|
|
|
129
129
|
- \`.lab/context/mission.md\`
|
|
130
130
|
- \`.lab/context/state.md\`
|
|
131
131
|
- \`.lab/context/data-decisions.md\`
|
|
132
|
+
- \`.lab/context/eval-protocol.md\`
|
|
132
133
|
- \`.lab/config/workflow.json\`
|
|
133
134
|
|
|
134
135
|
## 上下文写回
|
|
@@ -140,6 +141,7 @@ const ZH_SKILL_FILES = {
|
|
|
140
141
|
|
|
141
142
|
- 优先选择能打通全链路的最小实验。
|
|
142
143
|
- 数据、环境或 metric 接线有问题时要尽快失败。
|
|
144
|
+
- run 目标必须对齐已批准的评估协议,而不是只跟随聊天里的临时目标。
|
|
143
145
|
- 记录精确启动命令和输出位置。
|
|
144
146
|
- 持久 run 输出、日志和 checkpoint 写到 \`results_root\`。
|
|
145
147
|
- 图表和可视化写到 \`figures_root\`。
|
|
@@ -151,7 +153,11 @@ const ZH_SKILL_FILES = {
|
|
|
151
153
|
2. 登记 run
|
|
152
154
|
3. 执行最小有意义实验
|
|
153
155
|
4. 标准化原始指标
|
|
154
|
-
5.
|
|
156
|
+
5. 按当前评估协议校验标准化摘要
|
|
157
|
+
|
|
158
|
+
## 约束
|
|
159
|
+
|
|
160
|
+
- 不要凭记忆现想指标定义、baseline 行为或对比方法实现;它们必须锚定到已批准评估协议里记录的来源。
|
|
155
161
|
|
|
156
162
|
## 交互约束
|
|
157
163
|
|
|
@@ -171,6 +177,7 @@ const ZH_SKILL_FILES = {
|
|
|
171
177
|
- baseline
|
|
172
178
|
- 主指标
|
|
173
179
|
- 成功阈值
|
|
180
|
+
- evaluation ladder 与 benchmark 扩量 gate
|
|
174
181
|
- verification commands
|
|
175
182
|
- completion_promise
|
|
176
183
|
- 最大迭代轮次
|
|
@@ -182,6 +189,7 @@ const ZH_SKILL_FILES = {
|
|
|
182
189
|
- \`.lab/context/decisions.md\`
|
|
183
190
|
- \`.lab/context/evidence-index.md\`
|
|
184
191
|
- \`.lab/context/data-decisions.md\`
|
|
192
|
+
- \`.lab/context/eval-protocol.md\`
|
|
185
193
|
- \`.lab/config/workflow.json\`
|
|
186
194
|
|
|
187
195
|
## 上下文写回
|
|
@@ -221,6 +229,8 @@ const ZH_SKILL_FILES = {
|
|
|
221
229
|
- 持久 run 输出、日志和 checkpoint 放在 \`results_root\`。
|
|
222
230
|
- 图表和可视化放在 \`figures_root\`。
|
|
223
231
|
- 不要把长期结果堆在 \`.lab/changes/<change-id>/runs\` 里。
|
|
232
|
+
- 不要修改指标定义、baseline 语义或对比方法实现,除非评估协议已经记录了来源和与原始实现的偏差。
|
|
233
|
+
- 如果要调整 ladder、样本量或升格 gate,必须继续锚定到带来源的评估协议,而不是靠聊天临时判断。
|
|
224
234
|
|
|
225
235
|
## 交互约束
|
|
226
236
|
|
|
@@ -294,6 +304,7 @@ const ZH_SKILL_FILES = {
|
|
|
294
304
|
- \`.lab/context/state.md\`
|
|
295
305
|
- \`.lab/context/decisions.md\`
|
|
296
306
|
- \`.lab/context/evidence-index.md\`
|
|
307
|
+
- \`.lab/context/eval-protocol.md\`
|
|
297
308
|
|
|
298
309
|
## 上下文写回
|
|
299
310
|
|
|
@@ -304,6 +315,9 @@ const ZH_SKILL_FILES = {
|
|
|
304
315
|
|
|
305
316
|
- 不能隐藏失败迭代。
|
|
306
317
|
- 每个主要 claim 都要指向已记录的 summary 或 iteration artifact。
|
|
318
|
+
- 主表结构、gate 和最终结果 framing 必须对齐已批准的评估协议。
|
|
319
|
+
- 不要凭记忆重述指标定义、baseline 行为或对比方法实现;直接引用评估协议里记录的来源。
|
|
320
|
+
- 如果报告依赖了对原始指标或原始实现的偏差,必须明确写出这个偏差。
|
|
307
321
|
- 解释优先保守,不要写成营销文案。
|
|
308
322
|
- 要给 \`/lab:write\` 留下清晰 handoff,尤其是 section draft 可以直接引用的证据链接。
|
|
309
323
|
|
|
@@ -919,8 +933,13 @@ const ZH_SKILL_FILES = {
|
|
|
919
933
|
## 目标
|
|
920
934
|
|
|
921
935
|
- Objective:
|
|
936
|
+
- Autonomy level: L2
|
|
937
|
+
- Approval status: draft
|
|
922
938
|
- Allowed stages: run, iterate, review, report
|
|
923
939
|
- Success criteria:
|
|
940
|
+
- Terminal goal type:
|
|
941
|
+
- Terminal goal target:
|
|
942
|
+
- Required terminal artifact:
|
|
924
943
|
|
|
925
944
|
## 循环预算
|
|
926
945
|
|
|
@@ -941,6 +960,14 @@ const ZH_SKILL_FILES = {
|
|
|
941
960
|
- Promotion check command:
|
|
942
961
|
- Promotion command:
|
|
943
962
|
|
|
963
|
+
## 阶段产物约束
|
|
964
|
+
|
|
965
|
+
- Run stage contract: write persistent outputs under \`results_root\`.
|
|
966
|
+
- Iterate stage contract: update persistent outputs under \`results_root\`.
|
|
967
|
+
- Review stage contract: update canonical review context such as \`.lab/context/decisions.md\`、\`state.md\`、\`open-questions.md\` or \`evidence-index.md\`.
|
|
968
|
+
- Report stage contract: write the final report to \`<deliverables_root>/report.md\`.
|
|
969
|
+
- Write stage contract: write LaTeX output under \`<deliverables_root>/paper/\`.
|
|
970
|
+
|
|
944
971
|
## 升格策略
|
|
945
972
|
|
|
946
973
|
- Promotion policy:
|
|
@@ -954,6 +981,37 @@ const ZH_SKILL_FILES = {
|
|
|
954
981
|
|
|
955
982
|
- Stop conditions:
|
|
956
983
|
- Escalation conditions:
|
|
984
|
+
- Canonical promotion writeback: update \`.lab/context/data-decisions.md\`、\`.lab/context/decisions.md\`、\`.lab/context/state.md\` and \`.lab/context/session-brief.md\`.
|
|
985
|
+
`,
|
|
986
|
+
[path.join(".lab", "context", "auto-outcome.md")]:
|
|
987
|
+
`# 自动结果
|
|
988
|
+
|
|
989
|
+
## 目标
|
|
990
|
+
|
|
991
|
+
- Objective:
|
|
992
|
+
- Experiment ladder:
|
|
993
|
+
- Metric glossary:
|
|
994
|
+
- Metric source papers:
|
|
995
|
+
- Metric implementation source:
|
|
996
|
+
- Comparison source papers:
|
|
997
|
+
- Comparison implementation source:
|
|
998
|
+
- Deviation from original implementation:
|
|
999
|
+
- Terminal goal type:
|
|
1000
|
+
- Terminal goal target:
|
|
1001
|
+
- Required terminal artifact:
|
|
1002
|
+
|
|
1003
|
+
## 结果
|
|
1004
|
+
|
|
1005
|
+
- Status: idle
|
|
1006
|
+
- Goal reached: no
|
|
1007
|
+
- Stop reason:
|
|
1008
|
+
- Promotion applied: no
|
|
1009
|
+
- Final artifact:
|
|
1010
|
+
- Final rung:
|
|
1011
|
+
- Executed stages:
|
|
1012
|
+
- Iterations completed: 0
|
|
1013
|
+
- Started at:
|
|
1014
|
+
- Finished at:
|
|
957
1015
|
`,
|
|
958
1016
|
[path.join(".lab", "context", "auto-status.md")]:
|
|
959
1017
|
`# 自动模式状态
|
|
@@ -965,6 +1023,9 @@ const ZH_SKILL_FILES = {
|
|
|
965
1023
|
- Current command:
|
|
966
1024
|
- Active run id:
|
|
967
1025
|
- Iteration count: 0
|
|
1026
|
+
- Current rung:
|
|
1027
|
+
- Watch target:
|
|
1028
|
+
- Next rung:
|
|
968
1029
|
|
|
969
1030
|
## 时间
|
|
970
1031
|
|
|
@@ -1322,7 +1383,7 @@ ZH_CONTENT[path.join(".lab", ".managed", "templates", "framing.md")] = `# 论文
|
|
|
1322
1383
|
ZH_CONTENT[path.join(".codex", "prompts", "lab.md")] = codexPrompt(
|
|
1323
1384
|
"查看 /lab 研究工作流总览并选择合适阶段",
|
|
1324
1385
|
"workflow question 或 stage choice",
|
|
1325
|
-
"# `/lab` for Codex\n\n`/lab` 是严格的研究工作流命令族。每次都使用同一套仓库工件和阶段边界。\n\n## 子命令\n\n- `/lab:idea`\n 调研 idea,定义问题与 failure case,归类 contribution 与 breakthrough level,对比现有方法,收束三个一眼就有意义的点,并在实现前保留 approval gate。\n\n- `/lab:data`\n 把已批准的 idea 转成数据集与 benchmark 方案,记录数据集年份、使用过该数据集的论文、下载来源、许可或访问限制,以及 classic-public、recent-strong-public、claim-specific 三类 benchmark 的纳入理由,和 canonical baselines、strong historical baselines、recent strong public methods、closest prior work 四类对比方法的纳入理由。\n\n- `/lab:auto`\n 在不改变 mission、framing 和核心 claims 的前提下,读取 auto-mode 契约并自动编排 `run`、`iterate`、`review`、`report`,必要时扩展数据集、benchmark 和 comparison methods,并在满足升格策略时自动升级 primary package
|
|
1386
|
+
"# `/lab` for Codex\n\n`/lab` 是严格的研究工作流命令族。每次都使用同一套仓库工件和阶段边界。\n\n## 子命令\n\n- `/lab:idea`\n 调研 idea,定义问题与 failure case,归类 contribution 与 breakthrough level,对比现有方法,收束三个一眼就有意义的点,并在实现前保留 approval gate。\n\n- `/lab:data`\n 把已批准的 idea 转成数据集与 benchmark 方案,记录数据集年份、使用过该数据集的论文、下载来源、许可或访问限制,以及 classic-public、recent-strong-public、claim-specific 三类 benchmark 的纳入理由,和 canonical baselines、strong historical baselines、recent strong public methods、closest prior work 四类对比方法的纳入理由。\n\n- `/lab:auto`\n 在不改变 mission、framing 和核心 claims 的前提下,读取 eval-protocol 与 auto-mode 契约并自动编排 `run`、`iterate`、`review`、`report`,必要时扩展数据集、benchmark 和 comparison methods,并在满足升格策略时自动升级 primary package。启动前必须选定 autonomy level、声明 terminal goal,并显式批准契约。\n\n- `/lab:framing`\n 通过审计当前领域与相邻领域的术语,锁定 paper-facing 的方法名、模块名、论文题目和 contribution bullets,并在 section 起草前保留 approval gate。\n\n- `/lab:spec`\n 把已批准的 idea 转成 `.lab/changes/<change-id>/` 下的一个 lab change 目录,并在其中写出 `proposal`、`design`、`spec`、`tasks`。\n\n- `/lab:run`\n 执行最小有意义验证运行,登记 run,并生成第一版标准化评估摘要。\n\n- `/lab:iterate`\n 在冻结 mission、阈值、verification commands 与 `completion_promise` 的前提下执行有边界的实验迭代。\n\n- `/lab:review`\n 以 reviewer mode 审查文档或结果,先给短摘要,再输出 findings、fatal flaws、fix priority 和 residual risks。\n\n- `/lab:report`\n 从 runs 和 iterations 工件生成最终研究报告。\n\n- `/lab:write`\n 使用已安装 `lab` skill 下 vendored 的 paper-writing references,把稳定 report 工件转成论文 section。\n\n## 调度规则\n\n- 始终使用 `skills/lab/SKILL.md` 作为工作流合同。\n- 用户显式调用 `/lab:<stage>` 时,要立刻执行该 stage,而不是只推荐别的 `/lab` stage。\n- 先给简洁摘要,再决定是否写工件,最后回报输出路径和下一步。\n- 如果歧义会影响结论,一次只问一个问题;如果有多条可行路径,先给 2-3 个方案再收敛。\n- `/lab:spec` 前应已有经批准的数据集与 benchmark 方案。\n- `/lab:run`、`/lab:iterate`、`/lab:auto`、`/lab:report` 都应遵循 `.lab/context/eval-protocol.md`。\n- `.lab/context/eval-protocol.md` 不只定义主指标和主表,也应定义指标释义、实验阶梯,以及指标和对比实现的来源。\n- `/lab:auto` 只编排已批准边界内的执行阶段,不替代手动的 idea/data/framing/spec 决策。\n- `/lab:write` 前必须已有经批准的 `/lab:framing` 工件。\n"
|
|
1326
1387
|
);
|
|
1327
1388
|
|
|
1328
1389
|
ZH_CONTENT[path.join(".codex", "prompts", "lab-data.md")] = codexPrompt(
|
|
@@ -1334,14 +1395,14 @@ ZH_CONTENT[path.join(".codex", "prompts", "lab-data.md")] = codexPrompt(
|
|
|
1334
1395
|
ZH_CONTENT[path.join(".codex", "prompts", "lab-auto.md")] = codexPrompt(
|
|
1335
1396
|
"在已批准边界内编排自动实验循环",
|
|
1336
1397
|
"auto mode objective",
|
|
1337
|
-
"使用已安装的 `lab` 技能:`.codex/skills/lab/SKILL.md`。\n\n立刻针对用户当前给出的参数执行 `/lab:auto`,不要只推荐别的 `/lab` 阶段。只有在缺少阻塞性前提时,才明确指出缺什么,并且一次最多追问一个问题。\n\n本命令运行 `/lab:auto` 阶段。它必须读取 `.lab/context/auto-mode.md` 与 `.lab/context/auto-
|
|
1398
|
+
"使用已安装的 `lab` 技能:`.codex/skills/lab/SKILL.md`。\n\n立刻针对用户当前给出的参数执行 `/lab:auto`,不要只推荐别的 `/lab` 阶段。只有在缺少阻塞性前提时,才明确指出缺什么,并且一次最多追问一个问题。\n\n本命令运行 `/lab:auto` 阶段。它必须读取 `.lab/context/eval-protocol.md`、`.lab/context/auto-mode.md`、`.lab/context/auto-status.md` 与 `.lab/context/auto-outcome.md`,先确认 autonomy level、approval status 与 terminal goal schema,再把 eval-protocol 里的指标释义、主表计划、来源约束与结构化实验阶梯当作执行依据,在不修改 mission、framing 和核心 claims 的前提下编排已批准的 `run`、`iterate`、`review`、`report`,轮询长任务完成情况;如果声明了 rung,就保持会话活着并按 rung 转移继续推进。"
|
|
1338
1399
|
);
|
|
1339
1400
|
|
|
1340
1401
|
ZH_CONTENT[path.join(".claude", "commands", "lab.md")] = claudeCommand(
|
|
1341
1402
|
"LAB",
|
|
1342
1403
|
"查看 /lab 研究工作流总览并选择合适阶段",
|
|
1343
1404
|
"workflow, research, overview",
|
|
1344
|
-
"# `/lab` for Claude\n\n`/lab` 是严格的研究工作流命令族。每次都使用同一套仓库工件和阶段边界。\n\n## 子命令\n\n- `/lab:idea`\n 调研 idea,定义问题与 failure case,归类 contribution 与 breakthrough level,对比现有方法,收束三个一眼就有意义的点,并在实现前保留 approval gate。\n\n- `/lab:data`\n 把已批准的 idea 转成数据集与 benchmark 方案,记录数据集年份、使用过该数据集的论文、下载来源、许可或访问限制,以及 classic-public、recent-strong-public、claim-specific 三类 benchmark 的纳入理由,和 canonical baselines、strong historical baselines、recent strong public methods、closest prior work 四类对比方法的纳入理由。\n\n- `/lab:auto`\n 在不改变 mission、framing 和核心 claims 的前提下,读取 auto-mode 契约并自动编排 `run`、`iterate`、`review`、`report`,必要时扩展数据集、benchmark 和 comparison methods,并在满足升格策略时自动升级 primary package
|
|
1405
|
+
"# `/lab` for Claude\n\n`/lab` 是严格的研究工作流命令族。每次都使用同一套仓库工件和阶段边界。\n\n## 子命令\n\n- `/lab:idea`\n 调研 idea,定义问题与 failure case,归类 contribution 与 breakthrough level,对比现有方法,收束三个一眼就有意义的点,并在实现前保留 approval gate。\n\n- `/lab:data`\n 把已批准的 idea 转成数据集与 benchmark 方案,记录数据集年份、使用过该数据集的论文、下载来源、许可或访问限制,以及 classic-public、recent-strong-public、claim-specific 三类 benchmark 的纳入理由,和 canonical baselines、strong historical baselines、recent strong public methods、closest prior work 四类对比方法的纳入理由。\n\n- `/lab:auto`\n 在不改变 mission、framing 和核心 claims 的前提下,读取 eval-protocol 与 auto-mode 契约并自动编排 `run`、`iterate`、`review`、`report`,必要时扩展数据集、benchmark 和 comparison methods,并在满足升格策略时自动升级 primary package。启动前必须选定 autonomy level、声明 terminal goal,并显式批准契约。\n\n- `/lab:framing`\n 通过审计当前领域与相邻领域的术语,锁定 paper-facing 的方法名、模块名、论文题目和 contribution bullets,并在 section 起草前保留 approval gate。\n\n- `/lab:spec`\n 把已批准的 idea 转成 `.lab/changes/<change-id>/` 下的一个 lab change 目录,并在其中写出 `proposal`、`design`、`spec`、`tasks`。\n\n- `/lab:run`\n 执行最小有意义验证运行,登记 run,并生成第一版标准化评估摘要。\n\n- `/lab:iterate`\n 在冻结 mission、阈值、verification commands 与 `completion_promise` 的前提下执行有边界的实验迭代。\n\n- `/lab:review`\n 以 reviewer mode 审查文档或结果,先给短摘要,再输出 findings、fatal flaws、fix priority 和 residual risks。\n\n- `/lab:report`\n 从 runs 和 iterations 工件生成最终研究报告。\n\n- `/lab:write`\n 使用已安装 `lab` skill 下 vendored 的 paper-writing references,把稳定 report 工件转成论文 section。\n\n## 调度规则\n\n- 始终使用 `skills/lab/SKILL.md` 作为工作流合同。\n- 用户显式调用 `/lab:<stage>` 时,要立刻执行该 stage,而不是只推荐别的 `/lab` stage。\n- 先给简洁摘要,再决定是否写工件,最后回报输出路径和下一步。\n- 如果歧义会影响结论,一次只问一个问题;如果有多条可行路径,先给 2-3 个方案再收敛。\n- `/lab:spec` 前应已有经批准的数据集与 benchmark 方案。\n- `/lab:run`、`/lab:iterate`、`/lab:auto`、`/lab:report` 都应遵循 `.lab/context/eval-protocol.md`。\n- `.lab/context/eval-protocol.md` 不只定义主指标和主表,也应定义指标释义、实验阶梯,以及指标和对比实现的来源。\n- `/lab:auto` 只编排已批准边界内的执行阶段,不替代手动的 idea/data/framing/spec 决策。\n- `/lab:write` 前必须已有经批准的 `/lab:framing` 工件。\n"
|
|
1345
1406
|
);
|
|
1346
1407
|
|
|
1347
1408
|
ZH_CONTENT[path.join(".claude", "commands", "lab", "data.md")] = claudeCommand(
|
|
@@ -1355,7 +1416,7 @@ ZH_CONTENT[path.join(".claude", "commands", "lab", "auto.md")] = claudeCommand(
|
|
|
1355
1416
|
"LAB: Auto",
|
|
1356
1417
|
"在已批准边界内编排自动实验循环",
|
|
1357
1418
|
"workflow, research, auto",
|
|
1358
|
-
"使用已安装的 `lab` 技能:`.claude/skills/lab/SKILL.md`。\n\n立刻针对用户当前给出的参数执行 `/lab:auto`,不要只推荐别的 `/lab` 阶段。只有在缺少阻塞性前提时,才明确指出缺什么,并且一次最多追问一个问题。\n\n本命令运行 `/lab:auto` 阶段。它必须读取 `.lab/context/auto-mode.md` 与 `.lab/context/auto-
|
|
1419
|
+
"使用已安装的 `lab` 技能:`.claude/skills/lab/SKILL.md`。\n\n立刻针对用户当前给出的参数执行 `/lab:auto`,不要只推荐别的 `/lab` 阶段。只有在缺少阻塞性前提时,才明确指出缺什么,并且一次最多追问一个问题。\n\n本命令运行 `/lab:auto` 阶段。它必须读取 `.lab/context/eval-protocol.md`、`.lab/context/auto-mode.md`、`.lab/context/auto-status.md` 与 `.lab/context/auto-outcome.md`,先确认 autonomy level、approval status 与 terminal goal schema,再把 eval-protocol 里的指标释义、主表计划、来源约束与结构化实验阶梯当作执行依据,在不修改 mission、framing 和核心 claims 的前提下编排已批准的 `run`、`iterate`、`review`、`report`,轮询长任务完成情况;如果声明了 rung,就保持会话活着并按 rung 转移继续推进。"
|
|
1359
1420
|
);
|
|
1360
1421
|
|
|
1361
1422
|
ZH_CONTENT[path.join(".codex", "skills", "lab", "SKILL.md")] = `---
|
|
@@ -1755,6 +1816,52 @@ ZH_CONTENT[path.join(".lab", "context", "data-decisions.md")] = `# 已批准数
|
|
|
1755
1816
|
- 剩余预处理或 leakage 风险:
|
|
1756
1817
|
`;
|
|
1757
1818
|
|
|
1819
|
+
ZH_CONTENT[path.join(".lab", "context", "eval-protocol.md")] = `# 评估协议
|
|
1820
|
+
|
|
1821
|
+
用这份文件定义 \`/lab:run\`、\`/lab:iterate\`、\`/lab:auto\` 和 \`/lab:report\` 共用的论文导向评估目标、主表计划、gate 与 benchmark ladder。
|
|
1822
|
+
|
|
1823
|
+
## 主评估目标
|
|
1824
|
+
|
|
1825
|
+
- 主评估目标:
|
|
1826
|
+
- 主指标:
|
|
1827
|
+
- 次级指标:
|
|
1828
|
+
- 必要终局证据:
|
|
1829
|
+
|
|
1830
|
+
## 主表计划
|
|
1831
|
+
|
|
1832
|
+
- 主表计划:
|
|
1833
|
+
- 每张表必须支撑的 claims:
|
|
1834
|
+
|
|
1835
|
+
## 指标释义
|
|
1836
|
+
|
|
1837
|
+
- 指标释义:
|
|
1838
|
+
- 指标来源论文:
|
|
1839
|
+
- 指标实现来源:
|
|
1840
|
+
- 对比方法来源论文:
|
|
1841
|
+
- 对比方法实现来源:
|
|
1842
|
+
- 与原始实现的偏差:
|
|
1843
|
+
|
|
1844
|
+
## Gate Ladder
|
|
1845
|
+
|
|
1846
|
+
- 实验阶梯:
|
|
1847
|
+
- benchmark 阶梯:
|
|
1848
|
+
- 对比方法 gate:
|
|
1849
|
+
- 升格 gate:
|
|
1850
|
+
- 最小样本量:
|
|
1851
|
+
- 必要输出工件:
|
|
1852
|
+
|
|
1853
|
+
### Rung: <rung-id>
|
|
1854
|
+
|
|
1855
|
+
- 阶段:
|
|
1856
|
+
- 目标:
|
|
1857
|
+
- 命令:
|
|
1858
|
+
- 监视目标:
|
|
1859
|
+
- gate 命令:
|
|
1860
|
+
- 通过后:
|
|
1861
|
+
- 失败后:
|
|
1862
|
+
- 停止后:
|
|
1863
|
+
`;
|
|
1864
|
+
|
|
1758
1865
|
ZH_CONTENT[path.join(".codex", "skills", "lab", "stages", "auto.md")] = `# \`/lab:auto\` 阶段指南
|
|
1759
1866
|
|
|
1760
1867
|
## 必要输出
|
|
@@ -1772,9 +1879,11 @@ ZH_CONTENT[path.join(".codex", "skills", "lab", "stages", "auto.md")] = `# \`/la
|
|
|
1772
1879
|
- \`.lab/context/decisions.md\`
|
|
1773
1880
|
- \`.lab/context/data-decisions.md\`
|
|
1774
1881
|
- \`.lab/context/evidence-index.md\`
|
|
1882
|
+
- \`.lab/context/eval-protocol.md\`
|
|
1775
1883
|
- \`.lab/context/terminology-lock.md\`
|
|
1776
1884
|
- \`.lab/context/auto-mode.md\`
|
|
1777
1885
|
- \`.lab/context/auto-status.md\`
|
|
1886
|
+
- \`.lab/context/auto-outcome.md\`
|
|
1778
1887
|
|
|
1779
1888
|
## 上下文写回
|
|
1780
1889
|
|
|
@@ -1785,31 +1894,51 @@ ZH_CONTENT[path.join(".codex", "skills", "lab", "stages", "auto.md")] = `# \`/la
|
|
|
1785
1894
|
- \`.lab/context/summary.md\`
|
|
1786
1895
|
- \`.lab/context/session-brief.md\`
|
|
1787
1896
|
- \`.lab/context/auto-status.md\`
|
|
1897
|
+
- \`.lab/context/auto-outcome.md\`
|
|
1788
1898
|
|
|
1789
1899
|
## 边界规则
|
|
1790
1900
|
|
|
1791
1901
|
- 把 \`/lab:auto\` 当作编排层,不要再发明第二套 workflow。
|
|
1902
|
+
- 把 \`.lab/context/eval-protocol.md\` 当作论文导向指标、指标释义、主表、gate 与结构化实验阶梯的唯一来源。
|
|
1903
|
+
- 把评估协议当作“带来源的协议”,不是“临场想出来的说明”:指标定义、baseline 行为、对比实现和偏差都必须先写明来源,再用于 gate 或 promotion。
|
|
1904
|
+
- 契约里必须声明 \`Autonomy level\` 和 \`Approval status\`,只有显式写成 \`approved\` 才能启动。
|
|
1905
|
+
- 契约里还必须声明具体的 terminal goal:\`rounds\`、\`metric-threshold\` 或 \`task-completion\`,并补齐 \`Terminal goal target\` 与 \`Required terminal artifact\`。
|
|
1906
|
+
- 级别含义固定为:
|
|
1907
|
+
- \`L1\`:safe run,只允许 \`run\`、\`review\`、\`report\`
|
|
1908
|
+
- \`L2\`:bounded iteration,允许 \`run\`、\`iterate\`、\`review\`、\`report\`
|
|
1909
|
+
- \`L3\`:aggressive campaign,才允许额外编排 \`write\`
|
|
1792
1910
|
- 默认只编排 \`run\`、\`iterate\`、\`review\`、\`report\`;只有 framing 已批准时才可选 \`write\`。
|
|
1793
1911
|
- 不要自动修改 mission、paper-facing framing 或核心 claims。
|
|
1794
1912
|
- 可以在 exploration envelope 内增加数据集、benchmark 和 comparison methods。
|
|
1795
1913
|
- 只有在 auto-mode 契约中的升格策略满足时,才允许把 exploratory addition 自动升格为 primary package。
|
|
1796
1914
|
- 长任务必须通过轮询推进,直到完成、超时或命中停止条件。
|
|
1915
|
+
- 每次结束都必须写出规范的 \`.lab/context/auto-outcome.md\`。
|
|
1916
|
+
- 如果评估协议声明了结构化 rung,就按前台 rung 状态机执行:每个 rung 都要声明阶段、目标、命令、监视目标、gate、通过后/失败后/停止后的转移,并把当前 rung、监视目标和下一 rung 写进 \`.lab/context/auto-status.md\`。
|
|
1917
|
+
- 不要只看命令退出码;必须检查阶段产物约束:
|
|
1918
|
+
- \`run\` 和 \`iterate\` 更新 \`results_root\`
|
|
1919
|
+
- \`review\` 更新规范审查上下文
|
|
1920
|
+
- \`report\` 写出 \`<deliverables_root>/report.md\`
|
|
1921
|
+
- \`write\` 写出 \`<deliverables_root>/paper/\` 下的 LaTeX 产物
|
|
1922
|
+
- promotion 成功后,必须写回 \`data-decisions.md\`、\`decisions.md\`、\`state.md\` 和 \`session-brief.md\`。
|
|
1923
|
+
- 如果某个指标或对比 claim 在评估协议里没有带来源的定义,就不能拿它做 stop 或 promotion 判断。
|
|
1797
1924
|
|
|
1798
1925
|
## 最小流程
|
|
1799
1926
|
|
|
1800
1927
|
1. 校验自动模式契约
|
|
1801
|
-
2.
|
|
1802
|
-
3.
|
|
1803
|
-
4.
|
|
1804
|
-
5.
|
|
1805
|
-
6.
|
|
1928
|
+
2. 确认已批准的 autonomy level 与允许阶段一致
|
|
1929
|
+
3. 设置或刷新自动模式状态
|
|
1930
|
+
4. 选择下一个允许的 \`/lab\` 子阶段
|
|
1931
|
+
5. 发起有边界动作
|
|
1932
|
+
6. 轮询进程、checkpoint 或 summary 的变化
|
|
1933
|
+
7. 评估声明过的 terminal goal 是否已经达成
|
|
1934
|
+
8. 记录结果并决定 continue、promote、stop 或 escalate
|
|
1806
1935
|
|
|
1807
1936
|
## 交互约束
|
|
1808
1937
|
|
|
1809
1938
|
- 开始前先简洁说明:objective、frozen core 和下一自动阶段。
|
|
1810
1939
|
- 如果契约本身不完整,一次只追问一个问题。
|
|
1811
1940
|
- 如果存在多个可信的下一动作,先给 2-3 个 bounded 方案和推荐项,再启动长任务。
|
|
1812
|
-
-
|
|
1941
|
+
- 只有当下一步会离开已批准的 exploration envelope、超出选定 autonomy level,或实质改变 frozen core 时,才保留人工 approval gate。
|
|
1813
1942
|
`;
|
|
1814
1943
|
|
|
1815
1944
|
ZH_CONTENT[path.join(".claude", "skills", "lab", "stages", "auto.md")] =
|
package/lib/install.cjs
CHANGED
|
@@ -40,8 +40,10 @@ const PROJECT_OWNED_LOCALIZED_PATHS = [
|
|
|
40
40
|
path.join(".lab", "context", "evidence-index.md"),
|
|
41
41
|
path.join(".lab", "context", "open-questions.md"),
|
|
42
42
|
path.join(".lab", "context", "data-decisions.md"),
|
|
43
|
+
path.join(".lab", "context", "eval-protocol.md"),
|
|
43
44
|
path.join(".lab", "context", "auto-mode.md"),
|
|
44
45
|
path.join(".lab", "context", "auto-status.md"),
|
|
46
|
+
path.join(".lab", "context", "auto-outcome.md"),
|
|
45
47
|
path.join(".lab", "context", "terminology-lock.md"),
|
|
46
48
|
path.join(".lab", "context", "summary.md"),
|
|
47
49
|
path.join(".lab", "context", "next-action.md"),
|
|
@@ -431,13 +433,31 @@ function registerProjectInstall(targetDir, metadata, { env = process.env } = {})
|
|
|
431
433
|
}
|
|
432
434
|
|
|
433
435
|
function isTemporaryTestPath(targetDir) {
|
|
434
|
-
const
|
|
435
|
-
|
|
436
|
-
|
|
437
|
-
|
|
438
|
-
|
|
436
|
+
const normalizedCandidates = new Set([path.resolve(targetDir)]);
|
|
437
|
+
try {
|
|
438
|
+
normalizedCandidates.add(fs.realpathSync(targetDir));
|
|
439
|
+
} catch {}
|
|
440
|
+
|
|
441
|
+
const tempRoots = new Set([path.resolve(os.tmpdir()), path.resolve("/tmp"), path.resolve("/private/tmp")]);
|
|
442
|
+
for (const root of Array.from(tempRoots)) {
|
|
443
|
+
try {
|
|
444
|
+
tempRoots.add(fs.realpathSync(root));
|
|
445
|
+
} catch {}
|
|
446
|
+
}
|
|
447
|
+
|
|
448
|
+
for (const candidate of normalizedCandidates) {
|
|
449
|
+
if (!path.basename(candidate).startsWith("superlab-")) {
|
|
450
|
+
continue;
|
|
451
|
+
}
|
|
452
|
+
for (const tempRoot of tempRoots) {
|
|
453
|
+
const relativeToTmp = path.relative(tempRoot, candidate);
|
|
454
|
+
if (!relativeToTmp.startsWith("..") && !path.isAbsolute(relativeToTmp)) {
|
|
455
|
+
return true;
|
|
456
|
+
}
|
|
457
|
+
}
|
|
439
458
|
}
|
|
440
|
-
|
|
459
|
+
|
|
460
|
+
return false;
|
|
441
461
|
}
|
|
442
462
|
|
|
443
463
|
function detectLanguage({ explicitLang, env = process.env } = {}) {
|
|
@@ -8,4 +8,4 @@ tags: [workflow, research, auto]
|
|
|
8
8
|
Use the installed `lab` skill at `.claude/skills/lab/SKILL.md`.
|
|
9
9
|
|
|
10
10
|
Execute the requested `/lab:auto` stage against the user's argument now. Do not only recommend another lab stage. If a blocking prerequisite is missing, say exactly what is missing and ask at most one clarifying question.
|
|
11
|
-
This command runs the `/lab:auto` stage. It must read `.lab/context/auto-mode.md
|
|
11
|
+
This command runs the `/lab:auto` stage. It must read `.lab/context/eval-protocol.md`, `.lab/context/auto-mode.md`, `.lab/context/auto-status.md`, and `.lab/context/auto-outcome.md`, enforce the declared terminal goal schema, orchestrate approved run, iterate, review, and report stages inside that contract, poll long-running work until completion or stop conditions, and write progress plus the final outcome back into `.lab/context/auto-status.md` and `.lab/context/auto-outcome.md`.
|
|
@@ -18,7 +18,7 @@ tags: [workflow, research, overview]
|
|
|
18
18
|
Turn the approved idea into an approved dataset and benchmark package with dataset years, papers that used each dataset, source audit, download plan, classic-public versus recent-strong-public versus claim-specific benchmark roles, and explicit rationale for canonical baselines, strong historical baselines, recent strong public methods, and closest prior work.
|
|
19
19
|
|
|
20
20
|
- `/lab:auto`
|
|
21
|
-
Run a bounded orchestration loop over approved execution stages. Use an auto-mode contract plus live auto-status to drive `run`, `iterate`, `review`, `report`, and optionally `write` without changing the frozen mission or framing.
|
|
21
|
+
Run a bounded orchestration loop over approved execution stages. Use an auto-mode contract plus live auto-status to drive `run`, `iterate`, `review`, `report`, and optionally `write` without changing the frozen mission or framing. Choose an autonomy level, declare a concrete terminal goal, explicitly approve the contract before starting, and treat `.lab/context/eval-protocol.md` as the source of truth for metrics, metric glossary, source-backed comparison semantics, tables, and structured experiment-ladder rungs.
|
|
22
22
|
|
|
23
23
|
- `/lab:framing`
|
|
24
24
|
Lock paper-facing method name, module names, paper title, and contribution bullets by auditing current-field and adjacent-field terminology, then keep an approval gate before any section drafting.
|
|
@@ -51,5 +51,6 @@ tags: [workflow, research, overview]
|
|
|
51
51
|
- `/lab:spec` should inherit the approved dataset package from `.lab/context/data-decisions.md`.
|
|
52
52
|
- Never skip directly from `/lab:idea` to code.
|
|
53
53
|
- `/lab:iterate` requires a normalized summary from `scripts/eval_report.py`.
|
|
54
|
+
- `/lab:run`, `/lab:iterate`, `/lab:auto`, and `/lab:report` should all follow `.lab/context/eval-protocol.md`, including its recorded sources for metrics and comparison implementations.
|
|
54
55
|
- `/lab:write` requires an approved framing artifact from `/lab:framing`.
|
|
55
56
|
- `/lab:write` requires stable report artifacts, a mini-outline, the active section guide, `paper-review.md`, and `does-my-writing-flow-source.md`, and should only change one section per round.
|
|
@@ -6,4 +6,4 @@ argument-hint: autonomous campaign target
|
|
|
6
6
|
Use the installed `lab` skill at `.codex/skills/lab/SKILL.md`.
|
|
7
7
|
|
|
8
8
|
Execute the requested `/lab:auto` stage against the user's argument now. Do not only recommend another lab stage. If a blocking prerequisite is missing, say exactly what is missing and ask at most one clarifying question.
|
|
9
|
-
This command runs the `/lab:auto` stage. It must read `.lab/context/auto-mode.md
|
|
9
|
+
This command runs the `/lab:auto` stage. It must read `.lab/context/eval-protocol.md`, `.lab/context/auto-mode.md`, `.lab/context/auto-status.md`, and `.lab/context/auto-outcome.md`, enforce the declared terminal goal schema, orchestrate approved run, iterate, review, and report stages inside that contract, poll long-running work until completion or stop conditions, and write progress plus the final outcome back into `.lab/context/auto-status.md` and `.lab/context/auto-outcome.md`.
|
|
@@ -16,7 +16,7 @@ argument-hint: workflow question or stage choice
|
|
|
16
16
|
Turn the approved idea into an approved dataset and benchmark package with dataset years, papers that used each dataset, source audit, download plan, classic-public versus recent-strong-public versus claim-specific benchmark roles, and explicit rationale for canonical baselines, strong historical baselines, recent strong public methods, and closest prior work.
|
|
17
17
|
|
|
18
18
|
- `/lab:auto`
|
|
19
|
-
Run a bounded orchestration loop over approved execution stages. Use an auto-mode contract plus live auto-status to drive `run`, `iterate`, `review`, `report`, and optionally `write` without changing the frozen mission or framing.
|
|
19
|
+
Run a bounded orchestration loop over approved execution stages. Use an auto-mode contract plus live auto-status to drive `run`, `iterate`, `review`, `report`, and optionally `write` without changing the frozen mission or framing. Choose an autonomy level, declare a concrete terminal goal, explicitly approve the contract before starting, and treat `.lab/context/eval-protocol.md` as the source of truth for metrics, metric glossary, source-backed comparison semantics, tables, and structured experiment-ladder rungs.
|
|
20
20
|
|
|
21
21
|
- `/lab:framing`
|
|
22
22
|
Lock paper-facing method name, module names, paper title, and contribution bullets by auditing current-field and adjacent-field terminology, then keep an approval gate before any section drafting.
|
|
@@ -49,5 +49,6 @@ argument-hint: workflow question or stage choice
|
|
|
49
49
|
- `/lab:spec` should inherit the approved dataset package from `.lab/context/data-decisions.md`.
|
|
50
50
|
- Never skip directly from `/lab:idea` to code.
|
|
51
51
|
- `/lab:iterate` requires a normalized summary from `scripts/eval_report.py`.
|
|
52
|
+
- `/lab:run`, `/lab:iterate`, `/lab:auto`, and `/lab:report` should all follow `.lab/context/eval-protocol.md`, including its recorded sources for metrics and comparison implementations.
|
|
52
53
|
- `/lab:write` requires an approved framing artifact from `/lab:framing`.
|
|
53
54
|
- `/lab:write` requires stable report artifacts, a mini-outline, the active section guide, `paper-review.md`, and `does-my-writing-flow-source.md`, and should only change one section per round.
|
|
@@ -1,12 +1,19 @@
|
|
|
1
1
|
# Auto Mode Contract
|
|
2
2
|
|
|
3
3
|
Use this file to define the bounded autonomous execution envelope for `/lab:auto`.
|
|
4
|
+
Pair it with `.lab/context/eval-protocol.md`, which defines the paper-facing metrics, tables, gates, and benchmark ladder that auto mode should optimize against.
|
|
5
|
+
If `eval-protocol.md` declares structured rung entries, auto mode follows those rung transitions first and uses the stage commands here as per-stage fallbacks.
|
|
4
6
|
|
|
5
7
|
## Objective
|
|
6
8
|
|
|
7
9
|
- Objective:
|
|
10
|
+
- Autonomy level: L2
|
|
11
|
+
- Approval status: draft
|
|
8
12
|
- Allowed stages: run, iterate, review, report
|
|
9
13
|
- Success criteria:
|
|
14
|
+
- Terminal goal type:
|
|
15
|
+
- Terminal goal target:
|
|
16
|
+
- Required terminal artifact:
|
|
10
17
|
|
|
11
18
|
## Loop Budget
|
|
12
19
|
|
|
@@ -27,6 +34,14 @@ Use this file to define the bounded autonomous execution envelope for `/lab:auto
|
|
|
27
34
|
- Promotion check command:
|
|
28
35
|
- Promotion command:
|
|
29
36
|
|
|
37
|
+
## Stage Output Contracts
|
|
38
|
+
|
|
39
|
+
- Run stage contract: write persistent outputs under `results_root`.
|
|
40
|
+
- Iterate stage contract: update persistent outputs under `results_root`.
|
|
41
|
+
- Review stage contract: update canonical review context such as `.lab/context/decisions.md`, `state.md`, `open-questions.md`, or `evidence-index.md`.
|
|
42
|
+
- Report stage contract: write the final report to `<deliverables_root>/report.md`.
|
|
43
|
+
- Write stage contract: write LaTeX output under `<deliverables_root>/paper/`.
|
|
44
|
+
|
|
30
45
|
## Promotion Policy
|
|
31
46
|
|
|
32
47
|
- Promotion policy:
|
|
@@ -40,3 +55,4 @@ Use this file to define the bounded autonomous execution envelope for `/lab:auto
|
|
|
40
55
|
|
|
41
56
|
- Stop conditions:
|
|
42
57
|
- Escalation conditions:
|
|
58
|
+
- Canonical promotion writeback: update `.lab/context/data-decisions.md`, `.lab/context/decisions.md`, `.lab/context/state.md`, and `.lab/context/session-brief.md`.
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
# Auto Outcome
|
|
2
|
+
|
|
3
|
+
## Goal
|
|
4
|
+
|
|
5
|
+
- Objective:
|
|
6
|
+
- Experiment ladder:
|
|
7
|
+
- Metric glossary:
|
|
8
|
+
- Metric source papers:
|
|
9
|
+
- Metric implementation source:
|
|
10
|
+
- Comparison source papers:
|
|
11
|
+
- Comparison implementation source:
|
|
12
|
+
- Deviation from original implementation:
|
|
13
|
+
- Terminal goal type:
|
|
14
|
+
- Terminal goal target:
|
|
15
|
+
- Required terminal artifact:
|
|
16
|
+
|
|
17
|
+
## Outcome
|
|
18
|
+
|
|
19
|
+
- Status: idle
|
|
20
|
+
- Goal reached: no
|
|
21
|
+
- Stop reason:
|
|
22
|
+
- Promotion applied: no
|
|
23
|
+
- Final artifact:
|
|
24
|
+
- Final rung:
|
|
25
|
+
- Executed stages:
|
|
26
|
+
- Iterations completed: 0
|
|
27
|
+
- Started at:
|
|
28
|
+
- Finished at:
|
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
# Evaluation Protocol
|
|
2
|
+
|
|
3
|
+
Use this file to define the paper-facing evaluation objective, table plan, gates, and benchmark ladder for `/lab:run`, `/lab:iterate`, `/lab:auto`, and `/lab:report`.
|
|
4
|
+
|
|
5
|
+
## Primary Evaluation Objective
|
|
6
|
+
|
|
7
|
+
- Primary evaluation objective:
|
|
8
|
+
- Primary metrics:
|
|
9
|
+
- Secondary metrics:
|
|
10
|
+
- Required terminal evidence:
|
|
11
|
+
|
|
12
|
+
## Table Plan
|
|
13
|
+
|
|
14
|
+
- Table plan:
|
|
15
|
+
- Required claims per table:
|
|
16
|
+
|
|
17
|
+
## Metric Glossary
|
|
18
|
+
|
|
19
|
+
- Metric glossary:
|
|
20
|
+
- Metric source papers:
|
|
21
|
+
- Metric implementation source:
|
|
22
|
+
- Comparison source papers:
|
|
23
|
+
- Comparison implementation source:
|
|
24
|
+
- Deviation from original implementation:
|
|
25
|
+
|
|
26
|
+
Record enough source detail here that later `run`, `iterate`, `auto`, and `report` stages do not have to guess what a metric means, which baseline implementation is canonical, or where a comparison method came from.
|
|
27
|
+
|
|
28
|
+
## Gate Ladder
|
|
29
|
+
|
|
30
|
+
- Experiment ladder:
|
|
31
|
+
- Benchmark ladder:
|
|
32
|
+
- Comparison gate:
|
|
33
|
+
- Promotion gate:
|
|
34
|
+
- Minimum sample sizes:
|
|
35
|
+
- Required output artifacts:
|
|
36
|
+
|
|
37
|
+
### Rung: <rung-id>
|
|
38
|
+
|
|
39
|
+
- Stage:
|
|
40
|
+
- Goal:
|
|
41
|
+
- Command:
|
|
42
|
+
- Watch:
|
|
43
|
+
- Gate:
|
|
44
|
+
- On pass:
|
|
45
|
+
- On fail:
|
|
46
|
+
- On stop:
|
|
@@ -22,6 +22,8 @@ Use this skill when the user invokes `/lab:*` or asks for the structured researc
|
|
|
22
22
|
- Write durable artifacts to disk instead of leaving key decisions only in chat.
|
|
23
23
|
- Use `.lab/config/workflow.json` as the global contract for workflow language, paper language, and paper format.
|
|
24
24
|
- Use `.lab/context/` as the shared project state for both Codex and Claude entrypoints.
|
|
25
|
+
- Use `.lab/context/eval-protocol.md` as the shared evaluation contract for run, iterate, auto, and report stages, including metric glossary and experiment ladder semantics.
|
|
26
|
+
- Treat evaluation semantics as source-backed once evaluation planning starts: metrics, benchmark gates, baseline behavior, comparison implementations, and deviations should come from recorded sources, not memory.
|
|
25
27
|
- Workflow artifacts should follow the installed workflow language.
|
|
26
28
|
- Final paper output should default to LaTeX, and its manuscript language should be decided separately from the workflow language.
|
|
27
29
|
- Separate sourced facts from model-generated hypotheses.
|
|
@@ -82,6 +84,8 @@ Use this skill when the user invokes `/lab:*` or asks for the structured researc
|
|
|
82
84
|
- Use this stage to orchestrate approved execution stages with bounded autonomy.
|
|
83
85
|
- Read `.lab/config/workflow.json`, `.lab/context/mission.md`, `.lab/context/state.md`, `.lab/context/decisions.md`, `.lab/context/data-decisions.md`, `.lab/context/evidence-index.md`, `.lab/context/terminology-lock.md`, `.lab/context/auto-mode.md`, and `.lab/context/auto-status.md` before acting.
|
|
84
86
|
- Treat `.lab/context/auto-mode.md` as the control contract and `.lab/context/auto-status.md` as the live state file.
|
|
87
|
+
- Require `Autonomy level` and `Approval status` in `.lab/context/auto-mode.md` before execution.
|
|
88
|
+
- Treat `L1` as safe-run validation, `L2` as bounded iteration, and `L3` as aggressive campaign mode.
|
|
85
89
|
- Reuse `/lab:run`, `/lab:iterate`, `/lab:review`, `/lab:report`, and optional `/lab:write` instead of inventing a second workflow.
|
|
86
90
|
- Do not automatically change the research mission, paper-facing framing, or core claims.
|
|
87
91
|
- You may add exploratory datasets, benchmarks, and comparison methods inside the approved exploration envelope.
|
|
@@ -106,6 +110,7 @@ Use this skill when the user invokes `/lab:*` or asks for the structured researc
|
|
|
106
110
|
- Register the run with `.lab/.managed/scripts/register_run.py`.
|
|
107
111
|
- Normalize the result with `.lab/.managed/scripts/eval_report.py`.
|
|
108
112
|
- Validate normalized output with `.lab/.managed/scripts/validate_results.py`.
|
|
113
|
+
- Read `.lab/context/eval-protocol.md` before choosing the smallest run so the first experiment already targets the approved tables, metrics, and gates.
|
|
109
114
|
- Update `.lab/context/state.md` and `.lab/context/evidence-index.md` after the run.
|
|
110
115
|
|
|
111
116
|
### `/lab:iterate`
|
|
@@ -122,6 +127,8 @@ Use this skill when the user invokes `/lab:*` or asks for the structured researc
|
|
|
122
127
|
- Require a normalized evaluation report each round.
|
|
123
128
|
- Read `.lab/context/mission.md`, `.lab/context/state.md`, `.lab/context/decisions.md`, and `.lab/context/evidence-index.md` at the start of each round.
|
|
124
129
|
- Read `.lab/context/data-decisions.md` before changing benchmark-facing experiments.
|
|
130
|
+
- Read `.lab/context/eval-protocol.md` before changing evaluation ladders, sample sizes, or promotion gates.
|
|
131
|
+
- Keep metric definitions, baseline behavior, and comparison implementations anchored to the source-backed evaluation protocol before changing thresholds, gates, or ladder transitions.
|
|
125
132
|
- Switch to diagnostic mode if risk increases for two consecutive rounds.
|
|
126
133
|
- Write round reports with `.lab/.managed/templates/iteration-report.md`.
|
|
127
134
|
- Update `.lab/context/state.md`, `.lab/context/decisions.md`, `.lab/context/evidence-index.md`, and `.lab/context/open-questions.md` each round as needed.
|
|
@@ -141,6 +148,8 @@ Use this skill when the user invokes `/lab:*` or asks for the structured researc
|
|
|
141
148
|
|
|
142
149
|
- Summarize all validated iteration summaries.
|
|
143
150
|
- Read `.lab/context/mission.md`, `.lab/context/state.md`, `.lab/context/decisions.md`, `.lab/context/evidence-index.md`, and `.lab/context/data-decisions.md` before drafting.
|
|
151
|
+
- Read `.lab/context/eval-protocol.md` before choosing tables, thresholds, or final result framing.
|
|
152
|
+
- Keep metric definitions, comparison semantics, and implementation references anchored to the approved evaluation protocol instead of re-deriving them during reporting.
|
|
144
153
|
- Aggregate them with `.lab/.managed/scripts/summarize_iterations.py`.
|
|
145
154
|
- Write the final document with `.lab/.managed/templates/final-report.md`.
|
|
146
155
|
- Keep failed attempts and limitations visible.
|
|
@@ -172,7 +181,9 @@ Use this skill when the user invokes `/lab:*` or asks for the structured researc
|
|
|
172
181
|
- No implementation before `/lab:spec`.
|
|
173
182
|
- No frozen spec without an approved dataset package or an explicit defer reason recorded in `.lab/context/data-decisions.md`.
|
|
174
183
|
- No unconstrained iteration. Every `/lab:iterate` campaign must declare done criteria and `max_iterations`.
|
|
184
|
+
- No execution or reporting campaign without an evaluation protocol or an explicit defer reason recorded in `.lab/context/eval-protocol.md`.
|
|
175
185
|
- No unconstrained auto mode. Every `/lab:auto` campaign must declare allowed stages, stop conditions, and a promotion policy in `.lab/context/auto-mode.md`.
|
|
186
|
+
- No auto start without an explicit autonomy level and `Approval status: approved`.
|
|
176
187
|
- No final report without validated normalized results.
|
|
177
188
|
- No paper-writing round without stable report artifacts, an approved framing artifact, evidence links, and LaTeX manuscript output.
|
|
178
189
|
|
|
@@ -194,6 +205,6 @@ Use this skill when the user invokes `/lab:*` or asks for the structured researc
|
|
|
194
205
|
- Vendored paper-writing references: `.codex/skills/lab/references/paper-writing/{abstract,introduction,related-work,method,experiments,conclusion,paper-review,does-my-writing-flow-source}.md` or `.claude/skills/lab/references/paper-writing/{abstract,introduction,related-work,method,experiments,conclusion,paper-review,does-my-writing-flow-source}.md`
|
|
195
206
|
- Command adapters: the installed `/lab:*` command assets
|
|
196
207
|
- Shared workflow config: `.lab/config/workflow.json`
|
|
197
|
-
- Shared project context: `.lab/context/{mission,state,decisions,evidence-index,open-questions,data-decisions,auto-mode,auto-status}.md`
|
|
208
|
+
- Shared project context: `.lab/context/{mission,state,decisions,evidence-index,open-questions,data-decisions,eval-protocol,auto-mode,auto-status}.md`
|
|
198
209
|
- Templates: `.lab/.managed/templates/`
|
|
199
210
|
- Scripts: `.lab/.managed/scripts/`
|